Improving machine learning techniques for influenza-A classification
Shaltout, Nermin Ashraf
Abstract:
Influenza-A's ability to mutate constantly has resulted in recurring seasonal epidemics and pandemics. Recently, the virus's spread has been enhanced by its ability to infect multiple hosts simultaneously. Fast identification of the subtype and hosts of Influenza-A virus, is thus crucial, to quickly measure its drug resistance and virulence. Research in data mining techniques for influenza virus A host and subtype classification, has already been underway. The older studies' main goal was improving the accuracy, speed and safety of the virus analyses. With newer infectious strains of Influenza-A, appearing yearly, these techniques are still open for improvement.
The current research plans to improve existing machine learning techniques for classifying Influenza-A by using the following methodologies: (a) Exploring the effectiveness of using RNA/cDNA data over protein data for virus classification. (b) Measuring the impact of preprocessing the virus, by selecting the most informative positions in the sequence, on classifier performance and speed; both neural networks (NNs) and decision trees (DTs) were analyzed. (c) Testing the previous method on more than one classification problem; host identification experiments were conducted on both subtype H1, and H5, while antiviral resistance identification was conducted on the H1N1 strain. Accuracy, sensitivity, specificity, precision and time were used as performance measures.
The final results showed that: (a) DNA data is more sensitive than Protein data when using both subtypes. (b) Using the most 100 and 10 informative positions with DTs yielded an overall speed improvement of 92-100% when identifying hosts for segments of subtype H1. The performance decrease was insignificant. Using 100 and 60 informative positions with NNs yielded a speed improvement of 88% when identifying hosts of both subtypes H1, and H5. There was no significant drop in overall performance. Of the two classifiers: NNs had better performance, while DTs had better efficiency. (c) Testing the method on antiviral resistance identification of Influenza-A, showed promising results: Using the most 100 informative positions with DTs yielded an overall performance of not less than 95%, in not more than 3 seconds for all 8 segments. The method has the potential to improve the efficiency of other Influenza-A classification problems, as well as other viral classification problems in the Bioinformatics field.
The thesis provided the following contributions: (a) A way to extract informative positions from DNA positions directly without converting the DNA data to protein data. This can aid in detecting silent mutations in Influenza-A virus. (b) Antiviral identification of Adamantane using all eight segments of the virus. Previously there was one known viral segment mainly responsible for antiviral resistance. (c) Measuring the efficiency of using informative positions, as a preprocessing step, in terms of speed. (d) A clear comparison between two classifier performances when using the information gain algorithm.
Advisor:
Drs., Rafea, Ahmed, El-Hefnawi, Mahmoud, Moustafa, Ahmed
Committe Member:
Mahmoud, Wael , ElKafrawi, Passent , Alkabani, Yousra
Department:
American University in Cairo. Dept. of Computer Science and Engineering
Discipline:
Computer Science
Keyword:
DNA , Bioinformatics , Computer science , Biology , Influenza A virus , Machine learning , Classification , Artificial intelligence , Neural networks (Computer science) , Decision tree , Protein
Date Created:
2014 Spring
Date Issued:
2014-06-08
Type:
Text
Medium:
theses
Language:
en
Access Rights:
This item is available
Show full item record