Vaismann, IosifElattar, Marwy2017-12-072017-12-07https://hdl.handle.net/1920/10794The project is focused on machine learning classification of thermophilic and mesophilic proteins using N-gram based representation of protein sequences. Two datasets containing proteins from both classes were used for the analysis. Alphabet reduction was performed on all datasets, and n-gram frequencies were calculated for each sequence using the reduced alphabet. Data normalization was done by calculating n-gram likelihoods. Four different machine learning algorithms (Naïve Bayes, Support Vector Machines, Decision Trees and Random Forests) were used for the protein classification. Accuracies of 100.0% were achieved using SVM, 99.3% using Random Forests, 90.3% using Naïve Bayes and 99.6 using Decision Trees.enThermophilicMesophilicN-gramsMachine learningClassification of Thermophilic and Mesophilic Proteins Using N-GramsThesis