Classification of Thermophilic and Mesophilic Proteins Using N-Grams
Date
Authors
Elattar, Marwy
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The project is focused on machine learning classification of thermophilic and mesophilic proteins using N-gram based representation of protein sequences. Two datasets containing proteins from both classes were used for the analysis. Alphabet reduction was performed on all datasets, and n-gram frequencies were calculated for each sequence using the reduced alphabet. Data normalization was done by calculating n-gram likelihoods. Four different machine learning algorithms (Naïve Bayes, Support Vector Machines, Decision Trees and Random Forests) were used for the protein classification. Accuracies of 100.0% were achieved using SVM, 99.3% using Random Forests, 90.3% using Naïve Bayes and 99.6 using Decision Trees.
Description
Keywords
Thermophilic, Mesophilic, N-grams, Machine learning