Classification of Thermophilic and Mesophilic Proteins Using N-Grams

Date

Authors

Elattar, Marwy

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The project is focused on machine learning classification of thermophilic and mesophilic proteins using N-gram based representation of protein sequences. Two datasets containing proteins from both classes were used for the analysis. Alphabet reduction was performed on all datasets, and n-gram frequencies were calculated for each sequence using the reduced alphabet. Data normalization was done by calculating n-gram likelihoods. Four different machine learning algorithms (Naïve Bayes, Support Vector Machines, Decision Trees and Random Forests) were used for the protein classification. Accuracies of 100.0% were achieved using SVM, 99.3% using Random Forests, 90.3% using Naïve Bayes and 99.6 using Decision Trees.

Description

Keywords

Thermophilic, Mesophilic, N-grams, Machine learning

Citation