Identification and Prediction of Intrinsically Disordered Regions in Proteins
Date
2019
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
It has been the dominant paradigm in structural biology that a well-defined structure determines protein function. Intrinsically disordered proteins (IDPs), which lack a stable three-dimensional structure under normal physiological conditions, are a challenge to the structure-to-function paradigm. Disorder exists in up to half of the amino acids in eukaryotic proteins, and disordered regions are involved in numerous biological functions, as a result of their flexibility. Since amino acid sequence is known to determine protein structure, sequence information can be used to identify disordered regions. Protein disorder is involved in the development of many diseases, and identifying disordered regions can help us understand how to use them as potential drug targets. The identified regions can also be used to better understand the pathways of protein folding and provide insights into protein function. In this study, we developed two machine-learning based algorithms to distinguish between disordered and ordered residues within a sequence-based on n-gram frequencies content and reduced amino acid alphabets. Our results show that using n-gram frequencies is an accurate, computationally inexpensive and fast method to predict disordered regions, based on raw protein sequence data. Furthermore, we show that an algorithm using a combination of Convolutional Neural Networks architecture and reduced amino acid alphabets encoding achieves state-of-the-art prediction results on the CASP datasets. Both prediction algorithms can subsequently aid in the development of next-generation treatments for a variety of biomedical applications.