Effective Automated Feature Construction and Selection for Classification of Biological Sequences

Kamath, Uday; De Jong, Kenneth; Shehu, Amarda

Effective Automated Feature Construction and Selection for Classification of Biological Sequences

dc.contributor.author	Kamath, Uday
dc.contributor.author	De Jong, Kenneth
dc.contributor.author	Shehu, Amarda
dc.date.accessioned	2015-09-10T18:06:53Z
dc.date.available	2015-09-10T18:06:53Z
dc.date.issued	2014-07-17
dc.description.abstract	Background Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features. Methodology We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not. Results To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTools.
dc.description.sponsorship	Publication of this article was funded in part by the George Mason University Libraries Open Access Publishing Fund.
dc.identifier.citation	Kamath U, De Jong K, Shehu A (2014) Effective Automated Feature Construction and Selection for Classification of Biological Sequences. PLoS ONE 9(7): e99982. doi:10.1371/journal.pone.0099982
dc.identifier.doi	http://dx.doi.org/10.1371/journal.pone.009998
dc.identifier.uri	https://hdl.handle.net/1920/9828
dc.language.iso	en_US
dc.publisher	Public Library of Science
dc.rights	Attribution 3.0 United States
dc.rights.uri	https://creativecommons.org/licenses/by/3.0/us/
dc.subject	DNA sequence analysis
dc.subject	Alu elements
dc.subject	Sequence motif analysis
dc.subject	Sequence analysis
dc.subject	Kernel methods
dc.subject	Machine learning
dc.subject	Algorithms
dc.subject	Nucleotide sequencing
dc.title	Effective Automated Feature Construction and Selection for Classification of Biological Sequences
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2014-07-17-Kamath.pdf
Size:: 1.01 MB
Format:: Adobe Portable Document Format
Description:: Main article

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.63 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Recipients of OA Publishing Fund
Papers and Publications, Department of Computer Science