Effective Automated Feature Construction and Selection for Classification of Biological Sequences

dc.contributor.authorKamath, Uday
dc.contributor.authorDe Jong, Kenneth
dc.contributor.authorShehu, Amarda
dc.date.accessioned2015-09-10T18:06:53Z
dc.date.available2015-09-10T18:06:53Z
dc.date.issued2014-07-17
dc.description.abstractBackground Many open problems in bioinformatics involve elucidating underlying functional signals in biological sequences. DNA sequences, in particular, are characterized by rich architectures in which functional signals are increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection of regulatory regions, splice sites, exons, hypersensitive sites, and more. These problems naturally lend themselves to formulation as classification problems in machine learning. When classification is based on features extracted from the sequences under investigation, success is critically dependent on the chosen set of features. Methodology We present an algorithmic framework (EFFECT) for automated detection of functional signals in biological sequences. We focus here on classification problems involving DNA sequences which state-of-the-art work in machine learning shows to be challenging and involve complex combinations of local and distal features. EFFECT uses a two-stage process to first construct a set of candidate sequence-based features and then select a most effective subset for the classification task at hand. Both stages make heavy use of evolutionary algorithms to efficiently guide the search towards informative features capable of discriminating between sequences that contain a particular functional signal and those that do not. Results To demonstrate its generality, EFFECT is applied to three separate problems of importance in DNA research: the recognition of hypersensitive sites, splice sites, and ALU sites. Comparisons with state-of-the-art algorithms show that the framework is both general and powerful. In addition, a detailed analysis of the constructed features shows that they contain valuable biological information about DNA architecture, allowing biologists and other researchers to directly inspect the features and potentially use the insights obtained to assist wet-laboratory studies on retainment or modification of a specific signal. Code, documentation, and all data for the applications presented here are provided for the community at http://www.cs.gmu.edu/~ashehu/?q=OurTool​s.
dc.description.sponsorshipPublication of this article was funded in part by the George Mason University Libraries Open Access Publishing Fund.
dc.identifier.citationKamath U, De Jong K, Shehu A (2014) Effective Automated Feature Construction and Selection for Classification of Biological Sequences. PLoS ONE 9(7): e99982. doi:10.1371/journal.pone.0099982
dc.identifier.doihttp://dx.doi.org/10.1371/journal.pone.009998
dc.identifier.urihttps://hdl.handle.net/1920/9828
dc.language.isoen_US
dc.publisherPublic Library of Science
dc.rightsAttribution 3.0 United States
dc.rights.urihttps://creativecommons.org/licenses/by/3.0/us/
dc.subjectDNA sequence analysis
dc.subjectAlu elements
dc.subjectSequence motif analysis
dc.subjectSequence analysis
dc.subjectKernel methods
dc.subjectMachine learning
dc.subjectAlgorithms
dc.subjectNucleotide sequencing
dc.titleEffective Automated Feature Construction and Selection for Classification of Biological Sequences
dc.typeArticle

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2014-07-17-Kamath.pdf
Size:
1.01 MB
Format:
Adobe Portable Document Format
Description:
Main article
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.63 KB
Format:
Item-specific license agreed upon to submission
Description: