Latent Variable Models of Sequence Data for Classification and Discovery

Rangwala, HuzefaBlasiak, Samuel J.2014-08-282014-08-282013-08https://hdl.handle.net/1920/8798The need to operate on sequence data is prevalent across a range of real world applications including protein/DNA classification, speech recognition, intrusion detection, and text classification. Sequence data can be distinguished from the more-typical vector representation in that the length of sequences within a dataset can vary and that the order of symbols within a sequence carries meaning. Although it has become increasingly easy to collect large amounts of sequence data, our ability to infer useful information from these sequences has not kept pace. For instance, in the domain of biological sequences, experimentally determining the order of amino acids in a protein is far easier than determining the protein's physical structure or its role within a living organism. This asymmetry holds over a number of sequence data domains, and, as a result, researchers increasingly rely on computational techniques to infer properties of sequences that are either difficult or costly to collect through direct measurement. The methods I describe in this dissertation attempt to mitigate this asymmetry by advancing state-of-the-art techniques for extracting useful information from sequence data.210 pagesenCopyright 2013 Samuel J. BlasiakComputer scienceHidden Markov ModelLatent Variable ModelNeural NetworkSequencesSparse Dictionary LearningTopic ModelLatent Variable Models of Sequence Data for Classification and DiscoveryDissertation