Latent Variable Models of Sequence Data for Classification and Discovery




Blasiak, Samuel J.

Journal Title

Journal ISSN

Volume Title



The need to operate on sequence data is prevalent across a range of real world applications including protein/DNA classification, speech recognition, intrusion detection, and text classification. Sequence data can be distinguished from the more-typical vector representation in that the length of sequences within a dataset can vary and that the order of symbols within a sequence carries meaning. Although it has become increasingly easy to collect large amounts of sequence data, our ability to infer useful information from these sequences has not kept pace. For instance, in the domain of biological sequences, experimentally determining the order of amino acids in a protein is far easier than determining the protein's physical structure or its role within a living organism. This asymmetry holds over a number of sequence data domains, and, as a result, researchers increasingly rely on computational techniques to infer properties of sequences that are either difficult or costly to collect through direct measurement. The methods I describe in this dissertation attempt to mitigate this asymmetry by advancing state-of-the-art techniques for extracting useful information from sequence data.



Computer science, Hidden Markov Model, Latent Variable Model, Neural Network, Sequences, Sparse Dictionary Learning, Topic Model