Mason Archival Repository Service

A Dynamic Graph Model for Representing Streaming Text Documents

Show simple item record Hohman, Elizabeth Leeds
dc.creator Hohman, Elizabeth Leeds 2008-04-25 2008-06-05T19:30:28Z NO_RESTRICTION en 2008-06-05T19:30:28Z 2008-06-05T19:30:28Z
dc.description.abstract This work presents new techniques for representing an evolving stream of text documents. Text processing is traditionally performed on a fixed corpus of documents by representing the documents as vectors in a high-dimensional space with each dimension corresponding to a different word in the lexicon. The lexicon is formed by the set of unique words in the corpus. The vector entries are equal to the counts of the word in the document and often weighted by the inverse of the probability of the corresponding word occurring in a document. The probability of word occurrence, also called the document frequency, is needed in order to create document vectors which emphasize the informative words in each document. In order to apply statistical text processing techniques to a changing corpus of documents, a generalization of the vector space model is introduced. The generalization relies on managing a changing lexicon of words and approximating the probability of word occurrence over documents in the document stream. The methods presented here can be used to represent any new document as a vector, including documents that contain words that have not been seen previously in the document stream. Additionally, this work presents a graph model for representing a dynamic corpus of text documents. The graph model differs from other methods for text clustering which act on a fixed corpus of documents. The vertices in the graph represent topics and evolve as the document stream changes. The vertices contain statistics on documents of a similar topic. Each vertex has an associated lexicon and document frequency which can be used to provide information about the document stream. The graph model is demonstrated on a dataset of news articles collected over several years.
dc.language.iso en_US en
dc.subject Text Mining en_US
dc.subject Streaming Text en_US
dc.subject Graph Methods en_US
dc.subject Text Clustering en_US
dc.subject Dynamic Graphs en_US
dc.title A Dynamic Graph Model for Representing Streaming Text Documents en
dc.type Dissertation en Doctor of Philosophy in Computational Sciences and Informatics en Doctoral en Computational Sciences and Informatics en George Mason University en

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search MARS


My Account