A Dynamic Graph Model for Representing Streaming Text Documents

dc.contributor.authorHohman, Elizabeth Leeds
dc.creatorHohman, Elizabeth Leeds
dc.date2008-04-25
dc.date.accessioned2008-06-05T19:30:28Z
dc.date.availableNO_RESTRICTION
dc.date.available2008-06-05T19:30:28Z
dc.date.issued2008-06-05T19:30:28Z
dc.description.abstractThis work presents new techniques for representing an evolving stream of text documents. Text processing is traditionally performed on a fixed corpus of documents by representing the documents as vectors in a high-dimensional space with each dimension corresponding to a different word in the lexicon. The lexicon is formed by the set of unique words in the corpus. The vector entries are equal to the counts of the word in the document and often weighted by the inverse of the probability of the corresponding word occurring in a document. The probability of word occurrence, also called the document frequency, is needed in order to create document vectors which emphasize the informative words in each document. In order to apply statistical text processing techniques to a changing corpus of documents, a generalization of the vector space model is introduced. The generalization relies on managing a changing lexicon of words and approximating the probability of word occurrence over documents in the document stream. The methods presented here can be used to represent any new document as a vector, including documents that contain words that have not been seen previously in the document stream. Additionally, this work presents a graph model for representing a dynamic corpus of text documents. The graph model differs from other methods for text clustering which act on a fixed corpus of documents. The vertices in the graph represent topics and evolve as the document stream changes. The vertices contain statistics on documents of a similar topic. Each vertex has an associated lexicon and document frequency which can be used to provide information about the document stream. The graph model is demonstrated on a dataset of news articles collected over several years.
dc.identifier.urihttps://hdl.handle.net/1920/3062
dc.language.isoen_US
dc.subjectText Mining
dc.subjectStreaming Text
dc.subjectGraph Methods
dc.subjectText Clustering
dc.subjectDynamic Graphs
dc.titleA Dynamic Graph Model for Representing Streaming Text Documents
dc.typeDissertation
thesis.degree.disciplineComputational Sciences and Informatics
thesis.degree.grantorGeorge Mason University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy in Computational Sciences and Informatics

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Hohman_Elizabeth.pdf
Size:
2.13 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.72 KB
Format:
Item-specific license agreed upon to submission
Description: