dc.description.abstract |
This work presents new techniques for representing an evolving stream of text documents.
Text processing is traditionally performed on a fixed corpus of documents by representing the documents as vectors in a high-dimensional space with each dimension
corresponding to a different word in the lexicon. The lexicon is formed by
the set of unique words in the corpus. The vector entries are equal to the counts of
the word in the document and often weighted by the inverse of the probability of the
corresponding word occurring in a document. The probability of word occurrence,
also called the document frequency, is needed in order to create document vectors
which emphasize the informative words in each document.
In order to apply statistical text processing techniques to a changing corpus of documents,
a generalization of the vector space model is introduced. The generalization
relies on managing a changing lexicon of words and approximating the probability of
word occurrence over documents in the document stream. The methods presented
here can be used to represent any new document as a vector, including documents
that contain words that have not been seen previously in the document stream.
Additionally, this work presents a graph model for representing a dynamic corpus
of text documents. The graph model differs from other methods for text clustering
which act on a fixed corpus of documents. The vertices in the graph represent topics
and evolve as the document stream changes. The vertices contain statistics on
documents of a similar topic. Each vertex has an associated lexicon and document
frequency which can be used to provide information about the document stream. The
graph model is demonstrated on a dataset of news articles collected over several years. |
|