Network Neighborhood Analysis For Detecting Anomalies in Time Series of Graphs

Date

Authors

Goswami, Suchismita

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Around terabytes of unstructured electronic data are generated every day from twitter networks, scientific collaborations, organizational emails, telephone calls and websites. Ex- cessive communications in communication networks, particularly in organizational e-mail networks, continue to be a major problem. In some cases, for example, Enron e-mails, frequent contact or excessive activities on interconnected networks lead to fraudulent activ- ities. Analyzing the excessive activity in a social network is thus important to understand the behavior of individuals in subregions of a network. In a social network, anomalies can occur as a result of abrupt changes in the interactions among a group of individuals. There- fore, one needs to develop methodologies to analyze and detect excessive communications in dynamic social networks. The motivation of this research work is to investigate the ex- cessive activities and make inferences in dynamic sub networks. In this dissertation work, I implement new methodologies and techniques to detect excessive communications, topic activities and the associated influential individuals in the dynamic networks obtained from organizational emails using scan statistics, multivariate time series models and probabilistic topic modeling. Three major contributions have been presented here to detect anomalies of dynamic networks obtained from organizational emails. At first, I develop a different approach by invoking the log-likelihood ratio as a scan statistic with overlapping and variable window sizes to rank the clusters, and devise a two-step scan process to detect the excessive activities in an organizations e-mail network as a case study. The initial step is to determine the structural stability of the e-mail count time series and perform differencing and de-seasonalizing operations to make the time series stationary, and obtain a primary cluster using a Poisson process model. I then extract neighborhood ego subnetworks around the observed primary cluster to obtain more refined cluster by invoking the graph invariant betweenness as the locality statistic using the binomial model. I demonstrate that the two-step scan statistics algorithm is more scalable in detecting excessive activity in large dynamic social networks. Secondly, I implement for the first time the multivariate time series models to detect a group of influential people and their dynamic relationships that are associated with excessive communications, which cannot be assessed using scan statistics models. For the multivariate modeling, a vector auto regressive (VAR) model has been employed in time series of subgraphs in e-mail networks constructed using the graph edit distance, as the nodes or vertices of the subgraphs are interrelated. Anomalies or excessive communications are assessed using the residual thresholds greater than three times the standard deviations,obtained from the fitted time series models. Finally, I devise a new method of detecting excessive topic activities from the unstruc- tured text obtained from e-mail contents by combining the probabilistic topic modeling and scan statistics algorithms. Initially, I investigate the major topics discussed using the probabilistic modeling, such as latent Dirichlet allocation (LDA) modeling, then employ scan statistics to assess the excessive topic activities, which has the largest log likelihood ratio in the neighborhood of primary cluster. These analyses provide new ways of detecting the excessive communications and topic flow through the influential vertices in a dynamic network, and can be extended in other dynamic social networks to critically investigate excessive activities.

Description

Keywords

Time series analysis, Computational social science, Research methods

Citation