Fast Graph Embedding
-
Sampled Spectral Distance Embedding (SSDE)
Contributors: Ali Civril, Eli Bocke-Rivel, Jonathan Purnell,
Malik Magdon-Ismail
Summary:
A given (undirected and connected) graph is
embedded into Cartesian space in such a way as to minimize the loss of
information contained in pairwise distances.
The algorithm executes quickly due to
clever sampling of the nodes and a matrix approximation method that
runs in linear time with respect to the number of nodes in the graph.
Code:
SSDE;
Test Data;
Documentation.
Papers:
SSDE (2007).
Unsupervised Learning
-
Low-Rank Perturbation Gaussian
Mixture Model (LRPGMM)
Contributors: Jonathan Purnell, Saurabh Paul, Malik Magdon-Ismail
Summary:
A modified EM algorithm for
training a Gaussian Mixture Model (GMM). The defining difference is the
approximation of the covariance matrix for each mixture. The GMM can be
trained using the full covariance matrix, a diagonal matrix
approximation (i.e. only the variances), or using a low-rank perturbed
diagonal matrix decomposition (LRPDD). With the LRPDD, the covariance
matrix is represented by a linear number of parameters (with respect to
the dimension of the data being modeled). This decomposition allows for
a linear time (vs. quadratic or cubic) training the GMM and has less
overfitting that occurs with a full rank covariance matrix.
Additionally, there is a naive overlapping clustering algorithm
built-in to the code that clusters the data based on the posterior
probabilities.
Code:
LRPGMM;
Documentation.
Note: This code relies on the ALGLIB
library.
Papers:
LRPGMM (Ideal 2010);
LRPGMM (IJDMMM 2011).
Graph Clustering; Overlapping Communities
-
SSDE-Cluster
Contributors: Jonathan Purnell, Malik Magdon-Ismail
Summary:
Fast overlapping clustering of a graph which
first applies SSDE to embed the graph into
Carteisan coordinates and then uses a GMM to do soft (probabilistic)
clustering.
Finally the data is clustered in one of two
ways. One method is to use the posterior probabilities to cluster the
nodes in such a way as to create higher quality clusters than would be
found using the GMM's naive clustering function. The quality of the
clusters is based on a metric described in the related work. The other
method is to specify the average number of clusters a data sample is
assigned to. The SSDE-Cluster code provides the code for Cluster, which
clusters the data, and the SSDE-Cluster shell script which provides a
wrapper for SSDE, GMM, and Cluster.
Code:
SSDE-Cluster;
Documentation.
Papers:
SSDE-Cluster (SocialCom 2011).
-
Connected Iterative Scan (CIS)
Contributors: Jeff Baumes, Stephen Kelley, Robert Escriva, James Thompson, Mark Goldberg, Malik Magdon-Ismail
Summary:
Another approach to overlapping clustering of a network. Iteratively expands seed clusters in order to maximize a
two-part density function. The first part focuses on improving the density of edges inside the cluster with respect
to edges leaving the cluster, and the second examines the edge probabilities within the cluster. Balance between the two
parts can be controlled with the lambda value [0, 1]. Higher values usually result in a smaller number of communities, with a lambda
value of 1 only considering cliques as clusters.
Code:
CIS;
Documentation.
Papers:
Overlapping Communities
(SocialCom 2010);
(ISI 2005)
-
Overlapping Clustering Wrapper
Summary:
The code provides a central program where multiple overlapping clustering algorithms can be run from. Currently, there is support for the following algorithms:
(Algorithms other than Connected Iterative Scan must be installed separately - links are provided to the software)
Clique Percolation; CONGA; Connected Iterative Scan; COPRA; Link Communities; OSLOM; Repeated Random Walks; SLPA
Code:
Overlapping Clustering;
Documentation.
Papers:
Clique Percolation; CONGA; COPRA; Repeated Random Walks; SLPA; Link Communities; OSLOM
-
Spectral Clustering of Signed Networks
Contributors: Pranay Anchuri, Malik Magdon-Ismail
Summary:
A fast algorithm for partitioning signed networks based on minimizing
frustration or maximizing signed modularity. The algorithm uses a spectral
approach to obtain a partition into 2 comunities and then iteratively improves
this partitioning by moving the nodes which have low certainty. A partition
into more than two components is obtained by recursively ontinuing to partition
the current partioning as long as the objective can be improved.
Code:
Spectral Partition of Signed Graphs;
Documentation.
Papers:
Communities and Balance in Signed Networks: A Spectral Approach (ASONAM 2012).
Ranking and Prominence in Networks
-
ABC-Centrality (attentive betweenness centrality)
Contributors: Xiaohui Lu, Sibel Adali, Malik Magdon-Ismail
Summary:
Betweenness centrality measures how critical a node is to information flow in a network. A node is critical (and hence should have high betweeness) if it is on many shortest paths. Two shortcomings of such a measure are:
- It ignores nodes on ``almost shortest'' paths;
- It assumes that a node can provide the same attention to information flow through each of those shortest paths, no matter how many shortest paths the node controls.
There have been attempts to address these concerns in the literature, with partial success. We provide a new measure, ABC centrality, that measures criticality by the amount of attention a node devotes to the information flow between other nodes. Our measure addresses both the aforementioned concerns and can be computed efficiently. It performs as well or better than betweenness centrality on both stylized networks and large scale real data networks, and hence provides a useful tool for measuring node criticality.
Code:
ABC-Centrality;
Documentation.
Papers:
ABC-Centrality (SocialCom 2011).
-
iHypR (Iterative Hyperedge Ranking)
Contributors: Xiaohui Lu, Sibel Adali, Malik Magdon-Ismail
Summary:
We present an iterative hyperedge ranking (iHypR) algorithm for computing prominence of actors in heterogeneous collaborative networks. The algorithm builds on the assumption that prominent actors collaborate on prominent objects, and prominent objects are naturally grouped into prominent clusters or groups (hyperedges in a graph). iHypR makes use of the relationships between actors, objects and hyperedges to compute a global prominence score for the actors in the network. The algorithm is customized for networks of collaborations, but it is generally applicable without further tuning. A core component in our algorithm is the
clustering of objects into hyperedges. We use SSDE-cluster for this. (SSDE-cluster could be replaced with any overlapping clustering algorithm, but we have
found the SSDE-cluster performs best.
Code:
iHypR;
Documentation.
Papers:
iHypR (submitted).
Hidden Groups
-
Extracting Hidden Groups and their Structure from Streaming Interaction Data
Contributors: Ben Caulfield, Mark K. Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, and William A. Wallace
Summary:
This program is an implementation of the algorithms described in "Extracting Hidden Groups and their Structure from Streaming Interaction Data" by Mark K. Goldberg, Mykola Hayvanovych, Malik Magdon-Ismail, and William A. Wallace. This program may be used to find chain and sibling communication triples from interaction data. There are two sub-programs that may be run with this program:
The first is used to determine at what frequency triples must occur in order to be considered significant (called the "significance threshold"). This is done by generating lists of random communications based on the original data and finding the number of triples that occur in these lists.
The second sub-program seperates the triples into cliques and may be used to find larger communication-structures in the data.
Code:
Hidden_Groups;
Documentation.
Papers:
Hidden_Groups.pdf.
-
Find Matchings Between a List of Sent Messages and a List of Received Messages
Contributors: Lingxun Hu, Ben Caulfield, and Malik Magdon-Ismail
Summary:
In a social network, each node is able to send or receive messages. Each node can send messages to any other nodes in the network except to itself. The only information we can acquire is whether a node sent or received a message and the time it sent or received that message.
Our goal is to find as many as possible valid matchings between the sent messages and received messages, based on the information we can get.
We use two different algorithms: dynamic programming methods and chain algorithm from the paper below. The output of this program can be used as the input of program "Extracting Hidden Groups and their Structure from Streaming Interaction Data".
Code:
Find_matchings.zip;
Papers:
Hidden_Groups.pdf.
|