13 Jan 2020
* About me and the course
- Fourth year at RPI
- **1st** time teaching Graph Mining
+ I welcome any feedback, topics, etc.
- My research: graph analytical applications at the 'very large' scale
* Go over syllabus
* Go over website (https://www.cs.rpi.edu/~slotag/classes/SP20m/index.html)
* Go over real-world datasets
* Talk about project
- Take a look at the real-world data repositories listed on the website
- Prior projects:
+ Using vertex centrality analysis to identify population centers on
unlabeled road network data
+ Recommender systems for Amazon co-purchasing dataset
+ Grad students: you can use/involve your research topic
================================================================================
* Define graphs
Graphs => vertex set V(G) and edge set E(G)
V(G) = {v1, v2, v3, ... , vn}, |V| = n
E(G) = {e1, e2, e3, ... , em}, |E| = m
ei = {u, v}
Vertices => represent discrete objects (e.g., people in a social network)
Edges => relationship between these objects (e.g., friendships in social net)
Vertices and edges can have associated meta-data/label
Vertex => e.g., in Amazon copurchasing network -> {people, products}
Edges => e.g., in copurchasing network -> {bought, looked at}
Other labels exist, such as weights, time in a temporal graph (when edge was
create), etc.
================================================================================
* Real-world graph properties
Sparsity: n < m << O(n^2) => number of edges is much less than the number of
possible edges
Degree skew: the number of low degree vertices >> high degree vertices
Hubs: high degree or otherwise important vertices that have some functional
importance within the network
Irregularity: information, social networks, etc. are not constrained by a
physical existence
Small-world: "6 degree of Kevin Bacon", average shortest paths length is small
relative to the size of the network; 4-5 friendship hops on average between you
and every other person on Earth
================================================================================
* Define graph mining applications
Take a look at the syllabus for a list of topics (think about project)
Similarity to "data mining"
- Classification => can we 'classify' a vertex with some property
- Clustering => an approach for classification; placing similar vertices into
the same cluster/group/classification
- Prediction => how a network will evolve over time? will a new friendship
emerge? will you buy a product?
- Measurement => how can we measure properties of a network?
Graph mining == data mining, but specifically on graphical datasets
================================================================================
* Talk about graph processing approaches
"Think Like a Vertex"
- Many/most graph algorithms can be implemented in a 'vertex centric' way
- Format:
Input: Graph G(V, E), vertex state S
For all v in V:
S(v) = initialize(v)
For some number of iterations:
updateAlgorithmicData()
For all v in V:
For all neighbors u of v:
S(v)/S(u) = updateState(S(u), S(v))
For all v in V:
S(v) = finalize()
return S
================================================================================
* NetworkX and coding
Today: connected components algorithms
A graph is 'connected' if there exists a path between all u,v in V
e.g., u->w->y->...->v
Connected components: maximal subgraphs in a (possibly non-)connected graph G
that are themselves connected
Two algorithms:
BFS(G, S)
For all v in V:
S(v) = -1 # haven't determined connected component yet
numComponents = 0
until all vertices have assigned component:
For all v in V:
if S(v) > -1
continue
Q <- v
S(v) = numComponents
While Q != empty:
For all v in Q:
For all neighbors u of v:
if S(u) == -1:
Q <- u
S(u) = numComponents
numComponents +=1
return S
LabelPropagation()
counter = 0
For all v in V:
S(v) = counter
counter += 1
numUpdates = 1
While updates > 0:
updates = 0
For all v in V:
For all neighbors u of v:
if S(u) > S(v)
S(v) = S(u)
updates += 1
return S