13 Jan 2020

* About me and the course
  - Fourth year at RPI
  - **1st** time teaching Graph Mining
    + I welcome any feedback, topics, etc.
  - My research: graph analytical applications at the 'very large' scale

* Go over syllabus

* Go over website (https://www.cs.rpi.edu/~slotag/classes/SP20m/index.html)

* Go over real-world datasets

* Talk about project
  - Take a look at the real-world data repositories listed on the website
  - Prior projects:
    + Using vertex centrality analysis to identify population centers on
      unlabeled road network data
    + Recommender systems for Amazon co-purchasing dataset
    + Grad students: you can use/involve your research topic

================================================================================
* Define graphs 

Graphs => vertex set V(G) and edge set E(G)

V(G) = {v1, v2, v3, ... , vn}, |V| = n
E(G) = {e1, e2, e3, ... , em}, |E| = m
ei = {u, v}

Vertices => represent discrete objects (e.g., people in a social network)
Edges => relationship between these objects (e.g., friendships in social net)

Vertices and edges can have associated meta-data/label
Vertex => e.g., in Amazon copurchasing network -> {people, products}
Edges => e.g., in copurchasing network -> {bought, looked at}

Other labels exist, such as weights, time in a temporal graph (when edge was
create), etc.

================================================================================
* Real-world graph properties

Sparsity: n < m << O(n^2) => number of edges is much less than the number of 
possible edges

Degree skew: the number of low degree vertices >> high degree vertices

Hubs: high degree or otherwise important vertices that have some functional 
importance within the network

Irregularity: information, social networks, etc. are not constrained by a 
physical existence

Small-world: "6 degree of Kevin Bacon", average shortest paths length is small
relative to the size of the network; 4-5 friendship hops on average between you 
and every other person on Earth 

================================================================================
* Define graph mining applications

Take a look at the syllabus for a list of topics (think about project)

Similarity to "data mining"
- Classification => can we 'classify' a vertex with some property
- Clustering => an approach for classification; placing similar vertices into
  the same cluster/group/classification
- Prediction => how a network will evolve over time? will a new friendship 
  emerge? will you buy a product? 
- Measurement => how can we measure properties of a network?

Graph mining == data mining, but specifically on graphical datasets

================================================================================
* Talk about graph processing approaches

"Think Like a Vertex"
- Many/most graph algorithms can be implemented in a 'vertex centric' way
- Format:

Input: Graph G(V, E), vertex state S
For all v in V:
  S(v) = initialize(v)

For some number of iterations:
  updateAlgorithmicData()
  For all v in V:
    For all neighbors u of v:
      S(v)/S(u) = updateState(S(u), S(v))

For all v in V:
  S(v) = finalize()

return S

================================================================================
* NetworkX and coding

Today: connected components algorithms

A graph is 'connected' if there exists a path between all u,v in V
e.g., u->w->y->...->v

Connected components: maximal subgraphs in a (possibly non-)connected graph G
that are themselves connected

Two algorithms:

BFS(G, S)
For all v in V:
  S(v) = -1  # haven't determined connected component yet

numComponents = 0
until all vertices have assigned component:
  For all v in V:
    if S(v) > -1
      continue

    Q <- v
    S(v) = numComponents

    While Q != empty:
      For all v in Q:
        For all neighbors u of v:
          if S(u) == -1:
            Q <- u
            S(u) = numComponents

    numComponents +=1
    
return S


LabelPropagation()
counter = 0
For all v in V:
  S(v) = counter
  counter += 1

numUpdates = 1
While updates > 0:
  updates = 0
  For all v in V:
    For all neighbors u of v:
      if S(u) > S(v)
        S(v) = S(u)
        updates += 1

return S