07 Apr 2022

Recall from last class, Graph Mining = 
- Link Prediction
- Centrality
- Clustering and community detection
- Vertex classification and label prediction

Now ==> Subgraph Mining
(obviously important to graph mining since it has almost same name)
Big Dog (draw big dog)

Many different subproblems under this big ol' umbrella:
- Triangle counting (clustering coefficient, CD, etc.)
- Template matching 
  * Matching a general input template, aka subgraph isomorphism
  * Subgraph counting (how many show up)
  * Subgraph enumeration (where are they actually embedded)
- Motif finding and anomaly detection
  * What subgraphs show up more/less than expected
  * aka Homework problem for grad students
- Subgraph-based comparative analytics
  * Comparing networks based on their constituent subgraphs
- Graph alignment
  * How similar are these two networks, aligning similar regions
  * Can consider as a generalization of subgraph isomorphism
  * Non-exact, contains a cost function

================================================================================
Basic Problem: Subgraph isomorphism

Graph isomorphism: 
- draw a basic example
- Tougher problem than you think, TBD what complexity class it is in
- Current approach: convert graphs to a string s.t. two isomorphic graphs are
  the same string

What we can about: subgraph isomorphism
- Good: ton of applications as we've seen
- Bad: Much tougher than graph isomorphism 
  * NP-complete for decision problem, O(n^k)
  * and we need to solve this for all of the above analytics (womp)
  * Howver, certain subgraph structures are much easier (more later)

Note: induced vs. non-induced subgraph isomorphism
- Induced must match all edges AND non-edges
- Non-induced DGAF about them non-edges
- Draw examples

================================================================================
Triangle Counting

Triangles are cool, so they get their own special mention
- Why: very useful for analyzing clustering properties of networks
- Most interesting networks have such clustering

Can be much easier than larger subgraphs to count and enumerate
- O(n^w) complexity, where w is fast matrix exponent
- Essentially just computing closure of all wedges (a-b-c)
- Really: for sparse graphs, O(m^3/2)
  * Some algorithms: closer to 4/3 moment of degree distribution

Basic Algorithm
count = 0
For all v in V(G):
  For all u,w in N(v):
    if (u,w) in E(G):
      count += 1
count /= 3

Most subgraph count/enumeration algorithms use some variation of above approach.
- Need to be sure about induced vs. non-induced

================================================================================
Template Matching

The general term given for actually doing subgraph isomorphism
- We consider some input 'template' -- i.e., the subgraph we're looking for in G
- Not only graph topology, but often vertex/edge labels
  * Luckily, a much easier problem as labels restrict search space
  * Draw an example

Template search applications
- Facebook graph search (from my interpretation)
  * "Find all people who went to Penn State who are friends with someone from 
    Troy, NY who live in Portland, OR and play ultimate frisbee"
  * Draw example affiliation network template example
  * Unfortunately, lot of abuse: intelligence agencies and stalkers abused it
    quite readily (obviously), so they shut 'er dahn
- Financial networks
  * Known bad financial transaction patterns (Russians)
- Information/knowledge networks, relational data
  * data-relation-data
  * (person)-(married)-(person)-(lived in)-(troy)
  * What a lot of databases do: all about the joins

Generally, to solve template search/subgraph isomorphism
- Brute force - bad
- Dynamic programming
  * Better in practice, but complexity still bad
  * Show example
- Approximate algorithm - many exist
  * Sampling, trick is how you do it to ensure guarantees
  * Most useful for counting, some subset of enumeration-based analytics
- Color-coding
  * Show example, combines approximate+dynamic programming
- Parallel algorithms
  * Can be applied, and are usually applied, to all of the above

================================================================================
Motif Finding

Motif is a subgraph that occurs more frequently than you might otherwise expect
- We've already talked about it in context of null models
- I.e., how we'd define 'more frequently'
- We play it fast n' loose here

Why is more frequent: imply some structural or topological function
- Social networks: way more triangles than we'd expect, because of all the 
  reasons we talked about at the beginning of the semester.
- Protein interaction network: strength of interactions between proteins within
  a biological network (outside of Slota's expertise)
- Financial transaction network: diffusion of money through banks, people, etc.

Last thing brings us to the related problem: anomaly detection
- Finding patterns that are less frequent than we expect
- Can be more useful for financial transaction networks
  * Weird pattern that represents 'abnormal' transaction patterns
  * 'abnormal' maybe relative to other financial transaction networks (temporal)

Interesting ongoing problem: cortical conjecture
- Consider brain network defined via connections
- Structural connectome: physical neuron-neuron connections
- Functional connectome: higher-level functional region connections
- Diffusion of electrical signals through the connectomes
- Cortical Conjecture: there exists repeated substructures within the brain
  connectome that explain 'intelligence computation'

How do we do this:
- Counting occurrences of various templates in network A
- Counting occurrences of same templates in network B
  * C(T, A) >> C(T, B) then we have ourselves a motif
- Where our null models come into play (our network Bs)
  * C(T, A) >> 1/n*sum_i C(T, B_i)