07 Apr 2022 Recall from last class, Graph Mining = - Link Prediction - Centrality - Clustering and community detection - Vertex classification and label prediction Now ==> Subgraph Mining (obviously important to graph mining since it has almost same name) Big Dog (draw big dog) Many different subproblems under this big ol' umbrella: - Triangle counting (clustering coefficient, CD, etc.) - Template matching * Matching a general input template, aka subgraph isomorphism * Subgraph counting (how many show up) * Subgraph enumeration (where are they actually embedded) - Motif finding and anomaly detection * What subgraphs show up more/less than expected * aka Homework problem for grad students - Subgraph-based comparative analytics * Comparing networks based on their constituent subgraphs - Graph alignment * How similar are these two networks, aligning similar regions * Can consider as a generalization of subgraph isomorphism * Non-exact, contains a cost function ================================================================================ Basic Problem: Subgraph isomorphism Graph isomorphism: - draw a basic example - Tougher problem than you think, TBD what complexity class it is in - Current approach: convert graphs to a string s.t. two isomorphic graphs are the same string What we can about: subgraph isomorphism - Good: ton of applications as we've seen - Bad: Much tougher than graph isomorphism * NP-complete for decision problem, O(n^k) * and we need to solve this for all of the above analytics (womp) * Howver, certain subgraph structures are much easier (more later) Note: induced vs. non-induced subgraph isomorphism - Induced must match all edges AND non-edges - Non-induced DGAF about them non-edges - Draw examples ================================================================================ Triangle Counting Triangles are cool, so they get their own special mention - Why: very useful for analyzing clustering properties of networks - Most interesting networks have such clustering Can be much easier than larger subgraphs to count and enumerate - O(n^w) complexity, where w is fast matrix exponent - Essentially just computing closure of all wedges (a-b-c) - Really: for sparse graphs, O(m^3/2) * Some algorithms: closer to 4/3 moment of degree distribution Basic Algorithm count = 0 For all v in V(G): For all u,w in N(v): if (u,w) in E(G): count += 1 count /= 3 Most subgraph count/enumeration algorithms use some variation of above approach. - Need to be sure about induced vs. non-induced ================================================================================ Template Matching The general term given for actually doing subgraph isomorphism - We consider some input 'template' -- i.e., the subgraph we're looking for in G - Not only graph topology, but often vertex/edge labels * Luckily, a much easier problem as labels restrict search space * Draw an example Template search applications - Facebook graph search (from my interpretation) * "Find all people who went to Penn State who are friends with someone from Troy, NY who live in Portland, OR and play ultimate frisbee" * Draw example affiliation network template example * Unfortunately, lot of abuse: intelligence agencies and stalkers abused it quite readily (obviously), so they shut 'er dahn - Financial networks * Known bad financial transaction patterns (Russians) - Information/knowledge networks, relational data * data-relation-data * (person)-(married)-(person)-(lived in)-(troy) * What a lot of databases do: all about the joins Generally, to solve template search/subgraph isomorphism - Brute force - bad - Dynamic programming * Better in practice, but complexity still bad * Show example - Approximate algorithm - many exist * Sampling, trick is how you do it to ensure guarantees * Most useful for counting, some subset of enumeration-based analytics - Color-coding * Show example, combines approximate+dynamic programming - Parallel algorithms * Can be applied, and are usually applied, to all of the above ================================================================================ Motif Finding Motif is a subgraph that occurs more frequently than you might otherwise expect - We've already talked about it in context of null models - I.e., how we'd define 'more frequently' - We play it fast n' loose here Why is more frequent: imply some structural or topological function - Social networks: way more triangles than we'd expect, because of all the reasons we talked about at the beginning of the semester. - Protein interaction network: strength of interactions between proteins within a biological network (outside of Slota's expertise) - Financial transaction network: diffusion of money through banks, people, etc. Last thing brings us to the related problem: anomaly detection - Finding patterns that are less frequent than we expect - Can be more useful for financial transaction networks * Weird pattern that represents 'abnormal' transaction patterns * 'abnormal' maybe relative to other financial transaction networks (temporal) Interesting ongoing problem: cortical conjecture - Consider brain network defined via connections - Structural connectome: physical neuron-neuron connections - Functional connectome: higher-level functional region connections - Diffusion of electrical signals through the connectomes - Cortical Conjecture: there exists repeated substructures within the brain connectome that explain 'intelligence computation' How do we do this: - Counting occurrences of various templates in network A - Counting occurrences of same templates in network B * C(T, A) >> C(T, B) then we have ourselves a motif - Where our null models come into play (our network Bs) * C(T, A) >> 1/n*sum_i C(T, B_i)