13 Jan 2020 * About me and the course - Fourth year at RPI - **1st** time teaching Graph Mining + I welcome any feedback, topics, etc. - My research: graph analytical applications at the 'very large' scale * Go over syllabus * Go over website (https://www.cs.rpi.edu/~slotag/classes/SP20m/index.html) * Go over real-world datasets * Talk about project - Take a look at the real-world data repositories listed on the website - Prior projects: + Using vertex centrality analysis to identify population centers on unlabeled road network data + Recommender systems for Amazon co-purchasing dataset + Grad students: you can use/involve your research topic ================================================================================ * Define graphs Graphs => vertex set V(G) and edge set E(G) V(G) = {v1, v2, v3, ... , vn}, |V| = n E(G) = {e1, e2, e3, ... , em}, |E| = m ei = {u, v} Vertices => represent discrete objects (e.g., people in a social network) Edges => relationship between these objects (e.g., friendships in social net) Vertices and edges can have associated meta-data/label Vertex => e.g., in Amazon copurchasing network -> {people, products} Edges => e.g., in copurchasing network -> {bought, looked at} Other labels exist, such as weights, time in a temporal graph (when edge was create), etc. ================================================================================ * Real-world graph properties Sparsity: n < m << O(n^2) => number of edges is much less than the number of possible edges Degree skew: the number of low degree vertices >> high degree vertices Hubs: high degree or otherwise important vertices that have some functional importance within the network Irregularity: information, social networks, etc. are not constrained by a physical existence Small-world: "6 degree of Kevin Bacon", average shortest paths length is small relative to the size of the network; 4-5 friendship hops on average between you and every other person on Earth ================================================================================ * Define graph mining applications Take a look at the syllabus for a list of topics (think about project) Similarity to "data mining" - Classification => can we 'classify' a vertex with some property - Clustering => an approach for classification; placing similar vertices into the same cluster/group/classification - Prediction => how a network will evolve over time? will a new friendship emerge? will you buy a product? - Measurement => how can we measure properties of a network? Graph mining == data mining, but specifically on graphical datasets ================================================================================ * Talk about graph processing approaches "Think Like a Vertex" - Many/most graph algorithms can be implemented in a 'vertex centric' way - Format: Input: Graph G(V, E), vertex state S For all v in V: S(v) = initialize(v) For some number of iterations: updateAlgorithmicData() For all v in V: For all neighbors u of v: S(v)/S(u) = updateState(S(u), S(v)) For all v in V: S(v) = finalize() return S ================================================================================ * NetworkX and coding Today: connected components algorithms A graph is 'connected' if there exists a path between all u,v in V e.g., u->w->y->...->v Connected components: maximal subgraphs in a (possibly non-)connected graph G that are themselves connected Two algorithms: BFS(G, S) For all v in V: S(v) = -1 # haven't determined connected component yet numComponents = 0 until all vertices have assigned component: For all v in V: if S(v) > -1 continue Q <- v S(v) = numComponents While Q != empty: For all v in Q: For all neighbors u of v: if S(u) == -1: Q <- u S(u) = numComponents numComponents +=1 return S LabelPropagation() counter = 0 For all v in V: S(v) = counter counter += 1 numUpdates = 1 While updates > 0: updates = 0 For all v in V: For all neighbors u of v: if S(u) > S(v) S(v) = S(u) updates += 1 return S