18 Feb 2020 * Discuss Homework P1.) - Make sure you use simple graph for centrality calculations - Make a choice for how you define I,R sets initially when there's overlap, but as long as its a reasonable choice, don't worry too much ==> care more about overall discussion - Differences won't be too significant, but still should be measurable P2.) - Create the user-user connections based on Jaccard, use those to compute potential link attachments - Predictive power is relatively weak, but still be measurable - Stay tuned for potential update (using only ~6% of data as of now) ================================================================================ * Caronavirus Update From the last time we talked ==> we noted that long asymptomatic duration makes it difficult to prevent its spread. (2 - 24 days, potentially) What WHO/CDC/other government agencies are trying to do: - 'Sparsify' the global human-human contact network * sparsify = decrease the number of edges within a network * Via travel restrictions, quarantines, etc. - Effectively, they're doing a bit of a 'brute force' approach * Realistically, this is the best approach they can do * Limited real interaction data Recent news: taxi driver in Tokyo was determined to be infected - Taxi driver -> high centrality vertex in a contact network * For almost all centrality measures we talked about * Also, strength of interaction is relatively high - Overall = bad news (possibly) Professor Slota = high centrality - Denver, Seattle, Newark, Dulles - Close contact in flights with hundreds of people - Close (physical = handshakes) contact at conferences/meetings with dozens - Teach class to ~hundred or so students ================================================================================ * Shorting the global economy 'Short' = bet against Stock market (or really, the global economy) = a big ol' graph - Vertices = businesses, people, governments, etc. - Edges = Exchange of money, goods, services, etc. The economy = diffusive process on a financial transaction network - Strong economy = high flow within this network So, consider the sparsification due to caronavirus: - If the disease spreads out of control, bigger quarantines, restrictions on travel, commercial shipping etc. - Highly impacting the flow of goods, money, etc. in the global economic network ==> economy will have problems if caronavirus continues to spread - Note: Flu of 1918, wide-reaching impacts, stock market fell 25% So if we want to 'bet' on graph mining yet again, buy long-dated out of the money puts on SPY. Note: DON'T ACTUALLY DO THIS ================================================================================ * Link Prediction Inferring the growth process of a network and predicting which links are more likely to form relative to a random process. - Facebook => which people you will become friends with - Twitter => which people you will follow - Tinder => who you will match with - Amazon => what you will purchase Also, related problem: inferring 'hidden links' - Facebook => people you're friends with in reality, but not friends with online - Maybe, you're trying to hide that fact * Terrorists 'hiding in plain sight' * Hiding = adding 'noise', erroneous connections, not connected to people they would otherwise be connected to, etc. - Problem reduces to eliminating noise and predicting which links are most probable - Or, as in prior context, financial transaction networks and money laundering Our general approach for 'link prediction' - Consider network snapshot at time t - Use topological features (and/or metadata) to predict most probable links - Test on time t+x to validate our approach - Compare various methods using this same approach - Or, if there is no time-data, explicitly partitioning into training+test sets * Probably a better approach for validating methods on 'hidden links' ================================================================================ * Unsupervised Methods See: https://www.cs.cornell.edu/home/kleinber/link-pred.pdf Unsupervised = we *aren't* explicitly 'training' some algorithm We *are* using explicit measurements to 'classify' or 'make predictions' Common neighbors: overlap in neighborhood between vertices x and y C(x, y) = |N(x) ∩ N(y)| Jaccard Index: overlap in neighborhood over total size of both neighborhoods J(x, y) = |N(x) ∩ N(y)| / |N(x) ∪ N(y)| Adamic-Adar Index: consider the common neighbors of x and y, but explicitly bias against connections to large degree vertices A(x, y) = sum_{u = N(x) ∩ N(y)} 1 / log(|N(u)|) Preferential Attachment: considers two high degree vertices are more likely to attach, and this attachment probability is proportional to their degree P(x, y) = |N(x)| * |N(y)| Personalized PageRank: - To modify PPR to use to determine attachment likelihood: * Calculate PPR from some vertex v on whole graph * Attachment probability is directly correlated to the PPR values determined - Not implementing here, so there's stuff left to do on the HW