02 Mar 2020 ================================================================================ * Caronavirus update v3 "It’s going to disappear. One day it’s like a miracle, it will disappear." - Current US government approach = hope for miracle https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6 Since we last met: - 50% daily increase in (known) infections in certain countries - 2 deaths in the US * Washington state, Kings County -- aka Seattle metro * Professor Slota was there 3 weeks ago - First known case in New York -- NYC - Visit your grandparents/parents * CDC recommending all those over 60 years old avoid crowds Next homework: - Predict if Professor Slota has caronavirus * I found a Portland contact dataset, Portland ~= Seattle - Use community detection to help prevent spread Spanish Flu of 1918 (not actually from Spain ==> from France) - Very short latency for symptoms (1-2 days, at most) - Killed a lot of people very quickly * Diseases that kill quickly tend to die out * You die before you can spread * E.g., ebola - Still, 27% global infection rate - 2-3% death rate - Duration ==> ~11 months ================================================================================ * Medium-term economic predictions Not good. - Caronavirus sparsifying the global economic network - Lowered productivity due to illness * China ==> very very noticeable * Sparsification is probably a bigger impact as now - Other reasons: * Massive personal and corporate debt * 10%+ of companies are 'zombies' -- exist only to pay debt interest * Large cascading effects due to all of the above * 10 year / 3 mo inversion Even worse - the government handles recessions in two ways: - Fiscal Policy: * Tax cuts (already did -- already a very large deficit) * Stimulus/investment (also costs money) - Monetary Policy: * Lower interest rates (already at historic lows) * Quantitative easing (already doing it) My prediction: - Continued volatility in the market until risk of caronavirus is known * Pay attention to earnings (April, July) - If a recession comes, might be trouble ================================================================================ * Community detection review Basically: identifying dense subgraphs within a network Plenty of applications. Plenty of methods. Widely used and studied. ================================================================================ * Modularity basic definitions Modularity: one of the most widely-used approaches for community detection - A measure for a given set of community assignments - Maximization or evaluation Basic hypothesis: randomly-wired networks lack an inherent community - So ==> measure how are assigned clustering compares relative to what we might expect on a random network M = (1 / 2m) * sum_{uv in C} (A_uv - d(u)*d(v) / 2m) m = # undirected edges in our graph uv in C = all vertex pairs u,v in the same community A_uv = # edges between u,v d(u) = degree of vertex v d(u)*d(v) / 2m = expected number of edges between u and v in a random network - Exactly true for loopy multi-graphs - An 'alright' approximation for simple graphs ================================================================================ * Modularity maximization An approach to detect communities on a given network - Specifically, we're trying to maximize that measure of modularity given above - Usually: using an agglomerative approach Newman Algorithm: - Greedy maximization agglomerative algorithm - Initially: all vertices in their own community - While not a single community (or modularity isn't increasing): * Merge communities that have the highest modularity gain - Pros: outputs hierarchical structure, good real-world performance, strong theoretical foundation, can be fast - Cons: the 'issues with modularity' to be discussed, not optimal Louvain Algorithm: - Same as above, but we explicitly contract vertices - Can be slightly faster due to implementation Note that while these algorithms are not 'optimal': - Modularity on most networks doesn't have an explicit 'peak' - More of a 'plateau' ================================================================================ * Issues with modularity Resolution limit: - Modularity maximization can't 'resolve' small communities Change in modularity by combining community A and community B ΔM = (l_AB / m) - (k_A*k_B / 2*m^2) l_AB = edges between community A and B k_A = sum of degrees of vertices within community A k_B = sum of degrees of vertices within community B m = number of total undirected edges within the graph Consider: (l_AB / m) = (k_A*k_B / 2*m^2) l_AB = k_A*k_B / 2*m if l_AB > k_A*k_B / 2*m => we merge communities A and B Assume for simplicity: k_A = k_B Also assume: l_AB = 1 1 > k^2 / 2*m 2m > k^2 sqrt(2m) > k ==> From this, we merge A and B if k <= sqrt(2m) This is a lower bound on community size that modularity maximization can find. - The 'ring of cliques' graph highlights this fact - Graph constructed by attaching cliques of the same size in a cycle by a single edge - Once we pass the resolution limit, modularity maximization algorithms will combine cliques into single communities * Goes against all our assumed notions of what constitutes a 'good' community * Real graphs often have a wide spread of community sizes Why this isn't a huuuge problem: - Most algorithms are hierarchical - select level that gives us reasonable community sizes relative to our data (or some measure besides modularity) - Note that a wide spread in real sizes might still be problematic * 'Multi-resolution' methods attempt to address this One other problem: d(u)*d(v) / 2m approximation can be 'bad' for simple graphs - Most social graphs are considered as simple - Especially problematic for small, dense, and or skewed networks - d(u)*d(v) >> 2m can occur for multiple u,v - So, our measure of modularity can be very 'off'