02 Mar 2020

================================================================================
* Caronavirus update v3

"It’s going to disappear. One day it’s like a miracle, it will disappear."
- Current US government approach = hope for miracle

https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

Since we last met:
- 50% daily increase in (known) infections in certain countries
- 2 deaths in the US
  * Washington state, Kings County -- aka Seattle metro
  * Professor Slota was there 3 weeks ago
- First known case in New York -- NYC
- Visit your grandparents/parents
  * CDC recommending all those over 60 years old avoid crowds

Next homework:
- Predict if Professor Slota has caronavirus
  * I found a Portland contact dataset, Portland ~= Seattle 
- Use community detection to help prevent spread

Spanish Flu of 1918 (not actually from Spain ==> from France)
- Very short latency for symptoms (1-2 days, at most)
- Killed a lot of people very quickly 
  * Diseases that kill quickly tend to die out
  * You die before you can spread
  * E.g., ebola
- Still, 27% global infection rate
- 2-3% death rate
- Duration ==> ~11 months

================================================================================
* Medium-term economic predictions

Not good.
- Caronavirus sparsifying the global economic network
- Lowered productivity due to illness
  * China ==> very very noticeable
  * Sparsification is probably a bigger impact as now
- Other reasons:
  * Massive personal and corporate debt
  * 10%+ of companies are 'zombies' -- exist only to pay debt interest
  * Large cascading effects due to all of the above
  * 10 year / 3 mo inversion

Even worse - the government handles recessions in two ways:
- Fiscal Policy:
  * Tax cuts (already did -- already a very large deficit)
  * Stimulus/investment (also costs money)
- Monetary Policy: 
  * Lower interest rates (already at historic lows)
  * Quantitative easing (already doing it)
  
My prediction: 
- Continued volatility in the market until risk of caronavirus is known
  * Pay attention to earnings (April, July)
- If a recession comes, might be trouble

================================================================================
* Community detection review

Basically: identifying dense subgraphs within a network

Plenty of applications.
Plenty of methods.
Widely used and studied.

================================================================================
* Modularity basic definitions

Modularity: one of the most widely-used approaches for community detection
- A measure for a given set of community assignments
- Maximization or evaluation

Basic hypothesis: randomly-wired networks lack an inherent community
- So ==> measure how are assigned clustering compares relative to what we might
  expect on a random network
  
M = (1 / 2m) * sum_{uv in C} (A_uv - d(u)*d(v) / 2m)
m = # undirected edges in our graph
uv in C = all vertex pairs u,v in the same community 
A_uv = # edges between u,v
d(u) = degree of vertex v

d(u)*d(v) / 2m = expected number of edges between u and v in a random network
- Exactly true for loopy multi-graphs
- An 'alright' approximation for simple graphs

================================================================================
* Modularity maximization

An approach to detect communities on a given network
- Specifically, we're trying to maximize that measure of modularity given above
- Usually: using an agglomerative approach

Newman Algorithm:
- Greedy maximization agglomerative algorithm
- Initially: all vertices in their own community
- While not a single community (or modularity isn't increasing):
  * Merge communities that have the highest modularity gain
- Pros: outputs hierarchical structure, good real-world performance, strong
        theoretical foundation, can be fast
- Cons: the 'issues with modularity' to be discussed, not optimal

Louvain Algorithm:
- Same as above, but we explicitly contract vertices
- Can be slightly faster due to implementation

Note that while these algorithms are not 'optimal':
- Modularity on most networks doesn't have an explicit 'peak'
- More of a 'plateau'

================================================================================
* Issues with modularity

Resolution limit:
- Modularity maximization can't 'resolve' small communities

Change in modularity by combining community A and community B
ΔM = (l_AB / m) - (k_A*k_B / 2*m^2)
l_AB = edges between community A and B
k_A  = sum of degrees of vertices within community A
k_B  = sum of degrees of vertices within community B
m    = number of total undirected edges within the graph

Consider:
(l_AB / m) = (k_A*k_B / 2*m^2)
l_AB = k_A*k_B / 2*m
if l_AB > k_A*k_B / 2*m => we merge communities A and B

Assume for simplicity: k_A = k_B
Also assume: l_AB = 1

1 > k^2 / 2*m
2m > k^2
sqrt(2m) > k

==> From this, we merge A and B if k <= sqrt(2m)

This is a lower bound on community size that modularity maximization can find.
- The 'ring of cliques' graph highlights this fact
- Graph constructed by attaching cliques of the same size in a cycle by a 
  single edge
- Once we pass the resolution limit, modularity maximization algorithms will
  combine cliques into single communities
  * Goes against all our assumed notions of what constitutes a 'good' community
  * Real graphs often have a wide spread of community sizes

Why this isn't a huuuge problem:
- Most algorithms are hierarchical - select level that gives us reasonable
  community sizes relative to our data (or some measure besides modularity)
- Note that a wide spread in real sizes might still be problematic
  * 'Multi-resolution' methods attempt to address this
  
One other problem:
d(u)*d(v) / 2m approximation can be 'bad' for simple graphs
- Most social graphs are considered as simple
- Especially problematic for small, dense, and or skewed networks
- d(u)*d(v) >> 2m can occur for multiple u,v
- So, our measure of modularity can be very 'off'