27 Feb 2020

================================================================================
* Caronavirus update v2

"We’re ready to adapt and we’re ready to do whatever we have to as the disease 
spreads, if it spreads. It probably will, it possibly will. It could be at a 
very small level, or it could be at a larger level. Whatever happens we’re 
totally prepared."
- Someone totally not prepared

Since last week:
- Infections on every continent (minus Antarctica and Zealandia)
- Spread is un-contained in multiple countries
- 60 or so in US (~40 were from the cruise)
  * One 'community spread' -- i.e., no idea where it came from
- Tokyo canceling olympics if not contained by May
- Issues with the virus:
  * Can remain latent/asymptomatic for up to 27 days
    + Infectious within 1-2 days
  * Reproductive number: 
    + Average new transmissions per person
    + <1 a disease will disappear
    + >>1 a disease will become an epidemic
    + Measles: R = 14, Flu ~= 1.5, COVID-19 ~= 2.5
  * Mutating: instances of infected->recovered->infected
- We might return to epidemiology later in class
  * If we're still having class in a month or so
- Epidemiologist statements:
  * Serious concern unless drastic measures are immediately taken

================================================================================
* Community Detection

Basic definition: identifying dense clusters within a network

Specifically considered in various ways:
- Friend groups
- Classmates
- Teammates

Zachary Karate Club:
- A graph comprised of people in the same Karate club
- The edges define interactions outside of the club
- At a point in time:
  * There was a disagreement between the instructor and club president
  * The club split into two - one following pres. and one following instructor
  * The split almost perfectly followed the notion of 'communities' on the graph

Fundamental Hypothesis: communities can be described fully in the context of 
graph analysis based only on network topology.

What we'll discuss: many variations of the 'density' definition, many algorithms
that optimize based on these various definitions, along with various different
ways to actually evaluate quality of outputs.

================================================================================
* Defining Communities

How we define 'density' within a network is not clear cut:
- Cliques: fully connected subgraphs
 * Can't be more dense than a clique (in a simple graph)
 * Issue: large cliques are rare in actual networks
 * Using only cliques to define communities usually doesn't work in practice
 * Cliques are deceptively difficult to enumerate
 * However: triangles (K_3) are often used by themselves to measure density
 
- Connected components
  * Trivial to compute
  * Don't really capture that 'density' aspect of our definition
  
- Ratio of external to internal edges
  * Strong community: for all v in C : d_int(v) > d_ext(v)
  * Weak community: sum_{all v in C} d_int(v) > sum_{all v in C} d_ext(v)
  * Both: measures of density of a community
  * Can extrapolate to a global measure: sum of total internal vs. external
    + Also: edge cut = total external edges

- Modularity and Conductance:
  * to be discussed later
  
================================================================================
* Number of Communities

The number of communities within a network will vary based on the notion of 
density or our optimization criteria.
- Usually unknown - we're considering only topology and not a ground truth
- Possible per-vertex groupings increases exponentially
- Number of ways to create grouping sizes increases super-exponentially
- Community detection algorithms have to consider both
  * Possible solution space is 'quite large'
  * Exact solutions are often infeasible
  * Algorithms are often greedy or use heuristics

================================================================================
* Agglomerative, Divisive, Hierarchical Clustering

The above are different greedy/heuristic approaches used to address the problem.

Agglomerative: we combine communities in a certain way to reach some maximal 
value for our optimization criteria.

Divisive: we cut communities in a certain way to reach some maximal value for
our optimization criteria.

Hierarchical clustering: using the above to create a hierarchical structure
for our communities.
- Note: real communities often contain a hierarchical structure
- E.g., graph theory->CS->Science->RPI

================================================================================
* Ravasz Algorithm

Agglomerative algorithm:
- All vertices initially in their own communities
- Iterate until only a single community remains:
  * Select optimal (c_i, c_j) pair to combine
  * Based on similarity: 
    + s_uv = (|N(v) ⋂ N(u)| + A_ij) / (min(d(u), d(v)) + 1 - A_uv)
    + gives us a similarity value [0,1]
    + For communities: take average, min, max for 'optimal'
  * Pros: captures full community hierarchy
  * Cons: O(n^2), not optimizing a global metric

================================================================================
* Label Propagation

Agglomerative algorithm:
- Iterate:
  * For all v in V(G) in random order:
    + C(v) = max over neighbors
    + Ties broken randomly
- Pros: simple to implement, O(n), can return good results
- Cons: Can return bad results, can converge to a single community, hierarchy
  is not explicitly captured. 

================================================================================
* Girvan-Newman

- Divisive algorithm:
  * Iterate:
    + Select edge with highest betweenness centrality
    + Note: weak ties disconnect our graph quicker than strong ties, weak ties
      have high betweenness centrality.
  * Our communities are connected components that remain
- Pros: uses a lot of intuitive 'social network theory', can get good results
- Cons: Slooooowwwww O(n^3)

Next class: modularity, evaluating algorithms