27 Jan 2020

* Centrality - basic definitions

Measure of a vertex's 'importance' within a network
- Note: related measures are 'influence' metrics
- Social networks: important people (Twitter: number of follows)
- Road/infrastructure networks: key intersections/endpoints of bridges
- Epidemiological networks: someone who is a small # hops from everyone else
- Recall: in the context of 'hubs' from the first class

================================================================================
* Degree centrality

Defined as the number of connections a vertex has
- # incident edges
- size of the neighborhood N(v)
- For directed graphs: d_in(v) or d_out(v) or (d_in(v) + d_out(v))
  * Think of bots within Twitter, spam sites within the Web

Pros: Simple to calculate, # connections readily determine how information flows
      in a single timestep when originating from some v
Cons: Easy metric to falsify (e.g., spam sites), doesn't capture much for 
      subsequent timesteps in a diffusive process

================================================================================
* Closeness centrality

A measure of how 'close' a vertex is to all other vertices in a network
- The average shortest path lengths
- Think in terms of how 'close' a vertex is to the 'center' of a graph
  * I.e., how 'close' it is to all other vertices in the network
- How many hops/timesteps/etc. does it take for information originating at
  vertex v to reach a majority of vertices in the network

Pros: Loosely determines how quickly information might be able to reach others
Cons: Difficult to calculate for all vertices in a graph, at least O(n^2)

================================================================================
* Betweenness centrality

The proportion of information flow in a network will go through vertex v
- The ratio of number of shortest x,y-paths through v over the number 
  of all shortest x,y-paths for all x != y != v in V(G)
- For information to flow from x to y, it will pass through v some proportion 
  of the time, relative to the above

Pros: Loosely determines key information flow 'cutpoints' within a network
Cons: Difficult to calculate, another O(n^2)

================================================================================
* Eigenvector centrality

Note: We'll talk more about this tomorrow (specifically, PageRank)

Basically, defines a vertices importance based on the importance of its 
neighbors.
- Consider adjacency matrix A
  * Eigenvalue centrality ==> solved for via Ax = λx
- You're important if your friends are important (and you have a lot of them)

Pros: Relatively easier to calculate than the above (using power iteration), 
      gives really good and intuitive results (PageRank via Google, Twitter 
      with 'who to follow')
Cons: Tough to infer or correlate from our (human's) perspective

================================================================================
* Diffusive Process

Generally, we consider diffusion and diffusive processes to statistically
measure how 'information' or 'data' or 'etc.' might flow through a network

Basic models:
- Vertex-centric behaviors
  * Vertex v updates its state based on the state of its neighbors (and itself)
- Complexity of 'network response' depends on the complexity of each 
  vertex's individual behavior
  * E.g., a small change for a local region might have a large global effect
  
Simple example:
- Initialize two competing 'ideas'
- All vertices update to the dominant idea in their local neighborhood
  * variant of 'label propagation' (more later) -- lot of applications


================================================================================
* Epidemiology

Defined as how to study the spread of diseases. In our context, specifically
when considering some network topology.
- Can be considered a diffusive process
  * Differentiation: **randomness**
- Note: while 'disease' has an explicit definition, this general concept can be
  applied for a number of concepts (e.g., adoption of technology, memes, etc.)
  
SIR epidemic model:
- S = Susceptible -- a vertex isn't infected yes
- I = Infected -- a vertex is infected and can spread the disease
- R = Removed -- a vertex is no longer infected/can't spread (immune or dead)

How the model runs:
- Initialize some subset of vertices in I
- I state on some vertex lasts for t time steps
- p = probability of transmission on each interaction between u in I, v in S
- Iterate over timesteps until no vertices in I

Notes: 
- p not likely fixed in reality
- Networks are dynamic
- etc.