20 Feb 2020

================================================================================
* Link prediction review

Link prediction problem:
- Trying to determine the relative probability that a link will form
- Predicting growth within some network
- We've used basic topological features of the network for prediction
- Results = relatively good on certain networks
  * Network that grow based on the properties we've talked about
  * Triadic closure, weak/strong links, etc.
- Not great on other networks
  * You'll see that on Homework 2 with the Amazon datasets
  * Limitation of the data, growth process can be widely different

================================================================================
* Supervised Method(s)

In general: supervised methods use some notion of 'truth' (i.e. training set)
- In this context: we know what links have formed in the past
- Create 'features' based on these existing links
- Use that to 'train' some algorithm
  * We're going to be using matrix factorization
  * The features we will use are not explicit
  * Other algorithms might make use of explicit features
  * E.g., what subgraphs appear in the network, meta-data, etc. 
  * Above is good project idea

================================================================================
* Matrix Factorization for Link Prediction

Consider:
- We have no meta-data associated with our graph
- We have no notion of what explicit 'features' we should use
- We want some general purpose approach that is dataset-independent

Matrix factorization in general: X = UV'
- Take some matrix X, and 'factorize' it into U,V

In the context of link prediction:
- Each vertex has some set of features that define it
- We consider this set of features as u_i, for vertex i
- These features, in this context, are 'latent'
  * Latent features = we don't explicitly compute or 'realize' these features
  * We instead just train are algorithm to learn these features
- We predict a (i,j) link based on u_i, u_j, and cross-interactions V

For link prediction:
- We factorize our adjacency matrix into A=UVU'
- Prediction for A_ij = u_i*V*u_j'
- We trying to  min_{U,V} sum_{nonzeros} abs(A_ij - u_i*V*u_j')
- Really, we'll do min_{U,V} sum_{nonzeros} (A_ij - u_i*V*u_j')^2
- How we do that: gradient descent

Pros:
- Don't need to make features explicit, explicitly determine or calculate them
- Very scalable, easy to parallelize and train
- Relatively high quality results without much effort
Cons:
- Weak generalization
- Might need to retrain with novel data

================================================================================
* Derivation of gradient descent

Gradient descent:
a_{n+1} = a_n - α*∇f(a_n)
α       = learning rate
a_n     = current set of features
a_{n+1} = features for next iteration
∇f(a_n) = how our error changes based on the current set of features

We look at how our error changes with changing inputs at are current point.
We then move in the direction on greatest decrease in our error.

e_ij^2 = (A_ij - u_i*V*u_j')^2

How our error changes with respect to each feature component:
(d e_ij^2) / (d u_i) = 2 (A_ij - u_i*V*u_j') * (- V*u_j') = -2*e_ij*V*u_j'
(d e_ij^2) / (d u_j) = 2 (A_ij - u_i*V*u_j') * (- u_i*V) = -2*e_ij*u_i*V
(d e_ij^2) / (d V) = 2 (A_ij - u_i*V*u_j') * (- u_i*u_j') = -2*e_ij*u_i*u_j'

So our gradient descent update equations:
u_i = u_i + α*2*e_ij*V*u_j'
u_j = u_j + α*2*e_ij*u_i*V
V   =   V + α*2*e_ij*u_i*u_j'

Note: When training only on nonzeros on a simple adjacency matrix
- We'll just train our predictor to always output values of '1'
- So why not just train on our nonzeros as well?
  * Note that most graphs are extremely sparse - opposite problem
  * We'll just train our predictor to output '0'
  * The amount of training inputs scales by O(n^2) instead on ~O(n log n)
- Workaround: weight zeros and non-zeros differently
  * min_{U,V} sum_{nonzeros} (A_ij - u_i*V*u_j')^2 + 
                                        w_0*sum_{zeros} (u_i*V*u_j')^2
  * w0 is some weighting parameter to control influence on zeros/nonzeros

================================================================================
* Regularization

If A and n,k is suitably large:
- We might observe the values of U and V to 'blow-up'
- So: use regularization
  * We're also minimize the values of U,V within our objective
min_{U,V} sum_{nonzeros} 
(A_ij - u_i*V*u_j')^2 + w_0*sum_{zeros} (u_i*V*u_j')^2 + β_1*|U|^2 + B_2*|V|^2
  * Within gradient descent, we have additional 'loss terms'
  - 2 * β * (u_i)
  - 2 * β * (u_j)
  - 2 * β * V

So our new gradient descent update equations:
u_i = u_i + α*2*(e_ij*V*u_j' - β*u_i)
u_j = u_j + α*2*(e_ij*u_i*V - β*u_j)
V   =   V + α*2*(e_ij*u_i*u_j' - β*V)

================================================================================
* Summary

Matrix factorization learns 'latent features' within a given dataset
- Don't have to explicitly realize these latent features to learn them
- We use them to predict future links/growth within a network
- We can solve the problem relatively quickly/simply
- Generalization power can be weak
- Results can be 'not great' relative to our 'unsupervised' methods