Assignment 2: Graph Analysis

Due Date: Mon 3rd Oct, Before Midnight


Network Properties for Ecoli

Download the file Attach:Ecoli.txt that gives the gene network for E.Coli. It has an edge whenever there is an interaction between the two genes. You should use the networkx python module for this assignment, for constructing an undirected graph from the data file.

Write an python script to compute several of the graph properties for the E.coli network:

  • Diameter: Compute the diameter. Note that normally, d_{ij}, the distance (shortest path length) between node i and j, is considered to the infinity if i cannot reach j. For this assignment, simply ignore d_{ij} if i cannot reach j. In other words, compute the diameter as the maximum d_{ij} over all reachable pairs i and j. Note: you cannot directly use the diameter function in networkx. However you may compute the shortest paths using the shortest.paths functions in networkx.
  • Degree Distribution: Print the degree distribution of the Ecoli network. Next, plot the degree k and the probability P(k) in a log-log plot to see whether the plot looks like a straight line. Use the linear regression function within numpy to fit a line to the log-log data, and plot the line and print its slope, corresponding to the exponent in the power-law degree distribution: P(k) \propto k^{−\gamma}. You may omit the first few and last few degree values to get a better fit. Use your judgement to get a "good" fit. Note: you may not directly use the degree distribution related functions in networkx.
  • Clustering Coefficient: Compute the clustering coefficient for the graph. If a node has degree less than 2, assume that its local clustering coefficient is 0. Note: you may not use the transitivity, triangle, clustering functions in networkx.

Page Rank in E. Coli

Download the file Attach:Ecoli-directed.txt that gives the gene regulatory network for E.Coli. It has a directed edge from gene u to gene v, if u controls v. You should use networkx to create a directed graph from these edges.

Compute the pagerank for each gene, via the power-iteration method, assuming that d=0.1, that is with probability 0.1 we follow a random link, and with probability 0.9 an existing link from each node. That is write a function to find the dominant eigenvector and eigenvalue for the pagerank matrix of the directed Ecoli graph. One can compute the dominant eigen-vector/-value of a matrix M iteratively as follows. Let

x_0 = \begin{pmatrix} 1 \\ 1\\ \vdots \\ 1 \end{pmatrix}
be the starting vector. In each iteration i, we compute the new vector:

x_i = M \; x_{i-1}

We then find the element of x_i that has the maximum absolute value, say at index m. For the next round, to avoid numerical issues with large values, we re-scale x_i by dividing all elements by x_i[m], so that the largest value is always 1 before we begin the next iteration. To test convergence, you may test the magnitude of the difference between the old and new scaled vector, i.e., \|x_i - x_{i-1}\|, and you can stop if the difference between these two falls below some threshold, say 0.001.

Run your iterative dominant eigensolver on the Ecoli-directed dataset. Find the dominant eigen-value and eigen-vector. For the final eigen-vector, make sure to normalize it, so that it has unit length. Print the top 10 genes according to pagerank.

Note: You may not use the pagerank methods from networkx, or the eig function in numpy.


What to turn in

  • Write a python script called RCSID-Assign2.py. Use comments lines to separate your code/function for each question. Here RCSID is your RPI email id (without the rpi.edu part).
  • Submit a PDF file named RCSID-Assign2.pdf that should include your solutions to each of the questions (just cut and paste the output from python). The figure (degree distribution log-log plot) should also be part of this file.
  • You may not use any of the inbuilt functions in networkx or any other python package that directly answer any of the questions. You may use them to verify your answers. Secondary functions may be used, e.g., shortest-paths, neighbors, etc.
  • Do not hard code any of the file path names (you may hard code just the filename). You can assume that any data file will be in the local directory.
  • Submit the assignment as a zip or tar file that includes the RCSID-Assign2.py sript and RCSID-Assign2.pdf file. Email both as an attachment to the course assignment submission email address: . The subject of your email should be "RCSID-Assign2 Submission", where RCSID is your RPI email id. Please DO NOT use your RIN number; use your RCSID.

Solutions

Here are the answers to Part 1:

  • diameter = 9
  • Degree Distribution: slope -1.96, Plot: Attach:degdist.pdf
  • Clustering coefficient: 0.2116

Part 2:

gene ranked 0 is 895 with rank 0.192730429044
gene ranked 1 is 297 with rank 0.190228983655
gene ranked 2 is 298 with rank 0.190228983655
gene ranked 3 is 559 with rank 0.188642875597
gene ranked 4 is 758 with rank 0.188642875597
gene ranked 5 is 1409 with rank 0.188642875597
gene ranked 6 is 305 with rank 0.188642875597
gene ranked 7 is 314 with rank 0.188642875597
gene ranked 8 is 109 with rank 0.0473436393026
gene ranked 9 is 1562 with rank 0.0473436393026

The python script is as follows:

#!/usr/bin/env python

import networkx as nx
from math import log
import matplotlib.pyplot as plt
import numpy as np
from operator import itemgetter

##########################
#Part 1: Network Properties for Ecoli

def part1():
    #create undirected E-coli graph
    f = open("Ecoli.txt", "rU")

    G = nx.Graph()
    for l in f.readlines():
        a = l.strip().split()
        if a[0] != a[1]:
            G.add_edge(a[0],a[1])

    print G.number_of_nodes()
    print G.number_of_edges()


    #Q1. diameter
    splen = nx.all_pairs_shortest_path_length(G)
    diameter = -1
    for n in G.nodes():
        for n2 in G.nodes():
            if n in splen and n2 in splen[n]\
               and splen[n][n2] > diameter:
                diameter = splen[n][n2]
    print "diameter =", diameter


    #Q2. compute degree distribution
    Pk = {} #deg dist
    for n in G.nodes():
        dn = G.degree(n)
        if dn not in Pk: Pk[dn] = 0
        Pk[dn] += 1

    print sorted(Pk)
    D = [G.degree(n) for n in G.nodes()]
    print "max deg node", max(D), D.index(max(D))

    K = [log(k,2) for k in sorted(Pk)]
    P = [log(1.0*Pk[k]/G.order(),2) for k in sorted(Pk)]

    xs = 0 #start at deg 1
    xe = 30 #index of max degree to consider
    plt.plot(K,P,'ro')
    plt.ylabel('log P(k): probability')
    plt.xlabel('log k: degree')
    m = np.polyfit(K[xs:xe], P[xs:xe], 1)
    #print m
    pfit = np.polyval(m,K[xs:xe])
    s = "slope=%0.2f"%(m[0])
    plt.text(2.5, -4, s)
    plt.plot(K[xs:xe],pfit,'b-')
    plt.savefig("degdist.pdf")
    plt.close()

    #Q3. clustering coefficient
    Ck = {}
    sumCn = 0.0
    for n in G.nodes():
        dn = G.degree(n)
        Nn = G.neighbors(n)
        Ne = 0.0
        for j in Nn:
            Nj = G.neighbors(j)
            Common = filter(lambda(x): x in Nn, Nj)
            Ne += len(Common)
        if dn > 1:
            if dn not in Ck: Ck[dn] = []
            Cn = Ne/(dn*(dn-1))
            Ck[dn].append(Cn)
            if Cn > 1: print "error", n, Cn
            sumCn += Cn

    print "average clustering coefficient =", sumCn/G.order(),\
            nx.average_clustering(G)

#########################
#Part 2: Pagerank in Ecoli

#compute the dominant eigenvalue
def myeig(M,n):
    X = M.T #take transpose of pagerank matrix
    iter = 0
    threshold = 0.001
    converged = False
    p = np.matrix(np.ones((n,1)))

    while not converged:
        ip = X * p
        maxval = np.max(ip)
        ip = ip/maxval  #scale the vector
        err = np.linalg.norm(p-ip)
        p = ip
        iter += 1
        #print "myeig", iter, err
        converged = err <= threshold and True or False

    p = p/np.linalg.norm(p) #normalize the vector
    return p


#create pagerank matrix, call myeig, and print the top ranks
def part2():
    #create directed E-coli graph
    f = open("Ecoli-directed.txt", "rU")
    d= 0.1 #jump probability

    dG = nx.DiGraph()
    for l in f.readlines():
        a = l.strip().split()
        dG.add_edge(a[0],a[1])

    print dG.number_of_nodes()
    print dG.number_of_edges()
    n = dG.number_of_nodes()

    A = nx.adj_matrix(dG)

    # what is the degree of each node
    degAry = np.squeeze(np.sum(A,axis=1).A)

    #create normalized adjacency matrix
    #handle degenerate case when a node is isolated; 
    #assume it as fully connected
    N = A
    for i in range(n):
        if degAry[i] == 0:
            N[i,:] = np.ones((1,n))/n
        else:
            N[i,:] /= degAry[i]

    Nf = np.ones((n,n))/n #random jump matrix
    M = d*Nf + (1-d)*N #pagerank matrix
    #print M

    pageranks = myeig(M, n)

    #sort the pageranks in descending order
    R = [(i,r) for (i,r) in enumerate(pageranks)]
    sortedR = sorted(R, key=itemgetter(1), reverse=True)

    #print top 10 genes
    for i in range(10):
        print "gene ranked", i, "is ", sortedR[i][0], \
                "with rank ",sortedR[i][1].item()

#######MAIN######
if __name__ == "__main__":
    part1()
    part2()
GlossyBlue theme adapted by David Gilbert
Powered by PmWiki