Selecting PCA-correlated SNPs

P. Paschou, E. Ziv, E. Burchard, M.W. Mahoney, and P. Drineas

The algorithm

Our algorithm for identifying PCA-correlated SNPs takes as input a properly encoded SNP data matrix and outputs a score for each SNP. SNPs corresponding to the highest scores are typically the most informative in reproducing the structure of the population. The scores are computed as described in our paper (a link to our paper will be posted here).

We provide a MatLab implementation of the score computation. The file PCAscores.m is a MatLab function that returns a score for each SNP. It takes two inputs:
• an m-by-n matrix A, whose rows correspond to m subjects and whose columns correspond to n SNPs, and
• a positive integer k corresponding to the number of significant principal components in A. If k is set to zero, then our function automatically infers the number of significant principal components using the procedure described in our paper. This procedure is implemented in estimate_k.m and must be downloaded in the same directory as PCAscores.m
We now provide details regarding the encoding of the matrix A. There are three possible values for each entry in A. Consider for example the j-th column of A, corresponding to the j-th SNP in our data. Let A and B be the (alphabetically ordered) alleles associated with SNP j. Then, the entries in the j-th column of A are either -1 (denoting an individual with genotype AA), or 0 (denoting an individual with genotype AB), or +1 (denoting an individual with genotype BB). 0. We note that the choices of -1 and +1 could be reversed with no effect on the algorithm. At this point, the input to our code should have this format.

Missing entries are not allowed; we assume that the user has filled in missing entries using some procedure. In our work, we employed a regression-based method to fill in missing entries. (Code for this procedure will also be posted.)

Example

We provide an example run of the code in MatLab. First, download test.txt. This file contains data on 62 individuals (one per row) and 400 SNPs. The (i,j)-th entry in the file (i-th row, j-th column) corresponds to the genotype of the i-th indiviual for the j-th SNP.

To run our code, first we need to load the input data in MatLab. Type

in the MatLab command prompt and hit enter. Now, type whos in the MatLab command prompt and hit enter. A variable named test should appear; this variable is a 62-by-400 matrix containing the data. To run our code, type

[scores] = PCAscores(test,0);

and hit enter. The code will automatically figure out the number of principal components to be used. Alternatively, the user could type

k = estimate_k(test)

and hit enter; this command would return the number of significant principal components identified by the permutation test. Now type

[scores] = PCAscores(test,k);

and hit enter. This command manually specified the number of significant principal components to use.

The output of the code is a variable named scores. Typing whos in the MatLab prompt and hitting enter would return two variables: test (described above) and scores. Notice that scores is a vector with 400 elements; each element corresponds to the PCA score of the corresponding SNP. The following two commands would return (for example) the locations of the top five PCA-correlated SNPs.

[Y,I] = sort(scores); I(1:5)

Lists of PCA-correlated SNPs

We provide text files containing the top PCA-correlated SNPs for certain combinations of populations that we examined in our paper. For each of the top PCA-correlated SNPs we posted its tsc number, its rs number, the chromosome and the physical position of the SNP, and the frequency of the alphabetically first allele in each studied population.

More files/details will be posted soon.
Notice that an rs number of zero corresponds to SNPs whose rs number was unknown. A tsc number is typically available for such SNPs.

Figures