Selecting PCA-correlated SNPs
P. Paschou, E. Ziv, E. Burchard, M.W. Mahoney, and P. Drineas
The algorithm
Our algorithm for identifying PCA-correlated SNPs takes as input a properly encoded SNP data matrix and outputs a score for each SNP. SNPs corresponding to the highest
scores are typically the most informative in reproducing the structure of the population. The scores are computed as described in our paper (a link to our paper will
be posted here).
We provide a MatLab implementation of the score computation. The file PCAscores.m is a MatLab function that returns a score for
each SNP. It takes two inputs:
- an m-by-n matrix A, whose rows correspond to m subjects and whose columns correspond to n SNPs, and
- a positive integer k corresponding to the number of significant principal components in A. If k is set to zero, then our function automatically
infers the number of significant principal components using the procedure described in our paper. This procedure is implemented in
estimate_k.m and must be downloaded in the same directory as PCAscores.m
We now provide details regarding the encoding of the matrix A. There are three possible values for each entry in A.
Consider for example the j-th column of A, corresponding to the j-th SNP in our data. Let A and B be the (alphabetically
ordered) alleles associated with SNP j. Then, the entries in the j-th column of A are either -1 (denoting an individual with genotype AA), or 0 (denoting an individual
with genotype AB), or +1 (denoting an individual with genotype BB). 0. We note that the choices of -1 and +1 could be reversed with no effect on the algorithm. At this point, the input
to our code should have this format.
Missing entries are not allowed; we assume that the user has filled in missing entries using some
procedure. In our work, we employed a regression-based method to fill in missing entries. (Code for this procedure will also be posted.)
Example
We provide an example run of the code in MatLab. First, download test.txt. This file contains data on 62 individuals (one per row) and 400 SNPs.
The (i,j)-th entry in the file (i-th row, j-th column) corresponds to the genotype of the i-th indiviual for the j-th SNP.
To run our code, first we need to load the input data in MatLab. Type
load test.txt -ASCII
in the MatLab command prompt and hit enter. Now, type whos in the MatLab command prompt and hit enter. A variable named test should appear; this variable is a 62-by-400 matrix
containing the data. To run our code, type
[scores] = PCAscores(test,0);
and hit enter. The code will automatically figure out the number of principal components to be used. Alternatively, the user could type
k = estimate_k(test)
and hit enter; this command would return the number of significant principal components identified by the permutation test. Now type
[scores] = PCAscores(test,k);
and hit enter. This command manually specified the number of significant principal components to use.
The output of the code is a variable named scores. Typing whos in the MatLab prompt and hitting enter would return two variables: test (described above) and
scores. Notice that scores is a vector with 400 elements; each element corresponds to the PCA score of the corresponding SNP. The following two commands would return
(for example) the locations of the top five PCA-correlated SNPs.
[Y,I] = sort(scores);
I(1:5)
Lists of PCA-correlated SNPs
We provide text files containing the top PCA-correlated SNPs for certain combinations of populations that we examined in our paper. For each of the top PCA-correlated SNPs
we posted its tsc number, its rs number, the chromosome and the physical position of the SNP, and the frequency of the alphabetically first allele in each studied
population.
More files/details will be posted soon.
- African Populations (Mbuti, Mende, Burunge)
- African Populations, incl. African Americans (Mbuti, Mende, Burunge, African Americans)
- European Populations (Caucasians, Spanish)
- Asian Populations (Asians, South Altaians, Mala)
- American Populations (Nahua, Quechua)
- Nine Indigenous Populations (Mbuti, Asians, Mende, Nahua, Quechua, South Altaians, Mala, Spanish, Burunge)
- Four Continents (Africa, Europe, Asia, America)
Notice that an rs number of zero corresponds to SNPs whose rs number was unknown. A tsc number is typically available for such SNPs.
Figures