Tracing sub-structure in the European American
population with PCA-informative markers
Paschou P, Drineas P, Lewis J,
Nievergelt CM, Nickerson DA, Smith JD,
Ridker PM, Chasman DI, Krauss RM, and Ziv E.
Our redundancy removal approach
Our algorithm for removing redundant PCAIMs works by employing the so-called Rank-Revealing QR matrix decomposition
that is implemented in the MatLab command qr.
Let A be an m-by-n matrix, whose rows correspond to m subjects and whose columns
correspond to n SNPs. Here we assume that n is small, e.g., that the user has already selected a subset of "informative" SNPs using his
favorite algorithm (see, for example, our PCA-Informative SNP selection algorithm).
We now provide details regarding the encoding of the matrix A. There are three possible values
for each entry in A. Consider for example the j-th column of A, corresponding to
the j-th SNP in the data. Let a and b be the (alphabetically ordered) alleles
associated with SNP j. Then, the entries in the j-th column of A are either -1
(denoting an individual with genotype aa), or 0 (denoting an individual with
genotype ab), or +1 (denoting an individual with genotype bb). 0. We note that
the choices of -1 and +1 could be reversed with no effect on the algorithm. Missing
entries are not allowed; we assume that the user has filled in missing entries
using other procedures.
Given the matrix A, in MatLab, type
[Q, R, order] = qr (A, 0);
and hit enter. Ignore the outputs matrices Q and R, and keep only the vector order, whose entries correspond to the indices
of the most uncorrelated SNPs. For example, to see the indices of the top 10 most uncorrelated SNPs simply type
order(1:10)
and hit enter.
PCA Informative SNPs (PCAIMs) for the CHORI dataset
PCA Informative SNPs (PCAIMs) for the CORIELL dataset
PCA Informative SNPs (PCAIMs) for the joint (CHORI + CORIELL) dataset