Tracing sub-structure in the European American population with PCA-informative markers

Paschou P, Drineas P, Lewis J, Nievergelt CM, Nickerson DA, Smith JD, Ridker PM, Chasman DI, Krauss RM, and Ziv E.



Our redundancy removal approach

Our algorithm for removing redundant PCAIMs works by employing the so-called Rank-Revealing QR matrix decomposition that is implemented in the MatLab command qr.

Let A be an m-by-n matrix, whose rows correspond to m subjects and whose columns correspond to n SNPs. Here we assume that n is small, e.g., that the user has already selected a subset of "informative" SNPs using his favorite algorithm (see, for example, our PCA-Informative SNP selection algorithm). We now provide details regarding the encoding of the matrix A. There are three possible values for each entry in A. Consider for example the j-th column of A, corresponding to the j-th SNP in the data. Let a and b be the (alphabetically ordered) alleles associated with SNP j. Then, the entries in the j-th column of A are either -1 (denoting an individual with genotype aa), or 0 (denoting an individual with genotype ab), or +1 (denoting an individual with genotype bb). 0. We note that the choices of -1 and +1 could be reversed with no effect on the algorithm. Missing entries are not allowed; we assume that the user has filled in missing entries using other procedures.

Given the matrix A, in MatLab, type

[Q, R, order] = qr (A, 0);

and hit enter. Ignore the outputs matrices Q and R, and keep only the vector order, whose entries correspond to the indices of the most uncorrelated SNPs. For example, to see the indices of the top 10 most uncorrelated SNPs simply type

order(1:10)

and hit enter.

PCA Informative SNPs (PCAIMs) for the CHORI dataset


PCA Informative SNPs (PCAIMs) for the CORIELL dataset


PCA Informative SNPs (PCAIMs) for the joint (CHORI + CORIELL) dataset