Tracing sub-structure in the European American
population with PCA-informative markers

Paschou P, Drineas P, Lewis J,
Nievergelt CM, Nickerson DA, Smith JD,
Ridker PM, Chasman DI, Krauss RM, and Ziv E.

**Our redundancy removal approach**

Our algorithm for removing redundant PCAIMs works by employing the so-called Rank-Revealing QR matrix decomposition
that is implemented in the MatLab command **qr**.

Let *A* be an *m-by-n* matrix, whose rows correspond to *m* subjects and whose columns
correspond to *n* SNPs. Here we assume that *n* is small, e.g., that the user has already selected a subset of "informative" SNPs using his
favorite algorithm (see, for example, our PCA-Informative SNP selection algorithm).
We now provide details regarding the encoding of the matrix *A*. There are three possible values
for each entry in *A*. Consider for example the *j*-th column of *A*, corresponding to
the *j*-th SNP in the data. Let *a* and *b* be the (alphabetically ordered) alleles
associated with SNP *j*. Then, the entries in the *j*-th column of *A* are either -1
(denoting an individual with genotype *aa*), or 0 (denoting an individual with
genotype *ab*), or +1 (denoting an individual with genotype *bb*). 0. We note that
the choices of -1 and +1 could be reversed with no effect on the algorithm. Missing
entries are not allowed; we assume that the user has filled in missing entries
using other procedures.

Given the matrix *A*, in MatLab, type

**[Q, R, order] = qr (A, 0);**

and hit enter. Ignore the outputs matrices *Q* and *R*, and keep only the vector *order*, whose entries correspond to the indices
of the most uncorrelated SNPs. For example, to see the indices of the top 10 most uncorrelated SNPs simply type

**order(1:10)**

and hit enter.

**PCA Informative SNPs (PCAIMs) for the CHORI dataset**

**PCA Informative SNPs (PCAIMs) for the CORIELL dataset**

**PCA Informative SNPs (PCAIMs) for the joint (CHORI + CORIELL) dataset**