Selecting tagging SNPs
Polymorphisms are variations in human DNA sequences among two individuals from the same populations. A significant number of these variations consist of
only a single
nucleotide; and hence are called single nucleotide polymorphisms (pronounced as snips).
It is conjectured that human DNA has more than ten million SNPs but only a few
million have been discovered so far. These variations are important because
although they do not cause diseases, but they do determine the susceptibility of
an individual to them. Strong correlations have been observed among certain
genetic variations and diseases like heart disease, diabetes, and different
types of cancers.Recent developments have significantly reduced the cost of
assaying SNPs. However their sheer number and strong correlation among
neighboring SNPs begs the idea of compression. If we can somehow select a much
smaller representative set of SNPs, they can then be used to reconstruct the
complete original set. This smaller set of SNPs is known as tagging SNPs
(or tSNPs) in literature. We use de-randomized counterparts of recent
linear algebraic algorithms to select tSNPs which best capture the SNP variance.
P. Paschou, M.W. Mahoney, A. Javed, J.R. Kidd, A.J. Pakstis, S.
Gu, K.K. Kidd, and P. Drineas,
Intra- and
inter-population genotype reconstruction from tagging SNPs,
Genome Research,
January 2007. [pubmed,
dataset and code]
A. Javed and P. Paschou, Extracting
tagging SNPs from Genome-wide Datasets,
Data Mining for Biomedical Informatics,
workshop held in conjunction with 7th
SIAM Conference on Data Mining,
April 2007.
(image courtesy David Hall under the
terms of GNU free documentation license)