A SPECTRAL GRAPH APPROACH TO DISCOVERING GENETIC ANCESTRY

Home> Academic Documents> A SPECTRAL GRAPH APPROACH TO DISCOVERING GENETIC ANCESTRY

DOC PREVIEW

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

IntroductionMethodsSpectral embeddings revisited. Connection to MDS and kernel PCASpectral clustering and Laplacian eigenmapsNumber of dimensions via eigengap heuristicControlling for ancestry in association studiesAlgorithm for SpectralR and SpectralGEMAnalysis of dataData analysis of POPRESOutlier datasetCluster datasetFull datasetSimulations for associationDiscussionReferencesAuthor's AddressesThe Annals of Applied Statistics2010, Vol. 4, No. 1, 179–202DOI: 10.1214/09-AOAS281© Institute of Mathematical Statistics, 2010A SPECTRAL GRAPH APPROACH TO DISCOVERINGGENETIC ANCESTRY1BY ANN B. LEE,DIANA LUCA AND KATHRYN ROEDERCarnegie Mellon University, Genentech Inc. and Carnegie Mellon UniversityMapping human genetic variation is fundamentally interesting in fieldssuch as anthropology and forensic inference. At the same time, patterns of ge-netic diversity confound efforts to determine the genetic basis of complex dis-ease. Due to technological advances, it is now possible to measure hundredsof thousands of genetic variants per individual across the genome. Principalcomponent analysis (PCA) is routinely used to summarize the genetic sim-ilarity between subjects. The eigenvectors are interpreted as dimensions ofancestry. We build on this idea using a spectral graph approach. In the processwe draw on connections between multidimensional scaling and spectral ker-nel methods. Our approach, based on a spectral embedding derived from thenormalized Laplacian of a graph, can produce more meaningful delineationof ancestry than by using PCA. The method is stable to outliers and can moreeasily incorporate different similarity measures of genetic data than PCA. Weillustrate a new algorithm for genetic clustering and association analysis on alarge, genetically heterogeneous sample.1. Introduction. Human genetic diversity is of interest in a broad range ofcontexts, ranging from understanding the genetic basis of disease to applicationsin forensic science. Mapping clusters and clines in the pattern of genetic diver-sity provides the key to uncovering the demographic history of our ancestors. Todetermine the genetic basis of complex disease, individuals are measured at largenumbers of genetic variants across the genome as part of the effort to discover thevariants that increase liability to complex diseases such as autism and diabetes.Genetic variants, called alleles, occur in pairs, one inherited from each par-ent. High throughput genotyping platforms routinely yield genotypes for hundredsof thousands of variants per sample. These are usually single nucleotide variants(SNPs), which have two possible alleles, hence, the genotype for a particular vari-ant can be coded based on allele counts (0, 1 or 2) at each variant. The objective isto identify SNPs that either increase the chance of disease, or are physically nearbyan SNP that affects disease status.Due to demographic, biological and random forces, variants differ in allelefrequency in populations around the world [Cavalli-Sforza, Menozzi and Piazza(1994)]. An allele that is common in one geographical or ethnic group may beReceived April 2009; revised August 2009.1Supported by NIH (Grant MH057881) and ONR (Grant N0014-08-1-0673).Key words and phrases. Human genetics, dimension reduction, multidimensional scaling, popu-lation structure, spectral embedding.179180 A. B. LEE, D. LUCA AND K. ROEDERFIG.1. Percent of adult population who are lactose intolerant (http://www.medbio.info/Horn/Time). A gradient runs from north to south, correlating with the spread of the lactase mutation.Finland provides an exception to the gradient due to the Asian influence in the north.rare in another. For instance, the O blood type is very common among the indige-nous populations of Central and South America, while the B blood type is mostcommon in Eastern Europe and Central Asia [Cavalli-Sforza, Menozzi and Piazza(1994)]. The lactase mutation, which facilitates the digestion of milk in adults,occurs with much higher frequency in northwestern Europe than in southeasternEurope (Figure 1). Ignoring the structure in populations leads to spurious associ-ations in case-control genetic association studies due to differential prevalence ofdisease by ancestry.Although most SNPs do not vary dramatically in allele frequency across popu-lations, genetic ancestry can be estimated based on allele counts derived from in-dividuals measured at a large number of SNPs. An approach known as structuredassociation clusters individuals to discrete subpopulations based on allele frequen-cies [Pritchard, Stephens and Donnelly (2000a)]. This approach suffers from twolimitations: results are highly dependent on the number of clusters; and realisticpopulations do not naturally resolve into discrete clusters. If fractional member-ship in more than one cluster is allowed, the calculations become computationallyintractable for the large data sets currently available. A simple and appealing alter-A SPECTRAL GRAPH APPROACH TO DISCOVERING GENETIC ANCESTRY 181native is principal component analysis (PCA) [Cavalli-Sforza, Menozzi and Piazza(1994), Price et al. (2006), Patterson, Price and Reich (2006)], or principal compo-nent maps (PC maps). This approach summarizes the genetic similarity betweensubjects at a large number of SNPs using the dominant eigenvectors of a data-basedsimilarity matrix. Using this “spectral” embedding of the data, a small number ofeigenvectors is usually sufficient to describe the key variation. The PCA frame-work provides a formal test for the presence of population structure based on theTracy–Widom distribution [Patterson, Price and Reich (2006), Johnstone (2001)].Based on this theory, a test for the number of significant eigenvectors is obtained.In Europe, eigenvectors displayed in two dimensions often reflect the geographi-cal distribution of populations [Heath et al. (2008), Novembre et al. (2008)]. Thereare some remarkable examples in the population genetics literature of how PCmaps can reveal hidden structures in human genetic data that correlate with toler-ance of lactose across Europe [Tishkoff et al. (2007)], migration patterns and thespread of farming technology from Near East to Europe [Cavalli-Sforza, Menozziand Piazza (1994)]. Although these stunning patterns can lead to overinterpreta-tion [Novembre and Stephens (2008)], they are remarkably consistent across theliterature.In theory, if the sample consists of k distinct subpopulations, k −1 axes shouldbe


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-23-24 out of 24 pages.

Please select your school