1Computational Approaches to Haplotype InferenceRavi Vijaya SatyaAmar MukherjeeOverview SNPs & Haplotypes The HapMap Project Why “Infer” Haplotypes? Computational Methods Maximum Resolution Perfect Phylogeny Haplotyping Haplotyping with Pedigree information Haplotyping via sequencing Direct Approach for PPH (Bafna, Gusfield, et. al.)2Genetic Variationsunderlie phenotypic differencescause inherited diseasesallow tracking ancestral human historySource: Gabor T. Marth,www.vanbug.org/talk_ppts/Gabor_2004.ppt SNP: Single Nucleotide Polymorphism“Loci in the human genome in which a considerable percentage of the population differs from the rest.”…CATGATCACGTCGACGATCGAT……CATGATCACGTCGACGATCGAT……CATGATCATGTCGACGATCGAT……CATGATCACGTCGACGGTCGAT…Allele - One of the possible states of a given a locusThe locations, or loci, are also called ‘markers’3Types of SNPs Number of alleles: Bi-allelic: A site is called bi-allelic if there are only two possible states for that site. Multi-allelic: A site is called multi-allelic if there are more than two possible states for that site Almost all the SNPs are bi-allelic Coding / Noncoding Coding (CSNP), if the SNP occurs in an exon Non-coding, if it occurs in an intron or in a non-coding regionTypes of SNPs (contd…) Coding SNPs can be: Silent Non-silent……aca gat cag atc atg…………. T D Q I M …………aca gat caa atc atg…………. T D Q I M …………aca gat cag atc atg…………. T D Q I M …………aca gaa cag atc atg…………. T E Q I M ……4HaplotypesDefinition1: “The sequence of a copy of the chromosome” Over 10 million SNPs in total 1 SNP every 300 base pairs If each SNP is independent, there can be 210,000,000combinations possible. Limited variation Adjacent SNPs are interdependent ‘A’ at SNP1→ ‘G’ at SNP2, and: ‘C’ at SNP1→ ‘T’ at SNP2Haplotypes(Contd…)Defintion2: Each individual form taken by a block of adjacent, interdependent SNPs is called a ‘Haplotype’. A block consisting of 15 SNPs might in fact have only five or six common haplotypes. One possible reason Limited number of loci where recombinations are possible5The International HapMap Project“multi-country effort to identify and catalog genetic similarities and differences in human beings” - HapMap.orgTarget:A complete map of genetic variations in different populationsCountries currently involved:United States, Japan, China, Canada, UK and NigeriaHapMap Goals To provide tools and data for ‘association studies’ The HapMap will help in: Linking diseases to genetic variations Diagnosing diseases Preventing diseases Estimating response to drugs Designing ‘custom’ drugs6Construction of HapMap Identification of SNPs Compilation of SNPsinto Haplotypes Finding ‘tag’ SNPsPicture Source: HapMap.orgSample Populations Yoruba in Ibadan, Nigeria Individuals having four Yoruba grand parents Japanese in Tokyo, Japan Individuals from different parts of Japan Han Chinese in Beijing, China Individuals having at least 3 out of four Han grand parents CEPH (Centre d'Etude du Polymorphisme Humain ) Utah Residents with Northern and Western European Ancestry7Sample Populations … 270 individuals in total: Yoruba – 30 ‘trio’s (two parents an adult child) Japanese – 45 unrelated individuals Han Chinese – 45 unrelated individuals CEPH – 30 ‘trio’s – collected in 1980’s The samples are anonymous with regards to individual identityWhy ‘infer’ Haplotypes? Humans are diploid: Two copies of each chromosome One each from each parent A site is homozygous if it has the same allele in both chromosomes A site is called heterozygous if it has different alleles on thetwo chromosomes Expensive to sequence each chromosome separately The chromosomes are sequenced together, producing the ‘genotype’ information.8Genotype Data Genotype data tells whether each site is: Heterozygous (Aa, unordered) Homozygous with dominant allele (AA) Homozygous with the minor allele (aa) Haplotype data: Gives the actual alleles at each site Need to infer haplotypes from genotypes.Haplotype Inference Problem:Given a set of genotypes, can the underlying haplotypes be determined computationally?Types of Genotype data With pedigree information Relationships between at least some of the individuals are known Eg: trios Without pedigree information Unrelated individuals Relationship information not available.9Haplotyping: Definitions All sites are bi-allelic The two alleles are represented by ‘0’ and ‘1’ ‘0’ generally indicates the more frequent allele ‘1’ indicates the less frequent, or the minor allele A haplotype of length m: Is a vector h = <h1,…,hm> over {0,1}m Each position i is a site, or locusHaplotyping: Definitoins A genotype represents two haplotypes: Each site (position) is an unordered pair over {0,1} Can be written as: g = <g1,…,gm> over {0,1,2}m ‘0’ indicates the pair(0,0), 1 indicates (1,1) ‘2’ indicates the pairs (0,1) or (1,0)0 1 1 1 0 0 1 1 01 1 0 1 0 0 1 0 02 1 2 1 0 0 1 2 0The two haplotypesThe genotype10Haplotyping: Definitoins Resolution of a genotype g = <g1,…,gm> A pair <h,k> of haplotypes such that: hi= ki= giif gi= 0 or 1 hi≠ kiif gi= 2, for each i, 1≤ i ≤ m A haplotype h is compatible with a genotype g if there exists another haplotype h’ such that that pair <h,h’>resolves g h’ is called realization of g by h h’ is denoted as R(g,h)Haplotyping: definitions Given h and g, there can be only one h’: h’[i] = h[i] if g[i] is homozygous h’[i] = 1-h[i] if g[i] is heterozygous2 1 2 1 0 0 1 2 0g0 1 1 1 0 0 1 0 0h1 1 0 1 0 0 1 1 0h’Compatible2 1 2 1 0 0 1 2 0g0 0 1 1 0 0 1 0 0hIncompatible11Haplotype inference problemInput: a set G = {g1, ……,gn} of genotypesOutput: for each g ∈ G a pair <h, h’>of haplotypes resolving g.Simple solution: Find h by randomly assigning ‘1’ or ‘0’ for each ‘2’ in g h’ ← R(g,h)2 1 2 1 0 0 1 2 0g0 1 0 1 0 0 1 0 0h1 1 1 1 0 0 1 1 0h’If there are p heterozygous sites, 2p-1different solutions
View Full Document