Gene RecognitionGene structureSlide 3Slide 4Slide 5HMM-based Gene FindersBetter way to do it: negative binomialGENSCAN’s hidden weaponEvaluation of AccuracyResults of GENSCANComparison-based MethodsCross-species gene findingSlide 13Slide 14Not always: HoxA human-mousePatterns of ConservationTwinscanTwinscan AlgorithmExampleHMMs for simultaneous alignment and gene finding: Generalized Pair HMMsThe SLAM hidden Markov modelExon GPHMMSlide 23Measuring PerformanceExample: HoxA2 and HoxA3Gene Regulation and MicroarraysOverviewCells respond to environmentGenome is fixed – Cells are dynamicWhere gene regulation takes placeTranscriptional RegulationTranscription Factors Binding to DNAPromoter and EnhancersRegulation of GenesSlide 35Slide 36Slide 37Slide 38Example: A Human heat shock proteinThe Cell as a Regulatory NetworkThe Cell as a Regulatory Network (2)DNA MicroarraysWhat is a microarraySlide 44Slide 45Goal of Microarray ExperimentsClustering vs. ClassificationClustering AlgorithmsHierarchical clusteringDistance between clustersResults of Clustering Gene ExpressionK-Means Clustering AlgorithmK-Means AlgorithmSlide 54Slide 55Slide 56Slide 57Slide 58Slide 59Mixture of Gaussians – Probabilistic K-meansAnalysis of Clustering DataEvaluating clusters – Hypergeometric DistributionCS262 Lecture 16, Win07, BatzoglouGene RecognitionCredits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge SaxonovCS262 Lecture 16, Win07, BatzoglouGene structureexon1exon2 exon3intron1 intron2transcriptiontranslationsplicingexon = protein-codingintron = non-codingCodon:A triplet of nucleotides that is converted to one amino acidCS262 Lecture 16, Win07, BatzoglouGTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAAexonexonexonintronintronintergeneintergeneHidden Markov Models for Gene FindingIntergene StateFirst Exon StateIntronStateCS262 Lecture 16, Win07, BatzoglouGTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAAexonexonexonintronintronintergeneintergeneHidden Markov Models for Gene FindingIntergene StateFirst Exon StateIntronStateCS262 Lecture 16, Win07, BatzoglouTAA A A A A A A A A A A AA AAT T T T T T T T T T T T T T TG GGG G G G GGGG G G G GC C C C C C CExon1 Exon2 Exon3Duration dDuration HMM for Gene FindingiPINTRON(xi | xi-1…xi-w)PEXON_DUR(d)iPEXON((i – j + 2)%3)) (xi | xi-1…xi-w)j+2P5’SS(xi-3…xi+4)PSTOP(xi-4…xi+3)CS262 Lecture 16, Win07, BatzoglouHMM-based Gene Finders•GENMARK (Borodovsky & McIninch 1993)•GENIE (Kulp 1996)•GENSCAN (Burge 1997)Big jump in accuracy of de novo gene findingCurrently, one of the bestHMM with duration modeling for Exon states •FGENESH (Solovyev 1997)Currently one of the best•HMMgene (Krogh 1997)•VEIL (Henderson, Salzberg, & Fasman 1997)CS262 Lecture 16, Win07, BatzoglouBetter way to do it: negative binomial•EasyGene:Prokaryoticgene-finderLarsen TS, Krogh A•Negative binomial with n = 3CS262 Lecture 16, Win07, BatzoglouGENSCAN’s hidden weapon•C+G content is correlated with:Gene content (+)Mean exon length (+)Mean intron length (–)•These quantities affect parameters of model•SolutionTrain parameters of model in four different C+G content ranges!CS262 Lecture 16, Win07, BatzoglouEvaluation of Accuracy(Slide by NF Samatova)Sensitivity (SN) Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding)•Specificity (Sp) Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding)•Correlation Coefficient (CC)Combined measure of Sensitivity & Specificity Range: -1 (always wrong) +1 (always right)TP FP TN FN TP FN TNActualPredictedCoding / No CodingTNFNFPTPPredictedActualNo Coding / CodingCS262 Lecture 16, Win07, BatzoglouResults of GENSCAN•On the initial test dataset (Burset & Guigo)80% exact exon detection•10% partial exons•10% wrong exons•In generalHMMs have been best in de novo predictionIn practice they overpredict human genes by ~2xCS262 Lecture 16, Win07, BatzoglouComparison-based MethodsCS262 Lecture 16, Win07, BatzoglouCross-species gene finding5’3’Exon1Exon2Exon3Intron1 Intron2[human][mouse]GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-CS262 Lecture 16, Win07, BatzoglouComparison of 1196 orthologous genes(Makalowski et al., 1996)•Sequence identity between genes in human/mouse–exons: 84.6%–protein: 85.4%–introns: 35%–5’ UTRs: 67%–3’ UTRs: 69%•27 proteins were 100% identicalCS262 Lecture 16, Win07, BatzoglouCS262 Lecture 16, Win07, BatzoglouNot always: HoxA human-mouseCS262 Lecture 16, Win07, BatzoglouPatterns of Conservation30% 1.3%0.14% 58%14%10.2%Genes Intergenic Mutations Gaps FrameshiftsSeparation2-fold10-fold75-foldCS262 Lecture 16, Win07, BatzoglouTwinscan•Twinscan is an augmented version of the Gencscan HMM.EItransitionsdurationemissionsACUAUACAGACAUAUAUCAUCS262 Lecture 16, Win07, BatzoglouTwinscan Algorithm1. Align the two sequences (eg. from human and mouse)2. Mark each human base as gap ( - ), mismatch ( : ), match ( | )New “alphabet”: 4 x 3 = 12 letters= { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| } 3. Run Viterbi using emissions ek(b) where b { A-, A:, A|, …, T| }Emission distributions ek(b) estimated from real genes from human/mouseeI(x|) < eE(x|): matches favored in exonseI(x-) > eE(x-): gaps (and mismatches) favored in intronsCS262 Lecture 16, Win07, BatzoglouExampleHuman: ACGGCGACGUGCACGUMouse: ACUGUGACGUGCACUUAlignment: ||:|:|||||||||:|Input to Twinscan HMM:A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U|Recall, eE(A|) > eI(A|)eE(A-) < eI(A-)Likely exonCS262 Lecture 16, Win07, BatzoglouHMMs for simultaneous alignment and gene finding: Generalized Pair HMMsCS262 Lecture 16, Win07, BatzoglouThe SLAM hidden Markov modelCS262 Lecture 16, Win07, BatzoglouExon GPHMMde1.Choose exon lengths (d,e).2.Generate alignment of length d+e.CS262 Lecture 16, Win07, BatzoglouApproximate alignmentCS262 Lecture 16, Win07, BatzoglouMeasuring PerformanceCS262 Lecture 16, Win07, BatzoglouExample: HoxA2 and HoxA3SLAMSGP-2TwinscanGenscanTBLASTXSLAM CNSVISTARefSeqCS262 Lecture 16, Win07, BatzoglouGene Regulation and Gene Regulation and MicroarraysMicroarraysCS262 Lecture 16, Win07, BatzoglouOverview•A. Gene Expression and Regulation•B. Measuring Gene Expression:
View Full Document