Unformatted text preview:

Gene FindingFinding GenesIntronsOpen Reading Frames in BacteriaEukaryotesGene Finding in EukaryotesComputational methodsGenome BrowserNon-coding RNA genesPowerPoint PresentationSlide 11Slide 12Hidden Markov Models (HMMs)State TransitionsExampleFormal DescriptionObservationsMatrixThe CpG island problemHidden Markov ModelHMM is just one way of modeling p(X,S)…A simple HMMSlide 23How to “Generate” a Sequence?Slide 25HMM as a Probabilistic ModelGene FindingFinding Genes•Prokaryotes –Genome under 10Mb–>85% of sequence codes for proteins•Eukaryotes–Large Genomes (up to 10Gb)–1-3% coding for vertebratesIntrons•Humans–95% of genes have introns–10% of genes have more than 20 introns–Some have more than 60–Largest Gene (Duchenne muscular dystrophy locus) spans >2Mb (more than a prokaryote)–Average exon = 150b–Introns can interrupt Open Reading Frame at any position, even within a codon–ORF finding is not sufficient for Eukaryotic genomesOpen Reading Frames in Bacteria•Without introns, look for long open reading frame (start codon ATG, … , stop codon TAA, TAG, TGA)•Short genes are missed (<300 nucleotides)•Shadow genes (overlapping open reading frames on opposite DNA strands) are hard to detect•Some genes start with UUG, AUA, UUA and CUG for start codon•Some genes use TGA to create selenocysteine and it is not a stop codonEukaryotes•Maps are used as scaffolding during sequencing•Recombination is used to predict the distance genes are from each other (the further apart two loci are on the chromosome, the more likely they are to be separated by recombination during meiosis)•Pedigree analysisGene Finding in Eukaryotes•Look for strongly conserved regions•RNA blots - map expressed RNA to DNA•Identification of CPG islands–Short stretches of CG rich DNA are associated with the promoters of vertebrate genes•Exon Trapping - put questionable clone between two exons that are expressed. If there is a gene, it will be spliced into the mature transcriptComputational methods•Signals - TATA box and other sequences–TATA box is found 30bp upstream from about 70% of the genes•Content - Coding DNA and non-coding DNA differ in terms of Hexamer frequency (frequency with which specific 6 nucleotide strings are used)–Some organisms prefer different codons for the same amino acid•Homology - blast for sequence in other organismsGenome Browser•http://genome.ucsc.edu/•Tables•Genome browserNon-coding RNA genes•Ribosomal rRNA, transfer tRNA can be recognized by stochastic context-free grammars•Detection is still an open problemQuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Hidden Markov Models (HMMs) •Provide a probabilistic view of a process that we don’t fully understand•The model can be trained with data we don’t understand to learn patterns•You get to implement one for the first lab!!State TransitionsQuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.Markov Model Example. --x = States of the Markov model -- a = Transition probabilities -- b = Output probabilities -- y = Observable outputs-How does this differ from a Finite State machine?-Why is it a Markov process?Example•Distant friend that you talk to daily about his activities (walk, shop, clean)•You believe that the weather is a discrete Markov chain (no memory) with two states (rainy, sunny), but you cant observe them directly. You know the average weather patternsFormal Descriptionstates = ('Rainy', 'Sunny')observations = ('walk', 'shop', 'clean')start_probability = {'Rainy': 0.6, 'Sunny': 0.4}transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, }emission_probability = { 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }Observations•Given (walk, shop, clean) –What is the probability of this sequence of observations? (is he really still at home, or did he skip the country)–What was the most likely sequence of rainy/sunny days?MatrixRainy Sunnywalk .6*.1.4*.6shop .7*.4 .4*.4 .3*.3 .6*.3clean .7*.5 .4*.5 .3*.1 .6*.1Sunny, Rainy, Rainy = (.4*.6)(.4*.4)(.7*.5)The CpG island problem•Methylation in human genome–“CG” -> “TG” happens in most places except “start regions” of genes and within genes– CpG islands = 100-1,000 bases before a gene starts•Question–Given a long sequence, how would we find the CpG islands in it?Hidden Markov ModelCpG IslandX=ATTGATGCAAAAGGGGGATCGGGCGATATAAAATTTGOtherOtherHow can we identify a CpG island in a long sequence?Idea 1: Test each window of a fixed number of nucleitidesIdea2: Classify the whole sequence Class label S1: OOOO………….……OClass label S2: OOOO…………. OCC…Class label Si: OOOO…OCC..CO…O…Class label SN: CCCC……………….CC S*=argmaxS P(S|X)= argmaxS P(S,X)S*=OOOO…OCC..CO…OCpGHMM is just one way of modeling p(X,S)…A simple HMMParametersInitial state prob: p(B)= 0.5; p(I)=0.5State transition prob:p(BB)=0.7 p(BI)=0.3p(IB)=0.5 p(II)=0.5Output prob:P(a|B) = 0.25,…p(c|B)=0.10…P(c|I) = 0.25 …P(B)=0.5P(I)=0.5P(x|B)B I0.50.5P(x|I)0.70.30.50.5P(x|HCpG)=p(x|I)P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25P(x|HOther)=p(x|B)P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25( , , , , )HMM S V B A= Π( ) : " "i k k ib v prob of generating v at sA General Definition of HMM11{ ,..., } 1NN iiπ π π=Π = =∑:i iprob of starti n g at state sπ1{ ,..., }MV v v=1{ ,..., }NS s s=N statesM symbolsInitial state probability:1{ } 1 , 1Nij ijjA a i j N a== ≤ ≤ =∑State transition probability:1{ ( )} 1 , 1 ( ) 1Mi k i kkB b v i N k M b v== ≤ ≤ ≤ ≤ =∑Output probability::ij i ja prob of goi ng s s→How to “Generate” a Sequence?B I0.70.30.50.5P(x|B)P(x|I)P(B)=0.5 P(I)=0.5B I BB BII II I IB BBI I… …Given a model, follow a path to generate the observations.modelSequencestatesP(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25a c g t t …How to “Generate” a Sequence?B I0.70.30.50.5P(x|B)P(x|I)P(B)=0.5 P(I)=0.5modelSequenceP(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25a c g t t …aB I BIItgc0.50.3P(“BIIIB”,


View Full Document

BYU BIO 465 - Gene Finding

Documents in this Course
summary

summary

13 pages

Cancer

Cancer

8 pages

Ch1

Ch1

5 pages

GNUMap

GNUMap

20 pages

cancer

cancer

8 pages

SNPs

SNPs

22 pages

Load more
Download Gene Finding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Gene Finding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Gene Finding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?