DOC PREVIEW
UCSD CSE 182 - Gene Finding

This preview shows page 1-2-3-25-26-27-28-50-51-52 out of 52 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 52 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1HMM ‘fair-coin’ exampleSlide 3Syllabus for midtermTranslationTranslationEukaryotic gene structureGene FeaturesGene FeaturesGene identificationComputational Gene FindingGene Finding: The 1st generationCoding versus non-codingGeneralizingA geometric approach (2 hexamers)Choosing between Intergenic and ExonicCoding versus non-coding signals5th order markov chainScoring for coding regionsCoding differential for 380 genesCoding region can be detectedOther SignalsCombining SignalsThe second generation of Gene findingCombining signals using D.P.Hidden states & gene structureGene finding reformulatedOptimum labeling using D.P. (Viterbi)Optimum parse of the geneGeneralizingGeneralizingAn HMM for Gene structureGene Finding via HMMsGeneralized HMMs, and other refinementsLength distributions of Introns & ExonsGeneralized HMM for gene findingForward algorithm for gene findingDe novo Gene prediction: SummaryDNA SignalsDNA signal example:PWMsImprovements to signal detectionMDDMaximal Dependence DecompositionMDD for Donor sitesGene prediction: SummaryHow many genes do we have?Alternative splicingComparative methodsComparative gene finding toolsDatabasesCourseCSE182-L10Gene Finding1/14/19HMM ‘fair-coin’ example1/14/19EF(H)=0.5 EL(H)=0.10.60.60.40.41•H H T T T is the observed sequence1/14/19EF(H)=0.5 EL(H)=0.10.60.60.40.4010.60.400.51.5e-1 4.5e-2 1.3e-2 5.8e-32e-2 5.4e-2 2.9e-21.6e-21Syllabus for midterm•Sequence alignment using Blast–Global, local, space saving, affine gap costs•P-value, e-value computation•Pigeonhole principle, keyword matching•Column specific scoring (profiles)•Pattern matching (regular expressions)•HMMs1/14/19Translation•The ribosomal machinery reads mRNA. •Each triplet is translated into a unique amino-acid until the STOP codon is encountered.•There is also a special signal where translation starts, usually at the ATG (M) codon.1/14/19Translation•The ribosomal machinery reads mRNA. •Each triplet is translated into a unique amino-acid until the STOP codon is encountered.•There is also a special signal where translation starts, usually at the ATG (M) codon.•Given a DNA sequence, how many ways can you translate it?1/14/19Eukaryotic gene structure•The coding regions of a gene are discontiguous regions (exons), separated by non-coding regions (introns).•Transcription initially copies the entire region into RNA•The introns are ‘spliced out’ to form the mature mRNA (message)•Translation starts from an intitiating ATG somewhere in the message.1/14/19Gene FeaturesATG5’ UTRintronexon3’ UTRAcceptorDonor splice siteTranscription startTranslation start1/14/19Gene Features•The gene can lie on any strand (relative to the reference genome)•The code can be in one of 3 frames.AGTAGAGTATAGTGGACGS R V * W R V Q Y S G * S I V DFrame 1Frame 2Frame 3-ve strandTCATCTCATATCACCTGC1/14/19Gene identification•Eukaryotic gene definitions: –Location that codes for a protein–The transcript sequence(s) that encodes the protein–The protein sequence(s)•Suppose you want to know all of the genes in an organism.•This was a major problem in the 70s. PhDs, and careers were spent isolating a single gene sequence.•All of that changed with better reagents and the development of high throughput methods like EST sequencing•With genome sequencing, the initial problem became computational.1/14/19Computational Gene Finding•Given Genomic DNA, identify all the coordinates of the gene•TRIVIA QUIZ! What is the name of the FIRST gene finding program? (google testcode)ATG5’ UTRintronexon3’ UTRAcceptorDonor splice siteTranscription startTranslation start1/14/19Gene Finding: The 1st generation•Given genomic DNA, does it contain a gene (or not)?•Key idea: The distributions of nucleotides is different in coding (translated exons) and non-coding regions.•Therefore, a statistical test can be used to discriminate between coding and non-coding regions. 1/14/19Coding versus non-coding•You are given a collection of exons, and a collection of intergenic sequence.•Count the number of occurrences of ATGATG in Introns and Exons.–Suppose 1% of the hexamers in Exons are ATGATG–Only 0.01% of the hexamers in Intergenic are ATGATG•How can you use this idea to find genes?1/14/19GeneralizingAAAAAAAAAAACAAAAAGAAAAATI E• Compute a frequency count for all hexamers. • Exons, Intergenic and the sequence X are all vectors in a multi-dimensional space• Use this to decide whether a sequence X is exonic/intergenic.105 2010X105Frequencies (X10-5)1/14/19A geometric approach (2 hexamers)•Plot the following vectors– E= [10, 20]– I = [10, 5]– V3 = [6, 10]– V4 = [9, 15]•Is V3 more like E or more like I?520151015105EIV31/14/19Choosing between Intergenic and Exonic•Normalize V’ = V/||V||•All vectors have the same length (lie on the unit circle)•Next, compute the angle to E, and I.•Choose the feature that is ‘closer’ (smaller angle.EIV3€ β€ α€ E - score(V3) =αα + β1/14/19Coding versus non-coding signals•Fickett and Tung (1992) compared various measures•Measures that preserve the triplet frame are the most successful.•Genscan uses a 5th order Markov Model1/14/195th order markov chain•PrEXON[AAAAAACGAGAC..] =T[AAAAA,A] T[AAAAA,C] T[AAAAC,G] T[AAACG,A]……= (20/78) (50/78)………. AAAAAA 20 1AAAAAC 50 10AAAAAG 5 30AAAAAT 3 .. Tot AAAAAAGCAAAAGAAAACExon Intron1/14/19Scoring for coding regions€ CodingDifferential[ x] = logPrExon[x]PrIntron[x] ⎛ ⎝ ⎜ ⎞ ⎠ ⎟• The coding differential can be computed as the log odds of the probability that a sequence is an exon vs. and intron.•In Genscan, separate transition matrices are trained for each frame, as different frames have different hexamer distributions1/14/19Coding differential for 380 genes1/14/19Coding region can be detectedCoding•Plot the coding score using a sliding window of fixed length.•The (large) exons will show up reliably.•Not enough to predict gene boundaries reliably1/14/19Other SignalsGTATGAGCoding•Signals at exon boundaries are precise but not specific. Coding signals are specific but not precise.•When combined they can be effective1/14/19Combining Signals•We can compute the following: –E-score[i,j]–I-score[i,j]–D-score[i]–A-score[i]–Goal is to find coordinates that maximize the total scorei j1/14/19The second generation of Gene finding•Ex: Grail II. Used statistical


View Full Document

UCSD CSE 182 - Gene Finding

Download Gene Finding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Gene Finding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Gene Finding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?