DOC PREVIEW
BYU BIO 465 - Genome Annotation

This preview shows page 1-2-23-24 out of 24 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 24 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2Challenges to Genome Annotation?Slide 4What kinds of things are we looking to annotate?PseudogenesNoncoding RNA (ncRNA)Slide 8BLASTWhich Matrix?Protein coding genes in prokaryotes, and simple eukaryotesSlide 12Ab initio gene identificationGene-Finding StrategiesGene-Finding MethodsEvaluation StatisticsThe ProcessLots of Software:GenBank FeaturesGenBank Features: the important onesGene Prediction CaveatsGenome annotation problems:The ideal annotation of “MyGene”Some Concluding remarksGenome Annotation•Now that you’ve assembled your genome, what is next?•GENOME ANNOTATION•What is that?•Why is it important?•How do you do it?Challenges to Genome Annotation?•Finding genes involves computational methods as well as experimental validation•Computational methods are often inadequate, and often generate erroneous ‘gene’ (false positive) sequences which:–Are missing exons–Have incorrect exons–Over predict genes–Where the 5’ and 3’ UTR are missingWhat kinds of things are we looking to annotate?•CDS - coding sequences•mRNA•Alternative RNA•Promoter and Poly-A Signal•Pseudogenes•ncRNAPseudogenes•Could be as high as 20-30% of all Genomic sequence predictions could be pseudogene•Non-functional copy of a gene–Processed pseudogene•Retro-transposon derived•No 5’ promoters•No introns•Often includes polyA tail–Non-processed pseudogene•Gene duplication derived–Both include events that make the gene non-funtional•Frameshift•Stop codons•We assume pseudogenes have no function, but we really don’t know!Noncoding RNA (ncRNA)•ncRNA represent 98% of all transcripts in a mammalian cell•ncRNA have not been taken into account in gene counts•cDNA•ORF computational prediction•Comparative genomics looking at ORF•ncRNA can be:–Structural–Catalytic–Regulatory•tRNA – transfer RNA: involved in translation•rRNA – ribosomal RNA: structural component of ribosome, where translation takes place•snoRNA – small nucleolar RNA: functional/catalytic in RNA maturation•Antisense RNA: gene regulation/silencing?Covariance model searches are extremely compute intensive. A small model (like tRNA) can search a sequence database at a rate of around 300 bases/sec. The compute time scales roughly to the 4th power of the length of the RNA, so larger models quickly become infeasible without significant compute resources.BLAST•Seeks high-scoring segment pairs (HSP)–pair of sequences that can be aligned without gaps–when aligned, have maximal aggregate score (score cannot be improved by extension or trimming)–score must be above score threshold S•Public Search engines–WWW search formhttp://www.ncbi.nlm.nih.gov/BLAST–Unix command lineblastall -p progname -d db -i query > outfileWhich Matrix?•Triple-PAM strategy (Altschul, 1991)–PAM 40 Short alignments, highly similar•tblastn against ESTs –PAM 120–PAM 250 Longer, weaker local alignments•Looking in the twilight zone•BLOSUM (Henikoff, 1993)–BLOSUM 90 Short alignments, highly similar–BLOSUM 62 Most effective in detecting knownmembers of a protein family•Standard on NCBI server – works in most cases–BLOSUM 30 Longer, weaker local alignmentsProtein coding genes in prokaryotes, and simple eukaryotes•Use ORF finderhttp://www.ncbi.nlm.nih.gov/gorf/orfig.cgi•Simple ATG/Stop•Simple link to FASTA formatted files and BLAST.•Problems:–In frame Methionine–Small protein•Solution: comparative genomicsFigure 11 from: Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. Kellis M, Patterson N, Birren B, Berger B, Lander ES. J Comput Biol. 2004;11(2-3):319-55.Saccharomyces cerevisiae.Saccharomyces paradoxus, Saccharomyces mikatae,Saccharomyces bayanusAb initio gene identification•Goals–Identify coding exons–Seek gene structure information–Get a protein sequence for further analysis•Relevance–Characterization of anonymous DNA genomic sequences–Works on all DNA sequencesLecture 4.2 14Gene-Finding StrategiesGenomic SequenceComparativeSite-BasedContent-BasedBulk properties ofsequence:• Open reading frames• Codon usage• Repeat periodicity• CompositionalcomplexityAbsolute properties ofsequence:• Consensus sequences• Donor and acceptorsplice sites• Transcription factorbinding sites• Polyadenylationsignals• “Right” ATG start• Stop codonsout-of-contextInferences basedon sequence homology:• Protein sequencewith similarity totranslated productof query• Modular structure of proteins usuallyprecludes findingcomplete geneGene-Finding MethodsGenomic SequenceNeural NetworkRule-BasedCutoff method:• Criteria applied sequentiallyto identify possible exons• Rank or eliminate candidatesfrom consideration based onpre-determined cutoff ateach stepComposite method:• Criteria applied in parallel• Training sets used to optimize performance• Weight scores in order ofimportanceEvaluation StatisticsActualPredictedTPFPTN FNTPFNTNSensitivity Fraction of actual coding regions that are correctlypredicted as codingSpecificity Fraction of the prediction that is actually correctCorrelation Combined measure of sensitivity and specificity,Coefficient ranging from –1 (always wrong) to +1 (always right)The Process•Compute the prediction•Confirm with biological sequences (also with computational tools)•Integrate all of this•Annotate genome (often via a GUI: Graphical User Interface)•Validate•Re-annotate/Update•Check it twice•Submit to GenBankLots of Software:•EnsEMBL (EBI)•Sequin (NCBI)•PseudoCAP (SFU)•GMOD (CSHL)•Pegasys (UBiC)•Apollo (EBI/Berkeley)•GeneMark (Georgia Institute of Tech)•GeneScan (MIT)•GenomeThreader (University of Hamberg)•HMMgene (Technical University of Denmark)GenBank Features-10_signal-35_signal3'clip3'UTR5'clip5'UTRattenuatorCAAT_signalCDSconflictC_regionD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_bindingmisc_differencemisc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_regionold_sequencepolyA_signalpolyA_siteprecursor_RNAprimer_bindprim_transcriptpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAsatellitescRNAsig_peptidesnoRNAsnRNAS_regionstem_loopSTSTATA_signalterminatortransit_peptidetRNAunsurevariationV_regionV_segmentGenBank Features: the important


View Full Document

BYU BIO 465 - Genome Annotation

Documents in this Course
summary

summary

13 pages

Cancer

Cancer

8 pages

Ch1

Ch1

5 pages

GNUMap

GNUMap

20 pages

cancer

cancer

8 pages

SNPs

SNPs

22 pages

Load more
Download Genome Annotation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Genome Annotation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Genome Annotation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?