MIT 6 837 - A Visualization for the Dictionary Approach to Gene Annotation - D3014398

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 837> A Visualization for the Dictionary Approach to Gene Annotation

DOC PREVIEW

MIT 6 837 - A Visualization for the Dictionary Approach to Gene Annotation

School name Massachusetts Institute of Technology

Course 6 837- Computer Graphics

Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

6.837: Computer Graphics (Omprakash D. Gnawali & Valentin I. Spitkovsky)A Visualization for theDictionary Approach to Gene AnnotationReportAbstract | Introduction | Goals | AchievementsIndividual Contributions | Lessons Learned | Acknowledgments | Conclusion | Appendix AbstractFive intensive weeks of brainstorming and coding culminated in a stand-alone Java applet: the GenomicDictionary Visualizer (GDV). The GDV tool should prove to be useful to researchers involved incomputational biology. It facilitates quick and simple visualization of interesting regions of arbitrarygenetic material. While a lot of features are missing, GDV’s fundamental contribution is itscompleteness: all of the basic functionality is there and the framework for adding features exists. To theauthors’ knowledge, GDV is the first and only tool for visualizing genetic sequences. Top. IntroductionOriginally begun in 1990, the U.S. Human Genome Project is coordinated by the U.S. Department ofEnergy and the National Institutes of Health. Conceived as a 15-year endeavor, rapid technologicaladvances have accelerated the project to an expected completion date of 2003. Project goals are to 1)identify all the estimated 80,000-100,000 genes in human DNA; 2) determine the sequences of the 3billion chemical bases that make up human DNA; 3) store this information in databases; 4) develop toolsfor data analysis; and 5) address the ethical, legal, and social issues (ELSI) that may arise from theproject. Once the genes are identified and their nucleic sequences are determined, researchers will haveto tackle new and more difficult problems to understand what it is that these genes do. To get a feel forjust how difficult this problem is, consider the modern central biological dogma: 1. genes (DNA) are transcribed into RNA by the enzyme RNA polymerase 2. RNA transcripts are subjected to post-transcriptional modification 3. mRNA molecules are translated by ribosomes 4. newly synthesized proteins are often modified after translation 5. the protein carries out its function Not all DNA is expressed as protein. Sections of non-expression are termed introns. Introns are excisedby spliceosomes during the pre-mRNA phase, and are not found in the final mRNA product. Exons areall that remain of the original DNA message upon completion of transcription. Clearly, even knowing allthe nucleotides that make up a gene is not enough to predict a protein that it codes for. Gene RecognitionIn light of these complications, gene recognition emerged as a subfield of computational biology.Researchers in this area attempt to design algorithms and heuristics which would eventually compile toefficient computer programs for variations of the following problem: given a large sequence of geneticmaterial (read: a string of hundreds of thousands of nucleotides) 1. Locate the promoter regions (these occur just before the genes). 2. Identify the genes. 3. Identify the exons and the introns within each gene. 4. Predict the initial amino acid sequences. 5. Predict the proteins made by each gene. Gene AnnotationOf particular interest in the gene recognition problem is the subproblem of gene annotation. Supposethat we had access to a reliable promoter detector. Recall that promoters are short regions of DNA thatoccur close to the beginning of the gene. They are the chemical signals which alert the biologicalmachinery that there is a gene to be expressed and specify how much of the protein should be made.Having identified the promoters in our large chunk of DNA, we could then split that DNA into shortersegments, each beginning with the promoter region and ending just before the next promoter region. Wewould then be confident that the resulting segments of DNA contain exactly one gene. The next taskwould be to annotate that gene: to specify which segments are introns (to be cut out) and at which pointthe cell begins and ends its translation of the concatenation of the remaining segments (exons). Oftremendous help in this task are several biological facts: 1. Translation begins after an ATG. 2. Translation ends at the first stop codon (one of {TAA, TAG, TGA}). 3. Introns start with a GT. 4. Introns end with an AG. This limits the number of possible annotations or ‘‘parses’’ of a gene substantially, but does not reducethe solution space to a single nice crisp answer. As a result, there is a need for good tools for evaluatingor scoring a parse. Armed with an arsenal of such tools, a gene annotation algorithm could intelligentlydecide between parses and predict the ‘‘best’’ one. Even if the level of gene annotation were not goodenough for the gene recognition problem, it would still be extremely useful. One of the main challengesfacing biologists today is the accurate annotation of newly sequenced genomic data. If the annotatingalgorithm could latch on to more than just a handful of the most basic signals, it would save a lot ofhours of human work. Available for this task are extremely large databases of proteins, expressedsequence tags (roughly, snippets of coding regions of DNA), and much smaller databases of annotatedgenes. DictionaryOne tool already developed is a fast and fully automated dictionary for looking up genomic information.Qualitatively, given a brand new sequence, the dictionary answers the question ‘‘Have we ever seenanything like it before?’’ Quantitatively, the answer to that question is a list of all matches (longer than acutoff value) between the substrings of the input sequence and the substrings of all the sequences onwhich the dictionary was based. ImplementationAll of the dictionary code is in the C++ programming language, written and tested under the Linuxoperating system. It was designed and optimized to utilize the resources of the host computer asefficiently as possible. The alphabet used by the dictionary is entirely user-defined. Combined with thefact that there are no restrictions on lengths, numbers, and types of sequences in the dictionary, the usesfor it are virtually endless. Exon Prediction & Gene RecognitionExons, the coding regions of DNA, are under high selective pressure to not mutate very much. As aresult, large pieces of coding DNA are conserved between various proteins and even species. Therefore,one could build a dictionary of the known exons or proteins. Faced with a new sequence, using thedictionary, one could quickly find all the segments in it that match reasonably long pieces of knownexons

View Full Document