BCH 5405 Molecular Biology Biotechnology Dr Qing Xiang Amy Sang Mon Wed March 24 26 2008 A BioInformatics Survey some taste of theory and a few practicalities Steve Thompson Florida State University School of Computational Science SCS To begin some terminology What is bioinformatics genomics proteomics sequence analysis computational molecular biology My definitions lots of overlap Biocomputing and computational biology are synonyms and describe the use of computers and computational techniques to analyze any type of a biological system from individual molecules to organisms to overall ecology Bioinformatics describes using computational techniques to access analyze and interpret the biological information in any type of biological database Sequence analysis is the study of molecular sequence data for the purpose of inferring the function interactions evolution and perhaps structure of biological molecules Genomics analyzes the context of genes or complete genomes the total DNA content of an organism within the same and or across different genomes Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement i e the proteome of organisms both within and between different organisms And one way to think about it the Reverse Biochemistry Analogy Biochemists no longer have to begin a research project by isolating and purifying massive amounts of a protein from its native organism in order to characterize a particular gene product Rather now scientists can amplify a section of some genome based on its similarity to other genomes sequence that piece of DNA and using sequence analysis tools infer all sorts of functional evolutionary and perhaps structural insight into that stretch of DNA The computer and molecular databases are a necessary integral part of this entire process The exponential growth of molecular sequence databases cpu power Year 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 BasePairs 680338 2274029 3368765 5204420 9615371 15514776 23800000 34762585 49179285 71947426 101008486 157152442 217102462 384939485 651972984 1160300687 2008761784 3841163011 11101066288 15849921438 28507990166 36553368485 44575745176 56037734462 69019290705 83874179730 Sequences 606 2427 4175 5700 9978 14584 20579 28791 39533 55627 78608 143492 215273 555694 1021211 1765847 2837897 4864570 10106023 14976310 22318883 30968418 40604319 52016762 64893747 80388382 QuickTime and a TIFF Uncompressed decompressor are needed to see this picture Doubling time about a year and half http www ncbi nlm nih gov Genbank genbankstats html Sequence database growth cont The International Human Genome Sequencing Consortium announced the completion of the Working Draft of the human genome in June 2000 independently that same month the private company Celera Genomics announced that it had completed the first Assembly of the human genome The classic articles were published mid February 2001 in the journals Science and Nature Genome projects have kept the data coming at an incredible rate Currently around 50 Archaea 600 Bacteria and 20 Eukaryote complete genomes and 200 Eukaryote assemblies are represented not counting the almost 3 000 virus and viroid genomes available Some neat stuff from the human genome papers Homo sapiens aren t nearly as special as we once thought Of the 3 2 billion base pairs in our DNA Traditional gene number estimates were often in the 100 000 range turns out we ve only got about twice as many as a fruit fly between 25 and 30 000 The protein coding region of the genome is only about 1 or so a bunch of the remainder is jumping junk selfish DNA much of which may be involved in regulation and control Some 100 200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome Later shown to be false by more extensive analyses and to be due to gene loss not transfer NCBI s Entrez Let s start with sequence databases Sequence databases are an organized way to store exponentially accumulating sequence data An alphabet soup of three major organizations maintain them They largely mirror one another and share accession codes but NOT proper identifier names North America the National Center for Biotechnology Information NCBI a division of the National Library of Medicine NLM at the National Institute of Health NIH maintains the GenBank WGS nucleotide GenPept amino acid and RefSeq genome transcriptome and proteome databases Europe the European Molecular Biology Laboratory EMBL the European Bioinformatics Institute EBI and the Swiss Institute of Bioinformatics SIB all help maintain the EMBL nucleotide sequence database and the UNIPROT SWISS PROT TrEMBL amino acid sequence database with USA PIR NBRF support also Asia The National Institute of Genetics NIG supports the Center for Information Biology s CIG DNA Data Bank of Japan DDBJ A little history The first well recognized sequence database was Dr Margaret Dayhoff s hardbound Atlas of Protein Sequence and Structure begun in the mid sixties That became PIR DDBJ began in 1984 GenBank in 1982 and EMBL in 1980 They are all attempts at establishing an organized reliable comprehensive and openly available library of genetic sequences Sequence databases have long since outgrown a hardbound atlas that you can pull off of a library shelf They have become gargantuan and have evolved through many many changes What are sequence databases like Just what are primary sequences Central Dogma DNA RNA protein Primary refers to one dimension all of the symbol information written in sequential order necessary to specify a particular biological molecular entity be it polypeptide or nucleotide The symbols are the one letter codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes Biological carbohydrates lipids and structural and functional information are not sequence data Not even DNA CDS translations in a DNA database are sequence data However much of this feature and bibliographic type information is available in the reference documentation sections associated with primary sequences in the databases Sequence database content Sequence database installations are commonly a complex ASCII Binary mix and Web based ones are often relational or Object Oriented They usually consist of several very long text files each containing different types of related information such as all of the
View Full Document