CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 5 Biological databases Biological databases CMSC423 Fall 2008 2 Biological databases General GenBank US EMBL Europe Specialized by data type NCBI Trace Archive raw sequencing data SwissProt curated protein information KEGG biological pathways Gene Expression Omnibus microarray data Specialized by organism ZFIN zebrafish SGD yeast WormBase worms CMSC423 Fall 2008 3 What data gets stored DNA string of letters quality information maybe chromatograms location of genes ranges along a chromosome Proteins string of letters protein domains 3D coordinates of each atom Pathways graph of interactions between genes For all often store link to scientific articles related to data CMSC423 Fall 2008 4 How the data get accessed Gene by gene object by object targeted at manual inspection of data usually lots of clicking involved simple search capability similarity searches in addition to text queries Bulk targeted at computational analyses often programmatic access through web server most frequently just bulk download ftp CMSC423 Fall 2008 5 NCBI National Center for Biotech Info Virtually all biological data generated in the US gets stored here One stop shop for biological data Primarily focused on gene by gene analyses Provides simple scripts for programmatic access Provides ftp access for bulk downloads http www ncbi nlm nih gov CMSC423 Fall 2008 6 EMBL European Molecular Biology Lab European version of NCBI BioMart query builder http www ebi ac uk embl CMSC423 Fall 2008 7 Expasy proteomics server Home of Swisprot and other useful information on proteins http www expasy org CMSC423 Fall 2008 8 Kyoto Encyclopedia of Genes Genomes Central repository of pathway information http www genome jp kegg CMSC423 Fall 2008 9 Genome browsers UCSC Genome Browser http genome ucsc edu ENSEMBL Genome Browser http www ensemble org Gbrowse http www gmod org CMSC423 Fall 2008 10 Direct database access SQL CHADO schema www gmod org CMSC423 Fall 2008 11 SQL select pt phylotree id pn parent phylonode id po organism id from phylotree pt phylonode pn pylonode organism po where pt name Archaea and pt phylotree id pn phylotree id and pn phylonode id 1000 and po phylonode id pn parent phylonode id Selects parent node and organism IDs for archaeon with ID 1000 CMSC423 Fall 2008 12 Programmatic database access use DBI my dbh DBI connect dbi Sybase server SERV packetSize 8092 anonymous anonymous if defined dbh die Cannot connect to server n my mysqlqry STDIN dbh do set textsize 65535 my qh dbh prepare mysqlqry die Cannot prepare n qh execute die Cannot execute n while my row qh fetchrow processrow row CMSC423 Fall 2008 13 NCBI programmatic access http eutils ncbi nlm nih gov entrez query static eutils help html must write your own HTTP client LWP Perl module helps queries go directly to web server data returned in XML http www ncbi nlm nih gov Traces trace cgi cmd show f doc m obtain s stips stub script provided query tracedb queries still go through web server data returned in a variety of user selected formats For both limits are set on the amount of data retrieved e g less than 40 000 records at a time Download procedure figure out of records to be retrieved count query read data in allowable chunks combine the chunks CMSC423 Fall 2008 14 Biological Ontologies Gene Ontology http www geneontology org The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism text from GO homepage Note similar to semantic web GO not the only one http www obofoundry org CMSC423 Fall 2008 15
View Full Document
Unlocking...