UCSD CSE 182 - An EST database of H. medicinalis - D760054

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 182> An EST database of H. medicinalis

DOC PREVIEW

UCSD CSE 182 - An EST database of H. medicinalis

School name University of California, San Diego

Course Cse 182- Biological Databases

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CSE182 Class project: An EST database of H. medicinalisOctober 15, 20061 Introduction to HirudoHirudo medicinalis (medicinal leech is organism with historical medical as well contemporary relvance as amodel organism for medicine. Our project will be on extracting biological information from a database oftranscribed genomic (ESTs) sequences, and their applicability to Professor Eduardo Macagno’s neurosciencelab at UCSD.See http://animaldiversity.ummz.umich.edu/site/accounts/information/Hirudo medicinalis.html for basic physio-logical information. It is a segmented annelid, with 21 body segments (basic plan is a theme with variations),32 metameres (& neuromeres), seg ganglia with˜400 neurons, neuronal function, morphology and connectivitycharacterized, good regeneration, innate immune response, medically important system, etc.For neurobiology, H.m. is the only organism that is sufficiently well characterized as a whole. Most of itscentral neurons are identified and their functions/synaptic connections known (or can be easily studied), makingit possible to relate detailed gene expression profiles to physiological programs for each neuron. For medicine,many important molecules/factors remain to be discovered. For evolution, it represents a clade (lophotrochzoa)and phylum (annelida) that is not well described. It is also the source of a number of bio-products, includinganticoagulants, possible treatment for arthritic joints, useful as an adjunct to certain surgeries.The basic questions facing the Hirudo community are:1. What patterns of gene expression underlie neuronal function and synaptic circuitry?2. What are the relations among gene expression programs related to developmental, regeneration, immuneand other stress responses?3. What leech specific factors (anticoagulants, proteases, antibiotic peptides, etc.) are of medical interest?How do synaptic circuits generate specific behavioral responses?How does an EST resource help? Key to assaying expression profiles to correlate with function for individualidentified neurons (either via in situ hybridization or gene microarrays), entry point for discovering geneticcircuits, source of data for analysis of mass spec data and proteomics, source of partial transcripts that can beused to obtain full transcripts and protein sequences for siRNA, ectopic expression, biochemical pathways, etc.What questions do we hope to answer with an EST database? Functional questions about the origins of cellproperties, evolutionary issues2 Introduction to EST sequencingWe will discuss this, and related topics in class lectures. See a basic introduction inhttp://www.ncbi.nlm.nih.gov/About/primer/est.html.13 ProjectsThe class projects are designed to produce data that is useful for the neuroscience lab. Therefore, your customeris Professor Eduardo Macagno, and members of his lab, who will answer any questions about the relevance ofthe project and their needs from the database. Your consultants on these projects are the instructor, the TA(Julio Ng), and Jeff Wang.Please note that not everything described below might make sense to all of you, but the classroom lecturesprior to the first mid-term should help with that. You must choose one of the following projects, working inteams of size 3 (except for project 4, where larger teams are allowed).The project has 3 checkpoints, with dates on the course home page.1. Checkpoint 1: Form teams, and decide on a specific project. download data.2. Checkpoint 2: See individual projects for C2 requirements.3. Checkpoint 3:23.1 Functional analysis (annotation) teamBuild a pipeline that does the following:Input: A multi-fasta file of DNA sequences. The sequences can be individual ESTs or assembled transcripts.The header line of each entry is free-format but contains a unique ID.Output: Search each sequence against each of the annotation (functional domain) databases mentioned be-low. For each type of ’annotation-source’, produce a tabular (tab-delimited) output, with the followingcolumns in each row:1. Sequence Id2. Annotation Id3. Seq begin (The sequence coordinate where the domain match begins)4. Seq end5. Domain begin (relevant if the sequence matches only a portion of the domain).6. Annotation end7. P-value, Score8. Strand (+/-), Frame (1,2,3) if relevant.9. Domain specific fields.Various annotations must be produced, including, but not limited to the following annotation-sources:1. BLAST: Download the NCBI toolkit (ftp://ftp.ncbi.nih.gov/blast/)to run Blast locally, and download theNCBI nr database. Query each sequence as against the NCBI nr database. The annotation here is a simplya protein sequence. Use a conservative p-value cut-off after some experimentation.2. psi-Blast. Exactly as for Blast.3. PFAM: Download the Pfam database, and HMMer software toolkit from (http://pfam.wustl.edu/). Searcheach sequence for a Pfam domain using HMMer.4. PDB: Download the PDB sequences (but not structures) from (ftp://ftp.rcsb.org/pub/pdb/derived data/pdb seqres.txt).Use Blast to search the sequences, and report possible structures if any.5. UTR: Run the UTR prediction tool (XXX) to predict the coding region, frame and untranslated region ofthe EST. As there is no annotation Id, simply use “UTR” as the annotation ID.6. PROSITE: download the prosite regular expressions from ftp://ca.expasy.org/databases/prosite. Searcheach sequence against the prosite database using either your own tool, or a tool like ScanMotif.Checkpoint 2 deliverableProduce annotation table for BLAST using all available EST sequences and assemblies. Support the Bioifor-matics Algorithm Team I by providing them with the annotation table for queryIndex.33.2 The web teamThe goal of this project is simple: Create 3 views (dynamic HTML, or appropriate technology), and link themto the appropriate CGI scripts etc.1. The Home page.2. An EST-Collection Page3. EST Sequence page.4. EST Assembly page.The Home PageShould have a logo for the database (Choose an appropriate name (EX: Leechee (if you like the fruit), anddesign a graphical logo)), and a single search box. The search box is free-format, and it can accept EST-IDs,keywords (EX: annexins), or a sequence.The output of the search is always be a collection of EST-assembly sequences, a collection of EST-sequences or both. If the output is a collection, display the EST-collection page. If the output is a singleassembly, or a single EST, display the appropriate EST-assembly, or EST-sequence page.To execute the

View Full Document