DOC PREVIEW
UCSD CSE 182 - Biological Data Analysis

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Biological Data Analysis (CSE 182) Final projects1 LogisticsThe final 1 or 2 lectures of the class will be devoted to final presentation of the project. You are requiredto work on one of the projects described below. Checkpoints have been created to help you stay on track.They will not be graded separately, but you must meet the deadlines. Soon after your first checkpoint, youshould schedule a meeting with the instructor, preferably at the discussion time, Monday 11/17, or Tuesday11/18. Students choosing 1.3 must also submit A4. Otherwise A4 is optional.CheckpointsC1:11/10/08: Submit a 1-2 page written report with answers to the following questions:1. Your project partner’s name, if any. Teams of 2 people are recommended, but 1 or 3 will also beaccepted.2. Your choice of the project. Choose from one of the following, or from ubergrid.org.3. For each project, the ability to read input files, available after 11/11.C2:11/17/08: Answers to the C2 part of your project. Schedule a meeting with the instructor to discussfinal project.C3:12/1,3: C3 part, and final presentation.11.1 Homology based gene finderGoal: Given a protein sequence from a species, find the location of the orthologous gene using genomic datafrom another species (Ideal group size: 3 people)Skills: Scripting/programming. The tool could be built entirely using public domain tools such as Blast,and a scripting language.Motivation: As more genomes are sequenced, initial gene identification is based on such comparativeanalysis. We want to look at the current maize genome, using protein sequences from other plants,like rice.Questions: Answer the following:1. Download the maize genome from XXX. How large is the available data-set? How many sequences(C1)?2. Download the rice proteome from XXX. How large is the data-set? How many sequences?3. Blast each rice protein against the maize genome. What version of Blast will you use? How manyqueries get at least one hit (C1)?4. For each rice sequence, decide if ortholog is completely present in the genomic database, partiallypresent, or absent. Develop code that takes the Blast hits, and builds the following OrthologTable:each row has 4 columns: the protein ID for rice, an identifier for the genomic sequence, strand(+ve/-ve), and an indicator set to 0 (no-hit), 1 (partial-hit), or 2 (over 90% of the sequence ishit) (C2).Figure 1: Homology based gene finding. The protein query will match the translated genomic sequence inexons.5. Build a gene finder as follows:(a) For each protein query, use the Blast output to identify a genomic region that contains theortholog. Remember that the query might hit multiple locations (paralogs), and genomicsequence from multiple contigs. Separate the different paralogs, and order the contigs withineach paralog (C3).(b) While Blast should help locate most of the exons, it might miss some of the smaller exons,and get the boundaries incorrectly. Use the tool Augustus to reconstruct the best transcript(C3).(c) Build a table of exons coordinates. Each row contains the following: Rice protein Id, genomicID, strand, begin-exon coordinate, end-exon coordinate (C3).(d) concatenate and translate the exons to get a protein sequence. Blast the rice query againstthe translated amino-acid sequence to check similarity (C3).26. Redo OrthologTable. For each rice sequence, decide if ortholog is completely present in the genomicdatabase, partially present, or absent. Develop code that takes the Blast hits, and builds thefollowing table: each row has 4 columns: the protein ID for rice, an identifier for the genomicsequence, strand (+ve/-ve), and an indicator set to 0 (no-hit), 1 (partial-hit), or 2 (over 90% ofthe sequence is hit) (C3).Presentation:1. Describe the project.2. Provide general statistics (answers to the questions above).3. Show interesting cases where the gene finder could identify the gene, or failed to reconstruct thefinal gene.4. Future work.31.2 Post-translational protein analysisGoal: The goal of this project is to identify post-translational modifications and protein expression levelsthat are condition/tissue specific.Skills: The programming skills needed here are minimal, but require you to use web-resources or freelyavailable tools such as R. This is a more open ended project, and points will be given for creativity.Motivation: A comprehensive search of the Arabidopsis proteome resulted in identification of a number ofproteins, and their modifications. The data were acquired from different tissues. Additionally, peptideswere identified from root with and without infection from a nematode. It is expected that differentproteins are expressed preferentially in different tissues/conditions. The goal of this project is to helpvisualize this massive data-resource. As a simple example, compare different tissues/conditions bymaking a heatmap of protein expression.Questions: 1. Collect peptide/protein data from different Arabidpsis tissues: leaf, silique, root, etc. byemailaing Natalie Castellana [email protected] (C1).2. Identify a web resource, or learn how to draw a heat-map using the R programming language orPython (C1).3. Build heat maps and other visualizations using spectral counts of expressed proteins, and theirpost-translational modifications. If you select with the project, meet with the instructor in theweek of 11/9 to discuss.41.3 Dictionary matching toolStudents choosing this project must also submit A4.Goal: Build a robust dictionary matching tool.Skills: Good programming skills (Java/C++)Motivation: Dictionary matching is an important part of a string matching toolkit. In the class, we learnedonly to search with an existing dictionary. Here, we will build a tool for constructing a dictionary, andsearching with it.Questions: Answer the following (C1):1. What is a trie?2. What data structure will you use to encode the trie ?3. What data structure will you use to encode the failure function )?4. Describe algorithms to search with a trie, and to build one.5. Implement code searchtrie to search a multi-fasta file with a trie. If a sub-sequence from sequenceS matches the dictionary, you must report the header of the sequence, and the position in thesequence.Presentation:1. Implement code buildtrie to build a trie, given a dictionary of keywords (C2).2. Run buildtrie and show your output for various dictionaries provided (C2).3. Run searchtrie using the provided dictionaries, and4. Generate random database strings of size 1K, 10K, 100K, 1M,


View Full Document

UCSD CSE 182 - Biological Data Analysis

Download Biological Data Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Biological Data Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Biological Data Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?