UCSB CS 290 - Botryllus schlosseri: A BLAST study - D3017718

Home> Schools> University of California, Santa Barbara> (CS) > CS 290> Botryllus schlosseri: A BLAST study

DOC PREVIEW

UCSB CS 290 - Botryllus schlosseri: A BLAST study

School name University of California, Santa Barbara

Course Cs 290- Big Data and Networks

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 167/290D Course Project Botryllus schlosseri: A BLAST study Project Team Name: Eric Kwei Perm #: 8419855 Email: [email protected] Name: Nan Li Perm #: 3003100 Email: [email protected] Project Abstract Ascidians (sea squirts) are organisms that have typical chordate characteristics as larvae but then metamorphose into sedentary adults several hours after birth. Most ascidians reproduce sexually, but a number of species are colonial and capable of reproducing asexually, by budding entire adult bodies from stem cells. One model organism of this type, Botryllus schlosseri, has a genome significantly larger (4x) than other sequenced ascidians, which do not exhibit colonial behavior (e.g. Ciona intestinalis) [3]. One hypothesis is that the colonial behavior arose due to the duplication of ancestral ge nes related to embryonic development, which subfunctionalized into genes related to regenerative development [4],[5]. Using BLAST, we will search a subset of expressed sequences of Botryllus (the genome is not completely sequenced) to find evidence of this gene/genome duplication and, possibly, identify potential orthologous genes in the Ciona genome. If successful, our work may be helpful in understanding the genetic basis of the colonial behavior of B. schlosseri and its relatives. Milestones 1. Install and get familiar with local version of BLAST 2. Download and understand Botryllus sequence data 3. Write script that will BLAST each Botryllus sequence against the others 4. Run script to find identify possibly copied genes (if any) 5. Connect copied genes (if any) with known functional pathways (possible) 6. Find similar genes in the C. intestinali s genome (possible) [2] Data Sources Tony De Tomaso, currently an assistant professor at Stanford, has con tributed assembled Botryllus cDNA sequence data in FASTA format. >24_1_CCAO_CCTT_CCTU AACTCGAACACCACTGTGGTCTTACTTTTATTAACTCAAAACGGAAAATAGATAAAAGAA … GTAGAAACGGATCTTAATACAACCAAATACTCCAAT [876 nucleotides] >24_2_CCAO_CCTT_CCTU TGGTTGTATTAAGATCCGTTTCTACTAGCTTGATCCGAATTGATTAAAAGCAATGGCTAG… Fig. 1: Sample consensus sequences clu stered by JGI from cDNAs in libraries CCAO, CCTT and CCTU Experiment Design and Results The essence of our experiment is to cluster expressed DNA sequences from Botryllus schlosseri using BLAST. The principal motivation for this is to identify genes that may have duplicated and subfunctionalized in the species’ evolution; these genes may be implicated in a unique ability of Botryllus (and related species) among ascidians to r eproduce asexually. Data Description cDNA Consensus Sequences and Annotation The data being used in our experiment was a composite of three different normalized Botryllus cDNA libraries, generated from mRNA sequences. The three libraries represent an initial attempt to characterize the expressed genome of Botryllus (well over 50 percent). The DOE Joint Genome Institute (JGI) sequenced, clustered, and combined the sequence reads into a consensus set of expressed sequences (Fig. 1). Additionally, using a blastx (nucleotide‐protein) search, a subset of the consensus cDNA sequences listed above were connected to known proteins in the NCBI database; annotated results were combined with the cDNA clustering data into a structured description file. Based on this description file, we can assign genes and proteins to some of the sequences and potentially identify functions of duplicated genes using genInfo identifica tion (gi) numbers and NCBI descriptions. Methodology and Experiment Results Sequence Clustering We used the BLASTCLUST tool from NCBI to cluster the provided sequences. BLASTCLUST systematically clusters DNA or protein sequences based on statistically significant pair‐wise matches found using the default versions of the megablast and blastp algorithms, respectively. If the input sequences are provided in FASTA forma t, BLASTCLUST formats them into a database and searches each query sequence against this database. Alternately, BLASTCLUST can take a formatted sequence database as input instead of a set of FASTA sequences [1]. Fig. 2: Pair‐wise alignment used for NCBI BLASTCLUST As depicted below, for a given pair of sequences, a set of parameters can be derived from the top ‐scoring alignment for use in alignment evaluation (additional symbols are defined in Fig. 2): Coverage of seqX: Cx = Hx/Lx Coverage of seqY: Cy = Hy/Ly Coverage: MAX(Cx,Cy) or MIN(Cx,Cy) Alignment Length: Al = Hx+Gx = Hy+Gy Score Density: S/MIN(Hx,Hy) or N/Al*100% These parameters are incorporated into BLASTCLUST using the –S and –L parameters; these represent score density and length coverage respectively [1]. A sequence that exceeds these thresholds with at least one member of a cluster is included in that cluster (single‐linkage clustering). Clustering Statistics Since we expect that few sequences will be duplicated, the overwhelming majority of clusters should have exactly one sequence in them. However, if clusters contain too many sequences, then the clustering may not be significant, and we cannot safely conclude that clustered sequences are homologous. Thus, we use the percentage of clusters that have multiple consensus seque nces in them as a metric to compare clustering results from different –S and –L threshold values. ‐S and –L Parameter Sensitivity Results In our experiments, we fixed –L at the default value of 0.9 and varied –S between 0.0 and 3.0 to observe the impact –S has upon the cluster quality (Fig. 3). In Fig. 3, note the percentage of clusters with multiple sequences decreases monotonically for increasing –S. Because –S value represents the score density threshold, a higher value of –S implies a more strict clustering; thus, the higher the –S value is, the less likely two sequences will be grouped together. In addition, we observe that the percentage of clusters with multiple sequences is highly sensitive to the value of –S near 1.75 (the default BLASTCLUST value). Since we are interested in balancing the number of clusters with the significance of each cluster, we believe that this –S value

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

UCSB CS 290 - Botryllus schlosseri: A BLAST study

Sign up for free to view:

Please select your school