DOC PREVIEW
UMD CMSC 423 - Project Part 1: Multiple Sequence Aligner

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC 423: Project Part 1: Multiple Sequence AlignerDue: 11:59pm on November 16, 2010You will build a multiple sequence aligner. In part 1, you’ll implement the 2-approximation to MSA usingaffine gaps.Rules: You can work in groups of 2. The same grade will be assigned to both members of a group. Do notuse code from the internet or other groups. Your code should be in Java or C++. If you choose to work inC++, the TA might not help debug C++ problems.Part 1: Read in a scoring matrix file and a multi-FASTA file and output the 2-approximation to the MSAunder the sum-of-pairs score.Format of the command: Your program will be run using the following command:msa score.matrix in.fasta out.fastaYour program must run and compile on the glue.umd.edu machines.Format of the multi-FASTA file: The multi-FASTA file contains any number of sequences. A sequencebegins with a line containing a > character in the first column. Following the > on the same line there isname for the sequence. This name ends at the first space or at the end of the line whichever comes first.Characters after the first space on this line are ignored. Every name must be unique among the sequences inthe file; if you find a duplicate name, report an error (see below). On the line after the > line, the sequencebegins. The sequence consists of any alphabetic character (A-Z,a-z). Upper- and lower-case are distinct.Spaces and newlines can be put anywhere in the sequence and should be ignored. The sequence ends witheither the start of another sequence or the end of the file.Example multi-FASTA file:>seq1 this text is ignoredACGTTTGAAATAGATACTGATCTGCATGACTAGACGAGTACGGGGGTATATATAAAAAA>seq2ACGTTTGAAATAG ATACTGATCTGCATGACTAGACGAGTACGGGGGTATATATAAAAAA>seq3ATACTGATCTGCATGACTAGACGAGTACGGGGGTATATATAAAAA AFormat of the scoring matrix file: The file contains lines giving the cost of aligning two characterstogether. Each line is of the format:c1 c2 scorewhere c1 and c2 are alphabetic characters or the gap character “-”, and where score is a floating point costof aligning c1 to c2.The special case where c1=- and c2=- gives the gap opening cost (the cost of starting a new gap). Thecase where c16=- and c2=- gives the gap extension cost for pairing c1 with a gap. The true cost of aligninga gap to a gap is always taken to be 0.The scores should be given with the understanding that the minimum cost alignment will be sought.Example score matrix file:1- - 10A - 5T - 5C - 5G - 5A A -5A T 2.5A C 2.5A G 2.5T T -5T C 2.5T G 2.5C C -5C G 2.5G G -5Above, the gap opening penalty is 10, the gap extension penalty is 5 no matter what the gap is aligned with.The cost of a mismatch is 2.5 and the cost of a match is −5.Only one of c1 c2 and c2 c1 should be included in the scoring matrix. If both are, report an error asdescribed below.If a score needed by the input sequences is not defined, report an error.Format of the output: Your computed MSA should be output as a multi-FASTA file in the same formatas described above, except that now sequences are allowed to contain “-” characters representing where theyou have placed the gaps. The filename for the output will be given on the command line.You should also output the sum-of-pairs score on the terminal in the format:SP-score = -50where -50 is replaced by the actual computed SP-score for the alignment you have computed. You can (andprobably should) output some informative messages to track the progress of your program. Only the aboveline should start with SP-score, however.Reporting errors: If there is an error in the input FASTA file, you should output the following:error: bad fasta file - DETAILED MESSAGE HEREreplacing the text “DETAILED MESSAGE HERE” with some short explanation of the error.If there is an error in the scoring matrix or if some needed score is missing, you should output the following:error: bad scoring matrix - DETAILED MESSAGE HEREreplacing the text “DETAILED MESSAGE HERE” with some short explanation of the error.In either case, your program should exit after printing the error message.Submission: Submit a .zip file using the CS submit server: submit.cs.umd.edu. The zip file must benamed DIRID-msa1.zip, where DIRID is your UMD directory id. The zip file should contain all of the filesneeded to compile and run your program. If your program is written in Java, it must be able to be compiledwith “javac *.java” from the command line (ensure this is true even if you use Eclipse). If your programis written in C/C++, it must be able to be compiled with the command “make”. The TA will spend nomore than 1 minute attempting to compile your


View Full Document

UMD CMSC 423 - Project Part 1: Multiple Sequence Aligner

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Project Part 1: Multiple Sequence Aligner
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Project Part 1: Multiple Sequence Aligner and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Project Part 1: Multiple Sequence Aligner 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?