CMSC423 Bioinformatic Algorithms Databases and Tools Lecture 4 Writing bioinformatics software Biological databases Writing bioinformatics software CMSC423 Fall 2008 2 Libraries utilities Bio Perl Perl BioJava Java BioPython Python BioRuby Ruby seqAn C Bioconductor R Chado SQL CMSC423 Fall 2008 3 Bio Perl http www bioperl org use Bio Perl my seq read sequence mytest fa fasta my gbseq read sequence mytest gb genbank write sequence test fasta fasta gbseq vs CMSC423 Fall 2008 4 Bio Perl Homework question 5 use Bio Perl while seq read sequence test fa fasta if seq length 500 print seq primary id n Note you still need to write your own version CMSC423 Fall 2008 5 Bio Perl Other useful stuff seqio new Bio SeqIO format largefasta file t data genomic seq fasta pseq seqio next seq gb new Bio DB GenBank seq1 gb get Seq by id MUSIGHBA1 etc CMSC423 Fall 2008 6 BioJava http www biojava org import org biojava bio String filename args 0 BufferedInputStream is new BufferedInputStream new FileInputStream filename get the appropriate Alphabet Alphabet alpha AlphabetManager alphabetForName args 1 get a SequenceDB of all sequences in the file SequenceDB db SeqIOTools readFasta is alpha CMSC423 Fall 2008 7 BioJava Question 5 BufferedReader br new BufferedReader new FileReader args 0 String format args 1 String alphabet args 2 SequenceIterator iter SequenceIterator SeqIOTools fileToBiojava format alphabet br while iter hasNext Sequence seq iter nextSequence if seq length 500 System out println seq getName CMSC423 Fall 2008 8 BioJava more Same as Bio Perl can directly connect to databases various sequence manipulations reverse complement translate etc basic bioinformatics algorithms etc CMSC423 Fall 2008 9 BioPython http www biopython org from Bio import SeqIO handle open file fasta seq record SeqIO parse handle fasta SeqIO write my records handle2 fasta CMSC423 Fall 2008 10 BioPython Question 5 from Bio import SeqIO handle open test fasta for seq record in SeqIO parse handle fasta if len seq record 500 print seq record id handle close CMSC423 Fall 2008 11 BioPython more Same as Bio Perl can directly connect to databases various sequence manipulations reverse complement translate etc basic bioinformatics algorithms etc CMSC423 Fall 2008 12 BioRuby http www bioruby org require bio input seq ARGF read reads all files inarguments my naseq Bio Sequence NA new input seq CMSC423 Fall 2008 13 BioRuby Question 5 usr bin env ruby require bio ff Bio FlatFile new Bio FastaFormat ARGF ff each entry do f if f length 500 puts f entry id end end CMSC423 Fall 2008 14 BioRuby more Same as Bio Perl can directly connect to databases various sequence manipulations reverse complement translate etc basic bioinformatics algorithms etc CMSC423 Fall 2008 15 SeqAn http www seqan de include seqan sequence h include seqan file h using namespace seqan using namespace std String Dna seq String char name fstream f f open test fasta readMeta f name Fasta readMeta f seq Fasta CMSC423 Fall 2008 16 SeqAn Question 5 String Dna seq String char name fstream f f open test fasta while f eof readMeta f name Fasta readMeta f seq Fasta if length seq cout name endl CMSC423 Fall 2008 17 SeqAn more Not quite as much as Perl Java Python but still lots of utilities including graph algorithms CMSC423 Fall 2008 18 R BioConductor http www bioconductor org Mainly for statistical applications e g microarray analysis library affy library geneplotter library gplots data ReadAffy eset rma data e exprs eset heatmap 2 e margin c 15 15 trace none col redgreen 25 cexRow 0 5 CMSC423 Fall 2008 19 R BioConductor Book has lots of examples Worth learning more about it easy to do various cool things example if time CMSC423 Fall 2008 20 Chado http www gmod org Relational schema for storing biological data types in a relational database e g MySQL Oracle Sybase SELECT o organism id o abbreviation o genus o species o common name count f feature id as n features o comment FROM organism o LEFT JOIN feature f USING organism id GROUP by o organism id o abbreviation o genus o species o common name o comment ORDER BY o genus o species CMSC423 Fall 2008 21 CMSC423 Fall 2008 22 Chado more Bio generally provide ability to interface with relational database Understanding SQL and Chado is useful irrespective of language used Relational DB particularly useful for web services Gbrowse example if time CMSC423 Fall 2008 23 Programming for bioinformatics Details of specialized libraries beyond scope of course Good software engineering practices are essential Often correct is undefined output of program defines correctness Pitfalls e g papers retracted from Science due to software bugs Key be proactive and learn by yourselves from online resources CMSC423 Fall 2008 24
View Full Document
Unlocking...