Unformatted text preview:

1Database SearchingSearching for Data!Text Patterns!LookUp!Sequence Patterns!FindPatterns!ProfileSearch!Sequence Similarity!FastA, TFastA!BLAST, NetBLASTIntroduction to Data Base Searching!What are you looking for?"Exact" matches!"Have I cloned something that someone else has already worked on?""Related" sequences!Is there something similar to my sequence!Evolutionary relationships!Convergent functionSearch Program Considerations!Sensitivity!Stringency!Speed!Cost2Speed and Cost!Time and cost of the search is dependent on the size of the database and the size of the query!Restrict the size of the database!Use the -batch qualifier to save money!Use GenBank's ServicesResults!Histogram!Plot of 'match scores" vs. number of sequences!Allows you to distinguish background noise from significant matches!Sequence Names!AlignmentsFindPatterns!Locate short sequence patterns in sequences!Nucleic acid or Protein!Searches both strands of a nucleic acid sequencePattern Definitions!Findpatterns, Map, Mapsort, Mapplot, and Motifs all let you search with ambiguous expressions!Expressions can include any legal GCG sequence character!Expressions can also specify:!OR and NOT matching!Begin and end constraints!Repeat countsRepeats!Parentheses () enclose one or more symbols that can be repeated!Braces {} enclose numbers that tell how many times the symbol(s) must be found!(GA){2,10} - GA repeated 2 to 10 times!G{2,} - G repeated 2 to 350,000 times!(GAT){,10} - GAT repeated 0 to 10 timesTAATA(N){20,30}ATG!TAATA, followed by 20 to 30 of any base, followed by ATG3OR Matching!Enclose the different choices in parentheses and separate the choices with commas!RGF(Q,A)S!RGF followed by either Q or A followed by S.!GAT(TG,T,G){1,4}A means !GAT followed by any combination of TG, T, or G repeated from 1 to 4 times followed by ANOT Matching!Use the ~ symbol!GC~CAT!GC, followed by any symbol except C followed by AT!GC~(A,T)CC!GC followed by any symbol except A or T, followed by CC.BEGIN AND END Constraints!The pattern <GACCAT can only be found at the beginning of the sequence!The pattern GACCAT> can only be found at the end of the sequenceanalyze% findpatterns -checkFindPatterns identifies sequences that contain short patterns likeGAATTC or YRYRYRYR. You can define the patterns ambiguously and allowmismatches. You can provide the patterns in a file or simply type themin from the terminal.Minimal Syntax: % findpatterns [-INfile=]Genbank:Humig* -DefaultPrompted Parameters:-PATterns=GAATTC,RGGAY patterns to be found[-OUTfile=]findpatterns.find the output file nameLocal Data Files:-DATa=pattern.dat a file with a set of patternsOptional Parameters:-MISmatch=1 allows mismatches in the search for your subsequence-NAMes writes the output as a list file-ONEstrand searches only the top strand of nucleotide sequences-SIXbase searches only for patterns with six or more symbols-CIRcular searches all sequences as if they were circular-ALL does an "overlapping-set" search in nucleotide sequences-PERFect looks only for perfect matches-APPend appends the pattern data file to the output file-SHOw shows every file searched even if there are no finds-TERminal writes output to the terminal screen instead of a file-NOMONitor suppresses the screen trace showing each file-ONCe limits finds to patterns found a maximum of 1 time-MINCuts=1 limits finds to patterns found a minimum of 1 time-MAXCuts=3 limits finds to patterns found a maximum of 3 times-EXCLude=n1,n2 excludes patterns found between positions n1 and n2-SINce=6.90 limits search to sequences dated on or after June 1990-BATch Submits the program to run in the batch queueAdd what to the command line ?FINDPATTERNS in what sequence(s) ? swp:*Enter patterns individually, one per line.End the list with a blank line.Pattern 1: ygddPattern 2:What should I call the output file (* findpatterns.find *) ? ygdd.find** findpatterns will run as a batch or at job.** findpatterns was submitted using the command:" atnow "Job class000.894911339.a will be run at Mon May 11 13:28:59 CDT 1998.analyze%4! FINDPATTERNS on swp:* allowing 0 mismatches! 1 YGDD May 11, 1998 11:02 ..AAC1_PSEAE ck: 7052 len: 177! P23181 pseudomonas aeruginosa. gentamicin 3'-acetyltransferase (ec 2.3.1.61YGDD148: YVQAD YGDD PAVALAMDZ_YEAST ck: 8601 len: 464! Q03557 saccharomyces cerevisiae (baker's yeast). probable amidase ymr293c1YGDD450: QVVGQ YGDD STVLDAMOB_NITEU ck: 4649 len: 420! Q04508 nitrosomonas europaea. ammonia monooxygenase (ec 1.13.12.-). 2/961YGDD227: RVLLA YGDD LLMDPAMYM_BACST ck: 5976 len: 717! P19531 bacillus stearothermophilus. maltogenic alpha-amylase precursor (ecPOLG_HRV1B VPSGCSGTSI FNTMINNIII RTLVLDAYKN IDLDKLKIIA YGDDVIFSYKPOLG_HRV2 VPSGCSGTSI FNTMINNIII RTLVLDAYKN IDLDKLKIIA YGDDVIFSYIPOLG_HRV89 MPSGCAGTSI FNTIINNIII RTLVLDAYKN IDLDKLKILA YGDDVIFSYNPOLG_CXA16 MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELNMVA YGDDVLASYPPOLG_HE71M MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELNMVA YGDDVLASYPPOLG_HE71B MPSGCSGTSI FNSMINNIII RTLLIKTFKG IDLDELKMVA YGDDVLASYPPOLG_SVDVU MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_SVDVH MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_COXB5 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_COXB3 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_CXA9 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_COXB4 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_COXB1 MPSGCSGTSI FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_EC11G MPSGYSGTSM FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYPPOLG_FMDV1 MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASDPOLG_FMDVO MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASDPOLG_FMDVZ MPSGCSATSI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASDPOLG_FMDVA MPSDCSATGI INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASDPOLG_FMDVS MPSGCSATSI VNTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASDPOLG_TMEVB LPSGCAATSM LNTIMNNVII RAALYLTYSN FDFDDIKVLS YGDDLLIGTNPOLG_TMEVG LPSGCAATSM LNTIMNNVII RAALYLTYSN FEFDDIKVLS YGDDLLIGTNPOLG_TMEVD LLSGCAATSM LNTIMNNVII RAALYLTYSN FEFDDIKVLS YGDDLLIGTNPOLG_EMCVD LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATNPOLG_EMCVB LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATNPOLG_EMCV LPSGCAATSM LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN! FINDPATTERNS on swp:* allowing 0 mismatches! 1 (L,I,V)(S,A)YGDD(L,I,V){2} May 11, 1998 11:31 ..AMOB_NITEU ck: 4649 len: 420! Q04508 nitrosomonas europaea. ammonia monooxygenase (ec 1.13.12.-). 2/961


View Full Document

UAB MIC 753 - Database Searching

Download Database Searching
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Database Searching and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Database Searching 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?