Unformatted text preview:

1CMSC 838T – Lecture 3CMSC 838T – Lecture 3! Pairwise sequence alignment0 Find similarity between two DNA / protein sequences0 Hope similar sequence → similar function! This lecture0 Basic concepts, terminology0 Alignment scoring metrics" Gap penalty" Scoring matrices0 Overview of alignment algorithms" Dynamic programming" Dot matrix plot" FASTA" BLASTCMSC 838T – Lecture 3Motivation for Sequence Alignment! Large public sequence databases0 Genomic DNA – full DNA sequence in genome0 mRNA – messenger RNA sequences (expressed genes)0 cDNA – reverse transcription of mRNA (coding region)0 Expressed sequence tags (ESTs) – short partial cDNA0 Proteins – amino acids, structure, function! Alignment to similar sequence(s) in database can0 Transfer information (structure, function) between sequences0 Suggest evolutionary relationships0 Organize and classify genomic data2CMSC 838T – Lecture 3Similarity! Similarity0 Measure of closeness based on observable quantity0 One example: % identity" 42 % identity = 42% of bases match exactly ! Homology0 Conclusion that genes have common ancestry0 All or nothing (42% homologous is meaningless)0 High similarity may indicate homology! Conserved region0 Region(s) of highest similarity between homologous genes0 May imply region perform useful / important function, thus conserved by evolutionCMSC 838T – Lecture 3Two Types of Homology! Orthologs0 Genes with same function found in different species0 Inherited from common ancestor0 Differences due to speciation (evolution)! Paralogs0 Genes duplicated due to replication mutation0 Genes assume different functions! Example0 A is ancestor to B, C0 B, C are orthologs0 C, C' are paralogsACBC'SpeciationDuplication3CMSC 838T – Lecture 3Alignment! Alignment0 Mutual arrangement of sequences C A T C A G A T0 Gaps inserted if necessary : : : : : :0 Exhibits similarities and differences C – T C A G G T0 Score → measure of quality of alignment 1 -2 1 1 1 1 -1 1 = 3! “Optimal” alignment0 Alignment with best score relative to metrics used0 May or may not have biological significance…because algorithm relies on approximations" Scoring matches / mismatches" Scoring gaps" many more…CMSC 838T – Lecture 3Global & Local Alignment! Global alignment0 Best alignment of entire sequences to each other0 Q: Are two sequences generally the same?! Local alignment0 Best alignment of parts of sequence0 Q: Do two sequences contain regions of high similarity?0 Biologically" Two sequences may differ in structure and function,but share common substructure / subfunction! In general0 Use local alignment to find sequences with shared similarity0 Use global alignment to compare resulting sequences4CMSC 838T – Lecture 3Talk Outline! Basic concepts, terminology! Alignment scoring metrics0 Gap penalty0 Scoring matrices! Overview of alignment algorithms0 Dynamic programming0 Dot matrix plot0 FASTA0 BLASTCMSC 838T – Lecture 3Scoring Similarity! Scoring0 Used to compare alignments0 Can only score aligned sequences! Three components to scoring1. Match or mismatch" Protein" DNA / RNA2. Gap opening3. Gap extension! Example AAGCAG---AATG--GTACAmismatchgapopeninggapextension5CMSC 838T – Lecture 3Scoring Similarity - Gaps! Gaps A A G C A G ----- C A0 Can be inserted in aligned sequences A A G ----- G T A C A0 Can represent" Actual insertions / deletion (indel) mutations" Regions of low sequence similarity! Biologically0 Probability of indels seems to drops slowly (log(n)?) with size! Scoring gaps0 Commonly use affine cost model ( Cost = h + g × gap length)" h = gap opening penalty (large)" g = gap extension penalty (small)0 Costs empirically determined (relative to scoring matrix)CMSC 838T – Lecture 3Scoring Matches and Mismatches! Protein0 Amino acids have varying properties0 Use scoring matrix (amino acid substitution)! Scoring0 Some matrices based on observation" PAM - pointwise mutations in similar proteins " BLOSUM - mutations in locally-conserved regions of distant proteins" Mutations observed in specific protein families…0 Other matrices based on chemical / physical properties" Chemical similarity (e.g., hydrophobicity of residues)" Codon distance (# base changes to convert amino acid)6CMSC 838T – Lecture 3Scoring Protein Similarity – PAM! PAM (Percent Accepted Mutations) [Dayhoff+1978]! Weight derivation0 Amino acid replacement rates found in evolution0 Based on 1,572 changes found in 71 groups of closely related proteins (no more than 15% different)0 Log-odds! Implementation0 One PAM unit = 1% divergence in amino acids" Higher PAM → more divergent0 Higher PAM extrapolated from PAM 1 ( PAM k = (PAM 1)k)0 Authors suggest gap penalty 6 for PAM 250 (trial-and-error)weight = log (—————————)observed substitution probabilityrandom substitution probabilityCMSC 838T – Lecture 3PAM 250 Scoring MatrixA Ala 2R Arg -2 6N Asn 00 2D Asp 0-1 24C Cys -2 -4 -4 -5 4Q Gln 0 112-5 4E Glu 0-1 13-5 24G Gly 1 -3 0 1 -3 -1 0 5H His -1 221-3 31-2 6I Ile -1 -2 -2 -2 -2 -2 -2 -3 -2 5L Leu -2 -3 -3 -4 -6 -2 -3 -4 -2 26K Lys -1 310-51 0 -2 0 -2 -3 5M Met -10-2-3-5-1-2-3-224 0 6F Phe -4 -4 -4 -6 -4 -5 -5 -5 -2 12 -5 0 9P Pro 1 0-1-1-30-1-10-2-3-1-2-56S Ser 1 0 1 00-101 -1 -1 -3 0 -2 -3 13T Thr 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 13W Trp -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y Tyr -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V Val 0-2-2-2-2-2-2-1-242 -2 2 -1 -1 -1 0 -6 -2 4Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr ValARNDCQEGHILKMFPSTWYV7CMSC 838T – Lecture 3Scoring Protein Similarity – PAM! Observations0 Amino acids vary greatly in mutability0 In PAM 250, replaced" 45% tryptophans and 48% cysteines (low variability)" 73% glycines and 94% asparagines (high variability)! Limitations0 Some replacements observed too infrequently (36 never seen)" Use estimated replacement rate0 Assumes amino acid replacements are independent0 Assumes sequences have average amino acid composition! Updated version of PAM (1992)0 Based on 59K replacements in 16K sequencesCMSC 838T – Lecture 3Scoring Protein Similarity – BLOSUM! BLOSUM (BLOcks SUbstitution Matrix) [Henikoff+ 1992]! Weight derivation0 Amino acid replacement rates found in (locally aligned) conserved regions of distantly related proteins0 Based on proteins in BLOCKS database (> 106substitutions)! Implementation 0 BLOSUM unit


View Full Document

UMD CMSC 838T - CMSC 838T Lecture 3

Documents in this Course
Load more
Download CMSC 838T Lecture 3
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CMSC 838T Lecture 3 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CMSC 838T Lecture 3 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?