OSU BMI 731 - A compression Algorithm for DNA sequences and its applications in Genome Comparison

Unformatted text preview:

A compression Algorithm for DNA sequences and its applications in Genome ComparisonSlide 2Can we compress DNA sequences?Related WorkRelated Work…Slide 6Slide 7Encoding Edit Operations:Slide 91 ) Two Bits Encoding Method2) Exact Matching Method3) Approximate Matching Method4) Approximate with Edit OP sequenceGenCompress: an algorithm based on approximate matchingCondition C, and the Compression Gain AlgorithmOptimal Prefix?Symmetry?RelatednessQuestions?A compression A compression Algorithm for DNA Algorithm for DNA sequences and its sequences and its applications in applications in Genome Genome Comparison Comparison X. Chen, S. Kwong, M. Li, Genome Informatics (GIW'99), Tokyo, Japan, pp.51-61, 1999Compressibility of DNACompressibility of DNARelatedness between 2 Relatedness between 2 DNA sequencesDNA sequencesCan we compress DNA Can we compress DNA sequences?sequences?DNA sequences only consist of 4 DNA sequences only consist of 4 nucleotide bases {A,G,C,T} nucleotide bases {A,G,C,T}  2 bits 2 bits for each basefor each baseRegular-global compression Regular-global compression algorithms do not work (they even algorithms do not work (they even expand the data) for DNA. Why?expand the data) for DNA. Why?Related WorkRelated Work(in the spirit of Ziv and Lempel data (in the spirit of Ziv and Lempel data compression method) compression method) BiocompressBiocompress and and Biocompress-2Biocompress-2 (second one (second one uses arithmetic coding of order-2). uses arithmetic coding of order-2). => They are searching the => They are searching the previously previously processed processed part of the sequence for part of the sequence for repeats.repeats.Related Work…Related Work…CfactCfact:: idea same as Biocompress-2 but idea same as Biocompress-2 but no use of arithmetic coding of order 2.no use of arithmetic coding of order 2. 11stst phase: phase: Builds Suffix TreeBuilds Suffix Tree 22ndnd Phase: Phase: repetitions are coded with repetitions are coded with garanteed garanteed gain; otherwise, two-bit per gain; otherwise, two-bit per base encoding is used.base encoding is used.=> Searching the whole sequence for exact => Searching the whole sequence for exact matches.matches.Related Work…Related Work…Some compression schemes based on Some compression schemes based on appropriate string matching which are appropriate string matching which are lossylossy. . In the paper, they are not interested in lossy In the paper, they are not interested in lossy compression.compression.““Relatedness” between two DNA sequences. Relatedness” between two DNA sequences. Some approaches to define such distance d(x,y):Some approaches to define such distance d(x,y):conditional compression, transformation distanceconditional compression, transformation distance both are both are not symmetricnot symmetric, hence cannot be used in , hence cannot be used in general. Some authors tried (d(x,y)+d(y,x))/2. general. Some authors tried (d(x,y)+d(y,x))/2. Obviously such a definition is not well founded.Obviously such a definition is not well founded.Related Work…Related Work…Some Biologists: methods like Some Biologists: methods like counting the number of shared counting the number of shared genes in two genomes or comparing genes in two genomes or comparing the ordering of genes.the ordering of genes.Encoding Edit Encoding Edit Operations:Operations:Replace:Replace: (R, p, char) means (R, p, char) means replacing the character at position p replacing the character at position p by character char.by character char.Insert:Insert: (I, p, char) means inserting (I, p, char) means inserting character char at p.character char at p.Delete:Delete: (D, p) means deleting the (D, p) means deleting the character at p.character at p.Encoding Edit Encoding Edit Operations:Operations:1) Two Bits Encoding Method1) Two Bits Encoding Method2) Exact Matching Method2) Exact Matching Method3) Approximate Matching Method3) Approximate Matching Method4) Approximate Matching Method 4) Approximate Matching Method with edit operation sequencewith edit operation sequence1 ) Two Bits Encoding Method1 ) Two Bits Encoding Method10 00 01 01 10 11 01 00 encodes 10 00 01 01 10 11 01 00 encodes “gaccgtca” “gaccgtca” It needs 16 bits.It needs 16 bits.““gaccgtca” using string “gaccttca”gaccgtca” using string “gaccttca”2) Exact Matching Method2) Exact Matching MethodWe can use (repeat position, repeat We can use (repeat position, repeat length)length)Use one bit to indicate if the next pair Use one bit to indicate if the next pair is a pairis a pair string can be encoded “gaccgtca” string can be encoded “gaccgtca” using using string “gaccttca”string “gaccttca” 0 000 100 1 10 0 100 011 ….in 17 bits 0 000 100 1 10 0 100 011 ….in 17 bits {(0,4),g,(5,3)} is encoded… {(0,4),g,(5,3)} is encoded…3) Approximate Matching 3) Approximate Matching MethodMethodcan be more than one representation can be more than one representation for a specific sequencefor a specific sequence in this case.. in this case.. “ “gaccgtca” using string “gaccttca”gaccgtca” using string “gaccttca”{(0,8),(R,4,g)} or{(0,8),(R,4,g)} or0 000 111 1 00 100 10 0 000 111 1 00 100 10 in this case R encoded by 00, I encoded in this case R encoded by 00, I encoded by 01, and D encoded by 11, and 0/1 by 01, and D encoded by 11, and 0/1 indicating next item double /triple ?indicating next item double /triple ?4) Approximate with Edit 4) Approximate with Edit OP sequenceOP sequenceIf we use the edit operation sequence (I,4,g),If we use the edit operation sequence (I,4,g),(D,6).(D,6).Then the string Then the string “gaccgtca” can be encoded “gaccgtca” can be encoded as as {(0,8),(I,4,g),(D,6)} or…{(0,8),(I,4,g),(D,6)} or…0 000 111 1 01 100 10 1 10 110 in total 21 0 000 111 1 01 100 10 1 10 110 in total 21 bits…bits…From the examples Approximate (3From the examples Approximate (3rdrd) has ) has the least number of bits… the least number of bits…GenCompress:GenCompress: an algorithm an algorithm based on approximate based on approximate matchingmatchingGenCompress is a one-pass algorithm.GenCompress is a one-pass algorithm.For input For input ww, part of it already , part of it already processed, say processed, say vv, and the remaining , and the remaining part is u, i.e. part is u, i.e. ww = =


View Full Document

OSU BMI 731 - A compression Algorithm for DNA sequences and its applications in Genome Comparison

Download A compression Algorithm for DNA sequences and its applications in Genome Comparison
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A compression Algorithm for DNA sequences and its applications in Genome Comparison and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A compression Algorithm for DNA sequences and its applications in Genome Comparison 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?