Unformatted text preview:

3/17/09'1'Compression'CISC489/689‐010,'Lecture'#5'Monday,'February'23'Ben'CartereFe'Why'Compress?'• Recall'from'last'Mme:'index'files'– Vocabulary'file'contains'all'terms'with'pointers'to'lists'in'an'inverted'file.'– Inverted'file'contains'lists'of'all'documents'the'terms'appear'in.'– CollecMon'file'contains'all'the'document'names.'• This'can'be'a'lot'of'informaMon'to'store,'access,'and'transfer!'– Easily'takes'up'several'gigabytes'in'memory'or'on'disk.'• Compression'helps'work'with'large'files.'3/17/09'2'What'is'Compression?'• Compression'is'a'type'of'encoding'of'data.'• The'goal'is'to'make'the'data'smaller.'• A'very'big'topic'in'CS'and'engineering.'– We'have'a'full'course'on'data'compression.'Encoder'Data' Encoded'data'Model'Encoded'data' Decoder' Data’'Model'Types'of'Compression'• Lossless'compression:'– The'encoding'preserves'all'informaMon'about'the'original'data.'– The'original'data'can'be'recovered'completely.'• Lossy'compression:'– The'encoding'loses'some'informaMon'about'the'original'data.'– The'original'data'can'be'recovered'approximately.'• Signature'file'indexes'are'a'type'of'lossy'compression.'3/17/09'3'Compression'in'IR'• Text'compression:'– Used'to'compress'vocabulary,'document'names,'original'document'text.'– Based'on'assumpMons'about'language.'• Data'compression:'– Used'to'compress'inverted'lists.'– Not'generally'based'on'assumpMons,'but'on'observaMons'about'the'data.'Preliminaries'• “Text”'means'based'on'characters.'• What'is'a'character?''(Think'C,'C++)'– A'data'type.'– Generally'stores'1'byte.'– 1'byte'='8'bits.'– Since'each'bit'can'be'0'or'1,'one'byte'can'store'28'='256'possible'characters.'3/17/09'4'ASCII'Encoding'• ASCII'is'a'common'character'encoding.'• Each'character'is'represented'with'8'bits.'– A'='ASCII'65'='01000001'– ¿'='ASCII'168'='10101000'– 256'possible'characters.'• Decoding:''table'maps'bytes'to'characters.'• Fish:''01000110'01101001'01110011'01101000'– 32'bits'='4'bytes.'Fixed'Length'Codes'• Short'bytes:''use'the'smallest'number'of'bits'needed'to'represent'all'characters.'– English'has'26'leFers.''How'many'bits'needed?'– 5'bits'can'represent'25'='32'leFers.'– 26'leFers'*'2'cases'='52'characters.'• Requires'6'bits…'or'does'it?'• Use'numbers'1‐30'(00001'–'11110)'to'represent'two'sets'of'characters.'– Use'0'(00000)'to'toggle'the'first'set'(e.g.'capital'leFers).'– Use'31'(11111)'to'toggle'the'second'set'(e.g.'small'leFers).'• Fish:''00110'11111'01001'10011'01000'– 25'bits,'slightly'over'3'bytes.'F'↓' i' s' h'3/17/09'5'Fixed'Length'Codes'• Bigram'codes:''use'8'bits'to'encode'either'1'or'2'characters.'– is'would'be'encoded'in'8'bits.''• Use'values'0‐87'for'space,'26'lower'case,'26'upper'case,'10'numbers,'and'25'other'characters.'• Use'values'88‐255'for'character'pairs.'– Master'(8):''blank,'A,'E,'I,'O,'N,'T,'U'– Combining'(21):''blank,'all'other'leFers'except'JKQXYZ'– 88'+'8*21'='256'possibiliMes'encoded'• Fish:''00100000'10101010'00001000'– 24'bits,'3'bytes.'F'is' h'Fixed'Length'Codes'• N‐gram'codes:''same'as'bigram,'but'encode'character'strings'of'length'less'than'or'equal'to'n.'• Select'most'common'strings'for'8‐bit'encoding'in'advance.'– Goal:''most'commonly'occurring'n‐grams'require'only'one'byte.'• Fish:''00100000'10111010'– 16'bits,'2'bytes.'F'ish'3/17/09'6'Fixed'Length'Summary'• Fixed'length'codes'are'generally'simple,'easy'to'use,'and'effecMve'when'assumpMons'are'met.'• Limited'alphabet'size'allowed.'• If'data'does'not'meet'assumpMons,'compression'will'not'be'good.'Restricted'Variable'Length'Codes'• Idea:''different'characters'can'have'encodings'of'different'lengths.'• Similar'to'case‐shiwing'in'short'byte'codes:'– First'bit'indicates'case.'– 8'most'common'characters'encoded'in'4'bits'(0xxx)'– 128'less'common'characters'encoded'in'8'bits'(1xxxxxxx)'– First'bit'tells'you'how'many'bits'to'read'next.'• 8'most'common'English'leFers'are'e,'t,'a,'i,'n,'o,'r,'s.'• Fish:''10000110'0011'0110'10000100'– 24'bits,'3'bytes.'F'i' s'h'3/17/09'7'Restricted'Variable'Length'Codes'• 8'most'common'leFers'in'English'are'64%'of'characters'in'wiki000'subset.'• Expected'code'length'='0.64*4'bits'+'0.36*8'bits'='5.44'bits'per'character.'• A'liFle'worse'than'short'bytes,'but'can'encode'many'more'characters.'– Can'also'generalize'to'more'than'2'cases:'• 0xxx'for'most'common'8'characters.'• 1xxx0xxx'for'next'26'='64'characters.'• 1xxx1xxx0xxx'for'next'29'='512'characters,'…'Unicode'• Unicode'is'an'encoding'designed'to'handle'many'different'alphabets'and'symbol'sets.'• Unicode'is'a'type'of'restricted'variable'length'coding.'– Uses'21'bits'to'encode'1,114,112'symbols.'– First'5'bits'encode'“plane”'(numbered'0‐16).'– Within'each'plane,'16'bits'encode'characters'(numbered'0‐65,536).'3/17/09'8'UTF‐n'for'Unicode'• UTF‐n'encodes'Unicode'using'n‐bit'chunks.'– Each'value'of'n'can'encode'all'1,114,112'symbols.'• Encodings'designed'to'map'between'different'values'of'n'without'losing'informaMon.'• UTF‐32:'– 32'bits'can'store'more'than'4'billion'symbols.'– Just'assign'each'Unicode'symbol'a'32‐bit'string.'– 11'bits'never'used.'UTF‐8'• “Chunk”'is'8'bits'(1'byte).'• Use'7'bits'(0xxxxxxx)'to'store'first'128'Unicode'symbols'(which'are'basic'ASCII).'• Higher'values'stored'in'2'or'more'bytes.'– First'byte'encodes'number'of'bytes'in'unary.'• 110xxxxx'means'a'2‐byte'character.'• 1110xxxx'means'a'3‐byte'character.'– Remaining'bytes'in'form'10xxxxxx.'– Free'bits'(x’s)'used'to'encode'symbols.'3/17/09'9'UTF‐8'Templates'• 0xxxxxxx'(1'byte,'7'free'bits):'– Unicode'symbols'0'to'127'(basic'ASCII:''A‐Z,'a‐z,'0‐9,'etc.)'• 110xxxxx'10xxxxxx'(2'bytes,'11'free'bits):'– Unicode'symbols'128'to'2176'(LaMn,'Greek,'Cyrillic,'Armenian,'Hebrew,'Arabic,'etc.)'• 1110xxxx'10xxxxxx'10xxxxxx'(3'bytes,'16'free'bits):'– Unicode'symbols'2177'to'67,714'(almost'all'other'alphabets)'• 11110xxx'10xxxxxx'10xxxxxx'10xxxxxx'(4'bytes):'– All'remaining'Unicode'symbols.'UTF‐8'Examples'•


View Full Document

UD CISC 689 - Compression


Download Compression

Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Compression
 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Compression
 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?