IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000 737Grammar-Based Codes: A New Class of UniversalLossless Source CodesJohn C. Kieffer, Fellow, IEEE, and En-hui Yang, Member, IEEEAbstract—We investigate a type of lossless source code called agrammar-based code, which, in response to any input data stringover a fixed finite alphabet, selects a context-free grammarrepresenting in the sense that is the unique string belongingto the language generated by. Lossless compression of takesplace indirectly via compression of the production rules of thegrammar. It is shown that, subject to some mild restrictions,a grammar-based code is a universal code with respect to thefamily of finite-state information sources over the finite alphabet.Redundancy bounds for grammar-based codes are established.Reduction rules for designing grammar-based codes are pre-sented.Index Terms—Chomsky hierarchy, context-free grammars, en-tropy, Kolmogorov complexity, lossless coding, redundancy, uni-versal coding.I. INTRODUCTIONGRAMMARS (especially context-free grammars) havemany applications in engineering and computer science.Some of these applications are speech recognition [8, Ch. 13],image understanding [22, p. 289], compiler design [1], andlanguage modeling [10, Theorems 4.5, 4.6]. In this paper, weshall be interested in using context-free grammars for losslessdata compression. There has been some previous work ofthis nature, including the papers [3], [2], [11], [14], [24],[18].Two approaches have been used. In one of these approaches(as illustrated in [2], [11], and [14]), one fixes a context-freegrammar, known to both encoder and decoder, such that thelanguage generated bycontains all of the data strings thatare to be compressed. To compress a particular data string, onethen compresses the derivation tree [2, p. 844] showing how thegiven string is derived from the start symbol of the grammar.In the second of the two approaches (exemplified by the papers[3], [24], and [18]), a different context-free grammarisassigned to each data string, so that the language generated byis . If the data string is to be compressed, the encodertransmits code bits to the decoder that allow reconstruction ofthe grammar, from which the decoder infers . This secondapproach is the approach that we employ in this paper. WeManuscript received December 20, 1995; revised December 16, 1999. Thiswork was supported in part by the National Science Foundation under GrantsNCR-9304984, NCR-9508282, NCR-9627965, and by the Natural Sciences andEngineering Research Council of Canada under Grant RGPIN203035-98.J. C. Kieffer is with the Department of Electrical and Computer Engi-neering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail:[email protected]).E.-h. Yang is with the Department of Electrical and Computer Engi-neering, University of Waterloo, Waterloo, Ont., Canada N2L 3G1 (e-mail:[email protected]).Communicated by M. Weinberger, Associate Editor for Source Coding.Publisher Item Identifier S 0018-9448(00)03094-7.Fig. 1. Encoder structure of grammar-based code.shall put forth a class of lossless source codes that employ thisapproach that we call grammar-based codes. Unlike previousworkers using the second approach, we place our results in aninformation-theoretic perspective, showing how to properlydesign a grammar-based code so that it will be a universal codewith respect to the family of finite-state information sources ona fixed finite alphabet.In this section, we wish to give the reader an informal notionof theidea of a grammar-based code.For this purpose, we do notneed a precise definition of the concept of context-free grammar(this will be done in Section II). All we need to know about acontext-free grammarhere is that it furnishes us with someproduction rules via which we can construct certain sequencesover a finite alphabet which form what is called the languagegenerated by, denoted by .A grammar-based code consists of encoder and decoder.• Fig. 1 depicts the encoder structure. Lettingdenote thedata string that is to be compressed, consisting of finitelymany terms chosen from some fixed finite alphabet, thegrammar transform in Fig. 1 constructs a context-freegrammarsatisfying the property that ,which tells us thatmay be inferred from becauseis the unique string belonging to the language .The grammar encoder in Fig. 1 assigns to the grammara binary codeword which is denoted .• When the decoder is presented with the codeword,the data stringis recovered by first reconstructing thegrammar, and then inferring from the productionrules of.From the preceding, the reader can see that our philosophy isnot to directly compress the data string; instead, we try to “ex-plain”by finding a grammar that is simple and generatesin the sense that . Since can be recoveredfrom, we can compress instead of . As the grammarthat we shall use to represent will be simple, we will getgood compression by compressing.The main results of this paper (Theorems 7 and 8) tell usthat, under some weak restrictions, a grammar-based code is auniversal lossless source code for any finite-state informationsource. We shall be able to obtain specific redundancy boundsforgrammar-based codes with respectto finite-stateinformation0018–9448/00$10.00 © 2000 IEEE738 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 3, MAY 2000sources. Also, we shall see how to design efficient grammar-based codes by means of reduction rules.As a result of this paper, the code designer is afforded withmore flexibility in universal lossless source code design. Forexample, for some data strings, a properly designed grammar-based code yields better compression performance than that af-forded by the Lempel–Ziv universaldata compression algorithm[27].Notation and Terminology: We explain the following nota-tions and terminologies used throughout the paper:•denotes the cardinality of a finite set .•denotes the length of a string of finite length.•denotes the smallest positive integer greater than orequal to the real number.•denotes the set of all strings of finite length whose en-tries come from the finite set, including the empty string.We represent each nonempty string inmultiplicativelyand uniquely as, where is the length of thestring and; we write the empty stringinas .•Ifand are elements of , we define the product tobe the element ofsuch thati) If, then ;if , then .ii) Ifand , thenwhere and are the uniquemultiplicative
View Full Document