##
This **preview** shows page *1-2*
out of 6 **pages**.

*View Full Document*

End of preview. Want to read all 6 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document**Unformatted text preview:**

IntroductionInference and Lossy CompressionA Lossy MDL PrincipleExamples: MDL vs. Maximum LikelihoodA More General Dichotomy TheoremA Minimum Description L ength Proposalfor Lossy Data CompressionM. Madiman M. Harrison I. KontoyiannisJune 2004AbstractWe give a development of the theory of lossy data compression from the point of viewof statistics. This is partly motivated by the enormous succe ss of the statistical approachin lossless compression, in particular Rissanen’s celebrated Minimum Description Length(MDL) principle. A precise characterization of the fundamental limits of compressionperformance is g iven, for arbitrary data sources a nd with respect to general distortionmeasures.The star ting point for this development is the observation that there is a precisecorrespondence between compress ion algorithms and probability distributions (in analogywith the Kraft inequality in lossless compression). This leads us to formulate a versionof the MDL principle for lossy data compression. We discuss the consequences of thelossy MDL principle and explain how it leads to potential practical design lessons forvector-quantizer design.We introduce two methods for selecting efficient compression algorithms, the lossyMaximum Likelihood Estimate (LMLE) and the lossy Minimum Description Length Esti-mate (LMDLE). We describe their theoretical performance and give examples illustratinghow the LMDLE has supe rior performance to the LMLE.1 IntroductionFormally and somewhat roughly speaking, the central problem of universal data compres-sion is that of selecting an appropriate code among a given family, in order to obtain goodcompression performance. Following [3], we identify compression algorithms with p robabilitydistributions on the reproduction space.More precisely, consider a source {Xn} with values in the alphabet A, which is to becompressed with distortion no more than D with respect to an arbitrary sequence of distortionmeasures ρn: An×ˆAn→ [0, ∞), whereˆA is the reproduction alphabet. For a source stringxn1= (x1, x2, . . . , xn) ∈ An, let B(xn1, D) denote the distortion-ball of radius D around xn1:B(xn1, D) = {yn1∈ˆAn: ρn(xn1, yn1) ≤ D}.∗Mokshay Madiman (corresponding author) and Matthew Harrison are with the Division of Ap-plied Mathematics, Brown University, Box F, 182 George Street, Providence, RI 02912, USA. Email:[email protected], [email protected]†Ioannis Kontoyiannis is with the Division of Applied Mathematics and the Department of Com-puter Science, Brown University, Box F, 182 George Street, Providence, RI 02912, USA. Email:[email protected]‡This research for this work was supported in part by NSF grant #0073378-CCR, and by U SDA-IFAFSgrant #00-52100-961512 Presented by M. Madiman at ISIT 2004We consider the class of codes Cn: An→ {0, 1}∗that operate at distortion level D. Anysuch code Cnis the composition of a quantizer φnthat maps Anto a (finite or countablyinfinite) codebook Bn⊂ˆAn, followed by a prefix-free encoder ψn: Bn→ {0, 1}∗. The codeCnoperates at distortion level D, if ρn(xn1, φn(xn1)) ≤ D for all xn1∈ An. The figure of merithere is, of course, given by the length function L(true)nof the code Cn:L(true)n(xn1) = length of ψn(φn(xn1)) bits.As shown in [3], there is a precise correspondence between compression algorithms an dprobability distributions Q onˆAn. Similar to the lossless case, this correspondence is ex-pressed in terms of the idealized lossy Shannon code-lengthsLn(xn1) = − log Q(B(xn1, D)) bits. (1)More specifically, for any code Cnwith length-fun ction Lnoperating at distortion level D,there is a probability distribution QnonˆAnsuch that L(true)n(xn1) ≥ − log Qn(B(xn1, D)) forall xn1∈ An. Conversely, for any source {Xn1} and any sequence of “admissible” distributions{Qn}, there is a sequence of codes {Cn, Ln} operating at distortion level D, such that,for (almost) any realization of the source, L(true)n(Xn1) ≤ − log Qn(B(Xn1, D)) + O(log n),eventually with probability 1 (h en ceforth we will write ”w.p.1”).Thus motivated, we pose the problem of selecting a “good” code among a given family, asthe statistical estimation problem of selecting one of the available probability distributions{Qθ; θ ∈ Θ} on the reproduction space.2 Inference and Lossy CompressionIn the lossless case the problem optimal compression is, theoretically at least, equivalent tofinding a probability distribution Q that in some sense minimizes the code-lengths − log Q(xn1).In that case the optimal choice is simply to take Q to be the true source distribution.As it turns out [3], in the lossy case the best choice is to take Q to be the optimal reproduc-tion distribution Q∗onˆAn, i.e., the optimal output distribution in Sh annon’s rate-distortionproblem. T hus, our goal here is to do statistical inferen ce with the goal of estimating thisdistribution Q∗, and not the true source distribution, from the data.More generally, given a family of p robability distributions {Qθ; θ ∈ Θ} on the reproductionspace, we want to chose the one whose limiting coding rate, call it R(P, θ, D),R(P, θ, D)4= limn→∞−1nlog Qθ(B(Xn1, D)) bits/symbol, (2)is as small as possible. If Q∗= Qθ∗happens to be in the above class, then of course R(P, θ∗, D)is simply the rate-distortion fu nction of the source. But in general we d o not always requirethat to be the case, and we think of our target distribution Qθ∗as that corresponding toθ∗= arg minθR(P, θ, D). Intuitively, we think of Qθ∗as the simplest distribution describingall the regularity i n the data, with accuracy no worse than D.3 A L ossy MDL PrincipleA natural way to estimate the optimal θ∗empirically is to try and minimize the idealizedco de-lengths (1), or equivalently to maximize the probabilities Qθ(B(Xn1, D)). We thus definePresented by M. Madiman at ISIT 2004 3the Lossy Maximum Likelihood Estimate (LMLE) asˆθLMLn= arg minθ∈Θ[− log Qθ(B(Xn1, D))]. (3)Our first main result (Theorem 1 below) states that, und er very general conditions [1][4],this estimate is consistent as n → ∞. But as with the classical (lossless) maximum likeli-hood estimate (MLE),ˆθLMLnalso has some undesirable properties. First, the infimum in thedefinition ofˆθLMLnis not really a code-length; if we choose one of the θ’s based on the data,we should also describe the chosen θ itself. Indeed, there are examples [1] whereˆθLMLnis