This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

By this species of argument, stochastic models are practically always a stop-gap approximation. Take stochastic queue theory, for example, by which onecan give a probabilistic model of how many trucks will be arriving at given de-pots in a transportation system. One could argue that if we could just modeleverything about the state of the trucks and the conditions of the roads, thelocation of every nail that might cause a flat and every drunk driver that mightcause an accident, then we could in principle predict deterministically how manytrucks will be arriving at any depot at any time, and there is no need of stochas-tic queue theory. Stochastic queue theory is only an approximation in lieue ofinformation that it is impractical to collect.But this argument is flawed. If we have a complex deterministic system,and if we have access to the initial conditions in complete detail, so that wecan compute the state of the system unerringly at every point in time, a sim-pler stochastic description may still be more insightful. To use a dirty word,some properties of the system are genuinely emergent, and a stochastic accountis not just an approximation, it provides more insight than identifying everydeterministic factor. Or to use a different dirty word, it is a reductionist errorto reject a successful stochastic account and insist that only a more complex,lower-level, deterministic model advances scientific understanding.4.2 Chomsky v. ShannonIn one’s introductory linguistics course, one learns that Chomsky disabusedthe field once and for all of the notion that there was anything of interest tostatistical models of language. But one usually comes away a little fuzzy on thequestion of what, precisely, he proved.The arguments of Chomsky’s that I know are from “Three Models for theDescription of Language” [5] and Syntactic Structures [6] (essentially the sameargument repeated in both places), and from the Handbook of MathematicalPsychology, chapter 13 [17]. I think the first argument in Syntactic Structuresis the best known. It goes like this.Neither (a) ‘colorless green ideas sleep furiously’ nor (b) ‘furiouslysleep ideas green colorless’, nor any of their parts, has ever occuredin the past linguistic experience of an English speaker. But (a) isgrammatical, while (b) is not.This argument only goes through if we assume that if the frequency of asentence or ‘part’ is zero in a training sample, its probability is zero. But infact, there is quite a literature on how to estimate the probabilities of eventsthat do not occur in the sample, and in particular how to distinguish real zerosfrom zeros that just reflect something that is missing by chance.Chomsky also gives a more general argument:19If we rank the sequences of a given length in order of statisticalapproximation to English, we will find both grammatical and un-grammatical sequences scattered throughout the list; there appearsto be no particular relation between order of approximation andgrammaticalness.Because for any n, there are sentences with grammatical dependencies spanningmore than n words, so that no nth-order statistical approximation can sort outthe grammatical from the ungrammatical examples. In a word, you cannotdefine grammaticality in terms of probability.It is clear from context that ‘statistical approximation to English’ is a refer-ence to nth-order Markov models, as discussed by Shannon. Chomsky is sayingthat there is no way to choose n and ² such thatfor all sentences s, grammatical(s) ↔ Pn(s) > ²where Pn(s) is the probability of s according to the ‘best’ nth-order approxima-tion to English.But Shannon himself was careful to call attention to precisely this point:that for any n, there will be some dependencies affecting the well-formednessof a sentence that an nth-order model do es not capture. The point of Shan-non’s approximations is that, as n increases, the total mass of ungrammaticalsentences that are erroneously assigned nonzero probability decreases. That is,we can in fact define grammaticality in terms of probability, as follows:grammatical(s) ↔ limn→∞Pn(s) > 0A third variant of the argument appears in the Handbook. There Chomskystates that parameter estimation is impractical for an nth-order Markov mo delwhere n is large enough “to give a reasonable fit to ordinary usage”. He empha-sizes that the problem is not just an inconvenience for statisticians, but rendersthe model untenable as a model of human language acquisition: “we cannotseriously propose that a child learns the values of 109parameters in a childhoodlasting only 108seconds.”This argument is also only partially valid. If it takes at least a secondto estimate each parameter, and parameters are estimated sequentially, theargument is correct. But if parameters are estimated in parallel, say, by a high-dimensional iterative or gradient-pursuit method, all bets are off. Nonetheless, Ithink even the most hardcore statistical types are willing to admit that Markovmodels represent a brute force approach, and are not an adequate basis forpsychological models of language processing.However, the inadequacy of Markov mo dels is not that they are statisti-cal, but that they are statistical versions of finite-state automata! Each ofChomsky’s arguments turns on the fact that Markov models are finite-state,not on the fact that they are stochastic. None of his criticisms are applicable20to stochastic models generally. More sophisticated stochastic models do exist:stochastic context-free grammars are well understood, and stochastic versionsof Tree-Adjoining Grammar [18], GB [8], and HPSG [3] have been proposed.In fact, probabilities make Markov models more adequate than their non-probabilistic counterparts, not less adequate. Markov models are surprisinglyeffective, given their finite-state substrate. For example, they are the workhorseof speech recognition technology. Stochastic grammars can also b e easier tolearn than their non-stochastic counterparts. For example, though Gold [9]showed that the class of context-free grammars is not learnable, Horning [13]showed that the class of stochastic context-free grammars is learnable.In short, Chomsky’s arguments do not bear at all on the probabilistic natureof Markov models, only on the fact that they are finite-state. His arguments arenot by any stretch of the imagination a sweeping condemnation of statisticalmethods.5 ConclusionIn closing, let me repeat the main line of


View Full Document

MIT 6 863J - Study Guide

Documents in this Course
N-grams

N-grams

42 pages

Semantics

Semantics

75 pages

Semantics

Semantics

82 pages

Semantics

Semantics

64 pages

Load more
Download Study Guide
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Guide and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Guide 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?