MIT 6 863J - Study Guide - D2158273

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> Study Guide

DOC PREVIEW

MIT 6 863J - Study Guide

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

By this species of argument, stochastic models are practically always a stop-gap approximation. Take stochastic queue theory, for example, by which onecan give a probabilistic model of how many trucks will be arriving at given de-pots in a transportation system. One could argue that if we could just modeleverything about the state of the trucks and the conditions of the roads, thelocation of every nail that might cause a flat and every drunk driver that mightcause an accident, then we could in principle predict deterministically how manytrucks will be arriving at any depot at any time, and there is no need of stochas-tic queue theory. Stochastic queue theory is only an approximation in lieue ofinformation that it is impractical to collect.But this argument is flawed. If we have a complex deterministic system,and if we have access to the initial conditions in complete detail, so that wecan compute the state of the system unerringly at every point in time, a sim-pler stochastic description may still be more insightful. To use a dirty word,some properties of the system are genuinely emergent, and a stochastic accountis not just an approximation, it provides more insight than identifying everydeterministic factor. Or to use a different dirty word, it is a reductionist errorto reject a successful stochastic account and insist that only a more complex,lower-level, deterministic model advances scientific understanding.4.2 Chomsky v. ShannonIn one’s introductory linguistics course, one learns that Chomsky disabusedthe field once and for all of the notion that there was anything of interest tostatistical models of language. But one usually comes away a little fuzzy on thequestion of what, precisely, he proved.The arguments of Chomsky’s that I know are from “Three Models for theDescription of Language” [5] and Syntactic Structures [6] (essentially the sameargument repeated in both places), and from the Handbook of MathematicalPsychology, chapter 13 [17]. I think the first argument in Syntactic Structuresis the best known. It goes like this.Neither (a) ‘colorless green ideas sleep furiously’ nor (b) ‘furiouslysleep ideas green colorless’, nor any of their parts, has ever occuredin the past linguistic experience of an English speaker. But (a) isgrammatical, while (b) is not.This argument only goes through if we assume that if the frequency of asentence or ‘part’ is zero in a training sample, its probability is zero. But infact, there is quite a literature on how to estimate the probabilities of eventsthat do not occur in the sample, and in particular how to distinguish real zerosfrom zeros that just reflect something that is missing by chance.Chomsky also gives a more general argument:19If we rank the sequences of a given length in order of statisticalapproximation to English, we will find both grammatical and un-grammatical sequences scattered throughout the list; there appearsto be no particular relation between order of approximation andgrammaticalness.Because for any n, there are sentences with grammatical dependencies spanningmore than n words, so that no nth-order statistical approximation can sort outthe grammatical from the ungrammatical examples. In a word, you cannotdefine grammaticality in terms of probability.It is clear from context that ‘statistical approximation to English’ is a refer-ence to nth-order Markov models, as discussed by Shannon. Chomsky is sayingthat there is no way to choose n and ² such thatfor all sentences s, grammatical(s) ↔ Pn(s) > ²where Pn(s) is the probability of s according to the ‘best’ nth-order approxima-tion to English.But Shannon himself was careful to call attention to precisely this point:that for any n, there will be some dependencies affecting the well-formednessof a sentence that an nth-order model do es not capture. The point of Shan-non’s approximations is that, as n increases, the total mass of ungrammaticalsentences that are erroneously assigned nonzero probability decreases. That is,we can in fact define grammaticality in terms of probability, as follows:grammatical(s) ↔ limn→∞Pn(s) > 0A third variant of the argument appears in the Handbook. There Chomskystates that parameter estimation is impractical for an nth-order Markov mo delwhere n is large enough “to give a reasonable fit to ordinary usage”. He empha-sizes that the problem is not just an inconvenience for statisticians, but rendersthe model untenable as a model of human language acquisition: “we cannotseriously propose that a child learns the values of 109parameters in a childhoodlasting only 108seconds.”This argument is also only partially valid. If it takes at least a secondto estimate each parameter, and parameters are estimated sequentially, theargument is correct. But if parameters are estimated in parallel, say, by a high-dimensional iterative or gradient-pursuit method, all bets are off. Nonetheless, Ithink even the most hardcore statistical types are willing to admit that Markovmodels represent a brute force approach, and are not an adequate basis forpsychological models of language processing.However, the inadequacy of Markov mo dels is not that they are statisti-cal, but that they are statistical versions of finite-state automata! Each ofChomsky’s arguments turns on the fact that Markov models are finite-state,not on the fact that they are stochastic. None of his criticisms are applicable20to stochastic models generally. More sophisticated stochastic models do exist:stochastic context-free grammars are well understood, and stochastic versionsof Tree-Adjoining Grammar [18], GB [8], and HPSG [3] have been proposed.In fact, probabilities make Markov models more adequate than their non-probabilistic counterparts, not less adequate. Markov models are surprisinglyeffective, given their finite-state substrate. For example, they are the workhorseof speech recognition technology. Stochastic grammars can also b e easier tolearn than their non-stochastic counterparts. For example, though Gold [9]showed that the class of context-free grammars is not learnable, Horning [13]showed that the class of stochastic context-free grammars is learnable.In short, Chomsky’s arguments do not bear at all on the probabilistic natureof Markov models, only on the fact that they are finite-state. His arguments arenot by any stretch of the imagination a sweeping condemnation of statisticalmethods.5 ConclusionIn closing, let me repeat the main line of

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

MIT 6 863J - Study Guide

Sign up for free to view:

Please select your school