MIT 6 863J - Language learning model for finite parameter spaces - D2269833

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> Language learning model for finite parameter spaces

DOC PREVIEW

MIT 6 863J - Language learning model for finite parameter spaces

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 33

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

~, , il; j ELSEVIER Cognition 61 (1996) 161-193 COGNITION A language learning model for finite parameter spaces Partha Niyogi*, Robert C. Berwick Center .for Biological and Computational Learning, Massachusetts Institute of Technology E25-201. Cambridge. MA 02142, USA Abstract This paper shows how to formally characterize language learning in a finite parameter space, for instance, in the principles-and-parameters approach to language, as a Markov structure. New language learning results follow directly; we can explicitly calculate how many positive examples on average ("sample complexity") it will take for a learner to correctly identify a target language with high probability. We show how sample complexity varies with input distributions and learning regimes. In particular we find that the average time to converge under reasonable language input distributions for a simple three-parameter system first described by Gibson and Wexler (1994) is psychologically plausible, in the range of 100-150 positive examples. We further find that a simple random step algorithm - that is, simply jumping from one language hypothesis to another rather than changing one parameter at a time - works faster and always converges to the right target language, in contrast to the single-step, local parameter setting method advocated in some recent work. 1. Introduction: language acquisition in finite parameter spaces With the advent of the "principles-and-parameters" approach to language (Chomsky, 1981), the question of language learnability can again be raised in a new context. We take "principle-and-parameters" approaches to encompass such diverse linguistic theories as government and binding theory (including its current minimalist incarnations); head-driven phrase structure grammar (HPSG); and current lexical-functional grammar (LFG). In each of these frameworks it seems to be possible to characterize the class of possible target (learnable) grammars (or languages) as fixed by the parametric variation of a finite number of discontinuous variables. Learning a grammar (language) involves fixing the values of these parameters. The notion of a finite parameterization for grammars and learning extends even to phonological systems, such as stress, described by Dresher and * Corresponding author. Fax: 1 617 253 5060; e-maih [email protected], [email protected]. 0010-0277/96/$15.00 © 1996 Elsevier Science B.V. All rights reserved PII S0010-0277(96)00718-4162 P. Niyogi, R.C. Berwick / Cognition 61 (1996) 161-193 Kaye (1990), and perhaps even lexical knowledge, if parameterized along the lines discussed by Hale and Keyser (1993) and others. The classic example of a finite parameterization, common to GB, HPSG, and LFG, is X-bar theory: each of the theories above assumes a basic phrase structure determined by an unordered template of the form {Head, Complement}, where Head is one of the lexical categories Noun, Verb, Preposition ..... and Complement is some list of (possibly argument) phrases. (Under some recent accounts, Heads may be extended to non-lexical or functional categories such as Inf(lection) or Tense, but we shall not depend on such details in the sequel.) By fixing the order Head first, we get languages like English, French, and so forth; the other possibility, Head final, applies to languages like Japanese, German, and the like. However, as emphasized particularly by Wexler in a series of works (Ham- burger and Wexler, 1977; Wexler and Culicover, 1980; and Gibson and Wexler, 1994), the finite character of these hypothesis spaces does not solve the language acquisition problem. As Chomsky noted in Aspects of the Theory of Syntax (Chomsky, 1965), the key point is how the space of possible grammars - even if finite - is "scattered" with respect to the primary language input data. It is logically possible for just two grammars (or languages) to be so near each other that they are not separable by psychologically realistic input data. This was the thrust of Wexler and Hamburger and Wexler and Culicover's earlier work on the learnability of transformational grammars from simple data (with at most two embeddings). Note that this is essentially a question of how many examples it will take to identify a target language - in other words, a sample complexity problem. More recently, Gibson and Wexler (1994) show that more subtle difficulties can arise in the principles-and-parameters framework: given a linguistically plausible three-parameter space, some target languages are not learnable from some initial grammatical hypotheses, given plausible positive-only input. This article provides a complete mathematical model for analyzing Chomsky's informal notion of "scattering" and particular learnability results like those of Gibson and Wexler in a more general and precise framework. Our central goal is to focus on the question of convergence time: how many positive examples will it take to reach a target grammar, under varying assumptions about learning procedures, input data, possible grammars, and so forth. In this sense we can now exactly quantify the amount of data needed to learn (natural) languages. As a way of organizing our analysis, we may consider the language learning problem to vary along five familiar dimensions: (1) the type of learning algorithm involved; (2) the distribution of the input data; (3) the presence or absence of noise or extraneous examples; (4) the use of memory; (5) the parameterization of the language space itself (the class of possible grammars/languages). Our central observation is that one can model learning in a finite grammar (language) space by memoryless algorithms (like the Triggering Learning Algo- rithm and others discussed below) completely and mathematically precisely by a Markov process. The states in the Markov process denote possible languages or grammars (henceforth, we shall use the terms "language" and "grammar" interchangeably where no confusion would arise). We can then apply standard Markov theory to exactly compute the number of examples required to

View Full Document