1CS 294-5: StatisticalNatural Language ProcessingCourse IntroductionLecture I: 8/29/05Course Info Meeting times Lectures: Monday / Wednesday 1-2:30pm Office hours: Thursday 4-5pm, Friday 1-2pm Communication Web page: www.cs.berkeley.edu/~klein/cs294-5 My email: [email protected] Course newsgroup: ucb.class.cs294-5 (link to webnews on the web page) Questionnaires!Accounts and Access Accounts Data and code available on: CS instructional accounts EECS research accounts Millennium accounts Make sure you have one of them working More details on resources in assignment 1 (next class), but sortout access ASAP Computing Resources Lab resources may not be enough Recommendation: start assignments early to find out NLP cluster on Millennium network, signups later in the termThe Dream It’d be great if machines could Process our email (usefully) Translate languages accurately Help us manage, summarize, and aggregate information Use speech as a UI (when needed) Talk to us / listen to us But they can’t: Language is complex, ambiguous, flexible, and subtle Good solutions need linguistics and machine learning knowledge So:What is NLP? Fundamental goal: deep understand of broad language Not just string processing or keyword matching! End systems that we want to build: Ambitious: speech recognition, machine translation, information extraction, dialog interfaces, question answering… Modest: spelling correction, text categorization… Automatic Speech Recognition (ASR) Audio in, text out SOTA: 0.3% for digit strings, 5% dictation, 50%+ TV Text to Speech (TTS) Text in, audio out SOTA: totally intelligible (if sometimes unnatural) Speech systems currently: Model the speech signal (later part of term) Model language (next class)Speech Systems“Speech Lab”2Machine Translation Translation systems encode: Something about fluent language (next class) Something about how two languages correspond (middle of term) SOTA: for easy language pairs, better than nothing, but more an understanding aid than a replacement for human translatorsInformation Extraction Information Extraction (IE) Unstructured text to database entries SOTA: perhaps 70% accuracy for multi-sentence temples, 90%+ for single easy fieldsNew York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. startpresident and CEONew York Times Co.Lance R. Primisendexecutive vice presidentNew York Times newspaperRussell T. Lewisstartpresident and general managerNew York Times newspaperRussell T. LewisStatePostCompanyPersonQuestion Answering Question Answering: More than search Ask general comprehension questions of a document collection Can be really easy: “What’s the capital of Wyoming?” Can be harder: “How many US states’ capitals are also their largest cities?” Can be open ended: “What are the main issues in the global warming debate?” SOTA: Can do factoids, even when text isn’t a perfect matchWhat is nearby NLP? Computational Linguistics Using computational methods to learn more about how language works We end up doing this and using it Cognitive Science Figuring out how the human brain works Includes the bits that do language Humans: the only working NLP prototype! Speech? Mapping audio signals to text Traditionally separate from NLP, converging? Two components: acoustic models and language models Language models in the domain of stat NLPWhat is this Class? Three aspects to the course: Linguistic Issues What are the range of language phenomena? What are the knowledge sources that let us disambiguate? What representations are appropriate? Technical Methods Learning and parameter estimation Increasingly complex model structures Efficient algorithms: dynamic programming, search Engineering Methods Issues of scale Sometimes, very ugly hacks We’ll focus on what makes the problems hard, and what works in practice…Class Requirements and Goals Class requirements Uses a variety of skills / knowledge: Basic probability and statistics Basic linguistics background Decent coding skills (Java) Most people are probably missing one of the above We’ll address some review concepts with sections, TBD Class goals Learn the issues and techniques of statistical NLP Build the real tools used in NLP (language models, taggers, parsers, translation systems) Be able to read current research papers in the field See where the gaping holes in the field are!3Course Work Readings: Texts Manning and Shuetze (available online) Jurafsky and Martin Both on reserve in the Engineering library Papers (on web page) Assignments 5 individual coding assignments (60% of grade) 7 late days, 3 per assignment Lowest score dropped Substantial programming in Java 1.5 Evaluated by write-ups 1 group final project (40% of grade)Some Early NLP History 1950’s: Foundational work: automata, information theory, etc. First speech systems Machine translation (MT) hugely funded by military (imagine that) Toy models: MT using basically word-substitution Optimism! 1960’s and 1970’s: NLP Winter Bar-Hillel (FAHQT) and ALPAC reports kills MT Work shifts to deeper models, syntax … but toy domains / grammars (SHRDLU, LUNAR) 1980’s: The Empirical Revolution Expectations get reset Corpus-based methods become central Deep analysis often traded for robust and simple approximations Evaluate everythingClassical NLP: Parsing Write symbolic or logical rules: Use deduction systems to prove parses from words Minimal grammar on “Fed raises” sentence: 36 parses Simple 10-rule grammar: 592 parses Real-size grammar: many millions of parses This scaled very badly, didn’t yield broad-coverage toolsGrammar (CFG) LexiconROOT → SS → NP VPNP → DT NNNP → NN NNSNN → interestNNS → raisesVBP → interestVBZ → raises…NP → NP PPVP → VBP NPVP → VBP NP PPPP → IN NPNLP: AnnotationJohn bought a blue
View Full Document