1CS 294-5: StatisticalNatural Language ProcessingDan KleinMF 1-2:30pmSoda Hall 310Course Info Meeting times Lectures: MF 1-2:30pm 1:00-1:45, break, 1:50-2:30 Office hours: W 3-5pm (but negotiable) Communication Web page: www.cs.berkeley.edu/~klein/cs294-5 My email: [email protected] Course newsgroup: ucb.class.cs294-5 Assignments Readings (M+S, J+M, papers) 5 Small Projects, 1 Final Project Questionnaires!The Dream Language is our UI! It’d be great if machines could: Process our email (usefully) Translate for us Write up our research Talk to us / listen to us But they can’t: Language is ambiguous Language is flexible Language is complex Language is subtle So we use their UIs.What is NLP? Fundamental goal: deep understand of broad language Not just string processing or keyword matching! Applications: Ambitious: machine translation, information extraction, dialog systems, question answering… Modest: spelling correction, text categorizationWhat is nearby NLP? Computational Linguistics Using computational methods to learn more about how language works We end up doing this and using it Cognitive Science Figuring out how the human brain works Includes the bits that do language Humans: the only working NLP prototype! Speech Recognition Mapping audio signals to text Traditionally separate from NLP, converging? Two components: acoustic models and language models Language models in the domain of stat NLPWhat is this Class? Three aspects to the course: Linguistic Issues What are the range of language phenomena? What are the knowledge sources that let us disambiguate? What representations are appropriate? Technical Methods Learning and parameter estimation Increasingly complex model structures Efficient algorithms: dynamic programming, search Engineering Methods Issues of scale Memory limitations Sometimes, very ugly hacks We’ll visit a series of language problems2Class Requirements and Goals Class requirements Uses a variety of skills / knowledge: Basic probability and statistics Basic linguistics background Decent coding skills (Java) Most people are probably missing one of the above! Class goals Learn the issues and techniques of statistical NLP Build the real tools used in NLP (language models, taggers, parsers, translation systems) Be able to read current research papers in the field See where the gaping holes in the field are!An ExampleJohn bought a blue carLanguage is Ambiguous Headlines: Iraqi Head Seeks Arms Ban on Nude Dancing on Governor’s Desk Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Stolen Painting Found by Tree Kids Make Nutritious Snacks Local HS Dropouts Cut in Half British Left Waffles on Falkland Islands Clinton Wins on Budget, but More Lies Ahead Hospitals Are Sued by 7 Foot Doctors Why are these funny?Ambiguities Everywhere Maybe we’re sunk on funny headlines, but normal, boring sentences are unambiguous?Fed raises interest rates 0.5 % in a measure against inflationMore Attachment Ambiguities Semantic Ambiguities Even correct tree-structured syntactic analyses don’t always nail down the meaningEvery morning someone’s alarm clock wakes me upJohn’s boss said he was doing better3Other Levels of Language Tokenization/morphology: What are the words, what is the sub-word structure? Often simple rules work (period after Mr isn’t sentence break) Relatively easy in English, other languages are harder: Segementation Morphology Discourse: how do sentences relate? Pragmatics: what intent is expressed by the literal meaning, how to react?sarà andatabe+fut+3sg go+ppt+fem“she will have gone”Disambiguation for Applications Sometimes life is easy Can do text classification pretty well just knowing the set of words used in the document, same for authorship attribution Word-sense disambiguation not usually needed for web search because of majority effects or intersection effects (“jaguar habitat” isn’t the car) Sometimes only certain ambiguities are relevant Other times, all levels can be relevant (e.g., translation)he hoped to record a world recordSome Early NLP History 1950’s: Foundational work: automata, information theory, etc. First speech systems Machine translation (MT) hugely funded by military (imagine that) Toy models: MT using basically word-substitution Optimism! 1960’s and 1970’s: NLP Winter Bar-Hillel (FAHQT) and ALPAC reports kills MT Work shifts to deeper models, syntax … but toy domains / grammars (SHRDLU, LUNAR) 1980’s: The Empirical Revolution Expectations get reset Corpus-based methods Deep analysis often traded for robust and simple approximations Evaluate everythingClassical NLP: Parsing Write symbolic or logical rules: Use deduction systems to prove parses from words Minimal grammar on “Fed raises” sentence: 36 parses Simple 10-rule grammar: 592 parses Real-size grammar: many millions of parses This scaled very badly, didn’t yield broad-coverage toolsGrammar (CFG) LexiconROOT → SS → NP VPNP → DT NNNP → NN NNSNN → interestNNS → raisesVBP → interestVBZ → raises…NP → NP PPVP → VBP NPVP → VBP NP PPPP → IN NPPLURAL NOUNNOUNDETDETADJNOUNNP NPCONJNPPPWhat Were They Thinking? People did know that language was ambiguous! …but they hoped that all interpretations would be “good” ones (or ruled out pragmatically) …they didn’t realize how bad it would beProblems (and Solutions?) Dark ambiguities: most analyses are shockingly bad (meaning, they don’t have an interpretation you can get your mind around) Unknown words and new usages Solution: We need mechanisms to focus attention on the best ones, probabilistic techniques do thisThis analysis corresponds to the correct parse of “This will panic buyers ! ”4Corpora A corpus is a collection of text Often annotated in some way Sometimes just lots of text Balanced vs. uniform corpora Examples Newswire collections: 500M+ words Brown corpus: 1M words of tagged “balanced” text Penn Treebank: 1M words of parsed WSJ Canadian Hansards: 10M+ words of aligned French / English sentences The Web: billions of
View Full Document