Berkeley COMPSCI 294 - Lecture Notes - D3057540

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 294> Lecture Notes

DOC PREVIEW

Berkeley COMPSCI 294 - Lecture Notes

School name University of California, Berkeley

Course Compsci 294- Special Topics

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingDan KleinMF 1-2:30pmSoda Hall 310Course Info Meeting times Lectures: MF 1-2:30pm 1:00-1:45, break, 1:50-2:30 Office hours: W 3-5pm (but negotiable) Communication Web page: www.cs.berkeley.edu/~klein/cs294-5 My email: [email protected] Course newsgroup: ucb.class.cs294-5 Assignments Readings (M+S, J+M, papers) 5 Small Projects, 1 Final Project Questionnaires!The Dream Language is our UI! It’d be great if machines could: Process our email (usefully) Translate for us Write up our research Talk to us / listen to us But they can’t: Language is ambiguous Language is flexible Language is complex Language is subtle So we use their UIs.What is NLP? Fundamental goal: deep understand of broad language Not just string processing or keyword matching! Applications: Ambitious: machine translation, information extraction, dialog systems, question answering… Modest: spelling correction, text categorizationWhat is nearby NLP? Computational Linguistics Using computational methods to learn more about how language works We end up doing this and using it Cognitive Science Figuring out how the human brain works Includes the bits that do language Humans: the only working NLP prototype! Speech Recognition Mapping audio signals to text Traditionally separate from NLP, converging? Two components: acoustic models and language models Language models in the domain of stat NLPWhat is this Class? Three aspects to the course: Linguistic Issues What are the range of language phenomena? What are the knowledge sources that let us disambiguate? What representations are appropriate? Technical Methods Learning and parameter estimation Increasingly complex model structures Efficient algorithms: dynamic programming, search Engineering Methods Issues of scale Memory limitations Sometimes, very ugly hacks We’ll visit a series of language problems2Class Requirements and Goals  Class requirements Uses a variety of skills / knowledge: Basic probability and statistics Basic linguistics background Decent coding skills (Java) Most people are probably missing one of the above! Class goals Learn the issues and techniques of statistical NLP Build the real tools used in NLP (language models, taggers, parsers, translation systems) Be able to read current research papers in the field See where the gaping holes in the field are!An ExampleJohn bought a blue carLanguage is Ambiguous Headlines: Iraqi Head Seeks Arms Ban on Nude Dancing on Governor’s Desk Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Stolen Painting Found by Tree Kids Make Nutritious Snacks Local HS Dropouts Cut in Half British Left Waffles on Falkland Islands Clinton Wins on Budget, but More Lies Ahead Hospitals Are Sued by 7 Foot Doctors Why are these funny?Ambiguities Everywhere Maybe we’re sunk on funny headlines, but normal, boring sentences are unambiguous?Fed raises interest rates 0.5 % in a measure against inflationMore Attachment Ambiguities Semantic Ambiguities Even correct tree-structured syntactic analyses don’t always nail down the meaningEvery morning someone’s alarm clock wakes me upJohn’s boss said he was doing better3Other Levels of Language Tokenization/morphology: What are the words, what is the sub-word structure? Often simple rules work (period after Mr isn’t sentence break) Relatively easy in English, other languages are harder: Segementation Morphology Discourse: how do sentences relate? Pragmatics: what intent is expressed by the literal meaning, how to react?sarà andatabe+fut+3sg go+ppt+fem“she will have gone”Disambiguation for Applications Sometimes life is easy Can do text classification pretty well just knowing the set of words used in the document, same for authorship attribution Word-sense disambiguation not usually needed for web search because of majority effects or intersection effects (“jaguar habitat” isn’t the car) Sometimes only certain ambiguities are relevant Other times, all levels can be relevant (e.g., translation)he hoped to record a world recordSome Early NLP History 1950’s: Foundational work: automata, information theory, etc. First speech systems Machine translation (MT) hugely funded by military (imagine that) Toy models: MT using basically word-substitution Optimism! 1960’s and 1970’s: NLP Winter Bar-Hillel (FAHQT) and ALPAC reports kills MT Work shifts to deeper models, syntax … but toy domains / grammars (SHRDLU, LUNAR) 1980’s: The Empirical Revolution Expectations get reset Corpus-based methods Deep analysis often traded for robust and simple approximations Evaluate everythingClassical NLP: Parsing Write symbolic or logical rules: Use deduction systems to prove parses from words Minimal grammar on “Fed raises” sentence: 36 parses Simple 10-rule grammar: 592 parses Real-size grammar: many millions of parses This scaled very badly, didn’t yield broad-coverage toolsGrammar (CFG) LexiconROOT → SS → NP VPNP → DT NNNP → NN NNSNN → interestNNS → raisesVBP → interestVBZ → raises…NP → NP PPVP → VBP NPVP → VBP NP PPPP → IN NPPLURAL NOUNNOUNDETDETADJNOUNNP NPCONJNPPPWhat Were They Thinking? People did know that language was ambiguous! …but they hoped that all interpretations would be “good” ones (or ruled out pragmatically) …they didn’t realize how bad it would beProblems (and Solutions?) Dark ambiguities: most analyses are shockingly bad (meaning, they don’t have an interpretation you can get your mind around) Unknown words and new usages Solution: We need mechanisms to focus attention on the best ones, probabilistic techniques do thisThis analysis corresponds to the correct parse of “This will panic buyers ! ”4Corpora A corpus is a collection of text Often annotated in some way Sometimes just lots of text Balanced vs. uniform corpora Examples Newswire collections: 500M+ words Brown corpus: 1M words of tagged “balanced” text Penn Treebank: 1M words of parsed WSJ Canadian Hansards: 10M+ words of aligned French / English sentences The Web: billions of

View Full Document

Berkeley COMPSCI 294 - Lecture Notes

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

Berkeley COMPSCI 294 - Lecture Notes

Sign up for free to view:

Please select your school