DOC PREVIEW
Berkeley COMPSCI 294 - Lecture Notes

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingDan KleinMF 1-2:30pmSoda Hall 310Course Info Meeting times Lectures: MF 1-2:30pm 1:00-1:45, break, 1:50-2:30 Office hours: W 3-5pm (but negotiable) Communication Web page: www.cs.berkeley.edu/~klein/cs294-5 My email: [email protected] Course newsgroup: ucb.class.cs294-5 Assignments Readings (M+S, J+M, papers) 5 Small Projects, 1 Final Project Questionnaires!The Dream Language is our UI! It’d be great if machines could: Process our email (usefully) Translate for us Write up our research Talk to us / listen to us But they can’t: Language is ambiguous Language is flexible Language is complex Language is subtle So we use their UIs.What is NLP? Fundamental goal: deep understand of broad language Not just string processing or keyword matching! Applications: Ambitious: machine translation, information extraction, dialog systems, question answering… Modest: spelling correction, text categorizationWhat is nearby NLP? Computational Linguistics Using computational methods to learn more about how language works We end up doing this and using it Cognitive Science Figuring out how the human brain works Includes the bits that do language Humans: the only working NLP prototype! Speech Recognition Mapping audio signals to text Traditionally separate from NLP, converging? Two components: acoustic models and language models Language models in the domain of stat NLPWhat is this Class? Three aspects to the course: Linguistic Issues What are the range of language phenomena? What are the knowledge sources that let us disambiguate? What representations are appropriate? Technical Methods Learning and parameter estimation Increasingly complex model structures Efficient algorithms: dynamic programming, search Engineering Methods Issues of scale Memory limitations Sometimes, very ugly hacks We’ll visit a series of language problems2Class Requirements and Goals  Class requirements Uses a variety of skills / knowledge: Basic probability and statistics Basic linguistics background Decent coding skills (Java) Most people are probably missing one of the above! Class goals Learn the issues and techniques of statistical NLP Build the real tools used in NLP (language models, taggers, parsers, translation systems) Be able to read current research papers in the field See where the gaping holes in the field are!An ExampleJohn bought a blue carLanguage is Ambiguous Headlines: Iraqi Head Seeks Arms Ban on Nude Dancing on Governor’s Desk Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Stolen Painting Found by Tree Kids Make Nutritious Snacks Local HS Dropouts Cut in Half British Left Waffles on Falkland Islands Clinton Wins on Budget, but More Lies Ahead Hospitals Are Sued by 7 Foot Doctors Why are these funny?Ambiguities Everywhere Maybe we’re sunk on funny headlines, but normal, boring sentences are unambiguous?Fed raises interest rates 0.5 % in a measure against inflationMore Attachment Ambiguities Semantic Ambiguities Even correct tree-structured syntactic analyses don’t always nail down the meaningEvery morning someone’s alarm clock wakes me upJohn’s boss said he was doing better3Other Levels of Language Tokenization/morphology: What are the words, what is the sub-word structure? Often simple rules work (period after Mr isn’t sentence break) Relatively easy in English, other languages are harder: Segementation Morphology Discourse: how do sentences relate? Pragmatics: what intent is expressed by the literal meaning, how to react?sarà andatabe+fut+3sg go+ppt+fem“she will have gone”Disambiguation for Applications Sometimes life is easy Can do text classification pretty well just knowing the set of words used in the document, same for authorship attribution Word-sense disambiguation not usually needed for web search because of majority effects or intersection effects (“jaguar habitat” isn’t the car) Sometimes only certain ambiguities are relevant Other times, all levels can be relevant (e.g., translation)he hoped to record a world recordSome Early NLP History 1950’s: Foundational work: automata, information theory, etc. First speech systems Machine translation (MT) hugely funded by military (imagine that) Toy models: MT using basically word-substitution Optimism! 1960’s and 1970’s: NLP Winter Bar-Hillel (FAHQT) and ALPAC reports kills MT Work shifts to deeper models, syntax … but toy domains / grammars (SHRDLU, LUNAR) 1980’s: The Empirical Revolution Expectations get reset Corpus-based methods Deep analysis often traded for robust and simple approximations Evaluate everythingClassical NLP: Parsing Write symbolic or logical rules: Use deduction systems to prove parses from words Minimal grammar on “Fed raises” sentence: 36 parses Simple 10-rule grammar: 592 parses Real-size grammar: many millions of parses This scaled very badly, didn’t yield broad-coverage toolsGrammar (CFG) LexiconROOT → SS → NP VPNP → DT NNNP → NN NNSNN → interestNNS → raisesVBP → interestVBZ → raises…NP → NP PPVP → VBP NPVP → VBP NP PPPP → IN NPPLURAL NOUNNOUNDETDETADJNOUNNP NPCONJNPPPWhat Were They Thinking? People did know that language was ambiguous! …but they hoped that all interpretations would be “good” ones (or ruled out pragmatically) …they didn’t realize how bad it would beProblems (and Solutions?) Dark ambiguities: most analyses are shockingly bad (meaning, they don’t have an interpretation you can get your mind around) Unknown words and new usages Solution: We need mechanisms to focus attention on the best ones, probabilistic techniques do thisThis analysis corresponds to the correct parse of “This will panic buyers ! ”4Corpora A corpus is a collection of text Often annotated in some way Sometimes just lots of text Balanced vs. uniform corpora Examples Newswire collections: 500M+ words Brown corpus: 1M words of tagged “balanced” text Penn Treebank: 1M words of parsed WSJ Canadian Hansards: 10M+ words of aligned French / English sentences The Web: billions of


View Full Document

Berkeley COMPSCI 294 - Lecture Notes

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?