DOC PREVIEW
Berkeley COMPSCI 294 - Course Introduction

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 294-5: StatisticalNatural Language ProcessingCourse IntroductionLecture I: 8/29/05Course Info Meeting times Lectures: Monday / Wednesday 1-2:30pm Office hours: Thursday 4-5pm, Friday 1-2pm Communication Web page: www.cs.berkeley.edu/~klein/cs294-5 My email: [email protected] Course newsgroup: ucb.class.cs294-5 (link to webnews on the web page) Questionnaires!Accounts and Access Accounts Data and code available on: CS instructional accounts EECS research accounts Millennium accounts Make sure you have one of them working More details on resources in assignment 1 (next class), but sortout access ASAP Computing Resources Lab resources may not be enough Recommendation: start assignments early to find out NLP cluster on Millennium network, signups later in the termThe Dream It’d be great if machines could Process our email (usefully) Translate languages accurately Help us manage, summarize, and aggregate information Use speech as a UI (when needed) Talk to us / listen to us But they can’t: Language is complex, ambiguous, flexible, and subtle Good solutions need linguistics and machine learning knowledge So:What is NLP? Fundamental goal: deep understand of broad language Not just string processing or keyword matching! End systems that we want to build: Ambitious: speech recognition, machine translation, information extraction, dialog interfaces, question answering… Modest: spelling correction, text categorization… Automatic Speech Recognition (ASR) Audio in, text out SOTA: 0.3% for digit strings, 5% dictation, 50%+ TV Text to Speech (TTS) Text in, audio out SOTA: totally intelligible (if sometimes unnatural) Speech systems currently: Model the speech signal (later part of term) Model language (next class)Speech Systems“Speech Lab”2Machine Translation Translation systems encode: Something about fluent language (next class) Something about how two languages correspond (middle of term) SOTA: for easy language pairs, better than nothing, but more an understanding aid than a replacement for human translatorsInformation Extraction Information Extraction (IE) Unstructured text to database entries SOTA: perhaps 70% accuracy for multi-sentence temples, 90%+ for single easy fieldsNew York Times Co. named Russell T. Lewis, 45, president and general manager of its flagship New York Times newspaper, responsible for all business-side activities. He was executive vice president and deputy general manager. He succeeds Lance R. Primis, who in September was named president and chief operating officer of the parent. startpresident and CEONew York Times Co.Lance R. Primisendexecutive vice presidentNew York Times newspaperRussell T. Lewisstartpresident and general managerNew York Times newspaperRussell T. LewisStatePostCompanyPersonQuestion Answering Question Answering: More than search Ask general comprehension questions of a document collection Can be really easy: “What’s the capital of Wyoming?” Can be harder: “How many US states’ capitals are also their largest cities?” Can be open ended: “What are the main issues in the global warming debate?” SOTA: Can do factoids, even when text isn’t a perfect matchWhat is nearby NLP? Computational Linguistics Using computational methods to learn more about how language works We end up doing this and using it Cognitive Science Figuring out how the human brain works Includes the bits that do language Humans: the only working NLP prototype! Speech? Mapping audio signals to text Traditionally separate from NLP, converging? Two components: acoustic models and language models Language models in the domain of stat NLPWhat is this Class? Three aspects to the course: Linguistic Issues What are the range of language phenomena? What are the knowledge sources that let us disambiguate? What representations are appropriate? Technical Methods Learning and parameter estimation Increasingly complex model structures Efficient algorithms: dynamic programming, search Engineering Methods Issues of scale Sometimes, very ugly hacks We’ll focus on what makes the problems hard, and what works in practice…Class Requirements and Goals  Class requirements Uses a variety of skills / knowledge: Basic probability and statistics Basic linguistics background Decent coding skills (Java) Most people are probably missing one of the above We’ll address some review concepts with sections, TBD Class goals Learn the issues and techniques of statistical NLP Build the real tools used in NLP (language models, taggers, parsers, translation systems) Be able to read current research papers in the field See where the gaping holes in the field are!3Course Work Readings: Texts Manning and Shuetze (available online) Jurafsky and Martin Both on reserve in the Engineering library Papers (on web page) Assignments 5 individual coding assignments (60% of grade) 7 late days, 3 per assignment Lowest score dropped Substantial programming in Java 1.5 Evaluated by write-ups 1 group final project (40% of grade)Some Early NLP History 1950’s: Foundational work: automata, information theory, etc. First speech systems Machine translation (MT) hugely funded by military (imagine that) Toy models: MT using basically word-substitution Optimism! 1960’s and 1970’s: NLP Winter Bar-Hillel (FAHQT) and ALPAC reports kills MT Work shifts to deeper models, syntax … but toy domains / grammars (SHRDLU, LUNAR) 1980’s: The Empirical Revolution Expectations get reset Corpus-based methods become central Deep analysis often traded for robust and simple approximations Evaluate everythingClassical NLP: Parsing Write symbolic or logical rules: Use deduction systems to prove parses from words Minimal grammar on “Fed raises” sentence: 36 parses Simple 10-rule grammar: 592 parses Real-size grammar: many millions of parses This scaled very badly, didn’t yield broad-coverage toolsGrammar (CFG) LexiconROOT → SS → NP VPNP → DT NNNP → NN NNSNN → interestNNS → raisesVBP → interestVBZ → raises…NP → NP PPVP → VBP NPVP → VBP NP PPPP → IN NPNLP: AnnotationJohn bought a blue


View Full Document

Berkeley COMPSCI 294 - Course Introduction

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Course Introduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Course Introduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Course Introduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?