Language of the subconscious, by WildCherry - Saif Mohammad Instructor: Saif Mohammad Co-instructor: Nitin Madnani Coordinator: Professor Bonnie Dorr Teaching Assistant: Sajib Dasgupta Instructor: Saif Mohammad Co-instructor: Nitin Madnani Coordinator: Professor Bonnie Dorr Teaching Assistant: Sajib Dasgupta Guest Lectures: ◦ Bonnie Dorr ◦ Philip Resnik ◦ Doug Oard Competent programmers Competent programmers Do not have to be linguists ◦ Have high-school English behind you ◦ Know parts of speech, syntactic parse trees, subject, object,… ◦ Read material on word classes and context-free grammars from J&M chapters 5 and 12 for background Text: ◦ Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, second edition (published in 2008), by Daniel Jurafsky and James H. Martin. Course webpage: ◦ http://www.umiacs.umd.edu/~saif/WebPages/CS723.htm Class: ◦ Wednesdays, 4 to 6:30pm (5--10 min break in between) Exams: 50% ◦ midterm exam: 25% ◦ final exam: 25% Class assignments/projects: 45% ◦ Assignment 1 through 4: 10%, 12.5%, 10%, 12.5% ◦ Assignment 0: no credit designed to calibrate programming skills Class participation: 5% ◦ Showing up for class, demonstrating preparedness, and contributing to class discussions. Office hours: ◦ Saif: by appointment ◦ Sajib: TA room 1112 Mondays: 4 to 5:30 pm Tuesdays: 2 to 3:30 pm Forum: ◦ https://forum.cs.umd.edu/forumdisplay.php?f=113 Focus on Statistical Models ◦ HMMs, EM, N-gram LMs, TAGs (approx. 4 lectures) Assignments ◦ All written in Python/NLTK ◦ Python/NLTK tutorial next week (show up!) Assignment 0 (not for credit) ◦ Purpose: Introspection and Practice ◦ Try to solve problem 1 before tutorial next week, problem 2 after Forums ◦ Register unless already registered for another class ◦ Preferred way to ask questions ◦ Feel free to start discussion threads, if necessary ◦ Subscribe to notifications! Study of computer processing, understanding, and generation of human languages Interdisciplinary field ◦ Linguistics, machine learning and artificial intelligence, statistics, cognitive science, psychology, and others Common applications: ◦ Machine translation, information retrieval, text summarization, question answeringProfessor Bonnie Dorr Disambiguation decisions of word sense, word category, syntactic structure,… Maximize coverage, minimize errors (false positives) Robust Generalize well AI approaches with deep understanding had hand-coded rules ◦ Creating the rules is time-consuming ◦ One may miss rules; sometimes the rules are too many to encode ◦ May not scale to different domains ◦ Brittle (metaphors) I swallowed his story Counting things Determining patterns that occur in language use Features: ◦ Learn rules, patterns automatically ◦ Statistical models are robust, generalize well, and behave gracefully when faced with less-than-perfect conditions Corpus: a collection of natural language documents ◦ British National Corpus, Wall Street journal, google’s web-indexed corpus, switch-board corpus Can we learn how language works from this text? ◦ Look for patterns in the corpus Size Balanced or domain-specific Written or spoken Raw or annotated (senses, pos, structure) Electronically available or hard copy Free to use or one needs to pay for a license Brown Susanne Penn Treebank Canadian Hansards Dictionaries ◦ Gloss, example sentence Thesauri ◦ categories, paragraphs, semicolon units WordNet ◦ synsets, gloss ◦ hypernyms, holonyms, troponymsTom SawyerTom Sawyer the 333 determiner (article) and 2972 conjunction a 1775 determiner to 1725 preposition, verbal infinitive marker of 1440 preposition was 1161 auxiliary verb it 1027 (personal/expletive) pronoun in 906 prepositionTom Sawyer ◦ Tokens: 71,370 ◦ Types: 8,018 ◦ Memory: half a megabyte ◦ Average frequency of a word # tokens / # types = 8.9freq freq of freq 1 3993 2 1292 3 664 4 410 5 243 6 199 7 172 freq freq of freq 8 131 9 82 10 91 11–50 540 51–100 99 > 100 102 Tom Sawyer Hapax legomena ◦ word types that occur only once in the corpus Hapax legomena ◦ word types that occur only once in the corpus Direct applications of simple word counts ◦ cryptography, style of authorship Indirectly, counts are used pervasively in NLP Hapax legomena ◦ word types that occur only once in the corpus Direct applications of simple word counts ◦ cryptography, style of authorship Indirectly, counts are used pervasively in NLP Why is statistical NLP difficult? ◦ hard to predict much about the behavior of words that occur rarely (if at all) The Principle of Least Effort: “people will act so as to minimize their probable average rate of work” Evidence: ◦ Underlying statistical distributions in language ◦ Count up words in a corpus ◦ List (rank) words in order of frequency frequency ∝ 1/rank Example: ◦ the 50th most common word should occur three times more often than the 150th First observed by Estoup (1916) there are a few very common words, a middling number of medium frequency words, and many low frequency words speaker and the hearer are trying to minimize their effortregular scales (non-logarithmic) # meanings ∝ √frequency ∝ 1/√rank Length of a word ∝ 1/frequency Often, we deal with the occurrence and frequencies of sets of strings given a sentence with the word bank, did the words teller or tellers occur in the sentence? how many times did the various forms of the word dissect (dissect, dissection, dissected, dissectible) occur in a book What are the different dates mentioned in a history book
View Full Document