Unformatted text preview:

Saif Mohammad Language of the subconscious by WildCherry Instructor Saif Mohammad Co instructor Nitin Madnani Coordinator Professor Bonnie Dorr Teaching Assistant Sajib Dasgupta Instructor Saif Mohammad Co instructor Nitin Madnani Coordinator Professor Bonnie Dorr Teaching Assistant Sajib Dasgupta Guest Lectures Bonnie Dorr Philip Resnik Doug Oard Competent programmers Competent programmers Do not have to be linguists Have high school English behind you Know parts of speech syntactic parse trees subject object Read material on word classes and contextfree grammars from J M chapters 5 and 12 for background Text Speech and Language Processing An Introduction to Natural Language Processing Speech Recognition and Computational Linguistics second edition published in 2008 by Daniel Jurafsky and James H Martin Course webpage http www umiacs umd edu saif WebPages CS723 htm Class Wednesdays 4 to 6 30pm 5 10 min break in between Exams 50 midterm exam 25 final exam 25 Class assignments projects 45 Assignment 1 through 4 10 12 5 10 12 5 Assignment 0 no credit designed to calibrate programming skills Class participation 5 Showing up for class demonstrating preparedness and contributing to class discussions Office hours Saif by appointment Sajib TA room 1112 Mondays 4 to 5 30 pm Tuesdays 2 to 3 30 pm Forum https forum cs umd edu forumdisplay php f 113 Focus on Statistical Models HMMs EM N gram LMs TAGs approx 4 lectures Assignments All written in Python NLTK Python NLTK tutorial next week show up Assignment 0 not for credit Purpose Introspection and Practice Try to solve problem 1 before tutorial next week problem 2 after Forums Register unless already registered for another class Preferred way to ask questions Feel free to start discussion threads if necessary Subscribe to notifications Study of computer processing understanding and generation of human languages Interdisciplinary field Linguistics machine learning and artificial intelligence statistics cognitive science psychology and others Common applications Machine translation information retrieval text summarization question answering Professor Bonnie Dorr Disambiguation decisions of word sense word category syntactic structure Maximize coverage minimize errors false positives Robust Generalize well AI approaches with deep understanding had hand coded rules Creating the rules is time consuming One may miss rules sometimes the rules are too many to encode May not scale to different domains Brittle metaphors I swallowed his story Counting things Determining patterns that occur in language use Features Learn rules patterns automatically Statistical models are robust generalize well and behave gracefully when faced with lessthan perfect conditions Corpus a collection of natural language documents British National Corpus Wall Street journal google s web indexed corpus switch board corpus Can we learn how language works from this text Look for patterns in the corpus Size Balanced or domain specific Written or spoken Raw or annotated senses pos structure Electronically available or hard copy Free to use or one needs to pay for a license Brown Susanne Penn Treebank Canadian Hansards Dictionaries Gloss example sentence Thesauri categories paragraphs semicolon units WordNet synsets gloss hypernyms holonyms troponyms Tom Sawyer Tom Sawyer the 333 and 2972 a 1775 to 1725 of 1440 was 1161 it 1027 in 906 determiner article conjunction determiner preposition verbal infinitive marker preposition auxiliary verb personal expletive pronoun preposition Tom Sawyer Tokens 71 370 Types 8 018 Memory half a megabyte Average frequency of a word tokens types 8 9 Tom Sawyer freq 1 2 3 4 5 6 7 freq of freq 3993 1292 664 410 243 199 172 freq freq of freq 8 131 9 82 10 91 11 50 540 51 100 99 100 102 Hapax legomena word types that occur only once in the corpus Hapax legomena word types that occur only once in the corpus Direct applications of simple word counts cryptography style of authorship Indirectly counts are used pervasively in NLP Hapax legomena word types that occur only once in the corpus Direct applications of simple word counts cryptography style of authorship Indirectly counts are used pervasively in NLP Why is statistical NLP difficult hard to predict much about the behavior of words that occur rarely if at all The Principle of Least Effort people will act so as to minimize their probable average rate of work Evidence Underlying statistical distributions in language Count up words in a corpus List rank words in order of frequency frequency 1 rank Example the 50th most common word should occur three times more often than the 150th First observed by Estoup 1916 there are a few very common words a middling number of medium frequency words and many low frequency words speaker and the hearer are trying to minimize their effort regular scales non logarithmic meanings frequency 1 rank Length of a word 1 frequency Often we deal with the occurrence and frequencies of sets of strings given a sentence with the word bank did the words teller or tellers occur in the sentence how many times did the various forms of the word dissect dissect dissection dissected dissectible occur in a book What are the different dates mentioned in a history book Regular expressions are a way of identifying sets of strings A formula notation in a special language that is used for specifying simple classes sets of strings Developed by Kleene 1956 Regular expressions can be implemented by finite state automaton Variations of automata finite state trans ducers and hidden Markov models speech recognition and synthesis machine translation spell checking and IE olympics olympics olympics olympics a d a b c d olympics olympics a d a b c d INFORMAL olympics olympics abcd a b c d olympics olympics abcd a d a b c d a b c d olympics olympics abcd a d a b c d a b c d Oo lympics Olympics olympics olympics olympics abcd a d a b c d a b c d Oo lympics Olympics olympics A Z 9 A9 B9 C9 M9 Z9 olympics olympics abcd a d a b c d a b c d Oo lympics Olympics olympics A Z 9 A9 B9 C9 M9 Z9 a d e f z olympics olympics abcd a d a b c d a b c d Oo lympics Olympics olympics A Z 9 A9 B9 C9 M9 Z9 a d e f z yours mine yours mine Optional characters and Optional characters and 0 or 1 colou r color colour Optional characters and 0 or 1 colou r color colour 0 or more oo h oh ooh oooh Optional characters and 0 or 1 colou r color colour 0 or more oo h oh ooh oooh 1 or more o h oh ooh oooh Optional characters and 0 or 1


View Full Document

UMD CMSC 723 - COURSE INFORMATION

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Loading Unlocking...
Login

Join to view COURSE INFORMATION and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view COURSE INFORMATION and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?