DOC PREVIEW
Columbia COMS W4706 - Text Normalization

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Text NormalizationSlide 2Text Normalization (1)Text Normalization (2)TodaySegmentation: What is a sentence?Rule-based ApproachesMachine Learning ApproachesHybrid ApproachesTokenization: What is a word?Slide 11Word Decisions are Arbitrary but must be ConsistentAbbreviations and AcronymsSlide 14Abbreviation Identification/Resolution (Sproat et al ’99)Abbreviations and their ExpansionsAmbiguous AbbreviationsNumbers in ContextSlide 19Markup LanguagesMark-up LanguagesAn ExampleConcept-to-SpeechSlide 24Cultural DependenceSlide 26Next Class01/14/19 1Text NormalizationJulia HirschbergCS 470601/14/19 2•TTS demos:–AT&T –Cepstral •SNL Robot Repair•An interesting application for TTS01/14/19 3Text Normalization (1)A sworn deposition that Sen. John McCain gave in a lawsuit more than 5 years ago appears to contradict one part of a sweeping denial that his campaign issued this week to rebut a New York Times story about his ties to a Washington lobbyist. On Wednesday night the Times published a story suggesting that McCain might have done legislative favors for the clients of the lobbyist, Vicki Iseman, who worked for the firm of Alcalde & Fay. One example it cited were two letters McCain wrote in late 1999 demanding that the Federal Communications Commission act on a long-stalled bid by one of Iseman's clients, Florida-based Paxson Communications, to purchase a Pittsburgh TV station. Just hours after the Times's story was posted, the McCain campaign issued a point-by-point response that depicted the letters as routine correspondence handled by his staff—and insisted that McCain had never even spoken with anybody from Paxson or Alcalde & Fay about the matter. "No representative of Paxson or Alcalde & Fay personally asked Senator McCain to send a letter to the FCC," the campaign said in a statement e-mailed to reporters. But that flat claim seems to be contradicted by an impeccable source: McCain himself. "I was contacted by Mr. [Lowell] Paxson on this issue," McCain said in the Sept. 25, 2002, deposition obtained by NEWSWEEK. "He wanted their approval very bad for purposes of his business. I believe that Mr. Paxson had a legitimate complaint." While McCain said "I don't recall" if he ever directly spoke to the firm's lobbyist about the issue—an apparent reference to Iseman, though she is not named—"I'm sure I spoke to [Paxson]." McCain agreed that his letters on behalf of Paxson, a campaign contributor, could "possibly be an appearance of corruption"—even though McCain denied doing anything improper. McCain's subsequent letters to the FCC—coming around the same time that Paxson's firm was flying the senator to campaign events aboard its corporate jet and contributing $20,000 to his campaign—first surfaced as an issue during his unsuccessful 2000 presidential bid. William Kennard, the FCC chair at the time, described the sharply worded letters from McCain, then chairman of the Senate Commerce Committee, as "highly unusual."01/14/19 4Text Normalization (2)Dr. Julia HirschbergDept. of Computer Science450 CS Bldg, M/C 04011214 Amsterdam Ave.New York NY [email protected]: 212-939-7114Fax: 212-666-0140http://www.cs.columbia.edu/~julia/01/14/19 5Today•Segmentation•Tokenization•Abbreviations•Numbers•TTS markup•Concept to Speech01/14/19 6Segmentation: What is a sentence?A sworn deposition that Sen. John McCain gave in a lawsuit more than 5 years ago appears to contradict one part of a sweeping denial that his campaign issued this week to rebut a New York Times story about his ties to a Washington lobbyist. On Wednesday night the Times published a story suggesting that McCain might have done legislative favors for the clients of the lobbyist, Vicki Iseman, who worked for the firm of Alcalde & Fay. One example it cited were two letters McCain wrote in late 1999 demanding that the Federal Communications Commission act on a long-stalled bid by one of Iseman's clients, Florida-based Paxson Communications, to purchase a Pittsburgh TV station. Just hours after the Times's story was posted, the McCain campaign issued a point-by-point response that depicted the letters as routine correspondence handled by his staff—and insisted that McCain had never even spoken with anybody from Paxson or Alcalde & Fay about the matter. "No representative of Paxson or Alcalde & Fay personally asked Senator McCain to send a letter to the FCC," the campaign said in a statement e-mailed to reporters.01/14/19 7Rule-based Approaches•For a potential sentence-ending word w followed by a ‘.’–If w is an abbreviation (e.g. ‘Mr’ or ‘Mrs’ or ‘Dr’ or ‘Sen’ or ….)  w does not end the sentence–O.w. w ends the sentence•How do we know whether w is an abbreviation?•What if an abbreviation ends a sentence?He works for Cisco, Inc.01/14/19 8Machine Learning Approaches•Labeled data–Mechanical Turk??•What features best predict sentence boundaries?•Is preceding word a known abbreviation?•How long is preceding word? •Is preceding word capitalized?•Is succeeding word capitalized?•….•Create feature vectors for each potential boundary•Apply ML algorithm to produce classifier•Test on held-out dataHybrid Approaches•Combine rules (for ‘easy’ decisions) with ML–Use rules to label initial corpus and build classifier, or–Add rules directly to ML results01/14/19 10Tokenization: What is a word?…On Wednesday night the Times published a story suggesting that McCain might have done legislative favors for the clients of the lobbyist, Vicki Iseman, who worked for the firm of Alcalde & Fay. One example it cited were two letters McCain wrote in late 1999 demanding that the Federal Communications Commission act on a long-stalled bid by one of Iseman's clients, Florida-based Paxson Communications, to purchase a Pittsburgh TV station. Just hours after the Times's story was posted, the McCain campaign issued a point-by-point response that depicted the letters as routine correspondence handled by his staff—and insisted that McCain had never even spoken with anybody from Paxson or Alcalde & Fay about the matter. "No representative of Paxson or Alcalde & Fay personally asked Senator McCain to send a letter to the FCC," the campaign said in a statement e-mailed to reporters.But that flat claim seems to be contradicted by an impeccable source: McCain himself. "I was contacted by Mr. [Lowell] Paxson on this issue," McCain said in the Sept. 25, 2002,


View Full Document

Columbia COMS W4706 - Text Normalization

Download Text Normalization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Text Normalization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Text Normalization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?