Speech Processing 15-492/18-492Speech SynthesisOverviewText processingSpeech SynthesisFrom text to speechFrom text to speechText AnalysisText AnalysisStrings of characters to wordsStrings of characters to wordsLinguistic AnalysisLinguistic AnalysisFrom words to pronunciations and prosodyFrom words to pronunciations and prosodyWaveform SynthesisWaveform SynthesisFrom pronunciations to waveformsFrom pronunciations to waveformsText AnalysisThis is a pen.This is a pen.My cat who lives dangerously has nine lives.My cat who lives dangerously has nine lives.He stole $100 from the bank.He stole $100 from the bank.He stole 1996 cattle on 25 Nov 1996.He stole 1996 cattle on 25 Nov 1996.He stole $100 million from the bank.He stole $100 million from the bank.It's 13 St. Andrew St. near the bank.It's 13 St. Andrew St. near the bank.Its a PIII 1.5Ghz, 512MB RAM, 160Gb SATA, Its a PIII 1.5Ghz, 512MB RAM, 160Gb SATA, (no IDE) 24x (no IDE) 24x cdromcdromand 19" LCD.and 19" LCD.My home My home pgaepgaeis is http://http://www.geocities.com/awbwww.geocities.com/awb/./.Emailfrom from [email protected]@cstr.ed.ac.uk("Alan W Black") on Thu 23 Nov 15:30:45:("Alan W Black") on Thu 23 Nov 15:30:45:>>> ... but, *I* wont make it :> ... but, *I* wont make it :--) Can you tell me who's going?) Can you tell me who's going?>>IMHO I think you should go, but I think the IMHO I think you should go, but I think the followignfollowignare goingare goingGeorge BushGeorge BushBill ClintonBill Clintonand that other guyand that other guyBobBob----___ _ ___ _ ------------------++------------------------------------------------------------------------------------------------------+ |+ |\\\\//|//|| Bob Beck E| Bob Beck E--mail mail [email protected]@beck.demon.co.uk| | | | \\\\// |// |++------------------------------------------------------------------------------------------------------+ | > < |+ | > < || // | // \\\\||Alba Alba gugubrathbrath|//___|//___\\\\||----------------Text Analysis TasksCharacter encodings:Character encodings:LatinLatin--1, iso1, iso--88598859--1, utf1, utf--8 (or special)8 (or special)Find tokensFind tokensWhite space separatedWhite space separatedChunk into reasonably sized chunksChunk into reasonably sized chunksSort of sentencesSort of sentencesMap tokens to wordsMap tokens to wordsDisambiguate token typesDisambiguate token typesNumbersNumbersChunkingMaking reasonable sized sectionsMaking reasonable sized sectionsSomething to do with full stops …Something to do with full stops …Hi Alan,Hi Alan,I went to the conference. They listed you as Mr. Black when weI went to the conference. They listed you as Mr. Black when weknow you should be Dr. Black days ahead for their research.know you should be Dr. Black days ahead for their research.Next month I'll be in the U.S.A. I'll try to drop by C.M.U.Next month I'll be in the U.S.A. I'll try to drop by C.M.U.if I have time.if I have time.byebyeDorothyDorothyInstitute of XYZInstitute of XYZUniversity of Foreign PlaceUniversity of Foreign Placeemail: [email protected]: [email protected] analysisNormal wordsNormal wordsHomographs, Homographs, OOVsOOVsNumbersNumbersYears, quantities, digits, addressesYears, quantities, digits, addressesOther standard formsOther standard formsDates, times, moneyDates, times, moneyAbbreviations and Letter SequencesAbbreviations and Letter SequencesNASA, CIA, SATA, IDENASA, CIA, SATA, IDESpelling errors (choices)Spelling errors (choices)SoooooSooooo, … , … colourcolour, , collorcollorPunctuationPunctuation::--) quotes, dashes, ) quotes, dashes, asciiasciiart, art, Text layoutText layoutFinding WordsWhite space separated tokensWhite space separated tokensButBut------if I may interjectif I may interject------not all not all word(sword(s) are like ) are like thatthatWeanWean--HallHall--like architecturelike architectureSome languages don’t use spacesSome languages don’t use spacesChinese, Japanese, ThaiChinese, Japanese, ThaiSome languages use lots of compoundingSome languages use lots of compoundingunspacedmultiwordsunspacedmultiwordsHomographsHomographsHomographsSame writing, different pronunciationSame writing, different pronunciation(Homophones: same pronunciation different writing. “to” (Homophones: same pronunciation different writing. “to” “two” “write” “right”)“two” “write” “right”)English: not many:English: not many:Stress shift (Noun/Verb)Stress shift (Noun/Verb)Segment, project, convictSegment, project, convictSemanticSemanticBass, read, Begin, bathing, lives, Celtic, wind, Reading, sun, Bass, read, Begin, bathing, lives, Celtic, wind, Reading, sun, wed, …wed, …Roman NumeralsRoman NumeralsNon-standard Words (NSW)• Words not in the lexicon20.1%20.1%IMIM27.9%27.9%ClassifiedsClassifieds13.7%13.7%RecipesRecipes10.7%10.7%EmailEmail4.9%4.9%Press wirePress wire1.5%1.5%NovelsNovels%NSW%NSWText TypeText TypeDistribution of NSW• 3yrs News text, 2.2M tokens 120K NSWs2%2%As AbbrevAs Abbrev12%12%As lettersAs letters30%30%As wordAs wordAlphabeticAlphabetic3%3%OrdinalOrdinal7%7%YearYear26%26%NumberNumberNumericNumeric% of NSW% of NSWMinor typeMinor typeMajor typeMajor typeProcessing NSWsHow hard are they?How hard are they?Finding themFinding themIdentifying themIdentifying themExpanding themExpanding themCurrent processing techniquesCurrent processing techniquesIgnoredIgnoredLexical lookupLexical lookupHackyHackyhandhand--written ruleswritten rules(not so) (not so) HackyHackyhandhand--written ruleswritten rulesStatistically train models (and Statistically train models (and hackyhackyhand written rules)hand written rules)Homograph Disambiguation (Yarowsky)Same tokens in different contextsSame tokens in different contextsIdentify target homographIdentify target homographE.g. numbers, roman numerals, “St”E.g. numbers, roman numerals, “St”Find instances in large text corporaFind instances in large text corporaHand label them with correct answerHand label them with correct answerTrain a decision tree to
View Full Document