DOC PREVIEW
CMU CS 15492 - multilingual

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Speech Processing 15-492/18-492MultilingualityDealing with *all* LanguagesOver 6000 LanguagesOver 6000 LanguagesMaybe not all commercially interesting … nowMaybe not all commercially interesting … nowMajor languages (economic)Major languages (economic)Cell phone manufacturers list 46 languagesCell phone manufacturers list 46 languagesBut even those not all coveredBut even those not all coveredWhat you needASRASRAcoustic model (lots of speakers)Acoustic model (lots of speakers)Pronunciation LexiconPronunciation LexiconLanguage modelLanguage modelTTSTTSAcoustic model (one speaker)Acoustic model (one speaker)Pronunciation LexiconPronunciation LexiconText analysisText analysisWriting SystemsRomanized writing systemsRomanized writing systemsLatinLatin--1 (iso1 (iso--85998599--1)1)Covers many Western Europeans languagesCovers many Western Europeans languagesCyrillic Cyrillic Covers many Eastern European LanguagesCovers many Eastern European LanguagesArabic ScriptsArabic ScriptsArabic(sArabic(s), Farsi, Urdu, etc), Farsi, Urdu, etcDevenagariDevenagariCovers many Northern India LanguagesCovers many Northern India LanguagesChinese Chinese HanziHanziCovers some Chinese dialects but different versionsCovers some Chinese dialects but different versionsMany other scripts some nonMany other scripts some non--standardstandardWriting SystemsLetter based Letter based Latin, CyrillicLatin, CyrillicConsonant basedConsonant basedArabic, HebrewArabic, HebrewMora basedMora basedHalf syllable or syllableHalf syllable or syllableIndian scripts, Japanese native scriptsIndian scripts, Japanese native scriptsSyllable based Syllable based Hangul, ChineseHangul, ChineseStandardsWriting standardsWriting standardsTaught at schools, newspapers, computer Taught at schools, newspapers, computer supportsupportTypically standardized spellingTypically standardized spellingMay be mostly spokenMay be mostly spokenOccasionally writtenOccasionally writtenLanguage Specific IssuesNo explicit markingsNo explicit markingsStress, accent, tonesStress, accent, tonesNo word boundariesNo word boundariesChinese, ThaiChinese, ThaiNo (short) vowelsNo (short) vowelsArabic, HebrewArabic, HebrewRich morphologyRich morphologyMany different words in the languagesMany different words in the languagesFinnish, Turkish, GreenlandicFinnish, Turkish, GreenlandicGenre Specific IssuesNo capitals, punctuationsNo capitals, punctuationsUnpunctuatedUnpunctuatedPlain Plain vsvspolite formpolite formSpeech Speech vsvstext formtext formMany foreign phrasesMany foreign phrases(technology directed genre’s)(technology directed genre’s)Many new abbreviationsMany new abbreviationsE.g. SMS messagesE.g. SMS messagesCharacter EncodingUnicode Unicode vsvsutf8 utf8 vsvslatinlatinDocuments mix themDocuments mix themSometime accent omittedSometime accent omittedFor ease of typingFor ease of typingLots of standardsLots of standardsUnicode, EUC, BIG5, TIS42, …Unicode, EUC, BIG5, TIS42, …Everyone has their own standardEveryone has their own standardSome create their own standardsSome create their own standardsMixed character setsMixed character setsPhoneme SetsHard to find consensus for new languagesHard to find consensus for new languagesTypically lots of different dialectsTypically lots of different dialectsWhat level of distinction?What level of distinction?Some good for speech but not really phoneticSome good for speech but not really phonetic/t/ /t/ vsvs//dxdx/ in “water”/ in “water”Often doesn’t include foreign phonesOften doesn’t include foreign phones/w/ in German is common for younger people/w/ in German is common for younger peopleWordsMay be hard to defineMay be hard to defineNo word boundariesNo word boundariesRich morphologyRich morphologyWords have many variations of compoundsWords have many variations of compoundsYomenakattaYomenakatta--> could not read> could not readYomemasendeshitaYomemasendeshita--> could not read (polite)> could not read (polite)Gender specific speechGender specific speechBokuBokuvsvsatashiatashiLanguage mixturesLanguage mixturesPronunciation lexicons““proper” speech proper” speech vsvs“actual” speech“actual” speechHard to generalizeHard to generalizeChineseChineseCross lingual pronunciationsCross lingual pronunciations“Human” (English/German)“Human” (English/German)“Industry” wayCollect at least 100 hours of spoken speechCollect at least 100 hours of spoken speechAt least 20 different speakersAt least 20 different speakersMixture of gender, age, etcMixture of gender, age, etcThrough desired channel (phone/desktop)Through desired channel (phone/desktop)Collect at least 5 hours from one speakerCollect at least 5 hours from one speakerHigh quality recording studioHigh quality recording studioData should be targeted to applicationData should be targeted to applicationBuild pronunciation lexiconBuild pronunciation lexiconExpert Expert phonologistphonologistIndustry wayProbably 3Probably 3--6 months6 monthsLead developerLead developerLocal language expertLocal language expertLots of human transcribersLots of human transcribersCosts?Costs?Many hundreds of thousandsMany hundreds of thousandsOr cheaper (?) …Find existing dataFind existing dataLinguistic Data Consortium (Linguistic Data Consortium (UPennUPenn))ELRA (European equivalent)ELRA (European equivalent)AppenAppen, Australia, AustraliaFind local people who have collected dataFind local people who have collected dataFound data might be in wrong formatFound data might be in wrong formatData cleaning is often the most expensiveData cleaning is often the most expensiveActual wayOften mixtureOften mixtureFound data for initial modelFound data for initial modelCollect data with actual/initial applicationCollect data with actual/initial applicationMultilingual


View Full Document
Download multilingual
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view multilingual and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view multilingual 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?