Speech Processing 15-492/18-492MultilingualitySPICE: making it easierDealing with *all* LanguagesOver 6000 LanguagesOver 6000 LanguagesMaybe not all commercially interesting … nowMaybe not all commercially interesting … nowMajor languages (economic)Major languages (economic)Cell phone manufacturers list 46 languagesCell phone manufacturers list 46 languagesBut even those not all coveredBut even those not all covered ComputerizationComputerization: Speech is key technology: Speech is key technologyMobile Devices, Ubiquitous Information AccessMobile Devices, Ubiquitous Information AccessGlobalizationGlobalization: : MultilingualityMultilingualityMore than 6000 Languages in the world More than 6000 Languages in the world Multiple official languagesMultiple official languagesEurope has 20+ official languagesEurope has 20+ official languagesSouth Africa has 11 official languagesSouth Africa has 11 official languages⇒⇒Speech Processing in multiple LanguagesSpeech Processing in multiple LanguagesCrossCross--cultural Humancultural Human--Human InteractionHuman InteractionHumanHuman--Machine Interface in mother tongueMachine Interface in mother tongueMotivationChallengesAlgorithms language independent but require dataAlgorithms language independent but require dataDozens of hours audio recordings and corresponding transcriptionDozens of hours audio recordings and corresponding transcriptionssPronunciation dictionaries for large vocabularies (>100.000 wordPronunciation dictionaries for large vocabularies (>100.000 words)s)Millions of words written text corpora in various domains in queMillions of words written text corpora in various domains in questionstionBilingual aligned text corporaBilingual aligned text corporaBUT: Such data only available in very few languagesBUT: Such data only available in very few languagesAudio dataAudio data≤≤4040languages,languages,Transcriptions take up toTranscriptions take up to40x 40x real timereal timeLarge vocabulary pronunciation dictionariesLarge vocabulary pronunciation dictionaries≤≤2020languageslanguagesSmall text corporaSmall text corpora≤≤100 100 languages,languages,large corpora large corpora ≤≤30 30 languageslanguagesBilingual corpora in very few language pairs, pivot mostly EngliBilingual corpora in very few language pairs, pivot mostly EnglishshAdditional complications:Additional complications:Combinatorical explosionCombinatorical explosion(domain, speaking style, accent, dialect, ...)(domain, speaking style, accent, dialect, ...)Few native speakers at hand for minority (endangered) languagesFew native speakers at hand for minority (endangered) languagesLanguages without writing systemsLanguages without writing systemsSolution: Learning Systems⇒⇒Systems that learn a language from the userSystems that learn a language from the userEfficientEfficientlearning algorithms for speech processinglearning algorithms for speech processingLearning:Learning:Interactive learning with user in the loopInteractive learning with user in the loopStatistical modeling approachesStatistical modeling approachesEfficiency:Efficiency:Reduce amount of dataReduce amount of data(save time and costs): by a factor of 10(save time and costs): by a factor of 10Speed up development cycles:Speed up development cycles:days rather than monthsdays rather than months⇒⇒Rapid Language Rapid Language Adaptation from universal modelsAdaptation from universal modelsBridge the gap: language and technology expertsBridge the gap: language and technology expertsTechnology experts do not speak all languages in questionTechnology experts do not speak all languages in questionNative users are not in control of the technologyNative users are not in control of the technologySharing data between modulesLexstLMtWord s ↔Word t N-gramsAMtDicttWord →phone sequenceLMtN-gramsAMsDictsWord →phone sequenceLextsWord s ↔Word t LMsN-gramsAMsDictsLMsWord →phone sequenceN-gramsAMtDicttWord →phone sequenceInput LsInput LtOutput LsSpeech-to-Speech TranslationLsourceLtargetLsourceLtargetSPICESpeech Processing: Interactive Creation and Evaluation toolkit• National Science Foundation, Grant 10/2004, 3 years• Principle Investigators Tanja Schultz and Alan Black • Bridge the gap between technology experts → language experts• Automatic Speech Recognition (ASR), • Machine Translation (MT),• Text-to-Speech (TTS)• Develop web-based intelligent systems• Interactive Learning with user in the loop• Rapid Adaptation of universal models to unseen languages• SPICE webpage http://cmuspice.orgSpice Project PageInput: SpeechSpeech Processing SystemsPronunciation ruleshi /h//ai/you /j/u/we /w//i/hi youyou areI am AMLex LMOutput: Speech & TextHelloNLP / MTTTSText dataPhone set & Speech dataInput: Speechhi /h//ai/you /j/u/we /w//i/hi youyou areI am AMLex LMOutput: Speech & TextNLP / MTTTSPhone set & Speech data+HelloRapid Portability: DataFinding “Nice” PromptsFrom very large text databasesFrom very large text databasesFind “nice” sentences:Find “nice” sentences:Containing only high frequency wordsContaining only high frequency words55--15 words15 wordsFind grapheme/phoneme balanced setFind grapheme/phoneme balanced setSelect sentences with best Select sentences with best triphonetriphone/graph/graph500500--1000 sentences1000 sentencesCollect for ASR and TTS acoustic modelingCollect for ASR and TTS acoustic modelingPrompt Selection IssuesNeed good textNeed good textDeDe--htmlifyhtmlify, well, well--written, no misspellingwritten, no misspellingNeed word segmentationNeed word segmentationJapanese, Chinese ThaiJapanese, Chinese ThaiNatural text is often mixed languageNatural text is often mixed languageHindi Newspaper Text has lots of English wordsHindi Newspaper Text has lots of English wordsAutomatic selection has errorsAutomatic selection has errorsNeed Speaker to do further selectionNeed Speaker to do further selectionE.g. lots of telephone numbers, E.g. lots of telephone numbers, formatingformatingcommandscommandsCMU Arctic used similar methodsCMU Arctic used similar methodsRecording PromptsGlobalPhoneMultilingual
View Full Document