New version page

Are Learners Myna Birds to the Averaged Distributions of Native Speakers?

This preview shows page 1 out of 4 pages.

View Full Document
View Full Document

End of preview. Want to read all 4 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Are Learners Myna Birds to the Averaged Distributions of Native Speakers?— A Note of Warning from a Serious Speech Engineer —Nobuaki MinematsuGraduate School of Frontier Sciences, The University of [email protected] current speech recognition technology consists of clearly sep-arate modules of acoustic models, language models, a pronuncia-tion dictionary, and a decoder. CALL systems often use the acous-tic matching module to compare a learner’s utterance to the tem-plates stored in the systems. The acoustic template of a phrase isusually calculated by collecting utterances of that phrase spokenby native speakers and estimating their averaged distribution. Ifphoneme-based comparison is required, phoneme-based templatesshould be prepared and Hidden Markov Models are often adoptedfor training the templates. In this framework, a learner’s utteranceis acoustically and directly compared to the averaged distributions.And then, the notorious mismatch problem more or less inevitablyhappens. I wonder whether this framework is pedagogically-soundenough. No children acquire language through imitating their par-ents’ voices acoustically. Male learners don’t have to produce fe-male voices even when a female teacher asks them to repeat her.What in a learner’s utterance should be acoustically matched withwhat in a teacher’s utterance? I consider that the current speechtechnology does not have any good answers and this paper pro-poses a good candidate answer by regarding speech as music.1. IntroductionMany speech sounds are produced as standing waves in a vocaltract and their acoustic properties depend on the shape of the vocaltube. No two speakers have the same tube and therefore, speechacoustics vary among them. A process of producing a vowel soundis similar to that of producing a sound with a wind instrument. Avocal tube is an instrument and, by changing its shape dynamically,/aiueo/ is generated, for example. Different shapes cause differentresonance, which causes different timbre. Acoustic differences inspeakers are due to differences of the shape of the tube. Those invowels of a single speaker are also for the same reason.The aim of speech recognition is to extract only the linguis-tic information from speech. As speech contains both linguisticand extra-linguistic features, the current technology tries to extractonly the linguistic information based on the following strategy,g(linguistic) =Pextra-linguisticf(linguistic, extra-linguistic).This is called collectionism and HMMs are a typical example. IBMViaVoice collected speech samples from 350 thousands of Amer-ican speakers. Many CALL products adopted ViaVoice as speechrecognition engine and the above number is used even in advertise-ment [1]. As far as I know, however, no children acquire the abil-ity to recognize speech after hearing 350 thousands of speakers. Amajor part of speech an infant hears is from its father and mother.After the infant begins to talk, as the speech chain implies, abouta half of speech it hears will be its own speech. It is completelyimpossible for a human hearer to experience a speaker-balancedspeech corpus. But the collectionism needs that for machines.Why is a large corpus covering an enormous number of speak-ers needed? This is because the current speech technology does nothave a good way to remove the speaker information from speech.Pitch information can be removed effectively by smoothing a givenspectrum slice. Similarly, is there any good method to remove theextra-linguistic information from speech? What I’m discussing isnot normalization or adaptation with respect to speakers. Spec-trum smoothing is not a technique for normalizing pitch but forremoving pitch. Given a smoothed spectrum, it is difficult to guessthe pitch information included in the original speech. Is there anyspeech representation where it is difficult to guess who generatedthe speech sample? If one hears speech sounds, he can guess whoproduced them. This means that the desired representation may notinclude any factors which can reconstruct the sound substances butindicate only the linguistic skeleton of spoken language.Developmental psychology tells that infants acquire spokenlanguage through imitating the speech from their parents, calledvocal imitation [2]. But no infants try to imitate the voices. As theyhave little phonemic awareness [3], they cannot identify a soundas phoneme although they can discriminate two different sounds.Namely, they cannot decode the speech into sequence of phonemesor convert the phonemes into sounds. In this situation, what in a fa-ther’s speech is acoustically imitated by infants? Some researchersclaim that they firstly learn the holistic sound pattern of the word[2], called word Gestalt. Then, what is the acoustic definition ofthat word Gestalt? If it includes speaker information, many infantsmust try to produce their fathers’ voices. This consideration indi-cates that the word Gestalt has to be speaker-invariant. But whatis that acoustically? I asked this question to many researchers insome conferences on infant study [4] but no researchers gave me adefinite answer. If the word Gestalt could be defined acoustically,I’m wondering whether it might be the linguistic skeleton.No infants imitate the voices but myna birds imitate not onlythe voices but also many sounds of cars, doors, animals, etc. Hear-ing a good myna bird say something, one can guess its keeper [5].Hearing a very good child say something, however, it is impossibleto guess its keeper. If one trains a myna bird to be a better imita-tor, the bird’s voice and the target sound will be acoustically anddirectly compared and, to reduce the difference, some other train-ing will be done. Most of the CALL systems directly compare aninput utterance to the averaged distributions of many native speak-ers. This fact simply claims that the systems assume that a learneris a myna bird to the averaged distributions of the native speakers.Is this assumption correct and pedagogically-sound enough?The problem I’m addressing is one of the fundamental but un-solved questions in speech science, which is variability of speechacoustics and invariance of speech perception [6]. I consider that,as this problem still remains to be unsolved, all the technical dis-cussions have to be based on the collectionism. In the followingsections, I propose a novel framework which can solve this prob-lem by considering some


Loading Unlocking...
Login

Join to view Are Learners Myna Birds to the Averaged Distributions of Native Speakers? and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Are Learners Myna Birds to the Averaged Distributions of Native Speakers? and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?