SPEECH COMPARISON IN The Rosetta Stone

Home> Academic Documents> SPEECH COMPARISON IN The Rosetta Stone

DOC PREVIEW

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1. INTRODUCTION2. UTTERANCE DESCRIPTION2.1 Filters2.2 Speech Detection2.3 Voice Normalization3. COMPARISON4. CONCLUSION5. REFERENCESSPEECH COMPARISON IN The Rosetta Stone ™ABSTRACTThe Rosetta Stone™ is a successful CD-ROM based interactiveprogram for teaching foreign languages, that uses speechcomparison to help students improve their pronunciation. Theinput to a speech comparison system is N+1 digitizedutterances. The output is a measure of the similarity of the lastutterance to each of the N others. Which language is beingspoken is irrelevant. This differs from classical speechrecognition where the input data includes but one utterance, aset of expectations tuned to the particular language in use(typically digraphs or similar), and a grammar of expectedwords or phrases, and the output is recognition in the utteranceof one of the phrases in the grammar (or rejection). This paperdescribes a speech comparison system and its application inThe Rosetta Stone™.1. INTRODUCTIONFunding for this research came from the developers[1] of TheRosetta Stone™ (TRS), a highly successful interactivemultimedia program for teaching foreign languages. Thedevelopers wanted to use speech recognition technology to helpstudents of foreign languages improve their pronunciation andtheir active vocabulary. As of this writing TRS is available intwenty languages, which was part of the motivation to develop alanguage independent approach to speech recognition. Classicalapproaches require extensive development per language.TRS provides an immersion experience, where images, moviesand sounds are used to build knowledge of a language fromscratch. Since there is no concession to the native language ofthe learner, a German speaker and a Korean speaker bothlearning Vietnamese have the same experience--all inVietnamese.The most recent release of TRS includes EAR, the speechcomparison system described in this paper. The input to aspeech comparison system is N+1 digitized utterances. Theoutput is a measure of the similarity of the last utterance to eachof the N others. Which language is being spoken is irrelevant.This differs from classical speech recognition, where the inputdata includes an utterance, a set of expectations tuned to theparticular language in use (typically digraphs or similar), and agrammar of expected words or phrases, and the output isrecognition in the utterance of one of the phrases in the grammaror rejection.The TRS CD-ROM contains tens of thousands of utterances bynative speakers. Thus the TRS data set already included thenecessary input for speech comparison, but not for classicalspeech recognition. The first application we developed was apronunciation guide (see Fig. 1). The user clicks on a picture,hears a native speaker’s utterance, attempts to mimic thatutterance, sees a display of two images visually portraying thetwo utterances, and observes a gauge which shows a measure ofthe similarity between the two utterances. The systemnormalizes both voices (native speaker’s and student’s) to acommon standard, and displays various abstract or at leasthighly processed features of the normalized voices, so thatdifferences irrelevant to speech (such as how deep your voice is)do not play a role.The second application, currently under development, is activevocabulary building. The user sees four pictures and hears fourphrases semantically related to the pictures. This is materialthey have already worked over in other learning modes designedto build passive vocabulary, i.e. the ability to recognize themeaning of speech. However in this exercise the user must beable to generate the speech with less prompting. The order ofthe pictures is scrambled, and they are flashed one at a time.The user must respond to each with the phrase that was givenfor that picture. The system evaluates their success, i.e.whether they responded with the correct phrase, one of the otherphrases, or some unrelated utterance. One difficulty for thesystem is that frequently the four phrases are very similar, sothat the difference between them might hinge on a short piece inthe middle of otherwise nearly identical utterances (for example“the girl is cutting the blue paper”, “the girl is cutting the redpaper”).EAR is written in C. Since TRS is written in MacroMediaDirector™, EAR is interfaced to TRS using Director’s interfacefor extending Director with C code. TRS is multithreaded, soEAR is able to do its work incrementally since it must not takethe CPU for extended periods of time. Indeed EAR itselfcontains multiple threads of two kinds: description threads andcomparison threads.Since the system might load several prerecorded utterances ofnative speakers at once, it is desirable that the work ofcomputing the normalized high-level description of eachutterance be done while the user is listening to those utterances,in parallel. Thus each stream of sound data (22050 Hz soundsamples) is analyzed by a separate description thread, with avisual display in real time being an option. Similarly, sounddata from the microphone is analyzed in real time while thestudent is speaking by a description thread, and the resultingvisual display is displayed in real time. Description threads arediscussed in Section 2.Once the user has finished speaking, a comparison thread can belaunched for each of the native speaker descriptions, whichcompare those descriptions to the description of the student’sutterance. Comparison threads are discussed in Section 3.2. UTTERANCE DESCRIPTIONAn EAR utterance description is a vector of feature vectors. Ofthese, only pitch, emphasis and a dozen spectral features areportrayed in the visual display. An utterance descriptioncontains one feature vector for each 1/100 of a second of theutterance.2.1 FiltersDescription of a sound stream begins with 48 tuned


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

Please select your school