POCKETSPHINX: A FREE, REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEM FOR HAND-HELD DEVICES

Home> Academic Documents> POCKETSPHINX: A FREE, REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEM FOR HAND-HELD DEVICES

DOC PREVIEW

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

POCKETSPHINX: A FREE, REAL-TIME CONTINUOUS SPEECH RECOGNITION SYSTEMFOR HAND-HELD DEVICESDavid Huggins-Daines, Mohit Kumar, Arthur Chan,Alan W Black, Mosur Ravishankar, and Alex I. Rudnicky∗Carnegie Mellon UniversityLanguage Technologies Institute5000 Forbes Avenue, Pittsburgh, PA, USA 15213(dhuggins,mohitkum,archan,awb,rkm,air)@cs.cmu.eduABSTRACTThe availability of real-time continuous speech recognition onmobile and embedded devices has opened up a wide range of re-search opportunities in human-computer interactive applications. Un-fortunately, most of the work in this area to date has been con-fined to proprietary software, or has focused on limited domainswith constrained grammars. In this paper, we present a preliminarycase study on the porting and optimization of CMU SPHINX-II, apopular open source large vocabulary continuous speech recogni-tion (LVCSR) system, to hand-held devices. The resulting systemoperates in an average 0.87 times real-time on a 206MHz device,8.03 times faster than the baseline system. To our knowledge, thisis the first hand-held LVCSR system available under an open-sourcelicense.1. INTRODUCTIONMobile, embedded, and hands-free speech applications fundamen-tally require continuous, real-time speech recognition. For example,an intelligent, interactive personal information assistant where nat-ural speech has replaced the cumbersome stylus input and crampedgraphical user interface of a PDA. Many current applications, suchas speech control of GPS navigation systems and speech-controlledsong selection for portable music players and car stereos also requirea reliable and flexible speech interface. Finally, sophisticated naturallanguage applications such as handheld speech-to-speech translation[1]require fast and lightweight speech recognition.Several technical challenges have hindered the deployment ofsuch applications on embedded devices. The most difficult of theseis the computational requirements of continuous speech recognitionfor a medium to large vocabulary scenario. The need to minimize thesize and power consumption for these devices leads to compromisesin their hardware and operating system software that further restricttheir capabilities below what one might assume from their raw CPUspeed. For example, embedded CPUs typically lack hardware sup-port for floating-point arithmetic. Moreover, memory, storage capac-ity and bandwidth on embedded devices are also very limited. Forthese reasons, much of past work (e.g. [2], [3]) has concentrated onsimple tasks with restrictive grammars.∗This research was supported by DARPA grant NB CH-D-03-0010. Thecontent of the information in this publication does not necessarily reflect theposition or the policy of the US Government, and no official endorsementshould be inferred.In addition to hardware limitations, interested developers facea high barrier in building such systems. It requires access to pro-prietary speech recognition toolkits which are often expensive andusually provided without source code. As well, popular embeddedoperating systems may lack many of the features developers take forgranted on modern desktop systems, most notably a complete stan-dard C/C++ programming library and a fast virtual memory subsys-tem.POCKETSPHINX is the authors’ attempt to address the aboveissues. Our work builds on previous research in the Carnegie MellonSpeech group related to fast search techniques ([4] and [5]) and fastGMM computation techniques ([6], [7] and [8]). We believe that thiswork will benefit the development community and lead to the easiercreation of interesting speech applications. Therefore, we have madethis work available to the public without cost under an open-sourcelicense. To the best of our knowledge, this is the first open-sourceembedded speech recognition system that is capable of real-time,medium-vocabulary continuous speech recognition.2. BASELINE SPHINX-II SYSTEMThe target hardware platform for this work was the Sharp Zaurus SL-5500 hand-held computer. The Zaurus is typical of the previous gen-eration of hand-held PCs, having a 206MHz StrongARMR proces-sor, 64MB of SDRAM, 16MB of flash memory, and a quarter-VGAcolor LCD screen. We chose this particular device because it runsthe GNU/LinuxR operating system, simplifying the initial port ofour system. However, the CPU speed and memory capacity of thisdevice are several years behind the current state of the art, makingit commensurately more difficult to achieve the desired level of per-formance. To build our system, we used a GCC 3.4.1 cross-compilerbuilt with the crosstool script1.Platform speed directly affected our choice of a speech recogni-tion system for our work. Though all the members of the SPHINXrecognizer family have well-developed programming interfaces, andare actively used by researchers in fields such as spoken dialog sys-tems and computer-assisted learning, we chose the SPHINX-II rec-ognizer2as our baseline system because it is faster than other recog-nizers currently available in the SPHINX family.To evaluate our system’s performance, we used 400 utterancesrandomly selected from the evaluation portion of the DARPA Re-1http://kegel.com/crosstool/2http://www.cmusphinx.org/source Management (RM-1) corpus. The acoustic model uses Hid-den MarkovModelswitha5-stateBakistopologyandsemi-continuousoutput probabilities. It was trained from 1600 utterances from theRM-1 speaker-independent training corpus, using 256 tied Gaussiandensities, 1245 tied Gaussian Mixture Models (senones), and 39570context-dependent triphones. The input features consisted of four in-dependent streams of MFCC features, delta and delta-delta MFCCs,and power. A bigram statistical language model was used with a vo-cabulary of 994 words and a language weight of 9.5. The test setperplexity of this language model is 50.86.On our development workstation, a 3GHz IntelR PentiumR 4running GNU/Linux, SPHINX-II runs this task in 0.06 xRT. Afterthe first stage of porting the system to the Zaurus target platform,without applying any optimizations, this same task takes 7.15 xRT.The baseline word error rate is 9.73%. Clearly, this is much too slowto be useful for even the simplest recognition tasks.3. PLATFORM OPTIMIZATIONSIn the next stage of development, we investigated potential speed-upsbased on our knowledge of the hardware platform. First, we notedthat for embedded devices, memory access is slow and RAM is ata premium. We made several changes to the system to


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

Please select your school