UW-Madison ECE 539 - Speech Sound Production - D2902365

Home> Schools> University of Wisconsin, Madison> Electrical and Computer Engr (ECE) > ECE 539> Speech Sound Production

DOC PREVIEW

UW-Madison ECE 539 - Speech Sound Production

School name University of Wisconsin, Madison

Course Ece 539- Introduction to Artificial Neural Network and Fuzzy Systems

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Speech Sound Production: Recognition Using Recurrent Neural NetworksAbstract: In this paper I present a study of speech sound production and methods for speech recognition systems. One method for important speech sound feature extraction along with a possible full scale recognition system implementation using recurrent neural networks is presented. Neural network testing results are examined and suggestions for further research and testing are given at the end of this paper. by: Eric NuttECE 539 December 2003DiaphragmLungsThroatMouthNasal CavitySpeech Sound MechanismsAnatomy responsible for speech production:Processes controlling speech production:Phonation: converting air pressure into sound via the vocal folds Resonation: emphasizing certain frequencies by resonances in the vocal tractArticulation: changing vocal tract resonances to produce distinguishable soundsAir forced up from the lungs by the diaphragm passes through the vocal folds at the base of the larynx. Sound is produced by vibrations of the vocal folds and then the sound is filtered by the rest of the vocal tract. This sound production system acts much like that of a cavity resonator where the resonant frequency is given by:Cavity Resonator210)'(VLAcFricatives Stops/Plosives Affricates[f], foot [p], pie [č], chalk[v], view [b], buy [jˇ], ginPhonologyThe study of phonemes, the smallest distinguishable speech sounds. Phonemes can be separated in different ways but one common way is by the manner of articulation. This method breaks phonemes into groups based on location/shape of vocal articulators. One of the most important properties of a particular phoneme is its voicing. If the vocal folds are used to produce the sound then it is said to be voiced, otherwise it is not voiced.Waveform and Spectrum of /s/ Waveform and Spectrum of /v/Articulators: lip opening, shape of the body of the tongue, and the location of the tip of the tongue Phoneme groups:Fricatives: constant restriction of airflow  [f] as in foot and [v] as in viewStops: complete restriction of airflow followed by a release  [p] as in pie and [b] as in buyAffricates: stop followed by a fricative  [č] as in chalk and [ ĵ] as in ginExample of an unvoiced fricative  /s/Example of an voiced fricative  /v/Speech Feature Extraction|FFT|2Mel Filter BankWindowingSpeech WaveformDCTMFCCLogarithm1. Record Speech Waveform @ 20 kHz because human speechreaches only about 10 kHz.2. Select small section (20 to 30 ms) representing the phone of interest.3. Break into 100 or more overlapping sections and apply a Hamming Window to each section4. Calculate 256 point |FFT|² power spectrum of each section.Discard phase information because studies show perceptionis based on magnitude.5. Take the logarithm because humans hear loudness on an approximately log scale.6. Apply a Mel-Frequency Filter Bank to enhance perceptually important frequencies and reduce feature dimensions.7. Average over time to reduce the time dimensions.8. Take Discrete Cosine Transform of time averaged spectrum in order to produce Mel-Frequency Cepstral Coefficients. Keep the first 13 to 15 coefficients as they contain nearly all of the energy of the spectrum.Purpose: Enhance perceptually important frequencies and reduce featuresize by applying a bank of filters to each spectrum.Making the filter bank: Take about 40 linearly spaced points in the Mel-frequency scale and convert to the regular frequency scale using: Use these points as the peaks for each filter.Applying the filter bank: Multiply each filter by the spectrum values in the spectrumindex range covered by that filter. Sum upthe results.Result: After doing this the spectrum dimension will be reduced to the number offilters in the filter bank. The lowerfrequencies will be filtered at a higherresolution in order to enhance these perceptually more important frequencies.Mel-Filter bank 110*7002595melfHzf…F37F38 F39 F40Neural Network RecognitionIntro: Neural networks are an obvious choice for speech recognitionas a result of their ability to classify patterns. One important thing to noteabout classifying phonemes is that each phone has a feature vector thatis really a sequence of vectors (or a matrix) with the rows representingthe Mel-Frequency Cepstral Coefficients and the columns representingtime.Acoustic-Phonetic Recognition System: This system is basedon distinguishing phonemes by their acoustic properties. Featurevectors (phone/acoustic vectors) are gathered and introduced in parallel to expert networks that are each trained to recognize a particular phoneme. The complete output is recorded over timeto get phoneme hypotheses which are then stochastically processed to decide the closest matching word. The Experts: In order to process feature vectors that dependon time a recurrent network should be used because they havememory (current outputs depend on past inputs). One coulduse an Elman or Jordan network to accomplish such a task. ASlightly modified back-propagation algorithm can be used to train networks like these.Neural Network Testing and ResultsNetwork StructureMean-Squared Training ErrorTrainingError (%)TestingError (%)Mis-Classification Error(Training-Testing)[8,4,1] 0.1948 37 88 7.4/20 - 4.4/5[16,4,1] 0.1617 19 90 3.8/20 - 4.5/5[16,8,1] 0.5815 22 80 4.4/20 - 4.0/5[15,1] 0.2356 46 88 9.1/20 - 4.4/5[30,1] 0.4561 51 90 10.1/20 - 4.5/5In all 5 networks were trained to recognize the phoneme /s/. First 25 samples were gatheredand split into a training set (20 samples) and a test set (5 samples). The testing consistedof the following steps:1. Training set randomly ordered and current network structure trained for 3000 epochs.The training algorithm used was a back-propagation steepest descent algorithm withadaptive learning rate and momentum. Trained network then used to classify training setand test set to get training and testing error (in terms of percent missed).2. Five trials of step 1 were done and averaged for each network.3. Step 2 repeated four times with different training parameters (initial learning rateand momentum constant) each time.4. The results were tallied:Conclusion: The network structure that learned /s/ the best is the [16,8,1] structure. This networkhad 2 fully connected recurrent layers with 16 and 8 neurons and one output layer with 1 neuron. Althoughthis network had the second best training

View Full Document

UW-Madison ECE 539 - Speech Sound Production

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

UW-Madison ECE 539 - Speech Sound Production

Sign up for free to view:

Please select your school