CS 551 651 Structure of Spoken Language Lecture 12 Tests of Human Speech Perception John Paul Hosom Fall 2008 Reading Recommended Reading Chapter 5 Strong Weak Forms Intonation and Stress Chapter 11 pp 267 275 Balance between Phonetic Forces and Physical Phonetics Final Exam will be a take home exam with 10 questions same style as midterm but may require use of calculator and a number of spectrograms to be deciphered It will be handed out at the end of class on Wednesday December 3 The exam will be due back to me by Friday December 12 This is worth about 30 of your grade The final will cover material from Lecture 7 Syllable Structure until the end of the term Material covered on the midterm will probably not be covered on the final The spectrogram reading exercises will be similar to the midterm but will include the other classes of speech that we ve been studying nasals approximants and affricates as well as the usual vowels and diphthongs fricatives and stops The Perceptual Second Formant F2 Most vowels can be simulated using two resonances 400 Hz 2200 Hz In one study the lower resonance was fixed at the frequency of a vowel formant and the subject was asked to vary the higher resonance F2 until the perceived sound most closely matched the target vowel 400 Hz target ih For back vowels and central vowels subjects adjusted F2 to a frequency near the vowel s F2 For front vowels except for iy F2 was between the vowel s F2 and F3 for iy F2 was at or above the vowel s F3 The Perceptual Second Formant F2 These finding suggest that when formants are close in frequency they are integrated so that there is a single effective formant equivalent to an average of the two peaks It has also been shown that when two or more formants occur within 3 to 3 5 Barks the perceived vowel quality is equivalent to a resonance pattern with a single formant at the center of gravity of the two formants So for two formants within 3 Barks the formant positions affect a center of gravity measure of a single perceived resonance beyond 3 Barks two formants are heard as perceptually distinct These results suggest that for steady vowels there is an internal representation that has fairly low resolution 26 81 f Bark f 0 53 1960 f Perception of Coarticulation In most cases vowels are affected by coarticulation In some cases the vowel does not reach its target formant pattern How does the brain deal with this variation in the signal The acoustic effects of coarticulation referred to by Lindblom as target undershoot the amount of undershoot depends on syllable duration as well as on speaking style and varies both across and within speakers In vowel perception Lindblom hypothesized that people compensate for target undershoot and attempt to recover the canonical vowel targets In an experiment synthetic speech stimuli in a wVw and yVy context were presented to listeners with the F2 of V varying from high for an ih vowel to low for an uh vowel Perception of Coarticulation The boundary for perception of ih and uh given the varying F2 values was different in the wVw context and yVy context In yVy contexts mid level values of F2 were heard as uh and in wVw contexts mid level values of F2 heard as ih w ih w y uh y Perception of Coarticulation This demonstrates perceptual overshoot subjects are relying on direction and slope of formant transitions to classify vowels Lindblom proposed Perceptual Compensation model which normalizes formant frequencies based on formants of the surrounding consonants canonical vowel targets and syllable duration However many factors may account for target undershoot and so a simple model is not effective in this case Also if applied to automatic speech recognition determining locations of consonants and vowels is a non trivial problem Are Formant Targets Important Strange et al did experiment in which target information dynamic information in formant transitions and duration information were manipulated independently in CVC syllables Given a CVC the middle region of the V was removed or the transition regions were removed or the duration was normalized or some combination of these was applied The CVCs were presented to subjects who were asked to identify the vowel Regions with no target information are Silent Center regions with no transitions are Centers Alone and timenormalized versions are referred to as Neutral Duration Are Formant Targets Important Are Formant Targets Important Identification of Silent Center vowels was remarkably accurate in some cases as good as identification of unmodified CVC Neutral Duration Silent Center vowels not correctly identified as often as Silent Center vowels However Neutral Duration Silent Center vowels still more often correctly identified than Neutral Duration Center Alone vowels Conclusions 1 when vowel transition and duration information is present recognition is highly accurate 2 with no duration information transition information is more useful than nucleus information for vowel ID 3 vowel targets alone are neither sufficient nor necessary Are Formant Targets Important In another study by Furui CV syllables were truncated either from the beginning or from the ending and perception of the truncated syllable was measured In another experiment both initial and final sections of the syllable were truncated with a minimum duration of 40 msec The perceptual critical point was defined as the truncation position at which there was 80 correct recognition Furui found a The 10 msec during the point of greatest spectral transition is most important for identification of CV syllables and b The crucial information for both vowels and consonants is in this 10 msec region consonants can be mainly perceived by the spectral transition into the following vowel Are Formant Targets Important Tekieli and Cullinan showed that a Given first 10 msec of isolated vowel Place and Height can be distinguished at levels above chance the tense lax feature requires 30 msec b Place of articulation in CV can be identified based on 10 msec after release but voicing feature requires 20 30 msec In short timing information is critical for tense lax and voiced unvoiced distinctions and making these distinctions requires about 30 msec of speech other features can be identified in 10 msec Finally DiBenedetto demonstrated that the F1 trajectory influenced perception of front vowels synthetic syllables in which F1 targets are reached earlier than normal are perceived as lower in Height iy ih
View Full Document
Unlocking...