Unformatted text preview:

Effect of temporal envelope smearing on speech reception Rob Drullman, Joost M. Festen, and Reinier Plomp Department of Oto-rhino-laryngology, Free University Hospital, P.O. Box 7057, 1007 MB •4msterdam, The Netherlands (Received 17 September 1992; revised 30 June 1993; accepted 20 September 1993) The effect of smearing the temporal envelope on the speech-reception threshold (SRT) for sentences in noise and on phoneme identification was investigated for normal-hearing listeners. For this purpose, the speech signal was split up into a series of frequency bands (width of •, •, or 1 oct) and the amplitude envelope for each band was low-pass filtered at cutoff' frequencies of 0, «, 1, 2, 4, 8, 16, 32, or 64 Hz. Results for 36 subjects show ( 1 ) a severe reduction in sentence intelligibility for narrow processing bands at low cutoff frequencies (0-2 Hz); and (2) a marginal contribution of modulation frequencies above 16 Hz to the intelligibility of sentences (provided that lower modulation frequencies are completely present). For cutoff' frequencies above 4 Hz, the SRT appears to be independent of the frequency bandwidth upon which envelope filtering takes place. Vowel and consonant identification with nonsense syllables were studied for cutoff frequencies of 0, 2, 4, 8, or 16 Hz in •-oct bands. Results for 24 subjects indicate that consonants are more affected than vowels. Errors in vowel identification mainly consist of reduced recognition of diphthongs and of confusions between long and short vowels. In case of consonant recognition, stops appear to suffer most, with confusion patterns depending on the position in the syllable (initial, medial, or final). PACS numbers: 43.71.Es, 43.71.Gv INTRODUCTION The speech signal is characterized by a spectrum that varies in time. This is clearly illustrated by the spectro- gram: The distribution of light and dark spots in vertical direction (frequency) changes continuously in horizontal direction (time). These variations contain the information that is essential for the identification of phonemes, sylla- bles, words, and sentences. For this identification we need a detector which is able to perceive the spectrotemporal differences. Our ear is such a detector. The ear's resolution in both frequency and time is suf- ficiently high to perceive the essential acoustical features of the various speech sounds. Depending on the speech ma- terial, we even have a reserve capacity. This reserve capac- ity is rather small for isolated phoneroes, but large for sentences. For normal-hearing listeners, the speech- reception threshold (SRT) in noise, defined as the speech- to-noise ratio at which 50% of short everyday sentences are reproduced correctly, is about -5 dB (Plomp, 1986). An interesting question is: How critical is the resolu- tion in frequency and time for the intelligibility of speech? Recently, ter Keurs etal. (1992, 1993) investigated the effect of smearing in the frequency domain, as a way to reduce spectral contrast. They smeared the envelope of the spectrum over bandwidths varying from • to 4 oct. The effect of this operation can be considered as a blurting of the formant structure. The results indicate that the SRT for sentences in noise increases as spectral energy is smeared over • oct and more, thus exceeding the ear's crit- ical bandwidth. In •the present study we focus on the temporal enve- lope. Temporal modulations of the speech signal have been described in terms of the modulation index (Houtgast and Steeneken, 1985). In all octave bands the most important modulation frequencies (i.e.,. where the modulation index reaches its peak value) are 34 Hz, reflecting the syllable rate in speech. Taking the frequency for which the modu- lation index is reduced to half its peak value (comparable to the --6-dB point of a filter), one can find relevant mod- ulation frequencies up to about 15-20 Hz in undisturbed speech. The ear's sensitivity for temporal modulations shows a lowpass characteristic, with a 6-dB down point corresponding to a frequency roughly between 25 Hz (Fes- ten and Plomp, 1981; Plomp, 1984) and 100 Hz (Roden- burg, 1977; Viemeister, 1979). From these data we may conclude that for normal hearing the ear's capacity to de- tect temporal modulations is not a limiting factor in speech perception. What is the effect of reducing the degree to which temporal fluctuations are present in the speech signal? In case of reverberation, resulting in attenuation of fast tem- poral modulations (due to "filling" of the minima in the waveform by reflected speech), experiments have demon- strated a reduction in intelligibility for sentences (Duques- noy and Plomp, 1980). Witlh regard to multichannel am- plitude compression, Plomp (1988) argues that with small time constants intensity fluctuations (particularly at low modulation frequencies) are attenuated in every channel, resulting in reduced intelligibility. Plomp states that this reduction increases as the compression ratio and the num- ber of channels increase. In a comment on Plomp's paper, Villchur (1989) claims that infinite peak clipping (i.e., 100% compression) in a two-channel compression system would hardly affect intelligibility for normal-hearing listen- ers. One of the goals of the present paper is to quantify this effect as a function of the number of frequency bands. 1053 J. Acoust. Soc. Am. 95 (2), February 1994 0001-4966/94/95(2)/1053/12/$6.00 @ 1994 Acoustical Society of America 1053The significance of the various modulation frequencies for speech communication can be compared with the sig- nificance of the various audiofrequencies. For example, in designing channel vocoders, we need to know not only the frequency range (e.g., up to 4 kHz) to be covered by the channels, but also the upper limit of the envelope frequen- cies required to preserve intelligible speech. Similarly, in applying alternative presentation of speech information to the deaf, we need to know up to which envelope frequency the (tactile, visual) channel must transfer the signal faith- fully. The range of modulation frequencies most relevant for speech, as mentioned above, has been determined by means of physical/acoustical measurements, and not by any formal perceptual evaluation. In much the same way, a 25-Hz limit for temporal modulations in up to 100 filter bands was applied in early channel vocoders (cf. Flanagan, 1972). There have been


View Full Document

MSU EELE 578 - Assignment 2

Download Assignment 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Assignment 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Assignment 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?