UW SPHSC 503 - Lecture Notes - D1812068

Home> Schools> University of Washington> Speech and Hearing Sciences (SPHSC) > SPHSC 503> Lecture Notes

UW SPHSC 503 - Lecture Notes

Pages 5

Download Save

Unformatted text preview:

Lecture notes 9 – Friday 7/7Summary of last lectureIntroductionAcoustical model of speech productionLinear predictive coding (LPC)SPHSC 503 – Speech Signal Processing UW – Summer 2006 Lecture notes 9 – Friday 7/7 Summary of last lecture IIR filter design • Design process o analog low-pass filter prototypes o frequency band transformation o analog to digital conversion • Types of IIR filters: Butterworth, Chebyshev I and II, elliptic • Magnitude response, phase response, group delay Non-linear phase response • Possible source of distortion • Forward-backward filtering removes all phase distortion, but can’t be applied real-time • Could use an all-pass filter to compensate for phase distortion, but not easy to design Introduction In the second week of this course we discussed short-term frequency analysis of speech signals in the form of the short-time Fourier transform or the spectrogram. Those analysis tools are generic in the sense that they not require any prior knowledge about a signal to be successfully applied to that signal. This is a great strength of these techniques and one of the main reasons why they have been so widely applied to many signals. On the other hand, these techniques do not give very specific information about speech signals, because they are so general. If we’re only interested in frequency analysis of speech signals, we may want to use different techniques that yield specific information about the speech signal, such as fundamental frequency and formant frequencies. In this lecture we’ll discuss a common model of speech signals that is used to more specific frequency analysis of speech signals, and use that model to perform linear predication of speech signals. Acoustical model of speech production1In engineering, speech signals are typically considered to consist of two components: glottal excitation and vocal tract resonances. Separating speech signals into these components allows engineers to define very simple models for both processes. Each of these models can be quantified by a small number of parameters, which compactly represent the salient features of the speech signals such as fundamental and formant frequencies. Glottal Excitation Glottal excitation is the process that describes the events in human speech production that take place between lungs and the vocal tract. As probably known, the vocal chords constrict the path from the lungs to the vocal tract. As lung pressure is increased, air flows out of the lungs and through the opening between the vocal chords (‘glottis’). At one point, the vocal cords are together, thereby blocking the airflow, which builds up pressure behind them. Eventually the pressure reaches a level sufficient to force the vocal cords to open and this allow air to flow through the glottis. Then, the pressure in the glottis falls and, if the tension in the vocal chords is 1 Text and figures copied/adapted from “Spoken language processing”, X. Huang et al. – 1 –SPHSC 503 – Speech Signal Processing UW – Summer 2006 properly adjusted, the reduced pressure allows the cords to come together, and the cycle is repeated. This condition of sustained oscillation occurs for voiced sounds, and is illustrated in the figure below. This glottal excitation can be further separated into an impulse train that drives a glottal pulse FIR filter g[n]: For unvoiced sounds, the airflow between lungs and vocal tract is not or very little obstructed by the vocal chords. In that case, the glottal excitation consists mostly of turbulence which is modeled as random noise: This model of the glottal excitation is a decent approximation, but fails on voiced fricatives, since those sounds contain both a periodic component and an aspirated component. In this case, a mixed excitation model can be applied, using a sum of both an impulse train and random noise. Lossless tube concatenation A widely used model for speech production is based on the assumption that the vocal tract can be represented as a concatenation of lossless tubes, as shown in the figure below: The constant cross-sectional areas A1, A2,…, A5 of the tubes approximate the continuous area function A(x) of the vocal tract. In this model of the vocal tract a number of things are ignored, such as the vocal tract’s three-dimensional ‘bend’, its elasticity, viscosity and thermal condition. By leaving those aspects out of the model, the sound waves in the tubes satisfy a pair of differential equations, which can be solved to find the system that models the vocal tract frequency response. In general, the concatenation of N lossless tubes results in an IIR system with an N-th order feedback sequence and a feed-forward sequence that is only a gain. The N-th order – 2 –SPHSC 503 – Speech Signal Processing UW – Summer 2006 feedback sequences causes at most N/2 resonances or formants in the vocal tract. These resonances occur when a given frequency gets trapped in the vocal tract because it is reflected at the lips and then again at the glottis. The number of tubes that is required to accurately model the formants in a speech signal generated by a given vocal tract depends on the physical length of the vocal tract, the sampling frequency of the speech signal, and the speed of sound, as follows: 2sLFNc= For example, for Fs = 8000 Hz, c = 34000 cm/s, and L = 17 cm, the average length of a male adult vocal tract, we obtain N = 8, or alternatively 4 formants. Experimentally, the vocal tract system has been observed to have approximately 1 formant per kHz. Shorter vocal tract lengths (females or children) have fewer resonances per kHz and vice versa. Source-filter models of speech production For a total model of human speech production, we combine the glottal excitation with the lossless tube concatenation model into a mixed excitation model, as shown below: H(ω) Mathematically, this mixed excitation model can be expressed as (){}f0[] [] [] [] [] [] []vusnnsngnnwnhαα=⋅∗+⋅∗n where and α control the amount of mixing between voiced and unvoiced excitation, []vnα []unf0[]sn is the impulse train at the fundamental frequency that drives the glottal pulse filter g[n], w[n] is the random noise modeling the unvoiced excitation, and h[n] is the vocal tract filter. Despite all the approximations made, the combination of glottal excitation and the lossless tubes concatenation model represents

View Full Document


School:
Email:
New Password:
Confirm Password:

UW SPHSC 503 - Lecture Notes

Sign up for free to view:

Please select your school