Speech Processing 15-492/18-492Speech RecognitionSignal ProcessingAnalog to DigitalSpeech (sound) is analogSpeech (sound) is analogComputers are digital Computers are digital We need to convertWe need to convertSample from ASample from A--D converter D converter N times a secondN times a secondHow many times a second?How many times a second?Goals of Signal ProcessingDistinguish between phonetic typesDistinguish between phonetic typesBe invariant to channel/room conditionsBe invariant to channel/room conditionsBe invariant to speaker characteristicsBe invariant to speaker characteristicsComputational efficiencyComputational efficiencyTime vs Frequency DomainHuman ear distinguishes frequenciesHuman ear distinguishes frequenciesInitial ASR used time domain featuresInitial ASR used time domain featuresPowerPowerZero crossings (sort of frequency)Zero crossings (sort of frequency)Source Filter ModelPulseNoiseFilterVocal Track ModelPitchVoicedUnvoicedTime domain SignalWaveform RepresentationSpeech Spectragram/iy/ vs /ae/• “beat” /b iy t/ and “bat” /b ae t/Frequency Domain• “pencils” /p eh n s ih l z/Frequency Domain• “beats pits” / b iy t s p ih t s /Speech AnalysisStandard ParameterizationSplit waveform into “frames”Split waveform into “frames”Advance every 10msAdvance every 10msSize around 25ms (overlapping frames)Size around 25ms (overlapping frames)Window themWindow themPerform FFT/Mel Perform FFT/Mel CepstralCepstralanalysisanalysisFind Deltas (difference from previous)Find Deltas (difference from previous)Find Delta Deltas (difference in delta)Find Delta Deltas (difference in delta)SummaryTime domain Time domain vsvsFrequency domain Frequency domain Parameterization of speechParameterization of speechFrequency domainFrequency domainShort term Short term FFTsFFTsFFT FFT vsvsMEL MEL
View Full Document