Building an ASR using HTK CS4706 Outline Speech Recognition It’s hard to recognize speech Feature Extraction

Building an ASR using HTK CS4706 Outline Speech Recognition It’s hard to recognize speech Feature Extraction www.phwiki.com

Building an ASR using HTK CS4706 Outline Speech Recognition It’s hard to recognize speech Feature Extraction

Roberts, Denise, Managing Editor has reference to this Academic Journal, PHwiki organized this Journal Building an ASR using HTK CS4706 Fadi Biadsy April 21st, 2008 Outline Speech Recognition Feature Extraction HMM 3 basic problems HTK Steps to Build a speech recognizer Speech Recognition Speech Signal to Linguistic Units There’s something happening when Americans ASR

Columbia College of Missouri US www.phwiki.com

This Particular University is Related to this Particular Journal

It’s hard to recognize speech Contextual effects Speech sounds vary with context E.g., “How do you do” Within-speaker variability Speaking Style Pitch, intensity, speaking rate Voice Quality Between-speaker variability Accents, Dialects, native vs. non-native Environment variability Background noise Microphone Feature Extraction Wave as long as m Spectrogram We need a stable representation as long as different examples of the same speech sound Feature Extraction Extract features from short frames (frame period 10ms, 25ms frame size) – a sequence of features

Feature Extraction – MFCC Mel Scale approximate the unequal sensitivity of human hearing at different frequencies Feature Extraction – MFCC MFCC (Mel frequency cepstral coefficient) Widely used in speech recognition Take the Fourier trans as long as m of the signal Map the log amplitudes of the spectrum to the mel scale Discrete cosine trans as long as m of the mel log-amplitudes The MFCCs are the amplitudes of the resulting spectrum Feature Extraction – MFCC Extract a feature vector from each frame 12 MFCCs (Mel frequency cepstral coefficient) + 1 normalized energy = 13 features Delta MFCC = 13 Delta-Delta MCC = 13 Total: 39 features Inverted MFCCs: 39 Feature vector

Markov Chain Weighted Finite State Acceptor: Future is independent of the past given the present Hidden Markov Model (HMM) HMM is a Markov chain + emission probability function as long as each state. Markov Chain HMM M=(A, B, Pi) A = Transition Matrix B = Observation Distributions Pi = Initial state probabilities HMM Example

HMM – 3 basic problems (1) – Evaluation 1. Given the observation sequence O in addition to a model M, how do we efficiently compute: P(O M) = the probability of the observation sequence, given the model argmax i (P(O i) HMM – 3 basic problems (2) – Decoding 2. Given the observation sequence O in addition to the model M, how do we choose a corresponding state sequence Q = q1 q2 qt which best “explains” the observation O Q = argmax Q (P(O Q, M)) = argmaxQ(P(QO,M)P(QM)) Viterbi algorithm Is an efficient algorithm as long as Decoding O(TN^2) End Start /d/ /aa/ /n/ /aa/ => dana

HMM – 3 basic problems (3) – Training How do we adjust the model parameters M= (A, B, Pi) to maximize P(O M) Estimate dana => /d/ /aa/ /n/ /aa/ 1) Transition Matrix: A 2) Emission probability distribution: HMM – 3 basic problems (3) – Training HTK HTK is a toolkit as long as building Hidden Markov Models (HMMs) HTK is primarily designed as long as building HMM-based speech processing tools (e.g., extracting MFCC features)

Steps as long as building ASR voice-operated interface as long as phone dialing Examples: Dial three three two six five four Phone Woodl in addition to Call Steve Young Grammar: $digit = ONE TWO THREE FOUR FIVE SIX SEVEN EIGHT NINE OH ZERO; $name = [ JOOP ] JANSEN [ JULIAN ] ODELL [ DAVE ] OLLASON [ PHIL ] WOODLAND [ STEVE ] YOUNG; ( SENT-START ( DIAL <$digit> (PHONECALL) $name) SENT-END ) S0001 ONE VALIDATED ACTS OF SCHOOL DISTRICTS S0002 TWO OTHER CASES ALSO WERE UNDER ADVISEMENT S0003 BOTH FIGURES WOULD GO HIGHER IN LATER YEARS S0004 THIS IS NOT A PROGRAM OF SOCIALIZED MEDICINE etc A ah sp A ax sp A ey sp CALL k ao l sp DIAL d ay ax l sp EIGHT ey t sp PHONE f ow n sp

HTK scripting language is used to generate Phonetic transcription as long as all training data Extracting MFCC For each wave file, extract MFCC features. Creating Monophone HMMs Create Monophone HMM Topology 5 states: 3 emitting states Flat Start: Mean in addition to Variance are initialized as the global mean in addition to variance of all the data S1 S5

Roberts, Denise Agua Latinoamérica Managing Editor www.phwiki.com

Training For each training pair of files (mfc+lab): 1. concatenate the corresponding monophone HMMS: 2. Use the Beam-Welch Algorithm to train the HMMS given the MFC features. /ah/ /w/ /n/ ONE VALIDATED ACTS OF SCHOOL DISTRICTS Training So far, we have all monophones models trained Train the sp model Forced alignment The dictionary contains multiple pronunciations as long as some words. Realignment the training data Run Viterbi to get the best pronunciation that matches the acoustics

Retrain After getting the best pronunciation => Train again using Beam-Welch algorithm using the “correct” pronunciation. Creating Triphone models Context dependent HMMs Make Tri-phones from monophones Generate a list of all the triphones as long as which there is at least one example in the training data jh-oy+s oy-s ax+z f-iy+t iy-t s+l s-l+ow Creating Tied-Triphone models Data insufficiency => Tie states /aa/ /t/ /b/ S1 S5 /b/ /aa/ /l/

Summary MFCC Features HMM 3 basic problems HTK Thanks! HMM – Problem 1

Roberts, Denise Managing Editor

Roberts, Denise is from United States and they belong to Agua Latinoamérica and they are from  Tucson, United States got related to this Particular Journal. and Roberts, Denise deal with the subjects like Hazardous Materials and Waste Treatment; Hispanic Interest; Water Resources and Treatment

Journal Ratings by Columbia College of Missouri

This Particular Journal got reviewed and rated by Columbia College of Missouri and short form of this particular Institution is US and gave this Journal an Excellent Rating.