This paper provides a literature review of Speech Recognition, its background, methodologies and techniques used. Speech to Text or ASR involves processing the sound waves, extracting basic linguistic units or phonemes 1, then creating contextually correct and meaningful words to form a complete sentence. This paper explains types of speech, different approaches for speech recognition, techniques, the extraction of characteristics and the mathematical representationof ASR.
Recognition, modules/phases of ASR, HMM, feature extraction, MEL, LPC, Large
Vocabulary Continuous Speech Recognition-LVCSR, Pattern Classification
techniques, Applications, tools
The most common and primary communication method is to speak in any natural language.
Speech Recognition (is
also known as Automatic Speech Recognition (ASR),
or computer speech recognition) is the process of converting a speech signal to
a sequence of words, by means of an algorithm implemented as a computer
Work on Speech Recognition is being done since decades,
& the most evolved form of ASR can be seen in this era. The accuracy of
recognition, error rate, speaker variation, noise of surroundings are still the
areas of consideration.
are the linguistically significant sounds (or sets of sounds) of a
phoneme is minimal unit of sound that has semantic content.
It’s one of the most researched field of signal
processing. Applications areas of speech recognition are diverse, for example
it ranges from dictation to speaker recognition for security purpose, language
OF SPEECH RECOGNITION TASK
Some of the parameters
and tasks related to speech recognition are;
One aspect of the task of speech recognition is the dictionary size. It effects the complexity, processing
requirements and accuracy of the recognition. There are no defined standards
for size, the general perception is;
vocabulary – tens of words
Ø medium vocabulary – hundreds of words
vocabulary – thousands of words
vocabulary – tens of thousands of words.
of Speech Recognition Systems
It can be categorized as follows;
Ø ISOLATED WORDS:
In this type of recognition, each word is surrounded by pause
or break. It accepts single words or single utterance 3 at a time. These systems have “Listen/Not-Listen” states,
where they require the speaker to wait between utterances (usually doing
processing during the pauses) 2.
Ø CONNECTED WORDS:
It’s similar to isolated words, but allow separate
utterances to run together with minimal gaps between them.
Ø CONTINUOUS SPEECH:
Words run into each other and have to be segmented.
Ø SPONTANEOUS SPEECH:
It recognizes the natural speech. Such system is able to
handle features of natural language as spontaneous speech may include
mispronunciation, false starts and slang words.
Ø READ SPEECH
It simulates the dictation task, or conversing with
speech dialogue system. 4
UTTERANCE is any stretch of talk, by one person, before and after which
there is silence on the part of that person.
Ø CONVERSATIONAL SPEECH
Recognizing the speech of 2 humans talking to each other,
for example telephone
ASR systems are
classified as discrete or continuous systems that are speaker
dependent or independent.
Discrete systems maintain a
separate acoustic model for each word, combination of
words or phrases. Also known as isolated word speech recognition (ISR).
Continuous speech recognition (CSR) systems responds
to a user who pronounces words, phrases or sentences that are
in a series of particular order.
Every person has
distinct voice, therefore ASR systems classified according to speaker models
It’s developed for a particular speaker, i.e. it’s
trained by using data set of particular person’s voice. These systems are
easier to develop, are cheaper & accurate.
Such an ASR can recognize variety of speakers; thus, it
doesn’t require prior training.
It is used in Interactive
Voice Response System (IVRS) that must accept input from a large number of different users. But drawback is that it limits the number of words in a vocabulary. Implementation of Speaker Independent system is the most difficult. Also, it is expensive and its accuracy is lower than speaker dependent systems. 5
It uses the speaker dependent data, and adapt to the best
suited speaker to recognize the input speech. It adapts operation according to
the characteristics of speakers.
Few issues and
difficulties that need to be addressed and are relevant during the design of
information in the speech signal i.e. noise, signal/noise ratio.
amplitude, distortion, echo
dependence/independence, gender, age, physical and psychical state.
(quiet, normal, sshouted); Production (Isolated, continuous or spontaneous
speech) Speed (slow, fast, normal).
generic vocabulary, characteristics of available training data, dialects.
structure of an ASR includes the following modules 5;
The input to an ASR is in acoustic waveform in form of Analog signals. The pre-processing task
includes converting this signal into Digital
signals, clipping of large values of the signal; a frequency spectrum
filtering that alters the shape of signal or enhances the speech
and de-emphasizes noise that is present.
3.2 Feature Extraction
It’s the process to extract data from pre-processed speech signal
i.e. unique for each speaker. The feature extraction or signal processing is done to keep the relevant information and
discard the irrelevant. Feature extractor divides the acoustic signal into
10-25ms frames, data acquired in these frames are multiplied by window function
to transform them into spectral features.
There are many methods used for feature extraction, the most used methods are;
Linear Predictive Coding (LPC)
It’s one of the most powerful methods of
signal analysis for linear prediction. It’s a dominant technique for
determining the basic parameters of speech and provides precise estimation of
speech parameters and computational model of speech. 6
3.2.2 Mel Frequency Cepstral
a sequence of 39-dimensional MFCC feature vectors from a quantized
digitized waveform. 4
MFCC is the most evident and popular
feature technique of extraction for speech recognition. It approaches the
response of the human system more closely than any other system because the
frequency bands are logarithmically positioned here. They are obtained from a Mel-frequency cepstrum where frequency bands are equally spaced on the Mel scale. Computation technique of MFCC is based on the short-term analysis
and thus from each frame MFCC vector is computed. 7 The mel frequency m can be computed from the raw
acoustic frequency as follows:
mel(f) = 1127 ln
Each vector representing the information in small time windows
of the signal is called Feature Vector.
OTHER METHODS 5
Linear map; rapid;
Good result for Gaussian data.
The directions maximizing
variance do not always maximize information.
Supervised linear map, Depend on Eigen vector,
Better than PCA for classification, Handles
the case where the within class frequencies are unequal and their performance
has been examined on randomly generated test data.
If the distribution is significantly
non-Gaussian the LDA projection
will not be able to preserve any complex structure of the data, which may be
needed for classification.
extraction method, Linear map, iterative non
Blind than PCA for classification
Extracted components are not
Filter Bank analysis
Filter tuned required
It provides a spectral analysis with any
degree of frequency resolution
(wide or narrow), even with nonlinear filter spacing and
always take more calculation and processing
time than discrete Fourier analysis using the FFT
Kernel based feature
Dimensionality reduction leads to
better classification and it is used to remove noisy and redundant features
and improvement in
Slow similarity calculation speed
Better time resolution
than Fourier Transform
It replaces the fixed bandwidth of
Fourier transform with one proportional to frequency which
allows better time resolution at
high frequencies than Fourier
It requires longer compression
For Noisy speech
It finds out feature in noisy data
It increases the dependence of the data on its
is performed to find the best match for the incoming feature vectors using the
knowledge base, it actually recognizes the speech utterance by combining and
optimizing the information conveyed by acoustic and language models.
The standard approach to large vocabulary continuous speech
recognition (LVCSR) is to assume
a simple probabilistic model of speech production whereby a specified word
sequence, W, produces an acoustic
observation sequence Y, with
probability P (W, Y). The goal is
then to decode the word string, based on the acoustic observation sequence, so
that the decoded string has the maximum
a posteriori (MAP) probability. 245
Using Bayes’ rule, it can
be written as;
Since P(A) is independent of W, the MAP decoding rule
In the acoustic modeling or phone
recognition stage, we compute the likelihood
of the observed spectral feature vectors given linguistic units (words, phones,
of phones). 4
An acoustic model is implemented using different
approaches such as HMM, ANNs, dynamic Bayesian networks (DBNs), Support Vector Machines (SVM). HMM is widely used among all, as
it’s proved to be an efficient algorithm for training and recognition.
The first term in equation (3) P(A/W), is generally called the acoustic model, as it estimates the
probability of a sequence of acoustic observations, conditioned on the word
string. Hence P(A/W) is computed. For LVCSR
systems, it is necessary to build statistical models for sub word speech
units, build up word models from these sub word speech unit models (using a
lexicon to describe the composition of words), and then postulate word
sequences and evaluate the acoustic model probabilities via standard
concatenation methods. 2
The second term in equation (3) P(W), is called the language model. It describes the
probability associated with a postulated sequence of words. Such language
models can incorporate both syntactic and semantic constraints of the language
and the recognition task. 2
Generally, Speech recognition systems uses
bi-gram, trigram, n-gram language
models for finding correct word sequence by predicting the
likelihood of the nth word,
using the n-1 earlier words.
Language models can be classified into:
model: each word has equal probability of
probability of occurrence of a word depends on the word preceding it.
Finite state languages: languages
use a finite3 state network to define the allowed word sequences.
Context free grammar: It can be
used to encode which kind of sentences is allowed.
Architecture for a (simplified) speech recognizer decoding a single sentence 4
3.4 Pattern Classification
steps of Pattern Classification are Pattern
training and Pattern Comparison.
It is the process of comparing the unknown test pattern with each sound class
reference pattern and computing a measure of similarity between them. After
complete training of the system, at the time of testing, patterns are
classified to recognize the speech.
are various techniques which are opted for pattern classification.