ABSTRACT field of signal processing. Applications areas of

ABSTRACT

This paper provides a literature review of Speech Recognition, its background, methodologies and techniques used. Speech to Text or ASR involves processing the sound waves, extracting basic linguistic units or phonemes 1, then creating contextually correct and meaningful words to form a complete sentence. This paper explains types of speech, different approaches for speech recognition, techniques, the extraction of characteristics and the mathematical representationof ASR.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

                                                                   

Keywords

Speech
Recognition, modules/phases of ASR, HMM, feature extraction, MEL, LPC, Large
Vocabulary Continuous Speech Recognition-LVCSR, Pattern Classification
techniques, Applications, tools

 

1. INTRODUCTION

The most common and primary communication method is to speak in any natural language.

 Speech Recognition (is
also known as Automatic Speech Recognition (ASR),
or computer speech recognition) is the process of converting a speech signal to
a sequence of words, by means of an algorithm implemented as a computer
program. 2

Work on Speech Recognition is being done since decades,
& the most evolved form of ASR can be seen in this era. The accuracy of
recognition, error rate, speaker variation, noise of surroundings are still the
areas of consideration.

1 Phonemes
are the linguistically significant sounds (or sets of sounds) of a
language. A
phoneme is minimal unit of sound that has semantic content.
 

 

It’s one of the most researched field of signal
processing. Applications areas of speech recognition are diverse, for example
it ranges from dictation to speaker recognition for security purpose, language
identification etc.

 

2. DETAILS
OF SPEECH RECOGNITION TASK

Some of the parameters
and tasks related to speech recognition are;

 

2.1 Vocabulary
Size

One aspect of the task of speech recognition is the dictionary size.  It effects the complexity, processing
requirements and accuracy of the recognition. There are no defined standards
for size, the general perception is;

Ø  small
vocabulary – tens of words

Ø  medium vocabulary – hundreds of words

Ø  large
vocabulary – thousands of words

Ø  very-large
vocabulary – tens of thousands of words.

2.2 Types
of Speech Recognition Systems

It can be categorized as follows;

Ø  ISOLATED WORDS:

In this type of recognition, each word is surrounded by pause
or break. It accepts single words or single utterance 3 at a time. These systems have “Listen/Not-Listen” states,
where they require the speaker to wait between utterances (usually doing
processing during the pauses) 2.

 

Ø  CONNECTED WORDS:

It’s similar to isolated words, but allow separate
utterances to run together with minimal gaps between them.

 

Ø  CONTINUOUS SPEECH:

Words run into each other and have to be segmented.
4

 

Ø  SPONTANEOUS SPEECH:

It recognizes the natural speech. Such system is able to
handle features of natural language as spontaneous speech may include
mispronunciation, false starts and slang words.

 

Ø  READ SPEECH

It simulates the dictation task, or conversing with
speech dialogue system. 4

 

 

3 An
UTTERANCE is any stretch of talk, by one person, before and after which
there is silence on the part of that person.
 

 

 

 

Ø  CONVERSATIONAL SPEECH

Recognizing the speech of 2 humans talking to each other,
for example telephone

conversation. 4

 

2.3 Speaker
Model

 

ASR systems are
classified as discrete or continuous systems that are speaker
dependent or independent.

 

DISCRETE
SYSTEMS:

Discrete systems maintain a
separate acoustic model for each word, combination of
words or phrases. Also known as isolated word speech recognition (ISR).

 

CONTINUOUS
SYSTEMS:

Continuous speech recognition (CSR) systems responds
to a user who pronounces words, phrases or sentences that are
in a series of particular order.

 

Every person has
distinct voice, therefore ASR systems classified according to speaker models
are;

 

Ø  Speaker
Dependent Models

It’s developed for a particular speaker, i.e. it’s
trained by using data set of particular person’s voice. These systems are
easier to develop, are cheaper & accurate.

 

Ø  Speaker
Independent Models

Such an ASR can recognize variety of speakers; thus, it
doesn’t require prior training.

It is used in Interactive
Voice Response System (IVRS) that must accept input from a large number of different users. But drawback is that it limits the number of words in a vocabulary. Implementation of Speaker Independent system is the most difficult. Also, it is expensive and its accuracy is lower than speaker dependent systems. 5

 

Ø  Speaker
Adaptive Models

It uses the speaker dependent data, and adapt to the best
suited speaker to recognize the input speech. It adapts operation according to
the characteristics of speakers.

 

 

2.4 Difficulties
of ASR

Few issues and
difficulties that need to be addressed and are relevant during the design of
ASR are;

 

 

 

Environment

Unwanted
information in the speech signal i.e. noise, signal/noise ratio.

Transducer

Microphone,
Telephone.

Channel

Bad
amplitude, distortion, echo

Speakers

Speaker
dependence/independence, gender, age, physical and psychical state.

Speech
Styles

Voice tone
(quiet, normal, sshouted); Production (Isolated, continuous or spontaneous
speech) Speed (slow, fast, normal).

Vocabulary

Specific or
generic vocabulary, characteristics of available training data, dialects.

 

3. MODULES
OF ASR

 

The basic
structure of an ASR includes the following modules 5;

 

 

                                                          

Decoded
Results
 

 

 

 

 

 

 

 

 

 

 

 

 

 

3.1 Pre-Processing

The input to an ASR is in acoustic waveform in form of Analog signals. The pre-processing task
includes converting this signal into Digital
signals, clipping of large values of the signal; a frequency spectrum
filtering that alters the shape of signal or enhances the speech
and de-emphasizes noise that is present.

 

3.2 Feature Extraction

It’s the process to extract data from pre-processed speech signal
i.e. unique for each speaker. The feature extraction or signal processing is done to keep the relevant information and
discard the irrelevant. Feature extractor divides the acoustic signal into
10-25ms frames, data acquired in these frames are multiplied by window function
to transform them into spectral features.
There are many methods used for feature extraction, the most used methods are;

 

3.2.1   
Linear Predictive Coding (LPC)

 

It’s one of the most powerful methods of
signal analysis for linear prediction. It’s a dominant technique for
determining the basic parameters of speech and provides precise estimation of
speech parameters and computational model of speech. 6

 

7

 

3.2.2    Mel Frequency Cepstral
Coefficient (MEL)

 

Extracting
a sequence of 39-dimensional MFCC feature vectors from a quantized
digitized waveform. 4

 

MFCC is the most evident and popular
feature technique of extraction for speech recognition. It approaches the
response of the human system more closely than any other system because the
frequency bands are logarithmically positioned here. They are obtained from a Mel-frequency cepstrum where frequency bands are equally spaced on the Mel scale. Computation technique of MFCC is based on the short-term analysis
and thus from each frame MFCC vector is computed. 7 The mel frequency m can be computed from the raw
acoustic frequency as follows:
                                               mel(f) = 1127 ln

 

 

 

 

 

 

 

Each vector representing the information in small time windows
of the signal is called Feature Vector.
4

 

 

OTHER METHODS 5

S. No

Method

Property

Advantage

Disadvantage

1

Principal component
Analysis (PCA)

Nonlinear feature
extraction method,
Linear map; rapid;
Eigen vector-based,

Good result for Gaussian data.

The directions maximizing
variance do not always maximize information.

2

Linear
Discriminate
Analysis(LDA)

Supervised linear map, Depend on Eigen vector,
Nonlinear feature
extraction method.

Better than PCA for classification, Handles
the case where the within class frequencies are unequal and their performance
has been examined on randomly generated test data.

If the distribution is significantly
non-Gaussian the LDA projection
will not be able to preserve any complex structure of the data, which may be
needed for classification.

3

Independent
component
Analysis(ICA)

Nonlinear feature
extraction method, Linear map, iterative non
Gaussian.

Blind than PCA for classification

Extracted components are not
ordered.

4

Filter Bank analysis

Filter tuned required
frequencies

It provides a spectral analysis with any
degree of frequency resolution
(wide or narrow), even with nonlinear filter spacing and
bandwidths.

always take more calculation and processing
time than discrete Fourier analysis using the FFT

5

Kernel based feature
extraction method

Nonlinear transformations

Dimensionality reduction leads to
better classification and it is used to remove noisy and redundant features
and improvement in
classification error.

Slow similarity calculation speed

6

Wavelet

Better time resolution
than Fourier Transform

It replaces the fixed bandwidth of
Fourier transform with one proportional to frequency which
allows better time resolution at
high frequencies than Fourier
transform

It requires longer compression
time.

7

RASTA Filtering

For Noisy speech

It finds out feature in noisy data

It increases the dependence of the data on its
previous context.

 

3.3 DECODING

 

It
is performed to find the best match for the incoming feature vectors using the
knowledge base, it actually recognizes the speech utterance by combining and
optimizing the information conveyed by acoustic and language models.

The standard approach to large vocabulary continuous speech
recognition (LVCSR) is to assume
a simple probabilistic model of speech production whereby a specified word
sequence, W, produces an acoustic
observation sequence Y, with
probability P (W, Y). The goal is
then to decode the word string, based on the acoustic observation sequence, so
that the decoded string has the maximum
a posteriori (MAP) probability. 245

 

 

Using Bayes’ rule, it can
be written as;

 

Since P(A) is independent of W, the MAP decoding rule
can be;

 

3.3.1   
ACOUSTIC
MODEL

 

In the acoustic modeling or phone
recognition stage, we compute the likelihood
of the observed spectral feature vectors given linguistic units (words, phones,
subparts
of phones). 4

An acoustic model is implemented using different
approaches such as HMM, ANNs, dynamic Bayesian networks (DBNs), Support Vector Machines (SVM). HMM is widely used among all, as
it’s proved to be an efficient algorithm for training and recognition.

The first term in equation (3) P(A/W), is generally called the acoustic model, as it estimates the
probability of a sequence of acoustic observations, conditioned on the word
string. Hence P(A/W) is computed. For LVCSR
systems, it is necessary to build statistical models for sub word speech
units, build up word models from these sub word speech unit models (using a
lexicon to describe the composition of words), and then postulate word
sequences and evaluate the acoustic model probabilities via standard
concatenation methods. 2

 

 

3.3.2   
LANGUAGE
MODEL

The second term in equation (3) P(W), is called the language model. It describes the
probability associated with a postulated sequence of words. Such language
models can incorporate both syntactic and semantic constraints of the language
and the recognition task. 2

Generally, Speech recognition systems uses
bi-gram, trigram, n-gram language
models for finding correct word sequence by predicting the
likelihood of the nth word,
using the n-1 earlier words.
5

Language models can be classified into:
8
Uniform
model: each word has equal probability of
occurrence.

Stochastic model:
probability of occurrence of a word depends on the word preceding it.

Finite state languages: languages
use a finite3 state network to define the allowed word sequences.

Context free grammar: It can be
used to encode which kind of sentences is allowed.

 

Schematic
Architecture for a (simplified) speech recognizer decoding a single sentence 4

3.4 Pattern Classification

Two
steps of Pattern Classification are Pattern
training and Pattern Comparison.
It is the process of comparing the unknown test pattern with each sound class
reference pattern and computing a measure of similarity between them. After
complete training of the system, at the time of testing, patterns are
classified to recognize the speech.

There
are various techniques which are opted for pattern classification. 

x

Hi!
I'm Johnny!

Would you like to get a custom essay? How about receiving a customized one?

Check it out