Open Access Published by De Gruyter October 1, 2018

Optimizing Integrated Features for Hindi Automatic Speech Recognition System

Mohit Dua , Rajesh Kumar Aggarwal and Mantosh Biswas

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2018-0057

Abstract

An automatic speech recognition (ASR) system translates spoken words or utterances (isolated, connected, continuous, and spontaneous) into text format. State-of-the-art ASR systems mainly use Mel frequency (MF) cepstral coefficient (MFCC), perceptual linear prediction (PLP), and Gammatone frequency (GF) cepstral coefficient (GFCC) for extracting features in the training phase of the ASR system. Initially, the paper proposes a sequential combination of all three feature extraction methods, taking two at a time. Six combinations, MF-PLP, PLP-MFCC, MF-GFCC, GF-MFCC, GF-PLP, and PLP-GFCC, are used, and the accuracy of the proposed system using all these combinations was tested. The results show that the GF-MFCC and MF-GFCC integrations outperform all other proposed integrations. Further, these two feature vector integrations are optimized using three different optimization methods, particle swarm optimization (PSO), PSO with crossover, and PSO with quadratic crossover (Q-PSO). The results demonstrate that the Q-PSO-optimized GF-MFCC integration show significant improvement over all other optimized combinations.

Keywords: Automatic speech recognition; MFCC; GFCC; PSO; PLP; C-PSO; Q-PSO; HMM

1 Introduction

Humans use speech as a basic mode of communication. However, with the advent of technology, speech is also being used for man-machine communication. Speech recognition converts the recorded speech signal to a type of readable text, actions, or notations, and the objective of speech recognition studies is to create machines that can receive spoken information and act appropriately upon receiving the information. The speech recognition systems can be categorized into different classes, such as isolated, connected, continuous, and spontaneous speech recognition systems, based on the type of speaking mode or utterance recognition ability. Technology and research methods are used to make this speech-to-text translation as accurate as possible independent of the environment, speaker, or device, etc. In the last six decades, a lot of research work has been carried out to develop accurate and efficient automatic speech recognition (ASR) systems [2, 6, 20, 27]. However, different styles of dialect and the grammar differences for each language make it difficult for researchers to develop a standard system [3, 21, 24].

1.1 Feature Integration Review

While human beings can segregate and recognize signals under noisy conditions, it still remains a challenge for an automatic speech recognizer to perform this task accurately in noisy environments. Various methods to enhance speech or to compensate for noise have been applied by researchers to improve the accuracy of ASR systems in adverse conditions. Speech signal filtering prior to the training phase, adapting speech models to include the effects of noise, and using noise robust features are the three categories of classification of these methods. One of the simplest and effective way to deal with noise speech signal is to use noise robust features like Gammatone frequency (GF) cepstral coefficients (GFCCs) instead of using conventional features like linear predictive cepstral coefficients [25], TRAPs (temporal patterns) [15], Mel frequency (MF) cepstral coefficients (MFCCs) [9], perceptual linear prediction (PLP) [14], and wavelets [30], which do not perform accurately in the presence of additive noise.

In Refs. [28] and [34], linear discriminant analysis (LDA) has been applied to integrate MFCC feature vectors with phase features and with voiced-unvoiced features, respectively [35]. MFCC has been integrated with main spectral peak features by using a log-linear model in Ref. [31], and PLP has been integrated with modulation spectrogram features using acoustic posterior probabilities [16]. All these proposed feature integrations significantly reduced the word error rate [16, 28, 31, 34].

In Ref. [35], the recognizer uses three distinct acoustic feature optimizations – vocal tract length-normalized MFCC, PLP coefficients derived from Mel scale (MF-PLP), and MFCC derived from an all-pole magnitude spectrum – by applying LDA and log-linear model integration methods. In Ref. [29], it is shown that GFCC outperforms MF-PLP integration in recognition accuracy and the ROVER (recognizer output voting error reduction) algorithm outperforms the log-linear model and LDA in combining features.

The sequential integration of the features MFCC, gravity centroids, and PLP has been implemented to enhance the accuracy of a Hindi language ASR system for medium-sized vocabulary in specific environments in Ref. [5]. Recently, in Refs. [7] and [11], the MFCC and GFCC features are integrated to improve the accuracy of an automatic speaker recognition system and an ASR system, respectively, in low signal-to-noise ratio (SNR) conditions.

1.2 ASR Optimization Review

Researchers have made various efforts to optimize the front and back end of the ASR system by using different optimization approaches. At the back end, the hidden Markov model (HMM) conventionally uses the Baum-Welch (BW) algorithm that is dependent on correct initial estimation of model attributes. Hence, an incorrect or arbitrary estimation of such parameters does not lead to optimal solutions. The issues of non-linear time wrapping and ideal number of states in HMM topology have been addressed by Kwong et al. in Refs. [18] and [19], respectively, by using genetic algorithms (GAs). In Ref. [33], hybrid combinations of GA-BW and particle swarm optimization (PSO)-BW are compared for continuous HMM optimization. The results conclude that PSO-BW outperforms GA-BW in recognition performance. PSO has also been combined with the Viterbi algorithm to improve the recognition accuracy of the decoding phase of the ASR system [22]. Both the GA and PSO techniques have been applied in the front-end feature extractor module to optimize MFCC filter banks [4]. The results conclude that the ASR performance is improved by using PSO- and GA-optimized filter banks. Also, GA and PSO have been used for integrated feature vector refinement in Ref. [11]. However, in both Refs. [11] and [4], the results conclude that PSO outperforms GA in recognition accuracy. In Ref. [11], a discriminatively trained PSO-optimized feature-based ASR system has an accuracy rate of 5.4%, better than the discriminatively trained GA-optimized feature-based ASR system. In Ref. [4], the ASR system using PSO-optimized filter banks has approximately 3%–5% more accuracy than the ASR system using GA-optimized filter banks. Also, it has been shown that PSO converges at a faster rate and requires less computational time than GA.

The information-sharing idea of the PSO approach is entirely different from techniques like GA and differential evolution. The evolution criteria in PSO searches only for the optimal solution, where only the global best shares information with the other particles. However, in the other two methods, the mechanism of group sharing is used where each particle shares information with each other. One common limitation with PSO is the problem of premature convergence after some iterations, which results in a suboptimal solution. Researchers have used PSO after including a crossover operator in it, and have proposed PSO with crossover (C-PSO) and PSO with quadratic-crossover (Q-PSO) [13, 23].

The proposed work mainly integrates features and then optimizes the integrated features to enhance the performance of a Hindi language ASR system. Initially, the sequential integration of all three feature extraction methods (taking two at a time) is evaluated. The results show that the MF-GFCC and GF-MFCC integrations outperform the MF-PLP, PLP-MFCC, GF-PLP, and PLP-GFCC integrations. Further, these feature vector integrations are optimized using three different optimization methods, PSO, C-PSO, and Q-PSO. The results describe that the Q-PSO-optimized GF-MFCC integration show significant improvement over all other optimized combinations. The developed ASR system uses HMM Tool Kit [HTK, Cambridge University Engineering Department (CUED), Cambridge, UK] 3.5 beta-2 version and MATLAB (MathWorks, MA, USA) version R2015a for its implementation.

The paper is organized as follows: Section 2 discusses the basics of feature extraction and optimization methods used in the implementation of the proposed work. The details of the proposed architecture and speech corpus are described in Sections 3 and 4, respectively. The experimental results and conclusions derived from the work are discussed in Sections 5 and 6 of the research paper.

2 Theoretical Background

2.1 Feature Extraction

The objective of feature extraction is to detect a set of variables from the speech signal that are correlated acoustically. Such variables are termed as features. Feature extraction has utmost importance in the development of the speech recognition system. It removes unwanted and redundant information. There are various feature extraction techniques, and this section gives theoretical background of the feature extraction methods used in the proposed work.

2.1.1 MFCC

Researchers have been using MFCC as a de-facto standard to extract input acoustic signal features [24]. MFCC uses some parts of speech production and speech perception to extract a feature vector that contains all information about the speech signal [9].

The MFCC feature extraction first amplifies the energy at high frequencies and distributes this power across the relative frequencies by pre-emphasizing the input speech signal [8]. Secondly, framing and windowing are carried out to eliminate discontinuities at the edges. Thirdly, the discrete Fourier transform (DFT) is applied and the obtained spectrum is filtered with different band pass filters. The Fourier-transformed frame is passed through the logarithmic Mel-scaled filter bank. The relation between the Mel scale and the frequency of speech signal is given in Eq. (1):

(1) Mel(f)=2595log10(1+/700f).

Finally, the discrete cosine transform and the inverse DFT of the Mel spectrum is performed to have 12 MFCC coefficients and one energy coefficient. Generally, the first 13 coefficients are taken for further representation of the signal. The obtained cepstral coefficients are extended using first- and second-order derivatives to include the dynamic nature of the speech [9, 11, 24].

2.1.2 PLP

Like MFCC, PLP first performs windowing and fast Fourier transform. The computed frequency value further undergoes the Bark filter bank process, which uses a filter bank with 27 very sharp band pass filters. Similar to the pre-emphasis process of MFCC, the equal-loudness function is used and its output is used in linear prediction. Finally, a recursive cepstrum computation is applied to obtain the PLP coefficients [14]. Like MFCC, the PLP feature vector also has 39 features that include the first 13 coefficients, 13 first-order derivative coefficients, and 13 second-order derivative coefficients. However, it uses trapezoidal filters and cube root compression instead of the triangular filter and logarithmic compression of MFFC [11, 14].

2.1.3 GFCC

The equivalent rectangular bandwidth scale and Gammatone filter bank-based GFCC is a more comprehensive model and is designed to simulate the process of the human hearing system [25, 29]. One of the major challenges for an ASR system is noise sensitivity. Performing accurately in real-time acoustic environments is an important factor for a good feature extraction method. Sensitivity to additive noise is the main limitation of MFCC [7, 11]. The key difference between MFCC and GFCC is their filter bank. The Gammatone filter bank is a group of filters that has a high-impulse response similar to the magnitude characteristic of the human auditory filter. Like MFCC and PLP, GFCC performs windowing and Fourier transform is performed, and the produced output is passed through the Gammatone filter bank. A Gammatone filter with a center frequency f can be defined as follows [7]:

(2) g(f,t)=atn−1e−2πbtcos(2πft+ϕ),

where a is a constant, ϕ denotes the phase, and n defines the order of the filter. The value of n is usually set to <4 and ϕ is set to value 0. The factor b of Eq. (2) is mathematically expressed as follows:

(3) b=25.17(4.37f1000+1).

Thus, the first 12 components are then selected to obtain a GFCC feature vector that consists of 12 cepstral coefficients, 12 first-order derivatives, and 12 second-order derivatives [7, 11]. The final feature vector contains 36 coefficient values.

2.2 Optimization Methods

The optimization methods PSO, C-PSO, and Q-PSO are defined as population-based search techniques. The parameters used by a PSO algorithm are the initial size of the feature population and the fitness computation. However, C-PSO and Q-PSO use one more parameter crossover to obtain the optimal solution. The following sub-sections discuss all these methods in brief.

2.2.1 PSO

PSO is an intelligence optimization technique that belongs to a class of optimization algorithms known as meta-heuristics. It is inspired by the social behavior of animals like fish and birds. It is a very simple but powerful optimization algorithm. It is a population-based optimization technique, in which each member of the population is termed as a particle. Each particle has two components: its velocity and its current position. Every particle of the swarm communicates with each other to share information about these two components [22, 33]. In the run time of the algorithm, each particle updates its components and the algorithm finds a new particle with better velocity and position. The updated particle has better velocity and position value because it uses previous particle experiences. At the same time, each particle has its personal best experience, known as local best (Lbest_j), and a global best experience that is common for each particle, known as global best (Gbest). A particle with velocity V_j(t) and with position S_j(t) moves in the direction of vector V_j(t) and has one component parallel to Lbest_j and another parallel to Gbest. The resultant of these vectors is the new updated position S_j(t + 1) and the updated velocity is V_j(t + 1), as stated in Figure 1.

Figure 1:

Particle Swarm Optimization.

The new updated position S_j(t + 1) and velocity V_j(t + 1) of a particle are mathematically expressed as follows:

(4) Sj(t+1)=Sj(t)+Vj(t+1),

(5) Vj(t+1)=wVj(t)+C1R1(Lbestj(t)−Sj(t))+C2R2(Gbest(t)−Sj(t)),

where j is the index number of the particle; t is the time step; w is the inertia weight; R1, R2 are the uniformly distributed random variables; and C1, C2 are the acceleration coefficients. The above equations are for one-dimensional search space. The new velocity and position can also be derived for multiple dimensions, as follows:

(6) Vjk(t+1)=wVjk(t)+R1C1(Lbestjk(t)−Sjk(t))+R2C2(Gbest(t)−Sjk(t)),

(7) Sjk(t+1)=Sjk(t)+Vjk(t+1),

where j is the index number, k is the k^th component of velocity vector, and w is the inertia coefficient.

2.2.2 C-PSO and Q-PSO

It is observed that PSO suffers from low solution precision problem, gets fixed in local optima, and results in premature conversions. To overcome these drawbacks and get more optimized results, PSO with a crossover operator has been proposed by various researches. C-PSO and Q-PSO are two examples of such algorithms [13, 23].

In C-PSO, during the particles’ exploration of search space, some particles find their current individual best and then other particles come toward it. There is no need for further exploration of search space. The fitness of each feature is evaluated. Two individuals with the best fitness value are selected, and the crossover operator is applied to them. After the crossover operation, the fitness of two new offspring with the individual’s best position is compared, and the best value is chosen from these as the new best position. The crossover operator with PSO reduces the chance of being trapped in local optima. The crossover operator using two parent chromosome vectors Parent₁ and Parent₂ in C-PSO is applied using Eqs. (8) and (9) [13]:

(8) Offspring1=s×Parent1+(1−s)Parent2,

(9) Offspring2=(1−s)×Parent1+s×Parent2,

where s is any random number in the range [0, 1].

Q-PSO is simple modified version of the basic PSO algorithm [23]. It uses a non-linear crossover operator and uses three parents from the swarm to generate new offspring. The new generated particle is laid on the minima of the quadratic curve passing through three selected particles. This particle is taken in the population irrespective of whether it produces better result than the previous one or not. Hence, the search space of features is not restricted to a certain region. The quadratic crossover is applied until the assortment reaches the optimum solution. In this algorithm, one particle with minimum fitness value, F_min, and two randomly selected particles, F₁ and F₂, are taken from the search space and the fitness of new particle F_new is computed using Eq. (10) [23]:

(10) Fnew=12(F12−F22)*fitness(Fmin)+(F22−F12)*(fitness(F1))+(Fmin2−F12)*(fitness(F2))(F1−F2)(fitness(Fmin))(F2−F1)(fitness(F1))(Fmin−F1)(fitness(F2)).

3 Proposed Architecture

The proposed speech ASR system comprises two major modules, i.e. front end and back end. The front end involves feature extraction, refinement of features, and acoustic modeling. The extraction of feature vectors and their refinement are done using the various techniques and algorithms discussed above. The HMM model with various numbers of Gaussian mixtures is used to generate the acoustic model. The back end involves decoding by using a pronunciation model and a language model. Figure 2 gives the proposed architecture for the HMM-Gaussian mixture model (GMM)-based ASR system using various refined and integrated feature vectors.

Figure 2:

HMM-GMM-Based ASR System Using Various Feature Extraction Methods.

3.1 Pre-processing and Integrated Feature Extraction

The pre-processing step is used to increase the efficiency of the ASR system. Generally, the pre-processing steps comprise sampling, windowing, and denoising. Noise may arise due to microphone or electrical issues and environmental issues. Various filtering methods (e.g. Weiner filtering) are used in the pre-processing stage [1, 8]. Subsequently, the input speech signal is parameterized using various feature extraction techniques like MFCC, PLP, and GFCC, etc. Various other methods have also been developed later to increase the efficiency of the system.

The proposed work also exploits the sequential integration of MFCC, PLP, and GFCC. An important part of this sequential integration is to reduce the integrated feature space. The reduced feature space should contain only the features that contain the richest possible discriminant information. The simplest approach for feature reduction is principal component analysis (PCA) [12]. However, it has some problems that can be easily be addressed by using supervised approaches known as LDA and heteroscedastic LDA (HLDA) [17]. The HLDA technique performs better than the PCA and LDA methods. The proposed work also uses HLDA to reduce the integrated feature set. All six types of feature integrations are optimized separately for each experiment. The processing time of the proposed ASR system can be related to the time taken by integrated feature extraction methods. Full covariance matrix statistics for each component are required to estimate an HLDA transform, and the complexity of HMM decoding is directly proportional to the size of the feature vectors. Hence, both the performance and the speed of current speech recognizers by applying this feature reduction technique are improved. However, the cost of feature integration and reduction is added. Extracting features from the vocalizations begins by segmenting the waveform into frames. The frames are then parameterized into speech vectors. The time complexity for the proposed feature extraction method is O (n2V), where n is the number sequences and V is the number of frames.

3.1.1 MFCC and PLP Integrated Features

An integrated MF-PLP feature vector consisting of 17 features is obtained by combining the top 13 best features of MFCC and the top 4 best features of PLP. First-order and second-order derivatives of this sequential combination of 17 features result in a total of 51 features. HLDA is used to reduce these 51 features to 39 features. These features are termed as MF-PLP features. Figure 3 shows the steps followed to compute MF-PLP and feature extraction methods.

Figure 3:

MF-PLP Feature Extraction Method.

Similarly, all 13 features of PLP and the top 4 coefficients of the MFCC are combined and then reduced using HLDA to get the feature vector named PLP-MFCC. Figure 4 shows the steps followed to compute PLP-MFCC and feature extraction methods.

Figure 4:

PLP-MFCC Feature Extraction Method.

3.1.2 MFCC and GFCC Integrated Features

An integrated GF-MFCC feature vector consisting of 17 features is obtained by combining the top 12 best features of GFCC and top 5 best features of MFCC. First-order and second-order derivatives of this sequential combination of 17 features result in a total of 51 features. HLDA is used to reduce these 51 features to 39 features. These features are termed as GF-MFCC features. Figure 5 shows the steps followed to compute GF-MFCC and feature extraction methods.

Figure 5:

GF-MFCC Feature Extraction Method.

Similarly, all the 13 features of MFCC and the top 4 coefficients of GFCC are combined and then reduced using HLDA to get the feature vector named MF-GFCC. Figure 6 shows the steps followed to compute MF-GFCC and feature extraction methods.

Figure 6:

MF-GFCC Feature Extraction Method.

3.1.3 PLP and GFCC Integrated Features

For further considerations, GFCC is combined with PLP to combine features and obtain a feature set. An integrated GF-PLP feature vector consisting of 17 features is obtained by combining the top 12 best features of GFCC and the top 5 best features of PLP. First-order and second-order derivatives of this sequential combination of 17 features result in a total of 51 features. HLDA is used to reduce these 51 features to 39 features. These features are termed as GF-PLP features. Figure 7 shows the steps followed to compute GF-PLP and feature extraction methods.

Figure 7:

GF-PLP Feature Extraction Method.

Similarly, all 13 features of PLP and the top 4 coefficients of GFCC are combined and then reduced using HLDA to get the feature vector named PLP-GFCC. Figure 8 shows the steps followed to compute PLP-GFCC and feature extraction methods.

Figure 8:

PLP-GFCC Feature Extraction Method.

3.2 Integrated Feature Optimization

The integrated feature vectors are optimized using PSO, C-PSO, and Q-PSO methods. The PSO implementation has been done using the algorithm proposed by the authors in Ref. [11]. Like PSO, C-PSO also computes the fitness of each feature vector in the population and assigns an individual fitness value to each feature. In addition, from the population of feature vectors, the one having the best fitness is found and the global best PSO_gbest is assigned. However, unlike PSO, it uses a crossover operator. The two parents, pso_parent1, pso_parent2, with the best fitness value and a random number, pso_pfraction, in the range [0, 1] are passed. The crossover operator is then applied on these two parents to generate two offspring. Finally, the value of features is updated and the global best PSO_gbest and local best PSO_pbest values are computed iteratively. Algorithm 1 shows the application of C-PSO to obtain refined features.

Algorithm 1:

C-PSO (pso_{population_size}).

Begin

Initialization

p = 0

pso_population =a random population generated using feature vector f_t

pso_gbest =∅

psopbest[PSOpopulation_size] =∅

Fitness computation

{

Compute fitness (p) ∀ p ∈ pso_population

PSOpopulation_best[p]=p

If (fitness (psopopulation_best[p]) > psopopulation_best)

psogbest = psopopulation_best[p]

p = p + 1

}

While (p ¡psopopulation_size)

Crossover

Reinitialize p = 0

crossover (psoparent1, psoparent2, psopfraction)

{

psooffspring1 = psopfraction*psoparent1 + (1−psopfraction) * psoparent2

psooffspring1=(1−psopfraction) * psoparent1 + psopfraction*psoparent2

}

{

for (psoparent1, psoparent2∈psopopulation)

psooffspring = crossover (psoparent1, psoparent2, psopfraction)

psoparent1 = comparison ()

newBest = (ft,psogbest,psopbest[p])

if (fitness (newValue) > fitness (psopbest[p]))

{

psopbest[p] =newValue

If (fitness(psopbest[p]) > fitness(psogbest) )

psogbest = psopbest[p]

}

psopopulation,p+1 = psopopulation,i + psochildren

p = p + 1

}

while (f_t is not refined)

return psogbest

end

The algorithm for Q-PSO-based feature optimization uses feature vector f_t as search space, qpso_{swarm_size} as swarm size, and finds two particles, qpso_parent1 and qpso_parent2, with best fitness values. A particle F_min with worst fitness value is also found. Values of Q-PSO_pbest and Q-PSO_gbest are initialized. These three particles are used in Eq. (10) to find the new particle F_new. The worst particle is replaced with the new particle, and this procedure is repeated for the whole population. The non-linear quadratic crossover operator in this proposed system finds better solution in the features search space. Algorithm 2 shows the application of Q-PSO to obtain refined features.

Algorithm 2:

Q-PSO

Begin

Initialization

p =0

qpsoswarm_size =a random population generated using feature vector f_t

qpsogbest[p] =∅

qpsopbest[p] =∅

Fitness computation

{

Compute fitness (p) ∀ p ∈psopopulation

qpsoswarm_best[p]=p

If (fitness (qpsoswarm_best[p]) > (qpsoswarm_best))

qpsogbest[p] =qpsoswarm_best[p]

p = p + 1

}

While (p ¡qpsoswarm_size)

Crossover

Reinitialize p =0

{

Select two random particles qpsoparent1andqpsoparent2 from population

F1 = qpsoparent1

F2 = qpsoparent2

Find the worst particle Fmin from population

Put F₁, F₂, and Fmin in Eq. (10) and find new particle F_new

Replace the worst particle with F_new

if (fitness (F_new) > fitness (qpsopbest[p]))

{

qpsopbest[p] =F_new

If (fitness(qpsopbest[p]) > fitness(qpsogbest))

qpsogbest = qpsopbest[p]

}

p = p + 1

}

while (qpsoswarm_size is not refined)

return qpsogbest

end

3.3 Acoustic Modeling

In acoustic modeling of the system, the optimized features are directly linked with the expected phones of the sentence. The two major approaches in acoustic modeling are HMM and GMM. HMM is a technique developed to deal with the statistical variations of speech. It was developed by Baum and Petrie, and is widely used in speech recognition. HMMs are stochastic finite state machines with stochastic output process attached to each state. These states can be matched by vector quantization or GMM mixture model. A uniform amount of mixtures is coincided with HMM states [24]. For a given observation sequence, the state sequence is not observable and therefore hidden. This is why the word hidden is placed before Markov models. An HMM is defined as λ (A, B, π) and consists of the following elements:

S_hmm: Set of states S=S1,S2,………….,Sn.
M_hmm: Number of distinct observation symbols per states, where individual symbols are denoted by V={v1,v2,……..,vk}.
A: a_ij: State transition probability, where each a_ij represents the probability of transitioning from state S_i to S_j, i.e. a_ij = P(T_t+1 = S_j|T_t = S_i).
B: b_j(k): Probability distribution of each symbol of hmm state, i.e. bj(k) = P(v_kat t|T_t = S_t).
𝝅: Initial state distribution, i.e. the probability that S_i is an initial state.

A GMM is a probability density function represented by the weighted sum of Gaussian density components. The GMMs are commonly used in speaker recognition application because a probabilistic model for multivariate density is suitable for representation of the arbitrary densities. In the last few years, GMMs have become the powerful approach in the field of speech recognition. GMM represents a continuum of possibilities by putting a prior on some parameter describing the mean [24]. GMM can be represented as a weighted sum of K components’ Gaussian density using Eq. (11):

(11) P(Xn|S)=∑k=1KwkPk(Xn),

where X is the feature vector of D dimension and P_k(X_n), where k=1,2,…,K is the weighted linear combination of Gaussian densities. Their weighing factors, w_k, satisfy w_k > 0 and ∑k=1Kwk=1, mean vector μ_k, and the variance covariance matrix ∑k of the k^th mixture component. Each component density is represented by the Gaussian function, as follows:

(12) pk(Xn)=1(2π)D/2|∑k|1/2exp[−(xn−μk)K∑k−1(xn−μk)2].

The full covariance matrix can be reduced to a diagonal covariance matrix due to its computational overhead. Hence, using Eqs. (11) and (12), GMM can be defined as

(13) P(Xn|S)=∑k=1KZkexp[−12∑q=1D(xnq−μkq)2σkq2],

where Z_k is a constant for each Gaussian.

In order to compute efficiently and to avoid underflow, probabilities are computed in log domain. Therefore, the log likelihood can be expressed using Eq. (14):

(14) logP(Xn|S)=logaddk=1K[log(Zk)−12∑q=1D(xnq−μkq)zσkq2].

There are various advantages of the GMM model, as follows: it is computationally inexpensive, insensitive to the temporal aspects of speech, models only the underlying distribution of acoustic observation, etc.

3.4 Decoder

After the training of a system is achieved with the help of acoustic and lexicon modeling, the decoding of output variance matrix is the major objective. This objective is achieved by the implementation of the Viterbi algorithm. The search space in Viterbi at the back end can be termed as the directed acyclic graph. This word graph helps in producing the required output for each spoken utterance. The best sequence of states can be calculated with the help of previous states and observation sequence. Pattern matching is done by contemplating about the knowledge of trained model and well-defined set of phonemes. The Viterbi algorithm applies dynamic programming to determine the most suitable sequence of the HMM model. In Viterbi, to find the single best state sequence Q={q1,q2,q3,…,qt} for a given observation sequence O={o1,o2,o3,…,ot}, the quantity is defined as

(15) δt(i)=maxq1,q2,…,qt−1PP[{q1,q2,…,qt}=i,{O1,O2,…,Ot}|λ],

where δ_t(i) is the best score along a single path, which accounts for the first t observations and ends in state S_i. The Viterbi algorithm has very similar properties to the standard forward algorithm. The major difference between them lies on the maximization of the previous state.

4 Training and value Speech Corpus

For an ASR system, building a robust acoustic model and a well-defined language model is of utmost importance. The need for a well-labeled speech corpus for the training model and text dataset for the language model is always there for building an accurate ASR system. A language like English has vast amounts of speech data that help in making complex acoustic models and text data that help in building low-perplexity language models for a well-defined ASR task. Hindi language, although being spoken by a large number of people, still lags behind in availability of speech as well as text data. However, it has some special features that are distinct from many other languages. It is from the Indo-European language family. Hindi language is phonetically rich in nature and written in Devanagri script. It has fewer vowels but more consonants than English. There are 12 basic vowels in Hindi language, which include short and long versions of the same sound; therefore, it is called Barakhadi. The basic set of consonants has 36 characters categorized according to the place and manner of articulation. Thus, the Hindi alphabet set has a total of 48 characters, which is large enough in comparison to the 26 characters of the English alphabet. A special feature of Hindi language is that the set used to speak (phones) and the set used to write (orthographic representation) are identical. Due to close correspondence between phonemes and graphemes, we can use the string of Hindi characters as the output symbols of the acoustic-phonetic analyzer, thus eliminating the need for a pronunciation dictionary. Moreover, the errors in signal to symbol conversion caused by vagaries of pronunciation are less for Indian languages.

A well-defined and real-time Hindi speech database developed by TIFR, Mumbai, has been used for the proposed system [26]. The developed database is of Hindi language, and it contains almost all the phonemes used in Hindi language. The designed database used 100 speakers to record 1000 sentences. Two sentences that contained maximum phones of Hindi language are made common to each speaker. The next eight sentences also try to meet the maximum phones of Hindi language. Hence, the speech corpus used for the implemented ASR system try to cover the complete set of consonants and vowels defined in the Hindi character set. The 16-kHz sampling frequency is used to digitally record the speech data. A random set of 80 speakers consisting of 55 male and 25 female speakers is used for training, and the remaining 20 speakers are used for testing [26].

The database used for the proposed system is configured into two parts: one is for training purpose and other is for testing. The entire speaker’s input data are divided into three different training datasets as well as three different testing datasets. Thirty randomly selected speakers (18 male and 12 female speakers) who belong to north India (NI) and speak frequent Hindi are listed in Set1. Thirty randomly selected speakers (18 male and 12 female speakers) who belong to South India (SI) and speak less frequent Hindi are listed in Set2, whereas Set3 contains the data of both Set1 and Set2. These datasets are of both male and female vocals.

Similar to training dataset, testing data are also divided in to three sets. A dataset consisting of randomly selected 12 male speakers (from NI and SI both) only is listed in Set1, a dataset consisting of 8 female speakers (from NI and SI both) only is listed in Set2, and Set3 contains data from both Set1 and Set2 datasets. The proposed speech recognition system also uses the state-of-the-art NOISEX-92 [32] database to add additive white Gaussian noise to test its performance in noisy environments.

5 Simulation Details and Experiment Results

The proposed system uses MATLAB version R2015a for developing a feature extractor module of the ASR system. The acoustic module and decoding module have been developed using HTK 3.5 β-2 version toolkit. The word recognition rate (WRr) described by Eq. (16) has been used as the parameter for performance analysis of the developed system:

(16) WRr=(TN−TD−Ts−TI)/TN×100,

where T_N denotes the number of test set words, T_D represents the number of deleted words, T_s refers to number of substituted words, and T_I is the number of inserted words.

Table 1:

WRr (%) of Various Feature Extraction Integrations and MFCC.

Training dataset	Test dataset	MFCC	PLP-MFCC	MF-PLP	PLP-GFCC	GFCC-PLP	MF-GFCC	GF-MFCC
Set1	Set1	64.36	64.36	66.45	66.50	66.10	72.42	73.10
	Set2	66.50	66.90	68.36	68.10	69.40	70.96	71.40
	Set3	65.86	65.42	68.40	68.02	68.50	72.04	73.36
Set2	Set1	66.87	66.20	69.25	69.50	69.96	70.65	71.20
	Set2	68.20	64.36	68.45	67.36	67.86	72.26	72.42
	Set3	69.56	63.72	67.05	66.96	67.50	73.80	73.76
Set3	Set1	62.74	62.10	65.74	65.50	66.96	71.23	71.56
	Set2	63.96	64.16	64.86	65.86	66.50	70.12	71.82
	Set3	65.25	63.26	66.20	66.50	67.10	76.16	76.40

5.1 Performance Analysis of Various Integrated Feature Vectors

The performance comparison of the implemented ASR system for all feature extraction integrations including the conventional method MFCC at the front end and HMM-GMM acoustic modeling at the back end is shown in Table 1. It can be clearly concluded from the comparison that MF-GFCC- and GF-MFCC-based ASR systems have more recognition accuracy than the other integrated feature vector-based ASR systems. Also, while testing Set3 with training Set3, where both datasets include north and south Indian male and female speech utterances, the ASR systems perform better than all other combinations. Also, the comparison of the proposed feature integration methods has been shown with traditional MFCC features [11]. The GF-MFCC combination has its highest WRr at 11.15%, which is better than the highest WRr of MFCC. One more interesting observation is that the accuracy rate of each training dataset with respect to another training dataset shows mixed behavior. For example, the training dataset Set2 and testing dataset Set2 combination has more accuracy than the training dataset Set1 and testing dataset Set2 combination in the MFCC, MF-PLP, MF-GFCC, and GF-MFCC integrations. However, the same combination has less accuracy in the PLP-MFCC, PLP-GFCC, and GFCC-PLP integrations. Similar distinct observations can also be made in training dataset Set3. Hence, different variabilities like speaker variability, speech variability, etc., play a major role during training and testing of an efficient speech recognition system.

5.2 Performance Analysis Using Different SNRs

A real-time speech recognition system is required to be robust to real-time noise interference to the speech input signal. A robust system is categorized into two major types: one that alters itself according to any type of noise based on the adaptation technique and another that decreases the noise interference from noisy audio inputs based on reduction techniques [10]. The proposed system is evaluated using additive noise with different SNRs. The interference added to the input speech signal while processing through a communication channel is known as additive noise. The proposed speech recognition system uses additive white Gaussian noise with the help of the NOISEX92 database [32]. It is the type of additive noise that has uniform power throughout the frequency band, and it has uniform normal distribution in the time domain. The accuracy values of the implemented ASR system using MF-GFCC and GF-MFCC feature vector combinations with different SNRs are shown in Table 2. The results clearly show that the GF-MFCC-based ASR system performs better than the MF-GFCC-based ASR system, and the accuracy rate increases with increasing SNR value.

Table 2:

Performance Analysis in Different SNR (dB) Environment.

Training dataset	Test dataset	Feature extraction type	WRr (%) at different SNRs
			0 dB	5 dB	10 dB	15 dB	20 dB
Set1	Set1	MF-GFCC	45.84	54.96	65.25	70.45	72.42
		GF-MFCC	45.86	55.15	66.05	71.36	72.60
	Set2	MF-GFCC	41.96	52.23	62.05	67.65	70.96
		GF-MFCC	42.56	53.26	63.86	69.36	71.56
	Set3	MF-GFCC	43.26	54.04	63.55	68.96	71.04
		GF-MFCC	44.56	55.10	64.40	70.40	72.40
Set2	Set1	MF-GFCC	48.85	59.05	68.85	74.04	76.65
		GF-MFCC	48.20	59.10	70.20	75.20	76.26
	Set2	MF-GFCC	47.60	58.15	67.20	72.02	75.26
		GF-MFCC	48.96	57.86	68.36	71.36	75.20
	Set3	MF-GFCC	51.20	59.96	70.45	75.32	77.80
		GF-MFCC	52.86	59.40	69.40	76.40	75.56
Set3	Set1	MF-GFCC	44.50	53.74	64.66	69.45	71.23
		GF-MFCC	47.56	56.20	67.86	69.86	71.56
	Set2	MF-GFCC	42.65	53.10	63.40	68.25	70.12
		GF-MFCC	46.20	55.36	65.10	70.10	72.20
	Set3	MF-GFCC	45.99	56.75	66.65	71.36	76.16
		GF-MFCC	48.96	58.96	74.40	74.40	78.10

5.3 Performance Analysis Using Refined and Integrated Feature Vectors

The integrated features MF-GFCC and GF-MFCC vectors are refined using the optimization methods PSO, C-PSO, and Q-PSO. The accuracy of the implemented ASR system using these two refined features and HMM-GMM-based acoustic models is tested against all three test datasets. The results in Table 3 show that the GF-MFCC-based system performs better than the MF-GFCC system in all classifications. Also, it can be concluded from Table 3 results that Q-PSO-optimized integrated feature vectors outperform C-PSO- and PSO-optimized feature vectors. The accuracy increase of the Q-PSO-optimized ASR system over PSO- and C-PSO-optimized ASR systems with different dataset combinations lies in the approximate range of 1%–3%.

Table 3:

WRr (%) Using Refined and Integrated Feature Vectors.

Training dataset	Test dataset	Feature extraction type	Optimization methods
			PSO	C-PSO	Q-PSO
Set1	Set1	MF-GFCC	72.10	73.20	73.96
		GF-MFCC	74.86	75.40	74.20
	Set2	MF-GFCC	75.20	75.96	76.56
		GF-MFCC	75.56	76.10	76.86
	Set3	MF-GFCC	76.40	77.56	77.96
		GF-MFCC	77.10	77.86	78.56
Set2	Set1	MF-GFCC	75.56	76.20	76.86
		GF-MFCC	77.86	78.40	77.96
	Set2	MF-GFCC	73.40	74.10	78.56
		GF-MFCC	75.20	75.86	76.40
	Set3	MF-GFCC	73.56	74.40	74.96
		GF-MFCC	74.40	75.20	75.86
Set3	Set1	MF-GFCC	72.10	72.96	73.56
		GF-MFCC	75.56	76.10	76.56
	Set2	MF-GFCC	74.86	75.40	75.96
		GF-MFCC	76.20	77.20	77.86
	Set3	MF-GFCC	76.40	78.56	79.20
		GF-MFCC	78.96	79.96	80.40

Table 4:

Performance Analysis Based on Different Speaker Modes.

Training dataset	Test dataset	Feature extraction type	WRr (%) using different speaker modes
			C-PSO + HMM						Q-PSO + HMM
			SD		SI		SA		SD		SI		SA
			Clean	Noisy	Clean	Noisy	Clean	Noisy	Clean	Noisy	Clean	Noisy	Clean	Noisy
Set1	Set1	MF-GFCC	73.20	71.56	70.40	68.86	72.40	70.10	73.96	71.56	70.56	69.10	72.96	70.20
		GF-MFCC	75.40	73.20	73.86	71.40	75.16	73.20	74.20	72.40	71.40	69.56	73.86	71.40
	Set2	MF-GFCC	75.96	73.56	71.20	69.86	73.10	71.86	76.56	74.10	72.10	71.96	74.70	72.56
		GF-MFCC	76.10	74.40	72.56	70.20	74.40	72.40	76.86	74.20	73.86	71.56	75.40	73.20
	Set3	MF-GFCC	77.56	75.96	74.86	73.70	75.10	73.86	77.96	75.96	74.40	72.20	76.10	74.10
		GF-MFCC	77.86	75.20	73.40	71.86	69.20	67.56	78.56	75.56	75.10	73.96	77.56	75.96
Set2	Set1	MF-GFCC	76.20	74.40	72.20	71.40	70.10	68.40	76.86	73.86	72.56	71.56	74.86	73.56
		GF-MFCC	78.40	75.56	75.86	73.10	71.40	69.20	78.96	76.20	74.10	72.40	76.40	74.40
	Set2	MF-GFCC	74.10	71.86	71.20	70.86	69.56	67.86	75.56	73.56	71.40	70.20	72.56	70.86
		GF-MFCC	75.86	72.40	71.40	70.20	68.20	66.56	76.40	74.96	72.20	70.86	74.10	72.20
	Set3	MF-GFCC	74.40	72.96	70.20	68.86	67.40	65.40	74.96	72.40	70.86	68.10	72.00	70.56
		GF-MFCC	75.20	73.10	71.40	70.10	68.10	66.96	75.86	73.20	71.40	69.56	73.20	71.96
Set3	Set1	MF-GFCC	72.96	70.86	68.56	67.40	65.56	63.20	73.56	71.10	70.10	68.20	72.40	70.20
		GF-MFCC	76.10	73.20	72.86	70.20	69.86	67.56	76.56	74.56	72.56	70.96	74.56	72.10
	Set2	MF-GFCC	75.40	72.10	72.20	71.86	69.20	67.96	75.96	73.96	71.20	69.40	73.86	71.40
		GF-MFCC	77.20	75.56	74.40	72.10	70.56	69.40	77.96	75.20	74.96	72.10	76.20	74.10
	Set3	MF-GFCC	78.56	76.20	75.56	73.40	71.96	70.86	79.20	77.56	73.56	71.86	75.10	73.40
		GF-MFCC	79.96	77.40	76.96	74.20	72.20	70.56	80.40	78.10	75.20	73.40	77.20	75.86

5.4 Performance Analysis Based on Different Speaker Modes

On the basis of the speaker mode, ASR systems can be classified in to three categories: speaker-dependent (SD), speaker-independent (SI), and speaker-adaptive (SA) ASR systems. To perform accurately, prior training of the individual speakers’ voice is required by SD ASR systems, whereas SI ASR systems do not require such training and SA ASR systems try to adapt its operation to the characteristics of new speakers. The performance of the developed systems is tested using all these three types of speaker dataset in clean as well as noisy conditions. For SA classifications, the maximum likelihood linear regression adaptation approach is used. Table 4 describes the results for all possible combinations. It can be observed from the results that SD ASR systems have more WRr (%) than SI and SA ASR systems for all dataset combinations. The SA ASR system outperforms SI ASR systems for all Q-PSO-optimized dataset combinations. However, for dataset combinations in C-PSO, SI ASR systems outperform SA ASR systems and vice versa.

6 Conclusion

In this paper, six sequential combinations of the feature extraction methods MFCC, GFCC, and PLP (taking two at a time) have been proposed. It has been concluded that the GF-MFCC- and MF-GFCC-based Hindi ASR systems perform better than the other proposed combinations. Further, PSO, C-PSO, and Q-PSO techniques have been applied to refine the features of these two combinations. It has been described by the results that the Q-PSO-optimized GF-MFCC integration shows significant improvement over all other optimized combinations. All the results have been achieved in both clean and noisy scenarios. The work can further be extended by using more robust feature extraction methods and more efficient optimization techniques in more difficult real-time scenarios.

Bibliography

[1] M. A. Abd El-Fattah, M. I. Dessouky, S. M. Diab and F. E. Abd El-samie, Adaptive Wiener filtering approach for speech enhancement, Ubiquitous Comput. Commun. J. 3 (2008), 1–8.10.2528/PIERM08061206Search in Google Scholar

[2] A. Acero, Acoustical and Environmental Robustness in Automatic Speech Recognition, vol. 201, Springer Science & Business Media, New York, 2012.Search in Google Scholar

[3] K. R. Aggarwal and M. Dave, Acoustic modeling problem for automatic speech recognition system: conventional methods (Part I), Int. J. Speech Technol. 14 (2011), 297–308.10.1007/s10772-011-9108-2Search in Google Scholar

[4] K. R. Aggarwal and M. Dave, Filterbank optimization for robust ASR using GA and PSO, Int. J. Speech Technol. 15 (2012), 191–201.10.1007/s10772-012-9133-9Search in Google Scholar

[5] K. R. Aggarwal and M. Dave, Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system, Telecommun. Syst. 52 (2013), 1457–1466.10.1007/s11235-011-9623-0Search in Google Scholar

[6] M. J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan and D. O’Shaughnessy, Developments and directions in speech recognition and understanding, Part 1 [DSP Education], IEEE Signal Process. Mag. 26 (2009), 75–80.10.1109/MSP.2009.932166Search in Google Scholar

[7] W. Burgos, Gammatone and MFCC Features in Speaker Recognition, Dissertation, 2014.Search in Google Scholar

[8] P. H. Combrinck and E. C. Botha, On the Mel-Scaled Cepstrum, Department of Electrical and Electronic Engineering, University of Pretoria, Hatfield, South Africa, 1996.Search in Google Scholar

[9] S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process. 28 (1980), 357–366.10.1016/B978-0-08-051584-7.50010-3Search in Google Scholar

[10] M. Dua, R. K. Aggarwal and M. Biswas, Performance evaluation of Hindi speech recognition system using optimized filterbanks, Eng. Sci. Technol. 21 (2018), 389–398.10.1016/j.jestch.2018.04.005Search in Google Scholar

[11] M. Dua, R. K. Aggarwal and M. Biswas, Discriminative training using noise robust integrated features and refined HMM modeling, J. Intell. Syst. 29 (2020), 327–344.10.1515/jisys-2017-0618Search in Google Scholar

[12] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 2013.Search in Google Scholar

[13] Z.-F. Hao, Z.-G. Wang and H. Huang, A particle swarm optimization algorithm with crossover operator, in: 2007 International Conference on Machine Learning and Cybernetics, vol. 2, IEEE, HongKong, China, 2007.10.1109/ICMLC.2007.4370295Search in Google Scholar

[14] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87 (1990), 1738–1752.10.1121/1.399423Search in Google Scholar PubMed

[15] H. Hermansky and S. Sharma, Temporal patterns (TRAPS) in ASR of noisy speech, in: Proceedings of 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, IEEE, Phoenix, AZ, USA, 1999.10.1109/ICASSP.1999.758119Search in Google Scholar

[16] K. Kirchhoff, Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments, in: Fifth International Conference on Spoken Language Processing, Sydney, Australia, 1998.10.21437/ICSLP.1998-313Search in Google Scholar

[17] N. Kumar and A. G. Andreou, Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition, Speech Commun. 26 (1998), 283–297.10.1016/S0167-6393(98)00061-2Search in Google Scholar

[18] S. Kwong, C.-W. Chau and W. A. Halang, Genetic algorithm for optimizing the nonlinear time alignment of automatic speech recognition systems, IEEE Trans. Indust. Electron. 43 (1996), 559–566.10.1109/41.538613Search in Google Scholar

[19] S. Kwong, C. W. Chau, K. F. Man and K. S. Tangb, Optimisation of HMM topology and its model parameters by genetic algorithms, Pattern Recogn. 34 (2001), 509–522.10.1016/S0031-3203(99)00226-5Search in Google Scholar

[20] J. Li, L. Deng, Y. Gong and R. Haeb-Umbach, An overview of noise-robust automatic speech recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (2014), 745–777.10.1109/TASLP.2014.2304637Search in Google Scholar

[21] T. Mittal and R. K. Sharma, Speech recognition using ANN and predator-influenced civilized swarm optimization algorithm, Turk. J. Elect. Eng. Comput. Sci. 24 (2016), 4790–4803.10.3906/elk-1412-193Search in Google Scholar

[22] N. Najkar, F. Razzazi and H. Sameti, A novel approach to HMM-based speech recognition systems using particle swarm optimization, Math. Comput. Modell. 52 (2010), 1910–1920.10.1109/BICTA.2009.5338098Search in Google Scholar

[23] M. Pant, R. Thangaraj and A. Abraham, A new PSO algorithm with crossover operator for global optimization problems, in: Innovations in Hybrid Intelligent Systems, pp. 215–222, Springer, Berlin, 2007.10.1007/978-3-540-74972-1_29Search in Google Scholar

[24] R. L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NY, 1993.Search in Google Scholar

[25] A. D. Reynolds, Experimental evaluation of features for robust speaker identification, IEEE Trans. Speech Audio Process. 2 (1994), 639–643.10.1109/89.326623Search in Google Scholar

[26] K. Samudravijaya, P. V. S. Rao and S. S. Agrawal, Hindi speech database, in: International Conference on Spoken Language Processing, Beijing, China, pp. 456–464, 2002.Search in Google Scholar

[27] G. Saon and J.-T. Chien, Large-vocabulary continuous speech recognition systems: a look at some recent advances, IEEE Signal Process. Mag. 29 (2012), 18–33.10.1109/MSP.2012.2197156Search in Google Scholar

[28] R. Schluter and H. Ney, Using phase spectrum information for improved speech recognition performance, in: Proceedings 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), vol. 1, IEEE, Salt Lake City, UT, USA, 2001.Search in Google Scholar

[29] R. Schluter, I. Bezrukov, H. Wagner and H. Ney, Gammatone features and feature combination for large vocabulary speech recognition, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4, IEEE, Honolulu, HI, USA, 2007.10.1109/ICASSP.2007.366996Search in Google Scholar

[30] A. Sharma, M. C. Shrotriya, O. Farooq and Z. A. Abbasi, Hybrid wavelet based LPC features for Hindi speech recognition, Int. J. Inform. Commun. Technol. 1 (2008), 373–381.10.1504/IJICT.2008.024008Search in Google Scholar

[31] H. Tolba, S.-A. Selouani and D. O’Shaughnessy, Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm, in: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, IEEE, Orlando, FL, USA, 2002.10.1109/ICASSP.2002.1005870Search in Google Scholar

[32] A. Varga and H. J. Steeneken, Assessment for automatic speech recognition, II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems, Speech Commun. 12 (1993), 247–251.10.1016/0167-6393(93)90095-3Search in Google Scholar

[33] F. Yang, C. Zhang and T. Sun, Comparison of particle swarm optimization and genetic algorithm for HMM training, in: 19th International Conference on Pattern Recognition, 2008 (ICPR 2008), IEEE, Tampa, FL, USA, 2008.10.1109/ICPR.2008.4761282Search in Google Scholar

[34] A. Zolnay, R. Schlüter and H. Ney, Robust speech recognition using a voiced-unvoiced feature, in: Seventh International Conference on Spoken Language Processing, Denver, Colorado, USA, 2002.10.21437/ICSLP.2002-38Search in Google Scholar

[35] A. Zolnay, R. Schluter and H. Ney, Acoustic feature combination for robust speech recognition, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing 2005 (ICASSP’05), vol. 1. IEEE, Philadelphia, PA, USA, 2005.Search in Google Scholar

Received: 2018-01-25

Accepted: 2018-09-02

Published Online: 2018-10-01

This work is licensed under the Creative Commons Attribution 4.0 Public License.

Optimizing Integrated Features for Hindi Automatic Speech Recognition System

Abstract

1 Introduction

1.1 Feature Integration Review

1.2 ASR Optimization Review

2 Theoretical Background

2.1 Feature Extraction

2.1.1 MFCC

2.1.2 PLP

2.1.3 GFCC

2.2 Optimization Methods

2.2.1 PSO

2.2.2 C-PSO and Q-PSO

3 Proposed Architecture

3.1 Pre-processing and Integrated Feature Extraction

3.1.1 MFCC and PLP Integrated Features

3.1.2 MFCC and GFCC Integrated Features

3.1.3 PLP and GFCC Integrated Features

3.2 Integrated Feature Optimization

3.3 Acoustic Modeling

3.4 Decoder

4 Training and value Speech Corpus

5 Simulation Details and Experiment Results

5.1 Performance Analysis of Various Integrated Feature Vectors

5.2 Performance Analysis Using Different SNRs

5.3 Performance Analysis Using Refined and Integrated Feature Vectors

5.4 Performance Analysis Based on Different Speaker Modes

6 Conclusion

Bibliography

Journal and Issue

Articles in the same Issue