[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Improved Buffer-Aided Multi-Hop Relaying with Reduced Outage and Packet Delay in Cognitive Radio Networks
Previous Article in Journal
A Deep Learning Approach to EMG-Based Classification of Gait Phases during Level Ground Walking
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Speech Enhancement for Secure Communication Using Coupled Spectral Subtraction and Wiener Filter

1
Research Center for Informatics, Indonesian Institute of Sciences, Jawa Barat 40135, Indonesia
2
Department of Electrical Engineering, University of Indonesia, Jawa Barat 16424, Indonesia
*
Author to whom correspondence should be addressed.
Electronics 2019, 8(8), 897; https://doi.org/10.3390/electronics8080897
Submission received: 28 June 2019 / Revised: 5 August 2019 / Accepted: 13 August 2019 / Published: 14 August 2019
(This article belongs to the Section Computer Science & Engineering)
Figure 1
<p>The block diagram of the proposed method.</p> ">
Figure 2
<p>Comparisons of actual SNR with estimated SNR.</p> ">
Figure 3
<p>The process of creating the data for the experiments. IMA, Interactive Multimedia Association; M, Microsoft.</p> ">
Figure 4
<p>The performance of the proposed method (perceptual evaluation of speech quality (PESQ)) when the adaptation speed <span class="html-italic">c</span> was varied for various communication channels: (<b>a</b>) GSM, (<b>b</b>) IMA-ADPCM, (<b>c</b>) PCM, and (<b>d</b>) Microsoft-ADPCM. The results are the average PESQ scores over all 20 utterances.</p> ">
Figure 5
<p>The performance of the proposed method (frequency-weighted segmental SNR (FwSNR)) when the adaptation speed <span class="html-italic">c</span> is varied for various communication channels: (<b>a</b>) GSM, (<b>b</b>) IMA-ADPCM, (<b>c</b>) PCM, and (<b>d</b>) Microsoft-ADPCM.. The results are the average FwSNR over all 20 utterances.</p> ">
Figure 6
<p>Comparisons of the resulting noise estimation spectra of the proposed method and several noise estimators: Martin, Hirsch, and IMCRA. The figures show the magnitude spectra (in dB) of a frequency bin (k = 10) on (<b>a</b>) IMA-ADPCM and (<b>b</b>) M-ADPCM. For the proposed method, we used <math display="inline"><semantics> <mrow> <mi>c</mi> <mo>=</mo> <mn>0</mn> <mo>.</mo> <mn>8</mn> </mrow> </semantics></math>.</p> ">
Figure 6 Cont.
<p>Comparisons of the resulting noise estimation spectra of the proposed method and several noise estimators: Martin, Hirsch, and IMCRA. The figures show the magnitude spectra (in dB) of a frequency bin (k = 10) on (<b>a</b>) IMA-ADPCM and (<b>b</b>) M-ADPCM. For the proposed method, we used <math display="inline"><semantics> <mrow> <mi>c</mi> <mo>=</mo> <mn>0</mn> <mo>.</mo> <mn>8</mn> </mrow> </semantics></math>.</p> ">
Figure 7
<p>The spectrogram of female speakers: “hai selamat pagi apa kabar” of (<b>a</b>) original speech, (<b>b</b>) the decrypted speech from I-ADPCM, the enhanced speech from (<b>c</b>) the Wiener filter, (<b>d</b>) KLT, (<b>e</b>) LogMMSE, and (<b>f</b>) NMF, and (<b>g</b>) the the proposed method.</p> ">
Figure 7 Cont.
<p>The spectrogram of female speakers: “hai selamat pagi apa kabar” of (<b>a</b>) original speech, (<b>b</b>) the decrypted speech from I-ADPCM, the enhanced speech from (<b>c</b>) the Wiener filter, (<b>d</b>) KLT, (<b>e</b>) LogMMSE, and (<b>f</b>) NMF, and (<b>g</b>) the the proposed method.</p> ">
Figure 8
<p>The spectrogram of female speakers: “acara nontonnya jadi kan” of (<b>a</b>) original speech, (<b>b</b>) the decrypted speech from GSM, the enhanced speech from (<b>c</b>) the Wiener filter, (<b>d</b>) KLT, (<b>e</b>) LogMMSE, and (<b>f</b>) NMF, and (<b>g</b>) the the proposed method.</p> ">
Figure 8 Cont.
<p>The spectrogram of female speakers: “acara nontonnya jadi kan” of (<b>a</b>) original speech, (<b>b</b>) the decrypted speech from GSM, the enhanced speech from (<b>c</b>) the Wiener filter, (<b>d</b>) KLT, (<b>e</b>) LogMMSE, and (<b>f</b>) NMF, and (<b>g</b>) the the proposed method.</p> ">
Figure 9
<p>The spectrogram of male speakers: “apakah kamu sudah makan siang.” of (<b>a</b>) the original speech, (<b>b</b>) the decrypted speech from PCM and the enhanced speech from (<b>c</b>) SS + IMCRA, (<b>d</b>) SS + Martin, and (<b>e</b>) SS + Hirsch, and (<b>f</b>) the proposed method.</p> ">
Review Reports Versions Notes

Abstract

:
The encryption process for secure voice communication may degrade the speech quality when it is applied to the speech signals before encoding them through a conventional communication system such as GSM or radio trunking. This is because the encryption process usually includes a randomization of the speech signals, and hence, when the speech is decrypted, it may perceptibly be distorted, so satisfactory speech quality for communication is not achieved. To deal with this, we could apply a speech enhancement method to improve the quality of decrypted speech. However, many speech enhancement methods work by assuming noise is present all the time, so the voice activity detector (VAD) is applied to detect the non-speech period to update the noise estimate. Unfortunately, this assumption is not valid for the decrypted speech. Since the encryption process is applied only when speech is detected, distortions from the secure communication system are characteristically different. They exist when speech is present. Therefore, a noise estimator that is able to update noise even when speech is present is needed. However, most noise estimator techniques only adapt to slow changes of noise to avoid over-estimation of noise, making them unsuitable for this task. In this paper, we propose a speech enhancement technique to improve the quality of speech from secure communication. We use a combination of the Wiener filter and spectral subtraction for the noise estimator, so our method is better at tracking fast changes of noise without over-estimating them. Our experimental results on various communication channels indicate that our method is better than other popular noise estimators and speech enhancement methods.

1. Introduction

Secure voice communication is necessary for many applications. Transmitted speech may have sensitive and personal information of the speakers that needs to be kept confidential. For military purposes, this becomes even more important since the information may affect national security. The Global System for Mobile (GSM) technologies is mainly used for voice communications in many countries, even for military purposes. However, while they ensure confidentiality for the radio access channel, its security is not guaranteed when speech is transmitted across the switch network using standard modulation such as pulse-code modulation (PCM) and adaptive differential pulse-code modulation (ADPCM) forms [1]. To ensure the end-to-end confidentiality, the speech signals are encrypted before entering the communication systems. However, the encryption process suffers several drawbacks. First, the process may delay the transmission. Secondly, the decrypted speech may be distorted since the encryption process usually includes a randomization of the speech signals, and as a result, the speech signals may fail to reach sufficient quality.
This study aims to improve the quality of the decrypted speech to achieve satisfying quality. We focus on the implementation of speech enhancement (SE) for this purpose. SE is an active area of research that aims to remove noise from speech, and hence, the quality of the speech signals may be improved. This is very crucial in many speech-based applications such as speech recognition [2,3], hearing aid devices [4,5], mobile communications [6], and hands-free devices [7,8].
There has been a plethora algorithms proposed in this field. With the increasing trend of deep learning technologies, many current studies model the non-linear relations between speech and noise using various deep learning networks [9,10,11,12,13]. One of the first approaches is to use the denoising autoencoder where the networks learn to map the speech with itself after it is disturbed by noise [9]. Later, there was increasing interest in using the generative adversarial network (GAN) to reconstruct the clean signals from the noisy ones [12]. However, deep learning methods require heavy computational loads for training, and the resulting models require a large amount storage, which may not be suitable to be planted in the mobile devices or other devices with limited storage such as hand-held transceivers (HT) in which the systems are planned to be embedded.
For real-time and fast SE that requires minimum storage, signal processing-based methods would be preferred. Earlier signal processing-based-methods were derived from the assumption of linear relations between noise and speech in the time domain. While they are assumed to be linear in the time domain, many techniques prefer to remove noise in the frequency domain by transforming the noisy speech, i.e., speech that is contaminated with noise, into the spectral domain [14] since the energy of speech and noise is more distinguishable in that domain. One of the simplest and arguably the most popular methods is spectral subtraction (SS) [15]. This method simply subtracts noisy speech with the noise estimate. Noise is usually estimated using early frames of speech assuming speech is not present during those periods. Later, spectral subtraction is also used for other purposes such as for source separations [16] and dereverberation [17].
However, SS requires a good noise estimation, or otherwise, some artifacts may be left in the enhanced speech that would cause unwanted and disturbing sound when it is transformed back into the time domain. This is called musical noise [18]. Numerous methods have been proposed to reduce musical noise. In [18], over-subtraction and flooring factors were proposed for SS. The over-subtraction and flooring factors, which are determined heuristically, are set so that the gaps between the amplitudes of those artifacts and the actual speech are minimum, hence reducing the loudness of musical noise. Initially, these factors are set the same for all frequency bands. However, noise in the real world is non-stationary, and it may have different energies for different spectra. Some types of noise such as babble noise may have more energy at low frequencies while other such as the sound of machinery, fans, etc., may have more energy at high frequencies. In [19], a non-linear spectral subtraction method was proposed. The method works by computing the over-subtraction factor as a function of SNR for each frequency. Other studies also showed that the SNRs change more extremely when a narrower band is used, and they tend to change more slowly when a large band is used [20]. This property was utilized in [21,22] by proposing multi-band spectral subtraction (MBSS). In their method, the speech spectra were group into several wider frequency bands, and spectral subtraction was performed separately for each band.
Other SE methods are derived using statistical properties of speech and noise signals. Usually, Gaussian assumptions are used to model the speech for mathematical convenience. In [23], Wiener filter (WF) was used for speech enhancement. WF is the optimal solution, in the minimum-mean-squared-error (MMSE) sense, when we assume speech and noise spectra to be Gaussian and uncorrelated. There were various variants of WF-based SE in later studies [24,25,26]. Meanwhile, the SE method is derived from the assumption that the speech and noise spectral amplitudes are assumed to be Gaussianly distributed, as proposed in [26]. While SS is derived intuitively, we could say that it is the optimal variance estimator or power spectra, in the maximum likelihood sense, when we assume that the noisy speech is to be Gaussianly distributed. Recently, there have been several efforts to use non-negative matrix factorization (NMF) to factorize noisy speech into noise and speech components for speech enhancement [27,28]. NMF builds a noise model during non-speech periods assuming it follows a certain distribution, such as Gaussian.
Studies indicated that the Gaussian assumptions are generally not valid when we use short time windows for the processing of the speech signals. Non-Gaussian distributions are often used to instead, and SE methods are derived accordingly. In [29], the MMSE estimator was derived by assuming the log spectral amplitude of speech to be Gaussian, hence assuming a non-Gaussian distribution for their spectral amplitude distributions. Other studies used Laplace [30,31] and Gamma distributions [32,33,34]. In [35,36], a variant of spectral subtraction, called q-spectral subtraction, was derived by assuming that noisy speech spectra were distributed as q-Gaussian.
Another common assumption for SE is that noise is present all the time. Noise is usually estimated during the non-speech period using a voice activity detector (VAD). However, VAD usually fails when the SNR is low or when noise is non-stationary. Noise estimators are used instead of VAD to provide estimations regardless of speech being present or not. Many noise estimators work by tracking the minimum of the noisy speech spectra [37,38]. Unfortunately, these methods usually work for slowly-changing noise. Other methods exploit the fact that noise usually does not change very rapidly, so noise is estimated by averaging it from previous frames [39,40,41,42]. The methods apply the adaptation speed weight factor to control how much the current frames affect the noise estimate. Usually, the weight is set to be close to zero to minimize over-subtraction. However, as a consequence, the methods were unable to track the sudden change of noise.
In secure communication, the noise energy is usually low during the non-speech period, while it changes drastically when speech is present. This is because the encryption process usually detects the speech presence and applies randomization on the detected speech parts only. While the sudden change could be tracked by giving larger values for adaptation speed, the enhanced speech is very likely to be over-subtracted, leaving unwanted artifacts. Therefore, the noise estimator needs not only to be able to track the sudden change of noise, but also avoid over-subtraction. In this paper, we propose a noise estimator that combines the Wiener filter and spectral subtraction. By doing so, we could control the amount of the noise energy of current frames using the Wiener filter and the average noise estimate from preceding frames using spectral subtraction to minimize over-subtraction.
The remainder of this paper is organized as follows. In Section 2, we briefly explain the spectral subtraction and Wiener filter. We describe our proposed method in Section 3. In Section 4, we describe the experimental setup, and the results are explained and discussed in Section 5. The paper is concluded in Section 6.

2. Speech Enhancement Methods

Spectral subtraction (SS) is one of the earliest and also arguably the most popular SE method [15]. SS works with the assumption that the noise is additive and stationary. Let us denote y ( t ) as noisy speech, i.e., the clean speech x ( t ) contaminated with additive noise d ( t ) at time t. Therefore, in the time domain, their relation is:
y ( t ) = x ( t ) + d ( t ) .
By taking the discrete Fourier transform (DFT) of (1) and the power magnitude, we obtain their relation in the spectral domain, assuming that speech and noise are uncorrelated as follows:
Y ( m , k ) 2 = X ( m , k ) 2 + D ( m , k ) 2 ,
where m and k denote the frame and frequency indices, respectively. Assuming that we can obtain the estimate of D ( m , k ) , D ^ ( m , k ) 2 , then the spectral subtraction is formulated as:
X ^ ( m , k ) 2 = Y ( m , k ) 2 D ^ ( m , k ) 2 .
The performance of SS heavily depends on the accuracy of the noise estimate. However, this is not an easy task due to several reasons. First, speech and noise are correlated when we process them with a limited window length. There exist cross-terms between them, and therefore, even when noise could perfectly be estimated, the exact clean signal could not be extracted [43,44]. Secondly, noise is usually non-stationary, and hence, obtaining good estimator is very difficult. The inaccuracy of noise estimation leaves artifacts that produce unwanted and disturbing sound, called musical noise, in the time domain. To minimize this, over-subtraction and flooring factors are introduced for spectral subtraction [18]:
X ^ ( m , k ) 2 = Y ( m , k ) 2 α D ( m , k ) 2 if Y ( m , k ) 2 > ( α + β ) D ( m , k ) 2 β Y ( m , k ) 2 otherwise ,
where α and β are over-subtraction and flooring factors, respectively.
The parameter α is usually chosen to be higher than one to reduce the energy of the artifacts and hence reduce their loudness. Meanwhile, β is set between zero and one to reduce the gap between the remaining peaks of the artifacts and the spectral valleys of the speech signals. There is a trade-off in determining the values of α and β . If α is too large, the intelligibility of the speech signals may suffer, but if it is too small, much of the noise may still remain. On the other hand, if β is too large, more background noise would be present, but if it is too small, musical noise would be more audible. Many studies have proposed a solution to finding good values for α and β . They are determined heuristically based on the estimate of the signal-to-noise ratio (SNR) [19,21,22]. In [35,36], a variant of spectral subtraction was derived by assuming noisy speech to be non-Gaussian. The method, called q-spectral subtraction (q-SS), explains the non-linear relations between speech and noise and derives over-subtraction factors analytically.
The Wiener filter (WF) was first introduced by Norbert Wiener in the 1940s as a solution to find the optimum estimation of signals from a noisy environment. If the noisy signal y ( t ) consists of the clean signal x ( t ) and the noise d ( t ) and the noise and the signals are uncorrelated, then the estimation of the clean signals can be obtained using linear combinations of the data y ( t ) such that the mean squared error is minimum.
Many studies used WF for SE [23,24,25,26]. The transfer function of the Wiener filter is computed in frequency domain as follows:
H ( m , k ) = X ( m , k ) 2 X ( m , k ) 2 + D ^ ( m , k ) 2 .
However, as indicated by (5), WF requires the knowledge of clean speech. It is also non-causal, making it unrealizable. To overcome this, the transfer function is computed in an iterative manner where the enhanced speech is estimated from the estimate transfer functions from preceding frames as follows:
X ^ ( m , k ) 2 = H ( m 1 , k ) Y ( m , k ) 2 .
However, a noise estimator or a VAD are still needed to estimate the noise spectra. To overcome this, we could modify the Wiener filter to estimate noise instead of the enhanced speech [45]. Now, the transfer function of the Wiener filter can be formulated as follows:
H ( m , k ) = D ^ ( m 1 , k ) 2 X ^ ( m 1 , k ) 2 + D ^ ( m 1 , k ) 2 .
By doing so, we compute the Wiener filter from the enhanced speech instead of the noisy speech, and a VAD is not needed.

3. The Proposed Method

Figure 1 illustrates the block diagram of our proposed method. The proposed method works as follow. The noisy speech y ( t ) is framed, windowed, and then transformed into the frequency domain by taking the discrete Fourier transform (DFT). After that, we take the magnitude spectra Y ( m , k ) . The magnitude spectra of noisy speech is fed into the proposed noise estimator, and simple subtraction with the following formula is applied to obtain the enhanced speech:
X ^ ( m , k ) = Y ( m , k ) D ^ ( m , k ) if Y ( m , k ) > D ^ ( m , k ) β Y ( m , k ) otherwise .
In this paper, β was set to 0.002.
For the noise estimator, we combined the Wiener filter and q-spectral subtraction [35,36]. For the Wiener filter, we modify (7) and compute its transfer function as follows:
H ( m , k ) = D ˜ ( m 1 , k ) 2 X ˜ ( m 1 , k ) 2 + D ˜ ( m 1 , k ) 2 0.5 ,
where D ˜ ( m , k 1 ) is the estimate of the noise spectra. They are obtained by averaging the estimate of noise at the current frame with its preceding estimates of noise. The decision-directed formula [26] is modified with the following rule:
D ˜ ( m , k ) = c D ˜ ( m 1 , k ) + ( 1 c ) D ^ n ( m , k ) ,
where c is the adaptation speed factor, which is chosen between zero and one. Here, instead of updating the noise spectra with current noisy spectra Y ( m , k ) D ^ ( m , k ) as in [26], we updated them with the noise estimate from the Wiener filter (see (16)). Therefore, we could control the parts of the slow changes of noise in D ˜ ( m , k 1 ) , which was determined from the average estimate of clean speech from preceding frames, and faster changes in D ^ n ( m , k ) , which was computed by the Wiener filter. By doing so, we would be able to track the changes of noise and minimize the risk of over-subtraction as well. In this paper, we evaluated the optimal c for the performance of our method empirically. We varied c from 0.5–0.95.
Meanwhile, X ˜ ( m , k 1 ) of (9) is the average estimate of clean speech of the preceding frames. It is calculated using q-spectral subtraction (q-SS) with the following rules [36]:
X ˜ ( m , k ) = 2 ( 2 q ) 3 q Y ( m , k ) D ˜ ( m , k ) ,
where the parameter q is estimated based on a priori SNR, Ψ ( m , k ) , as reported in [36], with the following relation:
q = 1 if Ψ n ( m , k ) 20 dB 0.038 Ψ n ( m , k ) + 1.8 if 5 dB Ψ ( m , k ) < 20 dB 1.88 if Ψ ( m , k ) < 5 dB .
The parameter Ψ ( m , k ) is the clean speech-to-noise-ration (in dB), i.e., Ψ ( m , k ) = 10 log ψ ( m , k ) . Since ψ ( m , k ) cannot be directly estimated from the noisy speech, we estimated it using maximum likelihood, assuming the noise variance was known. It could be estimated from the posterior SNR estimations. However, to reduce the computations, we estimated ψ ( m , k ) using the formula as reported in [26]:
ψ ( m , k ) = max ϕ ˜ ( m , k ) , 0 ,
where:
ϕ ˜ ( m , k ) = a ϕ ˜ ( m , k 1 ) + ( 1 a ) ϕ ( m , k ) b .
The notation ϕ ( m , k ) is the noisy signal-to-noise ratio, i.e.:
ϕ ( m , k ) = Y ( m , k ) D ˜ ( m , k )
The parameters a and b were set to 0.725 and 2, respectively, in this paper, as reported in [29].
The computation for calculating the posterior SNR was simple and fairly fast. We evaluated the performance of these formulations by comparing their estimations with the actual SNR. The actual SNR could be calculated since we had access to the clean speech. The results are illustrated in Figure 2. As we can see, the SNR estimator did not perform very well. This was as expected, as the distortions were highly non-stationary.
We opted to use q-SS instead of the conventional SS in this study. As was reported in [36], q-SS has more attenuation than SS [36]. This means the flooring process would occur at higher SNR for q-SS than that of SS. This is especially beneficial when over-subtraction occurs. Since noise is more dominant in low SNR and also more difficult to estimate, having them floored at higher SNR conditions would minimize the loss of information.
Then, we can obtain the noise estimate D ^ ( m , k ) using the Wiener filter:
D ^ ( m , k ) = H ( m , k ) Y ( m , k ) .
The estimate found in (16) will be used for (8) to find the enhanced speech X ^ ( m , k ) . For the first frame, we assumed D ˜ ( m 1 , k ) to be equal to D ^ ( m , k ) , and they were estimated using the the average spectra of the first five frames of an utterance. After the enhancement process, the enhanced speech was transformed back to the time domain by applying the inverse discrete Fourier transform (IDFT) and overlap and add (OLAP).

4. Experimental Setup

For the evaluation data, we selected a subset from the Tokyo Institute of Technology Multilingual Speech Corpus-Indonesian (TITML-IDN) [46]. It is the Indonesian phonetically-balanced speech corpus, which consists of 20 Indonesian speakers’ recordings (10 male and 10 female speakers). Each speaker was asked to read 343 phonetically-balanced sentences. From this dataset, we selected 2 speakers: a female and a male, and 10 utterances for each speakers. Therefore, there was a total 20 sentences. The phonetic transcription of the spoken utterances used for this experiments is shown in Table 1.
The process of creating the experimental data is shown in Figure 3. All selected utterances were passed through the encryptor before being transmitted using communication devices and encoded using various communication channels. In this study, we used four types of communication channels: GSM, Interactive Multimedia Association (IMA)-ADPCM(denoted as I-ADPCM), Microsoft ADPCM (denoted as M-ADPCM), and PCM. After that, the speech was received by another communication device, and then, the decryptor process was applied to obtain the decoded speech. The decoded speech was then fed through the speech enhancement method.
The encryption utilized multicircular permutation to randomize the voice signals. Encryption and decryption were executed utilizing a digital dynamic complex permutation module rotated by a set of expanded keys. At the sender or transmitter, the voice was encrypted using permutation multicircular shrinking. At the receiver, the encrypted speech was decrypted with permutation multicircular expanding. The direction of the shift in permutations, both shrinking and expanding, was determined by the value of the expanded key. More details about the encryption process can be found in [47].
We evaluated our method using the perceptual evaluation of speech quality (PESQ) [48] and frequency-weighted segmental SNR (FwSNR) [49]. Both metrics were chosen since both metrics have strong correlation subjective quality measures [50].
Our noise estimator was compared with other popular methods based on minimal tracking and recursive averaging. Three noise estimation methods: Martin [37], Hirsch [51], and improved minimum controlled recursive averaging (IMCRA) [52], were selected. We also provided the results of SS with these noise estimators. In addition, we compared our method with several SE methods. They were SS [18], the Wiener filter (WF) [25], Karhurnen–Loeve transform-based speech enhancement (KLT) [53], NMF [27], and Log minimum-mean-squared-error (LogMMSE) [29]. For SS, WF, KLT, and LogMMSE, we implemented the code published in [54]. The implementations of IMCRA, Martin, and Hirsch were also taken from the same source. The NMF implementation was taken from [55].

5. Results and Discussions

In this section, we present evaluations of the proposed method. First, we evaluate the effect of various values for adaptation speed c to the performance of the method. Secondly, we evaluate the ability of our method to track the noise from decrypted speech and compare it with other noise estimators: Martin, Hirsch, and IMCRA. Lastly, we compare our method with several speech enhancement methods and noise estimators.

5.1. The Effect of Adaptation Speed

Figure 4 and Figure 5 show the performance of the proposed method when we varied the adaptation speech factor, c. The performance shown is the average over all utterances. It can clearly be seen that the performance of the method is affected by c. For PESQ, the best performance was obtained when c = 0.75 for PCM, I-ADPCM, and M-ADPCM, while the results indicated that lower c was required for GSM. Meanwhile, the highest FwSNR was achieved when we used c = 0.85 for PCM, I-ADPCM, and M-PCM, while c = 0.75 was the best for GSM. As indicated by both metrics, lower c was required for GSM to achieve better performance. This was because GSM produced the most noisy speech (as indicated by low FwSNR and low PESQ). Therefore, the effect of noise was more severe in GSM than in other channels. As a result, lower c was needed to compensate the fast changes of noise.
While our experiments pointed out that the objective quality could be improved when we use smaller c, the best performance of PESQ and FwSNR was achieved for different c. As it is known that having “cleaner” speech does not necessarily improve the quality of speech, therefore the appropriate c should be selected carefully.

5.2. Comparison with Other Noise Estimators

Table 2 shows the average log spectral distances (LSD) in dB. LSD is calculates as follows.
LSD = 1 K × M m = 1 M k = 1 K 10 log 10 N r e f ( m , k ) 2 N n e ( m , k ) 2
where N r e f ( m , k ) 2 is the noise power spectrum of actual noise, which is computed by subtracting the noisy decrypted speech with the actual clean speech, N n e ( m , k ) 2 is the noise estimate from noise estimators, M is the number of frames, and K is the number of frequency components. The results are the average of LSD over five utterances of female speakers on various communication channels. The results showed that our method had the smallest LSD over other noise estimators, implying that it tracked the changes in noise better than other evaluated methods.
Figure 6 shows the comparisons of the noise estimates for the proposed method with other noise estimators: Martin, Hirsch, and IMCRA. It is the magnitude spectrum (in dB) of the 10th band of a female speaker on various encodings: GSM, I-ADPCM, M-ADPCM, and PCM. The figure further suggests that the proposed method was closer to actual noise than other noise estimators in tracking the changes of noise. Other estimators such as Martin and Hirsch’s methods were based on tracking the minimum statistics of the spectrum to avoid the over-estimate of noise. As a result, while it is very unlikely to have over-estimates of noise, there are large gaps between actual noise with the estimate. On the other hand, our method generally had smaller gaps between the prediction and the actual gaps. Interestingly, our method was also able to minimize over-estimation.

5.3. Comparisons with Other SE Methods

Table 3 compares our proposed method with several SE methods: SS, WF, KLT, NMF, and LogMMSE. For this, we used c = 0.8 for the remainder of our experiments. In addition, we compared our method with SS when noise estimators were used: Martin, Hirsch, and IMCRA. We also investigated whether the improvements were because of the overlap and add (OLAP process), so we present the results of using OLAP only for the noisy speech. The results indicated that OLAP did not improve the PESQ and FwSNR metrics. On average, the results were slightly worse than the original noisy speech, suggesting that OLAP may not contribute to the improvements of our method. It is obvious that the SS did not improve the PESQ scores either. Similar results can also be seen for WF and KLT. In some cases, the PESQ scores were even worse. This is not surprising, because the methods were unable to track the changes of noise when speech was present. Only NMF and LogMMSE were able to improve the PESQ scores slightly. NMF, due to its prior assumptions on noise, may be able to remove some parts of the noise even during the speech period.
Our method improved the PESQ scores for most encoding techniques. We obtained PESQ improvement of 0.256 , 0.253 , 0.260 , 0.261 , 0.266 , 0.533 , 0.214 , and 0.434 on average over all encoding schemes compared to the original decrypted speech, SS, SS + Martin, SS + IMCRA, SS + Hirsch, WF, LogMMSE, and KLT, respectively, confirming that our method could improve the quality of the decrypted speech. Only NMF had better PESQ scores than our method for GSM, whereas our method was slightly better for other encoding schemes.
As indicated by the spectrograms of the enhanced speech (see Figure 7 and Figure 8), the spectrograms of all reference methods did not change drastically compared to original noisy speech for IMA-ADPCM. In contrast, NMF and our proposed method were clearly able to remove some parts of noise from the speech. Similar spectrogram results were observed for other utterances as well. Meanwhile for GSM, where it was noticeably that the conditions were the worst, we see that other methods only removed parts of noise during non-speech periods only. Our method could remove some parts of noise during speech periods.
Surprisingly, we observed that the performance of SS tended to get worse when noise estimators were used. Since most noise estimators avoid over-subtraction of the noise estimate tended to be very low compared to the actual noise (as we showed in Section 5.2). As the spectrograms indicated, our method was better at removing more noise compared to other noise estimators (see Figure 9). This is consistent with our previous findings.
Meanwhile, for FwSNR, the enhanced speech tended to have lower FwSNR for most methods. LogMMSE, WF, and the proposed method improved FwSNR only on GSM encoding where our method was better than LogMMSE. The FwSNR for NMF largely dropped for all encoding schemes. As NMF usually assumed Gaussian priors, it was very likely that noise and speech may not follow the same distributions, forcing the enhanced speech to be Gaussian, and as a consequence, the FwSNR dropped. In general, the proposed method was among the least that degraded the FwSNR compared to other method for other encoding schemes. These results indicate that while the method may cause the enhanced speech to be over-subtracted, the effect may not be as significant as other methods. This might be because other methods may not perform well on the boundary between when speech is present and absent. This caused distortions in the boundary area of speech, and hence, the FwSNRs was worse.
We compared the computational complexity of the proposed method with the other SE methods. We did this by computing the runtime of each method for processing 80 utterances, comprising 20 sentences on 4 types of encoding schemes. We repeated the experiments five times, and their average was calculated. The experiments were conducted on Intel i3 processors with 4 GB of memory. The results are shown in Table 4. Based on the results, we found that WF had the largest running time, indicating that it was the method with the largest computational complexity. This was as expected, since WF is an iterative process with the maximum iterative number set to 1000 in the implementation. NMF also had large computational complexity. The matrix decomposition operations may contribute to the large computational time. Meanwhile, SS had the smallest computation time as it was one of the simplest methods. Only SS and LogMMSE had smaller computation times than the proposed method, suggesting that the method was quite simple and its computational load can be considered low.

6. Conclusions

One way to secure voice communication is to apply an encryption method before sending the voice through mobile or wired communication networks. However, this approach may cause the decrypted speech to have degraded quality. In this paper, we presented an enhancement method with coupled spectral subtraction and the Wiener filter for noise estimation to improve the quality of speech from secure communication.
Our experimental works on speech transmitted on various communication channels: GSM, IMA-ADPCM, M-ADPCM, and PCM, showed that our proposed method was generally better at improving the PESQ scores of the decrypted speech compared to other speech enhancement or noise estimator methods. While the FwSNR metrics tended to get worse for I-ADPCM, M-ADPCM, and PCM in general, our method improved them on GSM channels, the conditions that had the worst FwSNR. The results might indicate the effectiveness of our method in conditions of highly-distorted speech. Our observations on the spectrograms results suggested that the enhanced speech signals from our methods were cleaner than the reference methods.
In the future, our plan is to further improve the performance of our method. In this research, we only applying a single adaptation speed, determined heuristically. Since it clearly affected the performance, it may be interesting to find out whether we could in an automatic way predict the optimum adaptation speed factor.
References

Author Contributions

Conceptualization, H.P. and K.R.; methodology, H.P., N.H., and Y.S.; software, H.P., N.H., and Y.S.; validation, H.P., Y.S., and N.H.; formal analysis, H.P.; investigation, H.P. and N.H.; resources, H.P., N.H., and Y.S.; data curation, N.H. and A.P.; writing, original draft preparation, H.P.; writing, review and editing, H.P. and K.R.; visualization, H.P.; supervision, K.R.; project administration, A.P., Y.S., and N.H.; funding acquisition, K.R.

Funding

This work was supported by Lembaga Pengelola Dana Pendidikan (LPDP) RISPRO from the Ministry of Finance of the Republic of Indonesia.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Katugampala, N.N.; Al-Naimi, K.T.; Villette, S.; Kondoz, A.M. Real time data transmission over GSM voice channel for secure voice and data applications. In Proceedings of the 2nd IEEE Secure Mobile Communications Forum: Exploring the Technical Challenges in Secure GSM and WLAN, London, UK, 23–23 September 2004; pp. 7/1–7/4. [Google Scholar]
  2. Gong, Y. Speech recognition in noisy environments: A survey. Speech Commun. 1995, 16, 261–291. [Google Scholar] [CrossRef]
  3. Vincent, E.; Watanabe, S.; Nugraha, A.A.; Barker, J.; Marxer, R. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Comput. Speech Lang. 2017, 46, 535–557. [Google Scholar] [CrossRef] [Green Version]
  4. Levitt, H. Noise reduction in hearing aids: A review. J. Rehabil. Res. Dev. 2001, 38, 111–122. [Google Scholar] [PubMed]
  5. Kam, A.C.S.; Sung, J.K.K.; Lee, T.; Wong, T.K.C.; van Hasselt, A. Improving mobile phone speech recognition by personalized amplification: Application in people with normal hearing and mild-to-moderate hearing loss. Ear Hear 2017, 38, e85–e92. [Google Scholar] [CrossRef] [PubMed]
  6. Goulding, M.M.; Bird, J.S. Speech enhancement for mobile telephony. IEEE Trans. Veh. Technol. 1990, 39, 316–326. [Google Scholar] [CrossRef] [Green Version]
  7. Juang, B.; Soong, F. Hands-free telecommunications. In Proceedings of the International Workshop on Hands-Free Speech Communication, Kyoto, Japan, 9–11 April 2001. [Google Scholar]
  8. Jin, W.; Taghizadeh, M.J.; Chen, K.; Xiao, W. Multi-channel noise reduction for hands-free voice communication on mobile phones. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 506–510. [Google Scholar]
  9. Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Speech enhancement based on deep denoising autoencoder. In Proceedings of the Interspeech, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013; pp. 436–440. [Google Scholar]
  10. Xu, Y.; Du, J.; Dai, L.; Lee, C. A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM IEEE Trans. Audio Speech Lang. Process. 2015, 23, 7–19. [Google Scholar]
  11. Weninger, F.; Erdogan, H.; Watanabe, S.; Vincent, E.; Le Roux, J.; Hershey, J.R.; Schuller, B. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, Liberec, Czech Republic, 25–28 August 2015; pp. 91–99. [Google Scholar]
  12. Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. In Proceedings of the Interspeech, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646. [Google Scholar]
  13. Kumar, A.; Florencio, D. Speech Enhancement in Multiple-Noise Conditions Using Deep Neural Networks. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 Sepember 2016; pp. 3738–3742. [Google Scholar]
  14. Shekokar, S.; Mali, M. A brief survey of a DCT-based speech enhancement system. Int. J. Sci. Eng. Res 2013, 4, 1–3. [Google Scholar]
  15. Boll, S. A spectral subtraction algorithm for suppression of acoustic noise in speech. In Proceedings of the 1979 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Washington, DC, USA, 2–4 April 1979; Volume 4, pp. 200–203. [Google Scholar]
  16. Nasu, Y.; Shinoda, K.; Furui, S. Cross-Channel Spectral Subtraction for meeting speech recognition. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Process (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 4812–4815. [Google Scholar]
  17. Furuya, K.; Kataoka, A. Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1579–1591. [Google Scholar] [CrossRef]
  18. Berouti, M.; Schwartz, R.; Makhoul, J. Enhancement of speech corrupted by acoustic noise. In Proceedings of the 1979 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Washington, DC, USA, 2–4 April 1979; Volume 4, pp. 208–211. [Google Scholar]
  19. Lockwood, P.; Boudy, J. Experiments with a nonlinear spectral subtractor (NSS), hidden Markov models and the projection, for robust speech recognition in cars. Speech Commun. 1992, 11, 215–228. [Google Scholar] [CrossRef]
  20. Cappé, O. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech Audio Process. 1994, 2, 345–349. [Google Scholar] [CrossRef]
  21. Kamath, S.; Loizou, P. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 13–17 May 2002; Volume 4, p. 44164. [Google Scholar]
  22. Udrea, R.M.; Oprea, C.C.; Stanciu, C. Multi-microphone Noise Reduction System Integrating Nonlinear Multi-band Spectral Subtraction. In Pervasive Computing Paradigms for Mental Health; Oliver, N., Serino, S., Matic, A., Cipresso, P., Filipovic, N., Gavrilovska, L., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 133–138. [Google Scholar]
  23. McAulay, R.; Malpass, M. Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 137–145. [Google Scholar] [CrossRef]
  24. Scalart, P. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 2, pp. 629–632. [Google Scholar]
  25. Hu, Y.; Loizou, P.C. Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Trans. Speech Audio Process. 2004, 12, 59–67. [Google Scholar] [CrossRef]
  26. Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 1109–1121. [Google Scholar] [CrossRef] [Green Version]
  27. Lyubimov, N.; Kotov, M. Non-negative matrix factorization with linear constraints for single-channel speech enhancement. In Proceedings of the Interspeech, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013; pp. 446–450. [Google Scholar]
  28. Duan, Z.; Mysore, G.J.; Smaragdis, P. Speech enhancement by online non-negative spectrogram decomposition in nonstationary noise environments. In Proceedings of the Interspeech, Portland, OR, USA, 9–13 September 2012. [Google Scholar]
  29. Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
  30. Mahmmod, B.M.; Ramli, A.R.; Abdulhussian, S.H.; Al-Haddad, S.A.R.; Jassim, W.A. Low-Distortion MMSE Speech Enhancement Estimator Based on Laplacian Prior. IEEE Access 2017, 5, 9866–9881. [Google Scholar] [CrossRef]
  31. Chen, B.; Loizou, P.C. Speech enhancement using a MMSE short time spectral amplitude estimator with Laplacian speech modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA, USA, 23–23 March 2005; Volume 1. [Google Scholar]
  32. Wang, Y.; Brookes, M. Speech enhancement using an MMSE spectral amplitude estimator based on a modulation domain Kalman filter with a Gamma prior. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5225–5229. [Google Scholar]
  33. Martin, R. Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech and Signal Process (ICASSP), Orlando, FL, USA, 13–17 May 2002; Volume 1, pp. I-253–I-256. [Google Scholar]
  34. Andrianakis, I.; White, P.R. Mmse Speech Spectral Amplitude Estimators With Chi and Gamma Speech Priors. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Process (ICASSP), Toulouse, France, 14–19 May 2006; Volume 3. [Google Scholar]
  35. Pardede, H.F.; Koichi, S.; Koji, I. Q-Gaussian based spectral subtraction for robust speech recognition. In Proceedings of the Interspeech, 13th Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012; pp. 1255–1258. [Google Scholar]
  36. Pardede, H.; Iwano, K.; Shinoda, K. Spectral subtraction based on non-extensive statistics for speech recognition. IEICE Trans. Inf. Syst. 2013, 96, 1774–1782. [Google Scholar] [CrossRef]
  37. Martin, R. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 2001, 9, 504–512. [Google Scholar] [CrossRef] [Green Version]
  38. Barnov, A.; Bracha, V.B.; Markovich-Golan, S. QRD based MVDR beamforming for fast tracking of speech and noise dynamics. In Proceedings of the 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 15–18 October 2017; pp. 369–373. [Google Scholar]
  39. Cohen, I.; Berdugo, B. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process Lett. 2002, 9, 12–15. [Google Scholar] [CrossRef]
  40. Lu, C.T.; Lei, C.L.; Shen, J.H.; Wang, L.L.; Tseng, K.F. Estimation of noise magnitude for speech denoising using minima-controlled-recursive-averaging algorithm adapted by harmonic properties. Appl. Sci. 2017, 7, 9. [Google Scholar] [CrossRef]
  41. Lu, C.T.; Chen, Y.Y.; Shen, J.H.; Wang, L.L.; Lei, C.L. Noise Estimation for Speech Enhancement Using Minimum-Spectral-Average and Vowel-Presence Detection Approach. In Proceedings of the International Conference on Frontier Computing, Bangkok, Thailand, 9–11 September 2016; pp. 317–327. [Google Scholar]
  42. He, Q.; Bao, F.; Bao, C.; He, Q.; Bao, F.; Bao, C. Multiplicative update of auto-regressive gains for codebook-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 457–468. [Google Scholar] [CrossRef]
  43. Faubel, F.; Mcdonough, J.; Klakow, D. A phase-averaged model for the relationship between noisy speech, clean speech and noise in the log-mel domain. In Proceedings of the Interspeech, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, 22–26 September 2008. [Google Scholar]
  44. Zhu, Q.; Alwan, A. The effect of additive noise on speech amplitude spectra: A quantitative analysis. IEEE Signal Process Lett. 2002, 9, 275–277. [Google Scholar]
  45. Sovka, P.; Pollak, P.; Kybic, J. Extended spectral subtraction. In Proceedings of the 1996 8th European Signal Processing Conference (EUSIPCO 1996), Trieste, Italy, 10–13 September 1996; pp. 1–4. [Google Scholar]
  46. Lestari, D.P.; Iwano, K.; Furui, S. A large vocabulary continuous speech recognition system for Indonesian language. In Proceedings of the 15th Indonesian Scientific Conference in Japan Proceedings, Hiroshima, Japan, 4–7 August 2006; pp. 17–22. [Google Scholar]
  47. Hayati, N.; Suryanto, Y.; Ramli, K.; Suryanegara, M. End-to-End Voice Encryption Based on Multiple Circular Chaotic Permutation. In Proceedings of the 2019 2nd International Conference on Communication Engineering and Technology (ICCET), Nagoya, Japan, 12–15 April 2019. [Google Scholar]
  48. Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar]
  49. Tribolet, J.; Noll, P.; McDermott, B.; Crochiere, R. A study of complexity and quality of speech waveform coders. In Proceedings of the ICASSP’78. IEEE International Conference on Acoustics, Speech, and Signal Processing, Tulsa, OK, USA, 10–12 April 1978; Volume 3, pp. 586–590. [Google Scholar]
  50. Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
  51. Hirsch, H.G.; Ehrlicher, C. Noise estimation techniques for robust speech recognition. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA, 9–12 May 1995; Volume 1, pp. 153–156. [Google Scholar]
  52. Cohen, I. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 2003, 11, 466–475. [Google Scholar] [CrossRef]
  53. Hu, Y.; Loizou, P.C. A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Trans. Speech Audio Process. 2003, 11, 334–341. [Google Scholar] [CrossRef] [Green Version]
  54. Loizou, P.C. Speech Enhancement: Theory and Practice, 2nd ed.; CRC Press, Inc.: Boca Raton, FL, USA, 2013. [Google Scholar]
  55. NMFdenoiser. 2014. Available online: https://github.com/niklub/NMFdenoiser (accessed on 11 March 2019).
Figure 1. The block diagram of the proposed method.
Figure 1. The block diagram of the proposed method.
Electronics 08 00897 g001
Figure 2. Comparisons of actual SNR with estimated SNR.
Figure 2. Comparisons of actual SNR with estimated SNR.
Electronics 08 00897 g002
Figure 3. The process of creating the data for the experiments. IMA, Interactive Multimedia Association; M, Microsoft.
Figure 3. The process of creating the data for the experiments. IMA, Interactive Multimedia Association; M, Microsoft.
Electronics 08 00897 g003
Figure 4. The performance of the proposed method (perceptual evaluation of speech quality (PESQ)) when the adaptation speed c was varied for various communication channels: (a) GSM, (b) IMA-ADPCM, (c) PCM, and (d) Microsoft-ADPCM. The results are the average PESQ scores over all 20 utterances.
Figure 4. The performance of the proposed method (perceptual evaluation of speech quality (PESQ)) when the adaptation speed c was varied for various communication channels: (a) GSM, (b) IMA-ADPCM, (c) PCM, and (d) Microsoft-ADPCM. The results are the average PESQ scores over all 20 utterances.
Electronics 08 00897 g004
Figure 5. The performance of the proposed method (frequency-weighted segmental SNR (FwSNR)) when the adaptation speed c is varied for various communication channels: (a) GSM, (b) IMA-ADPCM, (c) PCM, and (d) Microsoft-ADPCM.. The results are the average FwSNR over all 20 utterances.
Figure 5. The performance of the proposed method (frequency-weighted segmental SNR (FwSNR)) when the adaptation speed c is varied for various communication channels: (a) GSM, (b) IMA-ADPCM, (c) PCM, and (d) Microsoft-ADPCM.. The results are the average FwSNR over all 20 utterances.
Electronics 08 00897 g005
Figure 6. Comparisons of the resulting noise estimation spectra of the proposed method and several noise estimators: Martin, Hirsch, and IMCRA. The figures show the magnitude spectra (in dB) of a frequency bin (k = 10) on (a) IMA-ADPCM and (b) M-ADPCM. For the proposed method, we used c = 0 . 8 .
Figure 6. Comparisons of the resulting noise estimation spectra of the proposed method and several noise estimators: Martin, Hirsch, and IMCRA. The figures show the magnitude spectra (in dB) of a frequency bin (k = 10) on (a) IMA-ADPCM and (b) M-ADPCM. For the proposed method, we used c = 0 . 8 .
Electronics 08 00897 g006aElectronics 08 00897 g006b
Figure 7. The spectrogram of female speakers: “hai selamat pagi apa kabar” of (a) original speech, (b) the decrypted speech from I-ADPCM, the enhanced speech from (c) the Wiener filter, (d) KLT, (e) LogMMSE, and (f) NMF, and (g) the the proposed method.
Figure 7. The spectrogram of female speakers: “hai selamat pagi apa kabar” of (a) original speech, (b) the decrypted speech from I-ADPCM, the enhanced speech from (c) the Wiener filter, (d) KLT, (e) LogMMSE, and (f) NMF, and (g) the the proposed method.
Electronics 08 00897 g007aElectronics 08 00897 g007b
Figure 8. The spectrogram of female speakers: “acara nontonnya jadi kan” of (a) original speech, (b) the decrypted speech from GSM, the enhanced speech from (c) the Wiener filter, (d) KLT, (e) LogMMSE, and (f) NMF, and (g) the the proposed method.
Figure 8. The spectrogram of female speakers: “acara nontonnya jadi kan” of (a) original speech, (b) the decrypted speech from GSM, the enhanced speech from (c) the Wiener filter, (d) KLT, (e) LogMMSE, and (f) NMF, and (g) the the proposed method.
Electronics 08 00897 g008aElectronics 08 00897 g008b
Figure 9. The spectrogram of male speakers: “apakah kamu sudah makan siang.” of (a) the original speech, (b) the decrypted speech from PCM and the enhanced speech from (c) SS + IMCRA, (d) SS + Martin, and (e) SS + Hirsch, and (f) the proposed method.
Figure 9. The spectrogram of male speakers: “apakah kamu sudah makan siang.” of (a) the original speech, (b) the decrypted speech from PCM and the enhanced speech from (c) SS + IMCRA, (d) SS + Martin, and (e) SS + Hirsch, and (f) the proposed method.
Electronics 08 00897 g009
Table 1. The list of utterances used for the experiments.
Table 1. The list of utterances used for the experiments.
UtterancePhonetic Transcription
1[h-ai] [s-ə-l-a:-m-a:-t] [p-a:-g-i] [a:-p-a:] [k-a:-b-a:-R]
2[s-ə-m-ɔ:-g-a:] [u:-ʤ-i-a:-n-ɲ-a:] [b-ə-R-ʤ-a:-l-a:-n] [l-a:-n-ʧ-a:-R]
3[a:-ʧ-a:-R-a:] [n-ɔ:-n-t-ɔ:-n-ɲ-a:] [ʤ-a:-d-i] [k-a:-n]
4[a:-p-a:-k-a:-h] [k-a:-m-u:] [s-u:-d-a:-h] [m-a:-k-a:-n] [s-i-a:-ɲ]
5[n-a:-n-t-i] [m-a:-l-a:-m] [p-u:-l-a:-ɲ] [ʤ-a:-m] [b-ə-R-a:-p-a:]
6[ʤ-a:-ɲ-a:-n] [l-u:-p-a:] [s-a:-R-a:-p-a:-n] [j-a:]
7[ʧ-ə-p-a:-t] [i-s-t-i-R-a:-h-a:-t] [d-a:-n] [m-i-m-p-i] [j-a:-ɲ] [i-n-d-a:-h]
8[m-a:-a:-f] [s-a:-j-a:] [t-ə-R-l-a:-m-b-a:-t] [d-a:-t-a:-ɲ] [k-ə] [k-a:-n-t-o-R]
9[d-i-a:] [t-i-d-a:-k] [d-a:-t-a:-ɲ] [k-ə] [s-ə-k-ɔ:-l-a:-h]
10[a:-l-a:-s-a:-n-ɲ-a:] [b-ə-l-u:-m] [m-ə-ɲ-ə-R-ʤ-a:-k-a:-n] [p-ə-k-ə-R-ʤ-a:-a:-n] [R-u:-m-a:-h]
Table 2. The average LSD (in dB) of the actual noise and estimated noise of our method and other noise estimators for various communication channels. I, IMA; IMCRA, improved minimum controlled recursive averaging.
Table 2. The average LSD (in dB) of the actual noise and estimated noise of our method and other noise estimators for various communication channels. I, IMA; IMCRA, improved minimum controlled recursive averaging.
MethodsGSMI-ADPCMM-ADPCMPCM
Hirsch43.8940.5043.2540.47
Martin43.8841.6546.0641.52
IMCRA41.2037.3439.2637.28
Proposed40.0036.1743.9336.17
Table 3. Results of our method (PESQ scores and FwSNR) in comparison with other methods. The proposed method applied c = 0.8. OLAP, overlap and add; SS, spectral subtraction; WF, Wiener filter; KLT, Karhurnen–Loeve transform.
Table 3. Results of our method (PESQ scores and FwSNR) in comparison with other methods. The proposed method applied c = 0.8. OLAP, overlap and add; SS, spectral subtraction; WF, Wiener filter; KLT, Karhurnen–Loeve transform.
MethodsPESQFwSNR
GSMI-ADPCMM-ADPCMPCMGSMI-ADPCMM-ADPCMPCM
Noisy speech1.2342.3032.3462.3032.50115.30013.53315.309
With OLAP only1.2342.2952.3452.2952.49515.21013.52315.190
SS1.2292.3082.3512.3082.43113.14812.57513.155
SS + Martin1.2352.2962.3412.2962.21313.50512.77313.510
SS + IMCRA1.2342.2952.3392.2952.18513.20712.61313.200
SS + Hirsch1.2342.2892.3332.2882.13113.19212.56213.187
WF1.1631.9631.9891.9632.53511.91811.32811.919
LogMMSE1.2402.3522.4052.3522.71112.43911.89512.440
KLT1.1732.0902.1182.0912.20610.51210.12810.510
NMF1.3502.5512.6062.5911.6542.9872.9572.995
Proposed method1.2512.6542.6492.6543.62013.41512.79913.405
Table 4. Computational complexity of the proposed method in comparison with other methods.
Table 4. Computational complexity of the proposed method in comparison with other methods.
No.SSKLTLogMMSEWFNMFProposed
112,94543,15727,170255,627205,54437,139
213,13141,29228,038256,006204,90837,268
313,02241,75327,171255,987206,77337,117
413,42041,88527,477256,151207,22937,103
513,47041,00327,405255,775205,97436,644
Average13,19841,81827,452255,909206,08637,054

Share and Cite

MDPI and ACS Style

Pardede, H.; Ramli, K.; Suryanto, Y.; Hayati, N.; Presekal, A. Speech Enhancement for Secure Communication Using Coupled Spectral Subtraction and Wiener Filter. Electronics 2019, 8, 897. https://doi.org/10.3390/electronics8080897

AMA Style

Pardede H, Ramli K, Suryanto Y, Hayati N, Presekal A. Speech Enhancement for Secure Communication Using Coupled Spectral Subtraction and Wiener Filter. Electronics. 2019; 8(8):897. https://doi.org/10.3390/electronics8080897

Chicago/Turabian Style

Pardede, Hilman, Kalamullah Ramli, Yohan Suryanto, Nur Hayati, and Alfan Presekal. 2019. "Speech Enhancement for Secure Communication Using Coupled Spectral Subtraction and Wiener Filter" Electronics 8, no. 8: 897. https://doi.org/10.3390/electronics8080897

APA Style

Pardede, H., Ramli, K., Suryanto, Y., Hayati, N., & Presekal, A. (2019). Speech Enhancement for Secure Communication Using Coupled Spectral Subtraction and Wiener Filter. Electronics, 8(8), 897. https://doi.org/10.3390/electronics8080897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop