CN115331678B

CN115331678B - Generalized regression neural network acoustic signal identification method using Mel frequency cepstrum coefficient

Info

Publication number: CN115331678B
Application number: CN202210304605.4A
Authority: CN
Inventors: 汪勇; 姚琦海; 杨益新
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2024-10-22
Anticipated expiration: 2042-03-21
Also published as: CN115331678A

Abstract

The invention relates to a generalized regression neural network acoustic signal identification method utilizing Mel frequency cepstrum coefficients, which combines MFCC and GRNN, fully exerts the advantages of rich sound characteristics of the MFCC and nonlinear fitting of the GRNN, and effectively identifies seal types. Firstly, extracting the MFCC characteristics of an acoustic signal, carrying out FFT and Mel filtering, solving the L-order MFCC, calculating a cepstrum difference parameter, carrying out the test of a GRNN model, wherein the optimal expansion factor is determined by a k-fold cross validation method, dividing training data into k-folds, sequentially taking the k-folds as a validation set for the test, using the obtained optimal expansion factor for the training of the GRNN, and identifying the tested acoustic data. The influence of the signal-to-noise ratio reduction on the GRNN method is minimum, when the signal-to-noise ratio is more than 5dB, the GRNN method can realize accurate identification, and when the signal-to-noise ratio is 0dB, the GRNN method can still realize approximate identification.

Description

Generalized regression neural network acoustic signal identification method using Mel frequency cepstrum coefficient

Technical Field

The invention belongs to the fields of digital signal processing, machine learning, underwater sound measurement and the like, and relates to a generalized regression neural network sound signal identification method utilizing Mel frequency cepstrum coefficients, which utilizes Mel frequency cepstrum coefficients and a generalized regression neural network model to realize sound signal identification under a plurality of signal to noise ratios.

Background

Under the background of rapid development of machine learning technology, as the data driving models such as the neural network and the like can mine deep features of different target acoustic signals, the influence of noise can be reduced to a great extent, and the autonomy and the intellectualization of classification decision can be effectively realized, the machine learning method is widely researched and applied in the field of acoustic signal processing. In 2012, liu Jian et al input the energy spectrum characteristics of the acoustic signals into a classification model of a support vector machine (Support vector machine, SVM) and the results indicate that the method can effectively identify ship radiation noise signals (Liu Jian, liu Zhong, xiong Ying. Underwater target identification based on wavelet packet energy spectrum and SVM. University of martial arts, journal of traffic science and engineering, 2012,36 (2): 5.). In 2017, yue et al realized that unlabeled underwater target acoustic signal identification was (Hao Y,Zhang L,Wang D,et al.The Classification of Underwater Acoustic Targets Based on Deep Learning Methods.2017 2nd International Conference on Control,Automation and Artificial Intelligence.2017.).2019 by establishing a deep belief network (Deep Brief Network, DBN) model, lv Haitao et al classified the framed and normalized ship noise signals using convolutional neural networks (Convolutional Neural Networks, CNN), and the results showed that the classification performance was superior to the traditional higher-order spectrum classification method (Lv Haitao, jianwen, kong Xiaopeng. Convolutional neural network-based underwater target classification technique, ship electronic engineering, 2019,39 (2): 158-162.). In 2020, zhong et al collect underwater sound signals and input the collected underwater sound signals into a CNN model for detection (Zhong M,Castellote M,Dodhia R,et al.Beluga whale acoustic signal classification using deep learning neural network models.The Journal of the Acoustical Society of America,147(3):1834.). of white whales, in 2021, MISHACHANDAR et al use CNN for marine noise identification, and a sound signal identification method based on a machine learning model above artificial sound, natural sound and marine animal sound (Mishachandar B,Vairamuthu S.Diverse ocean noise classification using deep learning.Applied Acoustics,2021.). is often complex and requires training of a large number of network parameters, and training time is long.

In summary, in a plurality of signal-to-noise ratio environments, a machine learning method that combines an acoustic signal feature extraction technique with a neural network nonlinear fitting capability, and has a simple structure and fewer parameters is indispensable.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a generalized regression neural network acoustic signal identification method utilizing Mel frequency cepstrum coefficients.

Technical proposal

A generalized regression neural network acoustic signal identification method utilizing Mel frequency cepstrum coefficients is characterized by comprising the following steps:

Step 1: extracting MFCC characteristics from the acquired underwater acoustic signals;

Step 2: performing FFT on each frame of signal to obtain a frequency spectrum;

step 3: filtering the frequency spectrum through a group of triangular band-pass filters to obtain Mel filtering;

step 4: calculating the logarithmic energy output by each filter, calculating the discrete cosine transform of the logarithmic energy, and solving the MFCC of L-order;

Step 5: calculating cepstrum differential parameters by using the L MFCC cepstrum coefficients, and combining the three parameters of the MFCC, the first-order cepstrum differential parameters and the second-order cepstrum differential parameters to serve as feature vectors of signals;

step 6: 3/4 of all processed data are used for training and verifying data, and the other data are used for testing the GRNN model;

Step 7: the training and verifying data for the model are randomly divided into k folds, the optimal expansion factors are determined by a k-fold cross verification method, and the measurement indexes of the identification results of the verification set under different expansion factors are expressed as follows:

Where N is the number of samples, The number of correctly identified samples out of the N samples;

The verification set is 1-fold, and the training set is other k-1 folds;

step 7: for each expansion factor, firstly, a training set is used for training, then a test evidence set is tested, and the corresponding accuracy is calculated; taking each training sample as a primary verification set, repeating the process, and counting and calculating the average value of k accuracies, namely the average accuracy;

Repeating the steps for all the expansion factors, and taking the expansion factor corresponding to the minimum value of the average accuracy as the optimal expansion factor;

step 8: and taking the obtained optimal expansion factors as parameters of the GRNN model, using training data for training the GRNN model, inputting test data under a plurality of signal to noise ratios into the model, carrying out acoustic signal recognition on the test data, counting recognition results, analyzing the recognition performance of the GRNN model, and realizing the recognition of real-time data by the trained and tested GRNN model.

And when the MFCC characteristics are extracted, pre-emphasis, framing and windowing pretreatment are carried out on the acquired acoustic signals.

The value range of the expansion factor is as follows: 0.01,0.02, …,0.1, step size 0.01.

Advantageous effects

The generalized regression neural network acoustic signal identification method utilizing the Mel frequency cepstrum coefficient combines the MFCC with the GRNN, fully plays the advantages of rich sound characteristics of the MFCC and nonlinear fitting of the GRNN, and effectively identifies seal types. Firstly, extracting the MFCC characteristics of an acoustic signal, carrying out FFT and Mel filtering, solving the L-order MFCC, calculating a cepstrum difference parameter, carrying out the test of a GRNN model, wherein the optimal expansion factor is determined by a k-fold cross validation method, dividing training data into k-folds, sequentially taking the k-folds as a validation set for the test, using the obtained optimal expansion factor for the training of the GRNN, and identifying the tested acoustic data.

The study analyzes the identification performance of the method under different signal-to-noise ratios, gaussian white noise is added to an initial sound signal, the noise bandwidth is corresponding to different types of seal sound signals, the signal-to-noise ratios are respectively 10dB, 5dB and 0dB, waveforms of the initial signal of the seal and the signal under different signal-to-noise ratios are taken as examples, and as shown in fig. 6, the lower the signal-to-noise ratio is, the more burrs appear in the signal. The training samples are used for training and verifying the GRNN model by 10-fold cross verification, the GRNN optimization expansion factor process of the MFCC features of each scene is shown in fig. 7, and the GRNN optimization expansion factor process can be obtained analytically, and overall, as the expansion factor becomes larger, the recognition performance is reduced, and the optimal expansion factor is concentrated between 0 and 0.1.

In the research, SVM and CNN are used as comparison models, the SVM model uses radial basis functions as kernel functions, the CNN model structure is shown in figure 8, and the optimization algorithm adopts Sgdm algorithm. Tables 1, 2 and 3 show the recognition accuracy of GRNN, CNN and SVM methods in respective scenes, A, B, C showing leopard seals, ross seals and wedel seals, respectively. The GRNN, CNN and SVM methods can effectively identify seal types under the condition of high signal-to-noise ratio, but the influence of the reduction of the signal-to-noise ratio on the SVM method is larger, and when the signal-to-noise ratio is 0dB, the error of the SVM method is larger, so that the effective identification cannot be realized. Compared with SVM, the CNN method is less affected by the reduction of the signal-to-noise ratio, and has little difference with the GRNN method under the condition of high signal-to-noise ratio, the GRNN method is only superior to the CNN method to a certain extent, but under the condition of low signal-to-noise ratio, particularly at 0dB, the error of the CNN method is larger. The influence of the signal-to-noise ratio reduction on the GRNN method is minimum, when the signal-to-noise ratio is more than 5dB, the GRNN method can realize accurate identification, and when the signal-to-noise ratio is 0dB, the GRNN method can still realize approximate identification. Overall, GRNN can robustly achieve seal class identification at various signal-to-noise ratios because the MFCC-GRNN model combines the feature dominance of MFCCs with the nonlinear fitting capabilities of GRNNs.

Drawings

Fig. 1: GRNN model structure diagram

Fig. 2: seal type identification method overall flow block diagram based on MFCC and GRNN

Fig. 3: woltzfeldt database collection chart

Fig. 4: sound waveform diagram of seal of different kinds

Fig. 5: different kinds of seal MFCC characteristics

(A) A leopard seal; (b) ross seal; (c) Widel seal

Fig. 6: initial signal and waveforms at different signal-to-noise ratios

Fig. 7: optimizing expansion factor process for GRNN of each scene

Fig. 8: CNN structure diagram

Detailed Description

The invention will now be further described with reference to examples, figures:

mel frequency cepstral coefficient (Mel-frequency cepstral coefficient, MFCC), a feature widely used in speech recognition, is proposed based on human ear characteristics. Because of the specificity of the human ear structure, the listener can automatically separate the low frequency part and the high frequency part of the voice, wherein the low frequency part is the main part for recognizing the voice characteristics.

The generalized regression neural network (Generalized regression neural network, GRNN) belonging to the forward neural network has better nonlinear mapping capability based on kernel regression analysis. The GRNN obtains a conditional probability density function by calculating the input and output of training data and the input of test data, so that the output of the test data is further obtained. The GRNN comprises an input layer, a mode layer, a summation layer and an output layer, only one network parameter is needed, and other neural network models generally need to select a plurality of parameters, so that the GRNN has obvious advantages in network construction. GRNN only has one network parameter, and the training performance can be improved by optimizing the expansion factor. The present study uses k-fold cross-validation to determine the optimal expansion factor, and FIG. 1 is a graph of the GRNN model block diagram.

The present study proposes a method for accurately identifying seal types based on GRNN of input MFCC features. The research combines the MFCC with the GRNN, fully plays the advantages of the rich voice characteristics of the MFCC and the nonlinear fitting of the GRNN, and effectively identifies seal types.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

1) The present study requires extraction of MFCC characteristics of acoustic signals, and first, preprocessing of the acquired acoustic signals, including pre-emphasis, framing and windowing. Pre-emphasis is the flattening of the spectrum of a signal by boosting the spectrum of the high frequency portion of the signal. Framing may divide a signal into several short periods of time, during which the signal may be considered a stationary process. In the framing process, an overlapped segmentation method is generally adopted, so that frames are excessively smoothed. Windowing is used to reduce the truncation effect of the signal

2) After the preprocessing is completed, FFT is carried out on each frame of signal to obtain a frequency spectrum.

3) And filtering the frequency spectrum by a group of triangular band-pass filters to obtain Mel filtering.

4) The logarithmic energy of each filter output is calculated and its discrete cosine transform is calculated to find the MFCC of the L-order.

5) And calculating a cepstrum difference parameter (Delta Cepstrum) according to the L MFCC cepstrum values, and combining the MFCC, the first-order and second-order cepstrum difference parameters as a characteristic vector of the signal.

6) The collected different types of sound data are randomly divided in proportion, and part of the data are used for training and verification, and the other data are used for testing the GRNN model.

7) The expansion factor range and the step length are selected in the research, so that a large number of expansion factors are obtained for optimizing the GRNN model.

8) The method comprises the steps of dividing training data into k folds, sequentially using the k folds as verification sets for testing, and selecting the expansion factor with the best recognition performance for the verification sets as the optimal expansion factor.

9) And using the obtained optimal expansion factor for training of GRNN, and identifying the test sound data.

Detailed description of the preferred embodiment

By extracting the MFCC features and optimizing the expansion factors, the GRNN model can fully utilize the training data and optimizing the model to realize the recognition of the sound signal, and referring to fig. 2, the overall process is specifically constructed and trained by the following steps:

1) The data preprocessing includes pre-emphasis, framing and windowing. Pre-emphasis: the frequency spectrum of the high-frequency part of the signal is improved, so that the frequency spectrum of the signal becomes more gentle; framing: dividing the signal into a plurality of short-period signals, wherein the signals can be regarded as a stable process in the short period; windowing: let s (n) be the signal and w (n) be the window function. The windowed signal s' (n) is:

s'(n)＝s(n)w(n) (1)

wherein N is more than or equal to 0 and less than or equal to N-1, N is the number of sample points, and w (N) is usually a Hamming window.

2) After preprocessing is completed, FFT is performed on each frame of signal to obtain a frequency spectrum, and the discrete frequency spectrum S' _a (k) of the signal is:

3) The spectrum is filtered by a set of triangular band-pass filters to obtain Mel filtering, M filters are provided, the center frequency is f (M), and m=1, 2, … and M. The formula of the triangular filter is:

4) The logarithmic energy of each filter output is calculated:

5) Performing discrete cosine transform on the M logarithm energies obtained by calculation to obtain an L-order MFCC, wherein L is 12-16 in general, and the discrete cosine transform formula is as follows:

6) The cepstrum difference parameter is calculated according to the L MFCC cepstrum values, and the formula is as follows:

where d represents the nth first order difference result, C _n represents the nth cepstrum coefficient obtained by the calculation of the formula (5), L represents the order when the MFCC is obtained, and K represents the time difference of the first order derivative, preferably 1 or 2. And (3) carrying the calculation result into the formula (6) again to obtain a second-order differential result.

7) And combining the MFCC, the first-order cepstrum differential parameter and the second-order cepstrum differential parameter to serve as signal characteristic vectors of the input GRNN model.

8) 3/4 Of the data from each type of sound data is randomly selected for training and verification of the model, and the remaining 1/4 is used for data testing.

9) The present study uses a k-fold cross validation method to determine the optimal spreading factor, which first requires determining the range of values for the spreading factor, e.g., 0.01,0.02, …,0.1, step size 0.01.

10 Randomly dividing the data for training and validation of the model into k-folds, the validation set being 1-fold therein, the training set being the other k-1-folds.

11 Using the accuracy as a measurement index of the identification result of the verification set under different expansion factors, and adopting the following formula:

Where N is the number of samples, For the number of correctly identified samples of the N samples.

12 For each expansion factor, firstly using the training set for training, then testing the test evidence set, and calculating the corresponding accuracy; and taking each training sample as a verification set, repeating the process, and counting and calculating the average value of k accuracies, namely the average accuracy.

13 Repeating the steps for all the expansion factors, and taking the expansion factor corresponding to the minimum value of the average accuracy as the optimal expansion factor.

14 The obtained optimal expansion factors are used as parameters of the GRNN model, training data are used for training the GRNN model, test data under multiple signal to noise ratios are input into the model, acoustic signal recognition is carried out on the test data, recognition results are counted, recognition performance of the GRNN model is analyzed, and the GRNN model after training and testing can realize recognition on real-time data.

Specific implementation examples:

In the research, the underwater acoustic signals of seals are taken as examples, the types of the seal, the Ross seal and the Widel seal are identified, and all three seals live in south poles and easily appear in the same seal area. The input data used in this study were target acoustic signals such as the seal, ross seal, and wick seal in the waters Database (WATKINS MARINE MAMMAL Sound Database), fig. 3 is an area collected by the Database, the seal data used in this study are in an elliptical area, and fig. 4 is a waveform diagram of different types of seal sounds.

The research input is the MFCC characteristic of audio data, firstly, each segment of data is segmented, 256 samples are respectively segmented, each segment is offset by 128 samples, the segmentation is consistent with the previous research framing, the MFCC of each segment of data is obtained, 24 groups of filters are arranged in the extraction process, the first-order and second-order differential coefficients of the MFCC are obtained, and the MFCC, the first-order and second-order differential coefficients are subjected to characteristic fusion, so that each segment of data obtains a characteristic vector of 1x 36, and the characteristic vector is used as the characteristic input of a single sample. 3/4 of the data in each category is selected for data training, and the remaining 1/4 is used for data testing, so that 14762 training samples and 4923 test samples are obtained. The MFCC characteristics of the leopard seal, ross seal and wedel seal are shown in fig. 5 (a), 5 (b) and 5 (c), respectively.

In practical application, the ocean often has different degrees of environmental noise, the research analyzes the identification performance of the method under different signal-to-noise ratios, the Gaussian white noise is added to an initial sound signal, the noise bandwidth is corresponding to different types of seal sound signals, the signal-to-noise ratios are respectively 10dB, 5dB and 0dB, the waveforms of the initial signal of the seal and the signal under different signal-to-noise ratios are taken as examples, and as shown in fig. 6, the lower the signal-to-noise ratio is, the more burrs appear in the signal. The training samples are used for training and verifying the GRNN model by 10-fold cross verification, the GRNN optimization expansion factor process of the MFCC features of each scene is shown in fig. 7, and the GRNN optimization expansion factor process can be obtained analytically, and overall, as the expansion factor becomes larger, the recognition performance is reduced, and the optimal expansion factor is concentrated between 0 and 0.1.

Table 1 GRNN method seal class identification accuracy

Signal to noise ratio/dB	A	B	C	Together, a total of
					Noiseless	0.9956	0.9828	0.9424	0.9616
10	0.9867	0.9553	0.9124	0.9344
					5	0.9757	0.9467	0.8937	0.9200
0	0.9115	0.9065	0.8647	0.8838

Table 2 CNN method seal class identification accuracy

Signal to noise ratio/dB	A	B	C	Together, a total of
					Noiseless	0.9823	0.9713	0.9171	0.9423
10	0.9646	0.9524	0.9061	0.9279
					5	0.9292	0.9346	0.8918	0.9104
0	0.8717	0.8521	0.8544	0.8552

Table 3 SVM method seal class identification accuracy

Signal to noise ratio/dB	A	B	C	Together, a total of
					Noiseless	0.9862	0.9731	0.9252	0.9478
10	0.9403	0.9335	0.8966	0.9137
					5	0.8805	0.8539	0.8302	0.8956
0	0.8164	0.8200	0.8163	0.8176

Claims

1. A generalized regression neural network acoustic signal identification method utilizing Mel frequency cepstrum coefficients is characterized by comprising the following steps:

Step 2: performing FFT on each frame of signal to obtain a frequency spectrum;

The verification set is 1-fold, and the training set is other k-1 folds;

2. The generalized regression neural network acoustic signal recognition method utilizing Mel frequency cepstrum coefficients according to claim 1, wherein: and when the MFCC characteristics are extracted, pre-emphasis, framing and windowing pretreatment are carried out on the acquired acoustic signals.

3. The generalized regression neural network acoustic signal recognition method utilizing Mel frequency cepstrum coefficients according to claim 1, wherein: the value range of the expansion factor is as follows: 0.01,0.02, …,0.1, step size 0.01.