CN118351881A

CN118351881A - Fusion feature classification and identification method based on noise reduction underwater sound signals

Info

Publication number: CN118351881A
Application number: CN202410441314.9A
Authority: CN
Inventors: 文凯; 张靖淞
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-07-16

Abstract

The invention relates to a fusion feature classification and identification method based on noise reduction underwater sound signals, and belongs to the field of underwater sound signal processing. The method comprises the following steps: inputting an original underwater sound signal, and preprocessing the original underwater sound signal; carrying out multi-feature extraction on the pretreated underwater sound signal, and extracting the underwater sound signal features at least by adopting a short-time Fourier transform, WVD time-frequency transform and a Mel frequency cepstrum coefficient mode; inputting the extracted short-time Fourier transform underwater sound signal characteristics, WVD time-frequency transform underwater sound signal characteristics and Mel frequency cepstrum coefficient underwater sound signal characteristics into a characteristic fusion module for fusion to obtain fusion characteristics; and inputting the fusion characteristics into a convolutional neural network based on a channel attention mechanism for training, wherein the convolutional neural network after training is used for underwater sound signal identification. The invention is more sensitive to the change and interference of the signal, and further improves the identification accuracy.

Description

Fusion feature classification and identification method based on noise reduction underwater sound signals

Technical Field

The invention belongs to the field of underwater acoustic signal processing, and relates to a fusion characteristic classification and identification method based on noise reduction underwater acoustic signals.

Background

In recent years, the technology of underwater acoustic signal recognition is continuously developed, and the object recognition of the underwater acoustic signal becomes a research hotspot in the field of ocean technology. The national university obtains remarkable theoretical innovation results in the field, and particularly obtains important breakthrough in the aspects of deep learning and signal processing technology. In terms of theoretical innovation, research teams in domestic universities have proposed many new methods and techniques. In the prior art, depth separable convolution and time expansion convolution are applied to a network model, the input of the network model is a one-dimensional time domain signal of a ship, and finally, the precision is improved by 6.8% compared with other traditional target recognition models. In the prior art, the time sequence characteristics of the underwater sound signals are processed by using the bidirectional long-short-time memory network, and the robustness of recognition is improved by capturing the long-term dependency relationship in the signals. The GAO T provides an end-to-end mobile network structure, the network is a lightweight network, parameters are fewer than those of other networks, and a coordinate attention mechanism is added to the network, so that the recognition rate of a network model is greatly improved. The number of the disclosed sample data sets of the underwater acoustic signals is small, so that the classification and identification by adopting the deep learning network are easy to cause overfitting, and aiming at the problem, a mode of generating a combination of an countermeasure network and a dense connection convolution network based on the deep convolution is provided for expanding the sample data sets and improving the identification accuracy.

Deep learning, which is a branch of the machine learning field, has a core idea of training a complex model by utilizing massive data, and further extracting characteristic information with strong representativeness from the data. The characteristic information can greatly improve the classification and prediction accuracy of the samples. The deep learning model is characterized in that it has multiple hidden layers, each of which is responsible for deep abstraction and representation of data from different angles. Through layer-by-layer transmission and training, the model can learn the internal rules and complex modes of the data, thereby realizing accurate processing of the data. Among the deep learning models, convolutional Neural Networks (CNNs) have received much attention in various fields such as image classification and recognition, text and speech understanding, etc., due to their strong feature extraction capability. Conventional convolutional neural networks typically employ pooling layers in image processing to reduce data dimensionality and improve computational efficiency, with maximum pooling and average pooling being the most common. However, these methods may not adequately capture details and features of the data under certain circumstances.

Disclosure of Invention

In view of the above, the present invention aims to provide a fusion feature classification and identification method based on noise reduction underwater sound signals.

In order to achieve the above purpose, the present invention provides the following technical solutions:

A fusion feature classification and identification method based on noise reduction underwater sound signals comprises the following steps:

s1, inputting an original underwater sound signal, and preprocessing the original underwater sound signal;

S2, carrying out multi-feature extraction on the pretreated underwater sound signal, and extracting the underwater sound signal features at least by adopting a short-time Fourier transform, WVD time-frequency transform and a Mel frequency cepstrum coefficient mode;

S3, inputting the extracted short-time Fourier transform underwater acoustic signal characteristics, WVD time-frequency transform underwater acoustic signal characteristics and Mel frequency cepstrum coefficient underwater acoustic signal characteristics into a characteristic fusion module for fusion, so as to obtain fusion characteristics;

S4, inputting the fusion characteristics into a convolutional neural network based on a channel attention mechanism for training, wherein the convolutional neural network after training is used for underwater sound signal identification.

Further, in step S1, the preprocessing of the original underwater sound signal mainly includes a noise reduction processing, where the noise reduction processing includes an empirical mode decomposition EMD, an integrated empirical mode decomposition method EEMD, a fully adaptive noise set empirical mode decomposition method CEEMDAN, a wavelet transform decomposition method, or a CEEMDAN-IWT-SSA based noise reduction method,

The CEEMDAN-IWT-SSA based noise reduction method accomplishes noise reduction by combining CEEMDAN adaptive noise handling capability, improving the noise reduction characteristics of the wavelet threshold IWT, and the signal-to-noise separation capability of the SSA.

Further, in step S2, the features of STFT, WVD and MFCCs are extracted for the preprocessed underwater sound signal, respectively, using short-time fourier transform, WVD time-frequency transform and mel-frequency cepstrum coefficients, wherein,

The STFT feature is extracted by adopting short-time Fourier transform, and can be regarded as an approximate stationary signal s (t). H (t) in a period of time for a non-stationary signal s (t), and then the approximate stationary signal s (t). H (t) is subjected to Fourier transform, and the method specifically comprises the following steps:

The fourier transform of the stationary signal f (t) is known as follows:

Starting from the stationary signal, the short-time fourier transform of the non-stationary signal s (t) is represented as follows:

Wherein h (t) is an analysis window function; h ^* (τ -t) is the complex conjugate of h (τ -t); STFT (t, f) is the short-time Fourier transform of the non-stationary signal s (t).

Further, in step S2, WVD features are extracted by using WVD time-frequency transformation, and for the continuous signal x (t) with limited energy, the definition formula of the continuous WVD distribution is:

The WVD distribution satisfies the principle of secondary superposition: when x (t) =ax ₁(t)+bx₂ (t), the quadratic profile of WVD satisfies:

WVD_x(t,Ω)＝|a|²WVD_x1(t,Ω)+|b|²WVD_x2(t,Ω)+

ab^*WVD_x1x2(t,Ω)+a^*bWVD_x2x1(t,Ω)

Where |a| ²WVD_x1 (t, Ω) and |b| ²WVD_x2 (t, Ω) represent the self WVD of the two components of the signal x (t), ab ^*WVD_x1x2 (t, Ω) and a ^*bWVD_x2x1 (t, Ω) represent the mutual WVD of the two components, i.e. the cross terms thereof, which are proportional to the components contained in the signal, and the more the number of components the more the signal contains;

further, in step S2, the MFCCs features are extracted using mel-frequency cepstrum coefficients, which include the steps of:

(1) Preprocessing, including pre-emphasis, framing and windowing, wherein the framing and windowing are consistent with a short-time fourier transform process, and the frequency spectrum is smoothed by strengthening the high frequency of the signal and adding a filter, wherein the added filter is a first-order high-pass filter:

H(z)＝1-R(1)z^-1/R(0)

Wherein R (n) is the autocorrelation coefficient of signal x (n); r (1) z ^-1/R (0) is a pre-emphasis factor; the pre-emphasis signal needs to be de-emphasized after being processed;

(2) Performing FFT on the signal of each frame, namely X _i(k)＝FFT[x_i (n) ];

(3) Mel filter bank and performs logarithmic operation: first, calculating the energy of the calculated spectral line:

E_i(k)＝[X_i(k)]²

the energy spectrum passes through the Mel filter bank and the logarithm operation is carried out on the energy spectrum passing through the Mel filter:

(4) Discrete cosine transform, the DCT is calculated after the logarithm of the energy of the Mel filter:

where M is the number of mel filters (M total); l is the order of the MFCC coefficients.

Further, in step S3, the extracted STFT, WVD and MFCCs are fused by an early fusion method, features of the STFT, WVD and MFCCs are two-dimensional matrix feature vectors, and the feature map is filled or cut so that dimensions of the three features are the same, and then a channel splicing operation is performed to obtain a fused feature.

Further, in step S4, the fusion feature is convolved and pooled by using convolution kernels of various sizes of the convolutional neural network, so as to capture feature information of different scales in the image; then processing through continuous convolution and pooling layers; finally, classifying and identifying the images through conversion of the Flatten layer and weighting calculation of the Dense layer.

Further, in step S4, the convolutional neural network adopts ResNet residual structures, including a bottleneck residual structure and a non-bottleneck residual structure; the convolutional neural network comprises an input layer, a plurality of convolutional layers, a maximum pooling layer, an average pooling layer and a plurality of full-connection layers; reducing the internal variable offset degree of the model through a batch normalization layer;

Considering the relation between the fusion characteristic channels, adding an SE channel attention mechanism module to learn the characteristic weight of each channel, wherein F _sq (DEG) represents a squeze operation, and the formula is as follows:

Wherein u _c represents a certain feature map with the number of channels being c; z _c represents a vector compressed after global averaging of the feature map u _c;

After the feature map is compressed into feature vectors, feature values of channels are required to be synthesized, and features are synthesized by adopting two full-connection layers, wherein the formula is as follows:

s＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

wherein σ (·) is sigmod functions; delta (·) is a ReLU function; w ₁、W₂ is a weight matrix;

scale operation and multiplication of the original feature map of size h×w×c with the weight feature vector of size 1×1×c, i.e.:

F_scale(s_c,u_c)＝s_cu_c

The SE channel attention mechanism module carries out global tie pooling on the feature map to obtain a weight feature vector s, and multiplies the feature vector s with a corresponding feature map channel through s after learning parameters of the s so as to enlarge or reduce the function of each channel, wherein the larger the weight parameters are when the channel is favorable for feature extraction, the smaller the weight parameters are otherwise.

The invention has the beneficial effects that:

The invention utilizes the convolution neural network of multi-feature fusion to construct the underwater sound signal classification recognition model. Unlike conventional CNNs, this network architecture can process and fuse multiple signal features simultaneously to obtain more comprehensive and accurate information. By training and optimizing network parameters, the model can learn different characteristic representations of the underwater acoustic signals, so that accurate classification and identification of different types of signals are realized. In addition, the multi-feature fusion strategy can also enhance the robustness of the model, so that the model is more sensitive to signal change and interference, and the recognition accuracy is further improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a general flow chart of a fusion feature classification recognition method based on noise reduction underwater sound signals;

Fig. 2 is a short-time fourier transform visualization time-frequency diagram of four types of underwater acoustic signal samples according to an embodiment, fig. 2 (a) is a type a signal, fig. 2 (B) is a type B signal, fig. 2 (C) is a type C signal, and fig. 2 (D) is a type D signal;

FIG. 3 is a WVD distribution visualization time-frequency diagram of four types of underwater sound signal samples according to an embodiment, wherein FIG. 3 (a) is a type A signal, FIG. 3 (B) is a type B signal, FIG. 3 (C) is a type C signal, and FIG. 3 (D) is a type D signal;

FIG. 4 is a view of the four types of underwater acoustic signal samples MFCCs, the FIG. 4 (a) is a type A signal, the FIG. 4 (B) is a type B signal, the FIG. 4 (C) is a type C signal, and the FIG. 4 (D) is a type D signal, under an embodiment;

FIG. 5 is a block diagram of a STFT, WVD, MFCC-based feature fusion module, under one embodiment;

FIG. 6 is a convolutional neural network ResNet residual structure under one embodiment, where FIG. 6 (a) is a bottleneck residual structure and FIG. 6 (b) is a non-bottleneck residual structure;

FIG. 7 is a convolutional neural network architecture based on a channel attention mechanism, under one embodiment;

FIG. 8 is a schematic diagram of an confusion matrix in one embodiment, in which FIG. 8 (a) identifies the confusion matrix for a class of fusion features that do not generate an expansion of the countermeasure network sample, and FIG. 8 (b) identifies the confusion matrix for a class of fusion features that generate an expansion of the countermeasure network sample.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Referring to fig. 1 to 8, a fusion feature classification and identification method based on noise reduction underwater sound signals is provided.

The overall flow of the fusion feature classification recognition method based on the noise reduction underwater sound signal, shown in fig. 1, comprises the following steps:

Examples

The fusion characteristic classification and identification method based on the noise reduction underwater sound signal is further described in detail by combining MATLAB simulation technology.

In the underwater sound signal target recognition process based on deep learning, the experimental environments of hardware and software are as follows: the operating system is Windows10, the GPU is GTX 3080ti, and the programming environment is MATLAB R2019b and pytorch frames in Python. A Librosa audio processing library was used in python.

In this embodiment, in the deep learning underwater sound signal target recognition, accuracy Rate (PR) is the most commonly used metric. Definition of accuracy is: in a given system, the number of correctly identified targets is the proportion of the total number for all inputs. Expressed by a mathematical expression, if the number of targets correctly identified by the system is a, and the total input number is a+b (where B is the number of erroneous identifications), the accuracy calculation formula is:

the index directly reflects the accuracy degree of the system on the underwater target identification, and is one of the indexes necessary for evaluating the performance of the intelligent identification system. By comparing the accuracy of different noise reduction methods or neural network models, the method or model can intuitively know which method or model has better effect in processing the classification and identification problem of the underwater sound signals. However, the accuracy has a certain application range, and if the number of input samples is unbalanced, the prediction result of the accuracy may fail. Therefore, based on accuracy, the embodiment adopts more intuitionistic method, and more confusion matrixes (Confusion Matrix) are used for measuring experimental results in signal classification and identification.

Confusion matrix is a widely used visualization tool in machine learning and artificial intelligence, and is particularly suitable for classification problem assessment of supervised learning. This matrix is a two-dimensional matrix presented in n rows and n columns, where n represents the total number of classes, each element representing the accuracy of the classifier's predictions. Each column of the confusion matrix represents the number of samples of each category in the ground truth (or reference) verification information, and each row represents the number of samples of each category in the model prediction result. Specifically, each element in the matrix represents the number of samples that actually belong to one class that are predicted by the model as another class. In the confusion matrix, the true class is usually taken as a row and the predicted class is taken as a column, so that the elements on the diagonal of the confusion matrix represent the number of samples correctly predicted by the classifier, while the other elements represent the number of samples incorrectly predicted by the classifier.

In step S1, the embodiment uses the measured underwater acoustic signal recorded in ShipsEar database to support the subsequent verification experiment. The data set provides detailed information including the type of vessel, date of recording, specific location of the vessel, number and depth of hydrophones, current wind speed and weather conditions, etc., ensuring data integrity and reliability. Finally, 90 sections of wav format sound signals are recorded in the data set, the duration of the signals varies from 15 seconds to ten minutes, and rich sound samples are provided for experiments. These underwater acoustic signals are divided into five major classes, class E being the collected marine environmental noise, according to the size of the vessel and the speed of sailing, as shown in table 1 for ShipsEar dataset vessel classification:

TABLE 1

Then, the original underwater acoustic signal in ShipsEar database is preprocessed to remove noise, improve signal quality, and extract information that can effectively characterize the underwater acoustic signal, and possibly normalize to ensure uniformity of the signal on different scales. These preprocessing steps can significantly enhance the recognition of the signal, providing a higher quality input for subsequent target recognition modules. The preprocessing is mainly Noise reduction processing, and the Noise reduction processing comprises empirical mode decomposition (EMPIRICAL MODE DECOMPOSITION, EMD), an integrated empirical mode decomposition method (Ensemble Empirical Mode Decomposition, EEMD), a completely self-adaptive Noise set empirical mode decomposition method (Complete Ensemble Empirical Mode Decomposition WITH ADAPTIVE Noise, CEEMDAN), a wavelet transformation decomposition method or a method based on CEEMDAN-IWT-SSA. The embodiment adopts a CEEMDAN-IWT-SSA-based method to perform noise reduction processing on the original underwater sound signal, and can realize more effective noise reduction effect in a complex underwater sound environment by combining the self-adaptive noise processing capability of CEEMDAN, the high-efficiency noise reduction characteristic of IWT (improved wavelet threshold) and the signal-to-noise separation capability of SSA. The method can reduce the distortion of the underwater sound signal and display higher performance in the noise reduction of the underwater sound signal.

Further, in step S2, in the present embodiment, the characteristics of the acoustic signal after the noise reduction processing are extracted by using a short-time fourier transform, a WVD time-frequency transform, and mel-frequency cepstrum coefficients.

First, for stationary signals, fourier transform is used to directly obtain the frequency domain characteristics of the signal. However, most of the underwater sound signals are non-stationary signals, so that only the frequency domain features cannot completely analyze the characteristics of the underwater sound signals. In practical application, short-time Fourier transform is often adopted to process the underwater sound signal. The short-time fourier transform is used as a most common time-frequency analysis method, and the calculation theory is that for a non-stationary signal s (t), the non-stationary signal s (t) can be regarded as an approximate stationary signal s (t) ·h (t) within a short time, and then fourier transform is performed on the approximate stationary signal s (t) ·h (t).

The fourier transform of the stationary signal f (t) is known as follows:

starting from the stationary signal, the short-time fourier transform of the non-stationary signal s (t) can be expressed as follows:

From the above equation, it follows that spectral characteristics at different time periods can be obtained when the window function h (t) is shifted with time. If the window function h (t) has a value of 1, i.e. the window function is a rectangular window function which is infinite in time, the short-time fourier transform is a conventional fourier transform. The choice of the window function is therefore particularly important for short-time fourier transforms. Common window functions include rectangular window, hanning window, hamming window, and blackman window, and the specific type of window function to be used is selected according to the actual research task. For signals with non-stationary characteristics it is necessary to ensure that the time window function is sufficiently narrow so that the truncated signal segment is considered as an approximately stationary signal. The characteristics of the non-stationary signal change with time, while a narrow time window ensures that the characteristics of the signal are relatively stable in a shorter period of time, thereby being able to more accurately reflect the spectral characteristics and transformation laws of that period of time.

In practical applications, the short-time fourier transform needs to select a suitable time window, and although the short-time fourier transform is simple and efficient and does not generate cross terms, the selection of the time window often results in a contradiction between the time resolution and the frequency resolution. If the time window function length is longer, the time resolution of the feature map decreases and the frequency resolution increases. Conversely, if the time window function is shorter, the time resolution increases and the frequency resolution decreases. In processing signals, it is difficult to obtain high-resolution time features and frequency features at the same time, so that a suitable window function length is selected according to the research requirement of time-frequency resolution.

As shown in fig. 2, the four types of underwater acoustic signal samples are visualized by short-time fourier transform, fig. 2 (a) is a type a signal, fig. 2 (B) is a type B signal, fig. 2 (C) is a type C signal, fig. 2 (D) is a type D signal, the left side is an original underwater acoustic signal, the right side is a short-time fourier transform time-frequency diagram of the underwater acoustic signal, the types of the underwater acoustic signals are different, and the characteristics of the time-frequency diagrams are also different. A short-time fourier transform of the moment of time t corresponding to the n-point is calculated. After the time domain and the frequency domain are analyzed, a simulation diagram is drawn according to the result. Wherein the underwater sound signal data set has 4 kinds of signals, each audio signal is divided into signals of 2 seconds as one sample.

Second, the Wigner-wier distribution (Wigner-Ville Distribution, WVD) is a typical quadratic transformation, which is defined as the fourier transformation of the instantaneous correlation function of a signal, and can reflect the time-frequency relationship of the signal's instants.

Unlike the above time-frequency analysis method, WVD has some very superior characteristics such as true marginality, weak support, translational invariance, and the like. These characteristics enable WVD to have high stability and reliability in time-frequency analysis, and can cope with various complex signal environments. However, for multi-component chirped signals, the time-frequency resolution of the WVD distribution may decrease, and the time-frequency plane may have cross terms.

The WVD profile is an energy profile, also known as Time-Frequency energy density (Time-Frequency ENERGY DENSITY, TFED). For an energy-limited continuous signal x (t), the continuous WVD profile is defined as:

WVD_x(t,Ω)＝|a|²WVD_x1(t,Ω)+|b|²WVD_x2(t,Ω)+

ab^*WVD_x1x2(t,Ω)+a^*bWVD_x2x1(t,Ω)

Where |a| ²WVD_x1 (t, Ω) and |b| ²WVD_x2 (t, Ω) represent the self WVD of the two components of the signal x (t), ab ^*WVD_x1x2 (t, Ω) and a ^*bWVD_x2x1 (t, Ω) represent the mutual WVD of the two components, i.e. their cross terms. As can be seen from the above equation, the cross terms are proportional to the components contained in the signal, so that the more the number of components contained in the signal, the more the cross terms.

The four types of underwater acoustic signal sample WVD distribution visual time-frequency diagram based on WVD is shown in figure 3, wherein figure 3 (a) is a type A signal, figure 3 (B) is a type B signal, figure 3 (C) is a type C signal, figure 3 (D) is a type D signal, the abscissa is time, the unit is millisecond, the ordinate is normalized frequency, the range is [ -0.5,0.5], the time-frequency diagrams of the four types of signals in figure 3 are different, but due to the symmetry of WVD, half of the time-frequency diagram is redundant, the effect on a deep learning network is not great, and clipping or other processing modes can be selected in a subsequent convolutional neural network to reduce repeated characteristics.

Third, the mel frequency is a measure of the frequency of sound according to the mel unit, and is designed to simulate the perception of the frequency of sound by the human ear. The human ear is more sensitive to low frequency sound signals and relatively less sensitive to high frequency sounds. Resulting in a non-linear relationship between the frequency of the sound perceived by the human ear and the frequency of the actual sound. Knowing the construction and function of the cochlea, it is actually equivalent to a filter bank whose filtering action exhibits a certain imperfection. In particular, in the frequency range below 1000Hz, the filtering action of the cochlea is linearly related to the sound frequency; whereas in the high frequency region above 1000Hz, this relationship translates into a logarithmic correlation. The Mel-Frequency cepstral coefficient is a parameter extracted in a Mel-Frequency (Mel-Frequency) domain, and a conversion formula of the Mel-Frequency and the Frequency is represented by the following formula:

Wherein f _mel is mel-perceived frequency; f is the actual frequency of the signal.

Mel-cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCC) is a parameter extracted based on Mel frequency scale that mimics the auditory properties of the human ear. The MFCC not only reflects the frequency content of the sound signal, but also fully considers the perceived difference of the human ear to different frequencies. MFCC is a parameter extracted in Mel-Frequency (Mel-Frequency) domain, and the extraction steps are as follows:

(1) Pretreatment: the preprocessing process includes pre-emphasis, framing and windowing in addition to the previous noise reduction, where framing and windowing is the same as the short-time fourier transform process. The underwater acoustic signal itself has energy loss in the propagation process, so that the low-frequency signal part of the sound is larger in capacity, and the high-frequency signal part is more lost, so that in order to make the frequency spectrum smoother, the high frequency of the signal is reinforced, and a first-order high-pass filter is added:

H(z)＝1-R(1)z^-1/R(0)

Wherein R (n) is the autocorrelation coefficient of signal x (n); r (1) z ^-1/R (0) is a pre-emphasis factor; the pre-emphasis processed signal needs to be de-emphasized after processing.

(2) FFT: performing FFT on the signal of each frame, namely X _i(k)＝FFT[x_i (n) ];

E_i(k)＝[X_i(k)]²

(4) Discrete cosine transform: the energy of the mel filter is logarithmically calculated as DCT:

Where M is the number of mel filters (M total); l is the order of the MFCC coefficients, typically chosen to be 12-16.

The four types of underwater sound signal samples MFCCs are shown in fig. 4, wherein fig. 4 (a) is a type a signal, fig. 4 (B) is a type B signal, fig. 4 (C) is a type C signal, and fig. 4 (D) is a type D signal, the number of mel filters is 41, and the frame length is 1024. The MFCCs coefficients of each type of signal in fig. 4 are different, and by comparison, it can be found that the MFCC can effectively classify and identify the underwater sound signals.

Further, in step S3, according to the underwater acoustic signal time-frequency feature map obtained in step S2, the three feature maps are subjected to feature fusion and input into a network model, where the feature fusion module can effectively perform channel splicing on the three signal feature maps and serve as input of a convolutional neural network, and the feature fusion is divided into two types:

(1) Early fusion: early Fusion refers to fusing features of different layers or different scales at an earlier stage of the network. This is typically accomplished by feature stitching (concatenation) or feature addition (summation). The early fusion has the advantages that the network can learn the characteristic information of different scales simultaneously in the training process, and the network is facilitated to better understand and identify the complex mode in the image.

(2) Late fusion: late Fusion (Late Fusion) is the Fusion of features at different layers or scales at a later stage of the network, usually before the last fully connected layer. This is typically achieved by means of weighted averaging or voting of the outputs of the different layers. The benefit of late fusion is that the risk of network overfitting can be reduced, as the network can focus on learning feature information of different scales at different stages. In addition, late fusion can also make the network focus more on the characteristic information influencing the final decision in the training process.

In the embodiment, the extracted STFT, WVD and MFCCs are fused by adopting an early fusion method and then are used as the input of a convolutional neural network, and the relation between channel feature graphs is emphasized by adopting a channel attention mechanism after the features are fused and pass through a backbone network, so that the model can pay attention to more important features. Specifically, a block diagram of the STFT, WVD, MFCC-based feature fusion module is shown in fig. 5. After the characteristic extraction is performed on the underwater sound signal by using the three characteristic extraction methods in the step S2, the characteristic extracted data is a two-dimensional matrix characteristic vector, so that the channel number is 1. Because the length and width obtained by different feature extraction methods are different, the feature map needs to be filled or cut before feature fusion, so that the dimensions of the three features are the same, and then channel splicing operation is carried out, so that the fusion features are obtained.

Further, in step S4, in this embodiment, the characteristic fusion module fuses the characteristic graphs of the underwater acoustic signals with different dimensions; then performing convolution and pooling operations by using convolution kernels with various sizes of the convolution neural network so as to capture characteristic information with different scales in the image; then further processed by successive rolling and pooling layers; finally, the effective classification and identification of the images are realized through conversion of the Flatten layer and weighting calculation of the Dense layer. The embodiment adopts a ResNe-based classical deep learning model, and the model can effectively prevent the phenomenon of network degradation. ResNet in order to solve the problem that the gradient of the middle layer tends to 0 due to too deep network construction, the middle layer cannot be counter-propagated, the performance is affected, and ResNet provides a residual error module. The ResNet residual structure is shown in fig. 6, where "Conv" is the standard convolution and "Add" is the element addition operation.

The residual modules are divided into two types, a bottleneck residual structure as shown in fig. 6 (a) and a non-bottleneck residual structure as shown in fig. 6 (b). The bottleneck residual structure increases or decreases the dimension of the channel through 1X 1 convolution, and the module is mainly used for deep networks, and the 1X 1 convolution reduces parameters; the non-bottleneck residual structure is mainly used for relatively shallow networks.

The classification task is a simple computer vision task, if the network is too deep, overfitting is easy to cause, so ResNet-18 with fewer layers is selected as the infrastructure of the network, the residual structure is selected as a non-bottleneck structure, and the network structure parameters are shown as 2:

TABLE 2

The CNN used in this example used 1 input layer, 17 convolutional layers, 1 max pooling layer, 1 average pooling layer, and 3 fully connected layers in total. A batch normalization layer placed after defining the sequential model and convolution layers will reduce the degree of internal variable bias of the model, helping to overcome the overall network overfitting and helping to learn better. The design helps to improve the accuracy of the overall model.

After passing through the ResNet convolutional layers, the convolutional neural network needs to consider the relationship between channels because the three features are spliced by the channels, so a SE channel attention mechanism module is added after passing through the ResNet network to learn the feature weights of each channel. The convolutional neural network architecture based on the channel attention mechanism is shown in fig. 7, wherein F _sq (·) represents the squeze operation, and the formula is:

Wherein u _c represents a certain feature map with the number of channels being c; z _c represents a vector compressed after global averaging of the feature map u _c.

The feature value of the channel needs to be synthesized after the feature map is compressed into the feature vector, and the specification synthesis method needs to meet the nonlinear relation and the non-mutually exclusive relation, so that the specification operation can adopt two full-connection layers to synthesize the features, and the formula is as follows:

s＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

Wherein σ (·) is sigmod functions; delta (·) is a ReLU function; w ₁、W₂ is a weight matrix.

The final scale operation multiplies the original feature map of size H W C with the weight feature vector of size 1X 1C, namely:

F_scale(s_c,u_c)＝s_cu_c

Therefore, the essence of the SE channel attention mechanism module is that the feature map is globally and tie-pooled to obtain a weight feature vector s, the parameters of s are learned and multiplied by the corresponding feature map channel through s to enlarge or reduce the function of each channel, and the larger the weight parameters are when the channel is favorable for feature extraction, the smaller the weight parameters are otherwise.

The present embodiment provides a network training parameter of a convolutional neural network structure based on a channel attention mechanism, which is shown in table 3.

TABLE 3 Table 3

The fusion features are input into the convolutional neural network after training, and the confusion matrix is shown in fig. 8, wherein fig. 8 (a) is a fusion feature classification recognition confusion matrix without generating an expansion of the countermeasure network sample, and fig. 8 (b) is a classification recognition confusion matrix based on the fusion features generating the expansion of the countermeasure network sample. The accuracy rate of classification and identification of fusion features which are not subjected to data enhancement reaches 93.73%, and the method can meet most task requirements. ShipsEar the data set has a smaller number of samples and thus is data enhanced with the generation of the antagonism network. Firstly, part of training set is put into a generating countermeasure network for training, and a generator of trained parameters can generate some underwater sound signal samples. For the model with enhanced data, the accuracy of classification and identification of the underwater sound signals reaches 95.59%.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A fusion characteristic classification and identification method based on noise reduction underwater sound signals is characterized by comprising the following steps of: the method comprises the following steps:

2. The fusion feature classification and identification method based on noise reduction underwater sound signals according to claim 1, wherein the method comprises the following steps: in step S1, the preprocessing of the original underwater acoustic signal mainly includes a noise reduction processing, where the noise reduction processing includes an empirical mode decomposition EMD, an integrated empirical mode decomposition method EEMD, a completely adaptive noise set empirical mode decomposition method CEEMDAN, a wavelet transform decomposition method, or a CEEMDAN-IWT-SSA-based noise reduction method,

3. The fusion feature classification and identification method based on noise reduction underwater sound signals according to claim 2, wherein: in step S2, the features of STFT, WVD and MFCCs are extracted for the preprocessed underwater sound signal by short-time fourier transform, WVD time-frequency transform and mel-frequency cepstrum coefficient, respectively, wherein,

The fourier transform of the stationary signal f (t) is known as follows:

4. A fusion feature classification recognition method based on noise reduction underwater sound signals according to claim 3, wherein: in step S2, WVD features are extracted by using WVD time-frequency transformation, and for the continuous signal x (t) with limited energy, the definition formula of the continuous WVD distribution is:

WVD_x(t,Ω)＝|a|²WVD_x1(t,Ω)+|b|²WVD_x2(t,Ω)+

ab^*WVD_x1x2(t,Ω)+a^*bWVD_x2x1(t,Ω)

Where |a| ²WVD_x1 (t, Ω) and |b| ²WVD_x2 (t, Ω) represent the self WVD of the two components of the signal x (t), ab ^*WVD_x1x2 (t, Ω) and a ^*bWVD_x2x1 (t, Ω) represent the mutual WVD of the two components, i.e. the cross terms thereof, which are proportional to the components contained in the signal, and the more the number of components the more the signal contains.

5. The fusion feature classification and identification method based on noise reduction underwater sound signals according to claim 4, wherein the method comprises the following steps: in step S2, the MFCCs features are extracted using mel-frequency cepstrum coefficients, which include the steps of:

H(z)＝1-R(1)z^-1/R(0)

(2) Performing FFT on the signal of each frame, namely X _i(k)＝FFT[x_i (n) ];

E_i(k)＝[X_i(k)]²

6. The fusion feature classification and identification method based on noise reduction underwater sound signals according to claim 5, wherein the method comprises the following steps: in step S3, the extracted STFT, WVD and MFCCs are fused by adopting an early fusion method, the features of the STFT, WVD and MFCCs are two-dimensional matrix feature vectors, the feature map is filled or cut, so that the dimensions of the three features are the same, and then a channel splicing operation is performed, so that a fusion feature is obtained.

7. The fusion feature classification and identification method based on noise reduction underwater sound signals according to claim 6, wherein the method comprises the following steps: in step S4, performing convolution and pooling operations on the fusion features by using convolution kernels of various sizes of the convolutional neural network, so as to capture feature information of different scales in the image; then processing through continuous convolution and pooling layers; finally, classifying and identifying the images through conversion of the Flatten layer and weighting calculation of the Dense layer.

8. The fusion feature classification and identification method based on noise reduction underwater sound signals according to claim 7, wherein: in step S4, the convolutional neural network adopts ResNet residual structures, including a bottleneck residual structure and a non-bottleneck residual structure; the convolutional neural network comprises an input layer, a plurality of convolutional layers, a maximum pooling layer, an average pooling layer and a plurality of full-connection layers; reducing the internal variable offset degree of the model through a batch normalization layer;

s＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

F_scale(s_c,u_c)＝s_cu_c