CN117789758A - Urban audio classification method of convolutional neural network based on residual calculation - Google Patents
Urban audio classification method of convolutional neural network based on residual calculation Download PDFInfo
- Publication number
- CN117789758A CN117789758A CN202311833985.1A CN202311833985A CN117789758A CN 117789758 A CN117789758 A CN 117789758A CN 202311833985 A CN202311833985 A CN 202311833985A CN 117789758 A CN117789758 A CN 117789758A
- Authority
- CN
- China
- Prior art keywords
- audio
- urban
- layer
- training set
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 126
- 238000012360 testing method Methods 0.000 claims abstract description 59
- 238000013145 classification model Methods 0.000 claims abstract description 50
- 238000013528 artificial neural network Methods 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims abstract description 8
- 238000011176 pooling Methods 0.000 claims description 45
- 238000013507 mapping Methods 0.000 claims description 18
- 230000009467 reduction Effects 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 7
- 238000013016 damping Methods 0.000 claims description 6
- 230000001965 increasing effect Effects 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 21
- 230000008569 process Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 12
- 238000001228 spectrum Methods 0.000 description 10
- 230000005236 sound signal Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000037433 frameshift Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Landscapes
- Complex Calculations (AREA)
Abstract
The embodiment of the application relates to a city audio classification method of a convolutional neural network based on residual calculation. The method comprises the following steps: constructing an urban audio classification model; carrying out data enhancement on the urban audio data, and dividing the urban audio data subjected to data enhancement into a training set and a testing set; respectively processing the training set and the testing set to obtain audio characteristics of the training set and audio characteristics of the testing set; sending the training set audio features into the urban audio classification model for training to obtain a trained urban audio classification model; and sending the audio features of the test set into the trained urban audio classification model for training so as to classify the audio features of the test set, and classifying the test set according to the classification result of the audio features of the test set. According to the embodiment of the application, the problem that the classification precision of the urban audio is not high in the traditional deep learning neural network can be effectively solved, and the classification precision of the urban audio is improved while the calculation efficiency is improved.
Description
Technical Field
The embodiment of the application relates to the technical field of computer hearing, in particular to a city audio classification method of a convolutional neural network based on residual calculation.
Background
In the daily production and life process, sound plays an irreplaceable role in transmitting important information. In recent years, computer hearing technology has been developed, by which the task of screening and analyzing audio signals on a computer device is achieved, and important information required for obtaining therefrom is increasingly applied to practice. Therefore, the computer hearing technology greatly reduces the manpower and material resources required by the research related to the audio processing, and simultaneously ensures the accuracy of the audio content and the accuracy of the feature selection to a great extent.
Audio classification is one of the most fundamental problems studied in computer hearing technology, and is based on the feature information contained in audio, which is the main basis for distinguishing different sound sources. The audio classification range is wide, and mainly relates to identification of people, identification and detection of specific audio events, scene judgment of specific environments and the like.
The task of audio classification generally has two important steps. First, valid features are extracted from the audio data to replace the entire piece of sound information. Secondly, the audio classification task in the test stage is well completed, an audio classifier with good performance needs to be constructed, and the effective audio characteristics are utilized to train the audio classifier.
In the related technology, the application of the deep learning algorithm makes a major breakthrough in the scientific research of the audio classification field, and the deep learning neural network is used as an audio classifier with better performance, so that better classification precision and generalization capability can be obtained. However, the effects brought by different deep learning algorithms are different, and the problems of limitation such as overfitting, gradient disappearance, gradient explosion and incapability of further breaking through the upper performance limit of the neural network model remain to be studied and solved.
Accordingly, there is a need to improve one or more problems in the related art as described above.
It is noted that this section is intended to provide a background or context for the technical solutions of the present application as set forth in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Disclosure of Invention
An object of embodiments of the present application is to provide a method for classifying urban audio based on a convolutional neural network of residual calculation, thereby overcoming one or more problems due to limitations and disadvantages of the related art at least to some extent.
According to an embodiment of the present application, there is provided a method for classifying urban audio based on a convolutional neural network of residual calculation, the method including:
constructing a city audio classification model based on a convolution neural network of residual calculation;
carrying out data enhancement on the urban audio data, and dividing the urban audio data subjected to data enhancement into a training set and a testing set;
processing the training set and the testing set respectively to obtain training set audio characteristics and testing set audio characteristics;
sending the training set audio features into the urban audio classification model for training to obtain the trained urban audio classification model;
and sending the audio features of the test set into the trained urban audio classification model for training so as to classify the audio features of the test set, and classifying the test set according to the classification result of the audio features of the test set.
In one embodiment of the present application, the urban audio classification model includes: a multi-layer convolution layer, a multi-layer pooling layer and a multi-layer full connection layer;
the urban audio classification model utilizes the residual calculation to optimize the output result of each layer of the convolution layer;
wherein the residual computation comprises an identity mapping for each layer of the convolutional layers in the urban audio classification model.
In an embodiment of the present application, the calculation formula of the convolution layer is:
wherein,represents the first c Jth of layer convolution layer c Individual nodes, l c Indicating the number of layers of the convolution layer, g indicating the activation function, represents the first c Jth of layer convolution layer c Personal node and ith c Convolution kernel of individual nodes,>represents the j th c Personal node and ith c Offset of individual nodes, M ci Representing audio information mapping moments in convolutional neural networksMatrix, e represents an exponential constant;
the calculation formula of the pooling layer is as follows:
wherein,represents the first p Layer pooling layer j p Personal node->Represents the first p -j of layer 1 pooling layer p Individual nodes, l p Indicates the number of layers of the pooling layer, < >>Represents the first p Layer pooling layer j p Weight of individual node, down () represents sampling function, n represents size of input data, down (n) represents size of output data, < ->p represents the size of padding, f represents the window size of the pooling layer, s represents the step size, +.>Represents the first p Layer pooling layer j p Offset of individual nodes;
the calculation formula of the full connection layer is as follows:
wherein,represents the first f J of layer full connection layer f Personal node->Represents the first f -1 ith of full link layer f Individual nodes, l f Indicating the number of layers of the fully connected layer->Represents the ith f The next node and j f Weights of individual nodes, weight->Represents the ith f The next node and j f Offset of individual nodes, M f Representing the mapping relation of the full connection layer.
In an embodiment of the present application, the step of performing data enhancement on the urban audio data and dividing the data-enhanced urban audio data into a training set and a test set includes:
performing audio tuning or audio noise adding on the urban audio data;
the training set and the testing set are divided according to a preset proportion.
In one embodiment of the present application, the audio tuning comprises: adjusting the volume of the urban audio data, namely increasing xdb the original data of the urban audio data, wherein x epsilon < -10, 10 >;
the calculation formula of the audio tuning is as follows:
f'(t)=f(t)+x (4)
wherein f' (t) represents the urban audio data after audio tuning, f (t) represents the original data of the urban audio data, x represents the increase of the original data of the urban audio data by x db, x e [ -10, 10].
In an embodiment of the present application, the audio plus noise includes: adding a random noise section on the enhanced signal, wherein the damping coefficient is a preset value;
the calculation formula of the audio plus noise is as follows:
wherein f r (t) represents urban audio data subjected to audio plus noise, f (t) represents original data of the urban audio data,n noise sources are represented for enhancing the urban audio data.
In an embodiment of the present application, the step of processing the training set and the test set to obtain training set audio features and test set audio features includes:
and respectively carrying out pre-emphasis, framing and windowing, fast Fourier transformation, mel scale conversion and discrete cosine transformation on the training set and the testing set to obtain the audio characteristics of the training set and the audio characteristics of the testing set.
In an embodiment of the present application, the step of sending the training set audio feature to the urban audio classification model for training, and obtaining the trained urban audio classification model includes:
inputting the training set audio features into the convolution layer, and extracting training set key features through the convolution layer;
discarding the part which cannot correctly express the feature information in the extracted key features of the training set through the maximized pooling layer to finish the dimension reduction of the key features of the training set and obtain the dimension-reduced audio features of the training set;
the audio features after the dimension reduction of the training set are further extracted through repeatedly stacked multiple layers of convolution layers with residual error calculation identity mapping, and the further extracted audio features are summarized through the full connection layer, so that training lumped junction audio features are obtained;
dividing the audio features of the training set summary into audio scenes of different categories, calculating the accuracy of the audio scenes of different categories by using a softmax classifier, outputting the classification result of the audio features of the training set, and finishing training of the urban audio classification model.
In an embodiment of the present application, the expression of the convolutional layer extracting the key feature of the training set is:
h 1 =CONV(X) (6)
wherein h is 1 Representing the key features of the training set extracted by the convolution layer; CONV represents a convolution layer, and X represents the audio characteristics of the training set;
the expression of the audio characteristics after the dimension reduction of the training set is as follows:
h 2 =MAX_POOLING(h 1 ) (7)
wherein h is 2 Representing audio characteristics of the training set after dimension reduction, wherein MAX_POOLING represents a maximized POOLING layer;
the expression for further extracting the audio characteristics of the training set after dimension reduction is as follows:
F(h 2 )=D(h 2 )-S(h 2 ) (8)
wherein F (h 2 ) Represents the audio characteristics of the training set after dimension reduction, S (h 2 ) Represents the output value of the shallow convolutional layer, D (h 2 ) Representing the output value of the deep convolutional layer;
summarizing the further extracted audio features through the full connection layer to obtain an expression for training the lumped audio features, wherein the expression is as follows:
h 3 =FC(F(h 2 )) (9)
wherein h is 3 Representing training lumped audio features, FC representing fully connected layers;
the expression for outputting the classification result of the training set audio features is as follows:
h 4 =softmax(h 3 ) (10)
wherein h is 4 Representing the classification result of the training set audio features, softmax represents the softmax classifier.
In an embodiment of the present application, the loss function of the convolutional neural network in the urban audio classification model is a cross entropy loss function, and an expression of the cross entropy loss function is:
wherein loss (r, class) represents a cross entropy loss function, r represents a predicted classification result, class represents a sample tag of the urban audio data, loss represents a loss function of a classification task of the urban audio data, e represents an exponential constant, r [ class ]Representing the classification result of the sample label class, r v And representing the classification result belonging to the category v in the sample label.
The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:
according to the embodiment of the application, the urban audio classification method based on the residual calculation convolutional neural network can effectively solve the problem that the urban audio classification precision of the traditional deep learning neural network is not high, and improves the calculation efficiency and the classification precision of the urban audio.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 shows a schematic step diagram of a method for urban audio classification based on residual calculation convolutional neural networks in an exemplary embodiment of the application;
FIG. 2 illustrates a schematic diagram of a convolutional neural network employing residual computation in an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a Mel spectrogram audio feature generation process in a method for urban audio classification based on residual calculation convolutional neural network in an exemplary embodiment of the application;
FIG. 4 is a flow chart of an MFCC (Mel-Frequency Cepstral Coefficients, mel-cepstrum coefficient) audio feature generation process in a method for urban audio classification based on a residual calculation convolutional neural network in an exemplary embodiment of the present application;
FIG. 5 illustrates a result graph of a classification accuracy confusion matrix for test data of a 2D convolutional neural network for an audio feature of a urban audio Mel cepstrum coefficient in an exemplary embodiment of the present application;
fig. 6 shows a result diagram of a classification accuracy confusion matrix for test data of audio features of the mel-frequency cepstrum coefficient of the urban audio in an exemplary embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are only schematic illustrations of embodiments of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.
In this exemplary embodiment, a method for classifying urban audio based on a convolutional neural network of residual calculation is provided first. Referring to fig. 1, the urban audio classification method of the convolutional neural network based on residual calculation may include: steps S101 to S105.
Step S101: and constructing a city audio classification model based on the convolution neural network of residual calculation.
Step S102: and carrying out data enhancement on the urban audio data, and dividing the urban audio data subjected to data enhancement into a training set and a testing set.
Step S103: respectively processing the training set and the testing set to obtain audio characteristics of the training set and audio characteristics of the testing set;
step S104: sending the training set audio features into the urban audio classification model for training to obtain a trained urban audio classification model;
step S105: and sending the audio features of the test set into the trained urban audio classification model for training so as to classify the audio features of the test set, and classifying the test set according to the classification result of the audio features of the test set.
By the urban audio classification method of the convolutional neural network based on residual calculation, the problem that the urban audio classification precision of the traditional deep learning neural network is not high can be effectively solved, the calculation efficiency is improved, and meanwhile, the urban audio classification precision is improved.
Hereinafter, each step of the above-described urban audio classification method of the convolutional neural network based on residual calculation in the present exemplary embodiment will be described in more detail with reference to fig. 1 to 6.
Example 1
Urban audio data is first input and subjected to data enhancement and preprocessing operations to divide the data set. The adopted audio data enhancement method mainly comprises audio tuning or audio plus noise. The audio tuning is mainly used for adjusting the volume of the urban audio data, namely, the volume of the urban audio original data is increased by 5db. The audio noise adding step is to add a random noise section on the enhancement signal, the damping coefficient is 0.4, so that the audio data can cover more scenes, and the learning ability of the audio features of the data is enhanced.
In step S101, the depicted urban audio classification model includes a multi-layer convolution layer, a multi-layer pooling layer, and a multi-layer full-connection layer. The urban audio classification model optimizes the output result of each convolution layer by adopting residual calculation.
The described residual calculation includes an identity mapping for each layer of convolutional layers in the urban audio classification model;
the calculation formula of the convolution layer is as follows:
wherein,represents the first c Jth of layer convolution layer c Individual nodes, l c Indicating the number of layers of the convolution layer, g indicating the activation function, represents the first c Jth of layer convolution layer c Personal node and ith c Convolution kernel of individual nodes,>represents the j th c Personal node and ith c Offset of individual nodes, M ci Representing an audio information mapping matrix in the convolutional neural network, e representing an exponential constant;
the calculation formula of the pooling layer is as follows:
wherein,represents the first p Layer pooling layer j p Personal node->Represents the first p -j of layer 1 pooling layer p Individual nodes, l p Indicates the number of layers of the pooling layer, < >>Represents the first p Layer pooling layer j p Weight of individual node, down () represents sampling function, n represents size of input data, down (n) represents size of output data, < ->p represents the size of padding, f represents the window size of the pooling layer, s represents the step size, +.>Represents the first p Layer pooling layer j p Offset of individual nodes;
the calculation formula of the full connection layer is as follows:
wherein,represents the first f J of layer full connection layer f Personal node->Represents the first f -1 ith of full link layer f Individual nodes, l f Indicating the number of layers of the fully connected layer->Represents the ith f The next node and j f Weights of individual nodes, weight->Represents the ith f The next node and j f Offset of individual nodes, M f Representing the mapping relation of the full connection layer.
In step S102, the step of performing data enhancement on the urban audio data and dividing the data-enhanced urban audio data into a training set and a test set includes:
performing audio tuning or audio noise adding on the urban audio data;
the training set and the testing set are divided according to a preset proportion.
The audio tuning includes: the volume of the urban audio data is adjusted, i.e. the original data of the urban audio data is increased xdb, wherein x e-10, 10.
The calculation formula of the audio tuning is as follows:
f'(t)=f(t)+x (4)
wherein f' (t) represents the urban audio data after audio tuning, f (t) represents the original data of the urban audio data, x represents the increase of the original data of the urban audio data by x db, x e [ -10, 10].
The audio plus noise includes: and adding a random noise section on the enhanced signal, wherein the damping coefficient is a preset value.
The calculation formula of the audio plus noise is as follows:
wherein f r (t) represents urban audio data subjected to audio plus noise, f (t) represents original data of the urban audio data,n noise sources are represented for enhancing the urban audio data.
In step S103, the steps of processing the training set and the test set to obtain the training set audio feature and the test set audio feature respectively include: and respectively carrying out pre-emphasis, framing and windowing, fast Fourier transformation, mel scale conversion and discrete cosine transformation on the training set and the testing set to obtain the audio characteristics of the training set and the audio characteristics of the testing set.
In step S104, in the urban audio classification model, for the preprocessed urban audio data, key features thereof are first extracted by convolutional layer learning; then discarding the part which cannot correctly express the characteristic information in the extracted characteristics through a maximized pooling layer, so as to achieve the effect of reducing the order of magnitude of the parameters; then further extracting the features of the urban audio data from the repeatedly stacked deep convolutional layers with residual error calculation identity mapping; summarizing the feature information extracted from the audio features through the full connection layer; and finally, calculating the accuracy of classifying each audio data into different categories of audio scene labels through a softmax classifier, and outputting a classification result.
Specifically, the step of sending the training set audio features into the urban audio classification model for training to obtain a trained urban audio classification model comprises the following steps:
inputting the audio features of the training set into the convolution layer, and extracting key features of the training set through the convolution layer;
discarding the part which cannot correctly express the feature information in the extracted key features of the training set through the maximized pooling layer to finish the dimension reduction of the key features of the training set and obtain the dimension-reduced audio features of the training set;
the audio features after the dimension reduction of the training set are further extracted through repeatedly stacked multi-layer convolution layers with residual error calculation identity mapping, and the further extracted audio features are summarized through a full connection layer to obtain the training set integrated audio features;
dividing the training set audio features into audio scenes of different categories, calculating the accuracy of the audio scenes of different categories by using a softmax classifier, outputting the classification result of the training set audio features, and finishing training of the urban audio classification model.
Further, the expression of the convolutional layer extracting the key features of the training set is:
h 1 =CONV(X) (6)
wherein h is 1 Representing the key features of the training set extracted by the convolution layer; CONV represents the convolutional layer and X represents the training set audio features.
The expression of the audio characteristics after the dimension reduction of the training set is as follows:
h 2 =MAX_POOLING(h 1 ) (7)
wherein h is 2 Representing the audio features of the training set after dimension reduction, max_pooling represents the maximized POOLING layer.
The expression for further extracting the audio features of the training set after the dimension reduction is as follows:
F(h 2 )=D(h 2 )-S(h 2 ) (8)
wherein F (h 2 ) Represents the audio characteristics of the training set after dimension reduction, S (h 2 ) Represents the output value of the shallow convolutional layer, D (h 2 ) Representing the output value of the deep convolutional layer. F (h) =D (h) -S (h) represents nonlinear transformation output values between convolution layers calculated by residual errors, when the audio features learned by the shallow output values S (h) are optimal, F (h) automatically approaches to 0, so that S (h) is transmitted in an identical path, and the task of preventing the network training result from becoming worse is achieved by the fact that the remaining layers of the depth network are subjected to identical mapping under the condition that the shallow output result is good enough.
Summarizing the further extracted audio features through the full connection layer to obtain the expression of the training lumped audio features as follows:
h 3 =FC(F(h 2 )) (9)
wherein h is 3 Representing training lumped audio features, FC represents fully connected layers.
The expression of the classification result of the output training set audio features is:
h 4 =softmax(h 3 ) (10)
wherein h is 4 Representing the classification result of the training set audio features, softmax represents the softmax classifier.
The training process of the urban audio classification network is as follows:
firstly, a public urban audio data set UrbanSound8K is used and divided into a training set and a testing set according to the proportion of 7:3;
during training, setting the total epoch to be 200; the Batch size is 32; adam is selected as an optimizer; in the optimizer, a training strategy of learning rate attenuation is adopted, the initial learning rate is 0.001, and the weight attenuation is 0.0005; for classification problems, the loss function applicable to the overall classification network is a cross entropy loss function, as follows:
wherein loss (r, class) represents a cross entropy loss function, r represents a predicted classification result, class represents a sample tag of the urban audio data, loss represents a loss function of a classification task of the urban audio data, e represents an exponential constant, r [ class ]Representing the classification result of the sample label class, r v And representing the classification result belonging to the category v in the sample label.
Example 2
A city audio classification method of convolutional neural network based on residual calculation according to embodiment 1 is different in that:
the specific implementation process of step S102 includes:
firstly, a public urban audio data set UrbanSound8K is input, audio tuning operation is carried out on the public urban audio data set Urbansound8K, and the volume of urban audio data is adjusted to enable the original data to be increased by 5db. And a random noise section with a damping coefficient of 0.4 is added on the enhancement signal to finish the noise addition of the audio, so that audio data can cover more scenes to enhance the learning ability of the audio characteristics of the data;
and dividing the preprocessed urban audio data set into a training set and a testing set according to the ratio of 7:3.
The specific implementation process of step S103 includes:
the method comprises the steps that a pre-emphasis operation is firstly carried out on a preprocessed urban audio data set, high-frequency band data of audio signals are improved, and then the frequency spectrum of the audio signals is flattened;
the frame division operation is carried out, the purpose of the frame division operation is to divide the audio into small sections so as to be convenient for analysis, meanwhile, in order to eliminate the problem of signal discontinuity possibly existing between each frame, the audio needs to be windowed, and the frame length and the frame shift are respectively selected to be 25ms and 10ms;
because the reflected energy distribution condition on the frequency domain can show the characteristic of the audio frequency compared with the characteristic of the audio frequency on the time domain, the frequency spectrum of the audio data is required to be subjected to the work of modulus square, namely the fast Fourier transform;
and finally, obtaining logarithmic energy after filtering operation through a Mel frequency filter, and obtaining Mel frequency spectrum audio characteristics of the urban audio data.
The specific implementation process of step S104 includes:
the urban audio classification model comprises a plurality of convolution layers, a plurality of pooling layers and a plurality of full connection layers;
the described residual calculation includes an identity mapping for each layer of convolutional layers in the urban audio classification model;
in the urban audio classification model, for the Mel frequency spectrum audio characteristics of the preprocessed urban audio data, firstly, the key characteristics of the pretreated urban audio data are extracted through convolution layer learning with the convolution kernel of 7 multiplied by 7 and the step length of 2 and the convolution filling of 3;
then, the maximum pooling layer with the pooling core of 3 multiplied by 3 and the step length of 2 and the pooling filling of 1 discards the part of the extracted characteristics, which cannot accurately express the characteristic information, so as to achieve the effect of reducing the order of magnitude of the parameters;
then, by constructing two layers of convolution layers with convolution kernels of 3 multiplied by 3 and step length of 2, and adding identical mapping for the two layers of convolution layers according to the residual error calculation structure described in the embodiment 1, the two layers of convolution layers integrally form a convolution layer with the residual error calculation structure, and the convolution layer is used for further extracting audio characteristics of urban audio data;
repeatedly stacking 4 groups of convolution layers with residual calculation structures, so that the urban audio classification model becomes a deep neural network structure;
and summarizing the information of the audio features extracted from the Mel frequency spectrum audio features through the full connection layer, calculating the accuracy of labels of the audio data classified into different categories of audio scenes through a softmax classifier, and outputting a classification result.
In which, as shown in fig. 2, a schematic diagram of a convolutional neural network using residual calculation is shown.
The training process of the urban audio classification network in step S104 is as follows:
during training, setting the total epoch to be 200; the Batch size is 32; adam is selected as an optimizer; in the optimizer, a training strategy of learning rate attenuation is adopted, the initial learning rate is 0.001, and the weight attenuation is 0.0005; for the classification problem, the loss function applicable to the overall classification network is a cross entropy loss function, so the cross entropy loss function is selected as the loss function of the urban audio classification model, and training is carried out on the model aiming at the mel frequency spectrum audio characteristics.
The specific implementation process of step S105 includes:
sending the Mel frequency spectrum audio characteristics of the test data into a trained urban audio classification model, calculating probability values of different audio belonging to specific classification labels, taking the class label corresponding to the maximum probability as the final classification result of the different audio, and calculating and outputting the final total audio classification accuracy.
Example 3
The urban audio classification method of the convolutional neural network based on residual calculation according to embodiment 1 is different in that:
the specific implementation process of step S102 includes:
firstly, a public urban audio data set UrbanSound8K is input, audio tuning operation is carried out on the public urban audio data set Urbansound8K, and the volume of urban audio data is adjusted to enable the original data to be increased by 5db. And a random noise section with a damping coefficient of 0.4 is added on the enhancement signal to finish the noise addition of the audio, so that audio data can cover more scenes to enhance the learning ability of the audio characteristics of the data;
and dividing the preprocessed urban audio data set into a training set and a testing set according to the ratio of 7:3.
The specific implementation process of step S103 includes:
the method comprises the steps that a pre-emphasis operation is firstly carried out on a preprocessed urban audio data set, high-frequency band data of audio signals are improved, and then the frequency spectrum of the audio signals is flattened;
the frame division operation is carried out, the purpose of the frame division operation is to divide the audio into small sections so as to be convenient for analysis, meanwhile, in order to eliminate the problem of signal discontinuity possibly existing between each frame, the audio needs to be windowed, and the frame length and the frame shift are respectively selected to be 25ms and 10ms;
because the reflected energy distribution condition on the frequency domain can show the characteristic of the audio frequency compared with the characteristic of the audio frequency on the time domain, the frequency spectrum of the audio data is required to be subjected to the work of modulus square, namely the fast Fourier transform;
and finally, filtering by a Mel frequency filter to obtain logarithmic energy and performing discrete cosine transform to obtain Mel cepstrum coefficient audio characteristics of the urban audio data.
The specific implementation process of step S104 includes:
the urban audio classification model comprises a plurality of convolution layers, a plurality of pooling layers and a plurality of full connection layers;
the described residual calculation includes an identity mapping for each layer of convolutional layers in the urban audio classification model;
in the urban audio classification model, for the mel cepstrum coefficient audio features of the preprocessed urban audio data, firstly, the key features of the audio features are extracted through convolution layer learning with the convolution kernel of 7 multiplied by 7 and the step length of 2 and the convolution filling of 3;
then, the maximum pooling layer with the pooling core of 3 multiplied by 3 and the step length of 2 and the pooling filling of 1 discards the part of the extracted characteristics, which cannot accurately express the characteristic information, so as to achieve the effect of reducing the order of magnitude of the parameters;
then, by constructing two layers of convolution layers with convolution kernels of 3 multiplied by 3 and step length of 2, and adding identical mapping for the two layers of convolution layers according to the residual error calculation structure described in the embodiment 1, the two layers of convolution layers integrally form a convolution layer with the residual error calculation structure, and the convolution layer is used for further extracting the characteristics of urban audio data;
repeatedly stacking 4 groups of convolution layers with residual calculation structures, so that the urban audio classification model becomes a deep neural network structure;
and summarizing the feature information extracted from the audio features of the mel-frequency cepstrum coefficients through the full-connection layer, calculating the accuracy of labels of the audio data classified into different categories of audio scenes through a softmax classifier, and outputting a classification result.
The training process of the urban audio classification network in step S104 is as follows:
during training, setting the total epoch to be 200; the Batch size is 32; adam is selected as an optimizer; in the optimizer, a training strategy of learning rate attenuation is adopted, the initial learning rate is 0.001, and the weight attenuation is 0.0005; for the classification problem, the loss function applicable to the whole classification network is a cross entropy loss function, so the cross entropy loss function is selected as the loss function of the urban audio classification model, and the model is trained for the mel-frequency cepstrum coefficient audio characteristics.
The specific implementation process of step S105 includes:
sending the mel cepstrum coefficient audio features of the test data into a trained urban audio classification model, calculating probability values of different audio belonging to specific classification labels, taking the class label corresponding to the maximum probability as a final classification result of the different audio data, and calculating and outputting final total audio classification accuracy. And a result diagram of the classification accuracy confusion matrix of the test data of the audio characteristics of the mel-frequency cepstrum coefficient of the urban audio is drawn, as shown in fig. 6.
Respectively inputting a mel spectrogram of urban audio data and two audio features of an MFCC in a public data set UrbanSound8K for training for a constructed urban audio classification network framework adopting a convolution neural network with residual calculation, wherein the flow diagram is shown in fig. 3, the flow diagram is shown in fig. 4, and the flow diagram is shown in the flow diagram; and then respectively sending the Mel frequency spectrum and the MFCC of the test data into a trained urban audio classification model, wherein the obtained classification accuracy results are shown as follows:
as shown in fig. 5 and 6, the result diagrams of the classification accuracy confusion matrix are taken as examples of the 2D convolutional neural network and the test data of the MFCC audio characteristics of the urban audio according to the method of the present application, the horizontal axis scale 0 to 9 of the result diagram of the classification accuracy confusion matrix represents the prediction classification result label of the urban audio test data, and the vertical axis scale 0 to 9 represents the actual label of the urban audio test data. The darker the diagonal grid color depth of the resulting graph represents the accuracy with which urban audio data of different classes of tags can be correctly predicted to classify the tags. Therefore, according to the result graphs of the classification accuracy confusion matrix of fig. 5 and 6, the accuracy of the urban audio classification result can be better improved by using the method.
TABLE 1
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, one skilled in the art can combine and combine the different embodiments or examples described in this specification.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
Claims (10)
1. The urban audio classification method of the convolutional neural network based on residual calculation is characterized by comprising the following steps of:
constructing a city audio classification model based on a convolution neural network of residual calculation;
carrying out data enhancement on the urban audio data, and dividing the urban audio data subjected to data enhancement into a training set and a testing set;
processing the training set and the testing set respectively to obtain training set audio characteristics and testing set audio characteristics;
sending the training set audio features into the urban audio classification model for training to obtain the trained urban audio classification model;
and sending the audio features of the test set into the trained urban audio classification model for training so as to classify the audio features of the test set, and classifying the test set according to the classification result of the audio features of the test set.
2. The method for urban audio classification based on residual calculation of convolutional neural network of claim 1,
the urban audio classification model comprises: a multi-layer convolution layer, a multi-layer pooling layer and a multi-layer full connection layer;
the urban audio classification model utilizes the residual calculation to optimize the output result of each layer of the convolution layer;
wherein the residual computation comprises an identity mapping for each layer of the convolutional layers in the urban audio classification model.
3. The method for urban audio classification based on residual calculation of convolutional neural network of claim 1,
the calculation formula of the convolution layer is as follows:
wherein,represents the first c Jth of layer convolution layer c Individual nodes, l c Indicating the number of layers of the convolution layer, g indicating the activation function, represents the first c Jth of layer convolution layer c Personal node and ith c Convolution kernel of individual nodes,>represents the j th c Personal node and ith c Offset of individual nodes, M ci Representing an audio information mapping matrix in the convolutional neural network, e representing an exponential constant;
the calculation formula of the pooling layer is as follows:
wherein,represents the first p Layer pooling layer j p Individual sectionPoint (S)>Represents the first p -j of layer 1 pooling layer p Individual nodes, l p Indicates the number of layers of the pooling layer, < >>Represents the first p Layer pooling layer j p Weight of individual node, down () represents sampling function, n represents size of input data, down (n) represents size of output data, < ->p represents the size of padding, f represents the window size of the pooling layer, s represents the step size, +.>Represents the first p Layer pooling layer j p Offset of individual nodes;
the calculation formula of the full connection layer is as follows:
wherein,represents the first f J of layer full connection layer f Personal node->Represents the first f -1 ith of full link layer f Individual nodes, l f Indicating the number of layers of the fully connected layer->Represents the ith f The next node and j f Weights of individual nodes, weight->Represents the ith f The next node and j f Offset of individual nodes, M f Representing the mapping relation of the full connection layer.
4. The method for classifying urban audio by using a convolutional neural network based on residual calculation according to claim 3, wherein the step of performing data enhancement on the urban audio data and dividing the data-enhanced urban audio data into a training set and a test set comprises the steps of:
performing audio tuning or audio noise adding on the urban audio data;
the training set and the testing set are divided according to a preset proportion.
5. The method of urban audio classification based on residual calculation convolutional neural network of claim 4, wherein the audio tuning comprises: adjusting the volume of the urban audio data, namely increasing xdb the original data of the urban audio data, wherein x epsilon < -10, 10 >;
the calculation formula of the audio tuning is as follows:
f'(t)=f(t)+x (4)
wherein f' (t) represents the urban audio data after audio tuning, f (t) represents the original data of the urban audio data, x represents the increase of the original data of the urban audio data by x db, x e [ -10, 10].
6. The method for urban audio classification based on residual calculation convolutional neural network of claim 4, wherein the audio plus noise comprises: adding a random noise section on the enhanced signal, wherein the damping coefficient is a preset value;
the calculation formula of the audio plus noise is as follows:
wherein f r (t) represents urban audio data subjected to audio plus noise, f (t) represents original data of the urban audio data,n noise sources are represented for enhancing the urban audio data.
7. The method of claim 5 or 6, wherein the step of processing the training set and the test set to obtain training set audio features and test set audio features, respectively, comprises:
and respectively carrying out pre-emphasis, framing and windowing, fast Fourier transformation, mel scale conversion and discrete cosine transformation on the training set and the testing set to obtain the audio characteristics of the training set and the audio characteristics of the testing set.
8. The method for classifying urban audio by using a convolutional neural network based on residual calculation according to claim 7, wherein the step of training the training set audio features in the urban audio classification model to obtain the trained urban audio classification model comprises the following steps:
inputting the training set audio features into the convolution layer, and extracting training set key features through the convolution layer;
discarding the part which cannot correctly express the feature information in the extracted key features of the training set through the maximized pooling layer to finish the dimension reduction of the key features of the training set and obtain the dimension-reduced audio features of the training set;
the audio features after the dimension reduction of the training set are further extracted through repeatedly stacked multiple layers of convolution layers with residual error calculation identity mapping, and the further extracted audio features are summarized through the full connection layer, so that training lumped junction audio features are obtained;
dividing the audio features of the training set summary into audio scenes of different categories, calculating the accuracy of the audio scenes of different categories by using a softmax classifier, outputting the classification result of the audio features of the training set, and finishing training of the urban audio classification model.
9. The method for urban audio classification based on residual calculation of convolutional neural network of claim 8,
the expression of the convolution layer extracting the key characteristics of the training set is as follows:
h 1 =CONV(X) (6)
wherein h is 1 Representing the key features of the training set extracted by the convolution layer; CONV represents a convolution layer, and X represents the audio characteristics of the training set;
the expression of the audio characteristics after the dimension reduction of the training set is as follows:
h 2 =MAX_POOLING(h 1 ) (7)
wherein h is 2 Representing audio characteristics of the training set after dimension reduction, wherein MAX_POOLING represents a maximized POOLING layer;
the expression for further extracting the audio characteristics of the training set after dimension reduction is as follows:
F(h 2 )=D(h 2 )-S(h 2 ) (8)
wherein F (h 2 ) Represents the audio characteristics of the training set after dimension reduction, S (h 2 ) Represents the output value of the shallow convolutional layer, D (h 2 ) Representing the output value of the deep convolutional layer;
summarizing the further extracted audio features through the full connection layer to obtain an expression for training the lumped audio features, wherein the expression is as follows:
h 3 =FC(F(h 2 )) (9)
wherein h is 3 Representing training lumped audio features, FC representing fully connected layers;
the expression for outputting the classification result of the training set audio features is as follows:
h 4 =soft max(h 3 ) (10)
wherein h is 4 Representing the classification result of the training set audio features, soft max represents the softmax classifier.
10. The method for classifying urban audio by using a convolutional neural network based on residual calculation according to claim 9, wherein the loss function of the convolutional neural network in the urban audio classification model is a cross entropy loss function, and the cross entropy loss function has the expression:
wherein loss (r, class) represents a cross entropy loss function, r represents a predicted classification result, class represents a sample tag of the urban audio data, loss represents a loss function of a classification task of the urban audio data, e represents an exponential constant, r [class] Representing the classification result of the sample label class, r v And representing the classification result belonging to the category v in the sample label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311833985.1A CN117789758A (en) | 2023-12-28 | 2023-12-28 | Urban audio classification method of convolutional neural network based on residual calculation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311833985.1A CN117789758A (en) | 2023-12-28 | 2023-12-28 | Urban audio classification method of convolutional neural network based on residual calculation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117789758A true CN117789758A (en) | 2024-03-29 |
Family
ID=90390571
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311833985.1A Pending CN117789758A (en) | 2023-12-28 | 2023-12-28 | Urban audio classification method of convolutional neural network based on residual calculation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117789758A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118585924A (en) * | 2024-08-05 | 2024-09-03 | 杭州爱华仪器有限公司 | Neural network noise source classification method and device based on model fusion |
-
2023
- 2023-12-28 CN CN202311833985.1A patent/CN117789758A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118585924A (en) * | 2024-08-05 | 2024-09-03 | 杭州爱华仪器有限公司 | Neural network noise source classification method and device based on model fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN112818892B (en) | Multi-modal depression detection method and system based on time convolution neural network | |
CN111325095B (en) | Intelligent detection method and system for equipment health state based on acoustic wave signals | |
Su et al. | Performance analysis of multiple aggregated acoustic features for environment sound classification | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN103377651B (en) | The automatic synthesizer of voice and method | |
CN106409310A (en) | Audio signal classification method and device | |
CN102870156A (en) | Audio communication device, method for outputting an audio signal, and communication system | |
CN108922513A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN109378014A (en) | A kind of mobile device source discrimination and system based on convolutional neural networks | |
CN115602165B (en) | Digital employee intelligent system based on financial system | |
CN110728991B (en) | Improved recording equipment identification algorithm | |
CN111696580A (en) | Voice detection method and device, electronic equipment and storage medium | |
Shi et al. | End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network. | |
CN117789758A (en) | Urban audio classification method of convolutional neural network based on residual calculation | |
CN117762372A (en) | Multi-mode man-machine interaction system | |
CN112927723A (en) | High-performance anti-noise speech emotion recognition method based on deep neural network | |
CN108806725A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN117312548A (en) | Multi-source heterogeneous disaster situation data fusion understanding method | |
CN118351881A (en) | Fusion feature classification and identification method based on noise reduction underwater sound signals | |
CN118098247A (en) | Voiceprint recognition method and system based on parallel feature extraction model | |
Eltanashi et al. | Proposed speaker recognition model using optimized feed forward neural network and hybrid time-mel speech feature | |
CN113782051B (en) | Broadcast effect classification method and system, electronic equipment and storage medium | |
CN113327589B (en) | Voice activity detection method based on attitude sensor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |