CN117789758A

CN117789758A - Urban audio classification method of convolutional neural network based on residual calculation

Info

Publication number: CN117789758A
Application number: CN202311833985.1A
Authority: CN
Inventors: 邱博之; 王磊; 李盛; 李迎纲; 李莹
Original assignee: SHAANXI HUANGHE GROUP CO Ltd
Current assignee: SHAANXI HUANGHE GROUP CO Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-03-29

Abstract

The embodiment of the application relates to a city audio classification method of a convolutional neural network based on residual calculation. The method comprises the following steps: constructing an urban audio classification model; carrying out data enhancement on the urban audio data, and dividing the urban audio data subjected to data enhancement into a training set and a testing set; respectively processing the training set and the testing set to obtain audio characteristics of the training set and audio characteristics of the testing set; sending the training set audio features into the urban audio classification model for training to obtain a trained urban audio classification model; and sending the audio features of the test set into the trained urban audio classification model for training so as to classify the audio features of the test set, and classifying the test set according to the classification result of the audio features of the test set. According to the embodiment of the application, the problem that the classification precision of the urban audio is not high in the traditional deep learning neural network can be effectively solved, and the classification precision of the urban audio is improved while the calculation efficiency is improved.

Description

Urban audio classification method of convolutional neural network based on residual calculation

Technical Field

The embodiment of the application relates to the technical field of computer hearing, in particular to a city audio classification method of a convolutional neural network based on residual calculation.

Background

In the daily production and life process, sound plays an irreplaceable role in transmitting important information. In recent years, computer hearing technology has been developed, by which the task of screening and analyzing audio signals on a computer device is achieved, and important information required for obtaining therefrom is increasingly applied to practice. Therefore, the computer hearing technology greatly reduces the manpower and material resources required by the research related to the audio processing, and simultaneously ensures the accuracy of the audio content and the accuracy of the feature selection to a great extent.

Audio classification is one of the most fundamental problems studied in computer hearing technology, and is based on the feature information contained in audio, which is the main basis for distinguishing different sound sources. The audio classification range is wide, and mainly relates to identification of people, identification and detection of specific audio events, scene judgment of specific environments and the like.

The task of audio classification generally has two important steps. First, valid features are extracted from the audio data to replace the entire piece of sound information. Secondly, the audio classification task in the test stage is well completed, an audio classifier with good performance needs to be constructed, and the effective audio characteristics are utilized to train the audio classifier.

In the related technology, the application of the deep learning algorithm makes a major breakthrough in the scientific research of the audio classification field, and the deep learning neural network is used as an audio classifier with better performance, so that better classification precision and generalization capability can be obtained. However, the effects brought by different deep learning algorithms are different, and the problems of limitation such as overfitting, gradient disappearance, gradient explosion and incapability of further breaking through the upper performance limit of the neural network model remain to be studied and solved.

Accordingly, there is a need to improve one or more problems in the related art as described above.

It is noted that this section is intended to provide a background or context for the technical solutions of the present application as set forth in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Disclosure of Invention

An object of embodiments of the present application is to provide a method for classifying urban audio based on a convolutional neural network of residual calculation, thereby overcoming one or more problems due to limitations and disadvantages of the related art at least to some extent.

According to an embodiment of the present application, there is provided a method for classifying urban audio based on a convolutional neural network of residual calculation, the method including:

constructing a city audio classification model based on a convolution neural network of residual calculation;

carrying out data enhancement on the urban audio data, and dividing the urban audio data subjected to data enhancement into a training set and a testing set;

processing the training set and the testing set respectively to obtain training set audio characteristics and testing set audio characteristics;

sending the training set audio features into the urban audio classification model for training to obtain the trained urban audio classification model;

and sending the audio features of the test set into the trained urban audio classification model for training so as to classify the audio features of the test set, and classifying the test set according to the classification result of the audio features of the test set.

In one embodiment of the present application, the urban audio classification model includes: a multi-layer convolution layer, a multi-layer pooling layer and a multi-layer full connection layer;

the urban audio classification model utilizes the residual calculation to optimize the output result of each layer of the convolution layer;

wherein the residual computation comprises an identity mapping for each layer of the convolutional layers in the urban audio classification model.

In an embodiment of the present application, the calculation formula of the convolution layer is:

wherein,represents the first _c Jth of layer convolution layer _c Individual nodes, l _c Indicating the number of layers of the convolution layer, g indicating the activation function, represents the first _c Jth of layer convolution layer _c Personal node and ith _c Convolution kernel of individual nodes,>represents the j th _c Personal node and ith _c Offset of individual nodes, M _ci Representing audio information mapping moments in convolutional neural networksMatrix, e represents an exponential constant;

the calculation formula of the pooling layer is as follows:

wherein,represents the first _p Layer pooling layer j _p Personal node->Represents the first _p -j of layer 1 pooling layer _p Individual nodes, l _p Indicates the number of layers of the pooling layer, < >>Represents the first _p Layer pooling layer j _p Weight of individual node, down () represents sampling function, n represents size of input data, down (n) represents size of output data, < ->p represents the size of padding, f represents the window size of the pooling layer, s represents the step size, +.>Represents the first _p Layer pooling layer j _p Offset of individual nodes;

the calculation formula of the full connection layer is as follows:

wherein,represents the first _f J of layer full connection layer _f Personal node->Represents the first _f -1 ith of full link layer _f Individual nodes, l _f Indicating the number of layers of the fully connected layer->Represents the ith _f The next node and j _f Weights of individual nodes, weight->Represents the ith _f The next node and j _f Offset of individual nodes, M _f Representing the mapping relation of the full connection layer.

In an embodiment of the present application, the step of performing data enhancement on the urban audio data and dividing the data-enhanced urban audio data into a training set and a test set includes:

performing audio tuning or audio noise adding on the urban audio data;

the training set and the testing set are divided according to a preset proportion.

In one embodiment of the present application, the audio tuning comprises: adjusting the volume of the urban audio data, namely increasing xdb the original data of the urban audio data, wherein x epsilon < -10, 10 >;

the calculation formula of the audio tuning is as follows:

f'(t)＝f(t)+x (4)

wherein f' (t) represents the urban audio data after audio tuning, f (t) represents the original data of the urban audio data, x represents the increase of the original data of the urban audio data by x db, x e [ -10, 10].

In an embodiment of the present application, the audio plus noise includes: adding a random noise section on the enhanced signal, wherein the damping coefficient is a preset value;

the calculation formula of the audio plus noise is as follows:

wherein f _r (t) represents urban audio data subjected to audio plus noise, f (t) represents original data of the urban audio data,n noise sources are represented for enhancing the urban audio data.

In an embodiment of the present application, the step of processing the training set and the test set to obtain training set audio features and test set audio features includes:

and respectively carrying out pre-emphasis, framing and windowing, fast Fourier transformation, mel scale conversion and discrete cosine transformation on the training set and the testing set to obtain the audio characteristics of the training set and the audio characteristics of the testing set.

In an embodiment of the present application, the step of sending the training set audio feature to the urban audio classification model for training, and obtaining the trained urban audio classification model includes:

inputting the training set audio features into the convolution layer, and extracting training set key features through the convolution layer;

discarding the part which cannot correctly express the feature information in the extracted key features of the training set through the maximized pooling layer to finish the dimension reduction of the key features of the training set and obtain the dimension-reduced audio features of the training set;

the audio features after the dimension reduction of the training set are further extracted through repeatedly stacked multiple layers of convolution layers with residual error calculation identity mapping, and the further extracted audio features are summarized through the full connection layer, so that training lumped junction audio features are obtained;

dividing the audio features of the training set summary into audio scenes of different categories, calculating the accuracy of the audio scenes of different categories by using a softmax classifier, outputting the classification result of the audio features of the training set, and finishing training of the urban audio classification model.

In an embodiment of the present application, the expression of the convolutional layer extracting the key feature of the training set is:

h ₁ ＝CONV(X) (6)

wherein h is ₁ Representing the key features of the training set extracted by the convolution layer; CONV represents a convolution layer, and X represents the audio characteristics of the training set;

the expression of the audio characteristics after the dimension reduction of the training set is as follows:

h ₂ ＝MAX_POOLING(h ₁ ) (7)

wherein h is ₂ Representing audio characteristics of the training set after dimension reduction, wherein MAX_POOLING represents a maximized POOLING layer;

the expression for further extracting the audio characteristics of the training set after dimension reduction is as follows:

F(h ₂ )＝D(h ₂ )-S(h ₂ ) (8)

wherein F (h ₂ ) Represents the audio characteristics of the training set after dimension reduction, S (h ₂ ) Represents the output value of the shallow convolutional layer, D (h ₂ ) Representing the output value of the deep convolutional layer;

summarizing the further extracted audio features through the full connection layer to obtain an expression for training the lumped audio features, wherein the expression is as follows:

h ₃ ＝FC(F(h ₂ )) (9)

wherein h is ₃ Representing training lumped audio features, FC representing fully connected layers;

the expression for outputting the classification result of the training set audio features is as follows:

h ₄ ＝softmax(h ₃ ) (10)

wherein h is ₄ Representing the classification result of the training set audio features, softmax represents the softmax classifier.

In an embodiment of the present application, the loss function of the convolutional neural network in the urban audio classification model is a cross entropy loss function, and an expression of the cross entropy loss function is:

wherein loss (r, class) represents a cross entropy loss function, r represents a predicted classification result, class represents a sample tag of the urban audio data, loss represents a loss function of a classification task of the urban audio data, e represents an exponential constant, r [ _class ]Representing the classification result of the sample label class, r _v And representing the classification result belonging to the category v in the sample label.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

according to the embodiment of the application, the urban audio classification method based on the residual calculation convolutional neural network can effectively solve the problem that the urban audio classification precision of the traditional deep learning neural network is not high, and improves the calculation efficiency and the classification precision of the urban audio.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 shows a schematic step diagram of a method for urban audio classification based on residual calculation convolutional neural networks in an exemplary embodiment of the application;

FIG. 2 illustrates a schematic diagram of a convolutional neural network employing residual computation in an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a Mel spectrogram audio feature generation process in a method for urban audio classification based on residual calculation convolutional neural network in an exemplary embodiment of the application;

FIG. 4 is a flow chart of an MFCC (Mel-Frequency Cepstral Coefficients, mel-cepstrum coefficient) audio feature generation process in a method for urban audio classification based on a residual calculation convolutional neural network in an exemplary embodiment of the present application;

FIG. 5 illustrates a result graph of a classification accuracy confusion matrix for test data of a 2D convolutional neural network for an audio feature of a urban audio Mel cepstrum coefficient in an exemplary embodiment of the present application;

fig. 6 shows a result diagram of a classification accuracy confusion matrix for test data of audio features of the mel-frequency cepstrum coefficient of the urban audio in an exemplary embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are only schematic illustrations of embodiments of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

In this exemplary embodiment, a method for classifying urban audio based on a convolutional neural network of residual calculation is provided first. Referring to fig. 1, the urban audio classification method of the convolutional neural network based on residual calculation may include: steps S101 to S105.

Step S101: and constructing a city audio classification model based on the convolution neural network of residual calculation.

Step S102: and carrying out data enhancement on the urban audio data, and dividing the urban audio data subjected to data enhancement into a training set and a testing set.

Step S103: respectively processing the training set and the testing set to obtain audio characteristics of the training set and audio characteristics of the testing set;

step S104: sending the training set audio features into the urban audio classification model for training to obtain a trained urban audio classification model;

step S105: and sending the audio features of the test set into the trained urban audio classification model for training so as to classify the audio features of the test set, and classifying the test set according to the classification result of the audio features of the test set.

By the urban audio classification method of the convolutional neural network based on residual calculation, the problem that the urban audio classification precision of the traditional deep learning neural network is not high can be effectively solved, the calculation efficiency is improved, and meanwhile, the urban audio classification precision is improved.

Hereinafter, each step of the above-described urban audio classification method of the convolutional neural network based on residual calculation in the present exemplary embodiment will be described in more detail with reference to fig. 1 to 6.

Example 1

Urban audio data is first input and subjected to data enhancement and preprocessing operations to divide the data set. The adopted audio data enhancement method mainly comprises audio tuning or audio plus noise. The audio tuning is mainly used for adjusting the volume of the urban audio data, namely, the volume of the urban audio original data is increased by 5db. The audio noise adding step is to add a random noise section on the enhancement signal, the damping coefficient is 0.4, so that the audio data can cover more scenes, and the learning ability of the audio features of the data is enhanced.

In step S101, the depicted urban audio classification model includes a multi-layer convolution layer, a multi-layer pooling layer, and a multi-layer full-connection layer. The urban audio classification model optimizes the output result of each convolution layer by adopting residual calculation.

The described residual calculation includes an identity mapping for each layer of convolutional layers in the urban audio classification model;

the calculation formula of the convolution layer is as follows:

wherein,represents the first _c Jth of layer convolution layer _c Individual nodes, l _c Indicating the number of layers of the convolution layer, g indicating the activation function, represents the first _c Jth of layer convolution layer _c Personal node and ith _c Convolution kernel of individual nodes,>represents the j th _c Personal node and ith _c Offset of individual nodes, M _ci Representing an audio information mapping matrix in the convolutional neural network, e representing an exponential constant;

the calculation formula of the pooling layer is as follows:

the calculation formula of the full connection layer is as follows:

In step S102, the step of performing data enhancement on the urban audio data and dividing the data-enhanced urban audio data into a training set and a test set includes:

performing audio tuning or audio noise adding on the urban audio data;

The audio tuning includes: the volume of the urban audio data is adjusted, i.e. the original data of the urban audio data is increased xdb, wherein x e-10, 10.

The calculation formula of the audio tuning is as follows:

f'(t)＝f(t)+x (4)

The audio plus noise includes: and adding a random noise section on the enhanced signal, wherein the damping coefficient is a preset value.

The calculation formula of the audio plus noise is as follows:

In step S103, the steps of processing the training set and the test set to obtain the training set audio feature and the test set audio feature respectively include: and respectively carrying out pre-emphasis, framing and windowing, fast Fourier transformation, mel scale conversion and discrete cosine transformation on the training set and the testing set to obtain the audio characteristics of the training set and the audio characteristics of the testing set.

In step S104, in the urban audio classification model, for the preprocessed urban audio data, key features thereof are first extracted by convolutional layer learning; then discarding the part which cannot correctly express the characteristic information in the extracted characteristics through a maximized pooling layer, so as to achieve the effect of reducing the order of magnitude of the parameters; then further extracting the features of the urban audio data from the repeatedly stacked deep convolutional layers with residual error calculation identity mapping; summarizing the feature information extracted from the audio features through the full connection layer; and finally, calculating the accuracy of classifying each audio data into different categories of audio scene labels through a softmax classifier, and outputting a classification result.

Specifically, the step of sending the training set audio features into the urban audio classification model for training to obtain a trained urban audio classification model comprises the following steps:

inputting the audio features of the training set into the convolution layer, and extracting key features of the training set through the convolution layer;

the audio features after the dimension reduction of the training set are further extracted through repeatedly stacked multi-layer convolution layers with residual error calculation identity mapping, and the further extracted audio features are summarized through a full connection layer to obtain the training set integrated audio features;

dividing the training set audio features into audio scenes of different categories, calculating the accuracy of the audio scenes of different categories by using a softmax classifier, outputting the classification result of the training set audio features, and finishing training of the urban audio classification model.

Further, the expression of the convolutional layer extracting the key features of the training set is:

h ₁ ＝CONV(X) (6)

wherein h is ₁ Representing the key features of the training set extracted by the convolution layer; CONV represents the convolutional layer and X represents the training set audio features.

h ₂ ＝MAX_POOLING(h ₁ ) (7)

wherein h is ₂ Representing the audio features of the training set after dimension reduction, max_pooling represents the maximized POOLING layer.

The expression for further extracting the audio features of the training set after the dimension reduction is as follows:

F(h ₂ )＝D(h ₂ )-S(h ₂ ) (8)

wherein F (h ₂ ) Represents the audio characteristics of the training set after dimension reduction, S (h ₂ ) Represents the output value of the shallow convolutional layer, D (h ₂ ) Representing the output value of the deep convolutional layer. F (h) =D (h) -S (h) represents nonlinear transformation output values between convolution layers calculated by residual errors, when the audio features learned by the shallow output values S (h) are optimal, F (h) automatically approaches to 0, so that S (h) is transmitted in an identical path, and the task of preventing the network training result from becoming worse is achieved by the fact that the remaining layers of the depth network are subjected to identical mapping under the condition that the shallow output result is good enough.

Summarizing the further extracted audio features through the full connection layer to obtain the expression of the training lumped audio features as follows:

h ₃ ＝FC(F(h ₂ )) (9)

wherein h is ₃ Representing training lumped audio features, FC represents fully connected layers.

The expression of the classification result of the output training set audio features is:

h ₄ ＝softmax(h ₃ ) (10)

The training process of the urban audio classification network is as follows:

firstly, a public urban audio data set UrbanSound8K is used and divided into a training set and a testing set according to the proportion of 7:3;

during training, setting the total epoch to be 200; the Batch size is 32; adam is selected as an optimizer; in the optimizer, a training strategy of learning rate attenuation is adopted, the initial learning rate is 0.001, and the weight attenuation is 0.0005; for classification problems, the loss function applicable to the overall classification network is a cross entropy loss function, as follows:

Example 2

A city audio classification method of convolutional neural network based on residual calculation according to embodiment 1 is different in that:

the specific implementation process of step S102 includes:

firstly, a public urban audio data set UrbanSound8K is input, audio tuning operation is carried out on the public urban audio data set Urbansound8K, and the volume of urban audio data is adjusted to enable the original data to be increased by 5db. And a random noise section with a damping coefficient of 0.4 is added on the enhancement signal to finish the noise addition of the audio, so that audio data can cover more scenes to enhance the learning ability of the audio characteristics of the data;

and dividing the preprocessed urban audio data set into a training set and a testing set according to the ratio of 7:3.

The specific implementation process of step S103 includes:

the method comprises the steps that a pre-emphasis operation is firstly carried out on a preprocessed urban audio data set, high-frequency band data of audio signals are improved, and then the frequency spectrum of the audio signals is flattened;

the frame division operation is carried out, the purpose of the frame division operation is to divide the audio into small sections so as to be convenient for analysis, meanwhile, in order to eliminate the problem of signal discontinuity possibly existing between each frame, the audio needs to be windowed, and the frame length and the frame shift are respectively selected to be 25ms and 10ms;

because the reflected energy distribution condition on the frequency domain can show the characteristic of the audio frequency compared with the characteristic of the audio frequency on the time domain, the frequency spectrum of the audio data is required to be subjected to the work of modulus square, namely the fast Fourier transform;

and finally, obtaining logarithmic energy after filtering operation through a Mel frequency filter, and obtaining Mel frequency spectrum audio characteristics of the urban audio data.

The specific implementation process of step S104 includes:

the urban audio classification model comprises a plurality of convolution layers, a plurality of pooling layers and a plurality of full connection layers;

in the urban audio classification model, for the Mel frequency spectrum audio characteristics of the preprocessed urban audio data, firstly, the key characteristics of the pretreated urban audio data are extracted through convolution layer learning with the convolution kernel of 7 multiplied by 7 and the step length of 2 and the convolution filling of 3;

then, the maximum pooling layer with the pooling core of 3 multiplied by 3 and the step length of 2 and the pooling filling of 1 discards the part of the extracted characteristics, which cannot accurately express the characteristic information, so as to achieve the effect of reducing the order of magnitude of the parameters;

then, by constructing two layers of convolution layers with convolution kernels of 3 multiplied by 3 and step length of 2, and adding identical mapping for the two layers of convolution layers according to the residual error calculation structure described in the embodiment 1, the two layers of convolution layers integrally form a convolution layer with the residual error calculation structure, and the convolution layer is used for further extracting audio characteristics of urban audio data;

repeatedly stacking 4 groups of convolution layers with residual calculation structures, so that the urban audio classification model becomes a deep neural network structure;

and summarizing the information of the audio features extracted from the Mel frequency spectrum audio features through the full connection layer, calculating the accuracy of labels of the audio data classified into different categories of audio scenes through a softmax classifier, and outputting a classification result.

In which, as shown in fig. 2, a schematic diagram of a convolutional neural network using residual calculation is shown.

The training process of the urban audio classification network in step S104 is as follows:

during training, setting the total epoch to be 200; the Batch size is 32; adam is selected as an optimizer; in the optimizer, a training strategy of learning rate attenuation is adopted, the initial learning rate is 0.001, and the weight attenuation is 0.0005; for the classification problem, the loss function applicable to the overall classification network is a cross entropy loss function, so the cross entropy loss function is selected as the loss function of the urban audio classification model, and training is carried out on the model aiming at the mel frequency spectrum audio characteristics.

The specific implementation process of step S105 includes:

sending the Mel frequency spectrum audio characteristics of the test data into a trained urban audio classification model, calculating probability values of different audio belonging to specific classification labels, taking the class label corresponding to the maximum probability as the final classification result of the different audio, and calculating and outputting the final total audio classification accuracy.

Example 3

The urban audio classification method of the convolutional neural network based on residual calculation according to embodiment 1 is different in that:

the specific implementation process of step S102 includes:

The specific implementation process of step S103 includes:

and finally, filtering by a Mel frequency filter to obtain logarithmic energy and performing discrete cosine transform to obtain Mel cepstrum coefficient audio characteristics of the urban audio data.

The specific implementation process of step S104 includes:

in the urban audio classification model, for the mel cepstrum coefficient audio features of the preprocessed urban audio data, firstly, the key features of the audio features are extracted through convolution layer learning with the convolution kernel of 7 multiplied by 7 and the step length of 2 and the convolution filling of 3;

then, by constructing two layers of convolution layers with convolution kernels of 3 multiplied by 3 and step length of 2, and adding identical mapping for the two layers of convolution layers according to the residual error calculation structure described in the embodiment 1, the two layers of convolution layers integrally form a convolution layer with the residual error calculation structure, and the convolution layer is used for further extracting the characteristics of urban audio data;

and summarizing the feature information extracted from the audio features of the mel-frequency cepstrum coefficients through the full-connection layer, calculating the accuracy of labels of the audio data classified into different categories of audio scenes through a softmax classifier, and outputting a classification result.

during training, setting the total epoch to be 200; the Batch size is 32; adam is selected as an optimizer; in the optimizer, a training strategy of learning rate attenuation is adopted, the initial learning rate is 0.001, and the weight attenuation is 0.0005; for the classification problem, the loss function applicable to the whole classification network is a cross entropy loss function, so the cross entropy loss function is selected as the loss function of the urban audio classification model, and the model is trained for the mel-frequency cepstrum coefficient audio characteristics.

The specific implementation process of step S105 includes:

sending the mel cepstrum coefficient audio features of the test data into a trained urban audio classification model, calculating probability values of different audio belonging to specific classification labels, taking the class label corresponding to the maximum probability as a final classification result of the different audio data, and calculating and outputting final total audio classification accuracy. And a result diagram of the classification accuracy confusion matrix of the test data of the audio characteristics of the mel-frequency cepstrum coefficient of the urban audio is drawn, as shown in fig. 6.

Respectively inputting a mel spectrogram of urban audio data and two audio features of an MFCC in a public data set UrbanSound8K for training for a constructed urban audio classification network framework adopting a convolution neural network with residual calculation, wherein the flow diagram is shown in fig. 3, the flow diagram is shown in fig. 4, and the flow diagram is shown in the flow diagram; and then respectively sending the Mel frequency spectrum and the MFCC of the test data into a trained urban audio classification model, wherein the obtained classification accuracy results are shown as follows:

as shown in fig. 5 and 6, the result diagrams of the classification accuracy confusion matrix are taken as examples of the 2D convolutional neural network and the test data of the MFCC audio characteristics of the urban audio according to the method of the present application, the horizontal axis scale 0 to 9 of the result diagram of the classification accuracy confusion matrix represents the prediction classification result label of the urban audio test data, and the vertical axis scale 0 to 9 represents the actual label of the urban audio test data. The darker the diagonal grid color depth of the resulting graph represents the accuracy with which urban audio data of different classes of tags can be correctly predicted to classify the tags. Therefore, according to the result graphs of the classification accuracy confusion matrix of fig. 5 and 6, the accuracy of the urban audio classification result can be better improved by using the method.

TABLE 1

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, one skilled in the art can combine and combine the different embodiments or examples described in this specification.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

Claims

1. The urban audio classification method of the convolutional neural network based on residual calculation is characterized by comprising the following steps of:

2. The method for urban audio classification based on residual calculation of convolutional neural network of claim 1,

the urban audio classification model comprises: a multi-layer convolution layer, a multi-layer pooling layer and a multi-layer full connection layer;

3. The method for urban audio classification based on residual calculation of convolutional neural network of claim 1,

the calculation formula of the convolution layer is as follows:

the calculation formula of the pooling layer is as follows:

wherein,represents the first _p Layer pooling layer j _p Individual sectionPoint (S)>Represents the first _p -j of layer 1 pooling layer _p Individual nodes, l _p Indicates the number of layers of the pooling layer, < >>Represents the first _p Layer pooling layer j _p Weight of individual node, down () represents sampling function, n represents size of input data, down (n) represents size of output data, < ->p represents the size of padding, f represents the window size of the pooling layer, s represents the step size, +.>Represents the first _p Layer pooling layer j _p Offset of individual nodes;

the calculation formula of the full connection layer is as follows:

4. The method for classifying urban audio by using a convolutional neural network based on residual calculation according to claim 3, wherein the step of performing data enhancement on the urban audio data and dividing the data-enhanced urban audio data into a training set and a test set comprises the steps of:

performing audio tuning or audio noise adding on the urban audio data;

5. The method of urban audio classification based on residual calculation convolutional neural network of claim 4, wherein the audio tuning comprises: adjusting the volume of the urban audio data, namely increasing xdb the original data of the urban audio data, wherein x epsilon < -10, 10 >;

the calculation formula of the audio tuning is as follows:

f'(t)＝f(t)+x (4)

6. The method for urban audio classification based on residual calculation convolutional neural network of claim 4, wherein the audio plus noise comprises: adding a random noise section on the enhanced signal, wherein the damping coefficient is a preset value;

the calculation formula of the audio plus noise is as follows:

7. The method of claim 5 or 6, wherein the step of processing the training set and the test set to obtain training set audio features and test set audio features, respectively, comprises:

8. The method for classifying urban audio by using a convolutional neural network based on residual calculation according to claim 7, wherein the step of training the training set audio features in the urban audio classification model to obtain the trained urban audio classification model comprises the following steps:

9. The method for urban audio classification based on residual calculation of convolutional neural network of claim 8,

the expression of the convolution layer extracting the key characteristics of the training set is as follows:

h ₁ ＝CONV(X) (6)

h ₂ ＝MAX_POOLING(h ₁ ) (7)

F(h ₂ )＝D(h ₂ )-S(h ₂ ) (8)

h ₃ ＝FC(F(h ₂ )) (9)

h ₄ ＝soft max(h ₃ ) (10)

wherein h is ₄ Representing the classification result of the training set audio features, soft max represents the softmax classifier.

10. The method for classifying urban audio by using a convolutional neural network based on residual calculation according to claim 9, wherein the loss function of the convolutional neural network in the urban audio classification model is a cross entropy loss function, and the cross entropy loss function has the expression:

wherein loss (r, class) represents a cross entropy loss function, r represents a predicted classification result, class represents a sample tag of the urban audio data, loss represents a loss function of a classification task of the urban audio data, e represents an exponential constant, r _[class] Representing the classification result of the sample label class, r _v And representing the classification result belonging to the category v in the sample label.