CN117524252B - Light-weight acoustic scene perception method based on drunken model - Google Patents
Light-weight acoustic scene perception method based on drunken model Download PDFInfo
- Publication number
- CN117524252B CN117524252B CN202311505530.7A CN202311505530A CN117524252B CN 117524252 B CN117524252 B CN 117524252B CN 202311505530 A CN202311505530 A CN 202311505530A CN 117524252 B CN117524252 B CN 117524252B
- Authority
- CN
- China
- Prior art keywords
- model
- drunken
- convolution
- training
- channel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000008447 perception Effects 0.000 title claims abstract description 19
- 230000004927 fusion Effects 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 238000006243 chemical reaction Methods 0.000 claims abstract description 15
- 238000013140 knowledge distillation Methods 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 230000009467 reduction Effects 0.000 claims abstract description 5
- 238000011176 pooling Methods 0.000 claims description 12
- 238000002474 experimental method Methods 0.000 claims description 9
- 238000001228 spectrum Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 3
- 238000013210 evaluation model Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 230000006872 improvement Effects 0.000 claims 1
- 238000004364 calculation method Methods 0.000 abstract description 3
- 238000004821 distillation Methods 0.000 description 7
- 241000282414 Homo sapiens Species 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a lightweight acoustic scene perception method based on a drunkard model, which comprises the following steps: conventional audio feature extraction: the conventional audio features are processed by a feature conversion module to obtain drunken features. And (5) training a conventional model. And performing channel reduction and frequency grouping fusion convolution adding operation by using a conventional model to obtain an initial version of the drunken model. Training is performed using the drunken model of the initial version obtained by the guiding module. And the conventional model is used as a teacher model, the drunken model is used as a student model, and the performance of the student model is improved in a knowledge distillation mode. And evaluating the fused lightweight model to obtain an evaluation result. And optimizing and adjusting the lightweight model according to the evaluation result. Obtaining the final drunken model. And inputting the conventional audio characteristics into the drunkenness model to obtain an acoustic scene perception result. The invention has the advantages that: reduces the training time and the consumption of calculation resources, and can obtain higher accuracy and lower loss value.
Description
Technical Field
The invention relates to the technical field of acoustic scene classification, in particular to a lightweight acoustic scene perception method based on a drunken model.
Background
The acoustic scene classification is used as one of important applications of the deep convolutional neural network in the audio field, makes correct classification on the surrounding environment by simulating the perception capability of human beings on the external environment, and is widely used in the fields of audio monitoring, intelligent auxiliary driving, voiceprint recognition and the like. Most of the acoustic scene classification tasks adopt a top-down series connection mode, the extracted characteristic information is directly input into a neural network model for prediction, but the mode has some limitations. Currently, the mainstream neural network still belongs to a deep convolutional neural network, and in addition, some high-precision lightweight models are also proposed. For example, the efficient BC-res net architecture achieves excellent performance by extracting two feature maps specific to the frequency and time dimensions by a two-dimensional convolution of frequency and one-dimensional convolution of time. The BC-Res2Net structure fused with the BC-ResNet structure and the Res2Net structure can effectively acquire the characteristics of frequency and time dimension through broadcast learning, can operate in multiple scales and has remarkable performance. The MobileNet series model and ShuffleNet proposed in recent years realize a lightweight and efficient network by introducing deep rolling and shuffling (Shuffle) operations.
The acoustic scene classification mainly comprises two parts, namely feature extraction and classification model construction, wherein the extracted feature information is directly input into a neural network model for prediction. The currently prevailing audio features are log-mel spectrum (log-mel) features, MFCCs (Mel Frequency Cepstral Coefficients), time first order differences and second order differences, etc. A single feature extraction mode is adopted in the partial acoustic scene classification task, and although the calculation cost can be saved, partial important information can be ignored, so that the classification performance is poor. And employing multiple feature extraction approaches may involve redundant feature maps, resulting in unnecessary computational costs.
In terms of models, networks used for acoustic scene classification tasks are limited by their structure, and computational overhead can increase as the models deepen, which is detrimental to deployment on resource-constrained devices. In addition, a single neural network model is mostly adopted in the tasks, key audio features may not be fully extracted, and at present, an optimal model architecture and super-parameter combination are not determined, so that erroneous decisions of scene categories are easily caused.
Reference to the literature
[1]J.Hu,L.Shen,G.Sun,"Squeeze-and-excitation networks,"Proceedings of the IEEE conference on computer vision and pattern recognition,7132-7141,2018;
[2]K.He,X.Zhang,S.Ren,et al.,"Deep residual learning for image recognition,"Proceedings of the IEEE conference on computer vision and pattern recognition,770-778,2016。
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a lightweight acoustic scene perception method based on a drunkard model.
In order to achieve the above object, the present invention adopts the following technical scheme:
a lightweight acoustic scene perception method based on a drunken methodology comprises the following steps:
1) Conventional audio feature extraction: the original audio data is collected, and the original audio data is converted into conventional audio features in a logarithmic mel frequency spectrum, a first-order difference mode and a second-order difference mode.
2) Drunkenness feature extraction: based on the focus module of the squeze-and-Excitation (SE), a feature conversion module based on a focus mechanism is designed. The conventional audio features are processed through a feature conversion module, and drunken features are obtained after redundant information is removed.
3) Training a conventional model: training is performed using the structure and parameters of the conventional model. By inputting conventional audio features, a model is trained and scene perception capability learning is performed.
4) The guiding module operates: and performing channel reduction and frequency grouping fusion convolution adding operation by using a conventional model to obtain an initial version of the drunken model. Through experiments and adjustment, the optimal channel number and grouping convolution settings are found.
5) Training the drunkenness model: training is performed using the drunken model of the initial version obtained by the guiding module. Similar to conventional model training, drunken features are input and drunken models are trained and optimized.
6) Fusion module operation: and the conventional model is used as a teacher model, the drunken model is used as a student model, and the performance of the student model is improved in a knowledge distillation mode. Distillation loss is calculated based on predictions of student and teacher models, and a weighted sum of distillation loss and hard tag loss is calculated from soft tag predictions to obtain the final loss.
7) Evaluation model: and evaluating the fused lightweight model, and checking the accuracy and performance of the lightweight model on an acoustic scene perception task to obtain an evaluation result.
8) Model optimization: and optimizing and adjusting the lightweight model according to the evaluation result. Obtaining the final drunken model.
9) And inputting the drunken audio characteristics into the drunken model to obtain an acoustic scene perception result.
Further, in the drunken feature extraction of the step 2): the spatial dimension of each channel in the conventional features is reduced to a scalar in the global averaging pooling operation, the information of each channel is compressed to obtain the global average value of each channel, so that the global statistical information in the channel is captured, the calculated amount is reduced, and the feature correlation in the channel is extracted.
And learning the correlation among channels through the full connection layer, and acquiring the weights of the dimension characteristics of different channels by utilizing a sigmoid activation function. And finally, multiplying the weight with each original characteristic to obtain the drunken characteristic.
Further, the drunken model includes: frequency packet fusion convolution (freroupconv 2 d), packet convolution (GroupConv 2 d), bottleneck-block-improvement (fcresnet_block), convolution layer, and pooling layer.
The operation flow of constructing the drunken model is as follows: the method comprises the steps of integrating convolution rich characteristic information through frequency grouping, and performing channel expansion and downsampling through a convolution layer and a pooling layer to extract higher-level characteristic representation. Dropout is added to prevent overfitting by varying the number of channels and step sizes of the bottleneck blocks through a plurality of stacked bottleneck blocks. The outputs of each group are cascaded in the channel dimension to keep the number of channels in and out consistent. And outputting a classification result after global average pooling and a full connection layer.
The improved bottleneck block is the result of modifying the bottleneck block in the residual network ResNet and mainly consists of packet convolution. The modified bottleneck block replaces the 3 x 3 convolution in the bottleneck block with the packet convolution. The group convolution uniformly divides the channel dimension into groups, each of which is subjected to a 3 x 3 convolution.
Further, at the first layer of the drunkenness model, the frequency grouping fusion convolution is placed. The frequency grouping fusion convolution divides the frequency dimension of the input feature into a plurality of branches. Each branch is processed through a 3 x 3 convolutional layer. The other branches, except the first branch, receive feature information from the previous branch, gradually extracting features from low frequency to high frequency. The output of each branch is cascaded in the channel dimension and then input to the next stage.
Compared with the prior art, the invention has the advantages that:
1. improving the performance of the model: by introducing the feature conversion module and the fusion module to operate, the performance and the accuracy of the model can be improved. The feature conversion module can remove redundant information, reduce complexity of a model and improve distinguishing degree and characterization capability of features. The fusion module is operated to improve the performance of the student model by utilizing a knowledge distillation mode, so that the performance of the student model approaches to that of a teacher model. Under the condition of limited resources, higher accuracy and lower loss value can be still obtained, the accuracy is improved to 45.2% at the highest, and the loss is reduced by 0.634 compared with a baseline.
2. And (3) a lightweight model: the parameters and the computational complexity of the lightweight model are reduced through operations such as channel reduction, grouping convolution and the like. This can improve the operational efficiency of the model, and is applicable to resource-constrained environments and devices. The drunken model can reach 41.5 percent of accuracy under the condition that the parameter quantity is 442.67K only. Beyond the mainstream lightweight model.
3. The guiding module operates: the guidance module operation can provide an initial version of the drunken model, reducing training time and consumption of computing resources. Through experiments and adjustment, the optimal channel number and grouping convolution setting are found, and the performance of the model is further optimized.
Drawings
Fig. 1 is a general structural diagram of an embodiment of the present invention.
Fig. 2 is a flowchart of drunken feature extraction according to an embodiment of the present invention.
Fig. 3 is a diagram of drunkard model structure according to an embodiment of the present invention.
Fig. 4 is a diagram of a frequency packet fusion convolution architecture in an embodiment of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the accompanying drawings and by way of examples in order to make the objects, technical solutions and advantages of the invention more apparent.
1. Light-weight acoustic scene perception method overall structure based on drunken model
As shown in fig. 1, comprises two branches and a module for associating the two branches. The following is a detailed description of three aspects:
the first branch simulates a modeling process of a conventional behavior, input data extracts conventional features through a currently existing feature extraction mode, and the invention adopts a logarithmic Mel spectrum extraction mode and a first-order difference extraction mode and a second-order difference extraction mode. And then, inputting the extracted features into a conventional model for training to obtain a soft label result. The branch has higher complexity of the characteristics and the models, and the main purpose is to obtain higher classification performance of the acoustic scene.
The modeling process of the second branch simulating drunkenness behavior is used for simulating the perception capability of drunkenness to the surrounding environment, and the regular features in the first branch are subjected to a feature conversion module to obtain more simplified drunkenness features. Compared with the conventional characteristics, the drunken characteristics remove part of redundant characteristic information and pay more attention to the information of key positions. And then inputting the drunken model into a drunken model, transforming the drunken model by a guiding module, adding a layer of frequency grouping fusion convolution on the basis of the conventional model, and enhancing the representation capability of the features by dividing frequency dimensions and multiplexing the features of different dimensions. In addition, in order to achieve the purpose of light weight, the width of the drunken model is also reduced.
The fusion module is a bridge combining the two branches, the knowledge distillation strategy is selected, and the fusion module is used as a model compression means, so that the performance of the drunken model with low complexity and poor performance can be improved under the guidance of a conventional model with high complexity and good performance on the premise of not increasing excessive calculation cost, and the scene classification task is completed together.
2. Feature extraction section
2.1 general features
Although new multiple speech features are often designed for use in audio-related fields (e.g., speech recognition, sound event detection, information retrieval), the present invention contemplates selection of more sophisticated and perceptive representation of features. According to human auditory characteristics, the human ear is more sensitive to the difference information between low frequencies, and the human ear cannot sense the frequency linearly, so that the invention firstly considers the logarithmic mel frequency spectrum characteristics. The logarithmic Mel frequency spectrum is used as a common audio feature extraction method, comprises time domain and frequency domain information and perceptually relevant amplitude information, and is characterized in that the Mel scale is more in line with the auditory characteristics of human ears.
In addition, because the voice signal is continuous in time domain, the characteristic information extracted by framing only reflects the characteristics of the voice signal of the frame, and in order to enable the characteristics to better reflect the time domain continuity, the dimension of the front and rear frame information can be increased in the characteristic dimension, and the first-order difference and the second-order difference are commonly used. Therefore, the first-order difference and the second-order difference are performed on the basis of the logarithmic mel spectrum, and cascade connection is performed in the channel dimension. The final conventional feature is a cascade of logarithmic mel-spectrum features, first order differential features, second order differential features with three-dimensional channel dimensions.
2.2 drunkenness characterization
The invention introduces an attention mechanism in the feature extraction stage, suppresses redundant information in conventional features and enhances useful information. Is paid attention to by squeze-and-Excitation (SE)Module [1] By referring to part of the structure, a feature conversion module based on an attention mechanism is designed. The conventional characteristic is subjected to a characteristic conversion module to obtain the characteristic with redundant information removed, and the characteristic is called drunkenness. Drunkenness is lighter than conventional because the feature conversion module helps remove some similar features.
As shown in fig. 2, the feature conversion module first reduces the spatial dimension of each channel to a scalar by a global averaging pooling operation to obtain a global average value for each channel, and the purpose of this operation is to compress each channel to capture global statistics in the channel, while reducing the amount of computation. And finally, acquiring weights of different channel dimension characteristics by utilizing a sigmoid activation function, and multiplying the weights with each original characteristic to obtain drunken characteristics.
3. Model structural part
3.1 drunkard model
The data sets used in the experiment are all 1s audio files, and the information contained in the data sets is less. To solve the problem, a multi-scale drunken model is designed and named as FC_resnet, and feature information with richer details is captured from different scales. The model mainly contains three modules, namely frequency packet fusion convolution (Frequency Grouping Fusion Convolution, freroupconv 2 d), packet convolution (GroupConv 2 d), and improved bottleneck block (fcresnet_block). The invention uses the drunken model in table 1 as a baseline model, which is divided into six stages, and the method comprises the steps of firstly, integrating convolution rich characteristic information through frequency grouping, and then, performing channel expansion and downsampling through a convolution layer and a pooling layer. The fourth stage contains a number of stacked fcresnet_blocks, but with different channel numbers and step sizes, dropouts added to prevent overfitting. And finally, outputting a classification result after global average pooling and a full connection layer. Fig. 3 is a more visual illustration of the model configuration in table 1.
TABLE 1
(1) Frequency packet fusion convolution
The frequency dimension of the convolution layer is divided, the low-frequency part can capture global and rough features, the high-frequency part can capture local and detailed features, and the generalization capability of the model on input data is improved. In addition, the frequency grouping convolution can process the characteristics in different frequency ranges respectively, so that noise resistance of the model to input data is enhanced to a certain extent.
Fig. 4 is a schematic diagram of a frequency grouping fusion convolution, where the frequency dimensions are uniformly divided into 4 groups, each group is convolved by 3×3, and each path except the first path has information from the previous path, and this low-frequency to high-frequency feature multiplexing method, similar to the step-by-step feature extraction of the human visual system on the edge-to-texture level, enriches the feature information to some extent. And finally, inputting the output to the next stage after cascade connection of the channel dimensions. Since the frequency axis downsampling will be performed later, the frequency packet fusion convolution is placed on the first layer of the model, regardless of the addition of this layer at the back of the model.
(2)FCresnet_block
The part is the main part of the drunken model, and is the bottleneck block in the residual error network ResNet [2] Is modified on the basis of (1) and is named FCresnet_block. The original bottleneck block consists of three convolution layers, with three convolution kernel sizes of 1 x 1, 3 x 3 and 1 x 1 in order, and one shortcut connection mapping the input directly to the output. The 1×1 convolution mainly plays a role in changing the number of channels, does not increase feature information in the space dimension, only extracts the space feature by one 3×3 convolution layer, and replaces the 3×3 convolution in the bottleneck block by the grouping convolution for fully mining deep information of the audio feature. The channel dimension is divided into a plurality of groups by grouping convolution, each path is respectively subjected to 3X 3 convolution, and finally the output of all paths is processedThe output channels are cascaded in the channel dimension, so that the consistent number of the output channels and the input channels is ensured. The number of channel packets is represented by the parameter Groups, with the present invention having a baseline Groups of 8.
3.2 conventional model
The general model is similar to the drunken model in table 1 in overall structure, with the differences in several aspects: 1) The conventional model does not contain a frequency packet fusion convolution, phase 1 in table 1; 2) The complexity of the conventional model is higher, the method is mainly embodied in the aspect of model width, the value of Groups in an FCresnet_block module in the conventional model is 32, and the number of channels is 2 times that of the drunken model; 3) The depth of the conventional model is 37, the MAC index is 23.5M, the model is slightly higher than the drunken model, the parameter number is 1114.6k, and the model is about 3 times of the drunken model. Experimental results show that the accuracy of the conventional model is up to 46.4%, and the loss is 1.619.
3.3 boot and fusion Module
The guiding module is used for obtaining the drunken model from the conventional model through certain operation. One specific operation is channel reduction, the number of channels of the model is adjusted by multiple experiments, and the optimal settings are saved, as shown in table 1. Frequency packet fusion convolutions are added on the basis of a conventional model, and the number of Groups of the packet convolutions is reduced.
The fusion module adopts a knowledge distillation strategy, wherein the knowledge distillation strategy comprises a teacher model and a student model, and the performance of the student model is improved by teaching knowledge in the teacher model with higher performance to the student model with lower performance. In combination with knowledge distillation principle, we use the conventional model in drunken methodology as teacher model and drunken model as student model, and the invention uses the original form of knowledge distillation. Firstly, training a conventional model under different parameter configurations, and storing parameter settings with best performance. And then calculating soft label prediction results of the conventional model and the drunken model at the same temperature T, and calculating distillation loss. The final loss consists of a weighted sum of distillation loss and hard tag loss, L TOTAL Represents the total loss, α represents the weight of distillation loss, L DIST Indicating distillation loss, L LABEL Representing hard tag loss as shown in equation (1):
L TOTAL =α LDIST +(1-α)L LABEL (1)
Distillation loss is a prediction based on student and teacher matching using the soft target given in equation (2), where z is the logarithm, q is the soft target, and T is the temperature controlling the softness of the probability distribution.
4. Experimental part
(1) Data set and evaluation index
The dataset chooses DCASE (Detection and Classification of Acoustic Scenes and Events) 2022Task1 development set (TAU Urban Acoustic Scenes 2022Mobile,development dataset). The audio in the dataset is provided in a mono 44.1khz 24 bit format. The dataset contained 230,359 tones for a total duration of 64 hours, which the present invention divided for training and testing, 70% for training, 30% for testing, each tone duration of 1 second, from 10 cities and 9 devices: 3 real devices (A, B, C) and 6 simulation devices (S1-S6); 10 scenes are included, respectively, airports, malls, subway stations, pedestrian streets, public squares, traffic streets, electric cars, buses, subways, and parks. The present invention employs a validation set provided by the DCASE authorities for evaluating the performance of the trained model, the evaluation data containing 22 hours of audio. The complexity of the system is measured using the number of parameters and the cumulative multiply-add times (MAC), and the Loss (Loss) and Accuracy (Accuracy) are used to evaluate the model performance.
(2) Model training arrangement
In the aspect of feature extraction, 128 Mel filters are used for processing the audio signals, fast Fourier Transform (FFT) is carried out on the audio signals, and the Hamming window length and the frame length are respectively set to be 0.04 seconds and 0.02 seconds, so that a spectrogram of 128 multiplied by 43 is obtained. And extracting first-order difference and second-order difference features, and splicing the three feature spectrograms into an input feature spectrogram of 128 multiplied by 43 multiplied by 3 in the channel dimension.
In the training stage, the training batch_size is set to 128 in an experiment unification mode, the independent training iteration number of the drunken model and the conventional model is 256, and the knowledge distillation experiment is set to 200. Furthermore, two data enhancement techniques, mixup and specaugmenter, were added to optimize the training process.
The baseline system setting of the invention: the logarithmic mel-spectrum, first order difference and second order difference are used as input features. The model was trained 256 times on the full dataset using fc_res net with Groups of 8 and depth of 38 layers, and no data enhancement and coordinate attention mechanisms were employed. Experiments were performed on NVIDIA RTX2080 Ti using the Tensorflow and keras frameworks with Windows10 operating system.
(3) Description of the Experimental results
The drunken model (baseline model) constructed by the invention obtains 40.4% of accuracy and 2.284 of loss. On the basis, after the drunken methodology mechanism is adopted, the accuracy is improved to 45.2% at the highest, and the loss is reduced by 0.634 compared with a baseline system.
In addition, to verify the effectiveness of the drunken model of the present invention, it was compared with the mainstream advanced shufflenet v2, resNet, mobileNet and GhostNet. In order to ensure the fairness of experiments, the parameter amounts of the five models are controlled within the range of [440K,463K ], and the accuracy of the models is observed under the condition that the difference between the parameter amounts is ensured to be as low as 5 percent as possible. The drunken model with the accuracy of 41.5% is selected for comparison, called FC_ResNet, and the other four models are trained under the same training setting, and the experimental results are shown in Table 2. The result shows that compared with the SheffeNetV 2 and ResNet, the model designed by the invention has higher accuracy rate with similar parameter quantity, which is derived from the rich characteristic information extracted from different scales. Compared with the MobileNet V2, the method has the advantages of improving the accuracy by only 1%, but having fewer parameters, and the method can also show the advantages of the method in the aspects of accuracy and parameter balance. In addition, compared with GhostNet, although the accuracy rate is improved by only 0.3%, the parameter quantity is reduced by 5%, and the method of the invention achieves the effect of competing with GhostNet.
TABLE 2
The above-described method according to the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored on such software process on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is appreciated that a computer, processor, microprocessor controller, or programmable hardware includes a memory component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the lightweight acoustic scene aware methods described herein. Further, when the general-purpose computer accesses code for implementing the processes shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the processes shown herein.
Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to aid the reader in understanding the practice of the invention and that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.
Claims (3)
1. The lightweight acoustic scene perception method based on the drunken model is characterized by comprising the following steps of:
1) Conventional audio feature extraction: collecting original audio data, and converting the original audio data into conventional audio features by adopting a logarithmic Mel frequency spectrum and a first-order difference and second-order difference mode;
2) Drunkenness feature extraction: based on the focus module of the squeze-and-Excitation (SE), a feature conversion module based on a focus mechanism is designed; the conventional audio features are processed through a feature conversion module, and drunken features are obtained after redundant information is removed; the feature conversion module includes: global average pooling, full connection layer and sigmoid activation function;
the specific operation process of the feature conversion module is as follows: the feature conversion module firstly reduces the space dimension of each channel into a scalar through global average pooling operation to obtain a global average value of each channel, compresses each channel to capture global statistical information in the channel, simultaneously reduces the calculated amount, learns the correlation among the channels through a full-connection layer, finally obtains the weights of the dimension features of different channels by utilizing a sigmoid activation function, and multiplies the weights with each original feature to obtain drunken features;
3) Training a conventional model: training by using the structure and parameters of the conventional model; training a model and learning scene perception capability by inputting conventional audio features;
4) The guiding module operates: performing channel reduction and frequency grouping fusion convolution adding operation by using a conventional model to obtain an initial version of the drunken model; through experiments and adjustment, the optimal channel number and grouping convolution setting are found;
5) Training the drunkenness model: training by using the drunken model of the initial version obtained by the guiding module; similar to the conventional model training, the drunken features are input and training and optimizing are carried out on the drunken model;
6) Fusion module operation: the conventional model is used as a teacher model, the drunken model is used as a student model, and the performance of the student model is improved in a knowledge distillation mode;
7) Evaluation model: evaluating the drunken model obtained in the step 6), and checking the accuracy and performance of the drunken model on an acoustic scene perception task to obtain an evaluation result;
8) Model optimization: optimizing and adjusting the drunken model according to the evaluation result; obtaining a final drunken model;
9) And inputting the drunken characteristics into the drunken model to obtain an acoustic scene perception result.
2. The drunken model-based lightweight acoustic scene perception method as set forth in claim 1, wherein: the drunken model comprises: frequency grouping fusion convolution, grouping convolution, bottleneck block improvement, convolution layer and pooling layer;
the operation flow of constructing the drunken model is as follows: the method comprises the steps of integrating convolution rich characteristic information through frequency grouping, and performing channel expansion and downsampling through a convolution layer and a pooling layer to extract higher-level characteristic representation; through a plurality of stacked improved bottleneck blocks, the number of channels and the step size of the improved bottleneck blocks are different, and Dropout is added to prevent overfitting; cascading the outputs of each group in the channel dimension to keep the number of the input channels consistent with the number of the output channels; after global average pooling and a full connection layer, outputting a classification result;
the improved bottleneck block is a result of modifying the bottleneck block in the residual error network ResNet and mainly consists of packet convolution; the improved bottleneck block replaces the 3×3 convolution in the bottleneck block with a block convolution; the group convolution uniformly divides the channel dimension into groups, each of which is subjected to a 3 x 3 convolution.
3. The drunken model-based lightweight acoustic scene perception method as set forth in claim 2, wherein: the frequency grouping, fusion and convolution are placed on a first layer of the drunken model; dividing the frequency dimension of a convolution layer by frequency grouping fusion convolution, and dividing the frequency dimension into a plurality of branches; each branch is processed through a 3 x 3 convolutional layer; the other branches except the first branch receive the characteristic information of the previous branch, and gradually extract the characteristics from low frequency to high frequency; the outputs of each group are cascaded in the channel dimension and then input to the next stage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311505530.7A CN117524252B (en) | 2023-11-13 | 2023-11-13 | Light-weight acoustic scene perception method based on drunken model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311505530.7A CN117524252B (en) | 2023-11-13 | 2023-11-13 | Light-weight acoustic scene perception method based on drunken model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117524252A CN117524252A (en) | 2024-02-06 |
CN117524252B true CN117524252B (en) | 2024-04-05 |
Family
ID=89741360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311505530.7A Active CN117524252B (en) | 2023-11-13 | 2023-11-13 | Light-weight acoustic scene perception method based on drunken model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117524252B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110728354A (en) * | 2019-09-11 | 2020-01-24 | 东南大学 | Improved sliding type grouping convolution neural network |
CN112016639A (en) * | 2020-11-02 | 2020-12-01 | 四川大学 | Flexible separable convolution framework and feature extraction method and application thereof in VGG and ResNet |
CN112465140A (en) * | 2020-12-07 | 2021-03-09 | 电子科技大学 | Convolutional neural network model compression method based on packet channel fusion |
CN113723474A (en) * | 2021-08-12 | 2021-11-30 | 浙江云澎科技有限公司 | Cross-channel aggregation similarity network system |
CN114078471A (en) * | 2020-08-20 | 2022-02-22 | 京东科技控股股份有限公司 | Network model processing method, device, equipment and computer readable storage medium |
CN114627895A (en) * | 2022-03-29 | 2022-06-14 | 大象声科(深圳)科技有限公司 | Acoustic scene classification model training method and device, intelligent terminal and storage medium |
CN116246639A (en) * | 2023-02-06 | 2023-06-09 | 思必驰科技股份有限公司 | Self-supervision speaker verification model training method, electronic device and storage medium |
KR20230154597A (en) * | 2022-05-02 | 2023-11-09 | 계명대학교 산학협력단 | Method and apparatus for detecting sound event based on sound event detection model using a differential feature |
-
2023
- 2023-11-13 CN CN202311505530.7A patent/CN117524252B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110728354A (en) * | 2019-09-11 | 2020-01-24 | 东南大学 | Improved sliding type grouping convolution neural network |
CN114078471A (en) * | 2020-08-20 | 2022-02-22 | 京东科技控股股份有限公司 | Network model processing method, device, equipment and computer readable storage medium |
CN112016639A (en) * | 2020-11-02 | 2020-12-01 | 四川大学 | Flexible separable convolution framework and feature extraction method and application thereof in VGG and ResNet |
CN112465140A (en) * | 2020-12-07 | 2021-03-09 | 电子科技大学 | Convolutional neural network model compression method based on packet channel fusion |
CN113723474A (en) * | 2021-08-12 | 2021-11-30 | 浙江云澎科技有限公司 | Cross-channel aggregation similarity network system |
CN114627895A (en) * | 2022-03-29 | 2022-06-14 | 大象声科(深圳)科技有限公司 | Acoustic scene classification model training method and device, intelligent terminal and storage medium |
KR20230154597A (en) * | 2022-05-02 | 2023-11-09 | 계명대학교 산학협력단 | Method and apparatus for detecting sound event based on sound event detection model using a differential feature |
CN116246639A (en) * | 2023-02-06 | 2023-06-09 | 思必驰科技股份有限公司 | Self-supervision speaker verification model training method, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN117524252A (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN106952643A (en) | A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering | |
CN105788592A (en) | Audio classification method and apparatus thereof | |
CN111429938A (en) | Single-channel voice separation method and device and electronic equipment | |
CN113205820B (en) | Method for generating voice coder for voice event detection | |
CN110942766A (en) | Audio event detection method, system, mobile terminal and storage medium | |
CN113628612A (en) | Voice recognition method and device, electronic equipment and computer readable storage medium | |
CN112562698A (en) | Power equipment defect diagnosis method based on fusion of sound source information and thermal imaging characteristics | |
CN113191178A (en) | Underwater sound target identification method based on auditory perception feature deep learning | |
CN113763966A (en) | End-to-end text-independent voiceprint recognition method and system | |
Ma et al. | Deep semantic encoder-decoder network for acoustic scene classification with multiple devices | |
CN110808067A (en) | Low signal-to-noise ratio sound event detection method based on binary multiband energy distribution | |
CN117524252B (en) | Light-weight acoustic scene perception method based on drunken model | |
CN118351881A (en) | Fusion feature classification and identification method based on noise reduction underwater sound signals | |
CN112735466A (en) | Audio detection method and device | |
CN118098247A (en) | Voiceprint recognition method and system based on parallel feature extraction model | |
CN117789758A (en) | Urban audio classification method of convolutional neural network based on residual calculation | |
CN115331678B (en) | Generalized regression neural network acoustic signal identification method using Mel frequency cepstrum coefficient | |
CN116863956A (en) | Robust snore detection method and system based on convolutional neural network | |
CN115267672A (en) | Method for detecting and positioning sound source | |
CN113488069B (en) | Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network | |
CN113539298B (en) | Sound big data analysis and calculation imaging system based on cloud edge end | |
CN115602158A (en) | Voice recognition acoustic model construction method and system based on telephone channel | |
Bursuc et al. | Separable convolutions and test-time augmentations for low-complexity and calibrated acoustic scene classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |