CN117524252B

CN117524252B - Light-weight acoustic scene perception method based on drunken model

Info

Publication number: CN117524252B
Application number: CN202311505530.7A
Authority: CN
Inventors: 武梦龙; 张琳; 刘文楷; 蔡希昌; 黄明; 张海月
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-04-05
Anticipated expiration: 2043-11-13
Also published as: CN117524252A

Abstract

The invention discloses a lightweight acoustic scene perception method based on a drunkard model, which comprises the following steps: conventional audio feature extraction: the conventional audio features are processed by a feature conversion module to obtain drunken features. And (5) training a conventional model. And performing channel reduction and frequency grouping fusion convolution adding operation by using a conventional model to obtain an initial version of the drunken model. Training is performed using the drunken model of the initial version obtained by the guiding module. And the conventional model is used as a teacher model, the drunken model is used as a student model, and the performance of the student model is improved in a knowledge distillation mode. And evaluating the fused lightweight model to obtain an evaluation result. And optimizing and adjusting the lightweight model according to the evaluation result. Obtaining the final drunken model. And inputting the conventional audio characteristics into the drunkenness model to obtain an acoustic scene perception result. The invention has the advantages that: reduces the training time and the consumption of calculation resources, and can obtain higher accuracy and lower loss value.

Description

Light-weight acoustic scene perception method based on drunken model

Technical Field

The invention relates to the technical field of acoustic scene classification, in particular to a lightweight acoustic scene perception method based on a drunken model.

Background

The acoustic scene classification is used as one of important applications of the deep convolutional neural network in the audio field, makes correct classification on the surrounding environment by simulating the perception capability of human beings on the external environment, and is widely used in the fields of audio monitoring, intelligent auxiliary driving, voiceprint recognition and the like. Most of the acoustic scene classification tasks adopt a top-down series connection mode, the extracted characteristic information is directly input into a neural network model for prediction, but the mode has some limitations. Currently, the mainstream neural network still belongs to a deep convolutional neural network, and in addition, some high-precision lightweight models are also proposed. For example, the efficient BC-res net architecture achieves excellent performance by extracting two feature maps specific to the frequency and time dimensions by a two-dimensional convolution of frequency and one-dimensional convolution of time. The BC-Res2Net structure fused with the BC-ResNet structure and the Res2Net structure can effectively acquire the characteristics of frequency and time dimension through broadcast learning, can operate in multiple scales and has remarkable performance. The MobileNet series model and ShuffleNet proposed in recent years realize a lightweight and efficient network by introducing deep rolling and shuffling (Shuffle) operations.

The acoustic scene classification mainly comprises two parts, namely feature extraction and classification model construction, wherein the extracted feature information is directly input into a neural network model for prediction. The currently prevailing audio features are log-mel spectrum (log-mel) features, MFCCs (Mel Frequency Cepstral Coefficients), time first order differences and second order differences, etc. A single feature extraction mode is adopted in the partial acoustic scene classification task, and although the calculation cost can be saved, partial important information can be ignored, so that the classification performance is poor. And employing multiple feature extraction approaches may involve redundant feature maps, resulting in unnecessary computational costs.

In terms of models, networks used for acoustic scene classification tasks are limited by their structure, and computational overhead can increase as the models deepen, which is detrimental to deployment on resource-constrained devices. In addition, a single neural network model is mostly adopted in the tasks, key audio features may not be fully extracted, and at present, an optimal model architecture and super-parameter combination are not determined, so that erroneous decisions of scene categories are easily caused.

Reference to the literature

[1]J.Hu,L.Shen,G.Sun,"Squeeze-and-excitation networks,"Proceedings of the IEEE conference on computer vision and pattern recognition,7132-7141,2018；

[2]K.He,X.Zhang,S.Ren,et al.,"Deep residual learning for image recognition,"Proceedings of the IEEE conference on computer vision and pattern recognition,770-778,2016。

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a lightweight acoustic scene perception method based on a drunkard model.

In order to achieve the above object, the present invention adopts the following technical scheme:

a lightweight acoustic scene perception method based on a drunken methodology comprises the following steps:

1) Conventional audio feature extraction: the original audio data is collected, and the original audio data is converted into conventional audio features in a logarithmic mel frequency spectrum, a first-order difference mode and a second-order difference mode.

2) Drunkenness feature extraction: based on the focus module of the squeze-and-Excitation (SE), a feature conversion module based on a focus mechanism is designed. The conventional audio features are processed through a feature conversion module, and drunken features are obtained after redundant information is removed.

3) Training a conventional model: training is performed using the structure and parameters of the conventional model. By inputting conventional audio features, a model is trained and scene perception capability learning is performed.

4) The guiding module operates: and performing channel reduction and frequency grouping fusion convolution adding operation by using a conventional model to obtain an initial version of the drunken model. Through experiments and adjustment, the optimal channel number and grouping convolution settings are found.

5) Training the drunkenness model: training is performed using the drunken model of the initial version obtained by the guiding module. Similar to conventional model training, drunken features are input and drunken models are trained and optimized.

6) Fusion module operation: and the conventional model is used as a teacher model, the drunken model is used as a student model, and the performance of the student model is improved in a knowledge distillation mode. Distillation loss is calculated based on predictions of student and teacher models, and a weighted sum of distillation loss and hard tag loss is calculated from soft tag predictions to obtain the final loss.

7) Evaluation model: and evaluating the fused lightweight model, and checking the accuracy and performance of the lightweight model on an acoustic scene perception task to obtain an evaluation result.

8) Model optimization: and optimizing and adjusting the lightweight model according to the evaluation result. Obtaining the final drunken model.

9) And inputting the drunken audio characteristics into the drunken model to obtain an acoustic scene perception result.

Further, in the drunken feature extraction of the step 2): the spatial dimension of each channel in the conventional features is reduced to a scalar in the global averaging pooling operation, the information of each channel is compressed to obtain the global average value of each channel, so that the global statistical information in the channel is captured, the calculated amount is reduced, and the feature correlation in the channel is extracted.

And learning the correlation among channels through the full connection layer, and acquiring the weights of the dimension characteristics of different channels by utilizing a sigmoid activation function. And finally, multiplying the weight with each original characteristic to obtain the drunken characteristic.

Further, the drunken model includes: frequency packet fusion convolution (freroupconv 2 d), packet convolution (GroupConv 2 d), bottleneck-block-improvement (fcresnet_block), convolution layer, and pooling layer.

The operation flow of constructing the drunken model is as follows: the method comprises the steps of integrating convolution rich characteristic information through frequency grouping, and performing channel expansion and downsampling through a convolution layer and a pooling layer to extract higher-level characteristic representation. Dropout is added to prevent overfitting by varying the number of channels and step sizes of the bottleneck blocks through a plurality of stacked bottleneck blocks. The outputs of each group are cascaded in the channel dimension to keep the number of channels in and out consistent. And outputting a classification result after global average pooling and a full connection layer.

The improved bottleneck block is the result of modifying the bottleneck block in the residual network ResNet and mainly consists of packet convolution. The modified bottleneck block replaces the 3 x 3 convolution in the bottleneck block with the packet convolution. The group convolution uniformly divides the channel dimension into groups, each of which is subjected to a 3 x 3 convolution.

Further, at the first layer of the drunkenness model, the frequency grouping fusion convolution is placed. The frequency grouping fusion convolution divides the frequency dimension of the input feature into a plurality of branches. Each branch is processed through a 3 x 3 convolutional layer. The other branches, except the first branch, receive feature information from the previous branch, gradually extracting features from low frequency to high frequency. The output of each branch is cascaded in the channel dimension and then input to the next stage.

Compared with the prior art, the invention has the advantages that:

1. improving the performance of the model: by introducing the feature conversion module and the fusion module to operate, the performance and the accuracy of the model can be improved. The feature conversion module can remove redundant information, reduce complexity of a model and improve distinguishing degree and characterization capability of features. The fusion module is operated to improve the performance of the student model by utilizing a knowledge distillation mode, so that the performance of the student model approaches to that of a teacher model. Under the condition of limited resources, higher accuracy and lower loss value can be still obtained, the accuracy is improved to 45.2% at the highest, and the loss is reduced by 0.634 compared with a baseline.

2. And (3) a lightweight model: the parameters and the computational complexity of the lightweight model are reduced through operations such as channel reduction, grouping convolution and the like. This can improve the operational efficiency of the model, and is applicable to resource-constrained environments and devices. The drunken model can reach 41.5 percent of accuracy under the condition that the parameter quantity is 442.67K only. Beyond the mainstream lightweight model.

3. The guiding module operates: the guidance module operation can provide an initial version of the drunken model, reducing training time and consumption of computing resources. Through experiments and adjustment, the optimal channel number and grouping convolution setting are found, and the performance of the model is further optimized.

Drawings

Fig. 1 is a general structural diagram of an embodiment of the present invention.

Fig. 2 is a flowchart of drunken feature extraction according to an embodiment of the present invention.

Fig. 3 is a diagram of drunkard model structure according to an embodiment of the present invention.

Fig. 4 is a diagram of a frequency packet fusion convolution architecture in an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the accompanying drawings and by way of examples in order to make the objects, technical solutions and advantages of the invention more apparent.

1. Light-weight acoustic scene perception method overall structure based on drunken model

As shown in fig. 1, comprises two branches and a module for associating the two branches. The following is a detailed description of three aspects:

the first branch simulates a modeling process of a conventional behavior, input data extracts conventional features through a currently existing feature extraction mode, and the invention adopts a logarithmic Mel spectrum extraction mode and a first-order difference extraction mode and a second-order difference extraction mode. And then, inputting the extracted features into a conventional model for training to obtain a soft label result. The branch has higher complexity of the characteristics and the models, and the main purpose is to obtain higher classification performance of the acoustic scene.

The modeling process of the second branch simulating drunkenness behavior is used for simulating the perception capability of drunkenness to the surrounding environment, and the regular features in the first branch are subjected to a feature conversion module to obtain more simplified drunkenness features. Compared with the conventional characteristics, the drunken characteristics remove part of redundant characteristic information and pay more attention to the information of key positions. And then inputting the drunken model into a drunken model, transforming the drunken model by a guiding module, adding a layer of frequency grouping fusion convolution on the basis of the conventional model, and enhancing the representation capability of the features by dividing frequency dimensions and multiplexing the features of different dimensions. In addition, in order to achieve the purpose of light weight, the width of the drunken model is also reduced.

The fusion module is a bridge combining the two branches, the knowledge distillation strategy is selected, and the fusion module is used as a model compression means, so that the performance of the drunken model with low complexity and poor performance can be improved under the guidance of a conventional model with high complexity and good performance on the premise of not increasing excessive calculation cost, and the scene classification task is completed together.

2. Feature extraction section

2.1 general features

Although new multiple speech features are often designed for use in audio-related fields (e.g., speech recognition, sound event detection, information retrieval), the present invention contemplates selection of more sophisticated and perceptive representation of features. According to human auditory characteristics, the human ear is more sensitive to the difference information between low frequencies, and the human ear cannot sense the frequency linearly, so that the invention firstly considers the logarithmic mel frequency spectrum characteristics. The logarithmic Mel frequency spectrum is used as a common audio feature extraction method, comprises time domain and frequency domain information and perceptually relevant amplitude information, and is characterized in that the Mel scale is more in line with the auditory characteristics of human ears.

In addition, because the voice signal is continuous in time domain, the characteristic information extracted by framing only reflects the characteristics of the voice signal of the frame, and in order to enable the characteristics to better reflect the time domain continuity, the dimension of the front and rear frame information can be increased in the characteristic dimension, and the first-order difference and the second-order difference are commonly used. Therefore, the first-order difference and the second-order difference are performed on the basis of the logarithmic mel spectrum, and cascade connection is performed in the channel dimension. The final conventional feature is a cascade of logarithmic mel-spectrum features, first order differential features, second order differential features with three-dimensional channel dimensions.

2.2 drunkenness characterization

The invention introduces an attention mechanism in the feature extraction stage, suppresses redundant information in conventional features and enhances useful information. Is paid attention to by squeze-and-Excitation (SE)Module ^[1] By referring to part of the structure, a feature conversion module based on an attention mechanism is designed. The conventional characteristic is subjected to a characteristic conversion module to obtain the characteristic with redundant information removed, and the characteristic is called drunkenness. Drunkenness is lighter than conventional because the feature conversion module helps remove some similar features.

As shown in fig. 2, the feature conversion module first reduces the spatial dimension of each channel to a scalar by a global averaging pooling operation to obtain a global average value for each channel, and the purpose of this operation is to compress each channel to capture global statistics in the channel, while reducing the amount of computation. And finally, acquiring weights of different channel dimension characteristics by utilizing a sigmoid activation function, and multiplying the weights with each original characteristic to obtain drunken characteristics.

3. Model structural part

3.1 drunkard model

The data sets used in the experiment are all 1s audio files, and the information contained in the data sets is less. To solve the problem, a multi-scale drunken model is designed and named as FC_resnet, and feature information with richer details is captured from different scales. The model mainly contains three modules, namely frequency packet fusion convolution (Frequency Grouping Fusion Convolution, freroupconv 2 d), packet convolution (GroupConv 2 d), and improved bottleneck block (fcresnet_block). The invention uses the drunken model in table 1 as a baseline model, which is divided into six stages, and the method comprises the steps of firstly, integrating convolution rich characteristic information through frequency grouping, and then, performing channel expansion and downsampling through a convolution layer and a pooling layer. The fourth stage contains a number of stacked fcresnet_blocks, but with different channel numbers and step sizes, dropouts added to prevent overfitting. And finally, outputting a classification result after global average pooling and a full connection layer. Fig. 3 is a more visual illustration of the model configuration in table 1.

TABLE 1

(1) Frequency packet fusion convolution

The frequency dimension of the convolution layer is divided, the low-frequency part can capture global and rough features, the high-frequency part can capture local and detailed features, and the generalization capability of the model on input data is improved. In addition, the frequency grouping convolution can process the characteristics in different frequency ranges respectively, so that noise resistance of the model to input data is enhanced to a certain extent.

Fig. 4 is a schematic diagram of a frequency grouping fusion convolution, where the frequency dimensions are uniformly divided into 4 groups, each group is convolved by 3×3, and each path except the first path has information from the previous path, and this low-frequency to high-frequency feature multiplexing method, similar to the step-by-step feature extraction of the human visual system on the edge-to-texture level, enriches the feature information to some extent. And finally, inputting the output to the next stage after cascade connection of the channel dimensions. Since the frequency axis downsampling will be performed later, the frequency packet fusion convolution is placed on the first layer of the model, regardless of the addition of this layer at the back of the model.

(2)FCresnet_block

The part is the main part of the drunken model, and is the bottleneck block in the residual error network ResNet ^[2] Is modified on the basis of (1) and is named FCresnet_block. The original bottleneck block consists of three convolution layers, with three convolution kernel sizes of 1 x 1, 3 x 3 and 1 x 1 in order, and one shortcut connection mapping the input directly to the output. The 1×1 convolution mainly plays a role in changing the number of channels, does not increase feature information in the space dimension, only extracts the space feature by one 3×3 convolution layer, and replaces the 3×3 convolution in the bottleneck block by the grouping convolution for fully mining deep information of the audio feature. The channel dimension is divided into a plurality of groups by grouping convolution, each path is respectively subjected to 3X 3 convolution, and finally the output of all paths is processedThe output channels are cascaded in the channel dimension, so that the consistent number of the output channels and the input channels is ensured. The number of channel packets is represented by the parameter Groups, with the present invention having a baseline Groups of 8.

3.2 conventional model

The general model is similar to the drunken model in table 1 in overall structure, with the differences in several aspects: 1) The conventional model does not contain a frequency packet fusion convolution, phase 1 in table 1; 2) The complexity of the conventional model is higher, the method is mainly embodied in the aspect of model width, the value of Groups in an FCresnet_block module in the conventional model is 32, and the number of channels is 2 times that of the drunken model; 3) The depth of the conventional model is 37, the MAC index is 23.5M, the model is slightly higher than the drunken model, the parameter number is 1114.6k, and the model is about 3 times of the drunken model. Experimental results show that the accuracy of the conventional model is up to 46.4%, and the loss is 1.619.

3.3 boot and fusion Module

The guiding module is used for obtaining the drunken model from the conventional model through certain operation. One specific operation is channel reduction, the number of channels of the model is adjusted by multiple experiments, and the optimal settings are saved, as shown in table 1. Frequency packet fusion convolutions are added on the basis of a conventional model, and the number of Groups of the packet convolutions is reduced.

The fusion module adopts a knowledge distillation strategy, wherein the knowledge distillation strategy comprises a teacher model and a student model, and the performance of the student model is improved by teaching knowledge in the teacher model with higher performance to the student model with lower performance. In combination with knowledge distillation principle, we use the conventional model in drunken methodology as teacher model and drunken model as student model, and the invention uses the original form of knowledge distillation. Firstly, training a conventional model under different parameter configurations, and storing parameter settings with best performance. And then calculating soft label prediction results of the conventional model and the drunken model at the same temperature T, and calculating distillation loss. The final loss consists of a weighted sum of distillation loss and hard tag loss, L _TOTAL Represents the total loss, α represents the weight of distillation loss, L _DIST Indicating distillation loss, L _LABEL Representing hard tag loss as shown in equation (1)：

L _TOTAL ＝α _LDIST +(1-α)L _LABEL (1)

Distillation loss is a prediction based on student and teacher matching using the soft target given in equation (2), where z is the logarithm, q is the soft target, and T is the temperature controlling the softness of the probability distribution.

4. Experimental part

(1) Data set and evaluation index

The dataset chooses DCASE (Detection and Classification of Acoustic Scenes and Events) 2022Task1 development set (TAU Urban Acoustic Scenes 2022Mobile,development dataset). The audio in the dataset is provided in a mono 44.1khz 24 bit format. The dataset contained 230,359 tones for a total duration of 64 hours, which the present invention divided for training and testing, 70% for training, 30% for testing, each tone duration of 1 second, from 10 cities and 9 devices: 3 real devices (A, B, C) and 6 simulation devices (S1-S6); 10 scenes are included, respectively, airports, malls, subway stations, pedestrian streets, public squares, traffic streets, electric cars, buses, subways, and parks. The present invention employs a validation set provided by the DCASE authorities for evaluating the performance of the trained model, the evaluation data containing 22 hours of audio. The complexity of the system is measured using the number of parameters and the cumulative multiply-add times (MAC), and the Loss (Loss) and Accuracy (Accuracy) are used to evaluate the model performance.

(2) Model training arrangement

In the aspect of feature extraction, 128 Mel filters are used for processing the audio signals, fast Fourier Transform (FFT) is carried out on the audio signals, and the Hamming window length and the frame length are respectively set to be 0.04 seconds and 0.02 seconds, so that a spectrogram of 128 multiplied by 43 is obtained. And extracting first-order difference and second-order difference features, and splicing the three feature spectrograms into an input feature spectrogram of 128 multiplied by 43 multiplied by 3 in the channel dimension.

In the training stage, the training batch_size is set to 128 in an experiment unification mode, the independent training iteration number of the drunken model and the conventional model is 256, and the knowledge distillation experiment is set to 200. Furthermore, two data enhancement techniques, mixup and specaugmenter, were added to optimize the training process.

The baseline system setting of the invention: the logarithmic mel-spectrum, first order difference and second order difference are used as input features. The model was trained 256 times on the full dataset using fc_res net with Groups of 8 and depth of 38 layers, and no data enhancement and coordinate attention mechanisms were employed. Experiments were performed on NVIDIA RTX2080 Ti using the Tensorflow and keras frameworks with Windows10 operating system.

(3) Description of the Experimental results

The drunken model (baseline model) constructed by the invention obtains 40.4% of accuracy and 2.284 of loss. On the basis, after the drunken methodology mechanism is adopted, the accuracy is improved to 45.2% at the highest, and the loss is reduced by 0.634 compared with a baseline system.

In addition, to verify the effectiveness of the drunken model of the present invention, it was compared with the mainstream advanced shufflenet v2, resNet, mobileNet and GhostNet. In order to ensure the fairness of experiments, the parameter amounts of the five models are controlled within the range of [440K,463K ], and the accuracy of the models is observed under the condition that the difference between the parameter amounts is ensured to be as low as 5 percent as possible. The drunken model with the accuracy of 41.5% is selected for comparison, called FC_ResNet, and the other four models are trained under the same training setting, and the experimental results are shown in Table 2. The result shows that compared with the SheffeNetV 2 and ResNet, the model designed by the invention has higher accuracy rate with similar parameter quantity, which is derived from the rich characteristic information extracted from different scales. Compared with the MobileNet V2, the method has the advantages of improving the accuracy by only 1%, but having fewer parameters, and the method can also show the advantages of the method in the aspects of accuracy and parameter balance. In addition, compared with GhostNet, although the accuracy rate is improved by only 0.3%, the parameter quantity is reduced by 5%, and the method of the invention achieves the effect of competing with GhostNet.

TABLE 2

The above-described method according to the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored on such software process on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is appreciated that a computer, processor, microprocessor controller, or programmable hardware includes a memory component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor, or hardware, implements the lightweight acoustic scene aware methods described herein. Further, when the general-purpose computer accesses code for implementing the processes shown herein, execution of the code converts the general-purpose computer into a special-purpose computer for executing the processes shown herein.

Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to aid the reader in understanding the practice of the invention and that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The lightweight acoustic scene perception method based on the drunken model is characterized by comprising the following steps of:

1) Conventional audio feature extraction: collecting original audio data, and converting the original audio data into conventional audio features by adopting a logarithmic Mel frequency spectrum and a first-order difference and second-order difference mode;

2) Drunkenness feature extraction: based on the focus module of the squeze-and-Excitation (SE), a feature conversion module based on a focus mechanism is designed; the conventional audio features are processed through a feature conversion module, and drunken features are obtained after redundant information is removed; the feature conversion module includes: global average pooling, full connection layer and sigmoid activation function;

the specific operation process of the feature conversion module is as follows: the feature conversion module firstly reduces the space dimension of each channel into a scalar through global average pooling operation to obtain a global average value of each channel, compresses each channel to capture global statistical information in the channel, simultaneously reduces the calculated amount, learns the correlation among the channels through a full-connection layer, finally obtains the weights of the dimension features of different channels by utilizing a sigmoid activation function, and multiplies the weights with each original feature to obtain drunken features;

3) Training a conventional model: training by using the structure and parameters of the conventional model; training a model and learning scene perception capability by inputting conventional audio features;

4) The guiding module operates: performing channel reduction and frequency grouping fusion convolution adding operation by using a conventional model to obtain an initial version of the drunken model; through experiments and adjustment, the optimal channel number and grouping convolution setting are found;

5) Training the drunkenness model: training by using the drunken model of the initial version obtained by the guiding module; similar to the conventional model training, the drunken features are input and training and optimizing are carried out on the drunken model;

6) Fusion module operation: the conventional model is used as a teacher model, the drunken model is used as a student model, and the performance of the student model is improved in a knowledge distillation mode;

7) Evaluation model: evaluating the drunken model obtained in the step 6), and checking the accuracy and performance of the drunken model on an acoustic scene perception task to obtain an evaluation result;

8) Model optimization: optimizing and adjusting the drunken model according to the evaluation result; obtaining a final drunken model;

9) And inputting the drunken characteristics into the drunken model to obtain an acoustic scene perception result.

2. The drunken model-based lightweight acoustic scene perception method as set forth in claim 1, wherein: the drunken model comprises: frequency grouping fusion convolution, grouping convolution, bottleneck block improvement, convolution layer and pooling layer;

the operation flow of constructing the drunken model is as follows: the method comprises the steps of integrating convolution rich characteristic information through frequency grouping, and performing channel expansion and downsampling through a convolution layer and a pooling layer to extract higher-level characteristic representation; through a plurality of stacked improved bottleneck blocks, the number of channels and the step size of the improved bottleneck blocks are different, and Dropout is added to prevent overfitting; cascading the outputs of each group in the channel dimension to keep the number of the input channels consistent with the number of the output channels; after global average pooling and a full connection layer, outputting a classification result;

the improved bottleneck block is a result of modifying the bottleneck block in the residual error network ResNet and mainly consists of packet convolution; the improved bottleneck block replaces the 3×3 convolution in the bottleneck block with a block convolution; the group convolution uniformly divides the channel dimension into groups, each of which is subjected to a 3 x 3 convolution.

3. The drunken model-based lightweight acoustic scene perception method as set forth in claim 2, wherein: the frequency grouping, fusion and convolution are placed on a first layer of the drunken model; dividing the frequency dimension of a convolution layer by frequency grouping fusion convolution, and dividing the frequency dimension into a plurality of branches; each branch is processed through a 3 x 3 convolutional layer; the other branches except the first branch receive the characteristic information of the previous branch, and gradually extract the characteristics from low frequency to high frequency; the outputs of each group are cascaded in the channel dimension and then input to the next stage.