CN114974267A - Bird language classification model training method and bird language identification method - Google Patents
Bird language classification model training method and bird language identification method Download PDFInfo
- Publication number
- CN114974267A CN114974267A CN202210395700.XA CN202210395700A CN114974267A CN 114974267 A CN114974267 A CN 114974267A CN 202210395700 A CN202210395700 A CN 202210395700A CN 114974267 A CN114974267 A CN 114974267A
- Authority
- CN
- China
- Prior art keywords
- bird
- language
- classification model
- classification
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013145 classification model Methods 0.000 title claims abstract description 139
- 238000000034 method Methods 0.000 title claims abstract description 90
- 238000012549 training Methods 0.000 title claims abstract description 54
- 239000013598 vector Substances 0.000 claims description 27
- 238000012545 processing Methods 0.000 claims description 25
- 230000008569 process Effects 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 15
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 230000005236 sound signal Effects 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 12
- 230000009466 transformation Effects 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 238000001914 filtration Methods 0.000 claims description 9
- 230000007613 environmental effect Effects 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 4
- 230000003416 augmentation Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 16
- 238000010586 diagram Methods 0.000 description 13
- 238000004590 computer program Methods 0.000 description 10
- 238000012360 testing method Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008033 biological extinction Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a method for training a bird language classification model and a method for identifying bird language. And carrying out feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting bird information. And finally, establishing a bird language classification model according to a preset two classification models and a bird language identification model. Based on the method, a bird language classification model for identifying bird information can be trained, and meanwhile, the identification precision and accuracy are improved.
Description
Technical Field
The invention relates to the technical field of bird research, in particular to a bird language classification model training method and a bird language identification method.
Background
Bird language means the sound of birds. In research work aiming at birds, the birds can be identified and classified according to bird language, and better support is provided for the fields of environmental protection, animal protection or science popularization.
The traditional bird language classification modes mainly include two types: 1. based on the bird language classification of template matching, using a spectrogram of bird language to perform feature matching; 2. the multi-layer convolutional neural network is trained by using a spectrogram of the bird language and the type of a corresponding bird as training data, and the frequency spectrum of the bird language is classified during testing. However, the traditional bird language classification method has some defects: first, the template matching method requires modeling for each bird, which consumes a lot of manpower and material resources. Meanwhile, the range and efficiency of recognition are limited, and large-scale recognition work cannot be finished. Secondly, the recognition rate of the traditional neural network bird language classification system still has a larger promotion space. Moreover, training a brand new neural network for a particular bird requires the expenditure of a large amount of computational resources and target-labeled data, which is costly.
Therefore, the traditional bird language classification method has the defects.
Disclosure of Invention
Therefore, it is necessary to provide a method for training a bird classification model and a method for identifying bird language, aiming at the defects of the conventional bird language classification method.
A method for training a bird language classification model comprises the following steps:
acquiring bird voice frequency to be trained and determining bird information corresponding to the bird voice frequency to be trained;
dividing the bird voice frequency to be trained into a bird voice part and a non-bird voice part according to a preset two-classification model;
extracting features of the bird sound part to obtain bird language classification features;
adjusting a pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting bird information;
and establishing a bird language classification model according to a preset two classification models and a bird language identification model.
According to the method for training the bird language classification model, after the bird language audio to be trained is collected and bird information corresponding to the bird language audio to be trained is determined, the bird language audio to be trained is divided into a bird sound part and a non-bird sound part according to the preset two classification models. And carrying out feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting the bird information. And finally, establishing a bird language classification model according to a preset two classification model and a bird language identification model. Based on this, can be according to less data bulk, can train the bird language classification model who is used for discerning birds information, simultaneously, improve the precision and the rate of accuracy of discernment.
In one embodiment, the training process of the preset two-class model includes the following steps:
acquiring an audio signal of a public data set; wherein the audio signal comprises bird sound data and ambient noise data;
performing mathematical transformation on the audio signal to obtain a corresponding spectrogram;
applying band-pass filters under the multiple Mel scales to the spectrogram to obtain a Mel spectrogram of the audio signal;
normalizing the Mel frequency spectrogram to obtain a normalization result;
and training a two-classification neural network according to the normalization processing result to obtain a preset two-classification model.
In one embodiment, the bird classification features include a Mel frequency map of the bird's voice portion and bird's voice recording metadata features.
In one embodiment, the process of extracting features of the bird sound part to obtain bird language classification features includes the following steps:
dividing the bird sound part into audio segments;
carrying out mathematical transformation on the audio frequency segment, carrying out filtering processing based on a mathematical transformation result, and extracting frequency segment characteristics;
the frequency bin characteristics are converted to a mel-frequency spectrogram in decibels.
In one embodiment, the bird sound recording metadata features comprise a longitude vector, a latitude vector, an altitude vector, or a time vector;
the process of extracting the features of the bird sound part to obtain bird language classification features includes the following steps:
and mapping the metadata features of the bird sound recording to a high dimension, and combining the high dimension vector with the Mel frequency spectrogram to obtain bird language classification features.
In one embodiment, after the process of extracting features of the bird sound part and obtaining bird language classification features, the method further includes the following steps:
and performing data enhancement processing on the bird language classification features.
In one embodiment, the process of performing data enhancement processing on the bird language classification features includes the steps of:
adding expanded data for the bird language classification features; wherein the augmentation data comprises data jitter, white noise, same bird recordings or environmental noise.
In one embodiment, the classification model is a multi-layer convolutional neural network structure;
the multilayer convolutional neural network structure comprises a convolutional layer module and a full-connection layer module;
the convolutional layer module comprises a convolutional layer, an activation function layer, a BN layer and a maximum pooling layer;
the full connection layer module comprises a full connection layer, a BN layer, an activation function layer and a Dropout layer.
In one embodiment, the method further comprises the following steps:
and performing pooling treatment on the bird recognition model according to a logarithm-average-natural index pooling algorithm.
A bird language classification model training device comprises:
the data acquisition module is used for acquiring the bird voice frequency to be trained and determining bird information corresponding to the bird voice frequency to be trained;
the audio classification module is used for dividing the bird voice audio to be trained into a bird voice part and a non-bird voice part according to a preset two classification model;
the characteristic extraction module is used for extracting the characteristics of the bird sound part to obtain bird language classification characteristics;
the model training module is used for adjusting a pre-built classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting bird information;
and the model establishing module is used for establishing a bird language classification model according to a preset two classification models and a bird language identification model.
According to the bird language classification model training device, after bird language audio to be trained is collected and bird information corresponding to the bird language audio to be trained is determined, the bird language audio to be trained is divided into a bird sound part and a non-bird sound part according to the preset two classification models. And carrying out feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting the bird information. And finally, establishing a bird language classification model according to a preset two classification models and a bird language identification model. Based on this, can be according to less data bulk, can train the bird language classification model who is used for discerning birds information, simultaneously, improve the precision and the rate of accuracy of discernment.
A computer storage medium having computer instructions stored thereon, the computer instructions when executed by a processor implement the method for training a bird classification model according to any of the above embodiments.
After the computer storage medium collects the bird voice frequency to be trained and determines the bird information corresponding to the bird voice frequency to be trained, the bird voice frequency to be trained is divided into a bird sound part and a non-bird sound part according to the preset two-classification model. And carrying out feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting the bird information. And finally, establishing a bird language classification model according to a preset two classification models and a bird language identification model. Based on this, can be according to less data bulk, can train the bird language classification model who is used for discerning birds information, simultaneously, improve the precision and the rate of accuracy of discernment.
A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for training a bird classification model according to any of the above embodiments when executing the computer program.
After the computer equipment collects the bird voice frequency to be trained and determines the bird information corresponding to the bird voice frequency to be trained, the bird voice frequency to be trained is divided into a bird sound part and a non-bird sound part according to the preset two-classification model. And performing feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting the bird information. And finally, establishing a bird language classification model according to a preset two classification models and a bird language identification model. Based on this, can be according to less data bulk, can train the bird language classification model who is used for discerning birds information, simultaneously, improve the precision and the rate of accuracy of discernment.
A bird language identification method comprises the following steps:
acquiring a bird voice frequency to be identified;
and inputting the audio frequency of the bird language to be recognized into the bird language classification model to obtain bird information output by the bird language classification model.
According to the bird language identification method, after the bird language audio to be identified is obtained, the bird language audio to be identified is input into the bird language classification model, and bird information output by the bird language classification model is obtained. Based on the method, corresponding bird information can be determined according to the bird voice frequency to be recognized through the pre-trained bird language classification model, and the accuracy and the precision of bird information determination are guaranteed.
A bird language recognition device comprising:
the audio acquisition module is used for acquiring bird language audio to be identified;
and the information output module is used for inputting the bird voice frequency to be recognized into the bird language classification model and obtaining bird information output by the bird language classification model.
According to the bird language identification device, after the bird language audio to be identified is acquired, the bird language audio to be identified is input into the bird language classification model, and bird information output by the bird language classification model is acquired. Based on the method, corresponding bird information can be determined according to the bird voice frequency to be recognized through the pre-trained bird language classification model, and the accuracy and the precision rate of bird information determination are ensured.
A computer storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of identifying bird language according to any of the above embodiments.
After the bird audio to be recognized is obtained, the computer storage medium inputs the bird audio to be recognized into the bird classification model, and obtains bird information output by the bird classification model. Based on the method, corresponding bird information can be determined according to the bird voice frequency to be recognized through the pre-trained bird language classification model, and the accuracy and the precision of bird information determination are guaranteed.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the above embodiments when executing the program.
After the bird voice frequency to be recognized is obtained, the computer equipment inputs the bird voice frequency to be recognized into the bird voice classification model, and obtains bird information output by the bird voice classification model. Based on the method, corresponding bird information can be determined according to the bird voice frequency to be recognized through the pre-trained bird language classification model, and the accuracy and the precision of bird information determination are guaranteed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a method for training a bird classification model according to an embodiment;
FIG. 2 is a flow chart of a default classification model training process according to an embodiment;
FIG. 3 is a flowchart of a bird classification model training method according to another embodiment;
FIG. 4 is a schematic diagram of a feature extraction process according to an embodiment;
FIG. 5 is a schematic diagram of a structural improvement of a multilayer convolutional neural network;
FIG. 6 is a flow chart of a filtering and re-scoring algorithm;
FIG. 7 is a block diagram of an apparatus for training a bird classification model according to an embodiment;
FIG. 8 is a flow diagram of a method for bird language identification according to an embodiment;
FIG. 9 is a block diagram of an exemplary bird language identification device;
FIG. 10 is a schematic diagram of an internal structure of a computer according to an embodiment.
Detailed Description
For better understanding of the objects, technical solutions and effects of the present invention, the present invention will be further explained with reference to the accompanying drawings and examples. Meanwhile, the following described examples are only for explaining the present invention, and are not intended to limit the present invention.
The embodiment of the invention provides a method for training a bird language classification model.
Fig. 1 is a flowchart of a method for training a bird language classification model according to an embodiment, and as shown in fig. 1, the method for training a bird language classification model according to an embodiment includes steps S100 to S104:
s100, collecting bird voice frequency to be trained and determining bird information corresponding to the bird voice frequency to be trained;
s101, dividing the bird voice frequency to be trained into a bird voice part and a non-bird voice part according to a preset two-classification model;
s102, extracting the characteristics of the bird sound part to obtain bird language classification characteristics;
s103, adjusting a pre-established classification model according to the bird language classification features and the corresponding bird information to obtain a bird language identification model for outputting bird information;
and S104, establishing a bird language classification model according to the preset two classification models and the bird language identification model.
Wherein, the bird voice frequency to be trained is used as a sample for model training, and can be recorded by chirping birds with known bird information. In one embodiment, the bird information includes bird species, habitat area, recording time and place, and other information related to the birds. As a preferred embodiment, the bird information includes bird species.
Because the collected bird voice frequency to be trained comprises bird voice, environment voice and the like, in order to improve the accuracy of subsequent model identification, the bird voice frequency to be trained needs to be divided into a bird voice part and a non-bird voice part through a preset two-classification model.
In one embodiment, fig. 2 is a flowchart illustrating a training process of a preset two-class model according to an embodiment, and as shown in fig. 2, the training process of the preset two-class model includes steps S200 to step 204:
s200, acquiring an audio signal of the public data set; wherein the audio signal comprises bird sound data and ambient noise data;
s201, performing mathematical transformation on the audio signal to obtain a corresponding spectrogram;
s202, applying band-pass filters under a plurality of Mel scales to the spectrogram to obtain a Mel spectrogram of the audio signal;
s203, carrying out normalization processing on the Mel frequency spectrogram to obtain a normalization processing result;
and S204, training a two-classification neural network according to the normalization processing result to obtain a preset two-classification model.
Wherein, the sound stage classification related data set can be selected as the public data set. For example, bird sound data as well as ambient noise data are obtained using the public dataset of DCASE 2017. The audio of the public data set of DCASE2017 is a 5s segment.
The mathematical transformation means for converting the audio signal into a spectrogram include Fourier transforms such as short Time Fourier Transform (STFTShort Time Fourier Transform) or fast Fourier Transform.
The normalization processing of the Mel frequency spectrogram comprises mean normalization processing so as to improve the balance and the signal-to-noise ratio of the Mel frequency spectrogram.
And training an end-to-end two-classification neural network according to the normalization processing result, wherein a preset two-classification model based on the two-classification neural network can be used for dividing the bird voice frequency to be trained into a bird sound part and a non-bird sound part. It should be noted that the non-bird sound portion serves as a noise portion, which can be used for subsequent data enhancement processing.
And further, carrying out feature extraction on the bird sound part to obtain bird language classification features, wherein the bird language classification features are used as training data of a subsequent bird language recognition model, corresponding bird information is used as a label, the bird language recognition model establishes mapping between the training data and the label, and the bird information is used as an output result, so that a model processing process for recognizing the bird information according to other audio frequencies is realized.
In one embodiment, the bird language classification features include a Mel frequency spectrogram of a bird's voice component and bird's voice recording metadata features.
Fig. 3 is a flowchart of a method for training a bird language classification model according to another embodiment, and as shown in fig. 3, the process of extracting features of a bird sound part in step S102 to obtain bird language classification features includes steps S300 to S302:
s300, dividing the bird sound part into audio segments;
s301, performing mathematical transformation on the audio segment, performing filtering processing based on a mathematical transformation result, and extracting frequency segment characteristics;
s302, converting the frequency band characteristics into a Mel frequency spectrogram expressed in decibels.
In one embodiment, fig. 4 is a schematic diagram of a feature extraction process according to an embodiment, and as shown in fig. 4, the bird sound part is subjected to audio segmentation to obtain a plurality of segments of N seconds (N may be 5s corresponding to the public data set), where the overlap is set to be M seconds, so as to retain richer information and increase the amount of training data. As a preferred embodiment, M ═ N-1.
In one embodiment, as shown in FIG. 4, the mathematical transform comprises a Fourier transform, such as a short-time Fourier transform or a fast Fourier transform. As a preferred embodiment, the short-time fourier transform is used as the mathematical transform for the audio piece. The long-time audio clip is divided into a plurality of short equal-length signals, then Fourier transform of each equal-length signal is calculated, and the obtained frequency band characteristics can contain more time frequency information.
In one embodiment, as shown in fig. 4, the mathematical transformation result is filtered by a Mel filter. The Mel filter can extract the features of each frequency bin in a targeted manner.
In one embodiment, as shown in fig. 4, the method further includes the steps of:
and performing band-pass filtering processing on the frequency band characteristics through a band-pass filter.
The band-pass filtering process includes high-pass filtering and low-pass filtering, and information of too high frequency or too low frequency is filtered. The information with too high frequency or too low frequency has smaller characteristic content and is filtered out to reduce the subsequent data processing amount.
As shown in fig. 4, converting the frequency bin characteristics into a mel-frequency spectrogram represented in decibels, rather than a general linear representation, can enrich the characteristic expression of the frequency bin characteristics. After the corresponding Mel frequency spectrogram is determined, the size of the corresponding Mel frequency spectrogram is adjusted to meet the data requirement of a subsequent bird language identification model.
In one embodiment, the bird sound recording metadata features comprise a longitude vector, a latitude vector, an altitude vector, or a time vector;
for the metadata feature, data information such as latitude, longitude, altitude, and recording time is considered. From these provided metadata, a vector containing a plurality of elements, for example, a 7-element vector, can be obtained as follows:
whether to provide latitude and longitude, 1 means yes, 0 means no;
latitude, normalized between 0 and 1;
longitude, normalized between 0 and 1;
whether a provided altitude, if any, is 1, and if not, is 0;
altitude, normalized between 0 and 1;
whether to provide recording time, if so, 1, and if not, 0;
time information, normalized directly between 0 and 1. As shown in fig. 3, the process of extracting the feature of the birdsound part in step S102 to obtain the birdlanguage classification feature further includes step S303:
s303, mapping the metadata features of the bird sound recording to a high dimension, and combining the high dimension vector with the Mel frequency spectrogram to obtain bird language classification features.
In one embodiment, multiple features such as height, longitude and latitude, time information and the like representing metadata are mapped to a high dimension through a fully connected layer (fully connected layers) of a neural network, and finally merged with a Mel frequency spectrogram to determine a bird language classification feature. The characteristic extraction mode has innovativeness, simplicity and effectiveness. In one embodiment, as shown in fig. 3, after the process of extracting features of the bird sound part in step S103 to obtain the bird language classification features, the method further includes step S400:
and S400, performing data enhancement processing on the bird language classification features.
Through data enhancement processing, for example, operations such as deformation on scales such as time and frequency are carried out on the bird language classification features, data size is increased, and a subsequent classification model can adapt to a more complex real scene. Meanwhile, data enhancement processing can relieve the problem of data imbalance caused by insufficient data volume, and the requirement for acquiring the data volume of the bird voice frequency to be trained is lowered.
In one embodiment, the process of performing data enhancement processing on the bird language classification features comprises the following steps:
adding expanded data for bird language classification characteristics; wherein the augmentation data comprises data jitter, white noise, same bird recordings or environmental noise.
Wherein, the non-bird sound part can be selected as the environmental noise.
Specifically, data jitter (jitter) is added to the time and frequency of the Mel spectrogram, Gaussian white noise is added to the Mel spectrogram, the recording of the same bird is added to the Mel spectrogram, and the non-bird sound part is added to the Mel spectrogram as environmental noise.
In one embodiment, the classification model is a multi-layer convolutional neural network structure, and the subsequent identification precision is improved by improving the multi-layer convolutional neural network structure. Specifically, fig. 5 is a schematic diagram illustrating an improved structure of a multilayer convolutional neural network, as shown in fig. 5, the multilayer convolutional neural network structure includes a convolutional layer module and a fully connected layer module;
the convolutional layer module comprises a convolutional layer, an activation function layer, a BN layer and a maximum pooling layer;
the full connection layer module comprises a full connection layer, a BN layer, an activation function layer and a Dropout layer.
As shown in fig. 5, the entire convolutional neural network is formed by connecting a convolutional layer Module (Conv Module) and a full link layer Module (FC Module) in series. The convolutional layer module is composed of a convolutional layer, an ELU activation function layer, a BN (batch normalization) layer and a max pooling layer. The full connection layer module consists of a full connection layer, a BN layer, an ELU activation function layer and a Dropout layer. The network consists of five groups of Conv modules and two groups of FC modules, and a final full-connection layer, an ELU activation function layer and a SoftMax activation function layer.
In one embodiment, a convolutional neural network flattens (flatten) the Mel frequency spectrum diagram into vectors of several dimensions. As a preferred embodiment, the Mel frequency spectrogram is flattened as a 512-dimensional vector.
At the same time, an additional fully connected layer is constructed for the metadata, through which the vector of the above-mentioned multiple-element vector (e.g. 7-dimensional) is converted into a multi-dimensional vector. The unfolded result of the Mel-spectral plot and the transformed vectors are then spliced together and input into the next fully-connected layer. Finally, the predicted probability for each bird is output by the softmax layer. As a preferred embodiment, the vector of the plurality of elements (for example, 7-dimensional) is converted into a 100-dimensional vector by the fully-connected layer.
The ELU activation function can utilize the negative part of the input information. The addition of the BN layer can improve gradient problems caused by different characteristic scales, such as gradient extinction and gradient explosion. The Dropout layer can prevent the model from being over-fitted and improve the overall generalization performance of the classification model.
The training of the multilayer convolutional neural network structure is mainly divided into two steps: firstly, the classification model is pre-trained, and the purpose of the step is to use a large amount of preset data to enable the classification model to have strong feature extraction capability. And secondly, fine adjustment of classification model parameters, wherein the aim of the step is to migrate the pre-trained classification model to the current task (bird language classification characteristic-bird information) by using less identification target mark data. The hyper-parameters of the training are set as follows: the batch size is set to a preset size, e.g., 128, the penalty function uses a cross entropy loss function, and the optimizer uses an SGD (statistical gradient decision). The learning rate adjuster starts the initialization of the learning rate to a set learning rate, for example, 0.01, by using cosine optimization. In the pre-training process, the learning rate is adjusted according to the periodicity of the cosine function by taking a plurality of iteration times as a period. The periodic increase and decrease of the learning rate can effectively improve the performance of the classification model. As a preferred embodiment, the learning rate is adjusted according to the periodicity of the cosine function with 50 iterations as a cycle.
In one embodiment, as shown in fig. 3, the method further includes step S500:
and S500, performing pooling treatment on the bird language identification model according to a logarithm-average-natural index pooling algorithm.
Since the test data audio of the bird recognition model is of indefinite length, pooling over time is necessary. Therefore, a log-mean-natural index pooling (log-mean-exp) algorithm is used, which integrates the information from the front to the back well and provides a more accurate estimate. The algorithm is implemented according to the following formula:
where T refers to the estimated length of time. In response to the current problem, the method specifically refers to a method for generating a Mel frequency spectrogram by testing a sound recordingThe number of tokens (bird classification features). y is t Refers to an estimate of bird probability (corresponding bird information probability) for a mel-frequency spectrum. For a segment of test speech, first we generate a set of mel-frequency spectral features using a sliding 5-second window. Then, the Mel frequency spectrum characteristics are input into the trained neural network, and the estimation result of each Mel frequency spectrum diagram is obtained. The estimation result is a vector, and each dimension of the vector is the probability estimated by the system for each bird. And converting the probability values by natural indexes, accumulating, averaging, and finally carrying out logarithmic transformation to obtain a final result. The method can integrate the results of all the characteristics in a section of recording to obtain a more accurate output.
Fig. 6 is a flow chart of a filtering and re-scoring algorithm, as shown in fig. 6, since the range of the chirping frequency of each bird is relatively fixed, a fixed band pass filter is designed for each bird. For each Mel spectrogram, the first ten estimates are kept, then the band-pass filters of the ten specific birds are used to act on the original spectrogram, and the output result is input to the network again for re-scoring. The new scores for each of the ten birds are rearranged to achieve a more accurate estimate.
It should be noted that the selection of the number of specific birds can be adjusted according to model training and application range, and the core is to improve the estimation accuracy of the model by re-scoring.
Wherein for certain species of birds, their chirping frequency is always within a certain range. Outside this frequency range, any other information, including the environmental sounds or singing voice of other birds, may be considered noise. Therefore, one mask is designed for each bird. All spectrograms of one species were normalized and normalized in the frequency domain. For example, the threshold is set to 0.5-0.9, and the range of values below the threshold will be masked. As a preferred embodiment, the threshold value is 0.6.
Based on this, the masks of all classified birds can be considered as band pass filters. The predicted bird outcomes are ranked by their probability according to the output of the neural network every 5 second segment. The first 3 or 5 bird species will be selected and their spectrograms applied to the bandpass filters of these selected species, respectively. After the scope filter, the 3 or 5 new spectrograms will be re-scored by the neural network. With this approach, interference can be reduced and more accurate results obtained using current models.
If different recorded audios exist in corresponding birds, the method can correspondingly narrow the range of the types of the birds. The range of birds is distributed differently in different regions. Using this portion of geographic information (bird information) can make the output more targeted.
According to any one of the bird language classification model training methods, after the bird language audio to be trained is collected and bird information corresponding to the bird language audio to be trained is determined, the bird language audio to be trained is divided into a bird sound part and a non-bird sound part according to the preset two classification models. And carrying out feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting the bird information. And finally, establishing a bird language classification model according to a preset two classification models and a bird language identification model. Based on this, can be according to less data bulk, can train the bird language classification model who is used for discerning birds information, simultaneously, improve the precision and the rate of accuracy of discernment.
Fig. 7 is a block diagram of a bird language classification model training apparatus according to an embodiment, and as shown in fig. 7, the bird language classification model training apparatus according to an embodiment includes:
the data acquisition module 100 is configured to acquire a bird voice frequency to be trained and determine bird information corresponding to the bird voice frequency to be trained;
the audio classification module 101 is configured to divide the bird voice audio to be trained into a bird sound part and a non-bird sound part according to a preset two classification model;
the feature extraction module 102 is configured to perform feature extraction on the bird sound part to obtain a bird language classification feature;
the model training module 103 is used for adjusting a pre-built classification model according to the bird language classification features and the corresponding bird information to obtain a bird language identification model for outputting bird information;
and the model establishing module 104 is used for establishing a bird language classification model according to the preset two classification models and the bird language identification model.
According to the bird language classification model training device, after bird language audio to be trained is collected and bird information corresponding to the bird language audio to be trained is determined, the bird language audio to be trained is divided into a bird sound part and a non-bird sound part according to the preset two classification models. And performing feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting the bird information. And finally, establishing a bird language classification model according to a preset two classification models and a bird language identification model. Based on this, can be according to less data bulk, can train the bird language classification model who is used for discerning birds information, simultaneously, improve the precision and the rate of accuracy of discernment.
Fig. 8 is a flowchart of a bird language recognition method according to an embodiment, and as shown in fig. 8, the bird language recognition method according to an embodiment includes steps S600 and S601:
s600, acquiring a bird voice frequency to be identified;
s601, inputting the bird voice frequency to be recognized into the bird voice classification model, and obtaining bird information output by the bird voice classification model.
And inputting the bird voice frequency to be recognized into the bird voice classification model, and dividing the bird voice frequency to be recognized into a bird voice part and a non-bird voice part by the bird voice classification model. And identifying bird information corresponding to the bird voice frequency to be identified according to the bird voice part.
According to the bird language identification method, after the bird language audio to be identified is obtained, the bird language audio to be identified is input into the bird language classification model, and bird information output by the bird language classification model is obtained. Based on the method, corresponding bird information can be determined according to the bird voice frequency to be recognized through the pre-trained bird language classification model, and the accuracy and the precision of bird information determination are guaranteed.
Fig. 9 is a block diagram of a bird language recognition apparatus according to an embodiment, and as shown in fig. 9, the bird language recognition apparatus according to an embodiment includes:
the audio acquisition module 200 is configured to acquire a bird voice audio to be identified;
and the information output module 201 is configured to input the bird voice frequency to be recognized into the bird language classification model, and obtain bird information output by the bird language classification model.
According to the bird language identification device, after the bird language audio to be identified is acquired, the bird language audio to be identified is input into the bird language classification model, and bird information output by the bird language classification model is acquired. Based on the method, corresponding bird information can be determined according to the bird voice frequency to be recognized through the pre-trained bird language classification model, and the accuracy and the precision of bird information determination are guaranteed.
The embodiment of the present invention further provides a computer storage medium, on which computer instructions are stored, and when the instructions are executed by a processor, the method for training a bird classification model or a bird recognition method according to any of the above embodiments is implemented.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or various other media that can store program code.
Corresponding to the computer storage medium, in an embodiment, there is also provided a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement any one of the methods for training a bird classification model and identifying a bird.
The computer device may be a terminal, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for training a bird classification model or a method for bird recognition. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
After the computer equipment collects the bird voice frequency to be trained and determines the bird information corresponding to the bird voice frequency to be trained, the bird voice frequency to be trained is divided into a bird sound part and a non-bird sound part according to the preset two-classification model. And carrying out feature extraction on the bird sound part to obtain bird language classification features. And adjusting the pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting the bird information. And finally, establishing a bird language classification model according to a preset two classification model and a bird language identification model. Based on this, can be according to less data bulk, can train the bird language classification model who is used for discerning birds information, simultaneously, improve the precision and the rate of accuracy of discernment.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A method for training a bird language classification model is characterized by comprising the following steps:
acquiring bird voice frequency to be trained and determining bird information corresponding to the bird voice frequency to be trained;
dividing the bird voice frequency to be trained into a bird voice part and a non-bird voice part according to a preset two-classification model;
extracting the characteristics of the bird sound part to obtain bird language classification characteristics;
adjusting a pre-established classification model according to the bird language classification characteristics and the corresponding bird information to obtain a bird language identification model for outputting bird information;
and establishing a bird language classification model according to the preset two classification models and the bird language identification model.
2. The method for training a bird language classification model according to claim 1, wherein the training process of the preset two classification models comprises the steps of:
acquiring an audio signal of a public data set; wherein the audio signal comprises bird sound data and ambient noise data;
performing mathematical transformation on the audio signal to obtain a corresponding spectrogram;
applying band-pass filters under a plurality of Mel scales to the spectrogram to obtain a Mel spectrogram of the audio signal;
carrying out normalization processing on the Mel frequency spectrogram to obtain a normalization processing result;
and training a two-classification neural network according to the normalization processing result to obtain the preset two-classification model.
3. The method of claim 1, wherein the bird classification features comprise a mel-frequency spectrum of the bird sound part and bird sound recording metadata features.
4. The method for training a bird classification model according to claim 3, wherein the process of extracting features of the bird sound part to obtain bird classification features comprises the following steps:
segmenting the bird sound portion into audio segments;
carrying out mathematical transformation on the audio frequency segment, carrying out filtering processing based on a mathematical transformation result, and extracting frequency segment characteristics;
and converting the frequency band characteristics into a Mel frequency spectrogram expressed in decibels.
5. The method for training a bird language classification model according to claim 4, wherein the bird sound recording metadata features comprise longitude vectors, latitude vectors, altitude vectors, or time vectors;
the process of extracting the features of the bird sound part to obtain the bird language classification features comprises the following steps:
and mapping the metadata features of the bird sound recording to a high dimension, and combining the high dimension vector with the Mel frequency spectrogram to obtain bird language classification features.
6. The method for training a bird classification model according to claim 1, wherein after the process of extracting features of the bird sound part to obtain bird classification features, the method further comprises the following steps:
and performing data enhancement processing on the bird language classification features.
7. The method for training a bird classification model according to claim 6, wherein the process of performing data enhancement processing on the bird classification features comprises the steps of:
adding expanded data to the bird language classification features; wherein the augmentation data comprises data jitter, white noise, homogeneous bird recordings or environmental noise.
8. The method for training a bird classification model according to claim 1, wherein the classification model is a multi-layer convolutional neural network structure;
the multilayer convolutional neural network structure comprises a convolutional layer module and a full-connection layer module;
the convolutional layer module comprises a convolutional layer, an activation function layer, a BN layer and a maximum pooling layer;
the full connection layer module comprises a full connection layer, a BN layer, an activation function layer and a Dropout layer.
9. The method for training a bird classification model according to claim 1, further comprising the steps of:
and performing pooling treatment on the bird recognition model according to a logarithm-average-natural index pooling algorithm.
10. A method for identifying a bird language, comprising the steps of:
acquiring bird voice frequency to be identified;
and inputting the bird voice frequency to be recognized into the bird language classification model to obtain bird information output by the bird language classification model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210395700.XA CN114974267A (en) | 2022-04-15 | 2022-04-15 | Bird language classification model training method and bird language identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210395700.XA CN114974267A (en) | 2022-04-15 | 2022-04-15 | Bird language classification model training method and bird language identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114974267A true CN114974267A (en) | 2022-08-30 |
Family
ID=82977935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210395700.XA Pending CN114974267A (en) | 2022-04-15 | 2022-04-15 | Bird language classification model training method and bird language identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114974267A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118430012A (en) * | 2024-03-29 | 2024-08-02 | 北京积加科技有限公司 | Multi-mode fusion bird identification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107393542A (en) * | 2017-06-28 | 2017-11-24 | 北京林业大学 | A kind of birds species identification method based on binary channels neutral net |
CN108877778A (en) * | 2018-06-13 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Sound end detecting method and equipment |
CN110120224A (en) * | 2019-05-10 | 2019-08-13 | 平安科技(深圳)有限公司 | Construction method, device, computer equipment and the storage medium of bird sound identification model |
CN110246504A (en) * | 2019-05-20 | 2019-09-17 | 平安科技(深圳)有限公司 | Birds sound identification method, device, computer equipment and storage medium |
CN111816166A (en) * | 2020-07-17 | 2020-10-23 | 字节跳动有限公司 | Voice recognition method, apparatus, and computer-readable storage medium storing instructions |
CN113936667A (en) * | 2021-09-14 | 2022-01-14 | 广州大学 | Bird song recognition model training method, recognition method and storage medium |
-
2022
- 2022-04-15 CN CN202210395700.XA patent/CN114974267A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107393542A (en) * | 2017-06-28 | 2017-11-24 | 北京林业大学 | A kind of birds species identification method based on binary channels neutral net |
CN108877778A (en) * | 2018-06-13 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Sound end detecting method and equipment |
CN110120224A (en) * | 2019-05-10 | 2019-08-13 | 平安科技(深圳)有限公司 | Construction method, device, computer equipment and the storage medium of bird sound identification model |
CN110246504A (en) * | 2019-05-20 | 2019-09-17 | 平安科技(深圳)有限公司 | Birds sound identification method, device, computer equipment and storage medium |
CN111816166A (en) * | 2020-07-17 | 2020-10-23 | 字节跳动有限公司 | Voice recognition method, apparatus, and computer-readable storage medium storing instructions |
CN113936667A (en) * | 2021-09-14 | 2022-01-14 | 广州大学 | Bird song recognition model training method, recognition method and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118430012A (en) * | 2024-03-29 | 2024-08-02 | 北京积加科技有限公司 | Multi-mode fusion bird identification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108877775B (en) | Voice data processing method and device, computer equipment and storage medium | |
CN110120224B (en) | Method and device for constructing bird sound recognition model, computer equipment and storage medium | |
CN112435684B (en) | Voice separation method and device, computer equipment and storage medium | |
EP2695160B1 (en) | Speech syllable/vowel/phone boundary detection using auditory attention cues | |
CN117095694B (en) | Bird song recognition method based on tag hierarchical structure attribute relationship | |
CN110942766A (en) | Audio event detection method, system, mobile terminal and storage medium | |
WO2022141868A1 (en) | Method and apparatus for extracting speech features, terminal, and storage medium | |
Passricha et al. | A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR | |
CN110931023B (en) | Gender identification method, system, mobile terminal and storage medium | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN108847253B (en) | Vehicle model identification method, device, computer equipment and storage medium | |
CN111932056A (en) | Customer service quality scoring method and device, computer equipment and storage medium | |
CN110136726A (en) | A kind of estimation method, device, system and the storage medium of voice gender | |
CN111933148A (en) | Age identification method and device based on convolutional neural network and terminal | |
CN113936667A (en) | Bird song recognition model training method, recognition method and storage medium | |
Dua et al. | Optimizing integrated features for Hindi automatic speech recognition system | |
CN112908344A (en) | Intelligent recognition method, device, equipment and medium for bird song | |
CN114974267A (en) | Bird language classification model training method and bird language identification method | |
Wang et al. | A hierarchical birdsong feature extraction architecture combining static and dynamic modeling | |
Zhang et al. | Discriminative frequency filter banks learning with neural networks | |
CN116153339A (en) | Speech emotion recognition method and device based on improved attention mechanism | |
CN115267672A (en) | Method for detecting and positioning sound source | |
Bang et al. | Audio-Based Recognition of Bird Species Using Deep Learning | |
Bakshi et al. | Spoken Indian language classification using GMM supervectors and artificial neural networks | |
CN114912539B (en) | Environmental sound classification method and system based on reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |