CN113870896A - Motion sound false judgment method and device based on time-frequency graph and convolutional neural network - Google Patents
Motion sound false judgment method and device based on time-frequency graph and convolutional neural network Download PDFInfo
- Publication number
- CN113870896A CN113870896A CN202111136079.7A CN202111136079A CN113870896A CN 113870896 A CN113870896 A CN 113870896A CN 202111136079 A CN202111136079 A CN 202111136079A CN 113870896 A CN113870896 A CN 113870896A
- Authority
- CN
- China
- Prior art keywords
- sound
- motion
- segment
- audio
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 30
- 230000002441 reversible effect Effects 0.000 claims abstract description 63
- 238000012549 training Methods 0.000 claims abstract description 53
- 238000010586 diagram Methods 0.000 claims description 43
- 239000012634 fragment Substances 0.000 claims description 22
- 238000001228 spectrum Methods 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 9
- 230000002829 reductive effect Effects 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 abstract description 6
- 238000005070 sampling Methods 0.000 abstract 2
- 238000004364 calculation method Methods 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000015654 memory Effects 0.000 description 3
- 206010039740 Screaming Diseases 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a method and a device for judging motion sound false based on a time-frequency graph and a convolutional neural network, wherein the method comprises the following steps: s1, splicing a plurality of sports sound segments with the same sound category label into sports audio; s2, randomly intercepting a plurality of motion sound segments and reverse sound segments from a motion audio frequency and a reverse audio frequency respectively in an oversampling and undersampling mode to serve as forward sample data and reverse sample data of model training; s3, inputting the forward and reverse sample data into the improved convolution neural network, and forming a motion sound false judgment model through iterative update training; s4, intercepting the voice segment to be recognized from the audio collected in the real environment and inputting the voice segment to be recognized into the motion voice judging model, and outputting the motion judging result of the voice segment to be recognized by the model. The method balances the positive and negative samples of the motion sound false-judgment model training through the over-sampling and under-sampling modes, and improves the model identification efficiency and the identification precision through the improved neural network.
Description
Technical Field
The invention relates to the technical field of sound identification, in particular to a method and a device for judging motion sound based on a time-frequency graph and a convolutional neural network.
Background
The motion sound determination means determines whether the sound emitted is a sound of a specific motion, for example, whether the sound emitted at present is a rope skipping sound, a running sound, or the like, by using a sound recognition technology. In recent years, voice recognition technology has become mature, and people who speak can be recognized by analyzing unique characteristics of voice, such as pronunciation frequency, and the voice recognition technology is widely applied to the fields of identity authentication, network payment and the like. At present, for training of a voice recognition model, data volumes of positive and negative samples are mostly balanced, but in the technical field of motion voice false judgment, because motion voice is usually a specific voice (such as rope skipping), positive sample data of the training of the motion voice false judgment model are very limited, and reverse voice (negative samples of model training) such as false motion voice, environmental noise, speaking voice and the like have various categories, are rich in data volume, and the data volumes of the positive and negative samples are very unbalanced, so that the recognition accuracy of the motion voice false judgment model is directly influenced.
In addition, at present, an LSTM Long-short term memory neural network is usually adopted to train a voice recognition model, but the LSTM (Long short-term memory) Long-short term memory neural network has a complex structure, a slow model training speed and a low recognition efficiency, and cannot meet an application scenario that a pair of motion voice determination is high in voice recognition speed.
Disclosure of Invention
The invention aims to balance positive and negative samples of motion sound false judgment model training and improve the model identification efficiency and identification precision, and provides a motion sound false judgment method and device based on a time-frequency graph and a convolutional neural network.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method for judging the false motion sound based on the time-frequency graph and the convolutional neural network comprises the following steps:
s1, splicing a plurality of sports sound segments with the same sound category label into sports audio;
s2, randomly intercepting a plurality of motion sound segments from the motion audio in an oversampling mode to serve as forward sample data of motion sound false-judging model training, randomly intercepting reverse sound segments with the same number as the motion sound segments intercepted from the motion audio from a reverse audio in an undersampling mode to serve as reverse sample data of model training, wherein each reverse sound segment corresponds to the motion sound segment with the same duration and serves as the forward sample data;
s3, inputting the forward sample data and the reverse sample data into an improved convolutional neural network, and forming the motion sound false judgment model through iterative updating training;
s4, intercepting the voice segment to be recognized from the audio collected in the real environment and inputting the voice segment to be recognized into the motion voice false judgment model, and outputting the motion false judgment result of the voice segment to be recognized by the model.
As a preferable aspect of the present invention, in step S1, after the non-moving sound part in each moving sound segment is cut off, a plurality of moving sound segments are pieced together to form the moving audio.
In a preferred embodiment of the present invention, in step S2, the volume levels of the motion sound segment and the reverse sound segment are changed to expand the data amount of the forward sample data and the reverse sample data.
As a preferable aspect of the present invention, in step S4, the method for outputting the motion false result by the motion sound false judgment model specifically includes:
s41, converting the sound segment to be identified into a time-frequency diagram from audio;
s42, inputting the time-frequency graph into a two-dimensional average pooling layer to reduce the size of the time-frequency graph;
s43, extracting the convolution characteristics of the two layers of convolution layers from the time-frequency graph with the reduced size, and outputting a characteristic graph of the time-frequency graph;
s44, inputting the characteristic diagram into the maximum pooling layer and expanding to obtain a one-dimensional vector representing the time-frequency diagram;
s45, inputting the one-dimensional vector to a first full-connection layer for dimension reduction processing;
and S46, inputting the one-dimensional vector subjected to dimensionality reduction into a second full-connection layer, and outputting the probability that the sound fragment to be identified is a motion sound or a non-motion sound after being activated by softmax.
As a preferred aspect of the present invention, in step S41, the method for converting the to-be-recognized sound segment from audio frequency into the time-frequency diagram includes:
s411, intercepting a plurality of frame segments from the starting position of the voice segment to be identified to the end in a sliding window mode, wherein two adjacent frame segments have voice overlapping parts;
s412, multiplying each frame segment by a Hanning raised cosine window with the same length;
s413, performing fourier transform on each frame segment to convert each frame segment from a spatial domain to a frequency domain, so as to obtain a time-frequency spectrum corresponding to each frame segment;
s414, arranging the time frequency spectrum corresponding to each frame segment according to the time sequence to form the time frequency graph corresponding to the sound segment to be identified.
As a preferred embodiment of the present invention, in step S411, the sound volume value of each time of the to-be-recognized sound segment is normalized, and then the frame segment is intercepted.
As a preferable mode of the present invention, if the length of the frame segment is less than the nth power of 2, the length of the frame segment is raised to the nth power of 2 by complementing 0 data in the rear of the frame segment.
In a preferred embodiment of the present invention, n is 8.
The invention also provides a motion sound false judgment device based on the time-frequency graph and the convolutional neural network, which can realize the motion sound false judgment method, and the motion sound false judgment device comprises:
the motion audio synthesizing module is used for splicing a plurality of motion sound segments with the same sound category label into motion audio;
the backward audio acquisition module is used for acquiring the backward audio of the motion audio;
the sample equalization module is respectively connected with the motion audio synthesis module and the reverse audio acquisition module, and is used for randomly intercepting a plurality of motion sound fragments from the motion audio in an oversampling manner to serve as forward sample data of motion sound false judgment model training, and randomly intercepting reverse sound fragments with the same number as the motion sound fragments from a reverse audio in an undersampling manner to serve as reverse sample data of model training, wherein each reverse sound fragment corresponds to the motion sound fragment with the same duration;
the sample input module is connected with the sample equalization module and used for inputting the forward sample data and the reverse sample data into an improved convolutional neural network and forming the motion sound false judgment model through iterative updating training;
the model training module is connected with the sample input module and used for forming the motion sound false judgment model through iterative update training by an improved convolutional neural network according to the input forward sample data and the input reverse sample data;
the sound collection module is used for collecting motion sound from a real environment and forming the motion sound into an audio file;
the sound intercepting module is connected with the sound collecting module and is used for intercepting a sound fragment to be identified from the audio file;
the voice input module is connected with the voice intercepting module and is used for inputting the voice segment to be identified into the voice identification module;
the voice recognition module is respectively connected with the model training module and the voice input module, and is used for recognizing and outputting the probability that the voice segment to be recognized is the sports voice through the sports voice false-judging model.
As a preferable aspect of the present invention, the voice recognition module includes:
a time-frequency diagram generating module, configured to convert the to-be-identified sound segment from audio to a time-frequency diagram, where the time-frequency diagram generating module specifically includes:
the frame segment intercepting unit is used for intercepting a plurality of frame segments from the starting position of the sound segment to be identified to the end in a sliding window mode, and two adjacent frame segments have sound overlapping parts;
the windowing unit is connected with the frame segment intercepting unit and used for multiplying each frame segment by a Hanning raised cosine window with the same length;
a spatial domain converting unit, connected to the windowing unit, for performing fourier transform on each of the windowed frame segments to convert each of the frame segments from a spatial domain to a frequency domain;
and the sequencing unit is connected with the spatial domain conversion unit and used for arranging the time-frequency spectrum corresponding to each frame segment which completes spatial domain conversion according to the time sequence to form the time-frequency graph.
The invention has the following beneficial effects:
1. according to the method, a plurality of motion sound segments are randomly intercepted from the motion audio in an oversampling mode to serve as forward sample data of motion sound false judgment model training, the forward sample data size of the model training is expanded, reverse sound segments with the same number as the motion sound segments are randomly intercepted from the reverse audio in an undersampling mode to serve as reverse sample data of the model training, the reverse sample data size of the model training is reduced, and the balance of the number of positive and negative samples is realized;
2. the neural network with a simpler structure is used for training the motion sound hypothesis model, the model training speed is higher, and the model recognition efficiency is higher;
3. the identification precision of the motion sound false judgment model is improved by balancing the number of the positive and negative samples.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a diagram illustrating implementation steps of a method for determining a motion sound based on a time-frequency diagram and a convolutional neural network according to an embodiment of the present invention;
FIG. 2 is a diagram of the steps of a method for recognizing and outputting a motion estimation result by a motion sound estimation model;
FIG. 3 is a diagram of the method steps for converting a sound segment to be recognized from audio into a time-frequency diagram;
FIG. 4 is a schematic diagram of the structure of an improved convolutional neural network;
FIG. 5 is a schematic diagram of a frame segment cut out from a sound segment to be recognized in a sliding window manner;
FIG. 6 is a schematic diagram of a time-frequency diagram;
FIG. 7 is a schematic diagram of convolutional layers in a convolutional neural network;
FIG. 8 is a schematic diagram of an average pooling layer in a convolutional neural network;
fig. 9 is a schematic structural diagram of a motion sound determination apparatus based on a time-frequency diagram and a convolutional neural network according to an embodiment of the present invention;
fig. 10 is a schematic diagram of the internal structure of the time-frequency diagram generation module in the voice recognition module in the motion voice determination device.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The motion sound false judgment method based on the time-frequency graph and the convolutional neural network mainly solves two technical problems: 1. the problem of unbalance of positive and negative samples of the motion sound false judgment model; 2. the recognition accuracy and recognition speed of the motion sound false judgment model trained by the existing LSTM long and short term memory network cannot meet the technical problem of motion sound false judgment application scenes.
To solve the technical problem 1, the technical solution provided in this embodiment is: the collected motion sound data, namely each motion sound segment is manually screened to intercept a part which does not meet the requirement of data purity, such as a rope skipping sound segment, animal screaming sounds appear in 01 '00' -01 '05' of the segment and no rope skipping sounds, and in order to ensure the purity of the forward sample data (motion sound segment), the animal screaming sounds appearing in 01 '00' -01 '05' are manually intercepted and discarded.
The types of reverse sounds of the sports sound such as environmental noise, speaking sound and false sports sound (for example, the type of the sports to be identified is rope skipping, but the collected sports sound is running, and the running sound is false sports sound of rope skipping sound) are much larger than the types of the sports sound, and the two types of sound data have the problem of serious data volume imbalance. In the prior art, the forward data is copied to increase the forward data volume or the reverse data is deleted to reduce the reverse data volume so as to balance the forward data and the reverse data, but the method is difficult to improve the effectiveness of the forward sample data and does not greatly help the identification precision of the motion sound false judgment model. To solve this problem, we adopt the following method:
splicing the motion sound segments of the same type of labels (such as rope skipping sounds) into a whole motion audio, then randomly intercepting a plurality of motion sound segments from the motion audio in an oversampling mode to serve as forward sample data of motion sound false-judging model training, wherein the time lengths of all the motion sound segments intercepted from the same motion audio are the same. And randomly intercepting reverse audio (audio of single sound type such as environmental noise, speaking sound, false motion sound and the like or audio formed by splicing various non-motion sounds) in an undersampling mode, wherein the reverse audio is the same as the motion sound fragments intercepted from the motion audio in number, and the reverse audio fragments are taken as reverse sample data of model training. In order to further ensure balance between the forward sample data and the reverse sample data, it is preferable that each reverse sound segment corresponds to a moving sound segment as the forward sample data having the same duration.
Preferably, the volume of each motion sound segment is changed to further expand the data size of the forward sample data; the data volume of the reverse sample data is further expanded by changing the volume of the reverse sound segment.
To solve the technical problem 2, the technical solution provided in this embodiment is: forward sample data obtained by oversampling and reverse sample data obtained by undersampling are taken as model training samples, a motion false judgment model is obtained through iterative training of an improved convolutional neural network (please refer to fig. 4 for a network structure diagram of the improved convolutional neural network), and motion sound false judgment is carried out on a sound segment to be recognized through the motion false judgment model, so that the requirements of a motion sound false judgment application scene on model false judgment accuracy and speed are met.
In summary, as shown in fig. 1, a method for determining a motion sound based on a time-frequency diagram and a convolutional neural network according to an embodiment of the present invention includes:
step S1, splicing a plurality of sports sound segments with the same sound category label (such as label "rope skipping sound", "running sound" etc.) into sports audio; there are many existing methods of splicing sound segments into an integrated audio, which are not specifically described herein;
step S2, randomly intercepting a plurality of motion sound segments from a motion audio in an oversampling manner to serve as forward sample data of motion sound false-judging model training, randomly intercepting reverse sound segments with the same number as the motion sound segments intercepted from the motion audio from a reverse audio in an undersampling manner to serve as reverse sample data of model training, wherein each reverse sound segment preferably corresponds to the motion sound segment with the same duration (for example, the number of the motion sound segments serving as the forward sample data is 3, and the durations are respectively 1 minute, 2 minutes and 3 minutes, so that the number of the reverse sound segments is 3, and the durations are 1 minute, 2 minutes and 3 minutes, and correspond to the number and duration of the motion sound segments);
step S3, inputting the forward sample data and the reverse sample data into an improved convolutional neural network, and forming a motion sound false judgment model through iterative updating training;
step S4, intercepting the voice segment to be recognized from the audio collected in the real environment and inputting the voice segment to be recognized into the motion voice judging model, and outputting the motion judging result of the voice segment to be recognized by the model.
The motion sound false judgment method by the motion sound false judgment model comprises the following two steps: 1. converting the voice segment to be identified into a time-frequency diagram from audio; 2. and identifying and outputting a motion false judgment result according to the time-frequency graph.
In this embodiment, as shown in fig. 3, the method for converting the audio frequency of the to-be-recognized sound segment into the time-frequency diagram includes:
step S411, intercepting a plurality of frame segments from the starting position of the sound segment to be recognized to the end in a sliding window mode, wherein two adjacent frame segments have sound overlapping parts; when frame segments are cut in a sliding window mode, the overlapping parts are reserved in two adjacent frame segments to form redundant information, and the identification error of a model can be reduced. As shown in fig. 5, the waveform in fig. 5 is a one-dimensional audio waveform of the sound segment to be recognized, one frame segment is cut through the first sliding window 100 from the start position of the waveform data (the length of the sliding window is, for example, 256 audio data points), then another frame segment is cut through the second sliding window 200 with the same length partially overlapping with the first sliding window 100 (for example, overlapping 128 audio data points), and this operation is repeated until the end of the sound segment to be recognized, and the frame segment cutting of the sound segment to be recognized is completed.
In order to increase the calculation speed of the subsequent fourier transform, the length of the frame segment is preferably raised to the nth power of 2, and if the length of the frame segment is less than the nth power of 2, the length of the frame segment is raised to the nth power of 2 by complementing data with 0 at the rear of the frame segment. Preferably, n has a value of 8, i.e. the length of the frame segment is preferably 256 audio data points. The value of n is 8, and the calculation speed of the subsequent Fourier transform is the most ideal when the value of n is found to be 8.
Step S412, each frame segment is multiplied by a Hanning raised cosine window with the same length, because each frame segment needs to be periodically extended when Fourier transform is carried out subsequently, namely, a plurality of the segments need to be copied and then are connected end to end, because the frame segments are intercepted in a sliding window partial overlapping mode, the unnecessary discontinuity can occur after the end to end connection, the frequency spectrum after the Fourier transform does not accord with the actual situation, and the discontinuity can be reduced after windowing so that the obtained frequency spectrum accords with the actual frequency spectrum. Each frame segment is multiplied by a hanning cosine window of the same length, i.e. the value of each audio data point in the frame segment (volume value) is multiplied by the value of the corresponding time window function as the new value for that data point. The window function expression is as the following equation (1):
in formula (1), hann (n) represents a window function value to be multiplied when an audio data point in a frame segment is windowed;
pi represents the circumference ratio;
n represents the nth audio data point in the frame segment;
d represents the width of the hanning cosine-lifting window.
Step S413, performing fourier transform on each frame segment to convert each frame segment from a spatial domain to a frequency domain, and obtaining a time-frequency spectrum corresponding to each frame segment; in order to obtain the frequency spectrum of the frame segment at the center time, discrete fourier transform is performed on each windowed frame segment, and the discrete fourier transform process is expressed by the following equations (2) - (3):
in formulas (2) - (3), x [ n ] represents the volume value of the audio data point in the frame segment after inverse discrete Fourier transform;
n represents the nth audio data point in the frame segment;
k is k times harmonic;
akamplitude of the harmonic is k times;
e is a natural logarithm;
j is an imaginary number;
w0is the fundamental frequency;
n is the total number of audio data points in a frame segment.
For each frame segment, the scale size of each frequency component can be calculated by formula (3).
In step S414, the time-frequency spectrums corresponding to the frame segments are arranged in time sequence to form a time-frequency graph corresponding to the voice segment to be identified (see fig. 6).
In order to increase the recognition speed of the motion sound identification model, in step S411, it is preferable to perform normalization processing on the volume value of each time of the sound segment to be recognized, and then cut out the frame segment. The method for normalizing the volume value of each moment of the sound segment to be recognized comprises the following steps:
the original one-dimensional audio waveform data of the sound segment to be identified has a volume value at each moment, and the volume values arranged according to the time sequence form the one-dimensional audio waveform data. The data value of each moment, namely the measurement of the volume is represented by 16-bit binary, the maximum value does not exceed 32727, each data value is divided by 32727 to obtain a value between-1 and 1 after normalization processing, and the training speed of the model is favorably and greatly improved after data normalization.
In this embodiment, as shown in fig. 2 and 4, the method for identifying and outputting a motion hypothesis result according to a time-frequency diagram includes:
step S41, converting the audio of the voice segment to be recognized into a time-frequency diagram with a size of 124 × 129 × 1(124 is the number of rows in the two-dimensional diagram, 129 is the number of columns in the two-dimensional diagram, and 1 is the number of channels);
in step S42, the time-frequency graph with size 124 × 129 × 1 is input to the two-dimensional average pooling layer 1, and then the size is reduced to 31 × 32 × 1, so as to increase the recognition speed of the model and reduce the occupation of the storage space. The process of average pooling is shown in fig. 8, for example, the input is a, the output is B, and the average pooling operation is to find the average of 16 values a11, a12, … … and a44 in fig. 8 as the value of the output B.
Step S43, extracting the convolution characteristics of the time-frequency graph with reduced size through the first convolution layer 2 and the second convolution layer 3, and outputting a characteristic graph with the size of 27 × 28 × 64 corresponding to the time-frequency graph; referring to fig. 7, for example, the convolution input is a, the convolution kernel is B, and the convolution output is C, and the specific operations are as follows:
A11*B11+A12*B12+A21*B21+A22*B22=C11;
A12*B11+A13*B12+A22*B21+A23*B22=C12;
A21*B11+A22*B12+A31*B21+A32*B22=C21;
A22*B11+A23*B12+A32*B21+A33*B22=C22;
step S44, inputting the feature map into the maximum pooling layer 5 and expanding (i.e. rearranging each element of the two-dimensional feature map into a column, such as queuing, letting the head of the second team get to the tail of the first team, and the latter teams operate the same, so that the former teams are queued up) to obtain a one-dimensional vector with length 11648 representing the time-frequency map;
step S45, inputting the one-dimensional vector into the first fully-connected layer 5 to perform dimension reduction processing into a vector with a length of 128;
and step S46, inputting the one-dimensional vector subjected to dimension reduction into the second full-link layer 6, activating by softmax, and outputting the probability that the voice segment to be recognized is the moving voice or the non-moving voice, so as to realize the judgment of the moving voice. The method for training the motion sound false judgment model is briefly as follows:
firstly initializing model parameters, wherein each parameter has a value, inputting a time-frequency graph in a training set, calculating layer by layer according to a calculation step specified by the model, finally outputting the probability that the time-frequency graph is in a positive and negative category, comparing the probability with a label of the time-frequency graph to obtain the error, then reversely transmitting each parameter back through gradient to adjust the value of each parameter, wherein the updated parameter can reduce the error, inputting all samples into the model for adjusting and optimizing the parameters according to the steps, completing a round of model training, reducing the error between the predicted category of the model and the actual category, repeating the steps for a plurality of times until the error is not reduced and the error is not smaller than the error of the verification set, and obtaining the best model parameter which is not fitted, namely completing the model training. After the model training is finished, a new sample is input into the model, and then the prediction with the smallest error with the actual class can be obtained, namely the discrimination accuracy is the best, so that the class of a sample can be accurately identified, no matter whether other various sounds exist in the sample, the model training is finished, only the characteristics of the motion sound are concerned, the characteristics of the motion sound are discriminated to be the positive class, and the characteristics of the motion sound are discriminated to be the negative class.
In conclusion, the invention solves the problems that model training sample data of a motion sound fake judgment application scene is unbalanced and the technical problems of low training efficiency, low recognition precision and low recognition speed of the existing LSTM long-short term memory neural network are adopted.
The invention also provides a motion sound false judgment device based on the time-frequency graph and the convolutional neural network, which can realize the motion sound false judgment method, and as shown in fig. 9, the motion sound false judgment device comprises:
the motion audio synthesizing module is used for splicing a plurality of motion sound segments with the same sound category label into motion audio;
the backward audio acquisition module is used for acquiring the backward audio of the motion audio;
the sample equalization module is respectively connected with the motion audio synthesis module and the reverse audio acquisition module, and is used for randomly intercepting a plurality of motion sound fragments from the motion audio in an oversampling mode to serve as forward sample data of motion sound fake judgment model training and randomly intercepting reverse sound fragments with the same number as the motion sound fragments from the reverse audio in an undersampling mode to serve as reverse sample data of model training, wherein each reverse sound fragment corresponds to a motion sound fragment with the same duration;
the sample input module is connected with the sample equalization module and used for inputting the forward sample data and the reverse sample data into the improved convolutional neural network and forming a motion sound false judgment model through iterative updating training;
the model training module is connected with the sample input module and used for forming a motion sound false judgment model through iterative update training by an improved convolutional neural network according to input forward sample data and reverse sample data;
the sound collection module is used for collecting motion sound from a real environment and forming the motion sound into an audio file;
the voice intercepting module is connected with the voice collecting module and is used for intercepting a voice fragment to be identified from the audio file;
the voice input module is connected with the voice intercepting module and is used for inputting the voice segment to be identified into the voice identification module;
and the voice recognition module is respectively connected with the model training module and the voice input module and is used for recognizing and outputting the probability that the voice segment to be recognized is the sports voice through the sports voice hypothesis model.
As shown in fig. 10, the voice recognition module includes:
a time-frequency diagram generating module, configured to convert the audio of the to-be-identified sound segment into a time-frequency diagram, as shown in fig. 10, the time-frequency diagram generating module specifically includes:
the frame segment intercepting unit is used for intercepting a plurality of frame segments from the starting position of the sound segment to be identified to the end in a sliding window mode, and two adjacent frame segments have sound overlapping parts;
the windowing unit is connected with the frame segment intercepting unit and used for multiplying each frame segment by a Hanning raised cosine window with the same length so as to solve the problem that the frequency spectrum after Fourier transform does not conform to the actual situation due to the occurrence of originally-unnecessary discontinuity after the frame segment period extension;
the spatial domain conversion unit is connected with the windowing unit and is used for carrying out Fourier transform on each windowed frame segment so as to convert each frame segment from a spatial domain to a frequency domain;
and the sequencing unit is connected with the spatial domain conversion unit and used for arranging the time-frequency spectrum corresponding to each frame segment which completes the spatial domain conversion according to the time sequence to form the time-frequency graph.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.
Claims (10)
1. A method for judging false motion sound based on a time-frequency graph and a convolutional neural network is characterized by comprising the following steps:
s1, splicing a plurality of sports sound segments with the same sound category label into sports audio;
s2, randomly intercepting a plurality of motion sound segments from the motion audio in an oversampling mode to serve as forward sample data of motion sound false-judging model training, randomly intercepting reverse sound segments with the same number as the motion sound segments intercepted from the motion audio from a reverse audio in an undersampling mode to serve as reverse sample data of model training, wherein each reverse sound segment corresponds to the motion sound segment with the same duration and serves as the forward sample data;
s3, inputting the forward sample data and the reverse sample data into an improved convolutional neural network, and forming the motion sound false judgment model through iterative updating training;
s4, intercepting the voice segment to be recognized from the audio collected in the real environment and inputting the voice segment to be recognized into the motion voice false judgment model, and outputting the motion false judgment result of the voice segment to be recognized by the model.
2. The method according to claim 1, wherein in step S1, after intercepting the non-motion sound part of each motion sound segment, the motion sound segments are merged into the motion audio.
3. The method according to claim 1, wherein in step S2, the volume of the forward sample data and the reverse sample data is expanded by changing the volume of the moving sound segment and the reverse sound segment.
4. The method for judging motion sound based on a time-frequency diagram and a convolutional neural network of claim 1, wherein in step S4, the method for outputting the motion sound judgment result by the motion sound judgment model specifically comprises the following steps:
s41, converting the sound segment to be identified into a time-frequency diagram from audio;
s42, inputting the time-frequency graph into a two-dimensional average pooling layer to reduce the size of the time-frequency graph;
s43, extracting the convolution characteristics of the two layers of convolution layers from the time-frequency graph with the reduced size, and outputting a characteristic graph of the time-frequency graph;
s44, inputting the characteristic diagram into the maximum pooling layer and expanding to obtain a one-dimensional vector representing the time-frequency diagram;
s45, inputting the one-dimensional vector to a first full-connection layer for dimension reduction processing;
and S46, inputting the one-dimensional vector subjected to dimensionality reduction into a second full-connection layer, and outputting the probability that the sound fragment to be identified is a motion sound or a non-motion sound after being activated by softmax.
5. The method for determining a moving sound based on a time-frequency diagram and a convolutional neural network as claimed in claim 4, wherein in step S41, the step of converting the sound segment to be recognized from audio into the time-frequency diagram includes:
s411, intercepting a plurality of frame segments from the starting position of the voice segment to be identified to the end in a sliding window mode, wherein two adjacent frame segments have voice overlapping parts;
s412, multiplying each frame segment by a Hanning raised cosine window with the same length;
s413, performing fourier transform on each frame segment to convert each frame segment from a spatial domain to a frequency domain, so as to obtain a time-frequency spectrum corresponding to each frame segment;
s414, arranging the time frequency spectrum corresponding to each frame segment according to the time sequence to form the time frequency graph corresponding to the sound segment to be identified.
6. The method according to claim 5, wherein in step S411, the frame segment is intercepted after the normalization processing is performed on the volume value of each time of the sound segment to be recognized.
7. The method according to claim 5, wherein if the length of the frame segment is less than 2 raised to the nth power, the length of the frame segment is raised to the nth power of 2 by adding 0 to data after the frame segment.
8. The method according to claim 7, wherein n is 8.
9. A motion sound false-judging device based on time-frequency diagram and convolutional neural network, which can implement the motion sound false-judging method of any one of claims 1-8, wherein the motion sound false-judging device comprises:
the motion audio synthesizing module is used for splicing a plurality of motion sound segments with the same sound category label into motion audio;
the backward audio acquisition module is used for acquiring the backward audio of the motion audio;
the sample equalization module is respectively connected with the motion audio synthesis module and the reverse audio acquisition module, and is used for randomly intercepting a plurality of motion sound fragments from the motion audio in an oversampling manner to serve as forward sample data of motion sound false judgment model training, and randomly intercepting reverse sound fragments with the same number as the motion sound fragments from a reverse audio in an undersampling manner to serve as reverse sample data of model training, wherein each reverse sound fragment corresponds to the motion sound fragment with the same duration;
the sample input module is connected with the sample equalization module and is used for inputting the forward sample data and the reverse sample data into an improved convolutional neural network;
the model training module is connected with the sample input module and used for forming the motion sound false judgment model through iterative update training by an improved convolutional neural network according to the input forward sample data and the input reverse sample data;
the sound collection module is used for collecting motion sound from a real environment and forming the motion sound into an audio file;
the sound intercepting module is connected with the sound collecting module and is used for intercepting a sound fragment to be identified from the audio file;
the voice input module is connected with the voice intercepting module and is used for inputting the voice segment to be identified into the voice identification module;
the voice recognition module is respectively connected with the model training module and the voice input module, and is used for recognizing and outputting the probability that the voice segment to be recognized is the sports voice through the sports voice false-judging model.
10. The apparatus according to claim 9, wherein the voice recognition module comprises:
a time-frequency diagram generating module, configured to convert the to-be-identified sound segment from audio to a time-frequency diagram, where the time-frequency diagram generating module specifically includes:
the frame segment intercepting unit is used for intercepting a plurality of frame segments from the starting position of the sound segment to be identified to the end in a sliding window mode, and two adjacent frame segments have sound overlapping parts;
the windowing unit is connected with the frame segment intercepting unit and used for multiplying each frame segment by a Hanning raised cosine window with the same length;
a spatial domain converting unit, connected to the windowing unit, for performing fourier transform on each of the windowed frame segments to convert each of the frame segments from a spatial domain to a frequency domain;
and the sequencing unit is connected with the spatial domain conversion unit and used for arranging the time-frequency spectrum corresponding to each frame segment which completes spatial domain conversion according to the time sequence to form the time-frequency graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111136079.7A CN113870896A (en) | 2021-09-27 | 2021-09-27 | Motion sound false judgment method and device based on time-frequency graph and convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111136079.7A CN113870896A (en) | 2021-09-27 | 2021-09-27 | Motion sound false judgment method and device based on time-frequency graph and convolutional neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113870896A true CN113870896A (en) | 2021-12-31 |
Family
ID=78991242
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111136079.7A Pending CN113870896A (en) | 2021-09-27 | 2021-09-27 | Motion sound false judgment method and device based on time-frequency graph and convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113870896A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114648987A (en) * | 2022-04-28 | 2022-06-21 | 歌尔股份有限公司 | Speech recognition method, device, equipment and computer readable storage medium |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107680586A (en) * | 2017-08-01 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Far field Speech acoustics model training method and system |
CN109493874A (en) * | 2018-11-23 | 2019-03-19 | 东北农业大学 | A kind of live pig cough sound recognition methods based on convolutional neural networks |
CN109545242A (en) * | 2018-12-07 | 2019-03-29 | 广州势必可赢网络科技有限公司 | A kind of audio data processing method, system, device and readable storage medium storing program for executing |
CN109785850A (en) * | 2019-01-18 | 2019-05-21 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of noise detecting method, device and storage medium |
CN110827837A (en) * | 2019-10-18 | 2020-02-21 | 中山大学 | Whale activity audio classification method based on deep learning |
CN111179971A (en) * | 2019-12-03 | 2020-05-19 | 杭州网易云音乐科技有限公司 | Nondestructive audio detection method and device, electronic equipment and storage medium |
CN111507380A (en) * | 2020-03-30 | 2020-08-07 | 中国平安财产保险股份有限公司 | Image classification method, system and device based on clustering and storage medium |
US20200302949A1 (en) * | 2019-03-18 | 2020-09-24 | Electronics And Telecommunications Research Institute | Method and apparatus for recognition of sound events based on convolutional neural network |
CN112071322A (en) * | 2020-10-30 | 2020-12-11 | 北京快鱼电子股份公司 | End-to-end voiceprint recognition method, device, storage medium and equipment |
CN112382310A (en) * | 2020-11-12 | 2021-02-19 | 北京猿力未来科技有限公司 | Human voice audio recording method and device |
CN112399247A (en) * | 2020-11-18 | 2021-02-23 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method, audio processing device and readable storage medium |
CN112634882A (en) * | 2021-03-11 | 2021-04-09 | 南京硅基智能科技有限公司 | End-to-end real-time voice endpoint detection neural network model and training method |
CN112700793A (en) * | 2020-12-24 | 2021-04-23 | 国网福建省电力有限公司 | Method and system for identifying fault collision of water turbine |
CN112863530A (en) * | 2021-01-07 | 2021-05-28 | 广州欢城文化传媒有限公司 | Method and device for generating sound works |
CN112883931A (en) * | 2021-03-29 | 2021-06-01 | 动者科技(杭州)有限责任公司 | Real-time true and false motion judgment method based on long and short term memory network |
CN112883930A (en) * | 2021-03-29 | 2021-06-01 | 动者科技(杭州)有限责任公司 | Real-time true and false motion judgment method based on full-connection network |
-
2021
- 2021-09-27 CN CN202111136079.7A patent/CN113870896A/en active Pending
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107680586A (en) * | 2017-08-01 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Far field Speech acoustics model training method and system |
CN109493874A (en) * | 2018-11-23 | 2019-03-19 | 东北农业大学 | A kind of live pig cough sound recognition methods based on convolutional neural networks |
CN109545242A (en) * | 2018-12-07 | 2019-03-29 | 广州势必可赢网络科技有限公司 | A kind of audio data processing method, system, device and readable storage medium storing program for executing |
CN109785850A (en) * | 2019-01-18 | 2019-05-21 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of noise detecting method, device and storage medium |
US20200302949A1 (en) * | 2019-03-18 | 2020-09-24 | Electronics And Telecommunications Research Institute | Method and apparatus for recognition of sound events based on convolutional neural network |
CN110827837A (en) * | 2019-10-18 | 2020-02-21 | 中山大学 | Whale activity audio classification method based on deep learning |
CN111179971A (en) * | 2019-12-03 | 2020-05-19 | 杭州网易云音乐科技有限公司 | Nondestructive audio detection method and device, electronic equipment and storage medium |
CN111507380A (en) * | 2020-03-30 | 2020-08-07 | 中国平安财产保险股份有限公司 | Image classification method, system and device based on clustering and storage medium |
CN112071322A (en) * | 2020-10-30 | 2020-12-11 | 北京快鱼电子股份公司 | End-to-end voiceprint recognition method, device, storage medium and equipment |
CN112382310A (en) * | 2020-11-12 | 2021-02-19 | 北京猿力未来科技有限公司 | Human voice audio recording method and device |
CN112399247A (en) * | 2020-11-18 | 2021-02-23 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio processing method, audio processing device and readable storage medium |
CN112700793A (en) * | 2020-12-24 | 2021-04-23 | 国网福建省电力有限公司 | Method and system for identifying fault collision of water turbine |
CN112863530A (en) * | 2021-01-07 | 2021-05-28 | 广州欢城文化传媒有限公司 | Method and device for generating sound works |
CN112634882A (en) * | 2021-03-11 | 2021-04-09 | 南京硅基智能科技有限公司 | End-to-end real-time voice endpoint detection neural network model and training method |
CN112883931A (en) * | 2021-03-29 | 2021-06-01 | 动者科技(杭州)有限责任公司 | Real-time true and false motion judgment method based on long and short term memory network |
CN112883930A (en) * | 2021-03-29 | 2021-06-01 | 动者科技(杭州)有限责任公司 | Real-time true and false motion judgment method based on full-connection network |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114648987A (en) * | 2022-04-28 | 2022-06-21 | 歌尔股份有限公司 | Speech recognition method, device, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102235568B1 (en) | Environment sound recognition method based on convolutional neural networks, and system thereof | |
CN106683680B (en) | Speaker recognition method and device, computer equipment and computer readable medium | |
US20170358306A1 (en) | Neural network-based voiceprint information extraction method and apparatus | |
Massoudi et al. | Urban sound classification using CNN | |
CN111986699B (en) | Sound event detection method based on full convolution network | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN111508480A (en) | Training method of audio recognition model, audio recognition method, device and equipment | |
CN113053410B (en) | Voice recognition method, voice recognition device, computer equipment and storage medium | |
CN115240702A (en) | Voice separation method based on voiceprint characteristics | |
CN114863938A (en) | Bird language identification method and system based on attention residual error and feature fusion | |
CN111488486B (en) | Electronic music classification method and system based on multi-sound-source separation | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN112562725A (en) | Mixed voice emotion classification method based on spectrogram and capsule network | |
CN114898773B (en) | Synthetic voice detection method based on deep self-attention neural network classifier | |
CN113870896A (en) | Motion sound false judgment method and device based on time-frequency graph and convolutional neural network | |
CN108847251A (en) | A kind of voice De-weight method, device, server and storage medium | |
CN115512692B (en) | Voice recognition method, device, equipment and storage medium | |
CN114626424B (en) | Data enhancement-based silent speech recognition method and device | |
CN113488069B (en) | Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network | |
Hu et al. | MSARN: A Multi-scale Attention Residual Network for End-to-End Environmental Sound Classification | |
CN114822509A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN114187923A (en) | Convolutional neural network audio identification method based on one-dimensional attention mechanism | |
CN113327616A (en) | Voiceprint recognition method and device, electronic equipment and storage medium | |
CN114863939B (en) | Panda attribute identification method and system based on sound | |
Patel et al. | A Performance Study: Convolutional Deep Belief Networks and Convolutional Neural Networks for Audio Classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211231 |