CN113870896A

CN113870896A - Motion sound false judgment method and device based on time-frequency graph and convolutional neural network

Info

Publication number: CN113870896A
Application number: CN202111136079.7A
Authority: CN
Inventors: 冯兴盼; 吴友银; 朱绍共
Original assignee: Movers Technology Hangzhou Co ltd
Current assignee: Movers Technology Hangzhou Co ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-31

Abstract

The invention discloses a method and a device for judging motion sound false based on a time-frequency graph and a convolutional neural network, wherein the method comprises the following steps: s1, splicing a plurality of sports sound segments with the same sound category label into sports audio; s2, randomly intercepting a plurality of motion sound segments and reverse sound segments from a motion audio frequency and a reverse audio frequency respectively in an oversampling and undersampling mode to serve as forward sample data and reverse sample data of model training; s3, inputting the forward and reverse sample data into the improved convolution neural network, and forming a motion sound false judgment model through iterative update training; s4, intercepting the voice segment to be recognized from the audio collected in the real environment and inputting the voice segment to be recognized into the motion voice judging model, and outputting the motion judging result of the voice segment to be recognized by the model. The method balances the positive and negative samples of the motion sound false-judgment model training through the over-sampling and under-sampling modes, and improves the model identification efficiency and the identification precision through the improved neural network.

Description

Motion sound false judgment method and device based on time-frequency graph and convolutional neural network

Technical Field

The invention relates to the technical field of sound identification, in particular to a method and a device for judging motion sound based on a time-frequency graph and a convolutional neural network.

Background

The motion sound determination means determines whether the sound emitted is a sound of a specific motion, for example, whether the sound emitted at present is a rope skipping sound, a running sound, or the like, by using a sound recognition technology. In recent years, voice recognition technology has become mature, and people who speak can be recognized by analyzing unique characteristics of voice, such as pronunciation frequency, and the voice recognition technology is widely applied to the fields of identity authentication, network payment and the like. At present, for training of a voice recognition model, data volumes of positive and negative samples are mostly balanced, but in the technical field of motion voice false judgment, because motion voice is usually a specific voice (such as rope skipping), positive sample data of the training of the motion voice false judgment model are very limited, and reverse voice (negative samples of model training) such as false motion voice, environmental noise, speaking voice and the like have various categories, are rich in data volume, and the data volumes of the positive and negative samples are very unbalanced, so that the recognition accuracy of the motion voice false judgment model is directly influenced.

In addition, at present, an LSTM Long-short term memory neural network is usually adopted to train a voice recognition model, but the LSTM (Long short-term memory) Long-short term memory neural network has a complex structure, a slow model training speed and a low recognition efficiency, and cannot meet an application scenario that a pair of motion voice determination is high in voice recognition speed.

Disclosure of Invention

The invention aims to balance positive and negative samples of motion sound false judgment model training and improve the model identification efficiency and identification precision, and provides a motion sound false judgment method and device based on a time-frequency graph and a convolutional neural network.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for judging the false motion sound based on the time-frequency graph and the convolutional neural network comprises the following steps:

s1, splicing a plurality of sports sound segments with the same sound category label into sports audio;

s2, randomly intercepting a plurality of motion sound segments from the motion audio in an oversampling mode to serve as forward sample data of motion sound false-judging model training, randomly intercepting reverse sound segments with the same number as the motion sound segments intercepted from the motion audio from a reverse audio in an undersampling mode to serve as reverse sample data of model training, wherein each reverse sound segment corresponds to the motion sound segment with the same duration and serves as the forward sample data;

s3, inputting the forward sample data and the reverse sample data into an improved convolutional neural network, and forming the motion sound false judgment model through iterative updating training;

s4, intercepting the voice segment to be recognized from the audio collected in the real environment and inputting the voice segment to be recognized into the motion voice false judgment model, and outputting the motion false judgment result of the voice segment to be recognized by the model.

As a preferable aspect of the present invention, in step S1, after the non-moving sound part in each moving sound segment is cut off, a plurality of moving sound segments are pieced together to form the moving audio.

In a preferred embodiment of the present invention, in step S2, the volume levels of the motion sound segment and the reverse sound segment are changed to expand the data amount of the forward sample data and the reverse sample data.

As a preferable aspect of the present invention, in step S4, the method for outputting the motion false result by the motion sound false judgment model specifically includes:

s41, converting the sound segment to be identified into a time-frequency diagram from audio;

s42, inputting the time-frequency graph into a two-dimensional average pooling layer to reduce the size of the time-frequency graph;

s43, extracting the convolution characteristics of the two layers of convolution layers from the time-frequency graph with the reduced size, and outputting a characteristic graph of the time-frequency graph;

s44, inputting the characteristic diagram into the maximum pooling layer and expanding to obtain a one-dimensional vector representing the time-frequency diagram;

s45, inputting the one-dimensional vector to a first full-connection layer for dimension reduction processing;

and S46, inputting the one-dimensional vector subjected to dimensionality reduction into a second full-connection layer, and outputting the probability that the sound fragment to be identified is a motion sound or a non-motion sound after being activated by softmax.

As a preferred aspect of the present invention, in step S41, the method for converting the to-be-recognized sound segment from audio frequency into the time-frequency diagram includes:

s411, intercepting a plurality of frame segments from the starting position of the voice segment to be identified to the end in a sliding window mode, wherein two adjacent frame segments have voice overlapping parts;

s412, multiplying each frame segment by a Hanning raised cosine window with the same length;

s413, performing fourier transform on each frame segment to convert each frame segment from a spatial domain to a frequency domain, so as to obtain a time-frequency spectrum corresponding to each frame segment;

s414, arranging the time frequency spectrum corresponding to each frame segment according to the time sequence to form the time frequency graph corresponding to the sound segment to be identified.

As a preferred embodiment of the present invention, in step S411, the sound volume value of each time of the to-be-recognized sound segment is normalized, and then the frame segment is intercepted.

As a preferable mode of the present invention, if the length of the frame segment is less than the nth power of 2, the length of the frame segment is raised to the nth power of 2 by complementing 0 data in the rear of the frame segment.

In a preferred embodiment of the present invention, n is 8.

The invention also provides a motion sound false judgment device based on the time-frequency graph and the convolutional neural network, which can realize the motion sound false judgment method, and the motion sound false judgment device comprises:

the motion audio synthesizing module is used for splicing a plurality of motion sound segments with the same sound category label into motion audio;

the backward audio acquisition module is used for acquiring the backward audio of the motion audio;

the sample equalization module is respectively connected with the motion audio synthesis module and the reverse audio acquisition module, and is used for randomly intercepting a plurality of motion sound fragments from the motion audio in an oversampling manner to serve as forward sample data of motion sound false judgment model training, and randomly intercepting reverse sound fragments with the same number as the motion sound fragments from a reverse audio in an undersampling manner to serve as reverse sample data of model training, wherein each reverse sound fragment corresponds to the motion sound fragment with the same duration;

the sample input module is connected with the sample equalization module and used for inputting the forward sample data and the reverse sample data into an improved convolutional neural network and forming the motion sound false judgment model through iterative updating training;

the model training module is connected with the sample input module and used for forming the motion sound false judgment model through iterative update training by an improved convolutional neural network according to the input forward sample data and the input reverse sample data;

the sound collection module is used for collecting motion sound from a real environment and forming the motion sound into an audio file;

the sound intercepting module is connected with the sound collecting module and is used for intercepting a sound fragment to be identified from the audio file;

the voice input module is connected with the voice intercepting module and is used for inputting the voice segment to be identified into the voice identification module;

the voice recognition module is respectively connected with the model training module and the voice input module, and is used for recognizing and outputting the probability that the voice segment to be recognized is the sports voice through the sports voice false-judging model.

As a preferable aspect of the present invention, the voice recognition module includes:

a time-frequency diagram generating module, configured to convert the to-be-identified sound segment from audio to a time-frequency diagram, where the time-frequency diagram generating module specifically includes:

the frame segment intercepting unit is used for intercepting a plurality of frame segments from the starting position of the sound segment to be identified to the end in a sliding window mode, and two adjacent frame segments have sound overlapping parts;

the windowing unit is connected with the frame segment intercepting unit and used for multiplying each frame segment by a Hanning raised cosine window with the same length;

a spatial domain converting unit, connected to the windowing unit, for performing fourier transform on each of the windowed frame segments to convert each of the frame segments from a spatial domain to a frequency domain;

and the sequencing unit is connected with the spatial domain conversion unit and used for arranging the time-frequency spectrum corresponding to each frame segment which completes spatial domain conversion according to the time sequence to form the time-frequency graph.

The invention has the following beneficial effects:

1. according to the method, a plurality of motion sound segments are randomly intercepted from the motion audio in an oversampling mode to serve as forward sample data of motion sound false judgment model training, the forward sample data size of the model training is expanded, reverse sound segments with the same number as the motion sound segments are randomly intercepted from the reverse audio in an undersampling mode to serve as reverse sample data of the model training, the reverse sample data size of the model training is reduced, and the balance of the number of positive and negative samples is realized;

2. the neural network with a simpler structure is used for training the motion sound hypothesis model, the model training speed is higher, and the model recognition efficiency is higher;

3. the identification precision of the motion sound false judgment model is improved by balancing the number of the positive and negative samples.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a diagram illustrating implementation steps of a method for determining a motion sound based on a time-frequency diagram and a convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a diagram of the steps of a method for recognizing and outputting a motion estimation result by a motion sound estimation model;

FIG. 3 is a diagram of the method steps for converting a sound segment to be recognized from audio into a time-frequency diagram;

FIG. 4 is a schematic diagram of the structure of an improved convolutional neural network;

FIG. 5 is a schematic diagram of a frame segment cut out from a sound segment to be recognized in a sliding window manner;

FIG. 6 is a schematic diagram of a time-frequency diagram;

FIG. 7 is a schematic diagram of convolutional layers in a convolutional neural network;

FIG. 8 is a schematic diagram of an average pooling layer in a convolutional neural network;

fig. 9 is a schematic structural diagram of a motion sound determination apparatus based on a time-frequency diagram and a convolutional neural network according to an embodiment of the present invention;

fig. 10 is a schematic diagram of the internal structure of the time-frequency diagram generation module in the voice recognition module in the motion voice determination device.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The motion sound false judgment method based on the time-frequency graph and the convolutional neural network mainly solves two technical problems: 1. the problem of unbalance of positive and negative samples of the motion sound false judgment model; 2. the recognition accuracy and recognition speed of the motion sound false judgment model trained by the existing LSTM long and short term memory network cannot meet the technical problem of motion sound false judgment application scenes.

To solve the technical problem 1, the technical solution provided in this embodiment is: the collected motion sound data, namely each motion sound segment is manually screened to intercept a part which does not meet the requirement of data purity, such as a rope skipping sound segment, animal screaming sounds appear in 01 '00' -01 '05' of the segment and no rope skipping sounds, and in order to ensure the purity of the forward sample data (motion sound segment), the animal screaming sounds appearing in 01 '00' -01 '05' are manually intercepted and discarded.

The types of reverse sounds of the sports sound such as environmental noise, speaking sound and false sports sound (for example, the type of the sports to be identified is rope skipping, but the collected sports sound is running, and the running sound is false sports sound of rope skipping sound) are much larger than the types of the sports sound, and the two types of sound data have the problem of serious data volume imbalance. In the prior art, the forward data is copied to increase the forward data volume or the reverse data is deleted to reduce the reverse data volume so as to balance the forward data and the reverse data, but the method is difficult to improve the effectiveness of the forward sample data and does not greatly help the identification precision of the motion sound false judgment model. To solve this problem, we adopt the following method:

splicing the motion sound segments of the same type of labels (such as rope skipping sounds) into a whole motion audio, then randomly intercepting a plurality of motion sound segments from the motion audio in an oversampling mode to serve as forward sample data of motion sound false-judging model training, wherein the time lengths of all the motion sound segments intercepted from the same motion audio are the same. And randomly intercepting reverse audio (audio of single sound type such as environmental noise, speaking sound, false motion sound and the like or audio formed by splicing various non-motion sounds) in an undersampling mode, wherein the reverse audio is the same as the motion sound fragments intercepted from the motion audio in number, and the reverse audio fragments are taken as reverse sample data of model training. In order to further ensure balance between the forward sample data and the reverse sample data, it is preferable that each reverse sound segment corresponds to a moving sound segment as the forward sample data having the same duration.

Preferably, the volume of each motion sound segment is changed to further expand the data size of the forward sample data; the data volume of the reverse sample data is further expanded by changing the volume of the reverse sound segment.

To solve the technical problem 2, the technical solution provided in this embodiment is: forward sample data obtained by oversampling and reverse sample data obtained by undersampling are taken as model training samples, a motion false judgment model is obtained through iterative training of an improved convolutional neural network (please refer to fig. 4 for a network structure diagram of the improved convolutional neural network), and motion sound false judgment is carried out on a sound segment to be recognized through the motion false judgment model, so that the requirements of a motion sound false judgment application scene on model false judgment accuracy and speed are met.

In summary, as shown in fig. 1, a method for determining a motion sound based on a time-frequency diagram and a convolutional neural network according to an embodiment of the present invention includes:

step S1, splicing a plurality of sports sound segments with the same sound category label (such as label "rope skipping sound", "running sound" etc.) into sports audio; there are many existing methods of splicing sound segments into an integrated audio, which are not specifically described herein;

step S2, randomly intercepting a plurality of motion sound segments from a motion audio in an oversampling manner to serve as forward sample data of motion sound false-judging model training, randomly intercepting reverse sound segments with the same number as the motion sound segments intercepted from the motion audio from a reverse audio in an undersampling manner to serve as reverse sample data of model training, wherein each reverse sound segment preferably corresponds to the motion sound segment with the same duration (for example, the number of the motion sound segments serving as the forward sample data is 3, and the durations are respectively 1 minute, 2 minutes and 3 minutes, so that the number of the reverse sound segments is 3, and the durations are 1 minute, 2 minutes and 3 minutes, and correspond to the number and duration of the motion sound segments);

step S3, inputting the forward sample data and the reverse sample data into an improved convolutional neural network, and forming a motion sound false judgment model through iterative updating training;

step S4, intercepting the voice segment to be recognized from the audio collected in the real environment and inputting the voice segment to be recognized into the motion voice judging model, and outputting the motion judging result of the voice segment to be recognized by the model.

The motion sound false judgment method by the motion sound false judgment model comprises the following two steps: 1. converting the voice segment to be identified into a time-frequency diagram from audio; 2. and identifying and outputting a motion false judgment result according to the time-frequency graph.

In this embodiment, as shown in fig. 3, the method for converting the audio frequency of the to-be-recognized sound segment into the time-frequency diagram includes:

step S411, intercepting a plurality of frame segments from the starting position of the sound segment to be recognized to the end in a sliding window mode, wherein two adjacent frame segments have sound overlapping parts; when frame segments are cut in a sliding window mode, the overlapping parts are reserved in two adjacent frame segments to form redundant information, and the identification error of a model can be reduced. As shown in fig. 5, the waveform in fig. 5 is a one-dimensional audio waveform of the sound segment to be recognized, one frame segment is cut through the first sliding window 100 from the start position of the waveform data (the length of the sliding window is, for example, 256 audio data points), then another frame segment is cut through the second sliding window 200 with the same length partially overlapping with the first sliding window 100 (for example, overlapping 128 audio data points), and this operation is repeated until the end of the sound segment to be recognized, and the frame segment cutting of the sound segment to be recognized is completed.

In order to increase the calculation speed of the subsequent fourier transform, the length of the frame segment is preferably raised to the nth power of 2, and if the length of the frame segment is less than the nth power of 2, the length of the frame segment is raised to the nth power of 2 by complementing data with 0 at the rear of the frame segment. Preferably, n has a value of 8, i.e. the length of the frame segment is preferably 256 audio data points. The value of n is 8, and the calculation speed of the subsequent Fourier transform is the most ideal when the value of n is found to be 8.

Step S412, each frame segment is multiplied by a Hanning raised cosine window with the same length, because each frame segment needs to be periodically extended when Fourier transform is carried out subsequently, namely, a plurality of the segments need to be copied and then are connected end to end, because the frame segments are intercepted in a sliding window partial overlapping mode, the unnecessary discontinuity can occur after the end to end connection, the frequency spectrum after the Fourier transform does not accord with the actual situation, and the discontinuity can be reduced after windowing so that the obtained frequency spectrum accords with the actual frequency spectrum. Each frame segment is multiplied by a hanning cosine window of the same length, i.e. the value of each audio data point in the frame segment (volume value) is multiplied by the value of the corresponding time window function as the new value for that data point. The window function expression is as the following equation (1):

in formula (1), hann (n) represents a window function value to be multiplied when an audio data point in a frame segment is windowed;

pi represents the circumference ratio;

n represents the nth audio data point in the frame segment;

d represents the width of the hanning cosine-lifting window.

Step S413, performing fourier transform on each frame segment to convert each frame segment from a spatial domain to a frequency domain, and obtaining a time-frequency spectrum corresponding to each frame segment; in order to obtain the frequency spectrum of the frame segment at the center time, discrete fourier transform is performed on each windowed frame segment, and the discrete fourier transform process is expressed by the following equations (2) - (3):

in formulas (2) - (3), x [ n ] represents the volume value of the audio data point in the frame segment after inverse discrete Fourier transform;

n represents the nth audio data point in the frame segment;

k is k times harmonic;

a_kamplitude of the harmonic is k times;

e is a natural logarithm;

j is an imaginary number;

w₀is the fundamental frequency;

n is the total number of audio data points in a frame segment.

For each frame segment, the scale size of each frequency component can be calculated by formula (3).

In step S414, the time-frequency spectrums corresponding to the frame segments are arranged in time sequence to form a time-frequency graph corresponding to the voice segment to be identified (see fig. 6).

In order to increase the recognition speed of the motion sound identification model, in step S411, it is preferable to perform normalization processing on the volume value of each time of the sound segment to be recognized, and then cut out the frame segment. The method for normalizing the volume value of each moment of the sound segment to be recognized comprises the following steps:

the original one-dimensional audio waveform data of the sound segment to be identified has a volume value at each moment, and the volume values arranged according to the time sequence form the one-dimensional audio waveform data. The data value of each moment, namely the measurement of the volume is represented by 16-bit binary, the maximum value does not exceed 32727, each data value is divided by 32727 to obtain a value between-1 and 1 after normalization processing, and the training speed of the model is favorably and greatly improved after data normalization.

In this embodiment, as shown in fig. 2 and 4, the method for identifying and outputting a motion hypothesis result according to a time-frequency diagram includes:

step S41, converting the audio of the voice segment to be recognized into a time-frequency diagram with a size of 124 × 129 × 1(124 is the number of rows in the two-dimensional diagram, 129 is the number of columns in the two-dimensional diagram, and 1 is the number of channels);

in step S42, the time-frequency graph with size 124 × 129 × 1 is input to the two-dimensional average pooling layer 1, and then the size is reduced to 31 × 32 × 1, so as to increase the recognition speed of the model and reduce the occupation of the storage space. The process of average pooling is shown in fig. 8, for example, the input is a, the output is B, and the average pooling operation is to find the average of 16 values a11, a12, … … and a44 in fig. 8 as the value of the output B.

Step S43, extracting the convolution characteristics of the time-frequency graph with reduced size through the first convolution layer 2 and the second convolution layer 3, and outputting a characteristic graph with the size of 27 × 28 × 64 corresponding to the time-frequency graph; referring to fig. 7, for example, the convolution input is a, the convolution kernel is B, and the convolution output is C, and the specific operations are as follows:

A11*B11+A12*B12+A21*B21+A22*B22＝C11；

A12*B11+A13*B12+A22*B21+A23*B22＝C12；

A21*B11+A22*B12+A31*B21+A32*B22＝C21；

A22*B11+A23*B12+A32*B21+A33*B22＝C22；

step S44, inputting the feature map into the maximum pooling layer 5 and expanding (i.e. rearranging each element of the two-dimensional feature map into a column, such as queuing, letting the head of the second team get to the tail of the first team, and the latter teams operate the same, so that the former teams are queued up) to obtain a one-dimensional vector with length 11648 representing the time-frequency map;

step S45, inputting the one-dimensional vector into the first fully-connected layer 5 to perform dimension reduction processing into a vector with a length of 128;

and step S46, inputting the one-dimensional vector subjected to dimension reduction into the second full-link layer 6, activating by softmax, and outputting the probability that the voice segment to be recognized is the moving voice or the non-moving voice, so as to realize the judgment of the moving voice. The method for training the motion sound false judgment model is briefly as follows:

firstly initializing model parameters, wherein each parameter has a value, inputting a time-frequency graph in a training set, calculating layer by layer according to a calculation step specified by the model, finally outputting the probability that the time-frequency graph is in a positive and negative category, comparing the probability with a label of the time-frequency graph to obtain the error, then reversely transmitting each parameter back through gradient to adjust the value of each parameter, wherein the updated parameter can reduce the error, inputting all samples into the model for adjusting and optimizing the parameters according to the steps, completing a round of model training, reducing the error between the predicted category of the model and the actual category, repeating the steps for a plurality of times until the error is not reduced and the error is not smaller than the error of the verification set, and obtaining the best model parameter which is not fitted, namely completing the model training. After the model training is finished, a new sample is input into the model, and then the prediction with the smallest error with the actual class can be obtained, namely the discrimination accuracy is the best, so that the class of a sample can be accurately identified, no matter whether other various sounds exist in the sample, the model training is finished, only the characteristics of the motion sound are concerned, the characteristics of the motion sound are discriminated to be the positive class, and the characteristics of the motion sound are discriminated to be the negative class.

In conclusion, the invention solves the problems that model training sample data of a motion sound fake judgment application scene is unbalanced and the technical problems of low training efficiency, low recognition precision and low recognition speed of the existing LSTM long-short term memory neural network are adopted.

The invention also provides a motion sound false judgment device based on the time-frequency graph and the convolutional neural network, which can realize the motion sound false judgment method, and as shown in fig. 9, the motion sound false judgment device comprises:

the sample equalization module is respectively connected with the motion audio synthesis module and the reverse audio acquisition module, and is used for randomly intercepting a plurality of motion sound fragments from the motion audio in an oversampling mode to serve as forward sample data of motion sound fake judgment model training and randomly intercepting reverse sound fragments with the same number as the motion sound fragments from the reverse audio in an undersampling mode to serve as reverse sample data of model training, wherein each reverse sound fragment corresponds to a motion sound fragment with the same duration;

the sample input module is connected with the sample equalization module and used for inputting the forward sample data and the reverse sample data into the improved convolutional neural network and forming a motion sound false judgment model through iterative updating training;

the model training module is connected with the sample input module and used for forming a motion sound false judgment model through iterative update training by an improved convolutional neural network according to input forward sample data and reverse sample data;

the voice intercepting module is connected with the voice collecting module and is used for intercepting a voice fragment to be identified from the audio file;

and the voice recognition module is respectively connected with the model training module and the voice input module and is used for recognizing and outputting the probability that the voice segment to be recognized is the sports voice through the sports voice hypothesis model.

As shown in fig. 10, the voice recognition module includes:

a time-frequency diagram generating module, configured to convert the audio of the to-be-identified sound segment into a time-frequency diagram, as shown in fig. 10, the time-frequency diagram generating module specifically includes:

the windowing unit is connected with the frame segment intercepting unit and used for multiplying each frame segment by a Hanning raised cosine window with the same length so as to solve the problem that the frequency spectrum after Fourier transform does not conform to the actual situation due to the occurrence of originally-unnecessary discontinuity after the frame segment period extension;

the spatial domain conversion unit is connected with the windowing unit and is used for carrying out Fourier transform on each windowed frame segment so as to convert each frame segment from a spatial domain to a frequency domain;

and the sequencing unit is connected with the spatial domain conversion unit and used for arranging the time-frequency spectrum corresponding to each frame segment which completes the spatial domain conversion according to the time sequence to form the time-frequency graph.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A method for judging false motion sound based on a time-frequency graph and a convolutional neural network is characterized by comprising the following steps:

2. The method according to claim 1, wherein in step S1, after intercepting the non-motion sound part of each motion sound segment, the motion sound segments are merged into the motion audio.

3. The method according to claim 1, wherein in step S2, the volume of the forward sample data and the reverse sample data is expanded by changing the volume of the moving sound segment and the reverse sound segment.

4. The method for judging motion sound based on a time-frequency diagram and a convolutional neural network of claim 1, wherein in step S4, the method for outputting the motion sound judgment result by the motion sound judgment model specifically comprises the following steps:

5. The method for determining a moving sound based on a time-frequency diagram and a convolutional neural network as claimed in claim 4, wherein in step S41, the step of converting the sound segment to be recognized from audio into the time-frequency diagram includes:

6. The method according to claim 5, wherein in step S411, the frame segment is intercepted after the normalization processing is performed on the volume value of each time of the sound segment to be recognized.

7. The method according to claim 5, wherein if the length of the frame segment is less than 2 raised to the nth power, the length of the frame segment is raised to the nth power of 2 by adding 0 to data after the frame segment.

8. The method according to claim 7, wherein n is 8.

9. A motion sound false-judging device based on time-frequency diagram and convolutional neural network, which can implement the motion sound false-judging method of any one of claims 1-8, wherein the motion sound false-judging device comprises:

the sample input module is connected with the sample equalization module and is used for inputting the forward sample data and the reverse sample data into an improved convolutional neural network;

10. The apparatus according to claim 9, wherein the voice recognition module comprises: