CN112863548A - Method for training audio detection model, audio detection method and device thereof - Google Patents
Method for training audio detection model, audio detection method and device thereof Download PDFInfo
- Publication number
- CN112863548A CN112863548A CN202110090449.1A CN202110090449A CN112863548A CN 112863548 A CN112863548 A CN 112863548A CN 202110090449 A CN202110090449 A CN 202110090449A CN 112863548 A CN112863548 A CN 112863548A
- Authority
- CN
- China
- Prior art keywords
- audio
- detected
- data set
- sample data
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 132
- 238000012549 training Methods 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 76
- 238000004590 computer program Methods 0.000 claims description 14
- 230000004044 response Effects 0.000 claims description 11
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 33
- 238000010586 diagram Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 238000004519 manufacturing process Methods 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 239000012634 fragment Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 244000062720 Pennisetum compressum Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The disclosure discloses a model training method, a model training device, electronic equipment and a computer readable storage medium, and relates to the field of artificial intelligence, in particular to the field of deep learning and the field of artificial intelligence chips. The specific implementation scheme is as follows: acquiring a plurality of audio clips from an audio file; determining a first sample data set for training an audio detection model based on an audio segment containing a noise among a plurality of audio segments; determining a second sample data set for training the audio detection model based on the audio segments without the noise, wherein the second sample data set is different from the first sample data set; and training an audio detection model based on the first sample data set and the second sample data set. In this way, the technical scheme of the present disclosure can complete the training of the audio detection model quickly, efficiently and with low cost, thereby determining the detection result of the sound file to be detected.
Description
Technical Field
The present disclosure relates to the field of computer technology, and more particularly, to the field of deep learning, and more particularly, to a method of training an audio detection model, an audio detection method, and apparatus, electronic device, computer-readable storage medium, and computer program product thereof.
Background
With the improvement of living standard and technology of people, the way for people to acquire information and entertain and relax gradually changes, and short video rapidly occupies the fragment time in the life of people due to the characteristics of rich content, high information density, strong interest and the like. However, the user may have uneven levels of video capture and production uploaded, resulting in uneven levels of Chinese pennisetum of the uploaded video work. For example, part of the video may introduce noise during the shooting or post-production process. The noise in the video seriously affects the experience of the user for watching the video, even causes physiological discomfort, and causes the public praise of corresponding video products to be reduced and the user to lose in the past. Therefore, video products are urgently needed to solve the problems.
Disclosure of Invention
The present disclosure provides a method of training an audio detection model, an audio detection method, and apparatuses, an electronic device, a computer-readable storage medium, and a computer program product thereof.
According to a first aspect of the present disclosure, a method of training an audio detection model is provided. The method may include obtaining a plurality of audio clips from an audio file. Further, a first sample data set for training the audio detection model may be determined based on an audio segment of the plurality of audio segments that contains a murmur. The method may further comprise determining a second sample data set for training the audio detection model based on audio segments of the plurality of audio segments not containing a murmur, wherein the second sample data set is different from the first sample data set. Furthermore, the method may further comprise training the audio detection model based on the first set of sample data and the second set of sample data.
According to a second aspect of the present disclosure, an audio detection method is provided, which may include acquiring an audio file to be detected. Furthermore, a plurality of audio clips to be detected can be obtained from the audio file to be detected. The method may further comprise detecting the plurality of audio segments to be detected separately using the audio detection model trained according to the method of the first aspect of the present disclosure. In addition, the method can further comprise the step of determining the detection result of the audio file to be detected based on the corresponding detection results of the plurality of audio fragments to be detected.
In a third aspect of the present disclosure, there is provided an apparatus for training an audio detection model, comprising: the audio clip acquisition module is configured to acquire a plurality of audio clips from an audio file; a first sample data set determination module configured to determine a first sample data set for training the audio detection model based on an audio segment of the plurality of audio segments that contains a murmur; a second sample data set determination module configured to determine a second sample data set for training the audio detection model based on an audio segment of the plurality of audio segments not containing a murmur, wherein the second sample data set is different from the first sample data set; and an audio detection model training module configured to train the audio detection model based on the first sample data set and the second sample data set.
In a fourth aspect of the present disclosure, there is provided an audio detection apparatus comprising: the audio file acquisition module is configured to acquire an audio file to be detected; the to-be-detected audio clip acquisition module is configured to acquire a plurality of to-be-detected audio clips from the to-be-detected audio file; a detection module configured to detect the plurality of audio segments to be detected respectively using the audio detection models trained according to the method of the third aspect of the present disclosure; and the detection result determining module is configured to determine the detection result of the audio file to be detected based on the corresponding detection results of the plurality of audio fragments to be detected.
In a fifth aspect of the present disclosure, there is provided an electronic device comprising one or more processors; and storage means for storing the one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the first aspect of the disclosure.
In a sixth aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which program, when executed by a processor, implements a method according to the first aspect of the present disclosure.
In a seventh aspect of the present disclosure, a computer program product is provided, which computer program, when executed by a processor, implements the method according to the first aspect of the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;
FIG. 2 shows a schematic diagram of a detailed example environment, according to an embodiment of the present disclosure;
FIG. 3 shows a flow diagram of a process of training an audio detection model according to an embodiment of the present disclosure;
FIG. 4 shows a flowchart of a detailed process of training an audio detection model according to an embodiment of the present disclosure;
FIG. 5 shows a flow diagram of a process of audio detection according to an embodiment of the present disclosure;
FIG. 6 shows a block diagram of an apparatus for training an audio detection model according to an embodiment of the present disclosure;
FIG. 7 shows a block diagram of an audio detection apparatus according to an embodiment of the present disclosure; and
FIG. 8 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
It should be understood that noise in video is typically caused by: shooting environment background sound is noisy; current noise is introduced due to poor quality of the shooting equipment; and post-production induced codec distortion, etc. Noise introduced by any link can affect the final user experience. At present, no good identification method exists for video noise, and usually, a video auditor carries out manual audit to remove and reject videos with noise, so that the method is low in efficiency and high in cost, and risks of false detection and missed detection exist. In addition to manual detection, speech murmurs may be detected based on frequency domain energy functions and pitch parameters. However, this method can only detect voice-like noise, and when the voice-like noise appears in a part of the time period of a sound, the detection effect of noise for the whole video is poor.
The present disclosure recognizes that there is a need for a model training method to perform training of a detection model, particularly an audio detection model, quickly, efficiently, and at low cost, and further to determine whether the video to be detected contains noise using the model.
According to an embodiment of the present disclosure, a model training scheme is presented. In this approach, segments of the mass of audio segments that have a murmur may be labeled as first samples (e.g., positive samples) and segments that do not have a mur may be labeled as second samples (e.g., negative samples) in order to train an audio detection model based on the first and second samples. Specifically, the training process of the audio detection model of the present disclosure may include: acquiring a plurality of audio clips from an audio file; determining a first sample data set for training an audio detection model based on an audio segment containing a noise among a plurality of audio segments; determining a second sample data set for training the audio detection model based on the audio segments without the noise in the plurality of audio segments; and training the audio detection model based on the first set of sample data and the second set of sample data. In addition, the embodiment of the disclosure also comprises the step of detecting the audio file associated with the video file by utilizing the detection model trained based on the method. In this way, efficient and accurate model training and video detection is achieved.
Further, in order to augment the first sample data set, a murmur may be superimposed on the audio piece that does not contain mur, thereby generating an audio piece that contains mur. In this way, a sufficient amount of training data can be obtained at low cost, and the over-fitting problem of the deep neural network can be solved because the training data set is enlarged.
In addition, in order to optimize the audio detection model, after one round of model training is finished, the model can be used for detecting audio segments randomly collected in the audio file without noise, and for the sound segments detected as including noise, the sound segments can be added into the second sample data set for retraining. In this way, the situation of model false detection can be remarkably reduced.
Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings. Fig. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. As shown in FIG. 1, an example environment 100 includes a video file 110 to be detected, a computing device 120, and a detection result 130 determined via the computing device 120.
In some embodiments, the video file 110 to be detected may be at least one short video of a large number of short videos on a network platform. In the present disclosure, short video refers to short-film video. As a way of internet content dissemination, short videos are typically videos that are disseminated on new internet media for less than 5 minutes. It should be understood that video file 110 may also be other video content than short video.
In some embodiments, computing device 120 may include, but is not limited to, a personal computer, a server computer, a handheld or laptop device, a mobile device (such as a mobile phone, a personal digital assistant, PDA, a media player, etc.), a consumer electronics product, a minicomputer, a mainframe computer, a cloud computing resource, and the like. After computing device 120 receives video file 110 to be detected, audio file 122 may be parsed from video file 110. In turn, the computing device 120 may intercept the plurality of audio clips 124 from the audio file 122 in a time window of fixed duration. The feature data of these audio segments 124 is input to a detection model 140 configured in the computing device 120 so that the probability of noise being contained in each video segment can be predicted by the detection model 140. Based on these probabilities, it can be determined whether the video file 110 contains a noise or whether the video file 110 meets the detection result 130 of the standard sound quality. It is understood that "murmurs" as described in this disclosure include, but are not limited to: background noise in the shooting environment, current noise introduced due to poor quality of the shooting equipment, and codec distortion introduced by post-production.
Further, it should also be understood that while the present disclosure shows computing device 120 as "one" processing unit, the process of parsing out an audio file 122 from a video file 110, the process of truncating a plurality of audio segments 124 from the audio file 122, and the process of predicting each video segment by detection model 140 as described above may be performed separately in a different plurality of processing units, and these processing units may be collectively referred to as computing device 120. As an example, the process of parsing out the audio file 122 from the video file 110 and the process of intercepting the plurality of audio segments 124 from the audio file 122 may be performed in a field computer as an edge computing node, and the process of predicting each video segment by the detection model 140 may be performed in a more powerful computing cloud server.
At least one gist of the present disclosure is that an improved way is utilized to train an audio detection model. The training and use of the detection model 140 in the computing device 120 will be described below with reference to fig. 2, taking a machine learning model as an example.
Fig. 2 shows a schematic diagram of a detailed example environment 200, according to an embodiment of the present disclosure. Similar to fig. 1, the example environment 200 may include a computing device 220, a video file 210 to be detected, and a detection result 230. The difference is that the example environment 200 may generally include a model training system 270 and a model application system 280. By way of example, model training system 270 and/or model application system 280 may be implemented in computing device 120 as shown in FIG. 1 or computing device 220 as shown in FIG. 2. It should be understood that the description of the structure and functionality of the example environment 200 is for exemplary purposes only and is not intended to limit the scope of the subject matter described herein. The subject matter described herein may be implemented in various structures and/or functions.
As described above, the process of detecting a plurality of audio segments in an audio file parsed from the video file 210 to determine whether the video file 210 contains the detection result 230 of the noise may be divided into two stages: a model training phase and a model application phase. As an example, in the model training phase, the model training system 270 may train the model 240 for detecting audio snippets using the first sample data set 250 and the second sample data set 260. It should be understood that the first sample data set 250 is a set of labeled audio segments that contain a murmur and the second sample data set 260 is a set of labeled audio segments that do not contain a mur. In the model application phase, model application system 280 may receive trained model 240, such that a determination is made by model 240 as to whether video file 210 contains detection result 230 of a murmur based on an audio file associated with video file 210.
In other embodiments, the model 240 may be constructed as a learning network. In some embodiments, the learning network may include a plurality of networks, where each network may be a multi-layer neural network, which may be composed of a large number of neurons. Through the training process, respective parameters of the neurons in each network can be determined. The parameters of the neurons in these networks are collectively referred to as the parameters of the model 240.
The training process of the model 240 may be performed in an iterative manner. In particular, the model training system 270 may obtain sample data from the first sample data set 250 and the second sample data set 260 and utilize the sample data to perform one iteration of the training process to update the corresponding parameters of the model 240. The model training system 270 may perform the above process based on a plurality of sample data in the first sample data set 250 and the second sample data set 260 until at least some of the parameters of the model 240 converge or until a predetermined number of iterations is reached, thereby obtaining final model parameters.
The technical solutions described above are only used for illustration and do not limit the invention. It should be understood that the various networks may also be arranged in other ways and connections. To more clearly explain the principles of the above scheme, the process of training the model 240 will be described in more detail below with reference to fig. 3.
Fig. 3 shows a flow diagram of a process 300 of training an audio detection model according to an embodiment of the present disclosure. In certain embodiments, process 300 may be implemented in computing device 120 of fig. 1 as well as computing device 220 of fig. 2. A process 300 of model training according to an embodiment of the present disclosure is now described with reference to FIG. 3 in conjunction with FIG. 2. For ease of understanding, the specific examples set forth in the following description are intended to be illustrative, and are not intended to limit the scope of the disclosure.
At 302, the computing device 220 may retrieve a plurality of audio clips from an audio file. It should be appreciated that the audio file is obtained by parsing the video file 210, so that the data amount of the samples for training the model can be reduced, and the training speed can be increased. Further, the manner of acquiring the plurality of audio clips from the audio file may be to intercept the plurality of audio clips from the audio file at predetermined time intervals in accordance with a time window having a fixed time duration. By way of example, the computing device 220 may slide a time window having a duration, such as 3 seconds, to intercept audio clips from an audio file at intervals, such as 0.5 seconds.
In some embodiments, the plurality of audio segments may have a predetermined length of time, and one of the plurality of audio segments may have an overlapping portion with another of the plurality of audio segments. In this way, sample omissions may be avoided, so that more first and second sample data sets 250, 260 may be obtained with limited annotated video.
At 304, the computing device 220 may determine a first sample dataset 250 for training the model 240 for audio detection based on an audio segment of the plurality of audio segments that contains a murmur. The first sample data set 250 may be, for example, a positive sample data set.
At 306, the computing device 220 may accordingly determine a second sample data set 260 for training the model 240 for audio detection based on audio segments of the plurality of audio segments that do not contain a murmur, wherein the second sample data set 260 is different from the first sample data set 250. The second sample data set 260 may be, for example, a negative sample data set. Training the model 240 based on the positive and negative examples can significantly improve the model performance.
In some embodiments, to augment the first sample data set 250, the computing device 220 may determine an additional sample data set by overlaying at least a portion of the audio clip of the plurality of audio clips that does not contain a murmur (i.e., at least a portion of the second sample data set 260) with a predetermined mur audio clip, and add the additional sample data set to the first sample data set 250. In this way, the present disclosure may greatly expand on limited positive sample data based on negative sample data.
At 308, the computing device 220 may train the model 240 for audio detection based on the first sample data set and the second sample data set. It should be appreciated that prior to training the model 240, in order to adapt to the input requirements of the convolutional neural network and make the data associated with the audio segments more consistent with the response characteristics of the human ear to sounds of different frequencies, the computing device 220 typically performs pre-emphasis, framing, short-time fourier transform, mel-filtering, and logarithm operations on the audio segments to obtain two-dimensional feature data. For example, for an audio segment with a time window of 3 seconds, performing the above processing on each 0.01 second audio sub-segment may result in a one-dimensional feature vector (which may contain, for example, 64 feature values). It can be seen that the audio piece can be processed into 300 one-dimensional feature vectors, i.e. a two-dimensional feature with a size of 64 × 300. The two-dimensional features with labels may be input to the model 240 for training.
In addition, the present disclosure introduces an improved way of model training for some samples that are more difficult to correctly resolve by the model 240. Fig. 4 shows a flowchart of a detailed process 400 of training an audio detection model according to an embodiment of the present disclosure.
At 402, the computing device 220 may detect an audio clip in another audio file different from the audio file described above using the trained model 240 for audio detection. It is to be understood that the further audio file is a predetermined audio file that does not contain a noise. Generally, in order to facilitate the labeling work before model training, the audio file is selected from predetermined audio files containing noise, so that the audio segments containing noise and the audio segments without noise in the audio file can be fully utilized. Thus, given that the other audio file has been manually determined to not contain a murmur, if the audio clip in the other audio file is detected by the model 240 as containing a mur, it is stated that the model 240 has not been able to accurately distinguish the audio clip, so at 404, the computing device 220 may add the audio clip in the other audio file to the negative sample data set, so that the model 240 for audio detection may be further trained.
As an example, when a short video service provider performs a noise detection on a short video provided by a short video content provider using model 240, if the short video is picked up and down from a video website because model 240 determines that the short video contains a noise, the short video content provider may request a manual review on the short video, and when the manual review confirms that the short video does not contain a noise, a sample that model 240 cannot correctly distinguish may be collected for further model training by the above-described process, thereby optimizing model 240. In this way, samples which cannot be accurately distinguished by the current model can be fully collected and added to subsequent training, and therefore the false detection rate of the model can be remarkably reduced.
In some embodiments, to detect the audio segment in the other audio file, the computing device 220 may predict a probability that the audio segment in the other audio file contains a murmur. If the probability that one of the audio segments contains a murmur is predicted to be greater than a threshold probability (e.g., the predicted score is greater than 0.5), it may be determined that the audio segment contains a mur. This audio piece can thus be added to the training data set as a sample that is prone to false detection. In this way, samples that are prone to false positives can be quickly determined using the model.
Through the embodiment, the model with excellent performance can be trained more efficiently, and the labor and time cost is saved.
It should be appreciated that after the training process of the model 240 is completed, the model 240 may be used to examine an audio file or a video file containing an audio file to determine whether the file contains a noise that degrades the user's experience. Fig. 5 shows a flow diagram of a process 500 of audio detection according to an embodiment of the present disclosure. In certain embodiments, process 500 may be implemented in computing device 120 of fig. 1 as well as computing device 220 of fig. 2. A process 500 of audio detection according to an embodiment of the disclosure is now described with reference to fig. 5 in conjunction with fig. 1. For ease of understanding, the specific examples set forth in the following description are intended to be illustrative, and are not intended to limit the scope of the disclosure.
As shown in fig. 5, at 502, the computing device 120 may obtain the audio file 122 to be detected. It should be appreciated that for the noise detection process of short video, as shown in fig. 1, the computing device 120 may first obtain the video file 110 to be detected, and then parse the audio file 122 to be detected from the video file 110. In this way, only the audio portion of the short video can be detected, thereby reducing the data amount of the detected object and speeding up the detection.
At 504, the computing device 120 may obtain a plurality of audio clips 124 to be detected from the audio files 110 to be detected. As an example, to accommodate the input requirements of a convolutional neural network and make the data associated with the audio segments more consistent with the response characteristics of the human ear to sounds of different frequencies, the computing device 120 may perform pre-emphasis, framing, short-time fourier transform, mel-filtering, and logarithm operations on each audio segment 124 to obtain two-dimensional feature data. For example, for an audio segment with a time window of 3 seconds, performing the above processing on each 0.01 second audio sub-segment may result in a one-dimensional feature vector (which may contain, for example, 64 feature values). It can be seen that each of the audio segments 124 can be processed into 300 one-dimensional feature vectors, i.e., a two-dimensional feature with a size of 64 × 300. These two-dimensional features may be intended to be input into the detection model 140 for detection in a subsequent process, resulting in a probability that each audio piece contains a murmur. Further, in order to detect each part of the audio file 122 to be detected without omission, each of the plurality of audio pieces 124 to be detected may be intercepted to have an overlapping portion.
At 506, the computing device 120 may separately detect the plurality of audio segments 124 to be detected using the detection model 140 trained by the process described above. Thereafter, at 508, the computing device 120 may determine the detection results 130 of the audio files 122 to be detected or the video files 110 containing the audio files 122 based on the respective detection results of the plurality of audio segments 124 to be detected.
As an example, the computing device 120 may separately predict a probability that each of the plurality of audio segments 124 to be detected each contain a murmur. It will be appreciated that the probability of each audio piece being predicted to contain a murmur is typically different. Based on the human ear experience with sound quality, if only individual audio segments in an audio file contain a murmur, the audio file may still be determined to contain no mur. Thus, to determine the detection result 130, the computing device 120 may determine an average of the predicted probabilities and determine the audio file 122 to be detected as an audio file containing a murmur only if the average is greater than a threshold probability (e.g., the predicted score is greater than 0.3). It should be appreciated that, in addition to counting the probabilities by way of the above-described addition and averaging, the value that is most representative of the audio file 122 may be determined from the plurality of predicted probabilities by way of weighted averaging.
Through the embodiment, the method and the device can effectively detect the noise videos in the massive videos, have high calling rate and good robustness, can replace manual examination and verification, save human resources and avoid the situations of missing detection and false detection.
Fig. 6 shows a block diagram of an apparatus 600 for training an audio detection model according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 may include: an audio clip obtaining module 602 configured to obtain a plurality of audio clips from an audio file; a first sample data set determination module 604 configured to determine a first sample data set for training the audio detection model based on an audio segment of the plurality of audio segments that contains a noise; a second sample data set determining module 606 configured to determine a second sample data set for training the audio detection model based on an audio segment of the plurality of audio segments not containing a murmur, wherein the second sample data set is different from the first sample data set; and an audio detection model training module 608 configured to train the audio detection model based on the first sample data set and the second sample data set.
In some embodiments, the apparatus 600 may further comprise: an additional sample data set determination module configured to determine an additional sample data set by superimposing at least a part of the plurality of audio clips not containing a noise with a predetermined noise audio clip; and a first expansion module configured to add the additional sample data set to the first sample data set.
In some embodiments, the apparatus 600 may further comprise: a post-training detection module configured to detect an audio clip in another audio file different from the audio file using the trained audio detection model, the other audio file being a predetermined audio file that does not contain a noise; and a second expansion module configured to add an audio clip in the other audio file to the second sample data set in response to the audio clip in the other audio file being detected as containing a murmur for further training the audio detection model.
In certain embodiments, the post-training detection module comprises: a probability prediction module configured to predict a probability that a noise is included in an audio segment in the other audio file; and a prediction result decision module configured to determine that a noise is included in an audio clip in the other audio file in response to the probability being greater than a threshold probability.
In some embodiments, the plurality of audio segments have a predetermined length of time and one of the plurality of audio segments has an overlapping portion with another audio segment.
In certain embodiments, the first sample data set is a positive sample data set and the second sample data set is a negative sample data set.
In some embodiments, the audio file is obtained from a video file.
Fig. 7 shows a block diagram of an audio detection apparatus 700 according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus 700 may include: an audio file acquisition module 702 configured to acquire an audio file to be detected; an audio clip to be detected obtaining module 704 configured to obtain a plurality of audio clips to be detected from the audio file to be detected; a detection module 706 configured to detect the plurality of audio segments to be detected respectively by using the audio detection model trained according to the apparatus 600; and a detection result determining module 708 configured to determine a detection result of the audio file to be detected based on respective detection results of the plurality of audio clips to be detected.
In certain embodiments, the detection module 706 comprises: and the probability prediction module is configured to predict the probability that the plurality of audio segments to be detected contain the noise respectively.
In certain embodiments, the detection result determination module 708 comprises: a mean determination module configured to determine a mean of the predicted probabilities; a determination module configured to determine the audio file to be detected as an audio file containing a noise in response to the average being greater than a threshold probability.
In some embodiments, the plurality of audio segments to be detected have a predetermined time length, and one of the plurality of audio segments to be detected has an overlapping portion with another audio segment to be detected.
In some embodiments, the apparatus 700 may further comprise: the video file acquisition module is configured to acquire a video file to be detected, wherein the audio file acquisition module acquires the audio file to be detected from the video file to be detected.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
Fig. 8 illustrates a block diagram of a computing device 800 capable of implementing multiple embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.
Claims (27)
1. A method of training an audio detection model, comprising:
acquiring a plurality of audio clips from an audio file;
determining a first sample data set for training the audio detection model based on an audio segment of the plurality of audio segments that contains a murmur;
determining a second sample data set for training the audio detection model based on audio segments of the plurality of audio segments that do not include a murmur, wherein the second sample data set is different from the first sample data set; and
training the audio detection model based on the first sample data set and the second sample data set.
2. The method of claim 1, further comprising:
determining an additional sample data set by superimposing at least a part of the plurality of audio clips not containing the noise with a predetermined noise audio clip; and
adding the additional sample data set to the first sample data set.
3. The method of claim 1, further comprising:
detecting an audio clip in another audio file different from the audio file using the trained audio detection model, the other audio file being a predetermined audio file that does not contain a murmur; and
in response to an audio clip in the other audio file being detected as containing a murmur, adding an audio clip in the other audio file to the second sample data set for further training of the audio detection model.
4. The method of claim 3, wherein detecting an audio segment in the other audio file using the trained audio detection model comprises:
predicting a probability that a murmur is included in an audio segment in the other audio file; and
in response to the probability being greater than a threshold probability, determining that a murmur is included in an audio clip in the other audio file.
5. The method of claim 1, wherein the plurality of audio segments have a predetermined length of time and one of the plurality of audio segments has an overlapping portion with another audio segment.
6. The method of any of claims 1 to 5, wherein the first sample dataset is a positive sample dataset and the second sample dataset is a negative sample dataset.
7. The method of any of claims 1-5, wherein the audio file is obtained from a video file.
8. An audio detection method, comprising:
acquiring an audio file to be detected;
acquiring a plurality of audio clips to be detected from the audio files to be detected;
detecting the plurality of audio segments to be detected respectively using the audio detection model trained according to the method of any one of claims 1-7; and
and determining the detection result of the audio file to be detected based on the corresponding detection results of the plurality of audio clips to be detected.
9. The method of claim 8, wherein detecting the plurality of audio segments to be detected using the audio detection model comprises:
and respectively predicting the probability of containing noise in the plurality of audio segments to be detected.
10. The method of claim 9, wherein determining the detection result of the audio file to be detected comprises:
determining an average of the probabilities that are predicted;
and determining the audio file to be detected as the audio file containing the noise in response to the average value being greater than the threshold probability.
11. The method according to claim 8, wherein the plurality of audio segments to be detected have a predetermined length of time, and one of the plurality of audio segments to be detected has an overlapping portion with another audio segment to be detected.
12. The method of claim 8, further comprising:
acquiring a video file to be detected; and
and acquiring the audio file to be detected from the video file to be detected.
13. An apparatus to train an audio detection model, comprising:
the audio clip acquisition module is configured to acquire a plurality of audio clips from an audio file;
a first sample data set determination module configured to determine a first sample data set for training the audio detection model based on an audio segment of the plurality of audio segments that contains a murmur;
a second sample data set determination module configured to determine a second sample data set for training the audio detection model based on an audio segment of the plurality of audio segments not containing a murmur, wherein the second sample data set is different from the first sample data set; and
an audio detection model training module configured to train the audio detection model based on the first sample data set and the second sample data set.
14. The apparatus of claim 13, further comprising:
an additional sample data set determination module configured to determine an additional sample data set by superimposing at least a part of the plurality of audio clips not containing a noise with a predetermined noise audio clip; and
a first expansion module configured to add the additional sample data set to the first sample data set.
15. The apparatus of claim 13, further comprising:
a post-training detection module configured to detect an audio clip in another audio file different from the audio file using the trained audio detection model, the other audio file being a predetermined audio file that does not contain a noise; and
a second expansion module configured to add an audio clip in the other audio file to the second sample data set for further training of the audio detection model in response to the audio clip in the other audio file being detected as containing a murmur.
16. The apparatus of claim 15, wherein the post-training detection module comprises:
a probability prediction module configured to predict a probability that a noise is included in an audio segment in the other audio file; and
a prediction outcome determination module configured to determine that a noise is included in an audio segment in the other audio file in response to the probability being greater than a threshold probability.
17. The apparatus of claim 13, wherein the plurality of audio segments have a predetermined length of time and one of the plurality of audio segments has an overlapping portion with another audio segment.
18. The device of any of claims 13 to 17, wherein the first sample data set is a positive sample data set and the second sample data set is a negative sample data set.
19. The apparatus of any of claims 13 to 17, wherein the audio file is obtained from a video file.
20. An audio detection apparatus comprising:
the audio file acquisition module is configured to acquire an audio file to be detected;
the to-be-detected audio clip acquisition module is configured to acquire a plurality of to-be-detected audio clips from the to-be-detected audio file;
a detection module configured to detect the plurality of audio segments to be detected respectively using the audio detection model trained by the apparatus according to any one of claims 13-19; and
a detection result determination module configured to determine a detection result of the audio file to be detected based on respective detection results of the plurality of audio clips to be detected.
21. The apparatus of claim 20, wherein the detection module comprises:
and the probability prediction module is configured to predict the probability that the plurality of audio segments to be detected contain the noise respectively.
22. The apparatus of claim 21, wherein the detection result determination module comprises:
a mean determination module configured to determine a mean of the predicted probabilities;
a determination module configured to determine the audio file to be detected as an audio file containing a noise in response to the average being greater than a threshold probability.
23. The apparatus according to claim 20, wherein the plurality of audio segments to be detected have a predetermined length of time, and one of the plurality of audio segments to be detected has an overlapping portion with another audio segment to be detected.
24. The apparatus of claim 20, further comprising:
a video file acquisition module configured to acquire a video file to be detected,
the audio file acquisition module acquires the audio file to be detected from the video file to be detected.
25. An electronic device, the electronic device comprising:
one or more processors; and
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1-12.
26. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-12.
27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110090449.1A CN112863548A (en) | 2021-01-22 | 2021-01-22 | Method for training audio detection model, audio detection method and device thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110090449.1A CN112863548A (en) | 2021-01-22 | 2021-01-22 | Method for training audio detection model, audio detection method and device thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112863548A true CN112863548A (en) | 2021-05-28 |
Family
ID=76008128
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110090449.1A Pending CN112863548A (en) | 2021-01-22 | 2021-01-22 | Method for training audio detection model, audio detection method and device thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112863548A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113689843A (en) * | 2021-07-22 | 2021-11-23 | 北京百度网讯科技有限公司 | Vocoder selection and model training method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109785850A (en) * | 2019-01-18 | 2019-05-21 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of noise detecting method, device and storage medium |
CN111145763A (en) * | 2019-12-17 | 2020-05-12 | 厦门快商通科技股份有限公司 | GRU-based voice recognition method and system in audio |
CN111341333A (en) * | 2020-02-10 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Noise detection method, noise detection device, medium, and electronic apparatus |
WO2020151338A1 (en) * | 2019-01-23 | 2020-07-30 | 平安科技(深圳)有限公司 | Audio noise detection method and apparatus, storage medium, and mobile terminal |
CN111797708A (en) * | 2020-06-12 | 2020-10-20 | 瑞声科技(新加坡)有限公司 | Airflow noise detection method and device, terminal and storage medium |
CN111863021A (en) * | 2020-07-21 | 2020-10-30 | 上海宜硕网络科技有限公司 | Method, system and equipment for recognizing breath sound data |
-
2021
- 2021-01-22 CN CN202110090449.1A patent/CN112863548A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109785850A (en) * | 2019-01-18 | 2019-05-21 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of noise detecting method, device and storage medium |
WO2020151338A1 (en) * | 2019-01-23 | 2020-07-30 | 平安科技(深圳)有限公司 | Audio noise detection method and apparatus, storage medium, and mobile terminal |
CN111145763A (en) * | 2019-12-17 | 2020-05-12 | 厦门快商通科技股份有限公司 | GRU-based voice recognition method and system in audio |
CN111341333A (en) * | 2020-02-10 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Noise detection method, noise detection device, medium, and electronic apparatus |
CN111797708A (en) * | 2020-06-12 | 2020-10-20 | 瑞声科技(新加坡)有限公司 | Airflow noise detection method and device, terminal and storage medium |
CN111863021A (en) * | 2020-07-21 | 2020-10-30 | 上海宜硕网络科技有限公司 | Method, system and equipment for recognizing breath sound data |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113689843A (en) * | 2021-07-22 | 2021-11-23 | 北京百度网讯科技有限公司 | Vocoder selection and model training method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108989882B (en) | Method and apparatus for outputting music pieces in video | |
JP6101196B2 (en) | Voice identification method and apparatus | |
CN107301170B (en) | Method and device for segmenting sentences based on artificial intelligence | |
CN112559800B (en) | Method, apparatus, electronic device, medium and product for processing video | |
CN112800919A (en) | Method, device and equipment for detecting target type video and storage medium | |
CN108877779B (en) | Method and device for detecting voice tail point | |
US11282514B2 (en) | Method and apparatus for recognizing voice | |
CN113076932B (en) | Method for training audio language identification model, video detection method and device thereof | |
CN111341333B (en) | Noise detection method, noise detection device, medium, and electronic apparatus | |
CN109634554B (en) | Method and device for outputting information | |
CN111312223A (en) | Training method and device of voice segmentation model and electronic equipment | |
CN113392920B (en) | Method, apparatus, device, medium, and program product for generating cheating prediction model | |
CN112863548A (en) | Method for training audio detection model, audio detection method and device thereof | |
CN114724144B (en) | Text recognition method, training device, training equipment and training medium for model | |
CN116761020A (en) | Video processing method, device, equipment and medium | |
CN113808619B (en) | Voice emotion recognition method and device and electronic equipment | |
CN115410048A (en) | Training method, device, equipment and medium of image classification model and image classification method, device and equipment | |
CN114550300A (en) | Video data analysis method and device, electronic equipment and computer storage medium | |
KR20180041072A (en) | Device and method for audio frame processing | |
CN115312042A (en) | Method, apparatus, device and storage medium for processing audio | |
CN113852835A (en) | Live broadcast audio processing method and device, electronic equipment and storage medium | |
CN113761115A (en) | Method, device, equipment and medium for detecting emergency | |
CN112487809A (en) | Text data noise reduction method and device, electronic equipment and readable storage medium | |
CN112926623A (en) | Method, device, medium and electronic equipment for identifying composite video | |
CN113408664B (en) | Training method, classification method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210528 |