CN115132231A - Voice activity detection method, device, equipment and readable storage medium - Google Patents
Voice activity detection method, device, equipment and readable storage medium Download PDFInfo
- Publication number
- CN115132231A CN115132231A CN202211051500.9A CN202211051500A CN115132231A CN 115132231 A CN115132231 A CN 115132231A CN 202211051500 A CN202211051500 A CN 202211051500A CN 115132231 A CN115132231 A CN 115132231A
- Authority
- CN
- China
- Prior art keywords
- voice
- convolution
- signal frame
- layer
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000694 effects Effects 0.000 title claims abstract description 156
- 238000001514 detection method Methods 0.000 title claims abstract description 151
- 238000000034 method Methods 0.000 claims abstract description 42
- 230000008569 process Effects 0.000 claims abstract description 22
- 230000001364 causal effect Effects 0.000 claims description 45
- 238000013527 convolutional neural network Methods 0.000 claims description 35
- 230000006870 function Effects 0.000 claims description 26
- 230000004913 activation Effects 0.000 claims description 25
- 238000011176 pooling Methods 0.000 claims description 25
- 238000009432 framing Methods 0.000 claims description 19
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 230000004927 fusion Effects 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 5
- 238000007499 fusion processing Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 2
- 239000012634 fragment Substances 0.000 description 11
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000009499 grossing Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The application discloses a voice activity detection method, a voice activity detection device, voice activity detection equipment and a readable storage medium, wherein the voice characteristics of each signal frame corresponding to a voice signal to be detected are obtained; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame. In the scheme, aiming at each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.
Description
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for detecting speech activity.
Background
A Voice Activity Detection (VAD) system is used to determine the speech frame and the non-speech frame of an input speech signal, and the determined speech frame is sent to the subsequent speech processing steps. The voice activity detection system is a crucial pre-step in many voice-related applications (e.g., voice wake-up, voice enhancement, speech coding, speech recognition, speaker recognition), which have high requirements on real-time performance in many scenarios, such as video conferencing scenarios. Therefore, the speech activity detection system needs to send valid speech frames to subsequent speech processing steps as quickly as possible.
At present, a voice activity detection system mostly adopts a common Convolutional Neural Network (CNN) model to judge a voice frame and a non-voice frame of an input voice signal, and the common CNN model is used for a future frame in order to keep the frame number of time dimensions before and after a convolution operation unchanged, which can cause the common CNN model to generate a waiting time delay in a forward propagation process at an inference stage.
Therefore, how to provide a voice activity detection system to reduce the latency generated by the forward propagation of the model in the inference stage is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above problems, the present application provides a voice activity detection method, apparatus, device and readable storage medium. The specific scheme is as follows:
a voice activity detection method, the method comprising:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and determining an active speech segment corresponding to the speech signal based on the speech activity detection result of each signal frame.
Optionally, the acquiring the voice features of each signal frame corresponding to the voice signal to be detected includes:
performing frame windowing on the voice signal to obtain a plurality of signal frames;
and for each signal frame, performing feature extraction on the signal frame to obtain the voice feature of the signal frame.
Optionally, the determining an active speech segment corresponding to the speech signal based on the detection result of the speech activity of each signal frame includes:
performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
and determining the non-noise voice segment as an active voice segment corresponding to the voice signal.
Optionally, the determining a noise speech segment and a non-noise speech segment from each initial active speech segment includes:
calculating the posterior probability mean square error corresponding to each initial active voice fragment;
if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment;
and if the posterior probability mean square error corresponding to the initial active voice fragment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice fragment is a non-noise voice fragment.
Optionally, the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer, and a full-connection layer, which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding filling in the convolution processing process;
the regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
Optionally, the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
Optionally, the second convolution module includes a preset number of convolution units, and each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;
the pre-padding layer is used for receiving the output of the first convolution module and performing pre-padding processing on the output of the first convolution module based on a pre-padding parameter, wherein the pre-padding parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
A voice activity detection apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected;
the detection unit is used for inputting the voice characteristics of each signal frame into the voice activity detection model, the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and the determining unit is used for determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
Optionally, the obtaining unit includes:
a framing and windowing unit, configured to perform framing and windowing processing on the voice signal to obtain a plurality of signal frames;
and the feature extraction unit is used for extracting the features of the signal frames aiming at each signal frame to obtain the voice features of the signal frames.
Optionally, the determining unit includes:
a smoothing operation unit, configured to perform smoothing operation on a voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
a noise voice segment and non-noise voice segment determining unit, for determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
and the active voice segment determining unit is used for determining the non-noise voice segment as an active voice segment corresponding to the voice signal.
Optionally, the noise speech segment and non-noise speech segment determining unit is specifically configured to:
calculating the posterior probability mean square error corresponding to each initial active voice segment;
if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment;
and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
Optionally, the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolution neural network, a second convolution layer, and a full connection layer, which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
the regularization layer is used for receiving the output of the first convolution layer and regularizing the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
Optionally, the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
Optionally, the second convolution module includes a preset number of convolution units, and each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;
the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
A voice activity detection device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the voice activity detection method as described above.
A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for detecting speech activity as described above.
By means of the technical scheme, the application discloses a voice activity detection method, a voice activity detection device, voice activity detection equipment and a readable storage medium, and the voice activity detection method comprises the steps of firstly, obtaining voice characteristics of each signal frame corresponding to a voice signal to be detected; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active speech segment corresponding to the speech signal based on the speech activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a schematic flow chart of a method for detecting speech activity disclosed in an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for determining an active speech segment corresponding to a speech signal based on a result of speech activity detection of each signal frame according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a voice activity detection model according to an embodiment of the present disclosure;
FIG. 4 is a schematic structural diagram of a causal convolutional neural network in a voice activity detection model disclosed in an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a speech activity detection apparatus according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a hardware structure of a speech activity detection apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
Next, the voice activity detection method provided in the present application will be described by the following examples.
Referring to fig. 1, fig. 1 is a schematic flow chart of a voice activity detection method disclosed in an embodiment of the present application, where the method may include:
step S101: and acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected.
In the present application, the voice signal to be detected may be a voice signal input in real time, and for the voice signal input in real time, in the present application, frame-wise windowing may be performed on the voice signal to obtain a plurality of signal frames; and then, aiming at each signal frame, carrying out feature extraction on the signal frame to obtain the voice feature of the signal frame.
It should be noted that, in the present application, the speech signal may be subjected to frame division and windowing processing based on a preset frame length, frame shift, and window function, so as to obtain a plurality of signal frames. In this application, the speech feature may be a common speech feature such as a PLP (Perceptual linear prediction coefficient), an MFCC (Mel frequency cepstrum coefficient), a Filter Bank feature, etc., and since the Filter Bank feature retains a more primitive acoustic feature than the MFCC, as an implementation manner, the Filter Bank feature may be selected to be used as the speech feature of the signal frame, for example, the Filter Bank feature with a dimension of 40 may be selected to be used as the speech feature of the signal frame in this application.
The human ear has different perception degrees to different frequencies, the higher the Frequency is, the lower the sensitivity is, so the Frequency domain perception of the human ear is nonlinear, the Mel Scale (Mel Scale) just describes the rule, it reflects the relationship between Mel Frequency (Mel Frequency) and common Frequency of the human ear linear perception, the logarithm is taken to the energy value of Mel Frequency spectrum, and the final result is the Filter Bank (Filter Bank) characteristic.
Step S102: inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame.
In the application, the voice features of each signal frame can be input into the voice activity detection model in batches, and the voice activity detection model can be realized based on a causal convolutional neural network, compared with a voice activity detection model realized based on a common convolutional neural network in the prior art, when the voice activity detection is performed on each signal frame, the voice activity detection result of the signal frame is obtained based on the signal frame and a preset number of historical signal frames before the signal frame, and a future frame after the signal frame is not used.
It should be noted that the specific structure and function implementation of the voice activity detection model will be described in detail by the following embodiments, and will not be described herein.
Step S103: and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
In the application, based on the voice activity detection result of each signal frame, whether each signal frame is a voice frame or a non-voice frame can be determined, and the active voice segment can be determined based on the result.
In order to ensure the accuracy of the determined active speech segment, the time sequence correlation between the signal frames and the noise characteristics of the signal frames may also be considered, and the active speech segment corresponding to the speech signal is determined based on the detection result of the speech activity of the signal frames, the time sequence correlation between the signal frames, and the noise characteristics of the signal frames.
The embodiment discloses a voice activity detection method, which comprises the steps of firstly, acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected; then, inputting the voice characteristics of each signal frame into a voice activity detection model, outputting the voice activity detection result of each signal frame by the voice activity detection model, wherein the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; and finally, determining an active speech segment corresponding to the speech signal based on the speech activity detection result of each signal frame. In the scheme, for each signal frame, the voice activity detection model obtains the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, and the future frame after the signal frame is not used, so that the waiting time delay generated in the forward propagation process of the model in the inference stage can be reduced.
In another embodiment of the present application, a specific implementation manner of determining an active speech segment corresponding to a speech signal based on a detection result of speech activity of each signal frame in step S103 is described.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for determining an active speech segment corresponding to a speech signal based on a result of detecting speech activity of each signal frame according to an embodiment of the present application, where the method may include:
step S201: and performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal.
The speech signal is a time sequence signal, which indicates that there is a correlation between the front and back of a signal frame, for example, if the current signal frame is a speech frame, then the probability that the next signal frame is a speech frame is high, but when each signal frame is independently judged, a phenomenon that a non-speech frame is included in a plurality of speech frames occurs, and therefore, it is necessary to perform a smoothing operation at a speech segment level by an artificially defined rule based on the result of detecting the speech activity of each signal frame to reduce frequent jumps of the speech frame and the non-speech frame. Therefore, in the present application, the voice activity detection result of each signal frame can be smoothly operated to obtain the initial active voice segment corresponding to the voice signal.
Step S202: from each of the initial active speech segments, a noisy speech segment and a non-noisy speech segment are determined.
In the present application, as the voice activity detection model obtains, for each signal frame, the voice activity detection result of the signal frame based on the signal frame and the historical signal frame before the signal frame, which may cause the voice activity detection model to more easily detect the background human voice as a voice frame, in order to solve this problem, the initial active voice segments may be further processed, a noise voice segment and a non-noise voice segment are determined from each initial active voice segment, after the noise voice segment and the non-noise voice segment are determined, the noise voice segment is discarded, and the non-noise voice segment is determined as the active voice segment.
As an implementation, the determining a noise speech segment and a non-noise speech segment from each initial active speech segment includes: calculating the posterior probability mean square error corresponding to each initial active voice segment; if the posterior probability mean square error corresponding to the initial active voice fragment is lower than a preset posterior probability mean square error threshold, determining the initial active voice fragment as a noise voice fragment; and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
Step S203: and determining the non-noise voice segment as an active voice segment corresponding to the voice signal.
In another embodiment of the present application, the structural and functional implementation of the voice activity detection model is described.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a voice activity detection model disclosed in an embodiment of the present application, where the voice activity detection model includes a first convolution layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolutional neural network, a second convolution layer, and a full connection layer, which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process; as an implementation, the first convolution layer may use a convolution kernel of 3 × 3. In the present application, the padding parameters of the first convolution layer may be set, thereby implementing front and back zero padding.
The regularization layer is used for receiving the output of the first convolution layer and regularizing the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer; it should be noted that, the frame splicing processing can reduce the computational complexity of the subsequent model structure, and further reduce the computational delay of the model.
The causal convolutional neural network is used for receiving the output of the frame splicing layer, carrying out convolutional processing on the output of the frame splicing layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the frame splicing layer through pre-zero padding in the convolutional processing process; in the application, the filling parameters of the causal convolutional neural network can be set, and therefore pre-zero padding is achieved.
The second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process; as an implementation, the second convolutional layer may employ a convolution kernel of 1 × 5. In the present application, the filling parameters of the second convolution layer may be set, thereby implementing front and back zero padding.
And the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
In another embodiment of the present application, the structure of a causal convolutional neural network in a voice activity detection model is described.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a causal convolutional neural network in a voice activity detection model disclosed in an embodiment of the present application, where the causal convolutional neural network includes: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer; as an implementation, the first convolution module may employ a convolution kernel of 1 × 1.
Each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
The second convolution module comprises a preset number of convolution units, and each convolution unit comprises a pre-filling layer, a first convolution sub-layer, a second convolution sub-layer and a residual error connecting layer;
the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer; as an implementation, the first convolution sublayer may employ a convolution kernel of 1 × 3.
The second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer; as an implementation, the second convolution sublayer may employ a convolution kernel of 1 × 1.
And the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
It should be noted that the structure of the voice activity detection model proposed in the embodiment of the present application is merely exemplary, and other similar structures obtained on this basis should also be within the scope of the present application.
The following describes a voice activity detection device disclosed in an embodiment of the present application, and the voice activity detection device described below and the voice activity detection method described above may be referred to correspondingly.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice activity detection apparatus according to an embodiment of the present application. As shown in fig. 5, the voice activity detecting apparatus may include:
the acquiring unit 11 is configured to acquire a voice feature of each signal frame corresponding to a voice signal to be detected;
a detecting unit 12, configured to input the speech characteristics of each signal frame into a speech activity detection model, where the speech activity detection model outputs a speech activity detection result of each signal frame, and the speech activity detection result of each signal frame is used to indicate whether the signal frame is a speech frame or a non-speech frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
a determining unit 13, configured to determine an active speech segment corresponding to the speech signal based on the detection result of speech activity of each signal frame.
As an implementation, the obtaining unit includes:
a framing and windowing unit, configured to perform framing and windowing processing on the voice signal to obtain a plurality of signal frames;
and the feature extraction unit is used for extracting the features of the signal frames aiming at each signal frame to obtain the voice features of the signal frames.
As an implementable manner, the determining unit includes:
a smoothing operation unit, configured to perform smoothing operation on a voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
a noise voice segment and non-noise voice segment determining unit, for determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
an active speech segment determining unit, configured to determine the non-noise speech segment as an active speech segment corresponding to the speech signal.
As an implementation manner, the noise speech segment and non-noise speech segment determining unit is specifically configured to:
calculating the posterior probability mean square error corresponding to each initial active voice fragment;
if the posterior probability mean square error corresponding to the initial active voice fragment is lower than a preset posterior probability mean square error threshold, determining the initial active voice fragment as a noise voice fragment;
and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
As an implementation manner, the voice activity detection model comprises a first convolutional layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolutional neural network, a second convolutional layer and a full-connection layer which are connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
the regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and pooling the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the frame of the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the splicing frame layer, performing convolutional processing on the output of the splicing frame layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the splicing frame layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolutional neural network and carrying out convolution processing on the output of the causal convolutional neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolutional neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
As one possible implementation, the causal convolutional neural network includes: the convolution module comprises a first convolution module, a plurality of parallel second convolution modules and a fusion module, wherein the plurality of parallel second convolution modules are respectively connected with the first convolution module;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
As an implementation manner, the second convolution module includes a preset number of convolution units, each convolution unit includes a pre-padding layer, a first convolution sublayer, a second convolution sublayer and a residual connecting layer;
the pre-filling layer is used for receiving the output of the first convolution module and performing pre-filling processing on the output of the first convolution module based on a pre-filling parameter, and the pre-filling parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of a speech activity detection device according to an embodiment of the present application, and referring to fig. 6, the hardware structure of the speech activity detection device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for voice activity detection, the method comprising:
acquiring voice characteristics of each signal frame corresponding to a voice signal to be detected;
inputting the voice characteristics of each signal frame into a voice activity detection model, wherein the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
2. The method according to claim 1, wherein the obtaining the speech features of the signal frames corresponding to the speech signal to be detected comprises:
performing frame windowing on the voice signal to obtain a plurality of signal frames;
and performing feature extraction on the signal frame aiming at each signal frame to obtain the voice feature of the signal frame.
3. The method according to claim 1, wherein the determining the active speech segment corresponding to the speech signal based on the detection result of the speech activity of each signal frame comprises:
performing smooth operation on the voice activity detection result of each signal frame to obtain an initial active voice segment corresponding to the voice signal;
determining a noise voice segment and a non-noise voice segment from each initial active voice segment;
determining the non-noise speech segment as an active speech segment corresponding to the speech signal.
4. The method of claim 3, wherein said determining a noisy speech segment and a non-noisy speech segment from each initial active speech segment comprises:
calculating the posterior probability mean square error corresponding to each initial active voice segment;
if the posterior probability mean square error corresponding to the initial active voice segment is lower than a preset posterior probability mean square error threshold, determining the initial active voice segment as a noise voice segment;
and if the posterior probability mean square error corresponding to the initial active voice segment is not lower than a preset posterior probability mean square error threshold, determining that the initial active voice segment is a non-noise voice segment.
5. The method of claim 1, wherein the voice activity detection model comprises a first convolutional layer, a regularization layer, an activation function layer, a pooling layer, a framing layer, a causal convolutional neural network, a second convolutional layer, and a fully-connected layer connected in sequence;
the first convolution layer is used for receiving the voice characteristics of each signal frame and performing convolution processing on the voice characteristics of each signal frame, and the frame number of the output signal frame is kept consistent with the frame number of the received signal frame through front and back zero padding in the convolution processing process;
the regularization layer is used for receiving the output of the first convolution layer and carrying out regularization processing on the output of the first convolution layer;
the activation function layer is used for receiving the output of the regularization layer and performing activation processing on the output of the regularization layer;
the pooling layer is used for receiving the output of the activation function layer and performing pooling processing on the output of the activation function layer;
the frame splicing layer is used for receiving the output of the pooling layer and splicing the output of the pooling layer;
the causal convolutional neural network is used for receiving the output of the splicing frame layer, performing convolutional processing on the output of the splicing frame layer, and keeping the frame number of the output signal frame consistent with the frame number of the signal frame output by the splicing frame layer through pre-zero padding in the convolutional processing process;
the second convolution layer is used for receiving the output of the causal convolution neural network and carrying out convolution processing on the output of the causal convolution neural network, and the frame number of the output signal frame is kept consistent with the frame number of the signal frame output by the causal convolution neural network through front and back zero padding in the convolution processing process;
and the full-connection layer is used for receiving the output of the second convolution layer and performing full-connection processing on the output of the second convolution layer to obtain the voice activity detection result of each signal frame.
6. The method of claim 5, wherein the causal convolutional neural network comprises: the first convolution module, a plurality of parallel second convolution modules respectively connected with the first convolution module, and a fusion module connected with the plurality of parallel second convolution modules;
the first convolution module is used for receiving the output of the framing layer and performing convolution processing on the output of the framing layer;
each second convolution module is used for receiving the output of the first convolution module and performing convolution processing on the output of the first convolution module to obtain a convolution result;
and the fusion module is used for receiving the convolution results of the second convolution modules and carrying out fusion processing on the convolution results of the second convolution modules to obtain the output of the causal convolution neural network.
7. The method of claim 6, wherein the second convolution module comprises a preset number of convolution units, each convolution unit comprising a pre-fill layer, a first convolution sublayer, a second convolution sublayer, and a residual connection layer;
the pre-padding layer is used for receiving the output of the first convolution module and performing pre-padding processing on the output of the first convolution module based on a pre-padding parameter, wherein the pre-padding parameter is determined based on an expansion coefficient of the causal convolution neural network;
the first convolution sublayer is used for receiving the output of the pre-filling layer and performing convolution processing on the output of the pre-filling layer;
the second convolution sublayer is used for receiving the output of the first convolution sublayer and performing convolution processing on the output of the first convolution sublayer;
and the residual connecting layer is used for carrying out residual processing on the output of the pre-filling layer and the output of the second convolution sublayer.
8. A voice activity detection device, the device comprising:
the acquisition unit is used for acquiring the voice characteristics of each signal frame corresponding to the voice signal to be detected;
the detection unit is used for inputting the voice characteristics of each signal frame into the voice activity detection model, the voice activity detection model outputs the voice activity detection result of each signal frame, and the voice activity detection result of each signal frame is used for indicating whether the signal frame is a voice frame or a non-voice frame; for each signal frame, the voice activity detection model obtains a voice activity detection result of the signal frame based on the signal frame and a historical signal frame before the signal frame;
and the determining unit is used for determining an active voice segment corresponding to the voice signal based on the voice activity detection result of each signal frame.
9. A voice activity detection device comprising a memory and a processor;
the memory is used for storing programs;
the processor, configured to execute the program, implementing the steps of the voice activity detection method according to any one of claims 1 to 7.
10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for detecting speech activity according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211051500.9A CN115132231B (en) | 2022-08-31 | 2022-08-31 | Voice activity detection method, device, equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211051500.9A CN115132231B (en) | 2022-08-31 | 2022-08-31 | Voice activity detection method, device, equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115132231A true CN115132231A (en) | 2022-09-30 |
CN115132231B CN115132231B (en) | 2022-12-13 |
Family
ID=83387721
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211051500.9A Active CN115132231B (en) | 2022-08-31 | 2022-08-31 | Voice activity detection method, device, equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115132231B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1391212A (en) * | 2001-06-11 | 2003-01-15 | 阿尔卡塔尔公司 | Method for detecting phonetic activity in signals and phonetic signal encoder including device thereof |
WO2016188553A1 (en) * | 2015-05-22 | 2016-12-01 | Huawei Technologies Co., Ltd. | Methods and nodes in a wireless communication network |
CN106601229A (en) * | 2016-11-15 | 2017-04-26 | 华南理工大学 | Voice awakening method based on soc chip |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
CN111276125A (en) * | 2020-02-11 | 2020-06-12 | 华南师范大学 | Lightweight speech keyword recognition method facing edge calculation |
CN111312218A (en) * | 2019-12-30 | 2020-06-19 | 苏州思必驰信息科技有限公司 | Neural network training and voice endpoint detection method and device |
CN111816216A (en) * | 2020-08-25 | 2020-10-23 | 苏州思必驰信息科技有限公司 | Voice activity detection method and device |
CN113288183A (en) * | 2021-05-20 | 2021-08-24 | 中国科学技术大学 | Silent voice recognition method based on facial neck surface myoelectricity |
CN113470652A (en) * | 2021-06-30 | 2021-10-01 | 山东恒远智能科技有限公司 | Voice recognition and processing method based on industrial Internet |
WO2021201422A1 (en) * | 2020-03-31 | 2021-10-07 | 한밭대학교 산학협력단 | Semantic segmentation method and system applicable to ar |
WO2022036801A1 (en) * | 2020-08-18 | 2022-02-24 | 深圳大学 | Method and system for achieving coexistence of heterogeneous networks |
CN114155839A (en) * | 2021-12-15 | 2022-03-08 | 科大讯飞股份有限公司 | Voice endpoint detection method, device, equipment and storage medium |
CN114566179A (en) * | 2022-03-16 | 2022-05-31 | 北京声加科技有限公司 | Time delay controllable voice noise reduction method |
-
2022
- 2022-08-31 CN CN202211051500.9A patent/CN115132231B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1391212A (en) * | 2001-06-11 | 2003-01-15 | 阿尔卡塔尔公司 | Method for detecting phonetic activity in signals and phonetic signal encoder including device thereof |
WO2016188553A1 (en) * | 2015-05-22 | 2016-12-01 | Huawei Technologies Co., Ltd. | Methods and nodes in a wireless communication network |
CN106601229A (en) * | 2016-11-15 | 2017-04-26 | 华南理工大学 | Voice awakening method based on soc chip |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
CN111312218A (en) * | 2019-12-30 | 2020-06-19 | 苏州思必驰信息科技有限公司 | Neural network training and voice endpoint detection method and device |
CN111276125A (en) * | 2020-02-11 | 2020-06-12 | 华南师范大学 | Lightweight speech keyword recognition method facing edge calculation |
WO2021201422A1 (en) * | 2020-03-31 | 2021-10-07 | 한밭대학교 산학협력단 | Semantic segmentation method and system applicable to ar |
WO2022036801A1 (en) * | 2020-08-18 | 2022-02-24 | 深圳大学 | Method and system for achieving coexistence of heterogeneous networks |
CN111816216A (en) * | 2020-08-25 | 2020-10-23 | 苏州思必驰信息科技有限公司 | Voice activity detection method and device |
CN113288183A (en) * | 2021-05-20 | 2021-08-24 | 中国科学技术大学 | Silent voice recognition method based on facial neck surface myoelectricity |
CN113470652A (en) * | 2021-06-30 | 2021-10-01 | 山东恒远智能科技有限公司 | Voice recognition and processing method based on industrial Internet |
CN114155839A (en) * | 2021-12-15 | 2022-03-08 | 科大讯飞股份有限公司 | Voice endpoint detection method, device, equipment and storage medium |
CN114566179A (en) * | 2022-03-16 | 2022-05-31 | 北京声加科技有限公司 | Time delay controllable voice noise reduction method |
Non-Patent Citations (2)
Title |
---|
SY CHANG 等: "Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 * |
侯苗苗: "基于CNN多特征融合的藏语语音识别的研究", 《中国优秀硕士学位论文全文数据库》 * |
Also Published As
Publication number | Publication date |
---|---|
CN115132231B (en) | 2022-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11508366B2 (en) | Whispering voice recovery method, apparatus and device, and readable storage medium | |
CN108428447B (en) | Voice intention recognition method and device | |
CN109841220B (en) | Speech signal processing model training method and device, electronic equipment and storage medium | |
CN110364143B (en) | Voice awakening method and device and intelligent electronic equipment | |
CN110415699B (en) | Voice wake-up judgment method and device and electronic equipment | |
CN111785288B (en) | Voice enhancement method, device, equipment and storage medium | |
CN108010538B (en) | Audio data processing method and device and computing equipment | |
CN109448746B (en) | Voice noise reduction method and device | |
CN109658943B (en) | Audio noise detection method and device, storage medium and mobile terminal | |
CN113436640B (en) | Audio noise reduction method, device and system and computer readable storage medium | |
CN111916061A (en) | Voice endpoint detection method and device, readable storage medium and electronic equipment | |
CN114333912B (en) | Voice activation detection method, device, electronic equipment and storage medium | |
CN112652306A (en) | Voice wake-up method and device, computer equipment and storage medium | |
CN111048118B (en) | Voice signal processing method and device and terminal | |
CN116312616A (en) | Processing recovery method and control system for noisy speech signals | |
CN115132231B (en) | Voice activity detection method, device, equipment and readable storage medium | |
WO2024017110A1 (en) | Voice noise reduction method, model training method, apparatus, device, medium, and product | |
CN112289311A (en) | Voice wake-up method and device, electronic equipment and storage medium | |
CN113689886B (en) | Voice data emotion detection method and device, electronic equipment and storage medium | |
JP6106618B2 (en) | Speech section detection device, speech recognition device, method thereof, and program | |
JP3006496B2 (en) | Voice recognition device | |
CN111048096A (en) | Voice signal processing method and device and terminal | |
CN116110393B (en) | Voice similarity-based refusing method, device, computer and medium | |
CN113393858B (en) | Voice separation method and system, electronic equipment and readable storage medium | |
US20240170003A1 (en) | Audio Signal Enhancement with Recursive Restoration Employing Deterministic Degradation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |