US10679643B2 - Automatic audio captioning - Google Patents
Automatic audio captioning Download PDFInfo
- Publication number
- US10679643B2 US10679643B2 US15/691,546 US201715691546A US10679643B2 US 10679643 B2 US10679643 B2 US 10679643B2 US 201715691546 A US201715691546 A US 201715691546A US 10679643 B2 US10679643 B2 US 10679643B2
- Authority
- US
- United States
- Prior art keywords
- rnn
- audio
- speech sound
- computer
- characters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
- 238000000034 method Methods 0.000 claims abstract description 50
- 230000000306 recurrent effect Effects 0.000 claims abstract description 21
- 238000013528 artificial neural network Methods 0.000 claims abstract description 18
- 239000013598 vector Substances 0.000 claims description 50
- 238000012549 training Methods 0.000 claims description 27
- 230000004913 activation Effects 0.000 claims description 20
- 238000001994 activation Methods 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims 1
- 230000006870 function Effects 0.000 description 19
- 238000009826 distribution Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 230000005236 sound signal Effects 0.000 description 9
- 241000282414 Homo sapiens Species 0.000 description 8
- 238000002790 cross-validation Methods 0.000 description 6
- 238000011176 pooling Methods 0.000 description 6
- 238000005070 sampling Methods 0.000 description 6
- 239000011435 rock Substances 0.000 description 5
- 239000004065 semiconductor Substances 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000013434 data augmentation Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000010561 standard procedure Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000012952 Resampling Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- the present invention relates to audio captioning, and more particularly to automatic audio captioning of a digital audio stream using a deep recurrent neural network.
- a method, computer readable medium, and system are disclosed for audio captioning.
- a raw audio waveform including a non-speech sound is received and relevant features are extracted from the raw audio waveform using a recurrent neural network (RNN) acoustic model.
- RNN recurrent neural network
- a discrete sequence of characters represented in a natural language is generated based on the relevant features, where the discrete sequence of characters comprises a caption that describes the non-speech sound.
- FIG. 1A illustrates a flowchart of a method for audio captioning, in accordance with one embodiment
- FIG. 1B illustrates a block diagram of an audio captioning system, in accordance with one embodiment
- FIG. 2A illustrates another block diagram of an audio captioning system, in accordance with one embodiment
- FIG. 2B illustrates another flowchart of a method for audio captioning, in accordance with one embodiment
- FIG. 3A illustrates another block diagram of an audio captioning system, in accordance with one embodiment
- FIG. 3B is a conceptual diagram illustrating context vectors generated from audio clip frames, in accordance with one embodiment
- FIG. 3C is a conceptual diagram illustrating captions generated from concepts and attention weights, in accordance with one embodiment
- FIG. 3D illustrates another flowchart of a method for audio captioning, in accordance with one embodiment.
- FIG. 4 illustrates an exemplary system in which the various architecture and/or functionality of the various previous embodiments may be implemented.
- Deep neural networks may be used to generate captions for a raw audio signal.
- the captions describe non-speech events that are present in the raw audio signal.
- the caption is a discrete sequence of characters in a natural language, such as English.
- a caption generated for a raw audio signal including a dog barking may be “a dog barks four times.”
- the captions may also describe speech events by translating speech into text.
- a caption generated for a raw audio signal including speech and a dog barking may be “a man says good morning” while “a dog barks four times.”
- An audio captioning system receives a raw audio signal, processes the raw audio signal using 2D convolutional layers and a recurrent neural network (RNN) that form an RNN acoustic model.
- the RNN acoustic model is followed by a decoder and an RNN language model to generate a caption for the raw audio signal.
- FIG. 1A illustrates a flowchart of a method 100 for automatic audio captioning, in accordance with one embodiment.
- method 100 is described in the context of the RNN acoustic model, decoder, and an RNN language model, the method 100 may also be performed by a program, custom circuitry, or by a combination of custom circuitry and a program.
- the method 100 may be executed by a GPU, CPU, or any processor capable of performing the necessary operations.
- any system that performs method 100 is within the scope and spirit of embodiments of the present invention.
- a raw audio waveform including a non-speech sound is received by the audio captioning system.
- the raw audio waveform is sampled to generate a sequence of discrete audio samples.
- the raw audio waveform is sampled at a high enough rate to at least cover (but optionally to exceed) the human auditory range (e.g., up to 20 KHz).
- the audio samples are segmented into fixed width sub-sequences referred to as “frames” or “audio frames” that are provided as input to the RNN acoustic model.
- the frame width is a hyper-parameter of the audio captioning system that may be tuned using cross-validation.
- An audio frame may correspond to one or more processing timesteps.
- Input preprocessing techniques such as standardization or domain transformations such as a linear or mel-scale spectrogram, volume normalization, resampling, etc. may be applied to the sequence of discrete audio samples before the frames are provided to the RNN acoustic model.
- the use of frequency domain transformations reduces the size of the network required to achieve good performance. However, given enough training data and a large enough network, the performance (i.e., accuracy) of the system with and without the transformations may be similar.
- relevant features are extracted from the raw audio waveform using a recurrent neural network (RNN) acoustic model.
- RNN recurrent neural network
- a discrete sequence of characters represented in a natural language is generated based on the relevant features, where the discrete sequence of characters comprises a caption that describes the non-speech sound.
- the RNN acoustic model In a first embodiment of the audio captioning system, the RNN acoustic model generates a set of characters for a sequence of timesteps and a corresponding probability distribution for each character in the set.
- a decoder in the first embodiment is a connectionist temporal classification (CTC) decoder that receives the sets of characters and probability distributions and constructs valid combinations of characters. The combinations of characters are each associated with a probability and a number of the combinations having the highest probabilities are output by the decoder to the RNN language model.
- the RNN language model then generates the caption.
- One or more pooling layers may be interleaved between the convolution and/or recurrent layers in the RNN acoustic model.
- the RNN acoustic model locates one or more non-speech and/or speech events in the audio signal and the RNN language model generates the caption.
- the RNN acoustic model includes a context unit and generates a context vector for each timestep.
- a decoder is a feed forward neural network that implements an attention decoder.
- the attention decoder receives each character that is output by the RNN language model and a stream of context vectors from the RNN acoustic model.
- the attention decoder internally reduces each context vector to a single activation that indicates whether a sound event is present, the sound event (i.e., concept), and an attention weighting for the timestep.
- the attention decoder may use a softmax function to determine the attention weights.
- Each timestep the attention decoder outputs a probability distribution over all possible characters used in a caption.
- the RNN language model receives a stream of the single activations and determines the caption (i.e., label). When multiple valid captions exist, the sound event(s), the RNN language model selects one of the valid captions based on the probability that it is a valid sentence in the target language (e.g., English).
- the target language e.g., English
- FIG. 1B illustrates a block diagram of an audio captioning system 150 , in accordance with one embodiment.
- the audio captioning system 150 includes an RNN acoustic model 160 , a decoder 170 , and a RNN language model 180 .
- the audio captioning system 150 may be configured to perform the steps of the method 100 .
- the processor 150 may be a graphics processor or any processor capable of performing the necessary arithmetic operations of the method 100 .
- One or more of the RNN acoustic model 160 , the decoder 170 , and the RNN language model 180 can be implemented using any technically feasible techniques, including, without limitation, programming instructions executed by the processor 155 and/or circuitry configured to directly implement the operations.
- the RNN acoustic model 160 includes several layers of a neural network including layers configured to perform attention and time-adaptive computations.
- the output of the RNN acoustic model 160 is a probability distribution over characters along a sequence of timesteps.
- a first stack of layers of the neural network includes one or more two-dimensional (2D) convolutional layers that receive an input audio frame per timestep.
- a first dimension of the 2D convolutional layers is a number of samples within each audio frame and a second dimension is the number of frames in an audio clip.
- a second stack of layers of the neural network in the RNN acoustic model 160 is a stack of recurrent layers that follows the first stack of layers.
- the recurrent layers process one audio frame each timestep and include connections from one audio frame to the next audio frame in a sequence.
- the output of the recurrent layers resulting from processing the first audio frame during a first timestep is fed back into the recurrent layers as an input for processing the second audio frame during a second timestep.
- the output of the RNN acoustic model 160 is an activation vector for each timestep.
- the decoder 170 receives the probability distribution over the characters generated by the RNN acoustic model 160 .
- the decoder 170 is a connectionist temporal classification (CTC) decoder that receives the relevant features represented by the activation vector and constructs valid combinations of characters.
- the decoder 170 may be configured to remove repeated characters that are redundant.
- the decoder 170 performs a continuous associate lookup operation over each context vector generated by the RNN acoustic model 160 to reduce the context vector to a single activation for each timestep.
- CTC connectionist temporal classification
- the RNN language model 180 is a feed forward deep neural network with several layers, the final of which produces one output for each character in the target language and an end of sequence token.
- the end of sequence token indicates an end of each caption.
- the RNN language model 180 is trained with a cross entropy loss function to predict the next character in the reference label (i.e., caption).
- the audio captioning system 150 is trained end-to-end using supervised learning. In one embodiment, the audio captioning system 150 is trained using one or more of stochastic gradient decent optimization, a hybrid connectionist temporal classification, and cross entropy loss function.
- a large training dataset of “training samples” includes pairs of audio clips as the raw audio input and corresponding human annotated descriptions as the captions. For example, a training sample in the training dataset is the pair (“dog.mp3”, “a dog barks four times”).
- the training dataset should be as large as computationally feasible (e.g. thousands of hours or more).
- the training dataset is accurately labeled by humans, i.e., neither the audio nor the labels in the training dataset are synthetically generated.
- the training samples are drawn from the same distribution that the audio captioning system 150 is likely to encounter. Otherwise, if the audio captioning system 150 is not trained using samples of ocean waves, for example, the audio captioning system 150 will not be able to accurately describe recordings of waves crashing into rocks.
- a diverse and unbiased training dataset is generated by crawling the internet to find the complete set of available audio recordings, randomly sampling from the complete set until a large enough set of clips is found, and then having humans manually label each clip.
- the RNN acoustic model 160 , decoder 170 , and RNN language model 180 are randomly initialized according to a standard method (e.g., Xavier initialization) and audio clips from the training dataset are input to the audio captioning system 150 to generate a label (i.e., caption).
- the generated label is compared to the reference label (or set of labels) for the audio clip using one or more cost functions to generate a scalar loss.
- the cost function indicates the accuracy of the neural network that is being trained.
- An optimization algorithm updates the model parameters (e.g., weights) of the RNN acoustic model 160 and/or the RNN language model 180 to reduce the loss.
- the CTC decoder algorithm includes a beam search component, and includes a beam size parameter which determines the maximum number of labels to search for simultaneously.
- the number and size of each neural network layer should be chosen using a cross validation method. Additional training samples are applied until the RNN acoustic model 160 and/or the RNN language model 180 achieves a desired level of accuracy, or the model overfits on the dataset as determined via cross validation.
- a backpropagation algorithm is used to compute the gradient of the loss function with respect to parameters in the RNN acoustic model 160 and/or the RNN language model 180 .
- a recursive application of the chain rule is used to compute the gradient of the loss function with respect to parameters in the RNN acoustic model 160 and/or the RNN language model 180 .
- a suitable optimization algorithm such as stochastic gradient descent, Nesterov's accelerated gradient method, adaptive estimates of lower moments, etc. may used together with gradients produced by the backpropagation algorithm to find suitable values for parameters of the RNN acoustic model 160 and/or the RNN language model 180 .
- a search over nondifferentiable parameters such as a learning rate, number of layers, etc. using cross validation is performed on a portion of the training dataset that is not used to train the audio captioning system 150 .
- the quality of the resulting audio captioning system 150 depends significantly on the size and the quality of the dataset that it is trained on.
- a training dataset may be improved using data augmentation, where an individual audio signal may be mixed with noise, such as white noise. Noise samples may be drawn from white noise or specific training data sources and mixed with the original audio clip to improve the robustness of the trained audio captioning system 150 . For example, mixing music or television clips into the background of people talking. Conventional data augmentation mixes noise with an audio clip and pairs the mixed audio clip with the label of the original audio clip. In other words, the label is the same for the original audio clip and for the mixed audio clip.
- the corresponding caption is a combination of the caption for the original audio clip and a caption for the additional audio clip.
- the labels can be automatically be combined into “a man says ‘good afternoon’ while rock music is playing in the background”.
- Data augmentation ensures the system is robust in terms of invariance. Invariance means that two or more audio clips may be combined with different relative timing, so that the resulting waveforms are different, and the audio captioning system 150 will generate the same correct label.
- an audio clip used for training may include a combination of a first non-speech sound and a second sound that overlaps at least partially in time with the first non-speech sound, where the second sound is one of a non-speech sound, a speech sound, and noise samples.
- FIG. 2A illustrates another block diagram of an audio captioning system 200 , in accordance with one embodiment.
- the audio captioning system 200 may be used to implement the audio captioning system 150 shown in FIG. 1B .
- the audio captioning system 200 includes an audio sampling unit 205 , the RNN acoustic model 160 , a connectionist temporal classification (CTC) decoder 270 , a CTC cost unit 220 , the RNN language model 180 , and a cross entropy unit 225 .
- CTC connectionist temporal classification
- the audio sampling unit 205 is configured to receive the raw audio waveform for an audio clip, sample the raw audio waveform, and generate frames of audio samples. Each audio sample corresponds to an activation vector that is input to the RNN acoustic model 160 .
- the audio sampling unit 205 may be configured to implement one or more preprocessing operations on the audio samples before they are input to the RNN acoustic model 160 .
- the RNN acoustic model 160 includes a first stack of 2D convolutional layers 210 and a second stack of recurrent layers 215 .
- the output of the RNN acoustic model 160 is a probability distribution over all of the possible characters used in a caption.
- the output of the RNN acoustic model 160 is a direct representation of the output caption.
- the CTC decoder 270 removes redundant characters from the probability distribution representation.
- the output of the RNN acoustic model 160 is a sequence of concept vectors.
- Each element in the concept vector represents relevant features and corresponds to a character that may be included in the caption.
- the features represented in the concept vector are not determined a priori and are instead learned by the system during the end-to-end training process.
- the number of layers, the size of filters in a given layer of the 2D convolutional layers 210 , and the number of filters in a given layer are hyper-parameters that may be tuned using cross validation. In one embodiment, more layers, bigger filters, and more filters per layer improve performance given appropriate regularization or a large enough training dataset. In practice, the performance improvement should be balanced against computational limits, i.e. increasing the layer count, filter count, and/or filter size arbitrarily may result in an audio captioning system 150 or 200 that requires too much time to train. Consequently, in one embodiment, there is a maximum layer count, filter size, and filter count. In one embodiment, the maximum settings may be used to achieve the best accuracy, and when a tradeoff is required, cross validation is used to reduce the layer count, filter count, and/or filter size.
- the 2D convolutional layers 210 provide greater performance compared with fully connected or recurrent layers, and there is a natural interpretation of the 2D convolutions as implementations of frequency impulse response (FIR) filters with the parameters (e.g., weights) of the 2D convolutional layers 210 corresponding to FIR filter coefficients.
- FIR frequency impulse response
- the ability of the RNN acoustic model 160 to learn the parameters, allows the 2D convolutional layers 210 to perform an operation that is similar to a spectrogram, while having fine grained control over the frequency bands being measured. Therefore, the 2D convolutional layers 210 can be focused on specific frequency bands, e.g. 300 Hz-3 KHz for human speech, for specific musical instruments, etc.
- the recurrent layers 215 may be bidirectional recurrent layers.
- the number of layers and layer size within the recurrent layers 215 may follow the same guidelines as is used for the 2D convolutional layers 210 .
- increasing the layer count and/or layer size may be subject to computational limitations.
- a final layer of the recurrent layers 215 generates one element in the concept vector for each character in the target language specified for the captions.
- hierarchical connectivity is implemented in the recurrent layers 215 in addition to direct connections from one timestep to the next. Hierarchical connectivity means that the computation for timestep t may include inputs from timestep t ⁇ N for any choice of N in addition to inputs from timestep t ⁇ 1.
- one or more pooling layers are interleaved within the 2D convolution layers 210 and/or recurrent layers 215 .
- the one or more pooling layers are max pooling layers or other types of pooling such as mean pooling that combine the activations between layers over time.
- the concept vectors are processed by the CTC cost unit 220 to generate an accuracy value.
- the CTC cost unit 200 implements CTC cost function, to compute a loss function according to the difference between the output of the audio captioning system 200 and all possible alignments of a list of possible labels.
- the CTC cost function is fully differentiable and uses a continuous optimization algorithm, such as stochastic gradient descent, thereby enabling computation of gradients with respect to all parameters in the audio captioning system 200 .
- the concept vectors are also passed through the CTC decoder 270 that selects the most likely character for each timestep and collapses timesteps that output the same character, resulting in a shorter sequence of characters.
- the CTC decoder 270 constructs valid combinations of characters to select the characters that are output.
- the sequence of characters then is passed through the RNN language model 180 that predicts the next character in the reference label.
- the cross entropy unit 225 may be configured to compute a CTC gradient, and the cross entropy loss function can be used to compute a gradient with respect to the second parameters of the RNN language model 180 as well as the parameters of the RNN acoustic model 160 .
- a standard cross entropy cost function is used to compare the output of the RNN language model 180 against the reference label. If there are multiple valid labels, a combined loss function may be applied over all of the valid labels to measure the accuracy of the audio captioning system 200 . A combined loss function may also be applied when the CTC cost function is used to consider all possible alignments of characters for all possible valid labels.
- the cross entropy loss function may be optimized using standard techniques, such as batch/layer normalization, rectified linear activation functions, careful weight initialization (e.g. Glorot. et. al.), residual skip connections over individual layers, and advanced descent methods (e.g. Nesterov accelerated gradient, ADAM, RMSProp, etc).
- standard techniques such as batch/layer normalization, rectified linear activation functions, careful weight initialization (e.g. Glorot. et. al.), residual skip connections over individual layers, and advanced descent methods (e.g. Nesterov accelerated gradient, ADAM, RMSProp, etc).
- FIG. 2B illustrates another flowchart of a method 230 for audio captioning, in accordance with one embodiment.
- method 230 is described in the context of the RNN acoustic model 160 , the decoder 270 , and the RNN language model 180 , the method 230 may also be performed by a program, custom circuitry, or by a combination of custom circuitry and a program.
- the method 230 may be executed by a GPU, CPU, or any processor capable of performing the necessary operations.
- persons of ordinary skill in the art will understand that any system that performs method 230 is within the scope and spirit of embodiments of the present invention.
- the steps 110 and 115 are performed as previously described in conjunction with FIG. 2B .
- the relevant features that are extracted from the raw audio waveform using the RNN acoustic model 160 are output as concept vectors.
- a probability distribution over all possible characters represented in a natural language used for a caption is computed, for each timestep of the raw audio waveform, based on the concept vectors (i.e., relevant features).
- the CTC decoder 270 receives the sets of characters and probability distributions and constructs valid sequences of the characters. The valid sequences of characters are each associated with a probability value and a number of the combinations having the highest probability values are output by the CTC decoder 270 to the RNN language model 180 .
- the CTC decoder 270 removes repeated characters that are redundant o produce the valid sequences of characters.
- any mistakes in the valid sequence of characters are corrected using the RNN language model 180 and the caption is generated. For example, the phonetically plausible spelling mistake “Rock musac is playing.” may be corrected to “Rock music is playing.” by the language model.
- the caption may include one or more non-speech and/or speech events.
- Attention addresses the problems related to per-sample generation of concept vectors by enabling the audio captioning system 150 to quickly scan the output from the RNN acoustic 160 for each timestep to identify timesteps that are relevant to the next output character.
- attention is implemented in a standard form of content-based attention within the decoder 170 and an encoder component the RNN acoustic model 160 .
- FIG. 3A illustrates another block diagram of an audio captioning system 300 , in accordance with one embodiment.
- the audio captioning system 300 may be used to implement the audio captioning system 150 shown in FIG. 1B .
- the audio captioning system 300 includes the audio sampling unit 205 , the RNN acoustic model 160 , an attention decoder 370 , the CTC cost unit 220 , the RNN language model 180 , and a cross entropy unit 225 .
- the audio sampling unit 205 receives the raw audio waveform for an audio clip, samples the raw audio waveform, and generates frames of audio samples. Each audio sample corresponds to an activation vector that is input to the RNN acoustic model 160 .
- the RNN acoustic model 160 operates as an encoder 360 .
- the RNN acoustic model 160 processes the activation vectors for the audio clip to identify one or more concepts present in the audio clip. Each concept corresponds to a separate caption. For each concept, the RNN acoustic model 160 produces a variable sized vector of activations referred to as a context vector. Multiple concept vectors may be used to describe a single caption.
- the context vector identifies whether the concept associated with the context vector is present for each timestep. In one embodiment, the context vector tags each timestep of the audio clip to indicate whether the concept is present during the timestep.
- the attention decoder 370 performs a continuous associative lookup operation over each context vector to reduce the context vector into a single activation.
- the attention decoder 270 performs the continuous associative lookup operation for all timesteps within a window including multiple timesteps.
- the single activation for the current timestep and the character that is generated by the RNN language model 180 for the previous timestep are processed by the attention decoder 370 .
- the character for the previous timestep is used by the attention decoder 370 to determine when the end of a concept is reached.
- the attention decoder 370 processes the single activations resulting from the context vectors to generate an attention weight value for each timestep where a concept is present, until an end of sequence token is produced indicating the end of a caption has been reached.
- the attention weight values for a concept are then used by the attention decoder 370 to determine a sequence of characters associated with a caption describing the concept.
- the sequence y represents the characters in the output caption.
- Each output y i is generated by focusing and reading data from only the relevant elements of h.
- the attention decoder 370 is implemented by a deep RNN.
- the deep RNN is implemented with a standard architecture such as Long-Short-Term-Memory (LSRM) or Gated-Recurrent-Unit (GRU).
- ⁇ i is the list of attention weights generated by the attention decoder 370 . Each value indicates a magnitude of the contribution from each input value (the single activations generated by the attention decoder 370 ).
- the operation performed by the attention decoder 370 is implemented by scoring each element in h independently and then converting the resulting output values e i,j into a probability distribution ⁇ i,j , usually with a softmax operation.
- the score operation may be implemented by a deep feed forward neural network or CNN.
- the RNN language model 180 receives sequences of characters corresponding to each concept (i.e., sound event) from the attention decoder 370 and determines the caption (i.e., label) for each concept. When multiple valid captions exist, the RNN language model 180 selects one of the valid captions based on a probability that it is a valid sentence in the target language (e.g., English). In one embodiment, the RNN language model 180 is a feed forward deep neural network with several layers, the final of which produces one output for each character of the caption in the target language and an end of sequence token.
- FIG. 3B is a conceptual diagram illustrating context vectors generated from audio clip frames, in accordance with one embodiment.
- An audio clip 320 is segmented into audio frames 330 , 331 , 332 , 333 , 334 , and 335 .
- Each audio frame is associated with a timestep.
- a concept A is present for audio frames 330 , 331 , 332 , 333 , 334 , and 335 .
- a concept B is present for audio frames 331 , 332 , 333 , and 334 .
- the RNN acoustic model 160 generates a context vector A corresponding to concept A after audio frame 331 is processed and generates a context vector B corresponding to concept B at the end of concept B.
- the context vector A indicates that the concept A is present in audio frames 330 , 331 , 332 , 333 , 334 , and 335 .
- the context vector B indicates that the concept B is not present in audio frames 330 and 335 and is present in audio frames 331 , 332 , 333 , and 334 .
- the RNN acoustic model 160 may generate a context vector for a corresponding concept at any point within an audio frame in which the concept is present or when the end of the concept is reached.
- FIG. 3C is a conceptual diagram illustrating captions generated from concepts and attention weights, in accordance with one embodiment.
- the attention decoder 370 receives the context vector A and generates the attention weights 341 , 342 , 343 , 344 , 346 , and 347 .
- the attention decoder 370 receives the context vector B and generates the attention weights 351 , 352 , 353 , and 354 .
- Example attention weight values for audio frames 330 , 331 , 332 , 333 , 334 , and 335 for concept A are 0.0, 0.7, 0.5, 0.2, 0.1, and 0.0, respectively, and for concept B are 0.0, 0.01, 0.75, 0.2, 0.0, 0.0, respectively.
- the attention decoder 370 produces the characters describing concept A and concept B, based on the respective weights.
- FIG. 3D illustrates a flowchart of a method 375 for audio captioning, in accordance with one embodiment.
- method 375 is described in the context of the RNN acoustic model 160 , an attention decoder 370 , and an RNN language model 180 , the method 375 may also be performed by a program, custom circuitry, or by a combination of custom circuitry and a program.
- the method 375 may be executed by a GPU, CPU, or any processor capable of performing the necessary operations.
- persons of ordinary skill in the art will understand that any system that performs method 375 is within the scope and spirit of embodiments of the present invention.
- Steps 110 and 115 are performed as previously described in conjunction with FIGS. 1A and 2B .
- the RNN acoustic model 160 generates a context vector that includes concept tags for timesteps of the raw audio waveform based on the relevant features.
- the attention decoder 370 computes per-timestep attention weights for each concept.
- the attention decoder 370 generates sequences of characters for a caption represented in the natural language for each concept based on the per-timestep attention weights for the concept.
- the RNN language model 180 corrects mistakes in the sequences of characters and generates the caption.
- FIG. 4 illustrates an exemplary system 400 in which the various architecture and/or functionality of the various previous embodiments may be implemented.
- the exemplary system 400 may be used to implement the audio captioning system 150 , 200 , and 300 for automatically generating audio captions for both speech and non-speech events.
- a system 400 including at least one central processor 401 that is connected to a communication bus 402 .
- the communication bus 402 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol(s).
- the system 400 also includes a main memory 404 .
- Control logic (software) and data are stored in the main memory 404 which may take the form of random access memory (RAM).
- RAM random access memory
- one or more training datasets are stored in the main memory 404 .
- the system 400 also includes input devices 412 , a graphics processor 406 , and a display 408 , i.e. a conventional CRT (cathode ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display or the like.
- User input may be received from the input devices 412 , e.g., keyboard, mouse, touchpad, microphone, and the like.
- the graphics processor 406 may include a plurality of shader modules, a rasterization module, etc. Each of the foregoing modules may even be situated on a single semiconductor platform to form a graphics processing unit (GPU).
- GPU graphics processing unit
- a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
- CPU central processing unit
- the system 400 may also include a secondary storage 410 .
- the secondary storage 410 includes, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory.
- the removable storage drive reads from and/or writes to a removable storage unit in a well-known manner.
- Computer programs, or computer control logic algorithms may be stored in the main memory 404 and/or the secondary storage 410 . Such computer programs, when executed, enable the system 400 to perform various functions.
- the memory 404 , the storage 410 , and/or any other storage are possible examples of computer-readable media.
- Data streams associated with gestures may be stored in the main memory 404 and/or the secondary storage 410 .
- the architecture and/or functionality of the various previous figures may be implemented in the context of the central processor 401 , the graphics processor 406 , an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the central processor 401 and the graphics processor 406 , a chipset (i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.), and/or any other integrated circuit for that matter.
- a chipset i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.
- the architecture and/or functionality of the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system, and/or any other desired system.
- the system 400 may take the form of a desktop computer, laptop computer, server, workstation, game consoles, embedded system, and/or any other type of logic.
- the system 400 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a television, etc.
- PDA personal digital assistant
- system 400 may be coupled to a network (e.g., a telecommunications network, local area network (LAN), wireless network, wide area network (WAN) such as the Internet, peer-to-peer network, cable network, or the like) for communication purposes.
- a network e.g., a telecommunications network, local area network (LAN), wireless network, wide area network (WAN) such as the Internet, peer-to-peer network, cable network, or the like
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
Description
α1=Attention(s i1 ,h)
g i=Σj=0 Lαi,j *h j
y i=Decoder(y i1 ,g i)
si1 is the (i1)-th hidden state of the
The score operation may be implemented by a deep feed forward neural network or CNN.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/691,546 US10679643B2 (en) | 2016-08-31 | 2017-08-30 | Automatic audio captioning |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662382197P | 2016-08-31 | 2016-08-31 | |
US15/691,546 US10679643B2 (en) | 2016-08-31 | 2017-08-30 | Automatic audio captioning |
Publications (2)
Publication Number | Publication Date |
---|---|
US20180061439A1 US20180061439A1 (en) | 2018-03-01 |
US10679643B2 true US10679643B2 (en) | 2020-06-09 |
Family
ID=61240650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/691,546 Expired - Fee Related US10679643B2 (en) | 2016-08-31 | 2017-08-30 | Automatic audio captioning |
Country Status (1)
Country | Link |
---|---|
US (1) | US10679643B2 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11366954B2 (en) * | 2017-02-16 | 2022-06-21 | Hitachi, Ltd. | Text preparation apparatus |
US20230215427A1 (en) * | 2022-01-05 | 2023-07-06 | International Business Machines Corporation | Automated domain-specific constrained decoding from speech inputs to structured resources |
Families Citing this family (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590153B (en) * | 2016-07-08 | 2021-04-27 | 微软技术许可有限责任公司 | Conversational relevance modeling using convolutional neural networks |
US10372814B2 (en) * | 2016-10-18 | 2019-08-06 | International Business Machines Corporation | Methods and system for fast, adaptive correction of misspells |
US10579729B2 (en) | 2016-10-18 | 2020-03-03 | International Business Machines Corporation | Methods and system for fast, adaptive correction of misspells |
US10546575B2 (en) | 2016-12-14 | 2020-01-28 | International Business Machines Corporation | Using recurrent neural network for partitioning of audio data into segments that each correspond to a speech feature cluster identifier |
US10592751B2 (en) * | 2017-02-03 | 2020-03-17 | Fuji Xerox Co., Ltd. | Method and system to generate targeted captions and summarize long, continuous media files |
US10373610B2 (en) * | 2017-02-24 | 2019-08-06 | Baidu Usa Llc | Systems and methods for automatic unit selection and target decomposition for sequence labelling |
US10440180B1 (en) * | 2017-02-27 | 2019-10-08 | United Services Automobile Association (Usaa) | Learning based metric determination for service sessions |
US10540961B2 (en) * | 2017-03-13 | 2020-01-21 | Baidu Usa Llc | Convolutional recurrent neural networks for small-footprint keyword spotting |
US11056104B2 (en) * | 2017-05-26 | 2021-07-06 | International Business Machines Corporation | Closed captioning through language detection |
US10714076B2 (en) * | 2017-07-10 | 2020-07-14 | Sony Interactive Entertainment Inc. | Initialization of CTC speech recognition with standard HMM |
CN107680587A (en) * | 2017-09-29 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Acoustic training model method and apparatus |
US11556775B2 (en) | 2017-10-24 | 2023-01-17 | Baidu Usa Llc | Systems and methods for trace norm regularization and faster inference for embedded models |
US20190130896A1 (en) * | 2017-10-26 | 2019-05-02 | Salesforce.Com, Inc. | Regularization Techniques for End-To-End Speech Recognition |
US11508355B1 (en) * | 2017-10-27 | 2022-11-22 | Interactions Llc | Extracting natural language semantics from speech without the use of speech recognition |
WO2019126881A1 (en) * | 2017-12-29 | 2019-07-04 | Fluent.Ai Inc. | System and method for tone recognition in spoken languages |
JP6773707B2 (en) * | 2018-03-20 | 2020-10-21 | 株式会社東芝 | Signal processing equipment, signal processing methods and programs |
CN109036465B (en) * | 2018-06-28 | 2021-05-11 | 南京邮电大学 | Speech emotion recognition method |
US10380997B1 (en) | 2018-07-27 | 2019-08-13 | Deepgram, Inc. | Deep learning internal state index-based search and classification |
JP6882814B2 (en) * | 2018-09-13 | 2021-06-02 | LiLz株式会社 | Sound analyzer and its processing method, program |
US11475887B2 (en) | 2018-10-29 | 2022-10-18 | Spotify Ab | Systems and methods for aligning lyrics using a neural network |
US11308943B2 (en) * | 2018-10-29 | 2022-04-19 | Spotify Ab | Systems and methods for aligning lyrics using a neural network |
WO2020113031A1 (en) * | 2018-11-28 | 2020-06-04 | Google Llc | Training and/or using a language selection model for automatically determining language for speech recognition of spoken utterance |
JP7059166B2 (en) * | 2018-11-29 | 2022-04-25 | 株式会社東芝 | Information processing equipment, information processing methods and programs |
US11342002B1 (en) * | 2018-12-05 | 2022-05-24 | Amazon Technologies, Inc. | Caption timestamp predictor |
JP7267034B2 (en) * | 2019-02-27 | 2023-05-01 | 本田技研工業株式会社 | CAPTION GENERATION DEVICE, CAPTION GENERATION METHOD, AND PROGRAM |
EP3963580A1 (en) * | 2019-05-02 | 2022-03-09 | Google LLC | Automatically captioning audible parts of content on a computing device |
CN110189748B (en) * | 2019-05-31 | 2021-06-11 | 百度在线网络技术(北京)有限公司 | Model construction method and device |
US11183201B2 (en) * | 2019-06-10 | 2021-11-23 | John Alexander Angland | System and method for transferring a voice from one body of recordings to other recordings |
US11355134B2 (en) * | 2019-08-02 | 2022-06-07 | Audioshake, Inc. | Deep learning segmentation of audio using magnitude spectrogram |
CN110600059B (en) * | 2019-09-05 | 2022-03-15 | Oppo广东移动通信有限公司 | Acoustic event detection method and device, electronic equipment and storage medium |
EP3792915A1 (en) * | 2019-09-12 | 2021-03-17 | Spotify AB | Systems and methods for aligning lyrics using a neural network |
CN110767218A (en) * | 2019-10-31 | 2020-02-07 | 南京励智心理大数据产业研究院有限公司 | End-to-end speech recognition method, system, device and storage medium thereof |
JP7438744B2 (en) * | 2019-12-18 | 2024-02-27 | 株式会社東芝 | Information processing device, information processing method, and program |
CN111259188B (en) * | 2020-01-19 | 2023-07-25 | 成都潜在人工智能科技有限公司 | Lyric alignment method and system based on seq2seq network |
US11830473B2 (en) * | 2020-01-21 | 2023-11-28 | Samsung Electronics Co., Ltd. | Expressive text-to-speech system and method |
WO2022198474A1 (en) | 2021-03-24 | 2022-09-29 | Sas Institute Inc. | Speech-to-analytics framework with support for large n-gram corpora |
US11335350B2 (en) * | 2020-03-18 | 2022-05-17 | Sas Institute Inc. | Dual use of audio noise level in speech-to-text framework |
US11145309B1 (en) * | 2020-03-18 | 2021-10-12 | Sas Institute Inc. | Dynamic model selection in speech-to-text processing |
US11138979B1 (en) | 2020-03-18 | 2021-10-05 | Sas Institute Inc. | Speech audio pre-processing segmentation |
JP7537493B2 (en) * | 2020-05-11 | 2024-08-21 | 日本電信電話株式会社 | ENVIRONMENT ESTIMATION METHOD, ENVIRONMENT ESTIMATION DEVICE, AND PROGRAM |
US11527238B2 (en) * | 2020-10-30 | 2022-12-13 | Microsoft Technology Licensing, Llc | Internal language model for E2E models |
US20220156577A1 (en) * | 2020-11-13 | 2022-05-19 | Sony Group Corporation | Training neural network model based on data point selection |
CN113379875B (en) * | 2021-03-22 | 2023-09-29 | 平安科技(深圳)有限公司 | Cartoon character animation generation method, device, equipment and storage medium |
CN113066498B (en) * | 2021-03-23 | 2022-12-30 | 上海掌门科技有限公司 | Information processing method, apparatus and medium |
CN117594060A (en) * | 2023-10-31 | 2024-02-23 | 北京邮电大学 | Audio signal content analysis method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9378733B1 (en) * | 2012-12-19 | 2016-06-28 | Google Inc. | Keyword detection without decoding |
US20170004824A1 (en) * | 2015-06-30 | 2017-01-05 | Samsung Electronics Co., Ltd. | Speech recognition apparatus, speech recognition method, and electronic device |
US20170278525A1 (en) * | 2016-03-24 | 2017-09-28 | Google Inc. | Automatic smoothed captioning of non-speech sounds from audio |
-
2017
- 2017-08-30 US US15/691,546 patent/US10679643B2/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9378733B1 (en) * | 2012-12-19 | 2016-06-28 | Google Inc. | Keyword detection without decoding |
US20170004824A1 (en) * | 2015-06-30 | 2017-01-05 | Samsung Electronics Co., Ltd. | Speech recognition apparatus, speech recognition method, and electronic device |
US20170278525A1 (en) * | 2016-03-24 | 2017-09-28 | Google Inc. | Automatic smoothed captioning of non-speech sounds from audio |
Non-Patent Citations (6)
Title |
---|
Amodei et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," International Conference on Machine Learning, 2016, pp. 1-10. |
Bahdanau, Dzmitry, et al., "Neural Machine Translation by Jointly Learning to Align and Translate," ICLR Conference 2015, arXiv:1409.0473v7, May 19, 2016, pp. 1-15. |
Glorot et al., "Understanding the difficulty of training deep feedforward neural networks," Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, vol. 9 of JMLR: W&CP 9, 2010, pp. 249-256. |
Graves, A., "Adaptive Computation Time for Recurrent Neural Networks," Cornell University Library, 2017, pp. 1-19. |
Graves, A., "Supervised Sequence Labelling with Recurrent Neural Networks," Thesis, Technische Universitat Munchen, 2012, pp. 1-124. |
Johnson et al., "DenseCap: Fully convolutional localization networks for dense captioning," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4565-4574. |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11366954B2 (en) * | 2017-02-16 | 2022-06-21 | Hitachi, Ltd. | Text preparation apparatus |
US20230215427A1 (en) * | 2022-01-05 | 2023-07-06 | International Business Machines Corporation | Automated domain-specific constrained decoding from speech inputs to structured resources |
US12094459B2 (en) * | 2022-01-05 | 2024-09-17 | International Business Machines Corporation | Automated domain-specific constrained decoding from speech inputs to structured resources |
Also Published As
Publication number | Publication date |
---|---|
US20180061439A1 (en) | 2018-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10679643B2 (en) | Automatic audio captioning | |
US10176811B2 (en) | Neural network-based voiceprint information extraction method and apparatus | |
EP3477633A1 (en) | Systems and methods for robust speech recognition using generative adversarial networks | |
EP2695160B1 (en) | Speech syllable/vowel/phone boundary detection using auditory attention cues | |
JP6189970B2 (en) | Combination of auditory attention cue and phoneme posterior probability score for sound / vowel / syllable boundary detection | |
US8676574B2 (en) | Method for tone/intonation recognition using auditory attention cues | |
US8886533B2 (en) | System and method for combining frame and segment level processing, via temporal pooling, for phonetic classification | |
Korvel et al. | Analysis of 2d feature spaces for deep learning-based speech recognition | |
WO2019204547A1 (en) | Systems and methods for automatic speech recognition using domain adaptation techniques | |
JP7266683B2 (en) | Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction | |
KR20220130565A (en) | Keyword detection method and apparatus thereof | |
EP3910625A2 (en) | Method and apparatus for utterance time estimation | |
El‐Bialy et al. | Developing phoneme‐based lip‐reading sentences system for silent speech recognition | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
Smaragdis et al. | The Markov selection model for concurrent speech recognition | |
EP4457694A1 (en) | Generating data items using off-the-shelf guided generative diffusion processes | |
US12094453B2 (en) | Fast emit low-latency streaming ASR with sequence-level emission regularization utilizing forward and backward probabilities between nodes of an alignment lattice | |
JP5091202B2 (en) | Identification method that can identify any language without using samples | |
Sung et al. | Speech Recognition via CTC-CNN Model. | |
CN116434736A (en) | Voice recognition method, interaction method, system and equipment | |
CN112951270B (en) | Voice fluency detection method and device and electronic equipment | |
Shah et al. | Signal Quality Assessment for Speech Recognition using Deep Convolutional Neural Networks | |
Ahmed et al. | Efficient feature extraction and classification for the development of Pashto speech recognition system | |
Kastner et al. | R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS | |
CN118366433A (en) | Training method of fake voice detection model, fake voice detection method and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
ZAAA | Notice of allowance and fees due |
Free format text: ORIGINAL CODE: NOA |
|
ZAAB | Notice of allowance mailed |
Free format text: ORIGINAL CODE: MN/=. |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20240609 |