CN108346427A - Voice recognition method, device, equipment and storage medium - Google Patents
Voice recognition method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN108346427A CN108346427A CN201810113879.9A CN201810113879A CN108346427A CN 108346427 A CN108346427 A CN 108346427A CN 201810113879 A CN201810113879 A CN 201810113879A CN 108346427 A CN108346427 A CN 108346427A
- Authority
- CN
- China
- Prior art keywords
- signal
- voice
- recognition result
- lip
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 95
- 238000003860 storage Methods 0.000 title claims abstract description 27
- 238000000605 extraction Methods 0.000 claims abstract description 89
- 238000004458 analytical method Methods 0.000 claims abstract description 79
- 230000001960 triggered effect Effects 0.000 claims abstract description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 83
- 238000012549 training Methods 0.000 claims description 30
- 238000011156 evaluation Methods 0.000 claims description 21
- 238000010801 machine learning Methods 0.000 claims description 10
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 claims description 9
- 241000270295 Serpentes Species 0.000 claims description 7
- 235000013399 edible fruits Nutrition 0.000 claims description 7
- 238000013178 mathematical model Methods 0.000 claims description 7
- 230000005540 biological transmission Effects 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 6
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 12
- 238000004519 manufacturing process Methods 0.000 abstract 2
- 238000005516 engineering process Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000009826 distribution Methods 0.000 description 6
- 238000000513 principal component analysis Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000005291 magnetic effect Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 238000012905 input function Methods 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 206010024552 Lip dry Diseases 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 238000009415 formwork Methods 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium. The method comprises the following steps: when a sound production event is triggered, receiving a voice signal and an image signal containing lips, which are sent by a microphone and collected by a user in the process of executing the sound production event; carrying out feature extraction on a voice signal to generate a voice feature signal, and carrying out feature extraction on an image signal containing lips to generate a lip language feature signal; and sending the voice characteristic signal and the lip language characteristic signal to a server to instruct the server to perform matching analysis on the voice characteristic signal and a preset voice signal to generate a voice recognition result and perform matching analysis on the lip language characteristic signal and the preset lip language signal to generate a lip language recognition result, and if the similarity between the voice recognition result and the lip language recognition result is greater than or equal to a similarity threshold, generating a recognition feedback result according to the voice recognition result and sending the recognition feedback result to the terminal. The embodiment of the invention realizes the improvement of the voice recognition rate.
Description
Technical field
The present embodiments relate to speech recognition technology more particularly to a kind of audio recognition method, device, equipment and storages
Medium.
Background technology
With the arrival in electronic information epoch, mobile device is more and more universal, mobile terminal and mobile terminal it is outer
Connect equipment, such as children's tablet computer and microphone.It, such as can be in addition, the achievable function of above equipment is also more and more abundant
Microphone is connected to mobile terminal, language learning or singing songs are carried out according to the content shown in mobile terminal, at this
In the process, the sound of the real-time typing user of microphone is needed, then the sound is uploaded to mobile terminal, is carried out in the terminal
Corresponding speech recognition, and then obtain voice recognition result, provide language learning or singing songs further according to voice recognition result
Evaluation result.
It is the most key in above process, the accuracy of voice recognition result, and speech recognition technology is relied on merely
Possibly phonetic recognization rate can not be further increased.
Invention content
A kind of audio recognition method of present invention offer, device, equipment and storage medium, to realize raising phonetic recognization rate.
In a first aspect, an embodiment of the present invention provides a kind of audio recognition method, this method includes:
When sounding event is triggered, receive what the user that microphone is sent acquired in executing the sounding event procedure
Voice signal and picture signal comprising lip;
Feature extraction is carried out to the voice signal and generates phonetic feature signal, and the image comprising lip is believed
Number carry out feature extraction generate lid speech characteristic signal;
The phonetic feature signal and the lid speech characteristic signal are sent to server-side, to indicate the server by institute
Predicate sound characteristic signal carries out the matching analysis generation voice recognition result with default voice signal and believes the lid speech characteristic
Number carrying out the matching analysis with default lip reading signal generates lip reading recognition result, if institute's speech recognition result and the lip reading are known
The similarity of other result is more than or equal to similarity threshold, then according to institute speech recognition result generates Recognition feedback result and by institute
It states Recognition feedback result and is sent to terminal.
Further, described that feature extraction generation phonetic feature signal is carried out to the voice signal, including:
Speech characteristic parameter is carried out to the voice signal to extract to obtain speech characteristic parameter;
Dimensionality reduction is carried out to the speech characteristic parameter to convert to obtain pending phonetic feature signal;
Enhancing is carried out according to voice enhancement algorithm to the pending phonetic feature signal to handle to obtain the phonetic feature
Signal, the voice enhancement algorithm include cepstrum mean value subtraction algorithm;
The picture signal to described comprising lip carries out feature extraction and generates lid speech characteristic signal, including:
Feature extraction is carried out to the picture signal comprising lip according to lip feature extraction algorithm and obtains lip image
Signal, the lip feature extraction algorithm include that the feature extraction algorithm based on template or the feature extraction based on image pixel are calculated
At least one of method;
Shape of the mouth as one speaks contour feature is carried out according to shape of the mouth as one speaks contour feature extraction algorithm to the lip picture signal to extract to obtain institute
State lid speech characteristic signal, the shape of the mouth as one speaks contour feature extraction algorithm includes at least one in deforming template algorithm or Snakes algorithms
Kind.
Second aspect, the embodiment of the present invention additionally provide a kind of audio recognition method, and this method includes:
Receive the phonetic feature signal and lid speech characteristic signal that terminal is sent;
The matching analysis is carried out to the phonetic feature signal and default voice signal and generates voice recognition result;
The matching analysis is carried out to the lid speech characteristic signal and default lip reading signal and generates lip reading recognition result;
If institute's speech recognition result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, root
Generate Recognition feedback according to institute speech recognition result as a result, and the Recognition feedback result is sent to terminal, described in instruction
Terminal generates the evaluation result of generation event according to the Recognition feedback result.
Further, before the reception receives the phonetic feature signal and lid speech characteristic signal that terminal is sent, this method
Further include:
Establish speech recognition modeling trained in advance and lip reading identification model;
Correspondingly, described carry out the matching analysis generation speech recognition knot to the phonetic feature signal and default voice signal
Fruit, including:
According to the phonetic feature signal call the default voice signal in the speech recognition modeling trained in advance into
Row the matching analysis generates institute's speech recognition result;
It is described that the matching analysis generation lip reading recognition result, packet are carried out to the lid speech characteristic signal and default lip reading signal
It includes:
According to the lid speech characteristic signal call the default lip reading signal in the lip reading identification model trained in advance into
Row the matching analysis generates the lip reading recognition result.
Further, foundation lip reading identification model trained in advance, including:
In standard database, the training number for the setting sets of numbers that user generates in executing sounding event procedure is obtained
According to the training data includes the picture signal comprising lip and corresponding letter signal;
Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, with first lip
Language characteristic signal is input variable and the corresponding letter signal is output variable, is believed using first lid speech characteristic
Number and the corresponding letter signal according to machine learning algorithm to preset math block be trained generate in advance training
Lip reading identification model.
Further, this method further includes:
If institute's speech recognition result and the similarity of the lip reading recognition result are less than the similarity threshold, obtain
Take current context information;
Institute's speech recognition result and the lip reading recognition result are adjusted according to the current context information, until
The similarity of voice recognition result and lip reading recognition result after adjustment is more than or equal to the similarity threshold.
The third aspect, the embodiment of the present invention additionally provide a kind of speech recognition equipment, which includes:
Voice and image signal acquisition module exist for when sounding event is triggered, receiving the user that microphone is sent
Execute the voice signal acquired in the sounding event procedure and the picture signal comprising lip;
Characteristic signal generation module generates phonetic feature signal and right for carrying out feature extraction to the voice signal
The picture signal comprising lip carries out feature extraction and generates lid speech characteristic signal;
Characteristic signal sending module, for the phonetic feature signal and the lid speech characteristic signal to be sent to service
End generates speech recognition knot to indicate that the phonetic feature signal and default voice signal are carried out the matching analysis by the server-side
It fruit and the lid speech characteristic signal and default lip reading signal is subjected to the matching analysis generates lip reading recognition result, if institute's predicate
Sound recognition result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, then according to institute's speech recognition result
It generates Recognition feedback result and the Recognition feedback result is sent to terminal.
Fourth aspect, the embodiment of the present invention additionally provide a kind of speech recognition equipment, which includes:
Characteristic signal receiving module, phonetic feature signal and lid speech characteristic signal for receiving terminal transmission;
Voice recognition result generation module, for carrying out the matching analysis to the phonetic feature signal and default voice signal
Generate voice recognition result;
Lip reading recognition result generation module, for carrying out the matching analysis to the lid speech characteristic signal and default lip reading signal
Generate lip reading recognition result;
Recognition feedback result sending module, if similar to the lip reading recognition result for institute's speech recognition result
Degree is more than or equal to similarity threshold, then generates Recognition feedback according to institute speech recognition result as a result, and by the Recognition feedback
As a result it is sent to terminal, the evaluation result of sounding event is generated to indicate the terminal according to the Recognition feedback result.
5th aspect, the embodiment of the present invention additionally provide a kind of equipment, which includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors so that one or more of processing
Device realizes audio recognition method as previously described.
6th aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, are stored thereon with computer
Program, the program realize audio recognition method as previously described when being executed by processor.
The present invention is by when sounding event is triggered, receiving the user of microphone transmission in executing sounding event procedure
The voice signal of acquisition and picture signal comprising lip, and to voice signal carry out feature extraction generate phonetic feature signal with
And feature extraction is carried out to the picture signal comprising lip and generates lid speech characteristic signal, phonetic feature signal and lid speech characteristic are believed
Number it is sent to server-side, carries out the matching analysis to indicate server-side by phonetic feature signal and default voice signal and generate voice knowing
It other result and lid speech characteristic signal and default lip reading signal is subjected to the matching analysis generates lip reading recognition result, if voice is known
Other result and the similarity of lip reading recognition result are more than or equal to similarity threshold, then generate Recognition feedback according to voice recognition result
As a result, and Recognition feedback result is sent to terminal, solve in the prior art merely by speech recognition technology carry out voice
Identification, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.
Description of the drawings
Fig. 1 is a kind of flow chart of audio recognition method in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of audio recognition method in the embodiment of the present invention two;
Fig. 3 is a kind of flow chart of audio recognition method in the embodiment of the present invention three;
Fig. 4 is a kind of structural schematic diagram of speech recognition equipment in the embodiment of the present invention four;
Fig. 5 is a kind of structural schematic diagram of speech recognition equipment in the embodiment of the present invention five;
Fig. 6 is a kind of structural schematic diagram of equipment in the embodiment of the present invention six.
Specific implementation mode
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
Embodiment one
Fig. 1 is a kind of flow chart for audio recognition method that the embodiment of the present invention one provides, and the present embodiment is applicable to carry
The case where high phonetic recognization rate, this method can be executed by speech recognition equipment, which may be used software and/or hardware
Mode realize that the device can be configured in terminal, such as typically mobile phone, tablet computer etc..As shown in Figure 1, the party
Method specifically comprises the following steps:
S110, when sounding event is triggered, receive microphone send user acquired in executing sounding event procedure
Voice signal and picture signal comprising lip;
In a specific embodiment of the present invention, sounding event can indicate the function using terminal carry out learning activities or
The sound generated during recreation, such as carry out text and read aloud study or carry out singing songs.And usually on carrying out
It states in active procedure, can obtain an evaluation result, the movable effect that this evaluation result can be used for indicating currently to carry out,
It needs to be acquired voice signal in above process, feature extraction and match cognization.Microphone company usually may be used
The mode for connecing terminal carries out the acquisition of voice signal using external microphone, will be on the collected voice signal of microphone
Terminal is reached, the voice signal is further analyzed processing in terminal-pair, obtains voice recognition result.It is appreciated that
It is, it is the most key in order to keep evaluation result more accurate, the accuracy for ensureing voice recognition result is needed, and voice is known
The accuracy of other result is larger by such environmental effects, and such as when user is when executing sounding event, residing environment is more noisy,
Voice recognition result finally obtained in this way will obviously be by a degree of influence.Based on above-mentioned, in order to further increase language
Lip reading identification technology can be combined by the accuracy rate of sound recognition result with speech recognition technology.This is because lip reading identifies
Technology is an item collection machine vision with natural language processing in the technology of one, can directly be set from the image that someone talks
Locate speech content.And labiomaney is not influenced by voice environment and noise, so that the accuracy rate of voice recognition result is in noise
It is greatly improved under environment.In order to realize that the two combines, camera can be added on microphone, so that microphone
The voice signal that user generates in executing sounding event procedure can be not only acquired, the picture signal of generation can also be acquired,
Including at least the image at the lip position of face in the picture signal, of course for the variation for preferably identifying the lip shape of the mouth as one speaks,
Can also include the image at other positions of face in picture signal, this is because shape of the mouth as one speaks variation sometimes and human face expression change phase
It closes.Furthermore, it is necessary to which explanation, in order to obtain the variation of user's shape of the mouth as one speaks in executing sounding event procedure, can be arranged camera shooting
Head is in video typing pattern, i.e., collected is video image, can carry out image analysis to video image later, in order to
Processing is further analyzed to picture signal and obtains lip reading recognition result.It should also be noted that, being triggered in sounding event
Before, terminal is needed to establish with microphone and be communicated to connect, and communication connection mode can be that wire communication connects, such as USB data line
Connection etc., or wireless communication connection, such as bluetooth or WiFi.Certainly specific communication connection mode can be according to practical feelings
Condition is set, and is not specifically limited herein.
Illustratively, as after terminal and microphone establish wireless communication connection, user open a terminal in " ×× K is sung "
Application program can select corresponding song to be drilled according to itself interested song from recommendation list or search listing
It sings, has such as selected " Palette ", meanwhile, the voice input function of microphone and the video typing function of camera are opened
It opens, can start to sing according to prompt later, microphone carries out the real-time acquisition of voice signal and picture signal at this time, wherein
Picture signal includes at least the image at the lip position of face.After completing singing songs, stop voice signal and picture signal
Real-time acquisition, and the voice signal and picture signal are uploaded to terminal, subsequent analyzing processing are carried out in terminal.
S120, feature extraction generation phonetic feature signal is carried out to voice signal, and to the picture signal comprising lip
It carries out feature extraction and generates lid speech characteristic signal;
In a specific embodiment of the present invention, terminal is executing sounding event procedure in the user for receiving microphone transmission
After the voice signal of middle acquisition and picture signal comprising lip, just start the picture signal to voice signal and comprising lip into
Row feature extraction, feature extraction can characterize such a process:The data volume of original signal is very big, in other words at sample of signal
In a high-dimensional space, dimension space can be reduced by the method for mapping (or transformation) to indicate sample.Spy after mapping
Sign is certain combination of primitive character, and so-called feature extraction is exactly a kind of transformation in a broad sense.The purpose for carrying out feature extraction exists
In, to excavate in signal intrinsic propesties, then extracted by Digital Signal Processing means, reduce the dimension of characteristic vector
The complexity of degree and tag system.In characteristic extraction procedure, not only to consider whether feature is complete, accurately expression voice is believed
Number, while requiring between each characteristic parameter that coupling should be as small as possible, there is stronger robustness in a noisy environment, calculate
Come that comparison is easy, voice signal and picture signal will be made to handle the design of model and training and become simple and efficient in this way.
Specifically, LPCC may be used, (Linear Prediction Cepstrum Coefficient, linear prediction are fallen
Spectral coefficient), MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient), HMM (Hidden
Markov Model, hidden Markov model) and DTW (Dynamic TimeWarping, dynamic time warping) scheduling algorithm to language
Sound signal carries out feature extraction.Correspondingly, variable formwork, ASM (Active Shape Model, active shape mould may be used
Type), AAM (Active AppearanceModel, active apparent model), PCA (Principal Component
Analysis, principal component analysis), the calculations such as DCT (Discrete Cosine Transform, discrete cosine transform) and Snakes
Method carries out feature extraction to the picture signal comprising lip.
S130, phonetic feature signal and lid speech characteristic signal are sent to server-side, to indicate server-side by phonetic feature
Signal carries out the matching analysis generation voice recognition result with default voice signal and believes lid speech characteristic signal and default lip reading
Number carry out the matching analysis generate lip reading recognition result, if the similarity of voice recognition result and lip reading recognition result is more than or equal to
Similarity threshold then generates Recognition feedback result according to voice recognition result and Recognition feedback result is sent to terminal.
In a specific embodiment of the present invention, phonetic feature signal and lid speech characteristic signal are sent to server-side by terminal,
Server-side is allow to carry out further discriminance analysis according to the phonetic feature signal and lid speech characteristic signal received, to
Voice recognition result and lip reading recognition result are obtained, and compares the similarity of voice recognition result and lip reading recognition result, when two
When the similarity of person is more than or equal to similarity threshold, it can think that voice recognition result is relatively accurate, then can
Generate corresponding Recognition feedback according to voice recognition result as a result, and the Recognition feedback result is sent to terminal, terminal is connecing
The result that receiving can determine after the Recognition feedback result with the sounding event itself is compared, and then provides the sounding event
Evaluation result.It should be noted that above-mentioned similarity threshold can be set according to actual conditions, do not limit specifically herein
It is fixed.The representation of evaluation result can be indicated with fractional form, can also be to be indicated with classic form, certain specific table
Show that form can be set according to actual conditions, is also not especially limited herein.In addition, voice recognition result mentioned here
It can indicate to determine corresponding character express from phonetic feature signal or lid speech characteristic signal with lip reading recognition result, i.e.,
The content represented by the sounding event can have been judged according to recognition result.
Illustratively, phonetic feature signal and lid speech characteristic signal are sent to server-side such as terminal, server-side is according to language
The matching analysis that sound characteristic signal is carried out with default voice signal determines that voice recognition result is " because pain is be the youth ",
And lip reading recognition result is " because pain is be the youth ", whether the similarity for comparing the two is more than or equal to the similarity of setting
Threshold value 90%, it is clear that the similarity of the two is more than 90%, then voice recognition result can be thought " because pain is be green
Spring " is relatively correctly, can to generate corresponding Recognition feedback according to voice recognition result as a result, and by the Recognition feedback knot
Fruit is sent to terminal, result that terminal can determine after receiving the Recognition feedback result with the sounding event itself " because
Pain is so be the youth " it is compared, and then it is 100 points to provide the evaluation result of the sounding event.
In compared with the prior art, the sounding event evaluation that microphone is carried out with terminal can only be by way of speech recognition
For progress, lip reading identification is realized by adding camera on microphone, lip reading can be utilized to identify assistant voice in this way
Knowledge is otherwise corrected voice recognition result, so as to further increase the accuracy rate of speech recognition.
The technical solution of the present embodiment is being executed by when sounding event is triggered, receiving the user that microphone is sent
The voice signal acquired in sounding event procedure and the picture signal comprising lip, and feature extraction generation is carried out to voice signal
Phonetic feature signal and to comprising lip picture signal carry out feature extraction generate lid speech characteristic signal, phonetic feature is believed
Number and lid speech characteristic signal be sent to server-side, to indicate that server-side matches phonetic feature signal with default voice signal
Analysis, which generates voice recognition result and lid speech characteristic signal and default lip reading signal are carried out the matching analysis, generates lip reading identification
As a result, if the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity threshold, according to speech recognition
As a result generate Recognition feedback as a result, and Recognition feedback result is sent to terminal, solve in the prior art merely rely on voice
Identification technology carries out speech recognition, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.
Further, based on the above technical solution, feature extraction is carried out to phonetic feature signal and generates voice spy
Reference number can specifically include:
Speech characteristic parameter is carried out to voice signal to extract to obtain speech characteristic parameter;
Dimensionality reduction is carried out to speech characteristic parameter to convert to obtain pending phonetic feature signal;
Enhancing is carried out according to voice enhancement algorithm to pending phonetic feature signal to handle to obtain phonetic feature signal, voice
It includes cepstrum mean value subtraction algorithm to enhance algorithm.
In a specific embodiment of the present invention, the microphone that end-on receives collects in executing sounding event procedure
Voice signal carry out feature extraction, first, to the voice signal carry out speech characteristic parameter extract to obtain speech characteristic parameter,
Illustratively, such as LPCC or MFCC may be used to voice signal progress speech characteristic parameter extraction generation phonetic feature ginseng
Number;Then, dimensionality reduction transformation is carried out to speech characteristic parameter, illustratively, such as PCA or HMM may be used, phonetic feature is joined
Number carries out dimensionality reduction transformation and generates pending phonetic feature signal;It finally, can be according to voice enhancement algorithm to the pending voice
Characteristic signal carries out enhancing and handles to obtain phonetic feature signal, wherein is believed pending phonetic feature using voice enhancement algorithm
The purpose for reducing noise jamming can also be realized by number carrying out processing.Illustratively, cepstrum mean value subtraction may be used to enhance
Voice signal.Which kind of, of course, it should be understood that specifically using algorithm that can be set according to actual conditions, do not make to have herein
Body limits.
Feature extraction is carried out to the picture signal comprising lip and generates lid speech characteristic signal, can specifically include:
Feature extraction is carried out to the picture signal comprising lip according to lip feature extraction algorithm and obtains lip picture signal,
Lip feature extraction algorithm include in the feature extraction algorithm based on template or the feature extraction algorithm based on image pixel extremely
Few one kind;
Shape of the mouth as one speaks contour feature is carried out according to shape of the mouth as one speaks contour feature extraction algorithm to lip reading picture signal to extract to obtain lip reading spy
Reference number, shape of the mouth as one speaks contour feature extraction algorithm include at least one of deforming template algorithm or Snakes algorithms.
In a specific embodiment of the present invention, critically important status is occupied in lip reading identification due to the extraction of lip, it is right
Labiomaney discrimination will be directly affected in the selection of feature vector, the most important characteristic of feature extraction is " repeatability ",
Input picture is generally smoothed by Gaussian Blur in scale space, and one of image is calculated thereafter through local derivative operation
Or multiple features.Lip feature extraction algorithm can be used for extracting lip picture signal from the picture signal comprising lip,
Illustratively, lip feature extraction algorithm includes that the feature extraction algorithm based on template or the feature extraction based on image pixel are calculated
At least one of method, wherein the feature extraction algorithm based on template is also referred to as top-down algorithm, it is mainly to inside and outside
The profile of lip establishes a model, and relevant lip profile information is described with the set of a parameter.These are used for describing lip
As visual signature, such method usually requires to pre-suppose which is important for the linear combination of the parameter sets or parameter in portion
Visual signature, specifically, three kinds can be divided into:Algorithm based on model points, the algorithm based on active contour model and based on can
The algorithm of varying model.Feature extraction algorithm based on image pixel is also referred to as bottom-up algorithm, it be it is direct using comprising
The entire grayscale image of lip obtains a kind of algorithm of feature vector using by several pretreated lip images.
Specifically, three kinds can be divided into:Direct pixel algorithm, Vector Quantization algorithm and PCA.After the extraction of lip picture signal is completed,
It needs further to extract shape of the mouth as one speaks contour feature, shape of the mouth as one speaks contour feature extraction algorithm can be used for from lip picture signal extracting
Lid speech characteristic signal, illustratively, shape of the mouth as one speaks contour feature extraction algorithm include in deforming template algorithm and Snakes algorithms extremely
Few one kind, wherein deforming template algorithm be remove to approach lip profile with by multi-ribbon parameter curve, and by a plurality of curve combination at
Template is reflected curve close to the position of most suitable lip with optimal method then by certain restrictive condition
The parameter of shape of the mouth as one speaks variation, to describe lip movement.Variable model algorithm is not influenced by lip deformation, rotation and scaling, can
To portray the shape of lip well.In order to indicate that the method that the shape of the mouth as one speaks uses is exactly to utilize outside and the lip of lip
Width and height extract shape of the mouth as one speaks template.Snakes algorithms can describe shape of the mouth as one speaks profile well, if being added on lip
Dry point recycles restrictive condition to be detected these points.Also, it is to be understood that specifically using which kind of algorithm can be with
It is set, is not specifically limited herein according to actual conditions.
Embodiment two
Fig. 2 is a kind of flow chart of audio recognition method provided by Embodiment 2 of the present invention, and the present embodiment is applicable to carry
The case where high phonetic recognization rate, this method can be executed by speech recognition equipment, which may be used software and/or hardware
Mode realize that the device can be configured in server-side, such as typically server etc..As shown in Fig. 2, this method is specific
Include the following steps:
S210, phonetic feature signal and lid speech characteristic signal that terminal is sent are received;
In a specific embodiment of the present invention, server-side can receive the phonetic feature signal and lid speech characteristic of terminal transmission
Signal, in order to which server-side carries out further discriminance analysis to phonetic feature signal and lid speech characteristic signal.
S220, the matching analysis generation voice recognition result is carried out to phonetic feature signal and default voice signal;
In a specific embodiment of the present invention, phonetic feature signal and default voice signal are subjected to the matching analysis, in turn
Obtain voice recognition result, wherein default voice signal can be the default voice letter in speech recognition modeling trained in advance
Number, i.e., speech recognition modeling trained in advance is established in advance, more specifically, can be obtained user in standard database and exist
Execute the training data of setting sets of numbers generated in sounding event procedure, the training data includes voice signal and corresponding
Letter signal;Feature extraction is carried out to voice signal and generates phonetic feature signal, using phonetic feature signal as input variable, with right
The letter signal answered is output variable, using phonetic feature signal and corresponding letter signal according to machine learning algorithm to default
Mathematical model be trained and generate speech recognition modeling trained in advance.Wherein, standard database described here can be with table
Show and preserves at least two users equipment with voice input function, such as microphone, institute in executing various sounding event procedure
Collected voice signal and corresponding letter signal, the database generated with this, i.e., include in the database it is a large amount of,
Diversified voice signal and corresponding letter signal, the quantity for generating the user of the voice signal are at least two, phase
For, the quantity of user is The more the better, and age of user distribution, Area distribution and vocational distribution are extensive as much as possible so that
The database set up representativeness it is stronger and and based on this foundation model robustness it is more preferable.Set the training of sets of numbers
Data can be that the training data from the same user is formed by sets of numbers, can also be the training number from different user
According to sets of numbers is formed by, it can specifically be set, be not specifically limited herein according to actual conditions.Preferably, quantity is set
The training data of group is that the training data from different user is formed by sets of numbers, and doing so setting is advantageous in that, can be with
So that the performance of the model based on this foundation is more preferable, wherein training data includes voice signal and corresponding word letter
Number, i.e., every group of training data is to be formed by data pair by voice signal and corresponding letter signal.It may be used and aforementioned phase
Same algorithm carries out feature extraction to obtain phonetic feature signal to voice signal, no longer specifically repeats herein.According to machine
Learning algorithm is trained preset mathematical model and generates speech recognition modeling trained in advance, wherein machine learning algorithm
It may include neural network algorithm and ant group algorithm etc., can specifically be set according to actual conditions, not set specifically herein
It is fixed.
It is come from speech recognition modeling trained in advance due to presetting voice signal, there is corresponding word
Signal, then phonetic feature signal and default voice signal, which are carried out the matching analysis, generates voice recognition result, the speech recognition
As a result it can include letter signal corresponding with the phonetic feature signal in, phonetic feature is believed specifically, may be used
Number and default voice signal carry out keyword division mode come carry out specific the matching analysis or could also say that identification point
Analysis process.More specifically, phonetic feature signal and default voice signal can be split as more according to preset keywords database
A keyword marks the part of speech of each keyword, and the part of speech between each connected keyword is determined in phonetic feature signal
Whether match, when the part of speech between having adjacent keyword mismatches, keyword will be mismatched as the first keyword, and really
It is fixed it is preset obscure whether there is the first keyword in sound dictionary, in obscuring sound dictionary there are when unmatched keyword, really
Surely obscure corresponding second keyword of the first keyword in sound dictionary, the first keyword replaced with into the second keyword, and for
When part of speech matches between the second keyword and adjacent keyword after changing, by replaced second keyword and other keywords
It reconfigures.Phonetic feature signal after reconfiguring carries out the matching analysis of keyword to obtain with default voice signal
Voice recognition result.
S230, the matching analysis generation lip reading recognition result is carried out to lid speech characteristic signal and default lip reading signal;
In a specific embodiment of the present invention, likewise, lid speech characteristic signal and default lip reading signal match point
Analysis, and then obtain lip reading recognition result, wherein default lip reading signal can be default in lip reading identification model trained in advance
Lip reading signal establishes lip reading identification model trained in advance in advance.Training in advance is come from due to presetting lip reading signal
In lip reading identification model, there is corresponding letter signal, then lid speech characteristic signal and default lip reading signal are carried out
The matching analysis generates lip reading recognition result, can include text corresponding with the lid speech characteristic signal in the lip reading recognition result
Word signal, due to being formed by shape of the mouth as one speaks profile in lid speech characteristic signal and default lip reading signal, can be according to image
In the shape of the mouth as one speaks profile of each frame and the shape of the mouth as one speaks profile of previous frame determine the mode of shape of the mouth as one speaks profile output, by lid speech characteristic signal
It is divided, can be compared and analyzed successively in sequence later, and then obtain recognition result with prediction lip reading signal.
If S240, voice recognition result and the similarity of lip reading recognition result are more than or equal to similarity threshold, basis
Voice recognition result generate Recognition feedback as a result, and Recognition feedback result is sent to terminal, it is anti-according to identification with instruction terminal
Present the evaluation result that result generates sounding event.
In a specific embodiment of the present invention, it in order to improve the accuracy rate of speech recognition, uses voice recognition result
The mode that similarity-rough set is carried out with lip reading recognition result can when the similarity of the two is more than or equal to similarity threshold
Think that voice recognition result is relatively accurate, then, server-side can generate corresponding identification according to voice recognition result
Feedback result, and the Recognition feedback result is sent to terminal, terminal can be with the hair after receiving the Recognition feedback result
The result that sound events itself determine is compared, and then provides the evaluation result of the sounding event.
The technical solution of the present embodiment, by receiving the phonetic feature signal and lid speech characteristic signal that terminal is sent, to language
Sound characteristic signal carries out the matching analysis with default voice signal and generates voice recognition result, and to lid speech characteristic signal and presets
Lip reading signal carries out the matching analysis and generates lip reading recognition result, if the similarity of voice recognition result and lip reading recognition result is big
In equal to similarity threshold, then Recognition feedback is generated as a result, and Recognition feedback result is sent to end according to voice recognition result
End generates the evaluation result of sounding event with instruction terminal according to Recognition feedback result, solves simple dependence in the prior art
Speech recognition technology carries out speech recognition, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.
Further, based on the above technical solution, phonetic feature signal and lid speech characteristic that terminal is sent are received
Before signal, can also include specifically:
Establish speech recognition modeling trained in advance and lip reading identification model;
Correspondingly, carrying out the matching analysis to phonetic feature signal and default voice signal generates voice recognition result, specifically
May include:
The default voice signal in speech recognition modeling trained in advance is called to carry out matching point according to phonetic feature signal
Analysis generates voice recognition result;
The matching analysis is carried out to lid speech characteristic signal and default lip reading signal and generates lip reading recognition result, can specifically be wrapped
It includes:
The default lip reading signal in lip reading identification model trained in advance is called to carry out matching point according to lid speech characteristic signal
Analysis generates lip reading recognition result.
In a specific embodiment of the present invention, speech recognition modeling trained in advance and lip reading identification model for subsequently into
Row speech recognition and lip reading discriminance analysis believe phonetic feature signal and the default voice in speech recognition modeling trained in advance
Number carry out the matching analysis, in other words, it is also assumed that be analysis is identified, and then generation voice recognition result.Likewise,
Lid speech characteristic signal and the default lip reading signal in lip reading identification model trained in advance are subjected to the matching analysis, or can also
It is considered that analysis is identified, and then generates lip reading recognition result.In upper speech recognition result and lip reading recognition result
To include corresponding letter signal.
By carrying out the matching analysis with the preset signals in identification model trained in advance, and then obtain the identification knot of signal
Fruit, since identification model itself is with good performance, such as robustness, so that the accuracy of recognition result is also relatively
It is high.
Further, based on the above technical solution, lip reading identification model trained in advance is established, can specifically be wrapped
It includes:
In standard database, the training number for the setting sets of numbers that user generates in executing sounding event procedure is obtained
According to training data includes the picture signal comprising lip and corresponding letter signal;
Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, is believed with the first lid speech characteristic
Number be input variable, using corresponding letter signal as output variable, using the first lid speech characteristic signal and corresponding letter signal
Preset mathematical model is trained according to machine learning algorithm and generates lip reading identification model trained in advance.
In a specific embodiment of the present invention, in standard database, it is raw in executing sounding event procedure to obtain user
At setting sets of numbers training data, the training data include comprising lip picture signal and corresponding letter signal;
Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, is input with the first lid speech characteristic signal
Variable, using corresponding letter signal as output variable, using the first lid speech characteristic signal and corresponding letter signal according to machine
Learning algorithm is trained preset mathematical model and generates speech recognition modeling trained in advance.Wherein, mark described here
Quasi- database can indicate to preserve at least two users setting with image input function in executing various sounding event procedure
It is standby, such as microphone, the collected picture signal comprising lip and corresponding letter signal, the database generated with this,
The quantity for generating the user of the picture signal is at least two, and in contrast, the quantity of user is The more the better, and age of user
Distribution, Area distribution and vocational distribution are extensive as much as possible so that the representativeness for the database set up is stronger and and is based on
The robustness of this model established is more preferable.The training data for setting sets of numbers can be the training data institute from the same user
The sets of numbers of formation can also be that the training data from different user is formed by sets of numbers, specifically can be according to actual conditions
It is set, is not specifically limited herein.Preferably, the training data for setting sets of numbers is the training data from different user
It is formed by sets of numbers, doing so setting is advantageous in that, the performance of the model based on this foundation can be made more preferable, wherein
Training data includes the picture signal comprising lip and corresponding letter signal, i.e., every group of training data is by comprising lip
The picture signal in portion and corresponding letter signal are formed by data pair.Algorithm as hereinbefore may be used to comprising lip
Picture signal carry out feature extraction to obtain lid speech characteristic signal, no longer specifically repeat herein.According to machine learning algorithm
Preset mathematical model is trained and generates lip reading identification model trained in advance, wherein machine learning algorithm may include
Neural network algorithm and ant group algorithm etc. can specifically be set according to actual conditions, not set specifically herein.
Furthermore, it is necessary to explanation, above-mentioned sound identification module and lip reading identification model trained in advance establishes process
It is to be carried out in server-side, certainly, which can also be to be carried out in terminal.Well-established model save location is set
It sets in server-side, therefore, the matching analysis process subsequently carried out is also to be carried out in server-side.Above-mentioned setting is advantageous in that,
It is higher to the configuration requirement of equipment since the occupied memory space of the model usually established is larger, and terminal, it is such as mobile whole
End, the configuration of itself is relatively low for server-side, will occupy the larger model of memory space be stored in terminal can be into
One step aggravates its operating load so that the operating rate of terminal and the usage experience of user substantially reduce.Based on above-mentioned, will establish
Model save location be arranged in server-side, and the matching analysis process subsequently carried out be also server-side carry out.
Further, based on the above technical solution, this method can also include specifically:
If the similarity of voice recognition result and lip reading recognition result is less than similarity threshold, current context letter is obtained
Breath;
Voice recognition result and lip reading recognition result are adjusted according to current context information, until the voice after adjustment
The similarity of recognition result and lip reading recognition result is more than or equal to similarity threshold.
In a specific embodiment of the present invention, server-side can be believed according to the phonetic feature signal and lid speech characteristic received
Number further discriminance analysis is carried out, to obtain voice recognition result and lip reading recognition result, and compares voice recognition result
It can think voice recognition result when the similarity of the two is less than similarity threshold with the similarity of lip reading recognition result
It is adjusted with lip reading recognition result needs, current context information can be then obtained, based on the language ambience information got to language
Sound recognition result and lip reading recognition result carry out corresponding adjustment, this is because in speech recognition and lip reading identification process, mouth
Type and pronunciation, between pronunciation and word, are not only one-to-one correspondence, usually there is multiple possible segmented results, usually defeated
What is gone out is that possibility is maximum as a result, so, if when the maximum result of possibility does not meet the condition of setting, it is necessary to
Re-start selection, at this point it is possible to according to current language ambience information between the identified shape of the mouth as one speaks and word, pronunciation and word it
Between correspondence be adjusted, illustratively, if the segmented result in speech recognition process has 4, when possibility is maximum
As a result, such as result 2, when being unsatisfactory for the condition of setting, according to language ambience information, result 3 can be redefined from segmented result is
Voice recognition result, meanwhile, the segmented result in lip reading identification process has 5, when possibility is maximum as a result, such as result 4, no
When meeting the condition of setting, according to language ambience information, it is lip reading recognition result that result 2 can be redefined from segmented result, it
The recognition result redefined is compared again afterwards, when the similarity of the two is more than or equal to similarity threshold, can be recognized
It is relatively accurate for voice recognition result, it no longer needs to be adjusted.When the similarity of the two is still less than similarity threshold, just it is based on
Language ambience information is selected from segmented result again, until the recognition result selected meets similarity and is more than or equal to similarity
Threshold value.Furthermore, it is necessary to which explanation, if all segmented results are unsatisfactory for above-mentioned condition, can select similarity
It is maximum as final recognition result.
In compared with the prior art, the sounding event evaluation that microphone is carried out with terminal can only be by way of speech recognition
For progress, lip reading identification is realized by adding camera on microphone, lip reading can be utilized to identify assistant voice in this way
Knowledge is otherwise corrected voice recognition result, so as to further increase the accuracy rate of speech recognition.
Embodiment three
Fig. 3 is a kind of flow chart for audio recognition method that three figure of the embodiment of the present invention provides, and the present embodiment is applicable to
The case where improving phonetic recognization rate, this method can be executed by speech recognition equipment, which may be used software and/or hard
The mode of part realizes that the device can be configured in equipment, such as typically mobile phone, tablet computer and server etc..Such as Fig. 3
Shown, this method specifically comprises the following steps:
Microphone and terminal establish communication connection;
When sounding event is triggered, microphone acquires user and voice signal and includes lip in executing sounding event procedure
The picture signal in portion, and voice signal and picture signal comprising lip are sent to terminal;
Terminal-pair voice signal carry out feature extraction generate phonetic feature signal, and to the picture signal comprising lip into
Row feature extraction generates lid speech characteristic signal, and phonetic feature signal and lid speech characteristic signal are sent to server-side;
Server-side by phonetic feature signal and default phonetic feature signal carry out the matching analysis generate voice recognition result with
And lid speech characteristic signal and default lip reading signal are subjected to the matching analysis and generate lip reading recognition result;If voice recognition result with
The similarity of lip reading recognition result is less than similarity threshold, then obtains current context information;According to current context information to voice
Recognition result and lip reading recognition result are adjusted, until the similarity of the voice recognition result and lip reading recognition result after adjustment
More than or equal to similarity threshold;If the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity threshold,
Recognition feedback is then generated as a result, and Recognition feedback result is sent to terminal according to voice recognition result;
Terminal generates the evaluation result of sounding event according to Recognition feedback result.
The technical solution of the present embodiment is realized lip reading identification by adding camera on microphone, is known using lip reading
Other assistant voice knowledge is otherwise corrected voice recognition result, solves and relies on speech recognition skill merely in the prior art
Art carries out speech recognition, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.
Example IV
Fig. 4 is a kind of structural schematic diagram for speech recognition equipment that the embodiment of the present invention four provides, and the present embodiment is applicable
In improve phonetic recognization rate the case where, which may be used software and/or the mode of hardware is realized, which can be configured at
In equipment, such as typically mobile phone, tablet computer etc..As shown in figure 4, the device specifically includes:
Voice and image signal acquisition module 410, for when sounding event is triggered, receiving the user that microphone is sent
The voice signal that is acquired in executing the sounding event procedure and comprising the picture signal of lip;
Characteristic signal generation module 420 generates phonetic feature signal and right for carrying out feature extraction to voice signal
Including the picture signal of lip, which carries out feature extraction, generates lid speech characteristic signal;
Characteristic signal sending module 430, for phonetic feature signal and lid speech characteristic signal to be sent to server-side, to refer to
Show that phonetic feature signal and default voice signal are carried out the matching analysis and generate voice recognition result and by lip reading spy by server-side
Reference number carries out the matching analysis with default lip reading signal and generates lip reading recognition result, if voice recognition result is tied with lip reading identification
The similarity of fruit is more than or equal to similarity threshold, then according to voice recognition result generates Recognition feedback result and by Recognition feedback knot
Fruit is sent to terminal.
The technical solution of the present embodiment connects by voice and image signal acquisition module 410 when sounding event is triggered
Receive the voice signal that is acquired in executing sounding event procedure of user that microphone is sent and comprising the picture signal of lip, feature
Signal generation module 420 simultaneously carries out feature extraction generation phonetic feature signal to voice signal and believes the image comprising lip
Number carrying out feature extraction generates lid speech characteristic signal, and characteristic signal sending module 430 believes phonetic feature signal and lid speech characteristic
Number it is sent to server-side, carries out the matching analysis to indicate server-side by phonetic feature signal and default voice signal and generate voice knowing
It other result and lid speech characteristic signal and default lip reading signal is subjected to the matching analysis generates lip reading recognition result, if voice is known
Other result and the similarity of lip reading recognition result are more than or equal to similarity threshold, then generate Recognition feedback according to voice recognition result
As a result, and Recognition feedback result is sent to terminal, solve in the prior art merely by speech recognition technology carry out voice
Identification, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.
Further, based on the above technical solution, characteristic signal generation module 420, can specifically include:
Speech characteristic parameter generation unit is extracted to obtain phonetic feature ginseng for carrying out speech characteristic parameter to voice signal
Number;
Pending phonetic feature signal generation unit converts to obtain pending language for carrying out dimensionality reduction to speech characteristic parameter
Sound characteristic signal;
Phonetic feature signal generation unit, for being enhanced pending phonetic feature signal according to voice enhancement algorithm
Processing obtains phonetic feature signal, and voice enhancement algorithm includes cepstrum mean value subtraction algorithm;
Lip image signal generation unit, for being carried out to the picture signal comprising lip according to lip feature extraction algorithm
Feature extraction obtains lip picture signal, and lip feature extraction algorithm includes feature extraction algorithm based on template or is based on image
At least one of feature extraction algorithm of pixel;
Lid speech characteristic signal generation unit, for carrying out mouth to lip picture signal according to shape of the mouth as one speaks contour feature extraction algorithm
Type contour feature extracts to obtain lid speech characteristic signal, and shape of the mouth as one speaks contour feature extraction algorithm includes that deforming template algorithm or Snakes are calculated
At least one of method.
What the embodiment of the present invention was provided is configured at the executable any embodiment institute of the present invention of speech recognition equipment of terminal
The audio recognition method applied to terminal provided, has the corresponding function module of execution method and advantageous effect.
Embodiment five
Fig. 5 is a kind of structural schematic diagram for speech recognition equipment that the embodiment of the present invention five provides, and the present embodiment is applicable
In improve phonetic recognization rate the case where, which may be used software and/or the mode of hardware is realized, which can be configured at
In equipment, such as server etc..As shown in figure 5, the device specifically includes:
Characteristic signal receiving module 510, phonetic feature signal and lid speech characteristic signal for receiving terminal transmission;
Voice recognition result generation module 520, for carrying out the matching analysis to phonetic feature signal and default voice signal
Generate voice recognition result;
Lip reading recognition result generation module 530, for carrying out the matching analysis to lid speech characteristic signal and default lip reading signal
Generate lip reading recognition result;
Recognition feedback result sending module 540, if similar to the lip reading recognition result for voice recognition result
Degree is more than or equal to similarity threshold, then generates Recognition feedback as a result, and sending Recognition feedback result according to voice recognition result
To terminal, the evaluation result of sounding event is generated according to Recognition feedback result with instruction terminal.
The technical solution of the present embodiment receives the phonetic feature signal that terminal is sent by characteristic signal receiving module 510
With lid speech characteristic signal, voice recognition result generation module 520 match point to phonetic feature signal and default voice signal
Analysis generate voice recognition result and lip reading recognition result generation module 530 to lid speech characteristic signal and default lip reading signal into
Row the matching analysis generates lip reading recognition result, if 540 voice recognition result of Recognition feedback result sending module is identified with lip reading
As a result similarity is more than or equal to similarity threshold, then generates Recognition feedback according to voice recognition result as a result, and will identify anti-
Feedback result is sent to terminal, generates the evaluation result of sounding event according to Recognition feedback result with instruction terminal, solves existing
It relies on speech recognition technology to carry out speech recognition merely in technology, leads to the problem that phonetic recognization rate is low, realize raising voice
Discrimination.
Further, based on the above technical solution, which can also include specifically:
Identification model establishes module, for establishing speech recognition modeling trained in advance and lip reading identification model;
Correspondingly, voice recognition result generation module 520, can specifically include:
Voice recognition result generation unit, for being called in speech recognition modeling trained in advance according to phonetic feature signal
Default voice signal carry out the matching analysis, generate voice recognition result;
Lip reading recognition result generation module 530, can specifically include:
The default lip reading signal in lip reading identification model trained in advance is called to carry out matching point according to lid speech characteristic signal
Analysis generates lip reading recognition result.
Further, based on the above technical solution, identification model establishes module, can specifically include:
Training data generation unit is generated in standard database, obtaining user in executing sounding event procedure
Setting sets of numbers training data, training data include comprising lip picture signal and corresponding letter signal;
Lip reading identification model establishes unit, and the first lip reading is generated for carrying out feature extraction to the picture signal comprising lip
Characteristic signal, using the first lid speech characteristic signal as input variable, using the corresponding letter signal as output variable, using first
Lid speech characteristic signal and the corresponding letter signal are trained generation according to machine learning algorithm to preset mathematical model
Trained lip reading identification model in advance.
Further, based on the above technical solution, which can also include specifically:
Current context data obtaining module, if the similarity for voice recognition result and lip reading recognition result is less than phase
Like degree threshold value, then current context information is obtained;
Recognition result adjusts module, for being carried out to voice recognition result and lip reading recognition result according to current context information
Adjustment, until the similarity of the voice recognition result and lip reading recognition result after adjustment is more than or equal to similarity threshold.
What the embodiment of the present invention was provided is configured at the executable any embodiment of the present invention of speech recognition equipment of server-side
The audio recognition method applied to server-side provided has the corresponding function module of execution method and advantageous effect.
Embodiment six
Fig. 6 is a kind of structural schematic diagram for equipment that the embodiment of the present invention six provides.Fig. 6 is shown suitable for being used for realizing this
The block diagram of the example devices 612 of invention embodiment.The equipment 612 that Fig. 6 is shown is only an example, should not be to the present invention
The function and use scope of embodiment bring any restrictions.
As shown in fig. 6, equipment 612 is showed in the form of universal computing device.The component of equipment 612 may include but unlimited
In:One or more processor 616, system storage 628 are connected to different system component (including 628 He of system storage
Processor 616) bus 618.
Bus 618 indicates one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts
For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Equipment 612 typically comprises a variety of computer system readable media.These media can any can be moved
The usable medium that terminal 612 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 628 may include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 630 and/or cache memory 632.Equipment 612 may further include other removable/not removable
Dynamic, volatile/non-volatile computer system storage medium.Only as an example, storage system 634 can be used for read and write can not
Mobile, non-volatile magnetic media (Fig. 6 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 6, Ke Yiti
For the disc driver for being read and write to moving non-volatile magnetic disk (such as " floppy disk "), and to moving non-volatile light
The CD drive of disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver
It can be connected with bus 618 by one or more data media interfaces.Memory 628 may include at least one program production
There is one group of (for example, at least one) program module, these program modules to be configured to perform of the invention each for product, the program product
The function of embodiment.
Program/utility 640 with one group of (at least one) program module 642, can be stored in such as memory
In 628, such program module 642 includes but not limited to operating system, one or more application program, other program modules
And program data, the realization of network environment may be included in each or certain combination in these examples.Program module 642
Usually execute the function and/or method in embodiment described in the invention.
Equipment 612 can also be logical with one or more external equipments 614 (such as keyboard, sensing equipment, display 624 etc.)
Letter, can also be enabled a user to one or more equipment interact with the equipment 612 communicate, and/or with make the equipment 612
Any equipment (such as network interface card, modem etc.) communication that can be communicated with one or more of the other computing device.This
Kind communication can be carried out by input/output (I/O) interface 622.Also, equipment 612 can also by network adapter 620 with
One or more network (such as LAN (LAN), wide area network (WAN) and/or public network, such as internet) communication.Such as
Shown in figure, network adapter 620 is communicated by bus 618 with other modules of equipment 612.It should be understood that although not showing in Fig. 6
Go out, other hardware and/or software module can be used with bonding apparatus 612, including but not limited to:It is microcode, device driver, superfluous
Remaining processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processor 616 is stored in program in system storage 628 by operation, to perform various functions application and
Data processing, such as realize the audio recognition method applied to terminal that the embodiment of the present invention is provided, including:
When sounding event is triggered, the voice that the user that microphone is sent acquires in executing sounding event procedure is received
Signal and picture signal comprising lip;
Feature extraction is carried out to voice signal and generates phonetic feature signal, and the picture signal comprising lip is carried out special
Sign extraction generates lid speech characteristic signal;
Phonetic feature signal and lid speech characteristic signal are sent to server-side, with indicate server-side by phonetic feature signal with
Default voice signal carries out the matching analysis and generates voice recognition result and carry out lid speech characteristic signal and default lip reading signal
The matching analysis generates lip reading recognition result, if the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity
Threshold value then generates Recognition feedback result according to voice recognition result and Recognition feedback result is sent to terminal.
The embodiment of the present invention additionally provides another equipment comprising:One or more processors;Storage device is used for
The one or more programs of storage;When one or more of programs are executed by one or more of processors so that described one
A or multiple processors realize the audio recognition method at the application service end that the embodiment of the present invention is provided, including:
Receive the phonetic feature signal and lid speech characteristic signal that terminal is sent;
The matching analysis is carried out to phonetic feature signal and default voice signal and generates voice recognition result;
The matching analysis is carried out to lid speech characteristic signal and default lip reading signal and generates lip reading recognition result;
If the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity threshold, known according to voice
Other result generate Recognition feedback as a result, and Recognition feedback result is sent to terminal, with instruction terminal according to Recognition feedback result
Generate the evaluation result of sounding event.
Certainly, it will be understood by those skilled in the art that processor can also realize what any embodiment of the present invention was provided
The technical solution of audio recognition method applied to server-side.The hardware configuration and function of the server-side can be found in embodiment six
Content explain.
Embodiment seven
The embodiment of the present invention seven additionally provides a kind of computer readable storage medium, is stored thereon with computer program, should
The audio recognition method applied to terminal provided such as the embodiment of the present invention, this method packet are provided when program is executed by processor
It includes:
When sounding event is triggered, the voice that the user that microphone is sent acquires in executing sounding event procedure is received
Signal and picture signal comprising lip;
Feature extraction is carried out to voice signal and generates phonetic feature signal, and the picture signal comprising lip is carried out special
Sign extraction generates lid speech characteristic signal;
Phonetic feature signal and lid speech characteristic signal are sent to server-side, with indicate server-side by phonetic feature signal with
Default voice signal carries out the matching analysis and generates voice recognition result and carry out lid speech characteristic signal and default lip reading signal
The matching analysis generates lip reading recognition result, if the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity
Threshold value then generates Recognition feedback result according to voice recognition result and Recognition feedback result is sent to terminal.
The arbitrary of one or more computer-readable media may be used in the computer storage media of the embodiment of the present invention
Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable
Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or
Device, or the arbitrary above combination.The more specific example (non exhaustive list) of computer readable storage medium includes:Tool
There are one or the electrical connection of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory
(ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-
ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage
Medium, which can be any, includes or the tangible medium of storage program, which can be commanded execution system, device or device
Using or it is in connection.
Computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated,
Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including but unlimited
In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can
Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for
By instruction execution system, device either device use or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
It can be write with one or more programming languages or combinations thereof for executing the computer that operates of the present invention
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partly executes or executed on a remote computer or server completely on the remote computer on the user computer.
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service
It is connected by internet for quotient).
The embodiment of the present invention additionally provides another computer readable storage medium, the computer executable instructions by
For executing a kind of audio recognition method applied to server-side when computer processor executes, this method includes:
Receive the phonetic feature signal and lid speech characteristic signal that terminal is sent;
The matching analysis is carried out to phonetic feature signal and default voice signal and generates voice recognition result;
The matching analysis is carried out to lid speech characteristic signal and default lip reading signal and generates lip reading recognition result;
If the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity threshold, known according to voice
Other result generate Recognition feedback as a result, and Recognition feedback result is sent to terminal, with instruction terminal according to Recognition feedback result
Generate the evaluation result of sounding event.
Certainly, a kind of storage medium including computer executable instructions that the embodiment of the present invention is provided, computer
The operation of method that executable instruction is not limited to the described above, can also be performed that any embodiment of the present invention provided is applied to clothes
The relevant operation being engaged in the audio recognition method at end.The content in embodiment seven is can be found in the introduction of storage medium to explain.
Note that above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The present invention is not limited to specific embodiments described here, can carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out to the present invention by above example
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
May include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.
Claims (10)
1. a kind of audio recognition method, which is characterized in that including:
When sounding event is triggered, the voice that the user that microphone is sent acquires in executing the sounding event procedure is received
Signal and picture signal comprising lip;
To the voice signal carry out feature extraction generate phonetic feature signal, and to the picture signal comprising lip into
Row feature extraction generates lid speech characteristic signal;
The phonetic feature signal and the lid speech characteristic signal are sent to server-side, to indicate the server-side by institute's predicate
Sound characteristic signal and default voice signal carry out the matching analysis generate voice recognition result and by the lid speech characteristic signal with
Default lip reading signal carries out the matching analysis and generates lip reading recognition result, if institute's speech recognition result is tied with lip reading identification
The similarity of fruit is more than or equal to similarity threshold, then according to institute speech recognition result generates Recognition feedback result and by the knowledge
Other feedback result is sent to terminal.
2. according to the method described in claim 1, it is characterized in that, described carry out feature extraction generation language to the voice signal
Sound characteristic signal, including:
Speech characteristic parameter is carried out to the voice signal to extract to obtain speech characteristic parameter;
Dimensionality reduction is carried out to the speech characteristic parameter to convert to obtain pending phonetic feature signal;
Enhancing is carried out according to voice enhancement algorithm to the pending phonetic feature signal to handle to obtain the phonetic feature signal,
The voice enhancement algorithm includes cepstrum mean value subtraction algorithm;
The picture signal to described comprising lip carries out feature extraction and generates lid speech characteristic signal, including:
Feature extraction is carried out to the picture signal comprising lip according to lip feature extraction algorithm and obtains lip picture signal,
The lip feature extraction algorithm includes in the feature extraction algorithm based on template or the feature extraction algorithm based on image pixel
At least one;
Shape of the mouth as one speaks contour feature is carried out according to shape of the mouth as one speaks contour feature extraction algorithm to the lip picture signal to extract to obtain the lip
Language characteristic signal, the shape of the mouth as one speaks contour feature extraction algorithm include at least one of deforming template algorithm or Snakes algorithms.
3. a kind of audio recognition method, which is characterized in that including:
Receive the phonetic feature signal and lid speech characteristic signal that terminal is sent;
The matching analysis is carried out to the phonetic feature signal and default voice signal and generates voice recognition result;
The matching analysis is carried out to the lid speech characteristic signal and default lip reading signal and generates lip reading recognition result;
If institute's speech recognition result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, according to institute
Speech recognition result generate Recognition feedback as a result, and the Recognition feedback result is sent to terminal, to indicate the terminal
The evaluation result of sounding event is generated according to the Recognition feedback result.
4. according to the method described in claim 3, it is characterized in that, the phonetic feature signal and lip reading for receiving terminal and sending
Before characteristic signal, further include:
Establish speech recognition modeling trained in advance and lip reading identification model;
Correspondingly, described carry out the matching analysis generation voice recognition result to the phonetic feature signal and default voice signal,
Including:
The default voice signal progress in the speech recognition modeling trained in advance is called according to the phonetic feature signal
With analysis, institute's speech recognition result is generated;
It is described that the matching analysis generation lip reading recognition result is carried out to the lid speech characteristic signal and default lip reading signal, including:
The default lip reading signal progress in the lip reading identification model trained in advance is called according to the lid speech characteristic signal
With analysis, the lip reading recognition result is generated.
5. according to the method described in claim 4, it is characterized in that, the trained in advance lip reading identification model of the foundation, including:
In standard database, the training data for the setting sets of numbers that user generates in executing sounding event procedure, institute are obtained
It includes the picture signal comprising lip and corresponding letter signal to state training data;
Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, with first lip reading spy
Reference number is input variable, using the corresponding letter signal as output variable, using the first lid speech characteristic signal and right
The letter signal answered is trained preset mathematical model according to machine learning algorithm and generates lip reading knowledge trained in advance
Other model.
6. according to any methods of claim 3-5, which is characterized in that further include:
If the similarity of institute's speech recognition result and the lip reading recognition result is less than the similarity threshold, acquisition is worked as
Preceding language ambience information;
Institute's speech recognition result and the lip reading recognition result are adjusted according to the current context information, until adjustment
The similarity of voice recognition result and lip reading recognition result afterwards is more than or equal to the similarity threshold.
7. a kind of speech recognition equipment, which is characterized in that including:
Voice and image signal acquisition module are being executed for when sounding event is triggered, receiving the user that microphone is sent
The voice signal acquired in the sounding event procedure and the picture signal comprising lip;
Characteristic signal generation module generates phonetic feature signal for carrying out feature extraction to the voice signal, and to institute
It states the picture signal comprising lip and carries out feature extraction generation lid speech characteristic signal;
Characteristic signal sending module, for the phonetic feature signal and the lid speech characteristic signal to be sent to server-side, with
Indicate the server-side by the phonetic feature signal and default voice signal carry out the matching analysis generate voice recognition result with
And the lid speech characteristic signal and default lip reading signal are subjected to the matching analysis and generate lip reading recognition result, if the voice is known
Other result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, then are generated according to institute's speech recognition result
The Recognition feedback result is simultaneously sent to terminal by Recognition feedback result.
8. a kind of speech recognition equipment, which is characterized in that including:
Characteristic signal receiving module, phonetic feature signal and lid speech characteristic signal for receiving terminal transmission;
Voice recognition result generation module, for carrying out the matching analysis generation to the phonetic feature signal and default voice signal
Voice recognition result;
Lip reading recognition result generation module, for carrying out the matching analysis generation to the lid speech characteristic signal and default lip reading signal
Lip reading recognition result;
Recognition feedback result sending module, if big for institute's speech recognition result and the similarity of the lip reading recognition result
In equal to similarity threshold, then Recognition feedback is generated according to institute's speech recognition result as a result, and by the Recognition feedback result
It is sent to terminal, the evaluation result of sounding event is generated to indicate the terminal according to the Recognition feedback result.
9. a kind of equipment, which is characterized in that including:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors so that one or more of processors are real
The now audio recognition method as described in any in claim 1-6.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The audio recognition method as described in any in claim 1-6 is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810113879.9A CN108346427A (en) | 2018-02-05 | 2018-02-05 | Voice recognition method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810113879.9A CN108346427A (en) | 2018-02-05 | 2018-02-05 | Voice recognition method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108346427A true CN108346427A (en) | 2018-07-31 |
Family
ID=62958885
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810113879.9A Pending CN108346427A (en) | 2018-02-05 | 2018-02-05 | Voice recognition method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108346427A (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109377995A (en) * | 2018-11-20 | 2019-02-22 | 珠海格力电器股份有限公司 | Method and device for controlling equipment |
CN109448711A (en) * | 2018-10-23 | 2019-03-08 | 珠海格力电器股份有限公司 | Voice recognition method and device and computer storage medium |
CN109583359A (en) * | 2018-11-26 | 2019-04-05 | 北京小米移动软件有限公司 | Presentation content recognition methods, device, electronic equipment, machine readable storage medium |
CN109637521A (en) * | 2018-10-29 | 2019-04-16 | 深圳壹账通智能科技有限公司 | A kind of lip reading recognition methods and device based on deep learning |
CN109697976A (en) * | 2018-12-14 | 2019-04-30 | 北京葡萄智学科技有限公司 | A kind of pronunciation recognition methods and device |
CN109817201A (en) * | 2019-03-29 | 2019-05-28 | 北京金山安全软件有限公司 | Language learning method and device, electronic equipment and readable storage medium |
CN109872714A (en) * | 2019-01-25 | 2019-06-11 | 广州富港万嘉智能科技有限公司 | A kind of method, electronic equipment and storage medium improving accuracy of speech recognition |
CN109961789A (en) * | 2019-04-30 | 2019-07-02 | 张玄武 | One kind being based on video and interactive voice service equipment |
CN110232911A (en) * | 2019-06-13 | 2019-09-13 | 南京地平线集成电路有限公司 | With singing recognition methods, device, storage medium and electronic equipment |
CN110276259A (en) * | 2019-05-21 | 2019-09-24 | 平安科技(深圳)有限公司 | Lip reading recognition methods, device, computer equipment and storage medium |
CN110719436A (en) * | 2019-10-17 | 2020-01-21 | 浙江同花顺智能科技有限公司 | Conference document information acquisition method and device and related equipment |
CN110727854A (en) * | 2019-08-21 | 2020-01-24 | 北京奇艺世纪科技有限公司 | Data processing method and device, electronic equipment and computer readable storage medium |
CN110765868A (en) * | 2019-09-18 | 2020-02-07 | 平安科技(深圳)有限公司 | Lip reading model generation method, device, equipment and storage medium |
CN110837758A (en) * | 2018-08-17 | 2020-02-25 | 杭州海康威视数字技术股份有限公司 | Keyword input method and device and electronic equipment |
WO2020043007A1 (en) * | 2018-08-27 | 2020-03-05 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method, system, and computer-readable medium for purifying voice using depth information |
CN111028842A (en) * | 2019-12-10 | 2020-04-17 | 上海芯翌智能科技有限公司 | Method and equipment for triggering voice interaction response |
CN111221987A (en) * | 2019-12-30 | 2020-06-02 | 秒针信息技术有限公司 | Hybrid audio tagging method and apparatus |
CN111326152A (en) * | 2018-12-17 | 2020-06-23 | 南京人工智能高等研究院有限公司 | Voice control method and device |
CN111462733A (en) * | 2020-03-31 | 2020-07-28 | 科大讯飞股份有限公司 | Multi-modal speech recognition model training method, device, equipment and storage medium |
CN111724786A (en) * | 2019-03-22 | 2020-09-29 | 上海博泰悦臻网络技术服务有限公司 | Lip language identification system and method |
CN111738100A (en) * | 2020-06-01 | 2020-10-02 | 广东小天才科技有限公司 | Mouth shape-based voice recognition method and terminal equipment |
CN111966321A (en) * | 2020-08-24 | 2020-11-20 | Oppo广东移动通信有限公司 | Volume adjusting method, AR device and storage medium |
CN112927688A (en) * | 2021-01-25 | 2021-06-08 | 思必驰科技股份有限公司 | Voice interaction method and system for vehicle |
CN113506578A (en) * | 2021-06-30 | 2021-10-15 | 中汽创智科技有限公司 | Voice and image matching method and device, storage medium and equipment |
CN113611287A (en) * | 2021-06-29 | 2021-11-05 | 深圳大学 | Pronunciation error correction method and system based on machine learning |
WO2021223765A1 (en) * | 2020-06-01 | 2021-11-11 | 青岛海尔洗衣机有限公司 | Voice recognition method, voice recognition system and electrical device |
CN113660501A (en) * | 2021-08-11 | 2021-11-16 | 云知声(上海)智能科技有限公司 | Method and device for matching subtitles |
CN113961676A (en) * | 2021-09-14 | 2022-01-21 | 电信科学技术第五研究所有限公司 | Voice event extraction method based on deep learning classification combination |
CN114141249A (en) * | 2021-12-02 | 2022-03-04 | 河南职业技术学院 | Teaching voice recognition optimization method and system |
CN114974246A (en) * | 2022-06-15 | 2022-08-30 | 上海传英信息技术有限公司 | Processing method, intelligent terminal and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281745A (en) * | 2008-05-23 | 2008-10-08 | 深圳市北科瑞声科技有限公司 | Interactive system for vehicle-mounted voice |
CN102298443A (en) * | 2011-06-24 | 2011-12-28 | 华南理工大学 | Smart home voice control system combined with video channel and control method thereof |
CN102324035A (en) * | 2011-08-19 | 2012-01-18 | 广东好帮手电子科技股份有限公司 | Method and system of applying lip posture assisted speech recognition technique to vehicle navigation |
CN106157956A (en) * | 2015-03-24 | 2016-11-23 | 中兴通讯股份有限公司 | The method and device of speech recognition |
CN107578459A (en) * | 2017-08-31 | 2018-01-12 | 北京麒麟合盛网络技术有限公司 | Expression is embedded in the method and device of candidates of input method |
-
2018
- 2018-02-05 CN CN201810113879.9A patent/CN108346427A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281745A (en) * | 2008-05-23 | 2008-10-08 | 深圳市北科瑞声科技有限公司 | Interactive system for vehicle-mounted voice |
CN102298443A (en) * | 2011-06-24 | 2011-12-28 | 华南理工大学 | Smart home voice control system combined with video channel and control method thereof |
CN102324035A (en) * | 2011-08-19 | 2012-01-18 | 广东好帮手电子科技股份有限公司 | Method and system of applying lip posture assisted speech recognition technique to vehicle navigation |
CN106157956A (en) * | 2015-03-24 | 2016-11-23 | 中兴通讯股份有限公司 | The method and device of speech recognition |
CN107578459A (en) * | 2017-08-31 | 2018-01-12 | 北京麒麟合盛网络技术有限公司 | Expression is embedded in the method and device of candidates of input method |
Cited By (43)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110837758B (en) * | 2018-08-17 | 2023-06-02 | 杭州海康威视数字技术股份有限公司 | Keyword input method and device and electronic equipment |
CN110837758A (en) * | 2018-08-17 | 2020-02-25 | 杭州海康威视数字技术股份有限公司 | Keyword input method and device and electronic equipment |
WO2020043007A1 (en) * | 2018-08-27 | 2020-03-05 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method, system, and computer-readable medium for purifying voice using depth information |
US11842745B2 (en) | 2018-08-27 | 2023-12-12 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method, system, and computer-readable medium for purifying voice using depth information |
CN109448711A (en) * | 2018-10-23 | 2019-03-08 | 珠海格力电器股份有限公司 | Voice recognition method and device and computer storage medium |
CN109637521A (en) * | 2018-10-29 | 2019-04-16 | 深圳壹账通智能科技有限公司 | A kind of lip reading recognition methods and device based on deep learning |
CN109377995A (en) * | 2018-11-20 | 2019-02-22 | 珠海格力电器股份有限公司 | Method and device for controlling equipment |
CN109583359B (en) * | 2018-11-26 | 2023-10-24 | 北京小米移动软件有限公司 | Method, apparatus, electronic device, and machine-readable storage medium for recognizing expression content |
CN109583359A (en) * | 2018-11-26 | 2019-04-05 | 北京小米移动软件有限公司 | Presentation content recognition methods, device, electronic equipment, machine readable storage medium |
CN109697976B (en) * | 2018-12-14 | 2021-05-25 | 北京葡萄智学科技有限公司 | Pronunciation recognition method and device |
CN109697976A (en) * | 2018-12-14 | 2019-04-30 | 北京葡萄智学科技有限公司 | A kind of pronunciation recognition methods and device |
CN111326152A (en) * | 2018-12-17 | 2020-06-23 | 南京人工智能高等研究院有限公司 | Voice control method and device |
CN109872714A (en) * | 2019-01-25 | 2019-06-11 | 广州富港万嘉智能科技有限公司 | A kind of method, electronic equipment and storage medium improving accuracy of speech recognition |
CN111724786A (en) * | 2019-03-22 | 2020-09-29 | 上海博泰悦臻网络技术服务有限公司 | Lip language identification system and method |
CN109817201B (en) * | 2019-03-29 | 2021-03-26 | 北京金山安全软件有限公司 | Language learning method and device, electronic equipment and readable storage medium |
CN109817201A (en) * | 2019-03-29 | 2019-05-28 | 北京金山安全软件有限公司 | Language learning method and device, electronic equipment and readable storage medium |
CN109961789A (en) * | 2019-04-30 | 2019-07-02 | 张玄武 | One kind being based on video and interactive voice service equipment |
CN109961789B (en) * | 2019-04-30 | 2023-12-01 | 张玄武 | Service equipment based on video and voice interaction |
WO2020232867A1 (en) * | 2019-05-21 | 2020-11-26 | 平安科技(深圳)有限公司 | Lip-reading recognition method and apparatus, computer device, and storage medium |
CN110276259A (en) * | 2019-05-21 | 2019-09-24 | 平安科技(深圳)有限公司 | Lip reading recognition methods, device, computer equipment and storage medium |
CN110276259B (en) * | 2019-05-21 | 2024-04-02 | 平安科技(深圳)有限公司 | Lip language identification method, device, computer equipment and storage medium |
CN110232911A (en) * | 2019-06-13 | 2019-09-13 | 南京地平线集成电路有限公司 | With singing recognition methods, device, storage medium and electronic equipment |
CN110727854A (en) * | 2019-08-21 | 2020-01-24 | 北京奇艺世纪科技有限公司 | Data processing method and device, electronic equipment and computer readable storage medium |
CN110727854B (en) * | 2019-08-21 | 2022-07-12 | 北京奇艺世纪科技有限公司 | Data processing method and device, electronic equipment and computer readable storage medium |
CN110765868A (en) * | 2019-09-18 | 2020-02-07 | 平安科技(深圳)有限公司 | Lip reading model generation method, device, equipment and storage medium |
CN110719436A (en) * | 2019-10-17 | 2020-01-21 | 浙江同花顺智能科技有限公司 | Conference document information acquisition method and device and related equipment |
CN111028842B (en) * | 2019-12-10 | 2021-05-11 | 上海芯翌智能科技有限公司 | Method and equipment for triggering voice interaction response |
CN111028842A (en) * | 2019-12-10 | 2020-04-17 | 上海芯翌智能科技有限公司 | Method and equipment for triggering voice interaction response |
CN111221987A (en) * | 2019-12-30 | 2020-06-02 | 秒针信息技术有限公司 | Hybrid audio tagging method and apparatus |
CN111462733B (en) * | 2020-03-31 | 2024-04-16 | 科大讯飞股份有限公司 | Multi-modal speech recognition model training method, device, equipment and storage medium |
CN111462733A (en) * | 2020-03-31 | 2020-07-28 | 科大讯飞股份有限公司 | Multi-modal speech recognition model training method, device, equipment and storage medium |
CN111738100A (en) * | 2020-06-01 | 2020-10-02 | 广东小天才科技有限公司 | Mouth shape-based voice recognition method and terminal equipment |
WO2021223765A1 (en) * | 2020-06-01 | 2021-11-11 | 青岛海尔洗衣机有限公司 | Voice recognition method, voice recognition system and electrical device |
CN111738100B (en) * | 2020-06-01 | 2024-08-23 | 广东小天才科技有限公司 | Voice recognition method based on mouth shape and terminal equipment |
CN111966321A (en) * | 2020-08-24 | 2020-11-20 | Oppo广东移动通信有限公司 | Volume adjusting method, AR device and storage medium |
CN112927688A (en) * | 2021-01-25 | 2021-06-08 | 思必驰科技股份有限公司 | Voice interaction method and system for vehicle |
CN113611287B (en) * | 2021-06-29 | 2023-09-12 | 深圳大学 | Pronunciation error correction method and system based on machine learning |
CN113611287A (en) * | 2021-06-29 | 2021-11-05 | 深圳大学 | Pronunciation error correction method and system based on machine learning |
CN113506578A (en) * | 2021-06-30 | 2021-10-15 | 中汽创智科技有限公司 | Voice and image matching method and device, storage medium and equipment |
CN113660501A (en) * | 2021-08-11 | 2021-11-16 | 云知声(上海)智能科技有限公司 | Method and device for matching subtitles |
CN113961676A (en) * | 2021-09-14 | 2022-01-21 | 电信科学技术第五研究所有限公司 | Voice event extraction method based on deep learning classification combination |
CN114141249A (en) * | 2021-12-02 | 2022-03-04 | 河南职业技术学院 | Teaching voice recognition optimization method and system |
CN114974246A (en) * | 2022-06-15 | 2022-08-30 | 上海传英信息技术有限公司 | Processing method, intelligent terminal and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108346427A (en) | Voice recognition method, device, equipment and storage medium | |
US10699699B2 (en) | Constructing speech decoding network for numeric speech recognition | |
CN110265040B (en) | Voiceprint model training method and device, storage medium and electronic equipment | |
Czyzewski et al. | An audio-visual corpus for multimodal automatic speech recognition | |
US10235994B2 (en) | Modular deep learning model | |
CN107799126B (en) | Voice endpoint detection method and device based on supervised machine learning | |
CN105940407B (en) | System and method for assessing the intensity of audio password | |
CN111292764A (en) | Identification system and identification method | |
US20170178666A1 (en) | Multi-speaker speech separation | |
US20150325240A1 (en) | Method and system for speech input | |
CN113421547B (en) | Voice processing method and related equipment | |
CN112233698A (en) | Character emotion recognition method and device, terminal device and storage medium | |
CN106250400A (en) | A kind of audio data processing method, device and system | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
WO2023207541A1 (en) | Speech processing method and related device | |
CN109785846A (en) | The role recognition method and device of the voice data of monophonic | |
CN110955818A (en) | Searching method, searching device, terminal equipment and storage medium | |
EP1141943B1 (en) | Speaker recognition using spectrogram correlation | |
CN110781329A (en) | Image searching method and device, terminal equipment and storage medium | |
CN110211609A (en) | A method of promoting speech recognition accuracy | |
CN109961152B (en) | Personalized interaction method and system of virtual idol, terminal equipment and storage medium | |
JP6996627B2 (en) | Information processing equipment, control methods, and programs | |
CN108847251A (en) | A kind of voice De-weight method, device, server and storage medium | |
CN112652309A (en) | Dialect voice conversion method, device, equipment and storage medium | |
CN115104151A (en) | Offline voice recognition method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180731 |
|
RJ01 | Rejection of invention patent application after publication |