CN108346427A

CN108346427A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN108346427A
Application number: CN201810113879.9A
Authority: CN
Inventors: 李滨何
Original assignee: Guangdong Genius Technology Co Ltd
Current assignee: Guangdong Genius Technology Co Ltd
Priority date: 2018-02-05
Filing date: 2018-02-05
Publication date: 2018-07-31

Abstract

The invention discloses a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium. The method comprises the following steps: when a sound production event is triggered, receiving a voice signal and an image signal containing lips, which are sent by a microphone and collected by a user in the process of executing the sound production event; carrying out feature extraction on a voice signal to generate a voice feature signal, and carrying out feature extraction on an image signal containing lips to generate a lip language feature signal; and sending the voice characteristic signal and the lip language characteristic signal to a server to instruct the server to perform matching analysis on the voice characteristic signal and a preset voice signal to generate a voice recognition result and perform matching analysis on the lip language characteristic signal and the preset lip language signal to generate a lip language recognition result, and if the similarity between the voice recognition result and the lip language recognition result is greater than or equal to a similarity threshold, generating a recognition feedback result according to the voice recognition result and sending the recognition feedback result to the terminal. The embodiment of the invention realizes the improvement of the voice recognition rate.

Description

A kind of audio recognition method, device, equipment and storage medium

Technical field

The present embodiments relate to speech recognition technology more particularly to a kind of audio recognition method, device, equipment and storages Medium.

Background technology

With the arrival in electronic information epoch, mobile device is more and more universal, mobile terminal and mobile terminal it is outer Connect equipment, such as children's tablet computer and microphone.It, such as can be in addition, the achievable function of above equipment is also more and more abundant Microphone is connected to mobile terminal, language learning or singing songs are carried out according to the content shown in mobile terminal, at this In the process, the sound of the real-time typing user of microphone is needed, then the sound is uploaded to mobile terminal, is carried out in the terminal Corresponding speech recognition, and then obtain voice recognition result, provide language learning or singing songs further according to voice recognition result Evaluation result.

It is the most key in above process, the accuracy of voice recognition result, and speech recognition technology is relied on merely Possibly phonetic recognization rate can not be further increased.

Invention content

A kind of audio recognition method of present invention offer, device, equipment and storage medium, to realize raising phonetic recognization rate.

In a first aspect, an embodiment of the present invention provides a kind of audio recognition method, this method includes：

When sounding event is triggered, receive what the user that microphone is sent acquired in executing the sounding event procedure Voice signal and picture signal comprising lip；

Feature extraction is carried out to the voice signal and generates phonetic feature signal, and the image comprising lip is believed Number carry out feature extraction generate lid speech characteristic signal；

The phonetic feature signal and the lid speech characteristic signal are sent to server-side, to indicate the server by institute Predicate sound characteristic signal carries out the matching analysis generation voice recognition result with default voice signal and believes the lid speech characteristic Number carrying out the matching analysis with default lip reading signal generates lip reading recognition result, if institute's speech recognition result and the lip reading are known The similarity of other result is more than or equal to similarity threshold, then according to institute speech recognition result generates Recognition feedback result and by institute It states Recognition feedback result and is sent to terminal.

Further, described that feature extraction generation phonetic feature signal is carried out to the voice signal, including：

Speech characteristic parameter is carried out to the voice signal to extract to obtain speech characteristic parameter；

Dimensionality reduction is carried out to the speech characteristic parameter to convert to obtain pending phonetic feature signal；

Enhancing is carried out according to voice enhancement algorithm to the pending phonetic feature signal to handle to obtain the phonetic feature Signal, the voice enhancement algorithm include cepstrum mean value subtraction algorithm；

The picture signal to described comprising lip carries out feature extraction and generates lid speech characteristic signal, including：

Feature extraction is carried out to the picture signal comprising lip according to lip feature extraction algorithm and obtains lip image Signal, the lip feature extraction algorithm include that the feature extraction algorithm based on template or the feature extraction based on image pixel are calculated At least one of method；

Shape of the mouth as one speaks contour feature is carried out according to shape of the mouth as one speaks contour feature extraction algorithm to the lip picture signal to extract to obtain institute State lid speech characteristic signal, the shape of the mouth as one speaks contour feature extraction algorithm includes at least one in deforming template algorithm or Snakes algorithms Kind.

Second aspect, the embodiment of the present invention additionally provide a kind of audio recognition method, and this method includes：

Receive the phonetic feature signal and lid speech characteristic signal that terminal is sent；

The matching analysis is carried out to the phonetic feature signal and default voice signal and generates voice recognition result；

The matching analysis is carried out to the lid speech characteristic signal and default lip reading signal and generates lip reading recognition result；

If institute's speech recognition result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, root Generate Recognition feedback according to institute speech recognition result as a result, and the Recognition feedback result is sent to terminal, described in instruction Terminal generates the evaluation result of generation event according to the Recognition feedback result.

Further, before the reception receives the phonetic feature signal and lid speech characteristic signal that terminal is sent, this method Further include：

Establish speech recognition modeling trained in advance and lip reading identification model；

Correspondingly, described carry out the matching analysis generation speech recognition knot to the phonetic feature signal and default voice signal Fruit, including：

According to the phonetic feature signal call the default voice signal in the speech recognition modeling trained in advance into Row the matching analysis generates institute's speech recognition result；

It is described that the matching analysis generation lip reading recognition result, packet are carried out to the lid speech characteristic signal and default lip reading signal It includes：

According to the lid speech characteristic signal call the default lip reading signal in the lip reading identification model trained in advance into Row the matching analysis generates the lip reading recognition result.

Further, foundation lip reading identification model trained in advance, including：

In standard database, the training number for the setting sets of numbers that user generates in executing sounding event procedure is obtained According to the training data includes the picture signal comprising lip and corresponding letter signal；

Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, with first lip Language characteristic signal is input variable and the corresponding letter signal is output variable, is believed using first lid speech characteristic Number and the corresponding letter signal according to machine learning algorithm to preset math block be trained generate in advance training Lip reading identification model.

Further, this method further includes：

If institute's speech recognition result and the similarity of the lip reading recognition result are less than the similarity threshold, obtain Take current context information；

Institute's speech recognition result and the lip reading recognition result are adjusted according to the current context information, until The similarity of voice recognition result and lip reading recognition result after adjustment is more than or equal to the similarity threshold.

The third aspect, the embodiment of the present invention additionally provide a kind of speech recognition equipment, which includes：

Voice and image signal acquisition module exist for when sounding event is triggered, receiving the user that microphone is sent Execute the voice signal acquired in the sounding event procedure and the picture signal comprising lip；

Characteristic signal generation module generates phonetic feature signal and right for carrying out feature extraction to the voice signal The picture signal comprising lip carries out feature extraction and generates lid speech characteristic signal；

Characteristic signal sending module, for the phonetic feature signal and the lid speech characteristic signal to be sent to service End generates speech recognition knot to indicate that the phonetic feature signal and default voice signal are carried out the matching analysis by the server-side It fruit and the lid speech characteristic signal and default lip reading signal is subjected to the matching analysis generates lip reading recognition result, if institute's predicate Sound recognition result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, then according to institute's speech recognition result It generates Recognition feedback result and the Recognition feedback result is sent to terminal.

Fourth aspect, the embodiment of the present invention additionally provide a kind of speech recognition equipment, which includes：

Characteristic signal receiving module, phonetic feature signal and lid speech characteristic signal for receiving terminal transmission；

Voice recognition result generation module, for carrying out the matching analysis to the phonetic feature signal and default voice signal Generate voice recognition result；

Lip reading recognition result generation module, for carrying out the matching analysis to the lid speech characteristic signal and default lip reading signal Generate lip reading recognition result；

Recognition feedback result sending module, if similar to the lip reading recognition result for institute's speech recognition result Degree is more than or equal to similarity threshold, then generates Recognition feedback according to institute speech recognition result as a result, and by the Recognition feedback As a result it is sent to terminal, the evaluation result of sounding event is generated to indicate the terminal according to the Recognition feedback result.

5th aspect, the embodiment of the present invention additionally provide a kind of equipment, which includes：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors so that one or more of processing Device realizes audio recognition method as previously described.

6th aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, are stored thereon with computer Program, the program realize audio recognition method as previously described when being executed by processor.

The present invention is by when sounding event is triggered, receiving the user of microphone transmission in executing sounding event procedure The voice signal of acquisition and picture signal comprising lip, and to voice signal carry out feature extraction generate phonetic feature signal with And feature extraction is carried out to the picture signal comprising lip and generates lid speech characteristic signal, phonetic feature signal and lid speech characteristic are believed Number it is sent to server-side, carries out the matching analysis to indicate server-side by phonetic feature signal and default voice signal and generate voice knowing It other result and lid speech characteristic signal and default lip reading signal is subjected to the matching analysis generates lip reading recognition result, if voice is known Other result and the similarity of lip reading recognition result are more than or equal to similarity threshold, then generate Recognition feedback according to voice recognition result As a result, and Recognition feedback result is sent to terminal, solve in the prior art merely by speech recognition technology carry out voice Identification, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.

Description of the drawings

Fig. 1 is a kind of flow chart of audio recognition method in the embodiment of the present invention one；

Fig. 2 is a kind of flow chart of audio recognition method in the embodiment of the present invention two；

Fig. 3 is a kind of flow chart of audio recognition method in the embodiment of the present invention three；

Fig. 4 is a kind of structural schematic diagram of speech recognition equipment in the embodiment of the present invention four；

Fig. 5 is a kind of structural schematic diagram of speech recognition equipment in the embodiment of the present invention five；

Fig. 6 is a kind of structural schematic diagram of equipment in the embodiment of the present invention six.

Specific implementation mode

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limitation of the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

Embodiment one

Fig. 1 is a kind of flow chart for audio recognition method that the embodiment of the present invention one provides, and the present embodiment is applicable to carry The case where high phonetic recognization rate, this method can be executed by speech recognition equipment, which may be used software and/or hardware Mode realize that the device can be configured in terminal, such as typically mobile phone, tablet computer etc..As shown in Figure 1, the party Method specifically comprises the following steps：

S110, when sounding event is triggered, receive microphone send user acquired in executing sounding event procedure Voice signal and picture signal comprising lip；

In a specific embodiment of the present invention, sounding event can indicate the function using terminal carry out learning activities or The sound generated during recreation, such as carry out text and read aloud study or carry out singing songs.And usually on carrying out It states in active procedure, can obtain an evaluation result, the movable effect that this evaluation result can be used for indicating currently to carry out, It needs to be acquired voice signal in above process, feature extraction and match cognization.Microphone company usually may be used The mode for connecing terminal carries out the acquisition of voice signal using external microphone, will be on the collected voice signal of microphone Terminal is reached, the voice signal is further analyzed processing in terminal-pair, obtains voice recognition result.It is appreciated that It is, it is the most key in order to keep evaluation result more accurate, the accuracy for ensureing voice recognition result is needed, and voice is known The accuracy of other result is larger by such environmental effects, and such as when user is when executing sounding event, residing environment is more noisy, Voice recognition result finally obtained in this way will obviously be by a degree of influence.Based on above-mentioned, in order to further increase language Lip reading identification technology can be combined by the accuracy rate of sound recognition result with speech recognition technology.This is because lip reading identifies Technology is an item collection machine vision with natural language processing in the technology of one, can directly be set from the image that someone talks Locate speech content.And labiomaney is not influenced by voice environment and noise, so that the accuracy rate of voice recognition result is in noise It is greatly improved under environment.In order to realize that the two combines, camera can be added on microphone, so that microphone The voice signal that user generates in executing sounding event procedure can be not only acquired, the picture signal of generation can also be acquired, Including at least the image at the lip position of face in the picture signal, of course for the variation for preferably identifying the lip shape of the mouth as one speaks, Can also include the image at other positions of face in picture signal, this is because shape of the mouth as one speaks variation sometimes and human face expression change phase It closes.Furthermore, it is necessary to which explanation, in order to obtain the variation of user's shape of the mouth as one speaks in executing sounding event procedure, can be arranged camera shooting Head is in video typing pattern, i.e., collected is video image, can carry out image analysis to video image later, in order to Processing is further analyzed to picture signal and obtains lip reading recognition result.It should also be noted that, being triggered in sounding event Before, terminal is needed to establish with microphone and be communicated to connect, and communication connection mode can be that wire communication connects, such as USB data line Connection etc., or wireless communication connection, such as bluetooth or WiFi.Certainly specific communication connection mode can be according to practical feelings Condition is set, and is not specifically limited herein.

Illustratively, as after terminal and microphone establish wireless communication connection, user open a terminal in " ×× K is sung " Application program can select corresponding song to be drilled according to itself interested song from recommendation list or search listing It sings, has such as selected " Palette ", meanwhile, the voice input function of microphone and the video typing function of camera are opened It opens, can start to sing according to prompt later, microphone carries out the real-time acquisition of voice signal and picture signal at this time, wherein Picture signal includes at least the image at the lip position of face.After completing singing songs, stop voice signal and picture signal Real-time acquisition, and the voice signal and picture signal are uploaded to terminal, subsequent analyzing processing are carried out in terminal.

S120, feature extraction generation phonetic feature signal is carried out to voice signal, and to the picture signal comprising lip It carries out feature extraction and generates lid speech characteristic signal；

In a specific embodiment of the present invention, terminal is executing sounding event procedure in the user for receiving microphone transmission After the voice signal of middle acquisition and picture signal comprising lip, just start the picture signal to voice signal and comprising lip into Row feature extraction, feature extraction can characterize such a process：The data volume of original signal is very big, in other words at sample of signal In a high-dimensional space, dimension space can be reduced by the method for mapping (or transformation) to indicate sample.Spy after mapping Sign is certain combination of primitive character, and so-called feature extraction is exactly a kind of transformation in a broad sense.The purpose for carrying out feature extraction exists In, to excavate in signal intrinsic propesties, then extracted by Digital Signal Processing means, reduce the dimension of characteristic vector The complexity of degree and tag system.In characteristic extraction procedure, not only to consider whether feature is complete, accurately expression voice is believed Number, while requiring between each characteristic parameter that coupling should be as small as possible, there is stronger robustness in a noisy environment, calculate Come that comparison is easy, voice signal and picture signal will be made to handle the design of model and training and become simple and efficient in this way.

Specifically, LPCC may be used, (Linear Prediction Cepstrum Coefficient, linear prediction are fallen Spectral coefficient), MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient), HMM (Hidden Markov Model, hidden Markov model) and DTW (Dynamic TimeWarping, dynamic time warping) scheduling algorithm to language Sound signal carries out feature extraction.Correspondingly, variable formwork, ASM (Active Shape Model, active shape mould may be used Type), AAM (Active AppearanceModel, active apparent model), PCA (Principal Component Analysis, principal component analysis), the calculations such as DCT (Discrete Cosine Transform, discrete cosine transform) and Snakes Method carries out feature extraction to the picture signal comprising lip.

S130, phonetic feature signal and lid speech characteristic signal are sent to server-side, to indicate server-side by phonetic feature Signal carries out the matching analysis generation voice recognition result with default voice signal and believes lid speech characteristic signal and default lip reading Number carry out the matching analysis generate lip reading recognition result, if the similarity of voice recognition result and lip reading recognition result is more than or equal to Similarity threshold then generates Recognition feedback result according to voice recognition result and Recognition feedback result is sent to terminal.

In a specific embodiment of the present invention, phonetic feature signal and lid speech characteristic signal are sent to server-side by terminal, Server-side is allow to carry out further discriminance analysis according to the phonetic feature signal and lid speech characteristic signal received, to Voice recognition result and lip reading recognition result are obtained, and compares the similarity of voice recognition result and lip reading recognition result, when two When the similarity of person is more than or equal to similarity threshold, it can think that voice recognition result is relatively accurate, then can Generate corresponding Recognition feedback according to voice recognition result as a result, and the Recognition feedback result is sent to terminal, terminal is connecing The result that receiving can determine after the Recognition feedback result with the sounding event itself is compared, and then provides the sounding event Evaluation result.It should be noted that above-mentioned similarity threshold can be set according to actual conditions, do not limit specifically herein It is fixed.The representation of evaluation result can be indicated with fractional form, can also be to be indicated with classic form, certain specific table Show that form can be set according to actual conditions, is also not especially limited herein.In addition, voice recognition result mentioned here It can indicate to determine corresponding character express from phonetic feature signal or lid speech characteristic signal with lip reading recognition result, i.e., The content represented by the sounding event can have been judged according to recognition result.

Illustratively, phonetic feature signal and lid speech characteristic signal are sent to server-side such as terminal, server-side is according to language The matching analysis that sound characteristic signal is carried out with default voice signal determines that voice recognition result is " because pain is be the youth ", And lip reading recognition result is " because pain is be the youth ", whether the similarity for comparing the two is more than or equal to the similarity of setting Threshold value 90%, it is clear that the similarity of the two is more than 90%, then voice recognition result can be thought " because pain is be green Spring " is relatively correctly, can to generate corresponding Recognition feedback according to voice recognition result as a result, and by the Recognition feedback knot Fruit is sent to terminal, result that terminal can determine after receiving the Recognition feedback result with the sounding event itself " because Pain is so be the youth " it is compared, and then it is 100 points to provide the evaluation result of the sounding event.

In compared with the prior art, the sounding event evaluation that microphone is carried out with terminal can only be by way of speech recognition For progress, lip reading identification is realized by adding camera on microphone, lip reading can be utilized to identify assistant voice in this way Knowledge is otherwise corrected voice recognition result, so as to further increase the accuracy rate of speech recognition.

The technical solution of the present embodiment is being executed by when sounding event is triggered, receiving the user that microphone is sent The voice signal acquired in sounding event procedure and the picture signal comprising lip, and feature extraction generation is carried out to voice signal Phonetic feature signal and to comprising lip picture signal carry out feature extraction generate lid speech characteristic signal, phonetic feature is believed Number and lid speech characteristic signal be sent to server-side, to indicate that server-side matches phonetic feature signal with default voice signal Analysis, which generates voice recognition result and lid speech characteristic signal and default lip reading signal are carried out the matching analysis, generates lip reading identification As a result, if the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity threshold, according to speech recognition As a result generate Recognition feedback as a result, and Recognition feedback result is sent to terminal, solve in the prior art merely rely on voice Identification technology carries out speech recognition, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.

Further, based on the above technical solution, feature extraction is carried out to phonetic feature signal and generates voice spy Reference number can specifically include：

Speech characteristic parameter is carried out to voice signal to extract to obtain speech characteristic parameter；

Dimensionality reduction is carried out to speech characteristic parameter to convert to obtain pending phonetic feature signal；

Enhancing is carried out according to voice enhancement algorithm to pending phonetic feature signal to handle to obtain phonetic feature signal, voice It includes cepstrum mean value subtraction algorithm to enhance algorithm.

In a specific embodiment of the present invention, the microphone that end-on receives collects in executing sounding event procedure Voice signal carry out feature extraction, first, to the voice signal carry out speech characteristic parameter extract to obtain speech characteristic parameter, Illustratively, such as LPCC or MFCC may be used to voice signal progress speech characteristic parameter extraction generation phonetic feature ginseng Number；Then, dimensionality reduction transformation is carried out to speech characteristic parameter, illustratively, such as PCA or HMM may be used, phonetic feature is joined Number carries out dimensionality reduction transformation and generates pending phonetic feature signal；It finally, can be according to voice enhancement algorithm to the pending voice Characteristic signal carries out enhancing and handles to obtain phonetic feature signal, wherein is believed pending phonetic feature using voice enhancement algorithm The purpose for reducing noise jamming can also be realized by number carrying out processing.Illustratively, cepstrum mean value subtraction may be used to enhance Voice signal.Which kind of, of course, it should be understood that specifically using algorithm that can be set according to actual conditions, do not make to have herein Body limits.

Feature extraction is carried out to the picture signal comprising lip and generates lid speech characteristic signal, can specifically include：

Feature extraction is carried out to the picture signal comprising lip according to lip feature extraction algorithm and obtains lip picture signal, Lip feature extraction algorithm include in the feature extraction algorithm based on template or the feature extraction algorithm based on image pixel extremely Few one kind；

Shape of the mouth as one speaks contour feature is carried out according to shape of the mouth as one speaks contour feature extraction algorithm to lip reading picture signal to extract to obtain lip reading spy Reference number, shape of the mouth as one speaks contour feature extraction algorithm include at least one of deforming template algorithm or Snakes algorithms.

In a specific embodiment of the present invention, critically important status is occupied in lip reading identification due to the extraction of lip, it is right Labiomaney discrimination will be directly affected in the selection of feature vector, the most important characteristic of feature extraction is " repeatability ", Input picture is generally smoothed by Gaussian Blur in scale space, and one of image is calculated thereafter through local derivative operation Or multiple features.Lip feature extraction algorithm can be used for extracting lip picture signal from the picture signal comprising lip, Illustratively, lip feature extraction algorithm includes that the feature extraction algorithm based on template or the feature extraction based on image pixel are calculated At least one of method, wherein the feature extraction algorithm based on template is also referred to as top-down algorithm, it is mainly to inside and outside The profile of lip establishes a model, and relevant lip profile information is described with the set of a parameter.These are used for describing lip As visual signature, such method usually requires to pre-suppose which is important for the linear combination of the parameter sets or parameter in portion Visual signature, specifically, three kinds can be divided into：Algorithm based on model points, the algorithm based on active contour model and based on can The algorithm of varying model.Feature extraction algorithm based on image pixel is also referred to as bottom-up algorithm, it be it is direct using comprising The entire grayscale image of lip obtains a kind of algorithm of feature vector using by several pretreated lip images. Specifically, three kinds can be divided into：Direct pixel algorithm, Vector Quantization algorithm and PCA.After the extraction of lip picture signal is completed, It needs further to extract shape of the mouth as one speaks contour feature, shape of the mouth as one speaks contour feature extraction algorithm can be used for from lip picture signal extracting Lid speech characteristic signal, illustratively, shape of the mouth as one speaks contour feature extraction algorithm include in deforming template algorithm and Snakes algorithms extremely Few one kind, wherein deforming template algorithm be remove to approach lip profile with by multi-ribbon parameter curve, and by a plurality of curve combination at Template is reflected curve close to the position of most suitable lip with optimal method then by certain restrictive condition The parameter of shape of the mouth as one speaks variation, to describe lip movement.Variable model algorithm is not influenced by lip deformation, rotation and scaling, can To portray the shape of lip well.In order to indicate that the method that the shape of the mouth as one speaks uses is exactly to utilize outside and the lip of lip Width and height extract shape of the mouth as one speaks template.Snakes algorithms can describe shape of the mouth as one speaks profile well, if being added on lip Dry point recycles restrictive condition to be detected these points.Also, it is to be understood that specifically using which kind of algorithm can be with It is set, is not specifically limited herein according to actual conditions.

Embodiment two

Fig. 2 is a kind of flow chart of audio recognition method provided by Embodiment 2 of the present invention, and the present embodiment is applicable to carry The case where high phonetic recognization rate, this method can be executed by speech recognition equipment, which may be used software and/or hardware Mode realize that the device can be configured in server-side, such as typically server etc..As shown in Fig. 2, this method is specific Include the following steps：

S210, phonetic feature signal and lid speech characteristic signal that terminal is sent are received；

In a specific embodiment of the present invention, server-side can receive the phonetic feature signal and lid speech characteristic of terminal transmission Signal, in order to which server-side carries out further discriminance analysis to phonetic feature signal and lid speech characteristic signal.

S220, the matching analysis generation voice recognition result is carried out to phonetic feature signal and default voice signal；

In a specific embodiment of the present invention, phonetic feature signal and default voice signal are subjected to the matching analysis, in turn Obtain voice recognition result, wherein default voice signal can be the default voice letter in speech recognition modeling trained in advance Number, i.e., speech recognition modeling trained in advance is established in advance, more specifically, can be obtained user in standard database and exist Execute the training data of setting sets of numbers generated in sounding event procedure, the training data includes voice signal and corresponding Letter signal；Feature extraction is carried out to voice signal and generates phonetic feature signal, using phonetic feature signal as input variable, with right The letter signal answered is output variable, using phonetic feature signal and corresponding letter signal according to machine learning algorithm to default Mathematical model be trained and generate speech recognition modeling trained in advance.Wherein, standard database described here can be with table Show and preserves at least two users equipment with voice input function, such as microphone, institute in executing various sounding event procedure Collected voice signal and corresponding letter signal, the database generated with this, i.e., include in the database it is a large amount of, Diversified voice signal and corresponding letter signal, the quantity for generating the user of the voice signal are at least two, phase For, the quantity of user is The more the better, and age of user distribution, Area distribution and vocational distribution are extensive as much as possible so that The database set up representativeness it is stronger and and based on this foundation model robustness it is more preferable.Set the training of sets of numbers Data can be that the training data from the same user is formed by sets of numbers, can also be the training number from different user According to sets of numbers is formed by, it can specifically be set, be not specifically limited herein according to actual conditions.Preferably, quantity is set The training data of group is that the training data from different user is formed by sets of numbers, and doing so setting is advantageous in that, can be with So that the performance of the model based on this foundation is more preferable, wherein training data includes voice signal and corresponding word letter Number, i.e., every group of training data is to be formed by data pair by voice signal and corresponding letter signal.It may be used and aforementioned phase Same algorithm carries out feature extraction to obtain phonetic feature signal to voice signal, no longer specifically repeats herein.According to machine Learning algorithm is trained preset mathematical model and generates speech recognition modeling trained in advance, wherein machine learning algorithm It may include neural network algorithm and ant group algorithm etc., can specifically be set according to actual conditions, not set specifically herein It is fixed.

It is come from speech recognition modeling trained in advance due to presetting voice signal, there is corresponding word Signal, then phonetic feature signal and default voice signal, which are carried out the matching analysis, generates voice recognition result, the speech recognition As a result it can include letter signal corresponding with the phonetic feature signal in, phonetic feature is believed specifically, may be used Number and default voice signal carry out keyword division mode come carry out specific the matching analysis or could also say that identification point Analysis process.More specifically, phonetic feature signal and default voice signal can be split as more according to preset keywords database A keyword marks the part of speech of each keyword, and the part of speech between each connected keyword is determined in phonetic feature signal Whether match, when the part of speech between having adjacent keyword mismatches, keyword will be mismatched as the first keyword, and really It is fixed it is preset obscure whether there is the first keyword in sound dictionary, in obscuring sound dictionary there are when unmatched keyword, really Surely obscure corresponding second keyword of the first keyword in sound dictionary, the first keyword replaced with into the second keyword, and for When part of speech matches between the second keyword and adjacent keyword after changing, by replaced second keyword and other keywords It reconfigures.Phonetic feature signal after reconfiguring carries out the matching analysis of keyword to obtain with default voice signal Voice recognition result.

S230, the matching analysis generation lip reading recognition result is carried out to lid speech characteristic signal and default lip reading signal；

In a specific embodiment of the present invention, likewise, lid speech characteristic signal and default lip reading signal match point Analysis, and then obtain lip reading recognition result, wherein default lip reading signal can be default in lip reading identification model trained in advance Lip reading signal establishes lip reading identification model trained in advance in advance.Training in advance is come from due to presetting lip reading signal In lip reading identification model, there is corresponding letter signal, then lid speech characteristic signal and default lip reading signal are carried out The matching analysis generates lip reading recognition result, can include text corresponding with the lid speech characteristic signal in the lip reading recognition result Word signal, due to being formed by shape of the mouth as one speaks profile in lid speech characteristic signal and default lip reading signal, can be according to image In the shape of the mouth as one speaks profile of each frame and the shape of the mouth as one speaks profile of previous frame determine the mode of shape of the mouth as one speaks profile output, by lid speech characteristic signal It is divided, can be compared and analyzed successively in sequence later, and then obtain recognition result with prediction lip reading signal.

If S240, voice recognition result and the similarity of lip reading recognition result are more than or equal to similarity threshold, basis Voice recognition result generate Recognition feedback as a result, and Recognition feedback result is sent to terminal, it is anti-according to identification with instruction terminal Present the evaluation result that result generates sounding event.

In a specific embodiment of the present invention, it in order to improve the accuracy rate of speech recognition, uses voice recognition result The mode that similarity-rough set is carried out with lip reading recognition result can when the similarity of the two is more than or equal to similarity threshold Think that voice recognition result is relatively accurate, then, server-side can generate corresponding identification according to voice recognition result Feedback result, and the Recognition feedback result is sent to terminal, terminal can be with the hair after receiving the Recognition feedback result The result that sound events itself determine is compared, and then provides the evaluation result of the sounding event.

The technical solution of the present embodiment, by receiving the phonetic feature signal and lid speech characteristic signal that terminal is sent, to language Sound characteristic signal carries out the matching analysis with default voice signal and generates voice recognition result, and to lid speech characteristic signal and presets Lip reading signal carries out the matching analysis and generates lip reading recognition result, if the similarity of voice recognition result and lip reading recognition result is big In equal to similarity threshold, then Recognition feedback is generated as a result, and Recognition feedback result is sent to end according to voice recognition result End generates the evaluation result of sounding event with instruction terminal according to Recognition feedback result, solves simple dependence in the prior art Speech recognition technology carries out speech recognition, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.

Further, based on the above technical solution, phonetic feature signal and lid speech characteristic that terminal is sent are received Before signal, can also include specifically：

Correspondingly, carrying out the matching analysis to phonetic feature signal and default voice signal generates voice recognition result, specifically May include：

The default voice signal in speech recognition modeling trained in advance is called to carry out matching point according to phonetic feature signal Analysis generates voice recognition result；

The matching analysis is carried out to lid speech characteristic signal and default lip reading signal and generates lip reading recognition result, can specifically be wrapped It includes：

The default lip reading signal in lip reading identification model trained in advance is called to carry out matching point according to lid speech characteristic signal Analysis generates lip reading recognition result.

In a specific embodiment of the present invention, speech recognition modeling trained in advance and lip reading identification model for subsequently into Row speech recognition and lip reading discriminance analysis believe phonetic feature signal and the default voice in speech recognition modeling trained in advance Number carry out the matching analysis, in other words, it is also assumed that be analysis is identified, and then generation voice recognition result.Likewise, Lid speech characteristic signal and the default lip reading signal in lip reading identification model trained in advance are subjected to the matching analysis, or can also It is considered that analysis is identified, and then generates lip reading recognition result.In upper speech recognition result and lip reading recognition result To include corresponding letter signal.

By carrying out the matching analysis with the preset signals in identification model trained in advance, and then obtain the identification knot of signal Fruit, since identification model itself is with good performance, such as robustness, so that the accuracy of recognition result is also relatively It is high.

Further, based on the above technical solution, lip reading identification model trained in advance is established, can specifically be wrapped It includes：

In standard database, the training number for the setting sets of numbers that user generates in executing sounding event procedure is obtained According to training data includes the picture signal comprising lip and corresponding letter signal；

Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, is believed with the first lid speech characteristic Number be input variable, using corresponding letter signal as output variable, using the first lid speech characteristic signal and corresponding letter signal Preset mathematical model is trained according to machine learning algorithm and generates lip reading identification model trained in advance.

In a specific embodiment of the present invention, in standard database, it is raw in executing sounding event procedure to obtain user At setting sets of numbers training data, the training data include comprising lip picture signal and corresponding letter signal； Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, is input with the first lid speech characteristic signal Variable, using corresponding letter signal as output variable, using the first lid speech characteristic signal and corresponding letter signal according to machine Learning algorithm is trained preset mathematical model and generates speech recognition modeling trained in advance.Wherein, mark described here Quasi- database can indicate to preserve at least two users setting with image input function in executing various sounding event procedure It is standby, such as microphone, the collected picture signal comprising lip and corresponding letter signal, the database generated with this, The quantity for generating the user of the picture signal is at least two, and in contrast, the quantity of user is The more the better, and age of user Distribution, Area distribution and vocational distribution are extensive as much as possible so that the representativeness for the database set up is stronger and and is based on The robustness of this model established is more preferable.The training data for setting sets of numbers can be the training data institute from the same user The sets of numbers of formation can also be that the training data from different user is formed by sets of numbers, specifically can be according to actual conditions It is set, is not specifically limited herein.Preferably, the training data for setting sets of numbers is the training data from different user It is formed by sets of numbers, doing so setting is advantageous in that, the performance of the model based on this foundation can be made more preferable, wherein Training data includes the picture signal comprising lip and corresponding letter signal, i.e., every group of training data is by comprising lip The picture signal in portion and corresponding letter signal are formed by data pair.Algorithm as hereinbefore may be used to comprising lip Picture signal carry out feature extraction to obtain lid speech characteristic signal, no longer specifically repeat herein.According to machine learning algorithm Preset mathematical model is trained and generates lip reading identification model trained in advance, wherein machine learning algorithm may include Neural network algorithm and ant group algorithm etc. can specifically be set according to actual conditions, not set specifically herein.

Furthermore, it is necessary to explanation, above-mentioned sound identification module and lip reading identification model trained in advance establishes process It is to be carried out in server-side, certainly, which can also be to be carried out in terminal.Well-established model save location is set It sets in server-side, therefore, the matching analysis process subsequently carried out is also to be carried out in server-side.Above-mentioned setting is advantageous in that, It is higher to the configuration requirement of equipment since the occupied memory space of the model usually established is larger, and terminal, it is such as mobile whole End, the configuration of itself is relatively low for server-side, will occupy the larger model of memory space be stored in terminal can be into One step aggravates its operating load so that the operating rate of terminal and the usage experience of user substantially reduce.Based on above-mentioned, will establish Model save location be arranged in server-side, and the matching analysis process subsequently carried out be also server-side carry out.

Further, based on the above technical solution, this method can also include specifically：

If the similarity of voice recognition result and lip reading recognition result is less than similarity threshold, current context letter is obtained Breath；

Voice recognition result and lip reading recognition result are adjusted according to current context information, until the voice after adjustment The similarity of recognition result and lip reading recognition result is more than or equal to similarity threshold.

In a specific embodiment of the present invention, server-side can be believed according to the phonetic feature signal and lid speech characteristic received Number further discriminance analysis is carried out, to obtain voice recognition result and lip reading recognition result, and compares voice recognition result It can think voice recognition result when the similarity of the two is less than similarity threshold with the similarity of lip reading recognition result It is adjusted with lip reading recognition result needs, current context information can be then obtained, based on the language ambience information got to language Sound recognition result and lip reading recognition result carry out corresponding adjustment, this is because in speech recognition and lip reading identification process, mouth Type and pronunciation, between pronunciation and word, are not only one-to-one correspondence, usually there is multiple possible segmented results, usually defeated What is gone out is that possibility is maximum as a result, so, if when the maximum result of possibility does not meet the condition of setting, it is necessary to Re-start selection, at this point it is possible to according to current language ambience information between the identified shape of the mouth as one speaks and word, pronunciation and word it Between correspondence be adjusted, illustratively, if the segmented result in speech recognition process has 4, when possibility is maximum As a result, such as result 2, when being unsatisfactory for the condition of setting, according to language ambience information, result 3 can be redefined from segmented result is Voice recognition result, meanwhile, the segmented result in lip reading identification process has 5, when possibility is maximum as a result, such as result 4, no When meeting the condition of setting, according to language ambience information, it is lip reading recognition result that result 2 can be redefined from segmented result, it The recognition result redefined is compared again afterwards, when the similarity of the two is more than or equal to similarity threshold, can be recognized It is relatively accurate for voice recognition result, it no longer needs to be adjusted.When the similarity of the two is still less than similarity threshold, just it is based on Language ambience information is selected from segmented result again, until the recognition result selected meets similarity and is more than or equal to similarity Threshold value.Furthermore, it is necessary to which explanation, if all segmented results are unsatisfactory for above-mentioned condition, can select similarity It is maximum as final recognition result.

Embodiment three

Fig. 3 is a kind of flow chart for audio recognition method that three figure of the embodiment of the present invention provides, and the present embodiment is applicable to The case where improving phonetic recognization rate, this method can be executed by speech recognition equipment, which may be used software and/or hard The mode of part realizes that the device can be configured in equipment, such as typically mobile phone, tablet computer and server etc..Such as Fig. 3 Shown, this method specifically comprises the following steps：

Microphone and terminal establish communication connection；

When sounding event is triggered, microphone acquires user and voice signal and includes lip in executing sounding event procedure The picture signal in portion, and voice signal and picture signal comprising lip are sent to terminal；

Terminal-pair voice signal carry out feature extraction generate phonetic feature signal, and to the picture signal comprising lip into Row feature extraction generates lid speech characteristic signal, and phonetic feature signal and lid speech characteristic signal are sent to server-side；

Server-side by phonetic feature signal and default phonetic feature signal carry out the matching analysis generate voice recognition result with And lid speech characteristic signal and default lip reading signal are subjected to the matching analysis and generate lip reading recognition result；If voice recognition result with The similarity of lip reading recognition result is less than similarity threshold, then obtains current context information；According to current context information to voice Recognition result and lip reading recognition result are adjusted, until the similarity of the voice recognition result and lip reading recognition result after adjustment More than or equal to similarity threshold；If the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity threshold, Recognition feedback is then generated as a result, and Recognition feedback result is sent to terminal according to voice recognition result；

Terminal generates the evaluation result of sounding event according to Recognition feedback result.

The technical solution of the present embodiment is realized lip reading identification by adding camera on microphone, is known using lip reading Other assistant voice knowledge is otherwise corrected voice recognition result, solves and relies on speech recognition skill merely in the prior art Art carries out speech recognition, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.

Example IV

Fig. 4 is a kind of structural schematic diagram for speech recognition equipment that the embodiment of the present invention four provides, and the present embodiment is applicable In improve phonetic recognization rate the case where, which may be used software and/or the mode of hardware is realized, which can be configured at In equipment, such as typically mobile phone, tablet computer etc..As shown in figure 4, the device specifically includes：

Voice and image signal acquisition module 410, for when sounding event is triggered, receiving the user that microphone is sent The voice signal that is acquired in executing the sounding event procedure and comprising the picture signal of lip；

Characteristic signal generation module 420 generates phonetic feature signal and right for carrying out feature extraction to voice signal Including the picture signal of lip, which carries out feature extraction, generates lid speech characteristic signal；

Characteristic signal sending module 430, for phonetic feature signal and lid speech characteristic signal to be sent to server-side, to refer to Show that phonetic feature signal and default voice signal are carried out the matching analysis and generate voice recognition result and by lip reading spy by server-side Reference number carries out the matching analysis with default lip reading signal and generates lip reading recognition result, if voice recognition result is tied with lip reading identification The similarity of fruit is more than or equal to similarity threshold, then according to voice recognition result generates Recognition feedback result and by Recognition feedback knot Fruit is sent to terminal.

The technical solution of the present embodiment connects by voice and image signal acquisition module 410 when sounding event is triggered Receive the voice signal that is acquired in executing sounding event procedure of user that microphone is sent and comprising the picture signal of lip, feature Signal generation module 420 simultaneously carries out feature extraction generation phonetic feature signal to voice signal and believes the image comprising lip Number carrying out feature extraction generates lid speech characteristic signal, and characteristic signal sending module 430 believes phonetic feature signal and lid speech characteristic Number it is sent to server-side, carries out the matching analysis to indicate server-side by phonetic feature signal and default voice signal and generate voice knowing It other result and lid speech characteristic signal and default lip reading signal is subjected to the matching analysis generates lip reading recognition result, if voice is known Other result and the similarity of lip reading recognition result are more than or equal to similarity threshold, then generate Recognition feedback according to voice recognition result As a result, and Recognition feedback result is sent to terminal, solve in the prior art merely by speech recognition technology carry out voice Identification, leads to the problem that phonetic recognization rate is low, realizes raising phonetic recognization rate.

Further, based on the above technical solution, characteristic signal generation module 420, can specifically include：

Speech characteristic parameter generation unit is extracted to obtain phonetic feature ginseng for carrying out speech characteristic parameter to voice signal Number；

Pending phonetic feature signal generation unit converts to obtain pending language for carrying out dimensionality reduction to speech characteristic parameter Sound characteristic signal；

Phonetic feature signal generation unit, for being enhanced pending phonetic feature signal according to voice enhancement algorithm Processing obtains phonetic feature signal, and voice enhancement algorithm includes cepstrum mean value subtraction algorithm；

Lip image signal generation unit, for being carried out to the picture signal comprising lip according to lip feature extraction algorithm Feature extraction obtains lip picture signal, and lip feature extraction algorithm includes feature extraction algorithm based on template or is based on image At least one of feature extraction algorithm of pixel；

Lid speech characteristic signal generation unit, for carrying out mouth to lip picture signal according to shape of the mouth as one speaks contour feature extraction algorithm Type contour feature extracts to obtain lid speech characteristic signal, and shape of the mouth as one speaks contour feature extraction algorithm includes that deforming template algorithm or Snakes are calculated At least one of method.

What the embodiment of the present invention was provided is configured at the executable any embodiment institute of the present invention of speech recognition equipment of terminal The audio recognition method applied to terminal provided, has the corresponding function module of execution method and advantageous effect.

Embodiment five

Fig. 5 is a kind of structural schematic diagram for speech recognition equipment that the embodiment of the present invention five provides, and the present embodiment is applicable In improve phonetic recognization rate the case where, which may be used software and/or the mode of hardware is realized, which can be configured at In equipment, such as server etc..As shown in figure 5, the device specifically includes：

Characteristic signal receiving module 510, phonetic feature signal and lid speech characteristic signal for receiving terminal transmission；

Voice recognition result generation module 520, for carrying out the matching analysis to phonetic feature signal and default voice signal Generate voice recognition result；

Lip reading recognition result generation module 530, for carrying out the matching analysis to lid speech characteristic signal and default lip reading signal Generate lip reading recognition result；

Recognition feedback result sending module 540, if similar to the lip reading recognition result for voice recognition result Degree is more than or equal to similarity threshold, then generates Recognition feedback as a result, and sending Recognition feedback result according to voice recognition result To terminal, the evaluation result of sounding event is generated according to Recognition feedback result with instruction terminal.

The technical solution of the present embodiment receives the phonetic feature signal that terminal is sent by characteristic signal receiving module 510 With lid speech characteristic signal, voice recognition result generation module 520 match point to phonetic feature signal and default voice signal Analysis generate voice recognition result and lip reading recognition result generation module 530 to lid speech characteristic signal and default lip reading signal into Row the matching analysis generates lip reading recognition result, if 540 voice recognition result of Recognition feedback result sending module is identified with lip reading As a result similarity is more than or equal to similarity threshold, then generates Recognition feedback according to voice recognition result as a result, and will identify anti- Feedback result is sent to terminal, generates the evaluation result of sounding event according to Recognition feedback result with instruction terminal, solves existing It relies on speech recognition technology to carry out speech recognition merely in technology, leads to the problem that phonetic recognization rate is low, realize raising voice Discrimination.

Further, based on the above technical solution, which can also include specifically：

Identification model establishes module, for establishing speech recognition modeling trained in advance and lip reading identification model；

Correspondingly, voice recognition result generation module 520, can specifically include：

Voice recognition result generation unit, for being called in speech recognition modeling trained in advance according to phonetic feature signal Default voice signal carry out the matching analysis, generate voice recognition result；

Lip reading recognition result generation module 530, can specifically include：

Further, based on the above technical solution, identification model establishes module, can specifically include：

Training data generation unit is generated in standard database, obtaining user in executing sounding event procedure Setting sets of numbers training data, training data include comprising lip picture signal and corresponding letter signal；

Lip reading identification model establishes unit, and the first lip reading is generated for carrying out feature extraction to the picture signal comprising lip Characteristic signal, using the first lid speech characteristic signal as input variable, using the corresponding letter signal as output variable, using first Lid speech characteristic signal and the corresponding letter signal are trained generation according to machine learning algorithm to preset mathematical model Trained lip reading identification model in advance.

Current context data obtaining module, if the similarity for voice recognition result and lip reading recognition result is less than phase Like degree threshold value, then current context information is obtained；

Recognition result adjusts module, for being carried out to voice recognition result and lip reading recognition result according to current context information Adjustment, until the similarity of the voice recognition result and lip reading recognition result after adjustment is more than or equal to similarity threshold.

What the embodiment of the present invention was provided is configured at the executable any embodiment of the present invention of speech recognition equipment of server-side The audio recognition method applied to server-side provided has the corresponding function module of execution method and advantageous effect.

Embodiment six

Fig. 6 is a kind of structural schematic diagram for equipment that the embodiment of the present invention six provides.Fig. 6 is shown suitable for being used for realizing this The block diagram of the example devices 612 of invention embodiment.The equipment 612 that Fig. 6 is shown is only an example, should not be to the present invention The function and use scope of embodiment bring any restrictions.

As shown in fig. 6, equipment 612 is showed in the form of universal computing device.The component of equipment 612 may include but unlimited In：One or more processor 616, system storage 628 are connected to different system component (including 628 He of system storage Processor 616) bus 618.

Bus 618 indicates one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts For example, these architectures include but not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Equipment 612 typically comprises a variety of computer system readable media.These media can any can be moved The usable medium that terminal 612 accesses, including volatile and non-volatile media, moveable and immovable medium.

System storage 628 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 630 and/or cache memory 632.Equipment 612 may further include other removable/not removable Dynamic, volatile/non-volatile computer system storage medium.Only as an example, storage system 634 can be used for read and write can not Mobile, non-volatile magnetic media (Fig. 6 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 6, Ke Yiti For the disc driver for being read and write to moving non-volatile magnetic disk (such as " floppy disk "), and to moving non-volatile light The CD drive of disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver It can be connected with bus 618 by one or more data media interfaces.Memory 628 may include at least one program production There is one group of (for example, at least one) program module, these program modules to be configured to perform of the invention each for product, the program product The function of embodiment.

Program/utility 640 with one group of (at least one) program module 642, can be stored in such as memory In 628, such program module 642 includes but not limited to operating system, one or more application program, other program modules And program data, the realization of network environment may be included in each or certain combination in these examples.Program module 642 Usually execute the function and/or method in embodiment described in the invention.

Equipment 612 can also be logical with one or more external equipments 614 (such as keyboard, sensing equipment, display 624 etc.) Letter, can also be enabled a user to one or more equipment interact with the equipment 612 communicate, and/or with make the equipment 612 Any equipment (such as network interface card, modem etc.) communication that can be communicated with one or more of the other computing device.This Kind communication can be carried out by input/output (I/O) interface 622.Also, equipment 612 can also by network adapter 620 with One or more network (such as LAN (LAN), wide area network (WAN) and/or public network, such as internet) communication.Such as Shown in figure, network adapter 620 is communicated by bus 618 with other modules of equipment 612.It should be understood that although not showing in Fig. 6 Go out, other hardware and/or software module can be used with bonding apparatus 612, including but not limited to：It is microcode, device driver, superfluous Remaining processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

Processor 616 is stored in program in system storage 628 by operation, to perform various functions application and Data processing, such as realize the audio recognition method applied to terminal that the embodiment of the present invention is provided, including：

When sounding event is triggered, the voice that the user that microphone is sent acquires in executing sounding event procedure is received Signal and picture signal comprising lip；

Feature extraction is carried out to voice signal and generates phonetic feature signal, and the picture signal comprising lip is carried out special Sign extraction generates lid speech characteristic signal；

Phonetic feature signal and lid speech characteristic signal are sent to server-side, with indicate server-side by phonetic feature signal with Default voice signal carries out the matching analysis and generates voice recognition result and carry out lid speech characteristic signal and default lip reading signal The matching analysis generates lip reading recognition result, if the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity Threshold value then generates Recognition feedback result according to voice recognition result and Recognition feedback result is sent to terminal.

The embodiment of the present invention additionally provides another equipment comprising：One or more processors；Storage device is used for The one or more programs of storage；When one or more of programs are executed by one or more of processors so that described one A or multiple processors realize the audio recognition method at the application service end that the embodiment of the present invention is provided, including：

The matching analysis is carried out to phonetic feature signal and default voice signal and generates voice recognition result；

The matching analysis is carried out to lid speech characteristic signal and default lip reading signal and generates lip reading recognition result；

If the similarity of voice recognition result and lip reading recognition result is more than or equal to similarity threshold, known according to voice Other result generate Recognition feedback as a result, and Recognition feedback result is sent to terminal, with instruction terminal according to Recognition feedback result Generate the evaluation result of sounding event.

Certainly, it will be understood by those skilled in the art that processor can also realize what any embodiment of the present invention was provided The technical solution of audio recognition method applied to server-side.The hardware configuration and function of the server-side can be found in embodiment six Content explain.

Embodiment seven

The embodiment of the present invention seven additionally provides a kind of computer readable storage medium, is stored thereon with computer program, should The audio recognition method applied to terminal provided such as the embodiment of the present invention, this method packet are provided when program is executed by processor It includes：

The arbitrary of one or more computer-readable media may be used in the computer storage media of the embodiment of the present invention Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or the arbitrary above combination.The more specific example (non exhaustive list) of computer readable storage medium includes：Tool There are one or the electrical connection of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium, which can be any, includes or the tangible medium of storage program, which can be commanded execution system, device or device Using or it is in connection.

Computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated, Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By instruction execution system, device either device use or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

It can be write with one or more programming languages or combinations thereof for executing the computer that operates of the present invention Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partly executes or executed on a remote computer or server completely on the remote computer on the user computer. Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service It is connected by internet for quotient).

The embodiment of the present invention additionally provides another computer readable storage medium, the computer executable instructions by For executing a kind of audio recognition method applied to server-side when computer processor executes, this method includes：

Certainly, a kind of storage medium including computer executable instructions that the embodiment of the present invention is provided, computer The operation of method that executable instruction is not limited to the described above, can also be performed that any embodiment of the present invention provided is applied to clothes The relevant operation being engaged in the audio recognition method at end.The content in embodiment seven is can be found in the introduction of storage medium to explain.

Note that above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The present invention is not limited to specific embodiments described here, can carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out to the present invention by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also May include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims

1. a kind of audio recognition method, which is characterized in that including：

When sounding event is triggered, the voice that the user that microphone is sent acquires in executing the sounding event procedure is received Signal and picture signal comprising lip；

To the voice signal carry out feature extraction generate phonetic feature signal, and to the picture signal comprising lip into Row feature extraction generates lid speech characteristic signal；

The phonetic feature signal and the lid speech characteristic signal are sent to server-side, to indicate the server-side by institute's predicate Sound characteristic signal and default voice signal carry out the matching analysis generate voice recognition result and by the lid speech characteristic signal with Default lip reading signal carries out the matching analysis and generates lip reading recognition result, if institute's speech recognition result is tied with lip reading identification The similarity of fruit is more than or equal to similarity threshold, then according to institute speech recognition result generates Recognition feedback result and by the knowledge Other feedback result is sent to terminal.

2. according to the method described in claim 1, it is characterized in that, described carry out feature extraction generation language to the voice signal Sound characteristic signal, including：

Enhancing is carried out according to voice enhancement algorithm to the pending phonetic feature signal to handle to obtain the phonetic feature signal, The voice enhancement algorithm includes cepstrum mean value subtraction algorithm；

Feature extraction is carried out to the picture signal comprising lip according to lip feature extraction algorithm and obtains lip picture signal, The lip feature extraction algorithm includes in the feature extraction algorithm based on template or the feature extraction algorithm based on image pixel At least one；

Shape of the mouth as one speaks contour feature is carried out according to shape of the mouth as one speaks contour feature extraction algorithm to the lip picture signal to extract to obtain the lip Language characteristic signal, the shape of the mouth as one speaks contour feature extraction algorithm include at least one of deforming template algorithm or Snakes algorithms.

3. a kind of audio recognition method, which is characterized in that including：

If institute's speech recognition result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, according to institute Speech recognition result generate Recognition feedback as a result, and the Recognition feedback result is sent to terminal, to indicate the terminal The evaluation result of sounding event is generated according to the Recognition feedback result.

4. according to the method described in claim 3, it is characterized in that, the phonetic feature signal and lip reading for receiving terminal and sending Before characteristic signal, further include：

Correspondingly, described carry out the matching analysis generation voice recognition result to the phonetic feature signal and default voice signal, Including：

The default voice signal progress in the speech recognition modeling trained in advance is called according to the phonetic feature signal With analysis, institute's speech recognition result is generated；

It is described that the matching analysis generation lip reading recognition result is carried out to the lid speech characteristic signal and default lip reading signal, including：

The default lip reading signal progress in the lip reading identification model trained in advance is called according to the lid speech characteristic signal With analysis, the lip reading recognition result is generated.

5. according to the method described in claim 4, it is characterized in that, the trained in advance lip reading identification model of the foundation, including：

In standard database, the training data for the setting sets of numbers that user generates in executing sounding event procedure, institute are obtained It includes the picture signal comprising lip and corresponding letter signal to state training data；

Feature extraction is carried out to the picture signal comprising lip and generates the first lid speech characteristic signal, with first lip reading spy Reference number is input variable, using the corresponding letter signal as output variable, using the first lid speech characteristic signal and right The letter signal answered is trained preset mathematical model according to machine learning algorithm and generates lip reading knowledge trained in advance Other model.

6. according to any methods of claim 3-5, which is characterized in that further include：

If the similarity of institute's speech recognition result and the lip reading recognition result is less than the similarity threshold, acquisition is worked as Preceding language ambience information；

Institute's speech recognition result and the lip reading recognition result are adjusted according to the current context information, until adjustment The similarity of voice recognition result and lip reading recognition result afterwards is more than or equal to the similarity threshold.

7. a kind of speech recognition equipment, which is characterized in that including：

Voice and image signal acquisition module are being executed for when sounding event is triggered, receiving the user that microphone is sent The voice signal acquired in the sounding event procedure and the picture signal comprising lip；

Characteristic signal generation module generates phonetic feature signal for carrying out feature extraction to the voice signal, and to institute It states the picture signal comprising lip and carries out feature extraction generation lid speech characteristic signal；

Characteristic signal sending module, for the phonetic feature signal and the lid speech characteristic signal to be sent to server-side, with Indicate the server-side by the phonetic feature signal and default voice signal carry out the matching analysis generate voice recognition result with And the lid speech characteristic signal and default lip reading signal are subjected to the matching analysis and generate lip reading recognition result, if the voice is known Other result and the similarity of the lip reading recognition result are more than or equal to similarity threshold, then are generated according to institute's speech recognition result The Recognition feedback result is simultaneously sent to terminal by Recognition feedback result.

8. a kind of speech recognition equipment, which is characterized in that including：

Voice recognition result generation module, for carrying out the matching analysis generation to the phonetic feature signal and default voice signal Voice recognition result；

Lip reading recognition result generation module, for carrying out the matching analysis generation to the lid speech characteristic signal and default lip reading signal Lip reading recognition result；

Recognition feedback result sending module, if big for institute's speech recognition result and the similarity of the lip reading recognition result In equal to similarity threshold, then Recognition feedback is generated according to institute's speech recognition result as a result, and by the Recognition feedback result It is sent to terminal, the evaluation result of sounding event is generated to indicate the terminal according to the Recognition feedback result.

9. a kind of equipment, which is characterized in that including：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors so that one or more of processors are real The now audio recognition method as described in any in claim 1-6.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The audio recognition method as described in any in claim 1-6 is realized when execution.