[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114974221A - Speech recognition model training method and device and computer readable storage medium - Google Patents

Speech recognition model training method and device and computer readable storage medium Download PDF

Info

Publication number
CN114974221A
CN114974221A CN202210465435.8A CN202210465435A CN114974221A CN 114974221 A CN114974221 A CN 114974221A CN 202210465435 A CN202210465435 A CN 202210465435A CN 114974221 A CN114974221 A CN 114974221A
Authority
CN
China
Prior art keywords
text
voice
speaker
correct
wrong
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210465435.8A
Other languages
Chinese (zh)
Other versions
CN114974221B (en
Inventor
胡洪涛
徐景成
朱耀磷
彭成高
刘莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Internet Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Internet Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Internet Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202210465435.8A priority Critical patent/CN114974221B/en
Publication of CN114974221A publication Critical patent/CN114974221A/en
Application granted granted Critical
Publication of CN114974221B publication Critical patent/CN114974221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses a method and a device for training a speech recognition model and a computer readable storage medium, and the scheme provided by the application comprises the following steps: acquiring feedback information of voice recognition output by a target voice recognition model by a user, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text; acquiring the voice characteristics of a speaker corresponding to the wrong text; determining an updated training sample and a corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text; and updating and training the target voice recognition model based on the updated training sample and the corresponding label.

Description

Speech recognition model training method and device and computer readable storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for training a speech recognition model, and a computer-readable storage medium.
Background
Speech recognition is a technique that converts speech into text, and a good speech recognition model requires thousands of hours of corpus to train. The existing voice recognition system is not changed once entering the system, and if the existing voice recognition system is updated, the following main methods exist at present: 1. purchasing data according to the identification performance, submitting the data to a data company for customization, or directly purchasing the existing finished product database; 2. and manually re-labeling the data with poor recognition performance and then re-adding the model training.
The data obtained in the above manner, no matter the duration or the quantity of the voice, is limited, and the whole process is very long, which causes very high time cost and price cost, and the recognition accuracy of the voice recognition model is improved only limitedly.
Disclosure of Invention
An embodiment of the present application provides a method and an apparatus for training a speech recognition model, and a computer-readable storage medium, so as to solve the problems existing in the existing speech recognition model training.
In order to solve the above technical problem, the present specification is implemented as follows:
in a first aspect, a method for training a speech recognition model is provided, including:
acquiring feedback information of voice recognition output by a target voice recognition model by a user, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text;
acquiring the voice characteristics of a speaker corresponding to the error text;
determining an updated training sample and a corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text;
and updating and training the target voice recognition model based on the updated training sample and the corresponding label.
Optionally, determining to update the training sample and the corresponding label based on the incorrect text, the correct text corresponding to the incorrect text, and the speaker voice feature of the voice corresponding to the incorrect text, includes:
calculating the confusion degree of the correct text corresponding to the wrong text;
screening a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first wrong text corresponding to the first correct text;
performing voice synthesis based on any matching combination of the voice characteristics of the speaker of the voice corresponding to the first correct text and the first wrong text to generate an updated training sample;
and determining the label corresponding to the updated training sample based on the first correct text.
Optionally, the method further comprises:
crawling hotwords from a target network;
matching the hot words with a training sample library of the target speech recognition model;
determining the hotword as a new word if the matching is unsuccessful;
determining and updating the training sample and the corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text, comprising:
and determining to update the training sample and the corresponding label based on the new words, the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text.
Optionally, determining to update the training sample and the corresponding label based on the new word, the error text, the correct text corresponding to the error text, and the speaker voice feature of the voice corresponding to the error text, including:
calculating the confusion degree of the correct text corresponding to the wrong text;
screening a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first wrong text corresponding to the first correct text;
performing voice synthesis based on the first correct text and the text corresponding to the new word and any matching combination of the voice characteristics of the speaker corresponding to the voice of the first wrong text to generate an updated training sample;
and determining the label corresponding to the updated training sample based on the first correct text or the text of the new word.
Optionally, the confusion of calculating the correct text corresponding to the wrong text is calculated by the following formula:
Figure BDA0003623819570000031
wherein S represents a target correct text corresponding to a target wrong text, k represents the number of words included in the target correct text, and P (Wk) represents the sentence probability of the kth word included in the target correct text.
Optionally, before performing speech synthesis, the method further includes:
clustering speakers according to the speaker voice characteristics of the voice corresponding to the first error text;
determining the number of speakers included in each clustered set after clustering;
and screening out the voice characteristics of the speaker with the number of speakers lower than the preset number of voices corresponding to the first wrong texts in the clustering set for voice synthesis.
Optionally, performing speaker clustering according to the speaker voice feature of the voice corresponding to the first error text, including:
calculating the similarity between the voice characteristics of the target speaker of the voice corresponding to the target first error text and the voice characteristics of each speaker in the training sample library of the target voice recognition model;
and clustering the target speakers to a cluster set to which the speakers with high voice feature similarity belong.
Optionally, determining to update the training sample and the corresponding label based on the incorrect text, the correct text corresponding to the incorrect text, and the speaker voice feature of the voice corresponding to the incorrect text, includes:
clustering speakers according to the speaker voice characteristics of the voice corresponding to the error text;
determining the number of speakers included in each clustered set after clustering;
screening out the voice characteristics of the speaker corresponding to the second wrong text in the clustering set with the speaker number lower than the preset number;
performing voice synthesis based on the correct text corresponding to the wrong text and the speaker voice feature of the voice corresponding to the second wrong text to generate an updated training sample;
and determining the label corresponding to the updated training sample based on the correct text.
In a second aspect, there is provided a speech recognition model training apparatus comprising a memory and a processor electrically connected to the memory, the memory storing a computer program executable by the processor, the computer program, when executed by the processor, implementing the steps of the method according to the first aspect.
In a third aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to the first aspect.
In the embodiment of the application, feedback information of voice recognition output by a user to a target voice recognition model is obtained, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text; acquiring the voice characteristics of a speaker corresponding to the wrong text; determining an updated training sample and a corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text; based on the updated training sample and the corresponding label, the target speech recognition model is updated and trained, so that the speech which is fed back by the user and is not well recognized is dynamically collected, speech synthesis is performed from two dimensions of the text and the speech feature of the speaker, the updated training sample is generated, the updated training of the target speech recognition model is added in real time, the corpus augmentation and model training which are faster, stronger in timeliness and lower in cost can be realized, and the recognition accuracy of the speech recognition model is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart illustrating a speech recognition model training method according to an embodiment of the present application.
Fig. 2 is a flowchart illustrating a determination step of updating training samples and labels according to a first embodiment of the present application.
Fig. 3 is a flowchart illustrating a determination step of updating training samples and labels according to a second embodiment of the present application.
Fig. 4 is a flowchart illustrating a determination step of updating training samples and labels according to a third embodiment of the present application.
Fig. 5 is a block diagram showing a configuration of a speech recognition model training apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The reference numbers in the present application are only used for distinguishing the steps in the scheme and are not used for limiting the execution sequence of the steps, and the specific execution sequence is described in the specification.
In order to solve the problems in the prior art, an embodiment of the present application provides a method for training a speech recognition model, and fig. 1 is a schematic flow diagram of the method for training a speech recognition model according to the embodiment of the present application.
As shown in fig. 1, the method comprises the following steps:
102, acquiring feedback information of voice recognition output by a user to a target voice recognition model, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text;
step 104, obtaining the speaker voice characteristics of the voice corresponding to the error text;
step 106, determining to update the training sample and the corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text;
and 108, updating and training the target voice recognition model based on the updated training sample and the corresponding label.
In step 102, the target speech recognition model is a universal speech recognition model, which is used to recognize the input speech of different users and output speech recognition text to the corresponding users.
The feedback information is feedback sent when the target user is not satisfied with the recognized text, including when the speech recognition is not accurate or is not completely accurate, and the user can mark the incorrect recognized text to mark the correct text. The feedback information comprises the error text of the voice recognition output by the target voice recognition model and the correct text which is marked by the user and corresponds to the error text.
And the background collects the feedback information of the user in real time and stores the error text of the voice recognition and the correct text of the corresponding standard into a database. Therefore, a plurality of wrong texts and corresponding correct texts output by the target speech recognition model can be acquired through feedback information of different users or different feedback information of the same user.
In step 104, the speech corresponding to the error text is the speech of the feedback user of the target error text, the speech characteristics of the speaker include voiceprint vector information, and the voiceprint vector information is related to information of the speaker such as the age, the gender, the regional accent, the timbre and the like, and is a vector value which is comprehensively determined according to the information and represents the pronunciation characteristics of the speaker.
Under the condition that the target user feeds back the wrong text and the correct text of the voice recognition of the target voice recognition model, the voice characteristics of the speaker corresponding to the target user can be further obtained and stored in the database. Therefore, the voice characteristics of a plurality of different speakers with wrong recognition corresponding to the target voice recognition model can be obtained through the feedback information of different users.
In step 106, the correct texts and the speech features of the speakers can be obtained from the database, and the data corresponding to the two dimensions can be arbitrarily collocated and combined to perform speech synthesis, so as to generate an updated training sample. For example, the database includes correct text 1, correct text 2, …, correct text n, speaker speech feature 1, speaker speech feature 2, …, and speaker speech feature m, so that the n correct texts and the m speaker speech features may be combined in pairs and speech synthesized to generate a plurality of corpora, that is, a plurality of updated training samples for the target speech recognition model, where the correct text for generating the target updated training sample is the label of the target updated training sample.
In fact, the speech recognition error fed back by the user is not necessarily the reason of the target speech recognition model itself, and may be the equipment sound reception problem used by the user or the large background noise existing at that time, and in order to reduce the resource waste, it is not necessary to perform new corpus synthesis on all the erroneous texts fed back by the user. The method can analyze the text with wrong voice recognition fed back by the user, and after the text with real value is screened out, the text and the voice feature of the speaker are subjected to voice generation.
In one embodiment, as shown in fig. 2, the step 106 determines the updated training sample and the corresponding label based on the speaker's speech characteristics of the erroneous text, the correct text corresponding to the erroneous text, and the speech corresponding to the erroneous text, including:
step 202, calculating the confusion degree of the correct text corresponding to the wrong text;
step 204, screening out a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first error text corresponding to the first correct text;
step 206, performing speech synthesis based on any matching combination of the speech characteristics of the speaker of the speech corresponding to the first correct text and the first wrong text to generate an updated training sample;
and step 208, determining to update the label corresponding to the training sample based on the first correct text.
In this embodiment, in order to accurately select texts, texts fed back by the user and having incorrect recognition are filtered, and a confusion degree (PPL) of these correct texts is calculated.
Optionally, the confusion degree of the correct text corresponding to the wrong text is calculated by the following formula (1):
Figure BDA0003623819570000071
wherein S represents a target correct text corresponding to a target wrong text, k represents the number of words included in the target correct text, and P (Wk) represents the sentence probability of the kth word included in the target correct text.
For sentence S (sequence of k words W) corresponding to the target correct text: w1, W2, …, Wk, then
Figure BDA0003623819570000072
As can be seen from the above equation (1), PPL is the reciprocal of the sentence probability P (W1, W2, …, Wk) k-times the root number. That is, the larger the sentence probability, the smaller PPL, i.e., the less confusing the speech recognition model is for the sentence, the better the modeling capability of the speech recognition model on the corresponding text of the sentence.
Therefore, a PPL threshold is set, correct texts with PPL exceeding the PPL threshold are selected, that is, correct texts and voice characteristics of speakers corresponding to wrong texts with too high PPL and poor recognition performance of the target voice recognition model are selected, and then voice synthesis is performed to generate an update training sample for update training of the target voice recognition model. Therefore, resource waste can be avoided, and the efficiency of corpus sample data augmentation is improved.
Besides obtaining the text for producing the updated training sample from the user feedback information, new words can be obtained from the network to augment the original training sample library of the target speech recognition model.
Optionally, the method further comprises: crawling hotwords from a target network; matching the hot words with a training sample library of the target speech recognition model; in the event that the match is unsuccessful, determining the hotword as a new word.
Determining and updating the training sample and the corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text, comprising: and determining to update the training sample and the corresponding label based on the new words, the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text.
The method comprises the steps of crawling hot words such as words with high occurrence frequency in a network at regular time, matching and searching in an original training sample library of a target voice recognition model, if the original training sample library lacks the hot words, storing texts of the hot words in the database, and carrying out voice synthesis on the texts and speaker voice characteristic voices of voices corresponding to wrong texts in the database to generate updated training samples.
In another embodiment, as shown in fig. 3, determining to update the training sample and the corresponding label based on the speaker's speech characteristics of the new word, the erroneous text, the correct text corresponding to the erroneous text, and the speech corresponding to the erroneous text includes:
step 302, calculating the confusion degree of the correct text corresponding to the wrong text;
step 304, screening out a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first error text corresponding to the first correct text;
step 306, performing speech synthesis based on the first correct text, the text corresponding to the new word and any matching combination of the speaker speech characteristics of the speech corresponding to the first wrong text, and generating an updated training sample;
and 308, determining to update the label corresponding to the training sample based on the first correct text or the text of the new word.
Here, step 302 to step 304 correspond to the above step 202 to step 204, and valuable error texts are screened out through confusion calculation, which is not described herein again.
In step 306, in addition to the filtered correct text, new word text may be added as text for generating updated training samples.
In this embodiment, the correct texts, the texts corresponding to the new words, and the speech features of the speakers can be obtained from the database, and the correct texts and the new word texts can be unified into a text dimension. And generating an updated training sample by carrying out random collocation and combination on the data corresponding to the text dimension and the speaker voice characteristic dimension and carrying out voice synthesis. For example, the database includes text 1, text 2, … text i, and speaker speech feature 1, speaker speech feature 2, … speaker speech feature m, so that the i texts and the m speaker speech features can be combined pairwise and speech synthesis is performed to generate a plurality of corpora, that is, a plurality of updated training samples for the target speech recognition model. If the generated target updating training sample is a correct text, the correct text is a label of the target updating training sample; and generating a text of the target update training sample, wherein the text is a new word, and the new word text is a label of the target update training sample.
In the embodiment, the wrong text is screened from the feedback information of the user, or the new word text is crawled from the network, and the corresponding text and the voice characteristics of the speaker corresponding to the voice of the screened wrong text are matched and combined at will to generate the updated training sample. In one embodiment, the speech characteristics of the speaker corresponding to the speech of the selected erroneous text can be further selected again.
Optionally, before performing the speech synthesis of the above embodiment, the method further includes:
carrying out speaker clustering according to the speaker voice characteristics of the voice corresponding to the first error text;
determining the number of speakers included in each clustered set after clustering;
and screening out the voice characteristics of the speaker with the number of speakers lower than the preset number of voices corresponding to the first wrong texts in the clustering set for voice synthesis.
According to the speaker voice characteristics of the voice corresponding to the first error text, clustering speakers, comprising: calculating the similarity between the voice characteristics of the target speaker of the voice corresponding to the target first error text and the voice characteristics of each speaker in the training sample library of the target voice recognition model; and clustering the target speakers to a cluster set to which the speakers with high voice feature similarity belong.
In an original training sample library of the target speech recognition model, sorting is carried out according to speakers, audio segments with specified lengths are cut out from the audio of each speaker, and voiceprint vector information of all speaker audio segments is extracted for hierarchical clustering. Specifically, the method comprises the following steps: the voiceprint vectors of each speaker are taken as a single class at first, the similarity between the voiceprint vectors of every two speakers is calculated, and the two speakers with the highest similarity are found out and combined to be taken as a new class. And repeating the process of calculating the similarity and merging continuously until the two farthest classes exceed the threshold value, and stopping clustering, so as to obtain the speaker distribution in the original training library, including the number of speakers in each cluster set.
After the first error text of the feedback user and the voice characteristics of the speaker corresponding to the first error text are screened out, the voiceprint vector information of the target feedback user is extracted, and the voiceprint hierarchical clustering is continuously added, so that the target speaker corresponding to the target feedback user is clustered into the corresponding clustering set.
For the speech recognition model, if the speaker speech which is not seen in the original training sample library is used, the recognition performance is also reduced. Therefore, the speaker can be further screened, if the speaker corresponding to the feedback user is clustered to a clustering set with a small number in the original training sample library, it is indicated that the feedback user and similar users in the original training sample library are fewer, the feedback user and similar users in the original training sample library have high values, and the original training sample library needs to be expanded, and the voice characteristics of the speaker are focused. And the text and the speaker voice characteristics obtained by screening in the previous step are subjected to voice synthesis; on the contrary, if the speaker corresponding to the feedback user is clustered into a cluster set with a larger number in the original training sample library, it indicates that the feedback user and the similar user in the original training sample library are more, and the speech feature of the speaker corresponding to the user and the corresponding text are not synthesized into speech, so as to generate the updated training sample.
In yet another embodiment, only the speaker voice features corresponding to the voice recognition error can be screened to generate the updated training sample based on the feedback of the voice recognition error of the user.
As shown in fig. 4, the step 106 determines to update the training sample and the corresponding label based on the speaker voice characteristics of the erroneous text, the correct text corresponding to the erroneous text, and the voice corresponding to the erroneous text, including:
step 402, clustering speakers according to the speaker voice characteristics of the voice corresponding to the error text;
step 404, determining the number of speakers included in each clustered set after clustering;
step 406, screening out the speaker voice characteristics of the voices corresponding to the second error texts in the clustering set, wherein the number of speakers is less than the preset number;
step 408, performing speech synthesis based on the correct text corresponding to the wrong text and the speaker speech feature of the speech corresponding to the second wrong text to generate an updated training sample;
and step 410, determining to update the label corresponding to the training sample based on the correct text.
Here, steps 402 to 406 correspond to the same speaker voice feature screening step as the voice corresponding to the first error text, and valuable speaker voice features in the feedback user are screened out through speaker clustering, which is not described herein again.
In step 408, the correct texts may be all correct texts in the feedback information, and are arbitrarily collocated and combined with the speaker voice features screened in the above steps 402 to 406 to perform voice synthesis, so as to generate an updated training sample.
In the embodiment of the application, feedback information of voice recognition output by a user to a target voice recognition model is obtained, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text; acquiring the voice characteristics of a speaker corresponding to the wrong text; determining an updated training sample and a corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text; based on the updated training sample and the corresponding label, the target speech recognition model is updated and trained, so that the speech which is fed back by the user and is not well recognized is dynamically collected, speech synthesis is performed from two dimensions of the text and the speech feature of the speaker, the updated training sample is generated, the updated training of the target speech recognition model is added in real time, the corpus augmentation and model training which are faster, stronger in timeliness and lower in cost can be realized, and the recognition accuracy of the speech recognition model is improved.
In addition, the text and/or the speaker voice characteristics are screened through the feedback of the user, so that the generated training sample to be augmented and updated is valuable, the resource waste is avoided, the data augmentation efficiency is improved, and the recognition performance of the voice recognition model can be correspondingly improved.
Optionally, an embodiment of the present application further provides a speech recognition model training apparatus, and fig. 5 is a block diagram of the speech recognition model training apparatus according to the embodiment of the present application.
As shown in fig. 5, the speech recognition model training apparatus 2000 includes a memory 2200 and a processor 2400 electrically connected to the memory 2200, where the memory 2200 stores a computer program that can be executed by the processor 2400, and when the computer program is executed by the processor, the computer program implements each process of any one of the above-mentioned speech recognition model training method embodiments, and can achieve the same technical effect, and is not repeated here to avoid repetition.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of any one of the above embodiments of the speech recognition model training method, and can achieve the same technical effect, and is not described herein again to avoid repetition. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for training a speech recognition model, comprising:
acquiring feedback information of voice recognition output by a target voice recognition model by a user, wherein the feedback information comprises an error text of the voice recognition and a correct text corresponding to the error text;
acquiring the voice characteristics of a speaker corresponding to the wrong text;
determining an updated training sample and a corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text;
and updating and training the target voice recognition model based on the updated training sample and the corresponding label.
2. The method of claim 1, wherein determining updated training samples and corresponding labels based on speaker speech characteristics of the erroneous text, the correct text corresponding to the erroneous text, and the speech corresponding to the erroneous text comprises:
calculating the confusion degree of the correct text corresponding to the wrong text;
screening out a first correct text with the perplexity exceeding a preset perplexity threshold value and a first wrong text corresponding to the first correct text;
performing voice synthesis based on any matching combination of the voice characteristics of the speaker of the voice corresponding to the first correct text and the first wrong text to generate an updated training sample;
and determining the label corresponding to the updated training sample based on the first correct text.
3. The method of claim 1, further comprising:
crawling hotwords from a target network;
matching the hot words with a training sample library of the target speech recognition model;
determining the hotword as a new word if the matching is unsuccessful;
determining and updating the training sample and the corresponding label based on the wrong text, the correct text corresponding to the wrong text and the speaker voice characteristics of the voice corresponding to the wrong text, comprising:
and determining to update the training sample and the corresponding label based on the new words, the error text, the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the error text.
4. The method of claim 3, wherein determining updated training samples and corresponding labels based on speaker speech characteristics of new words, incorrect text, correct text corresponding to incorrect text, speech corresponding to incorrect text comprises:
calculating the confusion degree of the correct text corresponding to the wrong text;
screening a first correct text with the confusion degree exceeding a preset confusion degree threshold value and a first wrong text corresponding to the first correct text;
performing voice synthesis based on the first correct text and the text corresponding to the new word and any matching combination of the voice characteristics of the speaker corresponding to the voice of the first wrong text to generate an updated training sample;
and determining the label corresponding to the updated training sample based on the first correct text or the text of the new word.
5. The method of claim 2 or 4, wherein the confusion for calculating the correct text corresponding to the wrong text is calculated by the following formula:
Figure FDA0003623819560000021
wherein S represents a target correct text corresponding to a target wrong text, k represents the number of words included in the target correct text, and P (Wk) represents the sentence probability of the kth word included in the target correct text.
6. The method of claim 2 or 4, wherein prior to performing speech synthesis, further comprising:
clustering speakers according to the speaker voice characteristics of the voice corresponding to the first error text;
determining the number of speakers included in each clustered set after clustering;
and screening out the voice characteristics of the speaker with the number of speakers lower than the preset number of voices corresponding to the first wrong texts in the clustering set for voice synthesis.
7. The method of claim 6, wherein clustering speakers based on speaker speech characteristics of speech corresponding to the first erroneous text comprises:
calculating the similarity between the voice characteristics of the target speaker of the voice corresponding to the target first error text and the voice characteristics of each speaker in the training sample library of the target voice recognition model;
and clustering the target speakers to a cluster set to which the speakers with high voice feature similarity belong.
8. The method of claim 1, wherein determining updated training samples and corresponding labels based on speaker speech characteristics of the erroneous text, the correct text corresponding to the erroneous text, and the speech corresponding to the erroneous text comprises:
clustering speakers according to the speaker voice characteristics of the voice corresponding to the error text;
determining the number of speakers included in each clustered set after clustering;
screening out the voice characteristics of the speaker corresponding to the second wrong text in the clustering set with the speaker number lower than the preset number;
performing voice synthesis based on the correct text corresponding to the error text and the speaker voice characteristics of the voice corresponding to the second error text to generate an updated training sample;
and determining the label corresponding to the updated training sample based on the correct text.
9. A speech recognition model training apparatus, comprising: a memory and a processor electrically connected to the memory, the memory storing a computer program executable on the processor, the computer program, when executed by the processor, implementing the steps of the method according to any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202210465435.8A 2022-04-29 2022-04-29 Speech recognition model training method and device and computer readable storage medium Active CN114974221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210465435.8A CN114974221B (en) 2022-04-29 2022-04-29 Speech recognition model training method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210465435.8A CN114974221B (en) 2022-04-29 2022-04-29 Speech recognition model training method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114974221A true CN114974221A (en) 2022-08-30
CN114974221B CN114974221B (en) 2024-01-19

Family

ID=82978352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210465435.8A Active CN114974221B (en) 2022-04-29 2022-04-29 Speech recognition model training method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114974221B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024111387A1 (en) * 2022-11-24 2024-05-30 日本電気株式会社 Processing device, processing method, and recording medium
WO2024123240A1 (en) * 2022-12-08 2024-06-13 Grabtaxi Holdings Pte. Ltd. Method, device and system for building a dataset for automatic speech recognition

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312555A1 (en) * 2009-06-09 2010-12-09 Microsoft Corporation Local and remote aggregation of feedback data for speech recognition
US20120290298A1 (en) * 2011-05-09 2012-11-15 At&T Intellectual Property I, L.P. System and method for optimizing speech recognition and natural language parameters with user feedback
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Method, system, device and storage medium for optimizing speech recognition acoustic model
CN108735220A (en) * 2018-04-11 2018-11-02 四川斐讯信息技术有限公司 A kind of language learning intelligent earphone, intelligent interactive system and man-machine interaction method
CN109949797A (en) * 2019-03-11 2019-06-28 北京百度网讯科技有限公司 A kind of generation method of training corpus, device, equipment and storage medium
US20190206389A1 (en) * 2017-12-29 2019-07-04 Samsung Electronics Co., Ltd. Method and apparatus with a personalized speech recognition model
CN110853628A (en) * 2019-11-18 2020-02-28 苏州思必驰信息科技有限公司 Model training method and device, electronic equipment and storage medium
CN110942772A (en) * 2019-11-21 2020-03-31 新华三大数据技术有限公司 Voice sample collection method and device
CN114141235A (en) * 2021-10-26 2022-03-04 招联消费金融有限公司 Voice corpus generation method and device, computer equipment and storage medium
CN114203166A (en) * 2021-12-10 2022-03-18 零犀(北京)科技有限公司 Method, device and equipment for generating training data based on man-machine conversation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312555A1 (en) * 2009-06-09 2010-12-09 Microsoft Corporation Local and remote aggregation of feedback data for speech recognition
US20160012817A1 (en) * 2009-06-09 2016-01-14 Microsoft Technology Licensing, Llc Local and remote aggregation of feedback data for speech recognition
US20120290298A1 (en) * 2011-05-09 2012-11-15 At&T Intellectual Property I, L.P. System and method for optimizing speech recognition and natural language parameters with user feedback
US20190206389A1 (en) * 2017-12-29 2019-07-04 Samsung Electronics Co., Ltd. Method and apparatus with a personalized speech recognition model
CN108389577A (en) * 2018-02-12 2018-08-10 广州视源电子科技股份有限公司 Method, system, device and storage medium for optimizing speech recognition acoustic model
CN108735220A (en) * 2018-04-11 2018-11-02 四川斐讯信息技术有限公司 A kind of language learning intelligent earphone, intelligent interactive system and man-machine interaction method
CN109949797A (en) * 2019-03-11 2019-06-28 北京百度网讯科技有限公司 A kind of generation method of training corpus, device, equipment and storage medium
CN110853628A (en) * 2019-11-18 2020-02-28 苏州思必驰信息科技有限公司 Model training method and device, electronic equipment and storage medium
CN110942772A (en) * 2019-11-21 2020-03-31 新华三大数据技术有限公司 Voice sample collection method and device
CN114141235A (en) * 2021-10-26 2022-03-04 招联消费金融有限公司 Voice corpus generation method and device, computer equipment and storage medium
CN114203166A (en) * 2021-12-10 2022-03-18 零犀(北京)科技有限公司 Method, device and equipment for generating training data based on man-machine conversation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024111387A1 (en) * 2022-11-24 2024-05-30 日本電気株式会社 Processing device, processing method, and recording medium
WO2024123240A1 (en) * 2022-12-08 2024-06-13 Grabtaxi Holdings Pte. Ltd. Method, device and system for building a dataset for automatic speech recognition

Also Published As

Publication number Publication date
CN114974221B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US7562014B1 (en) Active learning process for spoken dialog systems
CN1196105C (en) Extensible speech recongnition system that provides user audio feedback
WO2020036178A1 (en) Voice conversion learning device, voice conversion device, method, and program
JP2015180966A (en) Speech processing system
CN109545185B (en) Interactive system evaluation method, evaluation system, server, and computer-readable medium
CN114974221B (en) Speech recognition model training method and device and computer readable storage medium
CN104299623A (en) Automated confirmation and disambiguation modules in voice applications
EP1317749A1 (en) Method of and system for improving accuracy in a speech recognition system
CN114242047B (en) Voice processing method and device, electronic equipment and storage medium
CN113674735B (en) Sound conversion method, device, electronic equipment and readable storage medium
US10978076B2 (en) Speaker retrieval device, speaker retrieval method, and computer program product
Imperl et al. Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones
JP6220733B2 (en) Voice classification device, voice classification method, and program
JP2004348552A (en) Voice document search device, method, and program
US20070129946A1 (en) High quality speech reconstruction for a dialog method and system
CN117669553A (en) Keyword detection device, keyword detection method, and storage medium
JP4986301B2 (en) Content search apparatus, program, and method using voice recognition processing function
CN113345442B (en) Speech recognition method, device, electronic equipment and storage medium
CN115293156B (en) Method and device for extracting abnormal events of prison short messages, computer equipment and medium
CN114420086B (en) Speech synthesis method and device
Chen et al. Mismatched Crowdsourcing from Multiple Annotator Languages for Recognizing Zero-Resourced Languages: A Nullspace Clustering Approach.
CN118098240A (en) Text error correction method, device, equipment and readable storage medium
JP2961797B2 (en) Voice recognition device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant