JP3876703B2

JP3876703B2 - Speaker learning apparatus and method for speech recognition

Info

Publication number: JP3876703B2
Application number: JP2001378341A
Authority: JP
Inventors: 由実脇田; 研治水谷; 伸一芳澤
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2001-12-12
Filing date: 2001-12-12
Publication date: 2007-02-07
Anticipated expiration: 2021-12-12
Also published as: JP2003177779A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識における話者学習装置及び方法に関するものである。
【０００２】
【従来の技術】
以下、従来の話者学習法を説明する。従来の不特定話者音声認識システムでは、なるべく不特定多数の話者に対応できる標準的な音響モデルを構築して用いているが、実用上では、話者の発声特徴は多種多様であり、全ての使用話者に対して高性能を保証する音響モデルを学習することは困難である。そこで従来は、認識しない話者について、話者自身の発声を用いて音響モデルパラメータを再学習し、話者に適応した音響モデルを再構築することにより全話者に対する性能を保証する話者適応手段をとっている。この話者適応には話者の特徴を捉えるに十分な多くの学習用音声が必要であるが、発声者の負担になるので、最低限の発声回数に絞る様々な工夫がなされている（たとえば、特許第2037877）。一方、別の学習方法として、誤認識した単語の認識結果に相当する音響モデル系列を正解系列として発音辞書に追加し、誤った系列として認識したものを正しい系列として認識することを可能とする話者登録方法もある（特開平8-171396号公報）。
【０００３】
【発明が解決しようとする課題】
従来の話者適応法は、学習データが十分あれば、原理的に確実に認識性能を向上できる手法であるが、ほとんど全ての実用上システムでは行われているように、話者の学習負担を考慮して発声回数が絞られた場合、学習データに存在しない一部の発声に対して、逆に認識率が低下してしまう可能性があるという問題がある。一方、従来の話者登録法は、学習された発声部分の認識率は確実に向上するが、多くの発声内容で認識しにくい話者の場合は、学習時に認識しにくい全ての発声をしなければならず学習に負担がかかる、という問題がある。
【０００４】
本発明の目的は、従来の話者適応学習と話者登録学習の問題点を解決し、話者に負担にならない学習発声量で、学習後に確実に認識率を向上させる話者学習法を提供するものである。
【０００５】
【課題を解決するための手段】
上述した課題を解決するために、本発明は、話者の学習用音声を用いて音響モデルパラメータを再学習し、話者に適応した音響モデルを作成する話者適応学習手段と、誤認識した単語の認識結果に相当する音素又は音節からなる音響モデル系列を正解の音素又は音節系列として発音辞書に追加する話者登録学習手段と、認識しやすさが発声内容に依存するかどうかを判断する手段と、各話者の認識しやすさと発声内容の依存の強さによって、話者適応学習手段と話者登録学習手段との選択を行うものである。
【０００６】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態の話者学習を説明する。
【０００７】
図1は本発明の実施形態の話者学習のブロック図である。
【０００８】
各話者が自分に対する認識性能を向上させる必要を感じた場合に選択するように設定された話者学習機能において、まず、システムからユーザに対し特定単語発声を促し、話者の特定単語発声が入力される。この発声内容は、各話者に対して、予め準備した標準音声がどのくらい適切かを判断するのに必要な最低限の内容であり、たとえば日本語認識の場合は、５母音を全て含む単語「マイクテスト」などの内容がふさわしい。システムが単語認識の場合には５母音が全て含まれるように対象単語から複数単語を選択しても良い。
【０００９】
この発声に対して音声認識処理１で通常の認識処理が行われ、認識スコア算出処理２で認識結果と認識信頼度スコアが計算される。認識結果は、認識結果の音素または音節系列と正解の音素系列とを比較し、異なっている部分を誤りとし一致している部分を正解として、正解系列の各音素毎に正誤を記録しておく。また信頼度スコアは、たとえば正解音素または音節系列と発声された結果との各音素または音節毎の音響的距離スコアであり、距離尺度として重み付きケプストラム距離を用いた場合は、各音素の信頼度は式１で算出されるものを用いてもよい。
【００１０】
【数１】

【００１１】
学習法決定処理３では、信頼度スコアが閾値以下であるか、閾値以上であったとしても誤認識している音素または音節（適応候補音素または音節と呼ぶ）の全発声に含まれる音素または音節に対する割合を計算する。この割合が大きい場合は、発声内容に依存せず話者の発声特徴が標準音声に適応していないことが推定され、全ての標準音声を話者に適応するように学習する必要があると考えられる。また、この割合が小さい場合には、誤認識は発声内容に依存しており、話者の発声特徴と標準音声は適応しているが、特定の発声においてのみ学習が必要であると考えられる。従って、この割合が一定値以上である場合、話者適応学習を選択し、一定値以下である場合、話者登録学習を選択する。
【００１２】
話者適応学習を選択した場合は、話者適応処理４で、ユーザにさらに適応するに必要最低限の発声を促す。話者適応法は、たとえば、特開平5-53599に記載のＶＦＳ法を利用した場合には、標準音響モデルと学習用入力音声パラメータとをマッチングし、対応するパラメータの関係からファジー級関数を求め、求められた関数を重みとして、標準音声を学習用入力音声に近づくように標準音響モデルのパラメータを更新している。
【００１３】
また、話者登録学習を選択した場合には、話者登録処理５で、学習決定処理で算出した適応候補音素または音節が含まれている単語のみの発声を促し、適応候補音素に相当する音素系列を含む単語の音素系列に、発声に対する音素または音節認識結果系列を発音辞書７に追加する。たとえば、「メニュー」という単語が誤認識を起こす場合、この単語のみの発声を促し、その認識結果が「デニュー」であったとする。音響モデルとして音素モデルを使用している場合には、「メニュー」の正しい音素モデル系列は/m e ny u u/であり、認識結果音素系列は/d e ny u u/である。この話者の場合、単語の始めであり、次に/e/が続く音素/m/は/d/に誤る傾向があることがわかる。そこで、認識対象単語の中で、単語の先頭であり、次が/e/である/m/は/d/と誤っても/m/と認識するように、発音辞書に音素系列を追加する。この例の場合には、もともと辞書上で「メニュー/m e ny u u/」であったところに/d e ny u u/を追加し、「メニュー/m e ny u u/または/d e ny u u/」と辞書を変更する。これにより、この話者が「メニュー」を/d e ny u u/ と認識しても結果的には「メニュー」が認識できることになる。
【００１４】
以上のように、話者の発声が発声内容に依存せずに誤るかどうかを推定し、発声内容に依存しない場合は話者適応学習、依存する場合は話者登録学習を行うことにより、従来の話者適応学習で、適応するための多くの学習発声をしたにもかかわらず認識率が低下する問題を、話者適応学習ではなく話者登録学習を行うことで解決することができる。また、従来の話者登録学習で、多くの単語を発声しなければ学習できなかった問題を、話者登録学習ではなく話者適応学習を行うことで解決することができる。
【００１５】
以上詳述したように、本発明に係る実施形態の話者学習法は、各話者の認識しやすさと発声内容の依存の強さによって、話者適応学習を行うか話者登録学習を行うかの選択を行い、どちらかの学習を話者に促すことにより、従来の話者適応学習において、適応するための多くの学習発声をしたにもかかわらず認識率が低下する問題を、話者適応学習のかわりに話者登録学習を自動選択することで解決することができる。また、従来の話者登録学習において、多くの単語を発声しなければ学習できなかった問題を、話者登録学習のかわりに話者適応学習を自動選択することで解決することができる。従って、話者に負担にならない程度の学習量で、確実に認識率を向上させることが可能である話者学習法を提供するものである。
【００１７】
以上詳述したように、本発明に係る実施形態の話者学習法は、認識しやすさが発声内容に依存するかどうかを判断した結果、依存すると判断された場合には話者登録学習を行い、依存しないと判断された場合には話者適応学習を行うことにより、従来の話者適応学習において、適応するための多くの学習発声をしたにもかかわらず認識率が低下する問題を、話者適応学習のかわりに話者登録学習を自動選択することで解決することができる。また、従来の話者登録学習において、多くの単語を発声しなければ学習できなかった問題を、話者登録学習のかわりに話者適応学習を自動選択することで解決することができる。従って、話者に負担にならない程度の学習量で、確実に認識率を向上させることが可能である話者学習法を提供するものである。
【００１９】
以上詳述したように、本発明は、各話者の認識しやすさと発声内容の依存の強さによって、話者適応学習を行うか話者登録学習を行うかの選択を行い、どちらかの学習を話者に促すことにより、従来の話者適応学習において、適応するための多くの学習発声をしたにもかかわらず認識率が低下する問題を、話者適応学習のかわりに話者登録学習を自動選択することで解決することができる
【図面の簡単な説明】
【図１】本発明の一実施例である話者学習法ブロック図
【符号の説明】
１音声認識
２認識スコア算出
３学習法決定
４話者適応
５話者登録
６音響モデル
７発音辞書
８認識スコアバッファ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speaker learning apparatus and method in speech recognition.
[0002]
[Prior art]
The conventional speaker learning method will be described below. In conventional unspecified speaker speech recognition systems, a standard acoustic model that can handle as many unspecified speakers as possible is constructed and used, but in practical use, there are a wide variety of speaker utterance features, It is difficult to learn an acoustic model that guarantees high performance for all speakers. Therefore, in the past, speaker adaptation that re-learns acoustic model parameters using the speaker's own utterance for unrecognized speakers and reconstructs an acoustic model adapted to the speakers to guarantee performance for all speakers. Take measures. This speaker adaptation requires a large amount of learning speech to capture the speaker's characteristics, but it is a burden on the speaker, so various ideas have been made to limit the minimum number of utterances (for example, Patent No. 2037877). On the other hand, as another learning method, an acoustic model sequence corresponding to the recognition result of a misrecognized word is added to the pronunciation dictionary as a correct answer sequence, and a story that enables recognition of an incorrect sequence as a correct sequence There is also a person registration method (Japanese Patent Laid-Open No. 8-171396).
[0003]
[Problems to be solved by the invention]
The conventional speaker adaptation method is a technique that can reliably improve the recognition performance in principle if there is enough learning data. However, as with almost all practical systems, the speaker's learning burden is reduced. When the number of utterances is narrowed down in consideration, there is a problem that the recognition rate may decrease for some utterances that do not exist in the learning data. On the other hand, the conventional speaker registration method improves the recognition rate of the learned utterances, but if the speaker is difficult to recognize with many utterances, all utterances difficult to recognize during learning must be made. There is a problem that learning is burdensome.
[0004]
An object of the present invention is to provide a speaker learning method that solves the problems of conventional speaker adaptive learning and speaker registration learning, and that reliably improves the recognition rate after learning with a learning utterance amount that does not burden the speaker. To do.
[0005]
[Means for Solving the Problems]
In order to solve the above-described problems, the present invention misrecognized speaker-adaptive learning means for re-learning acoustic model parameters using speaker learning speech and creating an acoustic model adapted to the speaker . Speaker registration learning means for adding an acoustic model sequence consisting of phonemes or syllables corresponding to the word recognition result to the pronunciation dictionary as correct phonemes or syllable sequences, and determining whether the recognition is dependent on the utterance content The speaker adaptive learning means and the speaker registration learning means are selected according to the means, the ease of recognition of each speaker, and the strength of the utterance content .
[0006]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, speaker learning according to an embodiment of the present invention will be described with reference to the drawings.
[0007]
FIG. 1 is a block diagram of speaker learning according to an embodiment of the present invention.
[0008]
In the speaker learning function set to select when each speaker feels necessary to improve his / her recognition performance, first, the system prompts the user to speak a specific word, and the speaker's specific word utterance is Entered. This utterance content is the minimum content necessary for each speaker to determine how appropriate the standard voice prepared in advance is. For example, in the case of Japanese recognition, the word “ Content such as “Mike Test” is appropriate. When the system is word recognition, a plurality of words may be selected from the target words so that all five vowels are included.
[0009]
A normal recognition process is performed on the utterance in the voice recognition process 1, and a recognition result and a recognition reliability score are calculated in the recognition score calculation process 2. For the recognition result, compare the phoneme or syllable sequence of the recognition result with the correct phoneme sequence, and record the correctness for each phoneme in the correct sequence, with the different part as the error and the matching part as the correct answer. . The confidence score is, for example, the acoustic distance score for each phoneme or syllable between the correct phoneme or syllable sequence and the utterance result, and when using a weighted cepstrum distance as a distance measure, the confidence score of each phoneme May be calculated by Equation 1.
[0010]
[Expression 1]

[0011]
In the learning method determination process 3, a phoneme or syllable included in all utterances of phonemes or syllables (referred to as adaptive candidate phonemes or syllables) that are misrecognized even if the reliability score is less than or equal to the threshold. Calculate the ratio to. If this proportion is large, utterance feature of the speaker without depending on the utterance contents is estimated that no adaptation to standard voice, it is necessary to learn to adapt to the speaker all standard voice it is conceivable that. Also, when this ratio is small, erroneous recognition depends on the utterance contents, although the utterance feature and the standard audio speakers are adaptation is considered necessary only learned in a particular utterance . Therefore, when this ratio is a certain value or more, speaker adaptive learning is selected, and when it is less than a certain value, speaker registration learning is selected.
[0012]
When speaker adaptation learning is selected, the speaker adaptation process 4 prompts the user to speak at the minimum necessary for further adaptation. In the speaker adaptation method, for example, when the VFS method described in JP-A-5-53599 is used, a standard acoustic model and learning input speech parameters are matched, and a fuzzy class function is obtained from the relationship of the corresponding parameters. The parameters of the standard acoustic model are updated so that the standard speech approaches the learning input speech by using the obtained function as a weight.
[0013]
When speaker registration learning is selected, the speaker registration process 5 prompts the utterance of only words that include the adaptive candidate phonemes or syllables calculated in the learning determination process, and the phonemes corresponding to the adaptive candidate phonemes. A phoneme or syllable recognition result sequence for the utterance is added to the pronunciation dictionary 7 to the phoneme sequence of the word including the sequence. For example, when the word “menu” causes misrecognition, the utterance of only this word is prompted, and the recognition result is “Denyu”. When a phoneme model is used as the acoustic model, the correct phoneme model sequence of the “menu” is / me ny uu /, and the recognition result phoneme sequence is / de ny uu /. In the case of this speaker, it can be seen that the phoneme / m /, which is the beginning of a word, followed by / e /, tends to be mistaken for / d /. Therefore, in the recognition target words, add phoneme sequences to the pronunciation dictionary so that even if / m / is / d /, the next is / e / . In this example, add / de ny uu / where it was originally “menu / me ny uu /” in the dictionary, and change the dictionary to “menu / me ny uu / or / de ny uu /”. change. As a result, even if the speaker recognizes the “menu” as / de ny uu /, the “menu” can be recognized as a result.
[0014]
As described above, by estimating whether the utterance of the speaker is incorrect without depending on the utterance content, by performing speaker adaptive learning if not dependent on the utterance content, and performing speaker registration learning if dependent, In the speaker adaptive learning, the problem that the recognition rate decreases even though many learning utterances for adaptation are performed can be solved by performing speaker registration learning instead of speaker adaptive learning. Further, the problem that cannot be learned without speaking many words in the conventional speaker registration learning can be solved by performing speaker adaptive learning instead of speaker registration learning.
[0015]
As described above in detail, the speaker learning method of the embodiment according to the present invention performs speaker adaptive learning or speaker registration learning depending on the ease of recognition of each speaker and the strength of dependency of the utterance content. In the conventional speaker adaptive learning, the speaker has a problem that the recognition rate is lowered even though many learning utterances are applied for adaptation. This can be solved by automatically selecting speaker registration learning instead of adaptive learning. In addition, in the conventional speaker registration learning, a problem that cannot be learned without speaking a large number of words can be solved by automatically selecting speaker adaptive learning instead of speaker registration learning. Therefore, the present invention provides a speaker learning method that can reliably improve the recognition rate with a learning amount that does not burden the speaker.
[0017]
As described above in detail, the speaker learning method according to the embodiment of the present invention determines whether or not the recognition is dependent on the utterance content. If it is determined that it does not depend, speaker adaptive learning is performed, and in the conventional speaker adaptive learning, the problem that the recognition rate decreases despite a lot of learning utterances to adapt, This can be solved by automatically selecting speaker registration learning instead of speaker adaptive learning. In addition, in the conventional speaker registration learning, a problem that cannot be learned without speaking a large number of words can be solved by automatically selecting speaker adaptive learning instead of speaker registration learning. Therefore, the present invention provides a speaker learning method that can reliably improve the recognition rate with a learning amount that does not burden the speaker.
[0019]
As described in detail above, the present invention selects whether to perform speaker adaptive learning or speaker registration learning depending on the ease of recognition of each speaker and the strength of dependence on the utterance content. Speaker registration learning instead of speaker adaptive learning is a problem that reduces the recognition rate in spite of many learning utterances for adaptation in the conventional speaker adaptive learning by encouraging the speaker to learn. Can be resolved by automatically selecting [Short description of drawings]
FIG. 1 is a block diagram of a speaker learning method according to an embodiment of the present invention.
1 Speech Recognition 2 Recognition Score Calculation 3 Learning Method Determination 4 Speaker Adaptation 5 Speaker Registration 6 Acoustic Model 7 Pronunciation Dictionary 8 Recognition Score Buffer

Claims

Speaker-adaptive learning means for re-learning acoustic model parameters using speaker learning speech to create an acoustic model adapted to the speaker, and sound consisting of phonemes or syllables corresponding to recognition results of misrecognized words Speaker registration learning means for adding model series as correct phoneme or syllable series to pronunciation dictionary, means for determining whether recognition is dependent on utterance content, and ease of recognition and utterance content of each speaker A speaker learning apparatus comprising means for selecting a speaker adaptive learning means and a speaker registration learning means depending on the strength of dependence, and prompting the speaker to learn either of them.

As a result of determining whether or not ease of recognition depends on the utterance content, use speaker registration learning means if it is determined to be dependent, and use speaker adaptive learning means if it is determined not to depend The speaker learning device according to claim 1.

The means for determining whether the ease of recognition depends on the utterance content is a phoneme or syllable that is erroneously recognized even if the recognition score is below a predetermined threshold or above a predetermined threshold. The speaker learning device according to claim 1, wherein the determination is made based on a ratio to phonemes or syllables included in all utterances.

A speaker-adaptive learning step for re-learning acoustic model parameters using speaker's learning speech to create an acoustic model adapted to the speaker, and a sound consisting of phonemes or syllables corresponding to recognition results of misrecognized words A speaker registration learning step of adding a model sequence to the pronunciation dictionary as a correct phoneme or syllable sequence, a step of determining whether the ease of recognition depends on the utterance content, and the ease of recognition and utterance content of each speaker A speaker learning method comprising a step of selecting a speaker adaptive learning means and a speaker registration learning means depending on the strength of dependence, and prompting the speaker to learn either.