JP5089655B2

JP5089655B2 - Acoustic model creation device, method and program thereof

Info

Publication number: JP5089655B2
Application number: JP2009148089A
Authority: JP
Inventors: 哲小橋川; 太一浅見; 義和山口
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-06-22
Filing date: 2009-06-22
Publication date: 2012-12-05
Anticipated expiration: 2029-06-22
Also published as: JP2011002792A

Description

本発明は、音声認識に用いる音響モデル作成装置、その方法及びプログラムに関する。 The present invention relates to an acoustic model creation device used for speech recognition, a method thereof, and a program.

音響モデルの新規作成・更新は通常、まず学習用の観測信号系列とそれに対応する状態系列と音響モデル（新規作成の場合は任意の初期モデル）とから、前向き計算及び後向き計算に基づく事後確率変数γ_t(i)（例えばForward-Backwardアルゴリズムによる場合：非特許文献１参照）又は前向き計算に基づく前向き確率δ_t(i)（例えばViterbiアルゴリズムによる場合：非特許文献２参照）を求める。なお、事後確率変数γ_t(i)は時刻ｔ（１≦ｔ≦Ｔ、Ｔは観測信号系列の長さ）で状態ｉ（１≦ｉ≦Ｊ、Ｊはモデル中の状態の数）に存在する確率を表し、前向き確率δ_t(i)は時刻ｔで状態ｉに至る状態系列の中で最も高い確率を与える状態系列の確率を表す。そして、事後確率変数γ_t(i)や前向き確率δ_t(i)に基づき最も尤もらしい状態系列（最尤パス）を求め、その最尤パスを辿りながら、音響モデルを更新するために必要となる各種統計量（以下、「十分統計量」という）を抽出し、これにより音響モデルを更新する。 New creation / update of an acoustic model is usually done by first using a posterior random variable based on forward and backward calculations from an observed signal sequence for learning, a corresponding state sequence, and an acoustic model (any initial model in the case of new creation). γ _t (i) (for example, forward-backward algorithm: see non-patent document 1) or forward probability δ _t (i) based on forward calculation (for example, viterbi algorithm: refer to non-patent document 2) is obtained. Note that the posterior random variable γ _t (i) exists in the state i (1 ≦ i ≦ J, J is the number of states in the model) at time t (1 ≦ t ≦ T, T is the length of the observed signal sequence). The forward probability δ _t (i) represents the probability of the state sequence that gives the highest probability among the state sequences that reach state i at time t. It is necessary to obtain the most likely state sequence (maximum likelihood path) based on the posterior random variable γ _t (i) and forward probability δ _t (i), and to update the acoustic model while following the maximum likelihood path. The various statistics (hereinafter referred to as “sufficient statistics”) are extracted, and the acoustic model is updated accordingly.

Viterbiアルゴリズムを用いる場合の従来の音響モデル作成装置１００の構成例を図１に、処理フロー例を図２に示す。音響モデル作成装置１００は、学習データ取得部１０１と特徴量分析部１０２と状態系列変換部１０３と音響モデル記憶部１０４と前向き計算部１０５と十分統計量蓄積部１０６とモデル更新部１０７とを備える。 FIG. 1 shows a configuration example of a conventional acoustic model creation apparatus 100 when using the Viterbi algorithm, and FIG. 2 shows a processing flow example. The acoustic model creation apparatus 100 includes a learning data acquisition unit 101, a feature amount analysis unit 102, a state series conversion unit 103, an acoustic model storage unit 104, a forward calculation unit 105, a sufficient statistics accumulation unit 106, and a model update unit 107. .

学習データ取得部１０１は、音声信号とその音声を例えば仮名文字で表した学習ラベルとからなる学習データが入力され、音声信号と学習ラベルとをそれぞれ出力する（Ｓ１）。なお、学習データの取得が完了した際には十分統計量蓄積部１０６にデータ取得完了情報を送信する。 The learning data acquisition unit 101 receives learning data composed of a speech signal and a learning label representing the speech in kana characters, for example, and outputs a speech signal and a learning label (S1). When the acquisition of learning data is completed, the data acquisition completion information is transmitted to the sufficient statistic accumulation unit 106.

特徴量分析部１０２は、音声信号が入力され音声特徴量系列Ｏ＝(ｏ_１,ｏ_２,・・・,ｏ_ｔ,・・・,ｏ_Ｔ)を抽出して出力する（Ｓ２）。抽出する特徴量としては、例えばＭＦＣＣ(Mel-Frequency Cepstrum Coefficient)の１〜１２次元とその変化量であるΔＭＦＣＣなどの動的パラメータや、パワーやΔパワーなどを用いる。また、ＣＭＮ（ケプストラム平均正規化）処理を行ってもよい。なお、特徴量はＭＦＣＣやパワーに限定したものではなく、音声認識に用いられる何らかのパラメータを用いても構わない。 The feature amount analyzing unit 102 receives the speech signal, extracts the speech feature amount series O = (o ₁ , o ₂ ,..., O _t ,..., O _T ) and outputs the extracted speech features (S2). As the feature quantity to be extracted, for example, dynamic parameters such as 1 to 12 dimensions of MFCC (Mel-Frequency Cepstrum Coefficient) and ΔMFCC which is a change amount thereof, power, Δ power, and the like are used. Also, CMN (cepstrum average normalization) processing may be performed. Note that the feature amount is not limited to MFCC or power, and any parameter used for speech recognition may be used.

状態系列変換部１０３は、学習ラベルと音響モデルとが入力され、学習ラベルを音素系列に分解し、更にこれを状態系列に変換して出力する（Ｓ３）。例えば、学習ラベルが「こんにちは」であった場合、「pause k o ng n i ch i h a pause」のように音素系列に分解し、更にこれを「*-pause+k[1] *-pause+k[2] *-pause+k[3] pause-k-o[1] ...」のように変換する。なお、[1][2][3]は音素モデルの状態番号である。 The state series conversion unit 103 receives the learning label and the acoustic model, decomposes the learning label into a phoneme series, further converts it into a state series, and outputs it (S3). For example, if the learning label is "Hello", "pause ko ng ni ch iha pause" decomposed phoneme sequence as further which "* -pause + k [1] * -pause + k [2 ] * -pause + k [3] pause-ko [1] ... " [1] [2] [3] are state numbers of the phoneme model.

音響モデル記憶部１０４は、初期状態分布π_ｉ、状態ｉから状態ｊ（１≦ｊ≦Ｊ、Ｊはモデル中の状態の数）への状態遷移確率ａ_ij、時刻ｔにおける状態ｊの出力確率 The acoustic model storage unit 104 includes an initial state distribution π _i , a state transition probability a _ij from the state i to the state j (1 ≦ j ≦ J, J is the number of states in the model), and the output probability of the state j at time t.

（Ｍは状態ｊに属する基底正規分布の数（混合数）、Ｎは基底正規分布、ｃ_jkは状態ｊのｋ番目の基底正規分布に対する混合重み係数、μ_jkは平均ベクトル、Ｕ_jkは共分散行列）等からなる音響モデルλを記憶する。図３は音響モデルλを概念的に例示したものである。音響モデルλはＨＭＭ(Hidden Markov Models)であり、複数の音素モデルからなる。音素モデルはモノフォン(例えば、*-a+*）であってもトライフォン(例えば、k-a+i）であっても構わない。各音素モデルは例えば３つの状態からなり、各状態内及び各状態間に状態遷移確率が定義され、各状態の出力確率は複数（図３では３つ）の基底正規分布からなる混合正規分布として定義されている。 (M is the number of basis normal distributions (mixed number) belonging to state j, N is the basis normal distribution, c _jk is the mixture weight coefficient for the kth basis normal distribution of state j, μ _jk is the average vector, and U _jk is the common vector. An acoustic model λ including a dispersion matrix is stored. FIG. 3 conceptually illustrates the acoustic model λ. The acoustic model λ is HMM (Hidden Markov Models), and includes a plurality of phoneme models. The phoneme model may be a monophone (for example, * -a + *) or a triphone (for example, k-a + i). Each phoneme model is composed of, for example, three states, state transition probabilities are defined within and between each state, and the output probability of each state is a mixed normal distribution composed of a plurality of (three in FIG. 3) basis normal distributions. Is defined.

前向き計算部１０５は、音声特徴量系列Ｏと状態系列とが入力され、状態系列に対応する音響モデルλを用いて前向き計算を行い、前向き確率δ_t(i)とバックポインタψ_t(i)とからなる前向き計算履歴を出力する（Ｓ４）。バックポインタψ_t(i)は時刻ｔで状態ｉに至る状態系列の中で最も高い確率を与える状態系列について、遷移元の状態を記憶するために用いる。前向き計算は例えば次のように行うことができる。 The forward calculation unit 105 receives the speech feature series O and the state series, performs forward calculation using the acoustic model λ corresponding to the state series, and has a forward probability δ _t (i) and a back pointer ψ _t (i). A forward calculation history consisting of is output (S4). The back pointer ψ _t (i) is used to store the state of the transition source for the state sequence that gives the highest probability among the state sequences that reach state i at time t. For example, the forward calculation can be performed as follows.

(i)初期化
δ₁(i)＝π_ｉ・ｂ_ｉ(ｏ_１) （１≦ｉ≦Ｊ、Ｊはモデル中の状態の数） (2)
ψ₁(i)＝０ (3) (i) Initialization δ ₁ (i) = π _i · b _i (o ₁ ) (1 ≦ i ≦ J, J is the number of states in the model) (2)
ψ ₁ (i) = 0 (3)

(ii)前向き計算

(ii) Forward calculation

(iii)終了

(iii) End

十分統計量蓄積部１０６は、入力された前向き計算履歴から最尤パスｑ_t ^*を求め、当該最尤パスｑ_t ^*を辿りながら取得した全ての学習データの処理が完了するまで十分統計量の蓄積を行う（Ｓ５）。最尤パスｑ_t ^*はバックポインタを用いて次式により求めることができる。 The sufficient statistic accumulation unit 106 obtains the maximum likelihood path q _t ^* from the input forward calculation history, and stores the sufficient statistic until the processing of all learning data acquired while following the maximum likelihood path q _t ^* is completed. Accumulation is performed (S5). The maximum likelihood path q _t ^* can be obtained by the following equation using a back pointer.

ｑ_t ^*＝ψ_t+1(q_t+1 ^*) （ｔ＝Ｔ−１、Ｔ−２、・・・、１） (8)
十分統計量には、モデル更新に用いる統計量、例えば出力確率の混合正規分布の各混合要素の重み係数、平均ベクトル、共分散ベクトルなどが含まれ、それらはそれぞれ下式より計算できる。 q _t ^* = ψ _{t + 1} (q _{t + 1} ^* ) (t = T−1, T−2,..., 1) (8)
The sufficient statistic includes a statistic used for model update, for example, a weighting factor, an average vector, a covariance vector, and the like of each mixed element of a mixed normal distribution of output probabilities, which can be calculated from the following equations, respectively.

なお、Viterbiアルゴリズムにおいては式(12)における事後確率変数γ_t(j)は、 In the Viterbi algorithm, the posterior random variable γ _t (j) in equation (12) is

である。

It is.

前向き計算部１０５と十分統計量蓄積部１０６によりViterbiアルゴリズムを用いて最尤パスを得る処理について、図４(a)、(b)を用いて具体的に説明する。図４(a)、(b)は、横軸が時刻（音声特徴量系列に対応）、縦軸が状態（音素系列に対応）を表し、各丸印（●、◎、○）はある時刻ｔにおける状態ｉを表す。まず、前向き計算部１０５は図４(a)に示すように、ｔ＝１におけるある状態（図４(a)では状態１）を起点として、ｔ＝２における複数の状態での前向き確率δ_tを計算する。図４(a)では、前向き確率δ_tを計算済みのものを◎、δ_tを計算していないものを○、最尤パス上のものを●（ｔ＝１の開始状態やｔ＝Ｔの終了状態は必ず最尤パス上になる）でそれぞれ示す。また、各計算済みのものに対してバックポインタψ_tを保持する。続いて、ｔ＝２において前向き確率が計算済みである状態（図４(a)では状態１や状態２）を起点として、ｔ＝３における複数の状態での前向き確率δ_tを計算し、バックポインタψ_tを保持する。以降、同様にｔ＝Ｔ（音声特徴量系列の長さ）まで繰り返して前向き計算履歴を蓄積する。このように蓄積した前向き計算履歴を用い、十分統計量蓄積部１０６は図４(b)に示すように時刻Ｔを起点にＴ−１、Ｔ−２、・・・とｔ＝１まで辿ることにより最尤パスｑ_t ^*を得る。式(13)においてViterbiアルゴリズムの事後確率変数γ_tが最尤パス上で１なのは、最尤パスを通る確率を１（１００％）、最尤パス以外を０（０％）としているからである。 The process of obtaining the maximum likelihood path using the Viterbi algorithm by the forward calculation unit 105 and the sufficient statistic accumulation unit 106 will be specifically described with reference to FIGS. 4 (a) and 4 (b), the horizontal axis represents time (corresponding to the speech feature amount series), the vertical axis represents the state (corresponding to the phoneme series), and each circle (●, ◎, ○) represents a certain time. represents the state i at t. First, as shown in FIG. 4A, the forward calculation unit 105 starts from a certain state at t = 1 (state 1 in FIG. 4A) as a starting point, and the forward probability δ _t in a plurality of states at t = 2. Calculate In FIG. 4 (a), the forward probability δ _t has been calculated is ◎, the one that has not calculated δ _t is ◯, the one on the maximum likelihood path is ● (start state of t = 1 or t = T The end state is always on the maximum likelihood path). Also, a back pointer ψ _t is held for each calculated item. Subsequently, starting from the state in which the forward probability has been calculated at t = 2 (state 1 and state 2 in FIG. 4A), the forward probability δ _t in a plurality of states at t = 3 is calculated, and the back Holds the pointer ψ _t . Thereafter, the forward calculation history is accumulated in the same manner until t = T (the length of the speech feature amount sequence). Using the forward calculation history accumulated in this manner, the sufficient statistic accumulation unit 106 traces from time T to T-1, T-2,..., And t = 1 as shown in FIG. To obtain the maximum likelihood path q _t ^* . In equation (13), the posterior random variable γ _t of the Viterbi algorithm is 1 on the maximum likelihood path because the probability of passing through the maximum likelihood path is 1 (100%), and other than the maximum likelihood path is 0 (0%). .

モデル更新部１０７は、十分統計量が入力され、学習後音響モデルを構築して出力し、これにより音響モデル記憶部１０４の音響モデルを更新する。なお、必ずしも出力によりそのまま更新せず、例えば学習後音響モデル記憶部１０８を設け、ひとまずそこに書き込むようにしても構わない。 The model update unit 107 receives a sufficient statistic, constructs and outputs a post-learning acoustic model, and thereby updates the acoustic model in the acoustic model storage unit 104. Note that the post-learning acoustic model storage unit 108 may be provided, for example, without being updated as it is by the output, and may be written in there for the time being.

鹿野清宏、外４名、「音声認識システム」、オーム社、2001年5月、p.18-31Kiyohiro Shikano, 4 others, "Voice Recognition System", Ohmsha, May 2001, p.18-31 L.R.Rabiner、"A tutorial on hidden Markov models and selected applications in speech recognition"、PROCEEDINGS OF THE IEEE、1989年2月、VOL.77、No.2、p.257-284L.R.Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition", PROCEEDINGS OF THE IEEE, February 1989, VOL.77, No.2, p.257-284

Forward-Backwardアルゴリズムによる場合、事後確率変数γ_t(i)を計算し、最尤パスｑ_t ^*を、 When using the Forward-Backward algorithm, the posterior random variable γ _t (i) is calculated, and the maximum likelihood path q _t ^*

というように、ある程度の範囲から絞り込むことにより求めるため、精度の高い学習効果が期待できるが、後向き計算を行うため処理速度に難がある。 Thus, since it is obtained by narrowing down from a certain range, a highly accurate learning effect can be expected, but since the backward calculation is performed, the processing speed is difficult.

また、Viterbiアルゴリズムによる場合、後向き計算が不要である分、高速に処理が可能であるが、式(13)からわかるように最尤パスを信じγ_t(i)＝１として十分統計量の式(9)から(12)の計算を行うため、学習データに誤りがある場合や曖昧な発声や雑音が重畳している場合の学習精度が劣化する恐れがある。 In addition, the Viterbi algorithm is capable of high-speed processing because backward calculation is not necessary, but as can be seen from equation (13), the maximum likelihood path is believed and γ _t (i) = 1 is set to a sufficiently statistical equation. Since the calculations from (9) to (12) are performed, the learning accuracy may be degraded when there is an error in the learning data or when an ambiguous utterance or noise is superimposed.

本発明の目的は、Viterbiアルゴリズムによる高速性を担保しつつ、学習精度がViterbiアルゴリズムより優れた音響モデル作成装置等を実現することにある。 An object of the present invention is to realize an acoustic model creation device and the like that have higher learning accuracy than the Viterbi algorithm while ensuring high speed by the Viterbi algorithm.

本発明の音響モデル作成装置は、学習データ取得部と特徴量分析部と状態系列変換部と音響モデル記憶部と前向き計算部と十分統計量蓄積部とモデル更新部とを備える。 The acoustic model creation device of the present invention includes a learning data acquisition unit, a feature amount analysis unit, a state series conversion unit, an acoustic model storage unit, a forward calculation unit, a sufficient statistic accumulation unit, and a model update unit.

学習データ取得部は、音声信号とその音声を文字で表した学習ラベルとからなる学習データが入力され、上記音声信号と上記学習ラベルとを分離して出力する。 The learning data acquisition unit receives learning data including a speech signal and a learning label that represents the speech in characters, and outputs the speech signal and the learning label separately.

特徴量分析部は、上記音声信号が入力され音声特徴量系列Ｏ＝(ｏ_１,ｏ_２,・・・,ｏ_ｔ,・・・,ｏ_Ｔ)を抽出して出力する。 Feature amount analyzing unit, the audio signal entered voice feature amount sequence _{_{O = (o 1, o 2}} , ···, o t, ···, o T) and outputs the extracted.

音響モデル記憶部は、初期状態分布、ある状態から別の状態への状態遷移確率、ある状態における出力確率などからなる音響モデルを記憶する。 The acoustic model storage unit stores an acoustic model including an initial state distribution, a state transition probability from one state to another state, an output probability in a certain state, and the like.

状態系列変化部は、上記学習ラベルと上記音響モデルとが入力され、上記学習ラベルを音素系列に分解し、更にこれを状態系列に変換して出力する。 The state series changing unit receives the learning label and the acoustic model, decomposes the learning label into a phoneme series, further converts this into a state series, and outputs the state series.

前向き計算部は、上記音声特徴量系列Ｏと上記状態系列とが入力され、上記状態系列に対応する上記音響モデルを用いて前向き計算を行い、前向き確率とバックポインタとからなる前向き計算履歴を出力する。 The forward calculation unit receives the speech feature series O and the state series, performs forward calculation using the acoustic model corresponding to the state series, and outputs a forward calculation history including a forward probability and a back pointer To do.

十分統計量蓄積部は、上記前向き計算履歴が入力され、最尤パスを求めて当該最尤パスを辿りながら十分統計量の蓄積を行う。なお、十分統計量の蓄積に際して用いる事後確率変数の値は、上記最尤パスが時刻ｔに状態ｉ（１≦ｉ≦Ｊ、Ｊは状態の数）を通る場合には状態出現信頼度ｆ_i(ｏ_t)（０≦ｆ_i(ｏ_t)≦１）であり、それ以外の場所では０である。 The sufficient statistics accumulation unit receives the forward calculation history, obtains the maximum likelihood path, and accumulates sufficient statistics while following the maximum likelihood path. Note that the value of the posterior random variable used for accumulating sufficient statistics is the state appearance reliability f _i when the maximum likelihood path passes through the state i (1 ≦ i ≦ J, where J is the number of states) at time t. (o _t ) (0 ≦ f _i (o _t ) ≦ 1) and 0 elsewhere.

モデル更新部は、上記十分統計量が入力され、学習後音響モデルを構築して出力し、これにより上記音響モデル記憶部の音響モデルを更新する。 The model update unit receives the sufficient statistics, constructs and outputs a learned acoustic model, and thereby updates the acoustic model in the acoustic model storage unit.

本発明の音響モデル作成装置によれば、Viterbiアルゴリズムによる高速性を担保しつつ、学習精度を向上することができる。 According to the acoustic model creation device of the present invention, it is possible to improve learning accuracy while ensuring high speed by the Viterbi algorithm.

音響モデル作成装置１００、２００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatuses 100 and 200. FIG. 音響モデル作成装置１００、２００の処理フロー例を示す図。The figure which shows the example of a processing flow of the acoustic model production apparatuses 100 and 200. FIG. 音響モデルを概念的に例示した図。The figure which illustrated the acoustic model notionally. Viterbiアルゴリズムにより最尤パスを得る処理を概念的に例示した図。The figure which illustrated notionally the processing which obtains the maximum likelihood path by Viterbi algorithm.

図１に本発明の音響モデル作成装置２００の機能構成例を示す。音響モデル作成装置１００との相違は、前向き計算部１０５が前向き計算部２０５に、十分統計量蓄積部１０６が十分統計量蓄積部２０６にそれぞれ置き換わっているのみであり、その他の部分及び処理フローは同様であるため、同様の部分については同じ符号を付し、説明は省略する。 FIG. 1 shows an example of the functional configuration of an acoustic model creation apparatus 200 according to the present invention. The only difference from the acoustic model creation apparatus 100 is that the forward calculation unit 105 is replaced with the forward calculation unit 205, and the sufficient statistic accumulation unit 106 is replaced with the sufficient statistic accumulation unit 206, respectively. Since it is the same, the same code | symbol is attached | subjected about the same part and description is abbreviate | omitted.

十分統計量蓄積部２０６は基本的には十分統計量蓄積部１０６と同じ処理を行うが、式(12)についてViterbiアルゴリズムにおいては、最尤パスが時刻ｔに状態ｊを通る場合には事後確率変数γ_t(j)＝１であり、それ以外の場合はγ_t(j)＝０としていたが、本発明においては最尤パスが時刻ｔに状態ｊを通る場合にはγ_t(j)＝ｆ_ｊ(ｏ_t)とする。ここで、ｆ_ｊ(ｏ_t)は状態出現信頼度として、通常０≦ｆ_ｊ(ｏ_t)≦１の範囲をとる。本発明では状態出現信頼度ｆ_ｊ(ｏ_t)を時刻ｔに状態ｊを通る信頼度として、当該学習用音声データ中の時間ｔに係るフレームのＳ／Ｎを用いてもよいし、発話全体のＳ／Ｎとしてもよい。当該学習用音声データに対する学習ラベルに対する尤度を用いてもよい。ただし、状態出現信頼度ｆ_ｊ(ｏ_t)の値は、最小０以上最大１以下の範囲に収まるように正規化した値を用いる。例えば、学習データに対するＳ／Ｎまたは尤度の分布から得られる最大値や最小値を元に正規化すればよい。また、Ｓ／Ｎの最大を３０ｄＢ、最小を０ｄＢとして、３０以上を１、０以下を０としてもよい。Viterbiアルゴリズムでは、最尤パス上の事後確率γ_t(j)を１としているのに対して、本発明では当該学習音声データのＳ／Ｎや尤度に基づく信頼度とすることで、学習音声データの雑音混入状況や学習ラベルの誤りが考慮された事後確率となるため、音響モデル学習の精度を高めることができる。 The sufficient statistic accumulation unit 206 basically performs the same processing as the sufficient statistic accumulation unit 106. However, in the Viterbi algorithm for the equation (12), the posterior probability when the maximum likelihood path passes through the state j at time t. The variable γ _t (j) = 1 and γ _t (j) = 0 in other cases, but in the present invention, when the maximum likelihood path passes through the state j at time t, γ _t (j) = F _j (o _t ). Here, f _j (o _t ) normally takes a range of 0 ≦ f _j (o _t ) ≦ 1 as the state appearance reliability. As confidence in the state appearing reliability f _j (o _t) of time t through the state j in the present invention, may be used the S / N of a frame according to the time t in the speech for the training data, the entire utterance It is good also as S / N of. The likelihood for the learning label for the learning speech data may be used. However, the value of the state appearance reliability f _j (o _t) is used normalized values to fit the minimum 0 or up to 1 or less. For example, normalization may be performed based on the maximum value or the minimum value obtained from the S / N or likelihood distribution for the learning data. Alternatively, the maximum S / N may be 30 dB, the minimum may be 0 dB, 30 may be 1 and 0 may be 0. In the Viterbi algorithm, the posterior probability γ _t (j) on the maximum likelihood path is set to 1, whereas in the present invention, the learning speech is obtained by setting the reliability based on the S / N or likelihood of the learning speech data. Since the posterior probabilities take into account the data noise contamination and learning label errors, the accuracy of acoustic model learning can be improved.

また、前向き計算部２０５において、前向き計算部１０５と同じ処理を行うことに加え、式(2)と式(4)で計算した出力確率を保存するようにし、ｆ_ｊ(ｏ_t)を時刻ｔにおける計算済の全出力確率の総和に対するViterbiパス（最尤パス）上の出力確率ｂ_j(ｏ_ｔ)の比の値、すなわち式(15)により求めた値を用いてもよい。 Further, in addition to performing the same processing as the forward calculation unit 105 in the forward calculation unit 205, the output probabilities calculated by the equations (2) and (4) are stored, and f _j (o _t ) is set to the time t. the ratio of the value of the Viterbi path output probability b _j on (maximum likelihood path) (o _t) to the sum of all the output probabilities calculated in, that may be used a value determined by the equation (15).

出力確率ｂ_k(ｏ_ｔ)は、前向き計算部２０５にて計算済のもの（図４において◎の部分）を流用するため、計算量を増加させることなく学習精度の向上を図ることができる。ここで、分子の時刻tにおける最尤パス上の状態jの出力確率ｂ_ｊ(ｏ_ｔ)は必ず分母に含まれるため、状態出現信頼度ｆ_ｊ(ｏ_t)は必ず1以下の値となり正規化済みの値となる。学習ラベルに誤りがある場合等は、時刻tにおける特徴量ｏ_tと最尤パス上の状態jとの整合性が取れず、状態jの出力確率に対するｂ_ｊ(ｏ_ｔ)の値が小さくなる。この時、状態j以外の出力確率との差が小さくなり状態出現信頼度ｆ_ｊ(ｏ_t)は小さな値となるため、学習ラベルに誤りがある学習データの影響を抑えることができる。 Since the output probability b _k (o _t ) that has been calculated by the forward calculation unit 205 (the portion marked by ◎ in FIG. 4) is diverted, the learning accuracy can be improved without increasing the amount of calculation. Here, since the output probability b _j (o _t ) of the state j on the maximum likelihood path at the time t of the numerator is always included in the denominator, the state appearance reliability f _j (o _t ) is always a value of 1 or less and is normal It becomes a converted value. Etc. If the learning label is incorrect, is not consistent state of the state j in the feature quantity o _t and maximum likelihood path at time t, the value of b _j (o _t) for the output probability of the state j is reduced . In this case, the difference between the output probability of the non-state j is smaller becomes state appearance reliability f _j (o _t) is a small value, it is possible to suppress the influence of the learning data has an error in the learning label.

また、式(15)の代わりに式(16)を適用してもよい。

Further, equation (16) may be applied instead of equation (15).

式(16)のｆ_j(ｏ_t)は、時刻ｔにおけるモノフォンに属する全出力確率の総和に対するViterbiパス（最尤パス）上の出力確率ｂ_ｊ(ｏ_ｔ)の比の値を意味する。この場合、式(15)と異なり一部未計算の部分（図４において○の部分）の計算が必要となる。しかし、モノフォンの状態数は全状態数に比べて少ないため、大きな計算量増加にはならず、効果的に学習精度の向上を図ることができる。なお、Viterbiパス上の状態に相当するモノフォンの尤度を用いてもよい（ex. k-a+i[1]→ *-a+*[1]）。この時には、状態出現信頼度ｆ_ｊ(ｏ_t)の計算は全てモノフォンから計算され、分母に分子の値が含まれるため1以下の安定した値となる。ここで、式(15)では前向き計算で計算済みの状態の出力確率で正規化が行われるため、分母は学習ラベルに依存した状態のみに基づく値となるが、式(16)では必ずモノフォンに属する全状態の出力確率の和で計算されるため、学習ラベルへの依存性を低減させることができ、そのため、安定した状態出現信頼度ｆ_ｊ(ｏ_t)の値となり学習精度の向上を図ることができる。 F _j (o _t) of formula (16) means a value of the ratio of the Viterbi path (maximum likelihood path) on the output probability b _j (o _t) to the sum of all the output probabilities belonging to the monophone at time t. In this case, unlike the equation (15), it is necessary to calculate a part that has not been calculated (a circle in FIG. 4). However, since the number of monophone states is smaller than the total number of states, the amount of calculation does not increase greatly, and the learning accuracy can be improved effectively. Note that the likelihood of a monophone corresponding to a state on the Viterbi path may be used (ex. K-a + i [1] → * -a + * [1]). At this time, calculation of state appearing reliability f _j (o _t) is calculated from all monophones becomes 1 or less stable value because it contains the value of the molecule in the denominator. Here, since normalization is performed with the output probability of the state already calculated in the forward calculation in Equation (15), the denominator is a value based only on the state depending on the learning label, but in Equation (16), it is always a monophone. belongs because it is calculated by the sum of the output probabilities of all states, it is possible to reduce the dependence on learning label, therefore, is the value of the steady state appeared reliability f _j (o _t) to improve the learning accuracy be able to.

以上のように、本発明の音響モデル作成装置によれば、Viterbiアルゴリズムによる高速性を担保しつつ、学習精度を向上することができる。 As described above, according to the acoustic model creation device of the present invention, it is possible to improve learning accuracy while ensuring high speed by the Viterbi algorithm.

上記の各実施形態の発話向き推定装置の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この場合、処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 When the configuration of the utterance direction estimation device of each of the above embodiments is realized by a computer, the processing contents of the functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer. In this case, at least a part of the processing content may be realized by hardware.

また、上述の各種処理は記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じ並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

Claims

Learning data consisting of a speech signal and a learning label that represents the speech in characters are input, a learning data acquisition unit that separates and outputs the speech signal and the learning label,
A feature amount analysis unit that receives the speech signal and extracts and outputs a speech feature amount sequence O = (o ₁ , o ₂ ,..., O _t ,..., O _T );
An acoustic model storage unit that stores an acoustic model including an initial state distribution, a state transition probability from one state to another state, an output probability in a certain state, and the like;
A state sequence conversion unit that receives the learning label and the acoustic model, decomposes the learning label into a phoneme sequence, further converts this into a state sequence, and outputs the state sequence;
A forward calculation unit that receives the speech feature amount sequence O and the state sequence, performs forward calculation using the acoustic model corresponding to the state sequence, and outputs a forward calculation history including a forward probability and a back pointer; ,
A sufficient statistics accumulation unit that receives the forward calculation history, obtains a maximum likelihood path, and accumulates sufficient statistics while following the maximum likelihood path; and
The sufficient statistics are input, a model update unit that builds and outputs an acoustic model after learning, and thereby updates the acoustic model in the acoustic model storage unit, and
An acoustic model creation device comprising:
The value of the posterior random variable used when the sufficient statistics amount is accumulated in the sufficient statistics accumulation unit is as follows when the maximum likelihood path passes through the state i (1 ≦ i ≦ J, where J is the number of states) at time t. An acoustic model creation device characterized by a state appearance probability f _i (o _t ) (0 ≦ f _i (o _t ) ≦ 1) and 0 in other places.

The acoustic model creation device according to claim 1,
The forward calculation unit calculates and stores the output probability b _j (o _t ) of the state j (1 ≦ j ≦ J, where J is the number of states) at time t in the course of the forward calculation,
The state appearance probability f _i (o _t ) is

(B _i (o _t ) is an output probability on the maximum likelihood path).

The acoustic model creation device according to claim 1,
The forward calculation unit calculates and stores an output probability b _j (o _t ) of the state j (1 ≦ j ≦ J, where J is the number of states) belonging to the monophone at time t in the forward calculation process,
The state appearance probability f _i (o _t ) is

(B _i (o _t ) is an output probability on the maximum likelihood path), and the state i belongs to a monophone.

A learning data acquisition step for separating learning data composed of a speech signal and a learning label that represents the speech in characters;
A feature amount analyzing step of extracting a speech feature amount series O = (o ₁ , o ₂ ,..., O _t ,..., O _T ) from the speech signal;
A state sequence conversion step of decomposing the learning label into a phoneme sequence and further converting it into a state sequence;
A forward calculation step of performing forward calculation using the acoustic model corresponding to the state series from the voice feature amount series O and the state series, and outputting a forward calculation history including a forward probability and a back pointer;
A sufficient statistics accumulation step for obtaining a maximum likelihood path from the forward calculation history and accumulating sufficient statistics while following the maximum likelihood path;
A model update step of constructing and outputting an acoustic model after learning from the sufficient statistics and thereby updating the acoustic model;
An acoustic model creation method for executing
The value of the posterior random variable used when accumulating the sufficient statistics in the sufficient statistics accumulation step is as follows when the maximum likelihood path passes through the state i (1 ≦ i ≦ J, where J is the number of states) at time t. A method of creating an acoustic model, characterized by state appearance probability f _i (o _t ) (0 ≦ f _i (o _t ) ≦ 1) and 0 in other locations.

The acoustic model creation method according to claim 4,
The forward calculation step calculates and stores the output probability b _j (o _t ) of the state j (1 ≦ j ≦ J, where J is the number of states) at time t in the course of the forward calculation.
The state appearance probability f _i (o _t ) is

(B _i (o _t ) is an output probability on the maximum likelihood path).

The acoustic model creation method according to claim 4,
The forward calculation step calculates and stores the output probability b _j (o _t ) of the state j (1 ≦ j ≦ J, where J is the number of states) belonging to the monophone at time t in the forward calculation process,
The state appearance probability f _i (o _t ) is

A program for causing a computer to function as the apparatus according to claim 1.