[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

JP5089655B2 - Acoustic model creation device, method and program thereof - Google Patents

Acoustic model creation device, method and program thereof Download PDF

Info

Publication number
JP5089655B2
JP5089655B2 JP2009148089A JP2009148089A JP5089655B2 JP 5089655 B2 JP5089655 B2 JP 5089655B2 JP 2009148089 A JP2009148089 A JP 2009148089A JP 2009148089 A JP2009148089 A JP 2009148089A JP 5089655 B2 JP5089655 B2 JP 5089655B2
Authority
JP
Japan
Prior art keywords
state
acoustic model
forward calculation
probability
maximum likelihood
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2009148089A
Other languages
Japanese (ja)
Other versions
JP2011002792A (en
Inventor
哲 小橋川
太一 浅見
義和 山口
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2009148089A priority Critical patent/JP5089655B2/en
Publication of JP2011002792A publication Critical patent/JP2011002792A/en
Application granted granted Critical
Publication of JP5089655B2 publication Critical patent/JP5089655B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Description

本発明は、音声認識に用いる音響モデル作成装置、その方法及びプログラムに関する。   The present invention relates to an acoustic model creation device used for speech recognition, a method thereof, and a program.

音響モデルの新規作成・更新は通常、まず学習用の観測信号系列とそれに対応する状態系列と音響モデル(新規作成の場合は任意の初期モデル)とから、前向き計算及び後向き計算に基づく事後確率変数γt(i)(例えばForward-Backwardアルゴリズムによる場合:非特許文献1参照)又は前向き計算に基づく前向き確率δt(i)(例えばViterbiアルゴリズムによる場合:非特許文献2参照)を求める。なお、事後確率変数γt(i)は時刻t(1≦t≦T、Tは観測信号系列の長さ)で状態i(1≦i≦J、Jはモデル中の状態の数)に存在する確率を表し、前向き確率δt(i)は時刻tで状態iに至る状態系列の中で最も高い確率を与える状態系列の確率を表す。そして、事後確率変数γt(i)や前向き確率δt(i)に基づき最も尤もらしい状態系列(最尤パス)を求め、その最尤パスを辿りながら、音響モデルを更新するために必要となる各種統計量(以下、「十分統計量」という)を抽出し、これにより音響モデルを更新する。 New creation / update of an acoustic model is usually done by first using a posterior random variable based on forward and backward calculations from an observed signal sequence for learning, a corresponding state sequence, and an acoustic model (any initial model in the case of new creation). γ t (i) (for example, forward-backward algorithm: see non-patent document 1) or forward probability δ t (i) based on forward calculation (for example, viterbi algorithm: refer to non-patent document 2) is obtained. Note that the posterior random variable γ t (i) exists in the state i (1 ≦ i ≦ J, J is the number of states in the model) at time t (1 ≦ t ≦ T, T is the length of the observed signal sequence). The forward probability δ t (i) represents the probability of the state sequence that gives the highest probability among the state sequences that reach state i at time t. It is necessary to obtain the most likely state sequence (maximum likelihood path) based on the posterior random variable γ t (i) and forward probability δ t (i), and to update the acoustic model while following the maximum likelihood path. The various statistics (hereinafter referred to as “sufficient statistics”) are extracted, and the acoustic model is updated accordingly.

Viterbiアルゴリズムを用いる場合の従来の音響モデル作成装置100の構成例を図1に、処理フロー例を図2に示す。音響モデル作成装置100は、学習データ取得部101と特徴量分析部102と状態系列変換部103と音響モデル記憶部104と前向き計算部105と十分統計量蓄積部106とモデル更新部107とを備える。   FIG. 1 shows a configuration example of a conventional acoustic model creation apparatus 100 when using the Viterbi algorithm, and FIG. 2 shows a processing flow example. The acoustic model creation apparatus 100 includes a learning data acquisition unit 101, a feature amount analysis unit 102, a state series conversion unit 103, an acoustic model storage unit 104, a forward calculation unit 105, a sufficient statistics accumulation unit 106, and a model update unit 107. .

学習データ取得部101は、音声信号とその音声を例えば仮名文字で表した学習ラベルとからなる学習データが入力され、音声信号と学習ラベルとをそれぞれ出力する(S1)。なお、学習データの取得が完了した際には十分統計量蓄積部106にデータ取得完了情報を送信する。   The learning data acquisition unit 101 receives learning data composed of a speech signal and a learning label representing the speech in kana characters, for example, and outputs a speech signal and a learning label (S1). When the acquisition of learning data is completed, the data acquisition completion information is transmitted to the sufficient statistic accumulation unit 106.

特徴量分析部102は、音声信号が入力され音声特徴量系列O=(o,o,・・・,o,・・・,o)を抽出して出力する(S2)。抽出する特徴量としては、例えばMFCC(Mel-Frequency Cepstrum Coefficient)の1〜12次元とその変化量であるΔMFCCなどの動的パラメータや、パワーやΔパワーなどを用いる。また、CMN(ケプストラム平均正規化)処理を行ってもよい。なお、特徴量はMFCCやパワーに限定したものではなく、音声認識に用いられる何らかのパラメータを用いても構わない。 The feature amount analyzing unit 102 receives the speech signal, extracts the speech feature amount series O = (o 1 , o 2 ,..., O t ,..., O T ) and outputs the extracted speech features (S2). As the feature quantity to be extracted, for example, dynamic parameters such as 1 to 12 dimensions of MFCC (Mel-Frequency Cepstrum Coefficient) and ΔMFCC which is a change amount thereof, power, Δ power, and the like are used. Also, CMN (cepstrum average normalization) processing may be performed. Note that the feature amount is not limited to MFCC or power, and any parameter used for speech recognition may be used.

状態系列変換部103は、学習ラベルと音響モデルとが入力され、学習ラベルを音素系列に分解し、更にこれを状態系列に変換して出力する(S3)。例えば、学習ラベルが「こんにちは」であった場合、「pause k o ng n i ch i h a pause」のように音素系列に分解し、更にこれを「*-pause+k[1] *-pause+k[2] *-pause+k[3] pause-k-o[1] ...」のように変換する。なお、[1][2][3]は音素モデルの状態番号である。   The state series conversion unit 103 receives the learning label and the acoustic model, decomposes the learning label into a phoneme series, further converts it into a state series, and outputs it (S3). For example, if the learning label is "Hello", "pause ko ng ni ch iha pause" decomposed phoneme sequence as further which "* -pause + k [1] * -pause + k [2 ] * -pause + k [3] pause-ko [1] ... " [1] [2] [3] are state numbers of the phoneme model.

音響モデル記憶部104は、初期状態分布π、状態iから状態j(1≦j≦J、Jはモデル中の状態の数)への状態遷移確率aij、時刻tにおける状態jの出力確率 The acoustic model storage unit 104 includes an initial state distribution π i , a state transition probability a ij from the state i to the state j (1 ≦ j ≦ J, J is the number of states in the model), and the output probability of the state j at time t.

Figure 0005089655
Figure 0005089655

(Mは状態jに属する基底正規分布の数(混合数)、Nは基底正規分布、cjkは状態jのk番目の基底正規分布に対する混合重み係数、μjkは平均ベクトル、Ujkは共分散行列)等からなる音響モデルλを記憶する。図3は音響モデルλを概念的に例示したものである。音響モデルλはHMM(Hidden Markov Models)であり、複数の音素モデルからなる。音素モデルはモノフォン(例えば、*-a+*)であってもトライフォン(例えば、k-a+i)であっても構わない。各音素モデルは例えば3つの状態からなり、各状態内及び各状態間に状態遷移確率が定義され、各状態の出力確率は複数(図3では3つ)の基底正規分布からなる混合正規分布として定義されている。 (M is the number of basis normal distributions (mixed number) belonging to state j, N is the basis normal distribution, c jk is the mixture weight coefficient for the kth basis normal distribution of state j, μ jk is the average vector, and U jk is the common vector. An acoustic model λ including a dispersion matrix is stored. FIG. 3 conceptually illustrates the acoustic model λ. The acoustic model λ is HMM (Hidden Markov Models), and includes a plurality of phoneme models. The phoneme model may be a monophone (for example, * -a + *) or a triphone (for example, k-a + i). Each phoneme model is composed of, for example, three states, state transition probabilities are defined within and between each state, and the output probability of each state is a mixed normal distribution composed of a plurality of (three in FIG. 3) basis normal distributions. Is defined.

前向き計算部105は、音声特徴量系列Oと状態系列とが入力され、状態系列に対応する音響モデルλを用いて前向き計算を行い、前向き確率δt(i)とバックポインタψt(i)とからなる前向き計算履歴を出力する(S4)。バックポインタψt(i)は時刻tで状態iに至る状態系列の中で最も高い確率を与える状態系列について、遷移元の状態を記憶するために用いる。前向き計算は例えば次のように行うことができる。 The forward calculation unit 105 receives the speech feature series O and the state series, performs forward calculation using the acoustic model λ corresponding to the state series, and has a forward probability δ t (i) and a back pointer ψ t (i). A forward calculation history consisting of is output (S4). The back pointer ψ t (i) is used to store the state of the transition source for the state sequence that gives the highest probability among the state sequences that reach state i at time t. For example, the forward calculation can be performed as follows.

(i)初期化
δ1(i)=π・b(o) (1≦i≦J、Jはモデル中の状態の数) (2)
ψ1(i)=0 (3)
(i) Initialization δ 1 (i) = π i · b i (o 1 ) (1 ≦ i ≦ J, J is the number of states in the model) (2)
ψ 1 (i) = 0 (3)

(ii)前向き計算

Figure 0005089655
(ii) Forward calculation
Figure 0005089655

(iii)終了

Figure 0005089655
(iii) End
Figure 0005089655

十分統計量蓄積部106は、入力された前向き計算履歴から最尤パスqt *を求め、当該最尤パスqt *を辿りながら取得した全ての学習データの処理が完了するまで十分統計量の蓄積を行う(S5)。最尤パスqt *はバックポインタを用いて次式により求めることができる。 The sufficient statistic accumulation unit 106 obtains the maximum likelihood path q t * from the input forward calculation history, and stores the sufficient statistic until the processing of all learning data acquired while following the maximum likelihood path q t * is completed. Accumulation is performed (S5). The maximum likelihood path q t * can be obtained by the following equation using a back pointer.

t *=ψt+1(qt+1 *) (t=T−1、T−2、・・・、1) (8)
十分統計量には、モデル更新に用いる統計量、例えば出力確率の混合正規分布の各混合要素の重み係数、平均ベクトル、共分散ベクトルなどが含まれ、それらはそれぞれ下式より計算できる。
q t * = ψ t + 1 (q t + 1 * ) (t = T−1, T−2,..., 1) (8)
The sufficient statistic includes a statistic used for model update, for example, a weighting factor, an average vector, a covariance vector, and the like of each mixed element of a mixed normal distribution of output probabilities, which can be calculated from the following equations, respectively.

Figure 0005089655
Figure 0005089655

なお、Viterbiアルゴリズムにおいては式(12)における事後確率変数γt(j)は、 In the Viterbi algorithm, the posterior random variable γ t (j) in equation (12) is

Figure 0005089655
である。
Figure 0005089655
It is.

前向き計算部105と十分統計量蓄積部106によりViterbiアルゴリズムを用いて最尤パスを得る処理について、図4(a)、(b)を用いて具体的に説明する。図4(a)、(b)は、横軸が時刻(音声特徴量系列に対応)、縦軸が状態(音素系列に対応)を表し、各丸印(●、◎、○)はある時刻tにおける状態iを表す。まず、前向き計算部105は図4(a)に示すように、t=1におけるある状態(図4(a)では状態1)を起点として、t=2における複数の状態での前向き確率δtを計算する。図4(a)では、前向き確率δtを計算済みのものを◎、δtを計算していないものを○、最尤パス上のものを●(t=1の開始状態やt=Tの終了状態は必ず最尤パス上になる)でそれぞれ示す。また、各計算済みのものに対してバックポインタψtを保持する。続いて、t=2において前向き確率が計算済みである状態(図4(a)では状態1や状態2)を起点として、t=3における複数の状態での前向き確率δtを計算し、バックポインタψtを保持する。以降、同様にt=T(音声特徴量系列の長さ)まで繰り返して前向き計算履歴を蓄積する。このように蓄積した前向き計算履歴を用い、十分統計量蓄積部106は図4(b)に示すように時刻Tを起点にT−1、T−2、・・・とt=1まで辿ることにより最尤パスqt *を得る。式(13)においてViterbiアルゴリズムの事後確率変数γtが最尤パス上で1なのは、最尤パスを通る確率を1(100%)、最尤パス以外を0(0%)としているからである。 The process of obtaining the maximum likelihood path using the Viterbi algorithm by the forward calculation unit 105 and the sufficient statistic accumulation unit 106 will be specifically described with reference to FIGS. 4 (a) and 4 (b), the horizontal axis represents time (corresponding to the speech feature amount series), the vertical axis represents the state (corresponding to the phoneme series), and each circle (●, ◎, ○) represents a certain time. represents the state i at t. First, as shown in FIG. 4A, the forward calculation unit 105 starts from a certain state at t = 1 (state 1 in FIG. 4A) as a starting point, and the forward probability δ t in a plurality of states at t = 2. Calculate In FIG. 4 (a), the forward probability δ t has been calculated is ◎, the one that has not calculated δ t is ◯, the one on the maximum likelihood path is ● (start state of t = 1 or t = T The end state is always on the maximum likelihood path). Also, a back pointer ψ t is held for each calculated item. Subsequently, starting from the state in which the forward probability has been calculated at t = 2 (state 1 and state 2 in FIG. 4A), the forward probability δ t in a plurality of states at t = 3 is calculated, and the back Holds the pointer ψ t . Thereafter, the forward calculation history is accumulated in the same manner until t = T (the length of the speech feature amount sequence). Using the forward calculation history accumulated in this manner, the sufficient statistic accumulation unit 106 traces from time T to T-1, T-2,..., And t = 1 as shown in FIG. To obtain the maximum likelihood path q t * . In equation (13), the posterior random variable γ t of the Viterbi algorithm is 1 on the maximum likelihood path because the probability of passing through the maximum likelihood path is 1 (100%), and other than the maximum likelihood path is 0 (0%). .

モデル更新部107は、十分統計量が入力され、学習後音響モデルを構築して出力し、これにより音響モデル記憶部104の音響モデルを更新する。なお、必ずしも出力によりそのまま更新せず、例えば学習後音響モデル記憶部108を設け、ひとまずそこに書き込むようにしても構わない。   The model update unit 107 receives a sufficient statistic, constructs and outputs a post-learning acoustic model, and thereby updates the acoustic model in the acoustic model storage unit 104. Note that the post-learning acoustic model storage unit 108 may be provided, for example, without being updated as it is by the output, and may be written in there for the time being.

鹿野清宏、外4名、「音声認識システム」、オーム社、2001年5月、p.18-31Kiyohiro Shikano, 4 others, "Voice Recognition System", Ohmsha, May 2001, p.18-31 L.R.Rabiner、"A tutorial on hidden Markov models and selected applications in speech recognition"、PROCEEDINGS OF THE IEEE、1989年2月、VOL.77、No.2、p.257-284L.R.Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition", PROCEEDINGS OF THE IEEE, February 1989, VOL.77, No.2, p.257-284

Forward-Backwardアルゴリズムによる場合、事後確率変数γt(i)を計算し、最尤パスqt *を、 When using the Forward-Backward algorithm, the posterior random variable γ t (i) is calculated, and the maximum likelihood path q t *

Figure 0005089655
Figure 0005089655

というように、ある程度の範囲から絞り込むことにより求めるため、精度の高い学習効果が期待できるが、後向き計算を行うため処理速度に難がある。 Thus, since it is obtained by narrowing down from a certain range, a highly accurate learning effect can be expected, but since the backward calculation is performed, the processing speed is difficult.

また、Viterbiアルゴリズムによる場合、後向き計算が不要である分、高速に処理が可能であるが、式(13)からわかるように最尤パスを信じγt(i)=1として十分統計量の式(9)から(12)の計算を行うため、学習データに誤りがある場合や曖昧な発声や雑音が重畳している場合の学習精度が劣化する恐れがある。 In addition, the Viterbi algorithm is capable of high-speed processing because backward calculation is not necessary, but as can be seen from equation (13), the maximum likelihood path is believed and γ t (i) = 1 is set to a sufficiently statistical equation. Since the calculations from (9) to (12) are performed, the learning accuracy may be degraded when there is an error in the learning data or when an ambiguous utterance or noise is superimposed.

本発明の目的は、Viterbiアルゴリズムによる高速性を担保しつつ、学習精度がViterbiアルゴリズムより優れた音響モデル作成装置等を実現することにある。   An object of the present invention is to realize an acoustic model creation device and the like that have higher learning accuracy than the Viterbi algorithm while ensuring high speed by the Viterbi algorithm.

本発明の音響モデル作成装置は、学習データ取得部と特徴量分析部と状態系列変換部と音響モデル記憶部と前向き計算部と十分統計量蓄積部とモデル更新部とを備える。   The acoustic model creation device of the present invention includes a learning data acquisition unit, a feature amount analysis unit, a state series conversion unit, an acoustic model storage unit, a forward calculation unit, a sufficient statistic accumulation unit, and a model update unit.

学習データ取得部は、音声信号とその音声を文字で表した学習ラベルとからなる学習データが入力され、上記音声信号と上記学習ラベルとを分離して出力する。   The learning data acquisition unit receives learning data including a speech signal and a learning label that represents the speech in characters, and outputs the speech signal and the learning label separately.

特徴量分析部は、上記音声信号が入力され音声特徴量系列O=(o,o,・・・,o,・・・,o)を抽出して出力する。 Feature amount analyzing unit, the audio signal entered voice feature amount sequence O = (o 1, o 2 , ···, o t, ···, o T) and outputs the extracted.

音響モデル記憶部は、初期状態分布、ある状態から別の状態への状態遷移確率、ある状態における出力確率などからなる音響モデルを記憶する。   The acoustic model storage unit stores an acoustic model including an initial state distribution, a state transition probability from one state to another state, an output probability in a certain state, and the like.

状態系列変化部は、上記学習ラベルと上記音響モデルとが入力され、上記学習ラベルを音素系列に分解し、更にこれを状態系列に変換して出力する。   The state series changing unit receives the learning label and the acoustic model, decomposes the learning label into a phoneme series, further converts this into a state series, and outputs the state series.

前向き計算部は、上記音声特徴量系列Oと上記状態系列とが入力され、上記状態系列に対応する上記音響モデルを用いて前向き計算を行い、前向き確率とバックポインタとからなる前向き計算履歴を出力する。   The forward calculation unit receives the speech feature series O and the state series, performs forward calculation using the acoustic model corresponding to the state series, and outputs a forward calculation history including a forward probability and a back pointer To do.

十分統計量蓄積部は、上記前向き計算履歴が入力され、最尤パスを求めて当該最尤パスを辿りながら十分統計量の蓄積を行う。なお、十分統計量の蓄積に際して用いる事後確率変数の値は、上記最尤パスが時刻tに状態i(1≦i≦J、Jは状態の数)を通る場合には状態出現信頼度fi(ot)(0≦fi(ot)≦1)であり、それ以外の場所では0である。 The sufficient statistics accumulation unit receives the forward calculation history, obtains the maximum likelihood path, and accumulates sufficient statistics while following the maximum likelihood path. Note that the value of the posterior random variable used for accumulating sufficient statistics is the state appearance reliability f i when the maximum likelihood path passes through the state i (1 ≦ i ≦ J, where J is the number of states) at time t. (o t ) (0 ≦ f i (o t ) ≦ 1) and 0 elsewhere.

モデル更新部は、上記十分統計量が入力され、学習後音響モデルを構築して出力し、これにより上記音響モデル記憶部の音響モデルを更新する。   The model update unit receives the sufficient statistics, constructs and outputs a learned acoustic model, and thereby updates the acoustic model in the acoustic model storage unit.

本発明の音響モデル作成装置によれば、Viterbiアルゴリズムによる高速性を担保しつつ、学習精度を向上することができる。   According to the acoustic model creation device of the present invention, it is possible to improve learning accuracy while ensuring high speed by the Viterbi algorithm.

音響モデル作成装置100、200の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatuses 100 and 200. FIG. 音響モデル作成装置100、200の処理フロー例を示す図。The figure which shows the example of a processing flow of the acoustic model production apparatuses 100 and 200. FIG. 音響モデルを概念的に例示した図。The figure which illustrated the acoustic model notionally. Viterbiアルゴリズムにより最尤パスを得る処理を概念的に例示した図。The figure which illustrated notionally the processing which obtains the maximum likelihood path by Viterbi algorithm.

図1に本発明の音響モデル作成装置200の機能構成例を示す。音響モデル作成装置100との相違は、前向き計算部105が前向き計算部205に、十分統計量蓄積部106が十分統計量蓄積部206にそれぞれ置き換わっているのみであり、その他の部分及び処理フローは同様であるため、同様の部分については同じ符号を付し、説明は省略する。   FIG. 1 shows an example of the functional configuration of an acoustic model creation apparatus 200 according to the present invention. The only difference from the acoustic model creation apparatus 100 is that the forward calculation unit 105 is replaced with the forward calculation unit 205, and the sufficient statistic accumulation unit 106 is replaced with the sufficient statistic accumulation unit 206, respectively. Since it is the same, the same code | symbol is attached | subjected about the same part and description is abbreviate | omitted.

十分統計量蓄積部206は基本的には十分統計量蓄積部106と同じ処理を行うが、式(12)についてViterbiアルゴリズムにおいては、最尤パスが時刻tに状態jを通る場合には事後確率変数γt(j)=1であり、それ以外の場合はγt(j)=0としていたが、本発明においては最尤パスが時刻tに状態jを通る場合にはγt(j)=f(ot)とする。ここで、f(ot)は状態出現信頼度として、通常0≦f(ot)≦1の範囲をとる。本発明では状態出現信頼度f(ot)を時刻tに状態jを通る信頼度として、当該学習用音声データ中の時間tに係るフレームのS/Nを用いてもよいし、発話全体のS/Nとしてもよい。当該学習用音声データに対する学習ラベルに対する尤度を用いてもよい。ただし、状態出現信頼度f(ot)の値は、最小0以上最大1以下の範囲に収まるように正規化した値を用いる。例えば、学習データに対するS/Nまたは尤度の分布から得られる最大値や最小値を元に正規化すればよい。また、S/Nの最大を30dB、最小を0dBとして、30以上を1、0以下を0としてもよい。Viterbiアルゴリズムでは、最尤パス上の事後確率γt(j)を1としているのに対して、本発明では当該学習音声データのS/Nや尤度に基づく信頼度とすることで、学習音声データの雑音混入状況や学習ラベルの誤りが考慮された事後確率となるため、音響モデル学習の精度を高めることができる。 The sufficient statistic accumulation unit 206 basically performs the same processing as the sufficient statistic accumulation unit 106. However, in the Viterbi algorithm for the equation (12), the posterior probability when the maximum likelihood path passes through the state j at time t. The variable γ t (j) = 1 and γ t (j) = 0 in other cases, but in the present invention, when the maximum likelihood path passes through the state j at time t, γ t (j) = F j (o t ). Here, f j (o t ) normally takes a range of 0 ≦ f j (o t ) ≦ 1 as the state appearance reliability. As confidence in the state appearing reliability f j (o t) of time t through the state j in the present invention, may be used the S / N of a frame according to the time t in the speech for the training data, the entire utterance It is good also as S / N of. The likelihood for the learning label for the learning speech data may be used. However, the value of the state appearance reliability f j (o t) is used normalized values to fit the minimum 0 or up to 1 or less. For example, normalization may be performed based on the maximum value or the minimum value obtained from the S / N or likelihood distribution for the learning data. Alternatively, the maximum S / N may be 30 dB, the minimum may be 0 dB, 30 may be 1 and 0 may be 0. In the Viterbi algorithm, the posterior probability γ t (j) on the maximum likelihood path is set to 1, whereas in the present invention, the learning speech is obtained by setting the reliability based on the S / N or likelihood of the learning speech data. Since the posterior probabilities take into account the data noise contamination and learning label errors, the accuracy of acoustic model learning can be improved.

また、前向き計算部205において、前向き計算部105と同じ処理を行うことに加え、式(2)と式(4)で計算した出力確率を保存するようにし、f(ot)を時刻tにおける計算済の全出力確率の総和に対するViterbiパス(最尤パス)上の出力確率bj(o)の比の値、すなわち式(15)により求めた値を用いてもよい。 Further, in addition to performing the same processing as the forward calculation unit 105 in the forward calculation unit 205, the output probabilities calculated by the equations (2) and (4) are stored, and f j (o t ) is set to the time t. the ratio of the value of the Viterbi path output probability b j on (maximum likelihood path) (o t) to the sum of all the output probabilities calculated in, that may be used a value determined by the equation (15).

Figure 0005089655
Figure 0005089655

出力確率bk(o)は、前向き計算部205にて計算済のもの(図4において◎の部分)を流用するため、計算量を増加させることなく学習精度の向上を図ることができる。ここで、分子の時刻tにおける最尤パス上の状態jの出力確率b(o)は必ず分母に含まれるため、状態出現信頼度f(ot)は必ず1以下の値となり正規化済みの値となる。学習ラベルに誤りがある場合等は、時刻tにおける特徴量otと最尤パス上の状態jとの整合性が取れず、状態jの出力確率に対するb(o)の値が小さくなる。この時、状態j以外の出力確率との差が小さくなり状態出現信頼度f(ot)は小さな値となるため、学習ラベルに誤りがある学習データの影響を抑えることができる。 Since the output probability b k (o t ) that has been calculated by the forward calculation unit 205 (the portion marked by ◎ in FIG. 4) is diverted, the learning accuracy can be improved without increasing the amount of calculation. Here, since the output probability b j (o t ) of the state j on the maximum likelihood path at the time t of the numerator is always included in the denominator, the state appearance reliability f j (o t ) is always a value of 1 or less and is normal It becomes a converted value. Etc. If the learning label is incorrect, is not consistent state of the state j in the feature quantity o t and maximum likelihood path at time t, the value of b j (o t) for the output probability of the state j is reduced . In this case, the difference between the output probability of the non-state j is smaller becomes state appearance reliability f j (o t) is a small value, it is possible to suppress the influence of the learning data has an error in the learning label.

また、式(15)の代わりに式(16)を適用してもよい。

Figure 0005089655
Further, equation (16) may be applied instead of equation (15).
Figure 0005089655

式(16)のfj(ot)は、時刻tにおけるモノフォンに属する全出力確率の総和に対するViterbiパス(最尤パス)上の出力確率b(o)の比の値を意味する。この場合、式(15)と異なり一部未計算の部分(図4において○の部分)の計算が必要となる。しかし、モノフォンの状態数は全状態数に比べて少ないため、大きな計算量増加にはならず、効果的に学習精度の向上を図ることができる。なお、Viterbiパス上の状態に相当するモノフォンの尤度を用いてもよい(ex. k-a+i[1]→ *-a+*[1])。この時には、状態出現信頼度f(ot)の計算は全てモノフォンから計算され、分母に分子の値が含まれるため1以下の安定した値となる。ここで、式(15)では前向き計算で計算済みの状態の出力確率で正規化が行われるため、分母は学習ラベルに依存した状態のみに基づく値となるが、式(16)では必ずモノフォンに属する全状態の出力確率の和で計算されるため、学習ラベルへの依存性を低減させることができ、そのため、安定した状態出現信頼度f(ot)の値となり学習精度の向上を図ることができる。 F j (o t) of formula (16) means a value of the ratio of the Viterbi path (maximum likelihood path) on the output probability b j (o t) to the sum of all the output probabilities belonging to the monophone at time t. In this case, unlike the equation (15), it is necessary to calculate a part that has not been calculated (a circle in FIG. 4). However, since the number of monophone states is smaller than the total number of states, the amount of calculation does not increase greatly, and the learning accuracy can be improved effectively. Note that the likelihood of a monophone corresponding to a state on the Viterbi path may be used (ex. K-a + i [1] → * -a + * [1]). At this time, calculation of state appearing reliability f j (o t) is calculated from all monophones becomes 1 or less stable value because it contains the value of the molecule in the denominator. Here, since normalization is performed with the output probability of the state already calculated in the forward calculation in Equation (15), the denominator is a value based only on the state depending on the learning label, but in Equation (16), it is always a monophone. belongs because it is calculated by the sum of the output probabilities of all states, it is possible to reduce the dependence on learning label, therefore, is the value of the steady state appeared reliability f j (o t) to improve the learning accuracy be able to.

以上のように、本発明の音響モデル作成装置によれば、Viterbiアルゴリズムによる高速性を担保しつつ、学習精度を向上することができる。   As described above, according to the acoustic model creation device of the present invention, it is possible to improve learning accuracy while ensuring high speed by the Viterbi algorithm.

上記の各実施形態の発話向き推定装置の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この場合、処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。   When the configuration of the utterance direction estimation device of each of the above embodiments is realized by a computer, the processing contents of the functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer. In this case, at least a part of the processing content may be realized by hardware.

また、上述の各種処理は記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じ並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。   In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

Claims (7)

音声信号とその音声を文字で表した学習ラベルとからなる学習データが入力され、上記音声信号と上記学習ラベルとを分離して出力する学習データ取得部と、
上記音声信号が入力され音声特徴量系列O=(o,o,・・・,o,・・・,o)を抽出して出力する特徴量分析部と、
初期状態分布、ある状態から別の状態への状態遷移確率、ある状態における出力確率などからなる音響モデルを記憶する音響モデル記憶部と、
上記学習ラベルと上記音響モデルとが入力され、上記学習ラベルを音素系列に分解し、更にこれを状態系列に変換して出力する状態系列変換部と、
上記音声特徴量系列Oと上記状態系列とが入力され、上記状態系列に対応する上記音響モデルを用いて前向き計算を行い、前向き確率とバックポインタとからなる前向き計算履歴を出力する前向き計算部と、
上記前向き計算履歴が入力され、最尤パスを求めて当該最尤パスを辿りながら十分統計量の蓄積を行う十分統計量蓄積部と、
上記十分統計量が入力され、学習後音響モデルを構築して出力し、これにより上記音響モデル記憶部の音響モデルを更新するモデル更新部と、
を備える音響モデル作成装置であって、
上記十分統計量蓄積部における上記十分統計量の蓄積に際して用いる事後確率変数の値は、上記最尤パスが時刻tに状態i(1≦i≦J、Jは状態の数)を通る場合には状態出現確率fi(ot)(0≦fi(ot)≦1)であり、それ以外の場所では0であることを特徴とする音響モデル作成装置。
Learning data consisting of a speech signal and a learning label that represents the speech in characters are input, a learning data acquisition unit that separates and outputs the speech signal and the learning label,
A feature amount analysis unit that receives the speech signal and extracts and outputs a speech feature amount sequence O = (o 1 , o 2 ,..., O t ,..., O T );
An acoustic model storage unit that stores an acoustic model including an initial state distribution, a state transition probability from one state to another state, an output probability in a certain state, and the like;
A state sequence conversion unit that receives the learning label and the acoustic model, decomposes the learning label into a phoneme sequence, further converts this into a state sequence, and outputs the state sequence;
A forward calculation unit that receives the speech feature amount sequence O and the state sequence, performs forward calculation using the acoustic model corresponding to the state sequence, and outputs a forward calculation history including a forward probability and a back pointer; ,
A sufficient statistics accumulation unit that receives the forward calculation history, obtains a maximum likelihood path, and accumulates sufficient statistics while following the maximum likelihood path; and
The sufficient statistics are input, a model update unit that builds and outputs an acoustic model after learning, and thereby updates the acoustic model in the acoustic model storage unit, and
An acoustic model creation device comprising:
The value of the posterior random variable used when the sufficient statistics amount is accumulated in the sufficient statistics accumulation unit is as follows when the maximum likelihood path passes through the state i (1 ≦ i ≦ J, where J is the number of states) at time t. An acoustic model creation device characterized by a state appearance probability f i (o t ) (0 ≦ f i (o t ) ≦ 1) and 0 in other places.
請求項1の音響モデル作成装置において、
上記前向き計算部は、上記前向き計算の過程で時刻tにおける状態j(1≦j≦J、Jは状態の数)の出力確率b(ot)を計算して保存し、
上記状態出現確率fi(ot)は、
Figure 0005089655
(bi(ot)は最尤パス上の出力確率)であることを特徴とする音響モデル作成装置。
The acoustic model creation device according to claim 1,
The forward calculation unit calculates and stores the output probability b j (o t ) of the state j (1 ≦ j ≦ J, where J is the number of states) at time t in the course of the forward calculation,
The state appearance probability f i (o t ) is
Figure 0005089655
(B i (o t ) is an output probability on the maximum likelihood path).
請求項1の音響モデル作成装置において、
上記前向き計算部は、上記前向き計算の過程で時刻tにおけるモノフォンに属する状態j(1≦j≦J、Jは状態の数)の出力確率b(ot)を計算して保存し、
上記状態出現確率fi(ot)は、
Figure 0005089655
(bi(ot)は最尤パス上の出力確率)であり、状態iがモノフォンに属することを特徴とする音響モデル作成装置。
The acoustic model creation device according to claim 1,
The forward calculation unit calculates and stores an output probability b j (o t ) of the state j (1 ≦ j ≦ J, where J is the number of states) belonging to the monophone at time t in the forward calculation process,
The state appearance probability f i (o t ) is
Figure 0005089655
(B i (o t ) is an output probability on the maximum likelihood path), and the state i belongs to a monophone.
音声信号とその音声を文字で表した学習ラベルとからなる学習データを、互いに分離する学習データ取得ステップと、
上記音声信号から音声特徴量系列O=(o,o,・・・,o,・・・,o)を抽出する特徴量分析ステップと、
上記学習ラベルを音素系列に分解し、更にこれを状態系列に変換する状態系列変換ステップと、
上記音声特徴量系列Oと上記状態系列から、上記状態系列に対応する音響モデルを用いて前向き計算を行い、前向き確率とバックポインタとからなる前向き計算履歴を出力する前向き計算ステップと、
上記前向き計算履歴から最尤パスを求めて当該最尤パスを辿りながら十分統計量の蓄積を行う十分統計量蓄積ステップと、
上記十分統計量から学習後音響モデルを構築して出力し、これにより上記音響モデルを更新するモデル更新ステップと、
を実行する音響モデル作成方法であって、
上記十分統計量蓄積ステップにおける上記十分統計量の蓄積に際して用いる事後確率変数の値は、上記最尤パスが時刻tに状態i(1≦i≦J、Jは状態の数)を通る場合には状態出現確率fi(ot)(0≦fi(ot)≦1)であり、それ以外の場所では0であることを特徴とする音響モデル作成方法。
A learning data acquisition step for separating learning data composed of a speech signal and a learning label that represents the speech in characters;
A feature amount analyzing step of extracting a speech feature amount series O = (o 1 , o 2 ,..., O t ,..., O T ) from the speech signal;
A state sequence conversion step of decomposing the learning label into a phoneme sequence and further converting it into a state sequence;
A forward calculation step of performing forward calculation using the acoustic model corresponding to the state series from the voice feature amount series O and the state series, and outputting a forward calculation history including a forward probability and a back pointer;
A sufficient statistics accumulation step for obtaining a maximum likelihood path from the forward calculation history and accumulating sufficient statistics while following the maximum likelihood path;
A model update step of constructing and outputting an acoustic model after learning from the sufficient statistics and thereby updating the acoustic model;
An acoustic model creation method for executing
The value of the posterior random variable used when accumulating the sufficient statistics in the sufficient statistics accumulation step is as follows when the maximum likelihood path passes through the state i (1 ≦ i ≦ J, where J is the number of states) at time t. A method of creating an acoustic model, characterized by state appearance probability f i (o t ) (0 ≦ f i (o t ) ≦ 1) and 0 in other locations.
請求項4の音響モデル作成方法において、
上記前向き計算ステップは、上記前向き計算の過程で時刻tにおける状態j(1≦j≦J、Jは状態の数)の出力確率b(ot)を計算して保存し、
上記状態出現確率fi(ot)は、
Figure 0005089655
(bi(ot)は最尤パス上の出力確率)であることを特徴とする音響モデル作成方法。
The acoustic model creation method according to claim 4,
The forward calculation step calculates and stores the output probability b j (o t ) of the state j (1 ≦ j ≦ J, where J is the number of states) at time t in the course of the forward calculation.
The state appearance probability f i (o t ) is
Figure 0005089655
(B i (o t ) is an output probability on the maximum likelihood path).
請求項4の音響モデル作成方法において、
上記前向き計算ステップは、上記前向き計算の過程で時刻tにおけるモノフォンに属する状態j(1≦j≦J、Jは状態の数)の出力確率b(ot)を計算して保存し、
上記状態出現確率fi(ot)は、
Figure 0005089655
(bi(ot)は最尤パス上の出力確率)であり、状態iがモノフォンに属することを特徴とする音響モデル作成方法。
The acoustic model creation method according to claim 4,
The forward calculation step calculates and stores the output probability b j (o t ) of the state j (1 ≦ j ≦ J, where J is the number of states) belonging to the monophone at time t in the forward calculation process,
The state appearance probability f i (o t ) is
Figure 0005089655
(B i (o t ) is an output probability on the maximum likelihood path), and the state i belongs to a monophone.
請求項1乃至3のいずれかに記載した装置としてコンピュータを機能させるためのプログラム。   A program for causing a computer to function as the apparatus according to claim 1.
JP2009148089A 2009-06-22 2009-06-22 Acoustic model creation device, method and program thereof Expired - Fee Related JP5089655B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2009148089A JP5089655B2 (en) 2009-06-22 2009-06-22 Acoustic model creation device, method and program thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2009148089A JP5089655B2 (en) 2009-06-22 2009-06-22 Acoustic model creation device, method and program thereof

Publications (2)

Publication Number Publication Date
JP2011002792A JP2011002792A (en) 2011-01-06
JP5089655B2 true JP5089655B2 (en) 2012-12-05

Family

ID=43560760

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2009148089A Expired - Fee Related JP5089655B2 (en) 2009-06-22 2009-06-22 Acoustic model creation device, method and program thereof

Country Status (1)

Country Link
JP (1) JP5089655B2 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0776880B2 (en) * 1993-01-13 1995-08-16 日本電気株式会社 Pattern recognition method and apparatus
JP2003271185A (en) * 2002-03-15 2003-09-25 Nippon Telegr & Teleph Corp <Ntt> Device and method for preparing information for voice recognition, device and method for recognizing voice, information preparation program for voice recognition, recording medium recorded with the program, voice recognition program and recording medium recorded with the program

Also Published As

Publication number Publication date
JP2011002792A (en) 2011-01-06

Similar Documents

Publication Publication Date Title
US10902845B2 (en) System and methods for adapting neural network acoustic models
JP5088701B2 (en) Language model learning system, language model learning method, and language model learning program
Povey Discriminative training for large vocabulary speech recognition
US9672815B2 (en) Method and system for real-time keyword spotting for speech analytics
Kingsbury Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling
US9972306B2 (en) Method and system for acoustic data selection for training the parameters of an acoustic model
JP6110945B2 (en) Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems
Kuo et al. Maximum entropy direct models for speech recognition
CN107615376B (en) Voice recognition device and computer program recording medium
US8494847B2 (en) Weighting factor learning system and audio recognition system
Bao et al. Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition
JP5752060B2 (en) Information processing apparatus, large vocabulary continuous speech recognition method and program
US8332222B2 (en) Viterbi decoder and speech recognition method using same using non-linear filter for observation probabilities
WO2018066436A1 (en) Learning device for acoustic model and computer program for same
AU2018271242A1 (en) Method and system for real-time keyword spotting for speech analytics
JP6027754B2 (en) Adaptation device, speech recognition device, and program thereof
JP5288378B2 (en) Acoustic model speaker adaptation apparatus and computer program therefor
JP5089655B2 (en) Acoustic model creation device, method and program thereof
JP3589044B2 (en) Speaker adaptation device
Sanchis et al. Improving utterance verification using a smoothed naive bayes model
JP5170449B2 (en) Detection device, voice recognition device, detection method, and program
Shigli et al. Automatic dialect and accent speech recognition of South Indian English
Wang Model-based approaches to robust speech recognition in diverse environments
Hamar Using Sub-Phonemic Units for HMM Based Phone Recognition
Ragni Discriminative models for speech recognition

Legal Events

Date Code Title Description
RD03 Notification of appointment of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7423

Effective date: 20110720

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20111012

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20120827

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20120904

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20120911

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20150921

Year of fee payment: 3

R150 Certificate of patent or registration of utility model

Ref document number: 5089655

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

LAPS Cancellation because of no payment of annual fees