JP6884946B2

JP6884946B2 - Acoustic model learning device and computer program for it

Info

Publication number: JP6884946B2
Application number: JP2016197107A
Authority: JP
Inventors: 直之神田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2016-10-05
Filing date: 2016-10-05
Publication date: 2021-06-09
Anticipated expiration: 2036-10-05
Also published as: WO2018066436A1; JP2018060047A

Description

この発明は音声認識技術に関し、特に、音声認識装置で用いられるCTC（Connectionist Temporal Classification）音響モデル（CTC-AM）を高精度化するための学習装置に関する。 The present invention relates to a speech recognition technique, and more particularly to a learning device for improving the accuracy of a CTC (Connectionist Temporal Classification) acoustic model (CTC-AM) used in a speech recognition device.

人間とコンピュータとのインターフェイスとして音声による入出力を用いる装置及びサービスが増えている。例えば携帯電話の操作にも音声による入出力が利用されている。音声による入出力では、その基盤をなす音声認識装置の認識精度はできるだけ高くする必要がある。 An increasing number of devices and services use voice input / output as an interface between humans and computers. For example, voice input / output is also used for operating mobile phones. For voice input / output, it is necessary to make the recognition accuracy of the voice recognition device, which is the basis of the voice input / output, as high as possible.

音声認識として一般的な技術は統計的機械学習により得られたモデルを用いる。例えば音響モデルとしてはHMM（隠れマルコフモデル）が使用されることが多い。また、音声認識の過程で生成される文字列から、音素列がどの程度の確率で得られるかを算出するための単語発音辞書、及び、ある言語の単語列がどの程度の確率で出現するかを算出するための言語モデル等も使用される。 A general technique for speech recognition uses a model obtained by statistical machine learning. For example, HMM (Hidden Markov Model) is often used as an acoustic model. In addition, a word pronunciation dictionary for calculating the probability that a phoneme string will be obtained from a character string generated in the process of speech recognition, and a word string for a certain language will appear with a probability. A language model or the like for calculating is also used.

従来のHMMを用いた音声認識装置における音声認識の基本的考え方について図１を参照して説明する。従来は、単語列３０（単語列W）が様々なノイズの影響を経て観測系列３６として観測されると考え、最終的な観測系列３６を与える尤度が最も高くなるような単語列を音声認識の結果として出力する。この過程では、単語列Wが生成される確率をP(W)で表す。その単語列Wから、中間生成物である発音列３２を経てHMMの状態系列S（状態系列３４）が生成される確率をP(S|W)とする。さらに状態系列Sから観測Xが得られる確率をP(X|S)で表す。 The basic concept of speech recognition in a conventional speech recognition device using an HMM will be described with reference to FIG. Conventionally, it is considered that the word string 30 (word string W) is observed as the observation sequence 36 through the influence of various noises, and the word string having the highest probability of giving the final observation sequence 36 is voice-recognized. Output as a result of. In this process, the probability that the word string W is generated is represented by P (W). Let P (S | W) be the probability that the state sequence S (state sequence 34) of the HMM is generated from the word sequence W via the pronunciation sequence 32 which is an intermediate product. Furthermore, the probability that observation X can be obtained from the state series S is represented by P (X | S).

音声認識の過程では、先頭から時刻Tまでの観測系列X_1:Tが与えられたときに、そのような観測系列を与える尤度が最大となるような単語列が音声認識の結果として出力される。すなわち、音声認識の結果の単語列^〜Wは次の式(1)により求められる。なお、数式において文字の直上に記されている記号「〜」は、明細書中では文字の直前に記載している。 In the process of speech recognition, when the observation sequence X _{1: T} from the beginning to the time T is given, the word string that gives the maximum likelihood of giving such an observation sequence is output as the result of speech recognition. To. That is, the word string ^~ W as a result of speech recognition is obtained by the following equation (1). In the mathematical formula, the symbol "~" written immediately above the character is described immediately before the character in the specification.

この式の右辺をベイズの式により変形すると次が得られる。 The following is obtained by transforming the right side of this equation by Bayesian equation.

さらにこの式の分子の第１項はHMMにより次のように求めることができる。 Furthermore, the first term of the numerator of this formula can be obtained by HMM as follows.

この式で状態系列S_1:TはHMMの状態系列S₁，…，S_Tを示す。式(3)の右辺の第１項はHMMの出力確率を示す。式(1)〜式(3)より、音声認識の結果の単語列~Wは次の式で求められる。 In this equation, the state series S _{1: T} indicates the state series S ₁ , ..., S _T of the HMM. The first term on the right side of Eq. (3) indicates the output probability of the HMM. From equations (1) to (3), the word string ~ W as a result of speech recognition can be obtained by the following equation.

HMMでは、時刻tにおける観測値x_tは状態s_tにしか依存しない。したがって、式(4)における、HMMの出力確率P(X_1:T|S_1:T)は次の式によって算出できる。 In HMM, the observed value x _t at time t does not depend only on the state s _t. _{Therefore, the output probability P (X 1: T} | S _{1: T} ) of the HMM in Eq. (4) can be calculated by the following equation.

確率P(x_t|s_t)は、ガウス混合モデル（GMM）により算出される。 Probability P (x _{_t} | s _t) is calculated by the Gaussian mixture model (GMM).

式(4)の他の項のうち、P(S_1:T|W)はHMMの状態遷移確率と単語の発音確率との積により算出され、P(W)は言語モデルにより算出される。分母のP(X_1:T)は各仮説について共通する値であり、したがってarg max演算の実行時には無視できる。 Of the other terms in Eq. (4), P (S _{1: T} | W) is calculated by the product of the state transition probability of the HMM and the pronunciation probability of the word, and P (W) is calculated by the language model. The denominator P (X _{1: T} ) is a common value for each hypothesis and can therefore be ignored when performing arg max operations.

最近、HMMにおける出力確率を、GMMではなくディープニューラルネットワーク（DNN）により算出するという、DNN-HMMハイブリッド方式と呼ばれるフレームワークについて研究がされている。DNN-HMMハイブリッド方式により、GMMを用いた音響モデルより高い精度が達成され、注目されている。さらに、DNN-HMMハイブリッド方式が優れた結果をもたらしていることから、DNNに替えて畳み込みニューラルネットワーク（CNN）、リカレント型ニューラルネットワーク（RNN）、又はロングショートタームメモリネットワーク（LSTM）等のニューラルネットワーク（NN）を用いる方式が提案されている。これら方式により音声認識の精度がより高くなることが期待できる。 Recently, research has been conducted on a framework called the DNN-HMM hybrid method, in which the output probability in HMM is calculated by a deep neural network (DNN) instead of GMM. The DNN-HMM hybrid method has achieved higher accuracy than the acoustic model using GMM, and is drawing attention. In addition, because the DNN-HMM hybrid method has produced excellent results, neural networks such as convolutional neural networks (CNN), recurrent neural networks (RNN), or long short-term memory networks (LSTM) have replaced DNN. A method using (NN) has been proposed. It can be expected that the accuracy of voice recognition will be higher by these methods.

しかし、このようなNN-HMMハイブリッド方式では、NNの出力が事後確率P(S_t|X_t)を表すため、そのままでは、出力確率P(X_t|S_t)を用いるHMMを用いた従来の枠組みに適合しない。この問題を解決するため、DNNの出力する事後確率P(S_t|X_t)に対してベイズの法則を適用して無理に式(5)に適合するよう出力確率P(X_t|S_t)を用いる形にNNの出力を変形して用いる必要がある。このような変形を用いないような音声認識方式が実現できれば、さらなる精度の向上が期待できる。 However, in such an NN-HMM hybrid method, since the output of NN _{represents the posterior probability P (S t} | X _t ), the conventional HMM using the output probability P (X _t | S _t ) is used as it is. Does not fit the framework of. To solve this problem, apply Bayes' law to the posterior probability P (S _t | X _t ) output by DNN to force the output probability P (X _t | S _{t) to fit equation (5).} It is necessary to transform the output of NN into a form that uses). If a voice recognition method that does not use such deformation can be realized, further improvement in accuracy can be expected.

Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in Proc. ASRU, 2015, pp. 167-174.Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in Proc. ASRU, 2015, pp. 167-174. Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition”, in Proc. ICASSP, 2016, pp 4945-4949.Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition”, in Proc. ICASSP, 2016, pp 4945-4949.

最近、音声認識のための音響モデルとして、End-to-End型のNNを用いることが提案されている（非特許文献１）。End-to-End型NNは、観測（音声特徴量）系列Xに対するサブワード列(発音列、発音記号列、音素列、又は文字列等)sの事後確率P(s|X)を、HMM等を介さずに直接表現する。したがって、DNN−HMMハイブリッドのような無理な変形を行うことなく音声認識に適用できる可能性がある。End-to-End型NNについては実施の形態に関連して後述するが、ここでは従来の方式の問題点について述べるために、End-to-End型NNで一般に利用されるEnd-to-End型RNNによる音声認識の考え方を説明する。なお、本発明はEnd-to-End型NN全般に対して適用可能なものであり、必ずしもRNNに限定されるものではない。 Recently, it has been proposed to use an end-to-end type NN as an acoustic model for speech recognition (Non-Patent Document 1). The End-to-End type NN sets the posterior probability P (s | X) of the subword string (phonetic sequence, phonetic symbol string, phoneme string, character string, etc.) s for the observation (speech feature) series X to HMM, etc. Express directly without going through. Therefore, there is a possibility that it can be applied to speech recognition without performing unreasonable deformation like the DNN-HMM hybrid. The End-to-End type NN will be described later in relation to the embodiment, but here, in order to describe the problems of the conventional method, the End-to-End type NN generally used in the End-to-End type NN will be described. The concept of speech recognition by type RNN will be explained. The present invention is applicable to all End-to-End type NNs, and is not necessarily limited to RNNs.

RNNは、入力層側から出力層側への一方向のノード間の結合だけではなく、出力側の層から、隣接する入力側の層へのノード間の結合、同じ層内のノード間の結合、及び自己帰還結合等を含む構造を持つ。この構造のため、RNNは時間に依存する情報を表すことができるという、通常のフィードフォワード型のニューラルネットワークにはない性格を備えている。音声は、時間に依存する情報としては典型的である。したがってRNNは音響モデルに適していると考えられる。 RNN is not only the connection between nodes in one direction from the input layer side to the output layer side, but also the connection between nodes from the output side layer to the adjacent input side layer, and the connection between nodes in the same layer. , And a structure including self-recurrent coupling and the like. Because of this structure, RNNs have the property of being able to represent time-dependent information, which is not found in ordinary feedforward neural networks. Speech is typical of time-dependent information. Therefore, RNN is considered to be suitable for acoustic models.

End-to-End型RNNの出力するラベルは、例えば音素又は音節等の任意のサブワード、文字又はHMMの状態等である。End-to-End型RNNを音響モデルに使用すると、HMMを用いた場合と比較して、NNの出力を無理に変形する必要がないので、認識精度の向上が期待できる。 The label output by the End-to-End type RNN is, for example, an arbitrary subword such as a phoneme or a syllable, a character, or the state of an HMM. When the End-to-End type RNN is used for the acoustic model, it is not necessary to forcibly deform the output of the NN compared to the case where the HMM is used, so improvement in recognition accuracy can be expected.

End-to-End型RNNは、前述したとおり、入力される観測系列Xからサブワード列sへの直接的なマッピングを学習する。End-to-End型RNNの代表例はCTCと呼ばれるモデルである。観測系列Xは通常サブワード列sよりも遥かに長いので、CTCではその長さの相違を吸収するためにRNNの出力に空ラベルφを追加する。すなわち、出力層に空ラベルφに対応するノードを設ける。この結果、RNNの出力にはフレーム単位のサブワード列c＝｛c₁,...,c_T｝（空ラベルφを含む。）が得られる。このサブワード列cをマッピング関数Φと呼ばれる関数によりフレーム数に依存しないサブワード列sに変換する。マッピング関数Φは、フレーム単位のサブワード列cから空ラベルφを削除し、さらにラベルの繰返しを１出力とみなすことでフレーム数に依存しないサブワード列sを出力する。マッピング関数Φを用いることで、以下のように観測系列Xがサブワード列sである確率P(s|X)を定式化できる。 As described above, the End-to-End type RNN learns the direct mapping from the input observation sequence X to the subword sequence s. A typical example of an End-to-End type RNN is a model called CTC. Since the observation sequence X is usually much longer than the subword sequence s, CTC adds an empty label φ to the output of the RNN to absorb the difference in length. That is, a node corresponding to the empty label φ is provided in the output layer. As a result, the subword string c = {c ₁ , ..., c _T } (including the empty label φ) for each frame is obtained in the output of the RNN. This subword string c is converted into a subword string s that does not depend on the number of frames by a function called a mapping function Φ. The mapping function Φ deletes the empty label φ from the subword string c in frame units, and outputs the subword string s that does not depend on the number of frames by regarding the repetition of the label as one output. By using the mapping function Φ, the probability P (s | X) in which the observation sequence X is the subword sequence s can be formulated as follows.

ここで、y_t ^ctは、時刻ｔにおけるRNNの出力ラベルc_tに対する出力スコアである。Φ^−１はマッピング関数Φの逆関数である。すなわち、Φ^−１（s）は、マッピング関数Φによりサブワード列sにマッピングされ得る全ての音素列cの集合を表す。 Here, y _t ^ct is the output score for _{the output label c t} of the RNN at time t. Φ ^-1 is the inverse function of the mapping function Φ. That is, Φ ^-1 (s) represents a set of all phoneme sequences c that can be mapped to the subword sequence s by the mapping function Φ.

End-to-End型NNは観測系列Xがサブワード列sを表す確率P(s|X)をニューラルネットワークで直接学習している点が特徴である。CTC以外の方式として、非特許文献２ではAttention-based Recurrent Sequence Generatorと呼ばれるモデルで表現している。 The end-to-end type NN is characterized in that the probability P (s | X) that the observation sequence X represents the subword sequence s is directly learned by the neural network. As a method other than CTC, Non-Patent Document 2 expresses it by a model called Attention-based Recurrent Sequence Generator.

End-to-End型NNは、HMMと異なり、観測系列Xがサブワード列sを表す確率P(s|X)を直接学習しているため、従来のHMMを用いたデコード方式を採用できない。このNNはまた、音響モデルと言語モデルとの双方の性格を持つ。そのために当初は言語モデルを用いずにNNのみを用いてデコードを行うことが試みられた。しかし、独立した言語モデルなしのデコードでは最良の結果が得られないことが判明し、最近ではEnd-to-End型NNに加えて言語モデルを用いたものが主流である。しかし、この場合には両者をどのように組み合わせるかが問題となる。さらに、End-to-End型のNNに基づく音響モデルは、通常はサブワード単位（文字、音素等）での学習が行われるため、出力されるスコアもサブワード単位である。言語モデルのスコアは単語レベルであるため、この点でも両者を組み合わせることが難しいという問題がある。 Unlike the HMM, the End-to-End type NN directly learns the probability P (s | X) that the observation sequence X represents the subword sequence s, so that the conventional decoding method using the HMM cannot be adopted. This NN also has the character of both an acoustic model and a language model. Therefore, at first, it was attempted to perform decoding using only NN without using the language model. However, it has been found that the best results cannot be obtained by decoding without an independent language model, and recently, the one using a language model in addition to the End-to-End type NN is the mainstream. However, in this case, the problem is how to combine the two. Furthermore, since the acoustic model based on the End-to-End type NN is usually learned in subword units (characters, phonemes, etc.), the output score is also in subword units. Since the score of the language model is at the word level, there is a problem that it is difficult to combine the two in this respect as well.

従来は、両者のスコアを組み合わせる手法として、次式に示すような両スコアの単純な内挿により単語列^〜Wを計算していた。 Conventionally, as a method of combining both scores, the word string ^to W has been calculated by simple interpolation of both scores as shown in the following equation.

関数Ψは、単語列Wを全ての可能なサブワード列sの集合に変換する関数である。非特許文献１では、各フレームにおいて事後確率を事前確率P(c_t)で除算することを提案している。 The function Ψ is a function that transforms the word string W into a set of all possible subword strings s. Non-Patent Document 1 proposes to divide posterior probabilities by prior probabilities P ( _{ct) in each frame.}

しかし、このような内挿方式で計算したスコアを利用することには理論的根拠がなく、十分に高い認識性能も得られていない。NNを用いた音響モデルにおいて、明確な理論的根拠に基づいてNNを学習することにより、音声認識の精度をさらに高める必要がある。 However, there is no rationale for using the score calculated by such an interpolation method, and sufficiently high recognition performance has not been obtained. In an acoustic model using NN, it is necessary to further improve the accuracy of speech recognition by learning NN based on a clear rationale.

それ故に本発明の目的は、NNの特性を活かした音響モデルにおいて、音声認識精度を高めることができる音響モデルの学習装置を提供することである。 Therefore, an object of the present invention is to provide an acoustic model learning device capable of improving speech recognition accuracy in an acoustic model utilizing the characteristics of NN.

本発明の第１の実施の形態に係る音響モデルの学習装置は、音声の観測系列が与えられたときに、当該観測系列が任意のサブワード列である確率を算出するための、End-to-End型ニューラルネットワークに基づく音響モデルの学習を行う。この音響モデルの学習装置は、学習音声の観測系列と、当該学習音声に対応する正解サブワード列との、アライメント済の組からなる学習データ、及び、単語列の出現頻度を記憶した単語モデルを記憶する、コンピュータ読取り可能な記憶手段に接続して用いられる。この学習装置は、学習音声の観測系列が与えられたときの、学習データの正解サブワード列の事後確率の学習データの全体に亘る和が最大となるようにEnd-to-End型ニューラルネットワークを最適化する第１の最適化手段と、評価用データの観測系列が与えられたときに、End-to-End型ニューラルネットワークと言語モデルとを用いて推定した単語列の仮説の精度の期待値が最大となるように、End-to-End型ニューラルネットワークをさらに最適化する第２の最適化手段とを含む。
好ましくは、第２の最適化手段は、学習音声の全体に亘り、End-to-End型ニューラルネットワーク及び言語モデルを用いて、観測系列に対する音声認識を行うことにより、単語列の仮説の生成を行う音声認識手段と、学習音声の全体に亘り、当該仮説及び学習データの正解サブワード列に基づいて、仮説を構成する単語列に対する認識精度を算出する第１の算出手段と、学習音声の全体に亘り、仮説生成の際の言語モデルにより算出された仮説の事後確率と、当該仮説を構成する単語列の認識精度との積の和を算出することにより、期待値を算出する第２の算出手段と、第２の算出手段により算出される期待値が増加するように、音響モデルのパラメータセットを更新する更新手段と、更新手段による音響モデルのパラメータセットの更新が終了したことに応答して、終了条件が充足されているか否かに関する判定処理を実行する判定手段と、判定手段による判定に応答して、End-to-End型ニューラルネットワークの学習を終了する第１の処理と、学習音声を用いた仮説の生成処理、認識精度の算出、期待値の算出、パラメータセットの更新、及び判定処理を再度行うよう、音声認識手段、第１の算出手段、第２の算出手段、更新手段、及び判定手段を制御する第２の処理とを選択的に実行する制御手段とを含む。
より好ましくは、観測系列は学習音声を表す音声信号のフレーム単位で準備されており、第１の算出手段は、End-to-End型ニューラルネットワークの出力する仮説の単語列の各サブワードが、入力された観測系列と組になったサブワード列の各サブワードとフレーム単位で一致している数を算出するためのサブワード一致数算出手段を含む。
さらに好ましくは、判定手段は、音声認識手段による学習音声全体に亘る仮説の生成処理、第１の算出手段による認識精度の算出処理、及び、第２の算出手段による和の算出処理が、予め定められた回数だけ行われたときに、終了条件が充足されたと判定する手段を含む。
判定手段は、End-to-End型ニューラルネットワークを規定するパラメータセットの前回の処理時との差がしきい値以下となったことに応答して、終了条件が充足されたと判定する手段を含んでもよい。
本発明の第２の局面に係るコンピュータプログラムは、上記したいずれかの音響モデルの学習装置の各手段としてコンピュータを動作させるよう機能する。 The acoustic model learning device according to the first embodiment of the present invention is used to calculate the probability that the observation sequence is an arbitrary subword string when a speech observation sequence is given. Learn the acoustic model based on the End type neural network. The learning device of this acoustic model stores the learning data consisting of an aligned pair of the observation sequence of the learning voice and the correct subword string corresponding to the learning voice, and the word model that stores the frequency of appearance of the word string. It is used by connecting to a computer-readable storage means. This learning device optimizes the End-to-End type neural network so that the sum of the posterior probability of the correct subword sequence of the training data is maximized when the observation sequence of the learning voice is given. Given the first optimization means to be converted and the observation sequence of the evaluation data, the expected value of the accuracy of the word string hypothesis estimated using the End-to-End type neural network and the language model is It includes a second optimization means that further optimizes the End-to-End type neural network so as to be maximized.
Preferably, the second optimization means generates a word sequence hypothesis by performing speech recognition on the observation sequence using an end-to-end type neural network and a language model over the entire learning speech. The speech recognition means to be performed, the first calculation means for calculating the recognition accuracy for the word string constituting the hypothesis based on the correct answer subword string of the hypothesis and the learning data, and the whole learning speech. A second calculation means for calculating the expected value by calculating the sum of the products of the posterior probability of the hypothesis calculated by the language model at the time of hypothesis generation and the recognition accuracy of the word strings constituting the hypothesis. In response to the update means for updating the parameter set of the acoustic model and the completion of the update of the parameter set for the acoustic model by the update means so that the expected value calculated by the second calculation means increases. Judgment means for executing the judgment process regarding whether or not the end condition is satisfied, the first process for ending the learning of the end-to-end type neural network in response to the judgment by the judgment means, and the learning voice. Speech recognition means, first calculation means, second calculation means, update means, and so as to perform the hypothesis generation process, recognition accuracy calculation, expected value calculation, parameter set update, and determination process again. It includes a second process for controlling the determination means and a control means for selectively executing the second process.
More preferably, the observation sequence is prepared for each frame of the voice signal representing the learning voice, and the first calculation means is input by each subword of the hypothetical word string output by the end-to-end type neural network. It includes a subword match number calculation means for calculating the number of matches for each subword of the subword sequence paired with the observed observation sequence in frame units.
More preferably, the determination means is predetermined by the speech recognition means for generating a hypothesis over the entire learning voice, the first calculation means for calculating the recognition accuracy, and the second calculation means for calculating the sum. It includes means for determining that the termination condition is satisfied when it is performed the number of times.
The determination means includes means for determining that the end condition is satisfied in response to the difference between the parameter set defining the End-to-End type neural network and the time of the previous processing being equal to or less than the threshold value. It may be.
The computer program according to the second aspect of the present invention functions to operate the computer as each means of the learning device of any of the acoustic models described above.

従来の音声認識の考え方を示す図である。It is a figure which shows the concept of the conventional speech recognition. 通常のDNNの構成を模式的に示す図である。It is a figure which shows typically the structure of a normal DNN. RNNの構成と、異なる時刻のRNNのノード間の結合の例を模式的に示す図である。It is a figure which shows typically the composition of RNN and the example of the connection between nodes of RNN at different time. 本発明の１実施の形態における音声認識の考え方を示す図である。It is a figure which shows the concept of speech recognition in one Embodiment of this invention. 本発明の１実施の形態に係る方法により学習したNNを採用する音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice recognition apparatus which adopts NN learned by the method which concerns on 1 Embodiment of this invention. 本発明の１実施の形態に係る、CTC-AMの学習方法を実行する装置の概略ブロック図である。It is a schematic block diagram of the apparatus which executes the learning method of CTC-AM which concerns on one Embodiment of this invention. 本発明の１実施の形態に係る、CTC-AMの学習方法を実現するプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program which realizes the learning method of CTC-AM which concerns on one Embodiment of this invention. 図７に示す方法においてCTC-AMの初期学習を行う処理を実現するプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program which realizes the process which performs the initial learning of CTC-AM in the method shown in FIG. 図７に示す方法において、初期学習済のCTC-AMを高精度化する処理を実現するプログラムの制御構造を示すフローチャートである。FIG. 5 is a flowchart showing a control structure of a program that realizes a process of improving the accuracy of the initially learned CTC-AM in the method shown in FIG. 7. 本発明の１実施の形態に係る方法による学習の繰返しによる効果を示すグラフである。It is a graph which shows the effect by repeating learning by the method which concerns on 1 Embodiment of this invention. 本発明の１実施の形態に係る方法による学習の繰返しによる効果を示すグラフである。It is a graph which shows the effect by repeating learning by the method which concerns on 1 Embodiment of this invention. 本発明の１実施の形態に係る音声認識装置を実現するコンピュータの外観を示す図である。It is a figure which shows the appearance of the computer which realizes the voice recognition apparatus which concerns on one Embodiment of this invention. 図１２に示すコンピュータのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware configuration of the computer shown in FIG.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。 In the following description and drawings, the same parts are given the same reference numbers. Therefore, detailed explanations about them will not be repeated.

最初に、従来の技術で使用されていたDNNとRNNとの相違について説明する。図２を参照して、DNN７０は、入力層７２及び出力層７８と、入力層７２と出力層７８との間に設けられた複数の隠れ層７４及び７６とを含む。この例では隠れ層は２層のみ示したが、隠れ層の数は２には限定されない。各層は複数個のノードを持つ。図２では各層におけるノード数はいずれも５個で同じであるが、これらの数は通常は様々である。隣り合うノード間は互いに結合されている。ただし、データは入力層側から出力層側へと一方向にしか流れない。各結合には重み及びバイアスが割り当てられている。これら重み及びバイアスは、学習データを用いた誤差逆伝搬法により学習データから学習される。 First, the differences between DNNs and RNNs used in conventional techniques will be explained. With reference to FIG. 2, the DNN 70 includes an input layer 72 and an output layer 78, and a plurality of hidden layers 74 and 76 provided between the input layer 72 and the output layer 78. In this example, only two hidden layers are shown, but the number of hidden layers is not limited to two. Each layer has multiple nodes. In FIG. 2, the number of nodes in each layer is the same at 5, but these numbers are usually various. Adjacent nodes are connected to each other. However, the data flows only in one direction from the input layer side to the output layer side. Weights and biases are assigned to each bond. These weights and biases are learned from the training data by the error back propagation method using the training data.

DNN７０においては、時刻tにおいて入力層７２に時刻tにおける音声特徴量X_tが与えられると、出力層７８からHMMの状態予測値S_tが出力される。音響モデルの場合、出力層７８のノード数は、対象となる言語の音素の数と一致するよう設計されることが多く、その場合には、出力層の各ノードの出力は、入力された音声特徴量がそのノードの表す音素である確率を示す。したがって、出力層７８の各ノードの出力する状態予測値を合計すると１になる。 In the DNN 70, when the input layer 72 _{is given the voice feature amount X t} at the time t at the time t, the state prediction value S _{t of the} HMM is output from the output layer 78. In the case of an acoustic model, the number of nodes in the output layer 78 is often designed to match the number of phonemes in the target language, in which case the output of each node in the output layer is the input voice. Shows the probability that the feature is the phoneme represented by the node. Therefore, the sum of the state prediction values output by each node of the output layer 78 is 1.

図２に示すDNNにより求められるものは、P(S_t|X_t)である。すなわち、時刻tに音声特徴量X_tを観測したときの、HMMの状態S_tの確率である。この例では、HMMの状態S_tは音素に対応する。これを前記した式(5)と比較すると、DNNの場合、その出力をそのまま式(5)に適用（代入）できないことが分かる。そのため、従来は、以下に示すようにベイズの法則を用いてDNNの出力をP(X_t|S_t)に変換している。 What is obtained by the DNN shown in FIG. 2 is P (S _t | X _t ). That is, it is the probability of _{the state S t} _{of the HMM when the voice feature X t} is observed at time t. In this example, state S _t of the HMM correspond to phonemes. Comparing this with the above equation (5), it can be seen that in the case of DNN, the output cannot be directly applied (assigned) to the equation (5). Therefore, conventionally, the output of DNN is converted to _{P (X t} | _{St) using Bayes' law as shown below.}

式(10)において、P(x_t）は各HMMの状態に共通であり、したがってarg max演算では無視できる。P(s_t)はアライメントされた学習データにおいて各状態の数を数えることで推定できる。結局、DNN-HMMハイブリッド方式の場合、DNNの出力P(S_t|X_t)を確率P(S_t)で割ることにより、従来のHMMを用いた枠組みの中でDNNを用いて認識スコアを計算していることになる。 In equation (10), P (x _t ) is common to the states of each HMM and can therefore be ignored in the arg max operation. P (s _t ) can be estimated by counting the number of each state in the aligned training data. After all, in the case of the DNN-HMM hybrid method, _{by dividing the output P (S t} | X _t ) of the DNN by the probability P (S _t ), the recognition score is calculated using the DNN within the framework using the conventional HMM. You are calculating.

一方、End-to-End型RNNの構成の例を図３に示す。図３は、時刻t-1におけるRNN100(t-1)と、時刻tにおけるRNN100(t)と、時刻t+1におけるRNN(t+1)の間の関係を示す。この例では、RNN100(t)の隠れ層内の各ノードは、入力層の各ノードだけではなく、RNN100(t-1)の自分自身の出力を受けている。すなわち、RNN100は入力される音声特徴量の時系列に対する出力を生成できる。さらに、End-to-End型RNNのうち、CTCでは、RNNの出力層に、ラベル（例えば音素）に対応するノードに加えて空ラベルφに対応するノード（図３においては右端に示す）を含む。すなわち出力層のノード数はラベル数+1である。 On the other hand, an example of the configuration of the End-to-End type RNN is shown in FIG. FIG. 3 shows the relationship between RNN100 (t-1) at time t-1, RNN100 (t) at time t, and RNN (t + 1) at time t + 1. In this example, each node in the hidden layer of RNN100 (t) receives its own output of RNN100 (t-1) as well as each node of the input layer. That is, the RNN100 can generate an output for a time series of input audio features. Furthermore, among the End-to-End type RNNs, in CTC, in addition to the node corresponding to the label (for example, phoneme), the node corresponding to the empty label φ (shown at the right end in FIG. 3) is provided in the output layer of the RNN. Including. That is, the number of nodes in the output layer is the number of labels + 1.

図３に示すようなEnd-to-End型RNNは、音声（音声特徴量）Xが発音列sである確率P(s|X)を直接モデル化する。したがってこうしたRNNを用いる音声認識はHMMには依存しない。RNNの出力は前掲の式(６)及び式(7)のように定式化される。 The End-to-End type RNN as shown in FIG. 3 directly models the probability P (s | X) in which the voice (voice feature amount) X is the pronunciation sequence s. Therefore, speech recognition using such RNN does not depend on HMM. The output of the RNN is formulated as in Eqs. (6) and (7) above.

End-to-End型RNNの特徴を活かして精度の高い音声認識を行うためには、DNN-HMMハイブリッド方式以外の枠組みを用いることが必要である。図４に、そのような新しい枠組みについて示す。本実施の形態はこの枠組に沿って音声認識を行う装置に関する。本実施の形態では、End-to-End型RNNとしてCTCを採用し、またサブワードの単位として発音系列を採用している。End-to-End型RNNの特性を活かして音声認識を行う新しい枠組みに基づいてCTCを用いたデコード方式を改良し、それにあわせてCTC自身の学習方法を改善する。 In order to take advantage of the characteristics of end-to-end RNNs and perform highly accurate speech recognition, it is necessary to use a framework other than the DNN-HMM hybrid method. Figure 4 shows such a new framework. The present embodiment relates to a device that performs voice recognition according to this framework. In this embodiment, CTC is adopted as the End-to-End type RNN, and the pronunciation sequence is adopted as the unit of the subword. Based on a new framework for voice recognition utilizing the characteristics of end-to-end RNNs, the decoding method using CTC will be improved, and the learning method of CTC itself will be improved accordingly.

図４を参照して、本実施の形態では、観測系列３６からRNNを用いて空ラベルφを含むラベル列からなる複数の音素列１１０の確率を求める。この確率は前掲の式(7)のようにモデル化される。これら音素列１１０にマッピング関数Φを適用して中間生成物である複数の発音列（サブワード列）１１２を得る。例えばラベル列「AAφφBφCCφ」及びラベル列「φAφBBφCφ」はマッピング関数Φによりいずれもサブワード列「ABC」にマッピングされる。このマッピング関数により、観測系列Xが与えられたときの発音列sの確率は前掲の式(6)のようにモデル化される。ここでさらに発音列（サブワード列）１１２から得られる複数の単語列３０の確率を求める。この確率はP(W)として単語レベルの言語モデルによりモデル化される。最終的に確率が最大となる単語列３０が音声認識結果として出力される。以上の関係から、以下の式によって観測系列Xに対する音声認識結果の単語列~Wが得られる。 With reference to FIG. 4, in the present embodiment, the probabilities of a plurality of phoneme strings 110 composed of a label sequence including an empty label φ are obtained from the observation sequence 36 using an RNN. This probability is modeled as in Eq. (7) above. A mapping function Φ is applied to these phoneme strings 110 to obtain a plurality of pronunciation strings (subword strings) 112 which are intermediate products. For example, the label sequence "AAφφBφCCφ" and the label sequence "φAφBBφCφ" are both mapped to the subword sequence "ABC" by the mapping function Φ. By this mapping function, the probability of the pronunciation sequence s given the observation sequence X is modeled as in Eq. (6) above. Here, the probabilities of a plurality of word strings 30 obtained from the pronunciation string (subword string) 112 are further obtained. This probability is modeled as P (W) by a word-level language model. Finally, the word string 30 having the maximum probability is output as the voice recognition result. From the above relationship, the word string ~ W of the speech recognition result for the observation sequence X can be obtained by the following equation.

この式は以下のように変形・近似できる。 This equation can be transformed and approximated as follows.

式(12)において、P(s|X)はCTCによる音響モデルのスコア（事後確率）を表す。αはそのスケーリングファクタである。発音列s及び観測系列Xは式(9)の制約を満たす必要がある。式(12)の近似にはビタビ・アルゴリズムを用いる。なお、RNNの学習時には、式(12)の第２式にしたがってP(W|s)を全てのsにわたって計算するが、デコード時には第３式のように近似することが多い。 In equation (12), P (s | X) represents the score (posterior probability) of the acoustic model by CTC. α is the scaling factor. The pronunciation sequence s and the observation sequence X must satisfy the constraint of Eq. (9). The Viterbi algorithm is used for the approximation of Eq. (12). When learning the RNN, P (W | s) is calculated over all s according to the second equation of the equation (12), but when decoding, it is often approximated as in the third equation.

式(12)中で、P(W|s)は以下の式(13)により計算できる。 In equation (12), P (W | s) can be calculated by the following equation (13).

式(13)のうち、P(s)はサブワード単位の言語モデル確率であり、βはそのスケーリングファクタである。P(s)は従来の言語モデルと同様に計算できる。すなわち、Nグラム言語モデルでも、ニューラルネットワークでも実現できる。ただし、サブワード単位の言語モデルはサブワードコーパスで学習する必要がある。サブワードコーパスは、通常のテキストコーパスに対して単語をサブワードに変換する方法で容易に実現できる。 In equation (13), P (s) is the language model probability in subword units, and β is its scaling factor. P (s) can be calculated in the same way as the conventional language model. That is, it can be realized by either an N-gram language model or a neural network. However, the language model for each subword needs to be learned with the subword corpus. A subword corpus can be easily realized by converting a word into a subword with respect to a normal text corpus.

式(13)の分子の第１項、P(s|W)は単語‐サブワード変換確率を示す。単語からサブワードへの変換は、多くの場合、１対１変換（例えば単語を各文字に分解すること）である。そうした場合には、P(s|W)は１になり、式(13)は次の式(14)のように簡略化される。 The first term of the numerator in equation (13), P (s | W), indicates the word-subword conversion probability. The word-to-subword conversion is often a one-to-one conversion (eg, breaking a word into letters). In such a case, P (s | W) becomes 1, and Eq. (13) is simplified as in Eq. (14) below.

以上をまとめると、以下のようになる。式(12)のP(W|s)に式(13)の右辺を代入すると以下の式(15)が得られる。この式(15)に従って仮説のスコアを計算し、最もよいスコアの仮説を音声認識結果として選択する。 The above can be summarized as follows. Substituting the right-hand side of equation (13) into P (W | s) of equation (12) gives the following equation (15). The hypothesis score is calculated according to this equation (15), and the hypothesis with the best score is selected as the speech recognition result.

結局、RNNを用いる従来法では、式(6)〜式(9)に示されるように、RNNの出力する事後確率と言語モデル確率とを内挿して認識スコアを算出している。これに対し、本実施の形態に係る手法では、式(15)に示すように、ある仮説に関する単語‐サブワード変換確率P(s|W)、従来と同様の単語レベルの言語モデルから得られる単語言語モデルスコアP(W)、及びRNNの出力するサブワード事後確率P(s|X)^αの積を、サブワードレベルの言語モデルから得られる確率P(s)^βで割ることにより仮説のスコアを算出する。各仮説についてこのスコアを算出し、最もよいスコアが得られる仮説を音声認識結果として選択する。RNNの出力する事後確率を最大化するという意味で、この方式をmaximum a posteriori（MAP）方式デコーディングと呼ぶ。 After all, in the conventional method using RNN, the recognition score is calculated by interpolating the posterior probability output by RNN and the language model probability as shown in equations (6) to (9). On the other hand, in the method according to the present embodiment, as shown in Eq. (15), the word-subword conversion probability P (s | W) related to a certain hypothesis, the word obtained from the same word-level language model as before. The hypothetical score is calculated by dividing the product of the language model score P (W) and the subword posterior probability P (s | X) ^α ^{output by the RNN by the probability P (s) β} obtained from the language model at the subword level. To do. This score is calculated for each hypothesis, and the hypothesis that gives the best score is selected as the speech recognition result. This method is called maximum a posteriori (MAP) method decoding in the sense that it maximizes the posterior probability output by the RNN.

なお、上記式におけるCTC-AMの学習においては、以下の式により表される目標関数F^CTC(θ)（θはCTC-AMを構成する各ノードの入出力の重み行列及びバイアス値を含むパラメータセット）を最大化するようなパラメータセットθを求める。 In the learning of CTC-AM in the above equation, the target function F ^CTC (θ) (θ is a parameter including the input / output weight matrix and bias value of each node constituting CTC-AM) expressed by the following equation. Find the parameter set θ that maximizes the set).

この式において、s_uはu番目の学習音声に対する正解サブワード列、X_uはu番目の学習音声、Pr_θはパラメータセットθのもとでCTC-AMが出力するスコアを表す。この条件でCTC-AMの出力におけるsoftmax関数の活性化関数値は以下の式により計算される。

In this equation, s _u is the correct subword sequence for the u-th learning voice, X _u is the u-th learning voice, and Pr _θ is the score output by CTC-AM under the parameter set θ. Under this condition, the activation function value of the softmax function in the output of CTC-AM is calculated by the following formula.

上式は、フォワード‐バックワードアルゴリズムを用いて効率的に計算できることが知られており、NNパラメータセットの誤差逆伝搬法による学習に用いられている。

It is known that the above equation can be calculated efficiently using the forward-backward algorithm, and is used for learning the NN parameter set by the error back propagation method.

ところで、MAP方式デコーディングに関してこのF^CTCを最大化するということは、CTC-AMそれ自体を最適化しているということができる。しかし、実際にはCTC-AMを言語モデルと組み合わせて音声をデコードするので、CTC-AMを最適化したからといって単語認識率が最大化するとは限らない。そこで、本実施の形態では、F^CTCを最大化する学習を行った後、さらに以下の式により示される目標関数F^MBRを最大化するようにCTC-AMの学習を行う。 ^{By the way, maximizing this F CTC} with respect to MAP method decoding can be said to be optimizing CTC-AM itself. However, in reality, CTC-AM is combined with a language model to decode speech, so optimizing CTC-AM does not necessarily maximize the word recognition rate. Therefore, in the present embodiment, after learning to maximize ^{F CTC} , CTC-AM is further learned to maximize the ^{target function FM BR represented by the following equation.}

F^MBRをy_t ^t(c)で微分することにより次式(21)を得る。

The following equation (21) is obtained by differentiating the ^{F MBR} by y _t ^{t (c).}

この結果、最終層のsoftmax層の活性化関数値a_u ^t(c)に関する誤差信号は以下のように計算される。

式(22)はフォワード・バックワードアルゴリズムを用いて効率的に計算できる。 _{As a result, the error signal regarding the activation function value a u} ^t (c) of the softmax layer of the final layer is calculated as follows.

Equation (22) can be calculated efficiently using the forward-backward algorithm.

図５を参照して、本実施の形態に係る方法により学習したCTC-AMを用いる音声認識装置２８０について説明する。音声認識装置２８０は、入力音声２８２に対する音声認識を行って、音声認識テキスト２８４として出力する機能を持つ。音声認識装置２８０は、入力音声２８２に対してアナログ／デジタル（A/D）変換を行ってデジタル信号として出力するA/D変換回路３００と、A/D変換回路３００の出力するデジタル化された音声データを、所定長及び所定シフト量で一部重複するようなウィンドウを用いてフレーム化するフレーム化処理部３０２と、フレーム化処理部３０２の出力する各フレームに対して所定の信号処理を行うことにより、そのフレームの音声特徴量を抽出し特徴量ベクトルを出力する特徴量抽出部３０４とを含む。各フレーム及び特徴量ベクトルには、入力音声２８２の例えば先頭に対する相対時刻等の情報が付されている。音声特徴量としては、MFCC（Mel-Frequency Cepstrum Coefficient：メル周波数ケプストラム係数）、その一次微分、二次微分、及びパワー等が用いられるが、フィルタバンクの出力をそのまま特徴量として用いても良い。時系列で得られる特徴量ベクトルにより観測系列が構成される。 A voice recognition device 280 using CTC-AM learned by the method according to the present embodiment will be described with reference to FIG. The voice recognition device 280 has a function of performing voice recognition on the input voice 282 and outputting it as voice recognition text 284. The voice recognition device 280 is an A / D conversion circuit 300 that performs analog / digital (A / D) conversion on the input voice 282 and outputs it as a digital signal, and a digitized A / D conversion circuit 300 that outputs the data. A predetermined signal processing is performed on each frame output by the framing processing unit 302 that frames the audio data by using a window that partially overlaps with a predetermined length and a predetermined shift amount, and the framing processing unit 302. This includes a feature amount extraction unit 304 that extracts the voice feature amount of the frame and outputs the feature amount vector. Information such as the relative time to the beginning of the input voice 282 is attached to each frame and the feature amount vector. As the audio feature quantity, MFCC (Mel-Frequency Cepstrum Coefficient), its first-order differential, second-order differential, power, etc. are used, but the output of the filter bank may be used as it is as the feature quantity. The observation series is composed of the feature vectors obtained in time series.

音声認識装置２８０はさらに、特徴量抽出部３０４が出力する特徴量ベクトルを一時記憶するための特徴量記憶部３０６と、特徴量記憶部３０６に記憶された特徴量ベクトルを入力として、各時刻における各フレームがある音素に対応する事後確率を音素ごとに示すベクトルを出力する、CTCに基づくEnd-to-End型RNN（CTC-AM）からなる音響モデル３０８と、音響モデル３０８の出力するベクトルを用いて、入力音声２８２に対応する音声認識テキスト２８４として最もスコア（確率）の高い単語列を出力するためのデコーダ３１０とを含む。音響モデル３０８が出力するベクトルの要素は、そのフレームが各音素である確率を音素ごとに示す値である。時系列で得られるこのベクトルから、フレームごとに各音素を選択して事後確率付で連結し、各音素を対応するラベルで表すことにより、ラベル列候補がラティス形式で得られる。このラベル列候補には空ラベルφも含まれることがある。各ラベル列候補の事後確率は、そのラベル列候補を構成するラティスの各パス上の音素の事後確率から算出できる。 Further, the voice recognition device 280 receives the feature amount storage unit 306 for temporarily storing the feature amount vector output by the feature amount extraction unit 304 and the feature amount vector stored in the feature amount storage unit 306 as inputs at each time. An acoustic model 308 consisting of an End-to-End type RNN (CTC-AM) based on CTC, which outputs a vector showing the posterior probability corresponding to a certain phoneme in each frame, and a vector output by the acoustic model 308. It includes a decoder 310 for outputting a word string having the highest score (probability) as a speech recognition text 284 corresponding to an input speech 282. The vector element output by the acoustic model 308 is a value indicating the probability that the frame is each phoneme for each phoneme. Label string candidates can be obtained in lattice format by selecting each phoneme for each frame from this vector obtained in time series, connecting them with posterior probabilities, and representing each phoneme with a corresponding label. This label column candidate may also include an empty label φ. The posterior probabilities of each label string candidate can be calculated from the posterior probabilities of phonemes on each path of the lattice that make up the label string candidate.

デコーダ３１０は、音響モデルにより算出されたラベル列候補の事後確率を用いて、入力された観測系列が表しうる複数の仮説を、それらの確率とともに算出して認識スコア付の仮説として出力し、認識スコアに基づき、最もスコア（確率）の高い仮説を音声認識テキスト２８４として出力する。 The decoder 310 uses the posterior probabilities of the label string candidates calculated by the acoustic model to calculate a plurality of hypotheses that can be represented by the input observation series together with those probabilities, outputs them as hypotheses with a recognition score, and recognizes them. Based on the score, the hypothesis with the highest score (probability) is output as voice recognition text 284.

本実施の形態に係る音響モデル３０８を構成するRNNの入力層のノードの数は、入力ベクトル（観測ベクトル）の要素の数と一致する。RNNの出力層のノードの数は、対象となる言語のサブワードの数に１を加算したものと一致する。すなわち、出力層のノードは、HMMによる音響モデルの各サブワード（例えば音素）と、空ラベルφとを表す。出力層の各ノードには、ある時刻で入力された音声が、そのノードの表すサブワード（空ラベルを含む）である確率が出力される。したがって音響モデル３０８の出力は、その時刻での入力音声が、各ノードの表すサブワードである確率を要素とするベクトルである。このベクトルの要素の値を合計すると１になる。 The number of nodes in the input layer of the RNN constituting the acoustic model 308 according to the present embodiment matches the number of elements of the input vector (observation vector). The number of nodes in the output layer of the RNN matches the number of subwords in the target language plus one. That is, the nodes of the output layer represent each subword (for example, a phoneme) of the acoustic model by HMM and the empty label φ. To each node of the output layer, the probability that the voice input at a certain time is a subword (including an empty label) represented by that node is output. Therefore, the output of the acoustic model 308 is a vector whose element is the probability that the input voice at that time is a subword represented by each node. The sum of the values of the elements of this vector is 1.

デコーダ３１０は、音響モデル３０８が出力するベクトルの各要素について、単語列Wの候補の確率計算をし、確率の低い枝については適宜枝刈りを行いながらラティスを生成して、仮説と確率計算を含めた認識スコアの計算をする。デコーダ３１０が、最終的に得られる単語列の中で最も認識スコアが高い（生起確率の高い）単語列を音声認識テキスト２８４として出力する。この際、デコーダ３１０は音響モデル３０８の出力を直接使いながら認識スコアを計算する。従来のDNN-HMMフレームワークのようにHMMの出力形式にあわせてRNNの出力を変換することが不要であり、認識の効率を高めることができる。また、End-to-End型RNNから得られた事後確率P(s|X)と、確率P(W|s)とを組み合わせて単語事後確率P(W|X)を算出することで、単語事後確率P(W|X)が最大となる仮説を探索する。End-to-end型RNNを用いる従来の方式のように理論的根拠のない内挿スコアを用いるものと異なり、理論的にも認識の精度を高めることが可能となる。またCTC-AMの学習方法として前述したように言語モデルと組み合わせて音声認識を行ったときに最も誤差が少なくなるように（F^MBRが最大となるように）パラメータセットを最適化する方法を採用している。したがって、F^CTCを最大化するような方式と比較して、最終的な認識精度をさらに高めることができる。 The decoder 310 calculates the probability of the candidate of the word string W for each element of the vector output by the acoustic model 308, and generates a lattice while appropriately pruning the branch with a low probability, and calculates the hypothesis and the probability. Calculate the recognition score including. The decoder 310 outputs the word string having the highest recognition score (high probability of occurrence) among the finally obtained word strings as the speech recognition text 284. At this time, the decoder 310 calculates the recognition score while directly using the output of the acoustic model 308. Unlike the conventional DNN-HMM framework, it is not necessary to convert the RNN output according to the HMM output format, and recognition efficiency can be improved. In addition, the word posterior probability P (W | X) is calculated by combining the posterior probability P (s | X) obtained from the End-to-End type RNN and the probability P (W | s). Search for the hypothesis that maximizes the posterior probability P (W | X). Unlike the conventional method that uses an end-to-end type RNN that uses an interpolated score that has no theoretical basis, it is possible to improve the recognition accuracy theoretically. In addition, as a learning method for CTC-AM, we adopted a method of optimizing the parameter set so that ^{the error is minimized (so that the FM BR} is maximized) when speech recognition is performed in combination with the language model as described above. doing. Therefore, the final recognition accuracy can be further improved compared to the method that maximizes ^{F CTC.}

図６を参照して、本発明に係るCTC-AM３６４の学習を行うための学習システム３５０について説明する。学習システム３５０は、CTC-AM３６４の学習のためのデータを記憶する学習データ記憶部３６０と、学習データ記憶部３６０に記憶された学習データを用い、学習音声の観測系列が与えられたときの、学習データの正解サブワード列の事後確率の、学習データの全体に亘る和である式(16)に示すF^CTCを最大化するようにCTC-AM３６４の学習（最適化）を行うための学習処理部３６２と、学習処理部３６２による学習が済んだCTC-AM３６４に対し、学習データ記憶部３６０に記憶された学習データを用い、学習音声の観測系列が与えられたときに、CTC-AM３６４と言語モデルとを用いて推定した単語列の仮説の事後確率と、当該単語列の仮説を構成する単語の認識精度との積の、学習データ全体に亘る和からなる単語の認識精度の期待値である式(18)に示したF^MBRを最大化することにより、CTC-AM３６４をさらに最適化するよう、上記したMBR学習を行うためのMBR学習処理部３６６と、MBR学習処理部３６６がCTC-AM３６４による学習を行う際に参照する単語言語モデル３６８、音素言語モデル３７０、及び単語発音辞書３７２とを含む。 A learning system 350 for learning the CTC-AM364 according to the present invention will be described with reference to FIG. The learning system 350 uses the learning data storage unit 360 that stores the learning data of the CTC-AM364 and the learning data stored in the learning data storage unit 360, and when an observation sequence of the learning voice is given, the learning system 350 is used. Learning processing unit for learning (optimizing) CTC-AM364 so as to maximize ^{F CTC} shown in Eq. (16), which is the sum of the posterior probabilities of the correct subword strings of the learning data over the entire learning data. For CTC-AM364 that has been trained by the learning processing unit 362 and 362, the learning data stored in the learning data storage unit 360 is used, and when the observation sequence of the learning voice is given, the CTC-AM364 and the language model An expression that is the expected value of the word recognition accuracy consisting of the sum of the sum of the product of the posterior probability of the word string hypothesis estimated using and and the word recognition accuracy that constitutes the word string hypothesis. The MBR learning processing unit 366 for performing the above-mentioned MBR learning and the MBR learning processing unit 366 are based on the CTC-AM364 so as to further optimize the CTC-AM364 by maximizing ^{the F MBR} shown in (18). It includes a word language model 368, a phonetic language model 370, and a word pronunciation dictionary 372 to be referred to when performing learning.

学習システム３５０はさらに、CTC-AM３６４による音声認識による仮説の精度を評価するための評価データを記憶する評価データ記憶部３７６と、MBR学習処理部３６６によるCTC-AM３６４の学習処理が１回終了するごとに、評価データ記憶部３７６に記憶された評価データ、単語言語モデル３６８、音素言語モデル３７０、及び単語発音辞書３７２を用いて、CTC-AM３６４を用いて音声認識を行い、その仮説に基づいて、仮説を構成する単語に対する認識精度と、仮説生成の際の言語モデルにより算出された仮説の事後確率とを算出し、さらに学習音声全体に亘る、当該仮説を構成する単語の認識精度との積の和を算出することにより音声認識精度の期待値である目標関数F^MBRの値を評価するための精度評価部３７４と、精度評価部３７４により評価された精度の期待値に基づいて、MBR学習処理部３６６によるMBR学習の終了条件が充足されたか否かを判定し、その結果にしたがってMBR学習処理部３６６を制御するための学習・評価制御部３７８とを含む。 Further, the learning system 350 completes the learning process of the evaluation data storage unit 376 for storing the evaluation data for evaluating the accuracy of the hypothesis by voice recognition by the CTC-AM364 and the learning process of the CTC-AM364 by the MBR learning processing unit 366 once. For each, using the evaluation data stored in the evaluation data storage unit 376, the word language model 368, the phonetic language model 370, and the word pronunciation dictionary 372, speech recognition was performed using CTC-AM364, and based on that hypothesis. , The recognition accuracy of the words that make up the hypothesis and the posterior probability of the hypothesis calculated by the language model at the time of hypothesis generation are calculated, and the product of the recognition accuracy of the words that make up the hypothesis over the entire learning speech. MBR learning based on the accuracy evaluation unit 374 for evaluating the ^{value of the target function F MBR} , which is the expected value of speech recognition accuracy by calculating the sum of, and the expected accuracy value evaluated by the accuracy evaluation unit 374. It includes a learning / evaluation control unit 378 for determining whether or not the end condition of MBR learning by the processing unit 366 is satisfied and controlling the MBR learning processing unit 366 according to the result.

図７に、学習システム３５０によるCTC-AM３６４の学習を実現するプログラムの制御構造をフローチャート形式で示す。図７を参照して、このプログラムは、式(17)に基づいて、学習データ記憶部３６０に記憶された学習データを用いてF^CTCの値を最大化するように（F^CTCの値が増加するように）CTC-AM３６４のパラメータセットを更新することによる学習を行うステップ４００と、ステップ４００で学習が終了したCTC-AM３６４の精度を評価するステップ４０２と、MBR学習の終了判定のために、直前に評価されたCTC-AM３６４の精度を図示しないメモリ等の記憶装置に記憶するステップ４０４と、CTC-AM３６４に対して式(18)に示す目標関数F^MBRの値を最大化するよう（F^MBRの値が増加するよう）、CTC-AM３６４のパラメータセットを更新することによりMBR学習を行うステップ４０６と、評価データを用いて、ステップ４０６によりMBR学習が終了したCTC-AM３６４の精度を評価するステップ４０８と、ステップ４０８で得られた評価結果をステップ４０４で記憶された前回の評価値と比較し、その差が所定のしきい値以下か否かに応答してCTC-AM３６４の学習を終了する処理と、制御をステップ４０４に戻してMBR学習を繰り返す処理とを選択的に実行するステップ４１０とを含む。すなわち、本実施の形態では、MBR学習の結果得られたCTC-AM３６４による音声認識精度が、前回の音声認識精度からわずかしか向上しなかったときに学習を終了する。もちろん学習の終了条件はこれに限らない。例えば所定回数だけMBR学習が終了した時点で学習を終了させるようにしても良い。 FIG. 7 shows a control structure of a program that realizes learning of CTC-AM364 by the learning system 350 in a flowchart format. ^{With reference to FIG. 7, this program maximizes the F CTC} value using the training data stored in the training data storage unit 360 based on Eq. (17) (the F ^CTC value increases). Step 400 for learning by updating the parameter set of CTC-AM364, step 402 for evaluating the accuracy of CTC-AM364 for which learning was completed in step 400, and for determining the end of MBR learning. Step 404 to store the accuracy of CTC-AM364 evaluated immediately before in a storage device such as a memory (not shown), and maximize the value of the ^{target function F MBR shown in Eq. (18) for CTC-AM364 (F).} ^(As the value of MBR increases), the accuracy of step 406, which performs MBR learning by updating the parameter set of CTC-AM364, and CTC-AM364, which has completed MBR learning by step 406, is evaluated using the evaluation data. The evaluation results obtained in step 408 and step 408 are compared with the previous evaluation values stored in step 404, and the learning of CTC-AM364 is completed in response to whether or not the difference is equal to or less than a predetermined threshold value. This includes step 410, which selectively executes the process of returning the control to step 404 and repeating the MBR learning. That is, in the present embodiment, the learning is terminated when the voice recognition accuracy by CTC-AM364 obtained as a result of the MBR learning is slightly improved from the previous voice recognition accuracy. Of course, the learning end condition is not limited to this. For example, the learning may be terminated when the MBR learning is completed a predetermined number of times.

図８に、図６のステップ４００で実行されるCTC-AMの初期化を行うプログラムの制御構造をフローチャート形式で示す。図８を参照して、このプログラムは、CTC-AM３６４を初期化するステップ４４０を含む。このステップでは、例えばCTC-AM３６４の各パラメータを、正規分布に従った乱数で初期化する。 FIG. 8 shows the control structure of the program for initializing the CTC-AM executed in step 400 of FIG. 6 in the form of a flowchart. With reference to FIG. 8, this program includes step 440 to initialize CTC-AM364. In this step, for example, each parameter of CTC-AM364 is initialized with a random number according to a normal distribution.

学習データは、複数のバッチに分割されている。以下の処理では、バッチごとにCTC-AM３６４の学習を行う。すなわち、このプログラムはさらに、全てのバッチについて、処理４４３を実行するステップ４４２と、ステップ４４２が終了した後に、学習後のCTC-AM３６４の評価を行うステップ４４８と、ステップ４４８での評価結果が終了条件を充足しているか否かを判定するステップ４５０とを含む。ステップ４５０での判定が肯定ならこのプログラムの実行は終了する。さもなければ制御はステップ４４２に戻る。 The training data is divided into a plurality of batches. In the following processing, learning of CTC-AM364 is performed for each batch. That is, this program further completes step 442, which executes the process 443 for all batches, step 448, which evaluates the CTC-AM364 after learning after step 442 is completed, and the evaluation result in step 448. Includes step 450 to determine if the condition is satisfied. If the determination in step 450 is affirmative, the execution of this program ends. Otherwise control returns to step 442.

処理４４３は、バッチ内の各文について処理４４６を実行するステップ４４４を含む。 Process 443 includes step 444 that executes process 446 for each statement in the batch.

処理４４６は、まずCTC-AM３６４を用いる音声認識装置にその文の音声データを入力して音素列を推定するステップ４６０と、ステップ４６０で推定された音素列と学習音声に付された音素ラベル列とを比較して誤差を算出するステップ４６２と、ステップ４６２で算出された誤差を用いて、式(18)に示す目標関数F^MBRの値が大きくなるよう、誤差逆伝播方式によりCTC-AM３６４のパラメータセットを修正するステップ４６４とを含む。 The process 446 first inputs the voice data of the sentence into the voice recognition device using CTC-AM364 to estimate the phoneme string, and the phoneme string estimated in step 460 and the phoneme label string attached to the learning voice. by comparing the preparative step 462 of calculating an error, using the error calculated in step 462, as the value of the objective function F ^MBR shown in equation (18) is increased, the back propagation method CTC-AM364 Includes step 464 and the modification of the parameter set.

［動作］
上記した学習システム３５０によるCTC-AM３６４の学習は以下のように行われる。まず、学習音声とその書き起こしとの音素列である正解サブワード列を含む学習データが学習データ記憶部３６０に記憶される。また、同様に、音声とその書き起こしとを含む評価データが評価データ記憶部３７６に記憶される。単語言語モデル３６８、音素言語モデル３７０及び単語発音辞書３７２については、既に存在するものを用いても良いし、学習データ記憶部３６０から作成するようにしてもよい。学習データ記憶部３６０に記憶された学習データはいくつかのバッチに分割される。 [motion]
The learning of CTC-AM364 by the learning system 350 described above is performed as follows. First, the learning data including the correct answer subword string, which is a phoneme string of the learning voice and its transcription, is stored in the learning data storage unit 360. Similarly, the evaluation data including the voice and its transcription is stored in the evaluation data storage unit 376. As for the word language model 368, the phoneme language model 370, and the word pronunciation dictionary 372, those that already exist may be used, or may be created from the learning data storage unit 360. The learning data stored in the learning data storage unit 360 is divided into several batches.

まず学習処理部３６２が学習データ記憶部３６０に記憶された学習データによりCTC-AM３６４の学習を行う（図７のステップ４００）。具体的には、図８を参照して、最初にCTC-AM３６４の各パラメータを、正規分布に従った乱数で初期化する。続いて、各バッチに対して以下の処理を行う（図８のステップ４４２）。 First, the learning processing unit 362 learns the CTC-AM364 using the learning data stored in the learning data storage unit 360 (step 400 in FIG. 7). Specifically, referring to FIG. 8, first, each parameter of CTC-AM364 is initialized with a random number according to a normal distribution. Subsequently, the following processing is performed for each batch (step 442 in FIG. 8).

まず、処理中のバッチ中のある文の音声について、CTC-AM３６４による音声認識でその音素ラベル列の推定を行う（ステップ４６０）。続いて、その推定結果とその音声の書き起こしとを用いて誤差を算出する（ステップ４６２）。さらに、この誤差を用いて目標関数F^CTCの値が大きくなるようにCTC-AM３６４のパラメータセットを修正する（ステップ４６４）。 First, for the voice of a certain sentence in the batch being processed, the phoneme label string is estimated by voice recognition by CTC-AM364 (step 460). Subsequently, the error is calculated using the estimation result and the transcription of the voice (step 462). Further, this error is used to modify the parameter set of CTC-AM364 so that the value of the ^{target function F CTC becomes large (step 464).}

以上の処理４４６を、処理中のバッチ中の全ての文について実行する。あるバッチに対する処理が終わると、次のバッチに対して同じ処理を繰返す。こうして、学習データの全てのバッチについてステップ４４４を終了すると、ステップ４４８でCTC-AM３６４の評価を行う（これを１エポックという）。この評価は、図６に示す精度評価部３７４ではなく、学習処理部３６２が行うもので、図示しない評価データを学習処理部３６２によって音声認識した結果の誤差を評価データ全体にわたり総合してその精度を計算することにより得る。本実施の形態では、この精度と、前回の処理で得られた精度との差がしきい値以上であれば、再度、学習データ全体を使用してCTC-AM３６４に対する同じ学習処理を繰返す。精度の差がしきい値未満になったところでCTC-AM３６４の初期学習を終了する。 The above processing 446 is executed for all the statements in the batch being processed. When the processing for one batch is completed, the same processing is repeated for the next batch. When step 444 is completed for all batches of training data in this way, CTC-AM364 is evaluated in step 448 (this is called one epoch). This evaluation is performed not by the accuracy evaluation unit 374 shown in FIG. 6 but by the learning processing unit 362, and the error of the result of voice recognition of the evaluation data (not shown) by the learning processing unit 362 is integrated over the entire evaluation data to obtain the accuracy. Is obtained by calculating. In the present embodiment, if the difference between this accuracy and the accuracy obtained in the previous process is equal to or greater than the threshold value, the same training process for CTC-AM364 is repeated again using the entire training data. The initial learning of CTC-AM364 ends when the difference in accuracy becomes less than the threshold value.

CTC-AM３６４の初期学習が終了すると、MBR学習処理部３６６がCTC-AM３６４に対するMBR学習を行う（図７のステップ４０６）。本実施の形態では、この学習にも学習データ記憶部３６０に記憶された学習データを用いる。 When the initial learning of CTC-AM364 is completed, the MBR learning processing unit 366 performs MBR learning for CTC-AM364 (step 406 in FIG. 7). In the present embodiment, the learning data stored in the learning data storage unit 360 is also used for this learning.

具体的には、図９を参照して、学習データ記憶部３６０に含まれる各学習音声について、処理４８２を実行する（ステップ４８０）。処理４８２では、CTC-AM３６４を音響モデルとし、単語言語モデル３６８、音素言語モデル３７０、及び単語発音辞書３７２を用いて処理対象の音声データに対する音声認識を行い、音声認識仮説からなるラティスを作成する（ステップ５１０）。このラティス内で、前述した式(19)にしたがって誤差計算を行う（ステップ５１２）。この誤差を用い、CTC-AM３６４に対し、目標関数F^MBRの値が大きくなるようにCTC-AM３６４のパラメータセットを誤差逆伝播法により修正する（ステップ５１４）。この処理を全ての音声データに対して実行する（これも、学習処理部３６２による処理と同様、１エポックという。）。１エポックが終了すると、ステップ４８４でCTC-AM３６４の精度の評価を行う。この評価は図６の精度評価部３７４が評価データ記憶部３７６に記憶された評価データと、単語言語モデル３６８、音素言語モデル３７０、及び単語発音辞書３７２を用いて行う。CTC-AM３６４の評価自体は学習処理部３６２が行うものと同様である。 Specifically, with reference to FIG. 9, processing 482 is executed for each learning voice included in the learning data storage unit 360 (step 480). In the process 482, the CTC-AM364 is used as an acoustic model, the word language model 368, the phoneme language model 370, and the word pronunciation dictionary 372 are used to perform speech recognition on the speech data to be processed, and a lattice consisting of the speech recognition hypothesis is created. (Step 510). In this lattice, the error calculation is performed according to the above-mentioned equation (19) (step 512). Using this error, to CTC-AM364, modifies the parameter set of CTC-AM364 as the value of the objective function F ^MBR is increased by backpropagation (step 514). This process is executed for all voice data (this is also referred to as one epoch as in the process by the learning processing unit 362). When one epoch is completed, the accuracy of CTC-AM364 is evaluated in step 484. This evaluation is performed by the accuracy evaluation unit 374 of FIG. 6 using the evaluation data stored in the evaluation data storage unit 376, the word language model 368, the phoneme language model 370, and the word pronunciation dictionary 372. The evaluation itself of CTC-AM364 is the same as that performed by the learning processing unit 362.

続いてステップ４８６でMBR学習の終了条件が充足されているか否かが（図６の学習・評価制御部３７８により）判定される。具体的には、ステップ４８４で評価された精度と、前回の精度との差がしきい値未満か否かがステップ４８６において判定される。判定が肯定であればCTC-AM３６４に対するMBR学習は終了である。判定が否定であれば、すなわち今回の精度と前回の精度との差がしきい値以上であれば、制御はステップ４８０に戻り、もう一度、学習データ全体を用いてMBR学習処理部３６６によるMBR学習がCTC-AM３６４に対して実行される。 Subsequently, in step 486, it is determined whether or not the MBR learning end condition is satisfied (by the learning / evaluation control unit 378 of FIG. 6). Specifically, it is determined in step 486 whether or not the difference between the accuracy evaluated in step 484 and the previous accuracy is less than the threshold value. If the judgment is affirmative, MBR learning for CTC-AM364 is complete. If the judgment is negative, that is, if the difference between the current accuracy and the previous accuracy is equal to or greater than the threshold value, the control returns to step 480, and again, MBR learning by the MBR learning processing unit 366 using the entire training data. Is executed for CTC-AM364.

このようにして学習が終わったCTC-AM３６４を用いて音声認識を行う場合には、図５の音響モデル３０８にこのCTC-AM３６４を用いるようにすればよい。 When voice recognition is performed using the CTC-AM364 that has been learned in this way, the CTC-AM364 may be used for the acoustic model 308 of FIG.

［実験結果］
図１０及び図１１に、上記した本発明の一実施例による音声認識精度と、従来の内挿方式による音声認識精度との、MRB学習の繰返しに伴う変化に関する実験結果を示す。 [Experimental result]
10 and 11 show the experimental results regarding the changes in the speech recognition accuracy according to the above-described embodiment of the present invention and the speech recognition accuracy according to the conventional interpolation method with the repetition of MRB learning.

実験では、学習コーパスとしてLDC93S6B及びLDC94S13として知られるウォール・ストリート・ジャーナル（WSJ）コーパスを用いた。学習音声は７７．５時間分、検証データは３．８時間分であった。CTC-AMとしては、音素に基づく双方向LSTM（BLSTM）からなる、４層の隠れ層を持つものを用いた。各隠れ層は３２０ノードを持ち、平均及び分散がともに正規化された１２０次元のフィルタバンク特徴量（４０次元のフィルタバンク特徴量＋Δ＋ΔΔ）により学習した。初期学習は学習率＝0.00004及びモーメンタムパラメータ＝０．９５で行った。CTC-BLSTM-AMの学習後、この音響モデルに基づいてラティスを生成した。このとき、学習データ内の書き起こしデータを用いてスケーリングファクタα＝１で学習した１グラム単語言語モデルを用いた。また、MAP方式によるラティスを生成する際には、学習音声の書き起こしを音素に変換したものにより学習したバイグラム音素言語モデルを、β＝0.5として用いた（式(13)(14)(15)参照）。MBR学習は学習率＝0.000001及びモーメンタムパラメータ＝0.9に固定して５エポック行った。 In the experiment, the Wall Street Journal (WSJ) corpus known as LDC93S6B and LDC94S13 was used as the learning corpus. The learning voice was for 77.5 hours, and the verification data was for 3.8 hours. As the CTC-AM, a phoneme-based bidirectional LSTM (BLSTM) having four hidden layers was used. Each hidden layer had 320 nodes and was trained by 120-dimensional filter bank features (40-dimensional filter bank features + Δ + ΔΔ) in which both the mean and the variance were normalized. The initial learning was performed with a learning rate of 0.00004 and a momentum parameter of 0.95. After training CTC-BLSTM-AM, a lattice was generated based on this acoustic model. At this time, a 1-gram word language model trained with a scaling factor α = 1 using the transcribed data in the training data was used. In addition, when generating the lattice by the MAP method, the bigram phoneme language model learned by converting the transcription of the learned speech into phonemes was used with β = 0.5 (Equations (13) (14) (15). reference). MBR learning was performed with 5 epochs fixed at a learning rate of 0.000001 and a momentum parameter of 0.9.

評価では、単語言語モデルとしてはWSJ標準のプルーンドトライグラム言語モデル（pruned trigram LM）を用いた。MAP方式によるデコーディングにおいては、バイグラム音素言語モデルを用いた。デコード時、パラメータ（スケーリングファクタα及びβ、並びに単語挿入ペナルティ）はWSJコーパス中の「dev93」セットにより調整し、最もよいパラメータをWSJコーパス中の「eval92」セットのデコードに用いた。 In the evaluation, the WSJ standard pruned trigram language model (pruned trigram LM) was used as the word language model. In decoding by the MAP method, a bigram phoneme language model was used. At the time of decoding, the parameters (scaling factors α and β, as well as the word insertion penalty) were adjusted by the “dev93” set in the WSJ corpus, and the best parameters were used to decode the “eval92” set in the WSJ corpus.

図１０及び図１１において、横軸はMBR学習の繰返し回数を示し、縦軸は各繰返し終了時のCTC-AMによる音声認識結果の単語誤り率（WER）を示す。図１０はdev93に対するものであり、図１１はeval92に対するものである。 In FIGS. 10 and 11, the horizontal axis shows the number of repetitions of MBR learning, and the vertical axis shows the word error rate (WER) of the speech recognition result by CTC-AM at the end of each repetition. FIG. 10 is for dev93 and FIG. 11 is for eval92.

図１０において、グラフ５３０は従来の内挿方式によるグラフであり、グラフ５３２は上記実施の形態によるものである。同様に、図１１において、グラフ５４０は従来の内挿方式によるものであり、グラフ５４２は上記実施の形態によるものである。 In FIG. 10, graph 530 is a graph by the conventional interpolation method, and graph 532 is based on the above embodiment. Similarly, in FIG. 11, graph 540 is based on the conventional interpolation method, and graph 542 is based on the above embodiment.

図１０及び図１１において、MBR繰返し回数＝０でのMAP方式の精度は、F^CTCによる学習のみ行ったCTC-AMによる精度を表す。この時点でMAP方式によるCTCの単語誤り率（7.5%）が内挿方式のもの（8.5%）と比較してかなり低いことが分かる。MBR学習を行うと、両者とも単語誤り率は改善されていく。しかしこの場合も、一貫してMAP方式の単語誤り率が内挿方式の単語誤り率より低いという結果となった。 In FIGS. 10 and 11, the accuracy of the MAP method when the number of MBR repetitions = 0 represents the accuracy of CTC-AM obtained only by learning by ^{F CTC.} At this point, it can be seen that the word error rate (7.5%) of CTC by the MAP method is considerably lower than that of the interpolation method (8.5%). MBR learning improves the word error rate in both cases. However, in this case as well, the result was that the word error rate of the MAP method was consistently lower than the word error rate of the interpolation method.

すなわち、内挿方式のものよりも式(15)に従った方式の方の精度が高いこと、さらにそのCTC-AMに対してMBR学習を行うことによりCTC-AMの精度はさらに高くなること、が確認できた。 That is, the accuracy of the method according to equation (15) is higher than that of the interpolation method, and the accuracy of CTC-AM is further increased by performing MBR learning for the CTC-AM. Was confirmed.

［コンピュータによる実現］
本発明の実施の形態に係る音声認識装置２８０及び学習システム３５０は、コンピュータハードウェアと、そのコンピュータハードウェア上で実行されるコンピュータプログラムとにより実現できる。図１２はこのコンピュータシステム６３０の外観を示し、図１３はコンピュータシステム６３０の内部構成を示す。 [Realization by computer]
The speech recognition device 280 and the learning system 350 according to the embodiment of the present invention can be realized by computer hardware and a computer program executed on the computer hardware. FIG. 12 shows the appearance of the computer system 630, and FIG. 13 shows the internal configuration of the computer system 630.

図１２を参照して、このコンピュータシステム６３０は、メモリポート６５２及びDVD（Digital Versatile Disk）ドライブ６５０を有するコンピュータ６４０と、キーボード６４６と、マウス６４８と、モニタ６４２とを含む。 With reference to FIG. 12, the computer system 630 includes a computer 640 having a memory port 652 and a DVD (Digital Versatile Disk) drive 650, a keyboard 646, a mouse 648, and a monitor 642.

図１３を参照して、コンピュータ６４０は、メモリポート６５２及びDVDドライブ６５０に加えて、CPU（中央処理装置）６５６と、CPU６５６、メモリポート６５２及びDVDドライブ６５０に接続されたバス６６６と、ブートプログラム等を記憶する読出専用メモリ（ROM）６５８と、バス６６６に接続され、プログラム命令、システムプログラム及び作業データ等を記憶するランダムアクセスメモリ（RAM）６６０と、ハードディスク６５４を含む。コンピュータシステム６３０はさらに、他端末との通信を可能とするネットワーク６６８への接続を提供するネットワークインターフェイス（I/F）６４４を含む。 With reference to FIG. 13, the computer 640 includes a CPU (central processing unit) 656, a CPU 656, a bus 666 connected to the memory port 652 and the DVD drive 650, and a boot program, in addition to the memory port 652 and the DVD drive 650. It includes a read-only memory (ROM) 658 for storing and the like, a random access memory (RAM) 660 connected to the bus 666 and storing program instructions, system programs, work data, and the like, and a hard disk 654. The computer system 630 further includes a network interface (I / F) 644 that provides a connection to a network 668 that allows communication with other terminals.

コンピュータシステム６３０を上記した実施の形態に係る音声認識装置２８０及び学習システム３５０の各機能部として機能させるためのコンピュータプログラムは、DVDドライブ６５０又はメモリポート６５２に装着されるDVD６６２又はリムーバブルメモリ６６４に記憶され、さらにハードディスク６５４に転送される。又は、プログラムはネットワーク６６８を通じてコンピュータ６４０に送信されハードディスク６５４に記憶されてもよい。プログラムは実行の際にRAM６６０にロードされる。DVD６６２から、リムーバブルメモリ６６４から又はネットワーク６６８を介して、直接にRAM６６０にプログラムをロードしてもよい。 The computer program for making the computer system 630 function as each functional unit of the voice recognition device 280 and the learning system 350 according to the above-described embodiment is stored in the DVD 662 or the removable memory 664 mounted on the DVD drive 650 or the memory port 652. And then transferred to the hard disk 654. Alternatively, the program may be transmitted to the computer 640 via the network 668 and stored on the hard disk 654. The program is loaded into RAM 660 at run time. Programs may be loaded directly into RAM 660 from DVD 662, from removable memory 664, or via network 668.

このプログラムは、コンピュータ６４０を、上記実施の形態に係る音声認識装置２８０及び学習システム３５０の各機能部として機能させるための複数の命令からなる命令列を含む。コンピュータ６４０にこの動作を行わせるのに必要な基本的機能のいくつかはコンピュータ６４０上で動作するオペレーティングシステム若しくはサードパーティのプログラム又はコンピュータ６４０にインストールされる、ダイナミックリンク可能な各種プログラミングツールキット又はプログラムライブラリにより提供される。したがって、このプログラム自体はこの実施の形態のシステム、装置及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又はプログラミングツールキット又はプログラムライブラリ内の適切なプログラムを実行時に動的に呼出すことにより、上記したシステム、装置又は方法としての機能を実現する命令のみを含んでいればよい。もちろん、プログラムのみで必要な機能を全て提供してもよい。 This program includes an instruction sequence including a plurality of instructions for causing the computer 640 to function as each functional unit of the voice recognition device 280 and the learning system 350 according to the above embodiment. Some of the basic functions required to cause the computer 640 to perform this operation are operating systems or third-party programs running on the computer 640 or various dynamically linkable programming toolkits or programs installed on the computer 640. Provided by the library. Therefore, the program itself does not necessarily have to include all the functions necessary to realize the systems, devices and methods of this embodiment. This program is a system described above by dynamically calling at runtime the appropriate function or programming toolkit or appropriate program in a program library of instructions in a controlled manner to obtain the desired result. It only needs to include instructions that implement the function as a device or method. Of course, the program alone may provide all the necessary functions.

上記実施の形態では、CTC-AMの学習では目標関数を最大化するような学習を行っている。しかし本発明はそのような実施の形態には限定されない。例えば、そのような目標関数ではなく、損失関数を定め、その損失関数の値を最大化するような学習を行っても良い。 In the above embodiment, in the learning of CTC-AM, learning that maximizes the objective function is performed. However, the present invention is not limited to such embodiments. For example, instead of such an objective function, a loss function may be determined and learning may be performed to maximize the value of the loss function.

また上記実験では、CTC-AMとして、LSTMを構成要素とするものを用いた。しかし、当業者には明らかであるように、CTC-AMはLSTMを用いるものには限定されない。例えばRNN全般に対象を広げても良いし、CNNを用いるようにしてもよい。また上記実施の形態では、学習処理部３６２による学習及び精度評価部３７４による学習の双方において、学習後のCTC-AMの精度と学習前の精度との差が所定の値未満になることを終了条件としている。しかし本発明はそのような実施の形態には限定されない。例えば上記した学習のいずれか又は双方において、繰返し回数を固定した値とし、学習の繰返し回数がその値に達したら学習を終了するようにすることも可能である。 In the above experiment, a CTC-AM having LSTM as a component was used. However, as will be apparent to those skilled in the art, CTC-AM is not limited to those using LSTM. For example, the target may be expanded to RNN in general, or CNN may be used. Further, in the above embodiment, in both the learning by the learning processing unit 362 and the learning by the accuracy evaluation unit 374, the difference between the accuracy of CTC-AM after learning and the accuracy before learning is less than a predetermined value. It is a condition. However, the present invention is not limited to such embodiments. For example, in either or both of the above-mentioned learnings, it is possible to set the number of repetitions to a fixed value and end the learning when the number of repetitions of learning reaches that value.

さらに、上記実施の形態では、単語列Ｗの精度を表す尺度として式(19)により表される値を使用している。しかし本発明はそのような実施の形態には限定されない。例えば、評価データをCTC-AMを用いて音声認識することにより得られるラティスの各パスのうち、単語Ｗを通るものについて得られる確率を平均したものを単語列Ｗの精度を表す尺度として採用してもよい。又は、この値を、ラティスの全てのパスの確率で割ったものを用いても良い。 Further, in the above embodiment, the value represented by the equation (19) is used as a measure for expressing the accuracy of the word string W. However, the present invention is not limited to such embodiments. For example, among the lattice paths obtained by voice-recognizing the evaluation data using CTC-AM, the average of the probabilities obtained for those passing through the word W is adopted as a measure of the accuracy of the word string W. You may. Alternatively, this value may be divided by the probabilities of all Lattice passes.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiments disclosed this time are merely examples, and the present invention is not limited to the above-described embodiments. The scope of the present invention is indicated by each claim of the scope of claims, taking into consideration the description of the detailed description of the invention, and all changes within the meaning and scope equivalent to the wording described therein. Including.

３０単語列
３２発音列
３４状態系列
３６観測系列
７０ DNN
１００ RNN
１１０音素列
１１２発音列（サブワード列）
２８０音声認識装置
２８２入力音声
３０２フレーム化処理部
３０４特徴量抽出部
３０６特徴量記憶部
３０８音響モデル
３１０デコーダ
３５０学習システム
３６２学習処理部
３６４ CTC-AM
３６６ MBR学習処理部
３７４精度評価部
３７８学習・評価制御部
６３０コンピュータシステム
６４０コンピュータ
６５４ハードディスク
６５６ CPU
６５８ ROM
６６０ RAM 30 Word sequence 32 Pronunciation sequence 34 State sequence 36 Observation sequence 70 DNN
100 RNN
110 Phoneme sequence 112 Pronunciation sequence (subword sequence)
280 Speech Recogniser 282 Input Speech 302 Framed Processing Unit 304 Feature Extraction Unit 306 Feature Storage 308 Acoustic Model 310 Decoder 350 Learning System 362 Learning Processing Unit 364 CTC-AM
366 MBR Learning processing unit 374 Accuracy evaluation unit 378 Learning / evaluation control unit 630 Computer system 640 Computer 654 Hard disk 656 CPU
658 ROM
660 RAM

Claims

An acoustic model learning device that learns an acoustic model based on an end-to-end neural network to calculate the probability that the observation sequence is an arbitrary subword string when a speech observation sequence is given. There,
The learning device of the acoustic model is a word-level language model that stores learning data consisting of an aligned pair of an observation sequence of the learning voice and a correct subword string corresponding to the learning voice, and the frequency of occurrence of the word string. Used by connecting to a computer-readable storage means to store
The observation sequence of the learning voice is given by the word-subword conversion probability, the language model score obtained from the language model, and the subword posterior probability output by the End-to-End type neural network for each hypothesis. As a first optimization means for optimizing the End-to-End type neural network so that the sum of the posterior probabilities of the correct subword strings of the training data over the entire training data is maximized. ,
When the observation sequence of training speech is given, as the expected value of the accuracy of the word sequence hypothesis estimated using the previous SL End-to-End neural network and the language model is maximized, the first An acoustic model learning device including a second optimizing means for further optimizing the End-to-End type neural network optimized by the optimizing means of 1.

The second optimization means is
A speech recognition means that generates a hypothesis of a word string by performing speech recognition on the observation sequence using the End-to-End type neural network and the language model over the entire learning speech.
A first calculation means for calculating the recognition accuracy for the word string constituting the hypothesis based on the hypothesis and the correct subword string of the learning data over the entire learning voice.
The expected value is calculated by calculating the sum of the products of the posterior probability of the hypothesis calculated by the language model at the time of generating the hypothesis and the recognition accuracy of the word strings constituting the hypothesis over the entire learning voice. A second calculation method for calculating the value and
An update means for updating the parameter set of the acoustic model so that the expected value calculated by the second calculation means increases.
A determination means that executes a determination process regarding whether or not the end condition is satisfied in response to the completion of updating the parameter set of the acoustic model by the update means.
The first process of ending the learning of the End-to-End type neural network in response to the determination by the determination means, the generation of the hypothesis using the learning voice, the calculation of the recognition accuracy, and the expected value. The voice recognition means, the first calculation means, the second calculation means, the update means, and the second determination means are controlled so that the calculation, the update of the parameter set, and the determination process are performed again. The acoustic model learning device according to claim 1, further comprising a control means for selectively executing the processing of the above.

The observation sequence is prepared in units of voice signals representing the learning voice.
In the first calculation means, each subword of the hypothetical word string output by the End-to-End type neural network is one with each subword of the subword string paired with the input observation sequence in frame units. The acoustic model learning apparatus according to claim 2, further comprising a subword matching number calculating means for calculating the number of operations.

The determination means includes a hypothesis generation process for the entire learning voice by the voice recognition means, a recognition accuracy calculation process by the first calculation means, a sum calculation process by the second calculation means, and the above. The acoustic model according to claim 2 or 3, further comprising means for determining that the termination condition is satisfied when the update means updates the parameter set of the acoustic model a predetermined number of times. Learning device.

The determination means responds to the difference between the latest processing value and the previous processing value of the parameter set defining the End-to-End type neural network becoming less than the threshold value. The acoustic model learning device according to claim 2 or 3, further comprising means for determining that the termination condition is satisfied.

A computer program that functions to operate a computer as the means according to any one of claims 1 to 5.