JP6674706B2

JP6674706B2 - Program, apparatus and method for automatically scoring from dictation speech of learner

Info

Publication number: JP6674706B2
Application number: JP2016179157A
Authority: JP
Inventors: 安田　圭志; 圭志安田
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2016-09-14
Filing date: 2016-09-14
Publication date: 2020-04-01
Anticipated expiration: 2036-09-14
Also published as: JP2018045062A

Description

本発明は、学習者の口述音声から自動的に採点するスピーキングテストの技術に関する。 The present invention relates to a technique of a speaking test that automatically scores from a dictation voice of a learner.

語学学習のような記述問題（例えば翻訳問題）に対して、採点者は、学習者による解答文と正解文とを比較して、その一致度に応じて採点している。しかし、人による採点には、時間及びコストがかかるだけでなく、恣意的な判断が影響する場合もある。そのために、記述問題自体が、人による採点がしやすいようなものになってしまう。 For a description problem (for example, a translation problem) such as language learning, the grader compares the answer sentence by the learner with the correct answer sentence, and scores according to the degree of coincidence. However, human scoring is not only time-consuming and costly, but can also be influenced by arbitrary judgment. Therefore, the description problem itself becomes such that it is easy for a person to score.

従来、記述問題に対して、学習者による解答文を自動的に採点する技術がある（例えば特許文献１参照）。この技術によれば、E-Learningシステムと言語処理システムとが接続されている。E-Learningシステムは、学習者に出題し、その解答文を言語処理システムへ転送する。また、言語処理システムは、解答文と正解文とを言語的に比較し、その一致度をE-learningシステムへ応答する。これによって、E-learningシステムは、その一致度に応じた採点を付与する。 2. Description of the Related Art Conventionally, there is a technique for automatically scoring an answer sentence by a learner for a description problem (for example, see Patent Document 1). According to this technology, the E-Learning system and the language processing system are connected. The E-Learning system gives a question to a learner and transfers the answer sentence to a language processing system. The language processing system linguistically compares the answer sentence and the correct answer sentence, and responds to the E-learning system with the degree of coincidence. Thereby, the E-learning system gives a score according to the degree of matching.

また、例えば翻訳問題に対して、学習者の解答文における表現の多様性を許容しながら、翻訳エージェントの翻訳能力を評価する技術もある（例えば特許文献２参照）。この技術によれば、解答文と正解文とが異なる表現で記述されていても、その翻訳能力を正当に評価することができる。具体的には、原言語テスト文に対する正解文と、原言語テスト文に類似する原言語参照文に対する正解翻訳文とをそれぞれ、原言語テスト文に対する解答文とを比較して翻訳正解率を算出する。 Further, for example, there is a technique for evaluating the translation ability of a translation agent while allowing a variety of expressions in an answer sentence of a learner for a translation problem (for example, see Patent Document 2). According to this technology, even if the answer sentence and the correct answer sentence are described in different expressions, their translation ability can be properly evaluated. Specifically, the correct answer sentence for the source language test sentence and the correct answer sentence for the source language reference sentence similar to the source language test sentence are each compared with the answer sentence for the source language test sentence to calculate the translation correct answer rate. I do.

これら従来技術によれば、例えば一問一答の出題形式のように、記述問題に対する解答文の内容の自由度が低い場合には有効である。 These conventional techniques are effective when the degree of freedom of the contents of the answer to the description problem is low, for example, in a question-and-answer format.

特許２００６−２４４００３号公報JP 2006-244003 A 特開２００４−０１３９１３号公報JP 2004-019131 A 特表２００２−５４４５７０号公報JP 2002-544570A

Shyamaa E. Sorour, Kazaumasa Goda and Tsunemori Mine, “Student performance Estimation Based on Topic Models Considering a Range of Lessons,” Proc.of AIED2015 pp.790-793, 2015.Shyamaa E. Sorour, Kazaumasa Goda and Tsunemori Mine, “Student performance Estimation Based on Topic Models Considering a Range of Lessons,” Proc. Of AIED2015 pp.790-793, 2015. Quoc Le、Tomas Mikolov、「Distributed Representations of Sentences and Documents」、[online]、［平成２８年７月１６日検索］、インターネット＜URL:http://cs.stanford.edu/~quocle/paragraph_vector.pdf＞Quoc Le, Tomas Mikolov, "Distributed Representations of Sentences and Documents", [online], [Search July 16, 2016], Internet <URL: http://cs.stanford.edu/~quocle/paragraph_vector.pdf > Hwee Tou Ng、Siew Mei Wu、Yuanbin Wu and Christian Hadiwinoto、Joel Tetreault、「The CoNLL-2013 Shared Task on Grammatical Error Correction」、[online]、［平成２８年７月１６日検索］、インターネット＜URL:http://www.comp.nus.edu.sg/~nlp/conll13st/CoNLLST01.pdf＞Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu and Christian Hadiwinoto, Joel Tetreault, "The CoNLL-2013 Shared Task on Grammatical Error Correction", [online], [July 16, 2016 search], Internet <URL: http : //www.comp.nus.edu.sg/~nlp/conll13st/CoNLLST01.pdf> 相澤一美、石川慎一郎、村田年、磯達夫、上村俊彦、小川貴宏、清水伸一、杉森直樹、羽井左昭彦、望月正道、「JACET8000英単語」、[online]、［平成２８年７月１６日検索］、インターネット＜URL:http://iss.ndl.go.jp/books/R100000002-I000008184038-00＞Kazumi Aizawa, Shinichiro Ishikawa, Toshio Murata, Tatsuo Iso, Toshihiko Uemura, Takahiro Ogawa, Shinichi Shimizu, Naoki Sugimori, Akihiko Hai, Masamichi Mochizuki, "JACET8000 English Words", [online], [July 16, 2016 Search], the Internet <URL: http://iss.ndl.go.jp/books/R100000002-I000008184038-00> 投野由紀夫、「CAN‐DOリスト作成・活用英語到達度指標CEFR‐Jガイドブック」、[online]、［平成２８年７月１６日検索］、インターネット＜URL:http://www.taishukan.co.jp/book/b197158.html＞Yukio Tono, "Creating and Using the CAN-DO List English Achievement Index CEFR-J Guidebook", [online], [Search on July 16, 2016], Internet <URL: http: //www.taishukan. co.jp/book/b197158.html> Julius、[online]、［平成２８年７月８日検索］、インターネット＜URL:http://julius.osdn.jp/＞Julius, [online], [Search July 8, 2016], Internet <URL: http://julius.osdn.jp/> Generalized Word Posterior Probability(GWPP) for Measuring reliability of Recognized Words, Frank K. Soong et al. Proc. SWIM 2004. [online]、［平成２８年９月５日検索］、インターネット＜URL:file:///C:/Users/hayah/Desktop/SoongSWIM2004.pdf＞Generalized Word Posterior Probability (GWPP) for Measuring reliability of Recognized Words, Frank K. Soong et al. Proc. SWIM 2004. [online], [Search September 5, 2016], Internet <URL: file: /// C: /Users/hayah/Desktop/SoongSWIM2004.pdf>

しかしながら、前述した従来技術によれば、学習者の口述音声に対して自動的に採点を付与するスピーキングテストに単に適用することはできない。会話の自由度が高いスピーキングテストになるほど、事前に正解文を準備しておくことが難しく、自動的な採点の精度が得られないという課題が生じる。 However, according to the above-described conventional technology, it cannot be simply applied to a speaking test for automatically giving a score to a dictation voice of a learner. As the speaking test has a higher degree of freedom in conversation, it is more difficult to prepare a correct answer in advance, and a problem arises in that the accuracy of automatic scoring cannot be obtained.

また、解答文と正解文との言語的な一致度が低くても、学習者の解答文の意味合いが、正解文の意味合いに近いと判断すべき場合もある。 Further, even if the degree of linguistic agreement between the answer sentence and the correct answer is low, it may be necessary to determine that the meaning of the answer sentence of the learner is close to the meaning of the correct answer.

更に、スピーキングテストによれば、学習者における解答誤りのみならず、音声認識システムにおける認識誤りが混在する場合がある。この場合、音声認識システムに標本音声を入力することによって、その音声認識システムにおける認識誤り精度を予め取得しておくことも必要となる（例えば特許文献３参照）。 Furthermore, according to the speaking test, not only the answer error in the learner but also the recognition error in the speech recognition system may be mixed. In this case, it is necessary to acquire the recognition error accuracy in the speech recognition system in advance by inputting the sample speech to the speech recognition system (for example, see Patent Document 3).

そこで、本発明は、学習者の口述音声に対する自動的な採点精度を高めることができるスピーキングテストプログラム、装置及び方法を提供することを目的とする。 Accordingly, an object of the present invention is to provide a speaking test program, an apparatus, and a method capable of improving the automatic scoring accuracy of a dictation voice of a learner.

本発明によれば、学習者にとって他国言語となる口述音声を入力し、採点結果を出力するようにコンピュータを機能させるスピーキングテストプログラムであって、
学習段階として、口述音声及び採点結果を対応付けた教師データを入力し、
教師データの口述音声から音声認識された口述テキストと、その音声認識結果に対する信頼度とを出力する音声認識エンジンと、
教師データの口述テキストから言語的特徴量と、音声認識エンジン内部から音声的特徴量とを抽出する特徴量抽出手段と、
音声認識エンジンから出力された信頼度が高いほど、音声的特徴量よりも言語的特徴量を多く含むように特徴量を選択する特徴量選択エンジンと、
特徴量選択エンジンによって選択された特徴量と、教師データの採点結果とを対応付けて学習する採点エンジンと
として機能させ、
採点段階として、学習者の口述音声を入力し、
音声認識エンジンは、学習者の口述音声から音声認識された口述テキストと、その音声認識結果に対する信頼度とを出力し、
特徴量抽出手段は、学習者の口述音声に基づく口述テキストから言語的特徴量と、音声認識エンジン内部から音声的特徴量とを抽出し、
特徴量選択エンジンは、音声認識エンジンから出力された信頼度に応じて選択された特徴量を出力し、
採点エンジンは、特徴量選択エンジンによって選択された特徴量を入力し、学習者の口述音声に基づく採点結果を出力する
ようにコンピュータを機能させることを特徴とする。 According to the present invention, a speaking test program that allows a computer to input a dictation voice that is a foreign language for a learner and output a scoring result,
As a learning stage, input dictation voice and teacher data in which the grading results are associated,
And spoken audio or et speech recognized dictated text training data, and voice recognition engine that outputs the reliability of the speech recognition result,
And linguistic features from dictation text training data, a feature amount extracting section which extracts a voice feature amount from the internal speech recognition engine,
A feature amount selection engine that selects a feature amount so as to include a linguistic feature amount more than a speech feature amount as the reliability output from the speech recognition engine is higher ,
A feature quantity selected by the feature quantity selection engine, in association with the rating result of the training data to function as a scoring engine learns,
As a grading stage, input the dictation voice of the learner,
The speech recognition engine outputs the dictation text speech-recognized from the learner's dictation speech and the reliability of the speech recognition result,
The feature amount extracting means extracts a linguistic feature amount from the dictation text based on the learner's dictation voice and a speech feature amount from inside the speech recognition engine,
The feature selection engine outputs the feature selected according to the reliability output from the speech recognition engine,
The scoring engine inputs the feature amount selected by the feature amount selection engine and outputs a scoring result based on the dictation voice of the learner.
The computer is caused to function as described above .

本発明のスピーキングテストプログラムにおける他の実施形態によれば、
特徴量選択エンジンは、音声認識エンジンから出力された信頼度が低いほど、言語的特徴量よりも音声的特徴量を多く含むように特徴量を選択する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the speaking test program of the present invention,
The feature selection engine causes the computer to select a feature amount such that the lower the reliability output from the speech recognition engine is, the larger the speech feature amount is than the linguistic feature amount. Is also preferred.

本発明のスピーキングテストプログラムにおける他の実施形態によれば、
学習段階として、
教師データの口述音声に、異なる発話環境音を合成した１つ以上の口述音声を出力する発話環境合成手段として更に機能させ、
音声認識エンジンは、環境音合成無しの口述音声から音声認識された口述テキスト及び信頼度と、環境音合成有りの口述音声から音声認識された口述テキスト及び信頼度とを出力し、
特徴量抽出手段は、環境音合成無しの口述音声に基づく口述テキストからの言語的特徴量と、環境音合成無しの口述音声に基づく音声認識エンジン内部からの音声的特徴量と、環境音合成有りの口述音声に基づく口述テキストからの言語的特徴量と、環境音合成有りの口述音声に基づく音声認識エンジン内部からの音声的特徴量とを抽出し、
環境音合成無しの口述音声に基づく言語的特徴量に対して、所定閾値よりも差分が小さい、又は、差分が小さい順に所定数となる、環境音合成有りの口述音声に基づく言語的特徴量を検出すると共に、環境音合成無しの口述音声に基づく音声的特徴量に対して、所定閾値よりも差分が小さい、又は、差分が小さい順に所定数となる、環境音合成有りの口述音声に基づく音声的特徴量を検出する特徴量差分検出手段として更に機能させ、
特徴量選択エンジンは、特徴量差分検出手段から出力された言語的特徴量及び音声的特徴量と、音声認識エンジンから出力された信頼度とを対応付けて学習する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the speaking test program of the present invention,
As a learning stage,
And further functioning as an utterance environment synthesizing unit that outputs one or more dictation sounds obtained by synthesizing different utterance environment sounds with the dictation voice of the teacher data;
The speech recognition engine outputs the dictation text and reliability recognized from the dictation voice without environmental sound synthesis, and the dictation text and reliability recognized from the dictation voice with environmental sound synthesis ,
The feature extraction means includes linguistic features from dictation text based on dictation speech without environmental sound synthesis, speech features from inside a speech recognition engine based on dictation speech without environment sound synthesis, and presence of environment sound synthesis. Linguistic features from dictation text based on dictation speech and speech features from inside a speech recognition engine based on dictation speech with environmental sound synthesis,
For the linguistic feature based on the dictation voice without environmental sound synthesis, a linguistic feature based on the dictation voice with environmental sound synthesis, in which the difference is smaller than a predetermined threshold or becomes a predetermined number in the order of smaller differences, is used. The voice based on the dictation voice with environmental sound synthesis, which is detected and has a predetermined difference with respect to the vocal feature based on the dictation voice without environmental sound synthesis, the difference being smaller than a predetermined threshold value or a predetermined number in the order of smaller differences. Further functioning as a feature amount difference detecting means for detecting a characteristic feature amount,
The feature selection engine may also cause the computer to function so as to learn the linguistic feature and the speech feature output from the feature difference detecting means in association with the reliability output from the speech recognition engine. preferable.

本発明のスピーキングテストプログラムにおける他の実施形態によれば、
発話環境合成手段は、発話環境音として異なるノイズを合成する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the speaking test program of the present invention,
It is also preferable that the speech environment synthesizing means causes the computer to function so as to synthesize different noises as speech environment sounds.

本発明のスピーキングテストプログラムにおける他の実施形態によれば、
言語的特徴量は、口述テキストに基づく
のべ単語数、
異なり単語数、
Bag-of-Wordsの空間ベクトル、
Bag-of-ngramの空間ベクトル
ＬＳＡ(Latent Semantic Analysis)の次元ベクトル、
ＬＤＡ(Latent Dirichlet Allocation)の次元ベクトル、
分散表現ベクトル、
文法誤り箇所の数及び／又は種別、
難易度別の単語数
における１つ以上であり、
音声的特徴量は、口述音声に基づく
発話時間、
単位時間当たりの単語数、
音響尤度、
単位時間当たりの音素数
における１つ以上である
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the speaking test program of the present invention,
The linguistic features are the total number of words based on the dictation text,
Different word count,
Bag-of-Words space vector,
Bag-of-ngram space vector Dimensional vector of LSA (Latent Semantic Analysis),
LDA (Latent Dirichlet Allocation) dimension vector,
Distributed expression vector,
The number and / or type of grammatical errors,
One or more in the number of words by difficulty,
Speech features are utterance time based on dictated speech,
Words per unit time,
Acoustic likelihood,
It is also preferable to have the computer function as one or more phonemes per unit time.

本発明によれば、学習者にとって他国言語となる口述音声を入力し、採点結果を出力するスピーキングテスト装置であって、
学習段階として、口述音声及び採点結果を対応付けた教師データを入力し、
教師データの口述音声から音声認識された口述テキストと、その音声認識結果に対する信頼度とを出力する音声認識エンジンと、
教師データの口述テキストから言語的特徴量と、音声認識エンジン内部から音声的特徴量とを抽出する特徴量抽出手段と、
音声認識エンジンから出力された信頼度が高いほど、音声的特徴量よりも言語的特徴量を多く含むように特徴量を選択する特徴量選択エンジンと、
特徴量選択エンジンによって選択された特徴量と、教師データの採点結果とを対応付けて学習する採点エンジンと
とを有し、
採点段階として、学習者の口述音声を入力し、
音声認識エンジンは、学習者の口述音声から音声認識された口述テキストと、その音声認識結果に対する信頼度とを出力し、
特徴量抽出手段は、学習者に基づく口述テキストから言語的特徴量と、音声認識エンジン内部から音声的特徴量とを抽出し、
特徴量選択エンジンは、音声認識エンジンから出力された信頼度に応じて選択された特徴量を出力し、
採点エンジンは、特徴量選択エンジンによって選択された特徴量を入力し、学習者に基づく採点結果を出力する
ことを特徴とする。 According to the present invention, there is provided a speaking test device that inputs a dictation voice that is a foreign language for a learner and outputs a scoring result,
As a learning stage, input dictation voice and teacher data in which the grading results are associated,
A speech recognition engine that outputs dictation text that is speech- recognized from the dictation speech of the teacher data and reliability for the speech recognition result;
And linguistic features from dictation text training data, a feature amount extracting section which extracts a voice feature amount from the internal speech recognition engine,
A feature amount selection engine that selects a feature amount so as to include a linguistic feature amount more than a speech feature amount as the reliability output from the speech recognition engine is higher ,
A feature quantity selected by the feature quantity selection engine, the scoring engine learns associates the scoring result of the training data
And
As a grading stage, input the dictation voice of the learner,
The speech recognition engine outputs the dictation text speech-recognized from the learner's dictation speech and the reliability of the speech recognition result,
The feature extraction means extracts a linguistic feature from the dictation text based on the learner and a speech feature from inside the speech recognition engine,
The feature selection engine outputs the feature selected according to the reliability output from the speech recognition engine,
The scoring engine is characterized by inputting the feature amount selected by the feature amount selection engine and outputting a scoring result based on the learner .

本発明のスピーキングテスト装置における他の実施形態によれば、
学習段階として、
教師データの口述音声に、異なる発話環境音を合成した１つ以上の口述音声を出力する発話環境合成手段として更に機能させ、
音声認識エンジンは、環境音合成無しの口述音声から音声認識された口述テキスト及び信頼度と、環境音合成有りの口述音声から音声認識された口述テキスト及び信頼度とを出力し、
特徴量抽出手段は、環境音合成無しの口述音声に基づく口述テキストからの言語的特徴量と、環境音合成無しの口述音声に基づく音声認識エンジン内部からの音声的特徴量と、環境音合成有りの口述音声に基づく口述テキストからの言語的特徴量と、環境音合成有りの口述音声に基づく音声認識エンジン内部からの音声的特徴量とを抽出し、
環境音合成無しの口述音声に基づく言語的特徴量に対して、所定閾値よりも差分が小さい、又は、差分が小さい順に所定数となる、環境音合成有りの口述音声に基づく言語的特徴量を検出すると共に、環境音合成無しの口述音声に基づく音声的特徴量に対して、所定閾値よりも差分が小さい、又は、差分が小さい順に所定数となる、環境音合成有りの口述音声に基づく音声的特徴量を検出する特徴量差分検出手段として更に機能させ、
特徴量選択エンジンは、特徴量差分検出手段から出力された言語的特徴量及び音声的特徴量と、音声認識エンジンから出力された信頼度とを対応付けて学習する
ことも好ましい。 According to another embodiment of the speaking test apparatus of the present invention,
As a learning stage,
And further functioning as an utterance environment synthesizing unit that outputs one or more dictation sounds obtained by synthesizing different utterance environment sounds with the dictation voice of the teacher data;
The speech recognition engine outputs the dictation text and reliability recognized from the dictation voice without environmental sound synthesis, and the dictation text and reliability recognized from the dictation voice with environmental sound synthesis ,
The feature extraction means includes linguistic features from dictation text based on dictation speech without environmental sound synthesis, speech features from inside a speech recognition engine based on dictation speech without environment sound synthesis, and presence of environment sound synthesis. Linguistic features from dictation text based on dictation speech and speech features from inside a speech recognition engine based on dictation speech with environmental sound synthesis,
For the linguistic feature based on the dictation voice without environmental sound synthesis, a linguistic feature based on the dictation voice with environmental sound synthesis, in which the difference is smaller than a predetermined threshold or becomes a predetermined number in the order of smaller differences, is used. The voice based on the dictation voice with environmental sound synthesis, which is detected and has a predetermined difference with respect to the vocal feature based on the dictation voice without environmental sound synthesis, the difference being smaller than a predetermined threshold value or a predetermined number in the order of smaller differences. Further functioning as a feature amount difference detecting means for detecting a characteristic feature amount,
It is also preferable that the feature selection engine learns the linguistic feature and the speech feature output from the feature difference detecting means in association with the reliability output from the speech recognition engine.

本発明によれば、学習者にとって他国言語となる口述音声を入力し、採点結果を出力する装置のスピーキングテスト方法であって、
装置は、
学習段階として、口述音声及び採点結果を対応付けた教師データを入力し、
音声認識エンジンを用いて、教師データの口述音声から音声認識された口述テキストと、その音声認識結果に対する信頼度とを出力する第１のステップと、
教師データの口述テキストから言語的特徴量と、音声認識エンジン内部から音声的特徴量とを抽出する第２のステップと、
特徴量選択エンジンを用いて、第１のステップによって出力された信頼度が高いほど、音声的特徴量よりも言語的特徴量を多く含むように特徴量を選択する第３のステップと、
採点学習エンジンを用いて、第３のステップによって選択された特徴量と、教師データの採点結果とを対応付けて学習する第４のステップと
を実行し、
採点段階として、学習者の口述音声を入力し、
音声認識エンジンを用いて、学習者の口述音声から音声認識された口述テキストと、その音声認識結果に対する信頼度とを出力する第５のステップと、
学習者に基づく口述テキストから言語的特徴量と、音声認識エンジン内部から音声的特徴量とを抽出する第６のステップと、
特徴量選択エンジンを用いて、第５のステップから出力された信頼度に応じて選択された特徴量を出力する第７のステップと、
採点エンジンを用いて、第７のステップによって選択された特徴量を入力し、学習者に基づく採点結果を出力する第８のステップと
を実行することを特徴とする。 According to the present invention, there is provided a speaking test method for a device that inputs a dictation voice that is a foreign language for a learner and outputs a scoring result,
The equipment is
As a learning stage, input dictation voice and teacher data in which the grading results are associated,
A first step of outputting a dictation text speech- recognized from the dictation speech of the teacher data using the speech recognition engine and a reliability for the speech recognition result;
And linguistic features from dictation text training data, a second step of extracting a voice feature amount from the internal speech recognition engine,
A third step of selecting a feature using the feature selection engine such that the higher the reliability output in the first step, the more the linguistic feature is included than the vocal feature;
Using scoring learning engine, and a feature quantity selected by the third step, a fourth step of learning associates the scoring result of the training data
Run
As a grading stage, input the dictation voice of the learner,
A fifth step of outputting, using a speech recognition engine, the dictation text speech-recognized from the dictation speech of the learner and the reliability of the speech recognition result;
A sixth step of extracting linguistic features from the dictation text based on the learner and speech features from inside the speech recognition engine;
A seventh step of using the feature selection engine to output a feature selected according to the reliability output from the fifth step;
An eighth step of inputting the feature amount selected in the seventh step using a scoring engine and outputting a scoring result based on the learner;
Is performed .

本発明のスピーキングテスト方法における他の実施形態によれば、
装置は、学習段階として、
教師データの口述音声に、異なる発話環境音を合成した１つ以上の口述音声を出力し、
第１のステップについて、音声認識エンジンを用いて、環境音合成無しの口述音声から音声認識された口述テキスト及び信頼度と、環境音合成有りの口述音声から音声認識された口述テキスト及び信頼度とを出力し、
第２のステップについて、
環境音合成無しの口述音声に基づく口述テキストからの言語的特徴量と、環境音合成無しの口述音声に基づく音声認識エンジン内部からの音声的特徴量と、環境音合成有りの口述音声に基づく口述テキストからの言語的特徴量と、環境音合成有りの口述音声に基づく音声認識エンジン内部からの音声的特徴量とを抽出し、
環境音合成無しの口述音声に基づく言語的特徴量に対して、所定閾値よりも差分が小さい、又は、差分が小さい順に所定数となる、環境音合成有りの口述音声に基づく言語的特徴量を検出すると共に、環境音合成無しの口述音声に基づく音声的特徴量に対して、所定閾値よりも差分が小さい、又は、差分が小さい順に所定数となる、環境音合成有りの口述音声に基づく音声的特徴量を検出し、
第３のステップについて、特徴量選択エンジンを用いて、第２のステップから出力された言語的特徴量及び音声的特徴量と、第１のステップの音声認識エンジンから出力された信頼度とを対応付けて学習する
ことも好ましい。
According to another embodiment of the speaking test method of the present invention,
The device, as a learning phase,
Outputting one or more dictation voices in which different utterance environment sounds are synthesized with the dictation voice of the teacher data;
For the first step, the dictation text and the reliability that are speech-recognized from the dictation speech without environmental sound synthesis using the speech recognition engine, and the dictation text and the reliability that are speech-recognized from the dictation speech with environmental sound synthesis are used. And output
For the second step,
Linguistic features from dictation text based on dictation speech without environmental sound synthesis, vocal features from inside a speech recognition engine based on dictation speech without environmental sound synthesis, and dictation based on dictation speech with environmental sound synthesis Extract linguistic features from text and speech features from inside the speech recognition engine based on dictation speech with environmental sound synthesis,
For the linguistic feature based on the dictation voice without environmental sound synthesis, a linguistic feature based on the dictation voice with environmental sound synthesis, in which the difference is smaller than a predetermined threshold or becomes a predetermined number in the order of smaller differences, is used. The voice based on the dictation voice with environmental sound synthesis, which is detected and has a predetermined difference with respect to the vocal feature based on the dictation voice without environmental sound synthesis, the difference being smaller than a predetermined threshold value or a predetermined number in the order of smaller differences. Detects characteristic features,
In the third step, a linguistic feature and a speech feature output from the second step are associated with the reliability output from the speech recognition engine in the first step by using a feature selection engine. It is also preferable to attach and learn.

本発明のスピーキングテストプログラム、装置及び方法によれば、学習者の口述音声に対する自動的な採点精度を高めることができる。 ADVANTAGE OF THE INVENTION According to the speaking test program, apparatus, and method of this invention, the automatic scoring precision with respect to the dictation voice of a learner can be improved.

本発明のスピーキングテストプログラムにおける採点段階の機能構成図である。FIG. 4 is a functional configuration diagram at a grading stage in the speaking test program of the present invention. 音声認識エンジンを用いて抽出される特徴量を表す説明図である。FIG. 5 is an explanatory diagram illustrating a feature amount extracted using a speech recognition engine. 本発明のスピーキングテストプログラムにおける採点エンジンの学習段階の機能構成図である。It is a functional block diagram of the learning stage of the scoring engine in the speaking test program of the present invention. 本発明のスピーキングテストプログラムにおける特徴量選択エンジンの学習段階の機能構成図である。It is a functional block diagram of a learning stage of the feature-value selection engine in the speaking test program of the present invention. 本発明におけるシーケンス図である。It is a sequence diagram in this invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜採点段階＞
図１は、本発明のスピーキングテストプログラムにおける採点段階の機能構成図である。 <Scoring stage>
FIG. 1 is a functional block diagram of the speaking test program of the present invention at the scoring stage.

「採点段階」とは、学習者の口述音声を入力し、採点エンジンを用いて採点結果を自動的に出力する処理である。例えば、学習者自身が所持する端末によって実行される処理である。 The “scoring stage” is a process of inputting a dictation voice of a learner and automatically outputting a scoring result using a scoring engine. For example, the process is performed by a terminal owned by the learner.

語学学習におけるスピーキングテストプログラムの場合、学習者の口述音声は、その学習者にとって自国語（例えば日本語）とは異なる他国言語（例えば英語）である。本発明によれば、例えば日本人が、自ら考えた解答文を英語で発音し、その口述音声に対して自動的に採点することができる。 In the case of a speaking test program in language learning, the dictation voice of the learner is in another language (eg, English) different from the native language (eg, Japanese) for the learner. According to the present invention, for example, an answer sentence thought by a Japanese person can be pronounced in English, and the dictation voice can be automatically scored.

図１のスピーキングテストプログラム１は、音声認識エンジン１１と、特徴量抽出部１２と、特徴量選択エンジン１３と、採点エンジン１４としてコンピュータを機能させるものである。尚、これら機能構成部の処理の流れは、スピーキングテスト装置及び方法としても理解できる。 The speaking test program 1 shown in FIG. 1 causes a computer to function as a speech recognition engine 11, a feature extraction unit 12, a feature selection engine 13, and a scoring engine 14. The flow of processing of these functional components can be understood as a speaking test apparatus and method.

［音声認識エンジン１１］
音声認識エンジン１１は、学習者の口述音声から認識された「口述テキスト」と、その認識結果に対する「信頼度」とを出力する。口述テキストは、特徴量抽出部１２へ出力される。
音声認識エンジンとしては、例えば、数万語彙の連続音声認識を実時間で実行可能なJulius（登録商標）がある（例えば非特許文献６参照）。この音声認識エンジンは、ＧＭＭ−ＨＭＭ(Gaussian Mixture Model - Hidden Markov Model)又はＤＮＮ−ＨＭＭ(Deep Neural Network - Hidden Markov Model)を用いた「音響モデル」（音響の特徴量を表すモデル）と、単語N-gram，記述文法及び単語辞書を用いた「言語モデル」（言語のつながりを表すモデル）とを搭載する。これら言語モデルや音響モデルのモジュールは、スピーキングテストの用途に応じて組み替えることができる。 [Speech recognition engine 11]
The speech recognition engine 11 outputs “dictation text” recognized from the learner's dictation voice and “reliability” for the recognition result. The dictation text is output to the feature extraction unit 12.
As a speech recognition engine, for example, there is Julius (registered trademark) that can execute continuous speech recognition of tens of thousands of vocabulary words in real time (for example, see Non-Patent Document 6). This speech recognition engine is composed of an "acoustic model" (a model representing acoustic features) using a GMM-HMM (Gaussian Mixture Model-Hidden Markov Model) or a DNN-HMM (Deep Neural Network-Hidden Markov Model), and a word. A "language model" (a model representing language connection) using an N-gram, a description grammar, and a word dictionary is mounted. These language model and acoustic model modules can be rearranged according to the purpose of the speaking test.

音響モデルとは、音素毎の周波数特性を表現したものであり、一般に、隠れマルコフモデルが用いられる。
言語モデルとは、単語の並び方に関する制約を表現したものである。例えば「私(watashi)」の直後に、「が」や「は」の単語が続く確率が高い、という制約のようなものである。 The acoustic model expresses the frequency characteristics of each phoneme, and a hidden Markov model is generally used.
The language model expresses restrictions on how words are arranged. For example, there is a restriction that the word "ga" or "ha" is likely to follow immediately after "I (watashi)".

［特徴量抽出部１２］
特徴量抽出部１２は、学習者に基づく口述テキストから及び／又は音声認識エンジン内部から、複数の特徴量を抽出する。 [Feature Extraction Unit 12]
The feature amount extraction unit 12 extracts a plurality of feature amounts from the dictation text based on the learner and / or from inside the speech recognition engine.

図２は、音声認識エンジンから抽出される特徴量を表す説明図である。 FIG. 2 is an explanatory diagram illustrating a feature amount extracted from the speech recognition engine.

口述テキストから抽出される特徴量は、「言語的特徴量」である。
音声認識エンジン内部から抽出される特徴量は、「音声的特徴量」である。
これら特徴量は、特徴量選択エンジン１４へ出力される。 The feature quantity extracted from the dictation text is a “linguistic feature quantity”.
The feature amount extracted from the inside of the speech recognition engine is a “speech feature amount”.
These features are output to the feature selection engine 14.

＜言語的特徴量＞
言語的特徴量は、口述テキストに基づく
のべ単語数、
異なり単語数、
Bag-of-Wordsの空間ベクトル
Bag-of-ngramの空間ベクトル
ＬＳＡ(Latent Semantic Analysis)の次元ベクトル、
ＬＤＡ(Latent Dirichlet Allocation)の次元ベクトル、
分散表現ベクトル、
文法誤り箇所の数及び／又は種別、
難易度別の単語数
における１つ以上である。 <Linguistic features>
The linguistic features are the total number of words based on the dictation text,
Different word count,
Bag-of-Words space vector
Bag-of-ngram space vector Dimensional vector of LSA (Latent Semantic Analysis),
LDA (Latent Dirichlet Allocation) dimension vector,
Distributed expression vector,
The number and / or type of grammatical errors,
It is one or more in the number of words for each difficulty level.

「のべ単語数」とは、解答となる口述テキストに含まれる全ての単語（語彙）の数をいう。
「異なり単語数」とは、同じ単語は１つとして数えた単語の数をいう。尚、活用形は問わず、全て同じとみなす。
「Bag-of-Wordsの空間ベクトル」とは、テキストに含まれる各単語の出現頻度のみを表現したベクトルをいう。ここでは、単語の出現順は無視される。この空間ベクトルは、単語を軸とし、出現頻度を値として、その空間の１点にそのテキストを位置付けたものである。また、予め導出されたＩＤＦ(Inverse Document Frequency)を単語の重みとして、文章間の類似度を導出する。
「Bag-of-ngramの空間ベクトル」とは、要素個数n=1とするBag-of-Wordsを含む枠組みのベクトルをいう。連続するn個の要素が何を表すかによって表現が異なる。
「ＬＳＡ(Latent Semantic Analysis)の次元ベクトル」とは、潜在意味解析に基づくものであって、文書群とそこに含まれる用語群とから生成した次元圧縮ベクトルをいう（例えば非特許文献１参照）。ＬＳＡによれば、文書毎の用語の出現を表した文書−単語マトリックスが用いられる。これは、各行が各単語に対応し、各列が各文書に対応した疎行列である。この行列の各成分の重み付けには、ＴＦ−ＩＤＦ(Term Frequency - Inverse Document Frequency)が用いられる。行列の各成分は、その文書でその単語が使用された回数に比例した値であり、単語は、その相対的重要性を反映するべく重み付けされる。
「ＬＤＡ(Latent Dirichlet Allocation)の次元ベクトル」とは、文書中の単語の「トピック」を確率的に生成した次元圧縮ベクトルをいう（例えば非特許文献１参照）。具体的には、テキストを、各トピックグループに属する確からしさ（トピック比率）で表したものである。単語は、独立に存在しているのではなく、潜在的にいずれか１つのトピックグループに分類することができ、同じトピックグループに含まれる単語は同じ文章に出現しやすい、という特徴を利用したものである。
「分散表現(Distributed representation)」とは、テキスト中の単語を高次元で表現した実数ベクトルをいう（例えば非特許文献２参照）。意味が近い単語ほど、近いベクトルに対応させられる。加法構成性を有し、ベクトルの足し算が、意味の足し算に対応することとなる。例えばdoc2vecのようなツールがある。このようなツールを用いることで、分の意味を数百次元のベクトルで表現することができる。
「文法誤り箇所の数及び／又は種別」によれば、文法誤り箇所が多いほど、採点も低くなる傾向がある（例えば非特許文献３参照）。
「難易度別の単語数」とは、難易度付き語彙リストを用いて、難易度毎に、単語を計数したものである（例えば非特許文献４及び５参照）。 The “number of total words” refers to the number of all words (vocabulary) included in the dictation text as the answer.
"Different word count" refers to the number of words counted as one for the same word. Regardless of the usage form, they are all considered the same.
“Bag-of-Words space vector” refers to a vector expressing only the appearance frequency of each word included in the text. Here, the order of appearance of the words is ignored. The space vector is obtained by positioning the text at one point in the space, with the word as an axis and the appearance frequency as a value. In addition, the similarity between sentences is derived using the previously derived IDF (Inverse Document Frequency) as the weight of the word.
“Bag-of-ngram space vector” refers to a vector of a framework including Bag-of-Words where the number of elements is n = 1. The expression differs depending on what the consecutive n elements represent.
The “dimensional vector of LSA (Latent Semantic Analysis)” is based on latent semantic analysis and refers to a dimensional compression vector generated from a document group and a term group included therein (for example, see Non-Patent Document 1). . According to LSA, a document-word matrix representing the appearance of terms for each document is used. This is a sparse matrix in which each row corresponds to each word and each column corresponds to each document. TF-IDF (Term Frequency-Inverse Document Frequency) is used for weighting each component of this matrix. Each element of the matrix is a value proportional to the number of times the word has been used in the document, and the words are weighted to reflect their relative importance.
The “dimensional vector of LDA (Latent Dirichlet Allocation)” refers to a dimensionally compressed vector that stochastically generates a “topic” of a word in a document (for example, see Non-Patent Document 1). Specifically, the text is represented by the likelihood (topic ratio) belonging to each topic group. Words do not exist independently, but can be classified into any one topic group, and words in the same topic group are likely to appear in the same sentence. It is.
“Distributed representation” refers to a real number vector expressing a word in a text in a high dimension (for example, see Non-Patent Document 2). A word having a closer meaning is associated with a closer vector. It has additive constructivity, and the addition of vectors will correspond to the addition of meaning. For example, there is a tool like doc2vec. By using such a tool, the meaning of the minute can be represented by a hundred-dimensional vector.
According to “the number and / or type of grammatical error locations”, the score tends to decrease as the number of grammatical error locations increases (for example, see Non-Patent Document 3).
The “number of words for each difficulty level” is obtained by counting words for each difficulty level using a vocabulary list with a difficulty level (for example, see Non-Patent Documents 4 and 5).

＜音声的特徴量＞
音声的特徴量は、口述音声に基づく
発話時間、
単位時間当たりの単語数、
音響尤度、
単位時間当たりの音素数
における１つ以上であってもよい。 <Speech feature>
Speech features are utterance time based on dictated speech,
Words per unit time,
Acoustic likelihood,
It may be one or more in the number of phonemes per unit time.

「発話時間」とは、解答となる口述音声の時間である。
「単位時間当たりの単語数」とは、例えば口述音声を単位時間（例えば５秒）毎に区分し、その単位時間毎に単語数を検出し、それら単語数を平均した数をいう。
「音響尤度」とは、当該音素について、音響モデルを用いた統計的観点からみた尤もらしさの度合いをいう。音響尤度が高い単語ほど音響的に正しく、音響尤度が低い単語ほど音響的に誤っている傾向がある。
「単位時間当たりの音素数」とは、単位時間（例えば５秒）に検出された音素の数（話速）をいう。 The “utterance time” is the time of the dictation voice that is the answer.
The “number of words per unit time” refers to, for example, the number of words dictated by dividing the dictation sound per unit time (for example, 5 seconds), detecting the number of words per unit time, and averaging the number of words.
The “acoustic likelihood” refers to the degree of likelihood of the phoneme from a statistical viewpoint using an acoustic model. Words with higher acoustic likelihood tend to be acoustically correct, and words with lower acoustic likelihood tend to be acoustically incorrect.
The "number of phonemes per unit time" refers to the number of phonemes (speech speed) detected in a unit time (for example, 5 seconds).

音声認識エンジン１１が出力する「信頼度」とは、例えばＧＷＰＰ(Generalized Word Posterior Probability)等のConfidence Measureである（例えば非特許文献７参照）。信頼度の値が低いほど、音声認識の誤りが含まれる可能性が高いと判定される。 The “reliability” output by the speech recognition engine 11 is, for example, a Confidence Measure such as GWPP (Generalized Word Posterior Probability) (for example, see Non-Patent Document 7). It is determined that the lower the value of the reliability is, the higher the possibility that a speech recognition error is included.

［特徴量選択エンジン１３］
特徴量選択エンジン１３は、信頼度と特徴量とを対応付けて予め学習したものであり、音声認識エンジン１１から出力された信頼度に応じて、１つ以上の特徴量を選択する。１つの特徴量を選択するものであってもよいし、複数の特徴量を選択する場合、言語的特徴量と音声的特徴量とが混在するものであってもよい。 [Feature selection engine 13]
The feature selection engine 13 learns in advance the reliability and the feature in association with each other, and selects one or more features according to the reliability output from the speech recognition engine 11. One feature amount may be selected, and when a plurality of feature amounts are selected, a linguistic feature amount and a speech feature amount may be mixed.

ここで具体的には、特徴量選択エンジン１３は、信頼度が低いほど音声認識の誤りが高いために、言語的特徴量よりも、音声的特徴量が選択される。音声的特徴量は、音声認識エンジン１１による音声認識の誤りの度合いに関係なく、安定して高い精度で抽出できるパラメータである。そのために、音声的特徴量は、音声認識誤りに対して、頑健な特徴量として用いることができる。
即ち、特徴量選択エンジン１３は、
信頼度が高いほど、音声的特徴量よりも言語的特徴量を多く含む特徴量を選択し、
信頼度が低いほど、言語的特徴量よりも音声的特徴量を多く含む特徴量を選択する
ことが好ましい。 Here, specifically, the feature selection engine 13 selects a speech feature rather than a linguistic feature because the lower the reliability, the higher the error in speech recognition. The speech feature is a parameter that can be stably extracted with high accuracy regardless of the degree of error in speech recognition by the speech recognition engine 11. Therefore, the speech feature can be used as a robust feature against speech recognition errors.
That is, the feature quantity selection engine 13
The higher the reliability is, the more features that include more linguistic features than phonetic features,
It is preferable to select a feature value that includes more speech features than linguistic features as reliability is lower.

［採点エンジン１４］
採点エンジン１４は、特徴量と採点結果とを対応付けて予め学習したものであり、特徴量選択部１３によって選択された特徴量に応じて、採点結果を出力する。ここでは、採点エンジン１４が、自ら学習した採点モデルパラメータを内部に構築している。
採点結果は、例えば０〜１００点のように連続値であってもよいし、級や合否のような離散値であってもよい。 [Scoring engine 14]
The scoring engine 14 learns in advance the feature amount and the scoring result in association with each other, and outputs the scoring result according to the feature amount selected by the feature amount selecting unit 13. Here, the scoring engine 14 internally builds the scoring model parameters learned by itself.
The scoring result may be a continuous value such as 0 to 100 points, or may be a discrete value such as grade or pass / fail.

採点エンジン１４は、採点結果が連続値である場合、例えば回帰分析や、重回帰分析、Lasso回帰、Ridge回帰、ＳＶＲ(Support Vector Regression)、ＮＮ(Neural Net)のような機械学習方式を用いることができる。
また、採点結果が離散値である場合、ロジスティック回帰や、ＳＶＭ(Support Vector Machine)やＮＮのような機械学習方式を用いることができる。
採点エンジンの学習方式の選択として、教師データで線形分離可能か否かが１つの基準となる。 The scoring engine 14 uses a machine learning method such as regression analysis, multiple regression analysis, Lasso regression, Ridge regression, SVR (Support Vector Regression), or NN (Neural Net) when the scoring result is a continuous value. Can be.
If the scoring result is a discrete value, logistic regression or a machine learning method such as SVM (Support Vector Machine) or NN can be used.
One criterion for selecting a learning method of the scoring engine is whether or not linear separation is possible with teacher data.

回帰分析(regression analysis)とは、統計学について、連続尺度の従属変数（目的変数）Yと、独立変数（説明変数）Xとの間にモデルを当てはめることをいう（Y＝f(X)）。最も基本的なモデルは、Y＝aX＋bである。Xが１次元であれば単回帰といい、Xが２次元以上であれば重回帰という。重回帰分析は、多変量解析の１つであって、一般的には最小二乗法が用いられる。
回帰分析の中でも、線形回帰として、Lasso回帰、Ridge回帰があり、非線形回帰として、ＳＶＲやＮＮがある。 Regression analysis refers to fitting a model between a dependent variable (object variable) Y and an independent variable (explanatory variable) X on a continuous scale in statistics (Y = f (X)). . The most basic model is Y = aX + b. If X is one-dimensional, it is called simple regression. If X is two-dimensional or more, it is called multiple regression. The multiple regression analysis is one of the multivariate analyses, and generally uses a least squares method.
Among regression analysis, linear regression includes Lasso regression and Ridge regression, and nonlinear regression includes SVR and NN.

ロジスティック回帰(Logistic regression)とは、ベルヌーイ分布に従う変数の統計的な分類モデルの一種である。
サポートベクター回帰とは、カーネル法と称される非線形回帰分析の１つである。パターン認識の分野で用いられているサポートベクターマシン(Support Vector Machine)の回帰バージョンである。サポートベクター回帰とは、モデルを事前に仮定することのないノンパラメトリックモデルであり、データの分布を考慮する必要はない。
ニューラルネットワーク(Neural Network)は、脳機能の特性を、計算機上のシミュレーションによって表現した数学モデルである。シナプスの結合によりネットワークを形成した人工ニューロン（ノード）が、学習によってシナプスの結合強度を変化させ、問題解決能力を持つようなモデル全般をいう。 Logistic regression is a type of statistical classification model for variables that follow a Bernoulli distribution.
Support vector regression is one type of nonlinear regression analysis called a kernel method. This is a regression version of the Support Vector Machine used in the field of pattern recognition. Support vector regression is a non-parametric model that does not assume a model in advance, and does not need to consider the distribution of data.
A neural network is a mathematical model that expresses the characteristics of brain functions by computer simulation. Artificial neurons (nodes) that form a network by synaptic connections change the synaptic connection strength through learning, and refer to all models that have problem-solving ability.

サポートベクターマシン(Support Vector Machine)は、教師あり学習を用いるパターン認識モデルの一つであって、分類や回帰に適用できる。サポートベクターマシンは、線形入力素子を用いて、２クラスのパターン識別器を構成する。教師データから、各データ点との距離が最大となるマージン最大化超平面を求めるという基準（超平面分離定理）で線形入力素子のパラメータを学習する。 A support vector machine is one of the pattern recognition models using supervised learning, and can be applied to classification and regression. The support vector machine forms two classes of pattern classifiers using linear input elements. The parameters of the linear input element are learned from the teacher data based on the criterion (hyperplane separation theorem) for finding a margin-maximizing hyperplane that maximizes the distance to each data point.

＜採点エンジン・学習段階＞
図３は、本発明のスピーキングテストプログラムにおける採点エンジンの学習段階の機能構成図である。 <Scoring engine, learning stage>
FIG. 3 is a functional configuration diagram of the scoring engine in the learning stage in the speaking test program of the present invention.

「採点エンジン・学習段階」とは、教師データを入力し、採点エンジン内部で採点モデルパラメータを構築する処理である。その採点モデルパラメータは、採点エンジン内部へ組み込まれる。スピーキングテストの場合、例えばテストの運用事業者によって実行される。 The “scoring engine / learning stage” is processing for inputting teacher data and constructing a scoring model parameter inside the scoring engine. The scoring model parameters are incorporated into the scoring engine. In the case of a speaking test, the test is performed by, for example, a test operator.

本発明によって入力される教師データ群は、過去の多数の学習者における口述音声及び採点結果を対応付けたものである。
（口述音声）<->（採点）
Ｖ１ <-> Ａ１
Ｖ２ <-> Ａ２
Ｖ３ <-> Ａ３
・・・・・・・
教師データ群の口述音声は、音声認識エンジン１１へ入力され、その採点結果は、採点エンジン１４へ入力される。 The teacher data group input according to the present invention is obtained by associating dictation voices and scores of many learners in the past.
(Voice dictation) <-> (Scoring)
V1 <-> A1
V2 <-> A2
V3 <-> A3
・・・・・・・
The dictation voice of the teacher data group is input to the voice recognition engine 11, and the scoring result is input to the scoring engine 14.

［音声認識エンジン１１］
音声認識エンジン１１は、教師データの口述音声から口述テキストと、音声認識結果に対する信頼度とを出力する。口述テキストは、特徴量抽出部１２へ出力され、信頼度は、特徴量選択エンジン１３へ出力される。音声認識エンジン１１自体は、図１で前述したものと全く同じものである。 [Speech recognition engine 11]
The speech recognition engine 11 outputs the dictation text from the dictation speech of the teacher data and the reliability of the speech recognition result. The dictation text is output to the feature extraction unit 12, and the reliability is output to the feature selection engine 13. The speech recognition engine 11 itself is exactly the same as that described above with reference to FIG.

［特徴量抽出部１２］
特徴量抽出部１２は、教師データに基づく口述テキストから及び／又は音声認識エンジン内部から、複数の特徴量を抽出する。複数の特徴量は、特徴量選択エンジン１３へ出力される。特徴量抽出部１２自体は、図１で前述したものと全く同じものである。 [Feature Extraction Unit 12]
The feature amount extraction unit 12 extracts a plurality of feature amounts from the dictation text based on the teacher data and / or from inside the speech recognition engine. The plurality of feature amounts are output to the feature amount selection engine 13. The feature amount extraction unit 12 itself is exactly the same as that described above with reference to FIG.

［特徴量選択エンジン１３］
特徴量選択エンジン１３は、教師データに基づく信頼度に応じて、１つ以上の特徴量を選択する。特徴量選択エンジン１３自体は、図１で前述したものと全く同じものである。 [Feature selection engine 13]
The feature quantity selection engine 13 selects one or more feature quantities according to the reliability based on the teacher data. The feature selection engine 13 itself is exactly the same as that described above with reference to FIG.

［採点エンジン１４］
採点エンジン１４は、特徴量選択エンジン１３によって選択された特徴量と、教師データの採点結果とを対応付けて学習する。これによって、採点エンジン１４の内部に、採点モデルパラメータを構築する。採点エンジン１４自体は、図１で前述したものと全く同じであって、図３によって構築された採点モデルパラメータは、図１の採点段階の採点エンジン１４で用いられる。 [Scoring engine 14]
The scoring engine 14 learns the feature amount selected by the feature amount selection engine 13 in association with the scoring result of the teacher data. Thereby, a scoring model parameter is constructed inside the scoring engine 14. The scoring engine 14 itself is exactly the same as that described above with reference to FIG. 1, and the scoring model parameters constructed according to FIG. 3 are used by the scoring engine 14 in the scoring stage of FIG.

＜特徴量選択エンジン・学習段階＞
図４は、本発明のスピーキングテストプログラムにおける特徴量選択エンジンの学習段階の機能構成図である。 <Feature selection engine, learning stage>
FIG. 4 is a functional block diagram of the feature selection engine in the learning stage in the speaking test program of the present invention.

「特徴量選択エンジン・学習段階」とは、教師データの口述音声を入力し、特徴量選択エンジン内部で選択モデルパラメータを構築する処理である。その選択モデルパラメータは、特徴量選択エンジンへ組み込まれる。
図４によれば、図１及び図３の機能構成部に加えて、発話環境合成部１５と、特徴量差分検出部１６を更に有する。 The “feature selection engine / learning stage” is a process of inputting the dictation voice of the teacher data and constructing a selection model parameter inside the feature selection engine. The selected model parameters are incorporated into a feature selection engine.
According to FIG. 4, in addition to the functional components of FIGS. 1 and 3, a speech environment synthesizing unit 15 and a feature difference detecting unit 16 are further provided.

［発話環境合成部１５］
発話環境合成部１５は、教師データの口述音声に、異なる発話環境音を合成した１つ以上の口述音声を出力する。これら口述音声は、音声認識エンジン１１へ入力される。
発話環境合成部１５は、教師データとしての同一の口述音声であっても、様々なノイズが合成された音声を、音声認識エンジン１１へ入力する。これによって、同一の口述音声であれば、音声認識エンジン１１から出力された口述テキストの認識に誤りがあっても、同一の採点が付与されるものとして学習する。即ち、学習段階について、学習者の口述音声に、発話環境音におけるノイズが混在していても、採点に対する耐性が高くなるような選択モデルパラメータを構築する。発話環境音における他の例としては、発話者の口述音声を収集するマイクの周波数特性や、発話者の存する部屋の反響特性等を模擬できるエフェクターが考えられる。 [Speech environment synthesis unit 15]
The utterance environment synthesis unit 15 outputs one or more dictation voices obtained by synthesizing different utterance environment sounds with the dictation voice of the teacher data. These dictation voices are input to the voice recognition engine 11.
The utterance environment synthesizing unit 15 inputs, to the speech recognition engine 11, speech in which various noises are synthesized, even if the same dictation speech as the teacher data. As a result, if the dictation speech is the same, even if there is an error in the recognition of the dictation text output from the speech recognition engine 11, learning is performed assuming that the same grading is given. That is, in the learning stage, a selection model parameter is constructed such that even if noise in the utterance environment sound is mixed in the dictation voice of the learner, resistance to scoring is increased. As another example of the utterance environment sound, an effector that can simulate the frequency characteristics of a microphone that collects the uttered voice of the speaker, the reverberation characteristics of the room where the speaker is located, and the like can be considered.

［音声認識エンジン１１］
音声認識エンジン１１は、ノイズ無しの口述音声から認識された口述テキストと、ノイズ有りの口述音声から認識された口述テキスト及びその認識結果に対する信頼度とを出力する。このとき、ノイズ有りの口述音声の口述テキストは、ノイズ無しの口述音声の口述テキストよりも、誤認識が多く、その信頼度も低くなる。音声認識エンジン１１自体は、図１及び図３で前述したものと全く同じものである。 [Speech recognition engine 11]
The speech recognition engine 11 outputs the dictation text recognized from the dictation voice without noise, the dictation text recognized from the dictation voice with noise, and the reliability of the recognition result. At this time, the dictation text of the dictation voice with noise has more misrecognition and lower reliability than the dictation text of the dictation voice without noise. The speech recognition engine 11 itself is exactly the same as that described above with reference to FIGS.

［特徴量抽出部１２］
特徴量抽出部１２は、ノイズ無しの口述音声及びノイズ有りの口述音声それぞれについて、認識された口述テキストから及び／又は音声認識エンジン内部から、１つ以上の特徴量を抽出する。言語的特徴量は、信頼度が低いほど音声認識の誤りが高い。一方で、音声的特徴量は、音声認識の誤りの影響を受けにくい。特徴量抽出部１２自体は、図１及び図３で前述したものと全く同じものである。 [Feature Extraction Unit 12]
The feature amount extraction unit 12 extracts one or more feature amounts from the recognized dictation text and / or from inside the speech recognition engine for each of the dictation voice without noise and the dictation voice with noise. The lower the reliability of the linguistic feature, the higher the error in speech recognition. On the other hand, speech features are less susceptible to speech recognition errors. The feature extraction unit 12 itself is exactly the same as that described above with reference to FIGS.

［特徴量差分検出部１６］
特徴量差分検出部１６は、ノイズ無しの口述音声に基づく特徴量に対して、所定閾値よりも差分が小さい、又は、差分が小さい順に所定数となる、ノイズ有りの口述音声に基づく特徴量を検出する。特徴量に対する所定閾値又は所定数は、予め設定されたものである。ここでは、ノイズ無しとノイズ有りとで、口述音声に基づく特徴量の差分が小さい、即ち、ノイズの影響を受けにくい特徴量を検出しようとしている。検出された１つ以上の特徴量は、特徴量選択エンジン１３へ出力される。 [Feature amount difference detection unit 16]
The feature amount difference detection unit 16 extracts a feature amount based on the dictation sound with noise, which is smaller than a predetermined threshold or becomes a predetermined number in the order of smaller differences with respect to the feature amount based on the dictation sound without noise. To detect. The predetermined threshold value or the predetermined number for the feature amount is set in advance. Here, it is intended to detect a small difference in the feature amount based on the dictation voice between the absence of noise and the presence of noise, that is, the feature amount that is not easily affected by noise. One or more detected feature amounts are output to the feature amount selection engine 13.

［特徴量選択エンジン１３］
特徴量選択エンジン１３は、特徴量差分検出部１６から出力された特徴量と、音声認識エンジン１１から出力された信頼度とを対応付けて学習する。これによって、特徴量選択エンジン１３の内部に、選択モデルパラメータを構築する。特徴量選択エンジン１３自体は、図１及び図３で前述したものと全く同じであって、図４によって構築された選択モデルパラメータは、図１の採点段階及び図３の採点エンジン・学習段階における特徴量選択エンジン１３で用いられる。 [Feature selection engine 13]
The feature selection engine 13 learns the feature output from the feature difference detector 16 in association with the reliability output from the speech recognition engine 11. As a result, a selection model parameter is constructed inside the feature quantity selection engine 13. The feature selection engine 13 itself is exactly the same as that described above with reference to FIGS. 1 and 3, and the selection model parameters constructed according to FIG. 4 are used in the scoring stage of FIG. 1 and the scoring engine / learning stage of FIG. Used by the feature quantity selection engine 13.

図５は、本発明におけるシーケンス図である。 FIG. 5 is a sequence diagram in the present invention.

図５によれば、スピーキングテストの事業者が運用するサーバと、学習者が所持する端末とが、ネットワークを介して接続されている。端末としては、マイク及びディスプレイのようなユーザインタフェースを予め搭載した、スマートフォンやタブレット端末であることが好ましい。 According to FIG. 5, the server operated by the speaking test operator and the terminal possessed by the learner are connected via a network. The terminal is preferably a smartphone or a tablet terminal equipped with a user interface such as a microphone and a display in advance.

図５（ａ）によれば、学習段階は、スピーキングテストの事業者が運用するサーバによって実行され、採点段階は、学習者が所持する端末によって実行される。
サーバは、学習段階で生成した採点モデルパラメータ及び選択モデルパラメータを、端末へ送信する。
端末は、受信した採点モデルパラメータ及び選択モデルパラメータを保持し、学習者の口述音声から採点する。 According to FIG. 5A, the learning step is performed by a server operated by a speaking test provider, and the scoring step is performed by a terminal owned by the learner.
The server transmits the scoring model parameters and the selected model parameters generated in the learning stage to the terminal.
The terminal holds the received scoring model parameters and selection model parameters, and scores from the dictation voice of the learner.

図５（ｂ）によれば、学習段階及び採点段階の両方とも、スピーキングテストの事業者が運用するサーバによって実行される。
サーバは、学習段階で生成した採点モデルパラメータ及び選択モデルパラメータを保持する。
端末は、学習者の口述音声をそのまま、サーバへ送信する。
サーバは、端末から受信した口述音声から採点し、その採点結果を端末へ返信する。 According to FIG. 5B, both the learning stage and the scoring stage are executed by the server operated by the speaking test company.
The server holds the scoring model parameters and the selected model parameters generated in the learning stage.
The terminal transmits the dictation voice of the learner as it is to the server.
The server scores from the dictation voice received from the terminal, and returns the scoring result to the terminal.

以上、詳細に説明したように、本発明のプログラム、装置及び方法によれば、学習者の口述音声に対する自動的な採点精度を高めることができる。特に、会話の自由度が高いスピーキングテストであっても、事前に正解文を準備しておく必要がない。 As described above in detail, according to the program, the apparatus, and the method of the present invention, it is possible to enhance the automatic scoring accuracy of the dictation voice of the learner. In particular, there is no need to prepare a correct sentence in advance even for a speaking test with a high degree of freedom in conversation.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 For the above-described various embodiments of the present invention, various changes, modifications, and omissions in the scope of the technical idea and viewpoint of the present invention can be easily performed by those skilled in the art. The foregoing description is merely an example, and is not intended to be limiting. The invention is limited only as defined by the following claims and equivalents thereof.

１スピーキングテストプログラム
１１音声認識エンジン
１２特徴量抽出部
１３特徴量選択エンジン
１４採点エンジン
１５発話環境合成部
１６特徴量差分検出部
1 Speaking Test Program 11 Speech Recognition Engine 12 Feature Extraction Unit 13 Feature Selection Engine 14 Scoring Engine 15 Speech Environment Synthesis Unit 16 Feature Difference Detection Unit

Claims

A speaking test program that allows a learner to input a dictation voice that is a foreign language and function a computer to output scoring results,
As a learning stage, input dictation voice and teacher data in which the grading results are associated,
And spoken audio or et speech recognized dictated text training data, and voice recognition engine that outputs the reliability of the speech recognition result,
And linguistic features from dictation text training data, a feature amount extracting section which extracts a voice feature amount from the internal speech recognition engine,
A feature amount selection engine that selects a feature amount so as to include a linguistic feature amount more than a speech feature amount as the reliability output from the speech recognition engine is higher ,
And the feature quantity selected by the feature quantity selection engine, in association with the rating result of the training data to function as a scoring engine learns,
As a grading stage, input the dictation voice of the learner
The speech recognition engine outputs a dictation text that is speech-recognized from the dictation speech of the learner, and a reliability for the speech recognition result,
The feature amount extracting means extracts a linguistic feature amount from a dictation text based on a learner's dictation voice and a speech feature amount from inside a speech recognition engine,
The feature amount selection engine outputs a feature amount selected according to the reliability output from the speech recognition engine,
The scoring engine inputs a feature amount selected by the feature amount selection engine and outputs a scoring result based on a dictation voice of a learner.
Speaking test program for causing a computer to function as.

The feature amount selection engine selects a feature amount such that the lower the reliability output from the speech recognition engine is, the larger the vocal feature amount is than the linguistic feature amount.
2. The speaking test program according to claim 1, wherein the computer functions as described above.

As a learning stage,
And further functioning as a speech environment synthesizing unit that outputs one or more speech voices obtained by synthesizing different speech environment sounds with the speech data of the teacher data;
The speech recognition engine outputs the dictation text and reliability recognized from the dictation voice without environmental sound synthesis, and the dictation text and reliability recognized from the dictation voice with environmental sound synthesis ,
The feature amount extraction means includes: a linguistic feature amount from a dictation text based on an dictation voice without environmental sound synthesis; a speech feature amount from inside a speech recognition engine based on the dictation voice without environmental sound synthesis; Extracting linguistic features from dictation text based on dictated speech with speech and speech features from inside a speech recognition engine based on dictation speech with environmental sound synthesis,
For the linguistic feature based on the dictation voice without environmental sound synthesis, a linguistic feature based on the dictation voice with environmental sound synthesis, in which the difference is smaller than a predetermined threshold or becomes a predetermined number in the order of smaller differences, is used. The voice based on the dictation voice with environmental sound synthesis, which is detected and has a predetermined difference with respect to the vocal feature based on the dictation voice without environmental sound synthesis, the difference being smaller than a predetermined threshold value or a predetermined number in the order of smaller differences. Further functioning as a feature amount difference detecting means for detecting a characteristic feature amount,
The feature selection engine functions a computer so as to perform learning by associating the linguistic feature and the speech feature output from the feature difference detecting means with the reliability output from the speech recognition engine. The speaking test program according to claim 1 or 2, wherein the speaking test program is executed.

The speaking test program according to claim 3, wherein the utterance environment synthesizing means causes a computer to function so as to synthesize different noises as utterance environment sounds.

The linguistic feature quantity is the total number of words based on the dictation text,
Different word count,
Bag-of-Words space vector,
Bag-of-ngram space vector Dimensional vector of LSA (Latent Semantic Analysis),
LDA (Latent Dirichlet Allocation) dimension vector,
Distributed expression vector,
The number and / or type of grammatical errors,
One or more in the number of words by difficulty,
The speech feature amount is a speech time based on the dictated speech,
Words per unit time,
Acoustic likelihood,
The speaking test program according to any one of claims 1 to 4 , wherein the computer is caused to function as one or more phonemes per unit time.

A speaking test device that inputs a dictation voice that is a foreign language for a learner and outputs a scoring result,
As a learning stage, input dictation voice and teacher data in which the grading results are associated,
A speech recognition engine that outputs dictation text that is speech- recognized from the dictation speech of the teacher data and reliability for the speech recognition result;
And linguistic features from dictation text training data, a feature amount extracting section which extracts a voice feature amount from the internal speech recognition engine,
A feature amount selection engine that selects a feature amount so as to include a linguistic feature amount more than a speech feature amount as the reliability output from the speech recognition engine is higher ,
And the feature quantity selected by the feature quantity selection engine, the scoring engine learns in association with the scoring results for the training data
And
As a grading stage, input the dictation voice of the learner
The speech recognition engine outputs a dictation text that is speech-recognized from the dictation speech of the learner, and a reliability for the speech recognition result,
The feature amount extracting means extracts a linguistic feature amount from the dictation text based on the learner and a speech feature amount from inside the speech recognition engine,
The feature amount selection engine outputs a feature amount selected according to the reliability output from the speech recognition engine,
The speaking test device , wherein the scoring engine inputs a feature amount selected by the feature amount selection engine and outputs a scoring result based on a learner .

As a learning stage,
And further functioning as a speech environment synthesizing unit that outputs one or more speech voices obtained by synthesizing different speech environment sounds with the speech data of the teacher data;
The speech recognition engine outputs the dictation text and reliability recognized from the dictation voice without environmental sound synthesis, and the dictation text and reliability recognized from the dictation voice with environmental sound synthesis ,
The feature amount extraction means includes: a linguistic feature amount from a dictation text based on an dictation voice without environmental sound synthesis; a speech feature amount from inside a speech recognition engine based on the dictation voice without environmental sound synthesis; Extracting linguistic features from dictation text based on dictated speech with speech and speech features from inside a speech recognition engine based on dictation speech with environmental sound synthesis,
For the linguistic feature based on the dictation voice without environmental sound synthesis, a linguistic feature based on the dictation voice with environmental sound synthesis, in which the difference is smaller than a predetermined threshold or becomes a predetermined number in the order of smaller differences, is used. The voice based on the dictation voice with environmental sound synthesis, which is detected and has a predetermined difference with respect to the vocal feature based on the dictation voice without environmental sound synthesis, the difference being smaller than a predetermined threshold value or a predetermined number in the order of smaller differences. Further functioning as a feature amount difference detecting means for detecting a characteristic feature amount,
The feature quantity selection engine learns the linguistic feature quantity and the speech feature quantity output from the feature quantity difference detection means in association with the reliability output from the speech recognition engine. The speaking test apparatus according to claim 6 .

A speaking test method for a device that inputs a dictation voice that is a foreign language for a learner and outputs a scoring result,
The device comprises:
As a learning stage, input dictation voice and teacher data in which the grading results are associated,
A first step of outputting a dictation text speech- recognized from the dictation speech of the teacher data using the speech recognition engine and a reliability for the speech recognition result;
And linguistic features from dictation text training data, a second step of extracting a voice feature amount from the internal speech recognition engine,
A third step of using a feature selection engine to select a feature such that the higher the reliability output in the first step is, the more linguistic features are included than speech-based features;
Using scoring learning engine, and the feature quantity selected by the third step, a fourth step of learning in association with the scoring results for the training data
Run
As a grading stage, input the dictation voice of the learner,
A fifth step of outputting, using the speech recognition engine, a dictation text speech-recognized from the dictation speech of the learner and a degree of reliability for the speech recognition result;
A sixth step of extracting linguistic features from the dictation text based on the learner and speech features from inside the speech recognition engine;
A seventh step of using the feature selection engine to output a feature selected according to the reliability output from the fifth step;
An eighth step of inputting the feature amount selected in the seventh step using the scoring engine and outputting a scoring result based on the learner;
Speaking test method apparatus, characterized by the execution.

The device, as a learning phase,
Outputting one or more dictation sounds obtained by combining different utterance environment sounds with the dictation speech of the teacher data;
In the first step, the dictation text and reliability recognized by the dictation speech without environmental sound synthesis using the speech recognition engine, and the dictation text and reliability recognized by the dictation speech with environmental sound synthesis are used. And output
For the second step,
Linguistic features from dictation text based on dictation speech without environmental sound synthesis, vocal features from inside a speech recognition engine based on dictation speech without environmental sound synthesis, and dictation based on dictation speech with environmental sound synthesis Extract linguistic features from text and speech features from inside the speech recognition engine based on dictation speech with environmental sound synthesis,
For the linguistic feature based on the dictation voice without environmental sound synthesis, a linguistic feature based on the dictation voice with environmental sound synthesis, in which the difference is smaller than a predetermined threshold or becomes a predetermined number in the order of smaller differences, is used. The voice based on the dictation voice with environmental sound synthesis, which is detected and has a predetermined difference with respect to the vocal feature based on the dictation voice without environmental sound synthesis, the difference being smaller than a predetermined threshold value or a predetermined number in the order of smaller differences. Detects characteristic features,
In the third step, the linguistic feature and the speech feature output from the second step using the feature selection engine, and the reliability output from the speech recognition engine in the first step. 9. The method according to claim 8 , wherein the learning is performed by associating with.