JP5605731B2

JP5605731B2 - Voice feature amount calculation device

Info

Publication number: JP5605731B2
Application number: JP2012171737A
Authority: JP
Inventors: 秀紀劔持; 啓嘉山; 達也入山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2012-08-02
Filing date: 2012-08-02
Publication date: 2014-10-15
Anticipated expiration: 2025-11-09
Also published as: JP2012234201A

Description

本発明は、お手本の発音の抑揚と学習者の発音の抑揚とを比較して評価する技術に関する。 The present invention relates to a technique for comparing and evaluating a model inflection and a learner's pronunciation inflection.

語学学習において、発音練習を行う際には、ＣＤ（CompactDisk）等の記録媒体に記録された模範音声を再生し、その模範音声の真似をして発音するという学習方法が広く用いられている。これは模範音声の真似をすることで正しい発音を身につけることを目的とするものである。このような学習を行う場合、学習者は自分の発音を把握し、模範音声との違いを認識して発音を改善していくこととなるが、模範音声と同じ発音ができているか否かを学習者自身が自分の発音を聞いて客観的に把握するのは難しい。
そこで、例えば、特許文献１に開示されているように、学習者が客観的に自分の音声を把握できるようにする技術が考案されている。特許文献１に開示されている語学学習装置は、発音のお手本となる音声と、学習者の音声とから抑揚等の音声情報を抽出する。そして、お手本の音声と、学習者の音声の類似度を求め、その類似度によって学習者の発音を採点し、抑揚等の採点結果をディスプレイに表示する。特許文献１に開示された技術によれば、学習者の発音が客観的に評価され、その評価結果が表示されるため、学習者は、自信の発音がお手本に近いものか否かを知ることができる。 In language learning, when practicing pronunciation, a learning method is widely used in which an exemplary voice recorded on a recording medium such as a CD (CompactDisk) is reproduced and pronounced by imitating the exemplary voice. The purpose of this is to acquire correct pronunciation by imitating model voices. In this kind of learning, the learner understands his / her pronunciation, recognizes the difference from the model voice and improves the pronunciation, but whether or not the same pronunciation as the model voice can be made. It is difficult for learners to listen to their own pronunciation and grasp it objectively.
Thus, for example, as disclosed in Patent Document 1, a technique has been devised that enables a learner to objectively grasp his / her voice. The language learning device disclosed in Patent Literature 1 extracts speech information such as intonation from a speech that serves as an example of pronunciation and a learner's speech. Then, the similarity between the model voice and the learner's voice is obtained, the learner's pronunciation is scored based on the similarity, and the score such as inflection is displayed on the display. According to the technique disclosed in Patent Document 1, the learner's pronunciation is objectively evaluated and the evaluation result is displayed, so that the learner knows whether or not his / her confidence pronunciation is close to a model. Can do.

特開２０００−３４７５６０号公報JP 2000-347560 A

さて、特許文献１においては、手本の音声の抑揚を示す曲線と、学習者の音声の抑揚を示す曲線とを重ね合わせて、抑揚の類似度を判断する方法が開示されており、自身の発音がお手本に近いものか否かを知ることが可能となっている。しかしながら、類似度だけでは、似ているか否かしか知ることができず、どのようにすればお手本の発音に近づけられるかを知ることはできない。このため、お手本の発音に一致した発音ができるようになるまでには、試行錯誤して発音の改善と評価とを繰り返すという、根気のいる学習を行うこととなる。 Patent Document 1 discloses a method for judging the similarity of intonation by superimposing a curve showing inflection of a model voice and a curve showing inflection of a learner's voice. It is possible to know whether the pronunciation is close to a model. However, it is only possible to know whether or not they are similar only by the degree of similarity, and it is not possible to know how to approximate the pronunciation of the model. For this reason, until a pronunciation that matches the pronunciation of the model can be achieved, it is necessary to carry out a persistent learning that repeats improvement and evaluation of pronunciation by trial and error.

本発明は、上述した背景の下になされたものであり、語学学習において、学習者の音声の抑揚を、お手本の音声の抑揚に近づけられるようにする技術を提供することを目的とする。 The present invention has been made under the above-described background, and an object of the present invention is to provide a technique for allowing the learner's voice inflection to approach that of the model voice in language learning.

本発明は、音声が入力される音声入力手段と、前記音声入力手段に入力された音声のピッチの時間的変化を示すピッチ曲線を生成し、生成したピッチ曲線の道程を抑揚の変化の大きさを表すパラメータとして算出する算出手段と、前記算出手段により算出された道程を出力する出力手段とを有する音声特徴量算出装置を提供する。
この態様においては、前記算出手段は、前記音声入力手段に入力された音声の無声区間については、該無声区間の前後の音声のピッチによって該無声区間のピッチを補間してピッチ曲線を生成するようにしてもよい。 The present invention generates voice input means for inputting a voice, and a pitch curve indicating a temporal change in pitch of the voice inputted to the voice input means, and the generated pitch curve path has a magnitude of inflection. There is provided an audio feature amount calculation device having a calculation unit that calculates a parameter that represents a path, and an output unit that outputs a route calculated by the calculation unit.
In this aspect, the calculation means generates a pitch curve by interpolating the pitch of the unvoiced section according to the pitch of the voice before and after the unvoiced section for the unvoiced section of the voice input to the voice input means. It may be.

本発明によれば、語学学習において、学習者は音声の抑揚を、お手本の音声の抑揚に近づけることができる。 According to the present invention, in language learning, a learner can approximate the inflection of speech to the inflection of a model speech.

本発明の実施形態に係る語学学習装置のハードウェア構成を示した図である。It is the figure which showed the hardware constitutions of the language learning apparatus which concerns on embodiment of this invention. 例文テーブルＴＢ１のフォーマットを例示した図である。It is the figure which illustrated the format of example sentence table TB1. 第１実施形態に係わるＣＰＵ１０２がプログラムを実行することにより実現する機能ブロックの構成を例示した図である。It is the figure which illustrated the structure of the functional block implement | achieved when CPU102 concerning 1st Embodiment runs a program. 第１実施形態に係わるＣＰＵ１０２が行う処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which CPU102 concerning 1st Embodiment performs. ステップＳＡ６の処理を説明するための図である。It is a figure for demonstrating the process of step SA6. 第２実施形態に係わるＣＰＵ１０２がプログラムを実行することにより実現する機能ブロックの構成を例示した図である。It is the figure which illustrated the structure of the functional block implement | achieved when CPU102 concerning 2nd Embodiment runs a program. 第２実施形態に係わるＣＰＵ１０２が行う処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which CPU102 concerning 2nd Embodiment performs. 学習者の音声のピッチ曲線と、模範音声のピッチ曲線を例示した図である。It is the figure which illustrated the pitch curve of a learner's voice, and the pitch curve of a model voice.

以下、図面を参照して本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１実施形態］
［実施形態の構成］
図１は、本発明の実施形態に係る語学学習装置１のハードウェア構成を例示したブロック図である。図１に示したように、語学学習装置１の各部は、バス１０１に接続されており、このバス１０１を介して各部間で信号やデータの授受を行う。 [First Embodiment]
[Configuration of the embodiment]
FIG. 1 is a block diagram illustrating a hardware configuration of a language learning device 1 according to an embodiment of the invention. As shown in FIG. 1, each unit of the language learning device 1 is connected to a bus 101, and signals and data are exchanged between the units via the bus 101.

マイクロホン１０９は、音声処理部１０８に接続されており、入力される音声をアナログの電気信号（以下、音声信号と称する）に変換して音声処理部１０８へ出力する。スピーカ１１０は、音声処理部１０８に接続されており、音声処理部１０８から出力される信号に対応した音を出力する。音声処理部１０８は、マイクロホン１０９から入力される音声信号をデジタルデータ（以下、学習者データと称する）に変換して出力する機能や、音声を表すデジタルデータをアナログの音声信号に変換し、スピーカ１１０へ出力する機能を備えている。 The microphone 109 is connected to the sound processing unit 108, converts input sound into an analog electrical signal (hereinafter referred to as a sound signal), and outputs the analog signal to the sound processing unit 108. The speaker 110 is connected to the sound processing unit 108 and outputs a sound corresponding to the signal output from the sound processing unit 108. The audio processing unit 108 converts the audio signal input from the microphone 109 into digital data (hereinafter referred to as “learner data”) and outputs it, or converts the digital data representing the audio into an analog audio signal, and the speaker. The function to output to 110 is provided.

表示部１０６は、例えば、液晶ディスプレイ等の表示デバイスを備えており、ＣＰＵ１０２の制御の下、文字列や各種メッセージ、語学学習装置１を操作するためのメニュー画面等を表示する。入力部１０７は、キーボードやマウス等（いずれも図示略）の入力装置を具備しており、キーの押下やマウスの操作等に応じて操作内容に対応した信号をＣＰＵ１０２へ出力する。 The display unit 106 includes, for example, a display device such as a liquid crystal display, and displays a character string, various messages, a menu screen for operating the language learning device 1, and the like under the control of the CPU 102. The input unit 107 includes an input device such as a keyboard and a mouse (both not shown), and outputs a signal corresponding to the operation content to the CPU 102 in response to a key press or a mouse operation.

記憶部１０５は、データを永続的に記憶するＨＤＤ（Hard Disk Drive）装置を備えており、各種データを記憶する。具体的には、記憶部１０５は、音声処理部１０８から出力される学習者データを記憶する。また、記憶部１０５は、語学学習に用いられる例文を表す例文テキストデータと、ネイティブスピーカが例文を読み上げた時の音声（以下、模範音声と称する）を表すデジタルデータ（以下、模範音声データと称する）とを記憶している。記憶部１０５は、図２に例示したフォーマットの例文テーブルＴＢ１を記憶しており、このテーブルに例文テキストデータと、模範音声データのファイル名と、各例文テキストデータを一意に識別する識別子とを対応付けて格納している。 The storage unit 105 includes an HDD (Hard Disk Drive) device that permanently stores data, and stores various data. Specifically, the storage unit 105 stores learner data output from the voice processing unit 108. In addition, the storage unit 105 includes example sentence text data representing example sentences used for language learning, and digital data (hereinafter referred to as model voice data) representing voices (hereinafter referred to as model voices) when the native speaker reads the example sentences. ) Is remembered. The storage unit 105 stores the example sentence table TB1 in the format illustrated in FIG. 2, and the table corresponds to the example sentence text data, the file name of the model voice data, and the identifier for uniquely identifying each example sentence text data. It is attached and stored.

ＣＰＵ（Central Processing Unit）１０２は、ＲＯＭ（Read Only Memory）１０３に記憶されているプログラムを、ＲＡＭ（Random Access Memory）１０４を作業エリアにして実行する。ＣＰＵ１０２がプログラムを実行すると、ＣＰＵ１０２によって各部が制御され、模範音声と、入力される学習者の音声（以下、学習者音声と称する）とを比較し、学習者音声の抑揚の評価結果を出力する機能が実現する。 A CPU (Central Processing Unit) 102 executes a program stored in a ROM (Read Only Memory) 103 using a RAM (Random Access Memory) 104 as a work area. When the CPU 102 executes the program, each part is controlled by the CPU 102, the model voice is compared with the input learner's voice (hereinafter referred to as the learner voice), and the evaluation result of the inflection of the learner voice is output. Function is realized.

図３は、プログラムを実行することにより実現する機能の構成を示した機能ブロック図である。時間軸補正部１０は、模範音声データが表す音声の発音時間と、記憶部１０５に記憶された学習者データが表す音声の発音時間とが同じとなるように、学習者データが表す音声を補正する。ピッチ抽出部２０は、入力されるデータが示す音声を、再生時間軸上において所定の時間間隔で複数のフレームに分割し、分割されたフレーム毎に、各フレームの音声のピッチを抽出する。ピッチ補間部３０は、無声区間や、無声子音の発音区間等、ピッチ抽出部２０においてピッチを抽出することができなかったフレームについて、これらのフレームに隣接するフレームとの間で直線補間や３次スプライン補間等の補間を行い、これらのフレームのピッチを定める。ピッチ曲線生成部４０は、ピッチ抽出部２０でフレーム毎に求められたピッチと、ピッチ補間部３０で補間されたピッチとを結んでピッチ曲線を生成する。道程計算部５０は、ピッチ曲線生成部４０で生成されたピッチ曲線の道程を計算する。 FIG. 3 is a functional block diagram showing a configuration of functions realized by executing a program. The time axis correction unit 10 corrects the voice represented by the learner data so that the pronunciation time of the voice represented by the model voice data is the same as the pronunciation time of the voice represented by the learner data stored in the storage unit 105. To do. The pitch extraction unit 20 divides the sound indicated by the input data into a plurality of frames at predetermined time intervals on the reproduction time axis, and extracts the pitch of the sound of each frame for each divided frame. The pitch interpolating unit 30 performs linear interpolation or cubic processing between frames adjacent to these frames, such as unvoiced sections and unvoiced consonant pronunciation sections, for which the pitch extraction unit 20 could not extract the pitch. Interpolation such as spline interpolation is performed to determine the pitch of these frames. The pitch curve generation unit 40 generates a pitch curve by connecting the pitch obtained for each frame by the pitch extraction unit 20 and the pitch interpolated by the pitch interpolation unit 30. The journey calculation unit 50 calculates the journey of the pitch curve generated by the pitch curve generation unit 40.

［実施形態の動作］
次に本実施形態の動作について説明する。まず、学習者が例文の一覧の表示を指示する操作を行うと、ＣＰＵ１０２は例文テーブルＴＢ１に格納されている例文テキストデータを読み出し（図４：ステップＳＡ１）、読み出したデータが表す例文の一覧を表示部１０６に表示する（ステップＳＡ２）。この後、学習者が入力部１０７を操作し、表示された例文の一つを選択する操作を行うと（ステップＳＡ３；ＹＥＳ）、ＣＰＵ１０２は、表示部１０６に表示されている画面と、入力部１０７から送られる信号に基づいて、選択された例文を特定する（ステップＳＡ４）。ＣＰＵ１０２は、選択された例文を特定すると、例文テーブルＴＢ１において、選択された例文に対応付けて格納されている模範音声データのファイル名を読み出す（ステップＳＡ５）。例えば、図２に示したテーブルにおいて、識別子が「００１」である例文が選択された場合、ファイル名「ａ００１」が読み出される。 [Operation of the embodiment]
Next, the operation of this embodiment will be described. First, when the learner performs an operation to instruct display of a list of example sentences, the CPU 102 reads out example sentence text data stored in the example sentence table TB1 (FIG. 4: step SA1), and displays a list of example sentences represented by the read data. The information is displayed on the display unit 106 (step SA2). Thereafter, when the learner operates the input unit 107 and performs an operation of selecting one of the displayed example sentences (step SA3; YES), the CPU 102 displays the screen displayed on the display unit 106, the input unit, and the like. Based on the signal sent from 107, the selected example sentence is specified (step SA4). When CPU 102 identifies the selected example sentence, CPU 102 reads out the file name of the model voice data stored in association with the selected example sentence in example sentence table TB1 (step SA5). For example, in the table shown in FIG. 2, when an example sentence with the identifier “001” is selected, the file name “a001” is read.

次にＣＰＵ１０２は、読み出したファイル名で特定される模範音声データを記憶部１０５から読み出し、読み出したデータが示す音声のピッチを抽出する。具体的には、まずＣＰＵ１０２は、模範音声データが示す音声を、図５に示したように、その再生時間軸上において所定の時間間隔（例えば、５ｍｓｅｃ）で分割する（ステップＳＡ６）（以下、各分割された区間をフレームと称する）。次にＣＰＵ１０２は、分割されたフレーム毎に、各フレームの音声のピッチを抽出する（ステップＳＡ７）。なお、１フレームの時間間隔は、５ｍｓｅｃではなく、１０ｍｓｅｃ等、他の時間間隔であってもよい。ＣＰＵ１０２は、分割されたフレーム毎にピッチを抽出すると、フレーム毎に求められたピッチを結んだピッチ曲線を生成し（以下、このピッチ曲線を第１ピッチ曲線と称する）、生成した第１ピッチ曲線を示す曲線データを記憶部１０５に記憶する（ステップＳＡ８）。なお、無声区間や、無声子音の発音区間等、ピッチを抽出することができないフレームにおいては、直線補間や３次スプライン補間等の補間を行い、第１ピッチ曲線を生成する。 Next, the CPU 102 reads out the model voice data specified by the read file name from the storage unit 105 and extracts the pitch of the voice indicated by the read data. Specifically, first, the CPU 102 divides the voice indicated by the model voice data at a predetermined time interval (for example, 5 msec) on the playback time axis as shown in FIG. 5 (step SA6) (hereinafter, referred to as “step SA6”). Each divided section is called a frame). Next, the CPU 102 extracts the audio pitch of each frame for each divided frame (step SA7). Note that the time interval of one frame may be other time intervals such as 10 msec instead of 5 msec. When the CPU 102 extracts the pitch for each divided frame, the CPU 102 generates a pitch curve connecting the pitches determined for each frame (hereinafter, this pitch curve is referred to as a first pitch curve), and the generated first pitch curve. Is stored in the storage unit 105 (step SA8). Note that, in a frame where the pitch cannot be extracted, such as an unvoiced section or an unvoiced consonant pronunciation section, interpolation such as linear interpolation or cubic spline interpolation is performed to generate a first pitch curve.

ＣＰＵ１０２は、第１ピッチ曲線の生成が終了すると、読み出したファイル名で特定される模範音声データを記憶部１０５から読み出し、読み出した模範音声データを音声処理部１０８へ出力する（ステップＳＡ９）。音声処理部１０８に模範音声データが入力されると、デジタルデータである模範音声データがアナログの信号に変換されてスピーカ１１０へ出力され、スピーカ１１０から模範音声が再生される。 When the generation of the first pitch curve is completed, the CPU 102 reads out the model voice data specified by the read file name from the storage unit 105, and outputs the read out model voice data to the voice processing unit 108 (step SA9). When the model voice data is input to the voice processing unit 108, the model voice data which is digital data is converted into an analog signal and output to the speaker 110, and the model voice is reproduced from the speaker 110.

ＣＰＵ１０２は、模範音声の再生が終了すると、表示部１０６を制御し、例えば、「キーを押してから発音し、発音が終わったら再度キーを押してください」という、例文の発音を促すメッセージを表示する（ステップＳＡ１０）。学習者は、スピーカ１１０から出力された模範音声を聞いた後、メッセージに従って入力部１０７を操作し、模範音声を真似て例文を読み上げる。学習者が発音すると、学習者の音声がマイクロホン１０９によって音声信号に変換され、変換された信号が音声処理部１０８へ出力される。音声処理部１０８は、マイクロホン１０９から出力された音声信号が入力されると、音声信号をデジタルデータである学習者データに変換する。この学習者データは、音声処理部１０８から出力されて記憶部１０５に記憶される。 When the reproduction of the model voice is finished, the CPU 102 controls the display unit 106 to display, for example, a message for prompting pronunciation of the example sentence such as “Press the key to pronounce and then press the key again when the pronunciation is finished” ( Step SA10). The learner listens to the model voice output from the speaker 110, and then operates the input unit 107 according to the message to read the example sentence by imitating the model voice. When the learner pronounces, the learner's voice is converted into a voice signal by the microphone 109, and the converted signal is output to the voice processing unit 108. When the audio signal output from the microphone 109 is input, the audio processing unit 108 converts the audio signal into learner data that is digital data. This learner data is output from the voice processing unit 108 and stored in the storage unit 105.

次にＣＰＵ１０２は、入力部１０７から送られる信号を監視し、学習者が発音を終了したか否かを判断する。学習者が発音を終了して入力部１０７を操作すると（ステップＳＡ１１；ＹＥＳ）、ＣＰＵ１０２は、模範音声データが表す音声の発音時間と、記憶部１０５に記憶された学習者データが表す音声の発音時間とが同じとなるように、学習者データが表す音声を補正する（ステップＳＡ１２）。 Next, the CPU 102 monitors a signal sent from the input unit 107 and determines whether or not the learner has finished pronunciation. When the learner finishes the pronunciation and operates the input unit 107 (step SA11; YES), the CPU 102 generates the pronunciation time of the voice represented by the model voice data and the pronunciation of the voice represented by the learner data stored in the storage unit 105. The voice represented by the learner data is corrected so that the time is the same (step SA12).

次にＣＰＵ１０２は、学習者データが示す音声を、ステップＳＡ６の処理と同様にして、その再生時間軸上において複数のフレームに区切り（ステップＳＡ１３）、区切られたフレーム毎に、各フレームにおける音声のピッチを抽出する（ステップＳＡ１４）。ＣＰＵ１０２は、フレーム毎にピッチを抽出すると、ステップＳＡ８と同様にして、フレーム毎に求められたピッチを結んだピッチ曲線（以下、第２ピッチ曲線と称する）を生成し、生成した第２ピッチ曲線を示すデータを記憶部１０５に記憶する（ステップＳＡ１５）。なお、ここでも、無声区間や、無声子音の発音区間等においては、直線補間や３次スプライン補間等の補間を行い、第２ピッチ曲線を生成する。 Next, the CPU 102 divides the voice indicated by the learner data into a plurality of frames on the reproduction time axis in the same manner as the process of step SA6 (step SA13), and the voice of each frame is divided for each divided frame. The pitch is extracted (step SA14). When CPU 102 extracts the pitch for each frame, CPU 102 generates a pitch curve (hereinafter referred to as a second pitch curve) connecting the pitches determined for each frame in the same manner as in step SA8, and generates the generated second pitch curve. Is stored in the storage unit 105 (step SA15). In this case as well, in an unvoiced section, an unvoiced consonant pronunciation section, etc., interpolation such as linear interpolation or cubic spline interpolation is performed to generate a second pitch curve.

ＣＰＵ１０２は、第２ピッチ曲線の生成が終了すると、第１ピッチ曲線の道程を算出し、次に第２ピッチ曲線の道程を算出する（ステップＳＡ１６）。ピッチ曲線の道程は、図８に示したように、ピッチ曲線ｆ（ｔ）の１階微分の絶対値の積分を算出すると求まる。図８に示したように、抑揚の変化が大きな音声はピッチの変化が大きくなるため、ピッチ曲線の道程が長くなり、一方、抑揚の変化が小さな音声はピッチの変化が小さくなるため、ピッチ曲線の道程が短くなる。即ち、ピッチ曲線の道程は、抑揚の変化の大きさを表しているといえる。 When the generation of the second pitch curve is completed, the CPU 102 calculates the path of the first pitch curve, and then calculates the path of the second pitch curve (step SA16). The distance of the pitch curve can be obtained by calculating the integral of the absolute value of the first derivative of the pitch curve f (t) as shown in FIG. As shown in FIG. 8, the voice with a large inflection change has a large pitch change, so the pitch curve has a long path, while the voice with a small inflection has a small pitch change, so the pitch curve The journey is shortened. That is, it can be said that the distance of the pitch curve represents the magnitude of the inflection.

次に、ＣＰＵ１０２は、第１ピッチ曲線の道程と、第２ピッチ曲線の道程とを比較する（ステップＳＡ１７）。ピッチ曲線の道程は、抑揚の変化の大きさを表しているため、ピッチ曲線の道程を比較することにより、模範音声の抑揚と学習者音声の抑揚とで、どちらが抑揚の変化が大きいかを知ることができる。図８に示したように、第１ピッチ曲線の道程が第２ピッチ曲線の道程より長い場合、学習者音声の抑揚の変化量が模範音声の抑揚の変化量より小さいため、例えば、「抑揚の変化が少ない音声です」というメッセージを表示部１０６に表示し、発音を改善するのに有用な情報を学習者へ出力する（ステップＳＡ１８）。また、第２ピッチ曲線の道程が第１ピッチ曲線の道程より長い場合、学習者音声の抑揚の変化量が模範音声の抑揚の変化量より大きいため、例えば、「抑揚の変化が大きすぎます」というメッセージを表示部１０６に表示し、発音を改善するのに有用な情報を学習者へ出力する（ステップＳＡ１８）。また、第１ピッチ曲線の道程と第２ピッチ曲線の道程が同じ場合、学習者音声の抑揚の変化量が模範音声の抑揚の変化量と同じであるため、例えば、「良い発音です」というメッセージを表示部１０６に表示する（ステップＳＡ１８）。 Next, the CPU 102 compares the path of the first pitch curve with the path of the second pitch curve (step SA17). Since the pitch curve path represents the magnitude of the inflection change, by comparing the pitch curve paths, it is known which of the inflection changes is greater between the model voice inflection and the learner voice inflection. be able to. As shown in FIG. 8, when the path of the first pitch curve is longer than the path of the second pitch curve, the change amount of the inflection of the learner speech is smaller than the change amount of the inflection of the model speech. The message “It is a voice with little change” is displayed on the display unit 106, and information useful for improving the pronunciation is output to the learner (step SA18). Also, if the path of the second pitch curve is longer than the path of the first pitch curve, the amount of change in the inflection of the learner's voice is greater than the amount of change in the inflection of the model voice. Is displayed on the display unit 106, and information useful for improving pronunciation is output to the learner (step SA18). In addition, when the journey of the first pitch curve and the journey of the second pitch curve are the same, the change amount of the inflection of the learner's voice is the same as the change amount of the inflection of the model voice. Is displayed on the display unit 106 (step SA18).

ＣＰＵ１０２は、学習者へのメッセージの出力が終了すると、表示部１０６を制御し、選択した例文の発音練習を再度行うのか、他の例文の発音練習を行うのかを確認するメニュー画面を表示する（ステップＳＡ１９）。学習者が入力部１０７を操作し、別の例文の発音練習を行う旨の操作を行った場合（ステップＳＡ２０；ＹＥＳ）、ＣＰＵ１０２は、ステップＳＡ１に処理の流れを戻し、ステップＳＡ１以降の処理を再び実行する。また、学習者が入力部１０７を操作し、選択した例文の発音練習を再度行う旨の操作を行った場合（ステップＳＡ２１；ＹＥＳ）、ＣＰＵ１０２は、ステップＳＡ６に処理の流れを戻し、ステップＳＡ６以降の処理を再び実行する。 When the CPU 102 finishes outputting the message to the learner, the CPU 102 controls the display unit 106 to display a menu screen for confirming whether to practice pronunciation of the selected example sentence again or practice pronunciation of another example sentence ( Step SA19). When the learner operates the input unit 107 and performs an operation for practicing pronunciation of another example sentence (step SA20; YES), the CPU 102 returns the process flow to step SA1, and performs the processes after step SA1. Run again. When the learner operates the input unit 107 to perform the pronunciation practice of the selected example sentence again (step SA21; YES), the CPU 102 returns the process flow to step SA6, and after step SA6. The above process is executed again.

以上説明したように本実施形態によれば、具体的にどのように発音を改善すれば良いのかが学習者へ出力されるため、試行錯誤して発音の改善と評価とを繰り返すことなく、発音を模範音声に近づけることが可能となる。 As described above, according to the present embodiment, how to improve the pronunciation is output to the learner, so that the pronunciation is not repeated by repeating trial and error to improve and evaluate the pronunciation. Can be brought closer to the model voice.

［第２実施形態］
［実施形態の構成］
次に本発明の第２実施形態について説明する。本実施形態に係わる語学学習装置１Ａのハードウェア構成は、第１実施形態に係わる語学学習装置１と同じとなっている。このため、ハードウェア構成については、その説明を省略する。本実施形態においては、ＣＰＵ１０２がプログラムを実行することにより実現する機能が第１実施形態と異なる。 [Second Embodiment]
[Configuration of the embodiment]
Next, a second embodiment of the present invention will be described. The hardware configuration of the language learning device 1A according to the present embodiment is the same as that of the language learning device 1 according to the first embodiment. For this reason, the description of the hardware configuration is omitted. In the present embodiment, the function realized by the CPU 102 executing the program is different from the first embodiment.

図６は、ＣＰＵ１０２がプログラムを実行することにより実現する機能の構成を示した機能ブロック図である。図６において、第１実施形態と同じ機能ブロックについては、図３において付した符号と同じ符号を付し、その説明を省略する。Ｈｚ→Ｃｅｎｔ変換部６０は、Ｈｚで表されているピッチをＣｅｎｔに変換する。フィルタ部７０は、ローパスフィルタとして機能し、ピッチの微細な変化を除去する。 FIG. 6 is a functional block diagram showing a configuration of functions realized by the CPU 102 executing the program. In FIG. 6, the same functional blocks as those in the first embodiment are denoted by the same reference numerals as those in FIG. 3, and the description thereof is omitted. The Hz → Cent conversion unit 60 converts the pitch expressed in Hz into Cent. The filter unit 70 functions as a low-pass filter and removes a minute change in pitch.

［実施形態の動作］
次に、本実施形態の動作について説明する。図７は、本実施形態におけるＣＰＵ１０２の処理の流れを例示したフローチャートである。なお、図７において、第１実施形態と同じ処理については、第１実施形態と同じ符号を付している。 [Operation of the embodiment]
Next, the operation of this embodiment will be described. FIG. 7 is a flowchart illustrating the processing flow of the CPU 102 in the present embodiment. In FIG. 7, the same processes as those of the first embodiment are denoted by the same reference numerals as those of the first embodiment.

ＣＰＵ１０２は、学習者により例文が選択されると、選択された例文に対応した模範音声データを記憶部１０５から読み出す（ステップＳＡ１〜ステップＳＡ５）。そして、模範音声データが示す音声を、その再生時間軸上において所定の時間間隔で分割する（ステップＳＡ６）。次にＣＰＵ１０２は、分割されたフレーム毎に、各フレームの音声のピッチを抽出する（ステップＳＡ７）。ＣＰＵ１０２は、分割されたフレーム毎にピッチを抽出すると、抽出したピッチの単位をＨｚからＣｅｎｔに変換する（ステップＳＢ１）。ＣＰＵ１０２は、ＨｚからＣｅｎｔへの変換を終了すると、フレーム毎に発音のゆらぎ等のピッチの微細な変化を除去する（ステップＳＢ２）。そして、ＣＰＵ１０２は、フレーム毎に求められたピッチを結んだピッチ曲線（第１ピッチ曲線）を生成し、生成した第１ピッチ曲線を示す曲線データを記憶部１０５に記憶する（ステップＳＡ８）。なお、無声区間や、無声子音の発音区間等、ピッチを抽出することができないフレームにおいては、直線補間や３次スプライン補間等の補間を行い、第１ピッチ曲線を生成する。 When the learner selects an example sentence, the CPU 102 reads out model voice data corresponding to the selected example sentence from the storage unit 105 (steps SA1 to SA5). Then, the voice indicated by the model voice data is divided at predetermined time intervals on the reproduction time axis (step SA6). Next, the CPU 102 extracts the audio pitch of each frame for each divided frame (step SA7). When the CPU 102 extracts the pitch for each divided frame, it converts the unit of the extracted pitch from Hz to Cent (step SB1). When the conversion from Hz to Cent is completed, the CPU 102 removes minute changes in pitch such as fluctuations in pronunciation for each frame (step SB2). Then, the CPU 102 generates a pitch curve (first pitch curve) connecting the pitches determined for each frame, and stores the curve data indicating the generated first pitch curve in the storage unit 105 (step SA8). Note that, in a frame where the pitch cannot be extracted, such as an unvoiced section or an unvoiced consonant pronunciation section, interpolation such as linear interpolation or cubic spline interpolation is performed to generate a first pitch curve.

この後、ＣＰＵ１０２は、例文の発音を促すメッセージを表示する（ステップＳＡ１０）。学習者が、模範音声を真似て例文を読み上げると、学習者音声が学習者データに変換される。ＣＰＵ１０２は、入力部１０７から送られる信号を監視し、学習者が発音を終了して入力部１０７を操作すると（ステップＳＡ１１；ＹＥＳ）、模範音声データが表す音声の発音時間と、記憶部１０５に記憶された学習者データが表す音声の発音時間とが同じとなるように、学習者データが表す音声を補正する（ステップＳＡ１２）。 Thereafter, the CPU 102 displays a message for prompting pronunciation of the example sentence (step SA10). When the learner reads the example sentence by imitating the model voice, the learner voice is converted into learner data. The CPU 102 monitors the signal sent from the input unit 107, and when the learner finishes pronunciation and operates the input unit 107 (step SA <b> 11; YES), the sound generation time of the voice represented by the model voice data and the storage unit 105. The voice represented by the learner data is corrected so that the pronunciation time of the voice represented by the stored learner data is the same (step SA12).

次にＣＰＵ１０２は、学習者データが示す音声を、ステップＳＡ６の処理と同様にして、その再生時間軸上において複数のフレームに分割する（ステップＳＡ１３）。そしてＣＰＵ１０２は、分割されたフレーム毎に、各フレームの音声のピッチを抽出する（ステップＳＡ１４）。ＣＰＵ１０２は、分割されたフレーム毎にピッチを抽出すると、抽出したピッチの単位をＨｚからＣｅｎｔに変換する（ステップＳＢ３）。ＣＰＵ１０２は、ＨｚからＣｅｎｔへの変換を終了すると、フレーム毎に発音のゆらぎ等のピッチの微細な変化を除去する（ステップＳＢ４）。そして、ＣＰＵ１０２は、フレーム毎に求められたピッチを結んだピッチ曲線（第２ピッチ曲線）を生成し、生成した第２ピッチ曲線を示す曲線データを記憶部１０５に記憶する（ステップＳＡ１５）。なお、無声区間や、無声子音の発音区間等、ピッチを抽出することができないフレームにおいては、直線補間や３次スプライン補間等の補間を行い、第２ピッチ曲線を生成する。
ステップＳＡ１５以降の処理の流れは、第１実施形態と同じであるため、その説明を省略する。 Next, the CPU 102 divides the voice indicated by the learner data into a plurality of frames on the reproduction time axis in the same manner as the process of step SA6 (step SA13). Then, the CPU 102 extracts the audio pitch of each frame for each divided frame (step SA14). When the CPU 102 extracts a pitch for each divided frame, the CPU 102 converts the extracted pitch unit from Hz to Cent (step SB3). When the conversion from Hz to Cent is completed, the CPU 102 removes minute changes in pitch such as fluctuations in pronunciation for each frame (step SB4). Then, the CPU 102 generates a pitch curve (second pitch curve) connecting the pitches obtained for each frame, and stores the curve data indicating the generated second pitch curve in the storage unit 105 (step SA15). Note that, in a frame where the pitch cannot be extracted, such as an unvoiced section or an unvoiced consonant pronunciation section, interpolation such as linear interpolation or cubic spline interpolation is performed to generate a second pitch curve.
Since the flow of processing after step SA15 is the same as that of the first embodiment, description thereof is omitted.

以上説明したように本実施形態によっても、具体的にどのように発音を改善すれば良いのかが学習者へ出力されるため、試行錯誤して発音の改善と評価とを繰り返すことなく、発音を模範音声に近づけることが可能となる。また、ピッチをＣｅｎｔの単位で比較するので、より人間の聴感に近い評価が可能となる。 As described above, this embodiment also outputs to the learner how to improve the pronunciation specifically, so that the pronunciation can be reproduced without repeating the improvement and evaluation of the pronunciation through trial and error. It is possible to approach the model voice. In addition, since the pitch is compared in units of Cent, an evaluation closer to human hearing is possible.

［変形例］
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。例えば、上述の実施形態を以下のように変形して本発明を実施してもよい。 [Modification]
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. For example, the present invention may be implemented by modifying the above-described embodiment as follows.

上述した実施形態においては、ステップＳＡ１２の処理、即ち、模範音声の発音時間と学習者音声の発音時間とを合わせる処理を行わないようにしてもよい。また、ステップＳＡ１２の処理においては、模範音声中における各単語の発音時間と、学習者音声中における各単語の発音時間とが同じとなるようにしてもよい。また、ステップＳＡ１２の処理においては、模範音声中における各音素の発音時間と、学習者音声中における各音素の発音時間とが同じとなるようにしてもよい。
また、第１ピッチ曲線と第２ピッチ曲線の道程を比較する際、比較する時間区間は、例文全体の区間でもよいし、予め指定された例文の一部区間、またはユーザが指定した一部区間であってもよい。 In the above-described embodiment, the process of step SA12, that is, the process of matching the pronunciation time of the model voice and the pronunciation time of the learner voice may not be performed. In the process of step SA12, the pronunciation time of each word in the model voice may be the same as the pronunciation time of each word in the learner voice. In the process of step SA12, the pronunciation time of each phoneme in the model voice may be the same as the pronunciation time of each phoneme in the learner voice.
Further, when comparing the distances of the first pitch curve and the second pitch curve, the time interval to be compared may be a whole example sentence section, a part of example sentence specified in advance, or a part of section specified by the user. It may be.

上述した実施形態においては、第１ピッチ曲線の道程と第２ピッチ曲線の道程とを比較し、比較結果に応じて、例えば「もう少し抑揚をつけて発音しましょう」というような、発音の改善ポイントを表示するようにしてもよい。また、上述した実施形態においては、例えば音声中の単語を認識し、単語毎に抑揚の評価結果を表示するようにしてもよい。 In the embodiment described above, the path of the first pitch curve is compared with the path of the second pitch curve, and according to the comparison result, for example, “Let's pronounce with a little inflection” May be displayed. In the above-described embodiment, for example, a word in speech may be recognized, and an inflection evaluation result may be displayed for each word.

上述した実施形態においては、学習者の音声についてのみピッチ曲線を生成してピッチ曲線の道程を求め、求めた道程を出力するようにしてもよい。 In the above-described embodiment, a pitch curve may be generated only for the learner's voice to determine the pitch curve path, and the calculated path may be output.

１・・・語学学習装置、１０１・・・バス、１０２・・・ＣＰＵ、１０３・・・ＲＯＭ、１０４・・・ＲＡＭ、１０５・・・記憶部、１０６・・・表示部、１０７・・・入力部、１０８・・・音声処理部、１０９・・・マイクロホン、１１０・・・スピーカ DESCRIPTION OF SYMBOLS 1 ... Language learning apparatus, 101 ... Bus, 102 ... CPU, 103 ... ROM, 104 ... RAM, 105 ... Memory | storage part, 106 ... Display part, 107 ... Input unit 108... Audio processing unit 109. Microphone 110 110 Speaker

Claims

Voice input means for inputting voice;
A calculation means for generating a pitch curve indicating a temporal change in the pitch of the voice input to the voice input means, and calculating a path of the generated pitch curve as a parameter representing the magnitude of the inflection ;
An audio feature quantity calculation apparatus comprising: output means for outputting a journey calculated by the calculation means.

The calculation means generates a pitch curve by interpolating the pitch of the unvoiced section according to the pitch of the voice before and after the unvoiced section for the unvoiced section of the voice input to the voice input means. Item 2. The audio feature amount calculation apparatus according to Item 1.