JP4654889B2

JP4654889B2 - Playback device

Info

Publication number: JP4654889B2
Application number: JP2005333324A
Authority: JP
Inventors: 直博江本
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-11-17
Filing date: 2005-11-17
Publication date: 2011-03-23
Anticipated expiration: 2025-11-17
Also published as: JP2007140079A

Description

本発明は、手本となる音と学習者が発する音との違いを学習者に示す技術に関する。 The present invention relates to a technique for showing a learner the difference between a sound as a model and a sound emitted by a learner.

語学学習において、発音練習を行う際には、ＣＤ（Compact Disk）等の記録媒体に記録された模範音声を再生し、その模範音声の真似をして発音するという学習方法が広く用いられている。これは模範音声の真似をすることで正しい発音を身につけることを目的とするものである。このような学習を行う場合、学習者は自分の発音を客観的に把握し、模範音声との違いを認識して発音を改善していくこととなるが、模範音声と同じ発音ができているか否かを学習者自身が自分の発音を聞いて客観的に把握するのは難しい。そこで、例えば、特許文献１に開示されているように、学習者が客観的に自分の音声を把握できるようにする技術が考案されている。特許文献１に開示されている語学学習装置は、学習者の音声から母音を抽出し、抽出した母音のピッチや発話時間をグラフ化している。
特開平５−２３２８５６号公報 In language learning, when practicing pronunciation, a learning method is widely used in which a model voice recorded on a recording medium such as a CD (Compact Disk) is reproduced and pronounced by imitating the model voice. . The purpose of this is to acquire correct pronunciation by imitating model voices. When performing such learning, the learner objectively grasps his / her pronunciation, recognizes the difference from the model voice, and improves the pronunciation. Is the same pronunciation as the model voice? It is difficult for the learner to grasp the objection objectively by listening to his / her pronunciation. Thus, for example, as disclosed in Patent Document 1, a technique has been devised that enables a learner to objectively grasp his / her voice. The language learning device disclosed in Patent Literature 1 extracts vowels from the learner's voice and graphs the extracted vowel pitches and utterance times.
JP-A-5-232856

さて、正しい発音を行うには、発音のピッチやリズム等の動的な変化を模範音声に合わせることが重要であると言われている。特許文献１に開示された技術によれば、母音のピッチや発話時間・発音タイミング等の静的なデータがグラフ化されるため、学習者は自分の音声を客観的に把握することができる。しかしながら、静的なデータが表示されるのみであり、例えば、発音リズム等の発音の動的な変化を把握することは難しいため、特許文献１に開示されている技術では、学習者と模範音声との発音リズムの動的な違いを把握し、どの点を改善すれば良いのかを見つけるのが難しいという問題がある。 Now, it is said that in order to perform correct pronunciation, it is important to match dynamic changes such as the pitch and rhythm of pronunciation with the model voice. According to the technique disclosed in Patent Document 1, since static data such as vowel pitch, speech time, and pronunciation timing is graphed, the learner can objectively grasp his / her voice. However, only static data is displayed, and for example, it is difficult to grasp dynamic changes in pronunciation such as pronunciation rhythm. There is a problem that it is difficult to grasp the dynamic difference in pronunciation rhythm and to find out which point should be improved.

本発明は、上述した背景の下になされたものであり、その目的は、学習者の音声のリズムと、お手本の音声のリズムとの違いを学習者が把握できるようにすることにある。 The present invention has been made under the above-described background, and an object thereof is to enable a learner to understand the difference between a learner's voice rhythm and a model voice rhythm.

上述した課題を解決するために本発明は、発音すべき音に対応した音データを予め定められた区間に区切って記憶する記憶手段と、前記記憶手段から前記音データを読み出して再生処理をする再生手段と、収音した音に対応するデータを収音データとして生成する収音データ生成手段と、前記収音データに対して前記音データの区切りに対応する部分を認識し、対応する区切りを付加する位置付加手段と、前記各区間の時間長に対応した表示幅を有する表示区間を連ねて形成されるスケールを表示する表示手段と、前記再生手段による再生処理に同期して、前記音データの各区間における再生位置を前記スケールの各表示区間に対して表示させる第１表示制御手段と、前記収音データを再生処理する収音データ再生処理手段と、前記収音データ再生処理手段に同期して、前記位置付加手段により区切られた前記収音データの各区間における再生位置を前記スケールの各表示区間に対して表示する第２表示制御手段とを具備することを特徴とする再生装置を提供する。 In order to solve the above-described problems, the present invention is configured to store sound data corresponding to a sound to be sounded by dividing the sound data into predetermined sections, and to read the sound data from the storage means and perform reproduction processing. Reproducing means, sound collection data generating means for generating data corresponding to the collected sound as sound collection data, recognizing a portion corresponding to the sound data delimiter for the sound collection data, and corresponding delimiter Synchronizing with the reproduction processing by the reproducing means, the position adding means for adding, the display means for displaying the scale formed by connecting the display sections having the display width corresponding to the time length of each section, the sound data First display control means for displaying the reproduction position in each section of the scale for each display section of the scale, sound collection data reproduction processing means for reproducing the sound collection data, and the sound collection data And second display control means for displaying the reproduction position in each section of the collected sound data divided by the position adding means for each display section of the scale in synchronization with the reproduction processing means. Is provided.

この態様においては、前記記憶手段は複数の異なる音データを記憶し、前記記憶手段に記憶された複数の音データのひとつを選択する選択手段を有し、前記再生手段は、前記選択手段により選択された音データを前記記憶手段から読み出して再生処理をするようにしてもよい。 In this aspect, the storage means stores a plurality of different sound data, and has a selection means for selecting one of the plurality of sound data stored in the storage means, and the reproduction means is selected by the selection means. The reproduced sound data may be read from the storage means and reproduced.

本発明によれば、学習者は、学習者の音声のリズムと、お手本の音声のリズムとの違いを把握することができる。 According to the present invention, the learner can grasp the difference between the rhythm of the learner's voice and the rhythm of the model voice.

以下、図面を参照して本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［実施形態の構成］
図１は、本発明の実施形態に係る語学学習装置１のハードウェア構成を例示したブロック図である。図１に示したように、語学学習装置１の各部は、バス１０１に接続されており、このバス１０１を介して各部間で信号やデータの授受を行う。 [Configuration of the embodiment]
FIG. 1 is a block diagram illustrating a hardware configuration of a language learning device 1 according to an embodiment of the invention. As shown in FIG. 1, each unit of the language learning device 1 is connected to a bus 101, and signals and data are exchanged between the units via the bus 101.

マイクロホン１０９は、音声処理部１０８に接続されており、入力される音声を電気信号（以下、音声信号と称する）に変換して音声処理部１０８へ出力する。スピーカ１１０は、音声処理部１０８に接続されており、音声処理部１０８から出力される信号に対応した音を出力する。音声処理部１０８は、マイクロホン１０９から入力される音声信号をデジタルデータに変換して出力する機能や、音声を表すデジタルデータをアナログの音声信号に変換し、スピーカ１１０へ出力する機能を備えている。ここで、マイクロホン１０９には、語学学習装置１を使用する学習者の音声が入力されるため、音声処理部１０８において変換される音声信号のデジタルデータを学習者データと称し、マイクロホン１０９に入力された音声（学習者データが示す音声）を学習者音声と称する。 The microphone 109 is connected to the sound processing unit 108, converts input sound into an electrical signal (hereinafter referred to as a sound signal), and outputs the electric signal to the sound processing unit 108. The speaker 110 is connected to the sound processing unit 108 and outputs a sound corresponding to the signal output from the sound processing unit 108. The audio processing unit 108 has a function of converting an audio signal input from the microphone 109 into digital data and outputting it, and a function of converting digital data representing audio into an analog audio signal and outputting the analog audio signal to the speaker 110. . Here, since the voice of the learner who uses the language learning device 1 is input to the microphone 109, the digital data of the voice signal converted by the voice processing unit 108 is referred to as learner data and is input to the microphone 109. The voice (the voice indicated by the learner data) is referred to as a learner voice.

表示部１０６は、例えば、液晶ディスプレイ等の表示デバイスを備えており、ＣＰＵ１０２の制御の下、文字列や各種メッセージ、語学学習装置１を操作するためのメニュー画面等を表示する。入力部１０７は、キーボードやマウス等（いずれも図示略）の入力装置を具備しており、キーの押下やマウスの操作等に応じて操作内容に対応した信号をＣＰＵ１０２へ出力する。 The display unit 106 includes, for example, a display device such as a liquid crystal display, and displays a character string, various messages, a menu screen for operating the language learning device 1, and the like under the control of the CPU 102. The input unit 107 includes an input device such as a keyboard and a mouse (both not shown), and outputs a signal corresponding to the operation content to the CPU 102 in response to a key press or a mouse operation.

記憶部１０５は、ＨＤＤ（Hard Disk Drive）装置を備えており、各種データを記憶する。具体的には、記憶部１０５は、音声処理部１０８から出力される学習者データを記憶する。また、記憶部１０５は、語学学習に用いられる例文を表す例文テキストデータと、ネイティブスピーカが例文を読み上げた時の音声（以下、模範音声と称する）を表すデジタルデータ（以下、模範音声データと称する）とを記憶している。記憶部１０５は、図２に例示したフォーマットの例文テーブルＴＢ１を記憶しており、このテーブルに例文テキストデータと、模範音声データのファイル名と、各例文テキストデータを一意に識別する識別子とを対応付けて格納している。 The storage unit 105 includes an HDD (Hard Disk Drive) device and stores various data. Specifically, the storage unit 105 stores learner data output from the voice processing unit 108. In addition, the storage unit 105 includes example sentence text data representing example sentences used for language learning, and digital data (hereinafter referred to as model voice data) representing voices (hereinafter referred to as model voices) when the native speaker reads the example sentences. ) Is remembered. The storage unit 105 stores the example sentence table TB1 in the format illustrated in FIG. 2, and the table corresponds to the example sentence text data, the file name of the model voice data, and the identifier for uniquely identifying each example sentence text data. It is attached and stored.

また、記憶部１０５は、記憶している例文に対応して、図３に例示したフォーマットのピッチテーブルＴＢ２を記憶している。ピッチテーブルＴＢ２において、識別子フィールドには、例文テキストデータを一意に識別する識別子が格納されている。なお、この識別子は、例文テーブルＴＢ１に格納されている識別子と同じ識別子である。
また、フレーム番号フィールドには、識別子フィールドの識別子で特定される例文の模範音声を、時間軸上で所定の時間枠（フレーム）で分割した時の各フレームを示す番号が格納される。例えば、１フレーム＝１００ｍｓｅｃである場合、模範音声において「One centimeter is ten millimeters.」という例文が２．６秒で読み上げられていると、図４に示したように、音声は２６のフレームに分割される。ピッチテーブルＴＢ２には、この２６の各フレームを示す１〜２６の数字が格納される。 In addition, the storage unit 105 stores a pitch table TB2 having the format illustrated in FIG. 3 in correspondence with the stored example sentences. In the pitch table TB2, the identifier field stores an identifier for uniquely identifying example sentence text data. This identifier is the same as the identifier stored in the example sentence table TB1.
The frame number field stores a number indicating each frame when the exemplary voice of the example sentence specified by the identifier in the identifier field is divided in a predetermined time frame (frame) on the time axis. For example, when 1 frame = 100 msec, if the example voice “One centimeter is ten millimeters” is read out in 2.6 seconds, the voice is divided into 26 frames as shown in FIG. Is done. In the pitch table TB2, numbers 1 to 26 indicating the 26 frames are stored.

また、ピッチフィールドには、模範音声の各フレーム時点における音声のピッチを示すピッチデータが格納される。例えば、図４に示したように、模範音声の時間軸上の３フレーム目の時点での音声のピッチがＸである場合、図３に示したように、フレーム番号「３」が格納されている行に「Ｘ」が格納される。
また、単語フィールドには、どのフレームが、例文中のどの単語に対応しているかを示すデータが格納されている。例えば、「One centimeter is ten millimeters.」において、模範音声の時間軸上の１フレーム目の音声と、２フレーム目の音声とが例文の「One」に対応している場合、図３に示したように、フレーム番号「１」と「２」が格納されている行に「One」という文字列が格納される。 The pitch field stores pitch data indicating the pitch of the voice at each frame time of the model voice. For example, as shown in FIG. 4, when the pitch of the voice at the third frame on the time axis of the model voice is X, the frame number “3” is stored as shown in FIG. “X” is stored in the existing row.
The word field stores data indicating which frame corresponds to which word in the example sentence. For example, in “One centimeter is ten millimeters.”, The case where the first frame sound and the second frame sound on the time axis of the model sound correspond to the example sentence “One” is shown in FIG. In this way, the character string “One” is stored in the row where the frame numbers “1” and “2” are stored.

また、記憶部１０５は、図５に例示したフォーマットの単語テーブルＴＢ３を記憶しており、このテーブルに、例文を示す識別子と、例文に含まれている各単語のテキストデータと、各単語の発話開始時間とを対応付けて格納している。 In addition, the storage unit 105 stores a word table TB3 in the format illustrated in FIG. 5. In this table, an identifier indicating an example sentence, text data of each word included in the example sentence, and an utterance of each word The start time is stored in association with each other.

ＣＰＵ（Central Processing Unit）１０２は、ＲＯＭ（Read Only Memory）１０３に記憶されているプログラムを、ＲＡＭ（Random Access Memory）１０４を作業エリアにして実行する。 A CPU (Central Processing Unit) 102 executes a program stored in a ROM (Read Only Memory) 103 using a RAM (Random Access Memory) 104 as a work area.

［実施形態の動作］
次に本実施形態の動作について説明する。
まず、学習者が例文の一覧の表示を指示する操作を行うと、ＣＰＵ１０２は例文テーブルＴＢ１に格納されている例文テキストデータを読み出し（図６：ステップＳＡ１）、読み出したデータが表す例文の一覧を表示部１０６に表示する（ステップＳＡ２）。この後、学習者が入力部１０７を操作し、表示された例文の一つを選択する操作を行うと（ステップＳＡ３；ＹＥＳ）、ＣＰＵ１０２は、表示部１０６に表示されている画面と、入力部１０７から送られる信号に基づいて、選択された例文を特定する（ステップＳＡ４）。ＣＰＵ１０２は、選択された例文を特定すると、例文テーブルＴＢ１において、選択された例文に対応付けて格納されている模範音声データのファイル名を読み出す（ステップＳＡ５）。例えば、図２に示したテーブルにおいて、識別子が「００１」である例文が選択された場合、ファイル名「ａ００１」が読み出される。 [Operation of the embodiment]
Next, the operation of this embodiment will be described.
First, when the learner performs an operation to instruct display of a list of example sentences, the CPU 102 reads out example sentence text data stored in the example sentence table TB1 (FIG. 6: step SA1), and displays a list of example sentences represented by the read data. The information is displayed on the display unit 106 (step SA2). Thereafter, when the learner operates the input unit 107 and performs an operation of selecting one of the displayed example sentences (step SA3; YES), the CPU 102 displays the screen displayed on the display unit 106, the input unit, and the like. Based on the signal sent from 107, the selected example sentence is specified (step SA4). When CPU 102 identifies the selected example sentence, CPU 102 reads out the file name of the model voice data stored in association with the selected example sentence in example sentence table TB1 (step SA5). For example, in the table shown in FIG. 2, when an example sentence with the identifier “001” is selected, the file name “a001” is read.

次にＣＰＵ１０２は、読み出したファイル名で特定される模範音声データを記憶部１０５から読み出し、読み出した模範音声データを音声処理部１０８へ出力する（ステップＳＡ６）。また、ＣＰＵ１０２は、表示部１０６を制御し、選択された例文と、例文の模範音声のピッチとを表示する（ステップＳＡ７）。ここでは、識別子が「００１」である例文が選択されているので、識別子「００１」に対応付けて例文テーブルＴＢ１に格納されている例文テキストデータが読み出され、識別子「００１」が格納されているピッチテーブルＴＢ２からピッチデータが読み出される。そして、図７に示したように、読み出された例文テキストデータに従って、選択された例文が表示され、読み出されたピッチデータに従って、例文の模範音声のピッチが表示される。
なお、図７において、表示される例文の長さ、およびピッチの長さは、模範音声データの全体の再生時間長に対応しており、例文の左端およびピッチ表示の左端が模範音声の再生開始時点を示しており、例文の右端およびピッチ表示の右端が模範音声の再生終了時点を表している。また、表示されるピッチは、各フレームのピッチをつないだ波形として表示される。 Next, the CPU 102 reads the model voice data specified by the read file name from the storage unit 105, and outputs the read model voice data to the voice processing unit 108 (step SA6). Further, the CPU 102 controls the display unit 106 to display the selected example sentence and the pitch of the example voice of the example sentence (step SA7). Here, since the example sentence with the identifier “001” is selected, the example sentence text data stored in the example sentence table TB1 in association with the identifier “001” is read, and the identifier “001” is stored. Pitch data is read from the pitch table TB2. Then, as shown in FIG. 7, the selected example sentence is displayed according to the read example sentence text data, and the pitch of the example voice of the example sentence is displayed according to the read pitch data.
In FIG. 7, the length of the displayed example sentence and the length of the pitch correspond to the entire playback time length of the model voice data, and the left end of the example sentence and the left end of the pitch display start playback of the model voice. The right end of the example sentence and the right end of the pitch display indicate the end point of reproduction of the model voice. The displayed pitch is displayed as a waveform connecting the pitches of the frames.

音声処理部１０８に模範音声データが入力されると、デジタルデータである模範音声データがアナログの信号に変換されてスピーカ１１０へ出力され、スピーカ１１０から模範音声が再生される。ＣＰＵ１０２は、模範音声の再生が開始されると、表示部１０６を制御し、表示した例文において音声再生されている部分を示すバー１０を表示する（ステップＳＡ８）。例えば、ＣＰＵ１０２は、フレーム「３」の部分（発音開始時点から見て２００ｍｓｅｃ〜３００ｍｓｅｃの間）の音声が再生されているタイミングでは、図７（ａ）に示したように、フレーム「３」のピッチが表示されている部分を１００ｍｓｅｃの時間をかけてバー１０が移動するように表示し、フレーム「４」の部分（発音開始時点から見て３００ｍｓｅｃ〜４００ｍｓｅｃの間）の音声が再生されているタイミングでは、図７（ｂ）に示したように、フレーム「４」のピッチが表示されている部分を１００ｍｓｅｃの時間をかけてバー１０が移動するように表示する。このように、再生されるフレーム番号に同期してバー１０の横方向の表示位置が制御される。 When the model voice data is input to the voice processing unit 108, the model voice data which is digital data is converted into an analog signal and output to the speaker 110, and the model voice is reproduced from the speaker 110. When the reproduction of the model voice is started, the CPU 102 controls the display unit 106 to display the bar 10 indicating the portion of the displayed example sentence that is being reproduced (step SA8). For example, at the timing when the sound of the frame “3” portion (between 200 msec and 300 msec from the start of sound generation) is being reproduced, the CPU 102 changes the frame “3” as shown in FIG. The portion where the pitch is displayed is displayed so that the bar 10 moves over a time of 100 msec, and the sound of the portion of the frame “4” (between 300 msec and 400 msec when viewed from the start of sound generation) is reproduced. At the timing, as shown in FIG. 7B, the portion where the pitch of the frame “4” is displayed is displayed so that the bar 10 moves over the time of 100 msec. Thus, the horizontal display position of the bar 10 is controlled in synchronization with the frame number to be reproduced.

ＣＰＵ１０２は、模範音声の再生が終了すると、表示部１０６を制御し、例えば、「キーを押してから発音し、発音が終わったら再度キーを押してください」という、例文の発音を促すメッセージを表示する（ステップＳＡ９）。学習者は、スピーカ１１０から出力された模範音声を聞いた後、メッセージに従って入力部１０７を操作し、模範音声を真似て例文を読み上げる。学習者の音声はマイクロホン１０９によって音声信号に変換され、変換された信号が音声処理部１０８へ出力される。音声処理部１０８は、マイクロホン１０９から出力された音声信号が入力されると、音声信号をデジタルデータである学習者データに変換する。この学習者データは、音声処理部１０８から出力され、記憶部１０５に記憶される。 When the reproduction of the model voice is finished, the CPU 102 controls the display unit 106 to display, for example, a message for prompting pronunciation of the example sentence such as “Press the key to pronounce and then press the key again when the pronunciation is finished” ( Step SA9). The learner listens to the model voice output from the speaker 110, and then operates the input unit 107 according to the message to read the example sentence by imitating the model voice. The learner's voice is converted into a voice signal by the microphone 109, and the converted signal is output to the voice processing unit 108. When the audio signal output from the microphone 109 is input, the audio processing unit 108 converts the audio signal into learner data that is digital data. This learner data is output from the voice processing unit 108 and stored in the storage unit 105.

学習者が発音を終了して入力部１０７を操作すると（ステップＳＡ１０；ＹＥＳ）、ＣＰＵ１０２は、学習者データが示す音声の長さを調整し、模範音声データが示す模範音声の長さと、学習者データが示す学習者の音声の長さとが同じとなるように学習者データを処理する（ステップＳＡ１１）。図８は、模範音声の波形とマイクロホン１０９に入力された学習者音声の波形とを例示した図である。図８においては、模範音声の波形および学習者音声の波形は同じ例文を発話した時のものを示しているが、発話速度が異なっているため、音声波形の長さが異なっている。ＣＰＵ１０２は、模範音声データと学習者データを解析し、模範音声の長さと学習者音声の長さ（図８のΔｔ）を求める。図８に示したように、学習者音声の長さが模範音声の長さよりΔｔ長い場合、学習者音声の長さをΔｔ分だけ縮める処理を行う。 When the learner finishes pronunciation and operates the input unit 107 (step SA10; YES), the CPU 102 adjusts the length of the voice indicated by the learner data, the length of the model voice indicated by the model voice data, and the learner. The learner data is processed so that the length of the learner's voice indicated by the data is the same (step SA11). FIG. 8 is a diagram illustrating the waveform of the model voice and the waveform of the learner voice input to the microphone 109. In FIG. 8, the waveform of the model voice and the waveform of the learner voice are shown when the same example sentence is uttered, but the length of the voice waveform is different because the utterance speed is different. The CPU 102 analyzes the model voice data and the learner data, and obtains the length of the model voice and the length of the learner voice (Δt in FIG. 8). As shown in FIG. 8, when the length of the learner's voice is longer by Δt than the length of the model voice, a process of reducing the length of the learner's voice by Δt is performed.

次にＣＰＵ１０２は、模範音声の波形と学習者音声の波形とを、図９に示したように所定の時間間隔（１００ｍｓｅｃ）で区切って複数のフレームに分割する。そして、模範音声の各フレームの音声波形と、学習者音声の各フレームの音声波形との対応付けをＤＰ（Dynamic Programming）マッチング法を用いて行う（ステップＳＡ１２）。例えば、図９に例示した波形においては、模範音声のフレームＡ１は、学習者音声のフレームＢ１に対応付けされ、模範音声のフレームＡ３は、学習者音声のフレームＢ４に対応付けされる。 Next, the CPU 102 divides the waveform of the model voice and the waveform of the learner voice into a plurality of frames by dividing the waveform at a predetermined time interval (100 msec) as shown in FIG. Then, the speech waveform of each frame of the model speech and the speech waveform of each frame of the learner speech are associated using a DP (Dynamic Programming) matching method (step SA12). For example, in the waveform illustrated in FIG. 9, the model voice frame A1 is associated with the learner voice frame B1, and the model voice frame A3 is associated with the learner voice frame B4.

ＣＰＵ１０２は、模範音声と学習者音声との対応付けが終了すると、各音声波形を単語の発音毎に分割する（ステップＳＡ１３）。具体的には、まず、模範音声については、単語テーブルＴＢ３から発話開始時間を読み出す。ここで、学習者が選択した例文が「One centimeter is ten centimeter.」であるので、まず、「One」の発話開始時間「０．０sec」が単語テーブルＴＢ３から読み出される。ＣＰＵ１０２は、図９に示したように、音声波形の「０．０sec」の位置のフレームに（フレームＡ１）単語の区切りを示す情報（以下、単語区切り情報Ｃと称する）を付加する。次にＣＰＵ１０２は「centimeter」の発話開始時間「０．２sec」を単語テーブルＴＢ３から読み出し、発音開始から０．２sec後の位置に対応したフレーム（フレームＡ３）に単語区切り情報Ｃを付加する。 When the association between the model voice and the learner voice ends, the CPU 102 divides each voice waveform for each word pronunciation (step SA13). Specifically, first, for the model voice, the utterance start time is read from the word table TB3. Here, since the example sentence selected by the learner is “One centimeter is ten centimeter.”, First, the utterance start time “0.0 sec” of “One” is read from the word table TB3. As shown in FIG. 9, the CPU 102 adds information indicating the word break (hereinafter referred to as word break information C) to the frame at the position of “0.0 sec” in the speech waveform (frame A1). Next, the CPU 102 reads the utterance start time “0.2 sec” of “centimeter” from the word table TB3, and adds the word break information C to the frame (frame A3) corresponding to the position 0.2 sec after the start of pronunciation.

ＣＰＵ１０２は、模範音声について最後の単語「millimeters」まで単語区切り情報Ｃを付加すると、次に、学習者音声について単語区切り情報を付加する。まず、ＣＰＵ１０２は、模範音声において単語区切り情報が付加されたフレームを抽出する。そして、抽出されたフレームに対応したフレームを、学習者音声において特定し、特定したフレームに単語区切り情報Ｃを付加する。例えば、単語区切り情報Ｃが付加されたフレームＡ１が抽出されると、上述したステップＳＡ１２の処理によってフレームＡ１はフレームＢ１に対応付けされているので、ＣＰＵ１０２は、フレームＡ１に対応しているフレームＢ１を特定し、このフレームＢ１に単語区切り情報を付加する。また、単語区切り情報が付加されたフレームＡ３が抽出されると、上述したステップＳＡ１２の処理によってフレームＡ３はフレームＢ４に対応付けされているので、ＣＰＵ１０２は、フレームＡ３に対応付けされたフレームＢ４を特定し、このフレームＢ４に単語区切り情報Ｃを付加する。 After adding the word break information C to the last word “millimeters” for the model voice, the CPU 102 then adds the word break information for the learner voice. First, the CPU 102 extracts a frame to which word break information is added in the model voice. Then, a frame corresponding to the extracted frame is specified in the learner's voice, and word break information C is added to the specified frame. For example, when the frame A1 to which the word break information C is added is extracted, the frame A1 is associated with the frame B1 by the process of step SA12 described above, so the CPU 102 determines that the frame B1 corresponds to the frame A1. And the word break information is added to the frame B1. Further, when the frame A3 to which the word break information is added is extracted, the frame A3 is associated with the frame B4 by the process of step SA12 described above, so the CPU 102 determines the frame B4 associated with the frame A3. Identify and add word break information C to this frame B4.

ＣＰＵ１０２は、フレームに単語区切り情報を付加して音声波形を単語の発音毎に分割すると、学習者音声について、各単語の音声のピッチと発話時間とを算出する（ステップＳＡ１４）。例えば、学習者音声の「One」の場合、「One」の発音を表す音声波形として、フレームＢ１〜フレームＢ３までの音声波形が抽出される。そして、抽出した音声波形が解析され、音声のピッチと各単語の発話時間とが算出される。 When CPU 102 adds the word break information to the frame and divides the speech waveform for each pronunciation of the word, CPU 102 calculates the speech pitch and utterance time of each word for the learner speech (step SA14). For example, in the case of “One” of the learner's voice, voice waveforms from frame B1 to frame B3 are extracted as a voice waveform representing the pronunciation of “One”. Then, the extracted speech waveform is analyzed, and the pitch of speech and the utterance time of each word are calculated.

次にＣＰＵ１０２は、記憶部１０５に記憶した学習者データを音声処理部１０８へ出力する（ステップＳＡ１５）。また、ＣＰＵ１０２は表示部１０６を制御し、選択された例文と、ステップＳＡ１４で求めた学習者の音声のピッチとを、模範音声のピッチを表示した時と同様に表示する（ステップＳＡ１６）。ここで、例文の各単語およびピッチの横方向の表示長さは、模範音声のピッチを表示した時の表示長さと同じ長さにされる。
なお、表示される画面においては、例文の左端およびピッチ表示の左端が模範音声の再生開始時点を示しており、例文の右端およびピッチ表示の右端が模範音声の再生終了時点を表している。 Next, the CPU 102 outputs the learner data stored in the storage unit 105 to the voice processing unit 108 (step SA15). Further, the CPU 102 controls the display unit 106 to display the selected example sentence and the pitch of the learner's voice obtained in step SA14 in the same manner as when the pitch of the model voice is displayed (step SA16). Here, the display length in the horizontal direction of each word and pitch of the example sentence is set to the same length as the display length when the pitch of the model voice is displayed.
In the displayed screen, the left end of the example sentence and the left end of the pitch display indicate the playback start time of the model voice, and the right end of the example sentence and the right end of the pitch display indicate the playback end time of the model sound.

音声処理部１０８に学習者データが入力されると、デジタルデータである学習者データがアナログの信号に変換されてスピーカ１１０へ出力され、スピーカ１１０から学習者の音声が再生される。ＣＰＵ１０２は、学習者の音声が再生されると、表示部１０６を制御し、表示した例文において音声再生されている部分を示すバーを表示する（ステップＳＡ１７）。 When learner data is input to the voice processing unit 108, the learner data that is digital data is converted into an analog signal and output to the speaker 110, and the learner's voice is reproduced from the speaker 110. When the learner's voice is reproduced, the CPU 102 controls the display unit 106 to display a bar indicating the portion of the displayed example sentence that is reproduced by voice (step SA17).

ここでＣＰＵ１０２は、学習者の「One」の発話時間が、模範音声の「One」の発話時間と同じである場合、模範音声を再生した時と同じ速度で「One」のピッチの表示区間を移動するようにバー１０を表示する。即ち、学習者の「One」の発話時間が、模範音声の「One」の発話時間と同じ２００ｍｓｅｃであった場合、バー１０は、「One」のピッチの表示区間を２００ｍｓｅｃの時間をかけて移動するように表示される。一方、学習者の「One」の発話時間が、模範音声の「One」の発話時間と異なり、例えば、３００ｍｓｅｃであった場合、バー１０は、「One」のピッチの表示区間を３００ｍｓｅｃの時間をかけて移動するように表示される。即ち、表示画面において「One」の区間を移動するバー１０の速度が、模範音声を再生した時のバー１０の移動速度より遅くなるので、このバー１０の移動を見た学習者は、自分の「One」の発話時間が模範音声の「One」の発話時間より長くなっていることを知ることができる。また、学習者の「One」の発話時間が例えば模範音声の「One」の発音より短い１００ｍｓｅｃあった場合、「One」のピッチの表示区間を１００ｍｓｅｃの時間をかけて移動するようにバー１０が表示される。即ち、表示画面において「One」の区間を移動するバー１０の速度が、模範音声を再生した時のバー１０の移動速度より早くなるので、このバー１０の移動を見た学習者は、自分の「One」の発話時間が模範音声の「One」の発話時間より短くなっていることを知ることができる。 Here, when the utterance time of the learner's “One” is the same as the utterance time of the model voice “One”, the CPU 102 sets the display interval of the pitch of “One” at the same speed as when the model voice is played back. The bar 10 is displayed to move. That is, if the learner's utterance time of “One” is 200 msec, which is the same as the utterance time of “One” of the model voice, the bar 10 moves over the display interval of the pitch of “One” over 200 msec. Is displayed. On the other hand, when the utterance time of the learner “One” is different from the utterance time of the model voice “One”, for example, 300 msec, the bar 10 sets the display interval of the pitch of “One” to 300 msec. Appears to move. That is, the speed of the bar 10 that moves in the “One” section on the display screen is slower than the moving speed of the bar 10 when the model voice is played back. It is possible to know that the utterance time of “One” is longer than the utterance time of “One” of the model voice. In addition, when the utterance time of the learner “One” is 100 msec shorter than, for example, the pronunciation of the model voice “One”, the bar 10 is moved so as to move over the display interval of the pitch of “One” over 100 msec. Is displayed. In other words, the speed of the bar 10 moving in the “One” section on the display screen is faster than the moving speed of the bar 10 when the model voice is played back. It is possible to know that the utterance time of “One” is shorter than the utterance time of “One” of the model voice.

ＣＰＵ１０２は、他の単語についても同様に、ステップＳＡ１４で算出した発話時間に従ってバー１０を移動させて表示し、学習者の音声の再生が終了すると、表示部１０６を制御し、選択した例文の発音練習を再度行うのか、他の例文の発音練習を行うのかを確認するメニュー画面を表示する（ステップＳＡ１８）。学習者が入力部１０７を操作し、別の例文の発音練習を行う旨の操作を行った場合（ステップＳＡ１９；ＹＥＳ）、ＣＰＵ１０２は、ステップＳＡ１に処理の流れを戻し、ステップＳＡ１以降の処理を再び実行する。また、学習者が入力部１０７を操作し、選択した例文の発音練習を再度行う旨の操作を行った場合（ステップＳＡ２０；ＹＥＳ）、ＣＰＵ１０２は、ステップＳＡ９に処理の流れを戻し、ステップＳＡ９以降の処理を再び実行する。 Similarly, the CPU 102 moves and displays the bar 10 according to the utterance time calculated in step SA14 for other words, and controls the display unit 106 when the learner's voice has been reproduced to pronounce the selected example sentence. A menu screen for confirming whether to practice again or to practice pronunciation of other example sentences is displayed (step SA18). When the learner operates the input unit 107 and performs an operation for practicing pronunciation of another example sentence (step SA19; YES), the CPU 102 returns the process flow to step SA1, and performs the processes after step SA1. Run again. When the learner operates the input unit 107 to perform the pronunciation practice of the selected example sentence again (step SA20; YES), the CPU 102 returns the process flow to step SA9, and after step SA9. The above process is executed again.

以上説明したように、本実施形態によれば、模範音声における発音リズムの動的な変化と、学習者の音声における発音リズムの動的な変化とが表示されるので、学習者は、自身の音声と、模範音声との発音リズムの違いを把握することができる。そして、模範音声との発音リズムの違いを把握することができるようになるので、発音を模範音声に近づけることが容易となる。 As described above, according to the present embodiment, the dynamic change of the pronunciation rhythm in the model voice and the dynamic change of the pronunciation rhythm in the learner's voice are displayed. It is possible to grasp the difference in pronunciation rhythm between voice and model voice. And since it becomes possible to grasp the difference in pronunciation rhythm with the model voice, it becomes easy to bring the pronunciation close to the model voice.

［変形例］
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、以下のように、他の様々な形態で実施可能である。 [Modification]
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with other various forms as follows.

上述した実施形態においては、学習者の音声の再生位置を示すバーを表示する際、模範音声の再生位置を示すバーを同時に表示するようにしてもよい。このような態様によれば、模範音声に対してどの程度早く（遅く）発音しているかを具体的に知ることができる。また、学習者の音声のピッチを表示する際には、図１０に示したように、模範音声のピッチを重ねて表示するようにしてもよい。 In the above-described embodiment, when the bar indicating the playback position of the learner's voice is displayed, the bar indicating the playback position of the model voice may be displayed at the same time. According to such an aspect, it is possible to specifically know how early (slow) the pronunciation of the model voice is. Further, when the pitch of the learner's voice is displayed, the pitch of the model voice may be displayed so as to overlap as shown in FIG.

上述した実施形態では、語学学習装置１は、フランス語やドイツ語等、英語だけでなく他の言語の例文と模範音声とを記憶して再生するようにしてもよい。 In the embodiment described above, the language learning device 1 may store and reproduce example sentences and model voices in other languages as well as English such as French and German.

上述した実施形態においては、模範音声および学習者の音声のピッチを表示しているが、ピッチを表示しないようにし、バー１０を表示されている例文上に表示するようにしてもよい。また、ピッチを表示する代わりに時間軸を表示するようにしてもよい。 In the embodiment described above, the pitch of the model voice and the learner's voice is displayed, but the pitch may not be displayed, and the bar 10 may be displayed on the displayed example sentence. Further, a time axis may be displayed instead of displaying the pitch.

上述した実施形態においては、バー１０により音声の再生部分を表しているが、再生済みの部分の色を、再生されていない部分の色とは異なる色に変えるようにして音声の再生位置を表すようにしてもよい。
また、上述した実施形態においては、１フレームの時間間隔を１００ｍｓｅｃとしているが、１フレームの時間間隔は、１００ｍｓｅｃ以外であってもよい。
また、上述した実施形態においては、例文を表示する際に単語毎に区切りを表示しているが、音素毎に区切りを表示するようにしてもよい。
また、上述した実施形態においては、音声のパワーの時間的変化を解析し、ピッチに替えて解析した音声パワーの時間的変化を表示するようにしてもよい。 In the above-described embodiment, the audio reproduction portion is represented by the bar 10, but the audio reproduction position is represented by changing the color of the reproduced portion to a color different from the color of the non-reproduced portion. You may do it.
In the above-described embodiment, the time interval of one frame is 100 msec, but the time interval of one frame may be other than 100 msec.
Further, in the above-described embodiment, when displaying an example sentence, a break is displayed for each word, but a break may be displayed for each phoneme.
In the above-described embodiment, the temporal change in voice power may be analyzed, and the temporal change in voice power analyzed instead of the pitch may be displayed.

上述した実施形態においては、ＣＰＵ１０２が実行するプログラムを記憶部１０５に記憶させ、ＣＰＵ１０２は記憶部１０５からプログラムを読み出して実行するようにしてもよい。また、語学学習装置１に通信機能を持たせ、ＣＰＵ１０２が実行するプログラムを通信ネットワークを介してダウンロードして記憶部１０５に記憶させるようにしてもよい。また、語学学習装置１は、ＣＤ−ＲＯＭ等の記録媒体に記録されたプログラムを読みとり、読みとったプログラムを記憶部１０５に記憶するようにしてもよい。 In the embodiment described above, the program executed by the CPU 102 may be stored in the storage unit 105, and the CPU 102 may read the program from the storage unit 105 and execute it. Alternatively, the language learning device 1 may have a communication function, and a program executed by the CPU 102 may be downloaded via a communication network and stored in the storage unit 105. Further, the language learning device 1 may read a program recorded on a recording medium such as a CD-ROM and store the read program in the storage unit 105.

本発明の実施形態に係る学習装置のブロック図である。It is a block diagram of the learning apparatus which concerns on embodiment of this invention. 例文テーブルＴＢ１のフォーマットを例示した図である。It is the figure which illustrated the format of example sentence table TB1. ピッチテーブルＴＢ２のフォーマットを例示した図である。It is the figure which illustrated the format of pitch table TB2. 模範音声におけるピッチの変化を例示した図である。It is the figure which illustrated the change of the pitch in model voice. 単語テーブルＴＢ３のフォーマットを例示した図である。It is the figure which illustrated the format of word table TB3. ＣＰＵ１０２が行う処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which CPU102 performs. 表示部１０６に表示される画面を例示した図である6 is a diagram illustrating a screen displayed on the display unit 106. FIG. 模範音声の波形と学習者音声の波形とを例示した図である。It is the figure which illustrated the waveform of the model voice, and the waveform of the learner voice. 模範音声の波形と学習者音声の波形とを複数のフレームに分割した時の図である。It is a figure when the waveform of an exemplary voice and the waveform of a learner voice are divided into a plurality of frames. 変形例における画面表示を例示した図である。It is the figure which illustrated the screen display in a modification.

Explanation of symbols

１・・・語学学習装置、１０１・・・バス、１０２・・・ＣＰＵ、１０３・・・ＲＯＭ、１０４・・・ＲＡＭ、１０５・・・記憶部、１０６・・・表示部、１０７・・・入力部、１０８・・・音声処理部、１０９・・・マイクロホン、１１０・・・スピーカ。 DESCRIPTION OF SYMBOLS 1 ... Language learning apparatus, 101 ... Bus, 102 ... CPU, 103 ... ROM, 104 ... RAM, 105 ... Memory | storage part, 106 ... Display part, 107 ... Input unit, 108... Voice processing unit, 109... Microphone, 110.

Claims

Storage means for storing sound data corresponding to the sound to be pronounced divided into predetermined sections;
Reproduction means for reading out the sound data from the storage means and performing reproduction processing;
Sound collection data generating means for generating data corresponding to the collected sound as sound collection data;
A position adding means for recognizing a portion corresponding to the delimiter of the sound data with respect to the collected sound data and adding a corresponding delimiter;
Display means for displaying a scale formed by connecting display sections having a display width corresponding to the time length of each section;
First display control means for displaying the reproduction position in each section of the sound data for each display section of the scale in synchronization with the reproduction processing by the reproduction means;
Sound collection data reproduction processing means for reproducing the sound collection data;
Second display control means for displaying the reproduction position in each section of the collected sound data divided by the position adding means for each display section of the scale in synchronization with the collected sound data reproduction processing means. A reproducing apparatus.

The storage means stores a plurality of different sound data,
Selecting means for selecting one of a plurality of sound data stored in the storage means;
The playback apparatus according to claim 1, wherein the playback means reads out the sound data selected by the selection means from the storage means and performs playback processing.