JP3620787B2

JP3620787B2 - Audio data encoding method

Info

Publication number: JP3620787B2
Application number: JP2000051801A
Authority: JP
Inventors: 博司関口
Original assignee: カナース・データー株式会社; ペンタックス株式会社
Priority date: 2000-02-28
Filing date: 2000-02-28
Publication date: 2005-02-16
Anticipated expiration: 2020-02-28
Also published as: WO2003019535A1; JP2001242890A

Abstract

A data structure or the like of voice data, which can decode the voice data changed in the reproduction speed by a user, without sacrific ease-of-listening. In this voice data creating method, a synthesizing unit adds, to the encoded voice data encoded by an encoding unit from voice signals according to a predetermined rule, the decoding aiding information, referred to at the time of decoding the encoded voice data specified by an analysis unit and containing kinds or the like of sounds of individual parts constructing a generated sound. As a result, the voice data can be decoded at an arbitrary reproducing speed on the user side so that they can be voice data which are promising as the contents such as the data distribution services using the information communication technique.

Description

【０００１】
【発明の属する技術分野】
この発明は、所定の音声信号に基づいて新たな音声データを生成するための音声データの符号化方法に関するものである。
【０００２】
【従来の技術】
従来から、英会話等の語学の独習用、詩吟の練習用、法律の独習用、歌の練習、その他の目的のために、カセットテープ等の記録媒体に音楽とともに音声情報が記録された教材が種々提供されている。ここで、英会話の独習用の教材を例に説明すると、従来の主な記録媒体は、例えば一連の英語の発声（音声情報）が記録されてたカセットテープ（又はＣＤ）であり、学習者はこのテープ教材とテキストとを組み合せて使用していた。なお、このような教材には、初級用から上級用まで種々のレベルが用意されている。
【０００３】
また、日本国特許第２５８１７００号には、複数の区画に区分された上級者学習用に適した音声情報列（ナチュラル・スピードの発生音を構成する各部の音）が記録された第１領域と、これら各区画に対応した等価な区画からなる初級者学習用に適した音声情報列（はっきりとした発生音を構成する各部の音であって、言語学上は同一の意味で派生の異なる音声）が記録された第２領域と、該上級者学習用及び初級者学習用の各音声情報列の対応する各区画の関係を、これら音声情報列の各区画の記録媒体における記録位置で示す情報が記録された第３領域とを、少なくとも備えたＣＤ−ＲＯＭ等の情報記録媒体、及びこのような構造を備えた情報記録媒体の対応する区画間での切替え再生等を含む再生方法が提案されている。
【０００４】
【発明が解決しようとする課題】
上述のように、日本国特許第２５８１７００号の情報記録媒体には、該媒体上の第１領域にネイティブスピーカーの発生音を構成する各部の音が記録され、また第２領域に言語上は同一の意味で遅緩した発音で構成された音声情報列が記録されている。したがって、第１領域に記録された音声情報列が再生されている最中に再生音を聞き取れなかった場合、第２領域に記録された同一内容の音声情報列（第１音声情報列の再生中の区画と第２音声情報列の再生すべき区画との対応は第３領域に記録されている）を切替えて再生することにより、学習者は聞き取れなかった音声を認知することができる。また、近年のパーソナル・コンピュータ等の情報処理機器の普及・高性能化を考慮すれば、制作時間の短縮や制作コストの削減のため、上記第１領域に記録される音声情報列から遅緩した第２領域に記録される音声情報列を生成することも不可能ではない。
【０００５】
しかしながら、単にネイティブ・スピーカーの音声を時間軸に沿って均一に伸張させたのでは、利用者の聞き取り易さを損なってしまう。すなわち、主として日本人がナチュラル・スピードの英語を単にゆっくり再生して聴けるようにした場合であっても、各周波数成分について単純にかつ一様に音声再生時間を伸ばしたり短縮したのでは不充分であり、発生音を構成する各部の音、例えば子音部のスペクトルの時間変化が言語上の音として別の音を意味する可能性があるからである。例えば、ＢＡ（バ）とＰＡ（パ）の発音は、前者のスペクトル変化が速く、後者は遅いだけでスペクトルそのものはほとんど同じ形をしており、ＢＡ（バ）という発音の子音部も含めて時間を伸長するとＰＡ（パ）と聴こえることになる。
【０００６】
一方、学習者のヒヤリング・レベルも、例えば上記第２領域に記録された音声再生速度でも十分に聞き取れない者、提供されたナチュラル・スピードでは満足できない者など様々であり、このように異なるヒヤリング・レベルの学習者を個々に満足させようとすると、各学習者のヒヤリング・レベルに応じた複数種類の音声情報を予め用意しなければならない。しかしながら、現状では学習者側で自分のヒヤリング・レベルに合った音声情報を選択できず、また、各学習者のヒヤリング・レベルに合った複数種類の音声情報を用意することは、ＣＤ等の記録媒体の記録容量に限界があるため現実的ではない。
【０００７】
さらに、近年の情報通信技術の発達により、インターネット等のコンピュータ・ネットワークを利用したデータ配信も注目されている。このようなデータ配信を利用した音声情報の提供を考える場合、大量のデータを送信するにはまだまだ通信時間や通信コストの面で実用レベルに達しているとは言えない。
【０００８】
この発明は上述のような課題を解決するためになされたもので、聞き取り易さを損なうことなく、利用者側で希望する再生速度の再生用音声データの復号化を可能にする音声データの符号化方法を提供することを目的としている。
【０００９】
【課題を解決するための手段】
上述の課題を解決するため、この発明に係る音声データの符号化方法により生成されたデータは、音声信号から所定の規則に従って符号化された符号化音声データと、該符号化音声データの復号化の際に参照される復号化補助情報とを含む。以下、このデータを音声基礎データという。
【００１０】
特に、上記復号化補助情報は、音声信号の波動に関する物理量、例えば周波数スペクトル情報などから特定された、少なくとも発生音を構成する各部の音の種類に関する情報を含む。これは、上述のような子音部の変化により異なる音に聞こえてしまう不具合を解消するためである。子音部の伸長度をＢＡ（バ）と聴こえる限界に留め、母音部のみ望みの音声再生時間に伸長あるいは短縮するようにすれば、ＢＡ（バ）のままに聴こえることになる。母音部はいくら伸長あるいは短縮してもその母音のままで聴こえるから望みの長さ（望みの再生時間）に設定できる。
【００１１】
一方、非英語圏の国民には弱すぎて聴き取りにくい音声や特定の周波数成分だけを選択的に２倍とか３倍に強調して聴かせることも語学学習等には必要である。母音部も含めて強調したのでは全体が大きくなり過ぎて効果がない。どうしても選択的に強調しなければならない。そこで、上記復号化補助情報は、強調すべき位置を指示するための強調位置識別情報を含むのが好ましい。また、この復号化補助情報は、復号化された音声データの周波数成分のうち、強調すべき周波数成分を個別に指示する情報を含むようにしてもよい。
【００１２】
なお、上記符号化音声データを得るための符号化としては、異なる再生速度でも聞き取り易くするため、予め符号化対象の音声信号を周波数成分に分解し、該分割された周波数成分ごとにその振幅情報等をデータ化する特願平１０−２４９６７２号記載の符号化が適している。また、符号化対象である音声信号は、ディジタル化された電気信号であり、その情報源としては、マイクを介して取り込まれたアナログ音声情報、磁気テープ等から読み出されたアナログ音声情報、ＭＯ、ＣＤ、ハードディスク等に記録されたディジタル音声情報のいずれであってもよい。ただし、アナログ音声情報の場合は、一旦Ａ／Ｄ変換される必要がある。また、ＣＤ等に記録されたデータが圧縮符号化されている場合には、該圧縮データを伸張（解凍）する必要がある。
【００１３】
近年普及し始めたインターネット等のコンピュータネットワーク、ケーブルＴＶネットワーク、衛生通信などの分野に着目すると、文字データ、音声データ、静止画データ、動画データなどのマルチメディアによる情報提供サービスも広く行われるようになってきており、このような情報通信技術を利用した情報提供サービスの１つとして、この発明を適用させるためには、画像データの表示タイミングの調節が不可欠となる。そこで、上述のようなデータ構造を備えた音声基礎データに、所定の表示手段に表示されるべき画像データの表示タイミングを指示する情報を含めることにより、音声データの再生動作に同期させた画像表示（特に、動画表示）が可能になる。
【００１４】
そこで、この発明に係る音声データの符号化方法は、上記符号化音声データを生成する第１行程と、上記復号化補助情報を特定する第２行程と、該符号化音声データに復号化補助情報を付加する第３行程とを備える。上記第１行程では、例えば特願平１０−２４９６７２号に記載された符号化技術のように、所定の規則に従って音声信号の符号化が行われる。上記第２行程では、符号化音声データの復号化の際に参照される復号化補助情報として、音声信号の波動に関する物理量（例えば周波数スペクトル情報）から少なくとも発生音を構成する各部の音の種類に関する情報が特定される。なお、上記第１及び第２行程は並行して実施することも可能である。
【００１５】
以上のように生成された音声基礎データ（符号化音声データを含む）を利用することにより、音声データの復号化において、指定された再生速度情報に基づいた発生音を構成する各部の音ごとに調節された各学習者のヒヤリング・レベルに合った再生用音声データの復号化が可能になる。このように復号化された再生用音声データを再生することにより、学習者は自己の指定した速度に調節された再生音声を聴くことができる。すなわち、音声データの再生は、上述のように生成された音声基礎データから復号化補助情報を抽出する第１行程と、抽出された抽出された復号化補助情報に含まれる少なくとも発生音を構成する各部の音に関する情報を参照しながら、発生音を構成する各部の音ごとに再生速度を決定する第２行程と、決定された再生速度に相当するよう符号化音声データの該当部分に対して伸張処理（弛緩した音声の再生のため）短縮処理（より早い音声の再生のため）を施しながら、該符号化音声データを復号化する第３行程とを備える。
【００１６】
なお、当該符号化方法は、聞き取り易さを損なうことなく、再生速度の異なる複数種類の再生用音声データが用意できるため、生成された再生速度の異なる再生用音声データ間での切り替え再生を行いながらの学習も可能になる。また、上記発生音を構成する各部の音には、音声スペクトルにおける母音部、該母音部の前後に現れる子音部、該母音部と子音部との間に現れる移行部、音声の切れ目（ポーズ）などが含まれる。
【００１７】
加えて、上述のように生成された音声基礎データの提供は、一旦ＣＤ等の記録媒体に記録された形態で利用者に提供される場合と、情報通信手段を介して利用者に提供される場合が考えられる。情報通信技術を利用する場合でも音声基礎データの取り扱いはハードディスク等への一時記録が不可欠であり、この音声基礎データの記録では、聞き取り易さを損なうことなく、利用者側で再生速度が変更された再生用音声データの復号化が可能になるよう、符号化音声データとともに復号化補助情報が所定の記録媒体に記録される。なお、この記録媒体において、符号化音声データが記録される領域と復号化補助情報が記録される領域は、異なっていてもよい。
【００１８】
上述のように生成された音声基礎データを有線又は無線の情報伝達手段を介して通信相手に送信するデータ配信方法としては、音声信号から所定の規則に従って符号化された符号化音声データと、該符号化音声データの復号化の際に参照される情報であって、音声信号の波動に関する物理量から少なくとも発生音を構成する各部の音の種類に関する情報を含む復号化補助情報とを、通信相手に送信する。なお、この配信方法では、送信される符号化音声データと復号化補助情報とを別個に通信相手に送信する構成であってもよい。
【００１９】
加えて、上述のように生成された音声基礎データの再生動作を文字データ、画像データ（静止画と動画）などのマルチメディア再生に対応させる場合、特に動画再生と、再生速度が自由に変更された音声の再生との同期をとることが重要である。すなわち、動画は例えば１秒間に２０フレーム程度の画像データをディスプレイに順次表示していくが、音声の再生動作との表示タイミングとの同期がとれていないと不自然な表示動作になってしまう。
【００２０】
そこで、この再生では、１又はそれ以上の画像データを一旦メモリ上に展開し、該メモリ上に格納された画像データのうち１フレーム分の画像データを、音声データの再生動作に同期して、順次所定の表示手段に表示するのが好ましい。
【００２１】
具体的な再生方法としては、メモリに格納される複数の画像データのうち１フレーム分又はそのＮ倍（正の有理数）の画像データの基準書き換え周期をＴｖ、音声のディジタル化サンプリング周期に基づいて決定された基準再生時間周期をＴａ、そして、指示された再生速度情報に基づいて発生音を構成する各部の音ごとに決定された再生時間周期をＴａ´（＞Ｔａ）とするとき、所定タイミングでの画像データ書き換え動作の終了時点からＴｖ×（（Ｔａ´／Ｔａ）−１）まで、次回の画像データ書き換え動作を休止させる。
【００２２】
なお、画像データの表示タイミングの調節は、音声のディジタル化サンプリング周期に基づいて決定された基準再生時間周期をＴａ、そして、指示された再生速度情報に基づいて発生音を構成する各部の音ごとに決定された再生時間周期をＴａ´（＞Ｔａ）とするとき、メモリに格納される画像データの平均書き換え周波数を、予め指定された書き換え周波数の（Ｔａ´／Ｔａ）倍に設定するようにしても、音声データの再生動作に同期した画像表示が可能になる。
【００２３】
【発明の実施の形態】
以下、この発明の各実施形態を図１〜図１３を用いて説明する。なお、図面の説明において同一部分には同一符号を付して重複する説明は省略する。
【００２４】
この発明により生成される音声基礎データは、再生時の聞き取り易さを損なうことなく、利用者が自由に設定した再生速度の再生用音声データの復号化を、該利用者側で行うことを可能にする。このような音声基礎データの利用形態は、近年のディジタル技術の発達やデータ通信環境の整備により種々の態様が考えられる。図１は、この発明に係る音声データの符号化方法により生成された音声基礎データがどのように産業上利用されるかを説明するための概念図である。
【００２５】
図１（ａ）に示されたように、音声データの符号化に利用される情報源１０としては、例えばマイクを介して直接取り込まれたり、既に磁気テープなどに記録されたアナログ音声情報、さらにはＭＯ、ＣＤ（ＤＶＤを含む）、Ｈ／Ｄ（ハードディスク）等に記録されているディジタル音声情報が利用可能であり、具体的には、市販されている教材やテレビ局、ラジオ局などから提供される音声情報などでも利用可能である。編集者１００は、このような情報源１０を利用して音声データ生成装置２００により、音声データが符号化される。なお、この際、現状のデータ提供方法を考えると、生成された音声基礎データはＣＤ（ＤＶＤを含む）、Ｈ／Ｄなどの記録媒体２０に一旦記録された状態で利用者に提供される場合が多い。また、これらＣＤやＨ／Ｄには当該音声基礎データとともに関連する画像データが記録される場合も十分に考えられる。
【００２６】
上記音声データ生成装置２００により生成される音声基礎データは、上記情報源１０から取り出されたディジタル音声信号から所定の規則に従って符号化された符号化音声データと、この符号化音声データの復号化の際に参照される情報であって、該音声信号の波動に関する物理量から特定された、少なくとも発生音を構成する各部の音の種類に関する情報を含む復号化補助情報とを備えたデータである。なお、符号化音声データを得るための符号化としては、例えば、異なる再生速度でも聞き取り易くするため、予め符号化対象の音声信号を周波数成分に分解し、該分割された周波数成分ごとにその振幅情報等をデータ化する特願平１０−２４９６７２号記載の符号化が利用可能である。生成された符号化音声データ及び復号化補助情報は、音声データ生成装置２００により、記録媒体２０に格納される。これにより、ＣＤ、ＤＶＤ、Ｈ／Ｄ等の記録媒体には上述の符号化音声データ、復号化補助情報とともに画像データ、文字データなどのマルチメディアが記録される。
【００２７】
特に、記録媒体２０としてのＣＤやＤＶＤは、雑誌の付録として利用者に提供されたり、コンピュータ・ソフト、音楽ＣＤなどと同様に店舗にて販売されるのが一般的である（市場での流通）。また、生成された音声基礎データはサーバ３００から有線、無線を問わず、インターネット、衛生通信などの情報通信手段を介して利用者に配信される場合も十分に考えられる。
【００２８】
データ配信の場合、上記音声データ生成装置２００により生成された音声基礎データは、サーバ３００の記憶装置３１０（例えばＨ／Ｄ）に画像データなどとともに一旦蓄積される。そして、Ｈ／Ｄ３１０に一旦蓄積された音声基礎データは、送受信装置３２０（図中のＩ／Ｏ）を介して利用者端末４００に送信される。利用者端末４００側では、送受信装置４５０を介して受信された音声基礎データが一旦Ｈ／Ｄ（外部記憶装置３０に含まれる）に格納される。一方、ＣＤやＤＶＤ等を利用したデータ提供では、利用者が購入したＣＤを端末装置４００のＣＤドライブやＤＶＤドライブに装着することにより該端末装置の外部記録装置３０として利用される。
【００２９】
通常、利用者側の端末装置４００には入力装置４６０、ＣＲＴ、液晶などのディスプレイ４７０、スピーカー４８０が装備されており、外部記憶装置３００に画像データなどとともに記録されている音声基礎データは、当該端末装置４００の復号化部４１０（ソフトウエアによっても実現可能）によって、利用者自身が指示した再生速度の音声データに一旦復号化された後、スピーカー４８０がら出力される。一方、外部記憶装置３００に格納された画像データは一旦ＶＲＡＭ４３２に展開された後にディスプレイ４７０に各フレームごと表示される（ビットマップ・ディスプレイ）。なお、復号化部４１０により復号化された再生用音声データを上記外部記憶装置３０内に順次蓄積することにより、該外部記憶装置３０内には再生速度の異なる複数種類の再生用音声データを用意すれば、日本国特許第２５８１７００号に記載された技術を利用して再生速度の異なる複数種類の音声データ間の切り替え再生が利用者側で可能になる。
【００３０】
利用者は、図１（ｂ）に示されたように、ディスプレイ４７０上に関連する画像４７１を表示させながらスピーカー４８０から出力される音声を聴くことになる。この際、音声のみ再生速度が変更されていたのでは、画像の表示タイミングがずれてしまう可能性がある。そこで、復号化部４１０が画像データの表示タイミングを制御できるよう、上記音声データ生成装置２００において生成される符号化音声データに画像表示タイミングを指示する情報を予め付加しておくのが好ましい。
【００３１】
次に、図１（ａ）に示された音声データ生成装置２００及び音声データ再生装置（端末装置４００）の詳細な構造を図２を用いて説明する。なお、図２（ａ）は、音声データ生成装置２００の構成を示す図であり、図２（ｂ）は、音声データ再生装置としての端末装置４００の構成を示す図である。
【００３２】
図２（ａ）に示されたように、音声データ生成装置２００に取り込まれる音声信号は情報源１０から提供される。なお、この情報源１０から提供される音声情報のうち、マイクから取り込まれる音声情報及び磁気テープからの音声情報は、ともにアナログ音声データであるため、当該音声データ生成装置２００へ入力される前にＡ／Ｄコンバータ１１（Ｉ／Ｏ１２に含まれる）によりＰＣＭデータに変換される。また、ＭＯ、ＣＤ（ＤＶＤ含む）、Ｈ／Ｄに既に格納された音声情報は、ＰＣＭデータとしてＩ／Ｏ１２を介して当該音声データ生成装置２００に取り込まれる。取り込まれた音声データが圧縮されている場合には、一旦ソフトウエア等の解凍しておく必要がある。
【００３３】
音声データ生成装置２００は、上述のように前処理された情報源１０からの音声信号（電気信号）から所定の規則に従って符号化された符号化音声データを生成する符号化部２１０と、この符号化音声データの復号化の際に参照される復号化補助情報として、音声信号の波動に関する物理量（例えば周波数スペクトル情報）から少なくとも発生音を構成する各部の音の種類に関する情報を特定する解析部２５０と、符号化部２１０により符号化された符号化音声データに、解析部２５０により特定された復号化補助情報を付加する合成部２６０とを備える。この合成部２６０から出力された符号化音声データと復号化補助情報は、ＣＤ、ＤＶＤ、Ｈ／Ｄ等の記録媒体２０に記録される。なお、上記符号化音声データと復号化補助情報は、記録媒体２０内の異なる領域にそれぞれ記録されてもよい。
【００３４】
一方、利用者側では、図２（ｂ）に示されたように、データ配信やＣＤ等の形態で提供された音声基礎データが端末装置４００の外部記憶装置３０内に格納される。復号化部４１０は、キーボードや、マウス等のポインティング・デバイスなどの入力手段４６０を介して入力された利用者の指示内容に従って、外部記憶装置３０からＩ／Ｏ３１を介して読み出されたディジタルデータを所定の速度で再生可能な再生用音声データとして復号化するとともに、画像同期信号Ｄも出力する。復号化された再生用音声データはアナログデータに変換された後、スピーカー４８０から音声として出力される。
【００３５】
なお、上記復号化部４１０は、外部記憶装置３０からＩ／Ｏ３１を介して読み出された音声基礎データを読み込み、この読み出された音声基礎データから、符号化音声データの復号化の際に参照される復号化補助情報を抽出するとともに、抽出された復号化補助情報に含まれる少なくとも発生音を構成する各部の音に関する情報を参照しながら上記符号化音声データに含まれる該発生音を構成する各部の音ごとに音声再生に適した再生速度を、利用者が指定した再生速度情報を基準にして決定する。この復号化部４１０における符号化音声データの復号化は、該符号化音声データに含まれる発生音を構成する各部の音ごとに、上述のように決定された再生速度に相当するよう該符号化音声データの該当部分に対して伸長処理又は短縮処理を施しながら行われる。
【００３６】
図３は、上述の音声データ生成装置２００における符号化部２１０の構造を示す図である。符号化部２１０は、まず、マイク等により、例えば音楽ＣＤの音響クロック４４．１ＫＨｚでサンプリングされたネイティブ・スピーカーのナチュラル・スピードの音声に相当する音声信号を取り込む。この取り込まれた音声信号は、一旦、各チャネルＣＨ＃１〜ＣＨ＃８５（周波数成分）に分割するためフィルタリングされる。なお、取り込まれた音声信号の周波数範囲は７５Ｈｚ〜１０，０００Ｈｚ、また、サンプリング周波数は音楽ＣＤの音響クロックに合わせて４４．１ｋＨｚ（２２．６８μｓ）である。分割するチャネル数は８５（７オクターブ＋１音）とし、各チャネル＃１〜＃８５の中心周波数（中心ｆ）は平均律（１オクターブ当り１２平均律とする）の半音列になるように設定される（７７．７８Ｈｚ（Ｄ＃）〜９，９６０Ｈｚ（Ｄ＃））。
【００３７】
以上のように各チャネル＃１〜＃８５にそれぞれ分割されたデータは、その振幅情報が２．２６８ｍｓごと（４４．１ｋＨｚサンプリングの１００データに相当、ただし１００データで１波形が形成できない場合にはデータ数を増やす）に抽出される。したがって、この実施形態では、各チャネル＃１〜＃８５における振幅情報のサンプリングレート（第２周期）は４４１サンプル／ｓ（２．２６８ｍｓ）である。なお、サンプリングレートは、規則性のある周期であればよく、例えば１００データ分取り込んだ次に、１２０データ分取り込んで処理するなど、これら異なるレートで交互に処理を繰り返すような実施形態であってもよい。
【００３８】
符号化部２１０は、２．２６８ｍｓごとにサンプリングされた各チャネル＃１〜＃８５の振幅情報をそれぞれ１バイト（８ビット）で表現し、８５バイト（８５チャネル×１バイト）の符号化音声データａ１，ａ２，ａ３，…，ａｎを生成する。なお、符号化音声データａ１，ａ２，ａ３，…，ａｎには、該符号化音声データに相当する音声の再生時に表示される動画像との表示タイミングを制御するため、画像同期信号Ｄ（１バイト）が付加される。
【００３９】
一方、図３に示された音声データ生成装置２００の解析部２５０は、上記符号化部２１０において生成された符号化音声データの復号化の際に参照される復号化補助情報を特定する。
【００４０】
復号化補助情報には、取り込まれた音声信号のスペクトル情報から発生音を構成する各部の音や、強調すべき部位を示す強調位置識別情報、強調すべき周波数成分等が含まれる。
【００４１】
例えば、この実施形態において、発生音を構成する各部の音は、図４に示されたように、母音部（Ｖ）と、この母音部（Ｖ）を挟んで前後に現れる子音部（図中、前子音部がＣ_Ｆ、後子音部がＣ_Ｒで示されている）と、母音部（Ｖ）と前後の各子音部Ｃ_Ｆ、Ｃ_Ｒとの間に現れる移行部（図中、前移行部がＴ_Ｆ、後移行部がＴ_Ｒで示されている）と、各音声の間に現れる無音期間を示すポーズ（Ｐ）とに分類される。なお、ポーズ（Ｐ）は、再生音声の遅延の際に他の発生音を構成する各部の音と同様に延伸されたのでは、利用者の聞き取り易さを損なう可能性がある。そこで、この実施形態においてポーズ（Ｐ）は、音節の間で発生することを示す場合（Ｐ１）と、句間で発生する場合（Ｐ２）と、文間で発生する場合（Ｐ３）とにさらに分類され、それぞれ特定すべき発生音を構成する各部の音に含められている。
【００４２】
上記復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎおのおのは、符号化音声データａ１，ａ２，ａ３，…，ａｎのサンプリング間隔ごとに用意されるデータ列であって、上記発生音を構成する各部の音に関する情報として３ビット、強調位置識別情報として１ビットの計４ビット程度のデータである。また、この復号化補助情報には、非英語圏の民族には第３フォルマントのように聞き取りにくい周波数があるため、強調すべき周波数帯（特に中心周波数）の指定情報が個別に含まれてもよい。
【００４３】
合成部２６０は、以上のように符号化部２１０により生成された符号化音声データａ１，ａ２，ａ３，…，ａｎに、解析部２５０により特定された復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎを付加し、記録媒体２０にそれぞれを書き込む。なお、合成部２６０により生成された合成データは、図５（ａ）〜図５（ｃ）に示されたような種々の論理構造を備えることが可能である。例えば、図５（ａ）に示されたように、生成された合成データは、符号化音声データａ１，ａ２，ａ３，…，ａｎの各データごとに対応する復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎが付加された構造であってもよい。また、図５（ｂ）に示されたように、生成された合成データは、符号化音声データａ１，ａ２，ａ３，…，ａｎと復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎとがそれぞれ異なるグループのデータとして取り扱われる構造であってもよい。さらに、図５（ｃ）に示されたように、生成された合成データは、符号化音声データａ１，ａ２，ａ３，…，ａｎを構成する複数のグループと、復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎの対応する複数のグループとの対により構成されてもよい。
【００４４】
次に、音声データの復号化及び再生を行う利用者側の端末装置４００の構造について説明する。
【００４５】
図６は、端末装置４００の復号化部４１０の構造を示す図であり、図７は、図６に示された符号化部４１０におけるＰＣＭデータ生成部４１５の構造を示す図である。
【００４６】
図６に示されたように、外部記憶装置３０からＩ／Ｏ３１を介して音声基礎データが復号化部４１０に取り込まれる。なお、外部記憶装置３０に格納された音声基礎データは、コンピュータ・ネットワークや衛星などの情報通信手段を介して配信されたり、利用者が購入したＣＤ等に格納されているデータであり、その他画像データも該外部記憶装置３０内には記録されている。また、外部記憶装置３０内に格納された音声基礎データが圧縮されている場合には、ソフトウエア等によるデータ伸長が復号化の前処理として行われる。
【００４７】
復号化部４１０では、まず抽出部４１１が、外部記憶装置３０から読み出された音声基礎データから復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎを抽出する。抽出された復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎのうち、発生音を構成する各部の音に関する情報（Ｖ、Ｃ_Ｆ、Ｃ_Ｒ、Ｔ_Ｆ、Ｔ_Ｒ、Ｐ１、Ｐ２、Ｐ３）は、入力手段４６０から入力された利用者からの指示情報とともに時間係数生成部４１２に入力される。また、抽出された復号化補助情報ｓ１２，ｓ２，ｓ３，…，ｓｎのうち、強調位置識別情報は、入力処理手段４６０から入力された利用者の指示情報とともに振幅強調係数生成部４１２に入力される。さらに、抽出された復号化補助情報ｓ１２，ｓ２，ｓ３，…，ｓｎのうち、強調すべき周波数成分（中心ＣＨ）に関する情報は、入力処理手段４６０から入力された利用者の指示情報とともに強調バンドデータ生成部４１４に入力される。
【００４８】
また、この実施形態では、入力手段４６０から入力される利用者の再生速度指示情報としては、図８の表に示されたように、複数の再生レベルＨ３〜Ｓ６が用意されている。図８の表からも分かるように、この実施形態では、再生レベルＮを標準の再生速度（ナチュラル・スピード）とし、Ｈ３に向かうほど再生速度を早く、逆にＳ６に向かうほど再生速度を遅くするように、該ナチュラル・スピードを基準とした再生時間の比及び再生速度の倍率で指示される。
【００４９】
上記時間係数生成部４１１は、図９に示されたような、再生レベル（利用者が指示）と発生音を構成する各部の音の種類との関係によって決定される再生速度倍率が予め設定された表を備えており、この表に基づいて、再生速度倍率をＰＣＭデータ生成部４１５に出力する。
【００５０】
上記振幅強調係数生成部４１２は、図１０に示されたような２種類の表を備える。図１０（ａ）は、抽出部４１１によって抽出された復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎに強調位置識別情報が含まれていない場合（強調指示がない場合）に適用される表であり、図１０（ｂ）は、復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎに強調位置識別情報が含まれている場合（強調指示がない場合）に適用される表である。なお、これらの表に示されたパラメータは、抽出部４１１において復号化補助情報と分離された符号化音声データの各周波数成分の振幅を基準とした倍率を意味する。
【００５１】
上記強調バンドデータ生成部４１４は、復号化補助情報ｓ１，ｓ２，ｓ３，…，ｓｎに強調すべき周波数帯（中心ＣＨで指定）の指示情報が含まれている場合、図１１（ａ）に示されたように、中心ＣＨに隣接する低周波数成分側５ＣＨ及び高周波数成分側５ＣＨの合計１１ＣＨについて、各周波数成分の振幅を変更するパラメータを生成する。なお、強調バンドデータ生成部４１４は、図１１（ｂ）に示されたように、再生レベルに応じた中心ＣＨの振幅倍率が予め設定された表を備えており、中心ＣＨの振幅倍率は、入力手段４６０から入力された再生速度指示情報に従って決定される。また、中心ＣＨに隣接する各ＣＨの振幅倍率は、この中心ＣＨの振幅倍率を基準にして、図１１（ａ）のように直線近似できるようにそれぞれ設定され、ＰＣＭデータ生成部４１５に出力される。
【００５２】
ＰＣＭデータ生成部４１５は、図７に示されたように、各チャネルに相当する周波数成分を発生させる正弦波ジェネレータ４２２を備える。制御部４２１は、抽出部４１１からの符号化音声データから各周波数成分の振幅情報と強調バンドデータ生成部４１４からの振幅倍率データに基づいて新たに振幅係数を生成し、この生成された振幅係数を乗算器４２３において正弦波ジェネレータ４２２からのデータ（基準振幅を示す）に乗算させる。そして、得られた各周波数成分のデータが加算器４２４で加算させることにより復号化されたＰＣＭデータが得られる。さらに、制御部４２１は、時間係数生成部４１２からの再生速度倍率データに基づいて、該各符号化音声データａ１，ａ２，ａ３，…，ａｎの出力回数を調節することにより、復号化される音声データの間延びや短縮を行う。このとき、各符号化音声データａ１，ａ２，ａ３，…，ａｎごとに出力される画像同期信号Ｄの出力回数も同時に調節されることとなるため、音声データの再生側において画像の表示タイミング制御が可能になる。
【００５３】
以上のようにＰＣＭデータ生成部４１５において復号化されたデータは、利用者の再生速度指示情報に従って時間軸に沿って調節された復号化データとなる。このＰＣＭデータ生成部４１５で復号化されたデータは、図１０（ａ）あるいは図１０（ｂ）のいずれかの表から振幅強調係数生成部４１２が決定した倍率パラメータと乗算器４１６において乗算される。これにより、再生用音声データが得られる。得られた再生用音声データはＤ／Ａ変換器４１７によりアナログデータに変換され、利用者が指示した再生速度の音声としてスピーカー４８０から出力される。
【００５４】
一方、端末装置４００では外部記憶装置３０から読み出された画像データの表示も可能である。図１２は、ビットマップ・ディスプレイの構造を示す図である。
【００５５】
ビットマップ・ディスプレイは、１又はそれ以上のフレームを格納するメモリ４３２（ＶＲＡＭ）を備えており、描画部４３１が、外部記憶装置３０からＩ／Ｏ３２を介して読み出された画像データ（圧縮されている場合にはソフトウエア等３２によりデータ伸長される）をこれらメモリ４３２に書き込んでいく。メモリ４３２に書き込まれた画像データは１フレームごとにスイッチＳ／Ｗ４３３を介してディスプレイ４７０に表示される。なお、これら描画部４３１の書き込みタイミング及びＳ／Ｗ４３３の切り替えタイミングはタイミングコントローラ４３４により行われる。
【００５６】
音声再生と画像表示とのタイミングは、この実施形態では図１３（ａ）に示されたようにＰＣＭデータ生成部４１５から出力された画像同期信号Ｄをカウントすることにより行われる。すなわち、ナチュラル・スピードでの音声再生の場合、例えば、３クロックごとにメモリ４３２のデータ書き換えを行うことにしておけば、図１３（ｂ）に示されたように、ＰＣＭデータ生成部４１５が再生速度の遅い再生用音声データを生成する場合にも、該音声データの遅延タイミングに合ったデータ書き換えが可能になる（画像の表示タイミングを音声再生のタイミングに一致させることが可能になる）。
【００５７】
すなわち、この実施形態では、メモリ４３２に格納される複数の画像データのうち１フレーム分又はそのＮ倍（正の有理数であって、Ｎは１／２や２／３であってもよい）の画像データの基準書き換え周期をＴｖ、音声のディジタル化サンプリング周期（例えば、音楽ＣＤの音響クロック）に基づいて決定された基準再生時間周期をＴａ、そして、指示された再生速度情報に基づいて発生音を構成する各部の音ごとに決定された再生時間周期をＴａ´（＞Ｔａ）とするとき、所定タイミングでの画像データ書き換え動作の終了時点からＴｖ×（（Ｔａ´／Ｔａ）−１）まで、次回の画像データ書き換え動作を休止させることを特徴としている。
【００５８】
なお、音声再生と画像表示とのタイミング調整は、上述の実施形態に限定されるものではない。例えば、音声のディジタル化サンプリング周期に基づいて決定された基準再生時間周期をＴａ、そして、指示された再生速度情報に基づいて発生音を構成する各部の音ごとに決定された再生時間周期をＴａ´（＞Ｔａ）とするとき、上記メモリ４３２に格納される画像データの平均書き換え周波数は、予め指定された書き換え周期の（Ｔａ´／Ｔａ）倍に設定されてもよい。
【００５９】
【発明の効果】
以上のようにこの発明によれば、マイク等から取り込まれたり、過去に蓄積された音声信号から所定の規則に従って符号化した符号化音声データと、該符号化音声データの復号化の際に参照される発生音を構成する各部の音の種類等を含む復号化補助情報とにより音声基礎データが得られる。このような音声基礎データを所定の記録媒体や配信方法により利用者に提供することにより、利用者側では任意に設定された速度の異なる複数種類の再生用音声データを復号化することができる。これにより、データ提供者から利用者へ提供すべき音声基礎データのデータ量を低減することができ、記録媒体の記録量の節約や、データ配信時間の短縮が実現される。
【００６０】
さらに、上記符号化音声データに復号化補助情報とともに画像同期信号を付加することにより、復号化された再生用音声データの再生速度に合った画像再生も可能になる。
【図面の簡単な説明】
【図１】この発明の各実施形態を概念的に説明するための図である。
【図２】（ａ）は、この発明に係る音声データの符号化方法を実現する音声データ生成装置の概略構成を示すブロック図であり、（ｂ）は、生成された音声基礎データから音声再生を実現する音声データ再生装置の概略構成を示すブロック図である。
【図３】図２（ａ）に示された音声データ生成装置における符号化部の構成を示すブロック図である。
【図４】符号化音声データの復号化に必要な復号化補助情報の一部を概念的に説明するための図である。
【図５】この発明に係る音声データの符号化方法により得られた符号化データを概念的に説明するための図である。
【図６】音声データ再生装置（端末装置）の構成を示すブロック図である。
【図７】図６に示された音声データ再生装置におけるＰＣＭデータ生成部の構成を示すブロック図である。
【図８】再生レベルごとに設定された、ナチュラル・スピードの再生レベルを基準とした再生時間の比率及び再生速度の倍率の一例を示す表である。
【図９】図６に示された時間係数生成部において参照される表であって、発生音を構成する各部の音の種類ごとに設定された再生速度の一例を、ナチュラル・スピードの再生レベルを基準とした倍率で示した表である。
【図１０】図６に示された振幅強調係数生成部において参照される表であって、発生音を構成する各部の音の種類ごとに設定される振幅の一例を、ナチュラル・スピードの再生レベルを基準とした倍率で示した表である。
【図１１】（ａ）は、図６に示された強調バンドデータ生成部において参照される表であって、指示された周波数バンドデータの編集動作を説明するための図であり、（ｂ）は、指示された周波数バンド（中心ＣＨ）の振幅を、ナチュラル・スピードの再生レベルを基準とした倍率で示した表である。
【図１２】図６に示された音声データ再生装置（端末装置）による音声再生に同期して画像データを表示する表示装置の構成を示す図である。
【図１３】音声再生動作に同期した画像表示タイミングを説明するためのタイムチャートである。
【符号の説明】
１０…情報源、２０、３０、３１０…記録媒体、２００…音声データ生成装置、２１０…符号化部、２５０…解析部、２６０…合成部、４００…端末装置（ＰＣ）、４１０…復号化部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audio data encoding method for generating new audio data based on a predetermined audio signal.
[0002]
[Prior art]
Various teaching materials with audio information recorded on a recording medium such as cassette tape for self-study of English conversation and other languages, poetry practice, law self-study, song practice, and other purposes. Is provided. Here, taking a self-study material for English conversation as an example, a conventional main recording medium is, for example, a cassette tape (or CD) on which a series of English utterances (voice information) is recorded. This tape material and text were used in combination. There are various levels of such teaching materials from beginner to advanced.
[0003]
Japanese Patent No. 2581700 includes a first area in which a voice information sequence (sounds of each part constituting a natural speed generated sound) suitable for advanced learning divided into a plurality of sections is recorded; , A speech information sequence suitable for beginners' learning composed of equivalent sections corresponding to these sections (sounds of each part constituting a clear generated sound, and differently derived sounds having the same meaning in linguistics) ) Is recorded, and information indicating the relationship between the corresponding sections of the speech information strings for the advanced learner and beginner learner in terms of recording positions on the recording medium of the sections of these speech information strings A recording method including at least an information recording medium such as a CD-ROM provided with a third area in which information is recorded and a switching reproduction between corresponding sections of the information recording medium having such a structure is proposed. ing.
[0004]
[Problems to be solved by the invention]
As described above, in the information recording medium of Japanese Patent No. 2581700, the sound of each part constituting the sound generated by the native speaker is recorded in the first area on the medium, and the same in terms of language in the second area. In this sense, a voice information sequence composed of slow pronunciations is recorded. Therefore, if the reproduced sound cannot be heard while the audio information sequence recorded in the first area is being reproduced, the audio information sequence having the same content recorded in the second area (during the reproduction of the first audio information sequence) The correspondence between the section and the section to be reproduced in the second audio information sequence is recorded in the third area), and the learner can recognize the voice that could not be heard. Also, considering the recent spread and high performance of information processing equipment such as personal computers, it was delayed from the audio information sequence recorded in the first area in order to shorten production time and production cost. It is not impossible to generate a voice information sequence recorded in the second area.
[0005]
However, if the voice of the native speaker is simply stretched along the time axis, the user's ease of listening is impaired. In other words, even if the Japanese are mainly able to listen to natural speed English by slowly playing it, it is not sufficient to simply extend or shorten the audio playback time for each frequency component. There is a possibility that a time change in the spectrum of each part constituting the generated sound, for example, the consonant part, may mean another sound as a linguistic sound. For example, the pronunciation of BA (pa) and PA (pa) are fast in the former spectrum change, the latter is slow, the spectrum itself is almost the same shape, including the consonant part of the pronunciation BA (ba). If you extend the time, you will hear PA.
[0006]
On the other hand, the hearing level of the learner also varies, for example, those who cannot be sufficiently heard even at the audio playback speed recorded in the second area, those who are not satisfied with the provided natural speed, etc. In order to satisfy each level of learners individually, it is necessary to prepare a plurality of types of audio information according to the hearing level of each learner. However, at present, the learner cannot select sound information suitable for his / her hearing level, and preparing multiple types of sound information suitable for each learner's hearing level is a record of a CD or the like. It is not realistic because the recording capacity of the medium is limited.
[0007]
Furthermore, with the recent development of information communication technology, data distribution using a computer network such as the Internet has attracted attention. When considering provision of voice information using such data distribution, it cannot be said that it has reached a practical level in terms of communication time and communication cost to transmit a large amount of data.
[0008]
The present invention has been made to solve the above-described problems, and is a code of audio data that enables decoding of audio data for reproduction at a reproduction speed desired by the user without impairing ease of listening. The purpose is to provide a conversion method.
[0009]
[Means for Solving the Problems]
In order to solve the above-described problems, data generated by the audio data encoding method according to the present invention includes encoded audio data encoded according to a predetermined rule from an audio signal, and decoding of the encoded audio data. And auxiliary decoding information to be referred to at the time. Hereinafter, this data is referred to as voice basic data.
[0010]
In particular, the decoding auxiliary information includes at least information regarding the type of sound of each part constituting the generated sound, which is specified from a physical quantity related to the wave of the audio signal, for example, frequency spectrum information. This is in order to eliminate the above-described problem that a different sound is heard due to a change in the consonant part. If the degree of expansion of the consonant part is kept at the limit where it can be heard as BA, and only the vowel part is extended or shortened to the desired voice reproduction time, it can be heard as BA. The vowel part can be set to the desired length (desired playback time) because it can be heard as it is, no matter how much it is expanded or shortened.
[0011]
On the other hand, it is also necessary for language learning, etc. that only non-English-speaking people are able to listen to voices that are too weak to hear and specific frequency components that are selectively emphasized twice or three times. The emphasis including the vowel part is too large to be effective. It must be emphasized selectively. Therefore, the decoding auxiliary information preferably includes emphasized position identification information for indicating a position to be emphasized. Further, this decoding auxiliary information may include information for individually indicating frequency components to be emphasized among the frequency components of the decoded audio data.
[0012]
Note that the encoding for obtaining the encoded audio data is performed by decomposing the audio signal to be encoded into frequency components in advance and making the amplitude information for each of the divided frequency components in order to make it easy to hear even at different playback speeds. The encoding described in Japanese Patent Application No. 10-249672, which converts the above into data, is suitable. The audio signal to be encoded is a digitized electric signal, and as its information source, analog audio information captured via a microphone, analog audio information read from a magnetic tape, etc., MO Any of digital audio information recorded on a CD, hard disk or the like may be used. However, in the case of analog audio information, A / D conversion needs to be performed once. In addition, when data recorded on a CD or the like is compression-encoded, it is necessary to decompress (decompress) the compressed data.
[0013]
Focusing on the fields of computer networks such as the Internet, cable TV networks, and sanitary communications, which have begun to spread in recent years, multimedia information provision services such as character data, audio data, still image data, and moving image data are widely performed. In order to apply the present invention as one of information providing services using such information communication technology, it is indispensable to adjust the display timing of image data. Therefore, by displaying information indicating the display timing of the image data to be displayed on the predetermined display means in the sound basic data having the data structure as described above, the image display synchronized with the reproduction operation of the sound data (In particular, moving image display) is possible.
[0014]
Therefore, the audio data encoding method according to the present invention includes a first step of generating the encoded audio data, a second step of specifying the decoding auxiliary information, and decoding auxiliary information in the encoded audio data. And a third step of adding. In the first step, an audio signal is encoded according to a predetermined rule, for example, as in the encoding technique described in Japanese Patent Application No. 10-249672. In the second step, as decoding auxiliary information to be referred to when decoding the encoded audio data, it relates to the type of sound of each part constituting at least the generated sound from the physical quantity (for example, frequency spectrum information) related to the wave of the audio signal. Information is identified. The first and second steps can be performed in parallel.
[0015]
By using the sound basic data (including encoded sound data) generated as described above, in the decoding of the sound data, for each sound of each part constituting the generated sound based on the designated reproduction speed information It is possible to decode the audio data for reproduction that matches the adjusted hearing level of each learner. By reproducing the reproduction audio data thus decoded, the learner can listen to the reproduction audio adjusted to the speed specified by the learner. That is, the reproduction of the audio data constitutes a first step of extracting the decoding auxiliary information from the audio basic data generated as described above, and at least a generated sound included in the extracted decoding auxiliary information. While referring to the information about the sound of each part, the second step of determining the playback speed for each sound of each part constituting the generated sound, and decompressing the corresponding part of the encoded audio data so as to correspond to the determined playback speed And a third step of decoding the encoded voice data while performing a shortening process (for faster voice playback) (for playback of relaxed voice).
[0016]
In addition, since the encoding method can prepare a plurality of types of reproduction audio data having different reproduction speeds without impairing ease of listening, switching reproduction between the generated reproduction audio data having different reproduction speeds is performed. Learning is possible. The sound of each part constituting the generated sound includes a vowel part in a speech spectrum, a consonant part appearing before and after the vowel part, a transition part appearing between the vowel part and the consonant part, and a voice break (pause). Etc. are included.
[0017]
In addition, the provision of the voice basic data generated as described above is provided to the user in a form once recorded on a recording medium such as a CD, and to the user via the information communication means. There are cases. Even when using information communication technology, it is indispensable to temporarily record the basic audio data on a hard disk, etc., and the recording speed of the basic audio data is changed on the user side without compromising ease of listening. The decoding auxiliary information is recorded together with the encoded audio data on a predetermined recording medium so that the reproduced audio data can be decoded. In this recording medium, the area where the encoded audio data is recorded and the area where the decoding auxiliary information is recorded may be different.
[0018]
As a data distribution method for transmitting the voice basic data generated as described above to a communication partner via wired or wireless information transmission means, encoded voice data encoded according to a predetermined rule from a voice signal, Decoding auxiliary information that is information that is referred to when decoding encoded audio data and that includes at least information about the type of sound of each part that constitutes the generated sound from the physical quantity related to the wave of the audio signal, Send. Note that this distribution method may be configured to separately transmit the encoded audio data and the auxiliary decoding information to the communication partner.
[0019]
In addition, when the playback operation of the sound basic data generated as described above is made compatible with multimedia playback such as character data and image data (still image and movie), the movie playback and the playback speed are freely changed. It is important to synchronize with the playback of the sound. In other words, for example, image data of about 20 frames per second is sequentially displayed on the display for a moving image, but an unnatural display operation occurs if the display timing is not synchronized with the audio reproduction operation.
[0020]
Therefore, in this reproduction, one or more image data is temporarily expanded on the memory, and one frame of the image data stored in the memory is synchronized with the audio data reproduction operation. It is preferable to sequentially display on a predetermined display means.
[0021]
As a specific reproduction method, Tv is a reference rewrite cycle of image data for one frame or N times (a positive rational number) of a plurality of image data stored in a memory, and based on a digitization sampling cycle of sound. When the determined reference reproduction time period is Ta and the reproduction time period determined for each sound of each part constituting the generated sound based on the instructed reproduction speed information is Ta ′ (> Ta), a predetermined timing The next image data rewriting operation is suspended from the end of the image data rewriting operation at Tv × ((Ta ′ / Ta) −1).
[0022]
The display timing of the image data is adjusted by adjusting the reference reproduction time period determined based on the digitization sampling period of the sound Ta and the sound of each part constituting the generated sound based on the instructed reproduction speed information. Is set to Ta ′ (> Ta), the average rewriting frequency of the image data stored in the memory is set to (Ta ′ / Ta) times the rewriting frequency specified in advance. Even in this case, it is possible to display an image in synchronization with the reproduction operation of the audio data.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to FIGS. In the description of the drawings, the same portions are denoted by the same reference numerals, and redundant description is omitted.
[0024]
The voice basic data generated by the present invention enables the user side to decode the playback voice data at a playback speed freely set by the user without impairing the ease of listening at the time of playback. To. Various forms of use of such voice basic data can be considered due to recent developments in digital technology and development of data communication environments. FIG. 1 is a conceptual diagram for explaining how audio basic data generated by the audio data encoding method according to the present invention is industrially used.
[0025]
As shown in FIG. 1 (a), as the information source 10 used for encoding audio data, for example, analog audio information that is directly taken in via a microphone or already recorded on a magnetic tape or the like, Can use digital audio information recorded in MO, CD (including DVD), H / D (hard disk), etc., and is specifically provided from commercially available teaching materials, TV stations, radio stations, etc. It can also be used for audio information. The editor 100 encodes audio data by the audio data generation device 200 using such an information source 10. At this time, considering the current data providing method, the generated audio basic data is provided to the user once recorded on the recording medium 20 such as a CD (including DVD) or H / D. There are many. It is also conceivable that related image data is recorded together with the audio basic data on these CDs and H / Ds.
[0026]
The basic speech data generated by the speech data generation device 200 includes encoded speech data encoded according to a predetermined rule from the digital speech signal extracted from the information source 10, and decoding of the encoded speech data. This is information that is referred to at the time, and includes decoding auxiliary information that includes at least information related to the type of sound of each part that constitutes the generated sound, which is specified from a physical quantity related to the wave of the audio signal. Note that the encoding for obtaining the encoded audio data is, for example, by decomposing the audio signal to be encoded into frequency components in advance and making the amplitude of each divided frequency component in order to make it easy to hear even at different playback speeds. The encoding described in Japanese Patent Application No. 10-249672, which converts information and the like into data, can be used. The generated encoded audio data and decoding auxiliary information are stored in the recording medium 20 by the audio data generation device 200. Thereby, multimedia such as image data and character data is recorded on the recording medium such as CD, DVD, H / D, etc. together with the above-described encoded audio data and decoding auxiliary information.
[0027]
In particular, CDs and DVDs as the recording medium 20 are generally provided to users as magazine appendices or sold at stores in the same manner as computer software and music CDs (distribution in the market). ). In addition, the generated voice basic data may be sufficiently distributed from the server 300 to the user via information communication means such as the Internet or sanitary communication regardless of wired or wireless.
[0028]
In the case of data distribution, the sound basic data generated by the sound data generating device 200 is temporarily stored together with image data or the like in the storage device 310 (for example, H / D) of the server 300. Then, the voice basic data once stored in the H / D 310 is transmitted to the user terminal 400 via the transmission / reception device 320 (I / O in the figure). On the user terminal 400 side, the voice basic data received via the transmission / reception device 450 is temporarily stored in the H / D (included in the external storage device 30). On the other hand, in providing data using a CD, DVD, or the like, a CD purchased by the user is used as the external recording device 30 of the terminal device by mounting the CD on the CD drive or DVD drive of the terminal device 400.
[0029]
Normally, the terminal device 400 on the user side is equipped with an input device 460, a CRT, a display 470 such as a liquid crystal, and a speaker 480. The basic sound data recorded together with image data and the like in the external storage device 300 is The data is decoded once by the decoding unit 410 (which can also be realized by software) of the terminal device 400 into audio data at a playback speed designated by the user, and then output from the speaker 480. On the other hand, the image data stored in the external storage device 300 is once expanded in the VRAM 432 and then displayed on the display 470 for each frame (bitmap display). Note that the reproduction audio data decoded by the decoding unit 410 is sequentially stored in the external storage device 30 to prepare a plurality of types of reproduction audio data having different reproduction speeds in the external storage device 30. If this is done, switching playback between a plurality of types of audio data having different playback speeds can be performed on the user side using the technology described in Japanese Patent No. 2581700.
[0030]
As shown in FIG. 1B, the user listens to the sound output from the speaker 480 while displaying the related image 471 on the display 470. At this time, if the playback speed of only the sound is changed, the display timing of the image may be shifted. Therefore, it is preferable to add in advance information indicating the image display timing to the encoded audio data generated by the audio data generation device 200 so that the decoding unit 410 can control the display timing of the image data.
[0031]
Next, detailed structures of the audio data generation device 200 and the audio data reproduction device (terminal device 400) shown in FIG. 1A will be described with reference to FIG. 2A is a diagram illustrating the configuration of the audio data generation device 200, and FIG. 2B is a diagram illustrating the configuration of the terminal device 400 as the audio data reproduction device.
[0032]
As shown in FIG. 2A, the audio signal captured by the audio data generation device 200 is provided from the information source 10. Note that, among the audio information provided from the information source 10, the audio information captured from the microphone and the audio information from the magnetic tape are both analog audio data, and therefore before being input to the audio data generation apparatus 200. It is converted into PCM data by the A / D converter 11 (included in the I / O 12). Also, audio information already stored in the MO, CD (including DVD), and H / D is taken into the audio data generation apparatus 200 via the I / O 12 as PCM data. When the captured audio data is compressed, it is necessary to decompress software or the like once.
[0033]
The audio data generation device 200 generates an encoded audio data encoded according to a predetermined rule from the audio signal (electric signal) from the information source 10 preprocessed as described above, and the code As the decoding auxiliary information referred to when decoding the digitized audio data, an analysis unit 250 that specifies information on the type of sound of at least each part constituting the generated sound from the physical quantity (for example, frequency spectrum information) related to the wave of the audio signal And a synthesis unit 260 that adds the decoding auxiliary information specified by the analysis unit 250 to the encoded speech data encoded by the encoding unit 210. The encoded audio data and the decoding auxiliary information output from the synthesizer 260 are recorded on a recording medium 20 such as a CD, a DVD, or an H / D. Note that the encoded audio data and the decoding auxiliary information may be recorded in different areas in the recording medium 20, respectively.
[0034]
On the other hand, on the user side, voice basic data provided in the form of data distribution, CD, or the like is stored in the external storage device 30 of the terminal device 400 as shown in FIG. The decryption unit 410 reads the digital data read from the external storage device 30 via the I / O 31 according to the user's instruction content input via the input means 460 such as a keyboard or a pointing device such as a mouse. Are decoded as reproduction audio data that can be reproduced at a predetermined speed, and an image synchronization signal D is also output. The decoded audio data for reproduction is converted into analog data, and then output from the speaker 480 as audio.
[0035]
The decoding unit 410 reads audio basic data read from the external storage device 30 via the I / O 31 and decodes encoded audio data from the read audio basic data. Extracting the decoding auxiliary information to be referred to, and configuring the generated sound included in the encoded audio data while referring to information on the sound of each part constituting the generated sound included in the extracted decoding auxiliary information A playback speed suitable for voice playback is determined for each sound of each section based on playback speed information designated by the user. The decoding of the encoded audio data in the decoding unit 410 is performed so that the sound corresponding to each part constituting the generated sound included in the encoded audio data corresponds to the reproduction speed determined as described above. This is performed while performing a decompression process or a shortening process on the corresponding part of the audio data.
[0036]
FIG. 3 is a diagram illustrating the structure of the encoding unit 210 in the audio data generation apparatus 200 described above. First, the encoding unit 210 captures an audio signal corresponding to a natural speed audio of a native speaker sampled by, for example, a sound CD 44.1 KHz of a music CD by a microphone or the like. This captured audio signal is temporarily filtered to divide it into channels CH # 1 to CH # 85 (frequency components). The frequency range of the captured audio signal is 75 Hz to 10,000 Hz, and the sampling frequency is 44.1 kHz (22.68 μs) in accordance with the audio clock of the music CD. The number of channels to be divided is 85 (7 octaves + 1 sound), and the center frequency (center f) of each channel # 1 to # 85 is set to be a semitone string of equal temperament (12 equal temperament per octave). (77.78 Hz (D #) to 9,960 Hz (D #)).
[0037]
As described above, the data divided into the respective channels # 1 to # 85 has amplitude information of 2.268 ms (corresponding to 100 data of 44.1 kHz sampling, provided that one waveform cannot be formed with 100 data). To increase the number of data). Therefore, in this embodiment, the sampling rate (second period) of amplitude information in each channel # 1 to # 85 is 441 samples / s (2.268 ms). It should be noted that the sampling rate may be a regular cycle. For example, in this embodiment, processing is performed alternately at these different rates, for example, after 100 data is captured and then 120 data is captured and processed. Also good.
[0038]
The encoding unit 210 represents the amplitude information of each channel # 1 to # 85 sampled every 2.268 ms in 1 byte (8 bits), and encoded audio data of 85 bytes (85 channels × 1 byte). a1, a2, a3,..., an are generated. Note that the encoded audio data a1, a2, a3,..., An have an image synchronization signal D (1) in order to control the display timing with the moving image displayed during reproduction of the audio corresponding to the encoded audio data. Byte).
[0039]
On the other hand, the analysis unit 250 of the audio data generation device 200 illustrated in FIG. 3 specifies decoding auxiliary information that is referred to when the encoded audio data generated by the encoding unit 210 is decoded.
[0040]
The decoding auxiliary information includes the sound of each part constituting the generated sound, the emphasized position identification information indicating the part to be emphasized, the frequency component to be emphasized, and the like from the spectrum information of the captured audio signal.
[0041]
For example, in this embodiment, as shown in FIG. 4, the sound of each part constituting the generated sound includes a vowel part (V) and a consonant part that appears before and after the vowel part (V) (in the figure). The front consonant part is C _F , The consonant part is C _R Vowel part (V) and front and rear consonant parts C _F , C _R (In the figure, the previous transition is T _F The rear transition part is T _R And a pause (P) indicating a silent period appearing between the voices. Note that if the pause (P) is extended in the same way as the sounds of the other parts constituting the generated sound when the reproduced sound is delayed, it may impair the user's ease of hearing. Therefore, in this embodiment, the pause (P) is further divided into a case where it occurs between syllables (P1), a case where it occurs between phrases (P2), and a case where it occurs between sentences (P3). It is classified and included in the sound of each part constituting the generated sound to be identified.
[0042]
The decoding auxiliary information s1, s2, s3,..., Sn is a data string prepared for each sampling interval of the encoded speech data a1, a2, a3,. The data is about 4 bits in total, 3 bits as information about the sound of each part and 1 bit as emphasized position identification information. In addition, since there are frequencies that are difficult to hear for non-English-speaking ethnic groups, such as the third formant, this decoding auxiliary information may include information specifying the frequency band to be emphasized (particularly the center frequency) individually. Good.
[0043]
The synthesizing unit 260 adds the decoding auxiliary information s1, s2, s3,... Specified by the analyzing unit 250 to the encoded speech data a1, a2, a3, ..., an generated by the encoding unit 210 as described above. , Sn are added, and each is written in the recording medium 20. Note that the combined data generated by the combining unit 260 can have various logical structures as shown in FIGS. 5 (a) to 5 (c). For example, as shown in FIG. 5A, the generated synthesized data is decoded auxiliary information s1, s2, s3 corresponding to each piece of encoded audio data a1, a2, a3,. ,..., Sn may be added. Further, as shown in FIG. 5B, the generated synthesized data includes encoded speech data a1, a2, a3,..., An and decoding auxiliary information s1, s2, s3,. The structure may be handled as data of different groups. Further, as shown in FIG. 5 (c), the generated synthesized data is composed of a plurality of groups constituting the encoded speech data a1, a2, a3,. s3,..., sn may be configured in pairs with corresponding groups.
[0044]
Next, the structure of the terminal device 400 on the user side that decodes and reproduces audio data will be described.
[0045]
FIG. 6 is a diagram illustrating a structure of the decoding unit 410 of the terminal device 400, and FIG. 7 is a diagram illustrating a structure of the PCM data generation unit 415 in the coding unit 410 illustrated in FIG.
[0046]
As shown in FIG. 6, the voice basic data is taken into the decoding unit 410 from the external storage device 30 via the I / O 31. Note that the basic voice data stored in the external storage device 30 is data distributed via information communication means such as a computer network or satellite, or stored on a CD or the like purchased by the user. Data is also recorded in the external storage device 30. In addition, when the voice basic data stored in the external storage device 30 is compressed, data decompression by software or the like is performed as a preprocessing for decoding.
[0047]
In the decoding unit 410, first, the extraction unit 411 extracts decoding auxiliary information s1, s2, s3,..., Sn from the speech basic data read from the external storage device 30. Among the extracted decoding auxiliary information s1, s2, s3,..., Sn, information (V, C _F , C _R , T _F , T _R , P1, P2, and P3) are input to the time coefficient generation unit 412 together with the instruction information from the user input from the input unit 460. Of the extracted decoding auxiliary information s12, s2, s3,..., Sn, the emphasized position identification information is input to the amplitude emphasis coefficient generating unit 412 together with the user instruction information input from the input processing unit 460. The Further, of the extracted decoding auxiliary information s 12, s 2, s 3,..., Sn, information on the frequency component (center CH) to be emphasized together with the user instruction information input from the input processing means 460 is an enhancement band. The data is input to the data generation unit 414.
[0048]
In this embodiment, as the user's playback speed instruction information input from the input unit 460, a plurality of playback levels H3 to S6 are prepared as shown in the table of FIG. As can be seen from the table of FIG. 8, in this embodiment, the playback level N is set to the standard playback speed (natural speed), and the playback speed is increased toward H3, and conversely, the playback speed is decreased toward S6. In this way, it is indicated by the ratio of the reproduction time and the magnification of the reproduction speed based on the natural speed.
[0049]
The time coefficient generation unit 411 is preset with a reproduction speed magnification determined by the relationship between the reproduction level (instructed by the user) and the type of sound of each unit constituting the generated sound, as shown in FIG. A reproduction speed magnification is output to the PCM data generation unit 415 based on this table.
[0050]
The amplitude emphasis coefficient generating unit 412 includes two types of tables as shown in FIG. FIG. 10A is a table applied when the emphasis position identification information is not included in the decoding auxiliary information s1, s2, s3,..., Sn extracted by the extraction unit 411 (when there is no emphasis instruction). FIG. 10B is a table that is applied when the emphasis position identification information is included in the decoding auxiliary information s1, s2, s3,..., Sn (when there is no emphasis instruction). Note that the parameters shown in these tables mean magnifications based on the amplitude of each frequency component of the encoded speech data separated from the auxiliary decoding information by the extraction unit 411.
[0051]
When the enhancement band data generation unit 414 includes instruction information of the frequency band (designated by the center CH) to be emphasized in the decoding auxiliary information s1, s2, s3,..., Sn, FIG. As shown, a parameter for changing the amplitude of each frequency component is generated for a total of 11 CHs of the low frequency component side 5CH and the high frequency component side 5CH adjacent to the center CH. As shown in FIG. 11B, the enhancement band data generation unit 414 includes a table in which the amplitude magnification of the center CH corresponding to the reproduction level is set in advance, and the amplitude magnification of the center CH is It is determined according to the playback speed instruction information input from the input means 460. The amplitude magnification of each CH adjacent to the center CH is set so that it can be linearly approximated as shown in FIG. 11A with reference to the amplitude magnification of the center CH, and is output to the PCM data generation unit 415. The
[0052]
As shown in FIG. 7, the PCM data generation unit 415 includes a sine wave generator 422 that generates a frequency component corresponding to each channel. The control unit 421 newly generates an amplitude coefficient from the encoded speech data from the extraction unit 411 based on the amplitude information of each frequency component and the amplitude magnification data from the enhancement band data generation unit 414, and the generated amplitude coefficient Is multiplied by the data (indicating the reference amplitude) from the sine wave generator 422 in the multiplier 423. Then, the obtained PCM data is obtained by causing the adder 424 to add the obtained data of each frequency component. Further, the control unit 421 performs decoding by adjusting the number of output times of the respective encoded audio data a1, a2, a3,..., An based on the reproduction speed magnification data from the time coefficient generation unit 412. Extend or shorten audio data. At this time, since the output frequency of the image synchronization signal D output for each encoded audio data a1, a2, a3,..., An is also adjusted at the same time, image display timing control is performed on the audio data reproduction side. Is possible.
[0053]
As described above, the data decoded by the PCM data generation unit 415 becomes decoded data adjusted along the time axis according to the reproduction speed instruction information of the user. The data decoded by the PCM data generation unit 415 is multiplied in the multiplier 416 by the magnification parameter determined by the amplitude emphasis coefficient generation unit 412 from the table of either FIG. 10A or FIG. . Thereby, reproduction audio data is obtained. The obtained audio data for reproduction is converted into analog data by the D / A converter 417 and output from the speaker 480 as audio at the reproduction speed designated by the user.
[0054]
On the other hand, the terminal device 400 can also display image data read from the external storage device 30. FIG. 12 is a diagram showing the structure of a bitmap display.
[0055]
The bitmap display includes a memory 432 (VRAM) that stores one or more frames, and the drawing unit 431 reads image data (compressed) from the external storage device 30 via the I / O 32. The data is decompressed by the software 32 or the like). The image data written in the memory 432 is displayed on the display 470 via the switch S / W 433 for each frame. Note that the timing controller 434 performs the writing timing of the drawing unit 431 and the switching timing of the S / W 433.
[0056]
In this embodiment, the timing of audio reproduction and image display is performed by counting the image synchronization signal D output from the PCM data generating unit 415 as shown in FIG. That is, in the case of audio reproduction at natural speed, for example, if the data in the memory 432 is rewritten every three clocks, the PCM data generation unit 415 reproduces the data as shown in FIG. Even when reproducing audio data for reproduction with a low speed is generated, it is possible to rewrite data in accordance with the delay timing of the audio data (it is possible to match the image display timing with the audio reproduction timing).
[0057]
That is, in this embodiment, one frame or a multiple of N of the plurality of image data stored in the memory 432 (a positive rational number, where N may be 1/2 or 2/3). The reference rewrite cycle of image data is Tv, the reference playback time cycle determined based on the digitization sampling cycle of audio (for example, the audio clock of a music CD) is Ta, and the generated sound is generated based on the indicated playback speed information. When the reproduction time period determined for each sound of each part constituting Ta is set to Ta ′ (> Ta), from the end of the image data rewriting operation at a predetermined timing to Tv × ((Ta ′ / Ta) −1) The next image data rewriting operation is suspended.
[0058]
Note that the timing adjustment between audio reproduction and image display is not limited to the above-described embodiment. For example, the reference reproduction time period determined based on the digitization sampling period of the voice is Ta, and the reproduction time period determined for each sound of each part constituting the generated sound based on the instructed reproduction speed information is Ta. When ′ (> Ta), the average rewriting frequency of the image data stored in the memory 432 may be set to (Ta ′ / Ta) times a rewriting cycle specified in advance.
[0059]
【The invention's effect】
As described above, according to the present invention, encoded audio data that is captured from a microphone or the like and encoded according to a predetermined rule from an audio signal that has been stored in the past, and is referred to when the encoded audio data is decoded. The basic speech data is obtained from the auxiliary decoding information including the type of sound of each part constituting the generated sound. By providing such audio basic data to the user using a predetermined recording medium or distribution method, it is possible to decode a plurality of types of reproduction audio data having different speeds set arbitrarily on the user side. Thereby, the data amount of the voice basic data to be provided from the data provider to the user can be reduced, and the recording amount of the recording medium can be saved and the data distribution time can be shortened.
[0060]
Further, by adding an image synchronization signal together with decoding auxiliary information to the encoded audio data, it is possible to reproduce an image that matches the reproduction speed of the decoded reproduction audio data.
[Brief description of the drawings]
FIG. 1 is a diagram for conceptually explaining each embodiment of the present invention.
FIG. 2 (a) is a block diagram showing a schematic configuration of an audio data generation apparatus that realizes an audio data encoding method according to the present invention, and FIG. 2 (b) is an audio reproduction from generated audio basic data. It is a block diagram which shows schematic structure of the audio | voice data reproduction | regeneration apparatus which implement | achieves.
FIG. 3 is a block diagram illustrating a configuration of an encoding unit in the audio data generation apparatus illustrated in FIG.
FIG. 4 is a diagram conceptually illustrating a part of decoding auxiliary information necessary for decoding encoded audio data.
FIG. 5 is a diagram for conceptually explaining encoded data obtained by an audio data encoding method according to the present invention.
FIG. 6 is a block diagram illustrating a configuration of an audio data reproduction device (terminal device).
7 is a block diagram showing a configuration of a PCM data generation unit in the audio data reproduction device shown in FIG. 6;
FIG. 8 is a table showing an example of a reproduction time ratio and a reproduction speed magnification set for each reproduction level based on a reproduction speed of natural speed.
9 is a table that is referred to in the time coefficient generation unit shown in FIG. 6, and shows an example of the reproduction speed set for each type of sound of each part that constitutes the generated sound. It is the table | surface shown by the magnification | multiplying_factor based on.
10 is a table that is referred to in the amplitude emphasis coefficient generation unit shown in FIG. 6, and shows an example of the amplitude set for each type of sound of each part that constitutes the generated sound. It is the table | surface shown by the magnification | multiplying_factor based on.
11A is a table referred to in the enhanced band data generation unit shown in FIG. 6 and is a diagram for explaining an editing operation of instructed frequency band data; FIG. FIG. 4 is a table showing the amplitude of the designated frequency band (center CH) by a magnification based on the reproduction level of natural speed.
12 is a diagram showing a configuration of a display device that displays image data in synchronization with audio reproduction by the audio data reproducing device (terminal device) shown in FIG. 6;
FIG. 13 is a time chart for explaining image display timing synchronized with an audio reproduction operation;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Information source 20, 30, 310 ... Recording medium, 200 ... Audio | voice data production | generation apparatus, 210 ... Encoding part, 250 ... Analysis part, 260 ... Synthesis | combination part, 400 ... Terminal device (PC), 410 ... Decoding part .

Claims

A plurality of encodings obtained by encoding a plurality of frequency components divided from an audio signal at a predetermined sampling period, each of which is composed of encoded data of the plurality of frequency components encoded at the same timing A first step of generating audio data;
Prepared for each of the plurality of encoded audio data generated at each predetermined sampling period, and as decoding auxiliary information referred to when decoding the plurality of encoded audio data, A second step of identifying information on the type of sound of each part constituting at least the generated sound from the physical quantity related to the wave;
And a third step of adding the plurality of pieces of decoding auxiliary information specified in the second step to the plurality of pieces of encoded audio data generated in the first step.

2. The audio data encoding method according to claim 1, wherein the physical quantity related to the wave of the audio signal includes frequency spectrum information of the audio signal.

3. The audio data according to claim 1, wherein each of the plurality of decoding auxiliary information further includes emphasized position identification information for indicating a position in a time axis direction to be emphasized in an amplitude direction. Encoding method.

4. The information according to claim 1, wherein each of the plurality of pieces of decoding auxiliary information includes information that individually indicates a frequency component to be emphasized among frequency components of the decoded audio data. An audio data encoding method according to claim 1.

5. Information that indicates the display timing of image data to be displayed on a predetermined display means is further added to each of the plurality of encoded audio data. The audio data encoding method described.