JPH04176244A

JPH04176244A - Sound information processor

Info

Publication number: JPH04176244A
Application number: JP2302398A
Authority: JP
Inventors: Hiroshi Ichikawa; 市川　熹
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1990-11-09
Filing date: 1990-11-09
Publication date: 1992-06-23

Abstract

PURPOSE:To allow the user to hear only a keyword sound estimated to be in an original sound by extracting at least a pitch structure (pattern) from an object original sound, estimating a position at which a keyword sound exists from output information of a stress information extraction means, copying an estimate keyword sound waveform for the estimated period and storing the sound in the lump. CONSTITUTION:A stress information extraction means 30 of a sound processing means 12 calculates a short-time power at a prescribed interval from a fetched sound data 24, regards a period when the state of the short-time power in excess of a prescribed threshold level is consecutive for a prescribed time or over as the period in which the sound is in existence and obtains a pitch frequency with respect to the sound for the period. When a short period powder at the head of each period is a large power to a degree or over and consecutive for a prescribed time or over, the possibility of a keyword is higher. A period making the sound period before and after the period satisfying the condition above longer by a prescribed length is outputted as an estimate position of the keyword. Then the result made correspondent to the original sound 24 is stored in a storage means 24.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、音声メールシステムなど、音声信号そのもの
を取り扱う音声情報処理装置の改良に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to improvements in voice information processing devices that handle voice signals themselves, such as voice mail systems.

[Conventional technology]

人間の脳では思考の中枢と音声を処理する中枢はほとん
ど一致していると言われでいる。このため、音声による
表現は、思い付いたことをそのまま口に出せば良いため
、文字に書いたリキーボードで入力する手間がなく、使
用者にとっては非常に便利な機能を持っている。It is said that in the human brain, the thinking center and the voice processing center are almost identical. For this reason, voice expressions are very convenient for users because they can simply say what comes to mind, eliminating the need to re-enter written text using a keyboard.

そこで、この様な音声の利点を活かした、音声メールシ
ステムなどの音声情報処理装置が利用されつつある。Therefore, voice information processing devices such as voice mail systems that take advantage of such advantages of voice are being used.

例えば、音声メールシステムでは音声信号そのものを取
り扱い、入力された音声をメモリ上に記録しておき、必
要に応じて再生しメモの内容をきく。外部からの人手に
よる指示に従って順序を入れ替えたり、その一部を切り
捨てたり、つけ加える等の編集を行ったりする場合も有
る。For example, a voice mail system handles the voice signal itself, records the input voice in memory, and plays it back as needed to listen to the contents of the memo. In some cases, the order may be rearranged, a part of the data may be discarded, or a part of the data may be edited based on manual instructions from outside.

また、別に入力されている電子的文書等に対応付けられ
、音声メモ（注釈）等として用いる装置も提案されてい
る。Furthermore, a device has been proposed that is associated with a separately input electronic document or the like and used as a voice memo (annotation) or the like.

さて、一方で、文字で書かれた文書は、普通いちべつし
ただけでキーワードが目に入り、関心の有るものかどう
か判断することが出来る。それに対し音声は、その特性
から、順番にその全てを聞かないと、それが重要なもの
かどうか判断することが出来ない。従って、音声メール
や音声メモでは、その一つ一つに対して聞くべき内容の
ものか、無視したり、後回しにして良いかどうかを簡単
に知ることが出来ないために、結局−つないしそれ以上
の音声文章からなる長い内容のものに対しても、総てを
聞かざるを得す、無駄な時間を、しかもほとんど実時間
で費やさざるを得ないことになる。Now, on the other hand, when it comes to documents written in text, keywords are usually visible just by looking at them, and it is possible to judge whether or not the document is of interest. On the other hand, due to the characteristics of audio, it is impossible to judge whether it is important or not unless you listen to all of the audio in sequence. Therefore, with voice mails and voice memos, it is not easy to know whether each message is something you should listen to, or whether you can ignore or postpone it. Even for long content consisting of audio sentences as described above, one has no choice but to listen to the entire thing, which is a waste of time, and moreover, one has no choice but to spend most of the time in real time.

これを解決する方法として、従来例えば文献（１）特開
平２−２０２２．５８のように、対象音声以外に主題情
報（キーワードなど）を別途データ端末のキーボード等
からテキスト情報として入力する方法が提案されている
が、別途データ端末のキーボード等が必要であり、電話
器のみのような簡単な入力により送られてきた音声に対
しては利用出来ず、入力者もわざわざテキスト情報とし
て入力する必要が有った。As a method to solve this problem, a method has been proposed in which, in addition to the target voice, subject information (keywords, etc.) is input separately as text information from the keyboard of a data terminal, etc., as in Document (1) JP-A-2-2022.58. However, it requires a separate data terminal keyboard, etc., and cannot be used for voice sent by simple input such as from a telephone, and the person inputting the information must also enter it as text information. There was.

[Problem to be solved by the invention]

本発明の目的は、対象音声以外に主題情報（キーワード
など）を別途データ端末のキーボード等からテキスト情
報として入力することなく、また、音声認識や理解等の
言葉のレベルまで立ち至った高度の処理を行うことなく
、信号レベルの処理機能により、この様な欠点を改良し
、使い勝手の良い音声情報処理装置を提供することにあ
る。The purpose of the present invention is to perform advanced processing that reaches the language level, such as speech recognition and understanding, without inputting subject information (keywords, etc.) separately from the keyboard of a data terminal as text information in addition to the target speech. It is an object of the present invention to provide an easy-to-use audio information processing device that improves these drawbacks by using a signal level processing function without having to perform the above operations.

[Means to solve the problem]

この目的を達成するために、本発明では、抑揚情報を抽
出する手段、キーワード音声（一つないしそれ以上）の
存在する位置を推定する手段、推定キーワード音声（一
つないしそれ以上）区間を推定する手段、推定された区
間の一つないしそれ以上の推定キーワード音声波形を各
々複写し纏めて格納する手段９元音声と格納推定キーワ
ード音声波形とを対応付ける手段、複写格納推定キーワ
ード音声波形を読みだし音声として出力再生する手段２
元音声波形を読みだし音声として出力再生する手段とを
少なくとも有する。To achieve this objective, the present invention provides means for extracting intonation information, means for estimating the location of the keyword speech (one or more), and estimating the interval of the estimated keyword speech (one or more). means for copying and collectively storing one or more of the estimated keyword speech waveforms in the estimated section; means for associating the original speech with the stored estimated keyword speech waveform; and reading out the copied and stored estimated keyword speech waveform. Means for outputting and reproducing as audio 2
and means for reading out the original audio waveform and outputting and reproducing it as audio.

[Effect]

初期の合成音声による長い文章と、人が話した同じ文章
を聞き比べると判るように、合成音声は、大事な言葉（
キーワード）とその他の言葉と同じような調子で話すた
めに、総ての音声を注意深く聞かないと内容が理解でき
ないのに対し、人の話した声は、有る程度何か他のこと
をしていても理解することが出来る。これは、単に音韻
の明瞭性が合成音では不足しているだけではなく、人の
声ではどこが大事な部分かを示す情報が与えられている
のに対し、初期の合成音声では総て一様に合成されてい
るからである。As can be seen by comparing long sentences produced by early synthetic speech with the same sentences spoken by humans, synthetic speech can be used to express important words (
Keywords) and other words are spoken in the same tone, so you have to listen carefully to all the sounds in order to understand what they are saying, whereas the human voice is likely to be doing something else to some degree. I can understand it. This is not only due to the lack of phonological clarity in synthesized speech, but also because in the human voice, information indicating the important parts is provided, whereas in early synthetic speech, information is uniform throughout. This is because it is synthesized into

どこが大事な部分かを示す人の声の情報は、主に抑揚の
構造の中に与えられている。ここで抑揚とは、イントネ
ーションなどとも呼ばれ、声の高さ（ピッチ）の時間的
変化の構造や、声の大きさの時間的変化の構造、その音
声を構成している各音韻の時間構造（継続時間など）の
時間的変化の構造などを指す。Information about the important parts of a person's voice is mainly given in the structure of intonation. Intonation here is also called intonation, and is the structure of temporal changes in the pitch of the voice, the structure of temporal changes in the volume of the voice, and the temporal structure of each phoneme that makes up the voice. Refers to the structure of temporal changes (such as duration).

大事な言葉を発声する場合、人間の自然で無意識な心理
の結果として、その部分はゆっくりと（時間構造）、大
きく　（声の大きさの時間的変化構造）、高い（ピッチ
の時間的変化構造）声で発声されることが観測される（
文献（２）「韻律情報を利用した構文推定およびワード
スポットによる会話音声理解方式」小松、大事、市川、
電子情報通信学会論文誌り、Ｊ７１−Ｄ、７、ｐ　ｐ　
１２１８−１２２８、（３）「韻律情報を用いた音声会
話文の文構造推定方式」大事、小松、市川、電子情報通
信学会論文誌Ａ、Ｊ７２−Ａ、１、ｐｐ、２３−３１、
等参照）。前記文献に示したように、大事な言葉の部分
は少なくともそれらのいずれかの一つの特徴を持ってい
る。特にピッチ構造（バタン）の役割は大きい。更に前
記文献に示したように、大事な言葉の位置を自動的に推
定することが可能である。When uttering important words, as a result of human's natural and unconscious psychology, the parts are slow (temporal structure), loud (temporal change structure of loudness), and high (temporal change structure of pitch). ) observed to be uttered vocally (
Literature (2) “Syntax estimation using prosodic information and conversational speech understanding method using word spots” Komatsu, Idai, Ichikawa,
IEICE Transactions, J71-D, 7, p p
1218-1228, (3) "Sentence structure estimation method for spoken conversation sentences using prosodic information" Daita, Komatsu, Ichikawa, IEICE Transactions A, J72-A, 1, pp, 23-31,
etc.). As shown in the above documents, important word parts have at least one of these characteristics. In particular, the pitch structure (bang) plays a major role. Furthermore, as shown in the above-mentioned document, it is possible to automatically estimate the positions of important words.

以上に説明した音声の持つ特徴を本発明は積極的に利用
する。以上に説明した音声の持つ特徴を念頭に置いて、
以下に各手段の作用を説明する。The present invention actively utilizes the characteristics of voice described above. Keeping in mind the characteristics of voice explained above,
The operation of each means will be explained below.

抑揚情報を抽出する手段は、対象とする元の音声から少
なくともピッチ構造（バタン）を抽出する機能を有する
。The means for extracting intonation information has a function of extracting at least pitch structure (bang) from the target original speech.

キーワード存在位置推定手段は、前記抑揚情報抽出手段
の出力情報からキーワード音声（一つな有する。The keyword existence position estimating means has one keyword voice from the output information of the intonation information extracting means.

推定キーワード音声（一つないしそれ以上）区間を推定
する手段は、前記キーワード音声存在位置推定手段の出
力情報をもとに推定キーワード音声（一つないしそれ以
上）区間を推定する。The means for estimating the estimated keyword speech (one or more) section estimates the estimated keyword speech (one or more) section based on the output information of the keyword speech presence position estimation means.

複写格納手段は、前記推定キーワード音声区間推定手段
により推定された区間の一つないしそれ以上の推定キー
ワード音声波形を各々複写し纏めて格納する。The copy storage means copies and collectively stores one or more estimated keyword voice waveforms of the sections estimated by the estimated keyword voice section estimation means.

対応付は手段は、前記対象とする元音声と前記複写格納
された推定キーワード音声波形とを対応付ける。The associating means associates the target original speech with the copied and stored estimated keyword speech waveform.

キーワー１く音声波形出力再生手段は、前記複写格納さ
れた推定キーワード音声波形を読みだし音声として出力
再生する機能を有する。The keyword 1 audio waveform output reproduction means has a function of reading out the copied and stored estimated keyword audio waveform and outputting and reproducing it as audio.

元音声波形出力再生手段は、前記対応付は手段の有する
情報にもとすいて前記対象とする元音声波形を読みだし
音声として出力再生する機能を有する。The original audio waveform output reproduction means has a function of reading out the target original audio waveform based on the information possessed by the association means and outputting and reproducing it as audio.

本発明による装置を利用する人は、前記キーワード音声
波形出力再生手段により、元音声の有すると推定された
キーワード音声のみを聞くことが出来き、出力された前
記−つないしそれ以上の推定キーワードから、元音声全
体を聞くべきと判断した場合には、前記元音声波形出力
再生手段の機能を用いてそれを聞くことが可能となる。A person using the device according to the present invention can listen to only the keyword voice estimated to be included in the original voice by the keyword voice waveform output reproduction means, and can listen to only the keyword voice estimated to be included in the original voice, and from the outputted keywords connected or more. If it is determined that the entire original audio should be heard, it becomes possible to listen to it using the function of the original audio waveform output reproduction means.

また、元音声全体を聞く必要がないと判断した場合には
、必要に応じて次の（別の）元音声の有すると推定され
たキーワード音声を聞く操作に移れば良く、効率良く処
理を進めることが可能となる。In addition, if it is determined that it is not necessary to listen to the entire original audio, it is sufficient to move on to the operation of listening to the keyword audio that is estimated to be included in the next (different) original audio as necessary, and the processing can proceed efficiently. becomes possible.

〔実施例〕以下図を用いて本発明の一実施例を説明する。〔Example〕 An embodiment of the present invention will be described below with reference to the drawings.

第１図は本発明による音声情報処理装置の一実施例を説
明する構成図である。FIG. 1 is a block diagram illustrating an embodiment of an audio information processing device according to the present invention.

本実施例では、ハードウェア構成は、システム全体の制
御とユーザのアプリケーション・システムの作成などを
行なう情報処理手段１と、該情報処理に使用する各種デ
ータ及び前記アプリケーション・システムを記憶する手
段２と、マイク３や電話器４などの音声入力手段からの
入力を選択する手段５と、該選択手段からのアナログ音
声信号を増幅し、周波数フィルタリングする手段６と、
該手段の出力信号をサンプリングし、ディジタル量に変
換する符号化手段７と、ディジタル量で表現された音声
信号からアナログ音声信号に変換する復号化手段８と、
該手段の出力信号を増幅し、周波数フィルタリングする
手段９と、スピーカ１０や電話器などの音声出力手段を
選択する手段〕−１と、音声の入出力の制御と音声情報
処理を行なう音声処理手段１２と、ディジタル量で表現
された音声信号等を−１時記憶する音声処理用記憶手段
１３と、前記音声処理用記憶手段内１３に設けた音声処
理の条件を記憶する手段１４．複数の音声データの管理
情報を記憶する手段１５．音声再生条件を記憶する手段
１６と、アナログ音声信号を増幅し、周波数フィルタリ
ングする手段６の出力信号を分析して音声の大きさ、有
声・無声の判定結果を一定間隔で出力する分析手段１７
と、情報処理手段１から前記音声処理用記憶手段１３へ
のアクセスと前記音声処理手段１２から前記音声処理用
記憶手段１３へのアクセスの競合を防止する手段１８（
メモリアクセス競合防止手段１８）。In this embodiment, the hardware configuration includes an information processing means 1 that controls the entire system and creates an application system for the user, and a means 2 that stores various data used for the information processing and the application system. , means 5 for selecting an input from an audio input means such as a microphone 3 or a telephone 4; and means 6 for amplifying and frequency filtering the analog audio signal from the selection means;
an encoding means 7 for sampling the output signal of the means and converting it into a digital quantity; a decoding means 8 for converting the audio signal expressed in the digital quantity into an analog audio signal;
means 9 for amplifying and frequency filtering the output signal of the means; means for selecting an audio output means such as a speaker 10 or a telephone]-1; and an audio processing means for controlling audio input/output and processing audio information. 12, an audio processing storage means 13 for storing audio signals etc. expressed in digital quantities at -1, and means 14 for storing audio processing conditions provided in the audio processing storage means 13. Means for storing management information of a plurality of audio data 15. Means 16 for storing audio reproduction conditions, and analysis means 17 for analyzing the output signal of the means 6 for amplifying and frequency filtering the analog audio signal and outputting the loudness of the sound and the voiced/unvoiced determination results at regular intervals.
means 18 (
Memory access conflict prevention means 18).

前記音声処理手段１２等から前記情報処理手段１あるい
は前記記憶手段２などへのアクセス手段１９、前記情報
処理手段１から前記音声処理手段１２等へのアクセス手
段２０からなるインタフェース部２１と、ネットワーク
との通信手段２３゜音声データ２４及び抽出したキーワ
ード音声２５゜両者の対応付は情報２６を記録しておく
ファイル２７から構成されている。また、各手段は音声
データを含むデータ及びアドレス、コントロールライン
からなる主バス２８または音声バス２９によって結合さ
れている。なお、音声処理を行なう手段をまとめて音声
処理系２２と呼ぶこととする。an interface section 21 consisting of an access means 19 from the voice processing means 12 etc. to the information processing means 1 or the storage means 2 etc., an access means 20 from the information processing means 1 to the voice processing means 12 etc., and a network. The correspondence between the communication means 23° audio data 24 and the extracted keyword audio 25° consists of a file 27 in which information 26 is recorded. Further, each means is connected by a main bus 28 or an audio bus 29 consisting of data including audio data, address, and control lines. Note that the means for performing audio processing will be collectively referred to as the audio processing system 22.

また、情報処理手段１と音声処理手段１２は一般のマイ
クロプロセッサで構成されているものとする。Further, it is assumed that the information processing means 1 and the audio processing means 12 are constituted by general microprocessors.

また、本実施例では本発明の中心となる各処理手段は主
に上記ハードウェアを用いるソフトウニアとして、一般
のマイクロプロセッサで構成されている上記情報処理手
段１と音声処理手段１２」二に実現されている。In addition, in this embodiment, each processing means that is the core of the present invention is realized as a software unit that mainly uses the above-mentioned hardware, and is realized by the above-mentioned information processing means 1 and audio processing means 12, which are composed of general microprocessors. ing.

抑揚情報を抽出する手段３０．キーワード音声（一つな
いしそれ以上）の存在する位置を推定する手段３１．推
定キーワード音声（一つないしそれ以上）区間を推定す
る手段３２．推定された区間の一つないしそれ以上の推
定キーワード音声波形を各々複写し纏めて格納する手段
３３を構成する各々のソフトウェアは主に音声処理手段
１２に、また、元音声と格納推定キーワード音声波形と
を対応付ける手段３４．複写格納推定キーワード音声波
形を読みだし音声として出方再生する手段３５、元音声
波形を読みだし音声として出方再生する手段３６を構成
するソフトウェアは主に」二記情報処理手段１と記憶手
段２にシステムの制御ソフトの一部として構成されてい
る。Means for extracting intonation information 30. Means for estimating the location of keyword sounds (one or more) 31. Means for estimating the estimated keyword audio (one or more) sections 32. Each piece of software constituting the means 33 for copying and collectively storing one or more estimated keyword speech waveforms in the estimated section is mainly used in the speech processing means 12, and also for copying the original speech and the stored estimated keyword speech waveform. Means for associating 34. The software constituting the means 35 for reading out and reproducing the output of the estimated keyword speech waveform and the means 36 for reading out the original speech waveform and reproducing the output as speech are mainly composed of the information processing means 1 and the storage means 2 described in "II". It is configured as part of the system's control software.

以下本実施例の動作を図を参照しながら説明する。The operation of this embodiment will be explained below with reference to the drawings.

通信手段２３を経て到着した音声はファイル２７中に音
声データ２４として格納される。以下本実施例では音声
はデジタル化されているものとする。なお、アナログ形
態のものは通常のアナログ／デジタル変換器に通せば簡
単にデジタル化出来ることは云うまでもないから、この
前提は、この発明をなんら限定するものではない。また
、音声入力経路としては、通信手段２３以外にもマイク
３や電話器４などの音声入力手段からのアナログ音声信
号を、増幅・周波数フィルタリングする手段６と該手段
の出力信号をサンプリングしディジタル量に変換する符
号化手段７．インタフェース部２１中のアクセス手段１
９を経てファイル２７中に音声データ２４として格納さ
れる場合もありうる。The audio that has arrived via the communication means 23 is stored in the file 27 as audio data 24. In the following embodiment, it is assumed that the audio is digitized. It goes without saying that analog formats can be easily digitized by passing them through a normal analog/digital converter, so this premise does not limit the invention in any way. In addition to the communication means 23, the audio input path includes a means 6 for amplifying and frequency filtering analog audio signals from voice input means such as a microphone 3 and a telephone 4, and a means 6 for sampling the output signals of the means and converting them into digital quantities. Encoding means for converting into 7. Access means 1 in the interface section 21
9 and may be stored as audio data 24 in file 27.

以下の処理はファイル２７中に音声データ２４が格納完
了すると自動的に起動され、バッチで処理されるように
構成されており、後刻音声データ２４を読みだす様な通
常の利用の場合には、既にキーワード音声が取り出され
ており、ただちに利用可能な状態に成っているように構
成されている（音声データ２４格納完了後ただちに読み
だす場合は未だキーワード音声が取り出されていない）
。The following processing is automatically started when the audio data 24 is stored in the file 27, and is configured to be processed in batch.In the case of normal use such as reading out the audio data 24 at a later time, The keyword audio has already been extracted and is configured so that it can be used immediately (if the audio data 24 is read out immediately after storage is completed, the keyword audio has not been extracted yet).
.

もちろん、外部からの指示に従って処理が起動されるよ
うに構成する機能も用意しである。両者のどちらで動作
するかは、音声処理の条件を記憶する手段１４内にその
指示を設定して置けば良い。Of course, there is also a function for configuring the process to be activated according to instructions from the outside. As to which of the two is to be operated, an instruction may be set in the means 14 for storing audio processing conditions.

音声処理手段１２上の抑揚情報抽出手段３０は、システ
ム全体の制御を行なう情報処理手段１により制御情報に
従って音声処理用記憶手段内１．３に設けた音声処理の
条件を記憶する手段１４．複数の音声データの管理情報
を記憶する手段１５に設定された条件に従ってファイル
２７中の音声データ２４を取り込み、音声データ２４の
抑揚を分析して行く。The intonation information extraction means 30 on the speech processing means 12 is a means 14.3 for storing speech processing conditions provided in the speech processing storage means 1.3 according to control information by the information processing means 1 which controls the entire system. The audio data 24 in the file 27 is taken in according to the conditions set in the means 15 for storing management information of a plurality of audio data, and the intonation of the audio data 24 is analyzed.

抑揚情報抽出手段３０に於ける抑揚情報抽出処理の例を
、第２図及び第３図を用いて簡単に説明する。詳細につ
いては、前記文献（２）及び（３）に詳しく報告されて
いるので参照されたい。An example of intonation information extraction processing in the intonation information extraction means 30 will be briefly explained using FIGS. 2 and 3. For details, please refer to the above-mentioned documents (2) and (3), which are reported in detail.

第２図に於いて、取り込んだ音声データ２４を一定間隔
（通常１０無いし２０ミリ秒）で短時間パワーを計算し
、短時間パワーが一定閾値を超えた状態が一定時間以上
継続している区間を音声が存在しているとみなし、その
区間の音声に対してピッチ周波数を求める。判定条件は
音声処理用記憶手段内１３に設けた音声処理の条件を記
憶する手段１４に設定されている。なお、短時間パワー
及びピッチ周波数、音声区間を求める手続きは広く知ら
れている多数の方法が有り、そのいずれの方法によるか
は本発明の本質に関係しないので、その詳細は省略する
。例えば、文献（４）「音声認識」新美著、共立出版、
情報科学講座Ｅ・１９・３に詳しい。In Figure 2, the short-term power of the captured audio data 24 is calculated at fixed intervals (usually 10 to 20 milliseconds), and the state in which the short-term power exceeds a certain threshold continues for a certain period of time or more. It is assumed that audio is present in the interval, and the pitch frequency is determined for the audio in that interval. The determination conditions are set in a means 14 for storing conditions for voice processing provided in the memory means 13 for voice processing. Note that there are many widely known methods for determining the short-time power, pitch frequency, and voice interval, and the details thereof will be omitted since it is not related to the essence of the present invention. For example, document (4) "Speech Recognition" by Niimi, Kyoritsu Shuppan,
Learn more about Information Science Course E.19.3.

以下、基本周波数の形状を折線の組で第２図の流れ図に
従って求めて行く。近似直線と分析した値の差の平均が
一定値以上のときは複数の発声区間から成るものとして
、誤差最大の位置を分割点候補として分け、その前後の
区間をＡとＢの区間としたときに、図中の結合係数Ｒｉ
　（Ａ−Ｂ）を求め、その値が一定値条件を満たしたと
き、その各々を新たな区間とみなして同様の処理を繰り
返す。個々で、結合係数を計算する各変数の意味は第３
図に示す通りである。Hereinafter, the shape of the fundamental frequency will be determined using a set of broken lines according to the flowchart in FIG. When the average difference between the approximate straight line and the analyzed value is more than a certain value, it is assumed that it consists of multiple utterance sections, and the position with the maximum error is divided as a dividing point candidate, and the sections before and after it are designated as sections A and B. In the figure, the coupling coefficient Ri
(A-B) is obtained, and when the value satisfies a constant value condition, each of them is regarded as a new section and the same process is repeated. The meaning of each variable for calculating the coupling coefficient is explained in the third section.
As shown in the figure.

次にキーワード音声存在位置推定手段３１の処理を説明
する。Next, the processing of the keyword voice presence position estimating means 31 will be explained.

分割がほぼ文節程度にまで進み、分割がこれ以上する必
要がなくなったと判断されたとき、分割を停止する。多
くの場合、文節は自立語と複数の付属語から成り立って
いるから、各区間の頭の部分にキーワードが存在する確
率が高いとみなせる。When the division has progressed to approximately the level of phrases and it is judged that there is no need to divide any more, the division is stopped. In many cases, a bunsetsu consists of an independent word and multiple attached words, so it can be assumed that the probability that a keyword exists at the beginning of each section is high.

さらに、重要な言葉は大きな声でゆっくりと発声するの
が自然であるから、各区間の頭の部分の短区間パワーが
有る程度以上大きな値でかつ一定以上の時間継続してい
る場合はさらにキーワードである確率は高くなる。この
様な条件を満たす区間の前後の音声区間を一定長（２０
０乃至４００ミリ秒程度）だけ長くした区間をキーワー
ドの推定位置として出力する。Furthermore, since it is natural for important words to be uttered loudly and slowly, if the short-term power at the beginning of each section is at least a certain value and continues for a certain period of time, then the keyword The probability that this is the case becomes higher. The audio sections before and after the section that satisfies these conditions are set to a certain length (20
The section lengthened by approximately 0 to 400 milliseconds is output as the estimated position of the keyword.

次に、キーワード音声区間推定手段３２の処理を第４図
を持って説明する。キーワード音声区間推定手段３２と
しては、本実施例では良く知られた連続ダイナミック・
プログラミング（Ｄ　Ｐ）マツチング手法を用いる。連
続ＤＰマツチング手法についても前記文献（４）に詳し
く記述されており、細部については上記文献（４）を参
照することにより、容易に実現できる。Next, the processing of the keyword voice section estimating means 32 will be explained with reference to FIG. In this embodiment, the keyword speech interval estimating means 32 uses a well-known continuous dynamic
Programming (DP) matching method is used. The continuous DP matching method is also described in detail in the above-mentioned document (4), and can be easily realized by referring to the above-mentioned document (4) for details.

キーワード音声区間推定手段３２では、キーワード音声
存在位置推定手段３１で位置推定された区間を連続ＤＰ
マツチング手法の標準バタンとみなし、音声データ２４
の音声を入力と見做して連続的にマツチングを行う。こ
の場合マツチングするための音声特徴パラメータとして
は、良く知られているＬＰＣ系（やはり前記文献（４）
に詳しく記述されている）のパラメータをプログラムで
計算し求めていく。この時に、例えば、文献（５）特開
昭５７−２０７２９７に示した様な、マツチング経路を
記録しておく機能を標準バタン（前記キーワード音声存
在推定区間）側の各分析位置に、各分析位置毎のマツチ
ング回数を記録するレジスタと共に用意しておく。同時
にマツチングの良さの程度が一定以上の値である部分の
回数が、音声データｒ１７）２４の音声全体とのマツチングが終了したときに、一定
値以上で有るとき、前記キーワード音声存在推定区間に
含まれる音声がキーワード音声である可能性はさらに高
くなる。この時、前記キーワード音声存在推定区間側の
各分析位置に用意された、各分析位置毎のマツチング回
数を記録するレジスタの値が一定値以上の区間は、キー
ワード音声として安定な部分である可能性も極めて高い
ことになる。この区間をキーワード音声区間とする。こ
の様な処理を全ての前記キーワード音声存在推定区間に
ついて実行する。The keyword voice section estimating means 32 converts the section whose position has been estimated by the keyword voice presence position estimating means 31 into a continuous DP.
Regarded as the standard bang of the matching method, audio data 24
Matching is performed continuously by regarding the audio as input. In this case, the well-known LPC system (also described in the above-mentioned document (4)) is used as the voice feature parameter for matching.
(described in detail in ) are calculated and determined by the program. At this time, for example, a function for recording the matching path as shown in Document (5) JP-A-57-207297 can be added to each analysis position on the standard button (the keyword voice presence estimation section) side. A register is prepared to record the number of matchings for each match. At the same time, if the number of times in which the degree of matching quality is a certain value or more is greater than a certain value when matching with the entire audio data r17) 24 is completed, it is included in the keyword audio presence estimation interval. The possibility that the voice that appears is the keyword voice becomes even higher. At this time, the section where the value of the register that records the number of matchings for each analysis position, which is prepared at each analysis position on the keyword voice presence estimation section side, is a certain value or more may be a stable part as a keyword voice. would also be extremely high. This section is defined as a keyword voice section. Such processing is executed for all the keyword voice presence estimation sections.

なお、後述のキーワード音声波形出力再生手段でキーワ
ード音声を音声として出力する場合は、この様な区間の
前後には段々とＯから１まで単調に大きくなる重みと、
１から０まで単調に小さくなる重みを掛けた波形を付加
して出力すること（フェードイン、フェードアウト機能
）により、区間推定の誤りの影響を小さくすることが出
来る。In addition, when outputting the keyword voice as voice by the keyword voice waveform output reproducing means described later, a weight that gradually increases monotonically from 0 to 1 before and after such an interval,
By adding and outputting a waveform multiplied by a weight that monotonically decreases from 1 to 0 (fade-in, fade-out function), the influence of errors in section estimation can be reduced.

全ての前記キーワード音声存在推定区間に付いて求めら
れたキーワード音声区間情報は、前記情報処理手段１と
記憶手段２上にシステムの制御ソフトの一部として構成
されている、元音声と格納推定キーワード音声波形とを
対応付ける手段３４によりあらかじめ定められたフォー
マットに従い、元音声２４と対応付けられて記憶手段２
上に記録される。これらの処理を実行するプログラムの
構成は極めて簡単であり、その実現方法を改めて記述す
るまでもないであろう。The keyword voice section information obtained for all the keyword voice presence estimation sections is the original voice and the stored estimated keyword, which is configured as part of the system control software on the information processing means 1 and storage means 2. The storage means 2 is associated with the original audio 24 according to a predetermined format by the audio waveform associating means 34.
recorded above. The configuration of the program that executes these processes is extremely simple, and there is no need to describe how to implement it.

キーワード音声波形出力再生手段としては、前記情報処
理手段１と記憶手段２上にシステムの制御ソフトの一部
として構成された簡単なプログラムに従い、前記記憶手
段２上に記録されているキーワード音声区間情報を用い
て、前記記憶手段２上に記録されている元音声２４の該
当区間の音声に前述のフェードイン、フェードアウト機
能を付加して読みだし、ディジタル量で表現された音声
信号からアナログ音声信号に変換する復号化手段８と、
該手段の出力信号を増幅し、周波数フィルタリングする
手段９を経て、スピーカ１ｏや電話器などの音声出力手
段を選択する手段１１により選択された音声出力手段か
ら音声として出力する。The keyword audio waveform output reproducing means uses a simple program configured on the information processing means 1 and the storage means 2 as part of the control software of the system to generate the keyword audio section information recorded on the storage means 2. is used to add the aforementioned fade-in and fade-out functions to the audio of the corresponding section of the original audio 24 recorded on the storage means 2 and read it out, converting the audio signal expressed in digital quantity into an analog audio signal. a decoding means 8 for converting;
The output signal of the means is amplified, passes through means 9 for frequency filtering, and is output as sound from an audio output means selected by means 11 for selecting an audio output means such as a speaker 1o or a telephone.

前述のフェードイン、フェードアウト機能に用いる重み
付は処理は音声処理手段１２上で容易に実現されること
は云うまでもない。It goes without saying that the weighting process used for the fade-in and fade-out functions described above can be easily realized on the audio processing means 12.

元音声波形２４を出力する手段は、同様に前記情報処理
手段１上にシステムの制御ソフトの一部として構成され
たプログラムに従い、記憶手段１７の記憶手段１５上に
記録されている複数の音声データの管理情報を用いて、
前記記憶手段２上に記録されている元音声２４を読みだ
し、ディジタル量で表現された音声信号からアナログ音
声信号に変換する復号化手段８と、該手段の出方信号を
増幅し、周波数フィルタリングする手段９を経て、スピ
ーカ１０や電話器などの音声出方手段を選択する手段１
１により選択された音声出力手段から音声として出力す
る一連の処理である。The means for outputting the original audio waveform 24 similarly outputs a plurality of audio data recorded on the storage means 15 of the storage means 17 according to a program configured on the information processing means 1 as part of the system control software. Using the management information of
A decoding means 8 reads out the original audio 24 recorded on the storage means 2 and converts the audio signal expressed in digital quantity into an analog audio signal, and amplifies the output signal of the means and performs frequency filtering. Means 1 for selecting a sound output means such as a speaker 10 or a telephone via means 9 for selecting
This is a series of processes for outputting audio from the audio output means selected in step 1.

〔Effect of the invention〕

以上説明したごとく、本発明に依れば、元音声の有する
と推定されたキーワード音声のみを聞くことが出来き、
出力されたーっないしそれ以上の推定キーワードから、
元音声全体を聞くべきと判断した場合には、元音声波形
出力再生手段の機能を用いてそれを聞くことが可能とな
る。また、元音声全体を聞く必要がないと判断した場合
には、必要に応じて別の元音声の有すると推定されたキ
ーワード音声を聞く操作に移れば良く、短時間で元音声
の内容を推定することが可能となり、効率良く音声を刹
用した処理を進めることが可能となる。As explained above, according to the present invention, it is possible to listen to only the keyword voice that is estimated to be included in the original voice.
From the estimated keywords that have been output or more,
If it is determined that the entire original audio should be heard, it becomes possible to listen to it using the function of the original audio waveform output reproduction means. In addition, if it is determined that there is no need to listen to the entire original audio, it is sufficient to move on to listening to the keyword audio that is estimated to be contained in another original audio as necessary, and the content of the original audio can be estimated in a short time. This makes it possible to proceed with processing that uses audio efficiently.

[Brief explanation of the drawing]

第１図は本発明の一実施例の構成を説明する図、第２図
は抑揚情報抽出部の処理を説明する流れ図、第３図は抑
揚情報抽出処理で用いる式の記号を説明するための図、
第４図はキーワード音声区間推定処理を説明する流れ図
である。１・・・情報処理手段、２・・・記憶手段、３・・・マ
イク、４・・・電話器、５・・・音声入力手段選択、６
・・アナログ音声信号増幅、周波数フィルタリング手段
、７°アナログ／ディジタル変変換量化手段、８・・デ
ィジタル／アナログ信号変換復号化手段、９・増幅、周
波数フィルタリング手段、１０・・・スピーカ、］１・
・・音声出力手段選択手段、１２・・・音声入出力制御
音声処理手段、］、３・・・音声処理用記憶手段、１４
・・・音声処理条件記憶手段、１．５・・・複数音声デ
ータ管理情報記憶手段、１６・・・音声再生条件記憶手
段、１７・・・分析手段、１８・・・メモリアクセス競
合防止手段、１９・・・アクセス手段、２０・・アクセ
ス手段、２１・・・インタフェース部、２２・・音声処
理系、２３・・ネットワーク通信手段、２４・・音声デ
ータ、２５・・・キーワード音声、２６・・・対応付は
情報、２７・・・ファイル、２８・・・主バス、２９・
・・音声バス、３０・・・抑揚情報抽出手段、３１・・
・キーワード音声存在位置推定手段、３２・・・推定キ
ーワード音声区間推定手段、３３・・・推定キーワード
音声波形複写格納手段、３４・・元音声納推定キーワー
ド音声対応付は手段、３５・・・音声出力再生手段、３
６・・・元音声波形出力再生手段。FIG. 1 is a diagram for explaining the configuration of an embodiment of the present invention, FIG. 2 is a flowchart for explaining the processing of the intonation information extraction section, and FIG. 3 is a diagram for explaining the symbols of the formula used in the intonation information extraction process. figure,
FIG. 4 is a flowchart illustrating the keyword voice segment estimation process. 1... Information processing means, 2... Storage means, 3... Microphone, 4... Telephone, 5... Voice input means selection, 6
...Analog audio signal amplification, frequency filtering means, 7° analog/digital conversion and quantification means, 8.Digital/analog signal conversion and decoding means, 9.Amplification, frequency filtering means, 10.Speaker, ]1.
. . . Audio output means selection means, 12 . . Audio input/output control audio processing means, ], 3 . . . Audio processing storage means, 14
. . . Audio processing condition storage means, 1.5 . . . Multiple audio data management information storage means, 16 . . . Audio reproduction condition storage means, 17 . 19... Access means, 20... Access means, 21... Interface section, 22... Audio processing system, 23... Network communication means, 24... Audio data, 25... Keyword audio, 26...・Correspondence is information, 27...File, 28...Main bus, 29.
...Voice bus, 30...Intonation information extraction means, 31...
・Keyword voice existence position estimation means, 32... Estimated keyword voice interval estimation means, 33... Estimated keyword voice waveform copy storage means, 34... Means for associating the estimated keyword voice with original voice, 35... Voice Output playback means, 3
6...Original audio waveform output reproducing means.

Claims

[Claims] 1. Means for extracting intonation information of the original voice; means for estimating the position where one or more keyword voices exist;
means for estimating the one or more estimated keyword voice sections; means for copying and collectively storing the one or more estimated keyword voice waveforms in the estimated sections; and the original voice and the stored estimated keyword voice. An audio information processing device comprising at least a means for associating a waveform with a waveform, a means for outputting and reproducing the copied and stored estimated keyword audio waveform as read audio, and a means for outputting and reproducing the original audio waveform as a read audio. 2. The audio information processing apparatus according to claim 1, wherein keyword audio extraction processing is started upon completion of inputting the original audio. 3. The audio information processing apparatus according to claim 1, wherein when the keyword audio extraction instruction is set, the keyword audio extraction process is started upon completion of the input of the original audio. 4. The audio information processing device according to claim 1, wherein the extracted keyword audio is provided with a window and a fade-in/fade-out function. 5. The audio information processing device according to claim 1, wherein the keyword audio is estimated using a continuous DP method.