JP4565162B2

JP4565162B2 - Speech event separation method, speech event separation system, and speech event separation program

Info

Publication number: JP4565162B2
Application number: JP2006057611A
Authority: JP
Inventors: 太浅野
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2006-03-03
Filing date: 2006-03-03
Publication date: 2010-10-20
Anticipated expiration: 2026-03-03
Also published as: JP2007233239A

Description

本発明は、会議中に同時発話や相槌など発話が重なった場合に発話イベントを分離するための発話イベント分離方法、発話イベント分離システム、及び、発話イベント分離プログラムに関する。 The present invention relates to an utterance event separation method, an utterance event separation system, and an utterance event separation program for separating utterance events when utterances such as simultaneous utterances and conflicts overlap during a conference.

従来、企業における顧客との打ち合わせや、公共機関における委員会などの会議においては会議の内容を記録しておくために、人手による会議録を作成する方法が一般に用いられているが、特に中・小規模の会議では、コストに見合わない場合が少なくない。 Conventionally, in meetings with customers in companies and meetings such as committees in public institutions, a method of manually creating conference minutes has been generally used to record the contents of meetings. Small meetings are often not worth the cost.

そこで、近年では、音声認識によって、自動的に会議録を作成する手法も研究・開発され、製品化もなされているが、自然な発話に対する認識率は6割程度に過ぎず、音声認識により得られた結果を人手により修正する作業が必要となっていた。 Therefore, in recent years, methods for automatically creating conference minutes by speech recognition have been researched and developed, and commercialization has been made, but the recognition rate for natural utterances is only about 60%. It was necessary to manually correct the obtained results.

また、ビデオやレコーダーなどで会議の音声や画像を記録する方法も広く用いられているが、会議中にどのような議論がなされたかを把握するためには、録音・録画内容をすべて再生する必要があり、効率が悪い。 In addition, video and recorders are widely used to record conference audio and images, but in order to understand what discussions took place during the conference, it is necessary to play back all recorded and recorded content. There is poor efficiency.

そこで、録音・録画した内容を解析し、だれが、どのような発言をおこなったかといった情報を付加することにより、所望の録音・録画内容に効率的にアクセスするための研究も行われている。（非特許文献１参照。）
Jitendra Ajmera, et al. “Clustering and Segmenting speakers and their locations in meetning,” Proc. ICASSP 2004, Vol.I, pp.605-608, 2004。 Therefore, research is being conducted to efficiently access the desired recorded / recorded content by analyzing the recorded / recorded content and adding information such as who made what remarks. (See Non-Patent Document 1.)
Jitendra Ajmera, et al. “Clustering and Segmenting speakers and their locations in meetning,” Proc. ICASSP 2004, Vol.I, pp.605-608, 2004.

図５は、このような研究における一般的な会議録コンテンツの作成手順を示す流れ図であって、会議において録音・録画された内容は解析され、いつ、だれが発言したかの情報を取り出す処理がなされる。なお、ここでは、このような処理を「構造化」と呼ぶ。 FIG. 5 is a flowchart showing a general procedure for creating conference minutes contents in such a research. The contents recorded and recorded in the conference are analyzed, and the process of extracting information on when and who has made a statement is performed. Made. Here, such processing is called “structuring”.

構造化され、発言者ごとに分けられた(セグメンテーション)音声は、音声認識によってその発言内容が解析され、さらに、キーワードの抽出やトピックの分類、意味の要約などにより、高度な情報が付加されて会議録コンテンツが製作される。 Speech is structured and segmented by speaker (segmentation) .The speech content is analyzed by speech recognition, and advanced information is added by keyword extraction, topic classification, semantic summarization, etc. Minutes content is produced.

ここで、高精度で音声認識を行うためには、発話内容がクリアでなければならないが、参加者が自由に発言できるような会議の場では、ある話者（目的話者）の発言中に、他の話者の相槌や割り込み発言、咳払い等があると、目的話者の発話内容に他の話者（競合話者）の発話が重畳する、いわゆる「かぶり」が生じるため、音声認識の精度が著しく低下してしまう問題があった。 Here, in order to perform speech recognition with high accuracy, the utterance content must be clear, but in a meeting place where participants can speak freely, while a speaker (target speaker) is speaking If there are other speakers' conflicts, interrupted speech, coughing, etc., the speech of other speakers (competitive speakers) is superimposed on the content of the target speaker's speech. There was a problem that the accuracy was significantly lowered.

従来における複数の音源を分離する研究においては、音声インターフェースなどへの応用を目的に、雑音（例えば、テレビやラジオの音）がある中で、発話者の発言を分離する問題を扱っていた。
この場合、図６に示すように雑音源と目的音源（発話）との重畳区間が十分に長ければ、独立成分分析(ICA)を用いたブラインド音源分離（BSS）など、従来周知の方法によって雑音と話者の発話とを分離することが可能である。（例えば、非特許文献２参照。）
Te-Won Lee, “Independent Component Analysis,” Kluwer Academic Publishers, 1998 In the conventional research for separating a plurality of sound sources, the problem of separating a speaker's speech in the presence of noise (for example, sound of a television or radio) was dealt with for the purpose of application to a voice interface or the like.
In this case, as shown in FIG. 6, if the overlap period between the noise source and the target sound source (utterance) is sufficiently long, the noise can be detected by a conventionally known method such as blind sound source separation (BSS) using independent component analysis (ICA). And the speaker's speech can be separated. (For example, refer nonpatent literature 2.)
Te-Won Lee, “Independent Component Analysis,” Kluwer Academic Publishers, 1998

しかしながら、会議における音声のかぶりは、相槌や咳払い等、図７に示すように、重畳区間が非常に短い場合が多いため、このような雑音と発話との分離には、前述したBSS法のような従来方法は適用できない。 However, as shown in FIG. 7, there are many cases where the voice fogging in the conference has a very short overlapping section as shown in FIG. 7, such noise and utterances are separated by the BSS method described above. This conventional method cannot be applied.

一方、適応ビームフォーマを用いた音源分離の方法も提案されているが、これを用いるためには発話者の位置から録音に用いる複数のマイクロホンまでの到達時間の情報を含んだ「話者の位置ベクトル」を与えなければならない。（例えば、非特許文献３参照。）
Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993 On the other hand, a method of sound source separation using an adaptive beamformer has also been proposed, but in order to use this method, the “speaker position” that contains information on the arrival time from the speaker position to multiple microphones used for recording is proposed. You must give a "vector". (For example, refer nonpatent literature 3.)
Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993

適応ビームフォーマによる方法においては、一般的に、予想される話者位置（例えば、マイクロホン中心として、その中心角5度おき）に仮想的な音源を配置し、この位置での位置ベクトルを測定により求めておき、実際の会議音声においては、このあらかじめ求めておいた位置ベクトルの候補の中から、一番近いものを用いる手法が用いられている。 In the method using the adaptive beamformer, generally, a virtual sound source is arranged at an expected speaker position (for example, the center of the microphone every 5 degrees of the central angle), and a position vector at this position is measured. In the actual conference voice, a method using the closest one of the position vector candidates obtained in advance is used.

この作業はキャリブレーションとよばれ、録音に用いる複数のマイクのセット（以下、マイクロホンアレイと呼ぶ。）ごとに測定作業を行う必要があり、入力装置を大量生産する上では、非常に不利となる。 This work is called calibration, and it is necessary to perform measurement work for each set of microphones (hereinafter referred to as microphone array) used for recording, which is very disadvantageous for mass production of input devices. .

さらに、雑音の空間情報を含む、雑音空間相関行列も与える必要があり、従来においては、周囲の雑音が定常であると仮定して、話者の発話の休止区間からこの相関行列を推定していた。しかしながら、会議においては、雑音源としての競合する話者は一定ではなく、絶えず変化するため、このような方法を用いることができなかった。 In addition, it is necessary to provide a noise spatial correlation matrix that includes spatial noise information, and in the past, this correlation matrix was estimated from the pause interval of the speaker's speech, assuming that the surrounding noise is stationary. It was. However, in conferences, competing speakers as noise sources are not constant and change constantly, so such a method could not be used.

そこで、本発明は、前述したような、従来の適応ビームフォーマが有していた問題点を解消し、会議中の発話のかぶりを取り除いて、発話内容を話者毎に高精度で分離することができる、発話イベント分離方法、発話イベント分離システム、及び、発話イベント分離プログラムを提供することを目的とする。 Therefore, the present invention solves the problems of the conventional adaptive beamformer as described above, removes the fog of the utterance during the conference, and separates the utterance contents with high accuracy for each speaker. An utterance event separation method, an utterance event separation system, and an utterance event separation program are provided.

前記目的を達成するために、本発明の会議録における発話イベント分離方法は、会議中の連続した区間に当該会議の場で収録された多チャネルの音声データから音源定位を行い、その空間スペクトルのピーク値を検出して前記区間中の時刻毎の音源方向を推定する第１のステップと、前記ピーク値を前記区間全体にわたってクラスタリングして、音源となる話者の存在する範囲を推定する第２のステップと、前記時刻毎の音源方向と話者の存在する範囲から各時刻にどの話者が発話しているかを同定する第３のステップと、第３のステップで得られたデータから目的とする話者が前記区間内に単独で発話しているブロックを探し出し、当該ブロックから前記目的とする話者の位置ベクトルを推定する第４のステップと、第３のステップで得られたデータから他の話者が前記区間内で単独で発話しているブロックを探し出し、当該ブロックから前記目的とする話者に対する当該他の話者の雑音空間相関行列を計算する第５のステップと、第４のステップで推定された話者の位置ベクトルと、第５のステップで計算された雑音空間相関行列からフィルタを生成する第６のステップと、前記目的とする話者と他の話者との発話が重畳しているブロックに対して前記フィルタを適用してフィルタリングを行い、目的とする話者のみの発話を分離して出力する第７のステップとからなるものである。 In order to achieve the above object, the utterance event separation method in the conference record of the present invention performs sound source localization from multi-channel audio data recorded at the conference site in consecutive sections of the conference, and the spatial spectrum A first step of detecting a peak value and estimating a sound source direction for each time in the section; and a second step of clustering the peak values over the entire section to estimate a range where a speaker serving as a sound source exists. The third step of identifying which speaker is speaking at each time from the sound source direction at each time and the range where the speaker exists, and the purpose from the data obtained in the third step In the fourth step and the third step, a block where a speaker who speaks alone finds a block speaking in the section and estimates the target speaker's position vector from the block is obtained. A fifth step of finding a block in which the other speaker is speaking alone in the section from the obtained data and calculating a noise spatial correlation matrix of the other speaker with respect to the target speaker from the block; A sixth step of generating a filter from the position vector of the speaker estimated in the fourth step and the noise spatial correlation matrix calculated in the fifth step; and the target speaker and the other speaker The seventh step is to apply the filter to the block in which the utterances of と are superimposed and perform filtering to separate and output the utterances of only the target speaker.

また、本発明の発話イベント分離システムは、会議中の連続した区間に、当該会議の場で収録された多チャネルの音声データから音源定位を行い、その空間スペクトルのピーク値を検出して前記区間中の時刻毎の音源方向を推定する音源方向推定手段と、前記音源方向推定手段で検出されたピーク値を前記区間全体にわたってクラスタリングして、音源となる話者の存在する範囲を推定する話者範囲推定手段と、前記音源方向推定手段ならびに話者範囲推定手段によって得られた時刻毎の音源方向と話者の存在する範囲から各時刻にどの話者が発話しているかを同定する話者同定手段と、前記話者同定手段で得られたデータから、目的とする話者が前記区間内に単独で発話しているブロックを探し出し、当該ブロックから前記目的とする話者の位置ベクトルを推定する話者位置ベクトル推定手段と、前記話者同定手段で得られたデータから、他の話者が前記区間内で単独で発話しているブロックを探し出し、当該ブロックから前記目的とする話者に対する当該他の話者の雑音空間相関行列を計算する雑音空間相関行列計算手段と、前記話者位置ベクトル推定手段によって推定された位置ベクトルと、前記雑音空間相関行列計算手段によって計算された雑音空間相関行列からフィルタを生成するフィルタ生成手段と、前記目的とする話者と他の話者との発話が重畳しているブロックに対して前記フィルタを適用してフィルタリングを行い、目的とする話者のみの発話を分離して出力するフィルタリング手段とを備えたものである。 Further, the speech event separation system of the present invention performs sound source localization from multi-channel audio data recorded at the conference site in consecutive sections during the conference, detects the peak value of the spatial spectrum, and A sound source direction estimating unit that estimates a sound source direction at each time in the middle, and a speaker that estimates a range where a speaker serving as a sound source exists by clustering the peak values detected by the sound source direction estimating unit over the entire section Speaker identification for identifying which speaker is speaking at each time from the range estimation means, the sound source direction at each time obtained by the sound source direction estimation means and the speaker range estimation means and the range where the speaker exists And the data obtained by the speaker identification means find a block where the target speaker is speaking alone within the section, and the target speaker is extracted from the block. Speaker position vector estimation means for estimating a position vector, and from the data obtained by the speaker identification means, a block in which another speaker is speaking alone within the section is searched, and the purpose and A noise spatial correlation matrix calculating means for calculating a noise spatial correlation matrix of the other speaker with respect to the speaker who performs, a position vector estimated by the speaker position vector estimating means, and a noise spatial correlation matrix calculating means Filter generating means for generating a filter from the noise-space correlation matrix, and applying the filter to a block in which the speech of the target speaker and another speaker is superimposed, Filtering means for separating and outputting the utterances of only the speaking speaker.

本発明の発話イベント分離システムにおいては、複数のマイクロフォンを放射状に配置して構成されたマイクロフォンアレイを用いて多チャネルの音声データを収録することが望ましい。 In the speech event separation system according to the present invention, it is desirable to record multi-channel audio data using a microphone array configured by arranging a plurality of microphones radially.

また、本発明の発話イベント分離プログラムは、会議中の連続した区間に、当該会議の場で収録された多チャネルの音声データがコンピュータに入力されてそのメモリに記憶され、前記コンピュータに、前記音声データから音源定位を行い、その空間スペクトルのピーク値を検出して前記区間中の時刻毎の音源方向を推定する第１のステップと、前記ピーク値を前記区間全体にわたってクラスタリングして、音源となる話者の存在する範囲を推定する第２のステップと、前記時刻毎に推定された音源方向と話者の存在する範囲から各時刻にどの話者が発話しているかを同定する第３のステップと、第３のステップで得られたデータから、目的とする話者が前記区間内に単独で発話しているブロックを探し出し、当該ブロックから前記目的とする話者の位置ベクトルを推定する第４のステップと、第３のステップで得られたデータから、他の話者が前記区間内で単独で発話しているブロックを探し出し、当該ブロックから前記目的とする話者に対する当該他の話者の雑音空間相関行列を算出する第５のステップと、第４のステップで推定された話者の位置ベクトルと、第５のステップで計算された雑音空間相関行列からフィルタデータを生成する第６のステップと、前記目的とする話者と他の話者との発話が重畳しているブロックに対して前記フィルタデータに基づいてフィルタリングを行わせ、目的とする話者のみの発話を分離して出力する第７のステップとを実行させるものである。 Further, the speech event separation program of the present invention is such that multi-channel audio data recorded at the conference is input to a computer and stored in the memory in consecutive sections of the conference, and the computer stores the audio A first step of performing sound source localization from data, detecting a peak value of the spatial spectrum and estimating a sound source direction at each time in the section, and clustering the peak values over the entire section to become a sound source A second step of estimating a range in which a speaker is present, and a third step of identifying which speaker is speaking at each time from the sound source direction estimated at each time and the range in which the speaker is present Then, from the data obtained in the third step, the block where the target speaker is speaking alone within the section is searched, and the target is determined from the block. From the data obtained in the fourth step of estimating the position vector of the speaker and the third step, a block where another speaker is speaking alone in the section is searched, and the object A fifth step of calculating a noise spatial correlation matrix of the other speaker with respect to the speaker who performs the step, a speaker position vector estimated in the fourth step, and a noise spatial correlation matrix calculated in the fifth step And a sixth step of generating filter data from the block, and a block on which the speech of the target speaker and another speaker is superimposed is filtered based on the filter data, and the target story And a seventh step of separating and outputting the utterances only by the user.

請求項１記載の発明によれば、会議中に目的話者の発話に他の話者の相槌や割り込み等の発話が重畳している場合においても、目的話者の発話内容を高精度に分離・抽出することができる発話イベント分離方法を提供することができる。 According to the first aspect of the present invention, even when the speech of the target speaker is superimposed on the speech of the target speaker during the conference, the content of the speech of the target speaker is separated with high accuracy. A speech event separation method that can be extracted can be provided.

また、本発明方法を用いることにより、会議中に収録した音声から、音声認識によって自動的に会議録を作成する場合等において、音声認識の認識率を向上させることができるので、自動作成された会議録の修正に費やす手間と時間を低減することができる。 In addition, by using the method of the present invention, it is possible to improve the recognition rate of voice recognition in the case of automatically creating a meeting record by voice recognition from the voice recorded during the meeting, so it was automatically created. It is possible to reduce the labor and time spent for correcting the minutes.

請求項２記載の発明によれば、請求項１に記載された発話イベント分離方法を実現するための発話イベント分離システムを提供することができる。 According to the invention described in claim 2, it is possible to provide an utterance event separation system for realizing the utterance event separation method described in claim 1.

請求項３記載の発明によれば、複数のマイクロフォンを放射状に配置して構成された一台のマイクロフォンアレイを会議テーブルの中央に置いて音声を収録することができるので、従来のように会議の参加者全員の胸元にタイピン型のマイクロフォンを付けさせて音声の収録を行うものと比較して配線を簡潔にできるともに、参加者のマイクロフォンの付け忘れ等により録音内容が不完全になる恐れもない。 According to the third aspect of the present invention, since one microphone array configured by arranging a plurality of microphones radially can be placed in the center of the conference table to record the voice, Compared to recording audio by attaching a tie-pin microphone to the chest of all participants, the wiring can be simplified, and there is no risk that the recording will be incomplete due to forgetting to attach the microphone etc. .

また、請求項４記載の発明によれば、発話イベント分離システムをノートパソコン等のコンピュータを用いて、簡単に且つ低コストで実現することができるコンピュータプログラムを提供することができる。 According to the invention described in claim 4, it is possible to provide a computer program that can realize an utterance event separation system easily and at low cost using a computer such as a notebook computer.

以下、本発明を実施する場合の形態について図面を参照して説明する。図１は、本発明の会議録における発話イベント分離方法を実施するための発話イベント分離システムの１実施形態を示すシステム構成図である。 Hereinafter, embodiments in the case of carrying out the present invention will be described with reference to the drawings. FIG. 1 is a system configuration diagram showing an embodiment of an utterance event separation system for implementing an utterance event separation method in a conference record according to the present invention.

同図に示すように、発話イベント分離システム１は、複数のマイクロフォン２Ａからなるマイクロフォンアレイ２に接続されて用いられるものであって、それぞれのマイクロフォン２Ａが捉えた音声は個別の多チャネルアナログ信号として、マイクロフォンアレイ２からケーブル３を介してアナログ／デジタル信号変換手段４に入力され、ここでデジタル信号に変換されて、当該発話イベント分離システム１に入力されるようになっている。 As shown in the figure, the speech event separation system 1 is used by being connected to a microphone array 2 composed of a plurality of microphones 2A, and the sound captured by each microphone 2A is used as an individual multi-channel analog signal. The signal is input from the microphone array 2 to the analog / digital signal conversion means 4 via the cable 3, converted into a digital signal, and input to the speech event separation system 1.

本実施形態においては、マイクロフォンアレイ２は、複数のマイクロフォン２Ａを筒型のケースの周面に放射状に配置して構成され、これを会議のテーブルの中央に一台設置して音声を収録するようにしている。 In the present embodiment, the microphone array 2 is configured by arranging a plurality of microphones 2A radially on the peripheral surface of a cylindrical case, and installing this one in the center of the conference table to record audio. I have to.

従来の会議等での音声の収録においては、マイクロホンでの信号対雑音比を少しでも向上させるために、会議の参加者全員に、胸元にタイピン型のマイクロホンを付けさせるなどの方法をとっていた。 When recording audio in a conventional conference, etc., in order to improve the signal-to-noise ratio of the microphone as much as possible, all the participants in the conference had to attach a tie-pin microphone to the chest. .

しかし、参加者全員にマイクロホンを付けさせた場合、マイクロフォンのケーブル配線等が煩雑になり、また、参加者がマイクロフォンをつけ忘れたために、録音内容が不完全となる場合も少なくなかった。 However, when all the participants have microphones attached, the microphone cable wiring becomes complicated, and the participants forget to attach the microphones, and the recorded content is often incomplete.

これに対し、本実施形態においては、音声を収録する装置として、それぞれ異なる方向に放射状に向けた複数のマイクロフォン２Ａからなるマイクロフォンアレイ２を用いているため、会議の参加者がそれぞれタイピン型のマイクロフォンを付ける煩わしさから解放されるとともに、周囲全方向からの音を収録できる利点がある。 On the other hand, in the present embodiment, since the microphone array 2 composed of a plurality of microphones 2A directed radially in different directions is used as a device for recording audio, each participant in the conference is a tiepin type microphone. There is an advantage that the sound from all directions can be recorded.

なお、マイクロフォンアレイは、本実施形態のものに限定するものではなく、会議テーブルの形状や会議参加者の席の配置に応じて、マイクロフォンを直線状や円弧状に適宜本数並べて構成してもよい。 Note that the microphone array is not limited to that of the present embodiment, and may be configured by arranging the number of microphones in a linear or arc shape as appropriate according to the shape of the conference table and the arrangement of the seats of the conference participants. .

一方、発話イベント分離システム１の内部には、アナログ／デジタル信号変換手段４から入力されるデジタル音声データを会議中の時間内の必要な区間にわたって収録しておくための記憶手段５が組み込まれている。 On the other hand, the utterance event separation system 1 incorporates a storage means 5 for recording digital voice data input from the analog / digital signal conversion means 4 over a necessary section in the time during the conference. Yes.

記憶手段５にはハードディスク等を用いることができ、本実施形態においては、発話イベント分離システム１を構成する装置の内部に組み込んでいるが、アナログ／デジタル信号変換手段４と同様に独立したユニットとして外部に設けてもよい。また、アナログ／デジタル信号変換手段４とともに、発話イベント分離システム１を構成する装置内に組み込んでもよい。 A hard disk or the like can be used as the storage means 5. In this embodiment, the storage means 5 is incorporated in the apparatus constituting the utterance event separation system 1, but as an independent unit like the analog / digital signal conversion means 4. It may be provided outside. Further, together with the analog / digital signal conversion means 4, it may be incorporated in a device constituting the utterance event separation system 1.

なお、本発明の発話イベント分離システム１には、前段の「構造化」の処理を行うための音源方向推定手段６、話者範囲推定手段７、話者同定手段８と、後段の「発話分離」の処理を行うための話者位置ベクトル推定手段９、雑音相関行列計算手段１０、フィルタ生成手段１１、及び、フィルタリング手段１２が含まれている。 The utterance event separation system 1 of the present invention includes a sound source direction estimation means 6, a speaker range estimation means 7, a speaker identification means 8 for performing “structuring” processing in the previous stage, and “speech separation in the subsequent stage. ”, A noise correlation matrix calculation unit 10, a filter generation unit 11, and a filtering unit 12 are included.

図２は、発話イベント分離システム１による処理の概要を示すフロー図であって、本発明においては、同図にステップＳ１〜ステップＳ３で示す前段の「構造化」の処理で得られた情報を用いて、ステップＳ４〜ステップＳ７で示す後段の「発話分離」の処理を行うことを特徴としている。 FIG. 2 is a flowchart showing an outline of processing by the utterance event separation system 1. In the present invention, the information obtained by the previous “structured” processing shown in steps S1 to S3 in FIG. It is characterized in that the subsequent “utterance separation” processing shown in steps S4 to S7 is performed.

前段の「構造化」の処理段階では、マイクロフォンアレイ２に入力された多チャネルの音声入力を解析し、会議中に誰が何時発言したかを推定する。特に、本発明においては、発話のかぶりを除去して目的話者の発話のみを分離することが目的であるので、目的話者が発言中に競合話者が入れる相槌などの小さな発話イベントまで詳細に分析する必要がある。 In the “structured” processing stage in the previous stage, the multi-channel audio input inputted to the microphone array 2 is analyzed to estimate who and when during the conference. In particular, according to the present invention, since the purpose is to remove the fog of the utterance and to isolate only the utterance of the target speaker, it is detailed even for a small utterance event such as a conflict that the competing speaker enters while the target speaker is speaking. Need to be analyzed.

そこで、「構造化」の処理段階においては、まず記憶手段５に収録されている音声データを読み出し、音源方向推定手段６で音源定位によりマイクロフォンアレイ２で収録した多チャネル音声入力を解析して時刻毎に音の到来方向を推定する。（図２のステップＳ１） Therefore, in the “structuring” processing stage, first, the voice data recorded in the storage means 5 is read, the sound source direction estimation means 6 analyzes the multi-channel voice input recorded by the microphone array 2 by sound source localization, and the time The direction of arrival of the sound is estimated every time. (Step S1 in FIG. 2)

音源定位には、従来周知の一般的な音源定位の手法（例えば、Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993参照）を用いることが可能である。しかしながら、本実施形態においては、より性能を高めるため、MUSIC法（R.O.Schmidt,”Multiple emitter location and signal parameter estimation,” IEEE Transactions on Antennas and Propagation, vol.AP-24, No.3, pp.276-280, 1986参照）を広帯域に拡張した方法（F.Asano et al. “Fusion of audio and video information for detecting speech events,” Proc. Fusion2003, pp.386-393, 2003）を用いている。 For sound source localization, a conventionally well-known general sound source localization method (see, for example, Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993) can be used. However, in this embodiment, in order to further improve the performance, the MUSIC method (RO Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Transactions on Antennas and Propagation, vol. AP-24, No. 3, pp. 276 -280, 1986) (F. Asano et al. “Fusion of audio and video information for detecting speech events,” Proc. Fusion 2003, pp. 386-393, 2003).

次いで、音源方向推定手段６により検出された時刻毎の音のピーク値のデータは、話者範囲推定手段７に入力される。話者範囲推定手段７では、音の空間スペクトルのピークを会議全体、もしくは、会議中の必要な区間に対して集積して図３に示すヒストグラムデータを生成し、このヒストグラムデータに対し、k-means法（例えば、R. Duda, E. Hart and D. Sort, “Pattern Classification,” Wiley-Interscience publication 2001参照）によりクラスタリングを行う。 Next, the sound peak value data for each time detected by the sound source direction estimating means 6 is input to the speaker range estimating means 7. The speaker range estimation means 7 generates the histogram data shown in FIG. 3 by accumulating the peaks of the spatial spectrum of the sound for the entire conference or for a necessary section during the conference. Clustering is performed by means method (see, for example, R. Duda, E. Hart and D. Sort, “Pattern Classification,” Wiley-Interscience publication 2001).

ここで、クラスタリングに必要なクラスタ数は、会議参加者数とする。クラスタリング結果により得られたクラスタ中心±R°を話者の存在範囲として推定する。（ステップＳ２）ここで、Rは任意の角度（例えば２０°）であり、会議毎に、その参加人数などに応じて設定する。 Here, the number of clusters required for clustering is the number of conference participants. The cluster center ± R ° obtained from the clustering result is estimated as the speaker's existence range. (Step S2) Here, R is an arbitrary angle (for example, 20 °), and is set for each conference according to the number of participants.

次に、話者同定手段８では、音源方向推定手段６により得られた音源定位の結果と、話者範囲推定手段７により推定された話者範囲から、各時刻にどの話者が発話しているかを推定する。例えば、ある時刻において、空間スペクトルのピークがある話者の範囲に入っていれば、当該話者が発話していたものと同定する。（ステップＳ３） Next, in the speaker identification means 8, which speaker speaks at each time from the result of the sound source localization obtained by the sound source direction estimation means 6 and the speaker range estimated by the speaker range estimation means 7. To estimate. For example, if a spatial spectrum peak falls within a certain speaker range at a certain time, the speaker is identified as speaking. (Step S3)

前述したような「構造化」の処理に続く後段の「発話分離」の処理では、「構造化」の処理段階で得られた情報をもとに、発話が重畳している（かぶさっている）部分についての話者分離を行う。発話イベントの分離には、適応ビームフォーマの一種である最尤推定法（例えば、Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993参照）が用いられる。 In the subsequent “speech separation” process following the “structured” process as described above, utterances are superimposed (covered) based on the information obtained in the “structured” process stage. Perform speaker separation for parts. For the separation of utterance events, a maximum likelihood estimation method (see, for example, Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993) is used.

最尤推定法では、目的話者に対する位置ベクトルと、雑音（競合話者）に対する相関行列（雑音空間相関行列）の２つの情報が必要となる。話者位置ベクトル推定手段９では、目的話者に対する位置ベクトルを、前述した「構造化」の処理段階で得られた情報に基づいて推定する。 The maximum likelihood estimation method requires two pieces of information: a position vector for the target speaker and a correlation matrix (noise spatial correlation matrix) for noise (competitive speaker). The speaker position vector estimation means 9 estimates the position vector for the target speaker based on the information obtained in the “structuring” processing step described above.

図４は、話者位置ベクトル及び雑音空間相関行列の推定手順を説明する図であって、「構造化」の処理段階で得られた情報を用いて、会議中に目的話者が単独で発話しているブロックを探し出す。なお、ここでは、通常０．５秒程度の短い時間単位を「ブロック」と呼び、処理はブロック単位で実行される。 FIG. 4 is a diagram for explaining the estimation procedure of the speaker position vector and the noise spatial correlation matrix. The information obtained in the “structuring” process stage is used to utter the target speaker alone during the conference. Find the block you are doing. Here, a short time unit, usually about 0.5 seconds, is called a “block”, and the process is executed in block units.

話者位置ベクトル推定手段９は、単独発話しているブロックから相関行列を計算し、これに対し、固有値分解（G.Strang, “Linear Algebra and its applications,” Harcourt Brace Jovanovich College Publishers, 1988参照）を行い、固有値と固有ベクトルを計算する。 The speaker position vector estimation means 9 calculates a correlation matrix from a single utterance block, and performs eigenvalue decomposition (see G.Strang, “Linear Algebra and its applications,” Harcourt Brace Jovanovich College Publishers, 1988). To calculate eigenvalues and eigenvectors.

音源が単独の場合は、最大固有値に対する固有ベクトルがその音源に対する位置ベクトルとなる性質（Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993参照。）を利用し、最大固有値に対応する固有ベクトルを取り出し、このブロックの話者位置ベクトルの候補とする。 When the sound source is single, use the property that the eigenvector for the maximum eigenvalue becomes the position vector for that sound source (see Don Johnson and Dan Dudgeon, “Array signal processing,” Prentice hall, 1993), and the eigenvector corresponding to the maximum eigenvalue. Is taken as a candidate for the speaker position vector of this block.

図４に示すように、会議中に目的話者が単独で発話しているブロックは複数あるので、話者位置ベクトル推定手段９は、これらのブロックについて計算した話者位置ベクトルの候補から、最適なものを話者位置ベクトルとして推定する。 As shown in FIG. 4, since there are a plurality of blocks in which the target speaker is speaking alone during the conference, the speaker position vector estimating means 9 determines the optimum position from the speaker position vector candidates calculated for these blocks. Is estimated as the speaker position vector.

その規範としては、まず、発話がかぶっているブロックの目的話者の方向と、話者位置ベクトルの候補となっているブロックの話者方向との差が最小のものを選び、それでも複数の候補がある場合は、単独発話の指標が最大のものを選択する。（ステップＳ４）
なお、単独発話の指標としては、すでに計算した固有値のうち、最大固有値と２番目に大きい固有値との比を、周波数軸上で平均したものを用いる。 As a standard, first, select the one with the smallest difference between the direction of the target speaker of the block covered by the utterance and the direction of the speaker of the block that is the candidate of the speaker position vector, and still select multiple candidates. If there is, the one with the largest index of single utterance is selected. (Step S4)
Note that, as an index for the single utterance, an average of the ratios of the maximum eigenvalue and the second largest eigenvalue on the frequency axis among the already calculated eigenvalues is used.

一方、雑音空間相関行列計算手段１０は、話者位置ベクトル推定手段９における話者位置ベクトルの推定手順と同様に、ここでは競合話者が単独で発話しているブロックを探し出す。そして、該当するブロックにおいて雑音空間相関行列を計算する。また、競合話者が単独で発話している他のブロックについても同様に雑音空間相関行列を計算する。（ステップＳ５） On the other hand, the noise spatial correlation matrix calculation means 10 searches for a block where the competing speaker is speaking alone, as in the speaker position vector estimation procedure in the speaker position vector estimation means 9. Then, a noise space correlation matrix is calculated in the corresponding block. In addition, the noise space correlation matrix is calculated in the same manner for the other blocks that the competing speaker is speaking alone. (Step S5)

次に、フィルタ生成手段１１は、話者位置ベクトル推定手段９によって推定された目的話者の位置ベクトルと、雑音空間相関行列計算手段１０によって競合話者が単独で発話しているブロック毎に計算された雑音空間相関行列のそれぞれについて発話分離を行う最尤推定法のフィルタを生成して、これを分離対象である発話がかぶっているブロックに対して適用し、これらのフィルタの中から出力パワーが最小となるものを最終的なフィルタとして生成する。（ステップＳ６） Next, the filter generation means 11 calculates the target speaker's position vector estimated by the speaker position vector estimation means 9 and the noise spatial correlation matrix calculation means 10 for each block where the competing speaker is speaking alone. A maximum likelihood estimation filter that performs utterance separation for each of the generated noise spatial correlation matrices, and applies it to the block covered by the utterance to be separated. The one that minimizes is generated as the final filter. (Step S6)

フィルタリング手段１２は、フィルタ生成手段１１で構築されたフィルタをアナログ／デジタル信号変換手段４から送られてくるデジタル音声信号に適用して、フィルタ出力として目的話者の発話情報のみを分離して出力する。（ステップＳ７） The filtering means 12 applies the filter constructed by the filter generation means 11 to the digital audio signal sent from the analog / digital signal conversion means 4 and separates and outputs only the speech information of the target speaker as the filter output. To do. (Step S7)

なお、前述した発話イベント分離システム１は、専用のハードウェアによって構築することもできるが、例えばノートパソコン等の汎用のコンピュータで発話イベント分離プログラムを実行させることによっても実現可能である。 The utterance event separation system 1 described above can be constructed by dedicated hardware, but can also be realized by executing the utterance event separation program on a general-purpose computer such as a notebook computer.

前記発話イベント分離プログラムは、コンピュータを、図１における音源方向推定手段６、話者範囲推定手段７、話者同定手段８、話者位置ベクトル推定手段９、雑音空間相関行列計算手段１０、フィルタ生成手段１１、及び、フィルタリング手段１０として動作させるものであって、アナログ／デジタル信号変換手段４からの出力信号は、ＵＳＢ（Universal Serial Bus Specification Rev.2.0)あるいは、ＰＣＩ（Peripheral Component Interconnect）等の入出力インターフェイスを経由してコンピュータに入力することができる。
また、アナログ／デジタル信号変換手段４は、信号変換モジュールとしてコンピュータに組み込んであってもよい。 The utterance event separation program includes a computer, a sound source direction estimation unit 6, a speaker range estimation unit 7, a speaker identification unit 8, a speaker position vector estimation unit 9, a noise spatial correlation matrix calculation unit 10, a filter generation unit in FIG. The output signal from the analog / digital signal conversion means 4 is input from USB (Universal Serial Bus Specification Rev. 2.0) or PCI (Peripheral Component Interconnect) or the like. It can be input to the computer via the output interface.
Further, the analog / digital signal conversion means 4 may be incorporated in a computer as a signal conversion module.

前記発話イベント分離プログラムは、オペレーティングシステムとともにコンピュータの記憶装置（例えば、ハードディスクや光ディスク等）に予めインストールされており、当該プログラムが起動されると、ＣＰＵによってコンピュータのＲＡＭ（Random Access Memory）に読み込まれて前述した図２に示す各ステップの処理を実行する。 The utterance event separation program is preinstalled in a computer storage device (for example, a hard disk or an optical disk) together with an operating system, and when the program is started, the CPU reads it into a computer RAM (Random Access Memory). Then, the process of each step shown in FIG. 2 described above is executed.

ここで、コンピュータを用いる場合には、アナログ／デジタル信号変換手段４を介してコンピュータに入力されるデジタル音声データは、会議中の記録に必要な区間にわたってコンピュータに内蔵もしくは外付けされたハードディスク等の記憶手段に蓄積される。 Here, in the case of using a computer, digital audio data input to the computer via the analog / digital signal conversion means 4 is a hard disk or the like built in or externally attached to the computer over a section necessary for recording during the conference. Accumulated in storage means.

発話イベント分離プログラムは、前記蓄積されたデータを用いて前述した各ステップを実行し、最終的に目的話者の発話のみを分離し、その音声データは、出力インターフェイスを介して外部に出力する。なお、発話分離した音声データは、ハードディスク等に出力して保存できるようにしてもよい。 The utterance event separation program executes each step described above using the accumulated data, finally separates only the utterance of the target speaker, and outputs the voice data to the outside through the output interface. Note that the speech data separated by speech may be output and stored on a hard disk or the like.

本発明の発話イベント分離方法、発話イベント分離システム、及び、発話イベント分離プログラムは、小規模の会議で収録した音声のデータから、音声認識により自動的に会議録を作成する場合等において利用可能である。 The utterance event separation method, the utterance event separation system, and the utterance event separation program according to the present invention can be used when, for example, a conference record is automatically created by voice recognition from voice data recorded in a small meeting. is there.

本発明の発話イベント分離方法を実施するための、発話イベント分離システムの１実施形態を示すシステム構成図である。It is a system configuration figure showing one embodiment of an utterance event separation system for carrying out an utterance event separation method of the present invention. 本発明の発話イベント分離システム１による処理の概要を示すフロー図である。It is a flowchart which shows the outline | summary of the process by the speech event separation system 1 of this invention. 話者範囲推定に用いるヒストグラムとクラスタリング結果から推定された話者範囲を表す図である。It is a figure showing the speaker range estimated from the histogram used for speaker range estimation, and a clustering result. 話者位置ベクトル及び雑音相関行列の推定手順を説明する図である。It is a figure explaining the estimation procedure of a speaker position vector and a noise correlation matrix. 従来の一般的な会議録コンテンツの作成手順を示すフロー図である。It is a flowchart which shows the preparation procedure of the conventional general meeting minutes content. 目的音源と雑音源との重畳区間が長い状態を模式的に示す図である。It is a figure which shows typically the state where the superimposition area of a target sound source and a noise source is long. 目的話者と競合話者の発話が短区間重畳している状態を模式的に示す図である。It is a figure which shows typically the state in which the speech of a target speaker and a competition speaker has overlapped the short area.

Explanation of symbols

１発話イベント分離システム
２マイクロフォンアレイ
２Ａマイクロフォン
３ケーブル
４アナログ／デジタル信号変換手段
５記憶手段
６音源方向推定手段
７話者範囲推定手段
８話者同定手段
９話者位置ベクトル推定手段
１０雑音相関行列計算手段
１１フィルタ生成手段
１２フィルタリング手段 DESCRIPTION OF SYMBOLS 1 Speech event separation system 2 Microphone array 2A Microphone 3 Cable 4 Analog / digital signal conversion means 5 Storage means 6 Sound source direction estimation means 7 Speaker range estimation means 8 Speaker identification means 9 Speaker position vector estimation means 10 Noise correlation matrix calculation Means 11 Filter generation means 12 Filtering means

Claims

First, sound source localization is performed from multi-channel audio data recorded at a meeting section in a continuous section of the conference, and a peak value of the spatial spectrum is detected to estimate a sound source direction at each time in the section. And the steps
A second step of clustering the peak values over the entire section to estimate a range in which a speaker serving as a sound source exists;
A third step of identifying which speaker is speaking at each time from the sound source direction at each time and the range where the speaker exists;
A fourth step of searching for a block in which the target speaker is speaking alone within the section from the data obtained in the third step, and estimating a position vector of the target speaker from the block; ,
From the data obtained in the third step, a block where another speaker is speaking alone within the section is searched, and the noise space correlation matrix of the other speaker with respect to the target speaker is determined from the block. A fifth step of calculating;
A sixth step of generating a filter from the speaker position vector estimated in the fourth step and the noise spatial correlation matrix calculated in the fifth step;
Applying the filter to a block in which the speech of the target speaker and another speaker is superimposed, filtering, and outputting the speech of only the target speaker separately An utterance event separation method comprising steps.

A sound source that performs sound source localization from multi-channel audio data recorded at the conference site in successive sections of the conference, detects the peak value of the spatial spectrum, and estimates the sound source direction at each time in the section Direction estimation means;
Clustering the peak values detected by the sound source direction estimating means over the entire section to estimate a range in which a speaker serving as a sound source exists;
Speaker identification means for identifying which speaker is speaking at each time from the sound source direction for each time obtained by the sound source direction estimating means and the speaker range estimating means and the range where the speaker exists;
Speaker position from which the target speaker finds a block that speaks alone in the section from the data obtained by the speaker identification means and estimates the position vector of the target speaker from the block Vector estimation means;
From the data obtained by the speaker identification means, a block in which another speaker speaks alone in the section is searched, and the noise space correlation of the other speaker with respect to the target speaker from the block A noise space correlation matrix calculating means for calculating a matrix;
Filter generating means for generating a filter from the position vector estimated by the speaker position vector estimating means and the noise spatial correlation matrix calculated by the noise spatial correlation matrix calculating means;
Filtering means for performing filtering by applying the filter to a block in which utterances of the target speaker and other speakers are superimposed, and separating and outputting utterances of only the target speaker; An utterance event separation system characterized by comprising:

3. The speech event separation system according to claim 2, wherein multi-channel audio data is recorded using a microphone array configured by arranging a plurality of microphones radially.

In continuous sections of the conference, multi-channel audio data recorded at the conference is input to the computer and stored in its memory.
In the computer,
A first step of performing sound source localization from the audio data, detecting a peak value of the spatial spectrum and estimating a sound source direction at each time in the section;
A second step of clustering the peak values over the entire section to estimate a range in which a speaker serving as a sound source exists;
A third step of identifying which speaker is speaking at each time from a sound source direction estimated at each time and a range where the speaker exists;
A fourth step of searching for a block in which the target speaker is speaking alone within the section from the data obtained in the third step and estimating a position vector of the target speaker from the block When,
From the data obtained in the third step, a block where another speaker speaks alone within the section is searched, and the noise space correlation matrix of the other speaker with respect to the target speaker from the block A fifth step of calculating
A sixth step of generating filter data from the speaker position vector estimated in the fourth step and the noise spatial correlation matrix calculated in the fifth step;
Filtering based on the filter data on a block in which the speech of the target speaker and another speaker is superimposed, and outputting the speech of only the target speaker separately An utterance event separation program characterized in that the steps are executed.