JP2007322523A

JP2007322523A - Voice translation apparatus and its method

Info

Publication number: JP2007322523A
Application number: JP2006150136A
Authority: JP
Inventors: Kazunori Imoto; 和範井本
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2006-05-30
Filing date: 2006-05-30
Publication date: 2007-12-13

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice translation apparatus capable of determining language translation direction according to dialog flow in conversation attended by three or more persons. <P>SOLUTION: The voice translation apparatus comprises: a voice input section 10; a speaker identification section 20; a language determination section 30; a language control section 40; a speaker language recording section 50; a voice recognition section 60; and a machine translation section 70. Correspondence relation between a speaker and a spoken language is held, correspondence relation of speech out language is recorded with speech out person and while a pair of speakers who are currently speaking are dynamically changed by recording the correspondence relation between the speaker and the spoken language, the language translation direction is automatically determined, according to the dialog flow. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、異なる言語を話す人同士の意思疎通を支援する音声翻訳装置及びその方法に関する。 The present invention relates to a speech translation apparatus and method for supporting communication between people who speak different languages.

近年、音声認識や音声合成などの音声処理技術や機械翻訳などの言語処理技術の研究が盛んに行われている。また音声処理と言語処理を連携させた、音声翻訳などの音声言語処理技術も盛んに研究されている。音声翻訳を現実世界での利用するためには多くの問題を解決する必要があるが、利用場面をうまく制限することや、利用者の協力を引き出して技術的な課題をカバーすることで実用化に至っている製品も存在する。 In recent years, research on speech processing technologies such as speech recognition and speech synthesis and language processing technologies such as machine translation has been actively conducted. Spoken language processing technologies such as speech translation that link speech processing and language processing are also being actively studied. In order to use speech translation in the real world, it is necessary to solve many problems, but it is put to practical use by restricting the usage situation well and drawing out the cooperation of users to cover technical issues. There are also products that have led to

このような母国語の異なる人同士のコミュニケーションを支援する音声翻訳技術は、海外旅行、国際会議といった国際交流機会の増加を背景に、様々な場面での応用が期待されている。現在製品化されている音声翻訳装置は、対面型会話、すなわち会話を進める２人が音声翻訳装置を挟んで接近した距離に近づいて意思疎通を図る場面を想定しているものが多い。しかしより多くの人が同じ場所に集まる会議や、様々なインフラを利用して遠隔地でとり行う遠隔会議など、翻訳が必要とされる場面は今後多様なものになると考えられる。 Such speech translation technology that supports communication between people with different native languages is expected to be applied in various situations against the background of increasing opportunities for international exchange such as overseas travel and international conferences. Many speech translation devices that are currently commercialized assume a face-to-face conversation, that is, a situation in which two persons who advance a conversation approach a close distance across the speech translation device and communicate with each other. However, there will be a variety of situations where translation is required in the future, such as conferences where more people gather in the same place, and remote conferences that are conducted remotely using various infrastructures.

しかし従来の技術を単純に組み合わせるだけでは音声翻訳を広い範囲で応用できるわけではない。例に挙げた対面型対話と会議を比較すると、端末画面を共通に見ることができるのか、参加者の発声を高品質に受音できるのか、など様々な違いが存在する。そのため実現に必要な技術レベルはもちろん、適したユーザインタフェースも利用場面によって異なる。このように従来の技術を単純に組み合わせるだけではなく、利用場面に応じて新しい技術を導入することや場面にあったインタフェースを検討することが重要になると考えられる。 However, speech translation cannot be applied in a wide range by simply combining conventional techniques. Comparing the face-to-face dialogue and the conference as an example, there are various differences such as whether the terminal screen can be seen in common or whether the participant's utterance can be received with high quality. For this reason, not only the technical level required for implementation, but also the appropriate user interface varies depending on the use scene. In this way, it is important not only to simply combine conventional technologies, but also to introduce new technologies according to usage situations and to examine interfaces suitable for the situations.

ここで異なる言語を母国語とする３人以上の人が同じ場所に集まって進める会議の場面に音声翻訳の技術を応用する場合の問題点を考える。この場合、対面型対話とは異なり人数が多いため、誰がどの言語を話すのかを音声翻訳装置に事前に設定することは容易ではない。特に３つ以上の異なる言語が交される場面では、どの言語が入力されてそれをどの言語に出力すればよいのかをわからない、すなわち言語変換方向がわからないという問題が生じる。これは参加者それぞれに専用の入力装置及び翻訳装置が利用できる設備を備えた会議室を準備して、参加者にあった設定を事前に行うことで解決できる問題であるが、設備を準備するためのコストが高く、利用できる場所も限定されてしまうため様々な場面で気軽に利用できないという問題が発生してしまう。 Let us consider the problem of applying speech translation technology to a meeting where three or more people who are native speakers of different languages gather at the same place. In this case, unlike the face-to-face conversation, there are many people, so it is not easy to set in advance in the speech translation device who speaks which language. Particularly in a scene where three or more different languages are exchanged, there is a problem that which language is input and which language should be output, that is, the language conversion direction is unknown. This is a problem that can be solved by preparing a conference room with facilities that can use dedicated input devices and translation devices for each participant and making settings that suit the participants in advance. Therefore, there is a problem in that it cannot be easily used in various scenes because the cost for the use is high and the places where it can be used are limited.

加えて同じ場面に集まっているため、音声を入力してから翻訳結果を出力するまでの待ち時間を大きくできないという問題も生じる。対面型対話では共有できる画面を利用するなどインタフェースを工夫することで多少の待ち時間は許容されても、３人以上の会話になると翻訳を介さずに発言を理解できる人とできない人が共存するなど不自然な状態が生じてしまう。このため待ち時間を小さくしないと会話の流れが悪くなるという問題が生じてしまう。 In addition, since they are gathered in the same scene, there is a problem that the waiting time from inputting the voice to outputting the translation result cannot be increased. Even if some waiting time is allowed by devising the interface, such as using a screen that can be shared in face-to-face conversations, there will be people who can understand the speech without translation and those who can not do it if there are more than 3 conversations An unnatural state will occur. For this reason, unless the waiting time is reduced, a problem arises in that the flow of conversation deteriorates.

これに対して、音声が入力された音源方向に応じて音声認識の原言語と機械翻訳の目的言語を制御する方法が提案されている。ここでは音声翻訳装置に可動式マイクやマイクロホンアレーを装備して音源方向を検出し、事前設定に従って検出した音源方向から言語変換方向を決定する。例えば手前から入力される音声を日本語から英語に、奥から入力される音声を英語から日本語に変換する。この方法は２人で進める対面型会話では非常に効果的で、話し相手に合わせて事前に一度だけ目的言語を設定すればその後は特別な操作なしに入力された音声の言語変換方向が自動的に決定される。しかしこの方法を３人以上が参加する会議に応用した場合には、どの言語を話す人がどの席に座るか分からないため音源方向と入力言語を対応付けることは容易ではない。このように従来技術では、３人以上の参加者がいる会話場面では、音源方向と入力言語を対応付けるのが容易ではないという問題があった（例えば、特許文献１参照）
これに対して、入力された音声を複数言語の音声認識エンジンによって認識させて尤度を計算して、認識尤度が高い言語が入力されたと判断する方法が提案されている。ここでは、どの方向からどの言語が入力されても毎回複数の言語の入力を仮定して処理を行うために、事前設定を行わなくてもよい。さらに会議に途中参加した人でも毎回の発声から言語を判定できるなどの利点もある。しかしこの方法は、全ての発話に対して言語判定を行うため、処理時間がかかってしまい、それが遅れ時間となって会話の流れを悪くしてしまうというという問題があった（例えば、特許文献２参照）
さらに従来技術に共通する課題として、入力された音声をどの言語に翻訳すればよいかわからないという問題があった。対面型会話では入力言語とは異なる別の言語に翻訳すればよかったが、３人以上での会話、特に３言語以上で進む会話の場合には、対話の流れに応じて入力言語をどの言語を優先して翻訳するべきかを決めるべきだが、具体的にこれを解決する方法は公開されていない。例えば可能性のある全ての言語に翻訳して出力するという方法も考えられるが、参加者は同じ場所に集まっているために大きな遅れ時間を取れないという制約があるため実用的とはいえない。このように従来技術では、３人以上の参加者がいる会話場面では、入力言語をどの言語に翻訳すればよいかを決めるのが容易ではないという問題があった。
特開２００５−１４１７５９公報特開２００４−３４７７３２公報 On the other hand, a method has been proposed in which the source language for speech recognition and the target language for machine translation are controlled according to the sound source direction in which the speech is input. Here, the speech translation device is equipped with a movable microphone and a microphone array to detect the sound source direction, and the language conversion direction is determined from the detected sound source direction according to the presetting. For example, the voice input from the front is converted from Japanese to English, and the voice input from the back is converted from English to Japanese. This method is very effective for face-to-face conversations with two people. If the target language is set only once in advance according to the person you are talking to, then the language conversion direction of the input speech is automatically changed without any special operation. It is determined. However, when this method is applied to a conference in which three or more people participate, it is not easy to associate the sound source direction with the input language because it is unknown which person speaks which language. As described above, in the conventional technique, there is a problem that it is not easy to associate the sound source direction with the input language in a conversation scene having three or more participants (see, for example, Patent Document 1).
On the other hand, a method has been proposed in which input speech is recognized by a multi-language speech recognition engine, likelihood is calculated, and a language having a high recognition likelihood is determined to be input. Here, no matter what direction is input from which direction, in order to perform processing assuming that a plurality of languages are input every time, it is not necessary to perform the presetting. Furthermore, there is an advantage that a person who participates in the conference can judge the language from each utterance. However, since this method performs language determination for all utterances, there is a problem that it takes a processing time, which causes a delay time and deteriorates the flow of conversation (for example, patent document). 2)
Furthermore, as a problem common to the prior art, there is a problem that it is not known which language the input speech should be translated into. In face-to-face conversations, it was only necessary to translate to another language different from the input language. However, in the case of a conversation with three or more people, especially a conversation that proceeds in three or more languages, which language is selected according to the flow of the conversation. You should decide whether or not you should preferentially translate, but there is no specific way to solve this. For example, a method of translating and outputting to all possible languages is conceivable, but it is not practical because there is a restriction that a large delay time cannot be taken because participants are gathered in the same place. As described above, the conventional technique has a problem that it is not easy to determine which language the input language should be translated into in a conversation scene with three or more participants.
JP 2005-141759 A JP 2004-347732 A

このように従来は、３人以上が参加する会話での意思疎通を支援する音声翻訳装置において、入力された音声をどの言語として認識して、どの言語に翻訳するべきか、すなわち言語変換方向がわからないという問題があった。 Thus, conventionally, in a speech translation device that supports communication in a conversation in which three or more people participate, the language in which the input speech is to be recognized and translated, that is, the language conversion direction is determined. There was a problem of not knowing.

また、入力された音声の言語変換方向を決定するための遅れ時間が発生してしまうという問題があった。 In addition, there is a problem that a delay time for determining the language conversion direction of the input voice occurs.

そこで、本発明はこのような事情を考慮してなされたもので、３人以上が参加する会話で、対話の流れに応じて言語変換方向を決定する音声翻訳装置及びその方法を提供することにある。 Therefore, the present invention has been made in view of such circumstances, and provides a speech translation apparatus and method for determining a language conversion direction according to the flow of a conversation in a conversation in which three or more people participate. is there.

本発明は、３人以上の発話者間における音声翻訳装置において、前記各発話者の音声を入力するための音声入力部と、前記各音声を分析して発話者を同定する話者同定部と、前記各音声を分析して発話された言語を判定する言語判定部と、前記発話者の中でどの発話者が注目発話者であるかを判定する注目発話者判定部と、前記同定話者と前記判定言語の対応関係を記録する話者言語記録部と、前記対応関係及び前記注目発話者に基づいて、（１）前記入力された音声が前記注目発話者である場合には前記注目発話者の判定言語を入力言語とし、前記入力言語以外の言語を出力言語として決定し、（２）前記入力された音声が前記注目発話者でない場合には前記音声を入力した発話者の判定言語を入力言語とし、前記注目発話者の判定言語を出力言語として決定する言語制御部と、前記入力された音声を前記入力言語として音声認識する音声認識部と、前記音声認識結果を前記入力言語から前記出力言語に翻訳する機械翻訳部と、を具備することを特徴とする音声翻訳装置である。 The present invention relates to a speech translation apparatus between three or more speakers, a speech input unit for inputting the speech of each speaker, a speaker identification unit for analyzing the speech and identifying a speaker, A language determination unit that analyzes each voice and determines a spoken language; a speaker-of-speech determination unit that determines which speaker is the speaker of interest among the speakers; and the identified speaker And a speaker language recording unit that records a correspondence relationship between the judgment language and the correspondence relationship and the attention speaker. (1) When the input speech is the attention speaker, the attention speech A language other than the input language is determined as an output language, and (2) when the input speech is not the attention speaker, the determination language of the speaker who inputs the speech is determined The input language, and the judgment language of the speaker of interest A language control unit for determining a speech language, a speech recognition unit for recognizing the input speech as the input language, and a machine translation unit for translating the speech recognition result from the input language to the output language. This is a speech translation apparatus characterized by

である。 It is.

本発明によれは、３人以上が参加する会話場面でどの方向からどの言語が入力されても対話の流れにあわせて翻訳言語を自動的に切り替えるため、対話の流れを妨げることなく翻訳による会話支援を実現できる。 According to the present invention, the translation language is automatically switched in accordance with the flow of the conversation regardless of which direction is input from any direction in the conversation scene in which three or more people participate. Support can be realized.

以下、図面を参照しながら本発明の実施形態の音声翻訳装置について説明する。 Hereinafter, a speech translation apparatus according to an embodiment of the present invention will be described with reference to the drawings.

（第１の実施形態）
本発明の第１の実施形態に関わる音声翻訳装置について、図１から図７に基づいて説明する。 (First embodiment)
A speech translation apparatus according to the first embodiment of the present invention will be described with reference to FIGS.

（１）音声翻訳装置の構成
図１は、本実施形態に関わる音声翻訳装置の概略構成図である。 (1) Configuration of Speech Translation Device FIG. 1 is a schematic configuration diagram of a speech translation device according to this embodiment.

図１に示すように音声翻訳装置は、音声入力部１０、話者同定部２０、言語判定部３０、言語制御部４０、話者言語記録部５０、音声認識部６０及び機械翻訳部７０から構成される。 As shown in FIG. 1, the speech translation apparatus includes a speech input unit 10, a speaker identification unit 20, a language determination unit 30, a language control unit 40, a speaker language recording unit 50, a speech recognition unit 60, and a machine translation unit 70. Is done.

音声入力部１０は、例えばマイクロフォンから入力された音声データを話者同定部２０、言語判定部３０及び音声認識部６０へと渡す。なお、必要に応じて音声データの暗号解除処理、デコード処理、形式変換処理、レート変換処理等を行ってもよい。 The voice input unit 10 passes voice data input from, for example, a microphone to the speaker identification unit 20, the language determination unit 30, and the voice recognition unit 60. If necessary, audio data descrambling processing, decoding processing, format conversion processing, rate conversion processing, and the like may be performed.

話者同定部２０は、音声入力部１０から入力された音声データを分析して話者同定のための特徴量を抽出し、抽出した特徴量に基づいて分析区間の発声が誰による発話なのかを同定して、同定した話者情報を言語制御部４０に出力する。発話者を同定するための方式としては、例えば事前に発話者毎の発声が入手できる場合には、発話者毎に学習した参照モデルと特徴量系列との比較によって類似度の最も大きな話者を同定する方法があり、事前に発話者毎の発声が入手できない場合には、事前に入手可能な多量の話者毎に作成された参照モデルとの類似度系列である話者ベクトルをクラスタリングすることによって話者を同定する手法がある。このように公知の様々な手段を用いて話者を同定することができる。なお言語制御部４０に出力する話者情報は、発話者個人を特定でなくても他の発話者とを区別することができる情報であればよく、例えば音声入力部１０をマイクロホンアレーで構成し、発話者の音源方向もしくは位置を発話者情報として出力しても構わない。 The speaker identification unit 20 analyzes the voice data input from the voice input unit 10 to extract feature quantities for speaker identification, and who is the utterance in the analysis section based on the extracted feature quantities And the identified speaker information is output to the language control unit 40. As a method for identifying a speaker, for example, when the utterance for each speaker is available in advance, the speaker with the largest similarity is determined by comparing the reference model learned for each speaker with the feature amount series. If there is a method for identification and utterances for each speaker cannot be obtained in advance, clustering of speaker vectors that are similarities with reference models created for a large number of speakers available in advance is performed. There is a technique to identify the speaker. Thus, the speaker can be identified using various known means. Note that the speaker information output to the language control unit 40 may be information that can be distinguished from other speakers even if the individual speaker is not specified. For example, the voice input unit 10 is configured by a microphone array. The sound source direction or position of the speaker may be output as speaker information.

言語判定部３０は、音声入力部１０から入力された音声データを分析して分析区間の発声がどの言語で発声されたものであるかを判定するために必要な特徴量を抽出し、事前に学習した辞書を参照して音響的に最も類似する言語を判定結果として言語制御部４０に出力する。事前に学習した辞書と入力音声との類似度を算出する言語判定方式としては、混合ガウス分布など公知の様々な手段を用いることができる。 The language determination unit 30 analyzes the voice data input from the voice input unit 10 and extracts a feature amount necessary for determining in which language the utterance in the analysis section is uttered. With reference to the learned dictionary, the acoustically most similar language is output to the language control unit 40 as a determination result. Various known means such as a mixed Gaussian distribution can be used as the language determination method for calculating the similarity between the dictionary learned in advance and the input speech.

言語制御部４０は、話者同定部２０から同定話者を、言語判定部３０から判定言語をそれぞれ受け取り、同定話者と判定言語を対応付けて話者言語記録部５０に記録する。さらに記録された同定話者と判定言語の対応関係に基づいて、入力された音声をどの言語として認識してどの言語に翻訳するべきか、すなわち入力言語と出力言語の対である言語変換方向を決定する。決定された入力言語は音声認識部６０及び機械翻訳部７０に、決定された出力言語は機械翻訳部７０に出力される。 The language control unit 40 receives the identified speaker from the speaker identification unit 20 and the determination language from the language determination unit 30, and records the identification speaker and the determination language in the speaker language recording unit 50 in association with each other. Furthermore, based on the correspondence between the recorded identified speaker and the judgment language, the language conversion direction that is a pair of the input language and the output language is determined as to which language the input speech should be recognized and translated into. decide. The determined input language is output to the speech recognition unit 60 and the machine translation unit 70, and the determined output language is output to the machine translation unit 70.

音声認識部６０は、音声入力部１０から入力された音声データを分析して認識に必要な特徴量を抽出し、言語制御部４０から受け取った入力言語に基づいて事前に学習した辞書を選択し、音響的に最も類似する単語もしくは単語系列を認識結果として機械翻訳部部７０に出力する。事前に学習した辞書と入力音声との類似度を算出する認識方式としては、隠れマルコフモデル、ニューラルネットワーク、ＤＰマッチングなど公知の様々な手段を用いることができる。 The speech recognition unit 60 analyzes the speech data input from the speech input unit 10 to extract feature amounts necessary for recognition, and selects a dictionary learned in advance based on the input language received from the language control unit 40. The most acoustically similar word or word sequence is output to the machine translation unit 70 as a recognition result. As a recognition method for calculating the similarity between the dictionary learned in advance and the input speech, various known means such as a hidden Markov model, a neural network, and DP matching can be used.

機械翻訳部７０は、言語制御部４０から受け取った入力言語を原言語、出力言語を目的言語として、音声認識部６０から原言語の文字系列を受け取り、目的言語へと変換する。規則に基づく翻訳方式や例文に基づく翻訳方式など既存の様々な方式を用いることができる。 The machine translation unit 70 receives a character string of the source language from the speech recognition unit 60 using the input language received from the language control unit 40 as the source language and the output language as the target language, and converts it into the target language. Various existing methods such as a rule-based translation method and a sentence-based translation method can be used.

（２）音声翻訳装置の動作
次に、音声翻訳装置の詳細な動作について具体例を用いて説明する。 (2) Operation of Speech Translation Device Next, detailed operation of the speech translation device will be described using a specific example.

図２は、複数人の会話場面の一例を示したものであり、会議の参加者のうち話者Ａから話者Ｄまでが英語を、話者Ｅ及び話者Ｆが日本語を発声しており、音声翻訳装置は話者Ｆのみが所有しており音声翻訳装置に備え付けられたマイクロフォンにより話者の発声を受け取って音声翻訳を実現している。 FIG. 2 shows an example of a conversation scene of a plurality of people. Among the participants in the conference, speakers A to D speak English, and speakers E and F speak Japanese. The speech translation apparatus is owned by only the speaker F, and the speech of the speaker is received by a microphone provided in the speech translation apparatus, thereby realizing speech translation.

図２の会話場面例を使って、話者同定部２０、言語判定部３０及び言語制御部４０の動作を詳細に説明する。 The operations of the speaker identification unit 20, the language determination unit 30, and the language control unit 40 will be described in detail using the conversation scene example of FIG.

（２−１）話者同定部２０
話者同定部２０は、音声入力部１０から入力された音声データを分析して話者同定のための特徴量を抽出し、抽出した特徴量に基づいて分析区間の発声が誰による発話なのかを同定して、同定した話者情報を言語制御部４０に出力する。以下では話者を同定する方法について詳細に説明する。 (2-1) Speaker identification unit 20
The speaker identification unit 20 analyzes the voice data input from the voice input unit 10 to extract feature quantities for speaker identification, and who is the utterance in the analysis section based on the extracted feature quantities And the identified speaker information is output to the language control unit 40. Hereinafter, a method for identifying a speaker will be described in detail.

発話者を同定するための方式としては、事前に参加者の発声を一定量収録して発話者毎にベクトル量子化辞書などの参照モデルを作成しておき、発話者を同定したい区間から抽出した特徴ベクトル系列と参照モデルとの類似度が最も大きな参照モデルから発話者を同定する方法が最も一般的である。 As a method for identifying a speaker, a certain amount of participants' utterances are recorded in advance, a reference model such as a vector quantization dictionary is created for each speaker, and the speaker is extracted from the section to be identified. The most common method is to identify a speaker from a reference model having the largest similarity between a feature vector sequence and a reference model.

しかし、実際の会議で事前に参加者の発声を登録するのはコストが大きく実用的とはいえない。そこで本実施形態では、参加者とは異なっても事前に入手可能な多量の話者毎に参照モデルを作成し、作成された参照モデルとの類似度系列である話者ベクトルへと特徴ベクトルを変換してクラスタリングを行う手法を例に話者同定の方法を図３及び図４を用いて説明する。 However, registering the utterances of participants in advance in an actual meeting is costly and impractical. Therefore, in the present embodiment, a reference model is created for each of a large number of speakers that can be obtained in advance even if they are different from the participants, and the feature vector is converted into a speaker vector that is a similarity series with the created reference model. A speaker identification method will be described with reference to FIGS. 3 and 4 by taking an example of a method of performing clustering by conversion.

本実施形態では、話者同定のための特徴量としてメル周波数ケプストラム係数（以降ではＭＦＣＣと略記）を例として用いるが、話者同定が可能な特徴量であれば既存のどんな特徴量を用いても良い。 In this embodiment, a Mel frequency cepstrum coefficient (hereinafter abbreviated as MFCC) is used as an example of a feature quantity for speaker identification. However, any feature quantity that can be used for speaker identification is used. Also good.

図３は特徴ベクトル空間の分布を示したものである。また特徴ベクトル空間における楕円は事前に収集した話者毎に作成した参照モデルを表したものである。図中の話者Ｘの楕円は参加者とは別の事前に収集した話者Ｘの発声から学習した参照モデルの分布を示している。図３中の入力（ａ）は、音声入力部１０から入力された音声を例えば分析フレーム長（例えば１０ｍＳに設定）毎に分割して、フレーム毎に分析して抽出した特徴ベクトルを表している。なお説明のために特徴ベクトルの次元数は３、参照モデル数は３として説明を進めるが実際には特徴ベクトルの次元数は数十程度と大きく、また参照モデルの数も数百〜数千程度準備しておくことが多い。 FIG. 3 shows the distribution of the feature vector space. An ellipse in the feature vector space represents a reference model created for each speaker collected in advance. The ellipse of the speaker X in the figure shows the distribution of the reference model learned from the utterances of the speaker X collected in advance different from the participants. Input (a) in FIG. 3 represents a feature vector obtained by dividing the voice input from the voice input unit 10 into, for example, analysis frame lengths (for example, set to 10 mS), and analyzing and extracting each frame. . For the sake of explanation, the description will proceed assuming that the number of dimension of the feature vector is 3 and the number of reference models is 3. However, the number of dimension of the feature vector is actually as large as several tens, and the number of reference models is also several hundred to several thousand. There are many preparations.

図３に示したように参照モデルには会議の参加者が含まれていないため、参照モデルとの類似度が低い特徴ベクトルが入力されることも少なくない。そこで、各事前話者との類似度を新しい特徴ベクトルとする変換をかけて入力を表現したものが図４に例示する話者ベクトル空間である。このように特徴ベクトルを話者ベクトルに変換することで、特定の事前話者と直接類似しない発声も複数の事前話者との距離関係によって入力音声の話者性を表現することができる。 As shown in FIG. 3, since the conference model does not include participants in the conference, it is often the case that a feature vector having a low similarity with the reference model is input. Thus, the speaker vector space illustrated in FIG. 4 represents the input by performing conversion using the similarity with each of the previous speakers as a new feature vector. By converting the feature vector into the speaker vector in this way, the utterance that is not directly similar to the specific pre-speaker can express the speaker nature of the input speech by the distance relationship with the plurality of pre-speakers.

例えば、図３，図４の例では入力（ｂ）及び入力（ｃ）が話者Ｘ及び話者Ｙの中間的な声質を持っており、これが話者ベクトル空間上で近いベクトルで表現されている。このような話者空間上での話者ベクトルを例えばＬＢＧアルゴリズムなどのクラスタリング手法で分類することで、入力音声の話者を同定することができる。例えばクラスタリング手法で同一クラスにマージされた入力（ｂ）（ｃ）をクラス（１）と分類すれば、今後クラス（１）に分類された発話を同一話者と同定できるようになる。なお説明を簡単にするため判定をフレーム単位で行うような説明を行ったが、実際は音声認識のフロントエンドで用いられる音声区間検出を流用して、一つのまとまりのある発声単位で一つの話者単位を判定するようにするのが効果的と考えられる。 For example, in the example of FIGS. 3 and 4, the input (b) and the input (c) have an intermediate voice quality between the speaker X and the speaker Y, and this is expressed by a close vector in the speaker vector space. Yes. By classifying speaker vectors in such a speaker space by a clustering method such as an LBG algorithm, it is possible to identify speakers of the input speech. For example, if the inputs (b) and (c) merged into the same class by the clustering method are classified as class (1), utterances classified into class (1) in the future can be identified as the same speaker. In order to simplify the explanation, the explanation is made such that the determination is performed in units of frames. However, in actuality, one speaker is used in one unit of utterance by using the speech section detection used in the front end of speech recognition. It is considered effective to determine the unit.

またここでは話者ベクトルを用いた話者同定手法について説明したが、前述したように同定する話者情報は、発話者個人を特定できなくても他の発話者とを区別することができる情報が出力すればよく、例えば音声入力部１０をマイクロホンアレーで構成し、発話者の音源方向もしくは位置を発話者情報として出力しても構わない。 In addition, the speaker identification method using the speaker vector has been described here. As described above, the speaker information to be identified can be distinguished from other speakers even if the individual speaker cannot be identified. For example, the voice input unit 10 may be configured by a microphone array, and the sound source direction or position of the speaker may be output as speaker information.

（２−２）言語判定部３０
言語判定部３０は、音声入力部１０を介して入力された音声データを分析して分析区間の発声がどの言語で発声されたものであるかを判定するために必要な特徴量を抽出し、事前に学習した辞書を参照して音響的に最も類似する言語を判定結果として言語制御部４０に出力する。以下では言語を同定する方法について詳細に説明する。 (2-2) Language determination unit 30
The language determination unit 30 analyzes the voice data input via the voice input unit 10 and extracts a feature amount necessary to determine in which language the utterance of the analysis section is uttered, A language that is acoustically most similar to the dictionary learned in advance is output to the language control unit 40 as a determination result. Hereinafter, a method for identifying a language will be described in detail.

言語を同定するための方式としては、話者を同定する場合と比較して事前に当該言語の音声集合を収集することが容易なため（少なくとも音声認識のための音響モデル学習に当該言語の音声データを大量に収集している）、言語毎に混合ガウス分布モデルなどの参照モデルを作成しておき、言語を同定したい区間から抽出した特徴ベクトル系列と参照モデルとの類似度が最も大きな参照モデルの言語を同定する方法を用いることができる。 As a method for identifying a language, it is easier to collect a speech set of the language in advance compared to the case of identifying a speaker (at least for the acoustic model learning for speech recognition, the speech of the language A large amount of data is collected), and a reference model such as a mixed Gaussian distribution model is created for each language, and the reference model with the highest similarity between the feature vector sequence extracted from the section in which the language is to be identified and the reference model Can be used to identify the language.

本実施形態でも言語同定のための特徴量としてメル周波数ケプストラム係数（以降ではＭＦＣＣと略記）を例として用いるが、言語同定が可能な特徴量であれば既存のどんな特徴量を用いても良い。 In this embodiment, a mel frequency cepstrum coefficient (hereinafter abbreviated as MFCC) is used as an example of the feature quantity for language identification. However, any existing feature quantity may be used as long as the feature quantity enables language identification.

図５は特徴ベクトル空間の分布を示したものである。また特徴ベクトル空間における楕円は事前に収集した言語毎に作成した参照モデルを表したものである。図中の入力（ａ）は、音声入力部１０を介して入力された音声を例えば分析フレーム長（例えば１０ｍＳに設定）毎に分割して、フレーム毎に分析して抽出した特徴ベクトルを表している。なお説明のために特徴ベクトルの次元数は３、参照モデル数は３として説明をしたが実際には特徴ベクトルの次元数は数十程度と大きい。ここで特徴ベクトルと各言語のモデルの類似度をそれぞれ算出して、類似度の最も大きな言語を判定する。図５の例では入力（ａ）は日本語のモデルと、入力（ｂ）〜（ｄ）は英語のモデルとの距離が近く類似度が相対的に大きいと判断される。ここでは説明を簡単にするため判定をフレーム単位で行うような説明を行ったが、実際は音声認識のフロントエンドで用いられる音声区間検出を流用して、一つのまとまりのある発声単位で一つの発声言語を判定するようにするのが効果的と考えられる。 FIG. 5 shows the distribution of the feature vector space. An ellipse in the feature vector space represents a reference model created for each language collected in advance. The input (a) in the figure represents a feature vector obtained by dividing the voice input via the voice input unit 10 into, for example, analysis frame lengths (for example, set to 10 mS) and analyzing and extracting each frame. Yes. For the sake of explanation, the dimension number of the feature vector is 3 and the number of reference models is 3. However, in practice, the dimension number of the feature vector is as large as several tens. Here, the similarity between the feature vector and the model of each language is calculated, and the language with the highest similarity is determined. In the example of FIG. 5, it is determined that the input (a) is close to the Japanese model and the inputs (b) to (d) are close to each other and the similarity is relatively large. Here, for the sake of simplicity, the description has been made so that the determination is performed in units of frames. However, in actuality, one utterance is made in one unit of utterance by using the speech section detection used in the front end of speech recognition. It is considered effective to determine the language.

（２−３）言語制御部４０
続いて言語制御部４０の詳細について説明する。 (2-3) Language control unit 40
Next, details of the language control unit 40 will be described.

図６は話者言語記録部５０に格納された対応関係の一例を示すものである。図６の例では会話の頭から話者Ｆ、話者Ａ、話者Ｂがこの順で発声した場合に、話者同定部２０によってそれぞれの話者が３つのクラスに分類され、また言語判定部３０によって正しく言語が判定された場合の対応関係を示している。なお図６に書かれた未分類とは、それまでに発話のない話者をわかりやすくするために便宜的に表示したものであり、実際に対応関係にかかれているのはクラス（１）（２）（３）のみである。図６の対応関係を参照している状態で、新しく話者Ｃ（すなわち、注目発話者）の発声が入力された場合の言語制御部４０の動作について詳細に説明する。 FIG. 6 shows an example of the correspondence stored in the speaker language recording unit 50. In the example of FIG. 6, when the speaker F, the speaker A, and the speaker B speak in this order from the beginning of the conversation, the speakers are classified into three classes by the speaker identification unit 20, and language determination is performed. The correspondence relationship when the language is correctly determined by the unit 30 is shown. Note that the uncategorized in FIG. 6 is displayed for the sake of convenience in order to make it easy to understand speakers who have not spoken so far, and the actual correspondence is the class (1) ( 2) Only (3). The operation of the language control unit 40 when the utterance of the speaker C (that is, the speaker of interest) is newly input while referring to the correspondence relationship in FIG. 6 will be described in detail.

新しく話者Ｃの発声が、音声入力部１０に入力されると、話者同定部２０から未知の話者が入力されたとして新しいクラス（４）が、言語同定部３０から英語が出力される。 When a new utterance of the speaker C is input to the voice input unit 10, a new class (4) is output from the language identification unit 30 as an unknown speaker is input from the speaker identification unit 20. .

言語判定部４０は図６に示す対応関係を参照して、入力された同定話者及び判定言語が既存の話者によるものか新しい話者による発声かを判定する。この場合クラス（４）は未知の話者であるため、対応関係に新しいエントリーとして話者を登録し、図７に示すように対応関係を更新する。またこの際に現在の話者がクラス（４）であることを話者属性に記録しておく。 The language determination unit 40 refers to the correspondence relationship shown in FIG. 6 and determines whether the input identified speaker and determination language are uttered by an existing speaker or a new speaker. In this case, since class (4) is an unknown speaker, the speaker is registered as a new entry in the correspondence relationship, and the correspondence relationship is updated as shown in FIG. At this time, it is recorded in the speaker attribute that the current speaker is class (4).

言語制御部４０は、新しく更新した図７の対応関係に基づいて入力言語及び出力言語を決定するように動作する。 The language control unit 40 operates to determine the input language and the output language based on the newly updated correspondence relationship in FIG.

ここでは一例として対応関係に含まれる言語のうち、入力言語と異なる言語を出力言語として決定するという最も簡単な規則に基づいて言語制御部４０が動作する場合を考える。 Here, as an example, consider a case where the language control unit 40 operates based on the simplest rule of determining a language different from the input language as the output language among the languages included in the correspondence relationship.

その場合には、図７の対応関係には英語及び日本語が列挙されており、話者属性を参照して現話者は英語を発声していることがわかるため、それ以外の言語すなわち日本語を出力言語として決定する。このように決定された入力言語及び出力言語に基づいて、音声認識部６０は入力された音声を英語として認識し、機械翻訳部７０は英語の認識結果を日本語に翻訳するよう英日翻訳エンジンを動作させることで、自動的に言語変換方向が決定される。 In that case, English and Japanese are listed in the correspondence relationship in FIG. 7, and it can be understood that the current speaker is speaking English by referring to the speaker attribute. Determine the word as the output language. Based on the input language and the output language determined in this way, the speech recognition unit 60 recognizes the input speech as English, and the machine translation unit 70 translates the English recognition result into Japanese. By operating, the language conversion direction is automatically determined.

（３）効果
このように本実施形態によれば、発話者と発話言語の対応関係を保持しておき、発話者と発話言語の対応関係を記録して、現在対話を進める話者対を動的に切り替えながら、対話の流れに応じて言語変換方向を自動的に決定できるようにしている。 (3) Effect As described above, according to the present embodiment, the correspondence between the speaker and the utterance language is maintained, the correspondence between the utterer and the utterance language is recorded, and the speaker pair currently proceeding with the conversation is moved. The language conversion direction can be automatically determined according to the flow of dialogue.

したがって、３人以上が参加する会話場面でどの方向からどの言語が入力されても対話の流れにあわせて翻訳言語を自動的に切り替えるため、対話の流れを妨げることなく翻訳による会話支援を実現できる。 Therefore, the translation language is automatically switched according to the flow of the conversation regardless of which direction is input from which direction in the conversation scene in which three or more people participate, so that it is possible to realize the conversation support by translation without hindering the flow of the conversation. .

また、発話者と発話言語の対応関係を記録しているため、既存の話者であれば新たに言語判定処理を行う必要がなく、音声翻訳による遅れ時間を一部軽減する効果も生まれる。 In addition, since the correspondence relationship between the speaker and the spoken language is recorded, there is no need to perform a new language determination process for an existing speaker, and the effect of partially reducing the delay time due to speech translation is also produced.

（第２の実施形態）
次に、本発明の第２の実施形態の音声翻訳装置について図８から図１１に基づいて説明する。本実施形態の音声翻訳装置は、音声翻訳装置の所有者を注目発話者として、言語変換方向をこの所有者で決定する。 (Second Embodiment)
Next, a speech translation apparatus according to the second embodiment of the present invention will be described with reference to FIGS. The speech translation apparatus according to the present embodiment determines the language conversion direction by using the owner of the speech translation apparatus as an attention speaker.

（１）音声翻訳装置の構成
本実施形態の音声翻訳装置の構成について説明する。 (1) Configuration of Speech Translation Device The configuration of the speech translation device according to this embodiment will be described.

図８は、本実施形態に関わる音声翻訳装置の概略構成図である。 FIG. 8 is a schematic configuration diagram of the speech translation apparatus according to the present embodiment.

図８に示すように音声翻訳装置は、音声入力部１０、話者同定部２０、言語判定部３０、言語制御部４０、話者言語記録部５０、音声認識部６０、機械翻訳部７０及び所有者判定部８０から構成される。 As shown in FIG. 8, the speech translation apparatus includes a speech input unit 10, a speaker identification unit 20, a language determination unit 30, a language control unit 40, a speaker language recording unit 50, a speech recognition unit 60, a machine translation unit 70, and a possession. The person determination unit 80 is configured.

なお、図８において第１の実施形態と同一の動作を行う部分については同一番号を付与しているため説明を省略する。 In FIG. 8, the same numbers are assigned to portions that perform the same operations as those in the first embodiment, and a description thereof is omitted.

所有者判定部８０は、言語制御部４０と連携しながら音声翻訳装置の所有者もしくは音声翻訳装置を主に利用する発話者が誰であるかを検出して同定話者と所有者との対応を話者言語記録部５０に記録する。例えば所有者は音声翻訳装置を手元に置いて操作することが容易であると考えられるので、所有者が発声する場合には、機器に備え付けられた特別なボタンを押すなどの手段で所有者による発話とボタンが押されたタイミングの同期を取れば対応関係を把握することは容易である。 The owner determination unit 80 detects the who is the owner of the speech translation device or the speaker who mainly uses the speech translation device in cooperation with the language control unit 40, and the correspondence between the identified speaker and the owner. Is recorded in the speaker language recording unit 50. For example, it is considered that the owner can easily operate with the speech translation device at hand, so when the owner speaks, the owner can press the special button on the device. If the utterance is synchronized with the timing when the button is pressed, it is easy to grasp the correspondence.

（２）音声翻訳装置の動作
次に、第２の実施形態に関わる音声翻訳装置の詳細な動作について具体例を用いて説明する。 (2) Operation of Speech Translation Device Next, detailed operation of the speech translation device according to the second embodiment will be described using a specific example.

図９は複数人の会話場面の一例を示したものであり、会議の参加者のうち話者Ｇが英語を、話者Ｈが中国語を、話者Ｉが日本語を発声しており、音声翻訳装置は話者Ｉのみが所有しており音声翻訳装置に備え付けられたマイクロフォンにより話者の発声を受け取って音声翻訳を実現している。 FIG. 9 shows an example of a conversation scene of a plurality of people. Among the participants in the conference, speaker G speaks English, speaker H speaks Chinese, and speaker I speaks Japanese. The speech translation apparatus is owned only by the speaker I, and the speech translation is realized by receiving the utterance of the speaker by the microphone provided in the speech translation apparatus.

図９の会話場面例を使って言語制御部４０の動作を詳細に説明する。 The operation of the language control unit 40 will be described in detail using the conversation scene example of FIG.

図１０は話者言語記録部５０に格納された対応関係の一例を示すものである。図１０の例ではこれまでに発話者Ｇ、Ｈ、Ｉがそれぞれ複数回発声しており、話者同定部２０によってそれぞれの話者が３つのクラスに分類され、また言語判定部３０によって正しく言語が判定された場合の対応関係を示している。また所有者判定部８０によって発話者Ｉの発声が所有者であることもこれまでの処理で判明しているとする。この状態で新たに話者Ｇ、話者Ｈ、話者Ｉが発声した場合の言語制御部４０の動作を詳細に説明する。 FIG. 10 shows an example of the correspondence stored in the speaker language recording unit 50. In the example of FIG. 10, the speakers G, H, and I have been uttered a plurality of times so far, and the speakers are classified into three classes by the speaker identification unit 20, and the language is correctly determined by the language determination unit 30. The correspondence relationship is shown when is determined. In addition, it is assumed that it has been found by the processing so far that the utterance of the speaker I is the owner by the owner determination unit 80. The operation of the language control unit 40 when a speaker G, speaker H, or speaker I newly utters in this state will be described in detail.

図１１は言語制御部４０の動作フローチャートを示している。 FIG. 11 shows an operation flowchart of the language control unit 40.

（２−１）話者Ｇの英語による発話
まず、話者Ｇが発話した場合を例に考える。 (2-1) Utterance of speaker G in English First, consider the case where speaker G utters.

ステップＳ１００で図１０に示す対応関係から現話者Ｇの言語が英語であることを取得する。 In step S100, it is acquired from the correspondence shown in FIG. 10 that the language of the current speaker G is English.

次に、ステップＳ１０１で対応関係から所有者Ｉの言語が日本語であることを取得する。 In step S101, it is acquired from the correspondence relationship that the language of the owner I is Japanese.

次に、ステップＳ１０２では現話者Ｇと所有者Ｉが一致するかを判定する。この場合は異なるため、ステップＳ１０３に進み現話者の言語と所有者の言語を比較する。この場合は英語と日本語で異なるため、ステップＳ１０４にて入力言語を英語、出力言語を日本語に決定して言語制御部４０の処理を終える。 Next, in step S102, it is determined whether the current speaker G and the owner I match. Since this case is different, the process proceeds to step S103 to compare the language of the current speaker with the language of the owner. In this case, since English differs from Japanese, in step S104, the input language is determined to be English and the output language is determined to be Japanese, and the processing of the language control unit 40 is completed.

（２−２）話者Ｈの中国語による発話
続いて、話者Ｈが発話した場合を例に考える。 (2-2) Utterance of speaker H in Chinese Next, consider the case where speaker H utters.

ステップＳ１００で対応関係から現話者Ｈの言語が中国語であることを取得する。 In step S100, it is acquired from the correspondence relationship that the language of the current speaker H is Chinese.

次に、ステップＳ１０２では現話者Ｈと所有者Ｉが一致するかを判定するが、この場合は異なるため、ステップＳ１０３に進み現話者の言語と所有者の言語を比較する。この場合は中国語と日本語で異なるため、ステップＳ１０４にて入力言語を中国語、出力言語を日本語に決定して言語制御部４０の処理を終える。 Next, in step S102, it is determined whether or not the current speaker H and the owner I match. Since this case is different, the process proceeds to step S103, where the language of the current speaker and the language of the owner are compared. In this case, since it differs between Chinese and Japanese, in step S104, the input language is determined to be Chinese and the output language is determined to be Japanese, and the processing of the language control unit 40 ends.

（２−３）話者Ｉの日本語による発話
最後に、所有者である話者Ｉが発声した場合の言語制御部４０の動作を説明する。 (2-3) Speech of Speaker I in Japanese Finally, the operation of the language control unit 40 when the speaker I who is the owner speaks will be described.

ステップＳ１００で対応関係から現話者Ｉの言語が日本語であることを取得する。 In step S100, it is acquired from the correspondence relationship that the language of the current speaker I is Japanese.

次に、ステップＳ１０２では現話者Ｉと所有者Ｉが一致するかを判定し、一致する場合にはステップＳ１０６に進む。 Next, in step S102, it is determined whether the current speaker I and the owner I match. If they match, the process proceeds to step S106.

次に、ステップＳ１０６では日本語以外の全ての言語（この場合は、英語と中国語）を取得して、ステップＳ１０７にて入力言語を日本語、出力言語を英語と中国語に決定して言語制御部４０の処理を終える。 Next, in step S106, all languages other than Japanese (in this case, English and Chinese) are acquired. In step S107, the input language is determined to be Japanese and the output language is determined to be English and Chinese. The process of the control unit 40 is finished.

なお、図９の対話場面では出てこないが、所有者と同じ言語を話す話者が会話の中に含まれている場合には、図１１のフローチャートのステップＳ１０５によって入力言語と出力言語が設定されない。すなわち入力音声を音声認識及び機械翻訳しないように動作する。 In addition, although not appearing in the dialog scene of FIG. 9, when a speaker who speaks the same language as the owner is included in the conversation, the input language and the output language are set in step S105 of the flowchart of FIG. Not. That is, it operates so that the input speech is not recognized and machine translated.

（３）効果
このように、本実施形態では所有者の言語以外による発声は、所有者が理解できる言語に翻訳して、所有者の発声は所有者以外の全ての言語に翻訳するように動作することで、不要に全ての言語対の翻訳を実施せず、所有者が理解できない場合にのみ音声翻訳を動作させることが可能となる。これによって音声翻訳による対談の中断を軽減しながら、所有者の理解を支援する音声翻訳を実現することが可能となる。 (3) Effect As described above, in the present embodiment, the utterance in a language other than the owner's language is translated into a language understandable by the owner, and the utterance of the owner is translated into all languages other than the owner. Thus, translation of all language pairs is not performed unnecessarily, and speech translation can be operated only when the owner cannot understand. As a result, it is possible to realize speech translation that supports the understanding of the owner while reducing interruption of the conversation due to speech translation.

（第３の実施形態）
次に、本発明の第３の実施形態の音声翻訳装置について図９、図１２、図１３、図１５に基づいて説明する。本実施形態の音声翻訳装置は、直前話者を注目発話者として、言語変換方向を直前話者で決定する。 (Third embodiment)
Next, a speech translation apparatus according to the third embodiment of the present invention will be described with reference to FIGS. 9, 12, 13, and 15. FIG. The speech translation apparatus according to the present embodiment determines the language conversion direction with the immediately preceding speaker, with the immediately preceding speaker as the attention speaker.

図１２は、本実施形態に関わる音声翻訳装置の概略構成図である。 FIG. 12 is a schematic configuration diagram of a speech translation apparatus according to the present embodiment.

図１２に示すように音声翻訳装置は、音声入力部１０、話者同定部２０、言語判定部３０、言語制御部４０、話者言語記録部５０、音声認識部６０、機械翻訳部７０及び発話履歴記録部９０から構成される。 As shown in FIG. 12, the speech translation apparatus includes a speech input unit 10, a speaker identification unit 20, a language determination unit 30, a language control unit 40, a speaker language recording unit 50, a speech recognition unit 60, a machine translation unit 70, and an utterance. A history recording unit 90 is included.

なお、図１２において、第１の実施形態と同一の動作を行う部分は同一番号を付与しているため説明を省略する。 In FIG. 12, parts that perform the same operations as in the first embodiment are given the same numbers, and descriptions thereof are omitted.

発話履歴記録部９０は、言語制御部４０と連携しながら過去の発話履歴を記録する。図１３に示すように過去一定回数内の発話者を時系列で保持することなどが発話履歴の一例としてあげられる。また発話履歴にはあわせて発話時間などの発話属性を記録しても構わない。 The utterance history recording unit 90 records past utterance histories in cooperation with the language control unit 40. As shown in FIG. 13, an example of the utterance history is to hold speakers within a certain number of times in the past in time series. In addition, utterance attributes such as utterance time may be recorded in the utterance history.

（２）音声翻訳装置の動作
次に、第３の実施形態に関わる音声翻訳装置の詳細な動作について具体例を用いて説明する。 (2) Operation of Speech Translation Device Next, detailed operation of the speech translation device according to the third embodiment will be described using a specific example.

図９は複数人の会話場面の一例を示したものであり、会議の参加者のうち話者Ｇが英語を、話者Ｈが中国語を、話者Ｉが日本語を発声しており、音声翻訳装置は話者Ｉのみが所有しており音声翻訳装置に備え付けられたマイクロフォンにより話者の発声を受け取って音声翻訳を実現している。図９の会話場面例を使って言語制御部４０の動作を詳細に説明する。 FIG. 9 shows an example of a conversation scene of a plurality of people. Among the participants in the conference, speaker G speaks English, speaker H speaks Chinese, and speaker I speaks Japanese. The speech translation apparatus is owned only by the speaker I, and the speech translation is realized by receiving the utterance of the speaker by the microphone provided in the speech translation apparatus. The operation of the language control unit 40 will be described in detail using the conversation scene example of FIG.

図１５は話者言語記録部５０に格納された対応関係の一例を示すものである。図１５の例ではこれまでに発話者Ｇ、Ｈ、Ｉがそれぞれ複数回発声しており、話者同定部２０によってそれぞれの話者が３つのクラスに分類され、また言語判定部３０によって正しく言語が判定された場合の対応関係を示している。また直前までの発話履歴は図１３である場合を例に考える。この場合には発話履歴から直前話者がＨであることがわかり、図１５の話者Ｈの話者属性に直前話者が記入されている。 FIG. 15 shows an example of the correspondence stored in the speaker language recording unit 50. In the example of FIG. 15, the speakers G, H, and I have spoken a plurality of times so far, and the speakers are classified into three classes by the speaker identification unit 20, and the language determination unit 30 correctly speaks the language. The correspondence relationship is shown when is determined. Consider the case where the utterance history up to immediately before is shown in FIG. In this case, it is found from the utterance history that the immediately preceding speaker is H, and the immediately preceding speaker is entered in the speaker attribute of speaker H in FIG.

図１４は言語制御部４０の動作フローチャートを示している。 FIG. 14 shows an operation flowchart of the language control unit 40.

ステップＳ２００で図１５に示す対応関係から現話者Ｇの言語が英語であることを取得する。 In step S200, it is acquired from the correspondence shown in FIG. 15 that the language of the current speaker G is English.

次に、ステップＳ２０１で対応関係から直前話者Ｈの言語が中国語であることを取得する。 Next, in step S201, it is acquired from the correspondence relationship that the language of the immediately preceding speaker H is Chinese.

次に、ステップＳ２０２では現話者Ｇと直前話者Ｈが一致するかを判定する。この場合は異なるため、ステップＳ２０３に進み現話者の言語と直前話者の言語を比較する。この場合は英語と中国語で異なるため、ステップＳ２０４にて入力言語を英語、出力言語を中国語に決定して言語制御部４０の処理を終える。 Next, in step S202, it is determined whether the current speaker G and the previous speaker H match. Since this case is different, the process advances to step S203 to compare the language of the current speaker with the language of the immediately preceding speaker. In this case, since English differs from Chinese, in step S204, the input language is determined to be English and the output language is determined to be Chinese, and the processing of the language control unit 40 is completed.

（２−２）話者Ｈの中国語による発話
続いて話者Ｈが発話した場合を例に考える。 (2-2) Utterance of speaker H in Chinese Next, consider the case where speaker H utters.

ステップＳ２００で対応関係から現話者Ｈの言語が中国語であることを取得する。 In step S200, it is acquired from the correspondence relationship that the language of the current speaker H is Chinese.

次に、ステップＳ２０２では現話者Ｈと直前話者Ｈが一致するかを判定するが、一致する場合には、ステップＳ２０６に進む。 Next, in step S202, it is determined whether the current speaker H and the previous speaker H match. If they match, the process proceeds to step S206.

次に、ステップＳ２０６では中国語以外の全ての言語（この場合は、英語と日本語）を取得して、ステップＳ２０７にて入力言語を中国語、出力言語を英語と日本語に決定して言語制御部の処理を終える。 Next, in step S206, all languages other than Chinese (in this case, English and Japanese) are acquired, and in step S207, the input language is determined to be Chinese and the output language is determined to be English and Japanese. The process of the control unit is finished.

（３）効果
このように本実施形態では、直前の発話者と現在の発話者が対談を進めているという仮定の元、対談を進める２人の言語を優先して翻訳するように言語変換方向を制御することで、対話の流れをなるべく損なわずに音声翻訳を動作させることが可能となる。 (3) Effects As described above, in this embodiment, the language conversion direction is preferentially translated so that the two languages that are engaged in the conversation are preferentially translated under the assumption that the previous speaker and the current speaker are proceeding with the conversation. By controlling, speech translation can be operated without losing the flow of dialogue as much as possible.

（第４の実施形態）
次に、本発明の第４の実施形態の音声翻訳装置について図９、図１２、図１３、図１６、図１７に基づいて説明する。本実施形態の音声翻訳装置は、最も発言が多い主題者を注目発話者として、言語変換方向を主題者で決定する。 (Fourth embodiment)
Next, a speech translation apparatus according to a fourth embodiment of the present invention will be described with reference to FIGS. 9, 12, 13, 16, and 17. The speech translation apparatus according to the present embodiment determines the language conversion direction as the subject speaker with the subject who has the most speech as the target speaker.

図１２は、第４の実施形態に関わる音声翻訳装置の概略の構成図である。 FIG. 12 is a schematic configuration diagram of a speech translation apparatus according to the fourth embodiment.

第３の実施形態では発話履歴から直前話者を抽出して言語変換方向を決定する方法を示したが、本実施形態では発話履歴から過去一定区間の中で主に発言をしていた主発言者を取得し、主発言者と現話者との意思疎通を優先するように言語変換方向を決定するように工夫した点が第３の実施形態とは異なっている。 In the third embodiment, the method of extracting the immediately preceding speaker from the utterance history and determining the language conversion direction is shown. However, in the present embodiment, the main utterance that has mainly spoken from the utterance history in the past certain interval. The third embodiment is different from the third embodiment in that the language conversion direction is determined so as to prioritize communication between the main speaker and the current speaker.

（２）音声翻訳装置の動作
以下では、図９に示す対話場面及び図１３に示す発話履歴を例に本実施形態の動作の詳細について説明する。 (2) Operation of Speech Translation Device Hereinafter, details of the operation of the present embodiment will be described using the conversation scene shown in FIG. 9 and the utterance history shown in FIG. 13 as examples.

図１７は話者言語記録部５０に格納された対応関係の一例を示すものである。図１７の例ではこれまでに発話者Ｇ、Ｈ、Ｉがそれぞれ複数回発声しており、話者同定部２０によってそれぞれの話者が３つのクラスに分類され、また言語判定部３０によって正しく言語が判定された場合の対応関係を示している。 FIG. 17 shows an example of the correspondence stored in the speaker language recording unit 50. In the example of FIG. 17, the speakers G, H, and I have so far spoken a plurality of times, and each speaker is classified into three classes by the speaker identification unit 20, and the language determination unit 30 correctly speaks the language. The correspondence relationship is shown when is determined.

図１６は、本実施形態における言語制御部４０の動作フローチャートを示している。 FIG. 16 shows an operational flowchart of the language control unit 40 in the present embodiment.

ステップＳ３０で図１３に示す発話履歴から主発言者を検出する。主発言者の検出方法としては、例えば発話履歴に格納された話者の中から最も発言回数の多い話者、または最も発話時間の長い話者を選択する方法が考えられる。図１３の発話履歴ではいずれの場合でも話者Ｇが主発言者として選択されるので、図１７に示す対応関係に話者Ｇの話者属性に主発言者であることを記録する。 In step S30, the main speaker is detected from the utterance history shown in FIG. As a method for detecting the main speaker, for example, a method of selecting a speaker having the largest number of utterances or a speaker having the longest utterance time from among the speakers stored in the utterance history can be considered. Since the speaker G is selected as the main speaker in any case in the utterance history of FIG. 13, the fact that the speaker is the main speaker is recorded in the speaker attribute of the speaker G in the correspondence shown in FIG.

次に、ステップＳ３００で対応関係から現話者Ｇの言語が英語であることを、ステップＳ３０１から主発言者Ｇの言語が英語であることを取得する。 Next, in step S300, it is acquired from the correspondence relationship that the language of the current speaker G is English, and in step S301, the language of the main speaker G is acquired in English.

次に、ステップＳ３０２では現話者Ｇと主発言者Ｇが一致するかを判定するが、この場合は一致するためステップＳ３０６において英語以外の全ての言語（この場合は、中国語と日本語）を取得してステップＳ３０７にて入力言語を英語、出力言語を中国語と日本語に決定して言語制御部４０の処理を終える。 Next, in step S302, it is determined whether the current speaker G and the main speaker G match. In this case, since they match, in step S306, all languages other than English (in this case, Chinese and Japanese). In step S307, the input language is determined to be English and the output languages are determined to be Chinese and Japanese, and the processing of the language control unit 40 is completed.

（２−２）話者Ｈの中国語による発話
話者Ｇではなく話者Ｈが発話した場合を例に考える。 (2-2) Utterance of speaker H in Chinese Consider the case where speaker H speaks instead of speaker G.

ステップＳ３００で対応関係から現話者Ｈの言語が中国語であることを、ステップＳ３０１から主発言者Ｇの言語が英語であることを取得する。 In step S300, it is acquired from the correspondence relationship that the language of the current speaker H is Chinese, and in step S301, the language of the main speaker G is English.

次に、ステップＳ３０２では現話者Ｈと主発言者Ｇが一致するかを判定するが、この場合は異なるため、ステップＳ３０３に進み現話者の言語と主発言者の言語を比較する。この場合は英語と中国語で異なるため、ステップＳ３０４にて入力言語を中国語、出力言語を英語に決定して言語制御部４０の処理を終える。 Next, in step S302, it is determined whether the current speaker H and the main speaker G match. However, since this case is different, the process advances to step S303 to compare the language of the current speaker with the language of the main speaker. In this case, since English and Chinese are different, in step S304, the input language is determined to be Chinese and the output language is determined to be English, and the processing of the language control unit 40 is completed.

（３）効果
このように本実施形態では、一定区間の間主に発言している主発言者と現在の発話者が対談を進めているという仮定の元、対談を進める２人の言語を優先して翻訳するように言語変換方向を制御することで、対話の流れをなるべく損なわずに音声翻訳を動作させることが可能となる。 (3) Effect As described above, in the present embodiment, priority is given to the two languages that are engaged in the conversation on the assumption that the main speaker who speaks mainly for a certain section and the current speaker are proceeding with the conversation. By controlling the language conversion direction so that translation is performed, it is possible to operate speech translation without losing the flow of dialogue as much as possible.

（第５の実施形態）
次に、本発明の第５の実施形態の音声翻訳装置について図９、図１２、図１３、図１８、図２０に基づいて説明する。本実施形態の音声翻訳装置は、直前話者を注目発話者として、言語変換方向を話者対履歴で決定する。すなわち、最も発言が多い発話者を注目発話者とし、次に発言の多い発話者をその対談者とするものである。 (Fifth embodiment)
Next, a speech translation apparatus according to a fifth embodiment of the present invention will be described with reference to FIG. 9, FIG. 12, FIG. 13, FIG. The speech translation apparatus according to the present embodiment determines the language conversion direction based on the speaker-to-history by setting the immediately preceding speaker as the attention speaker. That is, the speaker with the most speech is set as the attention speaker, and the speaker with the next most speech is set as the talker.

図１２は、第５の実施形態に関わる音声翻訳装置の概略構成図である。 FIG. 12 is a schematic configuration diagram of a speech translation apparatus according to the fifth embodiment.

図１２に示すように音声翻訳装置は、音声入力部１０、話者同定部２０、言語判定部３０、言語制御部４０、話者言語記録部５０、音声認識部６０及び機械翻訳部７０及び発話履歴記録部９０から構成される。 As shown in FIG. 12, the speech translation apparatus includes a speech input unit 10, a speaker identification unit 20, a language determination unit 30, a language control unit 40, a speaker language recording unit 50, a speech recognition unit 60, a machine translation unit 70, and an utterance. A history recording unit 90 is included.

なお、図１２において第１の実施形態と同一の動作を行う部分は同一番号を付与しているため説明を省略する。 In FIG. 12, parts that perform the same operations as in the first embodiment are given the same numbers, and descriptions thereof are omitted.

第３の実施形態及び第４の実施形態では、発話履歴から直前話者もしくは主発言者を抽出して言語変換方向を決定する方法を示したが、本実施形態では発話履歴から過去一定区間の中で交互に話を行った話者対の履歴を取得し、話者対の有無に応じて言語変換方向を決定するように工夫した点が以前の実施形態とは異なっている。 In the third embodiment and the fourth embodiment, the method of determining the language conversion direction by extracting the previous speaker or the main speaker from the utterance history is shown. The point which devised so that the history of the speaker pair who talked alternately in the inside may be acquired, and a language conversion direction may be determined according to the presence or absence of a speaker pair differs from previous embodiment.

（２）音声翻訳装置の動作
以下では図９に示す対話場面及び図１３に示す発話履歴を例に本実施形態の詳細について説明する。 (2) Operation of Spoken Translation Device The details of this embodiment will be described below taking the conversation scene shown in FIG. 9 and the utterance history shown in FIG. 13 as examples.

図１９は話者言語記録部５０に格納された対応関係の一例を示すものである。図１９の例ではこれまでに発話者Ｇ、Ｈ、Ｉがそれぞれ複数回発声しており、話者同定部２０によってそれぞれの話者が３つのクラスに分類され、また言語判定部３０によって正しく言語が判定された場合の対応関係を示している。 FIG. 19 shows an example of the correspondence stored in the speaker language recording unit 50. In the example of FIG. 19, the speakers G, H, and I have spoken a plurality of times so far, and the speakers are classified into three classes by the speaker identification unit 20, and the language determination unit 30 correctly speaks the language. The correspondence relationship is shown when is determined.

図１８は本実施形態における言語制御部４０の動作フローチャートを示している。 FIG. 18 shows an operation flowchart of the language control unit 40 in the present embodiment.

ステップＳ４０で図１３に示す発話履歴から話者対の表を作成する。図２０には図１３の発話履歴に基づいて作成された話者対を示している。表は連続する発話者の対毎に、（直前話者、直後話者）の頻度をカウントしたものである。 In step S40, a table of speaker pairs is created from the utterance history shown in FIG. FIG. 20 shows a pair of speakers created based on the utterance history of FIG. The table counts the frequency of (speaker just before, speaker just after) for each pair of consecutive speakers.

次に、ステップＳ４００で対応関係から現話者Ｇの言語が英語であること取得する。次に、ステップＳ４０１では図２０に例示する話者対から現話者Ｇとの話者対となった話者のリストをピックアップする。図２０を参照すると現話者Ｇとの話者対の履歴がある話者は話者Ｈと話者Ｉとなるので話者対の言語として中国語及び日本語を取得する。 Next, in step S400, it is acquired from the correspondence relationship that the language of the current speaker G is English. Next, in step S401, a list of speakers that are speaker pairs with the current speaker G is picked up from the speaker pairs illustrated in FIG. Referring to FIG. 20, a speaker having a history of a speaker pair with the current speaker G becomes a speaker H and a speaker I, so Chinese and Japanese are acquired as the language of the speaker pair.

次に、ステップＳ４０２では話者対が一致するか否かを判定するが、この場合は存在するためにステップＳ４０３にて入力言語を英語、出力言語を中国語と日本語に決定して言語制御部４０の処理を終える。 Next, in step S402, it is determined whether or not the speaker pair matches. In this case, since it exists, in step S403, the input language is determined to be English and the output language is determined to be Chinese and Japanese. The process of the unit 40 is finished.

（２−２）話者Ｈの中国による発話
話者Ｇではなく話者Ｈが発話した場合を例に考える。 (2-2) Utterance of speaker H in China Consider the case where speaker H speaks instead of speaker G.

ステップＳ４００で対応関係から現話者Ｈの言語が中国語であること取得する。 In step S400, it is acquired from the correspondence relationship that the language of the current speaker H is Chinese.

次に、ステップＳ４０１では図２０を参照すると現話者Ｈとの話者対の履歴がある話者は話者Ｇとなるので話者対の言語として英語を取得する。 Next, in step S401, referring to FIG. 20, the speaker having the history of the speaker pair with the current speaker H becomes the speaker G, so English is acquired as the language of the speaker pair.

次に、ステップＳ４０２では話者対が存在するか否かを判定するが、この場合は存在するためにステップＳ４０３にて入力言語を中国語、出力言語を英語に決定して言語制御部４０の処理を終える。 Next, in step S402, it is determined whether or not a speaker pair exists. In this case, since it exists, in step S403, the input language is determined to be Chinese and the output language is determined to be English. Finish the process.

（３）効果
このように本実施形態では一定区間の間に現在の話者とのやりとりを行った話者対に基づいて対談を進める２人の言語を優先して翻訳するように言語変換方向を制御することで、対話の流れをなるべく損なわずに音声翻訳を動作させることが可能となる。 (3) Effect As described above, in the present embodiment, the language conversion direction is such that the two languages that advance the conversation based on the speaker pair that has exchanged with the current speaker during a certain section are preferentially translated. By controlling, speech translation can be operated without losing the flow of dialogue as much as possible.

（第６の実施形態）
次に、本発明の第６の実施形態の音声翻訳装置について図９、図１２、図１３、図１９、図２０、図２１に基づいて説明する。本実施形態の音声翻訳装置は、複数の言語変換方向を優先順位付けして決定する。 (Sixth embodiment)
Next, a speech translation apparatus according to a sixth embodiment of the present invention will be described with reference to FIGS. 9, 12, 13, 19, 20, and 21. FIG. The speech translation apparatus of this embodiment prioritizes and determines a plurality of language conversion directions.

図１２は、第６の実施形態に関わる音声翻訳装置の概略の構成図である。 FIG. 12 is a schematic configuration diagram of a speech translation apparatus according to the sixth embodiment.

第３の実施形態から第５の実施形態では発話履歴から複数の出力言語が存在した場合にはその優先順位を特に決定しなかったが、本実施形態では複数の出力言語が存在した場合には過去一定区間の発話履歴から出力すべき言語の優先順位を付ける点が以前の実施形態とは異なっている。 In the third to fifth embodiments, when there are a plurality of output languages from the utterance history, the priority order is not particularly determined. However, in the present embodiment, when there are a plurality of output languages, The point of prioritizing the language to be output from the utterance history in the past fixed section is different from the previous embodiment.

図２１は本実施形態における言語制御部４０の動作フローチャートを示している。 FIG. 21 shows an operation flowchart of the language control unit 40 in the present embodiment.

ステップＳ５０で図１３に示す発話履歴から話者対の表を作成する。図２０には図１３の発話履歴に基づいて作成された話者対を示している。表は連続する発話者の対毎に、（直前話者、直後話者）の頻度をカウントしたものである。 In step S50, a table of speaker pairs is created from the utterance history shown in FIG. FIG. 20 shows a pair of speakers created based on the utterance history of FIG. The table counts the frequency of (speaker just before, speaker just after) for each pair of consecutive speakers.

次に、ステップＳ５００で対応関係から現話者Ｇの言語が英語であること取得する。 In step S500, it is acquired from the correspondence relationship that the language of the current speaker G is English.

次に、ステップＳ５０１では図２０に例示する話者対から現話者Ｇとの話者対となった話者のリストをピックアップする。図２０を参照すると現話者Ｇとの話者対の履歴がある話者は話者Ｈと話者Ｉとなるので話者対の言語として中国語及び日本語を取得する。 Next, in step S501, a list of speakers that are speaker pairs with the current speaker G is picked up from the speaker pairs illustrated in FIG. Referring to FIG. 20, a speaker having a history of a speaker pair with the current speaker G becomes a speaker H and a speaker I, so Chinese and Japanese are acquired as the language of the speaker pair.

次に、ステップＳ５０２では話者対が一致するか否かを判定するが、この場合は存在するためにステップＳ５０３に進む。 Next, in step S502, it is determined whether or not the speaker pair matches. In this case, since it exists, the process proceeds to step S503.

次に、ステップＳ５０３では話者対が複数存在するか否かを判定するが、この場合は存在するためにステップＳ５０４にて出力言語の優先順位を付ける。優先順位の付け方としては、例えば図２０を参照して話者対となった頻度の多い話者を優先するなどの方法が考えられる。この場合には、話者Ｈと話者対になった回数が多いため、話者Ｈの言語である中国語を話者Ｉの日本語よりも優先することになる。 Next, in step S503, it is determined whether or not there are a plurality of speaker pairs. In this case, since there is a speaker pair, priority is given to output languages in step S504. As a method of assigning priorities, for example, referring to FIG. 20, a method of giving priority to a speaker having a high frequency as a speaker pair can be considered. In this case, since the speaker H and the speaker are paired many times, the speaker H's language, Chinese, is given priority over the speaker I's Japanese.

次に、ステップＳ５０５では入力言語として英語、出力言語として中国語を優先して、続いて日本語を出力するように決定して言語制御部４０の処理を終える。 Next, in step S505, priority is given to English as the input language and Chinese as the output language, and then it is determined to output Japanese, and the processing of the language control unit 40 ends.

ステップＳ５００で対応関係から現話者Ｈの言語が中国語であること取得する。 In step S500, it is acquired from the correspondence relationship that the language of the current speaker H is Chinese.

次に、ステップＳ５０１では図２０を参照すると現話者Ｈとの話者対の履歴がある話者は話者Ｇとなるので話者対の言語として英語を取得する。 Next, in step S501, referring to FIG. 20, the speaker having the history of the speaker pair with the current speaker H becomes the speaker G, so English is acquired as the language of the speaker pair.

次に、ステップＳ５０２では話者対が存在するか否かを判定するが、この場合は存在するためにステップＳ５０３に進む。 Next, in step S502, it is determined whether or not there is a speaker pair. In this case, since it exists, the process proceeds to step S503.

次に、ステップＳ５０３では話者対が複数存在するか否かを判定するが、この場合は存在しないので入力言語を中国語、出力言語を英語に決定して言語制御部４０の処理を終える。 Next, in step S503, it is determined whether or not there are a plurality of speaker pairs. In this case, since there is no speaker pair, the input language is determined to be Chinese and the output language is determined to be English.

（３）効果
このように本実施形態では、一定区間の間の発話履歴に基づいて決定した出力言語が複数存在する場合に、対談を進める２人の言語を優先して翻訳するように言語変換方向を制御することで、対話の流れをなるべく損なわずに音声翻訳を動作させることが可能となる。 (3) Effect As described above, in the present embodiment, when there are a plurality of output languages determined based on the utterance history during a certain interval, language conversion is performed so that the languages of the two persons who proceed with the conversation are preferentially translated. By controlling the direction, it is possible to operate speech translation without losing the flow of dialogue as much as possible.

（第７の実施形態）
次に、本発明の第７の実施形態の音声翻訳装置について図２２、図２３に基づいて説明する。本実施形態の音声翻訳装置は、言語方向ができるまで出力しないものである。 (Seventh embodiment)
Next, a speech translation apparatus according to a seventh embodiment of the present invention will be described with reference to FIGS. The speech translation apparatus of this embodiment does not output until the language direction is made.

（１）音声翻訳装置の構成
図２２は、第７の実施形態に関わる音声翻訳装置の概略構成図である。 (1) Configuration of Speech Translation Device FIG. 22 is a schematic configuration diagram of a speech translation device according to the seventh embodiment.

図２２に示すように音声翻訳装置は、音声入力部１０、話者同定部２０、言語判定部３０、言語制御部４０、話者言語記録部５０、音声認識部６０及び機械翻訳部７０及び音声蓄積部１００から構成される。 As shown in FIG. 22, the speech translation apparatus includes a speech input unit 10, a speaker identification unit 20, a language determination unit 30, a language control unit 40, a speaker language recording unit 50, a speech recognition unit 60, a machine translation unit 70, and a speech. The storage unit 100 is configured.

なお、図２２において、以前の実施形態と同一の動作を行う部分は同一番号を付与しているため説明を省略する。 In FIG. 22, the same number is assigned to the portion that performs the same operation as in the previous embodiment, and the description thereof is omitted.

音声蓄積部１００は、入力音声部１０から入力される音声を、言語制御部４０の制御に従ってすぐに音声認識部６０に出力するのか一度蓄積しておいて後で認識させるのかを制御する。同じ場所にいる会話の参加者の発言を翻訳して会話を支援することを目的とした場合には、基本的には会話の流れを妨げないために音声を蓄積しない方が良い。 The voice storage unit 100 controls whether the voice input from the input voice unit 10 is immediately output to the voice recognition unit 60 according to the control of the language control unit 40 or is stored once and recognized later. If the purpose is to support the conversation by translating the speech of the participants in the conversation at the same place, basically it is better not to accumulate speech in order not to disturb the flow of conversation.

しかし、入力された音声の入力言語及び変換すべき出力言語が決定できない場合、音声を一度蓄積しておき、後に入力される音声の処理結果を利用して入力言語もしくは出力言語が決定できた段階でまとめて蓄積した音声を出力すればよい。 However, if the input language of the input speech and the output language to be converted cannot be determined, the speech can be stored once and the input language or output language can be determined using the processing result of the speech input later It is only necessary to output the voice that has been stored together in.

（２）音声翻訳装置の動作
図２３は話者言語記録部５０に格納された対応関係の一例を示すものである。 (2) Operation of Speech Translation Device FIG. 23 shows an example of the correspondence stored in the speaker language recording unit 50.

図２３の例ではこれまでに発話者Ｇしか発声しておらず、会話にどの言語を話す話者が存在するのか分かっていない状況である。この状況では話者Ｇの発話をどの言語に翻訳してよいのかわからないため、話者Ｇの発話が連続した場合には音声を蓄積しておく。 In the example of FIG. 23, only the speaker G has been uttered so far, and it is not known which language speaks in the conversation. In this situation, since it is not known to which language the speech of the speaker G can be translated, the speech is accumulated when the speech of the speaker G continues.

その後、例えば話者Ｈの発話が入力され、すなわち、発話回数が規定回数（例えば、１回）以上になって話者同定部２０にて新しい話者クラスが、言語判定部３０にて中国語と正しく判定できた場合には、それまでに蓄積された話者Ｇの発声をまとめて英語から中国語に変換する。 Thereafter, for example, the utterance of the speaker H is input, that is, the number of utterances exceeds a specified number (for example, once) and a new speaker class is set in the speaker identifying unit 20, and the language determining unit 30 in Chinese Is correctly determined, the utterances of the speaker G accumulated so far are collectively converted from English to Chinese.

（３）効果
このように本実施形態によれば、例え音声が入力された段階では入力言語及び出力言語が決定できない場合でも、音声データを蓄積しておき、話者の発話回数が規定回数以上になって言語変換方向が確定した段階で翻訳結果を出力するように制御することで、会話の初めに発声された音声も失うことなくきちんと相手に伝えることができる。 (3) Effect As described above, according to the present embodiment, even when the input language and the output language cannot be determined at the stage where the voice is input, the voice data is accumulated, and the number of utterances of the speaker is equal to or more than the predetermined number. By controlling to output the translation result when the language conversion direction is determined, the voice uttered at the beginning of the conversation can be properly transmitted to the partner without losing.

（第８の実施形態）
次に、本発明の第８の実施形態の音声翻訳装置について図９、図１０、図２４、図２３、図１８、図２０に基づいて説明する。本実施形態の音声翻訳装置は、複数の言語を異なるメディアや表示方法で出力するものである。 (Eighth embodiment)
Next, a speech translation apparatus according to an eighth embodiment of the present invention will be described with reference to FIGS. 9, 10, 24, 23, 18, and 20. The speech translation apparatus according to this embodiment outputs a plurality of languages using different media and display methods.

（１）音声翻訳装置の構成
図２４は、第８の実施形態に関わる音声翻訳装置の概略構成図である。 (1) Configuration of Speech Translation Device FIG. 24 is a schematic configuration diagram of a speech translation device according to the eighth embodiment.

図２４に示すように音声翻訳装置は、音声入力部１０、話者同定部２０、言語判定部３０、言語制御部４０、話者言語記録部５０、音声認識部６０及び機械翻訳部７０及び結果出力部１１０から構成される。 As shown in FIG. 24, the speech translation apparatus includes a speech input unit 10, a speaker identification unit 20, a language determination unit 30, a language control unit 40, a speaker language recording unit 50, a speech recognition unit 60, a machine translation unit 70, and results. The output unit 110 is configured.

なお、図２４において、以前の実施形態と同一の動作を行う部分は同一番号を付与しているため説明を省略する。 In FIG. 24, the same reference numerals are assigned to parts that perform the same operations as in the previous embodiment, and a description thereof is omitted.

結果出力部１１０は、機械翻訳部７０から入力される翻訳結果を音声合成や画面表示などの表示方法で参加者に提示する。本実施形態では言語制御部４０によって翻訳結果の表示方法を切り替えることがこれまでの実施形態と異なる点である。 The result output unit 110 presents the translation result input from the machine translation unit 70 to the participant using a display method such as speech synthesis or screen display. In this embodiment, the display method of the translation result is switched by the language control unit 40, which is different from the previous embodiments.

（２）音声翻訳装置の動作
以下では図９に示す対話場面を例に本実施形態の詳細について説明する。 (2) Operation of Speech Translation Device The details of this embodiment will be described below taking the dialog scene shown in FIG. 9 as an example.

図１０は、話者言語記録部５０に格納された対応関係の一例を示すものである。図１０の例ではこれまでに発話者Ｇ、Ｈ、Ｉがそれぞれ複数回発声しており、話者同定部２０によってそれぞれの話者が３つのクラスに分類され、また言語判定部３０によって正しく言語が判定された場合の対応関係を示している。 FIG. 10 shows an example of the correspondence relationship stored in the speaker language recording unit 50. In the example of FIG. 10, the speakers G, H, and I have been uttered a plurality of times so far, and the speakers are classified into three classes by the speaker identification unit 20, and the language is correctly determined by the language determination unit 30. The correspondence relationship is shown when is determined.

図２５は、本実施形態における言語制御部４０の動作フローチャートを示している。 FIG. 25 shows an operation flowchart of the language control unit 40 in the present embodiment.

ステップＳ６００で図１０に示す対応関係から現話者Ｇの言語が英語であることを取得する。 In step S600, it is acquired that the language of the current speaker G is English from the correspondence shown in FIG.

次に、ステップＳ６０１で対応関係から所有者Ｉの言語が日本語であることを取得する。 Next, in step S601, the fact that the language of the owner I is Japanese is acquired from the correspondence relationship.

次に、ステップＳ６０２では現話者Ｇと所有者Ｉが一致するかを判定する。この場合は異なるため、ステップＳ６０３に進み現話者の言語と所有者の言語を比較する。この場合は英語と日本語で異なるため、ステップＳ６０４にて入力言語を英語、出力言語を日本語に決定し、さらにステップＳ６０８にて出力方法を画面表示に決定して言語制御部４０の処理を終える。 In step S602, it is determined whether the current speaker G and the owner I match. Since this case is different, the process proceeds to step S603, where the language of the current speaker is compared with the language of the owner. In this case, since English and Japanese are different, in step S604, the input language is determined to be English and the output language is determined to be Japanese. In step S608, the output method is determined to be a screen display, and the processing of the language control unit 40 is performed. Finish.

ステップＳ６００で対応関係から現話者Ｈの言語が中国語であることを取得する。 In step S600, it is acquired from the correspondence relationship that the language of the current speaker H is Chinese.

次に、ステップＳ６０２では現話者Ｈと所有者Ｉが一致するかを判定するが、この場合は異なるため、ステップＳ６０３に進み現話者の言語と所有者の言語を比較する。この場合は中国語と日本語で異なるため、ステップＳ６０４にて入力言語を中国語、出力言語を日本語に決定し、さらにステップＳ６０８にて出力方法を画面表示に決定して言語制御部４０の処理を終える。 Next, in step S602, it is determined whether or not the current speaker H and the owner I match. However, since this case is different, the process advances to step S603 to compare the language of the current speaker with the language of the owner. In this case, since Chinese and Japanese are different, in step S604, the input language is determined to be Chinese and the output language is determined to be Japanese. In step S608, the output method is determined to be a screen display. Finish the process.

（２−２）話者Ｉの日本語による発話
最後に所有者である話者Ｉが発声した場合の言語制御部４０の動作を説明する。 (2-2) Utterance of Speaker I in Japanese Finally, the operation of the language control unit 40 when the speaker I who is the owner utters will be described.

ステップＳ６００で対応関係から現話者Ｉの言語が日本語であることを取得する。 In step S600, it is acquired from the correspondence relationship that the language of the current speaker I is Japanese.

次に、ステップＳ６０２では現話者Ｉと所有者Ｉが一致するかを判定し、一致する場合にはステップＳ６０６に進む。 Next, in step S602, it is determined whether the current speaker I and the owner I match. If they match, the process proceeds to step S606.

次に、ステップＳ６０６では日本語以外の全ての言語（この場合は、英語と中国語）を取得して、ステップＳ６０７にて入力言語を日本語、出力言語を英語と中国語に決定して、ステップＳ６０９出力方法を音声合成に決定して言語制御部の処理を終える。 Next, in step S606, all languages other than Japanese (in this case, English and Chinese) are acquired. In step S607, the input language is determined to be Japanese and the output language is determined to be English and Chinese. In step S609, the output method is determined to be speech synthesis, and the processing of the language control unit is finished.

（３）効果
このように本実施形態では、所有者の発言は音声合成でその他の話者の発言は画面表示で翻訳結果を出力するように制御すれば、合成音による会話の中断によって対話の流れを損なう危険を軽減しながら、音声翻訳を動作させることが可能となる。 (3) Effect As described above, in this embodiment, if control is performed so that the speech of the owner is speech synthesis and the speech of the other speakers is output on the screen display, the conversation is interrupted by interruption of the conversation by the synthesized sound. It is possible to operate speech translation while reducing the risk of impairing the flow.

（変更例）
本発明は上記各実施形態に限らず、その主旨を逸脱しない限り種々に変更することができる。 (Example of change)
The present invention is not limited to the above embodiments, and various modifications can be made without departing from the gist thereof.

なお、上記各実施形態では、言語判定部３０及び話者同定部２０の判定結果が常に正しいという仮定の下での動作について説明した。しかし、実際には１００％の判定精度を求めるのは容易ではなく、判定誤りを考慮しておかなければ誤動作が生じる危険性がある。 In each of the above embodiments, the operation under the assumption that the determination results of the language determination unit 30 and the speaker identification unit 20 are always correct has been described. However, in practice, it is not easy to obtain 100% determination accuracy, and there is a risk of malfunction if determination errors are not taken into consideration.

判定誤りの可能性を加味するためには、例えば言語判定や話者同定の結果を複数回の発声区間の多数決で決めるなど対策が考えられる。複数回での判定を導入することで会話の初期段階では音声翻訳のレスポンスが遅くなる危険はあるが、初期段階の遅れのみを許容することで後は安定した動作が可能となるため、利点の方が大きいといえる。 In order to consider the possibility of a determination error, for example, a measure such as determining the result of language determination or speaker identification by a majority decision of a plurality of utterance sections can be considered. By introducing multiple judgments, there is a risk that the response of speech translation will be delayed at the initial stage of the conversation, but by allowing only the delay at the initial stage, stable operation is possible later, so there is an advantage. It can be said that it is bigger.

本発明の第１の実施形態に関わる音声翻訳装置の概略構成例を表す図である。It is a figure showing the schematic structural example of the speech translation apparatus in connection with the 1st Embodiment of this invention. 第１の実施形態における複数人の会話場面の一例を示す図である。It is a figure which shows an example of the conversation scene of several people in 1st Embodiment. 第１の実施形態に関わる図１の構成図における話者同定部における特徴ベクトル空間の一例である。It is an example of the feature vector space in the speaker identification part in the block diagram of FIG. 1 in connection with 1st Embodiment. 第１の実施形態に関わる図１の構成図における話者同定部における話者ベクトル空間の一例である。It is an example of the speaker vector space in the speaker identification part in the block diagram of FIG. 1 in connection with 1st Embodiment. 第１の実施形態に関わる図１の構成図における言語判定部における特徴ベクトル空間の一例である。It is an example of the feature vector space in the language determination part in the block diagram of FIG. 1 in connection with 1st Embodiment. 第１の実施形態に関わる図１の構成図における話者言語記録部に格納された対応関係の一例である。It is an example of the correspondence stored in the speaker language recording part in the block diagram of FIG. 1 related to the first embodiment. 第１の実施形態に関わる図１の構成図における話者言語記録部に格納された対応関係の一例である。It is an example of the correspondence stored in the speaker language recording part in the block diagram of FIG. 1 related to the first embodiment. 本発明の第２の実施形態に関わる音声翻訳装置の概略構成例を表す図である。It is a figure showing the schematic structural example of the speech translation apparatus in connection with the 2nd Embodiment of this invention. 第２の実施形態における複数人の会話場面の一例を示す図である。It is a figure which shows an example of the conversation scene of several people in 2nd Embodiment. 第２の実施形態に関わる図８の構成図における話者言語記録部に格納された対応関係の一例である。It is an example of the correspondence stored in the speaker language recording unit in the configuration diagram of FIG. 8 according to the second embodiment. 第２の実施形態に関わる図８の構成図における言語制御部に関する動作フローチャートである。It is an operation | movement flowchart regarding the language control part in the block diagram of FIG. 8 in connection with 2nd Embodiment. 本発明の第３、第４、第５及び第６の実施形態に関わる音声翻訳装置の概略構成例を表す図である。It is a figure showing the schematic structural example of the speech translation apparatus in connection with the 3rd, 4th, 5th and 6th embodiment of this invention. 第３の実施形態に関わる図１２の構成図における発話履歴記録部に格納された発話履歴の一例である。It is an example of the utterance history stored in the utterance history recording unit in the configuration diagram of FIG. 12 according to the third embodiment. 第３の実施形態に関わる図１２の構成図における言語制御部に関する動作フローチャートである。13 is an operation flowchart relating to a language control unit in the configuration diagram of FIG. 12 according to the third embodiment. 第３の実施形態に関わる図１２の構成図における話者言語記録部に格納された対応関係の一例である。It is an example of the correspondence stored in the speaker language recording part in the block diagram of FIG. 12 relating to the third embodiment. 第４の実施形態に関わる図１２の構成図における言語制御部に関する動作フローチャートである。13 is an operation flowchart relating to a language control unit in the configuration diagram of FIG. 12 according to the fourth embodiment. 第４の実施形態に関わる図１２の構成図における話者言語記録部に格納された対応関係の一例である。It is an example of the correspondence stored in the speaker language recording part in the block diagram of FIG. 12 concerning 4th Embodiment. 第５の実施形態に関わる図１２の構成図における言語制御部に関する動作フローチャートである。FIG. 13 is an operation flowchart regarding the language control unit in the configuration diagram of FIG. 12 according to the fifth embodiment. 第５の実施形態に関わる図１２の構成図における話者言語記録部に格納された対応関係の一例である。It is an example of the correspondence stored in the speaker language recording part in the block diagram of FIG. 12 relating to the fifth embodiment. 第５の実施形態に関わる図１２の構成図における発話履歴記録部に格納された発話履歴の一例である。It is an example of the utterance history stored in the utterance history recording unit in the configuration diagram of FIG. 12 according to the fifth embodiment. 第６の実施形態に関わる図１２の構成図における言語制御部に関する動作フローチャートである。13 is an operation flowchart relating to a language control unit in the configuration diagram of FIG. 12 according to the sixth embodiment. 本発明の第７の実施形態に関わる音声翻訳装置の概略構成例を表す図The figure showing the schematic structural example of the speech translation apparatus in connection with the 7th Embodiment of this invention. 第７の実施形態に関わる図２２の構成図における話者言語記録部に格納された対応関係の一例である。It is an example of the correspondence stored in the speaker language recording part in the block diagram of FIG. 22 relating to the seventh embodiment. 本発明の第８の実施形態に関わる音声翻訳装置の概略構成例を表す図である。It is a figure showing the schematic structural example of the speech translation apparatus in connection with the 8th Embodiment of this invention. 第８の実施形態に関わる図１２の構成図における言語制御部に関する動作フローチャートである。It is an operation | movement flowchart regarding the language control part in the block diagram of FIG. 12 in connection with 8th Embodiment.

Explanation of symbols

１０音声入力部
２０話者同定部
３０言語判定部
４０言語制御部
５０話者言語記録部
６０音声認識部
７０機械翻訳部
８０所有者判定部
９０発話履歴記録部
１００音声蓄積部
１１０結果出力部
DESCRIPTION OF SYMBOLS 10 Voice input part 20 Speaker identification part 30 Language determination part 40 Language control part 50 Speaker language recording part 60 Speech recognition part 70 Machine translation part 80 Owner determination part 90 Speech history recording part 100 Voice storage part 110 Result output part

Claims

In a speech translation device between three or more speakers,
A voice input unit for inputting the voice of each speaker;
A speaker identification unit that analyzes each voice and identifies a speaker;
A language determination unit that analyzes each voice and determines a spoken language;
An attention speaker determination unit that determines which speaker is the attention speaker among the speakers; and
A speaker language recording unit that records a correspondence relationship between the identified speaker and the determination language;
Based on the correspondence and the speaker of interest, (1) when the input speech is the speaker of interest, the determination language of the speaker of interest is the input language and a language other than the input language is output (2) When the input speech is not the target speaker, the determination language of the input speech is determined as the input language, and the determination language of the target speaker is determined as the output language. A language control unit,
A speech recognition unit that recognizes the input speech as the input language;
A machine translation unit that translates the speech recognition result from the input language into the output language;
A speech translation apparatus comprising:

The speech translation according to claim 1, wherein the attention speaker determination unit determines which speaker is the owner among the speakers, and sets the determined owner as the attention speaker. apparatus.

The speech translation apparatus according to claim 1, wherein the attention speaker determination unit records a speaker's utterance order as an utterance history, and uses the immediately-speaker recorded in the utterance history as the attention speaker.

The attention speaker determination unit records the number of utterances or the utterance time of the speaker as an utterance history, and the number of utterances or the utterance time within a predetermined time based on the number of utterances or the utterance time recorded in the utterance history. The speech translation apparatus according to claim 1, wherein a speaker having the largest number of voices is a noted speaker.

The attention speaker determination unit records the number of utterances or the utterance time of the speaker as an utterance history, and the number of utterances or the utterance time within a predetermined time based on the number of utterances or the utterance time recorded in the utterance history. The speaker who has the most
The language control unit is configured to interview the speaker of interest with the number of utterances or utterance time next to the speaker of interest within the predetermined time based on the number of utterances or utterance time recorded in the utterance history. (1) When the input speech is the attention speaker, the determination language of the attention speaker is set as an input language, and the language of the talker is determined as an output language; (2) When the input voice is the talker, the language of the talker is set as an input language, the determination language of the noticeable talker is set as an output language, and (3) the input voice is also the noticeable talker. 2. The speech translation apparatus according to claim 1, wherein, if the speaker is not the talker, the determination language of the input speech is determined as an input language, and the determination language of the attention speaker is determined as an output language. .

The speech translation apparatus according to claim 1, wherein the speech recognition and the translation are not performed when the input language and the output language match.

The language control unit, when there are a plurality of output languages, prioritizes the output languages and performs the speech recognition and the translation in the order of the priority. Speech translation device.

An audio output unit for outputting the translation result and an image output unit;
The speech translation apparatus according to claim 1, wherein the language control unit selects the speech output unit or the image output unit for each output language and outputs the translation result.

A voice storage unit for storing the input voice;
The input speech is stored in the speech storage unit until the language controller cannot determine the input language until the number of utterances of a speaker of the input speech exceeds a predetermined number. The speech translation device described in 1.

A recognition result storage unit for storing the result of the speech recognition;
The recognition result is stored in the recognition result storage unit until the number of utterances of speakers other than the speaker of interest reaches a predetermined number or more when the language control unit cannot determine the output language. The speech translation apparatus according to 1.

In the speech translation method between three or more speakers,
Input the voice of each speaker,
Analyzing each voice to identify the speaker,
Analyzing each voice to determine the spoken language,
An attention speaker determination unit that determines which speaker is the attention speaker among the speakers; and
Record the correspondence between the identified speaker and the judgment language,
Based on the correspondence and the speaker of interest, (1) when the input speech is the speaker of interest, the determination language of the speaker of interest is the input language and a language other than the input language is output (2) When the input speech is not the target speaker, the determination language of the input speech is determined as the input language, and the determination language of the target speaker is determined as the output language. And
Recognizing the input voice as the input language,
The speech translation method, wherein the speech recognition result is translated from the input language to the output language.

In a speech translation program between three or more speakers,
A voice input function for inputting the voice of each speaker;
A speaker identification function for analyzing each voice and identifying a speaker;
A language determination function for analyzing each voice and determining a spoken language;
An attention speaker determination function for determining which speaker is the attention speaker among the speakers;
A speaker language recording function for recording a correspondence relationship between the identified speaker and the determination language;
Based on the correspondence and the speaker of interest, (1) when the input speech is the speaker of interest, the determination language of the speaker of interest is the input language and a language other than the input language is output (2) When the input speech is not the target speaker, the determination language of the input speech is determined as the input language, and the determination language of the target speaker is determined as the output language. Language control function to
A speech recognition function that recognizes the input speech as the input language;
A machine translation function for translating the speech recognition result from the input language to the output language;
A speech translation program characterized by being realized by a computer.