JP5704686B2

JP5704686B2 - Speech translation system, speech translation device, speech translation method, and program

Info

Publication number: JP5704686B2
Application number: JP2010217559A
Authority: JP
Inventors: 英男大熊; 将夫内山; 隅田　英一郎; 英一郎隅田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2010-09-28
Filing date: 2010-09-28
Publication date: 2015-04-22
Anticipated expiration: 2030-09-28
Also published as: JP2012073369A

Description

本発明は、入力された音声を翻訳し、音声合成出力する音声翻訳システム等に関するものである。 The present invention relates to a speech translation system that translates input speech and outputs a synthesized speech.

従来、予め対訳の例文を用意し、特定の例文の訳文を対話相手に表示することによって発話を翻訳する対話翻訳技術があった（特許文献１参照）。この技術は、利用者からの問いに対する回答文の選択肢を対話相手に提示し、対話相手が選択した回答の訳文を利用者に表示することにより対話相手の回答を利用者に通訳する対話翻訳技術である。 Conventionally, there has been a dialogue translation technique for translating an utterance by preparing a translation example sentence in advance and displaying a translation of a specific example sentence to a conversation partner (see Patent Document 1). This technology is a dialogue translation technology that presents the choices of answer sentences for questions from users to the conversation partner, and displays the translated sentences of the answers selected by the conversation partner to the users, thereby interpreting the answers of the conversation partners to the users. It is.

また、入力された音声の音声認識結果の文からキーワードを取得し、キーワードを用いて例文を検索し、当該例文を用いて自動通訳を行う自動通訳システムがあった（非特許文献１参照）。 In addition, there is an automatic interpretation system that acquires a keyword from a speech recognition result sentence of input speech, searches for an example sentence using the keyword, and performs automatic interpretation using the example sentence (see Non-Patent Document 1).

特許第３９５２７０９号公報（第１頁、第１図等）Japanese Patent No. 3952709 (first page, FIG. 1 etc.)

池田崇博、他４名「自由文通訳と例文選択型通訳を統合した自動通訳システム」ＦＩＴ（情報科学技術フォーラム）２００２年Takahiro Ikeda and four others "Automatic interpretation system integrating free sentence interpretation and example sentence selection type interpretation" FIT (Information Science and Technology Forum) 2002

しかしながら、従来の音声翻訳システムにおいては、音声認識結果に誤りがある場合、翻訳精度は著しく劣化していた。 However, in the conventional speech translation system, if there is an error in the speech recognition result, the translation accuracy is significantly degraded.

さらに具体的には、音声認識結果の文には、音が近いが意味が遠い単語が含まれていることも多いにも関わらず、従来の技術において、音声認識結果に含まれる単語をキーとして例文を検索することにより、正解からは遠い単語を含むテキストが機械翻訳の入力となることにより、翻訳精度は著しく劣化していた。 More specifically, although the speech recognition result sentence often includes words that are close in sound but far from meaning, in the conventional technology, the words included in the speech recognition result are used as keys. By searching for example sentences, text including words far from the correct answer becomes an input for machine translation, and the translation accuracy has been significantly degraded.

本第一の発明の音声翻訳システムは、端末装置とサーバ装置とを具備する音声翻訳システムであって、端末装置は、音声を受け付ける音声受付部と、音声受付部が受け付けた音声、または音声受付部が受け付けた音声に関する１以上の特徴量である音声関連情報を取得する音声関連情報取得部と、音声関連情報をサーバ装置に送信する音声関連情報送信部と、サーバ装置から音声合成結果を受信する音声合成結果受信部と、音声合成結果を用いて音声出力する合成音声出力部とを具備し、サーバ装置は、音素列と文字列とを有する２以上の固有表現情報を格納し得る固有表現情報格納部と、音声関連情報を受信する音声関連情報受信部と、音声関連情報を用いて、音声認識し、音素列を取得する音声認識部と、音声認識部が取得した音素列に類似する音素列を、固有表現情報格納部から取得する類似音素列取得部と、類似音素列取得部が取得した音素列に対応する文字列である類似文字列を、固有表現情報格納部から取得する類似文字列取得部と、類似文字列取得部が取得した類似文字列を翻訳し、翻訳結果を取得する機械翻訳部と、機械翻訳部が取得した翻訳結果を音声合成し、音声合成結果を取得する音声合成部と、音声合成結果を端末装置に送信する音声合成結果送信部とを具備する音声翻訳システムである。 The speech translation system according to the first aspect of the present invention is a speech translation system including a terminal device and a server device, and the terminal device includes a speech reception unit that receives speech and speech received by the speech reception unit, or speech reception. A speech related information acquisition unit that acquires speech related information that is one or more feature quantities related to speech received by the unit, a speech related information transmission unit that transmits the speech related information to the server device, and a speech synthesis result received from the server device And a synthesized speech output unit that outputs speech using the speech synthesis result, and the server device can store two or more unique representation information having a phoneme string and a character string. An information storage unit, a voice-related information receiving unit that receives voice-related information, a voice recognition unit that uses the voice-related information to perform voice recognition, acquires a phoneme string, and a phoneme string acquired by the voice recognition unit A similar phoneme string acquisition unit that acquires a similar phoneme string from the specific expression information storage unit, and a similar character string that is a character string corresponding to the phoneme string acquired by the similar phoneme string acquisition unit is acquired from the specific expression information storage unit A similar character string acquisition unit, a similar character string acquired by the similar character string acquisition unit, a machine translation unit that acquires a translation result, a speech synthesis of the translation result acquired by the machine translation unit, and a voice synthesis result A speech translation system including a speech synthesis unit to be acquired and a speech synthesis result transmission unit that transmits a speech synthesis result to a terminal device.

かかる構成により、音声認識結果に誤りがある場合でも、良好な翻訳結果を得ることができる。 With such a configuration, even if there is an error in the speech recognition result, a good translation result can be obtained.

また、本第二の発明の音声翻訳システムは、第一の発明に対して、音声認識部は、音声関連情報を用いて、音声認識し、１以上の音素列および音声認識結果である１以上の文字列である１以上の音声認識文字列を取得し、類似音素列取得部は、音声認識部が取得した音素列に類似する１以上の音素列を、固有表現情報格納部から取得し、類似文字列取得部は、類似音素列取得部が取得した１以上の音素列に対応する１以上の類似文字列を取得し、音声認識部が取得した１以上の音声認識文字列および類似文字列取得部が取得した１以上の類似文字列である２以上の候補文字列を、端末装置に送信する候補文字列送信部と、２以上の候補文字列の送信に対応して、一の候補文字列を特定する情報である候補文字列特定情報を、端末装置から受信する候補文字列特定情報受信部とをさらに具備し、機械翻訳部は、候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得し、端末装置は、サーバ装置から、２以上の候補文字列を受信する候補文字列受信部と、候補文字列受信部が受信した２以上の候補文字列を出力する候補文字列出力部と、候補文字列出力部が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける指示受付部と、指示受付部が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報を、サーバ装置に送信する候補文字列特定情報送信部とをさらに具備する音声翻訳システムである。 Further, in the speech translation system of the second invention, in contrast to the first invention, the speech recognition unit recognizes speech using speech related information, and is one or more phoneme sequences and one or more speech recognition results. The similar phoneme string acquisition unit acquires one or more phoneme strings similar to the phoneme string acquired by the voice recognition unit from the specific expression information storage unit, The similar character string acquisition unit acquires one or more similar character strings corresponding to the one or more phoneme sequences acquired by the similar phoneme string acquisition unit, and the one or more speech recognition character strings and similar character strings acquired by the voice recognition unit One candidate character corresponding to the candidate character string transmitting unit that transmits one or more candidate character strings that are one or more similar character strings acquired by the acquiring unit to the terminal device and the transmission of two or more candidate character strings Receives candidate character string specifying information, which is information for specifying a string, from the terminal device A candidate character string specifying information receiving unit, wherein the machine translation unit translates a speech recognition character string or a similar character string corresponding to the candidate character string specifying information, acquires a translation result, and the terminal device is a server A candidate character string receiving unit that receives two or more candidate character strings from the device, a candidate character string output unit that outputs two or more candidate character strings received by the candidate character string receiving unit, and a candidate character string output unit that outputs An instruction receiving unit that receives an instruction for one candidate character string from among the two or more candidate character strings, and candidate character string specifying information that specifies a candidate character string corresponding to the instruction received by the instruction receiving unit, Is a speech translation system further comprising a candidate character string specifying information transmitting unit to be transmitted.

かかる構成により、音声認識結果に誤りがある場合でも、さらに良好な翻訳結果を得ることができる。 With this configuration, even if the speech recognition result has an error, a better translation result can be obtained.

また、本第三の発明の音声翻訳システムは、第一の発明に対して、類似音素列取得部は、音声認識部が取得した音素列に類似する２以上の音素列を、固有表現情報格納部から取得し、類似文字列取得部は、類似音素列取得部が取得した２以上の音素列に対応する２以上の文字列である２以上の類似文字列を取得し、類似文字列取得部が取得した２以上の類似文字列である２以上の候補文字列を、端末装置に送信する候補文字列送信部と、２以上の候補文字列の送信に対応して、一の候補文字列を特定する情報である候補文字列特定情報を、端末装置から受信する候補文字列特定情報受信部とをさらに具備し、機械翻訳部は、候補文字列特定情報に対応する類似文字列を翻訳し、翻訳結果を取得し、端末装置は、サーバ装置から、２以上の候補文字列を受信する候補文字列受信部と、候補文字列受信部が受信した２以上の候補文字列を出力する候補文字列出力部と、候補文字列出力部が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける指示受付部と、指示受付部が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報を、サーバ装置に送信する候補文字列特定情報送信部とをさらに具備する音声翻訳システムである。 Further, in the speech translation system according to the third invention, in contrast to the first invention, the similar phoneme sequence acquisition unit stores two or more phoneme sequences similar to the phoneme sequence acquired by the speech recognition unit, and stores the unique expression information. The similar character string acquisition unit acquires two or more similar character strings that are two or more character strings corresponding to the two or more phoneme sequences acquired by the similar phoneme string acquisition unit, and the similar character string acquisition unit In response to transmission of two or more candidate character strings which are two or more similar character strings acquired by the candidate character string transmission unit and two or more candidate character strings, one candidate character string is obtained. A candidate character string specifying information receiving unit that receives candidate character string specifying information that is information to be specified from the terminal device, and the machine translation unit translates a similar character string corresponding to the candidate character string specifying information; The translation result is acquired, and the terminal device receives two or more candidates from the server device. A candidate character string receiving unit that receives a character string, a candidate character string output unit that outputs two or more candidate character strings received by the candidate character string receiving unit, and two or more candidate character strings output by the candidate character string output unit The candidate character string specification that transmits to the server device, the instruction receiving unit that receives an instruction for one candidate character string, and the candidate character string specifying information that specifies the candidate character string corresponding to the instruction received by the instruction receiving unit A speech translation system further comprising an information transmission unit.

また、本第四の発明の音声翻訳システムは、第二または第三の発明に対して、サーバ装置は、音声認識部が取得した文字列と類似文字列取得部が取得した１以上の各類似文字列とを比較し、音声認識部が取得した文字列と一致する文字列が、類似文字列取得部が取得した１以上の類似文字列の中に存在するか否かを判断する制御部をさらに具備し、候補文字列送信部は、候補文字列を送信しない音声翻訳システムである。 In the speech translation system according to the fourth aspect of the invention, in contrast to the second or third aspect of the invention, the server device has a character string acquired by the speech recognition unit and one or more similar items acquired by the similar character string acquisition unit. A control unit that compares the character string and determines whether or not a character string that matches the character string acquired by the voice recognition unit exists in one or more similar character strings acquired by the similar character string acquisition unit; Further, the candidate character string transmission unit is a speech translation system that does not transmit the candidate character string.

かかる構成により、音声認識結果が正しい場合、高速な処理が可能となる。 With this configuration, when the speech recognition result is correct, high-speed processing is possible.

また、本第五の発明の音声翻訳装置は、音素列と文字列とを有する２以上の固有表現情報を格納し得る固有表現情報格納部と、音声を受け付ける音声受付部と、音声受付部が受け付けた音声を音声認識し、音素列を取得する音声認識部と、音声認識部が取得した音素列に類似する音素列を、固有表現情報格納部から取得する類似音素列取得部と、類似音素列取得部が取得した音素列に対応する文字列である類似文字列を取得する類似文字列取得部と、類似文字列取得部が取得した類似文字列を翻訳し、翻訳結果を取得する機械翻訳部と、機械翻訳部が取得した翻訳結果を音声合成し、音声合成結果を取得する音声合成部と、音声合成結果を用いて音声出力する合成音声出力部とを具備する音声翻訳装置である。 The speech translation apparatus according to the fifth aspect of the present invention includes a specific expression information storage unit that can store two or more specific expression information having a phoneme string and a character string, a voice reception unit that receives voice, and a voice reception unit. A speech recognition unit that recognizes the received speech and obtains a phoneme sequence; a similar phoneme sequence acquisition unit that acquires a phoneme sequence similar to the phoneme sequence acquired by the speech recognition unit; A similar character string acquisition unit that acquires a similar character string that is a character string corresponding to the phoneme string acquired by the column acquisition unit, and a machine translation that translates the similar character string acquired by the similar character string acquisition unit and acquires a translation result A speech translation device comprising: a speech synthesis unit that synthesizes a translation result obtained by the machine translation unit, obtains a speech synthesis result, and a synthesized speech output unit that outputs speech using the speech synthesis result.

また、本第六の発明の音声翻訳装置は、第五の発明に対して、音声認識部は、音声関連情報を用いて、音声認識し、１以上の音素列および音声認識結果である１以上の文字列である１以上の音声認識文字列を取得し、類似音素列取得部は、音声認識部が取得した音素列に類似する１以上の音素列を、固有表現情報格納部から取得し、類似文字列取得部は、類似音素列取得部が取得した１以上の音素列に対応する１以上の文字列である１以上の類似文字列を取得し、音声認識部が取得した１以上の音声認識文字列および類似文字列取得部が取得した１以上の類似文字列である２以上の候補文字列を出力する候補文字列出力部と、候補文字列出力部が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける指示受付部とをさらに具備し、機械翻訳部は、指示受付部が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得する音声翻訳装置である。 Further, in the speech translation apparatus according to the sixth invention, in contrast to the fifth invention, the speech recognition unit recognizes speech using speech related information, and 1 or more phoneme sequences and 1 or more speech recognition results are obtained. The similar phoneme string acquisition unit acquires one or more phoneme strings similar to the phoneme string acquired by the voice recognition unit from the specific expression information storage unit, The similar character string acquisition unit acquires one or more similar character strings that are one or more character strings corresponding to the one or more phoneme strings acquired by the similar phoneme string acquisition unit, and the one or more voices acquired by the voice recognition unit. A candidate character string output unit that outputs two or more candidate character strings that are one or more similar character strings acquired by the recognized character string and the similar character string acquisition unit, and two or more candidate character strings output by the candidate character string output unit And an instruction receiving unit that receives an instruction for one candidate character string. And a machine translation unit translates a speech recognition character string or similar character string corresponding to candidate character string specifying information for specifying a candidate character string corresponding to an instruction received by the instruction receiving unit, and acquires a translation result It is a translation device.

また、本第七の発明の音声翻訳装置は、第五の発明に対して、類似音素列取得部は、音声認識部が取得した音素列に類似する２以上の音素列を、固有表現情報格納部から取得し、類似文字列取得部は、類似音素列取得部が取得した２以上の音素列に対応する２以上の類似文字列を取得し、類似文字列取得部が取得した２以上の類似文字列である２以上の候補文字列を出力する候補文字列出力部と、候補文字列出力部が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける指示受付部とをさらに具備し、機械翻訳部は、指示受付部が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得する音声翻訳装置である。 Further, in the speech translation apparatus according to the seventh invention, in contrast to the fifth invention, the similar phoneme string acquisition unit stores two or more phoneme strings similar to the phoneme string acquired by the speech recognition unit, and stores the unique expression information. The similar character string acquisition unit acquires two or more similar character strings corresponding to the two or more phoneme sequences acquired by the similar phoneme string acquisition unit, and the two or more similar characters acquired by the similar character string acquisition unit A candidate character string output unit that outputs two or more candidate character strings that are character strings, and an instruction reception unit that receives an instruction for one candidate character string from two or more candidate character strings output by the candidate character string output unit The machine translation unit translates the speech recognition character string or the similar character string corresponding to the candidate character string specifying information for specifying the candidate character string corresponding to the instruction received by the instruction receiving unit, and the translation result This is a speech translation device to be acquired.

また、本第八の発明の音声翻訳装置は、第六または第七の発明に対して、音声認識部が取得した文字列と類似文字列取得部が取得した１以上の各類似文字列とを比較し、音声認識部が取得した文字列と一致する文字列が、類似文字列取得部が取得した１以上の類似文字列の中に存在するか否かを判断する制御部をさらに具備し、候補文字列出力部は、候補文字列を出力しない音声翻訳装置である。 The speech translation apparatus according to the eighth aspect of the invention relates to the sixth or seventh aspect of the invention, a character string acquired by the speech recognition unit and one or more similar character strings acquired by the similar character string acquisition unit. A control unit that compares and determines whether or not a character string that matches the character string acquired by the voice recognition unit exists in one or more similar character strings acquired by the similar character string acquisition unit; The candidate character string output unit is a speech translation device that does not output a candidate character string.

本発明による音声翻訳システムによれば、音声認識結果に誤りがある場合でも、良好な翻訳結果を得ることができる。 According to the speech translation system according to the present invention, a good translation result can be obtained even when there is an error in the speech recognition result.

実施の形態１における音声翻訳システム１の概念図Conceptual diagram of speech translation system 1 according to Embodiment 1 同音声翻訳システム１の内部構造を示すブロック図Block diagram showing the internal structure of the speech translation system 1 同端末装置１１の動作について説明するフローチャートA flowchart for explaining the operation of the terminal device 11 同サーバ装置１２の動作について説明するフローチャートA flowchart for explaining the operation of the server device 12 同類似音素列取得処理について説明するフローチャートThe flowchart explaining the same phoneme sequence acquisition process 同固有表現管理表を示す図Figure showing the unique expression management table 同候補文字列の出力例を示す図The figure which shows the output example of the candidate character string 実施の形態２における音声翻訳装置２のブロック図Block diagram of speech translation apparatus 2 in Embodiment 2 同音声翻訳装置２の動作について説明するフローチャートA flowchart for explaining the operation of the speech translation apparatus 2 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 同コンピュータシステムのブロック図Block diagram of the computer system

以下、音声翻訳システム等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。
（実施の形態１） Hereinafter, embodiments of a speech translation system and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.
(Embodiment 1)

本実施の形態において、入力された音声に対する音声認識処理により音素列を取得し、当該音素列を用いて、類似文をコーパスから検索し、類似文を翻訳し、音声合成出力する音声翻訳システムについて説明する。また、本実施の形態において、１以上の音声認識処理結果と、１以上の検索類似文を出力し、ユーザからの指示を受け付け、指示された文を翻訳し、音声合成出力する音声翻訳システムについて説明する。さらに、本実施の形態において、２以上の検索類似文を出力し、ユーザからの指示を受け付け、指示された文を翻訳し、音声合成出力する音声翻訳システムについて説明する。 In this embodiment, a speech translation system that acquires a phoneme string by speech recognition processing for input speech, searches for a similar sentence from a corpus using the phoneme string, translates the similar sentence, and outputs a synthesized speech explain. Also, in the present embodiment, a speech translation system that outputs one or more speech recognition processing results and one or more search similar sentences, receives an instruction from a user, translates the instructed sentence, and outputs a synthesized speech. explain. Further, in the present embodiment, a speech translation system that outputs two or more search similar sentences, accepts an instruction from a user, translates the instructed sentence, and outputs a synthesized speech is described.

図１は、本実施の形態における音声翻訳システム１の概念図である。音声翻訳システム１は、１以上の端末装置１１、およびサーバ装置１２を具備し、ネットワーク１３により相互に通信可能である。端末装置１１は、いわゆるパーソナルコンピュータ、携帯端末、携帯電話、いわゆるスマートフォーンなどであるが、音声の入力や出力が行えれば良く、その態様は問わない。なお、ネットワーク１３は、インターネット、電話回線、専用回線等、問わない。 FIG. 1 is a conceptual diagram of a speech translation system 1 in the present embodiment. The speech translation system 1 includes one or more terminal devices 11 and a server device 12 and can communicate with each other via a network 13. The terminal device 11 is a so-called personal computer, a portable terminal, a cellular phone, a so-called smart phone, or the like. The network 13 may be the Internet, a telephone line, a dedicated line, or the like.

図２は、本実施の形態における音声翻訳システム１の内部構造を示すブロック図である。
端末装置１１は、音声受付部１１１、音声関連情報取得部１１２、音声関連情報送信部１１３、候補文字列受信部１１４、候補文字列出力部１１５、指示受付部１１６、候補文字列特定情報送信部１１７、音声合成結果受信部１１８、および合成音声出力部１１９を具備する。 FIG. 2 is a block diagram showing the internal structure of the speech translation system 1 in the present embodiment.
The terminal device 11 includes a voice reception unit 111, a voice related information acquisition unit 112, a voice related information transmission unit 113, a candidate character string reception unit 114, a candidate character string output unit 115, an instruction reception unit 116, and a candidate character string specifying information transmission unit. 117, a speech synthesis result receiving unit 118, and a synthesized speech output unit 119.

サーバ装置１２は、固有表現情報格納部１２０、音声関連情報受信部１２１、音声認識部１２２、類似音素列取得部１２３、類似文字列取得部１２４、候補文字列送信部１２５、候補文字列特定情報受信部１２６、機械翻訳部１２７、音声合成部１２８、音声合成結果送信部１２９、および制御部１３０を具備する。 The server device 12 includes a unique expression information storage unit 120, a speech related information reception unit 121, a speech recognition unit 122, a similar phoneme string acquisition unit 123, a similar character string acquisition unit 124, a candidate character string transmission unit 125, and candidate character string specifying information. A receiving unit 126, a machine translation unit 127, a speech synthesis unit 128, a speech synthesis result transmission unit 129, and a control unit 130 are provided.

音声受付部１１１は、通常、ユーザから音声を受け付ける。ただし、受け付けとは、有線もしくは無線の通信回線を介して送信された音声の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された音声の受け付けなどを含む概念である。 The voice reception unit 111 normally receives voice from the user. However, reception is a concept including reception of audio transmitted via a wired or wireless communication line, reception of audio read from a recording medium such as an optical disk, a magnetic disk, or a semiconductor memory.

音声関連情報取得部１１２は、音声受付部１１１が受け付けた音声に関する１以上の特徴量である音声関連情報を取得するか、音声受付部１１１が受け付けた音声を取得する。つまり、音声関連情報取得部１１２は、音声から１以上の特徴量を抽出する機能を有しても有さなくても良い。また、音声関連情報とは、音声または１以上の特徴量であり、音声から１以上の特徴量を取得する技術は公知技術である。ここで、特徴量は、音声の特徴量である。１以上の特徴量は、例えば、三角型フィルタを用いたチャネル数２４のフィルタバンク出力を離散コサイン変換したＭＦＣＣであり、その静的パラメータ、デルタパラメータおよびデルタデルタパラメータをそれぞれ１２次元、さらに正規化されたパワーとデルタパワーおよびデルタデルタパワー（３９次元）を有する。ただし、１以上の特徴量の内容は問わない。 The voice related information acquisition unit 112 acquires voice related information that is one or more feature amounts related to the voice received by the voice reception unit 111 or acquires the voice received by the voice reception unit 111. That is, the voice related information acquisition unit 112 may or may not have a function of extracting one or more feature amounts from the voice. The voice-related information is voice or one or more feature quantities, and a technique for acquiring one or more feature quantities from the voice is a known technique. Here, the feature amount is a feature amount of speech. The one or more feature amounts are, for example, MFCC obtained by discrete cosine transform of the filter bank output of 24 channels using a triangular filter, and the static parameter, the delta parameter, and the delta delta parameter are further normalized to 12 dimensions, respectively. Power and delta power and delta delta power (39th dimension). However, the content of one or more feature values is not limited.

音声関連情報送信部１１３は、音声関連情報取得部１１２が取得した音声関連情報をサーバ装置１２に送信する。 The voice related information transmission unit 113 transmits the voice related information acquired by the voice related information acquisition unit 112 to the server device 12.

候補文字列受信部１１４は、音声関連情報の送信に対応して、サーバ装置１２から２以上の候補文字列を受信する。 The candidate character string receiving unit 114 receives two or more candidate character strings from the server device 12 in response to the transmission of the voice related information.

候補文字列出力部１１５は、候補文字列受信部１１４が受信した２以上の候補文字列を出力する。ここで、出力とは、ディスプレイへの表示、プロジェクターを用いた投影、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。 The candidate character string output unit 115 outputs two or more candidate character strings received by the candidate character string receiving unit 114. Here, output refers to display on a display, projection using a projector, printing on a printer, sound output, transmission to an external device, storage in a recording medium, output to another processing device or other program, etc. It is a concept that includes delivery of processing results.

指示受付部１１６は、候補文字列出力部１１５が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける。指示の入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。 The instruction receiving unit 116 receives an instruction for one candidate character string from the two or more candidate character strings output by the candidate character string output unit 115. The instruction input means may be anything such as a numeric keypad, a keyboard, a mouse, or a menu screen.

候補文字列特定情報送信部１１７は、指示受付部１１６が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報を、サーバ装置１２に送信する。候補文字列特定情報とは、候補文字列を特定する情報であれば何でも良く、例えば、候補文字列のＩＤ、候補文字列自体などである。 The candidate character string specifying information transmitting unit 117 transmits candidate character string specifying information for specifying the candidate character string corresponding to the instruction received by the instruction receiving unit 116 to the server device 12. The candidate character string specifying information may be any information that specifies a candidate character string, such as an ID of the candidate character string and the candidate character string itself.

音声合成結果受信部１１８は、サーバ装置１２から音声合成結果を受信する。ここで、音声合成結果とは、音声のデータでも良いし、音声合成の元になるデータや音声出力する直前のデータなどでも良い。 The speech synthesis result receiving unit 118 receives the speech synthesis result from the server device 12. Here, the speech synthesis result may be speech data, data that is the basis of speech synthesis, data immediately before the speech output, or the like.

合成音声出力部１１９は、音声合成結果受信部１１８が受信した音声合成結果を用いて音声出力する。音声合成結果を用いた音声出力とは、音声合成結果が音声である場合、単に音声出力することである。また、音声合成結果が音声合成の元になるデータの場合、音声合成結果を用いた音声出力とは、当該データから音声合成し、音声出力することである。 The synthesized speech output unit 119 outputs a speech using the speech synthesis result received by the speech synthesis result receiving unit 118. The voice output using the voice synthesis result is simply outputting the voice when the voice synthesis result is voice. When the voice synthesis result is data that is the basis of voice synthesis, voice output using the voice synthesis result is voice synthesis from the data and voice output.

サーバ装置１２の固有表現情報格納部１２０は、音素列と文字列とを有する２以上の固有表現情報を格納し得る。ここで、音素列とは、文字列に対応する音素列である。つまり、音素列とは、文字列を発音した際の音素の並びである。そして、文字列とは、原言語の文や句や単語を構成する文字列である。 The specific expression information storage unit 120 of the server apparatus 12 can store two or more specific expression information having a phoneme string and a character string. Here, the phoneme string is a phoneme string corresponding to a character string. That is, the phoneme string is a sequence of phonemes when a character string is pronounced. The character string is a character string that constitutes a sentence, phrase, or word in the source language.

音声関連情報受信部１２１は、音声関連情報を、端末装置１１から受信する。 The voice related information receiving unit 121 receives voice related information from the terminal device 11.

音声認識部１２２は、音声関連情報受信部１２１が受信した音声関連情報を用いて、音声認識し、音素列を取得する。なお、音声認識部１２２は公知技術であるので、詳細な説明を省略する。また、この音声認識部１２２は、音声関連情報受信部１２１が受信した音声関連情報を用いて、音声認識し、１以上の音素列および音声認識結果である１以上の文字列である１以上の音声認識文字列を取得しても良い。また、音声認識部１２２は、１以上の特徴量から音声認識処理しても良いし、音声関連情報である音声から１以上の特徴量を抽出し、当該１以上の特徴量から音声認識しても良い。また、音素列とは、通常、２以上の音素の集合である。 The speech recognition unit 122 recognizes speech using the speech related information received by the speech related information reception unit 121 and acquires a phoneme string. Since the voice recognition unit 122 is a known technique, a detailed description thereof is omitted. The speech recognition unit 122 recognizes speech using the speech related information received by the speech related information reception unit 121 and performs one or more phoneme strings and one or more character strings that are speech recognition results. A voice recognition character string may be acquired. The speech recognition unit 122 may perform speech recognition processing from one or more feature amounts, or extract one or more feature amounts from speech that is speech related information, and perform speech recognition from the one or more feature amounts. Also good. A phoneme string is usually a set of two or more phonemes.

類似音素列取得部１２３は、音声認識部１２２が取得した音素列に類似する音素列を、固有表現情報格納部１２０から取得する。この類似音素列取得部１２３は、音声認識部１２２が取得した音素列と、固有表現情報格納部１２０に格納されている２以上の各固有表現情報が有する２以上の音素列との類似度（スコア）を算出し、当該類似度が所定の条件を満たすほど類似している１以上の音素列を固有表現情報格納部１２０から取得する。所定の条件を満たすほど類似している音素列とは、例えば、類似度が最大の音素列であったり、類似度が閾値以上または閾値より大の音素列であったり、類似度をキーとして降順に音素列をソートした場合の上位ｎ（ｎは１以上の整数）の音素列等である。 The similar phoneme sequence acquisition unit 123 acquires a phoneme sequence similar to the phoneme sequence acquired by the speech recognition unit 122 from the specific expression information storage unit 120. The similar phoneme sequence acquisition unit 123 has a similarity between the phoneme sequence acquired by the speech recognition unit 122 and two or more phoneme sequences included in each of the two or more pieces of specific expression information stored in the specific expression information storage unit 120 ( Score) is calculated, and one or more phoneme strings that are similar to each other as the degree of similarity satisfies a predetermined condition are acquired from the unique expression information storage unit 120. A phoneme string that is similar enough to satisfy a predetermined condition is, for example, a phoneme string having the maximum similarity, a phoneme string having a similarity greater than or greater than a threshold, or descending in order of similarity. Is the top n (n is an integer equal to or greater than 1) phoneme string.

なお、２つの音素列の類似度を算出するアルゴリズムは、例えば、ＢＬＥＵやＷｏｒｄＥｒｒｏｒＲａｔｅ（ＷＥＲ）などである。また、２つの音素列の類似度を算出するアルゴリズムは、例えば、「類似度＝一致する音素数／長い方の全音素数」などでも良い。つまり、類似の判断のアルゴリズムは問わない。なお、ＢＬＥＵやＷＥＲは、代表的な機械翻訳結果の評価尺度である。また、ＢＬＥＵやＷＥＲは、通常、機械翻訳結果である自然言語文や単語を対象として、スコアを算出する尺度であるが、音声翻訳システム１では、音素列を対象として用いられる。 Note that an algorithm for calculating the similarity between two phoneme strings is, for example, BLEU or Word Error Rate (WER). The algorithm for calculating the similarity between two phoneme strings may be, for example, “similarity = number of phonemes that match / number of longer phonemes”. That is, the algorithm of the similar judgment is not ask | required. Note that BLEU and WER are representative evaluation scales for machine translation results. In addition, BLEU and WER are usually scales for calculating scores for natural language sentences and words that are machine translation results. In the speech translation system 1, phoneme strings are used as targets.

ＢＬＥＵを用いた場合、以下の数式１により類似度を算出する。

When BLEU is used, the similarity is calculated by the following formula 1.

数式１において、ｐ_ｎは音声認識部１２２が取得した音素列内のnグラムが固有表現情報格納部１２０内の音素列のnグラムと一致する割合である。また、rは音声認識部１２２が取得した音素列の長さ、cは固有表現情報格納部１２０内の固有表現情報が有する音素列の長さである。なお、後述する実験では、Ｎは４であり、ｗ_ｎは１／Ｎである、とした。 In Equation 1, _pn is a ratio at which the n gram in the phoneme string acquired by the speech recognition unit 122 matches the n gram of the phoneme string in the specific expression information storage unit 120. Further, r is the length of the phoneme string acquired by the speech recognition unit 122, and c is the length of the phoneme string included in the unique expression information in the specific expression information storage unit 120. In the experiment described later, N is 4, the _{w n} is 1 / N, and the.

また、ＷＥＲを用いた場合、以下の数式２により類似度を算出する。

Further, when WER is used, the similarity is calculated by the following formula 2.

数式２において、Ｉは挿入語数、Ｄは削除語数、Ｓは置換語数、Ｎは参照訳の語数である。ここで、語とは、音素に読み替える。つまり、数式２において、挿入語数とは、挿入されている音素の数である。また、削除語数とは、削除されている音素の数である。また、置換語数とは、置換されている音素の数である。さらに、参照訳の語数とは、参照訳の音素の数である。 In Equation 2, I is the number of inserted words, D is the number of deleted words, S is the number of replacement words, and N is the number of reference translation words. Here, the word is read as phoneme. That is, in Equation 2, the number of inserted words is the number of phonemes that are inserted. The number of deleted words is the number of phonemes that have been deleted. The number of replacement words is the number of phonemes that are replaced. Further, the number of words in the reference translation is the number of phonemes in the reference translation.

また、類似音素列取得部１２３は、音声認識部１２２が取得した音素列に類似する１以上の音素列を、固有表現情報格納部１２０から取得しても良いし、音声認識部１２２が取得した音素列に類似する２以上の音素列を、固有表現情報格納部１２０から取得しても良い。 Further, the similar phoneme sequence acquisition unit 123 may acquire one or more phoneme sequences similar to the phoneme sequence acquired by the speech recognition unit 122 from the specific expression information storage unit 120, or the speech recognition unit 122 acquires the phoneme sequence. Two or more phoneme sequences similar to the phoneme sequence may be acquired from the specific expression information storage unit 120.

類似文字列取得部１２４は、類似音素列取得部１２３が取得した音素列に対応する文字列である類似文字列を、固有表現情報格納部１２０から取得する。また、類似文字列取得部１２４は、類似音素列取得部１２３が取得した１以上の音素列に対応する１以上の類似文字列を取得しても良い。また、類似文字列取得部１２４は、類似音素列取得部１２３が取得した２以上の音素列に対応する２以上の類似文字列を取得しても良い。 The similar character string acquisition unit 124 acquires a similar character string that is a character string corresponding to the phoneme string acquired by the similar phoneme string acquisition unit 123 from the specific expression information storage unit 120. Further, the similar character string acquisition unit 124 may acquire one or more similar character strings corresponding to the one or more phoneme strings acquired by the similar phoneme string acquisition unit 123. Further, the similar character string acquisition unit 124 may acquire two or more similar character strings corresponding to the two or more phoneme strings acquired by the similar phoneme string acquisition unit 123.

候補文字列送信部１２５は、２以上の候補文字列を、端末装置１１に送信する。２以上の候補文字列は、音声認識部１２２が取得した１以上の音声認識文字列および類似文字列取得部１２４が取得した１以上の類似文字列であっても良いし、類似文字列取得部１２４が取得した２以上の類似文字列であっても良い。つまり、２以上の候補文字列の中には、通常、音声認識文字列を含むが、音声認識文字列を含まなくても良い。 The candidate character string transmission unit 125 transmits two or more candidate character strings to the terminal device 11. The two or more candidate character strings may be one or more speech recognition character strings acquired by the speech recognition unit 122 and one or more similar character strings acquired by the similar character string acquisition unit 124, or a similar character string acquisition unit Two or more similar character strings acquired by 124 may be used. That is, the two or more candidate character strings usually include a voice recognition character string, but may not include a voice recognition character string.

候補文字列特定情報受信部１２６は、２以上の候補文字列の送信に対応して、一の候補文字列を特定する情報である候補文字列特定情報を、端末装置１１から受信する。候補文字列特定情報とは、候補文字列を識別する情報でも良いし、候補文字列そのものでも良い。 The candidate character string specifying information receiving unit 126 receives candidate character string specifying information, which is information for specifying one candidate character string, from the terminal device 11 in response to transmission of two or more candidate character strings. The candidate character string specifying information may be information for identifying the candidate character string or the candidate character string itself.

機械翻訳部１２７は、類似文字列取得部１２４が取得した類似文字列を翻訳し、翻訳結果を取得する。機械翻訳部１２７は、候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得しても良いし、候補文字列特定情報に対応する類似文字列を翻訳し、翻訳結果を取得しても良い。なお、機械翻訳部１２７は、公知技術である。 The machine translation unit 127 translates the similar character string acquired by the similar character string acquisition unit 124 and acquires a translation result. The machine translation unit 127 may translate the speech recognition character string or the similar character string corresponding to the candidate character string specifying information and obtain a translation result, or may translate the similar character string corresponding to the candidate character string specifying information. The translation result may be acquired. The machine translation unit 127 is a known technique.

音声合成部１２８は、機械翻訳部１２７が取得した翻訳結果を音声合成し、音声合成結果を取得する。音声合成結果とは、例えば、音声のデータである。ただし、音声合成結果とは、音声合成する元になるデータでも良い。音声合成部１２８は、公知技術である。 The voice synthesizer 128 synthesizes the translation result obtained by the machine translation unit 127 and obtains the voice synthesis result. The voice synthesis result is, for example, voice data. However, the speech synthesis result may be data from which speech synthesis is performed. The voice synthesizer 128 is a known technique.

音声合成結果送信部１２９は、音声合成部１２８が取得した音声合成結果を端末装置１１に送信する。 The speech synthesis result transmission unit 129 transmits the speech synthesis result acquired by the speech synthesis unit 128 to the terminal device 11.

制御部１３０は、音声認識部１２２が取得した文字列と類似文字列取得部１２４が取得した１以上の各類似文字列とを比較し、音声認識部１２２が取得した文字列と一致する文字列（概ね一致する文字列も含む）が、類似文字列取得部１２４が取得した１以上の類似文字列の中に存在するか否かを判断する。一致する文字列が存在する場合、候補文字列送信部１２５は候補文字列を送信しない。そして、一致する文字列が存在する場合、機械翻訳部１２７は、音声認識部１２２が取得した文字列を機械翻訳する。なお、制御部１３０は、音声認識部１２２が取得した音素列と類似音素列取得部１２３が取得した１以上の各類似音素列とを比較し、音声認識部１２２が取得した音素列と一致する音素列（概ね一致する音素列も含む）が、類似音素列取得部１２３が取得した１以上の音素列の中に存在するか否かを判断しても良い。なお、音素列の比較も、文字列の比較と同等である、と考える。 The control unit 130 compares the character string acquired by the voice recognition unit 122 with one or more similar character strings acquired by the similar character string acquisition unit 124, and matches the character string acquired by the voice recognition unit 122. It is determined whether (including generally matching character strings) is present in one or more similar character strings acquired by the similar character string acquisition unit 124. If there is a matching character string, the candidate character string transmission unit 125 does not transmit the candidate character string. If there is a matching character string, the machine translation unit 127 machine translates the character string acquired by the speech recognition unit 122. The control unit 130 compares the phoneme sequence acquired by the speech recognition unit 122 with one or more similar phoneme sequences acquired by the similar phoneme sequence acquisition unit 123, and matches the phoneme sequence acquired by the speech recognition unit 122. It may be determined whether or not a phoneme string (including substantially matching phoneme strings) is present in one or more phoneme strings acquired by the similar phoneme string acquisition unit 123. Note that the phoneme string comparison is equivalent to the character string comparison.

音声受付部１１１は、例えば、マイクとそのドライバーソフトにより実現され得る。 The voice reception unit 111 can be realized by, for example, a microphone and its driver software.

音声関連情報取得部１１２、音声認識部１２２、類似音素列取得部１２３、類似文字列取得部１２４、機械翻訳部１２７、および音声合成部１２８は、通常、ＭＰＵやメモリ等から実現され得る。音声関連情報取得部１１２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The speech related information acquisition unit 112, the speech recognition unit 122, the similar phoneme sequence acquisition unit 123, the similar character string acquisition unit 124, the machine translation unit 127, and the speech synthesis unit 128 can be usually realized by an MPU, a memory, or the like. The processing procedure of the audio related information acquisition unit 112 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声関連情報送信部１１３、候補文字列特定情報送信部１１７、候補文字列送信部１２５、および音声合成結果送信部１２９は、通常、無線または有線の通信手段で実現されるが、放送手段で実現されても良い。 The voice related information transmission unit 113, the candidate character string specifying information transmission unit 117, the candidate character string transmission unit 125, and the voice synthesis result transmission unit 129 are usually realized by wireless or wired communication means, but are realized by broadcasting means. May be.

候補文字列受信部１１４、音声合成結果受信部１１８、音声関連情報受信部１２１、および候補文字列特定情報受信部１２６は、通常、無線または有線の通信手段で実現されるが、放送を受信する手段で実現されても良い。 The candidate character string receiving unit 114, the voice synthesis result receiving unit 118, the voice related information receiving unit 121, and the candidate character string specifying information receiving unit 126 are usually realized by wireless or wired communication means, but receive broadcasts. It may be realized by means.

候補文字列出力部１１５は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。候補文字列出力部１１５は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The candidate character string output unit 115 may or may not include an output device such as a display or a speaker. The candidate character string output unit 115 can be realized by driver software of an output device, driver software of an output device, an output device, or the like.

指示受付部１１６は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The instruction receiving unit 116 can be realized by a device driver for input means such as a numeric keypad or a keyboard, control software for a menu screen, or the like.

合成音声出力部１１９は、スピーカー等の出力デバイスを含むと考えても含まないと考えても良い。合成音声出力部１１９は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The synthesized voice output unit 119 may be considered as including or not including an output device such as a speaker. The synthesized voice output unit 119 can be realized by driver software of an output device or driver software of an output device and an output device.

固有表現情報格納部１２０は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。固有表現情報格納部１２０に固有表現情報が記憶される過程は問わない。例えば、記録媒体を介して固有表現情報が固有表現情報格納部１２０で記憶されるようになってもよく、通信回線等を介して送信された固有表現情報が固有表現情報格納部１２０で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された固有表現情報が固有表現情報格納部１２０で記憶されるようになってもよい。
次に、音声翻訳システム１の動作について説明する。まず、端末装置１１の動作について、図３のフローチャートを用いて説明する。 The specific expression information storage unit 120 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. The process of storing the unique expression information in the specific expression information storage unit 120 is not limited. For example, the unique expression information may be stored in the specific expression information storage unit 120 via a recording medium, and the unique expression information transmitted via a communication line or the like is stored in the specific expression information storage unit 120. Alternatively, the unique expression information input via the input device may be stored in the specific expression information storage unit 120.
Next, the operation of the speech translation system 1 will be described. First, operation | movement of the terminal device 11 is demonstrated using the flowchart of FIG.

（ステップＳ３０１）音声受付部１１１は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ３０２に行き、音声を受け付けなければステップＳ３０１に戻る。 (Step S301) The voice reception unit 111 determines whether a voice is received. If a voice is accepted, the process goes to step S302, and if no voice is accepted, the process returns to step S301.

（ステップＳ３０２）音声関連情報取得部１１２は、ステップＳ３０１で受け付けられた音声に関する１以上の特徴量である音声関連情報を取得する。 (Step S302) The voice related information acquisition unit 112 acquires voice related information that is one or more feature amounts related to the voice received in step S301.

（ステップＳ３０３）音声関連情報送信部１１３は、ステップＳ３０２で取得された音声関連情報をサーバ装置１２に送信する。 (Step S303) The voice related information transmission unit 113 transmits the voice related information acquired in step S302 to the server device 12.

（ステップＳ３０４）候補文字列受信部１１４は、サーバ装置１２から、２以上の候補文字列を受信したか否かを判断する。２以上の候補文字列を受信すればステップＳ３０５に行き、受信しなければステップＳ３０９に行く。 (Step S304) The candidate character string receiving unit 114 determines whether or not two or more candidate character strings have been received from the server device 12. If two or more candidate character strings are received, the process proceeds to step S305, and if not received, the process proceeds to step S309.

（ステップＳ３０５）候補文字列出力部１１５は、ステップＳ３０４で受信された２以上の候補文字列を出力する。 (Step S305) The candidate character string output unit 115 outputs two or more candidate character strings received in step S304.

（ステップＳ３０６）指示受付部１１６は、ステップＳ３０５で出力された２以上の候補文字列の中から、一の候補文字列の指示（ユーザによる指示）を受け付けたか否かを判断する。指示を受け付ければステップＳ３０７に行き、受け付けなければステップＳ３０６に行く。 (Step S306) The instruction receiving unit 116 determines whether or not an instruction (instruction from the user) for one candidate character string has been received from the two or more candidate character strings output in step S305. If the instruction is accepted, the process goes to Step S307, and if not, the process goes to Step S306.

（ステップＳ３０７）候補文字列特定情報送信部１１７は、指示受付部１１６が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報を、サーバ装置１２に送信する。 (Step S307) The candidate character string specifying information transmitting unit 117 transmits candidate character string specifying information for specifying the candidate character string corresponding to the instruction received by the instruction receiving unit 116 to the server device 12.

（ステップＳ３０８）音声合成結果受信部１１８は、ステップＳ３０７における候補文字列特定情報の送信に対応して、サーバ装置１２から音声合成結果を受信したか否かを判断する。音声合成結果を受信すればステップＳ３１０に行き、受信しなければステップＳ３０８に戻る。 (Step S308) The speech synthesis result receiving unit 118 determines whether or not a speech synthesis result has been received from the server device 12 in response to the transmission of the candidate character string specifying information in step S307. If the speech synthesis result is received, the process goes to step S310, and if not received, the process returns to step S308.

（ステップＳ３０９）音声合成結果受信部１１８は、サーバ装置１２から音声合成結果を受信したか否かを判断する。音声合成結果を受信すればステップＳ３１０に行き、受信しなければステップＳ３０４に戻る。 (Step S309) The speech synthesis result receiving unit 118 determines whether or not a speech synthesis result has been received from the server device 12. If the speech synthesis result is received, the process goes to step S310, and if not received, the process returns to step S304.

（ステップＳ３１０）合成音声出力部１１９は、ステップＳ３０８、またはステップＳ３０９で受信された音声合成結果を用いて音声出力し、ステップＳ３０１に戻る。 (Step S310) The synthesized speech output unit 119 outputs speech using the speech synthesis result received in Step S308 or Step S309, and returns to Step S301.

なお、図３のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 In the flowchart of FIG. 3, the process ends when the power is turned off or the process is terminated.

次に、サーバ装置１２の動作について、図４のフローチャートを用いて説明する。 Next, operation | movement of the server apparatus 12 is demonstrated using the flowchart of FIG.

（ステップＳ４０１）音声関連情報受信部１２１は、端末装置１１から音声関連情報を受信したか否かを判断する。音声関連情報を受信すればステップＳ４０２に行き、音声関連情報を受信しなければステップＳ４０１に戻る。 (Step S <b> 401) The voice related information receiving unit 121 determines whether voice related information is received from the terminal device 11. If the voice related information is received, the process goes to step S402, and if the voice related information is not received, the process returns to step S401.

（ステップＳ４０２）音声認識部１２２は、ステップＳ４０１で受信された音声関連情報を用いて、音声認識処理を行う。そして、音声認識部１２２は、１以上の音素列および１以上の音声認識文字列を取得する。なお、音声認識文字列は、音声認識結果である。 (Step S402) The voice recognition unit 122 performs voice recognition processing using the voice related information received in step S401. Then, the voice recognition unit 122 acquires one or more phoneme strings and one or more voice recognition character strings. The voice recognition character string is a voice recognition result.

（ステップＳ４０３）類似音素列取得部１２３は、ステップＳ４０２で取得された音素列に類似する１以上の音素列を、固有表現情報格納部１２０から取得する。かかる処理を、類似音素列取得処理という。類似音素列取得処理について、図５のフローチャートを用いて説明する。 (Step S403) The similar phoneme string acquisition unit 123 acquires one or more phoneme strings similar to the phoneme string acquired in step S402 from the specific expression information storage unit 120. Such a process is called a similar phoneme string acquisition process. Similar phoneme sequence acquisition processing will be described with reference to the flowchart of FIG.

（ステップＳ４０４）類似文字列取得部１２４は、ステップＳ４０３で取得された１以上の音素列に対応する１以上の類似文字列を、固有表現情報格納部１２０から取得する。 (Step S404) The similar character string acquisition unit 124 acquires one or more similar character strings corresponding to the one or more phoneme strings acquired in step S403 from the specific expression information storage unit 120.

（ステップＳ４０５）制御部１３０は、音声認識部１２２がステップＳ４０２で取得した文字列と、類似文字列取得部１２４がステップＳ４０４で取得した１以上の各類似文字列とを比較する。そして、制御部１３０は、音声認識部１２２が取得した文字列と一致する類似文字列が存在するか否かを判断する。音声認識部１２２が取得した文字列と一致する類似文字列が存在すればステップＳ４０９に行き、存在しなければステップＳ４０６に行く。 (Step S405) The control unit 130 compares the character string acquired by the voice recognition unit 122 in step S402 with one or more similar character strings acquired by the similar character string acquisition unit 124 in step S404. Then, the control unit 130 determines whether there is a similar character string that matches the character string acquired by the voice recognition unit 122. If there is a similar character string that matches the character string acquired by the speech recognition unit 122, the process goes to step S409, and if not, the process goes to step S406.

（ステップＳ４０６）候補文字列送信部１２５は、ステップＳ４０２で取得した文字列と、ステップＳ４０４で取得した１以上の各類似文字列とを用いて、２以上の候補文字列を構成する。 (Step S406) The candidate character string transmission unit 125 configures two or more candidate character strings using the character string acquired in step S402 and one or more similar character strings acquired in step S404.

（ステップＳ４０７）候補文字列送信部１２５は、ステップＳ４０６で構成した２以上の候補文字列を端末装置１１に送信する。 (Step S407) The candidate character string transmission unit 125 transmits the two or more candidate character strings configured in step S406 to the terminal device 11.

（ステップＳ４０８）候補文字列特定情報受信部１２６は、ステップＳ４０７における２以上の候補文字列の送信に対応して、一の候補文字列を特定する情報である候補文字列特定情報を、端末装置１１から受信したか否かを判断する。候補文字列特定情報を受信すればステップＳ４０９に行き、受信しなければステップＳ４０８に戻る。 (Step S408) The candidate character string specifying information receiving unit 126 corresponds to the transmission of two or more candidate character strings in Step S407, and displays candidate character string specifying information that is information for specifying one candidate character string. 11 is received or not. If the candidate character string specifying information is received, the process goes to step S409, and if not received, the process returns to step S408.

（ステップＳ４０９）ステップＳ４０８から遷移してきた場合、機械翻訳部１２７は、候補文字列特定情報に対応する音声認識文字列または類似文字列を取得する。また、ステップＳ４０５から遷移してきた場合、機械翻訳部１２７は、音声認識部１２２が取得した文字列を取得する。 (Step S409) When the process proceeds from Step S408, the machine translation unit 127 acquires a speech recognition character string or a similar character string corresponding to the candidate character string specifying information. If the process proceeds from step S405, the machine translation unit 127 acquires the character string acquired by the speech recognition unit 122.

（ステップＳ４１０）機械翻訳部１２７は、ステップＳ４０９で取得した文字列を翻訳し、翻訳結果を取得する。 (Step S410) The machine translation unit 127 translates the character string acquired in step S409, and acquires a translation result.

（ステップＳ４１１）音声合成部１２８は、ステップＳ４１０で取得された翻訳結果を音声合成し、音声合成結果を取得する。 (Step S411) The speech synthesizer 128 synthesizes the translation result acquired in step S410, and acquires the speech synthesis result.

（ステップＳ４１２）音声合成結果送信部１２９は、ステップＳ４１１で取得された音声合成結果を端末装置１１に送信し、ステップＳ４０１に戻る。 (Step S412) The speech synthesis result transmission unit 129 transmits the speech synthesis result acquired in Step S411 to the terminal device 11, and returns to Step S401.

なお、図４のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 In the flowchart of FIG. 4, the process ends when the power is turned off or the process ends.

次に、ステップＳ４０３の類似音素列取得処理について、図５のフローチャートを用いて説明する。 Next, the similar phoneme string acquisition process in step S403 will be described with reference to the flowchart of FIG.

（ステップＳ５０１）類似音素列取得部１２３は、ステップＳ４０２における音声認識処理により得られた音声認識文字列（第一の音素列という）を取得する。 (Step S501) The similar phoneme string acquisition unit 123 acquires a voice recognition character string (referred to as a first phoneme string) obtained by the voice recognition process in Step S402.

（ステップＳ５０２）類似音素列取得部１２３は、カウンタｉに１を代入する。 (Step S502) The similar phoneme string acquisition unit 123 substitutes 1 for a counter i.

（ステップＳ５０３）類似音素列取得部１２３は、固有表現情報格納部１２０にｉ番目の音素列（第二の音素列という）が存在するか否かを判断する。ｉ番目の第二の音素列が存在すればステップＳ５０４に行き、ｉ番目の第二の音素列が存在しなければステップＳ５０７に行く。 (Step S503) The similar phoneme string acquisition unit 123 determines whether or not the i-th phoneme string (referred to as a second phoneme string) exists in the specific expression information storage unit 120. If the i-th second phoneme string exists, the process goes to step S504, and if the i-th second phoneme string does not exist, the process goes to step S507.

（ステップＳ５０４）類似音素列取得部１２３は、ステップＳ５０１で取得した第一の音素列と、ｉ番目の第二の音素列との類似度を算出する。 (Step S504) The similar phoneme string acquisition unit 123 calculates the similarity between the first phoneme string acquired in step S501 and the i-th second phoneme string.

（ステップＳ５０５）類似音素列取得部１２３は、ステップＳ５０４で算出した類似度を、ｉ番目の第二の音素列に対応付けて、図示しない記録媒体に一時蓄積する。 (Step S505) The similar phoneme string acquisition unit 123 temporarily stores the similarity calculated in step S504 in a recording medium (not shown) in association with the i-th second phoneme string.

（ステップＳ５０６）類似音素列取得部１２３は、カウンタｉを１、インクリメントし、ステップＳ５０３に戻る。 (Step S506) The similar phoneme string acquisition unit 123 increments the counter i by 1, and returns to step S503.

（ステップＳ５０７）類似音素列取得部１２３は、ステップＳ５０５で一時蓄積した類似度をキーとして、第二の音素列をソートする。 (Step S507) The similar phoneme string acquisition unit 123 sorts the second phoneme string using the similarity temporarily stored in step S505 as a key.

（ステップＳ５０８）類似音素列取得部１２３は、所定の条件を満たすほど類似している１または２以上の音素列（類似音素列）を固有表現情報格納部１２０から取得し、上位処理にリターンする。 (Step S508) The similar phoneme string acquisition unit 123 acquires one or more phoneme strings (similar phoneme strings) that are similar enough to satisfy a predetermined condition from the specific expression information storage unit 120, and returns to the upper process. .

以下、本実施の形態における音声翻訳システム１の具体的な動作（行った実験）について説明する。 Hereinafter, a specific operation (performed experiment) of the speech translation system 1 in the present embodiment will be described.

本実験では、サーバ装置１２の固有表現情報格納部１２０は、図６に示す固有表現管理表を保持している。固有表現管理表は「ＩＤ」「文字列」「音素列」を有するレコード（固有表現情報）を１以上格納している。ここでは、固有表現情報は、５０９５存在する。また、固有表現情報が有する音素列の固有表現音素記号化手法は「Ximera」という手法を用いている。また、類似音素列取得部１２３が利用する類似度の算出のアルゴリズムは、ＢＬＥＵ（数式１）である。また、類似音素列取得部１２３が利用する所定の条件は「類似度が最大の音素列」である。 In this experiment, the specific expression information storage unit 120 of the server device 12 holds the specific expression management table shown in FIG. The specific expression management table stores one or more records (specific expression information) having “ID”, “character string”, and “phoneme string”. Here, there is 5095 specific expression information. Also, a technique called “Ximera” is used as a method for encoding a phoneme string included in the phonetic information. The algorithm for calculating the similarity used by the similar phoneme string acquisition unit 123 is BLEU (Equation 1). Further, the predetermined condition used by the similar phoneme string acquisition unit 123 is “phoneme string having the maximum similarity”.

そして、実験において、ユーザは、３００の文を音声により入力した。以下、２つの例を用いて、音声翻訳システム１の具体的な動作について説明する。 In the experiment, the user inputs 300 sentences by voice. Hereinafter, specific operations of the speech translation system 1 will be described using two examples.

例えば、ユーザが「雑誌売り場はどこですか」と、端末装置１１に対して音声入力した。次に、端末装置１１の音声受付部１１１は、音声を受け付ける。そして、音声関連情報取得部１１２は、受け付けられた音声に関する１以上の特徴量である音声関連情報を取得する。音声関連情報送信部１１３は、取得された音声関連情報をサーバ装置１２に送信する。 For example, the user inputs a voice to the terminal device 11 “Where is the magazine store”? Next, the voice reception unit 111 of the terminal device 11 receives voice. Then, the voice related information acquisition unit 112 acquires voice related information that is one or more feature amounts related to the received voice. The voice related information transmission unit 113 transmits the acquired voice related information to the server device 12.

次に、サーバ装置１２の音声関連情報受信部１２１は、端末装置１１から１以上の特徴量である音声関連情報を受信する。 Next, the voice related information receiving unit 121 of the server device 12 receives voice related information that is one or more feature quantities from the terminal device 11.

次に、音声認識部１２２は、受信された音声関連情報を用いて、音声認識処理を行う。そして、音声認識部１２２は、音素列「z a ng sh i ng u r i b a w a d o k o d e s u k a」と音声認識文字列「斬新売り場はどこですか」とを取得する。 Next, the speech recognition unit 122 performs speech recognition processing using the received speech related information. Then, the speech recognition unit 122 acquires the phoneme string “z a ng sh i n u r i b a w a d o k o k e d e s u k a” and the voice recognition character string “Where is the novel counter”?

次に、類似音素列取得部１２３は、取得された音素列「z a ng sh i ng u r i b a w a d o k o d e s u k a」に類似する１以上の音素列を、ＢＬＵＥを用いて探索する。そして、類似音素列取得部１２３は、図６に示す固有表現管理表から類似音素列「z a q sh i u r i b a w a d o k o d e s u k a」取得する。 Next, the similar phoneme sequence acquisition unit 123 searches for one or more phoneme sequences similar to the acquired phoneme sequence “z a ng sh i ng u r i b a w a d o k o d e s u k a” using BLUE. Then, the similar phoneme string acquisition unit 123 acquires the similar phoneme string “z a q sh i u r i b a w a d k o k e d e u k a” from the specific expression management table shown in FIG. 6.

次に、類似文字列取得部１２４は、取得された音素列「z a q sh i u r i b a w a d o k o d e s u k a」に対応する類似文字列「雑誌売り場はどこですか」を、固有表現管理表から取得する。 Next, the similar character string acquisition unit 124 acquires the similar character string “Where is the magazine department” corresponding to the acquired phoneme string “z a q sh i u r i b a w a k o d o k e d e u k a” from the specific expression management table.

次に、制御部１３０は、音声認識文字列「斬新売り場はどこですか」と、類似文字列「雑誌売り場はどこですか」とを比較する。そして、制御部１３０は、両文字列が一致しない、と判断する。 Next, the control unit 130 compares the voice recognition character string “Where is the novel counter” with the similar character string “Where is the magazine counter”? Then, the control unit 130 determines that both character strings do not match.

次に、候補文字列送信部１２５は、音声認識文字列「斬新売り場はどこですか」と、類似文字列「雑誌売り場はどこですか」とを用いて、２つの候補文字列を構成する。例えば、構成した候補文字列は「１：雑誌売り場はどこですか，２：斬新売り場はどこですか」である。ここで構成とは、送信するデータ構造にすることである。 Next, the candidate character string transmission unit 125 configures two candidate character strings using the voice recognition character string “Where is the novel counter” and the similar character string “Where is the magazine counter”? For example, the constructed candidate character string is “1: where is the magazine department, 2: where is the novel department”. Here, the term “configuration” refers to a data structure to be transmitted.

次に、候補文字列送信部１２５は、構成した２つの候補文字列「１：雑誌売り場はどこですか，２：斬新売り場はどこですか」を端末装置１１に送信する。 Next, the candidate character string transmission unit 125 transmits the two configured candidate character strings “1: Where is the magazine department, 2: Where is the novel department” to the terminal device 11.

次に、端末装置１１の候補文字列受信部１１４は、サーバ装置１２から、２つの候補文字列「１：雑誌売り場はどこですか，２：斬新売り場はどこですか」を受信する。 Next, the candidate character string receiving unit 114 of the terminal device 11 receives two candidate character strings “1: Where is the magazine section, 2: Where is the novel section” from the server apparatus 12.

次に、候補文字列出力部１１５は、受信された候補文字列を出力する。候補文字列の出力例を図７に示す。そして、図７に示すように、ユーザは、「雑誌売り場はどこですか」の文をチェックし、「送信」ボタンを押下した、とする。 Next, the candidate character string output unit 115 outputs the received candidate character string. An output example of the candidate character string is shown in FIG. Then, as shown in FIG. 7, it is assumed that the user checks the sentence “Where is the magazine store” and presses the “Send” button.

次に、指示受付部１１６は、出力された２つの候補文字列の中から、一の候補文字列「雑誌売り場はどこですか」の指示（ユーザによる指示）を受け付ける。 Next, the instruction receiving unit 116 receives an instruction (an instruction from the user) of one candidate character string “Where is the magazine store” from the two output candidate character strings.

そして、候補文字列特定情報送信部１１７は、受け付けた指示に対応する候補文字列「雑誌売り場はどこですか」を特定する候補文字列特定情報「１」を取得する。そして、候補文字列特定情報送信部１１７は、候補文字列特定情報「１」をサーバ装置１２に送信する。 Then, the candidate character string specifying information transmitting unit 117 acquires candidate character string specifying information “1” for specifying the candidate character string “Where is the magazine section” corresponding to the received instruction. Then, the candidate character string specifying information transmitting unit 117 transmits the candidate character string specifying information “1” to the server device 12.

次に、サーバ装置１２の候補文字列特定情報受信部１２６は、候補文字列の送信に対応して、一の候補文字列を特定する情報である候補文字列特定情報「１」を、端末装置１１から受信する。 Next, the candidate character string specifying information receiving unit 126 of the server device 12 receives candidate character string specifying information “1”, which is information for specifying one candidate character string, in response to the transmission of the candidate character string, 11 is received.

次に、機械翻訳部１２７は、候補文字列特定情報「１」に対応する類似文字列「雑誌売り場はどこですか」を取得する。 Next, the machine translation unit 127 acquires a similar character string “Where is the magazine department” corresponding to the candidate character string specifying information “1”.

次に、機械翻訳部１２７は、取得した文字列「雑誌売り場はどこですか」を翻訳し、翻訳結果「Where is the magazine counter?」を取得する。 Next, the machine translation unit 127 translates the acquired character string “Where is the magazine counter”, and acquires the translation result “Where is the magazine counter?”.

次に、音声合成部１２８は、取得された翻訳結果「Where is the magazine counter?」を音声合成し、音声合成結果を取得する。 Next, the speech synthesizer 128 synthesizes the acquired translation result “Where is the magazine counter?” To obtain a speech synthesis result.

そして、音声合成結果送信部１２９は、取得された音声合成結果を端末装置１１に送信する。 Then, the speech synthesis result transmitting unit 129 transmits the acquired speech synthesis result to the terminal device 11.

次に、音声合成結果受信部１１８は、候補文字列特定情報の送信に対応して、サーバ装置１２から音声合成結果を受信する。 Next, the speech synthesis result receiving unit 118 receives the speech synthesis result from the server device 12 in response to the transmission of the candidate character string specifying information.

そして、合成音声出力部１１９は、受信された音声合成結果を用いて音声出力する。 Then, the synthesized voice output unit 119 outputs a voice using the received voice synthesis result.

次に、ユーザが「フロントは内線九番です」と、端末装置１１に対して音声入力した。そして、上記と同様の動作により、サーバ装置１２の音声認識部１２２は、音素列「j o ng t o w a n a i s e ng k j u u b a ng d e s u」と音声認識文字列「夜んとは内線九番で」とを取得する。 Next, the user inputs a voice to the terminal device 11 saying “Front is extension number 9”. Then, by the same operation as described above, the speech recognition unit 122 of the server device 12 acquires the phoneme string “j o ng t o w a n i e s e n k k u u b a ng d e s u” and the voice recognition character string “night is extension number 9”.

そして、次に、類似音素列取得部１２３は、取得された音素列「j o ng t o w a n a i s e ng k j u u b a ng d e s u」に類似する１以上の音素列を、ＢＬＵＥを用いて探索する。そして、類似音素列取得部１２３は、図６に示す固有表現管理表から類似音素列「f u r o ng t o w a n a i s e ng k j u u b a ng d e s u」取得する。 Next, the similar phoneme string acquisition unit 123 searches for one or more phoneme strings that are similar to the acquired phoneme string “j o ng t o w a nai se kng u u b ang d e su” using BLUE. Then, the similar phoneme string acquisition unit 123 acquires the similar phoneme string “f u r o n t o w a n a i s e n k k u u b ang d e s u” from the specific expression management table shown in FIG.

次に、類似文字列取得部１２４は、取得された音素列「f u r o ng t o w a n a i s e ng k j u u b a ng d e s u」に対応する類似文字列「フロントは内線九番です」を、固有表現管理表から取得する。 Next, the similar character string acquisition unit 124 acquires the similar character string “front is extension number 9” corresponding to the acquired phoneme string “f u r o ng t o w a n a i s e ng k j u u b a ng d e su” from the specific expression management table.

次に、制御部１３０は、音声認識文字列「夜んとは内線九番で」と、類似文字列「フロントは内線九番です」とを比較する。そして、制御部１３０は、両文字列が一致しない、と判断する。 Next, the control unit 130 compares the voice recognition character string “night is extension number 9” with the similar character string “front is extension number 9.” Then, the control unit 130 determines that both character strings do not match.

次に、候補文字列送信部１２５は、音声認識文字列「夜んとは内線九番で」と、類似文字列「フロントは内線九番です」とを用いて、２つの候補文字列「１：フロントは内線九番です，２：夜んとは内線九番で」を構成する。 Next, the candidate character string transmission unit 125 uses the voice recognition character string “night is extension number 9” and the similar character string “front is extension number 9” to generate two candidate character strings “1”. : Front is extension number 9, 2: Night is extension number 9.

次に、候補文字列送信部１２５は、構成した２つの候補文字列「１：フロントは内線九番です，２：夜んとは内線九番で」を端末装置１１に送信する。 Next, the candidate character string transmission unit 125 transmits the two candidate character strings “1: Front is extension number 9 and 2: Night is extension number 9” to the terminal device 11.

次に、端末装置１１の候補文字列受信部１１４は、サーバ装置１２から、２つの候補文字列「１：フロントは内線九番です，２：夜んとは内線九番で」を受信する。 Next, the candidate character string receiving unit 114 of the terminal device 11 receives two candidate character strings “1: Front is extension number 9; 2: Night is extension number 9” from the server device 12.

次に、候補文字列出力部１１５は、受信された候補文字列を出力する。 Next, the candidate character string output unit 115 outputs the received candidate character string.

そして、ユーザは、「フロントは内線九番です」の文をチェックし、「送信」ボタンを押下した、とする。 The user checks the sentence “Front is extension number 9” and presses the “Send” button.

次に、指示受付部１１６は、出力された２つの候補文字列の中から、一の候補文字列「フロントは内線九番です」の指示（ユーザによる指示）を受け付ける。 Next, the instruction receiving unit 116 receives an instruction (instruction from the user) of one candidate character string “front is extension number 9” from the two output candidate character strings.

そして、候補文字列特定情報送信部１１７は、受け付けた指示に対応する候補文字列を特定する候補文字列特定情報「１」を取得する。そして、候補文字列特定情報送信部１１７は、候補文字列特定情報「１」をサーバ装置１２に送信する。 Then, the candidate character string specifying information transmitting unit 117 acquires candidate character string specifying information “1” for specifying the candidate character string corresponding to the received instruction. Then, the candidate character string specifying information transmitting unit 117 transmits the candidate character string specifying information “1” to the server device 12.

次に、機械翻訳部１２７は、候補文字列特定情報「１」に対応する類似文字列「フロントは内線九番です」を取得する。 Next, the machine translation unit 127 acquires a similar character string “front is extension number 9” corresponding to the candidate character string specifying information “1”.

次に、機械翻訳部１２７は、取得した文字列「フロントは内線九番です」を翻訳し、翻訳結果「Extension because of the connection to the reception desk is the ninth.」を取得する。 Next, the machine translation unit 127 translates the acquired character string “front is extension number 9” and acquires a translation result “Extension because of the connection to the reception desk is the ninth.”.

次に、音声合成部１２８は、取得された翻訳結果「Extension because of the connection to the reception desk is the ninth.」を音声合成し、音声合成結果を取得する。 Next, the speech synthesizer 128 synthesizes the obtained translation result “Extension because of the connection to the reception desk is the ninth.” To obtain a speech synthesis result.

以上の実験において、音声認識の段階において、認識が成功した数は２３５で、失敗した数は６５となった。そして、認識失敗した文をさらに類似文検索した結果、一番スコア（類似度）が良かったものが意図した文（検索成功）であった数は５３で、意図しなかった文（検索失敗）であった数は１２であった。 In the above experiment, the number of successful recognitions was 235 and the number of failures was 65 in the speech recognition stage. As a result of further similar sentence search of the sentence that failed to be recognized, the number of sentences with the highest score (similarity) that was the intended sentence (search success) was 53, and the sentence that was not intended (search failure) The number was 12.

つまり、「認識成功：２３５（７８．３％）、認識失敗：６５（２１．７％）」、「検索成功：５３、検索失敗：１２」であった。 That is, “recognition success: 235 (78.3%), recognition failure: 65 (21.7%)”, “search success: 53, search failure: 12”.

つまり、認識成功文と検索成功文とを同時に提示してユーザに選択させることにより、発話した文が意図どおりに機械翻訳部１２７に渡る数は２３５＋５３＝２８８（９６．０％）になる。以上より、本実験において、音声翻訳としての精度を大幅に上げることができたことが分かる。 That is, by presenting the recognition success sentence and the search success sentence at the same time and allowing the user to select, the number of spoken sentences passed to the machine translation unit 127 as intended becomes 235 + 53 = 288 (96.0%). From the above, it can be seen that in this experiment, the accuracy of speech translation could be greatly improved.

以上、本実施の形態によれば、音を表す音素記号列を検索のキーとして、類似文の検索を行うことにより、音声認識結果に誤りがある場合でも、良好な翻訳結果を得ることができる。 As described above, according to the present embodiment, it is possible to obtain a good translation result even when there is an error in the speech recognition result by performing a similar sentence search using a phoneme symbol string representing a sound as a search key. .

なお、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における端末装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、音声を受け付ける音声受付部と、前記音声受付部が受け付けた音声、または前記音声受付部が受け付けた音声に関する１以上の特徴量である音声関連情報を取得する音声関連情報取得部と、前記音声関連情報を前記サーバ装置に送信する音声関連情報送信部と、前記サーバ装置から音声合成結果を受信する音声合成結果受信部と、前記音声合成結果を用いて音声出力する合成音声出力部として機能させるためのプログラム、である。 Note that the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. The software that realizes the terminal device in the present embodiment is the following program. In other words, this program causes the computer to acquire voice-related information that is a voice reception unit that receives voice and the voice received by the voice reception unit or one or more feature quantities related to the voice received by the voice reception unit. A related information acquisition unit; a speech related information transmission unit that transmits the speech related information to the server device; a speech synthesis result reception unit that receives a speech synthesis result from the server device; and a speech output using the speech synthesis result A program for functioning as a synthesized voice output unit.

また、上記プログラムにおいて、コンピュータを、前記サーバ装置から、２以上の候補文字列を受信する候補文字列受信部と、前記候補文字列受信部が受信した２以上の候補文字列を出力する候補文字列出力部と、前記候補文字列出力部が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける指示受付部と、前記指示受付部が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報を、前記サーバ装置に送信する候補文字列特定情報送信部としてさらに機能させることは好適である。 Moreover, in the above program, the computer causes the candidate character string receiving unit to receive two or more candidate character strings from the server device, and the candidate character to output the two or more candidate character strings received by the candidate character string receiving unit. A column output unit; an instruction receiving unit that receives an instruction for one candidate character string from among two or more candidate character strings output by the candidate character string output unit; and a candidate corresponding to the instruction received by the instruction receiving unit It is preferable to further function candidate character string specifying information for specifying a character string as a candidate character string specifying information transmitting unit that transmits the character string to the server device.

また、本実施の形態におけるサーバ装置を実現するソフトウェアは、以下のようなプログラムである。つまり、記憶媒体に、音素列と文字列とを有する２以上の固有表現情報を格納しており、コンピュータを、前記音声関連情報を受信する音声関連情報受信部と、前記音声関連情報を用いて、音声認識し、音素列を取得する音声認識部と、前記音声認識部が取得した音素列に類似する音素列を、前記記憶媒体から取得する類似音素列取得部と、前記類似音素列取得部が取得した音素列に対応する文字列である類似文字列を、前記記憶媒体から取得する類似文字列取得部と、前記類似文字列取得部が取得した類似文字列を翻訳し、翻訳結果を取得する機械翻訳部と、前記機械翻訳部が取得した翻訳結果を音声合成し、音声合成結果を取得する音声合成部と、前記音声合成結果を前記端末装置に送信する音声合成結果送信部として機能させることは好適である。 Moreover, the software which implement | achieves the server apparatus in this Embodiment is the following programs. That is, the storage medium stores two or more pieces of unique expression information having a phoneme string and a character string, and the computer uses the voice related information receiving unit that receives the voice related information and the voice related information. A speech recognition unit that performs speech recognition and acquires a phoneme sequence, a similar phoneme sequence acquisition unit that acquires a phoneme sequence similar to the phoneme sequence acquired by the speech recognition unit from the storage medium, and the similar phoneme sequence acquisition unit Translates a similar character string that is a character string corresponding to the phoneme string acquired from the storage medium and a similar character string acquired by the similar character string acquisition unit, and acquires a translation result And a speech synthesis result transmitting unit for synthesizing the translation result acquired by the machine translation unit, acquiring the speech synthesis result, and transmitting the speech synthesis result to the terminal device. That It is suitable.

また、上記プログラムにおいて、前記音声認識部は、前記音声関連情報を用いて、音声認識し、１以上の音素列および音声認識結果である１以上の文字列である１以上の音声認識文字列を取得し、前記類似音素列取得部は、前記音声認識部が取得した音素列に類似する１以上の音素列を、前記固有表現情報格納部から取得し、前記類似文字列取得部は、前記類似音素列取得部が取得した１以上の音素列に対応する１以上の類似文字列を取得し、コンピュータを、前記音声認識部が取得した１以上の音声認識文字列および前記類似文字列取得部が取得した１以上の類似文字列である２以上の候補文字列を、前記端末装置に送信する候補文字列送信部と、前記２以上の候補文字列の送信に対応して、一の候補文字列を特定する情報である候補文字列特定情報を、前記端末装置から受信する候補文字列特定情報受信部としてさらに機能させ、前記機械翻訳部は、前記候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得するものとして、コンピュータを機能させることは好適である。 In the above program, the speech recognition unit recognizes speech using the speech related information, and outputs one or more phoneme strings and one or more speech recognition character strings that are one or more character strings that are speech recognition results. The similar phoneme sequence acquisition unit acquires one or more phoneme sequences similar to the phoneme sequence acquired by the speech recognition unit from the specific expression information storage unit, and the similar character string acquisition unit One or more similar character strings corresponding to one or more phoneme strings acquired by the phoneme string acquisition unit are acquired, and one or more speech recognition character strings acquired by the voice recognition unit and the similar character string acquisition unit are One candidate character string corresponding to the candidate character string transmitting unit that transmits the two or more candidate character strings that are the one or more similar character strings acquired to the terminal device and the transmission of the two or more candidate character strings Candidate characters that are information for identifying The specific information further functions as a candidate character string specifying information receiving unit that receives from the terminal device, and the machine translation unit translates a speech recognition character string or a similar character string corresponding to the candidate character string specifying information, and translates It is preferable to make a computer function as a result acquisition.

また、上記プログラムにおいて、前記類似音素列取得部は、前記音声認識部が取得した音素列に類似する２以上の音素列を、前記固有表現情報格納部から取得し、前記類似文字列取得部は、前記類似音素列取得部が取得した２以上の音素列に対応する２以上の文字列である２以上の類似文字列を取得し、コンピュータを、前記類似文字列取得部が取得した２以上の類似文字列である２以上の候補文字列を、前記端末装置に送信する候補文字列送信部と、前記２以上の候補文字列の送信に対応して、一の候補文字列を特定する情報である候補文字列特定情報を、前記端末装置から受信する候補文字列特定情報受信部としてさらに機能させ、前記機械翻訳部は、前記候補文字列特定情報に対応する類似文字列を翻訳し、翻訳結果を取得するものとして、コンピュータを機能させることは好適である。 In the above program, the similar phoneme string acquisition unit acquires two or more phoneme strings similar to the phoneme string acquired by the speech recognition unit from the specific expression information storage unit, and the similar character string acquisition unit includes: , Two or more similar character strings that are two or more character strings corresponding to the two or more phoneme strings acquired by the similar phoneme string acquisition unit are acquired, and two or more similar character strings are acquired by the similar character string acquisition unit. A candidate character string transmitting unit that transmits two or more candidate character strings that are similar character strings to the terminal device, and information that identifies one candidate character string corresponding to the transmission of the two or more candidate character strings The candidate character string specifying information is further functioned as a candidate character string specifying information receiving unit that receives from the terminal device, the machine translation unit translates a similar character string corresponding to the candidate character string specifying information, and a translation result With what to get Te, it is preferred to cause a computer to function.

また、上記プログラムにおいて、コンピュータを、前記音声認識部が取得した文字列と前記類似文字列取得部が取得した１以上の各類似文字列とを比較し、前記音声認識部が取得した文字列と一致する文字列が、前記類似文字列取得部が取得した１以上の類似文字列の中に存在するか否かを判断する制御部としてさらに機能させ、前記候補文字列送信部は、前記候補文字列を送信しないものとして、コンピュータを機能させることは好適である。
（実施の形態２） In the above program, the computer compares the character string acquired by the voice recognition unit with one or more similar character strings acquired by the similar character string acquisition unit, and the character string acquired by the voice recognition unit The candidate character string transmission unit further functions as a control unit that determines whether or not a matching character string exists in one or more similar character strings acquired by the similar character string acquisition unit. It is preferable to make the computer function as not transmitting the sequence.
(Embodiment 2)

本実施の形態において、スタンドアロンの音声翻訳装置について説明する。本実施の形態における音声翻訳装置の機能は、実施の形態１の音声翻訳システム１の機能と同様である。 In this embodiment, a stand-alone speech translation apparatus will be described. The function of the speech translation apparatus in the present embodiment is the same as the function of the speech translation system 1 of the first embodiment.

図８は、本実施の形態における音声翻訳装置２のブロック図である。音声翻訳装置２は、固有表現情報格納部１２０、音声受付部１１１、音声認識部２０１、類似音素列取得部１２３、類似文字列取得部１２４、候補文字列出力部２０２、指示受付部１１６、機械翻訳部２０３、音声合成部１２８、合成音声出力部２０４、および制御部１３０を具備する。 FIG. 8 is a block diagram of the speech translation apparatus 2 in the present embodiment. The speech translation apparatus 2 includes a unique expression information storage unit 120, a speech reception unit 111, a speech recognition unit 201, a similar phoneme string acquisition unit 123, a similar character string acquisition unit 124, a candidate character string output unit 202, an instruction reception unit 116, a machine A translation unit 203, a speech synthesis unit 128, a synthesized speech output unit 204, and a control unit 130 are provided.

音声認識部２０１は、音声受付部１１１が受け付けた音声を音声認識し、音素列を取得する。また、音声認識部２０１は、音声受付部１１１が受け付けた音声を音声認識し、音素列と音声認識文字列とを取得しても良い。また、音声認識部２０１は、音声受付部１１１が受け付けた音声に関する１以上の特徴量である音声関連情報を取得し、当該音声関連情報を用いて、音声認識し、１以上の音素列または、１以上の音素列と１以上の音声認識文字列とを取得しても良い。 The voice recognition unit 201 recognizes the voice received by the voice reception unit 111 and acquires a phoneme string. The voice recognition unit 201 may recognize the voice received by the voice reception unit 111 and acquire a phoneme string and a voice recognition character string. In addition, the voice recognition unit 201 acquires voice related information that is one or more feature amounts related to the voice received by the voice receiving unit 111, performs voice recognition using the voice related information, and one or more phoneme strings or One or more phoneme strings and one or more speech recognition character strings may be acquired.

候補文字列出力部２０２は、２以上の候補文字列を出力する。２以上の候補文字列は、通常、音声認識部２０１が取得した１以上の音声認識文字列および類似文字列取得部１２４が取得した１以上の類似文字列である。ただし、２以上の候補文字列は、類似文字列取得部１２４が取得した２以上の類似文字列であっても良い。 The candidate character string output unit 202 outputs two or more candidate character strings. The two or more candidate character strings are usually one or more speech recognition character strings acquired by the speech recognition unit 201 and one or more similar character strings acquired by the similar character string acquisition unit 124. However, the two or more candidate character strings may be two or more similar character strings acquired by the similar character string acquisition unit 124.

機械翻訳部２０３は、類似文字列取得部１２４が取得した類似文字列を翻訳し、翻訳結果を取得する。機械翻訳部２０３は、指示受付部１１６が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得しても良い。機械翻訳部２０３は、候補文字列特定情報に対応する類似文字列を翻訳し、翻訳結果を取得しても良い。 The machine translation unit 203 translates the similar character string acquired by the similar character string acquisition unit 124 and acquires a translation result. The machine translation unit 203 may translate the speech recognition character string or the similar character string corresponding to the candidate character string specifying information for specifying the candidate character string corresponding to the instruction received by the instruction receiving unit 116, and acquire the translation result. good. The machine translation unit 203 may translate a similar character string corresponding to the candidate character string specifying information and obtain a translation result.

合成音声出力部２０４は、音声合成部１２８が取得した音声合成結果を用いて音声出力する。 The synthesized speech output unit 204 outputs speech using the speech synthesis result acquired by the speech synthesizer 128.

音声認識部２０１、機械翻訳部２０３は、通常、ＭＰＵやメモリ等から実現され得る。音声認識部２０１等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The speech recognition unit 201 and the machine translation unit 203 can usually be realized by an MPU, a memory, or the like. The processing procedure of the voice recognition unit 201 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

合成音声出力部２０４は、スピーカー等の出力デバイスを含むと考えても含まないと考えても良い。合成音声出力部２０４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。
次に、音声翻訳装置２の動作について、図９のフローチャートを用いて説明する。図９のフローチャートにおいて、図３または図４のフローチャートと同一のステップの説明を省略する。なお、図９のフローチャートは、図３または図４のフローチャートと同様のステップにより構成されるので、説明を省略する。また、図９のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 The synthesized voice output unit 204 may be considered as including or not including an output device such as a speaker. The synthesized voice output unit 204 can be realized by driver software of an output device or driver software of an output device and an output device.
Next, the operation of the speech translation apparatus 2 will be described using the flowchart of FIG. In the flowchart of FIG. 9, the description of the same steps as those in the flowchart of FIG. 3 or 4 is omitted. Note that the flowchart of FIG. 9 is configured by the same steps as the flowchart of FIG. 3 or FIG. In the flowchart of FIG. 9, the process is terminated by power-off or a process termination interrupt.

以下、本実施の形態における音声翻訳装置２の具体的な動作について説明する。 Hereinafter, a specific operation of the speech translation apparatus 2 in the present embodiment will be described.

本具体例では、実施の形態１における実験の環境と同じである。つまり、音声翻訳装置２の固有表現情報格納部１２０は、図６に示す固有表現管理表を保持している。また、固有表現情報が有する音素列の固有表現音素記号化手法は「Ximera」という手法を用いている。また、類似音素列取得部１２３が利用する類似度の算出のアルゴリズムは、ＢＬＥＵ（数式１）である。また、類似音素列取得部１２３が利用する所定の条件は「類似度が最大の音素列」である。 This specific example is the same as the environment of the experiment in the first embodiment. That is, the specific expression information storage unit 120 of the speech translation apparatus 2 holds the specific expression management table shown in FIG. Also, a technique called “Ximera” is used as a method for encoding a phoneme string included in the phonetic information. The algorithm for calculating the similarity used by the similar phoneme string acquisition unit 123 is BLEU (Equation 1). Further, the predetermined condition used by the similar phoneme string acquisition unit 123 is “phoneme string having the maximum similarity”.

例えば、ユーザが「雑誌売り場はどこですか」と、音声翻訳装置２に対して音声入力した。次に、音声翻訳装置２の音声受付部１１１は、音声を受け付ける。そして、音声認識部２０１は、受け付けられた音声に対して、音声認識処理を行う。そして、音声認識部２０１は、音素列「z a ng sh i ng u r i b a w a d o k o d e s u k a」と音声認識文字列「斬新売り場はどこですか」とを取得する。 For example, the user inputs a voice to the speech translation apparatus 2 “Where is the magazine store?” Next, the speech reception unit 111 of the speech translation apparatus 2 receives speech. Then, the voice recognition unit 201 performs voice recognition processing on the received voice. Then, the speech recognition unit 201 acquires a phoneme string “z a ng sh i ng u r i b a w a d o k o k e d e s u k a” and a voice recognition character string “Where is the novel counter”?

次に、候補文字列出力部２０２は、音声認識文字列「斬新売り場はどこですか」と、類似文字列「雑誌売り場はどこですか」とを用いて、２つの候補文字列を構成する。例えば、構成した候補文字列は「１：雑誌売り場はどこですか，２：斬新売り場はどこですか」である。 Next, the candidate character string output unit 202 configures two candidate character strings using the voice recognition character string “Where is the novel counter” and the similar character string “Where is the magazine counter”? For example, the constructed candidate character string is “1: where is the magazine department, 2: where is the novel department”.

次に、候補文字列出力部２０２は、候補文字列を出力する。候補文字列の出力例を図７に示す。そして、図７に示すように、ユーザは、「雑誌売り場はどこですか」の文をチェックし、「送信」ボタンを押下した、とする。 Next, the candidate character string output unit 202 outputs the candidate character string. An output example of the candidate character string is shown in FIG. Then, as shown in FIG. 7, it is assumed that the user checks the sentence “Where is the magazine store” and presses the “Send” button.

次に、機械翻訳部２０３は、候補文字列特定情報「１」に対応する類似文字列「雑誌売り場はどこですか」を取得する。 Next, the machine translation unit 203 acquires a similar character string “Where is the magazine department” corresponding to the candidate character string specifying information “1”.

次に、機械翻訳部２０３は、取得した文字列「雑誌売り場はどこですか」を翻訳し、翻訳結果「Where is the magazine counter?」を取得する。 Next, the machine translation unit 203 translates the acquired character string “Where is the magazine counter?” And acquires the translation result “Where is the magazine counter?”.

そして、合成音声出力部２０４は、音声合成結果を用いて音声出力する。 The synthesized speech output unit 204 outputs speech using the speech synthesis result.

なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、記憶媒体に、音素列と文字列とを有する２以上の固有表現情報を格納しており、コンピュータを、音声を受け付ける音声受付部と、前記音声受付部が受け付けた音声を音声認識し、音素列を取得する音声認識部と、前記音声認識部が取得した音素列に類似する音素列を、前記記憶媒体から取得する類似音素列取得部と、前記類似音素列取得部が取得した音素列に対応する文字列を取得する文字列取得部と、前記文字列取得部が取得した文字列を翻訳し、翻訳結果を取得する機械翻訳部と、前記機械翻訳部が取得した翻訳結果を音声合成する音声合成部と、前記音声合成結果を用いて音声出力する合成音声出力部として機能させるためのプログラムである。 Note that the software that implements the information processing apparatus according to the present embodiment is the following program. In other words, this program stores two or more unique expression information having a phoneme string and a character string in a storage medium, and the computer receives a voice receiving unit that receives voice and a voice received by the voice receiving unit. A speech recognition unit that performs speech recognition and acquires a phoneme sequence, a similar phoneme sequence acquisition unit that acquires a phoneme sequence similar to the phoneme sequence acquired by the speech recognition unit from the storage medium, and the similar phoneme sequence acquisition unit A character string acquisition unit that acquires a character string corresponding to the acquired phoneme string, a machine translation unit that translates the character string acquired by the character string acquisition unit and acquires a translation result, and a translation acquired by the machine translation unit A program for functioning as a speech synthesizer that synthesizes a result and a synthesized speech output unit that outputs speech using the speech synthesis result.

また、上記プログラムにおいて、前記音声認識部は、前記音声関連情報を用いて、音声認識し、１以上の音素列および音声認識結果である１以上の文字列である１以上の音声認識文字列を取得し、前記類似音素列取得部は、前記音声認識部が取得した音素列に類似する１以上の音素列を、前記固有表現情報格納部から取得し、前記類似文字列取得部は、前記類似音素列取得部が取得した１以上の音素列に対応する１以上の文字列である１以上の類似文字列を取得し、前記音声認識部が取得した１以上の音声認識文字列および前記類似文字列取得部が取得した１以上の類似文字列である２以上の候補文字列を出力する候補文字列出力部と、前記候補文字列出力部が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける指示受付部とをさらに具備し、前記機械翻訳部は、前記指示受付部が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得するものとしてコンピュータを機能させることは好適である。 In the above program, the speech recognition unit recognizes speech using the speech related information, and outputs one or more phoneme strings and one or more speech recognition character strings that are one or more character strings that are speech recognition results. The similar phoneme sequence acquisition unit acquires one or more phoneme sequences similar to the phoneme sequence acquired by the speech recognition unit from the specific expression information storage unit, and the similar character string acquisition unit One or more similar character strings that are one or more character strings corresponding to one or more phoneme strings acquired by the phoneme string acquisition unit, and the one or more speech recognition character strings and the similar characters acquired by the voice recognition unit A candidate character string output unit that outputs two or more candidate character strings that are one or more similar character strings acquired by the column acquisition unit, and one or more candidate character strings output by the candidate character string output unit, To accept instructions for candidate character strings The machine translation unit translates the speech recognition character string or the similar character string corresponding to the candidate character string specifying information for specifying the candidate character string corresponding to the instruction received by the instruction receiving unit. It is preferable to make the computer function as the one that obtains the translation result.

また、上記プログラムにおいて、前記類似音素列取得部は、前記音声認識部が取得した音素列に類似する２以上の音素列を、前記固有表現情報格納部から取得し、前記類似文字列取得部は、前記類似音素列取得部が取得した２以上の音素列に対応する２以上の類似文字列を取得し、前記類似文字列取得部が取得した２以上の類似文字列である２以上の候補文字列を出力する候補文字列出力部と、前記候補文字列出力部が出力した２以上の候補文字列の中から、一の候補文字列の指示を受け付ける指示受付部とをさらに具備し、前記機械翻訳部は、前記指示受付部が受け付けた指示に対応する候補文字列を特定する候補文字列特定情報に対応する音声認識文字列または類似文字列を翻訳し、翻訳結果を取得するものとしてコンピュータを機能させることは好適である。 In the above program, the similar phoneme string acquisition unit acquires two or more phoneme strings similar to the phoneme string acquired by the speech recognition unit from the specific expression information storage unit, and the similar character string acquisition unit includes: , Two or more similar character strings corresponding to two or more phoneme strings acquired by the similar phoneme string acquisition unit, and two or more candidate characters that are two or more similar character strings acquired by the similar character string acquisition unit A candidate character string output unit for outputting a string; and an instruction receiving unit for receiving an instruction for one candidate character string from two or more candidate character strings output by the candidate character string output unit, The translation unit translates the speech recognition character string or the similar character string corresponding to the candidate character string specifying information for specifying the candidate character string corresponding to the instruction received by the instruction receiving unit, and acquires the translation result as a computer. Function Rukoto is preferred.

また、上記プログラムにおいて、コンピュータを、前記音声認識部が取得した文字列と前記類似文字列取得部が取得した１以上の各類似文字列とを比較し、前記音声認識部が取得した文字列と一致する文字列が、前記類似文字列取得部が取得した１以上の類似文字列の中に存在するか否かを判断する制御部としてさらに機能させ、前記候補文字列出力部は、前記候補文字列を出力しないものとして、コンピュータを機能させることは好適である。 In the above program, the computer compares the character string acquired by the voice recognition unit with one or more similar character strings acquired by the similar character string acquisition unit, and the character string acquired by the voice recognition unit The candidate character string output unit further functions as a control unit that determines whether or not a matching character string exists in one or more similar character strings acquired by the similar character string acquisition unit. It is preferable to make the computer function as not outputting columns.

また、図１０は、本明細書で述べたプログラムを実行して、上述した実施の形態の音声翻訳装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図１０は、このコンピュータシステム３４０の概観図であり、図１１は、コンピュータシステム３４０の内部構成を示す図である。 FIG. 10 shows the external appearance of a computer that executes the program described in this specification to realize the speech translation apparatus and the like of the above-described embodiment. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 10 is an overview diagram of the computer system 340, and FIG. 11 is a diagram showing an internal configuration of the computer system 340.

図１０において、コンピュータシステム３４０は、ＦＤドライブ３４１１、ＣＤ−ＲＯＭドライブ３４１２を含むコンピュータ３４１と、キーボード３４２と、マウス３４３と、モニタ３４４と、マイク３４５とを含む。 In FIG. 10, the computer system 340 includes a computer 341 including an FD drive 3411 and a CD-ROM drive 3412, a keyboard 342, a mouse 343, a monitor 344, and a microphone 345.

図１１において、コンピュータ３４１は、ＦＤドライブ３４１１、ＣＤ−ＲＯＭドライブ３４１２に加えて、ＭＰＵ３４１３と、ＣＤ−ＲＯＭドライブ３４１２及びＦＤドライブ３４１１に接続されたバス３４１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ３４１５とに接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ３４１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３４１７とを含む。ここでは、図示しないが、コンピュータ３４１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 11, in addition to the FD drive 3411 and the CD-ROM drive 3412, the computer 341 stores an MPU 3413, a bus 3414 connected to the CD-ROM drive 3412 and the FD drive 3411, and a program such as a bootup program. A RAM 3416 for temporarily storing application program instructions and providing a temporary storage space; and a hard disk 3417 for storing application programs, system programs, and data. Although not shown here, the computer 341 may further include a network card that provides connection to the LAN.

コンピュータシステム３４０に、上述した実施の形態の音声翻訳装置等の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３５０１、またはＦＤ３５０２に記憶されて、ＣＤ−ＲＯＭドライブ３４１２またはＦＤドライブ３４１１に挿入され、さらにハードディスク３４１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３４１に送信され、ハードディスク３４１７に記憶されても良い。プログラムは実行の際にＲＡＭ３４１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３５０１、ＦＤ３５０２またはネットワークから直接、ロードされても良い。 A program that causes the computer system 340 to execute the functions of the speech translation apparatus or the like of the above-described embodiment is stored in the CD-ROM 3501 or FD 3502, inserted into the CD-ROM drive 3412 or FD drive 3411, and further the hard disk 3417. May be transferred to. Alternatively, the program may be transmitted to the computer 341 via a network (not shown) and stored in the hard disk 3417. The program is loaded into the RAM 3416 at the time of execution. The program may be loaded directly from the CD-ROM 3501, the FD 3502, or the network.

プログラムは、コンピュータ３４１に、上述した実施の形態の音声翻訳装置等の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３４０がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 341 to execute the functions of the speech translation apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 340 operates is well known and will not be described in detail.

なお、上記プログラムにおいて、情報を送信するステップや、情報を受信するステップなどでは、ハードウェアによって行われる処理、例えば、モデムやインターフェースカードなどで行われる処理（ハードウェアでしか行われない処理）は含まれない。 In the above program, in a step of transmitting information, a step of receiving information, etc., processing performed by hardware, for example, processing performed by a modem or an interface card (processing performed only by hardware) is performed. Not included.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、一の装置に存在する２以上の通信手段は、物理的に一の媒体で実現されても良いことは言うまでもない。 Further, in each of the above embodiments, it goes without saying that two or more communication units existing in one apparatus may be physically realized by one medium.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。
本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.
The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる音声翻訳システムは、音声認識結果に誤りがある場合でも、良好な翻訳結果を得ることができる、という効果を有し、音声翻訳システム等として有用である。 As described above, the speech translation system according to the present invention has an effect that a good translation result can be obtained even when the speech recognition result has an error, and is useful as a speech translation system or the like.

１音声翻訳システム
２音声翻訳装置
１１端末装置
１２サーバ装置
１１１音声受付部
１１２音声関連情報取得部
１１３音声関連情報送信部
１１４候補文字列受信部
１１５、２０２候補文字列出力部
１１６指示受付部
１１７候補文字列特定情報送信部
１１８音声合成結果受信部
１１９、２０４合成音声出力部
１２０固有表現情報格納部
１２１音声関連情報受信部
１２２、２０１音声認識部
１２３類似音素列取得部
１２４類似文字列取得部
１２５候補文字列送信部
１２６候補文字列特定情報受信部
１２７、２０３機械翻訳部
１２８音声合成部
１２９音声合成結果送信部
１３０制御部 DESCRIPTION OF SYMBOLS 1 Speech translation system 2 Speech translation apparatus 11 Terminal device 12 Server apparatus 111 Speech reception part 112 Speech-related information acquisition part 113 Speech-related information transmission part 114 Candidate character string reception part 115,202 Candidate character string output part 116 Instruction reception part 117 Candidate Character string identification information transmission unit 118 Speech synthesis result reception unit 119, 204 Synthetic speech output unit 120 Specific expression information storage unit 121 Speech related information reception unit 122, 201 Speech recognition unit 123 Similar phoneme sequence acquisition unit 124 Similar character string acquisition unit 125 Candidate character string transmission unit 126 Candidate character string specifying information reception unit 127, 203 Machine translation unit 128 Speech synthesis unit 129 Speech synthesis result transmission unit 130 Control unit

Claims

A speech translation system comprising a terminal device and a server device,
The terminal device
A voice reception unit for receiving voice;
A voice-related information acquisition unit that acquires voice-related information that is one or more feature quantities related to the voice received by the voice reception unit or the voice received by the voice reception unit;
A voice-related information transmitting unit that transmits the voice-related information to the server device;
A candidate character string receiving unit for receiving two or more candidate character strings from the server device;
A candidate character string output unit that outputs two or more candidate character strings received by the candidate character string receiver;
An instruction receiving unit for receiving an instruction for one candidate character string from among two or more candidate character strings output by the candidate character string output unit;
A candidate character string specifying information transmitting unit for transmitting candidate character string specifying information for specifying a candidate character string corresponding to the instruction received by the instruction receiving unit to the server device;
A speech synthesis result receiving unit for receiving a speech synthesis result from the server device;
A synthesized speech output unit that outputs speech using the speech synthesis result,
The server device
A unique expression information storage unit capable of storing two or more unique expression information having a phoneme string and a character string;
A voice related information receiving unit for receiving the voice related information;
A speech recognition unit that recognizes speech using the speech related information and obtains a speech recognition character string that is a phoneme sequence and a character string of a speech recognition result ;
The similarity between the phoneme sequence acquired by the speech recognition unit and the two or more phoneme sequences included in each of the two or more unique expression information stored in the specific expression information storage unit is calculated, and the similarity is a predetermined condition A similar phoneme string acquisition unit that acquires one or more phoneme strings that are similar to each other from the specific expression information storage unit;
A similar character string acquisition unit that acquires one or more similar character strings that are character strings corresponding to the one or more phoneme sequences acquired by the similar phoneme string acquisition unit;
The voice recognition character string acquired by the voice recognition unit is compared with one or more similar character strings acquired by the similar character string acquisition unit, and a character string that matches the voice recognition character string is the one or more similar characters. A control unit that determines whether or not it exists in the character string;
A candidate character string transmission unit that transmits to the terminal device two or more candidate character strings that are the voice recognition character string acquired by the voice recognition unit and the one or more similar character strings acquired by the similar character string acquisition unit;
In response to the transmission of the two or more candidate character strings, a candidate character string specifying information receiving unit that receives candidate character string specifying information that is information for specifying one candidate character string from the terminal device;
A machine translation unit that translates a candidate character string corresponding to the candidate character string specifying information and obtains a translation result;
A speech synthesizer that synthesizes a speech obtained by the machine translation unit and obtains a speech synthesis result; and
A speech synthesis result transmission unit for transmitting the speech synthesis result to the terminal device ;
The candidate character string transmitter is
When the control unit determines that a character string that matches the voice recognition character string acquired by the voice recognition unit exists in one or more similar character strings acquired by the similar character string acquisition unit, the candidate character string Do not send voice translation system.

The similar phoneme sequence acquisition unit includes:
The similarity between the phoneme sequence acquired by the speech recognition unit and the two or more phoneme sequences included in each of the two or more unique expression information stored in the specific expression information storage unit is calculated, and the similarity is a predetermined condition Two or more phoneme strings that are similar to each other are acquired from the specific expression information storage unit,
The similar character string acquisition unit
Two or more similar character strings that are character strings corresponding to two or more phoneme strings acquired by the similar phoneme string acquisition unit are acquired from the specific expression information storage unit,
The candidate character string transmitter is
The speech translation according to claim 1, wherein the speech recognition character string acquired by the speech recognition unit and three or more candidate character strings that are two or more similar character strings acquired by the similar character string acquisition unit are transmitted to the terminal device. system.

A unique expression information storage unit capable of storing two or more unique expression information having a phoneme string and a character string;
A voice reception unit for receiving voice;
A speech recognition unit that recognizes speech received by the speech reception unit and obtains a speech recognition character string that is a phoneme sequence and a character string of a speech recognition result ;
The similarity between the phoneme sequence acquired by the speech recognition unit and the two or more phoneme sequences included in each of the two or more unique expression information stored in the specific expression information storage unit is calculated, and the similarity is a predetermined condition A similar phoneme string acquisition unit that acquires one or more phoneme strings that are similar to each other from the specific expression information storage unit;
A similar character string acquisition unit that acquires one or more similar character strings that are character strings corresponding to the one or more phoneme sequences acquired by the similar phoneme string acquisition unit;
The voice recognition character string acquired by the voice recognition unit is compared with one or more similar character strings acquired by the similar character string acquisition unit, and a character string that matches the voice recognition character string is the one or more similar characters. A control unit that determines whether or not it exists in the character string;
A candidate character string output unit for outputting two or more candidate character strings that are the voice recognition character string acquired by the voice recognition unit and the one or more similar character strings acquired by the similar character string acquisition unit;
An instruction receiving unit for receiving an instruction for one candidate character string from among two or more candidate character strings output by the candidate character string output unit;
A machine translation unit that translates one candidate character string corresponding to the instruction received by the instruction reception unit and obtains a translation result;
A speech synthesizer that synthesizes a speech obtained by the machine translation unit and obtains a speech synthesis result; and
A synthesized speech output unit that outputs speech using the speech synthesis result ,
The candidate character string output unit includes:
When the control unit determines that a character string that matches the character string acquired by the voice recognition unit is present in one or more similar character strings acquired by the similar character string acquisition unit, the candidate character string is output. Not a speech translation device.

The similar phoneme sequence acquisition unit includes:
The similarity between the phoneme sequence acquired by the speech recognition unit and the two or more phoneme sequences included in each of the two or more unique expression information stored in the specific expression information storage unit is calculated, and the similarity is a predetermined condition Two or more phoneme strings that are similar to each other are acquired from the specific expression information storage unit,
The similar character string acquisition unit
Two or more similar character strings that are character strings corresponding to two or more phoneme strings acquired by the similar phoneme string acquisition unit are acquired from the specific expression information storage unit,
The candidate character string output unit includes:
The speech translation apparatus according to claim 3, wherein the speech recognition apparatus outputs the speech recognition character string acquired by the speech recognition unit and three or more candidate character strings that are two or more similar character strings acquired by the similar character string acquisition unit.

On the storage medium,
Two or more unique expression information having a phoneme string and a character string is stored,
Realized by a speech reception unit, a speech recognition unit, a similar phoneme string acquisition unit, a similar character string acquisition unit, a control unit, a candidate character string output unit, an instruction reception unit, a machine translation unit, a speech synthesis unit, and a synthesized speech output unit A speech translation method,
A voice receiving step in which the voice receiving unit receives voice;
A voice recognition step in which the voice recognition unit recognizes the voice received in the voice reception step, and acquires a phoneme string and a voice recognition character string that is a character string of a voice recognition result ;
The similar phoneme sequence acquisition unit calculates a similarity between the phoneme sequence acquired in the speech recognition step and two or more phoneme sequences included in each of the two or more unique representation information stored in the storage medium, A similar phoneme string acquisition step of acquiring from the storage medium one or more phoneme strings that are similar to each other such that the degree of similarity satisfies a predetermined condition ;
The similar character string obtaining section, one or more similar string is a character string corresponding to one or more phoneme string obtained by the similar phoneme string obtaining step, a similar character string obtaining step for obtaining from said storage medium ,
The control unit compares the voice recognition character string acquired in the voice recognition step with one or more similar character strings acquired in the similar character string acquisition step, and matches the voice recognition character string. Is a control step for determining whether or not exists in the one or more similar character strings;
Candidate characters that the candidate character string output unit outputs two or more candidate character strings that are the voice recognition character string acquired in the voice recognition step and one or more similar character strings acquired in the similar character string acquisition step A column output step;
An instruction receiving step in which the instruction receiving unit receives an instruction of one candidate character string from two or more candidate character strings output in the candidate character string output step;
The machine translation unit translates one candidate character string corresponding to the instruction received in the instruction reception step, and obtains a translation result; and
The speech synthesizer speech synthesizes the translation result obtained in the machine translation step, and obtains the speech synthesis result;
The synthesized speech output unit comprises a synthesized speech output step for outputting speech using the speech synthesis result ,
In the candidate character string output step,
If it is determined in the control step that the character string that matches the character string acquired in the voice recognition step exists in one or more similar character strings acquired in the similar character string acquisition step, the candidate character A speech translation method that does not output columns .

In the similar phoneme sequence acquisition step,
A similarity between the phoneme string acquired in the speech recognition step and two or more phoneme strings included in each of the two or more unique expression information stored in the storage medium is calculated, and the similarity satisfies a predetermined condition Two or more phoneme strings that are similar to each other are acquired from the storage medium,
In the similar character string acquisition step,
Acquiring two or more similar character strings that are character strings corresponding to two or more phoneme strings acquired in the similar phoneme string acquiring step from the storage medium;
In the candidate character string output step,
6. The speech translation method according to claim 5, wherein three or more candidate character strings that are the voice recognition character string acquired in the voice recognition step and the two or more similar character strings acquired in the similar character string acquisition step are output.

On the storage medium,
Two or more unique expression information having a phoneme string and a character string is stored,
Computer
A voice reception unit for receiving voice;
A speech recognition unit that recognizes speech received by the speech reception unit and acquires a phoneme sequence;
The degree of similarity between the phoneme sequence acquired by the speech recognition unit and two or more phoneme sequences included in each of the two or more unique representation information stored in the storage medium is calculated, and the degree of similarity satisfies a predetermined condition. A similar phoneme string acquisition unit that acquires one or more similar phoneme strings from the storage medium;
A similar character string acquisition unit that acquires, from the storage medium, one or more similar character strings that are character strings corresponding to the one or more phoneme strings acquired by the similar phoneme string acquisition unit;
The voice recognition character string acquired by the voice recognition unit is compared with one or more similar character strings acquired by the similar character string acquisition unit, and a character string that matches the voice recognition character string is the one or more similar characters. A control unit that determines whether or not it exists in the character string;
A candidate character string output unit for outputting two or more candidate character strings that are the voice recognition character string acquired by the voice recognition unit and the one or more similar character strings acquired by the similar character string acquisition unit;
An instruction receiving unit for receiving an instruction for one candidate character string from among two or more candidate character strings output by the candidate character string output unit;
A machine translation unit that translates one candidate character string corresponding to the instruction received by the instruction reception unit and obtains a translation result;
A speech synthesizer that synthesizes a speech obtained by the machine translation unit and obtains a speech synthesis result ; and
A program for functioning as a synthesized speech output unit that outputs speech using the speech synthesis result ,
The candidate character string output unit includes:
When the control unit determines that a character string that matches the character string acquired by the voice recognition unit is present in one or more similar character strings acquired by the similar character string acquisition unit, the candidate character string is output. A program that allows a computer to function as if not .

The similar phoneme sequence acquisition unit includes:
The degree of similarity between the phoneme sequence acquired by the speech recognition unit and two or more phoneme sequences included in each of the two or more unique representation information stored in the storage medium is calculated, and the degree of similarity satisfies a predetermined condition. Obtaining two or more similar phoneme strings from the storage medium;
The similar character string acquisition unit
Obtaining two or more similar character strings, which are character strings corresponding to the two or more phoneme strings obtained by the similar phoneme string obtaining unit, from the storage medium;
The candidate character string output unit includes:
8. The computer for functioning as a voice recognition character string acquired by the voice recognition unit and three or more candidate character strings that are two or more similar character strings acquired by the similar character string acquisition unit. The program described.