JP5957269B2

JP5957269B2 - Voice recognition server integration apparatus and voice recognition server integration method

Info

Publication number: JP5957269B2
Application number: JP2012088230A
Authority: JP
Inventors: 康成大淵; 本間　健; 健本間
Original assignee: Clarion Co Ltd
Current assignee: Faurecia Clarion Electronics Co Ltd
Priority date: 2012-04-09
Filing date: 2012-04-09
Publication date: 2016-07-27
Anticipated expiration: 2032-04-09
Also published as: JP2013218095A; EP2838085B1; WO2013154010A1; US20150088506A1; CN104221078B; CN104221078A; US9524718B2; EP2838085A4; EP2838085A1

Description

本発明は、ユーザが音声を用いて操作を行うための端末装置と、音声データを認識してその結果を返す複数の音声認識サーバとを繋ぎ、複数の音声認識サーバにより得られる認識結果を統合して最適な音声認識結果をユーザに提供するための装置および方法に関する。 The present invention connects a terminal device for a user to operate using voice and a plurality of voice recognition servers that recognize voice data and return the result, and integrates the recognition results obtained by the plurality of voice recognition servers. The present invention relates to an apparatus and a method for providing a user with an optimal speech recognition result.

車載情報機器や携帯電話などの操作を、ユーザの声により行うための音声認識機能が、数多くの機器に搭載されている。さらに近年では、データ通信技術の発展により、音声データをサーバに送信し、サーバの潤沢な計算資源を用いてより高精度の音声認識を行う方式が普及しつつある。また、特許文献１に示されるように、こうした音声認識サーバをより有効に用いるため、個人用端末と音声認識サーバとの間に中継サーバを置き、その中で付加的な処理を行う方式も提案されている。 A number of devices are equipped with a voice recognition function for performing operations on in-vehicle information devices and mobile phones by the voice of the user. Furthermore, in recent years, with the development of data communication technology, a method of transmitting voice data to a server and performing more accurate voice recognition using abundant calculation resources of the server is becoming widespread. Also, as shown in Patent Document 1, in order to use such a voice recognition server more effectively, a method is proposed in which a relay server is placed between the personal terminal and the voice recognition server, and additional processing is performed therein. Has been.

特開２００８−２４２０６７号公報JP 2008-242067 A 特開２００２−１１６７９６号公報Japanese Patent Laid-Open No. 2002-116796 特開２０１０−２２４３０１号公報JP 2010-224301 A

音声認識サーバが汎用のサービスとして運営される例が増えており、ユーザが扱う端末向けのサービスを提供する主体と、音声認識サーバを運営する主体とが異なる場合が多くなっている。また、運営主体が同一である場合においても、音声認識サーバの開発と音声認識アプリケーションの開発を別個に行い、それらが相互に最適化されていない場合もある。このような状況で、汎用の音声認識サーバが全般的には高い性能を示しながら、特定の語句に対しては必ずしも十分な性能を示さないということがある。 An example in which a speech recognition server is operated as a general-purpose service is increasing, and there are many cases where the entity that provides services for terminals handled by users and the entity that operates the speech recognition server are different. Even when the operating entities are the same, the development of the speech recognition server and the development of the speech recognition application are performed separately, and they may not be mutually optimized. In such a situation, a general-purpose speech recognition server may generally not exhibit sufficient performance for a specific phrase while exhibiting high performance.

一方、特定のアプリケーションを使用する特定のユーザに着目した場合、そのユーザの知人の名前や好きな音楽の楽曲名など、一般的ではないが重要度の高い語句が存在する。こうした語句を認識するためには、専用の音声認識サーバを設けることが望ましいが、専用音声認識サーバの開発には十分なコストをかけられないことが多く、一般的な語句に対する性能では、汎用音声認識サーバに劣ることになる。このように、汎用音声認識サーバと専用音声認識サーバとでは、語句によって認識の得手不得手があり、音声認識性能が異なる。したがって、ユーザの発した語句によってこれらを使い分けることが求められるが、音声認識というのが「内容のわからない発話に対して内容を推定する」タスクである以上、事前に発話内容を知ってサーバの使い分けを行うことは原理的に不可能である。 On the other hand, when focusing on a specific user who uses a specific application, there are words that are not general but highly important, such as names of acquaintances of the user and names of favorite music. In order to recognize such words and phrases, it is desirable to provide a dedicated speech recognition server. However, it is often impossible to develop a dedicated speech recognition server at a sufficient cost. It will be inferior to the recognition server. As described above, the general-purpose speech recognition server and the dedicated speech recognition server are not good at recognizing depending on phrases and have different speech recognition performance. Therefore, it is required to use these properly depending on the words uttered by the user, but since speech recognition is the task of "estimating the content for utterances whose contents are unknown", it is necessary to know the utterance content in advance and use the server properly It is impossible in principle.

本発明は、汎用音声認識サーバの音声認識結果と専用音声認識サーバの音声認識結果とを最適な形で統合し、最終的に間違いの少ない音声認識機能を提供することを目的とする。 An object of the present invention is to optimally integrate the speech recognition result of the general-purpose speech recognition server and the speech recognition result of the dedicated speech recognition server, and finally provide a speech recognition function with few errors.

本発明においては、ユーザが持つ端末装置に含まれる特定単語のリストをあらかじめ入手し、それらの単語のデータをもとに、専用の音声認識サーバを構築する。また、それらのデータを用い、汎用音声認識サーバの性能を事前評価する。その評価結果をもとに、専用および汎用の音声認識サーバから得られた認識結果の中で、どれを採用し、それらに対してどのような重み付けを行えば最適な認識結果が得られるかを、データベースの形で保持しておく。ユーザが実際に音声認識機能を用いる際には、専用および汎用の音声認識サーバにより入力音声を認識した後、そこで得られる結果を、先に述べたデータベースの内容と比較することにより、最適な音声認識結果を得る。また、事前評価の基準として、音声認識の正しさに加えて応答速度を用いることで、なるべく正しい音声認識結果を、なるべく早く得ることができるようにする。 In the present invention, a list of specific words included in a terminal device possessed by a user is obtained in advance, and a dedicated speech recognition server is constructed based on the data of those words. Moreover, the performance of a general-purpose speech recognition server is evaluated in advance using those data. Based on the evaluation results, which of the recognition results obtained from the dedicated and general-purpose speech recognition servers is used and what weighting is applied to them, the optimal recognition result can be obtained. Keep it in the form of a database. When the user actually uses the speech recognition function, the input speech is recognized by a dedicated and general-purpose speech recognition server, and then the results obtained there are compared with the contents of the database described above to obtain the optimal speech. Get recognition result. Further, by using the response speed in addition to the correctness of the speech recognition as a reference for the prior evaluation, it is possible to obtain the correct speech recognition result as soon as possible.

本発明の音声認識サーバ統合装置に一例を挙げるならば、ユーザが音声を用いて操作を行うための端末装置と、音声データを認識してその結果を返す音声認識サーバとの間を中継する装置であって、ユーザが登録した語句もしくはユーザがよく使う語句のリストに基づいて認識結果統合用パラメータを学習し保存する統合方式学習部と、ユーザが音声認識を意図して発した音声のデータである音声データを前記端末装置から受信する手段と、前記受信した音声データを汎用音声認識サーバおよび専用音声認識サーバに送信する手段と、前記音声データの前記汎用音声認識サーバおよび専用音声認識サーバによる認識結果を受信する手段と、前記汎用音声認識サーバおよび専用音声認識サーバによる認識結果を、前記保存された認識結果統合用パラメータと比較し、最適な認識結果を選択する認識結果統合部と、前記選択された認識結果を前記端末装置に送信する手段とを備えるものである。 If an example is given to the speech recognition server integration device of the present invention, a device that relays between a terminal device for a user to operate using speech and a speech recognition server that recognizes speech data and returns the result. a is, an integrated system learning unit for the user to save learns recognition result integration parameter based on a list of phrases frequently used words and phrases or user has registered, speech data generated by a user with the intention of speech recognition means for transmitting the audio data and means for receiving from the terminal device, the audio data thus received to the general-purpose speech recognition server and the specialized speech recognition server is, by the general-purpose speech recognition server and the specialized speech recognition server of the audio data means for receiving the recognition result, the recognition result by a general purpose speech recognition server and the specialized speech recognition server, the stored recognition result integration Compared with parameters, those comprising a recognition result integration unit that selects an optimal recognition result, and means for transmitting the selected recognition result to the terminal device.

本発明の音声認識サーバ統合装置において、更に、前記端末装置からユーザが登録した語句もしくはユーザがよく使う語句のリストを受信する手段と、前記受信した語句をもとに合成音声を生成する音声合成部と、前記生成された合成音声を前記汎用音声認識サーバおよび専用音声認識サーバに送信する手段と、前記合成音声の前記汎用音声認識サーバおよび専用音声認識サーバによる認識結果を受信する手段を備え、前記統合方式学習部は、前記合成音声の基となった語句と前記認識結果とを合わせて解析し、認識結果統合用パラメータを学習し保存するものでよい。 In the speech recognition server integration device of the present invention, further, means for receiving a word registered by the user or a list of words frequently used by the user from the terminal device, and speech synthesis for generating synthesized speech based on the received word Unit, means for transmitting the generated synthesized speech to the general-purpose speech recognition server and the dedicated speech recognition server, and means for receiving a recognition result of the synthesized speech by the general-purpose speech recognition server and the dedicated speech recognition server, The integration method learning unit may analyze the phrase that is the basis of the synthesized speech and the recognition result together to learn and store the recognition result integration parameter.

また、本発明の音声認識サーバ統合装置において、更に、前記端末装置からユーザが登録した語句もしくはユーザがよく使う語句のリストを受信する手段と、前記汎用音声認識サーバから認識用の語句リストを受信する手段と、前記認識用の語句リストを前記端末装置から受信した語句リストと比較し、類似度を推定する語句比較・類似度推定部とを備え、前記統合方式学習部は、前記推定結果を認識結果統合用パラメータとして保存するものでよい。 In the speech recognition server integration device of the present invention, means for receiving a list of phrases registered by the user or frequently used phrases from the terminal device, and a list of phrases for recognition from the general-purpose speech recognition server And a phrase comparison / similarity estimation unit that compares the phrase list for recognition with the phrase list received from the terminal device and estimates similarity, and the integration method learning unit it may be those stores as a recognition result integration parameter.

本発明の音声認識サーバ統合方法の一例を挙げるならば、ユーザが登録した語句もしくはユーザがよく使う語句のリストに基づいて認識結果統合用パラメータを学習して保存するステップと、ユーザが音声認識を意図して発した音声のデータである音声データを汎用音声認識サーバおよび専用音声認識サーバに送信するステップと、前記音声データの前記汎用音声認識サーバおよび専用音声認識サーバによる認識結果を受信するステップと、汎用音声認識サーバの認識結果および専用音声認識サーバの認識結果と、前記認識結果統合用パラメータとを比較して、最適な音声認識結果を選択するステップと、から成るものである。
If an example of a speech recognition server integration process of the present invention, the steps of the user and stores the learned recognition result integration parameter based on a list of phrases frequently used words and phrases or user has registered, the user speech recognition Transmitting voice data that is voice data intended to be transmitted to a general-purpose voice recognition server and a dedicated voice recognition server, and receiving a recognition result of the voice data by the general-purpose voice recognition server and the dedicated voice recognition server And comparing the recognition result of the general-purpose speech recognition server and the recognition result of the dedicated speech recognition server with the recognition result integration parameter and selecting the optimum speech recognition result.

本発明により、一般的な語句に関しては汎用音声認識サーバの認識結果を重要視し、ユーザ固有の語句に関しては専用音声認識サーバの結果を重要視するなど、個々の入力に対して最適な形で認識結果の統合が行われ、最終的に間違いの少ない音声認識機能をユーザに提供することが可能となる。また、間違いが少ないだけでなく、応答速度の点でも利便性の高いシステムを実現することができる。 According to the present invention, the recognition result of the general-purpose speech recognition server is regarded as important for general words and phrases, and the result of the dedicated speech recognition server is regarded as important for user-specific words and phrases in an optimum form for each input. The recognition results are integrated, and finally it is possible to provide the user with a voice recognition function with few errors. In addition to fewer errors, it is possible to realize a highly convenient system in terms of response speed.

本発明の実施例１の音声認識サーバ統合装置の構成図である。It is a block diagram of the speech recognition server integration apparatus of Example 1 of this invention. 本発明の実施例１の音声合成を用いた結果統合用パラメータ推定の処理を示す図である。It is a figure which shows the process of parameter estimation for result integration using the speech synthesis of Example 1 of this invention. 本発明の単一汎用音声認識サーバを用いた結果統合用パラメータの一例を示す図である。It is a figure which shows an example of the parameter for result integration using the single general purpose speech recognition server of this invention. 本発明の複数汎用音声認識サーバを用いた結果統合用パラメータの一例を示す図である。It is a figure which shows an example of the parameter for result integration using the multiple general purpose speech recognition server of this invention. 本発明の実施例１の複数サーバの認識結果の統合方法の一例を示す図である。It is a figure which shows an example of the integration method of the recognition result of the some server of Example 1 of this invention. 本発明の複数汎用音声認識サーバの認識結果信頼度を用いた結果統合用パラメータの一例を示す図である。It is a figure which shows an example of the parameter for result integration using the recognition result reliability of the multiple general purpose speech recognition server of this invention. 本発明の複数汎用音声認識サーバの認識結果信頼度と誤認識結果を用いた結果統合用パラメータの一例を示す図である。It is a figure which shows an example of the parameter for result integration using the recognition result reliability and misrecognition result of the multiple general purpose speech recognition server of this invention. 本発明の同音異表記の変換を利用した認識結果の統合方法の一例を示す図である。It is a figure which shows an example of the integration method of the recognition result using the conversion of the same phonetic notation of this invention. 本発明を実現するためのユーザ端末の構成例を示す図である。It is a figure which shows the structural example of the user terminal for implement | achieving this invention. 本発明におけるユーザ辞書の作成方法の一例を示す図である。It is a figure which shows an example of the creation method of the user dictionary in this invention. 本発明における音声合成部の構成の一例を示す図である。It is a figure which shows an example of a structure of the speech synthesis part in this invention. 本発明の応答時間を考慮した結果統合用パラメータの一例を示す図である。It is a figure which shows an example of the parameter for result integration which considered the response time of this invention. 本発明の実施例２の音声認識サーバ統合装置の構成図である。It is a block diagram of the speech recognition server integration apparatus of Example 2 of this invention. 本発明の実施例３の音声認識サーバ統合装置の構成図である。It is a block diagram of the speech recognition server integration apparatus of Example 3 of this invention. 本発明の実施例３の認識用語句リストを用いた結果統合用パラメータ推定の処理を示す図である。It is a figure which shows the process of parameter estimation for result integration using the recognition term phrase list of Example 3 of this invention. 本発明の実施例４の音声認識サーバ装置の構成図である。It is a block diagram of the speech recognition server apparatus of Example 4 of this invention. 本発明の実施例５の音声認識サーバ装置の構成図である。It is a block diagram of the speech recognition server apparatus of Example 5 of this invention.

以下、図面を用いて本発明の実施例を説明する。なお、発明を実施するための形態を説明するための全図において、同一の機能を有する要素には同一の名称、符号を付して、その繰り返しの説明を省略する。 Embodiments of the present invention will be described below with reference to the drawings. Note that components having the same function are denoted by the same names and reference symbols throughout the drawings for describing the embodiments for carrying out the invention, and the repetitive description thereof will be omitted.

図１は、本発明の実施例１に基づく音声認識サーバ統合装置の構成例を示す図である。音声認識機能は、ユーザ端末１０２、中継サーバ１０４、汎用音声認識サーバ群１０６、専用音声認識サーバ１０８を用いて提供される。なお、汎用音声認識サーバ群１０６は、単一の汎用音声認識サーバでも構わない。 FIG. 1 is a diagram illustrating a configuration example of a speech recognition server integration device based on Embodiment 1 of the present invention. The voice recognition function is provided using the user terminal 102, the relay server 104, the general-purpose voice recognition server group 106, and the dedicated voice recognition server 108. The general-purpose speech recognition server group 106 may be a single general-purpose speech recognition server.

ユーザ端末１０２は、ユーザ個人が持つ端末装置で、入力音声データの取得および音声認識結果に基づくサービスの提供を行う他に、アドレス帳や楽曲名リストなどのユーザに固有の語句リストを保持している。以下では、これらのユーザに固有の語句リストのことを「ユーザ辞書」と呼ぶ。ユーザ辞書には、ユーザが登録した語句もしくはユーザがよく使う語句のリストが保持されている。 The user terminal 102 is a terminal device possessed by the individual user. In addition to obtaining input voice data and providing a service based on the voice recognition result, the user terminal 102 holds a phrase list unique to the user such as an address book and a song name list. Yes. Hereinafter, the phrase list unique to these users is referred to as a “user dictionary”. The user dictionary holds a list of words registered by the user or frequently used by the user.

汎用音声認識サーバ群１０６は、本発明により実現されるサービスのみにより使用されることを想定していない、１台以上の音声認識サーバである。一般に、大規模な語句リストを内蔵し、様々な言葉に対する認識性能が高い一方、ユーザ辞書に含まれる一部の語句については、正しく認識できない可能性がある。 The general-purpose speech recognition server group 106 is one or more speech recognition servers that are not assumed to be used only by services realized by the present invention. In general, a large-scale word list is built in and recognition performance for various words is high, while some words included in the user dictionary may not be recognized correctly.

専用音声認識サーバ１０８は、本発明により実現されるサービスに特化した音声認識サーバであり、ユーザ辞書に含まれる語句のすべてもしくは大半を認識するように設計されている。専用音声認識サーバ１０８は、ユーザ辞書に含まれない語句が入力された場合には、「認識結果なし」という結果が出力されるよう設計されている。専用音声認識サーバは、サーバとして構成されるものにかぎらず、専用音声認識装置でも良いし、また、実施例２や実施例５のように、ユーザ端末や中継サーバに内蔵されるものでもよい。 The dedicated speech recognition server 108 is a speech recognition server specialized for services realized by the present invention, and is designed to recognize all or most of the words included in the user dictionary. The dedicated speech recognition server 108 is designed to output a result of “no recognition result” when a phrase not included in the user dictionary is input. The dedicated speech recognition server is not limited to being configured as a server, and may be a dedicated speech recognition device, or may be built in a user terminal or a relay server as in the second and fifth embodiments.

中継サーバ１０４は、本発明の「音声認識サーバ統合装置」に該当するもので、ユーザ端末１０２と音声認識サーバ１０６，１０８とを繋ぎ、音声認識結果の統合などを行う。ユーザ端末１０２とのデータのやりとりは、端末装置通信部１１０を介して行う。また、音声認識サーバ１０６，１０８とのデータのやりとりは、認識サーバ通信部１１２を介して行う。中継サーバ１０４は、端末装置通信部１１０、音声合成部１１４、統合方式学習部１１６、信号処理部１２０、認識結果統合部１２２、認識サーバ通信部１１２などから構成されている。 The relay server 104 corresponds to the “voice recognition server integration device” of the present invention, and connects the user terminal 102 and the voice recognition servers 106 and 108 to integrate voice recognition results. Data exchange with the user terminal 102 is performed via the terminal device communication unit 110. Data exchange with the voice recognition servers 106 and 108 is performed via the recognition server communication unit 112. The relay server 104 includes a terminal device communication unit 110, a speech synthesis unit 114, an integration method learning unit 116, a signal processing unit 120, a recognition result integration unit 122, a recognition server communication unit 112, and the like.

中継サーバ１０４の動作を説明する。はじめに、ユーザがユーザ端末１０２を通信可能な状態にセットすると、ユーザ辞書１２４のデータが端末装置通信部１１０を経由して送信される。このデータは直接認識サーバ通信部１１２に送られ、さらに専用音声認識サーバ１０８に送られる。専用音声認識サーバ１０８では、送られてきたユーザ辞書データに基づき、そこに含まれる語句を正しく認識できるようチューニングを行う。一方、端末装置通信部１１０で受信されたユーザ辞書データは、音声合成部１１４にも送られる。ここでは、文字列として送られてきたユーザ辞書データをもとに、合成音声データが作られる。一つの語句に対する合成音声データは、一つでも良いし、音質の違う複数のものであっても良い。作成された合成音声データは、認識サーバ通信部１１２を介して、汎用音声認識サーバ群１０６および専用音声認識サーバ１０８に送られる。これらに対する認識結果が各サーバから返されると、認識サーバ通信部１１２がそれを受信し、統合方式学習部１１６に送る。統合方式学習部１１６では、合成音声のもととなったユーザ辞書データと認識結果とを合わせて解析し、認識結果統合のためのパラメータを学習する。得られたパラメータは、結果統合用パラメータ１１８として保存される。この時点で、本発明を用いたシステムの事前学習処理が終了する。 The operation of the relay server 104 will be described. First, when the user sets the user terminal 102 in a communicable state, data in the user dictionary 124 is transmitted via the terminal device communication unit 110. This data is sent directly to the recognition server communication unit 112 and further sent to the dedicated speech recognition server 108. The dedicated speech recognition server 108 performs tuning so as to correctly recognize the words included in the user dictionary data sent thereto. On the other hand, the user dictionary data received by the terminal device communication unit 110 is also sent to the speech synthesis unit 114. Here, synthesized speech data is created based on user dictionary data sent as a character string. The synthesized speech data for one word may be one or a plurality of pieces having different sound quality. The generated synthesized speech data is sent to the general-purpose speech recognition server group 106 and the dedicated speech recognition server 108 via the recognition server communication unit 112. When the recognition results for these are returned from each server, the recognition server communication unit 112 receives them and sends them to the integrated method learning unit 116. The integration method learning unit 116 analyzes the user dictionary data that is the basis of the synthesized speech and the recognition result, and learns parameters for integrating the recognition result. The obtained parameters are stored as result integration parameters 118. At this point, the pre-learning process of the system using the present invention ends.

ユーザが実際に音声インタフェースを使う際には、ユーザ端末１０２で取得した入力音声データが、端末装置通信部１１０により受信される。受信されたデータは、信号処理部１２０に送られ、必要な処理が施される。ここで、必要な処理とは、例えば雑音を含む入力音声から雑音を取り除くこと等を指すが、必ずしも必須ではなく、何も処理をしなくても良い。信号処理部１２０から出力されたデータは、認識サーバ通信部１１２を経て、汎用音声認識サーバ群１０６および専用音声認識サーバ１０８に送られる。これらのサーバから返された認識結果は、認識サーバ通信部１１２を経て、認識結果統合部１２２に送られる。認識結果統合部１２２では、複数の認識結果と、結果統合用パラメータ１１８に含まれるパラメータとを比較して、最適な認識結果を選択する。選択された認識結果は、端末装置通信部１１０を経て、ユーザ端末１０２に送られる。ユーザ端末１０２では、この結果をもとに、ナビゲーション機能の目的地を設定する、電話をかける、楽曲を再生するなどのサービスを提供する。 When the user actually uses the voice interface, the input voice data acquired by the user terminal 102 is received by the terminal device communication unit 110. The received data is sent to the signal processing unit 120 and subjected to necessary processing. Here, the necessary processing refers to, for example, removing noise from input speech including noise, but is not necessarily required, and no processing may be performed. Data output from the signal processing unit 120 is sent to the general-purpose speech recognition server group 106 and the dedicated speech recognition server 108 via the recognition server communication unit 112. The recognition results returned from these servers are sent to the recognition result integration unit 122 via the recognition server communication unit 112. The recognition result integration unit 122 compares a plurality of recognition results with the parameters included in the result integration parameter 118, and selects an optimal recognition result. The selected recognition result is sent to the user terminal 102 via the terminal device communication unit 110. Based on this result, the user terminal 102 provides services such as setting a destination of a navigation function, making a call, and playing a music piece.

図２は、図１に示した構成において、ユーザ辞書データを使って結果統合用パラメータを作成するまでの処理の手順を示す図である。まず、ユーザ辞書データは、そのまま専用音声認識サーバに送られる。専用音声認識サーバでは、送られてきた語句を認識対象とするよう、音声認識エンジンをチューニングする。従って、ユーザ辞書に含まれない語句を発声したデータが送られてきた場合、専用音声認識サーバは、間違った結果を返すか、もしくは認識不能という結果を返すことになる。一方、ユーザ辞書データは、音声合成部にも送られ、そこで合成音声データが作成される。通常、一つの語句に対しては一つの合成音声が作られるが、音声合成部が、話者や話速、声の高さなどを選択できるような機能を持っている場合は、それらを変化させて、同じ語句に対して複数の合成音声データを作成すれば、後段で行う統合方式学習の性能をより高めることができる。 FIG. 2 is a diagram showing a processing procedure until the result integration parameter is created using the user dictionary data in the configuration shown in FIG. First, user dictionary data is sent to a dedicated speech recognition server as it is. In the dedicated speech recognition server, the speech recognition engine is tuned so that the sent phrase is the recognition target. Therefore, when data uttering a word that is not included in the user dictionary is sent, the dedicated speech recognition server returns an incorrect result or returns a result indicating that recognition is impossible. On the other hand, the user dictionary data is also sent to the speech synthesis unit, where synthesized speech data is created. Normally, one synthesized speech is created for one word, but if the speech synthesizer has a function that allows you to select the speaker, speech speed, voice pitch, etc., change them. Thus, if a plurality of synthesized speech data is created for the same word / phrase, it is possible to further improve the performance of the integrated method learning performed later.

こうして得られた合成音声データは、各汎用音声認識サーバおよび専用音声認識サーバに送られる。それらのサーバからは、認識結果が返される。また、認識結果だけではなく、それに付随する信頼度スコアが一緒に帰される場合もある。これらを元に、統合方式学習部で統合方式を学習し、その結果を結果統合用パラメータに保存する。 The synthesized speech data obtained in this way is sent to each general-purpose speech recognition server and dedicated speech recognition server. Recognition results are returned from those servers. In addition, not only the recognition result but also a reliability score associated therewith may be attributed together. Based on these, the integration method learning unit learns the integration method, and the result is stored in a result integration parameter.

図３は、結果統合用パラメータの最も簡単な構成の例を示す図である。この例では、汎用音声認識サーバが1台だけ存在すると仮定し、そのサーバで、ユーザ辞書の各語句が正しく認識されたかどうかだけを○と×で保持している。即ち、「鈴木一郎」「山田二郎」という語句は汎用音声認識サーバで正しく認識されたが、それ以外は正しく認識されなかったということを、この図は表している。図４は、同様の学習を、３台の汎用音声認識サーバを用いて行った際の例である。 FIG. 3 is a diagram illustrating an example of the simplest configuration of result integration parameters. In this example, it is assumed that there is only one general-purpose speech recognition server, and the server holds only whether or not each word / phrase in the user dictionary is correctly recognized by ○ and ×. That is, this figure shows that the words “Ichiro Suzuki” and “Jiro Yamada” were correctly recognized by the general-purpose speech recognition server but were not correctly recognized otherwise. FIG. 4 is an example when similar learning is performed using three general-purpose speech recognition servers.

図３、図４に示したような結果を使って実際に認識を行う際の処理の手順を、図５に示す。入力音声データは、はじめに信号処理部で事前処理される。信号処理部での処理の代表的なものとして、特許文献１に示されるような雑音抑圧処理が挙げられる。信号処理部での処理の結果、一つの入力音声データに対して一つの音声データが得られるのが普通であるが、設定を変えて複数の音声データが得られる場合もある。そのような場合には、以下に述べる処理を、音声データの数だけ繰り返す。また、信号処理部での処理が不要と思われる場合には、入力音声データをそのまま信号処理部の出力データとする。 FIG. 5 shows a processing procedure when the recognition is actually performed using the results shown in FIGS. Input voice data is first pre-processed by the signal processing unit. As a typical process in the signal processing unit, there is a noise suppression process as disclosed in Patent Document 1. As a result of processing in the signal processing unit, one piece of sound data is usually obtained for one piece of input sound data, but there are cases where a plurality of pieces of sound data are obtained by changing the setting. In such a case, the processing described below is repeated as many times as the number of audio data. Further, when it is considered that the processing in the signal processing unit is unnecessary, the input voice data is directly used as output data of the signal processing unit.

信号処理部の出力データは、汎用音声認識サーバおよび専用音声認識サーバに送られる。これらの結果がすべて認識結果統合部に送られる。認識結果統合部では、まず専用音声認識サーバの認識結果をチェックする。専用認識サーバの認識結果が、「認識結果なし」であった場合、汎用音声認識サーバの認識結果のみから最終的な認識結果を決定する。すなわち、汎用音声認識サーバが1台しかない場合は、その結果をそのまま採用する。複数台ある場合には、それらの認識結果のあいだで多数決を取る。多数決を取る際、各認識サーバが信頼度スコアを付与する場合であれば、その値で重み付けをした多数決とすることもできる。また、事前に各認識サーバの性能を推定して、重み付けの係数とすることもできる。このような、一般的な語句に対する複数の音声認識サーバの認識結果の統合については、特許文献２に示されるような公知の技術を用いることが可能である。 The output data of the signal processing unit is sent to the general-purpose speech recognition server and the dedicated speech recognition server. All these results are sent to the recognition result integration unit. The recognition result integration unit first checks the recognition result of the dedicated speech recognition server. If the recognition result of the dedicated recognition server is “no recognition result”, the final recognition result is determined only from the recognition result of the general-purpose speech recognition server. That is, when there is only one general-purpose speech recognition server, the result is adopted as it is. If there are multiple units, a majority vote is taken between the recognition results. When taking a majority vote, if each recognition server gives a reliability score, a majority vote weighted by that value can be used. In addition, the performance of each recognition server can be estimated in advance and used as a weighting coefficient. For such integration of recognition results of a plurality of speech recognition servers for general words / phrases, a known technique as shown in Patent Document 2 can be used.

一方、専用音声認識サーバの認識結果として、ユーザ辞書データに含まれる語句が得られた場合、図３や図４に示したような結果統合用パラメータを参照する。例えば、図３の例で、専用音声認識サーバの認識結果が「日立太郎」であった場合、結果統合用パラメータの該当する行を見ると、この語句は汎用音声認識サーバでは認識できないはずだということがわかるので、専用音声認識サーバの結果をそのまま採用する。一方、専用音声認識サーバの認識結果が「鈴木一郎」であった場合、結果統合用パラメータの該当する行を見ると、この語句は専用音声認識サーバでも認識されうるということがわかる。そこで次に汎用音声認識サーバの認識結果をチェックする。汎用音声認識サーバの認識結果も「鈴木一郎」である場合にはそのまま「鈴木一郎」を最終的な認識結果とすれば良いが、そうでない場合には、一般的に性能が高いと思われる汎用音声認識サーバの結果を優先するか、もしくは汎用音声認識サーバと専用音声認識サーバの認識結果のうち、信頼度スコアの高い方を最終認識結果として採用する。これにより、「鈴木一郎」と似た発音の言葉が、専用音声認識サーバにより誤認識されてしまった場合であっても、汎用音声認識サーバの認識結果に基づきこれを棄却することができるようになる。図４の例でも同様であり、「日立太郎」については無条件で専用音声認識サーバの結果を採用する。「鈴木一郎」については３台の汎用音声認識サーバすべてが認識可能な語句であるので、これらの認識結果での多数決、もしくはこれらすべてに専用音声認識サーバも加えての多数決により最終認識結果を決定する。また、専用音声認識サーバの認識結果が「山田二郎」であった場合には、これを正しく認識できる可能性のある汎用音声認識サーバは１番のみであることから、このサーバと専用音声認識サーバとの間で、図３の例と同じ処理を行うことにより最終認識結果を得る。 On the other hand, when a phrase included in the user dictionary data is obtained as a recognition result of the dedicated speech recognition server, the result integration parameter as shown in FIG. 3 or FIG. 4 is referred to. For example, in the example of FIG. 3, when the recognition result of the dedicated speech recognition server is “Hitachi Taro”, when the corresponding line of the result integration parameter is viewed, this phrase should not be recognized by the general-purpose speech recognition server. Therefore, the result of the dedicated speech recognition server is adopted as it is. On the other hand, when the recognition result of the dedicated speech recognition server is “Ichiro Suzuki”, it can be seen from the corresponding line of the result integration parameter that this phrase can also be recognized by the dedicated speech recognition server. Then, the recognition result of the general-purpose speech recognition server is checked next. If the recognition result of the general-purpose speech recognition server is also “Ichiro Suzuki”, “Ichiro Suzuki” can be used as the final recognition result, but if not, the general-purpose performance generally seems to be high. The result of the voice recognition server is given priority, or the recognition result of the general-purpose voice recognition server and the dedicated voice recognition server having the higher reliability score is adopted as the final recognition result. As a result, even if a word of pronunciation similar to “Ichiro Suzuki” is misrecognized by the dedicated speech recognition server, it can be rejected based on the recognition result of the general-purpose speech recognition server. Become. The same applies to the example of FIG. 4, and for “Hitachi Taro”, the result of the dedicated speech recognition server is used unconditionally. Since “Ichiro Suzuki” is a word that can be recognized by all three general-purpose speech recognition servers, the final recognition result is determined by a majority decision based on these recognition results, or a majority decision including a dedicated speech recognition server. To do. If the recognition result of the dedicated speech recognition server is “Jiro Yamada”, the only general-purpose speech recognition server that can correctly recognize this is the first, so this server and the dedicated speech recognition server The final recognition result is obtained by performing the same processing as in the example of FIG.

図６は、図３や図４とは異なるもう一つの結果統合用パラメータの実現例である。ここでは、ある語句が各汎用音声認識サーバで認識可能な場合に、その語句が正しく認識される確率を重みの数値に置き換えて保持している。ここで、正しく認識される確率は、たとえば「鈴木一郎」という語句に対し、音声合成用パラメータを様々に変えて作った合成音声による認識を行い、それらに対する認識結果のうち何個が正しいものであったかにより推定することができる。また、汎用音声認識サーバが複数の認識結果候補を返す仕様になっている場合には、正解単語の平均順位や平均信頼度スコアなどを用いることもできる。これらの値を適当な非線形変換により重み値に変換したものを、結果統合用パラメータに保持する。この例では、専用音声認識サーバの認識結果が「鈴木一郎」、汎用サーバ１の結果が「佐々木一郎」、汎用サーバ２と３の結果が「鈴木一郎」だった場合、「佐々木一郎」の重みが３．０、「鈴木一郎」の重みが１．４と１．２の和で２．６となり、前者の方が大きいことから、「佐々木一郎」を最終認識結果とする。 FIG. 6 shows another implementation example of result integration parameters different from those in FIGS. 3 and 4. Here, when a certain word / phrase is recognizable by each general-purpose speech recognition server, the probability that the word / phrase is correctly recognized is replaced with a numerical value of weight and held. Here, the probability of correct recognition is, for example, that the word “Ichiro Suzuki” is recognized with synthesized speech made by changing various parameters for speech synthesis, and the number of recognition results is correct. It can be estimated depending on whether there was. In addition, when the general-purpose speech recognition server is configured to return a plurality of recognition result candidates, the average rank of the correct words, the average reliability score, and the like can be used. A value obtained by converting these values into weight values by appropriate non-linear conversion is held in a result integration parameter. In this example, when the recognition result of the dedicated speech recognition server is “Ichiro Suzuki”, the result of the general-purpose server 1 is “Ichiro Sasaki”, and the results of the general-purpose servers 2 and 3 are “Ichiro Suzuki”, the weight of “Ichiro Sasaki” Is 3.0 and the weight of “Ichiro Suzuki” is the sum of 1.4 and 1.2, which is 2.6, and the former is larger, so “Ichiro Sasaki” is the final recognition result.

図７は、図３，４，６とは異なるもう一つの結果統合用パラメータの実現例である。ここでは、ユーザ辞書データに含まれる語句を汎用音声認識サーバで認識して、正しく認識されなかった場合においても、そのときに得られた認識結果を結果統合用パラメータとして保存しておく。それぞれのサーバの重みを設定するのは図６の例と同様である。複数回の実験を行った際には、最も多かった結果のみか、もしくは複数の認識結果を保存しておいても良い。また、実験の回数にかかわらず、２位以下の認識結果も併せて保存しておいても良い。認識実行時には、これまでの例と同じように、専用音声認識サーバの認識結果に基づき結果統合用パラメータを参照する。その際、汎用音声認識サーバの認識結果が、結果統合用パラメータに保存されているものと一致するかどうかをチェックする。例えば、専用音声認識サーバの認識結果が「日立太郎」で、汎用サーバ１の結果が「日立市」、汎用サーバ２の結果が「二十歳」、汎用サーバ３の結果が「日立」だった場合、汎用サーバ１の結果は「日立太郎」に変換した上で、各認識結果での多数決を行い、最終的に「日立太郎」が選択される。 FIG. 7 shows another implementation example of result integration parameters different from those shown in FIGS. Here, even when a word / phrase included in the user dictionary data is recognized by the general-purpose speech recognition server and not correctly recognized, the recognition result obtained at that time is stored as a result integration parameter. Setting the weight of each server is the same as in the example of FIG. When a plurality of experiments are performed, only the most frequent result or a plurality of recognition results may be stored. In addition, the recognition result of the second place or less may be stored together regardless of the number of experiments. When executing the recognition, the result integration parameter is referred to based on the recognition result of the dedicated speech recognition server, as in the previous examples. At this time, it is checked whether or not the recognition result of the general-purpose speech recognition server matches that stored in the result integration parameter. For example, the recognition result of the dedicated speech recognition server is “Hitachi Taro”, the result of the general-purpose server 1 is “Hitachi City”, the result of the general-purpose server 2 is “20 years old”, and the result of the general-purpose server 3 is “Hitachi”. In this case, after converting the result of the general-purpose server 1 into “Hitachi Taro”, a majority decision is made on each recognition result, and finally “Hitachi Taro” is selected.

図８は、同音異表記の検出を利用した、音声認識結果統合方式の例を示す図である。図に示すように、専用音声認識サーバの認識結果が「左藤一郎」である場合、これを汎用音声認識サーバの各認識結果と比較し、同音異表記が含まれないかをチェックする。ここで、表記から発音を推定するには、日本語であれば、個々の漢字の読みをデータとして保持しておき、当該語句を構成する漢字の読みを連結することにより発音表記を得る。英語であれば、部分的な綴りに対する読み付与のルールを保持しておき、これらを順次適用することにより発音表記を得る。その他の言語の場合であっても、一般にＧｒａｐｈｅｍｅｔｏＰｈｏｎｅｍｅと呼ばれる技術によって、発音表記を得ることができることは良く知られている。また、ユーザ辞書データの中に、漢字表記とカナ表記のように、発音情報が含まれている場合もあり、そのような場合にはそれを活用する。上述のチェックにより、同音異表記が含まれている場合には、当該認識結果の表記を専用音声認識サーバによる認識結果の表記に変換して用いる。図の例では、汎用音声認識サーバ１の認識結果「佐藤一郎」が、専用音声認識サーバの認識結果と同音であることから、これを「左藤一郎」に変換する。その結果、３台の汎用音声認識サーバによる多数決の結果は「左藤一郎」となり、これが最終結果として採用される。 FIG. 8 is a diagram illustrating an example of a speech recognition result integration method using detection of homonymous notation. As shown in the figure, when the recognition result of the dedicated speech recognition server is “Ichiro Sato”, it is compared with each recognition result of the general-purpose speech recognition server to check whether or not the same phonetic notation is included. Here, in order to estimate the pronunciation from the notation, in the case of Japanese, each kanji reading is held as data, and the pronunciation notation is obtained by concatenating the kanji readings constituting the word. In the case of English, the rules for giving readings for partial spellings are retained, and phonetic notation is obtained by sequentially applying these rules. Even in the case of other languages, it is well known that pronunciation notation can be obtained by a technique generally called Grapheme to Phoneme. In some cases, pronunciation information is included in the user dictionary data such as kanji notation and kana notation, and in such a case, it is utilized. If the above-mentioned check includes homonym notation, the recognition result notation is converted into a recognition result notation by the dedicated speech recognition server. In the example in the figure, since the recognition result “Ichiro Sato” of the general-purpose speech recognition server 1 is the same sound as the recognition result of the dedicated speech recognition server, it is converted to “Ichiro Sato”. As a result, the result of the majority decision by the three general-purpose speech recognition servers is “Ichiro Sato”, which is adopted as the final result.

図９は、自動車内でのナビゲーション機能やハンズフリー通話機能などを提供する場合を例に、ユーザ端末の具体的な実現形態の例を示した図である。（ａ）では、マイク装置９０４、アプリケーション９０６、通信モジュール９０８などのすべての機能をカーナビゲーション装置９０２内に実装している。（ｂ）では、カーナビゲーション装置９０２とスマートフォン９１０とを連結し、マイク装置９０４はカーナビゲーション装置９０２のものを、通信部９０８はスマートフォン９１０のものを用いている。アプリケーション９１２，９１４は、それぞれの機能に応じて、カーナビゲーション装置内とスマートフォン内に分散して配置するか、もしくはどちらか片方のみに配置する。（ｃ）では、スマートフォン９１０内に、すべての機能を実装する。 FIG. 9 is a diagram showing an example of a specific implementation form of the user terminal, taking as an example the case of providing a navigation function, a hands-free call function, etc. in an automobile. In (a), all functions such as a microphone device 904, an application 906, and a communication module 908 are mounted in the car navigation device 902. In (b), the car navigation device 902 and the smartphone 910 are connected, the microphone device 904 uses the car navigation device 902, and the communication unit 908 uses the smartphone 910. The applications 912 and 914 are arranged in a distributed manner in the car navigation apparatus and in the smartphone, or arranged in only one of them according to the respective functions. In (c), all functions are implemented in the smartphone 910.

図１０は、本発明を構成するユーザ辞書１２４の作成方法の例を示した図である。例えば、ユーザ端末１０２内にアドレス帳１００２が存在する場合には、そこに含まれる人名をユーザ辞書に登録する。同様に、音楽プレーヤーの楽曲リスト１００４が存在する場合には、そこに含まれる楽曲名やアーティスト名をユーザ辞書に登録する。また、ウェブブラウザのブックマーク１００６として登録されたページタイトルをユーザ辞書に登録することもできる。その他に、ユーザ端末内に蓄積されたメール１００８やショートメッセージなどのデータを解析し、そこに頻出する語句をユーザ辞書に登録するという方式も可能である。これらのデータに関しては、ユーザ端末がはじめて本発明によるシステムに接続された際には、ユーザ端末に含まれる全ユーザ辞書データをシステムに送信するのに加えて、アドレス帳や楽曲リストなどへの新規エントリの追加時には、新規追加データのみをシステムに追加送信し、結果統合用パラメータの更新を促すという方式を採ることもできる。このとき、結果統合用パラメータだけでなく、専用音声認識部の照合用辞書も同時に更新する必要がある。 FIG. 10 is a diagram showing an example of a method for creating the user dictionary 124 constituting the present invention. For example, when the address book 1002 exists in the user terminal 102, the names of people included therein are registered in the user dictionary. Similarly, when there is a music list 1004 of the music player, the music name and artist name included therein are registered in the user dictionary. Also, the page title registered as the bookmark 1006 of the web browser can be registered in the user dictionary. In addition, a method of analyzing data such as mail 1008 and short messages stored in the user terminal and registering frequently occurring phrases in the user dictionary is also possible. With respect to these data, when the user terminal is connected to the system according to the present invention for the first time, in addition to transmitting all user dictionary data contained in the user terminal to the system, new to the address book, music list, etc. At the time of adding an entry, it is also possible to adopt a method in which only newly added data is additionally transmitted to the system and the result integration parameter is prompted to be updated. At this time, it is necessary to update not only the result integration parameter but also the collation dictionary of the dedicated speech recognition unit at the same time.

図１１は、一般的な音声合成部の構成を変更し、本発明に特化した特殊な構成の一例を示す図である。一般に音声合成部１１４は、合成音声作成部１１０２と、音声素片データ１１０６〜１１１０とから成る。ここで、素片データとは、データを直接つなぎあわせて合成音声を作る方式で用いるためのデータの名称であるが、直接つなぎあわせる代わりに、統計処理と信号処理により波形を合成する方式を用いる場合でも、個々の音素や音節などの処理単位に対し、類似のデータ集合を用いるため、以下に述べる方式を適用することは可能である。合成音声作成部１１０２では、音声素片データを繋ぎ合わせ、必要であれば適切な信号処理を行い、標準合成音声を作成する。しかし、本発明においては、ユーザ端末の所有者である特定ユーザの声に対して各汎用音声認識サーバ群がどのように反応するかを知ることが重要であるので、音声合成部で作成される合成音声も、ユーザの声に似たものであることが望ましい。そこで、ユーザが音声認識機能を使用するたびに、あるいはそれ以外の音声機能や音声通話を使用するたびに、その声をユーザ音声データ１１１２として蓄積しておき、これを活用して音声変換部１１０４により標準合成音声からユーザ適応音声への変換を行う。こうして変換した音声を汎用音声認識サーバ群への入力とすることにより、より精度の高い性能予測を行うことが可能となり、結果統合用パラメータの値もより適切なものになることが期待できる。 FIG. 11 is a diagram showing an example of a special configuration specialized in the present invention by changing the configuration of a general speech synthesis unit. Generally, the speech synthesizer 114 includes a synthesized speech creation unit 1102 and speech unit data 1106 to 1110. Here, segment data is the name of data for use in a method for creating synthesized speech by directly joining data, but instead of directly joining, a method for synthesizing waveforms by statistical processing and signal processing is used. Even in this case, since a similar data set is used for processing units such as individual phonemes and syllables, the following method can be applied. A synthesized speech creating unit 1102 joins speech unit data, performs appropriate signal processing if necessary, and creates standard synthesized speech. However, in the present invention, since it is important to know how each general-purpose speech recognition server group reacts to the voice of the specific user who is the owner of the user terminal, it is created by the speech synthesizer. The synthesized voice is also preferably similar to the voice of the user. Therefore, every time the user uses the voice recognition function or every time other voice function or voice call is used, the voice is stored as user voice data 1112, and this is used to make a voice conversion unit 1104. Thus, the standard synthesized speech is converted to user adaptive speech. By using the converted speech as an input to the general-purpose speech recognition server group, it is possible to perform performance prediction with higher accuracy, and it can be expected that the value of the result integration parameter will be more appropriate.

図１２は、音声認識の正しさに加えて、応答速度を評価基準とする場合の結果統合用パラメータの例を示す図である。この例では、ユーザ辞書データに含まれる各語句に対応する合成音声を用いた認識を実行し、その処理にかかった平均時間をパラメータとして保持しておく。この例でいうと、専用音声認識サーバの認識結果が「鈴木一郎」であった場合、汎用サーバ２の認識結果は０．５秒で得られると期待されるが、汎用サーバ１の認識結果を得るには１．５秒も待たなければならない。この応答時間がアプリケーションで想定される上限値を上回る場合、汎用サーバ２の結果が得られた時点で結果統合処理を行う。これにより、結果統合処理にほとんど時間がかからないと仮定すると、約０．５秒の応答時間で最終認識結果を得ることができることになり、ユーザの利便性を向上させることができる。 FIG. 12 is a diagram illustrating an example of result integration parameters when the response speed is used as an evaluation criterion in addition to correct speech recognition. In this example, recognition using synthesized speech corresponding to each word / phrase included in the user dictionary data is executed, and the average time required for the processing is stored as a parameter. In this example, if the recognition result of the dedicated speech recognition server is “Ichiro Suzuki”, the recognition result of the general-purpose server 2 is expected to be obtained in 0.5 seconds. You have to wait 1.5 seconds to get it. When this response time exceeds the upper limit assumed by the application, the result integration process is performed when the result of the general-purpose server 2 is obtained. As a result, assuming that the result integration process takes almost no time, the final recognition result can be obtained with a response time of about 0.5 seconds, and the convenience for the user can be improved.

図１３は、ユーザ端末内に組み込まれた専用音声認識部１０８を用いて、図１に示した例と同等の機能を実現するような例の構成を示した図である。ここでは、ユーザ端末１０２が、中継サーバ１０４を介することなく、内部に存在する専用音声認識部１０８を用いてユーザ辞書１２４に含まれる語句の認識を行う。ユーザ辞書データを用いて、汎用音声認識サーバ群１０６の性能評価を事前に行う方法は、図１の場合に示したものと同様である。認識実行時には、中継サーバ１０４を介して汎用音声認識サーバ１０６による認識を実行すると同時に、ユーザ端末内の専用音声認識部１０８でも認識を実行する。このような、端末内の音声認識部と、通信装置を介して接続された音声認識部とを併用する方式は、特許文献３にも示されているが、特許文献３記載の発明が、通信経路が確立されているかどうかという点に着目して結果の取捨選択を行うのに対し、本発明では、事前に行った音声認識の結果に基づき求めた結果統合用パラメータを用いるという点が異なっている。 FIG. 13 is a diagram showing a configuration of an example in which a function equivalent to the example shown in FIG. 1 is realized using the dedicated speech recognition unit 108 incorporated in the user terminal. Here, the user terminal 102 recognizes a word / phrase included in the user dictionary 124 using the dedicated voice recognition unit 108 existing inside without using the relay server 104. The method for performing the performance evaluation of the general-purpose speech recognition server group 106 in advance using the user dictionary data is the same as that shown in FIG. At the time of recognition execution, recognition by the general-purpose speech recognition server 106 is performed via the relay server 104, and at the same time, recognition is also performed by the dedicated speech recognition unit 108 in the user terminal. Such a method of using a voice recognition unit in a terminal and a voice recognition unit connected via a communication device is also shown in Patent Document 3, but the invention described in Patent Document 3 While the results are selected based on whether the route is established or not, the present invention is different in that the result integration parameter obtained based on the result of speech recognition performed in advance is used. Yes.

図１４は、本発明に基づく音声認識サーバ統合装置のもう一つの構成例を示す図である。ここでは、汎用音声認識サーバ群１０６の機能として、そこで用いられている認識用語句リストが入手可能である場合を想定する。そのような条件のもとで、ユーザ端末１０２から中継サーバ１０４に送られたユーザ辞書データは、語句比較・類似度推定部１２６に送られる。当該部では、汎用音声認識サーバ群１０６から入手した認識用語句リストとユーザ辞書データとを比較し、ユーザ辞書１２４に含まれる各語句が、各々のサーバで正しく認識されうるかどうかを判定する。判定結果は統合方式学習部１１６に送られ、パラメータとして整理されたものが結果統合用パラメータ１１８に保持される。一方、ユーザ辞書データがそのまま専用音声認識サーバ１０８に送られ、専用音声認識サーバがチューニングされるのは、図１に示した例と同じである。 FIG. 14 is a diagram showing another configuration example of the speech recognition server integration device according to the present invention. Here, as a function of the general-purpose speech recognition server group 106, a case is assumed in which a recognition term phrase list used therein is available. Under such conditions, the user dictionary data sent from the user terminal 102 to the relay server 104 is sent to the phrase comparison / similarity estimation unit 126. This section compares the recognition phrase list obtained from the general-purpose speech recognition server group 106 with the user dictionary data, and determines whether each word included in the user dictionary 124 can be correctly recognized by each server. The determination result is sent to the integration method learning unit 116, and the result organized as a parameter is held in the result integration parameter 118. On the other hand, the user dictionary data is sent to the dedicated speech recognition server 108 as it is, and the dedicated speech recognition server is tuned as in the example shown in FIG.

このような準備が済んだ状態で、ユーザ端末１０２から入力音声データが送られてくると、図１に示した例と同様に、信号処理部１２０を経由して、汎用音声認識サーバ１０６および専用音声認識サーバ１０８に当該データが送られる。それらのサーバから返された認識結果は、認識結果統合部１２２に送られ、そこで、結果統合用パラメータ１１８との比較により、最適な認識結果が選択される。選択された認識結果がユーザ端末１０２に送信されて後は、図１に示した例と同様である。 When input voice data is sent from the user terminal 102 in such a prepared state, the general-purpose voice recognition server 106 and the dedicated voice data are sent via the signal processing unit 120 as in the example shown in FIG. The data is sent to the voice recognition server 108. The recognition results returned from these servers are sent to the recognition result integration unit 122, where an optimum recognition result is selected by comparison with the result integration parameter 118. After the selected recognition result is transmitted to the user terminal 102, it is the same as the example shown in FIG.

図１５は、図１４に示した構成において、ユーザ辞書データを使って結果統合用パラメータを作成するまでの処理の手順を示す図である。この例では、合成音声を作成することも、それを使って音声認識を実行してみることもなく、単に各汎用音声認識サーバから認識用語句リストを取得する。これらのリストと、ユーザ辞書データに含まれる語句とを比較し、ユーザ辞書データの各語句が、どの汎用音声認識サーバの語句リストに含まれているかをデータ化する。ここでは、認識用語句リストに含まれている（○）か、含まれていない（×）かのどちらかしか有り得ないことから、得られた結果をまとめた結果統合用パラメータは、図３ないし図４と同じものになる。従って、実際の認識を行う際の使い方も、前述した例と同じになる。また、各汎用音声認識サーバから、語句リストのみならず、それらの語句の認識されやすさを表す言語モデルを入手することが可能な際には、図６のような重み付きの結果統合用パラメータを作成することもできる。たとえば、代表的な言語モデルであるＮグラム言語モデルを用いる場合、ユニグラムの値をその単語の認識されやすさとする、もしくは倍グラムやトライグラムの最大値をその単語の認識されやすさとするなどの方式が考えられる。 FIG. 15 is a diagram showing a processing procedure until the result integration parameter is created using the user dictionary data in the configuration shown in FIG. In this example, a recognition speech phrase list is simply acquired from each general-purpose speech recognition server without creating synthesized speech or trying to execute speech recognition using it. These lists are compared with the words / phrases included in the user dictionary data, and data indicating which word / phrase list of which general-purpose speech recognition server contains each word / phrase in the user dictionary data. Here, since there can only be either (O) or not (X) included in the recognition term phrase list, the result integration parameters that summarize the obtained results are shown in FIG. It becomes the same as FIG. Therefore, how to use the actual recognition is the same as the example described above. Further, when it is possible to obtain not only the phrase list but also a language model representing the ease of recognition of these phrases from each general-purpose speech recognition server, the weighted result integration parameters as shown in FIG. Can also be created. For example, when an N-gram language model that is a typical language model is used, the unigram value is set to be easy to recognize the word, or the double gram or trigram maximum value is set to be easy to recognize the word. A method is conceivable.

図１６は、ユーザとの間の入出力機能と音声認識サーバ統合機能とを単一の装置の中に組み込んだ装置により、図１に示した例と同等の機能を実現するような例の構成を示した図である。ここでは、音声認識サーバ統合装置１０４の内部に蓄積されているユーザ辞書１２４に含まれるユーザ辞書データが、装置内の音声合成部１１４および認識サーバ通信部１１２に転送される。ユーザが話した声は、マイク装置１２８により取り込まれ、信号処理部１２０に転送される。これらを用いた処理の進め方は、図１の例において説明したものと同等であり、結果として認識結果統合部１２２にて認識結果が確定させられる。この認識結果は、装置内の表示部１３２に転送され、ユーザに提示される。 FIG. 16 shows an example configuration in which a function equivalent to the example shown in FIG. 1 is realized by a device in which an input / output function with a user and a voice recognition server integration function are incorporated in a single device. FIG. Here, the user dictionary data included in the user dictionary 124 stored inside the speech recognition server integration device 104 is transferred to the speech synthesis unit 114 and the recognition server communication unit 112 in the device. The voice spoken by the user is captured by the microphone device 128 and transferred to the signal processing unit 120. The processing method using these is the same as that described in the example of FIG. 1, and as a result, the recognition result integration unit 122 finalizes the recognition result. This recognition result is transferred to the display unit 132 in the apparatus and presented to the user.

図１７は、図１６に示した例をもとに、さらに専用音声認識サーバが担っている機能を音声認識サーバ統合装置に組み込んだ場合の構成を示した図である。音声認識サーバ統合装置１０４に含まれるマイク装置１２８から入力音声が取り込まれ、ユーザ辞書１２４からユーザ辞書データが転送される部分は図１６の例と同様であるが、それらに加えて、専用音声認識部１０８が装置内に組み込まれており、ユーザ辞書の内容を直接読み出した上で、マイク装置から送られてくる音声データを認識する。そこで得られた単体認識結果は、認識結果統合部１２２に送られ、汎用音声認識サーバ群１０６によって得られた認識結果と統合される。統合された認識結果は、装置内に存在するアプリケーション１３０に送られ、そこで各々のアプリケーションの目的に沿って活用される。 FIG. 17 is a diagram showing a configuration when the functions of the dedicated speech recognition server are incorporated in the speech recognition server integration device based on the example shown in FIG. The part in which the input voice is taken in from the microphone device 128 included in the voice recognition server integration apparatus 104 and the user dictionary data is transferred from the user dictionary 124 is the same as the example of FIG. The unit 108 is incorporated in the apparatus, and directly reads out the contents of the user dictionary and recognizes audio data sent from the microphone apparatus. The single recognition result obtained there is sent to the recognition result integration unit 122 and integrated with the recognition result obtained by the general-purpose speech recognition server group 106. The integrated recognition result is sent to the application 130 existing in the apparatus, where it is utilized according to the purpose of each application.

本発明は、車載端末と音声認識サーバとの間に介在して、高精度の音声認識機能を提供するための音声データ中継装置として利用可能である。 INDUSTRIAL APPLICABILITY The present invention can be used as a voice data relay device that is interposed between an in-vehicle terminal and a voice recognition server to provide a highly accurate voice recognition function.

１０２ユーザ端末
１０４中継サーバ
１０６汎用音声認識サーバ群
１０８専用音声認識サーバ
１１０端末装置通信部
１１２認識サーバ通信部
１１４音声合成部
１１６統合方式学習部
１１８結果統合用パラメータ
１２０信号処理部
１２２認識結果統合部
１２４ユーザ辞書
１２６語句比較・類似度推定部
１２８マイク装置
１３０アプリケーション
１３２表示部 102 User terminal 104 Relay server 106 General-purpose speech recognition server group 108 Dedicated speech recognition server 110 Terminal device communication unit 112 Recognition server communication unit 114 Speech synthesis unit 116 Integration method learning unit 118 Result integration parameter 120 Signal processing unit 122 Recognition result integration unit 124 User dictionary 126 Phrase comparison / similarity estimation unit 128 Microphone device 130 Application 132 Display unit

Claims

A device that relays between a terminal device for a user to operate using voice and a voice recognition server that recognizes voice data and returns the result,
And integration method learning unit for the user to save learns recognition result integration parameter based on a list of phrases frequently used words and phrases or user registered,
Means for receiving, from the terminal device, voice data that is voice data intended by a user for voice recognition;
Means for transmitting the received voice data to a general-purpose voice recognition server and a dedicated voice recognition server;
Means for receiving a recognition result of the voice data by the general-purpose voice recognition server and a dedicated voice recognition server;
The recognition result of the general-purpose speech recognition server and the specialized speech recognition server, as compared with the stored recognition result integration parameters, and the recognition result integration unit that selects an optimal recognition result,
A voice recognition server integration device comprising: means for transmitting the selected recognition result to the terminal device.

The speech recognition server integration device according to claim 1, further comprising:
Means for receiving a list of phrases registered by the user or frequently used by the user from the terminal device;
A speech synthesizer that generates synthesized speech based on the received phrases;
Means for transmitting the generated synthesized speech to the general-purpose speech recognition server and the dedicated speech recognition server;
Means for receiving recognition results of the synthesized speech by the general-purpose speech recognition server and the dedicated speech recognition server;
The speech recognition server integration device, wherein the integration method learning unit analyzes a phrase that is a basis of the synthesized speech and the recognition result, learns and stores a recognition result integration parameter.

The speech recognition server integration device according to claim 1, further comprising:
Means for receiving a list of phrases registered by the user or frequently used by the user from the terminal device;
Means for receiving a word list for recognition from the general-purpose speech recognition server;
A phrase comparison / similarity estimation unit that compares the phrase list for recognition with the phrase list received from the terminal device and estimates similarity; and
The integration method learning section, the estimation result speech recognition server integration device, characterized in that the stored as a recognition result integration parameter.

A device for a user to perform operations using voice,
And integration method learning unit for the user to save learns recognition result integration parameter based on a list of phrases frequently used words and phrases or user registered,
Means for transmitting voice data , which is voice data intended by the user for voice recognition, to the general-purpose voice recognition server and the dedicated voice recognition server;
Means for receiving a recognition result of the voice data by the general-purpose voice recognition server and a dedicated voice recognition server;
A recognition result integration unit that compares the recognition results of the general-purpose speech recognition server and the dedicated speech recognition server with the stored recognition result integration parameter and selects an optimal recognition result;
A voice recognition server integration device comprising a display unit for displaying the selected recognition result.

The voice recognition server integration device according to claim 4 , further comprising:
A user dictionary for storing words registered by the user or frequently used by the user;
A speech synthesizer that generates synthesized speech based on words stored in the user dictionary;
Means for transmitting the generated synthesized speech to the general-purpose speech recognition server and the dedicated speech recognition server;
Means for receiving recognition results of the synthesized speech by the general-purpose speech recognition server and the dedicated speech recognition server;
The speech recognition server integration device, wherein the integration method learning unit analyzes a phrase that is a basis of the synthesized speech and the recognition result, learns and stores a recognition result integration parameter.

The voice recognition server integration device according to claim 4 , further comprising:
A user dictionary that stores a list of words registered by the user or frequently used by the user;
Means for receiving a word list for recognition from the general-purpose speech recognition server;
A phrase comparison / similarity estimation unit that compares the phrase list for recognition with the phrase list of the user dictionary and estimates similarity ;
The integration method learning section, the estimation result speech recognition server integration device, characterized in that the stored as a recognition result integration parameter.

In the speech recognition server integration device according to any one of claims 1 to 6,
The dedicated speech recognition server creates a recognition target word list based on a word registered by the user or a list of words frequently used by the user, and can recognize the words included in the list with high accuracy. A featured voice recognition server integration device.

In the speech recognition server integration device according to any one of claims 1 to 6,
The dedicated speech recognition server, wherein the dedicated speech recognition server is incorporated in the speech recognition server integration device or terminal device as a dedicated speech recognition unit.

In the speech recognition server integration device according to claim 2 or 5,
The recognition result integration parameter stores the correctness of the recognition result of the general-purpose speech recognition server for a word registered by the user or a word frequently used by the user,
The recognition result integration unit takes out the correctness of the recognition result of the general-purpose speech recognition server with respect to the recognition result by the dedicated speech recognition server from the recognition result integration parameter based on the recognition result by the dedicated speech recognition server , speech recognition server integration device and selects an optimum recognition result based on the speech recognition result serial result fetch by a general purpose speech recognition server such that positive.

In the speech recognition server integration device according to claim 2 or 5,
The recognition result integration parameter is a value representing whether the recognition result of the general-purpose speech recognition server for the words registered by the user or frequently used by the user is correct, and the reliability of the recognition result of the general-purpose speech recognition server for each word Is to save
The recognition result integration unit, based on the recognition result by the dedicated speech recognition server , correctness of the recognition result of the general-purpose speech recognition server with respect to the recognition result by the dedicated speech recognition server from the recognition result integration parameter and its reliability It was removed, and in that the recognition result extraction to select a generic speech recognition by the server Ruoto voice recognition result to the reliability weighting results optimal recognition based on a result of integration by the like is positive A featured voice recognition server integration device.

In the speech recognition server integration device according to claim 2 or 5,
The recognition result integration parameter is for measuring the time taken for recognition of a dedicated and general-purpose speech recognition server for a word registered by the user or a word frequently used by the user, and storing a measurement value
The recognition result integration unit, based on the recognition result by the dedicated speech recognition server , extracts the recognition time of the dedicated and general-purpose speech recognition server for the recognition result by the dedicated speech recognition server from the recognition result integration parameter, Out of the dedicated and general-purpose speech recognition servers, the recognition result is extracted based on the recognition result that is less than the upper limit of the recognition time determined depending on the application, so that the optimum recognition can be performed based on the extracted recognition result. A speech recognition server integration device, wherein a result is selected.

In the speech recognition server integration device according to claim 2 or 5,
The recognition result integration parameter stores a recognition result including a misrecognition result of a general-purpose speech recognition server for a word registered by a user or a word frequently used by a user,
The recognition result integration unit, based on the recognition result by the dedicated speech recognition server, obtains a recognition result including an erroneous recognition result of the general-purpose speech recognition server for a recognition result by the dedicated speech recognition server from the recognition result integration parameter. If the speech recognition result of the general-purpose speech recognition server matches the recognition result including the extracted misrecognition result, it is converted into a correct word and the most suitable recognition result is selected by performing a majority decision on each recognition result. A speech recognition server integration device characterized by the above.

A step for the user to store learning the recognition result integration parameter based on a list of phrases frequently used words and phrases or user registered,
Transmitting voice data , which is voice data intended by the user for voice recognition, to the general-purpose voice recognition server and the dedicated voice recognition server;
Receiving a recognition result of the voice data by the general-purpose voice recognition server and a dedicated voice recognition server;
Comparing the recognition result of the general-purpose speech recognition server and the recognition result of the dedicated speech recognition server with the recognition result integration parameter, and selecting the optimum speech recognition result;
A speech recognition server integration method comprising:

14. The voice recognition server integration method according to claim 13, further comprising:
Generating synthesized speech based on words registered by the user or frequently used by the user;
Transmitting the generated synthesized speech to the general-purpose speech recognition server and a dedicated speech recognition server;
Receiving a recognition result of the synthesized speech by the general-purpose speech recognition server and a dedicated speech recognition server,
Step of storing learning the recognition result integration parameter, the synthesized speech of the combined group and became the phrase the recognition result and analyzed, characterized in that the store to learn recognition result integration parameter Speech recognition server integration method.

14. The voice recognition server integration method according to claim 13, further comprising:
Obtaining a list of phrases registered by the user or frequently used by the user;
Receiving a recognition word list from the general-purpose speech recognition server;
Comparing the recognition word list with a word registered by the user or a list of words frequently used by the user, and estimating a similarity,
Step, the speech recognition server integration method characterized by storing the estimation result as a recognition result integration parameter to be saved by learning the recognition result integration parameter.