JP2008250236A

JP2008250236A - Speech recognition device and speech recognition method

Info

Publication number: JP2008250236A
Application number: JP2007094855A
Authority: JP
Inventors: Masahiko Kubo; 雅彦久保; Toshio Kitahara; 俊夫北原
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2007-03-30
Filing date: 2007-03-30
Publication date: 2008-10-16

Abstract

<P>PROBLEM TO BE SOLVED: To improve speech recognition precision by recognizing an utterance object with high precision and selectively performing speech recognition for an utterance to an onboard device. <P>SOLUTION: When a passenger detector 26 detects the presence of a passenger using the output of a pressure sensor 45 installed in a seat, a speech, an image, and vibrations of a driver's seat in a state where a driver is conversing with the passenger are obtained and stored as profile data in a database 24. After a speech input detector 25 detects the driver operating a talk switch 44, the state of the driver is compared with the profile data to determine whether the driver gives utterance to the passenger or an onboard system, and selectively performs speech recognition of speech input directed to the onboard system. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、音声認識を用いて入力処理を行なう音声認識装置および音声認識方法に関し、特に、車載用の音声認識装置および音声認識方法に関する。 The present invention relates to a voice recognition apparatus and a voice recognition method for performing input processing using voice recognition, and more particularly, to an in-vehicle voice recognition apparatus and a voice recognition method.

近年、利用者の音声を認識する技術の実現に向けて、各種考案がなされている。利用者の音声を認識することができれば、利用者は各種機器の操作を音声によって実行することが可能であり、特に車載装置では運転者による手動操作の運転への影響が懸念されることから音声操作技術の実用化が切望されている。 In recent years, various ideas have been made for realizing a technology for recognizing a user's voice. If the user's voice can be recognized, it is possible for the user to perform various device operations by voice. Especially, in-vehicle devices are concerned about the influence of manual operation by the driver on the driving. The practical application of operation technology is eagerly desired.

また、音声認識では、だれがどこに向かって発話しているかの認識が必要な場合がある。そこで、例えば特許文献１は、マイクアレイによって音源を特定し、顔の向きを検出して話し相手を特定する技術を開示している。 In speech recognition, it may be necessary to recognize who is speaking to where. Thus, for example, Patent Document 1 discloses a technique for specifying a sound partner by using a microphone array and detecting a face direction to specify a speaking partner.

また、特許文献２は、音声入力を受け付ける車載装置において、発話と発話との間の無音声時間を計時し、計時された無音声時間に基づいてその音声が車載装置に対する音声入力であるか否かを判別する技術を開示している。 Japanese Patent Laid-Open No. 2004-228688 measures a silent time between utterances in an in-vehicle device that accepts voice input, and whether or not the voice is a voice input to the in-vehicle device based on the measured silent time. The technique which discriminates is disclosed.

特開２００６−２１１１５６号公報JP 2006-2111156 A 特開２００３−３０８０７９号公報JP 2003-308079 A

しかしながら上述した従来技術のように発話と発話の間の無音声時間を用いる方法では、無音声時間を計時するまで音声認識を行なうことができないという問題点があった。また、音楽データの再生などで無音声時間が十分に得られない場合には発話か否かを判別することが困難であるという問題点があった。 However, in the method using the silent time between utterances as in the above-described prior art, there is a problem that speech recognition cannot be performed until the silent time is counted. In addition, there has been a problem that it is difficult to determine whether or not the speech is uttered when a sufficient amount of silent time cannot be obtained due to music data reproduction or the like.

そのため、ユーザの発話が行なわれた時点で、その発話が車載装置に対する音声入力であるか、同乗者との会話であるか、すなわち音声の発話対象を高精度に判定することの出来る音声認識装置および音声認識方法の実現が重要な課題となっていた。 Therefore, when the user's utterance is performed, the speech recognition apparatus that can determine whether the utterance is a voice input to the in-vehicle device or a conversation with a passenger, that is, a speech utterance target with high accuracy. The realization of speech recognition methods has become an important issue.

本発明は、上述した従来技術における問題点を解消し、課題を解決するためになされたものであり、発話対象を高精度に認識し、車載装置に対する発話に対して選択的に音声認識を行なうことで音声認識精度を向上する音声認識装置および音声認識方法を提供することを目的とする。 The present invention has been made to solve the above-described problems in the prior art and to solve the problems. The present invention recognizes an utterance target with high accuracy and selectively performs speech recognition on an utterance to an in-vehicle device. It is an object of the present invention to provide a speech recognition apparatus and speech recognition method that improve speech recognition accuracy.

上述した課題を解決し、目的を達成するため、本発明にかかる音声認識処理装置および音声認識方法は、運転者が同乗者と会話している状態からプロファイルデータを作成し、発話中にプロファイルデータを参照して、その発話が同乗者に対する会話であるか音声操作入力であるかを判定する。 In order to solve the above-described problems and achieve the object, the speech recognition processing device and speech recognition method according to the present invention creates profile data from a state in which the driver is talking to a passenger, and profile data during speech. , It is determined whether the utterance is a conversation for a passenger or a voice operation input.

本発明によれば音声認識装置および音声認識方法は、運転者が同乗者と会話する場合と車載装置に対して音声入力を行なう場合との違いを判別し、音声入力を行なっている場合にのみ音声認識を行なうので、音声認識精度が向上した音声認識装置および音声認識装置を得ることができるという効果を奏する。 According to the present invention, the voice recognition device and the voice recognition method discriminate the difference between the case where the driver talks with the passenger and the case where voice input is performed on the in-vehicle device, and only when the voice input is performed. Since voice recognition is performed, there is an effect that a voice recognition device and a voice recognition device with improved voice recognition accuracy can be obtained.

以下に添付図面を参照して、この発明に係る音声認識装置および音声認識方法の好適な実施の形態を詳細に説明する。 Exemplary embodiments of a speech recognition apparatus and a speech recognition method according to the present invention will be explained below in detail with reference to the accompanying drawings.

図１は、本発明の実施例である音声認識装置２０を搭載した車載システムの概要構成を示す概要構成図である。同図に示す車載システム１０は、音声認識装置２０、車載統合機３０、マイク４１、カメラ４２、振動センサ４３、トークスイッチ４４、圧力センサ４５、ディスプレイ５１およびスピーカ５２を有する。 FIG. 1 is a schematic configuration diagram showing a schematic configuration of an in-vehicle system equipped with a voice recognition device 20 according to an embodiment of the present invention. The in-vehicle system 10 shown in FIG. 1 includes a voice recognition device 20, an in-vehicle integrated machine 30, a microphone 41, a camera 42, a vibration sensor 43, a talk switch 44, a pressure sensor 45, a display 51, and a speaker 52.

車載統合機３０は、目的地までの経路設定および経路誘導を行なうナビゲーション機能と、ラジオやテレビの受信、記録媒体に格納された音楽や影像の再生をおこなうオーディオ・ビジュアル機能とを備えた装置であり、その内部に機器操作部３１、記録媒体３２、ナビ処理部３３、ＡＶ処理部３４および出力処理部３５を有する。 The in-vehicle integrated machine 30 is a device having a navigation function for setting a route to a destination and guiding a route, and an audio / visual function for receiving radio and television and reproducing music and images stored in a recording medium. And includes a device operation unit 31, a recording medium 32, a navigation processing unit 33, an AV processing unit 34, and an output processing unit 35 therein.

機器操作部３１は、利用者からの操作入力を受け付ける処理部であり、受け付けた操作内容はナビ処理部３３およびＡＶ処理部３４の動作制御に用いられる。記録媒体３２は、ＨＤＤ、ＤＶＤ、ＣＤ、ＭＤなどであり、地図データや音楽データ、映像データが記録されている。ナビ処理部３３は、ナビゲーション機能を実現する処理部であり、ＡＶ処理部３４は、オーディオ・ビジュアル機能を実現する処理部である。そして、出力処理部３５は、ナビ処理部３３、ＡＶ処理部３４の出力に基づいてディスプレイ５１の表示出力およびスピーカ５２の音声出力を制御する処理を行なう。 The device operation unit 31 is a processing unit that receives an operation input from a user, and the received operation content is used for operation control of the navigation processing unit 33 and the AV processing unit 34. The recording medium 32 is an HDD, DVD, CD, MD, or the like, on which map data, music data, and video data are recorded. The navigation processing unit 33 is a processing unit that implements a navigation function, and the AV processing unit 34 is a processing unit that implements an audio / visual function. Then, the output processing unit 35 performs processing for controlling the display output of the display 51 and the audio output of the speaker 52 based on the outputs of the navigation processing unit 33 and the AV processing unit 34.

音声認識装置２０は、利用者の音声を認識して機器操作部３１に入力することで、車載統合機３０の動作を制御する操作手段として機能する。 The voice recognition device 20 functions as an operation unit that controls the operation of the in-vehicle integrated machine 30 by recognizing the user's voice and inputting it to the device operation unit 31.

具体的には、音声認識装置２０は、マイク１１、カメラ４２、振動センサ４３、トークスイッチ４４、圧力センサ４５に接続される。また、音声認識装置２０は、その内部にバッファメモリ２１、音声認識部２２、プロファイル作成部２３、プロファイルデータ２４、音声入力検知部２５、同乗者検知部２６、主制御部２７を有する。 Specifically, the voice recognition device 20 is connected to the microphone 11, the camera 42, the vibration sensor 43, the talk switch 44, and the pressure sensor 45. The voice recognition device 20 includes a buffer memory 21, a voice recognition unit 22, a profile creation unit 23, profile data 24, a voice input detection unit 25, a passenger detection unit 26, and a main control unit 27 therein.

マイク４１は、周囲の音、特に運転者の音声を取得する集音手段である。また、カメラ４２は、運転者の顔画像を撮影する撮影手段であり、振動センサ４３は、運転席の振動を検知する検知手段である。そして、マイク４１が取得した音声データ、カメラ４２が撮影した画像データ、振動センサ４３が取得した振動データは、バッファメモリ２１に格納される。 The microphone 41 is sound collection means for acquiring ambient sounds, particularly the driver's voice. The camera 42 is an imaging unit that captures a driver's face image, and the vibration sensor 43 is a detection unit that detects vibration of the driver's seat. The audio data acquired by the microphone 41, the image data captured by the camera 42, and the vibration data acquired by the vibration sensor 43 are stored in the buffer memory 21.

トークスイッチ４４は、利用者によって操作される操作手段であり、音声認識の開始要求に用いられる。音声認識装置２０は、トークスイッチ４４が操作された後にマイク４１が集音した音声データから発話を切り出して音声認識する。 The talk switch 44 is an operation means operated by a user, and is used for a request for starting speech recognition. The voice recognition device 20 recognizes a voice by cutting out an utterance from voice data collected by the microphone 41 after the talk switch 44 is operated.

ここで、トークスイッチ４４が操作された後に取得した音声データであっても、その全てが車載システム１０に対する音声操作指示であるとは限らない。例えば、運転者がトークスイッチを押した後、同乗者との会話が発生することもありえる。そのため、トークスイッチ操作後の発声を単純に車載システム１０に対する音声入力と看做して音声認識の対象とすると、本来同乗者に向けて発声されていた言葉についても操作入力として扱うととなり、誤認識を増加させてしまう。 Here, even the voice data acquired after the talk switch 44 is operated is not necessarily a voice operation instruction for the in-vehicle system 10. For example, a conversation with a passenger may occur after the driver presses a talk switch. Therefore, if the utterance after the talk switch operation is simply regarded as a voice input to the in-vehicle system 10 and is subject to voice recognition, the words originally spoken to the passenger will be handled as the operation input, which is erroneous. Increase recognition.

そこで、音声認識装置２０では、運転者が同乗者に対して発話する際の状態を予め取得してプロファイルデータを作成しておく。そして、トークスイッチ操作後の運転者の状態とプロファイルデータとを比較することで、発声が同乗者に対する会話であるのか、車載システム２０に対する音声入力であるのかを判定する。 Therefore, in the voice recognition device 20, the state when the driver speaks to the passenger is acquired in advance and profile data is created. Then, by comparing the driver's state after the talk switch operation and the profile data, it is determined whether the utterance is a conversation to the passenger or a voice input to the in-vehicle system 20.

プロファイルデータの作成、およびプロファイルデータとの比較に用いる運転者の状態としては、運転者の音声の大きさとその変化、音声の高さ、ノイズ区間との音声の大きさの差、発話開始部分もしくは発話終端部分の音声の特徴、発声速度、画像データ内の顔の変化、運転席の振動などを使用可能である。 The state of the driver used for the creation of profile data and comparison with the profile data includes the volume and change of the driver's voice, the height of the voice, the difference in the volume of the voice from the noise section, the utterance start part or It is possible to use the voice characteristics at the end of the utterance, the utterance speed, the change of the face in the image data, the vibration of the driver's seat, and the like.

なお、同乗者がいなければ、運転者の発言は全て音声入力であると推定することが出来る。そこで、音声認識装置２０は、各座席の座面や背もたれに内蔵した圧力センサ４５の出力を同乗者検知部２６によって収集し、同乗者が乗車しているか否かを検知している。 If there is no passenger, it can be estimated that all of the driver's speech is voice input. Therefore, the voice recognition device 20 collects the output of the pressure sensor 45 built in the seat surface and backrest of each seat by the passenger detection unit 26 and detects whether or not the passenger is in the vehicle.

主制御部２７は、音声認識装置２０を全体制御する制御部である。主制御部２７は、プロファイルデータがまだ作成されていなければ、バッファメモリ２１をプロファイル作成部２３に接続し、同乗者との会話時における運転者の状態を取得させる。プロファイル作成部２３は、取得した情報を用いてプロファイルデータを作成し、データベース２４に格納する。そして、トークスイッチ４４が操作されたことを音声入力検知部２５が検知した場合、主制御部２７は、バッファメモリ２１を音声認識部２２に接続し、同乗者との会話時における運転者の状態を取得させる。 The main control unit 27 is a control unit that controls the entire speech recognition apparatus 20. If the profile data has not yet been created, the main control unit 27 connects the buffer memory 21 to the profile creation unit 23 to acquire the state of the driver during the conversation with the passenger. The profile creation unit 23 creates profile data using the acquired information and stores it in the database 24. When the voice input detection unit 25 detects that the talk switch 44 has been operated, the main control unit 27 connects the buffer memory 21 to the voice recognition unit 22, and the state of the driver during conversation with the passenger To get.

音声認識部２２は、バッファメモリ２１に格納された音声データや画像データから発話の開始と終了を検知し、音声認識によって発話の内容を認識して機器操作部３１に入力する。 The voice recognition unit 22 detects the start and end of an utterance from the voice data and image data stored in the buffer memory 21, recognizes the content of the utterance by voice recognition, and inputs it to the device operation unit 31.

この時、音声認識部２２ａは、その内部の発話対象判定部２７ａによって音声データ内の発話が同乗者に対する会話であるのか、車載システム１０に対する音声入力であるのかは判定し、車載システム１０を対象とする発話に対してのみ音声認識を行なう。具体的には、発話対象判定部２２ａは、音声認識時における運転者の状態とプロファイルデータとを比較することで、発声対象を判定する。 At this time, the voice recognition unit 22a determines whether the utterance in the voice data is a conversation for the passenger or a voice input to the in-vehicle system 10 by the utterance target determination unit 27a therein, and targets the in-vehicle system 10 Speech recognition is performed only for the utterance. Specifically, the utterance target determination unit 22a determines the utterance target by comparing the driver's state and profile data during voice recognition.

図２は、音声認識装置２０の処理動作を示すフローチャートである。同図に示した処理動作は、音声認識装置２０が繰り返し実行するメインフローの一例である。 FIG. 2 is a flowchart showing the processing operation of the speech recognition apparatus 20. The processing operation shown in the figure is an example of a main flow that the speech recognition apparatus 20 repeatedly executes.

まず、音声認識装置２０はプロファイルデータの作成が必要であるか否か判定する（ステップＳ１０１）。具体的には、プロファイルデータが未作成である場合や、プロファイルデータの作成後、車両の乗員構成に変化があった場合にプロファイルデータの作成が必要であると判定する。なお、車両の乗員構成の変化は、圧力センサ４５の出力変化やドア開閉の検知によって行なうことができる。 First, the speech recognition apparatus 20 determines whether or not profile data needs to be created (step S101). Specifically, it is determined that the profile data needs to be created when the profile data has not been created or when the occupant configuration of the vehicle has changed after the creation of the profile data. The change in the vehicle occupant configuration can be made by detecting the output change of the pressure sensor 45 or the opening / closing of the door.

そして、プロファイルデータの作成が必要である場合（ステップＳ１０１，Ｙｅｓ）、プロファイル作成部２３によってプロファイルデータを作成し、データベース２４に登録する（ステップＳ１０２）。このプロファイル学習の終了後、もしくはプロファイルデータの作成が不要である場合（ステップＳ１０１，Ｎｏ）、主制御部２７はトークスイッチ４４が操作されたか否かを監視し（ステップＳ１０３）、トークスイッチ４４が操作されていなければ（ステップＳ１０３，Ｎｏ）、そのまま処理を終了する。 If it is necessary to create profile data (step S101, Yes), the profile creation unit 23 creates profile data and registers it in the database 24 (step S102). After completion of profile learning or when it is not necessary to create profile data (No in step S101), the main control unit 27 monitors whether or not the talk switch 44 has been operated (step S103). If not operated (step S103, No), the process is terminated as it is.

一方、トークスイッチ４４が操作された場合（ステップＳ１０３，Ｙｅｓ）、音声認識装置２０は、音声認識部による音声認識を実行し（ステップＳ１０４）、音声認識の結果を用いて車載統合機３０に対する操作入力を実行し（ステップＳ１０５）、処理を終了する。 On the other hand, when the talk switch 44 is operated (step S103, Yes), the voice recognition device 20 performs voice recognition by the voice recognition unit (step S104), and operates the in-vehicle integrated machine 30 using the voice recognition result. The input is executed (step S105), and the process ends.

つづいて、図２に示したプロファイル学習処理の詳細について図３を参照してさらに説明する。同図に示したように、プロファイル学習処理では、まず、同乗者検知部２６が圧力センサ４５の出力に基づいて同乗者が乗車しているか否かを判定し（ステップＳ２０１）、同乗者が居ない場合（ステップＳ２０１，Ｎｏ）にはそのまま処理を終了する。 Next, details of the profile learning process shown in FIG. 2 will be further described with reference to FIG. As shown in the figure, in the profile learning process, first, the passenger detection unit 26 determines whether or not a passenger is on the basis of the output of the pressure sensor 45 (step S201). If not (No at Step S201), the process is terminated as it is.

一方、同乗者が居る場合（ステップＳ２０１，Ｙｅｓ）には、プロファイル作成部２３がバッファメモリから音声データ、画像データなどを取得し（ステップＳ２０２）、プロファイルデータとして登録して（ステップＳ２０３）、処理を終了する。 On the other hand, when there is a passenger (step S201, Yes), the profile creation unit 23 acquires audio data, image data, and the like from the buffer memory (step S202), and registers them as profile data (step S203). Exit.

つづいて、図２に示した音声認識処理の具体例について説明する。既に述べたように、本発明では、運転者の音声の大きさとその変化、音声の高さ、ノイズ区間との音声の大きさの差、発話開始部分もしくは発話終端部分の音声の特徴、発声速度、画像データ内の顔の変化、運転席の振動などをプロファイルデータと比較して発話対象を判別している。 Next, a specific example of the speech recognition process shown in FIG. 2 will be described. As described above, in the present invention, the volume and change of the driver's voice, the pitch of the driver, the difference in the volume of the voice from the noise section, the characteristics of the voice at the utterance start part or the utterance end part, the utterance speed Then, the face change in the image data, the vibration of the driver's seat, etc. are compared with the profile data to determine the utterance target.

例えば、声の大きさを用いる場合、図４に示したように、まず、入力音声について発声区間を検出し、次に発声区間の入力レベル平均を計算する。そして得られた計算結果と同様に計算されたプロファイルデータの値とを比較し、プロファイルデータの値に比して閾値以上大きい場合、車載システムに向けた発声であると判定する。すなわち、この方法では、運転者は音声入力を行なう際には、同乗者との会話時よりも大きい声で発声する、との観点に基づいて発声対象を判定することが出来る。 For example, when using the loudness of the voice, as shown in FIG. 4, first, the utterance section is detected for the input speech, and then the input level average of the utterance section is calculated. Then, the calculated profile data value is compared with the obtained calculation result, and when it is larger than the profile data value by a threshold value or more, it is determined that the voice is directed to the in-vehicle system. That is, in this method, the driver can determine the utterance target based on the viewpoint that the driver utters a louder voice than when speaking with the passenger when performing voice input.

図５は、声の大きさの抑揚を用いる場合の発話対象判定方法である。同図では、まず入力音声について発声区間を検出し、発声区間の入力レベルを単位時間に区切って各区間毎に平均値を計算する。そして、計算結果の振れ幅（分散値）を計算し、同様に計算されたプロファイルの値と比較して、プロファイルデータの値に比して閾値以上大きい場合、車載システムに向けた発声であると判定する。すなわち、この方法では、運転者は音声入力を行なう際には、同乗者との会話時よりも明瞭な語調で発声する、との観点に基づいて発声対象を判定することが出来る。 FIG. 5 shows an utterance target determination method when voice volume inflection is used. In the figure, first, an utterance section is detected for the input speech, and an average value is calculated for each section by dividing the input level of the utterance section into unit times. Then, the fluctuation width (dispersion value) of the calculation result is calculated, and compared with the profile value calculated in the same manner, if it is greater than the threshold value compared to the value of the profile data, the voice is directed to the in-vehicle system judge. That is, in this method, the driver can determine the utterance target based on the viewpoint that the driver speaks in a clearer tone than when speaking with the passenger.

図６は、声の高さの抑揚を用いる場合の発話対象判定方法である。この方法では、まず入力音声について発声区間を検出し、発声区間を周波数領域についてフーリエ変換し、一番強い周波数帯域について、周波数軸上における振れ幅を計算する。そして、計算結果を同様に計算されたプロファイルの値と比較して、プロファイルデータの値に比して閾値以上大きい場合、車載システムに向けた発声であると判定する。この方法でも、運転者は音声入力を行なう際には、同乗者との会話時よりも明瞭な語調で発声する、との観点に基づいて発声対象を判定することが出来る。 FIG. 6 shows an utterance target determination method when using voice pitch inflection. In this method, first, an utterance interval is detected for input speech, the utterance interval is Fourier-transformed in the frequency domain, and a fluctuation width on the frequency axis is calculated for the strongest frequency band. Then, the calculation result is compared with the value of the profile calculated in the same manner, and when it is larger than the threshold value by the threshold value, it is determined that the voice is directed to the in-vehicle system. Also in this method, the driver can determine the utterance target based on the viewpoint that the driver speaks in clearer tone than when talking with the passenger.

図７は、発声区間とそれ以外の区間との音の入力レベル差による発話対象判定方法である。同図では、まず入力音声について発声区間を検出し、発声区間とその他の区間についてそれぞれ入力レベルの平均を計算する。そして、この二つの入力レベル平均について差分を計算し、差分値が同様に計算されたプロファイルデータの値に比して閾値以上大きい場合、車載システムに向けた発声であると判定する。すなわち、この方法では、運転者は環境音（発声していない区間の周囲の音）の大きさに合わせて自らの発声音量を決め、またその環境音量に対する大きさは、同乗者との会話時よりも明瞭な語調で発声する、との観点に基づいて発声対象を判定することが出来る。 FIG. 7 shows an utterance target determination method based on a difference in sound input level between the utterance section and the other sections. In the figure, first, an utterance section is detected for input speech, and an average of input levels is calculated for the utterance section and other sections. Then, a difference is calculated for the average of these two input levels, and if the difference value is larger than the threshold value calculated in the same manner by the threshold value, it is determined that the voice is directed to the in-vehicle system. In other words, in this method, the driver determines his / her utterance volume according to the volume of the environmental sound (the sound around the non-speaking section), and the volume level relative to the environmental volume is the same as when talking with the passenger. It is possible to determine the utterance target based on the viewpoint of uttering in a clearer tone.

また、画像データを用いる場合、図８に示したように、まず、入力音声について発声区間を検出し、次に発声区間と同時間帯の下唇の位置をカメラ画像から検出する。そして、下唇の位置の振れ幅を計算する。得られた計算結果と、同様に計算されたプロファイルデータの値とを比較し、プロファイルデータの値に比して閾値以上大きい場合、車載システムに向けた発声であると判定する。すなわち、この方法では、運転者は音声入力を行なう際には、同乗者との会話時よりも大きい声で発声する、との観点に基づいて発声対象を判定することが出来る。 When image data is used, as shown in FIG. 8, first, a speech section is detected for the input voice, and then the position of the lower lip in the same time zone as the speech section is detected from the camera image. Then, the deflection width of the position of the lower lip is calculated. The obtained calculation result is compared with the value of the profile data calculated in the same manner, and when the value is larger than the threshold value by the threshold value or more, it is determined that the voice is directed to the in-vehicle system. That is, in this method, the driver can determine the utterance target based on the viewpoint that the driver utters a louder voice than when speaking with the passenger when performing voice input.

また、運転席の振動を用いる場合、図９に示したように、まず、入力音声について発声区間を検出し、次に発声区間と同時間帯の運転席の振動値を検出し、その振れ幅を計算する。そして得られた計算結果と、同様に計算されたプロファイルデータの値とを比較し、プロファイルデータの値に比して閾値以上大きい場合、車載システムに向けた発声であると判定する。すなわち、この方法では、運転者は音声入力を行なう際には、同乗者との会話時よりも大きい声や明瞭な口調で話し、発声の差が運転席の振動に現れる、との観点に基づいて発声対象を判定することが出来る。 Further, when using vibration of the driver's seat, as shown in FIG. 9, first, the utterance section is detected for the input voice, and then the vibration value of the driver's seat in the same time zone as the utterance section is detected, and the fluctuation width is detected. Calculate Then, the obtained calculation result is compared with the value of the profile data calculated in the same manner, and when the value is larger than the threshold value by the threshold value or more, it is determined that the voice is directed to the in-vehicle system. In other words, this method is based on the viewpoint that the driver speaks with a louder voice and a clear tone than when talking with the passenger, and the difference in utterance appears in the vibration of the driver's seat. To determine the utterance target.

このほか、図１０に示したように発話開始部分や発話終端部分の音声の特徴を用いる場合、まず入力音声について発声区間を検出し、発声区間の入力レベルを単位時間に区切って各区間毎に平均値を計算する。そして、単位時間ごとの入力レベル平均の推移を計算し、同様に計算されたプロファイルの値と比較する。その結果、入力レベルの減少度合い（開始部分の場合は上昇度合い）が、プロファイルデータの値に比して閾値以上大きい場合、車載システムに向けた発声であると判定する。すなわち、この方法では、運転者は音声入力を行なう際には、同乗者との会話時よりも明瞭な語調で発声し、発声の立ち上がりや終端にその差が現れる、との観点に基づいて発声対象を判定することが出来る。 In addition, when using the voice features of the utterance start portion and the utterance end portion as shown in FIG. 10, first, the utterance interval is detected for the input speech, and the input level of the utterance interval is divided into unit times and is divided for each interval. Calculate the average value. Then, the transition of the average input level per unit time is calculated and compared with the profile value calculated in the same manner. As a result, when the decrease level of the input level (in the case of the start portion) is greater than the threshold value by the threshold value, it is determined that the voice is directed to the in-vehicle system. In other words, with this method, the driver speaks in clearer tone than when speaking with the passenger, and the utterance is based on the point that the difference appears at the beginning and end of the utterance. The target can be determined.

さらに、図１１に示したように発声速度を用いることもできる。この場合、まず入力音声について発声区間を検出し、発声区間を周波数領域についてフーリエ変換し、入力音声の特徴から、単位時間当たりの音節数（母音数）を求める。そして、計算結果を同様に計算されたプロファイルの値と比較して、プロファイルデータの値に比して閾値以上発声速度が遅い場合、車載システムに向けた発声であると判定する。この方法でも、運転者は音声入力を行なう際には、同乗者との会話時よりも丁寧なゆっくりとした発声を行なう、との観点に基づいて発声対象を判定することが出来る。 Furthermore, the speech rate can be used as shown in FIG. In this case, first, the utterance interval is detected for the input speech, the utterance interval is Fourier-transformed for the frequency domain, and the number of syllables (vowel number) per unit time is obtained from the characteristics of the input speech. Then, the calculation result is compared with the calculated profile value, and when the utterance speed is slower than the threshold value by the threshold value, it is determined that the utterance is directed to the in-vehicle system. Even in this method, the driver can determine the utterance target based on the viewpoint that when the voice is input, the utterance is more polite and slower than when speaking with the passenger.

以上説明してきたように、本発明にかかる音声認識装置２０は、運転者が同乗者と会話中である場合に運転者の状態をプロファイルデータとして取得し、音声入力時に運転者の状態をプロファイルデータと比較することで、運転者の発声が同乗者に向けた会話であるか車載システムに向けた音声入力であるかを判定する。 As described above, the voice recognition device 20 according to the present invention acquires the driver's state as profile data when the driver is talking to the passenger, and sets the driver's state as profile data when voice is input. To determine whether the utterance of the driver is a conversation directed to the passenger or a voice input directed to the in-vehicle system.

そのため、同乗者に対する会話を音声入力として誤って認識する誤認識の発生を抑え、高精度な音声入力を実現することができる。 Therefore, it is possible to suppress the occurrence of misrecognition that erroneously recognizes a conversation with a passenger as a voice input, and to realize a highly accurate voice input.

なお、本実施例に示した構成および動作はあくまで一例であり、本発明を限定するものではない。本発明の構成および動作は適宜変更して実施することができる。例えば、運転者方向と助手席方向にそれぞれ指向性を有するマイクを設ける構成とすれば、それぞれのマイクの入力レベルの差から誰が発話を行なったかを判定することや、運転者がどの方向に顔を向けて発話したかを推定することができる。さらに、運転者からの音声入力があるか否かを監視することで、トークスイッチを省略した構成で実施することも可能である。 Note that the configuration and operation shown in this embodiment are merely examples, and do not limit the present invention. The configuration and operation of the present invention can be implemented with appropriate modifications. For example, if there is a configuration in which microphones having directivity are provided in the driver direction and the passenger seat direction, it is possible to determine who speaks from the difference in the input level of each microphone, and in which direction the driver faces. Can be estimated. Furthermore, by monitoring whether or not there is a voice input from the driver, the talk switch can be omitted.

以上のように、本発明にかかる音声認識装置および音声認識方法は、音声認識精度向上に有用であり、特に車載装置での発話対象の判定による認識精度向上に適している。 As described above, the speech recognition apparatus and speech recognition method according to the present invention are useful for improving speech recognition accuracy, and are particularly suitable for improving recognition accuracy by determining an utterance target in an in-vehicle device.

本発明にかかる音声認識を搭載した車載システムの概要構成を説明する概要構成図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a schematic block diagram explaining the schematic structure of the vehicle-mounted system carrying the speech recognition concerning this invention. 音声認識装置の処理動作について説明するフローチャートである。It is a flowchart explaining the processing operation of a speech recognition apparatus. プロファイル学習処理について説明するフローチャートである。It is a flowchart explaining a profile learning process. 声の大きさによる発話対象判定について説明する説明図である。It is explanatory drawing explaining the speech object determination by a loudness. 声の大きさの抑揚による発話対象判定について説明する説明図である。It is explanatory drawing explaining the speech object determination by the inflection of a voice volume. 声の高さの抑揚による発話対象判定について説明する説明図である。It is explanatory drawing explaining the speech object determination by the inflection of a voice pitch. 発話区間とその他の区間とのレベル差を用いた判定について説明する説明図である。It is explanatory drawing explaining the determination using the level difference of an utterance area and another area. 口の開き幅の変化による発話対象判定について説明する説明図である。It is explanatory drawing explaining the speech object determination by the change of the opening width of a mouth. 運転席の振動検知による発話対象判定について説明する説明図である。It is explanatory drawing explaining the speech object determination by the vibration detection of a driver's seat. 話し始めや話し終わりの特徴による発話対象判定について説明する説明図である。It is explanatory drawing explaining the utterance target determination by the feature of the talk start and the talk end. 話の速度による発話対象判定について説明する説明図である。It is explanatory drawing explaining the speech object determination by the speed of a talk.

Explanation of symbols

１０車載システム
２０音声認識装置
２１バッファメモリ
２２音声認識部
２２ａ発話対象判定部
２３プロファイル作成部
２４データベース
２５音声入力検知部
２６同乗者検知部
２７主制御部
３０車載統合機
３１機器操作部
３２記録媒体
３３ナビ処理部
３４ＡＶ処理部
３５出力処理部
４１マイク
４２カメラ
４３振動センサ
４４トークスイッチ
４５圧力センサ
５１ディスプレイ
５２スピーカ DESCRIPTION OF SYMBOLS 10 In-vehicle system 20 Speech recognition apparatus 21 Buffer memory 22 Speech recognition part 22a Speech object determination part 23 Profile creation part 24 Database 25 Voice input detection part 26 Passenger detection part 27 Main control part 30 In-vehicle integrated machine 31 Device operation part 32 Recording medium 33 Navigation processing unit 34 AV processing unit 35 Output processing unit 41 Microphone 42 Camera 43 Vibration sensor 44 Talk switch 45 Pressure sensor 51 Display 52 Speaker

Claims

Profile creation means for obtaining information on the state in which the driver is talking to the passenger and creating profile data from the obtained information;
An utterance target determination unit that compares the state of the driver with the profile data during the driver's utterance and determines whether the utterance is a conversation with a passenger or a voice operation input;
Voice recognition means for performing voice recognition on an utterance determined to be a voice operation input by the utterance target determination means;
A speech recognition apparatus comprising:

The voice recognition apparatus according to claim 1, wherein the profile creating unit creates the profile data by acquiring at least one of the driver's voice, an image, and a driver's seat vibration.

The profile creating means is configured such that the driver's voice volume when the driver is talking to a passenger, the change in the volume, the voice height, the difference in the voice volume from the noise interval, The speech recognition apparatus according to claim 2, wherein the profile data is created by using at least one of a voice feature and a speech rate at a speech start part or a speech end part.

The voice recognition device according to claim 2, wherein the profile creating unit creates the profile data by obtaining a fluctuation range of mouth opening from the driver's face image.

A passenger detection means for detecting the presence or absence of a passenger is further provided, wherein the profile creation means creates the profile data when the presence of a passenger is detected by the passenger detection means. The speech recognition device according to any one of 1 to 4.

Operation means for instructing start of voice operation input is further provided, and the voice recognition means is determined to be voice operation input by the utterance target determination means among voice data input after the operation means is operated. The speech recognition apparatus according to claim 1, wherein speech recognition is performed on the uttered speech.

A profile creation step of obtaining information on a state in which the driver is talking to the passenger and creating profile data from the obtained information;
An utterance target determination step of comparing the driver state with the profile data during the driver's utterance, and determining whether the utterance is a conversation with a passenger or a voice operation input;
A voice recognition step for performing voice recognition on the utterance determined to be a voice operation input by the utterance target determination step;
A speech recognition method comprising: