JP3597398B2

JP3597398B2 - Voice recognition device

Info

Publication number: JP3597398B2
Application number: JP29325898A
Authority: JP
Inventors: 哲也室井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-10-15
Filing date: 1998-10-15
Publication date: 2004-12-08
Anticipated expiration: 2018-10-15
Also published as: JP2000122692A

Description

【０００１】
【発明の属する技術分野】
この発明は、通信経路を通して行われる会話における音声認識装置、特に会話中の音声を認識し、認識した音声をキーワードとして情報を検索して、検索した情報を話者に提供することに関するものである。
【０００２】
【従来の技術】
例えば商品の注文を受けたり、あるいは商品の問い合わせに答えるというような会話を電話で行うときに、音声認識を利用して情報提供や操作支援を行う音声認識方法が例えば特開平８−２４８９７号公報に開示されている。特開平８−２４８９７号公報に示された音声認識方法は、話者Ａと話者Ｂとが会話を行っている通信経路から、話者Ａのみの音声信号を抽出して音声認識し、音声認識した結果を用いて話者Ａに提供する情報を決定するようにしている。例えば音声認識装置が置かれている側の話者Ａが「はい、商品Ｃの値段ですね、少しお待ちください」という発話から「商品Ｃ」という言葉を認識して、商品Ｃの情報をディスプレイに表示し、それを確認した話者Ａが話者Ｂに商品Ｃの値段を答えることができる。
【０００３】
しかしながら、音声認識装置を常に動作モードにしておくと、仮に音声認識結果が正しくとも、話者Ａが意図しないときに音声認識装置が反応してしまい、予期せぬ画面切り替えが起こってしまったりする。このため音声を認識するためのタイミングを指示するタイミング指示手段を設けている。このため話者Ａは話者Ｂとの会話をしているときに、スイッチなどで音声認識を開始する合図を指示する必要があり、操作が煩雑になって話者Ｂに対する対話がおろそかになる場合が生じる。
【０００４】
この発明はかかる短所を改善し、音声認識を常に動作モードにしておいても的確なタイミングで情報提供や操作支援を行えることができる音声認識装置を提供することを目的とするものである。
【０００５】
【課題を解決するための手段】
この発明に係る音声認識装置は、話者と話者とが会話を行っている通信経路から音声信号を抽出して音声認識を行う音声認識部と、片方の話者の発話の音声認識結果と他方の話者の発話の音声認識結果を比較し、２つの音声認識結果があらかじめ定められた条件であった場合のみ、話者に情報提供あるいは操作支援を行う認識結果比較部とを有することを特徴とする。
【０００６】
上記認識結果比較部は、片方の話者の発話の音声認識結果と他方の話者の発話の音声認識結果が同一であった場合のみ、話者に情報提供あるいは操作支援を行うと良い。
【０００７】
また、上記認識結果比較部は、片方の話者の発話の音声認識結果と他方の話者の発話の音声認識結果が同じ意味であった場合のみ、話者に情報提供あるいは操作支援を行っても良い。
【０００８】
【発明の実施の形態】
この発明の音声認識装置は、送信音声入力部と受信音声入力部と送信音声認識部と受信音声認識部と情報格納部と認識結果比較部及び表示部を有し、話者Ａと話者Ｂが電話機などの音声入出力部により公衆回線などの通信経路を介して会話を行う話者Ａの音声入出力部側に接続されている。
【０００９】
例えば話者Ｂが音声入出力部から話者Ａの音声入出力部に発呼して商品の問い合わせをしたときに、受信音声入力部は話者Ｂからの受信音声を抽出して受信認識部に送る。受信音声認識部は送られた受信音声から、その商品を特定するキーワードを認識し、認識した結果を認識結果比較部へ送る。一方、話者Ａ側の音声入出力部を介して受信音声を聴いた話者Ａは情報提供が必要だと判断した場合には、その商品を特定するキーワードを含む応答の送信音声を発話する。この送信音声を送信音声入力部で抽出して送信認識部に送る。送信音声認識部は送られた送信音声から商品を特定するキーワードを認識し、認識した結果を認識結果比較部へ送る。認識結果比較部は受信音声認識部で認識したキーワードと送信音声認識部で認識したキーワードとを比較し、同一の結果であった場合のみ、情報格納部からその商品の価格や性能などの情報を読み出して表示部に表示して話者Ａに伝える。話者Ａは表示部に表示された商品の情報を確認して話者Ｂに伝える。
【００１０】
【実施例】
図１はこの発明の一実施例の構成を示すブロック図である。図に示すように、話者Ａと話者Ｂは電話機などの音声入出力部１ａ，１ｂにより公衆回線などの通信経路２を介して会話を行う。話者Ａは、例えば商品の注文を受けたり、質問を受けたりする側であり、話者Ｂは商品の注文をしたり、質問をしたりする。話者Ａの音声入出力部１ａには音声認識装置３が接続されている。音声認識装置３は送信音声入力部４と受信音声入力部５と送信音声認識部６と受信音声認識部７と情報格納部８と認識結果比較部９及び表示部１０を有する。送信音声入力部４は音声マイクロフォンなどからなり、話者Ａが音声入出力部１ａで通話したときの送信音声を抽出して入力し、送信音声入力部５は通信経路２に接続され、話者Ｂが音声入出力部１ｂで通話したとき通信経路２を介して受信した受信音声を抽出して入力する。送信音声認識部６は送信音声入力部４から入力した送信音声を認識するものであり、話者Ａがあらかじめ特定できるので、特定話者方式あるいは話者適応によって話者Ａにチューニングされた音声認識を行い、認識性能の向上を図る。受信音声認識部７は受信音声入力部５から入力した受信音声を認識するものであり、話者が特定できないので、話者に依存しない形で音声が認識できる不特定話者方式の音声認識方式により音声を認識する。情報格納部８には、例えば各種商品の情報があらかじめ格納されている。認識結果比較部９は送信音声認識部６と受信音声認識部８の音声認識結果があらかじめ定められた条件であった場合のみ、音声認識結果に応じた情報を情報格納部８から読み出して表示部１０に表示する。
【００１１】
上記のように構成された音声認識装置３で、話者Ｂから話者Ａに対して例えば商品に対する問い合わせがあったときの動作を説明する。
【００１２】
話者Ｂが音声入出力部１ｂから音声入出力部１ａに発呼して通信経路が接続され、例えば話者Ｂが話者Ａに対して「商品Ｃの値段を教えて欲しいんですけど」というような発話をした場合に、受信音声入力部５は話者Ｂの「商品Ｃの値段を教えて欲しいんですけど」という受信音声を抽出して受信認識部７に送る。受信音声認識部７は送られた受信音声から「商品Ｃ」というキーワードを認識し、認識した結果を認識結果比較部９へ送る。一方、音声入出力部１ａを介して「商品Ｃの値段を教えて欲しいんですけど」という音声を聴いた話者Ａは情報提供が必要だと判断した場合には、「はい、商品Ｃの価格でございますね、少しお待ちください」という送信音声を発話する。この送信音声を送信音声入力部４で抽出して送信認識部６に送る。送信音声認識部６は送られた送信音声から「商品Ｃ」というキーワードを認識し、認識した結果を認識結果比較部９へ送る。この受信音声認識部５と送信音声認識部４で音声認識するための文法などの言語モデルは、例えば図２に示すように商品名等を表示した言語モデル２１を用い、商品名等を発話中から例えばワードスポッティング、すなわち、あらかじめ定めた言葉だけを自動的に抽出し、他を無視する方法で単語や音節を認識したり、単語を連続して発声した音声を認識する連続音声認識のように発話全体を認識してから、図２に示すような商品名を抽出したりする。
【００１３】
認識結果比較部９は受信音声認識部７で認識したキーワード「商品Ｃ」と送信音声認識部６で認識したキーワード「商品Ｃ」とを比較し、同一の結果であった場合のみ、情報格納部８から「商品Ｃ」の価格や性能などの情報を読み出して表示部１０に表示して話者Ａに伝える。話者Ａは表示部１０に表示された商品の情報を確認して話者Ｂに伝える。
【００１４】
また、話者Ｂが、例えば「値段が１００万円以下の商品はありますか」という発話を行い、これに対して話者Ａが「商品Ｃでしたら９８万円でお求めいただけます」のような対話をした場合、受信音声認識部７では認識するキーワードがなく、送信音声認識部６は「商品Ｃ」というキーワードを認識するが、認識結果比較部９で受信音声認識部７の認識結果と送信音声認識部６の認識結果が異なるので「商品Ｃ」の情報を表示部１０に表示しないようにする。
【００１５】
上記実施例は受信音声認識部７で認識した結果と送信音声認識部６で認識した結果が同一の場合の認識結果比較部９から該当する情報を表示部１０に表示した場合について説明したが、受信音声認識部７で認識した結果と送信音声認識部６で認識した結果が同じ意味の場合に、認識結果比較部９から該当する情報を表示部１０に表示するようにしても良い。
【００１６】
例えば認識結果比較部９に、図３に示すように、正式名称「ＮＴ９５」なる商品が、消費者にわかりやすいように、「おとぼけくん」なる愛称がつけられている場合、「エヌティーきゅうごう」，「エヌティーきゅうじゅうご」，「おとぼけくん」という読みは、全て同じ「ＮＴ９５」という意味（商品）を表すというような意味と読みの変換テーブル９１をあらかじめ設けておき、話者Ｂの発話を受信音声認識部７で認識した結果が「エヌティーきゅうごう」であり、話者Ａの発話を送信音声認識部６で認識した結果が「エヌティーきゅうじゅうご」であった場合、読みは異なるが同じ意味「ＮＴ９５」を表すと認識結果比較部９で判定して、「ＮＴ９５」の情報を表示部１０に表示する。このにして適切な情報を話者Ａから話者Ｂに伝えることができる。
【００１７】
また、上記実施例は送信音声入力部４と受信音声入力部５を別個に設け、送信音声認識部６と受信音声認識部７も別個に設けた場合について説明したが、送信音声入力部４と受信音声入力部５を共通にし、送信音声認識部６と受信音声認識部７も共通にして不特定話者方式で受信音声と送信音声を認識したり、音響モデルだけを話者Ａと話者Ｂで切り替えるようにしても良い。このようにして装置の簡素化を図ることができる。
【００１８】
【発明の効果】
この発明は以上説明したように、片方の話者の発話の音声認識結果と他方の話者の発話の音声認識結果を比較し、２つの音声認識結果があらかじめ定められた条件であった場合だけ、情報提供あるいは操作支援を行うようにしたから、音声認識装置を常に動作モードにしておいて、認識するタイミングを指示しなくとも情報提供が必要な場合にだけ情報提供や操作支援を行うことができる。
【００１９】
また、片方の話者の発話の音声認識結果と他方の話者の発話の音声認識結果が同一であった場合のみ、話者に情報提供あるいは操作支援を行うことにより、誤った情報提供等を行うことを防止でき、正確な情報を提供することができる。
【００２０】
さらに、片方の話者の発話の音声認識結果と他方の話者の発話の音声認識結果が同じ意味であった場合に、話者に情報提供あるいは操作支援を行うから、ある商品に対する名称や読みが複数ある場合に、話者の発話の読みが異なっていても、その意味が共通であれば的確なタイミングで情報提供等を行うことができ、型番などの数詞表現などでは、各種の読みがなされるときでも正確な情報提供を行うことができる。
【図面の簡単な説明】
【図１】この発明の実施例の構成を示すブロック図である。
【図２】言語モデルを示す説明図である。
【図３】意味と読みの変換テーブルの構成図である。
【符号の説明】
１音声入出力部
２通信経路
３音声認識装置
４送信音声入力部
５受信音声入力部
６送信音声認識部
７受信音声認識部
８情報格納部
９認識結果比較部
１０表示部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition apparatus for a conversation performed through a communication path, and more particularly to recognizing a speech during a conversation, searching for information using the recognized speech as a keyword, and providing the searched information to a speaker. .
[0002]
[Prior art]
For example, Japanese Unexamined Patent Publication No. H8-24897 discloses a speech recognition method for providing information and operation support using speech recognition when a conversation such as receiving an order for a product or answering a product inquiry is performed by telephone. Is disclosed. The speech recognition method disclosed in Japanese Patent Application Laid-Open No. Hei 8-24897 discloses a speech recognition method in which a speech signal of only a speaker A is extracted from a communication path in which a speaker A and a speaker B have a conversation to perform speech recognition. The information to be provided to the speaker A is determined using the recognition result. For example, the speaker A on the side where the voice recognition device is placed recognizes the word "product C" from the utterance "Yes, the price of product C, please wait a moment", and displays the information of product C on the display. The speaker A who has displayed and confirmed it can reply the price of the product C to the speaker B.
[0003]
However, if the speech recognition device is always set to the operation mode, even if the speech recognition result is correct, the speech recognition device reacts when the speaker A does not intend, and unexpected screen switching may occur. . Therefore, a timing instructing means for instructing a timing for recognizing the voice is provided. For this reason, when the speaker A is talking with the speaker B, it is necessary to give a signal to start speech recognition with a switch or the like, and the operation becomes complicated and the dialogue with the speaker B is neglected. Cases arise.
[0004]
An object of the present invention is to improve the disadvantages and to provide a voice recognition device that can provide information and assist operation at an accurate timing even when the voice recognition is always in the operation mode.
[0005]
[Means for Solving the Problems]
A voice recognition device according to the present invention includes a voice recognition unit that performs voice recognition by extracting a voice signal from a communication path in which a speaker has a conversation, and a voice recognition result of an utterance of one of the speakers. Comparing the speech recognition results of the other speaker's utterances, and providing a recognition result comparison unit that provides information or operation support to the speaker only when the two speech recognition results satisfy predetermined conditions. Features.
[0006]
It is preferable that the recognition result comparing unit only provides information or operation support to the speaker only when the voice recognition result of the utterance of one speaker and the voice recognition result of the utterance of the other speaker are the same.
[0007]
Also, the recognition result comparison unit provides information or operation support to the speaker only when the voice recognition result of the utterance of one speaker and the voice recognition result of the utterance of the other speaker have the same meaning. Is also good.
[0008]
BEST MODE FOR CARRYING OUT THE INVENTION
The voice recognition device of the present invention includes a transmission voice input unit, a reception voice input unit, a transmission voice recognition unit, a reception voice recognition unit, an information storage unit, a recognition result comparison unit, and a display unit. Is connected to the voice input / output unit side of the speaker A who has a conversation via a communication path such as a public line by a voice input / output unit such as a telephone.
[0009]
For example, when the speaker B calls the voice input / output unit of the speaker A from the voice input / output unit and inquires about the product, the reception voice input unit extracts the reception voice from the speaker B, and receives the reception recognition unit. Send to The received voice recognition unit recognizes a keyword specifying the product from the received received voice and sends the recognized result to the recognition result comparison unit. On the other hand, if the speaker A who has listened to the received voice via the voice input / output unit of the speaker A determines that it is necessary to provide information, the speaker A utters a transmission voice of a response including a keyword specifying the product. . The transmission voice is extracted by the transmission voice input unit and sent to the transmission recognition unit. The transmission voice recognition unit recognizes a keyword for specifying a product from the transmitted transmission voice, and sends the recognized result to the recognition result comparison unit. The recognition result comparison unit compares the keyword recognized by the received voice recognition unit with the keyword recognized by the transmission voice recognition unit. Only when the result is the same, information such as the price and performance of the product is stored in the information storage unit. The information is read out, displayed on the display unit, and transmitted to the speaker A. Speaker A confirms the information of the product displayed on the display unit and conveys it to speaker B.
[0010]
【Example】
FIG. 1 is a block diagram showing the configuration of one embodiment of the present invention. As shown in the figure, a speaker A and a speaker B have a conversation via a communication path 2 such as a public line by voice input / output units 1a and 1b such as telephones. Speaker A is, for example, a party who receives an order for a product or receives a question, and speaker B orders an order for a product or asks a question. A speech recognition device 3 is connected to the speech input / output unit 1a of the speaker A. The voice recognition device 3 includes a transmission voice input unit 4, a reception voice input unit 5, a transmission voice recognition unit 6, a reception voice recognition unit 7, an information storage unit 8, a recognition result comparison unit 9, and a display unit 10. The transmission voice input unit 4 is composed of a voice microphone or the like, and extracts and inputs a transmission voice when the speaker A talks with the voice input / output unit 1a. The transmission voice input unit 5 is connected to the communication path 2, When B talks on the voice input / output unit 1b, it extracts and inputs the received voice received via the communication path 2. The transmission voice recognition unit 6 recognizes the transmission voice input from the transmission voice input unit 4, and since the speaker A can be specified in advance, the voice recognition tuned to the speaker A by the specific speaker method or the speaker adaptation. To improve recognition performance. The received voice recognition unit 7 recognizes the received voice input from the received voice input unit 5, and since the speaker cannot be specified, the voice recognition method of the unspecified speaker system capable of recognizing the voice without depending on the speaker. To recognize the voice. For example, information of various products is stored in the information storage unit 8 in advance. The recognition result comparing unit 9 reads out information corresponding to the voice recognition result from the information storage unit 8 and displays the information only when the voice recognition results of the transmission voice recognition unit 6 and the reception voice recognition unit 8 satisfy predetermined conditions. Display at 10.
[0011]
The operation of the speech recognition apparatus 3 configured as described above when the speaker B inquires the speaker A of, for example, a product will be described.
[0012]
The speaker B makes a call from the voice input / output unit 1b to the voice input / output unit 1a, and the communication path is connected. For example, the speaker B asks the speaker A "I want to tell the price of the product C." When such an utterance is made, the receiving voice input unit 5 extracts the receiving voice of the speaker B, "I want to know the price of the product C," and sends it to the reception recognizing unit 7. The received voice recognition unit 7 recognizes the keyword “commodity C” from the received received voice and sends the recognized result to the recognition result comparison unit 9. On the other hand, if the speaker A who hears the voice "I want to know the price of the product C" via the voice input / output unit 1a determines that information provision is necessary, "Yes, Please wait for a moment. " The transmission voice is extracted by the transmission voice input unit 4 and sent to the transmission recognition unit 6. The transmission voice recognition unit 6 recognizes the keyword “commodity C” from the transmitted transmission voice, and sends the recognized result to the recognition result comparison unit 9. A language model such as a grammar for voice recognition by the reception voice recognition unit 5 and the transmission voice recognition unit 4 uses, for example, a language model 21 displaying a product name or the like as shown in FIG. For example, word spotting, that is, automatically extracting only predetermined words, recognizing words and syllables by ignoring others, and continuous speech recognition recognizing speech that uttered words continuously After recognizing the entire utterance, a product name as shown in FIG. 2 is extracted.
[0013]
The recognition result comparison unit 9 compares the keyword “product C” recognized by the received voice recognition unit 7 with the keyword “product C” recognized by the transmission voice recognition unit 6, and only when the result is the same, the information storage unit 8, information such as the price and performance of “commodity C” is read out, displayed on the display unit 10 and transmitted to the speaker A. The speaker A confirms the information of the product displayed on the display unit 10 and transmits the information to the speaker B.
[0014]
In addition, speaker B makes an utterance, for example, "Do you have a product with a price of 1,000,000 yen or less?", While speaker A says, "If it is product C, it can be purchased for 980,000 yen." In the case of a conversation, the received voice recognition unit 7 has no keyword to recognize, and the transmitted voice recognition unit 6 recognizes the keyword “product C”. Since the recognition result of the transmission voice recognition unit 6 is different, the information of "product C" is not displayed on the display unit 10.
[0015]
The above embodiment has described the case where the corresponding information is displayed on the display unit 10 from the recognition result comparison unit 9 when the result recognized by the reception voice recognition unit 7 and the result recognized by the transmission voice recognition unit 6 are the same. When the result recognized by the reception voice recognition unit 7 and the result recognized by the transmission voice recognition unit 6 have the same meaning, the corresponding information may be displayed on the display unit 10 from the recognition result comparison unit 9.
[0016]
For example, as shown in FIG. 3, when the product having the official name “NT95” is given the nickname “Otoboke-kun” in the recognition result comparison unit 9 so that consumers can easily understand the product, “NTG” is used. , "NT 90" and "Otoboke-kun" are provided in advance with a meaning-reading conversion table 91 such that they represent the same meaning (product) "NT95". If the result recognized by the received voice recognition unit 7 is "NTN", and the result recognized by the transmission voice recognition unit 6 of the utterance of the speaker A is "NTN", the reading is different. The recognition result comparing unit 9 determines that the same meaning “NT95” is represented, and displays the information “NT95” on the display unit 10. In this way, appropriate information can be transmitted from speaker A to speaker B.
[0017]
In the above embodiment, the transmission voice input unit 4 and the reception voice input unit 5 are provided separately, and the transmission voice recognition unit 6 and the reception voice recognition unit 7 are provided separately. The receiving voice input unit 5 is made common, the transmitting voice recognizing unit 6 and the receiving voice recognizing unit 7 are made common, and the received voice and the transmitting voice are recognized by the unspecified speaker system. B may be used for switching. In this way, the device can be simplified.
[0018]
【The invention's effect】
As described above, the present invention compares the speech recognition result of one speaker's speech with the speech recognition result of the other speaker's speech, and only when the two speech recognition results are under predetermined conditions. Since information provision or operation support is provided, the voice recognition device is always set to the operation mode, and information provision and operation support can be performed only when information provision is necessary without instructing the recognition timing. it can.
[0019]
Also, only when the speech recognition result of one speaker's utterance and the speech recognition result of the other speaker's utterance are the same, erroneous information provision etc. Can be prevented, and accurate information can be provided.
[0020]
Furthermore, if the speech recognition result of one speaker's speech and the speech recognition result of the other speaker's speech have the same meaning, information or operation support is provided to the speaker, so that the name or reading In the case where there are multiple words, even if the readings of the speakers' utterances are different, if the meaning is common, it is possible to provide information etc. at the right timing. Even when it is done, accurate information can be provided.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing a language model.
FIG. 3 is a configuration diagram of a meaning-reading conversion table.
[Explanation of symbols]
Reference Signs List 1 voice input / output unit 2 communication path 3 voice recognition device 4 transmission voice input unit 5 reception voice input unit 6 transmission voice recognition unit 7 reception voice recognition unit 8 information storage unit 9 recognition result comparison unit 10 display unit

Claims

A speech recognition unit that extracts a speech signal from a communication path in which a speaker has a conversation and performs speech recognition, and a speech recognition result of one speaker's speech and a speech recognition of the other speaker's speech. A speech recognition apparatus, comprising: a recognition result comparing unit that compares results and provides information or assists operation to a speaker only when two speech recognition results satisfy predetermined conditions.

The said recognition result comparison part provides information or operation support to a speaker only when the speech recognition result of the speech of one speaker and the speech recognition result of the speech of the other speaker are the same. Voice recognition device.

The said recognition result comparison part provides information or operation support to a speaker only when the speech recognition result of the speech of one speaker and the speech recognition result of the speech of the other speaker have the same meaning. The speech recognition device according to claim 1.