JP5263875B2

JP5263875B2 - Computer system, speech recognition method and computer program for speech recognition

Info

Publication number: JP5263875B2
Application number: JP2008236872A
Authority: JP
Inventors: 大輔友田; 茂樹竹内; 壮是志村
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-09-16
Filing date: 2008-09-16
Publication date: 2013-08-14
Anticipated expiration: 2028-09-16
Also published as: JP2010072098A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a computer system for voice recognition of utterance input, capable of improving recognition accuracy of voice recognition. <P>SOLUTION: The system includes: a first determination section for determining whether or not, in response to input of first utterance, voice of the input first utterance matches voice registered in a storage section; a request section for requesting input of second utterance, when the voice of the input first utterance does not match the voice registered in the storage device; a second determination section for determining whether or not, the voice of the input second utterance matches the voice registered in the storage device; a comparing section for comparing a phoneme string of the second utterance with a phoneme string of the first utterance, when the voice of the second utterance matches the voice registered in the storage section; and a relating section for relating the voice of the first utterance to a command or action corresponding to the second utterance, when the phoneme string of the second utterance is similar to the phoneme string of the first utterance. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、発話入力の音声認識のためのコンピュータ・システム、並びにその方法及びコンピュータ・プログラムに関する。 The present invention relates to a computer system for speech recognition of speech input, a method thereof, and a computer program.

車載機器（例えば、カーナビゲーション・システム）の高機能化及び複雑化が進む中、ユーザに使いやすいヒューマン・インタフェースとして音声認識がある。ユーザが車載機器に対して音声コマンドを発すると、音声認識を通じて該音声コマンドがコマンドに変換されて、該コマンドに対応するアクションが実行される。ここで、ユーザが任意の発話内容を車載機器のコマンドに関連付ける機能（以下、音声タグ登録モードという）を用いる場合、発話の新規登録は以下の手順に従い行われる：
１．音声タグ登録モードを起動、
２．発話内容をコマンドに関連付けするためのメニューの選択、
３．発話、
４．新規登録。
しかし、ユーザは、どの発話にどのコマンドを関連付けたかを常に意識しているわけではない。また、ユーザは、音声認識に失敗した場合において音声認識の失敗に初めて気付くか或いは失敗を気付きもしない。さらに、車載機器の環境において、ユーザは、発話内容の登録乃至は発音修正の作業を行うことを望んでいない。さらに、ユーザが、発話内容をコマンドへ関連付けるために、上記手順をその都度実行するのは面倒である。 As an in-vehicle device (for example, a car navigation system) is highly functional and complicated, speech recognition is known as a human interface that is easy for a user to use. When the user issues a voice command to the in-vehicle device, the voice command is converted into a command through voice recognition, and an action corresponding to the command is executed. Here, when the user uses a function (hereinafter referred to as a voice tag registration mode) for associating an arbitrary utterance content with an in-vehicle device command, new utterance registration is performed according to the following procedure:
1. Activate voice tag registration mode,
2. Menu selection to associate utterances with commands,
3. Utterance,
4). sign up.
However, the user is not always aware of which command is associated with which utterance. In addition, when the voice recognition fails, the user does not notice the failure of the voice recognition for the first time or does not notice the failure. Furthermore, in the environment of the in-vehicle device, the user does not want to perform the work of registering the utterance content or correcting the pronunciation. Furthermore, it is troublesome for the user to execute the above procedure each time in order to associate the utterance content with the command.

特許文献１は、ユーザが使用する端末であり、文字列情報を記憶した認識辞書を備えるクライアント端末と、該クライアント端末と通信網で接続されるサーバ端末とを備える認識辞書システムを記載する。該サーバ端末は、クライアント端末から入力された情報に対応する第１の語彙について、認識辞書に記憶されている第２の語彙と等価の意味を持ち、異なる文字列情報を持つ言い換え語彙を累積記憶する言い換え語彙累積部と、該言い換え語彙累積部を参照し、言い換え語彙の発生頻度を解析し、該発生頻度が第１の所定値より高い言い換え語彙の少なくとも一つを主要言い換え語彙と判断する言い換え頻度解析部と、主要言い換えと判断された第１の語彙を、第２の語彙と対応付けて認識辞書に登録するために認識辞書を更新する認識辞書更新部とを有する。 Patent Document 1 describes a recognition dictionary system that is a terminal used by a user and includes a client terminal that includes a recognition dictionary that stores character string information, and a server terminal that is connected to the client terminal via a communication network. The server terminal cumulatively stores a paraphrase vocabulary having a meaning equivalent to the second vocabulary stored in the recognition dictionary and having different character string information for the first vocabulary corresponding to the information input from the client terminal. A paraphrasing vocabulary accumulating unit and a paraphrasing vocabulary accumulating unit, analyzing the occurrence frequency of the paraphrase vocabulary and determining at least one paraphrase vocabulary having the occurrence frequency higher than a first predetermined value as a main paraphrase vocabulary A frequency analysis unit, and a recognition dictionary update unit that updates the recognition dictionary in order to register the first vocabulary determined as the main paraphrase in the recognition dictionary in association with the second vocabulary.

特開２００７−２１３００５号公報JP 2007-213055 A

車載機器において、音声認識の認識精度を上げることが必要とされている。よって、認識精度を上げるために、例えば単語又はフレーズの発音のバリエーションを追加し、音の揺らぎに自動的に対応する音声認識システムが求められている。
また、単語又はフレーズの発音のバリエーションを追加する作業は人手を掛けないと精度がよくならないので、コスト面において敬遠されがちである。よって、単語又はフレーズの発音のバリエーションを自動的に登録する音声認識システムが求められている。
さらに、車載機器では、ハードウェア資源が限られている場合がある。それにも関わらず、例えば１つの単語又はフレーズに対して複数の音の揺らぎを関連付ける場合、より多くのメモリー容量が必要とされる。よって、ハードウェア資源の制約に対応できうるような音声認識システムが求められている。 There is a need to increase the recognition accuracy of voice recognition in in-vehicle devices. Therefore, in order to increase the recognition accuracy, for example, a voice recognition system that adds a variation of pronunciation of a word or a phrase and automatically responds to the fluctuation of the sound is required.
Also, the work of adding variations in the pronunciation of words or phrases tends to be avoided in terms of cost because the accuracy cannot be improved without manual intervention. Accordingly, there is a need for a speech recognition system that automatically registers variations of pronunciation of words or phrases.
Furthermore, hardware resources may be limited in in-vehicle devices. Nevertheless, for example, when associating multiple sound fluctuations with a single word or phrase, more memory capacity is required. Therefore, there is a need for a speech recognition system that can cope with hardware resource constraints.

本発明は、発話入力の音声認識のためのコンピュータ・システムを提供する。
上記コンピュータ・システムは、
第１発話の入力に応答して、該入力された第１発話の音声が記憶部に登録された音声と一致するかどうかを判断する第１の判断部と、
上記入力された第１発話の音声が上記記憶部に登録された音声と一致しない場合に、第２発話の入力を要求する要求部と、
上記入力された第２発話の音声が上記記憶部に登録された音声と一致するかどうかを判断する第２の判断部と、
上記第２発話の音声が上記記憶部に登録された音声と一致する場合に、上記第２発話の音素列と上記第１発話の音素列とを比較する比較部と、
上記第２発話の音素列が上記第１発話の音素列と似ている場合に、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付ける関連付け部と
を含む。 The present invention provides a computer system for speech recognition of speech input.
The computer system is
In response to the input of the first utterance, a first determination unit that determines whether the input voice of the first utterance matches the voice registered in the storage unit;
A request unit for requesting input of a second utterance when the input voice of the first utterance does not match the voice registered in the storage unit;
A second determination unit that determines whether or not the voice of the input second utterance matches the voice registered in the storage unit;
A comparison unit that compares the phoneme sequence of the second utterance and the phoneme sequence of the first utterance when the speech of the second utterance matches the speech registered in the storage unit;
An association unit for associating the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance is similar to the phoneme string of the first utterance;

本発明の１つの実施態様では、上記コンピュータ・システムは、上記第１の判断とともに、上記第１発話に対応する第１の音素列を生成する第１の生成部をさらに含む。 In one embodiment of the present invention, the computer system further includes a first generation unit that generates a first phoneme string corresponding to the first utterance together with the first determination.

本発明の１つの実施態様では、上記第１の判断部が、上記生成された第１の音素列が上記記憶部に登録された音素列と一致するかどうかをさらに判断する。 In one embodiment of the present invention, the first determination unit further determines whether or not the generated first phoneme string matches a phoneme string registered in the storage unit.

本発明の１つの実施態様では、上記入力された第１発話の音声が上記記憶部に登録された音声と一致するかどうかが、音素列中の音素の一致度に基づいて判断される。 In one embodiment of the present invention, whether or not the input speech of the first utterance matches the speech registered in the storage unit is determined based on the degree of coincidence of phonemes in the phoneme string.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記入力された第１発話の音声が上記記憶部に登録された音声と一致する場合に、上記第１発話に対応するアクションを実行する実行部をさらに含む。 In one embodiment of the present invention, the computer system executes an action corresponding to the first utterance when the inputted voice of the first utterance matches the voice registered in the storage unit. An execution part is further included.

本発明の１つの実施態様では、上記第１発話が第１の音声コマンドである。本発明の１つの実施態様では、上記実行部は、上記第１の音声コマンドが上記記憶部に登録された音声コマンドと一致する場合に、上記第１の音声コマンドに対応するアクションを実行する。 In one embodiment of the invention, the first utterance is a first voice command. In one embodiment of the present invention, the execution unit executes an action corresponding to the first voice command when the first voice command matches the voice command registered in the storage unit.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記入力された第１発話の音声が上記記憶部に登録された音声と一致しない場合に該第１発話の音声を記録部に格納する記録部をさらに含む。 In one embodiment of the present invention, the computer system stores the voice of the first utterance in the recording unit when the inputted voice of the first utterance does not match the voice registered in the storage unit. A recording unit is further included.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記第２の判断とともに、上記第２発話に対応する第２の音素列を生成する第２の生成部をさらに含む。 In one embodiment of the present invention, the computer system further includes a second generation unit that generates a second phoneme string corresponding to the second utterance together with the second determination.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記第２の判断部が、上記生成された第２の音素列が上記記憶部に登録された音素列と一致するかどうかをさらに判断する。 In one embodiment of the present invention, in the computer system, the second determination unit further determines whether or not the generated second phoneme sequence matches a phoneme sequence registered in the storage unit. To do.

本発明の１つの実施態様では、上記入力された第２発話の音声が上記記憶部に登録された音声と一致するかどうかが、音素列中の音素の一致度に基づいて判断される。 In one embodiment of the present invention, whether or not the voice of the input second utterance matches the voice registered in the storage unit is determined based on the matching degree of phonemes in the phoneme string.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記入力された第２発話の音声が上記記憶部に登録された音声と一致する場合に、上記第２発話に対応するアクションを実行する第２の実行部をさらに含む。 In one embodiment of the present invention, the computer system executes an action corresponding to the second utterance when the inputted voice of the second utterance matches the voice registered in the storage unit. A second execution unit is further included.

本発明の１つの実施態様では、上記第２発話が第２の音声コマンドである。本発明の１つの実施態様では、上記第２の実行部は、上記第２の音声コマンドが上記記憶部に登録された音声コマンドと一致する場合に、上記第２の音声コマンドに対応するアクションを実行する。 In one embodiment of the invention, the second utterance is a second voice command. In one embodiment of the present invention, the second execution unit performs an action corresponding to the second voice command when the second voice command matches the voice command registered in the storage unit. Run.

本発明の１つの実施態様では、上記第２発話の音素列が上記第１発話の音素列と似ていることが、音素列中の音素の一致度に基づいて判断される。 In one embodiment of the present invention, it is determined based on the degree of coincidence of phonemes in the phoneme sequence that the phoneme sequence of the second utterance is similar to the phoneme sequence of the first utterance.

本発明の１つの実施態様では、上記第２発話の音素列が上記第１発話の音素列と似ている場合に、音声が一致したかどうかが、音の揺らぎ情報に基づいて判断される。 In one embodiment of the present invention, when the phoneme string of the second utterance is similar to the phoneme string of the first utterance, it is determined based on the sound fluctuation information whether or not the voices match.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記第２発話の音素列が上記第１発話の音素列と似ている場合に、上記第１発話の音素列を上記記憶部に登録するかどうかを判定する登録部をさらに含む。 In one embodiment of the present invention, the computer system registers the phoneme sequence of the first utterance in the storage unit when the phoneme sequence of the second utterance is similar to the phoneme sequence of the first utterance. It further includes a registration unit that determines whether or not to do so.

本発明の１つの実施態様では、上記登録部が、上記第１発話の音素列を上記記憶部に登録するかどうかを判定するポリシーを参照する。 In one embodiment of the present invention, the registration unit refers to a policy for determining whether to register the phoneme string of the first utterance in the storage unit.

本発明の１つの実施態様では、上記ポリシーが、ノイズ比の高さ、単語又はフレーズの使用頻度、音素列の並びの少なくとも１つに基づく。 In one embodiment of the present invention, the policy is based on at least one of a high noise ratio, a frequency of use of words or phrases, and a sequence of phoneme strings.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記登録された発話の音素列を上記記憶部から削除するかどうかを判定する削除部をさらに含む。 In one embodiment of the present invention, the computer system further includes a deletion unit that determines whether or not to delete the phoneme string of the registered utterance from the storage unit.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記第２発話の音素列が上記第１発話の音素列と似ていない場合に、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付けるかどうかをユーザに問い合わせる問合部をさらに含む。 In one embodiment of the present invention, the computer system corresponds to the second utterance when the phoneme sequence of the second utterance is not similar to the phoneme sequence of the first utterance. A query unit for querying the user as to whether or not to associate with the command or action to be performed.

本発明の１つの実施態様では、上記関連付け部が、上記ユーザによって上記関連付けを行う命令を受信することに応じて、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付ける。 In one embodiment of the present invention, the association unit associates the voice of the first utterance with the command or action corresponding to the second utterance in response to receiving an instruction to perform the association by the user.

本発明の１つの実施態様では、上記コンピュータ・システムは、上記第２発話の音素列が上記第１発話の音素列と似ていない場合に、上記第１発話の音声に対して上記第２発話に対応するコマンド又はアクションを選択することを許す選択部をさらに含む。 In one embodiment of the present invention, the computer system performs the second utterance on the voice of the first utterance when the phoneme string of the second utterance is not similar to the phoneme string of the first utterance. And a selection unit that allows selection of a command or action corresponding to.

本発明の１つの実施態様では、上記選択を許すことが、音声コマンドのリストを提示することを含む。 In one embodiment of the invention, allowing the selection includes presenting a list of voice commands.

本発明の１つの実施態様では、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付けることが、上記第１発話の音声を上記第２発話の音声のバリエーションとして登録することを含む。 In one embodiment of the present invention, associating the voice of the first utterance with a command or action corresponding to the second utterance registers the voice of the first utterance as a variation of the voice of the second utterance. including.

本発明の１つの実施態様では、上記コンピュータ・システムは、
上記入力された第２発話の音声が上記記憶部に登録された音声と一致しない場合に、第３発話の入力をさらに要求する第２の要求部と、
上記入力された第３発話の音声が上記記憶部に登録された音声と一致するかどうかを判断する第３の判断部と、
上記第３発話の音声が上記記憶部に登録された音声と一致する場合に、上記第３発話の音素列と上記第２発話の音素列とを比較する第２の比較部と、
上記第３発話の音素列が上記第２発話の音素列と似ている場合に、上記第２発話の音声を上記第３発話に対応するコマンド又はアクションに関連付ける第２の関連付け部と
をさらに含む。 In one embodiment of the invention, the computer system comprises:
A second request unit that further requests input of a third utterance when the input second utterance speech does not match the voice registered in the storage unit;
A third determination unit for determining whether or not the voice of the input third utterance matches the voice registered in the storage unit;
A second comparison unit that compares the phoneme sequence of the third utterance and the phoneme sequence of the second utterance when the speech of the third utterance matches the speech registered in the storage unit;
A second associating unit for associating the voice of the second utterance with the command or action corresponding to the third utterance when the phoneme string of the third utterance is similar to the phoneme string of the second utterance; .

本発明はまた、発話入力の音声認識のための方法を提供する。該方法は、コンピュータ・システムに下記ステップを実行させる。
該ステップは、
第１発話の入力に応答して、該入力された第１発話の音声が記憶部に登録された音声と一致するかどうかを判断する第１の判断ステップと、
上記入力された第１発話の音声が上記記憶部に登録された音声と一致しない場合に、第２発話の入力を要求するステップと、
上記入力された第２発話の音声が上記記憶部に登録された音声と一致するかどうかを判断する第２の判断ステップと、
上記第２発話の音声が上記記憶部に登録された音声と一致する場合に、上記第２発話の音素列と上記第１発話の音素列とを比較するステップと、
上記第２発話の音素列が上記第１発話の音素列と似ている場合に、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付けるステップと
を含む。 The present invention also provides a method for speech recognition of speech input. The method causes a computer system to perform the following steps.
The step is
In response to an input of the first utterance, a first determination step of determining whether or not the input voice of the first utterance matches the voice registered in the storage unit;
Requesting the input of the second utterance when the input voice of the first utterance does not match the voice registered in the storage unit;
A second determination step of determining whether or not the voice of the input second utterance matches the voice registered in the storage unit;
Comparing the phoneme sequence of the second utterance with the phoneme sequence of the first utterance when the speech of the second utterance matches the speech registered in the storage unit;
Associating the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance is similar to the phoneme string of the first utterance.

本発明の１つの実施態様では、上記第１の判断ステップが、上記第１の判断とともに、上記第１発話に対応する第１の音素列を生成するステップをさらに含む。 In one embodiment of the present invention, the first determination step further includes a step of generating a first phoneme string corresponding to the first utterance together with the first determination.

本発明の１つの実施態様では、上記第１の判断ステップが、上記生成された第１の音素列が上記記憶部に登録された音素列と一致するかどうかをさらに判断するステップをさらに含む。 In one embodiment of the present invention, the first determining step further includes a step of further determining whether or not the generated first phoneme string matches a phoneme string registered in the storage unit.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記入力された第１発話の音声が上記記憶部に登録された音声と一致する場合に、上記第１発話に対応するアクションを実行するステップを含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of executing an action corresponding to the first utterance when the inputted voice of the first utterance coincides with the voice registered in the storage unit.

本発明の１つの実施態様では、上記第１発話が第１の音声コマンドである。本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記第１の音声コマンドが上記記憶部に登録された音声コマンドと一致する場合に、上記第１の音声コマンドに対応するアクションを実行するステップを含む。 In one embodiment of the invention, the first utterance is a first voice command. In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of executing an action corresponding to the first voice command when the first voice command matches the voice command registered in the storage unit.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記入力された第１発話の音声が上記記憶部に登録された音声と一致しない場合に該第１発話の音声を記録部に格納するステップを含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of storing the voice of the first utterance in the recording unit when the inputted voice of the first utterance does not match the voice registered in the storage unit.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記第２の判断とともに、上記第２発話に対応する第２の音素列を生成するステップを含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of generating a second phoneme string corresponding to the second utterance together with the second determination.

本発明の１つの実施態様では、上記第２の判断するステップが、上記生成された第２の音素列が上記記憶部に登録された音素列と一致するかどうかをさらに判断するステップを含む。 In one embodiment of the present invention, the second determining step includes a step of further determining whether or not the generated second phoneme string matches the phoneme string registered in the storage unit.

本発明の１つの実施態様では、上記第２発話が第２の音声コマンドである。本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記入力された第２発話の音声が上記記憶部に登録された音声と一致する場合に、上記第２発話に対応するアクションを実行するステップを含む。 In one embodiment of the invention, the second utterance is a second voice command. In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of executing an action corresponding to the second utterance when the inputted voice of the second utterance matches the voice registered in the storage unit.

本発明の１つの実施態様では、上記第２発話が第２の音声コマンドである。本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記第２の音声コマンドが上記記憶部に登録された音声コマンドと一致する場合に、上記第２の音声コマンドに対応するアクションを実行するステップを含む。 In one embodiment of the invention, the second utterance is a second voice command. In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of executing an action corresponding to the second voice command when the second voice command matches the voice command registered in the storage unit.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記第２発話の音素列が上記第１発話の音素列と似ている場合に、上記第１発話の音声を上記記憶部に登録するかどうかを判定するステップを含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of determining whether or not to register the voice of the first utterance in the storage unit when the phoneme string of the second utterance is similar to the phoneme string of the first utterance.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記登録された発話の音素列を上記記憶部から削除するかどうかを判定するステップを含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes a step of determining whether or not to delete the phoneme string of the registered utterance from the storage unit.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記第２発話の音素列が上記第１発話の音素列と似ていない場合に、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付けるかどうかをユーザに問い合わせるステップを含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step determines whether to associate the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance does not resemble the phoneme string of the first utterance. Including the step of inquiring.

本発明の１つの実施態様では、上記関連付けするステップが、上記ユーザによって上記関連付けを行う命令を受信することに応じて、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付けるステップを含む。 In one embodiment of the present invention, the step of associating associates the voice of the first utterance with the command or action corresponding to the second utterance in response to receiving an instruction to perform the association by the user. Includes steps.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、上記第２発話の音素列が上記第１発話の音素列と似ていない場合に、上記第１発話の音声に対して上記第２発話に対応するコマンド又はアクションを選択することを許すステップを含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step includes selecting a command or action corresponding to the second utterance for the voice of the first utterance when the phoneme string of the second utterance is not similar to the phoneme string of the first utterance. Includes forgiving steps.

本発明の１つの実施態様では、上記選択を許すステップが、音声コマンドのリストを提示するステップを含む。 In one embodiment of the invention, allowing the selection includes presenting a list of voice commands.

本発明の１つの実施態様では、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付けるステップが、上記第１発話の音声を上記第２発話の音声のバリエーションとして登録するステップを含む。 In one embodiment of the present invention, the step of associating the voice of the first utterance with the command or action corresponding to the second utterance registers the voice of the first utterance as a variation of the voice of the second utterance. including.

本発明の１つの実施態様では、上記方法は、コンピュータ・システムに下記ステップをさらに実行させる。該ステップは、
上記入力された第２発話の音声が上記記憶部に登録された音声と一致しない場合に、第３発話の入力をさらに要求するステップと、
上記入力された第３発話の音声が上記記憶部に登録された音声と一致するかどうかを判断するステップと、
上記第３発話の音声が上記記憶部に登録された音声と一致する場合に、上記第３発話の音素列と上記第２発話の音素列とを比較するステップと、
上記第３発話の音素列が上記第２発話の音素列と似ている場合に、上記第２発話の音声を上記第３発話に対応するコマンド又はアクションに関連付けるステップと
を含む。 In one embodiment of the invention, the method further causes the computer system to perform the following steps: The step is
Further requesting the input of the third utterance when the inputted voice of the second utterance does not match the voice registered in the storage unit;
Determining whether the voice of the input third utterance matches the voice registered in the storage unit;
Comparing the phoneme sequence of the third utterance and the phoneme sequence of the second utterance when the speech of the third utterance matches the speech registered in the storage unit;
Associating the voice of the second utterance with the command or action corresponding to the third utterance when the phoneme string of the third utterance is similar to the phoneme string of the second utterance.

本発明はまた、発話入力の音声認識のための方法を提供する。該方法は、コンピュータ・システムに下記ステップを実行させる。
該ステップは、
第１発話の入力に応答して、該入力された第１発話の音声が記憶部に登録された音声と一致するかどうかを判断するステップと、
上記入力された第１発話の音声が上記記憶部に登録された音声と一致する場合に、上記第１発話に対応するアクションを実行するステップと、
上記入力された第１発話の音声が上記記憶部に登録された音声と一致しない場合に、第２発話の入力を要求するステップと、
上記入力された第２発話の音声が上記記憶部に登録された音声と一致するかどうかを判断するステップと、
上記第２発話の音声が上記記憶部に登録された音声と一致する場合に、上記第２発話の音素列と上記第１発話の音素列とを比較するステップと、
上記第２発話の音素列が上記第１発話の音素列と似ている場合に、上記第１発話の音声を上記第２発話に対応するコマンド又はアクションに関連付けるステップと、
上記第２発話の音素列が上記第１発話の音素列と似ていない場合に、上記第１発話の音声を上記第２発話に対応するコマンドに関連付けるかどうかをユーザに問い合わせるステップと、
上記第２発話の音声が上記記憶部に登録された音声と一致しない場合に、
第３発話の入力をさらに要求するステップと、
上記入力された第３発話の音声が上記記憶部に登録された音声と一致するかどうかを判断するステップと、
上記第３発話の音声が上記記憶部に登録された音声と一致する場合に、上記第３発話の音素列と上記第２発話の音素列とを比較するステップと、
上記第３発話の音素列が上記第２発話の音素列と似ている場合に、上記第２発話の音声を上記第３発話に対応するコマンド又はアクションに関連付けるステップと
を含む。 The present invention also provides a method for speech recognition of speech input. The method causes a computer system to perform the following steps.
The step is
In response to the input of the first utterance, determining whether the input voice of the first utterance matches the voice registered in the storage unit;
Executing the action corresponding to the first utterance when the inputted voice of the first utterance matches the voice registered in the storage unit;
Requesting the input of the second utterance when the input voice of the first utterance does not match the voice registered in the storage unit;
Determining whether the voice of the input second utterance matches the voice registered in the storage unit;
Comparing the phoneme sequence of the second utterance with the phoneme sequence of the first utterance when the speech of the second utterance matches the speech registered in the storage unit;
Associating the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance is similar to the phoneme string of the first utterance;
If the phoneme sequence of the second utterance is not similar to the phoneme sequence of the first utterance, inquiring the user whether to associate the voice of the first utterance with the command corresponding to the second utterance;
When the voice of the second utterance does not match the voice registered in the storage unit,
Further requesting input of a third utterance;
Determining whether the voice of the input third utterance matches the voice registered in the storage unit;
Comparing the phoneme sequence of the third utterance and the phoneme sequence of the second utterance when the speech of the third utterance matches the speech registered in the storage unit;
Associating the voice of the second utterance with the command or action corresponding to the third utterance when the phoneme string of the third utterance is similar to the phoneme string of the second utterance.

本発明はまた、発話入力の音声認識のためのコンピュータ・プログラムを提供する。該コンピュータ・プログラムは、コンピュータ・システムに上記方法のいずれか一つに記載の各ステップを実行させる。 The present invention also provides a computer program for speech recognition of speech input. The computer program causes a computer system to execute the steps described in any one of the above methods.

本発明の実施形態に従うコンピュータ・システムは、音声認識及び音素取得を同時に複数回行うことにより、音声認識成功時の音声コマンドに対して、失敗時の異なる音素列を同じコマンドとして追加登録する。このことによって、音声認識と同時に失敗時の異なる音素列を追加的に登録する処理が同時に行えるので、ユーザの音声の登録作業が簡略化される。また、ユーザは、ある音声コマンドに対して、どのような発話が登録されているかを気にする必要がない。また、本発明の実施形態に従うコンピュータ・システムは、該コンピュータ・システムの出荷後に、ユーザに応じたバリエーションの単語又はフレーズを追加登録できることから、音声の認識精度が向上する。単語又はフレーズを追加登録できることから、本発明の実施形態に従うコンピュータ・システムの出荷時に、音声認識のための辞書の容量を小さくすることが可能である。 The computer system according to the embodiment of the present invention additionally registers different phoneme strings at the time of failure as the same command with respect to the voice command at the time of successful speech recognition by simultaneously performing speech recognition and phoneme acquisition a plurality of times. As a result, the process of additionally registering different phoneme sequences at the time of failure can be performed simultaneously with the voice recognition, so that the user's voice registration work is simplified. Further, the user does not need to worry about what kind of utterance is registered for a certain voice command. In addition, since the computer system according to the embodiment of the present invention can additionally register a variation word or phrase according to the user after the computer system is shipped, the speech recognition accuracy is improved. Since words or phrases can be additionally registered, it is possible to reduce the capacity of the dictionary for speech recognition when the computer system according to the embodiment of the present invention is shipped.

本発明の実施形態において、「発話」とは、ユーザによってコンピュータ・システムに入力される発話をいう。「発話入力」は例えば、コンピュータ・システムに接続された音声入力部、例えばマイクロフォン又はサウンドカードを通して入力される。
本発明の実施形態において、「記憶部」は、音声及び／又は音素列についてのデータを含む。「記憶部」は、データベースであってよい。該記憶部は、コンピュータ・システム内若しくはコンピュータ・システム外の記憶装置、又はコンピュータ・システムにネットワークを介して接続されたサーバ、プロキシの記憶装置に配置されうる。
本発明の実施形態において、「音声認識」の手法として、慣用の技術が使用されうる。例えば、音声認識技術として、ＩＢＭＥｍｂｅｄｄｅｄＶｉａＶｏｉｃｅ（ＥＶＶ）を使用することができる。音声認識では、認識エンジン、単語又はフレーズ辞書、音響モデルを使用して、入力された発話について単語又はフレーズの認識の処理が行われる。
本発明の実施形態において、「音素列」は、音韻論で、任意の個別言語において意味の区別（弁別）に用いられる最小の音の単位を指す。音素は/ /で囲んで表記する。音素に使う記号は自由であり、各言語固有の音素文字が使われることもあるし、国際音声字母が使われることもある。
本発明の実施形態において、音素列間の比較の手法として、慣用の技術が使用されうる。例えば、音素列が似ているかどうかの判定は、例えばＩＢＭＥＶＶにおける音素列を比較するＡＰＩを使用することができる。音素列を比較するＡＰＩは例えば、esrCompareBaseformsである。
本発明の実施形態において、音声が一致するとは、認識結果の単語又はフレーズが、あらかじめ定義された閾値以上又はそれを超える値（スコア）で辞書（記憶域内の単語又はフレーズの集合）内の単語又はフレーズと一致することをいう。所定の閾値は、言語によっても変わりうる。また、音声が一致したとは、該音声に対応するアクションが実行されることで判定することも可能である。
本発明の実施形態において、音声が一致しないとは、上記あらかじめ定義された閾値以上又はそれを超える単語又はフレーズが上記辞書内にみつからないことをいう。また、音声が一致しないことは、該音声に対応するアクションが実行されなかったことで判定することも可能である。
本発明の実施形態において、音素列が似ているとは、音素列中の音素の一致度が、あらかじめ定義された閾値以上又はそれよりも高いことをいう。所定の閾値は、言語によっても変わりうる。
本発明の実施形態において、音素列が似ていないとは、音素列中の音素の一致度が、あらかじめ定義された閾値以下又はそれよりも低いことをいう。
本発明の実施形態において、「コンピュータ・システム」は、車載機器、例えばカーナビゲーション・システム、ハンドヘルド・コンピュータ、パーソナル・デジタル・アシスタント、携帯電話又はカーナビゲーション・システム以外のナビゲーション・システムを含むが、これらに制限されない。 In the embodiment of the present invention, “utterance” refers to an utterance input by a user to a computer system. “Speech input” is input through, for example, a voice input unit connected to a computer system, such as a microphone or a sound card.
In the embodiment of the present invention, the “storage unit” includes data on speech and / or phoneme strings. The “storage unit” may be a database. The storage unit may be disposed in a storage device in the computer system or outside the computer system, or a storage device of a server or proxy connected to the computer system via a network.
In the embodiment of the present invention, a conventional technique can be used as the method of “voice recognition”. For example, IBM Embedded ViaVoice (EVV) can be used as a speech recognition technology. In speech recognition, a recognition process, a word or phrase dictionary, and an acoustic model are used to perform word or phrase recognition processing on an input utterance.
In the embodiment of the present invention, the “phoneme string” refers to a minimum sound unit used for phonological theory to distinguish (discriminate) meaning in an arbitrary individual language. Phonemes are described by enclosing them with //. The symbols used for phonemes are arbitrary, and each language's phoneme characters may be used, or the international phonetic alphabet may be used.
In the embodiment of the present invention, a conventional technique can be used as a method for comparing phoneme strings. For example, an API that compares phoneme sequences in IBM EVV can be used to determine whether phoneme sequences are similar, for example. An API for comparing phoneme strings is, for example, esrCompareBaseforms.
In the embodiment of the present invention, “sound matches” means that a word or phrase in a recognition result is a word in a dictionary (a set of words or phrases in a storage area) with a value (score) that is equal to or higher than a predetermined threshold value (score). Or to match a phrase. The predetermined threshold may vary depending on the language. Also, it can be determined that the voices coincide with each other by executing an action corresponding to the voice.
In the embodiment of the present invention, “sound does not match” means that a word or phrase not less than or exceeding the predefined threshold value is not found in the dictionary. Further, it is possible to determine that the voices do not coincide with each other because the action corresponding to the voice has not been executed.
In the embodiment of the present invention, that phoneme strings are similar means that the degree of coincidence of phonemes in the phoneme string is equal to or higher than a predefined threshold. The predetermined threshold may vary depending on the language.
In the embodiment of the present invention, that phoneme strings are not similar means that the degree of coincidence of phonemes in the phoneme string is equal to or lower than a predetermined threshold value.
In the embodiments of the present invention, the “computer system” includes an in-vehicle device such as a car navigation system, a handheld computer, a personal digital assistant, a mobile phone or a navigation system other than a car navigation system. Not limited to.

以下、図面に従って、本発明の実施形態を説明する。本実施形態は、本発明の好適な態様を説明するためのものであり、本発明の範囲をここで示すものに限定する意図はないことを理解されたい。また、以下の図を通して、特に断らない限り、同一符号は、同一の対象を指す。 Embodiments of the present invention will be described below with reference to the drawings. It should be understood that this embodiment is for the purpose of illustrating a preferred aspect of the present invention and is not intended to limit the scope of the invention to what is shown here. Further, throughout the following drawings, the same reference numerals refer to the same objects unless otherwise specified.

図１Ａは、本発明の実施態様である、音声認識システムの概要を示す。
音声認識システム（１０１）は、ナビゲーション・システム、ハンドヘルド・コンピュータ、パーソナル・デジタル・アシスタント、又は携帯電話でありうる。
音声認識システム（１０１）は、認識処理部（１０２）、判断部（１０３）、記憶部（１０４）、要求部（１０５）、比較部（１０６）、関連付け部（１０７）、問合部（１０８）、選択部（１０９）、登録部（１１０）、ポリシー（１１１）、実行部（１１２）、及び削除部（１１３）を含む。
認識処理部（１０２）は例えば、認識エンジン、単語又はフレーズ辞書、及び音響モデルを含む。認識エンジンは、発話が音声入力部（図示せず）、例えばマイクロフォン又はサウンドカードマイクを通して入力されると、該入力された発話について単語又はフレーズの音声認識の処理を、単語又はフレーズ辞書及び音響モデルを使用して実行し、同時に音素列を生成する。認識の処理は、発話と単語又はフレーズとのマッチングを行うことである。認識された音声と、生成された音素列は例えば、音素列が生成されることによって関連付けられる。
判断部（１０３）は、認識処理部（１０２）において処理された音声データが記憶部（１０４）内に格納された音声データと一致するかどうかを判断する。詳細には、判断部（１０３）は、ユーザによって最初の発話、例えば音声コマンドが入力されると、該音声コマンドの音素列が記憶部（１０４）に登録された音素列と一致するかどうかを判断する。同様に、判断部（１０３）は、ユーザによって２回目以降の発話、例えば音声コマンドが入力されると、該音声コマンドの音素列が記憶部（１０４）に登録された音素列と一致するかどうかを判断する。
記憶部（１０４）は、判断部（１０３）において使用するための音声データ、音素列データを格納する。記憶部（１０４）は例えば、データベースでありうる。
要求部（１０５）は、ユーザによって最初に入力された発話（第１発話）の音声が記憶部（１０４）に登録された音声と一致しない場合に、２回目の発話（第２発話）の入力を要求する。同様に、要求部（１０５）は、ユーザによって２回目に入力された発話（第２発話）の音声が記憶部（１０４）に登録された音声と一致しない場合に、３回目の発話（第３発話）の入力を要求する。以降、予め設定された回数について、要求部（１０５）は、ユーザに発話の入力を求める。
比較部（１０６）は、第２発話の音声が記憶部（１０４）に登録された音声と一致する場合に、第２発話の音素列と第１発話の音素列とを比較する。同様に、比較部（１０６）は、第ｎ発話の音声が記憶部（１０４）に登録された音声と一致する場合に、第ｎ発話の音素列と第ｎ−１発話の音素列とを比較する。
関連付け部（１０７）は、第２発話の音素列が第１発話の音素列と似ている場合に、第１発話の音声を第２発話に対応するコマンド又はアクションに関連付ける。同様に、関連付け部（１０７）は例えば、第ｎ発話の音素列が第ｎ−１又はそれ以前の発話の音素列と似ている場合に、第ｎ−１又はそれ以前の発話の音声を第ｎ発話に対応するコマンドに関連付ける。
問合部（１０８）は、第２発話の音素列が第１発話の音素列と似ていない場合に、第１発話の音声を第２発話に対応するコマンドに関連付けるかどうかをユーザに問い合わせる。同様に、問合部（１０８）は、第ｎ発話の音素列が第ｎ−１発話の音素列と似ていない場合に、第ｎ−１発話の音声を第ｎ発話に対応するコマンドに関連付けるかどうかをユーザに問い合わせる。
選択部（１０９）は、第２発話の音素列が第１発話の音素列と似ていない場合に、第１発話の音声に対応するコマンドを選択することを許す。同様に、選択部（１０９）は例えば、第ｎ発話の音素列が第ｎ−１又はそれ以前の発話の音素列と似ている場合に、第ｎ−１又はそれ以前の音声に対応するコマンドを選択することを許す。
登録部（１１０）は、第２発話の音素列が第１発話の音素列と似ている場合に、第１発話の音声データ、特に第１発生の音素列を記憶部（１０４）に登録するかどうかを判定するポリシーを参照し、該ポリシーに従い第１発話の音声データを記憶部（１０４）に登録するかどうかを判定する。
ポリシー（１１１）は例えば、ノイズ比の高さ、単語又はフレーズの使用頻度、又は音素列の並びに基づくがこれに限定されない。
実行部（１１２）は、入力された第１発話の音声が記憶部（１０４）に登録された音声と一致する場合に、第１発話に対応するアクションを実行する。特には、実行部（１１２）は、入力された第１の音声コマンドが記憶部（１０４）に登録された音声と一致する場合に、第１の音声コマンドに対応するアクションを実行する。同様に、実行部（１１２）は、入力された第ｎ発話の音声が記憶部（１０４）に登録された音声と一致する場合に、第ｎ発話に対応するアクションを実行する。特には、実行部（１１２）は、入力された第ｎの音声コマンドが記憶部（１０４）に登録された音声と一致する場合に、第ｎの音声コマンドに対応するアクションを実行する。
アクションは、音声認識システム（１０１）が実装される製品によって異なる。例えば、音声認識システム（１０１）がカーナビゲーション・システムの場合、アクションは例えば、検索のためのウィンドウを表示する、検索キーワードに従い所定の目的地までの経路を表示する、ＤＶＤ再生機能又は音楽再生機能を呼び出す、カーナビゲーション・システムの電源をオン／オフする、であるがこれらに限定されない。
削除部（１１３）は、記憶部（１０４）に登録された音声データ又は音素列データをユーザが削除することを許す。削除する理由は、音声データ又は音素列データを格納する記憶部の容量にも限界があり、さらに音素列のバリエーションが増加し過ぎることによって認識率が却って低下するのを防ぐためである。
削除の対象は、使用頻度の低い音声データ又は音素列データである。音声認識システム（１０１）は、使用頻度の低い音声データ又は音素列データをリストとしてユーザに提示する提示部を有してもよい。該提示は、削除対象でありうる音声コマンドとともに、該音声コマンドを最後に使用した日付、該音声コマンドを使用した回数をディスプレイ上に同時に表示するようにしてもよい。該表示は例えば、ウィンドウ形式で表示される。代替的に、削除対象でありうる音声コマンドをディスプレイ上に表示する代わりに、該音声コマンドを音声で再生して削除するかどうかをユーザに都度確認しながら行うようにしてもよい。 FIG. 1A shows an outline of a speech recognition system which is an embodiment of the present invention.
The speech recognition system (101) can be a navigation system, a handheld computer, a personal digital assistant, or a mobile phone.
The speech recognition system (101) includes a recognition processing unit (102), a determination unit (103), a storage unit (104), a request unit (105), a comparison unit (106), an association unit (107), and an inquiry unit (108). ), A selection unit (109), a registration unit (110), a policy (111), an execution unit (112), and a deletion unit (113).
The recognition processing unit (102) includes, for example, a recognition engine, a word or phrase dictionary, and an acoustic model. When a speech is input through a speech input unit (not shown), for example, a microphone or a sound card microphone, the recognition engine performs speech recognition processing of a word or phrase on the input speech, a word or phrase dictionary, and an acoustic model. To generate a phoneme sequence at the same time. The recognition process is to perform matching between an utterance and a word or phrase. The recognized speech and the generated phoneme string are associated, for example, by generating a phoneme string.
The determination unit (103) determines whether the audio data processed in the recognition processing unit (102) matches the audio data stored in the storage unit (104). Specifically, when the user inputs a first utterance, for example, a voice command, the determination unit (103) determines whether the phoneme sequence of the voice command matches the phoneme sequence registered in the storage unit (104). to decide. Similarly, when the user inputs a second or subsequent utterance, for example, a voice command, the determination unit (103) determines whether the phoneme string of the voice command matches the phoneme string registered in the storage unit (104). Judging.
The storage unit (104) stores voice data and phoneme string data for use in the determination unit (103). The storage unit (104) can be, for example, a database.
The request unit (105) inputs the second utterance (second utterance) when the voice of the first utterance (first utterance) input by the user does not match the voice registered in the storage unit (104). Request. Similarly, when the voice of the utterance (second utterance) input by the user for the second time does not match the voice registered in the storage section (104), the request unit (105) Utterance) input. Thereafter, the request unit (105) requests the user to input an utterance for a preset number of times.
The comparison unit (106) compares the phoneme sequence of the second utterance with the phoneme sequence of the first utterance when the speech of the second utterance matches the speech registered in the storage unit (104). Similarly, the comparison unit (106) compares the phoneme sequence of the nth utterance and the phoneme sequence of the (n-1) th utterance when the speech of the nth utterance matches the speech registered in the storage unit (104). To do.
The associating unit (107) associates the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance is similar to the phoneme string of the first utterance. Similarly, for example, when the phoneme sequence of the n-th utterance is similar to the phoneme sequence of the n-th or earlier utterance, the associating unit (107) Associate with command corresponding to n utterances.
When the phoneme sequence of the second utterance is not similar to the phoneme sequence of the first utterance, the inquiry unit (108) inquires of the user whether to associate the voice of the first utterance with the command corresponding to the second utterance. Similarly, when the phoneme sequence of the nth utterance does not resemble the phoneme sequence of the n-1st utterance, the inquiry unit (108) associates the voice of the n-1st utterance with the command corresponding to the nth utterance. Ask the user if it is.
The selection unit (109) allows a command corresponding to the voice of the first utterance to be selected when the phoneme string of the second utterance is not similar to the phoneme string of the first utterance. Similarly, for example, when the phoneme sequence of the n-th utterance is similar to the phoneme sequence of the n-th or earlier utterance, the selection unit (109) is a command corresponding to the n-th or earlier speech. Allows you to choose.
When the phoneme sequence of the second utterance is similar to the phoneme sequence of the first utterance, the registration unit (110) registers the speech data of the first utterance, particularly the phoneme sequence of the first utterance, in the storage unit (104). Whether or not the voice data of the first utterance is registered in the storage unit (104) is determined according to the policy.
The policy (111) is based on, for example, high noise ratio, frequency of use of words or phrases, or arrangement of phoneme strings, but is not limited thereto.
The execution unit (112) executes an action corresponding to the first utterance when the input voice of the first utterance matches the voice registered in the storage unit (104). In particular, the execution unit (112) executes an action corresponding to the first voice command when the input first voice command matches the voice registered in the storage unit (104). Similarly, the execution unit (112) executes an action corresponding to the nth utterance when the input nth utterance speech matches the speech registered in the storage unit (104). In particular, the execution unit (112) executes an action corresponding to the nth voice command when the inputted nth voice command matches the voice registered in the storage unit (104).
The action differs depending on the product on which the speech recognition system (101) is installed. For example, when the voice recognition system (101) is a car navigation system, the action displays, for example, a window for search, displays a route to a predetermined destination according to a search keyword, DVD playback function or music playback function , Turn on / off the power of the car navigation system, but is not limited thereto.
The deletion unit (113) allows the user to delete the voice data or phoneme string data registered in the storage unit (104). The reason for deleting is to limit the capacity of the storage unit for storing the voice data or the phoneme string data, and to prevent the recognition rate from decreasing due to the excessive increase of phoneme string variations.
The target of deletion is voice data or phoneme string data that is used less frequently. The voice recognition system (101) may include a presentation unit that presents voice data or phoneme string data with low usage frequency as a list to the user. The presentation may simultaneously display on the display the date when the voice command was last used and the number of times the voice command was used, together with the voice command that may be deleted. The display is displayed in a window format, for example. Alternatively, instead of displaying on the display a voice command that may be a deletion target, it may be performed while confirming with the user whether or not the voice command is reproduced and deleted.

図１Ｂは、本発明の実施態様である、音声、音素、及び音声コマンドに対するアクションの概念図を示す。
音声認識システム（図１Ａ、１０１）の音声入力部（１２２）に音声コマンドが入力される。認識処理部（１２３）は、入力された音声について、単語又はフレーズとしての音声認識と音素列の認識とを行う。記憶部（１２４）、例えばメモリー内は、認識処理部（１２３）において認識された音声コマンド「高速道路を利用」を格納する。実行部（図１Ａ、１１２）は、該認識された音声コマンドに対応するコマンドがコマンド群（１２５）にあるかどうかを確認する。コマンド群は、予め所定の記憶部、例えばデータベースに格納されている。上記認識された音声コマンドに対応するコマンドがコマンド群において見つけられた場合、音声コマンド「高速道路を利用」に対応するアクション「目的地までの高速道路を利用した経路を表示する」をアクション群（１２６）から見つけ、該アクションを実行する。 FIG. 1B shows a conceptual diagram of actions for voice, phonemes, and voice commands, which is an embodiment of the present invention.
A voice command is input to the voice input unit (122) of the voice recognition system (FIG. 1A, 101). The recognition processing unit (123) performs speech recognition as a word or phrase and phoneme string recognition on the input speech. In the storage unit (124), for example, in the memory, the voice command “use highway” recognized by the recognition processing unit (123) is stored. The execution unit (FIG. 1A, 112) confirms whether a command corresponding to the recognized voice command exists in the command group (125). The command group is stored in advance in a predetermined storage unit, for example, a database. When a command corresponding to the recognized voice command is found in the command group, an action “display a route using the highway to the destination” corresponding to the voice command “use highway” is displayed. 126) and execute the action.

図２Ａは、本発明の実施態様である、音声認識のための方法の概要（その１）を示す。
ステップ２０１では、ユーザによって最初の発話（第１発話）が音声入力部を通して入力される。音声認識システム（図１Ａ、１０１）が、音声認識を開始する。音声認識の開始は、第１発話が入力される前又は後のいずれであってもよい。入力された発話は電気的な音声信号に変換されて、認識エンジンに渡される。認識エンジンは、該入力された音声信号について、単語又はフレーズを認識し、同時に音素列を生成する。第１発話についての認識された単語又はフレーズ及び生成された音素列は、メモリー又はハードディスク・ドライブ若しくはソリッド・ステート・ディスク内に格納される。
例えば、ユーザが高速道路を利用し、目的地に行きたいと望む場合、ユーザは、カーナビゲーション・システムに「コーソクドーロオリヨウ」と発話する。
ステップ２０２では、音声認識システムが、第１発話の入力に応答して、該入力された第１発話の音声が記憶部に登録された音声と一致するかどうかを判断する。
上記例では、１回目の発話が、「高速道路を利用」について、１回目の発話の音素列が「k o o: s o k u d o o: r o o: r i y o o:」であると認識されたとする。一方、記憶部に登録されている音素列は、「k o u s o k u d o u r o o r i y o u」であるとする。該例では、１回目の発話の音素列が「ｕ」について長音符「―」であると認識されているか又は認識されていないために、１回目の発話の音素列「k o o: s o k u d o o: r o o: r i y o o:」と記憶部に登録されている音素列「k o u s o k u d o u r o o r i y o u」とは、一致しないと判断される。
ステップ２０３では、音声認識システム（１０１）が、上記一致がない場合、２回目の発話（第２発話）の入力をユーザに要求する。該要求は例えば、音声による案内又は音声認識システムに接続された表示装置上に表示されうる。また、第１発話の音声又は音素列が、メモリー又はハードディスク・ドライブ若しくはソリッド・ステート・ディスク内に格納される。
例えば、ユーザは、カーナビゲーション・システムに「コウソクドウロオリヨウ」と発話する。
ステップ２０４では、音声認識システム（１０１）が、上記一致がある場合、第１発話に対応するアクションを実行する。
ステップ２０５では、ユーザによって第２発話が音声入力部を通して入力される。入力された発話は電気的な音声信号に変換されて、認識エンジンに渡される。認識エンジンは、該入力された発話は電気的な音声信号に変換されて、認識エンジンに渡される。認識エンジンは、該音声信号について、単語又はフレーズを認識し、同時に音素列を生成する。第２発話についての認識された単語又はフレーズ及び生成された音素列は、メモリー又はハードディスク・ドライブ若しくはソリッド・ステート・ディスク内に格納される。
上記例では、２回目の発話が、「高速道路を利用」について、２回目の発話の音素列が「k o u s o k u d o u r o o r i y o u」であると認識されたとする。
ステップ２０６では、音声認識システム（１０１）が、第２発話の入力に応答して、該入力された第２発話の音声が記憶部に登録された音声と一致するかどうかを判断する。
上記例では、２回目の発話の音素列「k o u s o k u d o u r o o r i y o u」が記憶部に登録されている音素列「k o u s o k u d o u r o o r i y o u」と一致する。
ステップ２０７では、音声認識システム（１０１）が、上記一致がある場合、第２発話の音素列と第１発話の音素列とを比較する。
上記例では、２回目の発話の音素列「k o u s o k u d o u r o o r i y o u」と１回目の発話の音素列「k o o: s o k u d o o: r o o: r i y o o:」とを比較する。
ステップ２０８では、音声認識システム（１０１）が、上記一致がある場合、第２発話に対応するアクションをさらに実行する。
該例では、２回目の発話の音素列「k o u s o k u d o u r o o r i y o u」が記憶部に登録されている音素列「k o u s o k u d o u r o o r i y o u」と一致する。よって、２回目の発話の音素列と記憶部に登録されている音素列とは、一致すると判断される。よって、「高速道路を利用」に対応するアクションとして、例えば、カーナビゲーション・システムの表示装置上に目的地までの高速道路を利用した経路が表示される。
ステップ２０９では、音声認識システム（１０１）が、上記一致がない場合、図２Ｂに示すステップに進む。
ステップ２１０では、音声認識システム（１０１）が、第２発話の音素列が第１発話の音素列と似ている場合、第１発話の音声を第２発話に対応するコマンド又はアクションに関連付ける。特には、第１の音素列を第２の音声コマンドに対応するコマンド又はアクションに関連付ける。該関連付けによって、第１の音声コマンドと同じ発話を以降に行うことによって、第２の音声コマンドに対応するアクションを実行することが可能になる。従って、第２の音声コマンドと同じ発話を以降に要求されることがない。
上記例では、２回目の発話の音素列「k o u s o k u d o u r o o r i y o u」が１回目の発話の音素列「k o o: s o k u d o o: r o o: r i y o o:」と比較して「u」が「o:」に及び「o」が「o:」に置き換わっているだけであることから、音素列は互いに似ていると判断される。よって、１回目の発話の音素列「k o o: s o k u d o o: r o o: r i y o o:」が、２回目の発話の２回目の発話の音素列「k o u s o k u d o u r o o r i y o u」に対応するコマンド又はアクションに関連付けられる。代替的には、１回目の発話の音素列「k o o: s o k u d o o: r o o: r i y o o:」が、音素列「ko u s o k u d o u r o o r i y o u」に関連付けられる。
ステップ２１１では、第２発話の音素列が第１発話の音素列と似ていない場合に、第１発話の音声を第２発話に対応するコマンド又はアクションに関連付けるかどうかをユーザに問い合わせる。例えば、記憶部に音声コマンド「エアコン」が登録されており、１回目の音声コマンドが「クーラー」であり、２回目の音声コマンドが「エアコン」であるとする。２回目の音声コマンドと記憶部に登録された単語又はフレーズが一致するが、２回目の音声コマンドの音素列「k u u: r a a:」は、１回目の音声コマンドの音素列「e a k o n」と似ていない。しかし、クーラーという音声コマンドで行われるアクションと、エアコンという音声コマンドで行われるアクションとが、”エアコンの電源を入れる”という点で共通しているために、ユーザは、１回目の音声コマンド「クーラー」に、２回目の音声コマンドのアクションであるエアコンの”エアコンの電源を入れる”を関連付けることができる。
代替的に、ステップ２１１では、音声認識システム（１０１）が、第２発話の音素列が第１発話の音素列と似ていない場合に、第１発話の音声に対応するコマンドを選択することを許す。該選択は例えば、ユーザが、提示された音声コマンドのリストの中から所望の音声コマンドを選択することによって行われる。該選択が行われると、関連付け部は、第１発話の音声を第２発話の音声のバリエーションとして登録する（ステップ２１７）。該登録によって、発音が違う場合に、ユーザに確認して第１の音声コマンドを第２の音声コマンドの同意語として登録する機会が与えられる。
なお、第１発話の音声を第２発話の音声のバリエーションとして登録する場合に、登録を行うかどうかの判定基準に従い、登録を行うようにすることができる。判定基準は、ポリシーに格納することができる。
判定基準は、下記の通りである。
・ノイズ比の高さ。
−例えばS/N比が所定の値よりも高いか。S/N比は、発話された環境によって異なりうる。
・登録しようとしている単語又はフレーズの使用頻度の高さ。
−現在位置から遠い住所はあまり使われることはないだろう。
−都道府県などだけは追加してもよいかもしれない。
−一般名詞と固有名詞
−神社は一般名詞であるから利用頻度が高いであろうが、固有名詞である熊野神社は利用頻度が低いであろう。
−よく知られた固有名詞とそうでないもの。
−マクドナルド（商標）は利用頻度が高いであろうが、マクドナルド港南中央点は利用頻度が低いであろう。
−利用頻度の高いコマンドとそうでないもの。
−カーナビゲーション・システムにおいて、”自宅へ帰る”は利用頻度が高いであろうが、”今日の運勢は”は利用頻度が低いであろう。
・音素列の並び
−ありえない音素の並びを検出したら登録しない。
−“んんんんん”（同じ音素が３つ以上続く）、但し、このありえない音素の並びは、言語によって異なる。
・位置情報、個人の嗜好情報などを利用する。
−現在位置からの距離、自宅位置からの距離によって頻度を判断する。
−登録地点
−好きな食べ物、よく行くお店
さらに、登録した音声データ又は音素列データを削除することを可能にするステップが用意される（図示せず）。 FIG. 2A shows an outline (part 1) of a method for speech recognition, which is an embodiment of the present invention.
In step 201, the user inputs the first utterance (first utterance) through the voice input unit. The speech recognition system (FIG. 1A, 101) starts speech recognition. The start of speech recognition may be before or after the first utterance is input. The input utterance is converted into an electrical voice signal and passed to the recognition engine. The recognition engine recognizes a word or phrase with respect to the input speech signal and simultaneously generates a phoneme string. The recognized word or phrase for the first utterance and the generated phoneme sequence are stored in memory or a hard disk drive or solid state disk.
For example, when a user uses an expressway and wants to go to a destination, the user utters “Corso Doroyo” to the car navigation system.
In step 202, in response to the input of the first utterance, the voice recognition system determines whether or not the input voice of the first utterance matches the voice registered in the storage unit.
In the above example, it is assumed that the first utterance is “use highway” and the phoneme sequence of the first utterance is “koo: sokudoo: roo: riyoo:”. On the other hand, it is assumed that the phoneme string registered in the storage unit is “kousokudourooriyou”. In this example, since the phoneme sequence of the first utterance is recognized as a long note “-” for “u” or is not recognized, the phoneme sequence “koo: sokudoo: roo: riyoo” of the first utterance is recognized. : "And the phoneme string" kousokudourooriyou "registered in the storage unit are determined not to match.
In step 203, the speech recognition system (101) requests the user to input the second utterance (second utterance) if there is no match. The request can be displayed, for example, on a display device connected to a voice guidance or voice recognition system. The voice or phoneme string of the first utterance is stored in a memory, a hard disk drive, or a solid state disk.
For example, the user utters “Korokudourooriyo” to the car navigation system.
In step 204, the speech recognition system (101) executes an action corresponding to the first utterance when there is a match.
In step 205, the user inputs the second utterance through the voice input unit. The input utterance is converted into an electrical voice signal and passed to the recognition engine. The recognition engine converts the input utterance into an electrical voice signal and passes it to the recognition engine. The recognition engine recognizes a word or phrase with respect to the speech signal and simultaneously generates a phoneme string. The recognized word or phrase for the second utterance and the generated phoneme sequence are stored in memory or a hard disk drive or solid state disk.
In the above example, it is assumed that the second utterance is “use highway” and the phoneme string of the second utterance is “kousokudourooriyou”.
In step 206, in response to the input of the second utterance, the voice recognition system (101) determines whether or not the input voice of the second utterance matches the voice registered in the storage unit.
In the above example, the phoneme string “kousokudourooriyou” of the second utterance matches the phoneme string “kousokudourooriyou” registered in the storage unit.
In step 207, when there is a match, the speech recognition system (101) compares the phoneme sequence of the second utterance with the phoneme sequence of the first utterance.
In the above example, the phoneme sequence “kousokudourooriyou” of the second utterance is compared with the phoneme sequence “koo: sokudoo: roo: riyoo:” of the first utterance.
In step 208, the speech recognition system (101) further executes an action corresponding to the second utterance if there is a match.
In this example, the phoneme sequence “kousokudourooriyou” of the second utterance matches the phoneme sequence “kousokudourooriyou” registered in the storage unit. Therefore, it is determined that the phoneme string of the second utterance matches the phoneme string registered in the storage unit. Therefore, as an action corresponding to “use highway”, for example, a route using the highway to the destination is displayed on the display device of the car navigation system.
In step 209, if the voice recognition system (101) does not have the above match, the process proceeds to the step shown in FIG. 2B.
In step 210, the speech recognition system (101) associates the speech of the first utterance with the command or action corresponding to the second utterance if the phoneme sequence of the second utterance is similar to the phoneme sequence of the first utterance. In particular, the first phoneme string is associated with a command or action corresponding to the second voice command. This association makes it possible to execute an action corresponding to the second voice command by performing the same utterance as the first voice command thereafter. Therefore, the same utterance as that of the second voice command is not required thereafter.
In the above example, the phoneme sequence “kousokudourooriyou” for the second utterance is compared to the phoneme sequence “koo: sokudoo: roo: riyoo:” for the first utterance, and “o:” and “o” Since only “o:” is replaced, the phoneme strings are determined to be similar to each other. Therefore, the phoneme sequence “koo: sokudoo: roo: riyoo:” of the first utterance is associated with the command or action corresponding to the phoneme sequence “kousokudourooriyou” of the second utterance. Alternatively, the phoneme sequence “koo: sokudoo: roo: riyoo:” of the first utterance is associated with the phoneme sequence “ko usokudourooriyou”.
In step 211, when the phoneme string of the second utterance is not similar to the phoneme string of the first utterance, the user is inquired whether to associate the voice of the first utterance with the command or action corresponding to the second utterance. For example, it is assumed that the voice command “air conditioner” is registered in the storage unit, the first voice command is “cooler”, and the second voice command is “air conditioner”. The second voice command matches the word or phrase registered in the storage unit, but the phoneme string “kuu: raa:” of the second voice command is similar to the phoneme string “eakon” of the first voice command. Absent. However, since the action performed by the voice command “cooler” and the action performed by the voice command “air conditioner” are common in terms of “turning on the power of the air conditioner”, the user can use the first voice command “cooler”. Can be associated with “turn on the air conditioner” of the air conditioner, which is the action of the second voice command.
Alternatively, in step 211, the speech recognition system (101) selects a command corresponding to the speech of the first utterance when the phoneme sequence of the second utterance is not similar to the phoneme sequence of the first utterance. forgive. The selection is performed, for example, by the user selecting a desired voice command from the list of presented voice commands. When the selection is made, the associating unit registers the voice of the first utterance as a variation of the voice of the second utterance (step 217). If the pronunciation is different, the registration gives the user an opportunity to confirm and register the first voice command as a synonym for the second voice command.
In addition, when registering the voice of the first utterance as a variation of the voice of the second utterance, the registration can be performed according to a criterion for determining whether to register. The criteria can be stored in a policy.
The judgment criteria are as follows.
・ High noise ratio.
-For example, is the S / N ratio higher than a predetermined value? The S / N ratio can vary depending on the spoken environment.
• The frequency of use of the word or phrase you are trying to register.
-Addresses far from the current location will not be used much.
-Only prefectures may be added.
-General nouns and proper nouns-Shrines are common nouns, so they will be used frequently, while proper nouns Kumano Shrine will be used less frequently.
-Well-known proper nouns and not.
-McDonald's (TM) will be used more frequently, but McDonald's Port South Central Point will be used less frequently.
-Frequently used commands and those that are not.
-In a car navigation system, "going home" will be used more frequently, but "today's fortune" will be used less frequently.
-Phoneme sequence-When an impossible phoneme sequence is detected, it is not registered.
-"Nonnn" (3 or more same phonemes follow), but this impossible phoneme sequence varies by language.
・ Use location information and personal preference information.
-The frequency is determined by the distance from the current position and the distance from the home position.
-Registration point -Favorite food, frequent store Further, a step (not shown) is prepared which makes it possible to delete the registered voice data or phoneme string data.

図２Ｂは、本発明の実施態様である、音声認識のための方法の概要（その２）を示す。
ステップ２２１では、音声認識システム（図１Ａ、１０１）が、ステップ２０６において行われた判断において一致がない場合、３回目の発話（第３発話）の入力をユーザに要求する。該要求は例えば、音声による案内又は音声認識システムに接続された表示装置上に表示されうる。また、第１発話の音声又は音素列が、メモリー又はハードディスク・ドライブ若しくはソリッド・ステート・ディスク内に格納される。
ステップ２２２では、音声認識システム（１０１）が、ユーザによって第３発話が音声入力部を通して入力される。入力された発話がアナログである場合、アナログ−デジタル変換器を介して、デジタル・データに変換されてもよい。認識エンジンは、該入力された発話について、単語又はフレーズを認識し、同時に音素列を生成する。第２発話についての認識された単語又はフレーズ及び生成された音素列は、メモリー又はハードディスク・ドライブ若しくはソリッド・ステート・ディスク内に格納される。
ステップ２２３では、音声認識システム（１０１）が、第３発話の入力に応答して、該入力された第３発話の音声が記憶部に登録された音声と一致するかどうかを判断する。
ステップ２２４では、音声認識システム（１０１）が、上記一致がある場合、第３発話の音素列と第２発話の音素列とを比較する。
ステップ２２５では、音声認識システム（１０１）が、上記一致がある場合、第３発話に対応するアクションを実行する。
ステップ２２６では、音声認識システム（１０１）が、上記一致がない場合、図２Ｂのステップの最初（Ａ）に戻る。
ステップ２２７では、音声認識システム（１０１）が、第３発話の音素列が第２発話の音素列と似ている場合、第２発話の音声を第３発話に対応するコマンドに関連付ける。特には、第２の音声コマンドを第３の音声コマンドに対応するアクションに関連付ける。該関連付けによって、第２発話と同じ発話を以降に行うことによって、以降の発話を要求されることなく、第３の音声コマンドに対応するアクションを実行することが可能になる。
ステップ２２８では、音声認識システム（１０１）が、第２発話の音素列が第１発話の音素列と似ていない場合に、第１発話の音声に対応するコマンドを選択することを許す。該選択は例えば、ユーザが、提示された音声コマンドのリストの中から所望の音声コマンドを選択することによって行われる。該選択が行われると、関連付け部は、第１発話の音声を第２発話の音声のバリエーションとして登録する（ステップ２２７）。 FIG. 2B shows an outline (part 2) of the method for speech recognition according to the embodiment of the present invention.
In step 221, the speech recognition system (FIG. 1A, 101) requests the user to input a third utterance (third utterance) if there is no match in the determination made in step 206. The request can be displayed, for example, on a display device connected to a voice guidance or voice recognition system. The voice or phoneme string of the first utterance is stored in a memory, a hard disk drive, or a solid state disk.
In step 222, the voice recognition system (101) inputs the third utterance by the user through the voice input unit. If the input speech is analog, it may be converted to digital data via an analog-to-digital converter. The recognition engine recognizes a word or phrase for the input utterance and simultaneously generates a phoneme string. The recognized word or phrase for the second utterance and the generated phoneme sequence are stored in memory or a hard disk drive or solid state disk.
In step 223, in response to the input of the third utterance, the voice recognition system (101) determines whether or not the input voice of the third utterance matches the voice registered in the storage unit.
In step 224, if there is a match, the speech recognition system (101) compares the phoneme sequence of the third utterance with the phoneme sequence of the second utterance.
In step 225, the speech recognition system (101) executes an action corresponding to the third utterance if there is a match.
In step 226, the speech recognition system (101) returns to the beginning (A) of the step in FIG. 2B if there is no match.
In step 227, the speech recognition system (101) associates the speech of the second utterance with the command corresponding to the third utterance if the phoneme sequence of the third utterance is similar to the phoneme sequence of the second utterance. In particular, the second voice command is associated with an action corresponding to the third voice command. With this association, by performing the same utterance as the second utterance later, it is possible to execute the action corresponding to the third voice command without requiring the subsequent utterance.
In step 228, the speech recognition system (101) allows a command corresponding to the speech of the first utterance to be selected if the phoneme sequence of the second utterance is not similar to the phoneme sequence of the first utterance. The selection is performed, for example, by the user selecting a desired voice command from the list of presented voice commands. When the selection is made, the associating unit registers the voice of the first utterance as a variation of the voice of the second utterance (step 227).

図３Ａは、本発明の実施態様である、音声認識の仕組みの概要（その１）を示す。
図３Ａの音声認識は、グラマー（コマンド）を使用した音声認識を示す。
車載機器では、コマンドの認識が一般的に使用されている。
発話が例えば、「atama ga itai」であるとする。信号処理部（３０１）は、該発話が入力されると、当該発話を電気的な音声信号に変換して認識エンジン（３０２）に渡す。認識エンジン（３０２）は、辞書（３０３）及び音響モデル（３０４）を使用して、該音声信号について単語又はフレーズを認識する。
辞書（３０３）は、グラマーの集合体である認識辞書でありうる。グラマーは例えば、”<complaint>=<bodypart>が痛い”、で表記される。音声信号について、<bodypart>部分が、辞書（３０３）を使用して認識される。辞書には、頭、肩、腕、足が登録されている。
音響モデル（３０４）では、音響的な特徴が用いられる。音響的な特徴とは、認識対象の音素がそれぞれどのような周波数特性を持っているかを表したものである。音響モデルの表現としては、混合正規分布を出力確率とした隠れマルコフモデル（ＨＭＭ）が適用可能である。隠れマルコフモデルが適用可能であるのは、音声信号が断片的又は短時間の定常信号と見ることができるからである。
認識エンジン（３０２）は、<bodypart>部分が「頭」であることを認識する。そして、認識エンジン（３０２）は、認識した単語列「頭が痛い」をアプリケーション・プログラム（３０５）に渡す。 FIG. 3A shows an outline (part 1) of the mechanism of speech recognition, which is an embodiment of the present invention.
The voice recognition in FIG. 3A indicates voice recognition using a grammar (command).
Command recognition is commonly used in in-vehicle devices.
Assume that the utterance is “atama ga itai”, for example. When the utterance is input, the signal processing unit (301) converts the utterance into an electric voice signal and passes it to the recognition engine (302). The recognition engine (302) recognizes words or phrases in the speech signal using the dictionary (303) and the acoustic model (304).
The dictionary (303) may be a recognition dictionary that is an aggregate of grammars. The grammar is written, for example, “<complaint> = <bodypart> hurts”. For the audio signal, the <bodypart> part is recognized using the dictionary (303). The dictionary contains the head, shoulders, arms, and legs.
The acoustic model (304) uses acoustic features. The acoustic feature represents what frequency characteristic each recognition target phoneme has. As the representation of the acoustic model, a hidden Markov model (HMM) with a mixed normal distribution as an output probability is applicable. The hidden Markov model is applicable because the speech signal can be viewed as a fragmentary or short-time stationary signal.
The recognition engine (302) recognizes that the <bodypart> portion is the “head”. Then, the recognition engine (302) passes the recognized word string “headache” to the application program (305).

図３Ｂは、本発明の実施態様である、音声認識の仕組みの概要（その２）を示す。
図３Ｂの音声認識は、大語彙認識（口述筆記）を使用した音声認識を示す。
発話が例えば、「posuto-wa-akai-maru」であるとする。信号処理部（３１１）は、該発話が入力されると、当該発話を電気的な音声信号に変換して認識エンジン（３１２）に渡す。認識エンジン（３１２）は、言語モデル（３１３）及び音響モデル（３１４）を使用して、該音声信号について単語又はフレーズを認識する。
言語モデル（３１３）では、言語的な特徴が用いられる。言語的な特徴とは、音素の並び方に関する制約を表したものである。例えば、「あなた (a n a t a)」という発声の直後には、「が（g a）」や「は（w a）」などの発声が続く確率が高い、などの制約である。言語モデルの表現として、n-gramが用いられる。また、言語モデルの表現として、文脈自由文法が用いられる。n-gramは、直前の(N-1)個の単語を見て、次の単語を予測するモデルである。文脈自由文法は、全生成規則が、V→ｗの形式である形式文法のひとつである。ここで、Vは非終端記号であり、ｗは終端文字と非終端記号から構成される文字列である。「文脈自由」という用語は前後関係に依存せずに非終端記号Vをｗに置換できることを意味する。n-gramは例えば認識対象の言語が大規模な場合に用いられ、文脈自由文法は例えば認識対象の言語が人手で網羅出来る程度に小さい場合に用いられる。
音響モデル（３１４）では、音響的な特徴が用いられる。音響的な特徴とは、認識対象の音素がそれぞれどのような周波数特性を持っているかを表したものである。音響モデルの表現としては、混合正規分布を出力確率とした隠れマルコフモデル（ＨＭＭ）が適用可能である。隠れマルコフモデルが適用可能であるのは、音声信号が断片的又は短時間の定常信号と見ることができるからである。
認識エンジン（３１２）は、音声信号が「ポストは赤い丸」であることを認識する。そして、認識エンジン（３１２）は、認識した単語列「ポストは赤い丸」をアプリケーション・プログラム（３１５）に渡す。 FIG. 3B shows an outline (part 2) of the mechanism of speech recognition, which is an embodiment of the present invention.
The speech recognition in FIG. 3B shows speech recognition using large vocabulary recognition (dictation writing).
Assume that the utterance is “posuto-wa-akai-maru”, for example. When the utterance is input, the signal processing unit (311) converts the utterance into an electric voice signal and passes it to the recognition engine (312). The recognition engine (312) recognizes words or phrases for the speech signal using the language model (313) and the acoustic model (314).
In the language model (313), linguistic features are used. A linguistic feature represents a restriction on how phonemes are arranged. For example, immediately after the utterance of “you (anata)”, there is a restriction that the utterance of “ga (ga)” or “ha (wa)” is high. N-gram is used as a representation of the language model. In addition, context-free grammar is used as a language model expression. The n-gram is a model that predicts the next word by looking at the immediately preceding (N-1) words. The context-free grammar is one of the formal grammars in which all generation rules are in the form of V → w. Here, V is a non-terminal symbol, and w is a character string composed of a terminal character and a non-terminal symbol. The term “context-free” means that the non-terminal symbol V can be replaced by w without depending on the context. The n-gram is used, for example, when the language to be recognized is large, and the context-free grammar is used, for example, when the language to be recognized is small enough to be covered manually.
The acoustic model (314) uses acoustic features. The acoustic feature represents what frequency characteristic each recognition target phoneme has. As the representation of the acoustic model, a hidden Markov model (HMM) with a mixed normal distribution as an output probability is applicable. The hidden Markov model is applicable because the speech signal can be viewed as a fragmentary or short-time stationary signal.
The recognition engine (312) recognizes that the audio signal is “post is red circle”. Then, the recognition engine (312) passes the recognized word string “Post is a red circle” to the application program (315).

図４Ａは、本発明の実施態様である、音声認識をするための処理の流れを示す。
該音声認識をするための処理は、ＩＢＭＥＶＶを用いたキャラクタ・ユーザ・インターフェース（ＣＵＩ）のＷｉｎｄｏｗｓ（商標）アプリケーションでの動作例である。
メイン・アプリケーション・スレッド（４０１）は、ユーザからのエンター入力により、音声認識処理のメインアプリケーションスレッド内での処理（ステップ４０２〜４０７）を開始する。
ステップ４０２では、音声認識システム（１０１）が、ユーザからのエンター入力により、音声認識を開始するためにマイクロフォンをオンにする。発話が開始されると、該発話が音声認識システム（１０１）に入力される。
ステップ４０３では、音声認識システム（１０１）が、認識エンジンの処理を開始するために、例えば、音声認識ＡＰＩ esrRecoStartListeningをコールする。音声認識システム（１０１）が、音声の入力待ち受け状態になる。
ステップ４０４では、音声認識システム（１０１）が、ユーザからの再度のエンター入力により、発話が終了したと判断する。音声認識システム（１０１）が、音声認識が終了したためにマイクロフォンをオフにする。
ステップ４０５では、音声認識システム（１０１）が、認識エンジンの処理を終了するために、例えば、音声認識ＡＰＩ esrRecoStopListeningをコールする。音声認識システム（１０１）は、音声の入力待ち受け状態を解除する。
ステップ４０６では、音声認識システム（１０１）が、認識エンジンから呼ばれるコールバック関数を通してエンジンの状態をチェックし、音声の認識結果の取得を待つ。
ステップ４０７では、音声認識システム（１０１）は、音声の認識結果の出力をする。音声認識システム（１０１）は、音声の認識結果が出力されると、各コマンドを実行する。
図４Ａの４０８はＥＳＲ認識エンジンスレッドであり、音声認識処理中、一定の間隔で逐次呼び出される。
ステップ４０９では、音声認識処理が開始されると、音声認識システム（１０１）は、音声認識結果が出るまでの間、認識エンジンが自らの状態（RECOGNITION STATE）を知らせるために、コールバック関数をコールし続ける。コールバック関数はユーザの定義した関数である。RECOGNITION STATEは、認識エンジンが内部で保持する状態である。
ステップ４１０では、音声認識システム（１０１）が、認識エンジンの状態を取得し、関数内にて目的の状態（例えば、認識完了)かどうかを確認する。判断結果がＹＥＳの場合、ステップ４１１に進む。一方、判断結果がＮＯの場合、ステップは４１２に進む。
ステップ４１１では、音声認識システム（１０１）が、信号の状態をアプリケーション側と共有する。
ステップ４１２では、音声認識システム（１０１）が、処理を完了し、最初に戻る。
ステップ４１３では、音声認識システム（１０１）が、信号の状態をアプリケーションと共有する。
図４Ａの４１４では、音声認識処理が開始されると、音声認識システム（１０１）が、音声認識結果が出るまでの間、認識エンジンが自らの状態を知らせるために、コールバック関数をコールし続ける。コールバック関数はユーザの定義した関数である。
ステップ４１５では、音声認識システム（１０１）が、アプリケーションが必要とする認識結果を入力する。認識結果は、たとえばスペル、ＩＤなどである。音声認識システム（１０１）が、認識結果を知らせるために、コールバック関数をコールし続ける。
ステップ４１６では、音声認識システム（１０１）が、音声認識結果をフレーズという形で様々な情報（スペル、音素列、ＩＤ、スコアなど）を保持するために、所望のデータを取り出す。
ステップ４１７では、音声認識システム（１０１）が、処理を完了し、最初に戻る。
ステップ４１８では、音声認識システム（１０１）が、イベントをアプリケーションと共有する。 FIG. 4A shows a flow of processing for speech recognition, which is an embodiment of the present invention.
The processing for performing the speech recognition is an operation example in the Windows (trademark) application of the character user interface (CUI) using IBM EVV.
The main application thread (401) starts processing (steps 402 to 407) in the main application thread of voice recognition processing in response to an enter input from the user.
In step 402, the voice recognition system (101) turns on the microphone in order to start voice recognition by the enter input from the user. When the utterance is started, the utterance is input to the speech recognition system (101).
In step 403, the voice recognition system (101) calls, for example, a voice recognition API esrRecoStartListening to start the processing of the recognition engine. The voice recognition system (101) enters a voice input standby state.
In step 404, the voice recognition system (101) determines that the utterance has been completed by the re-entering input from the user. The voice recognition system (101) turns off the microphone because the voice recognition is finished.
In step 405, the speech recognition system (101) calls, for example, a speech recognition API esrRecoStopListening to end the processing of the recognition engine. The voice recognition system (101) cancels the voice input standby state.
In step 406, the speech recognition system (101) checks the state of the engine through a callback function called from the recognition engine, and waits for acquisition of a speech recognition result.
In step 407, the voice recognition system (101) outputs a voice recognition result. The voice recognition system (101) executes each command when a voice recognition result is output.
Reference numeral 408 in FIG. 4A denotes an ESR recognition engine thread that is sequentially called at regular intervals during the speech recognition processing.
In step 409, when the speech recognition process is started, the speech recognition system (101) calls a callback function to inform the recognition engine of its own state (RECOGNITION STATE) until a speech recognition result is obtained. Keep doing. The callback function is a user-defined function. RECOGNITION STATE is a state that the recognition engine holds internally.
In step 410, the speech recognition system (101) acquires the state of the recognition engine, and confirms whether or not the target state (for example, recognition completion) is obtained in the function. If the determination result is YES, the process proceeds to step 411. On the other hand, if the determination result is NO, the step proceeds to 412.
In step 411, the voice recognition system (101) shares the signal state with the application side.
In step 412, the speech recognition system (101) completes the process and returns to the beginning.
In step 413, the voice recognition system (101) shares the signal state with the application.
In 414 of FIG. 4A, when the speech recognition process is started, the speech recognition system (101) keeps calling the callback function so that the recognition engine notifies its own state until the speech recognition result is obtained. . The callback function is a user-defined function.
In step 415, the speech recognition system (101) inputs a recognition result required by the application. The recognition result is, for example, spelling or ID. The speech recognition system (101) keeps calling the callback function to inform the recognition result.
In step 416, the speech recognition system (101) retrieves desired data in order to hold various information (spell, phoneme sequence, ID, score, etc.) in the form of phrases as speech recognition results.
In step 417, the speech recognition system (101) completes the process and returns to the beginning.
In step 418, the speech recognition system (101) shares the event with the application.

例えば、認識辞書に単語「エアコン」及び「ラジオ」が登録されているとする。ユーザが「クーラー」と発話したとする。この場合、ユーザの発話「クーラー」に対応する単語が認識辞書にないので、認識結果は該当なしとなる。認識エンジンとしては、合致する結果がなかった場合に、確からしさの低い結果として、ユーザにとって意図しない結果が出力される可能性がある。 For example, it is assumed that the words “air conditioner” and “radio” are registered in the recognition dictionary. Suppose the user utters “cooler”. In this case, since the word corresponding to the user's utterance “cooler” does not exist in the recognition dictionary, the recognition result is not applicable. As a recognition engine, when there is no matching result, there is a possibility that a result unintended for the user is output as a result with low probability.

図４Ｂは、本発明の実施態様である、音素列を生成するための処理の流れを示す。
メイン・アプリケーション・スレッド（４２１）は、ユーザからのエンター入力により、音素列認識処理のメインアプリケーションスレッド内での処理（ステップ４２２〜４２７）を開始する。
ステップ４２２では、音声認識システム（１０１）が、ユーザからのエンター入力により、音素列認識を開始するためにマイクロフォンをオンにする。発話が開始されると、該発話が音声認識システム（１０１）に入力される。
ステップ４２３では、音声認識システム（１０１）が、認識エンジンの処理を開始するために、例えば、音声認識ＡＰＩ esrAcbfStartListeningをコールする。音声認識システム（１０１）が、音声の入力待ち受け状態になる。Acbfは、音素列（Acoustic baseform）の略である。
ステップ４２４では、音声認識システム（１０１）が、ユーザからの再度のエンター入力により、発話が終了したと判断する。音声認識システム（１０１）が、音声認識が終了したためにマイクロフォンをオフにする。
ステップ４２５では、音声認識システム（１０１）が、認識エンジンの処理を終了するために、例えば、音声認識ＡＰＩ esrRecoStopListeningをコールする。音声認識システム（１０１）は、音声の入力待ち受け状態を解除する。
ステップ４２６では、音声認識システム（１０１）が、認識エンジンから呼ばれるコールバック関数を通してエンジンの状態をチェックし、音素列の認識結果の取得を待つ。
ステップ４２７では、音声認識システム（１０１）が、音素列の認識結果の出力をする。音声認識システム（１０１）は、音素列の認識結果が出力されると、各コマンドを実行する。
図４Ｂの４２８は、ＥＳＲ認識エンジンスレッドであり、音素列認識処理中、一定の間隔で逐次呼び出される。
ステップ４２９では、音声認識処理が開始されると、音声認識システム（１０１）が、音声認識結果が出るまでの間、認識エンジンが自らの状態（RECOGNITION STATE）を知らせるために、コールバック関数をコールし続ける。コールバック関数はユーザの定義した関数である。RECOGNITION STATEは、認識エンジンが内部で保持する状態である。
ステップ４３０では、音声認識システム（１０１）が、認識エンジンの状態を取得し、関数内にて目的の状態（例えば、認識完了）かどうかを確認する。判断結果がＹＥＳの場合、ステップ４３１に進む。一方、判断結果がＮＯの場合、ステップは４３２に進む。
ステップ４３１では、音声認識システム（１０１）が、信号の状態をアプリケーション側と共有する。
ステップ４３２では、音声認識システム（１０１）が、処理を完了し、最初に戻る。
ステップ４３３では、音声認識システム（１０１）が、信号の状態をアプリケーションと共有する。
図４Ａの４３４では、音素列認識処理が開始されると、音声認識システム（１０１）が、音素列の認識結果が出るまでの間、認識エンジンが自らの状態を知らせるために、コールバック関数をコールし続ける。コールバック関数はユーザの定義した関数である。
ステップ４３５では、音声認識システム（１０１）が、アプリケーションが必要とする認識結果を入力する。認識結果は、たとえば音素、ＩＤなどである。音声認識システム（１０１）が、認識結果を知らせるために、コールバック関数をコールし続ける。
ステップ４３６では、音声認識システム（１０１）が、音素列認識結果をフレーズという形で様々な情報（スペル、音素列、ＩＤ、スコアなど）を保持するために、所望のデータを取り出す。
ステップ４３７では、音声認識システム（１０１）が、処理を完了し、最初に戻る。
ステップ４３８では、音声認識システム（１０１）が、イベントをアプリケーションと共有する。 FIG. 4B shows a flow of processing for generating a phoneme string, which is an embodiment of the present invention.
The main application thread (421) starts processing (steps 422 to 427) in the main application thread of phoneme string recognition processing in response to an enter input from the user.
In step 422, the speech recognition system (101) turns on the microphone in order to start phoneme string recognition by the enter input from the user. When the utterance is started, the utterance is input to the speech recognition system (101).
In step 423, the voice recognition system (101) calls, for example, a voice recognition API esrAcbfStartListening to start the processing of the recognition engine. The voice recognition system (101) enters a voice input standby state. Acbf is an abbreviation for phoneme string (Acoustic baseform).
In step 424, the speech recognition system (101) determines that the utterance has been completed by the re-entering input from the user. The voice recognition system (101) turns off the microphone because the voice recognition is finished.
In step 425, the voice recognition system (101) calls, for example, a voice recognition API esrRecoStopListening to end the processing of the recognition engine. The voice recognition system (101) cancels the voice input standby state.
In step 426, the speech recognition system (101) checks the state of the engine through a callback function called from the recognition engine, and waits for acquisition of a phoneme string recognition result.
In step 427, the speech recognition system (101) outputs the recognition result of the phoneme string. The speech recognition system (101) executes each command when the recognition result of the phoneme string is output.
Reference numeral 428 in FIG. 4B denotes an ESR recognition engine thread, which is sequentially called at regular intervals during the phoneme string recognition processing.
In step 429, when the speech recognition process is started, the speech recognition system (101) calls a callback function to notify the recognition engine of its own state (RECOGNITION STATE) until a speech recognition result is obtained. Keep doing. The callback function is a user-defined function. RECOGNITION STATE is a state that the recognition engine holds internally.
In step 430, the speech recognition system (101) acquires the state of the recognition engine, and confirms whether the target state (for example, recognition completion) is obtained in the function. If the determination result is YES, the process proceeds to step 431. On the other hand, if the determination result is NO, the step proceeds to 432.
In step 431, the voice recognition system (101) shares the signal state with the application side.
In step 432, the speech recognition system (101) completes the process and returns to the beginning.
In step 433, the speech recognition system (101) shares the signal state with the application.
In 434 of FIG. 4A, when the phoneme string recognition process is started, the speech recognition system (101) sets a callback function to notify the recognition engine of its own state until a phoneme string recognition result is obtained. Keep calling. The callback function is a user-defined function.
In step 435, the speech recognition system (101) inputs a recognition result required by the application. The recognition result is, for example, a phoneme or ID. The speech recognition system (101) keeps calling the callback function to inform the recognition result.
In step 436, the speech recognition system (101) retrieves desired data in order to hold various information (spell, phoneme string, ID, score, etc.) in the form of phrases as phoneme string recognition results.
In step 437, the speech recognition system (101) completes the process and returns to the beginning.
In step 438, the speech recognition system (101) shares the event with the application.

例えば、ユーザが「クーラー」と発話したとする。この場合、「クーラー」の音素列「k u u: r a a:」が生成される。 For example, assume that the user utters “cooler”. In this case, the “cooler” phoneme string “ku u: ra a:” is generated.

図５Ａは、本発明の実施態様である、音素列同士の比較を示す。
ステップ５０１では、音声認識システム（１０１）が、第１の音素列と第２の音素列を比較する。音素列同士の比較は、音声認識システムに依存するために一概には言えないが、たとえば、単純な方法として音素列中の音素を比較して一致する数を数える。代替的には、音素列同士の比較は例えば、図５Ｂに示す音素列同士を比較するＡＰＩを使用して行われる。
ステップ５０２では、第１の音素列と第２の音素列が似ている場合（ＹＥＳ）、ステップ５０３に進む。一方、第１の音素列と第２の音素列が似ていない場合（ＮＯ）、ステップ５０５に進む。
第１の音素列と第２の音素列とが似ているかどうかは、下記の基準により判断される。
判断基準：音素列中の音素の一致度が、あらかじめ定義された閾値以上であるかそれよりも低いか。
ステップ５０３では、音声認識システム（１０１）が、音の揺らぎを判断する。音の揺らぎの情報は、言語ごとに音素列の変化として対応付けられており、該情報は記憶部に格納されている。例えば、日本語の場合、"o u" という音素列は、"o o:" という音素列へ変化する可能性が高いということが記録されている。音素列同士の揺らぎは例えば、言語ごとに用意された音の揺らぎの情報を使用して求められる。認識された単語又はフレーズが揺らぎのある可能性のある単語かどうかは、与えられた音素列中に、前記記憶部に格納された揺らぎの情報に音素列が部分的に含まれているかどうかを検索することによって、が判定される。
言語による揺らぎの特徴を考慮して音素列同士が似ていると判断された場合、その音素列を他の音素列のバリエーションとして登録する。
言語による揺らぎの特徴を考慮して音素列同士が似ていないと判断された場合（ＮＯ）、ステップ５０５に進む。一方、言語による揺らぎの特徴を考慮して音素列同士が似ていると判断された場合（ＹＥＳ）、ステップ５０４に進む。
ステップ５０５では、音声認識システム（１０１）は、第１の音素列に対応するコマンドが、第２の音素列に対応するコマンドと違うと判断する。
ステップ５０４では、音声認識システム（１０１）は、第１の音素列に対応するコマンドが、第２の音素列に対応するコマンドと同じであると判断する。よって、音声認識システム（１０１）は、第１の音素列を、第２の音素列に対応するコマンド又はアクションに関連付ける。従って、音声認識システム（１０１）は、第１の音素列によって、第２の音素列に対応するコマンドを実行することができる。 FIG. 5A shows a comparison between phoneme strings, which is an embodiment of the present invention.
In step 501, the speech recognition system (101) compares the first phoneme string with the second phoneme string. The comparison between phoneme sequences cannot be generally described because it depends on the speech recognition system, but for example, as a simple method, the phonemes in the phoneme sequences are compared and the number of coincidence is counted. Alternatively, the phoneme strings are compared using, for example, an API for comparing phoneme strings shown in FIG. 5B.
In step 502, if the first phoneme string and the second phoneme string are similar (YES), the process proceeds to step 503. On the other hand, if the first phoneme string is not similar to the second phoneme string (NO), the process proceeds to step 505.
Whether or not the first phoneme string is similar to the second phoneme string is determined according to the following criteria.
Judgment criteria: Whether the phoneme coincidence in the phoneme string is equal to or lower than a predefined threshold.
In step 503, the voice recognition system (101) determines sound fluctuation. The sound fluctuation information is associated as a change in phoneme sequence for each language, and the information is stored in the storage unit. For example, in the case of Japanese, it is recorded that a phoneme string “ou” is likely to change to a phoneme string “oo:”. The fluctuation between phoneme strings is obtained using, for example, information on the fluctuation of sound prepared for each language. Whether or not the recognized word or phrase is a word that is likely to fluctuate is determined whether or not the phoneme string is partially included in the fluctuation information stored in the storage unit in the given phoneme string. By searching, it is determined.
When it is determined that the phoneme strings are similar in consideration of the fluctuation characteristics of the language, the phoneme strings are registered as variations of other phoneme strings.
If it is determined that the phoneme strings are not similar in consideration of the characteristics of fluctuation by language (NO), the process proceeds to step 505. On the other hand, if it is determined that the phoneme strings are similar in consideration of the fluctuation characteristics of the language (YES), the process proceeds to step 504.
In step 505, the speech recognition system (101) determines that the command corresponding to the first phoneme string is different from the command corresponding to the second phoneme string.
In step 504, the speech recognition system (101) determines that the command corresponding to the first phoneme string is the same as the command corresponding to the second phoneme string. Therefore, the speech recognition system (101) associates the first phoneme string with the command or action corresponding to the second phoneme string. Therefore, the speech recognition system (101) can execute a command corresponding to the second phoneme string by the first phoneme string.

図５Ｂは、本発明の実施態様である、音素列比較のためのＡＰＩを示す。
ＡＰＩ esrCompareBaseformsは、音素列同士を比較するＡＰＩの一例を示す。コンピュータ・システムは、本ＡＰＩを用いて、閾値を用いて音素列が似ているかどうかが判断される。
ＡＰＩ esrBaseformCompareInfoは、音素列同士の比較した結果が入る構造体の一例を示す。 FIG. 5B shows an API for phoneme string comparison, which is an embodiment of the present invention.
API esrCompareBaseforms indicates an example of an API for comparing phoneme strings. The computer system uses this API to determine whether phoneme strings are similar using a threshold value.
API esrBaseformCompareInfo indicates an example of a structure that contains a result of comparison between phoneme strings.

図６は、本発明の実施態様に係るコンピュータ・システムのブロック図を示す。
コンピュータ・システム（６０１）は、ＣＰＵ（６０２）とメイン・メモリ（６０３）とを含み、これらはバス（６０５）に接続されている。ＣＰＵ（６０２）は好ましくは、３２ビットまたは６４ビットのアーキテクチャに基づくものであり、例えば、インテル社のＸｅｏｎ（商標）シリーズ、Ｃｏｒｅ（商標）シリーズ、ＡＴＯＭ（商標）シリーズ、Ｐｅｎｔｉｕｍ（商標）シリーズ、Ｃｅｌｅｒｏｎ（商標）シリーズ、ＡＭＤ社のＰｈｅｎｏｍ（商標）シリーズ、Ａｔｈｌｏｎ（商標）シリーズなどを使用することができる。バス（６０５）には、音声の入出力を行うためのサウンド・ボード（６０４）が接続される。サウンド・ボード（６０４）には、必要に応じて、マイクロフォン又はスピーカが接続される。バス（６０５）には、ディスプレイ・コントローラ（６０６）を介して、ＬＣＤモニタなどのディスプレイ（６０７）が接続される。ディスプレイ（６０７）は、そのコンピュータ・システム（６０１）上で動作中のソフトウェアについての情報を、適当なグラフィック・インターフェースで表示するために使用される。バス（６０５）にはまた、ＩＤＥ又はＳＡＴＡコントローラ（６０８）を介して、ハードディスク又はシリコン・ディスク（６０９）と、ＣＤ−ＲＯＭ、ＤＶＤ又はＢｌｕ−ｒａｙドライブ（６１０）が接続されている。ＣＤ−ＲＯＭ、ＤＶＤ又はＢｌｕ−ｒａｙドライブ（６１０）は、必要に応じて、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ又はＢＤからプログラムをハードディスク又はシリコン・ディスク（６０９）に追加導入するために使用される。バス（６０５）には更に、キーボード・マウスコントローラ（６１１）を介して、或いはＵＳＢコントローラ（図示せず）を介して、キーボード（６１２）及びマウス（６１３）が接続されている。 FIG. 6 shows a block diagram of a computer system according to an embodiment of the present invention.
The computer system (601) includes a CPU (602) and a main memory (603), which are connected to a bus (605). The CPU (602) is preferably based on a 32-bit or 64-bit architecture, such as Intel's Xeon (TM) series, Core (TM) series, ATOM (TM) series, Pentium (TM) series, Celeron ™ series, AMD's Phenom ™ series, Athlon ™ series, and the like can be used. A sound board (604) for inputting / outputting audio is connected to the bus (605). A microphone or a speaker is connected to the sound board (604) as necessary. A display (607) such as an LCD monitor is connected to the bus (605) via a display controller (606). The display (607) is used to display information about the software running on the computer system (601) with a suitable graphic interface. A hard disk or silicon disk (609) and a CD-ROM, DVD or Blu-ray drive (610) are also connected to the bus (605) via an IDE or SATA controller (608). The CD-ROM, DVD or Blu-ray drive (610) is used to additionally install programs from the CD-ROM, DVD-ROM or BD to the hard disk or silicon disk (609) as necessary. A keyboard (612) and a mouse (613) are further connected to the bus (605) via a keyboard / mouse controller (611) or via a USB controller (not shown).

通信インタフェース（６１５）は、例えばイーサネットプロトコルに従うものであり、通信コントローラ（６１４）を介してバス（６０５）に接続される。通信インタフェース（６１５）は、コンピュータ・システム（６０１）及び通信回線（６１６）を物理的に接続する役割を担い、コンピュータ・システム（６０１）のオペレーティング・システムの通信機能のＴＣＰ／ＩＰ通信プロトコルに対して、ネットワーク・インターフェース層を提供する。尚、通信回線は、有線ＬＡＮ環境、或いは例えばＩＥＥＥ８０２．１１ａ／ｂ／ｇ／ｎなどの無線ＬＡＮ接続規格に基づく無線ＬＡＮ環境であってもよい。 The communication interface (615) conforms to, for example, the Ethernet protocol, and is connected to the bus (605) via the communication controller (614). The communication interface (615) plays a role of physically connecting the computer system (601) and the communication line (616), and is compatible with the TCP / IP communication protocol of the communication function of the operating system of the computer system (601). Providing a network interface layer. The communication line may be a wired LAN environment or a wireless LAN environment based on a wireless LAN connection standard such as IEEE802.11a / b / g / n.

なお、コンピュータ等のハードウェアを接続するためのネットワーク接続装置として使用できるものとして、上記のネットワーク・スイッチ以外に、これで尽きている訳ではないが、ルータ、ハードウェア管理コンソール等がある。要するに、ネットワーク運用管理用プログラムが導入されているコンピュータからの、所定のコマンドによる問い合わせに対して、それに接続されているコンピュータのＩＰアドレス、ＭＡＣアドレスなどの構成情報を返すことができる機能をもつものである。ネットワーク・スイッチ及びルータは、アドレス解決プロトコル（ＡＲＰ）のための、それに接続されているコンピュータのＩＰアドレス及び、それに対応するＭＡＣアドレスの対のリストを含むＡＲＰテーブルを含み、所定のコマンドによる問い合わせに対して、ＡＲＰテーブルの内容を返す機能をもつ。 In addition to the above-described network switch, devices that can be used as a network connection device for connecting hardware such as a computer include a router and a hardware management console. In short, a function capable of returning configuration information such as the IP address and MAC address of a computer connected to an inquiry by a predetermined command from a computer in which a network operation management program is installed. It is. The network switch and router include an ARP table for the address resolution protocol (ARP), which includes a list of IP addresses of computers connected to the router and a corresponding pair of MAC addresses, and can be used to query by a predetermined command. On the other hand, it has a function of returning the contents of the ARP table.

以上、実施形態に基づき本発明を説明してきたが、本実施形態に記載されている内容は、本発明の一例であり、当業者なら、本発明の技術的範囲を逸脱することなく、さまざまな変形例に想到できることが明らかであろう。 As described above, the present invention has been described based on the embodiment. However, the content described in the embodiment is an example of the present invention, and those skilled in the art will be able to use various methods without departing from the technical scope of the present invention. It will be clear that variations can be conceived.

本発明の実施態様である、音声認識システムの概要を示す。1 shows an outline of a speech recognition system that is an embodiment of the present invention. 本発明の実施態様である、音声、音素、及び音声コマンドに対するアクションの概念図を示す。The conceptual diagram of the action with respect to the audio | voice, phoneme, and audio | voice command which is the embodiment of this invention is shown. 本発明の実施態様である、音声認識のための方法の概要（その１）を示す。The outline | summary (the 1) of the method for speech recognition which is the embodiment of this invention is shown. 本発明の実施態様である、音声認識のための方法の概要（その２）を示す。The outline | summary (the 2) of the method for speech recognition which is the embodiment of this invention is shown. 本発明の実施態様である、音声認識の仕組みの概要（その１）を示す。The outline | summary (the 1) of the structure of the speech recognition which is the embodiment of this invention is shown. 本発明の実施態様である、音声認識の仕組みの概要（その２）を示す。The outline | summary (the 2) of the structure of the speech recognition which is the embodiment of this invention is shown. 本発明の実施態様である、音声認識をするための処理の流れを示す。The flow of the process for speech recognition which is an embodiment of the present invention is shown. 本発明の実施態様である、音素列を生成するための処理の流れを示す。The flow of the process for producing | generating the phoneme string which is the embodiment of this invention is shown. 本発明の実施態様である、音素列同士の比較を示す。The comparison between phoneme strings which is an embodiment of the present invention is shown. 本発明の実施態様である、音素列比較のためのＡＰＩを示すAn API for phoneme string comparison, which is an embodiment of the present invention, is shown. 本発明の実施態様に係るコンピュータ・システムのブロック図を示す。1 shows a block diagram of a computer system according to an embodiment of the present invention.

Claims

A computer system for speech recognition of speech input,
In response to the input of the first utterance, a first determination unit that determines whether the input voice of the first utterance matches the voice registered in the storage unit;
A request unit for requesting input of a second utterance when the input voice of the first utterance does not match the voice registered in the storage unit;
A second determination unit that determines whether the voice of the input second utterance matches the voice registered in the storage unit;
A comparison unit that compares the phoneme sequence of the second utterance and the phoneme sequence of the first utterance when the speech of the second utterance matches the speech registered in the storage unit;
An association unit for associating the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance is similar to the phoneme string of the first utterance ;
A selection unit that allows a command or action corresponding to the second utterance to be selected when the phoneme string of the second utterance is not similar to the phoneme string of the first utterance;
The computer system comprising:

2. The computer system according to claim 1, wherein the phoneme sequence of the second utterance is similar to the phoneme sequence of the first utterance, based on a matching degree of phonemes in the phoneme sequence.

2. The computer system according to claim 1, wherein when the phoneme string of the second utterance is similar to the phoneme string of the first utterance, whether or not the voices match is determined based on sound fluctuation information. .

The registration unit further includes a registration unit that determines whether to register the phoneme sequence of the first utterance in the storage unit when the phoneme sequence of the second utterance is similar to the phoneme sequence of the first utterance. The computer system described in 1.

The computer system according to claim 4, wherein the registration unit refers to a policy for determining whether to register the phoneme string of the first utterance in the storage unit.

The computer system according to claim 5, wherein the policy is based on at least one of a high noise ratio, a use frequency of words or phrases, and a sequence of phoneme strings.

The computer system according to claim 4, further comprising a deletion unit that determines whether to delete the phoneme string of the registered utterance from the storage unit.

Inquiry unit that inquires of the user whether to associate the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance is not similar to the phoneme string of the first utterance The computer system of claim 1 further comprising:

9. The computer system according to claim 8, wherein the association unit associates the voice of the first utterance with the command or action corresponding to the second utterance in response to receiving an instruction to perform the association by the user. .

The computer system of claim 1 , wherein allowing the selection includes presenting a list of voice commands.

The method of claim 1, wherein associating the voice of the first utterance with a command or action corresponding to the second utterance includes registering the voice of the first utterance as a variation of the voice of the second utterance. Computer system.

A second requesting unit for further requesting input of a third utterance when the inputted voice of the second utterance does not match the voice registered in the storage unit;
A third determination unit that determines whether or not the voice of the input third utterance matches the voice registered in the storage unit;
A second comparison unit that compares the phoneme sequence of the third utterance and the phoneme sequence of the second utterance when the speech of the third utterance matches the speech registered in the storage unit;
A second associating unit that associates the voice of the second utterance with the command or action corresponding to the third utterance when the phoneme string of the third utterance is similar to the phoneme string of the second utterance; The computer system according to claim 1.

The computer system according to claim 1, further comprising a first generation unit that generates a first phoneme string corresponding to the first utterance together with the first determination.

The computer system according to claim 12 , wherein the first determination unit further determines whether or not the generated first phoneme string matches a phoneme string registered in the storage unit.

The computer system according to claim 1, further comprising a second generation unit that generates a second phoneme string corresponding to the second utterance together with the second determination.

The computer system according to claim 14 , wherein the second determination unit further determines whether or not the generated second phoneme string matches a phoneme string registered in the storage unit.

Whether the inputted first utterance voice or the inputted second utterance voice matches the voice registered in the storage unit is determined based on the degree of coincidence of phonemes in the phoneme string, The computer system according to claim 1.

The computer according to claim 1, further comprising: a recording unit that stores the voice of the first utterance in the recording unit when the input voice of the first utterance does not match the voice registered in the storage unit. system.

2. The computer system according to claim 1, further comprising: an execution unit that executes an action corresponding to the first utterance when the input first utterance voice matches the voice registered in the storage unit. .

2. The apparatus according to claim 1, further comprising a second execution unit that executes an action corresponding to the second utterance when the input second utterance voice matches the voice registered in the storage unit. Computer system.

A method for speech recognition of speech input, wherein a computer system is
In response to the input of the first utterance, determining whether the input voice of the first utterance matches the voice registered in the storage unit;
Requesting the input of the second utterance when the input voice of the first utterance does not match the voice registered in the storage unit;
Determining whether the voice of the input second utterance matches the voice registered in the storage unit;
Comparing the phoneme sequence of the second utterance and the phoneme sequence of the first utterance when the speech of the second utterance matches the speech registered in the storage unit;
Associating the voice of the first utterance with a command or action corresponding to the second utterance when the phoneme string of the second utterance is similar to the phoneme string of the first utterance ;
When said phoneme strings of the second utterance is not similar to the phoneme string of the first utterance, comprising performing the steps of allowing a selection of a command or action corresponding to the second utterance, the method.

The computer system is
Inquiring the user whether to associate the voice of the first utterance with the command or action corresponding to the second utterance when the phoneme string of the second utterance is not similar to the phoneme string of the first utterance
The method of claim 21, further comprising:

The computer system is
Associating the voice of the first utterance with a command or action corresponding to the second utterance in response to receiving an instruction to perform the association by the user;
The method of claim 22, further comprising:

24. A method according to any of claims 21 to 23, wherein allowing the selection comprises presenting a list of voice commands.

25. A computer program for speech recognition of speech input, comprising causing a computer system to execute each step of the method according to any one of claims 21 to 24 .