JP2007163967A

JP2007163967A - Speech recognition device and speech recognition method

Info

Publication number: JP2007163967A
Application number: JP2005362014A
Authority: JP
Inventors: Hiroki Yamamoto; 寛樹山本; Hideo Kuboyama; 英生久保山
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-12-15
Filing date: 2005-12-15
Publication date: 2007-06-28

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that when the same pronunciation is allocated to objects different with users, it cannot discriminatingly be recognized since a conventional recognition device has the same recognition score for the same pronunciation. <P>SOLUTION: A speech recognition device is equipped with a pronunciation information registration means of registering a plurality of pieces of pronunciation information for an object to be recognized and a sound model registration means of relating sound models when the respective pieces of pronunciation information are used for speech recognition, and is configured to recognize speech of the respective pieces of pronunciation information by using the sound models related to the respective pieces of pronunciation information. Consequently, the object to which the sound models different with users are related although the pronunciation is the same can be recognized discriminatingly by the users. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、同じ認識対象に対して使用者ごとに異なる発音を付与可能な音声認識装置および音声認識方法に関する。 The present invention relates to a speech recognition apparatus and a speech recognition method that can give different pronunciations to the same recognition target for each user.

同じ対象を表す言葉であっても、日常使用している語彙や対象との関係などによって使用する言葉は個人ごとに異なる。例えば、同じＡ子さんを呼ぶときに、Ａ子さんの夫は「Ａ子」と呼び、Ａ子さんの娘は「お母さん」と呼んだりする。また、これとは逆に、同じ言葉を使用していても個人によって異なる対象を示すこともある。例えば、Ａ子さんが「お母さん」と呼ぶ場合とＡ子さんの娘が「お母さん」と呼ぶ場合で、呼ばれている対象は異なる。 Even if the words represent the same object, the words used vary depending on the vocabulary used daily and the relationship with the object. For example, when calling the same A child, the husband of A child calls “A child” and the daughter of A child calls “mother”. On the contrary, even if the same language is used, different objects may be indicated depending on the individual. For example, when A child calls “mother” and when A child's daughter calls “mother”, the called objects are different.

音声認識を使用する際、音声認識に用いる語彙も、使用者ごとに使い慣れた言葉が使えると使用者にとって便利である。例えば、先のＡ子さんの家族が自宅の電話から音声認識を使ったボイスダイアルでＡ子さんの携帯電話に電話をかける場合、夫は「Ａ子」、娘は「お母さん」など各人が使い慣れた呼称で電話をかけられると便利である。 When using speech recognition, it is convenient for the user if the vocabulary used for speech recognition can also use words familiar to each user. For example, if the family of the previous child A calls the mobile phone of A child with a voice dial using voice recognition from the home phone, the husband is “A child” and the daughter is “mother”. It is convenient if you can make a call with a familiar name.

このように使用者ごとに音声認識に使用する語を設定できる音声処理システムが特許文献１に開示されている。特許文献１では音声処理システムの一例として音声認識装置が搭載された親子電話が開示されている。この親子電話は、家族の個人が子機を占有する昨今の利用形態に照らし、子機ごとに、すなわち子機を占有する使用者ごとに認識語彙を設定できる。先の例で言えば、Ａ子さんの夫が占有する子機では「Ａ子」、Ａ子さんの娘が占有する子機では「お母さん」という語でＡ子さんの携帯電話に電話をかけることができる。文献１では、また、子機・親機間あるいは子機・子機間で互いの認識語彙を公開することによって、公開された認識語彙を設定した端末以外で音声認識することができるよう構成されているので、使用者が占有する子機以外の端末でも使用者ごとに設定した認識語彙を認識することができる。Ａ子さんの夫の子機の認識語彙を親機に公開することによって、親機でも「Ａ子」を認識できるようになる。同様に、Ａ子さんの娘の子機の認識語彙を親機に公開すれば、Ａ子さんの娘が使用する認識語彙も親機で認識可能となる。
特開平２００４−７２２７４号公報 A speech processing system capable of setting words used for speech recognition for each user as described above is disclosed in Patent Document 1. Patent Document 1 discloses a parent-child phone equipped with a voice recognition device as an example of a voice processing system. This parent-child phone can set a recognition vocabulary for each child device, that is, for each user who occupies the child device, in light of recent usage patterns in which family members occupy child devices. In the previous example, the child machine occupied by A child's husband calls "A child", and the child machine owned by A child's daughter calls "A mother" with the word "Mother". be able to. Document 1 is also configured so that voice recognition can be performed by terminals other than the terminal set with the public recognition vocabulary by disclosing the mutual recognition vocabulary between the slave unit and the master unit or between the slave unit and the slave unit. Therefore, the recognition vocabulary set for each user can be recognized even in terminals other than the handset occupied by the user. By exposing the recognition vocabulary of the child machine of A child's husband to the parent machine, the parent machine can also recognize “A child”. Similarly, if the recognition vocabulary of the child of A child's daughter is disclosed to the parent device, the recognition vocabulary used by the daughter of A child can also be recognized by the parent device.
Japanese Patent Laid-Open No. 2004-72274

文献１に開示された音声認識装置では、一つの音声認識装置において複数の使用者が使用者ごとに設定した認識語彙を認識することができるが、他の使用者が設定した認識語彙も認識対象になるため、使用者が設定した認識語彙のみを使用する場合よりも認識精度が劣化する可能性がある。 In the speech recognition apparatus disclosed in Document 1, a single speech recognition apparatus can recognize a recognition vocabulary set for each user by a plurality of users, but recognition vocabularies set by other users can also be recognized. Therefore, the recognition accuracy may be deteriorated as compared with the case where only the recognition vocabulary set by the user is used.

また、文献１に開示された音声認識装置では、一つの音声認識装置において、複数の使用者が使用者ごとに設定した認識語彙を認識することができるが、一方で複数の使用者が同じ語を異なる対象に割り当てた場合には対応できない。例えば、Ａ子さんがＡ子さんのお母さんの電話番号に対して「お母さん」、Ａ子さんの娘がＡ子さんの携帯電話番号に「お母さん」と登録している場合に、音声認識装置が「お母さん」という入力を正しく認識しても、電話番号を一つに決定することができない。 In the speech recognition device disclosed in Document 1, a single speech recognition device can recognize a recognition vocabulary set by a plurality of users for each user. If you assign to different targets, it cannot be handled. For example, if the child A is registered as “Mom” for the phone number of A ’s mother and the daughter of “A” is registered as “Mom” as the cell phone number of A child, the voice recognition device Even if the input "mother" is correctly recognized, it is not possible to determine a single telephone number.

本発明は、上記のような課題を解決し、使用者の利便性を向上した音声認識装置を提供することを目的とする。 An object of the present invention is to provide a speech recognition apparatus that solves the above-described problems and improves the convenience for the user.

かかる課題を解決するために、請求項１に記載の本発明の音声認識装置は認識対象に対して複数の発音情報を登録する発音情報登録手段と、それぞれの発音情報を音声認識する際に用いる音響モデルを関連づける音響モデル登録手段とを備え、前記各発音情報を、前記各発音情報に関連づけられた音響モデルを用いて音声認識することを特徴とする。 In order to solve this problem, the speech recognition apparatus according to the first aspect of the present invention uses pronunciation information registration means for registering a plurality of pronunciation information with respect to a recognition target, and is used when each pronunciation information is recognized by voice. Acoustic model registration means for associating an acoustic model, wherein each of the pronunciation information is recognized by using an acoustic model associated with each of the pronunciation information.

また、請求項２に記載の音声認識装置は、認識対象に対して複数の発音情報を登録する発音情報登録手段と、それぞれの発音情報を発声し得る使用者を登録する使用者登録手段と、使用者ごとに使用する音響モデルを登録する使用音響モデル登録手段とを備え、前記各発音情報を、前記発音情報を発声し得る使用者が使用する音響モデルを用いて音声認識することを特徴とする。 Further, the speech recognition apparatus according to claim 2, pronunciation information registration means for registering a plurality of pronunciation information for a recognition target, user registration means for registering a user who can utter each of the pronunciation information, Use acoustic model registration means for registering an acoustic model to be used for each user, wherein each of the pronunciation information is voice-recognized using an acoustic model used by a user who can utter the pronunciation information. To do.

本発明の音声認識装置によれば、使用者ごとに認識対象に対して任意の発音情報を登録することができ、また登録した発音情報に対して使用者ごとに使用する音響モデルを関連づけることが可能となる。これにより、使用者ごとに異なる音響モデルを用いた場合に、他の使用者が使用する音響モデルが関連づけられた発音情報への誤認識が減り、認識精度が向上する。 According to the speech recognition apparatus of the present invention, arbitrary pronunciation information can be registered for a recognition target for each user, and an acoustic model to be used for each user can be associated with the registered pronunciation information. It becomes possible. As a result, when different acoustic models are used for each user, erroneous recognition of pronunciation information associated with acoustic models used by other users is reduced, and recognition accuracy is improved.

また、同じ発音でも使用者によって異なる対象を認識することができるようになる。 Also, different objects can be recognized by the user even with the same pronunciation.

以下、図面を参照しながら本発明の好適な実施例について説明していく。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

図１は本発明の一実施形態である音声認識装置の概略構成を示すブロック図である。 FIG. 1 is a block diagram showing a schematic configuration of a speech recognition apparatus according to an embodiment of the present invention.

図１において、１０１は中央処理装置（ＣＰＵ）、１０２は制御メモリ（ＲＯＭ）、１０３はメモリ（ＲＡＭ）、１０４はキーボードやボタンなどの操作キー、１０５は液晶などの表示装置、１０６はマイクなどの音声入力装置、１０７はスピーカなどの音声出力装置、１０８は外部機器と通信するための通信装置、１０９はデータバスである。 In FIG. 1, 101 is a central processing unit (CPU), 102 is a control memory (ROM), 103 is a memory (RAM), 104 is an operation key such as a keyboard and buttons, 105 is a display device such as a liquid crystal, 106 is a microphone, etc. , An audio output device 107 such as a speaker, a communication device 108 for communicating with an external device, and a data bus 109.

本実施形態の音声認識装置を実現するための制御プログラムやその制御プログラムで用いるデータは、制御メモリ（ROM）１０２に記録される。 A control program for realizing the speech recognition apparatus of this embodiment and data used in the control program are recorded in a control memory (ROM) 102.

これらの制御プログラムやデータは、中央処理装置１０１の制御のもと、データバス１０８を通じて適宜メモリ１０３に取り込まれ、中央処理装置１０１によって実行される。実行した結果、すなわち音声認識の結果は表示装置１０５で表示、あるいは音声合成を利用して音声出力装置１０７から出力、あるいは通信装置を介して外部機器へ出力される。 These control programs and data are appropriately taken into the memory 103 through the data bus 108 under the control of the central processing unit 101 and executed by the central processing unit 101. The result of execution, that is, the result of speech recognition is displayed on the display device 105, or output from the speech output device 107 using speech synthesis, or output to an external device via a communication device.

図２は、本発明の音声認識装置を電話番号を検索するデータ検索システムに適用した場合の機能ブロック図である。 FIG. 2 is a functional block diagram when the speech recognition apparatus of the present invention is applied to a data search system for searching for telephone numbers.

本データ検索システムは、検索対象データである電話番号および電話番号ごとに関連した情報を記憶するデータベース２０１。音声認識に用いる音響モデルを記憶する音響モデル２０２、データ検索に用いる検索キーワードの読み仮名などの発音情報の登録を行う発音情報登録部２０３、音声認識に用いる音響モデルの登録を行う音響モデル登録部２０４、音声認識に用いる認識文法を作成する認識文法作成部２０５、音声認識を行う音声認識部２０６、データベースに記憶されたデータを検索して音声認識結果に適合する電話番号を検索する認識結果処理部２０７、とから構成される。 The data search system includes a database 201 that stores telephone numbers that are search target data and information related to each telephone number. An acoustic model 202 for storing an acoustic model used for speech recognition, a pronunciation information registration unit 203 for registering pronunciation information such as a reading pseudonym of a search keyword used for data search, and an acoustic model registration unit for registering an acoustic model used for speech recognition 204, a recognition grammar creation unit 205 that creates a recognition grammar used for speech recognition, a speech recognition unit 206 that performs speech recognition, and a recognition result process that retrieves data stored in a database and retrieves a telephone number that matches the speech recognition result Unit 207.

図３に本実施形態のデータ検索装置が記憶するデータベース２０１の一例を示す。本実施形態のデータベースは、電話番号ごとに、ＩＤ（３０１の列）、電話番号（３０２の列）、名前（３０３の列）、名前の読み仮名（３０４の列）の情報を記憶している。図３の３０４の列に示したように、以後、本実施形態では発音情報として読み仮名を使用した場合を説明する。 FIG. 3 shows an example of the database 201 stored in the data search apparatus of this embodiment. The database of the present embodiment stores information on ID (301 column), telephone number (302 column), name (303 column), and name reading pseudonym (304 column) for each telephone number. . As shown in the column 304 in FIG. 3, hereinafter, in this embodiment, a case where a reading pseudonym is used as pronunciation information will be described.

本実施形態の音響モデル２０２は、あらかじめ複数の音響モデルを記憶している。本実施形態では、不特定話者を対象にした不特定話者モデル、成人男性を対象にした成人男性モデル、成人女性にを対象にした成人女性モデル、子供を対象にした子供モデルを用いるが、これに限るものではなく、話者適応技術などを用いて使用者ごとに使用者の音声を用いて作成された使用者専用のモデルを用いてもよい。 The acoustic model 202 of the present embodiment stores a plurality of acoustic models in advance. In this embodiment, an unspecified speaker model for unspecified speakers, an adult male model for adult men, an adult female model for adult women, and a child model for children are used. However, the present invention is not limited to this, and a user-specific model created using the voice of the user for each user using a speaker adaptation technique or the like may be used.

以下、各モジュールで行う処理の詳細を、発音情報および音響モデルを登録する処理と音声認識してデータを検索する処理に分けて説明する。 Hereinafter, the details of the processing performed in each module will be described by dividing into processing for registering pronunciation information and acoustic model and processing for searching for data by voice recognition.

＜発音情報および音響モデルを登録する処理＞
図６に発音情報、音響モデルを登録する処理のフローを示す。以下、図６のフローチャートを用いて、発音情報および音響モデルの登録の処理を説明する。 <Process to register pronunciation information and acoustic model>
FIG. 6 shows a flow of processing for registering pronunciation information and an acoustic model. Hereinafter, the process of registering the pronunciation information and the acoustic model will be described with reference to the flowchart of FIG.

使用者は、操作キー１０４を使用して所定の手順で発音情報を登録するデータを選択する（Ｓ６０１）。この時、発音情報登録部２０３は表示装置１０５に図４のようなＧＵＩで表示する。使用者は選択したデータに対して操作キー１０４を用いて所望の発音情報および登録する発音情報を音声認識する際に用いる音響モデルを登録する（Ｓ６０２、Ｓ６０３）。図４は、図３に示したデータベースの３０５の行に記録されている山田花子さんの携帯電話番号099−9999−9999に対して、花子さんの夫が「はなこ」「おかあさん」、花子さん本人が「けいたい」、花子さんの子供が「おかあさん」というキーワードで検索するため、発音情報及び音響モデルを登録した様子を示している。図４において、ウィンドウ４００には、名前４０１、名前の読み仮名４０２、電話番号４０３が表示される。本実施例では名前の読み仮名４０２以外に４つのキーワードの読み仮名を４０８、４０９、４１０、４１１に登録できるものとする。登録した発音情報に対して、音響モデル登録部で４０４〜４０７に表示された音響モデルを選択して登録する。図４では、黒丸で示した音響モデルが登録された音響モデルである。すなわち、「はなこ」に対して成人男性モデル、「けいたい」に対して成人女性モデル、「おかあさん」に対して成人男性モデル、子供モデルが登録されている。また、本実施形態では、名前の読み仮名「やまだはなこ」に対して自動的に不特定話者モデルを登録することにするが、他の読み仮名と同様に使用者が音響モデルを登録するようにしてもよいし、自動で全ての音響モデルを登録するようにしてもよい。 The user uses the operation key 104 to select data for registering pronunciation information in a predetermined procedure (S601). At this time, the pronunciation information registration unit 203 displays the GUI on the display device 105 as shown in FIG. The user registers the acoustic model used when recognizing desired pronunciation information and the pronunciation information to be registered for the selected data using the operation keys 104 (S602, S603). FIG. 4 shows Hanako's husband, “Hanako”, “Okasan”, Hanako-san, for Hanako Yamada's mobile phone number 099-9999-9999 recorded in the row 305 of the database shown in FIG. Shows a state in which pronunciation information and an acoustic model are registered in order to search for a keyword “Kaitai” and Hanako's child “Okasan”. In FIG. 4, a window 401 displays a name 401, a name reading pseudonym 402, and a telephone number 403. In this embodiment, it is assumed that four keyword reading pseudonyms other than the name reading pseudonym 402 can be registered in 408, 409, 410, and 411. For the registered pronunciation information, the acoustic model displayed in 404 to 407 is selected and registered by the acoustic model registration unit. In FIG. 4, the acoustic model indicated by the black circle is an acoustic model registered. That is, an adult male model is registered for “Hanako”, an adult female model for “Keitai”, an adult male model, and a child model for “Okasan”. In this embodiment, an unspecified speaker model is automatically registered for the name reading kana “Yamada Hanako”. However, as with other reading kana, the user registers the acoustic model. Alternatively, all acoustic models may be registered automatically.

次に登録された発音情報および使用する音響モデルをデータベース２０１に記憶する（Ｓ６０４）。このとき、各発音情報に対して登録された音響モデルを関連づけて記憶する。登録された発音情報、音響モデルは電話番号データとともにデータベース２０１に記憶されている電話番号データに追記して記憶してもよいし、電話番号データとは別のデータとしてデータベース２０１に記憶してもよい。電話番号データに追記した場合の電話番号データの一例を図５に示す。同図では、各発音情報に対して使用する音響モデルの識別番号が関連づけられている（列５０１）。各識別番号はそれぞれ、１が不特定話者モデル、２が成人男性モデル、３が成人女性モデル、４が子供モデルである。 Next, the registered pronunciation information and the acoustic model to be used are stored in the database 201 (S604). At this time, the registered acoustic model is stored in association with each pronunciation information. The registered pronunciation information and acoustic model may be added and stored in the telephone number data stored in the database 201 together with the telephone number data, or may be stored in the database 201 as data different from the telephone number data. Good. An example of telephone number data when added to the telephone number data is shown in FIG. In the figure, an identification number of an acoustic model to be used is associated with each sound production information (column 501). In each identification number, 1 is an unspecified speaker model, 2 is an adult male model, 3 is an adult female model, and 4 is a child model.

他のデータに発音情報、音響モデルの登録を行う場合は登録対象のデータを選択するＳ６０１に戻り、登録を終了する場合は、発音情報および音響モデルの登録処理を終了する（Ｓ６０５）。 When registering pronunciation information and acoustic model for other data, the process returns to S601 for selecting data to be registered, and when registration is ended, registration processing of pronunciation information and acoustic model is ended (S605).

以上で発音情報および音響モデルの登録が完了する。 This completes registration of pronunciation information and acoustic models.

＜音声認識してデータを検索する処理＞
次に音声認識の処理について説明する。 <Process to search for data by voice recognition>
Next, the speech recognition process will be described.

図７に音声認識してデータを検索する処理のフローを示す。以下、図７のフローチャートを用いて、音声認識してデータを検索する処理の流れを説明する。 FIG. 7 shows a flow of processing for recognizing and retrieving data. Hereinafter, the flow of processing for recognizing voice and searching for data will be described with reference to the flowchart of FIG.

所定の操作により、データ検索の処理が開始されると、まず認識文法作成部２０５が、データベース２０１を参照して認識文法を作成する（Ｓ７０１）。認識文法には、少なくとも認識結果として出力する値と発音情報が記述されている。また、認識文法は、各発音情報と、各発音情報を音声認識する際に用いる音響モデルの対応関係がわかるように作成する。ここでは、認識文法を使用する音響モデルごとに作成することで音響モデルと発音情報の関連づけを行う。図８に作成する認識文法の例を示す。同図において、（Ａ）〜（Ｄ）の認識文法に記述された発音情報は、それぞれ、（Ａ）音響モデル１（不特定話者モデル）、（Ｂ）音響モデル２（成人男性モデル）、（Ｃ）音響モデル３（成人女性モデル）、（Ｄ）音響モデル４（子供モデル）が関連づけられている。また、認識結果として図３の３０１に示した電話番号データのＩＤを出力するものとする。 When the data search process is started by a predetermined operation, the recognition grammar creation unit 205 first creates a recognition grammar with reference to the database 201 (S701). The recognition grammar describes at least a value output as a recognition result and pronunciation information. The recognition grammar is created so that the correspondence between each pronunciation information and the acoustic model used when recognizing each pronunciation information is recognized. Here, the acoustic model and the pronunciation information are associated by creating each acoustic model that uses the recognition grammar. FIG. 8 shows an example of a recognition grammar created. In the same figure, the pronunciation information described in the recognition grammars (A) to (D) includes (A) an acoustic model 1 (unspecified speaker model), (B) an acoustic model 2 (adult male model), respectively. (C) Acoustic model 3 (adult female model) and (D) acoustic model 4 (child model) are associated with each other. Also, the ID of the telephone number data indicated by 301 in FIG. 3 is output as the recognition result.

ユーザが音声入力装置１０６を介して、検索キーワードを音声入力すると、作成した認識文法を用いて音声認識部２０６で音声認識する（Ｓ７０３）。一般的な音声認識では、全ての認識対象語について、各認識対象語の発音情報に対応する音響モデルと入力音声との類似度を表す認識スコアを計算し、最も類似した認識対象語を音声認識結果とする。Ｓ７０３における音声認識も同様の処理で認識結果を出力する。ただし、認識スコアの計算の際に、認識文法作成部２０５で作成された認識文法にしたがい、発音情報ごとに音響モデル登録部２０４で登録された音響モデルを使用する。例えば、図８に示した認識文法を用いる場合、８０１に示した「やまだはなこ」の認識スコアは不特定話者モデルを、８０２に示した「おかあさん」の認識スコアは成人男性モデルを使って計算する。全ての発音情報について認識スコアを求め、最も入力音声に類似した認識スコアになる認識結果を出力する。本データ検索システムでは認識結果として電話番号データのＩＤを出力するよう認識文法に記述しているので、例えば、図８の８０４の「けいたい」が最も入力音声に類似している場合は、その出力値「００１」を認識結果として出力する。 When the user inputs a search keyword by voice through the voice input device 106, the voice recognition unit 206 recognizes the voice using the created recognition grammar (S703). In general speech recognition, for all recognition target words, a recognition score representing the similarity between the acoustic model corresponding to the pronunciation information of each recognition target word and the input speech is calculated, and the most similar recognition target word is recognized by speech recognition. As a result. The speech recognition in S703 also outputs the recognition result by the same processing. However, in calculating the recognition score, the acoustic model registered by the acoustic model registration unit 204 is used for each pronunciation information in accordance with the recognition grammar created by the recognition grammar creation unit 205. For example, when the recognition grammar shown in FIG. 8 is used, the recognition score of “Yama Hanako” shown in 801 is calculated using an unspecified speaker model, and the recognition score of “Okasan” shown in 802 is calculated using an adult male model. To do. A recognition score is obtained for all pronunciation information, and a recognition result having a recognition score most similar to the input speech is output. In this data retrieval system, the recognition grammar is described so that the ID of the telephone number data is output as the recognition result. For example, when “Kaitai” in 804 in FIG. 8 is most similar to the input speech, The output value “001” is output as the recognition result.

認識結果処理部２０７では、データベース２０１から認識結果として出力されたＩＤの電話番号データを検索する（Ｓ７０４）。検索した電話番号データは表示装置１０５に表示しても良いし、電話番号を音声合成などを用いて音声出力装置１０７から音声で出力しても良い。また、本データ検索システムが電話機上に実現されている場合は、通信装置１０８を介して電話機と通信士、検索した電話番号データに登録された電話番号に電話をかけるようにしてもよい。 The recognition result processing unit 207 searches for the telephone number data of the ID output as the recognition result from the database 201 (S704). The retrieved telephone number data may be displayed on the display device 105, or the telephone number may be output by voice from the voice output device 107 using voice synthesis or the like. If the data retrieval system is implemented on a telephone, the telephone and the communication person may be made via the communication device 108, and the telephone number registered in the retrieved telephone number data may be called.

再度、検索を行う場合はＳ７０２の音声入力に戻り、検索を終了する場合はこの処理を終了する（Ｓ７０５）。 When the search is performed again, the process returns to the voice input of S702, and when the search is ended, the process is ended (S705).

＜効果＞
以上、説明したように、使用者ごとに任意の発音情報を登録し、発音情報ごとに認識スコアを計算する音響モデルを変えることによって、音響モデルを発音情報ごとに変えない場合に比べ、認識精度が向上するという効果がある。入力音声に類似するほど値が大きくなるような認識スコアを用いたときに、たとえば、成人男性が音声入力する場合、成人男性モデルを用いる発音情報の認識スコアが高く、逆に成人男性モデル以外の音響モデルを用いる発音情報の認識スコアは低くなると期待できる。これによって、成人男性が音声入力した際に、成人男性モデルが関連づけられていない発音情報への誤認識を少なくする効果が期待できる。例えば、図８に示した認識文法において、「はなこ」と「はなよ」の発音は良く似ているため、成人男性が「はなこ」と音声入力した場合、不特定話者モデルのみを用いて認識スコアを求めると、その差は少ないと考えられる。しかしながら、本実施形態にしたがって、「はなよ」を成人女性モデルで認識スコアを計算する場合は、入力音声である男性の音声と成人女性モデルがミスマッチし、「はなよ」の認識スコアが低くなり、誤認識の可能性を少なくできる。本実施例では音響モデルをあらかじめ用意された音響モデルを用いる場合で説明したが、使用者ごとに使用者の音声で学習した専用の音響モデルを用いれば、上記のような効果はさらに向上する。また、本発明の音声認識装置は、同じ発音でも使用者によって異なる対象（本実施例では電話番号データ）を認識することができる。図８の８０３、８０４に示したように、本発明の音声認識装置では認識結果の出力値００２（山田花子さんの電話番号）、００１（山田太郎さんの電話番号）に対して同じ「けいたい」という発音が登録されている。音響モデルを一つしか用いない場合、どちらも同じ認識スコアになるため認識結果を認識スコアだけで一つに絞り込むことはできない。つまり、山田花子さんが「けいたい」と入力した場合と、山田太郎さんが「けいたい」と音声入力した場合の認識結果を区別することができない。本発明の音声認識装置によれば、００２の「けいたい」に対して成人女性モデル、００１の「けいたい」に対して成人男性モデルが関連づけられているため、山田太郎さんが「けいたい」と音声入力すると成人男性モデルを用いて認識スコアを計算する００１の「けいたい」が成人女性モデルを用いる００２の「けいたい」よりも認識スコアが大きくなり、前述のような課題を解決できる。 <Effect>
As explained above, by registering arbitrary pronunciation information for each user and changing the acoustic model for calculating the recognition score for each pronunciation information, the recognition accuracy is higher than when the acoustic model is not changed for each pronunciation information. Has the effect of improving. When using a recognition score whose value increases as it is similar to the input speech, for example, when an adult male inputs speech, the recognition score of pronunciation information using the adult male model is high, and conversely, other than the adult male model The recognition score of pronunciation information using an acoustic model can be expected to be low. Thus, when an adult male inputs a voice, an effect of reducing misrecognition to pronunciation information not associated with an adult male model can be expected. For example, in the recognition grammar shown in FIG. 8, the pronunciations of “Hanako” and “Hanayo” are very similar, so when an adult male voice-inputs “Hanako”, only the unspecified speaker model is used. When the recognition score is obtained, the difference is considered to be small. However, according to the present embodiment, when the recognition score of “Hanayo” is calculated with the adult female model, the male voice that is the input speech and the adult female model mismatch, and the recognition score of “Hanayo” is This reduces the possibility of misrecognition. In the present embodiment, the case where an acoustic model prepared in advance is used has been described. However, if a dedicated acoustic model learned by the user's voice is used for each user, the above effect is further improved. Further, the voice recognition device of the present invention can recognize different objects (phone number data in this embodiment) depending on the user even with the same pronunciation. As indicated by reference numerals 803 and 804 in FIG. 8, in the speech recognition apparatus of the present invention, the same “Kaiai” is applied to the output values 002 (phone number of Hanako Yamada) and 001 (phone number of Taro Yamada) of the recognition result. "Is registered. When only one acoustic model is used, both have the same recognition score, so the recognition result cannot be narrowed down to only one by the recognition score. That is, it is not possible to distinguish the recognition result when Hanako Yamada inputs “Keitai” and when Taro Yamada inputs voice as “Keitai”. According to the speech recognition apparatus of the present invention, since the adult female model is associated with 002 “Keitai” and the adult male model is associated with 001 “Kaitai”, Taro Yamada is “Keitai”. , “001”, which calculates the recognition score using the adult male model, has a higher recognition score than “002”, which uses the adult female model, and can solve the above-described problems.

また、本発明の音声認識装置は音響モデルの違いにより認識されやすい発音情報が限定されるので、事前に使用者を指定して、使用者専用の認識文法を用いることで認識語彙を絞り込む音声認識装置と同様の効果を使用者の指定を行わずに実現できる。 In addition, since the speech recognition apparatus of the present invention limits the pronunciation information that is easy to be recognized due to the difference in the acoustic model, the speech recognition that narrows down the recognition vocabulary by designating the user in advance and using the user-specific recognition grammar The same effect as the device can be realized without specifying the user.

実施形態１では、各発音情報に対して使用する直接音響モデルを選択するようにしたが、本実施形態では、発音情報に対して使用者を関連づけ、あらかじめ登録した使用者と音響モデルの関係から発音情報に音響モデルを関連づける方法について説明する。 In the first embodiment, the direct acoustic model used for each pronunciation information is selected. However, in this embodiment, the user is associated with the pronunciation information, and the relationship between the user and the acoustic model registered in advance is used. A method of associating an acoustic model with pronunciation information will be described.

図９に本実施形態のデータ検索システムの機能ブロック図を示す。 FIG. 9 shows a functional block diagram of the data search system of this embodiment.

実施形態1の構成の音響モデル登録部２０４を使用者登録部２０９に置き換え、さらに使用者情報登録部２０８が追加された構成となる。 The acoustic model registration unit 204 having the configuration of the first embodiment is replaced with a user registration unit 209, and a user information registration unit 208 is further added.

＜使用者情報を登録する処理＞
使用者情報登録部２０８では、発音情報の登録に先立ち、使用者ごとに使用する音響モデルを登録し、データベース２０１に記憶する。データベース２０１に記憶する使用者情報の一例を図１０に示す。同図において、１００１の列が使用者のＩＤ、１００２の列が使用する音響モデルの識別番号である。本実施形態では、実施形態１と同様に成人男性モデル、成人女性モデル、子供モデルを用いるが、これに限るものではなく、話者適応技術などを用いて使用者ごとに使用者の音声を用いて作成された使用者専用のモデルを用いてもよい。 <Process to register user information>
The user information registration unit 208 registers an acoustic model to be used for each user prior to registration of pronunciation information and stores it in the database 201. An example of user information stored in the database 201 is shown in FIG. In the figure, the column 1001 is the ID of the user, and the column 1002 is the identification number of the acoustic model used. In this embodiment, an adult male model, an adult female model, and a child model are used as in the first embodiment. However, the present invention is not limited to this, and the user's voice is used for each user using speaker adaptation technology or the like. Alternatively, a user-specific model created in this way may be used.

＜発音情報および使用者を登録する処理＞
以下、図１３のフローチャートを参照して発音情報および使用者を登録する処理について、図６に示した実施形態１の処理手順と差異のある部分についてのみ説明する。 <Process to register pronunciation information and user>
Hereinafter, with respect to the process of registering pronunciation information and a user with reference to the flowchart of FIG. 13, only a part different from the process procedure of the first embodiment shown in FIG. 6 will be described.

図6における音響モデル登録部が行うＳ６０３の処理で音響モデルを直接登録する替わりに、本実施形態では、登録した発音情報を使用する使用者を登録する（図１３、Ｓ６０６）。その様子を図１１に示す。図１１は発音情報「はなこ（１１０８）」「けいたい（１１０９）」「おかあさん（１１１０）」に対して、その発音情報を使用する使用者を１１０４〜１１０６で登録する様子を示しており、黒丸で示した使用者が各発音情報を使用する使用者である。すなわち、「はなこ」はユーザ１、「けいたい」はユーザ２、「おかあさん」はユーザ３が使用する発音情報として登録される。また、本実施形態では、名前の読み仮名「やまだはなこ」に対して使用者を登録しないようにしているが、他の読み仮名と同様に使用者を登録できるようにしてもよいし、自動で全ての使用者を登録するようにしてもよい。 Instead of directly registering the acoustic model in the process of S603 performed by the acoustic model registration unit in FIG. 6, in this embodiment, a user who uses the registered pronunciation information is registered (FIG. 13, S606). This is shown in FIG. FIG. 11 shows a state in which the users who use the pronunciation information are registered in 1104-1106 for the pronunciation information “Hanako (1108)”, “Keitai (1109)”, “Okasan (1110)”. The user indicated by is a user who uses each pronunciation information. That is, “Hanako” is registered as user 1, “Keitai” is registered as user 2, and “Okasan” is registered as pronunciation information used by user 3. Further, in this embodiment, the user is not registered for the name reading Kana “Yamada Hanako”. However, the user may be registered in the same manner as other reading Kana, or automatically. All users may be registered.

次に、登録された使用者は、発音情報に関連づけてデータベース２０１に記憶する（Ｓ６０７）。登録された発音情報、使用者は電話番号データとともにデータベース２０１に記憶されている電話番号データに追記して記憶してもよいし、電話番号データとは別のデータとしてデータベース２０１に記憶してもよい。電話番号データに追記した場合の電話番号データの一例を図１２に示す。同図では、各発音情報に対して使用する使用者の識別番号が関連づけられている（列１２０１）。各識別番号はそれぞれ、０が特定の使用者なし、１がユーザ１、２がユーザ２、３がユーザ３である。 Next, the registered user is stored in the database 201 in association with the pronunciation information (S607). The registered pronunciation information and the user may add the phone number data to the phone number data stored in the database 201 together with the phone number data, or store it in the database 201 as data different from the phone number data. Good. An example of telephone number data when added to the telephone number data is shown in FIG. In the figure, the identification number of the user to be used is associated with each sound production information (column 1201). In each identification number, 0 is no specific user, 1 is user 1, 2 is user 2, and 3 is user 3.

＜音声認識してデータを検索する処理＞
次に本実施形態の音声認識の処理について説明する。 <Process to search for data by voice recognition>
Next, the speech recognition process of this embodiment will be described.

音声認識してデータを検索する処理のフローは図７に示した実施形態１の処理と同じである。本実施形態と実施形態１とでは、Ｓ７０１における認識文法の作成方法のみが異なるので、この部分についてのみ説明する。 The processing flow for recognizing and retrieving data is the same as that of the first embodiment shown in FIG. This embodiment is different from the first embodiment only in the method for creating a recognition grammar in S701, and only this portion will be described.

実施形態１では、データベース２０１に記憶された電話番号データに各発音情報ごとに使用する音響モデルが関連づけたが、本実施形態では、発音情報ごとに関連づけるのは使用者である。発音情報と使用する音響モデルを関連づけるため、本実施形態では、認識文法を作成する際に、データベース２０１に記憶されている図１０に示した使用者と使用する音響モデルの対応関係を用いて、発音情報ごとに使用する音響モデルの関連づけを行う。例えば、図１１の１１１０に示した「おかあさん」という発音情報には１１０６でユーザ１とユーザ３が登録されているので、使用者情報（図１０）を参照してユーザ１が使用する音響モデル２（成人男性モデル）、およびユーザ３が使用する音響モデル３（子供モデル）が関連づけられるように認識文法を作成する。 In the first embodiment, the acoustic model used for each pronunciation information is associated with the telephone number data stored in the database 201. However, in this embodiment, the user is associated with each pronunciation information. In order to associate the pronunciation information with the acoustic model to be used, in this embodiment, when creating the recognition grammar, the correspondence relationship between the user and the acoustic model to be used shown in FIG. The acoustic model used for each pronunciation information is associated. For example, since the user 1 and the user 3 are registered in 1106 in the pronunciation information “mother” shown in 1110 in FIG. 11, the acoustic model 2 used by the user 1 with reference to the user information (FIG. 10). The recognition grammar is created so that the (adult male model) and the acoustic model 3 (child model) used by the user 3 are associated with each other.

このようにして作成した認識文法は図８に示した実施形態１で作成した認識文法と同じ形態になる。 The recognition grammar created in this way has the same form as the recognition grammar created in the first embodiment shown in FIG.

以降の処理については、実施形態１と同じなので説明を省略する。 Since the subsequent processing is the same as that of the first embodiment, description thereof is omitted.

＜その他の実施形態＞
電話番号データを使用者ごとに管理している場合は、Ｓ６０６における使用者を選択する処理を先に実施すると、発音情報ごとに使用者を登録する処理を省くことができる。 <Other embodiments>
When the telephone number data is managed for each user, if the process of selecting a user in S606 is performed first, the process of registering the user for each pronunciation information can be omitted.

＜その他の実施形態＞
認識文法は使用者ごとに作成しても良い。この場合図８に示した認識文法はそれぞれ、（Ａ）特定使用者なし（Ｂ）ユーザ１（Ｃ）ユーザ２（Ｄ）ユーザ３が使用する認識文法となる。また、使用者ごとに認識文法を作成する場合は、音声認識部２０６で音声認識する際に、データベースに記憶されている使用者情報を参照し、各使用者の認識文法に記述されている発音情報の認識スコアを対応する音響モデルを用いて計算する。 <Other embodiments>
A recognition grammar may be created for each user. In this case, the recognition grammar shown in FIG. 8 is the recognition grammar used by (A) no specific user (B) user 1 (C) user 2 (D) user 3. Also, when creating a recognition grammar for each user, when the speech recognition unit 206 recognizes the speech, the user information stored in the database is referred to and the pronunciation described in the recognition grammar of each user is recorded. The information recognition score is calculated using the corresponding acoustic model.

＜効果＞
本実施形態では、発音情報ごとに、その発音情報を使用する使用者を関連づけるため、音響モデルを発音情報に関連づける実施形態１よりも発音情報および使用者の登録作業が直感的に行えるようになり、登録作業の操作性が向上する。 <Effect>
In the present embodiment, the user who uses the pronunciation information is associated with each pronunciation information. Therefore, the pronunciation information and the user can be registered more intuitively than the first embodiment in which the acoustic model is associated with the pronunciation information. The operability of registration work is improved.

また、電話番号データや発音情報を使用者ごとに管理する場合には、発音情報ごとに音響モデルや使用者の登録をすることなく、登録作業を効率良く行うことができる。 In addition, when telephone number data and pronunciation information are managed for each user, registration can be performed efficiently without registering an acoustic model or user for each pronunciation information.

実施形態１および実施形態２では、本発明の音声認識をデータ検索システムに適用した場合について説明したが、これに限るものではなく、発音情報を登録する機能を備え、複数の使用者が想定されている音声認識装置やアプリケーションであれば本発明を適用することができる。 In the first and second embodiments, the case where the speech recognition of the present invention is applied to a data search system has been described. However, the present invention is not limited to this, and a function for registering pronunciation information is provided, and a plurality of users are assumed. The present invention can be applied to any voice recognition device or application.

なお、本発明の目的は、前述した実施例の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても達成されることは言うまでもない。 An object of the present invention is to supply a storage medium recording a program code of software that realizes the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) operating on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.

さらに、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

本発明の音声認識装置の構成図。The block diagram of the speech recognition apparatus of this invention. 本発明の一実施形態であるデータ検索システムのモジュール構成図。The module block diagram of the data search system which is one Embodiment of this invention. 実施形態１のデータ検索システムで用いるデータの一例An example of data used in the data search system of Embodiment 1 実施形態１のデータ検索システムで発音情報、音響モデルを登録するＧＵＩの一例。An example of GUI which registers pronunciation information and an acoustic model in the data search system of Embodiment 1. FIG. 実施形態１のデータ検索システムで用いる発音情報、音響モデルを登録したデータの一例。6 is an example of data in which pronunciation information and an acoustic model used in the data search system of Embodiment 1 are registered. 実施形態１のデータ検索システムで発音情報、音響モデルを登録する処理のフローチャートFlowchart of processing for registering pronunciation information and acoustic model in data retrieval system of embodiment 1 実施形態1のデータ検索システムにおける、音声認識処理のフローチャートFlowchart of speech recognition processing in the data search system of the first embodiment 実施形態１のデータ検索システムで使用する認識文法の一例An example of a recognition grammar used in the data search system of Embodiment 1 実施形態２のデータ検索システムで発音情報、使用者を登録する処理のフローチャートFlowchart of processing for registering pronunciation information and user in data search system of embodiment 2 実施形態２のデータ検索システムにおける、使用者情報の一例An example of user information in the data search system of Embodiment 2 実施形態２のデータ検索システムで発音情報、使用者を登録するＧＵＩの一例Example of GUI for registering pronunciation information and user in data search system of embodiment 2 実施形態２のデータ検索システムで用いる発音情報、使用者を登録したデータの一例An example of pronunciation information used in the data search system of Embodiment 2 and data registered with a user 実施形態２のデータ検索システムで発音情報、使用者を登録する処理のフローチャートFlowchart of processing for registering pronunciation information and user in data search system of embodiment 2

Explanation of symbols

１０１中央処理装置
１０２制御メモリ
１０３メモリ
１０４操作キー
１０５表示装置
１０６音声入力装置
１０７音声出力装置
１０８通信装置
１０９バス 101 Central Processing Unit 102 Control Memory 103 Memory 104 Operation Key 105 Display Device 106 Audio Input Device 107 Audio Output Device 108 Communication Device 109 Bus

Claims

Pronunciation information registration means for registering a plurality of pronunciation information for a recognition target;
An acoustic model registration means for associating an acoustic model used when recognizing each pronunciation information with speech;
A speech recognition apparatus for recognizing each of the pronunciation information using an acoustic model associated with each of the pronunciation information.

Pronunciation information registration means for registering a plurality of pronunciation information for a recognition target;
A user registration means for registering a user who can utter each pronunciation information;
Use acoustic model registration means for registering an acoustic model to be used for each user,
A speech recognition apparatus characterized by recognizing each of the pronunciation information using an acoustic model used by a user who can utter the pronunciation information.