JP5196114B2

JP5196114B2 - Speech recognition apparatus and program

Info

Publication number: JP5196114B2
Application number: JP2007186184A
Authority: JP
Inventors: 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-07-17
Filing date: 2007-07-17
Publication date: 2013-05-15
Anticipated expiration: 2027-07-17
Also published as: JP2009025411A

Description

本発明は、音声を認識する技術に関する。 The present invention relates to a technology for recognizing speech.

音声信号から特定の単語（キーワード）を検索する技術が従来から提案されている。例えば特許文献１に開示された音声認識装置は、利用者が指定した単語（以下「指定単語」という）に対応した音響モデルと事前に収録された音声信号とを照合することで音声信号から指定単語を検索する。
特開２００１−２９０４９６号公報 Techniques for searching for specific words (keywords) from voice signals have been proposed. For example, the speech recognition apparatus disclosed in Patent Document 1 is designated from a speech signal by collating an acoustic model corresponding to a word designated by a user (hereinafter referred to as “designated word”) with a speech signal recorded in advance. Search for a word.
Japanese Patent Laid-Open No. 2001-290496

音声認識の誤認識を完全に回避することは技術的に困難であるから、特許文献１の技術においては、指定単語以外の単語が検索される場合や指定単語が検索されない場合がある。複数の参加者が発声する会議での収録音から特定の指定単語を検索できれば、例えば議事録の作成に便利であるが、指定単語以外の単語が誤検出された場合には当該単語を利用者が除外する必要があり、指定単語が検索されない場合には議事録上での発言の欠落といった重大な問題が発生し得る。 Since it is technically difficult to completely avoid misrecognition of speech recognition, in the technique of Patent Document 1, a word other than a designated word may be searched or a designated word may not be searched. If it is possible to search for a specific designated word from the sound recorded in a meeting uttered by multiple participants, it will be convenient for creating minutes, for example, but if a word other than the designated word is erroneously detected, the word will be If the designated word is not retrieved, a serious problem such as a lack of speech in the minutes can occur.

また、特許文献１の構成においては、音声信号のうちひとつの指定単語と完全に合致する部分のみが検索され、指定単語と同様の意味で使用された別個の単語や指定単語に関連する単語は検索されない。したがって、所定の指定単語に関連する総ての単語を検出するためには各単語を指定単語に設定したうえで検索を反復する必要がある。以上の事情を考慮して、本発明は、誤認識の可能性を抑制しながら、指定単語に関連する単語を効率的に認識するという課題の解決をひとつの目的としている。 Further, in the configuration of Patent Document 1, only a portion that completely matches one designated word in the audio signal is searched, and a separate word used in the same meaning as the designated word or a word related to the designated word is Not searched. Therefore, in order to detect all words related to a predetermined designated word, it is necessary to repeat the search after setting each word as the designated word. In view of the above circumstances, an object of the present invention is to solve the problem of efficiently recognizing a word related to a designated word while suppressing the possibility of erroneous recognition.

以上の課題を解決するために、本発明に係る音声認識装置は、利用者が指定した単語について複数の関連語を特定する関連語特定手段と、複数の単語の各々について出現確率を記憶する記憶手段と、複数の単語のうち関連語特定手段が特定した各関連語の出現確率を複数の関連語以外の単語の出現確率に対して相対的に上昇させる確率調整手段と、音声信号が表わす音声に対応した単語を確率調整手段による調整後の出現確率に基づいて特定する音声認識手段と、音声認識手段が特定した複数の単語から関連語を選択する選択手段と、音声信号を発声者毎の発声区間に区分する話者識別手段と、話者識別手段が区分した複数の発声区間のうち選択手段が選択した各関連語が発声された発声区間を特定することで、各関連語の文字列を発声者毎に表示装置に表示させる表示制御手段とを具備し、音声認識手段は、選択手段が選択した各関連語について当該音声認識手段による特定の結果の信頼度（例えば図１の信頼度Ａ1）を算定し、話者識別手段は、当該話者識別手段による区分の結果の信頼度（例えば図１の信頼度Ａ2）を発声区間毎に算定し、表示制御手段は、各関連語の文字列を、当該関連語について音声認識手段が算定した信頼度と、当該関連語が発声された発声区間について話者識別手段が算定した信頼度とに応じた態様（サイズや表示色（色相，明度，彩度）や文字種）で表示装置に表示させる。 In order to solve the above problems, the speech recognition apparatus according to the present invention stores related word specifying means for specifying a plurality of related words for a word specified by a user, and a memory for storing an appearance probability for each of the plurality of words. Means, probability adjusting means for relatively increasing the appearance probability of each related word specified by the related word specifying means among the plurality of words with respect to the appearance probability of words other than the plurality of related words, and the voice represented by the audio signal A speech recognition means for specifying a word corresponding to the word based on the appearance probability after adjustment by the probability adjustment means, a selection means for selecting a related word from a plurality of words specified by the speech recognition means, and a speech signal for each speaker A character string of each related word by specifying a speaker identifying means that divides into utterance sections, and a utterance section in which each related word selected by the selecting means is uttered among a plurality of utterance sections classified by the speaker identifying means. For each speaker Display control means to be displayed on the display device, and the speech recognition means calculates the reliability (for example, reliability A1 in FIG. 1) of a specific result by the speech recognition means for each related word selected by the selection means. The speaker identification means calculates the reliability (for example, reliability A2 in FIG. 1) of the classification result by the speaker identification means for each utterance section, and the display control means calculates the character string of each related word. Aspect (size or display color (hue, brightness, saturation)) according to the reliability calculated by the speech recognition means for the related word and the reliability calculated by the speaker identification means for the utterance section in which the related word was uttered Or character type) on the display device .

以上の構成においては、指定単語に応じた複数の関連語の各々の出現確率を相対的に上昇させたうえで音声信号の認識が実行されるから、各単語の出現確率が初期値に維持されたまま音声認識が実行される構成と比較して、誤認識の可能性を抑制しながら各関連語を効率的に認識することが可能である。なお、「各関連語の出現確率を複数の関連語以外の単語の出現確率に対して相対的に上昇させる」とは、各関連語の出現確率を上昇させる処理（関連語以外の各単語の出現確率は変化させない）や、関連語以外の各単語の出現確率を低下させる処理（各関連語の出現確率は変化させない）を少なくとも包含する。また、関連語特定手段の特定する複数の関連語に指定単語（利用者の指定した単語）が含まれるか否かは本発明において不問である。 In the above configuration, since the speech signal recognition is performed after relatively increasing the appearance probability of each of the plurality of related words according to the designated word, the appearance probability of each word is maintained at the initial value. Compared to a configuration in which speech recognition is performed as it is, it is possible to efficiently recognize each related word while suppressing the possibility of erroneous recognition. Note that “increasing the appearance probability of each related word relative to the appearance probability of words other than a plurality of related words” means a process of increasing the appearance probability of each related word (for each word other than related words). The appearance probability is not changed) and the process of reducing the appearance probability of each word other than the related word (the appearance probability of each related word is not changed). In the present invention, whether or not a specified word (word specified by the user) is included in a plurality of related words specified by the related word specifying unit is unquestioned.

本発明の好適な態様に係る音声認識装置は、表示装置に表示された関連語を利用者が指定した場合に、音声信号のうち当該関連語に対応した部分の音声を放音装置から出力する再生制御手段を具備する。本態様によれば、各関連語に対応した部分の音声が再生されるから、各関連語に対応した部分の発声の内容を利用者が容易に確認できるという利点がある。再生制御手段は、例えば、音声信号のうち利用者が指定した関連語の時刻から所定の時間長だけ手前の時点を開始点とする部分の音声を放音装置から出力する。 The speech recognition apparatus according to a preferred aspect of the present invention outputs, from the sound emitting device, the sound of the portion corresponding to the related word in the audio signal when the user specifies the related word displayed on the display device. A reproduction control means is provided. According to this aspect, since the sound of the part corresponding to each related word is reproduced, there is an advantage that the user can easily confirm the content of the utterance of the part corresponding to each related word. For example, the reproduction control unit outputs, from the sound emitting device, the sound of a portion starting from a time point a predetermined time before the time of the related word specified by the user in the audio signal.

本発明に係る音声認識装置は、音声の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、複数の単語の各々について出現確率を記憶する記憶手段を具備するコンピュータに、利用者が指定した単語について複数の関連語を特定する関連語特定処理と、複数の単語のうち関連語特定処理で特定した各関連語の出現確率を複数の関連語以外の単語の出現確率に対して相対的に上昇させる確率調整処理と、音声信号が表わす音声に対応した単語を確率調整処理後の出現確率に基づいて特定する音声認識処理と、音声認識処理で特定した複数の単語から関連語を選択する選択処理と、音声信号を発声者毎の発声区間に区分する話者識別処理と、話者識別処理で区分した複数の発声区間のうち選択処理で選択した各関連語が発声された発声区間を特定することで、各関連語の文字列を発声者毎に表示装置に表示させる表示制御処理とを実行させるプログラムであって、音声認識処理では、選択処理で選択した各関連語について当該音声認識処理による特定の結果の信頼度を算定し、話者識別処理では、当該話者識別処理による区分の結果の信頼度を発声区間毎に算定し、表示制御処理では、各関連語の文字列を、当該関連語について音声認識処理で算定した信頼度と、当該関連語が発声された発声区間について話者識別処理で算定した信頼度とに応じた態様で表示装置に表示させる。以上のプログラムによっても、本発明に係る音声認識装置と同様の作用および効果が奏される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The speech recognition apparatus according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to speech processing, and a general-purpose arithmetic processing device such as a CPU (Central Processing Unit). It is also realized through collaboration with the program. A program according to the present invention includes: a computer having a storage unit that stores an appearance probability for each of a plurality of words; a related word specifying process for specifying a plurality of related words for a word specified by a user; Probability adjustment processing that raises the appearance probability of each related word specified in the related word specification processing relative to the appearance probability of words other than a plurality of related words, and probability adjustment of the word corresponding to the voice represented by the audio signal Speech recognition processing that is specified based on the appearance probability after processing, selection processing that selects related words from a plurality of words that are specified in the speech recognition processing, and speaker identification processing that divides the speech signal into speech sections for each speaker And by identifying the utterance section in which each related word selected in the selection process is uttered from among the plurality of utterance sections divided by the speaker identification process, the character string of each related word is displayed on the display device for each speaker. The In the speech recognition process, the reliability of a specific result by the speech recognition process is calculated for each related word selected in the selection process, and the speaker identification process The reliability of the classification result by the person identification process is calculated for each utterance section, and in the display control process, the string of each related word is calculated with the reliability calculated by the speech recognition process for the related word and the related word is uttered. The displayed speech section is displayed on the display device in a manner corresponding to the reliability calculated in the speaker identification process . Even with the above program, the same operations and effects as the speech recognition apparatus according to the present invention are exhibited. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided from the server device in the form of distribution via a communication network. To be installed.

本発明は、音声を認識する方法としても特定される。具体的な態様に係る音声認識方法は、利用者が指定した単語について複数の関連語を特定する関連語特定過程と、複数の単語のうち関連語特定過程にて特定した各関連語の出現確率を初期値から上昇させる確率調整過程と、音声信号が表わす音声に対応した単語を確率調整過程による処理後の出現確率に基づいて特定する音声認識過程とを含む。以上の方法によれば、本発明に係る音声認識装置と同様の作用および効果が奏される。 The present invention is also specified as a method for recognizing speech. The speech recognition method according to a specific aspect includes a related word specifying process for specifying a plurality of related words for a word specified by a user, and an appearance probability of each related word specified in the related word specifying process among the plurality of words And a speech recognition process for specifying a word corresponding to the speech represented by the speech signal based on the appearance probability after processing by the probability adjustment process. According to the above method, the same operation and effect as the speech recognition apparatus according to the present invention are exhibited.

図１は、本発明の実施の形態に係る音声認識装置１００の構成を示すブロック図である。同図に示すように、音声認識装置１００は、制御装置１０と記憶装置３０とを具備するコンピュータシステムである。制御装置１０には入力装置４２と表示装置４４と放音装置４６とが接続される。入力装置４２は、音声認識装置１００に対する指示を利用者が入力するための機器（キーボードやマウス）である。例えば、入力装置４２を適宜に操作することで利用者は所望の単語（キーワード）ＫＷを入力する。表示装置４４は、制御装置１０による制御のもとに各種の画像を表示する。放音装置４６は、制御装置１０から供給される信号に応じた音声を放音する機器（例えばスピーカやヘッドホン）である。 FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus 100 according to an embodiment of the present invention. As shown in FIG. 1, the speech recognition apparatus 100 is a computer system that includes a control device 10 and a storage device 30. An input device 42, a display device 44, and a sound emitting device 46 are connected to the control device 10. The input device 42 is a device (keyboard or mouse) for a user to input an instruction to the voice recognition device 100. For example, the user inputs a desired word (keyword) KW by appropriately operating the input device 42. The display device 44 displays various images under the control of the control device 10. The sound emitting device 46 is a device (for example, a speaker or a headphone) that emits sound corresponding to a signal supplied from the control device 10.

記憶装置３０は、制御装置１０が実行するプログラムや制御装置１０が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置など公知の記録媒体が記憶装置３０として任意に採用される。図１に示すように、記憶装置３０には、音声信号Ｓと共起データベース（ＤＢ）Ｃと認識辞書Ｄと音素モデル群Ｇとが格納される。なお、以上の各情報は別個の記憶装置に分散して記憶されてもよい。 The storage device 30 stores a program executed by the control device 10 and various data used by the control device 10. A known recording medium such as a semiconductor storage device or a magnetic storage device is arbitrarily adopted as the storage device 30. As shown in FIG. 1, the storage device 30 stores a speech signal S, a co-occurrence database (DB) C, a recognition dictionary D, and a phoneme model group G. Each piece of information described above may be distributed and stored in separate storage devices.

音声信号Ｓは、収音機器（図示略）を利用して事前に採取された音声の波形を表わす。音声信号Ｓが表わす音声は、例えば、会議室などの空間内にて複数の参加者が随時に発声する会議にて収録された音声である。 The sound signal S represents a waveform of a sound collected in advance using a sound collecting device (not shown). The voice represented by the voice signal S is, for example, voice recorded in a conference in which a plurality of participants speak at any time in a space such as a conference room.

共起データベースＣは、多数の単語の各々に複数の別の単語を対応させたデータベースである。ひとつの単語には、当該単語の同義語や意味的に連関する単語（同じ文脈にて出現する可能性が高い単語）が対応づけられる。 The co-occurrence database C is a database in which a plurality of different words are associated with each of a large number of words. One word is associated with a synonym of the word or a word that is semantically related (a word that is highly likely to appear in the same context).

認識辞書Ｄは、音声信号Ｓの認識に使用されるデータベースである。図２は、認識辞書Ｄの内容を模式的に示す概念図である。同図に示すように、認識辞書Ｄは、複数（Ｎ個）の単語の各々について音素列ＤAと文字列ＤBと出現確率Ｐとを含む。音素列ＤAは、単語を構成する音素の配列である。文字列ＤBは、単語を表記したときの文字（例えば漢字）の配列である。出現確率Ｐは、単語が出現する確率である。出現確率Ｐは、新聞記事などの一般的な文章のなかで当該単語が使用される頻度を統計的に処理することで単語毎に事前に決定される。 The recognition dictionary D is a database used for recognition of the audio signal S. FIG. 2 is a conceptual diagram schematically showing the contents of the recognition dictionary D. As shown in the figure, the recognition dictionary D includes a phoneme string DA, a character string DB, and an appearance probability P for each of a plurality (N) of words. The phoneme string DA is an array of phonemes constituting words. The character string DB is an array of characters (for example, kanji) when a word is written. The appearance probability P is a probability that a word appears. The appearance probability P is determined in advance for each word by statistically processing the frequency with which the word is used in general sentences such as newspaper articles.

図１の音素モデル群Ｇは、平均的な音声の音響的な特性を音素毎にモデル化する複数の音素モデルで構成される。音素モデルには、隠れマルコフモデルに代表される公知の確率モデルが任意に採用される。 The phoneme model group G in FIG. 1 includes a plurality of phoneme models that model the acoustic characteristics of average speech for each phoneme. As the phoneme model, a known probability model represented by a hidden Markov model is arbitrarily adopted.

制御装置１０は、記憶装置３０に格納されたプログラムを実行することで複数の要素（関連語特定部１２，確率調整部１４，音声認識部１６，話者識別部２２，表示制御部２４，再生制御部２６）として機能する。制御装置１０が実現する各要素の機能を以下に詳述する。なお、制御装置１０の各要素は、音声の処理に専用されるＤＳＰなどの電子回路によっても実現される。また、制御装置１０は、複数の集積回路に分散して実装されてもよい。 The control device 10 executes a program stored in the storage device 30 to execute a plurality of elements (related word identification unit 12, probability adjustment unit 14, speech recognition unit 16, speaker identification unit 22, display control unit 24, reproduction It functions as the control unit 26). The function of each element realized by the control device 10 will be described in detail below. Note that each element of the control device 10 is also realized by an electronic circuit such as a DSP dedicated to voice processing. Further, the control device 10 may be distributed and mounted on a plurality of integrated circuits.

関連語特定部１２は、指定単語ＫＷについて複数の関連語ＲＷを特定する。関連語特定部１２が特定する複数の関連語ＲＷは、共起データベースＣにて指定単語ＫＷに対応づけられた単語と、利用者が入力装置４２から入力した指定単語ＫＷとを含む。 The related word specifying unit 12 specifies a plurality of related words RW for the designated word KW. The plurality of related words RW specified by the related word specifying unit 12 include a word associated with the specified word KW in the co-occurrence database C and a specified word KW input from the input device 42 by the user.

確率調整部１４は、認識辞書Ｄに登録された複数の単語のなかから各関連語ＲＷ（指定単語ＫＷを含む）を検索し、当該関連語ＲＷに対応する出現確率Ｐを初期値から上昇させる。例えば、確率調整部１４は、初期的な出現確率Ｐに所定の係数を乗算または加算することで更新後の出現確率Ｐを算定する。一方、関連語ＲＷ以外の各単語の出現確率Ｐは初期値のまま維持される。 The probability adjusting unit 14 searches each related word RW (including the designated word KW) from a plurality of words registered in the recognition dictionary D, and increases the appearance probability P corresponding to the related word RW from the initial value. . For example, the probability adjustment unit 14 calculates the updated appearance probability P by multiplying or adding a predetermined coefficient to the initial appearance probability P. On the other hand, the appearance probability P of each word other than the related word RW is maintained at the initial value.

音声認識部１６は、記憶装置３０に格納された音声信号Ｓを認識する手段である。さらに詳述すると、音声認識部１６は、音声信号Ｓの音声に対応した単語（関連語ＲＷおよび関連語ＲＷ以外の単語）の時系列と各単語が発声された時刻とを、認識辞書Ｄと音素モデル群Ｇとに基づいて順次に特定する。認識辞書Ｄに登録された複数の単語のうち評価値（スコア）ＳＣの高い単語が選択される。評価値ＳＣは、音響的評価値（音響スコア）ＡＳと言語的評価値（言語スコア）ＬＳとの加算値や加重和として算定される。 The voice recognition unit 16 is means for recognizing the voice signal S stored in the storage device 30. More specifically, the voice recognition unit 16 determines a time series of words corresponding to the voice of the voice signal S (words other than the related word RW and the related word RW) and the time when each word was uttered as a recognition dictionary D. Based on the phoneme model group G, it specifies sequentially. Of the plurality of words registered in the recognition dictionary D, a word having a high evaluation value (score) SC is selected. The evaluation value SC is calculated as an addition value or a weighted sum of the acoustic evaluation value (acoustic score) AS and the linguistic evaluation value (language score) LS.

音響的評価値ＡＳは、音声信号Ｓからフレーム毎に抽出された音響的な特徴量（例えばＭＦＣＣ（Mel Frequency Cepstral Coefficients））の時系列と複数の単語の各々の音響モデルとの相関の指標となる数値（例えば両者の距離）である。各単語の音響モデルは、当該単語の音素列ＤAを構成する各音素の音素モデルを音素モデル群Ｇから選択して組合わせた確率モデルである。したがって、音響モデルが音声信号Ｓの特徴量の時系列に近似する単語ほど音響的評価値ＡＳは高い数値となる。一方、言語的評価値ＬＳは、確率調整部１４による調整後に認識辞書Ｄで各単語に設定されている出現確率Ｐに応じた数値である。本形態においては出現確率Ｐを言語的評価値ＬＳとして採用する。評価値ＳＣは音響的評価値ＡＳと言語的評価値ＬＳとの加算に基づいて算定されるから、音響的評価値ＡＳまたは言語的評価値ＬＳが増加するほど評価値ＳＣも増加する。 The acoustic evaluation value AS is an index of correlation between a time series of acoustic feature quantities (for example, MFCC (Mel Frequency Cepstral Coefficients)) extracted from the speech signal S for each frame and each acoustic model of a plurality of words. (For example, the distance between the two). The acoustic model of each word is a probability model in which the phoneme models of each phoneme constituting the phoneme string DA of the word are selected from the phoneme model group G and combined. Therefore, the acoustic evaluation value AS is a higher numerical value for a word whose acoustic model approximates the time series of the feature amount of the speech signal S. On the other hand, the linguistic evaluation value LS is a numerical value corresponding to the appearance probability P set for each word in the recognition dictionary D after adjustment by the probability adjustment unit 14. In this embodiment, the appearance probability P is adopted as the linguistic evaluation value LS. Since the evaluation value SC is calculated based on the addition of the acoustic evaluation value AS and the linguistic evaluation value LS, the evaluation value SC increases as the acoustic evaluation value AS or the linguistic evaluation value LS increases.

確率調整部１４が関連語ＲＷの出現確率Ｐを増加させることで関連語ＲＷの評価値ＳＣ（言語的評価値ＬＳ）は上昇するから、音声認識部１６が関連語ＲＷを認識する確率は、当該単語が関連語ＲＷとして特定されない場合と比較して上昇する。すなわち、各関連語ＲＷは音声認識部１６に認識され易くなる。 Since the probability adjustment unit 14 increases the appearance probability P of the related word RW, the evaluation value SC (linguistic evaluation value LS) of the related word RW increases, so the probability that the speech recognition unit 16 recognizes the related word RW is: It rises compared to the case where the word is not specified as the related word RW. That is, each related word RW is easily recognized by the voice recognition unit 16.

図１に示すように、音声認識部１６は選択部１６２を含む。選択部１６２は、音声認識部１６が以上の処理で認識した複数の単語の時系列から各関連語ＲＷを選択する。さらに、音声認識部１６は、選択部１６２が選択した関連語ＲＷを音声認識部１６が認識した結果の確度（以下「信頼度」という）Ａ1を関連語ＲＷ毎に算定する。評価値ＳＣが高いほど音声認識の結果の妥当性は高いと言えるから、本形態においては、音声認識にて複数の単語について算定した評価値ＳＣの総和に対する関連語ＲＷの評価値ＳＣの相対比（関連語ＲＷの評価値ＳＣ／評価値ＳＣの総和）を信頼度Ａ1として算定する。音声認識部１６は、選択部１６２が選択した関連語ＲＷの文字列ＤBと当該関連語ＲＷの発声の時刻Ｔ1と信頼度Ａ1とを関連語ＲＷ毎に順次に出力する。以上が音声認識部１６による処理の内容である。 As shown in FIG. 1, the voice recognition unit 16 includes a selection unit 162. The selection unit 162 selects each related word RW from the time series of a plurality of words recognized by the voice recognition unit 16 through the above processing. Furthermore, the speech recognition unit 16 calculates the accuracy (hereinafter referred to as “reliability”) A1 of the result of the speech recognition unit 16 recognizing the related word RW selected by the selection unit 162 for each related word RW. Since the higher the evaluation value SC, the higher the validity of the speech recognition result, the relative ratio of the evaluation value SC of the related word RW to the sum of the evaluation values SC calculated for a plurality of words in the speech recognition. (Evaluation value SC of related word RW / total sum of evaluation values SC) is calculated as reliability A1. The speech recognition unit 16 sequentially outputs the character string DB of the related word RW selected by the selection unit 162, the utterance time T1 of the related word RW, and the reliability A1 for each related word RW. The above is the content of the processing by the voice recognition unit 16.

話者識別部２２は、音声信号Ｓを発声者毎に時間軸上で複数の区間（以下「発声区間」という）に区分する。例えば、話者識別部２２は、音声信号Ｓのフレーム毎に抽出された音響的な特徴量（例えばＭＦＣＣ）を複数の集合（クラスタ）に分類する。すなわち、発声者毎に別のクラスタが設定される。さらに、話者識別部２２は、音声信号Ｓの複数のフレームの各々を、複数のクラスタのうち当該フレームの特徴量が最も類似する（距離が最小となる）中心ベクトルで規定されるクラスタに分類することで、音声信号Ｓを発声者毎（クラスタ毎）の発声区間に区分する。 The speaker identifying unit 22 divides the audio signal S into a plurality of sections (hereinafter referred to as “speaking sections”) on the time axis for each speaker. For example, the speaker identification unit 22 classifies acoustic feature quantities (for example, MFCC) extracted for each frame of the audio signal S into a plurality of sets (clusters). That is, another cluster is set for each speaker. Furthermore, the speaker identification unit 22 classifies each of the plurality of frames of the speech signal S into clusters defined by a center vector having the most similar feature amount (the distance is minimum) among the plurality of clusters. By doing so, the audio signal S is divided into utterance intervals for each speaker (each cluster).

また、話者識別部２２は、話者識別部２２による区分の結果の確度（以下「信頼度」という）Ａ2を発声区間毎に算定する。例えば、ひとつのクラスタに所属する発声区間の信頼度Ａ2は、発声区間内の各フレームから抽出された特徴量と当該クラスタの中心ベクトルとの距離の平均値の逆数として算定される。話者識別部２２は、各発声区間の発声者（各クラスタ）に固有に付与された識別符号Ｉと当該発声区間の始点および終点の時刻Ｔ2と当該発声区間の信頼度Ａ2とを発声区間毎に順次に出力する。 Further, the speaker identification unit 22 calculates the accuracy (hereinafter referred to as “reliability”) A2 of the classification result by the speaker identification unit 22 for each utterance section. For example, the reliability A2 of the utterance section belonging to one cluster is calculated as the reciprocal of the average value of the distance between the feature quantity extracted from each frame in the utterance section and the center vector of the cluster. The speaker identifying unit 22 uses the identification code I uniquely assigned to the speaker (each cluster) in each utterance section, the time T2 of the start and end points of the utterance section, and the reliability A2 of the utterance section for each utterance section. Are output sequentially.

表示制御部２４は、音声認識部１６が特定した各関連語ＲＷの文字列ＤBを表示装置４４に表示させる。図３は、表示装置４４に表示される画面（以下「検索結果表示画面」という）４４２を例示する模式図である。同図に示すように、検索結果表示画面４４２には、話者識別部２２が識別した発声者毎の領域Ｒ（Ｒ1〜Ｒ3）が画定される。また、各領域Ｒに対応した発声者の識別符号Ｉ（Ｉ1〜Ｉ3）が当該領域Ｒの近傍に配置されるとともに、音声信号Ｓの始点を基準（0:00）とした時刻が上方から下方に向かって等間隔に配置される。 The display control unit 24 causes the display device 44 to display the character string DB of each related word RW identified by the voice recognition unit 16. FIG. 3 is a schematic view illustrating a screen (hereinafter referred to as “search result display screen”) 442 displayed on the display device 44. As shown in the figure, the search result display screen 442 defines a region R (R1 to R3) for each speaker identified by the speaker identification unit 22. In addition, a speaker's identification code I (I1 to I3) corresponding to each region R is arranged in the vicinity of the region R, and the time with the start point of the audio signal S as a reference (0:00) is downward from above. It arranges at equal intervals toward.

表示制御部２４は、音声認識部１６が特定した各関連語ＲＷの文字列ＤBを、当該関連語ＲＷの発声者に対応した領域Ｒ内に配置する。すなわち、各関連語ＲＷの文字列ＤBが発声者毎に区別して表示される。さらに詳述すると、表示制御部２４は、話者識別部２２が特定した時刻Ｔ2で特定される複数の発声区間のうち各関連語ＲＷの時刻Ｔ1を含む発声区間（すなわち関連語ＲＷが発声された区間）を特定し、当該発声区間の発声者の領域Ｒのうち時刻Ｔ1に対応した位置に関連語ＲＷの文字列ＤBを配置する。例えば、図３においては、「コスト」という指定単語ＫＷに対して特定された「削減」という関連語ＲＷを、識別符号Ｉ1の発声者が時刻“0:30”から“1:00”までの区間内に発声した場合が例示されている。表示制御部２４は、領域Ｒ毎（発声者毎）に別色で関連語ＲＷの文字列ＤBを表示させる。 The display control unit 24 arranges the character string DB of each related word RW identified by the voice recognition unit 16 in the region R corresponding to the speaker of the related word RW. That is, the character string DB of each related word RW is displayed separately for each speaker. More specifically, the display control unit 24 utters an utterance section including the time T1 of each related word RW among the plurality of utterance sections specified at the time T2 specified by the speaker identification unit 22 (that is, the related word RW is uttered). And the character string DB of the related word RW is arranged at a position corresponding to the time T1 in the speaker's area R of the utterance section. For example, in FIG. 3, the related word RW “reduction” specified for the designated word KW “cost” is selected from the time “0:30” to “1:00”. The case where the voice is uttered in the section is illustrated. The display control unit 24 displays the character string DB of the related word RW in a different color for each region R (for each speaker).

また、表示制御部２４は、音声認識部１６が特定した関連語ＲＷ毎に信頼度Ａ0を算定する。信頼度Ａ0は、関連語ＲＷについて音声認識部１６が特定した信頼度Ａ1と、当該関連語ＲＷの時刻Ｔ1を含む発声区間について話者識別部２２が特定した信頼度Ａ2との加算値（または加重和）である。表示制御部２４は、各関連語ＲＷの文字列ＤBを、当該関連語ＲＷの信頼度Ａ0に応じた態様で表示装置４４に表示させる。例えば、表示制御部２４は、信頼度Ａ0に応じたサイズで関連語ＲＷの文字列ＤBを表示する。図３の例示において、識別符号Ｉ2の発声者が時刻“1:00”から“1:15”までの区間内で発声した「価格」という関連語ＲＷの信頼度Ａ0は、識別符号Ｉ2の発声者および識別符号Ｉ3の発声者の各々が時刻“0:15”から“0:30”までの区間内で発声した「価格」という関連語ＲＷの信頼度Ａ0と比較して高い（したがって文字列ＤBのサイズが大きい）。 The display control unit 24 calculates the reliability A0 for each related word RW identified by the voice recognition unit 16. The reliability A0 is the sum of the reliability A1 specified by the speech recognition unit 16 for the related word RW and the reliability A2 specified by the speaker identification unit 22 for the utterance section including the time T1 of the related word RW (or Weighted sum). The display control unit 24 causes the display device 44 to display the character string DB of each related word RW in a manner corresponding to the reliability A0 of the related word RW. For example, the display control unit 24 displays the character string DB of the related word RW with a size corresponding to the reliability A0. In the example of FIG. 3, the reliability A0 of the related word RW “price” uttered in the section from the time “1:00” to “1:15” by the speaker of the identification code I2 is the utterance of the identification code I2. Higher than the reliability A0 of the related word RW “price” uttered in the section from the time “0:15” to “0:30” by each of the speaker and the speaker of the identification code I3 (therefore, the character string The size of DB is large).

利用者は、検索結果表示画面４４２に配置された何れかの関連語ＲＷを入力装置４２から指定することが可能である。図１の再生制御部２６は、記憶装置３０が記憶する音声信号Ｓのうち利用者が指定した関連語ＲＷの時刻Ｔ1に対応した時点以後の区間（関連語ＲＷに対応した部分）を順次に放音装置４６に出力する。したがって、関連語ＲＷを含む部分の音声が放音装置４６から出力される。なお、関連語ＲＷの時刻Ｔ1から所定の時間長だけ手前の時点を再生制御部２６による再生の開始点に設定してもよい。 The user can specify any related word RW arranged on the search result display screen 442 from the input device 42. The playback control unit 26 in FIG. 1 sequentially selects a section (portion corresponding to the related word RW) after the time corresponding to the time T1 of the related word RW specified by the user in the audio signal S stored in the storage device 30. The sound is output to the sound emitting device 46. Therefore, the sound of the part including the related word RW is output from the sound emitting device 46. Note that a time point that is a predetermined time length before the time T1 of the related word RW may be set as a start point of playback by the playback control unit 26.

以上に説明したように、本形態においては、複数の単語のうち各関連語ＲＷの出現確率Ｐが上昇するから、各関連語ＲＷの出現確率Ｐが初期値のまま音声認識に使用される構成と比較して、音声信号Ｓの誤認識の可能性が低減される。しかも、指定単語ＫＷに対応した複数の関連語ＲＷが特定されるから、指定単語ＫＷのみが音声信号Ｓから検索される構成と比較して、利用者の意図を反映した広範囲の単語（関連語ＲＷ）が効率的に検索されるという利点もある。 As described above, in the present embodiment, since the appearance probability P of each related word RW among a plurality of words increases, the appearance probability P of each related word RW is used for voice recognition with the initial value. Compared to the above, the possibility of erroneous recognition of the audio signal S is reduced. Moreover, since a plurality of related words RW corresponding to the designated word KW are specified, compared with a configuration in which only the designated word KW is searched from the speech signal S, a wide range of words (related words reflecting the user's intention) There is also an advantage that (RW) is searched efficiently.

また、音声信号Ｓが発声者毎に区分されるとともに関連語ＲＷの文字列ＤBが発声者毎に区別して表示されるから、関連語ＲＷの発声者を利用者が容易に把握できるという利点がある。さらに、領域Ｒのうち時刻Ｔ1に応じた位置に関連語ＲＷの文字列ＤBが配置されるから、各発声者が関連語ＲＷを発声した時刻や各発声者による発声の先後を利用者が直感的に把握できるという利点もある。また、各関連語ＲＷの認識の信頼度Ａ0（Ａ1，Ａ2）に応じた態様で当該関連語ＲＷの文字列ＤBが表示されるから、各関連語ＲＷの信頼度Ａ0を利用者が直感的に把握することができ、さらには信頼度Ａ0の高い関連語ＲＷから順番に再生するといった効率的な利用が可能となる。 Further, since the speech signal S is classified for each speaker and the character string DB of the related word RW is displayed separately for each speaker, there is an advantage that the user can easily grasp the speaker of the related word RW. is there. Further, since the character string DB of the related word RW is arranged at a position corresponding to the time T1 in the region R, the user can intuitively know the time when each speaker speaks the related word RW and the utterances of each speaker. There is also an advantage of being able to grasp. In addition, since the character string DB of the related word RW is displayed in a manner corresponding to the recognition reliability A0 (A1, A2) of each related word RW, the user intuitively determines the reliability A0 of each related word RW. In addition, it is possible to efficiently use the related word RW having a high reliability A0 in order.

＜変形例＞
以上の形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から２以上の態様を任意に選択して組合わせてもよい。 <Modification>
Various modifications can be made to the above embodiment. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

（１）変形例１
以上の形態においては音声認識部１６が選択部１６２を含む構成を例示したが、表示制御部２４が選択部１６２を含む構成も採用される。例えば、音声認識部１６は、音声信号Ｓから認識した総ての単語（関連語ＲＷおよび関連語ＲＷ以外の単語）の各々について文字列ＤBと時刻Ｔ1と信頼度Ａ1とを表示制御部２４に出力する。表示制御部２４の選択部１６２は、音声認識部１６から通知された単語のなかから関連語ＲＷを選択して文字列ＤBを表示装置４４に表示させる。なお、図１の構成によれば、文字列ＤBと時刻Ｔ1と信頼度Ａ1との特定前に音声認識の結果から関連語ＲＷが抽出されるから、関連語ＲＷのみについて文字列ＤBと時刻Ｔ1と信頼度Ａ1とを特定すれば足りる（したがって処理量が削減される）という利点がある。 (1) Modification 1
In the above embodiment, the configuration in which the voice recognition unit 16 includes the selection unit 162 is illustrated, but a configuration in which the display control unit 24 includes the selection unit 162 is also employed. For example, the voice recognition unit 16 sends the character string DB, the time T1, and the reliability A1 to the display control unit 24 for each of the words recognized from the voice signal S (words other than the related word RW and the related word RW). Output. The selection unit 162 of the display control unit 24 selects the related word RW from the words notified from the voice recognition unit 16 and causes the display device 44 to display the character string DB. According to the configuration of FIG. 1, since the related word RW is extracted from the result of speech recognition before the character string DB, time T1, and reliability A1 are specified, the character string DB and time T1 are related only to the related word RW. And the reliability A1 are sufficient (therefore, the processing amount is reduced).

（２）変形例２
確率調整部１４を音声認識部１６とは別個の要素とした構成を便宜的に例示したが、確率調整部１４の機能を音声認識部１６に持たせてもよい。例えば、複数の単語の各々を順次に選択して当該単語の評価値ＳＣを算定するときに、音声認識部１６は、選択した単語が関連語ＲＷであれば当該単語の出現確率Ｐを上昇させたうえで評価値ＳＣを算定する一方、選択した単語が関連語ＲＷ以外であれば当該単語の出現確率Ｐを初期値に維持したまま評価値ＳＣを算定する。 (2) Modification 2
For convenience, the configuration in which the probability adjustment unit 14 is a separate element from the speech recognition unit 16 is illustrated, but the function of the probability adjustment unit 14 may be provided in the speech recognition unit 16. For example, when each of a plurality of words is sequentially selected and the evaluation value SC of the word is calculated, the speech recognition unit 16 increases the appearance probability P of the word if the selected word is the related word RW. On the other hand, while the evaluation value SC is calculated, if the selected word is other than the related word RW, the evaluation value SC is calculated while maintaining the appearance probability P of the word at the initial value.

（３）変形例３
音声認識部１６が認識した複数の単語のうち各関連語ＲＷを選択的に出力（検索）する構成は本発明において必須ではない。例えば、音声認識部１６が音声信号Ｓから特定した各単語（関連語ＲＷおよび関連語ＲＷ以外の単語）の文字列ＤBを、表示制御部２４が表示装置４４から順次に出力する構成も採用される。指定単語ＫＷに対応した各関連語ＲＷの出現確率Ｐが初期値から上昇するから、音声信号Ｓから特定された各単語の文字列ＤBを出力する構成であっても、誤認識の可能性を抑制しながら各関連語ＲＷを効率的に認識するという所期の効果は確かに奏される。以上のように、音声認識部１６が認識した複数の単語から関連語ＲＷを選択する選択部１６２（音声信号Ｓから関連語ＲＷを検索する要素）は適宜に省略される。また、音声信号Ｓを発声者毎に区分する話者識別部２２を省略してもよい。話者識別部２２を省略した構成においては、音声認識部１６の認識した各単語の文字列ＤBが時系列に表示される（発声者毎に区別されない）。 (3) Modification 3
The configuration for selectively outputting (searching) each related word RW among the plurality of words recognized by the voice recognition unit 16 is not essential in the present invention. For example, a configuration in which the display control unit 24 sequentially outputs the character string DB of each word (words other than the related word RW and the related word RW) specified by the voice recognition unit 16 from the voice signal S from the display device 44 is also adopted. The Since the appearance probability P of each related word RW corresponding to the designated word KW rises from the initial value, even if the configuration is such that the character string DB of each word specified from the speech signal S is output, there is a possibility of erroneous recognition. The expected effect of efficiently recognizing each related word RW while suppressing is certainly achieved. As described above, the selection unit 162 (an element for searching for the related word RW from the speech signal S) that selects the related word RW from the plurality of words recognized by the voice recognition unit 16 is appropriately omitted. Further, the speaker identification unit 22 that classifies the voice signal S for each speaker may be omitted. In the configuration in which the speaker identifying unit 22 is omitted, the character strings DB of the words recognized by the speech recognizing unit 16 are displayed in time series (not distinguished for each speaker).

また、再生制御部２６も省略される。ただし、再生制御部２６を含む図１の構成によれば、所望の時点（音声信号Ｓのうち所望の発声者が特定の単語を発声した時点）の音声を利用者が容易に確認できるという利点がある。なお、以上の形態における話者識別部２２は音声信号Ｓを単に発声者毎に区別するのみである（各発声者の特定まではしない）から、利用者は、検索結果表示画面４４２を視認しただけでは各領域Ｒの発声者が誰であるかまでは特定できない。しかし、再生制御部２６を具備する構成によれば、音声信号Ｓの再生音を聴取することで、利用者は、各領域Ｒの関連語ＲＷの発声者を具体的に特定できる。 Further, the reproduction control unit 26 is also omitted. However, according to the configuration of FIG. 1 including the reproduction control unit 26, the user can easily confirm the sound at a desired time (when the desired speaker in the sound signal S utters a specific word). There is. In addition, since the speaker identification part 22 in the above form only distinguishes the audio | voice signal S for every speaker (it does not specify each speaker), the user visually recognized the search result display screen 442. It is not possible to specify who the speaker in each region R is. However, according to the configuration including the reproduction control unit 26, the user can specifically identify the speaker of the related word RW in each region R by listening to the reproduction sound of the audio signal S.

（４）変形例４
以上の形態においては話者識別部２２が音声信号Ｓを発声者毎に区分する構成を例示したが、話者識別部２２が各発声者の特定まで実行する構成も好適である。例えば、発声音から抽出された特徴量のモデル（例えばガウス混合モデル）と各発声者の氏名とを発声者毎に事前に記憶装置３０に格納しておく。話者識別部２２は、音声信号Ｓから抽出された特徴量と記憶装置３０に格納された特徴量のモデルとを対比することで音声信号Ｓの各発声区間における発声者の氏名を特定し、検索結果表示画面４４２の各領域Ｒの近傍に発声者の氏名を表示する。以上の構成によれば、音声信号Ｓの再生音を聴取しなくても、利用者は各関連語ＲＷの発声者を特定することが可能である。 (4) Modification 4
In the above-described embodiment, the configuration in which the speaker identifying unit 22 classifies the audio signal S for each speaker is exemplified, but a configuration in which the speaker identifying unit 22 executes until the speaker is specified is also suitable. For example, a feature model (for example, Gaussian mixture model) extracted from the uttered sound and the name of each utterer are stored in advance in the storage device 30 for each utterer. The speaker identification unit 22 identifies the name of the speaker in each utterance section of the speech signal S by comparing the feature amount extracted from the speech signal S with the model of the feature amount stored in the storage device 30. The name of the speaker is displayed in the vicinity of each region R on the search result display screen 442. According to the above configuration, the user can specify the speaker of each related word RW without listening to the reproduced sound of the audio signal S.

（５）変形例５
以上の形態においては確率調整部１４が各関連語ＲＷの出現確率Ｐを上昇させたが、関連語ＲＷ以外の単語の出現確率Ｐを確率調整部１４が低下させる構成（各関連語ＲＷの出現確率Ｐは初期値のまま維持される構成）も採用される。もっとも、関連語ＲＷ以外の単語は関連語ＲＷと比較して充分に多いから、関連語ＲＷの出現確率Ｐを調整する形態によれば、関連語ＲＷ以外の単語の出現確率Ｐを調整する構成と比較して、確率調整部１４による処理量が軽減されるという利点がある。 (5) Modification 5
In the above embodiment, the probability adjustment unit 14 increases the appearance probability P of each related word RW, but the probability adjustment unit 14 decreases the appearance probability P of words other than the related word RW (appearance of each related word RW). A configuration in which the probability P is maintained at an initial value is also adopted. However, since the number of words other than the related word RW is sufficiently larger than that of the related word RW, the appearance probability P of the words other than the related word RW is adjusted according to the form of adjusting the appearance probability P of the related word RW. As compared with the above, there is an advantage that the processing amount by the probability adjusting unit 14 is reduced.

（６）変形例６
信頼度Ａ0（Ａ1，Ａ2）を算定する構成は本発明において必須ではない。したがって、各関連語ＲＷの文字列ＤBの態様を可変に制御する構成は本発明において省略され得る。また、表示制御部２４が信頼度Ａ1のみに基づいて文字列ＤBの態様を制御する構成（信頼度Ａ2の算定を省略した構成）や、表示制御部２４が信頼度Ａ2のみに基づいて文字列ＤBの態様を制御する構成（信頼度Ａ1の算定を省略した構成）も採用される。 (6) Modification 6
The configuration for calculating the reliability A0 (A1, A2) is not essential in the present invention. Therefore, the configuration for variably controlling the character string DB of each related word RW can be omitted in the present invention. In addition, a configuration in which the display control unit 24 controls the mode of the character string DB based only on the reliability A1 (a configuration in which the calculation of the reliability A2 is omitted), and a character string based on only the reliability A2 in the display control unit 24. A configuration for controlling the mode of DB (a configuration in which calculation of reliability A1 is omitted) is also employed.

本発明の実施の形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on embodiment of this invention. 認識辞書の構成を模式的に示す概念図である。It is a conceptual diagram which shows the structure of a recognition dictionary typically. 検索結果表示画面の内容を示す概念図である。It is a conceptual diagram which shows the content of the search result display screen.

Explanation of symbols

１００……音声認識装置、１０……制御装置、１２……関連語特定部、１４……確率調整部、１６……音声認識部、１６２……選択部、２２……話者識別部、２４……表示制御部、２６……再生制御部、３０……記憶装置、４２……入力装置、４４……表示装置、４６……放音装置、ＫＷ……指定単語、ＲＷ……関連語、Ｓ……音声信号、Ｃ……共起データベース、Ｄ……認識辞書、ＤA……音素列、ＤB……文字列、Ｐ……出現確率、Ｇ……音素モデル群。 DESCRIPTION OF SYMBOLS 100 ... Voice recognition apparatus, 10 ... Control apparatus, 12 ... Related word specific | specification part, 14 ... Probability adjustment part, 16 ... Voice recognition part, 162 ... Selection part, 22 ... Speaker identification part, 24 ...... Display control unit, 26 ... Reproduction control unit, 30 ... Storage device, 42 ... Input device, 44 ... Display device, 46 ... Sound emitting device, KW ... Designated word, RW ... Related word, S: Speech signal, C: Co-occurrence database, D: Recognition dictionary, DA: Phoneme string, DB: Character string, P: Appearance probability, G: Phoneme model group.

Claims

A related word specifying means for specifying a plurality of related words for a word specified by a user;
Storage means for storing an appearance probability for each of a plurality of words;
Probability adjusting means for relatively increasing the appearance probability of each related word specified by the related word specifying means among the plurality of words with respect to the appearance probability of words other than the plurality of related words;
Speech recognition means for specifying a word corresponding to the speech represented by the speech signal based on the appearance probability after adjustment by the probability adjustment means ;
Selecting means for selecting the related word from a plurality of words specified by the voice recognition means;
Speaker identification means for dividing the audio signal into utterance intervals for each speaker;
The character string of each related word is displayed for each speaker by specifying the utterance section in which each of the related words selected by the selecting means is uttered among the plurality of utterance sections divided by the speaker identifying means. Display control means for displaying on
The speech recognition means calculates the reliability of a specific result by the speech recognition means for each related word selected by the selection means,
The speaker identification means calculates the reliability of the result of the classification by the speaker identification means for each utterance section,
The display control means includes the reliability calculated by the speech recognition means for the related word for the character string of each related word and the reliability calculated by the speaker identification means for the utterance section in which the related word is uttered. A voice recognition device that displays on the display device in a manner according to the above .

The sound according to claim 1 , further comprising: a reproduction control unit that outputs a sound of a portion corresponding to the related word in the sound signal from a sound emitting device when a user specifies the related word displayed on the display device. Recognition device.

The reproduction control means outputs, from the sound emitting device, a sound of a portion starting from a time point a predetermined time before the time of the related word specified by the user in the audio signal.
The speech recognition apparatus according to claim 2 .

A computer comprising storage means for storing the probability of occurrence of each of a plurality of words;
A related word specifying process for specifying a plurality of related words for a word specified by a user;
A probability adjustment process for relatively increasing the appearance probability of each related word specified in the related word specifying process among the plurality of words with respect to the appearance probability of words other than the plurality of related words;
A speech recognition process for specifying a word corresponding to the speech represented by the speech signal based on the appearance probability after the probability adjustment process ;
A selection process for selecting the related word from a plurality of words specified in the voice recognition process;
Speaker identification processing for dividing the voice signal into utterance sections for each speaker;
A device for displaying a character string of each related word for each speaker by identifying a utterance section in which each related word selected in the selection process is uttered from among a plurality of utterance sections divided in the speaker identification process A program for executing display control processing to be displayed on
In the voice recognition process, for each related word selected in the selection process, the reliability of a specific result by the voice recognition process is calculated,
In the speaker identification process, the reliability of the result of the classification by the speaker identification process is calculated for each utterance section,
In the display control process, the reliability of the string of each related word calculated by the speech recognition process for the related word and the reliability calculated by the speaker identification process for the utterance section in which the related word is uttered A program to be displayed on the display device in a manner according to the above .