[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

JP5196114B2 - Speech recognition apparatus and program - Google Patents

Speech recognition apparatus and program Download PDF

Info

Publication number
JP5196114B2
JP5196114B2 JP2007186184A JP2007186184A JP5196114B2 JP 5196114 B2 JP5196114 B2 JP 5196114B2 JP 2007186184 A JP2007186184 A JP 2007186184A JP 2007186184 A JP2007186184 A JP 2007186184A JP 5196114 B2 JP5196114 B2 JP 5196114B2
Authority
JP
Japan
Prior art keywords
related word
word
words
speaker
reliability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2007186184A
Other languages
Japanese (ja)
Other versions
JP2009025411A (en
Inventor
裕司 久湊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamaha Corp
Original Assignee
Yamaha Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yamaha Corp filed Critical Yamaha Corp
Priority to JP2007186184A priority Critical patent/JP5196114B2/en
Publication of JP2009025411A publication Critical patent/JP2009025411A/en
Application granted granted Critical
Publication of JP5196114B2 publication Critical patent/JP5196114B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Description

本発明は、音声を認識する技術に関する。   The present invention relates to a technology for recognizing speech.

音声信号から特定の単語(キーワード)を検索する技術が従来から提案されている。例えば特許文献1に開示された音声認識装置は、利用者が指定した単語(以下「指定単語」という)に対応した音響モデルと事前に収録された音声信号とを照合することで音声信号から指定単語を検索する。
特開2001−290496号公報
Techniques for searching for specific words (keywords) from voice signals have been proposed. For example, the speech recognition apparatus disclosed in Patent Document 1 is designated from a speech signal by collating an acoustic model corresponding to a word designated by a user (hereinafter referred to as “designated word”) with a speech signal recorded in advance. Search for a word.
Japanese Patent Laid-Open No. 2001-290496

音声認識の誤認識を完全に回避することは技術的に困難であるから、特許文献1の技術においては、指定単語以外の単語が検索される場合や指定単語が検索されない場合がある。複数の参加者が発声する会議での収録音から特定の指定単語を検索できれば、例えば議事録の作成に便利であるが、指定単語以外の単語が誤検出された場合には当該単語を利用者が除外する必要があり、指定単語が検索されない場合には議事録上での発言の欠落といった重大な問題が発生し得る。   Since it is technically difficult to completely avoid misrecognition of speech recognition, in the technique of Patent Document 1, a word other than a designated word may be searched or a designated word may not be searched. If it is possible to search for a specific designated word from the sound recorded in a meeting uttered by multiple participants, it will be convenient for creating minutes, for example, but if a word other than the designated word is erroneously detected, the word will be If the designated word is not retrieved, a serious problem such as a lack of speech in the minutes can occur.

また、特許文献1の構成においては、音声信号のうちひとつの指定単語と完全に合致する部分のみが検索され、指定単語と同様の意味で使用された別個の単語や指定単語に関連する単語は検索されない。したがって、所定の指定単語に関連する総ての単語を検出するためには各単語を指定単語に設定したうえで検索を反復する必要がある。以上の事情を考慮して、本発明は、誤認識の可能性を抑制しながら、指定単語に関連する単語を効率的に認識するという課題の解決をひとつの目的としている。   Further, in the configuration of Patent Document 1, only a portion that completely matches one designated word in the audio signal is searched, and a separate word used in the same meaning as the designated word or a word related to the designated word is Not searched. Therefore, in order to detect all words related to a predetermined designated word, it is necessary to repeat the search after setting each word as the designated word. In view of the above circumstances, an object of the present invention is to solve the problem of efficiently recognizing a word related to a designated word while suppressing the possibility of erroneous recognition.

以上の課題を解決するために、本発明に係る音声認識装置は、利用者が指定した単語について複数の関連語を特定する関連語特定手段と、複数の単語の各々について出現確率を記憶する記憶手段と、複数の単語のうち関連語特定手段が特定した各関連語の出現確率を複数の関連語以外の単語の出現確率に対して相対的に上昇させる確率調整手段と、音声信号が表わす音声に対応した単語を確率調整手段による調整後の出現確率に基づいて特定する音声認識手段と、音声認識手段が特定した複数の単語から関連語を選択する選択手段と、音声信号を発声者毎の発声区間に区分する話者識別手段と、話者識別手段が区分した複数の発声区間のうち選択手段が選択した各関連語が発声された発声区間を特定することで、各関連語の文字列を発声者毎に表示装置に表示させる表示制御手段とを具備し、音声認識手段は、選択手段が選択した各関連語について当該音声認識手段による特定の結果の信頼度(例えば図1の信頼度A1)を算定し、話者識別手段は、当該話者識別手段による区分の結果の信頼度(例えば図1の信頼度A2)を発声区間毎に算定し、表示制御手段は、各関連語の文字列を、当該関連語について音声認識手段が算定した信頼度と、当該関連語が発声された発声区間について話者識別手段が算定した信頼度とに応じた態様(サイズや表示色(色相,明度,彩度)や文字種)で表示装置に表示させるIn order to solve the above problems, the speech recognition apparatus according to the present invention stores related word specifying means for specifying a plurality of related words for a word specified by a user, and a memory for storing an appearance probability for each of the plurality of words. Means, probability adjusting means for relatively increasing the appearance probability of each related word specified by the related word specifying means among the plurality of words with respect to the appearance probability of words other than the plurality of related words, and the voice represented by the audio signal A speech recognition means for specifying a word corresponding to the word based on the appearance probability after adjustment by the probability adjustment means, a selection means for selecting a related word from a plurality of words specified by the speech recognition means, and a speech signal for each speaker A character string of each related word by specifying a speaker identifying means that divides into utterance sections, and a utterance section in which each related word selected by the selecting means is uttered among a plurality of utterance sections classified by the speaker identifying means. For each speaker Display control means to be displayed on the display device, and the speech recognition means calculates the reliability (for example, reliability A1 in FIG. 1) of a specific result by the speech recognition means for each related word selected by the selection means. The speaker identification means calculates the reliability (for example, reliability A2 in FIG. 1) of the classification result by the speaker identification means for each utterance section, and the display control means calculates the character string of each related word. Aspect (size or display color (hue, brightness, saturation)) according to the reliability calculated by the speech recognition means for the related word and the reliability calculated by the speaker identification means for the utterance section in which the related word was uttered Or character type) on the display device .

以上の構成においては、指定単語に応じた複数の関連語の各々の出現確率を相対的に上昇させたうえで音声信号の認識が実行されるから、各単語の出現確率が初期値に維持されたまま音声認識が実行される構成と比較して、誤認識の可能性を抑制しながら各関連語を効率的に認識することが可能である。なお、「各関連語の出現確率を複数の関連語以外の単語の出現確率に対して相対的に上昇させる」とは、各関連語の出現確率を上昇させる処理(関連語以外の各単語の出現確率は変化させない)や、関連語以外の各単語の出現確率を低下させる処理(各関連語の出現確率は変化させない)を少なくとも包含する。また、関連語特定手段の特定する複数の関連語に指定単語(利用者の指定した単語)が含まれるか否かは本発明において不問である。   In the above configuration, since the speech signal recognition is performed after relatively increasing the appearance probability of each of the plurality of related words according to the designated word, the appearance probability of each word is maintained at the initial value. Compared to a configuration in which speech recognition is performed as it is, it is possible to efficiently recognize each related word while suppressing the possibility of erroneous recognition. Note that “increasing the appearance probability of each related word relative to the appearance probability of words other than a plurality of related words” means a process of increasing the appearance probability of each related word (for each word other than related words). The appearance probability is not changed) and the process of reducing the appearance probability of each word other than the related word (the appearance probability of each related word is not changed). In the present invention, whether or not a specified word (word specified by the user) is included in a plurality of related words specified by the related word specifying unit is unquestioned.

本発明の好適な態様に係る音声認識装置は、表示装置に表示された関連語を利用者が指定した場合に、音声信号のうち当該関連語に対応した部分の音声を放音装置から出力する再生制御手段を具備する。本態様によれば、各関連語に対応した部分の音声が再生されるから、各関連語に対応した部分の発声の内容を利用者が容易に確認できるという利点がある。再生制御手段は、例えば、音声信号のうち利用者が指定した関連語の時刻から所定の時間長だけ手前の時点を開始点とする部分の音声を放音装置から出力する。 The speech recognition apparatus according to a preferred aspect of the present invention outputs, from the sound emitting device, the sound of the portion corresponding to the related word in the audio signal when the user specifies the related word displayed on the display device. A reproduction control means is provided. According to this aspect, since the sound of the part corresponding to each related word is reproduced, there is an advantage that the user can easily confirm the content of the utterance of the part corresponding to each related word. For example, the reproduction control unit outputs, from the sound emitting device, the sound of a portion starting from a time point a predetermined time before the time of the related word specified by the user in the audio signal.

本発明に係る音声認識装置は、音声の処理に専用されるDSP(Digital Signal Processor)などのハードウェア(電子回路)によって実現されるほか、CPU(Central Processing Unit)などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、複数の単語の各々について出現確率を記憶する記憶手段を具備するコンピュータに、利用者が指定した単語について複数の関連語を特定する関連語特定処理と、複数の単語のうち関連語特定処理で特定した各関連語の出現確率を複数の関連語以外の単語の出現確率に対して相対的に上昇させる確率調整処理と、音声信号が表わす音声に対応した単語を確率調整処理後の出現確率に基づいて特定する音声認識処理と、音声認識処理で特定した複数の単語から関連語を選択する選択処理と、音声信号を発声者毎の発声区間に区分する話者識別処理と、話者識別処理で区分した複数の発声区間のうち選択処理で選択した各関連語が発声された発声区間を特定することで、各関連語の文字列を発声者毎に表示装置に表示させる表示制御処理とを実行させるプログラムであって、音声認識処理では、選択処理で選択した各関連語について当該音声認識処理による特定の結果の信頼度を算定し、話者識別処理では、当該話者識別処理による区分の結果の信頼度を発声区間毎に算定し、表示制御処理では、各関連語の文字列を、当該関連語について音声認識処理で算定した信頼度と、当該関連語が発声された発声区間について話者識別処理で算定した信頼度とに応じた態様で表示装置に表示させる。以上のプログラムによっても、本発明に係る音声認識装置と同様の作用および効果が奏される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The speech recognition apparatus according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to speech processing, and a general-purpose arithmetic processing device such as a CPU (Central Processing Unit). It is also realized through collaboration with the program. A program according to the present invention includes: a computer having a storage unit that stores an appearance probability for each of a plurality of words; a related word specifying process for specifying a plurality of related words for a word specified by a user; Probability adjustment processing that raises the appearance probability of each related word specified in the related word specification processing relative to the appearance probability of words other than a plurality of related words, and probability adjustment of the word corresponding to the voice represented by the audio signal Speech recognition processing that is specified based on the appearance probability after processing, selection processing that selects related words from a plurality of words that are specified in the speech recognition processing, and speaker identification processing that divides the speech signal into speech sections for each speaker And by identifying the utterance section in which each related word selected in the selection process is uttered from among the plurality of utterance sections divided by the speaker identification process, the character string of each related word is displayed on the display device for each speaker. The In the speech recognition process, the reliability of a specific result by the speech recognition process is calculated for each related word selected in the selection process, and the speaker identification process The reliability of the classification result by the person identification process is calculated for each utterance section, and in the display control process, the string of each related word is calculated with the reliability calculated by the speech recognition process for the related word and the related word is uttered. The displayed speech section is displayed on the display device in a manner corresponding to the reliability calculated in the speaker identification process . Even with the above program, the same operations and effects as the speech recognition apparatus according to the present invention are exhibited. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided from the server device in the form of distribution via a communication network. To be installed.

本発明は、音声を認識する方法としても特定される。具体的な態様に係る音声認識方法は、利用者が指定した単語について複数の関連語を特定する関連語特定過程と、複数の単語のうち関連語特定過程にて特定した各関連語の出現確率を初期値から上昇させる確率調整過程と、音声信号が表わす音声に対応した単語を確率調整過程による処理後の出現確率に基づいて特定する音声認識過程とを含む。以上の方法によれば、本発明に係る音声認識装置と同様の作用および効果が奏される。   The present invention is also specified as a method for recognizing speech. The speech recognition method according to a specific aspect includes a related word specifying process for specifying a plurality of related words for a word specified by a user, and an appearance probability of each related word specified in the related word specifying process among the plurality of words And a speech recognition process for specifying a word corresponding to the speech represented by the speech signal based on the appearance probability after processing by the probability adjustment process. According to the above method, the same operation and effect as the speech recognition apparatus according to the present invention are exhibited.

図1は、本発明の実施の形態に係る音声認識装置100の構成を示すブロック図である。同図に示すように、音声認識装置100は、制御装置10と記憶装置30とを具備するコンピュータシステムである。制御装置10には入力装置42と表示装置44と放音装置46とが接続される。入力装置42は、音声認識装置100に対する指示を利用者が入力するための機器(キーボードやマウス)である。例えば、入力装置42を適宜に操作することで利用者は所望の単語(キーワード)KWを入力する。表示装置44は、制御装置10による制御のもとに各種の画像を表示する。放音装置46は、制御装置10から供給される信号に応じた音声を放音する機器(例えばスピーカやヘッドホン)である。   FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus 100 according to an embodiment of the present invention. As shown in FIG. 1, the speech recognition apparatus 100 is a computer system that includes a control device 10 and a storage device 30. An input device 42, a display device 44, and a sound emitting device 46 are connected to the control device 10. The input device 42 is a device (keyboard or mouse) for a user to input an instruction to the voice recognition device 100. For example, the user inputs a desired word (keyword) KW by appropriately operating the input device 42. The display device 44 displays various images under the control of the control device 10. The sound emitting device 46 is a device (for example, a speaker or a headphone) that emits sound corresponding to a signal supplied from the control device 10.

記憶装置30は、制御装置10が実行するプログラムや制御装置10が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置など公知の記録媒体が記憶装置30として任意に採用される。図1に示すように、記憶装置30には、音声信号Sと共起データベース(DB)Cと認識辞書Dと音素モデル群Gとが格納される。なお、以上の各情報は別個の記憶装置に分散して記憶されてもよい。   The storage device 30 stores a program executed by the control device 10 and various data used by the control device 10. A known recording medium such as a semiconductor storage device or a magnetic storage device is arbitrarily adopted as the storage device 30. As shown in FIG. 1, the storage device 30 stores a speech signal S, a co-occurrence database (DB) C, a recognition dictionary D, and a phoneme model group G. Each piece of information described above may be distributed and stored in separate storage devices.

音声信号Sは、収音機器(図示略)を利用して事前に採取された音声の波形を表わす。音声信号Sが表わす音声は、例えば、会議室などの空間内にて複数の参加者が随時に発声する会議にて収録された音声である。   The sound signal S represents a waveform of a sound collected in advance using a sound collecting device (not shown). The voice represented by the voice signal S is, for example, voice recorded in a conference in which a plurality of participants speak at any time in a space such as a conference room.

共起データベースCは、多数の単語の各々に複数の別の単語を対応させたデータベースである。ひとつの単語には、当該単語の同義語や意味的に連関する単語(同じ文脈にて出現する可能性が高い単語)が対応づけられる。   The co-occurrence database C is a database in which a plurality of different words are associated with each of a large number of words. One word is associated with a synonym of the word or a word that is semantically related (a word that is highly likely to appear in the same context).

認識辞書Dは、音声信号Sの認識に使用されるデータベースである。図2は、認識辞書Dの内容を模式的に示す概念図である。同図に示すように、認識辞書Dは、複数(N個)の単語の各々について音素列DAと文字列DBと出現確率Pとを含む。音素列DAは、単語を構成する音素の配列である。文字列DBは、単語を表記したときの文字(例えば漢字)の配列である。出現確率Pは、単語が出現する確率である。出現確率Pは、新聞記事などの一般的な文章のなかで当該単語が使用される頻度を統計的に処理することで単語毎に事前に決定される。   The recognition dictionary D is a database used for recognition of the audio signal S. FIG. 2 is a conceptual diagram schematically showing the contents of the recognition dictionary D. As shown in the figure, the recognition dictionary D includes a phoneme string DA, a character string DB, and an appearance probability P for each of a plurality (N) of words. The phoneme string DA is an array of phonemes constituting words. The character string DB is an array of characters (for example, kanji) when a word is written. The appearance probability P is a probability that a word appears. The appearance probability P is determined in advance for each word by statistically processing the frequency with which the word is used in general sentences such as newspaper articles.

図1の音素モデル群Gは、平均的な音声の音響的な特性を音素毎にモデル化する複数の音素モデルで構成される。音素モデルには、隠れマルコフモデルに代表される公知の確率モデルが任意に採用される。   The phoneme model group G in FIG. 1 includes a plurality of phoneme models that model the acoustic characteristics of average speech for each phoneme. As the phoneme model, a known probability model represented by a hidden Markov model is arbitrarily adopted.

制御装置10は、記憶装置30に格納されたプログラムを実行することで複数の要素(関連語特定部12,確率調整部14,音声認識部16,話者識別部22,表示制御部24,再生制御部26)として機能する。制御装置10が実現する各要素の機能を以下に詳述する。なお、制御装置10の各要素は、音声の処理に専用されるDSPなどの電子回路によっても実現される。また、制御装置10は、複数の集積回路に分散して実装されてもよい。   The control device 10 executes a program stored in the storage device 30 to execute a plurality of elements (related word identification unit 12, probability adjustment unit 14, speech recognition unit 16, speaker identification unit 22, display control unit 24, reproduction It functions as the control unit 26). The function of each element realized by the control device 10 will be described in detail below. Note that each element of the control device 10 is also realized by an electronic circuit such as a DSP dedicated to voice processing. Further, the control device 10 may be distributed and mounted on a plurality of integrated circuits.

関連語特定部12は、指定単語KWについて複数の関連語RWを特定する。関連語特定部12が特定する複数の関連語RWは、共起データベースCにて指定単語KWに対応づけられた単語と、利用者が入力装置42から入力した指定単語KWとを含む。   The related word specifying unit 12 specifies a plurality of related words RW for the designated word KW. The plurality of related words RW specified by the related word specifying unit 12 include a word associated with the specified word KW in the co-occurrence database C and a specified word KW input from the input device 42 by the user.

確率調整部14は、認識辞書Dに登録された複数の単語のなかから各関連語RW(指定単語KWを含む)を検索し、当該関連語RWに対応する出現確率Pを初期値から上昇させる。例えば、確率調整部14は、初期的な出現確率Pに所定の係数を乗算または加算することで更新後の出現確率Pを算定する。一方、関連語RW以外の各単語の出現確率Pは初期値のまま維持される。   The probability adjusting unit 14 searches each related word RW (including the designated word KW) from a plurality of words registered in the recognition dictionary D, and increases the appearance probability P corresponding to the related word RW from the initial value. . For example, the probability adjustment unit 14 calculates the updated appearance probability P by multiplying or adding a predetermined coefficient to the initial appearance probability P. On the other hand, the appearance probability P of each word other than the related word RW is maintained at the initial value.

音声認識部16は、記憶装置30に格納された音声信号Sを認識する手段である。さらに詳述すると、音声認識部16は、音声信号Sの音声に対応した単語(関連語RWおよび関連語RW以外の単語)の時系列と各単語が発声された時刻とを、認識辞書Dと音素モデル群Gとに基づいて順次に特定する。認識辞書Dに登録された複数の単語のうち評価値(スコア)SCの高い単語が選択される。評価値SCは、音響的評価値(音響スコア)ASと言語的評価値(言語スコア)LSとの加算値や加重和として算定される。   The voice recognition unit 16 is means for recognizing the voice signal S stored in the storage device 30. More specifically, the voice recognition unit 16 determines a time series of words corresponding to the voice of the voice signal S (words other than the related word RW and the related word RW) and the time when each word was uttered as a recognition dictionary D. Based on the phoneme model group G, it specifies sequentially. Of the plurality of words registered in the recognition dictionary D, a word having a high evaluation value (score) SC is selected. The evaluation value SC is calculated as an addition value or a weighted sum of the acoustic evaluation value (acoustic score) AS and the linguistic evaluation value (language score) LS.

音響的評価値ASは、音声信号Sからフレーム毎に抽出された音響的な特徴量(例えばMFCC(Mel Frequency Cepstral Coefficients))の時系列と複数の単語の各々の音響モデルとの相関の指標となる数値(例えば両者の距離)である。各単語の音響モデルは、当該単語の音素列DAを構成する各音素の音素モデルを音素モデル群Gから選択して組合わせた確率モデルである。したがって、音響モデルが音声信号Sの特徴量の時系列に近似する単語ほど音響的評価値ASは高い数値となる。一方、言語的評価値LSは、確率調整部14による調整後に認識辞書Dで各単語に設定されている出現確率Pに応じた数値である。本形態においては出現確率Pを言語的評価値LSとして採用する。評価値SCは音響的評価値ASと言語的評価値LSとの加算に基づいて算定されるから、音響的評価値ASまたは言語的評価値LSが増加するほど評価値SCも増加する。   The acoustic evaluation value AS is an index of correlation between a time series of acoustic feature quantities (for example, MFCC (Mel Frequency Cepstral Coefficients)) extracted from the speech signal S for each frame and each acoustic model of a plurality of words. (For example, the distance between the two). The acoustic model of each word is a probability model in which the phoneme models of each phoneme constituting the phoneme string DA of the word are selected from the phoneme model group G and combined. Therefore, the acoustic evaluation value AS is a higher numerical value for a word whose acoustic model approximates the time series of the feature amount of the speech signal S. On the other hand, the linguistic evaluation value LS is a numerical value corresponding to the appearance probability P set for each word in the recognition dictionary D after adjustment by the probability adjustment unit 14. In this embodiment, the appearance probability P is adopted as the linguistic evaluation value LS. Since the evaluation value SC is calculated based on the addition of the acoustic evaluation value AS and the linguistic evaluation value LS, the evaluation value SC increases as the acoustic evaluation value AS or the linguistic evaluation value LS increases.

確率調整部14が関連語RWの出現確率Pを増加させることで関連語RWの評価値SC(言語的評価値LS)は上昇するから、音声認識部16が関連語RWを認識する確率は、当該単語が関連語RWとして特定されない場合と比較して上昇する。すなわち、各関連語RWは音声認識部16に認識され易くなる。   Since the probability adjustment unit 14 increases the appearance probability P of the related word RW, the evaluation value SC (linguistic evaluation value LS) of the related word RW increases, so the probability that the speech recognition unit 16 recognizes the related word RW is: It rises compared to the case where the word is not specified as the related word RW. That is, each related word RW is easily recognized by the voice recognition unit 16.

図1に示すように、音声認識部16は選択部162を含む。選択部162は、音声認識部16が以上の処理で認識した複数の単語の時系列から各関連語RWを選択する。さらに、音声認識部16は、選択部162が選択した関連語RWを音声認識部16が認識した結果の確度(以下「信頼度」という)A1を関連語RW毎に算定する。評価値SCが高いほど音声認識の結果の妥当性は高いと言えるから、本形態においては、音声認識にて複数の単語について算定した評価値SCの総和に対する関連語RWの評価値SCの相対比(関連語RWの評価値SC/評価値SCの総和)を信頼度A1として算定する。音声認識部16は、選択部162が選択した関連語RWの文字列DBと当該関連語RWの発声の時刻T1と信頼度A1とを関連語RW毎に順次に出力する。以上が音声認識部16による処理の内容である。   As shown in FIG. 1, the voice recognition unit 16 includes a selection unit 162. The selection unit 162 selects each related word RW from the time series of a plurality of words recognized by the voice recognition unit 16 through the above processing. Furthermore, the speech recognition unit 16 calculates the accuracy (hereinafter referred to as “reliability”) A1 of the result of the speech recognition unit 16 recognizing the related word RW selected by the selection unit 162 for each related word RW. Since the higher the evaluation value SC, the higher the validity of the speech recognition result, the relative ratio of the evaluation value SC of the related word RW to the sum of the evaluation values SC calculated for a plurality of words in the speech recognition. (Evaluation value SC of related word RW / total sum of evaluation values SC) is calculated as reliability A1. The speech recognition unit 16 sequentially outputs the character string DB of the related word RW selected by the selection unit 162, the utterance time T1 of the related word RW, and the reliability A1 for each related word RW. The above is the content of the processing by the voice recognition unit 16.

話者識別部22は、音声信号Sを発声者毎に時間軸上で複数の区間(以下「発声区間」という)に区分する。例えば、話者識別部22は、音声信号Sのフレーム毎に抽出された音響的な特徴量(例えばMFCC)を複数の集合(クラスタ)に分類する。すなわち、発声者毎に別のクラスタが設定される。さらに、話者識別部22は、音声信号Sの複数のフレームの各々を、複数のクラスタのうち当該フレームの特徴量が最も類似する(距離が最小となる)中心ベクトルで規定されるクラスタに分類することで、音声信号Sを発声者毎(クラスタ毎)の発声区間に区分する。   The speaker identifying unit 22 divides the audio signal S into a plurality of sections (hereinafter referred to as “speaking sections”) on the time axis for each speaker. For example, the speaker identification unit 22 classifies acoustic feature quantities (for example, MFCC) extracted for each frame of the audio signal S into a plurality of sets (clusters). That is, another cluster is set for each speaker. Furthermore, the speaker identification unit 22 classifies each of the plurality of frames of the speech signal S into clusters defined by a center vector having the most similar feature amount (the distance is minimum) among the plurality of clusters. By doing so, the audio signal S is divided into utterance intervals for each speaker (each cluster).

また、話者識別部22は、話者識別部22による区分の結果の確度(以下「信頼度」という)A2を発声区間毎に算定する。例えば、ひとつのクラスタに所属する発声区間の信頼度A2は、発声区間内の各フレームから抽出された特徴量と当該クラスタの中心ベクトルとの距離の平均値の逆数として算定される。話者識別部22は、各発声区間の発声者(各クラスタ)に固有に付与された識別符号Iと当該発声区間の始点および終点の時刻T2と当該発声区間の信頼度A2とを発声区間毎に順次に出力する。   Further, the speaker identification unit 22 calculates the accuracy (hereinafter referred to as “reliability”) A2 of the classification result by the speaker identification unit 22 for each utterance section. For example, the reliability A2 of the utterance section belonging to one cluster is calculated as the reciprocal of the average value of the distance between the feature quantity extracted from each frame in the utterance section and the center vector of the cluster. The speaker identifying unit 22 uses the identification code I uniquely assigned to the speaker (each cluster) in each utterance section, the time T2 of the start and end points of the utterance section, and the reliability A2 of the utterance section for each utterance section. Are output sequentially.

表示制御部24は、音声認識部16が特定した各関連語RWの文字列DBを表示装置44に表示させる。図3は、表示装置44に表示される画面(以下「検索結果表示画面」という)442を例示する模式図である。同図に示すように、検索結果表示画面442には、話者識別部22が識別した発声者毎の領域R(R1〜R3)が画定される。また、各領域Rに対応した発声者の識別符号I(I1〜I3)が当該領域Rの近傍に配置されるとともに、音声信号Sの始点を基準(0:00)とした時刻が上方から下方に向かって等間隔に配置される。   The display control unit 24 causes the display device 44 to display the character string DB of each related word RW identified by the voice recognition unit 16. FIG. 3 is a schematic view illustrating a screen (hereinafter referred to as “search result display screen”) 442 displayed on the display device 44. As shown in the figure, the search result display screen 442 defines a region R (R1 to R3) for each speaker identified by the speaker identification unit 22. In addition, a speaker's identification code I (I1 to I3) corresponding to each region R is arranged in the vicinity of the region R, and the time with the start point of the audio signal S as a reference (0:00) is downward from above. It arranges at equal intervals toward.

表示制御部24は、音声認識部16が特定した各関連語RWの文字列DBを、当該関連語RWの発声者に対応した領域R内に配置する。すなわち、各関連語RWの文字列DBが発声者毎に区別して表示される。さらに詳述すると、表示制御部24は、話者識別部22が特定した時刻T2で特定される複数の発声区間のうち各関連語RWの時刻T1を含む発声区間(すなわち関連語RWが発声された区間)を特定し、当該発声区間の発声者の領域Rのうち時刻T1に対応した位置に関連語RWの文字列DBを配置する。例えば、図3においては、「コスト」という指定単語KWに対して特定された「削減」という関連語RWを、識別符号I1の発声者が時刻“0:30”から“1:00”までの区間内に発声した場合が例示されている。表示制御部24は、領域R毎(発声者毎)に別色で関連語RWの文字列DBを表示させる。   The display control unit 24 arranges the character string DB of each related word RW identified by the voice recognition unit 16 in the region R corresponding to the speaker of the related word RW. That is, the character string DB of each related word RW is displayed separately for each speaker. More specifically, the display control unit 24 utters an utterance section including the time T1 of each related word RW among the plurality of utterance sections specified at the time T2 specified by the speaker identification unit 22 (that is, the related word RW is uttered). And the character string DB of the related word RW is arranged at a position corresponding to the time T1 in the speaker's area R of the utterance section. For example, in FIG. 3, the related word RW “reduction” specified for the designated word KW “cost” is selected from the time “0:30” to “1:00”. The case where the voice is uttered in the section is illustrated. The display control unit 24 displays the character string DB of the related word RW in a different color for each region R (for each speaker).

また、表示制御部24は、音声認識部16が特定した関連語RW毎に信頼度A0を算定する。信頼度A0は、関連語RWについて音声認識部16が特定した信頼度A1と、当該関連語RWの時刻T1を含む発声区間について話者識別部22が特定した信頼度A2との加算値(または加重和)である。表示制御部24は、各関連語RWの文字列DBを、当該関連語RWの信頼度A0に応じた態様で表示装置44に表示させる。例えば、表示制御部24は、信頼度A0に応じたサイズで関連語RWの文字列DBを表示する。図3の例示において、識別符号I2の発声者が時刻“1:00”から“1:15”までの区間内で発声した「価格」という関連語RWの信頼度A0は、識別符号I2の発声者および識別符号I3の発声者の各々が時刻“0:15”から“0:30”までの区間内で発声した「価格」という関連語RWの信頼度A0と比較して高い(したがって文字列DBのサイズが大きい)。   The display control unit 24 calculates the reliability A0 for each related word RW identified by the voice recognition unit 16. The reliability A0 is the sum of the reliability A1 specified by the speech recognition unit 16 for the related word RW and the reliability A2 specified by the speaker identification unit 22 for the utterance section including the time T1 of the related word RW (or Weighted sum). The display control unit 24 causes the display device 44 to display the character string DB of each related word RW in a manner corresponding to the reliability A0 of the related word RW. For example, the display control unit 24 displays the character string DB of the related word RW with a size corresponding to the reliability A0. In the example of FIG. 3, the reliability A0 of the related word RW “price” uttered in the section from the time “1:00” to “1:15” by the speaker of the identification code I2 is the utterance of the identification code I2. Higher than the reliability A0 of the related word RW “price” uttered in the section from the time “0:15” to “0:30” by each of the speaker and the speaker of the identification code I3 (therefore, the character string The size of DB is large).

利用者は、検索結果表示画面442に配置された何れかの関連語RWを入力装置42から指定することが可能である。図1の再生制御部26は、記憶装置30が記憶する音声信号Sのうち利用者が指定した関連語RWの時刻T1に対応した時点以後の区間(関連語RWに対応した部分)を順次に放音装置46に出力する。したがって、関連語RWを含む部分の音声が放音装置46から出力される。なお、関連語RWの時刻T1から所定の時間長だけ手前の時点を再生制御部26による再生の開始点に設定してもよい。   The user can specify any related word RW arranged on the search result display screen 442 from the input device 42. The playback control unit 26 in FIG. 1 sequentially selects a section (portion corresponding to the related word RW) after the time corresponding to the time T1 of the related word RW specified by the user in the audio signal S stored in the storage device 30. The sound is output to the sound emitting device 46. Therefore, the sound of the part including the related word RW is output from the sound emitting device 46. Note that a time point that is a predetermined time length before the time T1 of the related word RW may be set as a start point of playback by the playback control unit 26.

以上に説明したように、本形態においては、複数の単語のうち各関連語RWの出現確率Pが上昇するから、各関連語RWの出現確率Pが初期値のまま音声認識に使用される構成と比較して、音声信号Sの誤認識の可能性が低減される。しかも、指定単語KWに対応した複数の関連語RWが特定されるから、指定単語KWのみが音声信号Sから検索される構成と比較して、利用者の意図を反映した広範囲の単語(関連語RW)が効率的に検索されるという利点もある。   As described above, in the present embodiment, since the appearance probability P of each related word RW among a plurality of words increases, the appearance probability P of each related word RW is used for voice recognition with the initial value. Compared to the above, the possibility of erroneous recognition of the audio signal S is reduced. Moreover, since a plurality of related words RW corresponding to the designated word KW are specified, compared with a configuration in which only the designated word KW is searched from the speech signal S, a wide range of words (related words reflecting the user's intention) There is also an advantage that (RW) is searched efficiently.

また、音声信号Sが発声者毎に区分されるとともに関連語RWの文字列DBが発声者毎に区別して表示されるから、関連語RWの発声者を利用者が容易に把握できるという利点がある。さらに、領域Rのうち時刻T1に応じた位置に関連語RWの文字列DBが配置されるから、各発声者が関連語RWを発声した時刻や各発声者による発声の先後を利用者が直感的に把握できるという利点もある。また、各関連語RWの認識の信頼度A0(A1,A2)に応じた態様で当該関連語RWの文字列DBが表示されるから、各関連語RWの信頼度A0を利用者が直感的に把握することができ、さらには信頼度A0の高い関連語RWから順番に再生するといった効率的な利用が可能となる。   Further, since the speech signal S is classified for each speaker and the character string DB of the related word RW is displayed separately for each speaker, there is an advantage that the user can easily grasp the speaker of the related word RW. is there. Further, since the character string DB of the related word RW is arranged at a position corresponding to the time T1 in the region R, the user can intuitively know the time when each speaker speaks the related word RW and the utterances of each speaker. There is also an advantage of being able to grasp. In addition, since the character string DB of the related word RW is displayed in a manner corresponding to the recognition reliability A0 (A1, A2) of each related word RW, the user intuitively determines the reliability A0 of each related word RW. In addition, it is possible to efficiently use the related word RW having a high reliability A0 in order.

<変形例>
以上の形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から2以上の態様を任意に選択して組合わせてもよい。
<Modification>
Various modifications can be made to the above embodiment. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

(1)変形例1
以上の形態においては音声認識部16が選択部162を含む構成を例示したが、表示制御部24が選択部162を含む構成も採用される。例えば、音声認識部16は、音声信号Sから認識した総ての単語(関連語RWおよび関連語RW以外の単語)の各々について文字列DBと時刻T1と信頼度A1とを表示制御部24に出力する。表示制御部24の選択部162は、音声認識部16から通知された単語のなかから関連語RWを選択して文字列DBを表示装置44に表示させる。なお、図1の構成によれば、文字列DBと時刻T1と信頼度A1との特定前に音声認識の結果から関連語RWが抽出されるから、関連語RWのみについて文字列DBと時刻T1と信頼度A1とを特定すれば足りる(したがって処理量が削減される)という利点がある。
(1) Modification 1
In the above embodiment, the configuration in which the voice recognition unit 16 includes the selection unit 162 is illustrated, but a configuration in which the display control unit 24 includes the selection unit 162 is also employed. For example, the voice recognition unit 16 sends the character string DB, the time T1, and the reliability A1 to the display control unit 24 for each of the words recognized from the voice signal S (words other than the related word RW and the related word RW). Output. The selection unit 162 of the display control unit 24 selects the related word RW from the words notified from the voice recognition unit 16 and causes the display device 44 to display the character string DB. According to the configuration of FIG. 1, since the related word RW is extracted from the result of speech recognition before the character string DB, time T1, and reliability A1 are specified, the character string DB and time T1 are related only to the related word RW. And the reliability A1 are sufficient (therefore, the processing amount is reduced).

(2)変形例2
確率調整部14を音声認識部16とは別個の要素とした構成を便宜的に例示したが、確率調整部14の機能を音声認識部16に持たせてもよい。例えば、複数の単語の各々を順次に選択して当該単語の評価値SCを算定するときに、音声認識部16は、選択した単語が関連語RWであれば当該単語の出現確率Pを上昇させたうえで評価値SCを算定する一方、選択した単語が関連語RW以外であれば当該単語の出現確率Pを初期値に維持したまま評価値SCを算定する。
(2) Modification 2
For convenience, the configuration in which the probability adjustment unit 14 is a separate element from the speech recognition unit 16 is illustrated, but the function of the probability adjustment unit 14 may be provided in the speech recognition unit 16. For example, when each of a plurality of words is sequentially selected and the evaluation value SC of the word is calculated, the speech recognition unit 16 increases the appearance probability P of the word if the selected word is the related word RW. On the other hand, while the evaluation value SC is calculated, if the selected word is other than the related word RW, the evaluation value SC is calculated while maintaining the appearance probability P of the word at the initial value.

(3)変形例3
音声認識部16が認識した複数の単語のうち各関連語RWを選択的に出力(検索)する構成は本発明において必須ではない。例えば、音声認識部16が音声信号Sから特定した各単語(関連語RWおよび関連語RW以外の単語)の文字列DBを、表示制御部24が表示装置44から順次に出力する構成も採用される。指定単語KWに対応した各関連語RWの出現確率Pが初期値から上昇するから、音声信号Sから特定された各単語の文字列DBを出力する構成であっても、誤認識の可能性を抑制しながら各関連語RWを効率的に認識するという所期の効果は確かに奏される。以上のように、音声認識部16が認識した複数の単語から関連語RWを選択する選択部162(音声信号Sから関連語RWを検索する要素)は適宜に省略される。また、音声信号Sを発声者毎に区分する話者識別部22を省略してもよい。話者識別部22を省略した構成においては、音声認識部16の認識した各単語の文字列DBが時系列に表示される(発声者毎に区別されない)。
(3) Modification 3
The configuration for selectively outputting (searching) each related word RW among the plurality of words recognized by the voice recognition unit 16 is not essential in the present invention. For example, a configuration in which the display control unit 24 sequentially outputs the character string DB of each word (words other than the related word RW and the related word RW) specified by the voice recognition unit 16 from the voice signal S from the display device 44 is also adopted. The Since the appearance probability P of each related word RW corresponding to the designated word KW rises from the initial value, even if the configuration is such that the character string DB of each word specified from the speech signal S is output, there is a possibility of erroneous recognition. The expected effect of efficiently recognizing each related word RW while suppressing is certainly achieved. As described above, the selection unit 162 (an element for searching for the related word RW from the speech signal S) that selects the related word RW from the plurality of words recognized by the voice recognition unit 16 is appropriately omitted. Further, the speaker identification unit 22 that classifies the voice signal S for each speaker may be omitted. In the configuration in which the speaker identifying unit 22 is omitted, the character strings DB of the words recognized by the speech recognizing unit 16 are displayed in time series (not distinguished for each speaker).

また、再生制御部26も省略される。ただし、再生制御部26を含む図1の構成によれば、所望の時点(音声信号Sのうち所望の発声者が特定の単語を発声した時点)の音声を利用者が容易に確認できるという利点がある。なお、以上の形態における話者識別部22は音声信号Sを単に発声者毎に区別するのみである(各発声者の特定まではしない)から、利用者は、検索結果表示画面442を視認しただけでは各領域Rの発声者が誰であるかまでは特定できない。しかし、再生制御部26を具備する構成によれば、音声信号Sの再生音を聴取することで、利用者は、各領域Rの関連語RWの発声者を具体的に特定できる。   Further, the reproduction control unit 26 is also omitted. However, according to the configuration of FIG. 1 including the reproduction control unit 26, the user can easily confirm the sound at a desired time (when the desired speaker in the sound signal S utters a specific word). There is. In addition, since the speaker identification part 22 in the above form only distinguishes the audio | voice signal S for every speaker (it does not specify each speaker), the user visually recognized the search result display screen 442. It is not possible to specify who the speaker in each region R is. However, according to the configuration including the reproduction control unit 26, the user can specifically identify the speaker of the related word RW in each region R by listening to the reproduction sound of the audio signal S.

(4)変形例4
以上の形態においては話者識別部22が音声信号Sを発声者毎に区分する構成を例示したが、話者識別部22が各発声者の特定まで実行する構成も好適である。例えば、発声音から抽出された特徴量のモデル(例えばガウス混合モデル)と各発声者の氏名とを発声者毎に事前に記憶装置30に格納しておく。話者識別部22は、音声信号Sから抽出された特徴量と記憶装置30に格納された特徴量のモデルとを対比することで音声信号Sの各発声区間における発声者の氏名を特定し、検索結果表示画面442の各領域Rの近傍に発声者の氏名を表示する。以上の構成によれば、音声信号Sの再生音を聴取しなくても、利用者は各関連語RWの発声者を特定することが可能である。
(4) Modification 4
In the above-described embodiment, the configuration in which the speaker identifying unit 22 classifies the audio signal S for each speaker is exemplified, but a configuration in which the speaker identifying unit 22 executes until the speaker is specified is also suitable. For example, a feature model (for example, Gaussian mixture model) extracted from the uttered sound and the name of each utterer are stored in advance in the storage device 30 for each utterer. The speaker identification unit 22 identifies the name of the speaker in each utterance section of the speech signal S by comparing the feature amount extracted from the speech signal S with the model of the feature amount stored in the storage device 30. The name of the speaker is displayed in the vicinity of each region R on the search result display screen 442. According to the above configuration, the user can specify the speaker of each related word RW without listening to the reproduced sound of the audio signal S.

(5)変形例5
以上の形態においては確率調整部14が各関連語RWの出現確率Pを上昇させたが、関連語RW以外の単語の出現確率Pを確率調整部14が低下させる構成(各関連語RWの出現確率Pは初期値のまま維持される構成)も採用される。もっとも、関連語RW以外の単語は関連語RWと比較して充分に多いから、関連語RWの出現確率Pを調整する形態によれば、関連語RW以外の単語の出現確率Pを調整する構成と比較して、確率調整部14による処理量が軽減されるという利点がある。
(5) Modification 5
In the above embodiment, the probability adjustment unit 14 increases the appearance probability P of each related word RW, but the probability adjustment unit 14 decreases the appearance probability P of words other than the related word RW (appearance of each related word RW). A configuration in which the probability P is maintained at an initial value is also adopted. However, since the number of words other than the related word RW is sufficiently larger than that of the related word RW, the appearance probability P of the words other than the related word RW is adjusted according to the form of adjusting the appearance probability P of the related word RW. As compared with the above, there is an advantage that the processing amount by the probability adjusting unit 14 is reduced.

(6)変形例6
信頼度A0(A1,A2)を算定する構成は本発明において必須ではない。したがって、各関連語RWの文字列DBの態様を可変に制御する構成は本発明において省略され得る。また、表示制御部24が信頼度A1のみに基づいて文字列DBの態様を制御する構成(信頼度A2の算定を省略した構成)や、表示制御部24が信頼度A2のみに基づいて文字列DBの態様を制御する構成(信頼度A1の算定を省略した構成)も採用される。
(6) Modification 6
The configuration for calculating the reliability A0 (A1, A2) is not essential in the present invention. Therefore, the configuration for variably controlling the character string DB of each related word RW can be omitted in the present invention. In addition, a configuration in which the display control unit 24 controls the mode of the character string DB based only on the reliability A1 (a configuration in which the calculation of the reliability A2 is omitted), and a character string based on only the reliability A2 in the display control unit 24. A configuration for controlling the mode of DB (a configuration in which calculation of reliability A1 is omitted) is also employed.

本発明の実施の形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on embodiment of this invention. 認識辞書の構成を模式的に示す概念図である。It is a conceptual diagram which shows the structure of a recognition dictionary typically. 検索結果表示画面の内容を示す概念図である。It is a conceptual diagram which shows the content of the search result display screen.

符号の説明Explanation of symbols

100……音声認識装置、10……制御装置、12……関連語特定部、14……確率調整部、16……音声認識部、162……選択部、22……話者識別部、24……表示制御部、26……再生制御部、30……記憶装置、42……入力装置、44……表示装置、46……放音装置、KW……指定単語、RW……関連語、S……音声信号、C……共起データベース、D……認識辞書、DA……音素列、DB……文字列、P……出現確率、G……音素モデル群。 DESCRIPTION OF SYMBOLS 100 ... Voice recognition apparatus, 10 ... Control apparatus, 12 ... Related word specific | specification part, 14 ... Probability adjustment part, 16 ... Voice recognition part, 162 ... Selection part, 22 ... Speaker identification part, 24 ...... Display control unit, 26 ... Reproduction control unit, 30 ... Storage device, 42 ... Input device, 44 ... Display device, 46 ... Sound emitting device, KW ... Designated word, RW ... Related word, S: Speech signal, C: Co-occurrence database, D: Recognition dictionary, DA: Phoneme string, DB: Character string, P: Appearance probability, G: Phoneme model group.

Claims (4)

利用者が指定した単語について複数の関連語を特定する関連語特定手段と、
複数の単語の各々について出現確率を記憶する記憶手段と、
前記複数の単語のうち前記関連語特定手段が特定した前記各関連語の出現確率を前記複数の関連語以外の単語の出現確率に対して相対的に上昇させる確率調整手段と、
音声信号が表わす音声に対応した単語を前記確率調整手段による調整後の出現確率に基づいて特定する音声認識手段と
前記音声認識手段が特定した複数の単語から前記関連語を選択する選択手段と、
前記音声信号を発声者毎の発声区間に区分する話者識別手段と、
前記話者識別手段が区分した複数の発声区間のうち前記選択手段が選択した前記各関連語が発声された発声区間を特定することで、前記各関連語の文字列を発声者毎に表示装置に表示させる表示制御手段とを具備し、
前記音声認識手段は、前記選択手段が選択した各関連語について当該音声認識手段による特定の結果の信頼度を算定し、
前記話者識別手段は、当該話者識別手段による区分の結果の信頼度を発声区間毎に算定し、
前記表示制御手段は、前記各関連語の文字列を、当該関連語について前記音声認識手段が算定した信頼度と、当該関連語が発声された発声区間について前記話者識別手段が算定した信頼度とに応じた態様で前記表示装置に表示させる
音声認識装置。
A related word specifying means for specifying a plurality of related words for a word specified by a user;
Storage means for storing an appearance probability for each of a plurality of words;
Probability adjusting means for relatively increasing the appearance probability of each related word specified by the related word specifying means among the plurality of words with respect to the appearance probability of words other than the plurality of related words;
Speech recognition means for specifying a word corresponding to the speech represented by the speech signal based on the appearance probability after adjustment by the probability adjustment means ;
Selecting means for selecting the related word from a plurality of words specified by the voice recognition means;
Speaker identification means for dividing the audio signal into utterance intervals for each speaker;
The character string of each related word is displayed for each speaker by specifying the utterance section in which each of the related words selected by the selecting means is uttered among the plurality of utterance sections divided by the speaker identifying means. Display control means for displaying on
The speech recognition means calculates the reliability of a specific result by the speech recognition means for each related word selected by the selection means,
The speaker identification means calculates the reliability of the result of the classification by the speaker identification means for each utterance section,
The display control means includes the reliability calculated by the speech recognition means for the related word for the character string of each related word and the reliability calculated by the speaker identification means for the utterance section in which the related word is uttered. A voice recognition device that displays on the display device in a manner according to the above .
前記表示装置に表示された関連語を利用者が指定した場合に、前記音声信号のうち当該関連語に対応した部分の音声を放音装置から出力する再生制御手段
を具備する請求項1の音声認識装置。
The sound according to claim 1 , further comprising: a reproduction control unit that outputs a sound of a portion corresponding to the related word in the sound signal from a sound emitting device when a user specifies the related word displayed on the display device. Recognition device.
前記再生制御手段は、前記音声信号のうち利用者が指定した関連語の時刻から所定の時間長だけ手前の時点を開始点とする部分の音声を前記放音装置から出力する
請求項2の音声認識装置。
The reproduction control means outputs, from the sound emitting device, a sound of a portion starting from a time point a predetermined time before the time of the related word specified by the user in the audio signal.
The speech recognition apparatus according to claim 2 .
複数の単語の各々について出現確率を記憶する記憶手段を具備するコンピュータに、
利用者が指定した単語について複数の関連語を特定する関連語特定処理と、
前記複数の単語のうち前記関連語特定処理で特定した前記各関連語の出現確率を前記複数の関連語以外の単語の出現確率に対して相対的に上昇させる確率調整処理と、
音声信号が表わす音声に対応した単語を前記確率調整処理後の出現確率に基づいて特定する音声認識処理と
前記音声認識処理で特定した複数の単語から前記関連語を選択する選択処理と、
前記音声信号を発声者毎の発声区間に区分する話者識別処理と、
前記話者識別処理で区分した複数の発声区間のうち前記選択処理で選択した前記各関連語が発声された発声区間を特定することで、前記各関連語の文字列を発声者毎に表示装置に表示させる表示制御処理とを実行させるプログラムであって、
前記音声認識処理では、前記選択処理で選択した各関連語について当該音声認識処理による特定の結果の信頼度を算定し、
前記話者識別処理では、当該話者識別処理による区分の結果の信頼度を発声区間毎に算定し、
前記表示制御処理では、前記各関連語の文字列を、当該関連語について前記音声認識処理で算定した信頼度と、当該関連語が発声された発声区間について前記話者識別処理で算定した信頼度とに応じた態様で前記表示装置に表示させる
プログラム。
A computer comprising storage means for storing the probability of occurrence of each of a plurality of words;
A related word specifying process for specifying a plurality of related words for a word specified by a user;
A probability adjustment process for relatively increasing the appearance probability of each related word specified in the related word specifying process among the plurality of words with respect to the appearance probability of words other than the plurality of related words;
A speech recognition process for specifying a word corresponding to the speech represented by the speech signal based on the appearance probability after the probability adjustment process ;
A selection process for selecting the related word from a plurality of words specified in the voice recognition process;
Speaker identification processing for dividing the voice signal into utterance sections for each speaker;
A device for displaying a character string of each related word for each speaker by identifying a utterance section in which each related word selected in the selection process is uttered from among a plurality of utterance sections divided in the speaker identification process A program for executing display control processing to be displayed on
In the voice recognition process, for each related word selected in the selection process, the reliability of a specific result by the voice recognition process is calculated,
In the speaker identification process, the reliability of the result of the classification by the speaker identification process is calculated for each utterance section,
In the display control process, the reliability of the string of each related word calculated by the speech recognition process for the related word and the reliability calculated by the speaker identification process for the utterance section in which the related word is uttered A program to be displayed on the display device in a manner according to the above .
JP2007186184A 2007-07-17 2007-07-17 Speech recognition apparatus and program Expired - Fee Related JP5196114B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2007186184A JP5196114B2 (en) 2007-07-17 2007-07-17 Speech recognition apparatus and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2007186184A JP5196114B2 (en) 2007-07-17 2007-07-17 Speech recognition apparatus and program

Publications (2)

Publication Number Publication Date
JP2009025411A JP2009025411A (en) 2009-02-05
JP5196114B2 true JP5196114B2 (en) 2013-05-15

Family

ID=40397288

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2007186184A Expired - Fee Related JP5196114B2 (en) 2007-07-17 2007-07-17 Speech recognition apparatus and program

Country Status (1)

Country Link
JP (1) JP5196114B2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5206553B2 (en) * 2009-03-31 2013-06-12 日本電気株式会社 Browsing system, method, and program
JP6556575B2 (en) 2015-09-15 2019-08-07 株式会社東芝 Audio processing apparatus, audio processing method, and audio processing program
JP2021120786A (en) * 2020-01-30 2021-08-19 Tis株式会社 Information processing device, information processing method, and information processing program
JP6953597B1 (en) * 2020-09-17 2021-10-27 ベルフェイス株式会社 Information processing equipment, programs and information processing methods

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004030623A (en) * 1993-02-04 2004-01-29 Matsushita Electric Ind Co Ltd Work state management device
JP2001290496A (en) * 2000-04-07 2001-10-19 Ricoh Co Ltd Speech retrieval device, speech retrieval method and recording medium
JP2001325250A (en) * 2000-05-15 2001-11-22 Ricoh Co Ltd Minutes preparation device, minutes preparation method and recording medium
JP3927800B2 (en) * 2001-12-04 2007-06-13 キヤノン株式会社 Voice recognition apparatus and method, program, and storage medium
JP2005025571A (en) * 2003-07-03 2005-01-27 Ns Solutions Corp Business support device, business support method, and its program
JP4558308B2 (en) * 2003-12-03 2010-10-06 ニュアンス コミュニケーションズ,インコーポレイテッド Voice recognition system, data processing apparatus, data processing method thereof, and program
JP3955880B2 (en) * 2004-11-30 2007-08-08 松下電器産業株式会社 Voice recognition device
JP2007017839A (en) * 2005-07-11 2007-01-25 Nissan Motor Co Ltd Speech recognition device
JP2007171809A (en) * 2005-12-26 2007-07-05 Canon Inc Information processor and information processing method
JP2007178927A (en) * 2005-12-28 2007-07-12 Canon Inc Information retrieving device and method

Also Published As

Publication number Publication date
JP2009025411A (en) 2009-02-05

Similar Documents

Publication Publication Date Title
US11887590B2 (en) Voice enablement and disablement of speech processing functionality
US20220156039A1 (en) Voice Control of Computing Devices
US10884701B2 (en) Voice enabling applications
US10056078B1 (en) Output of content based on speech-based searching and browsing requests
US20230317074A1 (en) Contextual voice user interface
US11823678B2 (en) Proactive command framework
JP6550068B2 (en) Pronunciation prediction in speech recognition
US8380505B2 (en) System for recognizing speech for searching a database
JP2005157494A (en) Conversation control apparatus and conversation control method
JP2008097082A (en) Voice interaction apparatus
JP2001083987A (en) Mark insertion device and its method
US20130289987A1 (en) Negative Example (Anti-Word) Based Performance Improvement For Speech Recognition
JP5753769B2 (en) Voice data retrieval system and program therefor
JP2008046538A (en) System supporting text-to-speech synthesis
US8566091B2 (en) Speech recognition system
US20050187767A1 (en) Dynamic N-best algorithm to reduce speech recognition errors
JP2005227686A (en) Speech recognizer, speech recognition program and recording medium
JP5196114B2 (en) Speech recognition apparatus and program
KR100467590B1 (en) Apparatus and method for updating a lexicon
US11935533B1 (en) Content-related actions based on context
WO2019113516A1 (en) Voice control of computing devices
US11551666B1 (en) Natural language processing
US11328713B1 (en) On-device contextual understanding
JP2001109491A (en) Continuous voice recognition device and continuous voice recognition method
JP3841342B2 (en) Speech recognition apparatus and speech recognition program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20100520

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20110720

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20110816

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20111014

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20120501

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20120627

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20130109

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20130122

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20160215

Year of fee payment: 3

R150 Certificate of patent or registration of utility model

Ref document number: 5196114

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

LAPS Cancellation because of no payment of annual fees