JP7501054B2

JP7501054B2 - Voice recognition device and voice recognition program

Info

Publication number: JP7501054B2
Application number: JP2020063732A
Authority: JP
Inventors: 征二松本
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2024-06-18
Anticipated expiration: 2040-03-31
Also published as: JP2021162697A

Description

本発明は、音声認識装置等に関する。 The present invention relates to a voice recognition device, etc.

展示会、ショールーム又は店頭等において収集した顧客や来場者の発話音声を音声認識によりテキスト化することが行われている。テキスト化した顧客や来場者の発言を内容分析することにより、マーケティングデータとして使用するためである。しかし、音声認識には以下に述べる課題がある。展示会の会場では、説明員の声、環境音、ノイズ、アナウンス音など、多種の音声が混じり合っており、来場者の発話音声を、良質に収集できる環境を構築することは困難である。また、会場で収音後に、収音した音声を音声認識する場合、音声認識に適した音声でないと適切な認識結果を得られない。収音した音声の品質が一定以下の場合、後処理で品質改善を試みても、人が発話内容を聞き取ることは困難であり、コンピュータによる音声認識は高い精度は期待できないという課題である。 The speech of customers and visitors collected at exhibitions, showrooms, stores, etc. is converted to text using speech recognition. The converted speech of customers and visitors is analyzed for content and used as marketing data. However, speech recognition has the following issues. At exhibitions, a wide variety of sounds are mixed together, such as the voice of the presenter, environmental sounds, noise, and announcements, making it difficult to create an environment in which the speech of visitors can be collected in good quality. Furthermore, when recognizing the collected speech after it is recorded at the venue, appropriate recognition results will not be obtained unless the speech is suitable for speech recognition. If the quality of the collected speech is below a certain level, it is difficult for people to understand the content of the speech, even if attempts are made to improve the quality in post-processing, and the issue is that high accuracy cannot be expected from computer-based speech recognition.

そのような状況に対して、特許文献１には、収音を行なうクライアント機器の所在する位置付近の背景音を記録し、記録された背景音に基づいて、雑音モデルを生成し、生成された雑音モデルに基づいて、クライアント機器からの音声ファイルに対して雑音低減処理を行い、雑音低減処理後の音声ファイルに対して音声認識を行って、認識後のテキストを得る音声認識装置が提案されている。 In response to such a situation, Patent Document 1 proposes a speech recognition device that records background sound near the location of the client device performing the sound collection, generates a noise model based on the recorded background sound, performs noise reduction processing on the audio file from the client device based on the generated noise model, and performs speech recognition on the audio file after the noise reduction processing to obtain the recognized text.

特開２０１５－１３５４９４号公報JP 2015-135494 A

しかし、展示会の会場のように、人の混雑状況が変動するなど収音環境に大きな変動がある場合、従来技術では、収音開始から終了まで間、常に最適な条件で音声を収音することは困難である。本発明はこのような状況に鑑みてなされたものである。その目的は、収音環境が大きく変動しても、収音した音声が音声認識に適したものとなるように、マイクやフィルタ処理の動作パラメータを変更する音声認識装置等の提供である。 However, when the sound collection environment changes significantly, such as in an exhibition hall where the number of people changes, it is difficult with conventional technology to always collect sound under optimal conditions from the start to the end of sound collection. The present invention has been made in consideration of this situation. The purpose is to provide a voice recognition device etc. that changes the operating parameters of the microphone and filter processing so that the collected sound is suitable for voice recognition, even if the sound collection environment changes significantly.

本願の一態様に係る音声認識装置は、雑音低減フィルタを有しテスト用テキストに対応するテスト音声及び雑音を含む音声を取得する取得部と、取得した前記テスト音声を認識し、発話テキストに変換する変換部と、前記発話テキストを前記テスト用テキストの正解データと比較し、認識率を算出する算出部と、前記雑音低減フィルタのパラメータを含む前記取得部の動作パラメータを変更する変更部と、前記認識率が所定の収束条件を満たすまで、前記取得部、前記変換部、前記算出部、前記変更部を繰り返し動作させる繰り返し制御部と、変更動作パラメータを用いて算出した複数の認識率の結果に基づき、認識率の高い前記動作パラメータを決定する決定部とを備え、前記繰り返し制御部は、前記動作パラメータを第１の刻み幅で変更することを前記変更部へ指示し、前記繰り返し制御部は、前記取得部、前記変換部、前記算出部、前記変更部の動作を繰り返し動作させ、前記認識率が最大値となる前記動作パラメータの第１最適値を探索し、探索の終了後に、前記繰り返し制御部は、前記動作パラメータを前記第１の刻み幅より小さい第２の刻み幅で変更することを前記変更部へ指示し、前記繰り返し制御部は、前記取得部、前記変換部、前記算出部、前記変更部の動作を繰り返し動作させ、前記認識率が前記最大値以上の値で最大となる前記動作パラメータの第２最適値を探索することを特徴とする。 A speech recognition device according to one aspect of the present application includes an acquisition unit having a noise reduction filter and acquiring a test speech and a speech including noise corresponding to a test text, a conversion unit recognizing the acquired test speech and converting it into a speech text, a calculation unit comparing the speech text with correct answer data of the test text and calculating a recognition rate, a modification unit modifying operation parameters of the acquisition unit including parameters of the noise reduction filter, a repetitive control unit repeatedly operating the acquisition unit, the conversion unit, the calculation unit and the modification unit until the recognition rate satisfies a predetermined convergence condition, and a repetitive control unit determining the operation parameters with a high recognition rate based on a plurality of recognition rate results calculated using the modified operation parameters. and a determination unit , wherein the repetitive control unit instructs the change unit to change the operating parameter by a first step size, and the repetitive control unit repeatedly operates the acquisition unit, the conversion unit, the calculation unit, and the change unit to search for a first optimal value of the operating parameter at which the recognition rate is maximized, and after the search is completed, the repetitive control unit instructs the change unit to change the operating parameter by a second step size smaller than the first step size, and the repetitive control unit repeatedly operates the acquisition unit, the conversion unit, the calculation unit, and the change unit to search for a second optimal value of the operating parameter at which the recognition rate is maximized and is a value equal to or greater than the maximum value .

本願の一態様にあっては、収音環境が大きく変動しても、マイクやフィルタ処理の動作パラメータを変更することにより、音声認識に適した音声が収音可能となる。 In one aspect of the present application, even if the sound collection environment changes significantly, by changing the operating parameters of the microphone and filter processing, it is possible to collect sound suitable for voice recognition.

音声認識装置のハードウェア構成例を示すブロック図である。1 is a block diagram showing an example of a hardware configuration of a voice recognition device; テスト音声ＤＢの例を示す説明図である。FIG. 2 is an explanatory diagram illustrating an example of a test audio DB. テストテキストＤＢの例を示す説明図である。FIG. 2 is an explanatory diagram illustrating an example of a test text DB. パラメータＤＢの例を示す説明図である。FIG. 2 is an explanatory diagram illustrating an example of a parameter DB. 音声ＤＢの例を示す説明図である。FIG. 2 is an explanatory diagram illustrating an example of a voice DB. 認識テキストＤＢの例を示す説明図である。FIG. 2 is an explanatory diagram illustrating an example of a recognized text DB; 収音認識処理の手順例を示すフローチャートである。13 is a flowchart showing an example of a procedure for a sound pickup recognition process. 動作パラメータ設定処理の手順例を示すフローチャートである。13 is a flowchart showing an example of a procedure for setting an operation parameter; 評価処理の手順例を示すフローチャートである。13 is a flowchart illustrating an example of an evaluation process procedure. 動作例を示す説明図である。FIG. 11 is an explanatory diagram showing an operation example. 動作例を示す説明図である。FIG. 11 is an explanatory diagram showing an operation example. 環境ＤＢの例を示す説明図である。FIG. 2 is an explanatory diagram illustrating an example of an environment DB. 音声認識装置が備える機能部の一例を示すブロック図である。2 is a block diagram showing an example of functional units included in the voice recognition device; FIG.

以下実施の形態を、図面を参照して説明する。図１は音声認識装置のハードウェア構成例を示すブロック図である。音声認識装置１はノートＰＣ（Personal Computer）、タブレットコンピュータ、スマートフォン又はスマートスピーカ等で構成する。音声認識装置１は制御部１１、主記憶部１２、補助記憶部１３、音声入力部１４、音声出力部１５、通信部１６及び読み取り部１７を含む。制御部１１、主記憶部１２、補助記憶部１３、音声入力部１４、音声出力部１５、通信部１６及び読み取り部１７は、バスＢにより接続されている。なお、音声認識装置１は複数のコンピュータからなるマルチコンピュータ、ソフトウェアによって仮想的に構築された仮想マシン又は量子コンピュータを用いて構成してもよい。さらに、音声認識装置１の全部又は一部の機能はクラウドサービスで実現してもよい。 The embodiment will be described below with reference to the drawings. FIG. 1 is a block diagram showing an example of the hardware configuration of a voice recognition device. The voice recognition device 1 is configured with a notebook PC (Personal Computer), a tablet computer, a smartphone, a smart speaker, or the like. The voice recognition device 1 includes a control unit 11, a main memory unit 12, an auxiliary memory unit 13, a voice input unit 14, a voice output unit 15, a communication unit 16, and a reading unit 17. The control unit 11, the main memory unit 12, the auxiliary memory unit 13, the voice input unit 14, the voice output unit 15, the communication unit 16, and the reading unit 17 are connected by a bus B. The voice recognition device 1 may be configured using a multicomputer consisting of multiple computers, a virtual machine virtually constructed by software, or a quantum computer. Furthermore, all or part of the functions of the voice recognition device 1 may be realized by a cloud service.

制御部１１は、一又は複数のＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro-Processing Unit）、ＧＰＵ（Graphics Processing Unit）等の演算処理装置を有する。制御部１１は、補助記憶部１３に記憶された制御プログラム１Ｐ（音声認識プログラム）を読み出して実行することにより、音声認識装置１に係る種々の情報処理、制御処理等を行い、取得部、変換部、算出部、変更部又は繰り返し制御部等の機能部を実現する。 The control unit 11 has one or more arithmetic processing devices such as a CPU (Central Processing Unit), an MPU (Micro-Processing Unit), or a GPU (Graphics Processing Unit). The control unit 11 reads and executes a control program 1P (voice recognition program) stored in the auxiliary storage unit 13, thereby performing various information processing, control processing, and the like related to the voice recognition device 1, and realizing functional units such as an acquisition unit, a conversion unit, a calculation unit, a change unit, or a repetition control unit.

主記憶部１２は、ＳＲＡＭ（Static Random Access Memory）、ＤＲＡＭ（Dynamic Random Access Memory）、フラッシュメモリ等である。主記憶部１２は主として制御部１１が演算処理を実行するために必要なデータを一時的に記憶する。 The main memory unit 12 is a static random access memory (SRAM), a dynamic random access memory (DRAM), a flash memory, etc. The main memory unit 12 mainly temporarily stores data required for the control unit 11 to execute arithmetic processing.

補助記憶部１３はハードディスク又はＳＳＤ（Solid State Drive）等であり、制御部１１が処理を実行するために必要な制御プログラム１Ｐや各種ＤＢ（Database）を記憶する。補助記憶部１３は、テスト音声ＤＢ１３１、テストテキストＤＢ１３２、パラメータＤＢ１３３、音声ＤＢ１３４及び認識テキストＤＢ１３５を記憶する。補助記憶部１３は音声認識装置１に接続された外部記憶装置であってもよい。補助記憶部１３に記憶する各種ＤＢ等を、音声認識装置１とは異なるデータベースサーバやクラウドストレージに記憶してもよい。 The auxiliary storage unit 13 is a hard disk or SSD (Solid State Drive) or the like, and stores the control program 1P and various DBs (Databases) necessary for the control unit 11 to execute processing. The auxiliary storage unit 13 stores a test voice DB 131, a test text DB 132, a parameter DB 133, a voice DB 134, and a recognition text DB 135. The auxiliary storage unit 13 may be an external storage device connected to the voice recognition device 1. The various DBs and the like stored in the auxiliary storage unit 13 may be stored in a database server or cloud storage different from the voice recognition device 1.

音声入力部１４はマイク装置であり、音声の収音を行う。音声入力部１４はノイズリダクション機能等を実現する音声処理回路を含む。音声出力部１５はスピーカ装置である。音声出力部１５は制御部１１の制御により音声を出力する。音声入力部１４及び音声出力部１５はそれぞれ複数であってもよい。 The audio input unit 14 is a microphone device that picks up audio. The audio input unit 14 includes an audio processing circuit that realizes a noise reduction function and the like. The audio output unit 15 is a speaker device. The audio output unit 15 outputs audio under the control of the control unit 11. There may be multiple audio input units 14 and multiple audio output units 15.

通信部１６はネットワークＮを介して、他のコンピュータと通信を行う。通信部１６は制御部１１からの制御にしたがい、ネットワークＮ等を介して他のコンピュータから制御プログラム１Ｐをダウンロードしてもよい。読み取り部１７はＣＤ（Compact Disc）－ＲＯＭ及びＤＶＤ（Digital Versatile Disc）－ＲＯＭを含む可搬型記憶媒体１ａを読み取る。制御部１１が読み取り部１７を介して、制御プログラム１Ｐを可搬型記憶媒体１ａより読み取り、補助記憶部１３に記憶してもよい。また、半導体メモリ１ｂから、制御部１１が制御プログラム１Ｐを読み込んでもよい。 The communication unit 16 communicates with other computers via the network N. The communication unit 16 may download the control program 1P from other computers via the network N, etc., under the control of the control unit 11. The reading unit 17 reads portable storage medium 1a, including CD (Compact Disc)-ROM and DVD (Digital Versatile Disc)-ROM. The control unit 11 may read the control program 1P from the portable storage medium 1a via the reading unit 17 and store it in the auxiliary storage unit 13. The control unit 11 may also read the control program 1P from the semiconductor memory 1b.

音声認識装置１は音声入力部１４で収音した音声の認識を、他のコンピュータやクラウドサービスが提供する外部の音声認識エンジンを用いて行う。外部の音声認識エンジンを利用しない場合、音声認識装置１は、収音した音声について、特徴抽出、音素解析、解析及び構文解析を行い、その入力された音声を仮名の文字列に変換する。さらに、音声認識装置１は仮名の文字列を漢字仮名交じりの文字列に変換する。なお、補助記憶部１３は、これらの解析を行う際に用いる音声辞書、構文辞書、単語辞書、仮名漢字変換辞書（共に図示しない）等を記憶する。 The voice recognition device 1 recognizes the voice picked up by the voice input unit 14 using an external voice recognition engine provided by another computer or cloud service. When an external voice recognition engine is not used, the voice recognition device 1 performs feature extraction, phonemic analysis, analysis, and syntactic analysis on the picked up voice, and converts the input voice into a string of kana. Furthermore, the voice recognition device 1 converts the string of kana into a string of mixed kanji and kana. The auxiliary storage unit 13 stores a phonetic dictionary, a syntactic dictionary, a word dictionary, a kana-kanji conversion dictionary (none of which are shown in the figure), and the like, which are used when performing these analyses.

次に、音声認識装置１が用いるデータベースについて、説明する。図２はテスト音声ＤＢの例を示す説明図である。テスト音声ＤＢ１３１は音声入力部１４の動作パラメータを調整する際に使用するテスト音声を記憶する。テスト音声ＤＢ１３１はVoID列、テスト音声列及びTxtID列を含む。VoID列はテスト音声を一意に特定可能なVoIDを記憶する。テスト音声列はテスト音声をバイナリ形式で記憶する。なお、テスト音声を音声ファイルとして補助記憶部１３に記憶し、当該音声ファイルのファイル名をテスト音声列に記憶してもよい。テスト音声のデータ形式は、ＭＰ３形式、ＷＡＶ形式、ＷＭＡ形式等である。TxtID列はテスト音声に対応したテキストを記憶するテストテキストＤＢ１３２の主キーTxtIDを記憶する。 Next, the database used by the voice recognition device 1 will be described. FIG. 2 is an explanatory diagram showing an example of a test voice DB. The test voice DB 131 stores test voices used when adjusting the operating parameters of the voice input unit 14. The test voice DB 131 includes a VoID string, a test voice string, and a TxtID string. The VoID string stores a VoID that can uniquely identify the test voice. The test voice string stores the test voice in a binary format. Note that the test voice may be stored as a voice file in the auxiliary storage unit 13, and the file name of the voice file may be stored in the test voice string. The data format of the test voice may be MP3 format, WAV format, WMA format, etc. The TxtID string stores the main key TxtID of the test text DB 132 that stores text corresponding to the test voice.

図３はテストテキストＤＢの例を示す説明図である。テストテキストＤＢ１３２はテスト音声をテキスト化したテキストを記憶する。テストテキストＤＢ１３２はTxtID列及びテキスト列を含む。TxtID列はテキストを一意に特定するTxtIDを記憶する。テキスト列はテキストを記憶する。図３の例では、テキスト列は平仮名表記とし、単語毎に区切り記号／を入れている。 Figure 3 is an explanatory diagram showing an example of a test text DB. Test text DB 132 stores text that is a conversion of test audio. Test text DB 132 includes a TxtID string and a text string. The TxtID string stores a TxtID that uniquely identifies the text. The text string stores the text. In the example of Figure 3, the text string is written in hiragana, with a separator / between each word.

図４はパラメータＤＢの例を示す説明図である。パラメータＤＢ１３３は音声入力部１４における動作パラメータの実績値を記憶する。パラメータＤＢ１３３はＰＩＤ列、ＭｉｃＮｏ．列、Ｐｏｗｅｒ（音声／環境）ｄＢ列、ノイズ低減（ｄＢ／感度／バンド）列、Ｍｉｃ（角度／範囲）列、フィルタ１列及び認識率列を含む。ＰＩＤ列は動作パラメータを一意に特定可能なＰＩＤを記憶する。ＭｉｃＮｏ．列は音声入力部１４を一意に特定するＭｉｃＮｏ．を記憶する。Ｐｏｗｅｒ（音声／環境）ｄＢ列は収音すべき音声（人の話し声）のパワーと、環境音のパワーとを記憶する。ノイズ低減（ｄＢ／感度／バンド）列は、ノイズを減衰させる量（ｄＢ）、ノイズとみなす音のパワー値（感度）、ノイズの周波数の前後で平滑化する周波数の幅（バンド）を記憶する。Ｍｉｃ（角度／範囲）列は指向性設定、指向性の中心角度（角度）と収音する角度範囲（範囲）を記憶する。フィルタ１列は第１フィルタの係数や重みを記憶する。フィルタ（雑音低減フィルタ）はローパスフィルタ、ハイパスフィルタ、バンドパスフィルタ、帯域除去フィルタ（ＢＥＦ：Ｂａｎｄｅｌｉｍｉｎａｔｉｏｎｆｉｌｔｅｒ、ＢＲＦ：Ｂａｎｄｒｅｊｅｃｔｆｉｔｌｅｒ）、有限インパルス応答フィルタ（ＦＩＲ：ＦｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）、無線インパルス応答デジタルフィルタ（ＩＩＲ：ＩｎｆｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）などである。利用するフィルタは複数であってもよい。その場合、フィルタの数に応じて、パラメータＤＢの列を設ける。認識率列は当該レコードの動作パラメータを設定した後、テスト音声を音声出力部１５から出力し、当該音声を音声入力部１４で収音して音声認識を行った際の認識率を記憶する。 Figure 4 is an explanatory diagram showing an example of a parameter DB. The parameter DB 133 stores the actual values of the operation parameters in the voice input unit 14. The parameter DB 133 includes a PID column, a Mic No. column, a Power (voice/environment) dB column, a noise reduction (dB/sensitivity/band) column, a Mic (angle/range) column, a filter 1 column, and a recognition rate column. The PID column stores a PID that can uniquely identify the operation parameters. The Mic No. column stores a Mic No. that uniquely identifies the voice input unit 14. The Power (voice/environment) dB column stores the power of the voice to be picked up (human speech) and the power of the environmental sound. The noise reduction (dB/sensitivity/band) column stores the amount of noise attenuation (dB), the power value (sensitivity) of the sound considered to be noise, and the frequency width (band) to be smoothed before and after the noise frequency. The Mic (angle/range) column stores the directivity setting, the central angle (angle) of the directivity, and the angle range (range) for collecting sound. The Filter 1 column stores the coefficients and weights of the first filter. The filter (noise reduction filter) can be a low-pass filter, a high-pass filter, a band-pass filter, a band elimination filter (BEF: Band elimination filter, BRF: Band reject filter), a finite impulse response filter (FIR: Finite Impulse Response), a radio impulse response digital filter (IIR: Infinite Impulse Response), etc. A plurality of filters may be used. In that case, a column of the parameter DB is provided according to the number of filters. The recognition rate column stores the recognition rate when the operation parameters of the record are set, a test voice is output from the voice output unit 15, the voice is collected by the voice input unit 14, and the voice recognition is performed.

図５は音声ＤＢの例を示す説明図である。音声ＤＢ１３４は収音した音声を記憶する。音声ＤＢ１３４は収音ＩＤ列、日付列、開始終了時刻列、PID列及び音声列を含む。収音ＩＤ列は収音した音声を一意に特定する収音ＩＤを記憶する。日付列は収音した日付を記憶する。開始終了時刻列は収音を開始した時刻と収音を終了した時刻とを記憶する。PID列は収音する際に使用した動作パラメータを記憶する。音声列は収音した音声をバイナリ形式で記憶する。音声データ形式は、上述したテスト音声と同様である。 Figure 5 is an explanatory diagram showing an example of an audio DB. Audio DB 134 stores collected audio. Audio DB 134 includes a collection ID string, a date string, a start/end time string, a PID string, and an audio string. The collection ID string stores a collection ID that uniquely identifies the collected audio. The date string stores the date of collection. The start/end time string stores the time when collection started and the time when collection ended. The PID string stores the operating parameters used when collecting the audio. The audio string stores the collected audio in binary format. The audio data format is the same as the test audio described above.

図６は認識テキストＤＢの例を示す説明図である。認識テキストＤＢ１３５は収音した音声を音声認識して得たテキストを記憶する。認識テキストＤＢ１３５は収音ＩＤ列及びテキスト列を含む。収音ＩＤ列はテキストに対応する音声データの収音ＩＤを記憶する。テキスト列は音声認識して得たテキストを記憶する。 Figure 6 is an explanatory diagram showing an example of a recognized text DB. The recognized text DB 135 stores text obtained by speech recognition of collected voice. The recognized text DB 135 includes a collection ID string and a text string. The collection ID string stores the collection ID of the voice data corresponding to the text. The text string stores the text obtained by speech recognition.

続いて、音声認識装置１が行う処理について、説明する。図７は収音認識処理の手順例を示すフローチャートである。収音認識処理は音声入力部１４で収音した音声の認識を行い、認識結果として得た漢字仮名交じり文字列（テキスト）を認識テキストＤＢ１３５に記憶する処理である。音声認識装置１の制御部１１は音声入力部１４の動作パラメータの設定を行う（ステップＳ１）。制御部１１は音声入力部１４を介して収音を行う（ステップＳ２）。制御部１１は収音した音声を音声ＤＢ１３４に記憶する（ステップＳ３）。制御部１１は外部の音声認識エンジンを利用して音声認識を行う（ステップＳ４）。制御部１１は音声認識エンジンから返却された認識テキストを認識テキストＤＢ１３５に記憶する（ステップＳ５）。制御部１１は処理を終了するか否かを判定する（ステップＳ６）。制御部１１は処理を終了しないと判定した場合（ステップＳ６でＮＯ）、動作パラメータを再調整するか否かを判定する（ステップＳ７）。例えば、制御部１１はインターバルタイマ等を利用して、所定時間が経過する毎に再調整を行うと判定する。また、制御部１１はユーザからの指示入力を受け付けた場合に再調整を行うと判定する。制御部１１は再調整を行うと判定した場合（ステップＳ７でＹＥＳ）、処理をステップＳ１へ戻す。制御部１１は動作パラメータを再調整しないと判定した場合（ステップＳ７でＮＯ）、処理をステップＳ２に戻す。制御部１１は処理を終了すると判定した場合（ステップＳ６でＹＥＳ）、処理を終了する。なお、音声認識エンジンがストリーミングに対応する場合は、ステップＳ２からＳ５を並列的に実行してもよい。 Next, the process performed by the voice recognition device 1 will be described. FIG. 7 is a flowchart showing an example of the procedure of the voice collection recognition process. The voice collection recognition process is a process of recognizing the voice collected by the voice input unit 14 and storing the kanji and kana mixed character string (text) obtained as the recognition result in the recognition text DB 135. The control unit 11 of the voice recognition device 1 sets the operation parameters of the voice input unit 14 (step S1). The control unit 11 collects voice via the voice input unit 14 (step S2). The control unit 11 stores the collected voice in the voice DB 134 (step S3). The control unit 11 performs voice recognition using an external voice recognition engine (step S4). The control unit 11 stores the recognition text returned from the voice recognition engine in the recognition text DB 135 (step S5). The control unit 11 determines whether or not to end the process (step S6). If the control unit 11 determines not to end the process (NO in step S6), it determines whether or not to readjust the operation parameters (step S7). For example, the control unit 11 uses an interval timer or the like to determine that readjustment should be performed every time a predetermined time has elapsed. The control unit 11 also determines that readjustment should be performed when an instruction input from the user is received. If the control unit 11 determines that readjustment should be performed (YES in step S7), the process returns to step S1. If the control unit 11 determines that the operating parameters should not be readjusted (NO in step S7), the process returns to step S2. If the control unit 11 determines that the process should be terminated (YES in step S6), the process ends. Note that if the voice recognition engine supports streaming, steps S2 to S5 may be executed in parallel.

図８は動作パラメータ設定処理の手順例を示すフローチャートである。動作パラメータ設定処理は図７のステップＳ１に対応する処理である。音声認識装置１の制御部１１は音声入力部１４の動作パラメータに初期値を設定する（ステップＳ２１）。初期値は予め定めた値でもよいし、ユーザが都度、入力した値でもよい。また、動作パラメータを再調整する場合、直前の値を初期値としてもよい。制御部１１は動作パラメータを変動させる幅（刻み幅）を大に設定する（ステップＳ２２）。刻み幅（第１の刻み幅）は動作パラメータ毎に予め定め、補助記憶部１３に記憶しておく。制御部１１は動作パラメータの評価を行う（ステップＳ２３）。制御部１１は収束しているか否かを判定する（ステップＳ２４）。制御部１１は動作パラメータを変更する毎に、ステップＳ２３で得られる評価値を一時記憶領域に記憶しておき、評価値の変化より収束しているか否か判定する。一時記憶領域は主記憶部１２又は補助記憶部１３に設ける。収束しているか否かを判定するための収束条件は予め補助記憶部１３に記憶しておく。例えば収束条件として、５回連続認識率の低下とする。制御部１１は収束していないと判定した場合（ステップＳ２４でＮＯ）、動作パラメータを変更する（ステップＳ２５）。制御部１１は動作パラメータの値を刻み幅だけ増加又は減少させる。制御部１１は処理をステップＳ２３へ戻す。制御部１１は収束していると判定した場合（ステップＳ２４でＹＥＳ）、刻み幅を小に設定する（ステップＳ２６）。制御部１１は収束していると判定した場合の動作パラメータの値が、第１最適値である。刻み幅が大のときと同様に、刻み幅（第２の刻み幅）は動作パラメータ毎に予め定め、補助記憶部１３に記憶しておく。制御部１１は動作パラメータの評価を行う（ステップＳ２７）。制御部１１は収束しているか否かを判定する（ステップＳ２８）。判定方法はステップＳ２４と同様である。制御部１１は収束していないと判定した場合（ステップＳ２８でＮＯ）、動作パラメータを変更する（ステップＳ２９）。制御部１１は動作パラメータの値を刻み幅だけ増加又は減少させる。制御部１１は処理をステップＳ２７へ戻す。制御部１１は収束していると判定した場合（ステップＳ２８でＹＥＳ）、動作パラメータを最適値（第２最適値）に設定する（ステップＳ３０）。ステップＳ２７で得られる評価値を動作パラメータと対応付けて記憶しておく。記憶した評価値の中で、もっと評価が高い値、例えば最大値に対応する動作パラメータの値を最適値とする。制御部１１は処理を呼び出し元に戻す。なお、刻み幅が大での探索（ステップＳ２３からＳ２５）において、高い認識率（例えば９５％）となった場合は、刻み幅が小での探索（ステップＳ２６からＳ２９）を省略してもよい。 Figure 8 is a flowchart showing an example of the procedure of the operation parameter setting process. The operation parameter setting process corresponds to step S1 in Figure 7. The control unit 11 of the voice recognition device 1 sets an initial value to the operation parameter of the voice input unit 14 (step S21). The initial value may be a predetermined value or a value input by the user each time. In addition, when the operation parameter is readjusted, the previous value may be the initial value. The control unit 11 sets the width (step width) for varying the operation parameter to large (step S22). The step width (first step width) is predetermined for each operation parameter and stored in the auxiliary storage unit 13. The control unit 11 evaluates the operation parameter (step S23). The control unit 11 judges whether or not the operation parameter has converged (step S24). Each time the control unit 11 changes the operation parameter, the evaluation value obtained in step S23 is stored in a temporary storage area, and judges whether or not the operation parameter has converged based on the change in the evaluation value. The temporary storage area is provided in the main storage unit 12 or the auxiliary storage unit 13. The convergence condition for determining whether or not the convergence has occurred is stored in the auxiliary storage unit 13 in advance. For example, the convergence condition is set to a decrease in the recognition rate five consecutive times. When the control unit 11 determines that the convergence has not occurred (NO in step S24), it changes the operation parameter (step S25). The control unit 11 increases or decreases the value of the operation parameter by the increment width. The control unit 11 returns the process to step S23. When the control unit 11 determines that the convergence has occurred (YES in step S24), it sets the increment width to small (step S26). The value of the operation parameter when the control unit 11 determines that the convergence has occurred is the first optimal value. As in the case where the increment width is large, the increment width (second increment width) is determined in advance for each operation parameter and stored in the auxiliary storage unit 13. The control unit 11 evaluates the operation parameter (step S27). The control unit 11 determines whether or not the convergence has occurred (step S28). The determination method is the same as in step S24. When the control unit 11 determines that the convergence has not occurred (NO in step S28), it changes the operation parameter (step S29). The control unit 11 increases or decreases the value of the operation parameter by the step size. The control unit 11 returns the process to step S27. If the control unit 11 determines that convergence has occurred (YES in step S28), it sets the operation parameter to an optimal value (second optimal value) (step S30). The evaluation value obtained in step S27 is stored in association with the operation parameter. Among the stored evaluation values, the value of the operation parameter with the highest evaluation, for example, the maximum value, is set as the optimal value. The control unit 11 returns the process to the caller. Note that if a high recognition rate (for example, 95%) is obtained in the search with a large step size (steps S23 to S25), the search with a small step size (steps S26 to S29) may be omitted.

図９は評価処理の手順例を示すフローチャートである。評価処理は図８のステップＳ２３及びＳ２７に対応する処理である。音声認識装置１の制御部１１はテスト音声を収音する（ステップＳ４１）。制御部１１はテスト音声ＤＢ１３１に記憶してあるテスト音声を音声出力部１５より出力する。制御部１１は出力したテスト音声を音声入力部１４により収音する。なお、テスト音声は、ユーザがその場で発話したものを使用してもよい。制御部１１は音声認識を行う（ステップＳ４２）。制御部１１は音声認識の結果として得たテキストと、テストテキストＤＢ１３２に記憶してある正解テキストとを比較する（ステップＳ４３）。制御部１１は単語単位で比較を行う。制御部１１は、正しく認識した単語数（正解単語数）、誤って認識した単語数（誤認識単語数）、テスト音声に含まれていたがその存在が認識されなかった単語数（脱落単語数）を求める。制御部１１は比較結果として得た正解単語数、誤認識単語数及び脱落単語数から、認識率を算出する（ステップＳ４４）。認識率は以下の式（１）で求める。 Figure 9 is a flowchart showing an example of the procedure of the evaluation process. The evaluation process corresponds to steps S23 and S27 in Figure 8. The control unit 11 of the voice recognition device 1 collects the test voice (step S41). The control unit 11 outputs the test voice stored in the test voice DB 131 from the voice output unit 15. The control unit 11 collects the output test voice by the voice input unit 14. The test voice may be one that is spoken by the user on the spot. The control unit 11 performs voice recognition (step S42). The control unit 11 compares the text obtained as a result of the voice recognition with the correct text stored in the test text DB 132 (step S43). The control unit 11 performs the comparison on a word-by-word basis. The control unit 11 calculates the number of correctly recognized words (number of correct words), the number of incorrectly recognized words (number of incorrectly recognized words), and the number of words that were included in the test voice but whose existence was not recognized (number of dropped words). The control unit 11 calculates the recognition rate from the number of correct words, the number of misrecognized words, and the number of omitted words obtained as the comparison result (step S44). The recognition rate is calculated using the following formula (1).

認識率＝正解単語数／（正解単語数＋誤認識単語数＋脱落単語数） …（１） Recognition rate = number of correct words / (number of correct words + number of misrecognized words + number of omitted words) … (1)

制御部１１は認識率を評価値として、一時記憶領域に記憶し（ステップＳ４５）、処理を呼び出し元に戻す。図８で示す動作パラメータ設定処理は、最適化問題を解く処理である。本実施の形態では、公知の様々な最適化問題の解法を採用可能である。例えばベイズ最適化手法を本実施の形態では用いる。 The control unit 11 stores the recognition rate as an evaluation value in a temporary storage area (step S45), and returns the process to the caller. The operation parameter setting process shown in FIG. 8 is a process for solving an optimization problem. In this embodiment, various known methods for solving optimization problems can be used. For example, the Bayesian optimization method is used in this embodiment.

続いて、動作パラメータ設定処理の動作例を示す。図１０及び図１１は動作例を示す説明図である。図１０及び図１１は動作パラメータと当該動作パラメータの評価結果として得られた認識率とを対応付けて示している。図１０及び図１１に示す内容が、評価処理において一時記憶領域に記憶される。図１０は刻み幅が大のときの処理の経過を示し、図１１は刻み幅が小のときの処理の経過を示す。図１０及び図１１において、処理が進むにしたがい、上から下へ順にレコードが増えている。 Next, an example of the operation of the operation parameter setting process is shown. Figs. 10 and 11 are explanatory diagrams showing the example of the operation. Figs. 10 and 11 show the operation parameters in correspondence with the recognition rate obtained as the evaluation result of the operation parameters. The contents shown in Figs. 10 and 11 are stored in a temporary storage area in the evaluation process. Fig. 10 shows the progress of the process when the step size is large, and Fig. 11 shows the progress of the process when the step size is small. In Figs. 10 and 11, the number of records increases from top to bottom as the process progresses.

図１０の例では、動作パラメータの初期値で認識率が８７．１％となっている。動作パラメータを変更した次の評価では、認識率が８７．９％となっている。その後の評価では、認識率が下がる傾向であるため、処理を打ち切り、認識率８７．９％に対応する動作パラメータの値が暫定の最適値となる。処理打ち切りは、例えば、認識率が所定回連続して低下した場合に、行う。 In the example of Figure 10, the recognition rate is 87.1% with the initial values of the operation parameters. In the next evaluation after changing the operation parameters, the recognition rate is 87.9%. Since the recognition rate tends to decrease in subsequent evaluations, the process is terminated and the value of the operation parameters corresponding to a recognition rate of 87.9% becomes the provisional optimal value. The process is terminated, for example, when the recognition rate decreases a predetermined number of times in succession.

図１１を参照し、刻み幅が小のときの処理を説明する。図１１の例では、暫定の最適値の周辺から動作パラメータの値を変更して、評価を行っている。認識率の最大値は９３．２％であり、それ以降の認識率は低下傾向であるから処理は打ち切られ、９３．２％に対応する動作パラメータが、最適値と判定される。 Referring to Figure 11, the processing when the step size is small will be explained. In the example of Figure 11, the value of the operation parameter is changed from around the provisional optimal value and evaluation is performed. The maximum recognition rate is 93.2%, and since the recognition rate thereafter has a downward trend, the processing is terminated and the operation parameter corresponding to 93.2% is determined to be the optimal value.

音声認識装置１が複数の音声入力部１４を備える場合、制御部１１は各音声入力部１４について個別に動作パラメータ設定処理（図８）及び評価処理（図９）を実行する。また、音声入力部１４が複数の物理なマイクを論理的な１つのマイクとして機能させるマイクアレイを備える場合、マイクアレイにより収音したテスト音声の認識率が最大となるように、マイクアレイを構成する個々のマイクの動作パラメータを設定する。 When the voice recognition device 1 includes multiple voice input units 14, the control unit 11 executes the operation parameter setting process (Fig. 8) and evaluation process (Fig. 9) for each voice input unit 14 individually. In addition, when the voice input unit 14 includes a microphone array that causes multiple physical microphones to function as one logical microphone, the control unit 11 sets the operation parameters of the individual microphones that make up the microphone array so that the recognition rate of the test voice picked up by the microphone array is maximized.

本実施の形態は以下の効果を奏する。音声認識装置１は所定条件が満たされていると判定すると、音声入力部１４の動作パラメータを更新する。動作パラメータの更新は、音声認識装置１の動作環境にて行なうので、収音環境が大きく変動しても、適切な値へ動作パラメータを変更することが可能となる。それにより、音声認識に適した音声が収音可能となる。また、収音した音声を記憶しておくので、認識したテキストに誤りがあった場合でも、記憶した音声を参考にユーザによるテキストの修正が可能となる。また、収音を開始する前に実環境で動作パラメータを設定するので、利用する音声認識エンジンが異なっても、同じ音声認識エンジンであるがバージョンアップにより動作特性が替わっていたとしても、適切な音声認識結果を得ることが可能となる。 The present embodiment has the following effects. When the voice recognition device 1 determines that a predetermined condition is satisfied, it updates the operating parameters of the voice input unit 14. The operating parameters are updated in the operating environment of the voice recognition device 1, so that even if the sound collection environment changes significantly, it is possible to change the operating parameters to appropriate values. This makes it possible to collect voice suitable for voice recognition. In addition, since the collected voice is stored, even if there is an error in the recognized text, the user can correct the text by referring to the stored voice. In addition, since the operating parameters are set in the actual environment before starting sound collection, it is possible to obtain appropriate voice recognition results even if a different voice recognition engine is used, or even if the same voice recognition engine has changed its operating characteristics due to a version upgrade.

（変形例）
動作パラメータ設定処理において、処理を早く収束させるためには、動作パラメータの初期値を最適値に近いと推定される値に設定する。そこで、使用実績のある動作パラメータが使用された環境に関する情報（環境情報）を記憶しておく。動作パラメータ設定処理の初期値として、使用する環境と類似する環境に対応付けられた動作パラメータの値を設定する。図１２は環境情報ＤＢの例を示す説明図である。環境情報ＤＢ１３６は補助記憶部１３に記憶する。環境情報ＤＢ１３６はPID列、場面列、名称列、会場種別列、会場名称列及び雑音レベル列を含む。PID列は対応する動作パラメータを特定するPIDを記憶する。PIDはパラメータＤＢ１３３の主キーである。場面列は動作パラメータを使用された場面を記憶する。場面は例えば、大規模展示会、小規模展示会、常設展示である。大規模展示会は数十社の出展者が集まり展示場で行う展示会である。小規模展示会は数社の出展者が集まり貸しホールで行う展示会である。常設展示はショールーム等での展示を示す。名称列は展示会等の名称を記憶する。会場種別列は会場の種別を記憶する。会場種別は例えば展示場、貸しホール、ショールームである。会場名称列は会場の名称を記憶する。雑音レベル列は周囲雑音の音圧（dBSPL）を記憶する。単位はdBである。 (Modification)
In the operation parameter setting process, in order to converge the process quickly, the initial value of the operation parameter is set to a value estimated to be close to the optimal value. Therefore, information (environment information) on the environment in which the operation parameter with a proven track record was used is stored. As the initial value of the operation parameter setting process, the value of the operation parameter associated with an environment similar to the environment in which it is used is set. FIG. 12 is an explanatory diagram showing an example of the environment information DB. The environment information DB 136 is stored in the auxiliary storage unit 13. The environment information DB 136 includes a PID string, a scene string, a name string, a venue type string, a venue name string, and a noise level string. The PID string stores a PID that identifies the corresponding operation parameter. The PID is the main key of the parameter DB 133. The scene string stores a scene in which the operation parameter was used. The scene is, for example, a large-scale exhibition, a small-scale exhibition, and a permanent exhibition. A large-scale exhibition is an exhibition held at an exhibition hall where dozens of exhibitors gather. A small-scale exhibition is an exhibition held in a rented hall where several exhibitors gather. A permanent exhibition indicates an exhibition in a showroom or the like. The name column stores the name of an exhibition, etc. The venue type column stores the type of venue. The venue type is, for example, an exhibition hall, a rental hall, or a showroom. The venue name column stores the name of the venue. The noise level column stores the sound pressure (dBSPL) of the ambient noise. The unit is dB.

ユーザは動作パラメータの初期値を設定する際に、環境情報ＤＢ１３６を検索し、使用する環境と類似する環境を選択する。制御部１１は選択された環境情報に含まれるPIDをキーにパラメータＤＢ１３３を検索する。例えば、過去に収音を行った展示会で再度、収音する場合は、展示会の名称で、環境情報ＤＢ１３６を検索する。過去に収音を行った場面ではないが、収音の実績がある会場で再度、収音する場合は、会場名称で環境情報ＤＢ１３６を検索する。過去に収音を行った会場ではないが、類似する会場での実績がある場合は、会場種別で検索する。場面や会場が新規の場合、周囲雑音の音圧が似通った値のレコードを検索する。制御部１１は、検索にヒットしたレコードに含まれる動作パラメータを初期値として、動作パラメータ設定処理を行う。なお、初期値として用いる動作パラメータを検索ではなく、他の方法で選択してもよい。収音する環境の属性（場面、名称、会場種別、会場名称、雑音レベル等）と、環境情報ＤＢ１３６の名称列、会場種別列、会場名称列、雑音レベル列とをそれぞれ対照して類似度を算出し、収音する環境の属性と最も類似する値を持つレコードを特定し、特定したレコードの動作パラメータを初期値とする。 When setting the initial values of the operation parameters, the user searches the environment information DB 136 and selects an environment similar to the environment to be used. The control unit 11 searches the parameter DB 133 using the PID included in the selected environment information as a key. For example, if recording again at an exhibition where sound recording was performed in the past, the environment information DB 136 is searched for the exhibition name. If recording again at a venue where sound recording has been performed in the past but not at a scene where sound recording has been performed in the past, the environment information DB 136 is searched for the venue name. If recording has been performed at a similar venue where sound recording has not been performed in the past but has been performed in a similar venue, the search is performed by venue type. If the scene or venue is new, records with similar sound pressure values of ambient noise are searched for. The control unit 11 performs the operation parameter setting process using the operation parameters included in the records found in the search as the initial values. The operation parameters to be used as initial values may be selected by other methods rather than by search. The attributes of the environment where sound is collected (scene, name, venue type, venue name, noise level, etc.) are compared with the name string, venue type string, venue name string, and noise level string of the environment information DB136 to calculate the similarity, identify the record with the value most similar to the attributes of the environment where sound is collected, and set the operating parameters of the identified record as the initial values.

本変形例においては、過去の実績に基づき、動作パラメータの初期値を最適値に近いと推定される値に初期値を設定し、動作パラメータ設定処理を行なうことにより、処理が迅速に収束し、動作パラメータの最適値が定まると期待される。 In this modified example, the initial values of the operating parameters are set to values that are estimated to be close to the optimal values based on past performance, and the operating parameter setting process is then performed. It is expected that the process will converge quickly and the optimal values of the operating parameters will be determined.

図１３は音声認識装置が備える機能部の一例を示すブロック図である。音声認識装置１は、機能部として、取得部１１ａ、変換部１１ｂ、算出部１１ｃ、変更部１１ｄ、決定部１１ｅ及び繰り返し制御部１１ｆを備える。これらの各機能部は、制御部１１が制御プログラム１Ｐに基づいて動作することにより、実現される。 Figure 13 is a block diagram showing an example of functional units of a voice recognition device. The voice recognition device 1 includes, as functional units, an acquisition unit 11a, a conversion unit 11b, a calculation unit 11c, a change unit 11d, a determination unit 11e, and a repetition control unit 11f. Each of these functional units is realized by the control unit 11 operating based on a control program 1P.

取得部１１ａはテスト用テキストに対応するテスト音声及び雑音を含む音声を取得する。変換部１１ｂは取得部１１ａが取得したテスト音声を認識し、発話テキストに変換する。算出部１１ｃは発話テキストをテスト用テキストの正解データと比較し、認識率を算出する。変更部１１ｄは算出した認識率に基づいて、雑音除去フィルタのパラメータを含む取得部の動作パラメータを変更する。決定部１１ｅは変更動作パラメータを用いて算出した複数の認識率の結果に基づき、認識率の高い動作パラメータを決定する。繰り返し制御部１１ｆは認識率が所定の収束条件を満たすまで、前記取得部、前記変換部、前記算出部、前記変更部を繰り返し動作させる。 The acquisition unit 11a acquires test speech corresponding to the test text and speech including noise. The conversion unit 11b recognizes the test speech acquired by the acquisition unit 11a and converts it into spoken text. The calculation unit 11c compares the spoken text with the correct answer data of the test text and calculates a recognition rate. The modification unit 11d modifies the operation parameters of the acquisition unit, including the parameters of a noise removal filter, based on the calculated recognition rate. The determination unit 11e determines the operation parameters with a high recognition rate based on the results of multiple recognition rates calculated using the modified operation parameters. The repetitive control unit 11f repeatedly operates the acquisition unit, the conversion unit, the calculation unit, and the modification unit until the recognition rate satisfies a predetermined convergence condition.

各実施の形態で記載されている技術的特徴（構成要件）はお互いに組み合わせ可能であり、組み合わせすることにより、新しい技術的特徴を形成することができる。
今回開示された実施の形態はすべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上記した意味ではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 The technical features (constituent elements) described in each embodiment can be combined with each other, and by combining them, new technical features can be formed.
The embodiments disclosed herein are illustrative in all respects and should not be considered as limiting. The scope of the present invention is defined by the claims, not by the above meaning, and is intended to include all modifications within the scope and meaning equivalent to the claims.

１音声認識装置
１１制御部
１１ａ取得部
１１ｂ変換部
１１ｃ算出部
１１ｄ変更部
１１ｅ決定部
１１ｆ繰り返し制御部
１２主記憶部
１３補助記憶部
１３１テスト音声ＤＢ
１３２テストテキストＤＢ
１３３パラメータＤＢ
１３４音声ＤＢ
１３５認識テキストＤＢ
１３６環境情報ＤＢ
１４音声入力部
１５音声出力部
１６通信部
１７読み取り部
１Ｐ制御プログラム
１ａ可搬型記憶媒体
１ｂ半導体メモリ REFERENCE SIGNS LIST 1 Voice recognition device 11 Control unit 11a Acquisition unit 11b Conversion unit 11c Calculation unit 11d Change unit 11e Determination unit 11f Repetition control unit 12 Main memory unit 13 Auxiliary memory unit 131 Test voice DB
132 Test Text DB
133 Parameter DB
134 Audio DB
135 Recognition Text DB
136 Environmental Information DB
14 Voice input unit 15 Voice output unit 16 Communication unit 17 Reading unit 1P Control program 1a Portable storage medium 1b Semiconductor memory

Claims

an acquisition unit having a noise reduction filter and acquiring a test speech corresponding to a test text and a noisy speech;
A conversion unit that recognizes the acquired test voice and converts it into spoken text;
a calculation unit that compares the spoken text with correct answer data of the test text and calculates a recognition rate;
a change unit that changes operating parameters of the acquisition unit, including parameters of the noise reduction filter;
a repetitive control unit that repeatedly operates the acquisition unit, the conversion unit, the calculation unit, and the change unit until the recognition rate satisfies a predetermined convergence condition;
a determination unit that determines an operation parameter having a high recognition rate based on a plurality of results of the recognition rates calculated using the changed operation parameter ,
the repetitive control unit instructs the change unit to change the operation parameter by a first step size;
the repetitive control unit repeatedly operates the acquisition unit, the conversion unit, the calculation unit, and the change unit to search for a first optimum value of the operation parameter that maximizes the recognition rate;
After the search is completed, the repetitive control unit instructs the change unit to change the operation parameter by a second step size smaller than the first step size;
The repetitive control unit repeatedly operates the acquisition unit, the conversion unit, the calculation unit, and the change unit to search for a second optimal value of the operation parameter that maximizes the recognition rate and is equal to or greater than the maximum value.

2. The speech recognition device according to claim 1, wherein the recognition rate is calculated based on a result of comparison between words constituting the spoken text and words constituting correct answer data.

3. The speech recognition device according to claim 1, wherein a Bayesian optimization method is used to search for the first optimum value and the second optimum value.

a storage unit that stores environmental information regarding an environment in which the acquisition unit acquired the voice in association with the second optimum value,
4. The speech recognition device according to claim 1, wherein the repetitive control unit acquires the environmental information, reads the second optimum value associated with the acquired environmental information from the storage unit, and controls the change unit to set the read second optimum value as an initial value of the operation parameter.

A computer including an acquisition unit having a noise reduction filter and configured to acquire speech,
an acquisition process for acquiring a speech corresponding to a test text and a speech including noise by the acquisition unit;
A conversion process that recognizes the acquired voice and converts it into spoken text;
A calculation process of comparing the spoken text with the correct answer data of the test text and calculating a recognition rate;
A modification process for modifying operating parameters of the acquisition unit, including parameters of the noise reduction filter;
A process of determining an operation parameter having a high recognition rate based on a plurality of recognition rate results calculated using the changed operation parameter;
A speech recognition program for causing the computer to execute the above ,
repeatedly executing the acquisition process, the conversion process, the calculation process, and the change process of changing the operation parameter by a first step size until the recognition rate satisfies a predetermined convergence condition, thereby searching for a first optimum value of the operation parameter that maximizes the recognition rate;
After the search is completed, the acquisition process, the conversion process, the calculation process, and the change process for changing the operation parameter by a second step size smaller than the first step size are repeatedly executed until the recognition rate satisfies a predetermined convergence condition, thereby searching for a second optimum value of the operation parameter at which the recognition rate is maximized at a value equal to or greater than the maximum value.
A speech recognition program comprising :