JP7002822B2

JP7002822B2 - Voice analysis system and voice analysis method

Info

Publication number: JP7002822B2
Application number: JP2018228620A
Authority: JP
Inventors: 信範工藤
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2018-12-06
Filing date: 2018-12-06
Publication date: 2022-01-20
Anticipated expiration: 2038-12-06
Also published as: JP2020091405A

Description

本発明は、入力音声を分析する音声分析システムおよび音声分析方法に関する。 The present invention relates to a voice analysis system and a voice analysis method for analyzing input voice.

近年、スマートスピーカと呼ばれる装置が普及してきている。一般に、スマートスピーカは以下の態様で使用される。すなわち、まず、ユーザがウェイクワードと呼ばれる特定のワードをスマートスピーカに対して発話し、その後、何らかの質問や、要求を行うための文言（以下、「リクエスト」という）を発話する。ウェイクワードおよびリクエストを含む音声データは、クラウドサーバに送信され、クラウドサーバは、リクエストを音声認識すると共に、リクエストの内容を理解し、リクエストの内容に応じた処理を実行する。例えば、リクエストの内容が質問である場合には、クラウドサーバはその質問に対する回答を音声出力するための音声データを生成してスマートスピーカに送信し、スマートスピーカに質問に対する回答を音声出力させる。 In recent years, devices called smart speakers have become widespread. Generally, smart speakers are used in the following embodiments. That is, first, the user utters a specific word called a wake word to the smart speaker, and then utters some question or a word for making a request (hereinafter referred to as "request"). Voice data including wake words and requests are transmitted to the cloud server, and the cloud server recognizes the request by voice, understands the content of the request, and executes processing according to the content of the request. For example, when the content of the request is a question, the cloud server generates voice data for outputting the answer to the question by voice and sends it to the smart speaker, and causes the smart speaker to output the answer to the question by voice.

なお、音声認識に関し、引用文献１には以下の技術が記載されている。すなわち、音声データ（音データ４）の音量レベルが閾値を超えた場合に、閾値を超えたときのタイミングから所定時間（時間Ｔｗ２）前に遡って記録済みの音声データから音声部分を識別し、その音声部分について音声認識を行う技術が記載されている。引用文献１の技術によれば、音声認識に必要な音声部分の先頭部分を欠くことなく、音声部分の全体を認識することが可能となる。 Regarding speech recognition, the following techniques are described in Cited Document 1. That is, when the volume level of the voice data (sound data 4) exceeds the threshold value, the voice portion is identified from the recorded voice data retroactively from the timing when the threshold value is exceeded to a predetermined time (time Tw2) before. A technique for performing voice recognition for the voice part is described. According to the technique of Cited Document 1, it is possible to recognize the entire voice portion without missing the head portion of the voice portion necessary for voice recognition.

特開２００９－１２２５９８号公報Japanese Unexamined Patent Publication No. 2009-122598

ところで、所定の企業によって音声認識に関するサービスがクラウドサービスとして提供されている。そして、車載装置を提供するベンダーには、自身が提供する製品で上記クラウドサービスを利用できるようにして、車載装置をスマートスピーカとしても機能させ、製品に付加価値を付けたいとするニーズがある。 By the way, a service related to voice recognition is provided as a cloud service by a predetermined company. Vendors who provide in-vehicle devices have a need to make the above-mentioned cloud services available in the products they provide, to make the in-vehicle devices function as smart speakers, and to add value to the products.

クラウドサービスの利用にあたっては、ウェイクワードおよびリクエストを含む音声データ（ただし、リクエストを含めることは必須でない場合もある）をクラウドサーバに送信する必要がある。そして、音声データのクラウドサーバへの送信に関し、クラウドサービスを提供する所定の企業により、以下のことが要請される場合がある。すなわち、音声データにおいてウェイクワードに対応する音声の開始ポイントおよび終了ポイントを明示する情報を送信することが要請される場合がある。この場合、開始ポイントおよび終了ポイントを検出する必要がある。 To use the cloud service, it is necessary to send voice data including wake words and requests (however, it may not be necessary to include the request) to the cloud server. Then, regarding the transmission of voice data to the cloud server, the following may be requested by a predetermined company that provides a cloud service. That is, in the voice data, it may be requested to transmit information specifying the start point and the end point of the voice corresponding to the wake word. In this case, it is necessary to detect the start point and the end point.

ここで、車載装置には、トリガレス音声認識を実行することによって、車両に搭乗するユーザが命令ワードに対応する音声を発話した場合に、その命令ワードに対応する処理を実行するものがある。命令ワードは、例えば「近くのコンビニ」というものであり、この場合、車載装置は、命令ワードに対応する処理として、車載装置が搭載された車両の周辺のコンビニエンスストアを検索し、検索結果を提示する処理を実行する。トリガレス音声認識では、命令ワードの音声パターンが音声認識辞書に予め登録されており、入力音声の音声波形と登録された音声パターンとの比較結果に基づいて類似度が算出され、この類似度が閾値を超えた場合に、命令ワードに対応する音声が発話されたことを検出する。 Here, some in-vehicle devices execute triggerless voice recognition to execute a process corresponding to a command word when a user boarding the vehicle utters a voice corresponding to the command word. The command word is, for example, a "nearby convenience store". In this case, the in-vehicle device searches a convenience store around the vehicle in which the in-vehicle device is mounted and presents the search result as a process corresponding to the command word. Execute the processing to be performed. In triggerless speech recognition, the speech pattern of the command word is registered in the speech recognition dictionary in advance, and the similarity is calculated based on the comparison result between the speech waveform of the input speech and the registered speech pattern, and this similarity is the threshold value. When the above is exceeded, it is detected that the voice corresponding to the command word is spoken.

このトリガレス音声認識を利用すれば、音声認識辞書にウェイクワードの音声パターンを登録することによって、ユーザがウェイクワードに対応する音声を発話した場合に、そのことを高い確度で検出可能である。しかしながら、トリガレス音声認識は、上述のとおり、入力音声から命令ワードに対応する音声の範囲を特定するといった処理は行わず、予め登録された音声パターンとの類似度を算出し、算出した類似度に基づいて命令ワードに対応する音声が発話されたことを検出するものであるため、トリガレス音声認識では、開始ポイントおよび終了ポイントを検出することができない。 By using this triggerless voice recognition, by registering the voice pattern of the wake word in the voice recognition dictionary, it is possible to detect with high accuracy when the user utters the voice corresponding to the wake word. However, as described above, the triggerless speech recognition does not perform processing such as specifying the range of speech corresponding to the instruction word from the input speech, but calculates the similarity with the pre-registered speech pattern and uses the calculated similarity as the calculated similarity. Since it is detected that the voice corresponding to the instruction word is uttered based on the triggerless voice recognition, the start point and the end point cannot be detected.

本発明は、このような問題を解決するために成されたものであり、ウェイクワードの開始ポイントおよび終了ポイントを高い精度で検出できるようにすることを目的としている。 The present invention has been made to solve such a problem, and an object of the present invention is to enable detection of a wake word start point and end point with high accuracy.

上記した課題を解決するために、本発明は、入力された入力音声と予め登録された特定ワードの音声パターンとを比較して特定ワードを認識すると共に、特定ワードが認識されたタイミングに連続する期間に入力された入力音声を構成する周波数のうち、音圧レベルが最も高い周波数に対応する周波数をターゲット周波数として特定する。そして、本発明は、特定ワードが認識されたタイミングの前後の期間におけるターゲット周波数の音圧レベルの推移に基づいて、入力音声における特定ワードの開始ポイントおよび終了ポイントを検出するようにしている。 In order to solve the above-mentioned problems, the present invention recognizes a specific word by comparing the input voice with the voice pattern of the pre-registered specific word, and is continuous at the timing when the specific word is recognized. Among the frequencies constituting the input voice input during the period, the frequency corresponding to the frequency having the highest sound pressure level is specified as the target frequency. Then, the present invention detects the start point and the end point of the specific word in the input voice based on the transition of the sound pressure level of the target frequency in the period before and after the timing when the specific word is recognized.

上記のように構成した本発明によれば、入力された入力音声と予め登録された特定ワード（ウェイクワード）の音声パターンとを比較して特定ワードを認識する音声認識が行われるため、ユーザにより特定ワードが音声として発話された場合には、高い確度でそのことを検出可能である。その上で、本発明によれば、特定ワードが認識されたタイミングの前後の期間、換言すれば、特定ワードが発話されている可能性が非常に高い期間を対象として、音圧レベルの特徴的な変化に基づいて開始ポイントおよび終了ポイントを検出することが可能であり、その点で開始ポイントおよび終了ポイントを高い精度で検出することができる。特に、本発明によれば、ユーザが特定ワードを音声として発話したときに入力される音声を構成する周波数のうち、音圧レベルが最も高い周波数に対応する周波数（最も高い周波数のほか、最も高い周波数に近い周波数も含む）をターゲット周波数として音圧レベルの推移が分析される。このため、特定ワードを発話したユーザの発話音声に特有の周波数を分析の対象とすることができるので、他の周波数の音声によるノイズ的な影響を受けにくい状況下で音圧レベルの推移を分析することができ、この点においても、入力音声における特定ワードの開始ポイントおよび終了ポイントを高い精度で検出できる。 According to the present invention configured as described above, since the voice recognition for recognizing a specific word is performed by comparing the input voice and the voice pattern of the pre-registered specific word (wake word), the user can perform the voice recognition. When a specific word is spoken as voice, it can be detected with high accuracy. On top of that, according to the present invention, the characteristic of the sound pressure level is targeted at the period before and after the timing when the specific word is recognized, in other words, the period during which the specific word is very likely to be spoken. It is possible to detect the start point and the end point based on the change, and the start point and the end point can be detected with high accuracy at that point. In particular, according to the present invention, among the frequencies constituting the voice input when the user speaks a specific word as voice, the frequency corresponding to the frequency having the highest sound pressure level (the highest frequency and the highest frequency). The transition of the sound pressure level is analyzed with the target frequency (including frequencies close to the frequency) as the target frequency. For this reason, it is possible to analyze the frequency peculiar to the spoken voice of the user who spoke a specific word, and analyze the transition of the sound pressure level under the condition that it is not easily affected by noise by the voice of other frequencies. In this respect as well, the start point and end point of a specific word in the input voice can be detected with high accuracy.

本発明の一実施形態に係る音声分析システムの機能構成例を示すブロック図である。It is a block diagram which shows the functional structure example of the voice analysis system which concerns on one Embodiment of this invention. 認識対象音声データの説明に利用する図である。It is a figure used for the explanation of the recognition target voice data. 距離値の推移の一例を示す図である。It is a figure which shows an example of the transition of a distance value. ターゲット周波数の音圧レベルの推移の一例を示す図である。It is a figure which shows an example of the transition of the sound pressure level of a target frequency. 本発明の一実施形態に係る車載装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the in-vehicle device which concerns on one Embodiment of this invention.

以下、本発明の一実施形態を図面に基づいて説明する。図１は、本実施形態に係る音声分析システム１の機能構成例を示すブロック図である。図１に示すように、音声分析システム１は、車載装置２を備えている。車載装置２は、インターネットや電話網等の通信網を含んで構成されたネットワークＮにアクセス可能である。ネットワークＮには、サービス提供サーバ３が接続されている。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a functional configuration example of the voice analysis system 1 according to the present embodiment. As shown in FIG. 1, the voice analysis system 1 includes an in-vehicle device 2. The in-vehicle device 2 can access the network N configured to include a communication network such as the Internet and a telephone network. The service providing server 3 is connected to the network N.

車載装置２は、車両に搭載された、いわゆるカーナビゲーションであり、車両の現在位置を検出する機能や、車両に搭乗するユーザにより設定された目的地までの経路を探索し、案内する機能等を有する。なお、これらの機能に関する詳細な説明は省略する。車載装置２は、車両に固定的に取り付けられた装置でなくてもよい。例えば、車載装置２は、ユーザによって車内に持ち込まれたモバイル端末であってもよい。図１に示すように、車載装置２には、マイク４およびスピーカ５が接続されている。以下、車載装置２が搭載されている車両を「自車両」という。マイク４は、自車両に搭乗するユーザが発話した音声を収音可能な位置に設けられた収音装置である。スピーカ５は、自車両の車内に音声を出力する放音装置である。 The in-vehicle device 2 is a so-called car navigation system mounted on a vehicle, and has a function of detecting the current position of the vehicle, a function of searching for a route to a destination set by a user boarding the vehicle, and a function of guiding the vehicle. Have. A detailed description of these functions will be omitted. The in-vehicle device 2 does not have to be a device fixedly attached to the vehicle. For example, the in-vehicle device 2 may be a mobile terminal brought into the vehicle by the user. As shown in FIG. 1, a microphone 4 and a speaker 5 are connected to the vehicle-mounted device 2. Hereinafter, the vehicle on which the in-vehicle device 2 is mounted is referred to as "own vehicle". The microphone 4 is a sound collecting device provided at a position capable of collecting sound spoken by a user boarding the own vehicle. The speaker 5 is a sound emitting device that outputs sound into the vehicle of the own vehicle.

サービス提供サーバ３は、クライアント端末で収集された音声の音声認識に関するサービスを提供するクラウドサーバである。以下、サービス提供サーバ３により提供されるサービスを「音声認識サービス」という。スマートスピーカや、バーチャルアシスタントが搭載されたモバイル端末、本実施形態に係る車載装置２等が、サービス提供サーバ３に対するクライアント端末として機能する。音声認識に関するサービスの１つは、クライアント端末で収集された音声を音声認識して、その音声の内容を理解し、その音声の内容に対応する処理を実行するというものである。一例として、サービス提供サーバ３は、ユーザがクライアント端末に対して何らかの質問を内容とする音声を発話した場合に、その音声を音声認識し、その音声の内容を理解し、質問に対する回答を生成し、クライアント端末に音声として出力させ、これによりユーザとクライアント端末との間で音声対話を実現する。 The service providing server 3 is a cloud server that provides a service related to voice recognition of voice collected by a client terminal. Hereinafter, the service provided by the service providing server 3 is referred to as a "voice recognition service". A smart speaker, a mobile terminal equipped with a virtual assistant, an in-vehicle device 2 according to the present embodiment, and the like function as a client terminal for the service providing server 3. One of the services related to voice recognition is to recognize voice collected by a client terminal, understand the content of the voice, and execute a process corresponding to the content of the voice. As an example, when the user utters a voice containing a question to the client terminal, the service providing server 3 recognizes the voice, understands the content of the voice, and generates an answer to the question. , The client terminal is made to output as voice, thereby realizing a voice dialogue between the user and the client terminal.

本実施形態に係る車載装置２は、サービス提供サーバ３に対するクライアントとして機能し、ユーザは、車載装置２を介して音声認識サービスを利用することができる。ユーザは、音声認識サービスの利用に際し、ウェイクワードと呼ばれる予め定められた特定のワードを発話し、ウェイクワードの発話に続けて、何らかの質問や、要求を行うための文言（以下、「リクエスト」という）を発話する。車載装置２は、ユーザによるウェイクワードおよびリクエストの発話に応じて、処理要求データを生成し、サービス提供サーバ３に送信する。処理要求データは、認識対象音声データと、制御情報データとを含んでいる。 The vehicle-mounted device 2 according to the present embodiment functions as a client for the service providing server 3, and the user can use the voice recognition service via the vehicle-mounted device 2. When a user uses a voice recognition service, he / she speaks a specific predetermined word called a wake word, and following the utterance of the wake word, a word for making a question or request (hereinafter referred to as "request"). ). The in-vehicle device 2 generates processing request data and transmits the processing request data to the service providing server 3 in response to the wake word and the utterance of the request by the user. The processing request data includes recognition target voice data and control information data.

認識対象音声データは、ユーザが発話したウェイクワードに対応する音声（以下、「ウェイクワード対応音声」という）、および、ユーザが発話したリクエストに対応する音声（以下、「リクエスト対応音声」という）が記録された音声データである。この認識対象音声データは、音声認識サービスを利用するために求められる機能要件を満たす必要がある。 The recognition target voice data includes voice corresponding to the wake word spoken by the user (hereinafter referred to as "wake word compatible voice") and voice corresponding to the request spoken by the user (hereinafter referred to as "request compatible voice"). Recorded voice data. This recognition target voice data must meet the functional requirements required for using the voice recognition service.

図２は、認識対象音声データの構造を説明するため、その構造を説明に適した態様で模式的に示す図である。図２では、認識対象音声データを、図中の左端を先頭とする帯状のオブジェクトで表している。本実施形態では、認識対象音声データは、１６ｋＨｚのサンプル周期でサンプリングされた音声データである。図２において、範囲Ｈ１は、認識対象音声データに記録されたウェイクワード対応音声の範囲（ウェイクワード対応音声の音声波形が記録されたサンプリング周期のデータの集合）を示し、範囲Ｈ２は、認識対象音声データに記録されたリクエスト対応音声の範囲を示している。 FIG. 2 is a diagram schematically showing the structure of the recognition target voice data in a mode suitable for explanation in order to explain the structure. In FIG. 2, the recognition target voice data is represented by a band-shaped object starting from the left end in the figure. In the present embodiment, the recognition target voice data is voice data sampled in a sample cycle of 16 kHz. In FIG. 2, the range H1 indicates the range of the wakeword-compatible voice recorded in the recognition target voice data (a set of sampling cycle data in which the voice waveform of the wakeword-compatible voice is recorded), and the range H2 is the recognition target. Shows the range of request-responsive voice recorded in the voice data.

図２に示すように、認識対象音声データにおいて、ウェイクワード対応音声よりも前には、８０００サンプル（５００ミリ秒）分のプリロールを含めることが機能要件として要求されている。従って、認証対象音声データにおいて、ウェイクワード対応音声が記録された範囲より前には、プリロールとして、ウェイクワード対応音声の始端から遡って８０００サンプル（５００ミリ秒）分の音声が記録されていることになる。 As shown in FIG. 2, in the recognition target voice data, it is required as a functional requirement to include a preroll for 8000 samples (500 milliseconds) before the wake word-compatible voice. Therefore, in the voice data to be authenticated, 8000 samples (500 milliseconds) of voice are recorded as a pre-roll before the range in which the wake word-compatible voice is recorded. become.

図２に示すように、認識対象音声データでは、プリロールの終端のＱ０（本実施形態では、Ｑ０＝８０００）サンプリング周期目からＱ１サンプリング周期目までのαサンプル分の範囲Ｈ１にウェイクワード対応音声が記録される。αの値は、ユーザが発話したウェイクワード対応音声の時間的な長さ（ウェイクワード音声を発話し始めてから発話し終わるまでの期間の長さ）によって変動し、Ｑ１の値は、αの値に応じて変動する。認識対象音声データにおいて、Ｑ０サンプリング周期目のポイントが、ウェイクワード対応音声が開始されたタイミングであり、開始ポイント（スタートインデックスと呼ばれる場合もある）に相当する。また、認識対象音声データにおいて、Ｑ１サンプリング周期目のポイントが、ウェイクワード対応音声が終了したタイミングであり、終了ポイント（エンドインデックスと呼ばれる場合もある）に相当する。図２に示すように、認識対象音声データには、ウェイクワード対応音声が記録された範囲Ｈ１から間隔をあけて後続する範囲Ｈ２にリクエスト対応音声が記録される。 As shown in FIG. 2, in the recognition target voice data, the wake word-compatible voice is in the range H1 for the α sample from the Q0 (Q0 = 8000 in this embodiment) sampling cycle to the Q1 sampling cycle at the end of the preroll. Recorded. The value of α varies depending on the time length of the wake word-compatible voice spoken by the user (the length of the period from the start of uttering the wake word voice to the end of utterance), and the value of Q1 is the value of α. It fluctuates according to. In the recognition target voice data, the point at the Q0 sampling cycle is the timing at which the wakeword-compatible voice is started, and corresponds to the start point (sometimes called the start index). Further, in the recognition target voice data, the point at the Q1 sampling cycle is the timing at which the wakeword-compatible voice ends, and corresponds to the end point (sometimes called an end index). As shown in FIG. 2, in the recognition target voice data, the request-corresponding voice is recorded in the range H2 following the range H1 in which the wake word-corresponding voice is recorded at intervals.

制御情報データは、認識対象音声データに関する必要な参照情報が所定のフォーマットに従って記述されたデータ（例えば、ＪＳＯＮ形式のデータ）である。本実施形態では、制御情報データには、処理要求データの識別に使用する情報や、認識対象音声データのフォーマットを示す情報等のほか、認識対象音声データにおけるウェイクワード対応音声の開始ポイントおよび終了ポイントの位置を明示する情報を含めることが要求されている。制御情報データにおいて、開始ポイントの位置を明示する情報および終了ポイントの位置を明示する情報の双方とも、データの先頭を起点（０周期目）とするサンプリング周期目によって表され、特に、開始ポイントの位置を明示する情報は、固定的に８０００周期目とされる。 The control information data is data in which necessary reference information regarding the recognition target voice data is described according to a predetermined format (for example, JSON format data). In the present embodiment, the control information data includes information used for identifying the processing request data, information indicating the format of the recognition target voice data, and the start point and end point of the wake word-compatible voice in the recognition target voice data. It is required to include information that specifies the location of. In the control information data, both the information that specifies the position of the start point and the information that specifies the position of the end point are represented by the sampling period starting from the beginning of the data (0th cycle), and in particular, the start point. The information that clearly indicates the position is fixedly set to the 8000th cycle.

このように、制御情報データには、開始ポイントの位置を明示する情報、および、終了ポイントの位置を明示する情報がそれぞれ記述されるが、本実施形態では、制御情報データに記述された情報が示す開始ポイント（本例では、８０００周期目）と、認識対象音声データにおいて実際にウェイクワード対応音声が開始するポイント（当該音声の音声波形の始端の位置）との誤差が５０ミリ秒（８００サンプル）以内であることが機能要件として求められている。更に、本実施形態では、制御情報データに記述された情報が示す終了ポイントと、認証対象音声データにおいて実際にウェイクワード対応音声が終了するポイント（当該音声の音声波形の終端の位置）との誤差が１５０ミリ秒（２４００サンブル）以内であることが機能要件として求められている。以上のことを前提として、以下、車載装置２の構成および処理について詳述する。 As described above, the control information data describes the information that clearly indicates the position of the start point and the information that clearly indicates the position of the end point, but in the present embodiment, the information described in the control information data is described. The error between the indicated start point (8000th cycle in this example) and the point at which the wake word-compatible voice actually starts in the recognition target voice data (position of the start end of the voice waveform of the voice) is 50 milliseconds (800 samples). ) Is required as a functional requirement. Further, in the present embodiment, there is an error between the end point indicated by the information described in the control information data and the point at which the wakeword-compatible voice actually ends in the voice data to be authenticated (the position of the end of the voice waveform of the voice). Is required to be within 150 milliseconds (2400 samples) as a functional requirement. On the premise of the above, the configuration and processing of the in-vehicle device 2 will be described in detail below.

図１に示すように、車載装置２は、機能構成として、通信部１０、音声処理部１１、登録ワード検出部１２、命令ワード対応処理実行部１３、ユーザ周波数登録部１４、周波数特定部１５、ポイント検出部１６および連携処理実行部１７を備えている。上記各機能ブロック１０～１７は、ハードウェア、ＤＳＰ（Digital Signal Processor）、ソフトウェアの何れによっても構成することが可能である。例えばソフトウェアによって構成する場合、上記各機能ブロック１０～１７は、実際にはコンピュータのＣＰＵ、ＲＡＭ、ＲＯＭなどを備えて構成され、ＲＡＭやＲＯＭ、ハードディスクまたは半導体メモリ等の記録媒体に記憶されたプログラムが動作することによって実現される。 As shown in FIG. 1, the in-vehicle device 2 has a communication unit 10, a voice processing unit 11, a registered word detection unit 12, a command word corresponding processing execution unit 13, a user frequency registration unit 14, a frequency identification unit 15, and so on. It includes a point detection unit 16 and a linkage process execution unit 17. Each of the above functional blocks 10 to 17 can be configured by any of hardware, DSP (Digital Signal Processor), and software. For example, when configured by software, each of the above functional blocks 10 to 17 is actually configured to include a computer CPU, RAM, ROM, etc., and is a program stored in a recording medium such as RAM, ROM, hard disk, or semiconductor memory. Is realized by the operation of.

図１に示すように、車載装置２は、記憶手段として、音声バッファ２０、辞書記憶部２１およびユーザ周波数記憶部２２を備えている。音声バッファ２０は、後述する入力音声データがバッファリングされる一時記憶領域である。辞書記憶部２１には音声認識辞書が記憶されている。音声認識辞書は、登録ワードのそれぞれについて、音声パターンが登録された音声認識用の辞書である。登録ワードには、命令ワードと、ウェイクワードとが含まれている。つまり、本実施形態では、音声認識辞書には、登録ワードとして、命令ワードのほか、ウェイクワードが登録されている。 As shown in FIG. 1, the in-vehicle device 2 includes a voice buffer 20, a dictionary storage unit 21, and a user frequency storage unit 22 as storage means. The voice buffer 20 is a temporary storage area in which input voice data described later is buffered. A voice recognition dictionary is stored in the dictionary storage unit 21. The voice recognition dictionary is a dictionary for voice recognition in which voice patterns are registered for each of the registered words. The registration word includes a command word and a wake word. That is, in the present embodiment, a wake word is registered as a registered word in the speech recognition dictionary in addition to the command word.

命令ワードとは、車載装置２（車載装置２の制御下にある他の装置であってもよい）に特定の処理を実行することを指示する文言のことである。ユーザは、一の命令ワードに対応する音声を発話することにより、車載装置２に対して手を使用した指示操作を行うことなく、車載装置２に当該一の命令ワードに対応する特定の処理を実行させることができる。 The command word is a word that instructs the in-vehicle device 2 (which may be another device under the control of the in-vehicle device 2) to execute a specific process. By uttering a voice corresponding to one command word, the user performs a specific process corresponding to the one command word on the vehicle-mounted device 2 without performing a manual instruction operation on the vehicle-mounted device 2. Can be executed.

通信部１０は、所定の通信規格に従ってネットワークＮにアクセスし、ネットワークＮに接続された外部装置（サービス提供サーバ３を含む）と通信する。通信部１０がネットワークＮにアクセスするときの通信方法はどのような方法でもよく、また、外部装置との通信に用いる通信規格は何でもよい。例えば、通信部１０は、自車両に持ち込まれた携帯端末とBluetooth（登録商標）やWi-Fi（登録商標）等の無線通信により通信し、携帯端末のテザリング機能を利用してネットワークＮにアクセスする。また、通信部１０は、移動体通信網に直接アクセスすることによってネットワークＮにアクセスする。 The communication unit 10 accesses the network N according to a predetermined communication standard, and communicates with an external device (including the service providing server 3) connected to the network N. Any communication method may be used when the communication unit 10 accesses the network N, and any communication standard may be used for communication with the external device. For example, the communication unit 10 communicates with a mobile terminal brought into its own vehicle by wireless communication such as Bluetooth (registered trademark) or Wi-Fi (registered trademark), and accesses network N by using the tethering function of the mobile terminal. do. Further, the communication unit 10 accesses the network N by directly accessing the mobile communication network.

音声処理部１１は、マイク４によって収音された音声に標本化、量子化、符号化処理を含むアナログ／デジタル変換処理を行って音声データを生成し、音声バッファ２０にバッファリングする。この結果、現時点から遡って所定期間の間にマイク４により収音された音声に基づく音声データが音声バッファ２０に記憶された状態となる。以下、音声バッファ２０に記憶された音声データの集合を「入力音声データ」という。また、音声処理部１１は、Ｄ／Ａコンバータや、ボリューム回路、アンプ回路等を備え、入力した音声データをＤ／Ａコンバータによりデジタル／アナログ変換し、ボリューム回路により音量レベルを調整し、アンプ回路により増幅して、スピーカ５から音声として出力する。 The voice processing unit 11 performs analog / digital conversion processing including sampling, quantization, and coding processing on the voice picked up by the microphone 4 to generate voice data, and buffers it in the voice buffer 20. As a result, the voice data based on the voice picked up by the microphone 4 is stored in the voice buffer 20 during a predetermined period retroactively from the present time. Hereinafter, a set of voice data stored in the voice buffer 20 is referred to as "input voice data". Further, the sound processing unit 11 includes a D / A converter, a volume circuit, an amplifier circuit, and the like, digitally / analog-converts the input audio data by the D / A converter, adjusts the volume level by the volume circuit, and performs the amplifier circuit. It is amplified by the speaker 5 and output as audio from the speaker 5.

登録ワード検出部１２は、マイク４を介して音声処理部１１に入力された入力音声に登録ワードが含まれている場合には、そのことを検出する。詳述すると、登録ワード検出部１２は、音声処理部１１により音声バッファ２０に累積的に記憶される入力音声データを継続して分析し、入力音声データに記録された音声の音声波形（以下では、単に「入力音声の音声波形」と表現する場合がある）と、音声認識辞書に登録された登録ワードの音声パターンの類似度を継続して算出する。登録ワード検出部１２は、音声認識辞書に登録された登録ワードのそれぞれについて、継続して類似度を算出する。本実施形態では、登録ワード検出部１２は、０～１０００の範囲内で、入力音声の音声波形と登録ワードの音声パターンとが類似すればするほど小さい値を取る「距離値」を類似度として算出する。 The registered word detection unit 12 detects when the input voice input to the voice processing unit 11 via the microphone 4 contains the registered word. More specifically, the registered word detection unit 12 continuously analyzes the input voice data cumulatively stored in the voice buffer 20 by the voice processing unit 11, and the voice waveform of the voice recorded in the input voice data (hereinafter referred to as “the voice waveform”). , It may be simply expressed as "voice waveform of input voice"), and the similarity of the voice pattern of the registered word registered in the voice recognition dictionary is continuously calculated. The registered word detection unit 12 continuously calculates the similarity for each of the registered words registered in the speech recognition dictionary. In the present embodiment, the registered word detection unit 12 uses a “distance value” as a degree of similarity, which takes a smaller value as the voice waveform of the input voice and the voice pattern of the registered word are similar in the range of 0 to 1000. calculate.

そして、登録ワード検出部１２は、音声認識辞書に登録された登録ワードの音声パターンのうち、何れかの登録ワードの音声パターンと、入力音声の音声波形との距離値が予め定められた閾値ＴＨを下回った場合、その登録ワードに対応する音声が入力音声に含まれていること（＝ユーザが登録ワードに対応する音声を発話したこと）を検出する。以下、登録ワード（命令ワード、ウェイクワード）に対応する音声が入力音声に含まれていることを検出することを、単に「登録ワード（命令ワード、ウェイクワード）を検出する」のように表現する。 Then, the registered word detection unit 12 has a threshold TH in which the distance value between the voice pattern of any of the registered words and the voice waveform of the input voice among the voice patterns of the registered words registered in the voice recognition dictionary is predetermined. If it is less than, it is detected that the input voice contains the voice corresponding to the registered word (= the user has spoken the voice corresponding to the registered word). Hereinafter, detecting that the voice corresponding to the registered word (command word, wake word) is included in the input voice is expressed as simply "detecting the registered word (command word, wake word)". ..

図３は、ある登録ワードについて算出される距離値の推移の例を、横軸を時間、縦軸を距離値とするグラフによって表す図である。図３の例では、タイミングＴＭ０からタイミングＴＭ１の期間、１０００付近で推移していた距離値が、タイミングＴＭ１辺りから徐々に低下していき、タイミングＴＭ２において、閾値ＴＨを下回っている。図３の例の場合、登録ワード検出部１２は、タイミングＴＭ２で登録ワードを検出する。 FIG. 3 is a graph showing an example of the transition of the distance value calculated for a certain registered word by a graph in which the horizontal axis is time and the vertical axis is the distance value. In the example of FIG. 3, the distance value that has changed around 1000 during the period from timing TM0 to timing TM1 gradually decreases from around timing TM1 and falls below the threshold value TH in timing TM2. In the case of the example of FIG. 3, the registered word detection unit 12 detects the registered word at the timing TM2.

登録ワード検出部１２は、登録ワードのうち命令ワードを検出した場合には、検出した命令ワードに割り当てられた識別情報を命令ワード対応処理実行部１３に通知する。一方、登録ワード検出部１２は、登録ワードのうちウェイクワードを検出した場合には、ウェイクワードを検出したことを示す情報を周波数特定部１５に通知する。このように、登録ワード検出部１２は、入力された入力音声と予め登録されたウェイクワードの音声パターンとを比較してウェイクワード（特定ワード）を検出する機能を有しており、この点で特許請求の範囲の「特定ワード検出部」に相当する。 When the registered word detection unit 12 detects an instruction word among the registered words, the registered word detection unit 12 notifies the instruction word corresponding processing execution unit 13 of the identification information assigned to the detected instruction word. On the other hand, when the registered word detection unit 12 detects a wake word among the registered words, the registered word detection unit 12 notifies the frequency specifying unit 15 of information indicating that the wake word has been detected. As described above, the registered word detection unit 12 has a function of comparing the input voice and the voice pattern of the pre-registered wake word to detect the wake word (specific word), and in this respect. It corresponds to the "specific word detector" in the claims.

命令ワード対応処理実行部１３は、登録ワード検出部１２から命令ワードの識別情報を入力した場合、その識別情報に対応付けられた処理を実行する。この結果、ユーザが命令ワードに対応する音声を発話した場合には、その命令ワードに対応する処理が命令ワード対応処理実行部１３により実行される。例えば、命令ワードは、自車両の周辺の地図を表示することを指示する「地図表示」であり、この場合、命令ワード対応処理実行部１３は、図示しない表示手段に自車両の周辺の地図を表示する。また、例えば、命令ワードは、登録された自宅までの誘導経路の探索および探索された誘導経路の案内を指示する「自宅に帰る」であり、この場合、命令ワード対応処理実行部１３は、自車両の現在位置から予め登録された自宅までの誘導経路を探索し、誘導経路の案内を実行する。 When the instruction word identification information is input from the registered word detection unit 12, the instruction word correspondence processing execution unit 13 executes the processing associated with the identification information. As a result, when the user utters a voice corresponding to the instruction word, the processing corresponding to the instruction word is executed by the instruction word correspondence processing execution unit 13. For example, the instruction word is a "map display" instructing to display a map around the own vehicle. In this case, the instruction word correspondence processing execution unit 13 displays a map around the own vehicle on a display means (not shown). indicate. Further, for example, the command word is "going home" instructing the search for the guided route to the registered home and the guidance of the searched guide route. In this case, the command word correspondence processing execution unit 13 is self. The guidance route from the current position of the vehicle to the pre-registered home is searched, and the guidance of the guidance route is executed.

なお、ユーザがトリガを与えることなく車載装置２側で命令ワードを自動で検出する機能はトリガレス音声認識と呼ばれる既存の機能であり、従来より装置に実装されている。従って、この既存の機能に係る音声認識辞書にウェイクワードの音声パターンを登録するだけで、ウェイクワードの検出を高い確度で行うことができる。 The function of automatically detecting the instruction word on the vehicle-mounted device 2 side without giving a trigger by the user is an existing function called triggerless speech recognition, and has been conventionally implemented in the device. Therefore, the wake word can be detected with high accuracy only by registering the wake word voice pattern in the voice recognition dictionary related to this existing function.

ユーザ周波数登録部１４は、ユーザ周波数記憶部２２に記憶されたユーザ周波数テーブルにユーザ周波数関連情報を登録する。詳述すると、自車両に搭乗する可能性があり、かつ、車載装置２を介してサービス提供サーバ３のサービスを受ける可能性があるユーザは、事前に、以下の周波数登録作業を行うことになっている。以下の作業は、自車両において静かな環境で行われる。周波数登録作業において、まず、ユーザは、所定の指示入力を行って、周波数登録作業を開始したことをユーザ周波数登録部１４に通知し、更に、普段ウェイクワードを発話するときと同じ態様でウェイクワードを発話する。 The user frequency registration unit 14 registers user frequency-related information in the user frequency table stored in the user frequency storage unit 22. More specifically, a user who may board the vehicle and may receive the service of the service providing server 3 via the in-vehicle device 2 shall perform the following frequency registration work in advance. ing. The following work is performed in a quiet environment in the own vehicle. In the frequency registration work, the user first inputs a predetermined instruction to notify the user frequency registration unit 14 that the frequency registration work has started, and further, the wake word is in the same manner as when the wake word is usually spoken. To speak.

ユーザ周波数登録部１４は、周波数登録作業を開始したことの通知を受けると、その時点から一定期間の間、音声バッファ２０に記憶される音声データを分析し、以下の事項を特定する。当該一定期間の時間の長さは、ユーザがウェイクワードの発話を終了するタイミングを包含するのに十分な長さとされる。すなわち、まず、ユーザ周波数登録部１４は、当該一定期間における入力音声を分析し、入力された入力音声を構成する周波数のうち、音圧レベルが最も高い周波数を特定する。周波数特定部１５は、例えば、車載装置２のマイクロコンピュータに実装されたＤＳＰのスペクトラムアナライザ機能により、周波数の特定を行う。以下、ここで特定された周波数を「ユーザ特有周波数」という。ここで特定された周波数は、ユーザの発話音声に特有の周波数であり、ユーザの発話音声についての特徴が最も現れやすい支配的な周波数である。 Upon receiving the notification that the frequency registration work has started, the user frequency registration unit 14 analyzes the voice data stored in the voice buffer 20 for a certain period from that point, and identifies the following items. The length of time for this period of time is sufficient to include the timing at which the user ends the wakeward utterance. That is, first, the user frequency registration unit 14 analyzes the input voice in the fixed period and identifies the frequency having the highest sound pressure level among the frequencies constituting the input input voice. The frequency specifying unit 15 specifies the frequency by, for example, the spectrum analyzer function of the DSP mounted on the microcomputer of the in-vehicle device 2. Hereinafter, the frequency specified here is referred to as a "user-specific frequency". The frequency specified here is a frequency peculiar to the user's uttered voice, and is a dominant frequency in which the characteristics of the user's uttered voice are most likely to appear.

更に、ユーザ周波数登録部１４は、当該一定期間における入力音声を分析し、ユーザが発話したウェイクワードに対応する音声の時間的な長さ（ウェイクワード音声を発話し始めてから発話し終わるまでの期間の長さ）を特定する。ここで、周波数登録作業を開始したことの通知を受けてから一定期間の間に入力される入力音声は、確実にユーザが発話したウェイクワードに対応する音声であり、従って入力音声データに記録された特徴的な音声波形は間違いなくウェイクワードに由来する波形である。これを踏まえ、ユーザ周波数登録部１４は、入力音声データにおいて、特定した周波数の音声波形が立ち上がる位置と、立ち下がる位置との間の時間的な長さを、ユーザが発話したウェイクワードに対応する音声の時間的な長さとして特定する。以下、ここで特定された長さを「ユーザ特有時間長」という。 Further, the user frequency registration unit 14 analyzes the input voice in the fixed period, and the time length of the voice corresponding to the wake word spoken by the user (the period from the start of uttering the wake word voice to the end of the utterance). (Length) is specified. Here, the input voice input during a certain period after receiving the notification that the frequency registration work has started is the voice corresponding to the wake word spoken by the user without fail, and is therefore recorded in the input voice data. The characteristic voice waveform is definitely a waveform derived from Wakeward. Based on this, the user frequency registration unit 14 corresponds to the wake word spoken by the user regarding the time length between the position where the voice waveform of the specified frequency rises and the position where the voice waveform falls in the input voice data. Specify as the time length of the voice. Hereinafter, the length specified here is referred to as "user-specific time length".

ユーザ特有周波数およびユーザ特有時間長を特定した後、ユーザ周波数登録部１４は、固有のユーザＩＤを生成し、ユーザＩＤとユーザ特有周波数を示す情報とユーザ特有時間長を示す情報とを含むユーザ周波数関連情報をユーザ周波数テーブルに登録する。ユーザ周波数登録部１４により以上の処理が行われる結果、ユーザ周波数記憶部２２に記憶されたユーザ周波数テーブルには、自車両に搭乗し、音声認識サービスを利用する可能性がある人物ごとに、ユーザＩＤとユーザ特有周波数を示す情報とユーザ特有時間長を示す情報とを含むユーザ周波数関連情報が登録された状態となる。なお、上記例では、ユーザの１回の発話に基づいて「ユーザ特有時間長」を測定する構成であったが、ユーザにウェイクワード対応音声を複数回発話させ、各サンプルの測定値から平均、その他の統計学的手法によってユーザ特有時間長を測定する構成でもよい。 After specifying the user-specific frequency and the user-specific time length, the user frequency registration unit 14 generates a unique user ID, and the user frequency including the user ID, information indicating the user-specific frequency, and information indicating the user-specific time length. Register related information in the user frequency table. As a result of the above processing performed by the user frequency registration unit 14, the user frequency table stored in the user frequency storage unit 22 is displayed for each user who may board the own vehicle and use the voice recognition service. The user frequency-related information including the ID, the information indicating the user-specific frequency, and the information indicating the user-specific time length is registered. In the above example, the "user-specific time length" is measured based on one utterance of the user, but the user is made to utter a wake word-compatible voice a plurality of times, and the average is obtained from the measured values of each sample. It may be configured to measure the user-specific time length by another statistical method.

周波数特定部１５は、登録ワード検出部１２からウェイクワードを検出したことを示す情報を入力した場合、音声バッファ２０を参照し、現時点（＝距離値が閾値ＴＨを下回り、登録ワード検出部１２によりウェイクワードが検出されたタイミング）を起点として所定時間だけ遡った期間の入力音声データを分析し、以下の事項を特定する。すなわち、周波数特定部１５は、当該期間に入力された入力音声を構成する周波数のうち、音圧レベルが最も高い周波数に対応する周波数をターゲット周波数として特定する。なお、音圧レベルが最も高い周波数に対応する周波数とは、音圧レベルが最も高い周波数のほか、最も高い周波数に近い周波数も含む。 When the frequency specifying unit 15 inputs information indicating that a wake word has been detected from the registered word detecting unit 12, the frequency specifying unit 15 refers to the voice buffer 20 and refers to the current time (= the distance value is below the threshold value TH, and the registered word detecting unit 12 causes it. The input voice data for a period retroactively by a predetermined time starting from the timing when the wake word is detected) is analyzed, and the following items are specified. That is, the frequency specifying unit 15 specifies the frequency corresponding to the frequency having the highest sound pressure level among the frequencies constituting the input voice input during the period as the target frequency. The frequency corresponding to the frequency having the highest sound pressure level includes not only the frequency having the highest sound pressure level but also the frequency close to the highest frequency.

当該期間は、登録ワード検出部１２がウェイクワードを検出したタイミングに連続する期間である。従って、当該期間における入力音声は、自車両に搭乗する一人の人物が発話したウェイクワード対応音声を含む（ノイズや、その他の音声が含まれている可能性もある）ものと想定され、ターゲット周波数として特定された周波数は、その人物の発話音声に特有の周波数を含んでいるということができる。 The period is a period continuous with the timing when the registered word detection unit 12 detects the wake word. Therefore, it is assumed that the input voice during the period includes the wake word-compatible voice spoken by one person in the vehicle (may include noise and other voices), and the target frequency. It can be said that the frequency specified as includes a frequency peculiar to the spoken voice of the person.

ポイント検出部１６は、登録ワード検出部１２によりウェイクワードが認識されたタイミングの前後の期間における「周波数特定部１５により特定されたターゲット周波数の音圧レベルの推移」に基づいて、入力音声におけるウェイクワードの開始ポイントおよび終了ポイントを検出する。以下、ポイント検出部１６の処理について詳述する。 The point detection unit 16 wakes in the input voice based on the "transition of the sound pressure level of the target frequency specified by the frequency specifying unit 15" in the period before and after the timing when the wake word is recognized by the registered word detection unit 12. Detects the start and end points of a word. Hereinafter, the processing of the point detection unit 16 will be described in detail.

まず、ポイント検出部１６は、音声バッファ２０に記憶された入力音声データについて、第１期間におけるターゲット周波数の音圧レベルの推移（ターゲット周波数の波形）を抽出し、第１期間推移データとしてワーキングエリアとして機能する所定の記憶領域に展開する。第１期間は、登録ワード検出部１２によりウェイクワードが検出されたタイミング（以下、「ウェイクワード検出タイミング」という）を含み、ウェイクワード検出タイミングから時間的に遡って所定時間の期間およびウェイクワード検出タイミングから時間的に進んで所定時間を含む期間である。なお、現時点が、第１期間の終端まで至っていない場合には、ポイント検出部１６は、第１期間の終端まで至った後に、第１期間推移データを所定の記憶領域に展開する。 First, the point detection unit 16 extracts the transition of the sound pressure level of the target frequency (waveform of the target frequency) in the first period from the input voice data stored in the voice buffer 20, and uses the working area as the transition data in the first period. Expand to a predetermined storage area that functions as. The first period includes the timing when the wake word is detected by the registered word detection unit 12 (hereinafter referred to as “wake word detection timing”), and the period for a predetermined time and the wake word detection are retroactive in time from the wake word detection timing. It is a period that includes a predetermined time from the timing. If the current time has not reached the end of the first period, the point detection unit 16 expands the first period transition data into a predetermined storage area after reaching the end of the first period.

図４は、ウェイクワード検出タイミングを含み、かつ、開始ポイントおよび終了ポイントを含む期間の、ターゲット周波数の音圧レベルの推移の一例を示すグラフである。図４のグラフは横軸が時間の経過を表しており、縦軸がターゲット周波数の音圧レベルを表している。図４では、タイミングＴＸ１が、ウェイクワード検出タイミングである。また、図４では、期間ＫＸ１が第１期間であり、第１期間は、ウェイクワード検出タイミングを挟んで前後２００ミリ秒の期間とされている。図４で例示する態様でターゲット周波数の音圧レベルが推移する場合において、ウェイクワード検出タイミングがタイミングＴＸ１である場合には、ポイント検出部１６は、第１期間である期間ＫＸ１におけるターゲット周波数の音圧レベルの推移を示す第１期間推移データを所定の記憶領域に記憶する。 FIG. 4 is a graph showing an example of the transition of the sound pressure level of the target frequency during the period including the wake word detection timing and including the start point and the end point. In the graph of FIG. 4, the horizontal axis represents the passage of time, and the vertical axis represents the sound pressure level of the target frequency. In FIG. 4, the timing TX1 is the wake word detection timing. Further, in FIG. 4, the period KX1 is the first period, and the first period is a period of 200 milliseconds before and after the wake word detection timing. When the sound pressure level of the target frequency changes in the embodiment illustrated in FIG. 4, and the wake word detection timing is the timing TX1, the point detection unit 16 performs the sound of the target frequency in the period KX1 which is the first period. The first period transition data indicating the transition of the pressure level is stored in a predetermined storage area.

次いで、ポイント検出部１６は、第１期間推移データを分析し、第１期間において時間の経過に従ってターゲット周波数の音圧レベルが低下していく状態から、安定的に推移する状態へと変化するポイントを終了ポイントとして検出する。図４の例の場合、ポイント検出部１６は、ポイントＰＸ１を終了ポイントとして検出する。一例として、ポイント検出部１６は、第１期間におけるターゲット周波数の音圧レベルの推移の近似曲線を算出し、時間の経過に応じた近似曲線の接線の傾きが所定値以下の範囲内に収束する起点の位置を終了ポイントとして検出する。図４において、符号Ｒ１は、近似曲線を表している。以下、このような終了ポイントの検出方法の優位性ついて説明する。 Next, the point detection unit 16 analyzes the transition data in the first period, and in the first period, the point where the sound pressure level of the target frequency changes from a state in which the sound pressure level of the target frequency decreases to a state in which the sound pressure level changes stably with the passage of time. Is detected as the end point. In the case of the example of FIG. 4, the point detection unit 16 detects the point PX1 as the end point. As an example, the point detection unit 16 calculates an approximate curve of the transition of the sound pressure level of the target frequency in the first period, and the slope of the tangent line of the approximate curve with the passage of time converges within a range of a predetermined value or less. The position of the starting point is detected as the ending point. In FIG. 4, reference numeral R1 represents an approximate curve. Hereinafter, the superiority of such an end point detection method will be described.

上述したように、本実施形態では、入力音声の音声波形と、予め登録されたウェイクワードの音声パターンとの類似度を示す距離値が閾値ＴＨを下回ったことをもって、ウェイクワードを検出する。距離値の算出方法および閾値ＴＨは、ウェイクワード対応音声の発話が開始されたタイミングから徐々に距離値が下がっていき、当該音声の発話が終了するタイミングの近辺で閾値ＴＨを下回るように設計されている（図２も併せて参照）。距離値の算出方法および閾値ＴＨがこのように設計されることにより、ユーザがウェイクワード対応音声の全体を発話した可能性が高い場合にのみウェイクワードが検出されるようにすることができ、例えば、ウェイクワードの一部分の文言のみが意図せず偶発的に発話された場合にウェイクワードが誤って検出されないようにすることができる。 As described above, in the present embodiment, the wake word is detected when the distance value indicating the degree of similarity between the voice waveform of the input voice and the voice pattern of the wake word registered in advance falls below the threshold value TH. The distance value calculation method and the threshold value TH are designed so that the distance value gradually decreases from the timing when the wakeword-compatible voice starts to be spoken, and falls below the threshold value around the timing when the voice is finished. (See also Fig. 2). By designing the distance value calculation method and the threshold value TH in this way, it is possible to detect the wake word only when the user is likely to have spoken the entire wake word-compatible voice, for example. , It is possible to prevent the wake word from being erroneously detected when only a part of the wording of the wake word is unintentionally and accidentally spoken.

そして、第１期間の値は、この第１期間内にウェイクワード対応音声を発話し終わったタイミングが属し、かつ、第１期間が不必要に長時間とならないように事前のテストやシミュレーションの結果に基づいて適切に設定されている。このため、ユーザがウェイクワード対応音声を発話した場合には、ウェイクワード検出タイミングを含む第１期間内にウェイクワード対応音声を発話し終わっているとみなすことができる。以上を踏まえ、本実施形態によれば、終了ポイントが属しているとみなすことができる限定された期間（第１期間）が対象となって、「ターゲット周波数の音圧レベルの推移」が分析されて終了ポイントが検出されるため、効率よく高い精度で終了ポイントを検出することができ、更に、処理負荷が小さく、処理に要する時間を短くできる。なお、第１期間が不必要に長時間とならないようにされるのは、第１期間が長いほど、分析する対象となる期間（データ量）が多くなり処理負荷および処理時間が増大し、また、第１期間推移データにノイズが含まれる可能性が増すからである。 The value of the first period belongs to the timing when the wakeword-compatible voice is finished to be spoken within this first period, and the result of a preliminary test or simulation so that the first period does not become unnecessarily long. It is set properly based on. Therefore, when the user utters the wake word-compatible voice, it can be considered that the wake-word-compatible voice has been uttered within the first period including the wake word detection timing. Based on the above, according to the present embodiment, the "transition of the sound pressure level of the target frequency" is analyzed for a limited period (first period) in which the end point can be considered to belong. Since the end point is detected, the end point can be detected efficiently and with high accuracy, the processing load is small, and the time required for processing can be shortened. The reason why the first period is not unnecessarily long is that the longer the first period, the larger the period (data amount) to be analyzed, the larger the processing load and the processing time, and the more. This is because the possibility that noise is included in the first period transition data increases.

更に、本実施形態では、ターゲット周波数の音圧レベルの推移を、終了ポイントを検出するための分析の対象とする。これにより、ウェイクワード対応音声を実際に発話した人物に特有の周波数（当該人物の発話音声の特徴が最も現れやすい支配的な周波数）が分析の対象となるので、他の音声（例えば、自車両のエンジン音や、環境音等のノイズ）によるノイズ的な影響を極力排除した状況で第１期間推移データを分析することができ、この点で高い精度で終了ポイントを検出することができる。 Further, in the present embodiment, the transition of the sound pressure level of the target frequency is the target of the analysis for detecting the end point. As a result, the frequency peculiar to the person who actually uttered the wake word-compatible voice (the dominant frequency in which the characteristics of the uttered voice of the person are most likely to appear) becomes the subject of analysis, so that other voices (for example, the own vehicle) are analyzed. It is possible to analyze the transition data for the first period in a situation where the noise-like influences (noise such as engine noise and environmental noise) are eliminated as much as possible, and the end point can be detected with high accuracy in this respect.

更に、本実施形態では、ターゲット周波数の音圧レベルが低下していく状態から安定的に推移する状態へと変化するポイントを終了ポイントとして検出する。一連のウェイクワード対応音声が終了すると、基本的にはマイク４へ入力される音声が環境音だけとなり、ウェイクワード対応音声を発話した人物に由来するターゲット周波数の音圧レベルは急速に低下して低い値で安定的した状態となる。従って、ターゲット周波数の音圧レベルは、ウェイクワード対応音声の発話が終了するタイミングで、低下していく状態から安定的に推移する状態へと変化する。これを踏まえ、本実施形態によれば、ターゲット周波数の音圧レベルの変化の態様の特徴を利用して、高い精度で終了ポイントを検出することができる。 Further, in the present embodiment, a point at which the sound pressure level of the target frequency changes from a state in which the sound pressure level decreases to a state in which the sound pressure level changes stably is detected as an end point. When a series of wakeword-compatible voices is completed, the sound input to the microphone 4 is basically only the environmental sound, and the sound pressure level of the target frequency derived from the person who uttered the wakeword-compatible voice drops rapidly. It becomes a stable state at a low value. Therefore, the sound pressure level of the target frequency changes from a decreasing state to a stable state at the timing when the wakeword-compatible voice utterance ends. Based on this, according to the present embodiment, the end point can be detected with high accuracy by utilizing the characteristics of the mode of changing the sound pressure level of the target frequency.

なお、終了ポイントは、第１期間推移データが示すターゲット周波数の波形における特定の位置であるが、ターゲット周波数の波形の元となった入力音声データの音声波形における特定の位置を示す情報としても使用可能である。この点は、後述する開始ポイントについても同じである。 The end point is a specific position in the waveform of the target frequency indicated by the first period transition data, but it is also used as information indicating a specific position in the audio waveform of the input audio data that is the source of the waveform of the target frequency. It is possible. This point is the same for the start point described later.

終了ポイントを検出した後、ポイント検出部１６は、ユーザ周波数記憶部２２に記憶されたユーザ周波数テーブルを参照し、ターゲット周波数の値と同一のまたは近似した値のユーザ特有周波数が登録されているか否かを判定する。２つの周波数の値が近似するとは、２つの周波数の差分が所定値以下という意味である。ターゲット周波数の値と同一のまたは近似した値のユーザ特有周波数が登録されている場合、ポイント検出部１６は、そのユーザ特有周波数に対応するユーザ周波数関連情報に含まれるユーザ特有時間長を認識し、ユーザ特有時間長を「検出用時間」として特定する。 After detecting the end point, the point detection unit 16 refers to the user frequency table stored in the user frequency storage unit 22, and whether or not a user-specific frequency having the same value as or close to the value of the target frequency is registered. Is determined. Approximating the values of the two frequencies means that the difference between the two frequencies is less than or equal to a predetermined value. When a user-specific frequency having the same value as or close to the value of the target frequency is registered, the point detection unit 16 recognizes the user-specific time length included in the user frequency-related information corresponding to the user-specific frequency. The user-specific time length is specified as "detection time".

ここで、ターゲット周波数の値と、一のユーザ周波数関連情報に対応するユーザ特有周波数の値とが同一または近似する場合、今回ウェイクワード対応音声を発話した人物と、当該一のユーザ周波数関連情報に対応する人物とが同一であると強く推定される。従って、検出用時間として特定されたユーザ特有時間長は、今回ウェイクワード対応音声を発話した人物が、ウェイクワード対応音声を発話し始めてから発話し終わるまでに要する時間として用いることができる。 Here, when the value of the target frequency and the value of the user-specific frequency corresponding to one user frequency-related information are the same or similar, the person who uttered the wakeword-compatible voice this time and the one user frequency-related information are used. It is strongly presumed that the corresponding person is the same. Therefore, the user-specific time length specified as the detection time can be used as the time required from the start of uttering the wakeword-compatible voice to the end of the utterance by the person who has spoken the wakeword-compatible voice this time.

一方、そのようなユーザ周波数関連情報が登録されていない場合、ポイント検出部１６は、予め定められた時間を「検出用時間」として特定する。予め定められた時間は、ウェイクワード対応音声を発話するのに必要となる平均的な時間であり、複数のサンプルについて平均、その他の統計学的手法を用いて予め算出されている。なお、この場合、ユーザ周波数テーブルにユーザ周波数関連情報が登録されていないユーザがウェイクワード対応音声を発話したか、または、ユーザ周波数テーブルにユーザ周波数関連情報が登録されているユーザが発話したものの、検出誤差、その他の原因によりそのユーザに対応するユーザ特有周波数と同一または近似するターゲット周波数が特定されなかったものと考えられる。 On the other hand, when such user frequency-related information is not registered, the point detection unit 16 specifies a predetermined time as a "detection time". The predetermined time is the average time required to speak the wakeword-compatible voice, and is calculated in advance using an average or other statistical method for a plurality of samples. In this case, a user whose user frequency-related information is not registered in the user frequency table has spoken a wake word-compatible voice, or a user whose user frequency-related information is registered in the user frequency table has spoken. It is probable that the target frequency that is the same as or close to the user-specific frequency corresponding to the user was not specified due to the detection error or other causes.

検出用時間を特定した後、ポイント検出部１６は、音声バッファ２０に記憶された入力音声データについて、終了ポイントから検出用時間、遡ったタイミング（以下、「第２期間対応タイミング」という）を特定する。図４では、タイミングＴＸ２が、第２期間対応タイミングである。次いで、ポイント検出部１６は、第２期間対応タイミングを含み、第２期間対応タイミングから時間的に遡って所定時間の期間および第２期間対応タイミングから時間的に進んで所定時間を含む期間を第２期間として特定する。図４では、期間ＫＸ２が第２期間であり、第２期間は、第２期間対応タイミングを挟んで前後２００ミリ秒の期間とされている。次いで、ポイント検出部１６は、第２期間におけるターゲット周波数の音圧レベルの推移（ターゲット周波数の波形）を抽出し、第２期間推移データとして所定の記憶領域に展開する。 After specifying the detection time, the point detection unit 16 specifies the detection time and the retroactive timing (hereinafter referred to as "second period correspondence timing") from the end point of the input voice data stored in the voice buffer 20. do. In FIG. 4, the timing TX2 is the timing corresponding to the second period. Next, the point detection unit 16 includes a second period correspondence timing, a period of a predetermined time retroactively from the second period correspondence timing, and a period including a predetermined time ahead of the second period correspondence timing. Specified as 2 periods. In FIG. 4, the period KX2 is the second period, and the second period is a period of 200 milliseconds before and after the second period correspondence timing. Next, the point detection unit 16 extracts the transition of the sound pressure level of the target frequency (waveform of the target frequency) in the second period, and develops it in a predetermined storage area as the transition data of the second period.

次いで、ポイント検出部１６は、第２期間推移データを分析し、第２期間において時間の経過に従ってターゲット周波数の音圧レベルが安定的に推移する状態から、上昇していく状態へと変化するポイントを開始ポイントとして検出する。図４の例の場合、ポイント検出部１６は、ポイントＰＸ２を開始ポイントとして検出する。一例として、ポイント検出部１６は、第２期間におけるターゲット周波数の音圧レベルの推移の近似曲線を算出し、時間の経過に応じた近似曲線の接線の傾きが所定値以下の範囲内に収束している状態から、所定値を超えて大きくなっていく状態となる起点の位置を開始ポイントとして検出する。以下、このような開始ポイントの検出方法の優位性について説明する。 Next, the point detection unit 16 analyzes the second period transition data, and in the second period, the point where the sound pressure level of the target frequency changes from a stable state to an increasing state with the passage of time. Is detected as the starting point. In the case of the example of FIG. 4, the point detection unit 16 detects the point PX2 as a start point. As an example, the point detection unit 16 calculates an approximate curve of the transition of the sound pressure level of the target frequency in the second period, and the slope of the tangent line of the approximate curve with the passage of time converges within a range of a predetermined value or less. The position of the starting point in which the value exceeds a predetermined value and becomes larger is detected as the starting point. Hereinafter, the superiority of the method for detecting such a start point will be described.

上述の通り、検出用時間は、ユーザ周波数テーブルに登録されたユーザ周波数関連情報のユーザ特有時間長とされるか、または、予め定められた値とされる。検出用時間がユーザ特有時間長の場合、検出用時間の長さは、今回ウェイクワード対応音声を発話した人物が、普段ウェイクワード対応音声を発話するときに要する時間の長さを表している。このため、検出用時間がユーザ特有時間長の場合、ウェイクワード対応音声の発話が開始されたタイミングと、終了ポイントから検出用時間だけ遡ったタイミング（第２期間対応タイミング）は近接しているとみなすことができる。 As described above, the detection time is set to the user-specific time length of the user frequency-related information registered in the user frequency table, or is set to a predetermined value. When the detection time is the user-specific time length, the length of the detection time represents the length of time required for the person who utters the wakeword-compatible voice this time to normally utter the wakeword-compatible voice. Therefore, when the detection time is the user-specific time length, the timing at which the wakeword-compatible voice starts to be spoken and the timing at which the detection time is advanced from the end point (second period correspondence timing) are close to each other. Can be regarded.

以上を踏まえ、検出用時間がユーザ特有時間長の場合、開始ポイントが属しているとみなすことができる限定された期間（第２期間）が対象となって、「ターゲット周波数の音圧レベルの推移」が分析されて開始ポイントが検出されるため、効率よく高い精度で開始ポイントを検出することができる。更に、処理負荷が小さく、処理に要する時間を短くできる。更に、対象となるデータにノイズが含まれている可能性を低減でき、ノイズの影響をできるだけ排除して分析することができる。一方、検出用時間が予め定められた値の場合、検出用時間の長さは、ウェイクワード対応音声を発話するのに必要となる平均的な時間を表している。この場合も、ウェイクワード対応音声の発話が開始されたタイミングと、終了ポイントから検出用時間だけ遡ったタイミングとは極端に離間することはなく、差分は所定の範囲内に収まるものと想定されるため、効率よく高い確度で開始ポイントを検出することができる。 Based on the above, when the detection time is a user-specific time length, the "change in sound pressure level of the target frequency" is targeted for a limited period (second period) in which the start point can be considered to belong. Is analyzed and the start point is detected, so that the start point can be detected efficiently and with high accuracy. Further, the processing load is small and the time required for processing can be shortened. Furthermore, the possibility that the target data contains noise can be reduced, and the influence of noise can be eliminated as much as possible for analysis. On the other hand, when the detection time is a predetermined value, the length of the detection time represents the average time required to utter the wake word-compatible voice. In this case as well, the timing at which the wakeword-compatible voice starts to be spoken and the timing at which the detection time is traced back from the end point are not extremely separated, and the difference is assumed to be within a predetermined range. Therefore, the start point can be detected efficiently and with high accuracy.

なお、第２期間の値は、この第２期間にウェイクワード対応音声を発話し始めたタイミングが属し、かつ、第２期間が不必要に長時間とならないように事前のテストやシミュレーションの結果に基づいて適切に設定されている。そして、検出用時間がユーザ特有時間長の場合と、検出用時間が予め定められた値の場合とで、検出用時間がユーザ特有時間長の場合の方が短くなるように第２期間を変更する構成でもよい。検出用時間がユーザ特有時間長の場合は、ウェイクワード対応音声の発話が開始されたタイミングと、終了ポイントから検出用時間だけ遡ったタイミングとの差分が小さいため第２期間を短くすることができ、また、第２期間を短くすることによって、処理負荷および処理時間の低減、および、ノイズによる影響の低減をより一層図ることができる。また、検出用時間が予め定められた値の場合に第２期間を長くすることによって、より確実に開始ポイントが第２期間に属する状態とすることができる。 The value of the second period belongs to the timing when the wake word-compatible voice is started to be spoken in this second period, and is the result of a preliminary test or simulation so that the second period does not become unnecessarily long. It is set appropriately based on. Then, the second period is changed so that the detection time is shorter when the detection time is the user-specific time length and when the detection time is a predetermined value. It may be configured to be used. When the detection time is a user-specific time length, the second period can be shortened because the difference between the timing when the wake word-compatible voice starts to be spoken and the timing when the detection time goes back from the end point is small. Further, by shortening the second period, it is possible to further reduce the processing load and the processing time, and further reduce the influence of noise. Further, by lengthening the second period when the detection time is a predetermined value, the start point can be more reliably set to belong to the second period.

ターゲット周波数の音圧レベルの推移を、開始ポイントを検出するための分析の対象とする点、および、ターゲット周波数の音圧レベルが安定的に推移する状態から上昇していく状態へと変化するポイントを開始ポイントとして検出する点の効果は、終了ポイントの場合と同様である。 The point to be analyzed to detect the start point of the transition of the sound pressure level of the target frequency, and the point where the transition of the sound pressure level of the target frequency changes from a stable transition state to an increasing state. The effect of detecting as the start point is the same as in the case of the end point.

連携処理実行部１７は、ポイント検出部１６により検出された開始ポイントおよび終了ポイントを利用して、処理要求データを生成し、サービス提供サーバ３に送信する。また、連携処理実行部１７は、サービス提供サーバ３から応答があった場合には、その応答に対応する処理を実行する。以下、連携処理実行部１７の処理について詳述する。 The cooperation processing execution unit 17 generates processing request data by using the start point and the end point detected by the point detection unit 16, and transmits the processing request data to the service providing server 3. Further, when the service providing server 3 receives a response, the cooperation processing execution unit 17 executes the processing corresponding to the response. Hereinafter, the processing of the cooperation processing execution unit 17 will be described in detail.

ポイント検出部１６により開始ポイントおよび終了ポイントが検出されると、連携処理実行部１７は、音声バッファ２０に記憶される入力音声データを分析し、ポイント検出部１６により検出された終了ポイントに続く入力音声（リクエストと想定される音声）について、入力音声の音圧レベルが所定値以下となった後、所定値以下の状態が一定時間以上続いた場合、そのことを検出する。以下、このことを検出したタイミングを「終了検出タイミング」という。この場合、ユーザがリクエスト対応音声を発話し終わったものと想定される。 When the start point and the end point are detected by the point detection unit 16, the cooperation processing execution unit 17 analyzes the input voice data stored in the voice buffer 20, and inputs following the end point detected by the point detection unit 16. For voice (voice assumed to be a request), if the sound pressure level of the input voice is below the predetermined value and then the state below the predetermined value continues for a certain period of time or more, that is detected. Hereinafter, the timing at which this is detected is referred to as "end detection timing". In this case, it is assumed that the user has finished speaking the request-corresponding voice.

リクエストと想定される入力音声の音圧レベルが所定値以下である状態が一定時間以上続いたことを検出すると、連携処理実行部１７は、音声バッファ２０の入力音声データについて、ポイント検出部１６により検出された開始ポイントよりも８０００サンプル分前のポイントから、終了検出タイミングに対応するポイントまでのデータを抽出し、このデータを処理要求データに含める認識対象音声データとする。図２を用いて説明したように、認識対象音声データには、ウェイクワード対応音声と、ウェイクワード対応音声に続いてリクエスト対応音声が記録されると共に、開始ポイントよりも前に８０００サンプル分のプリロールを含めることがデータの構造についての要件であるが、以上の方法で認証対象音声データが生成されるため、この要件を満たした認証対象音声データを生成可能である。 When it is detected that the sound pressure level of the input voice assumed to be a request is equal to or lower than a predetermined value for a certain period of time or longer, the cooperation processing execution unit 17 uses the point detection unit 16 for the input voice data of the voice buffer 20. Data from the point 8000 samples before the detected start point to the point corresponding to the end detection timing is extracted, and this data is used as the recognition target voice data to be included in the processing request data. As described with reference to FIG. 2, in the recognition target voice data, a wake word-compatible voice, a wake-word-compatible voice followed by a request-compatible voice are recorded, and 8000 samples of pre-roll are recorded before the start point. Is a requirement for the data structure, but since the voice data to be authenticated is generated by the above method, it is possible to generate the voice data to be authenticated that satisfies this requirement.

更に、連携処理実行部１７は、フォーマットに従って制御情報データを生成する。上述したように、連携処理実行部１７は、制御情報データに記述する「開始ポイントを明示する情報」について、固定値の８０００周期目とする。ここで、本実施形態では、ポイント検出部１６により高い精度で開始ポイントが検出されるため、認識対象音声データにおいて実際にウェイクワード対応音声が開始するポイント（つまり、ポイント検出部１６により検出された開始ポイント）と、制御情報データに記述された情報が示す開始ポイント（本例では、８０００周期目）との差異が非常に小さいことが想定され、基本的には誤差が５０ミリ秒（８００サンプル）以内との機能要件を満たすことになる。また、連携処理実行部１７は、制御情報データに記述する「終了ポイントを明示する情報」について、ポイント検出部１６により検出された終了ポイントとする。上述したように、終了ポイントについてもポイント検出部１６により高い精度で検出されるため、基本的には誤差が１５０ミリ秒（２４００サンプル）以内との機能要件を満たすことになる。 Further, the cooperation processing execution unit 17 generates control information data according to the format. As described above, the linkage process execution unit 17 sets the “information specifying the start point” described in the control information data to the 8000th cycle of a fixed value. Here, in the present embodiment, since the start point is detected with high accuracy by the point detection unit 16, the point at which the wakeword-compatible voice actually starts in the recognition target voice data (that is, the point detection unit 16 detects it). It is assumed that the difference between the start point (start point) and the start point indicated by the information described in the control information data (8000th cycle in this example) is very small, and the error is basically 50 milliseconds (800 samples). ) Will meet the functional requirements within. Further, the linkage process execution unit 17 sets the "information specifying the end point" described in the control information data as the end point detected by the point detection unit 16. As described above, since the end point is also detected with high accuracy by the point detection unit 16, the functional requirement that the error is within 150 milliseconds (2400 samples) is basically satisfied.

認識対象音声データおよび制御情報データを生成した後、連携処理実行部１７は、これらデータを含む処理要求データを生成し、生成した処理要求データをサービス提供サーバ３に送信する。処理要求データをサービス提供サーバ３に送信するために必要な情報（サービス提供サーバ３のアドレスや、認証に必要な情報、使用するプトロコル等）は、予め登録されている。 After generating the recognition target voice data and the control information data, the cooperation processing execution unit 17 generates the processing request data including these data, and transmits the generated processing request data to the service providing server 3. Information necessary for transmitting the processing request data to the service providing server 3 (address of the service providing server 3, information necessary for authentication, putrocol to be used, etc.) is registered in advance.

サービス提供サーバ３は、処理要求データを受信すると、制御情報データの内容に基づいて、認識対象音声データについて音声認識を含む分析を行って、リクエストの内容を理解し、リクエストの内容に対応する処理を実行する。以下では、リクエストの内容が質問であり、サービス提供サーバ３は、質問に対する回答を音声として車載装置２に出力させる処理を実行させるものとする。 When the service providing server 3 receives the processing request data, the service providing server 3 performs an analysis including voice recognition on the recognition target voice data based on the content of the control information data, understands the content of the request, and processes corresponding to the content of the request. To execute. In the following, the content of the request is a question, and the service providing server 3 shall execute a process of outputting the answer to the question to the in-vehicle device 2 as voice.

リクエストの内容が質問の場合は、リクエストに対して回答する文章（以下、「レスポンス」という）を生成する。レスポンスの生成は既存の技術に基づいて適切に実行される。例えば、レスポンスは、「明日の○○の天気を教えて」（○○は場所を表している）というリクエストに対して、「明日の○○の天気は晴れです」というものである。この場合、サービス提供サーバ３は、人工知能による音声認識処理や、自然言語処理、情報要約処理等を通してリクエストの内容を認識し、リクエストを生成するために必要な情報（本例の場合、明日の○○の天気）を収集し、収集した情報に基づいてレスポンスを生成する。 If the content of the request is a question, a sentence (hereinafter referred to as "response") that answers the request is generated. Response generation is properly performed based on existing technology. For example, the response is "Tomorrow's weather is sunny" in response to the request "Tell me the weather for tomorrow's XX" (where XX represents the place). In this case, the service providing server 3 recognizes the content of the request through voice recognition processing by artificial intelligence, natural language processing, information summarization processing, etc., and the information necessary for generating the request (in this example, tomorrow's information). XX weather) is collected and a response is generated based on the collected information.

レスポンスを生成した後、サービス提供サーバ３は、レスポンスに対応する音声が記録された音声データ（以下、「レスポンス音声データ」という）を生成する。次いで、サービス提供サーバ３は、レスポンス音声データを車載装置２に応答する。 After generating the response, the service providing server 3 generates voice data (hereinafter referred to as "response voice data") in which the voice corresponding to the response is recorded. Next, the service providing server 3 responds to the vehicle-mounted device 2 with the response voice data.

車載装置２の連携処理実行部１７は、サービス提供サーバ３により送信されたレスポンス音声データを受信する。連携処理実行部１７は、受信したレスポンス音声データを音声処理部１１に出力し、レスポンス音声データに記録されたレスポンスに対応する音声をスピーカ５から出力させる。 The cooperation processing execution unit 17 of the in-vehicle device 2 receives the response voice data transmitted by the service providing server 3. The cooperative processing execution unit 17 outputs the received response voice data to the voice processing unit 11, and causes the speaker 5 to output the voice corresponding to the response recorded in the response voice data.

以上のように、本実施形態において、車載装置２は、入力された入力音声と予め登録されたウェイクワード（特定ワード）の音声パターンとを比較してウェイクワードを認識すると共に、ウェイクワードが認識されたタイミングに連続する期間に入力された入力音声を構成する周波数のうち、音圧レベルが最も高い周波数に対応する周波数をターゲット周波数として特定する。そして、車載装置２は、ウェイクワードが認識されたタイミングを含む期間におけるターゲット周波数の音圧レベルの推移に基づいて、入力音声におけるウェイクワードの開始ポイントおよび終了ポイントを検出するようにしている。より具体的には、車載装置２は、ウェイクワード検出タイミングを含む第１期間におけるターゲット周波数の音圧レベルの推移に基づいて終了ポイントを検出し、また、第２期間対応タイミングを含む第２期間におけるターゲット周波数の音圧レベルの推移に基づいて開始ポイントを検出する。 As described above, in the present embodiment, the in-vehicle device 2 recognizes the wake word by comparing the input input voice with the voice pattern of the wake word (specific word) registered in advance, and the wake word recognizes the wake word. Among the frequencies constituting the input voice input in the continuous period at the specified timing, the frequency corresponding to the frequency having the highest sound pressure level is specified as the target frequency. Then, the in-vehicle device 2 detects the start point and the end point of the wake word in the input voice based on the transition of the sound pressure level of the target frequency in the period including the timing when the wake word is recognized. More specifically, the in-vehicle device 2 detects the end point based on the transition of the sound pressure level of the target frequency in the first period including the wake word detection timing, and also includes the second period corresponding timing. The start point is detected based on the transition of the sound pressure level of the target frequency in.

本実施形態によれば、入力された入力音声と予め登録されたウェイクワードの音声パターンとを比較してウェイクワードを検出する音声認識（トリガレス音声認識）が行われるため、ユーザによりウェイクワードが音声として発話された場合には、高い確度でそのことを検出可能である。その上で、本実施形態によれば、ウェイクワードが認識されたタイミングの前後の期間、換言すれば、ウェイクワードが発話されている可能性が非常に高い期間を対象として、音圧レベルの特徴的な変化に基づいて開始ポイントおよび終了ポイントを検出することが可能であり、開始ポイントおよび終了ポイントを高い効率性および高い確度で適切に検出することができる。 According to the present embodiment, voice recognition (triggerless voice recognition) is performed in which the input voice is compared with the voice pattern of the wake word registered in advance to detect the wake word, so that the wake word is voiced by the user. When it is spoken as, it can be detected with high accuracy. On top of that, according to the present embodiment, the characteristics of the sound pressure level are targeted for the period before and after the timing when the wake word is recognized, in other words, the period when the wake word is very likely to be spoken. It is possible to detect the start point and the end point based on the change, and the start point and the end point can be appropriately detected with high efficiency and high accuracy.

特に、本実施形態によれば、ユーザがウェイクワードを音声として発話したときに入力される音声を構成する周波数のうち、音圧レベルが最も高い周波数に対応する周波数（最も高い周波数のほか、最も高い周波数に近い周波数も含む）をターゲット周波数として音圧レベルの推移が分析される。このため、ウェイクワードを発話したユーザの発話音声に特有の周波数を分析の対象とすることができるので、他の周波数の音声によるノイズ的な影響を受けにくい状況下で音圧レベルの推移を分析することができる。 In particular, according to the present embodiment, among the frequencies constituting the voice input when the user speaks the wake word as voice, the frequency corresponding to the frequency having the highest sound pressure level (in addition to the highest frequency, the highest frequency). The transition of the sound pressure level is analyzed with the target frequency (including frequencies close to high frequencies) as the target frequency. Therefore, since the frequency peculiar to the spoken voice of the user who uttered the wake word can be analyzed, the transition of the sound pressure level is analyzed under the condition that the sound is not easily affected by the voice of other frequencies. can do.

更に、本実施形態によれば、ウェイクワード検出用の専用の音声認識エンジンを実装することなく、既存のトリガレス音声認識に使用する資源を有効活用して、開始ポイントおよび終了ポイントを適切に検出することができる。これにより、製造コスト、製品価格の低減を図ることができる。 Further, according to the present embodiment, the start point and the end point are appropriately detected by effectively utilizing the existing resources used for triggerless speech recognition without implementing a dedicated speech recognition engine for wake word detection. be able to. As a result, the manufacturing cost and the product price can be reduced.

次に、ユーザがウェイクワード対応音声を発話した場合の車載装置２の動作例について図５のフローチャートを用いて説明する。図５のフローチャートに示すように、ユーザがウェイクワード対応音声を発話した場合、登録ワード検出部１２は、入力音声の音声波形と、登録されたウェイクワードの音声パターンとの距離値が閾値ＴＨを下回ったことをもって、ウェイクワードを検出する（ステップＳＡ１）。次いで、登録ワード検出部１２は、ウェイクワードを検出したことを示す情報を周波数特定部１５に通知する（ステップＳＡ２）。 Next, an operation example of the in-vehicle device 2 when the user utters a wake word-compatible voice will be described with reference to the flowchart of FIG. As shown in the flowchart of FIG. 5, when the user utters a wake word-compatible voice, the registered word detection unit 12 sets the threshold value TH as the distance value between the voice waveform of the input voice and the voice pattern of the registered wake word. A wake word is detected when the value falls below the limit (step SA1). Next, the registered word detection unit 12 notifies the frequency specifying unit 15 of the information indicating that the wake word has been detected (step SA2).

周波数特定部１５は、登録ワード検出部１２からウェイクワードを検出したことを示す情報を入力した場合、音声バッファ２０を参照し、ウェイクワード検出タイミングを起点として所定時間だけ遡った期間の入力音声データを分析し、ターゲット周波数を特定する（ステップＳＡ３）。 When the frequency specifying unit 15 inputs information indicating that a wake word has been detected from the registered word detection unit 12, the frequency specifying unit 15 refers to the voice buffer 20 and inputs voice data for a period retroactively by a predetermined time starting from the wake word detection timing. Is analyzed and the target frequency is specified (step SA3).

周波数特定部１５によりターゲット周波数が特定されると、ポイント検出部１６は、音声バッファ２０に記憶された入力音声データについて、第１期間におけるターゲット周波数の音圧レベルの推移を抽出し、第１期間推移データとして所定の記憶領域に展開する（ステップＳＡ４）。次いで、ポイント検出部１６は、第１期間推移データを分析し、第１期間においてターゲット周波数の音圧レベルが低下していく状態から、安定的に推移する状態へと変化するポイントを終了ポイントとして検出する（ステップＳＡ５）。 When the target frequency is specified by the frequency specifying unit 15, the point detection unit 16 extracts the transition of the sound pressure level of the target frequency in the first period from the input voice data stored in the voice buffer 20, and the first period. It is expanded into a predetermined storage area as transition data (step SA4). Next, the point detection unit 16 analyzes the transition data for the first period, and sets the point at which the sound pressure level of the target frequency changes from the state where the sound pressure level of the target frequency decreases to the state where the sound pressure level changes stably as the end point. Detect (step SA5).

次いで、ポイント検出部１６は、ユーザ周波数記憶部２２に記憶されたユーザ周波数テーブルを参照し、ターゲット周波数の値と同一のまたは近似した値のユーザ特有周波数に対応するユーザ周波数関連情報が登録されているか否かを判定する（ステップＳＡ６）。 Next, the point detection unit 16 refers to the user frequency table stored in the user frequency storage unit 22, and registers user frequency-related information corresponding to the user-specific frequency having a value equal to or close to the value of the target frequency. Whether or not it is determined (step SA6).

ターゲット周波数の値と同一のまたは近似した値のユーザ特有周波数に対応するユーザ周波数関連情報が登録されている場合（ステップＳＡ６：ＹＥＳ）、ポイント検出部１６は、そのユーザ周波数関連情報に含まれるユーザ特有時間長を認識し、ユーザ特有時間長を「検出用時間」として特定する（ステップＳＡ７）。ステップＳＡ７の処理後、ポイント検出部１６は、処理手順をステップＳＡ９へ移行する。一方、そのようなユーザ周波数関連情報が登録されていない場合（ステップＳＡ６：ＮＯ）、ポイント検出部１６は、予め定められた時間を「検出用時間」として特定する（ステップＳＡ８）。ステップＳＡ８の処理後、ポイント検出部１６は、処理手順をステップＳＡ９へ移行する。 When the user frequency-related information corresponding to the user-specific frequency having the same value as or close to the value of the target frequency is registered (step SA6: YES), the point detection unit 16 is the user included in the user frequency-related information. The unique time length is recognized, and the user-specific time length is specified as the "detection time" (step SA7). After the processing of step SA7, the point detection unit 16 shifts the processing procedure to step SA9. On the other hand, when such user frequency-related information is not registered (step SA6: NO), the point detection unit 16 specifies a predetermined time as a “detection time” (step SA8). After the processing of step SA8, the point detection unit 16 shifts the processing procedure to step SA9.

ステップＳＡ９において、ポイント検出部１６は、音声バッファ２０に記憶された入力音声データについて、第２期間対応タイミングを特定する。次いで、ポイント検出部１６は、第２期間を特定し、第２期間におけるターゲット周波数の音圧レベルの推移を抽出し、第２期間推移データとして所定の記憶領域に展開する（ステップＳＡ１０）。次いで、ポイント検出部１６は、第２期間推移データを分析し、第２期間においてターゲット周波数の音圧レベルが安定的に推移する状態から、上昇していく状態へと変化するポイントを開始ポイントとして検出する（ステップＳＡ１１）。 In step SA9, the point detection unit 16 specifies the second period correspondence timing for the input voice data stored in the voice buffer 20. Next, the point detection unit 16 identifies the second period, extracts the transition of the sound pressure level of the target frequency in the second period, and develops it in a predetermined storage area as the second period transition data (step SA10). Next, the point detection unit 16 analyzes the transition data for the second period, and sets the point at which the sound pressure level of the target frequency changes from a stable transition state to an ascending state in the second period as a start point. Detect (step SA11).

ポイント検出部１６により開始ポイントおよび終了ポイントが検出されると、連携処理実行部１７は、ポイント検出部１６により検出された終了ポイントに続く入力音声について、入力音声の音圧が所定値以下となった後、所定値以下の状態が一定時間以上続いたか否かを監視する（ステップＳＡ１２）。続いた場合（ステップＳＡ１２：ＹＥＳ）、連携処理実行部１７は、ポイント検出部１６により検出された開始ポイントよりも８０００サンプル分前のポイントから、終了検出タイミングに対応するポイントまでのデータを抽出することによって、認識対象音声データを生成する（ステップＳＡ１３）。次いで、連携処理実行部１７は、制御情報データを生成する（ステップＳＡ１４）。次いで、連携処理実行部１７は、含む処理要求データを生成し、生成した処理要求データをサービス提供サーバ３に送信する（ステップＳＡ１５）。上述したように、連携処理実行部１７は、処理要求データに対する応答があれば、応答に対応する処理を実行する。 When the start point and the end point are detected by the point detection unit 16, the cooperation processing execution unit 17 determines that the sound pressure of the input voice is equal to or less than a predetermined value for the input voice following the end point detected by the point detection unit 16. After that, it is monitored whether or not the state of the predetermined value or less continues for a certain period of time or more (step SA12). If it continues (step SA12: YES), the linkage process execution unit 17 extracts data from the point 8000 samples before the start point detected by the point detection unit 16 to the point corresponding to the end detection timing. By doing so, the recognition target voice data is generated (step SA13). Next, the cooperation processing execution unit 17 generates control information data (step SA14). Next, the cooperation processing execution unit 17 generates the processing request data to be included, and transmits the generated processing request data to the service providing server 3 (step SA15). As described above, if there is a response to the processing request data, the cooperation processing execution unit 17 executes the processing corresponding to the response.

＜変形例＞
次に、上記実施形態の変形例について図４を援用して説明する。上記実施形態では、ポイント検出部１６は、第１期間を特定して、第１期間におけるターゲット周波数の音圧レベルの推移に基づいて終了ポイントを検出し、また、第２期間を特定して、第２期間におけるターゲット周波数の音圧レベルの推移に基づいて開始ポイントを検出した。一方で、本変形例では、ポイント検出部１６は、図４の符号Ｙで示す期間のように、ウェイクワード検出タイミングを含み、かつ、ウェイクワード対応音声がどのような態様で発話された場合であっても（ただし、極端に異常な態様で発話された場合を除く）、終了ポイントと開始ポイントとが含まれるような期間（以下、「対象期間」という）を特定する。対象期間は、例えば、ウェイクワード検出タイミングから所定時間だけ進んだタイミングと、当該タイミングから所定時間だけ遡ったタイミングとで挟まれた期間とされる。 <Modification example>
Next, a modified example of the above embodiment will be described with reference to FIG. In the above embodiment, the point detection unit 16 specifies the first period, detects the end point based on the transition of the sound pressure level of the target frequency in the first period, and specifies the second period. The start point was detected based on the transition of the sound pressure level of the target frequency in the second period. On the other hand, in this modification, the point detection unit 16 includes the wake word detection timing as in the period indicated by the reference numeral Y in FIG. 4, and the wake word-corresponding voice is uttered in any mode. Even if there is (however, except when the utterance is made in an extremely abnormal manner), the period in which the end point and the start point are included (hereinafter referred to as "target period") is specified. The target period is, for example, a period sandwiched between a timing that is advanced by a predetermined time from the wake word detection timing and a timing that is advanced by a predetermined time from the timing.

ポイント検出部１６は、音声バッファ２０に記憶された入力音声データについて、対象期間におけるターゲット周波数の音圧レベルの推移（ターゲット周波数の波形）を抽出し、対象期間推移データとしてワーキングエリアとして機能する所定の記憶領域に展開する。次いで、ポイント検出部１６は、対象期間推移データを分析し、対象期間において時間の経過に応じてターゲット周波数の音圧レベルが低下していく状態から、安定的に推移する状態へと変化するポイントを終了ポイントとして検出する。また、ポイント検出部１６は、対象期間において時間の経過に応じてターゲット周波数の音圧レベルが安定的に推移する状態から、上昇していく状態へと変化するポイントを開始ポイントとして検出する。 The point detection unit 16 extracts the transition of the sound pressure level of the target frequency (waveform of the target frequency) in the target period from the input audio data stored in the audio buffer 20, and functions as a working area as the target period transition data. Expand to the storage area of. Next, the point detection unit 16 analyzes the target period transition data, and the point where the sound pressure level of the target frequency changes from a state in which the sound pressure level of the target frequency decreases with the passage of time in the target period to a state in which the sound pressure level changes stably. Is detected as the end point. Further, the point detection unit 16 detects as a start point a point at which the sound pressure level of the target frequency changes from a stable state to an increasing state according to the passage of time in the target period.

例えば、ポイント検出部１６は、対象期間の終端から時間的に遡りつつターゲット周波数の音圧レベルの推移（変化の態様）を分析し、ターゲット周波数の音圧レベルが安定的に推移する状態から上昇していく状態へと変化するポイントを終了ポイントとして検出し、更に、終了ポイントよりも時間的に前のポイントであって、ターゲット周波数の音圧レベルが低下してく状態から安定的に推移する状態へと変化するポイントを開始ポイントとして検出する。なお、この場合、時間的に遡りつつ音圧レベルの推移の分析が行われるため、上述の方法で終了ポイントおよび開始ポイントが特定される。 For example, the point detection unit 16 analyzes the transition (mode of change) of the sound pressure level of the target frequency while tracing back in time from the end of the target period, and rises from the state where the sound pressure level of the target frequency changes stably. The point where the sound pressure level changes to the end point is detected as the end point, and the point is a point before the end point in time, and the sound pressure level of the target frequency is stably changed from the state where the sound pressure level is decreasing. The point that changes to is detected as the starting point. In this case, since the transition of the sound pressure level is analyzed while going back in time, the end point and the start point are specified by the above method.

また、例えば、ポイント検出部１６は、対象期間の始端から時間の経過に従ってターゲット周波数の音圧レベルの推移（変化の態様）を分析し、ターゲット周波数の音圧レベルが安定的に推移する状態から上昇していく状態へと変化するポイントを開始ポイントとして検出し、更に、開始ポイントよりも時間的に後のポイントであって、ターゲット周波数の音圧レベルが低下してく状態から安定的に推移する状態へと変化するポイントを終了ポイントとして検出する。 Further, for example, the point detection unit 16 analyzes the transition (mode of change) of the sound pressure level of the target frequency according to the passage of time from the beginning of the target period, and from the state where the sound pressure level of the target frequency changes stably. A point that changes to an ascending state is detected as a start point, and a point that is later in time than the start point and stably transitions from a state in which the sound pressure level of the target frequency decreases. The point that changes to the state is detected as the end point.

本変形例の場合も、ウェイクワード検出タイミングを含む期間、つまり、入力音声の音声波形と登録された音声パターンとの比較の結果、ウェイクワード対応音声が発話された可能性が非常に高いと判断できる期間を対象として分析が行われて開始ポイントおよび終了ポイントが検出されるため、高い精度で各ポイントを検出できる。更に、入力音声を構成する周波数のうち、ユーザの発話音声の特徴が最も現れやすい支配的な周波数であるターゲット周波数帯を分析する対象の周波数として、ターゲット周波数帯の音圧レベルの推移を分析して開始ポイントおよび終了ポイントが検出されるため、高い精度で各ポイントを検出できる。ただし、本変形例の場合、入力音声データについて、検出対象とする範囲が上記実施形態の場合と比較して大きいため、その点で処理効率および処理時間の点で劣る。 In the case of this modification as well, it is determined that there is a high possibility that the wake word-compatible voice was uttered as a result of comparison between the voice waveform of the input voice and the registered voice pattern during the period including the wake word detection timing. Since the analysis is performed for the possible period and the start point and end point are detected, each point can be detected with high accuracy. Furthermore, among the frequencies that make up the input voice, the transition of the sound pressure level in the target frequency band is analyzed as the target frequency for analyzing the target frequency band, which is the dominant frequency in which the characteristics of the user's spoken voice are most likely to appear. Since the start point and the end point are detected, each point can be detected with high accuracy. However, in the case of this modification, since the range to be detected for the input voice data is larger than that in the case of the above embodiment, the processing efficiency and the processing time are inferior in that respect.

以上、本発明の実施形態（変形例を含む）を説明したが、上記実施形態は、本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその要旨、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 Although the embodiments (including modifications) of the present invention have been described above, the above-described embodiments are merely examples of the embodiment of the present invention, thereby the technical scope of the present invention. Should not be interpreted in a limited way. That is, the present invention can be implemented in various forms without departing from its gist or its main features.

例えば、上記実施形態で、サービス提供サーバ３が実行していた処理の一部または全部を車載装置２が実行する構成としてもよい。また車載装置２が実行していた処理の一部または全部をサービス提供サーバ３（サービス提供サーバ３以外の外部装置であってもよい）が実行する構成としてもよい。 For example, in the above embodiment, the in-vehicle device 2 may execute a part or all of the processing executed by the service providing server 3. Further, the service providing server 3 (which may be an external device other than the service providing server 3) may execute a part or all of the processing executed by the in-vehicle device 2.

また、上記実施形態では、入力音声の音声波形と登録された音声パターンとの類似度を表す指標として距離値を用いたが、指標として距離値以外の指標を用いてもよいことは勿論である。 Further, in the above embodiment, the distance value is used as an index indicating the degree of similarity between the voice waveform of the input voice and the registered voice pattern, but it is needless to say that an index other than the distance value may be used as the index. ..

また、車載装置２からサービス提供サーバ３に送信する認識対象データおよび制御情報データの内容は、上記実施形態で例示した内容に限られず、求められる機能要件に応じて変更されるべきものである。一例として、プリロールとして１０００ミリ秒分の音声データを含めることが求められる場合には、認識対象データの内容はそのようにされる。また、開始ポイントや終了ポイントがサンプリング周期目ではなく、何ビット目という指定や、データにタイムスタンプが付与されている場合にタイムスタンプを利用した指定により行う必要がある場合には、制御情報データの内容はそのようにされる。 Further, the contents of the recognition target data and the control information data transmitted from the in-vehicle device 2 to the service providing server 3 are not limited to the contents exemplified in the above embodiment, and should be changed according to the required functional requirements. As an example, when it is required to include 1000 milliseconds of voice data as a preroll, the content of the recognition target data is so. Also, if it is necessary to specify the bit number of the start point or end point instead of the sampling cycle, or if the data is time-stamped and it is necessary to use the time stamp, control information data. The content of is done that way.

また、上記実施形態では、ポイント検出部１６は、ターゲット周波数に対応するユーザ特有時間長が登録されている場合には、検出用時間を、ユーザ特有時間長とした。この点に関し、ポイント検出部１６が、検出用時間を固定値とする構成でもよい。この場合、ユーザ特有時間長を事前に登録する必要はない。また、第２期間の長さは、第２期間の長さが不必要に長くなるようなことがなく、かつ、第２期間の中に開始ポイントが含まれるようにするという観点から適切に設定される。 Further, in the above embodiment, when the user-specific time length corresponding to the target frequency is registered, the point detection unit 16 sets the detection time as the user-specific time length. Regarding this point, the point detection unit 16 may be configured to set the detection time as a fixed value. In this case, it is not necessary to register the user-specific time length in advance. Further, the length of the second period is appropriately set from the viewpoint that the length of the second period does not become unnecessarily long and the start point is included in the second period. Will be done.

また、上記実施形態では、車載装置２は、ユーザがトリガを与えることなく車載装置２側で命令ワードを自動で検出するトリガレス音声認識を実行した。この点に関し、必ずしもトリガレスである必要はなく、タッチパネルや操作子に対する所定の操作や、ジェスチャ等をトリガとして音声認識が行われる構成でもよい。 Further, in the above embodiment, the vehicle-mounted device 2 executes triggerless voice recognition that automatically detects a command word on the vehicle-mounted device 2 side without giving a trigger by the user. In this regard, it does not necessarily have to be triggerless, and voice recognition may be performed using a predetermined operation on the touch panel or the operator, a gesture, or the like as a trigger.

１音声分析システム
１２登録ワード検出部（特定ワード検出部）
１５周波数特定部
１６ポイント検出部 1 Voice analysis system 12 Registered word detection unit (specific word detection unit)
15 frequency specific part 16 point detection part

Claims

A specific word detection unit that detects the specific word by comparing the input voice with the voice pattern of the specific word registered in advance, and
Of the frequencies constituting the input voice input during a period continuous to the timing when the specific word is detected by the specific word detection unit, the frequency corresponding to the frequency having the highest sound pressure level is specified as the target frequency. With a specific part
The start point of the specific word in the input voice based on the transition of the sound pressure level of the target frequency specified by the frequency specific unit in the period before and after the timing when the specific word is detected by the specific word detection unit. And a point detector that detects the end point,
A voice analysis system characterized by being equipped with.

The point detection unit is
The point at which the sound pressure level of the target frequency changes in a predetermined manner in the first period including the timing when the specific word is detected by the specific word detection unit is detected as the end point.
In the second period, which is a period before the first period and includes the detection time and the timing retroactive from the detected end point, the sound pressure level of the target frequency is in a predetermined embodiment. The voice analysis system according to claim 1, wherein a changing point is detected as the start point.

When the value of the target frequency is a user-specific frequency registered in advance, the point detection unit is characterized in that the value of the detection time is a user-specific time length previously associated with the user-specific frequency. The voice analysis system according to claim 2.

The point detection unit is
In the first period, a point at which the sound pressure level of the target frequency changes from a state in which the sound pressure level decreases to a state in which the sound pressure level changes stably is detected as the end point.
The second or third aspect of the present invention, wherein a point at which the sound pressure level of the target frequency changes from a stable state to an increasing state is detected as the start point in the second period. Voice analysis system.

The point detection unit is
A target period including the timing at which the specific word is detected by the specific word detection unit and expected to include the start point and the end point is specified, and the sound pressure level of the target frequency in the target period is determined. The voice analysis system according to claim 1, wherein the start point and the end point are detected based on the transition.

The point detection unit is
The transition of the sound pressure level of the target frequency is analyzed while going back in time from the end of the target period, and the point at which the sound pressure level of the target frequency changes from a stable transition state to an ascending state is determined. The start point is detected as the end point and is a point before the end point and changes from a state in which the sound pressure level of the target frequency decreases to a state in which the sound pressure level changes stably. The voice analysis system according to claim 5, wherein the sound is detected as a point.

The point detection unit is
The transition of the sound pressure level of the target frequency is analyzed according to the passage of time from the beginning of the target period, and the point at which the sound pressure level of the target frequency changes from a stable transition state to an increasing state is described above. The end point is a point that is detected as a start point and is later in time than the start point and changes from a state in which the sound pressure level of the target frequency decreases to a state in which the sound pressure level changes stably. The voice analysis system according to claim 5, wherein the sound is detected as.

A step in which the specific word detection unit of the voice analysis system detects the specific word by comparing the input input voice with the voice pattern of the pre-registered specific word.
Among the frequencies constituting the input voice input during a period continuous with the timing when the specific word is detected by the specific word detection unit, the frequency specifying unit of the voice analysis system has the highest sound pressure level. Steps to identify the corresponding frequency as the target frequency,
The point detection unit of the voice analysis system is based on the transition of the sound pressure level of the target frequency specified by the frequency identification unit in the period before and after the timing when the specific word is detected by the specific word detection unit. A step of detecting the start point and the end point of the specific word in the input voice, and
A speech analysis method characterized by including.