JP2016180778A

JP2016180778A - Information processing system and information processing method

Info

Publication number: JP2016180778A
Application number: JP2015059566A
Authority: JP
Inventors: 真一河野; Shinichi Kono; 祐平滝; Yuhei Taki
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2015-03-23
Filing date: 2015-03-23
Publication date: 2016-10-13

Abstract

PROBLEM TO BE SOLVED: To provide a technology which enables voice recognition processing to start according to the situation flexibly.SOLUTION: There is provided an information process system having an output control unit that makes an output unit output conditions for the start of voice recognition processing to be performed by a voice recognition unit for sound information input from a sound collection unit, where the output control unit dynamically changes the conditions for the start of voice recognition processing to be output by the output unit.SELECTED DRAWING: Figure 3

Description

本開示は、情報処理システムおよび情報処理方法に関する。 The present disclosure relates to an information processing system and an information processing method.

近年、マイクロフォンによって集音された音情報に対して音声認識処理を施して音声認識処理の結果を得る技術が知られている。音声認識処理の結果は、ユーザに知覚可能な態様によって出力される。例えば、マイクロフォンによって集音された音情報に対する音声認識処理は、ユーザから開始操作が入力されたことをトリガとして開始され得る（例えば、特許文献１参照。）。 2. Description of the Related Art In recent years, a technique is known in which sound recognition processing is performed on sound information collected by a microphone to obtain a result of the speech recognition processing. The result of the speech recognition process is output in a manner that can be perceived by the user. For example, the voice recognition process for sound information collected by a microphone can be started with a start operation input from the user as a trigger (see, for example, Patent Document 1).

特開２００４−０９４０７７号公報JP 2004-094077 A

ここで、マイクロフォンによって集音された音情報に対する音声認識処理の開始される条件が不変である場合には、音声認識処理を状況に応じて柔軟に開始させることが困難である。そこで、音声認識処理を状況に応じて柔軟に開始させることが可能な技術が提供されることが望まれる。 Here, when the condition for starting the speech recognition process for the sound information collected by the microphone is unchanged, it is difficult to start the speech recognition process flexibly according to the situation. Therefore, it is desired to provide a technology that can flexibly start the voice recognition processing according to the situation.

本開示によれば、集音部から入力された音情報に対して音声認識部によって施される音声認識処理の開始条件を出力部に出力させる出力制御部を備え、前記出力制御部は、前記出力部に出力させる前記音声認識処理の前記開始条件を動的に変更する、情報処理システムが提供される。 According to the present disclosure, the output control unit is configured to cause the output unit to output a start condition of a voice recognition process performed by the voice recognition unit on the sound information input from the sound collection unit, and the output control unit includes the output control unit, There is provided an information processing system that dynamically changes the start condition of the voice recognition processing to be output to an output unit.

本開示によれば、集音部から入力された音情報に対して音声認識部によって施される音声認識処理の開始条件を出力部に出力させることを含み、プロセッサにより前記出力部に出力させる前記音声認識処理の前記開始条件を動的に変更することを含む、情報処理方法が提供される。 According to the present disclosure, the output unit includes outputting a start condition of a voice recognition process performed by the voice recognition unit on the sound information input from the sound collection unit, and causing the output unit to output the sound recognition process. An information processing method including dynamically changing the start condition of the speech recognition process is provided.

以上説明したように本開示によれば、音声認識処理を状況に応じて柔軟に開始させることが可能な技術が提供される。なお、上記の効果は必ずしも限定的なものではなく、上記の効果とともに、または上記の効果に代えて、本明細書に示されたいずれかの効果、または本明細書から把握され得る他の効果が奏されてもよい。 As described above, according to the present disclosure, a technique capable of flexibly starting a voice recognition process according to a situation is provided. Note that the above effects are not necessarily limited, and any of the effects shown in the present specification, or other effects that can be grasped from the present specification, together with or in place of the above effects. May be played.

一般的なシステムにおける音声認識処理を説明するための図である。It is a figure for demonstrating the speech recognition process in a general system. 本開示の実施形態に係る情報処理システムの構成例を示す図である。It is a figure showing an example of composition of an information processing system concerning an embodiment of this indication. 本開示の実施形態に係る情報処理システムの機能構成例を示すブロック図である。3 is a block diagram illustrating a functional configuration example of an information processing system according to an embodiment of the present disclosure. FIG. 初期画面の表示から音声認識処理の起動トリガを検出するまでの画面遷移の例を示す図である。It is a figure which shows the example of a screen transition until it detects the starting trigger of a speech recognition process from the display of an initial screen. 音声認識処理が開始されるまでの残り時間が開始条件として出力されてから音声認識処理が開始されるまでの画面遷移の例を示す図である。It is a figure which shows the example of the screen transition after the remaining time until a speech recognition process is started as a start condition until a speech recognition process is started. 音声認識処理を開始させるために必要なユーザ操作に関する情報が開始条件として出力されてから音声認識処理が開始されるまでの画面遷移の例を示す図である。It is a figure which shows the example of a screen transition after the information regarding user operation required in order to start a speech recognition process is output as a start condition until a speech recognition process is started. 音声認識処理の起動トリガが検出された後に集音部から入力された音情報に基づいて、開始条件を動的に変更する例を説明するための図である。It is a figure for demonstrating the example which changes start conditions dynamically based on the sound information input from the sound collection part, after the starting trigger of a speech recognition process is detected. 開始条件として表示情報を出力部に出力させる例を示す図である。It is a figure which shows the example which outputs display information to an output part as start conditions. 開始条件として表示情報を出力部に出力させる例を示す図である。It is a figure which shows the example which outputs display information to an output part as start conditions. 開始条件として音声情報を出力部に出力させる例を示す図である。It is a figure which shows the example which outputs audio | voice information to an output part as start conditions. 開始条件として音声情報を出力部に出力させる例を示す図である。It is a figure which shows the example which outputs audio | voice information to an output part as start conditions. 音声認識処理の起動トリガが検出された後に集音部から入力された音情報に基づいて、出力部に出力させる開始条件を動的に変更する動作の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of the operation | movement which dynamically changes the starting condition output to an output part based on the sound information input from the sound collection part after the starting trigger of a speech recognition process was detected. 音声認識処理の起動トリガが検出された後に集音部から入力された音情報に基づいて、出力部に出力させる開始条件を動的に変更する動作の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of the operation | movement which dynamically changes the starting condition output to an output part based on the sound information input from the sound collection part after the starting trigger of a speech recognition process was detected. 過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、音声認識処理が開始されるまでの残り時間を動的に短くする例を説明するための図である。The remaining time until the voice recognition process is started is dynamically shortened based on past sound information collected at a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. It is a figure for demonstrating the example to do. 過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、音声認識処理が開始されるまでの残り時間を動的に短くする例を説明するための図である。The remaining time until the voice recognition process is started is dynamically shortened based on past sound information collected at a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. It is a figure for demonstrating the example to do. 過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、音声認識処理が開始されるまでの残り時間を動的に長くする例を説明するための図である。The remaining time until the voice recognition process is started is dynamically lengthened based on past sound information collected at a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. It is a figure for demonstrating the example to do. 過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、音声認識処理が開始されるまでの残り時間を動的に長くする例を説明するための図である。The remaining time until the voice recognition process is started is dynamically lengthened based on past sound information collected at a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. It is a figure for demonstrating the example to do. 音声認識処理が開始されるまでの残り時間が短くなった場合における表示情報の例を示す図である。It is a figure which shows the example of the display information when the remaining time until a speech recognition process is started becomes short. 音声認識処理が開始されるまでの残り時間が長くなった場合における表示情報の例を示す図である。It is a figure which shows the example of the display information when the remaining time until a speech recognition process is started becomes long. 過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、出力部に出力させる開始条件を動的に変更する動作の流れの例を示すフローチャートである。Flow of operation for dynamically changing the start condition to be output to the output unit based on past sound information collected in a predetermined time from when the activation trigger is detected in the past to when the voice recognition process is started It is a flowchart which shows the example of. 過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、出力部に出力させる開始条件を動的に変更する動作の流れの例を示すフローチャートである。Flow of operation for dynamically changing the start condition to be output to the output unit based on past sound information collected in a predetermined time from when the activation trigger is detected in the past to when the voice recognition process is started It is a flowchart which shows the example of. フィラーとその音声波形との対応関係の例を示す図である。It is a figure which shows the example of the correspondence of a filler and its audio | voice waveform. 集音部から入力される音情報にフィラーが含まれているか否かによって動作を異ならせる例を説明するための図である。It is a figure for demonstrating the example which changes operation | movement by whether the filler is contained in the sound information input from the sound collection part. 情報処理システムの構成の変形例１を示す図である。It is a figure which shows the modification 1 of a structure of an information processing system. 情報処理システムの構成の変形例２を示す図である。It is a figure which shows the modification 2 of a structure of an information processing system. 情報処理システムの構成の変形例２を示す図である。It is a figure which shows the modification 2 of a structure of an information processing system. 情報処理システムの構成の変形例２を示す図である。It is a figure which shows the modification 2 of a structure of an information processing system. 情報処理システムの構成の変形例２を示す図である。It is a figure which shows the modification 2 of a structure of an information processing system. 情報処理システムの構成の変形例３を示す図である。It is a figure which shows the modification 3 of a structure of an information processing system. 情報処理システムの構成の変形例３を示す図である。It is a figure which shows the modification 3 of a structure of an information processing system. 情報処理システムの構成の変形例３を示す図である。It is a figure which shows the modification 3 of a structure of an information processing system. 情報処理システムの構成の変形例３を示す図である。It is a figure which shows the modification 3 of a structure of an information processing system. 情報処理システムのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of an information processing system.

以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In addition, in this specification and drawing, about the component which has the substantially same function structure, duplication description is abbreviate | omitted by attaching | subjecting the same code | symbol.

また、本明細書および図面において、実質的に同一の機能構成を有する複数の構成要素を、同一の符号の後に異なる数字を付して区別する場合もある。ただし、実質的に同一の機能構成を有する複数の構成要素の各々を特に区別する必要がない場合、同一符号のみを付する。 In the present specification and drawings, a plurality of components having substantially the same functional configuration may be distinguished by adding different numerals after the same reference numerals. However, when it is not necessary to particularly distinguish each of a plurality of constituent elements having substantially the same functional configuration, only the same reference numerals are given.

なお、説明は以下の順序で行うものとする。
０．背景
１．本開示の実施形態
１．１．システム構成例
１．２．機能構成例
１．３．情報処理システムの機能詳細
１．４．システム構成の変形例
１．５．ハードウェア構成例
２．むすび The description will be made in the following order.
0. Background 1. Embodiment of the present disclosure 1.1. System configuration example 1.2. Functional configuration example 1.3. Functional details of information processing system 1.4. Modification of system configuration 1.5. 1. Hardware configuration example Conclusion

＜０．背景＞
まず、図面を参照しながら本開示の実施形態の背景を説明する。図１は、一般的なシステムにおける音声認識処理を説明するための図である。以下の説明において、音声（ｖｏｉｃｅまたはｓｐｅｅｃｈ）と音（ｓｏｕｎｄ）とは区別して用いられる。また、発話は、ユーザが音声を発している状態を示し、無音は、閾値よりも小さい音量によって音情報が集音されている状態を示す。 <0. Background>
First, the background of the embodiment of the present disclosure will be described with reference to the drawings. FIG. 1 is a diagram for explaining speech recognition processing in a general system. In the following description, a voice (voice or speech) and a sound are used separately. Further, the utterance indicates a state in which the user is uttering sound, and the silence indicates a state in which sound information is collected with a volume smaller than the threshold.

図１に示すように、一般的なシステム（以下、単に「システム」とも言う。）は、音声認識処理を開始させるための音声認識開始操作オブジェクトＧ１４を選択する操作がユーザから入力されると、かかる操作を音声認識処理の起動トリガとして検出し、集音開始画面Ｇ９１を表示する（時刻Ｔ９１）。集音開始画面Ｇ９１が表示されると、ユーザは発話を開始し（時刻Ｔ９２）、システムはマイクロフォンによって集音しつつ、集音された音情報に対する音声認識処理を行う（Ｓ９１）。 As shown in FIG. 1, a general system (hereinafter also simply referred to as “system”), when an operation for selecting a speech recognition start operation object G14 for starting speech recognition processing is input from a user, Such an operation is detected as an activation trigger for the voice recognition process, and a sound collection start screen G91 is displayed (time T91). When the sound collection start screen G91 is displayed, the user starts speaking (time T92), and the system performs sound recognition processing on the collected sound information while collecting sound with the microphone (S91).

発話区間Ｈａが終了すると（時刻Ｔ９３）、無音状態が開始される。そして、システムは、マイクロフォンによって集音される音情報の音量が継続して基準音量を下回る継続時間が所定の目標時間に達した区間（以下、「無音区間」とも言う。）Ｍａが検出されると（時刻Ｔ９４）、発話区間Ｈａにおいて集音された音情報に対して施される音声認識処理の結果に基づいて所定の実行動作を実行する（Ｓ９２）。 When the utterance section Ha ends (time T93), a silent state is started. Then, the system detects a section Ma (hereinafter, also referred to as a “silent section”) Ma in which the duration of the sound information collected by the microphone continues and reaches a predetermined target time that is below the reference volume. (Time T94), a predetermined execution operation is executed based on the result of the speech recognition process performed on the sound information collected in the utterance section Ha (S92).

ここで、音声認識処理の結果に基づく実行動作は特に限定されない。例えば、音声認識処理の結果に基づく実行動作は、音声認識処理の結果としての文字列に応じた検索結果を出力させる動作、音声認識処理の結果としての文字列を出力させる動作、音声認識処理の過程において得られた処理結果候補を出力させる動作、音声認識処理の結果の文字列から抽出される発話内容に返答するための文字列を出力させる動作のいずれか一つを含んでよい。 Here, the execution operation based on the result of the voice recognition process is not particularly limited. For example, the execution operation based on the result of the speech recognition process includes an operation of outputting a search result corresponding to a character string as a result of the speech recognition process, an operation of outputting a character string as a result of the speech recognition process, Any one of an operation of outputting the processing result candidate obtained in the process and an operation of outputting a character string for replying to the utterance content extracted from the character string as a result of the speech recognition process may be included.

ここで、音声認識処理の結果としての文字列から発話内容を抽出する手法は限定されない。例えば、音声認識処理の結果としての文字列から発話内容を抽出する手法は、音声認識処理の結果としての文字列に対して自然言語処理（例えば、言語解析、意味解析など）を施すことによって発話内容が抽出されてよい。 Here, the method of extracting the utterance content from the character string as a result of the speech recognition process is not limited. For example, a technique for extracting utterance contents from a character string as a result of speech recognition processing is performed by performing natural language processing (for example, language analysis, semantic analysis, etc.) on a character string as a result of speech recognition processing. The content may be extracted.

システムは、実行動作の処理中には、実行動作の処理中である旨を示す画面Ｇ９２を表示する。そして、システムは、実行動作が終了すると（時刻Ｔ９５）、実行動作の結果を示す画面Ｇ９３を表示する。図１に示した例においては、実行動作の結果を示す画面Ｇ９３に、音声認識処理の結果としての文字列に応じた検索結果として、「襟」「競り」「蹴り」が含まれている。 During the execution operation process, the system displays a screen G92 indicating that the execution operation process is in progress. When the execution operation ends (time T95), the system displays a screen G93 indicating the result of the execution operation. In the example shown in FIG. 1, “collar”, “bid”, and “kick” are included in the screen G93 showing the result of the execution operation as a search result corresponding to the character string as the result of the speech recognition process.

以上に説明したように、一般的なシステムにおいては、音声認識処理の開始条件が出力されないうちに音声認識処理が開始されてしまう。そのため、音声認識開始操作オブジェクトＧ１４を選択する操作を行った後に発話内容を考えるユーザが存在した場合などには、発話開始までに集音された音情報も音声認識処理の対象となってしまい、音声認識処理に影響を与えてしまう可能性がある。 As described above, in a general system, the voice recognition process is started before the voice recognition process start condition is output. Therefore, when there is a user who considers the utterance content after performing the operation of selecting the voice recognition start operation object G14, the sound information collected before the start of the utterance is also subject to the voice recognition process. The voice recognition process may be affected.

例えば、発話開始までに集音される音情報の中には、ユーザ自身によって発せられるフィラーや余計な発話などが存在し得る。フィラーは、「ええと」「あの」「まあ」などといった言葉のように、発話と発話との合間にユーザによって挟み込まれる言葉を意味する。また、発話開始までに集音される音情報の中には、雑音なども存在し得る。なお、上記したように、雑音は、集音部１２０から入力される音情報からユーザの発する音声を除いた音情報を意味してよい。 For example, in the sound information collected before the start of utterance, there may be fillers or extra utterances uttered by the user himself / herself. The filler means a word that is sandwiched between the utterances between the utterances, such as “um”, “that”, “well”. In addition, noise or the like may exist in the sound information collected before the start of utterance. Note that, as described above, noise may mean sound information obtained by excluding a user's voice from sound information input from the sound collection unit 120.

また、発話開始までに集音された音情報に対する音声認識処理の結果が、発話が開始されてから集音された音情報に対する音声認識処理に影響を与えてしまう場合もあり得る。また、発話開始までに集音された音情報に基づいて音声認識処理がなされ、発話開始までに無音区間が検出されてしまった場合、発話開始前に音声認識処理の結果に基づく実行動作が始まってしまう可能性もある。 In addition, the result of the voice recognition process for the sound information collected before the start of the utterance may affect the voice recognition process for the sound information collected after the utterance is started. In addition, when voice recognition processing is performed based on sound information collected before the start of utterance and a silent section is detected before the start of utterance, an execution operation based on the result of the voice recognition processing is started before the start of utterance. There is also a possibility that.

そこで、本明細書においては、音声認識処理が開始される前に音声認識処理の開始条件を出力させる技術を提案する。さらに、仮に状況に依らず一定の開始条件を出力させるようにした場合には、音声認識処理を状況に応じて柔軟に開始させることが困難である。そこで、本明細書においては、音声認識処理を状況に応じて柔軟に開始させることが可能な技術を提案する。 In view of this, the present specification proposes a technique for outputting a voice recognition process start condition before the voice recognition process is started. Furthermore, if a certain start condition is output regardless of the situation, it is difficult to start the voice recognition process flexibly according to the situation. Therefore, in the present specification, a technique capable of flexibly starting the speech recognition processing according to the situation is proposed.

以上、本開示の実施形態の背景を説明した。 The background of the embodiment of the present disclosure has been described above.

＜１．本開示の実施形態＞
［１．１．システム構成例］
続いて、図面を参照しながら本開示の実施形態に係る情報処理システム１０の構成例について説明する。図２は、本開示の実施形態に係る情報処理システム１０の構成例を示す図である。図２に示したように、本開示の実施形態に係る情報処理システム１０は、画像入力部１１０と、操作入力部１１５と、集音部１２０と、出力部１３０とを備える。情報処理システム１０は、ユーザＵ（以下、単に「ユーザ」とも言う。）によって発せられた音声に対して音声認識処理を行うことが可能である。 <1. Embodiment of the present disclosure>
[1.1. System configuration example]
Next, a configuration example of the information processing system 10 according to the embodiment of the present disclosure will be described with reference to the drawings. FIG. 2 is a diagram illustrating a configuration example of the information processing system 10 according to the embodiment of the present disclosure. As illustrated in FIG. 2, the information processing system 10 according to the embodiment of the present disclosure includes an image input unit 110, an operation input unit 115, a sound collection unit 120, and an output unit 130. The information processing system 10 can perform voice recognition processing on voices uttered by a user U (hereinafter also simply referred to as “user”).

画像入力部１１０は、画像を入力する機能を有する。図２に示した例では、画像入力部１１０は、テーブルＴｂｌに埋め込まれた２つのカメラを含んでいる。しかし、画像入力部１１０に含まれるカメラの数は１以上であれば特に限定されない。かかる場合、画像入力部１１０に含まれる１以上のカメラそれぞれが設けられる位置も特に限定されない。また、１以上のカメラには、単眼カメラが含まれてもよいし、ステレオカメラが含まれてもよい。 The image input unit 110 has a function of inputting an image. In the example shown in FIG. 2, the image input unit 110 includes two cameras embedded in the table Tbl. However, the number of cameras included in the image input unit 110 is not particularly limited as long as it is one or more. In such a case, the position where each of the one or more cameras included in the image input unit 110 is provided is not particularly limited. The one or more cameras may include a monocular camera or a stereo camera.

操作入力部１１５は、ユーザＵの操作を入力する機能を有する。図２に示した例では、操作入力部１１５は、テーブルＴｂｌの上方に存在する天井から吊り下げられた１つのカメラを含んでいる。しかし、操作入力部１１５に含まれるカメラが設けられる位置は特に限定されない。また、カメラには、単眼カメラが含まれてもよいし、ステレオカメラが含まれてもよい。また、操作入力部１１５はユーザＵの操作を入力する機能を有していればカメラでなくてもよく、例えば、タッチパネルであってもよいし、ハードウェアボタンであってもよい。 The operation input unit 115 has a function of inputting a user U operation. In the example shown in FIG. 2, the operation input unit 115 includes one camera suspended from the ceiling that exists above the table Tbl. However, the position where the camera included in the operation input unit 115 is provided is not particularly limited. Further, the camera may include a monocular camera or a stereo camera. Further, the operation input unit 115 may not be a camera as long as it has a function of inputting the operation of the user U. For example, the operation input unit 115 may be a touch panel or a hardware button.

出力部１３０は、テーブルＴｂｌに画面を表示する機能を有する。図２に示した例では、出力部１３０は、テーブルＴｂｌの上方に天井から吊り下げられている。しかし、出力部１３０が設けられる位置は特に限定されない。また、典型的には、出力部１３０は、テーブルＴｂｌの天面に画面を投影することが可能なプロジェクタであってよいが、画面を表示する機能を有すれば、他の形態のディスプレイであってもよい。 The output unit 130 has a function of displaying a screen on the table Tbl. In the example illustrated in FIG. 2, the output unit 130 is suspended from the ceiling above the table Tbl. However, the position where the output unit 130 is provided is not particularly limited. Typically, the output unit 130 may be a projector capable of projecting the screen onto the top surface of the table Tbl, but may be another type of display as long as it has a function of displaying the screen. May be.

なお、本明細書では、テーブルＴｂｌの天面が画面の表示面となる場合を主に説明するが、画面の表示面は、テーブルＴｂｌの天面以外であってもよい。例えば、画面の表示面は、壁であってもよいし、建物であってもよいし、床面であってもよいし、地面であってもよいし、天井であってもよい。あるいは、画面の表示面は、カーテンのヒダなどの非平面であってもよいし、他の場所にある面であってもよい。また、出力部１３０が表示面を有する場合には、画面の表示面は、出力部１３０が有する表示面であってもよい。 In this specification, the case where the top surface of the table Tbl is the display surface of the screen will be mainly described, but the display surface of the screen may be other than the top surface of the table Tbl. For example, the display surface of the screen may be a wall, a building, a floor surface, the ground, or a ceiling. Alternatively, the display surface of the screen may be a non-planar surface such as a curtain fold, or may be a surface in another place. When the output unit 130 has a display surface, the display surface of the screen may be the display surface of the output unit 130.

集音部１２０は、集音する機能を有する。図２に示した例では、集音部１２０は、テーブルＴｂｌの上方に存在する３つのマイクロフォンとテーブルＴｂｌの上面に存在する３つのマイクロフォンとの合計６つのマイクロフォンを含んでいる。しかし、集音部１２０に含まれるマイクロフォンの数は１以上であれば特に限定されない。かかる場合、集音部１２０に含まれる１以上のマイクロフォンそれぞれが設けられる位置も特に限定されない。 The sound collection unit 120 has a function of collecting sound. In the example shown in FIG. 2, the sound collection unit 120 includes a total of six microphones including three microphones existing above the table Tbl and three microphones existing on the upper surface of the table Tbl. However, the number of microphones included in the sound collection unit 120 is not particularly limited as long as it is one or more. In such a case, the position at which each of the one or more microphones included in the sound collection unit 120 is provided is not particularly limited.

ただし、集音部１２０が、複数のマイクロフォンを含んでいれば、複数のマイクロフォンそれぞれによって集音された音情報に基づいて音の到来方向が推定され得る。また、集音部１２０が指向性を有するマイクロフォンを含んでいれば、指向性を有するマイクロフォンによって集音された音情報に基づいて音の到来方向が推定され得る。 However, if the sound collection unit 120 includes a plurality of microphones, the arrival direction of the sound can be estimated based on sound information collected by each of the plurality of microphones. Further, if the sound collection unit 120 includes a microphone having directivity, the direction of arrival of sound can be estimated based on sound information collected by the microphone having directivity.

以上、本開示の実施形態に係る情報処理システム１０の構成例について説明した。 The configuration example of the information processing system 10 according to the embodiment of the present disclosure has been described above.

［１．２．機能構成例］
続いて、本開示の実施形態に係る情報処理システム１０の機能構成例について説明する。図３は、本開示の実施形態に係る情報処理システム１０の機能構成例を示すブロック図である。図３に示したように、本開示の実施形態に係る情報処理システム１０は、画像入力部１１０と、操作入力部１１５と、集音部１２０と、出力部１３０と、情報処理装置１４０（以下、「制御部１４０」とも言う。）と、を備える。 [1.2. Functional configuration example]
Subsequently, a functional configuration example of the information processing system 10 according to the embodiment of the present disclosure will be described. FIG. 3 is a block diagram illustrating a functional configuration example of the information processing system 10 according to the embodiment of the present disclosure. As illustrated in FIG. 3, the information processing system 10 according to the embodiment of the present disclosure includes an image input unit 110, an operation input unit 115, a sound collection unit 120, an output unit 130, and an information processing device 140 (hereinafter referred to as “information processing device 140”). , Also referred to as “control unit 140”).

情報処理装置１４０は、情報処理システム１０の各部の制御を実行する。例えば、情報処理装置１４０は、出力部１３０から出力する情報を生成する。また、例えば、情報処理装置１４０は、画像入力部１１０、操作入力部１１５および集音部１２０それぞれが入力した情報を、出力部１３０から出力する情報に反映させる。図３に示したように、情報処理装置１４０は、入力画像取得部１４１と、音情報取得部１４２と、操作検出部１４３と、認識制御部１４４と、音声認識部１４５と、出力制御部１４６とを備える。これらの各機能ブロックについての詳細は、後に説明する。 The information processing device 140 executes control of each unit of the information processing system 10. For example, the information processing apparatus 140 generates information output from the output unit 130. Further, for example, the information processing apparatus 140 reflects information input by the image input unit 110, the operation input unit 115, and the sound collection unit 120 in information output from the output unit 130. As illustrated in FIG. 3, the information processing apparatus 140 includes an input image acquisition unit 141, a sound information acquisition unit 142, an operation detection unit 143, a recognition control unit 144, a voice recognition unit 145, and an output control unit 146. With. Details of these functional blocks will be described later.

なお、情報処理装置１４０は、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ；中央演算処理装置）などで構成されていてもよい。情報処理装置１４０がＣＰＵなどといった処理装置によって構成される場合、かかる処理装置は、電子回路によって構成され得る。 Note that the information processing apparatus 140 may be configured by, for example, a CPU (Central Processing Unit). When the information processing device 140 is configured by a processing device such as a CPU, the processing device can be configured by an electronic circuit.

以上、本開示の実施形態に係る情報処理システム１０の機能構成例について説明した。 The function configuration example of the information processing system 10 according to the embodiment of the present disclosure has been described above.

［１．３．情報処理システムの機能詳細］
続いて、本開示の実施形態に係る情報処理システム１０の機能詳細について説明する。図４は、初期画面の表示から音声認識処理の起動トリガを検出するまでの画面遷移の例を示す図である。図４を参照すると、出力制御部１４６は、初期画面Ｇ１０−１を表示させている。初期画面Ｇ１０−１には、音声認識処理を開始させるための音声認識開始操作オブジェクトＧ１４、音声認識処理によって得られた文字列（以下、「認識文字列」とも言う。）の表示欄である認識文字列表示欄Ｇ１１が含まれる。 [1.3. Function details of information processing system]
Subsequently, functional details of the information processing system 10 according to the embodiment of the present disclosure will be described. FIG. 4 is a diagram illustrating an example of screen transition from the display of the initial screen to the detection of the activation trigger for the voice recognition process. Referring to FIG. 4, the output control unit 146 displays an initial screen G10-1. The initial screen G10-1 is a recognition field which is a display column for a voice recognition start operation object G14 for starting voice recognition processing, and a character string obtained by the voice recognition processing (hereinafter also referred to as “recognized character string”). A character string display field G11 is included.

また、初期画面Ｇ１０−１には、認識文字列を全部削除するための全削除操作オブジェクトＧ１２、認識文字列を確定するための確定操作オブジェクトＧ１３が含まれる。また、初期画面Ｇ１０−１には、認識文字列におけるカーソル位置を前に戻すための前方移動操作オブジェクトＧ１５、認識文字列におけるカーソル位置を後ろに進めるための後方移動操作オブジェクトＧ１６、カーソル位置の文字または単語を削除するための削除操作オブジェクトＧ１７が含まれる。 The initial screen G10-1 includes a delete all operation object G12 for deleting all recognized character strings and a confirm operation object G13 for confirming a recognized character string. The initial screen G10-1 includes a forward moving operation object G15 for returning the cursor position in the recognized character string to the front, a backward moving operation object G16 for moving the cursor position in the recognized character string backward, and a character at the cursor position. Alternatively, a delete operation object G17 for deleting a word is included.

まず、画面Ｇ１０−２に示すように、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、音声認識処理の開始条件を出力させる。なお、ここでは、音声認識処理の起動トリガとして音声認識開始操作オブジェクトＧ１４を選択する操作を例に挙げて説明するが、音声認識処理の起動トリガは、かかる例に限定されない。 First, as shown in the screen G10-2, when an operation for selecting the voice recognition start operation object G14 by the user is input by the operation input unit 115, the operation is detected by the operation detection unit 143 as a start trigger of the voice recognition processing. (Time T10). When the activation trigger for the speech recognition process is detected, the output control unit 146 outputs a start condition for the speech recognition process. Here, an operation for selecting the speech recognition start operation object G14 will be described as an example as an activation trigger for the speech recognition process, but the activation trigger for the speech recognition process is not limited to such an example.

例えば、音声認識処理の起動トリガは、音声認識処理を起動するためのハードウェアボタンを押下する操作であってもよい。このとき、音声認識処理は、ハードウェアボタンの押下開始から押下解除までの間に起動されてもよい（ＰｕｓｈＴｏＴａｌｋ型）。あるいは、音声認識処理の起動トリガは、音声認識処理の起動コマンド（例えば、「音声」という発話など）の実行であってもよい。 For example, the activation trigger for the voice recognition process may be an operation of pressing a hardware button for starting the voice recognition process. At this time, the voice recognition processing may be started from the start of pressing the hardware button to the release of pressing (Push To Talk type). Alternatively, the activation trigger for the voice recognition process may be an execution of a voice recognition process activation command (for example, an utterance “voice”).

あるいは、音声認識処理の起動トリガは、所定の音声認識処理の起動ジェスチャ（例えば、手の振り上げ、手の振り下ろし、顔の動き（例えば、うなずき、左右に顔を傾ける動作など）など）であってもよい。また、音声認識処理の起動トリガは、音声らしさが閾値を上回る音情報が集音部１２０から取得されたことを含んでもよい。 Alternatively, the voice recognition process activation trigger is a predetermined voice recognition process activation gesture (for example, raising a hand, swinging a hand down, moving a face (for example, nodding, tilting the face to the left or right, etc.)). May be. In addition, the activation trigger of the voice recognition process may include that sound information whose voice likelihood exceeds a threshold is acquired from the sound collection unit 120.

まず、音声認識処理が開始されるまでの残り時間が開始条件として出力される例を説明する。 First, an example in which the remaining time until the voice recognition process is started is output as the start condition will be described.

図５は、音声認識処理が開始されるまでの残り時間が開始条件として出力されてから音声認識処理が開始されるまでの画面遷移の例を示す図である。出力制御部１４６は、音声認識処理の起動トリガが検出されると、残り時間通知画面Ｇ２１−１の出力を開始させる（時刻Ｔ１１）。残り時間通知画面Ｇ２１−１には、音声認識処理が開始されるまでの残り時間Ｇ２３−１と開始条件の出力を停止させるための取り消しオブジェクトＧ２２とが含まれている。 FIG. 5 is a diagram illustrating an example of screen transition from when the remaining time until the voice recognition process is started as a start condition until the voice recognition process is started. When the activation trigger for the speech recognition process is detected, the output control unit 146 starts outputting the remaining time notification screen G21-1 (time T11). The remaining time notification screen G21-1 includes a remaining time G23-1 until the voice recognition process is started and a cancellation object G22 for stopping the output of the start condition.

続いて、出力制御部１４６は、時間の経過に伴って残り時間Ｇ２３−１を減らしていく。例えば、出力制御部１４６は、残り時間Ｇ２３−１から減らされた後の残り時間Ｇ２３−２を含んだ残り時間通知画面Ｇ２１−２を出力させる。続いて、出力制御部１４６は、音声認識処理が開始されるまでの残り時間がゼロになり開始条件が満たされると（時刻Ｔ１２）、開始条件の出力を停止させる（時刻Ｔ１３）。開始条件の出力が停止されると、ユーザは集音部１２０に向かって発話を開始する（時刻Ｔ１４）。 Subsequently, the output control unit 146 decreases the remaining time G23-1 as time elapses. For example, the output control unit 146 causes the remaining time notification screen G21-2 including the remaining time G23-2 after being reduced from the remaining time G23-1 to be output. Subsequently, when the remaining time until the voice recognition process is started becomes zero and the start condition is satisfied (time T12), the output control unit 146 stops outputting the start condition (time T13). When the output of the start condition is stopped, the user starts speaking toward the sound collection unit 120 (time T14).

集音部１２０によって集音された音情報が音情報取得部１４２によって取得されると、出力制御部１４６は、所定のオブジェクト（以下、「表示オブジェクト」とも言う。）Ｍｕを表示させる。表示オブジェクトＭｕは、静止していてもよいし、動きを有していてもよい。例えば、表示オブジェクトＭｕが動きを有する場合、表示オブジェクトＭｕの移動方向Ｄｅは、ユーザによる発話音声の音源から集音部１２０への到来方向に応じて決まってよい。なお、ユーザによる発話音声の到来方向の推定手法も特に限定されない。 When the sound information collected by the sound collection unit 120 is acquired by the sound information acquisition unit 142, the output control unit 146 displays a predetermined object (hereinafter also referred to as “display object”) Mu. The display object Mu may be stationary or may have a movement. For example, when the display object Mu has a movement, the moving direction De of the display object Mu may be determined according to the arrival direction from the sound source of the uttered voice by the user to the sound collection unit 120. In addition, the estimation method of the arrival direction of the uttered voice by the user is not particularly limited.

例えば、認識制御部１４４は、音声認識開始操作オブジェクトＧ１４を選択する操作を行ったユーザの指方向（例えば、指の根元から指先への方向）に一致または類似する１の到来方向をユーザによる発話音声の到来方向として推定してもよい。類似範囲はあらかじめ定められていてよい。また、指方向は入力画像を解析することによって取得されてよい。 For example, the recognition control unit 144 utters one arrival direction that matches or is similar to the finger direction of the user who performed the operation of selecting the voice recognition start operation object G14 (for example, the direction from the base of the finger to the fingertip). It may be estimated as the voice arrival direction. The similarity range may be determined in advance. The finger direction may be obtained by analyzing the input image.

あるいは、認識制御部１４４は、集音部１２０によって入力された音の到来方向をユーザによる発話音声の到来方向として推定してもよい。音の到来方向が複数あった場合には、複数の到来方向のうち最初に入力された音の到来方向をユーザによる発話音声の到来方向として推定してもよいし、複数の到来方向のうち音声認識開始操作オブジェクトＧ１４を選択する操作を行ったユーザの指方向に一致または類似する１の到来方向をユーザによる発話音声の到来方向として推定してもよい。 Alternatively, the recognition control unit 144 may estimate the arrival direction of the sound input by the sound collection unit 120 as the arrival direction of the uttered speech by the user. When there are a plurality of sound arrival directions, the arrival direction of the sound input first among the plurality of arrival directions may be estimated as the arrival direction of the uttered voice by the user. One arrival direction that matches or resembles the direction of the finger of the user who has performed the operation of selecting the recognition start operation object G14 may be estimated as the arrival direction of the uttered voice by the user.

あるいは、認識制御部１４４は、複数の到来方向のうち集音部１２０によって最も大きな音量で入力された音の到来方向をユーザによる発話音声の到来方向として推定してもよい。このようにしてユーザによる発話音声の到来方向が推定され得る。一方において、認識制御部１４４は、ユーザによる発話音声の到来方向以外の方向から集音部１２０によって入力された音を雑音として取得してよい。したがって、雑音には、情報処理システム１０からの出力音も含まれ得る。 Or the recognition control part 144 may estimate the arrival direction of the sound input with the loudest volume by the sound collection part 120 among several arrival directions as an arrival direction of the speech sound by a user. In this way, the arrival direction of the uttered voice by the user can be estimated. On the other hand, the recognition control unit 144 may acquire the sound input by the sound collection unit 120 from a direction other than the arrival direction of the uttered voice by the user as noise. Therefore, the noise may include an output sound from the information processing system 10.

また、図５には、出力制御部１４６が、ユーザによる発話音声の到来方向（移動方向Ｄｅ）に表示オブジェクトＭｕを移動させる例が示されている。これによって、ユーザは自分の発話音声が集音部１２０によって集音されていることを直感的に把握することが可能となる。しかし、表示オブジェクトＭｕの動きは、かかる動きに限定されない。また、図５には、表示オブジェクトＭｕの移動先が、音声認識開始操作オブジェクトＧ１４である例が示されている。しかし、表示オブジェクトＭｕの移動先は、かかる例に限定されない。 FIG. 5 shows an example in which the output control unit 146 moves the display object Mu in the direction of arrival of the uttered voice by the user (movement direction De). As a result, the user can intuitively understand that his / her speech is being collected by the sound collection unit 120. However, the movement of the display object Mu is not limited to such movement. FIG. 5 shows an example in which the movement destination of the display object Mu is the voice recognition start operation object G14. However, the movement destination of the display object Mu is not limited to this example.

また、図５には、出力制御部１４６が、集音部１２０による集音に応じて次々と出現した円形状の表示オブジェクトＭｕを移動させる例が示されているが、表示オブジェクトＭｕの表示態様はかかる例に限定されない。例えば、出力制御部１４６は、音情報に応じた所定の情報（例えば、音情報の音声らしさ、音量など）に基づいて表示オブジェクトＭｕの各種パラメータを制御してよい。このときに用いられる音情報は、ユーザによる発話音声の到来方向からの音情報であるとよい。また、表示オブジェクトＭｕのパラメータは、表示オブジェクトＭｕの形状、透明度、色、サイズおよび動きのうち、少なくともいずれか一つを含んでもよい。 FIG. 5 shows an example in which the output control unit 146 moves the circular display objects Mu that appear one after another according to the sound collection by the sound collection unit 120. However, the display mode of the display object Mu is shown. Is not limited to such an example. For example, the output control unit 146 may control various parameters of the display object Mu based on predetermined information corresponding to the sound information (for example, sound quality, sound volume, etc. of the sound information). The sound information used at this time may be sound information from the direction of arrival of the uttered voice by the user. The parameter of the display object Mu may include at least one of the shape, transparency, color, size, and movement of the display object Mu.

なお、音情報から音声らしさを評価する手法は特に限定されない。例えば、音情報から音声らしさを評価する手法として、特許文献（特開２０１０−３８９４３号公報）に記載されている手法を採用することも可能である。また、例えば、音情報から音声らしさを評価する手法として、特許文献（特開２００７−３２８２２８号公報）に記載されている手法を採用することも可能である。ここでは、音声らしさの評価が、出力制御部１４６によって行われる例を説明するが、音声らしさの評価は、図示しないサーバによって行われてもよい。 Note that there is no particular limitation on the method for evaluating the sound quality from the sound information. For example, a technique described in a patent document (Japanese Patent Laid-Open No. 2010-38943) can be employed as a technique for evaluating the likelihood of sound from sound information. In addition, for example, as a method for evaluating the likelihood of sound from sound information, a method described in a patent document (Japanese Patent Laid-Open No. 2007-328228) can be employed. Here, an example is described in which the speech likelihood evaluation is performed by the output control unit 146, but the speech likelihood evaluation may be performed by a server (not shown).

認識制御部１４４は、開始条件が満たされると、音情報取得部１４２によって取得された音情報に対する音声認識処理を音声認識部１４５に開始させる。音声認識処理を開始させるタイミングは限定されない。例えば、認識制御部１４４は、音声らしさが所定の閾値を超える音情報が集音されてから、音声認識部１４５に開始させてもよいし、表示オブジェクトＭｕが音声認識開始操作オブジェクトＧ１４に到達してから、表示オブジェクトＭｕに対応する音情報に対する音声認識処理を音声認識部１４５に開始させてもよい。 When the start condition is satisfied, the recognition control unit 144 causes the voice recognition unit 145 to start voice recognition processing for the sound information acquired by the sound information acquisition unit 142. The timing for starting the speech recognition process is not limited. For example, the recognition control unit 144 may cause the speech recognition unit 145 to start after the sound information whose soundness exceeds the predetermined threshold is collected, or the display object Mu reaches the speech recognition start operation object G14. Then, the speech recognition unit 145 may start speech recognition processing for sound information corresponding to the display object Mu.

なお、ユーザは音声認識処理の開始を取り消したい場合には、取り消しオブジェクトＧ２２を選択すればよい。ユーザが、取り消しオブジェクトＧ２２を選択すると、かかる操作は、出力停止操作として操作入力部１１５によって入力され、操作検出部１４３によって出力停止操作が検出される。操作検出部１４３によって出力停止操作が検出されると、出力制御部１４６は、開始条件の出力を停止させる。 Note that when the user wants to cancel the start of the speech recognition process, the cancel object G22 may be selected. When the user selects the cancel object G22, this operation is input by the operation input unit 115 as an output stop operation, and the output stop operation is detected by the operation detection unit 143. When the output detection operation is detected by the operation detection unit 143, the output control unit 146 stops the output of the start condition.

以上、音声認識処理が開始されるまでの残り時間が開始条件として出力される例を説明した。続いて、音声認識処理を開始させるために必要なユーザ操作に関する情報が開始条件として出力される例を説明する。 Heretofore, the example in which the remaining time until the voice recognition process is started is output as the start condition has been described. Next, an example will be described in which information related to a user operation necessary for starting the speech recognition process is output as a start condition.

図６は、音声認識処理を開始させるために必要なユーザ操作に関する情報が開始条件として出力されてから音声認識処理が開始されるまでの画面遷移の例を示す図である。出力制御部１４６は、音声認識処理の起動トリガが検出されると、発話開始確認画面Ｇ２４−１の出力を開始させる（時刻Ｔ１１）。発話開始確認画面Ｇ２４−１には、音声認識処理を開始させるために必要なユーザ操作に関する情報としての音声認識処理開始オブジェクトＧ２５−１と取り消しオブジェクトＧ２２とが含まれている。 FIG. 6 is a diagram illustrating an example of screen transition from when information related to a user operation necessary for starting the voice recognition process is output as a start condition until the voice recognition process is started. When the activation trigger for the speech recognition process is detected, the output control unit 146 starts outputting the utterance start confirmation screen G24-1 (time T11). The utterance start confirmation screen G24-1 includes a speech recognition process start object G25-1 and a cancellation object G22 as information related to user operations necessary for starting the speech recognition process.

続いて、ユーザによって音声認識処理開始オブジェクトＧ２５−１を選択する操作がなされると（発話開始確認画面Ｇ２４−２）、その操作が操作入力部１１５によって入力され、操作検出部１４３によって検出される。出力制御部１４６は、音声認識処理開始オブジェクトＧ２５−１を選択する操作が検出されて開始条件が満たされると（時刻Ｔ１２）、開始条件の出力を停止させる（時刻Ｔ１３）。開始条件の出力が停止されると、ユーザは集音部１２０に向かって発話を開始する（時刻Ｔ１４）。以降の動作は、既に説明した音声認識処理が開始されるまでの残り時間が開始条件として出力される例と同様に実行され得る。 Subsequently, when the user performs an operation of selecting the speech recognition processing start object G25-1 (speech start confirmation screen G24-2), the operation is input by the operation input unit 115 and detected by the operation detection unit 143. . When the operation for selecting the speech recognition process start object G25-1 is detected and the start condition is satisfied (time T12), the output control unit 146 stops outputting the start condition (time T13). When the output of the start condition is stopped, the user starts speaking toward the sound collection unit 120 (time T14). Subsequent operations can be executed in the same manner as in the example in which the remaining time until the voice recognition process already described is started is output as the start condition.

以上、音声認識処理を開始させるために必要なユーザ操作に関する情報が開始条件として出力される例を説明した。音声認識処理の開始条件が出力されることによって、ユーザは、図５および図６にも示されるように、発話開始までに集音される音情報（例えば、フィラーや余計な発話など）が音声認識処理に与える影響を低減することが可能となる。 Heretofore, an example has been described in which information regarding user operations necessary to start the speech recognition process is output as the start condition. By outputting the start condition of the voice recognition process, the user collects sound information (for example, filler and extra utterances) collected before the start of utterance as shown in FIG. 5 and FIG. It is possible to reduce the influence on the recognition process.

このようにして開始条件が出力され得るが、開始条件が不変である場合には、音声認識処理を状況に応じて柔軟に開始させることが困難である。そこで、本開示の実施形態において、出力制御部１４６は、出力部１３０に出力させる音声認識処理の開始条件を動的に変更する。かかる構成によって、音声認識処理を状況に応じて柔軟に開始させることが可能となる。例えば、出力制御部１４６は、所定の情報に基づいて、出力部１３０に出力させる音声認識処理の開始条件を動的に変更すればよい。 Although the start condition can be output in this way, if the start condition is unchanged, it is difficult to start the voice recognition process flexibly according to the situation. Therefore, in the embodiment of the present disclosure, the output control unit 146 dynamically changes the start condition of the voice recognition process to be output to the output unit 130. With this configuration, the voice recognition process can be started flexibly according to the situation. For example, the output control unit 146 may dynamically change the start condition of the voice recognition process to be output to the output unit 130 based on predetermined information.

所定の情報は特に限定されない。まず、音声認識処理の起動トリガが検出された後に集音部１２０から入力された音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する例を説明する。図７は、音声認識処理の起動トリガが検出された後に集音部１２０から入力された音情報に基づいて、開始条件を動的に変更する例を説明するための図である。 The predetermined information is not particularly limited. First, an example will be described in which the start condition to be output to the output unit 130 is dynamically changed based on sound information input from the sound collection unit 120 after the activation trigger of the speech recognition process is detected. FIG. 7 is a diagram for explaining an example in which the start condition is dynamically changed based on the sound information input from the sound collection unit 120 after the activation trigger for the speech recognition process is detected.

図７に示すように、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、集音部１２０から入力された音情報に含まれる第１の種類の音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する。 As shown in FIG. 7, when an operation for selecting the voice recognition start operation object G14 by the user is input by the operation input unit 15, the operation is detected by the operation detection unit 143 as a trigger for starting the voice recognition process (time). T10). When the activation trigger of the speech recognition process is detected, the output control unit 146 causes the output unit 130 to output based on the first type of sound information included in the sound information input from the sound collection unit 120. To change dynamically.

ここで、第１の種類の音情報は特に限定されない。例えば、第１の種類の音情報は、少なくとも雑音を含んでよい。雑音は、ユーザの発話に対する音声認識処理の妨げになる可能性があるからである。ここでは、第１の種類の音情報が雑音である場合を例として説明を続ける。 Here, the first type of sound information is not particularly limited. For example, the first type of sound information may include at least noise. This is because the noise may interfere with the voice recognition process for the user's utterance. Here, the description will be continued by taking as an example the case where the first type of sound information is noise.

一つ目として、雑音の音量（以下、「雑音レベル」とも言う。）が第１の閾値ｎ１を上回る場合には、ユーザの発話に対する音声認識処理の成功率は低めであるため、ユーザに音声認識処理の開始タイミングを入力させるのが望ましいと考えられる。そこで、出力制御部１４６は、雑音レベルが第１の閾値ｎ１を上回る場合には、音声認識処理を開始させるために必要なユーザ操作に関する情報に開始条件を変更するのがよい。 First, when the volume of noise (hereinafter, also referred to as “noise level”) exceeds the first threshold value n1, the success rate of the speech recognition processing for the user's utterance is low. It may be desirable to input the start timing of the recognition process. Therefore, when the noise level exceeds the first threshold value n1, the output control unit 146 may change the start condition to information related to user operation necessary to start the voice recognition process.

より具体的には、出力制御部１４６は、雑音レベルが第１の閾値ｎ１を上回る場合には、発話開始確認画面Ｇ２４−１を出力させるのがよい。上記した例と同様に、発話開始確認画面Ｇ２４−１には、音声認識処理を開始させるために必要なユーザ操作に関する情報としての音声認識処理開始オブジェクトＧ２５−１と取り消しオブジェクトＧ２２とが含まれている。 More specifically, the output control unit 146 may output the utterance start confirmation screen G24-1 when the noise level exceeds the first threshold value n1. Similar to the above-described example, the speech start confirmation screen G24-1 includes a speech recognition processing start object G25-1 and a cancellation object G22 as information related to user operations necessary for starting the speech recognition processing. Yes.

続いて、ユーザによって音声認識処理開始オブジェクトＧ２５−１を選択する操作がなされると、その操作が操作入力部１１５によって入力され、操作検出部１４３によって検出される。出力制御部１４６は、音声認識処理開始オブジェクトＧ２５−１を選択する操作が検出されて開始条件が満たされると（時刻Ｔ１２）、開始条件の出力を停止させる（時刻Ｔ１３）。以降の動作は、既に説明した通りである。 Subsequently, when the user performs an operation of selecting the speech recognition processing start object G25-1, the operation is input by the operation input unit 115 and detected by the operation detection unit 143. When the operation for selecting the speech recognition process start object G25-1 is detected and the start condition is satisfied (time T12), the output control unit 146 stops outputting the start condition (time T13). The subsequent operation is as described above.

二つ目として、雑音レベルが第１の閾値ｎ１以下である場合、かつ、雑音レベルが（第１の閾値ｎ１より小さい）第２の閾値ｎ２以上である場合には、ユーザの発話に対する音声認識処理の成功率は中程度であるため、所定時間の経過後に自動的に音声認識処理を開始させるのが望ましいと考えられる。そこで、出力制御部１４６は、雑音の音量が第１の閾値ｎ１を下回る場合、かつ、雑音レベルが第２の閾値ｎ２を上回る場合には、音声認識処理が開始されるまでの残り時間に開始条件を変更するのがよい。 Second, when the noise level is equal to or lower than the first threshold value n1 and the noise level is equal to or higher than the second threshold value n2 (which is smaller than the first threshold value n1), speech recognition for the user's utterance is performed. Since the success rate of the process is moderate, it is desirable to automatically start the speech recognition process after a predetermined time has elapsed. Therefore, the output control unit 146 starts at the remaining time until the voice recognition process is started when the volume of the noise is lower than the first threshold value n1 and when the noise level is higher than the second threshold value n2. It is better to change the conditions.

上記した例と同様に、残り時間通知画面Ｇ２１−１には、音声認識処理が開始されるまでの残り時間Ｇ２３−１と開始条件の出力を停止させるための取り消しオブジェクトＧ２２とが含まれている。出力制御部１４６は、音声認識処理が開始されるまでの残り時間がゼロになり開始条件が満たされると（時刻Ｔ１２）、開始条件の出力を停止させる（時刻Ｔ１３）。開始条件の出力が停止される。以降の動作は、既に説明した通りである。 Similar to the above-described example, the remaining time notification screen G21-1 includes a remaining time G23-1 until the voice recognition process is started and a cancellation object G22 for stopping the output of the start condition. . When the remaining time until the voice recognition process is started becomes zero and the start condition is satisfied (time T12), the output control unit 146 stops outputting the start condition (time T13). The start condition output is stopped. The subsequent operation is as described above.

三つ目として、雑音レベルが第２の閾値ｎ２を下回る場合には、ユーザの発話に対する音声認識処理の成功率は高めであるため、開始条件を出力させずに音声認識処理が開始されるのが望ましい。そこで、出力制御部１４６は、雑音レベルが第２の閾値ｎ２を下回る場合には、開始条件を出力部１３０に出力させることを省略するのが望ましい。 Third, when the noise level is lower than the second threshold value n2, the success rate of the speech recognition process for the user's utterance is high, so the speech recognition process is started without outputting the start condition. Is desirable. Therefore, it is desirable that the output control unit 146 omits outputting the start condition to the output unit 130 when the noise level falls below the second threshold value n2.

なお、上記では、雑音レベルが第１の閾値ｎ１と等しい場合は、雑音レベルが第１の閾値ｎ１以下である場合、かつ、雑音レベルが第２の閾値ｎ２以上である場合と同様に扱われたが、雑音レベルが第１の閾値ｎ１を上回る場合と同様に扱われてもよい。また、上記では、雑音レベルが第２の閾値ｎ２と等しい場合は、雑音レベルが第１の閾値ｎ１以下である場合、かつ、雑音レベルが第２の閾値ｎ２以上である場合と同様に扱われたが、雑音レベルが第２の閾値ｎ２を下回る場合と同様に扱われてもよい。 In the above, when the noise level is equal to the first threshold value n1, it is handled in the same manner as when the noise level is equal to or lower than the first threshold value n1 and when the noise level is equal to or higher than the second threshold value n2. However, it may be handled in the same manner as when the noise level exceeds the first threshold value n1. Further, in the above, when the noise level is equal to the second threshold value n2, it is handled in the same manner as when the noise level is equal to or lower than the first threshold value n1 and the noise level is equal to or higher than the second threshold value n2. However, it may be handled in the same manner as when the noise level is lower than the second threshold value n2.

出力制御部１４６は、開始条件として所定の表示情報を出力部１３０に出力させてよい。図８および図９は、開始条件として表示情報を出力部１３０に出力させる例を示す図である。図８には、音声認識開始操作オブジェクトＧ１４に表示内容を徐々に出現させる例が示されている（時刻Ｔ３１〜時刻Ｔ３６）。また、図９には、音声認識開始操作オブジェクトＧ１４の色を徐々に変化させていく例が示されている（時刻Ｔ４１〜時刻Ｔ４６）。 The output control unit 146 may cause the output unit 130 to output predetermined display information as a start condition. 8 and 9 are diagrams illustrating an example in which display information is output to the output unit 130 as a start condition. FIG. 8 shows an example in which the display content gradually appears on the voice recognition start operation object G14 (time T31 to time T36). FIG. 9 shows an example in which the color of the voice recognition start operation object G14 is gradually changed (time T41 to time T46).

また、出力制御部１４６は、開始条件として所定の音声情報を出力部１３０に出力させてもよい。図１０および図１１は、開始条件として音声情報を出力部１３０に出力させる例を示す図である。図１０には、時刻Ｔ５１から時刻Ｔ５４までに、音声認識処理の開始タイミング（時刻Ｔ５４）を知らせる音声情報が出力される例が示されている。また、図１１には、時刻Ｔ６１から時刻Ｔ６４までに、音声認識処理の開始タイミング（時刻Ｔ６４）を知らせる音声情報が出力される例が示されている。 Further, the output control unit 146 may cause the output unit 130 to output predetermined audio information as a start condition. 10 and 11 are diagrams illustrating an example in which audio information is output to the output unit 130 as a start condition. FIG. 10 shows an example in which voice information that informs the start timing (time T54) of the voice recognition process is output from time T51 to time T54. Further, FIG. 11 shows an example in which voice information that informs the start timing (time T64) of voice recognition processing is output from time T61 to time T64.

続いて、図１２および図１３を参照しながら、音声認識処理の起動トリガが検出された後に集音部１２０から入力された音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する動作の流れについて説明する。なお、図１２および図１３のフローチャートは、音声認識処理の起動トリガが検出された後に集音部１２０から入力された音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する動作の流れの例に過ぎないため、かかる動作の流れは、図１２および図１３のフローチャートに示された例に限定されない。 Subsequently, referring to FIG. 12 and FIG. 13, the start condition to be output to the output unit 130 is dynamically set based on the sound information input from the sound collection unit 120 after the activation trigger of the speech recognition process is detected. The operation flow to be changed will be described. The flowcharts of FIGS. 12 and 13 dynamically change the start condition to be output to the output unit 130 based on the sound information input from the sound collection unit 120 after the activation trigger of the speech recognition process is detected. Since this is merely an example of the flow of operation, the flow of such operation is not limited to the examples shown in the flowcharts of FIGS.

まず、図１２に示すように、操作検出部１４３は、音声認識処理の起動トリガを検出し（Ｓ１１）、集音部１２０から音情報ｖ１が入力される（Ｓ１２）。続いて、出力制御部１４６は、音声認識処理の開始条件を雑音に基づいて動的に決定する（Ｓ１３）。ここで、図１３を参照しながら、音声認識処理の開始条件を雑音に基づいて動的に決定する動作の詳細を説明する。まず、出力制御部１４６は、音情報ｖ１を取得し（Ｓ１３１）、音情報ｖ１の雑音レベルが閾値ｎ１を上回る場合には（Ｓ１３２において「Ｙｅｓ」）、モーダルＵＩ（上記した例では、発話開始確認画面Ｇ２４−１）を出力させることを決定する（Ｓ１３３）。 First, as shown in FIG. 12, the operation detection unit 143 detects the activation trigger of the voice recognition process (S11), and the sound information v1 is input from the sound collection unit 120 (S12). Subsequently, the output control unit 146 dynamically determines the start condition of the speech recognition process based on the noise (S13). Here, the details of the operation for dynamically determining the start condition of the speech recognition process based on noise will be described with reference to FIG. First, the output control unit 146 acquires the sound information v1 (S131), and when the noise level of the sound information v1 exceeds the threshold value n1 (“Yes” in S132), the modal UI (in the above example, the utterance start) It is determined to output the confirmation screen G24-1) (S133).

一方、出力制御部１４６は、音情報ｖ１の雑音レベルが閾値ｎ１を上回らない場合（Ｓ１３２において「Ｎｏ」）、Ｓ１３４に進み、音情報ｖ１の雑音レベルが閾値ｎ２を下回る場合には（Ｓ１３４において「Ｙｅｓ」）、開始条件を出力させないことを決定し（Ｓ１３５）、音情報ｖ１の雑音レベルが閾値ｎ２を下回らない場合には（Ｓ１３４において「Ｎｏ」）、タイマＵＩ（残り時間通知画面Ｇ２１−１）を出力させることを決定する（Ｓ１３６）。 On the other hand, if the noise level of the sound information v1 does not exceed the threshold value n1 (“No” in S132), the output control unit 146 proceeds to S134, and if the noise level of the sound information v1 is lower than the threshold value n2 (in S134) "Yes"), it is determined not to output the start condition (S135), and if the noise level of the sound information v1 does not fall below the threshold value n2 ("No" in S134), the timer UI (remaining time notification screen G21- 1) is determined to be output (S136).

図１２に戻って説明を続ける。出力制御部１４６は、開始条件の出力を省略すると決定した場合には（Ｓ１４において「Ｙｅｓ」）、Ｓ１８に動作を移行させる。一方、出力制御部１４６は、開始条件の出力を省略しないと決定した場合には（Ｓ１４において「Ｎｏ」）、開始条件を出力させる（Ｓ１５）。その後、操作検出部１４３は、開始条件の出力停止トリガを検出する（Ｓ１６）。開始条件の出力停止トリガには、開始条件が満たされたことと開始条件の出力を停止させるための取り消しオブジェクトＧ２２を選択する操作とが含まれ得る。 Returning to FIG. 12, the description will be continued. When the output control unit 146 determines to omit the output of the start condition (“Yes” in S14), the output control unit 146 shifts the operation to S18. On the other hand, when it is determined that the output of the start condition is not omitted (“No” in S14), the output control unit 146 outputs the start condition (S15). Thereafter, the operation detection unit 143 detects an output stop trigger of the start condition (S16). The output stop trigger of the start condition may include an operation of selecting the cancel object G22 for stopping the output of the start condition and the start condition being satisfied.

続いて、出力制御部１４６は、開始条件の出力を停止させる。そして、音声認識部１４５は、開始条件が満たされていない場合には（Ｓ１７において「Ｎｏ」）、音声認識処理を開始させずに（Ｓ１９）、動作を終了させる。一方、音声認識部１４５は、開始条件が満たされた場合には（Ｓ１７において「Ｙｅｓ」）、音声認識処理を開始させる（Ｓ１８）。 Subsequently, the output control unit 146 stops outputting the start condition. Then, when the start condition is not satisfied (“No” in S17), the voice recognition unit 145 ends the operation without starting the voice recognition process (S19). On the other hand, when the start condition is satisfied (“Yes” in S17), the voice recognition unit 145 starts the voice recognition process (S18).

以上においては、音声認識処理の起動トリガが検出された後に集音部１２０から入力された音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する例について説明した。 In the above, the example in which the start condition to be output to the output unit 130 is dynamically changed based on the sound information input from the sound collection unit 120 after the activation trigger of the speech recognition process is detected has been described.

続いて、過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する例を説明する。図１４および図１５は、過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、音声認識処理が開始されるまでの残り時間を動的に短くする例を説明するための図である。 Subsequently, the start condition to be output to the output unit 130 is dynamically changed based on the past sound information collected in a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. An example will be described. FIGS. 14 and 15 show the time until the voice recognition process is started based on the past sound information collected at a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. It is a figure for demonstrating the example which shortens remaining time dynamically.

図１４の上段に示すように、初回の音声認識処理時において、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、集音部１２０から入力された音情報の蓄積を開始し、残り時間通知画面Ｇ２１−１の出力を開始させる（時刻Ｔ１１）。上記したように、残り時間通知画面Ｇ２１−１には、音声認識処理が開始されるまでの残り時間Ｇ２３−１と開始条件の出力を停止させるための取り消しオブジェクトＧ２２とが含まれている。 As shown in the upper part of FIG. 14, when an operation for selecting the speech recognition start operation object G14 by the user is input by the operation input unit 15 in the first speech recognition process, the operation is used as a start trigger for the speech recognition process. It is detected by the operation detection unit 143 (time T10). When the activation trigger for the speech recognition processing is detected, the output control unit 146 starts accumulating sound information input from the sound collection unit 120 and starts outputting the remaining time notification screen G21-1 (time T11). . As described above, the remaining time notification screen G21-1 includes the remaining time G23-1 until the voice recognition process is started and the cancellation object G22 for stopping the output of the start condition.

続いて、出力制御部１４６は、時間の経過に伴って残り時間Ｇ２３−１を減らしていく。例えば、出力制御部１４６は、残り時間Ｇ２３−１から減らされた後の残り時間Ｇ２３−２を含んだ残り時間通知画面Ｇ２１−２を出力させる。続いて、出力制御部１４６は、音声認識処理が開始されるまでの残り時間がゼロになり開始条件が満たされると（時刻Ｔ１２）、開始条件の出力を停止させる（時刻Ｔ１３）。 Subsequently, the output control unit 146 decreases the remaining time G23-1 as time elapses. For example, the output control unit 146 causes the remaining time notification screen G21-2 including the remaining time G23-2 after being reduced from the remaining time G23-1 to be output. Subsequently, when the remaining time until the voice recognition process is started becomes zero and the start condition is satisfied (time T12), the output control unit 146 stops outputting the start condition (time T13).

開始条件の出力が停止されると、出力制御部１４６は、集音部１２０から入力された音情報の蓄積を終了する。このようにして蓄積された音情報は、過去の音情報として次回の音声認識処理時に利用される。そして、ユーザは集音部１２０に向かって発話を開始する（時刻Ｔ１４）。以降の動作は、既に説明した音声認識処理が開始されるまでの残り時間が開始条件として出力される例と同様に実行され得る。 When the output of the start condition is stopped, the output control unit 146 ends the accumulation of the sound information input from the sound collection unit 120. The sound information accumulated in this way is used as the past sound information in the next speech recognition process. Then, the user starts speaking toward the sound collection unit 120 (time T14). Subsequent operations can be executed in the same manner as in the example in which the remaining time until the voice recognition process already described is started is output as the start condition.

続いて、図１４の下段に示すように、二回目の音声処理時において、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、蓄積されている過去の音情報を取得し、集音部１２０から入力された音情報の蓄積を開始し、残り時間通知画面Ｇ２１−１の出力を開始させる（時刻Ｔ１１）。 Subsequently, as shown in the lower part of FIG. 14, when an operation for selecting the speech recognition start operation object G14 by the user is input by the operation input unit 15 in the second speech processing, the operation is performed in the speech recognition processing. It is detected by the operation detection unit 143 as an activation trigger (time T10). When the activation trigger for the voice recognition process is detected, the output control unit 146 acquires the accumulated past sound information, starts accumulating the sound information input from the sound collecting unit 120, and displays the remaining time notification screen The output of G21-1 is started (time T11).

このとき、出力制御部１４６は、過去の音情報に含まれる第２の種類の音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する。ここで、第２の種類の音情報は特に限定されない。例えば、第２の種類の音情報は、少なくとも雑音を含んでよい。雑音は、ユーザの発話に対する音声認識処理の妨げになる可能性があるからである。ここでは、第２の種類の音情報が雑音である場合を例として説明を続ける。 At this time, the output control unit 146 dynamically changes the start condition to be output to the output unit 130 based on the second type of sound information included in the past sound information. Here, the second type of sound information is not particularly limited. For example, the second type of sound information may include at least noise. This is because the noise may interfere with the voice recognition process for the user's utterance. Here, the description will be continued by taking as an example the case where the second type of sound information is noise.

ここで、図１４の上段にも示すように、初回の音声認識処理時においては、音声認識処理の起動トリガが検出されてから開始条件の出力が停止されるまで、雑音レベルが閾値より小さかった場合を想定する。かかる場合、二回目の音声認識処理時において取得される初回の音声認識処理時における雑音レベルは閾値より小さいこととなる。かかる場合には、出力制御部１４６は、開始条件として出力させる音声認識処理が開始されるまでの残り時間を、初回の音声認識処理時よりも短くする。 Here, as shown in the upper part of FIG. 14, at the time of the first speech recognition processing, the noise level was lower than the threshold until the start condition output was stopped after the start trigger of the speech recognition processing was detected. Assume a case. In such a case, the noise level at the time of the first voice recognition process acquired at the time of the second voice recognition process is smaller than the threshold value. In such a case, the output control unit 146 shortens the remaining time until the voice recognition process to be output as the start condition is started compared to the time of the first voice recognition process.

より具体的には、図１４を参照すると、出力制御部１４６は、音声認識処理が開始されるまでの残り時間Ｇ２３−１を、初回の音声認識処理時においては「３」秒としているのに対し、二回目の音声認識処理時においては「１」秒と短くしている。なお、図１４に示した例では、音声認識処理が開始されるまでの残り時間Ｇ２３−１が二回目の音声認識処理時に直ちに短くなっているが、雑音レベルが閾値より小さい状態が複数回続いて初めて、音声認識処理が開始されるまでの残り時間Ｇ２３−１が短くなってもよい。 More specifically, referring to FIG. 14, the output control unit 146 sets the remaining time G23-1 until the voice recognition process is started as “3” seconds in the first voice recognition process. On the other hand, in the second speech recognition process, the time is shortened to “1” seconds. In the example shown in FIG. 14, the remaining time G23-1 until the voice recognition process is started is immediately shortened during the second voice recognition process, but the state where the noise level is smaller than the threshold value continues several times. For the first time, the remaining time G23-1 until the voice recognition process is started may be shortened.

続いて、図１５に示すように、三回目の音声認識処理時において、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、蓄積されている二回目の音声認識処理時に集音部１２０から入力された音情報の蓄積を開始する（時刻Ｔ１１）。 Next, as shown in FIG. 15, when an operation for selecting the speech recognition start operation object G14 by the user is input by the operation input unit 15 in the third speech recognition process, the operation is activated by the speech recognition process. It is detected as a trigger by the operation detection unit 143 (time T10). When the activation trigger for the speech recognition process is detected, the output control unit 146 starts accumulating the sound information input from the sound collection unit 120 during the accumulated second speech recognition process (time T11).

ここで、図１４の下段にも示すように、二回目の音声認識処理時においては、音声認識処理の起動トリガが検出されてから開始条件の出力が停止されるまで、雑音レベルが閾値より小さかった場合を想定する。かかる場合、三回目の音声認識処理時において取得される二日目の音声認識処理時における雑音レベルは閾値より小さいこととなる。かかる場合には、出力制御部１４６は、開始条件として出力させる音声認識処理が開始されるまでの残り時間を、二日目の音声認識処理時よりも短くする。 Here, as shown in the lower part of FIG. 14, in the second speech recognition process, the noise level is lower than the threshold until the start condition output is stopped after the start trigger of the speech recognition process is detected. Assuming that In such a case, the noise level at the time of the second day speech recognition processing acquired at the time of the third speech recognition processing is smaller than the threshold value. In such a case, the output control unit 146 shortens the remaining time until the voice recognition process to be output as the start condition is started compared to the time of the voice recognition process on the second day.

より具体的には、図１５を参照すると、出力制御部１４６は、音声認識処理が開始されるまでの残り時間Ｇ２３−１を、二日目の音声認識処理時においては「１」秒としているのに対し、三回目の音声認識処理時においては、残り時間通知画面Ｇ２１−１の出力を省略している。なお、図１５に示した例では、残り時間通知画面Ｇ２１−１の出力が三回目の音声認識処理時に直ちに省略されているが、雑音レベルが閾値より小さい状態が複数回続いて初めて、残り時間通知画面Ｇ２１−１の出力が省略されてもよい。 More specifically, referring to FIG. 15, the output control unit 146 sets the remaining time G23-1 until the voice recognition process is started as “1” seconds at the time of the voice recognition process on the second day. On the other hand, during the third speech recognition process, the output of the remaining time notification screen G21-1 is omitted. In the example shown in FIG. 15, the output of the remaining time notification screen G21-1 is immediately omitted at the time of the third speech recognition process, but the remaining time is not reached until the state where the noise level is lower than the threshold value continues several times. The output of the notification screen G21-1 may be omitted.

続いて、音声認識処理が開始されるまでの残り時間を動的に長くする例を説明する。図１６および図１７は、過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、音声認識処理が開始されるまでの残り時間を動的に長くする例を説明するための図である。 Next, an example in which the remaining time until the voice recognition process is started is dynamically increased will be described. 16 and FIG. 17 show the time until the voice recognition process is started based on the past sound information collected at a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. It is a figure for demonstrating the example which lengthens remaining time dynamically.

図１６の上段に示すように、初回の音声認識処理時において、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、集音部１２０から入力された音情報の蓄積を開始し、残り時間通知画面Ｇ２１−１の出力を開始させる（時刻Ｔ１１）。以降の動作は、既に説明した音声認識処理が開始されるまでの残り時間が開始条件として出力される例と同様に実行され得る。 As shown in the upper part of FIG. 16, when an operation for selecting the speech recognition start operation object G14 by the user is input by the operation input unit 15 in the first speech recognition process, the operation is used as a start trigger for the speech recognition process. It is detected by the operation detection unit 143 (time T10). When the activation trigger for the speech recognition processing is detected, the output control unit 146 starts accumulating sound information input from the sound collection unit 120 and starts outputting the remaining time notification screen G21-1 (time T11). . Subsequent operations can be executed in the same manner as in the example in which the remaining time until the voice recognition process already described is started is output as the start condition.

続いて、図１６の下段に示すように、二回目の音声処理時において、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、蓄積されている過去の音情報を取得し、集音部１２０から入力された音情報の蓄積を開始し、残り時間通知画面Ｇ２１−１の出力を開始させる（時刻Ｔ１１）。 Subsequently, as shown in the lower part of FIG. 16, when an operation for selecting the speech recognition start operation object G14 by the user is input by the operation input unit 15 in the second speech processing, the operation is performed in the speech recognition processing. It is detected by the operation detection unit 143 as an activation trigger (time T10). When the activation trigger for the voice recognition process is detected, the output control unit 146 acquires the accumulated past sound information, starts accumulating the sound information input from the sound collecting unit 120, and displays the remaining time notification screen The output of G21-1 is started (time T11).

ここで、図１６の上段にも示すように、初回の音声認識処理時においては、音声認識処理の起動トリガが検出されてから開始条件の出力が停止されるまで、雑音レベルが閾値より大きかった場合を想定する。かかる場合、二回目の音声認識処理時において取得される初回の音声認識処理時における雑音レベルは閾値より大きいこととなる。かかる場合には、出力制御部１４６は、開始条件として出力させる音声認識処理が開始されるまでの残り時間を、初回の音声認識処理時よりも長くする。 Here, as shown in the upper part of FIG. 16, at the time of the first speech recognition processing, the noise level was larger than the threshold until the start condition output was stopped after the start trigger of the speech recognition processing was detected. Assume a case. In such a case, the noise level at the time of the first voice recognition process acquired at the time of the second voice recognition process is larger than the threshold value. In such a case, the output control unit 146 makes the remaining time until the voice recognition process to be output as the start condition is started longer than that in the first voice recognition process.

より具体的には、図１６を参照すると、出力制御部１４６は、音声認識処理が開始されるまでの残り時間Ｇ２３−１を、初回の音声認識処理時においては「３」秒としているのに対し、二回目の音声認識処理時においては「５」秒と長くしている。なお、図１６に示した例では、音声認識処理が開始されるまでの残り時間Ｇ２３−１が二回目の音声認識処理時に直ちに長くなっているが、雑音レベルが閾値より大きい状態が複数回続いて初めて、音声認識処理が開始されるまでの残り時間Ｇ２３−１が長くなってもよい。 More specifically, referring to FIG. 16, the output control unit 146 sets the remaining time G23-1 until the voice recognition process is started as “3” seconds in the first voice recognition process. On the other hand, in the second speech recognition process, it is set to “5” seconds. In the example shown in FIG. 16, the remaining time G23-1 until the voice recognition process is started is immediately longer during the second voice recognition process, but the state where the noise level is larger than the threshold value continues several times. For the first time, the remaining time G23-1 until the voice recognition process is started may be longer.

続いて、図１７に示すように、三回目の音声認識処理時において、ユーザによる音声認識開始操作オブジェクトＧ１４を選択する操作が操作入力部１５によって入力されると、その操作が音声認識処理の起動トリガとして操作検出部１４３によって検出される（時刻Ｔ１０）。出力制御部１４６は、音声認識処理の起動トリガが検出されると、蓄積されている二回目の音声認識処理時に集音部１２０から入力された音情報の蓄積を開始する（時刻Ｔ１１）。 Subsequently, as shown in FIG. 17, when an operation for selecting the speech recognition start operation object G14 by the user is input by the operation input unit 15 in the third speech recognition process, the operation starts the speech recognition process. It is detected as a trigger by the operation detection unit 143 (time T10). When the activation trigger for the speech recognition process is detected, the output control unit 146 starts accumulating the sound information input from the sound collection unit 120 during the accumulated second speech recognition process (time T11).

ここで、図１６の下段にも示すように、二回目の音声認識処理時においては、音声認識処理の起動トリガが検出されてから開始条件の出力が停止されるまで、雑音レベルが閾値より大きかった場合を想定する。かかる場合、三回目の音声認識処理時において取得される二日目の音声認識処理時における雑音レベルは閾値より大きいこととなる。かかる場合には、出力制御部１４６は、開始条件として出力させる音声認識処理が開始されるまでの残り時間を、二日目の音声認識処理時よりも長くする。 Here, as shown in the lower part of FIG. 16, in the second speech recognition process, the noise level is higher than the threshold until the start condition output is stopped after the start trigger of the speech recognition process is detected. Assuming that In such a case, the noise level at the time of the voice recognition process on the second day acquired at the time of the third voice recognition process is larger than the threshold value. In such a case, the output control unit 146 makes the remaining time until the voice recognition process to be output as the start condition is started longer than that during the voice recognition process on the second day.

より具体的には、図１７を参照すると、出力制御部１４６は、音声認識処理が開始されるまでの残り時間Ｇ２３−１を、二日目の音声認識処理時においては「５」秒としているのに対し、三回目の音声認識処理時においては、発話開始確認画面Ｇ２４−２を出力させている。なお、図１７に示した例では、三回目の音声認識処理時に直ちに発話開始確認画面Ｇ２４−２を出力させているが、雑音レベルが閾値より大きい状態が複数回続いて初めて、発話開始確認画面Ｇ２４−２が出力されてもよい。 More specifically, referring to FIG. 17, the output control unit 146 sets the remaining time G23-1 until the voice recognition process is started as “5” seconds at the time of the voice recognition process on the second day. On the other hand, during the third speech recognition process, the utterance start confirmation screen G24-2 is output. In the example shown in FIG. 17, the utterance start confirmation screen G24-2 is immediately output at the time of the third speech recognition process. However, the utterance start confirmation screen is not displayed until the state where the noise level is greater than the threshold value continues several times. G24-2 may be output.

ここで、上記したように、音声認識処理が開始されるまでの残り時間は変化し得る。このとき、音声認識処理が開始されるまでの残り時間の変化とともに出力部１３０に出力される表示情報も変化させるのがよい。そうすれば、ユーザにとっても、開始条件が変更されたことを容易に把握することが可能となる。 Here, as described above, the remaining time until the voice recognition process is started may change. At this time, it is preferable to change the display information output to the output unit 130 as the remaining time until the voice recognition process is started. Then, it becomes possible for the user to easily grasp that the start condition has been changed.

図１８は、音声認識処理が開始されるまでの残り時間が短くなった場合における表示情報の例を示す図である。図１８に示すように、音声認識開始操作オブジェクトＧ１４に表示内容を徐々に出現させる例において、出力制御部１４６は、表示内容の出現速度を高くするようにしてもよい（時刻Ｔ３１〜時刻Ｔ３３）。また、図１９は、音声認識処理が開始されるまでの残り時間が長くなった場合における表示情報の例を示す図である。図１９に示すように、音声認識開始操作オブジェクトＧ１４に表示内容を徐々に出現させる例において、出力制御部１４６は、表示内容の出現速度を低くするようにしてもよい（時刻Ｔ３１〜時刻Ｔ３８）。 FIG. 18 is a diagram illustrating an example of display information when the remaining time until the voice recognition process is started is shortened. As shown in FIG. 18, in the example in which the display content gradually appears on the speech recognition start operation object G14, the output control unit 146 may increase the appearance speed of the display content (time T31 to time T33). . FIG. 19 is a diagram illustrating an example of display information when the remaining time until the voice recognition process is started is long. As shown in FIG. 19, in the example in which the display content gradually appears on the speech recognition start operation object G14, the output control unit 146 may reduce the appearance speed of the display content (time T31 to time T38). .

続いて、図２０および図２１を参照しながら、過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する動作の流れについて説明する。なお、図２０および図２１のフローチャートは、過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する動作の流れの例に過ぎないため、かかる動作の流れは、図２０および図２１のフローチャートに示された例に限定されない。 Next, referring to FIGS. 20 and 21, the output unit 130 is based on past sound information collected in a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. The flow of the operation for dynamically changing the start condition to be output to will be described. Note that the flowcharts of FIGS. 20 and 21 are output to the output unit 130 based on past sound information collected in a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. Since this is merely an example of an operation flow for dynamically changing the starting condition to be performed, the operation flow is not limited to the examples shown in the flowcharts of FIGS. 20 and 21.

まず、図２０に示すように、操作検出部１４３は、音声認識処理の起動トリガを検出する（Ｓ２１）。また、出力制御部１４６は、過去の音情報ｈ１があれば過去の音情報ｈ１を取得し（Ｓ２２）、集音部１２０から音情報ｖ１の取得を開始する（Ｓ２３）。続いて、出力制御部１４６は、音声認識処理の開始条件を過去の音情報ｈ１に応じて動的に決定する（Ｓ２４）。ここで、図２１を参照しながら、音声認識処理の開始条件を過去の音情報ｈ１に基づいて動的に決定する動作の詳細を説明する。 First, as illustrated in FIG. 20, the operation detection unit 143 detects a start trigger for voice recognition processing (S <b> 21). Further, if there is past sound information h1, the output control unit 146 acquires the past sound information h1 (S22), and starts acquiring the sound information v1 from the sound collection unit 120 (S23). Subsequently, the output control unit 146 dynamically determines the start condition of the voice recognition process according to the past sound information h1 (S24). Here, the details of the operation for dynamically determining the start condition of the speech recognition process based on the past sound information h1 will be described with reference to FIG.

まず、出力制御部１４６は、過去の音情報ｈ１を取得し（Ｓ２４１）、タイムアウト値ｔ１（上記した例では、音声認識処理が開始されるまでの残り時間Ｇ２３−１）を取得する（Ｓ２４２）。続いて、出力制御部１４６は、過去の音情報ｈ１の音量が閾値ｍ１を上回っている場合には（Ｓ２４３において「Ｙｅｓ」）、Ｓ２４４に動作を移行させる。一方、出力制御部１４６は、過去の音情報ｈ１の音量が閾値ｍ１を上回っていない場合には（Ｓ２４３において「Ｎｏ」）、Ｓ２４８に動作を移行させる。 First, the output control unit 146 acquires past sound information h1 (S241), and acquires a timeout value t1 (in the above example, the remaining time G23-1 until the voice recognition process is started) (S242). . Subsequently, when the volume of the past sound information h1 exceeds the threshold value m1 (“Yes” in S243), the output control unit 146 shifts the operation to S244. On the other hand, when the volume of the past sound information h1 does not exceed the threshold value m1 (“No” in S243), the output control unit 146 shifts the operation to S248.

Ｓ２４４に動作が移行された場合、出力制御部１４６は、タイムアウト値ｔ１が閾値ｔ＿ｍａｘを上回っている場合には（Ｓ２４４において「Ｙｅｓ」）、モーダルＵＩ（上記した例では、発話開始確認画面Ｇ２４−１）を出力させることを決定し（Ｓ２４５）、タイムアウト値ｔ１が閾値ｔ＿ｍａｘを上回っていない場合には（Ｓ２４４において「Ｎｏ」）、タイムアウト値ｔ１を増加させ（Ｓ２４６）、タイムアウト値ｔ１が設定されたタイマＵＩ（上記では、残り時間通知画面Ｇ２１−１）を出力させることを決定する（Ｓ２４７）。 When the operation is shifted to S244, the output control unit 146, when the timeout value t1 exceeds the threshold value t_max (“Yes” in S244), the modal UI (in the above example, the utterance start confirmation screen G24− 1) is output (S245), and if the timeout value t1 does not exceed the threshold value t_max (“No” in S244), the timeout value t1 is increased (S246), and the timeout value t1 is set. The timer UI (in the above, the remaining time notification screen G21-1) is determined to be output (S247).

一方、Ｓ２４８に動作が移行された場合、出力制御部１４６は、タイムアウト値ｔ１が閾値ｔ＿ｍｉｎを下回っている場合には（Ｓ２４８において「Ｙｅｓ」）、開始条件を出力させないことを決定し（Ｓ２５１）、タイムアウト値ｔ１が閾値ｔ＿ｍｉｎを下回っていない場合には（Ｓ２４８において「Ｎｏ」）、タイムアウト値ｔ１を減少させ（Ｓ２４９）、タイムアウト値ｔ１が設定されたタイマＵＩ（上記では、残り時間通知画面Ｇ２１−１）を出力させることを決定する（Ｓ２４７）。 On the other hand, when the operation is shifted to S248, the output control unit 146 determines not to output the start condition when the timeout value t1 is lower than the threshold value t_min (“Yes” in S248) (S251). If the timeout value t1 is not lower than the threshold value t_min (“No” in S248), the timeout value t1 is decreased (S249), and the timer UI in which the timeout value t1 is set (in the above, remaining time notification screen G21) -1) is determined to be output (S247).

図２０に戻って説明を続ける。出力制御部１４６は、開始条件の出力を省略すると決定した場合には（Ｓ２５において「Ｙｅｓ」）、Ｓ３０に動作を移行させる。一方、出力制御部１４６は、開始条件の出力を省略しないと決定した場合には（Ｓ２５において「Ｎｏ」）、開始条件を出力させる（Ｓ２６）。その後、操作検出部１４３は、開始条件の出力停止トリガを検出する（Ｓ２７）。開始条件の出力停止トリガには、開始条件が満たされたことと開始条件の出力を停止させるための取り消しオブジェクトＧ２２を選択する操作とが含まれ得る。 Returning to FIG. 20, the description will be continued. If the output control unit 146 determines that the output of the start condition is omitted (“Yes” in S25), the output control unit 146 shifts the operation to S30. On the other hand, when it is determined that the output of the start condition is not omitted (“No” in S25), the output control unit 146 outputs the start condition (S26). Thereafter, the operation detection unit 143 detects an output stop trigger of the start condition (S27). The output stop trigger of the start condition may include an operation of selecting the cancel object G22 for stopping the output of the start condition and the start condition being satisfied.

続いて、出力制御部１４６は、開始条件の出力を停止させ、継続的に取得した音情報ｖ１を過去の音情報ｈ１として保存する（Ｓ２８）。そして、音声認識部１４５は、開始条件が満たされていない場合には（Ｓ２９において「Ｎｏ」）、音声認識処理を開始させずに（Ｓ３１）、動作を終了させる。一方、音声認識部１４５は、開始条件が満たされた場合には（Ｓ２９において「Ｙｅｓ」）、音声認識処理を開始させる（Ｓ３０）。 Subsequently, the output control unit 146 stops outputting the start condition, and stores the continuously acquired sound information v1 as past sound information h1 (S28). Then, when the start condition is not satisfied (“No” in S29), the voice recognition unit 145 ends the operation without starting the voice recognition process (S31). On the other hand, when the start condition is satisfied (“Yes” in S29), the voice recognition unit 145 starts the voice recognition process (S30).

以上、過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、出力部１３０に出力させる開始条件を動的に変更する例を説明した。 As described above, the start condition to be output to the output unit 130 is dynamically changed based on the past sound information collected in a predetermined time after the activation trigger is detected in the past until the voice recognition process is started. An example was explained.

以上においては、集音部１２０から入力される音情報にフィラーが含まれているか否かに依らずに動作する例を説明したが、集音部１２０から入力される音情報にフィラーが含まれているか否かによって動作を異ならせてもよい。まず、フィラーの例について説明する。図２２は、フィラーとその音声波形との対応関係の例を示す図である。図２２に示すように、フィラーとその音声波形とが対応付けられてなる情報があらかじめ記憶されている。この音声波形が集音部１２０から入力される音情報に含まれる場合、この音声波形に対応するフィラーが音声認識処理の結果として取得される。 In the above, an example has been described in which the sound information input from the sound collection unit 120 operates regardless of whether or not a filler is included, but the sound information input from the sound collection unit 120 includes a filler. The operation may be different depending on whether or not it is. First, an example of the filler will be described. FIG. 22 is a diagram illustrating an example of a correspondence relationship between a filler and its speech waveform. As shown in FIG. 22, information in which fillers and their speech waveforms are associated with each other is stored in advance. When this voice waveform is included in the sound information input from the sound collection unit 120, a filler corresponding to this voice waveform is acquired as a result of the voice recognition process.

図２３は、集音部１２０から入力される音情報にフィラーが含まれているか否かによって動作を異ならせる例を説明するための図である。出力制御部１４６は、音声認識処理の起動トリガが検出されると、残り時間通知画面Ｇ２１−１の出力を開始させる（時刻Ｔ１１）。このとき、認識制御部１４４は、集音部１２０から入力される音情報の蓄積を開始する。 FIG. 23 is a diagram for explaining an example in which the operation is varied depending on whether or not the sound information input from the sound collection unit 120 includes a filler. When the activation trigger for the speech recognition process is detected, the output control unit 146 starts outputting the remaining time notification screen G21-1 (time T11). At this time, the recognition control unit 144 starts accumulating sound information input from the sound collection unit 120.

続いて、認識制御部１４４は、開始条件が満たされると（時刻Ｔ１２）、開始条件が満たされるまでに蓄積された音情報Ｐ１の認識結果がフィラーであるか否かを判断し、認識結果がフィラーである場合、現時点までに蓄積された音情報Ｐ２から開始条件が満たされるまでに蓄積されたＰ１を除外して音声認識処理を音声認識部１４５に行わせる。一方、認識制御部１４４は、認識結果がフィラーではない場合、現時点までに蓄積された音情報Ｐ２から開始条件が満たされるまでに蓄積されたＰ１を除外せずに音声認識処理を音声認識部１４５に行わせる。また、出力制御部１４６は、開始条件が満たされると（時刻Ｔ１２）、開始条件の出力を停止させる（時刻Ｔ１３）。以降の動作は、既に説明した通りである。 Subsequently, when the start condition is satisfied (time T12), the recognition control unit 144 determines whether the recognition result of the sound information P1 accumulated until the start condition is satisfied is a filler. In the case of the filler, the speech recognition unit 145 performs speech recognition processing by excluding P1 accumulated until the start condition is satisfied from the sound information P2 accumulated up to the present time. On the other hand, when the recognition result is not a filler, the recognition control unit 144 performs the voice recognition processing without excluding P1 accumulated until the start condition is satisfied from the sound information P2 accumulated up to the present time. To do. Further, when the start condition is satisfied (time T12), the output control unit 146 stops outputting the start condition (time T13). The subsequent operation is as described above.

以上、出力部１３０に出力させる開始条件を動的に変更する例を説明したが、開始条件を動的に変更する例は、上記した例に限定されない。例えば、出力制御部１４６は、音声らしさが閾値を超える音情報の集音部１２０への到来方向の数に基づいて、出力部１３０に出力させる開始条件を動的に変更してもよい。音声らしさが閾値を超える音情報の集音部１２０への到来方向の数が所定値を超える場合には、音声認識処理が困難になる可能性があるからである。 The example in which the start condition to be output to the output unit 130 is dynamically changed has been described above, but the example in which the start condition is dynamically changed is not limited to the above example. For example, the output control unit 146 may dynamically change the start condition to be output to the output unit 130 based on the number of arrival directions of sound information whose sound quality exceeds a threshold value to the sound collection unit 120. This is because if the number of arrival directions of sound information whose sound quality exceeds the threshold exceeds the predetermined value, the speech recognition process may be difficult.

そこで、例えば、出力制御部１４６は、音声らしさが閾値を超える音情報の集音部１２０への到来方向の数が所定値を超える場合には、発話開始確認画面Ｇ２４−１を出力させてもよい。また、出力制御部１４６は、音声らしさが閾値を超える音情報の集音部１２０への到来方向の数が所定値以下である場合には、残り時間通知画面Ｇ２１−１を出力させてもよい。所定値は限定されないが、「１」であってもよい。 Therefore, for example, the output control unit 146 may output the utterance start confirmation screen G24-1 when the number of arrival directions of sound information whose sound quality exceeds the threshold exceeds the predetermined value. Good. Further, the output control unit 146 may cause the remaining time notification screen G21-1 to be output when the number of arrival directions of the sound information whose sound quality exceeds the threshold value to the sound collection unit 120 is equal to or less than a predetermined value. . The predetermined value is not limited, but may be “1”.

以上、本開示の実施形態に係る情報処理システム１０の機能詳細について説明した。 The function details of the information processing system 10 according to the embodiment of the present disclosure have been described above.

［１．４．システム構成の変形例］
上記においては、出力部１３０がテーブルＴｂｌの天面に画面を投影することが可能なプロジェクタである例について説明した。しかし、情報処理システム１０のシステム構成は、かかる例に限定されない。以下では、情報処理システム１０のシステム構成の変形例について説明する。図２４は、情報処理システム１０の構成の変形例１を示す図である。図２４に示すように、情報処理システム１０が携帯端末である場合に、出力部１３０は、携帯端末に備わっていてもよい。携帯端末の種類は特に限定されず、タブレット端末であってもよいし、スマートフォンであってもよいし、携帯電話であってもよい。 [1.4. Modification of system configuration]
In the above, the example in which the output unit 130 is a projector capable of projecting a screen onto the top surface of the table Tbl has been described. However, the system configuration of the information processing system 10 is not limited to this example. Below, the modification of the system configuration | structure of the information processing system 10 is demonstrated. FIG. 24 is a diagram illustrating a first modification of the configuration of the information processing system 10. As illustrated in FIG. 24, when the information processing system 10 is a mobile terminal, the output unit 130 may be included in the mobile terminal. The kind of portable terminal is not specifically limited, A tablet terminal may be sufficient, a smart phone may be sufficient, and a mobile phone may be sufficient.

また、図２５〜図２８は、情報処理システム１０の構成の変形例２を示す図である。図２５〜図２８に示すように、出力部１３０は、テレビジョン装置であり、情報処理装置１４０は、ゲーム機であり、操作入力部１１５は、ゲーム機を操作するコントローラであってよい。 25 to 28 are diagrams illustrating a second modification of the configuration of the information processing system 10. As illustrated in FIGS. 25 to 28, the output unit 130 may be a television device, the information processing device 140 may be a game machine, and the operation input unit 115 may be a controller that operates the game machine.

また、図２５に示すように、集音部１２０および出力部１３０は、操作入力部１１５に接続されていてもよい。また、図２６に示すように、画像入力部１１０および集音部１２０は、情報処理装置１４０に接続されていてもよい。また、図２７に示すように、操作入力部１１５、集音部１２０および出力部１３０は、情報処理装置１４０に接続されたスマートフォンに備えられていてもよい。また、図２８に示すように、集音部１２０は、テレビジョン装置に備えられていてもよい。 Further, as shown in FIG. 25, the sound collection unit 120 and the output unit 130 may be connected to the operation input unit 115. In addition, as illustrated in FIG. 26, the image input unit 110 and the sound collection unit 120 may be connected to the information processing apparatus 140. As illustrated in FIG. 27, the operation input unit 115, the sound collection unit 120, and the output unit 130 may be provided in a smartphone connected to the information processing apparatus 140. As shown in FIG. 28, the sound collection unit 120 may be provided in a television device.

また、図２９〜図３２は、情報処理システム１０の構成の変形例３を示す図である。図２９に示すように、情報処理システム１０は、自動車に取り付け可能な車載向けナビゲーションシステムに搭載され、自動車を運転中のユーザＵによって利用されてもよい。また、図３０に示すように、情報処理システム１０は、携帯端末に搭載され、自動車を運転中のユーザＵによって利用されてもよい。上記したように、携帯端末の種類は特に限定されない。 FIGS. 29 to 32 are diagrams showing a third modification of the configuration of the information processing system 10. As shown in FIG. 29, the information processing system 10 may be mounted on a vehicle-mounted navigation system that can be attached to a vehicle and used by a user U who is driving the vehicle. As shown in FIG. 30, the information processing system 10 may be mounted on a mobile terminal and used by a user U who is driving a car. As described above, the type of mobile terminal is not particularly limited.

また、図３１に示すように、情報処理システム１０のうち、画像入力部１１０と、操作入力部１１５と、出力部１３０とは、携帯端末によって備えられており、集音部１２０は、ユーザＵの身体に取り付け可能なマイクロフォンであってもよい。また、図３２に示すように、情報処理システム１０は、自動車に内蔵されている車載向けナビゲーションシステムに搭載され、自動車を運転中のユーザＵによって利用されてもよい。 As shown in FIG. 31, in the information processing system 10, the image input unit 110, the operation input unit 115, and the output unit 130 are provided by a mobile terminal, and the sound collection unit 120 It may be a microphone that can be attached to the body. As shown in FIG. 32, the information processing system 10 may be mounted on an in-vehicle navigation system built in an automobile and used by a user U driving the automobile.

［１．５．ハードウェア構成例］
次に、図３３を参照して、本開示の実施形態に係る情報処理システム１０のハードウェア構成について説明する。図３３は、本開示の実施形態に係る情報処理システム１０のハードウェア構成例を示すブロック図である。 [1.5. Hardware configuration example]
Next, a hardware configuration of the information processing system 10 according to the embodiment of the present disclosure will be described with reference to FIG. FIG. 33 is a block diagram illustrating a hardware configuration example of the information processing system 10 according to the embodiment of the present disclosure.

図３３に示すように、情報処理システム１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇｕｎｉｔ）９０１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９０３、およびＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９０５を含む。また、情報処理システム１０は、ホストバス９０７、ブリッジ９０９、外部バス９１１、インターフェース９１３、入力装置９１５、出力装置９１７、ストレージ装置９１９、ドライブ９２１、接続ポート９２３、通信装置９２５を含んでもよい。さらに、情報処理システム１０は、必要に応じて、撮像装置９３３、およびセンサ９３５を含んでもよい。情報処理システム１０は、ＣＰＵ９０１に代えて、またはこれとともに、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）またはＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）と呼ばれるような処理回路を有してもよい。 As illustrated in FIG. 33, the information processing system 10 includes a central processing unit (CPU) 901, a read only memory (ROM) 903, and a random access memory (RAM) 905. The information processing system 10 may also include a host bus 907, a bridge 909, an external bus 911, an interface 913, an input device 915, an output device 917, a storage device 919, a drive 921, a connection port 923, and a communication device 925. Furthermore, the information processing system 10 may include an imaging device 933 and a sensor 935 as necessary. The information processing system 10 may include a processing circuit called DSP (Digital Signal Processor) or ASIC (Application Specific Integrated Circuit) instead of or in addition to the CPU 901.

ＣＰＵ９０１は、演算処理装置および制御装置として機能し、ＲＯＭ９０３、ＲＡＭ９０５、ストレージ装置９１９、またはリムーバブル記録媒体９２７に記録された各種プログラムに従って、情報処理システム１０内の動作全般またはその一部を制御する。ＲＯＭ９０３は、ＣＰＵ９０１が使用するプログラムや演算パラメータなどを記憶する。ＲＡＭ９０５は、ＣＰＵ９０１の実行において使用するプログラムや、その実行において適宜変化するパラメータなどを一時的に記憶する。ＣＰＵ９０１、ＲＯＭ９０３、およびＲＡＭ９０５は、ＣＰＵバスなどの内部バスにより構成されるホストバス９０７により相互に接続されている。さらに、ホストバス９０７は、ブリッジ９０９を介して、ＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ／Ｉｎｔｅｒｆａｃｅ）バスなどの外部バス９１１に接続されている。 The CPU 901 functions as an arithmetic processing device and a control device, and controls all or a part of the operation in the information processing system 10 according to various programs recorded in the ROM 903, the RAM 905, the storage device 919, or the removable recording medium 927. The ROM 903 stores programs and calculation parameters used by the CPU 901. The RAM 905 temporarily stores programs used in the execution of the CPU 901, parameters that change as appropriate during the execution, and the like. The CPU 901, the ROM 903, and the RAM 905 are connected to each other by a host bus 907 configured by an internal bus such as a CPU bus. Further, the host bus 907 is connected to an external bus 911 such as a PCI (Peripheral Component Interconnect / Interface) bus via a bridge 909.

入力装置９１５は、例えば、マウス、キーボード、タッチパネル、ボタン、スイッチおよびレバーなど、ユーザによって操作される装置である。入力装置９１５は、ユーザの音声を検出するマイクロフォンを含んでもよい。入力装置９１５は、例えば、赤外線やその他の電波を利用したリモートコントロール装置であってもよいし、情報処理システム１０の操作に対応した携帯電話などの外部接続機器９２９であってもよい。入力装置９１５は、ユーザが入力した情報に基づいて入力信号を生成してＣＰＵ９０１に出力する入力制御回路を含む。ユーザは、この入力装置９１５を操作することによって、情報処理システム１０に対して各種のデータを入力したり処理動作を指示したりする。また、後述する撮像装置９３３も、ユーザの手の動き、ユーザの指などを撮像することによって、入力装置として機能し得る。このとき、手の動きや指の向きに応じてポインティング位置が決定されてよい。 The input device 915 is a device operated by the user, such as a mouse, a keyboard, a touch panel, a button, a switch, and a lever. The input device 915 may include a microphone that detects the user's voice. The input device 915 may be, for example, a remote control device using infrared rays or other radio waves, or may be an external connection device 929 such as a mobile phone that supports the operation of the information processing system 10. The input device 915 includes an input control circuit that generates an input signal based on information input by the user and outputs the input signal to the CPU 901. The user operates the input device 915 to input various data to the information processing system 10 and instruct processing operations. An imaging device 933, which will be described later, can also function as an input device by imaging a user's hand movement, a user's finger, and the like. At this time, the pointing position may be determined according to the movement of the hand or the direction of the finger.

出力装置９１７は、取得した情報をユーザに対して視覚的または聴覚的に通知することが可能な装置で構成される。出力装置９１７は、例えば、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）、ＰＤＰ（ＰｌａｓｍａＤｉｓｐｌａｙＰａｎｅｌ）、有機ＥＬ（Ｅｌｅｃｔｒｏ−Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ、プロジェクタなどの表示装置、ホログラムの表示装置、スピーカおよびヘッドホンなどの音声出力装置、ならびにプリンタ装置などであり得る。出力装置９１７は、情報処理システム１０の処理により得られた結果を、テキストまたは画像などの映像として出力したり、音声または音響などの音声として出力したりする。また、出力装置９１７は、周囲を明るくするためライトなどを含んでもよい。 The output device 917 is configured by a device capable of visually or audibly notifying acquired information to the user. The output device 917 is, for example, an LCD (Liquid Crystal Display), a PDP (Plasma Display Panel), an organic EL (Electro-Luminescence) display, a display device such as a projector, a hologram output device, an audio output device such as a speaker and headphones, As well as a printer device. The output device 917 outputs the result obtained by the processing of the information processing system 10 as a video such as text or an image, or outputs it as a voice such as voice or sound. The output device 917 may include a light or the like to brighten the surroundings.

ストレージ装置９１９は、情報処理システム１０の記憶部の一例として構成されたデータ格納用の装置である。ストレージ装置９１９は、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）などの磁気記憶部デバイス、半導体記憶デバイス、光記憶デバイス、または光磁気記憶デバイスなどにより構成される。このストレージ装置９１９は、ＣＰＵ９０１が実行するプログラムや各種データ、および外部から取得した各種のデータなどを格納する。 The storage device 919 is a data storage device configured as an example of a storage unit of the information processing system 10. The storage device 919 includes, for example, a magnetic storage device such as an HDD (Hard Disk Drive), a semiconductor storage device, an optical storage device, or a magneto-optical storage device. The storage device 919 stores programs executed by the CPU 901, various data, various data acquired from the outside, and the like.

ドライブ９２１は、磁気ディスク、光ディスク、光磁気ディスク、または半導体メモリなどのリムーバブル記録媒体９２７のためのリーダライタであり、情報処理システム１０に内蔵、あるいは外付けされる。ドライブ９２１は、装着されているリムーバブル記録媒体９２７に記録されている情報を読み出して、ＲＡＭ９０５に出力する。また、ドライブ９２１は、装着されているリムーバブル記録媒体９２７に記録を書き込む。 The drive 921 is a reader / writer for a removable recording medium 927 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and is built in or externally attached to the information processing system 10. The drive 921 reads information recorded on the attached removable recording medium 927 and outputs the information to the RAM 905. In addition, the drive 921 writes a record in the attached removable recording medium 927.

接続ポート９２３は、機器を情報処理システム１０に直接接続するためのポートである。接続ポート９２３は、例えば、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）ポート、ＩＥＥＥ１３９４ポート、ＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍＩｎｔｅｒｆａｃｅ）ポートなどであり得る。また、接続ポート９２３は、ＲＳ−２３２Ｃポート、光オーディオ端子、ＨＤＭＩ（登録商標）（Ｈｉｇｈ−ＤｅｆｉｎｉｔｉｏｎＭｕｌｔｉｍｅｄｉａＩｎｔｅｒｆａｃｅ）ポートなどであってもよい。接続ポート９２３に外部接続機器９２９を接続することで、情報処理システム１０と外部接続機器９２９との間で各種のデータが交換され得る。 The connection port 923 is a port for directly connecting a device to the information processing system 10. The connection port 923 may be, for example, a USB (Universal Serial Bus) port, an IEEE 1394 port, a SCSI (Small Computer System Interface) port, or the like. The connection port 923 may be an RS-232C port, an optical audio terminal, an HDMI (registered trademark) (High-Definition Multimedia Interface) port, or the like. Various data can be exchanged between the information processing system 10 and the external connection device 929 by connecting the external connection device 929 to the connection port 923.

通信装置９２５は、例えば、通信ネットワーク９３１に接続するための通信デバイスなどで構成された通信インターフェースである。通信装置９２５は、例えば、有線または無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、またはＷＵＳＢ（ＷｉｒｅｌｅｓｓＵＳＢ）用の通信カードなどであり得る。また、通信装置９２５は、光通信用のルータ、ＡＤＳＬ（ＡｓｙｍｍｅｔｒｉｃＤｉｇｉｔａｌＳｕｂｓｃｒｉｂｅｒＬｉｎｅ）用のルータ、または、各種通信用のモデムなどであってもよい。通信装置９２５は、例えば、インターネットや他の通信機器との間で、ＴＣＰ／ＩＰなどの所定のプロトコルを用いて信号などを送受信する。また、通信装置９２５に接続される通信ネットワーク９３１は、有線または無線によって接続されたネットワークであり、例えば、インターネット、家庭内ＬＡＮ、赤外線通信、ラジオ波通信または衛星通信などである。 The communication device 925 is a communication interface configured with, for example, a communication device for connecting to the communication network 931. The communication device 925 can be, for example, a communication card for wired or wireless LAN (Local Area Network), Bluetooth (registered trademark), or WUSB (Wireless USB). The communication device 925 may be a router for optical communication, a router for ADSL (Asymmetric Digital Subscriber Line), or a modem for various communication. The communication device 925 transmits and receives signals and the like using a predetermined protocol such as TCP / IP with the Internet and other communication devices, for example. The communication network 931 connected to the communication device 925 is a wired or wireless network, such as the Internet, a home LAN, infrared communication, radio wave communication, or satellite communication.

撮像装置９３３は、例えば、ＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）またはＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ）などの撮像素子、および撮像素子への被写体像の結像を制御するためのレンズなどの各種の部材を用いて実空間を撮像し、撮像画像を生成する装置である。撮像装置９３３は、静止画を撮像するものであってもよいし、また動画を撮像するものであってもよい。 The imaging device 933 uses various members such as an imaging element such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor), and a lens for controlling the formation of a subject image on the imaging element. It is an apparatus that images a real space and generates a captured image. The imaging device 933 may capture a still image or may capture a moving image.

センサ９３５は、例えば、加速度センサ、ジャイロセンサ、地磁気センサ、光センサ、音センサなどの各種のセンサである。センサ９３５は、例えば情報処理システム１０の筐体の姿勢など、情報処理システム１０自体の状態に関する情報や、情報処理システム１０の周辺の明るさや騒音など、情報処理システム１０の周辺環境に関する情報を取得する。また、センサ９３５は、ＧＰＳ（ＧｌｏｂａｌＰｏｓｉｔｉｏｎｉｎｇＳｙｓｔｅｍ）信号を受信して装置の緯度、経度および高度を測定するＧＰＳセンサを含んでもよい。 The sensor 935 is various sensors such as an acceleration sensor, a gyro sensor, a geomagnetic sensor, an optical sensor, and a sound sensor. The sensor 935 obtains information related to the state of the information processing system 10 such as the posture of the information processing system 10, and information related to the surrounding environment of the information processing system 10 such as brightness and noise around the information processing system 10. To do. The sensor 935 may also include a GPS sensor that receives a GPS (Global Positioning System) signal and measures the latitude, longitude, and altitude of the device.

以上、情報処理システム１０のハードウェア構成の一例を示した。上記の各構成要素は、汎用的な部材を用いて構成されていてもよいし、各構成要素の機能に特化したハードウェアにより構成されていてもよい。かかる構成は、実施する時々の技術レベルに応じて適宜変更され得る。 Heretofore, an example of the hardware configuration of the information processing system 10 has been shown. Each component described above may be configured using a general-purpose member, or may be configured by hardware specialized for the function of each component. Such a configuration can be appropriately changed according to the technical level at the time of implementation.

＜２．むすび＞
以上説明したように、本開示の実施形態によれば、集音部１２０から入力された音情報に対して音声認識部１４５によって施される音声認識処理の開始条件を出力部１３０に出力させる出力制御部１４６を備え、出力制御部１４６は、出力部１３０に出力させる音声認識処理の開始条件を動的に変更する、情報処理システム１０が提供される。かかる構成によれば、音声認識処理を状況に応じて柔軟に開始させることが可能となる。 <2. Conclusion>
As described above, according to the embodiment of the present disclosure, the output that causes the output unit 130 to output the start condition of the voice recognition process performed by the voice recognition unit 145 on the sound information input from the sound collection unit 120. The information processing system 10 is provided that includes a control unit 146, and the output control unit 146 dynamically changes the start condition of the speech recognition processing to be output to the output unit 130. According to such a configuration, it is possible to flexibly start the voice recognition process according to the situation.

また、かかる構成によれば、音声認識処理の開始前に発話内容をユーザに考えさせることが可能となる。換言すれば、発話内容をユーザに決めさせてから音声認識処理を開始させることが可能となる。また、かかる構成によれば、集音された音情報に含まれる雑音などを音声認識処理の対象から除外することが可能となる。また、音声認識処理の開始条件をユーザに提示することによって、音声認識処理の成功率を向上させることが可能となる。 Moreover, according to this structure, it becomes possible to make a user think the utterance content before the voice recognition process is started. In other words, the speech recognition process can be started after the user determines the utterance content. Further, according to such a configuration, it is possible to exclude noise included in the collected sound information from the target of speech recognition processing. In addition, the success rate of the speech recognition process can be improved by presenting the user with the start conditions of the speech recognition process.

以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it is understood that it belongs to the technical scope of the present disclosure.

例えば、上記においては、情報処理システム１０のシステム構成例およびシステム構成の変形例を説明したが、情報処理システム１０のシステム構成例は、これらの例に限定されない。例えば、出力部１３０は、ヘッドマウントディスプレイ以外のウェアラブル端末（例えば、時計、眼鏡など）に備わるディスプレイであってもよい。また、例えば、出力部１３０は、ヘルスケア分野において利用されるディスプレイであってもよい。 For example, in the above description, the system configuration example of the information processing system 10 and the modification examples of the system configuration have been described. However, the system configuration example of the information processing system 10 is not limited to these examples. For example, the output unit 130 may be a display provided in a wearable terminal (for example, a watch, glasses, etc.) other than the head mounted display. For example, the output unit 130 may be a display used in the healthcare field.

また、コンピュータに内蔵されるＣＰＵ、ＲＯＭおよびＲＡＭなどのハードウェアを、上記した情報処理装置１４０が有する機能と同等の機能を発揮させるためのプログラムも作成可能である。また、該プログラムを記録した、コンピュータに読み取り可能な記録媒体も提供され得る。 In addition, it is possible to create a program for causing hardware such as a CPU, ROM, and RAM incorporated in a computer to exhibit functions equivalent to the functions of the information processing apparatus 140 described above. Also, a computer-readable recording medium that records the program can be provided.

また、出力制御部１４６は、出力部１３０に表示内容を表示させるための表示制御情報を生成し、生成した表示制御情報を出力部１３０に出力することで、当該表示内容が出力部１３０に表示されるように出力部１３０を制御することが可能である。かかる表示制御情報の内容はシステム構成にあわせて適宜変更されてよい。 Further, the output control unit 146 generates display control information for causing the output unit 130 to display the display content, and outputs the generated display control information to the output unit 130, so that the display content is displayed on the output unit 130. The output unit 130 can be controlled as described above. The contents of the display control information may be changed as appropriate according to the system configuration.

具体的な一例として、情報処理装置１４０を実現するためのプログラムは、ウェブアプリケーションであってもよい。かかる場合、表示制御情報は、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）、ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）、ＸＭＬ（ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）などのマークアップ言語により実現されてもよい。 As a specific example, the program for realizing the information processing apparatus 140 may be a web application. In such a case, the display control information may be realized by a markup language such as HTML (HyperText Markup Language), SGML (Standard Generalized Markup Language), or XML (Extensible Markup Language).

なお、上述した情報処理システム１０の動作が実現されれば、各構成の位置は特に限定されない。具体的な一例として、画像入力部１１０、操作入力部１１５および集音部１２０と出力部１３０と情報処理装置１４０とは、ネットワークを介して接続された互いに異なる装置に設けられてもよい。この場合には、情報処理装置１４０が、例えば、ウェブサーバやクラウドサーバのようなサーバに相当し、画像入力部１１０、操作入力部１１５および集音部１２０と出力部１３０とが当該サーバにネットワークを介して接続されたクライアントに相当し得る。 Note that the position of each component is not particularly limited as long as the operation of the information processing system 10 described above is realized. As a specific example, the image input unit 110, the operation input unit 115, the sound collecting unit 120, the output unit 130, and the information processing device 140 may be provided in different devices connected via a network. In this case, the information processing apparatus 140 corresponds to a server such as a web server or a cloud server, and the image input unit 110, the operation input unit 115, the sound collection unit 120, and the output unit 130 are connected to the server. It may correspond to a client connected via

また、情報処理装置１４０が有するすべての構成要素が同一の装置に収まっていなくてもよい。例えば、入力画像取得部１４１と、音情報取得部１４２と、操作検出部１４３と、認識制御部１４４と、音声認識部１４５と、出力制御部１４６とのうち、一部は情報処理装置１４０とは異なる装置に存在していてもよい。例えば、音声認識部１４５は、入力画像取得部１４１と、音情報取得部１４２と、操作検出部１４３と、認識制御部１４４と、出力制御部１４６とを備える情報処理装置１４０とは異なるサーバに存在していてもよい。 In addition, all the components included in the information processing apparatus 140 may not be accommodated in the same apparatus. For example, among the input image acquisition unit 141, the sound information acquisition unit 142, the operation detection unit 143, the recognition control unit 144, the voice recognition unit 145, and the output control unit 146, some of them are the information processing device 140. May be present on different devices. For example, the voice recognition unit 145 is a server different from the information processing apparatus 140 including the input image acquisition unit 141, the sound information acquisition unit 142, the operation detection unit 143, the recognition control unit 144, and the output control unit 146. May be present.

また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏し得る。 Further, the effects described in the present specification are merely illustrative or exemplary and are not limited. That is, the technology according to the present disclosure can exhibit other effects that are apparent to those skilled in the art from the description of the present specification in addition to or instead of the above effects.

なお、以下のような構成も本開示の技術的範囲に属する。
（１）
集音部から入力された音情報に対して音声認識部によって施される音声認識処理の開始条件を出力部に出力させる出力制御部を備え、
前記出力制御部は、前記出力部に出力させる前記音声認識処理の前記開始条件を動的に変更する、
情報処理システム。
（２）
前記出力制御部は、前記音声認識処理の起動トリガが検出された場合に、前記開始条件を前記出力部に出力させる、
前記（１）に記載の情報処理システム。
（３）
前記情報処理システムは、前記開始条件が満たされた場合に、前記音声認識処理を前記音声認識部に開始させる認識制御部を備える、
前記（１）または（２）に記載の情報処理システム。
（４）
前記出力制御部は、前記開始条件が満たされた場合または前記開始条件の出力停止操作が検出された場合に、前記開始条件の出力を停止させる、
前記（１）〜（３）のいずれか一項に記載の情報処理システム。
（５）
前記出力制御部は、所定の情報に基づいて、前記出力部に出力させる前記音声認識処理の前記開始条件を動的に変更する、
前記（２）に記載の情報処理システム。
（６）
前記出力制御部は、前記起動トリガが検出された後に前記集音部から入力された音情報に基づいて、前記出力部に出力させる前記開始条件を動的に変更する、
前記（５）に記載の情報処理システム。
（７）
前記出力制御部は、前記集音部から入力された前記音情報に含まれる第１の種類の音情報に基づいて、前記出力部に出力させる前記開始条件を動的に変更する、
前記（６）に記載の情報処理システム。
（８）
前記出力制御部は、前記第１の種類の音情報の音量が第１の閾値を上回る場合には、前記音声認識処理を開始させるために必要なユーザ操作に関する情報に前記開始条件を変更する、
前記（７）に記載の情報処理システム。
（９）
前記出力制御部は、前記第１の種類の音情報の音量が前記第１の閾値を下回る場合には、前記音声認識処理が開始されるまでの残り時間に前記開始条件を変更する、
前記（８）に記載の情報処理システム。
（１０）
前記出力制御部は、前記第１の種類の音情報の音量が前記第１の閾値より小さい第２の閾値を下回る場合には、前記開始条件を前記出力部に出力させることを省略する、
前記（８）または（９）に記載の情報処理システム。
（１１）
前記第１の種類の音情報は、少なくとも雑音を含む、
前記（７）〜（１０）のいずれか一項に記載の情報処理システム。
（１２）
前記出力制御部は、過去に起動トリガが検出されてから音声認識処理が開始されるまでの所定の時間に集音された過去の音情報に基づいて、前記出力部に出力させる前記開始条件を動的に変更する、
前記（６）に記載の情報処理システム。
（１３）
前記出力制御部は、前記過去の音情報に含まれる第２の種類の音情報に基づいて、前記出力部に出力させる前記開始条件を動的に変更する、
前記（１２）に記載の情報処理システム。
（１４）
前記出力制御部は、前記第２の種類の音情報の音量が閾値を上回る場合には、前記開始条件として出力させる前記音声認識処理が開始されるまでの残り時間を、前回の音声認識処理時よりも長くする、
前記（１３）に記載の情報処理システム。
（１５）
前記出力制御部は、前記第２の種類の音情報の音量が前記閾値を下回る場合には、前記開始条件として出力させる前記音声認識処理が開始されるまでの残り時間を、前回の音声認識処理時よりも短くする、
前記（１４）に記載の情報処理システム。
（１６）
前記第２の種類の音情報は、少なくとも雑音を含む、
前記（１３）〜（１５）のいずれか一項に記載の情報処理システム。
（１７）
前記出力制御部は、音声らしさが閾値を超える音情報の前記集音部への到来方向の数に基づいて、前記出力部に出力させる前記開始条件を動的に変更する、
前記（１）に記載の情報処理システム。
（１８）
前記出力制御部は、前記開始条件として所定の表示情報および所定の音声情報のうち少なくともいずれか一方を前記出力部に出力させる、
前記（１）〜（１７）のいずれか一項に記載の情報処理システム。
（１９）
前記認識制御部は、前記開始条件が満たされる前から音声認識処理を開始させ、前記開始条件が満たされた場合に、前記音声認識処理の結果にフィラーが含まれる場合には、前記フィラーに対応する部分が除外された後の音情報に対する音声認識処理を前記音声認識部に開始させる、
前記（３）に記載の情報処理システム。
（２０）
集音部から入力された音情報に対して音声認識部によって施される音声認識処理の開始条件を出力部に出力させることを含み、
プロセッサにより前記出力部に出力させる前記音声認識処理の前記開始条件を動的に変更することを含む、
情報処理方法。 The following configurations also belong to the technical scope of the present disclosure.
(1)
An output control unit that causes the output unit to output a start condition of voice recognition processing performed by the voice recognition unit on the sound information input from the sound collecting unit;
The output control unit dynamically changes the start condition of the voice recognition processing to be output to the output unit;
Information processing system.
(2)
The output control unit causes the output unit to output the start condition when a start trigger of the voice recognition process is detected.
The information processing system according to (1).
(3)
The information processing system includes a recognition control unit that causes the voice recognition unit to start the voice recognition processing when the start condition is satisfied.
The information processing system according to (1) or (2).
(4)
The output control unit stops the output of the start condition when the start condition is satisfied or when an output stop operation of the start condition is detected,
The information processing system according to any one of (1) to (3).
(5)
The output control unit dynamically changes the start condition of the voice recognition processing to be output to the output unit based on predetermined information.
The information processing system according to (2).
(6)
The output control unit dynamically changes the start condition to be output to the output unit based on sound information input from the sound collection unit after the activation trigger is detected.
The information processing system according to (5) above.
(7)
The output control unit dynamically changes the start condition to be output to the output unit based on the first type of sound information included in the sound information input from the sound collection unit.
The information processing system according to (6).
(8)
The output control unit changes the start condition to information related to a user operation necessary to start the voice recognition process when the volume of the first type of sound information exceeds a first threshold.
The information processing system according to (7).
(9)
The output control unit changes the start condition to a remaining time until the voice recognition process is started when a volume of the first type of sound information is lower than the first threshold.
The information processing system according to (8).
(10)
The output control unit omits outputting the start condition to the output unit when the volume of the sound information of the first type is lower than a second threshold value that is smaller than the first threshold value;
The information processing system according to (8) or (9).
(11)
The first type of sound information includes at least noise.
The information processing system according to any one of (7) to (10).
(12)
The output control unit sets the start condition to be output to the output unit based on past sound information collected at a predetermined time from when the activation trigger is detected in the past to when the voice recognition process is started. Change dynamically,
The information processing system according to (6).
(13)
The output control unit dynamically changes the start condition to be output to the output unit based on the second type of sound information included in the past sound information.
The information processing system according to (12).
(14)
When the volume of the second type of sound information exceeds a threshold, the output control unit determines the remaining time until the voice recognition process to be output as the start condition is started at the time of the previous voice recognition process. Longer than,
The information processing system according to (13).
(15)
When the volume of the second type of sound information is lower than the threshold, the output control unit uses the remaining time until the voice recognition process to be output as the start condition is started as the previous voice recognition process. Shorter than time,
The information processing system according to (14).
(16)
The second type of sound information includes at least noise.
The information processing system according to any one of (13) to (15).
(17)
The output control unit dynamically changes the start condition to be output to the output unit based on the number of arrival directions of sound information whose sound quality exceeds a threshold value to the sound collection unit.
The information processing system according to (1).
(18)
The output control unit causes the output unit to output at least one of predetermined display information and predetermined audio information as the start condition.
The information processing system according to any one of (1) to (17).
(19)
The recognition control unit starts speech recognition processing before the start condition is satisfied, and when the start condition is satisfied, if the filler is included in the result of the speech recognition processing, it corresponds to the filler. Causing the voice recognition unit to start voice recognition processing on the sound information after the part to be removed is excluded.
The information processing system according to (3).
(20)
Including causing the output unit to output a start condition of voice recognition processing performed by the voice recognition unit on the sound information input from the sound collection unit,
Dynamically changing the start condition of the speech recognition process to be output to the output unit by a processor;
Information processing method.

１２０集音部
１０情報処理システム
１１０画像入力部
１１５操作入力部
１３０出力部
１４０情報処理装置（制御部）
１４１入力画像取得部
１４２音情報取得部
１４３操作検出部
１４４認識制御部
１４５音声認識部
１４６出力制御部
Ｇ１０初期画面
Ｇ１１認識文字列表示欄
Ｇ１２全削除操作オブジェクト
Ｇ１３確定操作オブジェクト
Ｇ１５前方移動操作オブジェクト
Ｇ１６後方移動操作オブジェクト
Ｇ１７削除操作オブジェクト

DESCRIPTION OF SYMBOLS 120 Sound collecting part 10 Information processing system 110 Image input part 115 Operation input part 130 Output part 140 Information processing apparatus (control part)
141 Input Image Acquisition Unit 142 Sound Information Acquisition Unit 143 Operation Detection Unit 144 Recognition Control Unit 145 Speech Recognition Unit 146 Output Control Unit G10 Initial Screen G11 Recognition Character String Display Field G12 All Delete Operation Object G13 Confirmation Operation Object G15 Forward Movement Operation Object G16 Backward movement operation object G17 Deletion operation object

Claims

An output control unit that causes the output unit to output a start condition of voice recognition processing performed by the voice recognition unit on the sound information input from the sound collecting unit;
The output control unit dynamically changes the start condition of the voice recognition processing to be output to the output unit;
Information processing system.

The output control unit causes the output unit to output the start condition when a start trigger of the voice recognition process is detected.
The information processing system according to claim 1.

The information processing system includes a recognition control unit that causes the voice recognition unit to start the voice recognition processing when the start condition is satisfied.
The information processing system according to claim 1.

The output control unit stops the output of the start condition when the start condition is satisfied or when an output stop operation of the start condition is detected,
The information processing system according to claim 1.

The output control unit dynamically changes the start condition of the voice recognition processing to be output to the output unit based on predetermined information.
The information processing system according to claim 2.

The output control unit dynamically changes the start condition to be output to the output unit based on sound information input from the sound collection unit after the activation trigger is detected.
The information processing system according to claim 5.

The output control unit dynamically changes the start condition to be output to the output unit based on the first type of sound information included in the sound information input from the sound collection unit.
The information processing system according to claim 6.

The output control unit changes the start condition to information related to a user operation necessary to start the voice recognition process when the volume of the first type of sound information exceeds a first threshold.
The information processing system according to claim 7.

The output control unit changes the start condition to a remaining time until the voice recognition process is started when a volume of the first type of sound information is lower than the first threshold.
The information processing system according to claim 8.

The output control unit omits outputting the start condition to the output unit when the volume of the sound information of the first type is lower than a second threshold value that is smaller than the first threshold value;
The information processing system according to claim 8.

The first type of sound information includes at least noise.
The information processing system according to claim 7.

The output control unit sets the start condition to be output to the output unit based on past sound information collected at a predetermined time from when the activation trigger is detected in the past to when the voice recognition process is started. Change dynamically,
The information processing system according to claim 6.

The output control unit dynamically changes the start condition to be output to the output unit based on the second type of sound information included in the past sound information.
The information processing system according to claim 12.

When the volume of the second type of sound information exceeds a threshold, the output control unit determines the remaining time until the voice recognition process to be output as the start condition is started at the time of the previous voice recognition process. Longer than,
The information processing system according to claim 13.

When the volume of the second type of sound information is lower than the threshold, the output control unit uses the remaining time until the voice recognition process to be output as the start condition is started as the previous voice recognition process. Shorter than time,
The information processing system according to claim 14.

The second type of sound information includes at least noise.
The information processing system according to claim 13.

The output control unit dynamically changes the start condition to be output to the output unit based on the number of arrival directions of sound information whose sound quality exceeds a threshold value to the sound collection unit.
The information processing system according to claim 1.

The output control unit causes the output unit to output at least one of predetermined display information and predetermined audio information as the start condition.
The information processing system according to claim 1.

The recognition control unit starts speech recognition processing before the start condition is satisfied, and when the start condition is satisfied, if the filler is included in the result of the speech recognition processing, it corresponds to the filler. Causing the voice recognition unit to start voice recognition processing on the sound information after the part to be removed is excluded.
The information processing system according to claim 3.

Including causing the output unit to output a start condition of voice recognition processing performed by the voice recognition unit on the sound information input from the sound collection unit,
Dynamically changing the start condition of the speech recognition process to be output to the output unit by a processor;
Information processing method.