JP7246337B2

JP7246337B2 - Computer system and work estimation method

Info

Publication number: JP7246337B2
Application number: JP2020043080A
Authority: JP
Inventors: 正明山本; 健本間
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2023-03-27
Anticipated expiration: 2040-03-12
Also published as: JP2021144156A

Description

本発明は、作業員の発話に基づく作業の推定方法に関する。 TECHNICAL FIELD The present invention relates to a work estimation method based on a worker's speech.

ユーザから取得した音声データを用いて、ユーザの発話意図を理解するアルゴリズム（モデル）を実装したシステムが知られている（例えば、特許文献１及び非特許文献１を参照）。本明細書では、当該システムを発話意図推定システムと記載する。 A system is known in which an algorithm (model) for understanding a user's utterance intention is implemented using speech data obtained from the user (see, for example, Patent Document 1 and Non-Patent Document 1). In this specification, the system is described as an utterance intention estimation system.

特許文献１には、「発話意図推定装置は、ユーザ発話の音声データを取得する音声取得手段と、発話の音響的な特徴量である音響特徴量を取得する特徴量取得手段と、前記音響特徴量から前記ユーザ発話の意図を推定する意図推定手段と、を備える。前記意図推定手段は、発話のテキストを用いても前記ユーザ発話の意図を推定可能に構成されてもよく、前記音声データから前記ユーザ発話のテキストを抽出できた場合には発話のテキストを用いて意図を推定し、前記音声データから前記ユーザ発話のテキストを抽出できなかった場合やテキストから発話意図を推定できなかった場合には音響特徴量を用いて意図を推定することも好ましい。」ことが記載されている。 Patent Literature 1 discloses that a speech intention estimation device includes speech acquisition means for acquiring speech data of a user's utterance, feature acquisition means for acquiring an acoustic feature that is an acoustic feature of the utterance, and the acoustic feature an intention estimating means for estimating the intention of the user's utterance from the quantity, wherein the intention estimating means may be configured to be capable of estimating the intention of the user's utterance even by using text of the utterance; When the text of the user's utterance can be extracted, the intention is estimated using the text of the utterance, and when the text of the user's utterance cannot be extracted from the voice data or the utterance intention cannot be estimated from the text. It is also preferable to estimate intention using acoustic features."

非特許文献１には、過去の発話内容を考慮して、現在の意図理解結果を修正することが記載されている。 Non-Patent Document 1 describes correcting the current intention understanding result in consideration of past utterance contents.

特開２０１８－１６９４９４号公報JP 2018-169494 A

吉野浩一郎、「音声発話系列からのユーザの意図の理解」、電子情報通信学会誌、Vol.101 No.9 pp.896-901、2018年9月Koichiro Yoshino, "Understanding User's Intention from Spoken Utterance Sequence", Journal of Institute of Electronics, Information and Communication Engineers, Vol.101 No.9 pp.896-901, September 2018

製品の製造又は機器の点検等の作業を行う工場又は現場において、作業員が行っている作業の内容及び結果を推定するために、発話意図推定システムを利用することが想定される。 It is assumed that an utterance intention estimation system will be used to estimate the content and results of work performed by workers in factories or sites where work such as product manufacturing or equipment inspection is performed.

作業員が行っている作業の内容及び結果を推定する場合、発話単位の発話意図からは作業の内容及び結果を推定することが困難である。そのため、非特許文献１に記載されているように、発話系列を用いて発話意図を推定する必要がある。 When estimating the content and result of work performed by a worker, it is difficult to estimate the content and result of work from the utterance intention of each utterance. Therefore, as described in Non-Patent Document 1, it is necessary to estimate the utterance intention using the utterance sequence.

作業中に作業員が発する発話の組合せ（発話内容及び発話順の組合せ）は、同じ作業であっても異なる場合がある。一般的に様々な発話の組合せに対応させたモデルを生成することは難しいため、従来のモデルは、特定の発話の組合せに固定されている。 The combination of utterances uttered by workers during work (combination of utterance content and utterance order) may differ even for the same work. Since it is generally difficult to generate models corresponding to combinations of various utterances, conventional models are fixed to specific combinations of utterances.

本発明は、様々な発話の組合せの入力に対して、作業員が行った作業の内容及び結果を高い精度で推定できる発話意図推定システムを実現する。 The present invention realizes an utterance intention estimation system capable of estimating with high accuracy the content and result of work performed by a worker for input of various combinations of utterances.

本願において開示される発明の代表的な一例を示せば以下の通りである。すなわち、発話に基づいて作業員が行っている作業を推定する計算機システムであって、プロセッサ及び前記プロセッサに接続されるメモリを有する少なくとも一つの計算機を備え、前記プロセッサは、前記作業員が発した発話の音声データをテキストに変換し、前記テキストを前記メモリに格納し、時系列が連続した複数の前記音声データから前記発話間の発話間隔を算出し、前記発話間隔を前記メモリに格納し、複数の前記テキストが時系列順に並べられた発話の時系列データ及び前記発話間隔を時系列順に並べた発話間隔の時系列データを生成し、前記発話の時系列データ及び前記発話間隔の時系列データを前記メモリに格納し、前記発話の時系列データ及び前記発話間隔の時系列データに基づいて、前記作業員が行った作業の内容及び結果を推定し、前記推定の結果を出力する。 A representative example of the invention disclosed in the present application is as follows. That is, a computer system for estimating work performed by a worker based on utterances, comprising at least one computer having a processor and a memory connected to the processor, wherein the processor receives the converting speech data into text, storing the text in the memory, calculating an utterance interval between the utterances from a plurality of the time-series continuous voice data, storing the utterance interval in the memory; generating time-series data of utterances in which the plurality of texts are arranged in chronological order and time-series data of utterance intervals in which the utterance intervals are arranged in chronological order; is stored in the memory, and based on the time-series data of the speech and the time-series data of the speech interval, the content and result of the work performed by the worker are estimated, and the result of the estimation is output.

本発明によれば、計算機システム（発話意図推定システム）は、様々な発話の組合せの入力に対して、作業員が行った作業の内容及び結果を高い精度で推定することができる。上記した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to the present invention, a computer system (speech intention estimation system) can highly accurately estimate the details and results of work performed by a worker in response to input of various combinations of speech. Problems, configurations, and effects other than those described above will be clarified by the following description of the embodiments.

実施例１の発話意図推定システムの構成の一例を示す図である。1 is a diagram showing an example of the configuration of an utterance intention estimation system of Example 1; FIG. 実施例１の計算機が実行する意図理解モデル生成処理の流れを示す図である。4 is a diagram showing the flow of intention understanding model generation processing executed by the computer of Example 1. FIG. 実施例１の計算機が実行する意図理解モデル生成処理の一例を示すフローチャートである。4 is a flowchart showing an example of intention understanding model generation processing executed by the computer of the first embodiment; 実施例１の計算機に入力される時系列データの一例を示す図である。4 is a diagram showing an example of time-series data input to the computer of Example 1. FIG. 実施例１の計算機に入力される時系列データの一例を示す図である。4 is a diagram showing an example of time-series data input to the computer of Example 1. FIG. 実施例１の計算機に入力される時系列データの一例を示す図である。4 is a diagram showing an example of time-series data input to the computer of Example 1. FIG. 実施例１の計算機によって生成される学習データのデータ構造の一例を示す図である。4 is a diagram showing an example of the data structure of learning data generated by the computer of Example 1. FIG. 実施例１の計算機が実行する作業推定処理の流れを示す図である。4 is a diagram showing the flow of work estimation processing executed by the computer of Example 1. FIG. 実施例１の計算機が実行する作業推定処理の一例を示すフローチャートである。5 is a flow chart showing an example of work estimation processing executed by the computer of the first embodiment; 実施例１の計算機が生成する中間出力情報のデータ構造の一例を示す図である。4 is a diagram showing an example of the data structure of intermediate output information generated by the computer of Example 1; FIG. 実施例２の計算機が実行する意図理解モデル生成処理の一例を示すフローチャートである。13 is a flow chart showing an example of intention understanding model generation processing executed by the computer of Example 2. FIG. 実施例２の計算機に入力される属性ラベルの時系列データの一例を示す図である。FIG. 10 is a diagram showing an example of time-series data of attribute labels input to the computer of Example 2; 実施例２の計算機が生成する発話パターン情報のデータ構造の一例を示す図である。FIG. 10 is a diagram showing an example of the data structure of utterance pattern information generated by the computer of Example 2; 実施例２の発話パターンの発話間隔の算出方法の一例を示す図である。FIG. 10 is a diagram showing an example of a method of calculating an utterance interval of an utterance pattern in Example 2;

以下、本発明の実施例を、図面を用いて説明する。ただし、本発明は以下に示す実施例の記載内容に限定して解釈されるものではない。本発明の思想ないし趣旨から逸脱しない範囲で、その具体的構成を変更し得ることは当業者であれば容易に理解される。以下に説明する発明の構成において、同一又は類似する構成又は機能には同一の符号を付し、重複する説明は省略する。本明細書等における「第１」、「第２」、「第３」等の表記は、構成要素を識別するために付するものであり、必ずしも、数又は順序を限定するものではない。図面等において示す各構成の位置、大きさ、形状、及び範囲等は、発明の理解を容易にするため、実際の位置、大きさ、形状、及び範囲等を表していない場合がある。したがって、本発明では、図面等に開示された位置、大きさ、形状、及び範囲等に限定されない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the present invention should not be construed as being limited to the contents of the examples described below. Those skilled in the art will easily understand that the specific configuration can be changed without departing from the idea or gist of the present invention. In the configurations of the invention described below, the same or similar configurations or functions are denoted by the same reference numerals, and overlapping descriptions are omitted. The notations such as “first”, “second”, “third”, etc. in this specification and the like are attached to identify the constituent elements, and do not necessarily limit the number or order. The position, size, shape, range, etc. of each component shown in the drawings may not represent the actual position, size, shape, range, etc. in order to facilitate understanding of the invention. Therefore, the present invention is not limited to the positions, sizes, shapes, ranges, etc. disclosed in the drawings and the like.

図１は、実施例１の発話意図推定システムの構成の一例を示す図である。 FIG. 1 is a diagram showing an example of the configuration of the utterance intention estimation system according to the first embodiment.

発話意図推定システムは、計算機１００及びマイク１０１から構成される。なお、計算機１００及びマイク１０１の数は二つ以上でもよい。 The utterance intention estimation system is composed of a computer 100 and a microphone 101 . Note that the number of computers 100 and microphones 101 may be two or more.

マイク１０１は、作業員が発した音声を収集する装置である。マイク１０１は、収集した音声から音声データを生成し、計算機１００に音声データを送信する。マイク１０１は、作業員が作業を行っている空間に固定されてもよいし、作業員が携帯してもよい。音声データには、発話の開始時刻及び発話の終了時刻の少なくともいずれかを示すタイムスタンプが含まれる。 A microphone 101 is a device for collecting voices uttered by workers. The microphone 101 generates audio data from collected audio and transmits the audio data to the computer 100 . The microphone 101 may be fixed in the space where the worker is working, or may be carried by the worker. The voice data includes a time stamp indicating at least one of the speech start time and the speech end time.

計算機１００は、音声データを用いて作業員が行った作業の内容及び結果を推定する。計算機１００は、プロセッサ１１０、主記憶装置１１１、副記憶装置１１２、及び接続インタフェース１１３を備える。各ハードウェアは内部バスを介して互いに接続される。 The computer 100 estimates the content and result of the work performed by the worker using the voice data. The computer 100 comprises a processor 110 , a main storage device 111 , a secondary storage device 112 and a connection interface 113 . Each piece of hardware is connected to each other via an internal bus.

接続インタフェース１１３は、外部装置と接続するためのインタフェースである。接続インタフェース１１３は、例えば、ネットワークインタフェース及びＩ／Ｏインタフェースである。 The connection interface 113 is an interface for connecting with an external device. The connection interface 113 is, for example, a network interface and an I/O interface.

プロセッサ１１０は、主記憶装置１１１に格納されるプログラムを実行する。プロセッサ１１０がプログラムにしたがって処理を実行することによって、特定の機能を実現する機能部（モジュール）として動作する。以下の説明では、機能部を主語に処理を説明する場合、プロセッサ１１０が当該機能部を実現するプログラムを実行していることを示す。 The processor 110 executes programs stored in the main memory device 111 . The processor 110 operates as a functional unit (module) that implements a specific function by executing processing according to a program. In the following description, when processing is described with a functional unit as the subject, it means that the processor 110 is executing a program that implements the functional unit.

主記憶装置１１１は、プロセッサ１１０が実行するプログラム及びプログラムが使用するデータを格納する。また、主記憶装置１１１は、プログラムが一時的に使用するワークエリアを含む。主記憶装置１１１は、例えば、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等である。主記憶装置１１１に格納されるプログラムについては後述する。 The main storage device 111 stores programs executed by the processor 110 and data used by the programs. The main storage device 111 also includes a work area that is temporarily used by the program. The main storage device 111 is, for example, a DRAM (Dynamic Random Access Memory) or the like. Programs stored in the main storage device 111 will be described later.

なお、主記憶装置１１１に格納されるプログラム及びデータは、副記憶装置１１２に格納されてもよい。この場合、プロセッサ１１０が、副記憶装置１１２からプログラム及びデータを読み出し、主記憶装置１１１に格納する。 Note that the programs and data stored in the main storage device 111 may be stored in the secondary storage device 112 . In this case, processor 110 reads programs and data from secondary storage device 112 and stores them in main storage device 111 .

副記憶装置１１２はデータを永続的に格納する。副記憶装置１１２は、例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）及びＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等である。副記憶装置１１２に格納されるデータについては後述する。 The secondary storage device 112 permanently stores data. The secondary storage devices 112 are, for example, HDDs (Hard Disk Drives) and SSDs (Solid State Drives). The data stored in secondary storage device 112 will be described later.

ここで、計算機１００が保持するプログラム及びデータについて説明する。 Here, programs and data held by the computer 100 will be described.

副記憶装置１１２は、音声認識モデル情報１３０及び意図理解モデル情報１３１を格納する。 The secondary storage device 112 stores speech recognition model information 130 and intent understanding model information 131 .

音声認識モデル情報１３０は、音声データから発話の具体的な内容を推定する音声認識モデルの定義情報である。実施例１の音声認識モデルは、例えば、ＮＮ（ＮｅｕｒａｌＮｅｔｗｏｒｋ）である。ネットワークを構成するノードには音声データの特徴量が入力される。音声認識モデルを用いることによって、発話内容を示すテキストが出力される。 The speech recognition model information 130 is definition information of a speech recognition model for estimating the specific content of an utterance from speech data. The speech recognition model of Example 1 is, for example, NN (Neural Network). Features of speech data are input to the nodes that make up the network. By using the speech recognition model, text indicating the utterance content is output.

意図理解モデル情報１３１は、発話の時系列データを用いて作業員が行った作業の内容及び結果を推定する意図理解モデルの定義情報である。実施例１の意図理解モデルは、例えば、ＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）、ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ－ＴｅｒｍＭｅｍｏｒｙ）、及びＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎＮｅｕｒａｌＮｅｔｗｏｒｋ）等である。ネットワークを構成するノードには単位発話及び発話間隔から算出される特徴量が入力される。単位発話は、例えば、文及び単語である。実施例１では単位発話は文とする。発話間隔は、一つの発話が行われてから、次の発話が行われるまでの時間間隔である。意図理解モデルを用いることによって、作業の内容及び結果の組合せが出力される。 The intent understanding model information 131 is definition information of an intent understanding model for estimating the details and results of the work performed by the worker using time-series data of utterances. The intent understanding model of Example 1 is, for example, RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), CNN (Convolution Neural Network), and the like. Features calculated from unit utterances and utterance intervals are input to the nodes that make up the network. Unit utterances are, for example, sentences and words. In Example 1, the unit utterance is a sentence. The utterance interval is the time interval from when one utterance is made until when the next utterance is made. By using the intent understanding model, a combination of work content and results is output.

主記憶装置１１１は、音声認識部１２０、作業推定部１２１、学習データ生成部１２２、及び学習部１２３を実現するプログラムを格納する。 The main memory device 111 stores programs that implement the speech recognition unit 120 , the work estimation unit 121 , the learning data generation unit 122 , and the learning unit 123 .

音声認識部１２０は、音声認識モデルを用いて、音声データから発話の具体的な内容を表すテキストを生成する。 The speech recognition unit 120 uses a speech recognition model to generate text representing the specific content of the utterance from the speech data.

作業推定部１２１は、意図理解モデル及び発話（テキスト）の時系列データを用いて、作業の内容及び結果を推定する。 The task estimation unit 121 estimates the content and result of the task using the intention understanding model and the time-series data of the utterance (text).

学習データ生成部１２２は、学習部１２３が使用する学習データを生成する。実施例１では、学習データ生成部１２２は、意図理解モデルを生成するための学習データを生成する。なお、学習データ生成部１２２は、音声認識モデルを生成するための学習データを生成してもよい。 The learning data generation unit 122 generates learning data used by the learning unit 123 . In Example 1, the learning data generation unit 122 generates learning data for generating an intent understanding model. Note that the learning data generation unit 122 may generate learning data for generating a speech recognition model.

学習部１２３は、学習データを用いてモデルを生成するための学習処理を実行する。実施例１では、学習部１２３は、意図理解モデルを生成するための学習処理を実行する。なお、学習部１２３は、音声認識モデルを生成するための学習処理を実行してもよい。 The learning unit 123 executes learning processing for generating a model using learning data. In Example 1, the learning unit 123 executes learning processing for generating an intent understanding model. Note that the learning unit 123 may execute learning processing for generating a speech recognition model.

なお、計算機１００が有する各機能部については、複数の機能部を一つの機能部にまとめてもよいし、一つの機能部を機能毎に複数の機能部に分けてもよい。 As for each functional unit of the computer 100, a plurality of functional units may be combined into one functional unit, or one functional unit may be divided into a plurality of functional units for each function.

図２は、実施例１の計算機１００が実行する意図理解モデル生成処理の流れを示す図である。図３は、実施例１の計算機１００が実行する意図理解モデル生成処理の一例を示すフローチャートである。図４Ａ、図４Ｂ、及び図４Ｃは、実施例１の計算機１００に入力される時系列データの一例を示す図である。図５は、実施例１の計算機１００によって生成される学習データ２００のデータ構造の一例を示す図である。 FIG. 2 is a diagram showing the flow of intention understanding model generation processing executed by the computer 100 of the first embodiment. FIG. 3 is a flowchart showing an example of intent understanding model generation processing executed by the computer 100 of the first embodiment. 4A, 4B, and 4C are diagrams showing examples of time-series data input to the computer 100 of the first embodiment. FIG. 5 is a diagram showing an example data structure of learning data 200 generated by the computer 100 of the first embodiment.

計算機１００は、学習用の発話（テキスト）の時系列データ、学習用の発話間隔の時系列データ、及び正解ラベルの時系列データの入力を受け付ける（ステップＳ１０１）。計算機１００は、学習データ生成部１２２に受け付けたデータを出力する。 Calculator 100 receives input of time-series data of utterances (text) for learning, time-series data of utterance intervals for learning, and time-series data of correct labels (step S101). Calculator 100 outputs the received data to learning data generator 122 .

学習用の発話の時系列データは、時間、作業種別、作業員、及び場所等を基準にまとめた発話のデータ群である。学習用の発話の時系列データは、例えば、図４Ａに示すような時系列データ４００である。時系列データ４００は、順番４０１及び発話内容４０２から構成されるレコードを格納する。一つの発話に対して一つのレコードが存在する。順番４０１は、発話順を格納するフィールドである。発話内容４０２は、テキストを格納するフィールドである。 The time-series data of utterances for learning is a data group of utterances organized on the basis of time, work type, worker, location, and the like. Time-series data of utterances for learning is, for example, time-series data 400 as shown in FIG. 4A. The time-series data 400 stores records composed of a sequence 401 and utterance content 402 . There is one record for one utterance. The order 401 is a field that stores the speaking order. The utterance content 402 is a field that stores text.

学習用の発話間隔の時系列データは、学習用の発話の時系列データに含まれる発話の間の時間間隔のデータ群である。学習用の発話間隔の時系列データは、例えば、図４Ｂに示すような時系列データ４１０である。時系列データ４００は、順番４１１及び発話間隔４１２から構成されるレコードを格納する。順番４１１は順番４０１と同一のフィールドである。発話間隔４１２は、レコードに対応する発話と、当該発話より時系列が一つ前の発話との間の時間間隔を格納するフィールドである。順番４１１が「１」のレコードに対応する発話の前には発話が存在しないため、発話間隔４１２は空欄となる。 The time-series data of utterance intervals for learning is a data group of time intervals between utterances included in the time-series data of utterances for learning. Time-series data of utterance intervals for learning is, for example, time-series data 410 as shown in FIG. 4B. The time-series data 400 stores records composed of a sequence 411 and an utterance interval 412 . Order 411 is the same field as order 401 . The utterance interval 412 is a field that stores the time interval between the utterance corresponding to the record and the utterance one chronologically before the utterance. Since there is no utterance before the utterance corresponding to the record whose order 411 is "1", the utterance interval 412 is blank.

正解ラベルの時系列データは、学習用の発話の時系列データに含まれる発話に対する正しい出力（クラス）を示す正解ラベルのデータ群である。正解ラベルの時系列データは、例えば、図４Ｃに示すような時系列データ４２０である。時系列データ４２０は、順番４２１及び正解ラベル４２２から構成されるレコードを格納する。順番４２１は順番４０１と同一のフィールドである。正解ラベル４２２は、意図理解モデルの正しい出力を格納するフィールドである。正解ラベル４２２には、クラス（作業の内容及び結果の組合せ）毎に値が設定される。図４Ｃに示すように、意図理解モデルは、紙詰まりＯＫ、紙詰まりＮＧ、用紙残量ＯＫ、用紙残量ＮＧ、及びその他の五つのクラスのいずれかを出力する。各クラスには「０」及び「１」のいずれかが設定される。「１」が設定されたクラスが正解のクラスであることを示す。 The time-series data of correct labels is a data group of correct labels indicating correct outputs (classes) for utterances included in the time-series data of utterances for learning. Time-series data of the correct label is, for example, time-series data 420 as shown in FIG. 4C. The time-series data 420 stores records composed of a sequence 421 and a correct label 422 . Order 421 is the same field as order 401 . Correct label 422 is a field that stores the correct output of the intent understanding model. A value is set in the correct label 422 for each class (combination of work content and result). As shown in FIG. 4C, the intent understanding model outputs one of five classes: paper jam OK, paper jam NG, paper remaining OK, paper remaining NG, and others. Either "0" or "1" is set for each class. A class set to "1" is a correct class.

計算機１００には、学習用の発話の時系列データ、学習用の発話間隔の時系列データ、及び正解ラベルの時系列データを一つのまとまりとするデータセットが複数入力される。なお、基準が同一のデータセットが複数入力されてもよい。 Calculator 100 is input with a plurality of data sets in which time-series data of utterances for learning, time-series data of utterance intervals for learning, and time-series data of correct labels are grouped together. A plurality of data sets having the same reference may be input.

なお、学習用の発話の時系列データに時刻の情報が含まれる場合、学習用の発話間隔の時系列データは入力として与えられなくてもよい。この場合、学習用の発話の時系列データから学習用の発話間隔の時系列データを生成することができる。 Note that when time-series data of utterances for learning includes time information, the time-series data of utterance intervals for learning need not be given as an input. In this case, it is possible to generate time-series data of utterance intervals for learning from time-series data of utterances for learning.

計算機１００は、入力された時系列データを用いて学習データ２００を生成する（ステップＳ１０２）。 Calculator 100 generates learning data 200 using the input time-series data (step S102).

具体的には、学習データ生成部１２２は、発話と、当該発話に対応する発話間隔及び正解ラベルとを対応付けたレコードを含む学習データ２００を生成する。なお、学習データ生成部１２２は、一つのデータセットに対して一つの学習データ２００を生成する。 Specifically, the learning data generation unit 122 generates learning data 200 including a record that associates an utterance with an utterance interval and a correct label corresponding to the utterance. Note that the learning data generation unit 122 generates one learning data 200 for one data set.

ここで、学習データ２００について説明する。図５に示すように、学習データ２００は、順番５０１、発話間隔５０２、発話内容５０３、及び正解ラベル５０４から構成されるレコードを含む。一つのレコードは一つの発話に対応する。順番５０１、発話間隔５０２、発話内容５０３、及び正解ラベル５０４は、順番４０１、発話間隔４１２、発話内容４０２、及び正解ラベル４２２と同一のフィールドである。 Here, the learning data 200 will be explained. As shown in FIG. 5 , the learning data 200 includes records composed of a sequence 501 , an utterance interval 502 , utterance content 503 , and a correct label 504 . One record corresponds to one utterance. The order 501 , the speech interval 502 , the speech content 503 , and the correct label 504 are the same fields as the order 401 , the speech interval 412 , the speech content 402 , and the correct label 422 .

実施例１の発話意図推定システムは、発話間隔を意図理解モデルの入力として扱う点に特徴を有する。これによって、発話群に、作業に関係のない発話が含まれる場合でも精度よく作業の内容及び結果を推定することができる。 The utterance intention estimation system of Example 1 is characterized in that the utterance interval is treated as an input for the intention understanding model. As a result, even if the utterance group includes utterances unrelated to the work, it is possible to accurately estimate the content and result of the work.

意図理解モデル生成処理の説明に戻る。 Returning to the description of the intent understanding model generation process.

次に、計算機１００は、学習データ２００を用いて学習処理を実行する（ステップＳ１０３）。 Next, computer 100 executes learning processing using learning data 200 (step S103).

具体的には、学習部１２３は、学習データ２００に含まれる発話間隔５０２及び発話内容５０３から算出される特徴量をノードへの入力と扱い、かつ、正解ラベル５０４に対応する出力が得られる意図理解モデルを生成する。学習方法は公知の方法を用いればよいため詳細な説明は省略する。 Specifically, the learning unit 123 treats the feature amount calculated from the utterance interval 502 and the utterance content 503 included in the learning data 200 as an input to the node, and intends to obtain an output corresponding to the correct label 504. Generate an understanding model. Since a known method may be used for the learning method, detailed description is omitted.

意図理解モデルに対応するネットワークを構成するノードには、例えば、発話から算出されたベクトル及び発話間隔が特徴量として入力される。例えば、Ｗｏｒｄ２ｖｅｃ等を用いて、発話からベクトルが算出できる。 For example, a vector calculated from an utterance and an utterance interval are input as feature amounts to the nodes that constitute the network corresponding to the intent understanding model. For example, a vector can be calculated from an utterance using Word2vec or the like.

次に、計算機１００は、意図理解モデル情報１３１に学習結果（意図理解モデル）を保存し（ステップＳ１０４）、意図理解モデル生成処理を終了する。 Next, computer 100 saves the learning result (intention understanding model) in intention understanding model information 131 (step S104), and terminates the intention understanding model generation process.

図６は、実施例１の計算機１００が実行する作業推定処理の流れを示す図である。図７は、実施例１の計算機１００が実行する作業推定処理の一例を示すフローチャートである。図８は、実施例１の計算機１００が生成する中間出力情報８００のデータ構造の一例を示す図である。 FIG. 6 is a diagram showing the flow of work estimation processing executed by the computer 100 of the first embodiment. FIG. 7 is a flowchart illustrating an example of work estimation processing executed by the computer 100 of the first embodiment. FIG. 8 is a diagram showing an example of the data structure of intermediate output information 800 generated by the computer 100 of the first embodiment.

計算機１００は、マイク１０１から音声データを取得する（ステップＳ２０１）。なお、計算機１００は、一定期間、音声データを蓄積するものとする。 The computer 100 acquires voice data from the microphone 101 (step S201). It is assumed that computer 100 accumulates voice data for a certain period of time.

計算機１００は、音声データから発話の具体的な内容を示すテキストを生成し、また、発話間隔を算出する（ステップＳ２０２）。具体的には、以下のような処理が実行される。 Calculator 100 generates a text indicating the specific content of the utterance from the voice data, and calculates the utterance interval (step S202). Specifically, the following processing is executed.

（Ｓ２０２－１）音声認識部１２０は、蓄積された音声データの中からターゲット音声データを選択する。 (S202-1) The speech recognition unit 120 selects target speech data from the accumulated speech data.

（Ｓ２０２－２）音声認識部１２０は、音声認識モデル情報１３０に格納される音声認識モデルを用いて、ターゲット音声データからテキストを生成する。また、音声認識部１２０は、ターゲット音声データと、ターゲット音声データより時系列が一つ前の音声データ（比較音声データ）との間の発話間隔を算出する。 (S202-2) The speech recognition unit 120 uses the speech recognition model stored in the speech recognition model information 130 to generate text from the target speech data. In addition, the speech recognition unit 120 calculates an utterance interval between the target speech data and the speech data (comparative speech data) preceding the target speech data in the time series.

例えば、音声認識部１２０は、比較音声データに対応する発話の終了時刻から、ターゲット音声データに対応する発話の開始時刻までの時間を、発話間隔として算出する。また、音声認識部１２０は、比較音声データに対応する発話の開始時刻から、ターゲット音声データに対応する発話の開始時刻までの時間を、発話間隔として算出してもよい。なお、比較音声データが存在しない場合、音声認識部１２０は、発話間隔をＮＵＬＬとして算出する。 For example, the speech recognition unit 120 calculates the time from the end time of the speech corresponding to the comparison speech data to the start time of the speech corresponding to the target speech data as the speech interval. Further, the speech recognition unit 120 may calculate the time from the start time of the speech corresponding to the comparative speech data to the start time of the speech corresponding to the target speech data as the speech interval. Note that when there is no comparison voice data, the voice recognition unit 120 calculates the utterance interval as NULL.

（Ｓ２０２－３）音声認識部１２０は、蓄積された全ての音声データの処理が完了したか否かを判定する。蓄積された全ての音声データの処理が完了していない場合、音声認識部１２０は、Ｓ２０２－１に戻り、同様の処理を実行する。蓄積された全ての音声データの処理が完了した場合、音声認識部１２０はステップＳ２０２の処理を終了する。 (S202-3) The speech recognition unit 120 determines whether or not all the accumulated speech data have been processed. If the processing of all the accumulated voice data has not been completed, the voice recognition unit 120 returns to S202-1 and performs similar processing. When the processing of all the accumulated voice data is completed, the voice recognition unit 120 ends the processing of step S202.

以上がステップＳ２０２の処理の説明である。 The above is the description of the processing in step S202.

次に、計算機１００は、入力用の発話の時系列データを生成する（ステップＳ２０３）。 Next, computer 100 generates time-series data of utterances for input (step S203).

具体的には、音声認識部１２０は、テキスト及び発話間隔から構成される入力データを生成する。音声認識部１２０は、発話順に並べられた入力データ群を、入力用の発話の時系列データとして生成する。 Specifically, the speech recognition unit 120 generates input data composed of text and speech intervals. The speech recognition unit 120 generates an input data group arranged in order of utterance as time-series data of utterance for input.

次に、計算機１００は、意図理解モデル及び入力用の発話の時系列データを用いて、推定処理を実行する（ステップＳ２０４）。具体的には、以下のような処理が実行される。 Next, computer 100 executes an estimation process using the intent understanding model and the input time-series data of utterances (step S204). Specifically, the following processing is executed.

（Ｓ２０４－１）作業推定部１２１は、中間出力情報８００を生成する。中間出力情報８００は、順番８０１、発話内容８０２、及び出力クラス８０３から構成されるレコードを含む。一つのレコードが一つの入力データに対応する。 (S204-1) The work estimation unit 121 generates the intermediate output information 800. FIG. The intermediate output information 800 includes records composed of a sequence 801 , utterance content 802 , and an output class 803 . One record corresponds to one input data.

順番８０１は、入力用の発話の時系列データにおける発話の順番を格納するフィールドである。発話内容８０２は、入力データに含まれるテキストを格納するフィールドである。出力クラス８０３は、意図推定モデルの出力結果を格納するフィールドであり、クラス毎に値が設定される。図８では、出力クラス８０３は、紙詰まりＯＫ、紙詰まりＮＧ、用紙残量ＯＫ、用紙残量ＮＧ、及びその他の五つのクラスを含む。この時点では、各クラスの値は「０」が設定される。 The order 801 is a field that stores the order of utterances in the input time-series data of utterances. The utterance content 802 is a field that stores the text included in the input data. The output class 803 is a field for storing the output result of the intention estimation model, and a value is set for each class. In FIG. 8, the output class 803 includes five classes: paper jam OK, paper jam NG, paper remaining OK, paper remaining NG, and others. At this point, the value of each class is set to "0".

（Ｓ２０４－２）作業推定部１２１は、発話の順番に基づいて、入力用の発話の時系列データに含まれる入力データを意図理解モデルに入力する。作業推定部１２１は、各入力データに対する出力結果を意図理解モデルから取得する。このとき、作業推定部１２１は、入力データに対応するレコードの出力クラス８０３を参照し、出力結果に対応するクラスに「１」を設定する。 (S204-2) The work estimating unit 121 inputs the input data included in the time-series data of utterances for input into the intention understanding model based on the order of utterances. The work estimation unit 121 acquires output results for each input data from the intention understanding model. At this time, the work estimation unit 121 refers to the output class 803 of the record corresponding to the input data, and sets "1" to the class corresponding to the output result.

実施例１では、発話を示すテキストとともに発話間隔が入力される。作業種別を特定するための発話と、作業結果を特定するための発話との間に、作業とは関係のない発話が含まれる場合でも、意図理解モデルは、発話間隔を考慮して、発話間の関係を推定できるため、作業の内容及び結果を精度よく推定できる。図８に示す例では、順番８０１が「４」のレコードと、順番８０１が「８」のレコードとの間には、「紙詰まりの点検作業」に関係しない発話が含まれるが、実施例１意図理解モデルを用いることによって、「８」のレコードに対して正しい作業の内容及び結果を出力できる。 In Example 1, an utterance interval is input together with the text indicating the utterance. Even if an utterance unrelated to work is included between the utterance for identifying the work type and the utterance for identifying the work result, the intent understanding model considers the utterance interval and can be estimated, the content and results of the work can be estimated with high accuracy. In the example shown in FIG. 8, an utterance not related to "paper jam inspection work" is included between the record whose order 801 is "4" and the record whose order 801 is "8". By using the intention understanding model, it is possible to output correct work contents and results for the record "8".

（Ｓ２０４－３）作業推定部１２１は、全ての入力データの出力結果が得られた場合、推定処理を終了する。 (S204-3) When the output results of all the input data are obtained, the work estimation unit 121 ends the estimation process.

以上がステップＳ２０４の処理の説明である。 The above is the description of the processing in step S204.

次に、計算機１００は、中間出力情報８００に基づいて推定結果を出力し（ステップＳ２０５）、作業推定処理を終了する。 Next, computer 100 outputs an estimation result based on intermediate output information 800 (step S205), and terminates the work estimation process.

例えば、作業推定部１２１は、出力クラス８０３が「その他」以外の出力結果をリスト化した推定結果を生成し、ユーザに提示する。 For example, the task estimating unit 121 generates an estimation result in which the output results for which the output class 803 is other than "other" is listed, and presents it to the user.

実施例１によれば、発話意図推定システムは、発話及び発話間隔を入力とする意図推定モデルを用いることによって、作業種別を特定するため発話と、作業結果を特定するための発話との間に、当該作業とは無関係の発話が存在する場合でも、作業の内容及び結果を精度よく推定することができる。すなわち、様々な発話の組合せに対応した発話意図推定システムを実現できる。 According to the first embodiment, the utterance intention estimation system uses an intention estimation model whose inputs are utterances and utterance intervals, so that between an utterance for identifying a work type and an utterance for identifying a work result, , even if there are utterances unrelated to the work, the content and result of the work can be estimated with high accuracy. That is, it is possible to realize an utterance intention estimation system that supports various combinations of utterances.

なお、単位発話が単語の場合、発話間隔は単語間の時間間隔となる。 Note that when the unit utterance is a word, the utterance interval is the time interval between words.

なお、意図理解モデルのノードに、音圧から算出される特徴量を入力してもよい。例えば、作業時の発話の音圧が小さく、作業時以外の発話の音圧が大きい場合、有用な特徴量として利用することができる。意図理解モデル生成処理では、学習用の発話の時系列データ、学習用の発話間隔の時系列データ、正解ラベルの時系列データ、及び学習用の音圧の時系列データから学習データ２００が生成される。作業推定処理では、音声認識部１２０は、音声データから音圧を算出し、また、テキスト、発話間隔、及び音圧から構成される入力データ群を生成する。 Note that a feature amount calculated from sound pressure may be input to the node of the intent understanding model. For example, when the sound pressure of utterances during work is low and the sound pressure of utterances other than work is high, it can be used as a useful feature amount. In the intention understanding model generation process, learning data 200 is generated from time-series data of utterances for learning, time-series data of utterance intervals for learning, time-series data of correct labels, and time-series data of sound pressure for learning. be. In the task estimation process, the speech recognition unit 120 calculates sound pressure from speech data, and generates an input data group composed of text, speech intervals, and sound pressure.

実施例２では、学習データの生成方法が異なる。以下、実施例１との差異を中心に実施例２について説明する。 The second embodiment differs in the method of generating learning data. The second embodiment will be described below, focusing on the differences from the first embodiment.

実施例２の発話意図推定システムの構成は実施例１と同一である。 The configuration of the utterance intention estimation system of the second embodiment is the same as that of the first embodiment.

図９は、実施例２の計算機１００が実行する意図理解モデル生成処理の一例を示すフローチャートである。図１０は、実施例２の計算機１００に入力される属性ラベルの時系列データの一例を示す図である。図１１は、実施例２の計算機１００が生成する発話パターン情報１１００のデータ構造の一例を示す図である。図１２は、実施例２の発話パターンの発話間隔の算出方法の一例を示す図である。 FIG. 9 is a flowchart showing an example of intent understanding model generation processing executed by the computer 100 of the second embodiment. FIG. 10 is a diagram showing an example of time-series data of attribute labels input to the calculator 100 of the second embodiment. FIG. 11 is a diagram showing an example of the data structure of utterance pattern information 1100 generated by the computer 100 of the second embodiment. FIG. 12 is a diagram illustrating an example of a method of calculating an utterance interval of an utterance pattern according to the second embodiment.

計算機１００は、分析用の発話の時系列データ、分析用の発話間隔の時系列データ、及び、属性ラベルの時系列データの入力を受け付ける（ステップＳ１５１）。計算機１００は、学習データ生成部１２２に受け付けたデータを出力する。 Calculator 100 receives inputs of time-series data of utterances for analysis, time-series data of utterance intervals for analysis, and time-series data of attribute labels (step S151). Calculator 100 outputs the received data to learning data generator 122 .

分析用の発話の時系列データは、学習用の発話の時系列データと同一のデータ構造である。分析用の発話間隔の時系列データは、学習用の発話間隔の時系列データと同一のデータ構造である。 Time-series data of utterances for analysis has the same data structure as time-series data of utterances for learning. Time-series data of utterance intervals for analysis has the same data structure as time-series data of utterance intervals for learning.

属性ラベルは、抽象化意図を示す情報である。抽象化意図は、発話意図を抽象化したものである。実施例２では、発話意図「紙詰まりの確認」及び発話意図「用紙残量の確認」の抽象化意図を「作業種別確認」と設定する。発話意図「作業結果ＯＫ」の抽象化意図を「作業結果ＯＫ」と設定し、発話意図「作業結果ＮＧ」の抽象化意図を「作業結果ＮＧ」と設定する。前述のいずれにも該当しない発話意図の抽象化意図は「その他」と設定する。 An attribute label is information indicating an abstraction intent. The abstract intent is an abstraction of the utterance intent. In the second embodiment, the abstract intention of the utterance intention "confirm paper jam" and the utterance intention "confirm remaining amount of paper" is set to "confirm work type". The abstract intention of the utterance intention "work result OK" is set to "work result OK", and the abstract intention of the utterance intention "work result NG" is set to "work result NG". The abstract intention of the utterance intention that does not correspond to any of the above is set as "other".

属性ラベルの時系列データは、例えば、図１０に示すような時系列データ１０００である。時系列データ１０００は、順番１００１及び属性ラベル１００２から構成されるレコードを格納する。順番１００１は順番４０１と同一のフィールドである。 Time-series data of attribute labels is, for example, time-series data 1000 as shown in FIG. Time-series data 1000 stores records composed of a sequence 1001 and an attribute label 1002 . Order 1001 is the same field as order 401 .

属性ラベル１００２は、作業種別毎の抽象化意図を示す値を格納するフィールドである。Ｉ１は「作業種別確認」を表し、Ｉ２は「作業結果ＯＫ」を表し、Ｉ３は「作業結果ＮＧ」を表し、Ｉ４は「その他」を表す。各抽象化意図には「０」及び「１」のいずれかが設定される。「１」が設定された抽象化意図が、作業種別の抽象化意図であることを示す。全ての作業種別の抽象化意図が「その他」の場合、発話の抽象化意図は「その他」になる。 The attribute label 1002 is a field that stores a value indicating an abstraction intention for each work type. I1 represents "work type confirmation", I2 represents "work result OK", I3 represents "work result NG", and I4 represents "others". Either "0" or "1" is set for each abstract intent. An abstraction intent set to "1" indicates that it is a work type abstraction intent. If the abstract intent of all work types is "other", the abstract intent of the utterance is "other."

実施例２では、データセット（発話の時系列データ）の数が少ないものとする。この場合、生成される学習データ２００の数が少ない。精度の高い意図理解モデルを生成するためには、質の高い学習データを多く用意する必要がある。 In the second embodiment, it is assumed that the number of data sets (time-series data of utterances) is small. In this case, the number of generated learning data 200 is small. In order to generate a highly accurate intent understanding model, it is necessary to prepare a large amount of high-quality training data.

次に、計算機１００は、分析用の発話の時系列データを用いた統計分析を実行し、分析結果に基づいて発話パターン情報１１００を生成する（ステップＳ１５１）。具体的には、以下のような処理が実行される。 Next, computer 100 performs statistical analysis using time-series data of utterances for analysis, and generates utterance pattern information 1100 based on the analysis results (step S151). Specifically, the following processing is executed.

（Ｓ１５２－１）学習データ生成部１２２は、各データセットに含まれる発話の時系列データ及び属性ラベルの時系列データに基づいて、作業種別毎に、作業の開始から終了までの発話群を生成する。一つのデータセットに対して、作業種別毎の発話群が生成される。なお、一つのデータセットに含まれる発話の時系列データに、複数回実行された同一作業の発話が含まれる場合、一つの分析用の発話のデータセットから作業種別が同一である発話群が複数生成される。 (S152-1) The learning data generation unit 122 generates an utterance group from the start to the end of work for each work type based on the time-series data of utterances and the time-series data of attribute labels included in each data set. do. An utterance group for each work type is generated for one data set. If the time-series data of utterances included in one data set includes utterances of the same task that have been executed multiple times, multiple utterance groups with the same task type can be found in one data set of utterances for analysis. generated.

例えば、図４Ａに示すデータ構造の分析用の発話の時系列データの場合、「紙詰まりの確認作業」について、順番４０１が「２」、「３」のレコードの発話群が生成される。また、「用紙残量の確認作業」について、順番４０１が「４」、「５」のレコードの発話群が生成される。 For example, in the case of time-series data of utterances for analysis of the data structure shown in FIG. 4A, a group of utterances of records with the order 401 of "2" and "3" is generated for "checking paper jam". In addition, regarding the “work to check the remaining amount of paper”, an utterance group of records whose order 401 is “4” and “5” is generated.

（Ｓ１５２－２）学習データ生成部１２２は、ターゲット作業種別を選択する。 (S152-2) The learning data generator 122 selects a target work type.

（Ｓ１５２－３）学習データ生成部１２２は、複数のデータセットのターゲット作業種別の発話群を取得し、発話群に含まれる発話の属性ラベルに基づいて発話パターンを生成する。発話パターンは、抽象化意図の遷移を示す情報である。例えば、順番４０１が「２」、「３」のレコードから「作業種別確認」から「作業結果ＯＫ」への遷移が、発話パターンとして生成される。 (S152-3) The learning data generation unit 122 acquires an utterance group of the target work type from a plurality of data sets, and generates an utterance pattern based on the attribute labels of the utterances included in the utterance group. The utterance pattern is information indicating the transition of abstract intention. For example, a transition from 'work type confirmation' to 'work result OK' is generated as an utterance pattern from records whose order 401 is '2' and '3'.

（Ｓ１５２－４）学習データ生成部１２２は、複数のデータセットのターゲット作業種別の発話群に対応する分析用の発話間隔の時系列データに基づいて、発話パターンの抽象化意図に対応する発話の発話間隔を算出する。学習データ生成部１２２は、発話間隔を時系列順に並べることによって、発話パターンの発話間隔の時系列データを生成する。 (S152-4) The learning data generation unit 122 generates utterances corresponding to the abstract intent of the utterance pattern based on the time-series data of the utterance intervals for analysis corresponding to the utterance group of the target work type in the plurality of data sets. Calculate the speech interval. The learning data generation unit 122 generates time-series data of the utterance intervals of the utterance pattern by arranging the utterance intervals in chronological order.

なお、ターゲット作業種別の発話パターンの対話群の数が少ない場合、発話パターンの対話群の数が多い他の作業種別の算出結果を用いてもよい。例えば、学習データ生成部１２２は、図１２に示すように、他の作業種別のある発話パターンについて発話間隔の分散を算出し、算出結果に基づいてターゲット作業種別のある発話パターンの発話間隔を算出する。なお、「作業種別確認」及び「作業結果ＯＫ」のみから発話パターン又は「作業種別確認」及び「作業結果ＮＧ」のみから構成される発話パターン等、特定の発話パターンにのみ上記の処理を適用してもよい。 If the number of dialogue groups of the utterance pattern of the target work type is small, the calculation result of another work type having a large number of dialogue groups of the utterance pattern may be used. For example, as shown in FIG. 12, the learning data generation unit 122 calculates the variance of the utterance intervals for the utterance pattern with the other work type, and calculates the utterance interval of the utterance pattern with the target work type based on the calculation result. do. In addition, the above processing is applied only to specific utterance patterns such as an utterance pattern consisting only of "work type confirmation" and "work result OK" or an utterance pattern consisting only of "work type confirmation" and "work result NG". may

（Ｓ１５２－５）学習データ生成部１２２は、全ての作業種別について処理が完了したか否かを判定する。全ての作業種別について処理が完了していない場合、学習データ生成部１２２は、Ｓ１５２－２に戻り、同様の処理を実行する。 (S152-5) The learning data generation unit 122 determines whether or not processing has been completed for all work types. If the processing has not been completed for all work types, the learning data generation unit 122 returns to S152-2 and performs similar processing.

（Ｓ１５２－６）全ての作業種別について処理が完了した場合、学習データ生成部１２２は、発話パターン情報１１００を初期化する。 (S152-6) When the processing has been completed for all work types, the learning data generator 122 initializes the utterance pattern information 1100. FIG.

具体的には、学習データ生成部１２２は、発話パターン情報１１００に、生成された発話パターンの種別と同数のレコードを追加する。学習データ生成部１２２は、発話パターン情報１１００の各レコードのＩＤ１１０１に識別情報を設定し、確率１１０３及び発話間隔１１０４の各々に作業種別と同数の列を設定する。学習データ生成部１２２は、発話パターン情報１１００の各レコードの発話パターン１１０２に発話パターンを設定する。図１１に示す発話パターンに含まれる「その他（２）」は、抽象化意図「その他」の発話が二つ含まれることを示す。 Specifically, the learning data generation unit 122 adds to the utterance pattern information 1100 the same number of records as the generated utterance pattern types. The learning data generation unit 122 sets identification information to the ID 1101 of each record of the utterance pattern information 1100, and sets the same number of columns as work types to each of the probability 1103 and the utterance interval 1104. FIG. Learning data generator 122 sets an utterance pattern to utterance pattern 1102 of each record of utterance pattern information 1100 . “Others (2)” included in the utterance pattern shown in FIG. 11 indicates that two utterances with the abstraction intention “others” are included.

（Ｓ１５２－７）学習データ生成部１２２は、ターゲット作業種別を選択する。 (S152-7) The learning data generator 122 selects a target work type.

（Ｓ１５２－８）学習データ生成部１２２は、ターゲット作業種別の対話群に基づいて各発話パターンの出現確率を算出する。学習データ生成部１２２は、発話パターン情報１１００の各レコードの確率１１０３のターゲット作業種別の列に出現確率を設定する。 (S152-8) The learning data generator 122 calculates the appearance probability of each utterance pattern based on the dialogue group of the target work type. The learning data generation unit 122 sets the appearance probability in the target work type column of the probability 1103 of each record of the utterance pattern information 1100 .

（Ｓ１５２－９）学習データ生成部１２２は、発話パターン情報１１００の各レコードの発話間隔１１０４のターゲット作業種別の列に、ターゲット作業種別の発話パターンの発話間隔の時系列データを設定する。発話間隔はセミコロンで区切られている。 (S152-9) The learning data generator 122 sets the time-series data of the utterance intervals of the utterance pattern of the target work type in the target work type column of the utterance intervals 1104 of each record of the utterance pattern information 1100. FIG. Speech intervals are separated by semicolons.

（Ｓ１５２－１０）学習データ生成部１２２は、全ての作業種別について処理が完了したか否かを判定する。全ての作業種別について処理が完了していない場合、学習データ生成部１２２は、Ｓ１５２－７に戻り、同様の処理を実行する。全ての作業種別について処理が完了した場合、学習データ生成部１２２は、ステップＳ１５２の処理を終了する。 (S152-10) The learning data generator 122 determines whether or not the processing has been completed for all work types. If the processing has not been completed for all work types, the learning data generation unit 122 returns to S152-7 and performs similar processing. When the processing has been completed for all work types, the learning data generation unit 122 ends the processing of step S152.

以上がステップＳ１５２の処理の説明である。 The above is the description of the processing in step S152.

次に、計算機１００は、作業種別毎の想定発話データの入力を受け付ける（ステップＳ１５３）。なお、想定発話データの入力は、ステップＳ１０１で行われてもよい。 Next, computer 100 receives input of assumed utterance data for each work type (step S153). Note that the input of assumed utterance data may be performed in step S101.

想定発話データは、発話及び発話意図から構成されるレコードを複数含む。なお、発話意図「その他」の発話については各作業種別で共通のものでもよい。 The assumed utterance data includes a plurality of records composed of utterances and utterance intentions. Note that the utterance of the utterance intention "other" may be common to each work type.

次に、計算機１００は、学習データ生成処理を実行する（ステップＳ１５４）。具体的には、以下のような処理が実行される。 Next, computer 100 executes learning data generation processing (step S154). Specifically, the following processing is executed.

（Ｓ１５４－１）学習データ生成部１２２は、ターゲット作業種別を選択する。 (S154-1) The learning data generator 122 selects a target work type.

（Ｓ１５４－２）学習データ生成部１２２は、ターゲット作業種別の発話パターン、発話パターンの発話間隔の時系列データ、及び想定発話データに基づいて、発話の時系列データ、発話間隔の時系列データ、及び正解ラベルの時系列データをひとまとまりとするデータセットを複数生成する。なお、各発話パターンから生成されるデータセットの数は、出現確率の比率と一致するように制御される。 (S154-2) Based on the utterance pattern of the target work type, the utterance interval time-series data of the utterance pattern, and the assumed utterance data, the learning data generation unit 122 generates utterance time-series data, utterance interval time-series data, And a plurality of data sets are generated in which the time-series data of the correct labels are grouped together. The number of data sets generated from each utterance pattern is controlled so as to match the appearance probability ratio.

一つのデータセットの発話の時系列データは、発話パターンにあわせて、想定発話データから発話を選択することによって生成できる。また、一つのデータセットの発話間隔の時系列データは、発話パターンの発話間隔をそのまま用いて生成されてもよいし、発話パターンの発話間隔に摂動を加えた値を用いて生成されてもよい。摂動の幅は、発話間隔の確率に基づいて設定することができる。一つのデータセットの正解ラベルの時系列データは、発話パターンの抽象化意図をターゲット作業種別に対応して発話意図に変換し、発話意図を時系列順に並べることによって生成できる。 Time-series data of utterances in one data set can be generated by selecting utterances from assumed utterance data according to utterance patterns. In addition, the time-series data of the utterance intervals of one data set may be generated using the utterance intervals of the utterance pattern as they are, or may be generated using a value obtained by adding perturbation to the utterance intervals of the utterance pattern. . The width of the perturbation can be set based on the probability of speech intervals. The time-series data of the correct labels in one data set can be generated by converting the abstract intentions of the utterance pattern into utterance intentions corresponding to the target work type, and arranging the utterance intentions in chronological order.

（Ｓ１５４－３）学習データ生成部１２２は、一つのデータセットを用いて発話、正解ラベル、及び発話間隔から構成されるレコードを複数含む学習データを生成する。当該処理はステップＳ１０２の処理と同一である。 (S154-3) The learning data generating unit 122 generates learning data including a plurality of records composed of utterances, correct labels, and utterance intervals using one data set. The processing is the same as the processing in step S102.

（Ｓ１５４－４）学習データ生成部１２２は、全ての作業種別について処理が完了したか否かを判定する。全ての作業種別について処理が完了していない場合、学習データ生成部１２２は、Ｓ１５４－１に戻り、同様の処理を実行する。全ての作業種別について処理が完了した場合、学習データ生成部１２２はステップＳ１５４の処理を終了する。 (S154-4) The learning data generation unit 122 determines whether or not processing has been completed for all work types. If the processing has not been completed for all work types, the learning data generation unit 122 returns to S154-1 and performs similar processing. When the processing has been completed for all work types, the learning data generation unit 122 ends the processing of step S154.

以上がステップＳ１５４の処理の説明である。 The above is the description of the processing in step S154.

ステップＳ１０３及びステップＳ１０４の処理は、実施例１で説明した処理と同一である。 The processing of steps S103 and S104 is the same as the processing described in the first embodiment.

実施例２の作業推定処理は、実施例１の作業推定処理と同一であるため説明を省略する。 Since the work estimation process of the second embodiment is the same as the work estimation process of the first embodiment, the description thereof is omitted.

実施例２によれば、計算機１００は、少数の学習データから新たな学習データを生成することができる。学習データの数を増やすことによって、学習処理によって生成される意図理解モデルの予測精度を高めることできる。 According to the second embodiment, the computer 100 can generate new learning data from a small number of learning data. By increasing the number of learning data, the prediction accuracy of the intent understanding model generated by the learning process can be improved.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。また、例えば、上記した実施例は本発明を分かりやすく説明するために構成を詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、各実施例の構成の一部について、他の構成に追加、削除、置換することが可能である。 In addition, the present invention is not limited to the above-described embodiments, and includes various modifications. Further, for example, the above-described embodiments are detailed descriptions of the configurations for easy understanding of the present invention, and are not necessarily limited to those having all the described configurations. Moreover, it is possible to add, delete, or replace a part of the configuration of each embodiment with another configuration.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、本発明は、実施例の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記憶媒体をコンピュータに提供し、そのコンピュータが備えるプロセッサが記憶媒体に格納されたプログラムコードを読み出す。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施例の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記憶媒体は本発明を構成することになる。このようなプログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、光ディスク、光磁気ディスク、ＣＤ－Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 Further, each of the above configurations, functions, processing units, processing means, and the like may be realized by hardware, for example, by designing a part or all of them using an integrated circuit. The present invention can also be implemented by software program code that implements the functions of the embodiments. In this case, a computer is provided with a storage medium recording the program code, and a processor included in the computer reads the program code stored in the storage medium. In this case, the program code itself read from the storage medium implements the functions of the above-described embodiments, and the program code itself and the storage medium storing it constitute the present invention. Examples of storage media for supplying such program code include flexible disks, CD-ROMs, DVD-ROMs, hard disks, SSDs (Solid State Drives), optical disks, magneto-optical disks, CD-Rs, magnetic tapes, A nonvolatile memory card, ROM, or the like is used.

また、本実施例に記載の機能を実現するプログラムコードは、例えば、アセンブラ、Ｃ／Ｃ＋＋、ｐｅｒｌ、Ｓｈｅｌｌ、ＰＨＰ、Ｐｙｔｈｏｎ、Ｊａｖａ（登録商標）等の広範囲のプログラム又はスクリプト言語で実装できる。 Also, the program code that implements the functions described in this embodiment can be implemented in a wide range of programs or scripting languages such as assembler, C/C++, perl, Shell, PHP, Python, and Java (registered trademark).

さらに、実施例の機能を実現するソフトウェアのプログラムコードを、ネットワークを介して配信することによって、それをコンピュータのハードディスクやメモリ等の記憶手段又はＣＤ－ＲＷ、ＣＤ－Ｒ等の記憶媒体に格納し、コンピュータが備えるプロセッサが当該記憶手段や当該記憶媒体に格納されたプログラムコードを読み出して実行するようにしてもよい。 Furthermore, by distributing the program code of the software that implements the functions of the embodiment via a network, it can be stored in storage means such as a hard disk or memory of a computer, or in a storage medium such as a CD-RW or CD-R. Alternatively, a processor provided in the computer may read and execute the program code stored in the storage means or the storage medium.

上述の実施例において、制御線や情報線は、説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。全ての構成が相互に接続されていてもよい。 In the above-described embodiments, the control lines and information lines indicate those considered necessary for explanation, and not all control lines and information lines are necessarily indicated on the product. All configurations may be interconnected.

１００計算機
１０１マイク
１１０プロセッサ
１１１主記憶装置
１１２副記憶装置
１１３接続インタフェース
１２０音声認識部
１２１作業推定部
１２２学習データ生成部
１２３学習部
１３０音声認識モデル情報
１３１意図理解モデル情報
２００学習データ
４００、４１０、４２０、１０００時系列データ
８００中間出力情報
１１００発話パターン情報 100 computer 101 microphone 110 processor 111 main storage device 112 secondary storage device 113 connection interface 120 speech recognition unit 121 work estimation unit 122 learning data generation unit 123 learning unit 130 speech recognition model information 131 intention understanding model information 200 learning data 400, 410, 420, 1000 time-series data 800 intermediate output information 1100 utterance pattern information

Claims

A computer system for estimating work performed by a worker based on speech,
at least one computer having a processor and a memory coupled to the processor;
The processor
converting the voice data of the utterances uttered by the worker into text, generating time-series data of the utterances in which a plurality of the texts are arranged in chronological order, storing the time-series data of the utterances in the memory;
utterance intervals between utterances are calculated from the plurality of speech data whose time series are continuous, time-series data of utterance intervals are generated by arranging the utterance intervals in chronological order, and time-series data of the utterance intervals are stored in the memory. store in
A computer system for estimating the contents and results of work performed by said worker based on said time-series data of said utterances and said time-series data of said utterance intervals, and outputting said result of said estimation.

A computer system according to claim 1,
The processor
calculating sound pressure from the speech data, generating time-series data of sound pressure in which the sound pressure corresponding to each of the plurality of texts included in the time-series data of the utterance is arranged in time-series order; Store the time series data of in the memory,
A computer system for estimating the contents and results of work performed by said worker based on said time-series data of said utterances, said time-series data of said intervals of speech, and said time-series data of said sound pressures.

A computer system according to claim 1,
The processor
Time-series data of utterances for analysis, time-series data of utterance intervals for analysis associated with the time-series data of utterances for analysis, and time-series data of utterances for analysis for each work type accepts time-series data of labels indicating the utterance intentions of utterances
Receiving speech data, which is information on assumed speech for each type of work,
Statistical analysis is performed using the time-series data of utterances for analysis, the time-series data of utterance intervals for analysis, and the time-series data of labels, and an utterance pattern indicating transition of utterance intention and the utterance pattern and time-series data of the utterance intervals of the utterance pattern in which the utterance intervals between utterances corresponding to the utterance intention in the utterance pattern are arranged in chronological order, and the utterance pattern and the occurrence probability of the utterance pattern are calculated. , and storing the time-series data of the utterance intervals of the utterance pattern in the memory,
generating utterance time-series data for learning using the utterance pattern, the probability of occurrence of the utterance pattern, and the utterance data, and storing the time-series data of the utterance for learning in the memory;
generating time-series data of utterance intervals for learning using the time-series data of utterance intervals of the utterance pattern, storing the time-series data of utterance intervals for learning in the memory;
generating time-series data of correct labels indicating a combination of correct work contents and results based on the utterance pattern, storing the time-series data of correct labels in the memory;
generating learning data composed of time-series data of utterances for learning, time-series data of utterance intervals for learning, and time-series data of correct labels, storing the learning data in the memory;
executing a learning process for generating a model that outputs the content and results of the work corresponding to the time-series data of the correct label using the time-series data of the utterance and the time-series data of the utterance interval as inputs; A computer system, wherein model information is stored in the memory.

A computer system according to claim 1,
The interval between the utterances is at least one of a time interval between a reference utterance and an utterance one chronologically before the reference utterance, and a time interval between words included in the utterance. computer system.

A method for estimating work performed by a worker based on utterances, executed by a computer system, comprising:
The computer system includes at least one computer having a processor and a memory connected to the processor;
The method of estimating the work includes:
The processor converts voice data of utterances uttered by the worker into text, generates time-series data of utterances in which a plurality of texts are arranged in chronological order, and stores the time-series data of utterances in the memory. a first step of storing;
The processor calculates utterance intervals between the utterances from a plurality of the speech data whose time series are continuous, generates time-series data of the utterance intervals by arranging the utterance intervals in chronological order, and generates the time-series data of the utterance intervals. a second step of storing data in said memory;
a third step in which the processor estimates the content and result of the work performed by the worker based on the time-series data of the utterance and the time-series data of the utterance interval, and outputs the result of the estimation; A work estimation method comprising:

A work estimation method according to claim 5,
In the first step, the processor calculates a sound pressure from the speech data, and arranges the sound pressure corresponding to each of the plurality of texts included in the time-series data of the utterance in chronological order. generating time-series data of and storing the time-series data of the sound pressure in the memory;
In the third step, the processor determines the content and result of the work performed by the worker based on the time-series data of the speech, the time-series data of the speech interval, and the time-series data of the sound pressure. A method of estimating work, comprising a step of estimating.

A work estimation method according to claim 5,
The processor provides, for each work type, time-series data of utterances for analysis, time-series data of utterance intervals for analysis associated with the time-series data of utterances for analysis, and time of the utterances for analysis. a step of receiving time series data of labels indicating utterance intentions of utterances included in the series data;
a step in which the processor receives utterance data, which is information on an assumed utterance for each work type;
The processor performs statistical analysis using the time-series data of utterances for analysis, the time-series data of utterance intervals for analysis, and the time-series data of labels, and an utterance pattern indicating transition of utterance intention. and calculating the appearance probability of the utterance pattern and time-series data of utterance intervals of the utterance pattern in which utterance intervals between utterances corresponding to the utterance intention in the utterance pattern are arranged in chronological order, and calculating the utterance pattern, the utterance, a step of storing in the memory time-series data of the pattern appearance probability and the utterance interval of the utterance pattern;
the processor generating utterance time-series data for learning using the utterance pattern, the appearance probability of the utterance pattern, and the utterance data, and storing the time-series data of the utterance for learning in the memory; and,
the processor generating time-series data of utterance intervals for learning using the time-series data of utterance intervals of the utterance pattern, and storing the time-series data of utterance intervals for learning in the memory;
a step in which the processor generates time-series data of correct labels indicating a combination of correct work contents and results based on the utterance pattern, and stores the time-series data of correct labels in the memory;
The processor generates learning data composed of time-series data of utterances for learning, time-series data of utterance intervals for learning, and time-series data of correct labels, and stores the learning data in the memory. storing;
The processor receives the time-series data of the utterance and the time-series data of the utterance interval as input, and performs learning processing for generating a model that outputs the content and result of the work corresponding to the time-series data of the correct label. executing and storing information of said model in said memory.

A work estimation method according to claim 5,
The interval between the utterances is at least one of a time interval between a reference utterance and an utterance one chronologically before the reference utterance, and a time interval between words included in the utterance. How to estimate the work to be