JP6994874B2

JP6994874B2 - Annotation device and noise measurement system

Info

Publication number: JP6994874B2
Application number: JP2017166535A
Authority: JP
Inventors: 俊也大島; 大介内藤; 学人砂子; 康貴中島
Original assignee: Rion Co Ltd
Current assignee: Rion Co Ltd
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2022-01-14
Anticipated expiration: 2037-08-31
Also published as: JP2019046018A

Description

特許法第３０条第２項適用一般社団法人日本音響学会，２０１７年日本音響学会春季研究発表会講演論文集，第７９５頁～第７９８頁，平成２９年３月１日２０１７年日本音響学会春季研究発表会，平成２９年３月１５日Application of Article 30, Paragraph 2 of the Patent Act General Incorporated Association Acoustical Society of Japan, 2017 Acoustical Society of Japan Spring Research Presentation Proceedings, pp. 795-798, March 1, 2017 2017 Acoustical Society of Japan Spring Research presentation, March 15, 2017

本発明は、アノテーション装置および騒音測定システムに関するものである。 The present invention relates to an annotation device and a noise measurement system.

大量のデータを扱う環境音の騒音測定では、対象となる音源とそれ以外の音源とを選別する労力を軽減するために、自動的な音源識別が要求される。ある音源種別識別装置は、ニューラルネットワークを有する判定手段を備え、入力音響のパワースペクトル分布をサブバンド化して得られるデータをニューラルネットワークの入力とし、そのニューラルネットワークの出力として音源種別信号を生成する（例えば特許文献１参照）。 In noise measurement of environmental sounds that handle a large amount of data, automatic sound source identification is required in order to reduce the labor of selecting the target sound source and other sound sources. A certain sound source type identification device includes a determination means having a neural network, takes data obtained by subbanding the power spectrum distribution of an input sound as an input of the neural network, and generates a sound source type signal as an output of the neural network ( For example, see Patent Document 1).

特開２００１－３３３０４号公報Japanese Unexamined Patent Publication No. 2001-333304

ニューラルネットワークなどの識別器で音源種別を精度よく識別するためには、機械学習において大量の教師データ（音響データと、音源種別との対）が必要となる。そのような入力データに、その入力データに対応する正しい音源種別（ラベル）を付す作業をアノテーションという。 In order to accurately identify a sound source type with a classifier such as a neural network, a large amount of teacher data (a pair of acoustic data and a sound source type) is required in machine learning. The work of attaching the correct sound source type (label) corresponding to the input data to such input data is called annotation.

環境音から音源種別を識別する識別器のための教師データを準備するためには、通常、環境音についての長時間の録音データを再生し、再生音を聴取して、人が、各時点の再生音の音源種別を特定して、その音源種別をラベル付けする。その際、再生音だけでは現場の状況がわかりにくく、１回の聴取では音源種別を特定できない場合には、再生音を繰り返して聴取して音源種別を特定する。そのため、環境音から音源種別を識別する識別器の機械学習用の教師データの準備には、膨大な時間と労力を要する。 In order to prepare the teacher data for the discriminator that identifies the sound source type from the environmental sound, usually, a person plays a long-time recorded data about the environmental sound, listens to the reproduced sound, and a person at each time point. Specify the sound source type of the reproduced sound and label the sound source type. At that time, if it is difficult to understand the situation at the site only by the reproduced sound and the sound source type cannot be specified by one listening, the reproduced sound is repeatedly listened to and the sound source type is specified. Therefore, it takes a huge amount of time and effort to prepare the teacher data for machine learning of the classifier that identifies the sound source type from the environmental sound.

特に、複数の音源からの音響が重なっている再生音の場合、聴覚情報のみでは、それらの音源の種別を正確に特定できないこともある。さらに、屋外の伝搬では、周囲の建物での音響の反射や回折などによって音響特性が変化することがあり、聴覚のみによるアノテーションでは、ラベル付けが正確に行われない可能性がある。 In particular, in the case of reproduced sound in which sounds from a plurality of sound sources overlap, it may not be possible to accurately identify the type of those sound sources only by auditory information. Furthermore, in outdoor propagation, acoustic characteristics may change due to acoustic reflections and diffractions in surrounding buildings, and auditory annotation may not be labeled accurately.

本発明は、上記の問題に鑑みてなされたものであり、学習データ数を確保しつつ、音源種別などの対象音の種別を識別する識別器用の教師データを生成するアノテーションのための時間と労力を軽減するアノテーション装置および騒音測定システムを得ることを目的とする。 The present invention has been made in view of the above problems, and the time and effort for annotation to generate teacher data for a discriminator that identifies a target sound type such as a sound source type while securing the number of training data. The purpose is to obtain an annotation device and a noise measurement system that reduce the noise.

本発明に係るアノテーション装置は、対象音の音響信号の周波数スペクトルデータを含む入力データから、その対象音の分類を示す出力データを出力する識別器を機械学習するための教師データを生成するアノテーション装置であり、環境音を集音する集音装置と、その集音現場で対象となる環境音の聴取時のユーザーによる音源種別を示すリアルタイムのユーザー操作を検出する入力装置と、集音装置により得られた環境音の音響信号から対象音の入力データを生成する音響処理部と、入力装置により検出されたユーザー操作に対応する分類を特定し、入力データに対して、特定した分類を示す出力データをラベルとして関連付け、入力データと、入力データに関連付けられた出力データとの対を教師データとするラベル付け部とを備える。さらに、本発明に係るアノテーション装置は、以下の（Ａ）または（Ｂ）の構成を備える。（Ａ）ラベル付け部は、リアルタイムで、音響処理部により生成される入力データを受け付け、入力装置により検出されたユーザー操作が検出された期間の入力データに、特定した分類をラベルとして関連付ける。（Ｂ）音響処理部により生成された入力データから、所定の複数の分類のそれぞれの事後確率を示す出力データを出力する識別器と、識別器により出力された出力データにより示される、分類ごとに、事後確率を時系列に沿って、事後確率波形として表示装置に表示させるとともに、事後確率波形において事後確率が所定の閾値を超えている区間に沿って、候補区間を表示装置に表示させる表示処理部とをさらに備え、入力装置は、候補区間に対するユーザー操作を検出し、ラベル付け部は、ユーザーにより操作された１または複数の候補区間を確定し、確定した１または複数の候補区間に対応する１または複数の分類を特定し、特定した１または複数の分類を示す出力データをラベルとして入力データに関連付ける。

The annotation device according to the present invention is an annotation device that generates teacher data for machine learning a classifier that outputs output data indicating the classification of the target sound from input data including frequency spectrum data of the acoustic signal of the target sound. It is obtained by a sound collector that collects environmental sounds, an input device that detects real-time user operations that indicate the sound source type by the user when listening to the target environmental sound at the sound collection site, and a sound collector. The sound processing unit that generates the input data of the target sound from the acoustic signal of the environmental sound, and the classification corresponding to the user operation detected by the input device are specified, and the output data indicating the specified classification is given to the input data. Is associated as a label, and a labeling unit is provided in which a pair of input data and output data associated with the input data is used as teacher data. Further, the annotation device according to the present invention has the following configuration (A) or (B). (A) The labeling unit receives the input data generated by the sound processing unit in real time, and associates the specified classification with the input data during the period in which the user operation detected by the input device is detected as a label. (B) A classifier that outputs output data indicating the posterior probabilities of each of a plurality of predetermined classifications from the input data generated by the sound processing unit, and a classifier that is indicated by the output data output by the classifier for each classification. , The display process of displaying the posterior probability as a posterior probability waveform on the display device along the time series and displaying the candidate section on the display device along the section where the posterior probability exceeds a predetermined threshold in the posterior probability waveform. Further including a unit, the input device detects a user operation on the candidate section, and the labeling unit determines one or more candidate sections operated by the user and corresponds to the determined one or more candidate sections. Identify one or more classifications and associate the output data indicating the identified one or more classifications with the input data as labels.

本発明に係る騒音測定システムは、上述のアノテーション装置を備え、そのアノテーション装置により生成された教師データで識別器の機械学習を行い、機械学習された識別器で、対象音の音響信号の周波数スペクトルデータを含む入力データから、対象音の音源種別を示す出力データを生成する。 The noise measurement system according to the present invention is provided with the above-mentioned annotating device, machine learning of the classifier is performed by the teacher data generated by the annotating device, and the frequency spectrum of the acoustic signal of the target sound is performed by the machine-learned classifier. From the input data including the data, the output data indicating the sound source type of the target sound is generated.

本発明によれば、集音現場で聴取時に集音された環境音の音源種別を記録できるので、音源種別などの対象音の分類を識別する識別器用の教師データを生成するアノテーションのための時間と労力を軽減するアノテーション装置および騒音測定システムが得られる。 According to the present invention, since the sound source type of the environmental sound collected at the time of listening at the sound collection site can be recorded, the time for annotation to generate the teacher data for the discriminator that identifies the classification of the target sound such as the sound source type. Annotation devices and noise measurement systems that reduce labor and labor are obtained.

図１は、本発明の実施の形態１に係るアノテーション装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an annotation device according to a first embodiment of the present invention. 図２は、第１アノテーションモードの表示画面の一例を示す図である。FIG. 2 is a diagram showing an example of a display screen of the first annotation mode. 図３は、第２アノテーションモードの表示画面の一例を示す図である。FIG. 3 is a diagram showing an example of a display screen of the second annotation mode. 図４は、図３の表示画面の一部を拡大した図である。FIG. 4 is an enlarged view of a part of the display screen of FIG. 図５は、実施の形態１に係るアノテーション装置により生成された教師データで機械学習された識別器による音源種別の識別結果の一例を示す図である。FIG. 5 is a diagram showing an example of a sound source type identification result by a classifier machine-learned with the teacher data generated by the annotation device according to the first embodiment.

以下、図に基づいて本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

実施の形態１． Embodiment 1.

図１は、本発明の実施の形態１に係るアノテーション装置の構成を示すブロック図である。図１に示すアノテーション装置は、対象音の音響信号の周波数スペクトルデータを含む入力データから、対象音の分類を示す出力データを出力する識別器を機械学習するための教師データを生成する。実施の形態１では、対象音の分類は、音源種別である。 FIG. 1 is a block diagram showing a configuration of an annotation device according to a first embodiment of the present invention. The annotation device shown in FIG. 1 generates teacher data for machine learning of a classifier that outputs output data indicating the classification of the target sound from the input data including the frequency spectrum data of the acoustic signal of the target sound. In the first embodiment, the classification of the target sound is a sound source type.

図１に示すアノテーション装置は、集音装置１、表示装置２、入力装置３、通信装置４、記憶装置５、および演算処理装置６を備える。 The annotation device shown in FIG. 1 includes a sound collecting device 1, a display device 2, an input device 3, a communication device 4, a storage device 5, and an arithmetic processing unit 6.

集音装置１は、環境音を集音するマイクロホンなどである。 The sound collecting device 1 is a microphone or the like that collects environmental sounds.

表示装置２は、各種情報をユーザーに対して表示する液晶ディスプレイなどである。 The display device 2 is a liquid crystal display or the like that displays various information to the user.

入力装置３は、上述の環境音を集音現場で聴取したユーザーによる聴取時の音源種別を示すリアルタイムのユーザー操作を検出する。なお、入力装置３は、ハードキーを備えるキーボード、キーパッドなどでもよいし、表示装置２に表示されるキー画像とともにソフトキーを構成するタッチパネルなどでもよい。なお、集音現場ごとに、対象となる音源は異なるので、集音現場ごとの予めソフトキーに対応する音源種別を設定しておく。 The input device 3 detects a real-time user operation indicating the sound source type at the time of listening by the user who listened to the above-mentioned environmental sound at the sound collection site. The input device 3 may be a keyboard having hard keys, a keypad, or the like, or may be a touch panel or the like that constitutes soft keys together with a key image displayed on the display device 2. Since the target sound source differs for each sound collection site, the sound source type corresponding to the soft key is set in advance for each sound collection site.

また、通信装置４は、教師データなどを外部装置へ送信する。通信装置４としては、ネットワークインターフェイスや周辺機器インターフェイスが使用される。 Further, the communication device 4 transmits teacher data and the like to an external device. As the communication device 4, a network interface or a peripheral device interface is used.

記憶装置５は、教師データなどを格納する不揮発性の記憶装置５である。記憶装置５としては、ハードディスクドライブ、フラッシュメモリなどが使用される。 The storage device 5 is a non-volatile storage device 5 that stores teacher data and the like. As the storage device 5, a hard disk drive, a flash memory, or the like is used.

演算処理装置６は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）などを備えるコンピュータであって、ＲＯＭや記憶装置５などに記憶されているプログラムをＲＡＭにロードし、ＣＰＵで実行することで、各種処理部として動作する。 The arithmetic processing unit 6 is a computer including a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like, and loads a program stored in the ROM, the storage device 5, and the like into the RAM. However, by executing it on the CPU, it operates as various processing units.

ここでは、演算処理装置６は、音響処理部１１、ラベル付け部１２、データ出力部１３、表示処理部１４、識別器１５、および学習処理部１６を備える。 Here, the arithmetic processing unit 6 includes an acoustic processing unit 11, a labeling unit 12, a data output unit 13, a display processing unit 14, a classifier 15, and a learning processing unit 16.

音響処理部１１は、集音装置１により得られた対象音の音響信号から、教師データのうちの入力データを生成する。音響処理部１１は、音響信号の周波数スペクトルデータを算出し、入力データに含める。例えば、周波数スペクトルデータとしては、周波数分析機能を備える騒音計などにより得られる所定バンド幅（例えば１／３オクターブ）ごとの短時間Ｌｅｑ（短時間平均音圧レベル）が使用される。 The sound processing unit 11 generates input data among the teacher data from the sound signal of the target sound obtained by the sound collecting device 1. The acoustic processing unit 11 calculates the frequency spectrum data of the acoustic signal and includes it in the input data. For example, as the frequency spectrum data, a short-time Leq (short-time average sound pressure level) for each predetermined bandwidth (for example, 1/3 octave) obtained by a sound level meter having a frequency analysis function is used.

ラベル付け部１２は、入力装置３により検出されたリアルタイムのユーザー操作に対応する分類を特定し、入力データに対して、特定した分類を示す出力データをラベルとして関連付け、入力データと、ラベル付け部１２により入力データに関連付けられた出力データとの対を教師データとする。 The labeling unit 12 identifies the classification corresponding to the real-time user operation detected by the input device 3, associates the output data indicating the specified classification with the input data as a label, and associates the input data with the labeling unit. The pair with the output data associated with the input data by 12 is set as the teacher data.

データ出力部１３は、ラベル付け部１２において生成された教師データを、通信装置４を使用して外部へ送信したり、記憶装置５に記憶したりする。 The data output unit 13 transmits the teacher data generated by the labeling unit 12 to the outside using the communication device 4 or stores it in the storage device 5.

表示処理部１４は、各種情報やソフトキーのキー画像などを表示装置２に表示させる。 The display processing unit 14 causes the display device 2 to display various information, key images of soft keys, and the like.

実施の形態１に係るアノテーション装置は、動作モードとして、第１アノテーションモードと第２アノテーションモードとを備え、ユーザーによりいずれかの動作モードを、入力装置３を使用して選択可能となっている。 The annotation device according to the first embodiment includes a first annotation mode and a second annotation mode as operation modes, and the user can select either operation mode by using the input device 3.

第１アノテーションモードでは、入力装置３における、所定の複数の分類に対する複数のキーが使用される。このキーは、ハードキーでもよいし、ソフトキーでもよい。また、第１アノテーションモードでは、ラベル付け部１２は、ユーザーにより操作された１または複数のキーを特定し、特定した１または複数のキーに対応する１または複数の分類を特定し、特定した１または複数の分類を示す出力データをラベルとして入力データに関連付ける。第１アノテーションモードでは、キーが押下されている期間において、継続して、そのキーに対応する分類が特定され続ける。 In the first annotation mode, a plurality of keys for a plurality of predetermined classifications in the input device 3 are used. This key may be a hard key or a soft key. Further, in the first annotation mode, the labeling unit 12 identifies one or a plurality of keys operated by the user, identifies one or a plurality of classifications corresponding to the specified one or a plurality of keys, and identifies one. Alternatively, the output data indicating multiple classifications is associated with the input data as a label. In the first annotation mode, the classification corresponding to the key is continuously specified during the period when the key is pressed.

図２は、第１アノテーションモードの表示画面の一例を示す図である。 FIG. 2 is a diagram showing an example of a display screen of the first annotation mode.

図２に示すように、第１アノテーションモードでは、現在時刻から所定時間（例えば３分）だけ過去の時点までの各種時系列データが表示装置２に表示され、所定時間間隔（例えば１秒）で更新される。例えば図２に示すように、騒音レベルの時系列データ５１、周波数スペクトルの時系列データ５２、および所定の音源種別のラベル付け結果の時系列データ５３が表示される。また、所定の音源種別に対応するソフトキーを含むソフトキー配列５４が表示されている。時系列データ５３では、各音源種別に対応する表示領域において、その音源種別に対応するキーが押下されていた期間について、特定の色が付される。 As shown in FIG. 2, in the first annotation mode, various time-series data from the current time to a predetermined time (for example, 3 minutes) in the past are displayed on the display device 2 at predetermined time intervals (for example, 1 second). Will be updated. For example, as shown in FIG. 2, the time-series data 51 of the noise level, the time-series data 52 of the frequency spectrum, and the time-series data 53 of the labeling result of a predetermined sound source type are displayed. Further, a soft key layout 54 including soft keys corresponding to a predetermined sound source type is displayed. In the time-series data 53, in the display area corresponding to each sound source type, a specific color is added to the period during which the key corresponding to the sound source type is pressed.

また、第１アノテーションモードでは、ラベル付け部１２は、ソフトキー配列５４における１または複数のソフトキーの押下を検出し、検出したソフトキーに対応する音源種別をラベルとして、そのソフトキーが押下されている期間の入力データに関連付ける。そして、表示処理部１４は、所定の時間間隔で繰り返し、騒音レベルの時系列データ５１および周波数スペクトルの時系列データ５２を音響処理部１１から取得するとともに、所定の音源種別のラベル付け結果の時系列データ５３をラベル付け部１２から取得し、それらを図２に示すように表示装置２に表示する。 Further, in the first annotation mode, the labeling unit 12 detects the pressing of one or a plurality of soft keys in the soft key layout 54, and the soft keys are pressed with the sound source type corresponding to the detected soft keys as a label. Associate with the input data of the period. Then, the display processing unit 14 repeats at predetermined time intervals to acquire the time-series data 51 of the noise level and the time-series data 52 of the frequency spectrum from the sound processing unit 11, and at the time of the labeling result of the predetermined sound source type. The series data 53 is acquired from the labeling unit 12, and they are displayed on the display device 2 as shown in FIG.

一方、第２アノテーションモードでは、識別器１５が、音響処理部１１により生成された入力データから、所定の複数の分類のそれぞれの事後確率（０から１までの値）を示す出力データを出力し、表示処理部１４が、識別器１５により出力された出力データにより示される、分類（ここでは、音源種別）ごとに、事後確率を時系列に沿って、事後確率波形として表示装置２に表示させるとともに、その事後確率波形において事後確率が所定の閾値（例えば、０．２５）を超えている区間に沿って、候補区間を表示装置２に表示させる。 On the other hand, in the second annotation mode, the classifier 15 outputs output data indicating the posterior probabilities (values from 0 to 1) of each of a plurality of predetermined classifications from the input data generated by the sound processing unit 11. , The display processing unit 14 causes the display device 2 to display posterior probabilities as posterior probability waveforms in chronological order for each classification (here, sound source type) indicated by the output data output by the classifier 15. At the same time, the display device 2 displays the candidate section along the section in which the posterior probability exceeds a predetermined threshold value (for example, 0.25) in the posterior probability waveform.

図３は、第２アノテーションモードの表示画面の一例を示す図である。図４は、図３の表示画面の一部を拡大した図である。 FIG. 3 is a diagram showing an example of a display screen of the second annotation mode. FIG. 4 is an enlarged view of a part of the display screen of FIG.

図３に示すように、第２アノテーションモードでは、現在時刻から所定時間（例えば３分）だけ過去の時点までの各種時系列データが表示装置２に表示され、所定時間間隔（例えば１秒）で更新される。例えば図３に示すように、騒音レベルの時系列データ６１、周波数スペクトルの時系列データ６２、および各音源種別の事後確率波形６３が表示される。さらに、図４に示すように、各音源種別の事後確率波形６３の表示領域６３ａ～６３ｇに隣接して、各音源種別に対応する、候補区間の表示領域６４ａ～６４ｇが確保されており、事後確率波形において事後確率が所定の閾値を超えている区間に対応して、候補区間が表示される。図４では、音源種別「自動車」については、候補区間６５が表示され、音源種別「救急車」については、候補区間６６が表示され、音源種別「電車」については、候補区間６７が表示され、音源種別「小鳥」については、候補区間６８が表示され、音源種別「カラス」については、候補区間６９が表示されている。 As shown in FIG. 3, in the second annotation mode, various time-series data from the current time to a predetermined time (for example, 3 minutes) in the past are displayed on the display device 2 at predetermined time intervals (for example, 1 second). Will be updated. For example, as shown in FIG. 3, time-series data 61 of noise level, time-series data 62 of frequency spectrum, and posterior probability waveform 63 of each sound source type are displayed. Further, as shown in FIG. 4, adjacent to the display areas 63a to 63g of the posterior probability waveform 63 of each sound source type, the display areas 64a to 64g of the candidate section corresponding to each sound source type are secured, and the posterior probability waveform 63 is secured. Candidate intervals are displayed corresponding to the intervals in which the posterior probabilities exceed a predetermined threshold in the probability waveform. In FIG. 4, the candidate section 65 is displayed for the sound source type "automobile", the candidate section 66 is displayed for the sound source type "ambulance", and the candidate section 67 is displayed for the sound source type "train". The candidate section 68 is displayed for the type "small bird", and the candidate section 69 is displayed for the sound source type "crow".

そして、第２アノテーションモードでは、入力装置３は、候補区間６５，６６，６７，６８，６９に対するユーザー操作を例えばタッチパネルで検出し、ラベル付け部１２は、ユーザーにより操作された１または複数の候補区間を特定し、特定した１または複数の候補区間に対応する１または複数の分類を特定し、特定した１または複数の分類（図３および図４では音源種別）を示す出力データをラベルとして入力データに関連付ける。 Then, in the second annotation mode, the input device 3 detects a user operation for the candidate sections 65, 66, 67, 68, 69 with, for example, a touch panel, and the labeling unit 12 is one or a plurality of candidates operated by the user. Specify the section, specify one or more classifications corresponding to the specified one or more candidate sections, and input the output data indicating the specified one or more classifications (sound source type in FIGS. 3 and 4) as a label. Associate with data.

このとき、特定された候補区間の始点時刻と終点時刻との間の時間における入力データに対して、その候補区間に対応する分類（ここでは音源種別）がラベルとして関連付けられる。 At this time, the classification (here, sound source type) corresponding to the candidate section is associated as a label with respect to the input data in the time between the start point time and the end point time of the specified candidate section.

なお、この識別器１５は、当該アノテーションモード装置が生成する教師データが機械学習に使用される識別器と同一の構成（ディープニューラルネットワークの場合、隠れ層の数、および各層のノード数が同一である構成）を有している。例えば、識別器１５には、ディープニューラルネットワークが使用される。例えば、そのディープニューラルネットワークは、２つの隠れ層を備え、その入力層には、周波数に対応する３３個のノードが設けられ、その初段の隠れ層には、２０個のノードが設けられ、その次段の隠れ層には、１０個のノードが設けられ、その出力層には、音源種別に対応する５５個のノードが設けられる。 The classifier 15 has the same configuration as the classifier in which the teacher data generated by the annotation mode device is used for machine learning (in the case of a deep neural network, the number of hidden layers and the number of nodes in each layer are the same. It has a certain configuration). For example, a deep neural network is used for the classifier 15. For example, the deep neural network has two hidden layers, the input layer is provided with 33 nodes corresponding to frequencies, and the hidden layer of the first stage is provided with 20 nodes. The hidden layer of the next stage is provided with 10 nodes, and the output layer is provided with 55 nodes corresponding to the sound source types.

学習処理部１６は、ラベル付け部１２により生成された教師データに基づいて識別器１５の機械学習を行う。 The learning processing unit 16 performs machine learning of the classifier 15 based on the teacher data generated by the labeling unit 12.

次に、実施の形態１に係るアノテーション装置の動作について説明する。 Next, the operation of the annotation device according to the first embodiment will be described.

まず、入力装置に対するユーザー操作に従って、各処理部が、動作モードを、第１アノテーションモードおよび第２アノテーションモードのいずれかにセットする。ユーザーは、このアノテーション装置の設置場所で、対象音を聴取し、特定した対象音の分類に応じた操作をアノテーション装置に対して行う。 First, according to the user operation on the input device, each processing unit sets the operation mode to either the first annotation mode or the second annotation mode. The user listens to the target sound at the installation location of the annotation device, and performs an operation on the annotation device according to the classification of the specified target sound.

第１アノテーションモードでは、ラベル付け部１２は、リアルタイムで、音響処理部１１により生成される入力データ（教師データのうちの入力データ）を受け付けており、さらに、入力装置３におけるキー押下を検出すると、そのキーに対応する分類（ここでは音源種別）を特定し、そのキー押下が継続した期間（つまり、キー押下の開始時刻と終了時刻）を特定し、その期間の入力データに、特定した分類をラベルとして関連付ける。 In the first annotation mode, the labeling unit 12 receives the input data (input data of the teacher data) generated by the sound processing unit 11 in real time, and further, when the key press on the input device 3 is detected. , The classification corresponding to the key (in this case, the sound source type) is specified, the period during which the key press continues (that is, the start time and end time of the key press) is specified, and the specified classification is used for the input data of that period. As a label.

このようにして、入力データと出力データ（つまり、特定した分類）との対が、１つの教師データセットとされる。 In this way, the pair of input data and output data (ie, the specified classification) is considered as one teacher data set.

また、図２に示すように、表示装置２には、第１アノテーションモードで、ユーザーにより入力された分類が時系列データ５３として表示される。 Further, as shown in FIG. 2, the display device 2 displays the classification input by the user as time series data 53 in the first annotation mode.

他方、第２アノテーションモードでは、識別器１５が、リアルタイムで、入力データに対する各分類の事後確率を算出しており、表示処理部１４は、図３および図４に示すように、各分類の事後確率波形６３を表示装置２に表示させるとともに、各時点での事後確率が所定の閾値を超えたか否かを判定し、事後確率が所定の閾値を超えた期間に対応する候補区間６５～６９を、事後確率波形６３に合わせて表示させる。 On the other hand, in the second annotation mode, the classifier 15 calculates the posterior probabilities of each classification for the input data in real time, and the display processing unit 14 calculates the posterior probabilities of each classification with respect to the input data, as shown in FIGS. 3 and 4. The probability waveform 63 is displayed on the display device 2, it is determined whether the posterior probability at each time point exceeds a predetermined threshold, and the candidate sections 65 to 69 corresponding to the period in which the posterior probability exceeds the predetermined threshold are set. , Displayed according to the posterior probability waveform 63.

ラベル付け部１２は、リアルタイムで、音響処理部１１により生成される入力データ（教師データのうちの入力データ）を受け付けており、さらに、入力装置３により候補区間の押下が検出されると、その候補区間に対応する分類（ここでは音源種別）を確定し、その候補区間の始点から終点までの期間を特定し、その期間の入力データに、特定した分類をラベルとして関連付ける。このように、第２アノテーションモードは、第１アノテーションモードに比べ、継続時間の短い対象音であってもユーザーによる作業が容易となる。 The labeling unit 12 receives the input data (input data of the teacher data) generated by the sound processing unit 11 in real time, and when the input device 3 detects that the candidate section is pressed, the labeling unit 12 receives the input data (input data among the teacher data). The classification corresponding to the candidate section (here, the sound source type) is determined, the period from the start point to the end point of the candidate section is specified, and the specified classification is associated with the input data of that period as a label. As described above, the second annotation mode facilitates the work by the user even for the target sound having a shorter duration than the first annotation mode.

以上のように、上記実施の形態１によれば、音響処理部１１は、集音装置１により得られた対象音の音響信号から入力データ（教師データのうちの入力データ）を生成する。そして、ラベル付け部１２は、入力装置３により検出されたユーザー操作に対応する分類を特定し、入力データに対して、特定した分類を示す出力データをラベルとして関連付け、入力データと、ラベル付け部１２により入力データに関連付けられた出力データとの対を教師データとする。 As described above, according to the first embodiment, the sound processing unit 11 generates input data (input data among the teacher data) from the sound signal of the target sound obtained by the sound collecting device 1. Then, the labeling unit 12 identifies the classification corresponding to the user operation detected by the input device 3, associates the output data indicating the specified classification with the input data as a label, and associates the input data with the labeling unit. The pair with the output data associated with the input data by 12 is set as the teacher data.

これにより、ユーザーは、教師データの作成のために、対象音源を確認しつつ対象音を聴きながら、キーや候補区間を押下するだけでよく、音源種別を識別する識別器用の教師データを生成するアノテーションのための時間と労力が軽減される。 As a result, in order to create teacher data, the user only has to press a key or a candidate section while listening to the target sound while checking the target sound source, and generates teacher data for the discriminator that identifies the sound source type. Saves time and effort for annotation.

図５は、実施の形態１に係るアノテーション装置により生成された教師データで機械学習された識別器による音源種別の識別結果の一例を示す図である。図５に示すように、騒音レベルのそれぞれのピーク付近において、音源種別の事後確率が高くなっており、音源種別が識別されている。 FIG. 5 is a diagram showing an example of a sound source type identification result by a classifier machine-learned with the teacher data generated by the annotation device according to the first embodiment. As shown in FIG. 5, the posterior probability of the sound source type is high in the vicinity of each peak of the noise level, and the sound source type is identified.

実施の形態２． Embodiment 2.

本発明の実施の形態２に係る騒音測定システムは、実施の形態１に係るアノテーション装置を備え、そのアノテーション装置により上述のように生成された教師データで識別器の機械学習を行い、機械学習された識別器で、対象音の音響信号の周波数スペクトルデータを含む入力データから、その対象音の音源種別を示す出力データを生成する。 The noise measurement system according to the second embodiment of the present invention includes the annotation device according to the first embodiment, and machine learning of the classifier is performed by the teacher data generated by the annotation device as described above, and the machine learning is performed. The classifier generates output data indicating the sound source type of the target sound from the input data including the frequency spectrum data of the acoustic signal of the target sound.

これにより、騒音の音源などを特定することができる。 This makes it possible to identify a noise source or the like.

なお、上述の実施の形態に対する様々な変更および修正については、当業者には明らかである。そのような変更および修正は、その主題の趣旨および範囲から離れることなく、かつ、意図された利点を弱めることなく行われてもよい。つまり、そのような変更および修正が請求の範囲に含まれることを意図している。 It should be noted that various changes and modifications to the above-described embodiments will be apparent to those skilled in the art. Such changes and modifications may be made without departing from the intent and scope of the subject and without diminishing the intended benefits. That is, it is intended that such changes and amendments are included in the claims.

例えば、上記実施の形態１において、上述の入力データには、集音装置１から見た音源の方向などの音源情報を含めるようにしてもよい。 For example, in the first embodiment, the above-mentioned input data may include sound source information such as the direction of the sound source as seen from the sound collecting device 1.

また、上記実施の形態２に係る騒音検出システムと同様の構成で、上述の対象音を、特定の装置から発せされる音とし、上述対象音の分類を、異音原因種別とすることで、実施の形態１に係るアノテーション装置を異音検出システムに適用するようにしてもよい。 Further, in the same configuration as the noise detection system according to the second embodiment, the above-mentioned target sound is defined as a sound emitted from a specific device, and the above-mentioned target sound is classified as an abnormal noise cause type. The annotation device according to the first embodiment may be applied to an abnormal noise detection system.

また、上述の実施の形態１，２において、ネットワークを介して、教師データをアノテーション装置または騒音測定システムからサーバーへ送信し、サーバーで上述の識別器の機械学習を行い、機械学習により得られた識別器のパラメータをサーバーからアノテーション装置または騒音測定システムへ送信し、そのパラメータを、アノテーション装置または騒音測定システム内の識別器に適用するようにしてもよい。 Further, in the above-described first and second embodiments, the teacher data is transmitted from the annotation device or the noise measurement system to the server via the network, and the server performs machine learning of the above-mentioned classifier, which is obtained by machine learning. The parameters of the classifier may be transmitted from the server to the annotation device or noise measurement system, and the parameters may be applied to the classifier in the annotation device or noise measurement system.

本発明は、例えば、対象音の特徴を分類するための識別器のための教師データの自動生成に適用可能である。 The present invention is applicable, for example, to the automatic generation of teacher data for a classifier for classifying features of a target sound.

１集音装置
２表示装置
３入力装置
１１音響処理部
１２ラベル付け部
１４表示処理部
１５識別器
１６学習処理部

1 Sound collector 2 Display device 3 Input device 11 Sound processing unit 12 Labeling unit 14 Display processing unit 15 Discriminator 16 Learning processing unit

Claims

In an annotation device that generates teacher data for machine learning of a classifier that outputs output data indicating the classification of the target sound from input data including frequency spectrum data of the acoustic signal of the target sound.
A sound collector that collects environmental sounds and
An input device that detects real-time user operations that indicate the sound source type by the user when listening to the environmental sound, and
An acoustic processing unit that generates the input data from the acoustic signal of the target sound obtained by the sound collector, and the acoustic processing unit.
The classification corresponding to the user operation detected by the input device was specified, the output data indicating the specified classification was associated with the input data as a label, and the input data was associated with the input data. A labeling unit whose teacher data is a pair with the output data,
Equipped with
The labeling unit receives the input data generated by the sound processing unit in real time, and the classification specified in the input data during the period in which the user operation detected by the input device is detected is described. Associate as a label,
Annotation device featuring.

The input device comprises a plurality of keys for a plurality of predetermined classifications.
The labeling unit identifies one or more keys operated by the user, identifies one or more classifications corresponding to the identified one or more keys, and indicates the identified one or more classifications. Associate the output data as a label with the input data,
1. The annotation device according to claim 1.

In an annotation device that generates teacher data for machine learning of a classifier that outputs output data indicating the classification of the target sound from input data including frequency spectrum data of the acoustic signal of the target sound.
A sound collector that collects environmental sounds and
An input device that detects real-time user operations that indicate the sound source type by the user when listening to the environmental sound, and
An acoustic processing unit that generates the input data from the acoustic signal of the target sound obtained by the sound collector, and the acoustic processing unit.
The classification corresponding to the user operation detected by the input device was specified, the output data indicating the specified classification was associated with the input data as a label, and the input data was associated with the input data. A labeling unit whose teacher data is a pair with the output data,
A classifier that outputs output data indicating the posterior probabilities of each of a plurality of predetermined classifications from the input data generated by the sound processing unit.
For each of the classifications indicated by the output data output by the classifier, the posterior probability is displayed on the display device as a posterior probability waveform in chronological order, and the posterior probability waveform is displayed as the posterior probability waveform. A display processing unit for displaying a candidate section on the display device along a section whose probability exceeds a predetermined threshold is provided .
The input device detects a user operation for the candidate section and detects the user operation.
The labeling unit determines one or more candidate sections operated by the user, identifies one or more classifications corresponding to the determined one or more candidate sections, and identifies the one or more classifications. Associating the output data indicating the above with the input data as a label,
Annotation device featuring.

The annotation device according to claim 2 or 3, further comprising a learning processing unit that performs machine learning of the classifier based on the teacher data generated by the labeling unit.

The annotation device according to claim 1 or 3 is provided.
Machine learning of the classifier is performed using the teacher data generated by the annotation device, and the machine-learned classifier indicates the sound source type of the target sound from the input data including the frequency spectrum data of the acoustic signal of the target sound. Producing output data,
A noise measurement system featuring.