JP2021033260A

JP2021033260A - Training method, speaker identification method, and recording medium

Info

Publication number: JP2021033260A
Application number: JP2020077113A
Authority: JP
Inventors: 美沙貴土井; Misaki Doi; 釜井　孝浩; Takahiro Kamai; 孝浩釜井; 光佑板倉; Kosuke Itakura
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2019-08-23
Filing date: 2020-04-24
Publication date: 2021-03-01
Also published as: CN112420021A

Abstract

To identify a speaker accurately.SOLUTION: Provided is a training method of training a speaker identification model 20 which, when voice data is input, outputs speaker identification information for identifying a speaker of an utterance included in the voice data. The training method includes: performing voice quality conversion of first voice data of a first speaker to generate second voice data of a second speaker; and performing training of the speaker identification model 20 using, as training data, the first voice data and the second voice data.SELECTED DRAWING: Figure 5

Description

本開示は、話者を識別する技術に関する。 The present disclosure relates to techniques for identifying speakers.

従来、話者識別モデルを用いて話者を識別する技術が知られている（例えば、非特許文献１参照）。 Conventionally, a technique for identifying a speaker using a speaker identification model is known (see, for example, Non-Patent Document 1).

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur, “X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION” ICASSP 2018:5329-5333.David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, Sanjeev Khudanpur, “X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION” ICASSP 2018: 5329-5333.

精度よく話者を識別したい。 I want to identify the speaker accurately.

本開示の一態様に係る学習方法は、音声データを入力すると、前記音声データに含まれる発話の話者を識別する話者識別情報を出力する話者識別モデルの学習方法であって、第１の話者の第１の音声データに対して声質変換処理を行うことで、第２の話者の第２の音声データを生成し、前記第１の音声データと前記第２の音声データとを学習データとして前記話者識別モデルの学習処理を行う。 The learning method according to one aspect of the present disclosure is a learning method of a speaker identification model that outputs speaker identification information for identifying a speaker of a speech included in the voice data when voice data is input. By performing voice quality conversion processing on the first voice data of the speaker, the second voice data of the second speaker is generated, and the first voice data and the second voice data are combined. The speaker identification model is trained as training data.

本開示の一態様に係る話者識別方法は、上記学習方法により予め学習処理を行った前記話者識別モデルに音声データを入力して、前記話者識別モデルに前記話者識別情報を出力させる。 In the speaker identification method according to one aspect of the present disclosure, voice data is input to the speaker identification model that has been previously trained by the learning method, and the speaker identification model outputs the speaker identification information. ..

本開示の一態様に係るプログラムは、コンピュータに、音声データを入力すると、前記音声データに含まれる発話の話者を識別する話者識別情報を出力する話者識別モデルの学習を行う処理を実行させるためのプログラムであって、前記処理は、第１の話者の第１の音声データに対して声質変換処理を行うことで、第２の話者の第２の音声データを生成する第１のステップと、前記第１の音声データと前記第２の音声データとを学習データとして前記話者識別モデルの学習処理を行う第２のステップと、を含む。 The program according to one aspect of the present disclosure executes a process of learning a speaker identification model that outputs speaker identification information for identifying a speaker of a speech included in the voice data when voice data is input to the computer. The first program is to generate the second voice data of the second speaker by performing the voice quality conversion process on the first voice data of the first speaker. And a second step of performing the learning process of the speaker identification model using the first voice data and the second voice data as training data.

なお、これらの全般的または具体的な態様は、システム、方法、集積回路、コンピュータプログラムまたはコンピュータで読み取り可能なＣＤ−ＲＯＭなどの記録媒体で実現されてもよく、システム、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。 It should be noted that these general or specific embodiments may be implemented in a system, method, integrated circuit, computer program or computer-readable recording medium such as a CD-ROM, system, method, integrated circuit, computer. It may be realized by any combination of a program and a recording medium.

本開示に係る学習方法等によると、精度よく話者を識別することができる。 According to the learning method and the like according to the present disclosure, the speaker can be identified with high accuracy.

図１は、実施の形態に係る話者識別装置の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a speaker identification device according to an embodiment. 図２は、実施の形態に係る音声データ保持部が、音声データと話者識別情報とを互いに対応付けて記憶する様子の一例を示す模式図である。FIG. 2 is a schematic view showing an example of how the voice data holding unit according to the embodiment stores the voice data and the speaker identification information in association with each other. 図３は、実施の形態に係る声質変換部が、一の話者の音声データを、複数の他の話者の音声データに変換して出力する様子を示す模式図である。FIG. 3 is a schematic diagram showing how the voice quality conversion unit according to the embodiment converts the voice data of one speaker into the voice data of a plurality of other speakers and outputs the data. 図４は、実施の形態に係る声質変換部の構成例を示すブロック図である。FIG. 4 is a block diagram showing a configuration example of the voice quality conversion unit according to the embodiment. 図５は、実施の形態に係る話者識別モデル学習処理のフローチャートである。FIG. 5 is a flowchart of the speaker identification model learning process according to the embodiment. 図６は、実施の形態に係る声質変換モデル学習処理のフローチャートである。FIG. 6 is a flowchart of the voice quality conversion model learning process according to the embodiment. 図７は、実施の形態に係る話者識別処理のフローチャートである。FIG. 7 is a flowchart of the speaker identification process according to the embodiment.

（本開示の一態様を得るに至った経緯）
話者を識別する識別情報に紐付けされた音声データを学習データとして予め学習処理を行った話者識別モデルを用いて話者を識別する話者識別技術が知られている。 (History of obtaining one aspect of the present disclosure)
There is known a speaker identification technique for identifying a speaker by using a speaker identification model in which learning processing is performed in advance using voice data associated with the identification information for identifying the speaker as learning data.

従来、学習データの数を増やす（以下、「学習データの数を増やす」ことを「学習データの拡張」とも称する。）ために、オリジナルの学習用音声データに対して、ノイズ付与、残響付与等が行われている。しかしながら、上記従来のノイズ付与、残響付与等による学習データの拡張では、一の話者における発話内容、言語（日本語、英語等）を増やすことはできない。このため、話者識別モデルの学習処理における、発話内容、言語による影響を十分に低減できないことがある。 Conventionally, in order to increase the number of learning data (hereinafter, "increasing the number of learning data" is also referred to as "extension of learning data"), noise addition, reverberation, etc. are added to the original learning audio data. Is being done. However, it is not possible to increase the utterance content and language (Japanese, English, etc.) of one speaker by expanding the learning data by adding noise, reverberation, etc. as described above. Therefore, it may not be possible to sufficiently reduce the influence of the utterance content and language in the learning process of the speaker identification model.

そこで、発明者らは、話者識別モデルを用いて行う話者の識別において、精度よく話者を識別すべく、鋭意検討、実験を重ねた。その結果、発明者らは、下記学習方法等に想到した。 Therefore, the inventors have conducted diligent studies and experiments in order to accurately identify the speaker in the speaker identification performed by using the speaker identification model. As a result, the inventors came up with the following learning methods and the like.

上記学習方法によると、話者識別モデルの学習処理における学習データの拡張において、第２の話者の音声データの数を、発話内容、言語により制限されることなく増やすことができる。このため、話者識別モデルによる話者の識別の精度を向上することができる。 According to the above learning method, in the extension of the learning data in the learning process of the speaker identification model, the number of voice data of the second speaker can be increased without being limited by the utterance content and the language. Therefore, the accuracy of speaker identification by the speaker identification model can be improved.

従って、上記学習方法によると、精度よく話者を識別することができる。 Therefore, according to the above learning method, the speaker can be identified with high accuracy.

また、前記声質変換処理は、前記第１の話者の音声データと前記第２の話者の音声データとに基づく処理であるとしてもよい。 Further, the voice quality conversion process may be a process based on the voice data of the first speaker and the voice data of the second speaker.

また、前記声質変換処理は、前記第１の話者の音声データを入力すると、前記第２の話者の音声データを出力するように予め学習処理を行った声質変換モデルに、前記第１の音声データを入力することで、前記声質変換モデルから前記第２の音声データを出力する処理を含むとしてもよい。 Further, in the voice quality conversion process, the first speaker is applied to a voice quality conversion model that has been previously trained so as to output the voice data of the second speaker when the voice data of the first speaker is input. By inputting the voice data, the process of outputting the second voice data from the voice quality conversion model may be included.

また、前記声質変換モデルは、ＷＡＶフォーマットの音声データを入力とし、ＷＡＶフォーマットの音声データを出力とする深層ニューラルネットワークを含むとしてもよい。 Further, the voice quality conversion model may include a deep neural network that inputs WAV format audio data and outputs WAV format audio data.

また、前記声質変換処理は、前記第１の話者の音声データと第３の話者の音声データとに基づく処理であるとしてもよい。 Further, the voice quality conversion process may be a process based on the voice data of the first speaker and the voice data of the third speaker.

また、前記話者識別モデルは、音声データに含まれる発話の特徴を示す発話特徴量を入力とし、話者の特徴を示す話者性特徴量を出力する深層ニューラルネットワークを含むとしてもよい。 Further, the speaker identification model may include a deep neural network that receives an utterance feature amount indicating the utterance feature included in the voice data as an input and outputs a speaker characteristic amount indicating the speaker characteristic.

上記話者識別方法によると、話者識別モデルの学習処理における学習データの拡張において、第２の話者の音声データの数を、発話内容、言語により制限されることなく増やすことができる。このため、話者識別モデルによる話者の識別の精度を向上することができる。 According to the speaker identification method, the number of voice data of the second speaker can be increased without being limited by the utterance content and the language in the extension of the learning data in the learning process of the speaker identification model. Therefore, the accuracy of speaker identification by the speaker identification model can be improved.

従って、上記話者識別方法によると、精度よく話者を識別することができる。 Therefore, according to the speaker identification method described above, the speaker can be identified with high accuracy.

上記プログラムによると、話者識別モデルの学習処理における学習データの拡張において、第２の話者の音声データの数を、発話内容、言語により制限されることなく増やすことができる。このため、話者識別モデルによる話者の識別の精度を向上することができる。 According to the above program, in the extension of the learning data in the learning process of the speaker identification model, the number of voice data of the second speaker can be increased without being limited by the utterance content and the language. Therefore, the accuracy of speaker identification by the speaker identification model can be improved.

従って、上記プログラムによると、精度よく話者を識別することができる。 Therefore, according to the above program, the speaker can be identified with high accuracy.

なお、これらの包括的または具体的な態様は、システム、方法、集積回路、コンピュータプログラムまたはコンピュータで読み取り可能なＣＤ−ＲＯＭなどの記録媒体で実現されてもよく、システム、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせで実現されてもよい。 It should be noted that these comprehensive or specific embodiments may be implemented in a system, method, integrated circuit, computer program or computer-readable recording medium such as a CD-ROM, system, method, integrated circuit, computer. It may be realized by any combination of a program and a recording medium.

以下、本開示の実施の形態について、図面を参照しながら説明する。以下で説明する実施の形態は、いずれも本開示の一具体例を示すものである。以下の実施の形態で示される数値、形状、構成要素、ステップ、ステップの順序などは、一例であり、本開示を限定する主旨ではない。また、全ての実施の形態において、各々の内容を組み合わせることもできる。 Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. Each of the embodiments described below is a specific example of the present disclosure. The numerical values, shapes, components, steps, order of steps, etc. shown in the following embodiments are examples, and are not intended to limit the present disclosure. Moreover, in all the embodiments, each content can be combined.

（実施の形態）
以下、実施の形態に係る話者識別装置について説明する。この話者識別装置は、音声データを取得して、その音声データに含まれる発話の話者を識別する識別情報を出力する。 (Embodiment)
Hereinafter, the speaker identification device according to the embodiment will be described. This speaker identification device acquires voice data and outputs identification information for identifying the speaker of the utterance included in the voice data.

＜構成＞
図１は、実施の形態に係る話者識別装置１の構成例を示すブロック図である。 <Structure>
FIG. 1 is a block diagram showing a configuration example of the speaker identification device 1 according to the embodiment.

図１に示すように、話者識別装置１は、音声データ拡張部１０と、話者識別モデル２０と、学習部３０と、識別対象音声データ取得部４０とを備える。 As shown in FIG. 1, the speaker identification device 1 includes a voice data expansion unit 10, a speaker identification model 20, a learning unit 30, and an identification target voice data acquisition unit 40.

音声データ拡張部１０は、話者識別モデル２０の学習処理を行うための学習データを拡張する（すなわち、学習データの数を増やす）。音声データ拡張部１０は、例えば、マイクロプロセッサ、メモリ、通信インターフェース等を備えるコンピュータにより実現されてもよい。この場合、音声データ拡張部１０の各種機能は、マイクロプロセッサが、メモリに記憶されるプログラムを実行することで実現される。また、音声データ拡張部１０は、例えば、互いに通信する複数のコンピュータによる、分散コンピューティング又はクラウドコンピューティングによって実現されてもよい。 The voice data expansion unit 10 expands the learning data for performing the learning process of the speaker identification model 20 (that is, increases the number of learning data). The voice data expansion unit 10 may be realized by, for example, a computer including a microprocessor, a memory, a communication interface, and the like. In this case, various functions of the voice data expansion unit 10 are realized by the microprocessor executing a program stored in the memory. Further, the voice data expansion unit 10 may be realized by, for example, distributed computing or cloud computing by a plurality of computers communicating with each other.

図１に示すように、音声データ拡張部１０は、音声データ保持部１１と、第１音声データ取得部１２と、声質変換部１３と、ノイズ残響付与部１４と、第１特徴量算出部１５と、比較部１６と、音声データ保存部１７と、拡張音声データ保持部１８とを有する。 As shown in FIG. 1, the voice data expansion unit 10 includes a voice data holding unit 11, a first voice data acquisition unit 12, a voice quality conversion unit 13, a noise reverberation adding unit 14, and a first feature amount calculation unit 15. A comparison unit 16, an audio data storage unit 17, and an extended audio data holding unit 18.

学習部３０は、音声データ拡張部１０により拡張された学習データを用いて、話者識別モデル２０の学習処理を行う。学習部３０は、例えば、マイクロプロセッサ、メモリ、通信インターフェース等を備えるコンピュータにより実現されてもよい。この場合、学習部３０の各種機能は、マイクロプロセッサが、メモリに記憶されるプログラムを実行することで実現される。また、学習部３０は、例えば、互いに通信する複数のコンピュータによる、分散コンピューティング又はクラウドコンピューティングによって実現されてもよい。 The learning unit 30 performs learning processing of the speaker identification model 20 using the learning data expanded by the voice data expansion unit 10. The learning unit 30 may be realized by, for example, a computer including a microprocessor, a memory, a communication interface, and the like. In this case, various functions of the learning unit 30 are realized by the microprocessor executing a program stored in the memory. Further, the learning unit 30 may be realized by, for example, distributed computing or cloud computing by a plurality of computers communicating with each other.

図１に示すように、学習部３０は、第２音声データ取得部３１と、第２特徴量算出部３２と、第１学習部３３とを有する。 As shown in FIG. 1, the learning unit 30 includes a second voice data acquisition unit 31, a second feature amount calculation unit 32, and a first learning unit 33.

話者識別モデル２０は、音声データを入力すると、その音声データに含まれる発話の話者を識別する話者識別情報を出力する。話者識別モデル２０は、例えば、マイクロプロセッサ、メモリ、通信インターフェース等を備えるコンピュータにより実現されてもよい。この場合、話者識別モデル２０の各種機能は、マイクロプロセッサが、メモリに記憶されるプログラムを実行することで実現される。また、話者識別モデル２０は、例えば、互いに通信する複数のコンピュータによる、分散コンピューティング又はクラウドコンピューティングによって実現されてもよい。 When the speaker identification model 20 inputs voice data, the speaker identification model 20 outputs speaker identification information for identifying the speaker of the utterance included in the voice data. The speaker identification model 20 may be realized by, for example, a computer including a microprocessor, a memory, a communication interface, and the like. In this case, the various functions of the speaker identification model 20 are realized by the microprocessor executing a program stored in the memory. Further, the speaker identification model 20 may be realized by distributed computing or cloud computing by a plurality of computers communicating with each other, for example.

図１に示すように、話者識別モデル２０は、第３特徴量算出部２１と、深層ニューラルネットワーク（ＤＮＮ：ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）２２と、判定部２３とを有する。 As shown in FIG. 1, the speaker identification model 20 includes a third feature amount calculation unit 21, a deep neural network (DNN: Deep Neural Network) 22, and a determination unit 23.

識別対象音声データ取得部４０は、話者識別モデル２０が行う話者の識別における識別の対象とする音声データを取得する。識別対象音声データ取得部４０は、例えば、外部装置と通信する通信インターフェースを有し、通信インターフェースを介して外部装置から音声データを取得してもよい。また、識別対象音声データ取得部４０は、例えば、入出力ポート（例えば、ＵＳＢポート）を有し、入出力ポートに接続された外部記憶装置（例えばＵＳＢメモリ）から音声データを取得してもよい。また、識別対象音声データ取得部４０は、例えば、マイクロフォンを有し、マイクロフォンに入力された音声を電気信号に変換することで音声データを取得してもよい。 The identification target voice data acquisition unit 40 acquires the voice data to be identified in the speaker identification performed by the speaker identification model 20. The identification target voice data acquisition unit 40 may have, for example, a communication interface that communicates with an external device, and may acquire voice data from the external device via the communication interface. Further, the identification target voice data acquisition unit 40 may have, for example, an input / output port (for example, a USB port) and acquire voice data from an external storage device (for example, a USB memory) connected to the input / output port. .. Further, the identification target voice data acquisition unit 40 may, for example, have a microphone and acquire voice data by converting the voice input to the microphone into an electric signal.

以下、音声データ拡張部１０を構成する各構成要素について説明する。 Hereinafter, each component constituting the voice data expansion unit 10 will be described.

音声データ保持部１１は、音声データと、その音声データに紐付けされた、その音声データに含まれる発話の話者を識別する話者識別情報とを、互いに対応付けて記憶する。 The voice data holding unit 11 stores the voice data and the speaker identification information associated with the voice data, which identifies the speaker of the utterance included in the voice data, in association with each other.

図２は、音声データ保持部１１が、音声データと話者識別情報とを互いに対応付けて記憶する様子の一例を示す模式図である。 FIG. 2 is a schematic diagram showing an example of how the voice data holding unit 11 stores the voice data and the speaker identification information in association with each other.

図２に示すように、音声データ保持部１１は、互いに異なる複数の話者識別情報に紐付けされた複数の音声データを記憶する。音声データ保持部１１が記憶する音声データ及び話者識別情報は、話者識別モデル２０の学習処理を行うための学習データとして利用される。 As shown in FIG. 2, the voice data holding unit 11 stores a plurality of voice data associated with a plurality of different speaker identification information. The voice data and the speaker identification information stored in the voice data holding unit 11 are used as learning data for performing the learning process of the speaker identification model 20.

再び図１に戻って、話者識別装置１の説明を続ける。 Returning to FIG. 1 again, the description of the speaker identification device 1 will be continued.

音声データ保持部１１は、例えば、外部装置と通信する通信インターフェースを有し、通信インターフェースを介して外部装置から取得した音声データ及びその音声データに紐付けされた話者識別情報を記憶するとしてもよい。また、音声データ保持部１１は、例えば、入出力ポート（例えば、ＵＳＢポート）を有し、入出力ポートに接続された外部記憶装置（例えばＵＳＢメモリ）から取得した音声データ及びその音声データに紐付けされた話者識別情報を記憶するとしてもよい。 Even if the voice data holding unit 11 has a communication interface for communicating with the external device, and stores the voice data acquired from the external device via the communication interface and the speaker identification information associated with the voice data. Good. Further, the voice data holding unit 11 has, for example, an input / output port (for example, a USB port), and is linked to the voice data acquired from an external storage device (for example, a USB memory) connected to the input / output port and the voice data. The attached speaker identification information may be stored.

ここでは、音声データは、ＷＡＶフォーマットであるとして説明する。しかしながら、音声データは、必ずしもＷＡＶフォーマットに限定される必要はなく、例えば、ＡＩＦＦフォーマット、ＡＡＣフォーマット等であっても構わない。 Here, the audio data will be described as being in WAV format. However, the audio data does not necessarily have to be limited to the WAV format, and may be, for example, an AIFF format, an AAC format, or the like.

第１音声データ取得部１２は、音声データ保持部１１から、音声データと、その音声データに紐付けされた話者識別情報とを取得する。 The first voice data acquisition unit 12 acquires voice data and speaker identification information associated with the voice data from the voice data holding unit 11.

声質変換部１３は、第１音声データ取得部１２により取得された音声データを、その音声データに紐付けされた話者識別情報により識別される話者以外の話者（以下、「他の話者」とも称する）により発話された音声データに変換して出力する。より具体的には、声質変換部１３は、音声データに含まれる発話の周波数成分を変更することで、他の話者により発話された音声データを生成して出力する。 The voice quality conversion unit 13 uses the voice data acquired by the first voice data acquisition unit 12 as a speaker other than the speaker identified by the speaker identification information associated with the voice data (hereinafter, "another talk"). It is converted into voice data uttered by a person) and output. More specifically, the voice quality conversion unit 13 generates and outputs voice data spoken by another speaker by changing the frequency component of the utterance included in the voice data.

声質変換部１３は、一の話者の音声データを、複数の他の話者の音声データに変換して出力することで、互いに話者が異なる一方で同一の発話内容となる複数の音声データを出力することができる。また、声質変換部１３は、一の話者の音声データが日本語による発話を含む音声データである場合には、必ずしも日本語を話すことができない他の話者の日本語による発話を含む音声データに変換することができる。すなわち、声質変換部１３は、変換前の音声データの発話内容、言語に制限されることなく、一の話者の音声データを、複数の他の話者の音声データに変換して出力することができる。 The voice quality conversion unit 13 converts the voice data of one speaker into the voice data of a plurality of other speakers and outputs the voice data so that the speakers are different from each other but the same speech content is obtained. Can be output. Further, when the voice data of one speaker is voice data including utterances in Japanese, the voice quality conversion unit 13 is a voice including utterances in Japanese of another speaker who cannot necessarily speak Japanese. Can be converted to data. That is, the voice quality conversion unit 13 converts the voice data of one speaker into the voice data of a plurality of other speakers and outputs the voice data without being limited by the utterance content and language of the voice data before conversion. Can be done.

図３は、声質変換部１３が、一の話者の音声データを、複数の他の話者の音声データに変換して出力する様子を示す模式図である。 FIG. 3 is a schematic diagram showing how the voice quality conversion unit 13 converts the voice data of one speaker into the voice data of a plurality of other speakers and outputs the data.

図３に示すように、声質変換部１３は、話者識別モデル２０の学習処理を行うための学習データとして利用される音声データの数を、発話内容、言語により制限されることなく増やすことができる。 As shown in FIG. 3, the voice quality conversion unit 13 can increase the number of voice data used as learning data for performing the learning process of the speaker identification model 20 without being limited by the utterance content and the language. it can.

声質変換部１３は、例えば、広く入手可能な従来型の声質変換器により実現されてもよい。また、声質変換部１３は、例えば、第１の話者の音声データを入力すると、第２の話者の音声データを出力するように予め学習処理を行った声質変換モデルを利用することにより実現されてもよい。ここでは、声質変換部１３は、第１の話者の音声データを入力すると、第２の話者の音声データを出力するように予め学習処理を行った声質変換モデルを利用することにより実現されるとして説明する。 The voice quality converter 13 may be realized, for example, by a widely available conventional voice quality converter. Further, the voice quality conversion unit 13 is realized by using, for example, a voice quality conversion model that has been previously trained so as to output the voice data of the second speaker when the voice data of the first speaker is input. May be done. Here, the voice quality conversion unit 13 is realized by using a voice quality conversion model that has been previously trained so as to output the voice data of the second speaker when the voice data of the first speaker is input. It will be explained as.

図４は、声質変換部１３の構成例を示すブロック図である。 FIG. 4 is a block diagram showing a configuration example of the voice quality conversion unit 13.

図４に示すように、声質変換部１３は、声質変換学習用データ保持部１３１と、第２学習部１３２と、声質変換モデル１３３とを有する。 As shown in FIG. 4, the voice quality conversion unit 13 includes a voice quality conversion learning data holding unit 131, a second learning unit 132, and a voice quality conversion model 133.

声質変換モデル１３３は、複数の話者ペアのそれぞれについて、話者ペアの一方の話者である第１の話者の音声データを入力すると、話者ペアの他方の話者である第２の話者の音声データを出力するように、及び、第２の話者の音声データを入力すると、第１の話者の音声データを出力するように予め学習処理を行った深層ニューラルネットワーク（ＤＮＮ：ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）である。ここでは、一例として、声質変換モデル１３３は、複数の話者ペアのそれぞれについて、第１の話者のＷＡＶフォーマットの音声データを入力すると、第２の話者のＷＡＶフォーマットの音声データを出力するように、及び、第２の話者のＷＡＶフォーマットの音声データを入力すると、第１の話者のＷＡＶフォーマットの音声データを出力するように予め学習処理を行ったｃｙｃｌｅＶＡＥであるとして説明する。しかしながら、声質変換モデル１３３は、複数の話者ペアのそれぞれについて、第１の話者の音声データを入力すると、第２の話者の音声データを出力するように、及び、第２の話者の音声データを入力すると、第１の話者の音声データを出力するように予め学習処理を行ったＤＮＮであれば、必ずしも上記ｃｙｃｌｅＶＡＥに限定される必要はない。 When the voice data of the first speaker, which is one speaker of the speaker pair, is input for each of the plurality of speaker pairs, the voice quality conversion model 133 is a second speaker, which is the other speaker of the speaker pair. A deep neural network (DNN:) that has been previously trained to output the voice data of the speaker and to output the voice data of the first speaker when the voice data of the second speaker is input. Deep Neural Network). Here, as an example, the voice quality conversion model 133 outputs the WAV format audio data of the second speaker when the WAV format audio data of the first speaker is input for each of the plurality of speaker pairs. As described above, the cycleVAE is preliminarily trained so as to output the WAV format audio data of the first speaker when the WAV format audio data of the second speaker is input. However, the voice quality conversion model 133 outputs the voice data of the second speaker when the voice data of the first speaker is input for each of the plurality of speaker pairs, and the second speaker. If the DNN has been previously trained to output the voice data of the first speaker when the voice data of the above is input, it is not necessarily limited to the above cycleVAE.

声質変換学習用データ保持部１３１は、声質変換モデル１３３の学習処理を行うための学習データを記憶する。より具体的には、声質変換学習用データ保持部１３１は、声質変換モデル１３３が対象とする複数の話者それぞれの音声データ（ここでは、ＷＡＶフォーマットの音声データ）を記憶する。 The voice quality conversion learning data holding unit 131 stores learning data for performing the learning process of the voice quality conversion model 133. More specifically, the voice quality conversion learning data holding unit 131 stores voice data (here, voice data in WAV format) of each of a plurality of speakers targeted by the voice quality conversion model 133.

第２学習部１３２は、声質変換学習用データ保持部１３１に記憶される学習用データを用いて、複数の話者ペアのそれぞれについて、話者ペアの一方の話者である第１の話者の音声データを入力すると、話者ペアの他方の話者である第２の話者の音声データを出力するように、及び、第２の話者の音声データを入力すると、第１の話者の音声データを出力するように声質変換モデル１３３の学習処理を行う。 The second learning unit 132 uses the learning data stored in the voice quality conversion learning data holding unit 131, and for each of the plurality of speaker pairs, the first speaker who is one speaker of the speaker pair. When the voice data of the second speaker is input, the voice data of the second speaker who is the other speaker of the speaker pair is output, and when the voice data of the second speaker is input, the voice data of the first speaker is output. The learning process of the voice quality conversion model 133 is performed so as to output the voice data of.

ノイズ残響付与部１４は、声質変換部１３から出力される音声データのそれぞれに対して、ノイズ付与（例えば４種類）及び残響付与（例えば１種類）を行い、ノイズ付与後の音声データ及びノイズ付与後の音声データを出力する。これにより、ノイズ残響付与部１４は、音声データの数を更に増やすことができる。 The noise reverberation unit 14 adds noise (for example, 4 types) and reverberation (for example, 1 type) to each of the voice data output from the voice quality conversion unit 13, and adds noise and the voice data after adding noise. Output the later audio data. As a result, the noise reverberation unit 14 can further increase the number of voice data.

第１特徴量算出部１５は、声質変換部１３から出力される音声データと、ノイズ残響付与部１４から出力される音声データとのそれぞれから、その音声データに含まれる発話の特徴を示す発話特徴量を算出する。ここでは、一例として、第１特徴量算出部１５は、発話特徴量として、話者の声道特性を示すＭＦＣＣ（Ｍｅｌ−ＦｒｅｕｙｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）を算出するとして説明する。しかしながら、第１特徴量算出部１５は、話者の特徴を示す発話特徴量を算出することができれば、必ずしもＭＦＣＣを算出する例に限定される必要はない。第１特徴量算出部１５は、例えば、発話の音声信号にメルフィルタバンクをかけたものを発話特徴量として算出するとしてもよいし、例えば、発話の音声信号のスペクトログラムを発話特徴量として算出するとしてもよい。 The first feature amount calculation unit 15 is a utterance feature showing the characteristics of the utterance included in the voice data from the voice data output from the voice quality conversion unit 13 and the voice data output from the noise reverberation addition unit 14. Calculate the amount. Here, as an example, the first feature amount calculation unit 15 will be described as calculating MFCC (Mel-Freeency Cepstrum Cofficients) indicating the vocal tract characteristics of the speaker as the utterance feature amount. However, the first feature amount calculation unit 15 is not necessarily limited to the example of calculating the MFCC as long as the utterance feature amount indicating the characteristics of the speaker can be calculated. The first feature amount calculation unit 15 may calculate, for example, the utterance voice signal multiplied by the mel filter bank as the utterance feature amount, or, for example, calculate the spectrogram of the utterance voice signal as the utterance feature amount. May be.

比較部１６は、第１特徴量算出部１５から出力される話者特徴量（以下、「第１の話者特徴量」とも称する）のそれぞれについて、第１の話者特徴量と、その第１の話者特徴量の算出元となる音声データに含まれる発話の話者の話者特徴量（以下、「第２の話者特徴量」とも称する）とを比較する。 The comparison unit 16 describes the first speaker feature amount and its first speaker feature amount for each of the speaker feature amounts (hereinafter, also referred to as “first speaker feature amount”) output from the first feature amount calculation unit 15. It is compared with the speaker feature amount of the speaker of the utterance (hereinafter, also referred to as "second speaker feature amount") included in the voice data which is the calculation source of the speaker feature amount of 1.

比較部１６は、比較の結果、（１）第１の話者特徴量と第２の話者特徴量との類似度が所定の範囲内である場合には、第１の話者特徴量の算出元となる音声データに、その音声データに含まれる発話の話者を識別する話者識別情報を紐付ける。これにより、比較部１６は、一の話者識別情報に紐付けされた音声データの数を増やすことができる。そして、比較部１６は、音声データと、その音声データに紐付けされた話者識別情報とを出力する。 As a result of comparison, the comparison unit 16 determines that (1) the first speaker feature amount and the second speaker feature amount are within a predetermined range when the similarity between the first speaker feature amount and the second speaker feature amount is within a predetermined range. The speaker identification information that identifies the speaker of the utterance included in the voice data is associated with the voice data that is the calculation source. As a result, the comparison unit 16 can increase the number of voice data associated with one speaker identification information. Then, the comparison unit 16 outputs the voice data and the speaker identification information associated with the voice data.

比較部１６は、比較の結果、（２）第１の話者特徴量と第２の話者特徴量との類似度が所定の範囲内でない場合には、第１の話者特徴量の算出元となる音声データに、その音声データに含まれる発話の話者とは異なる第三者を識別する識別情報を紐付ける。これにより、比較部１６は、音声データに紐付けされた話者識別情報の数を増やすことができる。すなわち、比較部１６は、話者識別モデル２０の学習処理を行うための学習データにおける話者の数を増やすことができる。話者の数を増やすことで、後述する話者識別モデル２０の学習処理における過学習を抑制することができる。これにより、話者識別モデル２０の汎化性能を向上させることができる。そして、比較部１６は、音声データと、その音声データに紐付けされた話者識別情報とを出力する。 As a result of comparison, the comparison unit 16 calculates (2) the first speaker feature amount when the similarity between the first speaker feature amount and the second speaker feature amount is not within a predetermined range. The original voice data is associated with identification information that identifies a third party that is different from the speaker of the utterance included in the voice data. As a result, the comparison unit 16 can increase the number of speaker identification information associated with the voice data. That is, the comparison unit 16 can increase the number of speakers in the learning data for performing the learning process of the speaker identification model 20. By increasing the number of speakers, it is possible to suppress overfitting in the learning process of the speaker identification model 20, which will be described later. Thereby, the generalization performance of the speaker identification model 20 can be improved. Then, the comparison unit 16 outputs the voice data and the speaker identification information associated with the voice data.

拡張音声データ保持部１８は、音声データ保持部１１と同様に、音声データと、その音声データに紐付けされた、その音声データに含まれる発話の話者を識別する話者識別情報とを、互いに対応付けて記憶する。 Similar to the voice data holding unit 11, the extended voice data holding unit 18 provides the voice data and the speaker identification information associated with the voice data to identify the speaker of the speech included in the voice data. Store in association with each other.

音声データ保存部１７は、比較部１６から出力される、音声データ及びその音声データに紐付けされた話者識別情報のそれぞれについて、音声データとその音声データに紐付けされた話者識別情報とを互いに対応付けて、拡張音声データ保持部１８に記憶させる。また、音声データ保存部１７は、第１音声データ取得部１２により取得される、音声データ及びその音声データに紐付けされた話者識別情報のそれぞれについて、音声データとその音声データに紐付けされた話者識別情報とを互いに対応付けて、拡張音声データ保持部１８に記憶させる。これにより、拡張音声データ保持部１８は、音声データ保持部１１が話者識別モデル２０の学習処理を行うための学習データとして記憶する音声データに加えて、比較部１６から出力された音声データをも、話者識別モデルの学習処理を行うための学習データとして記憶する。 The voice data storage unit 17 describes the voice data and the speaker identification information associated with the voice data, which are output from the comparison unit 16, with the voice data and the speaker identification information associated with the voice data. Are associated with each other and stored in the extended audio data holding unit 18. Further, the voice data storage unit 17 is associated with the voice data and the voice data for each of the voice data and the speaker identification information associated with the voice data acquired by the first voice data acquisition unit 12. The speaker identification information is associated with each other and stored in the extended voice data holding unit 18. As a result, the extended voice data holding unit 18 stores the voice data output from the comparison unit 16 in addition to the voice data stored as learning data for the voice data holding unit 11 to perform the learning process of the speaker identification model 20. Is also stored as training data for performing the training process of the speaker identification model.

以下、話者識別モデル２０を構成する各構成要素について説明する。 Hereinafter, each component constituting the speaker identification model 20 will be described.

第３特徴量算出部２１は、識別対象音声データ取得部４０により取得された音声データから、その音声データに含まれる発話の特徴を示す発話特徴量を算出する。ここでは、一例として、第３特徴量算出部２１は、発話特徴量として、話者の声道特性を示すＭＦＣＣを算出するとして説明する。しかしながら、第３特徴量算出部２１は、話者の特徴を示す発話特徴量を算出することができれば、必ずしもＭＦＣＣを算出する例に限定される必要はない。第３特徴量算出部２１は、例えば、発話の音声信号にメルフィルタバンクをかけたものを発話特徴量として算出するとしてもよいし、例えば、発話の音声信号のスペクトログラムを発話特徴量として算出するとしてもよい。 The third feature amount calculation unit 21 calculates the utterance feature amount indicating the utterance feature included in the voice data from the voice data acquired by the identification target voice data acquisition unit 40. Here, as an example, the third feature amount calculation unit 21 will be described as calculating the MFCC indicating the vocal tract characteristics of the speaker as the utterance feature amount. However, the third feature amount calculation unit 21 is not necessarily limited to the example of calculating the MFCC as long as the utterance feature amount indicating the characteristics of the speaker can be calculated. The third feature amount calculation unit 21 may calculate, for example, the utterance voice signal multiplied by the mel filter bank as the utterance feature amount, or, for example, calculate the spectrogram of the utterance voice signal as the utterance feature amount. May be.

深層ニューラルネットワーク２２は、第３特徴量算出部２１により算出される発話特徴量を入力すると、その発話特徴量の算出元となる音声データに含まれる発話の話者の特徴を示す話者性特徴量を出力するように予め学習処理を行った深層ニューラルネットワーク（ＤＮＮ）である。ここでは、一例として、深層ニューラルネットワーク２２は、話者の声道特性を示すＭＦＣＣを入力すると、可変長の発話を固定次元埋め込みにマッピングした発話の音響特徴量であるｘ−Ｖｅｃｔｏｒを話者性特徴量として出力するように予め学習処理を行ったＫａｌｄｉであるとして説明する。しかしながら、深層ニューラルネットワーク２２は、第３特徴量算出部２１により算出される発話特徴量を入力すると、話者の特徴を示す話者性特徴量を出力するように予め学習処理を行ったＤＮＮであれば、必ずしも上記Ｋａｌｄｉに限定される必要はない。なお、ｘ−Ｖｅｃｔｏｒの算出方法等の詳細は、非特許文献１に開示されているため、ここでの詳述を省略する。 When the utterance feature amount calculated by the third feature amount calculation unit 21 is input, the deep neural network 22 shows the speaker characteristic of the utterance included in the voice data from which the utterance feature amount is calculated. It is a deep neural network (DNN) that has been preliminarily trained so as to output an amount. Here, as an example, when the MFCC indicating the vocal tract characteristics of the speaker is input, the deep neural network 22 makes the x-Vector, which is the acoustic feature amount of the utterance that maps the variable-length utterance to the fixed-dimensional embedding, the speaker. It will be described as Kaldi which has been subjected to learning processing in advance so as to be output as a feature amount. However, the deep neural network 22 is a DNN that has been previously trained so as to output a speaker characteristic amount indicating the speaker's characteristics when the speech feature amount calculated by the third feature amount calculation unit 21 is input. If so, it does not necessarily have to be limited to the above Kaldi. Since the details of the calculation method of x-Vector and the like are disclosed in Non-Patent Document 1, the details thereof are omitted here.

判定部２３は、深層ニューラルネットワーク２２から出力される話者性特徴量に基づいて、識別対象音声データ取得部４０により取得された音声データに含まれる発話の話者を判定する。より具体的には、判定部２３は、複数の話者のｘ−Ｖｅｃｔｏｒを記憶し、記憶する複数のｘ−Ｖｅｃｔｏｒのうち、深層ニューラルネットワーク２２から出力されるｘ−Ｖｅｃｔｏｒに最も類似するｘ−Ｖｅｃｔｏｒを特定し、特定したｘ−Ｖｅｃｔｏｒの話者を、識別対象音声データ取得部４０により取得された音声データに含まれる発話の話者と判定する。そして、判定部２３は、判定した話者を識別する話者識別情報を出力する。 The determination unit 23 determines the speaker of the utterance included in the voice data acquired by the identification target voice data acquisition unit 40 based on the speaker characteristic amount output from the deep neural network 22. More specifically, the determination unit 23 stores the x-Vectors of the plurality of speakers, and among the plurality of stored x-Vectors, the x-Vectors most similar to the x-Vectors output from the deep neural network 22. The Vector is specified, and the speaker of the specified x-Vector is determined to be the speaker of the utterance included in the voice data acquired by the identification target voice data acquisition unit 40. Then, the determination unit 23 outputs speaker identification information that identifies the determined speaker.

以下、学習部３０を構成する各構成要素について説明する。 Hereinafter, each component constituting the learning unit 30 will be described.

第２音声データ取得部３１は、拡張音声データ保持部１８から、音声データと、その音声データに紐付けされた話者識別情報とを取得する。 The second voice data acquisition unit 31 acquires voice data and speaker identification information associated with the voice data from the extended voice data holding unit 18.

第２特徴量算出部３２は、第２音声データ取得部３１により取得された音声データから、その音声データに含まれる発話の特徴を示す発話特徴量を算出する。ここでは、一例として、第２特徴量算出部３２は、発話特徴量として、話者の声道特性を示すＭＦＣＣを算出するとして説明する。しかしながら、第２特徴量算出部３２は、話者の特徴を示す発話特徴量を算出することができれば、必ずしもＭＦＣＣを算出する例に限定される必要はない。第２特徴量算出部３２は、例えば、発話の音声信号にメルフィルタバンクをかけたものを発話特徴量として算出するとしてもよいし、例えば、発話の音声信号のスペクトログラムを発話特徴量として算出するとしてもよい。 The second feature amount calculation unit 32 calculates the utterance feature amount indicating the utterance feature included in the voice data from the voice data acquired by the second voice data acquisition unit 31. Here, as an example, the second feature amount calculation unit 32 will be described as calculating the MFCC indicating the vocal tract characteristics of the speaker as the utterance feature amount. However, the second feature amount calculation unit 32 is not necessarily limited to the example of calculating the MFCC as long as the utterance feature amount indicating the characteristics of the speaker can be calculated. The second feature amount calculation unit 32 may calculate, for example, the utterance voice signal multiplied by the mel filter bank as the utterance feature amount, or, for example, calculate the spectrogram of the utterance voice signal as the utterance feature amount. May be.

第１学習部３３は、第２特徴量算出部３２により算出された発話特徴量と、その発話特徴量の算出元となる音声データに含まれる発話の話者を識別する話者識別情報とを学習データとして、音声データを入力すると、その音声データに含まれる発話の話者を識別する話者識別情報を出力するように話者識別モデル２０の学習処理を行う。 The first learning unit 33 obtains the utterance feature amount calculated by the second feature amount calculation unit 32 and the speaker identification information for identifying the speaker of the utterance included in the voice data from which the utterance feature amount is calculated. When voice data is input as training data, the learning process of the speaker identification model 20 is performed so as to output speaker identification information for identifying the speaker of the utterance included in the voice data.

より具体的には、第１学習部３３は、第２特徴量算出部３２により算出されたＭＦＣＣと、そのＭＦＣＣに対応する話者識別情報とを学習データとして、ＭＦＣＣを入力すると、そのＭＦＣＣ算出元となる音声データに含まれる発話の話者の特徴を示すｘ−Ｖｅｃｔｏｒを出力するように深層ニューラルネットワーク２２の学習処理を行う。 More specifically, when the first learning unit 33 inputs the MFCC using the MFCC calculated by the second feature amount calculation unit 32 and the speaker identification information corresponding to the MFCC as learning data, the MFCC is calculated. The learning process of the deep neural network 22 is performed so as to output an x-Vector showing the characteristics of the speaker of the speech included in the original voice data.

＜動作＞
上記構成の話者識別装置１は、話者識別モデル学習処理と、声質変換モデル学習処理と、話者識別処理とを行う。 <Operation>
The speaker identification device 1 having the above configuration performs the speaker identification model learning process, the voice quality conversion model learning process, and the speaker identification process.

以下、これらの処理について、図面を参照しながら順に説明する。 Hereinafter, these processes will be described in order with reference to the drawings.

図５は、話者識別モデル学習処理のフローチャートである。 FIG. 5 is a flowchart of the speaker identification model learning process.

話者識別モデル学習処理は、話者識別モデル２０の学習処理を行う処理である。 The speaker identification model learning process is a process for performing the learning process of the speaker identification model 20.

話者識別モデル学習処理は、例えば、話者識別装置１を利用するユーザが、話者識別装置１に対して、話者識別モデル学習処理を開始する旨の操作を行うことで開始される。 The speaker identification model learning process is started by, for example, a user using the speaker identification device 1 performing an operation on the speaker identification device 1 to start the speaker identification model learning process.

話者識別モデル学習処理が開始されると、第１音声データ取得部１２は、音声データ保持部１１から、一の音声データと、その一の音声データに紐付けされた一の話者識別情報とを取得する（ステップＳ１００）。 When the speaker identification model learning process is started, the first voice data acquisition unit 12 receives one voice data and one speaker identification information associated with the one voice data from the voice data holding unit 11. And are acquired (step S100).

一の音声データと一の話者識別情報とが取得されると、音声データ保存部１７は、その一の音声データとその一の話者識別情報とを互いに対応付けて、拡張音声データ保持部１８に記憶させる（ステップＳ１１０）。 When one voice data and one speaker identification information are acquired, the voice data storage unit 17 associates the one voice data with the one speaker identification information with each other and extends the voice data holding unit. It is stored in 18 (step S110).

一方で、声質変換部１３は、その一の話者識別情報により識別される話者以外の話者である他の話者の中から一の話者を選択する（ステップＳ１２０）。そして、声質変換部１３は、一の音声データを、その一の話者により発話された音声データに変換して（ステップＳ１３０）出力する。 On the other hand, the voice quality conversion unit 13 selects one speaker from other speakers who are speakers other than the speaker identified by the one speaker identification information (step S120). Then, the voice quality conversion unit 13 converts one voice data into voice data uttered by the one speaker (step S130) and outputs the data.

声質変換部１３から音声データが出力されると、ノイズ残響付与部１４は、声質変換部１３から出力された音声データに対して、ノイズ付与及び残響付与を行い（ステップＳ１４０）、１以上の音声データを出力する。 When the voice data is output from the voice quality conversion unit 13, the noise reverberation unit 14 adds noise and reverberation to the voice data output from the voice quality conversion unit 13 (step S140), and one or more voices. Output data.

ノイズ残響付与部１４から１以上の音声データが出力されると、第１特徴量算出部１５は、声質変換部１３から出力された音声データと、ノイズ残響付与部１４から出力された１以上の音声データとのそれぞれから、発話特徴量を算出する（ステップＳ１５０）。 When one or more voice data is output from the noise reverberation unit 14, the first feature amount calculation unit 15 has the voice data output from the voice quality conversion unit 13 and one or more voice data output from the noise reverberation unit 14. The utterance feature amount is calculated from each of the voice data (step S150).

発話特徴量が算出されると、比較部１６は、算出された発話特徴量のそれぞれについて、選択した一の話者の発話特徴量と比較して、算出された発話特徴量と一の話者の発話特徴量との類似度が所定の範囲内であるか否かを判定する（ステップＳ１６０）。 When the utterance feature amount is calculated, the comparison unit 16 compares each of the calculated utterance feature amount with the utterance feature amount of the selected one speaker, and compares the calculated utterance feature amount with the one speaker. It is determined whether or not the degree of similarity with the utterance feature amount of is within a predetermined range (step S160).

比較部１６は、ステップＳ１６０の処理において肯定的に判定した場合に（ステップＳ１６０：Ｙｅｓ）、肯定的に判定した発話特徴量の算出元となる音声データに、選択した一の話者を識別する話者識別情報を紐付けする（ステップＳ１７０）。そして、比較部１６は、その音声データと、その音声データに紐付けされた話者識別情報とを出力する。 When a positive determination is made in the process of step S160 (step S160: Yes), the comparison unit 16 identifies one selected speaker in the voice data from which the positively determined utterance feature amount is calculated. The speaker identification information is linked (step S170). Then, the comparison unit 16 outputs the voice data and the speaker identification information associated with the voice data.

比較部１６は、ステップＳ１６０の処理において否定的に判定した場合に（ステップＳ１６０：Ｎｏ）、否定的に判定した発話特徴量の算出元となる音声データに、選択した一の話者とは異なる第三者を識別する識別情報を紐付けする（ステップＳ１８０）。そして、比較部１６は、その音声データと、その音声データに紐付けされた話者識別情報とを出力する。 When a negative determination is made in the process of step S160 (step S160: No), the comparison unit 16 is different from the selected speaker in the voice data from which the negatively determined utterance feature amount is calculated. The identification information that identifies the third party is associated (step S180). Then, the comparison unit 16 outputs the voice data and the speaker identification information associated with the voice data.

ステップＳ１６０の処理において比較対象となった全ての発話特徴量に対して、比較部１６によりステップＳ１７０の処理又はステップＳ１８０の処理が実行されると、音声データ保存部１７は、比較部１６から出力された、音声データと、その音声データに紐付けされた話者識別情報とのそれぞれについて、その音声データとその話者識別情報とを互いに対応付けて、拡張音声データ保持部１８に記憶させる（ステップＳ１９０）。 When the processing of step S170 or the processing of step S180 is executed by the comparison unit 16 for all the speech feature quantities to be compared in the processing of step S160, the voice data storage unit 17 outputs from the comparison unit 16. For each of the voice data and the speaker identification information associated with the voice data, the voice data and the speaker identification information are associated with each other and stored in the extended voice data holding unit 18 ( Step S190).

ステップＳ１９０の処理が終了すると、声質変換部１３は、他の話者の中に、ステップＳ１２０の処理において選択されていない一の話者（以下、「未選択の話者」とも称する）があるか否かを判定する（ステップＳ２００）。 When the process of step S190 is completed, the voice quality conversion unit 13 has one speaker (hereinafter, also referred to as “unselected speaker”) not selected in the process of step S120 among the other speakers. Whether or not it is determined (step S200).

ステップＳ２００の処理において、未選択の話者があると判定された場合に（ステップＳ２００：Ｙｅｓ）、声質変換部１３は、未選択の話者の中から一の話者を選択し（ステップＳ２１０）、ステップＳ１３０の処理に進む。 When it is determined in the process of step S200 that there is an unselected speaker (step S200: Yes), the voice quality conversion unit 13 selects one speaker from the unselected speakers (step S210). ), Proceed to the process of step S130.

ステップＳ２００の処理において、未選択の話者がないと判定された場合に（ステップＳ２００：Ｎｏ）、第１音声データ取得部１２は、音声データ保持部１１が記憶する音声データのうち、未だ取得していない未取得の音声データがあるか否かを判定する（ステップＳ２２０）。 When it is determined in the process of step S200 that there is no unselected speaker (step S200: No), the first voice data acquisition unit 12 still acquires the voice data stored in the voice data holding unit 11. It is determined whether or not there is unacquired audio data that has not been acquired (step S220).

ステップＳ２２０の処理において、未取得の音声データがあると判定された場合に（ステップＳ２２０：Ｙｅｓ）、第１音声データ取得部１２は、未取得の音声データの中から一の音声データを取得して（ステップＳ２３０）、ステップＳ１１０の処理に進む。 When it is determined in the process of step S220 that there is unacquired audio data (step S220: Yes), the first audio data acquisition unit 12 acquires one audio data from the unacquired audio data. (Step S230), the process proceeds to step S110.

ステップＳ２２０の処理において、未取得の音声データがないと判定された場合に（ステップＳ２２０：Ｎｏ）、第２音声データ取得部３１は、拡張音声データ保持部１８から、拡張音声データ保持部１８が記憶する全ての音声データについて、音声データと、その音声データに紐付けされた話者識別情報とを取得する（ステップＳ２４０）。 In the process of step S220, when it is determined that there is no unacquired audio data (step S220: No), the second audio data acquisition unit 31 is changed from the extended audio data holding unit 18 to the extended audio data holding unit 18. For all the voice data to be stored, the voice data and the speaker identification information associated with the voice data are acquired (step S240).

全ての音声データについて、音声データと、その音声データに紐付けされた話者識別情報とが取得されると、第２特徴量算出部３２は、全ての音声データに対して、音声データから、その音声データに含まれる発話の特徴を示す発話特徴量を算出する（ステップＳ２５０）。 When the voice data and the speaker identification information associated with the voice data are acquired for all the voice data, the second feature amount calculation unit 32 transfers the voice data to the voice data. The utterance feature amount indicating the utterance feature included in the voice data is calculated (step S250).

全ての音声データに対して、発話特徴量が算出されると、第１学習部３３は、全ての発話特徴量について、発話特徴量と、その発話特徴量の算出元となる音声データに含まれる発話の話者を識別する話者識別情報とを学習データとして、音声データを入力すると、その音声データに含まれる発話の話者を識別する話者識別情報を出力するように話者識別モデル２０の学習処理を行う（ステップＳ２６０）。 When the utterance feature amount is calculated for all the utterance feature amounts, the first learning unit 33 includes all the utterance feature amounts in the utterance feature amount and the voice data from which the utterance feature amount is calculated. When voice data is input using the speaker identification information for identifying the speaker of the utterance as learning data, the speaker identification model 20 so as to output the speaker identification information for identifying the speaker of the utterance included in the voice data. (Step S260).

ステップＳ２６０の処理が終了すると、話者識別装置１は、その話者識別モデル学習処理を終了する。 When the process of step S260 is completed, the speaker identification device 1 ends the speaker identification model learning process.

図６は、声質変換モデル学習処理のフローチャートである。 FIG. 6 is a flowchart of the voice quality conversion model learning process.

声質変換モデル学習処理は、声質変換モデル１３３の学習処理を行う処理である。 The voice quality conversion model learning process is a process of performing the learning process of the voice quality conversion model 133.

声質変換モデル学習処理は、例えば、話者識別装置１を利用するユーザが、話者識別装置１に対して、声質変換モデル学習処理を開始する旨の操作を行うことで開始される。 The voice quality conversion model learning process is started by, for example, a user using the speaker identification device 1 performing an operation on the speaker identification device 1 to start the voice quality conversion model learning process.

声質変換モデル学習処理が開始されると、第２学習部１３２は、声質変換モデル１３３が対象とする複数の話者のうち、一の話者ペアを選択する（ステップＳ３００）。そして、第２学習部１３２は、声質変換学習用データ保持部１３１が保持する学習データのうち、選択中の一の話者ペアを構成する２名の話者それぞれについての学習データを用いて、選択中の一の話者ペアについて、話者ペアの一方の話者である第１の話者の音声データを入力すると、話者ペアの他方の話者である第２の話者の音声データを出力するように、及び、第２の話者の音声データを入力すると、第１の話者の音声データを出力するように声質変換モデル１３３の学習処理を行う（ステップＳ３１０）。 When the voice quality conversion model learning process is started, the second learning unit 132 selects one speaker pair from the plurality of speakers targeted by the voice quality conversion model 133 (step S300). Then, the second learning unit 132 uses the learning data for each of the two speakers constituting one of the selected speaker pairs among the learning data held by the voice quality conversion learning data holding unit 131. For one selected speaker pair, when the voice data of the first speaker who is one speaker of the speaker pair is input, the voice data of the second speaker which is the other speaker of the speaker pair is input. And when the voice data of the second speaker is input, the learning process of the voice quality conversion model 133 is performed so as to output the voice data of the first speaker (step S310).

第２学習部１３２は、一の話者ペアについて声質変換モデル１３３の学習処理を行うと、声質変換モデル１３３が対象とする複数の話者のうち、未だ選択していない未選択の話者ペアがあるか否かを判定する（ステップＳ３２０）。 When the second learning unit 132 performs the learning process of the voice quality conversion model 133 for one speaker pair, the unselected speaker pair that has not yet been selected among the plurality of speakers targeted by the voice quality conversion model 133. It is determined whether or not there is (step S320).

ステップＳ３２０の処理において、未取得の話者ペアがあると判定された場合に（ステップＳ３２０：Ｙｅｓ）、第２学習部１３２は、未選択の話者ペアの中から一の話者ペアを選択して（ステップＳ３３０）、ステップＳ３１０の処理に進む。 When it is determined in the process of step S320 that there is an unacquired speaker pair (step S320: Yes), the second learning unit 132 selects one speaker pair from the unselected speaker pairs. Then (step S330), the process proceeds to step S310.

ステップＳ３２０の処理において、未取得の話者ペアがないと判定された場合に（ステップＳ３２０：Ｎｏ）、話者識別装置１は、その声質変換モデル学習処理を終了する。 When it is determined in the process of step S320 that there is no unacquired speaker pair (step S320: No), the speaker identification device 1 ends the voice quality conversion model learning process.

図７は、話者識別処理のフローチャートである。 FIG. 7 is a flowchart of the speaker identification process.

話者識別処理は、音声データに含まれる発話の話者を識別する処理である。より具体的には、話者識別処理は、予め学習処理を行った話者識別モデル２０に音声データを入力して、話者識別モデル２０に話者識別情報を出力させる処理である。 The speaker identification process is a process for identifying the speaker of the utterance included in the voice data. More specifically, the speaker identification process is a process in which voice data is input to the speaker identification model 20 that has undergone learning processing in advance, and the speaker identification model 20 outputs speaker identification information.

話者識別処理は、例えば、話者識別装置１を利用するユーザが、話者識別装置１に対して、話者識別処理を開始する旨の操作を行うことで開始される。 The speaker identification process is started, for example, when a user who uses the speaker identification device 1 performs an operation on the speaker identification device 1 to start the speaker identification process.

話者識別処理が開始されると、識別対象音声データ取得部４０は、識別の対象とする音声データを取得する（ステップＳ４００）。 When the speaker identification process is started, the identification target voice data acquisition unit 40 acquires the voice data to be identified (step S400).

音声データが取得されると、第３特徴量算出部２１は、取得された音声データから、その音声データに含まれる発話の特徴を示す発話特徴量を算出し（ステップＳ４１０）、算出した発話特徴量を深層ニューラルネットワーク２２に入力する。すると、深層ニューラルネットワーク２２は、入力された発話特徴量の算出元となる音声データに含まれる発話の話者の特徴を示す話者性特徴量を出力する（ステップＳ４２０）。 When the voice data is acquired, the third feature amount calculation unit 21 calculates the utterance feature amount indicating the utterance feature included in the voice data from the acquired voice data (step S410), and the calculated utterance feature is calculated. The amount is input to the deep neural network 22. Then, the deep neural network 22 outputs a speaker characteristic amount indicating the characteristics of the speaker of the utterance included in the voice data that is the calculation source of the input utterance feature amount (step S420).

話者性特徴量が出力されると、判定部２３は、出力された話者性特徴量に基づいて、識別対象音声データ取得部４０により取得された音声データに含まれる発話の話者を判定する（ステップＳ４３０）。そして、判定部２３は、判定した話者を識別する話者識別情報を出力する（ステップＳ４４０）。 When the speaker characteristic amount is output, the determination unit 23 determines the speaker of the utterance included in the voice data acquired by the identification target voice data acquisition unit 40 based on the output speaker characteristic amount. (Step S430). Then, the determination unit 23 outputs the speaker identification information that identifies the determined speaker (step S440).

ステップＳ４４０の処理が終了すると、話者識別装置１は、その話者識別処理を終了する。 When the process of step S440 is completed, the speaker identification device 1 ends the speaker identification process.

＜考察＞
上述したように、話者識別装置１は、音声データ保持部１１が記憶する、話者識別モデル２０の学習を行うための学習データを、発話内容、言語により制限されることなく拡張する。そして、拡張した学習データを用いて、話者識別モデル２０の学習処理を行う。このため、話者識別装置１によると、話者識別モデル２０を用いて行う話者の識別精度を向上することができる。従って、話者識別装置１によると、精度よく話者を識別することができる。 <Discussion>
As described above, the speaker identification device 1 extends the learning data stored in the voice data holding unit 11 for learning the speaker identification model 20 without being limited by the utterance content and the language. Then, the learning process of the speaker identification model 20 is performed using the expanded learning data. Therefore, according to the speaker identification device 1, it is possible to improve the identification accuracy of the speaker performed by using the speaker identification model 20. Therefore, according to the speaker identification device 1, the speaker can be identified with high accuracy.

（補足）
以上、実施の形態に係る話者識別装置について説明したが、本開示は、この実施の形態に限定されるものではない。 (Supplement)
Although the speaker identification device according to the embodiment has been described above, the present disclosure is not limited to this embodiment.

例えば、上記実施の形態に係る話者識別装置に含まれる各処理部は典型的には集積回路であるＬＳＩとして実現される。これらは個別に１チップ化されてもよいし、一部または全てを含むように１チップ化されてもよい。 For example, each processing unit included in the speaker identification device according to the above embodiment is typically realized as an LSI which is an integrated circuit. These may be individually integrated into one chip, or may be integrated into one chip so as to include a part or all of them.

また、集積回路化はＬＳＩに限るものではなく、専用回路または汎用プロセッサで実現してもよい。ＬＳＩ製造後にプログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、またはＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサを利用してもよい。 Further, the integrated circuit is not limited to the LSI, and may be realized by a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after the LSI is manufactured, or a reconfigurable processor that can reconfigure the connection and settings of circuit cells inside the LSI may be used.

また、本開示は、実施の形態に係る話者識別装置により実行される、話者識別モデルの学習方法として実現されてもよいし、話者識別方法として実現されてもよい。 Further, the present disclosure may be realized as a learning method of a speaker identification model executed by the speaker identification device according to the embodiment, or may be realized as a speaker identification method.

また、上記実施の形態において、各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵまたはプロセッサなどのプログラム実行部が、ハードディスクまたは半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。 Further, in the above-described embodiment, each component may be configured by dedicated hardware or may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.

また、ブロック図における機能ブロックの分割は一例であり、複数の機能ブロックを一つの機能ブロックとして実現したり、一つの機能ブロックを複数に分割したり、一部の機能を他の機能ブロックに移してもよい。また、類似する機能を有する複数の機能ブロックの機能を単一のハードウェアまたはソフトウェアが並列または時分割に処理してもよい。 Further, the division of the functional block in the block diagram is an example, and a plurality of functional blocks can be realized as one functional block, one functional block can be divided into a plurality of functional blocks, and some functions can be transferred to other functional blocks. You may. Also, the functions of a plurality of functional blocks having similar functions may be processed by a single hardware or software in parallel or in a time division manner.

また、フローチャートにおける各ステップが実行される順序は、本開示を具体的に説明するために例示するためのものであり、上記以外の順序であってもよい。また、上記ステップの一部が、他のステップと同時（並列）に実行されてもよい。 Further, the order in which each step in the flowchart is executed is for exemplifying the present disclosure in detail, and may be an order other than the above. Further, a part of the above steps may be executed at the same time (parallel) as other steps.

以上、一つまたは複数の態様に係る話者認識装置について、実施の形態に基づいて説明したが、本開示は、この実施の形態に限定されるものではない。本開示の趣旨を逸脱しない限り、当業者が思いつく各種変形を本実施の形態に施したものや、各種変形例等における構成要素を組み合わせて構築される形態も、一つまたは複数の態様の範囲内に含まれてもよい。 The speaker recognition device according to one or more aspects has been described above based on the embodiment, but the present disclosure is not limited to this embodiment. As long as the gist of the present disclosure is not deviated, various modifications that can be conceived by those skilled in the art are applied to the present embodiment, and forms constructed by combining components in various modifications and the like are also within the scope of one or more embodiments. May be included within.

本開示は、話者を識別する装置等に広く利用可能である。 The present disclosure can be widely used as a device for identifying a speaker and the like.

１話者識別装置
１０音声データ拡張部
１１音声データ保持部
１２第１音声データ取得部
１３声質変換部
１４ノイズ残響付与部
１５第１特徴量算出部
１６比較部
１７音声データ保持部
１８拡張音声データ保持部
２０話者識別モデル
２１第３特徴量算出部
２２深層ニューラルネットワーク
２３判定部
３０学習部
３１第２音声データ取得部
３２第２特徴量算出部
３３第１学習部
４０識別対象音声データ取得部
１３１声質変換学習用データ保持部
１３２第２学習部
１３３音声変換モデル 1 Speaker identification device 10 Voice data expansion unit 11 Voice data retention unit 12 First audio data acquisition unit 13 Voice quality conversion unit 14 Noise reverberation addition unit 15 First feature amount calculation unit 16 Comparison unit 17 Voice data retention unit 18 Extended voice data Holding unit 20 Speaker identification model 21 Third feature amount calculation unit 22 Deep neural network 23 Judgment unit 30 Learning unit 31 Second audio data acquisition unit 32 Second feature amount calculation unit 33 First learning unit 40 Identification target audio data acquisition unit 131 Voice conversion learning data holding unit 132 Second learning unit 133 Voice conversion model

Claims

It is a learning method of a speaker identification model that outputs speaker identification information that identifies a speaker of an utterance included in the voice data when voice data is input.
By performing voice quality conversion processing on the first voice data of the first speaker, the second voice data of the second speaker is generated.
The speaker identification model is trained using the first voice data and the second voice data as learning data.
Learning method.

The voice quality conversion process is a process based on the voice data of the first speaker and the voice data of the second speaker.
The learning method according to claim 1.

In the voice quality conversion process, the first voice data is applied to a voice quality conversion model that has been previously trained so as to output the voice data of the second speaker when the voice data of the first speaker is input. Including the process of outputting the second voice data from the voice quality conversion model by inputting.
The learning method according to claim 2.

The voice quality conversion model includes a deep neural network that inputs WAV format audio data and outputs WAV format audio data.
The learning method according to claim 3.

The voice quality conversion process is a process based on the voice data of the first speaker and the voice data of the third speaker.
The learning method according to claim 1.

The speaker identification model includes a deep neural network that inputs an utterance feature amount that indicates the utterance feature included in the voice data and outputs a speaker characteristic amount that indicates the speaker characteristic.
The learning method according to claim 1.

Voice data is input to the speaker identification model that has been previously trained by the learning method according to claim 1, and the speaker identification model is made to output the speaker identification information.
Speaker identification method.

It is a program for executing a process of learning a speaker identification model that outputs speaker identification information for identifying a speaker of an utterance included in the voice data when voice data is input to the computer.
The above processing
The first step of generating the second voice data of the second speaker by performing the voice quality conversion processing on the first voice data of the first speaker, and
A second step of performing a learning process of the speaker identification model using the first voice data and the second voice data as learning data is included.
program.