JP2016017980A

JP2016017980A - Voice imitation voice evaluation device, voice imitation voice evaluation method, and program

Info

Publication number: JP2016017980A
Application number: JP2014138332A
Authority: JP
Inventors: 隆伸大庭; Takanobu Oba; 記良鎌土; Noriyoshi Kamado
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-07-04
Filing date: 2014-07-04
Publication date: 2016-02-01
Anticipated expiration: 2034-07-04
Also published as: JP6316685B2

Abstract

【課題】音声信号のみから誰の声まねをしているのか推定し、その類似度を評価する。
【解決手段】人物データベース記憶部１６に、対象話者の音声と当該対象話者を特徴づける関連テキストとが組にして記憶される。話者認識部１１が、入力された声まね音声と対象話者の音声との類似度を示す話者認識スコアを算出する。音声認識部１２が、声まね音声を音声認識して認識結果テキストを生成する。テキスト検索部１３が、認識結果テキストと関連テキストとの類似度を示すテキスト類似度を算出する。類似度合算部１４が、話者認識スコアとテキスト類似度を合算した声まねスコアを算出する。推定評価部１５が、声まねスコアに基づいて声まね評価結果を出力する。
【選択図】図１An object of the present invention is to estimate who is imitating a voice from only a voice signal and evaluate the similarity.
SOLUTION: A person database storage unit 16 stores a voice of a target speaker and a related text characterizing the target speaker as a set. The speaker recognition unit 11 calculates a speaker recognition score indicating the degree of similarity between the input voice mimic speech and the speech of the target speaker. The voice recognition unit 12 recognizes the voice imitation voice and generates a recognition result text. The text search unit 13 calculates a text similarity indicating the similarity between the recognition result text and the related text. The similarity summation unit 14 calculates a voice imitation score obtained by adding the speaker recognition score and the text similarity. The estimation evaluation unit 15 outputs a voice imitation evaluation result based on the voice imitation score.
[Selection] Figure 1

Description

この発明は、人間が他者の声まね（声の「ものまね」）をしているときに、音声信号のみから誰のものまねをしているのか推定したり、どの程度似ているか（類似度）を評価したりする声まね音声評価技術に関する。 In the present invention, when a human is imitating another person's voice ("imitation" of a voice), the person is imitating only from the audio signal, or how similar it is (similarity) The present invention relates to voice imitation voice evaluation technology.

非特許文献１には、声まねにおいてどのような音響特徴量が制御されているのかを明らかにするために、プロのものまねタレントによる声まね音声と本人（著名人）の音声とを比較した例が記載されている。 In Non-Patent Document 1, in order to clarify what acoustic features are controlled in voice mimicry, an example of comparing voice mimicry voices by professional impersonation talents with the voice of a person (a celebrity) Is described.

北村達也、“物真似タレントによる物真似音声の分析”、信学技報(SP)、vol. 107、no. 282、pp. 49-54、2007年Tatsuya Kitamura, “Analysis of Imitation Voice by Imitation Talent”, IEICE Technical Report (SP), vol. 107, no. 282, pp. 49-54, 2007

第一の課題は、声まね音声と本人の音声との類似性をいかに評価するかである。非特許文献１の場合は、本人を特徴づける発話フレーズ（テキスト）が前提にあり、これをものまねタレントが声まねすることによって、類似度の評価を行っている。すなわち、声まねをされている本人の発話フレーズと同じフレーズの発話を行うことで、基本周波数の変化パターンが類似していることの分析ができている。一部のプロのものまねタレントによる声まねを除けば、ほとんどのものまねはさほど声が似ているわけではない。場合によっては全く似ていない。それでも、人間は何らかの方法により誰の声をまねているのかを判定できる。その方法は基本周波数の変化パターンだけに限定されるものではなく、その他の音響特徴量も知覚しているものと推定される。 The first issue is how to evaluate the similarity between voice imitation and the person's voice. In the case of Non-Patent Document 1, an utterance phrase (text) that characterizes the person is premised, and the similarity is evaluated by imitating the imitation talent. That is, it is possible to analyze that the change pattern of the fundamental frequency is similar by uttering the same phrase as the utterance phrase of the person who is imitating the voice. Except for some professional imitation voice imitations, most imitations are not very similar. In some cases it is not similar at all. Still, humans can determine who is imitating by some means. The method is not limited only to the change pattern of the fundamental frequency, and it is estimated that other acoustic feature quantities are also perceived.

第二の課題は、さほど似ていない声まねであっても、誰のものまねであるかを正しく推定する方法である。非特許文献１の場合、本人を特徴づける発話フレーズが、基本周波数の変化パターンの類似性を評価するために重要な位置づけにあったと考えられる。この考えに立つと、発話フレーズを特定しておけば、誰のものまねであるかを絞り込むことができ、音響特徴量のうえでも類似度の評価に有効に作用すると考えられる。しかし、発話フレーズをシステム側から提示することで限定して声まね音声を発話させる例はあっても、任意に発話された声まね音声を認識し、その中から本人を特徴づける発話フレーズを抽出して声まねの評価情報に用いる例は従来なかった。 The second problem is how to correctly estimate who is imitating a voice that is not very similar. In the case of Non-Patent Document 1, it is considered that the utterance phrase characterizing the person was in an important position for evaluating the similarity of the change pattern of the fundamental frequency. Based on this idea, if the utterance phrase is specified, it is possible to narrow down who is imitated, and it is considered that it effectively acts on the evaluation of the similarity in terms of the acoustic feature amount. However, even if there is an example in which voice imitation voice is uttered limitedly by presenting the utterance phrase from the system side, the voice utterance voice that is arbitrarily uttered is recognized and the utterance phrase characterizing the person is extracted from it In the past, there was no example used for evaluation information of voice imitation.

第三の課題は、高速な処理である。例えば、ものまねの対象が著名人であるとすると、著名人の数は極めて多数に上ることから、多数の候補の中から、誰の声まねをしているのか推定しなければならない。 The third problem is high-speed processing. For example, if the target of imitation is a celebrity, the number of celebrities is extremely large, so it must be estimated who is imitating the voice from among many candidates.

この発明の目的は、音声信号のみから誰の声まねをしているのかを推定し、その類似度を評価する声まね音声評価技術を提供することである。 An object of the present invention is to provide a voice imitation voice evaluation technique for estimating who is imitating from only a voice signal and evaluating the similarity.

上記の課題を解決するために、この発明の第一の態様の声まね音声評価装置は、対象話者の音声と当該対象話者を特徴づける関連テキストとを組にして記憶する人物データベース記憶部と、入力された声まね音声と対象話者の音声との類似度を示す話者認識スコアを算出する話者認識部と、声まね音声を音声認識して認識結果テキストを生成する音声認識部と、認識結果テキストと関連テキストとの類似度を示すテキスト類似度を算出するテキスト検索部と、話者認識スコアとテキスト類似度を合算した声まねスコアを算出する類似度合算部と、声まねスコアに基づいて声まね評価結果を出力する推定評価部と、を含む。 In order to solve the above-described problem, a voice imitation voice evaluation device according to a first aspect of the present invention is a person database storage unit that stores a target speaker's voice and related text characterizing the target speaker in pairs. A speaker recognition unit that calculates a speaker recognition score indicating the similarity between the input voice mimic speech and the target speaker speech, and a speech recognition unit that recognizes the voice mimic speech and generates a recognition result text A text search unit that calculates a text similarity indicating the similarity between the recognition result text and the related text, a similarity summation unit that calculates a voice imitation score obtained by adding the speaker recognition score and the text similarity, and a voice imitation And an estimation evaluation unit that outputs a voice imitation evaluation result based on the score.

この発明の第二の態様の声まね音声評価装置は、対象話者の音声と当該対象話者を特徴づける関連テキストとを組にして記憶する人物データベース記憶部と、入力された声まね音声を音声認識して認識結果テキストを生成する音声認識部と、認識結果テキストと関連テキストとの類似度を示すテキスト類似度を算出するテキスト検索部と、テキスト類似度が高い対象話者を特定する候補情報を記憶する高類似度候補記憶部と、声まね音声と候補情報により特定される対象話者の音声との類似度を示す話者認識スコアを算出する話者認識部と、話者認識スコアに基づいて声まね評価結果を出力する推定評価部と、を含む。 A voice imitation voice evaluation device according to a second aspect of the present invention includes a person database storage unit that stores a target speaker's voice and a related text characterizing the target speaker as a set, and an input voice imitation voice. A speech recognition unit that generates speech recognition result text by speech recognition, a text search unit that calculates a text similarity indicating the similarity between the recognition result text and the related text, and a candidate that identifies a target speaker having a high text similarity A high similarity candidate storage unit for storing information, a speaker recognition unit for calculating a speaker recognition score indicating the similarity between the voice imitation voice and the target speaker voice specified by the candidate information, and a speaker recognition score And an estimation evaluation unit that outputs a voice imitation evaluation result based on.

この発明の声まね音声評価技術は、声まね音声に対して、話者特徴量に基づく類似度と、声まね音声を認識したテキストと本人を特徴づけるテキストとの類似度とを合算した類似度によって、声まね音声の評価を行う。このように、音声の特徴の類似性とテキストの類似性の双方による複合的な評価に基づくことで、任意の発話で声まねする場合であっても、話者特徴量だけに基づいて評価する従来技術よりも的確に、声まねした人物を特定したり、あるいは声まねの類似性を評価したりすることができる。 The voice imitation voice evaluation technology of the present invention is based on the similarity based on the speaker feature and the similarity between the text that recognizes the voice imitation voice and the text that characterizes the person. To evaluate the voice imitation voice. In this way, even when imitating a voice with an arbitrary utterance, the evaluation is based only on the speaker feature based on the composite evaluation based on both the similarity of speech features and the similarity of text. It is possible to identify a person who imitates a voice or to evaluate the similarity of voice imitation more accurately than in the prior art.

図１は、第一実施形態の声まね音声評価装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the voice mimicking voice evaluation apparatus according to the first embodiment. 図２は、第一実施形態の声まね音声評価方法の処理フローを例示する図である。FIG. 2 is a diagram illustrating a processing flow of the voice imitation voice evaluation method of the first embodiment. 図３は、第二実施形態の声まね音声評価装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the voice imitation voice evaluation device of the second embodiment. 図４は、第二実施形態の声まね音声評価方法の処理フローを例示する図である。FIG. 4 is a diagram illustrating a processing flow of the voice imitation voice evaluation method of the second embodiment. 図５は、第三実施形態の声まね音声評価装置の機能構成を例示する図である。FIG. 5 is a diagram illustrating a functional configuration of the voice mimic speech evaluation apparatus according to the third embodiment. 図６は、第三実施形態の声まね音声評価方法の処理フローを例示する図である。FIG. 6 is a diagram illustrating a processing flow of the voice imitation voice evaluation method of the third embodiment.

実施形態の説明に先立って、この発明のポイントを上述の課題に対応させて説明する。 Prior to the description of the embodiments, the points of the present invention will be described in correspondence with the above-described problems.

第一の課題に対する発明のポイントは話者認識技術を用いる点である。話者認識技術により検索対象となる人物毎の声まね音声との類似性を算出する。具体的には、話者認識技術が出力するスコアを類似性の評価値とする。 The point of the invention for the first problem is that a speaker recognition technique is used. The similarity with the voice imitation for each person to be searched is calculated by the speaker recognition technique. Specifically, the score output by the speaker recognition technique is used as the similarity evaluation value.

人間が声まねを聞いて似ているか否かを判定する際に、明らかに違う声であっても特徴を捉えていると感じることがある。話者認識のスコアがこの現象を捉えているかは定かではないが、話者認識は同一人物の声であるか否かを判定する技術であるから、声まねの上手さの評価尺度としては機械的に求められるものの中では妥当性が高い。 When humans hear voice imitations to determine whether they are similar or not, they may feel that they are capturing features even if they are clearly different voices. It is not certain whether the speaker recognition score captures this phenomenon, but speaker recognition is a technology that determines whether or not the voice of the same person is the same. The relevance is high among what is required.

第二の課題に対する発明のポイントは音声認識技術及びテキスト検索技術を用いる点である。これは人間が誰のものまねであるかを推定する際に発話内容に着目するという点を利用したものである。例えば、著名人のものまねである場合は、その著名人の有名なフレーズを発した声まねである場合が多い。役者のものまねであれば、その役者の演じた役のセリフを発した声まねである場合が多い。歌手であれば、有名な歌を唄った声まねである場合が多い。 The point of the invention with respect to the second problem is that a speech recognition technique and a text search technique are used. This is based on the fact that attention is paid to the utterance contents when estimating who the person is. For example, in the case of imitation of a celebrity, it is often a voice imitation of the famous person's famous phrase. If it is imitated by an actor, it is often a voice imitating a voice of the role played by the actor. If you are a singer, it is often a voice imitating a famous song.

各人物の関連テキストを事前に集めておき、声まね音声を音声認識でテキスト化する。音声認識結果テキストをクエリとして、類似の関連テキストを持つ人物をテキスト検索技術により検索する。これにより、ものまねが本質的に似ていない場合であっても、誰のものまねであるかを予測できる。 Relevant texts of each person are collected in advance, and voice imitation speech is converted into text by speech recognition. Using the speech recognition result text as a query, a person having similar related text is searched by a text search technique. Thereby, even if the imitation is not substantially similar, it is possible to predict who the imitation is.

第三の課題に対する発明のポイントはベクトル空間上の類似度計算に基づく話者認識方法を導入する点である。ただし、これはテキスト検索技術の併用に支えられている。具体的には、近年、話者認識技術分野で開発されたi-vectorとコサイン類似度に基づく話者認識方法を用いる。i-vectorについての詳細は、「H. Aronowitz and O. Barkan, “Efficient approximated i-vector extraction”, Proceedings of ICASSP, pp. 4789-4792, 2012.（参考文献１）」に記載されている。コサイン類似度に基づく話者認識方法についての詳細は、「N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification”, IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.（参考文献２）」に記載されている。 The point of the invention for the third problem is that a speaker recognition method based on similarity calculation on a vector space is introduced. However, this is supported by the combined use of text search technology. Specifically, a speaker recognition method based on i-vector and cosine similarity developed in the speaker recognition technology field in recent years is used. Details of i-vectors are described in “H. Aronowitz and O. Barkan,“ Efficient approximated i-vector extraction ”, Proceedings of ICASSP, pp. 4789-4792, 2012. (Reference 1). For details on speaker recognition methods based on cosine similarity, see “N. Dehak, PJ Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“ Front-end factor analysis for speaker verification ”, IEEE Transactions on Audio. , Speech and Language Processing, vol. 19, no. 4, pp. 788-798, 2011. (Reference 2).

この手法を一般化すると、「ベクトル空間上の類似度計算に基づく話者認識方法」と表現できる。この手法は計算コストが低く極めて高速である。また比較的高性能である。しかし、多人数に対しても適切に話者を検出できるほどの能力は必ずしもない。前述の通り、ものまね自体が似ていない場合もある。しかしながら、第二の課題に対する発明のポイントで挙げた通り、テキストの類似性に着目することで、誰の声まねであるかを同定する、または候補を大幅に限定することができる。限定された候補の中からであれば、ものまねが似ていれば、i-vectorとコサイン類似度に基づく方法でも誰のものまねであるか同定可能である。一方、テキストで同定できてしまえば、残りはi-vectorとコサイン類似度に基づく方法で話者認識スコアを算出し、それをものまねの評価値にすればよい。 If this method is generalized, it can be expressed as “speaker recognition method based on similarity calculation in vector space”. This method has a low calculation cost and is extremely fast. It is also relatively high performance. However, there is not necessarily the ability to detect a speaker appropriately even for a large number of people. As mentioned above, impersonation itself may not be similar. However, as mentioned in the point of the invention with respect to the second problem, by focusing on the similarity of text, it is possible to identify who is imitating voice or to greatly limit candidates. From among the limited candidates, if the imitation is similar, it is possible to identify who imitates by a method based on i-vector and cosine similarity. On the other hand, if it can be identified by text, the remainder may be calculated by a method based on i-vector and cosine similarity and used as an imitation evaluation value.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
第一実施形態の声まね音声評価装置は、図１に示すように、音声入力部１０、話者認識部１１、音声認識部１２、テキスト検索部１３、類似度合算部１４、推定評価部１５、人物データベース記憶部１６及び類似度記憶部１７を例えば含む。 [First embodiment]
As shown in FIG. 1, the voice imitation voice evaluation device of the first embodiment includes a voice input unit 10, a speaker recognition unit 11, a voice recognition unit 12, a text search unit 13, a similarity summation unit 14, and an estimation evaluation unit 15. The person database storage unit 16 and the similarity storage unit 17 are included, for example.

声まね音声評価装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。声まね音声評価装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。声まね音声評価装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、声まね音声評価装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The voice imitation voice evaluation device is configured by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), and the like, for example. It is a special device. For example, the voice imitation voice evaluation device performs each process under the control of the central processing unit. The data input to the voice imitation voice evaluation device and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as necessary for other processing. Used. In addition, at least a part of each processing unit of the voice mimicking voice evaluation apparatus may be configured by hardware such as an integrated circuit.

声まね音声評価装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。声まね音声評価装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 Each storage unit included in the voice imitation voice evaluation device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disc, or a flash memory, Alternatively, it can be configured by middleware such as a relational database or key-value store. Each storage unit included in the voice mimicking voice evaluation device only needs to be logically divided, and may be stored in one physical storage device.

人物データベース記憶部１６には、ものまねの対象とされる人物（以下、対象話者という）が発話する音声と、その対象話者を特徴づける関連テキストとを組にした人物データベースが記憶されている。 The person database storage unit 16 stores a person database in which a voice uttered by a person to be imitated (hereinafter referred to as a target speaker) and a related text characterizing the target speaker are paired. .

図２を参照して、第一実施形態の声まね音声評価方法を説明する。 With reference to FIG. 2, the voice imitation voice evaluation method of the first embodiment will be described.

ステップＳ１０において、音声入力部１０へ、声まね音声が入力される。声まね音声は、ある話者が対象話者のものまねをして発話した音声である。声まね音声の入力は、予め収録しておいた声まね音声を入力してもよいし、マイクロホンなどでリアルタイムに収音する声まね音声を入力してもよい。入力された声まね音声は話者認識部１１及び音声認識部１２へ送られる。 In step S 10, a voice mimic voice is input to the voice input unit 10. The voice imitation voice is a voice uttered by a speaker imitating the target speaker. The voice imitation voice may be input by inputting a voice imitation voice that has been recorded in advance, or by inputting a voice imitation voice that is collected in real time by a microphone or the like. The input voice imitation voice is sent to the speaker recognition unit 11 and the voice recognition unit 12.

ステップＳ１１において、話者認識部１１は、入力された声まね音声と、人物データベース記憶部１６に記憶された各対象話者の音声との類似度を示す話者認識スコアを算出する。算出された話者認識スコアは類似度合算部１４へ送られる。 In step S 11, the speaker recognition unit 11 calculates a speaker recognition score indicating the similarity between the input voice imitation voice and the voice of each target speaker stored in the person database storage unit 16. The calculated speaker recognition score is sent to the similarity summation unit 14.

話者認識スコアの算出方法は、周知の話者認識技術を利用することができる。話者認識スコアの第一の形態は、「森島繁生他、“新映像技術「ダイブイントゥザムービー」”、電子情報通信学会誌、Vol. 94、No. 3、pp. 250-268、2011年3月（参考文献３）」に記載されている混合ガウスモデル（GMM: Gaussian Mixture Model）に基づく話者認識手法である。話者認識スコアの第二の形態は、上記参考文献３に記載されている動的時間伸縮法（DTW: Dynamic Time Warping）に基づく話者認識手法である。話者認識スコアの第三の形態は、「Pavel Matejka, Ondrej Glembek, Fabio Castaldo, Md. Jahangir Alam, Oldrich Plchot, Patrick Kenny, Lukas Burget, Jan Cernocky, “Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification”, ICASSP 2011, pp. 4828-4831（参考文献４）」に記載されているi-vectorとProbabilistic Linear Discriminant Analysisに基づく話者認識手法である。話者認識スコアの第四の形態は、上記参考文献３に記載されている各話者認識手法の組み合わせによる手法である。 As a method for calculating the speaker recognition score, a known speaker recognition technique can be used. The first form of speaker recognition score is “Shigeo Morishima et al.,“ New Video Technology “Dive Into the Movie” ”, IEICE Journal, Vol. 94, No. 3, pp. 250-268, 2011.3 This is a speaker recognition method based on a Gaussian Mixture Model (GMM) described in “Moon (Reference 3)”. The second form of the speaker recognition score is a speaker recognition technique based on the dynamic time warping (DTW) described in Reference Document 3 above. The third form of speaker recognition score is “Pavel Matejka, Ondrej Glembek, Fabio Castaldo, Md. Jahangir Alam, Oldrich Plchot, Patrick Kenny, Lukas Burget, Jan Cernocky,“ Full-covariance UBM and heavy-tailed PLDA in i This is a speaker recognition method based on i-vector and Probabilistic Linear Discriminant Analysis described in “-vector speaker verification”, ICASSP 2011, pp. 4828-4831 (reference 4). A fourth form of the speaker recognition score is a method based on a combination of the speaker recognition methods described in Reference Document 3.

ステップＳ１２において、音声認識部１２は、入力された声まね音声を音声認識して認識結果テキストを生成する。音声認識の方法は周知の音声認識方法を用いればよい。生成された認識結果テキストはテキスト検索部１３へ送られる。 In step S12, the voice recognition unit 12 recognizes the input voice imitation voice and generates a recognition result text. A known speech recognition method may be used as the speech recognition method. The generated recognition result text is sent to the text search unit 13.

ステップＳ１３において、テキスト検索部１３は、入力された認識結果テキストと、人物データベース記憶部１６に記憶された関連テキストとの類似度を示すテキスト類似度を算出する。テキスト検索の方法は周知のテキスト検索方法を用いればよい。算出されたテキスト類似度は類似度合算部１４へ送られる。 In step S 13, the text search unit 13 calculates a text similarity indicating the similarity between the input recognition result text and the related text stored in the person database storage unit 16. A known text search method may be used as the text search method. The calculated text similarity is sent to the similarity summation unit 14.

ステップＳ１４において、類似度合算部１４は、所定の方法に従い、話者認識スコアとテキスト類似度とを合算して声まねスコアを算出する。算出された声まねスコアは類似度記憶部１７に記憶される。 In step S14, the similarity summation unit 14 sums the speaker recognition score and the text similarity according to a predetermined method, and calculates a voice mimic score. The calculated voice imitation score is stored in the similarity storage unit 17.

話者認識スコアとテキスト類似度の合算の方法は、加算、乗算、対数上での加算、その重み付きの演算などである。重みは予備実験などを通して検索精度等の観点で最適と思われる値を人為的に決めるとよい。 Methods for adding the speaker recognition score and the text similarity include addition, multiplication, logarithmic addition, and weighted calculation. It is recommended to artificially determine a weight that is considered optimal in terms of search accuracy and the like through preliminary experiments.

ステップＳ１５において、推定評価部１５は、声まねスコアに基づいて声まね評価結果を出力する。声まね評価結果は、誰のものまねをしているのかを推定する場合は、例えば、声まねスコアの高い方から所定の数の対象話者の人物名である。また、ものまねが似ているか否かの指標を出力する場合は、例えば、声まねスコアの高い方から所定の数の話者認識スコアである。 In step S15, the estimation evaluation unit 15 outputs a voice imitation evaluation result based on the voice imitation score. The voice imitation evaluation result is, for example, the names of a predetermined number of target speakers from a higher voice imitation score when estimating who is imitating. Further, when outputting an index as to whether or not the mimicry is similar, for example, a predetermined number of speaker recognition scores from a higher voice imitation score.

［第二実施形態］
第一実施形態では、話者認識部は人物データベース中の全対象話者に対して話者認識スコアを算出する必要があった。この方法では計算コストが大きく人物データベースが大規模な場合には適していない。これを回避する目的で、テキスト検索部で上位候補の対象話者だけに限定するといった形態とすることも可能である。 [Second Embodiment]
In the first embodiment, the speaker recognition unit needs to calculate a speaker recognition score for all target speakers in the person database. This method is not suitable when the calculation cost is large and the human database is large. In order to avoid this, the text search unit may be limited to only the top candidate target speakers.

第二実施形態の声まね音声評価装置は、図３に示すように、音声入力部１０、音声認識部１２、テキスト検索部１３、推定評価部１５、人物データベース記憶部１６及び類似度記憶部１７を例えば含み、高類似度候補記憶部２０及び話者認識部２１をさらに含む。 As shown in FIG. 3, the voice imitation voice evaluation device of the second embodiment includes a voice input unit 10, a voice recognition unit 12, a text search unit 13, an estimation evaluation unit 15, a person database storage unit 16, and a similarity storage unit 17. For example, and further includes a high similarity candidate storage unit 20 and a speaker recognition unit 21.

図４を参照して、第二実施形態の声まね音声評価方法を説明する。以下では、上述の第一実施形態との相違点を中心に説明する。 With reference to FIG. 4, the voice imitation voice evaluation method of 2nd embodiment is demonstrated. Below, it demonstrates centering on difference with the above-mentioned 1st embodiment.

ステップＳ１３において、テキスト検索部１３は、入力された認識結果テキストと、人物データベース記憶部１６に記憶された関連テキストとの類似度を示すテキスト類似度を算出し、テキスト類似度が高い関連テキストを持つ対象話者を特定する候補情報を高類似度候補記憶部２０に記憶する。 In step S 13, the text search unit 13 calculates a text similarity indicating the similarity between the input recognition result text and the related text stored in the person database storage unit 16, and selects the related text having a high text similarity. Candidate information for specifying the target speaker is stored in the high similarity candidate storage unit 20.

ステップＳ３１において、話者認識部３１は、高類似度候補記憶部２０に記憶された候補情報により特定される対象話者に限定して、入力された声まね音声と、人物データベース記憶部１６に記憶された各対象話者の音声との類似度を示す話者認識スコアを算出する。算出された話者認識スコアは推定評価部１５へ送られる。 In step S 31, the speaker recognizing unit 31 limits the input voice imitation voice to the target speaker specified by the candidate information stored in the high similarity candidate storage unit 20 and the person database storage unit 16. A speaker recognition score is calculated that indicates the degree of similarity with the speech of each stored target speaker. The calculated speaker recognition score is sent to the estimation evaluation unit 15.

ステップＳ１５において、推定評価部１５は、話者認識スコアに基づいて声まね評価結果を出力する。声まね評価結果は、誰のものまねをしているのかを推定する場合は、例えば、話者認識スコアの高い方から所定の数の対象話者の人物名である。また、ものまねが似ているか否かの指標を出力する場合は、例えば、高い方から所定の数の話者認識スコアである。 In step S15, the estimation evaluation unit 15 outputs a voice imitation evaluation result based on the speaker recognition score. When estimating who imitates the voice imitation evaluation result, for example, the person names of a predetermined number of target speakers from the higher speaker recognition score are used. Further, when outputting an index as to whether or not the imitation is similar, for example, a predetermined number of speaker recognition scores from the highest one.

第二実施形態ではテキスト類似度によって限定された対象話者の候補に対してのみ話者認識スコアを算出することになるので、声まね評価結果を速く得ることが可能である。また、テキスト類似度と話者認識スコアのそれぞれについて独立して候補数を制御できる。ただし、テキスト検索部１３で正解の対象話者を候補に挙げることができなかった場合には、どんなにものまねが似ていても、正しく対象話者を特定できないというデメリットがある。 In the second embodiment, since the speaker recognition score is calculated only for the target speaker candidates limited by the text similarity, the voice imitation evaluation result can be obtained quickly. In addition, the number of candidates can be controlled independently for each of the text similarity and the speaker recognition score. However, there is a demerit that when the correct target speaker cannot be listed as a candidate in the text search unit 13, the target speaker cannot be correctly identified no matter how similar the imitation is.

［第三実施形態］
この発明の第三実施形態は、人物データベースが大規模な場合の声まね音声評価装置及び方法である。以下では第一実施形態を基本として構成した第三実施形態の声まね音声評価装置及び方法について説明するが、第二実施形態を基本として第三実施形態の声まね音声評価装置及び方法を構成することも可能である。 [Third embodiment]
The third embodiment of the present invention is a voice imitation voice evaluation apparatus and method when a person database is large. In the following, the voice imitation voice evaluation device and method of the third embodiment configured based on the first embodiment will be described. However, the voice imitation voice evaluation device and method of the third embodiment are configured based on the second embodiment. It is also possible.

第三実施形態の声まね音声評価装置は、図５に示すように、第一実施形態と同様に、音声入力部１０、音声認識部１２、テキスト検索部１３、類似度合算部１４、推定評価部１５及び類似度記憶部１７を例えば含み、話者特徴ベクトル空間話者認識部３１及び人物データベース記憶部３６をさらに含む。 As shown in FIG. 5, the voice imitation voice evaluation device according to the third embodiment is similar to the first embodiment in the voice input unit 10, the voice recognition unit 12, the text search unit 13, the similarity summation unit 14, and the estimated evaluation. For example, a speaker feature vector space speaker recognition unit 31 and a person database storage unit 36.

人物データベース記憶部３６には、ものまねの対象とされる対象話者を特徴づける関連テキストと、その対象話者の話者特徴ベクトルとを組にした人物データベースが記憶されている。 The person database storage unit 36 stores a person database that is a set of related text that characterizes the target speaker to be imitated and the speaker feature vector of the target speaker.

図６を参照して、第三実施形態の声まね音声評価方法を説明する。以下では、上述の第一実施形態との相違点を中心に説明する。 With reference to FIG. 6, the voice imitation voice evaluation method of 3rd embodiment is demonstrated. Below, it demonstrates centering on difference with the above-mentioned 1st embodiment.

ステップＳ３１において、話者特徴ベクトル空間話者認識部３１は、入力された声まね音声から抽出した話者特徴ベクトルと、人物データベース記憶部３６に記憶された各対象話者の話者特徴ベクトルとの類似度を話者認識スコアとして算出する。算出された話者認識スコアは類似度合算部１４へ送られる。 In step S31, the speaker feature vector space speaker recognition unit 31 extracts the speaker feature vector extracted from the input voice mimicking speech, and the speaker feature vector of each target speaker stored in the person database storage unit 36. Is calculated as a speaker recognition score. The calculated speaker recognition score is sent to the similarity summation unit 14.

話者特徴ベクトル空間話者認識部３１の具体的な構成は、上記参考文献４に記載されているi-vectorとコサイン類似度に基づく方法である。この方法のポイントは、認識対象の音声（ここでは声まね音声）と各対象話者の音声の話者特徴ベクトルとを、i-vectorと呼ばれる１つのベクトルとして表現する。そして、コサイン類似度等の２つのベクトルの類似度尺度のうち計算コストの小さいものを用いて類似度を計算する。これにより、高速な処理を実現することができる。この方法によれば、話者特徴ベクトル空間話者認識部３１は、声まね音声から１つの話者特徴ベクトルを算出する処理と、人物データベースに予め登録された各対象話者に対して１つの話者特徴ベクトルとの間の類似度（ここでは話者認識スコア）を既定の類似度尺度で算出する処理とにより構成される。 A specific configuration of the speaker feature vector space speaker recognition unit 31 is a method based on the i-vector and the cosine similarity described in the above-mentioned reference 4. The point of this method is that the speech to be recognized (here, voice mimic speech) and the speaker feature vector of each target speaker's speech are expressed as one vector called i-vector. Then, the degree of similarity is calculated by using the similarity measure of two vectors such as cosine similarity that has a low calculation cost. Thereby, high-speed processing can be realized. According to this method, the speaker feature vector space speaker recognizing unit 31 calculates one speaker feature vector from the voice mimic speech, and one for each target speaker registered in the person database in advance. It is comprised by the process which calculates the similarity (here speaker recognition score) between speaker feature vectors with a predetermined similarity scale.

話者特徴ベクトルの１つの形態は、例えば、i-vectorと呼ばれる特徴量である。i-vectorについての詳細は、上記参考文献１に記載されている。話者特徴ベクトルのもう１つの形態は、Joint Factor Analysis（JFA）を用いて抽出した話者依存成分のベクトルである。JFAにより得られるベクトルについての詳細は、上記参考文献２に記載されている。 One form of the speaker feature vector is, for example, a feature value called i-vector. Details of the i-vector are described in Reference Document 1 above. Another form of the speaker feature vector is a vector of speaker-dependent components extracted using Joint Factor Analysis (JFA). Details on the vector obtained by JFA are described in Reference 2 above.

i-vectorもJFAにより得られるベクトルも、音声データに対して適応処理を施した混合ガウス分布（GMM: Gaussian Mixture Model）の各ガウス分布の平均ベクトルを接続して一繋ぎにしたベクトル（スーパーベクトル）を所定の方法で行列分解したものである。それを考慮すると、話者特徴ベクトルのもう１つの形態は、GMMのスーパーベクトルを所定の方法で話者成分が抽出できるように行列分解して得たベクトルである。話者特徴ベクトルのもう１つの形態は、GMMのスーパーベクトルである。GMMのスーパーベクトルには、話者成分を残しているという点においては選択肢の１つではあるが、話者以外の成分も多量に含んでいる。また設定によっては他のベクトルに比べて極めて高次元となり検索速度への影響も懸念される。その他、GMMのスーパーベクトルを介さない方法で得たベクトルであっても、話者を識別する効力を発揮する音声に対して１つ与えられる特徴ベクトルである限り、話者特徴ベクトルの範疇である。 Both the i-vector and the vector obtained by JFA are connected by connecting the average vector of each Gaussian distribution (GMM: Gaussian Mixture Model) that has been applied to speech data (super vector). ) Is subjected to matrix decomposition by a predetermined method. Considering this, another form of the speaker feature vector is a vector obtained by matrix decomposition so that speaker components can be extracted by a predetermined method from the GMM super vector. Another form of speaker feature vector is the GMM supervector. The GMM supervector is one of the options in that the speaker component remains, but it also contains a large amount of components other than the speaker. In addition, depending on the setting, the dimension becomes extremely higher than other vectors, and there is a concern about the influence on the search speed. In addition, even a vector obtained by a method that does not use the GMM super vector is a category of speaker feature vectors as long as it is one feature vector that is given to speech that exhibits the effectiveness of speaker identification. .

話者特徴ベクトル間の類似度は任意の類似度尺度を用いることができる。１つの代表的な形態はコサイン類似度である。もう１つ代表的な形態は内積値である。 Any similarity measure can be used for the similarity between the speaker feature vectors. One representative form is cosine similarity. Another representative form is an inner product value.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１０音声入力部
１１、２１話者認識部
１２音声認識部
１３テキスト検索部
１４類似度合算部
１５推定評価部
１６、３６人物データベース記憶部
１７類似度記憶部
２０高類似度候補記憶部
３１話者特徴ベクトル空間話者認識部 DESCRIPTION OF SYMBOLS 10 Speech input part 11, 21 Speaker recognition part 12 Speech recognition part 13 Text search part 14 Similarity summation part 15 Estimate evaluation part 16, 36 Person database storage part 17 Similarity storage part 20 High similarity candidate storage part 31 Speaker Feature vector space speaker recognition unit

Claims

A person database storage unit for storing a voice of the target speaker and a related text characterizing the target speaker in pairs;
A speaker recognition unit that calculates a speaker recognition score indicating the degree of similarity between the input voice mimic speech and the speech of the target speaker;
A voice recognition unit that recognizes the voice imitation voice and generates a recognition result text;
A text search unit for calculating a text similarity indicating a similarity between the recognition result text and the related text;
A similarity summation unit that calculates a voice imitation score that is the sum of the speaker recognition score and the text similarity;
An estimation evaluation unit that outputs a voice imitation evaluation result based on the voice imitation score;
Voice imitation voice evaluation device including.

A person database storage unit for storing a voice of the target speaker and a related text characterizing the target speaker in pairs;
A voice recognition unit that recognizes an input voice imitation voice and generates a recognition result text;
A text search unit for calculating a text similarity indicating a similarity between the recognition result text and the related text;
A high similarity candidate storage unit for storing candidate information for specifying the target speaker having a high text similarity;
A speaker recognition unit that calculates a speaker recognition score indicating a similarity between the voice imitation voice and the voice of the target speaker specified by the candidate information;
An estimation evaluation unit that outputs a voice imitation evaluation result based on the speaker recognition score;
Voice imitation voice evaluation device including.

The voice imitation voice evaluation device according to claim 1 or 2,
The person database storage unit stores a speaker feature vector extracted from the speech of the target speaker and a related text characterizing the target speaker in pairs.
The voice recognition speech evaluation apparatus, wherein the speaker recognition unit calculates the speaker recognition score from a speaker feature vector extracted from the voice mimic speech and a speaker feature vector of the target speaker.

In the person database storage unit, the voice of the target speaker and the related text characterizing the target speaker are stored in pairs,
A speaker recognition step in which a speaker recognition unit calculates a speaker recognition score indicating a degree of similarity between the input voice mimic speech and the speech of the target speaker;
A voice recognition step in which a voice recognition unit recognizes the voice imitation voice to generate a recognition result text;
A text search step in which a text search unit calculates a text similarity indicating a similarity between the recognition result text and the related text;
A similarity summing unit that calculates a voice imitation score obtained by summing the speaker recognition score and the text similarity;
An estimation evaluation step in which an estimation evaluation unit outputs a voice imitation evaluation result based on the voice imitation score;
Voice imitation voice evaluation method including.

In the person database storage unit, the voice of the target speaker and the related text characterizing the target speaker are stored in pairs,
A speech recognition step in which a speech recognition unit recognizes the input voice imitation speech and generates a recognition result text;
A text search step in which a text search unit calculates a text similarity indicating a similarity between the recognition result text and the related text;
A high similarity candidate storage step for storing candidate information for specifying the target speaker having a high text similarity in the high similarity candidate storage unit;
A speaker recognition unit for calculating a speaker recognition score indicating a similarity between the voice imitation voice and the voice of the target speaker specified by the candidate information;
An estimation evaluation step in which an estimation evaluation unit outputs a voice imitation evaluation result based on the speaker recognition score;
Voice imitation voice evaluation method including.

The program for functioning a computer as a voice imitation voice evaluation apparatus in any one of Claim 1 to 3.