JP2011242637A

JP2011242637A - Voice data editing device

Info

Publication number: JP2011242637A
Application number: JP2010115192A
Authority: JP
Inventors: Yasuyuki Mitsui; 康行三井; Reishi Kondou; 玲史近藤; Masanori Kato; 正徳加藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-05-19
Filing date: 2010-05-19
Publication date: 2011-12-01

Abstract

PROBLEM TO BE SOLVED: To provide a voice data editing device which detects defects in the data of recorded voice and modifies the defects using synthesis voice data so that additional recording is not needed.SOLUTION: A voice data editing device 100 includes: an editing information generation unit 200, a voice data synthesis unit 300 and a voice data editing unit 400. The editing information generation unit 200 detects defects in recorded voice data 700 and generates voice data synthesis information 500 needed for synthesizing voice data for modifying the defects, and recorded voice data modifying information 600 which includes information on time points in which the defects occur; and the voice data synthesis unit 300 synthesizes voice data based on the voice data synthesis information 500; and then the voice data editing unit 400 modifies the defects in the recorded voice data 700 using the synthesized voice data, based on the recorded voice data modifying information 600.

Description

本発明は、音声編集装置に関し、特に録音された音声を合成音声で編集する技術に関する。 The present invention relates to a voice editing apparatus, and more particularly to a technique for editing recorded voice with synthesized voice.

この種の音声編集装置の一例が特許文献１に記載されている。特許文献１に記載の音声編集装置は、車載用ナビゲーション装置や公共施設における自動放送装置などの自動音声案内システムにおいて、録音音声のうち編集情報で指定された箇所を合成音声で置換することにより音声の編集を行う。具体的には、例えば、「この先、中野付近で、渋滞があります」というテキストに対応する録音音声データが記録されている場合において、「中野」の箇所を「品川」に編集することを指示する編集情報が入力されると、音声編集装置は、「品川」に対応する合成音声を生成し、録音音声データ「この先、」および「付近で、渋滞があります」と接続して、「この先、品川付近で、渋滞があります」という音声を出力する。 An example of this type of speech editing apparatus is described in Patent Document 1. The voice editing device described in Patent Document 1 is a voice by replacing a portion specified by editing information in a recorded voice with a synthesized voice in an automatic voice guidance system such as an in-vehicle navigation device or an automatic broadcasting device in a public facility. Edit. Specifically, for example, when the recorded voice data corresponding to the text “There is a traffic jam near Nakano” is recorded, “Nakano” is instructed to be edited as “Shinagawa”. When the editing information is input, the voice editing device generates a synthesized voice corresponding to “Shinagawa” and connects to the recorded voice data “Future,” and “There is traffic jam in the vicinity.” "There is a traffic jam in the vicinity."

他方、本発明に関連する技術として、以下のような技術がある。 On the other hand, there are the following techniques as techniques related to the present invention.

特許文献２には、音声データとテキストデータとが混在する情報源を入力とし、テキストデータについて、音声合成手段を用いて合成音声を生成し、この合成音声と音声データとを所定の順序に従って編成した音声コンテンツを生成する技術が記載されている。 In Patent Document 2, an information source in which speech data and text data are mixed is input, and synthesized speech is generated for the text data using speech synthesis means, and the synthesized speech and speech data are organized in a predetermined order. Describes a technique for generating such audio content.

特許文献３には、音声を人手により或いは音声認識装置を用いてテキスト化した際の音声文字化誤りを検出する装置が記載されている。具体的には、或る音声についての書き起こし作業により「今日の号から天気は下り坂に・・・」のようなテキストが得られた場合、そのテキストから逆に合成音声を生成し、この合成音声と元の音声とを比較することにより、テキスト中の音声文字化誤り箇所「号」を検出する。 Patent Document 3 describes a device that detects a voice characterizing error when a voice is converted into text by hand or using a voice recognition device. Specifically, when a text such as “Today's issue is downhill…” is obtained from the transcription work on a certain voice, a synthesized voice is generated from the text. By comparing the synthesized speech with the original speech, a speech garbled error location “No.” in the text is detected.

特許文献４には、ユーザが発声中に言い誤った直後に言い直し発声を行った場合や、言い淀み後に本来入力したい発声を行った場合に、言い直しや本来入力したい発声のみを正しく受理できる音声認識システムが記載されている。例えば、ユーザが「大阪から、いや、新大阪から東京まで」と発声した場合、「いや」という特定語彙を検出し、「大阪から、いや」に対応する音声区間を棄却し、「新大阪から東京まで」に対応する音声区間だけを音声認識対象とする。 Patent Document 4 can correctly accept only the utterance to be restated or originally input when the user utters the utterance immediately after making a mistake or when the utterance is intended to be input after the utterance. A speech recognition system is described. For example, if the user utters “From Osaka, No, Shin Osaka to Tokyo”, the specific vocabulary “No” is detected, the speech segment corresponding to “From Osaka, no” is rejected, and “From Shin Osaka” Only the speech segment corresponding to “To Tokyo” is subject to speech recognition.

特開２００９−１５７２２０号公報JP 2009-157220 A ＷＯ２００８／００１５００号公報WO2008 / 001500 publication 特開２００１−１３４２７６号公報JP 2001-134276 A 特開２００７−０５７８４４号公報JP 2007-057844 A

プロフェッショナルユース、コンシューマユースを問わず、話者が発声した音声を録音して利用するシーンは数多く存在する。例えば、プロフェッショナルユースではテレビやラジオの番組制作、講義や講演音声の収録、コンシューマユースでは留守番電話、ホームビデオでの撮影等が挙げられる。 Regardless of professional use or consumer use, there are many scenes where the voice uttered by the speaker is recorded and used. For example, professional use includes TV and radio program production, recording of lectures and lecture audio, consumer use includes answering machines, and home video recording.

上述したようなシーンでは、録音された音声に発声誤りや言いよどみ、雑音の重畳等が含まれる場合がある。この場合、直後に言い直していたり、重畳している雑音がごく短時間であったりするのであれば、市販の音響編集アプリケーションや機材を用いて手動で修正したり、特許文献４に記載の技術を応用して自動的に修正することが可能である。しかし、言い誤ったまま発声を続けたり、雑音が大きかったりする場合には、音声を修正するためには再度音声を収録する必要が生じてしまい、非常に負担を強いるものとなっていた。 In the scene as described above, the recorded voice may include an utterance error, sloppyness, noise superposition, and the like. In this case, if it is immediately rephrased or if the superimposed noise is very short, it can be corrected manually using a commercially available sound editing application or equipment, or the technique described in Patent Document 4 It is possible to correct automatically by applying. However, if the utterance is continued with a mistake, or the noise is loud, it becomes necessary to record the voice again in order to correct the voice, which is very burdensome.

録音された音声の一部を合成音声で置換する技術が特許文献１に記載されている。しかし、特許文献１は、録音された音声に発声誤りや言いよどみ、雑音の重畳等が含まれていることは想定していない。すなわち、録音音声中の言語的あるいは音響的に不具合な箇所を合成音声で置き換える考えは、特許文献１には記載されていない。また、特許文献２は、合成音声と音声データとを所定の順序に従って編成する技術であり、録音された音声の一部を合成音声で置換する技術ではない。また、特許文献３は、音声文字化誤り箇所を検出する技術であり、録音された音声中の発声誤りや言いよどみ、雑音の重畳等を検出する技術ではない。 Patent Document 1 describes a technique for replacing a part of recorded voice with synthesized voice. However, Patent Document 1 does not assume that the recorded voice includes utterance errors, sloppyness, noise superposition, and the like. That is, the idea of replacing a linguistically or acoustically defective portion in a recorded voice with a synthesized voice is not described in Patent Document 1. Patent Document 2 is a technique for organizing synthesized voice and voice data according to a predetermined order, and is not a technique for replacing a part of recorded voice with synthesized voice. Further, Patent Document 3 is a technique for detecting a voice characterization error location, and is not a technique for detecting an utterance error, stagnation, noise superposition, or the like in a recorded voice.

本発明の目的は、上述したような課題、すなわち録音された音声中の不具合箇所を修正するためには再度音声を収録する必要がある、という課題を解決する音声編集装置を提供することにある。 An object of the present invention is to provide an audio editing apparatus that solves the above-described problem, that is, the problem that it is necessary to record audio again in order to correct a defective portion in the recorded audio. .

本発明の一形態にかかる音声編集装置は、収録音声の不具合箇所を検出し、上記不具合箇所の修正に用いる合成音声を生成するのに必要な音声合成情報と、上記不具合箇所の位置情報を含む収録音声変更情報とを生成する編集情報生成手段と、上記音声合成情報に基づいて合成音声を生成する音声合成手段と、上記収録音声変更情報に基づいて、上記収録音声の不具合箇所を上記合成音声により修正する音声編集手段とを備える。 A speech editing apparatus according to an aspect of the present invention includes speech synthesis information necessary for detecting a defective portion of recorded sound and generating synthesized speech used for correcting the defective portion, and positional information of the defective portion. Editing information generating means for generating recorded voice change information, voice synthesizing means for generating synthesized voice based on the voice synthesis information, and a defective portion of the recorded voice based on the recorded voice change information And voice editing means for correcting by the above.

本発明は上述したように構成されているため、録音された音声中の不具合箇所を修正するために再度音声を収録する必要がなく、収録音声の修正に要する利用者の負担を軽減することができる。 Since the present invention is configured as described above, it is not necessary to record the sound again in order to correct the defective part in the recorded sound, and the burden on the user required to correct the recorded sound can be reduced. it can.

本発明の第１の実施形態のブロック図である。It is a block diagram of a 1st embodiment of the present invention. 本発明の第１の実施形態の動作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation | movement of the 1st Embodiment of this invention. 本発明の第２の実施形態のブロック図である。It is a block diagram of the 2nd Embodiment of this invention. 本発明の第２の実施形態における言語的不具合検出部のブロック図である。It is a block diagram of the linguistic malfunction detection part in the 2nd Embodiment of this invention. 本発明の第２の実施形態における収録音声の例とその正しい内容を示す図である。It is a figure which shows the example of the recorded audio | voice in the 2nd Embodiment of this invention, and its correct content. 本発明の第２の実施形態における音声合成情報の具体的な生成方法の説明図である。It is explanatory drawing of the specific production | generation method of the speech synthesis information in the 2nd Embodiment of this invention. 本発明の第２の実施形態における音声編集部の具体的な編集方法の説明図である。It is explanatory drawing of the specific editing method of the audio | voice editing part in the 2nd Embodiment of this invention. 本発明の第３の実施形態のブロック図である。It is a block diagram of the 3rd Embodiment of this invention. 本発明の第３の実施形態における言語的不具合検出部の検出結果の一例を示す図である。It is a figure which shows an example of the detection result of the linguistic malfunction detection part in the 3rd Embodiment of this invention. 本発明の第３の実施形態におけるユーザインターフェース画面の一例を示す図である。It is a figure which shows an example of the user interface screen in the 3rd Embodiment of this invention.

次に本発明の実施の形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

[第１の実施形態]
図１を参照すると、本発明の第１の実施形態に係る音声編集装置１００は、収録音声７００を入力して編集を行い、編集済み音声８００を出力する機能を有する。この音声編集装置１００は、編集情報生成部２００と、音声合成部３００と、音声編集部４００とから構成される。 [First embodiment]
Referring to FIG. 1, the sound editing apparatus 100 according to the first embodiment of the present invention has a function of inputting a recorded sound 700 and performing editing, and outputting an edited sound 800. The speech editing apparatus 100 includes an editing information generation unit 200, a speech synthesis unit 300, and a speech editing unit 400.

編集情報生成部２００は、収録音声７００を解析して収録音声７００に存在する不具合な箇所を検出する機能を有する。具体的には、編集情報生成部２００は、収録音声７００中の音響的に不具合な箇所または言語的に不具合な箇所を検出する。編集情報生成部２００は、音響的に不具合な箇所を検出する場合、収録音声７００を分析して得られる音声特徴量に基づいて、例えばテンポ、基本周波数、パワー、Ｓ／Ｎ比等の所定の音響的パラメータが所定の閾値を超えるほど急変している箇所を、音響的に不具合な箇所として検出する。また、編集情報生成部２００は、言語的に不具合な箇所を検出する場合、収録音声７００を例えば音声認識装置によってテキスト化し、そのテキスト中から、読み間違いや言い誤りなど文法的に正しくない箇所や意味的に正しくない箇所を不具合な箇所として検出する。また、編集情報生成部２００は、例えばフィラーや言い淀みなど、文法的および意味的に好ましくない箇所を不具合な箇所として検出する。 The editing information generation unit 200 has a function of analyzing the recorded sound 700 and detecting a defective part existing in the recorded sound 700. Specifically, the editing information generation unit 200 detects an acoustically defective part or a linguistically defective part in the recorded sound 700. When the editing information generation unit 200 detects an acoustically defective portion, the editing information generation unit 200 determines a predetermined tempo, fundamental frequency, power, S / N ratio, and the like based on an audio feature obtained by analyzing the recorded audio 700. A part that changes suddenly as the acoustic parameter exceeds a predetermined threshold is detected as an acoustically defective part. In addition, when the editing information generation unit 200 detects a linguistically defective part, the recorded voice 700 is converted into a text by, for example, a voice recognition device, and a grammatically incorrect part such as a reading error or a typo is read from the text. Detect a location that is not semantically correct as a defective location. Further, the editing information generation unit 200 detects a grammatically and semantically unfavorable portion such as a filler or a speech as a defective portion.

さらに編集情報生成部２００は、検出した不具合な箇所を修正するための音声を合成するために必要な音声合成情報と、不具合な箇所の位置情報を含む収録音声変更情報とを生成する機能を有する。編集情報生成部２００は、生成した音声合成情報および収録音声変更情報を音声合成情報５００および収録音声変更情報６００として出力する。あるいは編集情報生成部２００は、生成した音声合成情報および収録音声変更情報を図示しない表示装置に表示し、図示しない入力装置を通じて入力される利用者指示に従って音声合成情報および収録音声変更情報の変更を行い、この変更後の音声合成情報および収録音声変更情報を音声合成情報５００および収録音声変更情報６００として出力する。 Further, the editing information generation unit 200 has a function of generating speech synthesis information necessary for synthesizing a voice for correcting the detected defective part and recorded voice change information including position information of the defective part. . The editing information generation unit 200 outputs the generated voice synthesis information and recorded voice change information as voice synthesis information 500 and recorded voice change information 600. Alternatively, the editing information generation unit 200 displays the generated voice synthesis information and recorded voice change information on a display device (not shown), and changes the voice synthesis information and recorded voice change information according to a user instruction input through an input device (not shown). Then, the voice synthesis information and the recorded voice change information after the change are output as the voice synthesis information 500 and the recorded voice change information 600.

音声合成部３００は、音声合成情報５００に従って、収録音声７００の話者と同一または近似した音声を合成する機能を有する。音声合成部３００は、合成した音声を音声編集部４００へ出力する。 The speech synthesizer 300 has a function of synthesizing speech that is the same as or similar to the speaker of the recorded speech 700 according to the speech synthesis information 500. The voice synthesis unit 300 outputs the synthesized voice to the voice editing unit 400.

音声編集部４００は、収録音声変更情報６００に基づいて、収録音声７００の不具合な箇所を合成音声により修正する機能を有する。 The voice editing unit 400 has a function of correcting a defective portion of the recorded voice 700 with synthesized voice based on the recorded voice change information 600.

次に本実施形態の動作を説明する。図２は本実施形態の動作の流れを示すフローチャートである。 Next, the operation of this embodiment will be described. FIG. 2 is a flowchart showing the operation flow of the present embodiment.

音声編集装置１００の編集情報生成部２００は、収録音声７００が入力されると（Ｓ１０１）、収録音声７００中の音響的あるいは言語的に不具合な箇所を検出する（Ｓ１０２）。次に編集情報生成部２００は、上記検出した不具合な箇所の修正に使用する音声を合成するための音声合成情報５００と、その不具合な箇所の位置情報を含む収録音声変更情報６００とを生成する（Ｓ１０３）。そして、編集情報生成部２００は、音声合成情報５００を音声合成部３００へ出力し、収録音声変更情報６００を音声編集部４００へ出力する。 When the recorded voice 700 is input (S101), the editing information generation unit 200 of the voice editing device 100 detects a defective acoustically or linguistically part in the recorded voice 700 (S102). Next, the editing information generation unit 200 generates speech synthesis information 500 for synthesizing speech used for correcting the detected defective part, and recorded voice change information 600 including position information of the defective part. (S103). Then, the editing information generation unit 200 outputs the voice synthesis information 500 to the voice synthesis unit 300 and outputs the recorded voice change information 600 to the voice editing unit 400.

音声合成部３００は、音声合成情報５００に従って、音声を合成し、合成した音声を音声編集部４００へ出力する（Ｓ１０４）。 The speech synthesizer 300 synthesizes speech according to the speech synthesis information 500 and outputs the synthesized speech to the speech editing unit 400 (S104).

音声編集部４００は、収録音声７００のうち、収録音声変更情報６００で示される箇所を切り取ったり、音声合成部３００により生成された合成音声で置換することにより、編集済み音声８００を生成し、出力する（Ｓ１０５）。 The voice editing unit 400 generates the edited voice 800 by cutting out the portion indicated by the recorded voice change information 600 in the recorded voice 700 or replacing it with the synthesized voice generated by the voice synthesis unit 300, and outputs it. (S105).

このように本実施形態によれば、録音された音声中の不具合な箇所を修正するために再度音声を収録する必要がなく、収録音声の修正に要する利用者の負担を軽減することができる。 As described above, according to the present embodiment, it is not necessary to record the sound again in order to correct the troubled portion in the recorded sound, and the burden on the user required to correct the recorded sound can be reduced.

[第２の実施形態]
図３を参照すると、本発明の第２の実施形態に係る音声編集装置１０１は、収録音声７０１を入力して編集を行い、編集済み音声８０１を出力する機能を有する。この音声編集装置１０１は、編集情報生成部２０１と、音声合成部３０１と、音声編集部４０１とから構成される。 [Second Embodiment]
Referring to FIG. 3, the voice editing apparatus 101 according to the second embodiment of the present invention has a function of inputting a recorded voice 701 to perform editing and outputting an edited voice 801. The speech editing apparatus 101 includes an editing information generation unit 201, a speech synthesis unit 301, and a speech editing unit 401.

編集情報生成部２０１は、収録音声７０１を解析して収録音声７０１から音響的な不具合箇所および言語的な不具合箇所を検出する機能と、検出した不具合箇所を修正するための音声を合成するために必要な音声合成情報５０１と、不具合箇所の位置情報を含む収録音声変更情報６０１とを生成する機能とを有する。この編集情報生成部２０１は、音声認識部２１０と、音声分析部２２０と、変更箇所決定部２３０と、合成情報生成部２４０と、収録音声変更情報生成部２５０とから構成される。 The editing information generation unit 201 analyzes the recorded voice 701 and synthesizes a function for detecting an acoustically defective part and a linguistically defective part from the recorded voice 701 and a sound for correcting the detected defective part. It has a function of generating necessary voice synthesis information 501 and recorded voice change information 601 including position information of a defective part. The editing information generation unit 201 includes a voice recognition unit 210, a voice analysis unit 220, a change location determination unit 230, a synthesis information generation unit 240, and a recorded voice change information generation unit 250.

音声認識部２１０は、収録音声７０１を入力し、入力した収録音声に対して音声認識処理を行ってテキストに変換し、このテキストを変更箇所決定部２３０へ出力する。 The voice recognition unit 210 inputs the recorded voice 701, performs voice recognition processing on the input recorded voice, converts it into text, and outputs this text to the change location determination unit 230.

音声分析部２２０は、収録音声７０１を入力し、入力した収録音声に対して音声分析処理を行って音響的な特徴量を抽出し、この音響的な特徴量を変更箇所決定部２３０および合成情報生成部２４０へ出力する。収録音声７０１から抽出する音響的な特徴量としては、テンポ（全体平均、局所的な値）、基本周波数（全体平均、局所的な値、ピッチパターン等）、パワー（全体平均、局所的な値）、スペクトル情報（全体平均、局所的な値）、Ｓ／Ｎ比（全体平均、局所的な値）などが考えられる。また音響的な特徴量の抽出方法としては、ケプストラム分析、ＬＰＣ分析、ＬＳＰ分析等が考えられる。音声分析部２２０から変更箇所決定部２３０へ出力する音響的な特徴量の種類と、音声分析部２２０から合成情報生成部２４０へ出力する音響的な特徴量の種類とは、同一であってもよいし、異なっていてもよい。 The voice analysis unit 220 inputs the recorded voice 701, performs voice analysis processing on the input recorded voice, extracts an acoustic feature quantity, and uses this acoustic feature quantity as the change location determination unit 230 and the synthesis information. The data is output to the generation unit 240. The acoustic features extracted from the recorded audio 701 include tempo (overall average, local value), fundamental frequency (overall average, local value, pitch pattern, etc.), power (overall average, local value). ), Spectral information (overall average, local value), S / N ratio (overall average, local value), and the like. As an acoustic feature extraction method, cepstrum analysis, LPC analysis, LSP analysis, and the like can be considered. The type of acoustic feature value output from the speech analysis unit 220 to the change location determination unit 230 and the type of acoustic feature value output from the speech analysis unit 220 to the synthesis information generation unit 240 are the same. It may be good or different.

変更箇所決定部２３０は、収録音声７０１中の音響的な不具合箇所を検出する音響的不具合検出部２３１と、収録音声７０１中の言語的な不具合箇所を検出する言語的不具合検出部２３２と、統合部２３３とから構成される。 The change location determination unit 230 is integrated with an acoustic failure detection unit 231 that detects an acoustic failure location in the recorded sound 701, a linguistic failure detection portion 232 that detects a linguistic failure location in the recorded sound 701, and the like. Part 233.

音響的不具合検出部２３１は、収録音声７０１から抽出された音響的な特徴量に基づいて、収録音声７０１中の音響的な不具合箇所を検出する機能を有する。音響的な不具合箇所とは、例えば、テンポ、基本周波数、パワー、Ｓ／Ｎ比等の急激な変化が生じており、音声を聞く際に聞き取り難い等の問題が発生すると考えられる箇所のことである。音響的不具合検出部２３１は、例えば、テンポ、基本周波数、パワー、Ｓ／Ｎ比等の音響的な特徴量の少なくとも１つが、例えば全体平均に比べて閾値だけ相違している箇所を音響的な不具合箇所として検出する。音響的不具合検出部２３１は、１以上の音響的な不具合箇所を検出した場合、それぞれの不具合箇所ごとに、その不具合箇所を特定する情報を統合部２３３へ出力する。不具合箇所を特定する情報としては、例えば、収録音声７０１をテキスト化したテキスト上における位置や、収録音声７０１上における位置が考えられる。 The acoustic defect detection unit 231 has a function of detecting an acoustic defect location in the recorded sound 701 based on the acoustic feature amount extracted from the recorded sound 701. For example, an acoustic failure point is a point where a sudden change in the tempo, fundamental frequency, power, S / N ratio, etc. occurs, and problems such as difficulty in hearing when sound is heard are generated. is there. For example, the acoustic defect detection unit 231 acoustically detects a location where at least one of acoustic features such as tempo, fundamental frequency, power, S / N ratio, and the like is different from the overall average by a threshold, for example. Detect as a defect location. When the acoustic defect detection unit 231 detects one or more acoustic defect points, the acoustic defect detection unit 231 outputs information for identifying the defect points to the integration unit 233 for each defect point. As the information for specifying the defect location, for example, a position on a text obtained by converting the recorded voice 701 into text or a position on the recorded voice 701 can be considered.

言語的不具合検出部２３２は、収録音声７０１をテキスト化したテキストに基づいて、収録音声７０１中の言語的な不具合箇所を検出すると共に正しいテキストを推定する機能を有する。言語的な不具合箇所とは、読み間違いや言い誤りなど文法的あるいは意味的に正しくない箇所や、フィラーや言い淀みなど文法的および意味的に好ましくない箇所のことである。言語的不具合検出部２３２は、このような言語的な不具合箇所を収録音声７０１をテキスト化したテキストから検出し、また正しいテキストを推定する。 The linguistic defect detection unit 232 has a function of detecting a linguistic defect part in the recorded voice 701 and estimating a correct text based on text obtained by converting the recorded voice 701 into text. A linguistically defective part is a part that is not grammatically or semantically correct, such as a reading error or a typo, or a part that is grammatically or semantically unfavorable, such as a filler or a phrasing. The linguistic defect detection unit 232 detects such a linguistic defect part from the text obtained by converting the recorded voice 701 into text, and estimates a correct text.

言語的不具合検出部２３２の構成の一例を図４に示す。図４を参照すると、言語的不具合検出部２３２は、音声認識部２１０により生成された収録音声７０１のテキストを変更前テキスト２３２３として入力し、変更後テキスト２３２４を出力するテキスト変更部２３２１と、変更前テキスト２３２３と変更後テキスト２３２４の差分（相違点）を抽出する差分抽出部２３２２とから構成される。 An example of the configuration of the linguistic defect detection unit 232 is shown in FIG. Referring to FIG. 4, the linguistic defect detection unit 232 inputs the text of the recorded voice 701 generated by the voice recognition unit 210 as the pre-change text 2323 and outputs the post-change text 2324, and the change A difference extraction unit 2322 that extracts a difference (difference) between the previous text 2323 and the changed text 2324 is configured.

テキスト変更部２３２１は、変更前テキスト２３２３から、読み間違い、言い誤りなどの文法的に誤っている箇所や、フィラーや言い淀みなどの文法的に好ましくない箇所を検出し、これらの箇所を文法的に正しい内容、文法的に好ましい内容に変更したテキストを、変更後テキスト２３２４として出力する。テキスト変更部２３２１は、例えばフィラーについては、変更前テキスト２３２３とフィラー候補を収集した辞書とのマッチングにより抽出し、該当部分のフィラーを削除する。また、テキスト変更部２３２１は、言い淀み、読み間違い、言い誤りについては、音韻列の類似度や単語の前後環境により正解を推定する方法を用いて、誤り箇所と正解とを推定する。そして、テキスト変更部２３２１は、言い淀み箇所は削除し、読み間違いや言い誤り箇所は正解で置き換える。 The text changing unit 2321 detects, from the pre-change text 2323, grammatically incorrect parts such as reading mistakes and phrasing errors, and grammatically unfavorable parts such as fillers and sayings, and these grammatical parts are detected. The text changed to the correct content and the grammatically preferable content is output as the changed text 2324. The text change unit 2321 extracts, for example, fillers by matching the pre-change text 2323 with a dictionary in which filler candidates are collected, and deletes the filler in the corresponding part. In addition, the text changing unit 2321 estimates the error location and the correct answer for the utterance, reading error, and error using a method of estimating the correct answer based on the similarity of the phoneme sequence and the environment before and after the word. Then, the text change unit 2321 deletes the saying part and replaces the reading error or the error part with a correct answer.

差分抽出部２３２２は、変更前テキスト２３２３と変更後テキスト２３２４との差分を抽出する。差分抽出部２３２２は、１以上の差分を検出した場合、それぞれの差分ごとに、変更前テキスト２３２３上での位置と正解テキストとを統合部２３３へ出力する。 The difference extraction unit 2322 extracts a difference between the pre-change text 2323 and the post-change text 2324. When the difference extraction unit 2322 detects one or more differences, the difference extraction unit 2322 outputs the position on the pre-change text 2323 and the correct text to the integration unit 233 for each difference.

再び図３を参照すると、統合部２３３は、音響的不具合検出部２３１の検出結果と言語的不具合検出部２３２の検出結果とに基づいて、変更箇所音韻情報２６１および変更箇所位置情報２６２を生成し、合成情報生成部２４０および収録音声変更情報生成部２５０へ出力する。 Referring again to FIG. 3, the integration unit 233 generates the change location phoneme information 261 and the change location position information 262 based on the detection result of the acoustic failure detection unit 231 and the detection result of the linguistic failure detection unit 232. And output to the synthesized information generation unit 240 and the recorded voice change information generation unit 250.

統合部２３３は、例えば、収録音声７０１をテキスト化したテキスト中の例えば「Ａ社」の部分に音響的不具合があることが音響的不具合検出部２３１で検出された場合、その音響的不具合に関して、「Ａ社」の音韻情報を変更箇所音韻情報２６１として出力し、その不具合箇所の位置情報を変更箇所位置情報２６２として出力する。変更箇所の位置情報は、例えば、変更箇所の文頭からのモーラ数で表現される。あるいは、統合部２３３は、音響的な不具合箇所の前の幾つかの語から、後の幾つかの語までの所定範囲の箇所の音韻情報を示す変更箇所音韻情報２６１と、上記所定範囲を示す変更箇所位置情報２６２とを生成する。 For example, when the acoustic defect detecting unit 231 detects that there is an acoustic defect in a part of “Company A” in the text obtained by converting the recorded voice 701 into a text, for example, The phoneme information of “Company A” is output as changed location phoneme information 261, and the location information of the defective location is output as changed location location information 262. The position information of the changed part is expressed by, for example, the number of mora from the beginning of the changed part. Alternatively, the integration unit 233 indicates the changed part phoneme information 261 indicating the phoneme information of a predetermined range of parts from some words before the acoustically defective part to the following several words, and the predetermined range. Change location information 262 is generated.

また統合部２３３は、例えば、収録音声７０１をテキスト化したテキスト中に例えばフィラーの「ええと」があることが言語的不具合検出部２３２で検出された場合、その言語的不具合に関して、「ええと」の箇所を示す変更箇所位置情報２６２を生成し、対応する変更箇所音韻情報２６１は生成しない。あるいは、統合部２３３は、その言語的不具合に関して、「ええと」の前の幾つかの語から「ええと」の後の幾つかの語までの所定範囲の箇所の音韻から「ええと」の音韻を取り除いた音韻情報を示す変更箇所音韻情報２６１と、
上記所定範囲を示す変更箇所位置情報２６２とを生成する。 Further, for example, when the linguistic defect detecting unit 232 detects that there is a filler “um” in the text obtained by converting the recorded voice 701 into a text, The change location position information 262 indicating the location “is generated, and the corresponding change location phoneme information 261 is not generated. Alternatively, the integration unit 233 determines the linguistic defect from the phonemes in a predetermined range from several words before “um” to several words after “um”. Change location phoneme information 261 indicating phoneme information from which phonemes have been removed;
Change location information 262 indicating the predetermined range is generated.

また統合部２３３は、例えば、収録音声７０１をテキスト化したテキスト中に読み間違いの「そうさつ」があり、正解テキストとして「そうさい」が推定されている場合、その言語的不具合に関して、例えば、「そうさい」の音韻情報を示す変更箇所音韻情報２６１と、「そうさつ」の箇所を示す変更箇所位置情報２６２とを生成する。あるいは、統合部２３３は、その言語的不具合に関して、例えば、「さうさつ」の前の幾つかの語から「そうさつ」の後の幾つかの語までの所定範囲の箇所の音韻中の「そうさつ」部分を「さうそい」の音韻に置き換えた音韻情報を示す変更箇所音韻情報２６１と、上記所定範囲を示す変更箇所位置情報２６２とを生成する。 Further, for example, when there is a misreading “Sosasai” in the text obtained by converting the recorded voice 701 into a text and “Sosaisa” is estimated as the correct text, , The change location phoneme information 261 indicating the phonetic information of “Sosaisa” and the change location information 262 indicating the location of “Sosaisa” are generated. Alternatively, the integration unit 233 may relate to the linguistic defect, for example, in a phoneme in a predetermined range of locations from several words before “Sasatsu” to several words after “Sosatsu”. The change location phoneme information 261 indicating the phoneme information obtained by replacing the “so sasatsu” portion with the phonetic of “Sausoi” and the change location position information 262 indicating the predetermined range are generated.

さらに統合部２３３は、音響的不具合箇所と言語的不具合箇所とが重複する箇所に関しては、音響的不具合箇所は無視し、言語的不具合箇所に関してのみ処理を行う。その理由は、言語的不具合を修正する際に音響的不具合は自然と解消するためである。例えば、収録音声７０１をテキスト化したテキスト中に読み間違いの「そうさつ」があり、正解テキストとして「そうさい」が推定されていると同時に「そうさつ」部分に音響的不具合があることが検出されている場合、例えば、「そうさい」の音韻情報を示す変更箇所音韻情報２６１と、「そうさつ」の箇所を示す変更箇所位置情報２６２とを生成する。 Further, the integration unit 233 ignores the acoustic defect portion and processes only the linguistic defect portion with respect to the portion where the acoustic defect portion and the linguistic defect portion overlap. The reason for this is that the acoustic defect is naturally resolved when correcting the language defect. For example, there is a misreading “Sosatsu” in the text of the recorded voice 701, and “Sosai” is estimated as the correct text and at the same time there is an acoustic defect in the “Sosatsu” part. Is detected, for example, the change location phoneme information 261 indicating the phonetic information of “Sosai” and the change location position information 262 indicating the location of “Sosai” are generated.

なお、統合部２３３は、例えば、文の中央部では広範囲を変更してもよいが、文末では可能な限り変更範囲を狭くするといったように、変更箇所の文中での位置等から編集方法や編集範囲を変更するようにしてもよい。 For example, the integration unit 233 may change a wide range in the central part of the sentence, but the editing method or editing may be performed based on the position in the sentence of the changed part so that the change range is narrowed as much as possible at the end of the sentence. The range may be changed.

合成情報生成部２４０は、変更箇所音韻情報２６１に従って、音声合成処理のための音声合成情報５０１を生成し、音声合成部３０１に出力する。ここで、音声合成情報とは、音声合成部３０１における音声合成処理の際に必要となる、生成する合成音声の音韻情報、および合成音声のテンポ、基本周波数、スペクトル情報、継続時間長情報等の特徴量情報である。生成する合成音声の音韻情報には、変更箇所音韻情報２６１によって与えられる音韻情報が使用される。音韻情報は、発音記号列として音声合成情報５０１に含めてもよいし、漢字仮名混じり文として音声合成情報５０１に含めてもよい。また、合成音声のテンポや基本周波数などの特徴量情報は、音声分析部２２０から与えられる音響的な特徴量に基づいて生成される。即ち、テンポや基本周波数などの特徴量を収録音声７０１と同一或いは近いものにすることにより、収録音声と合成音声とが滑らかに（聴覚上違和感なく）結合されるようにする。 The synthesis information generation unit 240 generates speech synthesis information 501 for speech synthesis processing according to the changed location phoneme information 261 and outputs the speech synthesis information 501 to the speech synthesis unit 301. Here, the speech synthesis information refers to the phonetic information of the synthesized speech to be generated, the tempo of the synthesized speech, the fundamental frequency, the spectrum information, the duration length information, and the like that are necessary for the speech synthesis processing in the speech synthesis unit 301. This is feature amount information. The phoneme information given by the change location phoneme information 261 is used for the phoneme information of the synthesized speech to be generated. The phoneme information may be included in the speech synthesis information 501 as a phonetic symbol string, or may be included in the speech synthesis information 501 as a kanji-kana mixed sentence. Also, the feature amount information such as the tempo and the fundamental frequency of the synthesized speech is generated based on the acoustic feature amount given from the speech analysis unit 220. That is, by making the feature quantities such as tempo and fundamental frequency the same as or close to those of the recorded voice 701, the recorded voice and the synthesized voice are smoothly combined (with no sense of incongruity).

収録音声変更情報生成部２５０は、変更箇所位置情報２６２に従って、収録音声変更情報６０１を生成する。収録音声変更情報６０１は、少なくとも変更箇所位置情報２６２を含み、さらに加えて、変更箇所の変更前音節情報、変更箇所の前後に係る音節情報等を含めてもよい。なお、収録音声変更情報６０１として、変更箇所位置情報２６２のみを利用する場合は、変更箇所位置情報２６２をそのまま収録音声変更情報６０１とすればよい。この場合、収録音声変更情報生成部２５０は省略することができる。 The recorded audio change information generation unit 250 generates the recorded audio change information 601 according to the changed location position information 262. The recorded voice change information 601 includes at least the changed part position information 262, and may further include pre-change syllable information of the changed part, syllable information related to before and after the changed part, and the like. When only the changed part position information 262 is used as the recorded voice change information 601, the changed part position information 262 may be used as the recorded voice change information 601 as it is. In this case, the recorded audio change information generation unit 250 can be omitted.

音声合成部３０１は、音声合成情報５０１から、編集箇所の合成音声を生成する。音声合成部３０１は、音声合成情報５０１が漢字仮名交じり文であった場合は、形態素解析処理により読み情報やアクセント情報が付与された発音記号列を生成してから、音声合成用データベースを用いて音声を合成する。音声合成用データベースには、音声を合成する元になる音声波形、音声波形に対応する音節あるいは音素のデータ、各音節あるいは音素の韻律特徴パラメータなどが記録されている。音声合成処理に使用する音声合成用データベースは、収録音声と同一話者による音声合成用データベースが予め用意されていれば、それを用いる。収録音声と同一話者による音声合成用データベースが予め用意されていなければ、収録音声７０１のデータ量が充分に多い場合には収録音声７０１から音声合成用データベースを作成して用いてもよい。また、他の話者の音声合成用データベースを用いて合成音声を生成した後、声質を収録音声５０１の話者に近くなるように変換するようにしてもよい。 The voice synthesizer 301 generates a synthesized voice of the edited part from the voice synthesis information 501. When the speech synthesis information 501 is a kanji-kana mixed sentence, the speech synthesizer 301 generates a phonetic symbol string to which reading information and accent information are given by morphological analysis processing, and then uses the speech synthesis database. Synthesize speech. The speech synthesis database stores a speech waveform from which speech is synthesized, syllable or phoneme data corresponding to the speech waveform, prosodic feature parameters of each syllable or phoneme, and the like. If a speech synthesis database for the same speaker as the recorded speech is prepared in advance, the speech synthesis database used for speech synthesis processing is used. If a database for speech synthesis by the same speaker as the recorded speech is not prepared in advance, a speech synthesis database may be created from the recorded speech 701 and used if the recorded speech 701 has a sufficiently large amount of data. Further, after the synthesized speech is generated using the speech synthesis database of another speaker, the voice quality may be converted to be close to the speaker of the recorded speech 501.

音声編集部４０１は、収録音声７０１、音声合成部３０１によって生成された合成音声、および収録音声変更情報６０１を入力し、収録音声変更情報６０１で指示された通りに収録音声と合成音声を結合、編集し、編集済み音声８０１として出力する。 The voice editing unit 401 inputs the recorded voice 701, the synthesized voice generated by the voice synthesizing unit 301, and the recorded voice change information 601, and combines the recorded voice and the synthesized voice as instructed by the recorded voice change information 601. Edit and output as edited voice 801.

次に本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.

まず、収録音声７０１が、音声編集装置２０１の音声認識部２１０および音声分析部２２０に入力される。収録音声７０１は、マイクロフォンや電話機で収録された音声であり、パソコンやサーバー等に搭載された記録装置（ハードディスクドライブ、メモリ等）、ＩＣレコーダー、ＣＤやカセットテープ等の記録媒体に記録されている。 First, the recorded voice 701 is input to the voice recognition unit 210 and the voice analysis unit 220 of the voice editing device 201. Recorded audio 701 is audio recorded by a microphone or telephone, and is recorded on a recording medium (hard disk drive, memory, etc.), IC recorder, CD, cassette tape or the like mounted on a personal computer or server. .

一例として、図５に示すように、「Ａ社はええと赤字は株式、ば売却益でそうさつ可能であると発表しまった」という発声の収録音声データ（以下、音声データＡと記す）が、パソコンの記録装置に記録されているものとする。ここで、正しくは図５に示す通り、音声データＡは本来「Ａ社は、赤字は株式売却益で相殺可能であると発表しました」と発声されるべきものであったとする。つまり、音声データＡには、「ええと」というフィラー、「株式、ば売却益」という言い淀み、「そうさつ」という読み間違い、「発表しまった」といういい誤りが含まれている。 As an example, as shown in FIG. 5, recorded voice data (hereinafter referred to as voice data A) saying “Company A announced that deficit is possible with stocks, gains on sale”. Is recorded in the recording device of the personal computer. Here, correctly, as shown in FIG. 5, it is assumed that the voice data A should originally be voiced as “Company A announced that the deficit can be offset by the gain on sale of the stock”. In other words, the audio data A includes a filler “um”, a complaint “stock, gain on sale”, a misreading “so sasatsu”, and a good error “announced”.

音声認識部２１０は、音声認識技術を利用して、入力された収録音声をテキストに変換し、変更箇所決定部２３０へ出力する。ここでは、入力された上記音声データＡが、発声内容どおり「Ａ社はええと赤字は株式、ば売却益でそうさつ可能であると発表しまった」というテキスト（以下、テキストＡと記す）に変換されたものとする。 The voice recognition unit 210 converts the input recorded voice into text using a voice recognition technique, and outputs the text to the change location determination unit 230. Here, the input voice data A is the text that says “Company A has announced that the deficit is possible with stocks, gains on sale” (hereinafter referred to as text A) Is converted to.

音声分析部２２０は、入力された収録音声を分析し、音響的な特徴量を抽出し、変更箇所決定部２３０および合成情報生成部２４０へ出力する。ここでは、音声分析部２００は、音声データＡを分析して、全体の平均話速（Ｔ_m）、全体の平均基本周波数（Ｐ_m）、音節ごとの話速（Ｔ_on）、および音節ごとの基本周波数（Ｐ_on）、音節ごとのパワー、音節ごとのＳ／Ｎ比を抽出するものとする。また音声分析部２００は、音節ごとの基本周波数（Ｐ_on）、音節ごとのパワー、音節ごとのＳ／Ｎ比を変更箇所決定部２３０へ出力し、全体の平均話速（Ｔ_m）、全体の平均基本周波数（Ｐ_m）、音節ごとの話速（Ｔ_on）、および音節ごとの基本周波数（Ｐ_on）を合成情報生成部２４０へ出力するものとする。 The voice analysis unit 220 analyzes the input recorded voice, extracts an acoustic feature amount, and outputs it to the change location determination unit 230 and the synthesis information generation unit 240. Here, the voice analysis unit 200 analyzes the voice data A, and determines the overall average speech speed (T _m ), the overall average fundamental frequency (P _m ), the speech speed for each syllable (T _on ), and for each syllable. The fundamental frequency (P _on ), power for each syllable, and S / N ratio for each syllable are extracted. The speech analysis unit 200 also outputs the fundamental frequency (P _on ) for each syllable, the power for each syllable, and the S / N ratio for each syllable to the change location determination unit 230, so that the overall average speech speed (T _m ), overall The average fundamental frequency (P _m ), the speech speed for each syllable (T _on ), and the fundamental frequency for each syllable (P _on ) are output to the synthesis information generation unit 240.

音響的不具合検出部２３１は、上記テキストＡと上記音響的な特徴量とから、音響的な不都合箇所を検出する。音響的不具合検出部２３１は、入力された音響的な特徴量から、急に声が大きくなる箇所（音声波形のパワーが上がり、かつＳ／Ｎ比が低い箇所）、急に声が裏返る箇所（基本周波数が高くなる箇所）、電話の音等の雑音が混入している箇所（Ｓ／Ｎ比が大きくなる箇所）等を検出する。ここでは、「Ａ社」の部分が音響的な不具合箇所として検出されて、統合部２３３に検出結果が出力されたものとする。 The acoustic defect detection unit 231 detects an acoustic inconvenient location from the text A and the acoustic feature amount. The acoustic defect detection unit 231 detects a point where the voice suddenly increases (a portion where the power of the speech waveform increases and the S / N ratio is low) or a point where the voice suddenly turns over from the input acoustic feature amount ( Locations where the fundamental frequency is high), locations where noise such as telephone sounds is mixed (location where the S / N ratio increases), and the like are detected. Here, it is assumed that the part “Company A” is detected as an acoustic defect and the detection result is output to the integration unit 233.

言語的不具合検出部２３２は、上記テキストＡを変更前テキスト２３２３として入力する。また、言語的不具合検出部２３２のテキスト変更部２３２１は、変更前テキスト２３２３の誤り部分を推定して、正しい内容であるテキスト（テキストＢと記す）を変更後テキスト２３２４として生成する。テキストＢの内容は、図５に正しい内容として記載した「Ａ社は、赤字は株式売却益で相殺可能であると発表しました」となる。次に、言語的不具合検出部２３２の差分抽出部２３２２は、テキストＡとテキストＢとの差分を抽出する。ここでは、テキストＡの「ええと」、「、ば」、「そうさつ」、「しまった」に該当する箇所が差分として抽出される。そして、差分抽出部２３２２は、個々の差分ごとの検出結果を統合部２３３に出力する。検出結果は、不具合箇所の情報と、正解テキストがある場合には正解テキストとが含まれる。 The linguistic defect detection unit 232 inputs the text A as the pre-change text 2323. In addition, the text change unit 2321 of the linguistic defect detection unit 232 estimates an error part of the pre-change text 2323 and generates a text having the correct content (denoted as text B) as the post-change text 2324. The content of Text B is “Company A has announced that the deficit can be offset by the gain on the sale of shares,” which is described as correct in FIG. Next, the difference extraction unit 2322 of the linguistic defect detection unit 232 extracts the difference between the text A and the text B. In this case, portions corresponding to “Uto”, “Toba”, “Sosatsu”, and “Shit” of the text A are extracted as differences. Then, the difference extraction unit 2322 outputs a detection result for each individual difference to the integration unit 233. The detection result includes information on a defective part and a correct text when there is a correct text.

統合部２３３は、音響的な不具合箇所の「Ａ社」に関しては、例えば、「Ａ社」の音韻情報を含む変更箇所音韻情報２６１と、「Ａ社」の位置を示す変更箇所位置情報２６２とを対にして生成する。また、統合部２３３は、言語的な不具合箇所の「ええと」に関しては、例えば、「ええと」の箇所を示す変更箇所位置情報２６２を生成し、正解テキストが付随していないため対応する変更箇所音韻情報２６１は生成しない。また、統合部２３３は、言語的な不具合箇所の「、ば」に関して、例えば、「、ば」の箇所を示す変更箇所位置情報２６２を生成し、正解テキストが付随していないため対応する変更箇所音韻情報２６１は生成しない。また、統合部２３３は、言語的な不具合箇所の「そうさつ」に関しては、例えば、「さうさい」の音韻情報を示す変更箇所音韻情報２６１と「そうさつ」の箇所を示す変更箇所位置情報２６２の対を生成する。また、統合部２３３は、言語的な不具合箇所の「しまった」に関しては、例えば、「ました」の音韻情報を示す変更箇所音韻情報２６１と、「まった」の箇所を示す変更箇所位置情報２６２との対を生成する。 For the “Company A” of the acoustic defect location, the integration unit 233 includes, for example, change location phoneme information 261 including phoneme information of “Company A”, change location position information 262 indicating the location of “Company A”, Generate a pair. Further, the integration unit 233 generates, for example, change location position information 262 indicating the location of “um” with respect to “um” of the linguistic defect location, and the corresponding change is not accompanied by the correct text. The location phoneme information 261 is not generated. In addition, the integration unit 233 generates, for example, change location position information 262 indicating the location of “,” for the linguistic defect location “B”, and the corresponding change location is not accompanied by the correct text. The phoneme information 261 is not generated. Further, the integration unit 233 may, for example, change the phonological information “261” indicating the phonological information of “Susasai” and the changed portion indicating the “sosatsu”. A pair of position information 262 is generated. Further, the integration unit 233, for example, for the linguistic defect location “shut”, for example, the changed location phoneme information 261 indicating the phonetic information of “ta” and the changed location position information indicating the location of “same”. A pair with 262 is generated.

合成情報生成部２４０は、変更箇所音韻情報２６１および収録音声７０１の音声特徴量情報から、音声合成処理のための音声合成情報（以下、合成データＳＤと記す）を生成し、音声合成情報５０１として音声合成部３０１に出力する。今の例では、合成情報生成部２４０は、「ええと」、「、ば」に関しては、対応する変更箇所音韻情報２６１が無いため、これらに関する音声合成情報は生成しない。他方、「Ａ社」、「そうさつ」、「まった」に関しては、対応する変更箇所音韻情報２６１として、「Ａ社」、「そうさい」、「ました」が存在するため、音声合成情報を生成する。 The synthesis information generation unit 240 generates speech synthesis information (hereinafter referred to as synthesis data SD) for the speech synthesis process from the changed part phoneme information 261 and the speech feature information of the recorded speech 701 as speech synthesis information 501. Output to the speech synthesizer 301. In this example, the synthesis information generation unit 240 does not generate the speech synthesis information regarding “um” and “,” because there is no corresponding change location phoneme information 261. On the other hand, for “Company A”, “Sosatsu”, and “Mata”, there are “Company A”, “Sosai”, and “Tada” as the corresponding change location phoneme information 261. Generate information.

「まった」の部分を例に、音声合成情報の具体的な生成方法を説明する。前提として、図６（ａ）に示す通り、音声データＡの特徴量として、平均話速Ｔ_m、平均基本周波数Ｐ_m、音節ごとの話速Ｔ_on(1)〜Ｔ_on(8)、音節ごとの基本周波数Ｐ_on(1)〜Ｐ_on(9)が抽出されているものとする。図６（ａ）における曲線は、基本周波数パターンを示している。この場合、音声合成情報は、図６（ｂ）に示す通り、音節列情報として「ました」、音節ごとの話速はＴ_on(6)〜Ｔ_on(8)、音節ごとの基本周波数はＰ_on(1)〜Ｐ_on(6)となる。 A specific method for generating speech synthesis information will be described by taking the “married” part as an example. As a premise, as shown in FIG. 6 (a), as the feature amount of the voice data A, the average speech speed T _m , the average fundamental frequency P _m , the speech speed T _on (1) to T _on (8) for each syllable, It is assumed that the fundamental frequencies P _on (1) to P _on (9) are extracted. The curve in FIG. 6A shows the basic frequency pattern. In this case, as shown in FIG. 6 (b), the speech synthesis information is “Done” as syllable string information, the speech speed for each syllable is T _on (6) to T _on (8), and the fundamental frequency for each syllable is P _on (1) to P _on (6).

上記の例では、音声合成情報として、変更対象となる「ました」の部分だけの情報を持っているが、変更対象の付近、例えば「発表しました」の部分に関する情報を持っておくことも考えられる。また、文全体の情報を持つようにしても構わない。 In the above example, as speech synthesis information, we have only information about the part that was changed, but it is also possible to have information about the part to be changed, for example, the part that was announced. Conceivable. Moreover, you may make it have the information of the whole sentence.

また、上記の例では、音声合成情報として、収録音声の特徴量Ｔ_m、Ｐ_m、Ｔ_on、Ｐ_onを変更せずに使用したが、音節の変更に対応する規則等を用いて変更あるいは推定した特徴量Ｔ’_m、Ｐ’_m、Ｔ’_on、Ｐ’_onを使用してもよい。特に、変更対象となる音節数が多い場合等は、特徴量は変更あるいは推定されることが望ましい。 In the above example, the feature values T _m , P _m , T _on , and P _on of the recorded voice are used as the voice synthesis information without being changed. The estimated feature amounts T ′ _m , P ′ _m , T ′ _on , and P ′ _on may be used. In particular, when there are a large number of syllables to be changed, it is desirable that the feature amount be changed or estimated.

収録音声変更情報生成部２５０は、入力された変更箇所位置情報２６２から収録音声変更情報６０１を生成する。前述したように、収録音声変更情報６０１は、少なくとも変更箇所位置情報を含み、加えて、変更箇所の変更前音節情報、変更箇所の前後に係る音節情報等を含めることが考えられる。 The recorded audio change information generation unit 250 generates the recorded audio change information 601 from the input changed location position information 262. As described above, the recorded voice change information 601 includes at least changed part position information, and may include pre-change syllable information of the changed part, syllable information before and after the changed part, and the like.

音声合成部３０１は、入力された音声合成情報から、編集箇所の合成音声を生成する。ここでは、収録音声７０１と同一話者による音声合成用データベースを用いて、音声合成処理を行うものとする。この結果、例えば、「（発表）しまった」に該当する箇所については、それに対応する合成データＳＤに基づいて、収録音声７０１の話者と同一あるいは近似した音声の「ました」という内容の合成音声（以下、合成音声ＳＶと記す）が生成される。 The speech synthesizer 301 generates synthesized speech for the edited portion from the input speech synthesis information. Here, it is assumed that speech synthesis processing is performed using the speech synthesis database of the same speaker as the recorded speech 701. As a result, for example, with regard to a portion corresponding to “(announced)”, the content of “same” of the speech that is the same as or similar to the speaker of the recorded speech 701 is synthesized based on the synthesis data SD corresponding thereto. A voice (hereinafter referred to as a synthesized voice SV) is generated.

音声編集部４０１には、収録音声７０１、合成音声、収録音声変更情報６０１が入力され、収録音声変更情報６０１で示された情報通りに収録音声７０１と合成音声が結合、編集され、編集済み音声８０１が生成される。音声編集部４０１は、収録音声変更情報６０１に対応する合成音声が存在しない場合、収録音声７０１中の当該収録音声変更情報６０１で示される変更位置の音声を切り取る処理を行う。この結果、収録音声７０１中の「ええと」というフィラーや、「、ば」という言い淀み部分は取り除かれる。 Recorded voice 701, synthesized voice, and recorded voice change information 601 are input to the voice editing unit 401, and the recorded voice 701 and synthesized voice are combined and edited in accordance with the information indicated by the recorded voice change information 601, and edited voice. 801 is generated. When there is no synthesized voice corresponding to the recorded voice change information 601, the voice editing unit 401 performs a process of cutting out the voice at the change position indicated by the recorded voice change information 601 in the recorded voice 701. As a result, the filler “um” in the recorded audio 701 and the humorous part “wa” are removed.

また、音声編集部４０１は、収録音声変更情報６０１に対応する合成音声が存在する場合、収録音声７０１中の当該収録音声変更情報６０１で示される変更位置の音声が合成音声に置き換えられる。「（発表）しまった」の部分に係る具体的な編集方法を図７に示す。音声合成部３０１で生成された合成音声ＳＶが、それと対となる変更箇所位置情報２６２に従って、音声データＡの「発表しまった」の「まった」の部分と置換される形で、音声データＡの「発表し」と音声合成ＳＶの「ました」が結合される。「そうさつ」の部分についても同様に、合成音声「そうさい」で編集される。 Also, when there is a synthesized voice corresponding to the recorded voice change information 601, the voice editing unit 401 replaces the voice at the change position indicated by the recorded voice change information 601 in the recorded voice 701 with the synthesized voice. FIG. 7 shows a specific editing method related to the “(announced)” part. The synthesized voice SV generated by the voice synthesizing unit 301 is replaced with the “already” portion of “already announced” in the voice data A in accordance with the change location position information 262 paired therewith. “San” is combined with the voice synthesis SV “Ta”. Similarly, the portion of “Sosatsu” is edited with the synthesized speech “Sosai”.

音声データと合成音声を結合する際には、波形の不連続による異音等の発生を抑制するために、波形のスムージング処理を行うことが望ましい。スムージング処理としては、結合箇所で波形の振幅を合わせる、波形を線形補完する、合成音声と収録音声の波形を重畳してから振幅を調整する等の方法が考えれれる。また、修正部分について、音韻情報が異なる場合は、当然基本周波数パターンも異なるが、スプライン関数等を使うことで基本周波数パターンを推定することができる。さらに、「ええと」や「、ば」に相当する音声データを切り取る際にも、切り取った後の波形のスムージング処理を行うことが望ましい。 When combining speech data and synthesized speech, it is desirable to perform a waveform smoothing process in order to suppress the occurrence of abnormal sounds due to waveform discontinuities. As the smoothing process, methods such as matching the amplitude of the waveform at the joining point, linearly complementing the waveform, and adjusting the amplitude after superimposing the waveforms of the synthesized speech and the recorded speech are conceivable. Also, when the phoneme information is different for the corrected portion, the fundamental frequency pattern is naturally different, but the fundamental frequency pattern can be estimated by using a spline function or the like. Furthermore, it is desirable to perform the smoothing processing of the waveform after the cut-off when the audio data corresponding to “um” or “wa” is cut.

以上の例では、音声合成の単位として音節（ＣＶ単位）を用いているが、音素単位、半音素単位、ＣＶＣ単位、ＶＣＶ単位等を用いても構わない。 In the above example, a syllable (CV unit) is used as a unit of speech synthesis. However, a phoneme unit, a semiphoneme unit, a CVC unit, a VCV unit, or the like may be used.

以上の例では、音声合成処理を音声編集処理の前段で行っているが、音声合成処理を音声編集処理と平行して行うことも可能である。 In the above example, the speech synthesis process is performed before the speech editing process. However, the speech synthesis process may be performed in parallel with the speech editing process.

以上の例では、収録音声と合成音声を結合しているが、収録音声変更情報生成部２５０で、全文に渡って変更すると決定された場合は、全文が変更後テキストの内容を持つ合成音声が編集済み音声として出力される。 In the above example, the recorded voice and the synthesized voice are combined. However, if the recorded voice change information generation unit 250 determines to change the whole sentence, the synthesized voice having the contents of the changed text is changed. Output as edited audio.

以上の例では、収録音声７０１を自動的にテキスト化するために音声認識部２１０を構成に加えているが、音声を聴取しての書き起こし等、収録音声７０１に対して手動でテキスト化を行ってもよい。 In the above example, the voice recognition unit 210 is added to the configuration in order to automatically convert the recorded voice 701 into text. However, the recorded voice 701 is manually converted into text, such as a transcript by listening to the voice. You may go.

以上の例では、収録音声７０１の言語的不具合箇所を検出するための基準テキスト（変更後テキスト）を自動的に生成するためにテキスト変更部２３２１を構成に加えているが、原稿や台本等、予め用意された正解テキストを基準テキスト（変更後テキスト）として用いてもよい。 In the above example, the text change unit 2321 is added to the configuration in order to automatically generate the reference text (text after change) for detecting the linguistically defective part of the recorded sound 701. A correct text prepared in advance may be used as the reference text (changed text).

このように本実施形態によれば、収録音声中の不具合な箇所を修正するために再度音声を収録する必要がなく、また、収録音声中の不具合箇所の検出、合成音声の生成、編集がすべて自動化されているため、収録音声の修正に要する利用者の負担を大幅に軽減することができる。 As described above, according to the present embodiment, it is not necessary to record the sound again in order to correct the troubled part in the recorded sound, and all of the detection of the troubled part in the recorded sound, the generation of the synthesized sound, and the editing are all performed. Since it is automated, the burden on the user for correcting the recorded audio can be greatly reduced.

[第３の実施形態]
図８を参照すると、本発明の第３の実施形態に係る音声編集装置１０２は、図３に示した第２の実施形態に係る音声編集装置１０１と比較して、さらに出力部９０１と入力部９０２とを備えている点、変更箇所決定部２３０の代わりに変更箇所決定部２３０Ａを備えている点で相違する。 [Third embodiment]
Referring to FIG. 8, the voice editing device 102 according to the third embodiment of the present invention further includes an output unit 901 and an input unit, compared to the voice editing device 101 according to the second embodiment shown in FIG. 902 and the change location determination unit 230 in place of the change location determination unit 230.

出力部９０１は、液晶ディスプレイ等で構成され、音声編集装置１０２から利用者に対してユーザインターフェース画面を提示する機能を有する。入力部９０２は、キーボードやマウス等で構成され、利用者から音声編集装置１０２に対して編集情報や指示を入力する機能を有する。 The output unit 901 is configured by a liquid crystal display or the like, and has a function of presenting a user interface screen from the voice editing device 102 to the user. The input unit 902 includes a keyboard, a mouse, and the like, and has a function of inputting editing information and instructions from the user to the voice editing device 102.

変更箇所決定部２３０Ａは、収録音声７０１中の音響的な不具合箇所を検出する音響的不具合検出部２３１Ａと、収録音声７０１中の言語的な不具合箇所を検出する言語的不具合検出部２３２Ａと、対話処理部２３４とから構成される。 The change location determination unit 230A has an acoustic failure detection unit 231A that detects an acoustic failure location in the recorded audio 701, a linguistic failure detection unit 232A that detects a language failure location in the recorded audio 701, and a dialogue And a processing unit 234.

音響的不具合検出部２３１Ａは、音響的不具合検出部２３１と同様に、音声分析部２２０から与えられる収録音声７０１の音響的な特徴量に基づいて、収録音声７０１中の音響的な不具合箇所を検出する機能を有する。音響的不具合検出部２３１Ａは、１以上の音響的な不具合箇所を検出した場合、それぞれの不具合箇所ごとに、その不具合箇所の位置情報を対話処理部２３４へ出力する。 Similarly to the acoustic defect detection unit 231, the acoustic defect detection unit 231A detects an acoustic defect part in the recorded sound 701 based on the acoustic feature amount of the recorded sound 701 given from the sound analysis unit 220. It has the function to do. When the acoustic defect detection unit 231A detects one or more acoustic defect points, the acoustic defect detection unit 231A outputs position information of the defect points to the dialogue processing unit 234 for each defect point.

言語的不具合検出部２３２Ａは、音声認識部２１０から与えられる収録音声７０１をテキスト化したテキストに基づいて、収録音声７０１中の言語的な不具合箇所の検出とその正解テキストの推定を行う機能を有する。例えば、言語的不具合検出部２３２Ａは、変更前テキスト２３２３から、読み間違い、言い誤りなどの文法的に誤っている箇所や、フィラーや言い淀みなどの文法的に好ましくない箇所を検出し、これらの箇所を変更箇所候補として出力する。同時に、変更箇所候補に対応して、文法的に正しい内容、文法的に好ましい内容に変更するための変更テキスト候補を推定し、出力する。言語的不具合検出部２３２Ａは、例えばフィラーについては、変更前テキスト２３２３とフィラー候補を収集した辞書とのマッチングにより抽出し、フィラーの箇所とそのフィラーを削除することなどを示す変更候補とを生成する。また、言語的不具合検出部２３２Ａは、言い淀み、読み間違い、言い誤りについては、音韻列の類似度や単語の前後環境により正解を推定する方法を用いて、誤り箇所と正解とを推定する。 The linguistic defect detection unit 232A has a function of detecting a linguistic defect part in the recorded voice 701 and estimating the correct text based on text obtained by converting the recorded voice 701 given from the voice recognition unit 210 into text. . For example, the linguistic defect detection unit 232A detects, from the pre-change text 2323, a grammatically incorrect part such as a reading error or a phrasing error, or a grammatically unfavorable part such as a filler or an utterance, Output the location as a change location candidate. At the same time, a change text candidate for changing to a grammatically correct content and a grammatically preferable content is estimated and output in correspondence with the change location candidate. For example, for the filler, the linguistic defect detection unit 232A extracts the filler by matching the pre-change text 2323 and the dictionary in which the filler candidates are collected, and generates a change candidate indicating that the filler portion and the filler are deleted. . In addition, the linguistic defect detection unit 232A estimates an error location and a correct answer for a speech, a reading error, and a saying error by using a method of estimating a correct answer based on the similarity of the phoneme sequence and the environment before and after the word.

第２の実施形態で例に挙げたテキストＡの場合、言語的不具合検出部２３２Ａは、例えば図９に示すような検出結果を対話処理部２３４へ出力する。図９の例では、例えば変更箇所候補「そうさつ」に対して、「相殺」、「総裁」、「惣菜」、「（変更しない）」の４つの変更テキスト候補が推定されている。 In the case of the text A exemplified in the second embodiment, the linguistic defect detection unit 232A outputs a detection result as illustrated in FIG. 9 to the dialogue processing unit 234, for example. In the example of FIG. 9, for example, four change text candidates of “offset”, “governor”, “prepared food”, and “(do not change)” are estimated with respect to the change location candidate “Sosatsu”.

対話処理部２３４は、音声認識部２１０から与えられる収録音声７０１のテキスト、音声分析部２２０から与えられる音声特徴量、音響的不具合検出部２３１Ａの検出結果、および言語的不具合検出部２３２Ａの検出結果から、ユーザインターフェース画面を生成して出力部９０１を通じて利用者に提示する機能と、入力部９０２を通じて利用者から入力される指示に応じて、不具合箇所の変更、修正に用いる合成音声の変更などを行う機能とを備えている。そして、対話処理部２３４は、利用者との対話処理により最終的に決定した変更箇所音韻情報２６１および変更箇所位置情報２６２を合成情報生成部２４０および収録音声変更情報生成部２５０へ出力する。 The dialogue processing unit 234 includes the text of the recorded voice 701 given from the voice recognition unit 210, the voice feature amount given from the voice analysis unit 220, the detection result of the acoustic defect detection unit 231A, and the detection result of the linguistic defect detection unit 232A. Then, in accordance with a function to generate a user interface screen and present it to the user through the output unit 901 and an instruction input from the user through the input unit 902, a change of a defective part, a change of synthesized speech used for correction, etc. With the ability to do. Then, the dialogue processing unit 234 outputs the changed location phoneme information 261 and the changed location position information 262 finally determined by the dialogue processing with the user to the synthesis information generation unit 240 and the recorded audio change information generation unit 250.

対話処理部２３４が生成するユーザインターフェース画面の例を図１０に示す。図１０に示すユーザインターフェース画面は、変更前テキスト、編集情報候補、変換前テキストの読み、アクセント句境界位置、アクセント位置、合成音声が使用される箇所、平均話速、平均基本周波数を利用者に提示し、これらの情報について利用者が変更できるインターフェースとなっている。さらに、本ユーザインターフェース画面は、形態素解析等の技術を使用してテキストを読みに変換する「読みつけ」ボタンを備えており、テキストの変更を読みに反映することができるようになっている。なお、「収録/合成」の項目では、白い帯の部分が収録音声を、黒い帯の部分が合成音声を使用することを表しており、白黒の帯の境界を移動させることで、合成音声で置換する範囲を変更することが可能となっている。図１０では、「そうさつ」を「相殺」に、「（発表し）まった」を「（発表）ました」に変更した例を示している。 An example of a user interface screen generated by the dialogue processing unit 234 is shown in FIG. The user interface screen shown in FIG. 10 provides the user with pre-change text, editing information candidates, reading of pre-conversion text, accent phrase boundary positions, accent positions, places where synthesized speech is used, average speech speed, and average fundamental frequency. It is an interface that can be presented and changed by users. Further, the user interface screen includes a “read” button for converting text into reading using a technique such as morphological analysis, so that the change of the text can be reflected in the reading. In the “Recording / Synthesis” item, the white band indicates that the recorded sound is used and the black band indicates that the synthesized sound is used. By moving the boundary between the black and white bands, It is possible to change the replacement range. FIG. 10 shows an example in which “Sosatsu” is changed to “offset” and “(announced)” is changed to “(announced)”.

図１０の例では、「読みつけ」ボタンでテキストを読みに変換するようにしているが、テキストが変換された際に、自動的に読みを更新するようにしても構わない。さらに、より詳細に音声を編集するために、母音の無声化、各音節の話速、複数の制御点を持つ基本周波数パターン、音声のパワー等を編集可能とすることも考えられる。この場合は、夫々の情報をグラフィカルユーザインターフェース（ＧＵＩ）で可視化することが望ましい。 In the example of FIG. 10, the text is converted to reading by the “reading” button, but the reading may be automatically updated when the text is converted. Furthermore, in order to edit the voice in more detail, it may be possible to edit the vowel devoicing, the speech speed of each syllable, the basic frequency pattern having a plurality of control points, the power of the voice, and the like. In this case, it is desirable to visualize each information with a graphical user interface (GUI).

本発明は、例えば、テレビ番組やラジオ番組の制作システム、ホームビデオの編集システム、留守番電話システム等、音声を編集する装置やシステム全般に適用することができる。 The present invention can be applied to all devices and systems for editing audio, such as a television program and radio program production system, a home video editing system, and an answering machine system.

１００音声編集装置
１０１音声編集装置
１０２音声編集装置
２００編集情報生成部
２０１音声編集装置
２０１編集情報生成部
２１０音声認識部
２２０音声分析部
２３０変更箇所決定部
２３０Ａ変更箇所決定部
２３１音響的不具合検出部
２３１Ａ音響的不具合検出部
２３２言語的不具合検出部
２３２Ａ言語的不具合検出部
２３３統合部
２３４対話処理部
２４０合成情報生成部
２５０収録音声変更情報生成部
２６１変更箇所音韻情報
２６２変更箇所位置情報
３００音声合成部
３０１音声合成部
４００音声編集部
４０１音声編集部
５００音声合成情報
５０１音声合成情報
６００収録音声変更情報
６０１収録音声変更情報
７００収録音声
７０１収録音声
８００音声
８０１音声
９０１出力部
９０２入力部
２３２１テキスト変更部
２３２２差分抽出部
２３２３変更前テキスト
２３２４変更後テキスト DESCRIPTION OF SYMBOLS 100 Voice editing apparatus 101 Voice editing apparatus 102 Voice editing apparatus 200 Editing information generation part 201 Voice editing apparatus 201 Editing information generation part 210 Speech recognition part 220 Voice analysis part 230 Change location determination part 230A Change location determination part 231 Acoustic defect detection part 231A Acoustic defect detection unit 232 Linguistic defect detection unit 232A Linguistic defect detection unit 233 Integration unit 234 Dialog processing unit 240 Synthesis information generation unit 250 Recorded voice change information generation unit 261 Change location phoneme information 262 Change location position information 300 Speech synthesis Unit 301 speech synthesis unit 400 speech editing unit 401 speech editing unit 500 speech synthesis information 501 speech synthesis information 600 recorded speech change information 601 recorded speech change information 700 recorded speech 701 recorded speech 800 speech 801 speech 901 output unit 902 input unit 2321 text change 2322 the difference extraction unit 2323 before the change text 2324 after changing text

Claims

Editing information generating means for detecting a defective part of the recorded voice and generating voice synthesis information necessary for generating a synthesized voice used for correcting the defective part and recorded voice change information including position information of the defective part When,
Speech synthesis means for generating synthesized speech based on the speech synthesis information;
A voice editing device comprising: voice editing means for correcting a defective portion of the recorded voice with the synthesized voice based on the recorded voice change information.

The editing information generation means detects an acoustic defect location from a voice feature obtained by analyzing the recorded voice, and uses the same phoneme as the detected defect location as a synthesized speech used to correct the acoustic defect location. The speech editing apparatus according to claim 1, wherein speech synthesis information necessary to generate a synthesized speech having

The editing information generation means detects a linguistic defect location based on text obtained by converting the recorded voice into text and estimates a correct text, and the estimated correct as a synthesized speech used for correcting the linguistic defect location. The speech editing apparatus according to claim 1 or 2, wherein speech synthesis information necessary for generating a synthesized speech having the same phoneme as the text is generated.

The editing information generating means compares the text obtained by converting the recorded voice into text and a reference text, detects a linguistic defect portion, estimates a correct text, and uses it for correcting the linguistic defect portion. 4. The speech editing apparatus according to claim 1, wherein speech synthesis information necessary for generating a synthesized speech having the same phoneme as the estimated correct text as speech is generated.

The editing information generating means displays the voice information synthesized to generate the synthesized voice used to correct the detected defective part and the defective part in association with the recorded voice as text. 5. The voice editing apparatus according to claim 1, wherein the voice synthesis information and the position information of the defective portion are changed in accordance with an instruction from a user input from an input apparatus.

6. The speech editing apparatus according to claim 1, wherein the speech synthesizing unit generates the synthesized speech using a speech synthesis database of the same speaker as the recorded speech.

The voice editing means smoothly combines the recorded voice and the synthesized voice based on a voice feature obtained by analyzing the recorded voice. Voice editing device.

Detecting a defective part of the recorded voice, generating voice synthesis information necessary for generating a synthesized voice used for correcting the defective part, and recording voice change information including position information of the defective part,
Generating synthesized speech based on the speech synthesis information;
A voice editing method, wherein a defective portion of the recorded voice is corrected by the synthesized voice based on the recorded voice change information.

Computer
Editing information generating means for detecting a defective part of the recorded voice and generating voice synthesis information necessary for generating a synthesized voice used for correcting the defective part and recorded voice change information including position information of the defective part When,
Speech synthesis means for generating synthesized speech based on the speech synthesis information;
A program for functioning as voice editing means for correcting a defective portion of the recorded voice with the synthesized voice based on the recorded voice change information.