JP3432443B2

JP3432443B2 - Audio speed conversion device, audio speed conversion method, and recording medium storing program for executing audio speed conversion method

Info

Publication number: JP3432443B2
Application number: JP04392099A
Authority: JP
Inventors: 伸哉植垣; 宏之西; 亮造布川; 弘行松井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-02-22
Filing date: 1999-02-22
Publication date: 2003-08-04
Anticipated expiration: 2019-02-22
Also published as: JP2000242300A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声における人の
話す声の速さを変化させて、元の音声よりも速い速度ま
たは遅い速度の音声を作成するための装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for changing the speed of a person's speaking voice in a voice to produce a voice having a speed faster or slower than the original voice.

【０００２】[0002]

【従来の技術】従来、デジタルで保存されている音声を
対象として、音声の再生速度を変更させる手段が多く提
案されてきた。以前は、再生速度の変更に伴い音声のピ
ッチが変わってしまっていたが、最近、ピッチの変更を
伴わない音声再生速度変換方法が発明された。そのた
め、音声を効率的に聞き取るために、元の音声よりも速
い速度で再生する早聞き機能や、お年よりや耳の不自由
な方のために、元の声よりも遅い速度で再生する遅聞き
機能が、デジタル機器の中に普通に利用されるようにな
ってきた。この早聞き機能や遅聞き機能は、音声の全体
に対して一律の割合で波形を間引くか、または繰り返す
ことで実現されてきた。2. Description of the Related Art Conventionally, many means for changing the reproduction speed of voice have been proposed for voices stored digitally. Previously, the pitch of the voice was changed with the change of the reproduction speed, but recently, a voice reproduction speed conversion method without changing the pitch has been invented. Therefore, in order to listen to the voice efficiently, the fast-listening function plays at a speed faster than the original voice, and for those who are older or deaf, the voice is played slower than the original voice. The late listening feature has become commonplace in digital devices. The fast-listening function and the slow-listening function have been realized by thinning out or repeating the waveform at a uniform rate with respect to the entire voice.

【０００３】[0003]

【発明が解決しようとする課題】しかし、従来の音声速
度変換方法は、元の音声の発話速度を考慮せず、音声全
体に対して一律の割合で波形を間引く、または繰り返す
手法であった。そのため、元の音声が速い場合には、変
換された音声も想定以上に速く、元の音声がゆっくりし
ている場合は、変換された速度も想定以上にゆっくりと
してしまうといった問題点があり、利用者が好みの速度
で再生する場合の妨げとなっていた。However, the conventional voice speed conversion method is a method of thinning or repeating the waveform at a uniform rate with respect to the entire voice without considering the utterance speed of the original voice. Therefore, if the original voice is fast, the converted voice will be faster than expected, and if the original voice is slow, the converted speed will be slower than expected. It has been a hindrance for a person to play back at a desired speed.

【０００４】本発明は、このような課題を解決するため
のものであり、元の音声の有音無音判別結果、および無
音区間の長さや発話速度を利用して、音声波形を間引
く、または繰り返す割合を設定する音声速度変換装置を
提供することを目的としている。The present invention is intended to solve such a problem, and thins out or repeats a voice waveform by utilizing the result of the voiced / unvoiced discrimination of the original voice, the length of a silent period and the speech rate. It is an object to provide a voice speed conversion device that sets a ratio.

【０００５】[0005]

【課題を解決するための手段】請求項１の発明による音
声速度変換装置は、音声を取り込む音声入力手段と、前
記音声入力手段によって取り込んだ音声の発話速度を算
出する発話速度算出手段と、目標再生速度を設定する目
標速度設定手段と、音声の有音区間と無音区間とを判別
する有音無音判別手段と、前記有音無音判別手段によっ
て判別された有音区間の音声波形を間引く、または繰り
返す割合である有音区間伸縮率を設定する有音区間伸縮
率設定手段と、前記有音無音判別手段によって判別され
た無音区間の音声波形を間引く、または繰り返す割合で
ある無音区間伸縮率を設定する無音区間伸縮率設定手段
と、前記有音区間伸縮率に基づき、有音区間の音声波形
を間引くか、または繰り返すことにより有音区間の伸縮
波形を生成し、前記無音区間伸縮率に基づき、無音区間
の音声波形を間引くか、または繰り返すことにより無音
区間の伸縮波形を生成する波形生成手段と、前記目標速
度設定手段によって設定した目標再生速度と、前記発話
速度算出手段によって算出した発話速度とを比較し、前
記無音区間の伸縮波形の再生速度が、前記目標再生速度
と同程度の速度となるように、前記無音区間伸縮率設定
手段に無音区間伸縮率を可変設定させる第１の制御部
と、前記有音無音判別手段によって判別された各有音区
間毎に、前記発話速度算出手段によって有音区間毎発話
速度を算出させる第２の制御部と、前記目標速度設定手
段によって設定した目標再生速度と、前記発話速度算出
手段によって算出した有音区間毎発話速度とを比較し、
前記有音区間の伸縮波形の再生速度が、前記目標再生速
度と同程度の速度となるように、前記各有音区間毎に、
前記有音区間伸縮率設定手段に有音区間伸縮率を可変設
定させる第３の制御部と、前記波形生成手段によって生
成した有音区間と無音区間の伸縮波形を入力し、伸縮音
声を出力する音声出力手段とを有することを特徴とし、
音声を伸縮することを目的としている。According to a first aspect of the present invention, there is provided a voice speed conversion device, a voice input means for capturing voice, a voice speed calculation means for calculating the voice speed of the voice captured by the voice input means, and a target. Eyes that set the playback speed
Distinguish between voiced sections and silent sections of voice velocity setting means
And the sound / sound discrimination means.
Thinning out or repeating the voice waveform in the voiced section
Set the voiced section expansion / contraction ratio, which is the ratio to return
It is discriminated by the rate setting means and the voiced / non-voiced discrimination means.
At the rate of thinning out or repeating the voice waveform in the silent section
Silent section expansion / contraction rate setting means for setting a certain silent section expansion / contraction rate
And the voice waveform of the voiced section based on the voiced section expansion / contraction ratio
Expanding or contracting the voiced section by thinning out or repeating
Generate a waveform, and based on the silent section expansion and contraction rate, the silent section
Silence by thinning out or repeating the voice waveform of
Waveform generating means for generating a stretchable waveform for the section, and the target speed
Target playback speed set by the degree setting means and the utterance
Compare the speech speed calculated by the speed calculation means,
The playback speed of the expansion / contraction waveform in the silent section is the target playback speed.
The silent section expansion / contraction ratio is set so that the speed is similar to
Means for variably setting the expansion / contraction ratio of the silent section
And each voiced section discriminated by the voiced / unvoiced discriminating means.
Utterance for each voiced section by the utterance speed calculation means
The second control unit for calculating the speed, the target reproduction speed set by the target speed setting unit, and the speech speed for each voiced section calculated by the speech speed calculation unit are compared,
For each of the voiced sections, the playback speed of the expansion / contraction waveform of the voiced section is about the same as the target playback speed.
A third control unit that variably sets the voiced section expansion / contraction rate in the voiced section expansion / contraction rate setting unit, and the expansion / contraction waveforms of the voiced section and the silent section generated by the waveform generation unit are input, and the expansion / contraction voice is output. A voice output means,
The purpose is to expand and contract the voice.

【０００６】請求項２の発明による音声速度変換装置
は、請求項１の発明において、さらに、前記第３の制御
部は、前記無音区間の伸縮波形の再生時間が一定時間に
なるように、各無音区間毎に、前記無音区間伸縮率設定
手段に無音区間伸縮率を可変設定させる制御を行うこと
を特徴とし、音声を伸縮することを目的としている。According to a second aspect of the invention, there is provided the voice speed conversion device according to the first aspect, further comprising the third control.
Part, the playback time of the expansion and contraction waveform of the silent section is a fixed time
Set the expansion / contraction ratio of the silent section for each silent section so that
Performing control to variably set the silent section expansion / contraction ratio for the means.
The purpose is to expand and contract the voice.

【０００７】請求項３の発明による音声速度変換方法
は、音声を取り込み、該音声の発話速度を算出し、該音
声の有音区間と無音区間とを判別し、該音声の目標再生
速度を設定し、該目標再生速度と、前記発話速度とを比
較し、前記無音区間の伸縮波形の再生速度が、前記目標
再生速度と同程度の速度となるように、該無音区間の音
声波形を間引く、または繰り返す割合である無音区間伸
縮率を可変設定し、判別された各有音区間毎に、有音区
間毎発話速度を算出し、前記目標再生速度と、前記有音
区間毎発話速度とを比較し、前記有音区間の伸縮波形の
再生速度が、前記目標再生速度と同程度の速度となるよ
うに、前記各有音区間毎に、該有音区間の音声波形を間
引く、または繰り返す割合である有音区間伸縮率を可変
設定し、該有音区間伸縮率に基づき、有音区間の音声波
形を間引くか、または繰り返すことにより有音区間の伸
縮波形を生成し、前記無音区間伸縮率に基づき、無音区
間の音声波形を間引くか、または繰り返すことにより無
音区間の伸縮波形を生成し、該有音区間と無音区間の伸
縮波形より伸縮音声を出力することを特徴とし、音声を
伸縮することを目的としている。According to a third aspect of the present invention, there is provided a voice speed conversion method in which a voice is captured, a speech speed of the voice is calculated,
The target reproduction of the voice is discriminated between the voiced section and the silent section of the voice.
Set the speed and compare the target playback speed with the speech speed.
In comparison, the reproduction speed of the expansion / contraction waveform in the silent section is
The sound in the silent section is set so that the speed is about the same as the playback speed.
Silence interval extension, which is the rate of thinning or repeating the voice waveform
The reduction ratio is variably set, and the voiced section is set for each determined voiced section.
The utterance speed is calculated for each period, and the target reproduction speed and the voice
Compared with the speech rate for each section,
The playback speed will be similar to the target playback speed.
As described above, the voice waveform of the voiced section is interleaved for each voiced section.
Variable the expansion / contraction rate of the voiced section, which is the rate of pulling or repeating
Set the sound wave of the sound section based on the expansion / contraction ratio of the sound section.
Extending the voiced section by thinning out or repeating the shape
A contracted waveform is generated, and a silent section is generated based on the silent section expansion / contraction rate.
None by thinning out or repeating the audio waveform between
Generates a stretched waveform of the sound section and expands the sound section and the silent section.
It is intended to expand and contract voices, which is characterized by outputting expanded and contracted voices from a contracted waveform .

【０００８】請求項４の発明による音声速度変換方法
は、請求項３の発明において、さらに前記無音区間の伸
縮波形の再生時間が一定時間になるように、各無音区間
毎に、前記無音区間伸縮率設定手段に無音区間伸縮率を
可変設定することを特徴とし、音声を伸縮することを目
的としている。A voice speed conversion method according to a fourth aspect of the present invention is the method according to the third aspect, further comprising extending the silent section.
Each silent section so that the playback time of the compressed waveform is constant.
Each time, the silent section expansion / contraction rate is set to the silent section expansion / contraction rate setting means.
The feature is that it is variably set , and the purpose is to expand and contract the voice.

【０００９】請求項５の発明による音声速度変換方法を
実行するプログラムを記録した記録媒体は、音声を取り
込み、該音声の発話速度を算出し、該音声の有音区間と
無音区間とを判別し、該音声の目標再生速度を設定し、
該目標再生速度と、前記発話速度とを比較し、前記無音
区間の伸縮波形の再生速度が、前記目標再生速度と同程
度の速度となるように、該無音区間の音声波形を間引
く、または繰り返す割合である無音区間伸縮率を可変設
定し、判別された各有音区間毎に、有音区間毎発話速度
を算出し、前記目標再生速度と、前記有音区間毎発話速
度とを比較し、前記有音区間の伸縮波形の再生速度が、
前記目標再生速度と同程度の速度となるように、前記各
有音区間毎に、該有音区間の音声波形を間引く、または
繰り返す割合である有音区間伸縮率を可変設定し、該有
音区間伸縮率に基づき、有音区間の音声波形を間引く
か、または繰り返すことにより有音区間の伸縮波形を生
成し、前記無音区間伸縮率に基づき、無音区間の音声波
形を間引くか、または繰り返すことにより無音区間の伸
縮波形を生成し、該有音区間と無音区間の伸縮波形より
伸縮音声を出力することを特徴とし、音声を伸縮するこ
とを目的としている。A recording medium on which a program for executing the voice speed conversion method according to the fifth aspect of the invention records voice.
, The speech rate of the voice is calculated, and the voiced section of the voice is calculated.
Determine the silent section, set the target playback speed of the sound,
The target reproduction speed is compared with the speech speed, and the silence
The playback speed of the stretchable waveform in the section is about the same as the target playback speed.
The sound waveform of the silent section is thinned out so that the speed becomes
Variable or the silent section expansion / contraction ratio, which is the rate of repetition.
Spoken rate for each voiced section for each voiced section determined and determined
To calculate the target playback speed and the speech speed for each voiced section.
Compared with the degree, the playback speed of the expansion and contraction waveform of the voiced section,
In order to achieve a speed similar to the target playback speed, each of the above
For each voiced section, thin out the voice waveform of the voiced section, or
The sound segment expansion / contraction ratio, which is the repetition rate, is variably set and
Decimate the voice waveform in the voiced section based on the sound section expansion / contraction rate
Or by repeating it to generate a stretched waveform in the voiced section.
Based on the expansion and contraction rate of the silent section, the sound wave of the silent section
Extending silent sections by thinning out or repeating shapes
Generate a contracted waveform, and use the expanded and contracted waveforms of the voiced and silent sections
It is characterized by outputting a stretched voice, and is intended to stretch the voice.

【００１０】請求項６の発明による音声速度変換方法を
実行するプログラムを記録した記録媒体は、請求項５の
発明において、さらに、前記無音区間の伸縮波形の再生
時間が一定時間になるように、各無音区間毎に、前記無
音区間伸縮率設定手段に無音区間伸縮率を可変設定する
ことを特徴とし、音声を伸縮することを目的としてい
る。A recording medium on which a program for executing the voice speed converting method according to the sixth aspect of the invention is recorded is the same as in the fifth aspect of the invention, further, the expansion / contraction waveform of the silent section is reproduced.
In order to keep the time constant for each silent section,
The present invention is characterized in that the sound section expansion / contraction ratio setting means variably sets the silent section expansion / contraction ratio to expand or contract a voice.

【００１１】[0011]

【００１２】以上のように、本発明は、音声の発話速度
を算出し、この発話速度に応じて、音声を間引く割合や
繰り返す割合を可変的に設定することで、従来とは異な
り元の音声発話速度によらず聴取者の好みの速さの音声
を作り出すことが可能となる。As described above, the present invention calculates the speech rate of the voice and variably sets the thinning rate and the repeating rate of the voice in accordance with the speech rate, which is different from the conventional method. It is possible to create a voice at a speed desired by the listener regardless of the utterance speed.

【００１３】[0013]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を説明する。図１は、本発明の第１実施形態であ
る音声速度変換装置の構成を示すブロック図である。こ
の装置は、音声を入力する音声入力部101と、入力され
た音声の発話速度を算出する発話速度算出部102と、音
声伸縮率を設定する音声伸縮率設定部103と、前記音声
伸縮率に基づいて、入力された音声の波形を間引くか、
または繰り返す波形生成部104と、この波形生成部104が
生成した波形を音声として出力する音声出力部105とを
備える。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a voice speed conversion device according to a first embodiment of the present invention. This device, a voice input unit 101 for inputting a voice, a speech speed calculation unit 102 for calculating the speech speed of the input voice, a voice expansion / contraction ratio setting unit 103 for setting a voice expansion / contraction ratio, and the voice expansion / contraction ratio Based on the thinning of the waveform of the input voice,
Alternatively, it includes a repeating waveform generating unit 104 and a voice output unit 105 that outputs the waveform generated by the waveform generating unit 104 as voice.

【００１４】上記構成の接続関係を説明する。音声入力
部101の出力は、発話速度算出部102および波形生成部10
4に入力される。発話速度算出部102の出力は、音声伸縮
率設定部103に入力され、この音声伸縮率設定部103の出
力は、前記波形生成部104に入力される。波形生成部104
の出力は、音声出力部105に入力される。The connection relationship of the above configuration will be described. The output of the voice input unit 101 is the speech rate calculation unit 102 and the waveform generation unit 10.
Entered in 4. The output of the speech rate calculation unit 102 is input to the voice expansion / contraction ratio setting unit 103, and the output of the voice expansion / contraction ratio setting unit 103 is input to the waveform generation unit 104. Waveform generator 104
Is output to the audio output unit 105.

【００１５】次に、図６に示すフローチャートを参照し
て、本実施形態の動作を説明する。なお、以下の文中に
おけるS100等は、フローチャート中のステップを示す。
まず、音声入力部101に音声が入力される(S100)。次
に、入力された音声波形が発話速度算出部102に送ら
れ、ここで発話速度が算出される(S101)。発話速度の算
出には、一例として、特開平5-289691号公報「発話速度
測定装置」に開示されている動的尺度を用いた方法を利
用することが考えられる。Next, the operation of this embodiment will be described with reference to the flow chart shown in FIG. In addition, S100 and the like in the following sentences indicate steps in the flowchart.
First, a voice is input to the voice input unit 101 (S100). Next, the input speech waveform is sent to the speech rate calculation unit 102, where the speech rate is calculated (S101). To calculate the speech rate, as an example, it is conceivable to use the method using a dynamic scale disclosed in Japanese Unexamined Patent Publication No. 5-289691 “Speech rate measuring apparatus”.

【００１６】次に、算出された発話速度が音声伸縮率設
定部103に通知され、ここで通知された発話速度に応じ
た音声伸縮率が設定される(S102)。この音声伸縮率設定
の一例として、発話速度を、速い速度の区分、中間速度
の区分、遅い速度の区分の３段階に区分し、算出された
発話速度が速い速度の区分に属する場合には、前記音声
波形を繰り返す処理を行うように前記音声伸縮率を設定
し、発話速度が遅い速度の区分に属する場合には、波形
を間引く処理を行うように前記音声伸縮率を設定し、発
話速度が中間速度の区分に属する場合には、波形処理を
行わない制御を行うように前記音声伸縮率を設定する。Next, the calculated speech rate is notified to the voice expansion / contraction rate setting unit 103, and the voice expansion / contraction rate corresponding to the notified speech rate is set (S102). As an example of the voice expansion / contraction ratio setting, when the utterance speed is divided into three stages of a high speed category, an intermediate speed category, and a slow speed category, and the calculated utterance rate belongs to a high speed category, When the speech expansion / contraction rate is set to perform the process of repeating the speech waveform, and when the speech rate belongs to the slow speed category, the speech expansion / contraction rate is set to perform the process of thinning the waveform, and the speech rate is If it belongs to the intermediate speed category, the voice expansion / contraction ratio is set so as to perform control without waveform processing.

【００１７】次に、設定された音声伸縮率が波形生成部
104に送られ、ここで、この音声伸縮率に基づいて音声
波形を間引くか、または繰り返す波形作成処理が行わ
れ、音声波形が伸縮される(S103)。この波形処理は、一
例として、ピッチ単位で波形処理を行うPICOLA方式を用
い、発話者のピッチ変化が起こらない間引き処理を行う
ことが考えられる。最後に、伸縮された音声波形が音声
出力部105に送られ、ここで音声に変換されて出力され
る(S104)。Next, the set voice expansion / contraction ratio is calculated by the waveform generator.
It is sent to 104, where the voice waveform is thinned or repeated based on the voice expansion / contraction rate, and the waveform forming process is repeated to expand / contract the voice waveform (S103). As an example of this waveform processing, it is conceivable to use the PICOLA method, which performs waveform processing in pitch units, and perform thinning processing that does not cause a pitch change of the speaker. Finally, the expanded / contracted voice waveform is sent to the voice output unit 105, where it is converted into voice and output (S104).

【００１８】図２は、本発明の第２実施形態である音声
速度変換装置のブロック図である。なお、以下の説明に
おいて、同一の構成には同一の符号を付し、その説明を
省略するものとする。第２実施形態の装置は、第１実施
形態の装置に加え、目標の音声再生速度を設定する目標
速度設定部117と、この目標速度設定部117の出力および
前記発話速度算出部102の出力を入力する制御部116とを
備え、この制御部116の出力は、前記音声伸縮率設定部1
03に入力されている。制御部116は、発話速度算出部102
で算出された発話速度と、目標速度設定部117で設定さ
れた目標再生速度とを比較し、再生速度が目標再生速度
とほぼ同程度となるように、音声伸縮率設定部103に音
声伸縮率を可変設定する制御を行う。FIG. 2 is a block diagram of a voice speed conversion apparatus according to the second embodiment of the present invention. In the following description, the same components will be denoted by the same reference numerals and the description thereof will be omitted. In addition to the device of the first embodiment, the device of the second embodiment includes a target speed setting unit 117 that sets a target voice reproduction speed, an output of the target speed setting unit 117, and an output of the speech speed calculation unit 102. And a control unit 116 for inputting, and the output of the control unit 116 is the voice expansion / contraction ratio setting unit 1
It is entered in 03. The control unit 116 uses the speech rate calculation unit 102.
The speech expansion rate set in the voice expansion / contraction rate setting unit 103 is compared with the target reproduction speed set in the target speed setting unit 117 so that the reproduction speed is almost the same as the target reproduction speed. Controls to variably set.

【００１９】図７は、この装置の動作を示すフローチャ
ートである。音声入力部101による音声の入力(S100)お
よび発話速度算出部102による発話速度の算出(S101)
は、第１実施形態と同様である。前記S100、S101と並行
して、目標速度設定部117において目標再生速度が設定
される(S114)。FIG. 7 is a flowchart showing the operation of this device. Speech input by the voice input unit 101 (S100) and speech rate calculation by the speech rate calculation unit 102 (S101)
Is the same as in the first embodiment. In parallel with S100 and S101, the target reproduction speed is set in the target speed setting unit 117 (S114).

【００２０】次に、制御部116において、前記発話速度
算出部102で算出された発話速度と、前記目標速度設定
部117で設定された目標再生速度とが比較され、再生速
度が目標再生速度とほぼ同程度となるように、音声伸縮
率設定部103に音声伸縮率が設定される(S112)。すなわ
ち、制御部116は、発話速度と目標再生速度とを比較
し、発話速度が速い場合には音声波形を繰り返す処理を
行う音声伸縮率を設定し、発話速度が遅い場合には波形
を間引く処理を行う音声伸縮率を設定する制御を行う。
この音声伸縮率に基づいた音声の伸縮(S103)および音
声の出力(S104)のステップは、第１実施形態と同様であ
る。Next, the control unit 116 compares the utterance speed calculated by the utterance speed calculation unit 102 with the target reproduction speed set by the target speed setting unit 117, and the reproduction speed becomes the target reproduction speed. The voice expansion / contraction ratio is set in the audio expansion / contraction ratio setting unit 103 so as to be approximately the same (S112). That is, the control unit 116 compares the utterance speed with the target reproduction speed, sets the voice expansion / contraction ratio for repeating the voice waveform when the utterance speed is fast, and thins the waveform when the utterance speed is slow. Perform control to set the voice expansion / contraction ratio.
The steps of expanding / contracting the voice (S103) and outputting the voice (S104) based on the voice expansion / contraction rate are the same as those in the first embodiment.

【００２１】図３は、本発明の第３実施形態のブロック
図である。第３実施形態には、第２実施形態の構成に加
え、音声の有音無音を判別する有音無音判別部128と、
有音区間伸縮率を設定する有音区間伸縮率設定部129
と、無音区間伸縮率を設定する無音区間伸縮率設定部13
0とが設けられている。前記有音無音判別部128は、音声
の有音無音を判別し、音声波形を有音区間と無音区間と
に分ける。そして、前記有音区間伸縮率とは、有音区間
の音声波形を間引くか、または繰り返す割合であり、無
音区間伸縮率とは、無音区間の波形を間引くか、または
繰り返す割合であるFIG. 3 is a block diagram of the third embodiment of the present invention. In the third embodiment, in addition to the configuration of the second embodiment, a voiced / unvoiced discriminating unit 128 that discriminates voiced / non-voiced sound,
Voiced section expansion / contraction rate setting unit 129 for setting the expansion / contraction rate of the voiced section
And the silent section expansion / contraction ratio setting unit 13 for setting the silent section expansion / contraction ratio.
0 and are provided. The voiced / unvoiced discriminating unit 128 discriminates voiced / unvoiced voices and divides the voice waveform into voiced sections and silence sections. Then, the voiced section expansion / contraction rate is a rate of thinning out or repeating the voice waveform of the voiced section, and the silent section expansion / contraction rate is a rate of thinning out or repeating the waveform of the silent section.

【００２２】前記有音無音判別部128、有音区間伸縮率
設定部129、無音区間伸縮率設定部130の接続関係を説明
する。有音無音判別部128には、音声入力部101の出力が
入力され、この有音無音判別部128の出力は、制御部126
に入力されている。制御部126の出力は、有音区間伸縮
率設定部129および無音区間伸縮率設定部130に入力さ
れ、これらの有音区間伸縮率設定部129および無音区間
伸縮率設定部130の出力は、どちらも波形生成部124に入
力されている。The connection relationship between the voiced / unvoiced discrimination unit 128, the voiced section expansion / contraction rate setting unit 129, and the silent section expansion / contraction rate setting unit 130 will be described. The output of the voice input unit 101 is input to the voice / sound determination unit 128, and the output of the voice / sound determination unit 128 is output to the control unit 126.
Has been entered in. The output of the control unit 126 is input to the voiced section expansion / contraction rate setting unit 129 and the silent section expansion / contraction rate setting unit 130. Is also input to the waveform generation unit 124.

【００２３】制御部126は、有音無音判別部128で判別さ
れた有音区間と無音区間とのそれぞれにおいて、音声伸
縮率を可変設定する制御を行う。すなわち、制御部126
は、有音区間伸縮率設定部129に有音区間伸縮率を可変
設定する制御と、無音区間伸縮率設定部130に無音区間
伸縮率を可変設定する制御とを個別に行う。そして、波
形生成部124は、設定された有音区間伸縮率と無音区間
伸縮率を用いて、音声入力部101から入力された音声波
形を間引くか、または繰り返す処理を行う。The control unit 126 performs control to variably set the voice expansion / contraction ratio in each of the voiced section and the silent section determined by the voiced / non-voiced determination unit 128. That is, the control unit 126
Performs the control for variably setting the voiced section expansion / contraction rate in the voiced section expansion / contraction rate setting unit 129 and the control for variably setting the silent section expansion / contraction rate in the silent section expansion / contraction rate setting unit 130. Then, the waveform generation unit 124 performs a process of thinning out or repeating the voice waveform input from the voice input unit 101, using the set voiced section expansion / contraction rate and silent section expansion / contraction rate.

【００２４】図８は、この装置の動作を示すフローチャ
ートである。音声の入力（S100）、発話速度の算出（S1
01）、目標再生速度の設定（S114）のステップは、第２
実施形態と同様である。前記発話速度の算出（S101）と
並行して、有音無音判別部128で音声の有音無音の判別
が行われ、入力された音声波形が有音区間と無音区間と
に区分される（S125）。FIG. 8 is a flow chart showing the operation of this device. Speech input (S100), speech rate calculation (S1
01), the step of setting the target playback speed (S114) is the second
It is similar to the embodiment. In parallel with the calculation of the speech rate (S101), the voice / sound discrimination unit 128 discriminates the voice / sound of a voice, and the input voice waveform is divided into a voice segment and a silence segment (S125). ).

【００２５】制御部126は、有音区間に区分された音声
波形から有音区間伸縮率を算出し、この有音区間伸縮率
を有音区間伸縮率設定部129に設定し（S126）、無音区
間に区分された音声波形から無音区間伸縮率を算出し、
この無音区間伸縮率を無音区間伸縮率設定部130に設定
する（S127）。The control unit 126 calculates the voiced section expansion / contraction rate from the voice waveform divided into the voiced sections, and sets this voiced section expansion / contraction rate in the voiced section expansion / contraction rate setting unit 129 (S126), and the silence is generated. Calculate the silent section expansion and contraction rate from the voice waveform divided into sections,
This silent section expansion / contraction rate is set in the silent section expansion / contraction rate setting unit 130 (S127).

【００２６】これらの有音区間伸縮率および無音区間伸
縮率は、どちらも、発話速度が目標再生速度より遅い場
合には波形を繰り返すように、発話速度が目標再生速度
より速い場合には波形を間引くように設定される。ただ
し、有音区間伸縮率と無音区間伸縮率とを比較すると、
無音区間伸縮率の方が、波形を繰り返す度合いが小さ
く、波形を間引く度合いが大きいので、無音区間の方を
より大幅に縮小させることができる。Both of the voiced section expansion / contraction rate and the silent section expansion / contraction rate have a waveform so that the waveform is repeated when the utterance speed is lower than the target reproduction speed, and the waveform is repeated when the utterance speed is higher than the target reproduction speed. It is set to thin out. However, when comparing the expansion and contraction rate of the voiced section and the expansion and contraction rate of the silent section,
Since the silent section expansion / contraction ratio has a smaller degree of repeating the waveform and a larger degree of thinning the waveform, the silent section can be more significantly reduced.

【００２７】波形生成部124は、前記有音区間伸縮率に
基づき有音区間の音声波形を伸縮し、前記無音区間伸縮
率に基づき無音区間の音声波形を伸縮する（S123）。こ
こでの伸縮の方法は、第２実施形態におけるS103と同様
である。また、伸縮された音声を音声出力部105から出
力するステップS104も、第２実施形態におけるS104と同
様である。The waveform generator 124 expands / contracts the voice waveform in the voiced section based on the expansion / contraction rate of the voiced section and expands / contracts the voice waveform in the silent section based on the expansion / contraction rate of the silent section (S123). The expansion / contraction method here is the same as S103 in the second embodiment. Further, step S104 of outputting the expanded / contracted sound from the sound output unit 105 is also the same as S104 in the second embodiment.

【００２８】図４は、本発明の第４実施形態のブロック
図である。第４実施形態は、ブロック図レベルでは、第
３実施形態と同一の構成となっているが、制御部136内
部における制御が異なっている。異なる点は、無音区間
が一定時間になるような無音区間伸縮率が算出される点
である。すなわち、複数の区間に分割されている無音区
間において、各無音区間毎に個別の無音区間伸縮率が算
出される。FIG. 4 is a block diagram of the fourth embodiment of the present invention. The fourth embodiment has the same configuration as the third embodiment at the block diagram level, but the control inside the control unit 136 is different. The difference is that the silent section expansion / contraction ratio is calculated such that the silent section has a fixed time. That is, in the silent section divided into a plurality of sections, the individual silent section expansion / contraction ratio is calculated for each silent section.

【００２９】図９は、第４実施形態の動作を示すフロー
チャートである。S137において、上述した無音区間が一
定時間になるような設定が行われる。すなわち、制御部
136で、無音区間が一定時間になるような無音区間伸縮
率が算出され、算出された無音区間伸縮率が、無音区間
伸縮率設定部130に設定される。さらに詳細には、前記
有音無音の判定（S125）において、無音区間に区分され
た各区間に対しては、制御部136において、前記各無音
区間が一定時間になるように、各無音区間毎に無音区間
伸縮率が算出され、これらの無音区間伸縮率が無音区間
伸縮率設定部130に設定される。これ以外の動作は、第
３実施形態と同様である。FIG. 9 is a flowchart showing the operation of the fourth embodiment. In S137, the setting is performed such that the silent section described above has a fixed time. That is, the control unit
In 136, the silent section expansion / contraction ratio is calculated so that the silent section becomes a certain time, and the calculated silent section expansion / contraction ratio is set in the silent section expansion / contraction ratio setting unit 130. More specifically, in the presence / absence determination (S125), for each section segmented into a silent section, the control unit 136 controls each silent section so that each silent section becomes a certain time. The silent section expansion / contraction ratio is calculated, and these silent section expansion / contraction ratios are set in the silent section expansion / contraction ratio setting unit 130. The other operations are the same as those in the third embodiment.

【００３０】図５は、本発明の第５実施形態を示すブロ
ック図である。第５実施形態は、ブロック図レベルで
は、第４実施形態と同一の構成となっているが、制御部
146内部における制御が異なっている。異なる点は、複
数の区間に分割されている有音区間において、各有音区
間毎に個別の有音区間伸縮率が算出される点である。FIG. 5 is a block diagram showing a fifth embodiment of the present invention. The fifth embodiment has the same configuration as the fourth embodiment at the block diagram level, but the control unit
146 The control inside is different. The different point is that in a voiced section divided into a plurality of sections, an individual voiced section expansion / contraction rate is calculated for each voiced section.

【００３１】図１０は、第５実施形態の動作を示すフロ
ーチャートである。有音無音判定部128において、有音
無音の判定が行われ（S125）、発話速度算出部102にお
いて、前記判定によって有音区間に区分された各区間毎
に発話速度が算出される（S141）。そして、制御部146
において、前記各有音区間毎の発話速度に応じて、各有
音区間毎の有音区間伸縮率が算出され、これらの有音区
間伸縮率が有音区間伸縮率設定部129に設定される。FIG. 10 is a flow chart showing the operation of the fifth embodiment. The voice / non-voice determination unit 128 determines voice / non-voice (S125), and the utterance speed calculation unit 102 calculates the utterance speed for each section divided into the voice section by the determination (S141). . Then, the control unit 146
In, in accordance with the speech speed for each voiced section, the voiced section expansion / contraction rate for each voiced section is calculated, and these voiced section expansion / contraction rates are set in the voiced section expansion / contraction rate setting unit 129. .

【００３２】また、前記有音無音の判定（S125）におい
て、無音区間に区分された各区間に対しては、制御部14
6において、前記各無音区間が一定時間になるように、
各無音区間毎に無音区間伸縮率が算出され、これらの無
音区間伸縮率が無音区間伸縮率設定部130に設定され
る。In addition, in the judgment of the presence or absence of sound (S125), the control unit 14 is applied to each of the sections divided into silent sections.
In 6, so that each silent section is a certain time,
The silent section expansion / contraction rate is calculated for each silent section, and these silent section expansion / contraction rates are set in the silent section expansion / contraction rate setting unit 130.

【００３３】そして、波形生成部124においては、上記
各有音区間および各無音区間の音声波形が、これらの各
区間毎に設定された伸縮率で伸縮される。上記以外の動
作は、第４実施形態と同様である。Then, in the waveform generator 124, the voice waveforms in the voiced sections and the silent sections are expanded and contracted at the expansion and contraction ratio set for each of these sections. The operation other than the above is the same as that of the fourth embodiment.

【００３４】図１１は、第５実施形態による音声波形変
換の一例である。発話速度が速い有音区間では音声波形
の短縮率が小さいが、発話速度が遅い有音区間では短縮
率が大きい。また、各無音区間は、全て一定時間に変換
される。FIG. 11 shows an example of voice waveform conversion according to the fifth embodiment. The speech waveform shortening rate is small in the voiced section where the speech rate is fast, but is large in the speech section where the speech rate is slow. In addition, each silent section is converted into a constant time.

【００３５】[0035]

【発明の効果】以上説明したように、本発明では、音声
の発話速度を算出し、再生速度が一定の速度になるよう
に音声を間引くための割合や繰り返すための割合を可変
的に設定することで、従来とは異なり、元の音声の発話
速度によらず、一定の速さの音声を作り出すことが可能
となる。そして、本発明は、留守番電話やボイスメール
の聞き取り、議事録作成や音声の翻訳作業、英会話レッ
スンの補助、ニュースや本の朗読の聞き取りおよび蓄積
された映像付き音声の聞き取り等の用途に効果がある。As described above, in the present invention, the speech rate of voice is calculated, and the rate for thinning out the voice and the rate for repeating are set variably so that the reproduction speed becomes constant. As a result, unlike the conventional method, it is possible to generate a voice at a constant speed regardless of the utterance speed of the original voice. The present invention is effective for applications such as listening to voice mails and voice mails, creating minutes and translating voices, assisting English conversation lessons, listening to news and reading books, and listening to accumulated audio with video. is there.

【００３６】また、本発明では、音声の発話速度を算出
し、その発話速度と目標再生速度とを比較し、再生速度
が前記目標再生速度と同程度の速度になるように音声を
間引くための割合や繰り返すための割合を可変的に設定
することで、従来とは異なり、元の音声の発話速度によ
らず、聴取者の好みの速さの音声を作り出すことが可能
となる。Further, in the present invention, the speech speed of the voice is calculated, the speech speed is compared with the target reproduction speed, and the voice is thinned out so that the reproduction speed becomes the same speed as the target reproduction speed. By variably setting the ratio and the ratio for repeating, it becomes possible to create a voice at a speed desired by the listener, regardless of the utterance speed of the original voice, unlike the conventional case.

【００３７】また、無音有音を区別し、冗長である無音
部分の伸縮率を有音部分の伸縮率とは異なる設定にした
ので、より効率の良い伸縮ができる。さらに、無音区間
を短縮可能な限界の長さまで短縮することで、より効率
の良い伸縮ができる。さらにまた、個々の有音区間毎に
有音区間伸縮率を設定することで、より効率の良い伸縮
ができる。Further, since the silent voice is distinguished and the expansion / contraction ratio of the redundant silent portion is set to be different from the expansion / contraction ratio of the voiced part, the expansion / contraction can be performed more efficiently. Furthermore, by shortening the silent section to the limit length that can be shortened, more efficient expansion and contraction can be performed. Furthermore, by setting the voiced section expansion / contraction ratio for each individual voiced section, the expansion / contraction can be performed more efficiently.

[Brief description of drawings]

【図１】本発明の第１実施形態のブロック図。FIG. 1 is a block diagram of a first embodiment of the present invention.

【図２】本発明の第２実施形態のブロック図。FIG. 2 is a block diagram of a second embodiment of the present invention.

【図３】本発明の第３実施形態のブロック図。FIG. 3 is a block diagram of a third embodiment of the present invention.

【図４】本発明の第４実施形態のブロック図。FIG. 4 is a block diagram of a fourth embodiment of the present invention.

【図５】本発明の第５実施形態のブロック図。FIG. 5 is a block diagram of a fifth embodiment of the present invention.

【図６】第１実施形態のフローチャート。FIG. 6 is a flowchart of the first embodiment.

【図７】第２実施形態のフローチャート。FIG. 7 is a flowchart of the second embodiment.

【図８】第３実施形態のフローチャート。FIG. 8 is a flowchart of the third embodiment.

【図９】第４実施形態のフローチャート。FIG. 9 is a flowchart of the fourth embodiment.

【図１０】第５実施形態のフローチャート。FIG. 10 is a flowchart of the fifth embodiment.

【図１１】第５実施形態による音声波形変換の一例を
示す図。FIG. 11 is a diagram showing an example of speech waveform conversion according to the fifth embodiment.

[Explanation of symbols]

101 音声入力部（音声入力手段） 102 発話速度算出部（発話速度算出手段） 103 音声伸縮率設定部（音声伸縮率設定手段） 104、124 波形生成部（波形生成手段） 105 音声出力部（音声出力手段） 116、126、136、146 制御部 117 目標速度設定部 128 有音無音判別部 129 有音区間伸縮率設定部 130 無音区間伸縮率設定部 101 Voice input section (voice input means) 102 Speech rate calculation unit (speech rate calculation means) 103 Audio expansion / contraction ratio setting unit (audio expansion / contraction ratio setting means) 104,124 Waveform generator (waveform generator) 105 Audio output section (audio output means) 116, 126, 136, 146 Control unit 117 Target speed setting section 128 Voice / Silence discrimination section 129 Voice section stretch ratio setting section 130 Silent section expansion / contraction ratio setting section

───────────────────────────────────────────────────── フロントページの続き (72)発明者松井弘行東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 (56)参考文献特開平10−70790（ＪＰ，Ａ) 特開平９−325798（ＪＰ，Ａ) 特開平８−254992（ＪＰ，Ａ) 特開平５−257490（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 21/04 G10L 13/08 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Hiroyuki Matsui Inventor Hiroyuki Matsui 3-19-2 Nishishinjuku, Shinjuku-ku, Tokyo Nihon Telegraph and Telephone Corporation (56) References JP-A-10-70790 (JP, A) Special features Kaihei 9-325798 (JP, A) JP 8-254992 (JP, A) JP 5-257490 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 21 / 04 G10L 13/08

Claims

(57) [Claims]

1. A voice input unit for capturing a voice, a voice speed calculation unit for calculating a voice speed of the voice captured by the voice input unit, a target speed setting unit for setting a target reproduction speed, and a voiced section of the voice. Spoken / Silence distinguishing hand that distinguishes between the
And the sound in the voiced section discriminated by the voiced / non-voiced discrimination means.
Voice segment extension, which is the rate at which the voice waveform is decimated or repeated.
A sound section expansion / contraction ratio setting means for setting a reduction ratio, and a sound in a silent area determined by the sound / silence determination means.
Silence interval extension, which is the rate of thinning or repeating the voice waveform
A silent section expansion / contraction rate setting means for setting a reduction rate, and a voice waveform of a voiced section are interleaved based on the expansion / contraction rate of the voiced section.
Stretching waveform in the voiced section by pulling or repeating
Based on the expansion / contraction ratio of the silent section, the sound in the silent section is generated.
Silent section by thinning out or repeating the voice waveform
And a target reproduction speed set by the target speed setting means.
And the speech rate calculated by the speech rate calculation means,
And the reproduction speed of the expansion / contraction waveform in the silent section is
The silent zone is set so that the speed is almost the same as the target playback speed.
The variable stretch ratio setting means variably sets the silent stretch ratio.
1 control unit and each voiced section discriminated by the voiced / non-voiced discrimination unit
In addition, the speech rate for each voiced section is calculated by the speech rate calculation means.
And a target reproduction speed set by the target speed setting means.
And for each voiced section calculated by the speech speed calculation means
Compared with the speech rate, the playback speed of the expansion / contraction waveform in the voiced section
So that the speed is about the same as the target playback speed,
For each voiced section, the voiced section expansion / contraction ratio setting means has
It has a third control section for variably setting the sound section expansion / contraction ratio, and a voice output means for inputting the expansion / contraction waveforms of the voiced section and the silent section generated by the waveform generation means , and outputting the expanded / contracted voice. A voice speed conversion device characterized by the above.

2. The third control unit is configured to extend the silent section.
Each silent section so that the playback time of the compressed waveform is constant.
Each time, the silent section expansion / contraction rate is set to the silent section expansion / contraction rate setting means.
The control for variably setting is performed , according to claim 1.
The voice speed converter described .

3. Capture audio, Calculate the speech rate of the voice, The voiced section and the silent section of the voice are distinguished, Set the target playback speed of the voice, The target reproduction speed is compared with the speech speed, and the silence
The playback speed of the stretchable waveform in the section is about the same as the target playback speed.
The sound waveform of the silent section is thinned out so that the speed becomes
Variable or the silent section expansion / contraction ratio, which is the rate of repetition.
Set Calculates the speech rate for each voiced section for each determined voiced section
Then Compare the target playback speed with the speech speed for each voiced section
However, the playback speed of the expansion / contraction waveform in the voiced section is
For each voiced section so that the speed is similar to the raw speed
To thin out or repeat the voice waveform in the voiced section.
It is possible to variably set the expansion / contraction ratio of the voiced section, The voice waveform in the voiced section is thinned out based on the voiced section expansion / contraction rate.
Expanding or contracting waveforms in the voiced section by repeating or repeating
Generate the voice of the silent section based on the expansion / contraction ratio of the silent section.
By removing the waveform or repeating the waveform,
Generate a stretch waveform, Stretched sound is output from the stretched waveform of the voiced section and the silent section.
Ru A voice speed conversion method characterized by the above.

4. The playback time of the stretched waveform in the silent section is
The silent interval is extended for each silent interval so that the time becomes constant.
Variable setting of expansion / contraction ratio of silent section in reduction ratio setting means Special
The voice speed conversion method according to claim 1, wherein

5. Capture audio, Calculate the speech rate of the voice, The voiced section and the silent section of the voice are distinguished, Set the target playback speed of the voice, The target reproduction speed is compared with the speech speed, and the silence
The playback speed of the stretchable waveform in the section is about the same as the target playback speed.
The sound waveform of the silent section is thinned out so that the speed becomes
Variable or the silent section expansion / contraction ratio, which is the rate of repetition.
Set Calculates the speech rate for each voiced section for each determined voiced section
Then Compare the target playback speed with the speech speed for each voiced section
However, the playback speed of the expansion / contraction waveform in the voiced section is
For each voiced section so that the speed is similar to the raw speed
To thin out or repeat the voice waveform in the voiced section.
It is possible to variably set the expansion / contraction ratio of the voiced section, The voice waveform in the voiced section is thinned out based on the voiced section expansion / contraction rate.
Expanding or contracting waveforms in the voiced section by repeating or repeating
Generate the voice of the silent section based on the expansion / contraction ratio of the silent section.
By removing the waveform or repeating the waveform,
Generate a stretch waveform, Stretched sound is output from the stretched waveform of the voiced section and the silent section.
Ru A program for executing a voice speed conversion method characterized by
A recording medium that records ram.

6. The playback time of the stretched waveform in the silent section is
The silent interval is extended for each silent interval so that the time becomes constant.
Variable setting of expansion / contraction ratio of silent section in reduction ratio setting means Special
The voice speed conversion method according to claim 1 is executed.
A recording medium recording a program.