JP4886907B2

JP4886907B2 - Audio signal correction apparatus and audio signal correction method

Info

Publication number: JP4886907B2
Application number: JP2011132362A
Authority: JP
Inventors: 裕米久保; 広和竹内
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2011-06-14
Filing date: 2011-06-14
Publication date: 2012-02-29
Anticipated expiration: 2029-09-18
Also published as: JP2011203753A

Description

本発明は、オーディオ信号に含まれる音声信号と音楽信号とに対して、それぞれ適応的に音質補正処理を施すオーディオ信号補正技術に関する。 The present invention relates to an audio signal correction technique for adaptively performing sound quality correction processing on an audio signal and a music signal included in an audio signal.

周知のように、例えばテレビジョン放送を受信する放送受信機器や、情報記録媒体からその記録情報を再生する情報再生機器等にあっては、受信した放送信号や情報記録媒体から読み取った信号等からオーディオ信号を再生する際に、オーディオ信号に音質補正処理を施すことによって、より一層の高音質化を図るようにしている。 As is well known, for example, in a broadcast receiving device that receives a television broadcast or an information reproducing device that reproduces recorded information from an information recording medium, the received broadcast signal or the signal read from the information recording medium When reproducing an audio signal, the audio signal is subjected to a sound quality correction process to further improve the sound quality.

この場合、オーディオ信号に施す音質補正処理の内容は、オーディオ信号が人の話し声のような音声信号であるか、楽曲のような音楽（非音声）信号であるかに応じて異なる。すなわち、音声信号に対しては、トークシーンやスポーツ実況等のようにセンター定位成分を強調して明瞭化するように音質補正処理を施すことで音質が向上し、音楽信号に対しては、ステレオ感を強調した拡がりのある音質補正処理を施すことで音質が向上する。 In this case, the content of the sound quality correction processing applied to the audio signal differs depending on whether the audio signal is a sound signal such as a human voice or a music (non-speech) signal such as a music piece. In other words, sound quality is improved by performing sound quality correction processing to emphasize and clarify the center localization component, such as talk scenes and sports conditions, for audio signals, and stereo for music signals. The sound quality is improved by applying a sound quality correction process with a feeling of emphasis.

このため、取得したオーディオ信号が音声信号か音楽信号かを判別し、その判別結果に応じて対応する音質補正処理を施すことが考えられている。しかしながら、実際のオーディオ信号では、音声信号と音楽信号とが混在している場合が多いことから、それらの判別処理が困難になっているため、オーディオ信号に対して適切な音質補正処理が施されているとは言えないのが現状である。 For this reason, it is considered to determine whether the acquired audio signal is a voice signal or a music signal, and perform a corresponding sound quality correction process according to the determination result. However, since an audio signal and a music signal are often mixed in an actual audio signal, it is difficult to discriminate between them, so that an appropriate sound quality correction process is performed on the audio signal. The current situation is not to say.

特許文献１には、音声信号がスピーチか非スピーチかをスピーチ性の度合およびミュージック性の度合に応じて判定し、さらに、音声信号がモノラル信号かステレオ信号かに応じてスピーチか非スピーチかの判定を最適化する構成が開示されている。 Patent Document 1 determines whether an audio signal is speech or non-speech according to the degree of speech and music, and further determines whether the audio signal is speech or non-speech depending on whether the audio signal is a monaural signal or a stereo signal. A configuration for optimizing the determination is disclosed.

特開２００７−６７８５８号公報JP 2007-67858 A

しかしながら、特許文献１の構成では、音声信号がデュアルモノラル信号の場合やステレオ信号であってもモノラル伝送する場合には、信号内容を適切に判別することは困難である。 However, in the configuration of Patent Document 1, it is difficult to properly determine the signal content when the audio signal is a dual monaural signal or when the audio signal is a monaural transmission even if it is a stereo signal.

本発明の目的は、入力オーディオ信号の内容を評価し、適応的な音質補正処理を施すオーディオ信号補正装置及を提供することにある。 An object of the present invention is to provide an audio signal correction apparatus and an audio signal correction apparatus that evaluate the contents of an input audio signal and perform adaptive sound quality correction processing.

本発明の実施形態に係るオーディオ信号補正装置は、入力オーディオ信号をチャンル情報に基づいてモノラル信号またはステレオ信号のいずれか判別し、前記入力オーディオ信号を音声信号または音楽信号のいずれかに判別するための複数の特徴量パラメータを抽出する特徴抽出手段と、前記特徴抽出手段で抽出された前記複数の特徴量パラメータに基づいて、前記入力オーディオ信号が音声信号及び音楽信号のいずれに近いかを示す音声音楽識別スコアを算出する信号種別判定手段と、前記音声音楽識別スコアを用いて前記入力オーディオ信号の音声度合いおよび音楽度合いの出力レベルを算出するレベル算出手段と、前記レベル算出手段で算出された前記出力レベルに基づいて、音質補正処理を前記入力オーディオ信号に施す音質補正手段とを有する。 An audio signal correction apparatus according to an embodiment of the present invention determines whether an input audio signal is a monaural signal or a stereo signal based on channel information, and determines whether the input audio signal is an audio signal or a music signal. A feature extracting means for extracting a plurality of feature quantity parameters, and a voice indicating whether the input audio signal is close to a speech signal or a music signal based on the plurality of feature quantity parameters extracted by the feature extraction means A signal type determining means for calculating a music identification score, a level calculating means for calculating a sound level of the input audio signal and an output level of the music level using the sound music identification score, and the level calculated by the level calculating means Based on the output level, a sound quality correction process is performed on the input audio signal. With the door.

本発明によれば、入力オーディオ信号の内容を評価し、適応的な音質補正処理を施すオーディオ信号補正装置を提供できる。 ADVANTAGE OF THE INVENTION According to this invention, the audio signal correction apparatus which evaluates the content of an input audio signal and performs an adaptive sound quality correction process can be provided.

本発明の実施形態に係るデジタルテレビジョン放送受信装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the digital television broadcast receiver which concerns on embodiment of this invention. 本発明の実施形態に係るオーディオ処理モジュールの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the audio processing module which concerns on embodiment of this invention. 本発明の実施形態に係る特徴量抽出処理を説明するフローチャートである。It is a flowchart explaining the feature-value extraction process which concerns on embodiment of this invention. 本発明の実施形態に係る信号種別判定処理を説明するフローチャートである。It is a flowchart explaining the signal type determination process which concerns on embodiment of this invention. 本発明の実施形態に係るレベル算出処理を説明するフローチャートである。It is a flowchart explaining the level calculation process which concerns on embodiment of this invention.

以下、この発明の実施形態について図面を参照して詳細に説明する。図１は、デジタルテレビジョン放送受信装置１１の主要な信号処理系を示している。すなわち、ＢＳ／ＣＳ（broadcasting satellite／communication satellite）デジタル放送受信用のアンテナ４３で受信した衛星デジタルテレビジョン放送信号は、入力端子４４を介して衛星デジタル放送用のチューナ４５に供給されることにより、所望のチャネルの放送信号が選局される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 shows a main signal processing system of the digital television broadcast receiver 11. That is, the satellite digital television broadcast signal received by the BS / CS (broadcasting satellite / communication satellite) digital broadcast receiving antenna 43 is supplied to the satellite digital broadcast tuner 45 via the input terminal 44. A broadcast signal of a desired channel is selected.

そして、このチューナ４５で選局された放送信号は、ＰＳＫ（phase shift keying）復調モジュール４６及びＴＳ（transport stream）復号モジュール４７に順次供給されることにより、デジタルの映像信号及びオーディオ信号に復調された後、信号処理モジュール４８に出力される。 The broadcast signal selected by the tuner 45 is demodulated into a digital video signal and an audio signal by being sequentially supplied to a PSK (phase shift keying) demodulation module 46 and a TS (transport stream) decoding module 47. And then output to the signal processing module 48.

また、地上波放送受信用のアンテナ４９で受信した地上デジタルテレビジョン放送信号は、入力端子５０を介して地上デジタル放送用のチューナ５１に供給されることにより、所望のチャネルの放送信号が選局される。 The terrestrial digital television broadcast signal received by the terrestrial broadcast receiving antenna 49 is supplied to the terrestrial digital broadcast tuner 51 via the input terminal 50, so that the broadcast signal of the desired channel is selected. Is done.

そして、このチューナ５１で選局された放送信号は、例えば日本ではＯＦＤＭ（orthogonal frequency division multiplexing）復調モジュール５２及びＴＳ復号モジュール５３に順次供給されることにより、デジタルの映像信号及びオーディオ信号に復調された後、上記信号処理モジュール４８に出力される。 The broadcast signal selected by the tuner 51 is demodulated into a digital video signal and an audio signal by being sequentially supplied to an OFDM (orthogonal frequency division multiplexing) demodulation module 52 and a TS decoding module 53 in, for example, Japan. After that, it is output to the signal processing module 48.

また、上記地上波放送受信用のアンテナ４９で受信した地上アナログテレビジョン放送信号は、入力端子５０を介して地上アナログ放送用のチューナ５４に供給されることにより、所望のチャネルの放送信号が選局される。そして、このチューナ５４で選局された放送信号は、アナログ復調モジュール５５に供給されてアナログの映像信号及びオーディオ信号に復調された後、上記信号処理モジュール４８に出力される。 The terrestrial analog television broadcast signal received by the terrestrial broadcast receiving antenna 49 is supplied to the terrestrial analog broadcast tuner 54 via the input terminal 50, so that the broadcast signal of the desired channel is selected. Bureau. The broadcast signal selected by the tuner 54 is supplied to the analog demodulation module 55, demodulated into an analog video signal and audio signal, and then output to the signal processing module 48.

ここで、上記信号処理モジュール４８は、ＴＳ復号モジュール４７，５３からそれぞれ供給されたデジタルの映像信号及びオーディオ信号に対して、選択的に所定のデジタル信号処理を施し、グラフィック処理モジュール５６及びオーディオ処理モジュール５７に出力している。 Here, the signal processing module 48 selectively performs predetermined digital signal processing on the digital video signal and audio signal supplied from the TS decoding modules 47 and 53, respectively, and the graphic processing module 56 and audio processing. Output to module 57.

また、上記信号処理モジュール４８には、複数（図示の場合は４つ）の入力端子５８ａ，５８ｂ，５８ｃ，５８ｄが接続されている。これら入力端子５８ａ〜５８ｄは、それぞれ、アナログの映像信号及びオーディオ信号を、デジタルテレビジョン放送受信装置１１の外部から入力可能とするものである。 The signal processing module 48 is connected to a plurality (four in the illustrated case) of input terminals 58a, 58b, 58c, and 58d. These input terminals 58a to 58d can input analog video signals and audio signals from the outside of the digital television broadcast receiving apparatus 11, respectively.

信号処理モジュール４８は、上記アナログ復調モジュール５５及び各入力端子５８ａ〜５８ｄからそれぞれ供給されたアナログの映像信号及びオーディオ信号を選択的にデジタル化し、このデジタル化された映像信号及びオーディオ信号に対して所定のデジタル信号処理を施した後、グラフィック処理モジュール５６及びオーディオ処理モジュール５７に出力する。 The signal processing module 48 selectively digitizes the analog video signal and audio signal supplied from the analog demodulation module 55 and the input terminals 58a to 58d, respectively, and performs the digitization on the digitized video signal and audio signal. After performing predetermined digital signal processing, the digital signal is output to the graphic processing module 56 and the audio processing module 57.

グラフィック処理モジュール５６は、信号処理モジュール４８から供給されるデジタルの映像信号に、ＯＳＤ（on screen display）信号生成モジュール５９で生成されるＯＳＤ信号を重畳して出力する機能を有する。このグラフィック処理モジュール５６は、信号処理モジュール４８の出力映像信号と、ＯＳＤ信号生成モジュール５９の出力ＯＳＤ信号とを選択的に出力すること、また、両出力をそれぞれ画面の半分を構成するように組み合わせて出力することができる。 The graphic processing module 56 has a function of superimposing and outputting the OSD signal generated by the OSD (on screen display) signal generation module 59 on the digital video signal supplied from the signal processing module 48. The graphic processing module 56 selectively outputs the output video signal of the signal processing module 48 and the output OSD signal of the OSD signal generation module 59, and combines both outputs so as to constitute half of the screen. Can be output.

グラフィック処理モジュール５６から出力されたデジタルの映像信号は、映像処理モジュール６０に供給される。この映像処理モジュール６０は、入力されたデジタルの映像信号を、前記映像表示器１４で表示可能なフォーマットのアナログ映像信号に変換した後、映像表示器１４に出力して映像表示させるとともに、出力端子６１を介して外部に導出させる。 The digital video signal output from the graphic processing module 56 is supplied to the video processing module 60. The video processing module 60 converts the input digital video signal into an analog video signal in a format that can be displayed on the video display 14 and then outputs the analog video signal to the video display 14 to display the video. Derived outside through 61.

また、上記オーディオ処理モジュール５７は、入力されたデジタルのオーディオ信号に対して、後述する音質補正処理を施した後、前記スピーカ１５で再生可能なフォーマットのアナログオーディオ信号に変換している。そして、このアナログオーディオ信号は、スピーカ１５に出力されてオーディオ再生に供されるとともに、出力端子６２を介して外部に導出される。 The audio processing module 57 performs a sound quality correction process, which will be described later, on the input digital audio signal, and then converts it into an analog audio signal in a format that can be reproduced by the speaker 15. The analog audio signal is output to the speaker 15 for audio reproduction, and is derived to the outside via the output terminal 62.

ここで、このデジタルテレビジョン放送受信装置１１は、上記した各種の受信動作を含むその全ての動作を制御モジュール６３によって統括的に制御されている。この制御モジュール６３は、ＣＰＵ（central processing unit）６４を内蔵しており、前記操作モジュール１６からの操作情報、または、リモートコントローラ１７から送出され前記受光モジュール１８に受信された操作情報を受けて、その操作内容が反映されるように各部をそれぞれ制御している。 Here, in the digital television broadcast receiving apparatus 11, all operations including the above-described various receiving operations are comprehensively controlled by the control module 63. The control module 63 includes a CPU (central processing unit) 64 and receives operation information from the operation module 16 or operation information transmitted from the remote controller 17 and received by the light receiving module 18. Each unit is controlled to reflect the operation content.

この場合、制御モジュール６３は、主として、そのＣＰＵ６４が実行する制御プログラムを格納したＲＯＭ（read only memory）６５と、該ＣＰＵ６４に作業エリアを提供するＲＡＭ（random access memory）６６と、各種の設定情報及び制御情報等が格納される不揮発性メモリ６７とを利用している。 In this case, the control module 63 mainly includes a ROM (read only memory) 65 storing a control program executed by the CPU 64, a RAM (random access memory) 66 providing a work area to the CPU 64, and various setting information. And a non-volatile memory 67 in which control information and the like are stored.

図２は、上記オーディオ処理モジュール５７内に信号特性解析モジュール７０と音質補正モジュール８０を備える構成を示している。信号特性解析モジュール７０は、特徴抽出モジュール７２、信号種別判定モジュール７４、レベル算出モジュール７６を備える。さらに、特徴抽出モジュール７２は、第１の特徴抽出モジュール７２ａ、第２の特徴抽出モジュール７２ｂを備える。信号種別判定モジュール７４は、第１の信号種別判定モジュール７４ａ、第２の信号種別判定モジュール７４ｂを備える。入力端子７１には、入力オーディオ信号が供給される。制御モジュール６３は、入力オーディオ信号を特徴抽出モジュール７２に供給する。制御モジュール６３は、入力オーディオ信号のチャネル情報（モノラル／ステレオ信号情報）を信号特性解析モジュール７０を構成する各モジュールに供給する。 FIG. 2 shows a configuration in which a signal characteristic analysis module 70 and a sound quality correction module 80 are provided in the audio processing module 57. The signal characteristic analysis module 70 includes a feature extraction module 72, a signal type determination module 74, and a level calculation module 76. Furthermore, the feature extraction module 72 includes a first feature extraction module 72a and a second feature extraction module 72b. The signal type determination module 74 includes a first signal type determination module 74a and a second signal type determination module 74b. An input audio signal is supplied to the input terminal 71. The control module 63 supplies the input audio signal to the feature extraction module 72. The control module 63 supplies channel information (monaural / stereo signal information) of the input audio signal to each module constituting the signal characteristic analysis module 70.

第１の特徴抽出モジュール７２ａは、入力オーディオ信号がステレオ信号の場合、入力オーディオ信号が音声信号と音楽信号のいずれであるかを判別するための各種の特徴量パラメータを算出する。第２の特徴抽出モジュール７２ｂは、入力オーディオ信号がモノラル信号の場合、入力オーディオ信号が音声信号と音楽信号のいずれであるかを判別するための各種の特徴量パラメータを算出する。特徴抽出モジュール７２は、入力オーディオ信号がステレオ信号かモノラル信号かに応じて第１の特徴抽出モジュール７２ａか第２の特徴抽出モジュール７２ｂに切り替える。 When the input audio signal is a stereo signal, the first feature extraction module 72a calculates various feature amount parameters for determining whether the input audio signal is a voice signal or a music signal. When the input audio signal is a monaural signal, the second feature extraction module 72b calculates various feature amount parameters for determining whether the input audio signal is a voice signal or a music signal. The feature extraction module 72 switches to the first feature extraction module 72a or the second feature extraction module 72b depending on whether the input audio signal is a stereo signal or a monaural signal.

第１の信号種別判定モジュール７４ａは、入力オーディオ信号（ステレオ信号）が音声信号または音楽信号のいずれであるかを判別する。同様に第２の特徴抽出モジュール７４ｂは、入力オーディオ信号（モノラル信号）が音声信号または音楽信号のいずれであるかを判別する。信号種別判定モジュール７４は、入力オーディオ信号がステレオ信号かモノラル信号かに応じて第１の信号種別判定モジュール７４ａか第２の信号種別判定モジュール７４ｂに切り替える。 The first signal type determination module 74a determines whether the input audio signal (stereo signal) is an audio signal or a music signal. Similarly, the second feature extraction module 74b determines whether the input audio signal (monaural signal) is an audio signal or a music signal. The signal type determination module 74 switches to the first signal type determination module 74a or the second signal type determination module 74b depending on whether the input audio signal is a stereo signal or a monaural signal.

レベル算出モジュール７６は、音声信号または音楽信号に対して、音質を細かく制御するための確度情報を含む音声・音楽レベルを算出する。レベル算出モジュール７６は、音声・音楽レベル情報を音質補正モジュール８０に出力する。 The level calculation module 76 calculates a voice / music level including accuracy information for finely controlling the sound quality with respect to the voice signal or the music signal. The level calculation module 76 outputs the sound / music level information to the sound quality correction module 80.

本実施形態では、第１の特徴抽出モジュール７２ａと第２の特徴抽出モジュール７２ｂを異なるモジュール、第１の信号種別判定モジュール７４ａと第２の特徴抽出モジュール７４ｂを異なるモジュールとした構成であるが、それぞれ一体となっていてもよい。 In the present embodiment, the first feature extraction module 72a and the second feature extraction module 72b are different modules, and the first signal type determination module 74a and the second feature extraction module 74b are different modules. Each may be integrated.

音質補正モジュール８０は、信号特性解析モジュール７０で算出された音楽・音声レベル情報に基づいて、音質補正処理を施す。音質補正モジュール８０は、音質補正処理を施した出力オーディオ信号を出力端子７７に供給する。 The sound quality correction module 80 performs sound quality correction processing based on the music / sound level information calculated by the signal characteristic analysis module 70. The sound quality correction module 80 supplies the output audio signal subjected to the sound quality correction process to the output terminal 77.

つまり、信号特性解析モジュール７０および音質補正モジュール８０は、放送受信や記録媒体からのコンテンツ再生において、音楽区間と音声区間を処理遅延することなく識別し、シーンの内容に応じて入力オーディオ信号に適切な音質補正処理を施すことで高音質化を図るシーン適応音質補正を実行する機能を有する。 That is, the signal characteristic analysis module 70 and the sound quality correction module 80 identify a music section and a voice section without processing delay in receiving a broadcast or reproducing a content from a recording medium, and appropriately apply to an input audio signal according to the contents of the scene. It has a function of executing scene adaptive sound quality correction for improving the sound quality by performing a sound quality correction process.

次に、第１の特徴抽出モジュール７２と第２の特徴抽出モジュール７３の動作について説明する。図３は、特徴量抽出処理を説明するフローチャートである。はじめに、特徴抽出モジュール７２は、入力オーディオ信号を数百ｍｓｅｃ程度ごとにフレームとして切り出す。さらに、特徴抽出モジュール７２は、それらを数十msec程度ごとにサブフレームに分割する（ステップＳ１０１）。例えば、１つのサブフレームは、２０ｍｓｅｃである。 Next, operations of the first feature extraction module 72 and the second feature extraction module 73 will be described. FIG. 3 is a flowchart for explaining the feature amount extraction processing. First, the feature extraction module 72 cuts out an input audio signal as a frame every several hundreds of milliseconds. Furthermore, the feature extraction module 72 divides them into subframes every several tens of milliseconds (step S101). For example, one subframe is 20 msec.

特徴抽出モジュール７２は、入力オーディオ信号のチャネル情報に基づいて、入力オーディオ信号のチャネル数が２か否か（つまりモノラル信号かステレオ信号か）を判断する（ステップＳ１０２）。ここでは、信号処理モジュール４８は、例えばチューナ５１で選局された放送信号から復調された入力オーディオ信号がマルチチャネルのステレオ信号であった場合、マルチチャネルから２チャネルのステレオ信号にダウンミックス処理していることを前提とする。信号処理モジュール４８は、２チャネルのステレオ信号を入力端子７１に入力オーディオ信号として供給する。 The feature extraction module 72 determines whether the number of channels of the input audio signal is 2 (that is, whether it is a monaural signal or a stereo signal) based on the channel information of the input audio signal (step S102). Here, for example, when the input audio signal demodulated from the broadcast signal selected by the tuner 51 is a multi-channel stereo signal, the signal processing module 48 performs a downmix process from the multi-channel to the 2-channel stereo signal. It is assumed that The signal processing module 48 supplies a two-channel stereo signal to the input terminal 71 as an input audio signal.

チャネル数が２の場合（ステップＳ１０２、ＹＥＳ）、特徴抽出モジュール７２は、入力オーディオ信号がデュアルモノラル信号ではない通常のステレオ信号か否かを判断する（ステップＳ１０３）。デュアルモノラル信号は、デュアルモノラル信号のチャネル数は２であっても、メイン/サブの各チャネルに重畳される音は本来別々のモノラル信号である。 When the number of channels is 2 (step S102, YES), the feature extraction module 72 determines whether or not the input audio signal is a normal stereo signal that is not a dual monaural signal (step S103). In the dual monaural signal, even if the number of channels of the dual monaural signal is 2, the sound superimposed on the main / sub channels is originally a separate monaural signal.

入力オーディオ信号がデュアルモノラル信号でない通常のステレオ信号である場合（ステップＳ１０３、ＹＥＳ）、特徴抽出モジュール７２は、サブフレーム単位で入力オーディオ信号における２チャネルステレオの左右（ＬＲ）信号のパワー比（ＬＲパワー比）を算出する。ここで、入力オーディオ信号のフォーマットがステレオ信号であっても、実際はモノラル信号のように伝送されるケースがある。この場合、ＬＲチャネル相互でほぼ同等の信号となり、特徴抽出モジュール７２は、チャネル数だけでは判断できない。そこで、特徴抽出モジュール７２は、ＬＲチャネルの差成分値を和成分値で除したＬＲパワー比を算出し、予め設定された閾値thPwと比較する。次に、特徴抽出モジュール７２は、ＬＲパワー比が閾値thPwよりも大きいか否かを判断する（ステップＳ１０４）。 When the input audio signal is a normal stereo signal that is not a dual monaural signal (step S103, YES), the feature extraction module 72 determines the power ratio (LR) of the left and right (LR) signals of 2-channel stereo in the input audio signal in units of subframes. Power ratio). Here, even if the format of the input audio signal is a stereo signal, there are cases where it is actually transmitted like a monaural signal. In this case, the LR channels have substantially the same signal, and the feature extraction module 72 cannot determine only by the number of channels. Therefore, the feature extraction module 72 calculates an LR power ratio obtained by dividing the difference component value of the LR channel by the sum component value, and compares it with a preset threshold thPw. Next, the feature extraction module 72 determines whether or not the LR power ratio is larger than the threshold thPw (step S104).

ＬＲパワー比が閾値thPwよりも大きい場合（ステップＳ１０４、ＹＥＳ）、第１の特徴抽出モジュール７２ａは、ＬＲパワー比が閾値thPwよりも大きいステレオ信号からステレオ向け判別情報を抽出する（ステップＳ１０５）。本実施形態では、ステレオ信号は、チャネル数が２の信号の中でも、デュアルモノラル信号ではなく、ＬＲチャネルのパワー比が一定以上のステレオ性の強い信号を意味しているものとする。 When the LR power ratio is larger than the threshold thPw (step S104, YES), the first feature extraction module 72a extracts stereo discrimination information from the stereo signal whose LR power ratio is larger than the threshold thPw (step S105). In the present embodiment, the stereo signal is not a dual monaural signal among signals having two channels, but means a signal having a strong stereo characteristic in which the power ratio of the LR channel is not less than a certain level.

第１の特徴抽出モジュール７２ａは、サブフレーム単位でＬＲパワー比（信号振幅の２乗和）、サブフレーム単位で入力オーディオ信号の時間波形が振幅方向に零を横切る回数である零交差周波数、サブフレーム単位で入力オーディオ信号の周波数領域でのスペクトル成分変動等の判別情報を算出する。判別情報の内容としてはこれらに限定せず追加することができる。 The first feature extraction module 72a has an LR power ratio (signal amplitude squared sum) in subframe units, a zero-crossing frequency that is the number of times the time waveform of the input audio signal crosses zero in the amplitude direction in subframe units, Discrimination information such as spectral component fluctuations in the frequency domain of the input audio signal is calculated in units of frames. The contents of the discrimination information can be added without being limited to these.

第１の特徴抽出モジュール７２ａは、入力オーディオ信号に対してステレオ向け判別情報を示す変数paramSet=stereoをセットする（ステップＳ１０６）。特徴抽出モジュール７２は、サブフレームを統合して数百msec程度ごとにフレームを抽出する（ステップＳ１０７）。次に、特徴抽出モジュール７２は、ステレオ向け判別情報またはモノラル向け判別情報からフレーム単位での統計特徴量（例えば平均，分散，最大，最小等）を求め特徴量パラメータセットを生成する（ステップＳ１０８）。特徴抽出モジュール７２は、特徴量抽出処理を終了する。 The first feature extraction module 72a sets a variable paramSet = stereo indicating the discrimination information for stereo with respect to the input audio signal (step S106). The feature extraction module 72 integrates the subframes and extracts frames every several hundred milliseconds (step S107). Next, the feature extraction module 72 obtains a statistical feature amount (for example, average, variance, maximum, minimum, etc.) for each frame from the discrimination information for stereo or the discrimination information for monaural, and generates a feature parameter set (step S108). . The feature extraction module 72 ends the feature amount extraction process.

入力オーディオ信号がデュアルモノラル信号であって通常のステレオ信号でない場合（ステップＳ１０３、ＮＯ）、第２の特徴抽出モジュール７２ｂは、ユーザにより決定されるメイン/サブの選択情報を受け取り、検出対象となるチャネルのフォーカスを決定する（ステップＳ１０９）。第２の特徴抽出モジュール７２ｂは、メイン/サブのうち該当するいずれかのチャネルについてモノラル向けの判別情報を抽出する（ステップＳ１１０）。同様に、チャネル数が２でない場合（つまり、チャネル数が１の場合）（ステップＳ１０２、ＮＯ）、第２の特徴抽出モジュール７２ｂは、モノラル向けの判別情報を抽出する（ステップＳ１１０）。同様に、ＬＲパワー比が閾値thPw以下の場合（ステップＳ１０４、ＮＯ）、第２の特徴抽出モジュール７２ｂは、モノラル向けの判別情報を抽出する（ステップＳ１１０）。 If the input audio signal is a dual monaural signal and not a normal stereo signal (NO in step S103), the second feature extraction module 72b receives main / sub selection information determined by the user and becomes a detection target. The channel focus is determined (step S109). The second feature extraction module 72b extracts monaural discrimination information for one of the corresponding channels of main / sub (step S110). Similarly, when the number of channels is not 2 (that is, when the number of channels is 1) (step S102, NO), the second feature extraction module 72b extracts the monaural discrimination information (step S110). Similarly, when the LR power ratio is equal to or less than the threshold thPw (step S104, NO), the second feature extraction module 72b extracts the monaural discrimination information (step S110).

第２の特徴抽出モジュール７２ａは、サブフレーム単位で、ＬＲパワー比、零交差周波数、スペクトル成分変動等の判別情報を算出する。判別情報の内容としてはこれらに限定せず追加することができる。 The second feature extraction module 72a calculates discrimination information such as an LR power ratio, a zero-crossing frequency, and a spectral component variation in units of subframes. The contents of the discrimination information can be added without being limited to these.

第２の特徴抽出モジュール７２ｂは、入力オーディオ信号に対してモノラル向け判別情報を示す変数paramSet=monoをセットする（ステップＳ１１１）。次に、第２の特徴抽出モジュール７２ｂは、ステップＳ１０７以降の動作を続行する。 The second feature extraction module 72b sets a variable paramSet = mono indicating the monaural discrimination information for the input audio signal (step S111). Next, the second feature extraction module 72b continues the operation after step S107.

ステレオ向け判別情報とモノラル向けの判別情報は、互いに共通するものもあれば、それぞれに特有なものもある。ステレオ向け判別情報の特有の特徴量パラメータとしては、例えばＬＲパワー比がある。ＬＲパワー比は、音楽区間では値が大きくなる傾向にあり、音声区間では値が小さくなる傾向にある。 The discrimination information for stereo and the discrimination information for monaural may be common to each other, or may be unique to each. As the characteristic feature parameter peculiar to the discrimination information for stereo, for example, there is an LR power ratio. The LR power ratio tends to increase in the music section and decreases in the voice section.

上記説明したように、特徴抽出モジュール７２は、入力オーディオ信号のチャネル情報とともに入力オーディオ信号の内容に応じてステレオ向け判別情報またはモノラル向けの判別情報を抽出し、抽出した判別情報に基づいて特徴量パラメータセットを生成する。したがって、特徴抽出モジュール７２は、入力オーディオ信号が音声信号または音楽信号のいずれかを判別するために用いる最適な判別情報を選択することができる。特徴抽出モジュール７２で生成された各種の特徴量パラメータセットは、信号種別判定モジュール７４に供給される。 As described above, the feature extraction module 72 extracts the discrimination information for stereo or the discrimination information for monaural according to the contents of the input audio signal together with the channel information of the input audio signal, and the feature amount based on the extracted discrimination information Generate a parameter set. Therefore, the feature extraction module 72 can select optimum discrimination information used for discriminating whether the input audio signal is an audio signal or a music signal. Various feature amount parameter sets generated by the feature extraction module 72 are supplied to the signal type determination module 74.

次に、信号種別判定モジュール７４の動作について説明する。図４は、特徴量パラメータセットおよびチャネル情報を用いた信号種別判定処理を説明するフローチャートである。はじめに、信号種別判定モジュール７４は、入力オーディオ信号に対してparamSet=stereoがセットされているか否かを判断する（ステップＳ２０１）。paramSet=stereoがセットされている場合（ステップＳ２０１、ＹＥＳ）、第１の信号種別判定モジュール７４ａは、以下のようにステレオ向け線形判別式を算出する（ステップＳ２０２）。 Next, the operation of the signal type determination module 74 will be described. FIG. 4 is a flowchart for explaining signal type determination processing using a feature parameter set and channel information. First, the signal type determination module 74 determines whether paramSet = stereo has been set for the input audio signal (step S201). When paramSet = stereo is set (step S201, YES), the first signal type determination module 74a calculates a stereo linear discriminant as follows (step S202).

ステレオ向け線形判別式は、信号種別判定モジュール７４が入力オーディオ信号を音声信号または音楽信号のいずれであるか判断するために用いる音声・音楽識別スコアＳ１の算出に用いられる。信号種別判定モジュール７４は、特徴抽出モジュール７２で生成した特徴量パラメータセットについて、各特徴量パラメータの重要度に応じた重み付け係数を付与し、係数を乗じた値の線形和をとることで音楽・音声に属する確からしさを表す音声・音楽識別スコアＳ１を算出する。信号種別判定モジュール７４は、音楽・音声の音種別期待値があらかじめ判明しているデータを用いて学習することで重み付け係数を決定する。 The stereo linear discriminant is used to calculate the speech / music identification score S1 used by the signal type determination module 74 to determine whether the input audio signal is a speech signal or a music signal. The signal type determination module 74 assigns a weighting coefficient according to the importance of each feature parameter to the feature parameter set generated by the feature extraction module 72, and takes a linear sum of values multiplied by the coefficient. A voice / music identification score S1 representing the probability of belonging to the voice is calculated. The signal type determination module 74 determines the weighting coefficient by learning using data in which the expected value of the sound type of music / speech is known in advance.

この重み付け係数としては、信号種別の判別に効果の高い特徴量パラメータほど大きい値が与えられる。例として、信号種別判定モジュール７４は、以下のようなステレオ向け線形判定式を利用する。また、重み付け係数は、音声・音楽識別スコアＳ１については、予め準備した多くの既知の音声信号及び音楽信号を基準となる参照データとして入力し、その参照データについて特徴量パラメータを学習することで算出される。 As this weighting coefficient, a larger value is given to a feature amount parameter that is more effective in determining the signal type. As an example, the signal type determination module 74 uses the following linear determination equation for stereo. In addition, the weighting coefficient is calculated by inputting many known speech signals and music signals prepared in advance as reference data for the speech / music identification score S1, and learning a feature parameter for the reference data. Is done.

学習対象とする参照データのｋ番目のフレームの特徴量パラメータセットをベクトルｘで表わし、入力オーディオ信号が属する信号区間｛音声、音楽｝としてｙで以下のように表わすものとする。

The feature parameter set of the kth frame of the reference data to be learned is represented by a vector x, and the signal section {speech, music} to which the input audio signal belongs is represented by y as follows.

ここで、上記（１）式の各要素は、抽出したｎ個の特徴量パラメータに対応する。また、上記（２）式の−１，＋１は、それぞれ、音声区間及び音楽区間に対応し、使用する音声・音楽用学習データの正解信号種別となる区間について、予め人手で２値のラベル付けをしたものである。上記（２）式の−１，＋１は、便宜的な定義であるので、逆にしてもよい。さらに、上記（２）式より、以下の線形識別関数を立てる。

Here, each element of the above equation (1) corresponds to the extracted n feature amount parameters. Further, −1 and +1 in the above equation (2) correspond to the speech section and the music section, respectively, and the section that becomes the correct signal type of the speech / music learning data to be used is manually labeled in advance in binary. It is what you did. Since -1, +1 in the above formula (2) is a convenient definition, it may be reversed. Furthermore, the following linear discriminant function is established from the above equation (2).

ｋ＝１〜Ｎ（Ｎは参照データの入力フレーム数）に対し、ベクトルｘを抽出し、（３）式の評価値と正解信号種別（２）式の誤差二乗和（４）式が最小となる正規方程式を解くことにより、各特徴量パラメータに対する重み付け係数β_ｉ（ｉ＝０〜ｎ）が決定される。

For k = 1 to N (N is the number of input frames of reference data), a vector x is extracted, and the evaluation value of equation (3) and the error signal sum of equation (2) and equation (4) are minimum. By solving the normal equation, the weighting coefficient β _i (i = 0 to n) for each feature parameter is determined.

paramSet=stereoがセットされてない場合（つまり、paramSet=monoがセットされている場合）（ステップＳ２０１、ＮＯ）、第２の信号種別判定モジュール７４ｂは、上記同様（１）式から（４）式を用いてモノラル向け線形判別式を算出する（ステプＳ２０２）。このとき、第２の信号種別判定モジュール７４ａは、ステレオ向け線形判別式とは異なり、ｍ個の特徴量パラメータによってモノラル向け線形判別式を算出する。 When paramSet = stereo is not set (that is, when paramSet = mono is set) (step S201, NO), the second signal type determination module 74b is similar to the above formulas (1) to (4). Is used to calculate a linear discriminant for monaural (step S202). At this time, unlike the linear discriminant for stereo, the second signal type determination module 74a calculates the linear discriminant for monaural using m feature amount parameters.

信号種別判定モジュール７４は、ステレオ向け線形判別式またはモノラル向け線形判別式について、学習によって決定した重み付け係数を用い、実際に識別する入力オーディオ信号の評価値を（３）式よりフレーム毎に算出する（ステップＳ２０４）。ｆ（ｘ）が上記音声・音楽識別スコアＳ１に相当する。 For the linear discriminant for stereo or the linear discriminant for monaural use, the signal type determination module 74 uses the weighting coefficient determined by learning to calculate the evaluation value of the input audio signal that is actually identified for each frame from Equation (3). (Step S204). f (x) corresponds to the voice / music identification score S1.

なお、音声・音楽識別スコアＳ１の算出については、上記した線形識別関数を用いたオフライン学習により求めた重み付け係数を特徴量パラメータに乗ずる手法に限定されるものではない。例えば各特徴量パラメータの算出値に対して経験的な閾値を設定し、この閾値との比較判定に応じて各特徴量パラメータに重み付けされた得点を付与し、スコアを算出する等の手法も用いることが可能である。 Note that the calculation of the speech / music identification score S1 is not limited to the method of multiplying the feature parameter by the weighting coefficient obtained by offline learning using the above-described linear identification function. For example, an empirical threshold is set for the calculated value of each feature parameter, and a weighted score is assigned to each feature parameter in accordance with comparison with the threshold, and a score is calculated. It is possible.

信号種別判定モジュール７４は、Ｓ１＜０か否かを判断する（ステップＳ２０５）。信号種別判定モジュール７４は、Ｓ１＜０であれば音楽区間、ｆ（ｘ）＞０であれば音声区間と判定する。信号種別判定モジュール７４は、各フレームを音声区間か音楽区間に排他的に判別する。 The signal type determination module 74 determines whether or not S1 <0 (step S205). The signal type determination module 74 determines a music section if S1 <0, and a voice section if f (x)> 0. The signal type determination module 74 determines each frame exclusively as a voice section or a music section.

Ｓ１＜０でない場合（つまり、音声区間である場合）（ステップＳ２０５、ＮＯ）、信号種別判定モジュール７４は、変数cntSpをインクリメントする（ステップＳ２０６）。Ｓ１＜０である場合（つまり、音楽区間である場合）（ステップＳ２０５、ＹＥＳ）、信号種別判定モジュール７４は、変数cntMsをインクリメントする。 When S1 <0 is not satisfied (that is, when it is a voice section) (step S205, NO), the signal type determination module 74 increments the variable cntSp (step S206). When S1 <0 (that is, when it is a music section) (step S205, YES), the signal type determination module 74 increments the variable cntMs.

信号種別判定モジュール７４で算出された音声・音楽識別スコアＳ１およびインクリメントされた変数は、レベル算出モジュール７６に供給される。信号種別判定モジュール７４は、信号種別判定を終了する。 The voice / music identification score S1 calculated by the signal type determination module 74 and the incremented variable are supplied to the level calculation module 76. The signal type determination module 74 ends the signal type determination.

ここで、信号種別判定モジュール７４は、チャネル情報に基づいて判別した入力オーディオ信号がステレオ信号かモノラル信号かに応じて異なる特徴量パラメータセットを選定している。信号種別判定モジュール７４が特徴量パラメータセットを選定する有効性について説明する。 Here, the signal type determination module 74 selects different feature parameter sets depending on whether the input audio signal determined based on the channel information is a stereo signal or a monaural signal. The effectiveness with which the signal type determination module 74 selects the feature parameter set will be described.

例えば、ステレオ向け特徴量パラメータセットの特徴量パラメータの数ｎは、モノラル向け特徴量パラメータセットの特徴量パラメータの数ｍと異なる。上述したように、入力オーディオ信号がステレオ信号の場合、信号種別判定モジュール７４は判別情報であるＬＲパワー比から算出した統計特徴量を含めた特徴量パラメータセットを使用するため、音声・音楽識別スコアＳ１の検出精度の向上が期待できる。一方、入力オーディオ信号がモノラル信号の場合、信号種別判定モジュール７４がＬＲパワー比から算出した統計特徴量を含めた特徴量パラメータセットを使用しても音声・音楽識別スコアＳ１の検出精度の向上が期待できない。逆に、検出精度が低下することもありうる。 For example, the number n of feature quantity parameters in the stereo feature quantity parameter set is different from the number m of feature quantity parameters in the monaural feature quantity parameter set. As described above, when the input audio signal is a stereo signal, the signal type determination module 74 uses the feature parameter set including the statistical feature calculated from the LR power ratio, which is the discrimination information. An improvement in the detection accuracy of S1 can be expected. On the other hand, when the input audio signal is a monaural signal, the detection accuracy of the speech / music identification score S1 can be improved even if the signal type determination module 74 uses a feature parameter set including the statistical feature calculated from the LR power ratio. I can't expect it. On the contrary, the detection accuracy may be reduced.

（５）式は、第１の信号種別判定モジュール７４ａが各特徴量パラメータの重要度に応じた重み付け係数β_ｉを決定し、（３）式に適用した一例である。χnはＬＲパワー比における特徴量パラメータとする。

Expression (5) is an example in which the first signal type determination module 74a determines the weighting coefficient β _i corresponding to the importance of each feature parameter and is applied to Expression (3). χ n is a feature parameter in the LR power ratio.

（２）式に示すように、線形識別関数の値が負であれば、入力オーディオ信号の音楽性が高くなる。ここで、通常のステレオ音楽信号ではＬＲチャネルで異なる楽音が配置されているため、ＬＲパワー比は大きくなりやすい傾向にある。 As shown in equation (2), if the value of the linear discriminant function is negative, the musicality of the input audio signal is increased. Here, in the normal stereo music signal, since different musical sounds are arranged in the LR channel, the LR power ratio tends to increase.

この傾向は、どのようなステレオ楽曲でも一般的に当てはまる。学習の結果、ＬＲパワー比における特徴量パラメータに対応する重み付け係数の値は、他の特徴量パラメータが音楽区間・音声区間の判別を指し示す重み付け係数値に比べると相対的に大きくなりやすい。言い換えると、ＬＲパワー比における特徴量パラメータは、他の特徴量パラメータが音楽区間・音声区間の判別に寄与する度合いよりも強い。したがって線形識別関数の値も大きい負の値となる傾向をもつ。 This trend is generally true for any stereo song. As a result of learning, the value of the weighting coefficient corresponding to the feature quantity parameter in the LR power ratio is likely to be relatively larger than the weighting coefficient value in which the other feature quantity parameters indicate the discrimination of the music section / speech section. In other words, the feature amount parameter in the LR power ratio is stronger than the degree to which other feature amount parameters contribute to the determination of the music section / speech section. Therefore, the value of the linear discriminant function also tends to be a large negative value.

一方、入力オーディオ信号が音楽であってもモノラル信号であれば、特徴量パラメータχnは省略される。第２の信号種別判定モジュール７４ｂは、通常χnに０の値を入れて線形識別関数の値を算出する。つまり、線形識別関数の値は、ＬＲパワー比における特徴量パラメータの項が音楽区間・音声区間の判定に寄与しなくなる。第２の信号種別判定モジュール７４ｂは音楽区間・音声区間の検出精度が落ちる。第２の信号種別判定モジュール７４ｂは、重み付け係数の重みを特徴量パラメータごとに音楽区間・音声区間の判定への寄与を考慮して決定している。ＬＲパワー比における特徴量パラメータは、音楽区間・音声区間の判定への寄与が他の特徴量パラメータに比べて相対的に大きい。ＬＲパワー比における特徴量パラメータの項が線形識別関数から省略されると、第２の信号種別判定モジュール７４ｂは、音楽区間・音声区間の判定をしづらくなる。 On the other hand, if the input audio signal is music but is a monaural signal, the feature parameter χ n is omitted. The second signal type determination module 74b calculates the value of the linear discriminant function by putting a value of 0 in normal χn. That is, in the value of the linear discriminant function, the term of the feature parameter in the LR power ratio does not contribute to the determination of the music section / speech section. In the second signal type determination module 74b, the detection accuracy of the music section / voice section is lowered. The second signal type determination module 74b determines the weight of the weighting coefficient for each feature parameter in consideration of contribution to the determination of the music section / speech section. The feature quantity parameter in the LR power ratio has a relatively large contribution to the determination of the music section / speech section compared to other feature quantity parameters. When the term of the feature parameter in the LR power ratio is omitted from the linear discriminant function, the second signal type determination module 74b has difficulty in determining the music section / speech section.

そこで、第２の信号種別判定モジュール７４ｂは、ＬＲパワー比の特徴量パラメータの項を除いた他の特徴量パラメータセット（モノラル信号、ステレオ信号共通で効果が期待できる特徴量パラメータおよびモノラル信号に特有な特徴量パラメータから構成される）を用いて、（１）式から（４）式により重み付け係数値を求める。 Therefore, the second signal type determination module 74b is a feature parameter set excluding the feature parameter parameter of the LR power ratio (specific to the feature parameter and monaural signal that can be expected to have an effect common to monaural signals and stereo signals). The weighting coefficient value is obtained by the equations (1) to (4).

第２の信号種別判定モジュール７４ｂは、ＬＲパワー比の特徴量パラメータがない分、他の特徴量パラメータのうち特定の特徴量パラメータに対して、（５）式に示す重み付け係数値よりも音楽性を強く示す係数値を与える。したがって、第２の信号種別判定モジュール７４ｂは、音楽区間・音声区間の検出精度の低下を抑制できる。 The second signal type determination module 74b is more musical than the weighting coefficient value shown in Equation (5) for a specific feature parameter among other feature parameters because there is no feature parameter for the LR power ratio. A coefficient value that strongly indicates is given. Therefore, the second signal type determination module 74b can suppress a decrease in detection accuracy of the music section / speech section.

以上説明したように、信号種別判定モジュール７４は、ステレオ信号またはモノラル信号に応じて最適な重み付け係数を用意し、入力オーディオ信号のチャネル情報により、線形判定式を切り替えて用いることができる。 As described above, the signal type determination module 74 can prepare an optimum weighting coefficient according to a stereo signal or a monaural signal, and can switch and use a linear determination formula according to channel information of the input audio signal.

次に、レベル算出モジュール７６の動作について説明する。図５は、レベル算出処理を説明するフローチャートである。レベル算出モジュール７６は、（５）式で求めた線形識別関数の値が正であれば音声区間、負であれば音楽区間と判断することができる。しかしながら、制御モジュール６３がスピーカ１５から出力する音声の音質を細かく制御するために、レベル算出モジュール７６は、線形識別関数の値を段階的に表現される確度情報の形で算出するのが望ましい。また、モノラル信号では、楽曲特性が特徴量パラメータとしてステレオ信号ほど顕著に現れない。したがって、線形識別関数の値Ｓ１の音楽性スコアが比較的小さい値をとる傾向にある。そのため、レベル算出モジュール７６は、楽曲によって判定が不安定化する可能性がある。そこで、レベル算出モジュール７６は、例えば以下のようにスコア安定化を兼ねた音声・音楽レベルを算出する。 Next, the operation of the level calculation module 76 will be described. FIG. 5 is a flowchart for explaining the level calculation process. The level calculation module 76 can determine that the value is a speech segment if the value of the linear discriminant function obtained by the equation (5) is positive, and a music segment if the value is negative. However, in order to finely control the sound quality of the sound output from the speaker 15 by the control module 63, the level calculation module 76 preferably calculates the value of the linear discriminant function in the form of accuracy information expressed stepwise. Also, with monaural signals, music characteristics do not appear as prominently as stereo signals as feature parameters. Therefore, the musicality score of the linear discriminant function value S1 tends to take a relatively small value. For this reason, the level calculation module 76 may be unstable depending on the music. Therefore, the level calculation module 76 calculates a voice / music level that also serves as score stabilization as follows, for example.

レベル算出モジュール７６は、線形判別式で求まった線形識別関数の値Ｓ１をベースに音楽区間・音声区間それぞれの確度情報を算出する。ここで、Ｓｍ１は音楽用スコア変数、Ｓｓ１は音声用スコア変数である。レベル算出モジュール７６は、Ｓｍ１＝−Ｓ１、Ｓｓ１＝Ｓ１と設定する（ステップＳ３０１）。Ｓｍ１でＳ１の符号を反転するのは、音声・音楽のどちらも正値のレベルで表現するのが扱いやすいためである。 The level calculation module 76 calculates accuracy information for each of the music section and the voice section based on the linear discriminant function value S1 obtained by the linear discriminant. Here, Sm1 is a music score variable, and Ss1 is a voice score variable. The level calculation module 76 sets Sm1 = −S1 and Ss1 = S1 (step S301). The reason for inverting the sign of S1 in Sm1 is that it is easy to express both voice and music at a positive level.

レベル算出モジュール７６は、Ｓｍ１（＞０）について、フレームごとに音声・音楽識別スコアＳ１を算出する一方で、継続して過去に音楽判定されたフレーム数cntMsをカウントする。レベル算出モジュール７６は、cntMsが規定の回数thNms以上となったか否かを判断する（ステップＳ３０２）。 The level calculation module 76 calculates the speech / music identification score S1 for each frame for Sm1 (> 0), and continuously counts the number of frames cntMs for which music has been determined in the past. The level calculation module 76 determines whether cntMs is equal to or greater than the specified number of times thNms (step S302).

cntMsがthNmsに達した場合（ステップＳ３０１、ＹＥＳ）、レベル算出モジュール７６は、Ｓｍ１に加算する補正スコアＳｍ２（＞０）をstep_m（＞０）だけ加える。レベル算出モジュール７６は、Ｓｓ１から減算する補正スコアＳｓ２（＞０）をstep_s（＞０）だけ減ずる。レベル算出モジュール７６は、Ｓｍ２とＳｓ２の値を適切な値(min=0,max=1等)の範囲でクリッピングする（ステップＳ３０３）。 When cntMs reaches thNms (step S301, YES), the level calculation module 76 adds a correction score Sm2 (> 0) to be added to Sm1 by step_m (> 0). The level calculation module 76 reduces the correction score Ss2 (> 0) to be subtracted from Ss1 by step_s (> 0). The level calculation module 76 clips the values of Sm2 and Ss2 within a range of appropriate values (min = 0, max = 1, etc.) (step S303).

これにより、Ｓｍ１が示す音楽用スコア変数が比較的小さい値の場合でも、時間の経過とともに補正後の音楽用スコア変数の値は安定する。 As a result, even if the music score variable indicated by Sm1 is a relatively small value, the corrected value of the music score variable is stabilized as time passes.

レベル算出モジュール７６は、（６）式のように補正スコアＳｍ２を音楽用スコア変数Ｓｍ１に加算する（ステップＳ３０４）。

The level calculation module 76 adds the correction score Sm2 to the music score variable Sm1 as shown in equation (6) (step S304).

レベル算出モジュール７６は、（７）式のように補正スコアＳｓ２を音声用スコア変数Ｓｓ１から減算する（ステップＳ３０５）。

The level calculation module 76 subtracts the correction score Ss2 from the speech score variable Ss1 as shown in equation (7) (step S305).

cntMsがthNmsに達していない場合（ステップＳ３０２、ＮＯ）、レベル算出モジュール７６は、Ｓｓ１（＞０）について、継続して過去に音声判定されたフレーム数cntSpをカウントする。レベル算出モジュール７６は、cntSpが規定回数thNsp以上となったか否かを判断する（ステップＳ３０６）。 When cntMs has not reached thNms (step S302, NO), the level calculation module 76 continuously counts the number of frames cntSp for which speech determination has been made in the past for Ss1 (> 0). The level calculation module 76 determines whether cntSp is equal to or greater than the specified number of times thNsp (step S306).

cntSpがthNspに達した場合（ステップＳ３０６、ＹＥＳ）、レベル算出モジュール７６は、Ｓｍ１に加算する補正スコアＳｍ２（＞０）をstep_m（＞０）だけ減ずる。レベル算出モジュール７６は、Ｓｓ１から減算する補正スコアＳｓ２（＞０）をstep_s（＞０）だけ加える。レベル算出モジュール７６は、Ｓｍ２とＳｓ２の値を適切な値(min=0,max=1等)の範囲でクリッピングする（ステップＳ３０７）。 When cntSp reaches thNsp (step S306, YES), the level calculation module 76 decreases the correction score Sm2 (> 0) to be added to Sm1 by step_m (> 0). The level calculation module 76 adds a correction score Ss2 (> 0) to be subtracted from Ss1 by step_s (> 0). The level calculation module 76 clips the values of Sm2 and Ss2 within a range of appropriate values (min = 0, max = 1, etc.) (step S307).

レベル算出モジュール７６は、補正スコアＳｍ２を段階的に減ずるため、音楽から音声区間に変わる際の急激な補正音質変動を緩和する効果をもつ。 Since the level calculation module 76 reduces the correction score Sm2 step by step, the level calculation module 76 has an effect of mitigating sudden correction sound quality fluctuation when changing from music to a voice section.

レベル算出モジュール７６は、（８）式のように補正スコアＳｍ２を音楽用スコア変数Ｓｍ１から減算する（ステップＳ３０８）。

The level calculation module 76 subtracts the correction score Sm2 from the music score variable Sm1 as shown in equation (8) (step S308).

レベル算出モジュール７６は、（９）式のように補正スコアＳｓ２を音声用スコア変数Ｓｓ１に加算する（ステップＳ３０９）。レベル算出モジュール７６は、判定の連続性に伴い補正スコアＳｓ２を加算することで音声・音楽レベルの安定化を図ることができる

The level calculation module 76 adds the correction score Ss2 to the voice score variable Ss1 as shown in equation (9) (step S309). The level calculation module 76 can stabilize the voice / music level by adding the correction score Ss2 with the continuity of the determination.

次に、レベル算出モジュール７６は、Ｓｓ１´、Ｓｍ１´を後段で扱いやすい形に適宜変換するために、０から１の範囲でクリッピングする（ステップＳ３１０）。レベル算出モジュール７６は、Ｓｓ１´、Ｓｍ１´を所望の解像度のレベルに変換する（ステップＳ３１１）。レベル算出モジュール７６は、例えば、０から２５５などのＮ段階の整数値として音楽レベルLmsおよび音声レベルLspに変換する。 Next, the level calculation module 76 performs clipping in the range of 0 to 1 in order to appropriately convert Ss1 ′ and Sm1 ′ into a form that can be easily handled in the subsequent stage (step S310). The level calculation module 76 converts Ss1 ′ and Sm1 ′ to a desired resolution level (step S311). The level calculation module 76 converts the music level Lms and the audio level Lsp as N-stage integer values such as 0 to 255, for example.

レベル算出モジュール７６は、レベル値変換の過程で平滑化を行う（ステップＳ３１２）。レベル算出モジュール７６は、フレーム間における急激な音声・音楽レベルの変動を抑制するためである。すなわちレベル算出モジュール７６は、過去num_fr数のフレームで平滑化を行う場合、num_fr数のフレームの音声・音楽レベルにそれぞれ重み係数を乗じ移動平均をとった値を最終的な出力レベル（音楽レベルLms,音声レベルLsp）とする。この際、レベル算出モジュール７６は、例えば、直近の過去フレームほど音声・音楽レベルに乗じる重み係数の値を大きくする。 The level calculation module 76 performs smoothing in the process of level value conversion (step S312). This is because the level calculation module 76 suppresses a sudden change in voice / music level between frames. That is, when performing smoothing with the past num_fr number of frames, the level calculation module 76 multiplies the sound and music levels of the num_fr number of frames by the weighting factor to obtain a moving average to obtain the final output level (music level Lms , Audio level Lsp). At this time, for example, the level calculation module 76 increases the value of the weighting coefficient by which the voice / music level is multiplied for the most recent past frame.

レベル算出モジュール７６は、上記説明したスコア補正および平滑化により、低遅延・低負荷で、安定的な音声・音楽レベルを得ることができる。信号種別判定モジュール７４は、（３）式で２値による判定結果によって音楽・音声の結果を排他的に算出する。しかしレベル算出モジュール７６は、音声・音楽レベル情報に対して独立にスコア補正・平滑化を行うので、時間の経過とともに、音声・音楽レベルをお互いに排他的でない独立した値として算出することができる。レベル算出モジュール７６は、例えばＢＧＭのような区間では、音楽・音声レベルがそれぞれの音成分に応じた確度として出力する。 The level calculation module 76 can obtain a stable voice / music level with low delay and low load by the above-described score correction and smoothing. The signal type determination module 74 exclusively calculates the music / speech result based on the binary determination result in equation (3). However, since the level calculation module 76 performs score correction / smoothing independently on the voice / music level information, the voice / music level can be calculated as independent values that are not mutually exclusive with time. . The level calculation module 76 outputs the music / sound level as an accuracy corresponding to each sound component in a section such as BGM, for example.

さらに、レベル算出モジュール７６は、検出を適用する入力オーディオ信号の内容や、入力オーディオ信号が属するコンテンツ種類に応じて、音声・音楽レベルを制御してもよい。例えば、レベル算出モジュール７６は、入力オーディオ信号がステレオ信号に比べて相対的に楽曲補正の効果が得られにくいモノラル信号であれば、音声・音楽レベルの最大値をステレオ信号の場合に比べて低く設定する。 Further, the level calculation module 76 may control the audio / music level according to the content of the input audio signal to which the detection is applied and the content type to which the input audio signal belongs. For example, if the input audio signal is a monaural signal in which the effect of correcting the music is relatively difficult to obtain compared to the stereo signal, the level calculation module 76 lowers the maximum value of the voice / music level compared to the stereo signal. Set.

あるいは、トークシーンと楽曲シーンが比較的明確に現れる音楽番組以外のドラマやバラエティなどでは、演出上、各種の効果音が入りやすく、音楽区間と音声区間の著しい変動が短い時間内で頻繁に発生する。レベル算出モジュール７６は、こうした変動による急激な音質変化の影響を避けるべく、ＥＰＧなどのジャンル情報を参照し、特定のコンテンツでは出力する音楽・音声レベルの音声・音楽レベルを低く設定する。 Or, in a drama or variety other than a music program where talk scenes and music scenes appear relatively clearly, various sound effects are likely to be included in the production, and significant fluctuations between the music section and the voice section occur frequently within a short period of time. To do. The level calculation module 76 refers to genre information such as EPG in order to avoid the influence of a sudden change in sound quality due to such fluctuations, and sets the audio / music level of music / audio level to be output low for specific content.

音質補正モジュール８０は、入力オーディオ信号が音楽信号であるか音声信号であるか、およびステレオ信号であるかモノラル信号であるかに応じて柔軟に音質補正を制御できる。つまり、音質補正モジュール８０は、上記算出された音楽・音声レベル情報を用いて、信号の内容に即した音質補正処理を施す。 The sound quality correction module 80 can flexibly control the sound quality correction depending on whether the input audio signal is a music signal or a sound signal, and a stereo signal or a monaural signal. That is, the sound quality correction module 80 performs sound quality correction processing in accordance with the content of the signal using the calculated music / sound level information.

例えば、音質補正モジュール８０は、入力オーディオ信号がステレオ信号かつ音楽レベルが高ければサラウンド効果など広がり感を重視する補正を入力オーディオ信号に施す。音質補正モジュール８０は、入力オーディオ信号がモノラル信号かつ音楽レベルが高ければイコライジング中心の補正を入力オーディオ信号に施す。音質補正モジュール８０は、入力オーディオ信号がモノラル信号かつ音声レベルが高ければセンター定位を強めた輪郭強調を入力オーディオ信号に施す。音質補正モジュール６３は、入力オーディオ信号がステレオ信号かつ音声レベルが高ければよりソフトな音声強調を入力オーディオ信号に施す。したがって、音質補正モジュール８０は、入力オーディオ信号のチャネル数や音声・音楽レベルの高さ、安定度に応じて制御しやすくなる。 For example, if the input audio signal is a stereo signal and the music level is high, the sound quality correction module 80 performs correction on the input audio signal that emphasizes a sense of spread such as a surround effect. The sound quality correction module 80 applies equalization center correction to the input audio signal if the input audio signal is a monaural signal and the music level is high. If the input audio signal is a monaural signal and the sound level is high, the sound quality correction module 80 applies contour enhancement with enhanced center localization to the input audio signal. The sound quality correction module 63 performs softer sound enhancement on the input audio signal if the input audio signal is a stereo signal and the sound level is high. Therefore, the sound quality correction module 80 can be easily controlled according to the number of channels of the input audio signal, the height of the voice / music level, and the stability.

本実施形態によれば、信号特性解析モジュール７０は、入力オーディオ信号の特性に応じて音質補正を柔軟に切り替えることが可能となる。信号特性解析モジュール７０は、ステレオ信号だけでなく、モノラル信号も精度良く検出できる。また、信号特性解析モジュール７０は、ステレオ信号のフォーマットであってもモノラル的性質を持つ入力オーディオ信号や、デュアルモノラル信号の入力オーディオ信号も最適に検出できる。信号特性解析モジュール７０は、瞬間的、局所的な判定ブレを安定化した上で音楽・音声の確度をレベル情報で表現できる。さらに、信号特性解析モジュール７０は、音声・音楽レベルの算出を判別式１個を基にして低遅延・低負荷で行え、継続時間長に応じて安定化かつ音声と音楽で独立した情報として得ることができる。結果として、信号特性解析モジュール７０は、モノラル/ステレオ、音声/音楽の区分に応じて入力オーディオ信号の音質補正を柔軟に切り替えられる。 According to the present embodiment, the signal characteristic analysis module 70 can flexibly switch the sound quality correction according to the characteristics of the input audio signal. The signal characteristic analysis module 70 can accurately detect not only stereo signals but also monaural signals. Further, the signal characteristic analysis module 70 can optimally detect an input audio signal having a monaural property or a dual monaural input audio signal even in a stereo signal format. The signal characteristic analysis module 70 can express the accuracy of music / speech by level information after stabilizing instantaneous and local determination blur. Further, the signal characteristic analysis module 70 can calculate the voice / music level with a low delay and a low load based on one discriminant, and obtains it as information that is stabilized and independent between voice and music according to the duration. be able to. As a result, the signal characteristic analysis module 70 can flexibly switch the sound quality correction of the input audio signal according to the classification of monaural / stereo and voice / music.

なお、上記したモジュールとは、ハードウェアで実現するものであっても良いし、ＣＰＵ６４等を使ってソフトウェアで実現するものであってもよい。 The above-described module may be realized by hardware, or may be realized by software using the CPU 64 or the like.

なお、本願発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は可能な限り適宜組み合わせて実施してもよく、その場合組み合わせた効果が得られる。更に、上記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適当な組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題が解決でき、発明の効果の欄で述べられている効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。 Note that the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the invention in the implementation stage. In addition, the embodiments may be appropriately combined as much as possible, and in that case, the combined effect can be obtained. Further, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effect described in the column of the effect of the invention Can be obtained as an invention.

１１…デジタルテレビジョン放送受信装置、１５…スピーカ、７２…特徴量抽出モジュール、７４…信号種別判定モジュール、７６…レベル算出モジュール、８０…音質補正モジュール。 DESCRIPTION OF SYMBOLS 11 ... Digital television broadcast receiver, 15 ... Speaker, 72 ... Feature quantity extraction module, 74 ... Signal type determination module, 76 ... Level calculation module, 80 ... Sound quality correction module

Claims

Discrimination means for discriminating whether the input audio signal is a monaural signal or a stereo signal based on channel information;
A feature parameter set including a plurality of feature amount parameters for discriminating the input audio signal into either an audio signal or a music signal, and is different for the monaural signal or the stereo signal discriminated by the discriminating means Feature extraction means for extracting a feature parameter set;
Audio music indicating whether the input audio signal is close to an audio signal or a music signal using different discriminants according to the feature parameter set for the monaural signal or the stereo signal extracted by the feature extraction means A signal type determining means for calculating an identification score;
Level calculation means for calculating the audio level of the input audio signal and the output level of the music level using the audio music identification score;
Sound quality correction means for performing sound quality correction processing on the input audio signal based on the output level calculated by the level calculation means;
An audio signal correction apparatus comprising:

When the input audio signal is a dual monaural signal, the determination means determines the selected channel of the main or sub channel as a detection target based on selection of either the main channel or the sub channel. Thus, the input audio signal is determined as the monaural signal for the selected channel, or the input audio signal is in the stereo signal format and the LR power ratio of the input audio signal is smaller than a predetermined value. The audio signal correcting apparatus according to claim 1, wherein the input audio signal is discriminated from the monaural signal.

The audio signal correction apparatus according to claim 1, wherein the feature extraction unit extracts an LR power ratio as one of the plurality of feature amount parameters when the input audio signal is the stereo signal.

A plurality of weighting coefficients calculated by learning the plurality of feature parameters using the voice signal and the music signal prepared in advance as reference data for each of the plurality of feature parameters; The audio signal correction apparatus according to claim 1, wherein a sum total obtained by multiplying each of the plurality of feature parameters and the plurality of weighting coefficients is calculated as the speech music identification score.

2. The audio signal correction apparatus according to claim 1, wherein the feature extraction unit divides the input audio signal into a plurality of frames for each predetermined unit and extracts the plurality of feature amount parameters for each of the divided frames.

When the level calculation means determines that the audio music identification score for each of the divided frames calculated by the signal type determination means is the music signal continuously for a predetermined number of times, the level calculation means uses the audio music identification score for music. When the correction score is added so as to increase the correction strength, and the audio music identification score for each of the divided frames calculated by the signal type determination unit is determined to be the audio signal continuously for a predetermined number of times, the audio The audio signal correction apparatus according to claim 5, wherein the correction score is added to the music identification score so as to increase the correction strength for voice.

7. The audio signal correction apparatus according to claim 6, wherein the level calculation means calculates the smoothed output level by taking a moving average of the corrected speech and music identification scores for a plurality of the divided frames.

The level calculation means reduces the maximum value of the output level when the audio signal is the monaural signal, and changes the maximum value of the output level according to the genre of the audio signal. The audio signal correcting apparatus according to claim 7.

Based on the channel information, the input audio signal is identified as either a monaural signal or a stereo signal,
A feature parameter set including a plurality of feature amount parameters for discriminating the input audio signal into either an audio signal or a music signal, wherein different feature parameter sets are used for the determined monaural signal or the stereo signal. Extract and
Using a different discriminant according to the feature parameter set for the monaural signal or the stereo signal, calculate a speech music identification score indicating whether the input audio signal is close to a speech signal or a music signal;
Calculating the audio level of the input audio signal and the output level of the music level using the audio music identification score;
Based on the output level, a sound quality correction process is performed on the input audio signal.
Audio signal correction method.