JP2010118978A

JP2010118978A - Controller of localization of sound, and method of controlling localization of sound

Info

Publication number: JP2010118978A
Application number: JP2008291800A
Authority: JP
Inventors: Takao Yamabe; 孝朗山邊
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 2008-11-14
Filing date: 2008-11-14
Publication date: 2010-05-27

Abstract

<P>PROBLEM TO BE SOLVED: To improve sound recognition by clear sound regardless of the type of a content and even if monaural sound such as dialogs, announcement or the like is included in L and R channels. <P>SOLUTION: A controller 150 of localization of sound is equipped with: a broadcast receiving unit 210 which can receive television broadcast; an audio coded signal extraction unit 216; a sound determination unit 218; a degree-of-correlation derivation unit 222 for deriving a degree of correlation between L and R channels based on a flag indicating whether or not coding using a correlation between the L and R channels is performed on an audio coded signal; a storage unit 224 for storing a table 268 of coefficient groups wherein degrees of correlation and coefficient groups of head-related transfer functions are associated with each other; a table extraction unit 226 for extracting a coefficient group of a head-related transfer function from the table of coefficient groups according to the degree of correlation derived by the degree-of-correlation derivation unit; and a localization of sound processing unit 228 for executing localization of sound processing using the head-related transfer function which reflects the extracted coefficient group. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、サラウンドに対応したテレビジョン放送に対して音像定位処理を施す音像定位制御装置および音像定位制御方法に関する。 The present invention relates to a sound image localization control device and a sound image localization control method that perform sound image localization processing on a television broadcast that supports surround.

従来から音の臨場感を向上する技術としてサラウンド方式があり、映画等において広がりのある空間が再現されていた。例えば、５．１サラウンド方式では、左／右スピーカ、左／右リアスピーカに加えて、センタースピーカの５つのスピーカから発せられる５チャンネルの音声信号と、１チャンネルの低域信号とで音声を再現している。 Conventionally, there is a surround system as a technique for improving the realistic sensation of sound, and a wide space has been reproduced in movies and the like. For example, in the 5.1 surround system, in addition to the left / right speaker and the left / right rear speaker, the sound is reproduced with the five-channel audio signal emitted from the five speakers of the center speaker and the low-frequency signal of one channel. is doing.

このようなサラウンド方式に対し、受聴者の後方にスピーカを配置することなく、受聴者の前方に設置した例えば２本のスピーカによって仮想的なサラウンド空間を再現するバーチャルサラウンド方式等の音像定位技術が開発されている。バーチャルサラウンド方式は、音像を定位するための頭部伝達関数等の計算が複雑であったが、近年の計算機の進歩によりその計算も容易となり、また、受聴者の後方にスピーカを設置する必要がないため、住宅事情や配線の手軽さの面から一般に広く知られるようになってきた。 In contrast to such a surround system, there is a sound image localization technique such as a virtual surround system that reproduces a virtual surround space by, for example, two speakers installed in front of a listener without arranging a speaker behind the listener. Has been developed. In the virtual surround system, the calculation of the head-related transfer function and the like for localization of the sound image was complicated, but the calculation has become easier due to recent advances in computers, and it is necessary to install a speaker behind the listener. Therefore, it has become widely known from the viewpoint of housing conditions and ease of wiring.

また、近年、テレビジョン放送もデジタル化が進み、従来のアナログ放送ではモノラルやステレオ方式が主流だったのに対し、デジタル放送ではサラウンドに対応したコンテンツが増加している。そのため、映画だけでなく通常のテレビジョン放送、例えば、ニュース番組やスポーツ中継などでも手軽にサラウンドを楽しむことができ、今後はさらにサラウンドに対応したテレビジョン放送が一般的になっていくものと考えられる。 In recent years, television broadcasting has also been digitized, and in conventional analog broadcasting, monaural and stereo systems have been mainstream, whereas in digital broadcasting, content corresponding to surround is increasing. Therefore, not only movies but also regular television broadcasts, such as news programs and sports broadcasts, can be easily enjoyed surround, and in the future, television broadcasts that support surround will become more common. It is done.

このようなテレビジョン放送には様々なコンテンツが含まれる。しかし、そのすべてのコンテンツに同一の音像定位処理を一様に施すと所望する音像定位効果を得られない場合がある。そこで、コンテンツ情報に基づき、コンテンツの内容に応じた音像定位効果を制御する技術が開示されている（例えば、特許文献１）。
２００６−１４８６６４号公報 Such television broadcasting includes various contents. However, if the same sound image localization process is uniformly applied to all the contents, the desired sound image localization effect may not be obtained. Therefore, a technique for controlling the sound image localization effect according to the content content based on the content information is disclosed (for example, Patent Document 1).
No. 2006-148664

様々なコンテンツのうち、例えば、映画では、上述したバーチャルサラウンド方式によって、複数の音像をイメージでき、臨場感のある音声を楽しむことができる。一方、ニュース番組やスポーツ中継等コメンテータが発する台詞やアナウンス等の音声（以下、「単音声」という。）は、臨場感より明瞭さが求められる。従って、このようなコンテンツがサラウンドに対応している場合、コメンテータによる単音声は複数チャンネルのうちのセンターチャンネルにのみ割り当てられていることが多い。 Among various contents, for example, in a movie, a plurality of sound images can be imaged by the above-described virtual surround system, and a realistic sound can be enjoyed. On the other hand, speech such as dialogues and announcements (hereinafter referred to as “single speech”) produced by commentators such as news programs and sports broadcasts are required to be clearer than a sense of reality. Therefore, when such content corresponds to surround, a single sound by a commentator is often assigned only to a center channel of a plurality of channels.

上述した特許文献１の技術においても、ニュース番組の場合には、その音声がセンターチャンネルにのみ割当てられていることを前提に、センターチャンネルのフィルタ処理のタップ数を減らして残響を抑え、音声の明瞭度を上げている。 Also in the technique of Patent Document 1 described above, in the case of a news program, on the premise that the sound is assigned only to the center channel, the reverberation is suppressed by reducing the number of center channel filter processing taps. Increases clarity.

しかし、デジタル放送のニュース番組やスポーツ中継等では、チャンネルの役割が制作者の意図に沿って規定されているＤＶＤ等と異なり、従来のステレオ収録方式のなごりでセンターチャンネル以外のＬ／Ｒチャンネルにもサラウンド処理された単音声が含まれることが多い。そのため、バーチャルサラウンドシステムにおいて、Ｌ／Ｒチャンネルに広がり感を持たせた音像定位処理を施した場合、特許文献１の技術を用いてセンターチャンネルの残響を抑えることができたとしても、Ｌ／Ｒチャンネルに含まれる単音声が残り、明瞭度の悪化を招いてしまう。 However, in digital broadcast news programs and sports broadcasts, the role of the channel is different from DVD, etc., which is defined in accordance with the creator's intentions. In many cases, surround sound is included. Therefore, in the virtual surround system, when the sound image localization processing with a sense of spread is applied to the L / R channel, even if the center channel reverberation can be suppressed using the technique of Patent Document 1, the L / R channel The single voice included in the channel remains, and the clarity is deteriorated.

本発明は、このような課題に鑑み、コンテンツの種類に拘わらず、また、台詞やアナウンス等の単音声がＬ／Ｒチャンネルに含まれている場合においても、明瞭な音声による聴感の向上を図ることが可能な、音像定位制御装置を提供することを目的としている。 In view of such a problem, the present invention aims to improve the audibility of a clear voice regardless of the type of content, and even when a single voice such as a dialogue or an announcement is included in the L / R channel. It is an object of the present invention to provide a sound image localization control device that can perform the above-described operation.

上記課題を解決するために、本発明の音像定位制御装置の代表的な構成は、テレビジョン放送を受信可能な放送受信部と、放送受信部が受信したテレビジョン放送の音声データを抽出するオーディオ符号化信号抽出部と、オーディオ符号化信号がサラウンドに対応しているか否かを判定するサラウンド判定部と、サラウンドに対応していると判定された場合に、オーディオ符号化信号の補助情報から、Ｌ／Ｒチャンネル間の相関を用いた符号化である相関符号化がオーディオ符号化信号に施されているか否かのフラグを抽出し、抽出されたフラグに基づいて、Ｌ／Ｒチャンネル間の相関度を導出する相関度導出部と、相関度と頭部伝達関数の係数群とを関連付けた係数群テーブルを記憶する記憶部と、相関度導出部が導出した相関度に応じて係数群テーブルから頭部伝達関数の係数群を抽出するテーブル抽出部と、抽出された係数群を反映した頭部伝達関数によって音像定位処理を実行する音像定位処理部と、を備えることを特徴とする。 In order to solve the above problems, a typical configuration of the sound image localization control device of the present invention includes a broadcast receiving unit capable of receiving a television broadcast, and an audio for extracting the audio data of the television broadcast received by the broadcast receiving unit. When it is determined that the encoded signal extraction unit, the audio encoded signal is compatible with surround, and the surround determination unit that determines whether it is compatible with surround, from the auxiliary information of the audio encoded signal, A flag indicating whether or not correlation encoding, which is encoding using correlation between L / R channels, is applied to an audio encoded signal, is extracted, and correlation between L / R channels is based on the extracted flag. A correlation degree deriving unit for deriving a degree, a storage unit for storing a coefficient group table in which the degree of correlation and the coefficient group of the head related transfer function are associated, and a correlation degree derived by the correlation degree deriving unit A table extraction unit that extracts a coefficient group of a head related transfer function from a number group table, and a sound image localization processing unit that executes a sound image localization process using a head related transfer function that reflects the extracted coefficient group, To do.

映画等においては、あらゆる数、あらゆる位置に音像を想定することができ、また、そのような複数の音像が受聴者の臨場感を向上させる。一方、ニュース番組やスポーツ中継等における単音声は単一かつ中央一カ所の音像を想定すればよく、またそのようにすべきである。本発明では、Ｌ／Ｒチャンネル間の相関度を導出し、その相関度が高い音声は、単一かつ中央一カ所の音像から発せられた単音声と見なし、その音声が明瞭になる係数群に基づく頭部伝達関数を用いて音声を出力する。かかる構成により、コンテンツの種類に拘わらず、台詞やアナウンス等の単音声がＬ／Ｒチャンネルに含まれている場合においても、明瞭な音声による聴感の向上を図ることが可能となる。また、符号化処理に相関符号化が採用されている場合において、本発明では、その相関符号化が施されているか否かのフラグを用いているので、２つの信号の相関係数の導出等、複雑な別途の計算を伴うことなく、相関度を容易に導出することができ、処理負荷を軽減することが可能である。 In movies and the like, sound images can be assumed in any number and at any position, and such a plurality of sound images improve the sense of presence of the listener. On the other hand, a single sound in a news program or a sports broadcast should be assumed to be a single and central sound image, and should be so. In the present invention, the degree of correlation between the L / R channels is derived, and a voice with a high degree of correlation is regarded as a single voice emitted from a single sound image in the center, and the coefficient group makes the voice clear. The speech is output using the head-related transfer function based on it. With this configuration, it is possible to improve the audibility with clear sound even when a single sound such as a dialogue or announcement is included in the L / R channel regardless of the type of content. Further, in the case where correlation encoding is employed in the encoding process, the present invention uses a flag indicating whether or not the correlation encoding has been performed. The degree of correlation can be easily derived without complicated complicated calculations, and the processing load can be reduced.

相関度導出部は、相関符号化がオーディオ符号化信号の複数の周波数帯域毎に施されている場合、全周波数帯域に対する相関符号化が施された周波数帯域の割合に基づいてＬ／Ｒチャンネル間の相関度を導出してもよい。 When the correlation encoding is performed for each of the plurality of frequency bands of the audio encoded signal, the correlation degree deriving unit determines whether the L / R channel is based on the ratio of the frequency bands subjected to the correlation encoding with respect to the entire frequency band. The degree of correlation may be derived.

ここでは、相関度に応じて、その音声が単音声であるか否かを単純に判断するだけでなく、単音声である確からしさ、即ち、相関符号化が施されている周波数帯域の全周波数帯域に対する割合に応じて頭部伝達関数の係数群を変更することができる。かかる構成により、明瞭な音声によるさらなる聴感の向上を図ることが可能となる。 Here, depending on the degree of correlation, not only simply whether or not the voice is a single voice, but also the probability that it is a single voice, that is, all frequencies in the frequency band on which correlation coding is performed. The coefficient group of the head-related transfer function can be changed according to the ratio to the band. With this configuration, it is possible to further improve the audibility with clear sound.

オーディオ符号化信号は、ＡＡＣ（Advanced Audio Coding）方式で符号化されていて、相関符号化の方式は、ＭＳ（Mid Side）ステレオ方式であってもよい。 The audio encoded signal may be encoded by an AAC (Advanced Audio Coding) method, and the correlation encoding method may be an MS (Mid Side) stereo method.

ＡＡＣは、オーディオ符号化方式として広く普及している。また、ＡＡＣにおいて、チャンネル間の相関性を用いた符号化方式には、ＭＳステレオ方式が用いられており、フラグによって符号化が施されているか否かを容易に導出できる。かかる構成により、当該音像定位制御装置の処理負荷を軽減することができる。 AAC is widely used as an audio encoding method. In AAC, the MS stereo method is used as the encoding method using the correlation between channels, and it can be easily derived whether or not the encoding is performed by the flag. With this configuration, it is possible to reduce the processing load of the sound image localization control device.

相関度導出部は、所定周波数帯域の重み付けを他の周波数帯域より大きくして、相関度を導出してもよい。このとき、所定周波数帯域は、２００〜４０００Ｈｚであってもよい。 The correlation degree deriving unit may derive the correlation degree by making the weighting of the predetermined frequency band larger than other frequency bands. At this time, the predetermined frequency band may be 200 to 4000 Hz.

本発明の目的は、対象となる音声が単音声である場合にその音声を明瞭化することである。そこで、人の音声の周波数帯域である２００〜４０００Ｈｚに対する重み付けを他の周波数帯域より大きくすることで、相関度導出精度の向上を図ることができる。 An object of the present invention is to clarify a voice when the target voice is a single voice. Therefore, the accuracy of deriving the degree of correlation can be improved by making the weighting for 200 to 4000 Hz, which is the frequency band of human voice, larger than other frequency bands.

ここで、頭部伝達関数は、係数群によって、音声信号の直接音と反射音とをそれぞれ独立して調整できる。 Here, the head related transfer function can independently adjust the direct sound and the reflected sound of the sound signal by the coefficient group.

頭部伝達関数を通じたインパルス応答を時間領域で見ると、その応答波形を直接音と反射音とに区別することができる。本発明では、頭部伝達関数における直接音と反射音との係数を調整し、直接音に対する反射音の比を下げることで、反射による残響を抑制し、明瞭な音声を生成する。 When the impulse response through the head-related transfer function is viewed in the time domain, the response waveform can be distinguished into direct sound and reflected sound. In the present invention, the coefficient of the direct sound and the reflected sound in the head-related transfer function is adjusted, and the ratio of the reflected sound to the direct sound is lowered, thereby suppressing reverberation due to reflection and generating clear sound.

記憶部は、スピーカの配置に対応した複数の頭部伝達関数に対応した係数群テーブルを含み、テーブル抽出部は、複数の頭部伝達関数のうち、スピーカの配置に応じて選択決定された頭部伝達関数の係数群を抽出してもよい。 The storage unit includes a coefficient group table corresponding to a plurality of head related transfer functions corresponding to the arrangement of the speakers, and the table extracting unit is a head selected and determined according to the arrangement of the speakers among the plurality of head related transfer functions. A group of coefficients of the partial transfer function may be extracted.

頭部伝達関数は、受聴者の頭（両耳）とスピーカとの位置関係によって変化する。従って、スピーカの位置に応じた最適な頭部伝達関数を選択することで、ユーザは、より適切に明瞭な音声を得ることが可能となる。 The head-related transfer function changes depending on the positional relationship between the listener's head (both ears) and the speaker. Therefore, by selecting an optimal head-related transfer function corresponding to the position of the speaker, the user can obtain more appropriate clear sound.

上記課題を解決するために、本発明の音像定位制御方法の代表的な構成は、テレビジョン放送を受信し、受信したテレビジョン放送の音声データであるオーディオ符号化信号を抽出し、オーディオ符号化信号がサラウンドに対応しているか否かを判定し、サラウンドに対応していると判定された場合に、オーディオ符号化信号の補助情報から、Ｌ／Ｒチャンネル間の相関を用いた符号化である相関符号化がオーディオ符号化信号に施されているか否かのフラグを抽出し、抽出されたフラグに基づいて、Ｌ／Ｒチャンネル間の相関度を導出し、導出した相関度に応じて、相関度と頭部伝達関数の係数群とを関連付けた係数群テーブルから頭部伝達関数の係数群を抽出し、抽出された係数群を反映した頭部伝達関数によって音像定位処理を実行することを特徴とする。 In order to solve the above-described problems, a typical configuration of the sound image localization control method of the present invention receives a television broadcast, extracts an audio encoded signal that is audio data of the received television broadcast, and performs audio encoding. It is coding using the correlation between the L / R channels from the auxiliary information of the audio coded signal when it is determined whether the signal is compatible with surround and it is determined that the signal is compatible with surround. A flag indicating whether or not correlation encoding is applied to the audio encoded signal is extracted, and based on the extracted flag, a correlation degree between the L / R channels is derived, and a correlation is determined according to the derived correlation degree. The coefficient group of the head related transfer function is extracted from the coefficient group table that associates the degree and the coefficient group of the head related transfer function, and the sound image localization processing is executed by the head related transfer function reflecting the extracted coefficient group. It is characterized in.

上述した音像定位制御装置の技術的思想に基づく構成要素やその説明は、当該音像定位制御方法にも適用可能である。 The components based on the technical idea of the sound image localization control device described above and the description thereof can be applied to the sound image localization control method.

本発明では、コンテンツの種類に拘わらず、また、台詞やアナウンス等の単音声がＬ／Ｒチャンネルに含まれている場合においても、明瞭な音声による聴感の向上を図ることが可能となる。 In the present invention, it is possible to improve the audibility with clear sound regardless of the type of content, and even when a single sound such as dialogue or announcement is included in the L / R channel.

以下に添付図面を参照しながら、本発明の好適な実施形態について詳細に説明する。かかる実施形態に示す寸法、材料、その他具体的な数値などは、発明の理解を容易とするための例示にすぎず、特に断る場合を除き、本発明を限定するものではない。なお、本明細書及び図面において、実質的に同一の機能、構成を有する要素については、同一の符号を付することにより重複説明を省略し、また本発明に直接関係のない要素は図示を省略する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The dimensions, materials, and other specific numerical values shown in the embodiment are merely examples for facilitating understanding of the invention, and do not limit the present invention unless otherwise specified. In the present specification and drawings, elements having substantially the same function and configuration are denoted by the same reference numerals, and redundant description is omitted, and elements not directly related to the present invention are not illustrated. To do.

近年、デジタル放送の開始にも相俟って、サラウンド方式を採用したテレビジョン放送が放映されるようになった。従って、映画以外の例えばニュース番組やスポーツ中継等にも一様にサラウンド方式への拡張が為され、センターチャンネル以外のＬ／Ｒチャンネルにおいても、台詞やアナウンス等の単音声が割り当てられることがある。しかし、複数の音像をイメージでき、臨場感のある音声を楽しむことがその目的の１つである映画と異なり、これら番組におけるコメンテータが発する単音声は、臨場感より明瞭さが求められる。 In recent years, together with the start of digital broadcasting, television broadcasting employing the surround system has been broadcast. Therefore, other than movies, for example, news programs, sports broadcasts, etc., are uniformly expanded to the surround system, and single voices such as dialogues and announcements may be assigned to L / R channels other than the center channel. . However, unlike a movie whose one of its purposes is to enjoy a sound with a sense of presence and a plurality of sound images, the single sound emitted by the commentator in these programs is required to be clearer than the presence.

本実施形態では、音声信号に単音声が含まれるか否かを判断し、その単音声が含まれる度合に応じた頭部伝達関数を適用することで、コンテンツの種類に拘わらず、また、台詞やアナウンス等の単音声がＬ／Ｒチャンネルに含まれている場合においても、明瞭な音声による聴感の向上を図ることが可能となる。ここで、頭部伝達関数とは、任意に配置されたスピーカから発せられたインパルス信号を、受聴者の外耳道入り口で測定したインパルス応答である。ここでは、理解を容易にするため、バーチャルサラウンドを採用したバーチャルサラウンドシステムを説明し、その後、そのバーチャルサラウンドシステムを構成する音像定位制御装置を詳細に説明する。 In the present embodiment, it is determined whether or not a single sound is included in the sound signal, and a head related transfer function corresponding to the degree to which the single sound is included is applied. Even when a single sound such as an announcement or an announcement is included in the L / R channel, it is possible to improve the audibility by a clear sound. Here, the head-related transfer function is an impulse response obtained by measuring an impulse signal emitted from an arbitrarily arranged speaker at the entrance to the ear canal of the listener. Here, in order to facilitate understanding, a virtual surround system adopting virtual surround will be described, and then a sound image localization control device constituting the virtual surround system will be described in detail.

（サラウンドシステム１００、バーチャルサラウンドシステム１１０）
図１は、５．１サラウンド方式によるサラウンドシステム１００の構成を示した模式図である。かかるサラウンドシステム１００は、テレビジョン放送を受信して映像および音声を抽出するサラウンド制御装置１４８と、抽出された映像を表示するモニタ１５２と、抽出された音声を出力するスピーカ１５４とを含んで構成される。 (Surround System 100, Virtual Surround System 110)
FIG. 1 is a schematic diagram showing a configuration of a surround system 100 based on a 5.1 surround system. The surround system 100 includes a surround control device 148 that receives a television broadcast and extracts video and audio, a monitor 152 that displays the extracted video, and a speaker 154 that outputs the extracted audio. Is done.

５．１サラウンド方式では、受聴者１６０を囲むようにスピーカ１５４が複数配置され、例えば、受聴者１６０の前方センターにセンタースピーカ１５４ａが、その左右にＬスピーカ１５４ｂ、Ｒスピーカ１５４ｃが、受聴者の後方左右にＳＬスピーカ１５４ｄ、ＳＲスピーカ１５４ｅが配される。また、任意の位置に低域の音声を出力するサブウーファー（ＬＦＥ：Low Frequency Effect）１５４ｆも配されている。かかるサラウンドシステム１００により、音の臨場感を向上し、広がりある空間を再現することが可能となる。 In the 5.1 surround system, a plurality of speakers 154 are arranged so as to surround the listener 160. For example, a center speaker 154a is located at the front center of the listener 160, and an L speaker 154b and an R speaker 154c are arranged on the left and right sides of the listener 160. SL speakers 154d and SR speakers 154e are arranged on the left and right sides. In addition, a subwoofer (LFE: Low Frequency Effect) 154f that outputs low-frequency sound is disposed at an arbitrary position. With the surround system 100, it is possible to improve the sense of presence of sound and reproduce a wide space.

図２は、バーチャルサラウンドシステム１１０の構成を示した模式図である。かかるバーチャルサラウンドシステム１１０は、音像定位制御装置１５０と、モニタ１５２と、音声を出力する２つのスピーカ１５４とを含んで構成される。 FIG. 2 is a schematic diagram showing the configuration of the virtual surround system 110. The virtual surround system 110 includes a sound image localization control device 150, a monitor 152, and two speakers 154 that output sound.

バーチャルサラウンドシステム１１０では、受聴者１６０の後方にスピーカを配置することなく、受聴者１６０の前方に設置した２本のスピーカ（Ｌスピーカ１５４ｂ、Ｒスピーカ１５４ｃ）によって仮想的なサラウンドを再現する。このバーチャルサラウンドシステム１１０は、音像を定位するための頭部伝達関数を用いた音像定位技術が採用されている。 In the virtual surround system 110, a virtual surround is reproduced by two speakers (L speaker 154 b and R speaker 154 c) installed in front of the listener 160 without arranging speakers behind the listener 160. The virtual surround system 110 employs a sound image localization technique using a head-related transfer function for localizing a sound image.

バーチャルサラウンドシステム１１０において、音像定位制御装置１５０は、モニタ１５２やスピーカ１５４と一体的にまたは別体に形成することができる。以下、音像定位制御装置１５０の具体的な構成と動作を説明する。 In the virtual surround system 110, the sound image localization control device 150 can be formed integrally or separately with the monitor 152 and the speaker 154. Hereinafter, a specific configuration and operation of the sound image localization control device 150 will be described.

（音像定位制御装置１５０）
図３は、音像定位制御装置１５０のハードウェア構成を示した機能ブロック図である。音像定位制御装置１５０は、放送受信部２１０と、アンテナ２１２と、映像処理部２１４と、オーディオ符号化信号抽出部２１６と、サラウンド判定部２１８と、復号部２２０と、相関度導出部２２２と、記憶部２２４と、テーブル抽出部２２６と、音像定位処理部２２８と、増幅部２３０とを含んで構成される。 (Sound image localization control device 150)
FIG. 3 is a functional block diagram showing a hardware configuration of the sound image localization control device 150. The sound image localization control device 150 includes a broadcast receiving unit 210, an antenna 212, a video processing unit 214, an audio encoded signal extraction unit 216, a surround determination unit 218, a decoding unit 220, a correlation degree deriving unit 222, The storage unit 224, the table extraction unit 226, the sound image localization processing unit 228, and the amplification unit 230 are configured.

放送受信部２１０は、アンテナ２１２を通じてテレビジョン放送の電波を受信し、その受信した信号を圧縮映像信号と圧縮音声信号とに分離する。 The broadcast receiving unit 210 receives a television broadcast radio wave through the antenna 212 and separates the received signal into a compressed video signal and a compressed audio signal.

映像処理部２１４は、放送受信部２１０が受信した圧縮映像信号を映像信号に変換し、モニタ１５２に出力する。 The video processing unit 214 converts the compressed video signal received by the broadcast receiving unit 210 into a video signal and outputs the video signal to the monitor 152.

オーディオ符号化信号抽出部２１６は、放送受信部２１０が受信したテレビジョン信号から音声データであるオーディオ符号化信号を抽出する。ここでは、オーディオ符号化方式として国際標準規格であり日本国内のデジタル放送方式で採用されている、ＭＰＥＧ-２ＡＡＣ（規格書ＩＳＯ／ＩＥＣ１３８１８−７）を用いているものとする。 The audio encoded signal extraction unit 216 extracts an audio encoded signal that is audio data from the television signal received by the broadcast receiving unit 210. Here, it is assumed that MPEG-2 AAC (standard ISO / IEC 13818-7), which is an international standard and adopted in a digital broadcasting system in Japan, is used as an audio encoding system.

サラウンド判定部２１８は、オーディオ符号化信号抽出部２１６からのオーディオ符号化信号がサラウンド（３チャンネル以上のマルチチャンネル信号）に対応しているか否かを判定する。 The surround determination unit 218 determines whether the audio encoded signal from the audio encoded signal extraction unit 216 corresponds to surround (multi-channel signal of 3 channels or more).

復号部２２０は、オーディオ符号化信号を復号し、リニアＰＣＭ信号を生成する。以下、理解を容易にするため、リニアＰＣＭ信号を単に音声信号という。 The decoding unit 220 decodes the audio encoded signal and generates a linear PCM signal. Hereinafter, in order to facilitate understanding, the linear PCM signal is simply referred to as an audio signal.

相関度導出部２２２は、サラウンド判定部２１８が当該音声がサラウンド放送に対応していると判定した場合に、オーディオ符号化信号の補助情報から、Ｌ／Ｒチャンネル間の相関を用いた符号化である相関符号化がオーディオ符号化信号に施されているか否かのフラグを抽出し、抽出されたフラグに基づいて、Ｌ／Ｒチャンネル間の相関度を導出する。本実施形態において相関度は、相関の高さを示す指標であり、例えば１から５までの整数で表し、低い値である程相関が低いことを、高い値である程相関が高いことを示す。 When the surround determination unit 218 determines that the sound is compatible with surround broadcasting, the correlation degree deriving unit 222 performs encoding using the correlation between the L / R channels from the auxiliary information of the audio encoded signal. A flag indicating whether or not a certain correlation encoding is applied to the audio encoded signal is extracted, and a correlation degree between the L / R channels is derived based on the extracted flag. In the present embodiment, the degree of correlation is an index indicating the level of correlation, and is represented by an integer from 1 to 5, for example. The lower the value, the lower the correlation, and the higher the value, the higher the correlation. .

通常、映画等においては、任意の数、任意の位置に音像を想定することができ、また、そのような複数の音像が受聴者の臨場感を向上させる。一方、ニュース番組やスポーツ中継等における単音声は単一かつ中央一カ所の音像を想定すればよく、またそのようにすべきである。 Usually, in a movie or the like, a sound image can be assumed at any number and at any position, and such a plurality of sound images improves the sense of presence of the listener. On the other hand, a single sound in a news program or a sports broadcast should be assumed to be a single and central sound image, and should be so.

ここでは、相関度導出部２２２によって、Ｌ／Ｒチャンネル間の相関度を導出し、その相関度が高い音声は、単一かつ中央一カ所の音像から発せられた単音声と見なし、その後、その音声が明瞭になる係数群に基づく頭部伝達関数を用いて音声を出力している。逆に、相関度が低い音声は、単音声ではないと見なし、臨場感のある通常の頭部伝達関数を用いて音声を出力する。かかる構成により、コンテンツの種類に拘わらず、台詞やアナウンス等の単音声がＬ／Ｒチャンネルに含まれている場合においても、明瞭な音声による聴感の向上を図ることが可能となる。 Here, the degree of correlation between the L / R channels is derived by the degree-of-correlation deriving unit 222, and a voice having a high degree of correlation is regarded as a single voice emitted from a single central sound image, and thereafter Sound is output using a head-related transfer function based on a group of coefficients that make the sound clear. Conversely, a voice with a low degree of correlation is regarded as not a single voice, and the voice is output using a normal head-related transfer function with a sense of presence. With this configuration, it is possible to improve the audibility with clear sound even when a single sound such as a dialogue or announcement is included in the L / R channel regardless of the type of content.

また、相関度導出部２２２は、相関符号化がオーディオ符号化信号の複数の周波数帯域毎に施されている場合、全周波数帯域に対する相関符号化が施された周波数帯域の割合に基づいてＬ／Ｒチャンネル間の相関度を導出してもよい。 Further, when the correlation encoding is performed for each of the plurality of frequency bands of the audio encoded signal, the correlation degree deriving unit 222 calculates the L / L based on the ratio of the frequency bands that have been subjected to the correlation encoding with respect to the entire frequency band. The degree of correlation between the R channels may be derived.

かかる処理によって、相関度に応じて、その音声が単音声であるか否かを単純に判断するだけでなく、単音声である確からしさ、即ち、相関符号化が施されている周波数帯域の全周波数帯域に対する割合に応じて頭部伝達関数の係数群を変更することができる。かかる構成により、明瞭な音声によるさらなる聴感の向上を図ることが可能となる。 With such processing, depending on the degree of correlation, not only simply whether the speech is a single speech, but also the probability that it is a single speech, that is, all the frequency bands to which correlation coding is applied. The coefficient group of the head related transfer function can be changed according to the ratio to the frequency band. With this configuration, it is possible to further improve the audibility with clear sound.

ここで、相関度導出部２２２は、全周波数帯域に対する相関符号化が施された周波数帯域の割合に基づいてＬ／Ｒチャンネル間の相関度を導出しているが、かかる場合に限られず、時間軸方向の所定数のフレームに対する相関符号化が施されたフレームの割合に基づいてＬ／Ｒチャンネル間の相関度を導出してもよい。また、相関度が煩雑に切り換わるのを回避するため、相関度導出部２２２に低域通過フィルタやヒステリシス特性を持たせることもできる。 Here, the correlation degree deriving unit 222 derives the correlation degree between the L / R channels based on the ratio of the frequency bands that have been subjected to the correlation coding with respect to the entire frequency band. The degree of correlation between the L / R channels may be derived based on the ratio of frames that have been subjected to correlation coding with respect to a predetermined number of frames in the axial direction. Further, in order to avoid the switching of the correlation degree in a complicated manner, the correlation degree deriving unit 222 can be provided with a low-pass filter and a hysteresis characteristic.

上述した相関符号化の方式は、ＭＳステレオ方式とする。ＡＡＣにおいて、チャンネル間の相関性を用いた符号化方式には、ＭＳステレオ方式が用いられており、フラグによって符号化が施されているか否かを容易に導出できるので、２つの信号の相関係数の導出等、複雑な別途の計算を伴うことなく、相関度を容易に導出することができる。かかる構成により、当該音像定位制御装置１５０の処理負荷を軽減することができる。 The correlation encoding method described above is an MS stereo method. In AAC, the MS stereo method is used as the encoding method using the correlation between channels, and it can be easily derived whether or not the encoding is performed by the flag. The degree of correlation can be easily derived without complicated complicated calculations such as derivation of numbers. With this configuration, the processing load on the sound image localization control device 150 can be reduced.

また、上述した補助情報を含む符号化信号を構成するビットストリームのフォーマットとしては、ＡＡＣ−ＡＤＩＦ（Audio Data Interchange Format）とＡＡＣ−ＡＤＴＳ（Audio Data Transport Stream frame）の２種類が規定されており、何れのフォーマットにおいてもチャンネル間の相関を用いた符号化が施されているか否かを示すフラグが配置されている。以下、ＡＡＣ−ＡＤＴＳにおけるフラグを用いた例を挙げて説明する。 In addition, as the format of the bit stream constituting the encoded signal including the auxiliary information described above, two types of AAC-ADIF (Audio Data Interchange Format) and AAC-ADTS (Audio Data Transport Stream frame) are defined. In any format, a flag indicating whether or not encoding using correlation between channels is performed. Hereinafter, an example using a flag in AAC-ADTS will be described.

図４は、ＡＡＣ−ＡＤＴＳのフレーム構成を示す説明図である。（Ａ）に示すＡＡＣ−ＡＤＴＳの１フレーム２４０は、（Ｂ）のように、ＡＤＴＳヘッダー（adts_header）２４２、エラー検出ワード（crc：cyclic redundancy check）２４４、ローデータブロック（raw_data_block）２４６の３つの部分から構成される。ＡＤＴＳヘッダー２４２には、同期信号やサンプリング周波数等、フレームの各種情報が書き込まれている。エラー検出ワード２４４は、ＡＤＴＳフレームのエラーチェックに用いられる。ローデータブロック２４６には、圧縮音声データおよび圧縮音声データの種類を示す識別子が書き込まれている。 FIG. 4 is an explanatory diagram showing a frame configuration of AAC-ADTS. As shown in (B), one AAC-ADTS frame 240 shown in (A) has three ADTS header (adts_header) 242, error detection word (crc: cyclic redundancy check) 244, and raw data block (raw_data_block) 246. Consists of parts. In the ADTS header 242, various pieces of frame information such as a synchronization signal and a sampling frequency are written. The error detection word 244 is used for error check of the ADTS frame. In the raw data block 246, compressed audio data and an identifier indicating the type of the compressed audio data are written.

圧縮音声データは、データの種類に応じて圧縮方法が異なる。ここで（Ｃ）に示すように、ローデータブロックの圧縮音声データの種類を示す識別子が、ＣＰＥ（Channel Pair Element）２４８である場合におけるローデータブロックの内部構造を（Ｄ）（Ｅ）に示す。ＣＰＥ２４８は、Ｌ／Ｒチャンネルのように、相関が高い可能性のある２チャンネルをまとめて符号化する場合に用いられる。 The compressed audio data has a different compression method depending on the type of data. Here, as shown in (C), the internal structure of the raw data block when the identifier indicating the type of the compressed audio data of the raw data block is CPE (Channel Pair Element) 248 is shown in (D) and (E). . The CPE 248 is used when encoding two channels having a high correlation, such as the L / R channel, together.

ローデータブロック２４６は、当該２チャンネル以外のチャンネルが同一の識別子（ここではＣＰＥ２４８）を用いていた場合、識別するための固有の番号を付したエレメントインスタンスタグ（element_instance_tag）２５０、周波数変換で用いるブロック長及び窓かけ用のウィンドウの種類が共通のものを使用しているか否かを示すフラグであるコモンウィンドウ（common_window）２５２、サブバンド数（帯域分割数）等を格納する領域であるＩＣＳインフォ（ics_info）２５４、ＭＳステレオによる符号化が施されているか否かを示すフラグとしての、２ビットで表されるＭＳマスクプレゼント（ms_mask_present）２５６、同じくＭＳステレオによる符号化が施されているか否かを示すフラグとして、周波数分割帯域毎に示されるフラグであるＭＳユスド（ms used）２５８、量子化データを含むインディビジュアルチャネルストリーム（individual_channel_stream）２６０で構成される。 The raw data block 246 includes an element instance tag (element_instance_tag) 250 with a unique number for identification when a channel other than the two channels uses the same identifier (here, CPE 248), and a block used for frequency conversion. A common window (common_window) 252 that is a flag indicating whether or not a common window type for long and windowing is used, and an ICS info (area for storing the number of subbands (number of band divisions)). ics_info) 254, MS mask present (ms_mask_present) 256 represented by 2 bits as a flag indicating whether or not encoding by MS stereo is performed, and whether or not encoding by MS stereo is also performed As a flag to indicate, MS use which is a flag indicated for each frequency division band (Ms used) 258, composed of Individual channel stream (individual_channel_stream) 260 that includes quantized data.

ここで、コモンウィンドウ２５２が１であり、かつＭＳマスクプレゼント２５６が、「全帯域についてＭＳステレオによる符号化が有効である」ことを表す１０（二進数）の場合を（Ｄ）に示す。この場合、ＭＳステレオによる符号化が全帯域で用いられていることから、Ｌ／Ｒチャンネルの相関は高いと推定できるため、本実施形態では、相関度を最大値（例えば５）とする。 Here, (D) shows a case where the common window 252 is 1 and the MS mask present 256 is 10 (binary number) indicating that “encoding by MS stereo is effective for all bands”. In this case, since encoding by MS stereo is used in the entire band, it can be estimated that the correlation of the L / R channel is high. Therefore, in this embodiment, the degree of correlation is set to a maximum value (for example, 5).

また、コモンウィンドウ２５２が１であり、かつＭＳマスクプレゼント２５６が、「ＭＳステレオによる符号化が部分的に有効である」ことを表す０１（二進数）の場合を（Ｅ）に示す。この場合、周波数分割帯域毎に、ＭＳユスド２５８が設定され、ＭＳユスド２５８が１であると、その周波数分割帯域ではＭＳステレオによる符号化が有効となっていることを示す。本実施形態では、全周波数帯域中のＭＳユスド２５８が１となっている周波数分割帯域の割合に基づいて、相関度を導出する。ＭＳユスド２５８は、ＭＳユスド［ｇ］［ｓｆｂ］のように表記される。ここで、ｇは、ウィンドウの番号、ｓｆｂは、周波数分割帯域を識別する番号を示す。 Further, (E) shows a case where the common window 252 is 1 and the MS mask present 256 is 01 (binary number) indicating that “encoding by MS stereo is partially effective”. In this case, MS used 258 is set for each frequency division band, and if MS used 258 is 1, it indicates that encoding by MS stereo is effective in the frequency division band. In the present embodiment, the degree of correlation is derived based on the ratio of frequency division bands in which the MS used 258 is 1 in all frequency bands. The MS used 258 is expressed as MS used [g] [sfb]. Here, g is a window number, and sfb is a number for identifying a frequency division band.

図５は、記憶部２２４に記憶されたフラグ相関度テーブル２６２の一例を示した説明図である。ＭＳマスクプレゼント２５６が１０（二進数）の場合における、ＭＳユスド２６０の割合２６４と相関度２６６の関係を示す。 FIG. 5 is an explanatory diagram showing an example of the flag correlation degree table 262 stored in the storage unit 224. The relationship between the ratio 264 of the MS used 260 and the correlation degree 266 when the MS mask present 256 is 10 (binary number) is shown.

ＭＳユスド２６０が１となっている周波数分割帯域の割合２６４が１００〜８０％のとき、相関度２６６は最大値である５とする。同様に、周波数分割帯域の割合２６４が８０〜６０％のとき、相関度２６６は４、周波数分割帯域の割合２６４が６０〜４０％のとき、相関度２６６は３、周波数分割帯域の割合２６４が４０〜２０％のとき、相関度２６６は２、周波数分割帯域の割合２６４が２０〜０％のとき、相関度２６６は１となる。 When the ratio 264 of the frequency division band in which the MS used 260 is 1 is 100 to 80%, the correlation degree 266 is set to 5 which is the maximum value. Similarly, when the frequency division band ratio 264 is 80 to 60%, the correlation degree 266 is 4, when the frequency division band ratio 264 is 60 to 40%, the correlation degree 266 is 3, and the frequency division band ratio 264 is The correlation degree 266 is 2 when 40 to 20%, and the correlation degree 266 is 1 when the frequency division band ratio 264 is 20 to 0%.

また、ＭＳマスクプレゼント２５６が０１、１０以外の場合、チャンネル間に相関が無いものと推定できるため、相関度２６６は無相関を示す１とする。 Further, when the MS mask present 256 is other than 01 and 10, it can be estimated that there is no correlation between the channels. Therefore, the correlation degree 266 is 1 indicating no correlation.

さらに、相関度導出部２２２は、上述した相関度２６６の導出に際して、対象となる所定周波数帯域に制限し、あるいは所定周波数帯域外より重み付けを大きくおいて、相関度２６６を導出する。このとき、所定周波数帯域は、２００〜４０００Ｈｚとする。 Furthermore, when deriving correlation degree 266 described above, correlation degree deriving unit 222 limits correlation to a predetermined frequency band of interest, or derives a degree of correlation 266 with a greater weight than outside the predetermined frequency band. At this time, the predetermined frequency band is 200 to 4000 Hz.

図６は、周波数帯域と相関度導出の重み付けの一例を模式的に表した説明図である。本実施形態の目的は、対象となる音声が単音声である場合にその音声を明瞭化することである。そこで、人の音声の周波数帯域である２００〜４０００Ｈｚに対する重み付けを他の周波数帯域より大きくすることを考える。人の音声（音圧レベル）の周波数帯域は、２００〜４０００Ｈｚがほとんどを占める。従って、相関度２６６の計算対象を、約２００〜４０００Ｈｚに制限するように、図６に示すような周波数帯域に応じた重み付け特性値を掛け合わせて、相関度２６６を導出する。 FIG. 6 is an explanatory diagram schematically showing an example of the weighting for deriving the frequency band and the degree of correlation. The object of the present embodiment is to clarify the voice when the target voice is a single voice. Therefore, it is considered that the weighting with respect to 200 to 4000 Hz, which is the frequency band of human voice, is made larger than other frequency bands. The frequency band of human voice (sound pressure level) is mostly 200 to 4000 Hz. Therefore, the degree of correlation 266 is derived by multiplying the weighting characteristic values according to the frequency bands as shown in FIG. 6 so as to limit the calculation target of the degree of correlation 266 to about 200 to 4000 Hz.

例えば、ＭＳマスクプレゼント２５６が１０の場合に、ＭＳユスド２６０が１となっている周波数分割帯域にそれぞれ図６の重み付けをして足し合わせる。そして、この値を、全ての周波数分割帯域にそれぞれ図６の重み付けをして足し合わせた値で除算する。この除算した値を百分率で表し、図５を参照して対応する相関度２６６を導出する。
かかる処理によって、背景音等音声以外の成分を排除または低減でき、音声に限った相関度導出精度の向上を図ることもできる。 For example, when the MS mask present 256 is 10, the frequency division bands in which the MS used 260 is 1 are added with the weights shown in FIG. Then, this value is divided by a value obtained by adding the weights shown in FIG. This divided value is expressed as a percentage, and the corresponding degree of correlation 266 is derived with reference to FIG.
By such processing, components other than sound such as background sound can be eliminated or reduced, and the accuracy of deriving the degree of correlation limited to sound can be improved.

記憶部２２４は、ＲＯＭ、ＲＡＭ、Ｅ^２ＰＲＯＭ、不揮発性ＲＡＭ、フラッシュメモリ、ＨＤＤ等で構成され、制御部で処理されるプログラムや音声データ等を記憶する。また、記憶部２２４は、ＭＳユスド２６０が１となっている周波数帯域の全周波数帯域に占める割合２６４と相関度２６６とを関連付けたフラグ相関度テーブル２６２と、後述する相関度２６６と頭部伝達関数の係数群とを関連付けた係数群テーブルを記憶する。 The storage unit 224 includes a ROM, a RAM, an E ² PROM, a nonvolatile RAM, a flash memory, an HDD, and the like, and stores a program processed by the control unit, audio data, and the like. In addition, the storage unit 224 includes a flag correlation degree table 262 that associates the proportion 264 of the frequency band in which the MS usage 260 is 1, and the correlation degree 266, and a correlation degree 266 and head transmission described later. A coefficient group table in which the coefficient group of the function is associated is stored.

本実施形態は、相関度２６６に応じて、その音声を単音声か否か判断するだけでなく、単音声である確からしさ、即ち相関度２６６の値に比例して明瞭度を上げ、臨場感を落とす。従って、相関度２６６の値に応じて頭部伝達関数の係数群を変更することとなる。この目的を達成するためには、本来、係数群の各係数を相関度２６６の関数にするのが望ましいが、その関数の導出に過大な労力を要し、また、複数段階の対応付けのみであっても十分に本願の目的を達成できることから、ここでは、係数群テーブルによる複数段階の対応付けによって、相関度２６６と頭部伝達関数の係数群とを関連付ける。 The present embodiment not only determines whether or not the voice is a single voice according to the degree of correlation 266, but also increases the clarity in proportion to the probability of being a single voice, that is, in proportion to the value of the degree of correlation 266. Drop. Therefore, the coefficient group of the head related transfer function is changed according to the value of the correlation degree 266. In order to achieve this object, it is desirable that each coefficient of the coefficient group is originally a function having a correlation degree 266. However, an excessive amount of labor is required to derive the function, and only a plurality of steps of association is required. In this case, since the object of the present application can be achieved sufficiently, here, the correlation degree 266 and the coefficient group of the head-related transfer function are associated with each other by a plurality of levels of correspondence using the coefficient group table.

図７は、記憶部２２４に記憶された係数群テーブル２６８の一例を示した説明図である。ここでは、左欄の相関度２６６に頭部伝達関数の係数群２７０が関連付けられる。また、かかる係数群テーブル２６８に示された頭部伝達関数（１）、（２）、（３）は、図１０を用いて後述する直接音ゲイン、初期反射音ゲイン、残響音ゲインに対応している。かかる頭部伝達関数の係数群２７０は後ほど詳述する。 FIG. 7 is an explanatory diagram showing an example of the coefficient group table 268 stored in the storage unit 224. Here, coefficient group 270 of the head related transfer function is associated with correlation degree 266 in the left column. The head-related transfer functions (1), (2), and (3) shown in the coefficient group table 268 correspond to a direct sound gain, an initial reflected sound gain, and a reverberant sound gain that will be described later with reference to FIG. ing. The coefficient group 270 of the head related transfer function will be described in detail later.

テーブル抽出部２２６は、相関度導出部２２２が導出した相関度２６６に応じて係数群テーブル２６８から頭部伝達関数の係数群２７０を抽出する。こうして、相関度２６６に基づいて、頭部伝達関数を一意に形成することができる。 The table extraction unit 226 extracts the coefficient group 270 of the head related transfer function from the coefficient group table 268 according to the correlation degree 266 derived by the correlation degree deriving unit 222. Thus, the head-related transfer function can be uniquely formed based on the correlation degree 266.

また、スピーカ１５４が音像定位制御装置１５０やモニタ１５２と別体に形成され、かつ、その位置を変更可能な場合、受聴者１６０は、図２のように、任意のスピーカ配置でその音声を聞くことになる。上述した頭部伝達関数は、受聴者の頭（両耳）とスピーカとの位置関係によって変化する。そのため、複数のスピーカの任意の配置に対して複数の頭部伝達関数を準備する。そして、記憶部２２４は、スピーカの配置に対応したかかる複数の頭部伝達関数およびそれに対応した係数群テーブル２６８を記憶する。 When the speaker 154 is formed separately from the sound image localization control device 150 and the monitor 152 and the position thereof can be changed, the listener 160 listens to the sound with an arbitrary speaker arrangement as shown in FIG. It will be. The head-related transfer function described above changes depending on the positional relationship between the listener's head (both ears) and the speaker. Therefore, a plurality of head-related transfer functions are prepared for an arbitrary arrangement of a plurality of speakers. And the memory | storage part 224 memorize | stores such several head-related transfer functions corresponding to arrangement | positioning of a speaker, and the coefficient group table 268 corresponding to it.

テーブル抽出部２２６は、かかる記憶部２２４に記憶された複数の頭部伝達関数のうち、受聴者１６０からの選択入力に応じて選択決定された頭部伝達関数の係数群２７０を抽出する。受聴者１６０は、スピーカ１５４の位置に応じた最適な頭部伝達関数を選択することで、より適切に明瞭な音声を得ることが可能となる。 The table extraction unit 226 extracts the coefficient group 270 of the head-related transfer function selected and determined according to the selection input from the listener 160 from among the plurality of head-related transfer functions stored in the storage unit 224. The listener 160 can obtain clearer sound more appropriately by selecting an optimal head-related transfer function according to the position of the speaker 154.

音像定位処理部２２８は、テーブル抽出部２２６が抽出した係数群２７０を反映した頭部伝達関数によって音像定位処理を実行する。ここでは、５．１サラウンド方式の各チャンネルを、複数のスピーカ、例えば、２つのスピーカに出力するため、バーチャルサウンド方式の音声信号に変換する。 The sound image localization processing unit 228 performs the sound image localization processing using a head-related transfer function that reflects the coefficient group 270 extracted by the table extraction unit 226. Here, in order to output each channel of the 5.1 surround system to a plurality of speakers, for example, two speakers, it is converted into a sound signal of the virtual sound system.

図８は、音像定位処理部２２８の構成例を示した制御ブロック図である。図８の制御ブロック図においては、入力された５．１ｃｈのサラウンド音源を表す音声信号が頭部伝達関数ブロック２７２に伝達され、頭部伝達関数ブロック２７２における畳み込み処理を受け、加算器２７４で音声合成されて最終的に２ｃｈバーチャルサラウンドの音声信号として出力される。ここで、頭部伝達関数（１）、（２）、（３）は、図１０を用いて後述する直接音ゲイン、初期反射音ゲイン、残響音ゲインに対応している。また、本実施形態では、直接音、初期反射音、残響音の３つの対象に対してそれぞれ頭部伝達関数ブロック２７２を設けているが、これらを混合した１つのブロックで処理を行うことも可能である。 FIG. 8 is a control block diagram illustrating a configuration example of the sound image localization processing unit 228. In the control block diagram of FIG. 8, an input audio signal representing a 5.1ch surround sound source is transmitted to the head-related transfer function block 272, undergoes convolution processing in the head-related transfer function block 272, and the adder 274 After being synthesized, it is finally output as a 2ch virtual surround sound signal. Here, the head-related transfer functions (1), (2), and (3) correspond to a direct sound gain, an initial reflection sound gain, and a reverberation sound gain, which will be described later with reference to FIG. In the present embodiment, the head-related transfer function block 272 is provided for each of the three objects of the direct sound, the early reflection sound, and the reverberation sound. However, it is also possible to perform processing in one block in which these are mixed. It is.

図９は、図８の頭部伝達関数ブロック２７２の内部構成を示した制御ブロック図である。図９の制御ブロック図においては、まず、入力された５ｃｈのサラウンド音源を表す音声信号がディレイ２８０を通じて遅延処理され、ＬチャンネルとＲチャンネルそれぞれに関してＦＩＲフィルタ２８２によるフィルタ処理が施される。そして、加算器２７４で音声合成されて２ｃｈのバーチャルサラウンドの音声信号として出力される。 FIG. 9 is a control block diagram showing an internal configuration of the head-related transfer function block 272 of FIG. In the control block diagram of FIG. 9, first, an input audio signal representing a 5-channel surround sound source is delayed through a delay 280, and filter processing by the FIR filter 282 is performed on each of the L channel and the R channel. Then, the speech is synthesized by the adder 274 and output as a 2ch virtual surround sound signal.

かかる頭部伝達関数ブロック２７２におけるディレイ２８０は、図１０を用いて後述する初期反射音遅延量、残響音遅延量に対応している。直接音は遅延量が生じないので、直接音に関する頭部伝達関数ブロック２７２のディレイ２８０は省略することができる。また、本実施形態では頭部伝達関数としてＦＩＲ（Finite Impulse Response）フィルタを用いているが、ＩＩＲ（Infinite Impulse Response）を用いることもできる。 The delay 280 in the head related transfer function block 272 corresponds to an initial reflected sound delay amount and a reverberant sound delay amount which will be described later with reference to FIG. Since the delay amount of the direct sound does not occur, the delay 280 of the head related transfer function block 272 regarding the direct sound can be omitted. In this embodiment, an FIR (Finite Impulse Response) filter is used as the head-related transfer function, but an IIR (Infinite Impulse Response) can also be used.

５．１サラウンド方式の音声信号を２ｃｈのバーチャルサラウンドで再現するには、図２に示すように、受聴者１６０の前方に準備された２つのスピーカ１５４から発する音声を生成するため、頭部伝達関数（ターゲット用頭部伝達関す）５ｃｈ×２、およびスピーカの配置に対応した頭部伝達関数（再生用頭部伝達関数）４（両耳×２）が必要である。ここで、再生用頭部伝達関数は、受聴者１６０がスピーカの位置（音源）を認識させないようにするためのキャンセル信号である。 In order to reproduce a 5.1 surround sound signal with 2ch virtual surround, as shown in FIG. 2, in order to generate sound emitted from two speakers 154 prepared in front of the listener 160, the head transmission is performed. A function (related to head transfer for target) 5ch × 2 and a head related transfer function (reproduction head related transfer function) 4 (both ears × 2) corresponding to the speaker arrangement are required. Here, the reproduction head-related transfer function is a cancel signal for preventing the listener 160 from recognizing the position (sound source) of the speaker.

ここで、頭部伝達関数は、係数群２７０によって、音声信号の直接音と反射音とをそれぞれ独立して調整できる。従って、音像定位処理部２２８は、テーブル抽出部２２６が抽出した係数群２７０によって、直接音と反射音とをそれぞれ独立に制御できることとなる。 Here, the head related transfer function can adjust the direct sound and the reflected sound of the sound signal independently by the coefficient group 270. Therefore, the sound image localization processing unit 228 can control the direct sound and the reflected sound independently by the coefficient group 270 extracted by the table extraction unit 226.

図１０は、頭部伝達関数のインパルス応答を示した音声信号時間波形図である。かかるインパルス応答は、コンサートホールにおいて測定されており、前半のピークは直接音部分であり、次に壁、天井、床などから跳ね返った初期反射音（反射音）が続き、その後は残響音と呼ばれる副次的反射音が分布している。臨場感は、主として初期反射音や残響音の効果によって感じ取られている。 FIG. 10 is an audio signal time waveform diagram showing an impulse response of the head-related transfer function. Such impulse responses are measured in concert halls, with the first half peak being the direct sound part, followed by the early reflections (reflections) that bounce off the walls, ceiling, floor, etc., and then called reverberation. Secondary reflected sound is distributed. The sense of reality is felt mainly by the effects of early reflections and reverberation.

このように、頭部伝達関数を通じたインパルス応答を時間領域で見ると、その応答波形を直接音と反射音とに区別することができる。本実施形態では、頭部伝達関数における直接音と反射音との係数を調整し、直接音に対する反射音の比を下げることで、反射による残響を抑制する。具体的には、直接音と初期反射音（反射音）とが区別され、直接音ゲイン、初期反射音のゲインや直接音に対する遅延量、初期反射音以降の残響音のゲインや初期反射音に対する遅延量がそれぞれ独立して調整される。 Thus, when the impulse response through the head-related transfer function is viewed in the time domain, the response waveform can be distinguished into direct sound and reflected sound. In this embodiment, reverberation due to reflection is suppressed by adjusting the coefficient of the direct sound and the reflected sound in the head-related transfer function and lowering the ratio of the reflected sound to the direct sound. Specifically, the direct sound and the early reflection sound (reflection sound) are distinguished, and the direct sound gain, the initial reflection sound gain and the delay amount with respect to the direct sound, the reverberation sound gain after the initial reflection sound and the initial reflection sound are distinguished. The delay amount is adjusted independently.

上述した初期反射音や残響音のゲインの調整では、頭部伝達関数の測定条件や再現したい空間の広さ等にも影響されるが、一般に、測定時の直接音と反射音の比率に対して、反射音の比率を低くすると音声がより明瞭になり、高くすると、音声の臨場感が高くなる。具体的な処理方法の例として、直接音成分はその値を維持し、前述の相関度２６６の導出で用いた、全周波数帯域に対するＭＳユスド２６０が１となっている周波数分割帯域の割合２６４を１００％から減算した値を、反射成分である初期反射音および残響音にゲイン係数として掛け合わせる。こうして、その全周波数帯域に対するＭＳユスド２６０が１となっている周波数分割帯域の割合２６４が１００％に近い単音声では反射成分が抑えられ、直接音主体の明瞭性を持った音声信号を得る事ができる。 Adjustment of the gain of the early reflection sound and reverberation sound described above is affected by the measurement conditions of the head related transfer function and the size of the space to be reproduced. Thus, when the ratio of the reflected sound is lowered, the sound becomes clearer, and when it is increased, the presence of the sound is enhanced. As an example of a specific processing method, the direct sound component maintains its value, and the ratio 264 of the frequency division band in which the MS used 260 is 1 with respect to the entire frequency band used in the derivation of the correlation 266 described above is used. A value obtained by subtracting from 100% is multiplied as a gain coefficient by the early reflection sound and reverberation sound which are reflection components. In this way, in the case of a single voice in which the ratio 264 of the frequency division band in which the MS usage 260 is 1 for all the frequency bands is close to 100%, the reflection component is suppressed, and an audio signal with direct sound-oriented clarity can be obtained. Can do.

ここでは、初期反射音と残響音を区別して処理しているが、一連の波形として処理することもでき、また処理量軽減のため残響音部分を省略することもできる。また音像定位処理は時間軸信号をそのまま畳み込む方法と周波数軸上で畳み込む方法があり、一般的に周波数軸上の畳み込み演算の方が時間軸に比べて処理量を抑えられるが、いずれの方法であっても本実施形態の目標を達成することができ、どちらを利用するかは任意に決めることができる。さらに、畳み込みに使用する頭部伝達関数は、上述したように、係数群テーブル２６８から相関度２６６に応じた係数群２７０を選択しても、全周波数帯域に対するＭＳユスド２６０が１となっている周波数分割帯域の割合２６４を１００％から減算した値を、反射成分である初期反射音および残響音にゲイン係数として掛け合わせて調整した係数群を利用してもよい。前者は係数群テーブル２６８を参照することで係数群２７０を導出する処理負荷を軽減でき、後者は係数群テーブル２６８を記憶しないため使用するメモリ領域を節減できる。 Here, the initial reflection sound and the reverberation sound are distinguished and processed, but they can be processed as a series of waveforms, and the reverberation sound portion can be omitted to reduce the processing amount. Sound image localization processing includes the method of convolving the time axis signal as it is and the method of convolution on the frequency axis. Generally, the convolution operation on the frequency axis can suppress the processing amount compared to the time axis, but either method can be used. Even if it exists, the objective of this embodiment can be achieved and it can be decided arbitrarily which is used. Further, as described above, even if the coefficient group 270 corresponding to the degree of correlation 266 is selected from the coefficient group table 268, the MS transfer 260 for all frequency bands is 1 as the head related transfer function used for convolution. A coefficient group obtained by multiplying a value obtained by subtracting the frequency division band ratio 264 from 100% by multiplying the initial reflected sound and the reverberant sound, which are reflection components, as gain coefficients may be used. The former can reduce the processing load for deriving the coefficient group 270 by referring to the coefficient group table 268, and the latter does not store the coefficient group table 268, so that the memory area to be used can be saved.

増幅部２３０は、音像定位処理部２２８によって変換された、バーチャルサウンド方式の音声信号をスピーカ１５４に出力可能な信号レベルに増幅する。 The amplifying unit 230 amplifies the virtual sound system audio signal converted by the sound image localization processing unit 228 to a signal level that can be output to the speaker 154.

以上説明した音像定位制御装置１５０によって、コンテンツの種類に拘わらず、また、台詞やアナウンス等の単音声がＬ／Ｒチャンネルに含まれている場合においても、明瞭な音声による聴感の向上を図ることが可能となる。 The sound image localization control device 150 described above improves the audibility with clear sound regardless of the type of content, and even when a single sound such as speech or announcement is included in the L / R channel. Is possible.

（音像定位制御方法）
次に、上述した音像定位制御装置１５０を用いて、サラウンドに対応したテレビジョン放送に対して音像定位処理を施す音像定位制御方法を具体的に説明する。 (Sound image localization control method)
Next, a sound image localization control method for performing sound image localization processing on a television broadcast that supports surround using the above-described sound image localization control device 150 will be specifically described.

図１１は、音像定位制御方法の処理の流れを説明したフローチャートである。音像定位制御装置１５０の放送受信部２１０がテレビジョン放送を受信すると（Ｓ３００のＹＥＳ）、サラウンド判定部２１８は、放送受信部２１０が受信したテレビジョン放送の音声信号がサラウンドに対応しているか否かを判定し（Ｓ３０２）、サラウンドに対応していると判定されると（Ｓ３０２のＹＥＳ）、相関度導出部２２２は、当該音声信号のＬ／Ｒチャンネルを抽出し、さらに、その補助情報からＭＳステレオ方式による符号化が施されているか否かを示すフラグを抽出する（Ｓ３０４）。ＭＳステレオ方式による符号化がどの周波数帯域でも施されていない場合（Ｓ３０６ＹＥＳ）、相関度２６６は１とする（Ｓ３０８）。また、ＭＳステレオ方式による符号化が施されていた場合（Ｓ３０６ＮＯ）であり、符号化の対象が全周波数帯域であれば（Ｓ３１０ＹＥＳ）、相関度２６６は５とする（Ｓ３１２）。 FIG. 11 is a flowchart for explaining the processing flow of the sound image localization control method. When the broadcast receiving unit 210 of the sound image localization control device 150 receives a television broadcast (YES in S300), the surround determination unit 218 determines whether the audio signal of the television broadcast received by the broadcast receiving unit 210 is compatible with surround. (S302), if it is determined that the signal corresponds to surround (YES in S302), the correlation degree deriving unit 222 extracts the L / R channel of the audio signal, and further, from the auxiliary information. A flag indicating whether or not encoding by the MS stereo method is performed is extracted (S304). When the encoding by the MS stereo method is not performed in any frequency band (S306 YES), the correlation degree 266 is set to 1 (S308). In addition, when the encoding by the MS stereo method has been performed (NO in S306), and the encoding target is the entire frequency band (S310 YES), the correlation degree 266 is set to 5 (S312).

そして、周波数帯域毎に、ＭＳステレオ方式による符号化が施されていたり、施されていなかったりする場合（Ｓ３１０ＮＯ）、全周波数帯域に対する相関符号化が施された周波数帯域の割合２６４と、記憶部２２４に記憶されたフラグ相関度テーブル２６２とに基づいて、相関度２６６を決定する（Ｓ３１４） Then, when the encoding by the MS stereo method is performed or not performed for each frequency band (NO in S310), the ratio 264 of the frequency band subjected to the correlation encoding for all the frequency bands, and the storage unit The correlation degree 266 is determined based on the flag correlation degree table 262 stored in the H.224 (S314).

そして、テーブル抽出部２２６は、相関度導出部２２２が導出した相関度２６６に応じて、記憶部２２４に記憶された係数群テーブル２６８から頭部伝達関数の係数群２７０を抽出し（Ｓ３１６）、音像定位処理部２２８は、抽出された係数群２７０を反映した頭部伝達関数によって音像定位処理を実行する（Ｓ３１８）。そして、その音像定位処理が行われた音声信号は、スピーカ１５４を通じて受聴者１６０に提供される（Ｓ３２０）。 Then, the table extraction unit 226 extracts the coefficient group 270 of the head related transfer function from the coefficient group table 268 stored in the storage unit 224 according to the correlation degree 266 derived by the correlation degree deriving unit 222 (S316). The sound image localization processing unit 228 executes sound image localization processing using a head-related transfer function that reflects the extracted coefficient group 270 (S318). Then, the sound signal subjected to the sound image localization processing is provided to the listener 160 through the speaker 154 (S320).

かかる音像定位制御方法によっても、コンテンツの種類に拘わらず、また、台詞やアナウンス等の単音声がＬ／Ｒチャンネルに含まれている場合においても、明瞭な音声による聴感の向上を図ることが可能となる。 With this sound image localization control method, it is possible to improve the audibility with clear audio regardless of the type of content, and even when single speech such as dialogue or announcement is included in the L / R channel. It becomes.

以上、添付図面を参照しながら本発明の好適な実施形態について説明したが、本発明はかかる実施形態に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、それらについても当然に本発明の技術的範囲に属するものと了解される。 As mentioned above, although preferred embodiment of this invention was described referring an accompanying drawing, it cannot be overemphasized that this invention is not limited to this embodiment. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the claims, and these are naturally within the technical scope of the present invention. Understood.

また、上述した実施形態において、明瞭さが求められる番組か否かの判定は、相似度導出部２２２のみが行っていたが、かかる場合に限られず、例えば、コンテンツの種類の情報をデータとして受信し、受信した種類がニュース番組等明瞭さを求められる番組の場合、相似度導出部２２２の導出した相似度に関らず、音像定位処理の処理内容を変更することとしてもよい。 In the above-described embodiment, the determination as to whether or not the program requires clarity is performed only by the similarity deriving unit 222. However, the determination is not limited to such a case. For example, content type information is received as data. However, when the received type is a program that requires clarity, such as a news program, the processing content of the sound image localization process may be changed regardless of the similarity derived by the similarity deriving unit 222.

なお、本明細書の音像定位制御方法における各工程は、必ずしもフローチャートとして記載された順序に沿って時系列に処理する必要はなく、並列的あるいはサブルーチンによる処理を含んでもよい。 Note that each step in the sound image localization control method of the present specification does not necessarily have to be processed in time series in the order described in the flowchart, and may include parallel or subroutine processing.

本発明は、サラウンドに対応したテレビジョン放送に対して音像定位処理を施す音像定位制御装置および音像定位制御方法に利用することができる。 INDUSTRIAL APPLICABILITY The present invention can be used for a sound image localization control device and a sound image localization control method that perform sound image localization processing on a television broadcast that supports surround.

５．１サラウンド方式によるサラウンドシステムの構成を示した模式図である。It is the schematic diagram which showed the structure of the surround system by a 5.1 surround system. バーチャルサラウンドシステムの構成を示した模式図である。It is the schematic diagram which showed the structure of the virtual surround system. 音像定位制御装置のハードウェア構成を示した機能ブロック図である。It is the functional block diagram which showed the hardware constitutions of the sound image localization control apparatus. ＡＡＣ−ＡＤＴＳのフレーム構成を示す説明図である。It is explanatory drawing which shows the frame structure of AAC-ADTS. 記憶部に記憶されたフラグ相関度テーブルの一例を示した説明図である。It is explanatory drawing which showed an example of the flag correlation degree table memorize | stored in the memory | storage part. 周波数帯域と相関度導出の重み付けの一例を模式的に表した説明図である。It is explanatory drawing which represented typically an example of the weighting of a frequency band and correlation degree derivation. 記憶部に記憶された係数群テーブルの一例を示した説明図である。It is explanatory drawing which showed an example of the coefficient group table memorize | stored in the memory | storage part. 音像定位処理部の構成例を示した制御ブロック図である。It is the control block diagram which showed the structural example of the sound image localization process part. 図８の頭部伝達関数ブロックの内部構成を示した制御ブロック図である。FIG. 9 is a control block diagram illustrating an internal configuration of the head-related transfer function block of FIG. 8. 頭部伝達関数のインパルス応答を示した音声信号時間波形図である。It is an audio | voice signal time waveform figure which showed the impulse response of the head-related transfer function. 音像定位制御方法の処理の流れを説明したフローチャートである。It is the flowchart explaining the flow of processing of the sound image localization control method.

Explanation of symbols

１５０ …音像定位制御装置
１５２ …モニタ
１５４ …スピーカ
２１０ …放送受信部
２１６ …オーディオ符号化信号抽出部
２１８ …サラウンド判定部
２２２ …相関度導出部
２２４ …記憶部
２２６ …テーブル抽出部
２２８ …音像定位処理部
２６６ …相関度
２６８ …係数群テーブル
２７０ …係数群 DESCRIPTION OF SYMBOLS 150 ... Sound image localization control apparatus 152 ... Monitor 154 ... Speaker 210 ... Broadcast receiving part 216 ... Audio coding signal extraction part 218 ... Surround determination part 222 ... Correlation degree derivation part 224 ... Storage part 226 ... Table extraction part 228 ... Sound image localization process Unit 266 ... degree of correlation 268 ... coefficient group table 270 ... coefficient group

Claims

A broadcast receiver capable of receiving television broadcasts;
An audio encoded signal extracting unit for extracting audio data of the television broadcast received by the broadcast receiving unit;
A surround determination unit that determines whether or not the audio encoded signal is compatible with surround;
When it is determined that the audio signal corresponds to surround, correlation encoding, which is encoding using correlation between L / R channels, is applied to the audio encoded signal from auxiliary information of the audio encoded signal. A correlation degree deriving unit that extracts a flag indicating whether there is a correlation between the L / R channels based on the extracted flag;
A storage unit for storing a coefficient group table in which the degree of correlation and the coefficient group of the head related transfer function are associated;
A table extracting unit that extracts a coefficient group of a head related transfer function from the coefficient group table according to the degree of correlation derived by the correlation degree deriving unit;
A sound image localization processing unit that performs sound image localization processing by a head-related transfer function reflecting the extracted coefficient group;
A sound image localization control device comprising:

When the correlation encoding is performed for each of a plurality of frequency bands of the audio encoded signal, the correlation degree deriving unit determines the L based on the ratio of the frequency bands subjected to the correlation encoding with respect to the entire frequency band. The sound image localization control apparatus according to claim 1, wherein a degree of correlation between the / R channels is derived.

The sound image localization control apparatus according to claim 1 or 2, wherein the audio encoded signal is encoded by an AAC method, and the correlation encoding method is an MS stereo method.

The sound image localization control apparatus according to any one of claims 1 to 3, wherein the correlation degree deriving unit derives the correlation degree by weighting a predetermined frequency band larger than other frequency bands. .

The sound image localization control apparatus according to claim 4, wherein the predetermined frequency band is 200 to 4000 Hz.

6. The sound image localization control device according to claim 1, wherein the head-related transfer function can independently adjust a direct sound and a reflected sound of the audio signal by the coefficient group. .

The storage unit includes a coefficient group table corresponding to a plurality of head related transfer functions corresponding to the arrangement of speakers,
The table extraction unit extracts a coefficient group of a head-related transfer function selected and determined according to a speaker arrangement from the plurality of head-related transfer functions. The sound image localization control apparatus according to the item.

Receive television broadcasts,
Extracting an audio encoded signal that is audio data of the received television broadcast,
Determining whether the audio encoded signal is compatible with surround;
When it is determined that the audio signal corresponds to surround, correlation encoding, which is encoding using correlation between L / R channels, is performed on the audio encoded signal from the auxiliary information of the audio encoded signal. A flag indicating whether or not there is, and based on the extracted flag, a correlation degree between the L / R channels is derived,
According to the derived degree of correlation, the coefficient group of the head related transfer function is extracted from the coefficient group table that associates the degree of correlation with the coefficient group of the head related transfer function,
A sound image localization control method, wherein sound image localization processing is executed by a head-related transfer function reflecting the extracted coefficient group.