JP2005142640A

JP2005142640A - Terminal apparatus

Info

Publication number: JP2005142640A
Application number: JP2003374311A
Authority: JP
Inventors: Noriyuki Ashigahara; 範之芦ヶ原
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-11-04
Filing date: 2003-11-04
Publication date: 2005-06-02

Abstract

<P>PROBLEM TO BE SOLVED: To enable easily detecting the position of a speaking person in a terminal apparatus used for having a television meeting via a network. <P>SOLUTION: The apparatus is provided with a specific sound adding means for adding the sound signal of a specific sound to a sound signal inputted from the network, a plurality of speakers provided with predetermined distances therebetween for outputting the voice signal added with the specific sound, a receiving means for receiving the predetermined transmission signal transmitted in response to the fact that a sound collecting terminal corresponding to the speaking person detects the specific sound in sound/voice outputted from the plurality of speakers, a position detecting means for detecting the positional relation between the television meeting terminal and the sound collecting terminal on the basis of the arrival time difference when the receiving means receives the plurality of predetermined transmission signals concerning a detection result of the specific sound in the plurality of sound signals, and a control means for controlling the direction of a television camera on the basis of the detection result of the position detecting means. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明はネットワークを介してテレビ会議を行う端末装置に関する。 The present invention relates to a terminal device that performs a video conference via a network.

図６は、特許文献１に記載された位置検出装置を含むテレビ会議装置のブロック図であり、広域通信網５００に接続されたテレビ会議端末５０１において、受信手段５０２で受信したデータはＡＶ分離手段５０３により映像データと音声データと制御信号に分離され、映像データは映像復号手段５０４により復号された後、表示制御手段５０５により前記ＡＶ分離手段５０３からの制御信号で同期制御されてモニタ５０６に映像が表示される。一方音声データは音声復号手段５０７により復号された後、音声出力制御手段５０８により前記ＡＶ分離手段５０３からの制御信号で同期制御されてスピーカ５０９から音声が出力される。 FIG. 6 is a block diagram of a video conference device including the position detection device described in Patent Document 1. In the video conference terminal 501 connected to the wide area communication network 500, the data received by the reception unit 502 is the AV separation unit. Video data, audio data, and control signals are separated by the video signal 503, and the video data is decoded by the video decoding unit 504, and then synchronously controlled by the display control unit 505 using the control signal from the AV separation unit 503. Is displayed. On the other hand, after the audio data is decoded by the audio decoding means 507, the audio output control means 508 is synchronously controlled by the control signal from the AV separation means 503 and the audio is output from the speaker 509.

そして複数のマイクと反射版を用いて集音する高指向性マイク５１０より入力され音声信号受信手段５１１により受信された音声データは、音声符号手段５１４により符号化されると同時に、音声認識手段５１２により音声データ蓄積手段５１３に予め登録された会議参加者の音声データを基準に話者の方向の特定を行い、その情報を基に雲台制御手段５１５によりビデオカメラ５１６が話者の方向を向くように制御する。 The voice data input from the high directivity microphone 510 that collects sound using a plurality of microphones and a reflection plate and received by the voice signal receiving unit 511 is encoded by the voice encoding unit 514 and at the same time, the voice recognition unit 512. Thus, the direction of the speaker is specified based on the audio data of the conference participants registered in advance in the audio data storage unit 513, and the video camera 516 is directed to the speaker by the pan head control unit 515 based on the information. To control.

そしてビデオカメラ５１６により撮影された映像データは映像符号手段５１７により符号化され、ＡＶ多重手段５１８により前記音声符号手段５１４により符号化された音声データと共に多重化され、送信手段５１９により広域通信網５００に送信される。 The video data shot by the video camera 516 is encoded by the video encoding means 517, multiplexed together with the audio data encoded by the audio encoding means 514 by the AV multiplexing means 518, and transmitted by the transmission means 519 to the wide area communication network 500. Sent to.

また、特許文献２では、話者のマイクに特定のランプを取り付け、画像認識により話者の位置を検出する方法もある。
特開平６−３５１０１５号公報特開平９−３０７８７０号公報 In Patent Document 2, there is also a method of attaching a specific lamp to a speaker's microphone and detecting the position of the speaker by image recognition.
JP-A-6-351015 JP-A-9-307870

しかし、特許文献１の位置検出装置を含むテレビ会議装置では、位置検出のための高価な専用のセンサを必要とする。また、特許文献２の画像認識による位置検出を行うものは位置検出の精度に比例して処理が複雑になり、大規模な回路または高性能のプロセッサと多くのメモリを必要とする。 However, the video conference apparatus including the position detection apparatus disclosed in Patent Document 1 requires an expensive dedicated sensor for position detection. In addition, the apparatus that performs position detection by image recognition in Patent Document 2 has a complicated process in proportion to the accuracy of position detection, and requires a large-scale circuit or a high-performance processor and a large amount of memory.

つまり既存の位置検出装置を含む装置では、大掛かりな位置検出装置等が必要となり、装置が高価になる問題があった。 That is, in the apparatus including the existing position detection apparatus, a large-scale position detection apparatus or the like is required, and there is a problem that the apparatus becomes expensive.

本発明はこの様な問題を解決し、容易に話者の位置を検出可能とすることを目的とする。 An object of the present invention is to solve such problems and to easily detect the position of a speaker.

本発明は、ネットワークを介してテレビ会議を行うテレビ会議システムにおいて用いられる会議端末装置であって、ネットワークより入力した音声信号に対して特定音の音声信号を付加する特定音付加手段と、前記特定音が付加された音声信号を出力する、互いに所定の間隔を有する複数のスピーカと、話者に対応した集音端末が前記複数のスピーカから出力された音声中の前記特定音を検出したことに応じて送信した所定の送信信号を受信する受信手段と、前記受信手段により前記複数の音声信号中の特定音の検出結果に係る複数の前記所定の送信信号を受信したときの到着時間差に基づいて前記テレビ会議端末と前記集音端末との位置関係を検出する位置検出手段と、前記位置検出手段の検出結果に基づいてテレビカメラの方向を制御する制御手段とを備える。 The present invention is a conference terminal device used in a video conference system that performs a video conference via a network, the specific sound adding means for adding a specific sound signal to a sound signal input from the network, and the specific A plurality of speakers that output a sound signal to which sound is added and that have a predetermined interval from each other, and a sound collecting terminal that corresponds to a speaker detects the specific sound in the sound that is output from the plurality of speakers And a receiving means for receiving a predetermined transmission signal transmitted in response, and based on a difference in arrival time when the receiving means receives a plurality of the predetermined transmission signals according to a detection result of a specific sound in the plurality of audio signals. Position detecting means for detecting a positional relationship between the video conference terminal and the sound collecting terminal; and controlling a direction of the TV camera based on a detection result of the position detecting means. And a control unit.

本発明によれば、容易に話者の位置を検出することが可能となる。 According to the present invention, it is possible to easily detect the position of a speaker.

以下、本発明のテレビ会議装置の一実施例を示す。 Hereinafter, an embodiment of the video conference apparatus according to the present invention will be described.

図１は、本発明の請求項１を満たす位置検出装置を含むテレビ会議装置の一実施例のブロック図であり、広域通信網１００に接続されたテレビ会議端末１０１において、受信手段１０２で受信したデータはＡＶ分離手段１０３により映像データと音声データと制御信号に分離され、映像データは映像復号手段１０４により復号された後、表示制御手段１０５により前記ＡＶ分離手段１０３からの制御信号で同期制御されてモニタ１０６に映像が表示される。一方音声データは音声復号手段１０７により復号された後、特定音付加手段１０８により既定の特定音が付加され、音声出力制御手段１０９により前記ＡＶ分離手段１０３からの制御信号で同期制御されてスピーカ１１０から音声が出力される。そして、話者の持つワイヤレス集音端末１１１において、音声入力手段１１２により周囲音の音声データと話者が発声した音声データが集音され、音声信号送信手段１１３により、テレビ会議端末１０１へ送信される。 FIG. 1 is a block diagram of an embodiment of a video conference apparatus including a position detection device satisfying claim 1 of the present invention, which is received by a receiving means 102 in a video conference terminal 101 connected to a wide area communication network 100. The data is separated into video data, audio data, and control signals by the AV separation means 103, and the video data is decoded by the video decoding means 104 and then synchronously controlled by the display control means 105 with the control signal from the AV separation means 103. Thus, an image is displayed on the monitor 106. On the other hand, after the audio data is decoded by the audio decoding means 107, a predetermined specific sound is added by the specific sound adding means 108, and the audio output control means 109 is synchronously controlled by the control signal from the AV separation means 103, and the speaker 110 is synchronized. Will output audio. Then, in the wireless sound collecting terminal 111 held by the speaker, the sound data of the ambient sound and the sound data uttered by the speaker are collected by the sound input unit 112 and transmitted to the video conference terminal 101 by the sound signal transmitting unit 113. The

この音声データを受け、テレビ会議端末１０１側では、音声信号受信手段１１４により受信された音声データは、特定話者決定手段１１６により既定の優先度と音声データの音量とで特定の話者が決定され、その音声データについてエコー除去手段１１５により特定音付加手段１０８からの特定音送信時刻と音声受信手段１１４からの特定話者の集音端末の音声データ中の特定音受信時刻とで、前記音声復号手段１０９の復号音声データが除去された後、音声符号手段１１６により符号化される。 Upon receiving the audio data, the video conference terminal 101 side determines the audio data received by the audio signal receiving unit 114 by the specific speaker determining unit 116 based on the predetermined priority and the volume of the audio data. For the voice data, the echo canceling means 115 uses the specific sound transmission time from the specific sound adding means 108 and the specific sound receiving time in the voice data of the sound collecting terminal of the specific speaker from the voice receiving means 114 to After the decoded speech data of the decoding unit 109 is removed, the speech encoding unit 116 encodes the decoded speech data.

話者位置算出手段１１８では、特定音付加手段１０８からの特定音送信時刻と、音声受信手段１１４からの特定話者の集音端末の音声データ中の特定音受信時刻とで話者の位置を算出し、その算出結果を用いて雲台制御手段１１９ではビデオカメラ１２０が話者の方向を向くように雲台を制御する。 The speaker position calculating means 118 determines the position of the speaker based on the specific sound transmission time from the specific sound adding means 108 and the specific sound reception time in the voice data of the specific speaker's sound collection terminal from the voice receiving means 114. Using the calculation result, the camera platform control means 119 controls the camera platform so that the video camera 120 faces the speaker.

ビデオカメラ１２０により撮影された映像データは映像符号手段１２１により符号化され、ＡＶ多重手段１２２により前記音声符号手段１１６により符号化された音声データと共に多重化され、送信手段１２３により広域通信網１００に送信される。 Video data captured by the video camera 120 is encoded by the video encoding means 121, multiplexed together with the audio data encoded by the audio encoding means 116 by the AV multiplexing means 122, and transmitted to the wide area network 100 by the transmission means 123. Sent.

話者の位置の決定動作について、図２、図３を用いて説明する。 The operation of determining the speaker position will be described with reference to FIGS.

前記特定音付加手段１０８で付加する特定音は、図２のように人間の可聴限界を下回る予め決められた２つの音声信号とする。これにより、特定音が会議参加者に聴こえることはなく、音声信号受信手段で受信した音声データから特定音を除去する必要もない。前記特定音付加手段１０８は前記音声復号手段１０７の出力音声データにこの特定音を付加し、付加した時刻を記録し、２つのスピーカ１１０から出力する。 The specific sound added by the specific sound adding means 108 is two predetermined audio signals that are below the human audible limit as shown in FIG. Thus, the specific sound is not heard by the conference participants, and it is not necessary to remove the specific sound from the audio data received by the audio signal receiving means. The specific sound adding means 108 adds the specific sound to the output voice data of the voice decoding means 107, records the added time, and outputs it from the two speakers 110.

そして図３のように、２つのスピーカより出力された２つの特定音が、前記特定話者決定手段１１７により決定された話者Ａの保有するワイヤレス集音端末１１１の音声入力手段１１２で集音され、音声信号送信手段１１３により送信されて音声信号受信手段１１４に届くまでの時間を測定する。そして既定のスピーカＬからスピーカＲまでの距離ｄ０、スピーカＬからワイヤレス集音端末までの距離ｄ１、スピーカＲからワイヤレス集音端末までの距離ｄ１、ワイヤレス集音端末とテレビ会議端末までの距離ｄ３と、既知である信号の伝達速度とを考慮した変換テーブルを用い、前記測定結果により話者の位置を算出する。 Then, as shown in FIG. 3, two specific sounds output from the two speakers are collected by the voice input unit 112 of the wireless sound collection terminal 111 held by the speaker A determined by the specific speaker determination unit 117. Then, the time from the transmission by the audio signal transmission unit 113 to the arrival at the audio signal reception unit 114 is measured. Then, a predetermined distance d0 from the speaker L to the speaker R, a distance d1 from the speaker L to the wireless sound collecting terminal, a distance d1 from the speaker R to the wireless sound collecting terminal, a distance d3 from the wireless sound collecting terminal to the video conference terminal, The position of the speaker is calculated from the measurement result using a conversion table that considers the known signal transmission speed.

以上の作業を単位時間ごとに定期的に行うことで、動的な話者の位置を認識する。 The position of the dynamic speaker is recognized by periodically performing the above operation every unit time.

以下、本発明の他の実施例を説明する。 Hereinafter, other embodiments of the present invention will be described.

図４は、本発明の一実施例のブロック図であり、チューナ３００に接続された放送受信端末３０１において、放送信号受信手段３０２で受信したデータはＴＳ分離手段３０３により映像データと音声データと制御信号に分離され、映像データは映像復号手段３０４により復号された後、表示制御手段３０５により前記ＡＶ分離手段３０３からの制御信号で同期制御されてモニタ３０６に映像が表示される。一方音声データは音声復号手段３０７により復号された後、音声加工手段３０９では、ユーザ位置算出手段３０８により算出されたユーザ位置を中心に取り囲む５つのスピーカから出力されているような擬似サラウンド効果をもたらす音声加工処理を施し、特定音付加手段３１０により既定の特定音が付加され、音声出力制御手段３１１により前記ＴＳ分離手段３０３からの制御信号で同期制御されて２つのスピーカ３１２から音声が出力される。そしてリモコン端末３１３において、音声入力手段３１４により集音されたユーザが発声する音声データとユーザが発声した音声データと、操作入力手段３１５によりユーザが操作した命令とがリモコン信号送信手段３１６により、放送受信端末３０１へ送信される。 FIG. 4 is a block diagram of an embodiment of the present invention. In the broadcast receiving terminal 301 connected to the tuner 300, data received by the broadcast signal receiving means 302 is controlled by the TS separating means 303 with video data, audio data, and control. After being separated into signals, the video data is decoded by the video decoding unit 304, and then the video is displayed on the monitor 306 by the display control unit 305 synchronously controlled by the control signal from the AV separation unit 303. On the other hand, after the audio data is decoded by the audio decoding unit 307, the audio processing unit 309 brings about a pseudo surround effect that is output from five speakers surrounding the user position calculated by the user position calculation unit 308. Audio processing is performed, a predetermined specific sound is added by the specific sound adding means 310, and the audio output control means 311 is synchronously controlled by the control signal from the TS separation means 303 to output sound from the two speakers 312. . Then, in the remote control terminal 313, the voice data uttered by the user collected by the voice input means 314, the voice data uttered by the user, and the command operated by the user by the operation input means 315 are broadcast by the remote control signal transmission means 316. It is transmitted to the receiving terminal 301.

このリモコン信号を受け、放送受信端末３０１側では、リモコン信号受信手段３１７により受信されたリモコン信号がリモコン信号制御手段３１８により、ユーザが発音した音声データは音声認識手段３１９に送られて操作命令に変換された後、操作入力手段３１５によるユーザ操作命令と共に、放送信号受信手段３０２、表示出力制御手段３０５、音声出力制御手段３１１に送られる。 Upon receiving this remote control signal, the broadcast receiving terminal 301 side receives the remote control signal received by the remote control signal receiving means 317 by the remote control signal control means 318, and the voice data produced by the user is sent to the voice recognition means 319 for the operation command. After the conversion, it is sent to the broadcast signal receiving means 302, the display output control means 305, and the audio output control means 311 together with a user operation command from the operation input means 315.

ユーザ位置算出手段３０８では、特定音付加手段３１０からの特定音送信時刻とリモコン信号制御手段３１８からの特定音受信時刻とでユーザの位置を算出し、その算出結果を前記音声加工手段３０９出力する。 The user position calculation means 308 calculates the position of the user from the specific sound transmission time from the specific sound addition means 310 and the specific sound reception time from the remote control signal control means 318, and outputs the calculation result to the sound processing means 309. .

ユーザの位置の決定動作について、図２、図５を用いて説明する。 The user position determination operation will be described with reference to FIGS.

前記実施例１と同様に、特定音付加手段３１０で付加する特定音は、図２のように人間の可聴限界を下回る予め決められた２つの音声信号とする。これにより、特定音がユーザに聴こえることはない。前記特定音付加手段３１０は前記音声加工手段３０９の出力音声データにこの特定音を付加し、付加した時刻を記録し、２つのスピーカ３１２より出力する。 As in the first embodiment, the specific sound added by the specific sound adding means 310 is two predetermined audio signals that are below the human audible limit as shown in FIG. Thereby, the specific sound is not heard by the user. The specific sound adding means 310 adds this specific sound to the output sound data of the sound processing means 309, records the added time, and outputs it from the two speakers 312.

そして図５のように、２つのスピーカより出力された２つの特定音が、リモコン端末３１３の音声入力手段３１４で集音され、リモコン信号送信手段３１６により送信されてリモコン信号受信手段１１４に届くまでの時間を測定する。そして既定のスピーカＬからスピーカＲまでの距離ｄ０、スピーカＬからリモコン端末までの距離ｄ１、スピーカＲからリモコン端末までの距離ｄ１、リモコン端末と放送受信端末までの距離ｄ３と、既知である信号の伝達速度とを考慮した変換テーブルを用い、前記測定結果によりユーザの位置を算出する。 Then, as shown in FIG. 5, two specific sounds output from the two speakers are collected by the voice input means 314 of the remote control terminal 313 and transmitted by the remote control signal transmission means 316 until reaching the remote control signal reception means 114. Measure the time. Then, a predetermined distance d0 from the speaker L to the speaker R, a distance d1 from the speaker L to the remote control terminal, a distance d1 from the speaker R to the remote control terminal, a distance d3 from the remote control terminal to the broadcast receiving terminal, and a known signal A user's position is calculated from the measurement result using a conversion table that considers the transmission speed.

以上の作業を単位時間ごとに定期的に行うことで、動的なユーザの位置を認識する。 By periodically performing the above operation every unit time, the position of the dynamic user is recognized.

本発明の第１の実施例における位置検出装置を含むテレビ会議装置のブロック図である。1 is a block diagram of a video conference device including a position detection device according to a first embodiment of the present invention. 本発明の第１、２の実施例における位置決定動作に用いる特定音の説明図である。It is explanatory drawing of the specific sound used for the position determination operation | movement in the 1st, 2nd Example of this invention. 本発明の第１の実施例における位置決定動作の説明図である。It is explanatory drawing of the position determination operation | movement in 1st Example of this invention. 本発明の第２の実施例における位置検出装置を含む放送受信装置のブロック図である。It is a block diagram of the broadcast receiving apparatus containing the position detection apparatus in 2nd Example of this invention. 本発明の第２の実施例における位置決定動作の説明図である。It is explanatory drawing of the position determination operation | movement in 2nd Example of this invention. 本発明の従来例における位置検出装置を含むテレビ会議装置のブロック図である。It is a block diagram of the video conference apparatus containing the position detection apparatus in the prior art example of this invention.

Claims

A conference terminal device used in a video conference system that performs a video conference via a network,
Specific sound adding means for adding a sound signal of a specific sound to a sound signal input from a network;
A plurality of speakers that output a sound signal to which the specific sound is added and that have a predetermined interval from each other;
Receiving means for receiving a predetermined transmission signal transmitted in response to detection of the specific sound in the sound output from the plurality of speakers by a sound collection terminal corresponding to a speaker;
The positional relationship between the video conference terminal and the sound collecting terminal is detected based on the arrival time difference when the plurality of predetermined transmission signals related to the detection result of the specific sound in the plurality of audio signals are received by the receiving means. Position detecting means for
And a control unit that controls a direction of the television camera based on a detection result of the position detection unit.

The sound collecting terminal further collects the voice of the speaker and transmits it to the receiving means,
The terminal device according to claim 1, wherein the receiving unit receives a voice transmitted from the sound collecting terminal.

3. The terminal apparatus according to claim 2, further comprising: an encoding processing unit that encodes an audio signal received by the receiving unit and an image signal captured by the television camera and transmits the encoded image signal via the network. .

4. The apparatus according to claim 3, further comprising a removing unit that removes the specific sound from the audio signal received by the receiving unit, wherein the encoding processing unit encodes the audio signal output from the removing unit. Terminal equipment.

The terminal device according to claim 1, further comprising voice processing means for applying an acoustic effect according to a result of the position detection means to voice received from the network.

The terminal device according to claim 1, wherein the sound collection terminal has a remote commander function for controlling a function of the terminal device.

The terminal device according to claim 1, wherein the frequency of the specific sound is an inaudible region.