JP2000217091A

JP2000217091A - Video conference system

Info

Publication number: JP2000217091A
Application number: JP11012046A
Authority: JP
Inventors: Takanori Ikegami; 貴則池上
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-01-20
Filing date: 1999-01-20
Publication date: 2000-08-04

Abstract

PROBLEM TO BE SOLVED: To provide a video conference system by which progress of a more natural and smoother video conference can be realized. SOLUTION: In this video conference system, video images of conference participants of other terminals are side by side displayed on a long sideways screen of a video output device 17, a plurality of video cameras 11a-11z are respectively placed above the video images of the respective participants displayed on the long sideways screen. Non-verbal interpretation processing is applied to the motion of each conference participant on the basis of a plurality of video information sets obtained by photographing a talker in different directions by using the video cameras 11a-11z so as to specify a person to which the talker speaks thereby attaining control such as automatic transfer of a right to speak from the talker. Thus, the progress of the video conference can be realized more naturally and smoothly.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、テレビ会議システ
ムに関する。[0001] The present invention relates to a video conference system.

【０００２】[0002]

【従来の技術】近年、遠隔地に点在する、例えば事業所
や支店などの拠点にそれぞれ設置された各端末を通信網
を介して接続し各拠点の会議参加者が各拠点に居ながら
にしてデータ通信によりテレビ会議を行うテレビ会議シ
ステムが実現されている。2. Description of the Related Art In recent years, terminals located at remote locations, such as offices and branches, are connected via a communication network so that conference participants at each location can stay at each location. A video conference system for performing a video conference by data communication has been realized.

【０００３】従来のテレビ会議システムの各拠点に設置
された端末の一例としては、図５に示すように、他の拠
点に設置された他の端末からの会議参加者の映像を、画
面を４分割するなどして表示するモニタ５１と、このモ
ニタ５１の上に設置され、モニタ５１の画面を見て会話
する自端末の会議参加者を撮影する１台のカメラ５２と
を有するものがある。As an example of a terminal installed at each site of a conventional video conference system, as shown in FIG. 5, an image of a conference participant from another terminal installed at another site is displayed on a screen. There is a monitor having a monitor 51 that is divided and displayed, and a single camera 52 that is installed on the monitor 51 and captures a conference participant of the own terminal that talks while watching the screen of the monitor 51.

【０００４】この種のテレビ会議システムは、一般的な
パーソナルコンピュータ（以下パソコンと称す）と同等
のハードウェアを利用していることから、モニタ５１と
しては、対角寸法が例えば１５インチから１９インチ程
度の画面のものを利用しており、このため、会議を自分
を含めて５人程度で行う場合、モニタ５１の画面には、
他の会議参加者の映像が上下左右に１コマずつ４分割表
示される。Since this type of video conference system uses hardware equivalent to a general personal computer (hereinafter referred to as a personal computer), the monitor 51 has a diagonal size of, for example, 15 inches to 19 inches. The screen of the monitor 51 is used. Therefore, when the conference is performed by about five people including yourself, the screen of the monitor 51 includes:
The images of the other conference participants are displayed in four parts, one frame at a time, vertically and horizontally.

【０００５】この場合、モニタ５１の画面を見る自端末
の会議参加者の視点は、常にモニタ５１の画面の範囲、
つまり１点に拘束される。[0005] In this case, the viewpoint of the conference participant of the own terminal watching the screen of the monitor 51 is always within the range of the screen of the monitor 51,
That is, it is restricted to one point.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、これで
は、会議の進行において重要な会話相手に対する非言語
コミュニケーション、例えば視線の移動、瞳孔の大きさ
の変化、表情の変化、身振りおよび手振りの変化などが
阻害され、多人数での会議では特に対話が不自然となる
問題があった。However, in this, non-verbal communication with a conversation partner important in the progress of a conference, such as movement of eyes, changes in pupil size, changes in facial expressions, changes in gestures and hand gestures, and the like, are required. There was a problem that the conversation was unnatural, especially in multi-person meetings.

【０００７】本発明はこのような課題を解決するために
なされたもので、話者が行う対話動作、例えば視線の移
動、瞳孔の大きさの変化、表情の変化、身振りの変化、
手振りなどの非言語コミュニケーションを考慮すること
でテレビ会議を、より自然な形で円滑に進行させること
のできるテレビ会議システムを提供することを特徴とし
ている。SUMMARY OF THE INVENTION The present invention has been made to solve such a problem, and the present invention is directed to a dialogue operation performed by a speaker, for example, movement of a line of sight, change in pupil size, change in facial expression, change in gesture,
A feature of the present invention is to provide a video conference system that allows a video conference to proceed smoothly in a more natural manner by considering non-verbal communication such as hand gestures.

【０００８】[0008]

【課題を解決するための手段】上記した目的を達成する
ために、請求項１記載のテレビ会議システムは、自端末
とこれとは異なる地点に配置された他の複数の端末とを
通信網を介して接続してなるテレビ会議システムにおい
て、前記各端末は、前記他端末から送られてきた複数の
会議参加者の映像を、少なくとも自端末の会議参加者が
いずれか一つに向かい発言する際に他とは対話動作を変
える程度に並べて表示する表示手段と、前記表示手段に
より表示された個々の会議参加者の映像位置あるいは各
映像近傍の位置にそれぞれ配置され、前記自端末の会議
参加者を撮影する複数の撮影手段とを備え、前記各端末
あるいは前記通信網上の処理装置は、前記複数の撮影手
段によってそれぞれ撮影された複数の映像情報を基に、
非言語解釈処理を行い、前記話者が前記複数の会議参加
者の中から１人を選んで発言する動作を識別する非言語
解釈処理手段を具備したことを特徴としている。In order to achieve the above-mentioned object, a video conference system according to claim 1 establishes a communication network between a terminal and a plurality of other terminals located at different points. In the video conference system connected via the terminal, each terminal, when at least one of the conference participants of the own terminal speaks to the video of the plurality of conference participants sent from the other terminal Display means for arranging the participant so as to change the interaction with others, and arranged at the video position of each conference participant displayed by the display means or at a position near each video, and the conference participant of the terminal itself. A plurality of photographing means for photographing, the processing device on each terminal or the communication network, based on a plurality of video information respectively photographed by the plurality of photographing means,
Non-verbal interpretation processing means for performing non-verbal interpretation processing and identifying an operation in which the speaker selects one of the plurality of conference participants and speaks is provided.

【０００９】請求項２記載のテレビ会議システムは、自
端末とこれとは異なる地点に配置された他の複数の端末
とを通信網を介して接続してなるテレビ会議システムに
おいて、前記各端末は、前記他端末から送られてきた複
数の会議参加者の映像を、少なくとも自端末の会議参加
者がいずれか一つに向かい発言する際に他とは対話動作
を変える程度に並べて表示する表示手段と、前記表示手
段により表示された個々の会議参加者の映像位置あるい
は各映像近傍の位置にそれぞれ配置され、前記自端末の
会議参加者を撮影する複数の撮影手段とを備え、前記各
端末あるいは前記通信網上の処理装置は、前記複数の撮
影手段によってそれぞれ撮影された複数の映像情報を基
に、非言語解釈処理を行い、前記話者が前記複数の会議
参加者の中から１人を選んで発言する動作を識別する非
言語解釈処理手段と、前記非言語解釈処理手段による非
言語解釈処理結果、識別された１人の会議参加者の端末
に対して発言権を委譲するための切り替え制御を行う制
御手段とを具備したことを特徴としている。According to a second aspect of the present invention, there is provided a video conference system in which a terminal is connected to a plurality of other terminals arranged at different points via a communication network. Display means for displaying the images of the plurality of conference participants sent from the other terminal in such a manner that at least when the conference participant of the own terminal speaks toward one of them, it changes the interaction operation with the others. And a plurality of photographing means arranged at a video position of each conference participant displayed by the display means or a position near each video, and a plurality of photographing means for photographing the conference participant of the own terminal, The processing device on the communication network performs a non-linguistic interpretation process based on the plurality of pieces of video information respectively captured by the plurality of capturing units, and the speaker determines one of the plurality of conference participants. A non-verbal interpretation processing means for identifying an action of selecting and speaking; and a non-verbal interpretation processing result by the non-verbal interpretation processing means, for delegating the speaking right to the terminal of the identified one conference participant. Control means for performing switching control.

【００１０】請求項３記載のテレビ会議システムは、請
求項１あるいは請求項２いずれか記載のテレビ会議シス
テムにおいて、前記表示手段は、前記複数の会議参加者
の映像を横方向に順に並べて表示する画面を有してい
る。According to a third aspect of the present invention, in the video conference system according to any one of the first and second aspects, the display means displays the images of the plurality of conference participants in order in a horizontal direction. Has a screen.

【００１１】請求項１記載の発明では、他端末から送ら
れてきた複数の会議参加者の映像を、少なくとも自端末
の会議参加者がいずれか一つに向かい発言する際に他と
は対話動作を変える程度に並べて表示する表示手段と、
この表示手段により表示された個々の会議参加者の映像
位置あるいは各映像近傍の位置にそれぞれ配置され、自
端末の会議参加者を撮影する複数の撮影手段とを備えた
ことで、各端末あるいは通信網上の処理装置は、複数の
撮影手段によってそれぞれ撮影された複数の映像情報を
基に、非言語解釈処理を行い、話者が複数の会議参加者
の中から１人を選んで発言する動作を識別するので、話
者がどの会議参加者に対して発言しているかが解り、よ
り円滑な会議運営に向けたさまざまな制御を行うことが
できる。According to the first aspect of the present invention, at least when the conference participant of the own terminal speaks toward one of the plurality of conference participants transmitted from the other terminal, an interactive operation with the other is performed. Display means for displaying side by side to change the
A plurality of photographing means for photographing the conference participant of the own terminal are arranged at the image position of each conference participant displayed by the display means or at a position near each image, so that each terminal or communication A processing device on the network performs a non-linguistic interpretation process based on a plurality of pieces of video information respectively captured by a plurality of capturing means, and an operation in which a speaker selects one of a plurality of conference participants and speaks. , It is possible to know which conference participant the speaker is speaking to, and to perform various controls for smoother conference management.

【００１２】請求項２記載の発明では、請求項１記載の
発明に加え、話者が複数の会議参加者の中から選んで発
言した１人の映像を識別し、その人に自動的に発言権を
委譲する制御を行うので、非言語コミュニケーションの
自由度が広がり話者は自然に発言することができる。According to the second aspect of the present invention, in addition to the first aspect of the present invention, a speaker selects a video from a plurality of conference participants and speaks, and automatically speaks to the person. Since the control of transferring the right is performed, the degree of freedom of the non-verbal communication is increased and the speaker can speak naturally.

【００１３】請求項３記載の発明では、複数の会議参加
者の映像を横方向に順に並べて表示する画面を有する表
示手段としたことで、少なくとも話者は対話相手に対し
て視線を向けるような動作の変化を起こすようになり非
言語コミュニケーションを取り入れた会議を行うことが
できる。According to the third aspect of the present invention, the display means has a screen for displaying images of a plurality of conference participants in order in the horizontal direction, so that at least the speaker turns his / her eyes to the conversation partner. Changes in behavior can occur and meetings with nonverbal communication can be held.

【００１４】すなわち、この発明では、他の端末の会議
参加者の映像を横に並べて表示するとともに、それぞれ
の会議参加者の映像の位置に複数の撮影手段を設置し
て、それぞれの撮影手段で異なる方向から話者を撮影し
各撮影手段から得た複数の映像情報それぞれについて話
者の動作を非言語解釈処理し、話者が対話した人を特定
することで、例えば話者から発言権を自動的に対話者に
委譲するなどの制御を行えるので、より自然で円滑なテ
レビ会議の進行を実現することができる。That is, according to the present invention, the video of the conference participant of the other terminal is displayed side by side, and a plurality of photographing means are installed at the position of the video of each conference participant. Non-verbal interpretation processing of the speaker's actions for each of a plurality of video information obtained from each photographing means by photographing the speaker from different directions, and specifying the person with whom the speaker interacted, for example, to give the speaker the right to speak Since control such as automatic transfer to the interlocutor can be performed, a more natural and smooth progress of the video conference can be realized.

【００１５】[0015]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して詳細に説明する。図１は本発明に係るテレビ
会議システムの一つの実施形態を示す図である。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a diagram showing one embodiment of a video conference system according to the present invention.

【００１６】同図に示すように、このテレビ会議システ
ムは、各地に分散して点在する複数の拠点に配置された
テレビ会議用の端末１ａ〜１ｚと、これらの端末１ａ〜
１ｚを通信網２を介して接続する中央処理装置（ＭＣ
Ｕ）３とでシステムが構成されている。中央処理装置
（ＭＣＵ）３は、各拠点のテレビ会議用の端末１ａ〜１
ｚから送信されてきた映像データなどを指定された拠点
に送信、および各拠点からの制御データに基づく処理を
実行するなど、会議全体を制御する。各拠点のテレビ会
議用の端末１ａ〜１ｚと中央処理装置（ＭＣＵ）３とは
中央処理装置（ＭＣＵ）３を中心として通信網２にて例
えばスター型などに結線されている。As shown in FIG. 1, this video conference system includes video conference terminals 1a to 1z arranged at a plurality of bases dispersed in various places, and these terminals 1a to 1z.
1z via a communication network 2 (MC)
U) 3 constitute a system. The central processing unit (MCU) 3 includes video conference terminals 1a to 1 at each base.
The entire conference is controlled, for example, by transmitting video data and the like transmitted from z to designated sites and executing processing based on control data from each site. The terminals 1a to 1z for the video conference at each base and the central processing unit (MCU) 3 are connected to the central processing unit (MCU) 3 in a communication network 2 in, for example, a star type.

【００１７】テレビ会議用の各端末１ａ〜１ｚは、図２
に示すように、複数のビデオカメラ１１ａ〜１１ｚと、
これらのビデオカメラ１１ａ〜１１ｚにより撮影された
映像情報が入力され、この入力された映像情報に対して
非言語解釈処理を行い話者が複数の会議参加者の中から
１人を選んで発言した動作から対話相手の映像を識別す
る映像入力処理装置１２と、マイク１３を接続した音声
入力処理装置１４と、各データを多重し送信する送信処
理装置１５と、各データを受信および分離する受信処理
装置１６と、横方向に長い画面（横長画面）を有する大
画面の液晶プロジェクタなどの映像出力装置１７を接続
した映像出力処理装置１８と、スピーカー１９を接続し
た音声出力処理装置２０と、端末全体を制御する制御装
置２１とから構成されている。ビデオカメラ１１ａ〜１
１ｚは、その拠点における会議参加者の映像信号を映像
入力処理装置１２に出力する。映像入力処理装置１２
は、ビデオカメラ１１ａ〜１１ｚから入力された各映像
入力信号の非言語解釈処理、圧縮符号化処理、配信先の
付加処理等を行うものであり構成の説明については後述
する。マイク１３からの音声信号は、音声入力処理装置
１４に出力される。音声入力処理装置１４は、圧縮符号
化処理等を行う。送信処理装置１５は、映像入力処理装
置１２、音声入力処理装置１４、および制御装置２１か
らの信号を多重化し送信する。受信処理装置１６は、多
重化された映像・音声・制御信号を受信すると共に、こ
れらの信号の分離処理を行い、映像出力処理装置１８、
音声出力処理装置２０および制御装置２１に伝達する。
映像出力処理装置１８は、圧縮されたデータを復号化
し、他の各拠点の映像を映像出力装置１７上に再構築す
る。スピーカー１９は音声を発生するものである。音声
出力処理装置２０は、圧縮データを復号化し、スピーカ
ー９に出力する。制御装置２１は拠点全体を制御するも
のである。Each terminal 1a to 1z for a video conference is shown in FIG.
, A plurality of video cameras 11a to 11z,
Video information captured by the video cameras 11a to 11z is input, the input video information is subjected to non-verbal interpretation processing, and a speaker selects one of a plurality of conference participants and speaks. A video input processing device 12 for identifying a video of a conversation partner from an operation, an audio input processing device 14 connected to a microphone 13, a transmission processing device 15 for multiplexing and transmitting each data, and a reception process for receiving and separating each data A video output processing device 18 connected to a device 16, a video output device 17 such as a large-screen liquid crystal projector having a horizontally long screen (landscape screen), an audio output processing device 20 connected to a speaker 19, and the entire terminal And a control device 21 for controlling the Video cameras 11a-1
1z outputs the video signal of the conference participant at the base to the video input processing device 12. Video input processing device 12
Performs non-verbal interpretation processing, compression encoding processing, distribution destination addition processing, and the like of each video input signal input from the video cameras 11a to 11z, and the configuration will be described later. The audio signal from the microphone 13 is output to the audio input processing device 14. The voice input processing device 14 performs a compression encoding process and the like. The transmission processing device 15 multiplexes and transmits signals from the video input processing device 12, the audio input processing device 14, and the control device 21. The reception processing device 16 receives the multiplexed video, audio, and control signals, performs a separation process on these signals, and outputs a video output processing device 18,
It is transmitted to the audio output processing device 20 and the control device 21.
The video output processing device 18 decodes the compressed data and reconstructs the video of each of the other sites on the video output device 17. The speaker 19 generates sound. The audio output processing device 20 decodes the compressed data and outputs the decoded data to the speaker 9. The control device 21 controls the entire site.

【００１８】上記映像入力処理装置１２は、図３に示す
ように、アナログ・デジタル・コンバータ（以下ＡＤＣ
と称す）２４ａ〜２４ｚ、映像圧縮符号化処理部２５ａ
〜２５ｚ、非言語解釈処理部２６ａ〜２６ｚなどを有し
ている。As shown in FIG. 3, the video input processing device 12 includes an analog-to-digital converter (hereinafter referred to as an ADC).
24a to 24z, video compression / encoding processing unit 25a
To 25z, non-language interpretation processing units 26a to 26z, and the like.

【００１９】ＡＤＣ２４ａ〜２４ｚは、ビデオカメラ１
１a 〜１１ｚから入力された映像信号（アナログ信号）
をデジタル信号に変換するものである。映像圧縮符号化
処理部２５ａ〜２５ｚは、動画映像データを圧縮符号化
するエンコーダである。非言語解釈処理部２６ａ〜２６
ｚは、ＡＤＣ２４ａ〜２４ｚからの映像入力に対し、視
線の動き、瞳孔の大きさ、表情、身振り、手振りなどか
ら非言語的解釈処理を行い、その結果を制御装置２１に
伝達するものである。制御装置２１は、非言語解釈処理
部２６ａ〜２６ｚからの情報（非言語解釈データ）を送
信処理装置１５に送り、送信処理装置１５はその非言語
解釈データを中央処理装置（ＭＣＵ）３に送信する。中
央処理装置（ＭＣＵ）３は、各拠点の端末から送信され
てきた非言語解釈データに基づいて、視線の動きなどに
よる発言権の移譲制御、つまり会議の話者切り替え制御
やコミュニケーションの促進制御を行う。この他、例え
ば、異文化間の非言語コミュニケーションの翻訳などを
行うようにしても良い。The ADCs 24a to 24z are the video cameras 1
Video signal (analog signal) input from 1a to 11z
Is converted into a digital signal. The video compression encoding processing units 25a to 25z are encoders that compress and encode moving image video data. Non-linguistic interpretation processing units 26a to 26
z performs non-verbal interpretation processing on video input from the ADCs 24a to 24z based on gaze movement, pupil size, facial expression, gesture, hand gesture, and the like, and transmits the result to the control device 21. The control device 21 sends information (non-language interpretation data) from the non-language interpretation processing units 26a to 26z to the transmission processing device 15, and the transmission processing device 15 transmits the non-language interpretation data to the central processing unit (MCU) 3. I do. The central processing unit (MCU) 3 performs transfer control of the right to speak based on the movement of the line of sight based on the non-linguistic interpretation data transmitted from the terminal at each site, that is, control for switching speakers in a conference and control for promoting communication. Do. In addition, for example, translation of non-verbal communication between different cultures may be performed.

【００２０】上記テレビ会議用の各端末１ａ〜１ｚに備
えられている映像出力装置１７は、図４に示すように、
自端末以外の端末から送られてきた複数の会議参加者の
映像、例えば４人分の映像の場合、横長画面４０の横方
向に複数区分された各領域４１ａ〜４１ｄに表示するも
のである。各領域４１ａ〜４１ｄの上方の映像出力装置
１７上面には、各領域４１ａ〜４１ｄに表示されている
各会議参加者へ送る自端末の会議参加者の上半身部分を
撮影する複数のビデオカメラ１１ａ〜１１ｄが個々の会
議参加者映像の位置と接近して配設されている。The video output device 17 provided in each of the terminals 1a to 1z for the video conference, as shown in FIG.
In the case of images of a plurality of conference participants transmitted from terminals other than the own terminal, for example, images of four participants, the images are displayed in each of a plurality of areas 41a to 41d divided in the horizontal direction of the landscape screen 40. On the upper surface of the video output device 17 above each of the regions 41a to 41d, there are provided a plurality of video cameras 11a to 11e to shoot the upper body part of the conference participant of the own terminal to be transmitted to each conference participant displayed in each of the regions 41a to 41d. 11d is arranged close to the position of each conference participant image.

【００２１】次に、このテレビ会議システムの具体的な
動作を説明する。このテレビ会議システムにおいて、例
えば自端末の会議参加者と他の複数の端末の４人の会議
参加者との間でテレビ会議を行っており、自端末の会議
参加者が話者としての発言権があるときに、話者が、例
えば図４の横長画面４０に向かって右隅の人、つまり領
域４１ｄに表示されている会議参加者に対して発言を行
った場合、話者の視線は領域４１ｄの方向に向けられる
ため、ビデオカメラ１１ｄによって撮影された話者の映
像は、話者の正面の顔の映像であり、特に話者の目が向
いている方向、つまり視線は正面に向けられている。Next, a specific operation of the video conference system will be described. In this video conference system, for example, a video conference is held between a conference participant of the own terminal and four conference participants of a plurality of other terminals, and the conference participant of the own terminal has a floor as a speaker. When the speaker speaks, for example, to the person at the right corner of the landscape screen 40 in FIG. 4, that is, the conference participant displayed in the area 41d, the line of sight of the speaker is Since the speaker is directed in the direction of 41d, the image of the speaker captured by the video camera 11d is an image of the face in front of the speaker. In particular, the direction in which the speaker's eyes are facing, that is, the line of sight is directed to the front. ing.

【００２２】このとき、同時に他のビデオカメラ１１ａ
〜１１ｃにより撮影された映像は、いずれも話者の正面
以外の顔、つまり横顔あるいは斜め横顔の映像となる。At this time, another video camera 11a
11c are images of the face other than the front of the speaker, that is, a profile or oblique profile.

【００２３】これらの映像情報は、映像信号（アナログ
信号）の形でＡＤＣ２４ａ〜２４ｚに入力されてデジタ
ル信号に変換され、それぞれの映像圧縮符号化処理部２
５ａ〜２５ｚおよび非言語解釈処理部２６ａ〜２６ｚに
分岐して出力される。The video information is input to the ADCs 24a to 24z in the form of video signals (analog signals) and converted into digital signals.
5a to 25z and output to the non-language interpretation processing units 26a to 26z.

【００２４】映像圧縮符号化処理部２５ａ〜２５ｚで
は、入力されたデジタル信号が、ＭＰＥＧ１、ＭＰＥＧ
２あるいはＨ．２６１などの動画映像データに圧縮符号
化されて送信処理装置１５へ出力される。In the video compression / encoding processing sections 25a to 25z, the input digital signal is
2 or H. The video data is compression-encoded into moving image data such as H.261 and output to the transmission processing device 15.

【００２５】また、非言語解釈処理部２６ａ〜２６ｚで
は、ＡＤＣ２４ａ〜２４ｚからの映像情報の入力に対し
て、視線の動き、瞳孔の大きさの変化、表情の変化、身
振り、手振りの変化などから非言語的解釈処理を行い、
複数のビデオカメラ１１ａ〜１１ｄから得られた映像中
でどのビデオカメラに対して話者が話しかけているかを
解釈する。In addition, the non-verbal interpretation processing units 26a to 26z respond to the input of video information from the ADCs 24a to 24z from changes in gaze movement, changes in pupil size, changes in facial expressions, gestures, changes in hand gestures, and the like. Perform a non-linguistic interpretation,
It interprets which video camera the speaker is talking to in the images obtained from the plurality of video cameras 11a to 11d.

【００２６】この場合、非言語解釈処理部２６ｄには、
話者から視線が向けられているビデオカメラ１１ｄの映
像として、話者の正面の顔の映像が入力され、非言語解
釈処理部２６ｂには、ビデオカメラ１１ｂからの映像と
して、話者の横顔の映像が入力され、非言語解釈処理部
２６ｃには、ビデオカメラ１１ｃからの映像として、話
者の斜め横顔の映像が入力される。In this case, the non-language interpretation processing unit 26d includes:
The video of the front face of the speaker is input as the video of the video camera 11d to which the line of sight is directed from the speaker, and the non-verbal interpretation processing unit 26b outputs the profile of the speaker as the video from the video camera 11b. The video is input, and the video of the speaker's oblique side profile is input to the non-language interpretation processing unit 26c as the video from the video camera 11c.

【００２７】これらの映像から、それぞれの顔の向き情
報などがそれぞれの解釈結果として制御装置２１へ出力
される。From these images, the direction information of each face and the like are output to the control device 21 as respective interpretation results.

【００２８】制御装置２１では、これらの解釈結果か
ら、ビデオカメラ１１ｄが設置されている領域４１ｄに
映されている会議参加者が現在の話者との対話者である
ことが判別される。この判別結果は、対話者の映像番号
などが非言語解釈結果のデータとして送信処理装置１５
に送られる。The control device 21 determines from the results of these interpretations that the conference participant shown in the area 41d where the video camera 11d is installed is the talker with the current speaker. The result of this determination is that the video number of the interlocutor is used as data of the result of the non-language interpretation.
Sent to

【００２９】送信処理装置１５は、入力された動画映像
データと共にその非言語解釈結果のデータを中央処理装
置（ＭＣＵ）３に送信する。The transmission processing device 15 transmits the non-verbal interpretation result data to the central processing unit (MCU) 3 together with the input moving image video data.

【００３０】中央処理装置（ＭＣＵ）３は、各拠点の端
末から送信されてきた非言語解釈結果のデータに基づい
て、発言権の移譲制御、つまり会議の話者切り替え制御
を行い、これにより領域４１ｄに映されている会議参加
者が発言を許され、発言を行えるようになる。また対話
者に対して「発言をして下さい」などの提示を行うコミ
ュニケーションの促進制御を行っても良い。The central processing unit (MCU) 3 performs the transfer control of the right to speak, that is, the switching of the talker of the conference, based on the data of the non-language interpretation result transmitted from the terminal at each base. The conference participant shown in 41d is allowed to speak and can speak. In addition, communication facilitation control for presenting “please say” to the interlocutor may be performed.

【００３１】このように端末の前の発言者が実際の会議
のように画面に表示されている会議参加者へ視線を動か
したり、顔の向きを変えることで、対話を希望する人を
選べるので、非言語コミュニケーションの自由度が広が
り自然に発言することができる。As described above, the speaker in front of the terminal can select a person who wants to have a conversation by moving his / her gaze to the conference participant displayed on the screen or changing the face direction as in an actual conference. The degree of freedom of non-verbal communication is expanded, and it is possible to speak naturally.

【００３２】このようにこの実施の形態のテレビ会議シ
ステムによれば、他の端末の会議参加者の映像を横長の
モニタに横に並べて表示するとともに、それぞれの会議
参加者の映像の位置に複数のビデオカメラ１１ａ〜１１
ｚを設置することでそれぞれの方向から話者を撮影して
各ビデオカメラ１１ａ〜１１ｚから得た複数の映像情報
それぞれについて話者の動作を非言語解釈処理してシス
テム側で対話相手を特定するので、より自然で円滑なテ
レビ会議の進行を実現することができる。As described above, according to the video conference system of this embodiment, the video of the conference participant of another terminal is displayed side by side on a horizontally long monitor, and a plurality of video are displayed at the positions of the video of each conference participant. Video cameras 11a to 11
By setting z, a speaker is photographed from each direction, and a plurality of video information obtained from each of the video cameras 11a to 11z is subjected to non-verbal interpretation processing of the motion of the speaker to specify a conversation partner on the system side. Therefore, a more natural and smooth progress of the video conference can be realized.

【００３３】なお、本発明は上記実施形態のみに限定さ
れるものではない。上記実施形態では、会議参加者の映
像に応じた数だけビデオカメラを配置したが、自端末の
会議参加者以外に他の端末の会議参加者が２人のみの場
合、他の端末の会議参加者のそれぞれの映像を所定間隔
を隔てて配置した異なる２台のモニタに表示させ、２台
のモニタの間に１台のカメラを配置するだけでも、いず
れかの映像の方向に向いて発言する話者の動作を識別す
るのに十分な映像情報を得ることができる。The present invention is not limited only to the above embodiment. In the above embodiment, the video cameras are arranged by the number corresponding to the video of the conference participant. However, when there are only two conference participants other than the conference participant of the own terminal, the conference participation of the other terminal is performed. Each image of the user is displayed on two different monitors arranged at a predetermined interval, and even if one camera is arranged between the two monitors, the user speaks in the direction of one of the images. Sufficient video information can be obtained to identify the action of the speaker.

【００３４】また、上記実施形態では、横長画面４０を
有する１台の大画面液晶プロジェクタなどを例にした
が、この他、例えば会議参加者の人数分のモニタを用意
し、所定間隔で並べるだけでも良い。In the above embodiment, one large-screen liquid crystal projector having a horizontally long screen 40 has been described as an example. In addition, for example, monitors for the number of conference participants are prepared and arranged at predetermined intervals. But it is good.

【００３５】また、テレビ会議をよりリアルに行うため
に、各端末の会議参加者の周囲に複数のスピーカーを用
意し、これら複数のスピーカーの音響効果によって会議
参加者に対して立体的な音場を形成することで、聞き手
側に話者が誰であるかを認識させるようにしても良い。In order to conduct a video conference more realistically, a plurality of speakers are prepared around the conference participant of each terminal, and a three-dimensional sound field is given to the conference participant by the acoustic effect of the plurality of speakers. May be formed so that the listener can recognize who the speaker is.

【００３６】[0036]

【発明の効果】以上説明したように本発明によれば、他
の端末の会議参加者の映像を横に並べて表示するととも
に、それぞれの会議参加者の映像の位置に複数の撮影手
段を設置して、それぞれの撮影手段で異なる方向から話
者を撮影し各撮影手段から得た複数の映像情報それぞれ
について話者の動作を非言語解釈処理し、話者が対話し
た人を特定することで、例えば話者から対話者へ発言権
を自動的に委譲するなどの制御を行えるようになり、よ
り自然で円滑なテレビ会議の進行を実現することができ
る。As described above, according to the present invention, images of conference participants at other terminals are displayed side by side, and a plurality of photographing means are installed at the positions of the images of each conference participant. By photographing the speaker from different directions with the respective photographing means, performing non-verbal interpretation processing of the speaker's actions for each of a plurality of video information obtained from each photographing means, and specifying the person with whom the speaker interacted, For example, control such as automatically transferring the right to speak from the speaker to the interlocutor can be performed, and a more natural and smooth progress of the video conference can be realized.

[Brief description of the drawings]

【図１】本発明に係る一つの実施の形態のテレビ会議シ
ステムの構成を示す図。FIG. 1 is a diagram showing a configuration of a video conference system according to an embodiment of the present invention.

【図２】このテレビ会議システムの拠点にそれぞれ設置
された端末の一例を示すブロック図。FIG. 2 is an exemplary block diagram showing an example of terminals installed at the base of the video conference system.

【図３】このテレビ会議システムの映像入力処理装置の
構成を示す図。FIG. 3 is a diagram showing a configuration of a video input processing device of the video conference system.

【図４】このテレビ会議システムにおいて、具体的なビ
デオカメラの配置例を示す図。FIG. 4 is a diagram showing a specific arrangement example of video cameras in the video conference system.

【図５】従来のテレビ会議システムのビデオカメラの配
置例を示す図。FIG. 5 is a diagram showing an example of the arrangement of video cameras in a conventional video conference system.

[Explanation of symbols]

１１a 〜１１ｚ…ビデオカメラ、１２…映像入力処理装
置、１３…マイク、１４…音声入力処理装置、１５…送
信処理装置、１６…受信処理装置、１７…映像出力装
置、１８…映像出力処理装置、１９…スピーカー、２０
…音声出力処理装置、２１…制御装置、１a 〜１ｚ…端
末、２…通信網、３…中央処理装置（ＭＣＵ）、２４ａ
〜２４ｚ…ＡＤＣ、２５ａ〜２５ｚ…映像圧縮符号化処
理部、２６ａ〜２６ｚ…非言語解釈処理部。11a to 11z video camera, 12 video input processing device, 13 microphone, 14 audio input processing device, 15 transmission processing device, 16 reception processing device, 17 video output device, 18 video output processing device 19 ... speaker, 20
... Sound output processing device, 21 ... Control device, 1a to 1z ... Terminal, 2 ... Communication network, 3 ... Central processing unit (MCU), 24a
２４24z ADC, 25a〜25z video compression coding processing unit, 26a〜26z non-language interpretation processing unit.

Claims

[Claims]

1. In a video conference system in which a terminal and a plurality of other terminals arranged at different points are connected via a communication network, each terminal is transmitted from the other terminal. Display means for displaying images of a plurality of conference participants, at least when the conference participant of the own terminal speaks toward one, and changes the interactive operation with the other, and displayed by the display means. A plurality of photographing means arranged respectively at the video position of each conference participant or at a position near each video, and photographing the conference participant of the own terminal, wherein each terminal or the processing device on the communication network comprises: Performing a non-linguistic interpretation process based on a plurality of pieces of video information respectively captured by the plurality of capturing units, and identifying an operation in which the speaker selects one of the plurality of conference participants and speaks; Video conference system, characterized by comprising a language interpretation process means.

2. In a video conference system in which the terminal itself and a plurality of other terminals arranged at different points from the terminal are connected via a communication network, each terminal is transmitted from the other terminal. Display means for displaying images of a plurality of conference participants, at least when the conference participant of the own terminal speaks toward one, and changes the interactive operation with the other, and displayed by the display means. A plurality of photographing means arranged respectively at the video position of each conference participant or at a position near each video, and photographing the conference participant of the own terminal, wherein each terminal or the processing device on the communication network comprises: Performing a non-linguistic interpretation process based on a plurality of pieces of video information respectively captured by the plurality of capturing units, and identifying an operation in which the speaker selects one of the plurality of conference participants and speaks; Language interpretation processing means, and control means for performing switching control for delegating the speaking right to the terminal of one identified conference participant as a result of the non-language interpretation processing by the non-language interpretation processing means. A video conference system characterized by the following.

3. The video conference system according to claim 1, wherein the display unit has a screen for displaying images of the plurality of conference participants in order in a horizontal direction. Video conference system.