JP2008113164A

JP2008113164A - Communication apparatus

Info

Publication number: JP2008113164A
Application number: JP2006294269A
Authority: JP
Inventors: Akira Takayama; 明高山; Takuya Tamaru; 卓也田丸
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-10-30
Filing date: 2006-10-30
Publication date: 2008-05-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technology to efficiently transmit necessary video information within a limited communication band in data transfer between conference terminals. <P>SOLUTION: The communication apparatus includes a plurality of Web cameras 107 which are so set up that their shooting areas may be different and generate a video signal, a microphone array 106 which collects the sound emitted from a sound source and generates a sound signal, and a sound source direction information generating means which generates sound source direction information which indicates the direction of the sound source based on the sound signal generated by the microphone array 106. The communication apparatus takes the image of the speaker by the Web camera 107 whose shooting area includes the direction indicated by the sound source direction information and outputs the voice and image of the speaker toward the network. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声と共に画像を送信する通信装置に関する。 The present invention relates to a communication apparatus that transmits an image together with sound.

近年、通信網を介して接続された複数の会議端末を用いて会議を行う会議システムが一般に普及している。特許文献１には、遠隔地にある者同士が参加して行われる遠隔テレビ会議の運営を支援する技術が開示されている。この文献に開示されたシステムは、複数のテレビ会議端末と、それら各端末における音声情報や映像情報のやり取りを仲介する多地点テレビ会議中継装置とを備える。そして中継装置は、会議で使用する資料の参照ページや会議終了までの残り時間などといった会議運営情報を、自身を経由する音声情報や映像情報に対して適宜重畳する。
特開平０５−１４５９１８号公報 In recent years, a conference system that conducts a conference using a plurality of conference terminals connected via a communication network has become widespread. Patent Document 1 discloses a technology for supporting the operation of a remote video conference that is performed with participants in remote locations. The system disclosed in this document includes a plurality of video conference terminals and a multipoint video conference relay device that mediates exchange of audio information and video information at each terminal. Then, the relay device appropriately superimposes the conference operation information such as the reference page of the material used in the conference and the remaining time until the conference ends on the audio information and video information passing through the conference device.
Japanese Patent Laid-Open No. 05-145918

ところで、参加者が相手側に関して必要とする映像情報は、会議の局面に応じてさまざまである。例えば会議開始直後においては、相手側の会議室全体の様子や参加者の位置関係などを知る必要があるし、議論の最中には誰がどのような表情で発言しているのかを知る必要があるであろう。しかし、特許文献１の技術を利用してそれら必要な映像情報を詳細に相手側に伝えようとすると送信するデータ量が膨大になってしまい、限りある通信帯域幅ではまかなうことはできなかった。 By the way, the video information that the participant needs regarding the other party varies depending on the aspect of the conference. For example, immediately after the start of the conference, it is necessary to know the state of the other party's conference room and the positional relationship of the participants, and it is necessary to know who is speaking with what expression during the discussion. There will be. However, if the technique of Patent Document 1 is used to transmit the necessary video information in detail to the other party, the amount of data to be transmitted becomes enormous and cannot be covered with a limited communication bandwidth.

本発明は、上記の課題に応じてなされたものであり、例えば会議端末間のデータ転送において、必要な映像情報を効率良く送信する技術を提供することを目的とする。 The present invention has been made in response to the above problems, and an object of the present invention is to provide a technique for efficiently transmitting necessary video information, for example, in data transfer between conference terminals.

本発明に係る通信装置の第1の実施形態は、撮影範囲がそれぞれ設定され、前記各撮影範囲内の映像を示す映像信号を各々出力する複数の撮影手段と、収音した音の音声信号を生成する音声信号生成手段と、前記音声信号生成手段が生成した音声信号に基づいて音源の方向を特定し、特定した方向を示す音源方向情報を出力する音源方向特定手段と、前記音源方向情報に対応する前記撮影手段を選択する選択手段と、前記選択手段が選択した撮影手段が出力する映像信号と前記音声信号生成手段が生成した音声信号をネットワークを介して他の端末装置に出力する出力手段とを具備することを特徴とする。 In the first embodiment of the communication apparatus according to the present invention, a shooting range is set, a plurality of shooting units each outputting a video signal indicating a video in each shooting range, and an audio signal of the collected sound A sound signal generating means for generating, a sound source direction specifying means for specifying the direction of the sound source based on the sound signal generated by the sound signal generating means, and outputting sound source direction information indicating the specified direction; and the sound source direction information Selection means for selecting the corresponding photographing means, and output means for outputting the video signal output by the photographing means selected by the selecting means and the audio signal generated by the audio signal generating means to another terminal device via the network It is characterized by comprising.

また、本発明に係る通信装置の第２の実施形態は、撮影範囲がそれぞれ設定され、前記各撮影範囲内の映像を示す映像信号を各々出力する複数の撮影手段と、収音した音の音声信号を生成する音声信号生成手段と、前記音声信号生成手段が生成した音声信号に基づいて音源の方向を特定し、特定した方向を示す音源方向情報を出力する音源方向特定手段と、前記音源方向情報に対応する前記撮影手段を選択する選択手段と、前記音声信号生成手段が生成した音声信号のうち前記音源方向情報に対応する音声信号を抽出する音声抽出手段と、前記選択手段が選択した撮影手段が出力する映像信号と前記音声抽出手段が抽出した音声信号をネットワークを介して他の端末装置に出力する出力手段とを具備することを特徴とする。 In the second embodiment of the communication apparatus according to the present invention, a plurality of photographing means each for which a photographing range is set and each outputs a video signal indicating a video in each photographing range, and the sound of the collected sound A sound signal generating means for generating a signal, a sound source direction specifying means for specifying a sound source direction based on the sound signal generated by the sound signal generating means, and outputting sound source direction information indicating the specified direction; and the sound source direction Selection means for selecting the photographing means corresponding to the information; audio extraction means for extracting the audio signal corresponding to the sound source direction information from the audio signals generated by the audio signal generation means; and the imaging selected by the selection means And a video signal output by the means and an output means for outputting the audio signal extracted by the audio extraction means to another terminal device via a network.

また、本発明に係る通信装置の第３の実施形態は、前記第２の実施形態において、前記音声信号生成手段は複数のマイクが配列されたマイクアレイを有し、前記音声抽出手段は前記マイクアレイの収音方向を制御する収音方向制御装置を有し、前記音源方向特定手段は前記収音方向制御装置が制御する収音方向と前記マイクアレイが出力する音声信号の音量レベルから前記方向を特定する方向特定装置を有していることを特徴とする。 The communication device according to a third embodiment of the present invention is the communication device according to the second embodiment, wherein the sound signal generation means has a microphone array in which a plurality of microphones are arranged, and the sound extraction means is the microphone. A sound collection direction control device for controlling the sound collection direction of the array, wherein the sound source direction specifying means determines the direction from the sound collection direction controlled by the sound collection direction control device and the volume level of the audio signal output from the microphone array. It is characterized by having a direction specifying device for specifying.

また、本発明に係る通信装置の第４の実施形態は、前記第１または２の実施形態において、前記音声信号生成手段は、個別に位置の設定ができる複数のマイクを有し、前記音源方向特定手段は、前記各マイクが出力する音声信号の音量レベルから音源方向として認識される一つのマイクを選択し、選択したマイクを示す情報を前記音源方向情報として出力し、前記音声信号抽出手段は前記音源方向特定手段によって選択されたマイクの音声信号を抽出することを特徴とする。 The communication device according to a fourth embodiment of the present invention is the communication device according to the first or second embodiment, wherein the audio signal generation means includes a plurality of microphones whose positions can be individually set, and the sound source direction. The specifying unit selects one microphone recognized as a sound source direction from the volume level of the audio signal output by each microphone, outputs information indicating the selected microphone as the sound source direction information, and the audio signal extraction unit includes The microphone audio signal selected by the sound source direction specifying means is extracted.

また、本発明に係る通信装置の第５の実施形態は、前記第１ないし４いずれかに記載の実施形態において、前記出力手段は、１画面を構成する全エリアのうち特定のエリアを前記撮影手段選択手段が選択した撮影手段が出力する映像信号の表示エリアとして設定し、他のエリアには、他の映像信号が表示されるように合成した映像信号を出力することを特徴とする。 According to a fifth embodiment of the communication apparatus of the present invention, in the embodiment described in any one of the first to fourth aspects, the output unit captures a specific area among all areas constituting one screen. It is set as a display area of the video signal output by the imaging means selected by the means selection means, and the synthesized video signal is output to the other area so that another video signal is displayed.

また、本発明に係る通信装置の第６の実施形態は、前記第５に記載の実施形態において、前記出力手段が出力する映像信号の１画面を構成する全エリアをどのように区分するかを示すとともに、区分したエリアの中において前記特定のエリアを設定する複数の区分データと、前記区分データを選択する区分データ選択手段とを有し、前記出力手段は前記区分データ選択手段が選択した区分データに従って１画面を区分するとともに、区分データが示すエリアを前記特定のエリアとして認識することを特徴とする。 In addition, the sixth embodiment of the communication device according to the present invention, in the fifth embodiment, how to divide all areas constituting one screen of the video signal output by the output means. A plurality of segment data for setting the specific area in the segmented area, and segment data selection means for selecting the segment data, and the output unit selects the segment selected by the segment data selection unit One screen is divided according to the data, and an area indicated by the divided data is recognized as the specific area.

また、本発明に係る通信装置の第７の実施形態は、前記第１ないし６いずれかに記載の実施形態において、撮影対象となる領域の音源位置に応じた複数の態様を示すパターン画像を複数記憶するとともに、前記各パターン画像における音源位置と前記各撮影手段の撮影範囲との対応関係が記憶されたテーブルと、前記パターン画像を選択するパターン画像選択手段とを有し、前記パターン画像選択手段が選択したテンプレートに対応する各撮影範囲を前記テーブルを参照して決定し、決定した各撮影範囲に一致するように前記各撮影手段を制御する撮影範囲制御手段とを有することを特徴とする。 Further, in a seventh embodiment of the communication device according to the present invention, in the embodiment described in any one of the first to sixth, a plurality of pattern images showing a plurality of modes according to the sound source position of the region to be photographed are provided. The pattern image selecting means having a table storing a correspondence relationship between a sound source position in each pattern image and a photographing range of each photographing means, and a pattern image selecting means for selecting the pattern image. And a shooting range control unit that determines each shooting range corresponding to the selected template with reference to the table and controls each shooting unit so as to match the determined shooting range.

本発明に係る会議端末によれば、会議端末間のデータ転送において必要な映像情報を効率良く送信する技術を提供することができる、といった効果を奏する。 According to the conference terminal according to the present invention, it is possible to provide a technique for efficiently transmitting video information necessary for data transfer between conference terminals.

以下、図面を参照しつつ本発明の一実施形態である会議端末について説明する。
（Ａ：構成）
図１は、本発明の一実施形態である会議端末を含む会議システム１の構成を示すブロック図である。会議システム１は、会議端末１０Ａと会議端末１０Ｂと通信網２０とからなり、会議端末１０Ａおよび会議端末１０Ｂは通信網２０にそれぞれ有線接続されている。会議端末１０Ａおよび会議端末１０Ｂは互いに同じ構成からなり、以下では会議端末１０Ａおよび会議端末１０Ｂを区別する必要が無いときには、両者を会議端末１０と総称する
なお、ここでは２台の会議端末が通信網２０に接続されている場合について例示されているが、３台以上の会議端末が接続されているとしても良い。 Hereinafter, a conference terminal according to an embodiment of the present invention will be described with reference to the drawings.
(A: Configuration)
FIG. 1 is a block diagram showing a configuration of a conference system 1 including a conference terminal according to an embodiment of the present invention. The conference system 1 includes a conference terminal 10A, a conference terminal 10B, and a communication network 20, and the conference terminal 10A and the conference terminal 10B are connected to the communication network 20 by wire. The conference terminal 10A and the conference terminal 10B have the same configuration, and in the following, when there is no need to distinguish between the conference terminal 10A and the conference terminal 10B, they are collectively referred to as the conference terminal 10. Although the case where it is connected to the network 20 is illustrated, three or more conference terminals may be connected.

本実施形態では、通信プロトコルとして以下に述べる各通信プロトコルが用いられている。すなわち、アプリケーション層の通信プロトコルとして、音声データおよび映像データの転送にはReal-time Transport Protocol（以下、「ＲＴＰ」）が用いられている。ＲＴＰとは、音声データや映像データをend-to-endでリアルタイムに送受信する通信サービスを提供するための通信プロトコルであり、その詳細はＲＦＣ１８８９に規定されている。ＲＴＰにおいては、ＲＴＰパケットを生成し送受信することにより通信端末同士でデータの授受が行われる。一方画像データの転送には、ＨＴＴＰ（Hypertext Transfer Protocol）が用いられている。また、トランスポート層の通信プロトコルとしては、ＵＤＰが用いられており、ネットワーク層の通信プロトコルとしてはＩＰが用いられている。上記の会議端末１０Ａおよび会議端末１０Ｂには、それぞれにＩＰアドレスが割り振られており、ネットワーク上で一元的に識別される。
なお、ＨＴＴＰ、ＵＤＰ、およびＩＰについては、一般に広く用いられている通信プロトコルであるため説明を省略する。 In this embodiment, each communication protocol described below is used as a communication protocol. That is, Real-time Transport Protocol (hereinafter “RTP”) is used for transferring audio data and video data as a communication protocol in the application layer. RTP is a communication protocol for providing a communication service for transmitting and receiving audio data and video data in end-to-end in real time, and details thereof are defined in RFC1889. In RTP, data is exchanged between communication terminals by generating and transmitting / receiving RTP packets. On the other hand, HTTP (Hypertext Transfer Protocol) is used for transferring image data. Further, UDP is used as a transport layer communication protocol, and IP is used as a network layer communication protocol. Each of the conference terminal 10A and the conference terminal 10B is assigned an IP address, and is uniquely identified on the network.
Since HTTP, UDP, and IP are communication protocols that are generally widely used, description thereof is omitted.

次に、会議端末１０のハードウェア構成について図３を参照して説明する。なお、以下の説明において、上に挙げた会議端末１０の構成が、いずれの会議端末に属するものであるかを区別する必要があるときには、例えば会議端末１０Ａの制御部１０１を制御部１０１Ａなどのようにアルファベットを付して表す。 Next, the hardware configuration of the conference terminal 10 will be described with reference to FIG. In the following description, when it is necessary to distinguish which conference terminal the configuration of the conference terminal 10 listed above belongs to, for example, the control unit 101 of the conference terminal 10A is changed to the control unit 101A or the like. As shown, the alphabet is used.

図に示す制御部１０１は、例えばＣＰＵ（Central Processing Unit）であり、後述する記憶部１０３に格納されている各種プログラムを実行することにより、会議端末１０の各部の動作を制御する。 A control unit 101 shown in the figure is, for example, a CPU (Central Processing Unit), and controls the operation of each unit of the conference terminal 10 by executing various programs stored in a storage unit 103 described later.

記憶部１０３は、ＲＯＭ（Read Only Memory）１０３ａおよびＲＡＭ（Random Access Memory）１０３ｂを有する。
ＲＯＭ１０３ａは、本発明に特徴的な機能を制御部１０１に実現させるためのデータや制御プログラムを格納している。上記データの一例としては、Ｗｅｂカメラ選択テーブル、テストデータ、送信レート管理テーブル等がある。これらのデータについては後に詳細に説明する。
ＲＡＭ１０３ｂは、各種プログラムにしたがって作動している制御部１０１によってワークエリアとして利用されると共に、マイクアレイ１０６およびＷｅｂカメラ１０７から受取った音声データ・映像データ等を記憶する。 The storage unit 103 includes a ROM (Read Only Memory) 103a and a RAM (Random Access Memory) 103b.
The ROM 103a stores data and a control program for causing the control unit 101 to realize functions characteristic of the present invention. Examples of the data include a Web camera selection table, test data, a transmission rate management table, and the like. These data will be described in detail later.
The RAM 103b is used as a work area by the control unit 101 operating according to various programs, and stores audio data / video data received from the microphone array 106 and the web camera 107.

操作部１０４は、例えばキーボードやマウスなどであり、会議端末１０の操作者が操作部１０４を操作して何らかの入力操作を行うと、その操作内容を表すデータが制御部１０１へと伝達される。 The operation unit 104 is, for example, a keyboard or a mouse. When the operator of the conference terminal 10 operates the operation unit 104 to perform some input operation, data representing the operation content is transmitted to the control unit 101.

マイクアレイ１０６は、図示せぬ複数（本実施形態では８つ）のマイクロホンユニットと、アナログ／デジタル（以下、「Ａ／Ｄ」と略記する）コンバータを含む。マイクアレイ１０６は指向性マイクとしての機能を有し、音声を収音する方位を走査しながら音声を収音する機能を有する。制御部１０１はそのようにさまざまな方位からの音声から生成された音声データを解析し、音量レベルが大きい方位を音源（すなわち、受信した音声が人間の声ならばその話者）の方位として特定する。なお、同時に複数の参加者が発言するなどして同時に複数の方向から音声の入力があった場合には、制御部１０１はそれら複数の方位からの音声の音量レベルを比較し、最も音量レベルが高い方向を音源の方向とする。
図５は、会議室におけるマイクアレイ１０６および参加者２の相対配置の一例を示した図である。具体的には、マイクアレイ１０６が特定した音源方位に関する情報は、マイクアレイ１０６の中央から見た音源の方向（極座標における偏角Φ）として生成され、音声データと共にＲＡＭ１０３ｂへ出力される。なお、図５において話者は参加者２ａであり、音源方位はπ／６となる。 The microphone array 106 includes a plurality (eight in this embodiment) of microphone units (not shown) and an analog / digital (hereinafter abbreviated as “A / D”) converter. The microphone array 106 has a function as a directional microphone, and has a function of collecting sound while scanning a direction in which the sound is collected. The control unit 101 analyzes the voice data generated from the voices from various directions as described above, and specifies the direction in which the volume level is high as the direction of the sound source (that is, the speaker if the received voice is a human voice). To do. In addition, when a plurality of participants speak at the same time, and the sound is input from a plurality of directions at the same time, the control unit 101 compares the sound volume levels from the plurality of directions, and the sound volume level is the highest. The higher direction is the direction of the sound source.
FIG. 5 is a diagram showing an example of the relative arrangement of the microphone array 106 and the participant 2 in the conference room. Specifically, the information regarding the sound source azimuth specified by the microphone array 106 is generated as the direction of the sound source viewed from the center of the microphone array 106 (polar angle Φ in polar coordinates), and is output to the RAM 103b together with the audio data. In FIG. 5, the speaker is the participant 2a, and the sound source orientation is π / 6.

Ｗｅｂカメラ１０７は、Ｃ−ＭＯＳイメージセンサからの入力をＭｏｔｉｏｎ−ＪＰＥＧ動画として出力する。なお、Ｍｏｔｉｏｎ−ＪＰＥＧとは、撮影したフレームごとの映像をＪＰＥＧ圧縮し、これを連続して記録する動画データ生成方式である。Ｗｅｂカメラ１０７は、所定の画像サイズおよび単位時間あたりのフレーム数（ｆｐｓ；frames per second）で映像を撮影し、ＪＰＥＧによる画像圧縮を施してＲＡＭ１０３ｂへ出力する。解像度はＷｅｂカメラに予め設定された値を用い、単位時間当たりのフレーム数については、適宜制御部１０１により制御される。また、画像の圧縮率はＪＰＥＧによる画像圧縮（圧縮率は１／５〜１／６０）の範囲内で、制御部１０１の制御下で設定が可能となっている。
なお、本実施形態において会議端末１０においてＷｅｂカメラ１０７は５つ取り付けられており、参加者２はそれらの向きを手動で変更することが可能である。 The Web camera 107 outputs the input from the C-MOS image sensor as a Motion-JPEG moving image. Note that Motion-JPEG is a moving image data generation method in which video for each captured frame is JPEG compressed and recorded continuously. The Web camera 107 captures a video with a predetermined image size and the number of frames per unit time (fps; frames per second), performs image compression with JPEG, and outputs the image to the RAM 103b. The resolution uses a value preset in the Web camera, and the number of frames per unit time is appropriately controlled by the control unit 101. The image compression rate can be set under the control of the control unit 101 within the range of JPEG image compression (compression rate is 1/5 to 1/60).
In the present embodiment, five web cameras 107 are attached to the conference terminal 10, and the participant 2 can manually change their orientation.

Ｗｅｂカメラ１０７が出力したデータは一旦ＲＡＭ１０３ｂに書き込まれ、制御部１０１はＲＡＭ１０３ｂに書き込まれたデータからＲＴＰパケットを生成する。ＲＴＰパケットは、図２に示すようにＩＰにおけるデータ転送単位であるパケットやＴＣＰにおけるデータ転送単位であるセグメントと同様に、ペイロード部に対してヘッダ部が付与され構成されている。 Data output from the Web camera 107 is temporarily written in the RAM 103b, and the control unit 101 generates an RTP packet from the data written in the RAM 103b. As shown in FIG. 2, the RTP packet is configured by adding a header portion to the payload portion, similarly to a packet that is a data transfer unit in IP and a segment that is a data transfer unit in TCP.

ペイロード部には、例えばデータ送信メッセージにおいては、所定時間（本実施形態においては２０ミリ秒）分の音声・映像データ等が書き込まれる。また、受信通知メッセージにおいては、受取ったパケットのシーケンス番号が書き込まれる。
ヘッダ部には、タイムスタンプ、ペイロードタイプ、シーケンス番号、および区画情報の４種類のデータがセットされる。ここで、タイムスタンプとは、時刻（音声通信の開始を指示されてから経過した時間）を示すデータである。ペイロードタイプとは、通信メッセージの種別をその通信メッセージの宛先に識別させるためのデータである。本実施形態で利用されるメッセージ種別には、話者映像データ送信メッセージ、会議室映像データ送信メッセージ、音声データ送信メッセージ、資料データ送信メッセージ、受信通知メッセージの５種類がある。それらのメッセージにおいて、ペイロードタイプには、それぞれ“１”、“２”、“３”、“４”、“５”の５種類の数字が書き込まれる。シーケンス番号とは、各パケットを一意に識別するための識別子であり、例えば１つの音声データが複数のＲＴＰパケットに分割されて送信される場合に、各パケットに対して１、２、３…のようにシーケンス番号が付与される。区画情報とは、当該ＲＴＰパケットに含まれる画像を後述する映像表示部１０５のどの領域に表示するかを規定する情報である。例えば（０、０）−（２５６、１９２）などのように、画像の位置情報が座標で書き込まれている。 In the payload portion, for example, audio / video data for a predetermined time (20 milliseconds in the present embodiment) is written in a data transmission message. In the reception notification message, the sequence number of the received packet is written.
Four types of data including a time stamp, a payload type, a sequence number, and partition information are set in the header portion. Here, the time stamp is data indicating the time (the time elapsed since the start of voice communication was instructed). The payload type is data for identifying the type of communication message to the destination of the communication message. There are five message types used in the present embodiment: a speaker video data transmission message, a conference room video data transmission message, a voice data transmission message, a material data transmission message, and a reception notification message. In these messages, five types of numbers “1”, “2”, “3”, “4”, and “5” are written in the payload type, respectively. The sequence number is an identifier for uniquely identifying each packet. For example, when one voice data is divided into a plurality of RTP packets and transmitted, 1, 2, 3,. Thus, a sequence number is assigned. The section information is information that defines in which area of the video display unit 105, which will be described later, an image included in the RTP packet. For example, image position information is written in coordinates, such as (0, 0)-(256, 192).

通信ＩＦ部１０２は、例えばＮＩＣ（Network Interface Card）であり、通信網２０に接続されている。この通信ＩＦ部１０２は、制御部１０１から引渡されたＲＴＰパケットを下位層の通信プロトコルにしたがって順次カプセル化することにより得られるＩＰパケットを通信網２０へ送出する。なお、カプセル化とは、上記ＲＴＰパケットをペイロード部に書き込んだＵＤＰセグメントを生成し、さらに、そのＵＤＰセグメントをペイロード部に書き込んだＩＰパケットを生成することである。また、通信ＩＦ部１０２は、通信網２０を介してＩＰパケットを受信し、上記カプセル化とは逆の処理を行うことにより、そのＩＰパケットにカプセル化されているＲＴＰパケットを読み出して制御部１０１へ引渡す。 The communication IF unit 102 is, for example, a NIC (Network Interface Card) and is connected to the communication network 20. The communication IF unit 102 sends IP packets obtained by sequentially encapsulating RTP packets delivered from the control unit 101 in accordance with a lower layer communication protocol to the communication network 20. Encapsulation means generating a UDP segment in which the RTP packet is written in the payload portion, and further generating an IP packet in which the UDP segment is written in the payload portion. Further, the communication IF unit 102 receives an IP packet via the communication network 20 and performs a process reverse to the encapsulation, thereby reading out the RTP packet encapsulated in the IP packet and controlling the control unit 101. Delivered to.

映像表示部１０５は、１０２４pixel×７６８pixelのモニタである。通信ＩＦ部１０２を介して受取ったデータに基づいて画像や資料を表示する。 The video display unit 105 is a 1024 pixel × 768 pixel monitor. An image or material is displayed based on the data received via the communication IF unit 102.

スピーカ１０８は、制御部１０１から引渡される音声データの表す音声を再生するものであり、スピーカユニットとＤ／Ａコンバータとを含んでいる。Ｄ／Ａコンバータは、制御部１０１から引渡される音声データに対してＤ／Ａ変換を施すことによってアナログの音声信号へ変換しスピーカユニットへ出力するものである。そして、スピーカユニットは、Ｄ／Ａコンバータから受取った音声信号に応じた音声を再生する。
以上が、会議端末１０のハードウェア構成である。 The speaker 108 reproduces sound represented by the sound data delivered from the control unit 101, and includes a speaker unit and a D / A converter. The D / A converter converts the audio data delivered from the control unit 101 into an analog audio signal by performing D / A conversion, and outputs the analog audio signal to the speaker unit. Then, the speaker unit reproduces sound corresponding to the sound signal received from the D / A converter.
The hardware configuration of the conference terminal 10 has been described above.

ここで、会議室における会議端末１０の設置態様について説明する。図６（Ａ）は、会議端末１０が設置された会議室の様子を示した図（図５と対応している）である。会議室には机３が設置され、会議に参加する参加者２ａ、２ｂ、２ｃ、および２ｄが机の周囲に設置されたイスに腰掛けている。机の横には会議端末１０が設置され、映像表示部１０５、マイクアレイ１０６、５つ設置されたＷｅｂカメラ１０７、およびスピーカ１０８が参加者に面している。なお、以下で５つのＷｅｂカメラ１０７の各々を区別して示す必要がある場合には、図６（Ａ）に示すようにそれぞれＷｅｂカメラ１０７ａ、１０７ｂ、１０７ｃ、１０７ｄ、および１０７ｅと表す。 Here, an installation mode of the conference terminal 10 in the conference room will be described. FIG. 6A is a diagram (corresponding to FIG. 5) showing a state of the conference room in which the conference terminal 10 is installed. A desk 3 is installed in the conference room, and participants 2a, 2b, 2c, and 2d participating in the conference are seated on chairs installed around the desk. A conference terminal 10 is installed beside the desk, and a video display unit 105, a microphone array 106, five Web cameras 107 installed, and a speaker 108 face the participants. In the following, when it is necessary to distinguish and indicate each of the five web cameras 107, they are represented as web cameras 107a, 107b, 107c, 107d, and 107e, respectively, as shown in FIG.

映像表示部１０５は、全ての参加者が視認することができる場所に配置されている。マイクアレイ１０６は、複数のマイクロホンユニットが水平に一列に並ぶように映像表示部１０５の下方に配置されている。スピーカ１０８は、マイクアレイ１０６を挟むようにして会議端末１０において左右２箇所に配置されている。 The video display unit 105 is arranged at a place where all the participants can see. The microphone array 106 is disposed below the video display unit 105 so that a plurality of microphone units are horizontally aligned. The speakers 108 are arranged at two locations on the conference terminal 10 so as to sandwich the microphone array 106.

Ｗｅｂカメラ１０７は、映像表示部１０５の下方に、水平に一列に並ぶように配置されており、それぞれに番号が付けられたラインを介して会議端末１０に接続されている。具体的には、Ｗｅｂカメラ１０７ａはライン１に、Ｗｅｂカメラ１０７ｂはライン２に、Ｗｅｂカメラ１０７ｃはライン３に、Ｗｅｂカメラ１０７ｄはライン４に、Ｗｅｂカメラ１０７ｅはライン５に接続されている。 The Web cameras 107 are arranged below the video display unit 105 so as to be arranged in a horizontal line, and are connected to the conference terminal 10 via lines numbered respectively. Specifically, the web camera 107 a is connected to line 1, the web camera 107 b is connected to line 2, the web camera 107 c is connected to line 3, the web camera 107 d is connected to line 4, and the web camera 107 e is connected to line 5.

ここで、図６（Ａ）に示した会議室を、会議端末１０の側から俯瞰した場合の様子を図６（Ｂ）に示す。図６（Ｂ）において、領域ａ、ｂ、ｃ、ｄ、およびｅは、それぞれ図６（Ａ）に示されるＷｅｂカメラ１０７ａ、１０７ｂ、１０７ｃ、１０７ｄ、および１０７ｅにより撮影される領域を示す。領域ｅは、会議室全体を含み、領域ａ、ｂ、ｃ、およびｄは、それぞれ参加者２ａ、２ｂ、２ｃ、および２ｄを含むように、各Ｗｅｂカメラ１０７の撮影範囲は設置されている。 Here, FIG. 6B shows a state where the conference room shown in FIG. 6A is viewed from the conference terminal 10 side. In FIG. 6B, areas a, b, c, d, and e indicate areas that are photographed by the Web cameras 107a, 107b, 107c, 107d, and 107e shown in FIG. 6A, respectively. The shooting range of each Web camera 107 is set so that the area e includes the entire conference room, and the areas a, b, c, and d include the participants 2a, 2b, 2c, and 2d, respectively.

次に、会議端末１０の特徴的な機能について説明する。それらは大別して（１）映像選択機能、（２）利用可能帯域幅測定機能、（３）データ送信レート制御機能の３つがある。以上の機能は、制御部１０１がＲＯＭ１０３ａに格納されている制御プログラムを実行することにより実現される。 Next, characteristic functions of the conference terminal 10 will be described. These are roughly divided into three types: (1) video selection function, (2) usable bandwidth measurement function, and (3) data transmission rate control function. The above functions are realized by the control unit 101 executing a control program stored in the ROM 103a.

まず、映像選択機能について説明する。映像選択機能とは、その時点で発言をしている話者に向けられたＷｅｂカメラ１０７を選択し、選択されたＷｅｂカメラ１０７で撮影された映像を相手側会議端末１０に表示する機能である。
ここで、ＲＯＭ１０３ａに格納されたＷｅｂカメラ選択テーブルについて説明する。図７は、Ｗｅｂカメラ選択テーブルの一例を示した図である。Ｗｅｂカメラ選択テーブルには、音源方位情報である偏角の範囲と、各Ｗｅｂカメラ１０７が接続されたライン番号が対応付けられている。 First, the video selection function will be described. The video selection function is a function of selecting the Web camera 107 directed to the speaker who is speaking at that time and displaying the video captured by the selected Web camera 107 on the other party conference terminal 10. .
Here, the Web camera selection table stored in the ROM 103a will be described. FIG. 7 is a diagram illustrating an example of the Web camera selection table. In the Web camera selection table, the range of the declination that is the sound source direction information is associated with the line number to which each Web camera 107 is connected.

制御部１０１は、ＲＡＭ１０３ｂに書き込まれた音源方位情報を読み出し、音源方位情報である偏角の値をＷｅｂカメラ選択テーブルと照らし合わせることにより、ライン番号を特定する。例えば、図５に示されるように偏角Φ＝π／６である参加者２ａが発言した場合、Ｗｅｂカメラ選択テーブルにおいては音源方位情報の偏角Φ＝０〜π／４に対応するため、ライン１に接続されたＷｅｂカメラ１０７、すなわちＷｅｂカメラ１０７ａが選択される。続いて制御部１０１は、選択されたライン番号に接続されたＷｅｂカメラ１０７に映像データを生成・出力する旨の信号を出力する。制御部１０１から信号を受信したＷｅｂカメラ１０７は、映像データを生成しＲＡＭ１０３ｂに出力する。 The control unit 101 reads the sound source azimuth information written in the RAM 103b and identifies the line number by comparing the declination value, which is the sound source azimuth information, with the Web camera selection table. For example, as shown in FIG. 5, when the participant 2a with the deflection angle Φ = π / 6 speaks, the Web camera selection table corresponds to the deflection angle Φ = 0 to π / 4 of the sound source direction information. The web camera 107 connected to the line 1, that is, the web camera 107a is selected. Subsequently, the control unit 101 outputs a signal for generating and outputting video data to the Web camera 107 connected to the selected line number. The Web camera 107 that has received the signal from the control unit 101 generates video data and outputs it to the RAM 103b.

制御部１０１は、以上のように話者が撮影された映像データと同時に、Ｗｅｂカメラ１０７ｅが生成した映像データを相手側会議端末１０に向けて送信する。その際制御部１０１は、話者が撮影された映像データを含むＲＴＰパケットのペイロードタイプには “１”を、区画情報には、“（０，０）−（５１２、３８４）”を書き込み、一方Ｗｅｂカメラ１０７ｅが生成した映像データのペイロードタイプには“２”を、区画情報には、“（０，３８４）−（５１２、７６８）”を書き込む。また、それら映像データと並行して資料データも送信する。資料データのペイロードタイプには“４”を、区画情報には、“（５１２，０）−（１０２４、７６８）”を書き込む。 The control unit 101 transmits the video data generated by the Web camera 107e toward the counterpart conference terminal 10 at the same time as the video data in which the speaker is photographed as described above. At that time, the control unit 101 writes “1” in the payload type of the RTP packet including the video data in which the speaker is photographed, and “(0,0)-(512, 384)” in the section information. On the other hand, “2” is written in the payload type of the video data generated by the Web camera 107e, and “(0, 384)-(512, 768)” is written in the section information. In addition, the document data is transmitted in parallel with the video data. “4” is written in the payload type of the document data, and “(512, 0) − (1024, 768)” is written in the section information.

相手側会議端末１０は両映像データを受信する。映像表示部１０５は、受取ったデータの区画情報を読み出し、映像表示部１０５の該当する領域にデータの内容を表示する。さて図４に示すように、映像表示部１０５の表示面は図中左上を原点として座標が設定されている。例えば“（０，０）−（５１２、３８４）”の区画情報は、原点（０，０）、（５１２、０）、（５１２、３８４）、（０、３８４）の４点で囲まれた領域を表す。従って、Ｗｅｂカメラ１０７ａの映像データは話者表示部１０５ａに、Ｗｅｂカメラ１０７ｅの映像データは会議室表示部１０５ｂに、そして資料データは資料表示部１０５ｃに表示される。 The counterpart conference terminal 10 receives both video data. The video display unit 105 reads out the section information of the received data, and displays the data contents in a corresponding area of the video display unit 105. As shown in FIG. 4, the display surface of the video display unit 105 has coordinates set with the upper left in the drawing as the origin. For example, the section information “(0,0)-(512,384)” is surrounded by four points of the origin (0,0), (512,0), (512,384), (0,384). Represents an area. Accordingly, the video data of the Web camera 107a is displayed on the speaker display unit 105a, the video data of the Web camera 107e is displayed on the conference room display unit 105b, and the material data is displayed on the material display unit 105c.

次に、利用可能帯域幅測定機能について図８に示すフローチャートを参照しながら説明する。利用可能帯域幅測定機能とは、通信網２０を介して相手側会議端末とデータ通信する際に、その通信網２０にて利用することのできる最大の通信帯域幅を測定する機能である。まず制御部１０１は、パケットを送信する際の送信間隔を決定する（ステップＳＣ１００）。利用可能帯域幅測定処理を初めて行う際には、所定の送信間隔を選択する。次に制御部１０１は、ＲＯＭ１０３ａに格納されたテストデータから複数のパケットを生成し、上記選択された送信間隔で相手側会議端末へ送信する（ステップＳＣ１１０）。このとき、制御部１０１は送信した各パケットのシーケンス番号をＲＡＭ１０３ｂに書き込む。なお、上記テストデータは、予めＷｅｂカメラ１０７が生成する映像データと同様にＭｏｔｉｏｎ−ＪＰＥＧによる映像データであり、その映像内容はどのようなものであっても良い。 Next, the available bandwidth measurement function will be described with reference to the flowchart shown in FIG. The available bandwidth measurement function is a function of measuring the maximum communication bandwidth that can be used in the communication network 20 when performing data communication with the other party conference terminal via the communication network 20. First, the control unit 101 determines a transmission interval when transmitting a packet (step SC100). When performing the available bandwidth measurement process for the first time, a predetermined transmission interval is selected. Next, the control unit 101 generates a plurality of packets from the test data stored in the ROM 103a and transmits the packets to the partner conference terminal at the selected transmission interval (step SC110). At this time, the control unit 101 writes the sequence number of each transmitted packet in the RAM 103b. The test data is Motion-JPEG video data, similar to the video data generated by the Web camera 107 in advance, and the video content may be anything.

相手側会議端末１０の制御部１０１は上記テストデータを受信し、受信した各パケットのシーケンス番号を受信通知メッセージに書き込み、該受信通知メッセージを送信側会議端末に対して返信する。送信側会議端末１０の制御部１０１は、相手側会議端末から返信されてきた受信通知メッセージを受信し（ステップＳＣ１２０）、上記テストデータの送信におけるパケットロスの発生率（受信されなかったパケット数／送信されたパケット数）を算出する（ステップＳＣ１３０）。 The control unit 101 of the partner conference terminal 10 receives the test data, writes the sequence number of each received packet in a reception notification message, and returns the reception notification message to the transmission conference terminal. The control unit 101 of the transmitting conference terminal 10 receives the reception notification message returned from the other conference terminal (step SC120), and generates the packet loss occurrence rate (number of packets not received / The number of transmitted packets) is calculated (step SC130).

制御部１０１は、上記所定の送信間隔でテストデータを送信した場合に、パケットロスが発生しなかった場合（ステップＳＣ１３０；“Ｎｏ”）、ステップＳＣ１００以降の処理を再度行う。そのとき、ステップＳＣ１００においては、前回のステップＳＣ１００ないしステップＳＣ１３０の処理において利用した送信間隔より所定の割合だけ短い送信間隔を選択する。送信間隔が短くなると、単位時間当たりの送信データ量すなわち送信レートは高くなる。一般に、利用可能な帯域幅に比較して送信レートが小さい場合にはパケットロスの発生率は少なく、送信レートが利用可能な帯域幅に対して大きくなるに伴いパケットロスの発生率が高くなることが知られている。従って、テストデータの送信においてパケットロスが発生した場合には、その際に利用した送信レートは利用可能な通信帯域幅を上回ったことを意味する。
従って制御部１０１は、各パケットの送信間隔を順次短くしてステップＳＣ１００ないしステップＳＣ１３０を繰り返し行い、パケットロスが発生した場合（ステップＳＣ１３０；“Ｙｅｓ”）には、その１回前にテストデータを送信した際の送信レート（テストデータのデータ量／送信にかかった時間）を、その時点での利用可能な帯域幅（単位はＢＰＳ；Ｂｙｔｅ／秒）として算出する（ステップＳＣ１４０）。
以上が、利用可能帯域幅測定処理である。 When the test data is transmitted at the predetermined transmission interval and no packet loss occurs (step SC130; “No”), the control unit 101 performs the processes after step SC100 again. At that time, in step SC100, a transmission interval shorter than the transmission interval used in the previous processing of step SC100 to step SC130 by a predetermined rate is selected. When the transmission interval is shortened, the transmission data amount per unit time, that is, the transmission rate is increased. In general, the rate of packet loss is low when the transmission rate is small compared to the available bandwidth, and the rate of packet loss increases as the transmission rate increases relative to the available bandwidth. It has been known. Therefore, if a packet loss occurs during the transmission of test data, it means that the transmission rate used at that time exceeds the available communication bandwidth.
Therefore, the control unit 101 repeatedly shortens the transmission interval of each packet and repeats steps SC100 to SC130. When packet loss occurs (step SC130; “Yes”), the test data is stored one time before that. The transmission rate (data amount of test data / time taken for transmission) at the time of transmission is calculated as the available bandwidth (unit: BPS; Byte / second) (step SC140).
The above is the available bandwidth measurement process.

次に、データ送信レート制御機能について説明する。
データ送信レート制御機能とは、利用可能帯域幅測定機能により測定された利用可能な通信帯域幅に応じて、出力する音声データおよび映像データのデータ送信レートを制御する処理を実行する機能である。 Next, the data transmission rate control function will be described.
The data transmission rate control function is a function for executing processing for controlling the data transmission rate of audio data and video data to be output according to the available communication bandwidth measured by the available bandwidth measurement function.

図９は、送信レート管理テーブルの一例を示した図である。送信レート管理テーブルには、Ｗｅｂカメラ１０７に設定される単位時間当たりのフレーム数（frames per second）、およびＭｏｔｉｏｎ−ＪＰＥＧ方式におけるＪＰＥＧ画像の圧縮率が、利用可能な通信帯域幅に対応させて規定されている。制御部１０１は、上述した利用可能帯域幅測定機能により測定された利用可能な通信帯域幅を、ＲＯＭ１０３ａに格納された送信レート管理テーブルと照らし合わせることにより、利用可能な帯域幅に対応した単位時間当たりのフレーム数およびＪＰＥＧ画像の圧縮率を選択する。 FIG. 9 is a diagram showing an example of a transmission rate management table. In the transmission rate management table, the number of frames per unit time (frames per second) set in the Web camera 107 and the compression rate of the JPEG image in the Motion-JPEG system are defined in correspondence with the available communication bandwidth. Has been. The control unit 101 compares the available communication bandwidth measured by the above-described available bandwidth measurement function with the transmission rate management table stored in the ROM 103a, so that a unit time corresponding to the available bandwidth is obtained. Select the number of frames per frame and the JPEG image compression rate.

制御部１０１は、全てのＷｅｂカメラ１０７において、単位時間当たりのフレーム数、およびＪＰＥＧ画像の圧縮率を、選択された値に設定する。データ通信が開始されると、Ｗｅｂカメラ１０７は設定された単位時間あたりのフレーム数で映像データを生成し、制御部１０１は生成された映像データを選択されたＪＰＥＧ画像の圧縮率で圧縮する。
以上がデータ送信レート制御機能である。 The control unit 101 sets the number of frames per unit time and the compression rate of the JPEG image to the selected values in all the Web cameras 107. When data communication is started, the Web camera 107 generates video data with the set number of frames per unit time, and the control unit 101 compresses the generated video data at the compression rate of the selected JPEG image.
The above is the data transmission rate control function.

（Ｂ：動作）
次に、会議端末１０Ａの側の参加者と会議端末１０Ｂの参加者とが遠隔会議を行う際に、会議端末１０が行う動作について説明する。なお、遠隔会議の開始時点で、Ｗｅｂカメラ１０７の各々はすでに図６（Ｂ）に示される領域を撮影できるように設置されているものとする。また、Ｗｅｂカメラ選択テーブルには、図７に示すように音源の方位と各Ｗｅｂカメラ１０７が接続されたライン番号とが対応付けられて書き込まれているものとする。 (B: Operation)
Next, an operation performed by the conference terminal 10 when a participant on the conference terminal 10A side and a participant on the conference terminal 10B conduct a remote conference will be described. It is assumed that each of the Web cameras 107 is already installed so that the area shown in FIG. 6B can be taken at the start of the remote conference. In the Web camera selection table, it is assumed that the direction of the sound source and the line number to which each Web camera 107 is connected are written in association with each other as shown in FIG.

始めに、制御部１０１Ａは、データ通信の開始直後および一定時間おきにパラメータ調整処理を行う。図１０は、データを送信する側（会議端末１０Ａ）の行うパラメータ調整処理の流れを示したフローチャートである。 First, the control unit 101A performs parameter adjustment processing immediately after the start of data communication and at regular intervals. FIG. 10 is a flowchart showing the flow of parameter adjustment processing performed by the data transmission side (conference terminal 10A).

制御部１０１Ａは、まず利用可能帯域幅測定処理を行う（ステップＳＡ１００）。次に制御部１０１は、利用可能帯域幅測定処理の測定値を、ＲＯＭ１０３ａに格納された送信レート管理テーブル（図９参照）と照らし合わせ、テーブル中で該測定値より小さいものの中で最大の利用可能な帯域幅と対応付けられているフレーム数、およびＪＰＥＧ画像の圧縮率を選択する。制御部１０１Ａは、全てのＷｅｂカメラ１０７の単位時間当たりの撮影フレーム数を選択された値に設定する（ステップＳＡ１１０）と共に、ＪＰＥＧ画像の圧縮率を選択された値に設定する（ステップＳＡ１２０）。続くステップＳＡ１３０において、パラメータ調整処理を開始してから一定時間が経過したかどうか判定する。ステップＳＡ１３０の判定結果が“Ｎｏ”である場合は、一定時間が経過するまでステップＳＡ１３０の処理が繰り返される。一定時間が経過すると、ステップＳＡ１３０の判定結果は“Ｙｅｓ”となり、ステップＳＡ１４０が行われる。ステップＳＡ１４０においては、制御部１０１Ａは、データ通信が終了したかどうか判定する。ステップＳＡ１４０の判定結果が“Ｎｏ”である場合にはステップＳＡ１００以降の処理が再び行われる。ステップＳＡ１４０の判定結果が“Ｙｅｓ”である場合には、制御部１０１はパラメータ調整処理を終了する。
以上の処理から、制御部１０１はデータ通信開始後一定時間置きに利用可能帯域幅測定処理を行い、測定された利用可能な帯域幅に合わせてデータ送信に係る各種のパラメータがその都度設定されることとなる。そのことにより、時々刻々と変化する利用可能な通信帯域幅に応じてデータの送信を行うことができ、データを効率的に支障なく送信することができる。 First, the control unit 101A performs an available bandwidth measurement process (step SA100). Next, the control unit 101 compares the measurement value of the available bandwidth measurement process with the transmission rate management table (see FIG. 9) stored in the ROM 103a, and uses the maximum utilization among the smaller values in the table. Select the number of frames associated with the possible bandwidth and the compression rate of the JPEG image. The control unit 101A sets the number of frames taken per unit time of all the Web cameras 107 to the selected value (step SA110) and sets the JPEG image compression rate to the selected value (step SA120). In subsequent step SA130, it is determined whether or not a fixed time has elapsed since the parameter adjustment process was started. If the determination result in step SA130 is “No”, the process in step SA130 is repeated until a predetermined time has elapsed. When the predetermined time has elapsed, the determination result in step SA130 is “Yes”, and step SA140 is performed. In step SA140, control unit 101A determines whether data communication has ended. If the determination result in step SA140 is “No”, the processes in and after step SA100 are performed again. If the determination result in step SA140 is “Yes”, the control unit 101 ends the parameter adjustment process.
From the above processing, the control unit 101 performs an available bandwidth measurement process at regular intervals after the start of data communication, and various parameters related to data transmission are set each time according to the measured available bandwidth. It will be. As a result, data can be transmitted according to the available communication bandwidth that changes from moment to moment, and data can be transmitted efficiently and without hindrance.

以下では、遠隔会議中に会議端末１０Ａの側の参加者２ａが発言し、会議端末１０Ｂの側の参加者がその発言を聴く場合に会議端末１０Ａが行う動作を説明する。
図１１は、データ通信を行う際に会議端末１０が行う処理の流れを示したフローチャートである。参加者２ａが発言すると、マイクアレイ１０６Ａは該音声を収音し、音声データを生成する（ステップＳＢ１００）。制御部１０１Ａは、生成された音声データを元に音源の方位情報を生成する（ステップＳＢ１１０）。この場合図５に示すような位置に参加者２ａは位置することから、音源方位情報は参加者２ａの方位（偏角）であるπ／６となる。マイクアレイ１０６Ａにより生成された音声データおよび音源方位情報はＲＡＭ１０３ｂＡに書き込まれる。 Hereinafter, an operation performed by the conference terminal 10A when the participant 2a on the conference terminal 10A side speaks during the remote conference and the participant on the conference terminal 10B side listens to the speech will be described.
FIG. 11 is a flowchart showing a flow of processing performed by the conference terminal 10 when performing data communication. When the participant 2a speaks, the microphone array 106A picks up the sound and generates sound data (step SB100). The control unit 101A generates sound source direction information based on the generated audio data (step SB110). In this case, since the participant 2a is located at a position as shown in FIG. 5, the sound source orientation information is π / 6, which is the orientation (deflection angle) of the participant 2a. The audio data and sound source direction information generated by the microphone array 106A are written into the RAM 103bA.

制御部１０１Ａは、ＲＡＭ１０３ｂＡに書き込まれた音源方位情報を、Ｗｅｂカメラ管理テーブルと照らし合わせることにより、Ｗｅｂカメラ１０７が接続されたライン番号の一つ（この場合ライン１）を特定する（ステップＳＢ１２０）。次に制御部１０１Ａは、ステップＳＢ１２０で特定されたＷｅｂカメラ１０７ａＡに参加者２ａの詳細な映像を表す映像データを生成させる（ステップＳＢ１３０）。 The control unit 101A identifies one of the line numbers to which the web camera 107 is connected (in this case, line 1) by comparing the sound source direction information written in the RAM 103bA with the web camera management table (step SB120). . Next, the control unit 101A causes the Web camera 107aA specified in step SB120 to generate video data representing a detailed video of the participant 2a (step SB130).

制御部１０１Ａは、Ｗｅｂカメラ１０７ａＡにより生成された話者映像データ、Ｗｅｂカメラ１０７ｅＡにより生成された会議室映像データ、音声データ、および資料データを会議端末１０Ｂに向けて送信する（ステップＳＢ１４０）。 The control unit 101A transmits the speaker video data generated by the Web camera 107aA, the conference room video data generated by the Web camera 107eA, audio data, and material data to the conference terminal 10B (Step SB140).

会議端末１０Ｂは、話者映像・会議室映像を表す映像データ、音声データ、および資料データを受信する。音声データはスピーカ１０８Ｂから再生される。映像データおよび資料データについては、制御部１０１Ｂは、各映像データの有する区画情報を読み出す。該区画情報には、それぞれのデータが表示されるべき映像表示部１０５Ｂの領域が座標で書き込まれている。映像表示部１０５Ｂは、該区画情報に基づいて所定の領域にそれぞれのデータ内容を表示する。 The conference terminal 10B receives video data, audio data, and material data representing a speaker video / conference room video. The audio data is reproduced from the speaker 108B. For the video data and the material data, the control unit 101B reads the section information included in each video data. In the section information, an area of the video display unit 105B where each data is to be displayed is written in coordinates. The video display unit 105B displays each data content in a predetermined area based on the section information.

以上の処理により、相手側の会議端末１０Ｂを利用する参加者は、スピーカ１０８Ｂから会議端末１０Ａを利用する話者の音声を聞きながら、話者により提示される資料データおよび話者の表情などを視認することができる。
（Ｃ：変形例）
以上、本発明の実施形態について説明したが、本発明は以下に述べる種々の形態で実施することができる。 Through the above processing, the participant who uses the conference terminal 10B on the other side can listen to the voice of the speaker who uses the conference terminal 10A from the speaker 108B, while viewing the document data presented by the speaker and the facial expression of the speaker. It can be visually recognized.
(C: Modification)
As mentioned above, although embodiment of this invention was described, this invention can be implemented with the various form described below.

（１）上記実施例において、映像選択機能、利用可能帯域幅測定機能、データ送信レート制御機能の各機能は、会議端末に対して設けられていたが、設置対象はもちろん会議端末に限定されない。たとえば映像データや音声データを生成し、それらをクライアント装置へ提供するサーバ装置などに適用しても良い。 (1) In the above embodiment, the video selection function, the available bandwidth measurement function, and the data transmission rate control function are provided for the conference terminal. However, the installation target is not limited to the conference terminal. For example, the present invention may be applied to a server device that generates video data and audio data and provides them to the client device.

（２）上記実施例において、本発明に係る会議端末に特徴的な機能をソフトウェアモジュールで実現する場合について説明したが、上記各機能を担っているハードウェアモジュールを組み合わせて本発明に係る会議端末を構成するようにしても良い。 (2) In the above embodiment, a case has been described in which the functions characteristic of the conference terminal according to the present invention are implemented by software modules. However, the conference terminal according to the present invention is combined with the hardware modules responsible for the above functions. You may make it comprise.

（３）上述した実施形態では、映像データおよび音声データの通信にアプリケーション層の通信プロトコルとしてＲＴＰを用いる場合について説明したが、他の通信プロトコルを用いても良いことは勿論である。要は、所定のヘッダ部とペイロード部とを有するデータブロックのペイロード部に、映像データまたは音声データを所定時間分ずつ書き込んで送信する通信プロトコルであれば、どのような通信プロトコルであっても良い。また、上述した実施形態では、トランスポート層の通信プロトコルとしてＵＤＰを用いる場合について説明したが、ＴＣＰを用いるようにしても良い。同様にネットワーク層の通信プロトコルがＩＰに限定されるものではない。 (3) In the above-described embodiment, the case where RTP is used as the communication protocol of the application layer for communication of video data and audio data has been described, but it is needless to say that other communication protocols may be used. In short, any communication protocol may be used as long as it is a communication protocol in which video data or audio data is written and transmitted for a predetermined time in a payload portion of a data block having a predetermined header portion and a payload portion. . In the above-described embodiment, the case where UDP is used as the transport layer communication protocol has been described. However, TCP may be used. Similarly, the network layer communication protocol is not limited to IP.

（４）上述した実施形態では、映像データ、音声データ、および資料データの送受信を並行して行う場合について説明したが、データの種類はそれらのデータ種に限られるものではない。資料データについては送受信を行わなくても良い。 (4) In the above-described embodiment, the case where transmission / reception of video data, audio data, and material data is performed in parallel has been described. However, the types of data are not limited to these data types. There is no need to send and receive document data.

（５）上記実施形態では、会議端末１０Ａおよび会議端末１０Ｂが通信網２０に有線接続されている場合について説明したが、通信網２０が例えば無線ＬＡＮ（Local Area Network）などの無線パケット通信網であり、会議端末１０Ａおよび会議端末１０Ｂが、この無線パケット通信網に接続されていても勿論良い。また、上記実施形態では通信網２０がインターネットである場合について説明したが、ＬＡＮ（Local Area Network）であっても良いことは勿論である。要は、所定の通信プロトコルにしたがって行われる通信を仲介する機能を備えた通信網であれば、どのような通信網であっても良い。 (5) In the above embodiment, the case where the conference terminal 10A and the conference terminal 10B are wired to the communication network 20 has been described. However, the communication network 20 is a wireless packet communication network such as a wireless local area network (LAN), for example. Yes, the conference terminal 10A and the conference terminal 10B may of course be connected to this wireless packet communication network. Moreover, although the case where the communication network 20 is the Internet was demonstrated in the said embodiment, of course, LAN (Local Area Network) may be sufficient. In short, any communication network may be used as long as it has a function of mediating communication performed in accordance with a predetermined communication protocol.

（６）上記実施形態では、本発明に係る通信装置に特徴的な機能を制御部１０１に実現させるための制御プログラムをＲＯＭ１０３ａに予め書き込んでおく場合について説明したが、ＣＤ−ＲＯＭやＤＶＤなどのコンピュータ装置読み取り可能な記録媒体に上記制御プログラムを記録して配布するとしても良く、インターネットなどの電気通信回線経由のダウンロードにより上記制御プログラムを配布するようにしても勿論良い。 (6) In the above embodiment, a case has been described in which a control program for causing the control unit 101 to realize functions characteristic of the communication apparatus according to the present invention is written in the ROM 103a in advance. The control program may be recorded and distributed on a computer-readable recording medium, or the control program may be distributed by downloading via a telecommunication line such as the Internet.

（７）上記実施形態では、会議室全体の映像データについても、話者の映像の映像データと同様の解像度・単位時間当たりのフレーム数・ＪＰＥＧ画像の圧縮率による制御を行う場合について説明したが、会議室全体の映像は必要に応じて単位時間当たりのフレーム数を下げたり、ＪＰＥＧ画像の圧縮率を上げたりして送出してもよいし、一定時間置きに静止画を送出しても良い。 (7) In the above embodiment, the case where the video data of the entire conference room is controlled by the same resolution, the number of frames per unit time, and the compression rate of the JPEG image as the video data of the speaker's video has been described. The video of the entire conference room may be sent out by reducing the number of frames per unit time or increasing the compression rate of the JPEG image as necessary, or still images may be sent out at regular intervals. .

（８）上記実施形態では、話者の映像データを動画とする場合について説明したが、必要に応じて、または通信網の回線状況に応じて静止画を送信してもよい。 (8) In the above embodiment, the case where the video data of the speaker is a moving image has been described. However, a still image may be transmitted as necessary or according to the line status of the communication network.

（９）上記実施形態では、マイクアレイを用いることにより話者の特定を行う場合について示した。しかし、話者の特定方法はマイクアレイに限らない。例えば、会議室において複数の（望ましくは参加者の数と同数の）マイクロホンを参加者の前に設置し、複数のマイクロホンにおいて収音された音声のうち最も音量レベルが高い音声データを選択し、選択された音声データを生成したマイクロホンに対応する参加者が話者であると判断し、特定された話者に向けられたカメラを選択してもよい。
また、操作部１０４として参加者の数だけボタンを設置し、各参加者が該ボタンを押下してから発言を行うようにしても良い。その場合、制御部１０１は、押下されたボタンが出力する信号を受信して話者を特定し、該話者に向けられたＷｅｂカメラ１０７を選択するようにすれば良い。 (9) In the above embodiment, the case where the speaker is specified by using the microphone array has been described. However, the speaker identification method is not limited to the microphone array. For example, a plurality of microphones (preferably the same number as the number of participants) are installed in front of the participants in the conference room, and the audio data with the highest volume level is selected from the sounds collected by the plurality of microphones. It may be determined that the participant corresponding to the microphone that has generated the selected audio data is a speaker, and a camera directed to the identified speaker may be selected.
Further, as many buttons as the number of participants may be provided as the operation unit 104, and each participant may make a statement after pressing the button. In that case, the control unit 101 may receive the signal output from the pressed button, identify the speaker, and select the Web camera 107 directed to the speaker.

（１０）上記実施形態では、ＷｅｂカメラはＭｏｔｉｏｎ−ＪＰＥＧ方式により映像データを生成した。しかし、映像の記録方式はＭｏｔｉｏｎ−ＪＰＥＧに限定されず、ＭＰＥＧなど他の方式を用いても良い。 (10) In the above embodiment, the Web camera generates the video data by the Motion-JPEG method. However, the video recording method is not limited to Motion-JPEG, and other methods such as MPEG may be used.

（１１）上記実施形態では、Ｗｅｂカメラ１０７は５つ設置されている場合について説明したが、Ｗｅｂカメラ１０７の設置数は５つに限定されない。参加者の人数などに応じて適宜定めればよい。また、会議端末１０に設けられるＷｅｂカメラ１０７を接続するライン（端子）の数についても同様であり、設置されるＷｅｂカメラ１０７の数よりも多く設けられていれば良い。 (11) In the above embodiment, the case where five Web cameras 107 are installed has been described, but the number of Web cameras 107 is not limited to five. What is necessary is just to determine suitably according to the number of participants. The same applies to the number of lines (terminals) for connecting the web cameras 107 provided in the conference terminal 10 as long as the number is larger than the number of installed web cameras 107.

（１２）上記実施形態では、Ｗｅｂカメラ１０７の撮影領域を予め参加者が手動で調整する場合について説明した。しかし、Ｗｅｂカメラ１０７の撮影領域の調整を制御部１０１による制御下で行うようにしても良い。その場合、以下に説明するような実施形態が可能である。
Ｗｅｂカメラを水平および上下に回転させて撮影領域を調整するための駆動手段を会議端末１０に設け、該駆動手段を制御部１０１により制御できるようにする。また、予め“配置テンプレート”をＲＯＭ１０３ａに格納しておく。図１２は、配置テンプレートの一例を示した図である。配置テンプレートには、複数のパターンＡ、Ｂ、およびＣが設定されている。パターンごとに、参加者の位置とＷｅｂカメラ撮影方向が対応付けられて規定されている。ここで参加者の位置とは、マイクアレイ１０６を基準とした各参加者の相対位置を模式的に示した図であり、一方Ｗｅｂカメラ撮影方向とは、各ラインに接続されたＷｅｂカメラ１０７の撮影方向である。なお、撮影方向はマイクアレイが音源方位情報を生成する際に用いたのと同様の極座標における偏角として表されている。
会議を開始するにあたって、映像表示部１０５は、配置テーブルの“参加者の位置”の項目に書き込まれたテンプレート画像を表示する。該会議端末を利用する参加者は、映像表示部１０５に表示された複数のパターンから該会議における参加者の数や座席位置に合致するものを選択する。制御部１０１は、特定のパターンが選択されると、各ラインに接続されたＷｅｂカメラ１０７の各々を選択されたパターンにおいて規定されているＷｅｂカメラ撮影方向に向けるよう上記駆動手段を駆動させる。
以上に説明した実施形態により、参加者は会議端末１０を会議室に設置した後に設置位置に応じたテンプレートを作成しておくことで、以降の会議開催の度にＷｅｂカメラの位置を手動で設定しなおす手間を省くことができる。 (12) In the above embodiment, the case where the participant manually adjusts the shooting area of the Web camera 107 has been described. However, the imaging area of the Web camera 107 may be adjusted under the control of the control unit 101. In that case, an embodiment as described below is possible.
The conference terminal 10 is provided with drive means for adjusting the shooting area by rotating the Web camera horizontally and vertically, and the drive means can be controlled by the control unit 101. In addition, an “arrangement template” is stored in the ROM 103a in advance. FIG. 12 is a diagram illustrating an example of an arrangement template. A plurality of patterns A, B, and C are set in the arrangement template. For each pattern, the position of the participant and the shooting direction of the Web camera are defined in association with each other. Here, the position of the participant is a diagram schematically showing the relative position of each participant with reference to the microphone array 106, while the web camera shooting direction is the web camera 107 connected to each line. The shooting direction. Note that the shooting direction is represented as a polar angle in the same polar coordinate as that used when the microphone array generates the sound source direction information.
When starting the conference, the video display unit 105 displays the template image written in the “participant position” item of the arrangement table. A participant who uses the conference terminal selects a pattern that matches the number of participants and seat positions in the conference from a plurality of patterns displayed on the video display unit 105. When a specific pattern is selected, the control unit 101 drives the drive unit so that each of the Web cameras 107 connected to each line is directed in the Web camera shooting direction defined in the selected pattern.
According to the embodiment described above, the participant creates a template according to the installation position after installing the conference terminal 10 in the conference room, so that the position of the Web camera is manually set every time a subsequent conference is held. The trouble of reworking can be saved.

（１３）上記実施形態では、映像表示部１０５の表示面を図４に示すように区画する場合について説明した。しかし、映像表示部１０５の区画は上記の態様に限定されない。例えば、図１３に示す区画テンプレートをＲＯＭ１０３ａに格納しておく。区画テンプレートにおいては、映像表示部１０５の表示面の区画法を模式的に示した複数のテンプレートが、会議に参加する人数に対して規定されている。各テンプレートにおいて、数字が書き込まれた区画は、該数字のライン番号に接続されたＷｅｂカメラ１０７の映像（各参加者の映像）が表示される。なお、たとえば１／２／３と表示された区画は、１ないし３のライン番号に接続されたＷｅｂカメラ１０７いずれかの映像が表示される。“会議室”と書き込まれた区画は、上記実施形態における会議室表示部１０５ｂと同様に、ライン５に接続されたＷｅｂカメラ１０７ｅの映像が表示される。“資料”と書き込まれた区画は、上記実施形態における資料表示部１０５ｃと同様である。
さて、参加者は会議が始まる際に、会議の参加人数に合わせて特定のテンプレートを選択する。特定のテンプレートが選択されると、選択されたテンプレートに規定されるように映像表示部１０５は区画され、それぞれの区画にはテンプレートで規定されるように各Ｗｅｂカメラ１０７の映像や資料が表示される。
以上のようにすることで、参加者が必要とする情報を、そのときの状況に合わせたレイアウトで映像表示部１０５に表示させることができる。
また、以上のようにデフォルトのテンプレートを選択した上で、更に発言者が発言した際にテンプレートを短期間切り換えるようにしてもよい。すなわち、発言者切換用テンプレートを別途選択しておき、発言者が切り替わると制御部１０１は映像表示部１０５の区画法をデフォルトのテンプレートから発言者切換用テンプレートに所定の時間切り換え、所定時間が経過するとデフォルトのテンプレートに戻す。例えば、参加者が3名の場合に、デフォルトのテンプレートとしてテンプレートＡを、発言者切換用テンプレートとしてテンプレートＣを選択しておくと、例えば参加者Ａが発言を始めるとテンプレートＣに示されるように映像表示部１０５の中央に発言者である参加者Ａの映像が画面中央に大きく表示され、その後テンプレートＡに示されるように参加者全員の映像、会議室全体の映像、および資料が表示される。同様に参加者Ｂが発言すると、発言を始めてから所定の時間は参加者Ｂの映像が画面中央に大きく表示される。以上のように会議中にもテンプレートを切り換えることで、発言者が変わったことを知らせたり、どの参加者が発言を始めたのかを参加者に報知したりすることができる。 (13) In the above embodiment, the case where the display surface of the video display unit 105 is partitioned as shown in FIG. 4 has been described. However, the section of the video display unit 105 is not limited to the above mode. For example, the partition template shown in FIG. 13 is stored in the ROM 103a. In the division template, a plurality of templates schematically showing the division method of the display surface of the video display unit 105 are defined for the number of participants in the conference. In each template, a section in which a number is written displays a video (image of each participant) of the Web camera 107 connected to the line number of the number. For example, in the section displayed as 1/2/3, an image of one of the Web cameras 107 connected to the line numbers 1 to 3 is displayed. In the section where “conference room” is written, the video of the web camera 107e connected to the line 5 is displayed in the same manner as the conference room display unit 105b in the above embodiment. The section in which “Material” is written is the same as the material display portion 105 c in the above embodiment.
Now, when the conference starts, the participant selects a specific template according to the number of participants in the conference. When a specific template is selected, the video display unit 105 is partitioned as defined by the selected template, and the video and materials of each Web camera 107 are displayed in each partition as defined by the template. The
As described above, information required by the participant can be displayed on the video display unit 105 in a layout according to the situation at that time.
Further, after selecting the default template as described above, the template may be switched for a short period when the speaker speaks further. That is, when a speaker switching template is separately selected and the speaker is switched, the control unit 101 switches the partitioning method of the video display unit 105 from the default template to the speaker switching template for a predetermined time, and the predetermined time has elapsed. Then it returns to the default template. For example, when there are three participants, template A is selected as the default template, and template C is selected as the speaker switching template. For example, when participant A starts speaking, template C will be displayed. The video of participant A who is a speaker is displayed in the center of the screen in the center of the video display unit 105, and then the video of all participants, the video of the entire conference room, and materials are displayed as shown in template A. . Similarly, when Participant B speaks, the video of Participant B is displayed large in the center of the screen for a predetermined time after starting speaking. As described above, by switching the template during the conference, it is possible to notify that the speaker has changed or to notify the participant which participant has started speaking.

（１４）上記実施形態では、映像表示部１０５において話者・会議室映像データおよび資料データを図４に示すような区画割りで表示する場合について説明した。しかし、映像表示部１０５の区画割りは上記の態様に限定されるものではなく、各データの表示領域の大小や位置関係など他の種々の態様でおこなっても良い。 (14) In the above-described embodiment, the case where the speaker / conference room video data and the material data are displayed in the division as shown in FIG. 4 on the video display unit 105 has been described. However, the partitioning of the video display unit 105 is not limited to the above-described mode, and may be performed in various other modes such as the size of each data display area and the positional relationship.

（１５）上記実施形態では、ＲＴＰパケットのヘッダ部に区画情報を含ませることにより、各映像データが映像表示部１０５において表示される領域を指定する場合について説明した。しかし、各映像データを映像表示部１０５のどの領域に表示するのかを映像データの受信側の会議端末１０が制御するようにしても良い。その場合、相手側会議端末１０は、その映像表示部１０５をあらかじめ複数の領域に区画しそれぞれの区画に映像の種類（話者映像、会議室映像）を対応させておく。映像データを受取ったらパケットのペイロードタイプを読み出し、ペイロードタイプから映像の種類を判別し、対応する映像表示部１０５に割り当てて出力するようにすればよい。 (15) In the above-described embodiment, a case has been described in which an area in which each video data is displayed on the video display unit 105 is specified by including partition information in the header part of the RTP packet. However, the conference terminal 10 on the video data receiving side may control which area of the video display unit 105 displays each video data. In that case, the counterpart conference terminal 10 divides the video display unit 105 into a plurality of areas in advance and associates video types (speaker video and conference room video) with each of the zones. When video data is received, the payload type of the packet is read out, the type of video is determined from the payload type, and the video data is assigned to the corresponding video display unit 105 and output.

（１６）上記実施形態では、表示部１０５の表示面を区画し、複数のデータを表示する場合について説明した。しかし、例えば表示面全体に話者の映像を表示する場合など、映像表示部１０５の表示面の区画は行わなくても良い。 (16) In the above embodiment, the case where the display surface of the display unit 105 is partitioned and a plurality of data is displayed has been described. However, for example, when the video of the speaker is displayed on the entire display surface, the display surface of the video display unit 105 may not be partitioned.

本発明に係る会議端末を含む会議システムの構成を示すブロック図である。It is a block diagram which shows the structure of the conference system containing the conference terminal which concerns on this invention. ＲＴＰパケットの構成を示す図である。It is a figure which shows the structure of a RTP packet. 会議端末１０の構成を示すブロック図である。2 is a block diagram showing a configuration of a conference terminal 10. FIG. 映像表示部の一例を示した図である。It is the figure which showed an example of the video display part. マイクアレイによる音源方位の検出を説明する図である。It is a figure explaining the detection of the sound source direction by a microphone array. 会議室における会議端末および参加者の位置関係を示す図である。It is a figure which shows the positional relationship of the conference terminal and participant in a conference room. 会議端末の側から見た参加者の位置関係を示す図である。It is a figure which shows the positional relationship of the participant seen from the conference terminal side. Ｗｅｂカメラ選択テーブルの一例を示す図である。It is a figure which shows an example of a web camera selection table. 利用可能帯域幅測定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of an available bandwidth measurement process. 送信レート管理テーブルの一例を示す図である。It is a figure which shows an example of a transmission rate management table. パラメータ調整処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a parameter adjustment process. データ転送の流れを示すフローチャートである。It is a flowchart which shows the flow of data transfer. 配置テンプレートの一例を示す図である。It is a figure which shows an example of an arrangement | positioning template. 区画テンプレートの一例を示す図である。It is a figure which shows an example of a division template.

Explanation of symbols

１…会議システム、２…参加者、３…机、１０、１０Ａ、１０Ｂ…会議端末、２０…通信網、１０１…制御部、１０２…通信ＩＦ部、１０３…記憶部（１０３ａ；ＲＯＭ、１０３ｂ；ＲＡＭ）、１０４…操作部、１０５…映像表示部（１０５ａ…話者表示部、１０５ｂ…会議室表示部、１０５ｃ…資料表示部）、１０６…マイクアレイ、１０７…Ｗｅｂカメラ、１０８…スピーカ、１０９…バス DESCRIPTION OF SYMBOLS 1 ... Conference system, 2 ... Participant, 3 ... Desk, 10, 10A, 10B ... Conference terminal, 20 ... Communication network, 101 ... Control part, 102 ... Communication IF part, 103 ... Memory | storage part (103a; ROM, 103b; (RAM), 104 ... operation unit, 105 ... video display unit (105a ... speaker display unit, 105b ... conference room display unit, 105c ... document display unit), 106 ... microphone array, 107 ... web camera, 108 ... speaker, 109 …bus

Claims

A plurality of photographing means each for which a photographing range is set and outputting a video signal indicating an image within each photographing range;
Audio signal generation means for generating an audio signal of the collected sound;
Sound source direction specifying means for specifying the direction of the sound source based on the sound signal generated by the sound signal generating means, and outputting sound source direction information indicating the specified direction;
Selecting means for selecting the photographing means corresponding to the sound source direction information;
A communication apparatus comprising: a video signal output by the photographing unit selected by the selection unit; and an output unit that outputs the audio signal generated by the audio signal generation unit to another terminal device via a network.

A plurality of photographing means each for which a photographing range is set and outputting a video signal indicating an image within each photographing range;
Audio signal generation means for generating an audio signal of the collected sound;
Sound source direction specifying means for specifying the direction of the sound source based on the sound signal generated by the sound signal generating means, and outputting sound source direction information indicating the specified direction;
Selecting means for selecting the photographing means corresponding to the sound source direction information;
A voice extraction unit that extracts a voice signal corresponding to the sound source direction information from the voice signal generated by the voice signal generation unit;
A communication apparatus comprising: a video signal output by the photographing means selected by the selection means; and an output means for outputting the audio signal extracted by the audio extraction means to another terminal device via a network.

The sound signal generation means includes a microphone array in which a plurality of microphones are arranged, the sound extraction means includes a sound collection direction control device that controls a sound collection direction of the microphone array, and the sound source direction specifying means includes the sound source direction specifying means The communication apparatus according to claim 2, further comprising: a direction specifying device that specifies the direction from a sound collection direction controlled by a sound collection direction control device and a volume level of an audio signal output from the microphone array. .

The sound signal generating means has a plurality of microphones whose positions can be individually set, and the sound source direction specifying means has one microphone recognized as a sound source direction from the volume level of the sound signal output from each microphone. 3. The information indicating the selected microphone is output as the sound source direction information, and the sound signal extracting means extracts the sound signal of the microphone selected by the sound source direction specifying means. The communication apparatus as described in.

The output means sets a specific area among all areas constituting one screen as a display area of a video signal output by the photographing means selected by the photographing means selection means, and other video signals are displayed in the other areas. 5. The communication apparatus according to claim 1, wherein a video signal synthesized so as to be displayed is output.

A plurality of pieces of division data indicating how to divide all areas constituting one screen of the video signal output by the output means, and setting the specific area in the divided areas;
Section data selection means for selecting the section data,
6. The communication apparatus according to claim 5, wherein the output unit classifies one screen according to the segment data selected by the segment data selection unit and recognizes an area indicated by the segment data as the specific area.

A table storing a plurality of pattern images showing a plurality of modes according to the sound source position of the region to be imaged, and storing a correspondence relationship between the sound source position in each pattern image and the imaging range of each imaging means;
Pattern image selection means for selecting the pattern image,
A shooting range control unit that determines each shooting range corresponding to the template selected by the pattern image selection unit with reference to the table, and controls each shooting unit so as to match the determined shooting range; The communication device according to any one of claims 1 to 6.