JP4716962B2

JP4716962B2 - CONFERENCE SYSTEM, CONFERENCE SERVER, AND CONFERENCE SYSTEM DISTRIBUTION VOICE CONTROL METHOD

Info

Publication number: JP4716962B2
Application number: JP2006256065A
Authority: JP
Inventors: 浩久保木; 佐藤　　達也
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2006-09-21
Filing date: 2006-09-21
Publication date: 2011-07-06
Anticipated expiration: 2026-09-21
Also published as: JP2008079024A

Description

本発明は会議システム、会議サーバ及び会議システムの配信音声制御方法に関し、例えば、音声会議システムや、テレビ会議システムにおける音声信号の処理系に適用し得るものである。 The present invention relates to a conference system, a conference server, and a distributed audio control method for the conference system, and can be applied to, for example, an audio signal processing system in an audio conference system or a video conference system.

一般に、音声会議システムにおいて、参加者に、その人以外の全員の音声をミキシングして配布すると、各員の声が聞き取りにくくなる。そのため、会議が多数参加のもと行われていても、会議サーバにおいて、数名の音声のみを優先的にミキシングして配信することも考えられている。ミキシングに供する数名の音声は、例えば、以下のように決定される。ある一定時間の音声サンプル（パケットに挿入されている分）で捉えた音声レベルが高い上位数名の音声を抽出し、次周期では、その上位数名の音声のみをミキシングして、会議者に通知する（特許文献１参照）。
特表２００５−５０４４５０号公報 In general, in an audio conference system, when the voices of all the members other than the person are mixed and distributed to the participants, it becomes difficult to hear each member's voice. For this reason, even if a conference is held with a large number of participants, it is also conceivable that only a few voices are preferentially mixed and distributed in the conference server. For example, several voices to be used for mixing are determined as follows. The voices of the top few people with high voice levels captured by the voice samples for a certain period of time (the amount inserted in the packet) are extracted. Notification is made (see Patent Document 1).
JP 2005-504450 gazette

しかしながら、一定周期のサンプルから、次周期にミキシングして配布する音声信号を決定する方式では、例えば、ある参加者が大声で主張したとしても、前周期に声が小さければ、現周期でミキシング対象とならず、そのため、主張発言の冒頭がミキシングされない。 However, in the method of determining the audio signal to be distributed by mixing in the next cycle from the sample in a certain cycle, for example, even if a participant insists loudly, if the voice is low in the previous cycle, the target for mixing in the current cycle Therefore, the beginning of the assertion is not mixed.

このような急激な音声レベルの変化時に変化直後の音声がミキシングされない事態を防ぐために、一定周期の期間を短くし、ミキシングされない期間を短くすることも考えられる。しかし、このようにすると、会議サーバや会議クライアントの処理負荷を増大させるという課題が生じ、また、ミキシングに供する音声が短い周期で見直されるため、ミキシングに供する複数の音声の組み合わせの切り替えが頻繁に行われることを生じ、会議参加者に違和感を抱かせ、若しくは、音質劣化と取らえられることも十分に考えられる。 In order to prevent such a situation that the sound immediately after the change is not mixed at the time of such a sudden change in the sound level, it is conceivable to shorten the period of a certain period and shorten the period of not being mixed. However, if this is done, there will be a problem of increasing the processing load on the conference server and client, and the audio used for mixing will be reviewed in a short cycle, so the combination of multiple audio used for mixing is frequently switched. It is quite possible that it will be done, making the conference participants feel uncomfortable, or being regarded as sound quality degradation.

また、音声レベルに基づいたミキシング対象の決定では、議長などによる会議進行に関する音声信号も、音声レベルが低ければミキシング対象から除外され、会議進行の通知を不十分となる恐れもある。 Further, in the determination of the mixing target based on the audio level, the audio signal related to the conference progress by the chairperson or the like is also excluded from the mixing target if the audio level is low, and there is a possibility that the notification of the conference progress is insufficient.

そのため、ミキシングに供する音声信号を一部の音声信号に制御する場合において、優先度が高い音声信号をミキシング対象に含めることを確度良く行うことができる、しかも、ミキシング対象について十分な品質を確保することができる会議システム、会議サーバ及び会議システムの配信音声制御方法が望まれている。 Therefore, when controlling the audio signal to be mixed to a part of the audio signal, it is possible to accurately include the audio signal having a high priority in the object to be mixed, and to ensure sufficient quality for the object to be mixed. A conference system, a conference server, and a distributed audio control method for the conference system are desired.

第１の本発明は、複数の会議端末からの音声信号のうち、所定数の音声信号だけを合成し、合成音声信号を全ての会議端末に送信すると共に、現周期の各音声信号のレベルに基づいて、次周期で合成対象となる所定数の音声信号を定める会議サーバであって、（１）現周期において、合成対象でない各音声信号について、合成対象への変更の必要性を判別する合成対象見直し手段と、（２）合成対象への変更の必要性があると判別された音声信号がある場合に、判別された以降の現周期において、判別された音声信号を合成対象に追加する合成対象追加手段とを有することを特徴とする。 The first aspect of the present invention synthesizes only a predetermined number of audio signals from a plurality of conference terminals, transmits the synthesized audio signals to all the conference terminals, and sets the level of each audio signal in the current cycle. Based on this, the conference server determines a predetermined number of audio signals to be synthesized in the next cycle, and (1) synthesis that determines the necessity of changing to a synthesis target for each audio signal that is not to be synthesized in the current cycle. The target review means, and (2) synthesis that adds the discriminated audio signal to the compositing target in the current cycle after the discriminating when there is an audio signal that has been discriminated to be changed to the synthesis target And an object adding means.

第２の本発明は、音声信号を送信すると共に合成音声信号を受信する複数の会議端末と、複数の会議端末からの音声信号のうち、所定数の音声信号だけを合成し、合成音声信号を全ての会議端末に送信する会議サーバとを含む会議システムにおいて、上記会議サーバとして、第１の本発明の会議サーバを適用したことを特徴とする。 The second aspect of the present invention synthesizes only a predetermined number of audio signals out of audio signals from a plurality of conference terminals and a plurality of conference terminals that transmit audio signals and receive synthesized audio signals. In a conference system including a conference server that transmits to all conference terminals, the conference server of the first aspect of the present invention is applied as the conference server.

第３の本発明の会議システムの配信音声制御方法は、会議サーバが、複数の会議端末からの音声信号のうち、所定数の音声信号だけを合成し、合成音声信号を全ての会議端末に送信すると共に、現周期の各音声信号のレベルに基づいて、次周期で合成対象となる所定数の音声信号を定める会議システムにおいて、上記会議サーバが、現周期において、合成対象でない各音声信号について、合成対象への変更の必要性を判別し、合成対象への変更の必要性があると判別された音声信号がある場合に、判別された以降の現周期において、判別された音声信号を合成対象に追加することを特徴とする。 In the distributed audio control method for a conference system according to the third aspect of the present invention, the conference server synthesizes only a predetermined number of audio signals among audio signals from a plurality of conference terminals, and transmits the synthesized audio signals to all the conference terminals. In the conference system that determines a predetermined number of audio signals to be synthesized in the next cycle based on the level of each audio signal in the current cycle, the conference server The necessity of change to the synthesis target is determined, and if there is an audio signal that is determined to be required to be changed to the synthesis target, the determined audio signal is synthesized in the current cycle after the determination. It is characterized by adding to.

本発明によれば、ミキシングに供する音声信号を一部の音声信号に制御する場合において、優先度が高い音声信号をミキシング対象に含めることを確度良く行うことができ、しかも、ミキシング対象について十分な品質を確保することができる。 According to the present invention, when the audio signal to be mixed is controlled to be a part of the audio signal, it is possible to accurately include the audio signal having a high priority in the mixing target, and the mixing target is sufficient. Quality can be ensured.

（Ａ）第１の実施形態
以下、本発明による会議システム、会議サーバ及び会議システムの配信音声制御方法を、音声会議システムに適用した第１の実施形態を、図面を参照しながら説明する。 (A) First Embodiment Hereinafter, a first embodiment in which a conference system, a conference server, and a distributed voice control method for a conference system according to the present invention are applied to a voice conference system will be described with reference to the drawings.

（Ａ−１）第１の実施形態の構成
図２は、第１の実施形態の音声会議システムの全体構成を示すブロック図である。 (A-1) Configuration of the First Embodiment FIG. 2 is a block diagram showing the overall configuration of the audio conference system of the first embodiment.

図２において、第１の実施形態の音声会議システム１は、複数（図２は５個の場合を示している）の会議クライアント１０−１〜１０−５と、会議サーバ２０とを有する。会議サーバ２０には、全ての会議クライアント１０−１〜１０−５からの音声信号（例えばＲＴＰパケットに挿入されて送信される）がネットワーク２を介して到来し、会議サーバ２０は、参加者総数より少ない所定数の音声信号を選択してミキシングし、そのミキシング音声信号（例えばＲＴＰパケットに挿入されて送信される）がネットワーク２を介して各会議クライアント１０−１〜１０−５に配信される。 In FIG. 2, the audio conference system 1 according to the first embodiment includes a plurality of conference clients 10-1 to 10-5 (FIG. 2 shows a case of 5) and a conference server 20. The conference server 20 receives voice signals from all the conference clients 10-1 to 10-5 (for example, inserted into an RTP packet and transmitted) via the network 2, and the conference server 20 receives the total number of participants. A smaller predetermined number of audio signals are selected and mixed, and the mixed audio signals (for example, inserted into an RTP packet and transmitted) are distributed to the conference clients 10-1 to 10-5 via the network 2. .

図２では、会議クライアント１０−２、１０−４及び１０−５からの音声信号Ｂ、Ｅ及びＦがミキシングに供するものとして選択された例を示している。 FIG. 2 shows an example in which the audio signals B, E, and F from the conference clients 10-2, 10-4, and 10-5 are selected to be used for mixing.

会議クライアント１０（１０−１〜１０−５）は、例えば、会議専用端末であっても良く、ヘッドフォンを着脱可能な情報処理装置（例えば、パソコン）、若しくは、マイクロフォン及びスピーカを装備した情報処理装置上に音声会議用ソフトウェアが組み込まれたものであっても良く、その構成は、既存の会議クライアントの構成と同様で良い。会議クライアント１０は、機能的には、図３に例示するような構成を有する。 The conference client 10 (10-1 to 10-5) may be, for example, a conference dedicated terminal, and an information processing device (for example, a personal computer) that can be attached / detached with headphones, or an information processing device equipped with a microphone and a speaker. Voice conferencing software may be installed on the top, and the configuration may be the same as that of an existing conference client. The conference client 10 functionally has a configuration as illustrated in FIG.

図３において、マイクロフォン１１は会議参加者の音声を捕捉して音声信号を出力するものであり、アナログ／デジタル変換部（Ａ／Ｄ）１２は捕捉音声信号（アナログ信号）をデジタル信号に変換するものである。 In FIG. 3, a microphone 11 captures the speech of a conference participant and outputs a speech signal, and an analog / digital converter (A / D) 12 converts the captured speech signal (analog signal) into a digital signal. Is.

通信制御部１３は、アナログ／デジタル変換部１２からの音声信号（音声ストリーム）を所定サンプル数毎に切り分け、各ＲＴＰパケットのペイロードに挿入してＲＴＰパケットを次々と会議サーバ２０に向けて送信するものである。また、通信制御部１３は、会議サーバ２０から到来したＲＴＰパケットに挿入されている音声信号を抽出して音声ストリームに戻してデジタル／アナログ変換部（Ｄ／Ａ）１４に与えるものである。 The communication control unit 13 cuts the audio signal (audio stream) from the analog / digital conversion unit 12 into a predetermined number of samples, inserts the audio signal into the payload of each RTP packet, and transmits the RTP packets to the conference server 20 one after another. Is. Further, the communication control unit 13 extracts the audio signal inserted in the RTP packet coming from the conference server 20, returns it to the audio stream, and gives it to the digital / analog conversion unit (D / A) 14.

デジタル／アナログ変換部１４は、与えられた音声信号（デジタル信号）をアナログ信号に変換するものであり、スピーカ１５はデジタル／アナログ変換部１４から出力された音声信号を発音出力するものである。 The digital / analog conversion unit 14 converts a given audio signal (digital signal) into an analog signal, and the speaker 15 outputs a sound signal output from the digital / analog conversion unit 14.

会議制御部１６は、図示しないキー入力部からの指令等に応じて会議の参加や会議からの離脱の際の制御を行うものであり、例えば、この制御によって、通信制御部１３が、会議サーバ２０側と音声信号（ＲＴＰパケット）を授受し合う状態になったり、音声信号を授受し合う状態を終了させたりする。 The conference control unit 16 performs control at the time of joining or leaving the conference in accordance with a command from a key input unit (not shown). For example, the communication control unit 13 can control the conference server by this control. The voice signal (RTP packet) is exchanged with the 20 side, or the voice signal exchange state is terminated.

この第１の実施形態の場合、会議クライアント１０−１〜１０−５の通信制御部１３は、会議参加時点の会議サーバ２０とのネゴシエーション等により、同一時間帯の音声信号を含むＲＴＰパケットを送信するようになされている。言い換えると、全ての会議クライアント１０−１〜１０−５は、ＲＴＰパケットを同期して送信するようになされている。 In the case of the first embodiment, the communication control unit 13 of the conference clients 10-1 to 10-5 transmits an RTP packet including an audio signal in the same time zone by negotiation with the conference server 20 at the time of conference participation. It is made to do. In other words, all the conference clients 10-1 to 10-5 are configured to transmit RTP packets in synchronization.

会議サーバ２０は、ネットワーク上の任意のサーバによって実現されても良い。例えば、ＩＳＰが音声会議のサービスを提供する場合であれば、プロバイダサーバが会議サーバとして機能するようにしても良い。会議サーバ２０は、機能的には、図１に示すような構成を有する。 The conference server 20 may be realized by any server on the network. For example, if the ISP provides a voice conference service, the provider server may function as a conference server. The conference server 20 is functionally configured as shown in FIG.

図１において、音声受信部２１−１〜２１−５はそれぞれ、対応する会議クライアント１０−１〜１０−５からのＲＴＰパケットを受信し、挿入されている音声信号を抽出するものである。音声受信部２１−１〜２１−５はそれぞれ、音声バッファを内蔵し、抽出した音声信号を音声バッファに一時蓄積すると共に、対応する音声レベル測定部２２−１〜２２−５やミキシング部２４に所定のタイミングで与えるものである。 In FIG. 1, voice receiving units 21-1 to 21-5 receive RTP packets from corresponding conference clients 10-1 to 10-5, respectively, and extract inserted voice signals. Each of the sound receiving units 21-1 to 21-5 has a built-in sound buffer, temporarily stores the extracted sound signal in the sound buffer, and stores it in the corresponding sound level measuring units 22-1 to 22-5 and the mixing unit 24. It is given at a predetermined timing.

音声レベル測定部２２−１〜２２−５はそれぞれ、対応する音声受信部２１−１〜２１−５が受信して得た音声信号の長期間及び短期間の音声レベル（音声パワー）を測定するものである。例えば、長期間の音声レベルとしては、１ＲＴＰパケットに係る全ての音声サンプルの２乗和を適用できる。また、短期間の音声レベルは、ミキシング部２４に与える現時点の音声レベルといえるものである。例えば、ミキシング部２４に与えようとする現時点の音声サンプルを含め、その直前の極少ないサンプル数（例えば１０サンプル程度）の音声サンプルの２乗和を適用できる。 The sound level measuring units 22-1 to 22-5 measure the sound levels (sound power) of long and short periods of sound signals obtained by the corresponding sound receiving units 21-1 to 21-5, respectively. Is. For example, as the long-term voice level, the sum of squares of all voice samples related to one RTP packet can be applied. The short-term audio level is the current audio level given to the mixing unit 24. For example, the sum of squares of the audio samples including the current audio sample to be given to the mixing unit 24 and the audio sample having a very small number of samples (for example, about 10 samples) immediately before that can be applied.

ミキシング対象決定部２３は、音声レベル測定部２２−１〜２２−５からの長期間の音声レベルに基づいて、ミキシング部２４でのミキシングに供する次のＲＴＰパケット期間に係る所定数の音声信号を決定すると共に、短期間の音声レベルに基づいて、現ＲＴＰパケット期間に係る音声信号でミキシングすべきものがあるかを見直すようになされている。以下では、所定数を３として説明するが、所定数はこれに限定されるものではなく、また、所定数自体が、長期間の音声レベルに応じて変化するものであっても良い（例えば、長期間の音声レベルが閾値を超えた音声信号を全てミキシング対象とするようにしても良い）。 Based on the long-term audio levels from the audio level measuring units 22-1 to 22-5, the mixing target determining unit 23 outputs a predetermined number of audio signals related to the next RTP packet period to be used for mixing in the mixing unit 24. At the same time, based on the voice level for a short period of time, whether or not there is a voice signal related to the current RTP packet period to be mixed is reviewed. In the following description, the predetermined number is assumed to be 3. However, the predetermined number is not limited to this, and the predetermined number itself may change according to a long-term audio level (for example, All audio signals whose long-term audio levels exceed the threshold may be set as mixing targets).

ミキシング対象決定部２３は、音声レベル測定部２２−１〜２２−５からの５種類の長期間の音声レベルのうち、レベルが大きい方の３種類の音声信号をミキシングに供するものと決定する。ミキシング対象決定部２３は、ミキシングに供していない残りの音声信号については、その短期間の音声レベルを監視し、短期間の音声レベルが閾値を超えたと判断した場合には、その判断時点から、その音声信号をミキシングに供する音声信号に切り替える。なお、条件を満たすならば、ミキシング対象へ切り替える音声信号は複数あっても良い。 The mixing target determining unit 23 determines that three types of audio signals having higher levels among the five types of long-term audio levels from the audio level measuring units 22-1 to 22-5 are to be used for mixing. The mixing target determining unit 23 monitors the short-term sound level of the remaining sound signals that are not used for mixing, and when it is determined that the short-term sound level exceeds the threshold, from the determination time point, The audio signal is switched to an audio signal used for mixing. If the condition is satisfied, there may be a plurality of audio signals to be switched to the mixing target.

ここで、ミキシングするように切り替えられた音声信号は、今までミキシングしていた３種類の音声信号に追加してミキシングするようにしても良く、今までミキシングしていた３種類の音声信号の中で長期間の音声レベルが最も小さい（ミキシング順位が最下位の）ものに置き換えてミキシング対象（この場合においてミキシング順位は最上位）とするようにしても良い。また、短期間の音声レベルと比較される閾値は固定値であっても良く、ミキシング対象の３種類の音声信号に係る長期間音声レベルに応じて、適応的にかえるものであっても良い。例えば、３種類の長期間音声レベルの平均値のα（αは一定；例えばαは１より大きい）倍を閾値にするようにしても良い。 Here, the audio signal switched to be mixed may be added to the three types of audio signals that have been mixed so far, and may be mixed. Among the three types of audio signals that have been mixed so far, Then, it may be replaced with the one with the lowest long-term audio level (mixing order is the lowest) and the mixing target (in this case, the mixing order is the highest). Further, the threshold value to be compared with the short-term audio level may be a fixed value, or may be adaptively changed according to the long-term audio levels related to the three types of audio signals to be mixed. For example, α may be a threshold value that is α (α is constant; for example, α is greater than 1) of three types of long-term audio levels.

ミキシング部２４は、ミキシング対象決定部２３がミキシング対象と決定した複数種類の音声信号をミキシングするものである。 The mixing unit 24 mixes a plurality of types of audio signals determined to be mixed by the mixing target determining unit 23.

混合音声送信部２５は、ミキシング部２４から出力された混合音声信号をペイロードに含むＲＴＰパケットを組み立て、そのＲＴＰパケットを全ての会議クライアント１０−１〜１０−５に同報送信するものである。 The mixed voice transmission unit 25 assembles an RTP packet including the mixed voice signal output from the mixing unit 24 in a payload, and broadcasts the RTP packet to all the conference clients 10-1 to 10-5.

なお、図１では省略しているが、会議サーバ２０は、会議クライアント１０−１〜１０−５の会議への参加や離脱などに伴う制御を行う部分も有している。 Although omitted in FIG. 1, the conference server 20 also includes a portion that performs control associated with participation or withdrawal of the conference clients 10-1 to 10-5.

（Ａ−２）第１の実施形態の動作
次に、第１の実施形態に係る音声会議システム１の動作を説明する。 (A-2) Operation of the First Embodiment Next, the operation of the audio conference system 1 according to the first embodiment will be described.

各会議クライアント１０−１〜１０−５はそれぞれ、自己に係る会議参加者の音声信号を含むＲＴＰパケットを同期して会議サーバ２０に送信する。 Each of the conference clients 10-1 to 10-5 transmits an RTP packet including the audio signal of the conference participant related to the conference client 10-1 to the conference server 20 in synchronization.

会議サーバ２０において、会議クライアント１０−１〜１０−５からのＲＴＰパケットはそれぞれ、対応する音声受信部２１−１〜２１−５で受信され、ＲＴＰパケットに挿入されていた音声信号が抽出される。 In the conference server 20, RTP packets from the conference clients 10-1 to 10-5 are received by the corresponding audio receivers 21-1 to 21-5, and the audio signals inserted in the RTP packets are extracted. .

音声受信部２１−１〜２１−５で抽出された音声信号はそれぞれ、対応する音声レベル測定部２２−１〜２２−５に与えられ、長期間及び短期間の音声レベル（音声パワー）が測定されてミキシング対象決定部２３に与えられる。ミキシング対象決定部２３においては、全種類の長期間の音声レベルのうち、レベルが大きい方の３種類の音声信号を、次のＲＴＰパケットの受信時には、ミキシングするものと決定される。ミキシング対象決定部２３によって、ミキシングに供していない音声信号の短期間の音声レベルが監視され、短期間の音声レベルが閾値を超える、その判断時点から、その音声信号をミキシングに供するものに切り替える。 The audio signals extracted by the audio receiving units 21-1 to 21-5 are respectively supplied to the corresponding audio level measuring units 22-1 to 22-5, and the audio levels (audio power) for a long period and a short period are measured. Then, it is given to the mixing target determining unit 23. The mixing target determining unit 23 determines that three types of audio signals having higher levels among all types of long-term audio levels are to be mixed when the next RTP packet is received. The mixing target determining unit 23 monitors the short-term sound level of the sound signal that is not used for mixing, and switches the sound signal to the one that is used for mixing from the point in time when the short-time sound level exceeds the threshold.

すなわち、短期間の音声レベルが閾値を超えないような状況においては、前のＲＴＰパケットの受信時における長期間の音声レベルに基づいて決定された所定種類の音声信号がミキシング部２４においてミキシングされ、ミキシングに供していない音声信号の短期間の音声レベルが閾値を超えると、その音声信号もミキシング部２４においてミキシングされる。 That is, in a situation where the short-term audio level does not exceed the threshold, a predetermined type of audio signal determined based on the long-term audio level at the time of reception of the previous RTP packet is mixed in the mixing unit 24, When the audio level for a short period of the audio signal not subjected to mixing exceeds a threshold value, the audio signal is also mixed in the mixing unit 24.

ミキシング部２４から出力された混合音声信号は、混合音声送信部２５において、ＲＴＰパケットのペイロードに挿入され、そのＲＴＰパケットが全ての会議クライアント１０−１〜１０−５に同報送信される。 The mixed audio signal output from the mixing unit 24 is inserted into the payload of the RTP packet in the mixed audio transmission unit 25, and the RTP packet is broadcast to all the conference clients 10-1 to 10-5.

各会議クライアント１０−１〜１０−５においては、受信したＲＴＰパケットから音声信号（混合音声信号）を抽出して発音出力させる。 Each of the conference clients 10-1 to 10-5 extracts a voice signal (mixed voice signal) from the received RTP packet and outputs the voice signal.

（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、過去の長期間の音声レベルに基づいた決定ではミキシング対象となっていない音声信号において、現時点の音声レベルを監視し、音声レベルが急に大きくなった際には、その時点から、ミキシング対象に変更するようにしたので、急に大声を出して主張を始めた会議参加者の音声信号も、ほぼ主張を始めた時点よりミキシング対象とすることができ、そのための構成追加もごく僅かである。これにより、レベル測定周期を縮めることをせずに、主張を始めた参加者の声を、漏らすことなくミキシングできる。 (A-3) Effects of the First Embodiment According to the first embodiment, the current audio level is monitored in an audio signal that is not subject to mixing in the determination based on a past long-term audio level. When the audio level suddenly increased, we changed to the mixing target from that point, so the audio signals of the conference participants who suddenly started to make loud claims almost started to make claims. It can be a target for mixing from that point in time, and there are very few additional configurations. Thereby, the voice of the participant who started the claim can be mixed without leaking without shortening the level measurement cycle.

また、基本的には、過去の長期間の音声レベルに基づいてミキシング対象を決定しているので、上述のような場合を除けば、安定してミキシング対象を決定でき、ミキシング対象について十分な品質を確保することができる。 Basically, since the mixing target is determined based on the past long-term audio level, except for the above cases, the mixing target can be determined stably, and the mixing target has sufficient quality. Can be secured.

（Ｂ）第２の実施形態
次に、本発明による会議システム、会議サーバ及び会議システムの配信音声制御方法を、音声会議システムに適用した第２の実施形態を、図面を参照しながら説明する。 (B) Second Embodiment Next, a second embodiment in which the conference system, the conference server, and the distributed voice control method of the conference system according to the present invention are applied to the voice conference system will be described with reference to the drawings.

第２の実施形態の音声会議システムは、会議サーバ２０Ａが、第１の実施形態のものと変更されている。図４は、第２の実施形態の会議サーバ２０Ａの機能的構成を示すブロック図である。 In the audio conference system of the second embodiment, the conference server 20A is changed from that of the first embodiment. FIG. 4 is a block diagram illustrating a functional configuration of the conference server 20A according to the second embodiment.

図４において、第２の実施形態の会議サーバ２０Ａは、第１の実施形態の構成に加えて特定パターン認識部２６−１〜２６−５を有すると共に、ミキシング対象決定部２３Ａの決定方法も第１の実施形態のものから多少変更されている。 In FIG. 4, the conference server 20A of the second embodiment includes specific pattern recognition units 26-1 to 26-5 in addition to the configuration of the first embodiment, and the determination method of the mixing target determination unit 23A is also the first. The first embodiment is slightly changed.

特定パターン認識部２６−１〜２６−５はそれぞれ、対応する音声受信部２１−１〜２１−５が受信ＲＴＰパケットから抽出した現音声信号に、特定パターンが含まれていれば、その特定パターンを認識するものであり、認識結果をミキシング対象決定部２３Ａに与えるものである。ここで、特定パターンとしては、「会議開始」、「会議終了」、「休憩開始」、「休憩終了」などの議長発言等を示す文言である。 The specific pattern recognizing units 26-1 to 26-5 each include a specific pattern if the specific voice is included in the current voice signal extracted from the received RTP packet by the corresponding voice receiving units 21-1 to 21-5. The recognition result is given to the mixing target determining unit 23A. Here, the specific pattern is a statement indicating the chairman's remarks such as “conference start”, “conference end”, “break start”, “break end”, and the like.

例えば、特定パターン認識部２６−１〜２６−５は、特定パターンに係る基準の音声信号波形を記憶しており、この基準の音声信号波形と、処理対象の音声信号波形との照合によって、特定パターンを認識する。また例えば、特定パターン認識部２６−１〜２６−５は、入力音声信号をテキスト列に変換し、形態素解析を行って特定パターンを認識する。 For example, the specific pattern recognition units 26-1 to 26-5 store the reference audio signal waveform related to the specific pattern, and the specific pattern recognition units 26-1 to 26-5 are specified by collating the reference audio signal waveform with the audio signal waveform to be processed. Recognize patterns. Further, for example, the specific pattern recognition units 26-1 to 26-5 convert the input speech signal into a text string, perform morphological analysis, and recognize the specific pattern.

なお、特定パターンは、１個の受信ＲＴＰパケットから抽出した音声信号に限らず、複数の受信ＲＴＰパケットから抽出した音声信号の時系列から認識しても良く、その認識した時点で、認識結果をミキシング対象決定部２３Ａに与え留ようにしても良い。 The specific pattern is not limited to the voice signal extracted from one received RTP packet, but may be recognized from a time series of voice signals extracted from a plurality of received RTP packets. You may make it give to the mixing object determination part 23A.

第２の実施形態のミキシング対象決定部２３Ａは、第１の実施形態と同様に、ミキシング対象を決定すると共に、さらに、特定パターン認識部２６−１〜２６−５の認識結果に基づいても、現ＲＴＰパケット期間に係る音声信号でミキシングすべきものがあるかを見直す。ミキシング対象決定部２３Ａは、過去の長期間の音声レベルに基づいた決定ではミキシング対象となっていない音声信号でも、現ＲＴＰパケット期間に係る音声信号について特定パターンが認識された際には、ミキシング対象への切り替えを行う。 Similarly to the first embodiment, the mixing target determining unit 23A of the second embodiment determines a mixing target and, based on the recognition results of the specific pattern recognition units 26-1 to 26-5, Review whether there is any audio signal to be mixed in the current RTP packet period. When the specific pattern is recognized for the audio signal related to the current RTP packet period, the mixing target determining unit 23A recognizes the mixing target even if the audio signal is not the target of mixing based on the past long-term audio level determination. Switch to.

第２の実施形態によっても、第１の実施形態と同様な効果を奏することができ、さらに、以下の効果を奏することができる。 Also according to the second embodiment, the same effects as those of the first embodiment can be obtained, and further, the following effects can be obtained.

すなわち、会議においてある特定の役割をもった参加者の特定パターンの音声信号を認識して、ミキシング対象に変更するようにしたので、自然な会議を実現でき、しかも、システム負荷にもそれほど影響を与えない。 In other words, since a voice signal of a specific pattern of a participant who has a specific role in the conference is recognized and changed to a target for mixing, a natural conference can be realized, and the system load is also affected so much. Don't give.

（Ｃ）他の実施形態
上記各実施形態における会議サーバは、ソフトウェアによって特徴的な制御を実現しても、ハードウェアによって特徴的な制御を実現しても良い。ハードウェアによって実現する場合において、ＩＣチップ化すれば、より処理負荷、スピードが速いものになる。 (C) Other Embodiments The conference server in each of the above embodiments may realize characteristic control by software or hardware. In the case of realization with hardware, if an IC chip is used, the processing load and speed become faster.

上記第２の実施形態においては、短期間の音声レベルに基づいたミキシング対象の見直しと、特定パターンに基づいたミキシング対象の見直しとの双方を行うものを示したが、後者だけを行うように会議サーバを構築しても良い。 In the second embodiment described above, both the review of the mixing target based on the short-term audio level and the review of the mixing target based on the specific pattern are shown. However, the conference is performed so that only the latter is performed. You may build a server.

また、上記各実施形態においては、会議に参加可能な上限の数だけミキシング対象の見直しに必要な構成（例えば、短期間の音声レベルの測定構成や、特定パターン認識部）を備えるものを示したが、このような構成を、上限数から、ミキシング対象数との差だけ設け、これら構成にミキシング対象となっていない参加者の音声信号を入力させるスイッチを設けるようにしても良い。 Further, in each of the above-described embodiments, a configuration including a configuration (for example, a short-term audio level measurement configuration or a specific pattern recognition unit) necessary for reviewing the mixing target as many as the upper limit that can participate in the conference is shown. However, such a configuration may be provided by a difference from the upper limit number to the number of mixing targets, and a switch for inputting a voice signal of a participant who is not a mixing target may be provided in these configurations.

第１の実施形態に係る会議サーバの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the conference server which concerns on 1st Embodiment. 第１の実施形態に係る音声会議システムの全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the audio conference system which concerns on 1st Embodiment. 第１の実施形態に係る会議クライアントの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the conference client which concerns on 1st Embodiment. 第２の実施形態に係る会議サーバの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the conference server which concerns on 2nd Embodiment.

Explanation of symbols

１…音声会議システム、１０−１〜１０−５…会議クライアント、２０…会議サーバ、２１−１〜２１−５…音声受信部、２２−１〜２２−５…音声レベル測定部、２３、２３Ａ…ミキシング対象決定部、２４…ミキシング部、２５…混合音声送信部、２６−１〜２６−５…特定パターン認識部。 DESCRIPTION OF SYMBOLS 1 ... Voice conference system, 10-1 to 10-5 ... Conference client, 20 ... Conference server, 211-1 to 21-5 ... Voice receiving unit, 222-1 to 22-5 ... Voice level measuring unit, 23, 23A ... mixing target determination unit, 24 ... mixing unit, 25 ... mixed voice transmission unit, 26-1 to 26-5 ... specific pattern recognition unit.

Claims

Of the audio signals from a plurality of conference terminals, only a predetermined number of audio signals are synthesized, and the synthesized audio signals are transmitted to all the conference terminals and synthesized in the next period based on the level of each audio signal in the current period. A conference server that defines a predetermined number of target audio signals,
In the current cycle, for each audio signal that is not the synthesis target, a synthesis target review unit that determines the necessity of changing to the synthesis target;
A synthesis target adding means for adding the determined voice signal to the synthesis target in the current cycle after the determination when there is an audio signal determined to be changed to the synthesis target. Feature conference server.

The synthesis target review means confirms whether or not a short-term level sufficiently shorter than the cycle exceeds a threshold value for each voice signal that is not a synthesis target, and it is necessary to change the voice signal exceeding the threshold value to a synthesis target. The conference server according to claim 1, wherein the conference server is determined to have a property.

The synthesis target review unit confirms whether or not the waveform pattern of the audio signal matches a specific pattern stored in advance for each audio signal that is not the synthesis target, and converts the audio signal that matches the specific pattern to the synthesis target. The conference server according to claim 1, wherein it is determined that there is a need for a change.

A plurality of conference terminals that transmit audio signals and receive synthesized audio signals, and a predetermined number of audio signals are synthesized among audio signals from the plurality of conference terminals, and the synthesized audio signals are transmitted to all conference terminals. In a conference system including a conference server,
The conference system according to claim 1, wherein the conference server according to claim 1 is applied.

The conference server synthesizes only a predetermined number of audio signals from the audio signals from a plurality of conference terminals, transmits the synthesized audio signal to all the conference terminals, and based on the level of each audio signal in the current period, In a conference system that defines a predetermined number of audio signals to be synthesized in the next period,
The conference server
In the current cycle, for each audio signal that is not the synthesis target, determine the necessity to change to the synthesis target,
Distribution of a conference system characterized in that, when there is an audio signal that is determined to need to be changed to the synthesis target, the determined audio signal is added to the synthesis target in the current cycle after the determination. Voice control method.