JP2005080110A

JP2005080110A - Audio conference system, audio conference terminal, and program

Info

Publication number: JP2005080110A
Application number: JP2003310445A
Authority: JP
Inventors: Yukio Tada; 幸生多田
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2003-09-02
Filing date: 2003-09-02
Publication date: 2005-03-24

Abstract

<P>PROBLEM TO BE SOLVED: To provide an audio conference system by which a plurality of persons can participate in an audio conference from one point even without using a plurality of lines and identification of a speaker is easy. <P>SOLUTION: The audio conference terminal has a microphone which collects voice of the speaker and output voice information indicating the voice of the speaker, an identification means provided near the microphone and to output identification information based on owner information read from a recording medium in which the owner information for specifying the speaker is recorded, an identification information addition means to add identification information to the voice information and a transmission means to transmit the voice information to which the identification information is added. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声会議システムに係り、特に聞き手が容易に話者を区別することができる音声会議システムに関する。 The present invention relates to an audio conference system, and more particularly to an audio conference system in which a listener can easily distinguish speakers.

近年、ビジネスのグローバル化、スピードアップ化に伴い、地理的に離れた地点間でリアルタイムに会議を行うことができる通信会議システムの重要性が高まっている。
このような通信会議システムが、音声と共に画像（動画）の通信を伴う、いわゆるテレビ会議システムの場合は、聞き手は通信相手の表情を見ながら会議できるため話者を識別するのは容易であるが、画像を伴わない音声会議システムの場合は、聞き手は話者を識別するのが困難であった。 In recent years, with the globalization and speeding up of business, the importance of a communication conference system capable of conducting a conference in real time between geographically distant points is increasing.
When such a communication conference system is a so-called video conference system that involves communication of images (moving images) together with voice, it is easy for the listener to identify the speaker because the conference can be performed while watching the facial expression of the communication partner. In the case of an audio conference system without an image, it is difficult for the listener to identify the speaker.

この問題を解決するための技術として、ＩＳＤＮ（ＩｎｔｅｇｒａｔｅｄＳｅｒｖｉｃｅｓＤｉｇｉｔａｌＮｅｔｗｏｒｋ）を利用した音声会議システムにおいて、（１）話者ごとに通信チャネルを振り分けるチャネル制御手段、および（２）話者ごとに音声の出力位置を変えるか、話者ごとに音声信号を変化させるか、あるいは話者の名前を表示することにより、話者を識別することを容易にする音声会議システムが提案されている（例えば特許文献１）。
特開平８−１２５７３８号公報 In order to solve this problem, in a voice conference system using ISDN (Integrated Services Digital Network), (1) channel control means for allocating a communication channel for each speaker, and (2) voice for each speaker. An audio conference system that makes it easy to identify a speaker by changing an output position, changing an audio signal for each speaker, or displaying the name of the speaker has been proposed (for example, Patent Documents). 1).
JP-A-8-125738

しかしながら、ＩＳＤＮ回線を用いた従来の技術では、会議の出席者の人数分ＩＳＤＮ回線を用意する必要があった。すなわち、一地点（例えば、事務所内の一会議室）から複数人が音声会議に参加する場合にはその会議室においてその人数分のＩＳＤＮ回線を用意する必要があり、実施が困難であった。 However, in the conventional technique using the ISDN line, it is necessary to prepare ISDN lines for the number of participants in the conference. That is, when a plurality of people participate in an audio conference from one point (for example, one conference room in the office), it is necessary to prepare ISDN lines for the number of people in the conference room, which is difficult to implement.

本発明は上記の事情に鑑みてなされたものであり、複数の回線を用いなくても一地点から複数人が音声会議に参加することができ、かつ、話者の識別が容易な音声会議システムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and a voice conference system in which a plurality of people can participate in a voice conference from one point without using a plurality of lines, and speaker identification is easy. The purpose is to provide.

上記の課題を解決するため、本発明は、話者の音声を集音し、その話者の音声を示す音声情報を出力するマイクと、マイクの近傍に設けられ、話者を特定する所有者情報を記録した記録媒体から読み出したその所有者情報に基づいて識別情報を出力する識別手段と、音声情報に、識別情報を付加する識別情報付加手段と、識別情報が付加された音声情報を送信する送信手段とを有する音声会議端末装置を提供する。
この音声会議端末装置によれば、音声情報に、その音声を発した話者を識別する識別情報が付加されて送信されるため、受信側では、音声情報に付加された識別情報に基づいて話者を識別することができる。 In order to solve the above problems, the present invention collects a speaker's voice, outputs a voice information indicating the voice of the speaker, and an owner provided near the microphone to identify the speaker Identification means for outputting identification information based on the owner information read from the recording medium on which the information is recorded, identification information addition means for adding identification information to the voice information, and transmission of the voice information with the identification information added An audio conference terminal device having a transmitting means is provided.
According to this audio conference terminal device, the identification information for identifying the speaker who has emitted the audio is added to the audio information and transmitted, so that the receiving side can speak based on the identification information added to the audio information. Person can be identified.

また、本発明は、話者の音声を集音し、その話者の音声を示す音声情報を出力するマイクと、話者の音声を認識する音声認識手段と、音声認識手段の認識結果に基づいて識別情報を決定する識別手段と、音声情報に、識別情報を付加する識別情報付加手段と、識別情報が付加された音声情報を送信する送信手段とを有する音声会議端末装置を提供する。
この音声会議端末装置によれば、音声情報に、その音声を発した話者を識別する識別情報を、話者の識別情報を記録した記録媒体を用いなくても、音声の認識結果に基づいて話者を識別する識別情報が取得され、音声情報に付加される。 Further, the present invention is based on a microphone that collects the voice of a speaker and outputs voice information indicating the voice of the speaker, a voice recognition unit that recognizes the voice of the speaker, and a recognition result of the voice recognition unit. There is provided an audio conference terminal device comprising: identification means for determining identification information; identification information adding means for adding identification information to voice information; and transmission means for transmitting voice information to which the identification information is added.
According to this voice conference terminal device, the identification information for identifying the speaker who has emitted the voice is included in the voice information based on the voice recognition result without using a recording medium in which the identification information of the speaker is recorded. Identification information for identifying the speaker is acquired and added to the voice information.

さらに、本発明は、話者を識別する識別情報が付加された音声情報を受信する受信手段と、識別情報に基づいて、音声情報により示される音声に特徴を付加する特徴付加手段と、特徴付加手段により特徴を付加された音声を出力する音声出力手段とを有する音声会議端末装置を提供する。
この音声会議端末装置によれば、その音声を発した話者を識別する識別情報を付加された音声情報が受信され、話者毎に特徴付けられた音声が再生されることにより、聞き手は話者を容易に識別することができる。 Further, the present invention provides a receiving means for receiving voice information to which identification information for identifying a speaker is added, a feature adding means for adding a feature to the voice indicated by the voice information based on the identification information, and a feature addition Provided is an audio conference terminal device having audio output means for outputting audio with features added by the means.
According to this voice conference terminal device, the voice information to which the identification information for identifying the speaker who has made the voice is added is received, and the voice characterized by each speaker is reproduced. Can be easily identified.

さらに、本発明は、話者の音声を集音し、その話者の音声を示す音声情報を出力する音声情報出力処理と、マイクの近傍に設けられ、話者を特定する所有者情報を記録した記録媒体から読み出したその所有者情報に基づいて識別情報を出力する識別処理と、音声情報に、識別情報を付加する識別情報付加処理と、識別情報が付加された音声情報を送信する送信処理とをコンピュータに実行させるプログラムを提供する。 Further, the present invention collects voice of a speaker and outputs voice information indicating the voice of the speaker, and records owner information provided near the microphone to identify the speaker. Identification processing for outputting identification information based on the owner information read from the recorded recording medium, identification information addition processing for adding identification information to audio information, and transmission processing for transmitting audio information with identification information added A program for causing a computer to execute the above is provided.

さらに、本発明は、話者の音声を集音し、その話者の音声を示す音声情報を出力する音声出力処理と、話者の音声を認識する音声認識処理と、音声認識処理の認識結果に基づいて識別情報を決定する識別処理と、音声情報に、識別情報を付加する識別情報付加処理と、識別情報が付加された音声情報を送信する送信処理とをコンピュータに実行させるプログラムを提供する。 Furthermore, the present invention collects a speaker's voice and outputs voice information indicating the speaker's voice, a voice recognition process for recognizing the speaker's voice, and a recognition result of the voice recognition process. Provides a program for causing a computer to execute identification processing for determining identification information based on the identification information, identification information addition processing for adding identification information to voice information, and transmission processing for transmitting voice information to which identification information is added. .

さらに、本発明は、話者を識別する識別情報が付加された音声情報を受信する受信処理と、識別情報に基づいて、音声情報により示される音声に特徴を付加する特徴付加処理と、特徴付加処理により特徴を付加された音声を出力する音声出力処理とをコンピュータに実行させるプログラムを提供する。 Further, the present invention provides a reception process for receiving voice information to which identification information for identifying a speaker is added, a feature addition process for adding a feature to the voice indicated by the voice information based on the identification information, and a feature addition Provided is a program for causing a computer to execute a sound output process for outputting a sound with a feature added by the process.

本発明によれば、一地点から複数人が参加する音声会議において、聞き手が容易に話者を識別できる音声会議システムを実現することができる。また、本発明によれば、音声会議で記録された音声データの再利用時にも話者を容易に識別することができる。 According to the present invention, it is possible to realize an audio conference system in which a listener can easily identify a speaker in an audio conference in which a plurality of people participate from one point. Further, according to the present invention, the speaker can be easily identified even when the audio data recorded in the audio conference is reused.

以下、本発明の実施形態について図面を参照しながら説明する。
［１第１実施形態］
まず、本発明の第１実施形態について説明する。本発明に係る音声会議システムは、話者を識別する識別情報を出力する識別手段、話者を識別する識別情報を音声データに付加する識別情報付加手段、および音声データに対し話者毎に異なった特徴を付加する特徴付加処理を行う特徴付加手段を主要な構成要素とするものである。第１実施形態においては、識別手段としてＲＦＩＤ（ＲａｄｉｏＦｒｅｑｕｅｎｃｙＩＤｅｎｔｉｆｉｃａｔｉｏｎ）タグおよびＲＦＩＤタグ読取装置、並びに特徴付加手段としてイコライザを使用している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[1 First Embodiment]
First, a first embodiment of the present invention will be described. An audio conference system according to the present invention includes an identification unit that outputs identification information for identifying a speaker, an identification information addition unit that adds identification information for identifying a speaker to audio data, and a voice data different for each speaker. The feature adding means for performing the feature adding process for adding the feature is a main component. In the first embodiment, an RFID (Radio Frequency IDentification) tag and an RFID tag reader are used as identification means, and an equalizer is used as characteristic addition means.

［１−１音声会議システムの構成］
図１は本実施形態における音声会議システム１００の構成を示すブロック図である。図１に示されるように、本実施形態における音声会議システム１００は、地理的に離れて位置する会議室Ａ、Ｂ間での音声会議を可能にするものである。
各会議室に設置された各メインユニット２００は、通信網３００を介して接続され、各会議室間での音声データの送受信を行う機能を有する。メインユニット２００は、他の会議室に送信すべき音声データが発生した場合に、その発生元である話者を識別し、その話者の識別情報を音声データへ付加する。また、メインユニット２００は、識別情報の付加された音声データを他の会議室のメインユニット２００から受信した場合に、識別情報に基づいた特徴付加処理を行う。なお、特徴付加処理の詳細については後述する。
通信網３００は、例えばインターネットである。なお、通信網３００はインターネットに限られず、無線通信網、専用線等、パケット通信可能な他の通信網であってもよい。 [1-1 Configuration of the audio conference system]
FIG. 1 is a block diagram showing a configuration of an audio conference system 100 in the present embodiment. As shown in FIG. 1, the audio conference system 100 according to the present embodiment enables an audio conference between conference rooms A and B located geographically apart from each other.
Each main unit 200 installed in each conference room is connected via the communication network 300 and has a function of transmitting and receiving audio data between the conference rooms. When voice data to be transmitted to another conference room is generated, the main unit 200 identifies the speaker that is the generation source, and adds the identification information of the speaker to the voice data. The main unit 200 performs a feature addition process based on the identification information when the audio data to which the identification information is added is received from the main unit 200 in another conference room. Details of the feature addition processing will be described later.
The communication network 300 is, for example, the Internet. Communication network 300 is not limited to the Internet, and may be another communication network capable of packet communication, such as a wireless communication network or a dedicated line.

会議室Ａ、Ｂにはマイク１０が備えられている。このマイク１０は、会議の出席者の音声を集音する装置である。本実施形態においては、マイク１０は指向性を有し、マイク１０の正面に座った出席者の音声のみを集音する。また、本実施形態においては、マイク１０は少なくとも会議室の定員分用意されており、出席者は１人１本のマイクを使用できるようになっている。なお、以下では、指向性マイクを用い、出席者が１人１本のマイクを使用できる状態を「オンマイク状態」という。このオンマイク状態において、マイク１０の各々は１人の出席者の音声のみを集音する。 Conference rooms A and B are provided with microphones 10. The microphone 10 is a device that collects voices of conference attendees. In the present embodiment, the microphone 10 has directivity and collects only the voices of attendees sitting in front of the microphone 10. In the present embodiment, microphones 10 are prepared for at least the capacity of the conference room, and attendees can use one microphone per person. Hereinafter, a state where a directional microphone is used and each attendee can use one microphone is referred to as an “on-mic state”. In this on-microphone state, each microphone 10 collects only the sound of one attendee.

マイク１０の各々には、ＲＦＩＤタグ読取装置２０が取り付けられている。ＲＦＩＤタグ読取装置２０は、マイク１０の前に着席した出席者の所有するＲＦＩＤタグから情報を読み取る装置である。ＲＦＩＤとは、電波を用い非接触で所有者を認識する個体認識技術である。ＲＦＩＤシステムは、情報の記録媒体である「ＲＦＩＤタグ」と、あらかじめＲＦＩＤタグ書き込み装置によってＲＦＩＤタグに書き込まれた情報の読み取りを行うための「ＲＦＩＤタグ読み取り装置」から構成される。このシステムでは、ＲＦＩＤタグとＲＦＩＤ読み取り装置を一定距離範囲内に近づけることでＲＦＩＤタグ内の情報の読み取りが行われる。本発明は、ＲＦＩＤシステムの代わりに、バーコードや磁気カード等の技術を用いても実施可能である。しかしながら、ＲＦＩＤシステムは、バーコード等の技術と比較して、情報の更新や追加が可能、複数個体の一括認識が可能、透過性がある（カバンの中に入れていても認識可能）等の特徴があり、本発明の識別手段として好適である。 An RFID tag reader 20 is attached to each microphone 10. The RFID tag reader 20 is a device that reads information from an RFID tag owned by an attendee seated in front of the microphone 10. RFID is an individual recognition technology for recognizing an owner in a non-contact manner using radio waves. The RFID system includes an “RFID tag” that is an information recording medium, and an “RFID tag reader” that reads information previously written to the RFID tag by the RFID tag writer. In this system, information in the RFID tag is read by bringing the RFID tag and the RFID reader close to each other within a certain distance range. The present invention can also be implemented using a technology such as a bar code or a magnetic card instead of the RFID system. However, the RFID system can update and add information compared to technologies such as barcodes, can recognize multiple individuals at once, is transparent (can be recognized even in a bag), etc. It has features and is suitable as the identification means of the present invention.

マイク１０、ＲＦＩＤタグ読取装置２０は、全てメインユニット２００に接続されている。メインユニット２００は、ＲＦＩＤタグ読取装置２０から送られてくるデータに基づいて、マイク１０から送られてくる音声信号の発生元である話者を識別する識別処理を行うとともに、話者を特定する情報を音声信号に付加する話者情報付加処理を実行し、通信網３００を介して通信相手に送信する機能を有している。 The microphone 10 and the RFID tag reader 20 are all connected to the main unit 200. Based on the data sent from the RFID tag reader 20, the main unit 200 performs identification processing for identifying the speaker that is the source of the audio signal sent from the microphone 10, and identifies the speaker. It has a function of executing speaker information addition processing for adding information to a voice signal and transmitting the information to a communication partner via the communication network 300.

また、音声データ受信時には、メインユニット２００は、他の会議室のメインユニット２００から通信網３００を介して送信されてきた音声データをイコライザ２７０によって補正する。その後、補正した音声データを音声信号に変換し、音声信号をスピーカ３０に出力する。スピーカ３０は、このようにして与えられる音声信号を音として再生する装置であり、各会議室内に適切に配置されている。 When receiving audio data, the main unit 200 corrects the audio data transmitted from the main unit 200 of another conference room via the communication network 300 using the equalizer 270. Thereafter, the corrected audio data is converted into an audio signal, and the audio signal is output to the speaker 30. The speaker 30 is a device that reproduces the audio signal thus given as sound, and is appropriately arranged in each conference room.

図２は、各会議室に設置されたメインユニット２００の構成を示すブロック図である。メインユニット２００は、通信Ｉ／Ｆ２４０を介して図１に示す通信網３００に接続され、通信網３００を介して他の通信機器と通信可能である。ＣＰＵ２１０は、この通信の制御および各種演算処理を行うプロセッサである。メモリ２５０は、ＣＰＵ２１０のワークエリアとして機能するほか、後述する出席者テーブルＴＢＬ１および通信相手テーブルＴＢＬ２を記憶する。音声信号受信部２３０は、図１に示すマイク１０からアナログ音声信号（以下、単に音声信号という）を受信する機能を有する。ＣＯＤＥＣ２２０は、マイク１０から出力された音声信号をデジタル音声データ（以下、単に音声データという）に変換するとともに、通信網３００を介して受信された音声データを音声信号に変換する機能を有する。イコライザ２７０は、通信網３００を介して受信された音声データのうち特定の周波数成分を増加あるいは減少させることにより、音声データを再生したときに得られる音声の音質を補正する機能を有する。イコライザ２７０により補正された音声データは、ＣＯＤＥＣ２２０により音声信号に変換された後、音声信号出力部２６０を介し、図１に示すスピーカ３０から音声として出力される。 FIG. 2 is a block diagram showing the configuration of the main unit 200 installed in each conference room. The main unit 200 is connected to the communication network 300 shown in FIG. 1 via the communication I / F 240 and can communicate with other communication devices via the communication network 300. The CPU 210 is a processor that controls this communication and performs various arithmetic processes. The memory 250 functions as a work area for the CPU 210, and stores an attendee table TBL1 and a communication partner table TBL2, which will be described later. The audio signal receiving unit 230 has a function of receiving an analog audio signal (hereinafter simply referred to as an audio signal) from the microphone 10 shown in FIG. The CODEC 220 has a function of converting an audio signal output from the microphone 10 into digital audio data (hereinafter simply referred to as audio data) and converting audio data received via the communication network 300 into an audio signal. The equalizer 270 has a function of correcting the sound quality of the sound obtained when the sound data is reproduced by increasing or decreasing a specific frequency component in the sound data received via the communication network 300. The audio data corrected by the equalizer 270 is converted into an audio signal by the CODEC 220 and then output as audio from the speaker 30 shown in FIG. 1 via the audio signal output unit 260.

［１−２音声会議システムの動作］
以下、図２および図３を参照して本実施形態における音声会議システムの動作について説明する。本実施形態では、図３に示すように、ある会社のＡ事業所の会議室Ａと、Ｂ事業所の会議室Ｂとの間で音声会議を行うものとする。Ａ事業所側の会議の出席者は、Ｃ部に所属する社員５人（社員Ｃ１、社員Ｃ２、社員Ｃ３、社員Ｃ４、社員Ｃ５）、Ｂ事業所側の会議の出席者は、Ｄ部に所属する社員２人（社員Ｄ１、社員Ｄ２）およびＥ部に所属する社員２人（社員Ｅ１、社員Ｅ２）である。なお、この会社において、各社員は、各人の社員番号が記録されたＲＦＩＤタグを取り付けた社員証を使用しており、会議の出席者はみな自分の社員証を胸ポケットに付けているものとする。 [1-2 Operation of the audio conference system]
The operation of the audio conference system according to this embodiment will be described below with reference to FIGS. In this embodiment, as shown in FIG. 3, it is assumed that an audio conference is performed between a conference room A at a business office A and a conference room B at a business office B. Participants in the meeting at the A office side are five employees belonging to the C department (employee C1, employee C2, employee C3, employee C4, employee C5), and attendees at the meeting at the B office side are in the D department. Two employees (employee D1, employee D2) who belong and two employees (employee E1, employee E2) who belong to the E section. In this company, each employee uses an employee ID card with an RFID tag on which the employee number of each person is recorded, and all attendees of the meeting have their own ID card in their chest pocket. And

図３に示されるように、会議室Ａ、会議室Ｂの定員はともに６人であり、したがってマイク１０およびＲＦＩＤタグ読取装置２０はそれぞれ６台づつ用意されている。各会議室において、マイク１０の各々には、あらかじめＩＤ番号が割り当てられており、メインユニット２００は、各会議室内の各マイク１０をそれぞれ区別できるようになっている。
また、両会議室のメインユニット２００は、音声データの特定の周波数成分を増加、減少させる補正パターン（以下、イコライジング・パターンという）を、それぞれあらかじめ十分な数（例えば６個）記憶している。各イコライジング・パターンには、それを参照するためのイコライジング・パターン番号が割り当てられている。 As shown in FIG. 3, the conference room A and the conference room B have a capacity of six people, and therefore, six microphones 10 and six RFID tag readers 20 are prepared. In each conference room, an ID number is assigned to each microphone 10 in advance, and the main unit 200 can distinguish each microphone 10 in each conference room.
Further, the main units 200 of both conference rooms store in advance a sufficient number (for example, six) of correction patterns (hereinafter referred to as equalizing patterns) for increasing or decreasing specific frequency components of the audio data. Each equalizing pattern is assigned an equalizing pattern number for referring to it.

まず、会議室Ａの出席者のうち１人が、メインユニット２００の操作盤を操作し、会議室Ｂに割り当てられた会議室番号を入力する。すると会議室Ａのメインユニット２００は、通信網３００を介して会議室Ｂのメインユニット２００に接続要求を送信する。会議室Ａのメインユニット２００からの接続要求を受信した会議室Ｂのメインユニット２００は、会議室Ａとの間の通信回線を開く。双方の出席者が全員マイク１０の前の着席していることを確認したところで（このとき、出席者はそれぞれ図３に示される席に着席しているものとする）、出席者の１人は、メインユニット２００の操作盤にある「出席者登録」のボタンを押す。この操作により、音声会議システム１００の動作モードは、会議の出席者を登録する出席者登録モードになる。 First, one of the attendees in the conference room A operates the operation panel of the main unit 200 and inputs the conference room number assigned to the conference room B. Then, the main unit 200 in the conference room A transmits a connection request to the main unit 200 in the conference room B via the communication network 300. The main unit 200 in the conference room B that has received the connection request from the main unit 200 in the conference room A opens a communication line with the conference room A. When it is confirmed that both attendees are all seated in front of the microphone 10 (at this time, each attendee is seated in the seat shown in FIG. 3), one of the attendees is Then, the “Attendee Registration” button on the operation panel of the main unit 200 is pressed. By this operation, the operation mode of the audio conference system 100 becomes an attendee registration mode for registering attendees of the conference.

出席者登録モードになると、会議室Ａおよび会議室Ｂ双方のメインユニット２００は、マイク１０に取り付けられたＲＦＩＤタグ読取装置２０に対し、ＲＦＩＤタグのデータを読み取るように指令する。ＲＦＩＤタグ読取装置２０の各々は、ＲＦＩＤタグ読取装置２０の前に座っている出席者のＲＦＩＤタグから社員番号を読み取り、読み取った社員番号をメインユニット２００に送信する。このとき、ＲＦＩＤタグ読取装置２０は、あらかじめ決められたしきい値以上の強度を有する信号のみを受信するように構成されており、そのＲＦＩＤタグ読取装置２０が取り付けられたマイク１０の前に着席した参加者以外のＲＦＩＤタグからは情報を読み取らないようになっている。空席のＲＦＩＤタグ読取装置２０は、空席を示す信号をメインユニット２００に送信する。メインユニット２００は、各ＲＦＩＤ読取装置２０から社員番号を受信すると、各社員番号を各々の送信元であるＲＦＩＤ読取装置２０の取り付けられたマイク１０のＩＤ番号と対応付けて、自室の出席者テーブルＴＢＬ１としてメモリ２５０に記憶する。 When the attendee registration mode is set, the main units 200 in both the conference room A and the conference room B command the RFID tag reader 20 attached to the microphone 10 to read the data of the RFID tag. Each of the RFID tag readers 20 reads the employee number from the RFID tag of the attendee sitting in front of the RFID tag reader 20 and transmits the read employee number to the main unit 200. At this time, the RFID tag reader 20 is configured to receive only a signal having an intensity equal to or higher than a predetermined threshold, and is seated in front of the microphone 10 to which the RFID tag reader 20 is attached. Information is not read from RFID tags other than those who participate. The vacant RFID tag reader 20 transmits a signal indicating a vacant seat to the main unit 200. When the main unit 200 receives the employee number from each RFID reader 20, the main unit 200 associates each employee number with the ID number of the microphone 10 to which the RFID reader 20, which is the transmission source, is attached. It is stored in the memory 250 as TBL1.

なお、以下の説明ではＲＦＩＤタグを用いた態様について説明するが、話者を識別する方法はＲＦＩＤを用いる方法に限られない。非接触式のＲＦＩＤタグに代えて、接触式の磁気カードやバーコード等を用いて話者の識別情報を入力する構成としてもよい。この場合、メインユニット２００に磁気カード読取装置あるいはバーコード読取装置を設ける必要がある。あるいは、メインユニット２００にキーボード等の入力装置を設けて、出席者の操作入力により社員番号や氏名を入力する構成としてもよい。 In the following description, an aspect using an RFID tag will be described, but the method for identifying a speaker is not limited to the method using RFID. Instead of the non-contact type RFID tag, the identification information of the speaker may be input using a contact type magnetic card or a bar code. In this case, it is necessary to provide the main unit 200 with a magnetic card reader or a barcode reader. Alternatively, the main unit 200 may be provided with an input device such as a keyboard, and an employee number or name may be input by an operation input by an attendee.

続いて、両会議室のメインユニット２００は、自室の出席者テーブルＴＢＬ１を通信相手のメインユニット２００に送信する。通信相手の出席者テーブルＴＢＬ１を受信したメインユニット２００は、出席者テーブルＴＢＬ１に記載された各社員番号にそれぞれ別のイコライジング・パターン番号を割り当て、社員番号とイコライジング・パターン番号とを対応付けて通信相手の出席者テーブル（以下、「通信相手テーブルＴＢＬ２」という）としてメモリ２５０に記憶する。以上で音声会議の準備が完了する。準備が完了すると、メインユニット２００は、操作盤上のランプを点灯させる等の手段により会議の準備が完了した旨を出席者に報知する。会議の出席者は、それを確認してメインユニット２００の操作盤上の「会議開始」のボタンを押す。この操作によりメインユニット２００の動作モードは会議モードとなり、会議を開始することができる。 Subsequently, the main units 200 of both conference rooms transmit the attendee table TBL1 of the own room to the main unit 200 of the communication partner. The main unit 200 that has received the attendee table TBL1 of the communication partner assigns a different equalizing pattern number to each employee number described in the attendee table TBL1, and communicates by associating the employee number with the equalizing pattern number. It is stored in the memory 250 as a partner attendee table (hereinafter referred to as “communication partner table TBL2”). This completes preparation for the audio conference. When the preparation is completed, the main unit 200 informs the attendee that the preparation of the meeting is completed by means such as lighting a lamp on the operation panel. The attendee of the meeting confirms this and presses a “start meeting” button on the operation panel of the main unit 200. By this operation, the operation mode of the main unit 200 becomes the conference mode, and the conference can be started.

ここで、会議モードにおいて社員Ｃ１が発言を行った場合を考える。本実施形態においては、オンマイク状態が実現されているので、発言を集音した時点で複数話者の音声が混合されるおそれはない。社員Ｃ１の音声を集音したＩＤ番号０１のマイク１０は、社員Ｃ１の音声を音声信号としてメインユニット２００に送信する。マイク１０から送信された音声信号は、メインユニット２００内のＣＯＤＥＣ２２０により音声データに変換される。メインユニット２００のＣＰＵ２１０は、出席者テーブルＴＢＬ１からＩＤ番号０１のマイク１０に対応付けられている社員番号（社員Ｃ１の社員番号）を抽出し、音声データにヘッダとして付加する。社員番号を付加された音声データは、通信網３００を介して通信相手である会議室Ｂのメインユニット２００に送信される。本実施形態においては、インターネットを利用しているため、複数の音声データを１つの通信回線で送信することができる。 Here, consider a case where employee C1 makes a statement in the conference mode. In the present embodiment, since the on-mic state is realized, there is no possibility that the voices of a plurality of speakers are mixed at the time when a speech is collected. The microphone 10 having the ID number 01 that has collected the voice of the employee C1 transmits the voice of the employee C1 to the main unit 200 as a voice signal. The audio signal transmitted from the microphone 10 is converted into audio data by the CODEC 220 in the main unit 200. The CPU 210 of the main unit 200 extracts the employee number (employee number of the employee C1) associated with the microphone 10 with the ID number 01 from the attendee table TBL1, and adds it to the voice data as a header. The voice data to which the employee number is added is transmitted to the main unit 200 in the conference room B, which is a communication partner, via the communication network 300. In the present embodiment, since the Internet is used, a plurality of audio data can be transmitted through one communication line.

会議室Ｂのメインユニット２００のＣＰＵ２１０は、社員番号の付加された音声データを受信すると、そこから社員番号データを抽出する。続いて、メモリ２５０内の通信相手テーブルＴＢＬ２内を検索し、その社員番号に対応するイコライジング・パターン番号を抽出する。音声データは、イコライジング・パターン番号と共にイコライザ２７０に送信される。イコライザ２７０は、このイコライジング・パターン番号に対応付けられたイコライジング・パターンを読み出し、読み出したイコライジング・パターンを用いて音声データを補正する。 When the CPU 210 of the main unit 200 in the conference room B receives the voice data to which the employee number is added, the CPU 210 extracts the employee number data therefrom. Subsequently, the communication partner table TBL2 in the memory 250 is searched, and the equalizing pattern number corresponding to the employee number is extracted. The audio data is transmitted to the equalizer 270 together with the equalizing pattern number. The equalizer 270 reads the equalizing pattern associated with the equalizing pattern number, and corrects the audio data using the read equalizing pattern.

メインユニット２００は、こうして処理した音声データを、ＣＯＤＥＣ２２０および音声信号出力部２６０を介して音声信号として複数のスピーカ３０に出力し、スピーカ３０からはイコライザ２７０により補正された音声信号が再生される。こうして、話者ごとに特定の周波数成分が変化させられた音声が再生されることにより、聞き手は話者を容易に区別できるようになる。
例えば、社員Ｃ１と社員Ｃ２の声が非常に似ていて区別しにくいと仮定する。このとき、議論が白熱して、社員Ｃ１が「私は賛成です」、社員Ｃ２が「私は反対です」とほぼ同時に発言しても、それぞれの音声は各話者に対して割り当てられたマイク１０から別個に収集され、社員Ｃ１と社員Ｃ２それぞれの社員番号をヘッダに付加されて会議室Ｂのメインユニット２００に送信される。そして会議室Ｂでは、社員Ｃ１の音声と社員Ｃ２の音声は、それぞれ別個のイコライジング・パターンで補正されて再生される。このため、会議室Ｂの出席者は、社員Ｃ１と社員Ｃ２のどちらが賛成しどちらが反対したかを容易に認識することができる。 The main unit 200 outputs the audio data thus processed to the plurality of speakers 30 as audio signals via the CODEC 220 and the audio signal output unit 260, and the audio signals corrected by the equalizer 270 are reproduced from the speakers 30. Thus, by reproducing the sound in which the specific frequency component is changed for each speaker, the listener can easily distinguish the speaker.
For example, assume that the voices of employee C1 and employee C2 are very similar and difficult to distinguish. At this time, even if the discussion heated up and employee C1 said "I agree" and employee C2 said "I disagree" almost simultaneously, each voice is a microphone assigned to each speaker. 10, the employee numbers of the employees C1 and C2 are added to the header and transmitted to the main unit 200 in the conference room B. In the conference room B, the voice of the employee C1 and the voice of the employee C2 are corrected with separate equalizing patterns and reproduced. For this reason, the attendee in the conference room B can easily recognize which of the employee C1 and the employee C2 is in favor and which is against.

なお、以上説明した実施形態において、各会議室のメインユニット２００は、音声会議の開始前に、社員番号と、マイクＩＤとを対応付け、音声会議中は、話者の音声を集音したマイクのマイクＩＤに基づき、話者の社員番号を特定した。しかし、このようなマイクＩＤおよび社員番号の使用が不要な態様もある。まず、ある会議室において、ある出席者が発言を行うと、これと同時にＲＦＩＤタグ読取装置２０は話者のＲＦＩＤタグからＲＦＩＤを読み取る。読み取ったＲＦＩＤは、メインユニット２００から他の会議室のメインユニット２００に送信される。ＲＦＩＤを受信した他の会議室のメインユニット２００は、このＲＦＩＤをキーとして出席者テーブルＴＢＬ１内を検索し、対応するイコライジング・パターン番号を抽出する。この態様によれば、発言の度にＲＦＩＤから情報を読み取るので、会議の途中で参加者同士が席を変わっても話者を正しく認識することができる。 In the embodiment described above, the main unit 200 of each conference room associates the employee number with the microphone ID before starting the audio conference, and the microphone that collects the voice of the speaker during the audio conference. The speaker's employee number was identified based on the microphone ID. However, there is a mode in which the use of such a microphone ID and employee number is unnecessary. First, when an attendee speaks in a conference room, the RFID tag reader 20 reads an RFID from the speaker's RFID tag at the same time. The read RFID is transmitted from the main unit 200 to the main unit 200 in another conference room. The main unit 200 of another conference room that has received the RFID searches the attendee table TBL1 using this RFID as a key, and extracts a corresponding equalizing pattern number. According to this aspect, since information is read from the RFID each time a speech is made, the speaker can be correctly recognized even if participants change their seats during the meeting.

［２第２実施形態］
続いて、本発明の第２実施形態について説明する。図４は、本発明の第２実施形態における音声会議システムの構成を示すブロック図である。本実施形態においては、会議の内容を記録するための装置として、通信網３００に録音サーバ３２０が接続されている。また、本実施形態においては話者を識別する情報として、その話者に関する各種の情報を保存しているリソースを特定するＵＲＩ（ＵｎｉｆｏｒｍＲｅｓｏｒｃｅＩｄｅｎｔｉｆｉｅｒ）を使用する。この情報を提供するサーバとして、本実施形態では、情報サーバ３１０が通信網３００に接続されている。ここで、ＵＲＩとは、統一された書式を持った、リソースを識別するための文字列のことである。本実施形態では、具体的にはＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）を指す。 [2 Second Embodiment]
Subsequently, a second embodiment of the present invention will be described. FIG. 4 is a block diagram showing the configuration of the audio conference system according to the second embodiment of the present invention. In the present embodiment, a recording server 320 is connected to the communication network 300 as a device for recording the contents of the conference. In the present embodiment, a URI (Uniform Resource Identifier) that identifies a resource storing various information related to the speaker is used as information for identifying the speaker. In this embodiment, an information server 310 is connected to the communication network 300 as a server that provides this information. Here, the URI is a character string for identifying a resource having a unified format. In the present embodiment, it specifically refers to a URL (Uniform Resource Locator).

［２−１音声会議システムの構成］
本実施形態において、各会議室に設けられる各装置の構成は第１実施形態と同一であるので説明を省略する。
情報サーバ３１０は、図５に示すように、あらかじめ顔写真、所属部署、社員番号、電話番号、メールアドレス、およびその社員に関する情報が保存されている場所を示すＵＲＩ等、出席者に関する情報を記録した名簿データベースを有するサーバである。録音サーバ３２０は、会議中に送受信された音声データを記録するためのサーバである。情報サーバ３１０および録音サーバ３２０はそれぞれ、ＣＰＵ、メモリ、ハードディスク等の外部記憶装置、およびネットワークを介して通信を行う手段を有するサーバ装置である。なお、図４においては会議室Ａおよび会議室Ｂに共通の情報サーバを設けたが、会議室Ａ、Ｂの各々のための情報サーバを別個に設けてもよい。また、情報サーバ３１０と録音サーバ３２０は同一の装置であってもよい。また、メインユニット２００自身が情報サーバ３１０および／あるいは録音サーバ３２０と同様の機能を有していてもよい。 [2-1 Configuration of the audio conference system]
In the present embodiment, the configuration of each device provided in each conference room is the same as that of the first embodiment, and thus the description thereof is omitted.
As shown in FIG. 5, the information server 310 records information about attendees in advance, such as a face photo, department, employee number, telephone number, e-mail address, and URI indicating the location where information about the employee is stored. A server having a registered directory database. The recording server 320 is a server for recording audio data transmitted and received during the conference. Each of the information server 310 and the recording server 320 is a server device having an external storage device such as a CPU, a memory, and a hard disk, and a means for performing communication via a network. In FIG. 4, a common information server is provided for the conference room A and the conference room B. However, an information server for each of the conference rooms A and B may be provided separately. Further, the information server 310 and the recording server 320 may be the same device. Further, the main unit 200 itself may have the same function as the information server 310 and / or the recording server 320.

［２−２音声会議システムの動作］
以下、図２および図４を参照して本実施形態における音声会議システムの動作について説明する。本実施形態においても、第１実施形態と同様に、ある会社のＡ事業所の会議室Ａと、Ｂ事業所の会議室Ｂとの間で音声会議を行うものとする。会議の出席者等の状況は第１実施形態と同一であるとする。双方の会議室において、出席者が全員揃ったところで、メインユニット２００の操作盤上にある「出席者登録」のボタンを押すと、メインユニット２００の動作モードは参加者登録モードになる。以下、出席者テーブルＴＢＬ１をメモリ２５０に記憶するところまでの動作は第１実施形態と同一である。 [2-2 Operation of the audio conference system]
The operation of the audio conference system according to this embodiment will be described below with reference to FIGS. Also in the present embodiment, as in the first embodiment, an audio conference is performed between the conference room A of the A office of a company and the conference room B of the B office. Assume that the attendees are the same as those in the first embodiment. When all attendees are gathered in both the conference rooms, the operation mode of the main unit 200 is changed to the participant registration mode when the “attendant registration” button on the operation panel of the main unit 200 is pressed. Hereinafter, the operation up to storing the attendee table TBL1 in the memory 250 is the same as that of the first embodiment.

出席者テーブルＴＢＬ１をメモリ２５０に記憶した後、メインユニット２００は、ＵＲＩ要求と出席者テーブルＴＢＬ１を情報サーバ３１０に送信する。ＵＲＩ要求を受信した情報サーバ３１０は、自身の保有する名簿データベース内を検索し、出席者テーブルＴＢＬ１に記載された社員番号に対応するＵＲＩを抽出する。抽出されたＵＲＩは、社員番号と対応付けられて出席者テーブルＴＢＬ１に追加される。ＵＲＩが付加された出席者テーブルＴＢＬ１は、ＵＲＩ要求の送信元のメインユニット２００に送信される。メインユニット２００は、受信した出席者テーブルＴＢＬ１を新たな出席者テーブルＴＢＬ１としてメモリ２５０に記憶する。続いて、両会議室のメインユニット２００は、出席者テーブルＴＢＬ１を通信相手のメインユニット２００に送信する。通信相手から出席者テーブルＴＢＬ１を受信したメインユニット２００は、受信した出席者テーブルＴＢＬ１中の各ＵＲＩに、第１実施形態で説明したイコライジング・パターン番号を対応付けて、通信相手テーブルＴＢＬ２としてメモリ２５０に記憶する。以上で音声会議の準備が完了する。 After storing the attendee table TBL1 in the memory 250, the main unit 200 transmits the URI request and the attendee table TBL1 to the information server 310. Receiving the URI request, the information server 310 searches the own directory database and extracts the URI corresponding to the employee number described in the attendee table TBL1. The extracted URI is associated with the employee number and added to the attendee table TBL1. The attendee table TBL1 to which the URI is added is transmitted to the main unit 200 that is the transmission source of the URI request. The main unit 200 stores the received attendee table TBL1 in the memory 250 as a new attendee table TBL1. Subsequently, the main units 200 of both conference rooms transmit the attendee table TBL1 to the main unit 200 of the communication partner. The main unit 200 that has received the attendee table TBL1 from the communication partner associates the equalizing pattern number described in the first embodiment with each URI in the received attendee table TBL1, and stores the memory 250 as the communication partner table TBL2. To remember. This completes preparation for the audio conference.

会議の出席者は、準備が完了したことを確認して会議を開始する。メインユニット２００の操作盤にある「会議開始」のボタンが押されると、メインユニット２００は、通信網３００を介し録音サーバ３２０に対して出席者登録完了を示す信号を送信する。録音サーバ３２０はその信号を受信すると、記録用のファイルを作成し、会議の日時などの情報を用いて、そのファイルに他と識別可能なファイル名を自動的に付与する。
この記録用ファイルは、マルチトラックの記録、再生が可能なファイル形式を有している。会議の参加者の各々に対して１つのトラックが割り当てられ、各トラックには特定の話者の音声データがヘッダと共に記録され、特定の話者の特定の発言を抽出することが可能である。なお、記録用ファイルの形式はマルチトラック形式に限られず、話者および発言が識別可能な状態で保存できればどのような形式でもよい。 Meeting attendees confirm that they are ready and start the meeting. When the “conference start” button on the operation panel of the main unit 200 is pressed, the main unit 200 transmits a signal indicating the attendance registration completion to the recording server 320 via the communication network 300. When the recording server 320 receives the signal, the recording server 320 creates a file for recording, and automatically assigns a file name that can be distinguished from others using information such as the date and time of the meeting.
This recording file has a file format capable of multitrack recording and reproduction. One track is assigned to each conference participant, and each track is recorded with the voice data of a specific speaker along with a header, and a specific utterance of a specific speaker can be extracted. Note that the format of the recording file is not limited to the multitrack format, and any format may be used as long as it can be stored in a state where speakers and utterances can be identified.

出席者が発言した場合、第１実施形態と同様に、会議の出席者の発言は、出席者の各々に割り当てられたマイク１０を介して音声信号としてメインユニット２００に送信される。マイク１０から入力された音声信号は、メインユニット２００内のＣＯＤＥＣ２２０により音声データに変換される。メインユニット２００は、出席者テーブルＴＢＬ１からＩＤ番号０１のマイク１０に対応付けられているＵＲＩを抽出し、抽出したＵＲＩをヘッダとして音声データに付加する。
以上のようにメインユニット２００は、ＵＲＩを付加した音声データを、通信網３００を介して会議室Ｂのメインユニット２００および録音サーバ３２０に送信する。会議室Ｂにおける音声の再生は、第１実施形態における「社員番号」を「ＵＲＩ」に変更した点以外は第１実施形態と同様である。
録音サーバ３２０は、音声データを受信すると、記録用ファイルに受信した音声データを記録する。ヘッダとして付加されたＵＲＩも同時に記録される。こうして、会議の内容は録音サーバ３２０の記録用ファイルに記録される。 When the attendee speaks, the speech of the meeting attendee is transmitted to the main unit 200 as an audio signal through the microphone 10 assigned to each attendee, as in the first embodiment. The audio signal input from the microphone 10 is converted into audio data by the CODEC 220 in the main unit 200. The main unit 200 extracts the URI associated with the microphone 10 with the ID number 01 from the attendee table TBL1, and adds the extracted URI to the audio data as a header.
As described above, the main unit 200 transmits the audio data with the URI added to the main unit 200 and the recording server 320 in the conference room B via the communication network 300. The audio reproduction in the conference room B is the same as that in the first embodiment except that the “employee number” in the first embodiment is changed to “URI”.
When receiving the audio data, the recording server 320 records the received audio data in the recording file. The URI added as a header is also recorded at the same time. Thus, the contents of the conference are recorded in the recording file of the recording server 320.

ここで、会議後、会議の出席者のうち１人が、例えば議事録の作成のために、会議の内容が記録されたファイルを再度聞く場合を考える。議事録の作成を行う社員は、まず、自分の端末（図示略）から、通信網３００を介して録音サーバ３２０にアクセスし、先の会議が録音されたファイルを開く。端末にはあらかじめこの音声会議システムによって記録されたファイルを再生するための再生ソフトがインストールされている。
この再生ソフトは、記録ファイルを再生するに際し、各トラックにおいて音声データを検出すると、ヘッダとして付加されたＵＲＩが示すリソース（情報サーバ３１０内の社員Ｃ１に関する情報を記したファイル）にアクセスし、社員の氏名のデータを得る。再生ソフトは、情報サーバ３１０から得た話者の氏名を画面に表示することにより話者毎に特徴付けを行いながらその音声データを再生する。なお、音声データ再生時に話者毎に特徴付けを行う方法は氏名の表示に限られず、話者毎に異なる色で音声波形を表示してもよい。あるいは、話者の氏名ではなく話者の顔写真を表示しながら音声データを再生する構成としてもよい。あるいは、話者の社員番号や電子メールアドレス等の情報を表示してもよい。あるいは、録音を再生する際も音声会議中と同様に、話者毎に異なる周波数成分を増加／減少させるイコライザを適用してもよい。あるいは、スピーカ／ヘッドホンから話者毎に異なる音像定位で音声を再生してもよい。
本実施形態によれば、音声会議の記録ファイルを聞いている使用者は、話者が誰であったか迷うことなく、あるいは話者を誤認することなく会議の録音を聞くことができる。 Here, it is assumed that after a meeting, one of the attendees of the meeting listens again to a file in which the contents of the meeting are recorded, for example, in order to create a minutes. The employee who creates the minutes first accesses the recording server 320 from his / her terminal (not shown) via the communication network 300, and opens the file in which the previous meeting was recorded. The terminal is preinstalled with playback software for playing back files recorded by the voice conference system.
When playing back the recorded file, this playback software detects the audio data in each track, and accesses the resource indicated by the URI added as a header (a file describing information about the employee C1 in the information server 310). Get data for the name. The reproduction software reproduces the voice data while characterizing each speaker by displaying the name of the speaker obtained from the information server 310 on the screen. Note that the method for characterizing each speaker at the time of reproducing the sound data is not limited to the display of the name, and the sound waveform may be displayed in a different color for each speaker. Or it is good also as a structure which reproduces | regenerates audio | voice data, displaying a speaker's face photograph instead of a speaker's name. Alternatively, information such as the speaker's employee number and e-mail address may be displayed. Alternatively, an equalizer that increases / decreases frequency components that are different for each speaker may be applied when reproducing a recording, as in the case of an audio conference. Alternatively, sound may be reproduced from a speaker / headphone with different sound localization for each speaker.
According to the present embodiment, the user who is listening to the recording file of the audio conference can hear the recording of the conference without wondering who the speaker is or without misidentifying the speaker.

［３第３実施形態］
続いて、本発明の第３実施形態について説明する。本実施形態においては、識別手段としては音声認識技術を、特徴付加手段としては複数のスピーカを用いて話者ごとに音像の定位を変化させる技術を採用する。なお、以下の説明において、第１実施形態および第２実施形態と同一の構成要素に関しては同一の参照番号を使用している。 [3 Third Embodiment]
Subsequently, a third embodiment of the present invention will be described. In this embodiment, a speech recognition technique is used as the identification means, and a technique for changing the localization of the sound image for each speaker using a plurality of speakers as the feature addition means. In the following description, the same reference numerals are used for the same components as those in the first embodiment and the second embodiment.

［３−１音声会議システムの構成］
図６は、本発明の第３実施形態による音声会議システムの構成を示すブロック図である。本実施形態においては、第１実施形態および第２実施形態において使用されたメインユニット２００に代わりメインユニット５００が使用される。
図７は、本実施形態におけるメインユニット５００の構成を示すブロック図である。図７において、音声認識部５５０および音像定位処理部５７０は本実施形態に特有の構成要素である。音声認識部５５０は、図６に示されるマイク１０から入力された音声信号に対して音声認識処理を行う機能を有する。音像定位処理部５７０は、スピーカ３０から再生される音声の音像定位を変化させる処理を行う機能を有する。ここで、音像が定位される位置は、音像定位処理部５７０が記憶する音像定位情報によって定められる。音像定位処理部５７０はあらかじめ十分な数（本実施形態においては６つ）の音像定位情報（例えば、ある基準点からの距離、角度）を記憶しており、その各々には音像定位情報を参照するための音像定位番号が割り当てられている。いま、図１０に示される位置にそれぞれ音像を定位させる設定を仮定する。この場合、音像定位処理部５７０は、基準点からの距離および角度の組み合わせからなる音像定位情報を６つ有し、それぞれには１から６までの音像定位番号が割り当てられている。すなわち、音声データと音像定位番号が与えられると、音像定位処理部５７０は、音像定位番号に対応する音像定位情報に基づいて、図６に示されるスピーカ３０の各々から出力される音声の位相、強度を変化させる方法により音像に定位を与える。 [3-1 Configuration of audio conference system]
FIG. 6 is a block diagram showing a configuration of an audio conference system according to the third embodiment of the present invention. In the present embodiment, a main unit 500 is used instead of the main unit 200 used in the first embodiment and the second embodiment.
FIG. 7 is a block diagram showing a configuration of the main unit 500 in the present embodiment. In FIG. 7, a voice recognition unit 550 and a sound image localization processing unit 570 are components unique to the present embodiment. The voice recognition unit 550 has a function of performing voice recognition processing on the voice signal input from the microphone 10 shown in FIG. The sound image localization processing unit 570 has a function of performing processing for changing the sound image localization of the sound reproduced from the speaker 30. Here, the position where the sound image is localized is determined by the sound image localization information stored in the sound image localization processing unit 570. The sound image localization processing unit 570 stores in advance a sufficient number (six in this embodiment) of sound image localization information (for example, distance and angle from a certain reference point), each of which refers to the sound image localization information. A sound image localization number is assigned. Assume that the sound image is localized at the positions shown in FIG. In this case, the sound image localization processing unit 570 has six pieces of sound image localization information including combinations of distances and angles from the reference point, and sound image localization numbers 1 to 6 are assigned to each. That is, when the sound data and the sound image localization number are given, the sound image localization processing unit 570, based on the sound image localization information corresponding to the sound image localization number, the phase of the sound output from each of the speakers 30 shown in FIG. The sound image is localized by changing the intensity.

［３−２音声会議システムの動作］
以下、図６および図７を参照して本実施形態における音声会議システムの動作について説明する。会議を開始するにあたり、出席者のうち１人は、メインユニット５００の操作盤にある「出席者登録」のボタンを押す。この操作によりメインユニット５００の動作モードは、出席者登録モードになる。本実施形態においては、各出席者が自分の名前をマイク１０に向かって名乗り、メインユニット５００はその音声に対して音声認識処理を行い出席者の名前を得ることにより話者識別を行う。詳細には以下の通りである。 [3-2 Operation of the audio conference system]
The operation of the audio conference system according to this embodiment will be described below with reference to FIGS. In starting the conference, one of the attendees presses the “Register Attendee” button on the operation panel of the main unit 500. By this operation, the operation mode of the main unit 500 becomes the attendee registration mode. In the present embodiment, each attendee gives his / her name to the microphone 10, and the main unit 500 performs speech recognition processing on the speech and obtains the attendee's name to perform speaker identification. Details are as follows.

社員Ｃ１が、マイク１０に向かって「鈴木一朗」と自らの名前を名乗ると、その音声はマイク１０を介して音声信号としてメインユニット５００に送信される。メインユニット５００は音声信号を受信すると、受信した音声信号を音声認識部５５０に送信する。音声認識部５５０は音声信号に対して音声認識処理を行い、「スズキイチロウ」という名前のテキストデータを抽出する。メインユニット５００は、抽出した名前のテキストデータおよびＵＲＩ要求を通信網３００を介して情報サーバ３１０に送信する。
情報サーバ３１０は、名前のテキストデータおよびＵＲＩ要求を受信すると、自身の保有する名簿データベース（図５）から、名前のテキストデータをキーとして名簿データベース内を検索する。情報サーバ３１０は、「スズキイチロウ」というテキストデータに対応する社員のデータを検出すると、その社員に関するデータが保存されている場所を示すＵＲＩを名簿データベースから抽出し、メインユニット５００に返信する。なお、同姓同名がいる場合には、名前を名乗る際に「Ｃ部鈴木一朗」など氏名以外の情報を付加して、氏名および所属部署でデータベース内を検索するようにしてもよい。あるいは、情報サーバ３１０は同姓同名の社員が存在した場合は、同姓同名の社員の氏名および社員番号のリストをメインユニット２００に送信し、社員Ｃ１がメインユニット２００のディスプレイに表示されたそのリストから自分のデータを選択するようにしてもよい。
メインユニット５００は、ＵＲＩを受信すると、受信したＵＲＩと、社員Ｃ１のマイク１０のＩＤ番号を対応付けて出席者テーブルＴＢＬ１として記憶する。 When the employee C1 gives his name “Ichiro Suzuki” to the microphone 10, the sound is transmitted to the main unit 500 through the microphone 10 as a sound signal. When the main unit 500 receives the audio signal, the main unit 500 transmits the received audio signal to the audio recognition unit 550. The voice recognition unit 550 performs voice recognition processing on the voice signal, and extracts text data named “SUZUKI Ichiro”. The main unit 500 transmits the extracted text data and URI request to the information server 310 via the communication network 300.
Upon receiving the name text data and the URI request, the information server 310 searches the name list database from the name list database (FIG. 5) held by the information server using the name text data as a key. When the information server 310 detects employee data corresponding to the text data “Suzuki Ichiro”, the information server 310 extracts a URI indicating the location where the data related to the employee is stored from the name list database, and returns the URI to the main unit 500. If the same name is used, information other than the name such as “C Section Ichiro Ichiro” may be added when the name is given, and the database may be searched by name and department. Alternatively, if there is an employee with the same name and the same name, the information server 310 transmits a list of the names and employee numbers of the employees with the same name to the main unit 200, and the employee C1 is displayed from the list displayed on the display of the main unit 200. You may be allowed to select your own data.
When the main unit 500 receives the URI, the main unit 500 stores the received URI and the ID number of the microphone 10 of the employee C1 in association with each other as the attendee table TBL1.

以上の処理を、出席者全員について実行することで、それぞれの会議室のメインユニット５００は、会議の出席者全員について、出席者の各々が座っている席のマイク１０のＩＤ番号と、その出席者に関する情報が保存されている場所を示すＵＲＩとが関連付けられた出席者テーブルＴＢＬ１を記憶する。自らが設置されている会議室の出席者に関する出席者テーブルＴＢＬ１の作成が完了すると、両会議室のメインユニット５００は保有する出席者テーブルＴＢＬ１を通信相手に送信する。通信相手の出席者テーブルＴＢＬ１を受信したメインユニット５００は、受信した出席者テーブルＴＢＬ１に記録された各ＵＲＩにそれぞれ音像定位番号を対応付けて、通信相手テーブルＴＢＬ２として記憶する。以上で出席者登録モードは終了する。出席者登録モード終了後は、通常会議モードに移行する。通常会議モードにおいては、音声会議システムは以下のように動作する。 By executing the above processing for all attendees, the main unit 500 of each conference room, for all attendees of the conference, the ID number of the microphone 10 at the seat where each attendee sits and the attendance thereof. The attendee table TBL1 associated with the URI indicating the place where the information on the person is stored is stored. When the creation of the attendee table TBL1 relating to the attendees in the conference room in which the conference room is installed is completed, the main unit 500 in both conference rooms transmits the attendee table TBL1 held by the communication partner. The main unit 500 that has received the communication partner attendee table TBL1 associates the sound image localization number with each URI recorded in the received attendee table TBL1, and stores it as the communication partner table TBL2. This completes the attendee registration mode. After the attendee registration mode ends, the mode shifts to the normal conference mode. In the normal conference mode, the audio conference system operates as follows.

社員Ｃ１が発言すると、その発言はマイク１０を介して音声信号としてメインユニット５００に送信される。受信された音声信号は、メインユニット５００内のＣＯＤＥＣ２２０において音声データに変換され、変換された音声データにその音声信号を集音したマイクのＩＤ番号に対応付けられているＵＲＩ（この場合、社員Ｃ１に関する情報の保存場所を示すＵＲＩ）がヘッダとして付加される。ＵＲＩが付加された音声データは、通信網３００を介して通信先のメインユニット５００に送信される。ＵＲＩが付加された音声データを受信したメインユニット５００は、自身のメモリ内に記憶された通信相手テーブルＴＢＬ２にそのＵＲＩが登録されているか検索し、そのＵＲＩと対応付けられて記憶されている音像定位情報を抽出する。 When the employee C1 speaks, the comment is transmitted to the main unit 500 through the microphone 10 as an audio signal. The received audio signal is converted into audio data by the CODEC 220 in the main unit 500, and the URI (in this case, employee C1) associated with the ID number of the microphone that collected the audio signal in the converted audio data. (URI indicating the storage location of the information) is added as a header. The audio data to which the URI is added is transmitted to the communication destination main unit 500 via the communication network 300. The main unit 500 that has received the audio data to which the URI is added searches the communication partner table TBL2 stored in its own memory to see if the URI is registered, and the sound image stored in association with the URI. Extract localization information.

メインユニット５００は、音像定位情報を音声データのヘッダに付加して、その音声データをメインユニット５００内の音像定位処理部５７０に送信する。音像定位処理部５７０は、ヘッダに付加された音像定位情報に基づいてその音声の音像定位を決定し、その音像定位で音声が再生されるように音声信号を補正する。補正された音声信号は、スピーカ３０から音声として再生される。本実施形態によれば、複数のスピーカ３０から話者毎に異なった音像定位で音声が再生されるため、会議の出席者は話者を容易に識別することができる。
なお、本実施形態においては、複数のスピーカ３０から話者毎に異なる音像定位で音声を再生したが、スピーカ３０を会議の出席者の人数分用意し、各話者に１台のスピーカを割り当て、各話者の音声は話者毎に特定されたスピーカから再生される構成としてもよい。また、メインユニット２００にディスプレイを設け、発言している話者の社員番号あるいは氏名をディスプレイに表示するようにしてもよい。 The main unit 500 adds the sound image localization information to the header of the audio data, and transmits the audio data to the sound image localization processing unit 570 in the main unit 500. The sound image localization processing unit 570 determines the sound image localization of the sound based on the sound image localization information added to the header, and corrects the audio signal so that the sound is reproduced with the sound image localization. The corrected audio signal is reproduced as audio from the speaker 30. According to the present embodiment, since the sound is reproduced from the plurality of speakers 30 with different sound image localization for each speaker, the attendee of the conference can easily identify the speaker.
In the present embodiment, sound is reproduced from a plurality of speakers 30 with different sound image localization for each speaker. However, speakers 30 are prepared for the number of participants in the conference, and one speaker is assigned to each speaker. The voice of each speaker may be reproduced from a speaker specified for each speaker. Further, a display may be provided on the main unit 200 so that the employee number or name of the speaker who is speaking is displayed on the display.

［４第４実施形態］
続いて、本発明の第４実施形態について説明する。図８は、本発明の第４実施形態に係る音声会議システムの構成を示す図である。本実施形態に係る音声会議システムが前述の第１〜第３実施形態と異なる点は、会議室におけるマイク１０の本数が会議室の定員よりも少ない（以下、「オフマイク状態」という）点である。本実施形態において、マイク１０は無指向性マイクであり、複数の話者の発言は、混合された音声として複数のマイク１０から集音される。マイク１０から集音された音声は、まず音源ごとに分離され、続いて分離された音声に対して話者認識処理が行われる。なお、以下の説明において第１〜第３実施形態と同一の構成要素には同一の参照番号を付している。 [4 Fourth Embodiment]
Subsequently, a fourth embodiment of the present invention will be described. FIG. 8 is a diagram showing a configuration of an audio conference system according to the fourth embodiment of the present invention. The voice conference system according to this embodiment is different from the first to third embodiments described above in that the number of microphones 10 in the conference room is smaller than the number of conference rooms (hereinafter referred to as “off-microphone state”). . In the present embodiment, the microphone 10 is an omnidirectional microphone, and the speech of a plurality of speakers is collected from the plurality of microphones 10 as mixed sound. The sound collected from the microphone 10 is first separated for each sound source, and then speaker recognition processing is performed on the separated sound. In the following description, the same components as those in the first to third embodiments are denoted by the same reference numerals.

［４−１音声会議システムの構成］
図８に示されるように、本実施形態における音声会議システム７００は、通信網３００、メインユニット８００、情報サーバ３１０、マイク１０、スピーカ３０から構成される。なお、本実施形態において、情報サーバ３１０に記憶される名簿データベースには、各社員が自分の氏名を名乗った音声を示す音声信号から抽出された特徴量があらかじめ記録されている。 [4-1 Configuration of the audio conference system]
As shown in FIG. 8, the audio conference system 700 in this embodiment includes a communication network 300, a main unit 800, an information server 310, a microphone 10, and a speaker 30. In the present embodiment, the name database extracted from the voice signal indicating the voice of each employee with his / her name is recorded in the name list database stored in the information server 310 in advance.

図９は、メインユニット８００の構成を示すブロック図である。音像定位測定部９１０は、図８に示される３本のマイク１０から入力された音声の音源の音像定位を測定する機能を有する。メインユニット８００は、音像定位測定部９１０の測定した音源の音像定位情報を記憶する。話者認識部９５０は、後述する音源分離部９００により分離された話者毎の音声データに対し、話者を認識する処理を行い、その話者を識別する情報をその音声データに付加する機能を有する。音源分離部９００は、複数のマイク１０を介して入力された音声信号に対して、音源の音像定位情報に基づいて音源分離処理を行い、話者毎の音声に分離する機能を有する。具体的には、音源分離部９００は、図８に示される３本のマイク１０から入力される音声信号の位相をそれぞれ変化させる３つの遅延器からなる同期加算部を会議室の定員分、すなわち６つ有している。各同期加算部には、音源が１つずつ割り当てられる。各同期加算部は、対応する音源の音像定位情報（角度）から推定される位相差を補正して、その音源からの音声信号を同相化して加算する機能を有する。これにより特定の音源からの音声信号は同相化されるが、それ以外の音源からの信号は同相化されないため、特定の音源からの音声信号のみが強調され、音源を分離することができる。なお、音源分離の方法はこの方法に限られず、独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ：ＩＣＡ）に基づくブラインド音源分離（ＢｌｉｎｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎ：ＢＳＳ）の手法、あるいは音声の調波構造を仮定した分離手法等、他の技術を用いてもよい。 FIG. 9 is a block diagram showing a configuration of the main unit 800. The sound image localization measuring unit 910 has a function of measuring the sound image localization of the sound source of the sound input from the three microphones 10 shown in FIG. The main unit 800 stores sound image localization information of the sound source measured by the sound image localization measuring unit 910. The speaker recognizing unit 950 performs processing for recognizing the speaker for the audio data for each speaker separated by the sound source separation unit 900 described later, and adds information for identifying the speaker to the audio data. Have The sound source separation unit 900 has a function of performing sound source separation processing on sound signals input via the plurality of microphones 10 based on sound image localization information of the sound source, and separating the sound signals for each speaker. Specifically, the sound source separation unit 900 includes a synchronous addition unit composed of three delay units that change the phases of the audio signals input from the three microphones 10 shown in FIG. It has six. One sound source is assigned to each synchronous adder. Each synchronous adder has a function of correcting the phase difference estimated from the sound image localization information (angle) of the corresponding sound source, and adding the sound signals from the sound source in phase. As a result, the sound signal from the specific sound source is in phase, but the signals from the other sound sources are not in phase, so only the sound signal from the specific sound source is emphasized, and the sound source can be separated. Note that the sound source separation method is not limited to this method, and a blind sound source separation (BSS) method based on independent component analysis (ICA), a separation method assuming a harmonic structure of speech, or the like. Other techniques may be used.

［４−２音声会議システムの動作］
以下、図８および図９を参照して本実施形態における音声会議システムの動作について説明する。まず、出席者登録モードで出席者の登録が行われる。出席者は順番に自分の氏名を名乗る。出席者の発した音声は複数のマイク１０を介して音声信号としてメインユニット８００に送信される。音像定位測定部９１０は、複数のマイク１０から出力された複数の音声信号の位相差から、マイク１０に対する音源（話者）の角度を算出する。音源分離部９００の各同期加算部には、あらかじめＩＤ番号が割り当てられている。 [4-2 Operation of the audio conference system]
Hereinafter, the operation of the audio conference system according to the present embodiment will be described with reference to FIGS. 8 and 9. First, attendee registration is performed in the attendee registration mode. Attendees will give their names in turn. The voice uttered by the attendee is transmitted to the main unit 800 as a voice signal through the plurality of microphones 10. The sound image localization measuring unit 910 calculates the angle of the sound source (speaker) with respect to the microphone 10 from the phase differences of the plurality of audio signals output from the plurality of microphones 10. An ID number is assigned in advance to each synchronous adder of the sound source separator 900.

話者認識部９５０は、音声データに対して例えばスペクトルやホルマントを解析して特徴量を抽出するといった処理により声の特徴を抽出する。続いて話者認識部９５０は、通信網３００を介して情報サーバ３１０に対して、話者識別要求および抽出した特徴量を送信する。情報サーバ３１０は、話者識別要求を受信すると、受信した特徴量に基づいて、自身が保有する名簿データベースを検索し、その特徴量から話者が社員Ｃ１であると識別する。話者を識別したら、情報サーバ３１０は、社員Ｃ１の社員番号あるいは社員Ｃ１に関する情報が保存されている場所を示すＵＲＩ等、話者を識別するための情報をメインユニット８００に送信する。話者を識別する情報を受信したメインユニット８００は、受信したＵＲＩと、先ほど算出した話者の音像定位情報（角度）と、空いている同期加算部のＩＤ番号とを対応付けて出席者テーブルＴＢＬ１として記憶する。各同期加算部は、自身のＩＤ番号と対応付けられている音像定位情報に基づき同期加算処理を行う。両会議室のメインユニット８００が出席者テーブルＴＢＬ１を通信相手のメインユニット８００に送信し、通信相手テーブルＴＢＬ２を作成する動作は第１〜第３実施形態のいずれかと同様である。以上で出席者登録モードは終了する。 The speaker recognizing unit 950 extracts voice features by processing such as analyzing the spectrum and formants of the voice data and extracting feature amounts. Subsequently, the speaker recognition unit 950 transmits the speaker identification request and the extracted feature amount to the information server 310 via the communication network 300. When the information server 310 receives the speaker identification request, the information server 310 searches the name list database held by the information server 310 based on the received feature quantity, and identifies the speaker as the employee C1 from the feature quantity. When the speaker is identified, the information server 310 transmits information for identifying the speaker to the main unit 800, such as the employee number of the employee C1 or a URI indicating the location where the information related to the employee C1 is stored. The main unit 800 that has received the information for identifying the speaker associates the received URI, the sound image localization information (angle) of the speaker previously calculated, and the ID number of the vacant synchronization adding unit with each other in association with the attendee table. Store as TBL1. Each synchronous adder performs a synchronous addition process based on sound image localization information associated with its own ID number. The operation in which the main units 800 in both conference rooms transmit the attendee table TBL1 to the communication partner main unit 800 to create the communication partner table TBL2 is the same as in any of the first to third embodiments. This completes the attendee registration mode.

会議モードにおいては、出席者（例えば社員Ｃ１）が発言をすると、社員Ｃ１の音声は複数のマイク１０を介して複数の音声信号としてメインユニット８００に送信される。メインユニット８００は受信した各音声信号をメインユニット８００内の音源分離部９００に送信する。音源分離部９００は、前述のように、話者毎に音声データを抽出することが可能である。音源分離部９００は、抽出した音声データに対して、その音声データが出力された同期加算部のＩＤ番号と対応付けられているＵＲＩを出席者テーブルＴＢＬ１から抽出する。抽出されたＵＲＩは、ヘッダとして音声データに付加される。以上のようにして話者を識別する情報を付加された音声データは、通信網３００を介して通信相手のメインユニット８００に送信される。以下、通信先の会議室での再生処理および会議後の音声データの再利用は、前述の第１〜第３実施形態の再生処理、音声データの再利用のいずれかと同様である。 In the conference mode, when an attendee (for example, employee C1) speaks, the voice of the employee C1 is transmitted to the main unit 800 as a plurality of audio signals via the plurality of microphones 10. The main unit 800 transmits each received audio signal to the sound source separation unit 900 in the main unit 800. As described above, the sound source separation unit 900 can extract voice data for each speaker. The sound source separation unit 900 extracts, from the attendee table TBL1, the URI associated with the ID number of the synchronous addition unit to which the audio data is output from the extracted audio data. The extracted URI is added to the audio data as a header. The voice data to which the information for identifying the speaker is added as described above is transmitted to the communication partner main unit 800 via the communication network 300. Hereinafter, the reproduction processing in the communication destination conference room and the reuse of the audio data after the conference are the same as the reproduction processing and the audio data reuse in the first to third embodiments described above.

［５変形例］
上述した実施形態は本発明の実施形態の例示であり、上記実施形態に対しては、本発明の主旨から逸脱しない範囲で様々な変形を加えることができる。 [5 Modification]
The above-described embodiment is an exemplification of the embodiment of the present invention, and various modifications can be made to the above-described embodiment without departing from the gist of the present invention.

［５−１第１変形例］
前述のように、本発明は、概ね次の３つの要素から構成される。
（１）識別手段
（２）話者の識別情報の音声データへの付加手段
（３）特徴付加手段
（１）〜（３）の要素の組み合わせとして、第１〜第４実施形態で具体的な態様および変形例を例示したが、組み合わせは各実施形態に記載された組み合わせに限られず、変形例を含めて任意の組み合わせが可能である。 [5-1 First Modification]
As described above, the present invention is generally composed of the following three elements.
(1) Identification means (2) Means for adding speaker identification information to voice data (3) Feature addition means Specific combinations of the elements (1) to (3) in the first to fourth embodiments Although an aspect and the modification were illustrated, the combination is not limited to the combination described in each embodiment, Arbitrary combinations including a modification are possible.

［５−２第２変形例］
前述の各実施形態においては、２地点間の音声会議について説明したが、３地点間以上の多地点間で音声会議を行う構成にしてもよい。この場合、メインユニットは、出席者登録モード時には、複数の通信相手に出席者テーブルＴＢＬ１を送信し、複数の通信相手の通信相手テーブルＴＢＬ２を記憶することになる。 [5-2 Second Modification]
In each of the above-described embodiments, the audio conference between two points has been described. However, a configuration in which an audio conference is performed between three or more points may be employed. In this case, in the attendee registration mode, the main unit transmits the attendee table TBL1 to a plurality of communication partners and stores the communication partner table TBL2 of the plurality of communication partners.

［５−３第３変形例］
第４実施形態においては、オフマイク状態において、音源分離を行った上で話者認識を行ったが、話者認識を行う際にＲＦＩＤタグを補助的に用いて、話者認識時に検索するデータベースの絞込みを行う構成としてもよい。この場合、音声会議システムの動作は次の通りである。 [5-3 Third Modification]
In the fourth embodiment, speaker recognition is performed after sound source separation in the off-mic state, but an RFID tag is used supplementarily when performing speaker recognition, and a database that is searched for speaker recognition is used. It is good also as a structure which narrows down. In this case, the operation of the audio conference system is as follows.

マイク１０には、ＲＦＩＤタグ読取装置２０が取り付けられている。会議の出席者が、メインユニット８００の操作盤上の「出席者登録」のボタンを押すと、メインユニット８００は、ＲＦＩＤタグ読取装置２０に対して、会議の出席者の社員番号を読み取るように指令する。ＲＦＩＤタグはある程度離れた距離でも読み取ることができるので、会議室Ａのマイク１０に取り付けられたＲＦＩＤタグ読取装置２０は、会議室Ａに在席している社員（社員Ｃ１、社員Ｃ２、社員Ｃ３、社員Ｃ４、社員Ｃ５）の社員証に取り付けられたＲＦＩＤから、社員番号のデータを読み出す。読み出された社員番号のデータは、メインユニット８００に送信される。メインユニット８００は、音声特徴量要求を社員番号のデータと共に情報サーバ３１０に送信する。音声特徴量要求を受信した情報サーバ３１０は、受信した社員番号をキーとして自身の保有する話者情報データベースを検索し、社員Ｃ１、社員Ｃ２、社員Ｃ３、社員Ｃ４、社員Ｃ５の音声特徴量を抽出する。情報サーバ３１０は、抽出した音声特徴量をメインユニット８００に送信する。メインユニット８００は、音声特徴量を受信すると、音声認識テーブルとして、各出席者の社員番号と音声特徴量を対応付けて記憶する。以上で会議の準備が完了する。 An RFID tag reader 20 is attached to the microphone 10. When the attendee of the conference presses the “register attendee” button on the operation panel of the main unit 800, the main unit 800 reads the employee number of the attendee of the conference from the RFID tag reader 20. Command. Since the RFID tag can be read at a certain distance, the RFID tag reader 20 attached to the microphone 10 in the conference room A is used by employees (employee C1, employee C2, employee C3) present in the conference room A. The employee number data is read out from the RFID attached to the employee ID of employee C4 and employee C5). The read employee number data is transmitted to the main unit 800. The main unit 800 transmits the voice feature amount request together with the employee number data to the information server 310. The information server 310 that has received the voice feature value request searches its own speaker information database using the received employee number as a key, and obtains the voice feature values of the employee C1, employee C2, employee C3, employee C4, and employee C5. Extract. The information server 310 transmits the extracted audio feature quantity to the main unit 800. When the main unit 800 receives the voice feature amount, the main unit 800 associates and stores the employee number and the voice feature amount of each attendee as a voice recognition table. The meeting is now ready.

会議中、音声データを受信した話者認識部９５０は、第４実施形態で説明した通り音声データに対して例えばスペクトルやホルマントを解析して特徴量を抽出するといった処理により声の特徴を抽出する。続いて話者認識部９５０は、抽出した特徴量を検索キーとして、音声認識テーブル内を検索し、話者の社員番号を抽出する。抽出した社員番号を音声データに付加する処理以降は、第４実施形態と同一である。 During the conference, the speaker recognition unit 950 that has received the voice data extracts voice features by processing such as analyzing the spectrum and formants and extracting feature amounts from the voice data as described in the fourth embodiment. . Subsequently, the speaker recognition unit 950 searches the speech recognition table using the extracted feature amount as a search key, and extracts the employee number of the speaker. The processing subsequent to adding the extracted employee number to the voice data is the same as that in the fourth embodiment.

［５−４第４変形例］
第４実施形態においては、出席者登録モードにおいて各出席者の音声認識を行い、その結果得た音像定位情報と各出席者のＵＲＩとを対応付けて出席者テーブルＴＢＬ１に記録した。しかし、出席者登録モードにおいては各出席者と音像定位情報を対応付けず、会議モードにおいて、発言毎に話者認識処理を行い話者の識別情報（ＵＲＩあるいは社員番号）を音声データに付加する構成としてもよい。 [5-4 Fourth Modification]
In the fourth embodiment, voice recognition of each attendee is performed in the attendee registration mode, and the sound image localization information obtained as a result and the URI of each attendee are associated and recorded in the attendee table TBL1. However, in the attendee registration mode, each attendee is not associated with sound image localization information, and in the conference mode, speaker recognition processing is performed for each utterance, and speaker identification information (URI or employee number) is added to the voice data. It is good also as a structure.

本発明の第１実施形態における音声会議システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio conference system in 1st Embodiment of this invention. 同実施形態におけるメインユニットの構成を示すブロック図である。It is a block diagram which shows the structure of the main unit in the embodiment. 同実施形態における音声会議システムの構成を示す模式図である。It is a schematic diagram which shows the structure of the audio conference system in the embodiment. 本発明の第２実施形態における音声会議システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio conference system in 2nd Embodiment of this invention. 同実施形態における情報サーバに記録される名簿データベースを例示する図である。It is a figure which illustrates the name list database recorded on the information server in the embodiment. 本発明の第３実施形態における音声会議システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio conference system in 3rd Embodiment of this invention. 同実施形態におけるメインユニットの構成を示すブロック図である。It is a block diagram which shows the structure of the main unit in the embodiment. 本発明の第４実施形態における音声会議システムの構成を示す図である。It is a figure which shows the structure of the audio conference system in 4th Embodiment of this invention. 同実施形態におけるメインユニットの構成を示すブロック図である。It is a block diagram which shows the structure of the main unit in the embodiment. 第３実施形態における音像定位位置を例示する図である。It is a figure which illustrates the sound image localization position in 3rd Embodiment.

Explanation of symbols

１０…マイク、２０…ＲＦＩＤタグ読取装置、３０…スピーカ、１００…音声会議システム、２００…メインユニット、２１０…ＣＰＵ、２２０…ＣＯＤＥＣ、２３０…音声信号受信部、２４０…通信Ｉ／Ｆ、２５０…メモリ、２６０…音声信号出力部、２７０…イコライザ、３００…通信網、３１０…情報サーバ、３２０…録音サーバ、４００…端末、５００…メインユニット、５５０…音声認識部、５７０…音像定位処理部、７００…音声会議システム、８００…メインユニット、９００…音源分離部、９１０…音像定位測定部、９５０…話者認識部 DESCRIPTION OF SYMBOLS 10 ... Microphone, 20 ... RFID tag reader, 30 ... Speaker, 100 ... Voice conference system, 200 ... Main unit, 210 ... CPU, 220 ... CODEC, 230 ... Voice signal receiving part, 240 ... Communication I / F, 250 ... Memory, 260 ... Audio signal output unit, 270 ... Equalizer, 300 ... Communication network, 310 ... Information server, 320 ... Recording server, 400 ... Terminal, 500 ... Main unit, 550 ... Voice recognition unit, 570 ... Sound image localization processing unit, 700 ... Voice conference system, 800 ... Main unit, 900 ... Sound source separation unit, 910 ... Sound image localization measurement unit, 950 ... Speaker recognition unit

Claims

A microphone that collects the voice of the speaker and outputs voice information indicating the voice of the speaker;
An identification unit that is provided in the vicinity of the microphone and outputs identification information based on the owner information read from a recording medium in which owner information for identifying the speaker is recorded;
Identification information adding means for adding the identification information to the voice information;
A voice conference terminal device comprising: transmitting means for transmitting the voice information to which the identification information is added.

A microphone that collects the voice of the speaker and outputs voice information indicating the voice of the speaker;
Voice recognition means for recognizing the voice of the speaker;
Identification means for determining identification information based on a recognition result of the voice recognition means;
Identification information adding means for adding the identification information to the voice information;
A voice conference terminal device comprising: transmitting means for transmitting the voice information to which the identification information is added.

Receiving means for receiving voice information to which identification information for identifying a speaker is added;
A feature adding means for adding a feature to the voice indicated by the voice information based on the identification information;
A voice conference terminal device comprising: voice output means for outputting the voice with the feature added by the feature adding means.

Voice information output processing for collecting the voice of the speaker and outputting voice information indicating the voice of the speaker;
An identification process that is provided in the vicinity of the microphone and outputs identification information based on the owner information read from the recording medium in which the owner information for identifying the speaker is recorded;
An identification information adding process for adding the identification information to the voice information;
A program for causing a computer to execute transmission processing for transmitting the audio information to which the identification information is added.

Voice output processing for collecting the voice of the speaker and outputting voice information indicating the voice of the speaker;
A voice recognition process for recognizing the voice of the speaker;
An identification process for determining identification information based on a recognition result of the voice recognition process;
An identification information adding process for adding the identification information to the voice information;
A program for causing a computer to execute transmission processing for transmitting the audio information to which the identification information is added.

A receiving process for receiving voice information to which identification information for identifying a speaker is added;
A feature addition process for adding a feature to the voice indicated by the voice information based on the identification information;
A program for causing a computer to execute a sound output process for outputting a sound with a feature added by the feature addition process.