WO2002025908A2

WO2002025908A2 - Packet-based conferencing

Info

Publication number: WO2002025908A2
Application number: PCT/CA2001/001298
Authority: WO
Inventors: Philip K. Edholm; Frederic F. Simard; Nina K. Burns
Original assignee: Nortel Networks Limited
Priority date: 2000-09-18
Filing date: 2001-09-13
Publication date: 2002-03-28
Also published as: AU2001293542A1; WO2002025908A3; CA2422448A1; EP1323286A2

Abstract

Numerous packet-based terminals coupled within a packet-based network can establish a voice conference without the use of a conference bridge if the packet-based terminals can support specific operations. These specific operations include receiving voice data packets from each of the other packet-based terminals within the voice conference, determining a set of talkers within the voice conference and processing the received media data packets appropriately for the selected set of talkers so as to output uncompressed voice signals corresponding to the talkers to a speaker coupled to the packet-based terminal. the removal of the conference bridge can allow the packet-based apparatus to become independent from the packet-based network administration. Further, the removal of the conference bridge allows a reduction in transcoding and hence, allows a better quality signal to be received at the individual apparatus.

Description

APPARATUS AND METHOD FOR PACKET-BASED MEDIA COMMUNICATIONS FIELD OF THE INVENTION

This invention relates generally to packet-based media communications and more specifically to media conferencing within a packet-based communication network. BACKGROUND OF THE INVENTION

Prior to the use of packet-based voice communications, telephone conferences were a service option available within standard non-packet-based telephone networks such as Pulse Code Modulation (PCM) telephone networks. As depicted in FIGURE 1, a standard telephone switch 20 is coupled to a plurality of telephone terminals 22 to be included within a conference session as well as a central conference bridge 24. It is noted that these telephone terminals 22 are coupled to the telephone switch 20 via numerous other telephone switches (not shown) . The telephone switch 20 forwards any voice communications received from the terminals 22 to the central conference bridge 24, which then utilizes a standard algorithm to control the conference session.

One such algorithm used to control a conference session, referred to as a "party line" approach, comprises the steps of mixing the voice communications received from each telephone terminal 22 within the conference session and further distributing the result to each of the telephone terminals 22 for broadcasting. A problem with this algorithm is -the amount of noise that is combined during the mixing step, this noise comprising a background noise source corresponding to each of the telephone terminals 22 within the conference session. An improved algorithm for controlling a conference session is disclosed within U.S. patent application 08/987216 entitled "Method of Providing Conferencing in Telephony" by Dal Farra et al, filed on December 9, 1997, assigned to the assignee of the present invention, and herein incorporated by reference. This algorithm comprises the steps of selecting primary and secondary talkers, mixing the voice communications from these two talkers and forwarding the result of the mixing to all the participants within the conference session except for the primary and secondary talkers. The primary and secondary talkers receive the voice communications corresponding to the secondary and primary talkers respectively. The selection and mixing of only two talkers at any one time can reduce the background noise level* within the conference session when compared to the "party line" approach described above.

In a standard PCM telephone network as is depicted_< in FIGURE 1, all of the voice communications are in PCM format when being received at the central conference bridge 24 and when being sent to the individual telephone terminals 22. Hence, in this situation, the mixing of the voice communications corresponding to the primary and secondary talkers is relatively simple with no conversions of format required. Currently, packet-based voice communications are being utilized more frequently as Voice-over-Internet Protocol (VoIP) becomes increasingly popular. In these standard VoIP communications, voice data in PCM form is being encapsulated with a header and footer to form voice data packets; the header in these packets has, among other things, a Real Time Protocol (RTP) header that contains a time stamp corresponding to when the packet was generated. One area that requires considerable improvement is the use of packet- based voice communications to perform telephone conferencing capabilities . As depicted within FIGURE 2, a plurality of packet- based voice communication terminals, VoIP handsets 26 in this case, are coupled to a packet-based network, an IP network 28 in this case. Currently, in order for the users of these VoIP terminals 26 to communicate within a voice conference, a packet-based voice communication central bridge, in this case a VoIP central conference bridge 30, must be coupled to the IP network 28. This VoIP central conference bridge 30 has a number of problems. These problems include the latency inherently created within the conference bridge 30, the considerable amount of signal processing power required, the cost of the conference bridge, the provisioning of the conference bridges within a network and the maintenance and management of the conference bridges that are required. It should be noted that the high signalling power required is partially due to the conference bridge 30 having to compensate for a variety of problems that typically exist within current IP networks. These problems include possible variable delays, out-of-sequence packets, lost packets, and/or unbounded latency. FIGURE 3A is a logical block diagram of a well- known VoIP central conference bridge design while FIGURE 3B is a logical block diagram of a well-known VoIP terminal design. In the design of FIGURE 3A, the conference bridge 30 comprises an inputting block 32, a talker selection and mixing block 34, and an outputting block 36. Typically all three of these blocks are implemented in software. The inputting block 32 comprises, for each participant within the voice conference, a protocol stack (P.S.) 38 coupled in series with a jitter buffer (J.B.) 40* and a decompression block (DECOMP.) 42, each of the decompression blocks 42 further being coupled to the talker selection and mixing block 34. The protocol stacks 38 in this design perform numerous functions including receiving packets comprising compressed voice signals, hereinafter referred to as voice data packets; stripping off the packet overhead required for transmitting the voice data packet through the IP network 28; and outputting the compressed voice signals contained within the packets to the respective jitter buffer 40. The jitter buffers 40 receive these compressed voice signals, ensure that the compressed voice signals are within the proper sequence, (i.e. time ordering signals) , buffer the compressed voice signals to ensure smooth playback, and ideally implement packet loss concealment. The output of each of the jitter buffers 40 is a series of compressed voice signals within the proper order that are then fed into the respective decompression block 42. The decompression blocks 42 receive these compressed voice signals, convert them into standard PCM format and output the resulting voice signals (that are in Pulse Code Modulation) to the talker selection and mixing block 34. The talker selection and mixing block 34 preferably performs almost identical functionality to the central conference bridge 24 within FIGURE 1. The key to the design of a VoIP central conference bridge 30 as depicted in FIGURE 3A is the inputting block 32 transforming the packet-based voice communications into PCM voice communications so the well-known conferencing algorithms can be utilized within the block 34. As described previously, in one conferencing algorithm, primary and secondary talkers are selected for transmission to the participants in the conference session to reduce the background noise level from participants who are not talking and to simplify the mixing algorithm required. Hence, the resulting output from the talker selection and mixing block 34 is a voice communication consisting of a mix between the voice communications received from a primary talker and a secondary talker, the primary and secondary talkers being determined within the block 34. Further outputs from the talker selection and mixing block 34 include the unmixed voice communications of the primary and secondary talkers that are to be forwarded, as described previously, to the secondary and primary talkers respectively. The outputting block 36 comprises three compression blocks 44 and a plurality of transmitters 46. The compression blocks 44 receive respective ones of the three outputs from the talker selection and mixing block 34, , compress the received voice signals, and independently output the results to the appropriate transmitters 46. In this case, the mixed voice signals, after being compressed, are forwarded to all the transmitters 46 with the exception of the transmitters directed to the primary and secondary talkers. The transmitters directed to the primary and secondary talkers receive the appropriate unmixed voice signals. Each of the transmitters 46, after receiving a compressed voice signal, subsequently performs a protocol stack operation on the compressed voice signal, encapsulates the compressed voice signal within the packet-based format required for transmission on the IP network 28 and transmits a voice data packet comprising the compressed voice signal to the appropriate VoIP terminal 26 within the conference session.

The well-known terminals 26, as depicted in FIGURE 3B, each comprise a protocol stack 47 coupled in series with a jitter buffer 48 and a decompression block 49, these blocks typically being implemented in software. Voice data packets sent from the central conference bridge 30 are received at the protocol stack 47 which subsequently removes the packet overhead from the received voice data packets, leaving only the compressed voice signal sent from the packet-based central conference bridge 30. The jitter buffer 48 next performs numerous functions similar to those performed by the jitter buffers 40 including ensuring that the compressed voice signals are within the proper sequence, buffering the compressed voice signals to ensure smooth playback, and ideally implementing packet loss concealment. Subsequently, the decompression block 49 receives the compressed voice signals, decompresses them into PCM format, and forwards the voice signals to the speaker within the particular terminal 26 for broadcasting the voice signals audibly.

One key problem with the setup depicted within FIGURES 3A and 3B is the degradation of the voice signals as the voice signals are converted from PCM format to compressed format and vice versa, these conversions together being referred to generally as transcoding. A further problem results from the considerable latency that the processing within the VoIP central conference bridge 30 and the processing within the individual terminals 26 create. The combined latency of this processing can result in a significant delay between when the talker (s) speaks and when the other participants in the conference session hear the speech. This delay can be noticeable to the participants if it is beyond the perceived real-time limits of human hearing. This could result in participants talking while not realizing that another participant is speaking. Yet another key problem with the design depicted in FIGURES 3A and 3B is the considerable amount of signal processing power that is required to implement the conference bridge 30. As stated previously, each of the components shown within FIGURE 3A are normally simply software algorithms being run on DSP components (s) . This considerable amount of required signal processing power is expensive.

Hence, a new design within a packet-based voice communication network is required to implement voice conferencing functionality. In this new design, a reduction in transcoding, latency, and/or required signal processing power within the conferencing network is needed. SUMMARY OF THE INVENTION

The present invention is directed to methods and apparatus that can be utilized within a packet-based media communication system for media conferences. Packet-based apparatus are described that can be coupled within a packet- based network such that a media conference can be established without the use of a conference bridge. These packet-based apparatus can receive media data packets from each of the other packet-based apparatus within the media conference, determine a set of talkers within the media conference and process the received media data packets appropriately for the selected set of talkers so as to output media signals corresponding to the talkers. The removal of the conference bridge can allow the packet-based apparatus to become independent, from the packet-based network administration. Further, the removal of the conference bridge allows a reduction in transcoding and hence, allows a better quality signal to be received at the individual apparatus.

The present invention, according to a first broad aspect, is a packet-based apparatus including a receiver capable of being coupled to a network, an energy detection and talker selection unit and an output unit. The receiver operates to receive a media data packet from at least two sources forming a media conference, each media data packet defining a media signal. The energy detection and talker selection unit operates to process the media signals including selecting a set of the sources within the media conference as talkers. Finally, the output unit operates to output media signals that correspond to the talkers. According to a second broad aspect, the present invention is a method for outputting media signals within a media conference. In this aspect, the method initially receives a media data packet from at least two sources forming a media conference, each media data packet defining a media signal. Next, the method includes processing the received media data packets including selecting a set of the sources within the media conference as talkers. Finally, the method includes outputting media signals that correspond to the talkers. According to a third broad aspect, the present invention is a packet-based network comprising a plurality of packet-based apparatus. In this aspect, at least two of the plurality of packet-based apparatus operates to output media data packets comprising media signals. These packet-based apparatus together form a media conference. Further, within this aspect, at least one of the packet-based apparatus within the media conference operates to receive the media data packets from the packet-based apparatus within the media conference; to process the media signals corresponding to the received media data packets including selecting a set of the packet-based apparatus within the media conference as talkers; and to output media signals that correspond to the talkers.

In yet further aspects of the present invention, media data packets or media signals corresponding to the packet-based apparatus are received at the energy detection and talker selection unit and are considered as a source when selecting a set of the sources within the media conference as talkers . In some embodiments of the present invention, the packet-based apparatus according to the above described aspects is a packet-based terminal while, in other embodiments, the packet-based apparatus is a packet-based network interface arranged to be coupled, via a non-packet- based network, to a non-packet-based terminal.

Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

The preferred embodiments of the present invention are described with reference to the following figures, in which: FIGURE 1 is a simplified block diagram illustrating a well-known circuit switched network with a voice conferencing capability;

FIGURE 2 is a simplified block diagram illustrating a well-known packet-based network with a voice conferencing capability;

FIGURES 3A and 3B are logical block diagrams illustrating a well-known packet-based central conference bridge and a well-known packet-based terminal respectively implemented within the packet-based network of FIGURE 2;

FIGURE 4 is a simplified functional block diagram illustrating a packet-based terminal according to an embodiment of the present invention;

FIGURE 5 is a flow chart illustrating the operations performed by a packet receipt block and an energy detection and talker selection block implemented within the packet-based terminal of FIGURE 4;

FIGURE 6 is a flow chart illustrating the operations performed by an output generator implemented within the packet-based terminal of FIGURE 4;

FIGURE 7 is a more detailed functional block diagram of ^"the block diagram of FIGURE 4 during a sample operation;

FIGURE 8 is a detailed functional block diagram illustrating an alternative embodiment of the packet-based terminal of FIGURE 4 during a sample operation; and

FIGURE 9 is a ' simplified block diagram illustrating a well-known packet-based network coupled to a well-known PCM telephone network with a voice conferencing capability. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

One skilled in the art would understand that there are two main aspects for the operations of a telephone session. These aspects include a control plane that performs administrative functions such as access approval and buildup/tear-down of telephone sessions and/or conference sessions and a media plane which performs the signal processing required on media (voice or video) streams such as format conversions and mixing operations. As described below, the present invention is applicable to modifications within the media plane which could be implemented with a variety of different control planes while remaining within the scope of the present invention.

Embodiments of the present invention described herein below are directed to packet-based apparatus coupled within a packet-based network that enable media conferences between numerous sources of media signals. These sources of media signals can be any device in which a person can output media data for transmission to the packet-based apparatus. In some embodiments, the packet-based apparatus are packet- based terminals coupled together within a packet-based network, each of the packet-based terminals being a source for media signals for the other packet-based apparatus.

In other embodiments, one or more of the packet- based apparatus are packet-based network interfaces which couple standard non-packet-based terminals, such as PCM or analog telephone terminals, to a packet-based network, each of the non-packet-based terminals being a source for media signals for the packet-based apparatus. This situation is illustrated within FIGURE 9 in which a non-packet-based telephone network, in this case PCM telephone network 150, is coupled to a packet-based network, in this case IP network 28, via a packet-based network interface, in this case IP Gateway 152. As shown in FIGURE 9, a number of standard PCM telephone handsets 154 are coupled to the PCM telephone network 150, these PCM telephone terminals 154 possibly being considered as sources of media signals within embodiments of the present invention. Further, sources of media signals could be other devices that allow for the outputting of media data, this media data being in the form of media data packets when it is received at the packet-based apparatus described for preferred embodiments of the present invention.

In the following description, it should be understood that despite referring to the sources of media signals as packet-based terminals within the packet-based network throughout this document, such references could alternatively be directed to another form of media signal source. Further, although the packet-based apparatus described for the preferred embodiments are the packet-based terminals that also serve as the source for media signals, it should be understood that, alternatively, the packet-based apparatus could be packet-based network interfaces. Yet further, although the following description of the preferred embodiments of the present invention is specific to voice data packets that contain compressed voice signals and generally to voice conferencing, this should not limit the scope of the present invention as is described in further detail herein below.

A packet-based network, according to some embodiments of the present invention, that is capable of establishing voice conferences is now described with reference to FIGURES 4 through 8. In this design, conference sessions are initiated and maintained without the use of a central conference bridge, with each participant within the voice conference forwarding voice data packets generated at its particular packet-based terminal to all of the other participants within the voice conference. This forwarding of voice data packets from one point to multipoint could be done with a plurality of unicast transmissions or with a single multicast transmission in which each participant within the voice conference tunes in to. FIGURE 4 illustrates a simplified block diagram of a packet-based terminal according to some embodiments of the present invention. This packet-based terminal preferably replaces within FIGURE 2, the well-known packet-based terminal depicted within FIGURE 3B. There are a number of differences between the packet-based terminal depicted in FIGURE 4 and that of FIGURE 3B as will be described herein below. These differences allow for voice conferences to be established within the packet-based network 28 without the use of a packet-based central conference bridge. As depicted in FIGURE 4, the packet-based terminal comprises a packet receipt block 50, an energy detection and talker selection block 60 and an output generator 70. Although the blocks within FIGURE 4 are depicted as separate components, these blocks are meant to be logical representations of algorithms which are hereinafter referred to collectively as conference processing logic. Preferably, some or all of the conference processing logic is essentially software algorithms operating within a single control component such as a DSP. In alternative embodiments, some or all of the conference processing logic is comprised of hard logic and/or discrete components. The operations of the packet receipt block 50 and the energy detection and talker selection block 60 will be described with reference to FIGURE 5. The operation of the output generator 70 will be described with reference to FIGURE 6.

FIGURE 5 is a flow chart that depicts the steps performed by the packet receipt block 50 and the energy detection and talker selection block 60. This flow chart depicts the processing that occurs for a single voice data packet received by the packet-based terminal. It should be understood that multiple packets could proceed through this procedure at any one time which could possibly result in more than one packet being processed at the same step at the same time. Since these steps are preferably software operations, the situation in which a multiple number of packets operate at a common step within the procedure simply indicates that the software is being used by different packets in parallel.

The first step 80, as depicted in FIGURE 5, has the packet receipt block 50 receive a voice data packet from the packet-based network coupled to the packet-based terminal.

This packet may be an IP packet or a packet of another format that can be transported on the packet-based network. The packet is sent from another packet-based terminal being used within a voice conference (more generally referred to as a source for media signals) and contains a compressed voice signal that corresponds to a participant that is speaking at the particular terminal.

Next, as seen at step 81, the packet receipt block 50 removes the packet overhead from the received voice data packet. This overhead may include the actual packet header and footer utilized, as well as any other transport protocol wrapper. The removal of the packet overhead results in only the compressed voice signal within the received packet being forwarded on for further processing. It is noted though that information contained within the packet overhead, such as the source address, is still preferably used by the control plane to identify the source terminal and the voice conference that this particular voice signal corresponds. Further, it is noted that a time stamp within an RTP header of the packet header is preferably extracted and used in later processing within the media plane as described below.

The compressed voice signal is subsequently processed by the energy detection and talker selection block 60 as depicted at steps 82 through 90. Firstly within this processing, the block 60 determines if the compressed voice signal contains speech at step 82 by performing an energy detection operation. A compressed voice signal containing speech indicates that the source of the corresponding voice data packet has a speaking participant local.

This energy detection operation can be performed in a number of different manners. Some possible energy detection operations are disclosed within U.S. patent application 09/475,047 entitled "Apparatus and Method for Packet-Based Media Communications" by the inventors of the present invention, filed on December 30, 2000, assigned to the assignee of the present invention and herein incorporated by reference.

In one embodiment as disclosed within U.S. patent application 09/475,047, a Voice Activity Detection (VAD) operation is enabled at the packet-based terminal that sent the voice data packet. The VAD operation alternatively is enabled at the packet-based network interface if the source of media signals is a non-packet-based telephone terminal. In this embodiment, packets (and therefore compressed voice signals) that can contain speech can be distinguished from packets that do not by the number of bytes contained within the packet. In other words, the size of the compressed voice signal can determine whether it contains speech. For example, in the case that the G.723.1 VoIP standard is utilized, voice data packets containing voice would contain a compressed voice signal of 24 bytes while voice data packets containing essentially silence would contain a compressed voice signal of 4 bytes.

In another embodiment as disclosed within U.S. patent application 09/475,047, in which a VAD operation is not enabled at the packet-based terminal (or packet-based network interface) sending the voice data packet, the block 60 determines if there is speech within the compressed voice signal by monitoring a pitch-related sector within the corresponding voice data .packet. For example, within the G.723.1 VoIP standard, the pitch sector is an 18-bit field that contains pitch lag information for all subframes. In this particular embodiment, the block 60 uses the pitch sector to generate a pitch value for each subframe. If the pitch value is within a particular predetermined range, the corresponding compressed voice signal is said to contain speech. If not, the compressed voice signal is said to not contain speech. This predetermined range can be determined by experimentation or alternatively calculated mathematically. It is noted that many current VoIP standard codecs include pitch information as part of the transmitted packet and a similar comparison of pitch values with a predetermined range can be used with these standards. It is further noted that the energy determination operations which determine whether a particular compressed voice signal contains speech should not be limited to the above described embodiments. If the compressed voice signal at step 82 is deemed to not contain speech, the particular signal is discarded at step 83. The frequency in which signals are discarded from a signal source based upon their lack of speech affects the deselection of talkers for the voice conference as will be described herein below. If the compressed voice signal at step 82 does contain speech, the energy detection and talker selection block 60 proceeds to determine at step 84 whether the compressed voice signal is from a packet-based terminal (more generally a source of media data packets) selected to be a talker, voice signals from talkers being the only voice signals heard by the participant (s) at the particular packet- based, terminal.

The selection and de-selection of terminals as talkers is performed by a talker selection algorithm within the block 60. Although it is the terminal that is referenced as the source for the voice data packets containing speech, for simplicity herein below, the description will refer to the talker selection algorithm determining which participants are speaking rather than referring to which terminals have participants that are speaking. It should be recognized that a reference to a participant speaking indicates that the voice data packet received from the terminal corresponding to the particular participant has been deemed to contain speech. There are preferably three main situations which would result in different operations for the talker selection algorithm, these situations being no participants speaking, only one participant speaking, and two or more participants speaking at once. For the first case in which there is no participants speaking, the talker selection algorithm preferably has no terminals selected as talkers, thus removing the need for any further processing to take place. When considering the second case in which only one participant is speaking, the talker selection algorithm preferably has only one terminal selected as a talker, that terminal being the one corresponding to the speaking participant. In this situation, the single talker is hereinafter referred to as a "lone talker".

In the third case in which two or more participants at different terminals are speaking at the same time, the talker selection algorithm preferably has one terminal selected as a "primary talker" and a second terminal selected as a "secondary talker" for the voice conference. When considering this situation, the talker selection algorithm selects the primary and secondary talkers using a predetermined selection parameter. In one preferred embodiment, this selection parameter is the order in which the participants began to speak. In another embodiment, the selection parameter takes into consideration the volume level of the participants (i.e. comparing the energy levels of the talkers) . In yet another embodiment, a control mechanism is in place that automatically selects a participant to be the primary or secondary talker. This control mechanism could be utilized in cases that there is a moderator and/or a scheduled speaker for the voice conference. In even another embodiment, a control mechanism is in place that allows a user of a packet-based terminal to customize his/her personal settings in order to block out a particular participant or always select a particular participant as a talker.

The above described selection parameters are not meant to limit the scope of the present invention. In fact, the key to this portion of the preferable packet-based apparatus is the selection of talkers while the parameter used for this selection and the number of talkers selected is not directly relevant to the present invention.

Preferably, the talker selection algorithm comprises a software algorithm that is continuously operating during a voice conference with the determination of those speaking and the selection of no talkers, a lone talker, or primary and secondary talkers being dynamic during the receiving of voice data packets as will be described with reference to steps 84 through 90. As well, the talker selection algorithm preferably performs operations to deselect talkers continuously during the voice conference. These de-selection operations preferably including the steps of determining the length of time between voice data packets containing speech coming from the talker (s) and de-selecting any talker if the length of time between voice data packets containing speech exceeds a threshold level. Of course, other de-selection techniques could be utilized as the actual de-selection operation being used is not critical to the present invention. Referring back to FIGURE 5, the above described talker selection algorithm, for the case that the talker selection parameter is the order in which the participants begin to speak and a maximum of two talkers are selected at once, is implemented in steps 84 through 90. As mentioned previously at step 84, the energy detection and talker selection block 60 determines if the compressed voice signal is from a participant selected as a talker. If the compressed signal is from a talker, the talker selection algorithm determines, as depicted at step 85, if the talker is a lone talker, a primary talker, or a secondary talker. As will be described herein below, the output generator 70 processes the compressed voice signal differently depending on the "type" of talker it corresponds to.

If, at step 84, the compressed voice signal does not correspond to a talker selected by the talker selection algorithm, the talker selection algorithm proceeds to determine if there are currently two talkers selected at step 86. If there are two talkers already selected, the compressed voice signal is discarded at step 83. If there are not two talkers already selected at step 86, the talker selection algorithm determines if there is currently a lone talker selected at step 87. If there is not a lone talker already selected at step 87, the talker selection algorithm selects the participant corresponding to the particular compressed voice signal as the lone talker at step 88. If there is a lone talker currently selected at step 87, the talker selection algorithm proceeds to set the participant corresponding to the particular compressed voice signal as the secondary talker at step 89 and to set the lone talker as the primary talker at step 90. The procedure that occurs within the output generator 70 if the compressed voice signal corresponds to one of a lone talker, a primary talker, and a secondary talker will now be described with reference to FIGURE 6. Firstly, at step 94, if the compressed voice signal corresponds to the secondary talker, the output generator 70 proceeds to perform jitter buffer operations on the compressed voice signal, hereinafter referred to as a secondary voice signal, as were previously described for jitter buffers 38,47 within FIGURES 3A and 3B respectively. These jitter buffer operations preferably include ensuring that the voice signals are within the proper sequence (i.e. time ordering signals) and buffering the signals to ensure smooth playback. Next, the output generator determines whether the secondary voice signal has previously been regenerated for at step 96 by monitoring the time stamp associated with the secondary voice signal and comparing it to the time stamps associated with previously received secondary voice signals. If it is found that the voice signal was previously regenerated for, the secondary voice signal is discarded at step 98 and the conference processing logic returns to step 80. If it is found that the voice signal has not previously been regenerated for, the secondary voice signal, as depicted at step 100, is decompressed (converting it into a decompressed voice signal that is preferably a PCM signal) and preferably temporarily saved within the output generator 70 in both compressed and decompressed formats. Alternatively, the secondary voice signal is saved within only one of the compressed and decompressed formats. Saving in only the decompressed format would result in the need for a decompression operation at a subsequent step.

If it is determined that the compressed voice signal corresponds to the primary talker, the output generator 70, as shown at step 102, the output generator 70 proceeds to perform jitter buffer operations on the compressed voice signal, hereinafter referred to as a primary voice signal, in similar fashion to that described above for step 94. Subsequently, at step 104, it is determined whether there is a secondary voice signal currently saved within the output generator 70 with a corresponding time stamp.

If there is no corresponding secondary voice signal currently saved, it is determined at step 106 whether a predetermined time T has expired at step 106. This predetermined time T is a waiting period in which the output generator 70 will not utilize the primary voice signal as the procedure returns to step 104. This compensates for minor delays caused in the network by providing the voice data packets arriving from the secondary talker a limited amount of leeway after the arrival of a voice data packet corresponding to the primary talker. Preferably, if no voice data packets arrive from the secondary talker after the time T expires, the voice data packets corresponding to the primary talker are not subsequently delayed by this delay mechanism. If the predetermined time T has expired at step 106, a voice signal is generated for the secondary talker at step 108 with the use of a packet loss concealment algorithm. This generated voice signal is an approximation of what the secondary talker is saying based upon previous secondary voice data packets that were received. One such packet loss * concealment algorithm is disclosed within U.S. patent application 09/353906 entitled "Apparatus and Method of Regenerating a Lost Audio Segment" by Gunduzhan, filed on July 15, 1999, assigned to the assignee of the present invention and herein incorporated by reference.

After the generation of a secondary voice signal at step 108 or if there was a corresponding secondary voice signal currently saved at step 104, a number of operations, as depicted at step 110, are preferably performed by the output generator 70. These operations include decompressing the compressed primary voice signal (and secondary voice signal if previously not done) , hence converting it into an uncompressed voice signal that is preferably a PCM signal; mixing the primary voice signal with the secondary voice signal using a well-known mixing algorithm as is currently used for combining two uncompressed voice signals such as PCM signals, the primary and secondary voice signals being combined into a single uncompressed voice signal (preferably a PCM signal) ; and sending the result of the mixing operation to a speaker within the terminal for conversion into an audible form. In the alternative case in which the packet- based apparatus is a packet-based network interface, it should be understood that the result of the mixing operation would in fact be forwarded via the non-packet-based network, such as PCM telephone network 150, to a non-packet-based terminal, such as PCM terminal 154, for broadcasting on a speaker. At this point, the conference processing logic returns to step 80 within FIGURE 5. If the compressed voice signal was determined to correspond to a lone talker, the output generator 70 preferably, as depicted at step 112, performs jitter buffer operations in the same manner as is done in steps 94,102. Next at step 114, the compressed voice signal is decompressed, hence converting it into an uncompressed voice signal that is preferably a PCM signal, and the result is sent to a speaker within the terminal for conversion into an audible form. Similar to step 110 described above, in the alternative case in which the packet-based apparatus is a packet-based network interface, the uncompressed voice signal would be forwarded, via a non-packet-based network to a non- packet based terminal for broadcasting on a speaker. Yet again, at this stage, the conference processing logic returns to step 80 within FIGURE 5.

FIGURE 7 is a more detailed functional block diagram of the block diagram of FIGURE 4 for the case that the talker selection algorithm determines that there are two or more speakers and further selects primary and secondary talkers. As depicted in FIGURE 7, the terminal of FIGURE 4 logically comprises protocol stacks 52 for receiving voice data packets from each of the other participants within a voice conference (in this case participants B through Z) , energy detection blocks 62 that are each coupled to one of the protocol stack 52 and a talker selection block 64- coupled to all of the energy detection blocks 62. As can be seen in FIGURE 7, voice data packets from each of the participants, participants A through Z in this case, are input to a respective protocol stack 52. In this embodiment, these protocol stacks 52 are the only logical component within the packet receipt block 50. The protocol stacks 52 remove the packet overhead from the received voice data packets and output voice signals in compressed format. Preferably, the protocol stacks 52 together comprise a single software algorithm that is run for each received packet. In these embodiments, the software algorithm is possibly run multiple times in parallel as numerous packets from different participants can be received at one time.

In the detailed functional block diagram of FIGURE 7, it can be seen that the compressed voice signal output from each of the protocol stacks 52 is subsequently received by a corresponding energy detection block 62. These energy detection blocks 62 are one of the logical components within the energy detection and talker selection block 60 of FIGURE 4, with the energy detection blocks 62 together comprising a single software algorithm that is run for each compressed voice signal. It is determined for each of the voice signals within the received voice data packets whether the voice signal contains speech with use of the energy detection blocks 62, these determinations being forwarded to the talker selection block 64.

The talker selection block 64 preferably receives the determinations of which of the received voice signals contain speech and, in the case of two or more speakers, determine who are the primary and secondary talkers. FIGURE 7 depicts the case that there are at least two current talkers in the voice conference and the talker selection block 64 has selected two participants to be the primary and secondary talkers.

Further comprised within the depiction of a packet- based terminal of FIGURE 7 are two jitter buffers 72 independently coupled to the talker selection block 64, two decompression blocks 74 coupled to respective ones of the jitter buffers 72, and a mixer 76 coupled to both the decompression blocks 74. In this setup, the primary and secondary compressed voice packets output from the talker selection block 64 are received at respective ones of the jitter buffers 72. The jitter buffers 72 operate, as described in steps 94,102, to ensure that the voice signals are within the proper sequence (i.e. time ordering voice signals) and to buffer the voice signals to ensure smooth playback. Next, assuming that the compressed voice signal ' corresponding to the secondary talker arrives within the predetermined time T of the voice signal corresponding to the primary talker, the primary and secondary compressed voice signals are decompressed such that they are preferably in PCM format at decompression blocks 74 and mixed together at mixer 76. The mixer 76 then subsequently sends the mixed signal to a speaker (not shown) within the terminal that converts the voice signal into an audible form. As discussed previously, alternatively, if the packet-based apparatus is a packet- based network interface rather than a packet-based terminal, the mixed signal will be sent via a non-packet-based network to a non-packet-based terminal for broadcasting on a speaker. Although some embodiments of the present invention are as described above with reference to FIGURES 4 through 7, this description is not meant to limit the scope of the present invention. Numerous alternatives are possible such as the removal of the predetermined time T at step 106. This would result in the immediate generation of a secondary voice signal in the case that no such signal was previously saved. Further, although the embodiments described above include, the mixing of only the primary and secondary talkers, other embodiments could have the selection of more than two talkers and the subsequent mixing of all the selected voice signals. As well, other alternative embodiments only allow for a single talker at any one time and so no mixing stage is necessary at all. Another alternative embodiment, as depicted in

FIGURE 8, has any voice signals generated at the particular terminal possibly effecting the selection of primary and secondary talkers. As shown, voice signals in uncompressed format such as PCM format are output from a microphone (not shown) and received at a compression block 120 which compresses the voice signals and outputs them to an energy detection block 122' and a transmitter 124., The energy level determined by the energy detection block 122, which is preferably simply a software algorithm the same as blocks 62, outputs this energy information to the talker selection block 64. Hence in this alternative, the participant at the particular terminal is considered a source of media signals and could be selected as the primary or secondary talker for the terminal. If the participant at the terminal in question is selected as either^' the primary or secondary talker, the output generator 70 in this alternative treats the other participant selected to be a talker, if any, as a lone talker. If the participant at the terminal in question is selected to be a lone talker, the terminal discards all received voice signals and no voice signals are sent to the speaker. It should be understood that this alternative embodiment could also apply to the case in which the packet- based apparatus is a packet-based network interface.

Further, rather than having compressed voice signals from the compression block 120 being input to the energy detection and talker selection block 64, the compressed voice signal could be encapsulated within transmitter 124 and then subsequently received at the packet receipt block 50 of FIGURE 4. In this case, the voice data packet output from the transmitter 124 would be received at a protocol stack 62 and would be treated in similar fashion to packets from other packet-based terminals within the voice conference. It should be understood that the compression block 120 and the transmitter 124 combined can be considered a media data packet generation unit. An advantage of these alternatives is that, assuming the talker selection algorithms within all of the terminalβ operate in a similar fashion, all of the terminals in the voice conference will select the same primary and secondary talkers. Therefore, all of the participants will hear the same two talkers. In embodiments without one of these alternatives, it is possible that a participant that is selected as the primary or secondary speaker on most other terminals within the voice conference may be hearing two other speakers. This may allow the particular participant to hear another participant that is effectively muted by some terminals. This inconsistency could cause confusion; for instance, if the particular participant replies to a comment made by the muted participant.

It is noted that despite not being shown on FIGURES 3B or 7, the compression block 120 and the transmitter 124 would preferably be included in these packet-based terminals. They are left off these figures for simplicity.

Yet another alternative embodiment within the packet-based terminal is the moving of the jitter buffer and/or the decompression operations to another position within the conference processing logic. The advantage of having the jitter buffer and decompression operations after the talker selection block 64 is the reduced number of jitter buffer and decompression operations that are required to be performed as they only must be performed on the voice signals corresponding to the primary and secondary talkers. In one of these alternative embodiments, the jitter buffer and/or decompression operations occur within the packet receipt block 50 directly after the protocol stack operation. In this case, the jitter buffer and/or decompression operations are required to be performed for everyone of the participants in the voice conference. If the decompression operation is moved to the packet receipt block 50, the alternative depicted within FIGURE 8 could still be implemented with a slight modification. In this case, the compression block 120 is not necessary and uncompressed voice signals output from the microphone (not shown) would be received at the energy detection and talker selection block 60 along with the uncompressed voice signals output from the packet receipt block 50. The packet-based terminal of embodiments as described herein above is not specific to any one packet- based voice communications standard (such as VoIP G.711, G.729, G.723, etc), as it can be modified such that it can be used for numerous different standards. In one alternative embodiment, the packet-based terminal is a multi-mode terminal that allows for voice conferences of a number of different standards to utilize the single packet-based terminal. This implementation can have significant advantages due to a possible decrease in required transcoding. In typical conference bridge implementations, all received voice signals from selected talkers must be converted into a single packet-based voice communications standard utilized by the conference bridge prior to mixing, followed by subsequent conversions of the resulting mixed signals into the standards required by each packet-based terminal prior to outputting. With the implementation of multi-mode packet-based terminals as described above, only a single conversion of standards is possibly necessary and possibly none. It should be noted that, although the network described above for preferred embodiments of the present invention was specific to networks used for voice conferencing, this should not limit the scope of the present invention. For instance, the network of packet-based terminals could be used for point-to-point communications as well as voice conferencing. In the case of a point-to-point voice communication, both terminals would select the other participant as a lone talker. This allows a point-to-point conversation to be expanded to a larger voice conference with no major configuration modifications.

There are numerous possible advantages of using a network of packet-based terminals according to the present invention over previous voice conferencing techniques. For one, the lack of a conference bridge can allow a user of the packet-based terminal relative independence from the packet- based network administration. This is an important advantage over central conference bridges which have limited bandwidth. Since a packet-based terminal will likely only be a part of a single voice conference at one time, bandwidth limitations with respect to the hardware used should not be a problem assuming the packet-based terminal is designed to handle as many participants as is necessary.

Another possible advantage of the present invention is the reduced number of compression and decompression operations that are required. Only a single decompression operation is required in the packet-based terminal of FIGURE 4 with no compression operations. Hence, no transcoding is required and an improved signal quality is possible. On the other hand, the traditional voice conferencing techniques have a decompression and compression operation within the central conference bridge as well as a further decompression operation within the individual terminals. Yet another possible advantage of the present invention is the increased bandwidth distribution within the packet-based network due to the lack of a central point at which all voice data packets within a voice conference must meet, that central point traditionally being the conference bridge. The preferable implementation described above entails having a conference processing logic that is distributed amongst the packet-based terminals of the voice conference. Even another possible advantage of the present invention is a possible reduction in latency due to a possible reduction in equipment that voice data packets must traverse. In traditional conference bridge implementations, a voice data packet from a talker must traverse a first set of equipment to reach the conference bridge and, after being processed by the conference bridge, must traverse a second set of equipment to reach other packet-based terminals within the voice conference. With the implementation of the present invention, if any of the equipment of the first and second set are the same, it may be possible to reduce the amount of equipment a voice data packet traverses, hence reducing its latency. This advantage is especially important over implementations in which the conference bridge is either physically remote from the packet-based terminals of the voice conference or implemented on a separate network than the packet-based terminals.

Although the embodiments of the present invention described above are specific to packet-based networks comprising a plurality of packet-based terminals (or packet- based apparatus in general) that each perform talker selection operations, the present invention should not be limited to such embodiments. In an alternative embodiment less than all of the packet-based terminals within a packet- based network perform talker selection and, if necessary, mixing operations. In one such alternative, only one packet- based terminal performs talker selection and mixing operations, this packet-based terminal acting as a conference bridge for the other packet-based terminals. In this case, the packet-based terminal performing the talker selection and mixing operations outputs compressed voice signals respective of the selected talker (s) to the other packet-based terminals similar to the operation of a conference bridge. Although not gaining all of the advantages of the present invention described above> there are still advantages to this alternative embodiment since the decrease in transcoding and possible decrease in latency still .apply to the particular packet-based terminal that performs the talker selection.

There are a number of features that can be added to embodiments of the present invention that have not previously been discussed in detail. For one, a modified control plane could be used such that a number of operations could be controlled with the transmission of control packets between participants and possibly a moderator. One such operation could have a moderator established as a permanent talker throughout the voice conference, possibly as a permanent secondary talker or possibly as a third selected talker.

Another operation that could be controlled through use of a modified control plane is the manual selection of primary and/or secondary talkers. This may be useful in cases where a particular participant is scheduled to speak. Yet another possible operation that could be maintained with use of a modified control plane is a sidebar operation. In a sidebar operation, at least two of the participants within a voice conference can form a subset of participants smaller than the set that defines the entire voice conference. With this setup, one participant within the subset can choose to communicate with the entire voice conference or with only the members of the subset.

Another feature that could be added to the present invention described herein above is the sending of video streams via video data packets within the packet-based network. In this embodiment, the video data packets would replace or supplement the voice data packets within the above described implementations. The operation of an embodiment with this feature would operate the same as described herein above with these video signals preferably corresponding to the primary talker. Alternatively, a manual control within the control plane could be added so that each participant or a moderator could select which video stream to view. Further, a picture-in-picture feature could be used such that two or more video streams could be shown at once. In the case of there being primary and secondary talkers, the picture-in-picture operation could be equivalent to the mixing of the corresponding voice signals.

In general, although the operation of the present invention was described herein above with use of the terms voice data packets and voice signals, these packets and signals can be referred to broadly as media data packets and media signals respectively. In this case, media data packets are any data packets that are transmitted via the media plane, these media data packets preferably being either audio or audio/video data packets. It is noted that use of the term voice data packets above is specific to the preferred embodiments in which the audio signals are voice. Further, it should be understood that video data packets may incorporate audio data packets.

Although the present invention herein above described has a single voice conference being established with the use of a network of packet-based apparatus, it should be understood that in some embodiments it could be possible that one or more of the packet-based apparatus could be capable of handling a plurality of voice conferences simultaneously.

Persons skilled in the art will appreciate that there are yet more alternative implementations and modifications possible for implementing the present invention, and that the above implementation is only an illustration of this embodiment of the invention. The scope of the invention, therefore, is only to be limited by the claims appended hereto.

Claims

WE CLAIM :

1. A packet-based apparatus comprising: a receiver capable of being coupled to a network, said receiver to receive at least one media data packet from at least two sources forming a media conference, each media data packet defining a media signal; an energy detection and talker selection unit to process the media signals including selecting a set of the sources within the media conference as talkers; and an output unit to output the media signals that correspond to the talkers.

2. A packet-based apparatus according to claim 1 further comprising a speaker coupled to the output unit to receive the media signals that correspond to the talkers and broadcast audio signals corresponding to the received uncompressed media signals.

3. A packet-based network interface comprising a packet-based apparatus according to claim 1, wherein the media signals that correspond to the talkers are arranged to be output, via a non-packet-based network, to a non-packet- based telephone terminal.

4. A packet-based apparatus according to claim 1, wherein the media data packets are audio data packets and the media signals defined by the media data packets are audio signals .

5. A packet-based apparatus according to claim 4, wherein to process the media signals, the energy detection and talker selection unit operates to determine at least one speech parameter associated with each of the media signals and select a set of the sources within the media conference as talkers based upon the determined speech parameters.

6. A packet-based apparatus according to claim 5, wherein the speech parameter corresponding to each of the media signals is an energy level corresponding to each of the media signals.

7. A packet-based apparatus according to claim 5, wherein to select a set of the sources within the media conference as talkers, the energy detection and talker selection unit operates, for each of the media signals, to: determine whether the media signal contains speech based on the corresponding speech parameter; if determined that the media signal contains speech, determine whether the media signal corresponds to a previously selected talker; and if determined that the media signal does not correspond to a previously selected talker, determine whether a maximum number of talkers parameter is met, discard the media signal in the case that the maximum number of talkers parameter is met and select the source corresponding to the media signal as a talker within the media conference in the case that the maximum number of talkers parameter is not met.

8. A packet-based apparatus according to claim 1, wherein the media data packets are audio/video data packets and the media signals defined by the media data packets are audio/video signals.

9. A packet-based apparatus according to claim 1, wherein the media signals defined by the media data packets are compressed media signals; and wherein the output unit further operates to process the compressed media signals including decompressing at least the compressed media signals received from the talkers to generate uncompressed media signals and, to output media signals that correspond to the talkers, the output unit outputs uncompressed media signals that correspond to the talkers.

10. A packet-based apparatus according to claim 9, wherein the media data packets are audio data packets and the compressed media signals defined by the media data packets are compressed audio signals; and wherein to process the media signals, the energy detection and talker selection unit operates to determine at least one speech parameter associated with each of the compressed media signals and select a set of the sources within the media conference as talkers based upon the determined speech parameters .

11. A packet-based apparatus according to claim 10, wherein the speech parameter corresponding to each of the compressed media signals is a number of bytes within each of the compressed media signals.

12. A packet-based apparatus according to claim 10, wherein the speech parameter corresponding to each of the compressed media signals is a pitch value within the corresponding media data packets.

13. A packet-based apparatus according to claim 10, wherein the speech parameter corresponding to each of the compressed media signals is an energy level corresponding to each of the compressed media signals.

14. A packet-based apparatus according to claim 9, wherein the set of the sources within the media conference selected as talkers comprises one of first and second sources selected within the media conference as primary and secondary talkers respectively, one of the sources selected within the media conference as a lone talker, and none of the sources selected within the media conference as a talker.

15. A packet-based apparatus according to claim 14, wherein to process the compressed media signals and output uncompressed media signals that correspond to the talkers, the output unit operates, for each of the compressed media signals, to: determine whether the compressed media signal corresponds to the lone talker within the media conference; and if determined that the compressed media signal corresponds to the lone talker, decompress the compressed media signal to generate an uncompressed media signal and output the uncompressed media signal.

16. A packet-based apparatus according to claim 14, wherein to process the compressed media signals, the output unit operates, for each of the compressed media signals, to: determine whether the compressed media signal corresponds to the secondary talker within the media conference; and if determined that the compressed media signal corresponds to the secondary talker, determine whether the compressed media signal has been generated for previously; save the compressed media signal if not previously generated for; and discard the compressed media signal if previously generated for.

17. A packet-based apparatus according to claim 14, wherein to process the compressed media signals, the output unit operates, for each of the compressed media signals, to: determine whether the compressed media signal corresponds to the secondary talker within the media conference; and if determined that the compressed media signal corresponds to the secondary talker, determine whether the compressed media signal has been generated for previously; if not previously generated for, decompress the compressed media signal, resulting in a secondary media signal, and save the secondary media signal; and discard the compressed media signal if previously generated for.

18. A packet-based apparatus according to claim 14, wherein to process the compressed media signals and output uncompressed media signals that correspond to the talkers, the output unit operates, for each of the compressed media signals, to: determine whether the compressed media signal corresponds to the primary talker within the media conference; and if determined that the compressed media signal corresponds to the primary talker, decompress the compressed media signal, resulting in a primary media signal; determine whether a corresponding secondary media signal is saved; if a corresponding secondary media signal is not saved, generate a secondary media signal; mix the primary and secondary media signals into a single combined uncompressed media signal; and output the combined uncompressed media signal.

19. A packet-based apparatus according to claim 18, wherein the output unit further operates to decompress the secondary media signal prior to mixing it with the primary media signal if the secondary media signal is saved only in compressed form.

20. A packet-based apparatus according to claim 18, wherein the output unit further operates to buffer each of the primary and secondary media signals for jitter prior to the mixing of the signals.

21. A packet-based apparatus according to claim 14, wherein to process the compressed media signals and output uncompressed media signals that correspond to the talkers, the output unit operates, for each of the compressed media signals, to: determine whether the compressed media signal corresponds to the primary talker within the media conference;^' and if determined that the compressed media signal corresponds to the primary talker, decompress the compressed media signal, resulting in a primary media signal; determine whether a corresponding secondary media signal is saved; if a corresponding secondary media signal is not saved, monitor for receipt of a media data packet from the secondary talker for a predetermined time period; if the predetermined time \ period expires and no media data packet corresponding to the secondary talker has been received, generate a secondary media signal; mix the primary and secondary media signals into a single combined uncompressed media signal; and output the combined uncompressed media signal.

22. A packet-based apparatus according to claim 21, wherein the output unit further operates to buffer each of the primary and secondary media signals for jitter prior to the mixing of the signals.

23. A packet-based apparatus accorαmg t claim 9, wherein to process the compressed media signals, the output unit further includes buffering at least the compressed media signals received from the talkers for jitter.

24. A packet-based apparatus according to claim 1, wherein the media signals defined by the media data packets are compressed media signals; wherein the receiver further operates to decompress each of the compressed media signals to generate uncompressed media signals; and wherein, to output media signals that correspond to the talkers, the output unit outputs uncompressed media signals that correspond to the talkers.

25. A packet-based apparatus according to claim 24, wherein the set of the sources within the media conference selected as talkers comprises one of first and second sources selected within the media conference as primary and secondary talkers respectively, one of the sources selected within the media conference as a lone talker, and none of the sources selected within the media conference as a talker.

26. A packet-based apparatus according to claim 25, wherein to output the uncompressed media signals that correspond to the talkers, the output unit operates, for each of the uncompressed media signals, to: determine whether the uncompressed media signal corresponds to the lone talker within the media conference; and if determined that the uncompressed media signal corresponds to the lone talker, output the uncompressed media signal .

27. A packet-based apparatus according to claim 25, wherein the output unit further operates, for each of the uncompressed media signals, to: determine whether the uncompressed media signal corresponds to the secondary talker within the media conference; and if determined that the uncompressed media signal corresponds to the secondary talker, determine whether the uncompressed media signal has been generated for previously; save the uncompressed media signal if not previously generated for; and discard the uncompressed media signal if previously generated for.

28. A packet-based apparatus according to claim 25, wherein to output the uncompressed media signals that correspond to the talkers, the output unit operates, for each of the uncompressed media signals, to: determine whether the uncompressed media signal corresponds to the primary talker within the media conference; and if determined that the uncompressed media signal corresponds to the primary talker, determine whether a corresponding uncompressed media signal is saved for the secondary talker; if a corresponding uncompressed media signal is not saved for the secondary talker, generate an uncompressed media signal for the secondary talker; mix the. uncompressed media signals for the primary and secondary talkers .into a single combined uncompressed media signal; and output the combined uncompressed media signal.

29. A packet-based apparatus according to claim 25, wherein to output the uncompressed media signals that correspond to the talkers, the output unit operates, for each of the uncompressed media signals, to: determine whether the uncompressed media signal corresponds to the primary talker within the media conference; and if determined that the uncompressed media signal corresponds to the primary talker, determine whether a corresponding uncompressed media signal is saved for the secondary talker; if a corresponding uncompressed media signal is not saved for the secondary talker, monitor for receipt of a media data packet from the secondary talker for a predetermined time period; if the predetermined time period expires and no media data packet corresponding to the secondary talker has been received, generate an uncompressed media signal for the secondary talker; mix the uncompressed media signals for the primary and secondary talkers into a single combined uncompressed media signal; and output the combined uncompressed media signal.

30. A packet-based apparatus according to claim 1, wherein the receiver further operates to buffer each of the media signals for jitter.

31. A packet-based apparatus according to claim 1, wherein the energy detection and talker selection unit further operates to receive at least one media signal from a source of media signals corresponding to the packet-based apparatus; and wherein the packet-based apparatus comprises one of the sources forming the media conference.

32. A packet-based apparatus according to claim 31 further comprising a microphone, coupled to at least one of the receiver and the energy detection and talker selection unit, to receive audio signals and output media signals that correspond to the received audio signals, said microphone comprising the source for media signals corresponding to the packet-based apparatus.

33. A packet-based network interface comprising a packet-based apparatus according to claim 31, wherein the source for media signals corresponding to the packet-based apparatus is a non-packet-based telephone terminal arranged to be coupled, via a non-packet-based network, to the packet- based network interface.

34. A packet-based apparatus according to claim 1, wherein the output unit further operates to encapsulate the media signals that correspond to the talkers and to output the encapsulated media signals that correspond to the talkers to at least one of the sources within the media conference.

35. A method for outputting media signals within a media conference, the method comprising: receiving at least one media data packet from at least two sources forming a media conference, each media data packet defining a media signal; processing the received media data packets including selecting a set of the sources within the media conference as talkers; and outputting media signals that correspond to the talkers .

36. A packet-based apparatus comprising: means for receiving at least one media data packet from at least two sources forming a media conference, each media data packet defining a media signal; means for processing the received media data packets including selecting a set of the sources within the media conference as talkers; and means for outputting media signals that correspond to the talkers.

37. A packet-based network comprising a plurality of packet-based apparatus; wherein at least two of the plurality of packet- based apparatus operates to output media data packets comprising media signals, these packet-based apparatus together forming a media conference; and wherein at least one of the packet-based apparatus within the media conference operates to receive the media data packets from the packet-based apparatus within the media conference; to process the media signals corresponding to the received media data packets including selecting a set of the packet-based apparatus within the media conference as talkers; and to output media signals that correspond to the talkers.

38. A packet-based apparatus comprising: a receiver capable of being coupled to a network, said receiver to receive at least one media data packet from at least one source forming a media conference with the packet-based apparatus, each media data packet defining a media signal; an energy detection and talker selection unit to receive at least one media signal from a source for media signals corresponding to the packet-based apparatus; to process the media signals from the receiver and from the source for media signals corresponding to the packet-based apparatus including selecting a set of the sources within the media conference as talkers; and an output unit to output media signals that correspond to the talkers.

39. A packet-based apparatus according to claim 38 further comprising a microphone, coupled to at least one of the receiver and the energy detection and talker selection unit, to receive audio signals and output media signals corresponding to the received audio signals, said microphone comprising the source for media signals corresponding to the packet-based apparatus.

40. A packet-based network interface comprising a packet-based apparatus according to claim 38, wherein the source for media signals corresponding to the packet-based apparatus is a non-packet-based telephone terminal arranged to be coupled, via a non-packet-based network, to the packet- based network interface.

41. A packet-based apparatus according to claim 38, wherein the media signals defined by the media data packets are compressed media signals and the media signals received by the energy detection and talker selection unit are compressed media signals; and wherein the output unit further operates to process the compressed media signals including decompressing at least the compressed media signals received from the talkers to generate uncompressed media signals and, to output media signals that correspond to the talkers, the output unit outputs uncompressed media signals that correspond to the talkers.

42. A packet-based apparatus according to claim 38, wherein the media signals defined by the media data packets are compressed media signals and the media signals received by the energy detection and talker selection unit are uncompressed media signals; wherein the receiver further operates to decompress each of the compressed media signals to generate uncompressed media signals; and wherein, to output media signals, that correspond to the talkers, the output unit outputs uncompressed media signals that correspond to the talkers.

43. A packet-based apparatus according to claim 38, wherein the output unit further operates to encapsulate the media signals that correspond to the talkers and to output the encapsulated media signals that correspond to the talkers to at least one of the sources within the media conference.

44. ' A method for outputting uncompressed media signals within a media conference, the method comprising: receiving at least one media data packet from at least one source forming a media conference with the packet- based apparatus, each media data packet defining a media signal; receiving at least one media signal from a source for media signals corresponding to the packet-based apparatus; processing the received media signals including selecting a set of the sources within the media conference as talkers; and outputting media signals that correspond to the talkers .

45. A packet-based apparatus comprising: means for receiving at least one media data packet from at least one source forming a media conference with the packet-based apparatus, each media data packet defining a media signal; means for receiving at least one media signal from a source for media signals corresponding to the packet-based apparatus; means for processing the received media signals including selecting a set of the sources within the media conference as talkers; and means for outputting media signals that correspond to the talkers.