EP4037339A1 - Selecton of audio channels based on prioritization - Google Patents
Selecton of audio channels based on prioritization Download PDFInfo
- Publication number
- EP4037339A1 EP4037339A1 EP21154652.8A EP21154652A EP4037339A1 EP 4037339 A1 EP4037339 A1 EP 4037339A1 EP 21154652 A EP21154652 A EP 21154652A EP 4037339 A1 EP4037339 A1 EP 4037339A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- audio
- audio channels
- channels
- channel
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S3/00—Systems employing more than two channels, e.g. quadraphonic
- H04S3/008—Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/03—Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/11—Positioning of individual sound objects, e.g. moving airplane, within a sound field
Definitions
- Embodiments of the present disclosure relate to audio. Some enable the distribution of common content for rendering to both advanced audio output devices and less advanced audio output devices.
- Advanced audio output devices are capable to rendering multiple received audio channels as different spatially positioned audio sources.
- the spatial separation of audio sources can aid hearing when the sources simultaneously provide sound.
- Content that is suitable for rendering spatial audio via an advanced audio output device may be unsuitable for a less advanced audio output device and content that is suitable for rendering by a less advanced audio output device may under-utilize the spatial audio capabilities of an advanced audio output device.
- an apparatus comprising means for:
- the apparatus comprises means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
- the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
- N is at least two and M is one, the output audio channel being a monophonic audio output channel.
- the apparatus comprises means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
- prioritization depends upon one or more of:
- controlling mixing of the N audio channels to produce at least an output audio channel comprises:
- the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
- the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
- the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
- the apparatus comprises a user input interface for controlling prioritization of the N audio channels.
- the apparatus comprises a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
- a multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus.
- an apparatus comprising means for:
- an apparatus comprising means for:
- the set of N audio channels is referenced using reference number 20.
- Each audio channel of the set of N audio channels is referenced using reference number 20 i , where i is 1, 2,...N-1, N.
- the apparatus 10 comprises means for receiving at least N audio channels 20 where each of the N audio channels 20 i can be rendered as a different audio source.
- the apparatus 10 comprises means 40, 50 for controlling selection and mixing of the N audio channels 20 to produce at least an output audio channel 52.
- a selector 40 selects for mixing (to produce the output audio channel 52) a sub-set 30 of M audio channels from the N audio channels 20.
- the selection is dependent upon prioritization 32 of the N audio channels 20.
- the prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20.
- the sub-set 30 of M audio channels is referenced using reference number 30.
- Each audio channel of the sub-set of M audio channels is referenced using reference number 20 j , where j is any M of the N values of i.
- the sub-set 30 can, for example, be varied by changing the value of M and/or by changing which audio channels 20 j are used to comprise the M audio channel of the sub-set 30.
- different sub-set 30 can, in some examples, be differentiated using the same reference 30 with different numeric sub-scripts.
- a mixer 50 mixes the sub-set 30 of M audio channels to produce the output audio channel 52 which is suitable for rendering.
- An advanced spatial audio output device (an example is illustrated at FIG 11A ) can render the N audio channels 20 as multiple different spatially positioned audio sources.
- a less advanced audio output device (an example is illustrated at FIG 11B ) can render the output audio channel 52.
- the apparatus 10 therefore allows a common content, the N audio channels 20, to provide audio output at both the advanced spatial audio output device and the less advanced audio output device.
- FIG. 1 illustrates an example of an apparatus 10 for providing an output audio channel 52 for rendering.
- the rendering of the output audio channel 52 can occur at the apparatus 10 or can occur at some other device.
- the apparatus 10 receives at least N audio channels 20.
- An audio channel 20 i of the N audio channels 20 can be rendered as a distinct audio source.
- the apparatus 10 comprises a mixer 50 for mixing a sub-set 30 of the M audio channels 20 to produce at least an output audio channel 52.
- a selector 40 selects for mixing, at mixer 50, the sub-set 30 of M audio channels from the N audio channels 20.
- the selection, by the selector 40 is dependent upon prioritization 32 of the N audio channels 20.
- the prioritization 32 is adaptive depending at least upon a changing content 34 of one or more of the N audio channels 20.
- the apparatus 10 provides, from the mixer 50, the output audio channel 52 for rendering.
- the sub-set 30 of M audio channels has less audio channels than the N audio channels 20, that is, M is less than N.
- N is at least two and in at least some examples is greater than 2.
- M is one and the output audio channel 52 is a monophonic audio output channel.
- the prioritization 32 is adaptive. The prioritization 32 depends at least on a changing content 34 of one or more of the N audio channels 20.
- the apparatus 10 is configured to automatically control the mixing of the N audio channels 20 to produce at least the output audio channel 52, in dependence upon time-variation of content 34 of one or more of the N audio channels 20.
- FIG. 2 illustrates an example of an apparatus 10 in which an analyzer 60 is configured to analyze the N audio channels 20 to adapt the prioritization 32 of the N audio channels 20 in dependence upon, at least, changing content 34 of one or more of the N audio channels 20.
- the analysis can be performed before (or simultaneously with) the before-mentioned selection.
- the analyzer 60 is configured to process metadata associated with the N audio channels 20. Additionally or alternatively, in some examples, the analyzer 60 is configured to process the audio content of the audio channels 20. This processing could, for example, comprise voice activity detection, voice recognition processing, spectral analysis, semantic processing of speech or other processing including machine learning and artificial intelligence processing used to identify characteristics of the content 34 of one or more of the N audio channels 20.
- the prioritization 32 can depend upon one or more parameters of the content 34.
- the prioritization 32 depends upon timing of content 34 i of an audio channel 20 i relative to timing of content 34 j of an audio channel 20 j .
- the audio channel 20 that first satisfies a trigger condition has temporal priority.
- the trigger condition may be that the audio channel 20 has activity above a threshold, and/or has activity above a threshold in a particular spectral range and/or has voice activity and/or has voice activity associated with a specific person and/or the voice activity comprises semantic content including a particular keyword word or phrase.
- An initial prioritization 32 can cause an initial selection of a first sub-set 30 1 of audio channels 20 that are mixed to form the output audio channel 52.
- a change in prioritization 32 can cause a new selection of a second different sub-set 30 2 of audio channels 20 that are mixed to form a new, different output audio channel 52.
- the first sub-set 30 1 and the second sub-set 30 1 are not equal sets.
- apparatus 10 can prioritize one or more of the N audio channels 20 as a sub-set 30 until a new selection by the selector 40 based on a new prioritization 32 changes the sub-set 30.
- That channel may be prioritized ahead of a second audio channel. However, if the person speaking in the first audio channel stops speaking then the prioritization 32 of the audio channels can change and there can be a consequential reselection at the selector 40 of the sub-set 30 of M audio channels provided for mixing to produce the output audio channel 52.
- the apparatus 10 can flag at least one input audio channel 20 corresponding to a first active talker, or generally active content 34, during a selection period and prioritize this selection over other audio channels 20.
- the apparatus 10 can determine whether the active talker continues before introducing content 34 from non-prioritized channels to the mixed output audio channel 52. The introduction of such additional content 34 from non-prioritized channels is controlled by the selector 40 during a following selection period.
- non-prioritized audio channels 20 can be completely omitted from the mixed output audio channel 52 and thus the mixed output audio channel 52 will contain only the prioritized channel(s).
- the non-prioritized channels can be mixed with a lower gain or higher attenuation than the prioritized channel and/or with other suitable processing to produce the output audio channel 52.
- a history of content 34 of at least one of the N audio channels 20 can be used to control the prioritization 32.
- the selector 40 in making a selection of which of the N audio channels 20 to select for mixing to produce the output audio channel 52 can, for example, use decision thresholds for selection.
- a decision threshold can be changed over time and can be dependent upon a history of the content 34.
- different decision thresholds can be used for different audio channels 20.
- the prioritization 32 can be dependent upon mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20.
- the analyzer 60 can for example perform voice recognition based upon the content 34 of one or more of the N audio channels 20.
- the analyzer 60 can identify a particular person based upon metadata comprised within the content 34 of at least one of the N audio channels 20. It may therefore be possible to identify a particular one of the N audio channels 20 as relating to a person whose contribution it is particularly important to hear such as, for example, a chairman of a meeting.
- the analyzer 60 is configured to adapt the prioritization 32 when the presence of voice content is detected within the content 34 of at least one of the N audio channels 20.
- the analyzer 60 is able to prioritize the spoken word within the output audio channel 52. It is also possible to adapt the analyzer 60 to prioritize other types of content.
- the analyzer 60 is configured to adapt the prioritization 32 based upon detection that content 34 of at least one of the N audio channels 20 comprises an identified keyword.
- the analyzer 60 can, for example, listen to the content 34 and identify within the stream of content a keyword or identify semantic meaning within the stream of content. This can be used to modify the prioritization 32. For example, it may be desirable for a consumer of the output audio channel 52 to have that output audio channel 52 personalized so that if one of the N audio channels 20 comprises content 34 that includes the consumer's name or other information associated with the consumer then that audio channel 20 is prioritized by the analyzer 60.
- the N audio channels 20 can represent live content.
- the analysis by the analyzer 60, the selection by the selector 40 and the mixing by the mixer 50 can occur in real time such that the output audio channel 52 is also live.
- FIG. 3 illustrates an example of the apparatus of FIG. 1 in more detail.
- the mixing is a weighted mixing in which different sub-sets of the sub-set 30 of selected audio channels are weighted with different attenuation/gain before being finally mixed to produce the output audio channel 52.
- the selector 40 selects a first sub-set SS1 of the M audio channels to be mixed to provide background audio B and selects a second sub-set SS2 of the M audio channels 20 to be mixed to provide foreground audio F that is for rendering at greater loudness than the background audio B.
- the selection of the first sub-set SS1 and the selection of the second sub-set SS2 is dependent upon the prioritization 32 of the N audio channels 20.
- the first sub-set SS1 of audio channels 20 is mixed 50 1 to provide background audio B which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 50 3 for mixing to produce the output audio channel 52.
- the second sub-set SS2 of audio channels 20 is mixed 50 2 to provide foreground audio F which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to the mixer 50 3 for mixing to produce the output audio channel 52.
- the gain/attenuation G2 applied to the foreground audio F makes it significantly louder than the background audio B in the output audio channel 52. In some situations, the foreground audio F is naturally louder than background audio B. Thus, it can be but need not be that G2 > G1.
- the gain/attenuation G1, G2 can, in some examples, vary with frequency.
- FIG. 4 illustrates an example of a multi-party, live communication system 200 that enables live audio communication between multiple remote participants A i , B, C, D i using at least the N audio channels 20. Different ones of the multiple remote participants A i , B, C, D i provide audio input for different ones of the N audio channels 20.
- the system 200 comprises input end-points 206 for capturing audio channels 20.
- the system 200 comprises output end-points 204 for rendering audio channels.
- One or more output end-points 204 s are configured for rendering spatial audio as distinct rendered audio sources.
- One or more output end-points 204 m are not configured for rendering spatial audio.
- the N audio channels 20 are N spatial audio channels where each of the N spatial audio channels is captured as a differently positioned captured audio source, and can be rendered using spatial audio as a differently positioned rendered audio source.
- the captured audio source input end-point 206
- the rendered audio source can either be fixed or can move, for example, in a manner corresponding to the moving input end-point 206.
- the system 200 is for enabling immersive teleconferencing or telepresence for remote terminals.
- the different terminals have varying device capabilities and different (and possibly variable) network conditions.
- Spatial/immersive audio refers to audio that typically has a three-dimensional space representation or is presented (rendered) to a participant with the intention of the participant being able to hear a specific audio source from a specific direction.
- Some of the participants share a room. For example, participants A 1 , A 2 , A 3 , A 4 share the room A and the participants D 1 , D 2 , D 3 , D 4 , D 5 share the room D.
- Some of the terminals can be characterized as "advanced spatial audio output devices" that have an output end-point 204 s that is configured for spatial audio. However, some of the terminals are less advanced audio output devices that have an output end-point 204 m that is not configured for spatial audio.
- the voices of the participants A i , B, C, D i are spatially separated.
- the voices may, for example, have fixed spatial positions relative to each other or the directions may be adaptive, for example, according to participant movements, conference bridge settings or based upon inputs by participants.
- a similar experience is available to the participants who are using the output end-points 204 s and they have the ability to interact much more naturally than traditional voice calls and voice conferencing. For example, they can talk at the same time and still understand each other thanks to effects such as the well-known cocktail party effect.
- each of the respective participants A i , D i has a personal input end-point 206 which captures a personal captured audio source as a personal audio channel 20.
- the personal input end-point 206 can, for example, be provided by a directional microphone or by a Lavalier microphone.
- the participants B and C each have a single personal input end-point 206 which captures a personal audio channel 20.
- the output end-points 204 s are configured for spatial audio.
- each room can have a surround sound system as an output end-point 204 s .
- An output end point 204 s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
- each participant A i , B, C has a personal output audio channel 20.
- Each personal output audio channel 20 is rendered from a different location as a different rendered audio source.
- the collection of rendered audio sources associated with the participants A i creates a virtual room A.
- each participant D i , B, C has a personal output audio channel 20.
- Each personal output audio channel 20 is rendered from a different location as a different rendered sound source.
- the collection of the rendered audio sources associated with the participants D i creates a virtual room D.
- the output end-point 204 s is configured for spatial audio. For example, as an output end-point 204 s .
- An output end point 204 s is configured to render each captured sound source represented by an audio channel 20 as a rendered sound source.
- the participant C has an output end-point 204 s that is configured for spatial audio.
- the participant C is using a headset configured for binaural spatial audio that is suitable for virtual reality (VR).
- Binauralization methods can be used to render personal audio channels 20 as spatially positioned rendered audio sources,
- Each participant Ai, Di, B has a personal output audio channel 20.
- Each personal output audio channel 20 is or can be rendered from a different location as a different rendered sound source.
- the participant B has an output end-point 204 m that is not configured for spatial audio. In this example it is a monophonic output end-point.
- the participant B is using a mobile device (e.g. a mobile phone) to provide the input end-point 206 and the output end-point 204 m .
- the mobile device has a single output end-point 204 m which provides the output audio channel 52 as previously described.
- the processing to produce the output audio channel 52 can be performed at the mobile device of the participant C or at the server 202.
- the mono-capability limitation of participant B can, for example, be caused by the device, for example it is only configured for decoding of mono audio or because of the available audio output facilities such as a mono-only earpiece or headset.
- Each of the input end-points 206 is rendered in spatial audio as a spatially distinct rendered audio source. However, in other examples multiple ones of the input end-points 206 may be mixed together to produce a single rendered audio source. This can be used to reduce the number of rendered audio sources using spatial audio. Therefore, in some examples, a spatial audio device may render multiple ones of output audio channels 52.
- FIG. 4 a star topology similar to that illustrated in FIG. 5A is used.
- the central server 202 interconnects the input end-points 206 and the output end-points 204.
- the input end-points 206 provide the N audio channels 20 to a central server 202 which produces the output audio channel 52 as previously described to the output end-point 204 m .
- the apparatus 10 is located in the central server 202, however, in other examples the apparatus 10 is located at the output end-point 204 m .
- FIG. 5B illustrates an alternative topology in which there is no centralized architecture but a peer-to-peer architecture.
- the apparatus 10 is located at the output end-point 204m.
- the 3GPP IVAS codec is an example of a voice and audio communications codec for spatial audio.
- the IVAS codec is an extension of the 3GPP EVS codec and is intended for new immersive voice and audio services over 4G and 5G.
- Such immersive services include, for example, immersive voice and audio for virtual reality (VR).
- the multi-purpose audio codec is expected to handle encoding, decoding and rendering of speech, music and generic audio. It is expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
- the audio channels 20 can, for example, be coded/decoded using the 3GPP IVAS codec.
- the spatial audio channels 20 can, for example, be provided as metadata-assisted spatial audio (MASA), objective-based audio, channel-based audio (5.1, 7.1+4), non-parametric scene-based audio (e.g. First Order Ambisonics, High Order Ambisonics) and any combination of these formats. These audio formats can be binauralized for headset listening such that a participant can hear the audio sources outside their head.
- MASA metadata-assisted spatial audio
- objective-based audio e.g., objective-based audio
- channel-based audio 5.1, 7.1+4
- non-parametric scene-based audio e.g. First Order Ambisonics, High Order Ambisonics
- the apparatus 10 provides a better experience, including improved intelligibility for a mono user participating in a spatial audio teleconference with several potentially overlapping spatial audio inputs.
- the apparatus 10 means that it is not necessary, in some cases, to simplify the spatial audio conference experience for the spatial audio users due to having a mono-audio participant.
- a mono user can participate in a spatial audio conference without compromising the experience of the other users.
- FIGS 6, 7, 8 and 9A illustrate examples of an apparatus 10 that comprises a controller 70.
- the controller receives N audio channels 20 and performs control processing to select the sub-set 30 of M audio channels.
- the controller 70 comprises the selector 40 and, optionally, the analyzer 60.
- the mixer 50 is present but not illustrated.
- the controller 70 is configured to control mixing of the N audio channels 20 to produce the sub-set 30 of M audio channels when a conflict between a first audio channel of the N audio channels 20 and a second audio channel of the N audio channels occurs.
- the control can cause the first audio channel 20 to be included within the sub-set 30 of M audio channels and cause the second audio channel 20 not to be included within the sub-set 30 of M audio channels.
- the second audio channel is included within the sub-set 30 of M audio channels.
- One example of when there is conflict between audio channels is when there is simultaneous activity from different prioritized sound sources.
- overtalking sustaneous speech
- the prioritization 32 used for the selection of audio channels to form the sub-set 30 of M audio channels depends upon timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20.
- the later speech by participants 4 and 5 is not selected for inclusion within the sub-set 30 of audio channels used to form the output audio channel 52.
- the audio channel 20 3 preferentially remains prioritized and remains included within the output audio channel 52, while there is voice activity in the audio channel 20 3 , whereas the audio channels 20 4 , 20 5 are excluded. If voice activity is no longer detected in the audio channel 20 3 then in some examples a selection process may immediately change the identity of the audio channel 20 selected for inclusion within the output audio channel 52. However, in other examples there can be a selection grace period. During this grace period, there can be a greater likelihood of selection/reselection of the original selected audio channel 20 3 . Thus, during the grace period prioritization 32 is biased in favor of the previously selected audio channel.
- prioritization 32 used for the selection depends upon a history of content 34 of at least one of the N audio channels 20.
- the prioritization 32 used for the selection can depend upon mapping to a particular person (an identifiable human), an identified voice in content 34 of at least one of the N audio channels 20.
- a voice can be identified using metadata or by analysis of the content 34. The prioritization 32 would more favorably select the particular person's audio channel 20 for inclusion within the output audio channel 52.
- the particular person could, for example, be based upon service policy.
- a teleconference service may have a moderator or chairman role and this participant may for example be made audible to all participants or may be able to force themselves to be audible to all participants.
- the particular person could for example be indicated by a user consuming the output audio channel 52. That consumer could for example indicate which of the other participants' content 34 or audio channels 20 they wish to consume. This audio channel 20 could then be included, or be more likely to be included, within the output audio channel 52.
- the inclusion of the user-selected audio channel 20 can for example be dependent upon voice activity within the audio channel 20, that is, the user-selected audio channel 20 is only included if there is active voice activity within that audio channel 20.
- the prioritization 32 used for the selection therefore strongly favors the user-selected audio channel 20.
- the selection by the consumer of the output audio channel 52 of a particular audio channel 20 can for example be based upon an identity of the participant who is speaking or should speak in that audio channel. Alternatively, it could be based upon a user-selection of that audio channel because of the content 34 rendered within that audio channel.
- FIG. 7 illustrates an example similar to FIG. 6 .
- the audio channels 20 include a mixture of different audio types.
- the audio channel 20 3 associated with participant3 is predominantly a voice channel.
- the audio channels 20 4 , 20 5 associated with participants 4 and 5 are predominantly instrumental/music channels.
- the selection of which of the audio channels 20 is to be included within the output audio channel 52 can be based upon the audio type present within the audio channel 20.
- the detection of the audio type within the audio channel 20 can for example be achieved using metadata or, alternatively, by analyzing the content 34 of the audio channel 20.
- the prioritization 32 used for selection can be dependent upon detection that content 34 of at least one of the N audio channels 20 is voice content.
- the output audio channel 52 can switch between the inclusion of different audio channels 20 in dependence upon which of them includes active voice content. In this way priority can be given to spoken language.
- the other channels for example the music channels 20 4 , 20 5 may optionally be included, for example as background audio as previously described with relation to FIG. 3 .
- the apparatus 10 deliberately loses information by excluding (or diminishing) audio channels 20 with respect to the output audio channel 52.
- Information is generally lost by the selective downmixing which is required to maintain or guarantee intelligibility. It is, however, possible for there to be two simultaneously important audio channels 20, only one of which is selected for inclusion in the output audio channel 52.
- the apparatus illustrated in FIG. 8 addresses this issue.
- the apparatus 10 illustrated is similar to that illustrated in FIGS 6 and 7 . However, it additionally comprises a memory 82 for storage of a further sub-set 80 of the N audio channels 20 that is different to the sub-set 30 of N audio channels 20.
- a memory 82 for storage of a further sub-set 80 of the N audio channels 20 that is different to the sub-set 30 of N audio channels 20.
- the later rendering may be at a faster playback rate and that playback may be fixed or may be adaptive.
- the sub-set 80 of audio channels is mixed to form an alternative audio output channel for storage in the memory 82.
- At least some of the audio channels of the N audio channels that are not selected to be in the sub-set 30 of M audio channels are stored in memory 82 for later rendering.
- first sub-set 30 of M audio channels there is selection of a first sub-set 30 of M audio channels from the N audio channels based upon prioritization 32 of the N audio channels.
- the first sub-set 30 of M audio channels are mixed to produce a first output audio channel 52.
- the second sub-set 80 of audio channels are mixed to produce a second output audio channel for storage.
- the audio channel 20 3 includes content 34 comprising voice content from a single participant, and it is selected for inclusion within the sub-set 30 of audio channels. It is used to produce the output audio channel 52.
- the audio channels 20 4 , 20 5 which have not been included within the output audio channel 52, or included only as background (as described with reference to FIG. 3 ), are selected for mixing to produce the second output audio signal that is stored in memory 82.
- FIG. 10 illustrates an example of how such an indication may be provided to the consumer of the output audio channel 52. Fig 10 is described in detail later.
- An apparatus 10 may switch to the stored audio channel and play that back at a higher speed. For example, the apparatus 10 can monitor the typical length of inactivity in the preferred output audio channel 52 and adjust the speed of playback for the stored audio channel such that the relevant portions can be played back during a typical inactive period.
- FIG. 9A illustrates an example in which the apparatus 10 detects that content 34 of at least one of the N audio channels 20 comprises an identified keyword and adapts the prioritization 32 accordingly.
- the prioritization 32 in turn controls selection of which of the audio channels 20 are included in the sub-set 30 and the output audio channel 52 (and, if implemented, the stored alternative audio channel).
- the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 20 3 associated with the User 3 is initially selected as the priority audio channel and is included within the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the output audio channel 52.
- the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the output audio channel 52.
- the participant 'User 3' is speaking first and has priority. Therefore, the audio channel 20 3 associated with the User 3 is initially selected as the priority audio channel and is included within the sub-set 30 used to produce the output audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the sub-set 30 and the output audio channel 52.
- the prioritization is not changed and the audio channel 20 3 remains the priority audio channel included within the sub-set 30 and the output audio channel 52.
- This event causes a switch in the prioritization of the audio channels 20 3 , 20 5 such that the audio channel 20 5 becomes prioritized and included in the sub-set 30 and the output audio channel 52 and the audio channel 20 3 becomes de-prioritized and excluded from the sub-set 30 and the output audio channel 52.
- the consumer of the output audio channel 52 can via user input settings control the likelihood of a switch when a keyword is mentioned within an audio channel 20.
- the consumer of the output audio channel 52 can, for example, require a switch if a keyword is detected.
- the likelihood of a switch can be increased.
- the occurrence of a keyword can increase the prioritization of an audio channel 20 such that it is stored, for example as described in relation to FIG. 8 .
- the detection of a keyword may provide an option to the consumer of the output audio channel 52, to enable the consumer to cause a change in the audio channel 20 included within the sub-set 30 and the output audio channel 52. For example, if the name of the consumer of the output audio channel 52 is included within an audio channel 20 that is not being rendered, as a priority, within the output audio channel 52 then the consumer of the output audio channel 52 can be presented with an option to change prioritization 32 and switch to using a sub-set 30 and output audio channel 52 that includes the audio channel 20 in which their name was detected.
- the new output audio channel 52 based on the detected keyword may be played back from the occurrence of the detected keyword.
- the playback is at a faster rate to allow a catch-up with real time.
- FIG. 10 illustrates an example in which a consumer of the output audio channel 52 is provided with information to allow that consumer to make an informed decision to switch audio channels 20 included within the sub-set 30 and the output audio channel 52.
- some form of indication is given to indicate a change in activity status. For example, if a particular participant begins to talk or there is a second separate discussion ongoing, the consumer of the original output audio channel 52 is made aware of this.
- a suitable indicator could for example be an audible indicator that is added to the output audio channel 52.
- each participant may have an associated different tone and a beep with a particular tone may indicate which participant has begun to speak.
- an indicator could be a visual indictor in an input user interface.
- the background audio is adapted to provide an audible indication.
- the consumer listening to the output audio channel 52 hears the audio channel 20 1 associated with a first participant's voice (User A voice).
- a second audio channel 20 is mixed with the audio channel 20 1 , then it may, for example, be an audio channel 20 2 that captures the ambient audio of the first participant (User A ambience).
- a second participant, User B begins to talk. This does not initiate a switch of prioritization 32 sufficient to change the sub-set 30.
- the primary audio channel 20 in the sub-set 30 and the output audio channel 52 remains the audio channel 20 1 .
- an indication is provided to indicate to the consumer of the output audio channel 52 that there is an alternative, available, audio channel 20 3 .
- the indication is provided by mixing the primary audio channel 20 1 with an additional audio channel 20 associated with the User B.
- the additional audio channel 20 can be an attenuated version of the audio channel 20 3 or can be an ambient audio channel 20 4 for the User B (User B ambience).
- the second audio channel 20 2 is replaced by the additional audio channel 20 4 .
- the consumer of the output audio channel 52 can then decide whether or not they wish to cause a change in the prioritization 32 to prioritize the audio channel 20 3 associated with the User B above the audio channel 20 1 associated with the User A. If this change in prioritization occurs then there is a switch in the primary audio channel within the sub-set 30 and the output audio channel 52 from being the audio channel 20 1 to being the audio channel 20 3 . In the example illustrated, the consumer does not make this switch. The switch does however occur automatically when the User A stops talking at time T2.
- the background audio B can be included and/or varied as an indication to the consumer of the output audio channel 52 that an alternative audio channel 20 is available for selection.
- FIG. 11A schematically illustrates audio rendered to a participant (User 5) at an output end-point 204 s of the system 200 (not illustrated) that is configured for rendering spatial audio.
- the audio output at the end-point 204 s has multiple rendered sound sources associated with audio channels 20 1. 20 2 , 20 3 , 20 4 at different locations.
- FIG. 11A illustrates that even with the presence in the system 200 (not illustrated) of an output end-point 204 m ( FIG 11B ) that is not configured for spatial audio rendering, there may be no need to reduce the immersive capabilities or experience at the output end-points 204 s of the system 200 that are configured for rendering spatial audio.
- FIG. 11B schematically illustrates audio rendered to a participant (User 1) at an output end-point 204 m of the system 200 (not illustrated) that is not configured for rendering spatial audio.
- the audio output at the end-point 204 m provided by the output audio channel 52 has a single monophonic output audio channel 52 that is based on the sub-set 30 of selected audio channels 20 and has good intelligibility.
- the audio channel 20 2 is the primary audio channel that is included in the sub-set 30 and the output audio channel 52.
- the apparatus 10 can be configured to automatically switch the composition of the audio channels 20 mixed to form the output audio channel 52 in dependence upon an adaptive prioritization 32. Additionally or alternatively, in some examples, the switching can be effected manually by the consumer at the end-point 204 m using a user interface which includes a user input interface 90.
- the device at the output end-point 204 s which in some examples may be the apparatus 10, comprises a user input interface 90 for controlling prioritization 32 of the N audio channels 20.
- the user input interface 90 can be configured to highlight or label selected ones of the N audio channels 20 for selection.
- the user input interface 90 can be used to control if and to what extent manual or automatic switching occurs to produce the output audio channel 52 from selected ones of the audio channels 20.
- An adaptation of the prioritization 32 can cause an automatic switching or can cause a prompt to a consumer for manual switching.
- the user input interface 90 can control if and the extent to which prioritization 32 depends upon one or more of timing of content 34 of at least one of the N audio channels 20 relative to timing of content 34 of at least another one of the N audio channels 20; history of content 34 of at least one of the N audio channels 20; mapping to a particular person an identified voice in content 34 of at least one of the N audio channels 20; detection that content 34 of at least one of the N audio channels 20 is voice content; and/or detection that content 34 of at least one of the N audio channels comprises an identified word.
- an option 91 4 that allows the participant, User 1, to select the audio channel 20 4 as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 20 2 .
- an option 91 3 that allows User 1 to select the audio channel 20 3 as a replacement primary audio channel that is included in the sub-set 30 and the output audio channel 52 instead of the audio channel 20 2 .
- the user input interface 90 can provide a visual spatial representation of the N audio channels 20 and indicate which of the N audio channels 20 are comprised in the sub-set 30 of M audio channels.
- the user input interface 90 can also indicate which of the N audio channels are not comprised in the sub-set 30 of M audio channels and which, if any, of these are active.
- the user input interface 90 may provide textual information about an audio channel 20 that is active and available for selection.
- speech-to-text algorithms may be utilized to convert speech within that audio channel 20 into an alert displayed at the user input interface 90.
- the apparatus 10 may be configured to cause the user input interface 90 to provide an option to a consumer of the output audio channel 52 that enables that consumer to switch audio channels 20 included within the sub-set 30 and output audio channel 52.
- the keyword is "Dave” and the textual output provided by the user input interface 90 could, for example, say "option to switch to User 5 who addressed you and said: 'In our last teleco Dave made an interesting'".
- the sub-set 30 and the output audio channel 52 then includes the audio channel 20 5 from the User 5 and starts from the position "In our last teleco Dave made an interesting".
- a memory 82 could be used to store the audio channel 20 5 from the User 5.
- the apparatus 10 can be permanently operational to perform the selection of the sub-set 30 of audio channels 20 used to produce the output audio channel 52.
- the apparatus 10 has a state in which it is operational in this way and a state in which it is not operation in this way and it can transition between these states, for example when a trigger event is or is not detected.
- the apparatus 10 can be configured to control a mixer 50 mixing of the N audio channels 20 to produce M audio channels in response to a trigger event,
- Trigger event is conflict between audio channels 20.
- An example of detecting conflict would be when there is overlapping speech in audio channels 20.
- a trigger event is a reduction in communication bandwidth for receiving the audio channels 20 below a threshold value.
- the value of M can be dependent upon the available bandwidth.
- a trigger event is a reduction in communication bandwidth for providing the output audio channel 52 beneath a threshold value.
- the value of M can be dependent upon the available bandwidth.
- apparatus 10 can also be configured to control the transmission of audio channels 20 to it, and reduce the number of audio channels received by N-M from N to M, wherein only the M audio channels that may berequired for mixing to produce the output audio channel 52 are received.
- FIG. 12 illustrates an example of a method 100 that can for example be performed by the apparatus 10.
- the method comprises, at block 102, receiving at least N audio channels 20 where each of the N audio channels 20 can be rendered as a different audio source.
- the method 100 comprises, at block 104, controlling mixing of the N audio channels 20 to produce at least an output audio channel 52, wherein the mixer 50 selects a sub-set 30 of at least M audio channels from the N audio channels 20 in dependence upon prioritization 32 of the N audio channels 20, wherein the prioritization 32 is adaptive and depends at least upon a content 34 of one or more of the N audio channels 20.
- the method 100 further comprises, at block 106, causing rendering of at least the output audio channel 52.
- FIG. 13 illustrates a method 110 for producing the output audio channel 52. This method broadly corresponds to the method previously described with reference to FIG. 6 .
- the method 110 comprises obtaining spatial audio signals from at least two sources as distinct audio channels 20.
- the method 110 comprises determining temporal activity of each of the spatial audio signals (of the two audio channels 20) and selecting at least one spatial audio signal (audio channel 20) for mono downmix (for inclusion within the sub-set 30 and the output audio channel 52) for duration of its activity.
- the method 110 comprises determining a content-based priority for at least one of the spatial audio signals (audio channels 20) for temporarily altering a previous selection.
- the method 110 comprises determining a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the prioritized spatial audio signals (audio channels 20).
- the output audio channel 52 is based upon the selected sub-set M which is in turn based upon the prioritization 32. Then at block 120, the method 110 provides the first mono downmix (the output audio channel 52) to the participant for listening. That is, it provides the output audio channel 52 for rendering.
- the prioritization 32 determined at block 116 is used to adaptively adjust selection of the sub-set 30 of M audio channels 20 used to produce the output audio channel 52.
- FIG. 14 illustrates an example in which the audio channel 20 3 is first selected, based on prioritization, as the primary audio channel in the output audio channel 52.
- the output audio channel 52 does not comprise the audio channel 20 4 or 20 5 .
- the audio channel 20 3 remains prioritized. There is no change to the selection of the sub-set 30 of M audio channels until the activity in the audio channel 20 3 ends.
- a new selection process can occur based upon the prioritization 32 of other channels. In this example there is a selection grace period after the end of activity in the audio channel 20 3 .
- the audio channel 20 3 will be re-selected as the primary channel to be included in the sub-set 30 and the output audio channel 52.
- the audio channel 20 3 can have a higher prioritization and be selected if it becomes active. After the selection grace period expires, the prioritization of the audio channel 20 3 can be decreased.
- FIG. 15 illustrates an example of a method 130 that broadly corresponds to the method previously described in relation to FIG. 8 .
- the method 130 comprises obtaining spatial audio signals (audio channels 20) from at least two sources. This corresponds to the receiving of at least two audio channels 20.
- the method 130 determines a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the spatial audio signals (audio channels 20).
- the method 130 comprises determining at least one second mono downmix based (sub-set 80 and additional audio channel) on at least one of the spatial audio signals (audio channels 20) not present in the first mono downmix.
- the first mono downmix is provided to a participant for listening as the output audio channel 52.
- the second mono downmix is provided to a memory for storage.
- this information may be provided as a feedback at an output end-point 204 associated with that included input end-point 206.
- an audio channel 20 associated with a particular input end-point 206 is not selected for inclusion within the sub-set 30 of audio channels used to create the output audio channel 52 at a particular output end point 204, then this information may be provided as a feedback at an output end-point 204 associated with that excluded input end-point 206.
- the information can for example identify the input end-points 206 not selected for inclusion for rendering at a particular identified output end-point 204.
- FIG. 16 illustrates an example of a controller 70.
- Implementation of a controller 70 may be as controller circuitry.
- the controller 70 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware).
- the controller 70 may be implemented using instructions that enable hardware functionality, for example, by using executable instructions of a computer program 76 in a general-purpose or special-purpose processor 72 that may be stored on a computer readable storage medium (disk, memory etc) to be executed by such a processor 72.
- a general-purpose or special-purpose processor 72 may be stored on a computer readable storage medium (disk, memory etc) to be executed by such a processor 72.
- the processor 72 is configured to read from and write to the memory 74.
- the processor 72 may also comprise an output interface via which data and/or commands are output by the processor 72 and an input interface via which data and/or commands are input to the processor 72.
- the memory 74 stores a computer program 76 comprising computer program instructions (computer program code) that controls the operation of the apparatus when loaded into the processor 72.
- the computer program instructions, of the computer program 76 provide the logic and routines that enables the apparatus to perform the previously methods illustrated and/or described.
- the processor 72 by reading the memory 74 is able to load and execute the computer program 76.
- the apparatus 10 therefore comprises:
- the computer program 76 may arrive at the apparatus 10 via any suitable delivery mechanism 78.
- the delivery mechanism 78 may be, for example, a machine readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid state memory, an article of manufacture that comprises or tangibly embodies the computer program 76.
- the delivery mechanism may be a signal configured to reliably transfer the computer program 76.
- the apparatus 10 may propagate or transmit the computer program 76 as a computer data signal.
- Computer program instructions for causing an apparatus to perform at least the following or for performing at least the following:
- the computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.
- memory 74 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/ dynamic/cached storage.
- processor 72 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable.
- the processor 72 may be a single core or multi-core processor.
- references to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor' etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry.
- References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
- circuitry may refer to one or more or all of the following:
- the blocks illustrated in the preceding Figs may represent steps in a method and/or sections of code in the computer program 76.
- the illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied. Furthermore, it may be possible for some blocks to be omitted.
- the above described examples find application as enabling components of: automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and/or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services.
- a property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
- 'a' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of 'at least one' or 'one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
- the presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features).
- the equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way.
- the equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Stereophonic System (AREA)
Abstract
An apparatus comprising means for:
receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
providing for rendering at least the output audio channel.
receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
providing for rendering at least the output audio channel.
Description
- Embodiments of the present disclosure relate to audio. Some enable the distribution of common content for rendering to both advanced audio output devices and less advanced audio output devices.
- Advanced audio output devices are capable to rendering multiple received audio channels as different spatially positioned audio sources. The spatial separation of audio sources (spatial audio) can aid hearing when the sources simultaneously provide sound.
- Less advanced audio output devices are perhaps only capable of rendering one monophonic audio channel. They cannot render multiple received audio channels as different spatially positioned audio sources.
- Content that is suitable for rendering spatial audio via an advanced audio output device may be unsuitable for a less advanced audio output device and content that is suitable for rendering by a less advanced audio output device may under-utilize the spatial audio capabilities of an advanced audio output device.
- According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for:
- receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
- controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; and
- providing for rendering at least the output audio channel.
- In some but not necessarily all examples, the apparatus comprises means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
- In some but not necessarily all examples, the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
- In some but not necessarily all examples, N is at least two and M is one, the output audio channel being a monophonic audio output channel.
- In some but not necessarily all examples, the apparatus comprises means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
- In some but not necessarily all examples, prioritization depends upon one or more of:
- timing of content of at least one of the N audio channels relative to timing of content of at least another one of the N audio channels;
- history of content of at least one of the N audio channels;
- mapping to a particular person, an identified voice in content of at least one of the N audio channels;
- detection that content of at least one of the N audio channels is voice content;
- detection that content of at least one of the N audio channels comprises an identified word.
- In some but not necessarily all examples, controlling mixing of the N audio channels to produce at least an output audio channel, comprises:
- selecting a first sub-set of the N audio channels to be mixed to provide background audio;
- selecting a second sub-set of the N audio channels to be mixed to provide foreground audio that is for rendering at greater loudness than the background audio, wherein the selection of the first sub-set and selection of the second sub-set is dependent upon the prioritization of the N audio channels; and
- mixing the background audio and the foreground audio to produce the output audio channel.
- In some but not necessarily all examples, the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
- In some but not necessarily all examples, the apparatus comprises means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
- In some but not necessarily all examples, the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
- In some but not necessarily all examples, the apparatus comprises a user input interface for controlling prioritization of the N audio channels.
- In some but not necessarily all examples, the apparatus comprises a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
- According to various, but not necessarily all, embodiments there is provided a multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus.
- According to various, but not necessarily all, embodiments there is provided a method comprising:
- receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
- control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels; and
- rendering at least the output audio channel.
- According to various, but not necessarily all, embodiments there is provided a computer program that when run on one or more processors enables:
- control mixing of N received audio channels, where each of the N audio channels can be rendered as a different audio source, to produce at least an output audio channel for rendering,
- wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels.
- According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for:
- receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
- adapting a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels; and
- controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon the prioritization; and
- providing for rendering at least the output audio channel.
- According to various, but not necessarily all, embodiments there is provided an apparatus comprising means for:
- receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;
- analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels; and
- controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon the prioritization; and
- providing for rendering at least the output audio channel.
- According to various, but not necessarily all, embodiments there is provided examples as claimed in the appended claims.
- Some examples will now be described with reference to the accompanying drawings in which:
-
FIG. 1 illustrates an example of an apparatus for providing an output audio channel for rendering; -
FIG. 2 illustrates an example of an apparatus in which an analyzer is configured to analyze the N audio channels to adapt the prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels; -
FIG. 3 illustrates another example of the apparatus; -
FIG. 4 illustrates an example of a multi-party, live communication system comprising the apparatus; -
FIG. 5A and 5B illustrate alternative topologies of the system; -
FIG. 6 illustrates an example of prioritization based on timing of content; -
FIG. 7 illustrates an example of prioritization based on content type; -
FIG. 8 illustrates an example of storage of unselected audio channels; -
FIG. 9A, 9B, 9C illustrate examples of prioritization based on keywords in content; -
FIG. 10 illustrates an example of informing a consumer of the output audio channel of an option to change the audio channels included within the output audio channel; -
FIG. 11A illustrates an example of spatial audio rendered, based on the N audio channels, at an output end-point configured for rendering spatial audio; -
FIG. 11B illustrates an example of audio rendered, based on the output audio channel, at an output end-point that is not configured for rendering spatial audio; -
FIG. 12, 13 ,15 illustrate examples of a method; -
FIG. 14 illustrates an example of changing prioritization based on timing of content; -
FIG. 16 illustrates an example of a controller; and -
FIG. 17 illustrates an example of a computer program. - The following description and the attached drawings describe various examples of an
apparatus 10 that receives at least Naudio channels 20 and enables the rendering of one or moreoutput audio channels 52. - The set of N audio channels is referenced using
reference number 20. Each audio channel of the set of N audio channels is referenced usingreference number 20i, where i is 1, 2,...N-1, N. - The
apparatus 10 comprises means for receiving at least Naudio channels 20 where each of theN audio channels 20i can be rendered as a different audio source. - The
apparatus 10 comprisesmeans N audio channels 20 to produce at least anoutput audio channel 52. - A
selector 40 selects for mixing (to produce the output audio channel 52) asub-set 30 of M audio channels from theN audio channels 20. The selection is dependent uponprioritization 32 of theN audio channels 20. Theprioritization 32 is adaptive depending at least upon a changingcontent 34 of one or more of theN audio channels 20. - The
sub-set 30 of M audio channels is referenced usingreference number 30. Each audio channel of the sub-set of M audio channels is referenced usingreference number 20j, where j is any M of the N values of i. The sub-set 30 can, for example, be varied by changing the value of M and/or by changing whichaudio channels 20j are used to comprise the M audio channel of thesub-set 30. In the description,different sub-set 30 can, in some examples, be differentiated using thesame reference 30 with different numeric sub-scripts. - A
mixer 50 mixes thesub-set 30 of M audio channels to produce theoutput audio channel 52 which is suitable for rendering. - An advanced spatial audio output device (an example is illustrated at
FIG 11A ) can render theN audio channels 20 as multiple different spatially positioned audio sources. A less advanced audio output device (an example is illustrated atFIG 11B ) can render theoutput audio channel 52. - The
apparatus 10 therefore allows a common content, theN audio channels 20, to provide audio output at both the advanced spatial audio output device and the less advanced audio output device. -
FIG. 1 illustrates an example of anapparatus 10 for providing anoutput audio channel 52 for rendering. The rendering of theoutput audio channel 52 can occur at theapparatus 10 or can occur at some other device. - The
apparatus 10 receives at least Naudio channels 20. Anaudio channel 20i of theN audio channels 20 can be rendered as a distinct audio source. - The
apparatus 10 comprises amixer 50 for mixing asub-set 30 of theM audio channels 20 to produce at least anoutput audio channel 52. - A
selector 40 selects for mixing, atmixer 50, thesub-set 30 of M audio channels from theN audio channels 20. The selection, by theselector 40, is dependent uponprioritization 32 of theN audio channels 20. Theprioritization 32 is adaptive depending at least upon a changingcontent 34 of one or more of theN audio channels 20. Theapparatus 10 provides, from themixer 50, theoutput audio channel 52 for rendering. - The
sub-set 30 of M audio channels has less audio channels than theN audio channels 20, that is, M is less than N. N is at least two and in at least some examples is greater than 2. In at least some examples M is one and theoutput audio channel 52 is a monophonic audio output channel. - The
prioritization 32 is adaptive. Theprioritization 32 depends at least on a changingcontent 34 of one or more of theN audio channels 20. - In some but not necessarily all examples, the
apparatus 10 is configured to automatically control the mixing of theN audio channels 20 to produce at least theoutput audio channel 52, in dependence upon time-variation ofcontent 34 of one or more of theN audio channels 20. -
FIG. 2 illustrates an example of anapparatus 10 in which ananalyzer 60 is configured to analyze theN audio channels 20 to adapt theprioritization 32 of theN audio channels 20 in dependence upon, at least, changingcontent 34 of one or more of theN audio channels 20. - The analysis can be performed before (or simultaneously with) the before-mentioned selection.
- In some examples, the
analyzer 60 is configured to process metadata associated with theN audio channels 20. Additionally or alternatively, in some examples, theanalyzer 60 is configured to process the audio content of theaudio channels 20. This processing could, for example, comprise voice activity detection, voice recognition processing, spectral analysis, semantic processing of speech or other processing including machine learning and artificial intelligence processing used to identify characteristics of thecontent 34 of one or more of theN audio channels 20. - The
prioritization 32 can depend upon one or more parameters of thecontent 34. - In one example, the
prioritization 32 depends upon timing ofcontent 34i of anaudio channel 20i relative to timing ofcontent 34j of anaudio channel 20j. Thus, theaudio channel 20 that first satisfies a trigger condition has temporal priority. In some examples the trigger condition may be that theaudio channel 20 has activity above a threshold, and/or has activity above a threshold in a particular spectral range and/or has voice activity and/or has voice activity associated with a specific person and/or the voice activity comprises semantic content including a particular keyword word or phrase. - An
initial prioritization 32, can cause an initial selection of afirst sub-set 301 ofaudio channels 20 that are mixed to form theoutput audio channel 52. A change inprioritization 32, can cause a new selection of a seconddifferent sub-set 302 ofaudio channels 20 that are mixed to form a new, differentoutput audio channel 52. Thefirst sub-set 301 and thesecond sub-set 301 are not equal sets. Thus,apparatus 10 can prioritize one or more of theN audio channels 20 as a sub-set 30 until a new selection by theselector 40 based on anew prioritization 32 changes thesub-set 30. - If a person is speaking in a
particular audio channel 20, first, that channel may be prioritized ahead of a second audio channel. However, if the person speaking in the first audio channel stops speaking then theprioritization 32 of the audio channels can change and there can be a consequential reselection at theselector 40 of thesub-set 30 of M audio channels provided for mixing to produce theoutput audio channel 52. - The
apparatus 10 can flag at least oneinput audio channel 20 corresponding to a first active talker, or generallyactive content 34, during a selection period and prioritize this selection over otheraudio channels 20. Theapparatus 10 can determine whether the active talker continues before introducingcontent 34 from non-prioritized channels to the mixedoutput audio channel 52. The introduction of suchadditional content 34 from non-prioritized channels is controlled by theselector 40 during a following selection period. - In some examples,
non-prioritized audio channels 20 can be completely omitted from the mixedoutput audio channel 52 and thus the mixedoutput audio channel 52 will contain only the prioritized channel(s). However, in other examples, the non-prioritized channels can be mixed with a lower gain or higher attenuation than the prioritized channel and/or with other suitable processing to produce theoutput audio channel 52. - It will therefore be appreciated that in at least some examples, a history of
content 34 of at least one of theN audio channels 20 can be used to control theprioritization 32. For example, it may be possible to vary the "inertia" of the system, that is control the rate of change of the rate of change of prioritization. It is therefore possible to make theapparatus 10 more or less responsive to short term variations in thecontent 34 of one or more of theN audio channels 20. - The
selector 40 in making a selection of which of theN audio channels 20 to select for mixing to produce theoutput audio channel 52 can, for example, use decision thresholds for selection. A decision threshold can be changed over time and can be dependent upon a history of thecontent 34. In addition, different decision thresholds can be used for differentaudio channels 20. - In some examples, the
prioritization 32 can be dependent upon mapping to a particular person an identified voice incontent 34 of at least one of theN audio channels 20. Theanalyzer 60 can for example perform voice recognition based upon thecontent 34 of one or more of theN audio channels 20. Alternatively, theanalyzer 60 can identify a particular person based upon metadata comprised within thecontent 34 of at least one of theN audio channels 20. It may therefore be possible to identify a particular one of theN audio channels 20 as relating to a person whose contribution it is particularly important to hear such as, for example, a chairman of a meeting. - In some examples, the
analyzer 60 is configured to adapt theprioritization 32 when the presence of voice content is detected within thecontent 34 of at least one of theN audio channels 20. Thus, theanalyzer 60 is able to prioritize the spoken word within theoutput audio channel 52. It is also possible to adapt theanalyzer 60 to prioritize other types of content. - In some, but not necessarily all, examples, the
analyzer 60 is configured to adapt theprioritization 32 based upon detection thatcontent 34 of at least one of theN audio channels 20 comprises an identified keyword. Theanalyzer 60 can, for example, listen to thecontent 34 and identify within the stream of content a keyword or identify semantic meaning within the stream of content. This can be used to modify theprioritization 32. For example, it may be desirable for a consumer of theoutput audio channel 52 to have thatoutput audio channel 52 personalized so that if one of theN audio channels 20 comprisescontent 34 that includes the consumer's name or other information associated with the consumer then thataudio channel 20 is prioritized by theanalyzer 60. - In some, but not necessarily all, examples, the
N audio channels 20 can represent live content. In this example, the analysis by theanalyzer 60, the selection by theselector 40 and the mixing by themixer 50 can occur in real time such that theoutput audio channel 52 is also live. -
FIG. 3 illustrates an example of the apparatus ofFIG. 1 in more detail. In this example one possible operation of themixer 50 is illustrated in more detail. In this example, the mixing is a weighted mixing in which different sub-sets of thesub-set 30 of selected audio channels are weighted with different attenuation/gain before being finally mixed to produce theoutput audio channel 52. - In the illustrated example, the
selector 40, based upon theprioritization 32, selects a first sub-set SS1 of the M audio channels to be mixed to provide background audio B and selects a second sub-set SS2 of theM audio channels 20 to be mixed to provide foreground audio F that is for rendering at greater loudness than the background audio B. The selection of the first sub-set SS1 and the selection of the second sub-set SS2 is dependent upon theprioritization 32 of theN audio channels 20. The first sub-set SS1 ofaudio channels 20 is mixed 501 to provide background audio B which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to themixer 503 for mixing to produce theoutput audio channel 52. The second sub-set SS2 ofaudio channels 20 is mixed 502 to provide foreground audio F which is then amplified/attenuated G1 to adjust the loudness of the background audio before it is provided to themixer 503 for mixing to produce theoutput audio channel 52. - The gain/attenuation G2 applied to the foreground audio F makes it significantly louder than the background audio B in the
output audio channel 52. In some situations, the foreground audio F is naturally louder than background audio B. Thus, it can be but need not be that G2 > G1. - The gain/attenuation G1, G2 can, in some examples, vary with frequency.
-
FIG. 4 illustrates an example of a multi-party,live communication system 200 that enables live audio communication between multiple remote participants Ai, B, C, Di using at least theN audio channels 20. Different ones of the multiple remote participants Ai, B, C, Di provide audio input for different ones of theN audio channels 20. - The
system 200 comprises input end-points 206 for capturingaudio channels 20. Thesystem 200 comprises output end-points 204 for rendering audio channels. One or more output end-points 204s (spatial output end-points) are configured for rendering spatial audio as distinct rendered audio sources. One or more output end-points 204m (mono output end-points) are not configured for rendering spatial audio. - The
N audio channels 20 are N spatial audio channels where each of the N spatial audio channels is captured as a differently positioned captured audio source, and can be rendered using spatial audio as a differently positioned rendered audio source. In some examples the captured audio source (input end-point 206) has a fixed and stationary position. However, in other examples it can vary in position. When such an input end-point 206 is rendered as a rendered audio source at an output end-point 204 using spatial audio, then the rendered audio source can either be fixed or can move, for example, in a manner corresponding to the moving input end-point 206. - In this example, the
system 200 is for enabling immersive teleconferencing or telepresence for remote terminals. The different terminals have varying device capabilities and different (and possibly variable) network conditions. - Spatial/immersive audio refers to audio that typically has a three-dimensional space representation or is presented (rendered) to a participant with the intention of the participant being able to hear a specific audio source from a specific direction. In the specific example illustrated there is a multi-participant audio/visual conference call between remote participants. Some of the participants share a room. For example, participants A1, A2, A3, A4 share the room A and the participants D1, D2, D3, D4, D5 share the room D.
- Some of the terminals can be characterized as "advanced spatial audio output devices" that have an output end-
point 204s that is configured for spatial audio. However, some of the terminals are less advanced audio output devices that have an output end-point 204m that is not configured for spatial audio. - In a spatial audio experience, the voices of the participants Ai, B, C, Di are spatially separated. The voices may, for example, have fixed spatial positions relative to each other or the directions may be adaptive, for example, according to participant movements, conference bridge settings or based upon inputs by participants. A similar experience is available to the participants who are using the output end-
points 204s and they have the ability to interact much more naturally than traditional voice calls and voice conferencing. For example, they can talk at the same time and still understand each other thanks to effects such as the well-known cocktail party effect. - In rooms A and D, each of the respective participants Ai, Di has a personal input end-
point 206 which captures a personal captured audio source as apersonal audio channel 20. The personal input end-point 206 can, for example, be provided by a directional microphone or by a Lavalier microphone. - The participants B and C each have a single personal input end-
point 206 which captures apersonal audio channel 20. - In rooms A and D, the output end-
points 204s are configured for spatial audio. For example, each room can have a surround sound system as an output end-point 204s.
Anoutput end point 204s is configured to render each captured sound source represented by anaudio channel 20 as a rendered sound source. - In room D, each participant Ai, B, C has a personal
output audio channel 20. Each personaloutput audio channel 20 is rendered from a different location as a different rendered audio source. The collection of rendered audio sources associated with the participants Ai creates a virtual room A. - In room A, each participant Di, B, C has a personal
output audio channel 20. Each personaloutput audio channel 20 is rendered from a different location as a different rendered sound source. The collection of the rendered audio sources associated with the participants Di creates a virtual room D. - For participant C, the output end-
point 204s is configured for spatial audio. For example,
as an output end-point 204s.
Anoutput end point 204s is configured to render each captured sound source represented by anaudio channel 20 as a rendered sound source. - The participant C has an output end-
point 204s that is configured for spatial audio. In this example, the participant C is using a headset configured for binaural spatial audio that is suitable for virtual reality (VR). Binauralization methods can be used to render personalaudio channels 20 as spatially positioned rendered audio sources, Each participant Ai, Di, B has a personaloutput audio channel 20. Each personaloutput audio channel 20 is or can be rendered from a different location as a different rendered sound source. - The participant B has an output end-
point 204m that is not configured for spatial audio. In this example it is a monophonic output end-point. In the example illustrated, the participant B is using a mobile device (e.g. a mobile phone) to provide the input end-point 206 and the output end-point 204m. The mobile device has a single output end-point 204m which provides theoutput audio channel 52 as previously described. The processing to produce theoutput audio channel 52 can be performed at the mobile device of the participant C or at theserver 202. - The mono-capability limitation of participant B can, for example, be caused by the device, for example it is only configured for decoding of mono audio or because of the available audio output facilities such as a mono-only earpiece or headset.
- In the preceding examples the spatial audio has been described at a high resolution. Each of the input end-
points 206 is rendered in spatial audio as a spatially distinct rendered audio source. However, in other examples multiple ones of the input end-points 206 may be mixed together to produce a single rendered audio source. This can be used to reduce the number of rendered audio sources using spatial audio. Therefore, in some examples, a spatial audio device may render multiple ones ofoutput audio channels 52. - In the example illustrated in
FIG. 4 , a star topology similar to that illustrated inFIG. 5A is used. Thecentral server 202 interconnects the input end-points 206 and the output end-points 204. In the example ofFIG. 5A , the input end-points 206 provide theN audio channels 20 to acentral server 202 which produces theoutput audio channel 52 as previously described to the output end-point 204m. In this example, theapparatus 10 is located in thecentral server 202, however, in other examples theapparatus 10 is located at the output end-point 204m. -
FIG. 5B illustrates an alternative topology in which there is no centralized architecture but a peer-to-peer architecture. In this example, theapparatus 10 is located at the output end-point 204m. - The 3GPP IVAS codec is an example of a voice and audio communications codec for spatial audio. The IVAS codec is an extension of the 3GPP EVS codec and is intended for new immersive voice and audio services over 4G and 5G. Such immersive services include, for example, immersive voice and audio for virtual reality (VR). The multi-purpose audio codec is expected to handle encoding, decoding and rendering of speech, music and generic audio. It is expected to support channel-based audio and scene-based audio inputs including spatial information about the sound field and sound sources. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions. The
audio channels 20 can, for example, be coded/decoded using the 3GPP IVAS codec. - The spatial
audio channels 20 can, for example, be provided as metadata-assisted spatial audio (MASA), objective-based audio, channel-based audio (5.1, 7.1+4), non-parametric scene-based audio (e.g. First Order Ambisonics, High Order Ambisonics) and any combination of these formats. These audio formats can be binauralized for headset listening such that a participant can hear the audio sources outside their head. - It will therefore be appreciated from the foregoing that the
apparatus 10 provides a better experience, including improved intelligibility for a mono user participating in a spatial audio teleconference with several potentially overlapping spatial audio inputs. Theapparatus 10 means that it is not necessary, in some cases, to simplify the spatial audio conference experience for the spatial audio users due to having a mono-audio participant. Thus, a mono user can participate in a spatial audio conference without compromising the experience of the other users. -
FIGS 6, 7, 8 and9A illustrate examples of anapparatus 10 that comprises acontroller 70. The controller receivesN audio channels 20 and performs control processing to select thesub-set 30 of M audio channels. In the examples previously described, thecontroller 70 comprises theselector 40 and, optionally, theanalyzer 60. In these examples, themixer 50 is present but not illustrated. - In at least some of these examples, the
controller 70 is configured to control mixing of theN audio channels 20 to produce thesub-set 30 of M audio channels when a conflict between a first audio channel of theN audio channels 20 and a second audio channel of the N audio channels occurs. For example, the control can cause thefirst audio channel 20 to be included within thesub-set 30 of M audio channels and cause thesecond audio channel 20 not to be included within thesub-set 30 of M audio channels. - In some examples, at a later time, when there is no longer conflict between the first audio channel and the second audio channel, the second audio channel is included within the
sub-set 30 of M audio channels. - One example of when there is conflict between audio channels is when there is simultaneous activity from different prioritized sound sources. For example, overtalking (simultaneous speech) associated with different
audio channels 20 can be an example of conflict. - In the example illustrated in
FIG. 6 , theprioritization 32 used for the selection of audio channels to form thesub-set 30 of M audio channels depends upon timing ofcontent 34 of at least one of theN audio channels 20 relative to timing ofcontent 34 of at least another one of theN audio channels 20. - In this example, the
participant 3 speaks first and theaudio channel 203 associated with theparticipant 3 is selected as a 'priority' for inclusion within thesub-set 30 of M=1 audio channels used to form theoutput audio channel 52. The later speech byparticipants sub-set 30 of audio channels used to form theoutput audio channel 52. - The
audio channel 203 preferentially remains prioritized and remains included within theoutput audio channel 52, while there is voice activity in theaudio channel 203, whereas theaudio channels audio channel 203 then in some examples a selection process may immediately change the identity of theaudio channel 20 selected for inclusion within theoutput audio channel 52. However, in other examples there can be a selection grace period. During this grace period, there can be a greater likelihood of selection/reselection of the original selectedaudio channel 203. Thus, during thegrace period prioritization 32 is biased in favor of the previously selected audio channel. - It will therefore be appreciated that in at least some examples,
prioritization 32 used for the selection depends upon a history ofcontent 34 of at least one of theN audio channels 20. - In some examples, the
prioritization 32 used for the selection can depend upon mapping to a particular person (an identifiable human), an identified voice incontent 34 of at least one of theN audio channels 20. A voice can be identified using metadata or by analysis of thecontent 34. Theprioritization 32 would more favorably select the particular person'saudio channel 20 for inclusion within theoutput audio channel 52. - The particular person could, for example, be based upon service policy. A teleconference service may have a moderator or chairman role and this participant may for example be made audible to all participants or may be able to force themselves to be audible to all participants. In other examples, the particular person could for example be indicated by a user consuming the
output audio channel 52. That consumer could for example indicate which of the other participants'content 34 oraudio channels 20 they wish to consume. Thisaudio channel 20 could then be included, or be more likely to be included, within theoutput audio channel 52. The inclusion of the user-selectedaudio channel 20 can for example be dependent upon voice activity within theaudio channel 20, that is, the user-selectedaudio channel 20 is only included if there is active voice activity within thataudio channel 20. Theprioritization 32 used for the selection therefore strongly favors the user-selectedaudio channel 20. The selection by the consumer of theoutput audio channel 52 of aparticular audio channel 20 can for example be based upon an identity of the participant who is speaking or should speak in that audio channel. Alternatively, it could be based upon a user-selection of that audio channel because of thecontent 34 rendered within that audio channel. -
FIG. 7 illustrates an example similar toFIG. 6 . In this example, theaudio channels 20 include a mixture of different audio types. Theaudio channel 203 associated with participant3 is predominantly a voice channel. Theaudio channels participants audio channels 20 is to be included within theoutput audio channel 52 can be based upon the audio type present within theaudio channel 20. The detection of the audio type within theaudio channel 20 can for example be achieved using metadata or, alternatively, by analyzing thecontent 34 of theaudio channel 20. Thus, theprioritization 32 used for selection can be dependent upon detection thatcontent 34 of at least one of theN audio channels 20 is voice content. In such a voice-central case, natural pauses in theactive content 34 allow for changes in the mono downmix. That is, theoutput audio channel 52 can switch between the inclusion of differentaudio channels 20 in dependence upon which of them includes active voice content. In this way priority can be given to spoken language. The other channels for example themusic channels FIG. 3 . - In the examples illustrated in
FIGS 6 and 7 , theapparatus 10 deliberately loses information by excluding (or diminishing)audio channels 20 with respect to theoutput audio channel 52. Information is generally lost by the selective downmixing which is required to maintain or guarantee intelligibility. It is, however, possible for there to be two simultaneously importantaudio channels 20, only one of which is selected for inclusion in theoutput audio channel 52. The apparatus illustrated inFIG. 8 addresses this issue. - The
apparatus 10 illustrated is similar to that illustrated inFIGS 6 and 7 . However, it additionally comprises amemory 82 for storage of afurther sub-set 80 of theN audio channels 20 that is different to thesub-set 30 ofN audio channels 20. Thus, in this example at least some of the audio channels of theN audio channels 20 that are not selected for inclusion in thesub-set 30 of M audio channels, are stored assub-set 80 and are available for later rendering. In some examples, the later rendering may be at a faster playback rate and that playback may be fixed or may be adaptive. In some examples, thesub-set 80 of audio channels is mixed to form an alternative audio output channel for storage in thememory 82. - In the specific example illustrated at least some of the audio channels of the N audio channels that are not selected to be in the
sub-set 30 of M audio channels are stored inmemory 82 for later rendering. - In the particular illustrated example, there is selection of a
first sub-set 30 of M audio channels from the N audio channels based uponprioritization 32 of the N audio channels. Thefirst sub-set 30 of M audio channels are mixed to produce a firstoutput audio channel 52. There is selection of a differentsecond sub-set 80 of audio channels from the N audio channels based uponprioritization 32 of the N audio channels. Thesecond sub-set 80 of audio channels are mixed to produce a second output audio channel for storage. - In the example illustrated in
FIG. 8 , theaudio channel 203 includescontent 34 comprising voice content from a single participant, and it is selected for inclusion within thesub-set 30 of audio channels. It is used to produce theoutput audio channel 52. Theaudio channels output audio channel 52, or included only as background (as described with reference toFIG. 3 ), are selected for mixing to produce the second output audio signal that is stored inmemory 82. - When there is storage of a
second sub-set 80 of audio channels as a second audio signal, it is desirable to let the consumer of theoutput audio channel 52 know of the existence of the stored audio signal. This can for example facilitate user control of switching from rendering theoutput audio channel 52 to rendering the stored audio channel. -
FIG. 10 illustrates an example of how such an indication may be provided to the consumer of theoutput audio channel 52.Fig 10 is described in detail later. - In some examples, it may be possible to automatically switch from rendering the
output audio channel 52 to rendering the stored audio channel. For example, there may be automatic switching during periods of inactivity of theoutput audio channel 52. Anapparatus 10 may switch to the stored audio channel and play that back at a higher speed. For example, theapparatus 10 can monitor the typical length of inactivity in the preferredoutput audio channel 52 and adjust the speed of playback for the stored audio channel such that the relevant portions can be played back during a typical inactive period. -
FIG. 9A illustrates an example in which theapparatus 10 detects thatcontent 34 of at least one of theN audio channels 20 comprises an identified keyword and adapts theprioritization 32 accordingly. Theprioritization 32 in turn controls selection of which of theaudio channels 20 are included in thesub-set 30 and the output audio channel 52 (and, if implemented, the stored alternative audio channel). - In the example illustrated in
FIG. 9B , the participant 'User 3' is speaking first and has priority. Therefore, theaudio channel 203 associated with theUser 3 is initially selected as the priority audio channel and is included within theoutput audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and theaudio channel 203 remains the priority audio channel included within theoutput audio channel 52. At time T1 it is detected thatUser 5 says a keyword, in this example the name of the consumer of the output audio channel 52 (Dave). While this event increases the likelihood of a switch in the prioritization of theaudio channels audio channel 205 becomes prioritized and included in theoutput audio channel 52, in this example there is insufficient cause to change theprioritization 32 and consequently change which of theaudio channels 20 is included within theoutput audio channel 52. - In the example illustrated in
FIG. 9C , the participant 'User 3' is speaking first and has priority. Therefore, theaudio channel 203 associated with theUser 3 is initially selected as the priority audio channel and is included within the sub-set 30 used to produce theoutput audio channel 52. Even though the participant 'User 5' begins to talk, the prioritization is not changed and theaudio channel 203 remains the priority audio channel included within thesub-set 30 and theoutput audio channel 52. At time T1 it is detected thatUser 5 says a keyword, in this example the name of the consumer of the output audio channel 52 (Dave). This event causes a switch in the prioritization of theaudio channels audio channel 205 becomes prioritized and included in thesub-set 30 and theoutput audio channel 52 and theaudio channel 203 becomes de-prioritized and excluded from thesub-set 30 and theoutput audio channel 52. - In some examples, the consumer of the
output audio channel 52 can via user input settings control the likelihood of a switch when a keyword is mentioned within anaudio channel 20. For example, the consumer of theoutput audio channel 52 can, for example, require a switch if a keyword is detected. Alternatively, the likelihood of a switch can be increased. - In other examples, the occurrence of a keyword can increase the prioritization of an
audio channel 20 such that it is stored, for example as described in relation toFIG. 8 . - In other examples, the detection of a keyword may provide an option to the consumer of the
output audio channel 52, to enable the consumer to cause a change in theaudio channel 20 included within thesub-set 30 and theoutput audio channel 52. For example, if the name of the consumer of theoutput audio channel 52 is included within anaudio channel 20 that is not being rendered, as a priority, within theoutput audio channel 52 then the consumer of theoutput audio channel 52 can be presented with an option to changeprioritization 32 and switch to using asub-set 30 andoutput audio channel 52 that includes theaudio channel 20 in which their name was detected. - Where a detected keyword causes a switch in the audio channels included in the
sub-set 30 andoutput audio channel 52, the newoutput audio channel 52 based on the detected keyword may be played back from the occurrence of the detected keyword. In some examples the playback is at a faster rate to allow a catch-up with real time. -
FIG. 10 illustrates an example in which a consumer of theoutput audio channel 52 is provided with information to allow that consumer to make an informed decision to switchaudio channels 20 included within thesub-set 30 and theoutput audio channel 52. - In some examples, some form of indication is given to indicate a change in activity status. For example, if a particular participant begins to talk or there is a second separate discussion ongoing, the consumer of the original
output audio channel 52 is made aware of this. - A suitable indicator could for example be an audible indicator that is added to the
output audio channel 52. In some examples, each participant may have an associated different tone and a beep with a particular tone may indicate which participant has begun to speak. Alternatively, an indicator could be a visual indictor in an input user interface. - In the example illustrated in
FIG. 10 , the background audio is adapted to provide an audible indication. Initially, the consumer listening to theoutput audio channel 52 hears theaudio channel 201 associated with a first participant's voice (User A voice). If asecond audio channel 20 is mixed with theaudio channel 201, then it may, for example, be anaudio channel 202 that captures the ambient audio of the first participant (User A ambience). At time T1 a second participant, User B, begins to talk. This does not initiate a switch ofprioritization 32 sufficient to change thesub-set 30. Theprimary audio channel 20 in thesub-set 30 and theoutput audio channel 52 remains theaudio channel 201. However, an indication is provided to indicate to the consumer of theoutput audio channel 52 that there is an alternative, available,audio channel 203. The indication is provided by mixing theprimary audio channel 201 with anadditional audio channel 20 associated with the User B. For example, theadditional audio channel 20 can be an attenuated version of theaudio channel 203 or can be anambient audio channel 204 for the User B (User B ambience). In this example, thesecond audio channel 202 is replaced by theadditional audio channel 204. - The consumer of the
output audio channel 52 can then decide whether or not they wish to cause a change in theprioritization 32 to prioritize theaudio channel 203 associated with the User B above theaudio channel 201 associated with the User A. If this change in prioritization occurs then there is a switch in the primary audio channel within thesub-set 30 and theoutput audio channel 52 from being theaudio channel 201 to being theaudio channel 203. In the example illustrated, the consumer does not make this switch. The switch does however occur automatically when the User A stops talking at time T2. - In the example of
FIG. 10 , referring back to the example ofFIG. 3 , the background audio B can be included and/or varied as an indication to the consumer of theoutput audio channel 52 that analternative audio channel 20 is available for selection. -
FIG. 11A schematically illustrates audio rendered to a participant (User 5) at an output end-point 204s of the system 200 (not illustrated) that is configured for rendering spatial audio. In accordance with the preceding examples, the audio output at the end-point 204s has multiple rendered sound sources associated withaudio channels 201. 202, 203, 204 at different locations.FIG. 11A illustrates that even with the presence in the system 200 (not illustrated) of an output end-point 204m (FIG 11B ) that is not configured for spatial audio rendering, there may be no need to reduce the immersive capabilities or experience at the output end-points 204s of thesystem 200 that are configured for rendering spatial audio. -
FIG. 11B schematically illustrates audio rendered to a participant (User 1) at an output end-point 204m of the system 200 (not illustrated) that is not configured for rendering spatial audio. In accordance with the preceding examples, the audio output at the end-point 204m provided by theoutput audio channel 52 has a single monophonicoutput audio channel 52 that is based on thesub-set 30 of selectedaudio channels 20 and has good intelligibility. In the example illustrated, theaudio channel 202 is the primary audio channel that is included in thesub-set 30 and theoutput audio channel 52. - The
apparatus 10 can be configured to automatically switch the composition of theaudio channels 20 mixed to form theoutput audio channel 52 in dependence upon anadaptive prioritization 32. Additionally or alternatively, in some examples, the switching can be effected manually by the consumer at the end-point 204m using a user interface which includes auser input interface 90. - In the example illustrated in
FIG. 11B , the device at the output end-point 204s, which in some examples may be theapparatus 10, comprises auser input interface 90 for controllingprioritization 32 of theN audio channels 20. For example, theuser input interface 90 can be configured to highlight or label selected ones of theN audio channels 20 for selection. Theuser input interface 90 can be used to control if and to what extent manual or automatic switching occurs to produce theoutput audio channel 52 from selected ones of theaudio channels 20. An adaptation of theprioritization 32 can cause an automatic switching or can cause a prompt to a consumer for manual switching. - In some examples, the
user input interface 90 can control if and the extent to whichprioritization 32 depends upon one or more of timing ofcontent 34 of at least one of theN audio channels 20 relative to timing ofcontent 34 of at least another one of theN audio channels 20; history ofcontent 34 of at least one of theN audio channels 20; mapping to a particular person an identified voice incontent 34 of at least one of theN audio channels 20; detection thatcontent 34 of at least one of theN audio channels 20 is voice content; and/or detection thatcontent 34 of at least one of the N audio channels comprises an identified word. - In the example illustrated, within the
user input interface 90, there is an option 914 that allows the participant,User 1, to select theaudio channel 204 as a replacement primary audio channel that is included in thesub-set 30 and theoutput audio channel 52 instead of theaudio channel 202. There is also an option 913 that allowsUser 1 to select theaudio channel 203 as a replacement primary audio channel that is included in thesub-set 30 and theoutput audio channel 52 instead of theaudio channel 202. - In some but not necessarily all examples, the
user input interface 90 can provide a visual spatial representation of theN audio channels 20 and indicate which of theN audio channels 20 are comprised in thesub-set 30 of M audio channels. - The
user input interface 90 can also indicate which of the N audio channels are not comprised in thesub-set 30 of M audio channels and which, if any, of these are active. - In some, but not necessarily all, examples, the
user input interface 90 may provide textual information about anaudio channel 20 that is active and available for selection. For example, speech-to-text algorithms may be utilized to convert speech within thataudio channel 20 into an alert displayed at theuser input interface 90. Referring back to the example illustrated inFIG. 9A , theapparatus 10 may be configured to cause theuser input interface 90 to provide an option to a consumer of theoutput audio channel 52 that enables that consumer to switchaudio channels 20 included within thesub-set 30 andoutput audio channel 52. In this example, the keyword is "Dave" and the textual output provided by theuser input interface 90 could, for example, say "option to switch toUser 5 who addressed you and said: 'In our last teleco Dave made an interesting'". If the consumer, Dave, then selects the option to switch, thesub-set 30 and theoutput audio channel 52 then includes theaudio channel 205 from theUser 5 and starts from the position "In our last teleco Dave made an interesting...". A memory 82 (not illustrated in the FIG) could be used to store theaudio channel 205 from theUser 5. - In the preceding examples, the
apparatus 10 can be permanently operational to perform the selection of thesub-set 30 ofaudio channels 20 used to produce theoutput audio channel 52. However, in other examples theapparatus 10 has a state in which it is operational in this way and a state in which it is not operation in this way and it can transition between these states, for example when a trigger event is or is not detected.
Theapparatus 10 can be configured to control amixer 50 mixing of theN audio channels 20 to produce M audio channels in response to a trigger event, - One example of a trigger event is conflict between
audio channels 20. An example of detecting conflict would be when there is overlapping speech inaudio channels 20. - Another example of a trigger event is a reduction in communication bandwidth for receiving the
audio channels 20 below a threshold value. In this example, the value of M can be dependent upon the available bandwidth. - Another example of a trigger event is a reduction in communication bandwidth for providing the
output audio channel 52 beneath a threshold value. In this example, the value of M can be dependent upon the available bandwidth. - In some examples, where the
apparatus 10 can also be configured to control the transmission ofaudio channels 20 to it, and reduce the number of audio channels received by N-M from N to M, wherein only the M audio channels that may berequired for mixing to produce theoutput audio channel 52 are received. -
FIG. 12 illustrates an example of amethod 100 that can for example be performed by theapparatus 10. The method comprises, atblock 102, receiving at least Naudio channels 20 where each of theN audio channels 20 can be rendered as a different audio source. - The
method 100 comprises, atblock 104, controlling mixing of theN audio channels 20 to produce at least anoutput audio channel 52, wherein themixer 50 selects asub-set 30 of at least M audio channels from theN audio channels 20 in dependence uponprioritization 32 of theN audio channels 20, wherein theprioritization 32 is adaptive and depends at least upon acontent 34 of one or more of theN audio channels 20. Themethod 100 further comprises, atblock 106, causing rendering of at least theoutput audio channel 52. -
FIG. 13 illustrates amethod 110 for producing theoutput audio channel 52. This method broadly corresponds to the method previously described with reference toFIG. 6 . - At
block 112, themethod 110 comprises obtaining spatial audio signals from at least two sources as distinctaudio channels 20. Atblock 114, themethod 110 comprises determining temporal activity of each of the spatial audio signals (of the two audio channels 20) and selecting at least one spatial audio signal (audio channel 20) for mono downmix (for inclusion within thesub-set 30 and the output audio channel 52) for duration of its activity. Atblock 116, themethod 110 comprises determining a content-based priority for at least one of the spatial audio signals (audio channels 20) for temporarily altering a previous selection. Atblock 118, themethod 110 comprises determining a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the prioritized spatial audio signals (audio channels 20). Theoutput audio channel 52 is based upon the selected sub-set M which is in turn based upon theprioritization 32. Then atblock 120, themethod 110 provides the first mono downmix (the output audio channel 52) to the participant for listening. That is, it provides theoutput audio channel 52 for rendering. - It will therefore be appreciated that the
prioritization 32 determined atblock 116 is used to adaptively adjust selection of thesub-set 30 of Maudio channels 20 used to produce theoutput audio channel 52. -
FIG. 14 illustrates an example in which theaudio channel 203 is first selected, based on prioritization, as the primary audio channel in theoutput audio channel 52. In this example, at this time, theoutput audio channel 52 does not comprise theaudio channel audio channel 203 ends, theaudio channel 203 remains prioritized. There is no change to the selection of thesub-set 30 of M audio channels until the activity in theaudio channel 203 ends. When the activity in theaudio channel 203 ends then a new selection process can occur based upon theprioritization 32 of other channels. In this example there is a selection grace period after the end of activity in theaudio channel 203. If there is resumed activity in theaudio channel 203 during this selection grace period then theaudio channel 203 will be re-selected as the primary channel to be included in thesub-set 30 and theoutput audio channel 52. Thus during the selection grace period theaudio channel 203 can have a higher prioritization and be selected if it becomes active. After the selection grace period expires, the prioritization of theaudio channel 203 can be decreased. -
FIG. 15 illustrates an example of amethod 130 that broadly corresponds to the method previously described in relation toFIG. 8 . Atblock 132, themethod 130 comprises obtaining spatial audio signals (audio channels 20) from at least two sources. This corresponds to the receiving of at least twoaudio channels 20. Atblock 132, themethod 130 determines a first mono downmix (sub-set 30 and output audio channel 52) based on at least one of the spatial audio signals (audio channels 20). Next, atblock 136, themethod 130 comprises determining at least one second mono downmix based (sub-set 80 and additional audio channel) on at least one of the spatial audio signals (audio channels 20) not present in the first mono downmix. Atblock 138, the first mono downmix is provided to a participant for listening as theoutput audio channel 52. Atblock 140, the second mono downmix is provided to a memory for storage. - In any of the examples, when an
audio channel 20 associated with a particular input end-point 206 is selected for inclusion within thesub-set 30 of audio channels used to create theoutput audio channel 52, then this information may be provided as a feedback at an output end-point 204 associated with that included input end-point 206. - In any of the examples, when an
audio channel 20 associated with a particular input end-point 206 is not selected for inclusion within thesub-set 30 of audio channels used to create theoutput audio channel 52 at a particularoutput end point 204, then this information may be provided as a feedback at an output end-point 204 associated with that excluded input end-point 206. The information can for example identify the input end-points 206 not selected for inclusion for rendering at a particular identified output end-point 204. -
FIG. 16 illustrates an example of acontroller 70. Implementation of acontroller 70 may be as controller circuitry. Thecontroller 70 may be implemented in hardware alone, have certain aspects in software including firmware alone or can be a combination of hardware and software (including firmware). - As illustrated in
FIG. 16 thecontroller 70 may be implemented using instructions that enable hardware functionality, for example, by using executable instructions of acomputer program 76 in a general-purpose or special-purpose processor 72 that may be stored on a computer readable storage medium (disk, memory etc) to be executed by such aprocessor 72. - The
processor 72 is configured to read from and write to thememory 74. Theprocessor 72 may also comprise an output interface via which data and/or commands are output by theprocessor 72 and an input interface via which data and/or commands are input to theprocessor 72. - The
memory 74 stores acomputer program 76 comprising computer program instructions (computer program code) that controls the operation of the apparatus when loaded into theprocessor 72. The computer program instructions, of thecomputer program 76, provide the logic and routines that enables the apparatus to perform the previously methods illustrated and/or described. Theprocessor 72 by reading thememory 74 is able to load and execute thecomputer program 76. - The
apparatus 10 therefore comprises: - at least one
processor 72; and - at least one
memory 74 including computer program code - the at least one
memory 74 and the computer program code configured to, with the at least oneprocessor 72, cause theapparatus 10 at least to perform:- receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source
- control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels;
- causing rendering at least the output audio channel.
- As illustrated in
FIG. 17 , thecomputer program 76 may arrive at theapparatus 10 via anysuitable delivery mechanism 78. Thedelivery mechanism 78 may be, for example, a machine readable medium, a computer-readable medium, a non-transitory computer-readable storage medium, a computer program product, a memory device, a record medium such as a Compact Disc Read-Only Memory (CD-ROM) or a Digital Versatile Disc (DVD) or a solid state memory, an article of manufacture that comprises or tangibly embodies thecomputer program 76. The delivery mechanism may be a signal configured to reliably transfer thecomputer program 76. Theapparatus 10 may propagate or transmit thecomputer program 76 as a computer data signal. - Computer program instructions for causing an apparatus to perform at least the following or for performing at least the following:
- receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source
- control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels;
- causing rendering at least the output audio channel.
- The computer program instructions may be comprised in a computer program, a non-transitory computer readable medium, a computer program product, a machine readable medium. In some but not necessarily all examples, the computer program instructions may be distributed over more than one computer program.
- Although the
memory 74 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable and/or may provide permanent/semi-permanent/ dynamic/cached storage. - Although the
processor 72 is illustrated as a single component/circuitry it may be implemented as one or more separate components/circuitry some or all of which may be integrated/removable. Theprocessor 72 may be a single core or multi-core processor. - References to 'computer-readable storage medium', 'computer program product', 'tangibly embodied computer program' etc. or a 'controller', 'computer', 'processor' etc. should be understood to encompass not only computers having different architectures such as single /multi- processor architectures and sequential (Von Neumann)/parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device whether instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device etc.
- As used in this application, the term 'circuitry' may refer to one or more or all of the following:
- (a) hardware-only circuitry implementations (such as implementations in only analog and/or digital circuitry) and
- (b) combinations of hardware circuits and software, such as (as applicable):
- (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and
- (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions and
- (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g. firmware) for operation, but the software may not be present when it is not needed for operation.
- The blocks illustrated in the preceding Figs may represent steps in a method and/or sections of code in the
computer program 76. The illustration of a particular order to the blocks does not necessarily imply that there is a required or preferred order for the blocks and the order and arrangement of the block may be varied. Furthermore, it may be possible for some blocks to be omitted. - Where a structural feature has been described, it may be replaced by means for performing one or more of the functions of the structural feature whether that function or those functions are explicitly or implicitly described.
- The above described examples find application as enabling components of:
automotive systems; telecommunication systems; electronic systems including consumer electronic products; distributed computing systems; media systems for generating or rendering media content including audio, visual and audio visual content and mixed, mediated, virtual and/or augmented reality; personal systems including personal health systems or personal fitness systems; navigation systems; user interfaces also known as human machine interfaces; networks including cellular, non-cellular, and optical networks; ad-hoc networks; the internet; the internet of things; virtualized networks; and related software and services. - The term 'comprise' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising Y indicates that X may comprise only one Y or may comprise more than one Y. If it is intended to use 'comprise' with an exclusive meaning then it will be made clear in the context by referring to "comprising only one.." or by using "consisting".
- In this description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term 'example' or 'for example' or 'can' or 'may' in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus 'example', 'for example', 'can' or 'may' refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a feature described with reference to one example but not with reference to another example, can where possible be used in that other example as part of a working combination but does not necessarily have to be used in that other example.
- Although examples have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the claims.
- Features described in the preceding description may be used in combinations other than the combinations explicitly described above.
- Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.
- Although features have been described with reference to certain examples, those features may also be present in other examples whether described or not.
- The term 'a' or 'the' is used in this document with an inclusive not an exclusive meaning. That is any reference to X comprising a/the Y indicates that X may comprise only one Y or may comprise more than one Y unless the context clearly indicates the contrary. If it is intended to use 'a' or 'the' with an exclusive meaning then it will be made clear in the context. In some circumstances the use of 'at least one' or 'one or more' may be used to emphasis an inclusive meaning but the absence of these terms should not be taken to infer any exclusive meaning.
- The presence of a feature (or combination of features) in a claim is a reference to that feature or (combination of features) itself and also to features that achieve substantially the same technical effect (equivalent features). The equivalent features include, for example, features that are variants and achieve substantially the same result in substantially the same way. The equivalent features include, for example, features that perform substantially the same function, in substantially the same way to achieve substantially the same result.
- In this description, reference has been made to various examples using adjectives or adjectival phrases to describe characteristics of the examples. Such a description of a characteristic in relation to an example indicates that the characteristic is present in some examples exactly as described and is present in other examples substantially as described.
- Whilst endeavoring in the foregoing specification to draw attention to those features believed to be of importance it should be understood that the Applicant may seek protection via the claims in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not emphasis has been placed thereon.
Claims (15)
- An apparatus comprising means for:receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;controlling mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects for mixing to produce the output audio channel, a sub-set of M audio channels from the N audio channels, wherein the selection is in dependence upon prioritization of the N audio channels, and wherein the prioritization is adaptive depending at least upon a changing content of one or more of the N audio channels; andproviding for rendering at least the output audio channel.
- An apparatus as claimed in claim 1, comprising means for: automatically controlling mixing of the N audio channels to produce at least the output audio channel, in dependence upon time-variation of content of one or more of the N audio channels.
- An apparatus as claimed in claim 1 or 2, wherein the N audio channels are N spatial audio channels where each of the N spatial audio channels can be rendered as a differently positioned audio source.
- An apparatus as claimed in any preceding claim, wherein N is at least two and wherein M is one, the output audio channel being a monophonic audio output channel.
- An apparatus as claimed in any preceding claim, comprising means for analyzing the N audio channels to adapt a prioritization of the N audio channels in dependence upon, at least, changing content of one or more of the N audio channels.
- An apparatus as claimed in any preceding claim, wherein prioritization depends upon one or more of:timing of content of at least one of the N audio channels relative to timing of content of at least another one of the N audio channels;history of content of at least one of the N audio channels;mapping to a particular person, an identified voice in content of at least one of the N audio channels;detection that content of at least one of the N audio channels is voice content;detection that content of at least one of the N audio channels comprises an identified word.
- An apparatus as claimed in any preceding claim, wherein controlling mixing of the N audio channels to produce at least an output audio channel, comprises:selecting a first sub-set of the N audio channels to be mixed to provide background audio;selecting a second sub-set of the N audio channels to be mixed to provide foreground audio that is for rendering at greater loudness than the background audio, wherein the selection of the first sub-set and selection of the second sub-set is dependent upon the prioritization of the N audio channels; andmixing the background audio and the foreground audio to produce the output audio channel.
- An apparatus as claimed in any preceding claim, comprising means for controlling mixing of the N audio channels to produce M audio channels in response to a communication bandwidth for receiving the audio channels or for providing output audio signals falling beneath a threshold value.
- An apparatus as claimed in any preceding claim, comprising means for controlling mixing of the N audio channels to produce M audio channels when conflict between a first audio channel of the N audio channels and a second audio channel of the N audio channels, wherein the first audio channel is included within the M audio channels and the second audio channel is not included within the M audio channels, wherein over-talking is an example of conflict.
- An apparatus as claimed in any preceding claim, wherein the audio channels of the N audio channels that are not the selected M audio channels are available for later rendering.
- An apparatus as claimed in any preceding claim, comprising a user input interface for controlling prioritization of the N audio channels.
- An apparatus as claimed in any preceding claim, comprising a user input interface, wherein the user input interface provides a spatial representation of the N audio channels and indicates which of the N audio channels are comprised in the sub-set of M audio channels.
- A multi-party, live communication system that enables live audio communication between multiple remote participants using at least the N audio channels wherein different ones of the multiple remote participants provide audio input for different ones of the N audio channels, wherein the system comprises the apparatus as claimed in any of claims 1 to 12.
- A method comprising:receiving at least N audio channels where each of the N audio channels can be rendered as a different audio source;control mixing of the N audio channels to produce at least an output audio channel, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels; andrendering at least the output audio channel.
- A computer program that when run on one or more processors enables:
control mixing of N received audio channels, where each of the N audio channels can be rendered as a different audio source, to produce at least an output audio channel for rendering, wherein the mixing selects a sub-set of at least M audio channels from the N audio channels in dependence upon prioritization of the N audio channels, wherein the prioritization is adaptive and depends at least upon a content of one or more of the N audio channels.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP21154652.8A EP4037339A1 (en) | 2021-02-02 | 2021-02-02 | Selecton of audio channels based on prioritization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP21154652.8A EP4037339A1 (en) | 2021-02-02 | 2021-02-02 | Selecton of audio channels based on prioritization |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4037339A1 true EP4037339A1 (en) | 2022-08-03 |
Family
ID=74505017
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP21154652.8A Withdrawn EP4037339A1 (en) | 2021-02-02 | 2021-02-02 | Selecton of audio channels based on prioritization |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP4037339A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110040397A1 (en) * | 2009-08-14 | 2011-02-17 | Srs Labs, Inc. | System for creating audio objects for streaming |
US20150049868A1 (en) * | 2012-03-23 | 2015-02-19 | Dolby Laboratories Licensing Corporation | Clustering of Audio Streams in a 2D / 3D Conference Scene |
US20180190300A1 (en) * | 2017-01-03 | 2018-07-05 | Nokia Technologies Oy | Adapting A Distributed Audio Recording For End User Free Viewpoint Monitoring |
-
2021
- 2021-02-02 EP EP21154652.8A patent/EP4037339A1/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110040397A1 (en) * | 2009-08-14 | 2011-02-17 | Srs Labs, Inc. | System for creating audio objects for streaming |
US20150049868A1 (en) * | 2012-03-23 | 2015-02-19 | Dolby Laboratories Licensing Corporation | Clustering of Audio Streams in a 2D / 3D Conference Scene |
US20180190300A1 (en) * | 2017-01-03 | 2018-07-05 | Nokia Technologies Oy | Adapting A Distributed Audio Recording For End User Free Viewpoint Monitoring |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10574828B2 (en) | Method for carrying out an audio conference, audio conference device, and method for switching between encoders | |
EP3282669B1 (en) | Private communications in virtual meetings | |
US12067992B2 (en) | Audio codec extension | |
US9143728B2 (en) | User interface control in a multimedia conference system | |
US8547880B2 (en) | Method and system for replaying a portion of a multi-party audio interaction | |
CN110072021B (en) | Method, apparatus and computer readable medium in audio teleconference mixing system | |
US20150030149A1 (en) | Speech-Selective Audio Mixing for Conference | |
EP3111627B1 (en) | Perceptual continuity using change blindness in conferencing | |
CN111492638B (en) | Method and apparatus for managing streaming audio communication sessions between multiple user equipments | |
US20180048683A1 (en) | Private communications in virtual meetings | |
US12183357B2 (en) | Enhancing musical sound during a networked conference | |
WO2022124040A1 (en) | Teleconference system, communication terminal, teleconference method, and program | |
CN111951821B (en) | Communication method and device | |
EP4037339A1 (en) | Selecton of audio channels based on prioritization | |
JP2009027239A (en) | Telecommunication conference apparatus | |
EP3031048B1 (en) | Encoding of participants in a conference setting | |
WO2021122076A1 (en) | Rendering audio | |
EP4354841A1 (en) | Conference calls | |
JPH1188513A (en) | Voice processing unit for inter-multi-point communication controller | |
WO2021123495A1 (en) | Providing a translated audio object | |
JP2022093326A (en) | Communication terminal, remote conference method, and program | |
JP2000083106A (en) | Voice signal adder, summing method used for it, and recording medium storing its control program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20230204 |