WO2018162803A1

WO2018162803A1 - Method and arrangement for parametric analysis and processing of ambisonically encoded spatial sound scenes

Info

Publication number: WO2018162803A1
Application number: PCT/FI2018/050172
Authority: WO
Inventors: Archontis Politis; Sakari TERVO
Original assignee: Aalto University Foundation Sr
Priority date: 2017-03-09
Filing date: 2018-03-08
Publication date: 2018-09-13

Abstract

Arrangement (204a, 204b, 204c, 220) for cultivating a spherical harmonic digital representation of a sound scene (102, 103, 229, 230, 301, 401), being configured to obtain the spherical harmonic digital representation (301, 401) of the sound scene, determine through analysis (304, 404, 530, 532, 534, 535) of said spherical harmonic digital representation a number of related spatial parameters (536, 538) indicative of at least dominant sound sources in the sound scene, their directions-of-arrival (DOA) and associated powers, wherein time-frequency decomposition of said spherical harmonic digital representation is preferably utilized to divide the presentation into a plurality of frequency bands analyzed (302, 402), and provide (360) said spherical harmonic digital representation, preferably as divided into said plurality of frequency bands, and said number of spatial parameters to spatial filtering (308, 414) in order to produce an output signal for audio rendering (231, 232, 310, 410) or upmixing (312, 412) the representation to higher order. Corresponding method is presented as well as related arrangements and methods for audio playback or upmixing.

Description

METHOD AND ARRANGEMENT FOR PARAMETRIC ANALYSIS AND PROCESSING OF AMBISONICALLY ENCODED SPATIAL SOUND SCENES

TECHNICAL FIELD OF THE INVENTION

The present invention generally pertains to audio processing. In particular, however not exclusively, the invention relates to spatial audio processing and enhancing ambisonically encoded sound scenes through analysis of related data.

BACKGROUND

Spatial sound recording, processing and reproduction are generally moving away from playback setup-based channel formats, e.g. stereo or 5.1 surround, to systems that are flexible and able to appropriately render spatial sound scenes to arbitrary playback systems. Such systems can be intended exclusively for synthetically produced sounds scenes, where parametric spatial information, such as position and orientation, is attached to all sounds in the scene, and each individual sound is transmitted to the client for rendering and playback. This approach is termed object-based.

Alternatively, all the sound objects can be encoded, using appropriate mixing tools, into a number of audio signals that describe the whole sound scene, an approach sometimes termed scene-based, or spatial-transform-based.

An advantage of scene-based encoding and reproduction over object-based is the reduction on bandwidth requirements to a fixed number of channels, instead of potentially a large amount of object channels, and the important possibility to represent recorded, real sound scenes, captured with appropriate spatial audio recording devices employing microphone arrays.

The most popular scene-based approach is the one researched and popularized under the name Ambisonics, which uses spherical harmonics as spatial basis functions to represent a sound scene. Ambisonics basically encode the spatial properties of the sound scene as level differences between audio channels, without additional metadata. As alluded to hereinbefore, a sound scene may refer to both a synthesized spatial sound scene 102 created e.g. in a studio or workstation with appropriate software/hardware and ambisonic encoding 104, or it may refer to a real sound scene captured with an ambisonic microphone103 and encoded 105 as shown at 100 in Fig.1.

In practice, Ambisonics define a hierarchical spatial audio format, wherein increasing orders define an increasing spatial resolution, with a corresponding increasing number of audio channels describing the scene. Due to technological limitations, Ambisonics were limited to the first-order format in the past, described by four audio channels, termed here first-order Ambisonics (FOA). Nowadays, when higher-order formats are employed, termed here higher-order Ambisonics (HOA), generally a sharper, more accurate impression can be delivered to a listener thanks to the increased spatial resolution at the expense of an increased bandwidth due to an increased number of channels.

Typically, Ambisonics, or ambisonically encoded signals, are decoded to the playback system for listening through appropriate mixing of the channels, which depends only on the target playback system and the available order of Ambisonics, FOA or HOA. The process is called ambisonic decoding. Furthermore, in an additional mixing stage before decoding, useful transformations such as rotations can be applied to the sound scene. Rotations are useful e.g. in headphone playback since they can stabilize the perceived playback sound scene of the listener when combined with head- tracking headphones, or an external head-tracking system.

Even though ambisonic decoding is flexible with respect to different playback systems, its perceived quality is limited by the available order of channels. For practical orders of one (four channels), two (nine channels), or three (sixteen channels), the playback can suffer from significant directional blurring of the sounds, making their localization by the listener unstable. These effects are reduced using third-order HOA, but are still present, while they are more severe for the first and second order.

Since ambisonic decoding is signal-independent, it does not consider the content of the sound scene at all. To improve upon the limitations of Ambisonics, some signal-dependent methods have been developed, that perform a time-frequency analysis of the ambisonic signals and aim to extract parameters that they then use to sharpen reproduction and improve overall perceptual quality. One such solution is Directional Audio Coding, which aims to improve FOA signals by extracting one direction-of-arrival (DoA) and one diffuseness parameter, breaking essentially the sound scene into a directional sound and a non-directional/diffuse sound stream. Another such method is HARPEX which works only with FOA signals, and assumes two directional sounds active at each time-frequency point, and hence extracts two directional streams. These methods solely rely upon the low-resolution FOA model and have turned out to be sensitive to very complex sound scenes with multiple sources, resulting in reproduction that can be rather far removed from the original.

To alleviate such problems, some further developments have taken place with reference to e.g. an extension of DirAC to HOA, involving segmenting the sound scene into spatially separated sectors and estimating the DirAC parameters of DoA and diffuseness for each one of them. Another example concerns enhancement of HOA signals based on an assumption that a number of directional sounds exist and that they are spatially sparse; then using iterative numerical optimization methods, very sharply localized directional streams can be extracted from the HOA signals.

Notwithstanding their various benefits, current solutions based on Ambisonics still suffer from various drawbacks and challenges.

For example, due to the parametric information and estimation of source and ambience signals, the resolution at reproduction is in theory limited only by the capabilities of the playback system. Low-order ambisonic decoding fails, however, to utilize the full resolution of arbitrary playback systems due to the inability of the decoding stage to duly approximate the appropriate spatialization functions of the playback system, such as panning functions for loudspeakers or head-related transfer functions (HRTFs) for headphones.

Most recorded ambisonic sound scenes exist in a lower-order format only, mainly FOA, due to the lack of HOA microphones in the market. In contrast, synthetic sound scenes may be more conveniently produced directly in a higher-order format, e.g. third or fourth, to preserve the directional sharpness of the material.

Further, low-order Ambisonics struggle with accommodating loudspeaker setups with uneven speaker placement between front and back, up and down. That includes e.g. most surround sound setups (such as 5.1 and 7.1 ) and large cinema setups that have been proposed recently (e.g. 22.2 surround sound). In such cases preserving e.g. the loudness of a sound at different directions has been found problematic.

Still, ambisonic modifications to a sound scene are, despite of their usefulness in some occasions, limited to global spatial modifications, such as the aforesaid rotations of the scene, directional blurring, and warping, meaning that sounds from certain directions are pushed closer, while others are stretched further apart. These can be thought of as modifications of a 360 picture, in which the image can be rotated, blurred or distorted across certain directions.

Still further, higher-order Ambisonics, even though being capable of efficiently encoding the whole sound scene into a few audio channels, can prove too demanding in terms of bandwidth and transmission of the established channels.

The aforementioned more recent developments around Ambisonics, each trying to tackle e.g. some of the above drawbacks with mutually rather different solutions, also have their issues with reference to computationally very demanding and time-consuming computations and limitations on general applicability to e.g. FOA signals only.

SUMMARY OF THE INVENTION

In the light of foregoing, an objective of the present invention is to at least alleviate one or more of the above problems and challenges associated with prior art solutions in the context of audio encoding or decoding involving spherical harmonic digital representation of sound scenes, or specifically, as termed above, Ambisonics.

The objective is generally met by embodiments of an arrangement and method in accordance with the present invention.

Accordingly, in one embodiment an electronic arrangement for cultivating a spherical harmonic digital representation of a sound scene, comprises at least one data interface for transferring data, at least one processing unit for processing instructions and other data, and memory for storing the instructions and other data, said at least one processing unit being configured, in accordance with the stored instructions, to cause: obtaining the spherical harmonic digital representation of the sound scene, preferably comprising ambisonically encoded digital presentation, determining through analysis of said spherical harmonic digital representation a number of related spatial parameters indicative of at least dominant sound sources in the sound scene, their directions-of-arrival (DOA) and associated powers, wherein time-frequency decomposition of said spherical harmonic digital representation is preferably utilized to divide the presentation into a plurality of frequency bands analysed, said bands optionally reflecting human auditory frequency resolution, and providing said spherical harmonic digital representation, preferably as divided into said plurality of frequency bands, and said number of spatial parameters to spatial filtering in order to produce an output signal for audio rendering and/or upmixing the representation to higher order.

In other embodiment, an electronic arrangement for processing a spherical harmonic digital representation of a sound scene, comprises at least one data interface for transferring data, at least one processing unit for processing instructions and other data, and memory for storing the instructions and other data, said at least one processing unit being configured, in accordance with the stored instructions, to cause: obtaining the spherical harmonic digital representation of the sound scene, preferably being divided into a plurality of frequency bands, and a number of related spatial parameters indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting the spherical harmonic digital representation to spatial filtering and audio rendering or spatial filtering and upmixing, wherein corresponding matrices for decomposition of the spherical harmonic digital representation and rendering to audio signals associated with respective playback channels or upmixing to higher order representation, both for the dominant sound sources and ambient component are determined based on the spatial parameters; and respectively providing the resulting, rendered signals forward for audio playback via a number of transducers associated with the playback channels, optionally speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene for storage, transmission, or further processing, optionally decoding.

In a further embodiment, an electronic arrangement for processing a low- bandwidth indication of a spherical harmonic digital representation of a sound scene, comprises at least one data interface for transferring data, at least one processing unit for processing instructions and other data, and memory for storing the instructions and other data, said at least one processing unit being configured, in accordance with the stored instructions, to cause: obtaining a number of dominant sound source signals and a monophonic ambient signal resulting from decomposing the spherical harmonic digital representation preferably divided into a plurality of frequency bands, and further receiving a number of related spatial parameters indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting said dominant sound source signals and said ambient signal to audio rendering, utilizing said spatial parameters and involving distribution of the dominant sound source signals and said ambient signal among a number of playback channels, or to upmixing involving re-encoding the signals to a higher order spherical harmonic representation; and respectively providing the resulting, rendered signals forward for audio playback via a number of transducers associated with the playback channels, preferably speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene, for storage, transmission or further processing, optionally decoding.

In a further embodiment, a method for cultivating a spherical harmonic digital representation of a sound scene to be performed by an electronic arrangement comprises: obtaining the spherical harmonic digital representation of the sound scene; determining through analysis of said spherical harmonic digital representation a number of related spatial parameters indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers, wherein time-frequency decomposition of said spherical harmonic digital representation is utilized to divide the presentation into a plurality of frequency bands analyzed, said bands optionally reflecting human auditory frequency resolution; and providing said spherical harmonic digital representation, preferably as divided into said plurality of frequency bands, and said number of spatial parameters to spatial filtering in order to produce an output signal for audio rendering or upmixing the representation to higher order.

Still in a further embodiment, a method for processing a spherical harmonic digital representation of a sound scene, to be performed by an electronic arrangement, comprises: obtaining the spherical harmonic digital representation of the sound scene, preferably being divided into a plurality of frequency bands, and a number of related spatial parameters parameters indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting the spherical harmonic digital representation to combined spatial filtering and audio rendering or combined spatial filtering and upmixing, wherein corresponding matrices for decomposition of the spherical harmonic digital representation and rendering to audio signals associated with respective playback channels or upmixing to higher order representation, both for the dominant sound sources and ambient component are determined based on the spatial parameters; and respectively providing the resulting rendered signals forward for audio playback via a number of transducers associated with the playback channels, optionally speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene for storage, transmission, or further processing, optionally decoding.

Yet in a further embodiment, a method for processing a low-bandwidth indication of a spherical harmonic digital representation of a sound scene to be performed by an electronic arrangement, comprises: obtaining a number of dominant sound source signals and a preferably monophonic ambient signal resulting from decomposing the spherical harmonic digital representation preferably divided into a plurality of frequency bands, and further receiving a number of related spatial parameters indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting said dominant sound source signals and said ambient signal to audio rendering, utilizing said spatial parameters and involving distribution of the dominant sound source signals and said ambient signal among a number of playback channels, or to upmixing involving re-encoding the signals to a higher order spherical harmonic representation; and respectively providing the resulting, rendered signals forward for audio playback via a number of transducers associated with the playback channels, preferably speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene, for storage, transmission or further processing, optionally decoding.

Various supplementary or alternative other embodiments of the present invention are disclosed in the following detailed description as well as dependent claims.

The utility of the present invention arises from multiple factors depending on each particular embodiment thereof.

First of all, various embodiments of the present invention provide enhanced playback and flexible manipulation of HOA signals using parametric information. For the task, it uniquely applies an acoustic model of the sound scene that considers multiple directional sounds and a diffuse ambient signal. Indeed, the solution focuses on spatially transformed HOA signals and aims to playback of the whole sound scene, without rejecting ambience and reverberation. The underlying model utilizes acoustical array techniques of DoA estimation and beamforming. Furthermore, the HOA signals that serve as input to the method may originate either from a microphone array recording or from software source such as mixing software.

Yet, the analysis and synthesis can be performed in a suitable time-frequency transform domain, such as the short-time Fourier transform, or a perceptually optimized filter bank. The analysis/synthesis may proceed in time frames analyzed and processed in a number of frequency bands. The applied time- frequency processing improves estimation accuracy due to e.g. improved separability and sparsity of the source and diffuse signals, being hence in better agreement with the assumed model.

In more detail, various embodiments of the present invention yield increased spatial resolution at reproduction for e.g. loudspeakers or headphones; instead of direct transformation of the ambisonic signals for spatial modifications and decoding, the solution suggested herein analyzes the statistics of the ambisonic signals and extracts spatial parameters describing the dominant sources in the scene, such as their directions-of-arrival (DoA) and their powers. It then estimates their signals along with the signals that are not modeled by the dominant source signals, and hence they model the remaining sounds in the scene corresponding to reverberation and ambience. As the solution knows the DoAs of the estimated sources, it can use them to spatialize the sources at the playback system with the highest spatial resolution that it can offer. Furthermore, the ambience component may be enhanced to achieve maximally diffuse properties by processing it separately through a network of decorrelators, something which cannot be done with e.g. direct ambisonic decoding. This kind of diffuse processing basically restores spaciousness of the ambient and reverberant sound, which is otherwise degraded by the high correlation between the playback signals in direct ambisonic decoding.

In addition, the suggested solution is flexible in terms of playback setup. For example, panning functions that suit better the target playback system, such as amplitude or vector-base amplitude panning (VBAP) functions, can be utilized.

Further, various embodiments of the present invention enable making more advanced and flexible spatial modifications of the captured or constructed sound scene. The embodiments can be harnessed into introducing a wide variety of such novel, meaningful modifications based on the analyzed, obtained parameters. Since the parameters are rather intuitive (DoAs of sources, levels of sources and ambience), they can be conveniently utilized by a sound effect developer to design interesting effects for manipulation of the sound scene. Examples of applicable modifications include selective attenuation of certain detected sources in the scene, control of the level of the ambient component, spatial re-mapping of certain sources in the scene, and visualization of the source parameters. These can be especially useful on editing the spatial sound scene for combination and alignment with immersive video content, for example.

Also upmixing from lower to higher-order ambisonics is made possible. With associated embodiments of the present solution, lower-order ambisonic recordings and mixes, e.g. first or second, can be upmixed (or upscaled) to an arbitrary higher-order format. This operation brings the higher spatial resolution of the parametric method to regular ambisonic decoders and avoids the use of multiple ambisonic decoders for playing back together content at different orders, by upmixing lower-order material to a reference/common higher-order.

Even further, with reference to compression of higher-order ambisonic signals, the solution presented herein can indeed be utilized to achieve compression of the HOA sound scene, since for most frames during the parametric analysis, a much smaller number of sources than the number of channels is detected. Coupled with a single-channel diffuse signal, and parametric information, the requirements for bandwidth and transmission are lowered considerably without significant loss of perceived quality.

The exemplary embodiments presented in this text are not to be interpreted to pose limitations to the applicability of the appended claims. The verb "to comprise" is used in this text as an open limitation that does not exclude the existence of unrecited features. The features recited in depending claims are mutually freely combinable unless otherwise explicitly stated.

The novel features which are considered as characteristic of the invention are set forth in particular in the appended claims. The invention itself, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, may be best understood from the following description of specific, still merely exemplary, embodiments when read in connection with the accompanying drawings.

Finally, numerous considerations provided herein concerning different embodiments of the arrangement may be flexibly applied to the embodiments of the method, mutatis mutandis and vice versa, as being appreciated by a skilled person.

The expression "a number of" refers herein to any positive integer starting from one (1 ), e.g. one, two, or three. The expression "a plurality of" refers herein to any positive integer starting from two (2), e.g. two, three, four.

BRIEF DESCRIPTION OF THE DRAWINGS

Next the invention will be described in greater detail with reference to exemplary embodiments in accordance with the accompanying drawings, in which:

Figure 1 illustrates two common use scenarios involving Ambisonics;

Figure 2 illustrates a potential context and use scenario of various embodiments of the present invention, related entities and electronic arrangements;

Figure 3 illustrates an embodiment of the present invention for cultivating spherical harmonic representations of sound scenes;

Figure 4 illustrates one other embodiment of the present invention for cultivating spherical harmonic representations of sound scenes;

Figure 5 illustrates an embodiment of an audio analyser applicable e.g. in connection with the embodiments of Figs. 3 and 4;

Figure 6 illustrates an embodiment of establishing audio modification/separation matrices for use e.g. with the embodiment of Fig. 3 and based on spatial parameters provided by an analyser such as the analyser of Fig. 5;

Figure 7 illustrates embodiments of spatial filtering and rendering as well as upmixing, applicable e.g. in connection with the embodiment of Fig. 3;

Figure 8 illustrates embodiments of source panning and diffuse rendering as well as upmixing, applicable e.g. in connection with the embodiment of Fig. 4.

Figure 9 is a high-level flow diagram depicting the internals of various embodiments of a method in accordance with the invention. DETAILED DESCRIPTION

With reference to a scenario 200 of Figure 2, a merely exemplary context and related potential use scenario is illustrated in connection of which various embodiments of the present invention could be carried out.

User devices (UE) 204a, 204b, 204c of users (e.g. private users or collections of users, such as companies) 202a, 202b and 204c, respectively, may refer to wired or wireless terminals, e.g. desktop or laptop computers, smartphones, tablets, which may be functionally connected to a communications network 210 by a suitable wired or wireless transceiver, for example, to access the network 210 and remote entities operatively reachable therethrough. The network 210 may include or be at least operatively connected to various private and/or public networks, e.g. the internet. One of such user devices 204a, 204b, 204c and/or a number of further devices, such as a server 220, also connected to the network, may be configured to host at least portion of one or more embodiments of an electronic arrangement described herein and/or to execute related method(s) suggested. Yet, there may be further device(s) 218 included in the environment that do not contribute to encoding, cultivating or decoding the spherical harmonic digital representations of sound scenes but participate in storing or transferring them, for example.

The arrangement, thus realized by means of one or more at least operatively connected devices such as electronic devices, typically includes at least one processing unit 222, such as a microprocessor or a digital signal processor, for processing instructions and other data, a memory 228, such as one or more memory chips optionally integrated with the processing unit 222, for storing instructions and other data, and a (data) communication interface 224, such as a transceiver, transmitter and/or receiver, for transferring data e.g. via the network 210. The interface 224 may comprise wired and/or wireless means of communication.

In some embodiments, the arrangement may comprise or be at least operatively connected to a microphone such as an ambisonic microphone 230 for capturing a sound scene and sounds 229 for ambisonic encoding instead of or in addition to encoding a locally or externally synthetically produced sound scene. In some embodiments, the arrangement may comprise or be at least operatively connected to a speaker 231 such as one or more loudspeakers, a related playback system, or e.g. headphones for audio output 232 of a reproduced sound scene.

The arrangement may thus be configured to locally obtain or receive an externally created spherical harmonic digital representation of a sound scene, which may be captured via microphone(s) and/or be of synthetic origin (e.g., created directly digitally with a computer).

For instance, as mentioned hereinbefore in the shown scenario of Fig. 1 one or more embodiments of the arrangement in accordance with the present invention may be implemented and/or related method(s) executed by any one or more of the user devices 204a, 204b, 204c and/or other computing devices such as the server 220. Intermediate device(s) 218 may be utilized for storing and/or transferring related data, for instance. One or more of the devices 204a, 204b, 204c, 220 may be configured to analyze the digital representation of a sound scene, whereas the same or different devices 204a, 204b, 204c, 220

In various embodiments of the present invention, the suggested solution may take as input an ambisonic stream, and produce signals for a) loudspeakers with arbitrary setups (stereo, 5.1 , 7.1 , hexagons, octagons, cubic, 13.1 , 22.2, arbitrary ones), b) headphones employing head-related transfer functions for effective 3D sound rendering personalized to the user, and with head-tracking support, and/or c) ambisonic signals of a higher-order than the original (upmixing). The method can work with ambisonic signals of basically any resolution, such as the common first-order ambisonics (FOA) of 4 channels, or higher-order ambisonics (HOA) of 9, 16, 25, or 36 channels for example. Since the method performs sharp directional analysis and synthesis which adapts to the sound scene content, there is usually no benefit of very high ambisonic orders with large number of channels, and 16 channels of third- order ambisonics have been commonly found adequate for all sound scenes and transparent rendering.

With reference to Figs. 3-8, two, still merely exemplary, variants 300, 400 of the solution generally suggested herein are discussed in detail hereinafter. Generally, various embodiments of the present solution may rely e.g. on a time-frequency decomposition 302, 402 of the FOA/HOA signals 301 , 401 using an appropriate time-frequency transform or a filter bank. The created frequency channel(s) 303, 403 may be then handled separately. Both variants 300, 400 may utilize the same or at least similar analysis 304, 404 stage, which at each time step extracts spatial parameters for the sound scene, incorporating e.g. estimation of the number of dominant sources, their respective directions-of-arrival (DoA), and preferably also their powers. The two variants 300, 400 differ e.g. on where the signal decomposition happens and on the resulting bandwidth requirements, for instance.

The first variant 300, termed here high-bandwidth version, may be configured to decompose the sound scene and render or upmix the associated components in one stage 308. This version preferably utilizes all the FOA/HOA channels that are to be transmitted to the decoder, along with the spatial parameters, and is considered advantageous when e.g. maximum quality is desired, or when encoding and decoding may naturally occur in one stage (e.g. one application performing both in the same machine). This version preserves the original sound scene impression as much as possible.

Further with reference to Figs 6 and 7, the decoder obtains/receives 360 the FOA/HOA channels 303 as well as spatial parameters 536, 538 used to form mixing matrices 632, 634 to enable decomposition of the sound scene (spatial filtering 732) and rendering 750, 752 to loudspeakers/headphones 310 or upmixing 754, 756. The derived matrices 632, 634 may be further adapted based on the information 730 about the loudspeaker setup, headphone characteristics (such as headphone calibration filters) and e.g. the user's headphone spatialization filters (e.g. head-related transfer functions), if available. Thus the matrices may combine the decomposition of the sound scene and rendering to speakers/upmixing in one stage, for example. Accordingly, at least one matrix 632 may be determined to decompose and render the source components in the sound scene, and at least one other one 634 to decompose and render the ambient component. Furthermore, if upmixing 312 to higher-order Ambisonics is desired, the matrices may be configured to take into account the desired target order 742.

The second variant 400, termed here low-bandwidth version, results in a smaller number of channels to be used by the decoder, than the number of input FOA/HOA channels. Hence this version is more suitable for efficient transmission and compression of the sound scene, without significantly compromising of quality during rendering/upmixing. In the low-bandwidth version, the sound scene is decomposed at the encoder stage by spatial filtering block 414 into a number of sound source signals 416, variable at each time step, and a monophonic ambient signal 418. The total number of source signals plus ambient signal is smaller or equal to half the FOA/HOA channels plus one. The decomposed channels are then stored, transmitted or otherwise provided 460 to the decoder along with the spatial parameters such as DOA 836 and power parameters 838, which may correspond to output 536, 538 of the analysis 404 or result from modification in adjustment 406 executed based on e.g. user input/preferences. Further input may include information 830 on the playback setup. At the decoder stage 408 the source signals are rendered 850 to speakers 410 of the playback setup using e.g. amplitude panning in the case of loudspeakers or e.g. head-related transfer functions for headphones. The ambient signal may be distributed to the loudspeakers or headphones through rendering 852 involving decorrelation filters, to be perceived as diffuse and surrounding by the listener. For obtaining upmixed ambisonic stream 412, the source signals are re-encoded 854 to higher-order Ambisonics based on their analyzed directions-of-arrival and desired target order 842. For diffuse signal upmixing 856, spatial parameter(s) on e.g. associated power may be utilized.

Both variants 300, 400 enable and support modification 306, 406, 630 of the spatial parameters in the sound scene by the user, such as a) modification of the level difference between the source signals and the ambient signal, b) change of directions-of-arrival of the source signals towards other user-set ones, c) suppression of source signals coming from certain user-defined directions. The modification parameters 636 can be defined through an appropriate graphical user interface (GUI), for example, and be subsequently sent or otherwise provided to the decoder stage. In the high-bandwidth version 300, they contribute 306, 630 to the formulation of the separation and rendering matrices 632, 634, while in the low-bandwidth version 400 they contribute directly to the panning and diffuse rendering stage 408.

Reverting to analysis phases 304, 404 yielding spatial parameters 536, 538 and shown in more detail in Fig. 5, e.g. for each time step and frequency channel of the FOA/HOA signals, related selected second-order statistics may be computed 530, such as the coherence between each pair of channels and the power of each channel, building the spatial covariance matrix (CSM) of the input signals. A source number estimator 532 may utilize eigendecomposition of the SCM and perform analysis of its eigenvalues and eigenvectors. Based on the estimated number of sources, a source signal space and an ambient signal space may be formed from the eigenvectors. These subspaces may be then utilized to estimate the directions-of-arrival 534 of the source signals, using any subspace method, such as the MUSIC method, or the weighted subspace fitting method.

The total sound scene power is preferably computed from the sum of powers of the ambisonic input signals. Using the directions-of-arrival the power of each source component and subsequently the power of the ambient component may be estimated 535, the latter as the difference between the source powers and the total.

Switching over to a more detailed review of various signal processing and data manipulation tasks potentially executed in connection with different embodiments of the present invention and especially first having regard to the general motivation and related theoretical springboard underlying Ambisonics, Ambisonics are based on the formalism of a continuous spatially band-limited amplitude distribution describing the incident sound field, and capturing all contributions of the sound scene, such as multiple sources, reflections, and reverberation. The band-limitation refers to the spatial, or angular, variability of the sound field distribution, and a low-order representation approximates only coarsely sharp directional distributions and sounds incident with high spatial concentration.

It may be assumed that the spatio-temporal distribution describing the sound field at time t , is expressed by α(ί, γ) , where γ is a unit vector at azimuth and elevation respectively. The vector of sound field coefficients a(t) is given by the spherical harmonic transform (SHT) of the sound field distribution a(i) = SHT {a(t, γ)} = f a(t, γ)γ(γ)άγ, , (1 ) where integration occurs over the unit sphere with the differential surface element άγ = ηθάθάφ . The vector y(y) contains the spherical harmonic (SH) functions Y_nm(y) of order n and degree m . For a representation band-limited to order N there are Q = (N+Xf transformed signals and SHs in the vectors above. Ordering of the components in the vectors y and a is as [y _q = Y_nm > with q = l,2,..., Q

\^a\_q ^{= a}nm > and q = n² + n + m + l . (2)

This channel indexing scheme is known in Ambisonics literature as Ambisonic Channel Numbering (ACN), and the signals of sound field coefficients a as ambisonic signals. Following further ambisonic conventions, the real form of SHs is used in this work, defined as

Y_nm (θ, φ) = (3)

with

and P_nm the associated Legendre functions of degree n , expressed in terms cT

of regular Legendre polynomials P_n , as P (x) = (l - x²)^ml2— ΡΛχ) . The SHs

dx^m

are orthonormal with y(j)y^J (γ)άγ = 4πΙ (5) where I is the MxM identity matrix. Using this power normalization, the zeroth order ambisonic signal [a]_i = a₀₀ is equivalent to an omnidirectional signal at the origin. Furthermore, the norm of the SH vectors for any direction is ll (r) ll²=y^T(r)y(r) = a v_r. (6)

The HOA signals can be either obtained by synthesizing a sound scene, encoding directly sound signals at desired directions and reverberation, or by capturing a real sound scene. Ambisonic recording obtains the ambisonic signals by sampling the sound field at some finite region around the origin with a microphone array. The encoding then aims at achieving optimally the SHT of Eq. 1 based on the microphone recordings. Physical constraints limit severely broadband recording of the higher-order terms, with the limitation dependent on the array arrangement, number of microphones and overall array size. It is assumed here that we have access directly to the ambisonic signals, after encoding, with the best possible performance.

Encoding directly a source signal at direction y_s can be done simply by multiplying them with appropriate gains, as ( = *( y0O- (7)

Multiple K sources s(t) = [s₁(t),...,s_K(t)f can be encoded at the desired directions τ^ = [γ„...,γ_κ] as a( = Y.s( , (8) where Y_s = [y(y₁),-,y(y_K)] is the matrix of all SHs for the set of directions in r src .

Recording the ambisonic signals is done by microphone arrays, most commonly spherical ones for practical and theoretical convenience. Recording begins by sampling the sound field pressure over a surface or volume around the origin, expressed through a number of M microphone signals x . These signals are transformed to spherical harmonic coefficients of order N< M -1 , hence expressing the field in the array region, and then extrapolated to the plane wave density coefficients a , that are in theory independent of the array. Due to physical limitations, the frequency region over which this extrapolation is valid, and hence the acquisition of the ambisonic signals, depends on the size, geometric properties and diffraction characteristics of the array. The recording process however can be described in a compact form as

a(/) = £(/ (/^■), (9) where E(/) is the QxM matrix of encoding filters, which is derived by a constrained inversion of either the theoretical or measured directional response of the array.

After encoding or recording of the ambisonic signals, the ambisonic processing pipeline can be formulated as b( = DTa( (10) where b are the output signals for the target reproduction system, D is the ambisonic decoding matrix, and T is an optional spatial transformation matrix that can modify in a desired way spatial properties of the sound scene directly in the SHD (spherical harmonic decomposition). Examples of such transformations can be rotations, directional warping of the sound distribution, or directional loudness modifications. Alternatively, such spatial transformations can be performed in the parametric analysis stage. It may be assumed here that no transformation occurs in the sound scene, T = I .

Ambisonic decoding defines a linear mapping of the ambisonic signals a to L output channels of the reproduction system, defined through the ambisonic decoding matrix D of size L x Q . It is derived according to the spatial properties of the reproduction system and can be either frequency- independent (a matrix of gains), or frequency-dependent (a matrix of filters). The output signal vector b = [b₁,...,b_Lf is given as b( = Da(i). (1 1 )

In the case of K multiple incident source signals, their second-order statistics are given by their correlation matrix

where Ε[·] denotes mathematical expectation. Assuming that the source signals are uncorrelated between them, the correlation matrix becomes diagonal with entries that correspond to the vector of source powers p_s = [P_l,..., P_Kf = diag[c,], with P_k = E[ ¾(ί) |²]. The total sound field power due to all sources is ⁼∑*₌ * > ^{ancl is also} equal to the power of the omnidirectional ambisonic signal, P₀₀ = P_ik , due to incoherent summation. The ambisonic signal correlation matrix due to such a sound scene is

C_A = E[a(i)a^H(i)] = Y_SC_SY_S ^T =∑P_k , (13)

k=l where A_k = y(r_k)y^T(r_k) . Furthermore, the total power of the ambisonic signals is

P_& = E[a^H( a( ]= E[ | a(i) ||²]= Q∑P_k = QP_dil. (14)

k=l

A second case of interest is isotropic diffuse sound, coming with equal power from all directions, which is a useful simplification of late reverberant sound. This can be modeled as an amplitude distribution d(t, y) with angular correlation (15) and P_dm

omponent that is independent of direction due to the isotropy assumption. The ambisonic signals due to such diffuse sound are given by

and their correlation matrix as

C_a = E[z(t)z^H (t)] = P_diiiI . (17)

The last relation shows that in a perfectly diffuse field, the power of the ambisonic signals have zero correlation between them and have all power equal to the diffuse sound. It also indicates the means to encode a signal as a diffuse component directly in the ambisonic signals, by decorrelating a reference signal Q times for all ambisonic channels.

Considering the more generic case of a mixed sound field, with a number of source signals of K < Q and an additional diffuse component, which captures more faithfully realistic conditions of multiple sources in reverberation, we can express the ambisonic signals as

and their respective correlation matrix

C_a = E[a( a^T( ] = Y_sC_sY_s ^T + ^_ffI = C_dir + C_diff . (19)

This correlation matrix forms the basis of the parametric estimation for the parametric analysis and synthesis. According to the assumed field model, the total field power is

Note that even for a perfectly diffused signal, ambisonic decoding with a decoding matrix D will introduce correlation between the output signals, determined by C_b = E[b( b^H( ]= i^> _diffDD^T. (21 )

Reverting next to directional decomposition of sound scenes in accordance with various embodiments of the present invention, The analysis and synthesis is performed in a suitable time-frequency transform domain, such as the short-time Fourier transform, or a perceptually optimized filterbank. All quantities defined before are used in their time-frequency counterpart at the time index / and frequency index k , while correlations now determine subband correlations and are frequency-dependent. The time-frequency processing improves estimation due to better separability and sparsity of the source and diffuse signals, and hence better agreement with the assumed model.

Dominance of directional or diffuse components in the sound scene is reflected in the structure of the spatial statistics of the signals, as captured in the correlation matrix of Eq. 21 with K sources and diffuse sound. Detection of these conditions is based on the subspace principle of sensor array processing. The eigenvalue decomposition (EVD) of the correlation matrix has the form

C_a = VUV^H = (22)

where λ_ι > ... > _q > ... > λ_β > o are the sorted eigenvalues of the EVD, and \_q are the respective eigenvectors. Following the above assumptions, all the lowest eigenvalues of K < q < Q should be equal and close to the diffuse power P . All the eigenvalues associated with \ < q < K are associated with the powers of both sources and the diffuse field, with _q > P_diS . The distribution of the eigenvalues reveals information about how many sources exist in the scene, and sources with significant direct-to-diffuse ratio (DDR) will be associated with eigenvalues significantly higher than the lower ones corresponding to the diffuse field. This information will be used in order to detect diffuse conditions and get an estimate of the number of significant sources in the sound scene.

In practice, both detection and estimation uses a frequency-averaged covariance matrix across multiple bins, in frequency ranges that are perceptually motivated and reflect human auditory frequency resolution, such as equivalent rectangular bandwidth (ERB) bands. The frequency averaging for the j th ERB band is thus performed as ∑C.(*). (23) k ^■ _jJ - -^k-_jJ-l k _{1 +}i where k_} is the upper frequency index of the band, and k₀ = 0.

Estimation of the number of sources in the sound scene is based on analysis of the subspace decomposition. Various approaches from array processing literature could be applied for this task. They can be based for example on analysis of dominant eigenvalues, eigenvalue ratios, eigenvalue statistics, analysis of the eigenvectors, or information theoretic criteria. Here we display a method based on eigenvalue statistics termed SORTE, which has been shown to be robust and which omits manually adjusted thresholds. It is based on the differences of the sorted eigenvalues

^λ^ λ - λ^ , for i = \,...,Q-\. (24) The number of sources according to SORTE is given by

K = argmin_kf (k) for k = l,...,Q-3 (25) with

2

^<Jk + l 2

/(*) σΐ ' ^k for * = 1,..., β - 2. (26)

and with the eigenvalue di

DoA (direction(s)-of-arrival) estimation can be performed by a variety of established methods from array signal processing. They vary widely on their complexity and performance, and they should be chosen according to the sound scene and the application requirements. In the multiple directional component case, DoA estimation can be done by narrowband DoA methods, which require scanning on a grid of directions and the associated maxima or minima finding. That can be done through analysis of power maps of beamformers, such as the MVDR, or by subspace methods, such as MUSIC. We present an example based on MUSIC. We define a dense grid of G directions r = [γ_ι ,..., γ_β] and the associated SH matrix γ = [y(r_l),..., y(r_G)] - ^For ^ directional components in the scene, we construct the noise subspace V_n from the eigenvectors corresponding to the lowest Q -K eigenvalues. The MUSIC spectrum is then given by

_PMUSIC = diag[Y_G ^TV_NV_N ^HY_gl (28)

The source DoAs f _s E r_g are found at the grid directions for which the minima of Eq. 35 occur.

Based on knowledge of the number and source DoAs, the powers of the individual components may be estimated. The source powers can be computed by considering a beamformer with nulls to all estimated DoAs apart from the source of interest. This QxK matrix of beamforming weights W_dir is given by the solution to the following linear constraints

which corresponds to the pseudo-inverse of the SH matrix for the estimated directions Y_S = [y(_ri),..., y(r_K)] w_(Br = Y_i(Y_i ^TY_i)-¹. (30) The estimated amplitudes s for the source signals are then

and the vector p_s = [£·,..., ¾.] of source powers p_s = diag[c, ] = diag[wIC_aW_dir ] (32) with c_? = E[S S ^H J .

The diffuse sound can be estimated in various ways, and the final choice should depend on the application scenario. If the application has no strict bandwidth or transmission requirements, for example in a standalone high- resolution spatial sound reproduction system, where all the spherical harmonic signals are available on the decoder, then it is advantageous to retain a diffuse signal in the SHD, with a directional distribution which can deviate from isotropic and which can be reproduced without need of decorrelation, as will be detailed at the synthesis stage. This ambient sound component is simply the ambisonic signals after the estimated source signals have been extracted from them, and hence a residual which is expected to contain mostly reverberant and diffuse components. Its ambisonic representation, using is a_diff = a - Y_s s = a - Ϋ, w£a = W_diff a, (33) where the square beamforming matrix W_diff is given by

W_diff = I_Q -¾w£ = I_Q -Y^Y^Y (34) and which defines an orthogonal projection on the nullspace of Y_S ^H . Finally, the diffuse field power in this case is computed similarly to Eq. 20

If minimum transmission and bandwidth requirements are considered, it is advantageous to transmit the reduced number of source signals along with their parameters, and a single diffuse signal that can be further decorrelated for distribution on the playback system channels. This signal can be computed by keeping only the omnidirectional first channel of the estimated diffuse signals of Eq. 33. The total estimated field power is finally given as

Next switching over to synthesis of the sound scene, due to the decomposition of the scene to its constituent spatial components, along with their parameters, it is possible to reconstruct it for any target setup and with using the most perceptually effective tools for the capabilities of the certain system. In general, no matter the setup, directional sounds should be reproduced with appropriate panning or directional encoding functions, and diffuse sound should be distributed in a diffuse manner at the output channels.

With reference to synthesis of directional components, the directional sounds are distributed to the output channels with maximum directional concentration from their analyzed directions. It is suitable to consider such distribution functions as synthesis steering vectors, which may include panning laws, head-related-transfer functions, ambisonic panning functions, virtual steering vectors for transcoding into virtual recording arrays or other spatial formats, and others. Assuming L output channels, let us denote a vector of such real or complex spatialization gains as g(Y) = [g₁(Y),...,g_L(r)Y . Then having estimated the source signal amplitudes during analysis, associated with their DoAs, the source signals are spatialized as b_dir = G_s s = G_SW» a. (37) where G_s = [g(r₁),-,g(r_K)] is the L x K matrix of spatialization gains for the estimated directions. The design of the spatialization vectors depends on the target system. Three major cases of interest are:

1. Loudspeaker rendering: A common solution is vector-base amplitude panning (VBAP), which adapts to any loudspeaker setup and provides perceptually maximum directional concentration for any direction, and hence a suitable choice for fully directional sounds. Alternatively, smooth panning functions can be used such as ambisonic panning [citation], which have increased localization blur but provide a more even perceived source width, if such a characteristic is preferred over directional sharpness.

2. Headphone rendering: Directional rendering for headphones is similar to the loudspeaker gains, with the difference that the real panning gains are replaced with frequency-tranformed or filterbank coefficients of head-related transfer functions (HRTFs), and hence the spatialization vector corresponds to the left and right HRTF values for the analyzed DoA g(Y, k) = [h_L(Y, k)h_R (Y, k)f . Suitable HRTF interpolation should be employed for arbitrary DoAs, or in the case of a dense grid of HRTF measurements, quantization to the closest HRTF direction can be adequate.

3. Ambisonic upmixing: In the case of ambisonic upmixing, new synthetic ambisonic signals are generated from the lower order signals that are analyzed. Let us assume that the target order is N' > N , then the re-encoding gains are the target order spherical harmonic vectors for the analyzed DoAs

Having regard to synthesis of diffuses components, as was mentioned in the analysis stage, a total omnidirectional monophonic diffuse signal can be estimated and transmitted, by keeping only the first component of Eq. 33 , in a low-bandwidth application scenario. This diffuse component can then be distributed to the output channels at the decoding stage through a network of decorrelators to achieve spatially diffuse reproduction. In the high-bandwidth case, all the ambisonic signals model the ambient/diffuse residual, as shown in Eq. 33 . We focus here in this high-quality high-bandwidth case. Distribution of the diffuse signals of a_diff can be performed in two stages, first a non-parametric ambisonic rendering, and second an optional parametric enhancement stage. The non-parametric decoding stage relies on well- designed ambisonic decoding matrices. The output diffuse signals are given in this case simply by b_diff = DAi_ff =

(38)

In a majority of application scenarios, diffuse rendering through ambisonic decoding will be adequate, especially if orders higher than the first two are available.

If the order is too low to reproduce convincingly diffuse sound properties, as can be the case with first-order signals due to very high correlation between output channels even for a completely diffuse case, then the output signals can be enhanced through means of decorrelation. An example is given for the loudspeaker case. The correlation matrix of the output signals after the ambisonic decoding is given by Eq. 21

C_b = E[b_diffb J= D_lsC_adiff D^_s. (39)

If an enhanced or maximum diffuse reproduction is desired for loudspeaker rendering, then the output channels should be uncorrelated between them, preferably with the original output signal powers to preserve the directional power distribution of the original sound field. Hence, the enhanced correlation matrix c_{b eni} of these signals would be diagonal with entries diag[c_b] . By forcing a decorrelation operation D[-] at the output signals, with a power preserving decorrelation process, we get bd_M = D[DA_iff ] (40) where the enhanced b_{diff >enh} would have the desired correlation properties.

Binaural reproduction of diffuse sound would use, instead of frequency- independent ambisonic decoding matrices, frequency-independent 2 x Q ambisonic-to-binaural decoding matrices D_bin . Rendering then happens as in the loudspeaker case, with the operation of Eq. 38. If diffuse binaural enhancement is required, then a similar process to Eq. 40 can be followed but with an additional mixing of the decorrelated signals to achieve binaural correlation that would occur in a diffuse field, which can be significant at low frequencies. Finally diffuse rendering for upmixing of ambisonic signals to higher-orders dismisses completely the spatialization operation of a decoding matrix D, since the diffuse field is already expressed as ambisonic signals. Hence, in this case D = [i_e,0]^T , with the zero padding o being of size

(N+l)² x (N'+l)² -(N+l)². That means that even though the directional components are upmixed to the higher order ones through Eq. 37, the diffuse components are embedded into the higher-order signals with their original lower-order directionality. If upmixing of the diffuse components is desired too, then the diffuse signals of orders n = N + \,...,N' can be generated from decorrelated copies of the omnidirectional diffuse signal ϋ[¾_Η] , taken from

¾iff ^~~ [^adiff ]l^■

Yet, direct and diffuse rendering, or the results thereof, may be combined. To avoid musical noise and artifacts in the output signals due to sharp discontinuities of the time-frequency gain factors of Eq. 37 and Eq. 38 , based on the instantaneous DoA estimates, temporal smoothing or interpolation is employed on these synthesis matrices. Smoothing can be performed across a fixed number of frames, or with an one-pole recursive smoothing controlled by the smoothing coefficient β as in

A_dir (*, /) = _dir ( - 1) + (l - )0, w£ . (54)

The coefficient β is linked to the decay time constant τ of the smoothing by = e ^~RI(Tfs) with R the hop size of the windowed transform or the decimation factor of the filter bank and f_s the sample rate.

The final diffuse rendering matrix B from the ambisonic signals to the output signals is, similarly to the directional rendering matrix, given by smoothing Eq. 38 with the same time-constant as for the directional sounds

B_diff (*, /) = ¾_diff (*, / - 1 ) + ( 1 - W_diff . (55)

Assuming that no decorrelation is required on the diffuse rendering, the full output signals are given by b = A_d. a + B_diffa. (56)

Figure 9 illustrates, at 900, items that may be performed in various embodiments of a method according to the present invention. At start-up 902, different initial preparatory tasks may be executed. The executing arrangement may be configured with the necessary hardware and software (with reference to a number of computing devices such as user terminal devices and/or servers, for example) and provided in or at least connected to a target network in cases where such connectivity is preferred for e.g. transfer (transmission and/or receipt, from a standpoint of a single device) of sound scene related data.

Item 901 encompasses some items the method may, depending on the embodiment, comprise.

At 918, a number of feasible microphones such as an ambisonic microphone may be obtained and configured for capturing a sound scene. For the purpose, it may further be connected or comprise a recording equipment that stores the captured signals in a target format, which may in some embodiments already be e.g. a selected ambisonic encoding format or other preferred spherical harmonic representation of the scene. Alternatively or additionally (the latter referring to a hybrid type of a sound scene incorporating both captured and synthetically created sounds), synthesis equipment with reference to e.g. a computer provided with necessary synthesis software may be utilized.

Accordingly, item 920 refers to actual capturing (microphone) or creation (synthetic production) of the sound scene and item 922 refers to establishing the associated spherical harmonic representation.

At 904, the representation is obtained, either as ready-made by an external entity such as a computer functionally connected to the arrangement or self- established as deliberated above. Thus "obtaining" may herein refer to e.g. receiving, fetching, reading, capturing, synthetic production, etc.

At 906, the representation is subjected to analysis. As thoroughly explained hereinbefore, such analysis of said spherical harmonic digital representation preferably contains determination of a number of related spatial parameters indicative of at least dominant sound sources in the sound scene, their directions-of-arrival (DOA) and associated powers. Time-frequency decomposition of said spherical harmonic digital representation such as a selected time-frequency transform (e.g. a selected variant of Fourier transform) or a filter bank may be utilized to divide the presentation into a plurality of frequency bands analysed. The bands may be selected so as to reflect characteristic(s) of human auditory system such as frequency resolution thereof.

Next, depending on the embodiment with reference to the alternatives of Figs 3 and 4, spherical harmonic digital representation data or source/diffuse signals decomposed therefrom may be spatially filtered 908, potentially in the aforementioned bands, and rendered 912 for audio playback and/or upmixing 914 optionally based on sound modification input that is translated into changes 910 in the spatial parameters (either through direct manipulation of parameter data or of the process via which the parameters are configured to affect audio rendering/upmixing) as discussed hereinbefore.

For example, in the embodiment of Fig. 3 modification of spatial parameters may be executed prior to or in connection with spatial filtering, preferably upon creation of separation/mixing matrices, whereas in the embodiment of Fig. 4 such modifications typically take place after spatial filtering, whereupon the embodiment-dependent execution order of items 908 and 910 has been highlighted in the figure by a curved bidirectional arrow between them.

The dotted two horizontal lines are indicative of two options of a potential division between analysis and decoding/rendering side activities, further illustrating the fact that in some embodiments spatial filtering 908 may be executed at the decoding/rendering phase (Fig. 3) while in some other embodiments it may be already executed in connection with the analysis (Fig. 4). Some embodiments of the present invention may correspondingly concentrate on analysis side activities only, some others on decoding/rendering, while there may also be "complete system" type embodiments, executing both analysis and decoding/rendering side tasks at least selectively.

Prior to spatial filtering (embodiment of Fig. 3) or between spatial filtering and audio rendering/upmixing (embodiment of Fig. 4), potential intermediate data storage/transfer phase(s) may take place with additional reference to items 360, 460 of Figs 3 and 4.

The execution is ended at 916.

The dotted loopback arrow depicts the potentially repetitive nature of the execution of various method items. The present invention has been explained above with reference to a number of embodiments, and several advantages of the invention have been demonstrated. It is clear that the invention is not only restricted to these embodiments, however, but comprises further embodiments within the spirit and scope of inventive thought and especially the following patent claims.

The features recited in dependent claims are mutually freely combinable unless otherwise explicitly stated or being clear to a person skilled in the art due to inherent incompatibility.

Claims

1. An electronic arrangement (204a, 204b, 204c, 220) for cultivating a spherical harmonic digital representation of a sound scene (102, 103, 229, 230, 301 , 401 ), comprising at least one data interface (224) for transferring data, at least one processing unit (222) for processing instructions (226) and other data, and memory (228) for storing the instructions and other data, said at least one processing unit being configured, in accordance with the stored instructions, to cause: obtaining the spherical harmonic digital representation (301 , 401 ) of the sound scene, determining through analysis (304, 404, 530, 532, 534, 535) of said spherical harmonic digital representation a number of related spatial parameters (536, 538) indicative of at least dominant sound sources in the sound scene, their directions-of-arrival (DOA) and associated powers, wherein time-frequency decomposition of said spherical harmonic digital representation is preferably utilized to divide the presentation into a plurality of frequency bands analysed (302, 402), and providing (360) said spherical harmonic digital representation, preferably as divided into said plurality of frequency bands, and said number of spatial parameters to spatial filtering (308, 414) in order to produce an output signal for audio rendering (231 , 232, 310, 410) or upmixing (312, 412) the representation to higher order.

2. The arrangement of claim 1 , configured to determine a preferably parametric indication of a remaining diffuse, ambient component of said spherical harmonic digital representation, including the power of such component.

3. The arrangement of claim 2, wherein determination of the indication of the remaining diffuse, ambient component, comprises extracting the portion of dominant sound sources from the spherical harmonic digital representation.

4. The arrangement of any preceding claim, configured to establish a spatial covariance matrix of the spherical harmonic digital representation preferably for each frequency band of said plurality.

5. The arrangement of claim 4, wherein the spatial covariance matrix is subjected to eigenvalue decomposition to determine the number of dominant sound sources, related source signal space and remaining ambient signal space, wherein higher eigenvalues are associated with the dominant sound sources and the remaining lower values are associated with the ambient component.

6. The arrangement of claim 5, configured to utilize at least one subspace decomposition approach selected from the group consisting of: analysis of dominant eigenvalues, analysis of eigenvalue ratios, analysis of eigenvalue statistics, SORTE method, analysis of the eigenvectors, and application of information theoretic criteria.

7. The arrangement of any preceding claim, wherein the di recti ons-of- arrival of the dominant sound sources are determined utilizing a narrowband DOA technique and preferably subspace decomposition of the spatial covariance matrix of the spherical harmonic digital representation, wherein a grid of directions is scanned and subjected to minima or maxima finding, preferably through analysis of power maps of a beamformer optionally of minimum variance distortionless response type, and/or through utilization of a subspace method, optionally including MUSIC (Multiple Signal Classification) or a selected weighted subspace fitting method.

8. The arrangement of any preceding claim, wherein the associated powers are determined from the spherical harmonic digital representation utilizing the number and DOA of dominant sound sources.

9. The arrangement of any preceding claim, configured to obtain modification input optionally provided by a user, and modify one or more of the spatial parameters indicative of DOA, powers of the determined dominant sound sources and/or parameter indicative of the power of ambient component, based on the modification input.

10. The arrangement of claim 9, wherein the modification input is translated into at least one modification selected from the group consisting of: change in the level difference between the dominant sound sources and the ambient signal, change in the direction-of-arrival of one or more dominant sound sources optionally towards the direction indicated in the input, and suppression of one or more dominant source signals.

1 1. The arrangement of any preceding claim, configured to subject the provided spherical harmonic digital representation to spatial filtering (414), wherein the spherical harmonic digital representation is decomposed into at least a number of dominant sound source signals (416) and a preferably monophonic ambient signal (418) based on the spatial parameters, and optionally further configured to transmit such signals and the spatial parameters towards a remote entity such as a decoder.

12. The arrangement of any preceding claim, configured to subject the provided spherical harmonic digital representation to combined spatial filtering and audio rendering (308, 310) or combined spatial filtering and upmixing (308, 312), wherein corresponding matrices, combining decomposition of the spherical harmonic digital representation and rendering to audio or upmixing to higher order representation, both for dominant sound sources and ambient component are determined based on the spherical harmonic digital representation and said spatial parameters.

13. The arrangement of any preceding claim, configured to subject the provided spherical harmonic representation to spatial filtering and loudspeaker rendering, utilizing information on the loudspeaker setup optionally in terms of vector-based amplitude panning or ambisonic panning.

14. The arrangement of any preceding claim, configured to subject the provided spherical harmonic representation to spatial filtering and headphones rendering, utilizing information on the headphone setup, such as head-related transfer functions and/or headphone calibration filters.

15. The arrangement of any preceding claim, configured to subject the provided spherical harmonic representation to spatial filtering and upmixing, utilizing information on a target order of upmixed representation.

16. The arrangement of claim 1 1 , configured to subject said dominant sound source signals and said ambient signal to audio rendering (408) based on the spatial parameters and involving distribution of the dominant sound source signals and said ambient signal among a number of playback channels, or to upmixing involving re-encoding the signals to a higher order spherical harmonic representation.

17. The arrangement of any preceding claim, wherein said time-frequency decomposition is obtained utilizing a selected time-frequency transform or filterbank.

18. The arrangement of any preceding claim, wherein the obtained spherical harmonic digital representation comprises a first-order or a higher-order representation of the sound scene.

19. The arrangement of any preceding claim, wherein the obtained spherical harmonic digital representation carries a microphone-based representation of a real-life sound scene and/or artificially created, synthetic sound scene.

20. The arrangement of any preceding claim, comprising a computer device or a plurality of operatively connected computer devices, optionally servers.

21. The arrangement of claim 20, comprising a system encompassing at least one first computer device and operably connected at least one second computer device, wherein said at least one first device is allocated at least said analysis and determination tasks, while said second device is allocated at least said audio rendering or upmixing tasks.

22. An electronic arrangement (204a, 204b, 204c, 220) for processing (300) a spherical harmonic digital representation of a sound scene (102, 103, 229, 230, 301 , 303), comprising at least one data interface (224) for transferring data, at least one processing unit (222) for processing instructions (226) and other data, and memory (228) for storing the instructions and other data, said at least one processing unit being configured, in accordance with the stored instructions, to cause: obtaining the spherical harmonic digital representation (301 , 303) of the sound scene, preferably being divided into a plurality of frequency bands, and a number of related spatial parameters (536, 538) indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting the spherical harmonic digital representation to spatial filtering and audio rendering or spatial filtering and upmixing (308), wherein corresponding matrices (632, 634) for decomposition of the spherical harmonic digital representation and rendering to audio signals associated with respective playback channels or upmixing to higher order representation, both for the dominant sound sources (632) and ambient component (634), are determined based on the spatial parameters; and respectively providing the resulting, rendered signals forward for audio playback (310) via a number of transducers associated with the playback channels, optionally speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene (312) for storage, transmission, or further processing, optionally decoding.

23. The arrangement of claim 22, configured to obtain modification input, optionally provided by a user, and modify one or more of the spatial parameters indicative of DOA, powers of the determined dominant sound sources and/or a parameter indicative of the power of ambient component, based on the modification input, wherein said modification input is preferably configured to contribute to the formulation of said matrices.

24. The arrangement of any of claims 22-23, configured to subject the spherical harmonic representation to spatial filtering and loudspeaker rendering, utilizing information on the loudspeaker setup optionally in terms of vector-based amplitude panning or ambisonic panning.

25. The arrangement of any of claims 22-24, configured to subject the spherical harmonic representation to spatial filtering and headphones rendering, utilizing information on the headphone setup, such as head-related transfer functions and/or headphone calibration filters.

26. The arrangement of any of claims 22-25, configured to subject the spherical harmonic representation to spatial filtering and upmixing, utilizing information on a target order of upmixed representation.

27. An electronic arrangement (204a, 204b, 204c, 220) for processing a low-bandwidth indication of a spherical harmonic digital representation of a sound scene (102, 103, 229, 230, 401 , 403), comprising at least one data interface (224) for transferring data, at least one processing unit (222) for processing instructions (226) and other data, and memory (228) for storing the instructions and other data, said at least one processing unit being configured, in accordance with the stored instructions, to cause: obtaining a number of dominant sound source signals (416) and a monophonic ambient signal (418) resulting from decomposing the spherical harmonic digital representation preferably divided into a plurality of frequency bands, and further receiving a number of related spatial parameters (536, 538, 836, 838) indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting (408) said dominant sound source signals and said ambient signal to audio rendering (850, 852), utilizing said spatial parameters and involving distribution of the dominant sound source signals and said ambient signal among a number of playback channels, or to upmixing (854, 856) involving re-encoding the signals to a higher order spherical harmonic representation; and respectively providing (410, 412) the resulting, rendered signals forward for audio playback via a number of transducers associated with the playback channels, preferably speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene, for storage, transmission or further processing, optionally decoding.

28. The arrangement of claim 27, configured to obtain modification input, optionally from a user, and modify one or more of the spatial parameters indicative of DOA, powers of the determined dominant sound sources and/or a parameter indicative of the power of ambient component, based on the modification input.

29. The arrangement of any of claims 27-28, wherein audio rendering of the ambient signal comprises directing the ambient signal through a number of decorrelation filters.

30. A system comprising the arrangement of any of claims 1 -1 1 for analysis and the arrangement of any of claims 22-29 for audio rendering or upmixing of the spherical harmonic digital representation.

31. A method (900) for cultivating a spherical harmonic digital representation of a sound scene to be performed by an electronic arrangement, comprising:

-obtaining (901 , 904) the spherical harmonic digital representation of the sound scene;

-determining (906) through analysis of said spherical harmonic digital representation a number of related spatial parameters indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers, wherein time-frequency decomposition of said spherical harmonic digital representation is utilized to divide the presentation into a plurality of frequency bands analyzed, said bands optionally reflecting human auditory frequency resolution; and

-providing said spherical harmonic digital representation, preferably as divided into said plurality of frequency bands, and said number of spatial parameters to spatial filtering (908) in order to produce an output signal for audio rendering or upmixing (912) the representation to higher order.

32. A method (900) for processing a spherical harmonic digital representation of a sound scene, to be performed by an electronic arrangement, comprising: obtaining (904) the spherical harmonic digital representation of the sound scene, preferably being divided into a plurality of frequency bands, and a number of related spatial parameters parameters indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting (908, 910, 912) the spherical harmonic digital representation to combined spatial filtering and audio rendering or combined spatial filtering and upmixing (308), wherein corresponding matrices for decomposition of the spherical harmonic digital representation and rendering to audio signals associated with respective playback channels or upmixing to higher order representation, both for the dominant sound sources and ambient component are determined based on the spatial parameters; and respectively providing (914) the resulting rendered signals forward for audio playback via a number of transducers associated with the playback channels, optionally speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene for storage, transmission, or further processing, optionally decoding.

33. A method (900) for processing a low-bandwidth indication of a spherical harmonic digital representation of a sound scene to be performed by an electronic arrangement, comprising: obtaining a number of dominant sound source signals (416) and a preferably monophonic ambient signal (418) resulting from decomposing (414) the spherical harmonic digital representation preferably divided into a plurality of frequency bands, and further receiving a number of related spatial parameters (836, 838) indicative of at least dominant sound sources in the scene, their directions-of-arrival (DOA) and associated powers; subjecting (408, 910, 912) said dominant sound source signals and said ambient signal to audio rendering, utilizing said spatial parameters and involving distribution of the dominant sound source signals and said ambient signal among a number of playback channels, or to upmixing involving re- encoding the signals to a higher order spherical harmonic representation; and respectively providing the resulting, rendered signals forward for audio playback via a number of transducers associated with the playback channels, preferably speakers such as loudspeakers or headphones, or the upmixed, higher-order spherical harmonic digital representation of the sound scene, for storage, transmission or further processing, optionally decoding.

34. Computer program product, embodied in a non-transitory computer readable carrier medium, comprising instructions causing a computer to execute method items of any of claims 31 -33.