US20220262373A1 - Layered coding of audio with discrete objects - Google Patents
Layered coding of audio with discrete objects Download PDFInfo
- Publication number
- US20220262373A1 US20220262373A1 US17/739,901 US202217739901A US2022262373A1 US 20220262373 A1 US20220262373 A1 US 20220262373A1 US 202217739901 A US202217739901 A US 202217739901A US 2022262373 A1 US2022262373 A1 US 2022262373A1
- Authority
- US
- United States
- Prior art keywords
- audio
- ambisonic
- components
- audio components
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000005236 sound signal Effects 0.000 claims abstract description 37
- 238000000034 method Methods 0.000 claims description 42
- 230000015654 memory Effects 0.000 claims description 16
- 238000012545 processing Methods 0.000 description 21
- 230000008569 process Effects 0.000 description 14
- 230000006870 function Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 238000009877 rendering Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 4
- 238000003491 array Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000009792 diffusion process Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 229910000078 germane Inorganic materials 0.000 description 1
- 238000010237 hybrid technique Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/008—Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/18—Vocoders using multiple modes
- G10L19/24—Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
- G10L19/0204—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
- G10L19/0208—Subband vocoders
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
- G10L19/167—Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/26—Pre-filtering or post-filtering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S3/00—Systems employing more than two channels, e.g. quadraphonic
- H04S3/008—Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S7/00—Indicating arrangements; Control arrangements, e.g. balance control
- H04S7/30—Control circuits for electronic adaptation of the sound field
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
- G11B27/22—Means responsive to presence or absence of recorded information signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/23439—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements for generating different versions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/236—Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
- H04N21/2368—Multiplexing of audio and video streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/60—Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client
- H04N21/65—Transmission of management data between client and server
- H04N21/658—Transmission by the client directed to the server
- H04N21/6587—Control parameters, e.g. trick play commands, viewpoint selection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/01—Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2400/00—Details of stereophonic systems covered by H04S but not provided for in its groups
- H04S2400/11—Positioning of individual sound objects, e.g. moving airplane, within a sound field
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S2420/00—Techniques used stereophonic systems covered by H04S but not provided for in its groups
- H04S2420/11—Application of ambisonics in stereophonic audio systems
Definitions
- One aspect of the disclosure relates to layered coding of audio with discrete objects.
- Audio signals can have different formatting.
- Traditional channel-based audio is recorded with a listening device in mind, for example, 5.1 home theater with five speakers and one subwoofer.
- Object-based audio encodes audio sources as objects with meta-data that describes spatial information about the object.
- Ambisonics has an inherently hierarchical format. Each increasing order (e.g., first order, second order, third order, and so on) adds spatial resolution when played back to a listener. Ambisonics can be formatted with just the lower order Ambisonics, such as with first order, W, X, Y, and Z. This format, although having a relatively low bandwidth footprint, provides low spatial resolution. Much higher order Ambisonic components are typically required for high resolution immersive spatial audio experience.
- Objects can be converted to Ambisonics and the natural hierarchy of Ambisonics can then allow greater spatial resolution and detail of the objects as the order of the Ambisonics signal is increased. Regardless of how many components are included, this approach alone lacks flexibility in rendering different sound sources (objects) because those sound sources are hard-coded in the Ambisonic audio signals. Being able to access objects individually allows for a playback device to provide high resolution rendering of these objects as well as being able to manipulate each object independently, for example, an object can be virtually moved around a sound field, added and removed at will, and/or have its level adjusted independently of other sounds in the audio experience.
- a playback device can have the ability to render near-field audio to a user. It may be advantageous for such a playback device to receive a bit stream with an object-based signal to render in the near field.
- a second playback device might not be able to render near-field audio. In this case, if the object-based signal is transmitted from one device to another, this signal can go unutilized, which can result in a waste of bandwidth.
- a hybrid audio processing technique that layers Ambisonic audio components and object-based audio is lacking.
- Such a hybrid technique is beneficial as objects allow for near-field effects, precise localization, as well as interactivity.
- the object can be virtually moved around as seen fit, and/or have its level changed, and/or be added or removed from an audio scene as seen fit.
- Ambisonics can further provide a compelling spatial ambience reproduction.
- a hybrid audio processing technique includes generating a base layer having a first set of Ambisonic audio components (e.g., first order only, or first order and second order only) based on ambience and one or more object-based audio signals.
- the object-based audio signals can be converted into Ambisonic components and then combined with ambience that is also in Ambisonic format.
- This base layer can fully represent a desired audio scene in that it has sounds captured in ambience as well as sounds from individual sound sources that are converted into Ambisonic components.
- the first set of Ambisonic audio components, along with an optional number of objects (and associated metadata) that have not been converted to Ambisonics, can be included in a base (or first) layer that is encoded into a bit stream.
- At least one of the object-based audio signals can be included in a second layer (or an ‘enhancement layer’).
- object-based audio signals include associated metadata, for example, direction, depth, width, diffusion.
- Additional enhancement layers can also be added having additional object-based audio and/or additional Ambisonic audio components.
- the additional Ambisonic components could be higher order coefficients or coefficients of the same order but not included in lower levels (for example the first order Z coefficient if the previous levels only included the first order X and Y and the zeroth order W).
- Metadata can be included in the bit stream that provides a playback device and/or decoding device with configuration details of each and every layer.
- the receiving device can make an informed selection as to which layers, other than the base layer, shall be received. Such a selection can be determined based on bandwidth, which can vary over time, and/or the receiving device's capabilities. For example, the receiving device may not want to receive a layer or a stack of layers that contains near-field object based audio, if it cannot render and playback the object. The device may select a different layer or stack of layers that has the same sound source represented by the object-based audio, but embedded into the Ambisonics components.
- the device can select which layer to receive based on either the direction that the user of the device is pointing towards or facing (in the 3D space) or one or more directions that the user of the device is interested in (by some user based interactivity that indicates the direction of interest).
- Image sensing with a camera and computer vision and/or an inertial measurement unit can be used to determine a direction that the user is pointed towards or facing.
- Other techniques can also be employed to determine user position. Objects that are located in these directions can be picked by receiving the layers that contain these objects.
- the decoder can request to receive an enhancement layer that is associated with the direction that the user has turned to face.
- the encoder can then include in a bit stream which is transmitted to the user that includes sounds in the location corresponding to the user's right hand side, such as the chirping birds.
- the user can indicate, in settings or a user interface, that the user would like to hear sounds at a location corresponding to the user's right hand side, or from above or below.
- the decoder can then send request to receive enhancement layers associated with such locations.
- a hybrid audio processing technique provides layered coding using Ambisonics and object-based audio.
- the base layer includes lower order Ambisonics (e.g., first order or first and second order) having sounds from all object-based signals that might be relevant to the audio scene, as well as ambience. Additional layers have additional Ambisonics components (e.g., of increasing order) and/or additional object-based audio signals.
- the object's contribution can be subtracted from the Ambisonics representation of the previous layers, allowing the object to be rendered independently of the Ambisonics renderer. This is possible as the decoder will have knowledge of how each object was added to the Ambisonics mix at the encoder, through for example, configuration information present in metadata. This can prevent double rendering of the object, as described further in other sections.
- different devices can traverse layers differently. For example, if a legacy device does not have the computational power to render an object independently (using near field techniques, for example), the device can choose not to select a layer that contains such an object. Instead, the device can traverse a different stack layer that allows less computationally complex rendering, at the cost of quality of service and spatial resolution.
- FIG. 1 shows a process and system for encoding layered audio with Ambisonics and object-based audio, according to one aspect.
- FIG. 2 shows a process and system for encoding layered audio with Ambisonics and object-based audio, according to one aspect.
- FIG. 3 shows a process and system for encoding layered audio with Ambisonics and object-based audio with multiple layered stack options, according to one aspect.
- FIG. 4 shows a process and system for encoding layered audio with parametric coding and Ambisonics, according to one aspect.
- FIG. 5 illustrates an example of audio system hardware, according to one aspect.
- Ambisonics relates to a technique for recording, mixing, and playing back three-dimensional 360-degree audio both in the horizontal and/or vertical plane.
- Ambisonics treats an audio scene as a 360-degree sphere of sound coming from different directions around a center.
- An example of an Ambisonics format is B-format, which can include first order Ambisonics consisting of four audio components—W, X, Y and Z. Each component can represent a different spherical harmonic component, or a different microphone polar pattern, pointing in a specific direction, each polar pattern being conjoined at a center point of the sphere.
- Ambisonic audio can be extended to higher orders, increasing the quality of localization. With increasing order, additional Ambisonic components will be introduced, for example, 5 new components for second order, 7 new components for third order, and so on. This can cause the footprint or size of the audio information to grow, which can quickly run up against bandwidth limitations.
- Layered coding provides a means to transmit hierarchical layers.
- a receiver selects which layers it would like to decode based on available bandwidth and/or device level capabilities.
- a layered approach is different from canonical approaches where there are multiple independent manifests of the same content at different bitrates. Layers typically build on each other, with each additional layer providing an improved quality of audio experience to a consumer. Spatial resolution is typically traded off with the number of layers that is decoded, consuming less layers can result in lower spatial resolution, but can also help different devices adapt to different bandwidth constraints.
- audio content 120 can be encoded by an encoder 122 into a hybrid and layered audio asset that can be transmitted in a bit stream over a network 96 to downstream devices 132 .
- the encoder can generate a first set of Ambisonic audio components based on ambience and one or more object-based audio signals and include the first set into base layer 126 .
- the base layer can have both ambience and sounds from object-based audio signals contained in the first set of Ambisonic components as well as some objects that are not mixed into the Ambisonics signal.
- the first set of Ambisonic audio components can include only low order Ambisonic components, for example, only zeroth order, zeroth order and first order, or only zeroth, first and second order.
- the encoder can encode at least one of the object-based audio signals into a first enhancement layer 128 .
- First enhancement layer 128 can include Ambisonic audio components of the same or higher order as those in the base layer.
- Second enhancement layer 130 and additional enhancement layers, e.g., layer 131 can provide additional object-based audio signals and/or additional Ambisonic audio components.
- Metadata 124 can include configuration information (e.g., which layers are available, and Ambisonic components and/or object-based audio in each available layer) that allows devices 132 to determine which layers should be transmitted over network 96 . Such a determination can be made by the encoder or by the receiving device based on bandwidth and/or device capabilities as well as listener position/preference, as mentioned.
- the Ambisonic audio components here can represent ambience, which should be understood as any sound in a scene other than the discrete objects, for example, ocean, vehicle traffic, wind, voices, music, etc.
- content 120 might contain numerous object-based audio signals that each represent a discrete sound source, such as a speaker's voice, a bird chirping, a door slamming, a pot whistling, a car honking, etc.
- the encoder can combine all the discrete sound sources with the ambience Ambisonic audio components, and include only a lower order first set of those Ambisonic audio components into the base layer 126 .
- This set can include only an omnidirectional pattern (W) audio component, a first bi-directional polar pattern audio component aligned in a first (e.g., front-to-back) direction X, a second bi-directional polar pattern audio component aligned in a second (e.g., left-to-right) direction Y, and a third bi-directional polar pattern audio component aligned in a third (e.g., up-down) direction Z.
- W omnidirectional pattern
- base layer 126 can contain Ambisonics components with objects mixed in, as well as audio signals of objects that are not mixed with the Ambisonics components (e.g., discrete).
- the encoder can include an object-based audio signal of at least one of the object-based audio assets (and associated metadata such as direction, depth, width, diffusion, and/or location) and/or additional Ambisonic audio components (for example, one or more second order Ambisonic audio components).
- object-based audio signals in the enhancement layers are not mixed into the Ambisonics signals of the enhancement layers.
- a downstream device senses that network bandwidth is small (e.g., 50-128 kbps), then the downstream device can communicate to the encoder to only transmit the base layer. If, moments later, the network bandwidth increases, for example, if the downstream device moves to a location with increased bandwidth (e.g., a stronger wireless connection, or from a wireless connection to a wired connection), then the downstream device can request that the encoder transmit enhancement layers 128 - 131 .
- network bandwidth e.g., 50-128 kbps
- the downstream device can examine metadata 124 that includes configuration data indicating a) one or more selectable layers (e.g., enhancement layers 128 - 131 ) that can be encoded and transmitted in the bit stream, and b) Ambisonic audio components or audio objects in each of the selectable layers.
- selectable layers e.g., enhancement layers 128 - 131
- Ambisonic audio components can be indicated by their respective Ambisonics coefficients, which indicates a particular polar pattern (e.g., omni-directional, figure-of-eight, cardioid, etc.) and direction.
- base layer 126 includes selectable sublayers 101 , 102 , and 103 . This can allow for further customization and utilization of bandwidth by a downstream device, as shown in FIG. 4 and described in other sections.
- FIG. 2 shows an audio encoding and decoding process and system according to one aspect.
- An audio asset 78 can include Ambisonic components 82 that represent ambience sound and one or more audio objects 80 .
- the ambience and audio objects can be combined to form a first or base layer of audio.
- all of the audio objects, or a select subset of audio objects can be converted at block 84 to Ambisonics components.
- the audio objects 80 can be converted, at block 84 , into N Ambisonic components that contain sounds from the audio objects at their respective virtual locations.
- the audio objects are converted into a single Ambisonic component, e.g., the omni-directional W component.
- Audio objects can be converted to Ambisonic format by applying spherical harmonic functions and direction of each object to an audio signal of the object.
- Spherical harmonic functions Y mn ( ⁇ , ⁇ ) define Ambisonics encoding functions where ⁇ represents azimuth angle and ⁇ represents elevation angle of a location in spherical coordinates.
- a sound source S(t) can be converted to an Ambisonic component by applying Y mn ( ⁇ s , ⁇ s ) to S(t), where ⁇ s and ⁇ s describe the direction (azimuth and elevation angle) of the sound source relative to the center of the sphere (which can represent a recording and/or listener location).
- ⁇ s and ⁇ s describe the direction (azimuth and elevation angle) of the sound source relative to the center of the sphere (which can represent a recording and/or listener location).
- Ambisonics components for a given sound source S(t) can be calculated for lower order Ambisonics components as follows:
- Block 84 generates Ambisonic components with sounds from the audio objects 80 .
- These Ambisonic components are combined at block 85 with the ambience Ambisonics 82 .
- the combined ambience and object-based Ambisonic audio components can be truncated at block 86 to remove Ambisonic audio components of an order greater than a threshold (e.g., beyond a first order, or beyond a second order).
- any Ambisonic component that is greater than B 10 can be removed at block 86 .
- the remaining Ambisonic audio components e.g., W, X, Y, and Z
- W, X, Y, and Z are the first set of Ambisonic audio components (having ambience and the object-based audio information) that are encoded in the first layer of the bit stream at block 90 .
- At least one of the audio objects 80 can be encoded into a second layer at encoder 92 .
- additional Ambisonic components that contain ambience can also be encoded in the second layer.
- additional layers 93 can contain additional Ambisonic components (e.g., in ascending order) as well as additional audio objects.
- the base layer includes the sounds in the audio objects and the ambience combined into Ambisonic components (with some optional objects sent discretely and not mixed into Ambisonic components), but in the enhancement layers (the second and additional layers) the audio objects are encoded discretely, separate from the Ambisonic components (if also included in the same respective enhancement layer).
- Metadata 94 describes how the different information (audio objects and Ambisonic components) are bundled into different layers, allowing for a downstream device to select (e.g., through a request to the server with live or offline encoded bitstreams) which enhancement layers the downstream device wishes to receive. As mentioned, this selection can be based on different factors, e.g., a bandwidth of the network 96 , listener position/preference and/or capabilities of the downstream device. Listener position can indicate which way a listener is facing or focusing on. Listener preferences can be stored in based on user settings, device settings, and/or adjusted through a user interface. A preference can indicate a preferred position of the listener, for example, in a listener environment.
- a decoding device can receive the first (base) layer and, if selected, additional layers (e.g., the second layer).
- the downstream device can decode, at block 98 , the first layer having the first set of Ambisonic components that were generated by the encoder based on ambience 82 and object-based audio 80 .
- the second layer having at least one of the object-based audio signals (e.g., audio object 81 ) is decoded. If the second layer also has additional Ambisonic audio components (e.g., a second set of Ambisonic audio components), then these components can be concatenated to the first set that were decoded from the first layer at block 102 . It should be understood that each layer contains unique non-repeating Ambisonic audio components.
- an Ambisonic component corresponding to a particular set of Ambisonic coefficients (e.g., B 10 ) will not be repeated in any of the layers.
- each Ambisonic component builds on another and they can be concatenated with relative ease.
- the object-based audio extracted from the second layer is subtracted from the received Ambisonic audio components.
- the received Ambisonic audio components can include the first set of Ambisonic audio components received in the first layer, and any additional Ambisonic audio components received in the other layers).
- the object-based audio can be converted to Ambisonic format at block 104 , and the converted audio is then subtracted from the received Ambisonic audio components.
- This resulting set of Ambisonic audio components are rendered by an Ambisonic renderer at block 106 , into a first set of playback channels.
- the object-based audio received from the second layer is rendered spatially at block 108 , e.g., by applying transfer functions or impulse responses to each object based audio signal, to generate a second set of playback channels.
- the first set of playback channels and second set of playback channels are combined into a plurality of speaker channels that are used to drive a plurality of speakers 109 .
- Speakers 109 can be integrated into cabinet loudspeakers, one or more speaker arrays, and/or head worn speakers (e.g., in-ear, on-ear, or extra-aural).
- additional layers can be encoded and decoded in the same manner, having additional object-based audio and additional Ambisonic audio components.
- the decoding device can communicate a request to the encoding device/server to include a third layer of data in the bit stream.
- the encoder can then encode the third layer with additional audio objects and/or Ambisonic audio components, different from those already included in the second layer and first layer.
- FIG. 3 illustrates aspects of the present disclosure with one or more different layer stacks, each having different layered audio information.
- a first stack option 144 and second stack option 142 can be supported by an encoder, configuration data for each enhancement layer can be included in metadata 252 . These stack options share base layer 254 .
- the first stack option can have a first enhancement layer 258 that contains additional Ambisonics coefficients 5 , 6 , and 7 .
- the second stack option can include different enhancement layers than those in the first stack option.
- a first enhancement layer 136 includes a first object-based signal
- a second enhancement layer 138 includes an object 2 and Ambisonics coefficients 5 , 6 , 7
- additional enhancement layers 140 contain higher orders of Ambisonics coefficients and additional audio objects.
- each additional layer (or enhancement layer) can build on the previous layer, having a same or higher order of Ambisonic components than the previous layer.
- Ambisonic components are removed if they are deemed to not contain useful information.
- Ambisonic audio components corresponding to Ambisonics coefficients 8 and 9 can be removed by the encoder, if the encoder detects that these components contain little to no audio information. This can further utilize available bandwidth by preventing the encoding and transmission of useless audio signals.
- a first device 150 e.g., a decoder and/or playback device
- a second device 160 might choose to select the second stack option based on audio capabilities of the second device.
- the second device can select which of the layers in the second stack option is to be received based on available bandwidth. For example, if the second device has a bandwidth below 128 kBPS, then the second device can request only the base layer be sent. At a later time, if bandwidth increases, then the second device can request enhancement layers.
- a graph 161 shows layer consumption of the second device at different time intervals T 1 -T 4 . At T 1 , bandwidth might be especially low, so the device requests that only the first sublayer of the base layer be sent. As bandwidth gradually increases (e.g., the device's communication connection grows stronger), then additional sublayers and enhancement layers are requested and received at T 2 , T 3 , and T 4 .
- a graph 251 can show layer consumption of the first device 150 .
- the device might have a high bandwidth at time T 1 , but then a low bandwidth at T 2 .
- the device can request and consume all of the base layer and enhancement layer 1 .
- the device requests that only the first sublayer of the first layer be sent.
- the bandwidth can increase at times T 3 and T 4 .
- the first device can request and consume additional sublayers 2 and 3 at time T 3 and enhancement layer 1 at time T 4 .
- a receiving device can select different layers and sublayers at different times, from one audio frame to another audio frame.
- the decoder (a downstream device) can select (e.g., by communicating a request to the encoder) any of the enhancement layers based various factors, for example, bandwidth, computing capabilities, or near-field playback capabilities. Additionally or alternatively, the decoder can select enhancement layers based on spatial rendering capabilities. For example, a playback device might have capability to render sound horizontally but not have capability to render sound vertically at varying heights. Similarly, a playback device may not have capability to render certain surround information. For example, if a playback device is limited to specific rendering format (5.1 surround sound, 7.2 surround sound, etc.).
- FIG. 4 shows an encoding and decoding process and system that uses parametric coding, according one aspect of the present disclosure.
- the available bandwidth e.g. bitrate
- the bandwidth drops, various tradeoffs can be made to optimize the user experience in terms of audio quality degradation.
- the user experience could suffer when the codec can only deliver stereo or mono audio when the bitrate drops below a certain threshold.
- a layered audio process and system includes a parametric approach that can maintain immersive audio for first order Ambisonics (FOA) components at significantly lower bitrates compared to non-parametric audio coding methods.
- FOA first order Ambisonics
- the encoder can encode a first layer 170 , a second layer 172 , and a third layer 174 .
- these layers correspond, respectively, to the first sub layer, second sub layer, and third sub layer of the base layer described in other sections and shown in FIGS. 1-3 .
- the first layer can be encoded to contain the omni-directional FOA component W.
- the second layer can be encoded to using parametric coding include a summation signal of X, Y, and Z.
- the resulting FOA signal can be rendered for any speaker layout.
- a third layer can be encoded to include two of the three components X, Y, and Z without parametric coding.
- the third component that is not included in the third layer can be derived by the decoder based on subtracting the two components from the summation of the second layer.
- Ambisonic components 358 can include first order Ambisonics W, X, Y and Z.
- W, X, Y and Z each contain sounds from converted object-based audio signals as well as ambience, as shown in FIG. 2 and described in other sections.
- the blocks shown performed on the encoder side of FIG. 4 can correspond to block 90 of FIG. 2 and the blocks shown performed on the decoder side of FIG. 4 can correspond to block 98 of FIG. 2 .
- the encoder and decoder blocks shown in FIG. 4 can be performed independent of other enhancement layers and does not require other aspects described in FIGS. 1-3 .
- the first layer 170 has contains only Ambisonic component W having an omni-directional polar pattern.
- An encoder block 164 can encode W using known audio codecs. W can be played back as a mono signal.
- three Ambisonic audio components are combined at block 162 , the components including a) a first bi-directional polar pattern audio component aligned in a first direction (e.g., front-back or ‘X’), b) a second bi-directional polar pattern audio component aligned in a second direction (e.g., left-right or ‘Y’), and c) a third bi-directional polar pattern audio component aligned in a third direction (e.g., up-down or ‘Z’), resulting in a combined or downmixed channel S.
- the combined channel S is encoded at block 164 .
- a filter bank 260 can be used to filter each component to extract sub-bands of each component.
- the parameter generation block 166 can generate parameters for each of the sub-bands of each component.
- filter bank 260 can use critical band filters to extract only the sub-bands that are audible (e.g., using critical band filters).
- Parameters can define correlation between the three Ambisonic audio components, level differences between the three Ambisonic audio components, and/or phase differences between the three Ambisonic audio components.
- Other control parameters can be generated to improve spatial reproduction quality.
- the parameters can be associated for different time frames. For example, for a uniform time resolution, each time frame can have a constant duration of 20 ms.
- encoder blocks 164 encode only two of the three Ambisonic components (for example, X and Y, Y and Z, or X and Z) that are in the summed channel.
- the remaining Ambisonic component can be derived based on a) the summed channel (received in the second layer) and b) the two Ambisonic components (received in the third layer).
- weighting coefficients a, b, and c can be applied to each of the three Ambisonic audio components X, Y and Z to optimize the downmix of the three components into the S signal.
- the application of the coefficients can improve alignment in level and/or reduce signal cancelation.
- the weighting coefficients can be applied in the sub-band domain.
- a decoder can select how many of the layers to consume, based on bandwidth. With only the first layer, a listener can still hear ambience and sound sources of a sound scene, although spatial resolution is minimal. As bandwidth increases, the decoder can additionally request and receive the second layer 172 and then the third layer 174 .
- a decoder can decode the summed channel and Ambisonic components at blocks 168 , which can ultimately be rendered by an Ambisonic renderer to generate channels that can be used to drive speakers (e.g., an array of loudspeakers or a headphone set).
- the first layer of the received bit stream can be decoded at block 168 to extract Ambisonic component W′.
- the summed signal S′ can be extracted at block 168 and a parameter applicator 170 can apply the one or more parameters (generated at block 166 ) to S′ to generate Ambisonic audio components X′, Y′, and Z′.
- These Ambisonic audio components can be compressed versions of X, Y, and Z.
- the trade-off is that transmitting the summed channel and the parameters requires less bandwidth than transmitting X, Y, and Z.
- the summed signal S of the second layer provides a compressed version of X, Y, and Z.
- two of the three summed components can be decoded at block 168 .
- the decoder can, at block 176 , subtract these two components from the summed channel. In such a case where the decoder has received the third layer, the decoder can simply ignore the parameters and skip the parameter application block 170 , to reduce processing overhead. Rather, the decoder need only decode the two received components, and then extract the third Ambisonic component from the summed channel through subtraction.
- encoders and decoders described in the present disclosure can implement existing and known codecs for encoding and decoding audio signals.
- the third layer can jointly encode the two components (X and Y) with a single codec that can take advantage of correlation between the two components.
- layer three can contain X and Z, or Y and Z.
- X can be derived from the summed signal S′.
- Quantization errors may add during the recovery of X from S′, the recovered X signal may have more distortions than the Y and Z components. Therefore, some implementations may prefer to recover the Z component from S′ instead.
- Z is typically associated with sound sources above or below the listener where spatial perception is less accurate than in the horizontal plane.
- the Z component might also carry less energy on average as most audio sources are typically placed in the horizontal plane, for example consistent with the video scene.
- the parameter applicator block 170 applies the parameters to the summed signal S′ in the sub-band domain.
- a filter bank 169 can be used convert S′ into sub-bands from S′.
- inverse blocks 172 can be used to convert those sub-bands back to full-bands. Further, if weighting coefficients are applied at the encoder, inverse coefficients can be applied at the decoder as shown, to reconstruct the Ambisonic components X, Y and Z.
- the layered structure shown in FIG. 4 can beneficially reduce bitrate when needed, but still allow for improved spatial quality when greater bandwidth is present.
- a decoder can receive only the first layer 170 and play back the W signal in mono if bitrate reduction is critical.
- the decoder also can receive the second layer, in which case the decoder can reuse the W signal received in the first layer and reconstruct X′, Y′, and Z′ based on a summed signal S′.
- Bit rate in this case is still improved in comparison to transmitting X, Y, and Z, because the data footprint for S′ and the parameters are significantly less than the bit rate required for X, Y, and Z, the footprint of the parameters being negligible.
- the decoder can receive all three layers 170 , 172 , and 174 , in which case, only two of the components need be sent in the third layer, the third component being reconstructed as discussed above.
- FIG. 5 shows a block diagram of audio processing system hardware, in one aspect, which may be used with any of the aspects described herein (e.g., headphone set, mobile device, media player, or television).
- This audio processing system can represent a general purpose computer system or a special purpose computer system.
- FIG. 5 illustrates the various components of an audio processing system that may be incorporated into headphones, speaker systems, microphone arrays and entertainment systems, it is merely one example of a particular implementation and is merely to illustrate the types of components that may be present in the audio processing system.
- FIG. 5 is not intended to represent any particular architecture or manner of interconnecting the components as such details are not germane to the aspects herein. It will also be appreciated that other types of audio processing systems that have fewer components than shown or more components than shown in FIG. 7 can also be used. Accordingly, the processes described herein are not limited to use with the hardware and software of FIG. 5 .
- the audio processing system 150 (for example, a laptop computer, a desktop computer, a mobile phone, a smart phone, a tablet computer, a smart speaker, a head mounted display (HMD), a headphone set, or an infotainment system for an automobile or other vehicle) includes one or more buses 162 that serve to interconnect the various components of the system.
- One or more processors 152 are coupled to bus 162 as is known in the art.
- the processor(s) may be microprocessors or special purpose processors, system on chip (SOC), a central processing unit, a graphics processing unit, a processor created through an Application Specific Integrated Circuit (ASIC), or combinations thereof.
- Memory 151 can include Read Only Memory (ROM), volatile memory, and non-volatile memory, or combinations thereof, coupled to the bus using techniques known in the art.
- Camera 158 and display 160 can be coupled to the bus.
- Memory 151 can be connected to the bus and can include DRAM, a hard disk drive or a flash memory or a magnetic optical drive or magnetic memory or an optical drive or other types of memory systems that maintain data even after power is removed from the system.
- the processor 152 retrieves computer program instructions stored in a machine readable storage medium (memory) and executes those instructions to perform operations described herein.
- Audio hardware although not shown, can be coupled to the one or more buses 162 in order to receive audio signals to be processed and output by speakers 156 .
- Audio hardware can include digital to analog and/or analog to digital converters. Audio hardware can also include audio amplifiers and filters. The audio hardware can also interface with microphones 154 (e.g., microphone arrays) to receive audio signals (whether analog or digital), digitize them if necessary, and communicate the signals to the bus 162 .
- microphones 154 e.g., microphone arrays
- Communication module 164 can communicate with remote devices and networks.
- communication module 164 can communicate over known technologies such as Wi-Fi, 3G, 4G, 5G, Bluetooth, ZigBee, or other equivalent technologies.
- the communication module can include wired or wireless transmitters and receivers that can communicate (e.g., receive and transmit data) with networked devices such as servers (e.g., the cloud) and/or other devices such as remote speakers and remote microphones.
- the aspects disclosed herein can utilize memory that is remote from the system, such as a network storage device which is coupled to the audio processing system through a network interface such as a modem or Ethernet interface.
- the buses 162 can be connected to each other through various bridges, controllers and/or adapters as is well known in the art.
- one or more network device(s) can be coupled to the bus 162 .
- the network device(s) can be wired network devices (e.g., Ethernet) or wireless network devices (e.g., WI-FI, Bluetooth).
- various aspects described e.g., simulation, analysis, estimation, modeling, object detection, etc., can be performed by a networked server in communication with the capture device.
- aspects described herein may be embodied, at least in part, in software. That is, the techniques may be carried out in an audio processing system in response to its processor executing a sequence of instructions contained in a storage medium, such as a non-transitory machine-readable storage medium (e.g. DRAM or flash memory).
- a storage medium such as a non-transitory machine-readable storage medium (e.g. DRAM or flash memory).
- hardwired circuitry may be used in combination with software instructions to implement the techniques described herein.
- the techniques are not limited to any specific combination of hardware circuitry and software, or to any particular source for the instructions executed by the audio processing system.
- the terms “analyzer”, “separator”, “renderer”, “encoder”, “decoder”, “truncator”, “estimator”, “combiner”, “synthesizer”, “controller”, “localizer”, “spatializer”, “component,” “unit,” “module,” “logic”, “extractor”, “subtractor”, “generator”, “applicator”, “optimizer”, “processor”, “mixer”, “detector”, “calculator”, and “simulator” are representative of hardware and/or software configured to perform one or more processes or functions.
- examples of “hardware” include, but are not limited or restricted to an integrated circuit such as a processor (e.g., a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.).
- a processor e.g., a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.
- different combinations of hardware and/or software can be implemented to perform the processes or functions described by the above terms, as understood by one skilled in the art.
- the hardware may be alternatively implemented as a finite state machine or even combinatorial logic.
- An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions. As mentioned above, the software may be stored in any type of machine-readable medium.
- any of the processing blocks may be re-ordered, combined or removed, performed in parallel or in serial, as necessary, to achieve the results set forth above.
- the processing blocks associated with implementing the audio processing system may be performed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer readable storage medium to perform the functions of the system. All or part of the audio processing system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field-programmable gate array) and/or an ASIC (application-specific integrated circuit)).
- All or part of the audio system may be implemented using electronic hardware circuitry that include electronic devices such as, for example, at least one of a processor, a memory, a programmable logic device or a logic gate. Further, processes can be implemented in any combination hardware devices and software components.
- personally identifiable information should follow privacy policies and practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users.
- personally identifiable information data should be managed and handled so as to minimize risks of unintentional or unauthorized access or use, and the nature of authorized use should be clearly indicated to users.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Mathematical Physics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Quality & Reliability (AREA)
- Stereophonic System (AREA)
Abstract
Description
- One aspect of the disclosure relates to layered coding of audio with discrete objects.
- Audio signals can have different formatting. Traditional channel-based audio is recorded with a listening device in mind, for example, 5.1 home theater with five speakers and one subwoofer. Object-based audio encodes audio sources as objects with meta-data that describes spatial information about the object.
- Trading off spatial resolution with layered coding of audio has challenges. Traditional audio is channel-based, for example, 5.1 or 4.1. Channel-based audio does not lend itself to being layered because if the channels are treated as layers, then the absence of a layer would be noticeable and distracting because an entire speaker could be turned off or mute if a corresponding layer is not processed. Similarly, when multiple objects (e.g., sound sources) constitute a sound field, if the objects are treated as layers, without any additional measures, then absence of one or more of the objects could result in a misrepresentation of the sound field.
- Ambisonics has an inherently hierarchical format. Each increasing order (e.g., first order, second order, third order, and so on) adds spatial resolution when played back to a listener. Ambisonics can be formatted with just the lower order Ambisonics, such as with first order, W, X, Y, and Z. This format, although having a relatively low bandwidth footprint, provides low spatial resolution. Much higher order Ambisonic components are typically required for high resolution immersive spatial audio experience.
- Objects can be converted to Ambisonics and the natural hierarchy of Ambisonics can then allow greater spatial resolution and detail of the objects as the order of the Ambisonics signal is increased. Regardless of how many components are included, this approach alone lacks flexibility in rendering different sound sources (objects) because those sound sources are hard-coded in the Ambisonic audio signals. Being able to access objects individually allows for a playback device to provide high resolution rendering of these objects as well as being able to manipulate each object independently, for example, an object can be virtually moved around a sound field, added and removed at will, and/or have its level adjusted independently of other sounds in the audio experience.
- Different playback devices may also have different playback capabilities. For example, a playback device can have the ability to render near-field audio to a user. It may be advantageous for such a playback device to receive a bit stream with an object-based signal to render in the near field. A second playback device, however, might not be able to render near-field audio. In this case, if the object-based signal is transmitted from one device to another, this signal can go unutilized, which can result in a waste of bandwidth.
- A hybrid audio processing technique that layers Ambisonic audio components and object-based audio is lacking. Such a hybrid technique is beneficial as objects allow for near-field effects, precise localization, as well as interactivity. The object can be virtually moved around as seen fit, and/or have its level changed, and/or be added or removed from an audio scene as seen fit. Ambisonics can further provide a compelling spatial ambience reproduction.
- In one aspect of the present disclosure, a hybrid audio processing technique is described. The process includes generating a base layer having a first set of Ambisonic audio components (e.g., first order only, or first order and second order only) based on ambience and one or more object-based audio signals. In this first set, the object-based audio signals can be converted into Ambisonic components and then combined with ambience that is also in Ambisonic format. This base layer can fully represent a desired audio scene in that it has sounds captured in ambience as well as sounds from individual sound sources that are converted into Ambisonic components. The first set of Ambisonic audio components, along with an optional number of objects (and associated metadata) that have not been converted to Ambisonics, can be included in a base (or first) layer that is encoded into a bit stream.
- At least one of the object-based audio signals can be included in a second layer (or an ‘enhancement layer’). It should be understood that object-based audio signals include associated metadata, for example, direction, depth, width, diffusion. Additional enhancement layers can also be added having additional object-based audio and/or additional Ambisonic audio components. The additional Ambisonic components could be higher order coefficients or coefficients of the same order but not included in lower levels (for example the first order Z coefficient if the previous levels only included the first order X and Y and the zeroth order W).
- Metadata can be included in the bit stream that provides a playback device and/or decoding device with configuration details of each and every layer. The receiving device can make an informed selection as to which layers, other than the base layer, shall be received. Such a selection can be determined based on bandwidth, which can vary over time, and/or the receiving device's capabilities. For example, the receiving device may not want to receive a layer or a stack of layers that contains near-field object based audio, if it cannot render and playback the object. The device may select a different layer or stack of layers that has the same sound source represented by the object-based audio, but embedded into the Ambisonics components.
- The device can select which layer to receive based on either the direction that the user of the device is pointing towards or facing (in the 3D space) or one or more directions that the user of the device is interested in (by some user based interactivity that indicates the direction of interest). Image sensing with a camera and computer vision and/or an inertial measurement unit can be used to determine a direction that the user is pointed towards or facing. Other techniques can also be employed to determine user position. Objects that are located in these directions can be picked by receiving the layers that contain these objects. For example, if audio content at the encoder contains birds chirping at a location associated with a right hand side of a user, and a user turns to face her right hand side, the decoder can request to receive an enhancement layer that is associated with the direction that the user has turned to face. The encoder can then include in a bit stream which is transmitted to the user that includes sounds in the location corresponding to the user's right hand side, such as the chirping birds. Similarly, the user can indicate, in settings or a user interface, that the user would like to hear sounds at a location corresponding to the user's right hand side, or from above or below. The decoder can then send request to receive enhancement layers associated with such locations.
- In such a manner, a hybrid audio processing technique provides layered coding using Ambisonics and object-based audio. The base layer includes lower order Ambisonics (e.g., first order or first and second order) having sounds from all object-based signals that might be relevant to the audio scene, as well as ambience. Additional layers have additional Ambisonics components (e.g., of increasing order) and/or additional object-based audio signals. When an audio object is transmitted as part of any of the non-base layers, the object's contribution can be subtracted from the Ambisonics representation of the previous layers, allowing the object to be rendered independently of the Ambisonics renderer. This is possible as the decoder will have knowledge of how each object was added to the Ambisonics mix at the encoder, through for example, configuration information present in metadata. This can prevent double rendering of the object, as described further in other sections.
- Further, different devices can traverse layers differently. For example, if a legacy device does not have the computational power to render an object independently (using near field techniques, for example), the device can choose not to select a layer that contains such an object. Instead, the device can traverse a different stack layer that allows less computationally complex rendering, at the cost of quality of service and spatial resolution.
- The above summary does not include an exhaustive list of all aspects of the present disclosure. It is contemplated that the disclosure includes all systems and methods that can be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the Detailed Description below and particularly pointed out in the Claims section. Such combinations may have particular advantages not specifically recited in the above summary.
- Several aspects of the disclosure here are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. Also, in the interest of conciseness and reducing the total number of figures, a given figure may be used to illustrate the features of more than one aspect of the disclosure, and not all elements in the figure may be required for a given aspect.
-
FIG. 1 shows a process and system for encoding layered audio with Ambisonics and object-based audio, according to one aspect. -
FIG. 2 shows a process and system for encoding layered audio with Ambisonics and object-based audio, according to one aspect. -
FIG. 3 shows a process and system for encoding layered audio with Ambisonics and object-based audio with multiple layered stack options, according to one aspect. -
FIG. 4 shows a process and system for encoding layered audio with parametric coding and Ambisonics, according to one aspect. -
FIG. 5 illustrates an example of audio system hardware, according to one aspect. - Several aspects of the disclosure with reference to the appended drawings are now explained. Whenever the shapes, relative positions and other aspects of the parts described are not explicitly defined, the scope of the invention is not limited only to the parts shown, which are meant merely for the purpose of illustration. Also, while numerous details are set forth, it is understood that some aspects of the disclosure may be practiced without these details. In other instances, well-known circuits, structures, and techniques have not been shown in detail so as not to obscure the understanding of this description.
- Ambisonics relates to a technique for recording, mixing, and playing back three-dimensional 360-degree audio both in the horizontal and/or vertical plane. Ambisonics treats an audio scene as a 360-degree sphere of sound coming from different directions around a center. An example of an Ambisonics format is B-format, which can include first order Ambisonics consisting of four audio components—W, X, Y and Z. Each component can represent a different spherical harmonic component, or a different microphone polar pattern, pointing in a specific direction, each polar pattern being conjoined at a center point of the sphere.
- Ambisonic audio can be extended to higher orders, increasing the quality of localization. With increasing order, additional Ambisonic components will be introduced, for example, 5 new components for second order, 7 new components for third order, and so on. This can cause the footprint or size of the audio information to grow, which can quickly run up against bandwidth limitations.
- Layered coding provides a means to transmit hierarchical layers. A receiver selects which layers it would like to decode based on available bandwidth and/or device level capabilities. A layered approach is different from canonical approaches where there are multiple independent manifests of the same content at different bitrates. Layers typically build on each other, with each additional layer providing an improved quality of audio experience to a consumer. Spatial resolution is typically traded off with the number of layers that is decoded, consuming less layers can result in lower spatial resolution, but can also help different devices adapt to different bandwidth constraints.
- Referring now to
FIG. 1 ,audio content 120 can be encoded by anencoder 122 into a hybrid and layered audio asset that can be transmitted in a bit stream over anetwork 96 todownstream devices 132. The encoder can generate a first set of Ambisonic audio components based on ambience and one or more object-based audio signals and include the first set into base layer 126. Thus, the base layer can have both ambience and sounds from object-based audio signals contained in the first set of Ambisonic components as well as some objects that are not mixed into the Ambisonics signal. The first set of Ambisonic audio components can include only low order Ambisonic components, for example, only zeroth order, zeroth order and first order, or only zeroth, first and second order. - The encoder can encode at least one of the object-based audio signals into a
first enhancement layer 128.First enhancement layer 128 can include Ambisonic audio components of the same or higher order as those in the base layer.Second enhancement layer 130 and additional enhancement layers, e.g.,layer 131, can provide additional object-based audio signals and/or additional Ambisonic audio components.Metadata 124 can include configuration information (e.g., which layers are available, and Ambisonic components and/or object-based audio in each available layer) that allowsdevices 132 to determine which layers should be transmitted overnetwork 96. Such a determination can be made by the encoder or by the receiving device based on bandwidth and/or device capabilities as well as listener position/preference, as mentioned. - For example,
content 120 can include an audio scene with M Ambisonic orders, resulting in N=(M+1)2 number of Ambisonic audio components. The Ambisonic audio components here can represent ambience, which should be understood as any sound in a scene other than the discrete objects, for example, ocean, vehicle traffic, wind, voices, music, etc. In addition,content 120 might contain numerous object-based audio signals that each represent a discrete sound source, such as a speaker's voice, a bird chirping, a door slamming, a pot whistling, a car honking, etc. - In such a case, the encoder can combine all the discrete sound sources with the ambience Ambisonic audio components, and include only a lower order first set of those Ambisonic audio components into the base layer 126. This set can include only an omnidirectional pattern (W) audio component, a first bi-directional polar pattern audio component aligned in a first (e.g., front-to-back) direction X, a second bi-directional polar pattern audio component aligned in a second (e.g., left-to-right) direction Y, and a third bi-directional polar pattern audio component aligned in a third (e.g., up-down) direction Z. Some discrete sound sources (objects) may be included outside of this process—i.e sent discretely alongside the aforementioned Ambisonics signal but without being mixed into the Ambisonics signal. For example, as shown in
FIG. 1 , base layer 126 can contain Ambisonics components with objects mixed in, as well as audio signals of objects that are not mixed with the Ambisonics components (e.g., discrete). - In
first enhancement layer 128, the encoder can include an object-based audio signal of at least one of the object-based audio assets (and associated metadata such as direction, depth, width, diffusion, and/or location) and/or additional Ambisonic audio components (for example, one or more second order Ambisonic audio components). Each successive layer that builds on the base layer can have additional Ambisonic audio components thereby increasing spatial resolution and providing additional discrete audio objects that can be rendered spatially by a downstream device. The object-based audio signals in the enhancement layers are not mixed into the Ambisonics signals of the enhancement layers. - With such a layered encoding of audio, if a downstream device senses that network bandwidth is small (e.g., 50-128 kbps), then the downstream device can communicate to the encoder to only transmit the base layer. If, moments later, the network bandwidth increases, for example, if the downstream device moves to a location with increased bandwidth (e.g., a stronger wireless connection, or from a wireless connection to a wired connection), then the downstream device can request that the encoder transmit enhancement layers 128-131. The downstream device can examine
metadata 124 that includes configuration data indicating a) one or more selectable layers (e.g., enhancement layers 128-131) that can be encoded and transmitted in the bit stream, and b) Ambisonic audio components or audio objects in each of the selectable layers. As is known, Ambisonic audio components can be indicated by their respective Ambisonics coefficients, which indicates a particular polar pattern (e.g., omni-directional, figure-of-eight, cardioid, etc.) and direction. - In one aspect, base layer 126 includes
selectable sublayers FIG. 4 and described in other sections. -
FIG. 2 shows an audio encoding and decoding process and system according to one aspect. Anaudio asset 78 can include Ambisonic components 82 that represent ambience sound and one or more audio objects 80. The ambience and audio objects can be combined to form a first or base layer of audio. For example, all of the audio objects, or a select subset of audio objects, can be converted atblock 84 to Ambisonics components. In one aspect, if there are N ambience Ambisonic components, then the audio objects 80 can be converted, atblock 84, into N Ambisonic components that contain sounds from the audio objects at their respective virtual locations. In another aspect, atblock 84, the audio objects are converted into a single Ambisonic component, e.g., the omni-directional W component. - Audio objects can be converted to Ambisonic format by applying spherical harmonic functions and direction of each object to an audio signal of the object. Spherical harmonic functions Ymn(θ,ϕ) define Ambisonics encoding functions where θ represents azimuth angle and ϕ represents elevation angle of a location in spherical coordinates.
- A sound source S(t) can be converted to an Ambisonic component by applying Ymn(θs,ϕs) to S(t), where θs and ϕs describe the direction (azimuth and elevation angle) of the sound source relative to the center of the sphere (which can represent a recording and/or listener location). For example, Ambisonics components for a given sound source S(t) can be calculated for lower order Ambisonics components as follows:
-
W=B 00 =SY 00(θS,ϕS)=S -
X=B 11 =SY 11(θS,ϕS)=S√{square root over (3)}cosθS cos ϕS -
Y=B 1-1 =SY 1-1(θS,ϕS)=S√{square root over (3)}sin θS cos ϕS -
Z=B 10 =SY 10(θS,ϕS)=S√{square root over (3)}sinϕS - Other Ambisonic components (e.g., in higher orders) can be determined similarly, but with other known spherical harmonic functions Ymn. In such a manner, block 84 generates Ambisonic components with sounds from the audio objects 80. These Ambisonic components are combined at
block 85 with the ambience Ambisonics 82. The combined ambience and object-based Ambisonic audio components can be truncated atblock 86 to remove Ambisonic audio components of an order greater than a threshold (e.g., beyond a first order, or beyond a second order). - For example, if the base layer is to only contain first order Ambisonics, and the combined Ambisonic components contain up to 8 orders of Ambisonic components, then any Ambisonic component that is greater than B10 can be removed at
block 86. The remaining Ambisonic audio components (e.g., W, X, Y, and Z) are the first set of Ambisonic audio components (having ambience and the object-based audio information) that are encoded in the first layer of the bit stream at block 90. - At least one of the audio objects 80 (e.g., audio object 81) can be encoded into a second layer at
encoder 92. As mentioned, additional Ambisonic components that contain ambience (such as one or more of the components that were removed from the Ambisonic components prior to encoding the first layer) can also be encoded in the second layer. Similarly, additional layers 93 can contain additional Ambisonic components (e.g., in ascending order) as well as additional audio objects. It should be understood that while the base layer includes the sounds in the audio objects and the ambience combined into Ambisonic components (with some optional objects sent discretely and not mixed into Ambisonic components), but in the enhancement layers (the second and additional layers) the audio objects are encoded discretely, separate from the Ambisonic components (if also included in the same respective enhancement layer). -
Metadata 94 describes how the different information (audio objects and Ambisonic components) are bundled into different layers, allowing for a downstream device to select (e.g., through a request to the server with live or offline encoded bitstreams) which enhancement layers the downstream device wishes to receive. As mentioned, this selection can be based on different factors, e.g., a bandwidth of thenetwork 96, listener position/preference and/or capabilities of the downstream device. Listener position can indicate which way a listener is facing or focusing on. Listener preferences can be stored in based on user settings, device settings, and/or adjusted through a user interface. A preference can indicate a preferred position of the listener, for example, in a listener environment. - A decoding device can receive the first (base) layer and, if selected, additional layers (e.g., the second layer). The downstream device can decode, at
block 98, the first layer having the first set of Ambisonic components that were generated by the encoder based on ambience 82 and object-basedaudio 80. Atblock 100, the second layer having at least one of the object-based audio signals (e.g., audio object 81) is decoded. If the second layer also has additional Ambisonic audio components (e.g., a second set of Ambisonic audio components), then these components can be concatenated to the first set that were decoded from the first layer atblock 102. It should be understood that each layer contains unique non-repeating Ambisonic audio components. In other words, an Ambisonic component corresponding to a particular set of Ambisonic coefficients (e.g., B10) will not be repeated in any of the layers. Thus, each Ambisonic component builds on another and they can be concatenated with relative ease. - At
block 103, the object-based audio extracted from the second layer is subtracted from the received Ambisonic audio components. The received Ambisonic audio components can include the first set of Ambisonic audio components received in the first layer, and any additional Ambisonic audio components received in the other layers). Prior to subtraction, the object-based audio can be converted to Ambisonic format atblock 104, and the converted audio is then subtracted from the received Ambisonic audio components. This resulting set of Ambisonic audio components are rendered by an Ambisonic renderer atblock 106, into a first set of playback channels. By subtracting the object-based from the Ambisonic audio components atblock 103, this prevents audio artifacts that might be noticeable if the same sound source is rendered by the spatial renderer and also by the Ambisonics renderer. - The object-based audio received from the second layer is rendered spatially at
block 108, e.g., by applying transfer functions or impulse responses to each object based audio signal, to generate a second set of playback channels. Atblock 105, the first set of playback channels and second set of playback channels are combined into a plurality of speaker channels that are used to drive a plurality ofspeakers 109.Speakers 109 can be integrated into cabinet loudspeakers, one or more speaker arrays, and/or head worn speakers (e.g., in-ear, on-ear, or extra-aural). - It should be understood that although blocks are shown in the encoding and decoding of only the second layer, additional layers can be encoded and decoded in the same manner, having additional object-based audio and additional Ambisonic audio components. For example, the decoding device can communicate a request to the encoding device/server to include a third layer of data in the bit stream. The encoder can then encode the third layer with additional audio objects and/or Ambisonic audio components, different from those already included in the second layer and first layer.
-
FIG. 3 illustrates aspects of the present disclosure with one or more different layer stacks, each having different layered audio information. For example, afirst stack option 144 and second stack option 142 can be supported by an encoder, configuration data for each enhancement layer can be included inmetadata 252. These stack options sharebase layer 254. The first stack option can have afirst enhancement layer 258 that containsadditional Ambisonics coefficients 5, 6, and 7. The second stack option can include different enhancement layers than those in the first stack option. For example, afirst enhancement layer 136 includes a first object-based signal; asecond enhancement layer 138 includes anobject 2 andAmbisonics coefficients 5, 6, 7; and additional enhancement layers 140 contain higher orders of Ambisonics coefficients and additional audio objects. In one aspect, each additional layer (or enhancement layer) can build on the previous layer, having a same or higher order of Ambisonic components than the previous layer. - In one aspect, some Ambisonic components are removed if they are deemed to not contain useful information. For example, Ambisonic audio components corresponding to Ambisonics coefficients 8 and 9 can be removed by the encoder, if the encoder detects that these components contain little to no audio information. This can further utilize available bandwidth by preventing the encoding and transmission of useless audio signals.
- A first device 150 (e.g., a decoder and/or playback device) can examine
metadata 252 to determine which stack option is most suitable to be received. Seeing thatstack option 2 contains object-based audio signals, the first device can select the first stack option if it does not have computational or near-field capabilities to render the object-based audio. This can prevent wasted bandwidth and encoder-side computation which would otherwise be spent encoding audio information of the second stack option and then transmitting over a communication protocol. - Similarly, a
second device 160 might choose to select the second stack option based on audio capabilities of the second device. The second device can select which of the layers in the second stack option is to be received based on available bandwidth. For example, if the second device has a bandwidth below 128 kBPS, then the second device can request only the base layer be sent. At a later time, if bandwidth increases, then the second device can request enhancement layers. To further illustrate this, agraph 161 shows layer consumption of the second device at different time intervals T1-T4. At T1, bandwidth might be especially low, so the device requests that only the first sublayer of the base layer be sent. As bandwidth gradually increases (e.g., the device's communication connection grows stronger), then additional sublayers and enhancement layers are requested and received at T2, T3, and T4. - Similarly, a graph 251 can show layer consumption of the
first device 150. Here, the device might have a high bandwidth at time T1, but then a low bandwidth at T2. At T1, the device can request and consume all of the base layer andenhancement layer 1. At T2, the device requests that only the first sublayer of the first layer be sent. The bandwidth, however, can increase at times T3 and T4, Thus, the first device can request and consumeadditional sublayers enhancement layer 1 at time T4. A receiving device can select different layers and sublayers at different times, from one audio frame to another audio frame. - As described, the decoder (a downstream device) can select (e.g., by communicating a request to the encoder) any of the enhancement layers based various factors, for example, bandwidth, computing capabilities, or near-field playback capabilities. Additionally or alternatively, the decoder can select enhancement layers based on spatial rendering capabilities. For example, a playback device might have capability to render sound horizontally but not have capability to render sound vertically at varying heights. Similarly, a playback device may not have capability to render certain surround information. For example, if a playback device is limited to specific rendering format (5.1 surround sound, 7.2 surround sound, etc.).
-
FIG. 4 shows an encoding and decoding process and system that uses parametric coding, according one aspect of the present disclosure. As discussed, low-bitrate applications of spatial audio coding, the available bandwidth (e.g. bitrate) of a given data channel may vary during playback. When the bandwidth drops, various tradeoffs can be made to optimize the user experience in terms of audio quality degradation. When listening to immersive audio at higher bitrates, the user experience could suffer when the codec can only deliver stereo or mono audio when the bitrate drops below a certain threshold. - In one aspect, a layered audio process and system includes a parametric approach that can maintain immersive audio for first order Ambisonics (FOA) components at significantly lower bitrates compared to non-parametric audio coding methods. Such a system can maintain the spatial audio scene with lower precision instead.
- The encoder can encode a
first layer 170, asecond layer 172, and a third layer 174. In one aspect, these layers correspond, respectively, to the first sub layer, second sub layer, and third sub layer of the base layer described in other sections and shown inFIGS. 1-3 . The first layer can be encoded to contain the omni-directional FOA component W. The second layer can be encoded to using parametric coding include a summation signal of X, Y, and Z. The resulting FOA signal can be rendered for any speaker layout. A third layer can be encoded to include two of the three components X, Y, and Z without parametric coding. The third component that is not included in the third layer can be derived by the decoder based on subtracting the two components from the summation of the second layer. -
Ambisonic components 358 can include first order Ambisonics W, X, Y and Z. In one aspect, W, X, Y and Z each contain sounds from converted object-based audio signals as well as ambience, as shown inFIG. 2 and described in other sections. In one aspect, the blocks shown performed on the encoder side ofFIG. 4 can correspond to block 90 ofFIG. 2 and the blocks shown performed on the decoder side ofFIG. 4 can correspond to block 98 ofFIG. 2 . In another aspect, the encoder and decoder blocks shown inFIG. 4 can be performed independent of other enhancement layers and does not require other aspects described inFIGS. 1-3 . - The
first layer 170 has contains only Ambisonic component W having an omni-directional polar pattern. Anencoder block 164 can encode W using known audio codecs. W can be played back as a mono signal. - In the
second layer 172, three Ambisonic audio components are combined atblock 162, the components including a) a first bi-directional polar pattern audio component aligned in a first direction (e.g., front-back or ‘X’), b) a second bi-directional polar pattern audio component aligned in a second direction (e.g., left-right or ‘Y’), and c) a third bi-directional polar pattern audio component aligned in a third direction (e.g., up-down or ‘Z’), resulting in a combined or downmixed channel S. The combined channel S is encoded atblock 164. - A
filter bank 260 can be used to filter each component to extract sub-bands of each component. Theparameter generation block 166 can generate parameters for each of the sub-bands of each component. In one aspect,filter bank 260 can use critical band filters to extract only the sub-bands that are audible (e.g., using critical band filters). Parameters can define correlation between the three Ambisonic audio components, level differences between the three Ambisonic audio components, and/or phase differences between the three Ambisonic audio components. Other control parameters can be generated to improve spatial reproduction quality. The parameters can be associated for different time frames. For example, for a uniform time resolution, each time frame can have a constant duration of 20 ms. - In the third layer 174, encoder blocks 164 encode only two of the three Ambisonic components (for example, X and Y, Y and Z, or X and Z) that are in the summed channel. At the decoder side, the remaining Ambisonic component (not included in the third layer) can be derived based on a) the summed channel (received in the second layer) and b) the two Ambisonic components (received in the third layer).
- In one aspect, optional weighting coefficients a, b, and c can be applied to each of the three Ambisonic audio components X, Y and Z to optimize the downmix of the three components into the S signal. The application of the coefficients can improve alignment in level and/or reduce signal cancelation. In one aspect, the weighting coefficients can be applied in the sub-band domain.
- A decoder can select how many of the layers to consume, based on bandwidth. With only the first layer, a listener can still hear ambience and sound sources of a sound scene, although spatial resolution is minimal. As bandwidth increases, the decoder can additionally request and receive the
second layer 172 and then the third layer 174. - A decoder can decode the summed channel and Ambisonic components at
blocks 168, which can ultimately be rendered by an Ambisonic renderer to generate channels that can be used to drive speakers (e.g., an array of loudspeakers or a headphone set). For example, the first layer of the received bit stream can be decoded atblock 168 to extract Ambisonic component W′. Atsecond layer 172, the summed signal S′ can be extracted atblock 168 and aparameter applicator 170 can apply the one or more parameters (generated at block 166) to S′ to generate Ambisonic audio components X′, Y′, and Z′. These Ambisonic audio components can be compressed versions of X, Y, and Z. The trade-off, however, is that transmitting the summed channel and the parameters requires less bandwidth than transmitting X, Y, and Z. Thus, the summed signal S of the second layer provides a compressed version of X, Y, and Z. - In the third layer, two of the three summed components (for example, X and Y, Y and Z, or X and Z) can be decoded at
block 168. The decoder can, atblock 176, subtract these two components from the summed channel. In such a case where the decoder has received the third layer, the decoder can simply ignore the parameters and skip theparameter application block 170, to reduce processing overhead. Rather, the decoder need only decode the two received components, and then extract the third Ambisonic component from the summed channel through subtraction. - In one aspect, encoders and decoders described in the present disclosure can implement existing and known codecs for encoding and decoding audio signals. In one aspect, the third layer can jointly encode the two components (X and Y) with a single codec that can take advantage of correlation between the two components. It should be understood that, although shown as X and Y, layer three can contain X and Z, or Y and Z. For example, if Y and Z are encoded and transmitted in the third layer, then X can be derived from the summed signal S′. Quantization errors may add during the recovery of X from S′, the recovered X signal may have more distortions than the Y and Z components. Therefore, some implementations may prefer to recover the Z component from S′ instead. Z is typically associated with sound sources above or below the listener where spatial perception is less accurate than in the horizontal plane. The Z component might also carry less energy on average as most audio sources are typically placed in the horizontal plane, for example consistent with the video scene.
- In one aspect, the
parameter applicator block 170 applies the parameters to the summed signal S′ in the sub-band domain. Afilter bank 169 can be used convert S′ into sub-bands from S′. After the parameters are applied,inverse blocks 172 can be used to convert those sub-bands back to full-bands. Further, if weighting coefficients are applied at the encoder, inverse coefficients can be applied at the decoder as shown, to reconstruct the Ambisonic components X, Y and Z. - The layered structure shown in
FIG. 4 can beneficially reduce bitrate when needed, but still allow for improved spatial quality when greater bandwidth is present. For example, a decoder can receive only thefirst layer 170 and play back the W signal in mono if bitrate reduction is critical. Alternatively, the decoder also can receive the second layer, in which case the decoder can reuse the W signal received in the first layer and reconstruct X′, Y′, and Z′ based on a summed signal S′. Bit rate in this case, is still improved in comparison to transmitting X, Y, and Z, because the data footprint for S′ and the parameters are significantly less than the bit rate required for X, Y, and Z, the footprint of the parameters being negligible. Alternatively, the decoder can receive all threelayers -
FIG. 5 shows a block diagram of audio processing system hardware, in one aspect, which may be used with any of the aspects described herein (e.g., headphone set, mobile device, media player, or television). This audio processing system can represent a general purpose computer system or a special purpose computer system. Note that whileFIG. 5 illustrates the various components of an audio processing system that may be incorporated into headphones, speaker systems, microphone arrays and entertainment systems, it is merely one example of a particular implementation and is merely to illustrate the types of components that may be present in the audio processing system.FIG. 5 is not intended to represent any particular architecture or manner of interconnecting the components as such details are not germane to the aspects herein. It will also be appreciated that other types of audio processing systems that have fewer components than shown or more components than shown inFIG. 7 can also be used. Accordingly, the processes described herein are not limited to use with the hardware and software ofFIG. 5 . - As shown in
FIG. 5 , the audio processing system 150 (for example, a laptop computer, a desktop computer, a mobile phone, a smart phone, a tablet computer, a smart speaker, a head mounted display (HMD), a headphone set, or an infotainment system for an automobile or other vehicle) includes one ormore buses 162 that serve to interconnect the various components of the system. One ormore processors 152 are coupled tobus 162 as is known in the art. The processor(s) may be microprocessors or special purpose processors, system on chip (SOC), a central processing unit, a graphics processing unit, a processor created through an Application Specific Integrated Circuit (ASIC), or combinations thereof.Memory 151 can include Read Only Memory (ROM), volatile memory, and non-volatile memory, or combinations thereof, coupled to the bus using techniques known in the art.Camera 158 and display 160 can be coupled to the bus. -
Memory 151 can be connected to the bus and can include DRAM, a hard disk drive or a flash memory or a magnetic optical drive or magnetic memory or an optical drive or other types of memory systems that maintain data even after power is removed from the system. In one aspect, theprocessor 152 retrieves computer program instructions stored in a machine readable storage medium (memory) and executes those instructions to perform operations described herein. - Audio hardware, although not shown, can be coupled to the one or
more buses 162 in order to receive audio signals to be processed and output byspeakers 156. Audio hardware can include digital to analog and/or analog to digital converters. Audio hardware can also include audio amplifiers and filters. The audio hardware can also interface with microphones 154 (e.g., microphone arrays) to receive audio signals (whether analog or digital), digitize them if necessary, and communicate the signals to thebus 162. -
Communication module 164 can communicate with remote devices and networks. For example,communication module 164 can communicate over known technologies such as Wi-Fi, 3G, 4G, 5G, Bluetooth, ZigBee, or other equivalent technologies. The communication module can include wired or wireless transmitters and receivers that can communicate (e.g., receive and transmit data) with networked devices such as servers (e.g., the cloud) and/or other devices such as remote speakers and remote microphones. - It will be appreciated that the aspects disclosed herein can utilize memory that is remote from the system, such as a network storage device which is coupled to the audio processing system through a network interface such as a modem or Ethernet interface. The
buses 162 can be connected to each other through various bridges, controllers and/or adapters as is well known in the art. In one aspect, one or more network device(s) can be coupled to thebus 162. The network device(s) can be wired network devices (e.g., Ethernet) or wireless network devices (e.g., WI-FI, Bluetooth). In some aspects, various aspects described (e.g., simulation, analysis, estimation, modeling, object detection, etc.,) can be performed by a networked server in communication with the capture device. - Various aspects described herein may be embodied, at least in part, in software. That is, the techniques may be carried out in an audio processing system in response to its processor executing a sequence of instructions contained in a storage medium, such as a non-transitory machine-readable storage medium (e.g. DRAM or flash memory). In various aspects, hardwired circuitry may be used in combination with software instructions to implement the techniques described herein. Thus the techniques are not limited to any specific combination of hardware circuitry and software, or to any particular source for the instructions executed by the audio processing system.
- In the description, certain terminology is used to describe features of various aspects. For example, in certain situations, the terms “analyzer”, “separator”, “renderer”, “encoder”, “decoder”, “truncator”, “estimator”, “combiner”, “synthesizer”, “controller”, “localizer”, “spatializer”, “component,” “unit,” “module,” “logic”, “extractor”, “subtractor”, “generator”, “applicator”, “optimizer”, “processor”, “mixer”, “detector”, “calculator”, and “simulator” are representative of hardware and/or software configured to perform one or more processes or functions. For instance, examples of “hardware” include, but are not limited or restricted to an integrated circuit such as a processor (e.g., a digital signal processor, microprocessor, application specific integrated circuit, a micro-controller, etc.). Thus, different combinations of hardware and/or software can be implemented to perform the processes or functions described by the above terms, as understood by one skilled in the art. Of course, the hardware may be alternatively implemented as a finite state machine or even combinatorial logic. An example of “software” includes executable code in the form of an application, an applet, a routine or even a series of instructions. As mentioned above, the software may be stored in any type of machine-readable medium.
- Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the audio processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the claims below, refer to the action and processes of an audio processing system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the system's registers and memories into other data similarly represented as physical quantities within the system memories or registers or other such information storage, transmission or display devices.
- The processes and blocks described herein are not limited to the specific examples described and are not limited to the specific orders used as examples herein. Rather, any of the processing blocks may be re-ordered, combined or removed, performed in parallel or in serial, as necessary, to achieve the results set forth above. The processing blocks associated with implementing the audio processing system may be performed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer readable storage medium to perform the functions of the system. All or part of the audio processing system may be implemented as, special purpose logic circuitry (e.g., an FPGA (field-programmable gate array) and/or an ASIC (application-specific integrated circuit)). All or part of the audio system may be implemented using electronic hardware circuitry that include electronic devices such as, for example, at least one of a processor, a memory, a programmable logic device or a logic gate. Further, processes can be implemented in any combination hardware devices and software components.
- While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such aspects are merely illustrative of and not restrictive on the broad invention, and the invention is not limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those of ordinary skill in the art. The description is thus to be regarded as illustrative instead of limiting.
- To aid the Patent Office and any readers of any patent issued on this application in interpreting the claims appended hereto, applicants wish to note that they do not intend any of the appended claims or claim elements to invoke 35 U.S.C. 112(f) unless the words “means for” or “step for” are explicitly used in the particular claim.
- It is well understood that the use of personally identifiable information should follow privacy policies and practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users. In particular, personally identifiable information data should be managed and handled so as to minimize risks of unintentional or unauthorized access or use, and the nature of authorized use should be clearly indicated to users.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/739,901 US20220262373A1 (en) | 2019-09-26 | 2022-05-09 | Layered coding of audio with discrete objects |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/584,706 US11430451B2 (en) | 2019-09-26 | 2019-09-26 | Layered coding of audio with discrete objects |
US17/739,901 US20220262373A1 (en) | 2019-09-26 | 2022-05-09 | Layered coding of audio with discrete objects |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/584,706 Division US11430451B2 (en) | 2019-09-26 | 2019-09-26 | Layered coding of audio with discrete objects |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220262373A1 true US20220262373A1 (en) | 2022-08-18 |
Family
ID=75040952
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/584,706 Active 2039-11-28 US11430451B2 (en) | 2019-09-26 | 2019-09-26 | Layered coding of audio with discrete objects |
US17/739,901 Pending US20220262373A1 (en) | 2019-09-26 | 2022-05-09 | Layered coding of audio with discrete objects |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/584,706 Active 2039-11-28 US11430451B2 (en) | 2019-09-26 | 2019-09-26 | Layered coding of audio with discrete objects |
Country Status (2)
Country | Link |
---|---|
US (2) | US11430451B2 (en) |
CN (1) | CN112562696B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210409886A1 (en) * | 2020-06-29 | 2021-12-30 | Qualcomm Incorporated | Sound field adjustment |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220391167A1 (en) * | 2021-06-02 | 2022-12-08 | Tencent America LLC | Adaptive audio delivery and rendering |
US20240098439A1 (en) * | 2022-09-15 | 2024-03-21 | Sony Interactive Entertainment Inc. | Multi-order optimized ambisonics encoding |
CN117953924A (en) * | 2023-12-13 | 2024-04-30 | 方博科技(深圳)有限公司 | Method for detecting discrete sound contained in noise |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080004729A1 (en) * | 2006-06-30 | 2008-01-03 | Nokia Corporation | Direct encoding into a directional audio coding format |
US20120078642A1 (en) * | 2009-06-10 | 2012-03-29 | Jeong Il Seo | Encoding method and encoding device, decoding method and decoding device and transcoding method and transcoder for multi-object audio signals |
US20120155653A1 (en) * | 2010-12-21 | 2012-06-21 | Thomson Licensing | Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field |
US20120314878A1 (en) * | 2010-02-26 | 2012-12-13 | France Telecom | Multichannel audio stream compression |
US20130216070A1 (en) * | 2010-11-05 | 2013-08-22 | Florian Keiler | Data structure for higher order ambisonics audio data |
US20140350944A1 (en) * | 2011-03-16 | 2014-11-27 | Dts, Inc. | Encoding and reproduction of three dimensional audio soundtracks |
US20150271621A1 (en) * | 2014-03-21 | 2015-09-24 | Qualcomm Incorporated | Inserting audio channels into descriptions of soundfields |
US20160029140A1 (en) * | 2013-04-03 | 2016-01-28 | Dolby International Ab | Methods and systems for generating and interactively rendering object based audio |
US20160104494A1 (en) * | 2014-10-10 | 2016-04-14 | Qualcomm Incorporated | Signaling channels for scalable coding of higher order ambisonic audio data |
US20160125890A1 (en) * | 2013-06-05 | 2016-05-05 | Thomson Licensing | Method for encoding audio signals, apparatus for encoding audio signals, method for decoding audio signals and apparatus for decoding audio signals |
US20160225377A1 (en) * | 2013-10-17 | 2016-08-04 | Socionext Inc. | Audio encoding device and audio decoding device |
US20160302005A1 (en) * | 2015-04-10 | 2016-10-13 | B<>Com | Method for processing data for the estimation of mixing parameters of audio signals, mixing method, devices, and associated computers programs |
US20170366912A1 (en) * | 2016-06-17 | 2017-12-21 | Dts, Inc. | Ambisonic audio rendering with depth decoding |
US20180295466A1 (en) * | 2017-04-06 | 2018-10-11 | General Electric Company | Healthcare asset beacon |
US20180338212A1 (en) * | 2017-05-18 | 2018-11-22 | Qualcomm Incorporated | Layered intermediate compression for higher order ambisonic audio data |
WO2019068638A1 (en) * | 2017-10-04 | 2019-04-11 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to dirac based spatial audio coding |
US20190214026A1 (en) * | 2014-03-21 | 2019-07-11 | Dolby Laboratories Licensing Corporation | Methods and apparatus for decompressing a compressed hoa signal |
US20200068336A1 (en) * | 2017-04-13 | 2020-02-27 | Sony Corporation | Signal processing apparatus and method as well as program |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7313430B2 (en) * | 2003-08-28 | 2007-12-25 | Medtronic Navigation, Inc. | Method and apparatus for performing stereotactic surgery |
JP5635097B2 (en) * | 2009-08-14 | 2014-12-03 | ディーティーエス・エルエルシーDts Llc | System for adaptively streaming audio objects |
US9584912B2 (en) * | 2012-01-19 | 2017-02-28 | Koninklijke Philips N.V. | Spatial audio rendering and encoding |
CN104364842A (en) * | 2012-04-18 | 2015-02-18 | 诺基亚公司 | Stereo audio signal encoder |
WO2014013070A1 (en) * | 2012-07-19 | 2014-01-23 | Thomson Licensing | Method and device for improving the rendering of multi-channel audio signals |
WO2014046916A1 (en) * | 2012-09-21 | 2014-03-27 | Dolby Laboratories Licensing Corporation | Layered approach to spatial audio coding |
WO2014165806A1 (en) * | 2013-04-05 | 2014-10-09 | Dts Llc | Layered audio coding and transmission |
US9716959B2 (en) * | 2013-05-29 | 2017-07-25 | Qualcomm Incorporated | Compensating for error in decomposed representations of sound fields |
EP2922057A1 (en) | 2014-03-21 | 2015-09-23 | Thomson Licensing | Method for compressing a Higher Order Ambisonics (HOA) signal, method for decompressing a compressed HOA signal, apparatus for compressing a HOA signal, and apparatus for decompressing a compressed HOA signal |
WO2015145782A1 (en) * | 2014-03-26 | 2015-10-01 | Panasonic Corporation | Apparatus and method for surround audio signal processing |
US10140996B2 (en) * | 2014-10-10 | 2018-11-27 | Qualcomm Incorporated | Signaling layers for scalable coding of higher order ambisonic audio data |
US20180213202A1 (en) * | 2017-01-23 | 2018-07-26 | Jaunt Inc. | Generating a Video Stream from a 360-Degree Video |
WO2019004524A1 (en) * | 2017-06-27 | 2019-01-03 | 엘지전자 주식회사 | Audio playback method and audio playback apparatus in six degrees of freedom environment |
AR112504A1 (en) | 2017-07-14 | 2019-11-06 | Fraunhofer Ges Forschung | CONCEPT TO GENERATE AN ENHANCED SOUND FIELD DESCRIPTION OR A MODIFIED SOUND FIELD USING A MULTI-LAYER DESCRIPTION |
US11798569B2 (en) * | 2018-10-02 | 2023-10-24 | Qualcomm Incorporated | Flexible rendering of audio data |
WO2020159602A1 (en) * | 2019-01-28 | 2020-08-06 | Embody Vr, Inc | Spatial audio is received from an audio server over a first communication link. the spatial audio is converted by a cloud spatial audio processing system into binaural audio. the binauralized audio is streamed from the cloud spatial audio processing system to a mobile station over a second communication link to cause the mobile station to play the binaural audio on the personal audio delivery device |
-
2019
- 2019-09-26 US US16/584,706 patent/US11430451B2/en active Active
-
2020
- 2020-08-17 CN CN202010824443.8A patent/CN112562696B/en active Active
-
2022
- 2022-05-09 US US17/739,901 patent/US20220262373A1/en active Pending
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080004729A1 (en) * | 2006-06-30 | 2008-01-03 | Nokia Corporation | Direct encoding into a directional audio coding format |
US20120078642A1 (en) * | 2009-06-10 | 2012-03-29 | Jeong Il Seo | Encoding method and encoding device, decoding method and decoding device and transcoding method and transcoder for multi-object audio signals |
US20120314878A1 (en) * | 2010-02-26 | 2012-12-13 | France Telecom | Multichannel audio stream compression |
US20130216070A1 (en) * | 2010-11-05 | 2013-08-22 | Florian Keiler | Data structure for higher order ambisonics audio data |
US20120155653A1 (en) * | 2010-12-21 | 2012-06-21 | Thomson Licensing | Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field |
US20140350944A1 (en) * | 2011-03-16 | 2014-11-27 | Dts, Inc. | Encoding and reproduction of three dimensional audio soundtracks |
US20160029140A1 (en) * | 2013-04-03 | 2016-01-28 | Dolby International Ab | Methods and systems for generating and interactively rendering object based audio |
US20160125890A1 (en) * | 2013-06-05 | 2016-05-05 | Thomson Licensing | Method for encoding audio signals, apparatus for encoding audio signals, method for decoding audio signals and apparatus for decoding audio signals |
US20160225377A1 (en) * | 2013-10-17 | 2016-08-04 | Socionext Inc. | Audio encoding device and audio decoding device |
US20190214026A1 (en) * | 2014-03-21 | 2019-07-11 | Dolby Laboratories Licensing Corporation | Methods and apparatus for decompressing a compressed hoa signal |
US20150271621A1 (en) * | 2014-03-21 | 2015-09-24 | Qualcomm Incorporated | Inserting audio channels into descriptions of soundfields |
US20160104494A1 (en) * | 2014-10-10 | 2016-04-14 | Qualcomm Incorporated | Signaling channels for scalable coding of higher order ambisonic audio data |
US20160302005A1 (en) * | 2015-04-10 | 2016-10-13 | B<>Com | Method for processing data for the estimation of mixing parameters of audio signals, mixing method, devices, and associated computers programs |
US20170366912A1 (en) * | 2016-06-17 | 2017-12-21 | Dts, Inc. | Ambisonic audio rendering with depth decoding |
US20180295466A1 (en) * | 2017-04-06 | 2018-10-11 | General Electric Company | Healthcare asset beacon |
US20200068336A1 (en) * | 2017-04-13 | 2020-02-27 | Sony Corporation | Signal processing apparatus and method as well as program |
US20180338212A1 (en) * | 2017-05-18 | 2018-11-22 | Qualcomm Incorporated | Layered intermediate compression for higher order ambisonic audio data |
WO2019068638A1 (en) * | 2017-10-04 | 2019-04-11 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to dirac based spatial audio coding |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210409886A1 (en) * | 2020-06-29 | 2021-12-30 | Qualcomm Incorporated | Sound field adjustment |
US11558707B2 (en) * | 2020-06-29 | 2023-01-17 | Qualcomm Incorporated | Sound field adjustment |
US12120497B2 (en) | 2020-06-29 | 2024-10-15 | Qualcomm Incorporated | Sound field adjustment |
US12126982B2 (en) | 2020-06-29 | 2024-10-22 | Qualcomm Incorporated | Sound field adjustment |
Also Published As
Publication number | Publication date |
---|---|
US11430451B2 (en) | 2022-08-30 |
US20210098004A1 (en) | 2021-04-01 |
CN112562696A (en) | 2021-03-26 |
CN112562696B (en) | 2024-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI744341B (en) | Distance panning using near / far-field rendering | |
US11064310B2 (en) | Method, apparatus or systems for processing audio objects | |
US10187739B2 (en) | System and method for capturing, encoding, distributing, and decoding immersive audio | |
US20220262373A1 (en) | Layered coding of audio with discrete objects | |
CN112262585B (en) | Ambient stereo depth extraction | |
CN105247612B (en) | Spatial concealment is executed relative to spherical harmonics coefficient | |
US9479886B2 (en) | Scalable downmix design with feedback for object-based surround codec | |
CN114600188A (en) | Apparatus and method for audio coding | |
CN111630593A (en) | Method and apparatus for decoding sound field representation signals | |
GB2574667A (en) | Spatial audio capture, transmission and reproduction | |
CN115346537A (en) | Audio coding and decoding method and device | |
US20230410823A1 (en) | Spatial audio parameter encoding and associated decoding | |
KR20240001226A (en) | 3D audio signal coding method, device, and encoder | |
CN115472170A (en) | Three-dimensional audio signal processing method and device | |
US20240096335A1 (en) | Object Audio Coding | |
CN113994425B (en) | Apparatus and method for encoding and decoding scene-based audio data | |
EP3987513B1 (en) | Quantizing spatial components based on bit allocations determined for psychoacoustic audio coding | |
KR20240152893A (en) | Parametric spatial audio rendering | |
CN118800247A (en) | Method and device for decoding scene audio signals | |
CN118800244A (en) | Scene audio coding method and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |