[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2022257824A1 - 一种三维音频信号的处理方法和装置 - Google Patents

一种三维音频信号的处理方法和装置 Download PDF

Info

Publication number
WO2022257824A1
WO2022257824A1 PCT/CN2022/096546 CN2022096546W WO2022257824A1 WO 2022257824 A1 WO2022257824 A1 WO 2022257824A1 CN 2022096546 W CN2022096546 W CN 2022096546W WO 2022257824 A1 WO2022257824 A1 WO 2022257824A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal group
bit allocation
virtual
virtual loudspeaker
proportion
Prior art date
Application number
PCT/CN2022/096546
Other languages
English (en)
French (fr)
Inventor
刘帅
高原
夏丙寅
王宾
王喆
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22819422.1A priority Critical patent/EP4354430A4/en
Priority to KR1020237044825A priority patent/KR20240013221A/ko
Publication of WO2022257824A1 publication Critical patent/WO2022257824A1/zh
Priority to US18/532,085 priority patent/US20240112684A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/002Dynamic bit allocation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/11Application of ambisonics in stereophonic audio systems

Definitions

  • the present application relates to the technical field of audio processing, in particular to a method and device for processing three-dimensional audio signals.
  • Three-dimensional audio technology has been widely used in wireless communication voice, virtual reality/augmented reality and media audio.
  • Three-dimensional audio technology is an audio technology for acquiring, processing, transmitting and rendering playback of sound events and three-dimensional sound field information in the real world.
  • the three-dimensional audio technology makes the sound have a strong sense of space, envelopment and immersion, giving people an extraordinary auditory experience of "immersive sound”.
  • Higher order ambisonics (HOA) technology has the property of being independent of the speaker layout in the recording, encoding and playback stages and the rotatable playback characteristics of HOA format data, which has higher flexibility in three-dimensional audio playback. Therefore, it has also received more extensive attention and research.
  • the acquisition device (such as a microphone) collects a large amount of data to record the three-dimensional sound field information, and transmits the three-dimensional audio signal to the playback device (such as a speaker, earphone, etc.), so that the playback device can play the three-dimensional audio signal. Due to the large amount of data of the three-dimensional sound field information, a large amount of storage space is required to store the data, and the bandwidth requirement for transmitting the three-dimensional audio signal is relatively high. In order to solve the above problems, the three-dimensional audio signal can be compressed, and the compressed data can be stored or transmitted.
  • the encoder can use multiple pre-configured virtual speakers to encode the 3D audio signal, but after the encoder encodes the 3D audio signal, how to allocate the bits of the signal is still an unresolved problem.
  • Embodiments of the present application provide a method and device for processing a three-dimensional audio signal, which are used to implement bit allocation for the signal.
  • an embodiment of the present application provides a method for processing a three-dimensional audio signal, including: spatially encoding the three-dimensional audio signal to be encoded to obtain a transmission channel signal and transmission channel attribute information, wherein the transmission channel signal includes: At least one virtual speaker signal group and at least one residual signal group; determining the bit allocation ratio of the virtual speaker signal group and the bit allocation ratio of the residual signal group according to the transmission channel attribute information.
  • the transmission channel signal and transmission channel attribute information are obtained through three-dimensional audio signal encoding.
  • the transmission channel signal may include at least one virtual speaker signal group and at least one residual signal group.
  • the transmission The channel attribute information can be used to respectively determine the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group, thereby solving the problem that the bit allocation of the signal cannot be determined.
  • the transmission channel attribute information includes: coding efficiency of a virtual speaker; performing spatial coding on the 3D audio signal to be encoded to obtain the transmission channel attribute information includes: using a virtual speaker to encode the performing signal reconstruction on the encoded three-dimensional audio signal to obtain a reconstructed three-dimensional audio signal; obtaining an energy representative value of the reconstructed three-dimensional audio signal and an energy representative value of the three-dimensional audio signal to be encoded; according to the reconstruction The energy representative value of the final three-dimensional audio signal and the energy representative value of the three-dimensional audio signal to be encoded are used to obtain the encoding efficiency of the virtual speaker.
  • the encoder first performs signal reconstruction using a virtual speaker to obtain a reconstructed three-dimensional audio signal.
  • the encoding end can calculate the energy characterization value of the signal of each transmission channel, for example, the energy characterization value of the reconstructed 3D audio signal and the energy characterization value of the 3D audio signal to be encoded can be obtained.
  • the energy characterization value of the 3D audio signal is in the signal It is different before and after the reconstruction, so the coding efficiency of the virtual loudspeaker can be calculated through the transformation of the energy representation value before and after the signal reconstruction.
  • the transmission channel attribute information includes: the energy ratio of the virtual speaker signal group; the method further includes: according to the energy characterization of each virtual speaker signal in the virtual speaker signal group Acquire the energy representative value of the virtual loudspeaker signal group; obtain the energy representative value of the residual signal group according to the energy representative value of each residual signal in the residual signal group; according to the virtual speaker signal group The energy representative value and the energy representative value of the residual signal group are used to obtain the energy ratio of the virtual loudspeaker signal group.
  • the encoding end first obtains the energy representative value of each virtual speaker signal in the virtual speaker signal group, and then adds the energy representative values of all virtual speaker signals in the same group to obtain the virtual speaker signal group The energy representation value of .
  • each group can calculate the energy representative value of the virtual loudspeaker signal group in the above manner.
  • the encoding end can obtain the energy representative value of the residual signal group according to the energy representative value of each residual signal in the residual signal group.
  • the encoding end can obtain the energy ratio of the virtual loudspeaker signal group according to the energy representative value of the virtual loudspeaker signal group and the energy representative value of the residual signal group.
  • the energy proportion of the virtual loudspeaker signal group can indicate the proportion of the virtual loudspeaker signal group in the total transmission channel signal energy. If the energy ratio of the virtual loudspeaker signal group is relatively low, it means that the virtual loudspeaker signal group is not dominant (that is, weaker) in the total transmission channel signal energy.
  • the transmission channel attribute information includes: a virtual speaker code identifier, where the virtual speaker code identifier is used to indicate whether the bit allocation of the virtual speaker signal group is dominant; Performing spatial encoding on the audio signal to obtain transmission channel attribute information, including: performing spatial encoding on the three-dimensional audio signal to be encoded to obtain the number of distinct sound sources and virtual speaker coding efficiency of the transmission channel signal; according to the transmission The number of distinct sound sources of the channel signals and the coding efficiency of the virtual loudspeaker are used to obtain the coding identifier of the virtual loudspeaker.
  • the coding end after obtaining the number of heterogeneous sound sources of the transmission channel signal and the coding efficiency of the virtual speaker, the coding end obtains the virtual speaker code according to the judgment condition that the number of heterogeneous sound sources of the transmission channel signal and the coding efficiency of the virtual speaker meet.
  • the specific value of the identifier is the specific value of the identifier.
  • the acquiring the virtual speaker coding identifier according to the number of heterogeneous sound sources of the transmission channel signal and the coding efficiency of the virtual speaker includes: when the heterogeneous sound source of the transmission channel signal When the number of sources is less than or equal to the preset threshold of the number of dissimilar sound sources, and the virtual speaker coding efficiency is greater than or equal to the preset first virtual speaker coding efficiency threshold, it is determined that the virtual speaker coding flag is dominant; or, When the number of distinct sound sources of the transmission channel signal is greater than a preset threshold of the number of distinct sound sources, or the coding efficiency of the virtual speaker is less than a preset first virtual speaker coding efficiency threshold, determine the virtual speaker coding identifier For not dominant.
  • the encoding end can determine the virtual speaker coding identifier by comparing the number of heterogeneous sound sources, the virtual speaker coding efficiency and the above judgment conditions, so that the virtual speaker coding identifier can be used to determine the bit allocation ratio of the virtual speaker signal group , and the bit allocation ratio of the residual signal group.
  • the encoding end can further divide the case where the virtual speaker code identification is dominant, that is, two cases of the virtual speaker code identification being sub-dominant and strongly dominant can be obtained. It can be understood that, if the virtual loudspeaker coding flag is strongly dominant, more bits need to be allocated to the virtual loudspeaker signal group, for example, after the initial bit ratio of the virtual loudspeaker signal group is determined, the bit ratio can be increased. If the virtual speaker code is identified as sub-dominant, the virtual speaker signal group needs to allocate less bits than when the virtual speaker code is marked as strongly dominant, but the virtual speaker signal group still needs to allocate more bits than the virtual speaker signal group. For example, after determining the initial bit ratio of the virtual loudspeaker signal group, the bit ratio can be increased. In comparison, in the case of strong dominance, the increased bit ratio is greater than that in the case of subdominance.
  • the transmission channel attribute information includes: the energy proportion of the virtual speaker signal group, and/or a virtual speaker code identifier; the determination of the virtual speaker according to the transmission channel attribute information
  • the bit allocation proportion of the signal group and the bit allocation proportion of the residual signal group include: when the energy proportion of the virtual loudspeaker signal group is greater than or equal to the preset first energy proportion threshold, and/or the When the virtual loudspeaker coding flag is strongly dominant, determine the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group according to the preset first signal group bit allocation algorithm; when the virtual When the energy proportion of the loudspeaker signal group is greater than or equal to the preset second energy proportion threshold and less than the preset first energy proportion threshold, and/or the virtual loudspeaker code is identified as secondary dominant, according to the preset
  • the second signal group bit allocation algorithm determines the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group; wherein, the second energy proportion threshold is smaller than the
  • the encoder can preset multiple signal group bit allocation algorithms, and when the transmission channel attribute information meets different conditions, different signal group bit allocation algorithms can be used, so that the transmission channel attribute information can meet certain conditions.
  • the virtual loudspeaker signal group and the residual signal group are assigned bit allocation ratios that are suitable for this condition, so the coding efficiency of the three-dimensional audio signal at the coding end can be improved.
  • the S is the number of heterogeneous sound sources
  • the ⁇ represents the coding efficiency of the virtual speaker
  • the maxdirectionalNrgRatio is the preset maximum virtual speaker signal group bit allocation ratio
  • the transmission channel signal includes a virtual speaker signal group and a residual signal group. After obtaining the bit allocation ratio Ratio1_1 of the virtual speaker signal group, the bit allocation ratio of the residual signal group can be obtained through the calculation formula of Ratio2 above.
  • the transmission channel signal includes a virtual speaker signal group and a residual signal group. After obtaining the bit allocation ratio Ratio1_1 of the virtual speaker signal group, the bit allocation ratio of the residual signal group can be obtained through the calculation formula of Ratio2 above.
  • the proportion of the bit allocation of each residual signal group in all residual signal groups can be determined according to the number of transmission channels of each residual signal group.
  • R_i/C represents the transmission channel ratio of the i-th residual signal group to all residual signal groups
  • the bit allocation ratio of the i-th residual signal group can be obtained through (R_i/C) and Ratio2.
  • the third signal group bit allocation algorithm determines the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group, including: when directionalNrgRatio ⁇ TH3 is satisfied, or S>TH0 is satisfied, or ⁇
  • the bit allocation proportion of the above, the TH3 is the second energy proportion threshold, the TH4 is the coding efficiency threshold of the first virtual speaker, the S is the number of the heterogeneous sound sources, and the n represents the The encoding efficiency of the virtual loudspeak
  • the method further includes: according to the bit allocation ratio of the virtual loudspeaker signal group, the bit allocation ratio of the residual signal group, and the total number of transmission channel bits, respectively determine the The number of bits of the virtual loudspeaker signal group, the number of bits of the residual signal group; the bit allocation of the virtual speaker signal group according to the bit number of the virtual loudspeaker signal group, and the bit allocation of the virtual speaker signal group according to the bit number of the residual signal group Bit allocation is performed on the residual signal group.
  • the encoding end allocates bits to the virtual loudspeaker signal group according to the number of bits of the virtual loudspeaker signal group, and allocates bits to the residual signal group according to the number of bits of the residual signal group, which solves the problem that the encoding end cannot provide a virtual loudspeaker The problem of bit allocation for signal and residual signal.
  • the virtual loudspeaker signal group's bit allocation ratio, the residual signal group's bit allocation ratio, and the total number of transmission channel bits are respectively determined according to the virtual loudspeaker signal group.
  • the number of bits, the Ratio1 is the proportion of the bit allocation of the virtual loudspeaker signal group, and the C_bitnum is the total transmission channel bit number
  • the encoding end can predetermine the total number of transmission channel bits, and there is no limit to the value of the total transmission channel bit number, and the encoding end can calculate the number of bits and the residual of the virtual loudspeaker signal group through the above calculation formula The number of bits of the signal group realizes the problem of bit allocation for the virtual speaker signal and the residual signal at the encoding end.
  • the method further includes: encoding the transmission channel signal, the bit allocation ratio of the virtual loudspeaker signal group, and the bit allocation ratio of the residual signal group, and writing input stream.
  • the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group can be encoded into the bit stream, and the encoding end sends the bit stream to the decoding end, so that the decoding end can analyze the code Stream, the decoding end can obtain the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group through the bit stream, and the decoding end can obtain the bit allocation proportion of the virtual speaker signal group and the bit allocation of the residual signal group
  • the ratio can obtain the number of bits allocated by the virtual speaker signal group and the number of bits allocated by the residual signal, so that the code stream can be decoded to obtain a three-dimensional audio signal.
  • the embodiment of the present application also provides a method for processing a three-dimensional audio signal, including: receiving a code stream; decoding the code stream to obtain the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group ratio; according to the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group, the virtual loudspeaker signal and the residual signal in the code stream are decoded to obtain a decoded three-dimensional audio signal .
  • the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group can be encoded into the bit stream, and the encoding end sends the bit stream to the decoding end, so that the decoding end can analyze the code Stream, the decoding end can obtain the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group through the bit stream, and the decoding end can obtain the bit allocation proportion of the virtual speaker signal group and the bit allocation of the residual signal group
  • the ratio can obtain the number of bits allocated by the virtual speaker signal group and the number of bits allocated by the residual signal, so that the code stream can be decoded to obtain a three-dimensional audio signal.
  • the virtual speaker signal and the residual signal in the code stream are performed according to the bit allocation ratio of the virtual speaker signal group and the bit distribution ratio of the residual signal group Decoding, including: determining the number of available bits according to the code stream; determining the number of bits of the virtual speaker signal group according to the available bit number and the bit allocation ratio of the virtual speaker signal group; according to the virtual speaker signal group Decode the virtual loudspeaker signal in the code stream; determine the number of bits of the residual signal group according to the available bit number and the bit allocation ratio of the residual signal group; according to the residual The number of bits of the signal group decodes the residual signal in the code stream.
  • an embodiment of the present application further provides a processing device for a 3D audio signal, including: an encoding module, configured to perform spatial encoding on the 3D audio signal to be encoded to obtain a transmission channel signal and transmission channel attribute information, wherein the The transmission channel signal includes: at least one virtual loudspeaker signal group and at least one residual signal group; a bit allocation proportion determining module, configured to determine the bit allocation proportion and the bit allocation proportion of the virtual loudspeaker signal group according to the transmission channel attribute information bit allocation proportion of the residual signal group.
  • the constituent modules of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned first aspect and various possible implementations. For details, see the aforementioned first aspect and various possible implementations. Description in Implementation.
  • the embodiment of the present application also provides a three-dimensional audio signal processing device, including: a receiving module, configured to receive a code stream; a decoding module, configured to decode the code stream to obtain the bit allocation of the virtual speaker signal group ratio and the bit allocation proportion of the residual signal group; the signal generation module is used to calculate the virtual frequency in the code stream according to the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group.
  • the loudspeaker signal and the residual signal are decoded to obtain a decoded three-dimensional audio signal.
  • the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
  • the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
  • the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
  • the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
  • the components of the three-dimensional audio signal processing device can also perform the steps described in the aforementioned second aspect and various possible implementations.
  • the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores instructions, and when it is run on a computer, the computer executes the above-mentioned first aspect or the second aspect. described method.
  • the embodiment of the present application provides a computer program product containing instructions, which, when run on a computer, causes the computer to execute the method described in the first aspect or the second aspect above.
  • the embodiment of the present application provides a computer-readable storage medium, including the code stream generated by the method described in the foregoing first aspect.
  • the embodiment of the present application provides a communication device, which may include entities such as terminal equipment or chips, and the communication device includes: a processor and a memory; the memory is used to store instructions; the processor is used to Executing the instructions in the memory causes the communication device to execute the method as described in any one of the aforementioned first aspect or second aspect.
  • the present application provides a chip system, which includes a processor, configured to support an audio encoder or an audio decoder to implement the functions involved in the above aspect, for example, to send or process the information involved in the above method data and/or information.
  • the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the audio encoder or audio decoder.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • spatial encoding is first performed on the three-dimensional audio signal to be encoded to obtain transmission channel signals and transmission channel attribute information, wherein the transmission channel signals include: at least one virtual speaker signal group and at least one residual signal group; Then, the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group are determined according to the attribute information of the transmission channel.
  • the transmission channel signal and transmission channel attribute information are obtained through three-dimensional audio signal encoding.
  • the transmission channel signal may include at least one virtual speaker signal group and at least one residual signal group.
  • the transmission channel attribute information can be used for The bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group are respectively determined, thereby solving the problem that the bit allocation of the signal cannot be determined.
  • FIG. 1 is a schematic diagram of the composition and structure of an audio processing system provided by an embodiment of the present application
  • FIG. 2a is a schematic diagram of an audio encoder and an audio decoder provided in an embodiment of the present application applied to a terminal device;
  • FIG. 2b is a schematic diagram of an audio encoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 2c is a schematic diagram of an audio decoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 3a is a schematic diagram of a multi-channel encoder and a multi-channel decoder provided in an embodiment of the present application applied to a terminal device;
  • FIG. 3b is a schematic diagram of a multi-channel encoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 3c is a schematic diagram of a multi-channel decoder provided in an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 4 is a schematic diagram of a method for processing a three-dimensional audio signal provided in an embodiment of the present application
  • FIG. 5 is a schematic diagram of a method for processing a three-dimensional audio signal provided in an embodiment of the present application
  • FIG. 6 is a schematic diagram of an application scenario of a three-dimensional audio signal provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the composition and structure of an audio encoding device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the composition and structure of an audio decoding device provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the composition and structure of another audio encoding device provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of the composition and structure of another audio decoding device provided by an embodiment of the present application.
  • Sound is a continuous wave produced by the vibration of an object. Objects that vibrate to emit sound waves are called sound sources. When sound waves propagate through a medium (such as air, solid or liquid), the auditory organs of humans or animals can perceive sound.
  • a medium such as air, solid or liquid
  • Characteristics of sound waves include pitch, intensity, and timbre.
  • Pitch indicates how high or low a sound is.
  • Pitch intensity indicates the volume of a sound.
  • Pitch intensity can also be called loudness or volume.
  • the unit of sound intensity is decibel (decibel, dB). Timbre is also called fret.
  • the frequency of sound waves determines the pitch of the sound. The higher the frequency, the higher the pitch.
  • the number of times an object vibrates within one second is called frequency, and the unit of frequency is hertz (Hz).
  • the frequency of sound that can be recognized by the human ear is between 20Hz and 20000Hz.
  • the amplitude of the sound wave determines the intensity of the sound. The greater the amplitude, the greater the sound intensity. The closer the distance to the sound source, the greater the sound intensity.
  • the waveform of the sound wave determines the timbre.
  • the waveforms of sound waves include square waves, sawtooth waves, sine waves, and pulse waves.
  • sounds can be divided into regular sounds and irregular sounds.
  • Random sound refers to the sound produced by the sound source vibrating randomly. Random sounds are, for example, noises that affect people's work, study, and rest.
  • a regular sound refers to a sound produced by a sound source vibrating regularly. Regular sounds include speech and musical tones.
  • regular sound is an analog signal that changes continuously in the time-frequency domain. The analog signals may be referred to as audio signals (acoustic signals).
  • An audio signal is an information carrier that carries speech, music and sound effects.
  • the human sense of hearing has the ability to distinguish the location and distribution of sound sources in space, when the listener hears the sound in the space, he can not only feel the pitch, intensity and timbre of the sound, but also feel the direction of the sound.
  • Three-dimensional audio technology refers to the assumption that the space outside the human ear is a system, and the signal received at the eardrum is a three-dimensional audio signal that is output by filtering the sound from the sound source through a system outside the ear.
  • a system other than the human ear can be defined as a system impulse response h(n)
  • any sound source can be defined as x(n)
  • the signal received at the eardrum is the convolution result of x(n) and h(n) .
  • the three-dimensional audio signal described in the embodiment of the present application may refer to a higher order ambisonics (higher order ambisonics, HOA) signal or a first order ambisonics (first order ambisonics, FOA) signal.
  • Three-dimensional audio can also be called three-dimensional audio, spatial audio, three-dimensional sound field reconstruction, virtual 3D audio, or binaural audio.
  • the sound pressure p satisfies formula (1), is the Laplacian operator.
  • the space system outside the human ear is a sphere, and the listener is at the center of the sphere, the sound from outside the sphere has a projection on the sphere, and the sound outside the sphere is filtered out.
  • the sound source is distributed on the sphere, use the sphere
  • the sound field generated by the above sound source is used to fit the sound field generated by the original sound source, that is, the three-dimensional audio technology is a method of fitting the sound field.
  • the formula (1) equation is solved in the spherical coordinate system, and in the passive spherical region, the solution of the formula (1) is the following formula (2).
  • r represents the radius of the ball
  • represents the horizontal angle
  • k represents the wave number
  • s represents the amplitude of the ideal plane wave
  • m represents the order number of the three-dimensional audio signal (or the order number of the HOA signal).
  • represents ⁇ The spherical harmonics of the direction, Spherical harmonics representing the direction of the sound source.
  • the three-dimensional audio signal coefficients satisfy formula (3).
  • formula (3) can be transformed into formula (4).
  • N is an integer greater than or equal to 1.
  • the value of N is an integer ranging from 2 to 6.
  • the coefficients of the 3D audio signal described in the embodiments of the present application may refer to HOA coefficients or ambient stereo (ambisonic) coefficients.
  • the three-dimensional audio signal is an information carrier carrying the spatial position information of the sound source in the sound field, and describes the sound field of the listener in the space.
  • Formula (4) shows that the sound field can be expanded on the spherical surface according to the spherical harmonic function, that is, the sound field can be decomposed into the superposition of multiple plane waves. Therefore, the sound field described by the three-dimensional audio signal can be expressed by the superposition of multiple plane waves, and the sound field can be reconstructed through the coefficients of the three-dimensional audio signal.
  • the HOA signal includes a large amount of data for describing the spatial information of the sound field. If the acquisition device (such as a microphone) transmits the three-dimensional audio signal to a playback device (such as a speaker), a large bandwidth needs to be consumed.
  • the encoder can use the spatial squeezed surround audio coding (spatial squeezed surround audio coding, S3AC) method or the directional audio coding (directional audio coding, DirAC) method or the coding method based on virtual speaker selection to compress and code the three-dimensional audio signal to obtain the code stream, to transmit a code stream to a playback device, wherein the encoding method based on virtual speaker selection may also be referred to as a matching projection (matchPRojection, MP) encoding method, and the encoding method selected by a virtual speaker will be described later as an example.
  • the playback device decodes the code stream, reconstructs the three-dimensional audio signal, and plays the reconstructed three-dimensional audio signal. Therefore, the amount of data transmitted to the playback device and the bandwidth occupation of the three-dimensional audio signal are reduced.
  • the sound field classification of the 3D audio signal can be realized through the linear decomposition of the 3D audio signal, so that the sound field classification of the 3D audio signal can be accurately realized, and the sound field classification result of the current frame can be obtained.
  • the embodiment of the present application provides an audio coding technology, especially a three-dimensional audio coding technology for three-dimensional audio signals, and specifically provides a coding technology that uses fewer channels to represent three-dimensional audio signals to improve traditional audio coding system.
  • Audio coding (or commonly referred to as coding) includes two parts of audio coding and audio decoding. Audio encoding is performed on the source side and involves processing (eg, compressing) raw audio to reduce the amount of data needed to represent the audio for more efficient storage and/or transmission. Audio decoding is performed at the destination, including inverse processing relative to the encoder to reconstruct the original audio. The encoding part and the decoding part are also collectively referred to as encoding.
  • the implementation of the embodiment of the present application will be described in detail below with reference to the accompanying drawings.
  • the technical solution of the embodiment of the present application can be applied to various audio processing systems, as shown in FIG. 1 , which is a schematic diagram of the composition and structure of the audio processing system provided by the embodiment of the present application.
  • the audio processing system 100 may include: an audio encoding device 101 and an audio decoding device 102 .
  • the audio coding device 101 can be used to generate a code stream, and then the audio coded code stream can be transmitted to the audio decoding device 102 through an audio transmission channel, and the audio decoding device 102 can receive the code stream, and then perform the audio decoding function of the audio decoding device 102 , and finally get the reconstructed signal.
  • the audio coding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
  • the audio coding device can be the above-mentioned terminal device or wireless device or Audio encoder for core network equipment.
  • the audio decoding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices. decoder.
  • the audio encoder may include a radio access network, a media gateway of the core network, a transcoding device, a media resource server, a mobile terminal, a fixed network terminal, etc., and the audio encoder may also be a virtual reality (VR) ) audio encoders in streaming services.
  • VR virtual reality
  • the end-to-end audio signal processing flow includes: audio signal A passes through the acquisition module (audioPReprocessing) after (acquisition), the preprocessing operation includes filtering out the low frequency part of the signal, which can be 20Hz or 50Hz as the dividing point, extracting the orientation information in the signal, and then performing encoding processing (audio encoding) Package (file/segment encapsulation) and then send (delivery) to the decoding end, the decoding end first unpacks (file/segment decapsulation), then decodes (audio decoding), performs binaural rendering (audio rendering) processing on the decoded signal, and renders The processed signal is mapped onto the listener's headphones, which may be standalone headphones or headphones on a glasses device.
  • FIG. 2a it is a schematic diagram of an audio encoder and an audio decoder provided in the embodiment of the present application applied to a terminal device.
  • Each terminal device may include: an audio encoder, a channel encoder, an audio decoder, and a channel decoder.
  • the channel encoder is used for channel coding the audio signal
  • the channel decoder is used for channel decoding the audio signal.
  • the first terminal device 20 may include: a first audio encoder 201 , a first channel encoder 202 , a first audio decoder 203 , and a first channel decoder 204 .
  • the second terminal device 21 may include: a second audio decoder 211 , a second channel decoder 212 , a second audio encoder 213 , and a second channel encoder 214 .
  • the first terminal device 20 is connected to a wireless or wired first network communication device 22, the first network communication device 22 is connected to a wireless or wired second network communication device 23 through a digital channel, and the second terminal device 21 is connected to a wireless or wired network communication device.
  • the second network communication device 23 may generally refer to signal transmission equipment, such as communication base stations, data exchange equipment, and the like.
  • the terminal device as the sending end first collects audio, performs audio coding on the collected audio signal, and then performs channel coding, and then transmits in a digital channel through a wireless network or a core network.
  • the terminal device as the receiving end performs channel decoding according to the received signal to obtain the code stream, and then recovers the audio signal through audio decoding, and the terminal device at the receiving end enters the audio playback.
  • the wireless device or the core network device 25 includes: a channel decoder 251, other audio decoders 252, an audio encoder 253 provided in the embodiment of the present application, and a channel encoder 254, wherein the other audio decoders 252 refer to Audio codecs other than audio codecs.
  • the channel decoder 251 is first used to perform channel decoding on the signal entering the device, and then other audio decoders 252 are used for audio decoding, and then the audio encoder 253 provided by the embodiment of the present application is used for decoding.
  • the channel coder 254 is used to perform channel coding on the audio signal, and the channel coding is completed before transmission.
  • the other audio decoder 252 performs audio decoding on the code stream decoded by the channel decoder 251 .
  • FIG. 2c it is a schematic diagram of an audio decoder provided by the embodiment of the present application being applied to a wireless device or a core network device.
  • the wireless device or the core network device 25 includes: a channel decoder 251, an audio decoder 255 provided in the embodiment of the present application, other audio encoders 256, and a channel encoder 254, wherein the other audio encoders 256 refer to Audio codecs other than audio codecs.
  • the signal entering the device is first channel-decoded by the channel decoder 251, then the received audio coded stream is decoded using the audio decoder 255, and then other audio encoders 256 are used to Perform audio encoding, and finally use the channel encoder 254 to perform channel encoding on the audio signal, and then transmit it after completing the channel encoding.
  • the wireless device refers to equipment related to radio frequency in communication
  • the core network device refers to equipment related to core network in communication.
  • the audio coding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
  • the audio coding device can be the above-mentioned terminal device or wireless device Or a multi-channel encoder of a core network device.
  • the audio decoding device can be applied to various terminal devices that require audio communication, wireless devices that require transcoding, and core network devices.
  • the audio decoding device can be a combination of the above-mentioned terminal devices or wireless devices or core network devices. channel decoder.
  • a schematic diagram of the application of the multi-channel encoder and multi-channel decoder provided by the embodiment of the present application to the terminal equipment may include: a multi-channel encoder, a channel encoder, Multi-channel decoder, channel decoder.
  • the multi-channel encoder may execute the audio encoding method provided in the embodiment of the present application
  • the multi-channel decoder may execute the audio decoding method provided in the embodiment of the present application.
  • the channel encoder is used to perform channel coding on the multi-channel signal
  • the channel decoder is used to perform channel decoding on the multi-channel signal.
  • the first terminal device 30 may include: a first multi-channel encoder 301 , a first channel encoder 302 , a first multi-channel decoder 303 , and a first channel decoder 304 .
  • the second terminal device 31 may include: a second multi-channel decoder 311 , a second channel decoder 312 , a second multi-channel encoder 313 , and a second channel encoder 314 .
  • the first terminal device 30 is connected to a wireless or wired first network communication device 32, and the first network communication device 32 is connected to a wireless or wired second network communication device 33 through a digital channel, and the second terminal device 31 is connected to a wireless or wired network communication device.
  • the second network communication device 33 is connected to a wireless or wired network communication device.
  • the foregoing wireless or wired network communication equipment may generally refer to signal transmission equipment, such as communication base stations, data exchange equipment, and the like.
  • the terminal device as the sending end performs multi-channel coding on the collected multi-channel signal, and then performs channel coding, and then transmits it in a digital channel through a wireless network or a core network.
  • the terminal device as the receiving end performs channel decoding according to the received signal to obtain the coded stream of the multi-channel signal, and then restores the multi-channel signal through multi-channel decoding, and the terminal device as the receiving end plays it back.
  • FIG. 3b it is a schematic diagram of a multi-channel encoder applied to a wireless device or a core network device provided by the embodiment of the present application, wherein the wireless device or the core network device 35 includes: a channel decoder 351, other audio decoders 352 , the multi-channel encoder 353, and the channel encoder 354 are similar to those in FIG. 2b, and will not be repeated here.
  • FIG. 3c it is a schematic diagram of a multi-channel decoder applied to a wireless device or a core network device provided by the embodiment of the present application, wherein the wireless device or the core network device 35 includes: a channel decoder 351, a multi-channel decoder 355 , other audio encoder 356 , and channel encoder 354 are similar to those in FIG. 2 c and will not be repeated here.
  • the audio encoding process can be a part of the multi-channel encoder, and the audio decoding process can be a part of the multi-channel decoder.
  • performing multi-channel encoding on the collected multi-channel signal can be the After the multi-channel signal is processed, the audio signal is obtained, and then the obtained audio signal is encoded according to the method provided in the embodiment of the present application; the decoding end encodes the code stream according to the multi-channel signal, decodes the audio signal, and after the up-mixing process Recover the multi-channel signal. Therefore, the embodiments of the present application may also be applied to multi-channel encoders and multi-channel decoders in terminal devices, wireless devices, and core network devices. In wireless or core network equipment, if transcoding needs to be implemented, corresponding multi-channel encoding processing needs to be performed.
  • the method can be executed by a terminal device, for example, the terminal device can be an audio encoding device (hereinafter referred to as an encoding terminal or an encoder).
  • the terminal device may also be a three-dimensional audio signal processing device.
  • the processing method of the three-dimensional audio signal mainly includes the following:
  • the encoding end may acquire a three-dimensional audio signal
  • the three-dimensional audio signal may be a scene audio signal.
  • the three-dimensional audio signal may be a time-domain signal or a frequency-domain signal.
  • the 3D audio signal may also be a down-sampled signal.
  • the virtual speaker signals corresponding to these virtual speakers can be obtained, and then these virtual speaker signals are grouped to obtain the at least one virtual speaker signal group or, after determining the virtual speakers that encode the three-dimensional audio signal from the set of candidate virtual speakers, these virtual speakers can be grouped to obtain at least one virtual speaker group, and then each of the at least one virtual speaker group can be obtained virtual speaker signals corresponding to the virtual speakers, so as to obtain the at least one virtual speaker signal group.
  • the three-dimensional audio signal includes: a high-order ambisonic HOA signal, or a first-order ambisonic FOA signal.
  • the three-dimensional audio signal may also be other types of signals, and this is only an example of the present application, and is not intended to limit the embodiment of the present application.
  • the 3D audio signal may be a time-domain HOA signal or a frequency-domain HOA signal.
  • the 3D audio signal may include all channels of the HOA signal, or may include some HOA channels (for example, FOA channels).
  • the three-dimensional audio signal may be all sample points of the HOA signal, or 1/Q down-sampling points after the HOA signal to be analyzed is down-sampled. Among them, Q is the downsampling interval, and 1/Q is the downsampling rate.
  • the 3D audio signal includes multiple frames. Next, take the processing of a frame in the 3D audio signal as an example. For example, if this frame is the current frame, there is still The previous frame, there is a next frame after the current frame.
  • the processing method of other frames of the 3D audio signal except the current frame in the embodiment of the present application is similar to the processing method of the current frame, and the processing of the current frame will be used as an example in the following.
  • the three-dimensional audio signal is spatially encoded to obtain the transmission channel signal and transmission channel attribute information.
  • the specific process of spatial encoding no further description will be given here. The process of outputting the virtual loudspeaker signal and the residual signal after spatial encoding will not be described again.
  • the encoding end after the encoding end acquires the 3D audio signal to be encoded, it can perform spatial encoding on the 3D audio signal, and can output the transmission channel signal and transmission channel attribute information.
  • the transmission channel signal includes the virtual speaker signal and residual
  • the difference signal for example grouping the virtual loudspeaker signals, results in at least one group of virtual loudspeaker signals.
  • the residual signals are grouped to obtain at least one residual signal group.
  • the transmission channel attribute information corresponding to the transmission channel signal can also be output through spatial coding.
  • the transmission channel attribute information is used to indicate the attribute of the transmission channel signal.
  • the transmission channel attribute information includes: coding efficiency of the virtual speaker; the coding efficiency of the virtual speaker indicates the efficiency of reconstructing the 3D audio signal using the virtual speaker for the 3D audio signal.
  • the transmission channel attribute information output by the encoder (which may also be the encoding end) through spatial encoding includes the coding efficiency of the virtual loudspeaker, and the calculation method of the coding efficiency of the virtual loudspeaker is described next.
  • Step 401 performs spatial coding on the 3D audio signal to be coded to obtain transmission channel attribute information, including:
  • the virtual speaker for signal reconstruction of the 3D audio signal to be encoded may be determined from the set of candidate virtual speakers as described above Virtual speakers for encoding 3D audio signals.
  • the coding efficiency of the virtual speaker is obtained.
  • the encoding end first performs signal reconstruction using a virtual speaker, and obtains a reconstructed three-dimensional audio signal.
  • the encoding end can calculate the energy characterization value of the signal of each transmission channel, for example, the energy characterization value of the reconstructed 3D audio signal and the energy characterization value of the 3D audio signal to be encoded can be obtained.
  • the energy characterization value of the 3D audio signal is in the signal It is different before and after the reconstruction, so the coding efficiency of the virtual loudspeaker can be calculated through the transformation of the energy representation value before and after the signal reconstruction.
  • the encoding end calculates and reconstructs the energy representation value of each transmission channel of the HOA signal, which can be expressed as R1, R2,...,Rt, and the encoding end calculates The energy characterization value of each transmission channel of the original HOA signal can be expressed as N1, N2,...,Nt.
  • the virtual loudspeaker coding efficiency ⁇ : ⁇ sum(R)/sum(N), wherein, sum(R) represents the summation of R1-Rt, and sum(N) represents the summation of N1-Nt.
  • the transmission channel attribute information includes: the energy ratio of the virtual speaker signal group; the energy ratio of the virtual speaker signal group refers to the energy ratio of all virtual speaker signals in the virtual speaker signal group in all transmission channel signals proportion of the total energy.
  • the methods performed by the encoding side also include:
  • the energy proportion of the virtual loudspeaker signal group is obtained.
  • the encoding end first obtains the energy representation value of each virtual speaker signal in the virtual speaker signal group, and then adds the energy representation values of all virtual speaker signals in the same group to obtain the energy representation of the virtual speaker signal group value. If there are multiple virtual loudspeaker signal groups, each group can calculate the energy representative value of the virtual loudspeaker signal group in the above manner.
  • the encoding end can obtain the energy representative value of the residual signal group according to the energy representative value of each residual signal in the residual signal group.
  • the encoding end can obtain the energy ratio of the virtual loudspeaker signal group according to the energy representative value of the virtual loudspeaker signal group and the energy representative value of the residual signal group.
  • the energy proportion of the virtual loudspeaker signal group can indicate the proportion of the virtual loudspeaker signal group in the total transmission channel signal energy. If the energy ratio of the virtual loudspeaker signal group is relatively low, it means that the virtual loudspeaker signal group is not dominant (that is, weaker) in the total transmission channel signal energy.
  • the transmission channel attribute information includes: a virtual speaker code identifier, where the virtual speaker code identifier is used to indicate whether the bit allocation of the virtual speaker signal group is dominant.
  • the virtual speaker code identifier is used to indicate whether the bit allocation of at least one virtual speaker signal group is dominant, for example, the virtual speaker code identifier can be expressed as a flag, and the virtual speaker code identifier can indicate that the bit allocation of the virtual speaker signal group is dominant , or not dominant, different values of the virtual loudspeaker code identifier may indicate that the bit allocation of the virtual loudspeaker signal group is dominant or not dominant.
  • the dominance can also be divided into strong dominance and second dominance (ie slightly dominance).
  • Perform spatial encoding on the 3D audio signal to be encoded to obtain transmission channel attribute information including:
  • Spatial encoding is performed on the 3D audio signal to be encoded to obtain the number of different sound sources of the transmission channel signal and the encoding efficiency of the virtual speaker;
  • the coding identifier of the virtual loudspeaker is obtained according to the number of heterogeneous sound sources of the transmission channel signal and the coding efficiency of the virtual loudspeaker.
  • the encoding end can classify the sound field of the transmission channel signal through spatial coding, and generate the sound field classification result.
  • the sound field classification result can include the number of different sound sources.
  • the specific calculation process for the number of different sound sources is not done here. limited.
  • the coding end After obtaining the number of heterogeneous sound sources of the transmission channel signal and the coding efficiency of the virtual speaker, the coding end obtains the specific value of the virtual speaker coding identifier according to the judgment conditions met by the number of heterogeneous sound sources of the transmission channel signal and the coding efficiency of the virtual speaker , in the embodiment of the present application, there are many ways to realize the code identification of the virtual loudspeaker, please refer to the examples in the subsequent embodiments for details.
  • the coding identification of the virtual speaker is obtained, including:
  • the virtual speaker coding efficiency is smaller than a preset first virtual speaker coding efficiency threshold, it is determined that the virtual speaker coding flag is not dominant.
  • the threshold of the number of dissimilar sound sources and the threshold of coding efficiency of the first virtual loudspeaker may be combined with application scenarios, and are not limited here.
  • the threshold of the number of heterogeneous sound sources may be represented as TH0
  • the threshold of coding efficiency of the first virtual loudspeaker may be represented as TH4.
  • the virtual loudspeaker code is marked as dominant, which means that the virtual loudspeaker signal group is dominant in the total transmission channel signal, so the virtual loudspeaker signal group needs to allocate more bits, for example, when determining the initial bit occupation of the virtual loudspeaker signal group After the ratio, the bit ratio can be increased.
  • the coding flag of the virtual loudspeaker is not dominant, indicating that the virtual loudspeaker signal group is not dominant in the total transmission channel signals, and at this time, less bits may be allocated to the virtual loudspeaker signal group. For example, after the initial bit ratio of the virtual loudspeaker signal group is determined, the bit ratio may be reduced.
  • the encoding end can determine the virtual loudspeaker coding identifier by comparing the number of heterogeneous sound sources, the coding efficiency of the virtual loudspeaker and the above-mentioned judgment conditions, so that the virtual loudspeaker coding identifier can be used to determine the bit allocation of the virtual loudspeaker signal group. ratio, and the bit allocation ratio of the residual signal group.
  • the dominance includes sub-dominance or strong dominance; determining that the virtual speaker code is identified as dominance includes:
  • the virtual speaker encoding efficiency is greater than or equal to the first virtual speaker encoding efficiency threshold and the virtual speaker encoding efficiency is less than or equal to the preset second virtual speaker encoding efficiency threshold, it is determined that the virtual speaker encoding flag is sub-dominant; or,
  • the second virtual speaker coding efficiency threshold is greater than the first virtual speaker coding efficiency threshold.
  • the encoding end can further divide the situation that the virtual speaker encoding identification is dominant, that is, two cases of the virtual speaker encoding identification being sub-dominant and strongly dominant can be obtained. It can be understood that, if the virtual loudspeaker coding flag is strongly dominant, more bits need to be allocated to the virtual loudspeaker signal group, for example, after the initial bit ratio of the virtual loudspeaker signal group is determined, the bit ratio can be increased.
  • the virtual speaker signal group needs to allocate less bits than when the virtual speaker code is marked as strongly dominant, but the virtual speaker signal group still needs to allocate more bits than the virtual speaker signal group. For example, after determining the initial bit ratio of the virtual loudspeaker signal group, the bit ratio can be increased. In comparison, in the case of strong dominance, the increased bit ratio is greater than that in the case of subdominance.
  • the second virtual loudspeaker coding efficiency threshold may be denoted as TH2.
  • the transmission channel attribute information can be used to perform bit allocation for the virtual loudspeaker signal group, and in addition , bit allocation can be performed for the residual signal group by using the attribute information of the transmission channel.
  • the encoding end determines the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group according to the attribute information of the transmission channel.
  • the bit allocation ratio refers to the ratio of the number of bits allocated for a signal group to the total number of bits of the transmission channel signal, and the bit allocation ratio may also be referred to as a "bit allocation ratio".
  • the transmission channel signal includes at least one virtual speaker signal group and at least one residual signal group, so the bit allocation ratio of the virtual speaker signal group and the bit allocation ratio of the residual signal group can be obtained.
  • the process of determining the bit allocation proportions of one virtual loudspeaker signal group and two residual signal groups is taken as an example for illustration.
  • the spatial coding can output the transmission channel signal and transmission channel attribute information
  • the core encoder obtains the transmission channel signal and transmission channel attribute information
  • the core encoder then passes the transmission channel signal and transmission channel attribute information information, the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group can be obtained.
  • the transmission channel attribute information includes: the energy ratio of the virtual loudspeaker signal group, and/or the code identifier of the virtual loudspeaker;
  • the virtual loudspeaker signal group is determined according to the preset first signal group bit allocation algorithm The bit allocation proportion of and the bit allocation proportion of the residual signal group;
  • the second signal group bit allocation algorithm determines the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group; wherein, the second energy proportion threshold is smaller than the first energy proportion threshold;
  • the energy proportion of the virtual loudspeaker signal group is less than the preset first energy proportion threshold, or the virtual loudspeaker code is identified as not dominant, determine the bit allocation of the virtual loudspeaker signal group according to the preset third signal group bit allocation algorithm Percentage and bit allocation proportion of the residual signal group.
  • the encoder can preset multiple signal group bit allocation algorithms, and different signal group bit allocation algorithms can be used when the transmission channel attribute information satisfies different conditions, so that the transmission channel attribute information can meet certain conditions.
  • the virtual loudspeaker signal group and the residual signal group are assigned bit allocation ratios that are suitable for this condition, so the coding efficiency of the three-dimensional audio signal at the coding end can be improved.
  • the first energy proportion threshold may be denoted as TH1
  • the second energy proportion threshold may be denoted as TH3.
  • the group bit allocation algorithm determines the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group, including:
  • bit allocation ratio of the virtual loudspeaker signal group is calculated as follows:
  • Ratio1_1 FAC1*directionalNrgRatio+(1–FAC1)*maxdirectionalNrgRatio;
  • directionalNrgRatio represents the energy ratio of the virtual speaker signal group
  • S is the number of heterogeneous sound sources
  • represents the coding efficiency of the virtual speaker
  • maxdirectionalNrgRatio is the preset maximum virtual speaker signal group bit allocation ratio
  • FAC1 is the preset first Adjustment factor
  • Ratio1_1 is the bit allocation ratio of the virtual speaker signal group
  • * means multiplication operation
  • TH1 is the first energy ratio threshold
  • TH0 is the threshold of the number of different sound sources
  • TH2 is the second virtual speaker coding efficiency threshold
  • bit allocation ratio of the residual signal group is calculated as follows:
  • Ratio2 1-Ratio1_1;
  • Ratio1_1 is the bit allocation ratio of the virtual loudspeaker signal group
  • Ratio2 is the bit allocation ratio of the residual signal group
  • the transmission channel signal includes a virtual speaker signal group and a residual signal group. After obtaining the bit allocation ratio Ratio1_1 of the virtual speaker signal group, the bit allocation ratio of the residual signal group can be obtained through the calculation formula of Ratio2 above.
  • FAC1 may be flexibly determined according to a specific application scenario, and is not limited here.
  • the method performed by the encoding end further includes:
  • the bit allocation ratio of the virtual loudspeaker signal group is updated in the following manner:
  • Ratio1_2 min(Ratio1_1, maxdirectionalNrgRatio+FAC2*Ratio1_1)
  • Ratio1_2 represents the bit allocation ratio of the updated virtual speaker signal group
  • FAC2 is the preset second adjustment factor
  • maxdirectionalNrgRatio is the preset maximum virtual speaker signal group bit distribution ratio
  • Ratio1_1 is the virtual speaker signal before update
  • the bit allocation ratio of the group * indicates the multiplication operation
  • min is the minimum value operation.
  • FAC2 may be flexibly determined according to a specific application scenario, which is not limited here.
  • the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group are determined according to the preset second signal group bit allocation algorithm; wherein, the second energy ratio threshold is smaller than the first energy ratio threshold ratio thresholds, including:
  • Ratio1_1 is calculated as follows:
  • Ratio1_1 FAC3*directionalNrgRatio+(1–FAC3)*maxdirectionalNrgRatio;
  • maxdirectionalNrgRatio is the proportion of bit allocation of the preset virtual speaker signal group
  • FAC3 is the third preset adjustment factor
  • directionalNrgRatio represents the energy ratio of the virtual speaker signal group
  • S is the number of heterogeneous sound sources
  • represents the coding efficiency of the virtual speaker
  • Ratio1_1 is the bit allocation ratio of the virtual speaker signal group
  • * means the multiplication operation
  • TH0 is the threshold of the number of heterogeneous sound sources
  • TH1 is the first energy ratio threshold
  • TH2 is the second virtual speaker coding efficiency threshold
  • TH3 is the second Two energy ratio thresholds
  • TH4 is the first virtual speaker coding efficiency threshold
  • bit allocation ratio of the residual signal group is calculated as follows:
  • Ratio2 1-Ratio1_1;
  • Ratio1_1 is the bit allocation ratio of the virtual loudspeaker signal group
  • Ratio2 is the bit allocation ratio of the residual signal group
  • FAC3 may be flexibly determined according to a specific application scenario, and is not limited here. For example, 0 ⁇ FAC3 ⁇ 0.5, FAC3>FAC1.
  • the transmission channel signal includes a virtual speaker signal group and a residual signal group. After obtaining the bit allocation ratio Ratio1_1 of the virtual speaker signal group, the bit allocation ratio of the residual signal group can be obtained through the calculation formula of Ratio2 above.
  • the method provided in the embodiment of the present application further includes:
  • the bit allocation ratio of the virtual loudspeaker signal group is updated in the following manner:
  • Ratio1_2 min(Ratio1_1, maxdirectionalNrgRatio+FAC4*Ratio1_1).
  • Ratio1_2 represents the bit allocation ratio of the updated virtual speaker signal group
  • FAC4 is the preset fourth adjustment factor
  • maxdirectionalNrgRatio is the preset maximum virtual speaker signal group bit allocation ratio
  • Ratio1_1 is the virtual speaker signal before update
  • the bit allocation ratio of the group * indicates the multiplication operation
  • min is the minimum value operation.
  • FAC4 may be flexibly determined according to a specific application scenario, and is not limited here.
  • the method provided in the embodiment of the present application further includes:
  • Ratio2_i Ratio2*(R_i/C);
  • R_i represents the number of transmission channels included in the i-th residual signal group
  • C is the total number of transmission channels of all residual signal groups
  • Ratio2_i is the bit allocation ratio of the i-th residual signal group
  • * represents the relative In the multiplication operation
  • Ratio2 assigns proportions to the bits of all residual signal groups.
  • the proportion of the bit allocation of each residual signal group in all residual signal groups may be determined according to the number of transmission channels of each residual signal group.
  • R_i/C represents the transmission channel ratio of the i-th residual signal group to all residual signal groups, and the bit allocation ratio of the i-th residual signal group can be obtained through (R_i/C) and Ratio2.
  • the bit allocation according to the preset third signal group determines the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group, including:
  • Ratio1_1 directionalNrgRatio
  • directionalNrgRatio represents the energy ratio of the virtual speaker signal group
  • Ratio1_1 is the bit allocation ratio of the virtual speaker signal group
  • TH3 is the second energy ratio threshold
  • TH4 is the first virtual speaker coding efficiency threshold
  • S is the heterogeneous sound source Quantity
  • represents the coding efficiency of virtual loudspeaker
  • TH0 is the threshold value of the number of heterogeneous sound sources
  • bit allocation ratio of the residual signal group is calculated as follows:
  • Ratio2_1 D/(F+D);
  • Ratio2_1 is the bit allocation ratio of the residual signal group
  • F represents the energy representative value of the virtual loudspeaker signal group
  • D is the energy representative value of the residual signal group.
  • the method provided in the embodiment of the present application further includes:
  • Ratio1_1 groupBitsRatio1
  • Ratio1_2 groupBitsRatio1
  • Ratio1_2 FAC5*groupBitsRatio1+(1–FAC5)*Ratio1_1;
  • Ratio1_2 represents the bit allocation ratio of the updated virtual speaker signal group
  • FAC5 is the preset fifth adjustment factor
  • Ratio1_1 is the bit distribution ratio of the virtual speaker signal group before the update
  • * represents the multiplication operation
  • groupBitsRatio1 is Preset virtual loudspeaker signal group bit allocation ratio
  • Ratio2_1 groupBitsRatio2
  • Ratio2_2 groupBitsRatio2
  • Ratio2_2 indicates the bit allocation ratio of the updated residual signal group
  • FAC6 is the preset sixth adjustment factor
  • Ratio2_1 is the bit allocation ratio of the residual signal group before the update
  • * indicates the multiplication operation
  • groupBitsRatio2 is The preset residual signal group bit allocation ratio.
  • FAC5 may be flexibly determined according to a specific application scenario, which is not limited here.
  • the method provided in the embodiment of the present application further includes the following steps:
  • bit allocation proportion of the virtual loudspeaker signal group the bit allocation proportion of the residual signal group and the total number of transmission channel bits, respectively determine the number of bits of the virtual loudspeaker signal group and the number of bits of the residual signal group;
  • Bit allocation is performed on the virtual loudspeaker signal group according to the bit number of the virtual loudspeaker signal group, and bit allocation is performed on the residual signal group according to the bit number of the residual signal group.
  • the encoding end can perform bit allocation for the virtual speaker signal group and the residual signal group respectively, so as to determine the Bit allocation results for the virtual loudspeaker signal group and bit allocation results for the residual signal group. For example, the encoding end obtains the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group, and then combines the total number of transmission channel bits to determine the bit number of the virtual speaker signal group and the residual signal group respectively.
  • the number of bits, the number of bits of the virtual loudspeaker signal group indicates the actual number of bits that the encoder can allocate to the virtual speaker signal group
  • the number of bits of the residual signal group indicates the actual number of bits that the encoder can allocate to the residual signal group.
  • bit numbers The number of bits in the residual signal group, including:
  • the number of bits for a virtual loudspeaker signal group is calculated as follows:
  • F_bitnum is the number of bits of the virtual speaker signal group
  • Ratio1 is the bit allocation ratio of the virtual speaker signal group
  • C_bitnum is the total number of transmission channel bits
  • the number of bits of the residual signal group is calculated as follows:
  • D_bitnum Ratio2*C_bitnum
  • D_bitnum is the number of bits of the residual signal group
  • Ratio2 is the bit allocation ratio of the residual signal group
  • C_bitnum is the total number of transmission channel bits.
  • the encoding end can predetermine the total number of transmission channel bits, and there is no limit to the value of the total transmission channel bit number.
  • the encoding end can calculate the number of bits of the virtual loudspeaker signal group and the residual signal group through the above calculation formula The number of bits, realizes the bit allocation problem for the virtual speaker signal and the residual signal at the encoding end.
  • the above calculation formula is only an achievable way and is not a limitation to the embodiment of the present application.
  • the number of bits of the virtual loudspeaker signal group and the number of bits of the residual signal group can be calculated by the above formula, and can also be calculated by
  • the preset adjustment factor adjusts the value of the number of bits of the virtual loudspeaker signal group and the number of bits of the residual signal group to obtain a final value, and the above calculation process is not limited.
  • the method performed at the encoding end may also include the following steps:
  • the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group can be encoded into the code stream, and the encoding end sends the code stream to the decoding end, so that the decoding end parses the code stream to decode
  • the terminal can obtain the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group through the code stream, and the decoding end can obtain the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group.
  • the number of bits allocated to the virtual loudspeaker signal group and the number of bits allocated to the residual signal are obtained, so that the code stream can be decoded to obtain a three-dimensional audio signal.
  • encoding the transmission channel signal, the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group may specifically include directly encoding the transmission channel signal, or first encoding
  • the transmission channel signal is processed. After the virtual speaker signal and residual signal are obtained, the virtual speaker signal and residual signal are encoded.
  • the encoding end can be a core encoder. and the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group to obtain a code stream.
  • the code stream may also be referred to as an audio signal coded code stream.
  • the processing method of the three-dimensional audio signal provided by the embodiment of the present application may include: an audio encoding method and an audio decoding method, wherein the audio encoding method is performed by an audio encoding device, the audio decoding method is performed by an audio decoding device, and the audio encoding device and the audio decoding device communication between them is possible.
  • the aforementioned FIG. 4 is executed by the audio encoding device.
  • the processing method of the three-dimensional audio signal performed by the audio decoding device hereinafter referred to as the decoding end
  • FIG. 5 it mainly includes the following steps:
  • the decoding end receives the code stream from the encoding end.
  • the bit stream carries the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group.
  • the decoding end parses the code stream, and obtains the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group from the code stream, and the bit allocation proportion of the virtual speaker signal group and the bit allocation of the residual signal group
  • the ratio is obtained by the encoder according to the embodiment shown in FIG. 4 above.
  • the decoding end uses the bit allocation proportion of the virtual speaker signal group and the bit allocation proportion of the residual signal group to analyze
  • the code stream is used to obtain the decoded three-dimensional audio signal.
  • the decoding end can determine the number of bits allocated to the virtual speaker signal and the number of bits allocated to the residual signal through the bit allocation ratio of the virtual speaker signal group and the bit allocation ratio of the residual signal group.
  • the decoding method corresponding to the coding method of the terminal is decoded, so as to obtain the 3D audio signal sent by the coding terminal, and realize the transmission of the 3D audio signal from the coding terminal to the decoding terminal.
  • the decoding end can determine the number of bits allocated to the virtual speaker signal and the number of bits allocated to the residual signal according to the bit allocation ratio of the virtual speaker signal group and the bit allocation ratio of the residual signal group transmitted in the code stream, which solves the problem of The problem that the decoding end cannot determine the allocated bits of the signal.
  • step 503 decodes the virtual speaker signal and the residual signal in the code stream according to the bit allocation ratio of the virtual speaker signal group and the bit allocation ratio of the residual signal group, including:
  • the number of bits in the residual signal group is determined according to the number of available bits and the bit allocation ratio of the residual signal group; and the residual signal in the code stream is decoded according to the number of bits in the residual signal group.
  • the decoding end first determines the number of available bits, which is the total number of bits that can be allocated to the transmission channel.
  • the decoding end can obtain the bit allocation ratio of the virtual speaker signal group by analyzing the code stream, so that the bit number of the virtual speaker signal group can be determined according to the available bits and the bit allocation ratio of the virtual speaker signal group.
  • the number is the number of bits used by the encoding end to encode the virtual speaker signal group, and the decoding end can also decode the virtual speaker signal in the bit stream according to the bit number of the virtual speaker signal group, so that the decoding end can decode the virtual speaker signal from the bit stream. Speaker signal.
  • the decoder can obtain the bit allocation ratio of the residual signal group by analyzing the code stream, so that the number of bits in the residual signal group can be determined according to the number of available bits and the bit allocation ratio of the residual signal group.
  • the number of bits in the group is the number of bits used by the encoding end to encode the residual signal group, and the decoding end can also decode the residual signal in the code stream according to the number of bits in the residual signal group, so that the decoding end can obtain the residual signal from the code stream Decode the residual signal.
  • bit allocation ratio parameters between groups include : The bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group. bitsRatio occupies 4 bits, indicating the parameter of the bit allocation ratio within the group.
  • the parameter of the bit allocation ratio within the group includes: the bit allocation ratio of each virtual speaker signal group in all virtual speaker signal groups, and the ratio of each residual signal group in all residual The bit allocation ratio within the signal group.
  • the decoding end may include a bit allocation module.
  • the main function of the bit allocation module is to allocate the remaining available bits after removing other side information to each transmission channel according to the bit allocation ratio parameter obtained by decoding in the code stream. Among them, other The encoding of side information also takes up bits.
  • availableBits the number of available bits remaining in the current frame after deducting other side information, which is recorded as availableBits.
  • the general algorithm for calculating availableBits is expressed as follows:
  • bitsPerFrame is the initial number of bits per frame
  • bitsUsed is the number of bits occupied before the bit allocation.
  • It may represent the bit distribution ratio of the virtual loudspeaker signal group in all transmission channel signals, or may represent the bit distribution ratio of the residual signal group in all transmission channel signals.
  • groupBytes represents the total allocated bits of the virtual speaker signal group.
  • groupBytes represents the total number of allocated bits of the residual signal group.
  • the number of bits of each group of channels can be calculated.
  • the decoding end can also calculate the bit allocation ratio of the virtual loudspeaker signal group and the bit allocation ratio of the residual signal group in a similar manner to the encoding end, for example, using the aforementioned calculation process of Ratio1 and Ratio2, which is not described here. Let me repeat.
  • the three-dimensional audio signal is taken as an example of the HOA signal.
  • the embodiment of the present application provides a bit allocation method for the virtual speaker signal and the residual signal. First, the virtual speaker signal and the residual signal are grouped, and then according to the signal characteristics The bit allocation ratio between groups is obtained by summing the sound field characteristics, and finally the channel bit allocation is realized.
  • the purpose of the embodiment of the present application is to obtain the bit allocation result of the transmission channel signal, and the transmission channel signal is composed of a virtual loudspeaker signal and a residual signal.
  • the transmission channel signals are divided into groups of virtual loudspeaker signals and residual signal groups.
  • the bit allocation ratio between groups is obtained, and then the bit number of the virtual loudspeaker signal group and the bit number of the residual signal group are obtained through the total number of bits.
  • the encoder encodes at a certain rate, the total number of bits allocated to each frame is determined.
  • the bits are allocated under the number of available bits in this frame. For example, in the constant bitrate encoding mode (constant bitrate, CBR), the code rate is 384kbps, and the number of bits per frame is about 7680 bits at this time, and the actual number of available bits is less than 7680 bits. Bits are allocated.
  • the coding efficiency of the virtual loudspeaker is high, for example, when the number of heterogeneous sound sources is less than or equal to the number of transmission channels of the virtual loudspeaker signal, to increase the number of coding bits of the virtual loudspeaker signal, by increasing the number of virtual loudspeaker signal groups
  • the inter-bit allocation ratio is obtained.
  • the number of encoded bits of the virtual loudspeaker signal and the number of encoded bits of the residual signal can conform to the actual situation of the sound field classification of the current frame, which solves the problem of the need to determine the number of encoded bits of the virtual loudspeaker signal when encoding the current frame.
  • the problem of the number of coding bits of the residual signal is the problem of the number of coding bits of the residual signal.
  • the embodiment of the present application is in the core codec, and the execution flow of the core codec will be described next.
  • the HOA signal to be encoded is subjected to HOA space encoding to obtain the transmission channel signal and attribute information.
  • the transmission channel signal includes: a virtual loudspeaker signal and a residual signal
  • the attribute information is the aforementioned transmission single-channel attribute information, including sound field classification results and virtual speaker coding efficiency ⁇ .
  • the sound field classification result includes the number of different sound sources, or the sound field classification result includes the number of different sound sources and the type of sound field;
  • the virtual speaker encoding efficiency ⁇ represents the efficiency of reconstructing the HOA signal using a virtual speaker in the current frame.
  • norm() is a norm operation
  • SNt is the MDCT coefficient of the t-th channel of the original HOA signal
  • t is (HOA order + 1) 2 .
  • Virtual loudspeaker coding efficiency ⁇ sum(R)/sum(N); sum(R) means the summation of R1-Rt, and sum(N) means the summation of N1-Nt.
  • the transmission channel signals are grouped, assuming that the transmission channel signals are composed of M virtual speaker signals and N residual signals. Further, the N residual signals may be divided into K groups. If the M virtual speaker signals are divided into one group, the transmission channels are divided into K+1 groups. The number of channels in each group may be the same or different, and the grouping of each frame may be the same or different, which will not affect the subsequent process of the embodiment of the present application.
  • K is equal to 2 as an example. It is not limited that the value of K may also be 3 or other values, which are not limited here.
  • the number of virtual speakers included in the virtual speaker signal group is equal to 2
  • the number of residual signals included in residual signal group 1 is equal to 4
  • the number of residual signals included in residual signal group 2 is equal to 5.
  • step S2 the following steps S21 to S23 are included.
  • the method in S1 can be used to calculate the energy characterization value of each channel, and then add the channel energy characterization values in each group to obtain the energy characterization value of each group, for example, the energy characterization value of the virtual speaker signal group is F, and the residual signal group 1 energy The characteristic value is D1, and the energy characteristic value of residual signal group 2 is D2.
  • the bit allocation ratio between the transmission channel groups is determined, assuming that the bit allocation ratio of the virtual loudspeaker signal group is Ratio1, and the residual signal group 1
  • the bit allocation ratio is Ratio2
  • the residual signal group 2 bit allocation ratio is Ratio3.
  • the virtual speaker signal group energy ratio directionalNrgRatio and/or the virtual speaker coding efficiency ⁇ determine that the current frame virtual speaker signal group bit allocation is dominant, it is necessary to increase the virtual speaker signal group bit allocation ratio, and the residual signal group The proportion of bit allocation is reduced. Different adjustment methods can be selected to increase the bit allocation proportion of the virtual loudspeaker signal group under different preset conditions.
  • the judging condition includes the loudspeaker signal group energy ratio directionalNrgRatio, and/or the virtual loudspeaker coding flag Flag.
  • the virtual speaker encoding flag is obtained by the following method:
  • Flag strongly dominant (High).
  • the judging conditions may include the following conditions 1 to 6.
  • Ratio1 FAC1*directionalNrgRatio+(1-FAC1)*maxdirectionalNrgRatio.
  • maxdirectionalNrgRatio is a preset maximum virtual loudspeaker signal group bit allocation ratio
  • FAC1 is a preset first adjustment factor, 0 ⁇ FAC1 ⁇ 0.5.
  • limit security bits to Ratio1 for example:
  • Ratio1 min(Ratio1, maxdirectionalNrgRatio+FAC2*Ratio1).
  • FAC2 is a preset second adjustment factor, 0 ⁇ FAC2 ⁇ 0.5.
  • Ratio2 (1-Ratio1)*residual signal group 1 channel number/(residual signal group 1 channel number+residual signal group 2 channel number);
  • Ratio3 (1-Ratio1)*the number of channels in the residual signal group 2/(the number of channels in the residual signal group 1+the number of channels in the residual signal group 2).
  • TH0 is the number of codec matching virtual speakers or the number of codec virtual speaker signals.
  • TH0 2.
  • 0.8 ⁇ TH1 ⁇ 1, for example TH2 0.875. It can be considered that the bit allocation of the virtual loudspeaker signal group is strongly dominant. At this time, the bit allocation ratio between the transmission channel groups is adjusted as follows:
  • Ratio1 FAC3*directionalNrgRatio+(1-FAC3)*maxdirectionalNrgRatio.
  • maxdirectionalNrgRatio is the ratio of bit allocation of preset virtual loudspeaker signal groups
  • FAC3 is a preset third adjustment factor, 0 ⁇ FAC3 ⁇ 0.5; FAC3>FAC1.
  • limit security bits to Ratio1 for example:
  • Ratio1 min(Ratio1, maxdirectionalNrgRatio+TH8FAC4*Ratio1).
  • FAC4 is a preset fourth adjustment factor, 0 ⁇ FAC4 ⁇ 0.5, FAC4 ⁇ FAC2;
  • Ratio2 (1-Ratio1)*residual signal group 1 channel number/(residual signal group 1 channel number+residual signal group 2 channel number);
  • Ratio3 (1-Ratio1)*the number of channels in the residual signal group 2/(the number of channels in the residual signal group 1+the number of channels in the residual signal group 2).
  • Ratio1 directionalNrgRatio.
  • Ratio2 D1/(F+D1+D2).
  • Ratio3 D2/(F+D1+D2).
  • limit security bits to Ratio1, Ratio2, Ratio3, for example:
  • Ratio1 FAC5*groupBitsRatio1+(1–FAC5)*Ratio1;
  • Ratio2 FAC6*groupBitsRatio2+(1–FAC6)*Ratio2;
  • Ratio3 FAC7*groupBitsRatio3+(1–FAC7)*Ratio3;
  • groupBitsRatio1, groupBitsRatio2, and groupBitsRatio3 are respectively the proportion of the preset virtual speaker signal group bit allocation, the preset residual signal group 1 bit allocation proportion, the preset residual signal group 2 bit allocation proportion, and FAC5 is the preset first Five adjustment factors, 0.5 ⁇ FAC5 ⁇ 1, FAC6 is the preset sixth adjustment factor, 0.5 ⁇ FAC6 ⁇ 1, FAC7 is the preset seventh adjustment factor, 0.5 ⁇ FAC7 ⁇ 1, FAC5, FAC6, FAC7 can be equal May not be equal.
  • Ratio1, Ratio2 and Ratio3 After the above-mentioned Ratio1, Ratio2 and Ratio3 are obtained, Ratio1, Ratio2 and Ratio3 can be quantized and written into the code stream.
  • step S3 is an optional step, and the execution sequence of step S3 may be before step S2 or after step S2.
  • the number of bits in each group is determined by the proportion of bit allocation among groups in step S2 and the total number of available bits, for example:
  • the number of bits of the virtual loudspeaker signal group Ratio1 * the total number of available bits.
  • Number of bits in one residual signal group Ratio2*total number of available bits.
  • the number of bits in the residual signal group 2 Ratio3 * the total number of available bits.
  • determining the number of bits of each channel can be implemented in various ways, such as performing bit allocation according to the energy ratio of each channel.
  • the decoding end receives the bit stream sent by the encoding end, and then parses Ratio1, Ratio2, and Ratio3 from the bit stream, and then can perform bit allocation to the transmission channel signal, for example, bit allocation to the transmission channel signal can be obtained in the aforementioned step S4.
  • Each channel bit number method is described in detail below.
  • the encoding end of the embodiment of the present application can group the transmission channels, and determine the group bit allocation ratio according to the energy of the virtual loudspeaker signal group, the number of different sound sources, and the reconstructed HOA signal.
  • the adjustment of the allocation ratio between groups can be realized through the above-mentioned various conditions. Therefore, in the embodiment of the present application, the bit allocation efficiency of the transmission channel can be effectively improved.
  • a processing device for a three-dimensional audio signal is specifically an audio coding device 700, which may include: a coding module 701, a bit allocation ratio determination module 702, wherein,
  • An encoding module configured to spatially encode the three-dimensional audio signal to be encoded to obtain transmission channel signals and transmission channel attribute information, wherein the transmission channel signals include: at least one virtual speaker signal group and at least one residual signal group;
  • a bit allocation proportion determining module configured to determine the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group according to the transmission channel attribute information.
  • a processing device for a three-dimensional audio signal provided by an embodiment of the present application, for example, the processing device for a three-dimensional audio signal is specifically an audio decoding device 800, which may include: a receiving module 801, a decoding module 802 and a signal Generate module 803, wherein,
  • the receiving module is used to receive code stream
  • a decoding module configured to decode the code stream to obtain the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group;
  • a signal generating module configured to decode the virtual speaker signal and the residual signal in the code stream according to the bit allocation proportion of the virtual loudspeaker signal group and the bit allocation proportion of the residual signal group, and obtain the decoded 3D audio signal.
  • the embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.
  • the audio coding device 900 includes:
  • a receiver 901 , a transmitter 902 , a processor 903 and a memory 904 (the number of processors 903 in the audio encoding device 900 can be one or more, one processor is taken as an example in FIG. 9 ).
  • the receiver 901 , the transmitter 902 , the processor 903 and the memory 904 may be connected through a bus or in other ways, wherein connection through a bus is taken as an example in FIG. 9 .
  • the memory 904 may include read-only memory and random-access memory, and provides instructions and data to the processor 903 .
  • a part of the memory 904 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 904 stores operating systems and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the operating system may include various system programs for implementing various basic services and processing hardware-based tasks.
  • the processor 903 controls the operation of the audio encoding device, and the processor 903 may also be called a central processing unit (central processing unit, CPU).
  • CPU central processing unit
  • various components of the audio encoding device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus, etc. in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 903 or implemented by the processor 903 .
  • the processor 903 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 903 or instructions in the form of software.
  • the above-mentioned processor 903 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or Other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 904, and the processor 903 reads the information in the memory 904, and completes the steps of the above method in combination with its hardware.
  • the receiver 901 can be used to receive input digital or character information, and generate signal input related to the relevant settings and function control of the audio encoding device.
  • the transmitter 902 can include a display device such as a display screen, and the transmitter 902 can be used to output through an external interface. Numeric or character information.
  • the processor 903 is configured to execute the method performed by the audio encoding device shown in FIG. 4 of the foregoing embodiment.
  • the audio decoding device 1000 includes:
  • a receiver 1001 , a transmitter 1002 , a processor 1003 and a memory 1004 (the number of processors 1003 in the audio decoding device 1000 can be one or more, one processor is taken as an example in FIG. 10 ).
  • the receiver 1001 , the transmitter 1002 , the processor 1003 and the memory 1004 may be connected through a bus or in other ways, wherein connection through a bus is taken as an example in FIG. 10 .
  • the memory 1004 may include read-only memory and random-access memory, and provides instructions and data to the processor 1003 . A portion of memory 1004 may also include NVRAM.
  • the memory 1004 stores operating systems and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the operating system may include various system programs for implementing various basic services and processing hardware-based tasks.
  • the processor 1003 controls the operation of the audio decoding device, and the processor 1003 may also be referred to as a CPU.
  • various components of the audio decoding device are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, and a status signal bus, etc. in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1003 or implemented by the processor 1003 .
  • the processor 1003 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1003 or instructions in the form of software.
  • the aforementioned processor 1003 may be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1004, and the processor 1003 reads the information in the memory 1004, and completes the steps of the above method in combination with its hardware.
  • the processor 1003 is configured to execute the method performed by the audio decoding device shown in FIG. 5 of the foregoing embodiment.
  • the chip when the audio encoding device or the audio decoding device is a chip in the terminal, the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example Input/output interface, pin or circuit, etc.
  • the processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip in the terminal executes the audio encoding method of any one of the above-mentioned first aspect, or the audio decoding method of any one of the second aspect.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit in the terminal located outside the chip, such as a read-only memory (read -only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM read-only memory
  • RAM random access memory
  • the processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling program execution of the method of the first aspect or the second aspect.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application .
  • a computer device which can be a personal computer, a server, or a network device, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server, or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请实施例公开了一种三维音频信号的处理方法和装置,用于实现对信号的比特分配。本申请实施例提供一种三维音频信号的处理方法,包括:对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,所述传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残差信号组;根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。

Description

一种三维音频信号的处理方法和装置
本申请要求于2021年6月11日提交中国专利局、申请号为202110657283.7、发明名称为“一种三维音频信号的处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请要求于2021年6月23日提交中国专利局、申请号为202110700570.1、发明名称为“一种三维音频信号的处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理技术领域,尤其涉及一种三维音频信号的处理方法和装置。
背景技术
三维音频技术在无线通信语音、虚拟现实/增强现实和媒体音频等方面得到了广泛应用。三维音频技术是对真实世界中的声音事件和三维声场信息进行获取、处理、传输和渲染回放的音频技术。三维音频技术使声音具有强烈的空间感、包围感及沉浸感,给人以“声临其境”的非凡听觉体验。高阶立体混响(higher order ambisonics,HOA)技术具有在录制、编码与回放阶段与扬声器布局无关的性质和HOA格式数据的可旋转回放特性,在进行三维音频回放时具有更高的灵活性,因而也得到了更为广泛的关注和研究。
采集设备(如麦克风)采集大量的数据记录三维声场信息,向回放设备(例如扬声器,耳机等)传输三维音频信号,以便于回放设备播放三维音频信号。由于三维声场信息的数据量较大,导致需要大量的存储空间存储数据,以及传输三维音频信号的带宽需求较高。为了解决上述问题,可以对三维音频信号进行压缩,存储或传输压缩数据。
目前,编码器可以采用预先配置的多个虚拟扬声器对三维音频信号进行编码,但是在编码器对三维音频信号进行编码之后,如何进行信号的比特分配,仍然是尚未解决的问题。
发明内容
本申请实施例提供了一种三维音频信号的处理方法和装置,用于实现对信号的比特分配。
为解决上述技术问题,本申请实施例提供以下技术方案:
第一方面,本申请实施例提供一种三维音频信号的处理方法,包括:对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,所述传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残差信号组;根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。在上述方案中,本申请实施例中通过三维音频信号编码,得到了传输通道信号和传输通道属性信息,该传输通道信号中可以包括至少一个虚拟扬声器信号组和至少一个残差信号组,该传输通道属性信息可用于分别确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,从而解决了无法确定信号的比特分配的问题。
在一种可能的实现方式中,所述传输通道属性信息包括:虚拟扬声器编码效率;所述 对待编码的三维音频信号进行空间编码,以得到传输通道属性信息,包括:采用虚拟扬声器对所述待编码的三维音频信号进行信号重建,以得到重建后的三维音频信号;获取所述重建后的三维音频信号的能量表征值,以及所述待编码的三维音频信号的能量表征值;根据所述重建后的三维音频信号的能量表征值,以及所述待编码的三维音频信号的能量表征值,获取所述虚拟扬声器编码效率。在上述方案中,编码端首先进行采用虚拟扬声器进行信号重建,得到了重建后的三维音频信号。编码端可以计算每个传输通道的信号的能量表征值,例如可以获取重建后的三维音频信号的能量表征值,以及待编码的三维音频信号的能量表征值,三维音频信号的能量表征值在信号重建前后是不同的,因此通过信号重建前后的能量表征值的变换情况,可以计算出虚拟扬声器编码效率。
在一种可能的实现方式中,所述传输通道属性信息包括:所述虚拟扬声器信号组的能量占比;所述方法还包括:根据所述虚拟扬声器信号组中每个虚拟扬声器信号的能量表征值获取所述虚拟扬声器信号组的能量表征值;根据所述残差信号组中每个残差信号的能量表征值获取所述残差信号组的能量表征值;根据所述虚拟扬声器信号组的能量表征值和所述残差信号组的能量表征值,获取所述虚拟扬声器信号组的能量占比。在上述方案中,编码端首先获取虚拟扬声器信号组中每个虚拟扬声器信号的能量表征值,再将同一个组内的所有虚拟扬声器信号的能量表征值进行相加,以得到该虚拟扬声器信号组的能量表征值。若虚拟扬声器信号组有多个时,每个组都可以按照上述方式计算得到该虚拟扬声器信号组的能量表征值。同样的方式,编码端可以根据残差信号组中每个残差信号的能量表征值获取残差信号组的能量表征值。最后编码端可以根据虚拟扬声器信号组的能量表征值和残差信号组的能量表征值,获取虚拟扬声器信号组的能量占比。虚拟扬声器信号组的能量占比可以说明该虚拟扬声器信号组在总的传输通道信号能量中的占比,若该虚拟扬声器信号组的能量占比较高,则说明虚拟扬声器信号组在总的传输通道信号能量中占优,若该虚拟扬声器信号组的能量占比较低,则说明虚拟扬声器信号组在总的传输通道信号能量中不占优(即较弱)。
在一种可能的实现方式中,所述传输通道属性信息包括:虚拟扬声器编码标识,所述虚拟扬声器编码标识用于指示所述虚拟扬声器信号组的比特分配是否占优;所述对待编码的三维音频信号进行空间编码,以得到传输通道属性信息,包括:所述对待编码的三维音频信号进行空间编码,以得到所述传输通道信号的相异性声源数量和虚拟扬声器编码效率;根据所述传输通道信号的相异性声源数量和所述虚拟扬声器编码效率获取所述虚拟扬声器编码标识。在上述方案中,编码端在获取到传输通道信号的相异性声源数量和虚拟扬声器编码效率之后,根据传输通道信号的相异性声源数量和虚拟扬声器编码效率所符合的判决条件获取虚拟扬声器编码标识的具体取值。
在一种可能的实现方式中,所述根据所述传输通道信号的相异性声源数量和所述虚拟扬声器编码效率获取所述虚拟扬声器编码标识,包括:当所述传输通道信号的相异性声源数量小于或等于预设的相异性声源数量阈值,且所述虚拟扬声器编码效率大于或等于预设的第一虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为占优;或,当所述传输通道信号的相异性声源数量大于预设的相异性声源数量阈值,或所述虚拟扬声器编码效率小于预设的第一虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为不占优。 在上述方案中,编码端通过相异性声源数量、虚拟扬声器编码效率与上述判决条件的比较,可以确定虚拟扬声器编码标识,从而可以使用虚拟扬声器编码标识来确定虚拟扬声器信号组的比特分配占比,以及残差信号组的比特分配占比。
在一种可能的实现方式中,所述占优包括次占优或强占优;所述确定所述虚拟扬声器编码标识为占优,包括:当所述虚拟扬声器编码效率大于或等于所述第一虚拟扬声器编码效率阈值、且所述虚拟扬声器编码效率小于或等于预设的第二虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为次占优;或当所述虚拟扬声器编码效率大于或等于所述第一虚拟扬声器编码效率阈值、且所述虚拟扬声器编码效率大于预设的第二虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为强占优;其中,所述第二虚拟扬声器编码效率阈值大于所述第一虚拟扬声器编码效率阈值。在上述方案中,编码端还可以进一步的针对虚拟扬声器编码标识为占优的情况进行划分,即可以得到虚拟扬声器编码标识次占优和强占优这两种情况。可以理解的是,若虚拟扬声器编码标识为强占优,因此该虚拟扬声器信号组需要分配更多的比特,例如在确定虚拟扬声器信号组的初始比特占比之后,可以增加该比特占比。若虚拟扬声器编码标识为次占优,因此该虚拟扬声器信号组需要分配少于虚拟扬声器编码标识为强占优时的比特,但是虚拟扬声器信号组需要分配的比特仍需要大于虚拟扬声器编码标识为不占优时的比特,例如在确定虚拟扬声器信号组的初始比特占比之后,可以增加该比特占比。相比较的话,在强占优的情况下,所增加的比特占比要大于在次占优情况下所增加的比特占比。
在一种可能的实现方式中,所述传输通道属性信息包括:所述虚拟扬声器信号组的能量占比,和/或虚拟扬声器编码标识;所述根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比,包括:当所述虚拟扬声器信号组的能量占比大于或等于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为强占优时,按照预设的第一信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比;当所述虚拟扬声器信号组的能量占比大于或等于预设的第二能量占比阈值且小于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为次占优时,按照预设的第二信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比;其中,所述第二能量占比阈值小于所述第一能量占比阈值;或,当所述虚拟扬声器信号组的能量占比小于预设的第一能量占比阈值,或所述虚拟扬声器编码标识为不占优时,按照预设的第三信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。在上述方案中,编码端可以预设多种信号组比特分配算法,在传输通道属性信息满足不同的条件下,可以使用不同的信号组比特分配算法,从而可以在传输通道属性信息满足一定的条件时为虚拟扬声器信号组和残差信号组分配与这种条件相适配的比特分配占比,因此能够提高编码端对三维音频信号的编码效率。
在一种可能的实现方式中,所述当所述虚拟扬声器信号组的能量占比大于或等于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为强占优时,按照预设的第一信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比,包括:当满足directionalNrgRatio≥TH1,和/或,S≤TH0且η>TH2时,通过如下 方式计算所述虚拟扬声器信号组的比特分配占比:Ratio1_1=FAC1*directionalNrgRatio+(1–FAC1)*maxdirectionalNrgRatio;其中,所述directionalNrgRatio表示所述虚拟扬声器信号组的能量占比,所述S为所述相异性声源数量,所述η表示所述虚拟扬声器编码效率,所述maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,所述FAC1为预设的第一调整因子,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述TH1为所述第一能量占比阈值,所述TH0为所述相异性声源数量阈值,所述TH2为所述第二虚拟扬声器编码效率阈值;通过如下方式计算所述残差信号组的比特分配占比:Ratio2=1-Ratio1_1;其中,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述Ratio2为所述残差信号组的比特分配占比。在上述方案中,通过上述Ratio1_1的计算流程可知,虚拟扬声器信号组的比特分配占比是增大的,因此编码端可以分配更多的比特给虚拟扬声器信号组。传输通道信号包括虚拟扬声器信号组和残差信号组,在获取到虚拟扬声器信号组的比特分配占比Ratio1_1之后,可以通过上述Ratio2的计算公式得到残差信号组的比特分配占比。
在一种可能的实现方式中,获取所述虚拟扬声器信号组的比特分配占比之后,所述方法还包括:通过如下方式对所述虚拟扬声器信号组的比特分配占比进行更新:Ratio1_2=min(Ratio1_1,maxdirectionalNrgRatio+FAC2*Ratio1_1);其中,所述Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,所述FAC2为预设的第二调整因子,所述maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,所述Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述min为取最小值运算。在上述方案中,通过上述Ratio1_2的计算流程可知,可以对虚拟扬声器信号组的比特分配占比进行安全限制,将Ratio1_2限制在安全比特范围内,从而使得编码端可以安全可用的进行虚拟扬声器信号组的比特分配。
在一种可能的实现方式中,所述当所述虚拟扬声器信号组的能量占比大于或等于预设的第二能量占比阈值且小于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为次占优时,按照预设的第二信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比;其中,所述第二能量占比阈值小于所述第一能量占比阈值,包括:当满足TH3≤directionalNrgRatio<TH1,和/或,满足S≤TH0且TH4≤η≤TH2时,通过如下方式计算Ratio1_1:Ratio1_1=FAC3*directionalNrgRatio+(1–FAC3)*maxdirectionalNrgRatio;其中,所述maxdirectionalNrgRatio为预设虚拟扬声器信号组比特分配占比,所述FAC3为预设的第三调整因子,所述directionalNrgRatio表示所述虚拟扬声器信号组的能量占比,所述S为所述相异性声源数量,所述η表示所述虚拟扬声器编码效率,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述TH0为所述相异性声源数量阈值,所述TH1为所述第一能量占比阈值,所述TH2为所述第二虚拟扬声器编码效率阈值,所述TH3为所述第二能量占比阈值,所述TH4为所述第一虚拟扬声器编码效率阈值;通过如下方式计算所述残差信号组的比特分配占比:Ratio2=1-Ratio1_1;其中,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述Ratio2为所述残差信号组的比特分配占比。在上述方案中,通过上述Ratio1_1的计算流程可知,虚拟扬声器信号组的比特分配占比是增大的,因此编码端可以分配更多的比特 给虚拟扬声器信号组。传输通道信号包括虚拟扬声器信号组和残差信号组,在获取到虚拟扬声器信号组的比特分配占比Ratio1_1之后,可以通过上述Ratio2的计算公式得到残差信号组的比特分配占比。
在一种可能的实现方式中,获取所述虚拟扬声器信号组的比特分配占比之后,所述方法还包括:通过如下方式对所述虚拟扬声器信号组的比特分配占比进行更新:Ratio1_2=min(Ratio1_1,maxdirectionalNrgRatio+FAC4*Ratio1_1);其中,所述Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,所述FAC4为预设的第四调整因子,所述maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,所述Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述min为取最小值运算。在上述方案中,通过上述Ratio1_2的计算流程可知,可以对虚拟扬声器信号组的比特分配占比进行安全限制,将Ratio1_2限制在安全比特范围内,从而使得编码端可以安全可用的进行虚拟扬声器信号组的比特分配。
在一种可能的实现方式中,所述方法还包括:所述残差信号组为多个,通过如下方式计算第i个残差信号组的比特分配占比:Ratio2_i=Ratio2*(R_i/C);其中,所述R_i表示第i个残差信号组包括的传输通道个数,所述C为所有残差信号组的总传输通道个数,所述Ratio2_i为所述第i个残差信号组的比特分配占比,所述*表示相乘运算,所述Ratio2为所有残差信号组的比特分配占比。在上述方案中,当残差信号组为多个时,可以根据每个残差信号组的传输通道个数确定每个残差信号组的比特分配在所有残差信号组中的占比。例如R_i/C表示第i个残差信号组与所有残差信号组的传输通道比例,通过(R_i/C)和Ratio2可以获取第i个残差信号组的比特分配占比。
在一种可能的实现方式中,所述当所述虚拟扬声器信号组的能量占比小于预设的第一能量占比阈值,或所述虚拟扬声器编码标识为不占优时,按照预设的第三信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比,包括:当满足directionalNrgRatio<TH3,或,满足S>TH0,或η<TH4时,通过如下方式计算所述虚拟扬声器信号组的比特分配占比:Ratio1_1=directionalNrgRatio;其中,所述directionalNrgRatio表示所述虚拟扬声器信号组的能量占比,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述TH3为所述第二能量占比阈值,所述TH4为所述第一虚拟扬声器编码效率阈值,所述S为所述相异性声源数量,所述η表示所述虚拟扬声器编码效率,所述TH0为所述相异性声源数量阈值;通过如下方式计算所述残差信号组的比特分配占比:Ratio2_1=D/(F+D);其中,所述Ratio2_1为所述残差信号组的比特分配占比,所述F表示所述虚拟扬声器信号组的能量表征值,所述D为所述残差信号组的能量表征值。在上述方案中,通过上述Ratio1_1的计算流程可知,虚拟扬声器信号组的比特分配占比是等于虚拟扬声器信号组的能量占比,因此编码端在虚拟扬声器信号组的比特分配不占优时,不会分配更多的比特给虚拟扬声器信号组,从而保证编码端的并特分配的合理性。
在一种可能的实现方式中,所述方法还包括:获取所述虚拟扬声器信号组的比特分配占比之后,通过如下方式对所述虚拟扬声器信号组的比特分配占比进行更新:当Ratio1_1<groupBitsRatio1时,Ratio1_2=groupBitsRatio1;当Ratio1_1≥groupBitsRatio1时,Ratio1_2=FAC5*groupBitsRatio1+(1–FAC5)*Ratio1_1;其 中,所述Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,所述FAC5为预设的第五调整因子,所述Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述groupBitsRatio1为预设的虚拟扬声器信号组比特分配占比;获取所述残差信号组的比特分配占比之后,通过如下方式对所述残差信号组的比特分配占比进行更新:当Ratio2_1<groupBitsRatio2时,Ratio2_2=groupBitsRatio2;当Ratio2_1≥groupBitsRatio2时,Ratio2_2=FAC6*groupBitsRatio2+(1–FAC6)*Ratio2_1;其中,所述Ratio2_2表示更新后的残差信号组的比特分配占比,所述FAC6为预设的第六调整因子,所述Ratio2_1为更新前的残差信号组的比特分配占比,所述*表示相乘运算,所述groupBitsRatio2为预设的残差信号组比特分配占比。在上述方案中,通过上述Ratio1_2的计算流程可知,可以对虚拟扬声器信号组的比特分配占比进行安全限制,将Ratio1_2限制在安全比特范围内,从而使得编码端可以安全可用的进行虚拟扬声器信号组的比特分配。通过上述Ratio2_2的计算流程可知,可以对残差信号组的比特分配占比进行安全限制,将Ratio2_2限制在安全比特范围内,从而使得编码端可以安全可用的进行残差信号组的比特分配。
在一种可能的实现方式中,所述方法还包括:根据所述虚拟扬声器信号组的比特分配占比、所述残差信号组的比特分配占比和总的传输通道比特数,分别确定所述虚拟扬声器信号组的比特数、所述残差信号组的比特数;根据所述虚拟扬声器信号组的比特数对所述虚拟扬声器信号组进行比特分配,以及根据所述残差信号组的比特数对所述残差信号组进行比特分配。在上述方案中,编码端根据虚拟扬声器信号组的比特数对虚拟扬声器信号组进行比特分配,以及根据残差信号组的比特数对残差信号组进行比特分配,解决了编码端无法为虚拟扬声器信号和残差信号进行比特分配的问题。
在一种可能的实现方式中,所述根据所述所述虚拟扬声器信号组的比特分配占比、所述残差信号组的比特分配占比和总的传输通道比特数,分别确定所述虚拟扬声器信号组的比特数、所述残差信号组的比特数,包括:通过如下方式计算虚拟扬声器信号组的比特数:F_bitnum=Ratio1*C_bitnum;其中,所述F_bitnum为所述虚拟扬声器信号组的比特数,所述Ratio1为所述虚拟扬声器信号组的比特分配占比,所述C_bitnum为总的传输通道比特数;通过如下方式计算所述残差信号组的比特数:D_bitnum=Ratio2*C_bitnum;其中,所述D_bitnum为所述残差信号组的比特数,所述Ratio2为所述残差信号组的比特分配占比,所述C_bitnum为总的传输通道比特数。在上述方案中,编码端可以预先确定总的传输通道比特数,对于总的传输通道比特数的取值不做限定,编码端可以通过上述计算公式计算出虚拟扬声器信号组的比特数和残差信号组的比特数,实现了编码端针对虚拟扬声器信号和残差信号的比特分配问题。
在一种可能的实现方式中,所述方法还包括:对所述传输通道信号、所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比进行编码,并写入码流。在上述方案中,虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比可以被编码到码流中,编码端将该码流发送至解码端之后,从而解码端通过解析码流,解码端可以通过码流获取到虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,解码端通过虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比可以获取到虚拟扬声器信号 组分配的比特数和残差信号分配的比特数,从而可以对码流进行解码,以得到三维音频信号。
第二方面,本申请实施例还提供一种三维音频信号的处理方法,包括:接收码流;解码所述码流以获得虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;根据所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比对所述码流中的虚拟扬声器信号和残差信号进行解码,获得解码后的三维音频信号。在上述方案中,虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比可以被编码到码流中,编码端将该码流发送至解码端之后,从而解码端通过解析码流,解码端可以通过码流获取到虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,解码端通过虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比可以获取到虚拟扬声器信号组分配的比特数和残差信号分配的比特数,从而可以对码流进行解码,以得到三维音频信号。
在一种可能的实现方式中,所述根据所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比对所述码流中的虚拟扬声器信号和残差信号进行解码,包括:根据所述码流确定可用比特数;根据所述可用比特数和所述虚拟扬声器信号组的比特分配占比确定所述虚拟扬声器信号组的比特数;根据所述虚拟扬声器信号组的比特数对所述码流中的虚拟扬声器信号进行解码;根据所述可用比特数和所述残差信号组的比特分配占比确定所述残差信号组的比特数;根据所述残差信号组的比特数对所述码流中的残差信号进行解码。
第三方面,本申请实施例还提供一种三维音频信号的处理装置,包括:编码模块,用于对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,所述传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残差信号组;比特分配占比确定模块,用于根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。
在本申请的第三方面中,三维音频信号的处理装置的组成模块还可以执行前述第一方面以及各种可能的实现方式中所描述的步骤,详见前述对第一方面以及各种可能的实现方式中的说明。
第四方面,本申请实施例还提供一种三维音频信号的处理装置,包括:接收模块,用于接收码流;解码模块,用于解码所述码流以获得虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;信号生成模块,用于根据所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比对所述码流中的虚拟扬声器信号和残差信号进行解码,获得解码后的三维音频信号。
在本申请的第四方面中,三维音频信号的处理装置的组成模块还可以执行前述第二方面以及各种可能的实现方式中所描述的步骤,详见前述对第二方面以及各种可能的实现方式中的说明。
第五方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的方法。
第六方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运 行时,使得计算机执行上述第一方面或第二方面所述的方法。
第七方面,本申请实施例提供了一种计算机可读存储介质,包括如前述第一方面所述的方法所生成的码流。
第八方面,本申请实施例提供一种通信装置,该通信装置可以包括终端设备或者芯片等实体,所述通信装置包括:处理器、存储器;所述存储器用于存储指令;所述处理器用于执行所述存储器中的所述指令,使得所述通信装置执行如前述第一方面或第二方面中任一项所述的方法。
第九方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持音频编码器或者音频解码器实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存音频编码器或者音频解码器必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
从以上技术方案可以看出,本申请实施例具有以下优点:
在本申请实施例中,首先对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残差信号组;然后根据传输通道属性信息确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。本申请实施例中通过三维音频信号编码,得到了传输通道信号和传输通道属性信息,该传输通道信号中可以包括至少一个虚拟扬声器信号组和至少一个残差信号组,该传输通道属性信息可用于分别确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,从而解决了无法确定信号的比特分配的问题。
附图说明
图1为本申请实施例提供的音频处理系统的组成结构示意图;
图2a为本申请实施例提供的音频编码器和音频解码器应用于终端设备的示意图;
图2b为本申请实施例提供的音频编码器应用于无线设备或者核心网设备的示意图;
图2c为本申请实施例提供的音频解码器应用于无线设备或者核心网设备的示意图;
图3a为本申请实施例提供的多声道编码器和多声道解码器应用于终端设备的示意图;
图3b为本申请实施例提供的多声道编码器应用于无线设备或者核心网设备的示意图;
图3c为本申请实施例提供的多声道解码器应用于无线设备或者核心网设备的示意图;
图4为本申请实施例提供的一种三维音频信号的处理方法的示意图;
图5为本申请实施例提供的一种三维音频信号的处理方法的示意图;
图6为本申请实施例提供的一种三维音频信号的应用场景示意图;
图7为本申请实施例提供的一种音频编码装置的组成结构示意图;
图8为本申请实施例提供的一种音频解码装置的组成结构示意图;
图9为本申请实施例提供的另一种音频编码装置的组成结构示意图;
图10为本申请实施例提供的另一种音频解码装置的组成结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
声音(sound)是由物体振动产生的一种连续的波。产生振动而发出声波的物体称为声源。声波通过介质(如:空气、固体或液体)传播的过程中,人或动物的听觉器官能感知到声音。
声波的特征包括音调、音强和音色。音调表示声音的高低。音强表示声音的大小。音强也可以称为响度或音量。音强的单位是分贝(decibel,dB)。音色又称为音品。
声波的频率决定了音调的高低。频率越高音调越高。物体在一秒钟之内振动的次数称为频率,频率单位是赫兹(hertz,Hz)。人耳能识别的声音的频率在20Hz至20000Hz之间。
声波的幅度决定了音强的强弱。幅度越大音强越大。距离声源越近,音强越大。
声波的波形决定了音色。声波的波形包括方波、锯齿波、正弦波和脉冲波等。
根据声波的特征,声音可以分为规则声音和无规则声音。无规则声音是指声源无规则地振动发出的声音。无规则声音例如是影响人们工作、学习和休息等的噪声。规则声音是指声源规则地振动发出的声音。规则声音包括语音和乐音。声音用电表示时,规则声音是一种在时频域上连续变化的模拟信号。该模拟信号可以称为音频信号(acoustic signals)。音频信号是一种携带语音、音乐和音效的信息载体。
由于人的听觉具有辨别空间中声源的位置分布的能力,则听音者听到空间中的声音时,除了能感受到声音的音调、音强和音色外,还能感受到声音的方位。
随着人们对听觉系统体验的关注和品质要求与日俱增,为了增强声音的纵深感、临场感和空间感,则三维音频技术应运而生。从而听音者不仅感受到来自前、后、左和右的声源发出的声音,而且感受到自己所处空间被这些声源产生的空间声场(简称“声场”(sound field))所包围的感觉,以及声音向四周扩散的感觉,营造出一种使听音者置身于影院或音乐厅等场所的“身临其境”的音响效果。
三维音频技术是指将人耳以外的空间假设为一个系统,耳膜处接收到的信号为声源发出的声音经过耳朵以外系统滤波输出的三维音频信号。例如,人耳以外的系统可以定义为系统冲击响应h(n),任意一个声源可以定义为x(n),耳膜处接收到的信号为x(n)和h(n)的卷积结果。本申请实施例所述的三维音频信号可以是指高阶立体混响(higher order ambisonics,HOA)信号或者一阶立体混响(first order ambisonics,FOA)信号。三维音频也可以称为三维音效、空间音频、三维声场重建、虚拟3D音频或双耳音频等。
声波在理想介质中传播,波数为k=w/c,角频率为w=2πf,其中,f为声波频率,c为声速。声压p满足公式(1),
Figure PCTCN2022096546-appb-000001
为拉普拉斯算子。
Figure PCTCN2022096546-appb-000002
假设人耳以外的空间系统是一个球形,听音者处于球的中心,从球外传来的声音在球面上有一个投影,过滤掉球面以外的声音,假设声源分布在这个球面上,用球面上的声源产生的声场来拟合原始声源产生的声场,即三维音频技术就是一个拟合声场的方法。具体地,在球坐标系下求解公式(1)等式方程,在无源球形区域内,该公式(1)方程解为如下公式(2)。
Figure PCTCN2022096546-appb-000003
其中,r表示球半径,θ表示水平角,
Figure PCTCN2022096546-appb-000004
表示仰角,k表示波数,s表示理想平面波的幅度,m表示三维音频信号的阶数序号(或称为HOA信号的阶数序号)。
Figure PCTCN2022096546-appb-000005
表示球贝塞尔函数,球贝塞尔函数又称为径向基函数,其中,第一个j表示虚数单位,
Figure PCTCN2022096546-appb-000006
不随角度变化。
Figure PCTCN2022096546-appb-000007
表示θ,
Figure PCTCN2022096546-appb-000008
方向的球谐函数,
Figure PCTCN2022096546-appb-000009
表示声源方向的球谐函数。三维音频信号系数满足公式(3)。
Figure PCTCN2022096546-appb-000010
将公式(3)代入公式(2),公式(2)可以变形为公式(4)。
Figure PCTCN2022096546-appb-000011
其中,
Figure PCTCN2022096546-appb-000012
表示N阶的三维音频信号系数,用于近似描述声场。声场是指介质中有声波存在的区域。N为大于或等于1的整数。比如,N的取值范围为2至6的整数。本申请的实施例所述的三维音频信号的系数可以是指HOA系数或环境立体声(ambisonic)系数。
三维音频信号是一种携带声场中声源的空间位置信息的信息载体,描述了空间中听音者的声场。公式(4)表明声场可以在球面上按球谐函数展开,即声场可以分解为多个平面波的叠加。因此,可以将三维音频信号描述的声场使用多个平面波的叠加来表达,并通过三维音频信号系数重建声场。
相对5.1声道的音频信号或7.1声道的音频信号,由于N阶的HOA信号有(N+1) 2个声道,则HOA信号包括用于描述声场的空间信息的数据量较多。若采集设备(比如:麦克风)将该三维音频信号传输到回放设备(比如:扬声器),需要消耗较大的带宽。目前,编码器可以利用空间压缩环绕音频编码(spatial squeezed surround audio coding,S3AC)方法或方向音频编码(directional audio coding,DirAC)方法或者基于虚拟扬声器选择的编码方法对三维音频信号进行压缩编码得到码流,向回放设备传输码流,其中,基于虚拟扬声器选择的编码方法也可以称为匹配投影(matchPRojection,MP)编码方法,后续以虚拟扬声器选择的编码方法进行示例说明。回放设备对码流进行解码,并重建三维音频信号,播放重建后三维音频信号。从而降低向回放设备传输三维音频信号的数据量,以及带宽的占用。
针对上述的三维音频信号,目前无法对三维音频信号的声场进行分类,如何对三维音频信号进行声场分类,是本申请实施例所要解决的一个技术问题。本申请实施例中通过三维音频信号的线性分解可以实现对三维音频信号的声场分类,从而可以准确实现对三维音频信号的声场分类,达到能够获取当前帧的声场分类结果的目的。
另外,目前的编码器对三维音频信号进行压缩编码时,存在无法获得较高的压缩比的问题。因此,如何提高对不同声场的三维音频信号进行压缩编码的压缩比也是本申请实施例解决的另一个问题。
本申请实施例提供一种音频编码技术,尤其是提供一种面向三维音频信号的三维音频编码技术,具体提供一种采用较少的声道表示三维音频信号的编码技术,以改进传统的音频编码系统。音频编码(或通常称为编码)包括音频编码和音频解码两部分。音频编码在源侧执行,包括处理(例如,压缩)原始音频以减少表示该音频所需的数据量,从而更高效地存储和/或传输。音频解码在目的侧执行,包括相对于编码器作逆处理,以重建原始音频。编码部分和解码部分也合称为编码。下面将结合附图对本申请实施例的实施方式进行详细描述。
本申请实施例的技术方案可以应用于各种的音频处理系统,如图1所示,为本申请实施例提供的音频处理系统的组成结构示意图。音频处理系统100可以包括:音频编码装置101和音频解码装置102。其中,音频编码装置101可用于生成码流,然后该音频编码码流可以通过音频传输通道传输给音频解码装置102,音频解码装置102可以接收到码流,然后执行音频解码装置102的音频解码功能,最后得到重建后的信号。
在本申请的实施例中,该音频编码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频编码装置可以是上述终端设备或者无线设备或者核心网设备的音频编码器。同样的,该音频解码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频解码装置可以是上述终端设备或者无线设备或者核心网设备的音频解码器。例如,音频编码器可以包括无线接入网、核心网的媒体网关、转码设备、媒体资源服务器、移动终端、固网终端等,音频编码器还可以是应用于虚拟现实技术(virtual reality,VR)流媒体(streaming)服务中的音频编码器。
在申请实施例中,以适用于虚拟现实流媒体(VR streaming)服务中的音频编码模块(audio encoding及audio decoding)为例,端到端对音频信号的处理流程包括:音频信号A经过采集模块(acquisition)后进行预处理操作(audioPReprocessing),预处理操作包括滤除掉信号中的低频部分,可以是以20Hz或者50Hz为分界点,提取信号中的方位信息,之后进行编码处理(audio encoding)打包(file/segment encapsulation)之后发送(delivery)到解码端,解码端首先进行解包(file/segment decapsulation),之后解码(audio decoding),对解码信号进行双耳渲染(audio rendering)处理,渲染处理后的信号映射到收听者耳机(headphones)上,可以为独立的耳机也可以是眼镜设备上的耳机。
如图2a所示,为本申请实施例提供的音频编码器和音频解码器应用于终端设备的示意图。对于每个终端设备都可以包括:音频编码器、信道编码器、音频解码器、信道解码器。具体的,信道编码器用于对音频信号进行信道编码,信道解码器用于对音频信号进行信道解码。例如,在第一终端设备20中可以包括:第一音频编码器201、第一信道编码器202、第一音频解码器203、第一信道解码器204。在第二终端设备21中可以包括:第二音频解码器211、第二信道解码器212、第二音频编码器213、第二信道编码器214。第一终端设备20连接无线或者有线的第一网络通信设备22,第一网络通信设备22和无线或者有线的第二网络通信设备23之间通过数字信道连接,第二终端设备21连接无线或者有线的第二网络通信设备23。其中,上述无线或者有线的网络通信设备可以泛指信号传输设备,例如 通信基站,数据交换设备等。
在音频通信中,作为发送端的终端设备首先进行音频采集,对采集到的音频信号进行音频编码,再进行信道编码后,通过无线网络或者核心网进行在数字信道中传输。而作为接收端的终端设备根据接收到的信号进行信道解码,以得到码流,然后经过音频解码恢复出音频信号,由接收端的终端设备进音频回放。
如图2b所示,为本申请实施例提供的音频编码器应用于无线设备或者核心网设备的示意图。其中,无线设备或者核心网设备25包括:信道解码器251、其他音频解码器252、本申请实施例提供的音频编码器253、信道编码器254,其中,其他音频解码器252是指除音频解码器以外的其他音频解码器。在无线设备或者核心网设备25内,首先通过信道解码器251对进入该设备的信号进行信道解码,然后使用其他音频解码器252进行音频解码,然后使用本申请实施例提供的音频编码器253进行音频编码,最后使用信道编码器254对音频信号进行信道编码,完成信道编码之后再传输出去。其中,其他音频解码器252是对信道解码器251解码后的码流进行音频解码。
如图2c所示,为本申请实施例提供的音频解码器应用于无线设备或者核心网设备的示意图。其中,无线设备或者核心网设备25包括:信道解码器251、本申请实施例提供的音频解码器255、其他音频编码器256、信道编码器254,其中,其他音频编码器256是指除音频编码器以外的其他音频编码器。在无线设备或者核心网设备25内,首先通过信道解码器251对进入该设备的信号进行信道解码,然后使用音频解码器255对接收到的音频编码码流进行解码,然后使用其他音频编码器256进行音频编码,最后使用信道编码器254对音频信号进行信道编码,完成信道编码之后再传输出去。在无线设备或者核心网设备中,如果需要实现转码,则需要进行相应的音频编码处理。其中,无线设备指的是通信中的射频相关的设备,核心网设备指的是通信中核心网相关的设备。
在本申请的一些实施例中,该音频编码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频编码装置可以是上述终端设备或者无线设备或者核心网设备的多声道编码器。同样的,该音频解码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频解码装置可以是上述终端设备或者无线设备或者核心网设备的多声道解码器。
如图3a所示,为本申请实施例提供的多声道编码器和多声道解码器应用于终端设备的示意图,对于每个终端设备都可以包括:多声道编码器、信道编码器、多声道解码器、信道解码器。该多声道编码器可以执行本申请实施例提供的音频编码方法,该多声道解码器可以执行本申请实施例提供的音频解码方法。具体的,信道编码器用于对多声道信号进行信道编码,信道解码器用于对多声道信号进行信道解码。例如,在第一终端设备30中可以包括:第一多声道编码器301、第一信道编码器302、第一多声道解码器303、第一信道解码器304。在第二终端设备31中可以包括:第二多声道解码器311、第二信道解码器312、第二多声道编码器313、第二信道编码器314。第一终端设备30连接无线或者有线的第一网络通信设备32,第一网络通信设备32和无线或者有线的第二网络通信设备33之间通过数字信道连接,第二终端设备31连接无线或者有线的第二网络通信设备33。其中,上述无线或者有线的网络通信设备可以泛指信号传输设备,例如通信基站,数据交换设备等。 音频通信中作为发送端的终端设备对采集到的多声道信号进行多声道编码,再进行信道编码后,通过无线网络或者核心网进行在数字信道中传输。而作为接收端的终端设备根据接收到的信号,进行信道解码,以得到多声道信号编码码流,然后经过多声道解码恢复出多声道信号,由作为接收端的终端设备进回放。
如图3b所示,为本申请实施例提供的多声道编码器应用于无线设备或者核心网设备的示意图,其中,无线设备或者核心网设备35包括:信道解码器351、其他音频解码器352、多声道编码器353、信道编码器354,与前述图2b类似,此处不再赘述。
如图3c所示,为本申请实施例提供的多声道解码器应用于无线设备或者核心网设备的示意图,其中,无线设备或者核心网设备35包括:信道解码器351、多声道解码器355、其他音频编码器356、信道编码器354,与前述图2c类似,此处不再赘述。
其中,音频编码处理可以是多声道编码器中的一部分,音频解码处理可以是多声道解码器中的一部分,例如,对采集到的多声道信号进行多声道编码可以是将采集到的多声道信号经过处理后得到音频信号,再按照本申请实施例提供的方法对得到的音频信号进行编码;解码端根据多声道信号编码码流,解码得到音频信号,经过上混处理后恢复出多声道信号。因此,本申请实施例也可应用于终端设备、无线设备、核心网设备中的多声道编码器和多声道解码器。在无线或者核心网设备中,如果需要实现转码,则需要进行相应的多声道编码处理。
首先介绍本申请实施例提供的一种三维音频信号的处理方法,该方法可以由终端设备执行,例如该终端设备可以是一种音频编码装置(如下简称编码端或者编码器)。不限定的是,该终端设备还可以是一种三维音频信号的处理装置。如图4所示,三维音频信号的处理方法主要包括如下:
401、对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残差信号组。
其中,编码端可以获取三维音频信号,例如该三维音频信号可以是场景音频信号。具体的,该三维音频信号可以是时域信号,或者频域信号。另外,该三维音频信号还可以是经过下采样的信号。
其中,本发明实施例中,虚拟扬声器信号和虚拟扬声器是一一对应的。在从候选虚拟扬声器集合中确定了对三维音频信号进行编码的虚拟扬声器后,可以获得这些虚拟扬声器对应的虚拟扬声器信号,然后对这些虚拟扬声器信号进行分组以获得所述的至少一个虚拟扬声器信号组;或者,在从候选虚拟扬声器集合中确定了对三维音频信号进行编码的虚拟扬声器后,可以将这些虚拟扬声器进行分组以获得至少一个虚拟扬声器组,然后分别获得所述至少一个虚拟扬声器组中各个虚拟扬声器对应的虚拟扬声器信号,以获得所述至少一个虚拟扬声器信号组。
在本申请的一些实施例中,三维音频信号包括:高阶立体混响HOA信号,或者一阶立体混响FOA信号。不限定的是,三维音频信号还可以是其它类型的信号,此处只是本申请的一种举例,不作为对本申请实施例的限定。
例如,三维音频信号可以是时域HOA信号,也可以是频域HOA信号。又如,三维音频信号可以包含HOA信号的所有通道,也可以包含部分HOA通道(例如FOA通道)。另外,三 维音频信号可以是HOA信号的全部样点,也可以是待分析HOA信号下采样后的1/Q个下采样点。其中,Q是下采样间隔,1/Q是下采样率。
本申请实施例中,三维音频信号中包括多个帧,接下来以对三维音频信号中的一个帧的处理为例,例如该帧为当前帧,则在三维音频信号中在当前帧之前还存在前一帧,在当前帧之后还存在后一帧。另外,本申请实施例中三维音频信号的除当前帧之外的其它帧的处理方法,与当前帧的处理方法相类似,后续以当前帧的处理为例。
本申请实施例中,在获取到三维音频信号之后,先对三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息。对于空间编码的具体过程,此处不再展开说明。对于空间编码后输出虚拟扬声器信号和残差信号的过程不再说明。
本申请实施例中,编码端在获取到待编码的三维音频信号之后,可以对该三维音频信号进行空间编码,可以输出传输通道信号和传输通道属性信息,该传输通道信号包括虚拟扬声器信号和残差信号,例如对虚拟扬声器信号进行分组,得到至少一个虚拟扬声器信号组。又如,对残差信号进行分组,得到至少一个残差信号组。本申请实施例中对于传输通道信号中虚拟扬声器信号组的个数和残差信号组的个数不做限定。
本申请实施例中,通过空间编码还可以输出传输通道信号对应的传输通道属性信息,该传输通道属性信息用于指示传输通道信号的属性,传输通道属性信息的实现方式有多种,详见后续实施例的举例说明。
在本申请的一些实施例中,传输通道属性信息包括:虚拟扬声器编码效率;虚拟扬声器编码效率表示对三维音频信号采用虚拟扬声器重建三维音频信号的效率。编码器(也可以为编码端)通过空间编码输出的传输通道属性信息包括虚拟扬声器编码效率,接下来说明该虚拟扬声器编码效率的计算方法。
步骤401对待编码的三维音频信号进行空间编码,以得到传输通道属性信息,包括:
采用虚拟扬声器对待编码的三维音频信号进行信号重建,以得到重建后的三维音频信号;其中,对待编码的三维音频信号进行信号重建的虚拟扬声器可以是前述的从所述候选虚拟扬声器集合中确定的用于对三维音频信号进行编码的虚拟扬声器。
获取重建后的三维音频信号的能量表征值,以及待编码的三维音频信号的能量表征值;
根据重建后的三维音频信号的能量表征值,以及待编码的三维音频信号的能量表征值,获取虚拟扬声器编码效率。
其中,编码端首先进行采用虚拟扬声器进行信号重建,得到了重建后的三维音频信号。编码端可以计算每个传输通道的信号的能量表征值,例如可以获取重建后的三维音频信号的能量表征值,以及待编码的三维音频信号的能量表征值,三维音频信号的能量表征值在信号重建前后是不同的,因此通过信号重建前后的能量表征值的变换情况,可以计算出虚拟扬声器编码效率。
接下来举例说明计算虚拟扬声器编码效率的方法,以三维音频信号为HOA信号为例,编码端计算重建HOA信号每个传输通道的能量表征值可以表示为R1,R2,…,Rt,编码端计算原始HOA信号每个传输通道的能量表征值可以表示为N1,N2,…,Nt。最后虚拟扬声器编码效率η:η=sum(R)/sum(N),其中,sum(R)表示求R1~Rt求和,sum(N)表示N1~Nt求和。通过上述计算公式,可以计算出虚拟扬声器编码效率。
在本申请的一些实施例中,传输通道属性信息包括:虚拟扬声器信号组的能量占比;虚拟扬声器信号组的能量占比是指虚拟扬声器信号组中所有虚拟扬声器信号的能量在所有传输通道信号的总能量中的占比。接下来说明该虚拟扬声器信号组的能量占比的计算方法。
编码端执行的方法还包括:
根据虚拟扬声器信号组中每个虚拟扬声器信号的能量表征值获取虚拟扬声器信号组的能量表征值;
根据残差信号组中每个残差信号的能量表征值获取残差信号组的能量表征值;
根据虚拟扬声器信号组的能量表征值和残差信号组的能量表征值,获取虚拟扬声器信号组的能量占比。
其中,编码端首先获取虚拟扬声器信号组中每个虚拟扬声器信号的能量表征值,再将同一个组内的所有虚拟扬声器信号的能量表征值进行相加,以得到该虚拟扬声器信号组的能量表征值。若虚拟扬声器信号组有多个时,每个组都可以按照上述方式计算得到该虚拟扬声器信号组的能量表征值。
同样的方式,编码端可以根据残差信号组中每个残差信号的能量表征值获取残差信号组的能量表征值。最后编码端可以根据虚拟扬声器信号组的能量表征值和残差信号组的能量表征值,获取虚拟扬声器信号组的能量占比。虚拟扬声器信号组的能量占比可以说明该虚拟扬声器信号组在总的传输通道信号能量中的占比,若该虚拟扬声器信号组的能量占比较高,则说明虚拟扬声器信号组在总的传输通道信号能量中占优,若该虚拟扬声器信号组的能量占比较低,则说明虚拟扬声器信号组在总的传输通道信号能量中不占优(即较弱)。
在本申请的一些实施例中,传输通道属性信息包括:虚拟扬声器编码标识,虚拟扬声器编码标识用于指示虚拟扬声器信号组的比特分配是否占优。具体的,虚拟扬声器编码标识用于指示至少一个虚拟扬声器信号组的比特分配是否占优,例如虚拟扬声器编码标识可以表示为flag,虚拟扬声器编码标识的可以指示虚拟扬声器信号组的比特分配为占优,或者不占优,虚拟扬声器编码标识的的不同取值可以指示虚拟扬声器信号组的比特分配为占优,或者不占优。进一步的,该占优的情况还可以分为强占优和次占优(即略占优)。
对待编码的三维音频信号进行空间编码,以得到传输通道属性信息,包括:
对待编码的三维音频信号进行空间编码,以得到传输通道信号的相异性声源数量和虚拟扬声器编码效率;
根据传输通道信号的相异性声源数量和虚拟扬声器编码效率获取虚拟扬声器编码标识。
其中,编码端通过空间编码,可以对传输通道信号进行声场分类,并生成声场分类结果,该声场分类结果可以包括相异性声源数量,对于相异性声源数量的具体计算过程,此处不做限定。对于虚拟扬声器编码效率的确定方式详见前述实施例,此处不做赘述。编码端在获取到传输通道信号的相异性声源数量和虚拟扬声器编码效率之后,根据传输通道信号的相异性声源数量和虚拟扬声器编码效率所符合的判决条件获取虚拟扬声器编码标识的具体取值,本申请实施例中虚拟扬声器编码标识的获取方式有多种实现方式,详见后续实施例的举例说明。
在本申请的一些实施例中,进一步的,根据传输通道信号的相异性声源数量和虚拟扬 声器编码效率获取虚拟扬声器编码标识,包括:
当传输通道信号的相异性声源数量小于或等于预设的相异性声源数量阈值,且虚拟扬声器编码效率大于或等于预设的第一虚拟扬声器编码效率阈值时,确定虚拟扬声器编码标识为占优;或,
当传输通道信号的相异性声源数量大于预设的相异性声源数量阈值,或虚拟扬声器编码效率小于预设的第一虚拟扬声器编码效率阈值时,确定虚拟扬声器编码标识为不占优。
其中,本申请实施例中对于相异性声源数量阈值、第一虚拟扬声器编码效率阈值的具体实现方式可以结合应用场景,此处不做限定。例如,相异性声源数量阈值可以表示为TH0,第一虚拟扬声器编码效率阈值可以表示为TH4。
具体的,虚拟扬声器编码标识为占优,表示虚拟扬声器信号组在总的传输通道信号中占优,因此该虚拟扬声器信号组需要分配更多的比特,例如在确定虚拟扬声器信号组的初始比特占比之后,可以增加该比特占比。又如,虚拟扬声器编码标识为不占优,表示虚拟扬声器信号组在总的传输通道信号中不占优,此时可以为该虚拟扬声器信号组分配较少的比特。例如在确定虚拟扬声器信号组的初始比特占比之后,可以减少该比特占比。本申请实施例中,编码端通过相异性声源数量、虚拟扬声器编码效率与上述判决条件的比较,可以确定虚拟扬声器编码标识,从而可以使用虚拟扬声器编码标识来确定虚拟扬声器信号组的比特分配占比,以及残差信号组的比特分配占比。
进一步的,在本申请的一些实施例中,所述占优包括次占优或强占优;确定虚拟扬声器编码标识为占优,包括:
当虚拟扬声器编码效率大于或等于第一虚拟扬声器编码效率阈值、且虚拟扬声器编码效率小于或等于预设的第二虚拟扬声器编码效率阈值时,确定虚拟扬声器编码标识为次占优;或,
当所述虚拟扬声器编码效率大于或等于所述第一虚拟扬声器编码效率阈值、且虚拟扬声器编码效率大于预设的第二虚拟扬声器编码效率阈值时,确定虚拟扬声器编码标识为强占优;
其中,第二虚拟扬声器编码效率阈值大于第一虚拟扬声器编码效率阈值。
具体的,当传输通道信号的相异性声源数量小于或等于预设的相异性声源数量阈值,且虚拟扬声器编码效率大于或等于预设的第一虚拟扬声器编码效率阈值时,确定虚拟扬声器编码标识为占优,编码端还可以进一步的针对虚拟扬声器编码标识为占优的情况进行划分,即可以得到虚拟扬声器编码标识次占优和强占优这两种情况。可以理解的是,若虚拟扬声器编码标识为强占优,因此该虚拟扬声器信号组需要分配更多的比特,例如在确定虚拟扬声器信号组的初始比特占比之后,可以增加该比特占比。若虚拟扬声器编码标识为次占优,因此该虚拟扬声器信号组需要分配少于虚拟扬声器编码标识为强占优时的比特,但是虚拟扬声器信号组需要分配的比特仍需要大于虚拟扬声器编码标识为不占优时的比特,例如在确定虚拟扬声器信号组的初始比特占比之后,可以增加该比特占比。相比较的话,在强占优的情况下,所增加的比特占比要大于在次占优情况下所增加的比特占比。
例如,第二虚拟扬声器编码效率阈值可以表示为TH2。
402、根据传输通道属性信息确定虚拟扬声器信号组的比特分配占比和残差信号组的比 特分配占比。
其中,编码端在获取到传输通道信号和传输通道属性信息之后,由于传输通道属性信息中携带有传输通道信号的属性参数,因此使用该传输通道属性信息可以为虚拟扬声器信号组进行比特分配,另外,使用该传输通道属性信息可以为残差信号组进行比特分配。例如,编码端根据传输通道属性信息确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。比特分配占比是指为一个信号组分配的比特数所占传输通道信号的总的比特数的比例值,比特分配占比也可以称为“比特分配比例”。本申请实施例中传输通道信号包括至少一个虚拟扬声器信号组和至少一个残差信号组,因此可以获取到虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。后续实施例中以一个虚拟扬声器信号组和两个残差信号组的比特分配占比的确定过程为例进行说明。
举例说明如下,本申请实施例中空间编码可以输出传输通道信号和传输通道属性信息,由核心编码器获取到该传输通道信号和传输通道属性信息,核心编码器再通过传输通道信号和传输通道属性信息,可以获取到虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。
在本申请的一些实施例中,传输通道属性信息包括:虚拟扬声器信号组的能量占比,和/或虚拟扬声器编码标识;
根据传输通道属性信息确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,包括:
当虚拟扬声器信号组的能量占比大于或等于预设的第一能量占比阈值,和/或虚拟扬声器编码标识为强占优时,按照预设的第一信号组比特分配算法确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;
当虚拟扬声器信号组的能量占比大于或等于预设的第二能量占比阈值且小于预设的第一能量占比阈值,和/或虚拟扬声器编码标识为次占优时,按照预设的第二信号组比特分配算法确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;其中,第二能量占比阈值小于第一能量占比阈值;
当虚拟扬声器信号组的能量占比小于预设的第一能量占比阈值,或虚拟扬声器编码标识为不占优时,按照预设的第三信号组比特分配算法确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。
其中,本申请实施例中编码端可以预设多种信号组比特分配算法,在传输通道属性信息满足不同的条件下,可以使用不同的信号组比特分配算法,从而可以在传输通道属性信息满足一定的条件时为虚拟扬声器信号组和残差信号组分配与这种条件相适配的比特分配占比,因此能够提高编码端对三维音频信号的编码效率。
例如,第一能量占比阈值可以表示为TH1,第二能量占比阈值可以表示为TH3。
在本申请的一些实施例中,当虚拟扬声器信号组的能量占比大于或等于预设的第一能量占比阈值,和/或虚拟扬声器编码标识为强占优时,按照预设的第一信号组比特分配算法确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,包括:
当满足directionalNrgRatio≥TH1,和/或,S≤TH0且η>TH2时,通过如下方式计算虚拟扬声器信号组的比特分配占比:
Ratio1_1=FAC1*directionalNrgRatio+(1–FAC1)*maxdirectionalNrgRatio;
其中,directionalNrgRatio表示虚拟扬声器信号组的能量占比,S为相异性声源数量,η表示虚拟扬声器编码效率,maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,FAC1为预设的第一调整因子,Ratio1_1为虚拟扬声器信号组的比特分配占比,*表示相乘运算,TH1为第一能量占比阈值,TH0为相异性声源数量阈值,TH2为第二虚拟扬声器编码效率阈值;
通过如下方式计算残差信号组的比特分配占比:
Ratio2=1-Ratio1_1;
其中,Ratio1_1为虚拟扬声器信号组的比特分配占比,Ratio2为残差信号组的比特分配占比。
通过上述Ratio1_1的计算流程可知,虚拟扬声器信号组的比特分配占比是增大的,因此编码端可以分配更多的比特给虚拟扬声器信号组。
传输通道信号包括虚拟扬声器信号组和残差信号组,在获取到虚拟扬声器信号组的比特分配占比Ratio1_1之后,可以通过上述Ratio2的计算公式得到残差信号组的比特分配占比。
需要说明的是,本申请实施例中,FAC1可以根据具体的应用场景灵活确定,此处不做限定。
在本申请的一些实施例中,获取虚拟扬声器信号组的比特分配占比之后,编码端执行的方法还包括:
通过如下方式对虚拟扬声器信号组的比特分配占比进行更新:
Ratio1_2=min(Ratio1_1,maxdirectionalNrgRatio+FAC2*Ratio1_1)
其中,Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,FAC2为预设的第二调整因子,maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,*表示相乘运算,min为取最小值运算。
需要说明的是,本申请实施例中,FAC2可以根据具体的应用场景灵活确定,此处不做限定。
通过上述Ratio1_2的计算流程可知,可以对虚拟扬声器信号组的比特分配占比进行安全限制,将Ratio1_2限制在安全比特范围内,从而使得编码端可以安全可用的进行虚拟扬声器信号组的比特分配。
在本申请的一些实施例中,当虚拟扬声器信号组的能量占比大于或等于预设的第二能量占比阈值且小于预设的第一能量占比阈值,和/或虚拟扬声器编码标识为次占优时,按照预设的第二信号组比特分配算法确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;其中,第二能量占比阈值小于第一能量占比阈值,包括:
当满足TH3≤directionalNrgRatio<TH1,和/或,满足S≤TH0且TH4≤η≤TH2时,通过如下方式计算Ratio1_1:
Ratio1_1=FAC3*directionalNrgRatio+(1–FAC3)*maxdirectionalNrgRatio;
其中,maxdirectionalNrgRatio为预设虚拟扬声器信号组比特分配占比,FAC3为预设 的第三调整因子,directionalNrgRatio表示虚拟扬声器信号组的能量占比,S为相异性声源数量,η表示虚拟扬声器编码效率,Ratio1_1为虚拟扬声器信号组的比特分配占比,*表示相乘运算,TH0为相异性声源数量阈值,TH1为第一能量占比阈值,TH2为第二虚拟扬声器编码效率阈值,TH3为第二能量占比阈值,TH4为第一虚拟扬声器编码效率阈值;
通过如下方式计算残差信号组的比特分配占比:
Ratio2=1-Ratio1_1;
其中,Ratio1_1为虚拟扬声器信号组的比特分配占比,Ratio2为残差信号组的比特分配占比。
需要说明的是,本申请实施例中,FAC3可以根据具体的应用场景灵活确定,此处不做限定。例如,0≤FAC3≤0.5,FAC3>FAC1。
通过上述Ratio1_1的计算流程可知,虚拟扬声器信号组的比特分配占比是增大的,因此编码端可以分配更多的比特给虚拟扬声器信号组。
传输通道信号包括虚拟扬声器信号组和残差信号组,在获取到虚拟扬声器信号组的比特分配占比Ratio1_1之后,可以通过上述Ratio2的计算公式得到残差信号组的比特分配占比。
在本申请的一些实施例中,获取虚拟扬声器信号组的比特分配占比之后,本申请实施例提供的方法还包括:
通过如下方式对虚拟扬声器信号组的比特分配占比进行更新:
Ratio1_2=min(Ratio1_1,maxdirectionalNrgRatio+FAC4*Ratio1_1)。
其中,Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,FAC4为预设的第四调整因子,maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,*表示相乘运算,min为取最小值运算。
需要说明的是,本申请实施例中,FAC4可以根据具体的应用场景灵活确定,此处不做限定。
通过上述Ratio1_2的计算流程可知,可以对虚拟扬声器信号组的比特分配占比进行安全限制,将Ratio1_2限制在安全比特范围内,从而使得编码端可以安全可用的进行虚拟扬声器信号组的比特分配。
在本申请的一些实施例中,本申请实施例提供的方法还包括:
残差信号组为多个,通过如下方式计算第i个残差信号组的比特分配占比:
Ratio2_i=Ratio2*(R_i/C);
其中,R_i表示第i个残差信号组包括的传输通道个数,C为所有残差信号组的总传输通道个数,Ratio2_i为第i个残差信号组的比特分配占比,*表示相乘运算,Ratio2为所有残差信号组的比特分配占比。
当残差信号组为多个时,可以根据每个残差信号组的传输通道个数确定每个残差信号组的比特分配在所有残差信号组中的占比。例如R_i/C表示第i个残差信号组与所有残差信号组的传输通道比例,通过(R_i/C)和Ratio2可以获取第i个残差信号组的比特分配占比。
在本申请的一些实施例中,当虚拟扬声器信号组的能量占比小于预设的第一能量占比阈值,或虚拟扬声器编码标识为不占优时,按照预设的第三信号组比特分配算法确定虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,包括:
当满足directionalNrgRatio<TH3,或,满足S>TH0,或η<TH4时,通过如下方式计算虚拟扬声器信号组的比特分配占比:
Ratio1_1=directionalNrgRatio;
其中,directionalNrgRatio表示虚拟扬声器信号组的能量占比,Ratio1_1为虚拟扬声器信号组的比特分配占比,TH3为第二能量占比阈值,TH4为第一虚拟扬声器编码效率阈值,S为相异性声源数量,η表示虚拟扬声器编码效率,TH0为相异性声源数量阈值;
通过如下方式计算残差信号组的比特分配占比:
Ratio2_1=D/(F+D);
其中,Ratio2_1为残差信号组的比特分配占比,F表示虚拟扬声器信号组的能量表征值,D为残差信号组的能量表征值。
通过上述Ratio1_1的计算流程可知,虚拟扬声器信号组的比特分配占比是等于虚拟扬声器信号组的能量占比,因此编码端在虚拟扬声器信号组的比特分配不占优时,不会分配更多的比特给虚拟扬声器信号组,从而保证编码端的并特分配的合理性。
在本申请的一些实施例中,本申请实施例提供的方法还包括:
获取虚拟扬声器信号组的比特分配占比之后,通过如下方式对虚拟扬声器信号组的比特分配占比进行更新:
当Ratio1_1<groupBitsRatio1时,Ratio1_2=groupBitsRatio1;
当Ratio1_1≥groupBitsRatio1时,Ratio1_2=FAC5*groupBitsRatio1+(1–FAC5)*Ratio1_1;
其中,Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,FAC5为预设的第五调整因子,Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,*表示相乘运算,groupBitsRatio1为预设的虚拟扬声器信号组比特分配占比;
获取残差信号组的比特分配占比之后,通过如下方式对残差信号组的比特分配占比进行更新:
当Ratio2_1<groupBitsRatio2时,Ratio2_2=groupBitsRatio2;
当Ratio2_1≥groupBitsRatio2时,Ratio2_2=FAC6*groupBitsRatio2+(1–FAC6)*Ratio2_1;
其中,Ratio2_2表示更新后的残差信号组的比特分配占比,FAC6为预设的第六调整因子,Ratio2_1为更新前的残差信号组的比特分配占比,*表示相乘运算,groupBitsRatio2为预设的残差信号组比特分配占比。
需要说明的是,本申请实施例中,FAC5可以根据具体的应用场景灵活确定,此处不做限定。
通过上述Ratio1_2的计算流程可知,可以对虚拟扬声器信号组的比特分配占比进行安全限制,将Ratio1_2限制在安全比特范围内,从而使得编码端可以安全可用的进行虚拟扬声器信号组的比特分配。
通过上述Ratio2_2的计算流程可知,可以对残差信号组的比特分配占比进行安全限制,将Ratio2_2限制在安全比特范围内,从而使得编码端可以安全可用的进行残差信号组的比特分配。
在本申请的一些实施例中,本申请实施例中编码端除了执行前述的方法之外,本申请实施例提供的方法还包括如下步骤:
根据虚拟扬声器信号组的比特分配占比、残差信号组的比特分配占比和总的传输通道比特数,分别确定虚拟扬声器信号组的比特数、残差信号组的比特数;
根据虚拟扬声器信号组的比特数对虚拟扬声器信号组进行比特分配,以及根据残差信号组的比特数对残差信号组进行比特分配。
其中,编码端在获取到虚拟扬声器信号组的比特分配占比、残差信号组的比特分配占比之后,编码端可以进行为虚拟扬声器信号组和残差信号组分别进行比特分配,以确定出虚拟扬声器信号组的比特分配结果和残差信号组的比特分配结果。例如,编码端获取到虚拟扬声器信号组的比特分配占比、残差信号组的比特分配占比,再结合总的传输通道比特数,分别确定虚拟扬声器信号组的比特数、残差信号组的比特数,虚拟扬声器信号组的比特数表示编码端可以为虚拟扬声器信号组分配的实际比特个数,残差信号组的比特数表示编码端可以为残差信号组分配的实际比特个数。最后编码端根据虚拟扬声器信号组的比特数对虚拟扬声器信号组进行比特分配,以及根据残差信号组的比特数对残差信号组进行比特分配,解决了编码端无法为虚拟扬声器信号和残差信号进行比特分配的问题。
进一步的,在本申请的一些实施例中,根据虚拟扬声器信号组的比特分配占比、残差信号组的比特分配占比和总的传输通道比特数,分别确定虚拟扬声器信号组的比特数、残差信号组的比特数,包括:
通过如下方式计算虚拟扬声器信号组的比特数:
F_bitnum=Ratio1*C_bitnum;
其中,F_bitnum为虚拟扬声器信号组的比特数,Ratio1为虚拟扬声器信号组的比特分配占比,C_bitnum为总的传输通道比特数;
通过如下方式计算残差信号组的比特数:
D_bitnum=Ratio2*C_bitnum;
其中,D_bitnum为残差信号组的比特数,Ratio2为残差信号组的比特分配占比,C_bitnum为总的传输通道比特数。
具体的,编码端可以预先确定总的传输通道比特数,对于总的传输通道比特数的取值不做限定,编码端可以通过上述计算公式计算出虚拟扬声器信号组的比特数和残差信号组的比特数,实现了编码端针对虚拟扬声器信号和残差信号的比特分配问题。
不限定的是,上述计算公式只是一种可实现的方式,不作为对本申请实施例的限定,例如通过上述公式计算出虚拟扬声器信号组的比特数和残差信号组的比特数,还可以通过预设的调整因子对虚拟扬声器信号组的比特数和残差信号组的比特数的取值进行调整,以得到最终的取值,对于上述计算过程,不做限定。
在本申请的一些实施例中,编码端除了执行前述步骤,编码端执行的方法还可以包括如下步骤:
对传输通道信号、虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比进行编码,并写入码流。
其中,虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比可以被编码到码流中,编码端将该码流发送至解码端之后,从而解码端通过解析码流,解码端可以通过码流获取到虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,解码端通过虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比可以获取到虚拟扬声器信号组分配的比特数和残差信号分配的比特数,从而可以对码流进行解码,以得到三维音频信号。
在本申请的一些实施例中,对传输通道信号、虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比进行编码,具体可以包括直接对传输通道信号进行编码,或者先对传输通道信号进行处理,在获取到虚拟扬声器信号和残差信号之后,对虚拟扬声器信号和残差信号进行编码,例如编码端具体可以是核心编码器,核心编码器对虚拟扬声器信号、残差信号和虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比进行编码,以得到码流。该码流也可以称为音频信号编码码流。
本申请实施例提供的三维音频信号的处理方法可以包括:音频编码方法和音频解码方法,其中,音频编码方法由音频编码装置执行,音频解码方法由音频解码装置执行,音频编码装置和音频解码装置之间可以进行通信。前述图4由音频编码装置执行,接下来介绍本申请实施例提供中音频解码装置(后续简称为解码端)执行的三维音频信号的处理方法,如图5所示,主要包括如下步骤:
501、接收码流。
其中,解码端接收来自编码端的码流。该码流中携带虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。
502、解码码流以获得虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。
解码端解析码流,从该码流中获得虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,该虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比由编码端按照前述图4所示的实施例得到。
503、根据虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比对码流中的虚拟扬声器信号和残差信号进行解码,获得解码后的三维音频信号。
解码端获取到该虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比之后,解码端使用该虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比解析码流,得到解码后的三维音频信号,本申请实施例中对于码流中虚拟扬声器信号和残差信号的解码过程不做限定。本申请实施例中解码端可以通过虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比确定虚拟扬声器信号分配的比特数和残差信号分配的比特数,解码端采用与编码端的编码方式相对应的解码方式进行解码,从而得到编码端发送的三维音频信号,实现三维音频信号从编码端到解码端的传输。
例如,解码端能够根据码流中传输的虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比,确定虚拟扬声器信号分配的比特数和残差信号分配的比特数,解决了解 码端无法确定信号的分配比特的问题。
在本申请的一些实施例中,步骤503根据虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比对码流中的虚拟扬声器信号和残差信号进行解码,包括:
根据所述码流确定可用比特数;
根据可用比特数和虚拟扬声器信号组的比特分配占比确定虚拟扬声器信号组的比特数;根据虚拟扬声器信号组的比特数对码流中的虚拟扬声器信号进行解码;
根据可用比特数和残差信号组的比特分配占比确定残差信号组的比特数;根据残差信号组的比特数对码流中的残差信号进行解码。
其中,解码端首先确定可用比特数,该可用比特数是能够分配给传输通道的总比特数。解码端通过解析码流可以得到虚拟扬声器信号组的比特分配占比,从而可以根据可用比特数和虚拟扬声器信号组的比特分配占比确定虚拟扬声器信号组的比特数,该虚拟扬声器信号组的比特数为编码端编码虚拟扬声器信号组时所使用的比特数,解码端也可以根据虚拟扬声器信号组的比特数对码流中的虚拟扬声器信号进行解码,从而解码端可以从码流中解码出虚拟扬声器信号。
同样的,解码端通过解析码流可以得到残差信号组的比特分配占比,从而可以根据可用比特数和残差信号组的比特分配占比确定残差信号组的比特数,该残差信号组的比特数为编码端编码残差信号组时所使用的比特数,解码端也可以根据残差信号组的比特数对码流中的残差信号进行解码,从而解码端可以从码流中解码出残差信号。
举例说明如下,在解码端执行的解码过程中,可以从码流中解析以下两个参数:groupBitsRatio和bitsRatio,其中,groupBitsRatio占用4比特,表示组间比特分配比例参数,组间比特分配比例参数包括:虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比。bitsRatio占用4比特,表示组内比特分配比例参数,组内比特分配比例参数包括:每个虚拟扬声器信号组在所有虚拟扬声器信号组内的比特分配占比,每个残差信号组在所有残差信号组内的比特分配占比。
例如,解码端可以包括比特分配模块,该比特分配模块的主要作用是根据码流中解码获得的比特分配比例参数,将去除其他边信息后的剩余可用比特数分配给各个传输通道,其中,其它边信息的编码也会占用比特数。
首先,需要计算当前帧扣除其他边信息后剩余的可用比特数,记为availableBits。计算availableBits的通用算法表示为如下方式:
availableBits=bitsPerFrame-bitsUsed;
其中,bitsPerFrame为每帧初始比特数,bitsUsed为比特分配前已占用的比特数。
HOA比特分配HoaSplitBytesGroup()计算过程如下。
首先,根据总的可用比特数availableBits和groupBitsRatio计算每组通道的比特数groupBytes,如下式:
Figure PCTCN2022096546-appb-000013
其中,
Figure PCTCN2022096546-appb-000014
可以表示虚拟扬声器信号组在所有传输通道信号中的比特分配占比,或者,可以表示残差信号组在所有传输通道信号中的比特分配占比。
然后,根据bitsRatio计算每个通道的比特数bytesChannels,如下式:
Figure PCTCN2022096546-appb-000015
例如,groupBytes表示虚拟扬声器信号组的总分配比特数。
Figure PCTCN2022096546-appb-000016
表示每个虚拟扬声器信号组在所有虚拟扬声器信号组内的比特分配占比,bytesChannels表示每个虚拟扬声器信号组的比特数。
又如,groupBytes表示残差信号组的总分配比特数。
Figure PCTCN2022096546-appb-000017
表示每个残差信号组在所有残差信号组内的比特分配占比,bytesChannels表示每个残差信号组的比特数。
通过上述过程,可以计算出每组通道的比特数。
需要说明的是,解码端也可以和编码端类似的方法计算出虚拟扬声器信号组的比特分配占比,残差信号组的比特分配占比,例如采用前述Ratio1、Ratio2的计算流程,此处不再赘述。
为便于更好的理解和实施本申请实施例的上述方案,下面举例相应的应用场景来进行具体说明。
本申请实施例中以三维音频信号为HOA信号为例,本申请实施例提供一种虚拟扬声器信号和残差信号的比特分配方法,首先将虚拟扬声器信号和残差信号进行分组,然后根据信号特征和声场特征获取组间比特分配比值,最后实现通道比特分配。
本申请实施例的目的是得到传输通道信号的比特分配结果,传输通道信号由虚拟扬声器信号和残差信号组成。本申请实施例首先将传输通道信号分组,分为虚拟扬声器信号组和残差信号组。
根据信号特征和声场特征获取组间比特分配比值,进而通过总的比特数获得虚拟扬声器信号组比特数和残差信号组比特数。编码器在某一速率下编码时,每一帧被分配的总比特数是确定的,本申请实施例中在这一帧的可用比特数下对比特进行分配。例如,在固定速率编码模式下(constant bitrate,CBR),码率为384kbps,此时每一帧比特数约为7680比特,实际可用比特数小于7680比特,本申请实施例中可以对这小于7680比特进行分配。
其中,当虚拟扬声器编码效率较高时,例如相异性声源数量小于或等于虚拟扬声器信号的传输通道个数时,要达到增加虚拟扬声器信号的编码比特数目的,通过增加虚拟扬声器信号组的组间比特分配比值得到。
在上述计算方式中,虚拟扬声器信号的编码比特数、残差信号的编码比特数能够符合当前帧的声场分类的实际情况,解决了对当前帧进行编码时需要确定虚拟扬声器信号的编码比特数、残差信号的编码比特数的问题。
本申请实施例在核心编解码器中,接下来对核心编解码器的执行流程进行说明。
请参阅图6所示,以下给出具体实施步骤:
S1.待编码HOA信号经过HOA空间编码得到传输通道信号和属性信息。
其中,传输通道信号包括:虚拟扬声器信号和残差信号;
属性信息为前述的传输单通道属性信息,包括声场分类结果和虚拟扬声器编码效率η。
在本申请的一些实施例中,声场分类结果包括相异性声源数量,或者声场分类结果包括相异性声源数量和声场类型;虚拟扬声器编码效率η表示当前帧采用虚拟扬声器重建HOA信号的效率。
接下来给出一种计算虚拟扬声器编码效率的方法:
计算重建HOA信号每个通道的能量表征值R1,R2,…,Rt,Rt=norm(SRt),norm()为范数运算,SRt为重建HOA信号第t个通道的改进离散余弦变换MDCT系数;t为(HOA阶数+1) 2
计算原始HOA信号能量表征值N1,N2,…,Nt,Nt=norm(SNt),norm()为范数运算,SNt为原始HOA信号第t个通道的MDCT系数,t为(HOA阶数+1) 2
虚拟扬声器编码效率η=sum(R)/sum(N);sum(R)表示求R1~Rt求和,sum(N)表示N1~Nt求和。
S2.获取传输通道分组比特分配占比。
首先,对传输通道信号进行分组,假设传输通道信号由M个虚拟扬声器信号和N个残差信号组成。进一步可以将N个残差信号分为K组,若M个虚拟扬声器信号分为1个组,因此传输通道被分为K+1组。每组通道数量可以相同也可以不同,每帧分组可以相同也可以不相同,均不影响本申请实施例后续流程。
后续以K等于2为例,不限定的是,K的取值还可以是3或者其它数值,此处不做限定。
以传输通道数量为11为例,其中虚拟扬声器信号组包含的虚拟扬声器数量等于2,残差信号组1包含残差信号数量等于4,残差信号组2包含残差信号数量等于5。
在步骤S2中,包括如下步骤S21至S23。
S21.计算每组能量表征值。
可以采用S1中的方法计算各个通道的能量表征值,然后将每组内的通道能量表征值相加得到每组能量表征值,例如虚拟扬声器信号组能量表征值为F,残差信号组1能量表征值为D1,残差信号组2能量表征值为D2。
S22.计算虚拟扬声器信号组能量占比directionalNrgRatio。
directionalNrgRatio=F/(F+D1+D2)。
S23.确定传输通道组间比特分配占比。
根据虚拟扬声器信号组能量占比directionalNrgRatio,和/或虚拟扬声器编码标识Flag中的至少一种确定传输通道组间比特分配占比,假设虚拟扬声器信号组比特分配占比为Ratio1,残差信号组1比特分配占比为Ratio2,残差信号组2比特分配占比为Ratio3。当通过虚拟扬声器信号组能量占比directionalNrgRatio,和/或虚拟扬声器编码效率η确定当前帧虚拟扬声器信号组比特分配占优时,需要把虚拟扬声器信号组比特分配占比增大, 把残差信号组比特分配占比减小。可以在满足不同预设条件下选择不同的调整方式把虚拟扬声器信号组比特分配占比增大。
其中,判断条件包括扬声器信号组能量占比directionalNrgRatio,和/或虚拟扬声器编码标识Flag。
其中,虚拟扬声器编码标识Flag通过以下方法获取:
当满足相异性声源数量≤TH0且虚拟扬声器编码效率η>TH2时,Flag=强占优(High)。
当满足相异性声源数量≤TH0且虚拟扬声器编码效率TH4≤η≤TH2时,Flag=次占优(Middle)。否则,Flag=不占优(Low)。
接下来对上述判断条件进行举例说明,例如判断条件可以包括如下条件1至条件6。
条件1:当满足directionalNrgRatio≥TH1时,0.9≤TH1≤1,例如TH1=0.9375。
首先,计算虚拟扬声器信号组比特分配占比Ratio1:
Ratio1=FAC1*directionalNrgRatio+(1–FAC1)*maxdirectionalNrgRatio。
其中,maxdirectionalNrgRatio为预设最大虚拟扬声器信号组比特分配占比,FAC1为预设的第一调整因子,0≤FAC1≤0.5。
可选的,给Ratio1限制安全比特,例如:
Ratio1=min(Ratio1,maxdirectionalNrgRatio+FAC2*Ratio1)。
其中,FAC2为预设的第二调整因子,0≤FAC2≤0.5。
然后,计算残差信号组1比特分配占比Ratio2,残差信号组2比特分配占比Ratio3:
Ratio2=(1-Ratio1)*残差信号组1通道个数/(残差信号组1通道个数+残差信号组2通道个数);
Ratio3=(1-Ratio1)*残差信号组2通道个数/(残差信号组1通道个数+残差信号组2通道个数)。
条件2:当满足相异性声源数量≤TH0且虚拟扬声器编码效率η>TH2时,即Flag=High时,TH0为编解码器匹配虚拟扬声器个数或编解码器虚拟扬声器信号个数。例如TH0=2。0.8≤TH1≤1,例如TH2=0.875。可以认为虚拟扬声器信号组比特分配强占优,此时对传输通道组间比特分配占比进行如下调整:
计算Ratio1,Ratio2,Ratio3步骤与条件1相同。
条件3:当满足TH3≤directionalNrgRatio<TH1时,0.5≤TH3<0.9,例如TH3=0.75。
首先,计算虚拟扬声器信号组比特分配占比Ratio1:
Ratio1=FAC3*directionalNrgRatio+(1–FAC3)*maxdirectionalNrgRatio。
其中,maxdirectionalNrgRatio为预设虚拟扬声器信号组比特分配占比,FAC3为预设的第三调整因子,0≤FAC3≤0.5;FAC3>FAC1。
可选的,给Ratio1限制安全比特,例如:
Ratio1=min(Ratio1,maxdirectionalNrgRatio+TH8FAC4*Ratio1)。
其中,FAC4为预设的第四调整因子,0≤FAC4≤0.5,FAC4<FAC2;
然后,计算残差信号组1比特分配占比Ratio2,残差信号组2比特分配占比Ratio3:
Ratio2=(1-Ratio1)*残差信号组1通道个数/(残差信号组1通道个数+残差信号组2通道个数);
Ratio3=(1-Ratio1)*残差信号组2通道个数/(残差信号组1通道个数+残差信号组2通道个数)。
条件4:当满足相异性声源数量≤TH0且虚拟扬声器编码效率TH4≤η≤TH2时,即Flag=Middle时,0.5≤TH4<0.8,例如TH4=0.6875。可以认为虚拟扬声器信号组比特分配略占优,此时对传输通道组间比特分配占比进行如下调整:
计算Ratio1,Ratio2,Ratio3步骤与条件3相同。
条件5:当满足directionalNrgRatio<TH3时,可以认为残差组比特分配占优,此时对传输通道组间比特分配占比进行如下调整:
Ratio1=directionalNrgRatio。
Ratio2=D1/(F+D1+D2)。
Ratio3=D2/(F+D1+D2)。
可选的,给Ratio1,Ratio2,Ratio3限制安全比特,例如:
当Ratio1<groupBitsRatio1时,Ratio1=groupBitsRatio1;
当Ratio1≥groupBitsRatio1时,Ratio1=FAC5*groupBitsRatio1+(1–FAC5)*Ratio1;
当Ratio2<groupBitsRatio2时,Ratio2=groupBitsRatio2;
当Ratio2≥groupBitsRatio2时,Ratio2=FAC6*groupBitsRatio2+(1–FAC6)*Ratio2;
当Ratio3<groupBitsRatio3时,Ratio3=groupBitsRatio3;
当Ratio3≥groupBitsRatio3时,Ratio3=FAC7*groupBitsRatio3+(1–FAC7)*Ratio3;
其中,groupBitsRatio1,groupBitsRatio2,groupBitsRatio3分别为预设虚拟扬声器信号组比特分配占比,预设残差信号组1比特分配占比,预设残差信号组2比特分配占比,FAC5为预设的第五调整因子,0.5<FAC5≤1,FAC6为预设的第六调整因子,0.5<FAC6≤1,FAC7为预设的第七调整因子,0.5<FAC7≤1,FAC5、FAC6、FAC7可以相等也可以不相等。
条件6:当满足相异性声源数量>TH0,或,虚拟扬声器编码效率η<TH4时,即Flag=Low时,可以认为残差组比特分配占优,此时对传输通道组间比特分配占比进行如下调整:
计算Ratio1,Ratio2,Ratio3步骤与条件5相同。
在获取到上述Ratio1,Ratio2,Ratio3之后,可以将Ratio1,Ratio2,Ratio3量化后写入码流。
S3.对传输通道信号下混。
传输通道信号下混的具体过程不再说明,将原始通道信号采用下混算法计算得到下混通道,再进行比特分配。本步骤S3为可选步骤,且步骤S3的执行顺序可以在步骤S2之前,或者步骤S2之后。
S4.对传输通道信号进行比特分配。
首先,由步骤S2中的组间比特分配占比和总的可用比特数确定各组比特数,例如:
虚拟扬声器信号组比特数=Ratio1*总的可用比特数。
残差信号组1比特数=Ratio2*总的可用比特数。
残差信号组2比特数=Ratio3*总的可用比特数。
然后,确定各个通道比特数,可以有多种实现方式,例如根据各个通道能量占比进行比特分配。
接下来对解码端执行的信号解码流程进行说明。
解码端接收编码端发送的码流,然后从码流中解析Ratio1,Ratio2,Ratio3,然后可以对传输通道信号进行比特分配,例如对传输通道信号进行比特分配可以是前述步骤S4中得到各个通道比特数的方法。
通过前述的举例说明,本申请实施例编码端可以将传输通道分组,根据虚拟扬声器信号组能量,相异性声源数量和重建HOA信号判断分组比特分配占比。本申请实施例中通过上述多种条件可以实现组间分配占比调整。因此本申请实施例中可以有效提高传输通道比特分配效率。
本申请实施例中对于解码端执行的解码流程不再详细说明。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
为便于更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关装置。
请参阅图7所示,本申请实施例提供的一种三维音频信号的处理装置,例如该该三维音频信号的处理装置具体为音频编码装置700,可以包括:编码模块701、比特分配占比确定模块702,其中,
编码模块,用于对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,所述传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残差信号组;
比特分配占比确定模块,用于根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。
请参阅图8所示,本申请实施例提供的一种三维音频信号的处理装置,例如该该三维音频信号的处理装置具体为音频解码装置800,可以包括:接收模块801、解码模块802和信号生成模块803,其中,
接收模块,用于接收码流;
解码模块,用于解码所述码流以获得虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;
信号生成模块,用于根据所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比对所述码流中的虚拟扬声器信号和残差信号进行解码,获得解码后的三维音频信号。
需要说明的是,上述装置各模块/单元之间的信息交互、执行过程等内容,由于与本申 请方法实施例基于同一构思,其带来的技术效果与本申请方法实施例相同,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储有程序,该程序执行包括上述方法实施例中记载的部分或全部步骤。
接下来介绍本申请实施例提供的另一种音频编码装置,请参阅图9所示,音频编码装置900包括:
接收器901、发射器902、处理器903和存储器904(其中音频编码装置900中的处理器903的数量可以一个或多个,图9中以一个处理器为例)。在本申请的一些实施例中,接收器901、发射器902、处理器903和存储器904可通过总线或其它方式连接,其中,图9中以通过总线连接为例。
存储器904可以包括只读存储器和随机存取存储器,并向处理器903提供指令和数据。存储器904的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器904存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
处理器903控制音频编码装置的操作,处理器903还可以称为中央处理单元(central processing unit,CPU)。具体的应用中,音频编码装置的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器903中,或者由处理器903实现。处理器903可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器903中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器903可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器904,处理器903读取存储器904中的信息,结合其硬件完成上述方法的步骤。
接收器901可用于接收输入的数字或字符信息,以及产生与音频编码装置的相关设置以及功能控制有关的信号输入,发射器902可包括显示屏等显示设备,发射器902可用于通过外接接口输出数字或字符信息。
本申请实施例中,处理器903用于执行前述实施例图4所示的由音频编码装置执行的方法。
接下来介绍本申请实施例提供的另一种音频解码装置,请参阅图10所示,音频解码装 置1000包括:
接收器1001、发射器1002、处理器1003和存储器1004(其中音频解码装置1000中的处理器1003的数量可以一个或多个,图10中以一个处理器为例)。在本申请的一些实施例中,接收器1001、发射器1002、处理器1003和存储器1004可通过总线或其它方式连接,其中,图10中以通过总线连接为例。
存储器1004可以包括只读存储器和随机存取存储器,并向处理器1003提供指令和数据。存储器1004的一部分还可以包括NVRAM。存储器1004存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
处理器1003控制音频解码装置的操作,处理器1003还可以称为CPU。具体的应用中,音频解码装置的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1003中,或者由处理器1003实现。处理器1003可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1003中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1003可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1004,处理器1003读取存储器1004中的信息,结合其硬件完成上述方法的步骤。
本申请实施例中,处理器1003,用于执行前述实施例图5所示的由音频解码装置执行的方法。
在另一种可能的设计中,当音频编码装置或者音频解码装置为终端内的芯片时,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使该终端内的芯片执行上述第一方面任意一项的音频编码方法,或者第二方面任意一项的音频解码方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述终端内的位于所述芯片外部的存储单元,如只读存储器(read-onlymemory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(randomaccessmemory,RAM)等。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面或第二方面方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件 说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (27)

  1. 一种三维音频信号的处理方法,其特征在于,包括:
    对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,所述传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残差信号组;
    根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。
  2. 根据权利要求1所述的方法,其特征在于,所述传输通道属性信息包括:虚拟扬声器编码效率;
    所述对待编码的三维音频信号进行空间编码,以得到传输通道属性信息,包括:
    采用虚拟扬声器对所述待编码的三维音频信号进行信号重建,以得到重建后的三维音频信号;
    获取所述重建后的三维音频信号的能量表征值,以及所述待编码的三维音频信号的能量表征值;
    根据所述重建后的三维音频信号的能量表征值,以及所述待编码的三维音频信号的能量表征值,获取所述虚拟扬声器编码效率。
  3. 根据权利要求1或2所述的方法,其特征在于,所述传输通道属性信息包括:所述虚拟扬声器信号组的能量占比;
    所述方法还包括:
    根据所述虚拟扬声器信号组中每个虚拟扬声器信号的能量表征值获取所述虚拟扬声器信号组的能量表征值;
    根据所述残差信号组中每个残差信号的能量表征值获取所述残差信号组的能量表征值;
    根据所述虚拟扬声器信号组的能量表征值和所述残差信号组的能量表征值,获取所述虚拟扬声器信号组的能量占比。
  4. 根据权利要求1所述的方法,其特征在于,所述传输通道属性信息包括:虚拟扬声器编码标识,所述虚拟扬声器编码标识用于指示所述虚拟扬声器信号组的比特分配是否占优;
    所述对待编码的三维音频信号进行空间编码,以得到传输通道属性信息,包括:
    所述对待编码的三维音频信号进行空间编码,以得到所述传输通道信号的相异性声源数量和虚拟扬声器编码效率;
    根据所述传输通道信号的相异性声源数量和所述虚拟扬声器编码效率获取所述虚拟扬声器编码标识。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述传输通道信号的相异性声源数量和所述虚拟扬声器编码效率获取所述虚拟扬声器编码标识,包括:
    当所述传输通道信号的相异性声源数量小于或等于预设的相异性声源数量阈值,且所述虚拟扬声器编码效率大于或等于预设的第一虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为占优;或
    当所述传输通道信号的相异性声源数量大于预设的相异性声源数量阈值,或所述虚拟 扬声器编码效率小于预设的第一虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为不占优。
  6. 根据权利要求5所述的方法,其特征在于,所述占优包括次占优或强占优;
    所述确定所述虚拟扬声器编码标识为占优,包括:
    当所述虚拟扬声器编码效率大于或等于所述第一虚拟扬声器编码效率阈值、且所述虚拟扬声器编码效率小于或等于预设的第二虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为次占优;或
    当所述虚拟扬声器编码效率大于或等于所述第一虚拟扬声器编码效率阈值、且所述虚拟扬声器编码效率大于预设的第二虚拟扬声器编码效率阈值时,确定所述虚拟扬声器编码标识为强占优;
    其中,所述第二虚拟扬声器编码效率阈值大于所述第一虚拟扬声器编码效率阈值。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述传输通道属性信息包括:所述虚拟扬声器信号组的能量占比,和/或虚拟扬声器编码标识;
    所述根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比,包括:
    当所述虚拟扬声器信号组的能量占比大于或等于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为强占优时,按照预设的第一信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比;
    当所述虚拟扬声器信号组的能量占比大于或等于预设的第二能量占比阈值且小于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为次占优时,按照预设的第二信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比;其中,所述第二能量占比阈值小于所述第一能量占比阈值;或
    当所述虚拟扬声器信号组的能量占比小于预设的第一能量占比阈值,或所述虚拟扬声器编码标识为不占优时,按照预设的第三信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。
  8. 根据权利要求7所述的方法,其特征在于,所述当所述虚拟扬声器信号组的能量占比大于或等于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为强占优时,按照预设的第一信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比,包括:
    当满足directionalNrgRatio≥TH1,和/或,S≤TH0且η>TH2时,通过如下方式计算所述虚拟扬声器信号组的比特分配占比:
    Ratio1_1=FAC1*directionalNrgRatio+(1–FAC1)*maxdirectionalNrgRatio;
    其中,所述directionalNrgRatio表示所述虚拟扬声器信号组的能量占比,所述S为所述相异性声源数量,所述η表示所述虚拟扬声器编码效率,所述maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,所述FAC1为预设的第一调整因子,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述TH1为所述第一能量占比阈值,所述TH0为所述相异性声源数量阈值,所述TH2为所述第二虚拟扬声器编码效率阈值;
    通过如下方式计算所述残差信号组的比特分配占比:
    Ratio2=1-Ratio1_1;
    其中,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述Ratio2为所述残差信号组的比特分配占比。
  9. 根据权利要求8所述的方法,其特征在于,获取所述虚拟扬声器信号组的比特分配占比之后,所述方法还包括:
    通过如下方式对所述虚拟扬声器信号组的比特分配占比进行更新:
    Ratio1_2=min(Ratio1_1,maxdirectionalNrgRatio+FAC2*Ratio1_1)
    其中,所述Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,所述FAC2为预设的第二调整因子,所述maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,所述Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述min为取最小值运算。
  10. 根据权利要求7所述的方法,其特征在于,所述当所述虚拟扬声器信号组的能量占比大于或等于预设的第二能量占比阈值且小于预设的第一能量占比阈值,和/或所述虚拟扬声器编码标识为次占优时,按照预设的第二信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比;其中,所述第二能量占比阈值小于所述第一能量占比阈值,包括:
    当满足TH3≤directionalNrgRatio<TH1,和/或,满足S≤TH0且TH4≤η≤TH2时,通过如下方式计算Ratio1_1:
    Ratio1_1=FAC3*directionalNrgRatio+(1–FAC3)*maxdirectionalNrgRatio;
    其中,所述maxdirectionalNrgRatio为预设虚拟扬声器信号组比特分配占比,所述FAC3为预设的第三调整因子,所述directionalNrgRatio表示所述虚拟扬声器信号组的能量占比,所述S为所述相异性声源数量,所述η表示所述虚拟扬声器编码效率,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述TH0为所述相异性声源数量阈值,所述TH1为所述第一能量占比阈值,所述TH2为所述第二虚拟扬声器编码效率阈值,所述TH3为所述第二能量占比阈值,所述TH4为所述第一虚拟扬声器编码效率阈值;
    通过如下方式计算所述残差信号组的比特分配占比:
    Ratio2=1-Ratio1_1;
    其中,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述Ratio2为所述残差信号组的比特分配占比。
  11. 根据权利要求10所述的方法,其特征在于,获取所述虚拟扬声器信号组的比特分配占比之后,所述方法还包括:
    通过如下方式对所述虚拟扬声器信号组的比特分配占比进行更新:
    Ratio1_2=min(Ratio1_1,maxdirectionalNrgRatio+FAC4*Ratio1_1)
    其中,所述Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,所述FAC4为预设的第四调整因子,所述maxdirectionalNrgRatio为预设的最大虚拟扬声器信号组比特分配占比,所述Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,所述*表示相乘 运算,所述min为取最小值运算。
  12. 根据权利要求8至11中任一项所述的方法,其特征在于,所述方法还包括:
    所述残差信号组为多个,通过如下方式计算第i个残差信号组的比特分配占比:
    Ratio2_i=Ratio2*(R_i/C);
    其中,所述R_i表示第i个残差信号组包括的传输通道个数,所述C为所有残差信号组的总传输通道个数,所述Ratio2_i为所述第i个残差信号组的比特分配占比,所述*表示相乘运算,所述Ratio2为所有残差信号组的比特分配占比。
  13. 根据权利要求7所述的方法,其特征在于,所述当所述虚拟扬声器信号组的能量占比小于预设的第一能量占比阈值,或所述虚拟扬声器编码标识为不占优时,按照预设的第三信号组比特分配算法确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比,包括:
    当满足directionalNrgRatio<TH3,或,满足S>TH0,或η<TH4时,通过如下方式计算所述虚拟扬声器信号组的比特分配占比:
    Ratio1_1=directionalNrgRatio;
    其中,所述directionalNrgRatio表示所述虚拟扬声器信号组的能量占比,所述Ratio1_1为所述虚拟扬声器信号组的比特分配占比,所述TH3为所述第二能量占比阈值,所述TH4为所述第一虚拟扬声器编码效率阈值,所述S为所述相异性声源数量,所述η表示所述虚拟扬声器编码效率,所述TH0为所述相异性声源数量阈值;
    通过如下方式计算所述残差信号组的比特分配占比:
    Ratio2_1=D/(F+D);
    其中,所述Ratio2_1为所述残差信号组的比特分配占比,所述F表示所述虚拟扬声器信号组的能量表征值,所述D为所述残差信号组的能量表征值。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    获取所述虚拟扬声器信号组的比特分配占比之后,通过如下方式对所述虚拟扬声器信号组的比特分配占比进行更新:
    当Ratio1_1<groupBitsRatio1时,Ratio1_2=groupBitsRatio1;
    当Ratio1_1≥groupBitsRatio1时,Ratio1_2=FAC5*groupBitsRatio1+(1–FAC5)*Ratio1_1;
    其中,所述Ratio1_2表示更新后的虚拟扬声器信号组的比特分配占比,所述FAC5为预设的第五调整因子,所述Ratio1_1为更新前的虚拟扬声器信号组的比特分配占比,所述*表示相乘运算,所述groupBitsRatio1为预设的虚拟扬声器信号组比特分配占比;
    获取所述残差信号组的比特分配占比之后,通过如下方式对所述残差信号组的比特分配占比进行更新:
    当Ratio2_1<groupBitsRatio2时,Ratio2_2=groupBitsRatio2;
    当Ratio2_1≥groupBitsRatio2时,Ratio2_2=FAC6*groupBitsRatio2+(1–FAC6)*Ratio2_1;
    其中,所述Ratio2_2表示更新后的残差信号组的比特分配占比,所述FAC6为预设的第六调整因子,所述Ratio2_1为更新前的残差信号组的比特分配占比,所述*表示相乘运 算,所述groupBitsRatio2为预设的残差信号组比特分配占比。
  15. 根据权利要求1至14中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述虚拟扬声器信号组的比特分配占比、所述残差信号组的比特分配占比和总的传输通道比特数,分别确定所述虚拟扬声器信号组的比特数、所述残差信号组的比特数;
    根据所述虚拟扬声器信号组的比特数对所述虚拟扬声器信号组进行比特分配,以及根据所述残差信号组的比特数对所述残差信号组进行比特分配。
  16. 根据权利要求15所述的方法,其特征在于,所述根据所述所述虚拟扬声器信号组的比特分配占比、所述残差信号组的比特分配占比和总的传输通道比特数,分别确定所述虚拟扬声器信号组的比特数、所述残差信号组的比特数,包括:
    通过如下方式计算虚拟扬声器信号组的比特数:
    F_bitnum=Ratio1*C_bitnum;
    其中,所述F_bitnum为所述虚拟扬声器信号组的比特数,所述Ratio1为所述虚拟扬声器信号组的比特分配占比,所述C_bitnum为总的传输通道比特数;
    通过如下方式计算所述残差信号组的比特数:
    D_bitnum=Ratio2*C_bitnum;
    其中,所述D_bitnum为所述残差信号组的比特数,所述Ratio2为所述残差信号组的比特分配占比,所述C_bitnum为总的传输通道比特数。
  17. 根据权利要求1至16中任一项所述的方法,其特征在于,所述方法还包括:
    对所述传输通道信号、所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比进行编码,并写入码流。
  18. 一种三维音频信号的处理方法,其特征在于,包括:
    接收码流;
    解码所述码流以获得虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;
    根据所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比对所述码流中的虚拟扬声器信号和残差信号进行解码,获得解码后的三维音频信号。
  19. 根据权利要求18所述的方法,其特征在于,所述根据所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比对所述码流中的虚拟扬声器信号和残差信号进行解码,包括:
    根据所述码流确定可用比特数;
    根据所述可用比特数和所述虚拟扬声器信号组的比特分配占比确定所述虚拟扬声器信号组的比特数;根据所述虚拟扬声器信号组的比特数对所述码流中的虚拟扬声器信号进行解码;
    根据所述可用比特数和所述残差信号组的比特分配占比确定所述残差信号组的比特数;根据所述残差信号组的比特数对所述码流中的残差信号进行解码。
  20. 一种三维音频信号的处理装置,其特征在于,包括:
    编码模块,用于对待编码的三维音频信号进行空间编码,以得到传输通道信号和传输通道属性信息,其中,所述传输通道信号包括:至少一个虚拟扬声器信号组和至少一个残 差信号组;
    比特分配占比确定模块,用于根据所述传输通道属性信息确定所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比。
  21. 一种三维音频信号的处理装置,其特征在于,包括:
    接收模块,用于接收码流;
    解码模块,用于解码所述码流以获得虚拟扬声器信号组的比特分配占比和残差信号组的比特分配占比;
    信号生成模块,用于根据所述虚拟扬声器信号组的比特分配占比和所述残差信号组的比特分配占比对所述码流中的虚拟扬声器信号和残差信号进行解码,获得解码后的三维音频信号。
  22. 一种三维音频信号的处理装置,其特征在于,所述三维音频信号的处理装置包括至少一个处理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求1至17中任一项所述的方法。
  23. 根据权利要求22所述的三维音频信号的处理装置,其特征在于,所述三维音频信号的处理装置还包括:所述存储器。
  24. 一种三维音频信号的处理装置,其特征在于,所述三维音频信号的处理装置包括至少一个处理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求18至19中任一项所述的方法。
  25. 根据权利要求24所述的三维音频信号的处理装置,其特征在于,所述音频解码装置还包括:所述存储器。
  26. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至17、或者18至19中任意一项所述的方法。
  27. 一种计算机可读存储介质,包括如权利要求1至17任一项所述的方法所生成的码流。
PCT/CN2022/096546 2021-06-11 2022-06-01 一种三维音频信号的处理方法和装置 WO2022257824A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP22819422.1A EP4354430A4 (en) 2021-06-11 2022-06-01 METHOD AND DEVICE FOR PROCESSING THREE-DIMENSIONAL AUDIO SIGNALS
KR1020237044825A KR20240013221A (ko) 2021-06-11 2022-06-01 3차원 오디오 신호 처리 방법 및 장치
US18/532,085 US20240112684A1 (en) 2021-06-11 2023-12-07 Three-dimensional audio signal processing method and apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110657283 2021-06-11
CN202110657283.7 2021-06-11
CN202110700570.1A CN115472170A (zh) 2021-06-11 2021-06-23 一种三维音频信号的处理方法和装置
CN202110700570.1 2021-06-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/532,085 Continuation US20240112684A1 (en) 2021-06-11 2023-12-07 Three-dimensional audio signal processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2022257824A1 true WO2022257824A1 (zh) 2022-12-15

Family

ID=84363426

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/096546 WO2022257824A1 (zh) 2021-06-11 2022-06-01 一种三维音频信号的处理方法和装置

Country Status (5)

Country Link
US (1) US20240112684A1 (zh)
EP (1) EP4354430A4 (zh)
KR (1) KR20240013221A (zh)
CN (1) CN115472170A (zh)
WO (1) WO2022257824A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118800257A (zh) * 2023-04-13 2024-10-18 华为技术有限公司 场景音频解码方法及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264533A (zh) * 1997-07-16 2000-08-23 多尔拜实验特许公司 多声道低比特率编码解码方法和设备
CN101030379A (zh) * 2007-03-26 2007-09-05 北京中星微电子有限公司 一种数字音频信号比特分配的方法和装置
CN102859584A (zh) * 2009-12-17 2013-01-02 弗劳恩霍弗实用研究促进协会 用以将第一参数式空间音频信号转换成第二参数式空间音频信号的装置与方法
CN103489450A (zh) * 2013-04-07 2014-01-01 杭州微纳科技有限公司 基于时域混叠消除的无线音频压缩、解压缩方法及其设备
CN105637582A (zh) * 2013-10-17 2016-06-01 株式会社索思未来 音频编码装置及音频解码装置
CN107493542A (zh) * 2012-08-31 2017-12-19 杜比实验室特许公司 用于在听音环境中播放音频内容的扬声器系统
CN110728986A (zh) * 2018-06-29 2020-01-24 华为技术有限公司 立体声信号的编码方法、解码方法、编码装置和解码装置
CN112513980A (zh) * 2018-05-31 2021-03-16 诺基亚技术有限公司 空间音频参数信令

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140128565A (ko) * 2013-04-27 2014-11-06 인텔렉추얼디스커버리 주식회사 오디오 신호 처리 방법 및 장치

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264533A (zh) * 1997-07-16 2000-08-23 多尔拜实验特许公司 多声道低比特率编码解码方法和设备
CN101030379A (zh) * 2007-03-26 2007-09-05 北京中星微电子有限公司 一种数字音频信号比特分配的方法和装置
CN102859584A (zh) * 2009-12-17 2013-01-02 弗劳恩霍弗实用研究促进协会 用以将第一参数式空间音频信号转换成第二参数式空间音频信号的装置与方法
CN107493542A (zh) * 2012-08-31 2017-12-19 杜比实验室特许公司 用于在听音环境中播放音频内容的扬声器系统
CN103489450A (zh) * 2013-04-07 2014-01-01 杭州微纳科技有限公司 基于时域混叠消除的无线音频压缩、解压缩方法及其设备
CN105637582A (zh) * 2013-10-17 2016-06-01 株式会社索思未来 音频编码装置及音频解码装置
CN112513980A (zh) * 2018-05-31 2021-03-16 诺基亚技术有限公司 空间音频参数信令
CN110728986A (zh) * 2018-06-29 2020-01-24 华为技术有限公司 立体声信号的编码方法、解码方法、编码装置和解码装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4354430A4

Also Published As

Publication number Publication date
EP4354430A4 (en) 2024-07-24
CN115472170A (zh) 2022-12-13
EP4354430A1 (en) 2024-04-17
US20240112684A1 (en) 2024-04-04
KR20240013221A (ko) 2024-01-30

Similar Documents

Publication Publication Date Title
US20230298600A1 (en) Audio encoding and decoding method and apparatus
WO2022262576A1 (zh) 三维音频信号编码方法、装置、编码器和系统
WO2022237851A1 (zh) 一种音频编码、解码方法及装置
WO2022257824A1 (zh) 一种三维音频信号的处理方法和装置
US20240087580A1 (en) Three-dimensional audio signal coding method and apparatus, and encoder
US20240105187A1 (en) Three-dimensional audio signal processing method and apparatus
WO2024146408A1 (zh) 场景音频解码方法及电子设备
CN115376529B (zh) 三维音频信号编码方法、装置和编码器
WO2024212895A1 (zh) 场景音频信号的解码方法和装置
WO2024212898A1 (zh) 场景音频信号的编码方法和装置
WO2022242481A1 (zh) 三维音频信号编码方法、装置和编码器
WO2022242479A1 (zh) 三维音频信号编码方法、装置和编码器
WO2024212894A1 (zh) 场景音频信号的解码方法和装置
WO2024212638A1 (zh) 场景音频解码方法及电子设备
WO2024212896A1 (zh) 场景音频信号的解码方法和装置
WO2024212897A1 (zh) 场景音频信号的解码方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22819422

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202337083725

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2022819422

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 20237044825

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1020237044825

Country of ref document: KR

ENP Entry into the national phase

Ref document number: 2022819422

Country of ref document: EP

Effective date: 20231220

NENP Non-entry into the national phase

Ref country code: DE