[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2022110723A1 - 一种音频编解码方法和装置 - Google Patents

一种音频编解码方法和装置 Download PDF

Info

Publication number
WO2022110723A1
WO2022110723A1 PCT/CN2021/096841 CN2021096841W WO2022110723A1 WO 2022110723 A1 WO2022110723 A1 WO 2022110723A1 CN 2021096841 W CN2021096841 W CN 2021096841W WO 2022110723 A1 WO2022110723 A1 WO 2022110723A1
Authority
WO
WIPO (PCT)
Prior art keywords
virtual speaker
signal
target virtual
target
hoa
Prior art date
Application number
PCT/CN2021/096841
Other languages
English (en)
French (fr)
Inventor
高原
刘帅
王宾
王喆
曲天书
徐佳浩
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2023532579A priority Critical patent/JP2023551040A/ja
Priority to MX2023006299A priority patent/MX2023006299A/es
Priority to CA3200632A priority patent/CA3200632A1/en
Priority to EP21896233.0A priority patent/EP4246510A4/en
Publication of WO2022110723A1 publication Critical patent/WO2022110723A1/zh
Priority to US18/202,553 priority patent/US20230298600A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems

Definitions

  • the present application relates to the technical field of audio coding and decoding, and in particular, to an audio coding and decoding method and apparatus.
  • 3D audio technology is an audio technology that acquires, processes, transmits, and renders playback of sound events and 3D sound field information in the real world.
  • the three-dimensional audio technology makes the sound have a strong sense of space, enveloping and immersive, giving people an extraordinary listening experience of "sound immersion”.
  • Higher order ambisonics (HOA) technology has the characteristics of independent speaker layout in the recording, encoding and playback stages and the rotatable playback characteristics of HOA format data, which has higher flexibility in three-dimensional audio playback. Therefore, it has also received more extensive attention and research.
  • HOA technology requires a large amount of data for recording more detailed sound scene information. Although this kind of scene-based 3D audio signal sampling and storage is more conducive to the preservation and transmission of audio signal spatial information, more data will be generated with the increase of the HOA order, and a large amount of data will cause difficulties in transmission and storage.
  • the HOA signal needs to be encoded and decoded.
  • a multi-channel data encoding and decoding method which includes: at the encoding end, directly encoding each channel of the original scene audio signal through a core encoder (such as a 16-channel encoder), and then outputting a code stream .
  • a core encoder such as a 16-channel encoder
  • the code stream is decoded by a core decoder (for example, a 16-channel decoder) to obtain each channel of the decoded scene audio signal.
  • the above-mentioned multi-channel encoding and decoding method needs to adapt the corresponding codec according to the number of channels of the audio signal of the original scene, and with the increase of the number of channels, the compressed code stream has the problems of large amount of data and high bandwidth occupation.
  • Embodiments of the present application provide an audio encoding and decoding method and apparatus, which are used to reduce the amount of data in encoding and decoding, so as to improve encoding and decoding efficiency.
  • an embodiment of the present application provides an audio encoding method, including:
  • the first virtual speaker signal is encoded to obtain a code stream.
  • the first target virtual speaker is selected from the preset virtual speaker set according to the current scene audio signal; the first virtual speaker signal is generated according to the current scene audio signal and the attribute information of the first target virtual speaker; The first virtual speaker signal is encoded to obtain a code stream. Since the first virtual speaker signal can be generated according to the first scene audio signal and the attribute information of the first target virtual speaker in this embodiment of the present application, the audio encoding end encodes the first virtual speaker signal, instead of directly encoding the first scene. The audio signal is encoded.
  • a first target virtual speaker is selected according to the audio signal of the first scene, and the first virtual speaker signal generated based on the first target virtual speaker can represent the sound field of the position where the listener is located in the space, The sound field at this position is as close as possible to the original sound field when the audio signal of the first scene was recorded, which ensures the encoding quality of the audio encoding end, and encodes the first virtual speaker signal and the residual signal to obtain a code stream.
  • the first virtual speaker signal The amount of encoded data is related to the first target virtual speaker, but has nothing to do with the number of channels of the audio signal of the first scene, which reduces the amount of encoded data and improves encoding efficiency.
  • the method further includes:
  • the selecting the first target virtual speaker from the preset virtual speaker set according to the current scene audio signal includes:
  • the first target virtual speaker is selected from the virtual speaker set according to the main sound field components.
  • each virtual speaker in the virtual speaker set corresponds to a sound field component
  • the first target virtual speaker is selected from the virtual speaker set according to the main sound field components.
  • the virtual speaker corresponding to the main sound field component is the one selected by the encoder.
  • the encoding end can select the first target virtual speaker through the main sound field components, which solves the problem that the encoding end needs to determine the first target virtual speaker.
  • the selecting the first target virtual speaker from the virtual speaker set according to the main sound field components includes:
  • the HOA coefficients corresponding to the main sound field components are selected from the high-order stereo reverberation HOA coefficient set, and the HOA coefficients in the HOA coefficient set are one-to-one with the virtual speakers in the virtual speaker set correspond;
  • a virtual speaker in the virtual speaker set corresponding to the HOA coefficient corresponding to the main sound field component is determined as the first target virtual speaker.
  • the HOA coefficient set is pre-configured in the encoder according to the virtual speaker set, and there is a one-to-one correspondence between the HOA coefficients in the HOA coefficient set and the virtual speakers in the virtual speaker set, so the HOA coefficient is selected according to the main sound field components.
  • the target virtual speaker corresponding to the HOA coefficient corresponding to the main sound field component is searched from the virtual speaker set, and the found target virtual speaker is the first target virtual speaker, which solves the problem that the encoder needs to determine The problem with the first target virtual speaker.
  • the selecting the first target virtual speaker from the virtual speaker set according to the main sound field components includes:
  • a virtual speaker corresponding to the HOA coefficient corresponding to the first target virtual speaker in the virtual speaker set is determined as the target virtual speaker.
  • the encoder can use the main sound field component to determine the configuration parameters of the first target virtual speaker.
  • the main sound field component is one or several sound field components with the largest value.
  • a sound field component, or the main sound field component may be one or several sound field components with a dominant direction among the multiple sound field components, and the main sound field component can be used to determine the first target virtual speaker matching the audio signal of the current scene.
  • the speaker is configured with corresponding attribute information, and the HOA coefficient of the first target virtual speaker can be generated by using the configuration parameters of the first target virtual speaker.
  • Each virtual speaker in the virtual speaker set has a corresponding HOA coefficient, so the first target virtual speaker can be selected from the virtual speaker set according to the HOA coefficient corresponding to each virtual speaker, which solves the problem that the encoder needs to determine the first target virtual speaker. question.
  • the acquiring the configuration parameters of the first target virtual speaker according to the main sound field components includes:
  • the configuration parameter of the first target virtual speaker is selected from the configuration parameters of the plurality of virtual speakers according to the main sound field components.
  • the respective configuration parameters of a plurality of virtual speakers may be pre-stored in the audio encoder, and the configuration parameters of each virtual speaker may be determined by the configuration information of the audio encoder.
  • Configuration information of the audio encoder including but not limited to: HOA order, encoding bit rate, etc.
  • the configuration information of the audio encoder can be used to determine the number of virtual speakers and the position parameters of each virtual speaker, which solves the problem that the encoder needs to determine the configuration parameters of the virtual speakers.
  • An example is as follows, if the encoding bit rate is low, a smaller number of virtual speakers can be configured, and if the encoding bit rate is high, a large number of virtual speakers can be configured.
  • the HOA order of the virtual speaker may be equal to the HOA order of the audio encoder. It is not limited that, in this embodiment of the present application, in addition to determining the respective configuration parameters of the multiple virtual speakers according to the configuration information of the audio encoder, the respective configuration parameters of the multiple virtual speakers may also be customized according to user-defined information. You can customize the position of the virtual speakers, the order of HOA, the number of virtual speakers, etc.
  • the configuration parameters of the first target virtual speaker include: position information and HOA order information of the first target virtual speaker;
  • the generating the HOA coefficient corresponding to the first target virtual speaker according to the configuration parameters of the first target virtual speaker includes:
  • the HOA coefficient corresponding to the first target virtual speaker is determined according to the position information and HOA order information of the first target virtual speaker.
  • the HOA coefficient of each virtual speaker can be generated by using the position information and HOA order information of each virtual speaker.
  • the problem with the HOA coefficient of the speaker is the above solution.
  • the method further includes:
  • the attribute information of the first target virtual speaker is encoded and written into the code stream.
  • the encoding end in addition to encoding the virtual speaker, can also encode the attribute information of the first target virtual speaker, and write the encoded attribute information of the first target virtual speaker into the code stream.
  • the obtained code stream may include: the encoded virtual speaker and the encoded attribute information of the first target virtual speaker.
  • the encoded attribute information of the first target virtual speaker can be carried in the code stream, so that the decoding end can determine the attribute information of the first target virtual speaker by decoding the code stream, which is convenient for audio decoding at the decoding end.
  • the current scene audio signal includes: a high-order stereo reverberation HOA signal to be encoded; the attribute information of the first target virtual speaker includes the HOA coefficient of the first target virtual speaker;
  • the generating a first virtual speaker signal according to the current scene audio signal and the attribute information of the first target virtual speaker includes:
  • the encoding end first determines the HOA coefficient of the first target virtual speaker. For example, the encoding end selects the HOA coefficient from the HOA coefficient set according to the main sound field components.
  • the output HOA coefficient is the HOA coefficient of the first target virtual speaker.
  • the HOA coefficient can be generated according to the HOA signal to be encoded and the HOA coefficient of the first target virtual speaker.
  • a virtual speaker signal, wherein the HOA signal to be encoded can be obtained by linear combination of the HOA coefficients of the first target virtual speaker, and the solution of the first virtual speaker signal can be converted into a solution problem of the linear combination.
  • the current scene audio signal includes: a high-order stereo reverberation HOA signal to be encoded; the attribute information of the first target virtual speaker includes position information of the first target virtual speaker;
  • the generating a first virtual speaker signal according to the current scene audio signal and the attribute information of the first target virtual speaker includes:
  • the attribute information of the first target virtual speaker may include: position information of the first target virtual speaker, the encoding end pre-stores the HOA coefficient of each virtual speaker in the virtual speaker set, and the encoding end also stores each virtual speaker There is a corresponding relationship between the position information of the virtual speaker and the HOA coefficient of the virtual speaker, so the encoder can determine the HOA coefficient of the first target virtual speaker through the position information of the first target virtual speaker. If the attribute information includes the HOA coefficient, the encoder can obtain the HOA coefficient of the first target virtual speaker by decoding the attribute information of the first target virtual speaker.
  • the method further includes:
  • the second virtual speaker signal is encoded and written into the code stream.
  • the second target virtual speaker is another target virtual speaker selected by the encoding end that is different from the first target virtual encoder.
  • the first scene audio signal is the original scene audio signal to be encoded
  • the second target virtual speaker may be a certain virtual speaker in the virtual speaker set.
  • a preconfigured target virtual speaker selection strategy may be used from a preset virtual speaker set. to select the second target virtual speaker.
  • the target virtual speaker selection strategy is a strategy for selecting target virtual speakers matching the audio signal of the first scene from the virtual speaker set, for example, selecting the second target virtual speaker according to the sound field components obtained by each virtual speaker from the audio signal of the first scene .
  • the method further includes:
  • the encoding the second virtual speaker signal includes:
  • the encoding the first virtual speaker signal includes:
  • the aligned first virtual speaker signal is encoded.
  • the encoding end can encode the aligned first virtual speaker signal. Adjusting the alignment enhances the inter-channel correlation, which is beneficial to the encoding processing of the first virtual speaker signal by the core encoder.
  • the method further includes:
  • the encoding the first virtual speaker signal includes:
  • a downmix signal and side information are obtained from the first virtual speaker signal and the second virtual speaker signal, and the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal ;
  • the downmix signal and the side information are encoded.
  • the encoding end can also perform downmix processing according to the first virtual speaker signal and the second virtual speaker signal to generate a downmix signal, For example, performing downmix processing on the amplitude of the first virtual speaker signal and the second virtual speaker signal to obtain a downmix signal.
  • side information can be generated according to the first virtual speaker signal and the second virtual speaker signal, and the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal.
  • the information can be used by the decoding end to perform up-mixing on the down-mixed signal, so as to recover the first virtual speaker signal and the second virtual speaker signal.
  • the side information includes signal information loss analysis parameters, so that the decoding end recovers the first virtual speaker signal and the second virtual speaker signal through the signal information loss analysis parameters.
  • the method further includes:
  • the obtaining the downmix signal and the side information according to the first virtual speaker signal and the second virtual speaker signal includes:
  • the side information is used to indicate the relationship between the aligned first virtual speaker signal and the aligned second virtual speaker signal.
  • the encoding end may perform an alignment operation of the virtual speaker signal, and then generate the downmix signal and side information after the alignment operation is completed.
  • the encoding end may perform an alignment operation of the virtual speaker signal, and then generate the downmix signal and side information after the alignment operation is completed.
  • the correlation between the channels is enhanced, which is beneficial to the encoding processing of the first virtual speaker signal by the core encoder.
  • the method before selecting a second target virtual speaker from the virtual speaker set according to the current scene audio signal, the method further includes:
  • a second target virtual speaker is selected from the virtual speaker set according to the current scene audio signal.
  • the encoder can also perform signal selection to determine whether to acquire the second target virtual speaker.
  • the encoder can generate the second virtual speaker signal.
  • the encoding end may not generate the second virtual speaker signal.
  • the encoder may make a decision according to the configuration information of the audio encoder and/or the signal type information of the audio signal of the first scene, so as to determine whether another target virtual speaker needs to be selected in addition to the first target virtual speaker. For example, if the encoding rate is higher than the preset threshold, it is determined that the target virtual speakers corresponding to the two main sound field components need to be obtained, and the second target virtual speakers can be determined in addition to the first target virtual speakers.
  • the target virtual speaker For another example, it is determined according to the signal type information of the audio signal of the first scene that it is necessary to obtain the target virtual speaker corresponding to the two main sound field components with the dominant sound source direction.
  • the second target virtual speaker On the contrary, if it is determined according to the encoding rate and/or the signal type information of the audio signal of the first scene that only one target virtual speaker needs to be acquired, after the first target virtual speaker is determined, it is determined that other than the first target virtual speaker will not be acquired. the target virtual speaker.
  • the amount of data encoded by the encoding end can be reduced, and the encoding efficiency can be improved.
  • an embodiment of the present application also provides an audio decoding method, including:
  • the reconstructed scene audio signal is obtained according to the attribute information of the target virtual speaker and the virtual speaker signal.
  • the code stream is first received, then decoded to obtain a virtual speaker signal, and finally a reconstructed scene audio signal is obtained according to the attribute information of the target virtual speaker and the virtual speaker signal.
  • the virtual speaker signal can be decoded from the code stream, and the reconstructed scene audio signal is obtained through the attribute information of the target virtual speaker and the virtual speaker signal.
  • the obtained code stream carries the virtual speaker The signal and residual signal reduce the amount of decoded data and improve the decoding efficiency.
  • the method further includes:
  • the encoding end in addition to encoding the virtual speaker, can also encode the attribute information of the target virtual speaker, and write the encoded attribute information of the target virtual speaker into the code stream, for example, it can be obtained through the code stream Attribute information to the first target virtual speaker.
  • the encoded attribute information of the first target virtual speaker can be carried in the code stream, so that the decoding end can determine the attribute information of the first target virtual speaker by decoding the code stream, which is convenient for audio decoding at the decoding end.
  • the attribute information of the target virtual speaker includes a high-order stereo reverberation HOA coefficient of the target virtual speaker
  • the obtaining the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal includes:
  • the virtual speaker signal and the HOA coefficients of the target virtual speaker are synthesized to obtain the reconstructed scene audio signal.
  • the decoding end first determines the HOA coefficient of the target virtual speaker.
  • the decoding end can store the HOA coefficient of the target virtual speaker in advance. After the decoding end obtains the virtual speaker signal and the HOA coefficient of the target virtual speaker, according to the virtual speaker signal and the HOA coefficients of the target virtual speaker to obtain the reconstructed scene audio signal. Thereby improving the quality of the reconstructed scene audio signal.
  • the attribute information of the target virtual speaker includes position information of the target virtual speaker
  • the obtaining the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal includes:
  • the virtual speaker signal and the HOA coefficients of the target virtual speaker are synthesized to obtain the reconstructed scene audio signal.
  • the attribute information of the target virtual speaker may include: position information of the target virtual speaker.
  • the decoding end pre-stores the HOA coefficient of each virtual speaker in the virtual speaker set, and the decoding end also stores the position information of each virtual speaker.
  • the relationship determines the HOA coefficient corresponding to the position information of the target virtual speaker, or the decoding end can calculate the HOA coefficient of the target virtual speaker according to the position information of the target virtual speaker. Therefore, the decoding end can determine the HOA coefficient of the target virtual speaker through the position information of the target virtual speaker. Solved the problem that the decoding end needs to determine the HOA coefficient of the target virtual speaker.
  • the virtual speaker signal is a downmix signal obtained by downmixing the first virtual speaker signal and the second virtual speaker signal, and the method further includes:
  • the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal;
  • obtaining the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal includes:
  • the reconstructed scene audio signal is obtained according to the attribute information of the target virtual speaker, the first virtual speaker signal and the second virtual speaker signal.
  • the encoder generates a downmix signal when performing downmix processing according to the first virtual speaker signal and the second virtual speaker signal, and the encoder can also perform signal compensation for the downmix signal to generate side information, which can be is written into the code stream, the decoding end can obtain side information through the code stream, and the decoding end can perform signal compensation according to the side information to obtain the first virtual speaker signal and the second virtual speaker signal, so when reconstructing the signal, you can use The first virtual speaker signal and the second virtual speaker signal, and the aforementioned attribute information of the target virtual speaker, thereby improving the quality of the decoded signal at the decoding end.
  • an audio encoding device including:
  • an acquisition module configured to select the first target virtual speaker from the preset virtual speaker set according to the current scene audio signal
  • a signal generation module configured to generate a first virtual speaker signal according to the current scene audio signal and the attribute information of the first target virtual speaker
  • an encoding module configured to encode the first virtual speaker signal to obtain a code stream.
  • the acquiring module is configured to acquire main sound field components from the audio signal of the current scene according to the virtual speaker set; and select a main sound field component from the virtual speaker set according to the main sound field components the first target virtual speaker.
  • the component modules of the audio coding apparatus may also perform the steps described in the foregoing first aspect and various possible implementation manners.
  • the foregoing first aspect and various possible implementation manners for details. instruction of.
  • the acquisition module is configured to select the HOA coefficient corresponding to the main sound field component from a set of high-order stereo reverberation HOA coefficients according to the main sound field component, and the HOA coefficient set
  • the HOA coefficients in the virtual speaker set correspond one-to-one with the virtual speakers in the virtual speaker set; the virtual speakers in the virtual speaker set corresponding to the HOA coefficients corresponding to the main sound field components are determined as the first target virtual speakers.
  • the acquiring module is configured to acquire configuration parameters of the first target virtual speaker according to the main sound field components; generate the first target virtual speaker according to the configuration parameters of the first target virtual speaker HOA coefficient corresponding to the target virtual speaker; determining the virtual speaker corresponding to the HOA coefficient corresponding to the first target virtual speaker in the virtual speaker set as the target virtual speaker.
  • the obtaining module is configured to determine the configuration parameters of multiple virtual speakers in the virtual speaker set according to the configuration information of the audio encoder;
  • the configuration parameter of the first target virtual speaker is selected from the configuration parameters of the virtual speaker.
  • the configuration parameters of the first target virtual speaker include: position information and HOA order information of the first target virtual speaker;
  • the obtaining module is configured to determine the HOA coefficient corresponding to the first target virtual speaker according to the position information and HOA order information of the first target virtual speaker.
  • the encoding module is further configured to encode the attribute information of the first target virtual speaker, and write the code stream.
  • the current scene audio signal includes: the HOA signal to be encoded; the attribute information of the first target virtual speaker includes the HOA coefficient of the first target virtual speaker;
  • the signal generating module is configured to linearly combine the to-be-coded HOA signal and the HOA coefficient to obtain the first virtual speaker signal.
  • the current scene audio signal includes: a high-order stereo reverberation HOA signal to be encoded; the attribute information of the first target virtual speaker includes position information of the first target virtual speaker;
  • the signal generation module is used to obtain the HOA coefficient corresponding to the first target virtual speaker according to the position information of the first target virtual speaker; linearly combine the to-be-coded HOA signal and the HOA coefficient to obtain the first virtual speaker signal.
  • the acquiring module is configured to select a second target virtual speaker from the virtual speaker set according to the current scene audio signal
  • the signal generation module is configured to generate a second virtual speaker signal according to the current scene audio signal and the attribute information of the second target virtual speaker;
  • the encoding module is configured to encode the second virtual speaker signal and write the code stream.
  • the signal generation module is configured to perform alignment processing on the first virtual speaker signal and the second virtual speaker signal, so as to obtain an aligned first virtual speaker signal and an aligned the second virtual speaker signal;
  • the encoding module is configured to encode the aligned second virtual speaker signal
  • the encoding module is configured to encode the aligned first virtual speaker signal.
  • the acquiring module is configured to select a second target virtual speaker from the virtual speaker set according to the current scene audio signal
  • the signal generation module is configured to generate a second virtual speaker signal according to the current scene audio signal and the attribute information of the second target virtual speaker;
  • the encoding module is configured to obtain a downmix signal and side information according to the first virtual speaker signal and the second virtual speaker signal, and the side information is used to indicate the first virtual speaker signal and all the and encoding the downmix signal and the side information.
  • the signal generation module is configured to perform alignment processing on the first virtual speaker signal and the second virtual speaker signal, so as to obtain an aligned first virtual speaker signal and an aligned the second virtual speaker signal;
  • the encoding module is configured to obtain the downmix signal and the side information according to the aligned first virtual speaker signal and the aligned second virtual speaker signal;
  • the side information is used to indicate the relationship between the aligned first virtual speaker signal and the aligned second virtual speaker signal.
  • the obtaining module is configured to, before selecting the second target virtual speaker from the virtual speaker set according to the current scene audio signal, perform the following steps according to the encoding rate and/or the current scene.
  • the signal type information of the audio signal determines whether target virtual speakers other than the first target virtual speaker need to be obtained; if it is necessary to obtain target virtual speakers other than the first target virtual A second target virtual speaker is selected from the virtual speaker set.
  • an audio decoding apparatus including:
  • the receiving module is used to receive the code stream
  • a decoding module for decoding the code stream to obtain a virtual speaker signal
  • the reconstruction module is configured to obtain the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal.
  • the decoding module is further configured to decode the code stream to obtain attribute information of the target virtual speaker.
  • the attribute information of the target virtual speaker includes a high-order stereo reverberation HOA coefficient of the target virtual speaker
  • the reconstruction module is configured to perform synthesis processing on the virtual speaker signal and the HOA coefficient of the target virtual speaker to obtain the reconstructed scene audio signal.
  • the attribute information of the target virtual speaker includes position information of the target virtual speaker
  • the reconstruction module is configured to determine the HOA coefficient of the target virtual speaker according to the position information of the target virtual speaker; perform synthesis processing on the virtual speaker signal and the HOA coefficient of the target virtual speaker to obtain the reconstruction scene audio signal.
  • the virtual speaker signal is a downmix signal obtained by downmixing the first virtual speaker signal and the second virtual speaker signal
  • the apparatus further includes: a signal compensation module, wherein:
  • the decoding module configured to decode the code stream to obtain side information, where the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal;
  • the signal compensation module configured to obtain the first virtual speaker signal and the second virtual speaker signal according to the side information and the downmix signal
  • the reconstruction module is configured to obtain the reconstructed scene audio signal according to the attribute information of the target virtual speaker, the first virtual speaker signal and the second virtual speaker signal.
  • the component modules of the audio decoding apparatus may also perform the steps described in the foregoing second aspect and various possible implementation manners.
  • the foregoing second aspect and various possible implementation manners please refer to the foregoing second aspect and various possible implementation manners for details. instruction of.
  • an embodiment of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer executes the first aspect or the second aspect. method described.
  • an embodiment of the present application provides a computer program product containing instructions, which, when run on a computer, causes the computer to execute the method described in the first aspect or the second aspect.
  • an embodiment of the present application provides a communication apparatus.
  • the communication apparatus may include an entity such as a terminal device or a chip.
  • the communication apparatus includes: a processor.
  • the communication apparatus further includes a memory; the memory for storing instructions; the processor is configured to execute the instructions in the memory, causing the communication device to perform the method according to any one of the foregoing first or second aspects.
  • the present application provides a chip system
  • the chip system includes a processor for supporting an audio encoding device or an audio decoding device to implement the functions involved in the above aspects, for example, sending or processing the functions involved in the above methods. data and/or information.
  • the chip system further includes a memory for storing necessary program instructions and data of the audio encoding device or the audio decoding device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the present application provides a computer-readable storage medium, comprising a code stream generated by the method according to any one of the foregoing first aspects.
  • FIG. 1 is a schematic diagram of the composition and structure of an audio processing system provided by an embodiment of the present application.
  • FIG. 2a is a schematic diagram of an audio encoder and an audio decoder provided by an embodiment of the application applied to a terminal device;
  • 2b is a schematic diagram of an audio encoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • 2c is a schematic diagram of an audio decoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • 3a is a schematic diagram of the application of the multi-channel encoder and the multi-channel decoder provided by an embodiment of the application to a terminal device;
  • 3b is a schematic diagram of a multi-channel encoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 3c is a schematic diagram of a multi-channel decoder provided by an embodiment of the present application applied to a wireless device or a core network device;
  • FIG. 4 is a schematic diagram of an interaction flow between an audio encoding device and an audio decoding device in an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of an encoding terminal provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a decoding end provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an encoding terminal provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a virtual loudspeaker approximately uniformly distributed on a spherical surface provided by an embodiment of the present application;
  • FIG. 9 is a schematic structural diagram of an encoding terminal provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the composition and structure of an audio encoding apparatus provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of the composition and structure of an audio decoding apparatus provided by an embodiment of the application.
  • FIG. 12 is a schematic diagram of the composition and structure of another audio coding apparatus provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of the composition and structure of another audio decoding apparatus provided by an embodiment of the present application.
  • Embodiments of the present application provide an audio encoding and decoding method and apparatus, which are used to reduce the data amount of an audio signal of an encoding scene and improve encoding and decoding efficiency.
  • the technical solutions of the embodiments of the present application can be applied to various audio processing systems.
  • FIG. 1 a schematic diagram of the composition and structure of the audio processing system provided by the embodiments of the present application is shown.
  • the audio processing system 100 may include: an audio encoding device 101 and an audio decoding device 102 .
  • the audio encoding device 101 can be used to generate a code stream, and then the audio encoding code stream can be transmitted to the audio decoding device 102 through the audio transmission channel, and the audio decoding device 102 can receive the code stream and then execute the audio decoding function of the audio decoding device 102. , and finally get the reconstructed signal.
  • the audio encoding apparatus can be applied to various terminal devices that need audio communication, wireless devices that need transcoding, and core network devices.
  • the audio encoding apparatus can be the above-mentioned terminal device or wireless device or Audio encoder for core network devices.
  • the audio decoding device can be applied to various terminal devices that need audio communication, wireless devices that need transcoding, and core network devices.
  • the audio decoding device can be the audio of the above-mentioned terminal devices or wireless devices or core network devices.
  • the audio encoder may include a wireless access network, a media gateway of a core network, a transcoding device, a media resource server, a mobile terminal, a fixed network terminal, etc.
  • the audio encoder may also be applied to virtual reality technology (virtual reality, VR). ) audio codecs in streaming services.
  • VR virtual reality
  • the end-to-end processing flow of the audio signal includes: the audio signal A is collected after collecting
  • the preprocessing operation (audio preprocessing) is performed after the module (acquisition), and the preprocessing operation includes filtering out the low-frequency part of the signal, which can be 20Hz or 50Hz as the dividing point to extract the orientation information in the signal, and then perform encoding processing (audio encoding) package (file/segment encapsulation) and then send (delivery) to the decoder, the decoder first unpacks (file/segment decapsulation), then decodes (audio decoding), and performs binaural rendering (audio rendering) on the decoded signal.
  • the rendered signal is mapped to the listener's headphones (headphones), which may be independent headphones or headphones on the glasses device.
  • FIG. 2a it is a schematic diagram of the application of the audio encoder and the audio decoder provided by the embodiment of the present application to a terminal device.
  • Each terminal device may include: audio encoder, channel encoder, audio decoder, channel decoder.
  • the channel encoder is used for channel coding the audio signal
  • the channel decoder is used for channel decoding the audio signal.
  • the first terminal device 20 may include: a first audio encoder 201 , a first channel encoder 202 , a first audio decoder 203 , and a first channel decoder 204 .
  • the second terminal device 21 may include: a second audio decoder 211 , a second channel decoder 212 , a second audio encoder 213 , and a second channel encoder 214 .
  • the first terminal device 20 is connected to the wireless or wired first network communication device 22, the first network communication device 22 and the wireless or wired second network communication device 23 are connected through a digital channel, and the second terminal device 21 is connected wirelessly or wired.
  • the second network communication device 23 may generally refer to signal transmission devices, such as communication base stations, data exchange devices, and the like.
  • the terminal device as the transmitting end first performs audio acquisition, performs audio coding on the acquired audio signal, and then performs channel coding, and then transmits it in a digital channel through a wireless network or a core network.
  • the terminal device as the receiving end performs channel decoding according to the received signal to obtain the code stream, and then recovers the audio signal through audio decoding, which is played back by the terminal device at the receiving end.
  • FIG. 2b it is a schematic diagram of applying the audio encoder provided by the embodiment of the present application to a wireless device or a core network device.
  • the wireless device or the core network device 25 includes: a channel decoder 251, other audio decoders 252, an audio encoder 253 and a channel encoder 254 provided by the embodiments of the present application, wherein the other audio decoders 252 refer to the audio decoding other audio codecs other than the audio decoder.
  • the channel decoder 251 is used to first perform channel decoding on the signal entering the device, then other audio decoders 252 are used to perform audio decoding, and then the audio encoder 253 provided by the embodiment of the present application is used to perform audio decoding. Audio coding, and finally use the channel encoder 254 to perform channel coding on the audio signal, and transmit it after completing the channel coding.
  • the other audio decoder 252 performs audio decoding on the code stream decoded by the channel decoder 251 .
  • FIG. 2c it is a schematic diagram of applying the audio decoder provided by the embodiment of the present application to a wireless device or a core network device.
  • the wireless device or the core network device 25 includes: a channel decoder 251, an audio decoder 255 provided by the embodiment of the present application, other audio encoders 256, and a channel encoder 254, wherein the other audio encoders 256 refer to the audio encoding other audio encoders than the encoder.
  • the channel decoder 251 is used to first perform channel decoding on the signal entering the device, then the audio decoder 255 is used to decode the received audio code stream, and then other audio coders 256 are used to decode the received audio code stream.
  • the channel encoder 254 performs audio coding, and finally use the channel encoder 254 to perform channel coding on the audio signal, and transmit it after completing the channel coding.
  • a wireless device or a core network device if transcoding needs to be implemented, corresponding audio codec processing needs to be performed.
  • the wireless devices refer to radio frequency related devices in communication
  • the core network devices refer to core network related devices in communication.
  • the audio encoding apparatus can be applied to various terminal devices that need audio communication, wireless devices that need transcoding, and core network devices.
  • the audio encoding apparatus can be the above-mentioned terminal device or wireless device. Or the multi-channel encoder of the core network equipment.
  • the audio decoding apparatus can be applied to various terminal devices that need audio communication, wireless devices that need transcoding, and core network devices.
  • the audio decoding device can be a multiplicity of the above-mentioned terminal devices, wireless devices, or core network devices. channel decoder.
  • FIG. 3a it is a schematic diagram of the application of the multi-channel encoder and the multi-channel decoder provided by the embodiment of the application to a terminal device.
  • Each terminal device may include: a multi-channel encoder, a channel encoder, a Multi-channel decoder, channel decoder.
  • the multi-channel encoder may execute the audio encoding method provided by the embodiment of the present application
  • the multi-channel decoder may execute the audio decoding method provided by the embodiment of the present application.
  • the channel encoder is used for channel coding the multi-channel signal
  • the channel decoder is used for channel decoding the multi-channel signal.
  • the first terminal device 30 may include: a first multi-channel encoder 301 , a first channel encoder 302 , a first multi-channel decoder 303 , and a first channel decoder 304 .
  • the second terminal device 31 may include: a second multi-channel decoder 311 , a second channel decoder 312 , a second multi-channel encoder 313 , and a second channel encoder 314 .
  • the first terminal device 30 is connected to the wireless or wired first network communication device 32, the first network communication device 32 and the wireless or wired second network communication device 33 are connected through a digital channel, and the second terminal device 31 is connected to the wireless or wired The second network communication device 33 .
  • the above-mentioned wireless or wired network communication devices may generally refer to signal transmission devices, such as communication base stations, data exchange devices, and the like.
  • the terminal device as the sending end performs multi-channel encoding on the collected multi-channel signal, and then performs channel encoding, and then transmits it in a digital channel through a wireless network or a core network.
  • the terminal device as the receiving end performs channel decoding according to the received signal to obtain a multi-channel signal encoded stream, and then restores the multi-channel signal through multi-channel decoding, which is played back by the terminal device as the receiving end.
  • FIG. 3b it is a schematic diagram of applying the multi-channel encoder provided by this embodiment of the application to a wireless device or a core network device, wherein the wireless device or core network device 35 includes: a channel decoder 351 and other audio decoders 352 , a multi-channel encoder 353, and a channel encoder 354, which are similar to the aforementioned FIG. 2b, and will not be repeated here.
  • the wireless device or core network device 35 includes: a channel decoder 351 and other audio decoders 352 , a multi-channel encoder 353, and a channel encoder 354, which are similar to the aforementioned FIG. 2b, and will not be repeated here.
  • FIG. 3c it is a schematic diagram of applying the multi-channel decoder provided by this embodiment of the application to a wireless device or a core network device, where the wireless device or core network device 35 includes: a channel decoder 351 , a multi-channel decoder 355.
  • Other audio encoders 356 and channel encoders 354 are similar to the aforementioned FIG. 2c, and will not be repeated here.
  • the audio encoding process may be a part of the multi-channel encoder, and the audio decoding process may be a part of the multi-channel decoder.
  • performing multi-channel encoding on the collected multi-channel signal may be to After the multi-channel signal is processed, the audio signal is obtained, and then the obtained audio signal is encoded according to the method provided in the embodiment of the present application; the decoding end encodes the code stream according to the multi-channel signal, and decodes to obtain the audio signal.
  • the multi-channel signal is recovered. Therefore, the embodiments of the present application can also be applied to multi-channel encoders and multi-channel decoders in terminal devices, wireless devices, and core network devices. In wireless or core network equipment, if transcoding is required, corresponding multi-channel encoding and decoding processing is required.
  • the audio coding and decoding methods provided in the embodiments of the present application may include: an audio coding method and an audio decoding method, wherein the audio coding method is performed by an audio coding apparatus, the audio decoding method is performed by an audio decoding apparatus, and the audio coding apparatus and the audio decoding apparatus are executed between the audio coding apparatus and the audio decoding apparatus. Communication is possible.
  • the audio encoding method and the audio decoding method provided by the embodiments of the present application will be described based on the aforementioned system architecture and the audio encoding device and the audio decoding device. As shown in FIG.
  • FIG. 4 it is a schematic diagram of an interaction flow between an audio encoding device and an audio decoding device in an embodiment of the present application, wherein the following steps 401 to 403 may be performed by the audio encoding device (hereinafter referred to as the encoding end), The following steps 411 to 413 can be performed by the audio decoding device (hereinafter referred to as the decoding end), and mainly include the following processes:
  • the encoding end obtains the audio signal of the current scene
  • the audio signal of the current scene refers to the audio signal obtained by collecting the sound field at the position of the microphone in the space.
  • the audio signal of the current scene may also be called the original scene audio signal.
  • the audio signal of the current scene may be an audio signal obtained by a higher order ambisonics (higher order ambisonics, HOA) technology.
  • a virtual speaker set may be pre-configured at the encoding end, and the virtual speaker set may include multiple virtual speakers.
  • the scene audio signal may be played back through headphones, or through a plurality of speakers arranged in the room. speaker playback.
  • the basic method is to superimpose the signals of multiple speakers, so that the sound field at a certain point in space (where the listener is) is as close as possible to the original sound field when recording the scene audio signal under a certain standard.
  • the virtual speaker is used to calculate the playback signal corresponding to the scene audio signal, the playback signal is used as the transmission signal, and the compressed signal is further generated.
  • the virtual speaker represents a virtual speaker in the spatial sound field, and the virtual speaker can realize the playback of the scene audio signal at the encoding end.
  • the virtual speaker set includes multiple virtual speakers, and each virtual speaker in the multiple virtual speakers corresponds to a virtual speaker configuration parameter (configuration parameter for short).
  • the virtual speaker configuration parameters include, but are not limited to, information such as the number of virtual speakers, the HOA order of the virtual speakers, and the position coordinates of the virtual speakers.
  • a preconfigured target virtual speaker selection strategy may be used to select the first target virtual speaker from the preset virtual speaker set.
  • the target virtual speaker selection strategy is a strategy for selecting target virtual speakers matching the current scene audio signal from the virtual speaker set, for example, selecting the first target virtual speaker according to the sound field components obtained by each virtual speaker from the current scene audio signal.
  • the first target virtual speaker is selected from the audio signal of the current scene according to the position information of each virtual speaker.
  • the first target virtual speaker is a virtual speaker in the virtual speaker set for playing back the audio signal of the current scene, that is, the encoder can select a target virtual encoder that can play back the audio signal of the current scene from the virtual speaker set.
  • subsequent processing procedures for the first target virtual speaker may be performed, such as subsequent steps 402 to 403 .
  • subsequent steps 402 to 403 not only the first target virtual speaker but also more target virtual speakers can be selected, for example, the second target virtual speaker can also be selected.
  • the second target virtual speaker the same steps as the subsequent step 402 need to be performed.
  • the similar process to 403 please refer to the description of the subsequent embodiment for details.
  • the encoding end may also obtain attribute information of the first target virtual speaker, and the attribute information of the first target virtual speaker includes attributes related to the first target virtual speaker.
  • the attribute information can be set according to a specific application scenario, for example, the attribute information of the first target virtual speaker includes: the position information of the first target virtual speaker, or the HOA coefficient of the first target virtual speaker.
  • the position information of the first target virtual speaker may be the spatial distribution position of the first target virtual speaker, or may be the position information of the first target virtual speaker relative to other virtual speakers in the virtual speaker set. Specifically, this There are no restrictions.
  • Each virtual speaker in the virtual speaker set corresponds to an HOA coefficient, and the HOA coefficient may also be called an ambisonic coefficient. Next, the HOA coefficient corresponding to the virtual speaker will be described.
  • the HOA order can be one of the 2nd to 10th orders, the signal sampling rate when recording the audio signal is 48 to 192 kilohertz (kHz), and the sampling depth is 16 or 24 bits (bit), through
  • the HOA coefficient of the virtual speaker and the scene audio signal can generate the HOA signal.
  • the HOA signal is characterized by the spatial information of the sound field.
  • the HOA signal is the information describing the sound field signal of a certain point in the space with a certain accuracy. Therefore, it can be considered to use another representation to describe the sound field signal of a certain position point. This description method can use less data to describe the signal of the spatial position point with the same accuracy, so as to achieve the purpose of signal compression. .
  • the spatial sound field can be decomposed into the superposition of multiple plane waves. Therefore, in theory, the sound field expressed by the HOA signal can be re-expressed by the superposition of multiple plane waves, and each plane wave is represented by an audio signal of one channel and a direction vector.
  • the plane wave superposition representation can accurately express the original sound field with a smaller number of channels to achieve the purpose of signal compression.
  • the audio encoding method provided by the embodiments of the present application further includes the following steps:
  • A1 Acquire the main sound field components from the audio signal of the current scene according to the virtual speaker set.
  • the main sound field component in step A1 may also be referred to as the first main sound field component.
  • the aforementioned step 401 selects the first target virtual speaker from the preset virtual speaker set according to the audio signal of the current scene, including:
  • the encoding end obtains a virtual speaker set, and the encoding end uses the virtual speaker set to perform signal decomposition on the audio signal of the current scene, so as to obtain the main sound field components corresponding to the audio signal of the current scene.
  • the main sound field component represents the audio signal corresponding to the main sound field in the audio signal of the current scene.
  • the virtual speaker set includes multiple virtual speakers. According to the multiple virtual speakers, multiple sound field components can be obtained from the audio signal of the current scene. That is, each virtual speaker can obtain one sound field component from the audio signal of the current scene.
  • the main sound field component is selected from the sound field components.
  • the main sound field component may be one or several sound field components with the largest value among the multiple sound field components, or the main sound field component may be one or several sound field components with a dominant direction among the multiple sound field components. sound field components.
  • Each virtual speaker in the virtual speaker set corresponds to a sound field component, then the first target virtual speaker is selected from the virtual speaker set according to the main sound field components.
  • the virtual speaker corresponding to the main sound field component is the first target virtual speaker selected by the encoder. speaker.
  • the encoding end can select the first target virtual speaker through the main sound field components, which solves the problem that the encoding end needs to determine the first target virtual speaker.
  • the encoding end has a variety of ways to select the first target virtual speaker.
  • the encoding end may preset a virtual speaker at a specified position as the first target virtual speaker, that is, according to each virtual speaker set in the virtual speaker set. The positions of the virtual speakers are selected as the first target virtual speakers, and the virtual speakers conforming to the specified positions are selected.
  • the aforementioned step B1 selects the first target virtual speaker from the virtual speaker set according to the main sound field components, including:
  • the HOA coefficients corresponding to the main sound field components are selected from the high-order stereo reverberation HOA coefficient set, and the HOA coefficients in the HOA coefficient set are in one-to-one correspondence with the virtual speakers in the virtual speaker set;
  • the HOA coefficient set is pre-configured in the encoder according to the virtual speaker set, and there is a one-to-one correspondence between the HOA coefficients in the HOA coefficient set and the virtual speakers in the virtual speaker set. Therefore, after selecting the HOA coefficients according to the main sound field components, then According to the above-mentioned one-to-one correspondence, the target virtual speaker corresponding to the HOA coefficient corresponding to the main sound field component is searched from the virtual speaker set, and the found target virtual speaker is the first target virtual speaker, which solves the problem that the encoder needs to determine the first target. Problem with virtual speakers.
  • the HOA coefficient set includes HOA coefficient 1, HOA coefficient 2, and HOA coefficient 3, and the virtual speaker set includes virtual speaker 1, virtual speaker 2, and virtual speaker 3, wherein the HOA coefficient in the HOA coefficient set is the same as the virtual speaker.
  • the virtual speakers in the set are in one-to-one correspondence, for example: HOA coefficient 1 corresponds to virtual speaker 1, HOA coefficient 2 corresponds to virtual speaker 2, and HOA coefficient 3 corresponds to virtual speaker 3. If the HOA coefficient 3 is selected from the HOA coefficient set according to the main sound field components, the first target virtual speaker can be determined as the virtual speaker 3 .
  • the aforementioned step B1 selects the first target virtual speaker from the virtual speaker set according to the main sound field components, and further includes:
  • the encoder can use the main sound field component to determine the configuration parameters of the first target virtual speaker, for example, the main sound field component is one or several sound field components with the largest value among the multiple sound field components , or the main sound field component may be one or several sound field components with a dominant direction among the multiple sound field components, and the main sound field component may be used to determine the first target virtual speaker matched by the audio signal of the current scene, and the first target virtual speaker is configured with
  • the HOA coefficient of the first target virtual speaker can be generated by using the configuration parameters of the first target virtual speaker, and the HOA coefficient generation process can be realized by the HOA algorithm, which will not be described in detail here.
  • Each virtual speaker in the virtual speaker set has a corresponding HOA coefficient, so the first target virtual speaker can be selected from the virtual speaker set according to the HOA coefficient corresponding to each virtual speaker, which solves the problem that the encoder needs to determine the first target virtual speaker. question.
  • step C1 obtains the configuration parameters of the first target virtual speaker according to the main sound field components, including:
  • the configuration parameter of the first target virtual speaker is selected from the configuration parameters of the plurality of virtual speakers according to the main sound field components.
  • the configuration parameters of multiple virtual speakers can be pre-stored in the audio encoder, and the configuration parameters of each virtual speaker can be determined by the configuration information of the audio encoder.
  • the audio encoder refers to the aforementioned encoding end.
  • the audio encoder configuration information including but not limited to: HOA order, encoding bit rate, etc.
  • the configuration information of the audio encoder can be used to determine the number of virtual speakers and the position parameters of each virtual speaker, which solves the problem that the encoder needs to determine the configuration parameters of the virtual speakers. An example is as follows, if the encoding bit rate is low, a smaller number of virtual speakers can be configured, and if the encoding bit rate is high, a large number of virtual speakers can be configured.
  • the HOA order of the virtual speaker may be equal to the HOA order of the audio encoder. It is not limited that, in this embodiment of the present application, in addition to determining the respective configuration parameters of the multiple virtual speakers according to the configuration information of the audio encoder, the respective configuration parameters of the multiple virtual speakers may also be customized according to user-defined information. You can customize the position of the virtual speakers, the order of HOA, the number of virtual speakers, etc.
  • the encoding end obtains the configuration parameters of multiple virtual speakers from the virtual speaker set.
  • each virtual speaker configuration parameter includes but is not limited to: the HOA order of the virtual speaker , the position coordinates of the virtual speaker, etc.
  • the HOA coefficient of each virtual speaker can be generated by using the configuration parameters of the virtual speaker, and the generation process of the HOA coefficient can be realized by the HOA algorithm, which will not be described in detail here.
  • a HOA coefficient is generated for each virtual speaker in the virtual speaker set, and the HOA coefficients configured by all virtual speakers in the virtual speaker set constitute the HOA coefficient set, which solves the need for the encoder to determine the HOA coefficient of each virtual speaker in the virtual speaker set. The problem.
  • the configuration parameters of the first target virtual speaker include: position information and HOA order information of the first target virtual speaker;
  • the aforementioned step C2 generates the HOA coefficient corresponding to the first target virtual speaker according to the configuration parameters of the first target virtual speaker, including:
  • the HOA coefficient corresponding to the first target virtual speaker is determined according to the position information and HOA order information of the first target virtual speaker.
  • the configuration parameters of each virtual speaker in the virtual speaker set may include position information of the virtual speaker and HOA order information of the virtual speaker.
  • the configuration parameters of the first target virtual speaker include: position information and HOA order information of the first target virtual speaker.
  • the location information of each virtual speaker in the virtual speaker set can be determined according to the spatial distribution of the virtual speakers with partial equidistant.
  • the distribution, eg locally equidistant, may include: uniform distribution or non-uniform distribution.
  • the HOA coefficient of the virtual speaker can be generated.
  • the HOA coefficient generation process can be realized by the HOA algorithm, which solves the problem that the encoder needs to determine the HOA coefficient of the first target virtual speaker. question.
  • a set of HOA coefficients is respectively generated for each virtual speaker in the virtual speaker set, and multiple sets of HOA coefficients constitute the aforementioned set of HOA coefficients.
  • the HOA coefficients respectively configured for all virtual speakers in the virtual speaker set constitute the HOA coefficient set, which solves the problem that the encoder needs to determine the HOA coefficient of each virtual speaker in the virtual speaker set.
  • the encoding end can play back the current scene audio signal, and the encoding end generates the first target virtual speaker according to the current scene audio signal and the attribute information of the first target virtual speaker.
  • a virtual speaker signal, the first virtual speaker signal is the playback signal of the audio signal of the current scene.
  • the attribute information of the first target virtual speaker describes the information related to the attributes of the first target virtual speaker.
  • the first target virtual speaker is a virtual speaker selected by the encoding end that can play back the audio signal of the current scene.
  • the attribute information of the speaker is used to play back the audio signal of the current scene, and the first virtual speaker signal can be obtained.
  • the data size of the first virtual speaker signal has nothing to do with the number of channels of the current scene audio signal, and the data size of the first virtual speaker signal is related to the first target virtual speaker.
  • the first virtual speaker signal is represented by fewer channels than the current scene audio signal.
  • the current scene audio signal is a third-order HOA signal, and the HOA signal has 16 channels.
  • 16 channels may be compressed into 2 channels, that is, the virtual speaker signal generated by the encoder is 2 channels.
  • the virtual speaker signal generated by the encoder may include the aforementioned first virtual speaker signal and For the second virtual speaker signal, etc., the number of channels of the virtual speaker signal generated by the encoding end is independent of the number of channels of the audio signal of the first scene. It can be seen from the description of the subsequent steps that the code stream can carry the first virtual speaker signal of 2 channels.
  • the decoding end receives the code stream, and the virtual speaker signal obtained by decoding the code stream is 2 channels.
  • the 2-channel virtual speaker signal can reconstruct the 16-channel scene audio signal, and ensures that the reconstructed scene audio signal has the same subjective and objective quality when compared with the original scene audio signal.
  • steps 401 and 402 may be specifically implemented by a spatial encoder to implement a moving picture experts group (moving picture experts group, MPEG) spatial encoder.
  • a spatial encoder to implement a moving picture experts group (moving picture experts group, MPEG) spatial encoder.
  • MPEG moving picture experts group
  • the current scene audio signal may include: the HOA signal to be encoded; the attribute information of the first target virtual speaker includes the HOA coefficient of the first target virtual speaker;
  • Step 402 generates a first virtual speaker signal according to the current scene audio signal and the attribute information of the first target virtual speaker, including:
  • a first virtual speaker signal is obtained by linearly combining the HOA signal to be encoded and the HOA coefficients of the first target virtual speaker.
  • the encoding end first determines the HOA coefficient of the first target virtual speaker. For example, the encoding end selects the HOA coefficient from the HOA coefficient set according to the main sound field components, and the selected HOA coefficient The coefficient is the HOA coefficient of the first target virtual speaker. After the encoding end obtains the HOA signal to be encoded and the HOA coefficient of the first target virtual speaker, the first virtual speaker can be generated according to the HOA signal to be encoded and the HOA coefficient of the first target virtual speaker. signal, wherein the HOA signal to be encoded can be obtained by linear combination of the HOA coefficients of the first target virtual speaker, and the solution of the first virtual speaker signal can be converted into a solution problem of the linear combination.
  • the attribute information of the first target virtual speaker may include: HOA coefficients of the first target virtual speaker.
  • the encoding end can obtain the HOA coefficient of the first target virtual speaker by decoding the attribute information of the first target virtual speaker.
  • the encoding end performs linear combination of the HOA signal to be encoded and the HOA coefficient of the first target virtual speaker, that is, the encoding end combines the HOA signal to be encoded and the HOA coefficient of the first target virtual speaker to obtain a linear combination matrix, and then the encoding end
  • the optimal solution can be obtained for the linear combination matrix, and the obtained optimal solution is the first virtual speaker signal.
  • the optimal solution is related to the algorithm used to solve the linear combination matrix.
  • the embodiment of the present application solves the problem that the encoding end needs to generate the first virtual speaker signal.
  • the current scene audio signal includes: a high-order stereo reverberation HOA signal to be encoded; the attribute information of the first target virtual speaker includes position information of the first target virtual speaker;
  • Step 402 generates a first virtual speaker signal according to the current scene audio signal and the attribute information of the first target virtual speaker, including:
  • the attribute information of the first target virtual speaker may include: position information of the first target virtual speaker, the encoding end pre-stores the HOA coefficient of each virtual speaker in the virtual speaker set, and the encoding end also stores the position information of each virtual speaker , there is a correspondence between the position information of the virtual speaker and the HOA coefficient of the virtual speaker, so the encoder can determine the HOA coefficient of the first target virtual speaker by using the position information of the first target virtual speaker. If the attribute information includes the HOA coefficient, the encoder can obtain the HOA coefficient of the first target virtual speaker by decoding the attribute information of the first target virtual speaker.
  • the encoder After the encoder obtains the HOA signal to be encoded and the HOA coefficient of the first target virtual speaker, the encoder performs a linear combination of the HOA signal to be encoded and the HOA coefficient of the first target virtual speaker, that is, the encoder combines the HOA signal to be encoded with the first target virtual speaker.
  • the HOA coefficients of the loudspeakers are combined to obtain a linear combination matrix.
  • the encoding end can find an optimal solution for the linear combination matrix, and the obtained optimal solution is the first virtual speaker signal.
  • the HOA coefficient of the first target virtual speaker is represented by a matrix A, and the matrix A can be used to linearly combine the HOA signal to be encoded.
  • the least squares method can be used to obtain the theoretical optimal solution w, which is the first virtual
  • the speaker signal for example, can be calculated as follows:
  • a -1 represents the inverse matrix of matrix A
  • the size of matrix A is (M ⁇ C)
  • C is the number of the first target virtual speakers
  • M is the number of channels of the Nth-order HOA coefficient
  • a represents the first HOA coefficients of the target virtual speaker, e.g.,
  • X represents the HOA signal to be encoded
  • the size of the matrix X is (M ⁇ L)
  • M is the number of channels of HOA coefficients of order N
  • L is the number of sampling points
  • x represents the coefficient of the HOA signal to be encoded, for example,
  • the encoding end may encode the first virtual speaker signal to obtain a code stream.
  • the encoding end may specifically be a core encoder, and the core encoder encodes the first virtual speaker signal to obtain a code stream.
  • the code stream can also be referred to as an audio signal encoding code stream.
  • the encoding end encodes the first virtual speaker signal, but no longer encodes the scene audio signal.
  • the selected first target virtual speaker makes the sound field where the listener is located in the space as close as possible.
  • the original sound field when recording the scene audio signal ensures the encoding quality of the encoding end, and the encoded data amount of the first virtual speaker signal is independent of the number of channels of the scene audio signal, which reduces the data amount of the encoded scene audio signal and improves the encoding and decoding efficiency.
  • the audio encoding method provided by the embodiments of the present application further includes the following steps:
  • the attribute information of the first target virtual speaker is encoded and written into the code stream.
  • the encoding end can also encode the attribute information of the first target virtual speaker, and write the encoded attribute information of the first target virtual speaker into the code stream.
  • the stream may include: the encoded virtual speaker and the encoded attribute information of the first target virtual speaker.
  • the encoded attribute information of the first target virtual speaker can be carried in the code stream, so that the decoding end can determine the attribute information of the first target virtual speaker by decoding the code stream, which is convenient for audio decoding at the decoding end.
  • the first virtual speaker signal is generated based on the first target virtual speaker, and the first virtual speaker signal is generated according to the first virtual speaker.
  • the process of signal encoding It is not limited that, in this embodiment of the present application, the encoding end can not only select the first target virtual speaker, but also select more target virtual speakers, for example, can also select the second target virtual speaker, for the second target virtual speaker.
  • the speaker also needs to perform a process similar to the aforementioned steps 402 to 403 , which will be described in detail below.
  • the audio encoding method provided by the embodiments of the present application further includes:
  • D1 select the second target virtual speaker from the virtual speaker set according to the first scene audio signal
  • step D1 is similar to the foregoing step 401, and the second target virtual speaker is another target virtual speaker selected by the encoder that is different from the first target virtual encoder.
  • the first scene audio signal is the original scene audio signal to be encoded
  • the second target virtual speaker may be a certain virtual speaker in the virtual speaker set.
  • a preconfigured target virtual speaker selection strategy may be used from a preset virtual speaker set. to select the second target virtual speaker.
  • the target virtual speaker selection strategy is a strategy for selecting a target virtual speaker matching the audio signal of the first scene from the virtual speaker set, for example, selecting the second target virtual speaker according to the sound field components obtained by each virtual speaker from the audio signal of the first scene .
  • the audio coding method provided by the embodiments of the present application further includes the following steps:
  • step D1 selects the second target virtual speaker from the preset virtual speaker set according to the audio signal of the first scenario, including:
  • the encoding end obtains a virtual speaker set, and the encoding end uses the virtual speaker set to perform signal decomposition on the audio signal of the first scene, so as to obtain the second main sound field component corresponding to the audio signal of the first scene.
  • the second main sound field component represents the audio signal corresponding to the main sound field in the audio signal of the first scene.
  • the virtual speaker set includes multiple virtual speakers. According to the multiple virtual speakers, multiple sound field components can be obtained from the audio signal of the first scene. That is, each virtual speaker can obtain one sound field component from the audio signal of the first scene. Next The second main sound field component is selected from the multiple sound field components.
  • the second main sound field component may be one or several sound field components with the largest value among the multiple sound field components, or the second main sound field component may be multiple sound field components. One or several sound field components that dominate in the middle direction.
  • the second target virtual speaker is selected from the virtual speaker set according to the second main sound field component.
  • the virtual speaker corresponding to the second main sound field component is the second target virtual speaker selected by the encoding end.
  • the encoding end can select the second target virtual speaker through the main sound field components, which solves the problem that the encoding end needs to determine the second target virtual speaker.
  • the aforementioned step F1 selects the second target virtual speaker from the virtual speaker set according to the second main sound field component, including:
  • the HOA coefficient corresponding to the second main sound field component is selected from the HOA coefficient set, and the HOA coefficient in the HOA coefficient set is in one-to-one correspondence with the virtual speakers in the virtual speaker set;
  • the aforementioned step F1 selects the second target virtual speaker from the virtual speaker set according to the second main sound field component, and further includes:
  • G1 obtain the configuration parameters of the second target virtual speaker according to the second main sound field component
  • G2 generate the HOA coefficient corresponding to the second target virtual speaker according to the configuration parameter of the second target virtual speaker;
  • G3 Determine the virtual speaker corresponding to the HOA coefficient corresponding to the second target virtual speaker in the virtual speaker set as the second target virtual speaker.
  • step G1 obtains the configuration parameters of the second target virtual speaker according to the second main sound field components, including:
  • the configuration parameter of the second target virtual speaker is selected from the configuration parameters of the plurality of virtual speakers according to the second main sound field component.
  • the configuration parameters of the second target virtual speaker include: position information and HOA order information of the second target virtual speaker;
  • the aforementioned step G2 generates the HOA coefficient corresponding to the second target virtual speaker according to the configuration parameters of the second target virtual speaker, including:
  • the HOA coefficient corresponding to the second target virtual speaker is determined according to the position information and HOA order information of the second target virtual speaker.
  • the audio signal of the first scene includes: the HOA signal to be encoded; the attribute information of the second target virtual speaker includes the HOA coefficient of the second target virtual speaker;
  • Step D2 generates a second virtual speaker signal according to the first scene audio signal and the attribute information of the second target virtual speaker, including:
  • the first scene audio signal includes: a high-order stereo reverberation HOA signal to be encoded; the attribute information of the second target virtual speaker includes position information of the second target virtual speaker;
  • Step D2 generates a second virtual speaker signal according to the first scene audio signal and the attribute information of the second target virtual speaker, including:
  • the encoding end may further perform step D3 to encode the second virtual speaker signal and write the code stream.
  • the encoding method adopted by the encoding end is similar to step 403, so that the code stream can carry the encoding result of the second virtual speaker signal.
  • the audio coding method performed by the coding end may further include the following steps:
  • I1. Perform alignment processing on the first virtual speaker signal and the second virtual speaker signal to obtain an aligned first virtual speaker signal and an aligned second virtual speaker signal.
  • step D3 to encode the second virtual speaker signal includes:
  • step 403 encodes the first virtual speaker signal, including:
  • the aligned first virtual speaker signal is encoded.
  • the encoding end can generate a first virtual speaker signal and a second virtual speaker signal, and the encoding end can perform alignment processing on the first virtual speaker signal and the second virtual speaker signal, so as to obtain the aligned first virtual speaker signal and the aligned first virtual speaker signal and the aligned first virtual speaker signal.
  • the second virtual speaker signal is exemplified as follows. There are two virtual speaker signals. The channel order of the virtual speaker signal of the current frame is 1 and 2, which correspond to the virtual speaker signals generated by the target virtual speakers P1 and P2 respectively. The channel sequence of the virtual speaker signal of the frame is 1 and 2, which correspond to the virtual speaker signals generated by the target virtual speakers P2 and P1 respectively.
  • the channel sequence of the virtual speaker signal of the current frame can be adjusted according to the sequence of the target virtual speakers of the previous frame. Adjust, for example, adjust the channel order of the virtual speaker signal of the current frame to 2 and 1, so that the virtual speaker signals generated by the same target virtual speaker are on the same channel.
  • the encoding end can encode the aligned first virtual speaker signal.
  • the inter-channel correlation is beneficial to the encoding processing of the first virtual speaker signal by the core encoder.
  • the audio encoding method provided by the embodiments of the present application further includes:
  • D1 select the second target virtual speaker from the virtual speaker set according to the first scene audio signal
  • step 403 encodes the first virtual speaker signal, including:
  • J1 obtain a downmix signal and side information according to the first virtual speaker signal and the second virtual speaker signal, and the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal;
  • J2. Encode the downmix signal and side information.
  • the encoding end may further perform downmix processing according to the first virtual speaker signal and the second virtual speaker signal to generate a downmix signal, for example, for the first virtual speaker signal and the second virtual speaker signal.
  • a virtual loudspeaker signal and a second virtual loudspeaker signal are subjected to amplitude downmix processing to obtain a downmix signal.
  • side information can be generated according to the first virtual speaker signal and the second virtual speaker signal, and the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal.
  • the information can be used by the decoding end to perform up-mixing on the down-mixed signal, so as to recover the first virtual speaker signal and the second virtual speaker signal.
  • the side information includes signal information loss analysis parameters, so that the decoding end recovers the first virtual speaker signal and the second virtual speaker signal through the signal information loss analysis parameters.
  • the side information may specifically be a correlation parameter between the first virtual speaker signal and the second virtual speaker signal, for example, may be an energy ratio parameter of the first virtual speaker signal and the second virtual speaker signal. So that the decoding end recovers the first virtual speaker signal and the second virtual speaker signal through the above-mentioned correlation parameter or energy ratio parameter.
  • the encoding end may also perform the following steps:
  • I1. Perform alignment processing on the first virtual speaker signal and the second virtual speaker signal to obtain an aligned first virtual speaker signal and an aligned second virtual speaker signal.
  • step J1 obtains the downmix signal and side information according to the first virtual speaker signal and the second virtual speaker signal, including:
  • the side information is used to indicate the relationship between the aligned first virtual speaker signal and the aligned second virtual speaker signal.
  • the encoding end may first perform an alignment operation of the virtual speaker signal, and then generate the downmix signal and side information after completing the alignment operation.
  • the inter-channel correlation is enhanced, which is beneficial to the encoding processing of the first virtual speaker signal by the core encoder.
  • the audio signal of the second scene can be obtained from the first virtual speaker signal before alignment and the second virtual speaker signal before alignment, or can be obtained from the first virtual speaker after alignment.
  • the acquisition of the signal and the aligned second virtual speaker signal depends on the application scenario, which is not limited here.
  • the audio signal encoding method before step D1 selects the second target virtual loudspeaker from the virtual loudspeaker set according to the audio signal of the first scene, the audio signal encoding method provided by the embodiment of the present application further includes:
  • K1 according to the coding rate and/or the signal type information of the audio signal of the first scene, determine whether it is necessary to acquire target virtual speakers other than the first target virtual speaker;
  • the second target virtual speaker is selected from the virtual speaker set according to the audio signal of the first scene.
  • the encoder can also perform signal selection to determine whether to acquire the second target virtual speaker.
  • the encoder can generate the second virtual speaker signal.
  • the encoding end may not generate the second virtual speaker signal.
  • the encoder may make a decision according to the configuration information of the audio encoder and/or the signal type information of the audio signal of the first scene, so as to determine whether another target virtual speaker needs to be selected in addition to the first target virtual speaker. For example, if the encoding rate is higher than the preset threshold, it is determined that the target virtual speakers corresponding to the two main sound field components need to be obtained, and the second target virtual speakers can be determined in addition to the first target virtual speakers.
  • the target virtual speaker For another example, it is determined according to the signal type information of the audio signal of the first scene that it is necessary to obtain the target virtual speaker corresponding to the two main sound field components with the dominant sound source direction.
  • the second target virtual speaker On the contrary, if it is determined according to the encoding rate and/or the signal type information of the audio signal of the first scene that only one target virtual speaker needs to be acquired, after the first target virtual speaker is determined, it is determined that other than the first target virtual speaker will not be acquired. the target virtual speaker.
  • the amount of data encoded by the encoding end can be reduced, and the encoding efficiency can be improved.
  • the encoding end when the encoding end selects the signal, it can determine whether the second virtual speaker signal needs to be generated. Since the signal selection is performed at the encoder end, information loss will occur, so it is necessary to perform signal compensation for the virtual speaker signal that is not transmitted.
  • Signal compensation can be selected and not limited to information loss analysis, energy compensation, envelope compensation, noise compensation, etc.
  • the compensation method can choose linear compensation or nonlinear compensation. After signal compensation, side information can be generated, and the side information can be written into the code stream, so that the decoding end can obtain the side information through the code stream, and the decoding end can perform signal compensation according to the side information, thereby improving the quality of the decoded signal at the decoding end.
  • a first virtual speaker signal may be generated according to the first scene audio signal and attribute information of the first target virtual speaker, and the audio encoding end encodes the first virtual speaker signal, and The audio signal of the first scene is no longer directly encoded.
  • a first target virtual speaker is selected according to the audio signal of the first scene, and the first virtual speaker signal generated based on the first target virtual speaker can represent listening in the space.
  • the sound field of the position where the sound artist is located is as close as possible to the original sound field when the audio signal of the first scene was recorded, which ensures the encoding quality of the audio encoding end, and encodes the first virtual speaker signal and the residual signal to obtain the code stream.
  • the amount of encoded data of the first virtual speaker signal is related to the first target virtual speaker, but not related to the number of channels of the audio signal of the first scene, which reduces the amount of encoded data and improves encoding efficiency.
  • the encoding end encodes the virtual speaker signal to generate a code stream. Then the encoding end can output the code stream and send it to the decoding end through the audio transmission channel. The decoding end performs subsequent steps 411 to 413 .
  • the decoding end receives the code stream from the encoding end.
  • the code stream may carry the encoded first virtual speaker signal. It is not limited that the code stream may also carry the encoded attribute information of the first target virtual speaker. It should be noted that the attribute information of the first target virtual loudspeaker may not be carried in the code stream, and at this moment the decoding end can determine the attribute information of the first target virtual loudspeaker through pre-configuration.
  • the code stream when the encoding end generates the second virtual speaker signal, the code stream may also carry the second virtual speaker signal. It is not limited that the code stream may also carry the encoded attribute information of the second target virtual speaker. It should be noted that, the attribute information of the second target virtual speaker may not be carried in the code stream. In this case, the decoding end may determine the attribute information of the second target virtual speaker through pre-configuration.
  • the decoding end decodes the code stream after receiving the code stream from the encoding end, and obtains the virtual speaker signal from the code stream.
  • the virtual speaker signal may specifically be the aforementioned first virtual speaker signal, and may also be the aforementioned first virtual speaker signal and second virtual speaker signal, which is not limited here.
  • the audio decoding method provided by the embodiments of the present application further includes the following steps:
  • the encoding end can also encode the attribute information of the target virtual speaker, and write the encoded attribute information of the target virtual speaker into the code stream.
  • Property information of the target virtual speaker In the embodiment of the present application, the encoded attribute information of the first target virtual speaker can be carried in the code stream, so that the decoding end can determine the attribute information of the first target virtual speaker by decoding the code stream, which is convenient for audio decoding at the decoding end.
  • the decoding end may acquire attribute information of a target virtual speaker, where the target virtual speaker is a virtual speaker in the virtual speaker set used for playing back the reconstructed scene audio signal.
  • the attribute information of the target virtual speaker may include position information of the target virtual speaker and HOA coefficients of the target virtual speaker.
  • the attribute information of the target virtual speaker includes the HOA coefficient of the target virtual speaker
  • Step 413 obtains the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal, including:
  • the virtual speaker signal and the HOA coefficients of the target virtual speaker are synthesized to obtain the reconstructed scene audio signal.
  • the decoding end first determines the HOA coefficients of the target virtual speakers. For example, the decoding end can store the HOA coefficients of the target virtual speakers in advance. After the decoding end obtains the virtual speaker signals and the HOA coefficients of the target virtual The HOA coefficients of the loudspeaker can obtain the reconstructed scene audio signal. Thereby improving the quality of the reconstructed scene audio signal.
  • the HOA coefficient of the target virtual speaker is represented by a matrix A', the size of the matrix A' is (M ⁇ C), C is the number of target virtual speakers, and M is the number of channels of the Nth-order HOA coefficient.
  • the virtual speaker signal is represented by a matrix W', and the size of the matrix W' is (C ⁇ L), where L is the number of signal sampling points.
  • the reconstructed HOA signal is obtained by the following formula:
  • the H obtained by the above calculation formula is the reconstructed HOA signal.
  • the attribute information of the target virtual speaker includes location information of the target virtual speaker
  • Step 413 obtains the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal, including:
  • the virtual speaker signal and the HOA coefficients of the target virtual speaker are synthesized to obtain the reconstructed scene audio signal.
  • the attribute information of the target virtual speaker may include: position information of the target virtual speaker.
  • the decoding end pre-stores the HOA coefficient of each virtual speaker in the virtual speaker set, and the decoding end also stores the position information of each virtual speaker.
  • the relationship determines the HOA coefficient corresponding to the position information of the target virtual speaker, or the decoding end can calculate the HOA coefficient of the target virtual speaker according to the position information of the target virtual speaker. Therefore, the decoding end can determine the HOA coefficient of the target virtual speaker through the position information of the target virtual speaker. Solved the problem that the decoding end needs to determine the HOA coefficient of the target virtual speaker.
  • the virtual speaker signal is a downmix signal obtained by downmixing the first virtual speaker signal and the second virtual speaker signal.
  • the audio decoding method provided by the embodiment of the present application further includes:
  • decoding the code stream to obtain side information, where the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal;
  • the first virtual speaker signal and the second virtual speaker signal are obtained according to the side information and the downmix signal.
  • the relationship between the first virtual speaker signal and the second virtual speaker signal may be a direct relationship or an indirect relationship; for example, between the first virtual speaker signal and the second virtual speaker signal
  • the first side information may include a correlation parameter of the first virtual speaker signal and the second virtual speaker signal, for example, may be the energy ratio of the first virtual speaker signal and the second virtual speaker signal.
  • the first side information may include a correlation parameter between the first virtual speaker signal and the downmix signal
  • the correlation parameter between the second virtual speaker signal and the downmix signal for example, includes the energy ratio parameter between the first virtual speaker signal and the downmix signal, and the energy ratio parameter between the second virtual speaker signal and the downmix signal .
  • the decoder may determine the first virtual speaker signal and the second virtual speaker signal; when the relationship between the first virtual speaker signal and the second virtual speaker signal may be an indirect relationship, the decoder may determine the first virtual speaker signal and the first virtual speaker signal according to the downmix signal and the indirect relationship Two virtual speaker signals.
  • step 413 obtains the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal, including:
  • the reconstructed scene audio signal is obtained according to the attribute information of the target virtual speaker, the first virtual speaker signal and the second virtual speaker signal.
  • the encoder generates a downmix signal when performing downmix processing according to the first virtual speaker signal and the second virtual speaker signal, and the encoder can also perform signal compensation for the downmix signal to generate side information, and the side information can be written in
  • the decoding end can obtain side information through the code stream, and the decoding end can perform signal compensation according to the side information to obtain the first virtual speaker signal and the second virtual speaker signal. Therefore, when performing signal reconstruction, the first virtual speaker signal can be used.
  • the virtual speaker signal can be decoded from the code stream, and the virtual speaker signal is used as the playback signal of the scene audio signal, and the reconstructed image is obtained through the attribute information of the target virtual speaker and the virtual speaker signal
  • the obtained code stream carries the virtual speaker signal and the residual signal, which reduces the amount of data to be decoded and improves the decoding efficiency.
  • the first virtual speaker signal is represented by fewer channels than the audio signal of the first scene.
  • the audio signal of the first scene is a third-order HOA signal
  • the HOA signal is 16 16 channels may be compressed into 2 channels in this embodiment of the present application, that is, the virtual speaker signal generated by the encoder is 2 channels.
  • the virtual speaker signal generated by the encoder may include the aforementioned first channel.
  • the number of channels of the virtual speaker signal generated by the encoding end is independent of the number of channels of the audio signal of the first scene. It can be seen from the description of the subsequent steps that the code stream can carry virtual speaker signals of 2 channels.
  • the decoding end receives the code stream, and the virtual speaker signal obtained by decoding the code stream is 2 channels.
  • the decoding end passes the 2 channels.
  • the virtual speaker signal of the channel can reconstruct the scene audio signal of 16 channels, and it is guaranteed that the reconstructed scene audio signal has the same subjective and objective quality effect when compared with the original scene audio signal.
  • the scene audio signal is an HOA signal as an example
  • the sound wave propagates in an ideal medium
  • the angular frequency w 2 ⁇ f
  • f is the sound wave frequency
  • c is the sound speed.
  • the sound pressure p satisfies the following formula, where is the Laplace operator:
  • r represents the radius of the sphere
  • represents the horizontal angle
  • k is the wave number
  • s is the amplitude of the ideal plane wave
  • m is the HOA order number
  • spherical Bessel function also known as the radial basis function, where the first j is an imaginary unit. Does not vary with angle.
  • spherical harmonics of the direction, is the spherical harmonic function of the sound source direction.
  • the HOA coefficient can be expressed as:
  • the above formula shows that the sound field can be expanded according to the spherical harmonic function on the spherical surface, using the coefficient to express.
  • the coefficients are known
  • the sound field can be reconstructed.
  • the Nth-order HOA coefficient and the HOA coefficient can also be called the ambisonic coefficient.
  • the Nth-order HOA coefficients have (N+1) 2 channels in total.
  • the ambisonic signal above the first order is also called the HOA signal.
  • the HOA order can be 2 to 6, the signal sampling rate for scene audio recording is 48 to 192 kHz, and the sampling depth is 16 or 24 Bit.
  • the characteristic of the HOA signal is the spatial information with the sound field, which is a description of the sound field signal at a certain point in the space with a certain accuracy. Therefore, it is possible to consider using another representation to describe the sound field signal at this point. If this description method can use a smaller amount of data to describe the signal at the point with the same accuracy, the purpose of signal compression can be achieved.
  • the spatial sound field can be decomposed into the superposition of multiple plane waves. Therefore, the sound field expressed by the HOA signal can be re-expressed by the superposition of multiple plane waves, and each plane wave is represented by an audio signal of one channel and a direction vector. If the representation form of plane wave superposition can better express the original sound field with a smaller number of channels, the purpose of signal compression can be achieved.
  • the basic method is to superimpose the sound fields of multiple speakers, so that the sound field at a certain point in space (where the listener is) is as close as possible to the original sound field when the HOA signal was recorded under a certain standard.
  • the embodiment of the present application assumes a virtual speaker array, then calculates a playback signal of the virtual speaker array, uses the playback signal as a transmission signal, and then generates a compressed signal.
  • the decoding end obtains the playback signal by decoding the code stream, and reconstructs the scene audio signal from the playback signal.
  • the embodiments of the present application provide an encoding end suitable for scene audio signal encoding, and a decoding end suitable for scene audio signal decoding.
  • the encoding end encodes the original HOA signal into a compressed code stream, the encoding end sends the compressed code stream to the decoding end, and then the decoding end restores the compressed code stream to the reconstructed HOA signal.
  • the amount of data compressed by the encoding end is as small as possible, or the quality of the HOA signal obtained by the decoding end after reconstruction is higher at the same code rate.
  • the embodiments of the present application can solve the problems of large amount of data, high bandwidth occupation, low compression efficiency and low encoding quality when encoding HOA signals. Since an N-order HOA signal has (N+1) 2 channels, a large bandwidth is required to directly transmit the HOA signal, so an effective multi-channel coding scheme is required.
  • the embodiment of the present application adopts different channel extraction methods, and the assumption of the sound source is not limited in the embodiment of the present application, and does not rely on the assumption of a single sound source in the time-frequency domain, which can more effectively process complex signals such as multi-sound source signals. Scenes.
  • the codec of the embodiment of the present application provides a spatial encoding and decoding method that uses fewer channels to represent the original HOA signal. As shown in FIG.
  • the encoding end includes a spatial encoder and a core encoder, wherein the spatial encoder can perform channel extraction on the HOA signal to be encoded to generate a virtual speaker
  • the core encoder can encode the virtual speaker signal to obtain the code stream, and the encoding end sends the code stream to the decoding end. As shown in FIG.
  • the decoding end includes: a core decoder and a spatial decoder, wherein the core decoder first receives the code stream from the encoding end, and then converts the code stream from the code The virtual speaker signal is decoded from the stream, and then the spatial decoder reconstructs the virtual speaker signal to obtain the reconstructed HOA signal.
  • the encoding end may include: a virtual speaker configuration unit, an encoding analysis unit, a virtual speaker set generation unit, a virtual speaker selection unit, a virtual speaker signal generation unit, and Core encoder processing unit.
  • the encoder shown in FIG. 7 can generate one virtual speaker signal or multiple virtual speaker signals, wherein the generation process of the multiple virtual speaker signals can be based on the encoder structure shown in FIG. 7 Perform multiple generation, and then take the generation process of a virtual speaker signal as an example.
  • the virtual speaker configuration unit is used to configure the virtual speakers in the virtual speaker set to obtain multiple virtual speakers.
  • the virtual speaker configuration unit outputs virtual speaker configuration parameters according to the encoder configuration information.
  • the encoder configuration information includes but is not limited to: HOA order, encoding bit rate, user-defined information, etc.
  • the virtual speaker configuration parameters include but are not limited to: the number of virtual speakers, the HOA order of the virtual speaker, and the position coordinates of the virtual speaker Wait.
  • the virtual speaker configuration parameters output by the virtual speaker configuration unit are used as input to the virtual speaker set generation unit.
  • the coding analysis unit is used to perform coding analysis on the HOA signal to be coded, such as analyzing the sound field distribution of the HOA signal to be coded, including the number of sound sources, directivity, dispersion and other characteristics of the HOA signal to be coded, as a decision on how to select the target virtual speaker one of the judgment conditions.
  • the encoding end may not include an encoding analysis unit, that is, the encoding end may not analyze the input signal, and a default configuration is used to determine how to select the target virtual speaker.
  • the encoding end obtains the HOA signal to be encoded, for example, the HOA signal recorded from the actual acquisition device or the HOA signal synthesized by using artificial audio objects can be used as the input of the encoder, and the HOA signal to be encoded input by the encoder can be the time domain HOA signal
  • the signal can also be a frequency domain HOA signal.
  • the virtual speaker set generating unit is configured to generate a virtual speaker set, the virtual speaker set may include: a plurality of virtual speakers, and the virtual speakers in the virtual speaker set may also be referred to as "candidate virtual speakers”.
  • the virtual speaker set generating unit generates the designated candidate virtual speaker HOA coefficients. Generating the HOA coefficient of the candidate virtual speaker requires the coordinates of the candidate virtual speaker (that is, the position coordinates or the position information) and the HOA order of the candidate virtual speaker.
  • the coordinate determination method of the candidate virtual speaker includes but is not limited to generating K virtual speakers according to the equidistant rule, According to the principle of auditory perception, non-uniformly distributed K candidate virtual speakers are generated. The following is an example of a method for generating a uniformly distributed fixed number of virtual speakers.
  • the coordinates of the candidate virtual speakers with uniform distribution are generated according to the number of the candidate virtual speakers, for example, a numerical iterative calculation method is used to give an approximately uniform speaker arrangement.
  • Figure 8 it is a schematic diagram of a virtual speaker with approximately uniform distribution on the spherical surface. It is assumed that some particles are distributed on the unit sphere, and the repulsive force of inverse quadratic proportionality is set between these particles, and the electrostatic repulsion between the same charge similar. Let these particles move freely under the action of repulsion, and it can be expected that when they reach a steady state, the distribution of particles should tend to be uniform. In the calculation, the actual physical laws are simplified, and the moving distance of the particle is directly equal to the force. Then for the i-th particle, its movement distance in a certain step of iterative calculation, that is, the virtual force it receives is the following formula:
  • the parameter k controls the size of the single step, and the initial position of the particle can be randomly specified.
  • candidate virtual speaker HOA coefficients are generated.
  • the amplitude is s
  • the speaker position coordinates are
  • the ideal plane wave of , which is expanded using spherical harmonics is as follows:
  • the HOA coefficients of the candidate virtual speakers output by the virtual speaker set generation unit are used as inputs to the virtual speaker selection unit.
  • a virtual speaker selection unit configured to select a target virtual speaker from a plurality of candidate virtual speakers in the virtual speaker set according to the HOA signal to be encoded, where the target virtual speaker may be referred to as a "virtual speaker matching the HOA signal to be encoded", or Short for matching virtual speakers.
  • the virtual speaker selection unit matches the HOA signal to be encoded with the candidate virtual speaker HOA coefficients output by the virtual speaker set generation unit, and selects the specified matched virtual speaker.
  • the method for selecting a virtual speaker is illustrated.
  • the HOA signal to be encoded is matched with the HOA coefficient of the candidate virtual speaker output by the virtual speaker set generating unit to find the HOA signal to be encoded.
  • the best match on the candidate virtual loudspeaker the goal is to combine the HOA signal to be encoded using the candidate virtual loudspeaker HOA coefficient matching.
  • use the candidate virtual speaker HOA coefficient and the HOA signal to be encoded to do the inner product select the candidate virtual speaker with the largest absolute value of the inner product as the target virtual speaker, that is, match the virtual speaker, and place the HOA signal to be encoded in the candidate virtual speaker.
  • the projection of the virtual speaker is superimposed on the linear combination of the HOA coefficients of the candidate virtual speaker, and then the projection vector is subtracted from the HOA signal to be encoded to obtain the difference value, and the above process is repeated for the difference value to realize iterative calculation.
  • loudspeaker output the matched virtual loudspeaker coordinates and matched virtual loudspeaker HOA coefficients. It can be understood that multiple matching virtual speakers will be selected, and one matching virtual speaker will be generated at each iteration.
  • the coordinates of the target virtual speaker and the HOA coefficient of the target virtual speaker output by the virtual speaker selection unit are used as inputs to the virtual speaker signal generation unit.
  • the encoding end may further include a side information generating unit. It is not limited that the encoding end may also not include a side information generating unit, which is only an example here.
  • the coordinates of the target virtual speaker and/or the HOA coefficient of the target virtual speaker output by the virtual speaker selection unit are used as inputs to the side information generation unit.
  • the side information generating unit converts the HOA coefficients of the target virtual speaker or the coordinates of the target virtual speaker into side information, which is beneficial to the processing and transmission of the core encoder.
  • the output of the side information generation unit serves as the input to the core encoder processing unit.
  • the virtual speaker signal generating unit is configured to generate a virtual speaker signal according to the HOA signal to be encoded and the attribute information of the target virtual speaker.
  • the virtual speaker signal generation unit calculates the virtual speaker signal by using the HOA signal to be encoded and the HOA coefficient of the target virtual speaker.
  • the matching virtual speaker HOA coefficient is represented by matrix A, and the matrix A can be used to linearly combine the HOA signal to be encoded.
  • the least square method can be used to obtain the theoretical optimal solution w, which is the virtual speaker signal. For example, the following formula can be used. :
  • a -1 represents the inverse matrix of matrix A
  • the size of matrix A is (M ⁇ C)
  • C is the number of target virtual speakers
  • M is the number of channels of the Nth-order HOA coefficient
  • a represents the target virtual speaker.
  • HOA coefficients for example,
  • X represents the HOA signal to be encoded
  • the size of the matrix X is (M ⁇ L)
  • M is the number of channels of HOA coefficients of order N
  • L is the number of sampling points
  • x represents the coefficient of the HOA signal to be encoded, for example,
  • the virtual speaker signal output by the virtual speaker signal generation unit is used as the input of the core encoder processing unit.
  • the encoding end may further include a signal alignment unit. It is not limited that the encoding end may also not include a signal alignment unit, which is only an example here.
  • the virtual speaker signal output by the virtual speaker signal generation unit is used as the input of the signal alignment unit.
  • the signal alignment unit is used to readjust the channels of the virtual speaker signal to enhance the correlation between channels, which is beneficial to the processing of the core encoder.
  • the aligned virtual speaker signal output by the signal alignment unit is the input of the core encoder processing unit.
  • the core encoder processing unit is used to perform core encoder processing on the side information and the aligned virtual speaker signal to obtain a transmission code stream.
  • the core encoder processing includes but is not limited to transformation, quantization, psychoacoustic model, code stream generation, etc. It can process the frequency domain channel or the time domain channel, which is not limited here.
  • the decoding end may include: a core decoder processing unit and a HOA signal reconstruction unit.
  • the core decoder processing unit is used to perform core decoder processing on the transport code stream to obtain a virtual speaker signal.
  • the decoding end also needs to include: a side information decoding unit.
  • the side information decoding unit is used for decoding the decoded side information output by the core decoder processing unit to obtain the decoded side information.
  • the core decoder processing may include transformation, code stream analysis, inverse quantization, etc., and may process the frequency domain channel or the time domain channel, which is not limited here.
  • the virtual speaker signal output by the core decoder processing unit is the input of the HOA signal reconstruction unit, and the decoded side information output by the core decoder processing unit is the input of the side information decoding unit.
  • the side information decoding unit converts the decoded side information into HOA coefficients of the target virtual speaker.
  • the HOA coefficient of the target virtual speaker output by the side information decoding unit is the input of the HOA signal reconstruction unit.
  • the HOA signal reconstruction unit is used for reconstructing the HOA signal by using the virtual speaker signal and the HOA coefficient of the target virtual speaker.
  • the HOA coefficient of the target virtual speaker is used to represent the matrix A'.
  • the size of the matrix A' is (M ⁇ C), denoted as A', C is the number of target virtual speakers, and M is the number of channels of the Nth-order HOA coefficient.
  • the virtual speaker signal forms a (C ⁇ L) matrix, denoted as W', where L is the number of signal sampling points, and the reconstructed HOA signal H is obtained by the following formula:
  • the reconstructed HOA signal output by the HOA signal reconstruction unit is the output of the decoding end.
  • the encoding end may use the spatial encoder to represent the original HOA signal with fewer channels.
  • the original third-order HOA signal can be compressed with 16 channels by using the spatial encoder of the embodiment of the present application. It is 4 channels, and it ensures that there is no obvious difference in subjective hearing.
  • the subjective listening test is an evaluation standard in audio coding and decoding, and no obvious difference is a grade of subjective evaluation.
  • the virtual speaker selection unit at the encoding end selects the target virtual speaker from the virtual speaker set, and the virtual speaker with a specified orientation may also be used as the target virtual speaker, and the virtual speaker signal generating unit directly selects the target virtual speaker in each target virtual speaker.
  • the virtual speaker signal is obtained by projecting on the speaker.
  • the selection process of the virtual speaker can be simplified, and the encoding and decoding speed can be improved.
  • the encoder end may not include a signal alignment unit, and in this case, the output of the virtual speaker signal generation unit is directly subjected to encoding processing by the core encoder. In the above manner, the signal alignment processing is reduced, and the complexity of the encoder side is reduced.
  • the selected target virtual loudspeaker is applied to the HOA signal encoding and decoding in the embodiment of the present application.
  • the embodiment of the present application can obtain accurate sound source localization of the HOA signal, reconstruct the direction of the HOA signal more accurately, and improve the coding efficiency.
  • Higher, and the complexity of the decoding end is very low, which is beneficial to mobile applications and can improve the performance of encoding and decoding.
  • an audio encoding apparatus 1000 provided by an embodiment of the present application may include: an acquisition module 1001, a signal generation module 1002, and an encoding module 1003, wherein,
  • an acquisition module configured to select the first target virtual speaker from the preset virtual speaker set according to the current scene audio signal
  • a signal generation module configured to generate a first virtual speaker signal according to the current scene audio signal and the attribute information of the first target virtual speaker
  • an encoding module configured to encode the first virtual speaker signal to obtain a code stream.
  • the acquiring module is configured to acquire main sound field components from the audio signal of the current scene according to the virtual speaker set; and select a main sound field component from the virtual speaker set according to the main sound field components the first target virtual speaker.
  • the obtaining module is configured to select, according to the main sound field components, HOA coefficients corresponding to the main sound field components from a set of high-order stereo reverberation HOA coefficients, the HOA coefficient set
  • the HOA coefficients in the virtual speaker set correspond one-to-one with the virtual speakers in the virtual speaker set; the virtual speakers in the virtual speaker set corresponding to the HOA coefficients corresponding to the main sound field components are determined as the first target virtual speakers.
  • the acquiring module is configured to acquire configuration parameters of the first target virtual speaker according to the main sound field components; generate the first target virtual speaker according to the configuration parameters of the first target virtual speaker HOA coefficient corresponding to the target virtual speaker; determining the virtual speaker corresponding to the HOA coefficient corresponding to the first target virtual speaker in the virtual speaker set as the target virtual speaker.
  • the obtaining module is configured to determine the configuration parameters of multiple virtual speakers in the virtual speaker set according to the configuration information of the audio encoder;
  • the configuration parameter of the first target virtual speaker is selected from the configuration parameters of the virtual speaker.
  • the configuration parameters of the first target virtual speaker include: position information and HOA order information of the first target virtual speaker;
  • the obtaining module is configured to determine the HOA coefficient corresponding to the first target virtual speaker according to the position information and HOA order information of the first target virtual speaker.
  • the encoding module is further configured to encode the attribute information of the first target virtual speaker, and write the code stream.
  • the current scene audio signal includes: the HOA signal to be encoded; the attribute information of the first target virtual speaker includes the HOA coefficient of the first target virtual speaker;
  • the signal generating module is configured to linearly combine the to-be-coded HOA signal and the HOA coefficient to obtain the first virtual speaker signal.
  • the current scene audio signal includes: a high-order stereo reverberation HOA signal to be encoded; the attribute information of the first target virtual speaker includes position information of the first target virtual speaker;
  • the signal generation module is used to obtain the HOA coefficient corresponding to the first target virtual speaker according to the position information of the first target virtual speaker; linearly combine the to-be-coded HOA signal and the HOA coefficient to obtain the first virtual speaker signal.
  • the acquiring module is configured to select a second target virtual speaker from the virtual speaker set according to the current scene audio signal
  • the signal generation module is configured to generate a second virtual speaker signal according to the current scene audio signal and the attribute information of the second target virtual speaker;
  • the encoding module is configured to encode the second virtual speaker signal and write the code stream.
  • the signal generation module is configured to perform alignment processing on the first virtual speaker signal and the second virtual speaker signal, so as to obtain an aligned first virtual speaker signal and an aligned the second virtual speaker signal;
  • the encoding module is configured to encode the aligned second virtual speaker signal
  • the encoding module is configured to encode the aligned first virtual speaker signal.
  • the acquiring module is configured to select a second target virtual speaker from the virtual speaker set according to the current scene audio signal
  • the signal generation module is configured to generate a second virtual speaker signal according to the current scene audio signal and the attribute information of the second target virtual speaker;
  • the encoding module is configured to obtain a downmix signal and side information according to the first virtual speaker signal and the second virtual speaker signal, and the side information is used to indicate the first virtual speaker signal and all the and encoding the downmix signal and the side information.
  • the signal generation module is configured to perform alignment processing on the first virtual speaker signal and the second virtual speaker signal, so as to obtain an aligned first virtual speaker signal and an aligned the second virtual speaker signal;
  • the encoding module is configured to obtain the downmix signal and the side information according to the aligned first virtual speaker signal and the aligned second virtual speaker signal;
  • the side information is used to indicate the relationship between the aligned first virtual speaker signal and the aligned second virtual speaker signal.
  • the obtaining module is configured to, before selecting the second target virtual speaker from the virtual speaker set according to the current scene audio signal, perform the following steps according to the encoding rate and/or the current scene.
  • the signal type information of the audio signal determines whether target virtual speakers other than the first target virtual speaker need to be obtained; if it is necessary to obtain target virtual speakers other than the first target virtual A second target virtual speaker is selected from the virtual speaker set.
  • an audio decoding apparatus 1100 may include: a receiving module 1101, a decoding module 1102, and a reconstruction module 1103, wherein:
  • the receiving module is used to receive the code stream
  • a decoding module for decoding the code stream to obtain a virtual speaker signal
  • the reconstruction module is configured to obtain the reconstructed scene audio signal according to the attribute information of the target virtual speaker and the virtual speaker signal.
  • the decoding module is further configured to decode the code stream to obtain attribute information of the target virtual speaker.
  • the attribute information of the target virtual speaker includes a high-order stereo reverberation HOA coefficient of the target virtual speaker
  • the reconstruction module is configured to perform synthesis processing on the virtual speaker signal and the HOA coefficient of the target virtual speaker to obtain the reconstructed scene audio signal.
  • the attribute information of the target virtual speaker includes location information of the target virtual speaker
  • the reconstruction module is configured to determine the HOA coefficient of the target virtual speaker according to the position information of the target virtual speaker; perform synthesis processing on the virtual speaker signal and the HOA coefficient of the target virtual speaker to obtain the reconstruction scene audio signal.
  • the virtual speaker signal is a downmix signal obtained by downmixing the first virtual speaker signal and the second virtual speaker signal
  • the apparatus further includes: a signal compensation module, wherein:
  • the decoding module configured to decode the code stream to obtain side information, where the side information is used to indicate the relationship between the first virtual speaker signal and the second virtual speaker signal;
  • the signal compensation module configured to obtain the first virtual speaker signal and the second virtual speaker signal according to the side information and the downmix signal
  • the reconstruction module is configured to obtain the reconstructed scene audio signal according to the attribute information of the target virtual speaker, the first virtual speaker signal and the second virtual speaker signal
  • Embodiments of the present application further provide a computer storage medium, wherein the computer storage medium stores a program, and the program executes some or all of the steps described in the above method embodiments.
  • the audio encoding apparatus 1200 includes:
  • the receiver 1201, the transmitter 1202, the processor 1203 and the memory 1204 (wherein the number of the processors 1203 in the audio coding apparatus 1200 may be one or more, and one processor is taken as an example in FIG. 12).
  • the receiver 1201 , the transmitter 1202 , the processor 1203 , and the memory 1204 may be connected by a bus or in other ways, wherein the connection by a bus is taken as an example in FIG. 12 .
  • Memory 1204 may include read-only memory and random access memory, and provides instructions and data to processor 1203 .
  • a portion of memory 1204 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1204 stores operating system and operation instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
  • the operating system may include various system programs for implementing various basic services and handling hardware-based tasks.
  • the processor 1203 controls the operation of the audio encoding apparatus, and the processor 1203 may also be referred to as a central processing unit (central processing unit, CPU).
  • CPU central processing unit
  • various components of the audio coding apparatus are coupled together through a bus system, wherein the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 1203 or implemented by the processor 1203 .
  • the processor 1203 may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 1203 or an instruction in the form of software.
  • the above-mentioned processor 1203 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or Other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • Other programmable logic devices discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1204, and the processor 1203 reads the information in the memory 1204, and completes the steps of the above method in combination with its hardware.
  • the receiver 1201 can be used to receive input digital or character information, and generate signal input related to the relevant settings and function control of the audio coding device.
  • the transmitter 1202 can include a display device such as a display screen, and the transmitter 1202 can be used to output through an external interface Numeric or character information.
  • the processor 1203 is configured to execute the audio encoding method performed by the audio encoding apparatus shown in FIG. 4 in the foregoing embodiment.
  • the audio decoding apparatus 1300 includes:
  • the receiver 1301, the transmitter 1302, the processor 1303, and the memory 1304 (wherein the number of the processors 1303 in the audio decoding apparatus 1300 may be one or more, and one processor is taken as an example in FIG. 13).
  • the receiver 1301 , the transmitter 1302 , the processor 1303 and the memory 1304 may be connected by a bus or in other ways, wherein the connection by a bus is taken as an example in FIG. 13 .
  • Memory 1304 may include read-only memory and random access memory, and provides instructions and data to processor 1303 . A portion of memory 1304 may also include NVRAM.
  • the memory 1304 stores operating system and operation instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
  • the operating system may include various system programs for implementing various basic services and handling hardware-based tasks.
  • the processor 1303 controls the operation of the audio decoding apparatus, and the processor 1303 may also be referred to as a CPU.
  • various components of the audio decoding device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 1303 or implemented by the processor 1303 .
  • the processor 1303 may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 1303 or an instruction in the form of software.
  • the above-mentioned processor 1303 may be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.
  • the methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1304, and the processor 1303 reads the information in the memory 1304, and completes the steps of the above method in combination with its hardware.
  • the processor 1303 is configured to execute the audio decoding method performed by the audio decoding apparatus shown in FIG. 4 in the foregoing embodiment.
  • the chip when the audio encoding device or the audio decoding device is a chip in the terminal, the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, a Input/output interface, pin or circuit, etc.
  • the processing unit can execute the computer-executable instructions stored in the storage unit, so that the chip in the terminal executes the audio encoding method of any one of the above-mentioned first aspect, or the audio decoding method of any one of the second aspect.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit in the terminal located outside the chip, such as a read-only memory (read only memory).
  • -only memory read only memory
  • ROM read only memory
  • RAM random access memory
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the program of the method of the first aspect or the second aspect.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk mobile hard disk
  • ROM read-only memory
  • RAM magnetic disk or optical disk
  • a computer device which may be a personal computer, server, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种音频编解码方法、装置以及可读存储介质,该编码方法包括:根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器(401);根据当前场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号(402);对第一虚拟扬声器信号进行编码,以得到码流(403)。该编码方法用于减少编码的数据量,以提高编码效率。

Description

一种音频编解码方法和装置
本申请要求于2020年11月30日提交中国专利局、申请号为202011377320.0、发明名称为“一种音频编解码方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频编解码技术领域,尤其涉及一种音频编解码方法和装置。
背景技术
三维音频技术是对真实世界中的声音事件和三维声场信息进行获取、处理、传输和渲染回放的音频技术。三维音频技术使声音具有强烈的空间感、包围感及沉浸感,给人以“声临其境”的非凡听觉体验。高阶立体混响(higher order ambisonics,HOA)技术具有在录制、编码与回放阶段与扬声器布局无关的性质和HOA格式数据的可旋转回放特性,在进行三维音频回放时具有更高的灵活性,因而也得到了更为广泛的关注和研究。
为了实现更好的音频听觉效果,HOA技术需要大量的数据量用于记录更详细的声音场景的信息。虽然这种基于场景的三维音频信号采样和存储更加利于音频信号空间信息的保存和传输,但随着HOA阶数的增加将会产生更多的数据,大量的数据造成传输和存储的困难,因此需要对HOA信号进行编解码。
目前存在一种多声道数据的编解码方法,包括:在编码端,通过核心编码器(例如16声道的编码器)直接对原始场景音频信号的每一个声道进行编码,然后输出码流。在解码端,通过核心解码器(例如16声道的解码器)对码流进行解码,以得到解码场景音频信号的每一个声道。
上述多声道编解码方法,需要根据原始场景音频信号的声道数适配相应的编解码器,且随着声道数增加,压缩码流存在数据量大、带宽占用高的问题。
发明内容
本申请实施例提供了一种音频编解码方法和装置,用于减少编解码的数据量,以提高编解码效率。
为解决上述技术问题,本申请实施例提供以下技术方案:
第一方面,本申请实施例提供一种音频编码方法,包括:
根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器;
根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号;
对所述第一虚拟扬声器信号进行编码,以得到码流。
在本申请实施例中,根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器;根据当前场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号;对第一虚拟扬声器信号进行编码,以得到码流。由于本申请实施例中可以根据 第一场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,音频编码端对该第一虚拟扬声器信号进行编码,而不再直接对第一场景音频信号进行编码,本申请实施例中根据第一场景音频信号选择出第一目标虚拟扬声器,基于该第一目标虚拟扬声器生成的第一虚拟扬声器信号可以表示空间中听音人所在的位置声场,该位置声场尽可能的接近录制第一场景音频信号时的原始声场,保证了音频编码端的编码质量,且对第一虚拟扬声器信号和残差信号进行编码以得到码流,该第一虚拟扬声器信号的编码数据量与第一目标虚拟扬声器有关,而与第一场景音频信号的声道个数无关,减少了编码数据量,提高编码效率。
在一种可能的实现方式中,所述方法还包括:
根据所述虚拟扬声器集合从所述当前场景音频信号中获取主要声场成分;
所述根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器包括:
根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器。
在上述方案中,虚拟扬声器集合中的每个虚拟扬声器对应一个声场成分,则根据主要声场成分从虚拟扬声器集合中选择出第一目标虚拟扬声器,例如主要声场成分对应的虚拟扬声器就是编码端选择出的第一目标虚拟扬声器。本申请实施例中,编码端通过主要声场成分可以选择出第一目标虚拟扬声器,解决了编码端需要确定第一目标虚拟扬声器的问题。
在一种可能的实现方式中,所述根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器,包括:
根据所述主要声场成分从高阶立体混响HOA系数集合中选择出与所述主要声场成分对应的HOA系数,所述HOA系数集合中的HOA系数与所述虚拟扬声器集合中的虚拟扬声器一一对应;
确定所述虚拟扬声器集合中与所述主要声场成分对应的HOA系数对应的虚拟扬声器为所述第一目标虚拟扬声器。
在上述方案中,编码端中根据虚拟扬声器集合预先配置HOA系数集合,HOA系数集合中的HOA系数与虚拟扬声器集合中的虚拟扬声器之间的一一对应关系,因此根据主要声场成分选择出HOA系数之后,再根据上述一一对应关系从虚拟扬声器集合中查找与主要声场成分对应的HOA系数对应的目标虚拟扬声器,该查找出的目标虚拟扬声器即为第一目标虚拟扬声器,解决了编码端需要确定第一目标虚拟扬声器的问题。
在一种可能的实现方式中,所述根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器,包括:
根据所述主要声场成分获取所述第一目标虚拟扬声器的配置参数;
根据所述第一目标虚拟扬声器的配置参数生成所述第一目标虚拟扬声器对应的HOA系数;
确定所述虚拟扬声器集合中所述第一目标虚拟扬声器对应的HOA系数对应的虚拟扬声器为所述目标虚拟扬声器。
在上述方案中,编码端在获取到主要声场成分之后,可以根据该主要声场成分可用于确定第一目标虚拟扬声器的配置参数,例如主要声场成分是多个声场成分中取值最大的一 个或几个声场成分,或主要声场成分可以是多个声场成分中方向占优的一个或几个声场成分,该主要声场成分可用于确定出当前场景音频信号匹配的第一目标虚拟扬声器,第一目标虚拟扬声器配置有相应的属性信息,使用第一目标虚拟扬声器的配置参数都可以生成该第一目标虚拟扬声器的HOA系数,HOA系数的生成过程可以通过HOA算法来实现,此处不再详细说明。虚拟扬声器集合中每个虚拟扬声器都对应有HOA系数,因此可以根据每个虚拟扬声器对应的HOA系数从虚拟扬声器集合中选择出第一目标虚拟扬声器,解决了编码端需要确定第一目标虚拟扬声器的问题。
在一种可能的实现方式中,所述根据所述主要声场成分获取所述第一目标虚拟扬声器的配置参数,包括:
根据音频编码器的配置信息确定所述虚拟扬声器集合中的多个虚拟扬声器的配置参数;
根据所述主要声场成分从所述多个虚拟扬声器的配置参数中选择出所述第一目标虚拟扬声器的配置参数。
在上述方案中,音频编码器中可以预先存储多个虚拟扬声器各自的配置参数,每个虚拟扬声器的配置参数可以是通过音频编码器的配置信息确定,音频编码器是指前述的编码端,该音频编码器的配置信息,包括且不限于:HOA阶数、编码比特率等。音频编码器的配置信息可以用于确定虚拟扬声器的个数,以及每个虚拟扬声器的位置参数,解决了编码端需要确定虚拟扬声器的配置参数的问题。举例说明如下,若编码比特率较低时可以配置较少数量的虚拟扬声器,若编码比特率较高时可以配置多个数量的虚拟扬声器。又如虚拟扬声器的HOA阶数可以等于音频编码器的HOA阶数。不限定的是,本申请实施例中,除了通过音频编码器的配置信息确定多个虚拟扬声器各自的配置参数之外,还可以根据用户自定义信息多个虚拟扬声器各自的配置参数,例如,用户可以自定义虚拟扬声器的位置、HOA阶数、虚拟扬声器的个数等。
在一种可能的实现方式中,所述第一目标虚拟扬声器的配置参数包括:所述第一目标虚拟扬声器的位置信息和HOA阶数信息;
所述根据所述第一目标虚拟扬声器的配置参数生成所述第一目标虚拟扬声器对应的HOA系数,包括:
根据所述第一目标虚拟扬声器的位置信息和HOA阶数信息确定所述第一目标虚拟扬声器对应的HOA系数。
在上述方案中,使用每个虚拟扬声器的位置信息和HOA阶数信息都可以生成该虚拟扬声器的HOA系数,HOA系数的生成过程可以通过HOA算法来实现,解决了编码端需要确定第一目标虚拟扬声器的HOA系数的问题。
在一种可能的实现方式中,所述方法还包括:
对所述第一目标虚拟扬声器的属性信息进行编码,并写入所述码流。
在上述方案中,编码端除了对虚拟扬声器进行编码,还可以对第一目标虚拟扬声器的属性信息进行编码,并将编码后的第一目标虚拟扬声器的属性信息写入到码流中,此时得到的码流中可以包括:编码后的虚拟扬声器和编码后的第一目标虚拟扬声器的属性信息。本申请实施例中码流中可以携带编码后的第一目标虚拟扬声器的属性信息,使得解码端通过解码码流,就可以确定出第一目标虚拟扬声器的属性信息,便于解码端的音频解码。
在一种可能的实现方式中,所述当前场景音频信号包括:待编码高阶立体混响HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的HOA系数;
所述根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,包括:
对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
在上述方案中,以当前场景音频信号为待编码HOA信号为例,编码端首先确定第一目标虚拟扬声器的HOA系数,例如编码端根据主要声场成分从HOA系数集合中选择出HOA系数,该选择出的HOA系数就是第一目标虚拟扬声器的HOA系数,编码端获取到待编码HOA信号和第一目标虚拟扬声器的HOA系数之后,根据待编码HOA信号和第一目标虚拟扬声器的HOA系数可以生成第一虚拟扬声器信号,其中,待编码HOA信号可以采用第一目标虚拟扬声器的HOA系数进行线性组合得到,第一虚拟扬声器信号的求解可以被转换为对线性组合的求解问题。
在一种可能的实现方式中,所述当前场景音频信号包括:待编码高阶立体混响HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的位置信息;
所述根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,包括:
根据所述第一目标虚拟扬声器的位置信息获取所述第一目标虚拟扬声器对应的HOA系数;
对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
在上述方案中,第一目标虚拟扬声器的属性信息可以包括:第一目标虚拟扬声器的位置信息,编码端预先存储虚拟扬声器集合中每个虚拟扬声器的HOA系数,编码端还存储有每个虚拟扬声器的位置信息,虚拟扬声器的位置信息和该虚拟扬声器的HOA系数之间存在对应关系,因此编码端可以通过第一目标虚拟扬声器的位置信息确定第一目标虚拟扬声器的HOA系数。若属性信息包括HOA系数时,编码端通过解码第一目标虚拟扬声器的属性信息可以获取到第一目标虚拟扬声器的HOA系数。
在一种可能的实现方式中,所述方法还包括:
根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
对所述第二虚拟扬声器信号进行编码,并写入所述码流。
在上述方案中,第二目标虚拟扬声器是编码端选择出的与第一目标虚拟编码器不相同的另一个目标虚拟扬声器。第一场景音频信号是待编码的原始场景音频信号,该第二目标虚拟扬声器可以是虚拟扬声器集合中的某一个虚拟扬声器,例如可以使用预先配置的目标虚拟扬声器选择策略从预设的虚拟扬声器集合中选择出第二目标虚拟扬声器。目标虚拟扬声器选择策略是从虚拟扬声器集合中选择与第一场景音频信号匹配的目标虚拟扬声器的策略,例如按照每个虚拟扬声器从第一场景音频信号中获取的声场成分来选择第二目标虚拟 扬声器。
在一种可能的实现方式中,所述方法还包括:
对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
相应地,所述对所述第二虚拟扬声器信号进行编码包括:
对所述对齐后的第二虚拟扬声器信号进行编码;
相应地,所述对所述第一虚拟扬声器信号进行编码,包括:
对所述对齐后的第一虚拟扬声器信号进行编码。
在上述方案中,编码端获取到对齐后的第一虚拟扬声器信号之后,可以对对齐后的第一虚拟扬声器信号进行编码,本申请实施例中通过将第一虚拟扬声器信号的各声道间重新调整对齐,增强了声道间相关性,有利于核心编码器对第一虚拟扬声器信号的编码处理。
在一种可能的实现方式中,所述方法还包括:
根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
相应地,所述对所述第一虚拟扬声器信号进行编码,包括:
根据所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得下混信号和边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;
对所述下混信号以及所述边信息进行编码。
在上述方案中,编码端在获取到第一虚拟扬声器信号和第二虚拟扬声器信号之后,编码端还可以根据第一虚拟扬声器信号和第二虚拟扬声器信号进行下混处理,以生成下混信号,例如对第一虚拟扬声器信号和第二虚拟扬声器信号进行幅度上的下混处理,以得到下混信号。另外还可以根据第一虚拟扬声器信号和第二虚拟扬声器信号生成边信息,边信息用于指示第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系,该关系具有多种实现方式,该边信息可以用于解码端针对下混信号进行上混,以恢复出第一虚拟扬声器信号和第二虚拟扬声器信号。例如边信息包括信号信息丢失分析参数,以使得解码端通过信号信息丢失分析参数恢复出第一虚拟扬声器信号和第二虚拟扬声器信号。
在一种可能的实现方式中,所述方法还包括:
对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
相应的,所述根据所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得下混信号和边信息,包括:
根据所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号获得所述下混信号和所述边信息;
相应的,所述边信息用于指示所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号之间的关系。
在上述方案中,编码端在生成下混信号之前,可以先执行虚拟扬声器信号的对齐操作,在完成对齐操作之后,再生成下混信号和边信息。本申请实施例中通过将第一虚拟扬声器 信号和第二虚拟扬声器的各声道间重新调整对齐,增强了声道间相关性,有利于核心编码器对第一虚拟扬声器信号的编码处理。
在一种可能的实现方式中,在根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器前,所述方法还包括:
根据编码速率和/或所述当前场景音频信号的信号类型信息确定是否需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器;
若需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器,才根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器。
在上述方案中,编码端还可以进行信号选择,以确定是否需要获取第二目标虚拟扬声器,在需要获取第二目标虚拟扬声器的情况下,编码端可以生成第二虚拟扬声器信号,在不需要获取第二目标虚拟扬声器的情况下,编码端可以不生成第二虚拟扬声器信号。其中,编码器可以根据音频编码器的配置信息和/或第一场景音频信号的信号类型信息进行决策,以确定在选择出第一目标虚拟扬声器之外是否还需要选择别的目标虚拟扬声器。例如,若编码速率高于预设的阈值,则确定需要获取两个主要声场成分对应的目标虚拟扬声器,则在确定出第一目标虚拟扬声器之外,还可以继续确定第二目标虚拟扬声器。又如,根据第一场景音频信号的信号类型信息确定需要获取包含声源方向占优的两个主要声场成分对应的目标虚拟扬声器,则在确定出第一目标虚拟扬声器之外,还可以继续确定第二目标虚拟扬声器。相反的,若根据编码速率和/或第一场景音频信号的信号类型信息确定只需要获取一个目标虚拟扬声器,则在确定第一目标虚拟扬声器之后,就确定不再获取除第一目标虚拟扬声器以外的目标虚拟扬声器。本申请实施例中通过信号选择,可以减少编码端进行编码的数据量,提高编码效率。
第二方面,本申请实施例还提供一种音频解码方法,包括:
接收码流;
解码所述码流以获得虚拟扬声器信号;
根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号。
在本申请实施例中,首先接收码流,然后解码码流以获得虚拟扬声器信号,最后根据目标虚拟扬声器的属性信息以及虚拟扬声器信号获得重建的场景音频信号。本申请实施例中可以从码流中解码得到虚拟扬声器信号,通过目标虚拟扬声器的属性信息和虚拟扬声器信号得到了重建的场景音频信号,本申请实施例中,获取到的码流中携带虚拟扬声器信号和残差信号,减少了解码的数据量,提高了解码效率。
在一种可能的实现方式中,所述方法还包括:
解码所述码流以获得所述目标虚拟扬声器的属性信息。
在上述方案中,编码端除了对虚拟扬声器进行编码,还可以对目标虚拟扬声器的属性信息进行编码,并将编码后的目标虚拟扬声器的属性信息写入到码流中,例如可以通过码流获取到第一目标虚拟扬声器的属性信息。本申请实施例中码流中可以携带编码后的第一目标虚拟扬声器的属性信息,使得解码端通过解码码流,就可以确定出第一目标虚拟扬声器的属性信息,便于解码端的音频解码。
在一种可能的实现方式中,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器 的高阶立体混响HOA系数;
所述根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号,包括:
对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
在上述方案中,解码端首先确定目标虚拟扬声器的HOA系数,例如解码端中可以预先存储目标虚拟扬声器的HOA系数,解码端获取到虚拟扬声器信号和目标虚拟扬声器的HOA系数之后,根据虚拟扬声器信号和目标虚拟扬声器的HOA系数可以得到重建的场景音频信号。从而提高重建的场景音频信号的质量。
在一种可能的实现方式中,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的位置信息;
所述根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号,包括:
根据所述目标虚拟扬声器的位置信息确定所述目标虚拟扬声器的HOA系数;
对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
在上述方案中,目标虚拟扬声器的属性信息可以包括:目标虚拟扬声器的位置信息。解码端预先存储虚拟扬声器集合中每个虚拟扬声器的HOA系数,解码端还存储有每个虚拟扬声器的位置信息,例如解码端可以根据虚拟扬声器的位置信息和该虚拟扬声器的HOA系数之间的对应关系确定出目标虚拟扬声器的位置信息对应的HOA系数,或者解码端可以根据目标虚拟扬声器的位置信息计算出目标虚拟扬声器的HOA系数。因此解码端可以通过目标虚拟扬声器的位置信息确定目标虚拟扬声器的HOA系数。解决了解码端需要确定目标虚拟扬声器的HOA系数的问题。
在一种可能的实现方式中,所述虚拟扬声器信号是根据第一虚拟扬声器信号和第二虚拟扬声器信号下混获得的下混信号,所述方法还包括:
解码所述码流以获得边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;
根据所述边信息和所述下混信号获得所述第一虚拟扬声器信号和所述第二虚拟扬声器信号;
相应的,所述根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号,包括:
根据所述目标虚拟扬声器的属性信息、所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得所述重建的场景音频信号。
在上述方案中,编码端根据第一虚拟扬声器信号和第二虚拟扬声器信号进行下混处理时生成下混信号,编码端还可以针对下混信号进行信号补偿,以生成边信息,该边信息可以被写入码流中,解码端可以通过码流得到边信息,解码端可以根据边信息进行信号补偿,以得到第一虚拟扬声器信号和第二虚拟扬声器信号,因此在进行信号重建时,可以使用第一虚拟扬声器信号和第二虚拟扬声器信号,以及前述的目标虚拟扬声器的属性信息,从而 提高解码端的解码信号质量。
第三方面,本申请实施例提供一种音频编码装置,包括:
获取模块,用于根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器;
信号生成模块,用于根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号;
编码模块,用于对所述第一虚拟扬声器信号进行编码,以得到码流。
在一种可能的实现方式中,所述获取模块,用于根据所述虚拟扬声器集合从所述当前场景音频信号中获取主要声场成分;根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器。
在本申请的第三方面中,音频编码装置的组成模块还可以执行前述第一方面以及各种可能的实现方式中所描述的步骤,详见前述对第一方面以及各种可能的实现方式中的说明。
在一种可能的实现方式中,所述获取模块,用于根据所述主要声场成分从高阶立体混响HOA系数集合中选择出与所述主要声场成分对应的HOA系数,所述HOA系数集合中的HOA系数与所述虚拟扬声器集合中的虚拟扬声器一一对应;确定所述虚拟扬声器集合中与所述主要声场成分对应的HOA系数对应的虚拟扬声器为所述第一目标虚拟扬声器。
在一种可能的实现方式中,所述获取模块,用于根据所述主要声场成分获取所述第一目标虚拟扬声器的配置参数;根据所述第一目标虚拟扬声器的配置参数生成所述第一目标虚拟扬声器对应的HOA系数;确定所述虚拟扬声器集合中所述第一目标虚拟扬声器对应的HOA系数对应的虚拟扬声器为所述目标虚拟扬声器。
在一种可能的实现方式中,所述获取模块,用于根据音频编码器的配置信息确定所述虚拟扬声器集合中的多个虚拟扬声器的配置参数;根据所述主要声场成分从所述多个虚拟扬声器的配置参数中选择出所述第一目标虚拟扬声器的配置参数。
在一种可能的实现方式中,所述第一目标虚拟扬声器的配置参数包括:所述第一目标虚拟扬声器的位置信息和HOA阶数信息;
所述获取模块,用于根据所述第一目标虚拟扬声器的位置信息和HOA阶数信息确定所述第一目标虚拟扬声器对应的HOA系数。
在一种可能的实现方式中,所述编码模块,还用于对所述第一目标虚拟扬声器的属性信息进行编码,并写入所述码流。
在一种可能的实现方式中,所述当前场景音频信号,包括:待编码HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的HOA系数;
所述信号生成模块,用于对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
在一种可能的实现方式中,所述当前场景音频信号包括:待编码高阶立体混响HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的位置信息;
所述信号生成模块,用于根据所述第一目标虚拟扬声器的位置信息获取所述第一目标虚拟扬声器对应的HOA系数;对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
在一种可能的实现方式中,所述获取模块,用于根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
所述信号生成模块,用于根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
所述编码模块,用于对所述第二虚拟扬声器信号进行编码,并写入所述码流。
在一种可能的实现方式中,所述信号生成模块,用于对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
相应地,所述编码模块,用于对所述对齐后的第二虚拟扬声器信号进行编码;
相应地,所述编码模块,用于对所述对齐后的第一虚拟扬声器信号进行编码。
在一种可能的实现方式中,所述获取模块,用于根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
所述信号生成模块,用于根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
相应地,所述编码模块,用于根据所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得下混信号和边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;对所述下混信号以及所述边信息进行编码。
在一种可能的实现方式中,所述信号生成模块,用于对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
相应的,所述编码模块,用于根据所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号获得所述下混信号和所述边信息;
相应的,所述边信息用于指示所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号之间的关系。
在一种可能的实现方式中,所述获取模块,用于在根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器前,根据编码速率和/或所述当前场景音频信号的信号类型信息确定是否需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器;若需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器,才根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器。
第四方面,本申请实施例提供一种音频解码装置,包括:
接收模块,用于接收码流;
解码模块,用于解码所述码流以获得虚拟扬声器信号;
重建模块,用于根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号。
在一种可能的实现方式中,所述解码模块,还用于解码所述码流以获得所述目标虚拟扬声器的属性信息。
在一种可能的实现方式中,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的高阶立体混响HOA系数;
所述重建模块,用于对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
在一种可能的实现方式中,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的位置信息;
所述重建模块,用于根据所述目标虚拟扬声器的位置信息确定所述目标虚拟扬声器的HOA系数;对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
在一种可能的实现方式中,所述虚拟扬声器信号是根据第一虚拟扬声器信号和第二虚拟扬声器信号下混获得的下混信号,所述装置还包括:信号补偿模块,其中,
所述解码模块,用于解码所述码流以获得边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;
所述信号补偿模块,用于根据所述边信息和所述下混信号获得所述第一虚拟扬声器信号和所述第二虚拟扬声器信号;
相应的,所述重建模块,用于根据所述目标虚拟扬声器的属性信息、所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得所述重建的场景音频信号。
在本申请的第四方面中,音频解码装置的组成模块还可以执行前述第二方面以及各种可能的实现方式中所描述的步骤,详见前述对第二方面以及各种可能的实现方式中的说明。
第五方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的方法。
第六方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的方法。
第七方面,本申请实施例提供一种通信装置,该通信装置可以包括终端设备或者芯片等实体,所述通信装置包括:处理器,可选的,所述通信装置还包括存储器;所述存储器用于存储指令;所述处理器用于执行所述存储器中的所述指令,使得所述通信装置执行如前述第一方面或第二方面中任一项所述的方法。
第八方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持音频编码装置或者音频解码装置实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存音频编码装置或者音频解码装置必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
第九方面,本申请提供一种计算机可读存储介质,包括如前述第一方面中任一项所述的方法所生成的码流。
附图说明
图1为本申请实施例提供的音频处理系统的组成结构示意图;
图2a为本申请实施例提供的音频编码器和音频解码器应用于终端设备的示意图;
图2b为本申请实施例提供的音频编码器应用于无线设备或者核心网设备的示意图;
图2c为本申请实施例提供的音频解码器应用于无线设备或者核心网设备的示意图;
图3a为本申请实施例提供的多声道编码器和多声道解码器应用于终端设备的示意图;
图3b为本申请实施例提供的多声道编码器应用于无线设备或者核心网设备的示意图;
图3c为本申请实施例提供的多声道解码器应用于无线设备或者核心网设备的示意图;
图4为本申请实施例中音频编码装置和音频解码装置之间的一种交互流程示意图;
图5为本申请实施例提供的编码端的一种结构示意图;
图6为本申请实施例提供的解码端的一种结构示意图;
图7为本申请实施例提供的编码端的一种结构示意图;
图8为本申请实施例提供的一种球面上近似均匀分布的虚拟扬声器的示意图;
图9为本申请实施例提供的编码端的一种结构示意图;
图10为本申请实施例提供的一种音频编码装置的组成结构示意图;
图11为本申请实施例提供的一种音频解码装置的组成结构示意图;
图12为本申请实施例提供的另一种音频编码装置的组成结构示意图;
图13为本申请实施例提供的另一种音频解码装置的组成结构示意图。
具体实施方式
本申请实施例提供了一种音频编解码方法和装置,用于减少编码场景音频信号的数据量,提高编解码效率。
下面结合附图,对本申请的实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
本申请实施例的技术方案可以应用于各种的音频处理系统,如图1所示,为本申请实施例提供的音频处理系统的组成结构示意图。音频处理系统100可以包括:音频编码装置101和音频解码装置102。其中,音频编码装置101可用于生成码流,然后该音频编码码流可以通过音频传输通道传输给音频解码装置102,音频解码装置102可以接收到码流,然后执行音频解码装置102的音频解码功能,最后得到重建后的信号。
在本申请的实施例中,该音频编码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频编码装置可以是上述终端设备或者无线设备或者核心网设备的音频编码器。同样的,该音频解码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频解码装置可以是上述终端设备或者无线设备或者核心网设备的音频解码器。例如,音频编码器可以包括无线接入网、核心网的媒体网关、转码设备、媒体资源服务器、移动终端、固网终端等,音频编码器还可以是应用于虚拟现实技术(virtual reality,VR)流媒体(streaming)服务中的音频编解码器。
在申请实施例中,以适用于虚拟现实流媒体(VR streaming)服务中的音频编解码模块(audio encoding及audio decoding)为例,端到端对音频信号的处理流程包括:音频信号A经过采集模块(acquisition)后进行预处理操作(audio preprocessing),预处理操作包括滤除掉信号中的低频部分,可以是以20Hz或者50Hz为分界点,提取信号中的方位信息,之后进行编码处理(audio encoding)打包(file/segment encapsulation)之后发送(delivery)到解码端,解码端首先进行解包(file/segment decapsulation),之后解码(audio decoding),对解码信号进行双耳渲染(audio rendering)处理,渲染处理后的信号映射到收听者耳机(headphones)上,可以为独立的耳机也可以是眼镜设备上的耳机。
如图2a所示,为本申请实施例提供的音频编码器和音频解码器应用于终端设备的示意图。对于每个终端设备都可以包括:音频编码器、信道编码器、音频解码器、信道解码器。具体的,信道编码器用于对音频信号进行信道编码,信道解码器用于对音频信号进行信道解码。例如,在第一终端设备20中可以包括:第一音频编码器201、第一信道编码器202、第一音频解码器203、第一信道解码器204。在第二终端设备21中可以包括:第二音频解码器211、第二信道解码器212、第二音频编码器213、第二信道编码器214。第一终端设备20连接无线或者有线的第一网络通信设备22,第一网络通信设备22和无线或者有线的第二网络通信设备23之间通过数字信道连接,第二终端设备21连接无线或者有线的第二网络通信设备23。其中,上述无线或者有线的网络通信设备可以泛指信号传输设备,例如通信基站,数据交换设备等。
在音频通信中,作为发送端的终端设备首先进行音频采集,对采集到的音频信号进行音频编码,再进行信道编码后,通过无线网络或者核心网进行在数字信道中传输。而作为接收端的终端设备根据接收到的信号进行信道解码,以得到码流,然后经过音频解码恢复出音频信号,由接收端的终端设备进音频回放。
如图2b所示,为本申请实施例提供的音频编码器应用于无线设备或者核心网设备的示意图。其中,无线设备或者核心网设备25包括:信道解码器251、其他音频解码器252、本申请实施例提供的音频编码器253、信道编码器254,其中,其他音频解码器252是指除音频解码器以外的其他音频解码器。在无线设备或者核心网设备25内,首先通过信道解码器251对进入该设备的信号进行信道解码,然后使用其他音频解码器252进行音频解码,然后使用本申请实施例提供的音频编码器253进行音频编码,最后使用信道编码器254对音频信号进行信道编码,完成信道编码之后再传输出去。其中,其他音频解码器252是对信道解码器251解码后的码流进行音频解码。
如图2c所示,为本申请实施例提供的音频解码器应用于无线设备或者核心网设备的示意图。其中,无线设备或者核心网设备25包括:信道解码器251、本申请实施例提供的音频解码器255、其他音频编码器256、信道编码器254,其中,其他音频编码器256是指除音频编码器以外的其他音频编码器。在无线设备或者核心网设备25内,首先通过信道解码器251对进入该设备的信号进行信道解码,然后使用音频解码器255对接收到的音频编码码流进行解码,然后使用其他音频编码器256进行音频编码,最后使用信道编码器254对音频信号进行信道编码,完成信道编码之后再传输出去。在无线设备或者核心网设备中, 如果需要实现转码,则需要进行相应的音频编解码处理。其中,无线设备指的是通信中的射频相关的设备,核心网设备指的是通信中核心网相关的设备。
在本申请的一些实施例中,该音频编码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频编码装置可以是上述终端设备或者无线设备或者核心网设备的多声道编码器。同样的,该音频解码装置可以应用于各种有音频通信需要的终端设备、有转码需要的无线设备与核心网设备,例如音频解码装置可以是上述终端设备或者无线设备或者核心网设备的多声道解码器。
如图3a所示,为本申请实施例提供的多声道编码器和多声道解码器应用于终端设备的示意图,对于每个终端设备都可以包括:多声道编码器、信道编码器、多声道解码器、信道解码器。该多声道编码器可以执行本申请实施例提供的音频编码方法,该多声道解码器可以执行本申请实施例提供的音频解码方法。具体的,信道编码器用于对多声道信号进行信道编码,信道解码器用于对多声道信号进行信道解码。例如,在第一终端设备30中可以包括:第一多声道编码器301、第一信道编码器302、第一多声道解码器303、第一信道解码器304。在第二终端设备31中可以包括:第二多声道解码器311、第二信道解码器312、第二多声道编码器313、第二信道编码器314。第一终端设备30连接无线或者有线的第一网络通信设备32,第一网络通信设备32和无线或者有线的第二网络通信设备33之间通过数字信道连接,第二终端设备31连接无线或者有线的第二网络通信设备33。其中,上述无线或者有线的网络通信设备可以泛指信号传输设备,例如通信基站,数据交换设备等。音频通信中作为发送端的终端设备对采集到的多声道信号进行多声道编码,再进行信道编码后,通过无线网络或者核心网进行在数字信道中传输。而作为接收端的终端设备根据接收到的信号,进行信道解码,以得到多声道信号编码码流,然后经过多声道解码恢复出多声道信号,由作为接收端的终端设备进回放。
如图3b所示,为本申请实施例提供的多声道编码器应用于无线设备或者核心网设备的示意图,其中,无线设备或者核心网设备35包括:信道解码器351、其他音频解码器352、多声道编码器353、信道编码器354,与前述图2b类似,此处不再赘述。
如图3c所示,为本申请实施例提供的多声道解码器应用于无线设备或者核心网设备的示意图,其中,无线设备或者核心网设备35包括:信道解码器351、多声道解码器355、其他音频编码器356、信道编码器354,与前述图2c类似,此处不再赘述。
其中,音频编码处理可以是多声道编码器中的一部分,音频解码处理可以是多声道解码器中的一部分,例如,对采集到的多声道信号进行多声道编码可以是将采集到的多声道信号经过处理后得到音频信号,再按照本申请实施例提供的方法对得到的音频信号进行编码;解码端根据多声道信号编码码流,解码得到音频信号,经过上混处理后恢复出多声道信号。因此,本申请实施例也可应用于终端设备、无线设备、核心网设备中的多声道编码器和多声道解码器。在无线或者核心网设备中,如果需要实现转码,则需要进行相应的多声道编解码处理。
本申请实施例提供的音频编解码方法可以包括:音频编码方法和音频解码方法,其中,音频编码方法由音频编码装置执行,音频解码方法由音频解码装置执行,音频编码装置和音频解码装置之间可以进行通信。接下来基于前述的系统架构以及音频编码装置和音频解 码装置,对本申请实施例提供的音频编码方法和音频解码方法进行说明。如图4所示,为本申请实施例中音频编码装置和音频解码装置之间的一种交互流程示意图,其中,下述步骤401至步骤403可以由音频编码装置(如下简称编码端)执行,下述步骤411至步骤413可以由音频解码装置(如下简称解码端)执行,主要包括如下过程:
401、根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器。
其中,编码端获取当前场景音频信号,该当前场景音频信号是指对空间中麦克风所在位置的声场进行采集得到的音频信号,当前场景音频信号也可以称为原始场景音频信号。例如当前场景音频信号可以是通过高阶立体混响(higher order ambisonics,HOA)技术得到的音频信号。
本申请实施例中,编码端可以预先配置虚拟扬声器集合,该虚拟扬声器集合中可以包括多个虚拟扬声器,场景音频信号在实际回放时,可以通过耳机回放,也可以通过布置在房间中的多个扬声器回放。使用扬声器回放时,基本方法是通过多个扬声器的信号进行叠加,使得空间中某点(听音人所在的位置)声场在某个标准下尽可能的接近录制场景音频信号时的原始声场。本申请实施例中使用虚拟扬声器计算场景音频信号对应的回放信号,使用该回放信号作为传输信号,并进而生成压缩后的信号。虚拟扬声器表示的是在空间声场中虚拟存在的扬声器,该虚拟扬声器可以实现在编码端的场景音频信号的回放。
本申请实施例中,虚拟扬声器集合中包括多个虚拟扬声器,这多个虚拟扬声器中每个虚拟扬声器对应有虚拟扬声器配置参数(简称配置参数)。虚拟扬声器配置参数包括且不限于:虚拟扬声器的个数,虚拟扬声器的HOA阶数、虚拟扬声器的位置坐标等信息。编码端获取到上述的虚拟扬声器集合之后,根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器,当前场景音频信号是待编码的原始场景音频信号,该第一目标虚拟扬声器可以是虚拟扬声器集合中的某一个虚拟扬声器,例如可以使用预先配置的目标虚拟扬声器选择策略从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器。目标虚拟扬声器选择策略是从虚拟扬声器集合中选择与当前场景音频信号匹配的目标虚拟扬声器的策略,例如按照每个虚拟扬声器从当前场景音频信号中获取的声场成分来选择第一目标虚拟扬声器。又如,按照每个虚拟扬声器的位置信息从当前场景音频信号中选择第一目标虚拟扬声器。其中,第一目标虚拟扬声器为虚拟扬声器集合中用于回放当前场景音频信号的虚拟扬声器,即编码端可以从虚拟扬声器集合中选择出可回放当前场景音频信号的目标虚拟编码器。
不限定的是,本申请实施例中,通过步骤401选择出第一目标虚拟扬声器之后,可以执行后续针对第一目标虚拟扬声器的处理过程,如后续步骤402至步骤403。本申请实施例不仅可以选择出第一目标虚拟扬声器,还可以选择出更多的目标虚拟扬声器,例如还可以选择出第二目标虚拟扬声器,针对第二目标虚拟扬声器,同样需要执行与后续步骤402至403相类似的过程,详见后续实施例的说明。
在本申请实施例中,编码端选择出第一目标虚拟扬声器之后,编码端还可以获取第一目标虚拟扬声器的属性信息,第一目标虚拟扬声器的属性信息包括与第一目标虚拟扬声器的属性相关的信息,该属性信息可以根据具体应用场景设置,例如第一目标虚拟扬声器的属性信息包括:该第一目标虚拟扬声器的位置信息,或者该第一目标虚拟扬声器的HOA系 数。其中,第一目标虚拟扬声器的位置信息可以是该第一目标虚拟扬声器在空间的分布位置,也可以是该第一目标虚拟扬声器在虚拟扬声器集合中相对于其它虚拟扬声器的位置的信息,具体此处不做限定。虚拟扬声器集合中每个虚拟扬声器都对应有HOA系数,该HOA系数也可以称为Ambisonic系数,接下来对虚拟扬声器对应的HOA系数进行说明。
例如,HOA阶数可以为2阶至10阶中的其中1个阶数,录制音频信号时的信号采样率为48至192千赫兹(kHz),采样深度为16或者24比特(bit),通过虚拟扬声器的HOA系数和场景音频信号可以生成HOA信号,HOA信号的特点是带有声场的空间信息,HOA信号是描述空间某点声场信号一定精度的信息。因此,可以考虑使用另一种表示形式描述某一位置点的声场信号,这种描述方法能够使用更少的数据量对空间位置点的信号达到同样精确度的描述,从而能达到信号压缩的目的。空间声场可以分解为多个平面波的叠加。因此,理论上可以将HOA信号表达的声场,重新使用多个平面波的叠加来表达,每个平面波使用一个声道的音频信号和一个方向向量表示。平面波叠加的表示形式能够使用更少的声道数目准确的表达原始声场,以达到信号压缩的目的。
在本申请的一些实施例中,编码端除了执行前述步骤401,本申请实施例提供的音频编码方法还包括如下步骤:
A1、根据虚拟扬声器集合从当前场景音频信号中获取主要声场成分。
其中,步骤A1中的主要声场成分也可以称为第一主要声场成分。
在执行步骤A1的场景下,前述步骤401根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器,包括:
B1、根据主要声场成分从虚拟扬声器集合中选择出第一目标虚拟扬声器。
其中,编码端获取虚拟扬声器集合,编码端使用该虚拟扬声器集合对当前场景音频信号进行信号分解,以得到当前场景音频信号对应的主要声场成分。其中,主要声场成分表示的是当前场景音频信号中的主要声场所对应的音频信号。例如虚拟扬声器集合中包括多个虚拟扬声器,根据多个虚拟扬声器可以从当前场景音频信号中获取多个声场成分,即每个虚拟扬声器可以从当前场景音频信号中获取一个声场成分,接下来从多个声场成分中选择出主要声场成分,例如主要声场成分可以是多个声场成分中取值最大的一个或几个声场成分,或主要声场成分可以是多个声场成分中方向占优的一个或几个声场成分。虚拟扬声器集合中的每个虚拟扬声器对应一个声场成分,则根据主要声场成分从虚拟扬声器集合中选择出第一目标虚拟扬声器,例如主要声场成分对应的虚拟扬声器就是编码端选择出的第一目标虚拟扬声器。本申请实施例中,编码端通过主要声场成分可以选择出第一目标虚拟扬声器,解决了编码端需要确定第一目标虚拟扬声器的问题。
不限定的是,本申请实施例中,编码端具有多种方式选择出第一目标虚拟扬声器,例如编码端可以预设指定位置的虚拟扬声器作为第一目标虚拟扬声器,即按照虚拟扬声器集合中每个虚拟扬声器的位置选择出符合指定位置的虚拟扬声器作为第一目标虚拟扬声器。
其中,在本申请的一些实施例中,前述步骤B1根据主要声场成分从虚拟扬声器集合中选择出第一目标虚拟扬声器,包括:
根据主要声场成分从高阶立体混响HOA系数集合中选择出与主要声场成分对应的HOA系数,HOA系数集合中的HOA系数与虚拟扬声器集合中的虚拟扬声器一一对应;
确定虚拟扬声器集合中与主要声场成分对应的HOA系数对应的虚拟扬声器为第一目标虚拟扬声器。
其中,编码端中根据虚拟扬声器集合预先配置HOA系数集合,HOA系数集合中的HOA系数与虚拟扬声器集合中的虚拟扬声器之间的一一对应关系,因此根据主要声场成分选择出HOA系数之后,再根据上述一一对应关系从虚拟扬声器集合中查找与主要声场成分对应的HOA系数对应的目标虚拟扬声器,该查找出的目标虚拟扬声器即为第一目标虚拟扬声器,解决了编码端需要确定第一目标虚拟扬声器的问题。举例说明如下,HOA系数集合中包括HOA系数1、HOA系数2、HOA系数3,虚拟扬声器集合中包括虚拟扬声器1、虚拟扬声器2、虚拟扬声器3,其中,HOA系数集合中的HOA系数与虚拟扬声器集合中的虚拟扬声器一一对应,例如:HOA系数1与虚拟扬声器1对应,HOA系数2与虚拟扬声器2对应,HOA系数3与虚拟扬声器3对应。若根据主要声场成分从HOA系数集合中选择出HOA系数3,则可以确定第一目标虚拟扬声器为虚拟扬声器3。
其中,在本申请的一些实施例中,前述步骤B1根据主要声场成分从虚拟扬声器集合中选择出第一目标虚拟扬声器,还包括:
C1、根据主要声场成分获取第一目标虚拟扬声器的配置参数;
C2、根据第一目标虚拟扬声器的配置参数生成第一目标虚拟扬声器对应的HOA系数;
C3、确定虚拟扬声器集合中第一目标虚拟扬声器对应的HOA系数对应的虚拟扬声器为第一目标虚拟扬声器。
其中,编码端在获取到主要声场成分之后,可以根据该主要声场成分可用于确定第一目标虚拟扬声器的配置参数,例如主要声场成分是多个声场成分中取值最大的一个或几个声场成分,或主要声场成分可以是多个声场成分中方向占优的一个或几个声场成分,该主要声场成分可用于确定出当前场景音频信号匹配的第一目标虚拟扬声器,第一目标虚拟扬声器配置有相应的属性信息,使用第一目标虚拟扬声器的配置参数都可以生成该第一目标虚拟扬声器的HOA系数,HOA系数的生成过程可以通过HOA算法来实现,此处不再详细说明。虚拟扬声器集合中每个虚拟扬声器都对应有HOA系数,因此可以根据每个虚拟扬声器对应的HOA系数从虚拟扬声器集合中选择出第一目标虚拟扬声器,解决了编码端需要确定第一目标虚拟扬声器的问题。
其中,在本申请的一些实施例中,步骤C1根据主要声场成分获取第一目标虚拟扬声器的配置参数,包括:
根据音频编码器的配置信息确定虚拟扬声器集合中的多个虚拟扬声器的配置参数;
根据主要声场成分从多个虚拟扬声器的配置参数中选择出第一目标虚拟扬声器的配置参数。
其中,音频编码器中可以预先存储多个虚拟扬声器各自的配置参数,每个虚拟扬声器的配置参数可以是通过音频编码器的配置信息确定,音频编码器是指前述的编码端,该音频编码器的配置信息,包括且不限于:HOA阶数、编码比特率等。音频编码器的配置信息可以用于确定虚拟扬声器的个数,以及每个虚拟扬声器的位置参数,解决了编码端需要确定虚拟扬声器的配置参数的问题。举例说明如下,若编码比特率较低时可以配置较少数量的虚拟扬声器,若编码比特率较高时可以配置多个数量的虚拟扬声器。又如虚拟扬声器的 HOA阶数可以等于音频编码器的HOA阶数。不限定的是,本申请实施例中,除了通过音频编码器的配置信息确定多个虚拟扬声器各自的配置参数之外,还可以根据用户自定义信息多个虚拟扬声器各自的配置参数,例如,用户可以自定义虚拟扬声器的位置、HOA阶数、虚拟扬声器的个数等。
编码端从虚拟扬声器集合中获取多个虚拟扬声器的配置参数,对于每个虚拟扬声器而言,都存在相应的虚拟扬声器配置参数,每个虚拟扬声器配置参数包括且不限于:虚拟扬声器的HOA阶数、虚拟扬声器的位置坐标等信息。使用每个虚拟扬声器的配置参数都可以生成该虚拟扬声器的HOA系数,HOA系数的生成过程可以通过HOA算法来实现,此处不再详细说明。针对虚拟扬声器集合中的每个虚拟扬声器分别生成一个HOA系数,虚拟扬声器集合中所有虚拟扬声器分别配置的HOA系数构成HOA系数集合,解决了编码端需要确定虚拟扬声器集合中每个虚拟扬声器的HOA系数的问题。
其中,在本申请的一些实施例中,第一目标虚拟扬声器的配置参数包括:第一目标虚拟扬声器的位置信息和HOA阶数信息;
前述步骤C2根据第一目标虚拟扬声器的配置参数生成第一目标虚拟扬声器对应的HOA系数,包括:
根据第一目标虚拟扬声器的位置信息和HOA阶数信息确定第一目标虚拟扬声器对应的HOA系数。
其中,虚拟扬声器集合中的每个虚拟扬声器的配置参数都可以包括该虚拟扬声器的位置信息以及该虚拟扬声器的HOA阶数信息。同样的,第一目标虚拟扬声器的配置参数包括:第一目标虚拟扬声器的位置信息和HOA阶数信息。例如可以按照局部等距的虚拟扬声器空间分布方式确定虚拟扬声器集合中每个虚拟扬声器的位置信息,局部等距的虚拟扬声器空间分布方式是指多个虚拟扬声器在空间中按照局部等距的方式进行分布,例如局部等距可以包括:均匀分布或者不均匀分布。使用每个虚拟扬声器的位置信息和HOA阶数信息都可以生成该虚拟扬声器的HOA系数,HOA系数的生成过程可以通过HOA算法来实现,解决了编码端需要确定第一目标虚拟扬声器的HOA系数的问题。
另外,本申请实施例中针对虚拟扬声器集合中的每个虚拟扬声器分别生成一组HOA系数,多组HOA系数构成前述的HOA系数集合。虚拟扬声器集合中所有虚拟扬声器分别配置的HOA系数构成HOA系数集合,解决了编码端需要确定虚拟扬声器集合中每个虚拟扬声器的HOA系数的问题。
402、根据当前场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号。
其中,编码端获取到当前场景音频信号和第一目标虚拟扬声器的属性信息之后,编码端可以进行当前场景音频信号的回放,编码端根据当前场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,该第一虚拟扬声器信号即为当前场景音频信号的回放信号。第一目标虚拟扬声器的属性信息描述了与第一目标虚拟扬声器的属性相关的信息,该第一目标虚拟扬声器是编码端选择出的可回放当前场景音频信号的虚拟扬声器,因此通过第一目标虚拟扬声器的属性信息对当前场景音频信号进行回放,可以得到第一虚拟扬声器信号。该第一虚拟扬声器信号的数据量大小与当前场景音频信号的声道数无关,该 第一虚拟扬声器信号的数据量大小与第一目标虚拟扬声器有关。例如,本申请实施例中,第一虚拟扬声器信号相比于当前场景音频信号,采用较少的声道进行表示,例如当前场景音频信号为3阶HOA信号,该HOA信号为16个声道,本申请实施例中可以将16个声道压缩为2个声道,即编码端生成的虚拟扬声器信号为2个声道,例如编码端生成的虚拟扬声器信号可以包括前述的第一虚拟扬声器信号和第二虚拟扬声器信号等,编码端生成的虚拟扬声器信号的声道数与第一场景音频信号的声道数无关。通过后续步骤描述可知,码流中可以携带2个声道的第一虚拟扬声器信号,相应的,解码端接收该码流,解码该码流得到的虚拟扬声器信号为2个声道,解码端通过2个声道的虚拟扬声器信号可以重建出16个声道的场景音频信号,且保证了重建的场景音频信号与原始的场景音频信号相比时,具有主观和客观质量相当的效果。
可以理解的是,前述步骤401和步骤402具体可以由空间编码器来实现动态图像专家组(moving picture experts group,MPEG)空间编码器来实现。
在本申请的一些实施例中,当前场景音频信号可以包括:待编码HOA信号;第一目标虚拟扬声器的属性信息包括第一目标虚拟扬声器的HOA系数;
步骤402根据当前场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,包括:
对待编码HOA信号和第一目标虚拟扬声器的HOA系数进行线性组合,以得到第一虚拟扬声器信号。
其中,以当前场景音频信号为待编码HOA信号为例,编码端首先确定第一目标虚拟扬声器的HOA系数,例如编码端根据主要声场成分从HOA系数集合中选择出HOA系数,该选择出的HOA系数就是第一目标虚拟扬声器的HOA系数,编码端获取到待编码HOA信号和第一目标虚拟扬声器的HOA系数之后,根据待编码HOA信号和第一目标虚拟扬声器的HOA系数可以生成第一虚拟扬声器信号,其中,待编码HOA信号可以采用第一目标虚拟扬声器的HOA系数进行线性组合得到,第一虚拟扬声器信号的求解可以被转换为对线性组合的求解问题。
例如,第一目标虚拟扬声器的属性信息可以包括:第一目标虚拟扬声器的HOA系数。编码端通过解码第一目标虚拟扬声器的属性信息可以获取到第一目标虚拟扬声器的HOA系数。编码端对待编码HOA信号和第一目标虚拟扬声器的HOA系数进行线性组合,即编码端将待编码HOA信号和第一目标虚拟扬声器的HOA系数组合在一起,可以得到线性组合矩阵,接下来编码端可以对线性组合矩阵进行求最优解,得到的最优解就是第一虚拟扬声器信号。其中,该最优解与对线性组合矩阵进行求解时采用的算法有关。本申请实施例解决了编码端需要生成第一虚拟扬声器信号的问题。
在本申请的一些实施例中,当前场景音频信号包括:待编码高阶立体混响HOA信号;第一目标虚拟扬声器的属性信息包括第一目标虚拟扬声器的位置信息;
步骤402根据当前场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,包括:
根据第一目标虚拟扬声器的位置信息获取第一目标虚拟扬声器对应的HOA系数;
对待编码HOA信号和第一目标虚拟扬声器对应的HOA系数进行线性组合,以得到第一 虚拟扬声器信号。
其中,第一目标虚拟扬声器的属性信息可以包括:第一目标虚拟扬声器的位置信息,编码端预先存储虚拟扬声器集合中每个虚拟扬声器的HOA系数,编码端还存储有每个虚拟扬声器的位置信息,虚拟扬声器的位置信息和该虚拟扬声器的HOA系数之间存在对应关系,因此编码端可以通过第一目标虚拟扬声器的位置信息确定第一目标虚拟扬声器的HOA系数。若属性信息包括HOA系数时,编码端通过解码第一目标虚拟扬声器的属性信息可以获取到第一目标虚拟扬声器的HOA系数。
编码端获取待编码HOA信号以及第一目标虚拟扬声器的HOA系数之后,编码端对待编码HOA信号和第一目标虚拟扬声器的HOA系数进行线性组合,即编码端将待编码HOA信号和第一目标虚拟扬声器的HOA系数组合在一起,可以得到线性组合矩阵,接下来编码端可以对线性组合矩阵进行求最优解,得到的最优解就是第一虚拟扬声器信号。
举例说明如下,第一目标虚拟扬声器的HOA系数用矩阵A表示,用矩阵A可以线性组合出待编码HOA信号,其中可以采用最小二乘方法求得理论的最优解w,即为第一虚拟扬声器信号,例如可以采用如下计算式:
w=A -1X,
其中,A -1代表矩阵A的逆矩阵,矩阵A的大小为(M×C),C为第一目标虚拟扬声器个数,M为N阶的HOA系数的声道个数,a表示第一目标虚拟扬声器的HOA系数,例如,
Figure PCTCN2021096841-appb-000001
其中,X代表待编码HOA信号,矩阵X的大小为(M×L),M为N阶的HOA系数的声道个数,L为采样点数,x表示待编码HOA信号的系数,例如,
Figure PCTCN2021096841-appb-000002
403、对虚拟扬声器信号进行编码,以得到码流。
本申请实施例中,编码端在生成第一虚拟扬声器信号之后,编码端可以对第一虚拟扬声器信号进行编码,以得到码流。例如编码端具体可以是核心编码器,核心编码器对第一虚拟扬声器信号进行编码,以得到码流。该码流也可以称为音频信号编码码流。本申请实施例编码端对该第一虚拟扬声器信号进行编码,而不再对场景音频信号进行编码,通过选择出的第一目标虚拟扬声器,使得空间中听音人所在的位置声场尽可能的接近录制场景音频信号时的原始声场,保证了编码端的编码质量,且第一虚拟扬声器信号的编码数据量与场景音频信号的声道数无关,减少编码场景音频信号的数据量,提高编解码效率。
在本申请的一些实施例中,编码端执行上述的步骤401至步骤403之后,本申请实施例提供的音频编码方法还包括如下步骤:
对第一目标虚拟扬声器的属性信息进行编码,并写入码流。
其中,编码端除了对虚拟扬声器进行编码,还可以对第一目标虚拟扬声器的属性信息进行编码,并将编码后的第一目标虚拟扬声器的属性信息写入到码流中,此时得到的码流中可以包括:编码后的虚拟扬声器和编码后的第一目标虚拟扬声器的属性信息。本申请实施例中码流中可以携带编码后的第一目标虚拟扬声器的属性信息,使得解码端通过解码码流,就可以确定出第一目标虚拟扬声器的属性信息,便于解码端的音频解码。
需要说明的是,前述步骤401至步骤403中描述了从虚拟扬声器集合中选择出第一目标扬声器的情况下,基于该第一目标虚拟扬声器生成第一虚拟扬声器信号,并根据第一虚拟扬声器进行信号编码的过程。不限定的是,本申请实施例中,编码端不仅可以选择出第一目标虚拟扬声器,还可以选择出更多的目标虚拟扬声器,例如还可以选择出第二目标虚拟扬声器,针对第二目标虚拟扬声器,同样需要执行与前述步骤402至403相类似的过程,接下来进行详细的说明。
在本申请的一些实施例中,编码端除了执行前述步骤之外,本申请实施例提供的音频编码方法还包括:
D1、根据第一场景音频信号从虚拟扬声器集合中选择出第二目标虚拟扬声器;
D2、根据第一场景音频信号和第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
D3、对第二虚拟扬声器信号进行编码,并写入码流。
其中,步骤D1的实现方式与前述步骤401相类似,第二目标虚拟扬声器是编码端选择出的与第一目标虚拟编码器不相同的另一个目标虚拟扬声器。第一场景音频信号是待编码的原始场景音频信号,该第二目标虚拟扬声器可以是虚拟扬声器集合中的某一个虚拟扬声器,例如可以使用预先配置的目标虚拟扬声器选择策略从预设的虚拟扬声器集合中选择出第二目标虚拟扬声器。目标虚拟扬声器选择策略是从虚拟扬声器集合中选择与第一场景音频信号匹配的目标虚拟扬声器的策略,例如按照每个虚拟扬声器从第一场景音频信号中获取的声场成分来选择第二目标虚拟扬声器。
在本申请的一些实施例中,本申请实施例提供的音频编码方法还包括如下步骤:
E1、根据虚拟扬声器集合从第一场景音频信号中获取第二主要声场成分。
在执行步骤E1的场景下,前述步骤D1根据第一场景音频信号从预设的虚拟扬声器集合中选择出第二目标虚拟扬声器,包括:
F1、根据第二主要声场成分从虚拟扬声器集合中选择出第二目标虚拟扬声器。
其中,编码端获取虚拟扬声器集合,编码端使用该虚拟扬声器集合对第一场景音频信号进行信号分解,以得到第一场景音频信号对应的第二主要声场成分。其中,第二主要声场成分表示的是第一场景音频信号中的主要声场所对应的音频信号。例如虚拟扬声器集合中包括多个虚拟扬声器,根据多个虚拟扬声器可以从第一场景音频信号中获取多个声场成分,即每个虚拟扬声器可以从第一场景音频信号中获取一个声场成分,接下来从多个声场成分中选择出第二主要声场成分,例如第二主要声场成分可以是多个声场成分中取值最大 的一个或几个声场成分,或第二主要声场成分可以是多个声场成分中方向占优的一个或几个声场成分。根据第二主要声场成分从虚拟扬声器集合中选择出第二目标虚拟扬声器,例如第二主要声场成分对应的虚拟扬声器就是编码端选择出的第二目标虚拟扬声器。本申请实施例中,编码端通过主要声场成分可以选择出第二目标虚拟扬声器,解决了编码端需要确定第二目标虚拟扬声器的问题。
其中,在本申请的一些实施例中,前述步骤F1根据第二主要声场成分从虚拟扬声器集合中选择出第二目标虚拟扬声器,包括:
根据第二主要声场成分从HOA系数集合中选择出与第二主要声场成分对应的HOA系数,HOA系数集合中的HOA系数与虚拟扬声器集合中的虚拟扬声器一一对应;
确定虚拟扬声器集合中与第二主要声场成分对应的HOA系数对应的虚拟扬声器为第二目标虚拟扬声器。
其中,上述实现与前述实施例中确定第一目标虚拟扬声器的过程相类似,此处不再赘述。
其中,在本申请的一些实施例中,前述步骤F1根据第二主要声场成分从虚拟扬声器集合中选择出第二目标虚拟扬声器,还包括:
G1、根据第二主要声场成分获取第二目标虚拟扬声器的配置参数;
G2、根据第二目标虚拟扬声器的配置参数生成第二目标虚拟扬声器对应的HOA系数;
G3、确定虚拟扬声器集合中第二目标虚拟扬声器对应的HOA系数对应的虚拟扬声器为第二目标虚拟扬声器。
其中,上述实现与前述实施例中确定第一目标虚拟扬声器的过程相类似,此处不再赘述。
其中,上述实现与前述实施例中确定第一目标虚拟扬声器的过程相类似,此处不再赘述。
其中,在本申请的一些实施例中,步骤G1根据第二主要声场成分获取第二目标虚拟扬声器的配置参数,包括:
根据音频编码器的配置信息确定虚拟扬声器集合中的多个虚拟扬声器的配置参数;
根据第二主要声场成分从多个虚拟扬声器的配置参数中选择出第二目标虚拟扬声器的配置参数。
其中,上述实现与前述实施例中确定第一目标虚拟扬声器的配置参数的过程相类似,此处不再赘述。
其中,在本申请的一些实施例中,第二目标虚拟扬声器的配置参数包括:第二目标虚拟扬声器的位置信息和HOA阶数信息;
前述步骤G2根据第二目标虚拟扬声器的配置参数生成第二目标虚拟扬声器对应的HOA系数,包括:
根据第二目标虚拟扬声器的位置信息和HOA阶数信息确定第二目标虚拟扬声器对应的HOA系数。
其中,上述实现与前述实施例中确定第一目标虚拟扬声器对应的HOA系数的过程相类似,此处不再赘述。
在本申请的一些实施例中,第一场景音频信号,包括:待编码HOA信号;第二目标虚拟扬声器的属性信息包括第二目标虚拟扬声器的HOA系数;
步骤D2根据第一场景音频信号和第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号,包括:
对待编码HOA信号和第二目标虚拟扬声器的HOA系数进行线性组合,以得到第二虚拟扬声器信号。
在本申请的一些实施例中,第一场景音频信号包括:待编码高阶立体混响HOA信号;第二目标虚拟扬声器的属性信息包括第二目标虚拟扬声器的位置信息;
步骤D2根据第一场景音频信号和第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号,包括:
根据第二目标虚拟扬声器的位置信息获取第二目标虚拟扬声器对应的HOA系数;
对待编码HOA信号和第二目标虚拟扬声器对应的HOA系数进行线性组合,以得到第二虚拟扬声器信号。
其中,上述实现与前述实施例中确定第一虚拟扬声器信号的过程相类似,此处不再赘述。
在本申请实施例中,编码端生成第二虚拟扬声器信号之后,编码端还可以执行步骤D3,对第二虚拟扬声器信号进行编码,并写入码流。其中,编码端所采用的编码方法与步骤403相类似,使得码流可以携带第二虚拟扬声器信号的编码结果。
其中,在本申请的一些实施例中,编码端执行的音频编码方法还可以包括如下步骤:
I1、对第一虚拟扬声器信号和第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号。
在执行步骤I1的场景下,相应地,步骤D3对第二虚拟扬声器信号进行编码包括:
对对齐后的第二虚拟扬声器信号进行编码;
相应地,步骤403对第一虚拟扬声器信号进行编码,包括:
对对齐后的第一虚拟扬声器信号进行编码。
其中,编码端可以生成第一虚拟扬声器信号和第二虚拟扬声器信号,编码端可以对第一虚拟扬声器信号和第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号,举例说明如下,有两个虚拟扬声器信号,当前帧的虚拟扬声器信号的声道顺序为1、2,分别对应由目标虚拟扬声器P1、P2产生的虚拟扬声器信号,前一帧的虚拟扬声器信号的声道顺序为1、2,分别对应由目标虚拟扬声器P2、P1产生的虚拟扬声器信号,则可以按照前一帧目标虚拟扬声器的顺序对当前帧虚拟扬声器信号的声道顺序进行调整,例如将当前帧的虚拟扬声器信号的声道顺序调整为2、1,使得相同的目标虚拟扬声器产生的虚拟扬声器信号处于同一声道上。
编码端获取到对齐后的第一虚拟扬声器信号之后,可以对对齐后的第一虚拟扬声器信号进行编码,本申请实施例中通过将第一虚拟扬声器信号的各声道间重新调整对齐,增强了声道间相关性,有利于核心编码器对第一虚拟扬声器信号的编码处理。
在本申请的一些实施例中,编码端除了执行前述步骤之外,本申请实施例提供的音频编码方法还包括:
D1、根据第一场景音频信号从虚拟扬声器集合中选择出第二目标虚拟扬声器;
D2、根据第一场景音频信号和第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号。
相应地,在编码端执行步骤D1至D2的场景下,步骤403对第一虚拟扬声器信号进行编码,包括:
J1、根据第一虚拟扬声器信号和第二虚拟扬声器信号获得下混信号和边信息,边信息用于指示第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系;
J2、对下混信号以及边信息进行编码。
其中,编码端在获取到第一虚拟扬声器信号和第二虚拟扬声器信号之后,编码端还可以根据第一虚拟扬声器信号和第二虚拟扬声器信号进行下混处理,以生成下混信号,例如对第一虚拟扬声器信号和第二虚拟扬声器信号进行幅度上的下混处理,以得到下混信号。另外还可以根据第一虚拟扬声器信号和第二虚拟扬声器信号生成边信息,边信息用于指示第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系,该关系具有多种实现方式,该边信息可以用于解码端针对下混信号进行上混,以恢复出第一虚拟扬声器信号和第二虚拟扬声器信号。例如边信息包括信号信息丢失分析参数,以使得解码端通过信号信息丢失分析参数恢复出第一虚拟扬声器信号和第二虚拟扬声器信号。又如边信息具体可以是第一虚拟扬声器信号和第二虚拟扬声器信号的相关性参数,例如,可以是第一虚拟扬声器信号和第二虚拟扬声器信号的能量比例参数。以使得解码端通过上述相关性参数或者能量比例参数恢复出第一虚拟扬声器信号和第二虚拟扬声器信号。
其中,在本申请的一些实施例中,在编码端执行步骤D1至D2的场景下,编码端还可以执行如下步骤:
I1、对第一虚拟扬声器信号和第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号。
在执行步骤I1的场景下,相应地,步骤J1根据第一虚拟扬声器信号和第二虚拟扬声器信号获得下混信号和边信息,包括:
根据对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号获得下混信号和边信息;
相应的,边信息用于指示对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号之间的关系。
其中,编码端在生成下混信号之前,可以先执行虚拟扬声器信号的对齐操作,在完成对齐操作之后,再生成下混信号和边信息。本申请实施例中通过将第一虚拟扬声器信号和第二虚拟扬声器的各声道间重新调整对齐,增强了声道间相关性,有利于核心编码器对第一虚拟扬声器信号的编码处理。
需要说明的是,在本申请的上述实施例中,第二场景音频信号可以根据对齐前的第一虚拟扬声器信号和对齐前的第二虚拟扬声器信号获取,也可以根据对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号获取,具体实现方式取决于应用场景,此处不做限定。
在本申请的一些实施例中,在步骤D1根据第一场景音频信号从虚拟扬声器集合中选择 出第二目标虚拟扬声器前,本申请实施例提供的音频信号编码方法还包括:
K1、根据编码速率和/或第一场景音频信号的信号类型信息确定是否需要获取除第一目标虚拟扬声器以外的目标虚拟扬声器;
K2、若需要获取除第一目标虚拟扬声器以外的目标虚拟扬声器,才根据第一场景音频信号从虚拟扬声器集合中选择出第二目标虚拟扬声器。
其中,编码端还可以进行信号选择,以确定是否需要获取第二目标虚拟扬声器,在需要获取第二目标虚拟扬声器的情况下,编码端可以生成第二虚拟扬声器信号,在不需要获取第二目标虚拟扬声器的情况下,编码端可以不生成第二虚拟扬声器信号。其中,编码器可以根据音频编码器的配置信息和/或第一场景音频信号的信号类型信息进行决策,以确定在选择出第一目标虚拟扬声器之外是否还需要选择别的目标虚拟扬声器。例如,若编码速率高于预设的阈值,则确定需要获取两个主要声场成分对应的目标虚拟扬声器,则在确定出第一目标虚拟扬声器之外,还可以继续确定第二目标虚拟扬声器。又如,根据第一场景音频信号的信号类型信息确定需要获取包含声源方向占优的两个主要声场成分对应的目标虚拟扬声器,则在确定出第一目标虚拟扬声器之外,还可以继续确定第二目标虚拟扬声器。相反的,若根据编码速率和/或第一场景音频信号的信号类型信息确定只需要获取一个目标虚拟扬声器,则在确定第一目标虚拟扬声器之后,就确定不再获取除第一目标虚拟扬声器以外的目标虚拟扬声器。本申请实施例中通过信号选择,可以减少编码端进行编码的数据量,提高编码效率。
其中,编码端进行信号选择时,可以确定是否需要生成第二虚拟扬声器信号。由于编码端进行信号选择,会产生信息丢失,因此需要对不传输的虚拟扬声器信号进行信号补偿。信号补偿可以选择且不限于信息丢失分析,能量补偿,包络补偿,噪声补偿等。补偿的方法可以选择线性补偿或非线性补偿等。信号补偿之后可以生成边信息,该边信息可以被写入码流中,从而解码端可以通过码流得到边信息,解码端可以根据边信息进行信号补偿,从而提高解码端的解码信号质量。
通过前述实施例的举例说明,本申请实施例中可以根据第一场景音频信号和第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,音频编码端对该第一虚拟扬声器信号进行编码,而不再直接对第一场景音频信号进行编码,本申请实施例中根据第一场景音频信号选择出第一目标虚拟扬声器,基于该第一目标虚拟扬声器生成的第一虚拟扬声器信号可以表示空间中听音人所在的位置声场,该位置声场尽可能的接近录制第一场景音频信号时的原始声场,保证了音频编码端的编码质量,且对第一虚拟扬声器信号和残差信号进行编码以得到码流,该第一虚拟扬声器信号的编码数据量与第一目标虚拟扬声器有关,而与第一场景音频信号的声道个数无关,减少了编码数据量,提高编码效率。
在申请实施例中,编码端对虚拟扬声器信号进行编码,生成码流。然后编码端可以将该码流输出,并经过音频传输通道,发送至解码端。解码端执行后续步骤411至步骤413。
411、接收码流。
其中,解码端从编码端接收码流。该码流可以携带编码后的第一虚拟扬声器信号。不限定的是,该码流还可以携带编码后的第一目标虚拟扬声器的属性信息。需要说明的是,码流中可以不携带第一目标虚拟扬声器的属性信息,此时解码端可以通过预先配置确定出 第一目标虚拟扬声器的属性信息。
另外,在本申请的一些实施例中,在编码端生成第二虚拟扬声器信号的情况下,该码流还可以携带第二虚拟扬声器信号。不限定的是,该码流还可以携带编码后的第二目标虚拟扬声器的属性信息。需要说明的是,码流中可以不携带第二目标虚拟扬声器的属性信息,此时解码端可以通过预先配置确定出第二目标虚拟扬声器的属性信息。
412、解码码流以获得虚拟扬声器信号。
其中,解码端接收到来自编码端的码流之后,对该码流进行解码,从该码流中得到虚拟扬声器信号。
需要说明的是,该虚拟扬声器信号具体可以是前述的第一虚拟扬声器信号,还可以是前述的第一虚拟扬声器信号和第二虚拟扬声器信号,此处不做限定。
在本申请的一些实施例中,解码端执行上述的步骤411至步骤412之后,本申请实施例提供的音频解码方法还包括如下步骤:
解码码流以获得目标虚拟扬声器的属性信息。
其中,编码端除了对虚拟扬声器进行编码,还可以对目标虚拟扬声器的属性信息进行编码,并将编码后的目标虚拟扬声器的属性信息写入到码流中,例如可以通过码流获取到第一目标虚拟扬声器的属性信息。本申请实施例中码流中可以携带编码后的第一目标虚拟扬声器的属性信息,使得解码端通过解码码流,就可以确定出第一目标虚拟扬声器的属性信息,便于解码端的音频解码。
413、根据目标虚拟扬声器的属性信息和虚拟扬声器信号获得重建的场景音频信号。
其中,解码端可以获取目标虚拟扬声器的属性信息,该目标虚拟扬声器为虚拟扬声器集合中用于回放重建的场景音频信号的虚拟扬声器。目标虚拟扬声器的属性信息可以包括目标虚拟扬声器的位置信息和目标虚拟扬声器的HOA系数。解码端获取到虚拟扬声器信号之后,解码端使用目标虚拟扬声器的属性信息进行信号重建,通过信号重建可以输出重建的场景音频信号。
在本申请的一些实施例中,目标虚拟扬声器的属性信息包括目标虚拟扬声器的HOA系数;
步骤413根据目标虚拟扬声器的属性信息以及虚拟扬声器信号获得重建的场景音频信号,包括:
对虚拟扬声器信号和目标虚拟扬声器的HOA系数进行合成处理,得到重建的场景音频信号。
其中,解码端首先确定目标虚拟扬声器的HOA系数,例如解码端中可以预先存储目标虚拟扬声器的HOA系数,解码端获取到虚拟扬声器信号和目标虚拟扬声器的HOA系数之后,根据虚拟扬声器信号和目标虚拟扬声器的HOA系数可以得到重建的场景音频信号。从而提高重建的场景音频信号的质量。
举例说明如下,目标虚拟扬声器的HOA系数用矩阵A’表示,矩阵A’的大小为(M×C),C为目标虚拟扬声器个数,M为N阶的HOA系数的声道个数。虚拟扬声器信号用矩阵W’表示,矩阵W’的大小为(C×L),其中,L为信号采样点个数。通过如下计算式得到重建的HOA信号:
H=A’W’,
通过上述计算式得到的H即为重建的HOA信号。
在本申请的一些实施例中,目标虚拟扬声器的属性信息包括目标虚拟扬声器的位置信息;
步骤413根据目标虚拟扬声器的属性信息以及虚拟扬声器信号获得重建的场景音频信号,包括:
根据目标虚拟扬声器的位置信息确定目标虚拟扬声器的HOA系数;
对虚拟扬声器信号和目标虚拟扬声器的HOA系数进行合成处理,得到重建的场景音频信号。
其中,目标虚拟扬声器的属性信息可以包括:目标虚拟扬声器的位置信息。解码端预先存储虚拟扬声器集合中每个虚拟扬声器的HOA系数,解码端还存储有每个虚拟扬声器的位置信息,例如解码端可以根据虚拟扬声器的位置信息和该虚拟扬声器的HOA系数之间的对应关系确定出目标虚拟扬声器的位置信息对应的HOA系数,或者解码端可以根据目标虚拟扬声器的位置信息计算出目标虚拟扬声器的HOA系数。因此解码端可以通过目标虚拟扬声器的位置信息确定目标虚拟扬声器的HOA系数。解决了解码端需要确定目标虚拟扬声器的HOA系数的问题。
在本申请的一些实施例中,通过编码端的方法说明可知,虚拟扬声器信号是根据第一虚拟扬声器信号和第二虚拟扬声器信号下混获得的下混信号。在这种实现场景下,本申请实施例提供的音频解码方法还包括:
解码码流以获得边信息,边信息用于指示第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系;
根据边信息和下混信号获得第一虚拟扬声器信号和第二虚拟扬声器信号。
其中,在本发明实施例中,所述第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系可以是直接关系,也可以是间接关系;例如在第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系为直接关系时,所述第一边信息可以包括第一虚拟扬声器信号和第二虚拟扬声器信号的相关性参数,例如可以是第一虚拟扬声器信号和第二虚拟扬声器信号的能量比例参数;例如在,第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系为间接关系时,所述第一边信息可以包括第一虚拟扬声器信号与下混信号之间的相关性参数,以及第二虚拟扬声器信号与下混信号之间的相关性参数,例如包括第一虚拟扬声器信号与下混信号之间的能量比例参数,以及第二虚拟扬声器信号与下混信号之间的能量比例参数。
在所述第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系可以是直接关系时,解码器可以根据下混信号,下混信号的获取方式以及该直接关系确定出第一虚拟扬声器信号和第二虚拟扬声器信号;在所述第一虚拟扬声器信号和第二虚拟扬声器信号之间的关系可以是间接关系时,解码器可以根据下混信号及该间接关系确定出第一虚拟扬声器信号和第二虚拟扬声器信号。
相应的,步骤413根据目标虚拟扬声器的属性信息以及虚拟扬声器信号获得重建的场景音频信号,包括:
根据目标虚拟扬声器的属性信息、第一虚拟扬声器信号和第二虚拟扬声器信号获得重 建的场景音频信号。
其中,编码端根据第一虚拟扬声器信号和第二虚拟扬声器信号进行下混处理时生成下混信号,编码端还可以针对下混信号进行信号补偿,以生成边信息,该边信息可以被写入码流中,解码端可以通过码流得到边信息,解码端可以根据边信息进行信号补偿,以得到第一虚拟扬声器信号和第二虚拟扬声器信号,因此在进行信号重建时,可以使用第一虚拟扬声器信号和第二虚拟扬声器信号,以及前述的目标虚拟扬声器的属性信息,从而提高解码端的解码信号质量。
通过前述实施例的举例说明,本申请实施例中可以从码流中解码得到虚拟扬声器信号,虚拟扬声器信号作为场景音频信号的回放信号,通过目标虚拟扬声器的属性信息和虚拟扬声器信号得到了重建的场景音频信号,本申请实施例中,获取到的码流中携带虚拟扬声器信号和残差信号,减少了解码的数据量,提高了解码效率。
举例说明如下,本申请实施例中,第一虚拟扬声器信号相比于第一场景音频信号,采用较少的声道进行表示,例如第一场景音频信号为3阶HOA信号,该HOA信号为16个声道,本申请实施例中可以将16个声道压缩为2个声道,即编码端生成的虚拟扬声器信号为2个声道,例如编码端生成的虚拟扬声器信号可以包括前述的第一虚拟扬声器信号和第二虚拟扬声器信号等,编码端生成的虚拟扬声器信号的声道数与第一场景音频信号的声道数无关。通过后续步骤描述可知,码流中可以携带2个声道的虚拟扬声器信号,相应的,解码端接收该码流,解码该码流得到的虚拟扬声器信号为2个声道,解码端通过2个声道的虚拟扬声器信号可以重建出16个声道的场景音频信号,且保证了重建的场景音频信号与原始的场景音频信号相比时,具有主观和客观质量相当的效果。
为便于更好的理解和实施本申请实施例的上述方案,下面举例相应的应用场景来进行具体说明。
本申请实施例中以场景音频信号为HOA信号为例,声波在理想介质中传播,波数为k=w/c,角频率w=2πf,f为声波频率,c为声速。则声压p满足如下计算式,其中
Figure PCTCN2021096841-appb-000003
为拉普拉斯算子:
Figure PCTCN2021096841-appb-000004
在球坐标下求上述等式方程,在无源球形区域内,该方程解为如下计算式:
Figure PCTCN2021096841-appb-000005
在上述计算式中,r表示球半径,θ表示水平角,
Figure PCTCN2021096841-appb-000006
表示仰角,k表示波数,s为理想平面波的幅度,m为HOA阶数序号,
Figure PCTCN2021096841-appb-000007
是球贝塞尔函数,又称径向基函数,其中第一个j是虚数单位。
Figure PCTCN2021096841-appb-000008
不随角度变化。
Figure PCTCN2021096841-appb-000009
即为θ,
Figure PCTCN2021096841-appb-000010
方向的球谐函数,
Figure PCTCN2021096841-appb-000011
是声源方向的球谐函数。
HOA系数可以表述为:
Figure PCTCN2021096841-appb-000012
进而给出如下计算式:
Figure PCTCN2021096841-appb-000013
上述计算式表明声场可以在球面上按球谐函数展开,使用系数
Figure PCTCN2021096841-appb-000014
进行表示。或者,已知系数
Figure PCTCN2021096841-appb-000015
就可以重建声场。将上式截断到第N项,以系数
Figure PCTCN2021096841-appb-000016
作为对声场的近似描述,则称为N阶的HOA系数,该HOA系数也可以称为Ambisonic系数。N阶的HOA系数共有(N+1) 2个声道。其中,一阶以上的Ambisonic信号也称为HOA信号。将球谐函数按照HOA信号一个采样点对应的系数进行叠加,就能实现该采样点对应的时刻空间声场的重构。
例如,在一种配置下,HOA阶数可以为2至6阶,对场景音频录制时信号采样率为48至192kHz,采样深度为16或24Bit。HOA信号的特点是带有声场的空间信息,是空间某点声场信号一定精度的描述。因此,可以考虑使用另一种表示形式描述该点的声场信号,如果这种描述方法能够使用更少的数据量对该点信号达到同样精确度的描述,就能达到信号压缩的目的。
空间声场可以分解为多个平面波的叠加。因此,可以将HOA信号表达的声场,重新使用多个平面波的叠加来表达,每个平面波使用一个声道的音频信号和一个方向向量表示。如果平面波叠加的表示形式能够使用更少的声道数目较好的表达原始声场,则可以达到信号压缩的目的。
HOA信号在实际回放时,可以通过耳机回放,也可以通过布置在房间中的多个扬声器回放。使用扬声器回放时,基本方法是通过多个扬声器的声场的叠加,使得空间中某点(听音人所在的位置)声场在某个标准下尽可能的接近录制HOA信号时的原始声场。本申请实施例假设一个虚拟扬声器阵列,然后计算该虚拟扬声器阵列的回放信号,使用该回放信号作为传输信号,并进而生成压缩后的信号。解码端通过对码流进行解码,得到该回放信号,并通过该回放信号重建出场景音频信号。
本申请实施例提供适用于场景音频信号编码的编码端,和适用于场景音频信号解码的解码端。其中,编码端将原始HOA信号编码为压缩码流,编码端向解码端发送该压缩码流,然后解码端将压缩码流恢复为重建HOA信号。本申请实施例中,编码端进行压缩后的数据量尽可能小,或在同等码率下解码端重建后得到的HOA信号的质量更高。
本申请实施例可以解决编码HOA信号的时候,数据量大,带宽占用高,压缩效率较低,编码质量不高的问题。由于N阶的HOA信号具有(N+1) 2个声道,直接传输该HOA信号需要消耗较大的带宽,因此需要一种有效的多声道编码方案。
本申请实施例采取了不同的声道提取方法,且本申请实施例中对声源的假设不做限定,不依赖时频域点单声源假设,可以更有效的处理多声源信号等复杂场景。本申请实施例的编解码器提供一种采用较少的声道用于表示原始HOA信号的空间编解码方法。如图5所示,为本申请实施例提供的编码端的一种结构示意图,编码端包括空间编码器和核心编码器,其中,空间编码器可以对待编码HOA信号进行声道提取,以生成虚拟扬声器信号,核心编码器可以对虚拟扬声器信号进行编码,以得到码流,编码端向解码端发送码流。如图6所示,为本申请实施例提供的解码端的一种结构示意图,解码端包括:核心解码器和空间解码器,其中,核心解码器先接收到来自编码端的码流,然后从该码流中解码出虚拟扬声器信号,接下来空间解码器对该虚拟扬声器信号进行重建,以得到重建的HOA信号。
接下来分别从编码端和解码端进行举例说明。
如图7所示,首先对本申请实施例提供的编码端进行说明,该编码端可以包括:虚拟扬声器配置单元、编码分析单元、虚拟扬声器集合生成单元、虚拟扬声器选择单元、虚拟扬声器信号生成单元和核心编码器处理单元。接下来分别对编码端的各个组成单元的功能进行说明。本申请实施例中,图7所示的编码端可以生成一个虚拟扬声器信号,也可以生成多个虚拟扬声器信号,其中,多个虚拟扬声器信号的生成流程可以是根据图7所示的编码器结构进行多次生成,接下来以一个虚拟扬声器信号的生成流程为例。
虚拟扬声器配置单元,用于对虚拟扬声器集合中的虚拟扬声器进行配置,以得到多个虚拟扬声器。
虚拟扬声器配置单元根据编码器配置信息输出虚拟扬声器配置参数。编码器配置信息包括且不限于:HOA阶数,编码比特率,用户自定义信息等,虚拟扬声器配置参数包括且不限于:虚拟扬声器的个数,虚拟扬声器的HOA阶数、虚拟扬声器的位置坐标等。
虚拟扬声器配置单元输出的虚拟扬声器配置参数作为虚拟扬声器集合生成单元的输入。
编码分析单元,用于对待编码HOA信号进行编码分析,例如分析待编码HOA信号的声场分布,包括待编码HOA信号的声源个数、方向性、弥散度等特征,作为决定如何选择目标虚拟扬声器的判断条件之一。
不限定的是,本申请实施例中,编码端中还可以不包括编码分析单元,即编码端可以不对输入信号进行分析,则采用一种默认配置决定如何选择目标虚拟扬声器。
其中,编码端获取待编码HOA信号,例如可以将从实际采集设备记录的HOA信号或采用人工音频对象合成的HOA信号作为编码器的输入,同时编码器输入的待编码HOA信号可以是时域HOA信号也可以是频域HOA信号。
虚拟扬声器集合生成单元,用于生成虚拟扬声器集合,该虚拟扬声器集合中可以包括:多个虚拟扬声器,虚拟扬声器集合中的虚拟扬声器也可以称为“候选虚拟扬声器”。
虚拟扬声器集合生成单元生成指定的候选虚拟扬声器HOA系数。生成候选虚拟扬声器HOA系数需要候选虚拟扬声器的坐标(即位置坐标或者位置信息)和候选虚拟扬声器的HOA阶数,候选虚拟扬声器的坐标确定方法包括且不限于按等距规则产生K个虚拟扬声器、根据听觉感知原理生成非均匀分布的K个候选虚拟扬声器,以下举例一种产生均匀分布固定个数虚拟扬声器的方法。
根据候选虚拟扬声器的个数生成分布均匀的候选虚拟扬声器的坐标,例如使用数值迭代计算方法给出近似均匀的扬声器排布。如图8所示,为球面上近似均匀分布的虚拟扬声器的示意图,设想在单位球面上分布一些质点,并在这些质点之间设置二次方反比的斥力,与同种电荷之间的静电斥力类似。让这些质点在斥力作用下自由运动,可以期望在其达到稳态时,质点的分布应该趋于均匀。计算中,对实际的物理规律进行简化,直接令质点的移动距离等于受力。则对于第i个质点,其在迭代计算的某一步的运动距离也即受到的虚拟力为如下计算式:
Figure PCTCN2021096841-appb-000017
其中,
Figure PCTCN2021096841-appb-000018
代表位移矢量,
Figure PCTCN2021096841-appb-000019
代表力矢量,r ij代表第i个质点与第j个质点之间的距离,
Figure PCTCN2021096841-appb-000020
代表从第j个质点指向第i个质点的方向矢量。参数k控制单步步长的大小,质点的初始位置随机指定即可。
质点按照位移矢量
Figure PCTCN2021096841-appb-000021
运动后,一般会偏离单位球面。在下一步迭代前,通过归一化质点与球心的距离,将其移动回单位球面即可,由此可以得到如图8所示的虚拟扬声器分布示意图,球面上近似均匀分布有多个虚拟扬声器。
接下来生成候选虚拟扬声器HOA系数。振幅为s,扬声器位置坐标为
Figure PCTCN2021096841-appb-000022
的理想平面波,使用球谐函数展开后的形式为如下计算式:
Figure PCTCN2021096841-appb-000023
对于平面波的HOA系数为
Figure PCTCN2021096841-appb-000024
满足如下计算式:
Figure PCTCN2021096841-appb-000025
虚拟扬声器集合生成单元输出的候选虚拟扬声器的HOA系数作为虚拟扬声器选择单元的输入。
虚拟扬声器选择单元,用于根据待编码HOA信号从虚拟扬声器集合中的多个候选虚拟扬声器中选择出目标虚拟扬声器,该目标虚拟扬声器可以称为“与待编码HOA信号匹配的虚拟扬声器”,或者简称为匹配虚拟扬声器。
虚拟扬声器选择单元将待编码HOA信号与虚拟扬声器集合生成单元输出的候选虚拟扬声器HOA系数匹配,选择出指定的匹配虚拟扬声器。
接下来对虚拟扬声器的选择方法进行举例说明,一种实施例中,得到候选虚拟扬声器后,将待编码HOA信号与虚拟扬声器集合生成单元输出的候选虚拟扬声器HOA系数进行匹配,寻找待编码HOA信号在候选虚拟扬声器上的最佳匹配,目标是使用候选虚拟扬声器HOA系数匹配组合待编码HOA信号。一种实施例中,使用候选虚拟扬声器HOA系数与待编码HOA信号做内积,选取内积绝对值最大的候选虚拟扬声器为目标虚拟扬声器,即匹配虚拟扬声器,并将待编码HOA信号在该候选虚拟扬声器的投影叠加到该候选虚拟扬声器HOA系数的线性组合上,然后将投影向量从待编码HOA信号中减去得到差值,对差值重复上述过程实现迭代计算,每迭代一次产生一个匹配虚拟扬声器,输出匹配虚拟扬声器坐标和匹配虚拟扬声器HOA系数。可以理解的是,匹配虚拟扬声器会选取多个,每迭代一次产生一个匹配虚拟扬声器。
虚拟扬声器选择单元输出的目标虚拟扬声器的坐标和目标虚拟扬声器的HOA系数作为虚拟扬声器信号生成单元的输入。
在本申请的一些实施例中,编码端除了包括图7所示的组成单元之外,还可以包括边信息生成单元。不限定的是,编码端还可以不包括边信息生成单元,此处仅为举例。
虚拟扬声器选择单元输出的目标虚拟扬声器的坐标和/或目标虚拟扬声器的HOA系数作为边信息生成单元的输入。
边信息生成单元将目标虚拟扬声器的HOA系数或目标虚拟扬声器的坐标转换为边信息, 利于核心编码器的处理和传输。
边信息生成单元的输出作为核心编码器处理单元的输入。
虚拟扬声器信号生成单元,用于根据待编码HOA信号和目标虚拟扬声器的属性信息生成虚拟扬声器信号。
虚拟扬声器信号生成单元通过待编码HOA信号和目标虚拟扬声器的HOA系数计算虚拟扬声器信号。
匹配虚拟扬声器HOA系数用矩阵A表示,用矩阵A可以线性组合出待编码HOA信号,其中可以采用最小二乘方法求得理论的最优解w,即为虚拟扬声器信号,例如可以采用如下计算式:
w=A -1X,
其中,A -1代表矩阵A的逆矩阵,矩阵A的大小为(M×C),C为目标虚拟扬声器个数,M为N阶的HOA系数的声道个数,a表示目标虚拟扬声器的HOA系数,例如,
Figure PCTCN2021096841-appb-000026
其中,X代表待编码HOA信号,矩阵X的大小为(M×L),M为N阶的HOA系数的声道个数,L为采样点数,x表示待编码HOA信号的系数,例如,
Figure PCTCN2021096841-appb-000027
虚拟扬声器信号生成单元输出的虚拟扬声器信号作为核心编码器处理单元的输入。
在本申请的一些实施例中,编码端除了包括图7所示的组成单元之外,还可以包括信号对齐单元。不限定的是,编码端还可以不包括信号对齐单元,此处仅为举例。
虚拟扬声器信号生成单元输出的虚拟扬声器信号作为信号对齐单元的输入。
信号对齐单元,用于将虚拟扬声器信号各声道间重新调整,增强声道间相关性,利于核心编码器处理。
信号对齐单元输出的对齐后的虚拟扬声器信号为核心编码器处理单元的输入。
核心编码器处理单元,用于对边信息和对齐后的虚拟扬声器信号进行核心编码器处理,得到传输码流。
核心编码器处理包括且不限于变换、量化、心理声学模型、码流产生等,可以对频域声道进行处理也可以对时域声道进行处理,此处不做限定。
如图9所示,本申请实施例提供的解码端可包含:核心解码器处理单元和HOA信号重建单元。
核心解码器处理单元,用于对传输码流进行核心解码器处理,得到虚拟扬声器信号。
不限的是,若编码端在码流中携带边信息,则解码端还需要包括:边信息解码单元。
边信息解码单元,用于对核心解码器处理单元输出的解码边信息进行解码,以得到解码后的边信息。
核心解码器处理可以包括变换、码流解析、反量化等,可以对频域声道进行处理也可以对时域声道进行处理,此处不做限定。
核心解码器处理单元输出的虚拟扬声器信号为HOA信号重建单元的输入,核心解码器处理单元输出的解码边信息为边信息解码单元的输入。
边信息解码单元将解码边信息转为目标虚拟扬声器的HOA系数。
边信息解码单元输出的目标虚拟扬声器的HOA系数为HOA信号重建单元的输入。
HOA信号重建单元,用于通过虚拟扬声器信号和目标虚拟扬声器的HOA系数对HOA信号进行重建。
目标虚拟扬声器的HOA系数用于矩阵A’表示,矩阵A’的大小为(M×C),记为A’,C为目标虚拟扬声器个数,M为N阶的HOA系数的声道个数。虚拟扬声器信号构成(C×L)矩阵,记为W’,其中L为信号采样点个数,通过如下计算式得到重建的HOA信号H:
H=A’W’,
其中,HOA信号重建单元输出的重建的HOA信号为解码端的输出。
本申请实施例中,编码端可以利用空间编码器,将原始HOA信号采用较少的声道进行表示,例如原始3阶HOA信号,采用本申请实施例的空间编码器可以将16个声道压缩为4个声道,且保证了主观听力无明显差别。其中,主观听力测试是音频编解码中的一种评价标准,无明显差别是主观评价的一种等级。
在本申请的另一些实施例中,编码端的虚拟扬声器选择单元从虚拟扬声器集合中选择出目标虚拟扬声器,还可以采用指定方位的虚拟扬声器作为目标虚拟扬声器,虚拟扬声器信号生成单元直接在各个目标虚拟扬声器上做投影得到虚拟扬声器信号。
在上述方式中,通过指定方位的虚拟扬声器作为目标虚拟扬声器,可以简化虚拟扬声器选择过程,可以提高编解码速度。
在本申请的另一些实施例中,编码器端可以不包括信号对齐单元,此时虚拟扬声器信号生成单元的输出直接进行核心编码器进行编码处理。通过上述方式,减少信号对齐处理,降低编码器端复杂度。
通过前述的举例说明可知,本申请实施例将选择出的目标虚拟扬声器应用于HOA信号编解码上,本申请实施例能够得到准确的HOA信号声源定位,重建HOA信号方向更为准确,编码效率更高,且解码端复杂度非常低,利于移动端应用,可以提升编解码的性能。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
为便于更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关装置。
请参阅图10所示,本申请实施例提供的一种音频编码装置1000,可以包括:获取模块1001、信号生成模块1002和编码模块1003,其中,
获取模块,用于根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器;
信号生成模块,用于根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号;
编码模块,用于对所述第一虚拟扬声器信号进行编码,以得到码流。
在本申请的一些实施例中,所述获取模块,用于根据所述虚拟扬声器集合从所述当前场景音频信号中获取主要声场成分;根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器。
在本申请的一些实施例中,所述获取模块,用于根据所述主要声场成分从高阶立体混响HOA系数集合中选择出与所述主要声场成分对应的HOA系数,所述HOA系数集合中的HOA系数与所述虚拟扬声器集合中的虚拟扬声器一一对应;确定所述虚拟扬声器集合中与所述主要声场成分对应的HOA系数对应的虚拟扬声器为所述第一目标虚拟扬声器。
在本申请的一些实施例中,所述获取模块,用于根据所述主要声场成分获取所述第一目标虚拟扬声器的配置参数;根据所述第一目标虚拟扬声器的配置参数生成所述第一目标虚拟扬声器对应的HOA系数;确定所述虚拟扬声器集合中所述第一目标虚拟扬声器对应的HOA系数对应的虚拟扬声器为所述目标虚拟扬声器。
在本申请的一些实施例中,所述获取模块,用于根据音频编码器的配置信息确定所述虚拟扬声器集合中的多个虚拟扬声器的配置参数;根据所述主要声场成分从所述多个虚拟扬声器的配置参数中选择出所述第一目标虚拟扬声器的配置参数。
在本申请的一些实施例中,所述第一目标虚拟扬声器的配置参数包括:所述第一目标虚拟扬声器的位置信息和HOA阶数信息;
所述获取模块,用于根据所述第一目标虚拟扬声器的位置信息和HOA阶数信息确定所述第一目标虚拟扬声器对应的HOA系数。
在本申请的一些实施例中,所述编码模块,还用于对所述第一目标虚拟扬声器的属性信息进行编码,并写入所述码流。
在本申请的一些实施例中,所述当前场景音频信号,包括:待编码HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的HOA系数;
所述信号生成模块,用于对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
在本申请的一些实施例中,所述当前场景音频信号包括:待编码高阶立体混响HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的位置信息;
所述信号生成模块,用于根据所述第一目标虚拟扬声器的位置信息获取所述第一目标虚拟扬声器对应的HOA系数;对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
在本申请的一些实施例中,所述获取模块,用于根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
所述信号生成模块,用于根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
所述编码模块,用于对所述第二虚拟扬声器信号进行编码,并写入所述码流。
在本申请的一些实施例中,所述信号生成模块,用于对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
相应地,所述编码模块,用于对所述对齐后的第二虚拟扬声器信号进行编码;
相应地,所述编码模块,用于对所述对齐后的第一虚拟扬声器信号进行编码。
在本申请的一些实施例中,所述获取模块,用于根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
所述信号生成模块,用于根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
相应地,所述编码模块,用于根据所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得下混信号和边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;对所述下混信号以及所述边信息进行编码。
在本申请的一些实施例中,所述信号生成模块,用于对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
相应的,所述编码模块,用于根据所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号获得所述下混信号和所述边信息;
相应的,所述边信息用于指示所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号之间的关系。
在本申请的一些实施例中,所述获取模块,用于在根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器前,根据编码速率和/或所述当前场景音频信号的信号类型信息确定是否需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器;若需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器,才根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器。
请参阅图11所示,本申请实施例提供的一种音频解码装置1100,可以包括:接收模块1101、解码模块1102、重建模块1103,其中,
接收模块,用于接收码流;
解码模块,用于解码所述码流以获得虚拟扬声器信号;
重建模块,用于根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号。
在本申请的一些实施例中,所述解码模块,还用于解码所述码流以获得所述目标虚拟扬声器的属性信息。
在本申请的一些实施例中,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的高阶立体混响HOA系数;
所述重建模块,用于对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合 成处理,以获得所述重建的场景音频信号。
在本申请的一些实施例中,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的位置信息;
所述重建模块,用于根据所述目标虚拟扬声器的位置信息确定所述目标虚拟扬声器的HOA系数;对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
在本申请的一些实施例中,所述虚拟扬声器信号是根据第一虚拟扬声器信号和第二虚拟扬声器信号下混获得的下混信号,所述装置还包括:信号补偿模块,其中,
所述解码模块,用于解码所述码流以获得边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;
所述信号补偿模块,用于根据所述边信息和所述下混信号获得所述第一虚拟扬声器信号和所述第二虚拟扬声器信号;
相应的,所述重建模块,用于根据所述目标虚拟扬声器的属性信息、所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得所述重建的场景音频信号
需要说明的是,上述装置各模块/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其带来的技术效果与本申请方法实施例相同,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储有程序,该程序执行包括上述方法实施例中记载的部分或全部步骤。
接下来介绍本申请实施例提供的另一种音频编码装置,请参阅图12所示,音频编码装置1200包括:
接收器1201、发射器1202、处理器1203和存储器1204(其中音频编码装置1200中的处理器1203的数量可以一个或多个,图12中以一个处理器为例)。在本申请的一些实施例中,接收器1201、发射器1202、处理器1203和存储器1204可通过总线或其它方式连接,其中,图12中以通过总线连接为例。
存储器1204可以包括只读存储器和随机存取存储器,并向处理器1203提供指令和数据。存储器1204的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1204存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
处理器1203控制音频编码装置的操作,处理器1203还可以称为中央处理单元(central processing unit,CPU)。具体的应用中,音频编码装置的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1203中,或者由处理器1203实现。处理器1203可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1203中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处 理器1203可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1204,处理器1203读取存储器1204中的信息,结合其硬件完成上述方法的步骤。
接收器1201可用于接收输入的数字或字符信息,以及产生与音频编码装置的相关设置以及功能控制有关的信号输入,发射器1202可包括显示屏等显示设备,发射器1202可用于通过外接接口输出数字或字符信息。
本申请实施例中,处理器1203用于执行前述实施例图4所示的由音频编码装置执行的音频编码方法。
接下来介绍本申请实施例提供的另一种音频解码装置,请参阅图13所示,音频解码装置1300包括:
接收器1301、发射器1302、处理器1303和存储器1304(其中音频解码装置1300中的处理器1303的数量可以一个或多个,图13中以一个处理器为例)。在本申请的一些实施例中,接收器1301、发射器1302、处理器1303和存储器1304可通过总线或其它方式连接,其中,图13中以通过总线连接为例。
存储器1304可以包括只读存储器和随机存取存储器,并向处理器1303提供指令和数据。存储器1304的一部分还可以包括NVRAM。存储器1304存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
处理器1303控制音频解码装置的操作,处理器1303还可以称为CPU。具体的应用中,音频解码装置的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1303中,或者由处理器1303实现。处理器1303可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1303中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1303可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读 存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1304,处理器1303读取存储器1304中的信息,结合其硬件完成上述方法的步骤。
本申请实施例中,处理器1303,用于执行前述实施例图4所示的由音频解码装置执行的音频解码方法。
在另一种可能的设计中,当音频编码装置或者音频解码装置为终端内的芯片时,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使该终端内的芯片执行上述第一方面任意一项的音频编码方法,或者第二方面任意一项的音频解码方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述终端内的位于所述芯片外部的存储单元,如只读存储器(read-onlymemory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(randomaccessmemory,RAM)等。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面或第二方面方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例 如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (44)

  1. 一种音频编码方法,其特征在于,包括:
    根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器;
    根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号;
    对所述第一虚拟扬声器信号进行编码,以得到码流。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述虚拟扬声器集合从所述当前场景音频信号中获取主要声场成分;
    所述根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器包括:
    根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器,包括:
    根据所述主要声场成分从高阶立体混响HOA系数集合中选择出与所述主要声场成分对应的HOA系数,所述HOA系数集合中的HOA系数与所述虚拟扬声器集合中的虚拟扬声器一一对应;
    确定所述虚拟扬声器集合中与所述主要声场成分对应的HOA系数对应的虚拟扬声器为所述第一目标虚拟扬声器。
  4. 根据权利要求2所述的方法,其特征在于,所述根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器,包括:
    根据所述主要声场成分获取所述第一目标虚拟扬声器的配置参数;
    根据所述第一目标虚拟扬声器的配置参数生成所述第一目标虚拟扬声器对应的HOA系数;
    确定所述虚拟扬声器集合中所述第一目标虚拟扬声器对应的HOA系数对应的虚拟扬声器为所述目标虚拟扬声器。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述主要声场成分获取所述第一目标虚拟扬声器的配置参数,包括:
    根据音频编码器的配置信息确定所述虚拟扬声器集合中的多个虚拟扬声器的配置参数;
    根据所述主要声场成分从所述多个虚拟扬声器的配置参数中选择出所述第一目标虚拟扬声器的配置参数。
  6. 根据权利要求4或5所述的方法,其特征在于,所述第一目标虚拟扬声器的配置参数包括:所述第一目标虚拟扬声器的位置信息和HOA阶数信息;
    所述根据所述第一目标虚拟扬声器的配置参数生成所述第一目标虚拟扬声器对应的HOA系数,包括:
    根据所述第一目标虚拟扬声器的位置信息和HOA阶数信息确定所述第一目标虚拟扬声器对应的HOA系数。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述方法还包括:
    对所述第一目标虚拟扬声器的属性信息进行编码,并写入所述码流。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述当前场景音频信号包括:待编码高阶立体混响HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的HOA系数;
    所述根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,包括:
    对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
  9. 根据权利要求1至7任一项所述的方法,其特征在于,所述当前场景音频信号包括:待编码高阶立体混响HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的位置信息;
    所述根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号,包括:
    根据所述第一目标虚拟扬声器的位置信息获取所述第一目标虚拟扬声器对应的HOA系数;
    对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
    根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
    对所述第二虚拟扬声器信号进行编码,并写入所述码流。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
    相应地,所述对所述第二虚拟扬声器信号进行编码包括:
    对所述对齐后的第二虚拟扬声器信号进行编码;
    相应地,所述对所述第一虚拟扬声器信号进行编码,包括:
    对所述对齐后的第一虚拟扬声器信号进行编码。
  12. 根据权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
    根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
    相应地,所述对所述第一虚拟扬声器信号进行编码,包括:
    根据所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得下混信号和边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;
    对所述下混信号以及所述边信息进行编码。
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后 的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
    相应的,所述根据所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得下混信号和边信息,包括:
    根据所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号获得所述下混信号和所述边信息;
    相应的,所述边信息用于指示所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号之间的关系。
  14. 根据权利要求10至13任一项所述的方法,其特征在于,在根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器前,所述方法还包括:
    根据编码速率和/或所述当前场景音频信号的信号类型信息确定是否需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器;
    若需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器,才根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器。
  15. 一种音频解码方法,其特征在于,包括:
    接收码流;
    解码所述码流以获得虚拟扬声器信号;
    根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号。
  16. 根据权利要求15所述的方法,其特征在于,所述方法还包括:
    解码所述码流以获得所述目标虚拟扬声器的属性信息。
  17. 根据权利要求16所述的方法,其特征在于,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的高阶立体混响HOA系数;
    所述根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号,包括:
    对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
  18. 根据权利要求16所述的方法,其特征在于,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的位置信息;
    所述根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号,包括:
    根据所述目标虚拟扬声器的位置信息确定所述目标虚拟扬声器的HOA系数;
    对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
  19. 根据权利要求15至18中任一项所述的方法,其特征在于,所述虚拟扬声器信号是根据第一虚拟扬声器信号和第二虚拟扬声器信号下混获得的下混信号,所述方法还包括:
    解码所述码流以获得边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;
    根据所述边信息和所述下混信号获得所述第一虚拟扬声器信号和所述第二虚拟扬声器信号;
    相应的,所述根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号,包括:
    根据所述目标虚拟扬声器的属性信息、所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得所述重建的场景音频信号。
  20. 一种音频编码装置,其特征在于,包括:
    获取模块,用于根据当前场景音频信号从预设的虚拟扬声器集合中选择出第一目标虚拟扬声器;
    信号生成模块,用于根据所述当前场景音频信号和所述第一目标虚拟扬声器的属性信息生成第一虚拟扬声器信号;
    编码模块,用于对所述第一虚拟扬声器信号进行编码,以得到码流。
  21. 根据权利要求20所述的装置,其特征在于,所述获取模块,用于根据所述虚拟扬声器集合从所述当前场景音频信号中获取主要声场成分;根据所述主要声场成分从所述虚拟扬声器集合中选择出所述第一目标虚拟扬声器。
  22. 根据权利要求21所述的装置,其特征在于,所述获取模块,用于根据所述主要声场成分从高阶立体混响HOA系数集合中选择出与所述主要声场成分对应的HOA系数,所述HOA系数集合中的HOA系数与所述虚拟扬声器集合中的虚拟扬声器一一对应;确定所述虚拟扬声器集合中与所述主要声场成分对应的HOA系数对应的虚拟扬声器为所述第一目标虚拟扬声器。
  23. 根据权利要求21所述的装置,其特征在于,所述获取模块,用于根据所述主要声场成分获取所述第一目标虚拟扬声器的配置参数;根据所述第一目标虚拟扬声器的配置参数生成所述第一目标虚拟扬声器对应的HOA系数;确定所述虚拟扬声器集合中所述第一目标虚拟扬声器对应的HOA系数对应的虚拟扬声器为所述目标虚拟扬声器。
  24. 根据权利要求23所述的装置,其特征在于,所述获取模块,用于根据音频编码器的配置信息确定所述虚拟扬声器集合中的多个虚拟扬声器的配置参数;根据所述主要声场成分从所述多个虚拟扬声器的配置参数中选择出所述第一目标虚拟扬声器的配置参数。
  25. 根据权利要求23或24所述的装置,其特征在于,所述第一目标虚拟扬声器的配置参数包括:所述第一目标虚拟扬声器的位置信息和HOA阶数信息;
    所述获取模块,用于根据所述第一目标虚拟扬声器的位置信息和HOA阶数信息确定所述第一目标虚拟扬声器对应的HOA系数。
  26. 根据权利要求20至25中任一项所述的装置,其特征在于,所述编码模块,还用于对所述第一目标虚拟扬声器的属性信息进行编码,并写入所述码流。
  27. 根据权利要求20至26中任一项所述的装置,其特征在于,所述当前场景音频信号,包括:待编码HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一目标虚拟扬声器的HOA系数;
    所述信号生成模块,用于对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
  28. 根据权利要求20至26中任一项所述的装置,其特征在于,所述当前场景音频信号包括:待编码高阶立体混响HOA信号;所述第一目标虚拟扬声器的属性信息包括所述第一 目标虚拟扬声器的位置信息;
    所述信号生成模块,用于根据所述第一目标虚拟扬声器的位置信息获取所述第一目标虚拟扬声器对应的HOA系数;对所述待编码HOA信号和所述HOA系数进行线性组合,以得到所述第一虚拟扬声器信号。
  29. 根据权利要求20至28中任一项所述的装置,其特征在于,
    所述获取模块,用于根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
    所述信号生成模块,用于根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
    所述编码模块,用于对所述第二虚拟扬声器信号进行编码,并写入所述码流。
  30. 根据权利要求29所述的装置,其特征在于,
    所述信号生成模块,用于对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
    相应地,所述编码模块,用于对所述对齐后的第二虚拟扬声器信号进行编码;
    相应地,所述编码模块,用于对所述对齐后的第一虚拟扬声器信号进行编码。
  31. 根据权利要求20至28中任一项所述的装置,其特征在于,
    所述获取模块,用于根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器;
    所述信号生成模块,用于根据所述当前场景音频信号和所述第二目标虚拟扬声器的属性信息生成第二虚拟扬声器信号;
    相应地,所述编码模块,用于根据所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得下混信号和边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;对所述下混信号以及所述边信息进行编码。
  32. 根据权利要求31所述的装置,其特征在于,
    所述信号生成模块,用于对所述第一虚拟扬声器信号和所述第二虚拟扬声器信号进行对齐处理,以得到对齐后的第一虚拟扬声器信号和对齐后的第二虚拟扬声器信号;
    相应的,所述编码模块,用于根据所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号获得所述下混信号和所述边信息;
    相应的,所述边信息用于指示所述对齐后的第一虚拟扬声器信号和所述对齐后的第二虚拟扬声器信号之间的关系。
  33. 根据权利要求20至32任一项所述的装置,其特征在于,所述获取模块,用于在根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器前,根据编码速率和/或所述当前场景音频信号的信号类型信息确定是否需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器;若需要获取除所述第一目标虚拟扬声器以外的目标虚拟扬声器,才根据所述当前场景音频信号从所述虚拟扬声器集合中选择出第二目标虚拟扬声器。
  34. 一种音频解码装置,其特征在于,包括:
    接收模块,用于接收码流;
    解码模块,用于解码所述码流以获得虚拟扬声器信号;
    重建模块,用于根据目标虚拟扬声器的属性信息以及所述虚拟扬声器信号获得重建的场景音频信号。
  35. 根据权利要求34所述的装置,其特征在于,所述解码模块,还用于解码所述码流以获得所述目标虚拟扬声器的属性信息。
  36. 根据权利要求35所述的装置,其特征在于,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的高阶立体混响HOA系数;
    所述重建模块,用于对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
  37. 根据权利要求35所述的装置,其特征在于,所述目标虚拟扬声器的属性信息包括所述目标虚拟扬声器的位置信息;
    所述重建模块,用于根据所述目标虚拟扬声器的位置信息确定所述目标虚拟扬声器的HOA系数;对所述虚拟扬声器信号和所述目标虚拟扬声器的HOA系数进行合成处理,以获得所述重建的场景音频信号。
  38. 根据权利要求34至37中任一项所述的装置,其特征在于,所述虚拟扬声器信号是根据第一虚拟扬声器信号和第二虚拟扬声器信号下混获得的下混信号,所述装置还包括:信号补偿模块,其中,
    所述解码模块,用于解码所述码流以获得边信息,所述边信息用于指示所述第一虚拟扬声器信号和所述第二虚拟扬声器信号之间的关系;
    所述信号补偿模块,用于根据所述边信息和所述下混信号获得所述第一虚拟扬声器信号和所述第二虚拟扬声器信号;
    相应的,所述重建模块,用于根据所述目标虚拟扬声器的属性信息、所述第一虚拟扬声器信号和所述第二虚拟扬声器信号获得所述重建的场景音频信号。
  39. 一种音频编码装置,其特征在于,所述音频编码装置包括至少一个处理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求1至14中任一项所述的方法。
  40. 根据权利要求39所述的音频编码装置,其特征在于,所述音频编码装置还包括:所述存储器。
  41. 一种音频解码装置,其特征在于,所述音频解码装置包括至少一个处理器,所述至少一个处理器用于与存储器耦合,读取并执行所述存储器中的指令,以实现如权利要求15至19中任一项所述的方法。
  42. 根据权利要求41所述的音频解码装置,其特征在于,所述音频解码装置还包括:所述存储器。
  43. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至14、或者15至19中任意一项所述的方法。
  44. 一种计算机可读存储介质,包括如权利要求1至14任一项所述的方法所生成的码流。
PCT/CN2021/096841 2020-11-30 2021-05-28 一种音频编解码方法和装置 WO2022110723A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2023532579A JP2023551040A (ja) 2020-11-30 2021-05-28 オーディオの符号化及び復号方法及び装置
MX2023006299A MX2023006299A (es) 2020-11-30 2021-05-28 Metodo y aparato de codificacion y decodificacion de audio.
CA3200632A CA3200632A1 (en) 2020-11-30 2021-05-28 Audio encoding and decoding method and apparatus
EP21896233.0A EP4246510A4 (en) 2020-11-30 2021-05-28 METHOD AND APPARATUS FOR AUDIO ENCODING AND DECODING
US18/202,553 US20230298600A1 (en) 2020-11-30 2023-05-26 Audio encoding and decoding method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011377320.0A CN114582356A (zh) 2020-11-30 2020-11-30 一种音频编解码方法和装置
CN202011377320.0 2020-11-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/202,553 Continuation US20230298600A1 (en) 2020-11-30 2023-05-26 Audio encoding and decoding method and apparatus

Publications (1)

Publication Number Publication Date
WO2022110723A1 true WO2022110723A1 (zh) 2022-06-02

Family

ID=81753927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096841 WO2022110723A1 (zh) 2020-11-30 2021-05-28 一种音频编解码方法和装置

Country Status (7)

Country Link
US (1) US20230298600A1 (zh)
EP (1) EP4246510A4 (zh)
JP (1) JP2023551040A (zh)
CN (1) CN114582356A (zh)
CA (1) CA3200632A1 (zh)
MX (1) MX2023006299A (zh)
WO (1) WO2022110723A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376527A (zh) * 2021-05-17 2022-11-22 华为技术有限公司 三维音频信号编码方法、装置和编码器
CN118138980A (zh) * 2022-12-02 2024-06-04 华为技术有限公司 场景音频解码方法及电子设备
CN118136027A (zh) * 2022-12-02 2024-06-04 华为技术有限公司 场景音频编码方法及电子设备
CN118314908A (zh) * 2023-01-06 2024-07-09 华为技术有限公司 场景音频解码方法及电子设备
CN118800250A (zh) * 2023-04-13 2024-10-18 华为技术有限公司 场景音频解码方法及电子设备
CN118800254A (zh) * 2023-04-13 2024-10-18 华为技术有限公司 场景音频解码方法及电子设备
CN118800257A (zh) * 2023-04-13 2024-10-18 华为技术有限公司 场景音频解码方法及电子设备
CN118800252A (zh) * 2023-04-13 2024-10-18 华为技术有限公司 场景音频编码方法及电子设备
CN118800248A (zh) * 2023-04-13 2024-10-18 华为技术有限公司 场景音频解码方法及电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019040827A1 (en) * 2017-08-25 2019-02-28 Google Llc QUICK AND EFFICIENT ENCODING OF MEMORY OF SOUND OBJECTS USING SPHERICAL HARMONIC SYMMETRIES
CN109618276A (zh) * 2018-11-23 2019-04-12 武汉轻工大学 基于非中心点的声场重建方法、设备、存储介质及装置
CN109891503A (zh) * 2016-10-25 2019-06-14 华为技术有限公司 声学场景回放方法和装置
WO2019241345A1 (en) * 2018-06-12 2019-12-19 Magic Leap, Inc. Efficient rendering of virtual soundfields
CN110771182A (zh) * 2017-05-03 2020-02-07 弗劳恩霍夫应用研究促进协会 用于音频渲染的音频处理器、系统、方法和计算机程序
CN111670583A (zh) * 2018-02-01 2020-09-15 高通股份有限公司 可扩展的统一的音频渲染器
CN111819627A (zh) * 2018-07-02 2020-10-23 杜比实验室特许公司 用于对沉浸式音频信号进行编码及/或解码的方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9881628B2 (en) * 2016-01-05 2018-01-30 Qualcomm Incorporated Mixed domain coding of audio

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109891503A (zh) * 2016-10-25 2019-06-14 华为技术有限公司 声学场景回放方法和装置
CN110771182A (zh) * 2017-05-03 2020-02-07 弗劳恩霍夫应用研究促进协会 用于音频渲染的音频处理器、系统、方法和计算机程序
WO2019040827A1 (en) * 2017-08-25 2019-02-28 Google Llc QUICK AND EFFICIENT ENCODING OF MEMORY OF SOUND OBJECTS USING SPHERICAL HARMONIC SYMMETRIES
CN111670583A (zh) * 2018-02-01 2020-09-15 高通股份有限公司 可扩展的统一的音频渲染器
WO2019241345A1 (en) * 2018-06-12 2019-12-19 Magic Leap, Inc. Efficient rendering of virtual soundfields
CN111819627A (zh) * 2018-07-02 2020-10-23 杜比实验室特许公司 用于对沉浸式音频信号进行编码及/或解码的方法及装置
CN109618276A (zh) * 2018-11-23 2019-04-12 武汉轻工大学 基于非中心点的声场重建方法、设备、存储介质及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4246510A4

Also Published As

Publication number Publication date
CA3200632A1 (en) 2022-06-02
CN114582356A (zh) 2022-06-03
US20230298600A1 (en) 2023-09-21
EP4246510A4 (en) 2024-04-17
JP2023551040A (ja) 2023-12-06
EP4246510A1 (en) 2023-09-20
MX2023006299A (es) 2023-08-21

Similar Documents

Publication Publication Date Title
WO2022110723A1 (zh) 一种音频编解码方法和装置
CN107533843B (zh) 用于捕获、编码、分布和解码沉浸式音频的系统和方法
WO2022110722A1 (zh) 一种音频编解码方法和装置
TWI834760B (zh) 用於編碼、解碼、場景處理及與以指向性音訊編碼為基礎之空間音訊編碼有關的其他程序之裝置、方法及電腦程式
TWI819344B (zh) 音訊訊號渲染方法、裝置、設備及電腦可讀存儲介質
WO2022237851A1 (zh) 一种音频编码、解码方法及装置
WO2022262576A1 (zh) 三维音频信号编码方法、装置、编码器和系统
WO2022262758A1 (zh) 音频渲染系统、方法和电子设备
WO2022257824A1 (zh) 一种三维音频信号的处理方法和装置
WO2022184097A1 (zh) 虚拟扬声器集合确定方法和装置
KR20240001226A (ko) 3차원 오디오 신호 코딩 방법, 장치, 및 인코더
WO2024212897A1 (zh) 场景音频信号的解码方法和装置
WO2024146408A1 (zh) 场景音频解码方法及电子设备
WO2024212895A1 (zh) 场景音频信号的解码方法和装置
WO2024212894A1 (zh) 场景音频信号的解码方法和装置
WO2024212898A1 (zh) 场景音频信号的编码方法和装置
WO2024212896A1 (zh) 场景音频信号的解码方法和装置
WO2022242483A1 (zh) 三维音频信号编码方法、装置和编码器
WO2022242479A1 (zh) 三维音频信号编码方法、装置和编码器
WO2024114373A1 (zh) 场景音频编码方法及电子设备
WO2024114372A1 (zh) 场景音频解码方法及电子设备
WO2022152960A1 (en) Transforming spatial audio parameters
CN115376528A (zh) 三维音频信号编码方法、装置和编码器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896233

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023532579

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 3200632

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 202317038571

Country of ref document: IN

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112023010465

Country of ref document: BR

ENP Entry into the national phase

Ref document number: 2021896233

Country of ref document: EP

Effective date: 20230612

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 112023010465

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20230529