[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118645110A - Audio processing method and medium - Google Patents

Audio processing method and medium Download PDF

Info

Publication number
CN118645110A
CN118645110A CN202310273204.1A CN202310273204A CN118645110A CN 118645110 A CN118645110 A CN 118645110A CN 202310273204 A CN202310273204 A CN 202310273204A CN 118645110 A CN118645110 A CN 118645110A
Authority
CN
China
Prior art keywords
sub
noise estimation
audio
band
estimation result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310273204.1A
Other languages
Chinese (zh)
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310273204.1A priority Critical patent/CN118645110A/en
Publication of CN118645110A publication Critical patent/CN118645110A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0017Lossless audio signal coding; Perfect reconstruction of coded audio signal by transmission of coding error
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application discloses an audio processing method and a medium, wherein the method comprises the following steps: acquiring a first audio signal of a listener; performing noise estimation on the first audio signal, determining a noise estimation result of the first audio signal, and transmitting the noise estimation result of the first audio signal to a sounding party; the noise estimation result is used for indicating the intensity of noise in the first audio signal; receiving an encoded signal sent by a sounding party, performing audio decoding on the encoded signal to obtain a decoded signal, and performing audio playing on the decoded signal; the coding signal is obtained by audio coding the second audio signal by using the coding parameter determined by the noise estimation result after the second audio signal of the sounding party is obtained by the sounding party, the coding parameter can be dynamically adjusted based on the background environment noise of the listener, the setting flexibility of the coding parameter is improved, the change of the background environment noise can be adapted, and the balance of the playing tone quality and the listening environment effect is realized.

Description

一种音频处理方法及介质Audio processing method and medium

技术领域Technical Field

本申请涉及计算机技术领域,尤其涉及一种音频处理方法及介质。The present application relates to the field of computer technology, and in particular to an audio processing method and medium.

背景技术Background Art

在音频编码技术中,对音频信号进行音频编码所需的编码参数通常会影响音频音质质量,如不同的编码参数会产生不同的音质效果。目前,在实际业务(如通话业务)中,音频编码的整个过程中编码参数通常是在通话启动阶段进行设置后就不再变动,这个被选定的编码参数将被固化用于实际业务中。可以看出,这种编码参数的设置灵活性差,不能较好的适配于实时的通话环境。In audio coding technology, the coding parameters required for audio coding of audio signals usually affect the audio quality. For example, different coding parameters will produce different sound quality effects. At present, in actual services (such as call services), the coding parameters in the entire process of audio coding are usually set at the call startup stage and will not change. The selected coding parameters will be fixed and used in actual services. It can be seen that the setting flexibility of such coding parameters is poor and cannot be well adapted to the real-time call environment.

发明内容Summary of the invention

本申请实施例提供了一种音频处理方法及介质,可以基于收听方的背景环境噪声动态调整编码参数,提高编码参数设置灵活性,并可适应收听方的背景环境噪声的变化,以较好的适配于实时的通话(收听)环境,实现播放音质与收听环境效果之间的平衡。The embodiments of the present application provide an audio processing method and medium, which can dynamically adjust encoding parameters based on the background environmental noise of the listener, improve the flexibility of encoding parameter setting, and adapt to changes in the background environmental noise of the listener, so as to better adapt to the real-time call (listening) environment and achieve a balance between playback sound quality and listening environment effect.

第一方面,本申请实施例提供了一种音频处理方法,包括:In a first aspect, an embodiment of the present application provides an audio processing method, comprising:

获取收听方的第一音频信号;所述收听方与发声方已建立通信连接;Acquiring a first audio signal from a listening party; the listening party has established a communication connection with a sounding party;

对所述第一音频信号进行噪声估计,确定所述第一音频信号的噪声估计结果,并将所述第一音频信号的噪声估计结果发送给所述发声方;所述噪声估计结果用于指示所述第一音频信号中噪声的强度;Performing noise estimation on the first audio signal, determining a noise estimation result of the first audio signal, and sending the noise estimation result of the first audio signal to the speaker; the noise estimation result is used to indicate the intensity of noise in the first audio signal;

接收所述发声方发送的编码信号,对所述编码信号进行音频解码,得到解码信号,并对所述解码信号进行音频播放;所述编码信号为所述发声方获取到所述发声方的第二音频信号后,利用所述噪声估计结果所确定的编码参数对所述第二音频信号进行音频编码得到的。Receive the encoded signal sent by the speaker, perform audio decoding on the encoded signal to obtain a decoded signal, and perform audio playback on the decoded signal; the encoded signal is obtained by audio encoding the second audio signal after the speaker obtains the second audio signal of the speaker using the encoding parameters determined by the noise estimation result.

第二方面,本申请实施例提供了另一种音频处理方法,包括:In a second aspect, an embodiment of the present application provides another audio processing method, including:

接收收听方发送的噪声估计结果;所述噪声估计结果为所述收听方获取到所述收听方的第一音频信号后,对所述第一音频信号进行噪声估计得到的,所述收听方与发声方已建立通信连接;receiving a noise estimation result sent by a listener; the noise estimation result is obtained by estimating the noise of the first audio signal after the listener acquires the first audio signal of the listener, and the listener has established a communication connection with the sounding party;

基于所述噪声估计结果确定用于进行音频编码的编码参数;Determining encoding parameters for audio encoding based on the noise estimation result;

若获取到所述发声方的第二音频信号,则利用所述编码参数对所述第二音频信号进行音频编码,得到所述第二音频信号对应的编码信号,并将所述编码信号发送给所述收听方。If the second audio signal of the speaker is obtained, the second audio signal is audio-encoded using the encoding parameters to obtain an encoded signal corresponding to the second audio signal, and the encoded signal is sent to the listener.

第三方面,本申请实施例提供了一种音频处理装置,包括:In a third aspect, an embodiment of the present application provides an audio processing device, including:

获取单元,用于获取收听方的第一音频信号;所述收听方与发声方已建立通信连接;An acquisition unit, configured to acquire a first audio signal from a listener; the listener has established a communication connection with a speaker;

估计单元,用于对所述第一音频信号进行噪声估计,确定所述第一音频信号的噪声估计结果,并将所述第一音频信号的噪声估计结果发送给所述发声方;所述噪声估计结果用于指示所述第一音频信号中噪声的强度;an estimating unit, configured to perform noise estimation on the first audio signal, determine a noise estimation result of the first audio signal, and send the noise estimation result of the first audio signal to the sounding party; the noise estimation result is used to indicate the intensity of noise in the first audio signal;

解码单元,用于接收所述发声方发送的编码信号,对所述编码信号进行音频解码,得到解码信号,并对所述解码信号进行音频播放;所述编码信号为所述发声方获取到所述发声方的第二音频信号后,利用所述噪声估计结果所确定的编码参数对所述第二音频信号进行音频编码得到的。A decoding unit is used to receive the encoded signal sent by the speaker, perform audio decoding on the encoded signal to obtain a decoded signal, and perform audio playback on the decoded signal; the encoded signal is obtained by the speaker obtaining the second audio signal of the speaker and then performing audio encoding on the second audio signal using the encoding parameters determined by the noise estimation result.

第四方面,本申请实施例提供了另一种音频处理装置,包括:In a fourth aspect, an embodiment of the present application provides another audio processing device, including:

接收单元,用于接收收听方发送的噪声估计结果;所述噪声估计结果为所述收听方获取到所述收听方的第一音频信号后,对所述第一音频信号进行噪声估计得到的,所述收听方与发声方已建立通信连接;a receiving unit, configured to receive a noise estimation result sent by a listening party; the noise estimation result is obtained by estimating the noise of the first audio signal after the listening party acquires the first audio signal of the listening party, and the listening party has established a communication connection with the sounding party;

确定单元,用于基于所述噪声估计结果确定用于进行音频编码的编码参数;A determination unit, configured to determine encoding parameters for audio encoding based on the noise estimation result;

编码单元,用于若获取到所述发声方的第二音频信号,则利用所述编码参数对所述第二音频信号进行音频编码,得到所述第二音频信号对应的编码信号,并将所述编码信号发送给所述收听方。The encoding unit is used to, if a second audio signal of the speaker is acquired, perform audio encoding on the second audio signal using the encoding parameters to obtain an encoded signal corresponding to the second audio signal, and send the encoded signal to the listener.

第五方面,本申请实施例提供了一种计算机设备,所述计算机设备包括:处理器和存储器,所述处理器用于执行上述第一方面和/或第二方面所述的方法。In a fifth aspect, an embodiment of the present application provides a computer device, comprising: a processor and a memory, wherein the processor is used to execute the method described in the first aspect and/or the second aspect above.

第六方面,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有程序指令,该程序指令被执行时实现上述第一方面和/或第二方面所述的方法。In a sixth aspect, an embodiment of the present application further provides a computer-readable storage medium, in which program instructions are stored, and when the program instructions are executed, the method described in the first aspect and/or the second aspect above is implemented.

第七方面,本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序指令,该程序指令被处理器执行时实现上述第一方面和/或第二方面所述的方法。In the seventh aspect, the embodiments of the present application further provide a computer program product or a computer program, which includes program instructions, and when the program instructions are executed by a processor, the method described in the first aspect and/or the second aspect is implemented.

本申请实施例可以获取收听方的第一音频信号;且收听方与发声方已建立通信连接;然后,可以对第一音频信号进行噪声估计,以确定第一音频信号的噪声估计结果,并可以将第一音频信号的噪声估计结果发送给发声方;该噪声估计结果用于指示第一音频信号中噪声的强度。进一步的,收听方可以接收发声方发送的编码信号,对编码信号进行音频解码,得到解码信号,并对解码信号进行音频播放;该编码信号为发声方获取到发声方的第二音频信号后,利用噪声估计结果所确定的编码参数对第二音频信号进行音频编码得到的。基于上述方式,可以对收听方的背景环境噪声进行估计,并把噪声估计结果反馈给发声方,从而使得发声方可以根据反馈的噪声估计结果实现对编码参数的调节,以适应收听方的背景环境噪声的变化,达到播放音质与收听环境效果的最佳匹配。The embodiment of the present application can obtain the first audio signal of the listener; and the listener has established a communication connection with the speaker; then, the first audio signal can be noise estimated to determine the noise estimation result of the first audio signal, and the noise estimation result of the first audio signal can be sent to the speaker; the noise estimation result is used to indicate the intensity of the noise in the first audio signal. Further, the listener can receive the encoded signal sent by the speaker, perform audio decoding on the encoded signal, obtain the decoded signal, and perform audio playback on the decoded signal; the encoded signal is obtained by the speaker obtaining the second audio signal of the speaker and performing audio encoding on the second audio signal using the encoding parameters determined by the noise estimation result. Based on the above method, the background environmental noise of the listener can be estimated, and the noise estimation result can be fed back to the speaker, so that the speaker can adjust the encoding parameters according to the feedback noise estimation result to adapt to the changes in the background environmental noise of the listener, so as to achieve the best match between the playback sound quality and the listening environment effect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本申请实施例提供的一种音频处理系统的架构示意图;FIG1 is a schematic diagram of the architecture of an audio processing system provided in an embodiment of the present application;

图2是本申请实施例提供的一种音频处理方法的流程示意图;FIG2 is a flow chart of an audio processing method provided in an embodiment of the present application;

图3是本申请实施例提供的一种业务场景的界面示意图;FIG3 is a schematic diagram of an interface of a business scenario provided in an embodiment of the present application;

图4是本申请实施例提供的另一种音频处理方法的流程示意图;FIG4 is a flow chart of another audio processing method provided in an embodiment of the present application;

图5是本申请实施例提供的又一种音频处理方法的流程示意图;FIG5 is a flow chart of another audio processing method provided in an embodiment of the present application;

图6是本申请实施例提供的一种表征映射关系的曲线图;FIG6 is a curve diagram representing a mapping relationship provided in an embodiment of the present application;

图7是本申请实施例提供的一种音频处理装置的结构示意图;FIG7 is a schematic diagram of the structure of an audio processing device provided in an embodiment of the present application;

图8是本申请实施例提供的一种音频处理装置的结构示意图;FIG8 is a schematic diagram of the structure of an audio processing device provided in an embodiment of the present application;

图9是本申请实施例提供的一种计算机设备的结构示意图。FIG. 9 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application.

云技术(Cloud Technology)是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。Cloud Technology refers to a hosting technology that unifies hardware, software, network and other resources within a wide area network or local area network to achieve data computing, storage, processing and sharing.

云技术基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the cloud computing business model. It can form a resource pool, which can be used on demand and is flexible and convenient. Cloud computing technology will become an important support. The backend services of the technical network system require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites. With the rapid development and application of the Internet industry, in the future, each item may have its own identification mark, which needs to be transmitted to the backend system for logical processing. Data of different levels will be processed separately. All kinds of industry data need strong system backing support, which can only be achieved through cloud computing.

云计算(Cloud Computing)是一种计算模式,它将计算任务分布在大量计算机构成的资源池上,使各种应用系统能够根据需要获取计算力、存储空间和信息服务。提供资源的网络被称为“云”。“云”中的资源在使用者看来是可以无限扩展的,并且可以随时获取,按需使用,随时扩展,按使用付费。Cloud computing is a computing model that distributes computing tasks on a resource pool composed of a large number of computers, allowing various application systems to obtain computing power, storage space and information services as needed. The network that provides resources is called a "cloud". From the user's perspective, the resources in the "cloud" are infinitely scalable and can be obtained at any time, used on demand, expanded at any time, and paid for by use.

本申请可以将音频处理所需的数据存储到“云”中,根据需求对云中的数据随时获取,随时扩展,例如,可以将噪声估计结果与编码参数关联存储到“云”中,若后续在进行噪声估计得到噪声估计结果时,可以从“云”中获取噪声估计结果对应的编码参数,进而基于该编码参数进行音频编码。The present application can store the data required for audio processing in the "cloud", and the data in the cloud can be obtained and expanded at any time according to needs. For example, the noise estimation result can be associated with the encoding parameters and stored in the "cloud". If the noise estimation result is obtained later, the encoding parameters corresponding to the noise estimation result can be obtained from the "cloud", and then audio encoding can be performed based on the encoding parameters.

请参见图1,图1是本申请实施例提供的一种音频处理的系统架构图。本申请涉及至少一个发声方以及至少一个收听方。其中,任一发声方可以与任一收听方可以建立通信连接,以实现发声方与收听方之间的相互通信。其中,该通信连接可以是针对音视频的连接,即发声方与收听方之间可以进行音视频通话。Please refer to Figure 1, which is a system architecture diagram of an audio processing provided by an embodiment of the present application. The present application involves at least one speaker and at least one listener. Among them, any speaker can establish a communication connection with any listener to achieve mutual communication between the speaker and the listener. Among them, the communication connection can be a connection for audio and video, that is, the speaker and the listener can have an audio and video call.

其中,收听方和发声方可以是终端,该终端可以是智能手机、平板电脑、笔记本电脑、台式电脑等设备。例如,收听方和发声方可以利用智能手机进行电话通话。或者,收听方和发声方也可以是配置在终端上的客户端,如该客户端可以是具有音视频通话功能的客户端,如可以是游戏类客户端、即时通信类客户端等等。例如,收听方和发声方可以通过即时通信类客户端进行音频通话。The listening party and the speaking party may be terminals, and the terminals may be devices such as smart phones, tablet computers, laptop computers, and desktop computers. For example, the listening party and the speaking party may use smart phones to make a phone call. Alternatively, the listening party and the speaking party may also be clients configured on the terminal, such as clients with audio and video call functions, such as game clients, instant messaging clients, etc. For example, the listening party and the speaking party may make an audio call through an instant messaging client.

如图1所示,在发声方与收听方建立通信连接之后,收听方可以获取收听方所处的通信环境的背景环境音频,通常,该背景环境音频中可能存在背景噪声(环境噪声),如白噪声、babble噪声(多路重合噪声)、汽车鸣笛声、灰色噪声等等。则在收听方获取到该背景环境音频之后,可以基于该背景环境音频进行噪声水平的估计,并将估计得到的噪声估计结果(或者说噪声水平结果)反馈给发声方。As shown in FIG1 , after the speaker and the listener establish a communication connection, the listener can obtain the background environment audio of the communication environment in which the listener is located. Usually, the background environment audio may contain background noise (environmental noise), such as white noise, babble noise (multi-path overlap noise), car horn, gray noise, etc. After the listener obtains the background environment audio, it can estimate the noise level based on the background environment audio, and feed back the estimated noise estimation result (or noise level result) to the speaker.

发声方在获取到噪声估计结果之后,可以对音频编码的编码参数(如编码码率、采样率、通道数等中的一种或多种)进行调整(或称之为修改、调节、更新)。需要理解的是,在通信起始阶段可以先配置默认的进行音频编码所需的编码参数。在一个实施例中,发声方可以基于获取到噪声估计结果确定编码参数,以利用该编码参数对默认的编码参数进行调整,如可以将默认的编码参数调整为利用噪声估计得到的编码参数。即可以通过噪声估计结果动态调整发声方的编码参数,以适应收听方的背景噪声环境的变化,达到播放音质与收听环境效果的最佳匹配。After obtaining the noise estimation result, the speaker can adjust (or modify, adjust, update) the coding parameters of the audio coding (such as one or more of the coding bit rate, sampling rate, number of channels, etc.). It should be understood that the default coding parameters required for audio coding can be configured at the beginning of the communication. In one embodiment, the speaker can determine the coding parameters based on the obtained noise estimation result, and use the coding parameters to adjust the default coding parameters, such as adjusting the default coding parameters to the coding parameters obtained by noise estimation. That is, the coding parameters of the speaker can be dynamically adjusted according to the noise estimation result to adapt to the changes in the background noise environment of the listener, so as to achieve the best match between the playback sound quality and the listening environment effect.

当发声方获取到输入的音频信号时,可以基于该编码参数对音频信号进行音频编码,得到编码信号;然后,发声方可以将编码信号发送至收听方,以使收听方可以对该编码信号进行音频解码,得到解码信号,该解码信号也就是原始输入的音频信号。在一个实施例中,收听方还可以输出该音频信号,如可以通过收听方上的扬声器播放该音频信号,以使使用收听方的用户可以收听到该音频信号。When the speaker obtains the input audio signal, the audio signal can be encoded based on the encoding parameters to obtain an encoded signal; then, the speaker can send the encoded signal to the listener so that the listener can perform audio decoding on the encoded signal to obtain a decoded signal, which is the original input audio signal. In one embodiment, the listener can also output the audio signal, such as playing the audio signal through a speaker on the listener so that a user using the listener can hear the audio signal.

在一个实施例中,例如以基于IP的语音传输(Voice over Internet Protocol,VoIP)场景为例进行说明,在通话起始阶段可以先配置默认的进行音频编码所需的编码参数,然后,在后续的通话过程中,收听方可以对收听方背景环境噪声进行估计,并把噪声估计结果反馈给发声方,发声方可以根据反馈的噪声估计结果调整音频编码所需的编码参数,如可以将默认的编码参数调整为噪声估计结果对应的编码参数,以适应收听方的背景噪声环境的变化,达到播放音质与收听环境效果的最佳匹配。In one embodiment, for example, taking a Voice over Internet Protocol (VoIP) scenario based on IP as an example, at the beginning of a call, the default encoding parameters required for audio encoding can be configured first, and then, in the subsequent call process, the listener can estimate the background environment noise of the listener and feed back the noise estimation result to the speaker. The speaker can adjust the encoding parameters required for audio encoding according to the feedback noise estimation result, such as adjusting the default encoding parameters to the encoding parameters corresponding to the noise estimation result, so as to adapt to the changes in the background noise environment of the listener, so as to achieve the best match between the playback sound quality and the listening environment effect.

在具体应用场景中,本申请实施例提供的音频处理系统可以适用在需对音频信号进行音频编码的场景中,如可以应用在音视频通话、多人音视频、会议、k歌、直播、游戏等场景中。In specific application scenarios, the audio processing system provided in the embodiments of the present application can be used in scenarios where audio encoding of audio signals is required, such as audio and video calls, multi-person audio and video, conferences, karaoke, live broadcasts, games, and the like.

以下对本申请实施例的技术方案的实现细节进行详细阐述:The implementation details of the technical solution of the embodiment of the present application are described in detail below:

请参见图2,图2是本申请实施例提供的一种音频处理方法的流程示意图,音频处理方法涉及收听方和发声方,本实施例主要描述收听方和发声方之间的交互过程,音频处理方法包括以下步骤:Please refer to FIG. 2, which is a flow chart of an audio processing method provided in an embodiment of the present application. The audio processing method involves a listening party and a sounding party. This embodiment mainly describes the interaction process between the listening party and the sounding party. The audio processing method includes the following steps:

S201,收听方获取收听方的第一音频信号。S201: A listener obtains a first audio signal of the listener.

其中,收听方与发声方已建立通信连接。该第一音频信号可以包括:收听方所处的通信环境的背景环境音频,或者,还可以包括:当前时间用户通过收听方发出的语音音频。The listening party has established a communication connection with the speaking party. The first audio signal may include: background environment audio of the communication environment where the listening party is located, or may also include: voice audio emitted by the user through the listening party at the current time.

在一种实现方式中,收听方可以通过第一音频获取设备获取收听方对应的第一音频信号。例如,该第一音频获取设备可以是具有音频获取功能的设备,如该第一音频获取设备可以为收听方上的麦克风。In one implementation, the listener may obtain the first audio signal corresponding to the listener through a first audio acquisition device. For example, the first audio acquisition device may be a device with an audio acquisition function, such as a microphone on the listener.

可选的,音频获取设备的数量可以为一个或多个,如果音频获取设备的数量为多个,则该第一音频信号可以为多个音频获取设备中任一个或多个音频获取设备获取到的音频信号。在一个实施例中,在收听方上的音频获取设备(如麦克风)为多个的情况下,可以基于各个音频获取设备所适用通话场景来选择将音频获取设备获取到的音频信号作为第一音频信号,此处的一个音频获取设备所适用的通话场景可以理解为:在该通话场景下,该音频获取设备获取音频信号的效果最佳。例如,多个音频获取设备中存在适用于在免提以及视频通话场景下进行音频获取的音频获取设备,也存在适用于在非免提以及非视频通话场景下进行音频获取的音频获取设备。Optionally, the number of audio acquisition devices may be one or more. If the number of audio acquisition devices is multiple, the first audio signal may be an audio signal acquired by any one or more of the multiple audio acquisition devices. In one embodiment, when there are multiple audio acquisition devices (such as microphones) on the listener, the audio signal acquired by the audio acquisition device may be selected as the first audio signal based on the call scenario applicable to each audio acquisition device. The call scenario applicable to an audio acquisition device here may be understood as: in this call scenario, the audio acquisition device acquires the audio signal best. For example, among the multiple audio acquisition devices, there are audio acquisition devices suitable for audio acquisition in hands-free and video call scenarios, and there are also audio acquisition devices suitable for audio acquisition in non-hands-free and non-video call scenarios.

基于此可知,可以先获取当前时间的通话场景,并确定与该通话场景匹配的音频获取设备,进而将与通话场景匹配的音频获取设备所获取到的音频信号作为第一音频信号。其中,确定与该通话场景匹配的音频获取设备的具体实现方式可以是:先获取参考通话场景与参考音频获取设备之间的映射关系,该映射关系可以是预先设置的;然后,可以从该映射关系中确定与通话场景匹配的音频获取设备。Based on this, it can be known that the call scene at the current time can be first obtained, and the audio acquisition device matching the call scene can be determined, and then the audio signal acquired by the audio acquisition device matching the call scene can be used as the first audio signal. Among them, the specific implementation method of determining the audio acquisition device matching the call scene can be: first obtain the mapping relationship between the reference call scene and the reference audio acquisition device, and the mapping relationship can be pre-set; then, the audio acquisition device matching the call scene can be determined from the mapping relationship.

需要说明的是,后续在发声方获取发声方的第二音频信号的具体实现可以作同样理解。It should be noted that the specific implementation of subsequently obtaining the second audio signal of the speaker at the speaker can be understood in the same way.

在一种实现方式中,在收听方与发声方建立通信连接之后,收听方与发声方之间可以进行通信。例如,在音视频通话、多人音视频、会议、k歌、直播、游戏等场景中,收听方与发声方建立通信连接之后,收听方与发声方之间即可以进行通信。如参见图3所示,图中的界面31展示的是音视频通话的场景,界面32展示的是多人音视频的场景,界面33展示的是直播场景,界面34展示的是游戏场景。示例性的,在直播场景中,直播用户可以通过发声方进行直播,而其他观看用户可以通过收听方观看直播,此处的发声方以及收听方具体可以是具有直播功能的终端或客户端。而在观看直播的过程中,这些观看用户可通过收听方发起连麦请求,在发声发设备响应该连麦请求之后,即可以建立发声方与收听方之间的通信连接,进而实现直播用户与观看用户之间的语音通话。In one implementation, after the listener establishes a communication connection with the speaker, the listener and the speaker can communicate with each other. For example, in scenarios such as audio and video calls, multi-person audio and video, conferences, karaoke, live broadcasts, games, etc., after the listener establishes a communication connection with the speaker, the listener and the speaker can communicate with each other. As shown in FIG3 , the interface 31 in the figure shows the scene of audio and video calls, the interface 32 shows the scene of multi-person audio and video, the interface 33 shows the live broadcast scene, and the interface 34 shows the game scene. Exemplarily, in a live broadcast scene, the live broadcast user can broadcast live through the speaker, and other viewing users can watch the live broadcast through the listener. The speaker and the listener here can specifically be a terminal or client with a live broadcast function. In the process of watching the live broadcast, these viewing users can initiate a microphone connection request through the listener. After the sound-emitting device responds to the microphone connection request, a communication connection between the speaker and the listener can be established, thereby realizing a voice call between the live broadcast user and the viewing user.

在一种实现方式中,收听方可以实时检测当前时间是否为预设噪声估计时间,若检测到当前时间为预设噪声估计时间,则可以获取收听方的第一音频信号,以利用获取到的第一音频信号进行噪声估计。In one implementation, the listener may detect in real time whether the current time is the preset noise estimation time. If it is detected that the current time is the preset noise estimation time, the listener's first audio signal may be acquired to perform noise estimation using the acquired first audio signal.

其中,该预设噪声估计时间可以是预设设置的,该预设噪声估计时间的数量可以是一个或多个。例如,在预设噪声估计时间的数量为一个的情况下,该预设噪声估计时间可以设置在建立通信连接之后的任意时间点,通常可以尽可能设置与基准时间相近的时间点,该基准时间是指建立通信连接的时间,以便于可以及时的根据收听方所处的环境进行编码参数的调整,提高收听体验。The preset noise estimation time may be a preset setting, and the number of the preset noise estimation time may be one or more. For example, when the number of the preset noise estimation time is one, the preset noise estimation time may be set at any time point after the communication connection is established, and may usually be set at a time point as close to a reference time as possible, the reference time referring to the time when the communication connection is established, so that the encoding parameters can be adjusted in a timely manner according to the environment in which the listener is located, thereby improving the listening experience.

在预设噪声估计时间的数量为多个的情况下,这多个预设噪声估计时间可以是周期性的时间点,即收听方可以执行周期性的噪声估计,以便于可以根据周期性实时噪声估计结果来动态调整发声方进行音频编码所需的编码参数,以适应收听方的背景噪声环境的变化,以达到播放音质与收听环境效果的最佳匹配。基于此可知,可以以参考周期来来获取收听方的音频信号,其中,参考周期可以用于指示获取音频信号的时间间隔,示例性的,参考周期可以是20ms,即可以每隔20ms获取收听方的音频信号。In the case where there are multiple preset noise estimation times, these multiple preset noise estimation times can be periodic time points, that is, the listener can perform periodic noise estimation, so that the encoding parameters required for the sound emitter to perform audio encoding can be dynamically adjusted according to the periodic real-time noise estimation results to adapt to the changes in the background noise environment of the listener, so as to achieve the best match between the playback sound quality and the listening environment effect. Based on this, it can be known that the audio signal of the listener can be obtained with a reference period, wherein the reference period can be used to indicate the time interval for obtaining the audio signal. For example, the reference period can be 20ms, that is, the audio signal of the listener can be obtained every 20ms.

在一种实现方式中,当检测到存在通信场景切换时,可以获取收听方的第一音频信号,以减少计算开销。例如,该通信场景切换可以是指语音通话与视频通话之间的切换;示例性的,当当前时间从语音通话切换到视频通话,或从视频通话切换到语音通话,则可以触发获取收听方的第一音频信号的操作。又如,该通信场景切换可以是指两人通话与多人(两人以上)通话之间的切换;示例性的,当当前时间从两人语音通话(如图3中的界面31展示的是两个音视频通话场景)切换到四人语音通话(如图3中的界面32展示的是四个音视频通话场景),或从四人语音通话切换到两人语音通话,则可以触发获取收听方的第一音频信号的操作。又如,该通信场景切换可以是指音频播放类型的切换;示例性的,当当前时间从语音音频类型切换到音乐音频类型,或从音乐音频类型切换到语音音频类型,则可以触发获取收听方的第一音频信号的操作。In one implementation, when a communication scene switch is detected, the first audio signal of the listener can be obtained to reduce the computational overhead. For example, the communication scene switch may refer to the switch between a voice call and a video call; exemplarily, when the current time switches from a voice call to a video call, or from a video call to a voice call, the operation of obtaining the first audio signal of the listener can be triggered. For another example, the communication scene switch may refer to the switch between a two-person call and a multi-person (more than two people) call; exemplarily, when the current time switches from a two-person voice call (such as the interface 31 in FIG. 3 showing two audio and video call scenes) to a four-person voice call (the interface 32 in FIG. 3 showing four audio and video call scenes), or from a four-person voice call to a two-person voice call, the operation of obtaining the first audio signal of the listener can be triggered. For another example, the communication scene switch may refer to the switch of the audio playback type; exemplarily, when the current time switches from the voice audio type to the music audio type, or from the music audio type to the voice audio type, the operation of obtaining the first audio signal of the listener can be triggered.

S202,收听方对第一音频信号进行噪声估计,确定第一音频信号的噪声估计结果。S202: The listener performs noise estimation on the first audio signal to determine a noise estimation result of the first audio signal.

其中,噪声估计结果可以用于指示第一音频信号中噪声的强度。The noise estimation result may be used to indicate the intensity of noise in the first audio signal.

考虑到音频通话本身最终受益的是收听方,而收听方的音频体验除了播放信号本身的音质效果外,还跟收听方的声学环境强相关。例如,如果收听方身处比较安静环境(如会议室)下,则在播放信号存在一点音质问题时,收听方都有可能很轻易分辨出来,从而影响收听效果;相反,如果收听方身处相对嘈杂的环境(如商场、酒吧)下,此时播放的音频信号与环境噪声在频域谱上重叠的部分,由于声学掩蔽效应,收听方所处的环境中的环境噪声会掩蔽了播放的音频信号的瑕疵,则播放的音频信号的音质的细微差别很难被收听方分辨出,在这种情况下,音频编码的编码参数(如编码码率)的差异对收听音质的差异影响极小。通过上述可知,可以考虑对收听方的环境噪声进行分析,以利用收听方的环境噪声的噪声水平对音频编码所需的编码参数进行调整,从而使得编码参数适应收听方的环境噪声的变化,达到播放音质与收听环境效果的最佳匹配。Considering that the ultimate beneficiary of the audio call itself is the listener, and the listener's audio experience is strongly related to the acoustic environment of the listener in addition to the sound quality of the playback signal itself. For example, if the listener is in a relatively quiet environment (such as a conference room), when there is a slight sound quality problem with the playback signal, the listener may easily distinguish it, thus affecting the listening effect; on the contrary, if the listener is in a relatively noisy environment (such as a shopping mall or a bar), the overlapped part of the audio signal played and the ambient noise in the frequency domain spectrum at this time, due to the acoustic masking effect, the ambient noise in the listener's environment will mask the flaws of the played audio signal, and the subtle differences in the sound quality of the played audio signal will be difficult for the listener to distinguish. In this case, the difference in the encoding parameters of the audio encoding (such as the encoding bit rate) has little effect on the difference in listening sound quality. From the above, it can be seen that we can consider analyzing the ambient noise of the listener, so as to use the noise level of the ambient noise of the listener to adjust the encoding parameters required for audio encoding, so that the encoding parameters can adapt to the changes in the ambient noise of the listener and achieve the best match between the playback sound quality and the listening environment effect.

在本申请实施例中,利用收听方的环境噪声的噪声水平来调整的编码参数可以是编码码率、采样率以及通道数等中的一种或多种。In the embodiment of the present application, the encoding parameter adjusted by the noise level of the ambient noise of the listener may be one or more of the encoding bit rate, sampling rate, number of channels, and the like.

例如,以编码码率来说,需理解的是,在音频处理中的编码码率是音频编码器的重要编码参数,编码码率与音频输入信号的压缩比强相关。如果输入的音频信号配置较低的编码码率,可以带来更大的数据量压缩,但播放音质(声音音质)通常会下降;而编码码率越高,音频信号的编码质量越佳,则解码输出的音频信号的质量较高,即播放音质较好,但音频信号的压缩比不高,编码文件占用的存储越大,并且编码码率越高,对应编码码流占用带宽也越多。因此,为了提高通信质量,通常可以配置合适的编码码率来实现对音频信号的音频编码。基于此可知,可以根据收听方的噪声水平的反馈实时调整音频编码所需的编码码率,以达到有效利用编码能力和传输带宽的目的,实现音频信号的播放音质体验效果和运营成本(如音频传输带宽和存储的成本)的最佳平衡。For example, in terms of encoding bit rate, it should be understood that the encoding bit rate in audio processing is an important encoding parameter of the audio encoder, and the encoding bit rate is strongly related to the compression ratio of the audio input signal. If the input audio signal is configured with a lower encoding bit rate, it can bring greater data compression, but the playback sound quality (sound quality) will usually decrease; and the higher the encoding bit rate, the better the encoding quality of the audio signal, then the quality of the decoded output audio signal is higher, that is, the playback sound quality is better, but the compression ratio of the audio signal is not high, the larger the storage occupied by the encoding file, and the higher the encoding bit rate, the more bandwidth the corresponding encoding stream occupies. Therefore, in order to improve the quality of communication, a suitable encoding bit rate can usually be configured to achieve audio encoding of the audio signal. Based on this, it can be seen that the encoding bit rate required for audio encoding can be adjusted in real time according to the feedback of the noise level of the listener, so as to achieve the purpose of effectively utilizing the encoding capacity and transmission bandwidth, and achieve the best balance between the playback sound quality experience effect of the audio signal and the operating cost (such as the cost of audio transmission bandwidth and storage).

又如,以采样率来说,采样率与编码质量可以呈正相关,即采样率越高,则音频信号的编码质量越好,即音频信号的播放音质也就越好;对应的,采样率越低,则音频信号的编码质量越差,即音频信号的播放音质也就越差。需理解的是,较低的采样率意味着每秒钟的采样数较少,也就意味着获取的音频数据较少;同样可理解的是,采样率越高,则需要更多的存储空间以及更强的处理能力。基于此可知,可以根据收听方的噪声水平的反馈实时调整音频编码所需的采样率,以实现音频信号的播放音质体验效果与信号采样数据量的最佳平衡,而采样数据量可以直接影响数据的存储空间以及处理能力,则同样可以实现音频信号的播放音质体验效果与存储空间以及处理能力的最佳平衡。For another example, in terms of sampling rate, the sampling rate and encoding quality can be positively correlated, that is, the higher the sampling rate, the better the encoding quality of the audio signal, that is, the better the playback quality of the audio signal; correspondingly, the lower the sampling rate, the worse the encoding quality of the audio signal, that is, the worse the playback quality of the audio signal. It should be understood that a lower sampling rate means fewer samples per second, which means less audio data is obtained; it is also understandable that the higher the sampling rate, the more storage space and stronger processing power are required. Based on this, it can be seen that the sampling rate required for audio encoding can be adjusted in real time according to the feedback of the noise level of the listener to achieve the best balance between the playback sound quality experience effect of the audio signal and the amount of signal sampling data. The amount of sampling data can directly affect the storage space and processing power of the data, so the best balance between the playback sound quality experience effect of the audio signal and the storage space and processing power can also be achieved.

又如,以通道数(声道数)来说,通道数与音频信号的播放音质(或者说播放音质体验效果)可以呈正相关,通常,通道数越多,音频信号的立体感以及现场感越强。即通道数越大,音频信号的播放音质也就越好;对应的,通道数越小,则音频信号的播放音质也就越差。此处的通道数可以包括第一通道数以及第二通道数,第一通道数可以理解为单通道(单声道),第二通道数可以理解为双通道(双声道)。需理解的是,声道通常是指声音在录制时在不同空间位置采集的相互独立的音频信号,所以声道数也可以理解为声音录制时的音源数量;则声道数越大,获取音频信号的复杂度越高。基于此可知,可以根据收听方的噪声水平的反馈实时调整音频编码所需的通道数,以实现音频信号的播放音质体验效果与信号获取复杂度的最佳平衡。For another example, in terms of the number of channels (number of channels), the number of channels and the playback sound quality of the audio signal (or the playback sound quality experience effect) can be positively correlated. Generally, the more channels there are, the stronger the stereoscopic sense and presence of the audio signal. That is, the larger the number of channels, the better the playback sound quality of the audio signal; correspondingly, the smaller the number of channels, the worse the playback sound quality of the audio signal. The number of channels here can include the first number of channels and the second number of channels. The first number of channels can be understood as a single channel (mono), and the second number of channels can be understood as a dual channel (dual channel). It should be understood that channels usually refer to independent audio signals collected at different spatial positions when the sound is recorded, so the number of channels can also be understood as the number of sound sources when the sound is recorded; the larger the number of channels, the higher the complexity of obtaining the audio signal. Based on this, it can be seen that the number of channels required for audio encoding can be adjusted in real time according to the feedback of the noise level of the listener to achieve the best balance between the playback sound quality experience effect of the audio signal and the complexity of signal acquisition.

S203,收听方将第一音频信号的噪声估计结果发送给发声方。S203: The listener sends the noise estimation result of the first audio signal to the speaker.

在一种实现方式中,收听方可以将第一音频信号的噪声估计结果发送给发声方,以使得发声方可以基于该噪声估计结果实时调整音频编码所需的编码参数。In one implementation, the listener may send the noise estimation result of the first audio signal to the speaker, so that the speaker may adjust the encoding parameters required for audio encoding in real time based on the noise estimation result.

S204,发声方接收收听方发送的噪声估计结果。S204: The speaker receives the noise estimation result sent by the listener.

如前所述,该噪声估计结果为收听方获取到收听方的第一音频信号后,对第一音频信号进行噪声估计得到的。As mentioned above, the noise estimation result is obtained by performing noise estimation on the first audio signal after the listener obtains the first audio signal of the listener.

S205,发声方基于噪声估计结果确定用于进行音频编码的编码参数。S205: The speaker determines encoding parameters for audio encoding based on the noise estimation result.

其中,音频编码是将原始采集到的原始无损的音频信号进行时域和频域的冗余分析和压缩,压缩处理是为了保存音频中主要或者全部的信息。通过压缩处理,可以降低音频的数据量,从而降低音频传输带宽和存储空间,并以此来降低音频传输和存储的成本;同时,也可以保持较好的音频质量。其中,音频编码所涉及的到编码参数可以包括:采样率、通道数、编码码率等。本申请实施例考虑利用收听方的噪声水平来调整音频编码所需的编码参数,以适配合适的编码参数来实现对音频信号的音频编码,提高收听方收听效果。Among them, audio coding is to perform redundant analysis and compression of the original lossless audio signal collected in the time domain and frequency domain, and the compression process is to preserve the main or all information in the audio. Through compression processing, the amount of audio data can be reduced, thereby reducing the audio transmission bandwidth and storage space, and thereby reducing the cost of audio transmission and storage; at the same time, good audio quality can also be maintained. Among them, the coding parameters involved in audio coding may include: sampling rate, number of channels, coding bit rate, etc. The embodiment of the present application considers using the noise level of the listener to adjust the coding parameters required for audio coding, so as to adapt the appropriate coding parameters to achieve audio coding of the audio signal and improve the listening effect of the listener.

在一种实现方式中,可以预先设置参考噪声估计结果与参考编码参数之间的映射关系,该映射关系可以是通过大量的测试所得到的噪声估计结果与编码参数之间的较佳匹配。则发声方在接收到收听方发送的噪声估计结果之后,可以获取该映射关系,以利用该映射关系确定噪声估计结果对应的编码参数。例如,可以从该映射关系中查找与该噪声估计结果一致的参考噪声估计结果,并将查找到的参考噪声估计结果对应的参考编码参数确定为编码参数。In one implementation, a mapping relationship between a reference noise estimation result and a reference coding parameter may be preset, and the mapping relationship may be a better match between the noise estimation result and the coding parameter obtained through a large number of tests. After receiving the noise estimation result sent by the listener, the speaker may obtain the mapping relationship to determine the coding parameter corresponding to the noise estimation result using the mapping relationship. For example, a reference noise estimation result consistent with the noise estimation result may be searched from the mapping relationship, and the reference coding parameter corresponding to the found reference noise estimation result may be determined as the coding parameter.

需理解的是,在通信起始阶段通常会先配置默认的编码参数,在本申请实施例中,在发声方不存在对默认的编码参数进行调整的情况下,可以使用默认的编码参数进行音频编码。在一个实施例中,在发声方获取到编码参数之后,可以直接将默认的编码参数替换为该编码参数。在另一个实施例中,也可以将默认的编码参数以及该编码参数均保留,并为该编码参数设置对应的估计时间,该估计时间可以是指获取到该编码参数的时间,则在后续音频编码中,可以基于编码参数的估计时间来确定当前最新的编码参数,以便于可以利用最新的编码参数进行音频编码。It should be understood that the default encoding parameters are usually configured at the beginning of the communication. In the embodiment of the present application, if the speaker does not adjust the default encoding parameters, the default encoding parameters can be used for audio encoding. In one embodiment, after the speaker obtains the encoding parameters, the default encoding parameters can be directly replaced with the encoding parameters. In another embodiment, the default encoding parameters and the encoding parameters can also be retained, and a corresponding estimated time can be set for the encoding parameters. The estimated time can refer to the time when the encoding parameters are obtained. In the subsequent audio encoding, the current latest encoding parameters can be determined based on the estimated time of the encoding parameters, so that the latest encoding parameters can be used for audio encoding.

S206,若发声方获取到发声方的第二音频信号,则利用编码参数对第二音频信号进行音频编码,得到第二音频信号对应的编码信号。S206: If the speaker obtains the second audio signal of the speaker, audio encoding is performed on the second audio signal using the encoding parameter to obtain an encoded signal corresponding to the second audio signal.

可以理解的是,使用发声方的用户可以通过发声方发起语音,以实现与使用收听方的用户之间的相互通信。如果发声方获取到发声方对应的音频信号,则可以利用编码参数进行音频编码。为方便描述,可以将此处的音频信号称之为第二音频信号,该第二音频信号也就是用户通过发声方发出的语音。在一种实现方式中,发声方可以通过第二音频获取设备获取发声方的第二音频信号。例如,该第二音频获取设备可以是具有音频获取功能的设备,如该第二音频获取设备可以为发声方上的麦克风。It is understandable that a user using the speaking party can initiate voice through the speaking party to achieve mutual communication with a user using the listening party. If the speaking party obtains an audio signal corresponding to the speaking party, the audio encoding can be performed using the encoding parameters. For the convenience of description, the audio signal here can be referred to as a second audio signal, and the second audio signal is the voice emitted by the user through the speaking party. In one implementation, the speaking party can obtain the second audio signal of the speaking party through a second audio acquisition device. For example, the second audio acquisition device can be a device with an audio acquisition function, such as the second audio acquisition device can be a microphone on the speaking party.

可以理解的是,该编码参数是基于收听方所处的通信环境所得到的,相比于在整个通信连接中利用一个默认(固定)的编码参数对所有的音频信号进行音频编码,本申请实施例可以基于收听方的噪声水平对默认的编码参数进行调整,以利用适配于收听方噪声水平的编码参数实现对音频信号的音频编码。It can be understood that the encoding parameter is obtained based on the communication environment in which the listener is located. Compared with using a default (fixed) encoding parameter to perform audio encoding on all audio signals in the entire communication connection, the embodiment of the present application can adjust the default encoding parameter based on the noise level of the listener to achieve audio encoding of the audio signal using the encoding parameter adapted to the noise level of the listener.

在一种实现方式中,在获取到第二音频信号之后,可以利用编码参数对第二音频信号进行音频编码,以对该第二音频信号进行压缩,并得到压缩后的第二音频信号,如可以将压缩后的第二音频信号称之为编码信号。In one implementation, after acquiring the second audio signal, the second audio signal may be audio-encoded using the encoding parameters to compress the second audio signal and obtain a compressed second audio signal. For example, the compressed second audio signal may be referred to as an encoded signal.

S207,发声方将编码信号发送给收听方。S207, the speaker sends the encoded signal to the listener.

S208,收听方接收发声方发送的编码信号,对编码信号进行音频解码,得到解码信号,并对解码信号进行音频播放。S208, the listening party receives the coded signal sent by the speaking party, performs audio decoding on the coded signal to obtain a decoded signal, and performs audio playback on the decoded signal.

其中,音频解码是对压缩后的音频信号(即编码信号)进行解码,从而恢复原始的音频信号,该原始的音频信号可以称之为解码信号。如前所述,该编码信号为发声方获取到发声方的第二音频信号后,利用噪声估计结果所确定的编码参数对第二音频信号进行音频编码得到的。Among them, audio decoding is to decode the compressed audio signal (i.e., the encoded signal) to restore the original audio signal, and the original audio signal can be called a decoded signal. As mentioned above, the encoded signal is obtained by the speaker obtaining the second audio signal of the speaker and then performing audio encoding on the second audio signal using the encoding parameters determined by the noise estimation result.

在一种实现方式中,在收听方解码得到解码信号之后,即可以对该解码信号进行音频播放,以使用户可以收听到发声方所发出的语音。例如,可以通过收听方上具有播放功能的设备对该解码信号进行播放,示例性的,如果收听方为手机,则具有播放功能的设备可以是手机上的扬声器。In one implementation, after the listener decodes and obtains the decoded signal, the decoded signal can be played as an audio so that the user can hear the voice of the speaker. For example, the decoded signal can be played by a device with a play function on the listener. For example, if the listener is a mobile phone, the device with a play function can be a speaker on the mobile phone.

在一种实现方式中,在收听方为多个的情况下,各个收听方可以对各自的背景环境噪声进行估计,并把各自的噪声估计结果反馈给发声方,而发声方可以基于各自的噪声估计结果来对默认的编码参数进行调整,从而得到适用于不同收听方的编码参数,使得编解码输出的音频信号可以适配于不同的收听方,进而可以提高多人语音体验。In one implementation, when there are multiple listeners, each listener can estimate their own background environmental noise and feed back their own noise estimation results to the speaker, and the speaker can adjust the default encoding parameters based on their own noise estimation results to obtain encoding parameters suitable for different listeners, so that the audio signal output by the codec can be adapted to different listeners, thereby improving the multi-person voice experience.

在本申请实施例中,收听方可以对收听方的背景环境噪声进行估计,并把噪声估计结果反馈给发声方,从而使得发声方可以根据反馈的噪声估计结果实现对编码参数的调节,以适应收听方的背景噪声环境的变化,达到播放音质与收听环境效果的最佳匹配。In an embodiment of the present application, the listener can estimate the background environmental noise of the listener and feed back the noise estimation result to the speaker, so that the speaker can adjust the encoding parameters according to the feedback noise estimation result to adapt to the changes in the background noise environment of the listener, so as to achieve the best match between the playback sound quality and the listening environment effect.

请参见图4,图4是本申请实施例提供的另一种音频处理方法的流程示意图,本实施例主要描述收听方侧对第一音频信号进行噪声估计的具体实现过程,本实施例中所描述的音频处理方法,包括以下步骤:Please refer to FIG. 4, which is a flowchart of another audio processing method provided in an embodiment of the present application. This embodiment mainly describes a specific implementation process of noise estimation on the first audio signal by the listener side. The audio processing method described in this embodiment includes the following steps:

S401,获取收听方的第一音频信号。S401: Acquire a first audio signal of a listener.

其中,该步骤的具体实施方式可以参考上述步骤S201中的描述,此处不再赘述。The specific implementation of this step can refer to the description in the above step S201, which will not be repeated here.

S402,对第一音频信号进行噪声估计,确定第一音频信号的噪声估计结果,并将第一音频信号的噪声估计结果发送给发声方。S402: Perform noise estimation on the first audio signal, determine a noise estimation result of the first audio signal, and send the noise estimation result of the first audio signal to a speaker.

其中,该噪声估计结果可以用于指示第一音频信号中噪声的强度。The noise estimation result may be used to indicate the intensity of noise in the first audio signal.

在一种实现方式中,收听方可以对收听方的背景环境噪声进行估计,并把噪声估计结果反馈给发声方,以使发声方可以根据反馈的噪声估计结果调节进行音频编码所需的编码参数。In one implementation, the listener may estimate the background environmental noise of the listener and feed back the noise estimation result to the speaker, so that the speaker may adjust the encoding parameters required for audio encoding according to the fed-back noise estimation result.

在一种实现方式中,可以先对第一音频信号进行分帧处理,得到M个音频帧;然后,可以对每个音频帧进行噪声估计,得到针对每个音频帧的噪声估计结果。需要说明的是,在每得到一个音频帧的噪声估计结果之后,即可以将该噪声估计结果反馈给发声方,以使发声方可以及时的进行编码参数的调整。基于此可知,可以遍历M个音频帧中的各个音频帧,若遍历到M个音频帧中的第m(m∈[1,M])个音频帧,则可以对第m个音频帧进行噪声估计,以得到该第m个音频帧的噪声估计结果,并可以将该第m个音频帧的噪声估计结果作为第一音频信号的噪声估计结果,以将该第一音频信号的噪声估计结果反馈给发声方,即可以理解为第一音频信号的噪声估计结果也是基于所得到的音频帧的噪声估计结果实时更新的。In one implementation, the first audio signal may be firstly processed by frame division to obtain M audio frames; then, noise estimation may be performed on each audio frame to obtain a noise estimation result for each audio frame. It should be noted that after obtaining the noise estimation result of each audio frame, the noise estimation result may be fed back to the speaker so that the speaker can adjust the encoding parameters in a timely manner. Based on this, it can be known that each audio frame in the M audio frames can be traversed. If the mth (m∈[1,M]) audio frame in the M audio frames is traversed, the mth audio frame may be noise estimated to obtain the noise estimation result of the mth audio frame, and the noise estimation result of the mth audio frame may be used as the noise estimation result of the first audio signal, so as to feed back the noise estimation result of the first audio signal to the speaker, that is, it can be understood that the noise estimation result of the first audio signal is also updated in real time based on the noise estimation result of the obtained audio frame.

其中,M的具体数值不作限定,如可以是20;一个音频帧的帧长不作限定,如帧长可以是15ms。在进行分帧时,两个相邻音频帧之间可以存在时间重叠,如帧间可以存在30%的重叠;也可以不存在时间重叠,对此不作限定。The specific value of M is not limited, such as 20; the frame length of an audio frame is not limited, such as 15ms. When dividing the frames, there may be a time overlap between two adjacent audio frames, such as a 30% overlap between frames; or there may be no time overlap, which is not limited.

在一种实现方式中,需理解的是,在噪声估计中,是利用信号的功率谱进行相关计算得到对应的噪声估计结果,则可以对第m个音频帧进行时域到频域的转化。基于此可知,上述提及的对第m个音频帧进行噪声估计,得到该第m个音频帧的噪声估计结果的具体实现可以是:先对第m个音频帧进行傅里叶变换处理,得到第m个音频帧的功率谱,简单来说,就是将采集到的第一音频信号经过傅里叶变换,得到频域上各频点的复数值。然后,可以基于第m个音频帧的功率谱对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果。In one implementation, it should be understood that in noise estimation, the power spectrum of the signal is used to perform relevant calculations to obtain the corresponding noise estimation results, and the mth audio frame can be converted from the time domain to the frequency domain. Based on this, it can be seen that the above-mentioned noise estimation of the mth audio frame to obtain the noise estimation result of the mth audio frame can be specifically implemented by first performing Fourier transform processing on the mth audio frame to obtain the power spectrum of the mth audio frame. In simple terms, the collected first audio signal is Fourier transformed to obtain the complex values of each frequency point in the frequency domain. Then, the noise estimation of the mth audio frame can be performed based on the power spectrum of the mth audio frame to determine the noise estimation result of the mth audio frame.

在一种实现方式中,在对第m个音频帧进行噪声估计时,还可以对第m个音频帧对应的功率谱进行子带划分,以基于对各个子带的噪声评估,确定该第m个音频帧的噪声估计结果。基于此可知,基于第m个音频帧的功率谱对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果的具体实现可为如下步骤S01-S03:In one implementation, when performing noise estimation on the mth audio frame, the power spectrum corresponding to the mth audio frame may be divided into sub-bands, so as to determine the noise estimation result of the mth audio frame based on the noise evaluation of each sub-band. Based on this, it can be known that performing noise estimation on the mth audio frame based on the power spectrum of the mth audio frame and determining the noise estimation result of the mth audio frame may be specifically implemented as the following steps S01-S03:

S01,将第m个音频帧的功率谱划分为K个子带,并确定K个子带中各个子带的子带功率谱。S01 , dividing the power spectrum of the m th audio frame into K sub-bands, and determining a sub-band power spectrum of each sub-band in the K sub-bands.

其中,K的数值不作具体限定,例如可以是10、15等等。The value of K is not specifically limited, for example, it can be 10, 15, etc.

在一种实现方式中,在进行子带划分时,可以是等带宽子带划分,也可以是非等带宽子带划分。对于等带宽子带划分,各个子带的带宽是相同的。对于非等带宽子带划分,各个子带的带宽可以是不相同的;在这种情况下进行子带划分时,可以令各子带的宽度随K的增加而增加,即低频段的子带的带宽窄,高频段的子带的带宽宽。In one implementation, when performing sub-band division, it can be equal bandwidth sub-band division or unequal bandwidth sub-band division. For equal bandwidth sub-band division, the bandwidth of each sub-band is the same. For unequal bandwidth sub-band division, the bandwidth of each sub-band may be different; in this case, when performing sub-band division, the width of each sub-band may increase with the increase of K, that is, the bandwidth of the sub-band in the low frequency band is narrow, and the bandwidth of the sub-band in the high frequency band is wide.

在一种实现方式中,考虑到确定K个子带中每个子带的子带功率谱的方式是相同的,则下述以第k个子带为例对子带功率谱的确定进行阐述:针对K个子带中的第k个子带,首先,可以获取第k个子带中频点的数量;例如,可以获取第k个子带的起始频点索引值以及结束频点索引值,以基于起始频点索引值以及结束频点索引值确定第k个子带中频点的数量。然后,可以基于第k个子带中频点的数量以及第k个子带中各个频点对应的频域值,确定第k个子带的子带功率谱。在此实施方式下,计算机设备可以通过如下公式(1)计算得到第m个音频帧的第k个子带的子带功率谱:In one implementation, considering that the method for determining the subband power spectrum of each subband in the K subbands is the same, the determination of the subband power spectrum is described below using the kth subband as an example: for the kth subband in the K subbands, first, the number of frequency points in the kth subband can be obtained; for example, the starting frequency point index value and the ending frequency point index value of the kth subband can be obtained to determine the number of frequency points in the kth subband based on the starting frequency point index value and the ending frequency point index value. Then, the subband power spectrum of the kth subband can be determined based on the number of frequency points in the kth subband and the frequency domain values corresponding to each frequency point in the kth subband. Under this implementation, the computer device can calculate the subband power spectrum of the kth subband of the mth audio frame by the following formula (1):

其中,m对应音频帧的序号,k对应子带的序号,z对应子带中的频点索引值。freq1(k)表示第k个子带的起始频点索引值,freq2(k)表示第k个子带的结束频点索引值;freq2(k)-freq1(k)+1表示第k个子带中频点的数量。X(m,z)为傅里叶变换后的第i个音频帧中第z个频点的频域值(频域复数值)。Where m corresponds to the serial number of the audio frame, k corresponds to the serial number of the subband, and z corresponds to the frequency index value in the subband. freq 1 (k) represents the starting frequency index value of the kth subband, freq 2 (k) represents the ending frequency index value of the kth subband; freq 2 (k)-freq 1 (k)+1 represents the number of frequency points in the kth subband. X(m,z) is the frequency domain value (complex value in the frequency domain) of the zth frequency point in the i-th audio frame after Fourier transform.

S02,基于各个子带的子带功率谱对相应子带进行噪声估计,确定相应子带的噪声估计结果。S02: performing noise estimation on the corresponding subband based on the subband power spectrum of each subband, and determining a noise estimation result of the corresponding subband.

其中,考虑到对第m个音频帧的每个子带的噪声估计流程是相同的,则可以以K个子带中的第k个子带为例,对子带的噪声估计流程进行相关说明,其具体实现可以参加如下步骤S11-S14,步骤S11-S14的具体实现也就是子带噪声功率谱估计的具体实现,通过这种具体实现,可以得到较为准确的噪声功率谱估计值。Among them, considering that the noise estimation process for each subband of the mth audio frame is the same, the kth subband among the K subbands can be taken as an example to explain the noise estimation process of the subband. Its specific implementation can refer to the following steps S11-S14. The specific implementation of steps S11-S14 is the specific implementation of subband noise power spectrum estimation. Through this specific implementation, a more accurate noise power spectrum estimation value can be obtained.

S11,针对K个子带中的第k个子带,对第k个子带的子带功率谱进行时频域平滑处理,得到第k个子带的平滑子带功率谱。S11 , for a k-th subband among the K subbands, performing time-frequency domain smoothing processing on a subband power spectrum of the k-th subband to obtain a smoothed subband power spectrum of the k-th subband.

在一种实现方式中,可以先对第k个子带的子带功率谱进行相邻子带的频域平滑处理,再对频域平滑处理结果进行历史音频帧的时域平滑处理,以得到对应的时域平滑处理结果,并将时域平滑处理结果作为第k个子带的平滑子带功率谱。具体实现可参见下述描述。In one implementation, the subband power spectrum of the kth subband may be first subjected to frequency domain smoothing of adjacent subbands, and then the frequency domain smoothing result may be subjected to time domain smoothing of historical audio frames to obtain a corresponding time domain smoothing result, and the time domain smoothing result may be used as the smoothed subband power spectrum of the kth subband. For specific implementation, see the following description.

在一种实现方式中,频域平滑处理的过程可以是:In one implementation, the frequency domain smoothing process may be:

首先,可以获取与第k个子带相邻的r个相邻子带中各个相邻子带分别对应的子带功率谱。First, a subband power spectrum corresponding to each of the r adjacent subbands adjacent to the kth subband may be obtained.

其中,相邻可以是指前向相邻和后向相邻中的一种或多种,即r个相邻子带可以是与第k个子带前向相邻的子带,如相邻子带可以是第k-1个子带;或者,r个相邻子带可以是与第k个子带后向相邻的子带,如相邻子带可以是第k+1个子带;或者,r个相邻子带可以是与第k个子带前向相邻以及后向相邻的子带,如相邻子带可以是第k-1个子带和第k+1个子带。其中,r值不做限定,如可以是1、2等。在相邻是指前向相邻以及后向相邻时,通常前向相邻的子带数量与后向相邻的子带数量是相同的,例如,如果前向相邻的子带有2个,则后向相邻的子带也有2个。Wherein, adjacent may refer to one or more of forward adjacent and backward adjacent, that is, the r adjacent subbands may be the subbands that are forward adjacent to the kth subband, such as the adjacent subband may be the k-1th subband; or, the r adjacent subbands may be the subbands that are backward adjacent to the kth subband, such as the adjacent subband may be the k+1th subband; or, the r adjacent subbands may be the subbands that are both forward adjacent and backward adjacent to the kth subband, such as the adjacent subband may be the k-1th subband and the k+1th subband. Wherein, the value of r is not limited, such as 1, 2, etc. When adjacent refers to forward adjacent and backward adjacent, usually the number of forward adjacent subbands is the same as the number of backward adjacent subbands. For example, if there are 2 forward adjacent subbands, there are also 2 backward adjacent subbands.

然后,可以基于各个相邻子带分别对应的子带功率谱,对第k个子带的子带功率谱进行频域平滑处理,得到第k个子带在频域上的平滑子带功率谱。可选的,在进行频域平滑处理之前,还可以获取频域平滑因子组,以基于该频域平滑因子组对各个相邻子带对应的子带功率谱以及第k个子带的子带功率谱进行频域平滑处理,得到第k个子带在频域上的平滑子带功率谱。其中,该频域平滑因子组可以包括各个相邻子带以及第k个子带分别对应的频域平滑因子。上述频域平滑处理可以理解为利用各个频域平滑因子对相应子带的子带功率谱进行加权求和处理,从而得到第k个子带在频域上的平滑子带功率谱。Then, based on the subband power spectra corresponding to each adjacent subband, the subband power spectrum of the kth subband can be smoothed in the frequency domain to obtain the smoothed subband power spectrum of the kth subband in the frequency domain. Optionally, before performing the frequency domain smoothing process, a frequency domain smoothing factor group can also be obtained to perform frequency domain smoothing process on the subband power spectra corresponding to each adjacent subband and the subband power spectrum of the kth subband based on the frequency domain smoothing factor group to obtain the smoothed subband power spectrum of the kth subband in the frequency domain. The frequency domain smoothing factor group can include the frequency domain smoothing factors corresponding to each adjacent subband and the kth subband. The above frequency domain smoothing process can be understood as performing weighted summation process on the subband power spectra of the corresponding subbands using each frequency domain smoothing factor, so as to obtain the smoothed subband power spectrum of the kth subband in the frequency domain.

在此实施方式下,计算机设备可以通过如下公式(2)计算得到第m个音频帧的第k个子带在频域上的平滑子带功率谱:In this implementation manner, the computer device may calculate the smoothed sub-band power spectrum of the k-th sub-band of the m-th audio frame in the frequency domain by using the following formula (2):

其中,表示第m个音频帧的第k个子带在频域上的平滑子带功率谱,w表示单向相邻(如前向相邻或后向相邻)子带的数量,2w表示相邻子带的总数量(即r)。b(j+w)表示频域平滑因子组,该频域平滑因子组中的各个频域平滑因子依次是各个相邻子带以及第k个子带的频域平滑因子,例如,w为2,则b(j+w)=[b(0),b(1),b(2),b(3),b(4)],且b(0),b(1),b(2),b(3),b(4)依次是第k-2个子带、第k-1个子带、第k个子带、第k+1个子带、第k+2个子带的频域平滑因子,示例性的,b(j+w)=[0.1,0.2,0.4,0.2,0.1]。S(m,k+j)表示第m个音频帧的第k+j个子带的子带功率谱。in, represents the smoothed subband power spectrum of the kth subband of the mth audio frame in the frequency domain, w represents the number of unidirectionally adjacent (such as forward adjacent or backward adjacent) subbands, and 2w represents the total number of adjacent subbands (i.e., r). b(j+w) represents a frequency domain smoothing factor group, and each frequency domain smoothing factor in the frequency domain smoothing factor group is the frequency domain smoothing factor of each adjacent subband and the kth subband in turn. For example, if w is 2, then b(j+w)=[b(0), b(1), b(2), b(3), b(4)], and b(0), b(1), b(2), b(3), b(4) are the frequency domain smoothing factors of the k-2th subband, the k-1th subband, the kth subband, the k+1th subband, and the k+2th subband in turn. For example, b(j+w)=[0.1, 0.2, 0.4, 0.2, 0.1]. S(m,k+j) represents the subband power spectrum of the k+jth subband of the mth audio frame.

在一种实现方式中,时域平滑处理的过程可以是:In one implementation, the process of time domain smoothing may be:

首先,可以获取第m-1个音频帧的第k个子带的平滑子带功率谱。可以理解的是,在计算M个音频帧的噪声估计结果的过程是一个迭代过程,即是从第1个音频帧开始,依次计算各个音频帧的噪声估计结果,则在计算第m个音频帧的第k个子带的平滑子带功率谱时,已计算出第m-1个音频帧的第k个子带的平滑子带功率谱,则可以直接获取该第m-1个音频帧的第k个子带的平滑子带功率谱。First, the smoothed subband power spectrum of the kth subband of the m-1th audio frame can be obtained. It can be understood that the process of calculating the noise estimation results of M audio frames is an iterative process, that is, starting from the 1st audio frame, the noise estimation results of each audio frame are calculated in sequence. When the smoothed subband power spectrum of the kth subband of the mth audio frame is calculated, the smoothed subband power spectrum of the kth subband of the m-1th audio frame has been calculated, and the smoothed subband power spectrum of the kth subband of the m-1th audio frame can be directly obtained.

然后,可以基于第m-1个音频帧的第k个子带的平滑子带功率谱以及第k个子带在频域上的平滑子带功率谱进行时域平滑处理,得到第k个子带在时域上的平滑子带功率谱,并可以将第k个子带在时域上的平滑子带功率谱作为第k个子带的平滑子带功率谱。Then, time domain smoothing can be performed based on the smoothed subband power spectrum of the kth subband of the m-1th audio frame and the smoothed subband power spectrum of the kth subband in the frequency domain to obtain the smoothed subband power spectrum of the kth subband in the time domain, and the smoothed subband power spectrum of the kth subband in the time domain can be used as the smoothed subband power spectrum of the kth subband.

可选的,在进行时域平滑处理之前,还可以获取时域平滑因子,以基于该时域平滑因子对第m-1个音频帧的第k个子带的平滑子带功率谱以及第m个音频帧的第k个子带在频域上的平滑子带功率谱进行时域平滑处理,得到第m个音频帧的第k个子带在时域上的平滑子带功率谱。在一个实施例中,可以将该时域平滑因子与第m-1个音频帧的第k个子带的平滑子带功率谱进行乘积运算,并将1与时域平滑因子之间的差值,与第m个音频帧的第k个子带在频域上的平滑子带功率谱进行乘积运算;进而将两个乘积运算结果进行求和运算,并将求和运算结果作为第m个音频帧的第k个子带在时域上的平滑子带功率谱,即第m个音频帧的第k个子带的平滑子带功率谱。在此实施方式下,计算机设备可以通过如下公式(3)计算得到第m个音频帧的第k个子带的平滑子带功率谱:Optionally, before performing the time domain smoothing process, a time domain smoothing factor may be obtained to perform time domain smoothing process on the smoothed subband power spectrum of the kth subband of the m-1th audio frame and the smoothed subband power spectrum of the kth subband of the mth audio frame in the frequency domain based on the time domain smoothing factor, so as to obtain the smoothed subband power spectrum of the kth subband of the mth audio frame in the time domain. In one embodiment, the time domain smoothing factor may be multiplied with the smoothed subband power spectrum of the kth subband of the m-1th audio frame, and the difference between 1 and the time domain smoothing factor may be multiplied with the smoothed subband power spectrum of the kth subband of the mth audio frame in the frequency domain; the two product operation results may be summed, and the summed result may be used as the smoothed subband power spectrum of the kth subband of the mth audio frame in the time domain, that is, the smoothed subband power spectrum of the kth subband of the mth audio frame. In this implementation manner, the computer device may calculate the smoothed sub-band power spectrum of the k-th sub-band of the m-th audio frame by the following formula (3):

其中,表示第m个音频帧的第k个子带的平滑子带功率谱,表示第m-1个音频帧的第k个子带的平滑子带功率谱,c0(c0<1)表示时域平滑因子,例如,c0=0.9。in, represents the smoothed subband power spectrum of the kth subband of the mth audio frame, represents the smoothed subband power spectrum of the kth subband of the m-1th audio frame, c 0 (c 0 <1) represents a time domain smoothing factor, for example, c 0 =0.9.

S12,获取预设周期,并基于第k个子带的平滑子带功率谱以及预设周期,确定针对第m个音频帧的第k个子带的最小子带功率谱。S12, obtaining a preset period, and determining a minimum subband power spectrum of the kth subband for the mth audio frame based on the smoothed subband power spectrum of the kth subband and the preset period.

其中,该步骤主要是在一个时间窗口内查找到最小子带功率谱,例如,该时间窗口也就是此处提及的预设周期,该预设周期可以为T,该T值可以用于指示每隔T个音频帧查找最小子带功率谱,即在T个音频帧内进行最小值搜索,并将搜索到的结果作为对第k个子带的噪声的粗略估计。例如,假设T为5,则每隔5个音频帧查找最小子带功率谱。Among them, this step is mainly to find the minimum sub-band power spectrum within a time window. For example, the time window is the preset period mentioned here. The preset period can be T. The T value can be used to indicate to find the minimum sub-band power spectrum every T audio frames, that is, to search for the minimum value within T audio frames, and use the search result as a rough estimate of the noise of the kth sub-band. For example, assuming T is 5, the minimum sub-band power spectrum is found every 5 audio frames.

在一种实现方式中,对于第m个音频帧的第k个子带,可以对k与T进行取余处理,并确定处理结果。该处理结果可以包括指定数值以及非指定数值,指定数值表示数值0,而非指定数值表示不为0的数值。In one implementation, for the kth subband of the mth audio frame, a modulo process may be performed on k and T, and a processing result may be determined. The processing result may include a specified value and a non-specified value, wherein the specified value indicates a value of 0, and the non-specified value indicates a value that is not 0.

如果处理结果为指定数值,则可以获取上一周期的最小子带功率谱,并可以将上一周期的最小子带功率谱以及第m个音频帧的第k个子带的平滑子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱。If the processing result is a specified value, the minimum subband power spectrum of the previous period can be obtained, and the minimum value of the minimum subband power spectrum of the previous period and the smoothed subband power spectrum of the kth subband of the mth audio frame can be used as the minimum subband power spectrum for the kth subband of the mth audio frame.

如果处理结果不为指定数值,则可以获取针对第m-1个音频帧的第k个子带的最小子带功率谱,并可以将第k个子带的平滑子带功率谱以及针对第m-1个音频帧的第k个子带的最小子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱。If the processing result is not a specified value, the minimum subband power spectrum of the kth subband for the m-1th audio frame can be obtained, and the minimum value of the smoothed subband power spectrum of the kth subband and the minimum subband power spectrum of the kth subband for the m-1th audio frame can be used as the minimum subband power spectrum of the kth subband for the m-1th audio frame.

例如,假设T为5,则每隔5个音频帧查找最小子带功率谱。如果k为7,则k与T进行取余处理的处理结果为非指定数值(即不为0),则可以获取针对第m个音频帧的第6个子带的最小子带功率谱,并从针对第m个音频帧的第6个子带的最小子带功率谱,以及第7个子带的平滑子带功率谱中的最小值作为针对第m个音频帧的第7个子带的最小子带功率谱。如果k为10,则k与T进行取余处理的处理结果为指定数值(即为0),则可以获取针对第m个音频帧的第6个子带与第9个子带之间的最小子带功率谱,并从这之间的最小子带功率谱,以及第10个子带的平滑子带功率谱中的最小值作为针对第m个音频帧的第10个子带的最小子带功率谱。For example, assuming that T is 5, the minimum subband power spectrum is searched for every 5 audio frames. If k is 7, the result of the modulo processing of k and T is a non-specified value (i.e., not 0), then the minimum subband power spectrum of the 6th subband of the mth audio frame can be obtained, and the minimum value of the minimum subband power spectrum of the 6th subband of the mth audio frame and the smooth subband power spectrum of the 7th subband is used as the minimum subband power spectrum of the 7th subband of the mth audio frame. If k is 10, the result of the modulo processing of k and T is a specified value (i.e., 0), then the minimum subband power spectrum between the 6th subband and the 9th subband of the mth audio frame can be obtained, and the minimum value of the minimum subband power spectrum between the minimum subband power spectrum and the smooth subband power spectrum of the 10th subband is used as the minimum subband power spectrum of the 10th subband of the mth audio frame.

其中,上述确定针对第m个音频帧的第k个子带的最小子带功率谱的实现方式可以称之为最小值搜索(查找)法,该最小值搜索(查找)法的处理逻辑可以如下代码所示:The above-mentioned implementation method of determining the minimum sub-band power spectrum of the k-th sub-band for the m-th audio frame can be called a minimum value search (search) method, and the processing logic of the minimum value search (search) method can be shown in the following code:

其中,T表示预设周期,Smin(m,k)表示第m个音频帧的第k个子带的最小子带功率谱,Stmp(m,k)表示中间变量。表示:将上一周期的最小子带功率谱,以及第m个音频帧的第k个子带的平滑子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱。表示:将针对第m-1个音频帧的第k个子带的最小子带功率谱,以及第m个音频帧的第k个子带的平滑子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱。Wherein, T represents a preset period, S min (m, k) represents the minimum subband power spectrum of the kth subband of the mth audio frame, and S tmp (m, k) represents an intermediate variable. It means: the minimum subband power spectrum of the previous period and the minimum value of the smoothed subband power spectrum of the k-th subband of the m-th audio frame are used as the minimum subband power spectrum of the k-th subband of the m-th audio frame. It means: the minimum subband power spectrum of the k-th subband of the m-1-th audio frame and the minimum value of the smoothed subband power spectrum of the k-th subband of the m-th audio frame are used as the minimum subband power spectrum of the k-th subband of the m-th audio frame.

S13,基于平滑子带功率谱和最小子带功率谱,确定第m个音频帧的第k个子带中存在语音的语音概率。S13, determining a speech probability that speech exists in the kth subband of the mth audio frame based on the smoothed subband power spectrum and the minimum subband power spectrum.

首先,可以基于平滑子带功率谱和最小子带功率谱确定信噪比。例如,可以将平滑子带功率谱和最小子带功率谱之间的比值作为信噪比。在此实施方式下,计算机设备可以通过如下公式(4)计算得到信噪比:First, the signal-to-noise ratio can be determined based on the smoothed subband power spectrum and the minimum subband power spectrum. For example, the ratio between the smoothed subband power spectrum and the minimum subband power spectrum can be used as the signal-to-noise ratio. In this implementation, the computer device can calculate the signal-to-noise ratio using the following formula (4):

其中,Sr(m,k)表示信噪比。Wherein, S r (m,k) represents the signal-to-noise ratio.

然后,可以将信噪比与阈值进行比较,并可以基于比较结果确定第m个音频帧的第k个子带中存在语音的初始语音概率。其中,阈值是一个比值门限值,可以用来确定语音是否存在;该阈值可以预先设置,其具体数值不作限定;该阈值可以是一个经验值,通常,对于越安静的通信环境,该阈值可以设置越大,而对于越嘈杂的通信环境,该阈值可以设置越小。具体实现中,如果比较结果为信噪比大于阈值,则可以将第m个音频帧的第k个子带中存在语音的初始语音概率确定为第一概率;如果比较结果为信噪比小于或等于阈值,则将第m个音频帧的第k个子带中存在语音的初始语音概率确定为第二概率。其中,第一概率具体数值可以为1,第二概率具体数值可以为0。在此实施方式下,计算机设备可以通过如下公式(5)计算得到第m个音频帧的第k个子带中存在语音的初始语音概率:Then, the signal-to-noise ratio can be compared with the threshold, and the initial speech probability of speech existing in the kth subband of the mth audio frame can be determined based on the comparison result. Wherein, the threshold is a ratio threshold value, which can be used to determine whether speech exists; the threshold can be preset, and its specific value is not limited; the threshold can be an empirical value, and generally, for a quieter communication environment, the threshold can be set larger, and for a noisier communication environment, the threshold can be set smaller. In a specific implementation, if the comparison result is that the signal-to-noise ratio is greater than the threshold, the initial speech probability of speech existing in the kth subband of the mth audio frame can be determined as a first probability; if the comparison result is that the signal-to-noise ratio is less than or equal to the threshold, the initial speech probability of speech existing in the kth subband of the mth audio frame can be determined as a second probability. Wherein, the specific value of the first probability can be 1, and the specific value of the second probability can be 0. In this embodiment, the computer device can calculate the initial speech probability of speech existing in the kth subband of the mth audio frame by the following formula (5):

其中,p(m,k)表示第m个音频帧的第k个子带中存在语音的初始语音概率,δ表示阈值。Wherein, p(m,k) represents the initial speech probability that speech exists in the k-th subband of the m-th audio frame, and δ represents the threshold.

最后,可以获取第m-1个音频帧的第k个子带中存在语音的语音概率,并基于第m-1个音频帧的第k个子带对应的语音概率以及初始语音概率,确定第m个音频帧的第k个子带中存在语音的语音概率。在一个实施例中,还可以获取针对语音存在概率的平滑因子,该平滑因子可以理解为一个权重,即当前音频帧的语音存在的概率和上一音频帧的语音存在的概率之间存在一个权重。基于此可知,可以基于该平滑因子,第m-1个音频帧的第k个子带对应的语音概率以及初始语音概率,确定第m个音频帧的第k个子带中存在语音的语音概率。Finally, the speech probability of speech existing in the kth subband of the m-1th audio frame can be obtained, and based on the speech probability corresponding to the kth subband of the m-1th audio frame and the initial speech probability, the speech probability of speech existing in the kth subband of the m-1th audio frame can be determined. In one embodiment, a smoothing factor for the speech existence probability can also be obtained, and the smoothing factor can be understood as a weight, that is, there is a weight between the probability of speech existing in the current audio frame and the probability of speech existing in the previous audio frame. Based on this, it can be known that the speech probability of speech existing in the kth subband of the m-1th audio frame can be determined based on the smoothing factor, the speech probability corresponding to the kth subband of the m-1th audio frame and the initial speech probability.

在此实施方式中,计算机设备可以通过如下公式(6)计算得到第m个音频帧的第k个子带中存在语音的语音概率:In this implementation, the computer device may calculate the speech probability that speech exists in the kth subband of the mth audio frame by the following formula (6):

其中,表示第m个音频帧的第k个子带中存在语音的语音概率,αpp<1)表示针对语音存在概率的平滑因子,表示第m-1个音频帧的第k个子带中存在语音的语音概率。in, represents the speech probability of speech in the k-th subband of the m-th audio frame, α pp <1) represents the smoothing factor for the speech probability, represents the speech probability that speech exists in the k-th subband of the m-1-th audio frame.

S14,基于语音概率、第m个音频帧的第k个子带的子带功率谱以及第m-1个音频帧的第k个子带的噪声估计结果,确定第m个音频帧的第k个子带的噪声估计结果。S14, determining a noise estimation result of the kth subband of the mth audio frame based on the speech probability, the subband power spectrum of the kth subband of the mth audio frame, and the noise estimation result of the kth subband of the (m-1)th audio frame.

其中,第k个子带的噪声估计结果也就是第k个子带的噪声功率谱估计值,在此实施方式中,计算机设备可以通过如下公式(7)计算得到第m个音频帧的第k个子带的噪声估计结果:The noise estimation result of the k-th subband is also the noise power spectrum estimation value of the k-th subband. In this implementation, the computer device can calculate the noise estimation result of the k-th subband of the m-th audio frame by the following formula (7):

其中,表示第m个音频帧的第k个子带的噪声估计结果,表示第m-1个音频帧的第k个子带的噪声估计结果,in, represents the noise estimation result of the kth subband of the mth audio frame, represents the noise estimation result of the kth subband of the m-1th audio frame,

S03,基于第m个音频帧所包含的各个子带对应的噪声估计结果,对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果。S03: Perform noise estimation on the mth audio frame based on the noise estimation results corresponding to the sub-bands included in the mth audio frame to determine the noise estimation result of the mth audio frame.

在一种实现方式中,在得到第m个音频帧所包含的各个子带对应的噪声估计结果之后,即可以基于各个子带对应的噪声估计结果实现对第m个音频帧的整体噪声估计,即需确定第m个音频帧的整体噪声估计结果。其中,第m个音频帧的噪声估计结果也就是指第m个音频帧的整体噪声估计结果,该噪声估计结果可以用来表针第m个音频帧的噪声强度(噪声水平),该噪声估计结果可以是一个具体的数值,则该噪声估计结果可以称之为噪声估计值、噪声强度值、噪声水平值等。In one implementation, after obtaining the noise estimation results corresponding to each subband contained in the mth audio frame, the overall noise estimation of the mth audio frame can be achieved based on the noise estimation results corresponding to each subband, that is, the overall noise estimation result of the mth audio frame needs to be determined. Among them, the noise estimation result of the mth audio frame refers to the overall noise estimation result of the mth audio frame, and the noise estimation result can be used to indicate the noise intensity (noise level) of the mth audio frame. The noise estimation result can be a specific numerical value, and the noise estimation result can be called a noise estimation value, a noise intensity value, a noise level value, etc.

在一种实现方式中,可以基于第m个音频帧所包含的所有子带对应的噪声估计结果,对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果。也可以基于第m个音频帧所包含的部分子带对应的噪声估计结果,对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果。In one implementation, noise estimation may be performed on the mth audio frame based on noise estimation results corresponding to all subbands included in the mth audio frame to determine the noise estimation result of the mth audio frame. Noise estimation may also be performed on the mth audio frame based on noise estimation results corresponding to some subbands included in the mth audio frame to determine the noise estimation result of the mth audio frame.

在本申请实施例中,以利用部分子带对第m个音频帧进行噪声估计进行具体阐述,对于利用所有的子带对应的噪声估计结果对第m个音频帧进行噪声估计的具体实现与利用部分的子带是类似的,区别在于不需进行后续提到的子带筛选操作。例如,可以仅统计与人声主要频谱范围重叠的子带范围,例如,该子带范围可以为100hz~6khz。即仅利用在该子带范围内的子带的噪声估计结果,对第m个音频帧进行噪声估计。In the embodiment of the present application, the noise estimation of the mth audio frame using some sub-bands is specifically described. The specific implementation of the noise estimation of the mth audio frame using the noise estimation results corresponding to all sub-bands is similar to that using some sub-bands, except that the sub-band screening operation mentioned later is not required. For example, only the sub-band range overlapping with the main spectrum range of human voice can be counted, for example, the sub-band range can be 100hz~6khz. That is, only the noise estimation results of the sub-bands within the sub-band range are used to estimate the noise of the mth audio frame.

在这种情况下,可以先基于该子带范围对第m个音频帧的K个子带进行筛选,以利用该子带范围内的子带对应的噪声估计结果,对第m个音频帧进行噪声估计,进而确定第m个音频帧的噪声估计结果。In this case, the K subbands of the mth audio frame can be first screened based on the subband range, so as to use the noise estimation results corresponding to the subbands within the subband range to perform noise estimation on the mth audio frame, and then determine the noise estimation result of the mth audio frame.

具体实现中,可以先获取预设子带范围,并从第m个音频帧所包含的K个子带中确定频率范围在该预设子带范围的子带;其中,此处的预设子带范围也就是上述的子带范围,在该预设子带范围的子带的数量可以是N个,且可以将此处的子带称之为参考子带,即可以确定N个参考子带。在确定N个参考子带之后,还可以进一步获取第m-1个音频帧的噪声估计结果,以结合历史音频帧的噪声估计结果来确定当前音频帧的噪声估计结果;也就是说,可以基于第m-1个音频帧的噪声估计结果以及N个参考子带中各个参考子带对应的噪声估计结果,对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果。In a specific implementation, a preset subband range may be first obtained, and a subband whose frequency range is within the preset subband range may be determined from the K subbands contained in the mth audio frame; wherein the preset subband range here is also the subband range mentioned above, and the number of subbands within the preset subband range may be N, and the subbands here may be referred to as reference subbands, that is, N reference subbands may be determined. After determining the N reference subbands, the noise estimation result of the m-1th audio frame may be further obtained to determine the noise estimation result of the current audio frame in combination with the noise estimation result of the historical audio frame; that is, based on the noise estimation result of the m-1th audio frame and the noise estimation results corresponding to each reference subband in the N reference subbands, the noise estimation of the mth audio frame may be performed to determine the noise estimation result of the mth audio frame.

在一种实现方式中,基于第m-1个音频帧的噪声估计结果以及N个参考子带中各个参考子带对应的噪声估计结果,对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果的具体实现可以包括以下步骤是S21-S23:In one implementation, based on the noise estimation result of the m-1th audio frame and the noise estimation results corresponding to each reference subband in the N reference subbands, noise estimation is performed on the mth audio frame, and a specific implementation of determining the noise estimation result of the mth audio frame may include the following steps S21-S23:

S21,可以确定N个参考子带中各个参考子带的噪声估计结果。S21 , a noise estimation result of each reference subband in the N reference subbands may be determined.

考虑到确定每一个参考子带的噪声估计结果的方式是一致的,则下述以N个参考子带中的第n个参考子带为例,对一个参考子带的噪声估计结果的确定进行相关说明。具体实现中,针对N个参考子带中的第n个参考子带,可以先获取第n个参考子带中频点的数量,并基于第n个参考子带中频点的数量以及第n个参考子带对应的噪声估计结果,确定第n个参考子带的噪声估计结果。在一个实施例中,可以将第n个参考子带中频点的数量与第n个参考子带对应的噪声估计结果进行乘积运算,并将乘积运算结果作为第n个参考子带的噪声估计结果。Considering that the method for determining the noise estimation result of each reference subband is consistent, the following takes the nth reference subband among N reference subbands as an example to explain the determination of the noise estimation result of a reference subband. In a specific implementation, for the nth reference subband among the N reference subbands, the number of mid-frequency points of the nth reference subband can be first obtained, and the noise estimation result of the nth reference subband can be determined based on the number of mid-frequency points of the nth reference subband and the noise estimation result corresponding to the nth reference subband. In one embodiment, the number of mid-frequency points of the nth reference subband can be multiplied by the noise estimation result corresponding to the nth reference subband, and the product operation result can be used as the noise estimation result of the nth reference subband.

基于此可知,通过上述方式,可以确定N个参考子带中各个参考子带的噪声估计结果。Based on this, it can be known that, through the above method, the noise estimation result of each reference sub-band in the N reference sub-bands can be determined.

S22,可以基于N个参考子带中各个参考子带的噪声估计结果,确定第m个音频帧的初始噪声估计结果。S22: Based on the noise estimation results of each reference subband in the N reference subbands, an initial noise estimation result of the mth audio frame may be determined.

在一个实施例中,可以将各个参考子带的噪声估计结果进行求和运算,并将求和运算结果作为第m个音频帧的初始噪声估计结果。In one embodiment, the noise estimation results of each reference sub-band may be summed up, and the summed up result may be used as the initial noise estimation result of the mth audio frame.

S23,可以基于第m-1个音频帧的噪声估计结果以及第m个音频帧的初始噪声估计结果,对第m个音频帧进行噪声估计,确定第m个音频帧的噪声估计结果。S23 , based on the noise estimation result of the m-1 th audio frame and the initial noise estimation result of the m th audio frame, noise estimation may be performed on the m th audio frame to determine the noise estimation result of the m th audio frame.

在一个实施例中,可以将第m-1个音频帧的噪声估计结果、以及第m个音频帧的初始噪声估计结果进行求和运算,并将求和运算结果作为第m个音频帧的噪声估计结果。In one embodiment, the noise estimation result of the m-1th audio frame and the initial noise estimation result of the mth audio frame may be summed, and the summation result is used as the noise estimation result of the mth audio frame.

另一个实施例中,还可以获取噪声估计平滑因子,以基于该噪声估计平滑因子,对第m-1个音频帧的噪声估计结果以及第m个音频帧的初始噪声估计结果进行平滑处理,得到第m个音频帧的噪声估计结果。在一个实施例中,可以将该噪声估计平滑因子与第m-1个音频帧的噪声估计结果进行乘积运算,并将1与噪声估计平滑因子之间的差值,与第m个音频帧的初始噪声估计结果进行乘积运算;进而将两个乘积运算结果进行求和运算,并将求和运算结果作为第m个音频帧的噪声估计结果。在此实施方式中,计算机设备可以通过如下公式(8)计算得到第m个音频帧的噪声估计结果:In another embodiment, a noise estimation smoothing factor may be obtained to smooth the noise estimation result of the m-1th audio frame and the initial noise estimation result of the m-th audio frame based on the noise estimation smoothing factor to obtain the noise estimation result of the m-1th audio frame. In one embodiment, the noise estimation smoothing factor may be multiplied with the noise estimation result of the m-1th audio frame, and the difference between 1 and the noise estimation smoothing factor may be multiplied with the initial noise estimation result of the m-th audio frame; the two multiplication results may be summed, and the summed result may be used as the noise estimation result of the m-th audio frame. In this embodiment, the computer device may calculate the noise estimation result of the m-th audio frame by the following formula (8):

其中,表示第m个音频帧的噪声估计结果,表示第m-1个音频帧的噪声估计结果,β(β<1)表示噪声估计平滑因子。K0和K1是上述提到的预设子带范围(如100hz~6khz频谱范围)内下的参考子带的子带序号,K0表示N个参考子带中的第1个参考子带,K1表示N个参考子带中的第N个参考子带。in, represents the noise estimation result of the mth audio frame, represents the noise estimation result of the m-1th audio frame, and β (β<1) represents the noise estimation smoothing factor. K0 and K1 are the subband numbers of the reference subbands within the preset subband range (such as the 100 Hz to 6 kHz spectrum range) mentioned above, K0 represents the first reference subband among N reference subbands, and K1 represents the Nth reference subband among N reference subbands.

S403,接收发声方发送的编码信号,对编码信号进行音频解码,得到解码信号,并对解码信号进行音频播放。S403, receiving the coded signal sent by the speaker, performing audio decoding on the coded signal to obtain a decoded signal, and performing audio playback on the decoded signal.

其中,该步骤的具体实施方式可以参考上述步骤S203中的描述,此处不再赘述。The specific implementation of this step can refer to the description in the above step S203, which will not be repeated here.

在本申请实施例中,收听方可以对收听方的背景环境噪声进行估计(评估),并把噪声估计结果反馈给发声方,从而使得发声方可以根据反馈的噪声估计结果调节默认的编码参数,以适应收听方的背景噪声环境的变化,以达到播放音质与收听环境效果的最佳匹配。In an embodiment of the present application, the listener can estimate (evaluate) the background environmental noise of the listener and feed back the noise estimation result to the speaker, so that the speaker can adjust the default encoding parameters according to the feedback noise estimation result to adapt to the changes in the background noise environment of the listener, so as to achieve the best match between the playback sound quality and the listening environment effect.

请参见图5,图5是本申请实施例提供的又一种音频处理方法的流程示意图,本实施例主要描述发声方侧的具体实现过程,本实施例中所描述的音频处理方法,包括以下步骤:Please refer to FIG. 5, which is a flowchart of another audio processing method provided in an embodiment of the present application. This embodiment mainly describes the specific implementation process on the sounding side. The audio processing method described in this embodiment includes the following steps:

S501,接收收听方发送的噪声估计结果。S501: Receive a noise estimation result sent by a listener.

如前所述,噪声估计结果获取到收听方的第一音频信号后,对第一音频信号进行噪声估计得到的,收听方与发声方已建立通信连接。其中,该噪声估计结果可以是一个具体的估计值,则该噪声估计结果也可以称之为噪声估计值。可以理解的是,该噪声估计结果可以用来表征音频信号中噪声的强度或者说噪声的水平,则该噪声估计结果也可以称之为噪声强度值或噪声水平值。As mentioned above, after the noise estimation result obtains the first audio signal of the listener, the noise estimation is performed on the first audio signal, and the listener and the speaker have established a communication connection. Among them, the noise estimation result can be a specific estimation value, and the noise estimation result can also be called a noise estimation value. It can be understood that the noise estimation result can be used to characterize the intensity of the noise in the audio signal or the level of the noise, and the noise estimation result can also be called a noise intensity value or a noise level value.

S502,基于噪声估计结果确定用于进行音频编码的编码参数。S502: Determine encoding parameters for audio encoding based on the noise estimation result.

其中,编码参数可以包括编码码率、通道数以及采样率等中的一种或多种。The encoding parameters may include one or more of encoding bit rate, number of channels, sampling rate, etc.

可以理解的是,音频编码处理中不同编码参数的设置会影响编解码输出的音频质量,该音频质量可以是指播放音质。例如,音频编码处理中编码码率的高低会影响编解码输出的音频质量,也会影响编码后数据的传输带宽以及存储消耗,所以可以根据实际的业务和实际的收听需求来动态调节编码参数。It is understandable that the settings of different encoding parameters in the audio encoding process will affect the audio quality of the codec output, and the audio quality may refer to the playback sound quality. For example, the encoding bit rate in the audio encoding process will affect the audio quality of the codec output, and will also affect the transmission bandwidth and storage consumption of the encoded data, so the encoding parameters can be dynamically adjusted according to the actual business and actual listening needs.

如前所述,音频通话本身最终受益的是收听方,而收听方的音频体验除了播放信号(即播放的音频信号)本身的音质效果外,还与收听方的声学环境强相关。如果收听方身处比较安静环境下,即收听方的噪声水平较低,则在播放信号存在一点音质问题时,收听方都有可能很轻易分辨出来,从而影响收听效果;在这种情况下,为了使得收听方尽可能不能感知音质问题,则可以将编码侧的编码参数设置为使得编解码输出的音频具有较好的播放音质的参数。例如,可以将编码码率、采样率、通道数等中的一种或多种设为较高。As mentioned above, the ultimate beneficiary of the audio call itself is the listener, and the listener's audio experience is strongly related to the listener's acoustic environment in addition to the sound quality of the playback signal (i.e., the audio signal played) itself. If the listener is in a relatively quiet environment, that is, the noise level of the listener is low, then when there is a slight sound quality problem with the playback signal, the listener may easily distinguish it, thereby affecting the listening effect; in this case, in order to make the listener unable to perceive the sound quality problem as much as possible, the encoding parameters on the encoding side can be set to parameters that make the audio output by the codec have better playback sound quality. For example, one or more of the encoding bit rate, sampling rate, number of channels, etc. can be set to a higher level.

相反,如果收听方身处相对嘈杂的环境下,即收听方的噪声水平较高,考虑到收听方所处的环境中的背景噪声可以掩蔽播放信号的瑕疵,则播放信号的音质的细微差别很难被收听发分辨出;在这种情况下,编码参数的差异对收听音质的差异影响极小,则可以将编码侧的编码参数设为较低数值,在尽量不影响音质的同时减少数据量、传输带宽、存储空间等,从而以此来降低音频传输和存储的成本,进而实现音频信号的播放音质体验效果和运营成本(如音频传输带宽和存储的成本)的最佳平衡。On the contrary, if the listener is in a relatively noisy environment, that is, the noise level of the listener is high, considering that the background noise in the listener's environment can mask the defects of the playback signal, the subtle differences in the sound quality of the playback signal are difficult for the listener to distinguish; in this case, the difference in encoding parameters has little effect on the difference in listening sound quality, and the encoding parameters on the encoding side can be set to lower values, while minimizing the impact on the sound quality while reducing the amount of data, transmission bandwidth, storage space, etc., thereby reducing the cost of audio transmission and storage, and then achieving the best balance between the audio signal playback sound quality experience and operating costs (such as audio transmission bandwidth and storage costs).

基于此可知,基于收听方的噪声水平来调整编码码率的基本原则可以是:收听方的噪声水平越高,则编码侧的编码参数可以向播放音质效果变差的方向进行调整;相反,如果收听方的噪声水平越低,则编码侧的编码参数可以向播放音质效果变好的方向进行调整。Based on this, it can be seen that the basic principle for adjusting the encoding bit rate based on the noise level of the listener can be: the higher the noise level of the listener, the encoding parameters on the encoding side can be adjusted in the direction of worsening the playback sound quality; conversely, if the noise level of the listener is lower, the encoding parameters on the encoding side can be adjusted in the direction of improving the playback sound quality.

在一个实施例中,以编码参数为编码码率来说,基于收听方的噪声水平来调整编码码率的基本原则可以是:收听方的噪声水平越高,则编码侧的编码码率可以往下调整;相反,如果收听方的噪声水平越低,则编码侧的编码码率可以往上调整。这里的调整通常可以是指相对于默认的编码码率往上调整或往下调整。总的来说,收听方的噪声水平越高,编码侧的编码码率可以越低,而收听方的噪声水平越低,编码侧的编码码率可以越高。In one embodiment, taking the encoding parameter as the encoding bit rate, the basic principle for adjusting the encoding bit rate based on the noise level of the listener may be: the higher the noise level of the listener, the encoding bit rate on the encoding side may be adjusted downward; conversely, if the noise level of the listener is lower, the encoding bit rate on the encoding side may be adjusted upward. The adjustment here may generally refer to an upward adjustment or a downward adjustment relative to the default encoding bit rate. In general, the higher the noise level of the listener, the lower the encoding bit rate on the encoding side may be, and the lower the noise level of the listener, the higher the encoding bit rate on the encoding side may be.

在一个实施例中,以编码参数为采样率来说,基于收听方的噪声水平来调整采样率的基本原则可以是:收听方的噪声水平越高,则编码侧的采样率可以往下调整;相反,如果收听方的噪声水平越低,则编码侧的采样率可以往上调整。这里的调整通常可以是指相对于默认的采样率往上调整或往下调整。总的来说,收听方的噪声水平越高,编码侧的采样率可以越低,而收听方的噪声水平越低,编码侧的采样率可以越高。In one embodiment, taking the encoding parameter as the sampling rate, the basic principle for adjusting the sampling rate based on the noise level of the listener may be: the higher the noise level of the listener, the sampling rate of the encoding side may be adjusted downward; conversely, if the noise level of the listener is lower, the sampling rate of the encoding side may be adjusted upward. The adjustment here may generally refer to an upward adjustment or a downward adjustment relative to the default sampling rate. In general, the higher the noise level of the listener, the lower the sampling rate of the encoding side may be, and the lower the noise level of the listener, the higher the sampling rate of the encoding side may be.

在一个实施例中,以编码参数为通道数来说,基于收听方的噪声水平来调整通道数的基本原则可以是:收听方的噪声水平越高,则编码侧的通道数可以往下调整;相反,如果收听方的噪声水平越低,则编码侧的通道数可以往上调节。这里的调整通常可以是指相对于默认的通道数往上调整或往下调整。总的来说,收听方的噪声水平越高,编码侧的通道数可以越低,而收听方的噪声水平越低,编码侧的通道数可以越高。In one embodiment, taking the encoding parameter as the number of channels, the basic principle for adjusting the number of channels based on the noise level of the listener may be: the higher the noise level of the listener, the number of channels on the encoding side may be adjusted downward; conversely, if the noise level of the listener is lower, the number of channels on the encoding side may be adjusted upward. The adjustment here may generally refer to an upward adjustment or a downward adjustment relative to the default number of channels. In general, the higher the noise level of the listener, the lower the number of channels on the encoding side may be, and the lower the noise level of the listener, the higher the number of channels on the encoding side may be.

在一种实现方式中,可以预先设置参考噪声估计结果与参考编码参数之间的映射关系,该映射关系可以是通过大量的实测结果所得到的噪声估计结果与编码参数之间的较佳匹配。则发声方在接收到收听方发送的噪声估计结果之后,可以获取该映射关系,以利用该映射关系确定噪声估计结果对应的编码参数。In one implementation, a mapping relationship between a reference noise estimation result and a reference coding parameter may be preset, and the mapping relationship may be a better match between the noise estimation result and the coding parameter obtained through a large number of actual measurement results. After receiving the noise estimation result sent by the listener, the speaker may obtain the mapping relationship to determine the coding parameter corresponding to the noise estimation result by using the mapping relationship.

可选的,以编码参数包括编码码率来说,基于噪声估计结果确定用于进行音频编码的编码码率的具体实现可为如下描述。Optionally, assuming that the encoding parameters include the encoding bit rate, a specific implementation of determining the encoding bit rate for audio encoding based on the noise estimation result may be described as follows.

在一个实施例中,可以获取参考噪声估计结果与参考编码码率之间的映射关系;在获取到该映射关系之后,可以基于噪声估计结果以及参考噪声估计结果与参考编码码率之间的映射关系,确定用于进行音频编码的编码码率。In one embodiment, a mapping relationship between a reference noise estimation result and a reference encoding bit rate may be obtained; after obtaining the mapping relationship, the encoding bit rate for audio encoding may be determined based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference encoding bit rate.

例如,该映射关系可以通过映射二维表格来表征。该映射二维表格中数据可以根据实测结果来完成,如该映射二维表格可如表1所示:For example, the mapping relationship can be represented by a mapping two-dimensional table. The data in the mapping two-dimensional table can be completed according to the actual measurement results, such as the mapping two-dimensional table can be shown in Table 1:

表1Table 1

参考编码码率Reference encoding bit rate 参考噪声估计值Reference noise estimate A1A1 B1B1 A2A2 B2B2 A3A3 B3B3

需要说明的是,B1到B3对应数值逐渐变大,但A1到A3逐渐变小,即表明收听方的噪声越大的情况下,即需要利用使得播放音质变得较差的编码参数(如此处的编码码率)进行音频编码。通过表1所知,在第m个音频帧的噪声估计结果之后,即可以基于该表1确定噪声估计结果对应的编码码率。例如,假设第m个音频帧的噪声估计值为B2,则基于表1可知对应的编码码率为A2。It should be noted that the corresponding values of B1 to B3 gradually increase, but A1 to A3 gradually decrease, which means that the greater the noise of the listener, the more likely it is that the audio encoding needs to be performed using encoding parameters that make the playback sound quality worse (such as the encoding bit rate here). As shown in Table 1, after the noise estimation result of the mth audio frame, the encoding bit rate corresponding to the noise estimation result can be determined based on Table 1. For example, assuming that the noise estimation value of the mth audio frame is B2, the corresponding encoding bit rate can be known to be A2 based on Table 1.

又如,该映射关系也可以通过映射公式来表征。在一个实施例中,该映射公式可以通过上述映射二维表格(如表1)中的数据拟合得到。例如,该映射公式可以如下公式(9)所示:For another example, the mapping relationship can also be represented by a mapping formula. In one embodiment, the mapping formula can be obtained by fitting the data in the above-mentioned two-dimensional mapping table (such as Table 1). For example, the mapping formula can be shown as the following formula (9):

其中,bitrate(m)表示第m个音频帧的编码码率,bitrate0表示默认的编码码率;a、b1、b2、c为常数;表示第m个音频帧的噪声估计值。Wherein, bitrate(m) indicates the encoding bitrate of the mth audio frame, bitrate0 indicates the default encoding bitrate; a, b1, b2, c are constants; Represents the noise estimate of the mth audio frame.

通过公式(9)可知,在确定第m个音频帧的噪声估计值之后,可以将其代入公式(9),并可以将公式(9)的计算结果作为编码码率。It can be seen from formula (9) that after the noise estimation value of the mth audio frame is determined, it can be substituted into formula (9), and the calculation result of formula (9) can be used as the encoding bit rate.

为更加形象的显示出该映射公式所指示的参考噪声估计结果与参考编码码率之间的映射关系,可以基于该映射公式显示对应的曲线图,例如,该曲线图可如图6所示,其中,图6中所示的坐标系中的横坐标为噪声水平值(噪声估计值),纵坐标为编码码率值。In order to more vividly display the mapping relationship between the reference noise estimation result indicated by the mapping formula and the reference coding bit rate, a corresponding curve graph can be displayed based on the mapping formula. For example, the curve graph can be shown in Figure 6, wherein the horizontal axis in the coordinate system shown in Figure 6 is the noise level value (noise estimation value), and the vertical axis is the coding bit rate value.

可选的,以编码参数包括采样率来说,基于噪声估计结果确定用于进行音频编码的采样率的具体实现可为如下描述。Optionally, assuming that the encoding parameters include a sampling rate, a specific implementation of determining the sampling rate for audio encoding based on the noise estimation result may be described as follows.

可以获取参考噪声估计结果与参考采样率之间的映射关系;在获取到该映射关系之后,可以基于噪声估计结果以及参考噪声估计结果与参考采样率之间的映射关系,确定用于进行音频编码的采样率。A mapping relationship between a reference noise estimation result and a reference sampling rate may be obtained; after obtaining the mapping relationship, a sampling rate for audio encoding may be determined based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference sampling rate.

例如,该映射关系可以通过映射二维表格来表征。该映射二维表格中数据可以根据实测结果来完成。例如,该映射二维表格可如表2所示:For example, the mapping relationship can be represented by a mapping two-dimensional table. The data in the mapping two-dimensional table can be completed according to the actual measurement results. For example, the mapping two-dimensional table can be shown in Table 2:

表2Table 2

参考采样率Reference sampling rate 参考噪声估计结果Reference noise estimation results D1D1 C1~C2C1~C2 D2D2 C2~C3C2~C3 D3D3 C3~C4C3~C4

其中,一个参考噪声估计值范围可以对应一个采样率档位。如表2所示,处于C1~C2参考噪声估计值对应的参考采样率为D1。需要说明的是,C1到C4对应数值逐渐变大,但D1到D3逐渐变小,即表明收听方的噪声越大的情况下,即需要利用使得播放音质变得较差的编码参数(如此处的采样率)进行音频编码。通过表2所知,在第m个音频帧的噪声估计结果之后,即可以基于该表2确定噪声估计结果对应的采样率。例如,假设第m个音频帧的噪声估计值为C0,且C0位于C2~C3之间,则基于表2可知对应的采样率为D2。Among them, a reference noise estimation value range can correspond to a sampling rate gear. As shown in Table 2, the reference sampling rate corresponding to the reference noise estimation value between C1 and C2 is D1. It should be noted that the corresponding values of C1 to C4 gradually increase, but D1 to D3 gradually decrease, which means that the greater the noise of the listener, the more necessary it is to use encoding parameters (such as the sampling rate here) that make the playback sound quality worse for audio encoding. As can be seen from Table 2, after the noise estimation result of the mth audio frame, the sampling rate corresponding to the noise estimation result can be determined based on Table 2. For example, assuming that the noise estimation value of the mth audio frame is C0, and C0 is between C2 and C3, based on Table 2, it can be known that the corresponding sampling rate is D2.

可选的,以编码参数包括通道数来说,基于噪声估计结果确定用于进行音频编码的通道数的具体实现可为如下描述。Optionally, assuming that the encoding parameters include the number of channels, the specific implementation of determining the number of channels for audio encoding based on the noise estimation result may be described as follows.

可以获取参考噪声估计结果与参考通道数之间的映射关系;在获取到该映射关系之后,可以基于噪声估计结果以及参考噪声估计结果与参考通道数之间的映射关系,确定用于进行音频编码的通道数。A mapping relationship between a reference noise estimation result and a reference number of channels may be obtained; after obtaining the mapping relationship, the number of channels for audio encoding may be determined based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference number of channels.

例如,该映射关系可以通过映射二维表格来表征。该映射二维表格中数据可以根据实测结果来完成。例如,该映射二维表格可如表3所示:For example, the mapping relationship can be represented by a mapping two-dimensional table. The data in the mapping two-dimensional table can be completed according to the actual measurement results. For example, the mapping two-dimensional table can be shown in Table 3:

表3Table 3

参考通道数Number of reference channels 参考噪声估计结果Reference noise estimation results 第二通道数Second channel number E≤θE≤θ 第一通道数Number of first channels E>θE>θ

其中,θ表示预设噪声估计结果(预设噪声估计值),该值可以预先设置,具体数值不作限定;E表示参考噪声估计结果,即可以将噪声估计结果与预设噪声估计结果进行比较,以基于比较结果确定最终的通道数。如果比较结果为噪声估计结果小于或等于预设噪声估计结果,则可以将编码参数确定为第二通道数;如果比较结果为噪声估计结果大于预设噪声估计结果,则可以将编码参数确定为第一通道数。Wherein, θ represents a preset noise estimation result (preset noise estimation value), which can be preset and the specific value is not limited; E represents a reference noise estimation result, that is, the noise estimation result can be compared with the preset noise estimation result to determine the final number of channels based on the comparison result. If the comparison result is that the noise estimation result is less than or equal to the preset noise estimation result, the encoding parameter can be determined as the second number of channels; if the comparison result is that the noise estimation result is greater than the preset noise estimation result, the encoding parameter can be determined as the first number of channels.

基于此可知,确定通道数的具体实现还可以阐述为:Based on this, the specific implementation of determining the number of channels can also be explained as:

首先,可以获取发声方对应的初始通道数,并将噪声估计结果与预设噪声估计结果进行比较。在比较结果为噪声估计结果小于或等于预设噪声估计结果的情况下,则表明收听方的噪声不是较大,即需要利用使得播放音质变得较好的编码参数进行音频编码;在这种情况下,如果初始通道数为第一通道数(即单声道),则可以将第一通道数调整为第二通道数(即双声道),如果初始通道数为第二通道数,则不需进行通道数的调整。在比较结果为噪声估计结果大于预设噪声估计结果的情况下,则表明收听方的噪声较大,即不需要利用使得播放音质变得较好的编码参数进行音频编码;在这种情况下,如果初始通道数为第二通道数,则可以将第二通道数调整为第一通道数,如果初始通道数为第一通道数,则不需进行通道数的调整。First, the initial number of channels corresponding to the sounding party can be obtained, and the noise estimation result can be compared with the preset noise estimation result. If the comparison result is that the noise estimation result is less than or equal to the preset noise estimation result, it indicates that the noise of the listening party is not large, that is, it is necessary to use the encoding parameters that make the playback sound quality better for audio encoding; in this case, if the initial number of channels is the first number of channels (i.e., mono), the first number of channels can be adjusted to the second number of channels (i.e., dual channels), and if the initial number of channels is the second number of channels, there is no need to adjust the number of channels. If the comparison result is that the noise estimation result is greater than the preset noise estimation result, it indicates that the noise of the listening party is large, that is, it is not necessary to use the encoding parameters that make the playback sound quality better for audio encoding; in this case, if the initial number of channels is the second number of channels, the second number of channels can be adjusted to the first number of channels, and if the initial number of channels is the first number of channels, there is no need to adjust the number of channels.

可选的,在编码参数包括编码码率、采样率、通道数中的多种的情况下,在一个实施例中,在获取到噪声估计结果之后,还可以获取多种映射关系,以利用噪声估计结果以及多种映射关系确定对应的编码参数。例如,如果编码参数包括编码码率、采样率、通道数,则可以获取参考噪声估计结果与参考编码码率之间的映射关系、参考噪声估计结果与参考采样率之间的映射关系、参考噪声估计结果与参考通道数之间的映射关系,在获取到这些映射关系之后,即可以基于这些映射关系以及噪声估计结果确定对应的编码参数。例如,可以基于噪声估计结果以及参考噪声估计结果与参考编码码率之间的映射关系确定编码码率,可以基于噪声估计结果以及参考噪声估计结果与参考采样率之间的映射关系确定采样率,可以基于噪声估计结果以及参考噪声估计结果与参考通道数之间的映射关系确定通道数。Optionally, in the case where the coding parameters include multiple types of coding bit rate, sampling rate, and number of channels, in one embodiment, after obtaining the noise estimation result, multiple mapping relationships can also be obtained to determine the corresponding coding parameters using the noise estimation result and the multiple mapping relationships. For example, if the coding parameters include coding bit rate, sampling rate, and number of channels, the mapping relationship between the reference noise estimation result and the reference coding bit rate, the mapping relationship between the reference noise estimation result and the reference sampling rate, and the mapping relationship between the reference noise estimation result and the reference number of channels can be obtained. After obtaining these mapping relationships, the corresponding coding parameters can be determined based on these mapping relationships and the noise estimation result. For example, the coding bit rate can be determined based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference coding bit rate, the sampling rate can be determined based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference sampling rate, and the number of channels can be determined based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference number of channels.

另一个实施例中,在获取到噪声估计结果之后,还可以获取参考映射关系,以利用噪声估计结果以及参考映射关系确定对应的编码参数。该参考映射关系中包含了参考噪声估计结果与参考编码参数之间的映射关系。例如,如果编码参数包括编码码率、采样率、通道数,则该参考映射关系中的参考编码参数可以包括参考编码码率、参考采样率、参考通道数。在这种情况下,仅需利用一个映射安关系即可确定对应的编码参数。In another embodiment, after obtaining the noise estimation result, a reference mapping relationship can also be obtained to determine the corresponding coding parameters using the noise estimation result and the reference mapping relationship. The reference mapping relationship includes a mapping relationship between the reference noise estimation result and the reference coding parameter. For example, if the coding parameters include coding bit rate, sampling rate, and number of channels, the reference coding parameters in the reference mapping relationship may include reference coding bit rate, reference sampling rate, and reference number of channels. In this case, only one mapping relationship is required to determine the corresponding coding parameters.

在一种实现方式中,对于不同的业务场景,可以配置不同的参考噪声估计结果与参考编码参数之间的映射关系,即一个业务场景对应一个参考噪声估计结果与参考编码参数之间的映射关系。其中业务场景的划分不作具体限制,如业务场景可以基于通信方式的不同进行划分,如可以将业务场景划分为电话通话场景、音视频通话场景、音视频会议场景、K歌场景、直播场景、游戏场景等。In one implementation, for different business scenarios, different mapping relationships between reference noise estimation results and reference coding parameters can be configured, that is, one business scenario corresponds to a mapping relationship between a reference noise estimation result and a reference coding parameter. The division of business scenarios is not specifically limited, such as business scenarios can be divided based on different communication methods, such as business scenarios can be divided into telephone call scenarios, audio and video call scenarios, audio and video conference scenarios, karaoke scenarios, live broadcast scenarios, game scenarios, etc.

基于此可知,在具体实现中,可以先获取建立的通信连接所属的业务场景。在确定业务场景之后,可以获取适配于该业务场景的参考噪声估计结果与参考编码参数之间的映射关系;并可以基于噪声估计结果以及参考噪声估计结果与参考编码参数之间的映射关系,确定用于进行音频编码的编码参数。Based on this, it can be known that in a specific implementation, the business scenario to which the established communication connection belongs can be first obtained. After determining the business scenario, a mapping relationship between a reference noise estimation result and a reference coding parameter adapted to the business scenario can be obtained; and based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference coding parameter, the coding parameters for audio coding can be determined.

其中,此处的编码参数可以包括编码码率、采样率、通道数中的一种或多种,对于不同的编码参数,仅需获取对应的映射关系即可。The encoding parameters here may include one or more of encoding bit rate, sampling rate, and number of channels. For different encoding parameters, it is only necessary to obtain the corresponding mapping relationship.

在一种实现方式中,对于不同的音频播放类型,可以配置不同的参考噪声估计结果与参考编码参数之间的映射关系,即一个音频播放类型对应一个参考噪声估计结果与参考编码参数之间的映射关系。其中,音频播放类型可以划分为发声方对应的音频信号为语音的音频播放类型(可简称为语音播放类型)、以及发声方对应的音频信号为音乐的音频播放类型(可简称为音乐播放类型)。可以理解的是,对于不同的音频播放类型,进行音频编码所设置的默认的编码参数存在区别,通常是音乐播放类型所需的编码参数对应的编码效果要比语音播放类型所需的编码参数对应的编码效果好,则对于不同的音频播放类型,在设置映射关系时,映射关系中参考噪声估计结果与参考编码参数可能存在差异,则为达到播放音质与收听环境效果的最佳匹配,可以为不同的音频播放类型设置不同的映射关系。In one implementation, for different audio playback types, different mapping relationships between reference noise estimation results and reference coding parameters can be configured, that is, one audio playback type corresponds to a mapping relationship between a reference noise estimation result and a reference coding parameter. Among them, the audio playback type can be divided into an audio playback type in which the audio signal corresponding to the sounding party is a voice (which can be simply referred to as a voice playback type), and an audio playback type in which the audio signal corresponding to the sounding party is music (which can be simply referred to as a music playback type). It can be understood that for different audio playback types, there are differences in the default encoding parameters set for audio encoding. Usually, the encoding effect corresponding to the encoding parameters required for the music playback type is better than the encoding effect corresponding to the encoding parameters required for the voice playback type. For different audio playback types, when setting the mapping relationship, there may be differences between the reference noise estimation result and the reference coding parameter in the mapping relationship. In order to achieve the best match between the playback sound quality and the listening environment effect, different mapping relationships can be set for different audio playback types.

基于此可知,在具体实现中,在获取到发声方的第二音频信号之后,还可以确定该第二音频信号的音频播放类型,以获取适配于该音频播放类型的参考噪声估计结果与参考编码参数之间的映射关系;并基于噪声估计结果以及参考噪声估计结果与参考编码参数之间的映射关系,确定用于进行音频编码的编码参数。Based on this, it can be seen that in a specific implementation, after obtaining the second audio signal of the speaker, the audio playback type of the second audio signal can also be determined to obtain a mapping relationship between a reference noise estimation result and a reference coding parameter suitable for the audio playback type; and based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference coding parameters, the coding parameters for audio encoding are determined.

在一种实现方式中,对于不同的音频音色类型,可以配置不同的参考噪声估计结果与参考编码参数之间的映射关系,即一个音频音色类型对应一个参考噪声估计结果与参考编码参数之间的映射关系。其中,音频音色类型可以划分为发声方对应的音频信号为男生的音频音色类型(可简称为男生音色类型)、以及发声方对应的音频信号为女生的音频音色类型(可简称为女生音色类型)。基于此可知,在具体实现中,在获取到发声方的第二音频信号之后,还可以确定该第二音频信号的音频音色类型,以获取适配于该音频音色类型的参考噪声估计结果与参考编码参数之间的映射关系;并基于噪声估计结果以及参考噪声估计结果与参考编码参数之间的映射关系,确定用于进行音频编码的编码参数。In one implementation, for different audio timbre types, different mapping relationships between reference noise estimation results and reference coding parameters can be configured, that is, one audio timbre type corresponds to a mapping relationship between a reference noise estimation result and a reference coding parameter. Among them, the audio timbre type can be divided into an audio timbre type (which can be referred to as a male timbre type) in which the audio signal corresponding to the speaker is a male, and an audio timbre type (which can be referred to as a female timbre type) in which the audio signal corresponding to the speaker is a female. Based on this, it can be seen that in a specific implementation, after obtaining the second audio signal of the speaker, the audio timbre type of the second audio signal can also be determined to obtain the mapping relationship between the reference noise estimation result and the reference coding parameter suitable for the audio timbre type; and based on the noise estimation result and the mapping relationship between the reference noise estimation result and the reference coding parameter, the coding parameters used for audio coding are determined.

在一种实现方式中,如前所述,在利用噪声估计结果进行调整的编码参数可以包括编码码率、采样率以及通道数等中的一种或多种,在实际场景中,还可以基于实际的业务需求确定最终所需要调整的编码参数,如仅对编码码率进行调整,或仅对采样率进行调整等等。In one implementation, as described above, the coding parameters adjusted using the noise estimation results may include one or more of the coding bit rate, sampling rate, and number of channels. In actual scenarios, the coding parameters that ultimately need to be adjusted may also be determined based on actual business needs, such as adjusting only the coding bit rate, or only the sampling rate, and so on.

可选的,可以获取业务需求,并基于该业务需求确定需调整的编码参数。例如,如果业务需求为第一业务需求,该第一业务需求为保证传输带宽与播放音质的平衡,则可以确定编码参数为编码码率维度下的编码参数,即编码参数为编码码率。又如,如果业务需求为第二业务需求,该第二业务需求为保证信号采样数据量与与播放音质的平衡,则可以确定编码参数为采样率维度下的编码参数,即编码参数为采样率。又如,如果业务需求为第三业务需求,该第三业务需求为保证信号获取复杂度与播放音质的平衡,则可以确定编码参数为通道数维度下的编码参数,即编码参数为通道数。其中,该业务需求可以是相关研发人员在默认的编码参数设置时指定的,或者也可以以其他方式设定,具体不作限定。Optionally, the business demand can be obtained, and the coding parameters to be adjusted can be determined based on the business demand. For example, if the business demand is a first business demand, and the first business demand is to ensure the balance between the transmission bandwidth and the playback sound quality, then the coding parameter can be determined to be a coding parameter under the coding bit rate dimension, that is, the coding parameter is the coding bit rate. For another example, if the business demand is a second business demand, and the second business demand is to ensure the balance between the signal sampling data volume and the playback sound quality, then the coding parameter can be determined to be a coding parameter under the sampling rate dimension, that is, the coding parameter is the sampling rate. For another example, if the business demand is a third business demand, and the third business demand is to ensure the balance between the signal acquisition complexity and the playback sound quality, then the coding parameter can be determined to be a coding parameter under the channel number dimension, that is, the coding parameter is the number of channels. Among them, the business demand can be specified by the relevant R&D personnel when the default coding parameter setting is set, or it can be set in other ways, without specific limitation.

需要说明的是,上述提及的各种参考噪声估计结果与参考编码参数之间的映射关系可以是基于预先的实测结果来获取到的。例如,下述以一个参考噪声估计结果与一个参考编码码率之间的映射关系的确定进行具体阐述。首先,对于一个参考噪声估计结果(即一个参考噪声估计值),可以为其配置L个候选编码码率。在一个实施例中,在配置L个候选编码码率时,可以结合参考噪声估计值的大小来设置。例如,参考噪声估计值越大,所设置的候选编码码率可以越小,而参考噪声估计值越小,所设置的候选编码码率可以越大。It should be noted that the mapping relationship between the various reference noise estimation results and the reference coding parameters mentioned above can be obtained based on the pre-measured results. For example, the following specifically describes the determination of the mapping relationship between a reference noise estimation result and a reference coding rate. First, for a reference noise estimation result (i.e., a reference noise estimation value), L candidate coding rates can be configured for it. In one embodiment, when configuring the L candidate coding rates, they can be set in combination with the size of the reference noise estimation value. For example, the larger the reference noise estimation value, the smaller the set candidate coding rate can be, and the smaller the reference noise estimation value, the larger the set candidate coding rate can be.

然后,可以利用L个候选编码码率中的每个候选编码码率对同一个音频信号进行音频编码以及音频解码,得到该音频信号对应的解码信号,基于此可知,可以得到L个解码信号,一个候选编码码率对应一个解码信号。最后,可以对这L个解码信号进行音频播放,并对收听效果进行主观评测,并基于收听效果从L个候选编码码率中确定参考噪声估计结果对应的参考编码码率。例如,可以将收听效果最好的解码信号所利用到的候选编码码率确定为参考编码码率;或者,如果存在多个解码信号的收听效果较好,且效果差别较小时,可以将这多个解码下信号所利用到的最小的候选编码码率确定为参考编码码率,以在保证收听效果的同时,尽可能利用较小的编码码率来进行后续的音频编码,从而降低带宽成本以及存储成本。Then, each of the L candidate encoding rates can be used to perform audio encoding and audio decoding on the same audio signal to obtain a decoded signal corresponding to the audio signal. Based on this, it can be known that L decoded signals can be obtained, and one candidate encoding rate corresponds to one decoded signal. Finally, the L decoded signals can be played audio, and the listening effect can be subjectively evaluated, and the reference encoding rate corresponding to the reference noise estimation result can be determined from the L candidate encoding rates based on the listening effect. For example, the candidate encoding rate used by the decoded signal with the best listening effect can be determined as the reference encoding rate; or, if there are multiple decoded signals with good listening effects and the effect difference is small, the minimum candidate encoding rate used by the multiple decoded signals can be determined as the reference encoding rate, so as to ensure the listening effect while using a smaller encoding rate as much as possible for subsequent audio encoding, thereby reducing bandwidth costs and storage costs.

S503,若获取到发声方的第二音频信号,则利用编码参数对第二音频信号进行音频编码,得到第二音频信号对应的编码信号,并将编码信号发送给收听方。S503: If the second audio signal of the speaker is obtained, the second audio signal is audio-encoded using the encoding parameter to obtain an encoded signal corresponding to the second audio signal, and the encoded signal is sent to the listener.

其中,该步骤的具体实施方式可参考上述步骤S206中的描述,此处不再赘述。The specific implementation of this step may refer to the description in the above step S206, which will not be repeated here.

如前所述,在收听方与发声方建立通信连接的过程中,收听方可以周期性的进行噪声估计,并将对应的噪声估计结果发送至发声方,则发声方可以基于周期性的噪声估计结果来动态调整发声方在进行音频编码所需的编码参数,以适应收听方的背景噪声环境的变化,达到播放音质与收听环境效果的最佳匹配。As mentioned above, in the process of establishing a communication connection between the listener and the speaker, the listener can periodically perform noise estimation and send the corresponding noise estimation results to the speaker. The speaker can then dynamically adjust the encoding parameters required for audio encoding based on the periodic noise estimation results to adapt to changes in the background noise environment of the listener, thereby achieving the best match between the playback sound quality and the listening environment effect.

需要说明的是,通信起始阶段先配置默认的音频编码参数(如编码码率),即用户在通过发声方进行语音时,会先基于默认的编码参数对获取到音频信号进行音频编码,而再次获取到音频信号时,可以先确定当前的编码参数是否存在针对默认的编码参数的更新值,若存在更新值,则可以利用该更新值对当前接收到的音频信号进行音频编码。同理,针对后续所获取到的音频信号,可以不断利用最新的编码参数进行音频编码。It should be noted that the default audio encoding parameters (such as encoding bit rate) are configured at the beginning of the communication, that is, when the user speaks through the speaker, the acquired audio signal will be encoded based on the default encoding parameters. When the audio signal is acquired again, it can be determined whether there is an updated value for the current encoding parameter. If there is an updated value, the updated value can be used to encode the currently received audio signal. Similarly, for the subsequently acquired audio signals, the latest encoding parameters can be continuously used for audio encoding.

在本申请实施例中,发声方可以根据收听方反馈的噪声估计结果调节默认的编码参数,以适应收听方的背景噪声环境的变化,以达到播放音质与收听环境效果的最佳匹配。In an embodiment of the present application, the speaker can adjust the default encoding parameters according to the noise estimation result fed back by the listener to adapt to the changes in the background noise environment of the listener, so as to achieve the best match between the playback sound quality and the listening environment effect.

请参阅图7,是本申请实施例提供的一种音频处理装置的结构示意图。本实施例中所描述的音频处理装置,包括:Please refer to FIG. 7 , which is a schematic diagram of the structure of an audio processing device provided in an embodiment of the present application. The audio processing device described in this embodiment includes:

获取单元701,用于获取收听方的第一音频信号;所述收听方与发声方已建立通信连接;The acquisition unit 701 is used to acquire a first audio signal of a listener; the listener has established a communication connection with a speaker;

估计单元702,用于对所述第一音频信号进行噪声估计,确定所述第一音频信号的噪声估计结果,并将所述第一音频信号的噪声估计结果发送给所述发声方;所述噪声估计结果用于指示所述第一音频信号中噪声的强度;an estimating unit 702, configured to perform noise estimation on the first audio signal, determine a noise estimation result of the first audio signal, and send the noise estimation result of the first audio signal to the sounding party; the noise estimation result is used to indicate the intensity of noise in the first audio signal;

解码单元703,用于接收所述发声方发送的编码信号,对所述编码信号进行音频解码,得到解码信号,并对所述解码信号进行音频播放;所述编码信号为所述发声方获取到所述发声方的第二音频信号后,利用所述噪声估计结果所确定的编码参数对所述第二音频信号进行音频编码得到的。The decoding unit 703 is used to receive the encoded signal sent by the speaker, perform audio decoding on the encoded signal to obtain a decoded signal, and perform audio playback on the decoded signal; the encoded signal is obtained by the speaker obtaining the second audio signal of the speaker and then performing audio encoding on the second audio signal using the encoding parameters determined by the noise estimation result.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

对所述第一音频信号进行分帧处理,得到M个音频帧;Performing frame processing on the first audio signal to obtain M audio frames;

遍历所述M个音频帧中的各个音频帧,若遍历到所述M个音频帧中的第m个音频帧,则对所述第m个音频帧进行傅里叶变换处理,得到所述第m个音频帧的功率谱;Traversing each audio frame in the M audio frames, if traversing to the m-th audio frame in the M audio frames, performing Fourier transform processing on the m-th audio frame to obtain a power spectrum of the m-th audio frame;

基于所述第m个音频帧的功率谱对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果,并将所述第m个音频帧的噪声估计结果作为第一音频信号的噪声估计结果。Noise estimation is performed on the mth audio frame based on the power spectrum of the mth audio frame, a noise estimation result of the mth audio frame is determined, and the noise estimation result of the mth audio frame is used as the noise estimation result of the first audio signal.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

将所述第m个音频帧的功率谱划分为K个子带,并确定所述K个子带中各个子带的子带功率谱;Dividing the power spectrum of the m-th audio frame into K sub-bands, and determining a sub-band power spectrum of each sub-band in the K sub-bands;

基于所述各个子带的子带功率谱对相应子带进行噪声估计,确定相应子带的噪声估计结果;Performing noise estimation on the corresponding subband based on the subband power spectrum of each subband, and determining a noise estimation result of the corresponding subband;

基于所述第m个音频帧所包含的各个子带对应的噪声估计结果,对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果。Based on the noise estimation results corresponding to the sub-bands included in the m-th audio frame, noise estimation is performed on the m-th audio frame to determine the noise estimation result of the m-th audio frame.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

针对所述K个子带中的第k个子带,对所述第k个子带的子带功率谱进行时频域平滑处理,得到所述第k个子带的平滑子带功率谱;For a k-th subband among the K subbands, performing time-frequency domain smoothing processing on a subband power spectrum of the k-th subband to obtain a smoothed subband power spectrum of the k-th subband;

获取预设周期,并基于所述第k个子带的平滑子带功率谱以及所述预设周期,确定针对第m个音频帧的第k个子带的最小子带功率谱;Obtaining a preset period, and determining a minimum subband power spectrum of the kth subband for the mth audio frame based on the smoothed subband power spectrum of the kth subband and the preset period;

基于所述平滑子带功率谱和所述最小子带功率谱,确定所述第m个音频帧的第k个子带中存在语音的语音概率;Determining, based on the smoothed subband power spectrum and the minimum subband power spectrum, a speech probability that speech exists in the kth subband of the mth audio frame;

基于所述语音概率、所述第m个音频帧的第k个子带的子带功率谱以及第m-1个音频帧的第k个子带的噪声估计结果,确定所述第m个音频帧的第k个子带的噪声估计结果。A noise estimation result of the kth subband of the mth audio frame is determined based on the speech probability, the subband power spectrum of the kth subband of the mth audio frame, and the noise estimation result of the kth subband of the (m-1)th audio frame.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

获取与所述第k个子带相邻的r个相邻子带中各个相邻子带分别对应的子带功率谱;Obtaining a subband power spectrum corresponding to each of r adjacent subbands adjacent to the kth subband;

基于所述各个相邻子带分别对应的子带功率谱,对所述第k个子带的子带功率谱进行频域平滑处理,得到所述第k个子带在频域上的平滑子带功率谱;Based on the subband power spectra corresponding to the adjacent subbands, the subband power spectrum of the k-th subband is smoothed in the frequency domain to obtain a smoothed subband power spectrum of the k-th subband in the frequency domain;

获取所述第m-1个音频帧的第k个子带的平滑子带功率谱;Obtaining a smoothed subband power spectrum of the kth subband of the m-1th audio frame;

基于所述第m-1个音频帧的第k个子带的平滑子带功率谱以及所述第k个子带在频域上的平滑子带功率谱进行时域平滑处理,得到所述第k个子带在时域上的平滑子带功率谱,并将所述第k个子带在时域上的平滑子带功率谱作为所述第k个子带的平滑子带功率谱。Based on the smoothed subband power spectrum of the kth subband of the m-1th audio frame and the smoothed subband power spectrum of the kth subband in the frequency domain, time domain smoothing is performed to obtain the smoothed subband power spectrum of the kth subband in the time domain, and the smoothed subband power spectrum of the kth subband in the time domain is used as the smoothed subband power spectrum of the kth subband.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

对所述k与所述T进行取余处理,并确定处理结果;Performing modulo processing on the k and the T, and determining a processing result;

若所述处理结果为指定数值,则获取上一周期的最小子带功率谱,并将所述上一周期的最小子带功率谱以及所述第m个音频帧的第k个子带的平滑子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱;If the processing result is a specified value, obtaining a minimum subband power spectrum of a previous period, and taking a minimum value of the minimum subband power spectrum of the previous period and a smoothed subband power spectrum of the kth subband of the mth audio frame as the minimum subband power spectrum of the kth subband of the mth audio frame;

若所述处理结果不为指定数值,则获取针对第m-1个音频帧的第k个子带的最小子带功率谱,并将所述第k个子带的平滑子带功率谱以及所述针对第m-1个音频帧的第k个子带的最小子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱。If the processing result is not a specified value, then the minimum subband power spectrum of the kth subband for the m-1th audio frame is obtained, and the minimum value of the smoothed subband power spectrum of the kth subband and the minimum subband power spectrum of the kth subband for the m-1th audio frame is used as the minimum subband power spectrum of the kth subband for the m-1th audio frame.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

基于所述平滑子带功率谱和所述最小子带功率谱确定信噪比;determining a signal-to-noise ratio based on the smoothed sub-band power spectrum and the minimum sub-band power spectrum;

将所述信噪比与阈值进行比较,并基于比较结果确定所述第m个音频帧的第k个子带中存在语音的初始语音概率;Comparing the signal-to-noise ratio with a threshold, and determining an initial speech probability that speech exists in the kth subband of the mth audio frame based on the comparison result;

获取所述第m-1个音频帧的第k个子带中存在语音的语音概率,并基于所述第m-1个音频帧的第k个子带对应的语音概率以及所述初始语音概率,确定所述第m个音频帧的第k个子带中存在语音的语音概率。Obtain a speech probability that speech exists in the kth subband of the m-1th audio frame, and determine a speech probability that speech exists in the kth subband of the m-1th audio frame based on the speech probability corresponding to the kth subband of the m-1th audio frame and the initial speech probability.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

若所述比较结果为所述信噪比大于所述阈值,则将所述第m个音频帧的第k个子带中存在语音的初始语音概率确定为第一概率;If the comparison result is that the signal-to-noise ratio is greater than the threshold, determining the initial speech probability that speech exists in the kth subband of the mth audio frame as a first probability;

若所述比较结果为所述信噪比小于或等于所述阈值,则将所述第m个音频帧的第k个子带中存在语音的初始语音概率确定为第二概率。If the comparison result is that the signal-to-noise ratio is less than or equal to the threshold, the initial speech probability that speech exists in the kth subband of the mth audio frame is determined as the second probability.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

获取预设子带范围,从所述第m个音频帧所包含的K个子带中确定频率范围在所述预设子带范围的N个参考子带;Obtaining a preset subband range, and determining N reference subbands whose frequency range is within the preset subband range from the K subbands included in the m-th audio frame;

获取第m-1个音频帧的噪声估计结果,基于所述第m-1个音频帧的噪声估计结果以及所述N个参考子带中各个参考子带对应的噪声估计结果,对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果。Obtain a noise estimation result of an m-1th audio frame, perform noise estimation on the m-1th audio frame based on the noise estimation result of the m-1th audio frame and the noise estimation results corresponding to each reference subband in the N reference subbands, and determine a noise estimation result of the m-1th audio frame.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

针对所述N个参考子带中的第n个参考子带,获取所述第n个参考子带中频点的数量,并基于所述第n个参考子带中频点的数量以及所述第n个参考子带对应的噪声估计结果,确定所述第n个参考子带的噪声估计结果;For an nth reference subband among the N reference subbands, obtaining the number of frequency points in the nth reference subband, and determining a noise estimation result of the nth reference subband based on the number of frequency points in the nth reference subband and a noise estimation result corresponding to the nth reference subband;

基于所述N个参考子带中各个参考子带的噪声估计结果,确定所述第m个音频帧的初始噪声估计结果;Determining an initial noise estimation result of the m-th audio frame based on the noise estimation result of each reference subband in the N reference subbands;

基于所述第m-1个音频帧的噪声估计结果以及所述第m个音频帧的初始噪声估计结果,对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果。Based on the noise estimation result of the m-1th audio frame and the initial noise estimation result of the mth audio frame, noise estimation is performed on the mth audio frame to determine the noise estimation result of the mth audio frame.

在一种实现方式中,所述估计单元702,具体用于:In one implementation, the estimating unit 702 is specifically configured to:

针对所述K个子带中的第k个子带,获取所述第k个子带中频点的数量;For a k-th subband among the K subbands, obtaining the number of frequency points in the k-th subband;

基于所述第k个子带中频点的数量以及所述第k个子带中各个频点对应的频域值,确定所述K个子带中各个子带的子带功率谱。Based on the number of frequency points in the kth sub-band and the frequency domain value corresponding to each frequency point in the kth sub-band, a sub-band power spectrum of each sub-band in the K sub-bands is determined.

请参阅图8,是本申请实施例提供的另一种数音频处理装置的结构示意图。本实施例中所描述的音频处理装置,包括:Please refer to FIG8, which is a schematic diagram of the structure of another audio processing device provided in an embodiment of the present application. The audio processing device described in this embodiment includes:

接收单元801,用于接收收听方发送的噪声估计结果;所述噪声估计结果为所述收听方获取到所述收听方的第一音频信号后,对所述第一音频信号进行噪声估计得到的,所述收听方与发声方已建立通信连接;The receiving unit 801 is configured to receive a noise estimation result sent by a listener; the noise estimation result is obtained by estimating the noise of the first audio signal after the listener obtains the first audio signal of the listener, and the listener has established a communication connection with the sounding party;

确定单元802,用于基于所述噪声估计结果确定用于进行音频编码的编码参数;A determining unit 802, configured to determine encoding parameters for audio encoding based on the noise estimation result;

编码单元803,用于若获取到所述发声方的第二音频信号,则利用所述编码参数对所述第二音频信号进行音频编码,得到所述第二音频信号对应的编码信号,并将所述编码信号发送给所述收听方。The encoding unit 803 is configured to, if a second audio signal of the speaker is acquired, perform audio encoding on the second audio signal using the encoding parameters to obtain an encoded signal corresponding to the second audio signal, and send the encoded signal to the listener.

在一种实现方式中,所述编码参数包括编码码率和采样率中的一种或多种;所述确定单元802,具体用于:In one implementation, the encoding parameter includes one or more of an encoding bit rate and a sampling rate; the determining unit 802 is specifically configured to:

获取参考噪声估计结果与参考编码参数之间的映射关系;Obtaining a mapping relationship between a reference noise estimation result and a reference coding parameter;

基于所述噪声估计结果以及所述参考噪声估计结果与参考编码参数之间的映射关系,确定用于进行音频编码的编码参数。Based on the noise estimation result and a mapping relationship between the reference noise estimation result and a reference encoding parameter, encoding parameters for audio encoding are determined.

在一种实现方式中,所述编码参数包括通道数;所述确定单元802,具体用于:In one implementation, the encoding parameter includes the number of channels; the determining unit 802 is specifically configured to:

将所述噪声估计结果与预设噪声估计结果进行比较;Comparing the noise estimation result with a preset noise estimation result;

若比较结果为所述噪声估计结果大于所述预设噪声估计结果,则将所述编码参数确定为第一通道数;If the comparison result is that the noise estimation result is greater than the preset noise estimation result, determining the encoding parameter as the first channel number;

若比较结果为所述噪声估计结果小于或等于所述预设噪声估计结果,则将所述编码参数确定为第二通道数。If the comparison result is that the noise estimation result is less than or equal to the preset noise estimation result, the encoding parameter is determined as the second channel number.

请参见图9,为本申请实施例提供的一种计算机设备的结构示意图。该计算机设备可以为上述的发声方和/或收听方,或者可以执行上述发声方和/或收听方执行的部分或全部步骤。本实施例中所描述的计算机设备,包括:处理器901、存储器902以及网络接口903。上述处理器901、存储器902以及网络接口903之间可以交互数据。Please refer to Figure 9, which is a schematic diagram of the structure of a computer device provided in an embodiment of the present application. The computer device can be the above-mentioned speaker and/or listener, or can execute some or all of the steps executed by the above-mentioned speaker and/or listener. The computer device described in this embodiment includes: a processor 901, a memory 902, and a network interface 903. The above-mentioned processor 901, memory 902, and network interface 903 can exchange data.

上述处理器901可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 901 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor, etc.

上述存储器902可以包括只读存储器和随机存取存储器,并向处理器901提供程序指令和数据。存储器902的一部分还可以包括非易失性随机存取存储器。The memory 902 may include a read-only memory and a random access memory, and provide program instructions and data to the processor 901. A portion of the memory 902 may also include a non-volatile random access memory.

可选的,在一些实施例中,该计算机设备可以为收听方,或者可执行该收听方执行的部分或全部步骤。例如,所述处理器901调用所述程序指令时用于执行:Optionally, in some embodiments, the computer device may be a listener, or may execute some or all of the steps executed by the listener. For example, when the processor 901 calls the program instruction, it is used to execute:

获取收听方的第一音频信号;所述收听方与发声方已建立通信连接;Acquiring a first audio signal from a listening party; the listening party has established a communication connection with a sounding party;

对所述第一音频信号进行噪声估计,确定所述第一音频信号的噪声估计结果,并将所述第一音频信号的噪声估计结果发送给所述发声方;所述噪声估计结果用于指示所述第一音频信号中噪声的强度;Performing noise estimation on the first audio signal, determining a noise estimation result of the first audio signal, and sending the noise estimation result of the first audio signal to the speaker; the noise estimation result is used to indicate the intensity of noise in the first audio signal;

接收所述发声方发送的编码信号,对所述编码信号进行音频解码,得到解码信号,并对所述解码信号进行音频播放;所述编码信号为所述发声方获取到所述发声方的第二音频信号后,利用所述噪声估计结果所确定的编码参数对所述第二音频信号进行音频编码得到的。Receive the encoded signal sent by the speaker, perform audio decoding on the encoded signal to obtain a decoded signal, and perform audio playback on the decoded signal; the encoded signal is obtained by audio encoding the second audio signal after the speaker obtains the second audio signal of the speaker using the encoding parameters determined by the noise estimation result.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

对所述第一音频信号进行分帧处理,得到M个音频帧;Performing frame processing on the first audio signal to obtain M audio frames;

遍历所述M个音频帧中的各个音频帧,若遍历到所述M个音频帧中的第m个音频帧,则对所述第m个音频帧进行傅里叶变换处理,得到所述第m个音频帧的功率谱;Traversing each audio frame in the M audio frames, if traversing to the m-th audio frame in the M audio frames, performing Fourier transform processing on the m-th audio frame to obtain a power spectrum of the m-th audio frame;

基于所述第m个音频帧的功率谱对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果,并将所述第m个音频帧的噪声估计结果作为第一音频信号的噪声估计结果。Noise estimation is performed on the mth audio frame based on the power spectrum of the mth audio frame, a noise estimation result of the mth audio frame is determined, and the noise estimation result of the mth audio frame is used as the noise estimation result of the first audio signal.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

将所述第m个音频帧的功率谱划分为K个子带,并确定所述K个子带中各个子带的子带功率谱;Dividing the power spectrum of the m-th audio frame into K sub-bands, and determining a sub-band power spectrum of each sub-band in the K sub-bands;

基于所述各个子带的子带功率谱对相应子带进行噪声估计,确定相应子带的噪声估计结果;Performing noise estimation on the corresponding subband based on the subband power spectrum of each subband, and determining a noise estimation result of the corresponding subband;

基于所述第m个音频帧所包含的各个子带对应的噪声估计结果,对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果。Based on the noise estimation results corresponding to the sub-bands included in the m-th audio frame, noise estimation is performed on the m-th audio frame to determine the noise estimation result of the m-th audio frame.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

针对所述K个子带中的第k个子带,对所述第k个子带的子带功率谱进行时频域平滑处理,得到所述第k个子带的平滑子带功率谱;For a k-th subband among the K subbands, performing time-frequency domain smoothing processing on a subband power spectrum of the k-th subband to obtain a smoothed subband power spectrum of the k-th subband;

获取预设周期,并基于所述第k个子带的平滑子带功率谱以及所述预设周期,确定针对第m个音频帧的第k个子带的最小子带功率谱;Obtaining a preset period, and determining a minimum subband power spectrum of the kth subband for the mth audio frame based on the smoothed subband power spectrum of the kth subband and the preset period;

基于所述平滑子带功率谱和所述最小子带功率谱,确定所述第m个音频帧的第k个子带中存在语音的语音概率;Determining, based on the smoothed subband power spectrum and the minimum subband power spectrum, a speech probability that speech exists in the kth subband of the mth audio frame;

基于所述语音概率、所述第m个音频帧的第k个子带的子带功率谱以及第m-1个音频帧的第k个子带的噪声估计结果,确定所述第m个音频帧的第k个子带的噪声估计结果。A noise estimation result of the kth subband of the mth audio frame is determined based on the speech probability, the subband power spectrum of the kth subband of the mth audio frame, and the noise estimation result of the kth subband of the (m-1)th audio frame.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

获取与所述第k个子带相邻的r个相邻子带中各个相邻子带分别对应的子带功率谱;Obtaining a subband power spectrum corresponding to each of r adjacent subbands adjacent to the kth subband;

基于所述各个相邻子带分别对应的子带功率谱,对所述第k个子带的子带功率谱进行频域平滑处理,得到所述第k个子带在频域上的平滑子带功率谱;Based on the subband power spectra corresponding to the adjacent subbands, the subband power spectrum of the k-th subband is smoothed in the frequency domain to obtain a smoothed subband power spectrum of the k-th subband in the frequency domain;

获取所述第m-1个音频帧的第k个子带的平滑子带功率谱;Obtaining a smoothed subband power spectrum of the kth subband of the m-1th audio frame;

基于所述第m-1个音频帧的第k个子带的平滑子带功率谱以及所述第k个子带在频域上的平滑子带功率谱进行时域平滑处理,得到所述第k个子带在时域上的平滑子带功率谱,并将所述第k个子带在时域上的平滑子带功率谱作为所述第k个子带的平滑子带功率谱。Based on the smoothed subband power spectrum of the kth subband of the m-1th audio frame and the smoothed subband power spectrum of the kth subband in the frequency domain, time domain smoothing is performed to obtain the smoothed subband power spectrum of the kth subband in the time domain, and the smoothed subband power spectrum of the kth subband in the time domain is used as the smoothed subband power spectrum of the kth subband.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

对所述k与所述T进行取余处理,并确定处理结果;Performing modulo processing on the k and the T, and determining a processing result;

若所述处理结果为指定数值,则获取上一周期的最小子带功率谱,并将所述上一周期的最小子带功率谱以及所述第m个音频帧的第k个子带的平滑子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱;If the processing result is a specified value, obtaining a minimum subband power spectrum of a previous period, and taking a minimum value of the minimum subband power spectrum of the previous period and a smoothed subband power spectrum of the kth subband of the mth audio frame as the minimum subband power spectrum of the kth subband of the mth audio frame;

若所述处理结果不为指定数值,则获取针对第m-1个音频帧的第k个子带的最小子带功率谱,并将所述第k个子带的平滑子带功率谱以及所述针对第m-1个音频帧的第k个子带的最小子带功率谱中的最小值,作为针对第m个音频帧的第k个子带的最小子带功率谱。If the processing result is not a specified value, then the minimum subband power spectrum of the kth subband for the m-1th audio frame is obtained, and the minimum value of the smoothed subband power spectrum of the kth subband and the minimum subband power spectrum of the kth subband for the m-1th audio frame is used as the minimum subband power spectrum of the kth subband for the m-1th audio frame.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

基于所述平滑子带功率谱和所述最小子带功率谱确定信噪比;determining a signal-to-noise ratio based on the smoothed sub-band power spectrum and the minimum sub-band power spectrum;

将所述信噪比与阈值进行比较,并基于比较结果确定所述第m个音频帧的第k个子带中存在语音的初始语音概率;Comparing the signal-to-noise ratio with a threshold, and determining an initial speech probability that speech exists in the kth subband of the mth audio frame based on the comparison result;

获取所述第m-1个音频帧的第k个子带中存在语音的语音概率,并基于所述第m-1个音频帧的第k个子带对应的语音概率以及所述初始语音概率,确定所述第m个音频帧的第k个子带中存在语音的语音概率。Obtain a speech probability that speech exists in the kth subband of the m-1th audio frame, and determine a speech probability that speech exists in the kth subband of the m-1th audio frame based on the speech probability corresponding to the kth subband of the m-1th audio frame and the initial speech probability.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

若所述比较结果为所述信噪比大于所述阈值,则将所述第m个音频帧的第k个子带中存在语音的初始语音概率确定为第一概率;If the comparison result is that the signal-to-noise ratio is greater than the threshold, determining the initial speech probability that speech exists in the kth subband of the mth audio frame as a first probability;

若所述比较结果为所述信噪比小于或等于所述阈值,则将所述第m个音频帧的第k个子带中存在语音的初始语音概率确定为第二概率。If the comparison result is that the signal-to-noise ratio is less than or equal to the threshold, the initial speech probability that speech exists in the kth subband of the mth audio frame is determined as the second probability.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

获取预设子带范围,从所述第m个音频帧所包含的K个子带中确定频率范围在所述预设子带范围的N个参考子带;Obtaining a preset subband range, and determining N reference subbands whose frequency range is within the preset subband range from the K subbands included in the m-th audio frame;

获取第m-1个音频帧的噪声估计结果,基于所述第m-1个音频帧的噪声估计结果以及所述N个参考子带中各个参考子带对应的噪声估计结果,对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果。Obtain a noise estimation result of an m-1th audio frame, perform noise estimation on the m-1th audio frame based on the noise estimation result of the m-1th audio frame and the noise estimation results corresponding to each reference subband in the N reference subbands, and determine a noise estimation result of the m-1th audio frame.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

针对所述N个参考子带中的第n个参考子带,获取所述第n个参考子带中频点的数量,并基于所述第n个参考子带中频点的数量以及所述第n个参考子带对应的噪声估计结果,确定所述第n个参考子带的噪声估计结果;For an nth reference subband among the N reference subbands, obtaining the number of frequency points in the nth reference subband, and determining a noise estimation result of the nth reference subband based on the number of frequency points in the nth reference subband and a noise estimation result corresponding to the nth reference subband;

基于所述N个参考子带中各个参考子带的噪声估计结果,确定所述第m个音频帧的初始噪声估计结果;Determining an initial noise estimation result of the m-th audio frame based on the noise estimation result of each reference subband in the N reference subbands;

基于所述第m-1个音频帧的噪声估计结果以及所述第m个音频帧的初始噪声估计结果,对所述第m个音频帧进行噪声估计,确定所述第m个音频帧的噪声估计结果。Based on the noise estimation result of the m-1th audio frame and the initial noise estimation result of the mth audio frame, noise estimation is performed on the mth audio frame to determine the noise estimation result of the mth audio frame.

在一种实现方式中,所述处理器901,具体用于:In one implementation, the processor 901 is specifically configured to:

针对所述K个子带中的第k个子带,获取所述第k个子带中频点的数量;For a k-th subband among the K subbands, obtaining the number of frequency points in the k-th subband;

基于所述第k个子带中频点的数量以及所述第k个子带中各个频点对应的频域值,确定所述K个子带中各个子带的子带功率谱。Based on the number of frequency points in the kth sub-band and the frequency domain value corresponding to each frequency point in the kth sub-band, a sub-band power spectrum of each sub-band in the K sub-bands is determined.

可选的,在一些实施例中,该计算机设备可以为发声方,或者可执行该发声方执行的部分或全部步骤。例如,处理器901调用所述程序指令时用于执行:Optionally, in some embodiments, the computer device may be the speaker, or may execute some or all of the steps executed by the speaker. For example, when the processor 901 calls the program instruction, it is used to execute:

接收收听方发送的噪声估计结果;所述噪声估计结果为所述收听方获取到所述收听方的第一音频信号后,对所述第一音频信号进行噪声估计得到的,所述收听方与发声方已建立通信连接;receiving a noise estimation result sent by a listener; the noise estimation result is obtained by estimating the noise of the first audio signal after the listener acquires the first audio signal of the listener, and the listener has established a communication connection with the sounding party;

基于所述噪声估计结果确定用于进行音频编码的编码参数;Determining encoding parameters for audio encoding based on the noise estimation result;

若获取到所述发声方的第二音频信号,则利用所述编码参数对所述第二音频信号进行音频编码,得到所述第二音频信号对应的编码信号,并将所述编码信号发送给所述收听方。If the second audio signal of the speaker is obtained, the second audio signal is audio-encoded using the encoding parameters to obtain an encoded signal corresponding to the second audio signal, and the encoded signal is sent to the listener.

在一种实现方式中,所述编码参数包括编码码率和采样率中的一种或多种;所述处理器901,具体用于:In one implementation, the encoding parameter includes one or more of an encoding bit rate and a sampling rate; the processor 901 is specifically configured to:

获取参考噪声估计结果与参考编码参数之间的映射关系;Obtaining a mapping relationship between a reference noise estimation result and a reference coding parameter;

基于所述噪声估计结果以及所述参考噪声估计结果与参考编码参数之间的映射关系,确定用于进行音频编码的编码参数。Based on the noise estimation result and a mapping relationship between the reference noise estimation result and a reference encoding parameter, encoding parameters for audio encoding are determined.

在一种实现方式中,所述编码参数包括通道数;所述处理器901,具体用于:In one implementation, the encoding parameter includes the number of channels; the processor 901 is specifically configured to:

将所述噪声估计结果与预设噪声估计结果进行比较;Comparing the noise estimation result with a preset noise estimation result;

若比较结果为所述噪声估计结果大于所述预设噪声估计结果,则将所述编码参数确定为第一通道数;If the comparison result is that the noise estimation result is greater than the preset noise estimation result, determining the encoding parameter as the first channel number;

若比较结果为所述噪声估计结果小于或等于所述预设噪声估计结果,则将所述编码参数确定为第二通道数。If the comparison result is that the noise estimation result is less than or equal to the preset noise estimation result, the encoding parameter is determined as the second channel number.

本申请实施例还提供了一种计算机存储介质,该计算机存储介质中存储有程序指令,所述程序执行时可包括如图2或者图4或者图5对应实施例中的音频处理方法的部分或全部步骤。An embodiment of the present application further provides a computer storage medium in which program instructions are stored. When the program is executed, it may include some or all steps of the audio processing method in the corresponding embodiment of FIG. 2 or FIG. 4 or FIG. 5 .

需要说明的是,对于前述的各个方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the above-mentioned various method embodiments, for the sake of simplicity of description, they are all expressed as a series of action combinations, but those skilled in the art should be aware that this application is not limited by the order of the actions described, because according to this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also be aware that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by this application.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random AccessMemory,RAM)、磁盘或光盘等。A person skilled in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by instructing related hardware through a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: a flash drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, etc.

本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括程序指令,该程序指令被处理器执行时可实现上述方法中的部分或全部步骤。例如,该程序指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该程序指令,处理器执行该程序指令,使得该计算机设备执行上述各方法的实施例中所执行的步骤。The embodiment of the present application also provides a computer program product or a computer program, which includes program instructions, and when the program instructions are executed by a processor, some or all of the steps in the above method can be implemented. For example, the program instructions are stored in a computer-readable storage medium. The processor of the computer device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions, so that the computer device performs the steps performed in the embodiments of the above methods.

以上对本申请实施例所提供的一种音频处理方法及介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to an audio processing method and medium provided in an embodiment of the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea; at the same time, for general technical personnel in this field, according to the idea of the present application, there will be changes in the specific implementation method and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

Claims (15)

1. A method of audio processing, the method comprising:
acquiring a first audio signal of a listener; the listener and the sounding party have established communication connection;
Performing noise estimation on the first audio signal, determining a noise estimation result of the first audio signal, and sending the noise estimation result of the first audio signal to the sounding party; the noise estimation result is used for indicating the intensity of noise in the first audio signal;
Receiving the coded signal sent by the sounding party, performing audio decoding on the coded signal to obtain a decoded signal, and performing audio playing on the decoded signal; and the coding signal is obtained by performing audio coding on the second audio signal by using the coding parameters determined by the noise estimation result after the sounding party acquires the second audio signal of the sounding party.
2. The method of claim 1, wherein the performing noise estimation on the first audio signal, determining a noise estimation result of the first audio signal, comprises:
Carrying out framing treatment on the first audio signal to obtain M audio frames;
traversing each audio frame in the M audio frames, and if traversing to an mth audio frame in the M audio frames, performing Fourier transform processing on the mth audio frame to obtain a power spectrum of the mth audio frame;
And carrying out noise estimation on the mth audio frame based on the power spectrum of the mth audio frame, determining a noise estimation result of the mth audio frame, and taking the noise estimation result of the mth audio frame as a noise estimation result of a first audio signal.
3. The method of claim 2, wherein the noise estimating the mth audio frame based on the power spectrum of the mth audio frame, determining the noise estimation result of the mth audio frame, comprises:
dividing the power spectrum of the mth audio frame into K sub-bands, and determining the sub-band power spectrum of each sub-band in the K sub-bands;
carrying out noise estimation on the corresponding sub-bands based on the sub-band power spectrums of the sub-bands, and determining a noise estimation result of the corresponding sub-bands;
And carrying out noise estimation on the mth audio frame based on the noise estimation results corresponding to the sub-bands contained in the mth audio frame, and determining the noise estimation result of the mth audio frame.
4. A method according to claim 3, wherein said noise estimating the respective sub-band based on the sub-band power spectrum of the respective sub-band, determining the noise estimation result of the respective sub-band, comprises:
Performing time-frequency domain smoothing processing on the sub-band power spectrum of the kth sub-band aiming at the kth sub-band in the K sub-bands to obtain a smoothed sub-band power spectrum of the kth sub-band;
acquiring a preset period, and determining a minimum subband power spectrum of a kth subband for an mth audio frame based on the smooth subband power spectrum of the kth subband and the preset period;
determining a speech probability of speech being present in a kth sub-band of the mth audio frame based on the smoothed sub-band power spectrum and the minimum sub-band power spectrum;
And determining the noise estimation result of the kth sub-band of the m-th audio frame based on the voice probability, the sub-band power spectrum of the kth sub-band of the m-th audio frame and the noise estimation result of the kth sub-band of the m-1 th audio frame.
5. The method of claim 4, wherein performing time-frequency domain smoothing on the subband power spectrum of the kth subband to obtain a smoothed subband power spectrum of the kth subband, comprises:
acquiring sub-band power spectrums respectively corresponding to each adjacent sub-band in r adjacent sub-bands adjacent to the kth sub-band;
Carrying out frequency domain smoothing on the sub-band power spectrum of the kth sub-band based on the sub-band power spectrums respectively corresponding to the adjacent sub-bands to obtain a smoothed sub-band power spectrum of the kth sub-band on a frequency domain;
Acquiring a smooth sub-band power spectrum of a kth sub-band of the m-1 th audio frame;
and carrying out time domain smoothing processing based on the smooth sub-band power spectrum of the kth sub-band of the m-1 audio frame and the smooth sub-band power spectrum of the kth sub-band on the frequency domain to obtain the smooth sub-band power spectrum of the kth sub-band on the time domain, and taking the smooth sub-band power spectrum of the kth sub-band on the time domain as the smooth sub-band power spectrum of the kth sub-band.
6. The method of claim 4, wherein the predetermined period is T; the determining a minimum subband power spectrum for the kth subband of the mth audio frame based on the smooth subband power spectrum of the kth subband and the preset period comprises:
Performing residual taking processing on the k and the T, and determining a processing result;
if the processing result is a specified value, acquiring a minimum sub-band power spectrum of the previous period, and taking the minimum sub-band power spectrum of the previous period and a minimum value in a smooth sub-band power spectrum of a kth sub-band of the mth audio frame as a minimum sub-band power spectrum of the kth sub-band of the mth audio frame;
And if the processing result is not the appointed value, acquiring the minimum sub-band power spectrum of the kth sub-band of the m-1 th audio frame, and taking the minimum value in the smooth sub-band power spectrum of the kth sub-band and the minimum sub-band power spectrum of the kth sub-band of the m-1 th audio frame as the minimum sub-band power spectrum of the kth sub-band of the m-1 th audio frame.
7. The method of claim 4, wherein the determining the speech probability of speech being present in the kth sub-band of the mth audio frame based on the smoothed sub-band power spectrum and the minimum sub-band power spectrum comprises:
determining a signal-to-noise ratio based on the smoothed sub-band power spectrum and the minimum sub-band power spectrum;
Comparing the signal-to-noise ratio with a threshold value, and determining an initial voice probability of voice existing in a kth sub-band of the mth audio frame based on a comparison result;
The voice probability of the voice in the kth sub-band of the (m-1) th audio frame is obtained, and the voice probability of the voice in the kth sub-band of the (m-1) th audio frame is determined based on the voice probability corresponding to the kth sub-band of the (m-1) th audio frame and the initial voice probability.
8. The method of claim 7, wherein determining an initial speech probability for speech to be present in a kth sub-band of the mth audio frame based on the comparison result comprises:
If the comparison result is that the signal-to-noise ratio is greater than the threshold value, determining the initial voice probability of the voice existing in the kth sub-band of the mth audio frame as a first probability;
And if the comparison result is that the signal-to-noise ratio is smaller than or equal to the threshold value, determining the initial voice probability of the voice existing in the kth sub-band of the mth audio frame as a second probability.
9. The method of claim 3, wherein the determining the noise estimation result of the mth audio frame based on the noise estimation result corresponding to each sub-band included in the mth audio frame comprises:
acquiring a preset sub-band range, and determining N reference sub-bands with frequency ranges within the preset sub-band range from K sub-bands contained in the mth audio frame;
And acquiring a noise estimation result of an m-1 th audio frame, carrying out noise estimation on the m audio frame based on the noise estimation result of the m-1 th audio frame and the noise estimation results corresponding to all the reference sub-bands in the N reference sub-bands, and determining the noise estimation result of the m audio frame.
10. The method of claim 9, wherein the determining the noise estimate for the mth audio frame based on the noise estimate for the mth-1 audio frame and the noise estimates for each of the N reference subbands comprises:
for an nth reference sub-band in the N reference sub-bands, acquiring the number of frequency points in the nth reference sub-band, and determining a noise estimation result of the nth reference sub-band based on the number of frequency points in the nth reference sub-band and a noise estimation result corresponding to the nth reference sub-band;
Determining an initial noise estimation result of the mth audio frame based on noise estimation results of each of the N reference subbands;
And carrying out noise estimation on the m-th audio frame based on the noise estimation result of the m-1-th audio frame and the initial noise estimation result of the m-th audio frame, and determining the noise estimation result of the m-th audio frame.
11. The method of claim 2, wherein said determining a subband power spectrum for each of said K subbands comprises:
For a kth sub-band in the K sub-bands, acquiring the number of frequency points in the kth sub-band;
and determining the sub-band power spectrum of each sub-band in the K sub-bands based on the number of the frequency points in the K sub-bands and the frequency domain value corresponding to each frequency point in the K sub-bands.
12. A method of audio processing, the method comprising:
Receiving a noise estimation result sent by a listener; the noise estimation result is obtained by carrying out noise estimation on a first audio signal of the listener after the listener acquires the first audio signal, and the listener and the sounding party are in communication connection;
determining coding parameters for audio coding based on the noise estimation result;
and if the second audio signal of the sounding party is obtained, performing audio coding on the second audio signal by utilizing the coding parameters to obtain a coding signal corresponding to the second audio signal, and transmitting the coding signal to the listening party.
13. The method of claim 12, wherein the encoding parameters include one or more of an encoding rate and a sampling rate; the determining coding parameters for audio coding based on the noise estimation result includes:
Obtaining a mapping relation between a reference noise estimation result and a reference coding parameter;
and determining coding parameters for audio coding based on the noise estimation result and the mapping relation between the reference noise estimation result and the reference coding parameters.
14. The method of claim 12, wherein the encoding parameters include a number of channels; the determining coding parameters for audio coding based on the noise estimation result includes:
comparing the noise estimation result with a preset noise estimation result;
If the comparison result is that the noise estimation result is larger than the preset noise estimation result, determining the coding parameter as a first channel number;
And if the comparison result is that the noise estimation result is smaller than or equal to the preset noise estimation result, determining the coding parameter as a second channel number.
15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein program instructions, which when executed, are adapted to carry out the method according to any of claims 1-14.
CN202310273204.1A 2023-03-13 2023-03-13 Audio processing method and medium Pending CN118645110A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310273204.1A CN118645110A (en) 2023-03-13 2023-03-13 Audio processing method and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310273204.1A CN118645110A (en) 2023-03-13 2023-03-13 Audio processing method and medium

Publications (1)

Publication Number Publication Date
CN118645110A true CN118645110A (en) 2024-09-13

Family

ID=92663819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310273204.1A Pending CN118645110A (en) 2023-03-13 2023-03-13 Audio processing method and medium

Country Status (1)

Country Link
CN (1) CN118645110A (en)

Similar Documents

Publication Publication Date Title
CN111048119B (en) Call audio mixing processing method and device, storage medium and computer equipment
US9460729B2 (en) Layered approach to spatial audio coding
US10224046B2 (en) Spatial comfort noise
JP2018528479A (en) Adaptive noise suppression for super wideband music
US9628630B2 (en) Method for improving perceptual continuity in a spatial teleconferencing system
JP2013528832A (en) Scalable audio processing in a multipoint environment
CN106716526A (en) Method and apparatus for enhancing sound sources
WO2010125228A1 (en) Encoding of multiview audio signals
WO2010105695A1 (en) Multi channel audio coding
US12183357B2 (en) Enhancing musical sound during a networked conference
WO2021244418A1 (en) Audio encoding method and audio encoding apparatus
KR102284104B1 (en) An encoding device for processing an input signal and a decoding device for processing the encoded signal
CN114067822A (en) Call audio processing method and device, computer equipment and storage medium
WO2021244417A1 (en) Audio encoding method and audio encoding device
CN117079661A (en) Sound source processing method and related device
WO2021120795A1 (en) Sampling rate processing method, apparatus and system, and storage medium and computer device
CN118645110A (en) Audio processing method and medium
US20080059161A1 (en) Adaptive Comfort Noise Generation
US20240304171A1 (en) Echo reference prioritization and selection
WO2022267754A1 (en) Speech coding method and apparatus, speech decoding method and apparatus, computer device, and storage medium
CN116830561A (en) Echo reference prioritization and selection
CN117118956A (en) Audio processing method, device, electronic equipment and computer readable storage medium
CN118588101A (en) Audio processing method, device, electronic device and storage medium
JP2016029428A (en) Voice collection system, host device, and program
CN118075246A (en) Method and device for adjusting jitter buffer area size and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication