[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118609536A - Audio generation method, device, equipment and storage medium - Google Patents

Audio generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN118609536A
CN118609536A CN202410777954.7A CN202410777954A CN118609536A CN 118609536 A CN118609536 A CN 118609536A CN 202410777954 A CN202410777954 A CN 202410777954A CN 118609536 A CN118609536 A CN 118609536A
Authority
CN
China
Prior art keywords
target user
timbre
features
tone
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410777954.7A
Other languages
Chinese (zh)
Inventor
张毅
陈博
付振
王明月
何金鑫
孙宇嘉
梁小明
王紫烟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Faw Nanjing Technology Development Co ltd
FAW Group Corp
Original Assignee
Faw Nanjing Technology Development Co ltd
FAW Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Faw Nanjing Technology Development Co ltd, FAW Group Corp filed Critical Faw Nanjing Technology Development Co ltd
Priority to CN202410777954.7A priority Critical patent/CN118609536A/en
Publication of CN118609536A publication Critical patent/CN118609536A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an audio generation method, device, equipment and storage medium, and relates to the technical field of audio. The method comprises the following steps: responding to the received text information, acquiring historical audio information of a target user, and transmitting the text information by the target user through a target terminal; extracting tone features from the historical audio information; clustering the extracted tone features to determine tone features of a clustering center; determining the current tone characteristic of the target user according to the tone characteristic of the clustering center; the current timbre characteristics of the target user and the text information are input into a speech generation model to generate target audio having the current timbre characteristics of the target user. According to the technical scheme provided by the embodiment of the invention, personalized voice can be generated, so that user experience is improved.

Description

音频生成方法、装置、设备和存储介质Audio generation method, device, equipment and storage medium

技术领域Technical Field

本发明实施例涉及音频技术领域,尤其涉及一种音频生成方法、装置、设备和存储介质。Embodiments of the present invention relate to the field of audio technology, and in particular to an audio generation method, apparatus, device and storage medium.

背景技术Background Art

在日常生活和工作中,人们经常需要接收和发送文字信息。然而,在某些特定场景下,例如:驾驶、运动或视觉障碍等,用户可能无法直接查看或读取文字消息。这时,用户如果能够收听到与文字信息内容相对应的音频信息,将极大地提高信息接收的便利性和安全性。In daily life and work, people often need to receive and send text messages. However, in certain specific scenarios, such as driving, sports or visual impairment, users may not be able to directly view or read text messages. In this case, if users can listen to the audio information corresponding to the text message content, it will greatly improve the convenience and security of information reception.

现有的文字转语音技术虽然可以将文字转换为语音,但通常只能使用预设或通用的语音库来将文字转换为语音,缺乏个性化和真实感,从而影响用户的体验。Although existing text-to-speech technologies can convert text into speech, they can usually only use preset or universal voice libraries to convert text into speech, which lacks personalization and authenticity, thus affecting the user experience.

因此,亟需提出一种新的方法来解决上述问题。Therefore, it is urgent to propose a new method to solve the above problems.

发明内容Summary of the invention

本发明提供一种音频生成方法、装置、设备和存储介质,可以生成个性化语音,从而提高用户体验。The present invention provides an audio generation method, device, equipment and storage medium, which can generate personalized voice, thereby improving user experience.

第一方面,本发明实施例提供了一种音频生成方法,包括:In a first aspect, an embodiment of the present invention provides an audio generation method, comprising:

响应接收到的文本信息,获取目标用户的历史音频信息,所述文本信息由所述目标用户通过目标终端发送;In response to the received text message, acquiring historical audio information of a target user, wherein the text message is sent by the target user through a target terminal;

从所述历史音频信息中提取音色特征;Extracting timbre features from the historical audio information;

对提取的音色特征进行聚类,以确定聚类中心的音色特征;Clustering the extracted timbre features to determine the timbre features of the cluster center;

根据所述聚类中心的音色特征确定所述目标用户的当前音色特征;Determine the current timbre feature of the target user according to the timbre feature of the cluster center;

将所述目标用户的当前音色特征和所述文本信息输入语音生成模型,以生成具有所述目标用户的当前音色特征的目标音频。The current timbre characteristics of the target user and the text information are input into a speech generation model to generate a target audio with the current timbre characteristics of the target user.

本发明的技术方案,先响应接收到的文本信息,获取目标用户的历史音频信息,文本信息由目标用户通过目标终端发送;再从历史音频信息中提取音色特征;之后对提取的音色特征进行聚类,以确定聚类中心的音色特征;然后根据聚类中心的音色特征确定目标用户的当前音色特征;最后将目标用户的当前音色特征和文本信息输入语音生成模型,以生成具有目标用户的当前音色特征的目标音频。上述技术方案,通过获取目标用户的历史音频信息,为之后提取音色特征提供数据基础。然后从历史音频信息中提取音色特征,并对提取的音色特征进行聚类,可以将相似的音色特征归为一类,得到聚类中心的音色特征,简化了数据,从而降低了后续处理的复杂度,提高了工作效率,同时也为之后确定目标用户的当前音色特征提供了数据基础。然后通过聚类的代表性特征,即聚类中心的音色特征,来确定目标用户的当前音色特征,可以使最终确定的目标用户的当前音色特征更具针对性和个性化,从而为用户提供更加自然、真实的音色体验,提高用户的满意度。最后将目标用户的当前音色特征和文本信息输入语音生成模型,以生成符合目标用户个性化音色特征的目标音频,使得目标音频听起来就像是由目标用户本人亲自说出的一样,极大地增强了音频内容的个性化和真实感,从而提高了用户体验。相比于现有技术虽然可以将文字转换为语音,但通常只能使用预设或通用的语音库来将文字转换为语音,缺乏个性化和真实感,从而影响用户的体验。本发明从历史音频信息中提取音色特征,并对提取的音色特征进行聚类,得到聚类中心的音色特征;之后根据聚类中心的音色特征确定目标用户的当前音色特征;最后将目标用户的当前音色特征和文本信息输入语音生成模型,以生成具有目标用户的当前音色特征的目标音频,可以确保生成的音频在音色上更加贴近目标用户的真实声音,为用户带来更加自然、真实的音色体验。因此,本发明可以解决利用现有的文字生成语音技术所生成的音频缺乏个性化和真实感的问题。The technical solution of the present invention first responds to the received text information and obtains the historical audio information of the target user, the text information is sent by the target user through the target terminal; then the timbre features are extracted from the historical audio information; then the extracted timbre features are clustered to determine the timbre features of the cluster center; then the current timbre features of the target user are determined based on the timbre features of the cluster center; finally, the current timbre features of the target user and the text information are input into the speech generation model to generate the target audio with the current timbre features of the target user. The above technical solution provides a data basis for the subsequent extraction of timbre features by obtaining the historical audio information of the target user. Then the timbre features are extracted from the historical audio information, and the extracted timbre features are clustered, so that similar timbre features can be classified into one category to obtain the timbre features of the cluster center, which simplifies the data, thereby reducing the complexity of subsequent processing and improving work efficiency, and also provides a data basis for determining the current timbre features of the target user. Then, the current timbre features of the target user are determined by the representative features of the cluster, that is, the timbre features of the cluster center, so that the current timbre features of the target user finally determined can be more targeted and personalized, thereby providing the user with a more natural and realistic timbre experience and improving user satisfaction. Finally, the current timbre features and text information of the target user are input into the speech generation model to generate a target audio that meets the personalized timbre features of the target user, so that the target audio sounds like it is spoken by the target user himself, which greatly enhances the personalization and authenticity of the audio content, thereby improving the user experience. Compared with the prior art, although text can be converted into speech, usually only a preset or general voice library can be used to convert text into speech, which lacks personalization and authenticity, thus affecting the user experience. The present invention extracts timbre features from historical audio information, and clusters the extracted timbre features to obtain the timbre features of the cluster center; then determines the current timbre features of the target user based on the timbre features of the cluster center; finally, inputs the current timbre features of the target user and text information into a speech generation model to generate a target audio with the current timbre features of the target user, which can ensure that the generated audio is closer to the real voice of the target user in timbre, and brings a more natural and real timbre experience to the user. Therefore, the present invention can solve the problem that the audio generated by the existing text-to-speech technology lacks personalization and realism.

第二方面,本发明实施例还提供了一种音频生成装置,该装置包括:In a second aspect, an embodiment of the present invention further provides an audio generation device, the device comprising:

获取模块,用于响应接收到的文本信息,获取目标用户的历史音频信息,所述文本信息由所述目标用户通过目标终端发送;An acquisition module, configured to acquire historical audio information of a target user in response to a received text message, wherein the text message is sent by the target user via a target terminal;

提取模块,用于从所述历史音频信息中提取音色特征;An extraction module, used for extracting timbre features from the historical audio information;

聚类模块,用于对提取的音色特征进行聚类,以确定聚类中心的音色特征;A clustering module, used for clustering the extracted timbre features to determine the timbre features of the cluster center;

确定模块,用于根据所述聚类中心的音色特征确定所述目标用户的音色特征;A determination module, used to determine the timbre characteristics of the target user according to the timbre characteristics of the cluster center;

生成模块,用于将所述目标用户的音色特征和所述文本信息输入语音生成模型,以生成具有所述目标用户的音色特征的目标音频。A generation module is used to input the timbre characteristics of the target user and the text information into a speech generation model to generate a target audio with the timbre characteristics of the target user.

第三方面,本发明实施例还提供了一种电子设备,该电子设备包括:In a third aspect, an embodiment of the present invention further provides an electronic device, the electronic device comprising:

至少一个处理器;以及与所述至少一个处理器通信连接的存储器;at least one processor; and a memory communicatively coupled to the at least one processor;

其中,所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够实现第一方面中任一所述的音频生成方法。The memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can implement any audio generation method described in the first aspect.

第四方面,本发明实施例还提供了一种包含计算机可执行指令的存储介质,In a fourth aspect, an embodiment of the present invention further provides a storage medium including computer executable instructions,

所述计算机可执行指令在由计算机处理器执行时实现第一方面中任一所述的音频生成方法。The computer executable instructions, when executed by a computer processor, implement any of the audio generation methods described in the first aspect.

需要说明的是,上述计算机指令可以全部或者部分存储在计算机可读存储介质上。其中,计算机可读存储介质可以与音频生成装置的处理器封装在一起的,也可以与音频生成装置的处理器单独封装,本申请对此不做限定。It should be noted that the above computer instructions may be stored in whole or in part on a computer-readable storage medium. The computer-readable storage medium may be packaged together with the processor of the audio generating device, or may be packaged separately from the processor of the audio generating device, and this application does not limit this.

本申请中第二方面、第三方面以及第四方面的描述,可以参考第一方面的详细描述;并且,第二方面、第三方面以及第四方面的描述的有益效果,可以参考第一方面的有益效果分析,此处不再赘述。The description of the second, third and fourth aspects in this application can refer to the detailed description of the first aspect; and the beneficial effects of the description of the second, third and fourth aspects can refer to the beneficial effect analysis of the first aspect, which will not be repeated here.

在本申请中,上述音频生成装置的名字对设备或功能模块本身不构成限定,在实际实现中,这些设备或功能模块可以以其他名称出现。只要各个设备或功能模块的功能和本申请类似,属于本申请权利要求及其等同技术的范围之内。In this application, the name of the above-mentioned audio generating device does not limit the device or functional module itself. In actual implementation, these devices or functional modules may appear with other names. As long as the functions of each device or functional module are similar to those of this application, they fall within the scope of the claims of this application and their equivalent technologies.

本申请的这些方面或其他方面在以下的描述中会更加简明易懂。These and other aspects of the present application will become more apparent from the following description.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1为本发明实施例提供的一种音频生成方法的流程图;FIG1 is a flow chart of an audio generation method provided by an embodiment of the present invention;

图2为本发明实施例提供的另一种音频生成方法的流程图;FIG2 is a flow chart of another audio generation method provided by an embodiment of the present invention;

图3为本发明实施例提供的一种音频生成装置的结构示意图;FIG3 is a schematic diagram of the structure of an audio generating device provided by an embodiment of the present invention;

图4为本发明实施例提供的一种电子设备的结构示意图。FIG. 4 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本发明,而非对本发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本发明相关的部分而非全部结构。The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are only used to explain the present invention, rather than to limit the present invention. It should also be noted that, for ease of description, only parts related to the present invention, rather than all structures, are shown in the accompanying drawings.

本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。The term "and/or" in this article is merely a description of the association relationship of associated objects, indicating that three relationships may exist. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.

本申请的说明书以及附图中的术语“第一”和“第二”等是用于区别不同的对象,或者用于区别对同一对象的不同处理,而不是用于描述对象的特定顺序。The terms "first" and "second" and the like in the specification and drawings of this application are used to distinguish different objects, or to distinguish different processing of the same object, rather than to describe a specific order of objects.

此外,本申请的描述中所提到的术语“包括”和“具有”及其任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选的还包括其他没有列出的步骤或单元,或可选的还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。In addition, the terms "including" and "having" and any variations thereof mentioned in the description of this application are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but may optionally include other steps or units that are not listed, or may optionally include other steps or units that are inherent to these processes, methods, products or devices.

在更加详细地讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各项操作(或步骤)描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。此外,各项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。此外,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。It should be mentioned before discussing exemplary embodiments in more detail that some exemplary embodiments are described as processes or methods depicted as flow charts. Although the flow charts describe various operations (or steps) as sequential processes, many operations therein can be implemented in parallel, concurrently or simultaneously. In addition, the order of various operations can be rearranged. The process can be terminated when its operation is completed, but can also have additional steps not included in the accompanying drawings. The process can correspond to methods, functions, procedures, subroutines, subprograms, etc. In addition, the embodiments in the present invention and the features in the embodiments can be combined with each other without conflict.

需要说明的是,本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。It should be noted that in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as "exemplary" or "for example" is intended to present related concepts in a specific way.

在本申请的描述中,除非另有说明,“多个”的含义是指两个或两个以上。In the description of the present application, unless otherwise specified, “plurality” means two or more.

图1为本发明实施例提供的一种音频生成方法的流程图,本实施例可适用于当用户收到新的文字消息时,用户不便于用手操作终端的情况,其中,终端可以是智能手机或平板电脑等,该方法可以由音频生成装置来执行,该装置可采用软件和/或硬件的方式实现,示例性的,该装置可以是计算机。参考图1,本实施例的音频生成方法具体可以包括如下步骤:FIG1 is a flow chart of an audio generation method provided by an embodiment of the present invention. This embodiment is applicable to the case where a user receives a new text message and it is inconvenient for the user to operate the terminal by hand, wherein the terminal may be a smart phone or a tablet computer, etc. The method may be executed by an audio generation device, which may be implemented in software and/or hardware, and exemplarily, the device may be a computer. Referring to FIG1 , the audio generation method of this embodiment may specifically include the following steps:

步骤110、响应接收到的文本信息,获取目标用户的历史音频信息。Step 110: In response to the received text message, obtain historical audio information of the target user.

具体地,文本信息由目标用户通过目标终端发送。目标用户为发送文本信息的用户。目标用户的历史音频信息指的是接收文本信息的设备中存储的与目标用户相关的音频信息或记录,接收文本信息的设备可以是智能手机、平板电脑或笔记本电脑等,本发明实施例对此不进行限制。示例性的,若用户A通过终端A1向用户B的终端B1发送文本信息,则目标用户为A,目标终端为A1,接收文本信息的用户为B,接收文本信息的设备为B1。Specifically, the text message is sent by the target user through the target terminal. The target user is the user who sends the text message. The historical audio information of the target user refers to the audio information or records related to the target user stored in the device that receives the text message. The device that receives the text message can be a smart phone, a tablet computer or a laptop computer, etc., and the embodiment of the present invention does not limit this. For example, if user A sends a text message to user B's terminal B1 through terminal A1, the target user is A, the target terminal is A1, the user who receives the text message is B, and the device that receives the text message is B1.

具体实现中,在响应接收到的文本信息之前,需要先设置一个监听机制接收文本消息,例如:可以先根据实际情况或需求来确定使用哪种通信方式来接收文本消息,通信方式可以是表示性状态转移应用程序编程接口、网络套接字、消息队列、实时通信协议或短信/邮件服务等;然后根据选择的通信方式,部署相应的监听服务,若选择的通信方式是表示性状态转移应用程序编程接口,则可以部署Web服务;若选择的通信方式是网络套接字,则可以部署网络套接字服务;同时为了确保监听服务的安全性,可以配置安全机制,例如:安全机制可以是身份验证或授权等。然后通过前述建立的监听机制接收目标用户通过目标终端发来的文本信息,之后可以根据实际情况或需求来确定解析算法,例如:解析算法可以是正则表达式、可扩展标记语言解析器或模板匹配算法等。再根据所选的解析算法解析接收到的文本消息,提取出与获取目标用户的历史音频信息有关的参数,例如:参数可以是目标用户的标识符。然后为确保提取出的参数是有效的,可以验证参数的有效性,例如:验证用户标识符是否存在。然后通过提取的参数查询数据库,以获取目标用户的历史音频信息,例如:使用提取出的参数构建数据库查询语句,通过执行查询语句,从数据库中获取目标用户的历史音频信息。之后可以根据业务需求,对得到的音频信息进行进一步的处理,例如:存储或者提取音色特征。In a specific implementation, before responding to the received text message, a monitoring mechanism needs to be set up to receive the text message. For example, the communication method to be used to receive the text message can be determined according to the actual situation or demand. The communication method can be a representational state transfer application programming interface, a network socket, a message queue, a real-time communication protocol, or a short message/mail service, etc. Then, according to the selected communication method, the corresponding monitoring service is deployed. If the selected communication method is a representational state transfer application programming interface, a Web service can be deployed; if the selected communication method is a network socket, a network socket service can be deployed; at the same time, in order to ensure the security of the monitoring service, a security mechanism can be configured, for example, the security mechanism can be identity authentication or authorization, etc. Then, the text message sent by the target user through the target terminal is received through the aforementioned established monitoring mechanism, and then the parsing algorithm can be determined according to the actual situation or demand, for example, the parsing algorithm can be a regular expression, an extensible markup language parser, or a template matching algorithm, etc. Then, the received text message is parsed according to the selected parsing algorithm to extract parameters related to obtaining the historical audio information of the target user, for example, the parameter can be an identifier of the target user. Then, to ensure that the extracted parameters are valid, the validity of the parameters can be verified, for example, verifying whether the user identifier exists. Then, the database is queried using the extracted parameters to obtain the historical audio information of the target user, for example, a database query statement is constructed using the extracted parameters, and the historical audio information of the target user is obtained from the database by executing the query statement. The obtained audio information can then be further processed according to business needs, for example, by storing or extracting timbre features.

另外,在获取目标用户的历史音频信息之前,为了保护用户的隐私权,需要确保当前执行获取目标用户的历史音频信息操作的设备具备获取目标用户的历史音频信息的权限。In addition, before obtaining the historical audio information of the target user, in order to protect the user's privacy, it is necessary to ensure that the device currently performing the operation of obtaining the historical audio information of the target user has the authority to obtain the historical audio information of the target user.

本实施例中,通过获取目标用户的历史音频信息,为之后提取音色特征提供数据基础。In this embodiment, historical audio information of the target user is obtained to provide a data basis for subsequent extraction of timbre features.

步骤120、从历史音频信息中提取音色特征。Step 120: extract timbre features from historical audio information.

具体地,正如不同乐器具有不同的发音特性一样,不同的说话人也具有不同的发音特性,该特性会进一步导致其发出的声音充满了辨识度,也即语音信号中潜藏着标志说话人发音特性的特征,称之为音色特征。Specifically, just as different musical instruments have different pronunciation characteristics, different speakers also have different pronunciation characteristics, which will further make the sounds they produce full of recognition. That is, the speech signal contains characteristics that mark the speaker's pronunciation characteristics, which are called timbre features.

具体实现中,在获取到目标用户地历史音频信息之后,可以先对历史音频信息进行预处理,包括但不限于降噪、人声分离、分段等操作,从而提取出人声音频信息。在预处理之后,可以根据实际情况和应用需求选择合适地音色特征提取器来进行音色特征提取。例如,可以使用音频特征提取器从原始音频中提取音频特征,然后结合音色信息来联合训练音色特征提取器和音色分类器,最后先使用音频特征提取器从上述人声音频信息中提取人声音频特征,后送入上述训练好的音色特征提取器得到历史音频信息中的音色特征。当算力较紧张,如部署在端侧设备上时,音频特征提取可以使用傅里叶变换、梅尔谱等方法;当算力较充裕,如部署在云侧设备上时,音频特征提取可以使用预训练的音频大模型,从而进一步提高音色特征提取的效果。In a specific implementation, after obtaining the historical audio information of the target user, the historical audio information can be preprocessed, including but not limited to noise reduction, human voice separation, segmentation and other operations, so as to extract the human voice audio information. After preprocessing, a suitable timbre feature extractor can be selected according to the actual situation and application requirements to extract the timbre features. For example, an audio feature extractor can be used to extract audio features from the original audio, and then the timbre feature extractor and the timbre classifier can be jointly trained in combination with the timbre information. Finally, the audio feature extractor is used to extract the human voice audio features from the above human voice audio information, and then the trained timbre feature extractor is sent to obtain the timbre features in the historical audio information. When the computing power is tight, such as when deployed on the end-side device, audio feature extraction can use methods such as Fourier transform and Mel spectrum; when the computing power is sufficient, such as when deployed on the cloud-side device, audio feature extraction can use a pre-trained audio large model to further improve the effect of timbre feature extraction.

另外,在提取到音色特征之后,还可以对提取的音色特征进行后处理,如降维、归一化等,以便后续使用。In addition, after the timbre features are extracted, the extracted timbre features can be post-processed, such as dimension reduction, normalization, etc., for subsequent use.

本实施例中,实现了从历史音频信息中提取音色特征。In this embodiment, the timbre features are extracted from the historical audio information.

步骤130、对提取的音色特征进行聚类,以确定聚类中心的音色特征。Step 130: cluster the extracted timbre features to determine the timbre features at the cluster center.

具体地,聚类是一种无监督学习算法,它将相似的对象分组或聚集在一起,同时使不同组之间的对象尽可能不同。本实施例中,聚类是指将提取到的音色特征中相似的音色特征聚集在一起。聚类中心是聚类算法为每个聚类或组计算出的一个中心点或代表点。在聚类分析中,聚类中心通常用于描述该聚类的整体特征或位置。在本实施例中,聚类中心的音色特征表示某一类音色特征的典型或平均特征。Specifically, clustering is an unsupervised learning algorithm that groups or aggregates similar objects while making the objects in different groups as different as possible. In the present embodiment, clustering refers to aggregating similar timbre features among the extracted timbre features. The cluster center is a central point or representative point calculated by the clustering algorithm for each cluster or group. In cluster analysis, the cluster center is usually used to describe the overall characteristics or position of the cluster. In the present embodiment, the timbre features of the cluster center represent the typical or average characteristics of a certain type of timbre features.

具体实现中,在提取到音色特征之后,可以根据实际情况和应用需求选择合适的聚类算法来对提取的音色特征进行聚类,例如:聚类算法可以是K均值聚类算法、层次聚类、基于密度的带噪声应用空间聚类、谱聚类等。进而应用所选的聚类算法对提取的音色特征进行聚类。所选的聚类算法为基于密度的带噪声应用空间聚类时,可以先将提取到的音频特征整理成特征向量数据集,每个特征向量代表一个音色特征,再根据实际情况或历史经验确定该算法的两个参数,即邻域大小和密度阈值,之后对于数据集中的每个点,计算它与其他所有点之间的距离,即对于数据集中的每个特征向量,计算它与其他所有特征向量之间的距离,例如:计算的距离可以是欧氏距离,然后根据前述确定的领域大小的值确定每个点的邻域,即与该点距离小于或等于领域大小的所有点的集合。之后遍历数据集中的每个点,如果该点的邻域内至少有前述确定的密度阈值个点,则将该点标记为核心点。如果该点不是核心点,但它的邻域内包含至少一个核心点,则将该点标记为边界点。如果该点既不是核心点也不是边界点,则将其标记为噪声点。再从任意一个未被访问的核心点开始,找出与其密度可达的所有点,这些点形成一个聚类。重复这个过程,直到所有核心点都被访问过,并且每个聚类都已被形成。最后可以通过计算每个聚类中所有点的平均特征向量来表示各个聚类所对应的聚类中心,这个计算得到的平均特征向量就是聚类中心的音色特征。In a specific implementation, after the timbre features are extracted, a suitable clustering algorithm can be selected according to the actual situation and application requirements to cluster the extracted timbre features, for example, the clustering algorithm can be a K-means clustering algorithm, hierarchical clustering, density-based noisy application space clustering, spectral clustering, etc. Then the selected clustering algorithm is applied to cluster the extracted timbre features. When the selected clustering algorithm is density-based noisy application space clustering, the extracted audio features can be first organized into a feature vector data set, each feature vector represents a timbre feature, and then the two parameters of the algorithm, i.e., the neighborhood size and the density threshold, are determined according to the actual situation or historical experience, and then for each point in the data set, the distance between it and all other points is calculated, that is, for each feature vector in the data set, the distance between it and all other feature vectors is calculated, for example, the calculated distance can be the Euclidean distance, and then the neighborhood of each point is determined according to the value of the aforementioned determined domain size, that is, the set of all points whose distance to the point is less than or equal to the domain size. Then traverse each point in the data set. If there are at least the density threshold points determined above in the neighborhood of the point, mark the point as a core point. If the point is not a core point, but its neighborhood contains at least one core point, mark the point as a boundary point. If the point is neither a core point nor a boundary point, mark it as a noise point. Starting from any unvisited core point, find all points that are reachable with its density, and these points form a cluster. Repeat this process until all core points have been visited and each cluster has been formed. Finally, the cluster center corresponding to each cluster can be represented by calculating the average eigenvector of all points in each cluster. The calculated average eigenvector is the timbre feature of the cluster center.

本实施例中,对提取的音色特征进行聚类,可以将相似的音色特征归为一类,即可以将大量的音色特征数据简化为几个聚类中心,大大降低了后续处理的复杂度,提高了工作效率,同时也为之后确定目标用户的当前音色特征提供了数据基础。In this embodiment, the extracted timbre features are clustered so that similar timbre features can be classified into one category, that is, a large amount of timbre feature data can be simplified into several clustering centers, which greatly reduces the complexity of subsequent processing and improves work efficiency. It also provides a data basis for determining the current timbre features of the target user later.

步骤140、根据聚类中心的音色特征确定目标用户的当前音色特征。Step 140: Determine the current timbre feature of the target user according to the timbre feature of the cluster center.

具体地,目标用户的当前音色特征指的是可以描述目标用户音色属性的特征,即可以代表目标用户的个性化或专属音色。Specifically, the current timbre feature of the target user refers to a feature that can describe the timbre attribute of the target user, that is, it can represent the personalized or exclusive timbre of the target user.

具体实现中,若存在目标用户的历史音色特征,则计算目标用户的历史音色特征与每个聚类中心的音色特征之间的相似度或距离。然后基于计算出的相似度或距离,找到与目标用户音色特征最接近的聚类中心的音色特征,将其确定为目标用户的当前候选音色特征,使用其更新目标用户的历史音色特征,并将更新后的历史音色特征确定为目标用户的当前音色特征。例如:可以计算目标用户的历史音色特征与每个聚类中心的音色特征之间的欧氏距离,然后将距离最小的聚类中心的音色特征确定为目标用户的当前候选音色特征。若不存在目标用户的历史音色特征,则可以根据聚类中心的数量来确定目标用户的当前音色特征,当聚类中心的数量为一时,直接将该聚类中心的音色特征确定为目标用户的当前音色特征,当聚类中心的数量不为一时,可以对各聚类中心的音色特征进行优先级排序,例如:可以是使用多属性决策分析技术对各聚类中心的音色特征进行优先级排序,将优先级最高的聚类中心的音色特征确定为目标用户的当前音色特征。In a specific implementation, if there are historical timbre features of the target user, the similarity or distance between the historical timbre features of the target user and the timbre features of each cluster center is calculated. Then, based on the calculated similarity or distance, the timbre features of the cluster center closest to the timbre features of the target user are found, and the timbre features are determined as the current candidate timbre features of the target user, and the timbre features of the target user are updated using the timbre features of the target user, and the updated historical timbre features are determined as the current timbre features of the target user. For example, the Euclidean distance between the historical timbre features of the target user and the timbre features of each cluster center can be calculated, and then the timbre features of the cluster center with the smallest distance are determined as the current candidate timbre features of the target user. If there are no historical timbre features of the target user, the current timbre features of the target user can be determined based on the number of cluster centers. When the number of cluster centers is one, the timbre features of the cluster center are directly determined as the current timbre features of the target user. When the number of cluster centers is not one, the timbre features of each cluster center can be prioritized. For example, the timbre features of each cluster center can be prioritized using multi-attribute decision analysis technology, and the timbre features of the cluster center with the highest priority can be determined as the current timbre features of the target user.

本实施例中,通过聚类的代表性特征,即聚类中心的音色特征,来确定目标用户的当前音色特征,可以使最终确定的目标用户的当前音色特征更具针对性和个性化,从而为用户提供更加自然、真实的音色体验,提高用户的满意度。In this embodiment, the current timbre characteristics of the target user are determined by the representative characteristics of the cluster, that is, the timbre characteristics of the cluster center, so that the current timbre characteristics of the target user finally determined can be more targeted and personalized, thereby providing the user with a more natural and realistic timbre experience and improving user satisfaction.

步骤150、将目标用户的当前音色特征和文本信息输入语音生成模型,以生成具有目标用户的当前音色特征的目标音频。Step 150: Input the current timbre characteristics and text information of the target user into a speech generation model to generate a target audio with the current timbre characteristics of the target user.

具体地,语音生成模型是一种能够将文本信息转换成语音信号的机器学习模型。目标音频指的是通过语音生成模型生成的,具有目标用户的当前音色特征的语音输出。在本实施例中,目标音频听起来就像是目标用户本人在朗读其发送的文本信息,具有高度的个性化和自然度。Specifically, the speech generation model is a machine learning model that can convert text information into speech signals. The target audio refers to the speech output generated by the speech generation model and has the current timbre characteristics of the target user. In this embodiment, the target audio sounds like the target user himself reading the text message he sent, which is highly personalized and natural.

具体实现中,在将目标用户的当前音色特征和文本信息输入语音生成模型之前,可以先基于实际情况或需求选择适合的语音生成模型,然后可以使用提前收集到的文本和其对应的音频的数据集来训练所选的模型,从而得到最终使用的语音生成模型。之后将目标用户的当前音色特征和文本信息输入到训练好的语音生成模型中,模型将根据输入的音色特征和文本信息,生成对应的音频波形。在得到音频后,可以对生成的音频进行后处理,即可以通过噪声消除、音量均衡、动态压缩等方式对生成的音频进行后处理,以提高其质量和自然度。最后播放后处理后的音频。In a specific implementation, before inputting the current timbre characteristics and text information of the target user into the speech generation model, a suitable speech generation model can be selected based on the actual situation or needs, and then the selected model can be trained using the data set of the text and its corresponding audio collected in advance, so as to obtain the speech generation model used in the end. After that, the current timbre characteristics and text information of the target user are input into the trained speech generation model, and the model will generate the corresponding audio waveform according to the input timbre characteristics and text information. After obtaining the audio, the generated audio can be post-processed, that is, the generated audio can be post-processed by noise elimination, volume equalization, dynamic compression, etc. to improve its quality and naturalness. Finally, the post-processed audio is played.

本实施例中,通过将目标用户的当前音色特征和文本信息输入语音生成模型,能够生成出符合目标用户个性化音色特征的目标音频,使得目标音频听起来就像是由目标用户本人亲自说出的一样,极大地增强了音频内容的个性化和真实感,从而提高了用户体验。In this embodiment, by inputting the current timbre characteristics and text information of the target user into the speech generation model, a target audio that conforms to the personalized timbre characteristics of the target user can be generated, so that the target audio sounds as if it is spoken by the target user himself, greatly enhancing the personalization and authenticity of the audio content, thereby improving the user experience.

本发明实施例中,通过获取目标用户的历史音频信息,为之后提取音色特征提供数据基础。然后从历史音频信息中提取音色特征,并对提取的音色特征进行聚类,可以将相似的音色特征归为一类,得到聚类中心的音色特征,简化了数据,从而降低了后续处理的复杂度,提高了工作效率,同时也为之后确定目标用户的当前音色特征提供了数据基础。然后通过聚类的代表性特征,即聚类中心的音色特征,来确定目标用户的当前音色特征,可以使最终确定的目标用户的当前音色特征更具针对性和个性化,从而为用户提供更加自然、真实的音色体验,提高用户的满意度。最后将目标用户的当前音色特征和文本信息输入语音生成模型,以生成符合目标用户个性化音色特征的目标音频,使得目标音频听起来就像是由目标用户本人亲自说出的一样,极大地增强了音频内容的个性化和真实感,从而提高了用户体验。相比于现有技术虽然可以将文字转换为语音,但通常只能使用预设或通用的语音库来将文字转换为语音,缺乏个性化和真实感,从而影响用户的体验。本发明通过接收到的文本信息来获取目标用户的历史音频信息,再从历史音频信息中提取音色特征,并对提取的音色特征进行聚类,得到聚类中心的音色特征;之后根据聚类中心的音色特征确定目标用户的当前音色特征;最后将目标用户的当前音色特征和文本信息输入语音生成模型,以生成具有目标用户的当前音色特征的目标音频,可以确保生成的音频在音色上更加贴近目标用户的真实声音,为用户带来更加自然、真实的音色体验。因此,本发明可以解决利用现有的文字生成语音技术所生成的音频缺乏个性化和真实感的问题。In the embodiment of the present invention, by obtaining the historical audio information of the target user, a data basis is provided for the subsequent extraction of timbre features. Then, the timbre features are extracted from the historical audio information, and the extracted timbre features are clustered, so that similar timbre features can be classified into one category to obtain the timbre features of the cluster center, simplifying the data, thereby reducing the complexity of subsequent processing, improving work efficiency, and also providing a data basis for determining the current timbre features of the target user later. Then, the current timbre features of the target user are determined by the representative features of the cluster, that is, the timbre features of the cluster center, so that the current timbre features of the target user finally determined can be more targeted and personalized, thereby providing the user with a more natural and real timbre experience and improving the user's satisfaction. Finally, the current timbre features and text information of the target user are input into the speech generation model to generate a target audio that meets the personalized timbre features of the target user, so that the target audio sounds like it is spoken by the target user himself, which greatly enhances the personalization and authenticity of the audio content, thereby improving the user experience. Compared with the prior art, although text can be converted into speech, usually only a preset or general voice library can be used to convert text into speech, which lacks personalization and authenticity, thereby affecting the user experience. The present invention obtains the historical audio information of the target user through the received text information, then extracts the timbre features from the historical audio information, and clusters the extracted timbre features to obtain the timbre features of the cluster center; then determines the current timbre features of the target user according to the timbre features of the cluster center; finally, inputs the current timbre features of the target user and the text information into the speech generation model to generate the target audio with the current timbre features of the target user, which can ensure that the generated audio is closer to the real voice of the target user in timbre, and brings a more natural and real timbre experience to the user. Therefore, the present invention can solve the problem that the audio generated by the existing text-to-speech technology lacks personalization and realism.

图2为本发明实施例提供的另一种音频生成方法的流程图,本实施例是在上述实施例的基础上进行具体化。在本实施例中,该方法还可以包括:FIG2 is a flow chart of another audio generation method provided by an embodiment of the present invention. This embodiment is specific based on the above embodiment. In this embodiment, the method may also include:

步骤211、根据接收到的文本信息确定目标用户的用户标识。Step 211: Determine the user ID of the target user according to the received text information.

具体地,用户标识指的是用于唯一识别用户的标记或代码。目标用户的用户标识指的是用于唯一识别目标用户的标记或代码。Specifically, the user identification refers to a mark or code used to uniquely identify the user. The user identification of the target user refers to a mark or code used to uniquely identify the target user.

具体实现中,可以先确定接收文本信息的渠道,例如:渠道可以是即时通讯应用或移动应用内的聊天窗口等。然后在指定的接收渠道上设置监听器或拦截器,以便捕获目标用户发送的文本信息,再通过指定的渠道接收文本信息,之后可以根据实际情况或需求来确定解析算法,例如:解析算法可以是机器学习算法。再根据所选的解析算法对接收到的文本信息进行解析,以提取目标用户的用户标识。在提取出目标用户的用户标识后,可以对其进行验证和清洗操作以确保数据的准确性和有效性。例如:可以通过检查用户标识是否符合预期的格式和规则来对其进行验证,通过处理大小写敏感性问题来对其进行清洗。然后将处理后的目标用户的用户标识存储起来,以便后续处理和使用。In a specific implementation, the channel for receiving text information can be determined first, for example, the channel can be an instant messaging application or a chat window in a mobile application. Then, a listener or interceptor is set on the specified receiving channel to capture the text information sent by the target user, and then the text information is received through the specified channel. After that, the parsing algorithm can be determined according to the actual situation or needs, for example, the parsing algorithm can be a machine learning algorithm. Then, the received text information is parsed according to the selected parsing algorithm to extract the user ID of the target user. After the user ID of the target user is extracted, it can be verified and cleaned to ensure the accuracy and validity of the data. For example, the user ID can be verified by checking whether it conforms to the expected format and rules, and it can be cleaned by handling case sensitivity issues. The processed user ID of the target user is then stored for subsequent processing and use.

本实施例中,根据接收到的文本信息确定目标用户的用户标识,为之后得到目标用户的历史音频信息提供了数据基础。In this embodiment, the user identification of the target user is determined according to the received text information, which provides a data basis for obtaining the historical audio information of the target user later.

步骤212、基于用户标识生成目标用户的历史音频信息的获取请求,并得到请求结果。Step 212: Generate a request for obtaining the historical audio information of the target user based on the user identifier, and obtain the request result.

具体地,获取请求是向服务器发出的指令,用于请求访问或检索特定资源。在本实施例中,获取请求是为了请求目标设备提供目标用户的历史音频信息,目标设备指的存放目标用户的历史音频信息的设备,例如:目标设备可以是接收文本信息的设备或存储信息的服务器。获取请求可以包含用户标识、时间范围、音频类型等。请求结果是服务器对获取请求的响应,在本实施例中,请求结果是一个同意或拒绝获取历史音频信息的响应。如果请求结果为同意获取,那么可以根据目标用户的用户标识获取目标用户的历史音频信息。Specifically, the acquisition request is an instruction sent to the server to request access to or retrieval of a specific resource. In the present embodiment, the acquisition request is to request the target device to provide the historical audio information of the target user. The target device refers to the device that stores the historical audio information of the target user. For example, the target device can be a device that receives text messages or a server that stores information. The acquisition request may include a user ID, a time range, an audio type, etc. The request result is the server's response to the acquisition request. In the present embodiment, the request result is a response that agrees or refuses to obtain the historical audio information. If the request result is to agree to obtain, the historical audio information of the target user can be obtained based on the user ID of the target user.

具体实现中,为了获取目标用户的历史音频信息,需要先确定一个合适的接口,例如:接口可以是应用程序编程接口,然后使用超文本传输协议或安全超文本传输协议的客户端库来构建请求,例如:在使用超文本传输协议的客户端库来构建请求时,先确定请求类型,如GET、POST、PUT、DELETE等,然后设置请求的统一资源定位,统一资源定位包括路径参数,之后设置请求头,请求头包括认证信息、内容类型、字符编码等。再准备请求体,请求体包含要发送的数据,如目标用户的用户标识,最后调用客户端库的函数或方法,传入统一资源定位、请求方法、请求头和请求体,得到请求。然后使用超文本传输协议客户端库将构建好的请求发送给目标设备,之后会得到目标设备返回的一个响应,即请求结果。具体而言,可以根据响应的状态码来确定具体的请求结果,例如:响应码为200时可以代表请求结果为同意获取。In the specific implementation, in order to obtain the historical audio information of the target user, it is necessary to first determine a suitable interface, for example, the interface can be an application programming interface, and then use the client library of the hypertext transfer protocol or the secure hypertext transfer protocol to construct a request. For example, when using the client library of the hypertext transfer protocol to construct a request, first determine the request type, such as GET, POST, PUT, DELETE, etc., and then set the uniform resource location of the request, the uniform resource location includes the path parameter, and then set the request header, the request header includes the authentication information, content type, character encoding, etc. Then prepare the request body, the request body contains the data to be sent, such as the user ID of the target user, and finally call the function or method of the client library, pass in the uniform resource location, request method, request header and request body, and obtain the request. Then use the hypertext transfer protocol client library to send the constructed request to the target device, and then get a response returned by the target device, that is, the request result. Specifically, the specific request result can be determined according to the status code of the response. For example, when the response code is 200, it can represent that the request result is consent to obtain.

步骤213、在请求结果为同意获取的情况下,根据用户标识获取目标用户的历史音频信息。Step 213: When the request result is that the acquisition is approved, the historical audio information of the target user is acquired according to the user identifier.

具体实现中,在请求结果为同意获取的情况下,使用提取出的目标用户的用户标识来构建数据库查询语句,通过执行查询语句,从存放音频信息的数据库中获取目标用户的历史音频信息。之后可以根据业务需求,对得到的音频信息进行进一步的处理,例如:存储或者提取音色特征。In the specific implementation, if the request result is consent to obtain, the extracted user ID of the target user is used to construct a database query statement, and the historical audio information of the target user is obtained from the database storing the audio information by executing the query statement. After that, the obtained audio information can be further processed according to business needs, for example: storing or extracting timbre features.

另一方面,在获取目标用户的历史音频信息时,为了更有效地利用计算资源并提高处理效率,可以利用时间戳来进行高效提取,例如:在每次提取目标用户的历史音频数据时,可以记录所提取的音频数据的结束时间戳。在后续的提取过程中,可以利用这个记录的时间戳来定位缓存中自上次提取之后新增的音频数据,然后只提取新增的音频数据,这样可以避免重复处理已经分析过的音频,从而节省算力。也可以限制每次提取的音频条数来进行高效提取,例如:在每次提取目标用户的历史音频数据时,只提取最新的一定数量的音频记录,如只提取最近100条的音频记录,通过这种方式,可以确保始终关注并处理最新的音频数据,而不是无谓地处理过时的或不再重要的数据。On the other hand, when obtaining the historical audio information of the target user, in order to more effectively utilize computing resources and improve processing efficiency, timestamps can be used for efficient extraction. For example, each time the historical audio data of the target user is extracted, the end timestamp of the extracted audio data can be recorded. In the subsequent extraction process, this recorded timestamp can be used to locate the newly added audio data in the cache since the last extraction, and then only the newly added audio data is extracted. This can avoid repeated processing of the audio that has been analyzed, thereby saving computing power. The number of audio items extracted each time can also be limited for efficient extraction. For example, each time the historical audio data of the target user is extracted, only a certain number of the latest audio records are extracted, such as only the most recent 100 audio records. In this way, it can be ensured that the latest audio data is always paid attention to and processed, rather than processing outdated or no longer important data unnecessarily.

本实施例中,为了保护用户的隐私权,在获取目标用户的历史音频信息之前,基于用户标识生成目标用户的历史音频信息的获取请求,并得到请求结果。在请求结果为同意获取的情况下,才根据用户标识获取目标用户的历史音频信息,保证了当前执行获取目标用户的历史音频信息操作的设备具备获取目标用户的历史音频信息的权限。In this embodiment, in order to protect the privacy of the user, before obtaining the historical audio information of the target user, a request for obtaining the historical audio information of the target user is generated based on the user identifier, and a request result is obtained. If the request result is consent to obtain, the historical audio information of the target user is obtained according to the user identifier, ensuring that the device currently executing the operation of obtaining the historical audio information of the target user has the authority to obtain the historical audio information of the target user.

步骤214、从历史音频信息中滤除杂音,得到初始人声音频信息。Step 214: filter out noise from the historical audio information to obtain initial human voice audio information.

具体地,杂音指的是历史音频信息中不希望存在的声音元素,例如:风声、车流声、录音设备的机械噪音等。初始人声音频信息指的是经过滤除杂音处理后,从历史音频信息中提取出的人声部分。Specifically, noise refers to unwanted sound elements in historical audio information, such as wind noise, traffic noise, mechanical noise of recording equipment, etc. The initial human voice audio information refers to the human voice part extracted from the historical audio information after filtering out the noise.

具体实现中,可以根据实际情况或需求选取降噪算法,例如:降噪算法可以是统计滤波器、自适应滤波、深度学习算法、谱减法等。进而应用所选的降噪算法从历史音频信息中滤除杂音,得到初始人声音频信息。所选的降噪算法为谱减法时,可以先对历史音频信息中的音频信号进行预处理,然后通过对预处理后的音频信号中的无语音段进行频谱分析,得到背景噪声的频谱,之后对预处理后的音频信号进行快速傅里叶变换,将其从时域转换到频域,得到带噪语音的频谱。再从带噪语音的频谱中减去之前得到的背景噪声的频谱,得到中间频谱。之后对中间频谱进行逆快速傅里叶变换,将其从频域转换回时域,得到过滤后的音频信号,即初始人声音频信息。In a specific implementation, a noise reduction algorithm can be selected according to actual conditions or requirements. For example, the noise reduction algorithm can be a statistical filter, an adaptive filter, a deep learning algorithm, a spectral subtraction method, etc. Then, the selected noise reduction algorithm is applied to filter out noise from the historical audio information to obtain the initial human voice audio information. When the selected noise reduction algorithm is spectral subtraction, the audio signal in the historical audio information can be preprocessed first, and then the spectrum of the background noise can be obtained by performing a spectral analysis on the speech-free segment in the preprocessed audio signal. Then, the preprocessed audio signal is fast Fourier transformed to convert it from the time domain to the frequency domain to obtain the spectrum of the noisy speech. Then, the spectrum of the background noise obtained before is subtracted from the spectrum of the noisy speech to obtain an intermediate spectrum. Then, the intermediate spectrum is inversely fast Fourier transformed to convert it from the frequency domain back to the time domain to obtain the filtered audio signal, that is, the initial human voice audio information.

本实施例中,通过从历史音频信息中滤除杂音,得到初始人声音频信息,为之后得到目标人声音频信息提供了数据基础。In this embodiment, initial human voice audio information is obtained by filtering out noise from historical audio information, which provides a data basis for obtaining target human voice audio information later.

步骤215、将初始人声音频信息按照单人语音分离和/或分段,得到目标人声音频信息。Step 215: Separate and/or segment the initial human voice audio information according to the single person's voice to obtain the target human voice audio information.

具体地,分离指的是当一个初始人声音频信息中存在多个人同时说话时,需要对该初始人声音频信息按照单人语音进行分离,得到分离音频信息,以此来保证每个分离音频信息只有一个音色特征。分段指的是当一个初始人声音频信息中存在多个人先后说话时,需要按照不同人声对该初始人声音频信息进行分段,得到分段音频信息,以此来保证每个分段音频信息只有一个音色特征。目标人声音频信息指的是进行分离和/或分段处理后的所得到的音频信息中只存在一个人说话的声音,即只有一个音色特征。Specifically, separation means that when there are multiple people speaking at the same time in an initial human voice audio information, the initial human voice audio information needs to be separated according to the single voice to obtain separated audio information, so as to ensure that each separated audio information has only one timbre feature. Segmentation means that when there are multiple people speaking successively in an initial human voice audio information, the initial human voice audio information needs to be segmented according to different human voices to obtain segmented audio information, so as to ensure that each segmented audio information has only one timbre feature. The target human voice audio information refers to the audio information obtained after separation and/or segmentation processing, in which there is only the sound of one person speaking, that is, there is only one timbre feature.

具体实现中,在将初始人声音频信息按照单人语音分离和/或分段之前,需要先判断初始人声音频信息中是否存在多个人同时说话和/或多个人先后说话,若不存在,则不需要进行将初始人声音频信息按照单人语音分离和/或分段。若不存在,则根据具体存在的类型,即多个人同时说话和/或多个人先后说话,来确定具体的操作,即分离和/或分段。具体而言,可以先通过音频信号的能量变量、频谱分析技术或音频分割技术等方法来确定各初始人声音频信息中是否存在多个人同时说话和/或多个人先后说话,例如:若使用音频信号的能量变量来确定各初始人声音频信息中是否存在多个人同时说话和/或多个人先后说话,则当一个初始人声音频信息中的音频信号的能量在时间上呈现明显的波动,且波动周期与语音的发音周期相匹配,则代表该音频信息存在多个人先后说话,此时需要对该音频信息进行分段处理,当一个初始人声音频信息中的音频信号的能量在时间上呈现持续的高能量状态,且没有明显的波动,则代表该音频信息存在多个人同时说话,此时需要对该音频信息进行分离处理。当一个初始人声音频信息中的音频信号的能量在时间上相对平稳,且只有一个主要的能量峰值,则代表该音频信息只存在一个人说话。In a specific implementation, before separating and/or segmenting the initial human voice audio information according to a single person's voice, it is necessary to first determine whether there are multiple people speaking at the same time and/or multiple people speaking successively in the initial human voice audio information. If not, there is no need to separate and/or segment the initial human voice audio information according to a single person's voice. If not, the specific operation, i.e., separation and/or segmentation, is determined according to the specific type of existence, i.e., multiple people speaking at the same time and/or multiple people speaking successively. Specifically, it is possible to first determine whether there are multiple people speaking at the same time and/or multiple people speaking successively in each initial human voice audio information by using methods such as the energy variable of the audio signal, spectrum analysis technology or audio segmentation technology. For example, if the energy variable of the audio signal is used to determine whether there are multiple people speaking at the same time and/or multiple people speaking successively in each initial human voice audio information, then when the energy of the audio signal in an initial human voice audio information shows obvious fluctuations in time, and the fluctuation period matches the pronunciation period of the speech, it means that there are multiple people speaking successively in the audio information, and at this time, the audio information needs to be segmented. When the energy of the audio signal in an initial human voice audio information presents a continuous high energy state in time and there is no obvious fluctuation, it means that there are multiple people speaking at the same time in the audio information, and at this time, the audio information needs to be separated. When the energy of the audio signal in an initial human voice audio information is relatively stable in time and there is only one main energy peak, it means that there is only one person speaking in the audio information.

然后对存在多个人同时说话的各初始人声音频信息按照单人语音分离,即可以利用音分轨技术分别对每个初始人声音频信息进行单人语音分离,具体而言,在多人同时说话的音频信息中,可以通过音分轨技术识别并提取出每个说话人的声音所在的轨道,从而将其与其他声音分离开来,得到目标人声音频信息。Then, each initial human voice audio information in which multiple people speak at the same time is separated according to the single person's voice, that is, the audio separation track technology can be used to separate the single person's voice for each initial human voice audio information. Specifically, in the audio information in which multiple people speak at the same time, the audio separation track technology can be used to identify and extract the track where the voice of each speaker is located, thereby separating it from other sounds and obtaining the target human voice audio information.

然后对存在多个人先后说话的各初始人声音频信息按照单人语音分段,即可以分别识别每个初始人声音频信息中的静音或低能量区域,这些区域通常表示不同说话人之间的间隔,然后根据静音或低能量区域将对应的音频信息切割成多个片段,使每个片段只包含一个人的语音,从而得到目标人声音频信息。Then, the initial human voice audio information in which multiple people speak successively is segmented according to single-person speech, that is, the silent or low-energy areas in each initial human voice audio information can be identified respectively. These areas usually represent the intervals between different speakers. Then, the corresponding audio information is cut into multiple segments according to the silent or low-energy areas, so that each segment contains only one person's speech, thereby obtaining the target human voice audio information.

本实施例中,通过将初始人声音频信息按照单人语音分离和/或分段,得到目标人声音频信息,为之后提取每个目标人声音频信息的音色特征提供了数据基础。In this embodiment, the target human voice audio information is obtained by separating and/or segmenting the initial human voice audio information according to the single person's voice, which provides a data basis for subsequently extracting the timbre features of each target human voice audio information.

步骤216、提取每个目标人声音频信息的音色特征。Step 216: extract the timbre features of each target human voice audio information.

具体实现中,在获取到每个目标人声音频信息之后,可以先对每个目标人声音频信息中的音频信号进行预处理,包括但不限于降噪、人声分离、分段等操作,从而提取出人声音频信息。在预处理之后,可以根据实际情况和应用需求选择合适地音色特征提取器来进行音色特征提取。例如,可以使用音频特征提取器从原始音频中提取音频特征,然后结合音色信息来联合训练音色特征提取器和音色分类器,最后先使用音频特征提取器从上述人声音频信息中提取人声音频特征,后送入上述训练好的音色特征提取器得到历史音频信息中的音色特征。当算力较紧张,如部署在端侧设备上时,音频特征提取可以使用傅里叶变换、梅尔谱等方法;当算力较充裕,如部署在云侧设备上时,音频特征提取可以使用预训练的音频大模型,从而进一步提高音色特征提取的效果。In a specific implementation, after obtaining each target human voice audio information, the audio signal in each target human voice audio information can be preprocessed, including but not limited to noise reduction, human voice separation, segmentation and other operations, so as to extract the human voice audio information. After preprocessing, a suitable timbre feature extractor can be selected according to the actual situation and application requirements to extract timbre features. For example, an audio feature extractor can be used to extract audio features from the original audio, and then the timbre feature extractor and the timbre classifier can be jointly trained in combination with the timbre information. Finally, the audio feature extractor is used to extract the human voice audio features from the above human voice audio information, and then the trained timbre feature extractor is sent to obtain the timbre features in the historical audio information. When the computing power is tight, such as when deployed on the end-side device, audio feature extraction can use methods such as Fourier transform and Mel spectrum; when the computing power is sufficient, such as when deployed on the cloud-side device, audio feature extraction can use a pre-trained large audio model to further improve the effect of timbre feature extraction.

本实施例中,通过提取每个目标人声音频信息的音色特征,为之后确定目标用户的当前音色特征提供了数据基础。In this embodiment, by extracting the timbre features of each target human voice audio information, a data basis is provided for subsequently determining the current timbre features of the target user.

步骤217、对提取的音色特征进行聚类,以确定聚类中心的音色特征。Step 217: cluster the extracted timbre features to determine the timbre features at the cluster center.

进一步的,对提取的音色特征进行聚类,以确定聚类中心的音色特征,包括:对提取的音色特征进行聚类;将每个聚类内音色特征的平均值确定为当前聚类中心的音色特征。Furthermore, the extracted timbre features are clustered to determine the timbre features of the cluster center, including: clustering the extracted timbre features; and determining the average value of the timbre features in each cluster as the timbre features of the current cluster center.

具体实现中,在提取到每个目标人声音频信息的音色特征之后,可以根据实际情况和应用需求选择合适的聚类算法来对其进行聚类,例如:聚类算法可以是K均值聚类算法、层次聚类、基于密度的带噪声应用空间聚类、谱聚类等。进而应用所选的聚类算法对提取的音色特征进行聚类,再分别计算每个聚类内部的音色特征的平均值,将得到的各平均值确定为各自所对应聚类的聚类中心的音色特征。In a specific implementation, after extracting the timbre features of each target human voice audio information, a suitable clustering algorithm can be selected to cluster it according to the actual situation and application requirements, for example, the clustering algorithm can be a K-means clustering algorithm, hierarchical clustering, density-based spatial clustering with noise, spectral clustering, etc. Then, the selected clustering algorithm is applied to cluster the extracted timbre features, and then the average value of the timbre features within each cluster is calculated respectively, and each average value obtained is determined as the timbre feature of the cluster center of the corresponding cluster.

本实施例中,对提取的音色特征进行聚类,并将每个聚类内音色特征的平均值确定为当前聚类中心的音色特征,可以更准确地得到每个聚类的整体特性,从而提高音色识别的准确性。In this embodiment, the extracted timbre features are clustered, and the average value of the timbre features in each cluster is determined as the timbre feature of the current cluster center, so that the overall characteristics of each cluster can be obtained more accurately, thereby improving the accuracy of timbre recognition.

步骤218、判断是否存在目标用户的历史音色特征。Step 218: Determine whether there is a historical timbre feature of the target user.

若存在,则执行步骤219;若不存在,则执行步骤220。If it exists, execute step 219; if it does not exist, execute step 220.

具体地,目标用户的历史音色特征指的是过去得到的目标用户的音色特征。Specifically, the historical timbre features of the target user refer to the timbre features of the target user obtained in the past.

具体实现中,在得到聚类中心的音色特征之后,判断是否存在目标用户的历史音色特征。若存在,则将目标用户的历史音色特征与提取的音色特征进行匹配,或将目标用户的历史音色特征与各聚类中心的音色特征进行匹配,根据匹配结果确定目标用户的当前音色特征;若不存在,则获取聚类中心的数量,得到中心数量。In a specific implementation, after obtaining the timbre features of the cluster center, it is determined whether there are historical timbre features of the target user. If so, the historical timbre features of the target user are matched with the extracted timbre features, or the historical timbre features of the target user are matched with the timbre features of each cluster center, and the current timbre features of the target user are determined according to the matching results; if not, the number of cluster centers is obtained to obtain the number of centers.

在一种实施例中,若存在目标用户的历史音色特征,则可以直接将目标用户的历史音色特征确定为目标用户的当前音色特征。In one embodiment, if there are historical timbre features of the target user, the historical timbre features of the target user may be directly determined as the current timbre features of the target user.

本实施例中,通过判断是否存在目标用户的历史音色特征,来适应不同场景下的需求。In this embodiment, the needs in different scenarios are met by judging whether there are historical timbre features of the target user.

步骤219、将目标用户的历史音色特征与提取的音色特征进行匹配,或将目标用户的历史音色特征与各聚类中心的音色特征进行匹配,根据匹配结果确定目标用户的当前音色特征。Step 219: Match the historical timbre features of the target user with the extracted timbre features, or match the historical timbre features of the target user with the timbre features of each cluster center, and determine the current timbre features of the target user according to the matching results.

具体实现中,在确定存在目标用户的历史音色特征之后,可以将目标用户的历史音色特征与提取的音色特征进行匹配或者将目标用户的历史音色特征与各聚类中心的音色特征进行匹配,得到相应的匹配结果,即可以根据实际情况选择合适的相似性度量方法来进行匹配,从而得到匹配结果,例如:若将目标用户的历史音色特征与提取的音色特征进行匹配,且选择的相似性度量方法为余弦相似度,则通过计算提取的音色特征与目标用户的历史音色特征之间的余弦值来度量它们之间的匹配程度,具体而言,可以对每一个提取的音色特征,计算其与目标用户的历史音色特征之间的相似度得分,然后根据相似度得分确定匹配结果,例如:可以根据实际情况设定一个相似度阈值,当提取的音色特征与目标用户的历史音色特征之间的相似度得分大于或等于相似度阈值时,可以确定匹配结果为匹配,否则确定匹配结果为不匹配。也可以对每一个提取的音色特征,计算其与目标用户的历史音色特征之间的距离,并于预先设定的阈值进行比较。其中距离可以采用余弦距离、欧式距离等度量方法,此处不做限制,只要适用于度量向量之间的距离即可。若距离小于阈值,则可以确定匹配结果为匹配,否则确定匹配结果为不匹配。In a specific implementation, after determining that there are historical timbre features of the target user, the historical timbre features of the target user can be matched with the extracted timbre features or the historical timbre features of the target user can be matched with the timbre features of each cluster center to obtain a corresponding matching result, that is, a suitable similarity measurement method can be selected according to the actual situation to perform matching, thereby obtaining a matching result. For example, if the historical timbre features of the target user are matched with the extracted timbre features, and the selected similarity measurement method is cosine similarity, then the degree of matching between the extracted timbre features and the historical timbre features of the target user is measured by calculating the cosine value between them. Specifically, for each extracted timbre feature, the similarity score between it and the historical timbre features of the target user can be calculated, and then the matching result can be determined according to the similarity score. For example, a similarity threshold can be set according to the actual situation. When the similarity score between the extracted timbre features and the historical timbre features of the target user is greater than or equal to the similarity threshold, the matching result can be determined to be a match, otherwise the matching result can be determined to be a mismatch. For each extracted timbre feature, the distance between it and the historical timbre features of the target user can also be calculated and compared with a preset threshold. The distance can be measured by cosine distance, Euclidean distance, etc., which are not limited here, as long as they are applicable to the distance between measuring vectors. If the distance is less than the threshold, the matching result can be determined to be a match, otherwise the matching result is determined to be a mismatch.

本实施例中,将目标用户的历史音色特征与提取的音色特征进行匹配,或将目标用户的历史音色特征与各聚类中心的音色特征进行匹配,根据匹配结果确定目标用户的当前音色特征,可以使确定的目标用户的当前音色特征更具个性化,从而为用户提供更加个性化的服务。In this embodiment, the historical timbre features of the target user are matched with the extracted timbre features, or the historical timbre features of the target user are matched with the timbre features of each cluster center, and the current timbre features of the target user are determined based on the matching results. This can make the current timbre features of the determined target user more personalized, thereby providing users with more personalized services.

进一步的,根据匹配结果确定目标用户的当前音色特征,包括:若匹配结果为不匹配,则不进行操作;若匹配结果为匹配,则使用匹配的音色特征对目标用户的历史音色特征进行更新,完成音色特征的匹配与更新后,将目标用户的历史音色特征确定为目标用户的当前音色特征。Furthermore, the current timbre features of the target user are determined based on the matching result, including: if the matching result is a mismatch, no operation is performed; if the matching result is a match, the historical timbre features of the target user are updated using the matched timbre features, and after completing the matching and updating of the timbre features, the historical timbre features of the target user are determined as the current timbre features of the target user.

具体实现中,在得到匹配结果之后,可以根据得到的匹配结果确定目标用户的当前音色特征。具体而言,若匹配结果为不匹配,则不进行操作,即可以直接将目标用户的历史音色特征确定为目标用户的当前音色特征;若匹配结果为匹配,则可以使用提取的音色特征中与目标用户的历史音色特征匹配的音色特征,或者各聚类中心的音色特征中与目标用户的历史音色特征匹配的音色特征,对目标用户的历史音色特征进行更新,并在完成音色特征的匹配与更新后,将目标用户的历史音色特征确定为目标用户的当前音色特征。例如:可以将与目标用户的历史音色特征之间的相似度得分大于或等于相似度阈值的提取的音色特征或各聚类中心的音色特征确定为匹配音色特征,若大于或等于相似度阈值的提取的音色特征或各聚类中心的音色特征的个数大于一,则可以将大于或等于相似度阈值的提取的音色特征或各聚类中心的音色特征中相似度得分最高的音色特征确定为匹配音色特征。然后基于匹配音色特征对历史音色特征进行更新,并将更新后的历史音色特征确定为目标用户的当前音色特征。其中,特征更新方法可以为移动平均方法等。In a specific implementation, after obtaining the matching result, the current timbre features of the target user can be determined according to the obtained matching result. Specifically, if the matching result is not a match, no operation is performed, that is, the historical timbre features of the target user can be directly determined as the current timbre features of the target user; if the matching result is a match, the timbre features of the target user can be updated by using the timbre features that match the historical timbre features of the target user among the extracted timbre features, or the timbre features of the timbre features of each cluster center that match the historical timbre features of the target user, and after completing the matching and updating of the timbre features, the historical timbre features of the target user are determined as the current timbre features of the target user. For example, the extracted timbre features or timbre features of each cluster center whose similarity scores with the historical timbre features of the target user are greater than or equal to the similarity threshold can be determined as matching timbre features. If the number of extracted timbre features or timbre features of each cluster center that are greater than or equal to the similarity threshold is greater than one, the timbre features with the highest similarity score among the extracted timbre features or timbre features of each cluster center that are greater than or equal to the similarity threshold can be determined as matching timbre features. Then, the historical timbre features are updated based on the matching timbre features, and the updated historical timbre features are determined as the current timbre features of the target user. The feature update method can be a moving average method, etc.

本实施例中,通过上述步骤,可以更加健壮地处理各种音色特征匹配的情况。即使在某些情况下出现不匹配,也能够通过选择历史音色特征来保持功能正常运行,从而避免了由于无法识别音色而导致的误操作。同时该方式减少了用户需要手动调整或确认音色特征的情况,简化了用户的操作流程。在匹配时,通过更新历史音色特征,实现了音色特征随目标用户的真实发音情况动态更新,避免了音色突变。In this embodiment, through the above steps, various timbre feature matching situations can be handled more robustly. Even if mismatch occurs in some cases, the function can be kept running normally by selecting the historical timbre features, thereby avoiding erroneous operations caused by the inability to recognize the timbre. At the same time, this method reduces the need for users to manually adjust or confirm the timbre features, simplifying the user's operation process. During matching, by updating the historical timbre features, the timbre features are dynamically updated according to the actual pronunciation of the target user, avoiding sudden changes in timbre.

步骤220、获取聚类中心的数量,得到中心数量。Step 220: Obtain the number of cluster centers to obtain the number of centers.

具体地,聚类中心的数量指的是在对提取的音色特征进行聚类之后,得到的每个聚类所对应的聚类中心的个数,也就是中心数量。Specifically, the number of cluster centers refers to the number of cluster centers corresponding to each cluster obtained after clustering the extracted timbre features, that is, the number of centers.

具体实现中,在使用聚类算法对提取的音色特征进行聚类,得到聚类中心的音色特征的同时也得到了聚类中心的数量,即中心数量,该数量是聚类过程中聚类算法根据数据分布和设定的聚类参数自动计算得出的。In the specific implementation, the extracted timbre features are clustered using a clustering algorithm to obtain the timbre features of the cluster centers and the number of cluster centers, i.e., the number of centers, which is automatically calculated by the clustering algorithm according to the data distribution and the set clustering parameters during the clustering process.

步骤221、判断聚类中心的数量是否等于一。Step 221: Determine whether the number of cluster centers is equal to one.

若等于,则执行步骤222;若不等于,则执行步骤223。If they are equal, execute step 222; if they are not equal, execute step 223.

步骤222、将聚类中心的音色特征确定为目标用户的当前音色特征。Step 222: Determine the timbre feature of the cluster center as the current timbre feature of the target user.

具体实现中,当聚类中心的数量为一时,说明所有提取的音色特征在聚类过程中都紧密地聚集于这一个中心点附近,意味着这些音色特征具有很高的相似性,所以可以直接将该聚类中心的音色特征确定为目标用户的当前音色特征。In the specific implementation, when the number of cluster centers is one, it means that all the extracted timbre features are tightly clustered around this center point during the clustering process, which means that these timbre features have high similarity, so the timbre features of the cluster center can be directly determined as the current timbre features of the target user.

本实施例中,当聚类中心的数量为一时,代表这些特征形成了一个稳定的、具有代表性的音色类别。因此,将这个聚类中心的音色特征作为目标用户的当前音色特征,不仅符合数据分布的实际情况,而且可以准确地反映目标用户的音色特点。In this embodiment, when the number of cluster centers is one, it means that these features form a stable and representative timbre category. Therefore, taking the timbre feature of this cluster center as the current timbre feature of the target user not only conforms to the actual data distribution, but also can accurately reflect the timbre characteristics of the target user.

步骤223、判断是否存在目标聚类中心。Step 223: Determine whether there is a target cluster center.

若存在,则执行步骤224;若不等于,则执行步骤225。If it exists, execute step 224; if it is not equal, execute step 225.

具体地,目标聚类中心下的音色特征数量与其他聚类中心下的音色特征数量的差值超过预设数量阈值。预设数量阈值指的是根据实际情况或需求而设置的,用于判断不同聚类中心之间音色特征数量差异是否显著的界限值。例如:当各聚类中心下的音色特征数量的范围在100到500时,预设数量阈值可以为200。聚类中心下的音色特征数量指的是经过聚类后,每个聚类下所包含的数据量,即属于该聚类下所有音色特征点的数量。Specifically, the difference between the number of timbre features under the target cluster center and the number of timbre features under other cluster centers exceeds a preset quantity threshold. The preset quantity threshold refers to a boundary value set according to actual conditions or needs, which is used to determine whether the difference in the number of timbre features between different cluster centers is significant. For example: when the number of timbre features under each cluster center ranges from 100 to 500, the preset quantity threshold may be 200. The number of timbre features under a cluster center refers to the amount of data contained in each cluster after clustering, that is, the number of all timbre feature points belonging to the cluster.

具体实现中,当聚类中心的数量大于一时,代表着提取的音色特征在音色特征的数据集中形成了多个明显的簇。所以为了确定目标用户的当前音色特征,可以先计算每个聚类中心下的音色特征数量,即统计属于各个聚类中心的音色特征点的数量,以此来判断是否存在目标聚类中心。In the specific implementation, when the number of cluster centers is greater than one, it means that the extracted timbre features form multiple obvious clusters in the timbre feature data set. Therefore, in order to determine the current timbre features of the target user, the number of timbre features under each cluster center can be calculated first, that is, the number of timbre feature points belonging to each cluster center can be counted to determine whether there is a target cluster center.

步骤224、确定目标聚类中心的音色特征为目标用户的当前音色特征。Step 224: Determine the timbre feature of the target cluster center as the current timbre feature of the target user.

具体实现中,在确定存在目标聚类中心之后,代表该目标聚类中心下的音色特征数量显著多余其他聚类中心,即该目标聚类中心下的音色特征数量远大于其他聚类中心,说明这些占据多数的音色特征更有可能来自于目标用户,则可以将音色特征数量最多的该目标聚类中心所对应的音色特征确定为目标用户的当前音色特征。In the specific implementation, after determining the existence of a target cluster center, it means that the number of timbre features under the target cluster center is significantly greater than that of other cluster centers, that is, the number of timbre features under the target cluster center is much larger than that of other cluster centers, indicating that these timbre features that occupy the majority are more likely to come from the target user. Then, the timbre features corresponding to the target cluster center with the largest number of timbre features can be determined as the current timbre features of the target user.

步骤225,将语音合成系统默认音色特征确定为目标用户的当前音色特征。Step 225: determine the default timbre feature of the speech synthesis system as the current timbre feature of the target user.

具体实现中,在确定不存在目标聚类中心之后,说明无法确定目标用户的当前音色特征,此时将语音合成系统默认音色特征确定为目标用户的当前音色特征。In a specific implementation, after determining that there is no target cluster center, it means that the current timbre feature of the target user cannot be determined. At this time, the default timbre feature of the speech synthesis system is determined as the current timbre feature of the target user.

本实施例中,当聚类中心的数量大于一时,通过计算每个聚类中心下的音色特征数量,来判断是否存在目标聚类中心,若存在,则代表存在更符合目标用户实际音色的音色特征,所以可以确定目标聚类中心的音色特征为目标用户的当前音色特征。若不存在,则代表此时各音色特征分布散乱、无法确定目标用户当前音色特征,所以可以回退为使用系统默认音色特征进行合成,避免了使用错误音色特征进行合成可能带来的误解,更能改善用户体验。In this embodiment, when the number of cluster centers is greater than one, the number of timbre features under each cluster center is calculated to determine whether there is a target cluster center. If there is, it means that there is a timbre feature that is more consistent with the actual timbre of the target user, so the timbre feature of the target cluster center can be determined as the current timbre feature of the target user. If it does not exist, it means that the distribution of the timbre features is scattered and the current timbre feature of the target user cannot be determined, so it can be returned to the system default timbre feature for synthesis, avoiding the misunderstanding caused by using the wrong timbre feature for synthesis, and improving the user experience.

步骤226、将目标用户的当前音色特征和文本信息输入语音生成模型,以生成具有目标用户的当前音色特征的目标音频。Step 226: Input the current timbre characteristics and text information of the target user into the speech generation model to generate target audio with the current timbre characteristics of the target user.

因此,本发明的技术方法根据接收到的文本信息确定目标用户的用户标识,为之后得到目标用户的历史音频信息提供了数据基础。之后为了保护用户的隐私权,在获取目标用户的历史音频信息之前,基于用户标识生成目标用户的历史音频信息的获取请求,并得到请求结果。在请求结果为同意获取的情况下,才根据用户标识获取目标用户的历史音频信息,保证了当前执行获取目标用户的历史音频信息操作的设备具备获取目标用户的历史音频信息的权限。然后通过从历史音频信息中滤除杂音,得到初始人声音频信息,为之后得到目标人声音频信息提供了数据基础。再将初始人声音频信息按照单人语音分离和/或分段,得到目标人声音频信息,为之后提取每个目标人声音频信息的音色特征提供了数据基础。之后通过提取每个目标人声音频信息的音色特征,为之后确定目标用户的当前音色特征提供了数据基础。然后对提取的音色特征进行聚类,可以将相似的音色特征归为一类,得到聚类中心的音色特征,简化了数据,从而降低了后续处理的复杂度,提高了工作效率。之后判断是否存在目标用户的历史音色特征,以此来适应不同场景下的需求。当存在目标用户的历史音色特征时,将目标用户的历史音色特征与提取的音色特征进行匹配,或将目标用户的历史音色特征与各聚类中心的音色特征进行匹配,根据匹配结果确定目标用户的当前音色特征,可以使确定的目标用户的当前音色特征更具个性化,从而为用户提供更加个性化的服务。当不存在目标用户的历史音色特征时,获取聚类中心的数量,得到中心数量。判断聚类中心的数量是否等于一。当聚类中心的数量为一时,将聚类中心的音色特征确定为目标用户的当前音色特征,准确地了反映目标用户的音色特点。当聚类中心的数量大于一时,判断是否存在目标聚类中心,若存在,则代表存在更符合目标用户实际音色的音色特征,所以可以确定目标聚类中心的音色特征为目标用户的当前音色特征。若不存在,则代表此时各音色特征分布散乱、无法确定目标用户当前音色特征,所以可以回退为使用系统默认音色特征进行合成,避免了使用错误音色特征进行合成可能带来的误解,更能改善用户体验。最后将目标用户的当前音色特征和文本信息输入语音生成模型,以生成符合目标用户个性化音色特征的目标音频,使得目标音频听起来就像是由目标用户本人亲自说出的一样,极大地增强了音频内容的个性化和真实感,从而提高了用户体验。相比于现有技术虽然可以将文字转换为语音,但通常只能使用预设或通用的语音库来将文字转换为语音,缺乏个性化和真实感,从而影响用户的体验。本发明根据接收到的文本信息来获取目标用户的历史音频信息,再从历史音频信息中提取音色特征,之后根据提取的音色特征来确定目标用户的当前音色特征,然后将目标用户的当前音色特征和文本信息输入语音生成模型,以生成具有目标用户的当前音色特征的目标音频,可以确保生成的音频在音色上更加贴近目标用户的真实声音,为用户带来更加自然、真实的音色体验。因此,本发明可以解决利用现有的文字生成语音技术所生成的音频缺乏个性化和真实感的问题。Therefore, the technical method of the present invention determines the user identification of the target user according to the received text information, and provides a data basis for obtaining the historical audio information of the target user. In order to protect the privacy of the user, before obtaining the historical audio information of the target user, a request for obtaining the historical audio information of the target user is generated based on the user identification, and the request result is obtained. When the request result is to agree to obtain, the historical audio information of the target user is obtained according to the user identification, which ensures that the device currently performing the operation of obtaining the historical audio information of the target user has the authority to obtain the historical audio information of the target user. Then, by filtering out noise from the historical audio information, the initial human voice audio information is obtained, which provides a data basis for obtaining the target human voice audio information. Then, the initial human voice audio information is separated and/or segmented according to the single voice to obtain the target human voice audio information, which provides a data basis for extracting the timbre features of each target human voice audio information. Then, by extracting the timbre features of each target human voice audio information, a data basis is provided for determining the current timbre features of the target user. Then, the extracted timbre features are clustered, and similar timbre features can be classified into one category to obtain the timbre features of the cluster center, which simplifies the data, thereby reducing the complexity of subsequent processing and improving work efficiency. Then, it is determined whether there are historical timbre features of the target user, so as to adapt to the needs in different scenarios. When the historical timbre features of the target user exist, the historical timbre features of the target user are matched with the extracted timbre features, or the historical timbre features of the target user are matched with the timbre features of each cluster center, and the current timbre features of the target user are determined according to the matching results, so that the current timbre features of the determined target user can be more personalized, thereby providing users with more personalized services. When the historical timbre features of the target user do not exist, the number of cluster centers is obtained to obtain the number of centers. It is determined whether the number of cluster centers is equal to one. When the number of cluster centers is one, the timbre features of the cluster center are determined as the current timbre features of the target user, accurately reflecting the timbre characteristics of the target user. When the number of cluster centers is greater than one, it is determined whether there is a target cluster center. If so, it means that there is a timbre feature that is more in line with the actual timbre of the target user, so the timbre features of the target cluster center can be determined as the current timbre features of the target user. If it does not exist, it means that the distribution of each timbre feature is scattered and the current timbre feature of the target user cannot be determined. Therefore, it can be reverted to the system default timbre feature for synthesis, avoiding the misunderstanding caused by using the wrong timbre feature for synthesis, and improving the user experience. Finally, the current timbre feature and text information of the target user are input into the speech generation model to generate a target audio that meets the personalized timbre feature of the target user, so that the target audio sounds like it is spoken by the target user himself, which greatly enhances the personalization and authenticity of the audio content, thereby improving the user experience. Compared with the prior art, although text can be converted into speech, usually only a preset or general voice library can be used to convert text into speech, which lacks personalization and authenticity, thereby affecting the user experience. The present invention obtains the historical audio information of the target user according to the received text information, and then extracts the timbre feature from the historical audio information, and then determines the current timbre feature of the target user according to the extracted timbre feature, and then inputs the current timbre feature and text information of the target user into the speech generation model to generate a target audio with the current timbre feature of the target user, which can ensure that the generated audio is closer to the real voice of the target user in timbre, and brings a more natural and real timbre experience to the user. Therefore, the present invention can solve the problem that the audio generated by the existing text-to-speech technology lacks personalization and realism.

另外,可以在确定匹配结果为匹配后,先确定匹配音色特征,然后基于匹配音色特征对历史音色特征进行更新,并将更新后的历史音色特征确定为目标用户的当前音色特征,这样不仅可以实现音色特征随目标用户的真实发音情况动态更新,还可以避免音色突变,从而为用户提供更加自然、真实的音色体验,提高用户的满意度。In addition, after determining that the matching result is a match, the matching timbre features can be determined first, and then the historical timbre features can be updated based on the matching timbre features, and the updated historical timbre features can be determined as the current timbre features of the target user. This not only allows the timbre features to be dynamically updated according to the actual pronunciation of the target user, but also avoids sudden changes in timbre, thereby providing users with a more natural and realistic timbre experience and improving user satisfaction.

图3为本发明实施例提供的一种音频生成装置的结构示意图,该装置与上述各实施例的音频生成方法属于同一个发明构思,在音频生成装置的实施例中未详尽描述的细节内容,可以参考上述音频生成方法的实施例。Figure 3 is a structural schematic diagram of an audio generating device provided in an embodiment of the present invention. The device and the audio generating methods of the above-mentioned embodiments belong to the same inventive concept. For details not described in detail in the embodiment of the audio generating device, reference can be made to the embodiment of the above-mentioned audio generating method.

如图3所示,该装置包括:As shown in FIG3 , the device comprises:

获取模块310,用于响应接收到的文本信息,获取目标用户的历史音频信息,所述文本信息由所述目标用户通过目标终端发送;The acquisition module 310 is used to acquire historical audio information of a target user in response to the received text information, wherein the text information is sent by the target user through a target terminal;

提取模块320,用于从所述历史音频信息中提取音色特征;An extraction module 320, for extracting timbre features from the historical audio information;

聚类模块330,用于对提取的音色特征进行聚类,以确定聚类中心的音色特征;A clustering module 330, for clustering the extracted timbre features to determine the timbre features of the cluster center;

确定模块340,用于根据所述聚类中心的音色特征确定所述目标用户的当前音色特征;A determination module 340, configured to determine the current timbre feature of the target user according to the timbre feature of the cluster center;

生成模块350,用于将所述目标用户的当前音色特征和所述文本信息输入语音生成模型,以生成具有所述目标用户的当前音色特征的目标音频。The generation module 350 is used to input the current timbre characteristics of the target user and the text information into a speech generation model to generate a target audio with the current timbre characteristics of the target user.

在上述实施例的基础上,获取模块310具体用于:Based on the above embodiment, the acquisition module 310 is specifically used for:

根据接收到的文本信息确定所述目标用户的用户标识;Determining a user identifier of the target user according to the received text information;

基于所述用户标识生成所述目标用户的历史音频信息的获取请求,并得到请求结果;Generate a request for obtaining the historical audio information of the target user based on the user identifier, and obtain a request result;

在所述请求结果为同意获取的情况下,根据所述用户标识获取所述目标用户的历史音频信息。When the request result is that the acquisition is approved, the historical audio information of the target user is acquired according to the user identifier.

在上述实施例的基础上,提取模块320具体用于:Based on the above embodiment, the extraction module 320 is specifically used for:

从所述历史音频信息中滤除杂音,得到初始人声音频信息;Filtering out noise from the historical audio information to obtain initial human voice audio information;

将所述初始人声音频信息按照单人语音分离和/或分段,得到目标人声音频信息;Separating and/or segmenting the initial human voice audio information according to the individual voices to obtain target human voice audio information;

提取每个所述目标人声音频信息的音色特征。The timbre features of each target human voice audio information are extracted.

在上述实施例的基础上,聚类模块330具体用于:Based on the above embodiment, the clustering module 330 is specifically used for:

对提取的音色特征进行聚类;Clustering the extracted timbre features;

将每个聚类内音色特征的平均值确定为当前聚类中心的音色特征。The average value of the timbre features in each cluster is determined as the timbre feature of the current cluster center.

在上述实施例的基础上,该装置还包括:Based on the above embodiment, the device further includes:

判断模块,用于根据所述聚类中心的音色特征确定所述目标用户的当前音色特征之前判断是否存在所述目标用户的历史音色特征;若不存在所述目标用户的历史音色特征,则触发执行根据所述聚类中心的音色特征确定所述目标用户的当前音色特征的步骤。A judgment module is used to judge whether there is a historical timbre feature of the target user before determining the current timbre feature of the target user based on the timbre feature of the cluster center; if the historical timbre feature of the target user does not exist, triggering the step of determining the current timbre feature of the target user based on the timbre feature of the cluster center.

在上述实施例的基础上,该装置还包括:Based on the above embodiment, the device further includes:

匹配模块,用于在判断是否存在所述目标用户的历史音色特征之后若存在所述目标用户的历史音色特征,则将所述目标用户的历史音色特征与提取的音色特征进行匹配,或将所述目标用户的历史音色特征与各聚类中心的音色特征进行匹配,根据匹配结果确定所述目标用户的当前音色特征。。A matching module is used to match the historical timbre features of the target user with the extracted timbre features after determining whether the historical timbre features of the target user exist, or to match the historical timbre features of the target user with the timbre features of each cluster center, and determine the current timbre features of the target user according to the matching results.

在上述实施例的基础上,匹配模块根据匹配结果确定所述目标用户的当前音色特征,包括:Based on the above embodiment, the matching module determines the current timbre characteristics of the target user according to the matching result, including:

若所述匹配结果为不匹配,则不进行操作;If the matching result is no match, no operation is performed;

若所述匹配结果为匹配,则使用匹配的音色特征对所述目标用户的历史音色特征进行更新,完成音色特征的匹配与更新后,将所述目标用户的历史音色特征确定为所述目标用户的当前音色特征。If the matching result is a match, the historical timbre feature of the target user is updated using the matched timbre feature. After the timbre feature is matched and updated, the historical timbre feature of the target user is determined as the current timbre feature of the target user.

在上述实施例的基础上,确定模块340具体用于:Based on the above embodiment, the determination module 340 is specifically used for:

获取所述聚类中心的数量,得到中心数量;Obtaining the number of cluster centers to obtain the number of centers;

当所述中心数量等于一时,将所述聚类中心的音色特征确定为所述目标用户的当前音色特征;When the number of centers is equal to one, determining the timbre feature of the cluster center as the current timbre feature of the target user;

当所述中心数量大于一时,判断是否存在目标聚类中心,所述目标聚类中心下的音色特征数量与其他聚类中心下的音色特征数量的差值超过预设数量阈值;当存在所述目标聚类中心时,确定该所述目标聚类中心的音色特征为所述目标用户的当前音色特征;否则,无法确定所述目标用户的当前音色特征,将语音合成系统默认音色特征确定为所述目标用户的当前音色特征。When the number of centers is greater than one, determine whether there is a target cluster center, and the difference between the number of timbre features under the target cluster center and the number of timbre features under other cluster centers exceeds a preset quantity threshold; when the target cluster center exists, determine the timbre features of the target cluster center as the current timbre features of the target user; otherwise, the current timbre features of the target user cannot be determined, and the default timbre features of the speech synthesis system are determined as the current timbre features of the target user.

本发明实施例所提供的音频生成装置可执行本发明任意实施例所提供的音频生成方法,具备执行方法相应的功能模块和有益效果。The audio generating device provided in the embodiment of the present invention can execute the audio generating method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.

值得注意的是,上述音频生成装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。It is worth noting that in the embodiment of the above-mentioned audio generating device, the various units and modules included are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be achieved; in addition, the specific names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the scope of protection of the present invention.

图4为本发明实施例提供的一种电子设备的结构示意图。图4示出了适于用来实现本发明实施方式的示例性电子设备4的框图。图4显示的电子设备4仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。FIG4 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention. FIG4 shows a block diagram of an exemplary electronic device 4 suitable for implementing an embodiment of the present invention. The electronic device 4 shown in FIG4 is only an example and should not bring any limitation to the functions and scope of use of the embodiment of the present invention.

如图4所示,电子设备4以通用计算电子设备的形式表现。电子设备4的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。As shown in Fig. 4, electronic device 4 is in the form of a general purpose computing electronic device. Components of electronic device 4 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and bus 18 connecting different system components (including system memory 28 and processing unit 16).

总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线,微通道体系结构(MAC)总线,增强型ISA总线、视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor or a local bus using any of a variety of bus architectures. By way of example, these architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MAC) bus, an Enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

电子设备4典型地包括多种计算机系统可读介质。这些介质可以是任何能够被电子设备4访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。The electronic device 4 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the electronic device 4, including volatile and non-volatile media, removable and non-removable media.

系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)30和/或高速缓存存储器32。电子设备4可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图4未显示,通常称为“硬盘驱动器”)。尽管图4中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。系统存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本发明各实施例的功能。The system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. The electronic device 4 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 4 , commonly referred to as a “hard drive”). Although not shown in FIG. 4 , a disk drive for reading and writing to a removable non-volatile disk (e.g., a “floppy disk”) and an optical disk drive for reading and writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM or other optical media) may be provided. In these cases, each drive may be connected to the bus 18 via one or more data media interfaces. The system memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to perform the functions of various embodiments of the present invention.

具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如系统存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本发明所描述的实施例中的功能和/或方法。A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

电子设备4也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该电子设备4交互的设备通信,和/或与使得该电子设备4能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,电子设备4还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图4所示,网络适配器20通过总线18与电子设备4的其它模块通信。应当明白,尽管图4中未示出,可以结合电子设备4使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 4 may also communicate with one or more external devices 14 (e.g., keyboards, pointing devices, displays 24, etc.), may also communicate with one or more devices that enable a user to interact with the electronic device 4, and/or communicate with any device that enables the electronic device 4 to communicate with one or more other computing devices (e.g., network cards, modems, etc.). Such communication may be performed via an input/output (I/O) interface 22. Furthermore, the electronic device 4 may also communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and/or public networks, such as the Internet) via a network adapter 20. As shown in FIG. 4 , the network adapter 20 communicates with other modules of the electronic device 4 via a bus 18. It should be understood that, although not shown in FIG. 4 , other hardware and/or software modules may be used in conjunction with the electronic device 4, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, etc.

处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及页面显示,例如实现本发明实施例所提供的音频生成方法,该方法包括:The processing unit 16 executes various functional applications and page displays by running the program stored in the system memory 28, for example, implementing the audio generation method provided in the embodiment of the present invention, which includes:

响应接收到的文本信息,获取目标用户的历史音频信息,所述文本信息由所述目标用户通过目标终端发送;In response to the received text message, acquiring historical audio information of a target user, wherein the text message is sent by the target user through a target terminal;

从所述历史音频信息中提取音色特征;Extracting timbre features from the historical audio information;

对提取的音色特征进行聚类,以确定聚类中心的音色特征;Clustering the extracted timbre features to determine the timbre features of the cluster center;

根据所述聚类中心的音色特征确定所述目标用户的当前音色特征;Determine the current timbre feature of the target user according to the timbre feature of the cluster center;

将所述目标用户的当前音色特征和所述文本信息输入语音生成模型,以生成具有所述目标用户的当前音色特征的目标音频。The current timbre characteristics of the target user and the text information are input into a speech generation model to generate a target audio with the current timbre characteristics of the target user.

当然,本领域技术人员可以理解,处理器还可以实现本发明任意实施例所提供的音频生成方法的技术方案。Of course, those skilled in the art can understand that the processor can also implement the technical solution of the audio generation method provided by any embodiment of the present invention.

本发明实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现例如本发明实施例所提供的音频生成方法,该方法包括:An embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, for example, an audio generation method provided by an embodiment of the present invention is implemented. The method includes:

响应接收到的文本信息,获取目标用户的历史音频信息,所述文本信息由所述目标用户通过目标终端发送;In response to the received text message, acquiring historical audio information of a target user, wherein the text message is sent by the target user through a target terminal;

从所述历史音频信息中提取音色特征;Extracting timbre features from the historical audio information;

对提取的音色特征进行聚类,以确定聚类中心的音色特征;Clustering the extracted timbre features to determine the timbre features of the cluster center;

根据所述聚类中心的音色特征确定所述目标用户的当前音色特征;Determine the current timbre feature of the target user according to the timbre feature of the cluster center;

将所述目标用户的当前音色特征和所述文本信息输入语音生成模型,以生成具有所述目标用户的当前音色特征的目标音频。The current timbre characteristics of the target user and the text information are input into a speech generation model to generate a target audio with the current timbre characteristics of the target user.

本发明实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The computer storage medium of the embodiment of the present invention can adopt any combination of one or more computer-readable media. The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium can be, for example, but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In this document, a computer-readable storage medium can be any tangible medium containing or storing a program, which can be used by an instruction execution system, device or device or used in combination with it.

计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, which carry computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Computer-readable signal media may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.

计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。The program code embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码,程序设计语言包括面向对象的程序设计语言,诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present invention may be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).

本领域普通技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个计算装置上,或者分布在多个计算装置所组成的网络上,可选地,他们可以用计算机装置可执行的程序代码来实现,从而可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件的结合。It should be understood by those skilled in the art that the above modules or steps of the present invention can be implemented by a general-purpose computing device, they can be concentrated on a single computing device, or distributed on a network composed of multiple computing devices, optionally, they can be implemented by a program code executable by a computer device, so that they can be stored in a storage device and executed by the computing device, or they can be made into individual integrated circuit modules, or multiple modules or steps therein can be made into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

另外,本发明技术方案中对数据的获取、存储、使用、处理等均符合国家法律法规的相关规定。In addition, the acquisition, storage, use, and processing of data in the technical solution of the present invention are in compliance with the relevant provisions of national laws and regulations.

注意,上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解,本发明不限于这里的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本发明进行了较为详细的说明,但是本发明不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其他等效实施例,而本发明的范围由所附的权利要求范围决定。Note that the above are only preferred embodiments of the present invention and the technical principles used. Those skilled in the art will understand that the present invention is not limited to the specific embodiments herein, and that various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in more detail through the above embodiments, the present invention is not limited to the above embodiments, and may include more other equivalent embodiments without departing from the concept of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (11)

1. A method of audio generation, the method comprising the steps of:
responding to received text information, acquiring historical audio information of a target user, wherein the text information is sent by the target user through a target terminal;
extracting tone features from the historical audio information;
clustering the extracted tone features to determine tone features of a clustering center;
determining the current tone characteristic of the target user according to the tone characteristic of the clustering center;
the current tone color characteristics of the target user and the text information are input into a speech generation model to generate target audio with the current tone color characteristics of the target user.
2. The audio generation method according to claim 1, wherein acquiring the historical audio information of the target user in response to the received text information, comprises:
determining a user identification of the target user according to the received text information;
Generating an acquisition request of the historical audio information of the target user based on the user identification, and obtaining a request result;
and under the condition that the request result is the consent to be acquired, acquiring the historical audio information of the target user according to the user identification.
3. The audio generation method of claim 1, wherein extracting timbre features from the historical audio information comprises:
filtering noise from the historical audio information to obtain initial voice audio information;
Separating and/or segmenting the initial voice audio information according to single voice to obtain target voice audio information;
and extracting tone characteristics of each target voice audio information.
4. The audio generation method of claim 1, wherein clustering the extracted timbre features to determine timbre features of a cluster center comprises:
Clustering the extracted tone features;
and determining the average value of the tone color features in each cluster as the tone color feature of the current cluster center.
5. The audio generation method of claim 1, further comprising, before determining the current timbre characteristic of the target user from the timbre characteristic of the cluster center:
judging whether the historical tone characteristics of the target user exist or not;
And if the historical tone characteristics of the target user do not exist, triggering and executing the step of determining the current tone characteristics of the target user according to the tone characteristics of the clustering center.
6. The audio generation method of claim 5, further comprising, after determining whether there is a history of timbre characteristics of the target user:
If the historical tone characteristics of the target user exist, matching the historical tone characteristics of the target user with the extracted tone characteristics, or matching the historical tone characteristics of the target user with the tone characteristics of each clustering center, and determining the current tone characteristics of the target user according to the matching result.
7. The audio generation method of claim 6, wherein determining the current timbre characteristic of the target user based on the matching result comprises:
if the matching result is not matching, not performing operation;
and if the matching result is that the target user is matched, updating the historical tone characteristics of the target user by using the matched tone characteristics, and determining the historical tone characteristics of the target user as the current tone characteristics of the target user after the matching and updating of the tone characteristics are completed.
8. The audio generation method of claim 1, wherein determining the current timbre characteristic of the target user from the timbre characteristic of the cluster center comprises:
Acquiring the number of the clustering centers to obtain the number of the centers;
when the number of the centers is equal to one, determining the tone characteristic of the clustering center as the current tone characteristic of the target user;
When the number of the centers is greater than one, judging whether a target clustering center exists, wherein the difference value between the number of tone color features under the target clustering center and the number of tone color features under other clustering centers exceeds a preset number threshold; when the target clustering center exists, determining that the tone characteristic of the target clustering center is the current tone characteristic of the target user; otherwise, the current tone characteristic of the target user cannot be determined, and the default tone characteristic of the voice synthesis system is determined as the current tone characteristic of the target user.
9. An audio generating apparatus, comprising:
The acquisition module is used for responding to the received text information and acquiring the historical audio information of the target user, wherein the text information is sent by the target user through the target terminal;
The extraction module is used for extracting tone characteristics from the historical audio information;
the clustering module is used for clustering the extracted tone color features to determine tone color features of a clustering center;
the determining module is used for determining the current tone characteristic of the target user according to the tone characteristic of the clustering center;
And the generation module is used for inputting the current tone color characteristics of the target user and the text information into a voice generation model so as to generate target audio with the current tone color characteristics of the target user.
10. An electronic device, the electronic device comprising:
At least one processor; and a memory communicatively coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the audio generation method of any one of claims 1-8.
11. A storage medium containing computer executable instructions which when executed by a computer processor implement performing the audio generation method of any of claims 1-8.
CN202410777954.7A 2024-06-17 2024-06-17 Audio generation method, device, equipment and storage medium Pending CN118609536A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410777954.7A CN118609536A (en) 2024-06-17 2024-06-17 Audio generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410777954.7A CN118609536A (en) 2024-06-17 2024-06-17 Audio generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118609536A true CN118609536A (en) 2024-09-06

Family

ID=92555002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410777954.7A Pending CN118609536A (en) 2024-06-17 2024-06-17 Audio generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118609536A (en)

Similar Documents

Publication Publication Date Title
WO2021139425A1 (en) Voice activity detection method, apparatus and device, and storage medium
KR100636317B1 (en) Distributed speech recognition system and method
WO2020211354A1 (en) Speaker identity recognition method and device based on speech content, and storage medium
CN111128223A (en) Text information-based auxiliary speaker separation method and related device
CN105679310A (en) Method and system for speech recognition
CN107886951B (en) Voice detection method, device and equipment
CN111145763A (en) GRU-based voice recognition method and system in audio
CN110956965A (en) A personalized smart home security control system and method based on voiceprint recognition
CN110019848A (en) Conversation interaction method and device and robot
CN112992153B (en) Audio processing method, voiceprint recognition device and computer equipment
CN112735381B (en) Model updating method and device
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment
WO2021051533A1 (en) Address information-based blacklist identification method, apparatus, device, and storage medium
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN113314103A (en) Illegal information identification method and device based on real-time speech emotion analysis
CN117877510A (en) Voice automatic test method, device, electronic equipment and storage medium
CN118609536A (en) Audio generation method, device, equipment and storage medium
CN114155845A (en) Service determination method and device, electronic equipment and storage medium
CN114283818A (en) Voice interaction method, device, equipment, storage medium and program product
WO2022041177A1 (en) Communication message processing method, device, and instant messaging client
CN114400009B (en) Voiceprint recognition method and device and electronic equipment
CN115240657B (en) A voice processing method, device, equipment and storage medium
CN114678040B (en) Voice consistency detection method, device, equipment and storage medium
CN112820274B (en) Voice information recognition correction method and system
CN116915894A (en) Incoming call identity recognition method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination