CN118093930A

CN118093930A - Audio resource retrieval method, device, equipment and storage medium

Info

Publication number: CN118093930A
Application number: CN202410281641.2A
Authority: CN
Inventors: 黄晓荣; 赵鸿含; 梁伟健
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-05-28

Abstract

The application provides a method, a device, equipment and a storage medium for retrieving audio resources, which relate to the technical field of audio processing, and the method comprises the following steps: acquiring input retrieval information, extracting feature vectors of the retrieval information by adopting a preset text audio model to obtain retrieval information feature vectors, matching the retrieval information feature vectors with audio feature vectors of all audios in a preset retrieval database to obtain feature matching results, wherein the preset retrieval database is stored with: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by adopting a preset text audio model in advance, determining the audio corresponding to the retrieval information feature vector as a first target audio according to the feature matching result, and generating a first retrieval result of the retrieval information. The first target audio can be accurately determined by matching the retrieval information feature vector with the audio feature vector of each audio stored in the preset retrieval database.

Description

Audio resource retrieval method, device, equipment and storage medium

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for audio resource retrieval.

Background

In the current game sound effect design, the management and retrieval process of the sound effect relies on manual writing of audio description and audio tags, and because the description of the audio description and the audio tags needs manual writing, the description standard range needs to be unified in team cooperation, in addition, the accuracy of the audio description also needs manual secondary correction, and the maintenance cost is high. In addition, the retrieval mode is fixed, the retrieval process can only rely on text keyword matching, and audio retrieval cannot be performed from the angle of audio sound. Therefore, in the retrieval process of the sound effect by the prior art method, the problem of low retrieval efficiency exists, and if the audio description is incomplete or the vocabulary expression is inaccurate, the wanted resource cannot be retrieved.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide an audio resource retrieval method, device, equipment and storage medium, so that a first target audio corresponding to a retrieval information feature vector can be accurately determined by matching the retrieval information feature vector with the audio feature vector of each audio stored in a preset retrieval database, and a user can more attach the audio application scene described by an input text to the audio retrieved according to the retrieval information.

In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows:

In a first aspect, an embodiment of the present application provides an audio resource retrieval method, including:

acquiring input retrieval information;

extracting feature vectors of the retrieval information by adopting a preset text audio model to obtain retrieval information feature vectors;

Matching the retrieval information feature vector with an audio feature vector of each audio in a preset retrieval database to obtain a feature matching result; wherein, the preset search database stores: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by adopting the preset text audio model in advance;

According to the feature matching result, determining the audio corresponding to the retrieval information feature vector as a first target audio;

and generating a first search result of the search information according to the first target audio.

In an alternative embodiment, if the search information is: inputting a text, extracting feature vectors of the search information by adopting a preset text audio model to obtain the feature vectors of the search information, and comprising the following steps:

Extracting feature vectors of the input text by adopting a preset text audio model to obtain the input text feature vectors as the retrieval information feature vectors;

The step of matching the retrieval information feature vector with the audio feature vector of each audio in a preset retrieval database to obtain a feature matching result comprises the following steps:

and matching the input text feature vector with the audio feature vector of each audio to obtain the feature matching result.

In an alternative embodiment, if the search information is: inputting audio, wherein the adoption of a preset text audio model extracts feature vectors of the retrieval information to obtain the feature vectors of the retrieval information, and the method comprises the following steps:

Extracting feature vectors of the input audio by adopting the preset text audio model to obtain the input audio feature vectors as the retrieval information feature vectors;

And matching the input audio feature vector with the audio feature vector of each audio to obtain the feature matching result.

In an alternative embodiment, the preset search database further stores: a tag for each of the audio frequencies; the method further comprises the steps of:

if the search information is: inputting a label, and matching the input label with the label of each audio in the preset search database to obtain a label matching result;

According to the label matching result, determining the audio corresponding to the input label as a second target audio;

And generating a second search result of the input tag according to the second target audio.

In an optional embodiment, before the matching is performed on the input tag and the tag of each audio in the preset search database to obtain a tag matching result, the method further includes:

extracting feature vectors of the preset standard audio classified text by adopting the preset text audio model to obtain first classified text feature vectors;

matching the first classified text feature vector with the audio feature vector of each audio to obtain a first matching result;

According to the first matching result, determining the audio corresponding to the preset standard audio classification text as a third target audio;

And determining the label corresponding to the preset standard audio classification text as the label of the third target audio.

Receiving an input audio classified text and a label corresponding to the input audio classified text;

extracting feature vectors of the input audio classified text by adopting the preset text audio model to obtain second classified text feature vectors;

matching the second classified text feature vector with the audio feature vector of each audio to obtain a second matching result;

According to the second matching result, determining that the audio corresponding to the input audio classification text is fourth target audio;

And determining the label corresponding to the input audio classified text as the label of the fourth target audio.

In an optional embodiment, before extracting the feature vector of the input text by using a preset text audio model to obtain the feature vector of the input text as the feature vector of the search information, the method further includes:

acquiring sample audio and a description text corresponding to the sample audio;

extracting feature vectors of the sample audio and the description text respectively by adopting a preset initial text audio model to obtain sample audio feature vectors and corresponding sample text feature vectors;

Training the initial text audio model according to the sample audio feature vector and the corresponding sample text feature vector to obtain the preset text audio model.

In an optional implementation manner, before the matching is performed on the input text feature vector and the audio feature vector of each audio in the preset search database to obtain a feature matching result, the method further includes:

Extracting feature vectors of all the audios in a preset audio resource library by adopting the preset text audio model to obtain audio feature vectors of all the audios;

According to the audios and the audio feature vectors of the audios, constructing the preset retrieval database, wherein the preset audio resource database is as follows: the audio resource library of the software item or the network audio resource library is preset.

In an optional embodiment, before the generating, according to the first target audio, a first search result of the search information, the method further includes:

And if the number of the first target audios is a plurality of, carrying out random processing on the sequences of the plurality of the first target audios to generate the first search result.

In a second aspect, an embodiment of the present application further provides an audio resource retrieval apparatus, where the apparatus includes:

The acquisition module is used for acquiring the input retrieval information;

the extraction module is used for extracting the feature vector of the retrieval information by adopting a preset text audio model to obtain the feature vector of the retrieval information;

The matching module is used for matching the retrieval information feature vector with the audio feature vector of each audio in a preset retrieval database to obtain a feature matching result; wherein, the preset search database stores: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by adopting the preset text audio model in advance;

The determining module is used for determining the audio corresponding to the retrieval information feature vector as a first target audio according to the feature matching result;

and the generation module is used for generating a first search result of the search information according to the first target audio.

In a third aspect, an embodiment of the present application further provides a computer apparatus, including: a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating over the bus when the computer device is running, the processor executing the program instructions to perform the steps of the audio resource retrieval method as described in any of the first aspects.

In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the audio resource retrieval method according to any of the first aspects.

The beneficial effects of the application are as follows:

The embodiment of the application provides an audio resource retrieval method, an audio resource retrieval device, audio resource retrieval equipment and a storage medium, wherein the audio resource retrieval method comprises the following steps: acquiring input retrieval information, extracting feature vectors of the retrieval information by adopting a preset text audio model to obtain retrieval information feature vectors, and then matching the retrieval information feature vectors with audio feature vectors of all audios in a preset retrieval database to obtain feature matching results, wherein the preset retrieval database stores: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by a preset text audio model in advance, then determining the audio corresponding to the retrieval information feature vector as a first target audio according to the feature matching result, and finally generating a first retrieval result of the retrieval information according to the first target audio. According to the method, the feature vector of the search information is extracted through the preset text audio model, the feature vector of the search information is obtained, the first target audio corresponding to the feature vector of the search information can be accurately determined through matching the feature vector of the search information with the audio feature vector of each audio stored in the preset search database, the multiplexing rate of audio resources is improved, and a user can attach the audio application scene described by the search information to the audio retrieved by the search information.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an audio resource searching method according to an embodiment of the present application;

FIG. 2 is a second flowchart of an audio resource searching method according to an embodiment of the present application;

FIG. 3 is a third flowchart of an audio resource searching method according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for retrieving audio resources according to an embodiment of the present application;

FIG. 5 is a flowchart of an audio resource searching method according to an embodiment of the present application;

FIG. 6 is a flowchart of an audio resource searching method according to an embodiment of the present application;

FIG. 7 is a flowchart of an audio resource searching method according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating an audio resource searching method according to an embodiment of the present application;

Fig. 9 is a schematic functional block diagram of an audio resource retrieval device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Furthermore, the terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

In order to retrieve corresponding target audio through input retrieval information and generate a retrieval result, the embodiment of the application provides an audio resource retrieval method, which is used for acquiring the input retrieval information, extracting feature vectors of the retrieval information by adopting preset text audio, matching the feature vectors of the retrieval information with audio feature vectors of all the audio in a preset retrieval database to obtain feature matching results, and determining first target audio corresponding to the feature vectors of the retrieval information according to the feature matching results, so that a first retrieval result of the retrieval information is generated according to the first target audio, a user can retrieve audio with higher matching degree with the retrieval information, and the requirements of the user are met.

The audio resource searching method provided by the embodiment of the application is explained in detail by a specific example with reference to the accompanying drawings. The audio resource retrieval method provided by the embodiment of the application can be realized by the following steps: the computer equipment for presetting an audio resource retrieval algorithm or detecting software is realized by running the algorithm or the software. The computer device may be, for example, a server or a terminal, which may be a user computer. Fig. 1 is a schematic flow chart of an audio resource searching method according to an embodiment of the present application. As shown in fig. 1, the method includes:

S101, acquiring input search information.

In this embodiment, the search information is input by the user on the search interface of the client device, so that the server may obtain the input search information, where the search information may include multiple types, such as a text type, an audio type, and a tag type, and the text type indicates that the user may input, in the search interface of the client device, an open audio description text such as a material, an action, a scene, a speed, a occupation, and the like of the audio to be searched; the audio type indication is that a user can input a similar audio file needing to search the audio in a search interface of the client device or randomly intercept and input the audio of the client; the tag type indicates that a user can input a tag for which audio needs to be retrieved in a retrieval interface of the client device.

S102, extracting feature vectors of the retrieval information by adopting a preset text audio model to obtain feature vectors of the retrieval information.

Specifically, the preset text audio model is a model obtained by training according to the sample audio and the descriptive text corresponding to the sample audio in advance, and the preset text audio model can extract feature vectors of the retrieval information to obtain feature vectors of the retrieval information.

S103, matching the retrieval information feature vector with the audio feature vector of each audio in a preset retrieval database to obtain a feature matching result.

Wherein, the preset search database stores: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by adopting a preset text audio model in advance.

And matching the retrieval information feature vector with the audio feature vector of each audio through a matching algorithm such as an inverted index, a vector space model, a cosine distance and the like, so as to obtain a feature matching result, wherein the feature matching result indicates the similarity between the retrieval information feature vector and the audio feature vector of each audio.

S104, determining the audio corresponding to the retrieval information feature vector as the first target audio according to the feature matching result.

S105, generating a first search result of the search information according to the first target audio.

According to the feature matching result, namely the similarity between the retrieval information feature vector and the audio feature vector of each audio, determining the audio corresponding to the audio feature vector meeting the similarity preset threshold as a first target audio, wherein the number of the first target audio can be one or more because the first target audio is determined according to the similarity between the retrieval information feature vector and the audio feature vector of each audio, and the first retrieval result is one first target audio if the number of the first target audio is one.

Optionally, if the number of the first target audios is multiple, the ranking of the multiple first target audios is randomly processed, and a first search result is generated.

Specifically, the first target audio may be locally ranked according to the similarity between the feature vector of the search information and the audio feature vector of each audio, and if the number of the first target audio is multiple, the ranking of the first target audio is randomly processed, for example, the local ranking is disordered, so that the rankings of the multiple first target audio are randomly generated and are not ranked according to the similarity, thereby generating the first search result. And finally, the first search result is sent to the client device, so that a search result display interface of the client device displays the first search result generated by the first target audio, and hot spot audio is effectively avoided and audio homogenization is prevented by carrying out random processing on the ordering of a plurality of first target audio.

In summary, the embodiment of the present application provides an audio resource retrieval method, including: acquiring input retrieval information, extracting feature vectors of the retrieval information by adopting a preset text audio model to obtain retrieval information feature vectors, and then matching the retrieval information feature vectors with audio feature vectors of all audios in a preset retrieval database to obtain feature matching results, wherein the preset retrieval database stores: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by a preset text audio model in advance, then determining the audio corresponding to the retrieval information feature vector as a first target audio according to the feature matching result, and finally generating a first retrieval result of the retrieval information according to the first target audio. According to the method, the feature vector of the search information is extracted through the preset text audio model, the feature vector of the search information is obtained, the first target audio corresponding to the feature vector of the search information can be accurately determined through matching the feature vector of the search information with the audio feature vector of each audio stored in the preset search database, the multiplexing rate of audio resources is improved, and a user can attach the audio application scene described by the search information to the audio retrieved by the search information.

The foregoing embodiment provides an implementation manner of an audio resource searching method, and the embodiment of the present application further provides a possible implementation manner of another audio resource searching method if the search information is an input text, and fig. 2 is a second flowchart of an audio resource searching method provided by the embodiment of the present application. As shown in fig. 2, the extracting of feature vectors of the search information by using a preset text audio model to obtain feature vectors of the search information includes:

S201, extracting feature vectors of the input text by adopting a preset text audio model, and obtaining the feature vectors of the input text as the feature vectors of the retrieval information.

In this embodiment, the search information is: inputting text, i.e. the user inputs descriptive text of the audio to be retrieved in a retrieval interface of the client device, for example: "footstep grass" represents the audio of walking on grass. The text editor in the preset text audio model can extract the input text feature vector of the input text to obtain the input text feature vector.

Based on the matching of the retrieval information feature vector and the audio feature vector of each audio in the preset retrieval database, a feature matching result is obtained, and the method comprises the following steps:

S202, matching the input text feature vector with the audio feature vector of each audio to obtain a feature matching result.

And matching the input text feature vector with the audio feature vector of each audio through a matching algorithm, so as to obtain a feature matching result, wherein the feature matching result indicates the similarity between the input text feature vector and the audio feature vector of each audio.

In the method provided by the embodiment of the application, if the acquired search information is the input text, the feature vector of the input text is extracted through the preset text audio model to obtain the feature vector of the input text, and the feature matching result can be accurately obtained through matching the feature vector of the input text with the audio feature vector of each audio stored in the preset search database.

The above embodiment explains an implementation manner of an audio resource searching method provided by the embodiment of the present application if the search information is an input text, and further provides a possible implementation manner of another audio resource searching method if the search information is an input audio, and fig. 3 is a third flow chart of an audio resource searching method provided by the embodiment of the present application. As shown in fig. 3, extracting feature vectors of the search information by using a preset text audio model to obtain feature vectors of the search information includes:

s301, extracting feature vectors of input audio by adopting a preset text audio model, and obtaining the input audio feature vectors as retrieval information feature vectors.

In this embodiment, the search information is: the audio is input, i.e. the user inputs similar audio requiring retrieval of the audio in a retrieval interface of the client device. The input audio is extracted by an audio processing library to obtain input audio features, wherein the audio processing library can be a voice signal processing library Librosa or an audio data analysis library Essentia, and then the input audio features are extracted by a preset text audio model to obtain input audio feature vectors.

s302, matching the input audio feature vector with the audio feature vector of each audio to obtain a feature matching result.

Wherein, because the preset search database stores: and matching the input audio feature vector with the audio feature vector of each audio through a matching algorithm, so as to obtain a feature matching result, wherein the feature matching result indicates the similarity between the input audio feature vector and the audio feature vector of each audio.

According to the method provided by the embodiment of the application, the retrieval information can be extracted as the input audio through the preset text audio model, so that the input audio feature vector is obtained, then the input audio feature vector is matched with the audio feature vector of each audio in the preset retrieval database through a matching algorithm, the similarity between the input audio feature vector and each audio feature vector is determined by the true feature matching result, and as the preset text audio model is obtained by training according to the sample audio and the description text corresponding to the sample audio in advance, the input audio feature vector of the input audio can be directly obtained by adopting the preset text audio model, the preset text audio model does not need to be trained independently for data in different fields, the retrieval audio efficiency is improved, the cross-field robustness of the input audio feature vector is stronger, the generalization effect of the preset text audio model is better, meanwhile, the audio in the feature matching result is similar to the input audio, more inspiration can be provided for users, and the sound effect style of the whole audio application scene is more uniform.

The above embodiment explains an implementation manner of another audio resource searching method provided by the embodiment of the present application if the search information is input audio, and further provides a possible implementation manner of another audio resource searching method if the search information is input tag, where the preset search database further stores: labels for each audio. Fig. 4 is a flowchart of an audio resource searching method according to an embodiment of the present application. As shown in fig. 4, the method further includes:

S401, if the search information is: and inputting a label, and matching the input label with the label of each audio in the preset search database to obtain a label matching result.

In this embodiment, the search information is: inputting a tag, i.e. a user inputs a tag requiring retrieval of audio in a retrieval interface of a client device, for example: FOLY, SNOW. Since the preset search database stores: and the label of each audio frequency is matched with the input label through a matching algorithm, so that a label matching result is obtained, and the label matching result indicates the similarity between the input label and the label of each audio frequency.

S402, determining the audio corresponding to the input tag as second target audio according to the tag matching result.

If the input labels are form and SNOW, the labels of the second target audio in the label matching result comprise form and SNOW.

S403, generating a second search result of the input label according to the second target audio.

According to the label matching result, namely the similarity between the input label and the label of each audio, determining the audio corresponding to the audio feature vector meeting the similarity preset threshold as the second target audio, wherein the number of the second target audio can be one or more as the second target audio is determined according to the similarity between the input label and the label of each audio.

And if the number of the second target audios is a plurality of, carrying out random processing on the plurality of second target audios to generate a second search result, and finally sending the second search result to the client device, so that a search result display interface of the client device displays the second search result generated by the second target audios.

In the method provided by the embodiment of the application, the input label is matched with the label of each audio in the preset search database through the matching algorithm, the similarity of the input label and the label of each audio is determined, so that the second target audio corresponding to the input label is determined, and the second search result of the input audio is generated.

Based on the implementation manner of an audio resource searching method provided in the foregoing embodiment, if the search information is the input label, the embodiment of the present application further provides another possible implementation manner of the audio resource searching method by determining the fourth target audio label, and fig. 5 is a schematic flow chart of the audio resource searching method provided in the embodiment of the present application. As shown in fig. 5, before matching the input tag with the tag of each audio in the preset search database to obtain the tag matching result, the method further includes:

S501, extracting feature vectors of a preset standard audio classified text by adopting a preset text audio model to obtain a first classified text feature vector.

In this embodiment, the preset standard audio classification text is a standard audio classification text provided by a universal classification system (Universal Category System, UCS), and the universal classification system includes a plurality of standard audio classification texts, where the preset standard audio classification texts each have a corresponding tag. Extracting feature vectors of the preset standard audio classified texts by adopting a text editor in the preset text audio model to obtain first classified text feature vectors, namely first classified text vectors.

S502, matching the first classified text feature vector with the audio feature vector of each audio to obtain a first matching result.

Since the preset search database stores: and matching the first classified text feature vector with the audio feature vector of each audio through a matching algorithm, so as to obtain a first matching result, wherein the first matching result indicates the similarity between the first classified text feature vector and the audio feature vector of each audio.

S503, according to the first matching result, determining that the audio corresponding to the preset standard audio classification text is the third target audio.

S504, determining the label corresponding to the preset standard audio classification text as the label of the third target audio.

According to the first matching result, namely, the similarity of the first classified text feature vector and the audio feature vector of each audio, the audio corresponding to the audio feature vector meeting the similarity preset threshold is determined to be fourth target audio, and as the third target audio is determined according to the similarity of the first classified text feature vector and the audio feature vector of each audio, the number of the third target audio can be one or more, so that a label corresponding to a preset standard audio classified text is configured for each third target audio, the label of the third target audio is stored in a preset search database, so that a user can determine the label of each audio matched with the input label from the preset search database through the input label, a label matching result is obtained, and the second target audio is determined according to the label matching result.

In the method provided by the embodiment of the application, the preset text audio model extracts the feature vector of the preset standard audio classified text to obtain the first classified text feature vector, then the first classified text feature vector is matched with the audio feature vector of each audio to obtain a first matching result, the third target audio corresponding to the preset standard audio classified text is determined according to the first matching result, and the label corresponding to the preset standard audio classified text is determined as the label of the third target audio. Therefore, each standard audio classified text in the preset standard audio classified text can be matched with the corresponding third target audio, and the corresponding label is configured for the third target audio, so that the third target audio matched with the input label can be accurately determined.

In the embodiment of the present application, another possible implementation manner of the audio resource searching method is provided by determining the fifth target audio tag, and fig. 6 is a flowchart of a method for searching audio resources provided in the embodiment of the present application. As shown in fig. 6, before matching the input tag with the tag of each audio in the preset search database to obtain the tag matching result, the method further includes:

s601, receiving the input audio classified text and a label corresponding to the input audio classified text.

In this embodiment, a user may customize and preset a tag of each audio in the search database, specifically, the user inputs an audio classification text and a tag corresponding to the input audio classification text at the client device, so that the server receives the input audio classification text and the tag corresponding to the input audio classification text, for example, the tag corresponding to the input audio classification text is an animal call, and the input audio classification text includes: dog: audio classification text description 1; cat: audio classification text description 2; tiger: audio classification text description 3; and (3) lion: audio classification text description 4.

S602, extracting feature vectors of the classified text of the input audio by adopting a preset text audio model to obtain second classified text feature vectors.

And extracting feature vectors of the classified texts of the input audio by adopting a text editor in a preset text audio model to obtain second classified text feature vectors, namely second classified text vectors.

S603, matching the second classified text feature vector with the audio feature vector of each audio to obtain a second matching result.

Since the preset search database stores: and matching the second classification text feature vector with the audio feature vector of each audio through a matching algorithm, so as to obtain a second matching result, wherein the second matching result indicates the similarity between the second classification text feature vector and the audio feature vector of each audio.

S604, determining fourth target audio corresponding to the input audio classified text according to the second matching result.

S605, determining the label corresponding to the input audio classified text as the label of the fourth target audio.

According to the second matching result, namely, the similarity between the second classified text feature vector and the audio feature vector of each audio, determining the audio corresponding to the audio feature vector meeting the similarity preset threshold as a fifth target audio, and since the fourth target audio is determined according to the similarity between the second classified text feature vector and the audio feature vector of each audio, the number of the fourth target audio can be one or more, so that a label corresponding to the classified text of the input audio is configured for each fourth target audio, and the label of the fourth target audio is stored in a preset search database, so that a user can determine the label of each audio matched with the input label from the preset search database through the input label, and a label matching result is obtained, and the second target audio is determined according to the label matching result.

In the method provided by the embodiment of the application, a user can input an audio classified text and a label corresponding to the input audio classified text, and the labels of all the audios in the search database are preset in a user-defined mode, so that all the audio labels required by the user are obtained, the input audio classified text is subjected to feature vector extraction through a preset text audio model to obtain a second classified text feature vector, then the second classified text feature vector is matched with the audio feature vector of all the audios to obtain a second matching result, a fourth target audio corresponding to the input audio classified text is determined according to the second matching result, and the label corresponding to the input audio classified text is determined as the label of the fourth target audio. Therefore, the input audio classified text is matched with the corresponding fourth target audio, the corresponding label is automatically configured for the fourth target audio, the fourth target audio matched with the input label is convenient to accurately determine, the label of the fourth target audio can be flexibly defined, and the preset text audio model is not required to be retrained.

The embodiment of the application also provides another possible implementation manner of the audio resource retrieval method by training the preset text audio model, and fig. 7 is a seventh flow chart of an audio resource retrieval method provided by the embodiment of the application. As shown in fig. 7, the method further includes, before extracting feature vectors of the input text by using a preset text audio model to obtain feature vectors of the input text:

S701, acquiring sample audio and description text corresponding to the sample audio.

S702, extracting feature vectors of sample audio and description text by adopting a preset initial text audio model to obtain sample audio feature vectors and corresponding sample text feature vectors.

S703, training the initial text audio model according to the sample audio feature vector and the corresponding sample text feature vector to obtain a preset text audio model.

In this embodiment, the preset initial text audio model is a cross-modal audio-text model, and is used for processing the input sample audio or description text to obtain a sample audio feature vector and a corresponding sample text feature vector, so that a text feature vector with highest similarity is determined according to the sample audio feature vector, and an audio feature vector with highest similarity is determined according to the sample text feature vector.

Specifically, the preset initial text audio model uses an audio processing library to extract feature vectors of sample audio to obtain sample audio feature vectors, and also uses a text editor to extract feature vectors of descriptive text of the sample audio to obtain sample text feature vectors.

And constructing an audio-text matching pair according to the sample audio feature vector and the corresponding sample text feature vector, and calculating the distance between the sample audio feature vector and the corresponding sample text feature vector through the cosine distance, so as to train the initial text audio model and obtain a preset text audio model.

If the audio is input to the preset text audio model, the preset text audio model can extract the input audio feature vector of the input audio and output the text feature vector closest to the input audio feature vector, namely, output the text feature vector with the highest similarity with the input audio feature vector; similarly, if a text is input to the preset text audio model, the preset text audio model may extract an input text feature vector of the input text, and output an audio feature vector closest to the input text feature vector, that is, output an audio feature vector having the highest similarity to the input text feature vector.

According to the method provided by the embodiment of the application, the sample audio and the description text corresponding to the sample audio are obtained, the preset initial text audio model is adopted, feature vector extraction is respectively carried out on the sample audio and the description text, so as to obtain sample audio feature vectors and corresponding sample text feature vectors, then the initial text audio model is trained by adopting a loss function of comparison learning according to the sample audio feature vectors and the corresponding sample text feature vectors, so as to obtain the preset text audio model, and the feature vector extraction of the preset text audio model on the input retrieval information is realized.

The embodiment of the application also provides another possible implementation manner of the audio resource searching method, and fig. 8 is a schematic flow diagram of the audio resource searching method provided by the embodiment of the application. As shown in fig. 8, the method further includes, before matching the input text feature vector with the audio feature vector of each audio in the preset search database to obtain a feature matching result:

S801, extracting feature vectors of all the audios in a preset audio resource library by adopting a preset text audio model to obtain audio feature vectors of all the audios.

In this embodiment, the preset audio resource library includes a plurality of audios and text descriptions corresponding to the plurality of audios, feature vector extraction is performed on each audio in the preset audio resource library through a preset text audio model to obtain audio feature vectors of each audio, and feature vector extraction is performed on each audio in the preset audio resource library through the preset text audio model to obtain text feature vectors corresponding to each audio.

S802, constructing a preset search database according to each audio and the audio feature vector of each audio.

The preset audio resource database is as follows: the audio resource library of the software item or the network audio resource library is preset. Constructing a corresponding preset search database according to different preset audio resource databases, wherein the preset search database comprises the following steps: each audio in the audio resource library, and an audio feature vector and a text feature vector corresponding to each audio are preset.

It should be noted that, if the labels are configured for each audio according to the preset standard audio classification text and/or the input audio classification text, the preset search database further includes: labels for each audio.

In the method provided by the embodiment of the application, a preset text audio model is adopted to extract the characteristic vector of each audio in a preset audio resource library to obtain the audio characteristic vector of each audio, and then a preset retrieval database is constructed according to each audio and the audio characteristic vector of each audio, wherein the preset audio resource database is as follows: the audio resource library of the software item or the network audio resource library is preset. Therefore, the target audio corresponding to the input search information is determined by comparing the feature vector of the input search information with the audio feature vector of each audio in the preset search database.

The following further explains the audio resource retrieving apparatus, the computer device and the computer readable storage medium provided by any of the above embodiments of the present application, and the specific implementation process and the technical effects thereof are the same as those of the corresponding method embodiments, and for brevity, reference may be made to corresponding contents in the method embodiments for the parts not mentioned in this embodiment.

Fig. 9 is a schematic functional block diagram of an audio resource retrieval device according to an embodiment of the present application. As shown in fig. 9, the audio resource retrieval apparatus 100 includes:

An acquisition module 110 for acquiring the input retrieval information;

the extracting module 120 is configured to extract feature vectors of the search information by using a preset text audio model, so as to obtain feature vectors of the search information;

The matching module 130 is configured to match the feature vector of the search information with an audio feature vector of each audio in a preset search database, so as to obtain a feature matching result; wherein, the preset search database stores: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by adopting a preset text audio model in advance;

The determining module 140 is configured to determine, according to the feature matching result, that the audio corresponding to the feature vector of the search information is the first target audio;

the generating module 150 is configured to generate a first search result of the search information according to the first target audio.

Optionally, if the search information is: the extraction module 120 is further configured to extract feature vectors of the input text by using a preset text audio model, so as to obtain feature vectors of the input text as feature vectors of the retrieval information;

The matching module 130 is further configured to match the input text feature vector with the audio feature vector of each audio to obtain a feature matching result.

Optionally, if the search information is: the extraction module 120 is further configured to extract feature vectors of the input audio by using a preset text audio model, so as to obtain feature vectors of the input audio;

The matching module 130 is further configured to match the input audio feature vector with the audio feature vector of each audio to obtain a feature matching result.

Optionally, the preset search database further stores: labels for each audio; the matching module 130 is further configured to, if the search information is: inputting a label, and matching the input label with each audio label in a preset search database to obtain a label matching result;

the determining module 140 is further configured to determine, according to the tag matching result, that the audio corresponding to the input tag is a second target audio;

the generating module 150 is further configured to generate a second search result of the input tag according to the second target audio.

Optionally, the extracting module 120 is further configured to extract feature vectors of the preset standard audio classified text by using a preset text audio model, so as to obtain a first classified text feature vector;

the matching module 130 is further configured to match the first classified text feature vector with an audio feature vector of each audio to obtain a first matching result;

the determining module 140 is further configured to determine, according to the first matching result, a third target audio corresponding to the preset standard audio classification text;

the generating module 150 is further configured to determine a tag corresponding to the preset standard audio classification text as a tag of the third target audio.

Optionally, the obtaining module 110 is further configured to receive an input audio classified text and a tag corresponding to the input audio classified text;

The extracting module 120 is further configured to extract feature vectors of the classified text of the input audio by using a preset text audio model, so as to obtain a second classified text feature vector;

The matching module 130 is further configured to match the second classified text feature vector with the audio feature vector of each audio to obtain a second matching result;

The determining module 140 is further configured to determine, according to the second matching result, a fourth target audio corresponding to the input audio classification text; and determining the label corresponding to the input audio classified text as the label of the fourth target audio.

Optionally, the obtaining module 110 is further configured to obtain the sample audio and a description text corresponding to the sample audio;

the extracting module 120 is further configured to extract feature vectors of the sample audio and the description text by using a preset initial text audio model, so as to obtain a sample audio feature vector and a corresponding sample text feature vector;

And the training module is used for training the initial text audio model according to the sample audio feature vector and the corresponding sample text feature vector to obtain a preset text audio model.

Optionally, the extracting module 120 is further configured to extract feature vectors of each audio in the preset audio resource library by using a preset text audio model to obtain an audio feature vector of each audio;

The construction module is used for constructing a preset retrieval database according to each audio and the audio feature vector of each audio, wherein the preset audio resource database is as follows: the audio resource library of the software item or the network audio resource library is preset.

Optionally, the generating module 150 is further configured to, if the number of the first target audios is multiple, perform a random process on the ordering of the multiple first target audios, and generate a first search result.

The foregoing apparatus is used for executing the method provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

The above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more Application SPECIFIC INTEGRATED Circuits (ASIC), or one or more microprocessors, or one or more field programmable gate arrays (Field Programmable GATE ARRAY FPGA), etc. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application, where the computer device may be used for audio resource retrieval. As shown in fig. 10, the computer device 200 includes: a processor 210, a storage medium 220, and a bus 230.

The storage medium 220 stores machine-readable instructions executable by the processor 210. When the computer device is running, the processor 210 communicates with the storage medium 220 via the bus 230, the processor 210 executes the machine-readable instructions, the method comprising:

Acquiring input retrieval information; extracting feature vectors of the retrieval information by adopting a preset text audio model to obtain retrieval information feature vectors; matching the retrieval information feature vector with the audio feature vector of each audio in a preset retrieval database to obtain a feature matching result; wherein, the preset search database stores: the audio feature vector of each audio is obtained by extracting the feature vector of each audio by adopting a preset text audio model in advance; according to the feature matching result, determining the audio corresponding to the retrieval information feature vector as a first target audio; and generating a first retrieval result of the retrieval information according to the first target audio.

Optionally, if the search information is: inputting a text, extracting feature vectors of the search information by adopting a preset text audio model to obtain the feature vectors of the search information, wherein the method comprises the following steps:

extracting feature vectors of the input text by adopting a preset text audio model to obtain the feature vectors of the input text as the feature vectors of the retrieval information;

matching the retrieval information feature vector with the audio feature vector of each audio in a preset retrieval database to obtain a feature matching result, wherein the method comprises the following steps:

and matching the input text feature vector with the audio feature vector of each audio to obtain a feature matching result.

Optionally, if the search information is: inputting audio, extracting feature vectors of the search information by adopting a preset text audio model to obtain the feature vectors of the search information, wherein the method comprises the following steps:

extracting feature vectors of the input audio by adopting a preset text audio model to obtain the input audio feature vectors as retrieval information feature vectors;

And matching the input audio feature vector with the audio feature vector of each audio to obtain a feature matching result.

Optionally, the preset search database further stores: labels for each audio; the method further comprises the steps of:

If the search information is: inputting a label, and matching the input label with each audio label in a preset search database to obtain a label matching result; determining a second target audio corresponding to the input tag according to the tag matching result; and generating a second search result of the input tag according to the second target audio.

Optionally, matching the input tag with each audio tag in a preset search database, and before obtaining the tag matching result, the method further comprises:

Extracting feature vectors of a preset standard audio classified text by adopting a preset text audio model to obtain a first classified text feature vector; matching the first classified text feature vector with the audio feature vector of each audio to obtain a first matching result; determining a third target audio corresponding to the preset standard audio classification text according to the first matching result; and determining the label corresponding to the preset standard audio classification text as the label of the third target audio.

Receiving an input audio classified text and a label corresponding to the input audio classified text; extracting feature vectors of the classified text of the input audio by adopting a preset text audio model to obtain second classified text feature vectors; matching the second classified text feature vector with the audio feature vector of each audio to obtain a second matching result; determining fourth target audio corresponding to the input audio classified text according to the second matching result; and determining the label corresponding to the input audio classified text as the label of the fourth target audio.

Optionally, before extracting the feature vector of the input text by adopting a preset text audio model to obtain the feature vector of the input text as the feature vector of the retrieval information, the method further comprises:

Acquiring sample audio and description text corresponding to the sample audio; respectively extracting feature vectors of sample audio and description text by adopting a preset initial text audio model to obtain sample audio feature vectors and corresponding sample text feature vectors; training the initial text audio model according to the sample audio feature vector and the corresponding sample text feature vector to obtain a preset text audio model.

Optionally, matching the input text feature vector with the audio feature vector of each audio in the preset search database, and before obtaining the feature matching result, the method further includes:

Extracting feature vectors of all the audios in a preset audio resource library by adopting a preset text audio model to obtain audio feature vectors of all the audios; according to each audio and the audio feature vector of each audio, a preset search database is constructed, and the preset audio resource database is as follows: the audio resource library of the software item or the network audio resource library is preset.

Optionally, before generating the first search result of the search information according to the first target audio, the method further includes:

If the number of the first target audios is a plurality of, the ordering of the plurality of the first target audios is randomly processed, and a first search result is generated.

The present application also provides a storage medium 220, on which storage medium 220 a computer program is stored which, when executed by a processor, performs the steps of the above-described method embodiments.

Wherein the program, when executed by the processor, performs a method that may include:

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some feature vectors may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform some of the steps of the methods according to the embodiments of the invention. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. An audio resource retrieval method, comprising:

acquiring input retrieval information;

2. The method of claim 1, wherein if the search information is: inputting a text, extracting feature vectors of the search information by adopting a preset text audio model to obtain the feature vectors of the search information, and comprising the following steps:

3. The method of claim 1, wherein if the search information is: inputting audio, wherein the adoption of a preset text audio model extracts feature vectors of the retrieval information to obtain the feature vectors of the retrieval information, and the method comprises the following steps:

Extracting feature vectors of the input audio by adopting the preset text audio model to obtain input audio feature vectors serving as the retrieval information feature vectors;

4. The method according to claim 1, wherein the preset search database further stores: a tag for each of the audio frequencies; the method further comprises the steps of:

5. The method of claim 4, wherein the matching the input tag with the tag of each audio in the preset search database, before obtaining a tag matching result, further comprises:

6. The method of claim 4, wherein the matching the input tag with the tag of each audio in the preset search database, before obtaining a tag matching result, further comprises:

7. The method of claim 2, wherein the extracting the feature vector from the input text using the preset text-to-audio model, before obtaining the input text feature vector as the search information feature vector, further comprises:

8. The method of claim 2, wherein said matching said input text feature vector with said audio feature vector for each audio frequency, prior to obtaining said feature matching result, further comprises:

9. The method of claim 1, wherein prior to generating the first search result for the search information based on the first target audio, the method further comprises:

10. An audio resource retrieval device, the device comprising:

The acquisition module is used for acquiring the input retrieval information;

11. A computer device, comprising: a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating via the bus when the computer device is running, the processor executing the program instructions to perform the steps of the audio resource retrieval method according to any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the audio resource retrieval method according to any of claims 1 to 9.