CN111143604B

CN111143604B - Similarity matching method and device for audio frequency and storage medium

Info

Publication number: CN111143604B
Application number: CN201911353609.6A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2024-02-02
Anticipated expiration: 2039-12-25
Also published as: CN111143604A

Abstract

The embodiment of the invention discloses an audio similarity matching method, an audio similarity matching device and a storage medium. The scheme determines a similar user group from a plurality of users according to an audio list of the users; determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio; calculating the similarity between the sample audio pairs in the training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs; training a twin network model by using the positive sample and the negative sample, wherein the twin network model comprises two basic networks for acquiring the feature vector of the audio, and the two basic networks have the same structure and share weight; and performing similarity matching of the audios based on the trained twin network model. The accuracy of audio similarity matching is improved.

Description

Similarity matching method and device for audio frequency and storage medium

Technical Field

The invention relates to the technical field of audio processing, in particular to an audio similarity matching method, an audio similarity matching device and a storage medium.

Background

With the popularity of networks and the increasing convenience of song production, thousands of new songs emerge each day, and the amount of songs will increase exponentially over a period of time in the future. While the sea of songs accumulates, the user also shows obvious personalized music preferences. For example, users prefer a certain song, and often wish to continue listening to this type of song. In a huge number of song libraries, how to recommend other songs similar to the song to the user according to the song preferred by the user becomes a problem to be solved at present.

In many conventional song recommendation methods, collaborative filtering is used to recommend songs, and in many collaborative filtering recommendation methods, attribute information of songs, such as artist, genre, language, etc., are used as a basis to search for similar songs, for example, songs with the same genre and language label are identified as similar songs, and song recommendation is performed.

However, this collaborative filtering method is too dependent on the attribute information of songs, and can only perform similarity judgment when songs have enough attribute information, so that the possibility of entering a recommendation pool is provided, but in practical situations, the attribute information of songs is often missing or unreliable, and songs without the attribute information are often difficult to be matched with similar songs of other songs, so that the matching accuracy of the similar songs is low.

Disclosure of Invention

The embodiment of the invention provides an audio similarity matching method, an audio similarity matching device and a storage medium, aiming at improving the accuracy of audio similarity matching.

The embodiment of the invention provides an audio similarity matching method, which comprises the following steps:

determining a similar user group from a plurality of users according to the audio list of the users;

determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio;

Calculating the similarity between sample audio pairs in a training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs;

training a twin network model using the positive and negative samples, wherein the twin network model comprises two base networks for acquiring feature vectors of audio, the two base networks being identical in structure and sharing weights;

and performing similarity matching of the audio based on the trained twin network model.

The embodiment of the invention also provides an audio similarity matching device, which comprises:

a user grouping unit for determining a similar user group from a plurality of users according to the audio list of the users;

the set determining unit is used for determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio;

the sample acquisition unit is used for calculating the similarity between the sample audio pairs in the training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs;

a model training unit, configured to train a twin network model using the positive sample and the negative sample, where the twin network model includes two base networks for acquiring feature vectors of audio, the two base networks have the same structure and share weights;

And the audio matching unit is used for matching the similarity of the audios based on the trained twin network model.

The embodiment of the invention also provides a storage medium which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the similarity matching method of any audio provided by the embodiment of the invention.

According to the audio similarity matching scheme provided by the embodiment of the invention, a similar user group is determined from a plurality of users according to the audio list of the users; determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio; calculating the similarity between the sample audio pairs in the training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs; training a twin network model by using the positive sample and the negative sample, wherein the twin network model comprises two basic networks for acquiring the feature vector of the audio, and the two basic networks have the same structure and share weight; and performing similarity matching of the audios based on the trained twin network model. According to the scheme, the similarity between the audios is preliminarily determined according to the similar user groups so as to train the twin network model, when similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, attribute information of the songs is not needed, matching of the similar songs can be achieved only according to the songs, and matching accuracy of the similar songs is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic flow chart of a method for matching similarity of audio according to an embodiment of the present invention;

fig. 1b is a schematic structural diagram of a twin network model in an audio similarity matching method according to an embodiment of the present invention;

fig. 2a is a second flow chart of a similarity matching method of audio according to an embodiment of the present invention;

fig. 2b is a third flow chart of the audio similarity matching method according to the embodiment of the present invention;

fig. 3a is a schematic diagram of a first structure of an audio similarity matching device according to an embodiment of the present invention;

fig. 3b is a schematic diagram of a second structure of an audio similarity matching device according to an embodiment of the present invention;

fig. 3c is a schematic diagram of a third structure of an audio similarity matching device according to an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The embodiment of the invention provides an audio similarity matching method, and an execution subject of the audio similarity matching method can be the audio similarity matching device provided by the embodiment of the invention or electronic equipment integrated with the audio similarity matching device, wherein the audio similarity matching device can be realized in a hardware or software mode. Wherein the electronic device may be a server.

Referring to fig. 1a, fig. 1a is a first flowchart illustrating a method for matching audio similarity according to an embodiment of the present invention. The specific flow of the audio similarity matching method can be as follows:

101. a similar user group is determined from a plurality of users based on the audio list of the users.

The audio of the scheme of this embodiment may be various forms of audio data such as songs, tracks in video, and the like. The song may be a song with both lyrics and melody, or may be pure music with only melody and no lyrics. In the following, a song is taken as an example to describe, the scheme can be applied to servers for pairs such as a music application program or a music website, a song library is maintained in the server of a certain music application program, a large number of songs are stored in the song library, certain similarity exists among the songs, if the similarity among the songs is matched, the server can recommend similar songs to the user according to the song listening habit of the user or according to a recommendation request sent by a client, and the music operation efficiency is improved.

According to the scheme, attribute information of songs is not required to be acquired, and similarity among the songs is determined according to user behavior data generated when a user listens to the songs and the content of the songs, wherein the user behavior data comprise data such as play quantity, collection quantity and comment quantity of the songs by the user. Specifically, the scheme mainly determines similarity between songs through a twin network model. The twin network model comprises two basic networks for acquiring characteristic vectors of songs, wherein the two basic networks have the same structure and share weights.

The training method of the twin network model is described below. To train the twin network model, a training set needs to be prepared. First, when a user uses the music application, a song list is created according to the listening habit and preference of the individual and audio is collected to each song list, so that the similarity between two users can be determined according to the song collection list of the user.

In some embodiments, "determining a group of similar users from among a plurality of users based on an audio list of users" may include: acquiring audio lists of users, and calculating Jaccard coefficients of the audio lists of each two users as the similarity between the two users; dividing the plurality of users into a plurality of similar user groups according to the similarity between the users, wherein the similarity between any two users in one similar user group is larger than a first preset threshold.

In this embodiment, the training set may be constructed from song-related data of all or part of registered users of the music application, and the similarity of the positive sample song pairs and the negative sample song pairs in the training set may be determined, where the number of users may be set as desired. For each two of these users, the similarity between the two users is calculated from the user's song collection list. For example, the aggregate similarity of the song collections of two users is calculated, such as by calculating the Jaccard coefficient (Jaccard coefficient, also referred to as Jaccard similarity coefficient) of the two song collections as the similarity between the two users. Wherein the song collection list S of the user A can be calculated according to the following formula _A Song collection list S with user B _B Jacquard coefficient J _AB ＝|SA∩SB|/|SA∪SB|。

After determining the similarity between every two users, the users with the similarity greater than the first preset threshold are divided into a similar user group, and all selected users can be divided into n similar user groups U in such a way _i I is E (1, n). It will be appreciated that, to improve the reliability of the data, similar user groups having a number less than a certain number are filtered out, only similar user groups having a higher number of users are used,for example, similar user groups with a number of users less than 1000 are filtered out.

102. And determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio.

After a plurality of similar user groups are determined, taking all collected songs of all users in the similar user groups as candidate song sets of the similar user groups, and selecting songs meeting certain conditions from the candidate song sets as characteristic audio sets of the similar user groups.

In some embodiments, the user behavior data is a collection; "determining a characteristic audio set for a group of similar users based on the audio-corresponding user behavior data" may include: and determining the audio with collection larger than a second preset threshold value from the audio corresponding to the similar user group to form a characteristic audio set of the similar user group.

In this embodiment, a set of characteristic songs (i.e., a set of characteristic audio) corresponding to a group of similar users is determined based on the collection of songs. For example, for a similar user group, collection amounts of collected songs of all users in the similar user group are counted, and songs with collection amounts larger than a second preset threshold value are determined to form a characteristic song set. Such as a similar user group U _i If the second preset threshold is one thousand, when one song is collected by one thousand users in the ten thousand users, the song is added to the characteristic song set corresponding to the similar user group. The second preset threshold is herein given by way of example only, and in other embodiments, the second preset threshold may be set to other values depending on the model accuracy requirements and the amount of songs in the library.

Alternatively, in other embodiments, the user behavior data is a play amount; and selecting a characteristic song set corresponding to the similar user group according to the play quantity.

103. And calculating the similarity between the sample audio pairs in the training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs.

After the similar user group and the characteristic song sets corresponding to the similar user group are determined, the similarity between the sample audio pairs in the training set is calculated according to the characteristic song sets, and positive samples and negative samples are determined. Wherein the sample audio pair is a sample song pair.

Wherein, all songs of the user selected in 101 can be used as songs in the training set, or all or part of songs can be selected from all characteristic song sets to form the training set. Any two songs in the training set form a sample song pair.

The similarity between two songs in the sample song pair is calculated as follows. "computing similarity between pairs of sample audio in a training set from a set of feature audio" may include: for any sample audio pair in the training set, calculating the number of feature audio sets simultaneously having two audio in the sample audio pair; dividing the number by the total number of feature audio sets to obtain the similarity of the sample audio pairs.

Assuming that the sample song pair contains song C and song D, the probability of defining one feature song set in which song C and song D are co-located is P, p=the number of feature song sets having both song C and song D/the total number of feature song sets, since songs in one feature song set of a similar user group can be considered to have a certain similarity, the probability of defining one feature song set in which song C and song D are co-located can be regarded as the similarity between song C and song D.

In some embodiments, the sample audio pair is a sample song pair, "determining positive and negative samples in the training set based on similarity between sample audio pairs" may include:

and taking the sample song pairs with the similarity larger than the fourth preset threshold value as positive samples, and taking the sample song pairs with the similarity not larger than the fifth preset threshold value as negative samples, wherein the fourth preset threshold value is larger than or equal to the fifth preset threshold value. For example, the fourth preset threshold is 0.6, and the fifth preset threshold is 0.3; for another example, the fourth preset threshold and the fifth preset threshold are both 0.5.

Because the twin network model is used in this embodiment, the positive sample and the negative sample are both in the form of song pairs, that is, one positive sample contains two songs with similarity greater than the fourth preset threshold value, and one negative sample contains two songs with similarity not greater than the fifth preset threshold value.

104. Training a twin network model using the positive and negative samples, wherein the twin network model comprises two base networks for obtaining feature vectors of the audio, the two base networks being identical in structure and sharing weights.

After positive and negative samples in the training set are determined, the twin network model is trained using the positive and negative samples.

In some embodiments, "training the twin network model using positive and negative samples" may include: extracting audio characteristics of audio in the positive sample and the negative sample; based on the audio features and similarities of the positive samples and the audio features and similarities of the negative samples, the twin network model is trained until the loss value of the loss function is minimized.

Extracting audio features of songs in positive samples and negative samples, such as Mel frequency spectrum features, taking Mel frequency spectrum features as input data of a twin network model, taking similarity between the sample pairs obtained by the calculation as output, and then carrying out iterative training on the model according to a loss function of the model, wherein in each training round, loss values of the twin network model are obtained according to the loss function and the output of the twin network model; and according to the loss value of the training twin network model of each round, adjusting parameters of the twin network model until the loss value of the loss function reaches the minimum value, stopping training, determining model parameters, and obtaining the trained twin network model.

Loss function L (F) ₁ ，F ₂ The formula of Y) is as follows:

wherein F is ₁ 、F ₂ Feature vectors corresponding to audio features output by two basic networks respectively, Y is the similarity between two songs, D (F ₁ ，F ₂ ) As the distance function for calculating the distance between the feature vectors, the distance function may be a cosine distance, a euclidean distance, or the like, and the embodiment is not particularly limited thereto, and may be any function capable of calculating the distance between the two vectors.

Referring to fig. 1b, fig. 1b is a schematic structural diagram of a twin network model in an audio similarity matching method according to an embodiment of the present invention. The twin network model used in this embodiment includes two base networks of identical structure, and the two base networks share weights. The same structure means that each network comprises the same layers and the same number of neurons of each layer.

In the training process, two basic networks learn together, the content of each network learning is different, complementation can be formed, and the final training target of the model is to enable two similar input distances to be as small as possible and two different input distances to be as large as possible. The two basic networks are learned together in the training process, so that compared with a single network, the twin network model obtained through learning has higher accuracy and better performance, and the similarity between two audios can be calculated more accurately by using the feature vector output in the twin network model.

In addition, the embodiment is not particularly limited to the base network, and may be any network capable of learning audio features, such as a multi-layer convolutional neural network, a cyclic neural network, a transducer network, a VGG (Visual Geometry Group ) network, a ResNet (Deep residual network, a depth residual network), or the like.

In some embodiments, "extracting audio features of audio in positive and negative samples" may include: for any audio of the positive and negative samples, dividing the audio into a plurality of audio pieces; performing short-time Fourier transform on each audio fragment to obtain a frequency domain signal; performing Mel scale transformation on the frequency domain signal to obtain Mel spectrum characteristics of the audio fragment; and combining the Mel spectrum characteristics of the plurality of audio fragments to obtain the audio characteristics of the audio.

In this embodiment, a complete song is split into a plurality of audio segments with shorter duration, short-time fourier transform and mel-scale transform are performed on each audio segment to obtain mel-frequency spectrum features of each audio segment, the mel-frequency spectrum features of one audio segment are used as a row, and then the mel-frequency spectrum features of all audio segments are stacked according to longitudinal arrangement to form the mel-frequency spectrum features corresponding to the complete song. It will be appreciated that the present solution is not limited to obtaining the mel spectrum of the audio as the audio feature in the above manner, and in other embodiments, the audio feature of the song may be extracted in other manners, as long as the audio content can be characterized.

105. And performing similarity matching of the audios based on the trained twin network model.

After training to obtain the twin network model, the similarity matching of songs can be carried out according to the twin network model, so that the user can recommend similar songs. Among them, there are various ways of similarity matching of songs, and three of them are listed below for explanation.

In the first mode, when the similarity of the song E and the song F is calculated, the audio characteristics of the song E and the song F are respectively extracted, a trained twin network model is input, and the similarity of the song E and the song F can be calculated and output by the twin network model.

And in a second mode, combining all songs in the song library two by two, and calculating the similarity between every two songs by using a twin network model to form a similarity matrix, wherein each row or each column in the similarity matrix is the similarity between one song and all other songs in the song library. When a similar song of a certain song is queried, searching the song with the similarity larger than a certain threshold value according to the similarity matrix.

And thirdly, calculating the characteristic vector of each song in the song library by using the twin network model to form a characteristic vector library. When a similar song of a certain song is queried, approximate vector search is performed in a feature vector library based on the feature vector of the song. For example, a perceptual hash, an LSH (Locality-sensitive hash) algorithm, an Annoy algorithm, etc. are used to search for similar vectors, so that a similar vector of a certain vector can be quickly searched from a huge number of vectors.

In practical applications, similar songs may be found in different ways according to the number of songs in the library. For example, when the number of songs stored in the library is small, the recommendation of similar songs may be performed using the scheme of the first or second mode, and when the number of songs stored in the library is large, the recommendation of similar songs may be performed using the scheme of the third mode. Alternatively, in other embodiments, similar song lookups may be performed in other ways based on the twinning network model.

In addition, it will be appreciated that the training set may be periodically updated to retrain the twin network model and update the parameters of the model to ensure that the model has a higher accuracy.

In particular, the present application is not limited by the order of execution of the steps described, and certain steps may be performed in other orders or concurrently without conflict.

In the above-mentioned manner, the audio similarity matching method according to the embodiment of the present invention determines a similar user group from a plurality of users according to the audio list of the users; determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio; calculating the similarity between the sample audio pairs in the training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs; training a twin network model by using the positive sample and the negative sample, wherein the twin network model comprises two basic networks for acquiring the feature vector of the audio, and the two basic networks have the same structure and share weight; and performing similarity matching of the audios based on the trained twin network model. According to the scheme, the similarity between the audios is preliminarily determined according to the similar user groups so as to train the twin network model, when similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, attribute information of the songs is not needed, matching of the similar songs can be achieved only according to the songs, and matching accuracy of the similar songs is improved.

The method described in the previous examples is described in further detail below by way of example.

Referring to fig. 2a, fig. 2a is a second flowchart of a method for matching similarity of audio according to an embodiment of the invention. The method comprises the following steps:

201. an audio recommendation request is received, wherein the audio recommendation request is for indicating similar audio to query target audio.

202. A first feature vector of the target audio is obtained.

The audio recommendation request may be a song recommendation request, and the user may send the song recommendation request to the server through the music application program of the client, where the audio recommendation request carries identification information of the target song. When receiving an audio recommendation request, the server can acquire a first feature vector of a target song, for example, according to identification information of target audio, a corresponding first feature vector is acquired from a feature vector library; or extracting the audio characteristics of the target audio, and calculating a first characteristic vector of the target audio according to the audio characteristics and the twin network model.

203. Searching a first preset number of second feature vectors with highest similarity with the first feature vectors from a feature vector library, wherein the feature vectors of the audios in the library are calculated according to the trained twin network model to form the feature vector library, and the twin network model is obtained according to the collection songs of the user.

According to the twin network model obtained by training in the above embodiment, the feature vector of each song in the library is calculated, that is, the output of the base network in the twin network model is obtained. The feature vectors of all songs are formed into a feature vector library, and each feature vector in the feature vector library corresponds to one song. The training method of the twin network model is referred to the above embodiments, and will not be described herein.

After the first feature vector is obtained, performing approximate vector search in a feature vector library based on the first feature vector, and searching a plurality of second feature vectors closest to the first feature vector, wherein the number of the second feature vectors can be set according to requirements. For example, the similarity vector is searched by adopting a perceptual hash, an LSH algorithm, an Annoy algorithm and the like, and the similarity vector of a certain vector can be quickly searched from a large number of vectors.

204. And responding to the audio recommendation request, and pushing the audio corresponding to the second feature vector to the terminal corresponding to the audio recommendation request.

After the second feature vector is found, responding to the audio recommendation request, and pushing the song corresponding to the second feature vector to the terminal corresponding to the audio recommendation request. For example, information such as the name of the song, the singer and the like corresponding to the second feature vector is pushed to the terminal.

In the above-mentioned manner, the similarity matching method of audio provided by the embodiment of the present invention calculates the feature vector of each song in the song library in advance according to the trained twin network model to form a feature vector library, and when receiving the song recommendation request indicating to search for the similar song of the target song, performs the search of the approximate vector in the feature vector library according to the first feature vector of the target song, so as to obtain a plurality of second feature vectors most similar to the first feature vector, and pushes the song corresponding to the second feature vector to the user.

Referring to fig. 2b, fig. 2b is a third flow chart of the audio similarity matching method according to the embodiment of the invention. The method comprises the following steps:

205. and calculating the similarity between every two audios in the music library according to the twin network model to obtain a similarity matrix corresponding to the music library, wherein the twin network model is obtained by training according to collected songs of a user.

In this embodiment, after obtaining the trained twin network model, the similarity between every two songs may be calculated for all songs in the server's library. For example, for any two songs, the audio features of the two songs are extracted, and the two audio features are input into a trained twin network model, so that the similarity of the two songs can be output. According to the similarity between every two songs, a similarity matrix corresponding to the whole song library can be obtained. Assuming that 1000 songs are contained in the song library, a similarity matrix of 1000×1000 is finally obtained, and for any two songs in the song library, the similarity between the two songs can be found in the song library.

206. And obtaining a similar audio list of each audio in the music library according to the similarity matrix, wherein when the similarity between the two audio is larger than a third preset threshold value, the two audio are judged to be similar to each other.

After obtaining the similarity matrix corresponding to the song library, each row or each column in the similarity matrix is the similarity between one song and all other songs in the song library. A list of similar songs for each song in the library may be derived from the matrix. For example, for song a in the song library, a song with a similarity greater than a third preset threshold value is obtained from the row (or column) in which the song a is located, and is used as a similar song list of song a, and the similar song list is stored in association with song a. In this way, a list of similar songs for each song in the library can be obtained.

207. An audio recommendation request is received, wherein the audio recommendation request is for indicating similar audio to query target audio.

208. And obtaining a similar audio list corresponding to the target audio, and responding to the audio recommendation request according to the similar audio list.

When receiving the audio recommendation request, the server can acquire a similar song list of the target song, and push all or part of songs with highest similarity in the similar song list to the terminal corresponding to the audio recommendation request according to the requirement.

In the above-mentioned manner, the similarity matching method of audio provided by the embodiment of the invention calculates the similarity between every two songs in the song library in advance according to the trained twin network model to form a similarity matrix, and based on the similarity matrix, the matching of the similarity of any two songs can be realized.

In order to implement the above method, the embodiment of the invention also provides an audio similarity matching device, which can be integrated in terminal equipment such as mobile phones, tablet computers and other equipment.

For example, referring to fig. 3a, fig. 3a is a schematic structural diagram of an audio similarity matching device according to an embodiment of the invention. The similarity matching apparatus of audio may include a user grouping unit 301, a set determining unit 302, a sample acquiring unit 303, a model training unit 304, and an audio matching unit 305, as follows:

a user grouping unit 301, configured to determine a similar user group from a plurality of users according to an audio list of the users;

a set determining unit 302, configured to determine a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio;

A sample acquiring unit 303, configured to calculate a similarity between pairs of sample audio in a training set according to the feature audio set, and determine a positive sample and a negative sample in the training set according to the similarity between pairs of sample audio;

a model training unit 304, configured to train a twin network model using the positive sample and the negative sample, where the twin network model includes two base networks for acquiring feature vectors of audio, the two base networks have the same structure and share weights;

an audio matching unit 305, configured to perform similarity matching of audio based on the trained twin network model.

Referring to fig. 3c, fig. 3c is a schematic diagram illustrating a third structure of an audio similarity matching device according to an embodiment of the invention. In some embodiments of the present invention, in some embodiments,

in some embodiments, the user grouping unit 301 is further configured to: acquiring audio lists of users, and calculating Jaccard coefficients of the audio lists of each two users as the similarity between the two users; dividing the plurality of users into a plurality of similar user groups according to the similarity between the users, wherein the similarity between any two users in one similar user group is larger than a first preset threshold.

In some embodiments, the sample acquisition unit 303 is further configured to: for any sample audio pair in the training set, calculating the number of feature audio sets simultaneously having two audio in the sample audio pair; dividing the number by the total number of feature audio sets to obtain the similarity of the sample audio pairs.

In some embodiments, model training unit 304 is further to: extracting audio characteristics of audio in the positive sample and the negative sample; based on the audio features and similarities of the positive samples and the audio features and similarities of the negative samples, the twin network model is trained until the loss value of the loss function is minimized.

In some embodiments, model training unit 304 is further to:

for any audio of the positive and negative samples, splitting the audio into a plurality of audio segments;

performing short-time Fourier transform on each audio fragment to obtain a frequency domain signal;

performing Mel scale transformation on the frequency domain signal to obtain Mel spectrum characteristics of the audio fragment;

and combining the Mel spectrum characteristics of the plurality of audio fragments to obtain the audio characteristics of the audio.

In some embodiments, the user behavior data is a collection; the set determination unit 302 is further configured to: and determining the audio with collection larger than a second preset threshold value from the audio corresponding to the similar user group, and forming a characteristic audio set of the similar user group.

Referring to fig. 3b, fig. 3b is a schematic diagram illustrating a second structure of an audio similarity matching device according to an embodiment of the invention.

In some embodiments, the similarity matching device of audio may further include a first recommending unit 306, where the first recommending unit 306 is configured to:

calculating the characteristic vector of the audio in the library according to the trained twin network model to form a characteristic vector library; receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of query target audio; acquiring a first feature vector of the target audio; searching a first preset number of second feature vectors with highest similarity with the first feature vectors from the feature vector library; and responding to the audio recommendation request, and pushing the audio corresponding to the second feature vector to a terminal corresponding to the audio recommendation request.

In some embodiments, the first recommendation unit 306 is further configured to: acquiring a corresponding first feature vector from the feature vector library according to the identification information of the target audio; or extracting the audio characteristics of the target audio, and calculating a first characteristic vector of the target audio according to the audio characteristics and the twin network model.

In some embodiments, the similarity matching device of audio may further include a second recommendation unit 307, where the second recommendation unit 307 is configured to:

calculating the similarity between every two audios in a music library according to the twin network model to obtain a similarity matrix corresponding to the music library; and obtaining a similar audio list of each audio in the music library according to the similarity matrix, wherein when the similarity between two audios is larger than a third preset threshold value, the two audios are judged to be similar to each other.

In some embodiments, the second recommendation unit 307 is further configured to:

receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of query target audio; and obtaining a similar audio list corresponding to the target audio, and responding to the audio recommendation request according to the similar audio list.

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

It should be noted that, the similarity matching device for audio provided by the embodiment of the present invention belongs to the same concept as the similarity matching method for audio in the above embodiment, and any method provided in the similarity matching method embodiment for audio may be run on the similarity matching device for audio, and the specific implementation process is detailed in the similarity matching method embodiment for audio, which is not described herein again.

According to the audio similarity matching device provided by the embodiment of the invention, the user grouping unit 301 determines similar user groups from a plurality of users according to the audio list of the users; the set determining unit 302 determines a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio; the sample acquiring unit 303 calculates the similarity between the pairs of sample audios in the training set according to the feature audio set, and determines positive and negative samples in the training set according to the similarity between the pairs of sample audios; the model training unit 304 trains a twin network model using the positive and negative samples, wherein the twin network model includes two base networks for acquiring feature vectors of audio, the two base networks having the same structure and sharing weights; the audio matching unit 305 performs similarity matching of audio based on the trained twin network model. According to the scheme, the similarity between the audios is preliminarily determined according to the similar user groups so as to train the twin network model, when similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, attribute information of the songs is not needed, matching of the similar songs can be achieved only according to the songs, and matching accuracy of the similar songs is improved.

The embodiment of the invention also provides an electronic device, please refer to fig. 4, and fig. 4 is a schematic structural diagram of the electronic device according to the embodiment of the invention. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

In some embodiments, the processor 401 runs an application program stored in the memory 402, and may also implement the following functions:

acquiring audio lists of users, and calculating Jaccard coefficients of the audio lists of each two users as the similarity between the two users;

dividing the plurality of users into a plurality of similar user groups according to the similarity between the users, wherein the similarity between any two users in one similar user group is larger than a first preset threshold.

for any sample audio pair in the training set, calculating the number of feature audio sets simultaneously having two audio in the sample audio pair;

dividing the number by the total number of feature audio sets to obtain the similarity of the sample audio pairs.

extracting audio characteristics of audio in the positive sample and the negative sample;

based on the audio features and similarities of the positive samples and the audio features and similarities of the negative samples, the twin network model is trained until the loss value of the loss function is minimized.

and determining the audio with collection larger than a second preset threshold value from the audio corresponding to the similar user group, and forming a characteristic audio set of the similar user group.

calculating the characteristic vector of the audio in the library according to the trained twin network model to form a characteristic vector library;

receiving an audio recommendation request, wherein the audio recommendation request is used for indicating similar audio of query target audio;

Acquiring a first feature vector of the target audio;

searching a first preset number of second feature vectors with highest similarity with the first feature vectors from the feature vector library;

and responding to the audio recommendation request, and pushing the audio corresponding to the second feature vector to a terminal corresponding to the audio recommendation request.

acquiring a corresponding first feature vector from the feature vector library according to the identification information of the target audio;

or extracting the audio characteristics of the target audio, and calculating a first characteristic vector of the target audio according to the audio characteristics and the twin network model.

calculating the similarity between every two audios in a music library according to the twin network model to obtain a similarity matrix corresponding to the music library;

and obtaining a similar audio list of each audio in the music library according to the similarity matrix, wherein when the similarity between two audios is larger than a third preset threshold value, the two audios are judged to be similar to each other.

and obtaining a similar audio list corresponding to the target audio, and responding to the audio recommendation request according to the similar audio list.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

In the above, the electronic device according to the embodiment of the present invention determines the similar user group from the plurality of users according to the audio list of the users; determining a characteristic audio set of the similar user group based on the user behavior data corresponding to the audio; calculating the similarity between the sample audio pairs in the training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs; training a twin network model by using the positive sample and the negative sample, wherein the twin network model comprises two basic networks for acquiring the feature vector of the audio, and the two basic networks have the same structure and share weight; and performing similarity matching of the audios based on the trained twin network model. According to the scheme, the similarity between the audios is preliminarily determined according to the similar user groups so as to train the twin network model, when similar songs of a certain song are searched, the twin network model is used for matching the similarity between the songs, attribute information of the songs is not needed, matching of the similar songs can be achieved only according to the songs, and matching accuracy of the similar songs is improved.

To this end, an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, where the instructions can be loaded by a processor to perform any of the audio similarity matching methods provided in the embodiment of the present invention. For example, the instructions may perform:

The specific implementation of the above operations may be referred to the previous embodiments, and will not be described herein.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Because the instructions stored in the storage medium can execute any audio similarity matching method provided by the embodiment of the present invention, the beneficial effects that any audio similarity matching method provided by the embodiment of the present invention can achieve can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein. The foregoing describes in detail a method, an apparatus, and a storage medium for matching audio frequency similarity provided by the embodiments of the present invention, and specific examples are applied to illustrate the principles and implementations of the present invention, where the foregoing examples are only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present invention, the present description should not be construed as limiting the present invention in summary.

Claims

1. A method for similarity matching of audio frequencies, comprising:

calculating the similarity between sample audio pairs in a training set according to the characteristic audio set, and determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs, wherein the similarity between the sample audio pairs characterizes the probability that the sample audio pairs are in the same characteristic audio set;

training a twin network model based on the audio features and the similarity of the positive samples and the audio features and the similarity of the negative samples until the loss value of the loss function is minimized, wherein the twin network model comprises two basic networks for acquiring the feature vectors of the audio, and the two basic networks have the same structure and share weight; loss function L (F) ₁ ，F ₂ The formula of Y) is as follows:

wherein F is ₁ 、F ₂ The method comprises the steps of respectively obtaining audio characteristics of audio in a positive sample and audio characteristics of audio in a negative sample, wherein Y is similarity between sample audio pairs formed by the positive sample and the negative sample, D is distance between the characteristics, and m is a preset boundary value;

2. The method for matching similarity of audio according to claim 1, wherein said determining a group of similar users from among a plurality of users based on the audio list of the users comprises:

3. The method of similarity matching of audio according to claim 1, wherein said calculating the similarity between pairs of sample audio in a training set from said set of characteristic audio comprises:

4. The method of similarity matching of audio according to claim 1, wherein said extracting audio features of audio in said positive and negative samples comprises:

5. The audio similarity matching method of claim 1, wherein the user behavior data is a collection; the determining the characteristic audio set of the similar user group based on the user behavior data corresponding to the audio comprises the following steps:

6. The audio similarity matching method according to any one of claims 1 to 5, wherein the performing audio similarity matching based on the trained twin network model comprises:

acquiring a first feature vector of the target audio;

7. The method of similarity matching of audio of claim 6, wherein said obtaining a first feature vector of said target audio comprises:

8. The audio similarity matching method according to any one of claims 1 to 5, wherein the performing audio similarity matching based on the trained twin network model comprises:

9. The method for matching the similarity of audio according to claim 8, further comprising, after said obtaining a list of similar audio for each audio in said library according to said similarity matrix:

10. An audio similarity matching apparatus, comprising:

the sample acquisition unit is used for calculating the similarity between sample audio pairs in the training set according to the characteristic audio set, determining positive samples and negative samples in the training set according to the similarity between the sample audio pairs, and representing the probability that the sample audio pairs are in the same characteristic audio set by the similarity between the sample audio pairs;

the model training unit is used for extracting audio characteristics of the audio in the positive sample and the negative sample; training a twin network model based on the audio features and the similarity of the positive samples and the audio features and the similarity of the negative samples until the loss value of the loss function is minimized, wherein the twin network model comprises two basic networks for acquiring the feature vectors of the audio, and the two basic networks have the same structure and share weight; loss function L (F) ₁ ，F ₂ The formula of Y) is as follows:

wherein F is ₁ 、F ₂ Respectively of audio in positive samplesThe method comprises the steps that the audio characteristics and the audio characteristics of audio in a negative sample are similar, wherein Y is the similarity between a sample audio pair formed by a positive sample and the negative sample, D is the distance between the characteristics, and m is a preset boundary value;

11. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the method of similarity matching of audio of any one of claims 1 to 9.

12. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps in the audio similarity matching method of any one of claims 1 to 9.