CN110517671B

CN110517671B - Audio information evaluation method and device and storage medium

Info

Publication number: CN110517671B
Application number: CN201910819121.1A
Authority: CN
Inventors: 徐东
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2022-04-05
Anticipated expiration: 2039-08-30
Also published as: CN110517671A

Abstract

The embodiment of the application discloses an audio information evaluation method, an audio information evaluation device and a storage medium, wherein the audio to be trained is obtained, and corresponding training label information is distributed to the audio to be trained; extracting training characteristic information corresponding to the audio to be trained; inputting training characteristic information and corresponding training label information into a preset model for training to obtain a trained preset model; and evaluating the audio to be evaluated based on the trained preset model. Therefore, the preset model is trained through the training label information and the training characteristic information of the audio to be trained, and the audio to be evaluated is automatically evaluated based on the trained preset model, so that labor can be saved, the evaluation speed of the audio information is accelerated, the cost is greatly reduced, and the evaluation efficiency of the audio information is improved.

Description

Audio information evaluation method and device and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to an audio information evaluation method, apparatus, and storage medium.

Background

Digital audio, as the name implies, is the audio frequency that just stores in network server with digital signal's mode, and streaming transmission in network space has fast advantage, can download the audio frequency immediately according to people's demand, and digital audio does not rely on traditional music carrier, such as tape or CD, can avoid wearing and tearing, can guarantee the audio frequency quality.

In the prior art, in the process of generating digital audio, due to different recording environments or transcoding modes and other methods, a large amount of digital audio with similar contents but uneven quality is generated, even some low-quality digital audio has the conditions of monotonicity, disordered beats, discordance of sound continuity, sudden rhythm interruption and the like, and the low-quality digital audio is spread in a network space and causes interference to users, thereby seriously affecting audio experience.

In the course of research and practice on the prior art, the inventors of the present application found that, although a method for manually evaluating the quality of digital audio is provided in the prior art, for a huge amount of digital audio, the manual evaluation is too slow and cost is too high, and the evaluation efficiency is low.

Disclosure of Invention

The embodiment of the application provides an audio information evaluation method, an audio information evaluation device and a storage medium, and aims to reduce cost and improve the evaluation efficiency of audio information.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

a method of evaluating audio information, comprising:

acquiring audio to be trained, and distributing corresponding training label information to the audio to be trained;

extracting training characteristic information corresponding to the audio to be trained;

inputting the training characteristic information and the corresponding training label information into a preset model for training to obtain a trained preset model;

and evaluating the audio to be evaluated based on the trained preset model.

An apparatus for evaluating audio information, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the audio to be trained and distributing corresponding training label information to the audio to be trained;

the extraction unit is used for extracting training characteristic information corresponding to the audio to be trained;

the training unit is used for inputting the training characteristic information and the corresponding training label information into a preset model for training to obtain a trained preset model;

and the evaluation unit is used for evaluating the audio to be evaluated based on the trained preset model.

In some embodiments, the obtaining unit includes:

the acquisition subunit is used for acquiring the audio to be trained and extracting the audio fingerprint of the audio to be trained;

the determining subunit is used for determining the audio to be trained with the audio fingerprint similarity larger than a preset threshold as the same audio group to be trained;

and the distribution subunit is used for distributing corresponding training label information to the audio to be trained.

In some embodiments, the evaluation unit is specifically configured to:

extracting evaluation characteristic information corresponding to the audio to be evaluated;

and inputting the evaluation characteristic information into a trained preset model to obtain evaluation information corresponding to the audio to be evaluated.

In some embodiments, the first extraction unit includes:

the formatting subunit is configured to perform normalization processing on the format of the audio to be trained to obtain a target audio to be trained with a normalized format;

the calculating subunit is used for calculating a hash value corresponding to the target audio to be trained;

the de-duplication subunit is used for carrying out de-duplication processing on the target audio to be trained with the same hash value;

and the extraction subunit is used for extracting training characteristic information corresponding to the target audio to be trained after the duplication removal processing.

In some embodiments, the extraction subunit is specifically configured to:

and extracting the sampling rate, bit depth, code rate, spectrum contrast, flatness, spectrum entropy, spectrum centroid and spectrum roll-off coefficient of the target audio to be trained after the duplication removal processing.

In some embodiments, the training unit is specifically configured to:

and inputting the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid, the spectrum roll-off coefficient and the corresponding training label information of the target audio to be trained after the duplication removal processing into a preset model for training to obtain the trained preset model.

In some embodiments, the evaluation unit is specifically configured to:

calculating a hash value corresponding to the audio to be evaluated;

acquiring the audio to be evaluated with the same hash value and carrying out corresponding marking;

extracting the sampling rate, bit depth, code rate, spectrum contrast, flatness, spectrum entropy, spectrum centroid and spectrum roll-off coefficient of one to-be-evaluated audio and an unmarked to-be-evaluated audio in the to-be-evaluated audios with the same mark;

and inputting the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient of one to-be-evaluated audio and an unmarked to-be-evaluated audio in the same marked to-be-evaluated audio into the trained preset model to obtain first evaluation information corresponding to the one to-be-evaluated audio and the unmarked to-be-evaluated audio in the same marked to-be-evaluated audio.

In some embodiments, the apparatus further comprises:

the first determining unit is used for determining a target audio to be evaluated corresponding to the first evaluation information;

the second extraction unit is used for extracting the spectral height and the spectral notch coefficient of the target audio to be evaluated;

the weighting parameter determining unit is used for determining a first weighting parameter and a second weighting parameter according to the spectrum height and the spectrum sag coefficient;

and the weighting unit is used for weighting the first evaluation information according to the first weighting parameter and the second weighting parameter to obtain target evaluation information.

In some embodiments, the apparatus further comprises:

the second determining unit is used for determining the target audio to be evaluated corresponding to the target evaluation information;

and the setting unit is used for determining the audio to be evaluated which is the same as the mark when the mark of the target audio to be evaluated is detected, and setting the evaluation information of the audio to be evaluated which is the same as the mark as the target evaluation information.

In a third aspect, a storage medium is provided in an embodiment of the present application, and has a computer program stored thereon, where the computer program is executed on a computer, so as to enable the computer to execute the method for evaluating audio information provided in any embodiment of the present application.

According to the embodiment of the application, the audio to be trained is obtained, and corresponding training label information is distributed to the audio to be trained; extracting training characteristic information corresponding to the audio to be trained; inputting training characteristic information and corresponding training label information into a preset model for training to obtain a trained preset model; and evaluating the audio to be evaluated based on the trained preset model. Compared with a scheme that a large amount of audio information needs to be manually analyzed for quality evaluation, the method greatly reduces the cost and improves the evaluation efficiency of the audio information.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scenario of an audio information evaluation system provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating an evaluation method of audio information provided by an embodiment of the present application;

FIG. 3 is a schematic flow chart of another method for evaluating audio information according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a scenario of an evaluation method for audio information according to an embodiment of the present application;

FIG. 5a is a schematic structural diagram of an apparatus for evaluating audio information according to an embodiment of the present application;

FIG. 5b is a schematic structural diagram of an apparatus for evaluating audio information according to an embodiment of the present application;

FIG. 5c is a schematic structural diagram of an apparatus for evaluating audio information according to an embodiment of the present application;

FIG. 5d is a schematic structural diagram of an apparatus for evaluating audio information according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides an audio information evaluation method, an audio information evaluation device and a storage medium.

Referring to fig. 1, fig. 1 is a schematic view of a scene of an audio information evaluation system according to an embodiment of the present application, including: the terminal a and the server (the evaluation system may also include other terminals besides the terminal a, and the specific number of the terminals is not limited herein), the terminal a and the server may be connected through a communication network, which may include a wireless network and a wired network, wherein the wireless network includes one or more of a wireless wide area network, a wireless local area network, a wireless metropolitan area network, and a wireless personal area network. The network includes network entities such as routers, gateways, etc., which are not shown in the figure. The terminal a may perform information interaction with the server through the communication network, for example, when searching for certain audio information, the terminal a may automatically generate an audio search instruction, where the audio search instruction indicates corresponding audio information, such as an audio name "if there is one day", and then upload the audio search instruction to the service, and the server may perform overall evaluation according to the audio information indicated by the audio search instruction.

The audio information evaluation system may include an audio information evaluation device, and the audio information evaluation device may be specifically integrated in a server, it should be noted that, in the embodiment of the present application, the audio information evaluation device is integrated in the server, and in another embodiment, the audio information evaluation device may also be integrated in a terminal. In fig. 1, the server is mainly configured to receive an audio search instruction sent by a terminal a, acquire audio information indicated by the audio search instruction, acquire an audio to be trained before evaluating the audio information, and allocate corresponding training label information to the audio to be trained; extracting training characteristic information corresponding to the audio to be trained, inputting the training characteristic information and corresponding training label information into a preset model for training, enabling the trained preset model to have the capability of evaluating subsequent audio information, determining a plurality of audio to be evaluated corresponding to the audio information indicated by the audio search instruction based on the training characteristic information and evaluating the audio to be evaluated through the trained preset model to obtain an audio information list sorted according to evaluation results, sending the audio information list sorted according to the evaluation results to a terminal A, and enabling a user to quickly know which audios are good in quality and which audios are poor in quality after receiving the audio information list sorted according to the evaluation results, so that the time for screening the user is saved, and the experience of the user is improved.

The audio information evaluation system can also comprise a terminal A, wherein the terminal A can be provided with various applications required by the user, such as a music application, a browser application, an instant messaging application and the like, and can generate an audio search instruction to be uploaded to a server when the user searches music audio through the music application.

It should be noted that the scene schematic diagram of the audio information evaluation system shown in fig. 1 is merely an example, and the audio information evaluation system and the scene described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

The first embodiment,

In the present embodiment, the description will be made from the viewpoint of an audio information evaluation device that can be integrated specifically in a server having an arithmetic capability with a storage unit and a microprocessor mounted thereon.

A method of evaluating audio information, comprising: acquiring audio to be trained, and distributing corresponding training label information for the audio to be trained; extracting training characteristic information corresponding to the audio to be trained; inputting training characteristic information and corresponding training label information into a preset model for training to obtain a trained preset model; and evaluating the audio to be evaluated based on the trained preset model.

Referring to fig. 2, fig. 2 is a flowchart illustrating an audio information evaluation method according to an embodiment of the present disclosure. The audio information evaluation method comprises the following steps:

in step 101, an audio to be trained is obtained, and corresponding training label information is assigned to the audio to be trained.

The number of the Audio to be trained may be multiple, or multiple, for example, 1000 sets of Audio to be trained, and the format of each Audio to be trained may be the same or different, for example, the format of the Audio to be trained is a Moving Picture Experts Group Audio Layer III (MP 3), a Lossless Audio compression coding (Free loss Audio coding, FLAC) or ogg (oggvobis) format, etc.

Further, the training label information is a score value manually scored, the higher the score value is, the better the quality of the corresponding audio is, and the lower the score value is, the worse the quality of the corresponding audio is. Each audio to be trained is assigned to corresponding training label information for use in a subsequent training process.

In some embodiments, the step of acquiring the audio to be trained and assigning corresponding training label information to the audio to be trained may include:

(1) acquiring an audio to be trained, and extracting an audio fingerprint of the audio to be trained;

(2) determining the audio to be trained with the audio fingerprint similarity larger than a preset threshold as the same audio group to be trained;

(3) and distributing corresponding training label information to the audio to be trained.

The audio fingerprint is a unique digital feature extracted from audio by a specific audio processing algorithm, and the feature can represent audio information and has unique uniqueness. It is used to quickly retrieve and locate the sound sample to be detected from a massive database. Based on this, the corresponding audio fingerprint of each audio to be trained can be extracted, and the audio fingerprints are matched with each other in pairs, the audio to be trained with the matching degree larger than the preset threshold is classified as the same audio group to be trained, the matching degree refers to the similarity of the audio fingerprint characteristics, the range is between 0 and 1, the two audio to be trained are completely unmatched when the matching degree is 0, the two audio to be trained are completely matched when the matching degree is 1, the preset threshold of the embodiment can be set to 0.95, namely, the audio to be trained with the audio fingerprint matching degree larger than 0.95 is considered to meet the requirement of audio fingerprint matching, and is classified as the same audio group to be trained. In some embodiments, the preset threshold may also be set to other thresholds, such as 0.9 or 0.7, and the larger the preset threshold is, the more similar the matched audio to be trained is, and vice versa.

Furthermore, similar audio to be trained is determined to be the same audio group to be trained according to the similarity of the audio fingerprints, namely the audio to be trained is divided into a plurality of audio groups to be trained, the audio to be trained in each audio group to be trained is similar audio, like a plurality of audios corresponding to one song, corresponding training label information is distributed to each audio to be trained, and the subsequent processing efficiency can be higher due to the fact that the similar audio is classified into the same audio group to be trained.

In step 102, training feature information corresponding to the audio to be trained is extracted.

It can be understood that, in the process of generating the audio, due to the change of the recording environment or the transcoding method, a large amount of audio with similar content but different quality can be generated, the content is similar as the duration of the audio and the information such as the content of the audio are similar but different, the quality is different as the sound quality of the audio is different, and the difference in quality or low sound perception exists.

In an embodiment, the different audio to be trained generally has different qualities, and the different qualities are mainly reflected in training feature information, where the training feature information represents corresponding features of each audio to be trained, and the training feature information may include, but is not limited to: sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid, spectral roll-off coefficient, spectral height, and/or spectral notch coefficient, and the like. Therefore, the training feature information corresponding to the audio to be trained can be extracted in sequence.

It should be noted that the sampling rate refers to the number of times the sound signal is sampled by the recording device within one second, and the higher the sampling frequency is, the more realistic and natural the sound is restored. The bit depth is also called the sampling bit depth, and the bit depth of the audio determines the dynamic range. The code rate, also called audio bit rate or bit rate, refers to the amount of information that can be passed per second in a data stream, and can also be understood as: the amount of data in bits per second is used to indicate the quality, in principle, the higher the audio bit rate the better. The three parameters of the sampling rate, the bit depth and the code rate are packaged in an audio format and can be obtained when data are read. The spectral contrast is obtained by calculating the difference between the higher frequency and the lower frequency interval. The flatness is obtained by calculating the rate of change of the energy fluctuation in the frequency domain. The spectral entropy is obtained by calculating the distribution probability of the frequency domain energy at different frequencies. The spectral centroid is obtained by calculating a distribution center value of the frequency domain energy. The spectral roll-off coefficient is obtained by the frequency difference of 99% and 90% reduction of the frequency domain energy, respectively. The spectral height is calculated as the highest frequency value that can be reached in the frequency domain, and the larger the spectral height, the more high-frequency information. The spectral notch coefficient is an intensity value in which energy in the frequency domain is notched as the frequency increases, and the smaller the spectral notch coefficient is, the less the sound quality loss is, and the higher the sound quality is.

In some embodiments, the step of extracting training feature information corresponding to the audio to be trained includes:

(1) unifying the format of the audio to be trained to obtain a target audio to be trained with a unified format;

(2) calculating the corresponding hash value of the target audio to be trained;

(3) carrying out duplicate removal processing on the target audio to be trained with the same hash value;

(4) and extracting training characteristic information corresponding to the target audio to be trained after the duplication removal processing.

Because the format of the audio to be trained can be various, for subsequent unified processing, the audio to be trained needs to be subjected to unified processing, the audio to be trained with different formats is uniformly converted into data with the same format, such as a WAV format, and the target audio to be trained with the unified format is obtained, so that the features with different dimensions can be extracted subsequently.

Further, a hash value (hash value) corresponding to the target audio to be trained can be calculated by a preset algorithm, and the hash value is usually represented by a short character string composed of random letters and numbers. In an embodiment, the target audio to be trained may be sequentially processed through an MD5Message Digest Algorithm (MD5Message-Digest Algorithm), so as to obtain a corresponding hash value of each target audio to be trained. When the hash values are the same, the audio is completely the same, so only one of the audios needs to be calculated, and the final result of the audio is also completely suitable for the audio which is the same as the audio, and it needs to be characterized that the audio with the audio fingerprint matching degree of 1 does not need to be the same, but the audio with the same hash value does not need to be the same, and the audio fingerprint matching degree of 1 needs to be the same.

Therefore, in order to improve the training efficiency of subsequent audio to be trained, the target audio to be trained with the same hash value can be subjected to deduplication processing, that is, the repeated target audio to be trained with the same hash value is removed, only one target audio to be trained is reserved, and then the training feature information corresponding to the target audio to be trained after deduplication processing is extracted.

In an embodiment, the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid, and spectral roll-off coefficient of the target audio to be trained after the deduplication processing may be extracted.

In step 103, the training feature information and the corresponding training label information are input into a preset model for training, so as to obtain a trained preset model.

Each audio to be trained comprises corresponding training label information, and the training label information is manually marked, so that the training characteristic information and the corresponding training label information are input into a preset model together for training, the distribution rule between the training characteristic information and the training label information can be learned by the preset model, namely, the capability of marking the characteristic information is learned, and the preset model after training with the evaluation capability is obtained.

In an embodiment, the preset model may be a neural network learning model, such as a convolutional neural network model or a Support Vector Machine (SVM) learning model, and is described as the Support Vector Machine (SVM) learning model, the support vector machine learning method is a generalized linear classifier that is popular internationally and performs binary classification on data in a supervised learning manner, the trainer used herein is a sorting algorithm based on an SVM, i.e., a Ranking SVM, and the training feature information and the corresponding training label information are input into the support vector machine learning model for training, and the support vector machine learning model may obtain a corresponding statistical rule according to a corresponding relationship between a distribution rule of the training feature information and the training label information, so that the trained support vector machine learning model has an evaluation capability close to artificial evaluation.

In some embodiments, the step of inputting the training feature information and the corresponding training label information into a preset model for training to obtain a trained preset model includes inputting a sampling rate, a bit depth, a code rate, a spectral contrast, a flatness, a spectral entropy, a spectral centroid, a spectral roll-off coefficient, and corresponding training label information of the target audio to be trained after the deduplication processing into the preset model for training to obtain the trained preset model.

The sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient of the target audio to be trained after the duplication removal processing and corresponding training label information are input into a preset model for training, so that the preset model learns the regular distribution among the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient and the score, and the preset model after the training has the capability of scoring and evaluating corresponding to the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient.

In step 104, the audio to be evaluated is evaluated based on the trained preset model.

The trained preset model has the capability of evaluating and scoring the feature information, so that the evaluation feature information of the audio to be evaluated can be correspondingly extracted, the evaluation feature information is input into the trained preset model for training, the trained preset model outputs corresponding score evaluation information according to the evaluation feature information, the higher the score is, the better the quality of the corresponding audio to be evaluated is, and the lower the score is, the worse the quality of the corresponding audio to be evaluated is.

In some embodiments, after the trained preset model evaluates the audio to be evaluated and obtains the evaluation result, a feedback message of the user may be received, and if the user feels that the evaluation result is wrong, an error may be reported, and the evaluation result may be finely adjusted based on the feedback message, so that the evaluation result fits the use habit of the user as much as possible in actual use.

In some embodiments, the step of evaluating the audio to be evaluated based on the trained preset model may include:

(1) calculating a hash value corresponding to the audio to be evaluated;

(2) acquiring the audio to be evaluated with the same hash value and carrying out corresponding marking;

(3) extracting the sampling rate, bit depth, code rate, spectrum contrast, flatness, spectrum entropy, spectrum centroid and spectrum roll-off coefficient of one to-be-evaluated audio and an unmarked to-be-evaluated audio in the to-be-evaluated audios with the same mark;

(4) and inputting the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient of one to-be-evaluated audio and an unmarked to-be-evaluated audio in the same marked to-be-evaluated audio into the trained preset model to obtain first evaluation information corresponding to the one to-be-evaluated audio and the unmarked to-be-evaluated audio in the same marked to-be-evaluated audio.

The audio to be evaluated may be a group of audio to be evaluated whose audio fingerprints all match each other, and since the audio with the same hash value is identical, only one of the audios needs to be evaluated, and the evaluation result of the other audios is identical to the evaluation result of the one audio. Correspondingly, the corresponding hash value of the audio to be evaluated is calculated firstly, the audio to be evaluated with the same hash value is obtained, and corresponding marking is carried out.

Furthermore, only the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient of one of the audios to be evaluated with the same mark and the audio to be evaluated without the mark are extracted and input into the trained preset model, so that first evaluation information corresponding to the one of the audios to be evaluated with the same mark and the audio to be evaluated without the mark is obtained, and the first evaluation information of the audios to be evaluated with the mark can be synchronized to other audios to be evaluated with the same mark, so that the training efficiency is improved better.

As can be seen from the above, in the embodiment of the present application, the audio to be trained is obtained, and corresponding training label information is allocated to the audio to be trained; extracting training characteristic information corresponding to the audio to be trained; inputting training characteristic information and corresponding training label information into a preset model for training to obtain a trained preset model; and evaluating the audio to be evaluated based on the trained preset model. Compared with a scheme that a large amount of audio information needs to be manually analyzed for quality evaluation, the method greatly reduces the cost and improves the evaluation efficiency of the audio information.

Example II,

The method described in the first embodiment is further illustrated by way of example.

In this embodiment, an example will be described in which the audio information evaluation device is specifically integrated in a server.

Referring to fig. 3, fig. 3 is another flow chart illustrating an audio information evaluation method according to an embodiment of the present application. The method flow can comprise the following steps:

in step 201, the server obtains the audio to be trained, extracts the audio fingerprint of the audio to be trained, determines the audio to be trained with the audio fingerprint similarity greater than the preset threshold as the same audio group to be trained, and allocates corresponding training label information to the audio to be trained.

The server acquires a plurality of audios to be trained, sequentially extracts audio fingerprints of the audios to be trained, the preset threshold value is 0.95, the audio fingerprints of the audios to be trained are matched pairwise, the audios to be trained with similarity greater than 0.95 are determined to be a same audio group to be trained, the number of the audios to be trained in each group can be the same or different, namely similar audios to be trained are classified, and subsequent processing efficiency is improved.

Further, the server allocates corresponding training label information to each audio to be trained, the training label information can be manually marked, the higher the quality of the audio is, the higher the corresponding label value is, the lower the quality of the audio is, and the lower the corresponding label value is.

In step 202, the server unifies the format of the audio to be trained to obtain a target audio to be trained with a unified format.

The server unifies the format of the audio to be trained, for example, the audio to be trained in the MP3 format, the audio to be trained in the flac format, and the audio to be trained in the OGG format are all uniformly converted into the target audio to be trained in the WAC format, so as to facilitate subsequent feature extraction.

In step 203, the server calculates a hash value corresponding to the target audio to be trained, and performs deduplication processing on the target audio to be trained with the same hash value.

The server can sequentially calculate the hash values corresponding to the target audio to be trained through the MD5message digest algorithm, and since the hash values are the same and the representative audios are also completely the same, in order to improve the training efficiency in the later period, the target audio to be trained with the same hash value can be subjected to deduplication processing, and only one audio of the target audio to be trained with the same hash value needs to be left.

In step 204, the server extracts the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient of the target audio to be trained after the deduplication processing.

The server can extract the sampling rate, bit depth, code rate, spectrum contrast, flatness, spectrum entropy, spectrum centroid and spectrum roll-off coefficient of the target audio to be trained after the deduplication processing as the characteristic information of the later training.

Further, after the server extracts the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient of the target audio to be trained after the deduplication processing, the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient may be combined with corresponding label information to generate a corresponding set to be trained, for example, the set to be trained is a List, the List is [ G1, G2, G3, …, GN ], which is N groups, where G1 represents a first set to be trained, G2 represents a second set to be trained, and so on, each set to be trained contains a combination of feature information of a plurality of audio to be trained and a label set, such as GN group, GN [ ID ], character1, character2, 2, character ] 829, the audio to be trained is a spectrum ratio, 1, 493 is a spectrum ratio, and by analogy, the characterN is label information and the like.

In step 205, the server inputs the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid, spectral roll-off coefficient, and corresponding training label information of the target audio to be trained after the deduplication processing into a preset model for training, so as to obtain a trained preset model.

The preset model is a learning model of a support vector machine, the server can input the set List to be trained after the duplication removal processing into the learning model of the support vector machine for training, the learning model of the support vector machine can learn the corresponding relation between the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient and the corresponding training label information, and learn the statistical rules corresponding to the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient and the corresponding training label information, so that the learning-trained learning model of the support vector machine has the evaluation capability close to the artificial evaluation.

In step 206, the server calculates the hash value corresponding to the audio to be evaluated, obtains the audio to be evaluated with the same hash value, and performs corresponding marking.

After the trained learning model of the support vector machine is obtained, the server can automatically receive an evaluation request sent by the terminal, as shown in a display interface 10 in fig. 4, a user can input an audio name "if there is one day" on a music playing interface, in the prior art, as shown in a display interface 11, after the user clicks and searches, the server can directly feed back a corresponding search result of the audio name "if there is one day" to the display interface of the terminal, the search result is only sorted according to the names of providers, the user can only click one of the audios at random for playing, and because the sound quality is uncertain, the user may order the audios with very poor sound quality, and very poor experience is brought to the user.

Therefore, the server needs to evaluate the audio in advance, filter out the audio with extremely poor sound quality, and avoid waste of storage space and manpower management cost, so as shown in fig. 4, after receiving the audio name "if there is one day", the server may first determine 4 audio to be evaluated corresponding to the audio name "if there is one day", respectively calculate hash values of the 4 audio to be evaluated, and correspondingly mark the same audio to be evaluated, for example, mark the provider 2 and the provider 1 correspondingly.

In step 207, the server extracts a sampling rate, a bit depth, a code rate, a spectral contrast, a flatness, a spectral entropy, a spectral centroid and a spectral roll-off coefficient of one of the audio to be evaluated with the same mark and the audio to be evaluated without the mark.

As shown in fig. 4, the server only needs to extract the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient of the target audio to be evaluated corresponding to the provider 1, the provider 3 and the provider 4, and the time for extracting the features is increased because the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient of the audio to be evaluated corresponding to the provider 2 are omitted.

In step 208, the server inputs the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid, and spectral roll-off coefficient of one of the identically labeled audio to be evaluated and the unlabeled audio to be evaluated into the trained preset model, so as to obtain first evaluation information corresponding to the one of the identically labeled audio to be evaluated and the unlabeled audio to be evaluated.

As shown in fig. 4, the server inputs the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid, and spectral roll-off coefficient of a target audio to be evaluated corresponding to a provider 1, a provider 3, and a provider 4 into a learning model of a support vector machine after learning training, and obtains first evaluation information of the audio to be evaluated corresponding to the provider 1, the provider 3, and the provider 4, where the first evaluation information of the audio to be evaluated corresponding to the provider 1 is 51 points, the first evaluation information of the audio to be evaluated corresponding to the provider 3 is 83 points, and the first evaluation information of the audio to be evaluated corresponding to the provider 4 is 82 points.

In step 209, the server determines a target audio to be evaluated corresponding to the first evaluation information, and extracts a spectral height and a spectral notch coefficient of the target audio to be evaluated.

As shown in fig. 4, the server determines target audio to be evaluated corresponding to the provider 1, the provider 3, and the provider 4, and extracts the spectral height and the spectral notch coefficient of the target audio to be evaluated corresponding to the provider 1, the provider 3, and the provider 4, respectively.

In step 210, the server determines a first weighting parameter and a second weighting parameter according to the spectral height and the spectral notch coefficient, and weights the first evaluation information according to the first weighting parameter and the second weighting parameter to obtain target evaluation information.

The server can also weight the target audio to be evaluated according to the spectrum height and the spectrum sag coefficient, wherein the spectrum height is a first weighting parameter, and the higher the spectrum height is, the more the high-frequency information is, the higher the distribution of the corresponding first weighting parameter is. The spectral notch coefficient is correspondingly a second weighting parameter, and the smaller the spectral notch coefficient is, the less the sound quality loss is, and the higher the corresponding second weighting parameter is allocated.

Further, after determining a first weighting parameter and a second weighting parameter corresponding to the target audio to be evaluated, the server weights the first evaluation information of the corresponding target audio to be evaluated according to the first weighting parameter and the second weighting parameter to obtain target evaluation information, and the target evaluation information integrates the characteristics of the spectral height and the spectral sag coefficient and is higher in accuracy than the first evaluation information.

For example, as shown in fig. 4, assuming that the spectral height and the spectral notch coefficient of the target audio to be evaluated corresponding to the provider 1 are both the lowest, the spectral height and the spectral notch coefficient of the target audio to be evaluated corresponding to the provider 3 are both the highest, and the spectral height and the spectral notch coefficient of the target audio to be evaluated corresponding to the provider 4 are both intermediate, the first weighting parameter allocation for the target audio to be evaluated corresponding to the provider 1 may be 2, the second weighting parameter allocation for the target audio to be evaluated corresponding to the provider 3 may be 2, the first weighting parameter allocation for the target audio to be evaluated corresponding to the provider 3 may be 5, the second weighting parameter allocation for the target audio to be evaluated corresponding to the provider 4 may be 3, and the second weighting parameter allocation for the target audio to be evaluated corresponding to the provider 4 may be 3. According to the first weighting parameter and the second weighting parameter, the first evaluation information of the target audio to be evaluated corresponding to the provider 1 is weighted to obtain target evaluation information 55, the first evaluation information of the target audio to be evaluated corresponding to the provider 3 is weighted to obtain target evaluation information 93, and the first evaluation information of the target audio to be evaluated corresponding to the provider 4 is weighted to obtain target evaluation information 88.

In step 211, the server determines a target audio to be evaluated corresponding to the target evaluation information, determines the audio to be evaluated identical to the tag when the tag of the target audio to be evaluated is detected, and sets the evaluation information of the audio to be evaluated identical to the tag as the target evaluation information.

After the server determines the last target evaluation information of the target evaluation information, because other to-be-evaluated audios with the same mark are not evaluated yet, the server needs to traverse the target to-be-evaluated audio corresponding to the target evaluation information, when the target to-be-evaluated audio is detected to have the mark, other to-be-evaluated audios which are the same as the mark and are not evaluated are determined, and the evaluation information of the to-be-evaluated audios with the same mark is set as the target evaluation information, so that each audio corresponds to the corresponding target evaluation information.

As shown in fig. 4, in the display interface 12, when the server detects that a target audio to be evaluated corresponding to the provider 1 has a tag, it determines an audio to be evaluated corresponding to the provider 2 having the same tag, and sets evaluation information of the audio to be evaluated corresponding to the provider 2 having the same tag as target evaluation information 55. And according to the order of the target evaluation information from high to low, the corresponding audios of the providers 1, 2, 3 and 4 are sequenced and fed back to a display interface of the terminal, and each audio contains corresponding target evaluation information, namely a score, so that the user can directly select the audio with the highest quality to play according to the score, and the best experience is brought to the user.

In an embodiment, after determining the target evaluation information, the server may automatically screen out the audio corresponding to the target evaluation information being lower than a preset threshold to generate a low-quality list, and may also automatically delete the audio in the low-quality list, thereby saving the storage cost and the labor management cost of the server.

As can be seen from the above description, in the embodiments of the present application, audio to be trained is obtained, audio to be trained is grouped according to audio fingerprints, corresponding training label information is allocated to each audio to be trained, formats of all audio to be trained are unified, repeated audio to be trained is deleted according to hash values, a sampling rate, a bit depth, a code rate, a spectrum contrast, a flatness, a spectrum entropy, a spectrum centroid, a spectrum roll-off coefficient, and label information of the audio to be trained after deduplication is extracted and input into a support vector machine learning model for training, a support vector machine learning model after training is obtained, hash values corresponding to the audio to be evaluated are calculated, the audio to be evaluated with the same hash value is labeled, only one audio to be evaluated in the same labeled audio to be evaluated and the sampling rate, the bit depth, the code rate, the spectrum contrast, the sample rate, the bit depth, the code rate, the spectrum contrast, and the label information of the audio to be evaluated without the label are extracted, Inputting the flatness, the spectral entropy, the spectral centroid and the spectral roll-off coefficient into a trained learning model of a support vector machine for training to obtain first evaluation information, determining a target audio to be evaluated corresponding to the first evaluation information, extracting corresponding spectral height and spectral sag coefficient to determine corresponding first weighting parameter and second weighting parameter for weighting to obtain final target evaluation information, and transmitting the target evaluation information to other audio to be evaluated with the same mark. Compared with a scheme that a large amount of audio information needs to be manually analyzed for quality evaluation, the time for training and evaluation is saved, the cost is further reduced, and the evaluation efficiency of the audio information is improved.

Example III,

In order to better implement the audio information evaluation method provided by the embodiments of the present application, the embodiments of the present application further provide a device based on the audio information evaluation method. Wherein the meaning of the noun is the same as that in the above-mentioned audio information evaluation method, and the details of the implementation can refer to the description in the method embodiment.

Referring to fig. 5a, fig. 5a is a schematic structural diagram of an apparatus for evaluating audio information according to an embodiment of the present disclosure, where the apparatus for evaluating audio information may include an obtaining unit 301, a first extracting unit 302, a training unit 303, an evaluating unit 304, and the like.

The obtaining unit 301 is configured to obtain an audio to be trained, and assign corresponding training label information to the audio to be trained.

The number of the audio to be trained may be multiple, or may be multiple sets, such as 1000 sets of audio to be trained, and the format of each audio to be trained may be the same or different, such as the format of the audio to be trained is MP3, FLAC, or OGG format, and so on.

Further, the training label information is a score value manually scored, the higher the score value is, the better the quality of the corresponding audio is, and the lower the score value is, the worse the quality of the corresponding audio is. The obtaining unit 301 assigns corresponding training label information to each audio to be trained for the subsequent training process.

In some embodiments, as shown in fig. 5b, the acquisition unit 301 may include an acquisition subunit 3011, a determination subunit 3012, and an assignment subunit 3013, as follows:

the obtaining subunit 3011 is configured to obtain an audio to be trained, and extract an audio fingerprint of the audio to be trained.

And the determining subunit 3012 is configured to determine, as the same audio group to be trained, the audio to be trained whose audio fingerprint similarity is greater than a preset threshold.

And the allocating subunit 3013 is configured to allocate corresponding training label information to the audio to be trained.

The obtaining unit 301 may extract an audio fingerprint corresponding to each audio to be trained, match the audio fingerprints two by two, determine that the subunit 3012 classifies the audio to be trained having a matching degree greater than a preset threshold as the same audio group to be trained, where the matching degree is a similarity of audio fingerprint features, the range is between 0 and 1, the matching degree is 0, two audio to be trained are completely unmatched, and the matching degree is 1, the two audio to be trained are completely matched, where the preset threshold of this embodiment may be set to 0.95, that is, the audio to be trained having a matching degree greater than 0.95 is considered to meet the requirement of matching the audio fingerprints, and is classified as the same audio group to be trained. In some embodiments, the preset threshold may also be set to other thresholds, such as 0.9 or 0.7, and the larger the preset threshold is, the more similar the matched audio to be trained is, and vice versa.

Further, the determining subunit 3012 determines similar audio to be trained as the same audio group to be trained according to the similarity of the audio fingerprints, that is, the audio to be trained is divided into multiple audio groups to be trained, the audio to be trained in each audio group to be trained is similar audio, like multiple audio corresponding to one song, and the allocating subunit 3013 allocates corresponding training label information to each audio to be trained.

A first extracting unit 302, configured to extract training feature information corresponding to the audio to be trained.

In an embodiment, the different audio to be trained generally has different qualities, and the different qualities are mainly reflected in training feature information, where the training feature information represents corresponding features of each audio to be trained, and the training feature information may include, but is not limited to: sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid, spectral roll-off coefficient, spectral height, and/or spectral notch coefficient, and the like. Therefore, the computing unit 302 may sequentially extract training feature information corresponding to the audio to be trained.

In some embodiments, as shown in fig. 5c, the first extraction unit 302 may include a formatting subunit 3021, a calculating subunit 3022, a de-weighting subunit 3023, and an extraction subunit 3024, as follows:

a formatting subunit 3021, configured to perform normalization processing on the format of the audio to be trained, so as to obtain a target audio to be trained with a normalized format.

And the computing subunit 3022 is configured to compute a hash value corresponding to the target audio to be trained.

And the deduplication subunit 3023 is configured to perform deduplication processing on the target audio to be trained with the same hash value.

And the extracting subunit 3024 is configured to extract training feature information corresponding to the target audio to be trained after the deduplication processing.

Since the format of the audio to be trained may be multiple, for subsequent unified processing, the formatting subunit 3021 needs to perform unified processing on the audio to be trained, and uniformly convert the audio to be trained with different formats into data with the same format, such as the WAV format, to obtain a target audio to be trained with a unified format, so as to extract features with different dimensions in the subsequent process.

Further, the calculating subunit 3022 may calculate the hash value corresponding to the target audio to be trained through a preset algorithm, and in an embodiment, the calculating subunit 3022 may sequentially process the target audio to be trained through an MD5message digest algorithm to obtain the hash value corresponding to each target audio to be trained. When the hash values are the same, the audio is completely the same, so only one of the audios needs to be calculated, and the final result of the audio is also completely suitable for the audio which is the same as the audio, and it needs to be characterized that the audio with the audio fingerprint matching degree of 1 does not need to be the same, but the audio with the same hash value does not need to be the same, and the audio fingerprint matching degree of 1 needs to be the same.

Therefore, in order to improve the training efficiency of the subsequent audio to be trained, the deduplication subunit 3023 may perform deduplication processing on the target audio to be trained with the same hash value, that is, remove the repeated target audio to be trained with the same hash value, and only one target audio to be trained is reserved, and then the extraction subunit 3024 extracts the training feature information corresponding to the target audio to be trained after the deduplication processing.

In some embodiments, the extracting subunit 3024 is specifically configured to extract a sampling rate, a bit depth, a code rate, a spectral contrast, a flatness, a spectral entropy, a spectral centroid, and a spectral roll-off coefficient of the target audio to be trained after the deduplication processing.

The training unit 303 is configured to input the training feature information and the corresponding training label information into a preset model for training, so as to obtain a trained preset model.

Each audio to be trained comprises corresponding training label information, and the training label information is scored manually, so that the training unit 303 inputs the training characteristic information and the corresponding training label information into a preset model to be trained, the preset model can learn the distribution rule between the training characteristic information and the training label information, namely, the capability of scoring the characteristic information is learned, and the preset model after training with the evaluation capability is obtained after the training is finished.

In some embodiments, the training unit 303 is specifically configured to input the sampling rate, the bit depth, the code rate, the spectral contrast, the flatness, the spectral entropy, the spectral centroid, the spectral roll-off coefficient, and the corresponding training label information of the target audio to be trained after the deduplication processing into a preset model for training, so as to obtain a trained preset model.

And the evaluation unit 304 is configured to evaluate the audio to be evaluated based on the trained preset model.

Since the trained preset model has the capability of evaluating and scoring the feature information, the evaluation unit 304 may correspondingly extract the evaluation feature information of the audio to be evaluated, and input the evaluation feature information into the trained preset model for training, so that the trained preset model outputs corresponding score evaluation information according to the evaluation feature information, the higher the score is, the better the quality of the corresponding audio to be evaluated is, and the lower the score is, the worse the quality of the corresponding audio to be evaluated is.

In some embodiments, after the trained preset model evaluates the audio to be evaluated and obtains an evaluation result, the evaluation unit 304 may further receive a feedback message of the user, and if the user finds that the evaluation result is wrong, an error may be reported, and the evaluation result may be fine-tuned based on the feedback message, so that the evaluation result fits the usage habit of the user as much as possible in the actual use.

In some embodiments, the evaluation unit 304 is specifically configured to extract evaluation feature information corresponding to the audio to be evaluated; and inputting the evaluation characteristic information into a trained preset model to obtain evaluation information corresponding to the audio to be evaluated.

In some embodiments, the evaluation unit 304 is further configured to calculate a hash value corresponding to the audio to be evaluated; acquiring the audio to be evaluated with the same hash value and carrying out corresponding marking; extracting the sampling rate, bit depth, code rate, spectrum contrast, flatness, spectrum entropy, spectrum centroid and spectrum roll-off coefficient of one to-be-evaluated audio and an unmarked to-be-evaluated audio in the to-be-evaluated audios with the same mark; and inputting the sampling rate, the bit depth, the code rate, the spectrum contrast, the flatness, the spectrum entropy, the spectrum centroid and the spectrum roll-off coefficient of one to-be-evaluated audio and an unmarked to-be-evaluated audio in the same marked to-be-evaluated audio into the trained preset model to obtain first evaluation information corresponding to the one to-be-evaluated audio and the unmarked to-be-evaluated audio in the same marked to-be-evaluated audio.

The audio to be evaluated may be a group of audio to be evaluated whose audio fingerprints all match each other, and since the audio with the same hash value is identical, only one of the audios needs to be evaluated, and the evaluation result of the other audios is identical to the evaluation result of the one audio. Correspondingly, the evaluation unit 304 calculates a hash value corresponding to the audio to be evaluated, obtains the audio to be evaluated with the same hash value, and performs corresponding marking.

Furthermore, the evaluation unit 304 only needs to extract the sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient of one of the audios to be evaluated with the same label and the audio to be evaluated without the label and input the extracted sampling rate, bit depth, code rate, spectral contrast, flatness, spectral entropy, spectral centroid and spectral roll-off coefficient into the trained preset model to obtain first evaluation information corresponding to the one of the audios to be evaluated with the same label and the audio to be evaluated without the label, and the evaluation unit 304 can synchronize the first evaluation information of the audios to be evaluated with the label to other audios to be evaluated with the same label, thereby better improving the training efficiency.

In some embodiments, as shown in fig. 5d, the apparatus for evaluating audio information further comprises:

the first determining unit 305 is configured to determine a target audio to be evaluated corresponding to the first evaluation information.

A second extracting unit 306, configured to extract the spectral height and spectral notch coefficient of the target audio to be evaluated.

A weighting parameter determining unit 307, configured to determine a first weighting parameter and a second weighting parameter according to the spectral height and the spectral notch coefficient.

The weighting unit 308 is configured to weight the first evaluation information according to the first weighting parameter and the second weighting parameter, so as to obtain target evaluation information.

The second determining unit 309 is configured to determine a target audio to be evaluated corresponding to the target evaluation information.

A setting unit 310, configured to, when it is detected that the target audio to be evaluated has a flag, determine the same audio to be evaluated as the flag, and set evaluation information of the same audio to be evaluated as the flag as target evaluation information.

The specific implementation of each unit can refer to the previous embodiment, and is not described herein again.

As can be seen from the above, in the embodiment of the present application, the audio to be trained is acquired through the acquisition unit 301, and corresponding training label information is allocated to the audio to be trained; the first extraction unit 302 extracts training feature information corresponding to the audio to be trained; the training unit 303 inputs the training feature information and the corresponding training label information into a preset model for training to obtain a trained preset model; the evaluation unit 304 evaluates the audio to be evaluated based on the trained preset model. Compared with a scheme that a large amount of audio information needs to be manually analyzed for quality evaluation, the method greatly reduces the cost and improves the evaluation efficiency of the audio information.

Example four,

The embodiment of the present application further provides a server, as shown in fig. 6, which shows a schematic structural diagram of the server according to the embodiment of the present application, specifically:

the server may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the server architecture shown in FIG. 6 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the server. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The server further includes a power supply 403 for supplying power to each component, and preferably, the power supply 403 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The server may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in this embodiment, the processor 401 in the server loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring audio to be trained, and distributing corresponding training label information to the audio to be trained; extracting training characteristic information corresponding to the audio to be trained; inputting the training characteristic information and corresponding training label information into a preset model for training to obtain a trained preset model; and evaluating the audio to be evaluated based on the trained preset model.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the audio information evaluation method, and are not described herein again.

As can be seen from the above, the server according to the embodiment of the present application may allocate corresponding training label information to the audio to be trained by acquiring the audio to be trained; extracting training characteristic information corresponding to the audio to be trained; inputting training characteristic information and corresponding training label information into a preset model for training to obtain a trained preset model; and evaluating the audio to be evaluated based on the trained preset model. Compared with a scheme that a large amount of audio information needs to be manually analyzed for quality evaluation, the method greatly reduces the cost and improves the evaluation efficiency of the audio information.

Example V,

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the methods for evaluating audio information provided by the embodiments of the present application. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any audio information evaluation method provided in the embodiments of the present application, the beneficial effects that can be achieved by any audio information evaluation method provided in the embodiments of the present application can be achieved, and detailed descriptions are omitted here for the foregoing embodiments.

The foregoing describes in detail an audio information evaluation method, apparatus, and storage medium provided in an embodiment of the present application, and a specific example is applied in the present application to explain the principles and implementations of the present application, and the description of the foregoing embodiment is only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for evaluating audio information, comprising:

acquiring audio to be trained, and extracting an audio fingerprint of each audio to be trained in the audio to be trained;

matching every two audio frequencies to be trained according to the audio fingerprint of each audio frequency to be trained, and determining the audio frequencies to be trained with the audio fingerprint similarity larger than a preset threshold as the same audio frequency group to be trained;

distributing corresponding training label information for the audio to be trained;

calculating a hash value corresponding to the audio to be evaluated;

extracting feature information of one to-be-evaluated audio and feature information of unmarked to-be-evaluated audio in the same marked to-be-evaluated audio;

inputting the feature information of one to-be-evaluated audio in the same marked to-be-evaluated audio and the feature information of the unmarked to-be-evaluated audio into the trained preset model to obtain first evaluation information corresponding to the one to-be-evaluated audio in the same marked to-be-evaluated audio and the unmarked to-be-evaluated audio.

2. The evaluation method according to claim 1, wherein the step of extracting the training feature information corresponding to the audio to be trained comprises:

unifying the format of the audio to be trained to obtain a target audio to be trained with a unified format;

calculating a hash value corresponding to the target audio to be trained;

carrying out duplicate removal processing on the target audio to be trained with the same hash value;

and extracting training characteristic information corresponding to the target audio to be trained after the duplication removal processing.

3. The evaluation method according to claim 2, wherein the step of extracting training feature information corresponding to the audio to be trained of the target after the deduplication processing comprises:

extracting the sampling rate, bit depth, code rate, spectrum contrast, flatness, spectrum entropy, spectrum centroid and spectrum roll-off coefficient of the target audio to be trained after the duplicate removal processing;

the step of inputting the training feature information and the corresponding training label information into a preset model for training to obtain a trained preset model includes:

4. The assessment method according to claim 1, wherein after the step of obtaining the first assessment information corresponding to one of the identically labeled audio-to-be-assessed and the unlabeled audio-to-be-assessed, further comprises:

determining a target audio to be evaluated corresponding to the first evaluation information;

extracting the spectral height and spectral notch coefficient of the target audio to be evaluated;

determining a first weighting parameter and a second weighting parameter according to the spectral height and the spectral notch coefficient;

and weighting the first evaluation information according to the first weighting parameter and the second weighting parameter to obtain target evaluation information.

5. The evaluation method according to claim 4, wherein the step of obtaining target evaluation information is followed by:

determining a target audio to be evaluated corresponding to the target evaluation information;

when the target audio to be evaluated is detected to have a mark, determining the audio to be evaluated which is the same as the mark, and setting the evaluation information of the audio to be evaluated which is the same as the mark as the target evaluation information.

6. An apparatus for evaluating audio information, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio to be trained and extracting an audio fingerprint of each audio to be trained in the audio to be trained; matching every two audio frequencies to be trained according to the audio fingerprint of each audio frequency to be trained, and determining the audio frequencies to be trained with the audio fingerprint similarity larger than a preset threshold as the same audio frequency group to be trained; distributing corresponding training label information for the audio to be trained;

the first extraction unit is used for extracting training characteristic information corresponding to the audio to be trained;

the evaluation unit is used for calculating a hash value corresponding to the audio to be evaluated;

7. A storage medium on which a computer program is stored, which, when run on a computer, causes the computer to carry out the method of evaluating audio information according to any one of claims 1 to 5.