CN104882152A

CN104882152A - Method and apparatus for generating lyric file

Info

Publication number: CN104882152A
Application number: CN201510257914.0A
Authority: CN
Inventors: 武大伟; 赵普; 任思豪; 龚维
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2015-05-18
Filing date: 2015-05-18
Publication date: 2015-09-02
Anticipated expiration: 2035-05-18
Also published as: CN104882152B

Abstract

The invention discloses a method and an apparatus for generating a lyric file, and belongs to the technical field of audio processing. The method includes: obtaining a reference audio file corresponding to a to-be-processed target audio file, wherein the reference audio file and the target audio file belong to different versions of the same song; calculating the time deviation between the reference audio file and the target audio file; and correcting a timestamp corresponding to the lyric file of the reference audio file according to the time deviation and regarding the corrected lyric file as the lyric file of the target audio file. According to the method and the apparatus for generating the lyric file, problems of low efficiency and high cost of the generation of the lyric file by the adoption of the manual method are solved, the efficiency for generating the lyric file is improved, and the cost is lowered.

Description

Method and device for generating lyric file

Technical Field

The invention relates to the technical field of audio processing, in particular to a method and a device for generating a lyric file.

Background

With the increasing requirements of users on audio-visual experience, when users use music application programs to perform operations such as viewing, listening, singing and the like on music works, the application programs are required to provide a function of displaying lyrics.

In order to meet the requirements of users, developers of application programs need to generate lyric files matched with different song files. In the related art, a lyric file matched with a song file is manually generated for the song file.

However, generating lyric files manually is not only inefficient but also costly. With the continuous expansion of the scale of the music library, the defects of the manual mode become more and more serious.

Disclosure of Invention

In order to solve the problems of low efficiency and high cost in the manual generation of lyric files in the related art, the embodiment of the invention provides a method and a device for generating lyric files. The technical scheme is as follows:

in a first aspect, a method for generating a lyric file is provided, the method comprising:

acquiring a reference audio file corresponding to a target audio file to be processed, wherein the reference audio file and the target audio file belong to different versions of the same song;

calculating a time offset between the reference audio file and the target audio file;

and correcting the time stamp corresponding to the lyric file of the reference audio file according to the time deviation, and taking the corrected lyric file as the lyric file of the target audio file.

Optionally, the obtaining of the reference audio file corresponding to the target audio file to be processed includes:

acquiring at least one candidate reference audio file corresponding to the target audio file, wherein each candidate reference audio file and the target audio file belong to different versions of the same song;

sorting the at least one candidate reference audio file according to a preset sorting rule;

sequentially selecting the candidate reference audio files one by one according to the sorting result;

detecting whether the selected candidate reference audio file has strong correlation with the target audio file;

and when a candidate reference audio file with strong correlation between the first candidate reference audio file and the target audio file is obtained, stopping selecting the next candidate reference audio file, and taking the candidate reference audio file with strong correlation between the first candidate reference audio file and the target audio file as the reference audio file.

Optionally, the detecting whether there is a strong correlation between the selected candidate reference audio file and the target audio file includes:

calculating a cross-correlation coefficient sequence between the selected candidate reference audio file and the target audio file, wherein the cross-correlation coefficient sequence comprises at least one cross-correlation coefficient;

selecting the maximum value p of the cross-correlation coefficient from the cross-correlation coefficient sequence₀；

Obtaining the maximum value p₀Corresponding position deviation m₀；

According to the position deviation m₀In a first position deviation interval m₀+m_min，m₀+m_max]And a second position deviation interval [ m₀-m_max，m₀-m_min]Selecting maximum value p from corresponding correlation coefficients₁，1≤m_min＜m_max；

Detecting the maximum value p₀And the maximum value p₁Ratio p between₀/p₁Whether the threshold value is greater than a preset threshold value;

if the ratio p₀/p₁If the correlation between the selected candidate reference audio file and the target audio file is greater than the preset threshold, determining that the selected candidate reference audio file and the target audio file have strong correlation.

Optionally, the calculating a cross-correlation coefficient sequence between the selected candidate reference audio file and the target audio file includes:

sampling from the selected candidate reference audio file at a preset sampling rate to obtain a candidate audio sampling sequence, and sampling from the target audio file at the preset sampling rate to obtain a target audio sampling sequence;

extracting audio data with preset length from the same position of the candidate audio sampling sequence and the target audio sampling sequence to respectively obtain a candidate audio data sequence and a target audio data sequence;

a cross-correlation coefficient sequence between the candidate audio data sequence and the target audio data sequence is calculated.

Optionally, the calculating a cross-correlation coefficient sequence between the candidate audio data sequence and the target audio data sequence includes:

calculating a cross-correlation coefficient sequence R _ xy (m) between the candidate audio data sequence x (n) and the target audio data sequence y (n) according to the following formula:

wherein m belongs to- (N-1), (N-1) ], N is more than or equal to 0 and less than or equal to N-1, N + m is more than or equal to 0 and less than or equal to N-1, and N is a positive integer.

extracting audio data from the candidate audio data sequence x (n) at intervals of a preset interval to obtain a candidate audio data extraction sequence x '(n), and extracting audio data from the target audio data sequence y (n) at intervals of a preset interval to obtain a target audio data extraction sequence y' (n); wherein x '(n) ═ x (k × n), y' (n) ═ y (k × n), the predetermined interval is k audio data, and k is a positive integer;

calculating a coarse cross-correlation coefficient sequence R _ xy ' (m) between the candidate audio data extraction sequence x ' (n) and the target audio data extraction sequence y ' (n) according to the following formula:

wherein m belongs to- (N-1)/k, (N-1)/k), N is more than or equal to 0 and less than or equal to (N-1)/k, N + m is more than or equal to 0 and less than or equal to (N-1)/k, and N is a positive integer;

obtaining the position deviation m corresponding to the maximum value in the rough cross correlation coefficient sequence R _ xy' (m)₁；

A positional deviation between the candidate audio data sequence x (n) and the target audio sequence y (n) of k × m₁In the state (a), respectively intercepting a candidate audio data interception sequence x "(n) and a target audio data interception sequence y" (n) of a target length from corresponding positions of the candidate audio data sequence x (n) and the target audio sequence y (n);

calculating a sequence of exact cross-correlation coefficients R _ xy "(m) between said candidate truncated sequence of audio data x" (n) and said target truncated sequence of audio data y "(n) according to the following formula:

wherein m is [ k × m ]₁-a,k×m₁+a]，a≥k，N₀Representing the target length, N₀Is a preset value; the position deviation m corresponding to the maximum value in the accurate cross-correlation coefficient sequence R _ xy' (m)₂I.e. the exact position deviation.

Optionally, the obtaining at least one candidate reference audio file corresponding to the target audio file includes:

obtaining the classification to which the target audio file belongs, wherein the classification is any one of a single song class, a scene class, an accompaniment class and a silencing class;

determining a target classification for searching the candidate reference audio file according to the classification to which the target audio file belongs;

searching audio files meeting preset selection conditions in the target classification as the candidate reference audio files; wherein, the preset selection condition comprises: the audio file has at least one of manually bound lyric files and belongs to high-sound-quality audio files.

Optionally, the determining a target classification for searching the candidate reference audio file according to the classification to which the target audio file belongs includes:

when the classification to which the target audio file belongs to the single song, determining the single song as the target classification; or,

when the classification to which the target audio file belongs to the field class, determining the field class as the target classification; or,

when the classification to which the target audio file belongs to the accompaniment class, determining the accompaniment class, the single-song class and the scene class as the target classification; or,

and when the classification to which the target audio file belongs to the silencing class, determining the silencing class, the single song class and the scene class as the target classification.

In a second aspect, an apparatus for generating a lyric file is provided, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a reference audio file corresponding to a target audio file to be processed, and the reference audio file and the target audio file belong to different versions of the same song;

a calculation module for calculating a time offset between the reference audio file and the target audio file;

and the correction module is used for correcting the time stamp corresponding to the lyric file of the reference audio file according to the time deviation and taking the corrected lyric file as the lyric file of the target audio file.

Optionally, the obtaining module includes: the method comprises the following steps of obtaining a submodule, a sequencing submodule, a selecting submodule, a detecting submodule and a determining submodule;

the obtaining submodule is used for obtaining at least one candidate reference audio file corresponding to the target audio file, and each candidate reference audio file and the target audio file belong to different versions of the same song;

the sorting submodule is used for sorting the at least one candidate reference audio file according to a preset sorting rule;

the selection submodule is used for sequentially selecting the candidate reference audio files one by one according to the sorting result;

the detection submodule is used for detecting whether the selected candidate reference audio file has strong correlation with the target audio file;

the determining sub-module is configured to, when a candidate reference audio file having a strong correlation with the target audio file is obtained, stop selecting a next candidate reference audio file, and use the candidate reference audio file having the strong correlation with the target audio file as the reference audio file.

Optionally, the detection submodule includes: the device comprises a calculation unit, a first selection unit, an acquisition unit, a second selection unit, a detection unit and a determination unit;

the computing unit is used for computing a cross-correlation coefficient sequence between the selected candidate reference audio file and the target audio file, wherein the cross-correlation coefficient sequence comprises at least one cross-correlation coefficient;

the first selecting unit is used for selecting the maximum value p of the cross correlation coefficient from the cross correlation coefficient sequence₀；

The obtaining unit is used for obtaining the maximum value p₀Corresponding position deviation m₀；

The second selection unit is used for selecting the position deviation m₀In a first position deviation interval m₀+m_min，m₀+m_max]And a second position deviation interval [ m₀-m_max，m₀-m_min]Selecting maximum value p from corresponding correlation coefficients₁，1≤m_min＜m_max；

The detection unit is used for detecting the maximum value p₀And the maximumValue p₁Ratio p between₀/p₁Whether the threshold value is greater than a preset threshold value;

the determination unit is used for determining the ratio p₀/p₁And when the correlation value is larger than the preset threshold value, determining that the selected candidate reference audio file and the target audio file have strong correlation.

Optionally, the computing unit includes: the device comprises a sampling subunit, an extraction subunit and a calculation subunit;

the sampling subunit is configured to sample the selected candidate reference audio file at a preset sampling rate to obtain a candidate audio sampling sequence, and sample the target audio file at the preset sampling rate to obtain a target audio sampling sequence;

the extraction subunit is configured to extract audio data with a preset length from the same position of the candidate audio sample sequence and the target audio sample sequence, so as to obtain a candidate audio data sequence and a target audio data sequence respectively;

the calculating subunit is configured to calculate a cross-correlation coefficient sequence between the candidate audio data sequence and the target audio data sequence.

Optionally, the calculating subunit is specifically configured to:

Optionally, the obtaining sub-module includes: the device comprises a classification acquisition unit, a classification determination unit and a file search unit;

the classification acquisition unit is used for acquiring the classification to which the target audio file belongs, wherein the classification is any one of a single song class, a scene class, an accompaniment class and a silencing class;

the classification determining unit is used for determining a target classification for searching the candidate reference audio file according to the classification to which the target audio file belongs;

the file searching unit is used for searching audio files meeting preset selection conditions in the target classification as the candidate reference audio files; wherein, the preset selection condition comprises: the audio file has at least one of manually bound lyric files and belongs to high-sound-quality audio files.

Optionally, the classification determining unit includes:

a first classification determining subunit, configured to determine, when the classification to which the target audio file belongs to the single song class, that the single song class is the target classification; and/or the presence of a gas in the gas,

a second classification determination subunit, configured to determine, when the classification to which the target audio file belongs to the live class, that the live class is the target classification; and/or the presence of a gas in the gas,

a third category determination subunit configured to determine, when the category to which the target audio file belongs to the accompaniment category, the single-song category, and the live category as the target category; and/or the presence of a gas in the gas,

a fourth classification determination subunit, configured to determine, when the classification to which the target audio file belongs to the silence class, that the silence class, the single tune class, and the live class are the target classification.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

calculating the time deviation between the reference audio file and the target audio file, and then correcting a time stamp corresponding to the lyric file of the reference audio file according to the time deviation to obtain the lyric file of the target audio file; the problems of low efficiency and high cost of manually generating the lyric file in the related technology are solved; the technical effects of improving the efficiency of generating the lyric file and reducing the cost are achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram of a method for generating a lyric file, provided by an embodiment of the present invention;

FIG. 2A is a flow chart of a method for generating a lyric file according to another embodiment of the present invention;

FIG. 2B is a flow chart of step 201 according to another embodiment of the present invention;

FIG. 2C is a flow chart of step 204 in accordance with another embodiment of the present invention;

FIG. 2D is a flowchart of step 204a according to another embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus for generating a lyric file according to an embodiment of the present invention;

fig. 4 is a block diagram of an apparatus for generating a lyric file according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The method for generating the lyric file provided by the embodiment of the invention can be applied to any electronic equipment with computing and processing capabilities. For example, the electronic device may be a server, or may be a terminal such as a mobile phone, a multimedia player, or a computer.

Referring to fig. 1, a flowchart of a method for generating a lyric file according to an embodiment of the present invention is shown, where the method may include the following steps:

step 102, obtaining a reference audio file corresponding to a target audio file to be processed, wherein the reference audio file and the target audio file belong to different versions of the same song.

Step 104, calculating the time offset between the reference audio file and the target audio file.

And step 106, correcting the time stamp corresponding to the lyric file of the reference audio file according to the time deviation, and taking the corrected lyric file as the lyric file of the target audio file.

In summary, in the method provided in this embodiment, the time offset between the reference audio file and the target audio file is calculated, and then the time stamp corresponding to the lyric file of the reference audio file is corrected according to the time offset, so as to obtain the lyric file of the target audio file; the problems of low efficiency and high cost of manually generating the lyric file in the related technology are solved; the technical effects of improving the efficiency of generating the lyric file and reducing the cost are achieved.

Referring to fig. 2A, a flowchart of a method for generating a lyric file according to another embodiment of the present invention is shown, where the method may include the following steps:

step 201, at least one candidate reference audio file corresponding to the target audio file is obtained, and each candidate reference audio file and the target audio file belong to different versions of the same song.

In a song library, there are usually multiple different versions of the same song (or entry), such as a single song version, a live version, an accompaniment version, etc. There may also be multiple different versions of the same type of version of the same song, such as there may be multiple single-song versions that are sung by multiple different singers, multiple live versions that are sung by the same singer at different concerts, and so on. When generating the corresponding lyric file for the target audio file, searching and acquiring other audio files belonging to the same song as the target audio file from the song library as candidate reference audio files.

Alternatively, as shown in fig. 2B, this step may include several sub-steps as follows:

step 201a, obtaining the classification of the target audio file.

The category to which the target audio file belongs includes, but is not limited to, any one of the following categories: single songs, live categories, accompaniment categories, and silence categories.

Step 201b, determining a target classification for searching candidate reference audio files according to the classification to which the target audio file belongs.

In one possible implementation, the class to which the target audio file belongs is directly determined as the target class for finding the candidate reference audio file. That is, after the classification to which the target audio file belongs is obtained, the candidate reference audio file is directly searched under the classification.

In another possible embodiment, step 201b may include the following cases:

1) when the classification to which the target audio file belongs to the single song, determining the single song as a target classification;

2) when the classification to which the target audio file belongs to a field class, determining the field class as a target classification;

3) when the classification to which the target audio file belongs to the accompaniment class, determining the accompaniment class, the single-song class and the scene class as target classifications;

4) and when the classification to which the target audio file belongs to the silencing class, determining the silencing class, the single song class and the live class as target classifications.

Of course, the above two possible embodiments are only exemplary and explanatory, and the present embodiment does not limit other possible embodiments.

Step 201c, searching audio files meeting the preset selection condition in the target classification as candidate reference audio files.

Wherein, the preset selection conditions comprise: the audio file has at least one of manually bound lyric files and belongs to high-tone quality audio files. In the embodiment, the audio file meeting the preset selection condition is selected as the candidate reference audio file, so that the candidate reference audio file can be ensured to have the precisely matched lyric file, the quality of the selected candidate reference audio file can be ensured, and the accuracy of subsequent calculation and correction can be improved.

Step 202, at least one candidate reference audio file is sorted according to a preset sorting rule.

After at least one candidate reference audio file corresponding to the target audio file is obtained, the obtained candidate reference audio files are ranked according to a preset ranking rule. Wherein, the preset ordering rule comprises: at least one of a candidate reference audio file priority belonging to the same category as the target audio file, a candidate reference audio file priority belonging to a high-sound-quality audio file, and a candidate reference audio file priority belonging to a high-heat-degree audio file. In this embodiment, by setting the preset sorting rule, it can be ensured that a candidate reference audio file which is more similar to the target audio file and has higher quality is preferentially selected for subsequent matching calculation, which is beneficial to improving the selection efficiency and saving calculation and processing overhead.

And step 203, sequentially selecting candidate reference audio files one by one according to the sorting result.

Step 204, detecting whether the selected candidate reference audio file has a strong correlation with the target audio file.

In this embodiment, in order to ensure the correction accuracy of the lyric file, when a reference audio file is selected from candidate reference audio files, it is necessary to detect and analyze the correlation between the candidate reference audio file and the target audio file, and select the candidate reference audio file having a strong correlation with the target audio file as a final reference audio file.

Alternatively, as shown in fig. 2C, this step may include several sub-steps as follows:

step 204a, calculating a cross-correlation coefficient sequence between the selected candidate reference audio file and the target audio file, wherein the cross-correlation coefficient sequence comprises at least one cross-correlation coefficient.

As shown in fig. 2D, in one possible implementation, in order to reduce the amount of computation and improve the computation efficiency, step 204a may include the following sub-steps:

step 204a1, sampling the candidate audio sample sequence from the selected candidate reference audio file at a preset sampling rate, and sampling the target audio sample sequence from the target audio file at the preset sampling rate.

In order to facilitate processing of audio files with different code rates and reduce the calculation time overhead, in this embodiment, a down-sampling mode is adopted to down-sample both the selected candidate reference audio file and the target audio file to a preset sampling rate. The predetermined sampling rate may be preset according to actual requirements, for example, the predetermined sampling rate is 8kHz, or the predetermined sampling rate may also be 4kHz, and so on.

Step 204a2, extracting audio data with preset length from the same position of the candidate audio sample sequence and the target audio sample sequence, and respectively obtaining a candidate audio data sequence and a target audio data sequence.

The preset length may be preset according to actual requirements, for example, the preset length is 10 s.

Step 204a3, a cross-correlation coefficient sequence between the candidate audio data sequence and the target audio data sequence is calculated.

Alternatively, the cross-correlation coefficient sequence R _ xy (m) between the candidate audio data sequence x (n) and the target audio data sequence y (n) is calculated according to the following formula:

Step 204b, selecting the maximum value p of the cross-correlation coefficient from the cross-correlation coefficient sequence₀。

Step 204c, obtaining the maximum value p₀Corresponding position deviation m₀。

Step 204d, according to the position deviation m₀In a first position deviation interval m₀+m_min，m₀+m_max]And a second position deviation interval [ m₀-m_max，m₀-m_min]Selecting maximum value p from corresponding correlation coefficients₁，1≤m_min＜m_max。

Step 204e, detecting the maximum value p₀And a maximum value p₁Ratio p between₀/p₁Whether it is greater than a preset threshold.

Step 204f, if the ratio p₀/p₁And if the correlation is larger than the preset threshold value, determining that the selected candidate reference audio file has strong correlation with the target audio file.

The points to be explained are: the analysis of the cross-correlation coefficients involved in steps 204b to 204f is crucial in this embodiment, and determines the performance of the whole system, i.e. the accuracy of the calculation and correction of the subsequent time offset. Although a maximum value p must be found from the calculated cross-correlation coefficient sequence R _ xy (m)₀But passes through the maximum value p₀Corresponding position deviation m₀The calculated time offset is not necessarily authentic. The reason is that when p is₀When the cross-correlation coefficient is large but not large enough compared with other cross-correlation coefficients in the sequence of cross-correlation coefficients R _ xy (m), it indicates that the correlation between the selected candidate reference audio file and the target audio file is not strong. Therefore, in the present embodiment, by comparing the ratio p₀/p₁And a predetermined threshold to determine whether the selected candidate reference audio file has a strong correlation with the target audio file, wherein the selected candidate reference audio file and the target audio file are determined to have a strong correlation with each otherAnd under the condition of strong correlation, taking the selected candidate reference audio file as a final reference audio file, otherwise, selecting the next candidate reference audio file until obtaining a candidate reference audio file with strong correlation with the target audio file.

Step 205, when the candidate reference audio file with strong correlation between the first candidate reference audio file and the target audio file is obtained, stopping selecting the next candidate reference audio file, and using the candidate reference audio file with strong correlation between the first candidate reference audio file and the target audio file as the reference audio file.

The reference audio file is an audio file having a strong correlation with the target audio file, and the reference audio file has a manually bound lyrics file, i.e. the reference audio file can be considered to have an exact match lyrics file.

In step 206, the time offset between the reference audio file and the target audio file is calculated.

After the reference audio file is selected, a time offset between the reference audio file and the target audio file is calculated based on a correlation coefficient between the two.

Optionally, the time offset τ ═ m₀/k₀(ii) a Wherein m is₀Representing the maximum value p of the cross-correlation coefficient₀Corresponding position deviation, k₀Representing a preset sampling rate.

And step 207, correcting the time stamp corresponding to the lyric file of the reference audio file according to the time deviation, and taking the corrected lyric file as the lyric file of the target audio file.

And after the time deviation tau is calculated, correcting the time stamp corresponding to the lyric file of the reference audio file by using the time deviation tau. In this embodiment, the time deviation τ is used to perform an overall correction on the time stamp corresponding to the lyric file, that is, the correction amplitude of the time stamp corresponding to each lyric in the lyric file is the time deviation τ. The corrected lyric file is the lyric file of the target audio file.

In addition, according to the method provided by the embodiment, the candidate reference audio file with strong correlation with the target audio file is selected as the reference audio file to perform subsequent time deviation calculation and correction, so that the accuracy of the finally generated lyric file is fully improved, and the system performance is ensured.

The points to be supplemented are: in order to further improve the efficiency of the cross-correlation coefficient calculation and save the calculation and processing overhead, the following two ways of calculating the cross-correlation coefficient are provided in the embodiment of the present invention.

In the first mode, the following steps can be included:

1. extracting audio data from the candidate audio data sequence x (n) at intervals of a preset interval to obtain a candidate audio data extraction sequence x '(n), and extracting audio data from the target audio data sequence y (n) at intervals of a preset interval to obtain a target audio data extraction sequence y' (n); where x '(n) ═ x (k × n), y' (n) ═ y (k × n), the predetermined interval is k audio data, and k is a positive integer.

The value of the preset interval k can be set after the factors of two aspects of calculation precision and calculation efficiency are comprehensively considered. For example, k may be set to 4.

2. A coarse cross-correlation coefficient sequence R _ xy ' (m) between the candidate audio data extraction sequence x ' (n) and the target audio data extraction sequence y ' (n) is calculated according to the following formula:

wherein m belongs to- (N-1)/k, (N-1)/k), N is more than or equal to 0 and less than or equal to (N-1)/k, N is more than or equal to 0 and more than or equal to m and less than or equal to (N-1)/k, and N is a positive integer.

3. Obtaining the position deviation m corresponding to the maximum value in the rough cross correlation coefficient sequence R _ xy' (m)₁。

Wherein the position deviation m₁Is a coarse position deviation.

4. The position deviation between the candidate audio data sequence x (n) and the target audio sequence y (n) is k × m₁In the state (b), a candidate audio data truncation sequence x "(n) and a target audio data truncation sequence y" (n) of a target length are truncated from corresponding positions of the candidate audio data sequence x (n) and the target audio sequence y (n), respectively.

5. The exact cross-correlation coefficient sequence R _ xy "(m) between the candidate truncated sequence of audio data x" (n) and the target truncated sequence of audio data y "(n) is calculated according to the following formula:

wherein m is [ k × m ]₁-a,k×m₁+a]，a≥k，N₀Representing the target length, N₀Is a preset value; the position deviation m corresponding to the maximum value in the sequence of exact cross-correlation coefficients R _ xy' (m)₂I.e. the exact position deviation.

In the first method, the "rough positional deviation" is calculated first, and the "precise positional deviation" is calculated. As can be seen from the calculation formula of the cross-correlation coefficient sequence, the calculation complexity of two audio data sequences with the length of N is O (N)²). Therefore, the calculation time overhead can be reduced to 1/k of the original time by the first method²Left and right. Alternatively, when k is 4, the calculation time overhead can be reduced to about 1/16 by using the first method.

In the second approach, FFT (Fast Fourier transform) is used to calculate the cross-correlation coefficients:

the method for calculating the cross-correlation coefficient sequence by using FFT can be derived from the calculation formula of the cross-correlation coefficient sequence:

R_xy＝IFFT(conj(FFT(y))×FFT(x))；

where conj () denotes the conjugate operation.

The second approach can be implemented by using some existing mature and efficient FFT calculation modules, such as the FFTW (the fast Fourier Transform in the West) library. The FFT is adopted to calculate the cross-correlation coefficient, so that the efficiency of the cross-correlation coefficient calculation can be improved, and the calculation and processing expenses can be saved.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 3, a block diagram of an apparatus for generating a lyric file according to an embodiment of the present invention is shown. The device can be applied to any electronic equipment with computing processing capability. The apparatus may include: an acquisition module 310, a calculation module 320, and a correction module 330.

The obtaining module 310 is configured to obtain a reference audio file corresponding to a target audio file to be processed, where the reference audio file and the target audio file belong to different versions of the same song.

A calculating module 320 for calculating a time offset between the reference audio file and the target audio file.

And the correcting module 330 is configured to correct the timestamp corresponding to the lyric file of the reference audio file according to the time offset, and use the corrected lyric file as the lyric file of the target audio file.

In summary, in the apparatus provided in this embodiment, the time offset between the reference audio file and the target audio file is calculated, and then the time stamp corresponding to the lyric file of the reference audio file is corrected according to the time offset, so as to obtain the lyric file of the target audio file; the problems of low efficiency and high cost of manually generating the lyric file in the related technology are solved; the technical effects of improving the efficiency of generating the lyric file and reducing the cost are achieved.

Referring to fig. 4, a block diagram of an apparatus for generating a lyric file according to another embodiment of the present invention is shown. The device can be applied to any electronic equipment with computing processing capability. The apparatus may include: an acquisition module 310, a calculation module 320, and a correction module 330.

Optionally, the obtaining module 310 includes: an acquisition sub-module 310a, a sorting sub-module 310b, a selection sub-module 310c, a detection sub-module 310d, and a determination sub-module 310 e.

The obtaining sub-module 310a is configured to obtain at least one candidate reference audio file corresponding to the target audio file, where each candidate reference audio file and the target audio file belong to different versions of the same song.

The sorting sub-module 310b is configured to sort the at least one candidate reference audio file according to a preset sorting rule.

The selecting sub-module 310c is configured to sequentially select the candidate reference audio files one by one according to the sorting result.

The detection sub-module 310d is configured to detect whether there is a strong correlation between the selected candidate reference audio file and the target audio file.

The determining sub-module 310e is configured to, when a candidate reference audio file having a strong correlation with the target audio file is obtained, stop selecting a next candidate reference audio file, and use the candidate reference audio file having the strong correlation with the target audio file as the reference audio file.

Optionally, the detection sub-module 310d includes: a calculating unit 310d1, a first selecting unit 310d2, an obtaining unit 310d3, a second selecting unit 310d4, a detecting unit 310d5 and a determining unit 310d 6.

The calculating unit 310d1 is configured to calculate a cross-correlation coefficient sequence between the selected candidate reference audio file and the target audio file, where the cross-correlation coefficient sequence includes at least one cross-correlation coefficient.

The first selecting unit 310d2 is configured to select a maximum value p of cross-correlation coefficients from the sequence of cross-correlation coefficients₀。

The obtaining unit 310d3 is configured to obtain the maximum value p₀Corresponding position deviation m₀。

The second selecting unit 310d4 is configured to select the position deviation m according to the position deviation m₀In a first position deviation interval m₀+m_min，m₀+m_max]And a second position deviation interval [ m₀-m_max，m₀-m_min]Selecting maximum value p from corresponding correlation coefficients₁，1≤m_min＜m_max。

The detection unit 310d5 is used for detecting the maximum value p₀And the maximum value p₁Ratio p between₀/p₁Whether it is greater than a preset threshold.

The determining unit 310d6 is configured to determine the ratio p₀/p₁And when the correlation value is larger than the preset threshold value, determining that the selected candidate reference audio file and the target audio file have strong correlation.

Optionally, the calculating unit 310d1 includes: a sample sub-unit 310d11, an extract sub-unit 310d12, and a compute sub-unit 310d 13.

The sampling sub-unit 310d11 is configured to sample the selected candidate reference audio file at a preset sampling rate to obtain a candidate audio sample sequence, and sample the target audio sample sequence from the target audio file at the preset sampling rate.

The extracting sub-unit 310d12 is configured to extract audio data with a preset length from the same position of the candidate audio sample sequence and the target audio sample sequence, so as to obtain a candidate audio data sequence and a target audio data sequence, respectively.

The calculating subunit 310d13 is configured to calculate a cross-correlation coefficient sequence between the candidate audio data sequence and the target audio data sequence.

Optionally, the computing subunit 310d13 is specifically configured to:

Optionally, the obtaining sub-module 310a includes: a classification obtaining unit 310a1, a classification determining unit 310a2, and a file searching unit 310a 3.

The classification obtaining unit 310a1 is configured to obtain a classification to which the target audio file belongs, where the classification is any one of a single song category, a live category, an accompaniment category, and a silence category.

The classification determining unit 310a2 is configured to determine a target classification for finding the candidate reference audio file according to the classification to which the target audio file belongs.

The file searching unit 310a3 is configured to search, in the target classification, an audio file meeting a preset selection condition as the candidate reference audio file; wherein, the preset selection condition comprises: the audio file has at least one of manually bound lyric files and belongs to high-sound-quality audio files.

Optionally, the classification determining unit 310a2 includes:

a first classification determining subunit 310a21, configured to determine the single song class as the target classification when the classification to which the target audio file belongs to the single song class; and/or the presence of a gas in the gas,

a second classification determining subunit 310a22, configured to determine that the live class is the target class when the classification to which the target audio file belongs to the live class; and/or the presence of a gas in the gas,

a third classification determining subunit 310a23, configured to determine, when the classification to which the target audio file belongs to the accompaniment class, that the accompaniment class, the single-tune class, and the live class are the target classification; and/or the presence of a gas in the gas,

a fourth classification determining subunit 310a24, configured to determine that the mute class, the single tune class, and the live class are the target classification when the classification to which the target audio file belongs to the mute class.

In addition, the device provided by this embodiment further selects a candidate reference audio file having a strong correlation with the target audio file as a reference audio file to perform subsequent time offset calculation and correction, thereby sufficiently improving the accuracy of the finally generated lyric file and ensuring the system performance.

In addition, the device provided by this embodiment adopts a coarse-to-fine calculation mode when calculating the cross-correlation coefficient sequence, so as to further improve the efficiency of cross-correlation coefficient calculation and save calculation and processing overhead.

It should be noted that: in the apparatus for generating a lyric file according to the above embodiment, when the lyric file is generated, only the division of the above functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the apparatus for generating a lyric file and the method embodiment of the method for generating a lyric file provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.

It should be understood that, as used herein, the singular forms "a," "an," "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of generating a lyric file, the method comprising:

2. The method according to claim 1, wherein the obtaining of the reference audio file corresponding to the target audio file to be processed comprises:

3. The method of claim 2, wherein the detecting whether the selected candidate reference audio file has a strong correlation with the target audio file comprises:

Obtaining the maximum value p₀Corresponding position deviation m₀；

According to the position deviation m₀In a first position deviation interval m₀+m_min，m₀+m_max]And a second position offsetDifference interval [ m ]₀-m_max，m₀-m_min]Selecting maximum value p from corresponding correlation coefficients₁，1≤m_min＜m_max；

4. The method of claim 3, wherein the calculating the sequence of cross-correlation coefficients between the selected candidate reference audio file and the target audio file comprises:

5. The method of claim 4, wherein the calculating the sequence of cross-correlation coefficients between the candidate audio data sequence and the target audio data sequence comprises:

6. The method of claim 4, wherein the calculating the sequence of cross-correlation coefficients between the candidate audio data sequence and the target audio data sequence comprises:

7. The method according to any one of claims 2 to 6, wherein the obtaining at least one candidate reference audio file corresponding to the target audio file comprises:

8. The method of claim 7, wherein determining a target classification for finding the candidate reference audio file according to the classification to which the target audio file belongs comprises:

9. An apparatus for generating a lyric file, the apparatus comprising:

10. The apparatus of claim 9, wherein the obtaining module comprises: the method comprises the following steps of obtaining a submodule, a sequencing submodule, a selecting submodule, a detecting submodule and a determining submodule;

11. The apparatus of claim 10, wherein the detection submodule comprises: the device comprises a calculation unit, a first selection unit, an acquisition unit, a second selection unit, a detection unit and a determination unit;

The detection unit is used for detecting the maximum value p₀And the maximum value p₁Ratio p between₀/p₁Whether the threshold value is greater than a preset threshold value;

12. The apparatus of claim 11, wherein the computing unit comprises: the device comprises a sampling subunit, an extraction subunit and a calculation subunit;

13. The apparatus according to claim 12, wherein the computing subunit is specifically configured to:

14. The apparatus according to claim 12, wherein the computing subunit is specifically configured to:

obtaining the position deviation corresponding to the maximum value in the rough cross correlation coefficient sequence R _ xy' (m)m₁；

15. The apparatus of any one of claims 10 to 14, wherein the acquisition sub-module comprises: the device comprises a classification acquisition unit, a classification determination unit and a file search unit;

16. The apparatus of claim 15, wherein the classification determining unit comprises: