1 Introduction
Coherence is often lacking in existing lyrics generation systems. This reduces the comprehensibility, engagement, and artistic expression of the content. This article focuses on enhancing the coherence of generated lyrics, defined as the quality of being logical and consistent, forming a unified whole. In lyrics writing, coherence means using a broader context and establishing semantic relationships between words and sentences to enhance understanding and interpretation [
59].
Coherence in lyrics is manifested at four levels [
59]—Word level coherence: Words form logical concepts or actions with their surrounding words. Sentence level coherence: Each line supplements, contrasts, or expands on the previous line. Full-text level coherence: The entire text revolve around a single theme. Musicality level coherence: The lyrics match the rhythm, sentiment, and style of the accompanying music. For example, in
Figure 1(a), the example shows high-quality human-composed lyrics with appropriate collocation. Each subsequent sentence extends the previous one, and the entirety of the sentence describes the theme “natural scenery during a day,” creating a peaceful atmosphere that aligns with the music’s sentiment. In contrast, the example (b) of machine-generated lyrics exhibits poor word choices leading to unclear meaning (e.g., “spider fixes toast”), weak sentence connections, a lack of central theme, and a bizarre style that conflicts with the musical style. Additional generated examples can be found in
Figure B1 of
Appendix B.
Current models for lyrics generation typically achieve the proficiency demonstrated by the lower lines in the previous example, where grammar and rhythm are acceptable, but coherence across all four levels remains problematic [
45,
65]. Although ChatGPT
1 generates coherent and rhyming lyrics, it has notable limitations: It cannot process music input, meet syllable requirements, or avoid the risk of plagiarism. Customizing lyrics for specific applications, such as language learning, introduces even greater challenges. While research has shown that singing songs with appropriate lyrics can help language learners acquire vocabulary [
18,
19,
22,
24,
48,
50,
62,
75], human-composed songs that appeal to learners’ musical tastes [
51] often lack the necessary target vocabulary and may be subject to copyright. Existing lyrics generation systems are also inadequate in this context, as they struggle to incorporate user-specified keywords while maintaining four-level coherence.
A recent study propose using a keyword skeleton as prompts for lyrics generators to enhance coherence [
70]. A “keyword skeleton” is defined as a structured list of terms, each corresponding sequentially to a sentence or line in the lyrics. Nevertheless, this approach has limitations. It does not thoroughly investigate and clearly define lyric coherence, fails to analyze the root causes of incoherence in generated lyrics, and does not fully leverage the keyword skeleton framework. We identified the main issues as follows: (1) The system employs the YAKE keyword extraction method [
70], which generates skeletons based on word frequency. This approach does not effectively capture the narrative structure of lyrics. (2) The system may overlook essential user-input keywords during the skeleton generation process. (3) The system does not account for musical features, relying heavily on the rhythm and syllable templates of original lyrics, thus limiting its generalizability. (4) The system utilizes keywords as prompts without incorporating targeted innovations in its language model appropriately to enhance coherence. As a result, the coherence of the generated lyrics remains less satisfying compared against human-composition. It is crucial to integrate a deeper understanding of musical elements and user inputs into the generation process to improve musicality and personalization.
In response to the above challenges, we introduce the KeYric system, which enhances coherence, musical association, and personalization in generated lyrics. As illustrated in
Figure 2, the KeYric system generates coherent lyrics by first taking a MIDI file [
63] and user-specified seed words as inputs. The Keyword Skeleton Extractor module uses unsupervised learning to extract semantically significant keywords from a lyrics database, forming keyword skeletons. This extends the original
\(\{\)MIDI, lyrics
\(\}\) tuple into a
\(\{\)MIDI, keyword skeleton, lyrics
\(\}\) triplet dataset. The cross-modal tier, which includes a graph network and a lyrics generator, then couples MIDI with skeletons and skeletons with lyrics. The Music-skeleton Coupled Graph Network selects a keyword for each musical phrase, linking them to form a skeleton, and the Coherent Lyrics Generator creates lyrics based on this skeleton. During inference, the system predicts a skeleton from the given MIDI file and seed words, and then generates personalized lyrics suitable for language learning through singing [
45].
Our approach differs from traditional keyword extraction methods that rely on word frequency or manual annotation. Instead, we conceptualize our keyword skeletons as an interpretable latent space, capable of compressing the lyric space and sampling it with comprehensible keywords. For instance, the example lyrics (a) in
Figure 1 can be summarized into the keyword skeleton [“see,” “sun,” “birds,” and “nightly”]. We propose a novel unsupervised method for extracting keyword skeletons from human-composed lyrics, identifying the most representative words through a process of text compression and reconstruction. Furthermore, we leverage deep graph networks to establish pairwise connections between musical phrases and keywords within a skeleton. This graph model predicts a skeleton that includes the user-specified keywords, aligning with the sentiment and style of the music, thereby enhancing musicality coherence. Additionally, we present a coherent lyrics generation model that uses a skeleton and melody as prompts. This model incorporates three levels of coherence mechanisms, which enhance coherence at the word, sentence, and full-text levels throughout the generation process.
Objective and subjective evaluations demonstrate that the KeYric system achieves a 5% and 19% improvement in lyric quality over the compared models, respectively. Experts, including linguists, songwriters, and language teachers, have verified that KeYric generates lyrics that facilitate language learning through singing. Furthermore, we investigate the impact of genres and musical elements on lyric coherence. The findings reveal that pop and country songs produce the most coherent skeletons and lyrics. Additionally, our analysis indicates that, in comparison to bar boundary, genre-relevant information such as pitch and note duration of music plays a more significant role in maintaining lyric coherence.
In summary, this study makes three major contributions:
—
We address coherent lyrics generation and propose a solution that improves coherence at word (smoothness between adjacent words), sentence (continuity in describing the same subject in adjacent sentences), full-text levels (the entire lyrics center the same theme), and musicality (lyrics match the style and sentiment of the music).
—
We propose KeYric to generate lyrics from a keyword skeleton. This novel unsupervised method extracts keyword skeletons from lyrics, expands input music and seed words for language learning into a skeleton, and incorporates the skeleton in the lyrics generator.
—
Our analysis of the experiment results reveals that incorporating genre-relevant musical components (i.e., pitch and note duration) in data encoding substantially enhances the coherence of the generated lyrics.
3 Keyword Skeleton Extraction and Expansion
3.1 Motivation
We commissioned lyricists to write lyrics suitable for language learning based on given seed words and music. By monitoring their creation process, we observed common procedures among human lyricists writing for linguistic pedagogy. As illustrated in
Figure 3, these keywords evolve into lyrical cues and are extended into sentences, considering pivotal terms, rhyme, melodic alignment, and seed words to be learned. An initial assessment of the semantic relevance [
61] between these keyword skeletons and song titles showed a 57.4% improvement compared to randomly selected keywords (as 0.203 vs. 0.129 shown in
Figure 3). There are typically four methods to achieve textual coherence: repeating key nouns, using pronouns, employing transition signals, and maintaining logical order [
57]. Compared to previous automated models, human lyricists use these components more effectively, enhancing coherence between consecutive lyric lines.
However, this process is labor-intensive, requiring over 20 minutes per song lyric to align with the provided music and keywords. In lyrics-based language learning, this effort increases as lyricists tailor compositions to learners’ backgrounds (e.g., linguistic proficiency, vocabulary). To improve efficiency, we propose KeYric, which emulates human lyricists’ writing processes through deep learning.
The core idea is to extract salient vocabulary from each lyric sentence using unsupervised learning. We build a keyword skeleton with the words gaining highest attention weights during compression and reconstruction, refining coherent connections between lyric sentences. Then a keyword skeleton expander, trained on extracted keyword skeletons and corresponding MIDI files, predicts a suitable keyword skeleton for unseen input songs. Simultaneously, a lyrics generator, trained on full lyrics using an expanded keyword skeleton as prompts, employs multi-layer coherence mechanisms to select prevalent conjunctions, pronouns, clauses, and cohesive terms. This approach generates coherent lyrics by stringing together the keywords, ultimately achieving overall coherence.
3.2 Keyword Skeleton Extraction
Given a vocabulary set \(V\), the keyword skeleton is defined as a sequence of word tokens \(K=\{k_{1},k_{2},\ldots,k_{n}\},K\neq\emptyset\), where each element \(k_{t}\in V\) corresponds sequentially to \(s_{t}\), the \(t\mathrm{th}\) line in specific lyrics. As a condensed version of the entire text, a keyword skeleton should present coherence akin to a storyline, showcasing the narrative development and central theme. The selected keywords should meet the following criteria: (1) Each keyword represents a lyric sentence and conveys its stylistic information concisely. (2) Keywords should link coherently to nearby keywords. (3) Repeated keywords are allowed to present lyric structure. Thus, the skeleton can serves as a synopsis and developmental framework for the lyrics.
We propose an unsupervised learning model to interpretably select the best keyword from each lyric line. As shown in
Figure 4(a), the keyword extractor compresses lyrics into latent space and reconstructs them using a hierarchical Transformer-
Variational AutoEncoder (VAE). The word-to-sentence encoder’s attention scores determine the most semantic and stylistic words contributing to the latent variables for each line. These chosen words form the keyword skeleton. The use of VAE improves generalizability and robustness by capturing the underlying semantic structure of lyrics. This accommodates variations and different versions of the same song while maintaining keyword extraction consistency. This probabilistic approach ensures effective handling of diverse lyric representations. A hierarchical architecture separates sentence and word attention computation, making word token attention values more representative of their contributions to a sentence.
3.2.1 Lyrics Compression.
In
Figure 5, the green blocks illustrate the compression process. The VAE has a hierarchical Transformer encoder
\(q_{\theta}(z|x)\), decoder
\(p_{\phi}(x|z)\), and latent variable
\(z\in\mathbb{R}^{d_{z}}\) [
31].
\(q\) and
\(p\) are parameterized by
\(\theta\) and
\(\phi\), respectively.
The hierarchical Transformer encoder [
71] aggregates word tokens
\(X=\{x_{1},x_{2},\ldots,x_{n}\}\) into sentence-level representations
\(Z=\{z_{1},z_{2},\ldots,z_{u}\}\) stored in “hub” vectors, i.e., the foremost vector. Each sentence is prefixed with a prepositive virtual [CLS] token. A sentence’s aggregated latent vector is a Gaussian sample of its Transformer encoding at the hub vector’s location. We choose an isotropic Gaussian distribution with unit variance,
\(p(z)=\mathcal{N}(0,I)\), as our prior. This simplifies the latent space structure, ensuring each latent variable contributes equally and independently. It aids efficient learning of diverse lyric representations while maintaining consistency and reducing complexity in the hierarchical Transformer encoder. We incorporate positional,
part-of-speech (POS), and dependency embeddings [
17] to include word tokens’ syntactic features.
The lyric-level encoder computes cross-sentence information and averages the outputs through a pooling layer to obtain the compressed latent vector of the entire lyrics,
\(Z_{T}\in\mathbb{R}^{d_{z}}\):
where
\(Emb(\cdot)\) is the word embedding layer and
\(Emb^{+}(\cdot)\) is the sum of positional and semantic embeddings.
\(s_{t}\) (
\(t\in[1,v]\)) is the
tth sentence.
\(||\) denotes concatenation,
\(f(\cdot)\) and
\(F(\cdot)\) are the word-to-sentence and sentence-to-lyrics Transformer encoders, respectively.
\(M_{Hub}\) is the masking operation that retains only the hub vector.
3.2.2 Lyrics Reconstruction.
Unlike previous research [
33,
40,
87], our lyric reconstruction uses a symmetric Transformer-based encoder and decoder, as the model compresses and reconstructs lyrics rather than generating them from random latent variables.
We sample the lyric’s latent vector, \(z^{\prime}_{T}\), from the approximate posterior \(q\theta(z|x)=\mathcal{N}(\mu_{z},\sigma_{z})\). To match the decoder’s input shape, we expand \(z^{\prime}_{T}\) into the set \(C^{\prime}=\{c^{\prime}_{1},c^{\prime}_{2},\ldots,c^{\prime}_{u}\}\), where \(c^{\prime}_{i}\in\mathbb{R}^{d_{c}}\). We decode \(z^{\prime}_{T}\) into sentence-level latent variables \(Z^{\prime}=\{z^{\prime}_{1},z^{\prime}_{2},\ldots,z^{\prime}_{u}\}\) and project each \(z^{\prime}_{i}\in\mathbb{R}^{d_{z}}\) into the word-level decoder’s input shape to regenerate lyric tokens.
A
multi-layer perceptron (MLP) predicts whether the current sentence is the end of the lyric reconstruction. The MLP assigns each
\(z^{\prime}_{i}\) a probability
\(P_{stop}\), indicating if the current sentence should be the last
where
\(W_{z}\) is a linear projection to sequentially expand
\(z^{\prime}_{T}\);
\(G(\cdot)\) and
\(g(\cdot)\) are the lyric-to-sentence and sentence-to-word decoders.
3.2.3 Loss Design.
The loss function of the keyword extractor is formulated as follows:
The loss function has three weighted terms. The first term, Reconstruction Loss, compares generated lyrics to the ground truth. The second term, Sentence Loss on the stopping distribution
\(P_{stop}\), encourages the model to select an appropriate length for the generated lyrics [
33]. The third term, the Kullback–Leibler Divergence, penalizes deviations of the latent variable distribution from a Gaussian prior with unit variance. We employ the “reparameterization trick” [
31] to sample latent variables in a differentiable manner by predicting the mean and variance parameters of the Gaussian distribution.
3.2.4 Keyword Selection.
After compressing and reconstructing the lyrics, we examine the accumulative self-attention matrices of all word-level Transformer encoder blocks to identify each line’s keyword. For each lyric line, each token’s attention scores across all layers,
\(W_{i}^{Att}\in\mathbb{R}^{n\times n}\), are multiplied along the propagation path to the hub vector’s attention score,
\(W_{Hub_{t}}^{Att}\in\mathbb{R}^{n\times 1}\). As illustrated by the red arrows in
Figure 5, we select
\(k_{t}\), the token with the highest product, as the keyword for its lyric line since this product indicates the token’s contribution to the sentence encoding. We then concatenate all selected keywords to form the skeleton of the lyrics
where
\(k_{t}\) is the extracted keyword for sentence
\(t\) and
\(\psi\) is the layer number of the word-to-sentence encoder.
3.3 Keyword Skeleton Expansion
The keyword skeleton extractor creates matched triplets of
\(\{\)keyword skeletons, lyrics, MIDI files
\(\}\). It also generates two static matrices showing keyword co-occurrence and adjacency statistics, forming a keyword relationship graph. By adding input music phrases as nodes and connecting them to the keyword graph, the expander uses a graph Transformer [
27] to learn the cross-modal relevance of keywords and music (
Figure 4(b)).
During inference, the expander generates a keyword skeleton as lyric storylines from user-input seed words \(K_{seed}\) and input MIDI music \(m\). The expander augments seed words by predicting additional keywords from the music and rearranging them to form a keyword skeleton. Each music phrase node predicts a keyword matching its musical features after graph propagation and neighbor feature aggregation. The skeleton is the concatenation of the input seed words and predicted keywords from all music nodes.
3.3.1 Graph Building.
As shown in
Figure 4(b), a bipartite graph connects the keyword textual modality with the symbolic music modality, consisting of two sub-graphs: the keyword graph and the music graph.
The keyword graph contains word nodes of the entire vocabulary, identified by their token IDs. These nodes are connected by bidirectional edges representing co-occurrence and adjacency frequencies based on extractor statistics. For instance, the probability of the keywords “seasons” and “spring” appearing together in a skeleton is 0.6, and the probability of “spring” following “seasons” in skeletons is 0.4. Thus, the edge from “seasons” to “spring” in the keyword graph has features [0.6, 0.4].
The music graph includes music nodes represented by MidiBERT embeddings [
11] of all music phrases’ REMI encodings [
28]. First, a MIDI file is split into phrases following [
45]. Then, as demonstrated in
Figure 6(a), REMI, a music event representation, converts each music phrase’ MIDI score into discrete tokens, providing metrical context for rhythmic patterns and segmenting music encoding into distinct nodes aligned with lyric phrase divisions. Next, MidiBERT, a large-scale pre-trained model for symbolic music understanding, uses
masked language modeling (MLM) to learn high-level features by masking and reconstructing input REMI tokens (
Figure 6(b)), capturing intricate musical patterns, harmonies, and structures.
In the music graph, music nodes are connected by directed edges indicating performance order. Keyword nodes are bidirectionally connected to all music nodes, forming a bipartite graph to model cross-modal relationships (
Figure 4(b)). That is, in reference initialization, every music phrase is connected to every keyword. Integrating text and music modalities in a graph network, this model predicts keywords from paired music phrases, generating a keyword skeleton aligned with the musical context to provide a coherent storyline for lyrics generation.
3.3.2 Keyword Skeleton Expansion.
The graph Transformer computes hidden states for all nodes, propagating keyword information throughout the graph. Unlike sequential or grid models, a graph network (1) de-emphasizes autoregressive generation, enabling parallel keyword expansion; (2) captures topological long-term keyword dependencies; and (3) unifies music and text as graph nodes for cross-modal relevance. Hidden states for nodes and edges are represented uniformly as
\(\mathbb{R}^{d_{g}}\). After information propagation, an MLP predicts and samples a keyword for each music node, representing the music phrase. This procedure is formulated as
where
\(h\) and
\(e\) are graph node and edge hidden states, whose superscripts distinguish music and word nodes.
\(R(\cdot)\) denotes REMI encoding while
\(Emb_{word}\) and
\(Emb_{MIDI}\) represent word and MidiBERT Embedding, respectively.
\(\mathcal{N}(i)\) are node
\(i\)’s incoming neighbors, and
\(W_{h}\) and
\(W_{e}\) are graph Transformer model’s trainable parameters.
The expander is trained using the Cross-Entropy loss between the predicted keyword from the music node \(P(k_{i})\) and the extracted keyword \(\hat{k_{i}}\) from the lyrics. Thereby, the expander produces a coherent keyword skeleton that fits the music during inference.
3.3.3 Specified Seed Words Insertion.
During lyrics generation inference, users typically input several seed words to indicate the words they wish to learn through singing. The seed words provided by users are usually insufficient to form a complete keyword skeleton. Therefore, we employ the keyword expander to predict additional keywords from the input MIDI file and organize both seed words and expanded keywords into a keyword skeleton in a specified order. We utilize melody identification from [
44] and musical snippet segmentation techniques from AI-Lyricist [
45] to estimate an appropriate sentence number
\(L\) (equal to the number of phrases in the input music). After predicting keywords for the first number of
\(l_{exp}=L-l_{seed}\) music nodes, we insert the remaining
\(l_{seed}\) specified seed words into the keyword skeleton, ensuring the total number of keywords in the skeleton equals the number of musical phrases. We use average co-occurrence and adjacency probabilities to determine seed words’ positions within the skeleton. Each seed word is inserted sequentially to maximize these probabilities for the entire keyword skeleton thus improving the storyline’s coherence. The expanded keyword skeleton is finalized after all seed words are inserted.
Compared to Plan2Lyrics [
70] and AI-Lyricist, inserting seed words during skeleton expansion avoids conflicts with surrounding words. Maximizing co-occurrence and adjacency probabilities ensures sentence-level coherence. The graph model also establishes cross-modal coherence between music and lyrics.
4 Coherent Lyrics Generation
After training, the keyword skeleton expander can produce a keyword skeleton from unseen input MIDI music. Thus, the lyrics generation module takes any MIDI music and a keyword skeleton produced from the MIDI as input to generate coherent lyrics. We propose a three-layer mechanism to ensure coherence in its generation, utilizing three stacked GPT-2-based sub-modules: prepositive topic guider, main-body lyrics generator, and inverse prompts. These sub-modules enforce coherence before, during, and after lyric probability computation. (1) The expanded keyword skeleton prompts the main-body lyrics generator. (2) The prepositive topic guider uses the song name and previously generated words to constrain the next word selection. (3) Beam search with inverse prompts evaluates lyric candidates based on their alignment with the keyword skeleton.
GPT-2 is chosen as the foundational model for all three sub-modules due to its power, reproducibility, interpretability, and computational affordability. While more advanced models might perform better, GPT-2 enables us to explore coherence-enhancing factors and techniques within generally acceptable resource constraints. A textual lyric dataset pre-trains these models for poetic lyric adaptation, followed by fine-tuning for specific tasks. The three-layer mechanisms work together to generate fluent, coherent, and musically relevant lyrics.
4.1 Main-Body Lyrics Generator
The main-body lyrics generator produces subsequent tokens autoregressively. It is fine-tuned to generate lyrics based on a specified number of syllable and a keyword in the skeleton as prompts after pre-training on a lyric dataset. As illustrated in
Figure 7(b), the fix-length keyword and syllable prompts precede each lyric sentence. A syllable planning
\(SL=\{sl_{i}^{1},sl_{i}^{2},...,sl_{i}^{T}\}\), which is a list of predicted remaining syllable counts, is added to each token’s word embedding to indicate the remaining syllables in the current sentence. Training with keyword and lyric association improves word-level coherence by familiarizing the generator with specific keywords. Using the keyword
\(k_{i}\) and remaining syllable number
\(sl_{i}^{t}\) as prompts, the generator maximizes
\(P_{w}(x_{t}=k_{i}|x_{\lt t},sl_{i}^{t})\times P_{w}(x_{\gt t}|x_{\lt t},sl_{i}^{t+1},x_ {t}=k_{i})\) to generate probabilities for subsequent token. For ease of computation, this process is approximated mathematically as
4.2 Prepositive Topic Guider
Usually, to generate lyrics that match desired attributes, discriminators
\(P_{d}(\cdot)\) typically measure how well the generated lyrics align with a given attribute. The entire generation and discrimination process is formulated as follows:
However, the subjective and multi-faceted nature of song lyrics makes them difficult to describe. Instead of attribute classes used in [
32], we propose using the song name to guide generation, as song names often summarize themes, sentiments, and content.
We enhance the main-body generator with a prepositive topic guider based on GEDI [
32]. GEDI is called a “prepositive” topic guider because it influences the token selection process with the song name and previous generated lyrics before the main-body generator makes its final prediction, ensuring alignment with the desired song name from the outset. This guider computes the probability that each candidate token
\(x_{t}\) matches the desired features in the song name prompt
\(y\) (i.e.,
\(P_{d}(y|x_{t},x_{\lt t})\)), replacing the ineffective roll-out and reward processes of conventional discriminators. As shown in
Figure 7(a), the topic guider computes the probability of each candidate token given the song name prompt, the corresponding anti-prompt (“<SONGNAME> <FALSE>”), and previous tokens at each step. This probability is multiplied to constrain token selection alongside the main-body generator’s prediction
GEDI can enhance sentence-level coherence in lyrics generation. Its optimization objective ensures the current sentence is judged as a continuation of the same topic as the previous sentence, thus increasing the probability of selecting more relevant candidate words.
4.3 Inverse Prompts
Generating long texts often deviates from the prompt and includes irrelevant content. To address this, we use the inverse prompt mechanism [
89], a beam scoring function that evaluates the log likelihood in reverse. Traditional beam search calculates beam scores using the log likelihood of generating lyrics from prompts:
\(BeamScore(X|K)=logP_{w}(X|K)\). In contrast, the inverse prompt assumes that if the prompts can be generated back from the lyrics, they must be closely related, formulated as
\(BeamScore_{IP}(X|K)=logP_{w}(K|X)\). Traditional prompting strategy is “K results in X,” whereas inverse prompt is “X inferred from K.”
However, reversing the order of prompts and lyrics can produce unnatural texts [
89]. A more natural inverse prompt predicts the original prompts from the generated text. Here, the inverse prompt summarizes generated lines
\(X^{\prime}\) back into a keyword skeleton
\(K^{\prime}\), and beams are rated by
\(BeamScore_{IP}(X|K)=logP_{w}(K^{\prime}|X^{\prime})\).
An example is shown in
Figure 7(c). Given the keyword “Dream” and previously generated text
“… say I’m a” in the search beams, the inverse prompt is constructed as
“… say I’m a \(\{\) dreamer/painter/human\(\}\) can be summarized as <KEYWORD>.” A GPT-2 model optimized for inverse prompt predicts the <KEYWORD> for each beam and scores them based on how closely it matches “Dream.” In the example, beams ending in “dreamer” and “painter” receive higher scores and remain in the search while other results are eliminated. To ensure the inclusion of seed words, the proposed scorer will return 0 if a candidate beam does not contain the specified seed word for learning.
The skeleton extracted by the KeYric system is a compressed representation of the lyrics in a latent space, serving as the song’s theme. The essence of Inverse Prompt is to have the model generate lines during beam search that can be summarized as the core idea, ensuring the lyrics maintain a consistent theme and enhancing full-text coherence.
5 Objective Experiment
5.1 Dataset
We used the Netease API to extract English lyrics with the 100 most frequent tags from a lyric dataset [
84], creating the
Netease-lyrics dataset of 160,171
\(\{songname,lyric\}\) pairs for training the keyword skeleton extraction model. For training the keyword skeleton expansion model, we built the
LMD-lyrics dataset from the Lakh MIDI dataset [
60], which contains 7,211
\(\{songname,lyric,MIDIfile\}\) triplets. All lyrics are segmented by lines. Both datasets are split into 8:1:1 train, validation, and test subsets.
5.2 Configurations
The keyword skeleton extractor employs a standard encoder-decoder block [
71] with hidden states size
\(z\in Z\) set to 256 (
\(d_{z}\) \(=\) \(d_{c}\) \(=\) 256). Preliminary experiments determined
\(\alpha\) \(=\) 1.0,
\(\beta\) \(=\) 4.0, and
\(\gamma\) \(=\) 0.2 [
33]. The keyword skeleton expander’s graph network has an embedding size of 256 (
\(d_{g}\) \(=\) 256), 7 propagation layers to accommodate an average of two verses and two choruses, and includes the 10,000 most frequent words. Graph propagation uses sub-graphs of size 1,024 in batches. We pre-trained three GPT-2 models (prepositive topic guider, main-body generator, and inverse prompt scorer) on the Netease-lyrics dataset with an MLM task [
15] and used the LMD-lyrics dataset with respective prompt templates.
5.3 Compared Methods
We compared our keyword skeleton extraction (Proposed-K) and keyword expansion (Proposed-G, with G representing “graph”) models against various unsupervised keyword extraction techniques, including graph-based algorithms (TextRank, TopicRank, MultipartiteRank, PositionRank), embedding-based algorithms (EmbedRank, SIFRank), and attention-based algorithms (AttentionRank, UkeRank) [
38].
We also compared our KeYric system (Proposed) with the SOTA lyrics generation model [
70], referred to as Plan2Lyrics in this article, and with AI-Lyricist [
45], based on SeqGAN, and SongMASS [
65], which uses a Transformer to generate lyrics from a melody line.
An ablation study assessed the impact of each coherence mechanism in our lyrics generator. We evaluated three versions: (1) a vanilla GPT-2 generator with the keyword skeleton as prompts (Proposed-Lite, a simplified generator without coherence mechanisms), (2) a generator with a prepositive topic guider (Proposed-Pre), and (3) a generator with only inverse prompts (Proposed-IP). This allowed us to determine each mechanism’s contribution to enhancing lyrics’ coherence and musicality.
5.4 Objective Measures
We objectively evaluated the keyword skeletons and the lyrics generator’s applicability to language learning.
The keyword extractor and expander were evaluated on five metrics. The first metric,
“representativeness” [
70], assesses how well the keyword skeleton represents the semantic content and linguistic characteristics of the original lyrics [
20]. This is measured by the average cosine similarity between each lyric sentence and its keyword embedding, indicating their interchangeability. The second metric,
“coherence,” evaluates the skeleton’s topic transitions [
29]. It is the average log probability of keyword graph edges, reflecting the frequency of consecutive keyword pairs in adjacent lines and thus measuring the coherence of the storyline. It is formulated as
where each
\(e\in|E|\) denotes the edge in the keyword graph defined in
Section 3.3.1 that connects two subsequent words in the keyword skeleton
\(K\) and
\(P(e_{i,j})\) represents the probability associated with edge
\(e_{i,j}\) in the keyword graph, indicating the frequency of subsequent occurrence of the keywords
\(k_{i}\) and
\(k_{j}\).
We propose the third metric,
“uniformity,” which evaluates the distribution of keywords, aiming for one keyword per lyric sentence. It is the ratio of lyric sentences without keywords, multiplied by the number of keywords in the skeleton as a balancing coefficient. This ensures each sentence contributes a keyword, supporting the storyline cohesively without missing key points. High uniformity suggests concentrated keywords, potentially losing content from other lines. The fourth metric,
“cross-modal relevance,” measures the correlation between text and musical features, computed as the normalized dot product of their feature vectors [
45]. The fifth metric,
“diversity,” assesses the word choice diversity in the lyrics dataset [
20,
83]. It is the average pairwise difference between two keyword skeletons, formulated as
where
\(S\) is the keyword skeleton of the lyrics in the test set and
\(||\) denotes the size of a set. Diversity is beneficial, but excessive diversity can result in random keywords.
Following previous studies on coherent text generation, we evaluate lyrics generators using two metrics: local [
25,
35] and global coherence scores [
26]. Local coherence is measured by topic switching detection, calculating the probability that two consecutive sentences share the same topic [
3]. Global coherence is evaluated by a model predicting the document’s overall coherence through supervised regression [
1]. Additionally, we performed POS tagging on the generated lyrics and computed the proportion of elements that significantly contribute to coherence, including conjunctions, sub-ordinate clause indicators, and pronouns.
5.5 Objective Experiment Results
The keyword skeleton evaluation results are shown in
Table 1(a). Our proposed model outperforms others in all five metrics. Notably, our one keyword per line strategy avoids distribution bias and enhances episodic coherence. It improves uniformity by 26% compared to the second-best method. The VAE captures essential keywords, increasing diversity by 15% and cross-modal relevance by 14%. Our keyword expander improves coherence and cross-modal relevance by 20% and 14%, respectively, demonstrating the effectiveness of deep graph networks in correlating music and lyrics.
In contrast, AttentionRank employs an attention mechanism but lacks a lyric reconstruction process, leading to the selection of articles and auxiliary words that misrepresent lyrics, lowering its representativeness score. TopicRank and TextRank build text graphs and select keywords without considering sentence order, resulting in weaker coherence. PositionRank, relying on keyword frequency and previous occurrences, produces undiversified keyword skeletons. EmbedRank, which uses embeddings for keyword extraction, ranks second in competitiveness by selecting heavily modified words, like nouns surrounded by many adjectives, creating information-dense keywords. However, ignoring sentiment and style modifiers weakens EmbedRank’s cross-modal relevance with music.
The evaluation results for the coherent lyrics generated by our proposed model are presented in
Table 1(b). Our model shows a 5% improvement over the SOTA Plan2Lyrics, demonstrating that our compression-reconstruction skeleton extraction method produces a more effective latent space. It also surpasses AI-Lyricist by 9% in overall coherence, validating the effectiveness of our three-layer mechanisms. Additionally, our model outperforms SongMASS by 7%, indicating that incorporating human knowledge, such as syllable templates and keyword skeleton input, is more effective than relying solely on automatic cross-modal relevance capture.
Lyrics’ coherence improvement is calculated by averaging the percentage improvements for each metric in
Table 1(b). For example, the improvement in local coherence compared to AI-Lyricist is (0.88
\(-\) 0.83)/0.825
\(=\) 0.06, and the improvement in global coherence is (1.69
\(-\) 1.52)/1.52
\(=\) 0.112. The overall improvement is then (0.06 + 0.112)/2
\(=\) 0.09 (9%). The +5% and +7% improvement over Plan2Lyrics and SongMASS is calculated similarly.
Although Plan2Lyrics increases the use of conjunctions and referential words in generated lyrics, the skeleton quality largely determines coherence improvement. Plan2Lyrics employs the YAKE method, which relies on text word frequency for keyword extraction. While these skeleton keywords show high coherence, they lack diversity and fail to adequately represent the original lyrics, resulting in an ineffective compression space. Additionally, the absence of musical input leads to significant deviations in musicality.
Our objective experiments revealed a strong positive correlation between the coherence of lyrics and the proportion of coherent elements they contain (
Table 1(b)). This suggests that using these elements more extensively in lyrics generation enhances coherence.
6 Subjective Evaluation
6.1 Experiment Participant and Procedures
We recruited 32 participants from the university via e-mail, requiring English as their first language. After completing the experiment and passing a manipulation check, participants received S\(\$\)30. The participants included three professional lyricists and one language teacher.
Participants underwent training before the main experiments. They were shown a sample keyword skeleton and lyric paired with music and then asked to rate keyword skeletons and lyrics. They also reviewed rating standards with examples for marks 1–5. The main experiment had two sections: In section 1, participants rated 100 keyword skeletons paired with music in random order; in section 2, they rated 40 lyrics paired with music in random order.
To maximize validity, we used several approaches: (1) a within-subjects experiment with randomized display order; (2) online training for participants with detailed example ratings and reasons to ensure consistency. For singability, we provided a clear rating question with examples for scores 1–5 to avoid ambiguity: “Please listen to the synthesized singing of the lyrics and rate the following aspects of the lyrics on a scale of 1 to 5, where 1 represents the lowest rating and 5 represents the highest rating. Singability: How well do the syllables of the generated lyrics align with the melody notes of the input music? It is a 5 score if all syllables and music notes are perfectly matched so that you can sing the lyrics naturally, without syllables needing elongations or compressions into more/less music notes, and without a word’s syllables separated by a downbeat. You should subtract 1 point for every mismatch that you feel.” (3) We balanced participant recruitment to include a variety of other spoken languages to mitigate linguistic biases, ensuring all participants’ first language was English. (4) We used manipulation checks to verify genuine engagement.
6.2 Subjective Metrics
We conducted user rating surveys to evaluate the quality of extracted keyword skeletons, expanded keyword skeletons, and generated lyrics, focusing on key aspects. We assessed keyword skeletons based on coherence, faithfulness, musicality, and sentiment. Coherence examines storyline progression [
9], faithfulness assesses accurate summarization of the original lyrics, musicality checks the match with the music style [
45], and sentiment evaluates emotional expression. We refined the previously defined four levels of coherence in lyrics into six evaluation criteria for subjective experiments: fluency, local coherence, global coherence, learnability, singability, and musicality. Fluency [
70] ensures natural English [
88], local coherence ensures smooth sentence transitions, global coherence maintains a consistent theme [
88], and learnability integrates seed words seamlessly. Participants judged the presence and contextual relevance of user-specified words in the lyrics. Singability ensures syllables align with melody notes [
30,
37], and musicality evaluates cross-modal relevance with the genre, sentiment, and style of the paired music [
45].
6.3 Subjective Experiment Results
As shown in
Table 2(a), our model’s keyword skeletons assist users in envisioning storylines aligned with the music’s style and sentiment. Participants noted that our keyword extractor summarizes lyrics more accurately than compared models. Overall, our keyword extractor and expander outperform compared methods by 15% and 8%, respectively, based on the average improvement over the second-best model in each metric.
As shown in
Table 2(b), our lyrics generator with a three-layer coherence mechanism surpasses competitors in text quality, local and global coherence, and cross-modal relevance with the expanded keyword skeleton. The KeYric system improves coherent lyrics generation by 19%, based on the average improvement over the second-best model in each metric. It excels particularly in coherence at the word, sentence, and whole-piece levels, validating our design motivation.
Subjective experiments show that lyrics generated with a skeleton are perceived as more coherent and fluent by singers. Compared to the Plan2Lyrics model, our method improves coherence by 7.6% in local coherence and 10% in global coherence, highlighting the importance of coherence constraints in lyrics generation. Additionally, our model outperforms Plan2Lyrics in musicality by 17%, demonstrating the need for cross-modal associations between music and text.
One important function of the KeYric system is to help generate personalized lyrics for language learning through singing [
45,
51]. In this method, users enhance their understanding and memory of words by singing songs that include the keywords they wish to learn. Our lyrics generator enhances personalized language learning by integrating input seed words naturally, showing a 28% improvement in learnability compared to AI-Lyricist. Specifically, the seed words that users want to learn are successfully incorporated into the generated lyrics. And the surrounding words and sentences help users understand the meanings of these seed words. This integration aids vocabulary comprehension and language acquisition effectively. In summary, our generated lyrics are engaging, coherent, pleasant, and artistic, making them ideal for language learning.
7 Multi-Faceted Analysis: The Impacts of Musical Factors on Lyric Coherence
To analyze the effects of musical factors on lyric coherence, we divided the LMD-lyrics dataset into 17 genres and independently trained and evaluated their expanded keyword skeletons. This approach helped us understand how music genres influence lyric coherence. We extracted songs from the database with identified genre attributes, covering 17 genres: bluegrass (0.66%), blues (4.76%), Christian-gospel (1.38%), classical (1.16%), country (8.46%), dance-electric (5.48%), disco (0.32%), folk (2.08%), hip-hop (3.28%), jazz (3.54%), metal (19.92%), new age (1.98%), pop (11.96%), punk (6.84%), reggae (0.6%), R&B (0.96%), and rock (26.62%).
Table 3(a) shows the coherence rankings, revealing that pop and country music produce the most coherent results. The coherence and narrativity of human songwriting may be the primary reason. In contrast, classical music lacks lyrics, and gospel songs use chanting and exclamations, leading to less coherent keyword expansion and lyrics generation.
Our case studies show that classical songs’ lyrics are often incomplete, and punk and folk songs lack a fixed arrangement. Large language models (LLMs) like ChatGPT struggle to detect patterns in these genres. Thus, we trained our generation model on genres with average coherence scores of 0.53 and above (e.g., metal). The trained model also performs well on unseen genres.
We further investigated which musical elements influence cross-modal lyric coherence. By excluding elements in REMI melody representation, we observed changes in coherence. As shown in
Table 3(b), removing pitches significantly decreases keyword and whole-piece coherence by 13% and 7%. Surprisingly, removing bar line information increases keyword coherence by 2% but reduces line-to-line coherence by 9%, as lyric lines do not always match bar lines. These findings help us select musical elements to establish cross-modal relevance between text and music in future work.
7.1 Ablation Study and Case Study
The analysis of lyrics from our model and its variants (Proposed-Lite, Proposed-Pre, and Proposed-IP) shows that Proposed-Pre produces more sentimentally and thematically coherent lyrics than Proposed-Lite (+2.9%). Most sentences generated by Proposed-Pre maintain a consistent tone and focus on a shared topic, influenced by the prepositive topic guider.
Lyrics from Proposed-IP exhibit tone and subject shifts but use more consecutive and supplementary words to link these shifts. This variant also shows an increase in longer compound and complex sentences with attributive and adverbial clauses. The Proposed model integrates features of both Proposed-Pre and Proposed-IP, achieving smooth topic transitions between paragraphs and coherence within paragraphs.