[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

KeYric: Unsupervised Keywords Extraction and Expansion from Music for Coherent Lyrics Generation

Published: 23 December 2024 Publication History

Abstract

We address the challenge of enhancing coherence in generated lyrics from symbolic music, particularly for creating singing-based language learning materials. Coherence, defined as the quality of being logical and consistent, forming a unified whole, is crucial for lyrics at multiple levels—word, sentence, and full-text. Additionally, it involves lyrics’ musicality—matching of style and sentiment of the music. To tackle this, we introduce KeYric, a novel system that leverages keyword skeletons to strengthen both coherence and musicality in lyrics generation. KeYric employs an innovative approach with an unsupervised keyword skeleton extractor and a graph-based skeleton expander, designed to produce a style-appropriate keyword skeleton from input music. This framework integrates the skeleton with the input music via a three-layer coherence mechanism, significantly enhancing lyric coherence by 5% in objective evaluations. Subjective assessments confirm that KeYric-generated lyrics are perceived as 19% more coherent and suitable for language learning through singing compared to existing models. Our analyses indicate that integrating genre-relevant elements, such as pitch, into music encoding is crucial, as musical genres significantly affect lyric coherence.

1 Introduction

Coherence is often lacking in existing lyrics generation systems. This reduces the comprehensibility, engagement, and artistic expression of the content. This article focuses on enhancing the coherence of generated lyrics, defined as the quality of being logical and consistent, forming a unified whole. In lyrics writing, coherence means using a broader context and establishing semantic relationships between words and sentences to enhance understanding and interpretation [59].
Coherence in lyrics is manifested at four levels [59]—Word level coherence: Words form logical concepts or actions with their surrounding words. Sentence level coherence: Each line supplements, contrasts, or expands on the previous line. Full-text level coherence: The entire text revolve around a single theme. Musicality level coherence: The lyrics match the rhythm, sentiment, and style of the accompanying music. For example, in Figure 1(a), the example shows high-quality human-composed lyrics with appropriate collocation. Each subsequent sentence extends the previous one, and the entirety of the sentence describes the theme “natural scenery during a day,” creating a peaceful atmosphere that aligns with the music’s sentiment. In contrast, the example (b) of machine-generated lyrics exhibits poor word choices leading to unclear meaning (e.g., “spider fixes toast”), weak sentence connections, a lack of central theme, and a bizarre style that conflicts with the musical style. Additional generated examples can be found in Figure B1 of Appendix B.
Fig. 1.
Fig. 1. Examples of (a) coherent human-composed lyrics (at top) and (b) incoherent automatically generated lyrics (at bottom) using the music from John Lennon’s song Imagine. The human composition (a) exhibits good coherence due to appropriate collocation. Subsequent lines extend the preceding ones, and all sentences revolve around the theme of “natural scenery during a day.” And the contents are harmonious with the music style. In contrast, the generated sample (b) suffers from poor coherence due to inappropriate word combinations, off-topic sentences, and a style incongruous with the music.
Current models for lyrics generation typically achieve the proficiency demonstrated by the lower lines in the previous example, where grammar and rhythm are acceptable, but coherence across all four levels remains problematic [45, 65]. Although ChatGPT1 generates coherent and rhyming lyrics, it has notable limitations: It cannot process music input, meet syllable requirements, or avoid the risk of plagiarism. Customizing lyrics for specific applications, such as language learning, introduces even greater challenges. While research has shown that singing songs with appropriate lyrics can help language learners acquire vocabulary [18, 19, 22, 24, 48, 50, 62, 75], human-composed songs that appeal to learners’ musical tastes [51] often lack the necessary target vocabulary and may be subject to copyright. Existing lyrics generation systems are also inadequate in this context, as they struggle to incorporate user-specified keywords while maintaining four-level coherence.
A recent study propose using a keyword skeleton as prompts for lyrics generators to enhance coherence [70]. A “keyword skeleton” is defined as a structured list of terms, each corresponding sequentially to a sentence or line in the lyrics. Nevertheless, this approach has limitations. It does not thoroughly investigate and clearly define lyric coherence, fails to analyze the root causes of incoherence in generated lyrics, and does not fully leverage the keyword skeleton framework. We identified the main issues as follows: (1) The system employs the YAKE keyword extraction method [70], which generates skeletons based on word frequency. This approach does not effectively capture the narrative structure of lyrics. (2) The system may overlook essential user-input keywords during the skeleton generation process. (3) The system does not account for musical features, relying heavily on the rhythm and syllable templates of original lyrics, thus limiting its generalizability. (4) The system utilizes keywords as prompts without incorporating targeted innovations in its language model appropriately to enhance coherence. As a result, the coherence of the generated lyrics remains less satisfying compared against human-composition. It is crucial to integrate a deeper understanding of musical elements and user inputs into the generation process to improve musicality and personalization.
In response to the above challenges, we introduce the KeYric system, which enhances coherence, musical association, and personalization in generated lyrics. As illustrated in Figure 2, the KeYric system generates coherent lyrics by first taking a MIDI file [63] and user-specified seed words as inputs. The Keyword Skeleton Extractor module uses unsupervised learning to extract semantically significant keywords from a lyrics database, forming keyword skeletons. This extends the original \(\{\)MIDI, lyrics\(\}\) tuple into a \(\{\)MIDI, keyword skeleton, lyrics\(\}\) triplet dataset. The cross-modal tier, which includes a graph network and a lyrics generator, then couples MIDI with skeletons and skeletons with lyrics. The Music-skeleton Coupled Graph Network selects a keyword for each musical phrase, linking them to form a skeleton, and the Coherent Lyrics Generator creates lyrics based on this skeleton. During inference, the system predicts a skeleton from the given MIDI file and seed words, and then generates personalized lyrics suitable for language learning through singing [45].
Fig. 2.
Fig. 2. The KeYric system extracts semantically significant keywords from a song’s lyrics, forms a keyword skeleton that align with the music, and uses this skeleton to fit human-composed lyrics in training. During inference, the system takes a MIDI file and seed words as inputs to produce personalized coherent lyrics.
Our approach differs from traditional keyword extraction methods that rely on word frequency or manual annotation. Instead, we conceptualize our keyword skeletons as an interpretable latent space, capable of compressing the lyric space and sampling it with comprehensible keywords. For instance, the example lyrics (a) in Figure 1 can be summarized into the keyword skeleton [“see,” “sun,” “birds,” and “nightly”]. We propose a novel unsupervised method for extracting keyword skeletons from human-composed lyrics, identifying the most representative words through a process of text compression and reconstruction. Furthermore, we leverage deep graph networks to establish pairwise connections between musical phrases and keywords within a skeleton. This graph model predicts a skeleton that includes the user-specified keywords, aligning with the sentiment and style of the music, thereby enhancing musicality coherence. Additionally, we present a coherent lyrics generation model that uses a skeleton and melody as prompts. This model incorporates three levels of coherence mechanisms, which enhance coherence at the word, sentence, and full-text levels throughout the generation process.
Objective and subjective evaluations demonstrate that the KeYric system achieves a 5% and 19% improvement in lyric quality over the compared models, respectively. Experts, including linguists, songwriters, and language teachers, have verified that KeYric generates lyrics that facilitate language learning through singing. Furthermore, we investigate the impact of genres and musical elements on lyric coherence. The findings reveal that pop and country songs produce the most coherent skeletons and lyrics. Additionally, our analysis indicates that, in comparison to bar boundary, genre-relevant information such as pitch and note duration of music plays a more significant role in maintaining lyric coherence.
In summary, this study makes three major contributions:
We address coherent lyrics generation and propose a solution that improves coherence at word (smoothness between adjacent words), sentence (continuity in describing the same subject in adjacent sentences), full-text levels (the entire lyrics center the same theme), and musicality (lyrics match the style and sentiment of the music).
We propose KeYric to generate lyrics from a keyword skeleton. This novel unsupervised method extracts keyword skeletons from lyrics, expands input music and seed words for language learning into a skeleton, and incorporates the skeleton in the lyrics generator.
Our analysis of the experiment results reveals that incorporating genre-relevant musical components (i.e., pitch and note duration) in data encoding substantially enhances the coherence of the generated lyrics.

2 Related Work

2.1 Automatic Lyrics Generation

With the rise of neural network technology, Recurrent Neural Networks [21, 45, 46, 58, 66, 79] and Transformers [7, 36, 41, 54, 56, 65, 81] dominate automatic lyrics generation. These autoregressive models select the next lyric token or line until the desired length is reached. Considering the musical nature and unique requirements of writing lyrics, many studies have focused on using prompts or conditional embeddings to improve the generated lyrics’ topical words [76, 85], content [54], rhyme [36, 81], matching of syllables [43, 53, 77], audio features [7, 7274], and realism [47]. To produce well-balanced lyrics, SongNet [36] and ChipSong [41] both constrain several of the above characteristics. Other studies use adversarial learning to reward results with desirable lyric characteristics from a consequentialist perspective [8, 12, 14, 45]. Researchers find that, to make lyrics singable, syllable patterns must match the melody [42]. Accordingly, some studies attempt to produce lyrics from melody input [34, 65, 77].
Unfortunately, few studies have examined coherence in lyrics generation thoroughly. Although state-of-the-art (SOTA) work attempts to use keyword skeletons as prompts to improve the coherence of lyrics generation [70], it does not define or explain what coherence in lyrics means in detail. Additionally, this system lacks an in-depth discussion on how to extract keyword skeletons and how to use the keyword framework to enhance coherence in the lyrics generator. It also overlooks the coherence between lyrics and music, failing to generate lyrics that match the prosody and style of the input music. These issues result in a significant gap between automatic lyrics generation and human songwriting.

2.2 Coherence in Text Generation

Existing methods for coherent text generation fall into three categories. The first generates text from prerequisite keywords, topics, or sentences. Early approaches used the hidden state of the previous sentence as context for the next [78], while later work suggested condensing sentences to condition future generations [26]. Other studies expanded phrase-based storyline plans into coherent stories. However, these methods often rely on (1) high-frequency words, leading to homogenized storylines [83], (2) predicate-argument structures, which ignore sentiment and style [20], or (3) human annotations, which lack uniform standards [80]. Improved methods for extracting keyword skeletons from lyrics are needed.
The second branch of methods enhances coherence by selecting the most coherent candidate. One study includes an independent model’s coherence judgment score in the generation loss [68]. SeqGAN with coherence and cohesion discriminators evaluates candidates’ probabilities of co-occurrence and adjacency with previous text [10]. Phrase-level reward replaces full-sentence reward in the roll-out process to improve efficiency [12]. The GEDI model uses an independent language model to compute priors for candidate words under a given topic code and previous words, approximating the posteriors given by the discriminators [32]. Lin and Riedl suggest adding a topic transition planner to GEDI for gradual topic transitions [39]. A recent poem generation study used prompt templates requiring candidates to predict the title, previous sentence, or topic, keeping only the winners in the beam search [89]. These “inverse prompts” predict back from the current generation (e.g., “current generation is from a <STYLE> style poem titled <TITLE>”). Since lyrics, as artistic texts, cannot be generalized into a few limited themes like technology, society, or economy, as in the case of GEDI, and song titles often fail to comprehensively summarize the lyrics’ content, we believe that using a single coherence mechanism alone is unlikely to improve lyric coherence. Therefore, it is necessary to design a new architecture that includes multiple coherence mechanisms.
The third branch suggests using a keyword skeleton as prompts to improve coherence in long texts [26]. However, applying this to lyrics generation is complex. It requires subjective human annotation of keyword skeletons, with one word per lyric line capturing salient semantic, sentimental, and narrative information. Current automatic plot planning and keyword skeleton extraction techniques often overlook words expressing sentiment and music style [20, 52, 80, 82, 83], making them less suitable for coherent lyrics generation. The lyrics generation in this study uses the keyword skeleton as prompts. Through an integrated model architecture, it combines GEDI, a lyrics generator, and inverse prompt techniques to enhance the coherence of the generated lyrics before, during, and after word selection.

2.3 Unsupervised Keyword Extraction

To create keyword skeletons that enhance lyric coherence and maintain the style and sentiment without human annotation bias, we reviewed existing unsupervised keyword extraction studies.
Unsupervised keyword extraction has evolved significantly. Initially, researchers used statistical, linguistic, machine learning, and graph theory methods to extract keywords from text [55]. With advancements in deep learning and language models, text-embedding models have gained prominence [2, 67]. Techniques like PageRank [13] and TextRank [49] construct document graphs to evaluate vertex importance. Improvements include clustering similar phrases into topics and weighting them by semantic relations [6], PositionRank which considers word position and frequency [23], and multi-partite graphs that ensure topical diversity [5]. To address the limitations of graph-based methods, deep learning-based embedding methods like EmbedRank [4] and SIFRank [69] use high-dimensional vectors. UkeRank [38] and AttentionRank [16] improve accuracy with global and local contexts and a hybrid attention model with BERT [15], respectively. Recent research ranks all phrases from a corpus by relevance to new documents [64] and uses autoencoding variational Bayes to build a latent topic tree [86].
However, these methods are inadequate for lyric keyword skeleton extraction: (1) They struggle with poetic and metaphorical language, such as “time is a thief” (without simile indicators “like” or “as”), which requires understanding abstract concepts. (2) They overlook plot and topic transitions, complicating the generation of coherent texts from keywords. (3) Lyrics feature more repetitions and fewer explicit topic markers, making accurate keyword identification challenging. Thus, lyric keyword skeleton extraction necessitates specialized methods.

3 Keyword Skeleton Extraction and Expansion

3.1 Motivation

We commissioned lyricists to write lyrics suitable for language learning based on given seed words and music. By monitoring their creation process, we observed common procedures among human lyricists writing for linguistic pedagogy. As illustrated in Figure 3, these keywords evolve into lyrical cues and are extended into sentences, considering pivotal terms, rhyme, melodic alignment, and seed words to be learned. An initial assessment of the semantic relevance [61] between these keyword skeletons and song titles showed a 57.4% improvement compared to randomly selected keywords (as 0.203 vs. 0.129 shown in Figure 3). There are typically four methods to achieve textual coherence: repeating key nouns, using pronouns, employing transition signals, and maintaining logical order [57]. Compared to previous automated models, human lyricists use these components more effectively, enhancing coherence between consecutive lyric lines.
Fig. 3.
Fig. 3. Human composed lyrics given the mandatory seed words know and see based on John Lennon’s song Imagine. Seed words are denoted by green text, skeleton keywords by orange text, clauses by an arrow \(\rightarrow\), and conjunctions by underscore.
However, this process is labor-intensive, requiring over 20 minutes per song lyric to align with the provided music and keywords. In lyrics-based language learning, this effort increases as lyricists tailor compositions to learners’ backgrounds (e.g., linguistic proficiency, vocabulary). To improve efficiency, we propose KeYric, which emulates human lyricists’ writing processes through deep learning.
The core idea is to extract salient vocabulary from each lyric sentence using unsupervised learning. We build a keyword skeleton with the words gaining highest attention weights during compression and reconstruction, refining coherent connections between lyric sentences. Then a keyword skeleton expander, trained on extracted keyword skeletons and corresponding MIDI files, predicts a suitable keyword skeleton for unseen input songs. Simultaneously, a lyrics generator, trained on full lyrics using an expanded keyword skeleton as prompts, employs multi-layer coherence mechanisms to select prevalent conjunctions, pronouns, clauses, and cohesive terms. This approach generates coherent lyrics by stringing together the keywords, ultimately achieving overall coherence.

3.2 Keyword Skeleton Extraction

Given a vocabulary set \(V\), the keyword skeleton is defined as a sequence of word tokens \(K=\{k_{1},k_{2},\ldots,k_{n}\},K\neq\emptyset\), where each element \(k_{t}\in V\) corresponds sequentially to \(s_{t}\), the \(t\mathrm{th}\) line in specific lyrics. As a condensed version of the entire text, a keyword skeleton should present coherence akin to a storyline, showcasing the narrative development and central theme. The selected keywords should meet the following criteria: (1) Each keyword represents a lyric sentence and conveys its stylistic information concisely. (2) Keywords should link coherently to nearby keywords. (3) Repeated keywords are allowed to present lyric structure. Thus, the skeleton can serves as a synopsis and developmental framework for the lyrics.
We propose an unsupervised learning model to interpretably select the best keyword from each lyric line. As shown in Figure 4(a), the keyword extractor compresses lyrics into latent space and reconstructs them using a hierarchical Transformer-Variational AutoEncoder (VAE). The word-to-sentence encoder’s attention scores determine the most semantic and stylistic words contributing to the latent variables for each line. These chosen words form the keyword skeleton. The use of VAE improves generalizability and robustness by capturing the underlying semantic structure of lyrics. This accommodates variations and different versions of the same song while maintaining keyword extraction consistency. This probabilistic approach ensures effective handling of diverse lyric representations. A hierarchical architecture separates sentence and word attention computation, making word token attention values more representative of their contributions to a sentence.
Fig. 4.
Fig. 4. KeYric architecture. (a) Keyword skeleton extractor creates keyword skeletons from lyrics. The skeletons are then packaged along with the lyrics text and their corresponding MIDI files for model training. Specifically, paired \(\{\)keyword skeletons, MIDI files\(\}\)datasets will be utilized to train a cross-modal graph model, i.e., the keyword skeleton expander, that communicates between music and keywords. Paired \(\{\)keyword skeleton, lyrics\(\}\) datasets will be employed to train a coherent lyrics generator, with a keyword skeleton serving as heuristic prompts. (b) Keyword skeleton expander trains on pairs to build a keyword skeleton from input MIDI file and seed words. The bipartite graph consists of two sub-graphs, the word sub-graph and music sub-graph. (c) Coherent lyrics generator takes a keyword skeleton, melody, and a song name to generate coherent lyrics supporting language learning.

3.2.1 Lyrics Compression.

In Figure 5, the green blocks illustrate the compression process. The VAE has a hierarchical Transformer encoder \(q_{\theta}(z|x)\), decoder \(p_{\phi}(x|z)\), and latent variable \(z\in\mathbb{R}^{d_{z}}\) [31]. \(q\) and \(p\) are parameterized by \(\theta\) and \(\phi\), respectively.
Fig. 5.
Fig. 5. Keyword skeleton extractor network made of symmetric hierarchical Transformer-VAE encoder and decoder.
The hierarchical Transformer encoder [71] aggregates word tokens \(X=\{x_{1},x_{2},\ldots,x_{n}\}\) into sentence-level representations \(Z=\{z_{1},z_{2},\ldots,z_{u}\}\) stored in “hub” vectors, i.e., the foremost vector. Each sentence is prefixed with a prepositive virtual [CLS] token. A sentence’s aggregated latent vector is a Gaussian sample of its Transformer encoding at the hub vector’s location. We choose an isotropic Gaussian distribution with unit variance, \(p(z)=\mathcal{N}(0,I)\), as our prior. This simplifies the latent space structure, ensuring each latent variable contributes equally and independently. It aids efficient learning of diverse lyric representations while maintaining consistency and reducing complexity in the hierarchical Transformer encoder. We incorporate positional, part-of-speech (POS), and dependency embeddings [17] to include word tokens’ syntactic features.
The lyric-level encoder computes cross-sentence information and averages the outputs through a pooling layer to obtain the compressed latent vector of the entire lyrics, \(Z_{T}\in\mathbb{R}^{d_{z}}\):
\begin{align}\begin{split} z_{t}=M_{Hub}(f(Emb([CLS]||s_{t})+Emb^{+}([CLS]||s_ {t}))), \\z_{T}=Pool(F(Z)) \\\mu_{t},\sigma_{t}=\mathrm{MLP}(z_{t}), \quad\mu_{T},\sigma_{T}=\mathrm{MLP}(Z_{T}),\end{split},\end{align}
(1)
where \(Emb(\cdot)\) is the word embedding layer and \(Emb^{+}(\cdot)\) is the sum of positional and semantic embeddings. \(s_{t}\) (\(t\in[1,v]\)) is the tth sentence. \(||\) denotes concatenation, \(f(\cdot)\) and \(F(\cdot)\) are the word-to-sentence and sentence-to-lyrics Transformer encoders, respectively. \(M_{Hub}\) is the masking operation that retains only the hub vector.

3.2.2 Lyrics Reconstruction.

Unlike previous research [33, 40, 87], our lyric reconstruction uses a symmetric Transformer-based encoder and decoder, as the model compresses and reconstructs lyrics rather than generating them from random latent variables.
We sample the lyric’s latent vector, \(z^{\prime}_{T}\), from the approximate posterior \(q\theta(z|x)=\mathcal{N}(\mu_{z},\sigma_{z})\). To match the decoder’s input shape, we expand \(z^{\prime}_{T}\) into the set \(C^{\prime}=\{c^{\prime}_{1},c^{\prime}_{2},\ldots,c^{\prime}_{u}\}\), where \(c^{\prime}_{i}\in\mathbb{R}^{d_{c}}\). We decode \(z^{\prime}_{T}\) into sentence-level latent variables \(Z^{\prime}=\{z^{\prime}_{1},z^{\prime}_{2},\ldots,z^{\prime}_{u}\}\) and project each \(z^{\prime}_{i}\in\mathbb{R}^{d_{z}}\) into the word-level decoder’s input shape to regenerate lyric tokens.
A multi-layer perceptron (MLP) predicts whether the current sentence is the end of the lyric reconstruction. The MLP assigns each \(z^{\prime}_{i}\) a probability \(P_{stop}\), indicating if the current sentence should be the last
\begin{align}\begin{split} C^{\prime}=W_{z}\cdot z^{\prime}_{T};\ \ Z^{\prime} =G(C^{\prime});\;\;\ X^{\prime}=g(C^{\prime});\ \ P_{stop}=\mathrm{MLP}(Z^{\prime})\end{split},\end{align}
(2)
where \(W_{z}\) is a linear projection to sequentially expand \(z^{\prime}_{T}\); \(G(\cdot)\) and \(g(\cdot)\) are the lyric-to-sentence and sentence-to-word decoders.

3.2.3 Loss Design.

The loss function of the keyword extractor is formulated as follows:
\begin{align}\begin{split}\mathcal{L}_\mathrm{VAE}=\alpha\mathbb{E}_{q_{\theta}(z|x)} \left[logp_{\phi}(x|z)\right]+\beta\mathcal{L}_{Stop}(P_{stop}) \\-\gamma D_\mathrm{KL}\left[q_{\theta}(z|x)||p(z)\right] \end{split}.\end{align}
(3)
The loss function has three weighted terms. The first term, Reconstruction Loss, compares generated lyrics to the ground truth. The second term, Sentence Loss on the stopping distribution \(P_{stop}\), encourages the model to select an appropriate length for the generated lyrics [33]. The third term, the Kullback–Leibler Divergence, penalizes deviations of the latent variable distribution from a Gaussian prior with unit variance. We employ the “reparameterization trick” [31] to sample latent variables in a differentiable manner by predicting the mean and variance parameters of the Gaussian distribution.

3.2.4 Keyword Selection.

After compressing and reconstructing the lyrics, we examine the accumulative self-attention matrices of all word-level Transformer encoder blocks to identify each line’s keyword. For each lyric line, each token’s attention scores across all layers, \(W_{i}^{Att}\in\mathbb{R}^{n\times n}\), are multiplied along the propagation path to the hub vector’s attention score, \(W_{Hub_{t}}^{Att}\in\mathbb{R}^{n\times 1}\). As illustrated by the red arrows in Figure 5, we select \(k_{t}\), the token with the highest product, as the keyword for its lyric line since this product indicates the token’s contribution to the sentence encoding. We then concatenate all selected keywords to form the skeleton of the lyrics
\begin{align}k_{t}=arg\max\prod\nolimits_{i=1}^{\psi-1}W_{i}^{Att}\cdot W_{Hub_{t}}^{Att},\end{align}
(4)
where \(k_{t}\) is the extracted keyword for sentence \(t\) and \(\psi\) is the layer number of the word-to-sentence encoder.

3.3 Keyword Skeleton Expansion

The keyword skeleton extractor creates matched triplets of \(\{\)keyword skeletons, lyrics, MIDI files\(\}\). It also generates two static matrices showing keyword co-occurrence and adjacency statistics, forming a keyword relationship graph. By adding input music phrases as nodes and connecting them to the keyword graph, the expander uses a graph Transformer [27] to learn the cross-modal relevance of keywords and music (Figure 4(b)).
During inference, the expander generates a keyword skeleton as lyric storylines from user-input seed words \(K_{seed}\) and input MIDI music \(m\). The expander augments seed words by predicting additional keywords from the music and rearranging them to form a keyword skeleton. Each music phrase node predicts a keyword matching its musical features after graph propagation and neighbor feature aggregation. The skeleton is the concatenation of the input seed words and predicted keywords from all music nodes.

3.3.1 Graph Building.

As shown in Figure 4(b), a bipartite graph connects the keyword textual modality with the symbolic music modality, consisting of two sub-graphs: the keyword graph and the music graph.
The keyword graph contains word nodes of the entire vocabulary, identified by their token IDs. These nodes are connected by bidirectional edges representing co-occurrence and adjacency frequencies based on extractor statistics. For instance, the probability of the keywords “seasons” and “spring” appearing together in a skeleton is 0.6, and the probability of “spring” following “seasons” in skeletons is 0.4. Thus, the edge from “seasons” to “spring” in the keyword graph has features [0.6, 0.4].
The music graph includes music nodes represented by MidiBERT embeddings [11] of all music phrases’ REMI encodings [28]. First, a MIDI file is split into phrases following [45]. Then, as demonstrated in Figure 6(a), REMI, a music event representation, converts each music phrase’ MIDI score into discrete tokens, providing metrical context for rhythmic patterns and segmenting music encoding into distinct nodes aligned with lyric phrase divisions. Next, MidiBERT, a large-scale pre-trained model for symbolic music understanding, uses masked language modeling (MLM) to learn high-level features by masking and reconstructing input REMI tokens (Figure 6(b)), capturing intricate musical patterns, harmonies, and structures.
Fig. 6.
Fig. 6. Illustration of REMI encoding and MidiBERT embedding. (a) The REMI encoding of a MIDI file is a music event representation that converts MIDI scores into discrete tokens with metrical context, aligning with lyric phrase divisions. (b) The MidiBERT embedding, a large-scale pre-trained model for symbolic music understanding, is trained by masked language modeling (MLM) task which masks and reconstructs the tokens in REMI encodings.
In the music graph, music nodes are connected by directed edges indicating performance order. Keyword nodes are bidirectionally connected to all music nodes, forming a bipartite graph to model cross-modal relationships (Figure 4(b)). That is, in reference initialization, every music phrase is connected to every keyword. Integrating text and music modalities in a graph network, this model predicts keywords from paired music phrases, generating a keyword skeleton aligned with the musical context to provide a coherent storyline for lyrics generation.

3.3.2 Keyword Skeleton Expansion.

The graph Transformer computes hidden states for all nodes, propagating keyword information throughout the graph. Unlike sequential or grid models, a graph network (1) de-emphasizes autoregressive generation, enabling parallel keyword expansion; (2) captures topological long-term keyword dependencies; and (3) unifies music and text as graph nodes for cross-modal relevance. Hidden states for nodes and edges are represented uniformly as \(\mathbb{R}^{d_{g}}\). After information propagation, an MLP predicts and samples a keyword for each music node, representing the music phrase. This procedure is formulated as
\begin{align}h_{i}^{x}=Emb_{word}(x);\ \ h_{i}^{m}=Emb_{ MIDI}(R(m))\end{align}
(5)
\begin{align}h_{i}=\sigma\left(\sum_{j\in\mathcal{N}(i)} \alpha_{i,j}W_{h}h_{j}\right);\ \ e_{i,j}=\sigma\left(\sum_{k\in\mathcal{N}(i) }\alpha_{i,k}W_{e}e_{i,k}\right)\end{align}
(6)
\begin{align}P(k_{i})=Softmax(\mathrm{MLP}(h^{m}_{i}));\; Loss=CE(P(k_{i}),\hat{k_{i}}),\end{align}
(7)
where \(h\) and \(e\) are graph node and edge hidden states, whose superscripts distinguish music and word nodes. \(R(\cdot)\) denotes REMI encoding while \(Emb_{word}\) and \(Emb_{MIDI}\) represent word and MidiBERT Embedding, respectively. \(\mathcal{N}(i)\) are node \(i\)’s incoming neighbors, and \(W_{h}\) and \(W_{e}\) are graph Transformer model’s trainable parameters.
The expander is trained using the Cross-Entropy loss between the predicted keyword from the music node \(P(k_{i})\) and the extracted keyword \(\hat{k_{i}}\) from the lyrics. Thereby, the expander produces a coherent keyword skeleton that fits the music during inference.

3.3.3 Specified Seed Words Insertion.

During lyrics generation inference, users typically input several seed words to indicate the words they wish to learn through singing. The seed words provided by users are usually insufficient to form a complete keyword skeleton. Therefore, we employ the keyword expander to predict additional keywords from the input MIDI file and organize both seed words and expanded keywords into a keyword skeleton in a specified order. We utilize melody identification from [44] and musical snippet segmentation techniques from AI-Lyricist [45] to estimate an appropriate sentence number \(L\) (equal to the number of phrases in the input music). After predicting keywords for the first number of \(l_{exp}=L-l_{seed}\) music nodes, we insert the remaining \(l_{seed}\) specified seed words into the keyword skeleton, ensuring the total number of keywords in the skeleton equals the number of musical phrases. We use average co-occurrence and adjacency probabilities to determine seed words’ positions within the skeleton. Each seed word is inserted sequentially to maximize these probabilities for the entire keyword skeleton thus improving the storyline’s coherence. The expanded keyword skeleton is finalized after all seed words are inserted.
Compared to Plan2Lyrics [70] and AI-Lyricist, inserting seed words during skeleton expansion avoids conflicts with surrounding words. Maximizing co-occurrence and adjacency probabilities ensures sentence-level coherence. The graph model also establishes cross-modal coherence between music and lyrics.

4 Coherent Lyrics Generation

After training, the keyword skeleton expander can produce a keyword skeleton from unseen input MIDI music. Thus, the lyrics generation module takes any MIDI music and a keyword skeleton produced from the MIDI as input to generate coherent lyrics. We propose a three-layer mechanism to ensure coherence in its generation, utilizing three stacked GPT-2-based sub-modules: prepositive topic guider, main-body lyrics generator, and inverse prompts. These sub-modules enforce coherence before, during, and after lyric probability computation. (1) The expanded keyword skeleton prompts the main-body lyrics generator. (2) The prepositive topic guider uses the song name and previously generated words to constrain the next word selection. (3) Beam search with inverse prompts evaluates lyric candidates based on their alignment with the keyword skeleton.
GPT-2 is chosen as the foundational model for all three sub-modules due to its power, reproducibility, interpretability, and computational affordability. While more advanced models might perform better, GPT-2 enables us to explore coherence-enhancing factors and techniques within generally acceptable resource constraints. A textual lyric dataset pre-trains these models for poetic lyric adaptation, followed by fine-tuning for specific tasks. The three-layer mechanisms work together to generate fluent, coherent, and musically relevant lyrics.

4.1 Main-Body Lyrics Generator

The main-body lyrics generator produces subsequent tokens autoregressively. It is fine-tuned to generate lyrics based on a specified number of syllable and a keyword in the skeleton as prompts after pre-training on a lyric dataset. As illustrated in Figure 7(b), the fix-length keyword and syllable prompts precede each lyric sentence. A syllable planning \(SL=\{sl_{i}^{1},sl_{i}^{2},...,sl_{i}^{T}\}\), which is a list of predicted remaining syllable counts, is added to each token’s word embedding to indicate the remaining syllables in the current sentence. Training with keyword and lyric association improves word-level coherence by familiarizing the generator with specific keywords. Using the keyword \(k_{i}\) and remaining syllable number \(sl_{i}^{t}\) as prompts, the generator maximizes \(P_{w}(x_{t}=k_{i}|x_{\lt t},sl_{i}^{t})\times P_{w}(x_{\gt t}|x_{\lt t},sl_{i}^{t+1},x_ {t}=k_{i})\) to generate probabilities for subsequent token. For ease of computation, this process is approximated mathematically as
\begin{align}P_{w}(x_{t}|x_{ \lt t},sl_{i}^{t},k_{i}).\end{align}
(8)
Fig. 7.
Fig. 7. Coherent lyrics generator’s network architecture. (a) The prepositive topic guider. (b) The main-body lyrics generator. (c) Post-beam search driven by inverse prompts. (d) Final output of generated lyrics.

4.2 Prepositive Topic Guider

Usually, to generate lyrics that match desired attributes, discriminators \(P_{d}(\cdot)\) typically measure how well the generated lyrics align with a given attribute. The entire generation and discrimination process is formulated as follows:
\begin{align}P_{y}(x_{t}|x_{ \lt t},y)\propto P_{w}(x_{t}|x{ \lt t})P_{d}(y|x_{t},x{ \lt t}).\end{align}
(9)
However, the subjective and multi-faceted nature of song lyrics makes them difficult to describe. Instead of attribute classes used in [32], we propose using the song name to guide generation, as song names often summarize themes, sentiments, and content.
We enhance the main-body generator with a prepositive topic guider based on GEDI [32]. GEDI is called a “prepositive” topic guider because it influences the token selection process with the song name and previous generated lyrics before the main-body generator makes its final prediction, ensuring alignment with the desired song name from the outset. This guider computes the probability that each candidate token \(x_{t}\) matches the desired features in the song name prompt \(y\) (i.e., \(P_{d}(y|x_{t},x_{\lt t})\)), replacing the ineffective roll-out and reward processes of conventional discriminators. As shown in Figure 7(a), the topic guider computes the probability of each candidate token given the song name prompt, the corresponding anti-prompt (“<SONGNAME> <FALSE>”), and previous tokens at each step. This probability is multiplied to constrain token selection alongside the main-body generator’s prediction
\begin{align}P_{d}(y_{pos}|x_{1:t})=\frac{P_{y}(x_{1:t}|y_{pos})}{P_{y}(x_{1:t}|y_{neg})+P_ {y}(x_{1:t}|y_{pos})}.\end{align}
(10)
GEDI can enhance sentence-level coherence in lyrics generation. Its optimization objective ensures the current sentence is judged as a continuation of the same topic as the previous sentence, thus increasing the probability of selecting more relevant candidate words.

4.3 Inverse Prompts

Generating long texts often deviates from the prompt and includes irrelevant content. To address this, we use the inverse prompt mechanism [89], a beam scoring function that evaluates the log likelihood in reverse. Traditional beam search calculates beam scores using the log likelihood of generating lyrics from prompts: \(BeamScore(X|K)=logP_{w}(X|K)\). In contrast, the inverse prompt assumes that if the prompts can be generated back from the lyrics, they must be closely related, formulated as \(BeamScore_{IP}(X|K)=logP_{w}(K|X)\). Traditional prompting strategy is “K results in X,” whereas inverse prompt is “X inferred from K.”
However, reversing the order of prompts and lyrics can produce unnatural texts [89]. A more natural inverse prompt predicts the original prompts from the generated text. Here, the inverse prompt summarizes generated lines \(X^{\prime}\) back into a keyword skeleton \(K^{\prime}\), and beams are rated by \(BeamScore_{IP}(X|K)=logP_{w}(K^{\prime}|X^{\prime})\).
An example is shown in Figure 7(c). Given the keyword “Dream” and previously generated text “… say I’m a” in the search beams, the inverse prompt is constructed as “… say I’m a \(\{\) dreamer/painter/human\(\}\) can be summarized as <KEYWORD>.” A GPT-2 model optimized for inverse prompt predicts the <KEYWORD> for each beam and scores them based on how closely it matches “Dream.” In the example, beams ending in “dreamer” and “painter” receive higher scores and remain in the search while other results are eliminated. To ensure the inclusion of seed words, the proposed scorer will return 0 if a candidate beam does not contain the specified seed word for learning.
The skeleton extracted by the KeYric system is a compressed representation of the lyrics in a latent space, serving as the song’s theme. The essence of Inverse Prompt is to have the model generate lines during beam search that can be summarized as the core idea, ensuring the lyrics maintain a consistent theme and enhancing full-text coherence.

5 Objective Experiment

5.1 Dataset

We used the Netease API to extract English lyrics with the 100 most frequent tags from a lyric dataset [84], creating the Netease-lyrics dataset of 160,171 \(\{songname,lyric\}\) pairs for training the keyword skeleton extraction model. For training the keyword skeleton expansion model, we built the LMD-lyrics dataset from the Lakh MIDI dataset [60], which contains 7,211 \(\{songname,lyric,MIDIfile\}\) triplets. All lyrics are segmented by lines. Both datasets are split into 8:1:1 train, validation, and test subsets.

5.2 Configurations

The keyword skeleton extractor employs a standard encoder-decoder block [71] with hidden states size \(z\in Z\) set to 256 (\(d_{z}\) \(=\) \(d_{c}\) \(=\) 256). Preliminary experiments determined \(\alpha\) \(=\) 1.0, \(\beta\) \(=\) 4.0, and \(\gamma\) \(=\) 0.2 [33]. The keyword skeleton expander’s graph network has an embedding size of 256 (\(d_{g}\) \(=\) 256), 7 propagation layers to accommodate an average of two verses and two choruses, and includes the 10,000 most frequent words. Graph propagation uses sub-graphs of size 1,024 in batches. We pre-trained three GPT-2 models (prepositive topic guider, main-body generator, and inverse prompt scorer) on the Netease-lyrics dataset with an MLM task [15] and used the LMD-lyrics dataset with respective prompt templates.

5.3 Compared Methods

We compared our keyword skeleton extraction (Proposed-K) and keyword expansion (Proposed-G, with G representing “graph”) models against various unsupervised keyword extraction techniques, including graph-based algorithms (TextRank, TopicRank, MultipartiteRank, PositionRank), embedding-based algorithms (EmbedRank, SIFRank), and attention-based algorithms (AttentionRank, UkeRank) [38].
We also compared our KeYric system (Proposed) with the SOTA lyrics generation model [70], referred to as Plan2Lyrics in this article, and with AI-Lyricist [45], based on SeqGAN, and SongMASS [65], which uses a Transformer to generate lyrics from a melody line.
An ablation study assessed the impact of each coherence mechanism in our lyrics generator. We evaluated three versions: (1) a vanilla GPT-2 generator with the keyword skeleton as prompts (Proposed-Lite, a simplified generator without coherence mechanisms), (2) a generator with a prepositive topic guider (Proposed-Pre), and (3) a generator with only inverse prompts (Proposed-IP). This allowed us to determine each mechanism’s contribution to enhancing lyrics’ coherence and musicality.

5.4 Objective Measures

We objectively evaluated the keyword skeletons and the lyrics generator’s applicability to language learning.
The keyword extractor and expander were evaluated on five metrics. The first metric, “representativeness” [70], assesses how well the keyword skeleton represents the semantic content and linguistic characteristics of the original lyrics [20]. This is measured by the average cosine similarity between each lyric sentence and its keyword embedding, indicating their interchangeability. The second metric, “coherence,” evaluates the skeleton’s topic transitions [29]. It is the average log probability of keyword graph edges, reflecting the frequency of consecutive keyword pairs in adjacent lines and thus measuring the coherence of the storyline. It is formulated as
\begin{align}C_{K}=\frac{1}{|E|}\sum_{e_{i,j}\in E}\log P(e_{i,j}),\end{align}
(11)
where each \(e\in|E|\) denotes the edge in the keyword graph defined in Section 3.3.1 that connects two subsequent words in the keyword skeleton \(K\) and \(P(e_{i,j})\) represents the probability associated with edge \(e_{i,j}\) in the keyword graph, indicating the frequency of subsequent occurrence of the keywords \(k_{i}\) and \(k_{j}\).
We propose the third metric, “uniformity,” which evaluates the distribution of keywords, aiming for one keyword per lyric sentence. It is the ratio of lyric sentences without keywords, multiplied by the number of keywords in the skeleton as a balancing coefficient. This ensures each sentence contributes a keyword, supporting the storyline cohesively without missing key points. High uniformity suggests concentrated keywords, potentially losing content from other lines. The fourth metric, “cross-modal relevance,” measures the correlation between text and musical features, computed as the normalized dot product of their feature vectors [45]. The fifth metric, “diversity,” assesses the word choice diversity in the lyrics dataset [20, 83]. It is the average pairwise difference between two keyword skeletons, formulated as
\begin{align}D=\frac{1}{|S|(|S|-1)}\sum_{i=1}^{|S|}\sum_{j=i+1}^{|S|}|(S_{i}\cup S_{j})-(S_ {i}\cap S_{j})|,\end{align}
(12)
where \(S\) is the keyword skeleton of the lyrics in the test set and \(||\) denotes the size of a set. Diversity is beneficial, but excessive diversity can result in random keywords.
Following previous studies on coherent text generation, we evaluate lyrics generators using two metrics: local [25, 35] and global coherence scores [26]. Local coherence is measured by topic switching detection, calculating the probability that two consecutive sentences share the same topic [3]. Global coherence is evaluated by a model predicting the document’s overall coherence through supervised regression [1]. Additionally, we performed POS tagging on the generated lyrics and computed the proportion of elements that significantly contribute to coherence, including conjunctions, sub-ordinate clause indicators, and pronouns.

5.5 Objective Experiment Results

The keyword skeleton evaluation results are shown in Table 1(a). Our proposed model outperforms others in all five metrics. Notably, our one keyword per line strategy avoids distribution bias and enhances episodic coherence. It improves uniformity by 26% compared to the second-best method. The VAE captures essential keywords, increasing diversity by 15% and cross-modal relevance by 14%. Our keyword expander improves coherence and cross-modal relevance by 20% and 14%, respectively, demonstrating the effectiveness of deep graph networks in correlating music and lyrics.
Table 1.
(a) Keyword Skeleton Generation(b) Lyrics Generation
ModelREP\(\uparrow\)CO\(\downarrow\)UNI\(\downarrow\)CMR\(\uparrow\)DIV\(\uparrow\)ModelLocal Coherence\(\uparrow\)Global Coherence\(\uparrow\)PCT\(\uparrow\)
SIFRank0.56.695.300.526.2Original0.881.660.28
MultiPartiteRank0.566.734.110.5513.59AI-Lyricist0.831.520.16
TopicRank0.556.74.060.5612.48SongMASS0.851.540.19
AttentionRank0.426.665.910.395.27Plan2Lyrics0.851.590.25
TextRank0.546.74.20.4811.19Proposed0.8811.690.27
PositionRank0.456.56.70.453.86Proposed-Lite0.8541.570.22
EmbedRank0.576.482.750.5113.46Proposed-Pre0.861.650.24
UKERank0.446.433.980.424.62Proposed-IP0.8761.610.24
Plan2Lyrics0.296.422.910.334.33    
Proposed-K0.66.332.040.6415.68    
Proposed-G-5.13-0.6413.82    
Table 1. Results of the Objective Experiments
(a) Objective experiment results of keyword skeleton level evaluation. (b) Objective experiment results of lyrics evaluation. The bold red values represent the best performing values, while the underlined values represent second best performing. We omit the values of the keyword expander (Proposed-G) on representativeness, uniformity, and faithfulness as the expander is independent of the original lyrics.
CMR, Cross-modal relevance; CO, coherence; DIV, diversity; PCT, proportion of coherent terms; REP, representativeness; UNI, uniformity.
In contrast, AttentionRank employs an attention mechanism but lacks a lyric reconstruction process, leading to the selection of articles and auxiliary words that misrepresent lyrics, lowering its representativeness score. TopicRank and TextRank build text graphs and select keywords without considering sentence order, resulting in weaker coherence. PositionRank, relying on keyword frequency and previous occurrences, produces undiversified keyword skeletons. EmbedRank, which uses embeddings for keyword extraction, ranks second in competitiveness by selecting heavily modified words, like nouns surrounded by many adjectives, creating information-dense keywords. However, ignoring sentiment and style modifiers weakens EmbedRank’s cross-modal relevance with music.
The evaluation results for the coherent lyrics generated by our proposed model are presented in Table 1(b). Our model shows a 5% improvement over the SOTA Plan2Lyrics, demonstrating that our compression-reconstruction skeleton extraction method produces a more effective latent space. It also surpasses AI-Lyricist by 9% in overall coherence, validating the effectiveness of our three-layer mechanisms. Additionally, our model outperforms SongMASS by 7%, indicating that incorporating human knowledge, such as syllable templates and keyword skeleton input, is more effective than relying solely on automatic cross-modal relevance capture.
Lyrics’ coherence improvement is calculated by averaging the percentage improvements for each metric in Table 1(b). For example, the improvement in local coherence compared to AI-Lyricist is (0.88 \(-\) 0.83)/0.825 \(=\) 0.06, and the improvement in global coherence is (1.69 \(-\) 1.52)/1.52 \(=\) 0.112. The overall improvement is then (0.06 + 0.112)/2 \(=\) 0.09 (9%). The +5% and +7% improvement over Plan2Lyrics and SongMASS is calculated similarly.
Although Plan2Lyrics increases the use of conjunctions and referential words in generated lyrics, the skeleton quality largely determines coherence improvement. Plan2Lyrics employs the YAKE method, which relies on text word frequency for keyword extraction. While these skeleton keywords show high coherence, they lack diversity and fail to adequately represent the original lyrics, resulting in an ineffective compression space. Additionally, the absence of musical input leads to significant deviations in musicality.
Our objective experiments revealed a strong positive correlation between the coherence of lyrics and the proportion of coherent elements they contain (Table 1(b)). This suggests that using these elements more extensively in lyrics generation enhances coherence.

6 Subjective Evaluation

6.1 Experiment Participant and Procedures

We recruited 32 participants from the university via e-mail, requiring English as their first language. After completing the experiment and passing a manipulation check, participants received S\(\$\)30. The participants included three professional lyricists and one language teacher.
Participants underwent training before the main experiments. They were shown a sample keyword skeleton and lyric paired with music and then asked to rate keyword skeletons and lyrics. They also reviewed rating standards with examples for marks 1–5. The main experiment had two sections: In section 1, participants rated 100 keyword skeletons paired with music in random order; in section 2, they rated 40 lyrics paired with music in random order.
To maximize validity, we used several approaches: (1) a within-subjects experiment with randomized display order; (2) online training for participants with detailed example ratings and reasons to ensure consistency. For singability, we provided a clear rating question with examples for scores 1–5 to avoid ambiguity: “Please listen to the synthesized singing of the lyrics and rate the following aspects of the lyrics on a scale of 1 to 5, where 1 represents the lowest rating and 5 represents the highest rating. Singability: How well do the syllables of the generated lyrics align with the melody notes of the input music? It is a 5 score if all syllables and music notes are perfectly matched so that you can sing the lyrics naturally, without syllables needing elongations or compressions into more/less music notes, and without a word’s syllables separated by a downbeat. You should subtract 1 point for every mismatch that you feel.” (3) We balanced participant recruitment to include a variety of other spoken languages to mitigate linguistic biases, ensuring all participants’ first language was English. (4) We used manipulation checks to verify genuine engagement.

6.2 Subjective Metrics

We conducted user rating surveys to evaluate the quality of extracted keyword skeletons, expanded keyword skeletons, and generated lyrics, focusing on key aspects. We assessed keyword skeletons based on coherence, faithfulness, musicality, and sentiment. Coherence examines storyline progression [9], faithfulness assesses accurate summarization of the original lyrics, musicality checks the match with the music style [45], and sentiment evaluates emotional expression. We refined the previously defined four levels of coherence in lyrics into six evaluation criteria for subjective experiments: fluency, local coherence, global coherence, learnability, singability, and musicality. Fluency [70] ensures natural English [88], local coherence ensures smooth sentence transitions, global coherence maintains a consistent theme [88], and learnability integrates seed words seamlessly. Participants judged the presence and contextual relevance of user-specified words in the lyrics. Singability ensures syllables align with melody notes [30, 37], and musicality evaluates cross-modal relevance with the genre, sentiment, and style of the paired music [45].

6.3 Subjective Experiment Results

As shown in Table 2(a), our model’s keyword skeletons assist users in envisioning storylines aligned with the music’s style and sentiment. Participants noted that our keyword extractor summarizes lyrics more accurately than compared models. Overall, our keyword extractor and expander outperform compared methods by 15% and 8%, respectively, based on the average improvement over the second-best model in each metric.
Table 2.
(a) Keyword Skeleton Generation(b) Lyrics Generation
ModelCO\(\uparrow\)FA\(\uparrow\)MUS\(\uparrow\)SEN\(\uparrow\)ModelFLU\(\uparrow\)Local CO\(\uparrow\)Global CO\(\uparrow\)LRN\(\uparrow\)SIN\(\uparrow\)MUS\(\uparrow\)
SIFRank3.242.912.882.72Original3.993.903.841.403.453.24
MultiPartiteRank3.162.932.913.22AI-Lyricist1.791.661.512.332.271.83
TopicRank3.222.932.901.60SongMASS2.422.162.021.032.452.07
AttentionRank1.751.601.732.90Plan2Lyrics2.923.002.790.382.632.04
TextRank3.253.092.902.51Proposed3.693.233.07\(\Delta\) 2.983.082.61
PositionRank2.832.682.622.48       
EmbedRank3.383.183.042.94       
UKERank2.482.332.351.55       
Plan2Lyrics3.382.792.252.29       
Proposed-K3.593.693.593.69       
Proposed-G3.60-3.553.23       
Table 2. Results of the Subjective Experiments
(a) Subjective experiment results of keyword skeleton level evaluation. (b) Subjective experiment results of lyrics evaluation. The bold red values represent the best performing values, while the underlined values represent second best performing. CO, Coherence; FA, faithfulness; FLU, fluency; LRN, learnability; MUS, musicality; SEN, sentiment; SIN, singability.
As shown in Table 2(b), our lyrics generator with a three-layer coherence mechanism surpasses competitors in text quality, local and global coherence, and cross-modal relevance with the expanded keyword skeleton. The KeYric system improves coherent lyrics generation by 19%, based on the average improvement over the second-best model in each metric. It excels particularly in coherence at the word, sentence, and whole-piece levels, validating our design motivation.
Subjective experiments show that lyrics generated with a skeleton are perceived as more coherent and fluent by singers. Compared to the Plan2Lyrics model, our method improves coherence by 7.6% in local coherence and 10% in global coherence, highlighting the importance of coherence constraints in lyrics generation. Additionally, our model outperforms Plan2Lyrics in musicality by 17%, demonstrating the need for cross-modal associations between music and text.
One important function of the KeYric system is to help generate personalized lyrics for language learning through singing [45, 51]. In this method, users enhance their understanding and memory of words by singing songs that include the keywords they wish to learn. Our lyrics generator enhances personalized language learning by integrating input seed words naturally, showing a 28% improvement in learnability compared to AI-Lyricist. Specifically, the seed words that users want to learn are successfully incorporated into the generated lyrics. And the surrounding words and sentences help users understand the meanings of these seed words. This integration aids vocabulary comprehension and language acquisition effectively. In summary, our generated lyrics are engaging, coherent, pleasant, and artistic, making them ideal for language learning.

7 Multi-Faceted Analysis: The Impacts of Musical Factors on Lyric Coherence

To analyze the effects of musical factors on lyric coherence, we divided the LMD-lyrics dataset into 17 genres and independently trained and evaluated their expanded keyword skeletons. This approach helped us understand how music genres influence lyric coherence. We extracted songs from the database with identified genre attributes, covering 17 genres: bluegrass (0.66%), blues (4.76%), Christian-gospel (1.38%), classical (1.16%), country (8.46%), dance-electric (5.48%), disco (0.32%), folk (2.08%), hip-hop (3.28%), jazz (3.54%), metal (19.92%), new age (1.98%), pop (11.96%), punk (6.84%), reggae (0.6%), R&B (0.96%), and rock (26.62%). Table 3(a) shows the coherence rankings, revealing that pop and country music produce the most coherent results. The coherence and narrativity of human songwriting may be the primary reason. In contrast, classical music lacks lyrics, and gospel songs use chanting and exclamations, leading to less coherent keyword expansion and lyrics generation.
Table 3.
(a) Results of Subdatasets of Different Music Genres.
GenresKeyword Coherence\(\uparrow\)Local Lyrics Coherence\(\uparrow\)Global Lyrics Coherence (Global)\(\uparrow\)Normalized Average\(\uparrow\)
Pop (11.96%)5.480.911.670.83
Bluegrass (0.66%)5.010.921.640.75
Newage (1.98%)5.080.911.610.70
Jazz (3.54%)4.950.911.630.69
Reggae (0.6%)4.660.911.660.69
Country (8.46%)5.140.931.620.68
Hip-hop (3.28%)5.050.911.620.67
Disco (0.32%)4.680.921.580.59
Rock (26.62%)4.550.901.630.58
R&B (0.96%)5.170.891.580.54
Metal (19.92%)4.60.891.580.53
Blues (4.76%)5.080.901.550.52
Dance-electric (5.48%)4.060.911.610.50
Christ-Gospel (1.38%)3.360.851.500.45
Punk (6.84%)4.790.901.500.37
Folk (2.08%)4.470.901.500.31
Classical (1.16%)4.330.881.490.15
(b) Results of Subdatasets of Different REMI Encoded Elements.
REMI ElementKeyword Coherence\(\uparrow\)Local Lyrics Coherence\(\uparrow\)Global Lyrics Coherence (Global)\(\uparrow\)
Full5.130.8811.69
w/o Bar Line5.230.801.61
w/o Position5.100.831.66
w/o Pitch4.440.861.57
w/o Duration4.940.891.61
Table 3. Multifaceted Analysis of Keyword and Lyric Coherence on Datasets Divided by (a) Music Genres. (b) Encoded Elements
The bold red values represent the best performing values, while the bold blue values indicate the least ideal performing values. The underlined values mean the second best performing values.
Our case studies show that classical songs’ lyrics are often incomplete, and punk and folk songs lack a fixed arrangement. Large language models (LLMs) like ChatGPT struggle to detect patterns in these genres. Thus, we trained our generation model on genres with average coherence scores of 0.53 and above (e.g., metal). The trained model also performs well on unseen genres.
We further investigated which musical elements influence cross-modal lyric coherence. By excluding elements in REMI melody representation, we observed changes in coherence. As shown in Table 3(b), removing pitches significantly decreases keyword and whole-piece coherence by 13% and 7%. Surprisingly, removing bar line information increases keyword coherence by 2% but reduces line-to-line coherence by 9%, as lyric lines do not always match bar lines. These findings help us select musical elements to establish cross-modal relevance between text and music in future work.

7.1 Ablation Study and Case Study

The analysis of lyrics from our model and its variants (Proposed-Lite, Proposed-Pre, and Proposed-IP) shows that Proposed-Pre produces more sentimentally and thematically coherent lyrics than Proposed-Lite (+2.9%). Most sentences generated by Proposed-Pre maintain a consistent tone and focus on a shared topic, influenced by the prepositive topic guider.
Lyrics from Proposed-IP exhibit tone and subject shifts but use more consecutive and supplementary words to link these shifts. This variant also shows an increase in longer compound and complex sentences with attributive and adverbial clauses. The Proposed model integrates features of both Proposed-Pre and Proposed-IP, achieving smooth topic transitions between paragraphs and coherence within paragraphs.

8 Discussion and Future Work

Despite the success of LLMs like ChatGPT in natural language processing, our findings suggest that their proficiency in generating lyrics for language learning is inferior to our proposed system, KeYric. As shown in Appendix A, ChatGPT struggles with generating original lyrics without mimicking existing structures and often fails to meet specific syllabic requirements. In contrast, KeYric adapts to new musical inputs and consistently produces accurate, flexible results. To fully exploit the potential of LLMs in lyrics generation, future research could enhance LLMs’ abilities to understand lyric prosody, incorporate explicit mandatory control instructions, and intentionally avoid plagiarism. Developing models that can process various musical inputs and integrating more comprehensive paired datasets could improve the applicability of LLMs in music-related tasks.
Our research focuses on the English language, but extending our model to other languages is a promising future direction. While this extension is beyond the current article’s scope, it could significantly enhance our system’s applicability and impact. Future research should explore adapting the model for multi-lingual lyrics generation by addressing language-specific nuances and integrating lyric datasets in diverse languages.

9 Conclusion

This article examines lyric coherence at word, sentence, full-text, and cross-modal (musicality) levels. We address issues in existing keyword extraction methods for improving lyric coherence. We propose unsupervised keyword extraction from lyrics and keyword expansion from music. Additionally, we suggest using multiple coherence mechanisms with a keyword skeleton to enhance coherence before, during, and after lyric token prediction. Users believe that the KeYric system outperforms SOTA model by 19%. Finally, we offer insights into how genres and musical elements influence coherent lyrics generation.

Acknowledgment

The authors would like to thank anonymous reviewers for their valuable suggestions.

Footnote

References

[1]
Tushar Abhishek, Daksh Rawat, Manish Gupta, and Vasudeva Varma. 2021. Transformer models for text coherence assessment. arXiv:2109.02176. Retrieved from https://arxiv.org/abs/2109.02176
[2]
Lahbib Ajallouda, Fatima Zahra Fagroud, and Ahmed Zellou. 2023. Automatic keyphrases extraction: An overview of deep learning approaches. Bulletin of Electrical Engineering and Informatics 12, 1 (2023), 303–313.
[3]
Dennis Aumiller, Satya Almasian, Sebastian Lackner, and Michael Gertz. 2021. Structural text segmentation of legal documents. In Proceedings of the 18th International Conference on Artificial Intelligence and Law, 2–11.
[4]
Kamil Bennani-Smires, Claudiu Musat, Andreea Hossmann, Michael Baeriswyl, and Martin Jaggi. 2018. Simple unsupervised keyphrase extraction using sentence embeddings. arXiv:1801.04470. Retrieved from https://arxiv.org/abs/1801.04470
[5]
Florian Boudin. 2018. Unsupervised keyphrase extraction with multipartite graphs. arXiv:1803.08721. Retrieved from https://arxiv.org/abs/1803.08721
[6]
Adrien Bougouin, Florian Boudin, and Béatrice Daille. 2013. TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP ’13), 543–551.
[7]
Jia-Wei Chang, Jason C. Hung, and Kuan-Cheng Lin. 2021. Singability-enhanced lyric generator with music style transfer. Computer Communications 168 (2021), 33–53.
[8]
Yihao Chen and Alexander Lerch. 2020. Melody-conditioned lyrics generation with seqgans. In Proceedings of the IEEE International Symposium on Multimedia (ISM ’20). IEEE, 189–196.
[9]
Yun-Nung Chen, Yu Huang, Hung-Yi Lee, and Lin-Shan Lee. 2012. Unsupervised two-stage keyword extraction from spoken documents by topic coherence and support vector machine. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’12). IEEE, 5041–5044.
[10]
Woon Sang Cho, Pengchuan Zhang, Yizhe Zhang, Xiujun Li, Michel Galley, Chris Brockett, Mengdi Wang, and Jianfeng Gao. 2018. Towards coherent and cohesive long-form text generation. arXiv:1811.00511. Retrieved from https://arxiv.org/abs/1811.00511
[11]
Yi-Hui Chou, I-Chun Chen, Chin-Jui Chang, Joann Ching, and Yi-Hsuan Yang. 2021. MidiBERT-piano: Large-scale pre-training for symbolic music understanding. arXiv:2107.05223. Retrieved from https://arxiv.org/abs/2107.05223
[12]
Yun-Yen Chuang, Hung-Min Hsu, Ray-I Chang, and Hung-Yi Lee. 2022. Adversarial rap lyric generation. In Proceedings of the 4th International Conference on Natural Language Processing (ICNLP ’22). IEEE, 414–419.
[13]
Fan Chung. 2014. A brief survey of pagerank algorithms. IEEE Transactions on Network Science and Engineering 1, 1 (2014), 38–42.
[14]
Aidan Cookson, Auguste Hirth, and Krish Kabra. 2020. SloGAN: Character level adversarial lyric generation, 1–6. Retrieved from https://krishk97.github.io/files/SloGAN.pdf
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805
[16]
Haoran Ding and Xiao Luo. 2021. AttentionRank: Unsupervised keyphrase extraction using self and cross attentions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 1919–1928.
[17]
Sufeng Duan, Hai Zhao, Junru Zhou, and Rui Wang. 2019. Syntax-aware transformer encoder for neural machine translation. In Proceedings of the International Conference on Asian Language Processing (IALP ’19). IEEE, 396–401.
[18]
Dwayne Engh. 2013. Why use music in English language learning? A survey of the literature. English Language Teaching 6, 2 (2013), 113–127.
[19]
Judith Weaver Failoni. 1993. Music as means to enhance cultural awareness and literacy in the foreign language classroom. Mid-Atlantic Journal of Foreign Language Pedagogy 1 (1993), 97–108.
[20]
Angela Fan, Mike Lewis, and Yann Dauphin. 2019. Strategies for structuring story generation. arXiv:1902.01109. Retrieved from https://arxiv.org/abs/1902.01109
[21]
Haoshen Fan, Jie Wang, Bojin Zhuang, Shaojun Wang, and Jing Xiao. 2019. A hierarchical attention based Seq2Seq model for Chinese lyrics generation. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. Springer, 279–288.
[22]
Douglas Fisher. 2001. Early language learning with and without music. Reading Horizons: A Journal of Literacy and Language Arts 42, 1 (2001), 8.
[23]
Corina Florescu and Cornelia Caragea. 2017. Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1105–1115.
[24]
Arla J. Good, Frank A. Russo, and Jennifer Sullivan. 2015. The efficacy of singing in foreign-language learning. Psychology of Music 43, 5 (2015), 627–640.
[25]
Jian Guan, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wenbiao Ding, and Minlie Huang. 2021. Long text generation by modeling sentence-level and discourse-level coherence. arXiv:2105.08963. Retrieved from https://arxiv.org/abs/2105.08963
[26]
Zhe Hu, Hou Pong Chan, Jiachen Liu, Xinyan Xiao, Hua Wu, and Lifu Huang. 2022. Planet: Dynamic content planning in autoregressive transformers for long-form text generation. arXiv:2203.09100. Retrieved from https://arxiv.org/abs/2203.09100
[27]
Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of the Web Conference 2020, 2704–2710.
[28]
Yu-Siang Huang and Yi-Hsuan Yang. 2020. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM International Conference on Multimedia, 1180–1188.
[29]
Haixin Jiang, Rui Zhou, Limeng Zhang, Hua Wang, and Yanchun Zhang. 2019. Sentence level topic models for associated topics extraction. World Wide Web 22 (2019), 2545–2560.
[30]
Haven Kim, Kento Watanabe, Masataka Goto, and Juhan Nam. 2023. A computational evaluation framework for singable lyric translation. arXiv:2308.13715. Retrieved from https://arxiv.org/abs/2308.13715
[31]
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114
[32]
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2020. GeDi: Generative discriminator guided sequence generation. arXiv:2009.06367. Retrieved from https://arxiv.org/abs/2009.06367
[33]
Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 317–325.
[34]
Sankar Kuppan and Sobha Lalitha Devi. 2009. Automatic generation of Tamil lyrics for melodies. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, 40–46.
[35]
Junyi Li, Wayne Xin Zhao, Zhicheng Wei, Nicholas Jing Yuan, and Ji-Rong Wen. 2021. Knowledge-based review generation by coherence enhanced text planning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 183–192.
[36]
Piji Li, Haisong Zhang, Xiaojiang Liu, and Shuming Shi. 2020. SongNet: Rigid formats controlled text generation. arXiv:2004.08022. Retrieved from https://arxiv.org/abs/2004.08022
[37]
Qihao Liang, Xichu Ma, Finale Doshi-Velez, Brian Lim, and Ye Wang. 2024. XAI-Lyricist: Improving the singability of AI-Generated lyrics with prosody explanations. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. Kate Larson (Ed.), IJCAI, Human-Centred AI. 7877–7885. DOI:
[38]
Xinnian Liang, Shuangzhi Wu, Mu Li, and Zhoujun Li. 2021. Unsupervised keyphrase extraction by jointly modeling local and global context. arXiv:2109.07293. Retrieved from https://arxiv.org/abs/2109.07293
[39]
Zhiyu Lin and Mark O. Riedl. 2021. Plug-and-blend: A framework for plug-and-play controllable story generation with sketches. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 17, 58–65.
[40]
Danyang Liu and Gongshen Liu. 2019. A transformer-based variational autoencoder for sentence generation. In Proceedings of the International Joint Conference on Neural Networks (IJCNN ’19). IEEE, 1–7.
[41]
Nayu Liu, Wenjing Han, Guangcan Liu, Da Peng, Ran Zhang, Xiaorui Wang, and Huabin Ruan. 2022. ChipSong: A controllable lyric generation system for Chinese popular song. In Proceedings of the 1st Workshop on Intelligent and Interactive Writing Assistants (In2Writing ’22), 85–95.
[42]
Peter Low. 2003. Singable translations of songs. Perspectives: Studies in Translatology 11, 2 (2003), 87–103.
[43]
Xu Lu, Jie Wang, Bojin Zhuang, Shaojun Wang, and Jing Xiao. 2019. A syllable-structured, contextually-based conditionally generation of Chinese lyrics. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. Springer, 257–265.
[44]
Xichu Ma, Xiao Liu, Bowen Zhang, and Ye Wang. 2022. Robust melody track identification in symbolic music. In Proceedings of the 23rd ISMIR Conference (ISMIR ’22), 842–849.
[45]
Xichu Ma, Ye Wang, Min-Yen Kan, and Wee Sun Lee. 2021. AI-Lyricist: Generating music and vocabulary constrained lyrics. In Proceedings of the 29th ACM International Conference on Multimedia, 1002–1011.
[46]
Eric Malmi, Pyry Takala, Hannu Toivonen, Tapani Raiko, and Aristides Gionis. 2016. DopeLearning: A computational approach to rap lyrics generation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 195–204.
[47]
Enrique Manjavacas, Mike Kestemont, and Folgert Karsdorp. 2019. Generation of hip-hop lyrics with hierarchical modeling and conditional templates. In Proceedings of the 12th International Conference on Natural Language Generation, 301–310.
[48]
Suzanne L Medina. 1990. The Effects of Music upon Second Language Vocabulary Acquisition. Annual Meeting of the Teachers of English to Speakers of Other Languages. ERIC, 1–26.
[49]
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 404–411.
[50]
Susan Bergman Miyake. 2004. Pronunciation and music. Sophia Junior College Faculty Bulletin 20, 3 (2004), 80.
[51]
Dania Murad, Riwu Wang, Douglas Turnbull, and Ye Wang. 2018. SLIONS: A karaoke application to enhance foreign language learning. In Proceedings of the 26th ACM International Conference on Multimedia, 1679–1687.
[52]
Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Sim∼oes, Vitaly Nikolaev, and Ryan McDonald. 2021. Planning with learned entity prompts for abstractive summarization. Transactions of the Association for Computational Linguistics 9 (2021), 1475–1492.
[53]
Hieu Nguyen and Brian Sa. 2009. Rap lyric generator. Stanford University, Spring, California, USA (2009), 1–3.
[54]
Nikola I. Nikolov, Eric Malmi, Curtis G. Northcutt, and Loreto Parisi. 2020. Conditional rap lyrics generation with denoising autoencoders. arXiv:2004.03965. Retrieved from https://arxiv.org/abs/2004.03965
[55]
Aytuğ Onan, Serdar Korukoğlu, and Hasan Bulut. 2016. Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications 57 (2016), 232–247.
[56]
Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal, Aitor Soroa, and Eneko Agirre. 2022. PoeLM: A meter-and rhyme-controllable language model for unsupervised poetry generation. arXiv:2205.12206. Retrieved from https://arxiv.org/abs/2205.12206
[57]
Alice Oshima and Ann Hogue. 2006. Writing Academic English. Pearson.
[58]
Peter Potash, Alexey Romanov, and Anna Rumshisky. 2015. GhostWriter: Using an LSTM for automatic rap lyric generation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1919–1924. DOI:
[59]
Dian Nauwala Putri. 2021. Cohesiveness in informal written text, song lyrics. In Proceedings of the 6th Annual Seminar on English Language Studies (ASELS ’21), 18–22.
[60]
C. Raffel. 2016. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-Midi Alignment and Matching. 331 Ph. D. Ph. D. Dissertation. Thesis, Columbia University.
[61]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084. Retrieved from https://arxiv.org/abs/1908.10084
[62]
Andrés Roberto Rengifo. 2009. Improving pronunciation through the use of karaoke in an adult English class. Profile Issues in Teachers Professional Development 11 (2009), 91–106.
[63]
Joseph Rothstein. 1995. MIDI: A Comprehensive Introduction, Vol. 7. AR Editions, Inc.
[64]
Xianjie Shen, Yinghan Wang, Rui Meng, and Jingbo Shang. 2022. Unsupervised deep keyphrase generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 11303–11311.
[65]
Zhonghao Sheng, Kaitao Song, Xu Tan, Yi Ren, Wei Ye, Shikun Zhang, and Tao Qin. 2021. SongMASS: Automatic song writing with pre-training and alignment constraint. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 13798–13805.
[66]
Sung-Hwan Son, Hyun-Young Lee, Gyu-Hyeon Nam, and Seung-Shik Kang. 2019. Korean song-lyrics generation by deep learning. In Proceedings of the 2019 4th International Conference on Intelligent Information Technology, 96–100.
[67]
Mingyang Song, Yi Feng, and Liping Jing. 2023. A survey on recent advances in keyphrase extraction from pre-trained language models. In Findings of the Association for Computational Linguistics (EACL ’23), 2108–2119.
[68]
Ruixiao Sun, Jie Yang, and Mehrdad Yousefzadeh. 2020. Improving language generation with sentence coherence objective. arXiv:2009.06358. Retrieved from https://arxiv.org/abs/2009.06358
[69]
Yi Sun, Hangping Qiu, Yu Zheng, Zhongwei Wang, and Chaoran Zhang. 2020. SIFRank: A new baseline for unsupervised keyphrase extraction based on pre-trained language model. IEEE Access 8 (2020), 10896–10906.
[70]
Yufei Tian, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Gunnar Sigurdsson, Chenyang Tao, Wenbo Zhao, Yiwen Chen, Tagyoung Chung, Jing Huang, et al. 2023. Unsupervised melody-to-lyrics generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9235–9254.
[71]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30.
[72]
Olga Vechtomova and Gaurav Sahu. 2023. LyricJam Sonic: A generative system for real-time composition and musical improvisation. In Proceedings of the International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar). Springer, 292–307.
[73]
Olga Vechtomova, Gaurav Sahu, and Dhruv Kumar. 2020. Generation of lyrics lines conditioned on music audio clips. arXiv:2009.14375. Retrieved from https://arxiv.org/abs/2009.14375
[74]
Olga Vechtomova, Gaurav Sahu, and Dhruv Kumar. 2021. LyricJam: A system for generating lyrics for live instrumental music. arXiv:2106.01960. Retrieved from https://arxiv.org/abs/2106.01960
[75]
Wanda T. Wallace. 1994. Memory for music: Effect of melody on recall of text. Journal of Experimental Psychology: Learning, Memory, and Cognition 20, 6 (1994), 1471.
[76]
Jie Wang and Xinyan Zhao. 2019. Theme-aware generation model for Chinese lyrics. arXiv:1906.02134. Retrieved from https://arxiv.org/abs/1906.02134
[77]
Kento Watanabe, Yuichiroh Matsubayashi, Satoru Fukayama, Masataka Goto, Kentaro Inui, and Tomoyasu Nakano. 2018. A melody-conditioned lyrics language model. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 163–172.
[78]
Kento Watanabe, Yuichiroh Matsubayashi, Kentaro Inui, and Masataka Goto. 2014. Modeling structural topic transitions for automatic lyrics generation. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, 422–431.
[79]
Xing Wu, Zhikang Du, Yike Guo, and Hamido Fujita. 2019. Hierarchical attention based long short-term memory for Chinese lyric generation. Applied Intelligence 49, 1 (2019), 44–52.
[80]
Jingjing Xu, Xuancheng Ren, Yi Zhang, Qi Zeng, Xiaoyan Cai, and Xu Sun. 2018. A skeleton-based model for promoting coherence among sentences in narrative story generation. arXiv:1808.06945. Retrieved from https://arxiv.org/abs/1808.06945
[81]
Lanqing Xue, Kaitao Song, Duocai Wu, Xu Tan, Nevin L Zhang, Tao Qin, Wei-Qiang Zhang, and Tie-Yan Liu. 2021. DeepRapper: Neural rap generation with rhyme and rhythm modeling. arXiv:2107.01875. Retrieved from https://arxiv.org/abs/2107.01875
[82]
Kevin Yang, Nanyun Peng, Yuandong Tian, and Dan Klein. 2022. Re3: Generating longer stories with recursive reprompting and revision. arXiv:2210.06774. Retrieved from https://arxiv.org/abs/2210.06774
[83]
Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, and Rui Yan. 2019. Plan-and-write: Towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 7378–7385.
[84]
Yi Yu, Abhishek Srivastava, and Simon Canales. 2021. Conditional LSTM-GAN for melody generation from lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1 (2021), 1–20.
[85]
Rongsheng Zhang, Xiaoxi Mao, Le Li, Lin Jiang, Lin Chen, Zhiwei Hu, Yadong Xi, Changjie Fan, and Minlie Huang. 2022. Youling: An AI-assisted lyrics creation system. arXiv:2201.06724. Retrieved from https://arxiv.org/abs/2201.06724
[86]
Yuxiang Zhang, Tao Jiang, Tianyu Yang, Xiaoli Li, and Suge Wang. 2022. HTKG: Deep keyphrase generation with neural hierarchical topic guidance. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1044–1054.
[87]
Kun Zhao, Hongwei Ding, Kai Ye, and Xiaohui Cui. 2021. A transformer-based hierarchical variational autoencoder combined hidden Markov model for long text generation. Entropy 23, 10 (2021), 1277.
[88]
Wei Zhao, Michael Strube, and Steffen Eger. 2022. Discoscore: Evaluating text generation with BERT and discourse coherence. arXiv:2201.11176. Retrieved from https://arxiv.org/abs/2201.11176
[89]
Xu Zou, Da Yin, Qingyang Zhong, Hongxia Yang, Zhilin Yang, and Jie Tang. 2021. Controllable generation from pre-trained language models via inverse prompting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2450–2460.

A Lyrics Generation by ChatGPT

A.1 Generated from Existing Lyrics

The conversation in Figure A1 is the process of ChatGPT generating lyrics from existing lyrics in ChatGPT’s training dataset. Using query template like “Generate lyrics from the song called <SONG NAME>.,” ChatGPT’s high-quality lyrics generation often contain traces of adaptation from the original lyrics. This could cause copyright issues and is not comparable to our task of generating lyrics without reference lyrics. ChatGPT also copies the song’s structure while our KeYric model can be flexibly applied to new MIDI file input.
Fig. A1.
Fig. A1. The conversation with ChatGPT generating lyrics from existing songs.

A.2 Generated from Syllable Templates

The conversation in Figure A2 is the process of ChatGPT generating lyrics from a syllable template that contains a sequence of syllable number of each line. “Generate lyrics of sentences having syllable numbers of 8,7,…,6” is the query sentence in ChatGPT. Despite providing lyric examples with syllable numbers, ChatGPT does not understand this specific requirement and still produces lyrics with incorrect syllable numbers. ChatGPT also always uses the same rigid versus-chorus structure, possibly because it favors common lyric structures.
Fig. A2.
Fig. A2. The conversation with ChatGPT generating lyrics from syllable templates.

B Lyrics Generation Samples from Compared Models

Fig. B1.
Fig. B1. Lyrics generated by different lyrics generation models from John Lennon’s song “Imagine,” with seed words sophisticated, know, and see. Seed words are denoted by green text, clauses by an arrow \(\rightarrow\), conjunctions by underscore, and comments by a dotted box. Words that tie together related concepts, behaviors and attributes are denoted by italic bold formatting.

Index Terms

  1. KeYric: Unsupervised Keywords Extraction and Expansion from Music for Coherent Lyrics Generation

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 21, Issue 1
        January 2025
        860 pages
        EISSN:1551-6865
        DOI:10.1145/3703004
        Issue’s Table of Contents
        This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 23 December 2024
        Online AM: 17 October 2024
        Accepted: 23 September 2024
        Revised: 23 September 2024
        Received: 30 September 2023
        Published in TOMM Volume 21, Issue 1

        Check for updates

        Author Tags

        1. Lyrics Generation
        2. Keyword Extraction
        3. Textual Coherence
        4. Language Learning
        5. Graph Learning

        Qualifiers

        • Research-article

        Funding Sources

        • Ministry of Education - Singapore

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 237
          Total Downloads
        • Downloads (Last 12 months)237
        • Downloads (Last 6 weeks)140
        Reflects downloads up to 06 Feb 2025

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Full Access

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media