Open AccessArticle

Extracting Implicit User Preferences in Conversational Recommender Systems Using Large Language Models

Woo-Seok Kim

¹,

Seongho Lim

²,

Gun-Woo Kim

^1,*

and

Sang-Min Choi

^1,3,*

Department of Computer Science and Engineering, Gyeongsang National University, Jinju 52828, Republic of Korea

Digital Division, National Forensic Service, Wonju 26460, Republic of Korea

The Research Institute of Natural Science, Gyeongsang National University, Jinju 52828, Republic of Korea

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(2), 221; https://doi.org/10.3390/math13020221

Submission received: 26 November 2024 / Revised: 2 January 2025 / Accepted: 6 January 2025 / Published: 10 January 2025

Download

Browse Figures

Versions Notes

Abstract

Conversational recommender systems (CRSs) have garnered increasing attention for their ability to provide personalized recommendations through natural language interactions. Although large language models (LLMs) have shown potential in recommendation systems owing to their superior language understanding and reasoning capabilities, extracting and utilizing implicit user preferences from conversations remains a formidable challenge. This paper proposes a method that leverages LLMs to extract implicit preferences and explicitly incorporate them into the recommendation process. Initially, LLMs identify implicit user preferences from conversations, which are then refined into fine-grained numerical values using a BERT-based multi-label classifier to enhance recommendation precision. The proposed approach is validated through experiments on three comprehensive datasets: the Reddit Movie Dataset (8413 dialogues), Inspired (825 dialogues), and ReDial (2311 dialogues). Results show that our approach considerably outperforms traditional CRS methods, achieving a 23.3% improvement in Recall@20 on the ReDial dataset and a 7.2% average improvement in recommendation accuracy across all datasets with GPT-3.5-turbo and GPT-4. These findings highlight the potential of using LLMs to extract and utilize implicit conversational information, effectively enhancing the quality of recommendations in CRSs.

Keywords:

conversational recommender systems; large language models; implicit user preference; classification

MSC:

94A16

1. Introduction

Conversational recommender systems (CRSs) offer personalized recommendations by engaging in direct conversations with users via conversational interfaces [1]. These systems typically utilize past behaviour data of users, explicit feedback, and information gathered during conversations to provide recommendations [1,2]. However, user needs are complex and dynamic, presenting a formidable challenge in effectively understanding and adapting to such preference patterns [3]. In particular, understanding user preferences in CRSs involve two distinct challenges: detecting explicit requirements clearly stated by users and interpreting the ambiguous or implicit preferences embedded within the dialogue. As an example, when a user directly states “I want to watch action movies”, this represents an explicit genre preference. By contrast, when a user mentions “I enjoyed The Dark Knight for its intense storyline”, they are implicitly expressing their preference for psychological thrillers or complex narratives without explicitly stating their preferred genres. Therefore, CRSs need to continuously identify and process both explicit and implicit preferences during conversations to provide truly personalized and relevant recommendations [3].

CRSs have undergone transformative improvements with the rapid advancements in machine learning and natural language processing (NLP) [4,5,6,7,8]. These systems have transformed from basic rule-based frameworks to sophisticated machine learning methodologies that provide a deep contextual understanding and enhanced personalization. CRSs integrate cutting-edge technologies, such as deep learning and transformer architectures, to comprehend complex conversational cues and dynamically adapt to user needs, enabling highly tailored and accurate recommendations [7,8]. Early machine learning-based CRSs employed collaborative and content-based filtering techniques to personalize recommendations [4,5] with a limited ability to understand and generate human-like dialogues. The integration of deep-learning models, particularly recurrent neural networks (RNNs) and attention mechanisms, marks a pivotal achievement in CRS development [6]. These models enable systems to generate context-aware and dynamic responses, thereby improving the relevance and accuracy of recommendations. The introduction of transformer-based architectures such as BERT and GPT further revolutionized CRSs by enabling improved contextual understanding and natural language generation [7].

More recent research on CRSs has been focused on multi-turn dialogue management, knowledge graph integration, and reinforcement learning to enhance system adaptability and user satisfaction [8,9]. Multi-turn dialogue management enables CRSs to maintain context over extended conversations. Knowledge graphs provide structured and semantically rich information to refine recommendations [10], whereas reinforcement learning optimizes the ability of a system to balance exploration and exploitation in real-time interactions [11].

The components of CRSs have considerably evolved over time. Systems in the early stages of development used rule-based dialogue managers and simple retrieval-based recommendation engines [12,13]. The NLU capabilities were limited to keyword extraction. Intermediate systems incorporated collaborative filtering and deep learning models, such as RNNs. Dialogue management improved attention mechanisms, enabling better context retention [14]. Modern systems feature transformer-based NLU models, multi-turn dialogue capabilities, and knowledge graph integration. Reinforcement learning optimizes system performance by balancing exploration and exploitation [15]. CRSs continue to advance by integrating multimodal data, addressing ethical challenges, and enhancing user experience in domains such as e-commerce, healthcare, and education.

CRSs leverage various types of information to provide accurate and personalized recommendations. User preference information, which includes explicit feedback, such as ratings or likes, and implicit feedback derived from user behaviour, such as browsing history, clicks, and purchase records, are primary data [16,17]. User demographic information, such as age, gender, and location, also plays a crucial role in tailoring recommendations [18].

Contextual data, which enable CRSs to adapt to the situational needs of the user, are another crucial type of information. This includes temporal context, such as time of day or seasonal trends, and conversational context, which ensures that recommendations remain relevant throughout multi-turn dialogues [19]. As mentioned earlier, knowledge graphs serve as an essential repository of structured domain-specific information, enriching recommendations with semantically linked data, and enabling more accurate suggestions [20].

CRSs also utilize content information regarding the recommended items. For instance, product descriptions, reviews, and multimedia content, such as images or videos, are incorporated into the recommendation process [21]. This information enhances the ability of a system to provide more detailed and appealing suggestions.

Furthermore, recent studies have introduced various approaches to enhance CRS capabilities. Fang et al. proposed a multi-agent framework for CRSs that leverages the advanced conversational abilities of large language models (LLMs). This framework enables multiple LLM-based agents to collaborate and dynamically adjust dialogue flows based on user feedback to deliver more accurate and personalized recommendations [22]. He et al. addressed the challenges of integrating LLMs into CRSs, particularly in controlling recommendation distributions. Their proposed “Reindex-Then-Adapt” framework combines the strengths of LLMs and traditional recommendation systems to improve both dialogue coherence and recommendation accuracy [23].

Li et al. conducted a comprehensive survey of holistic CRS approaches, emphasizing the importance of utilizing real-world conversational data. Their research highlighted three critical components—backbone language models, integration of external knowledge, and application of external guidance—all of which enhance the CRS performance in practical scenarios [24]. Recent studies have explored strategies to address bias in CRS datasets. Fairness and system reliability can be considerably improved by addressing selection and popularity bias via counterfactual data simulation and reinforcement learning [25].

Studies have also investigated parameter-efficient conversational models, focusing on reducing the computational overhead of CRSs while maintaining high performance [26]. Jung et al. [27] proposed a multi-task learning approach for a unified CRS by leveraging contextualized knowledge distillation to streamline interactions and improve multi-turn dialogue consistency. A variety of data sources have used to enhance CRS capabilities, including large conversational datasets, real-time user interactions, and domain-specific knowledge bases.

Fang et al. emphasized the importance of incorporating user feedback loops to dynamically adapt recommendations, while He et al. leveraged structured content such as item metadata and user demographics to refine recommendation distributions. Li et al. advocated the integration of contextual information, such as temporal trends and conversational context, to enhance the relevance of multi-turn dialogue. Jung et al. focused on parameter efficiency by utilizing pretrained language models fine-tuned on domain-specific conversational data to balance performance and computational costs. Collectively, these approaches highlight the reliance on a diverse range of structured and unstructured data sources, including explicit user preferences, implicit behavioural signals, and multimodal content, such as images, videos, and product reviews.

Despite these advances, the effective utilization of LLMs for conversational recommendations presents formidable challenges. The primary difficulty lies in systematically extracting latent user preferences from unstructured dialogue sequences, particularly when converting implicit signals into explicit categorical preferences. Although LLMs can understand conversational contexts, the automated extraction and structuring of these preferences remains challenging. Another challenge is the quantification of the preferences. Even when categorical preferences are identified, existing approaches lack mechanisms to effectively quantify the relative importance of these preferences, often treating all identified preferences equally, resulting in suboptimal recommendation performance. These challenges necessitate an innovative framework that can explicitly structure the latent preferences and quantify their relative importance.

In this study, we propose a method that converts implicit user preferences within conversations into explicit preferences using LLMs. The proposed approach comprises two main components. First, we extracted user preferences from conversations and converted them into explicit categorical information using LLMs. Second, we used the extracted information as labels to train a multi-label classification model that quantifies categorical preferences. The classifier transforms the qualitative preferences into numerical values, thereby enabling precise recommendation matching. We reconstructed the original conversations by incorporating both categorical and quantitative preference information to create enhanced conversational contexts. This reconstruction renders implicit preferences explicit and measurable, enabling more accurate preference modelling. Our experimental results demonstrate that incorporating such numerical preference information significantly improves the recommendation accuracy. Furthermore, our findings show that the integration of LLMs with CRSs represents a crucial advancement in the development of user-centric recommendation systems, opening new possibilities for future research. The contributions of this paper are as follows:

We propose a method to explicitly extract implicit preferences within conversations.
We further suggest using the extracted preferences to design a multi-label model that quantifies categorical data.
We conduct comparative experiments using GPT models with a large number of parameters and open-source LLM models with relatively fewer parameters to evaluate the effectiveness of the proposed approach.
We demonstrate, through experimental results, that our proposed approach significantly enhances the performance of CRSs.

The remainder of this paper is organized as follows. Section 2 provides a literature review of related work. Section 3 presents our proposed methodology. Section 4 presents the experimental setup, results, and analysis. In Section 5, we conclude our research and outline potential directions for future work.

2. Related Work

Recent advancements in CRSs have increasingly incorporated LLMs to enhance performance and adaptability. These advancements also address challenges such as the cold-start problem and enable the application of CRSs in diverse research domains such as healthcare, education, and multimedia personalization. Additionally, research has demonstrated the effective use of NLP technologies to improve the understanding and generation capabilities of CRSs [28,29,30]. Park et al. [31] demonstrated the utility of domain-specific LLMs in healthcare and education, demonstrating significant improvements in the quality of context-specific recommendations. Similarly, Nguyen et al. [32] explored multimodal learning approaches for CRSs, leveraging textual and visual data to improve recommendation diversity and relevance in multimedia applications. Vinyals and Le [18] introduced a neural conversational model that leverages sequence-to-sequence architectures for open-domain recommendations. Mikolov et al. [26] proposed word-embedding techniques that are foundational for semantic understanding in CRSs. Furthermore, Radford et al. [23] advanced language generation capabilities using generative pretraining models, which form the basis for conversational flow in modern CRSs. These studies demonstrated the foundational role of NLP in enabling CRSs to process complex conversational data, adapt dynamically, and deliver precise recommendations. These studies also highlighted the ability of LLMs to process complex user intentions, maintain multi-turn dialogue coherence, and provide context-aware recommendations. Recent research is categorized into two major domains based on the technical methodologies employed and the information or features utilized.

Technical innovations in CRSs often revolve around adopting and optimizing neural architectures, particularly transformer-based models, to improve performance. These studies highlight the role of advanced neural architectures in making CRSs more adaptive, scalable, and efficient, while addressing the limitations of traditional systems. By integrating such architectures, CRSs have demonstrated improved handling of context-aware interactions and dynamic user requirements, as highlighted by Kim et al. [29].

The second category focuses on feature-driven advancements, in which external knowledge and contextual signals are integrated to enhance CRS capabilities. These studies emphasize the importance of utilizing external data sources, such as knowledge graphs and multimodal information, to expand the contextual understanding and applicability of CRSs. Nguyen et al. [30] demonstrated the utility of multimodal learning by combining visual and textual data to enhance the richness and diversity of recommendations.

In addition to these studies, there are a number of other studies that utilize LLMs as they have emerged in recent years. LLMs offer transformative capabilities for CRSs, including the ability to interpret conversational contexts and discern user intentions. These systems excel at analyzing linguistic nuances and contextual signals. Table 1 shows related studies that utilize transformer models, external knowledge utilization, and LLMs.

Despite these advances, several unresolved challenges remain. Current methods cannot systematically convert implicit conversational signals into explicit preferences and quantify their relative importance. Existing frameworks often fail to fully utilize the rich conversational data available to infer nuanced user preferences, particularly when dealing with ambiguous or incomplete user input. Furthermore, although LLMs excel at language understanding, integrating these capabilities into CRSs for real-time scalable applications remains complex. For instance, real-time personalization requires the efficient handling of computational overhead, which can limit the practicality of LLMs in resource-constrained environments [31].

Another significant challenge involves ensuring fairness and minimizing biases in recommendations, primarily because LLMs trained on large uncurated datasets are prone to replicating societal biases [32]. In addition, incorporating multimodal signals such as textual, visual, and behavioural data into CRSs remains an open research area, with current systems often favouring textual information at the expense of other valuable modalities. Addressing these limitations requires interdisciplinary approaches, including advancements in multimodal learning, fairness-aware algorithms, and scalable model architectures.

3. Proposed Approach

In this section, we propose a method that explicitly utilizes the implicit preferences of users. Figure 1 illustrates the overall process of the proposed approach. In this example, we used the movie domain. First, we extracted the preferred user genres implicitly expressed in conversations using an LLM. We then trained a multi-label classifier using the user conversations as input and extracted genres as labels. When the trained classifier receives the conversation as input, it outputs the predicted genre labels. These labels are then explicitly added at the end of the conversation, which is input into the LLM along with a prompt for movie recommendations. Finally, the LLM recommends a list of movies based on the prompt.

The proposed methodology entails three main steps. The first step is to extract the user preferences from the conversation. The second step is to train a multi-label classification model using the conversation and extracted preferences. The final step is to provide recommendations using the conversation and classification models.

3.1. Extraction of Implicit Preferences Within a Conversation

Figure 2a presents a method that leverages an LLM to extract implicit preferences within a conversation. We define a user conversation as Conv and the preference information extracted by the LLM from Conv as

U_{i}

. Specifically,

U_{i}

represents item features such as genres that are positively expressed by user i in Conv. For example, if user i reacts positively to a particular movie genre in a conversation, that genre can be considered as

U_{i}

To leverage the LLM with the extracted preferences, we configured the prompts to be suitable for extracting preferences. Prompts were generated by combining content from the user conservation with instructions. The prompt is structured as shown in Figure 3. For instance, an instruction such as “you reply me with user’s genre preference within [Action, Adventure, Animation, …, Thriller, War, Western]” is included in the prompt along with the conversation (Conv). Based on this prompt, the LLM extracts the genre (

U_{i}

) that a user is likely to prefer. The movie genres referenced were based on the genre list from IMDb, which includes 25 genres.

This preference extraction process is designed to capture both explicit and implicit preferences. Moreover, the system can recognize the preference intensity through linguistic cues; strong positive expressions result in higher preference weights in the subsequent quantification step.

3.2. Quantifying Extracted Preferences

Figure 2b depicts a method that uses a multi-label classifier to quantify the extracted preferences within a conversation. A multi-label classification model was created using conversation (Conv) as input and the preferred user genre (

U_{i}

) as labels. The architecture of the model is shown in Figure 4. We first utilized BERT [15] to embed the conversations. BERT embedding converts Conv into a vector form, rendering it understandable for the model. These embedded values pass through three linear layers, further refining the understanding of the conversations and extracting important features. Finally, the output is generated through a sigmoid layer with 25 dense units that represent the number of genres. This layer outputs the probability values for each genre to predict the preferred user genres.

Labels with predicted probabilities that exceed the threshold are defined as the quantification of user i’s preferred genre

P_{i}

. We suppose that if we can quantify the extracted preferences, it can help us to clearly understand the user preferences by comparing the features. Therefore,

P_{i}

is designed to enable a quantitative comparison by adding a numerical variable, i.e., the probability, to the categorical values of

U_{i}

. For example, if the preferred user genres are predicted to be [Romance, Comedy], the corresponding value can be included in the conversation to numerically indicate how much the user prefers these genres.

We explicitly add

P_{i}

to Conv to utilize

P_{i}

in the CRS. For instance, it can be added as follows: “My favorite genres are [Comedy: 0.9814, Romance: 0.8694]”. This restructured conversation is termed

{C o n v}^{+ P_{i}}

, indicating the original conversation (Conv) with the preferred user genres (

U_{i}

), and their prediction probabilities are explicitly added. This enables users to understand the extent to which their preferences are reflected.

3.3. Recommendation Process

Finally, we construct a prompt to be used as an input for the LLM. The prompts contain instructions and conversations that contain the preferred user genres. The instructions comprise sentences designed to instruct the LLM to recommend the top-k movies, followed by a user conversation. The conversation also includes the predicted preferred user genres (

P_{i}

). Using this prompt, the LLM can recommend the top-k movies based on user conversations that include explicit preferences.

Prompts for movie recommendations are shown in Figure 5. For example, the instructions specify the role of the LLM as a recommendation system tasked with recommending 20 movies, along with details such as the number of recommendations. In addition,

{C o n v}^{+ P_{i}}

includes information from the previous process, such as [Comedy: 0.9814, Romance: 0.8694] and user requirements. In response to this prompt, the LLM recommends 20 movies that are slightly more focused on comedy than romance.

Algorithm 1 illustrates the proposed approach: the algorithm comprises three main functions: extractImplicitPreferences, trainMultiLabelClassifier, and recommendItem, each of which is described below.

Algorithm 1 Extracting Implicit Preferences

Require:

Conv = Conversations

Prm1 = Prompt for extracting implicit preferences

Prm2 = Prompt for recommending item

LLM(input, Prm) = The output produced by the prompt and input through the LLM

Train(D, L) = Train a multi-label classifier using D as input and L as labels

Pred(D) = Predict user preferences using a trained multi-label classifier with D as input

1: function extractImplicitPreferences(Conv, Prm1)

2: return LLM(Conv, Prm1)

3: end function

4: function trainMultiLabelClassifier(Conv, Prm1)

5: Uc = list

6: for c in Conv do

7: Uc.append(extractImplicitPreferences(c, Prm1))

8: Train(Conv, Uc)

9: end for

10: end function

11: function recommedItem(Conv, Prm2)

12: Reclist = list

13: for c in Conv do

14: Conv2 = c + Pred(c)

15: Reclist.append(LLM(Conv2, Prm2))

16: end for

17: return Reclist

18: end function

Extracting Implicit Preferences: The function extractImplicitPreferences (lines 1–3) uses a prompt-based approach to infer user preferences implicitly embedded in conversations. By passing conversation data (Conv) and a predefined prompt (Prm1) to the LLM, the algorithm extracts nuanced user preferences that are not explicitly stated. This step enables the system to effectively handle implicit information that is often present in real-world dialogues.
Training the Multi-Label Classifier: The function trainMultiLabelClassifier (lines 4–10) builds a multi-label classifier to predict user preferences. For each conversation in Conv, implicit preferences are extracted using extractImplicitPreferences, and the resulting user preference labels (Uc) are collected. These labelled preferences are then used to train the classifier (Train), creating a model that can predict user preferences based on conversational data. This step enhances the ability of the system to generalize across various conversations.
Generating Recommendations: The function recommendItem (lines 11–18) generates personalized recommendations by combining the predicted user preferences with the original conversations. For each conversation, the algorithm appends the predicted preferences (Pred(c)) to the conversation data (c) to form an enriched input (Conv2). This augmented input is then passed through the LLM, along with a recommendation prompt (Prm2), to produce a list of recommendations. The final recommendation list (Reclist) is returned as output.

The proposed approach integrates the robust language understanding of LLMs with the predictive processes of machine learning classifiers. It effectively balances contextual understanding and scalability by combining implicit preference extraction and multi-label classification. Furthermore, the use of prompt engineering for preference extraction and recommendation generation ensures flexibility and adaptability to diverse conversational scenarios.

4. Experiments

We conducted comprehensive experiments using the GPT-3.5-turbo, GPT-4, LLaMA3, and Mistral models with multiple CRS datasets. We used the LLaMA model ‘llama3-8b’ and the Mistral model ‘mistral-7b v0.3’. We utilized these datasets to evaluate the effectiveness of our proposed method by comparing the recommendation performance using three types of conversational data: original conversations (Conv), conversations with categorical preferences (

{C o n v}^{+ U_{i}}

), and conversations with quantified preferences (

{C o n v}^{+ P_{i}}

4.1. Experimental Setup

4.1.1. Datasets and Evaluation Metrics

We used three representative CRS datasets: Reddit-movie [7], ReDial [16], and Inspired [17]. Although these datasets are invaluable for understanding real-world conversational recommendation systems, they require varying levels of preprocessing to ensure consistency and relevance to the purpose of our approach. The preprocessing steps for each dataset are outlined in Table 2.

Table 3 presents the statistics of the datasets after preprocessing, highlighting their diverse scales and characteristics. The preprocessing steps ensured that each dataset was suitably prepared for training and evaluating conversational recommendation systems, addressing common problems such as noise, missing data, and imbalance.

We utilized the metrics Recall@K, NDCG@K, and MRR@K to evaluate performance in our experiments [18]. We set the value of K as 1, 5, 10, and 20. These metrics are widely used indicators to evaluate the performance of recommendation systems, measuring the accuracy of the model at each K value.

Recall@K: Recall@K is a metric used to evaluate the fraction of relevant items successfully retrieved from the top K recommendations. |Relevant Items| represents the total number of relevant items for a given query or user. Retrieved Items@K| denotes the number of items retrieved from the top K positions. The formula for recall @K is given by

$R e c a l l @ K = \frac{| R e l e v a n t I t e m s \cap R e t r i e v e d I t e m s @ K |}{| R e l e v a n t I t e m s |},$

(1)

NDCG@K is a measure of the ranking quality that considers the position of relevant items on the recommended list, where higher-ranked relevant items contribute more to the score. ${r e l}_{i}$ indicates the relevance score of an item at position i in the retrieved list. This score is often binary (e.g., 1 if relevant and 0 otherwise). Moreover, i indicates the position index of the ranked list. K refers to the cutoff rank position, up to which the evaluation is performed. IDCG@K is the ideal DCG@K, which represents the maximum possible DCG score up to position K and is used for normalization. The formula for NDCG@K is as follows:

$D C G @ K = \sum_{i = 1}^{k} \frac{{r e l}_{i}}{{l o g}_{2} (i + 1)},$

(2)

$I D C G @ K = \sum_{i = 1}^{k} \frac{{r e l}_{i}^{o p t}}{{l o g}_{2} (i + 1)},$

(3)

$N D C G @ K = \frac{D C G @ K}{I D C G @ K},$

(4)

MRR@K is the average reciprocal rank of the first relevant items in the top K recommendations. This provides an insight into how quickly the first relevant item is retrieved. |Q| denotes the total number of queries and users. ${r a n k}_{i}$ indicates the rank position of the first relevant item in the top K results for the i-th query. The formula for MRR@K is as follows:

$M R R @ K = \frac{1}{| Q |} \sum_{i = 1}^{| Q |} \frac{1}{{r a n k}_{i}},$

(5)

Using these metrics, the performance of the movie recommendation lists generated by each GPT model can be quantitatively evaluated.

4.1.2. Baselines

The baseline for the experiment was to recommend movies that the user might prefer based on Conv using the LLM. The LLM uses a prompt containing Conv and instructions to recommend 20 movies that the user might prefer. In the comparative experiment, Conv was converted into

{C o n v}^{+ U_{i}}

and

{C o n v}^{+ P_{i}}

, and the LLM recommended 20 movies in the same manner. For

{C o n v}^{+ U_{i}}

, as shown in Figure 6,

U_{i}

is obtained through the LLM without using a multi-label classifier. Additionally, the prompt that converts Conv to

{C o n v}^{+ P_{i}}

includes the phrase “The parentheses at the end of the conversation are in the format ‘User’s preferred genre: value’. The higher the value, the more preferred the genre”.

4.1.3. Implementation Details

A multi-label classifier was set up using

U_{i}

generated by the LLM as labels and Conv as inputs to create models for each LLM. The data were split into 80% for training and 20% for validation. We trained the classifier until both the training and validation loss rates fell below 0.10, thereby ensuring robust preference quantification. In the experiments using GPT, the hyperparameters were set with a ‘temperature’ value of 0 and ‘max_tokens’ value of 512. For the experiments using LLaMA and Mistral, the ‘temperature’ value was set to 0. The messages for all experiments were as ‘[{“role”: “user”, “content”: Prompt}]’, where the prompt consisted of the instructions and conversation as described above.

4.2. Experimental Results and Analysis

The experimental results for each model are summarized in Table 4. For the Reddit-movie dataset, we observed different patterns for the four LLMs. When utilizing GPT-3.5-turbo, we found that

{C o n v}^{+ U_{i}}

outperformed the baseline Conv in early recommendation metrics (Recall @1, NDCG@1, and MRR@1). However, the performance advantage of

{C o n v}^{+ U_{i}}

diminishes as the K values increase; Conv occasionally exhibits superior performance. Interestingly,

{C o n v}^{+ P_{i}}

demonstrates a relatively lower performance than both Conv and

{C o n v}^{+ U_{i}}

across most metrics. When GPT-4 was used on the same dataset, the results exhibited a more consistent pattern. Both

{C o n v}^{+ U_{i}}

and

{C o n v}^{+ P_{i}}

consistently outperformed the baseline Conv across all evaluation metrics. This improvement is particularly noticeable in the ranking quality metrics (NDCG@K), suggesting that the proposed method helps better order recommendations using more advanced language models.

When using the LLaMA model,

{C o n v}^{+ P_{i}}

outperforms all evaluation metrics, followed by

{C o n v}^{+ U_{i}}

. In the case of the Mistral model,

{C o n v}^{+ P_{i}}

excels when the K values are 1, 5, and 10, followed by

{C o n v}^{+ U_{i}}

. However, when K is 20,

{C o n v}^{+ U_{i}}

shows better results in Recall@K and NDCG@K. In the experimental results with the open-source LLM, the proposed method outperformed Conv in all evaluation metrics.

In the Inspired dataset with the GPT model, we observed a different pattern where the baseline Conv achieved the best performance, followed by

{C o n v}^{+ P_{i}}

. This dataset contains relatively fewer dialogues and items than the others, and thus deals with a smaller variety of user preferences. Nevertheless, notably,

{C o n v}^{+ P_{i}}

consistently demonstrates better performance than

{C o n v}^{+ U_{i}}

across all metrics, indicating the value of preference quantification, even in smaller datasets.

For LLaMA,

{C o n v}^{+ U_{i}}

performed best when K values were 1, 5, and 10, while at a K value of 20, both Conv and

{C o n v}^{+ U_{i}}

outperformed

{C o n v}^{+ P_{i}}

. Overall, the metrics showed inconsistent results. When using the Mistral model, Conv performed best at a K value of 1, and as K increased,

{C o n v}^{+ P_{i}}

demonstrated superior performance.

The ReDial dataset provided the most compelling evidence for the effectiveness of our approach.

{C o n v}^{+ P_{i}}

considerably outperforms all metrics, achieving up to a 23.3% improvement in Recall@20 compared with the baseline. The performance advantage is consistent across different K values and evaluation metrics, suggesting a robust improvement in both recommendation coverage and ranking quality. Additionally, we observe that Conv performance typically ranks second best in most cases, outperforming

{C o n v}^{+ U_{i}}

but falling short of

{C o n v}^{+ P_{i}}

When using LLaMA,

{C o n v}^{+ P_{i}}

excelled at a K value of 1, and

{C o n v}^{+ U_{i}}

showed superior results thereafter. Overall, both

{C o n v}^{+ U_{i}}

and

{C o n v}^{+ P_{i}}

significantly outperformed Conv. In the case of the Mistral model,

{C o n v}^{+ P}

was superior across all metrics, followed by

{C o n v}^{+ U_{i}}

A consistent pattern across all datasets indicates the superior performance of the GPT-4 model compared to GPT-3.5-turbo. Additionally, the GPT models with larger parameters perform better than the open-source models with fewer parameters. This performance gap is particularly pronounced in more complex conversation scenarios and larger datasets, suggesting that more advanced language models are better equipped to handle the nuances of preference extraction and utilization. A comparison with existing studies is presented in Section 4.2.1.

4.2.1. Comparison with Existing Studies

Recent studies explored the application of LLMs in CRS. Gao et al. proposed Chat-rec, which transforms user profiles and past interactions into prompts for LLMs to generate recommendations using the MovieLens 100 K dataset [33]. Although this method demonstrates the ability to leverage structured user profiles, it does not address the extraction of implicit preferences directly from conversations, which is the core focus of our study.

Similarly, Hou et al. evaluated LLMs as zero-shot rankers using the MovieLens-1M and Amazon Review datasets. Their findings highlighted the susceptibility of LLMs to biases, such as item popularity or position within prompts [34]. By contrast, our proposed method mitigates such biases by explicitly quantifying implicit preferences through multi-label classification, resulting in more consistent and context-aware recommendations.

Other studies such as Wang and Lim’s exploration of zero-shot next-item recommendations and Bao et al.’s TALLRec framework have shown limitations in matching the performance of traditional CRS models [35,36]. However, our approach demonstrated substantial improvements in Recall@20 (up to 23.3% on the ReDial dataset), suggesting that the integration of implicit preference extraction and quantification addresses the key limitations of prior LLM-based CRS methods.

Kang et al. and Zhang et al. emphasized the need to fine-tune LLMs to align them with the recommendation tasks [37,38]. Although these studies proposed generalization strategies for bridging performance gaps, they lack the mechanisms to systematically convert implicit signals into actionable data. By leveraging both LLMs for contextual understanding and classifiers for structured quantification, the proposed method effectively bridges this gap, enabling robust performance even on datasets with diverse user interactions.

Finally, our results on multiple datasets (ReDial, Reddit-movie, and Inspired) demonstrate the versatility of the proposed approach compared to existing works that focus on single-domain datasets, such as MovieLens or Amazon Reviews. This highlights the scalability and generalizability of the proposed framework across different conversational scenarios.

The results provide several key insights. First, adding quantitative values to categorical data using our proposed multi-label classifier generally improves recommendation performance more effectively than expressing preferences only categorically. This improvement is particularly evident in datasets with diverse user preferences, such as ReDial. Second, the effectiveness of our approach appears to be influenced by dataset characteristics; larger datasets with more varied preference expressions benefit more. The results of our experiments can be analyzed from three perspectives: the impact of dataset characteristics on the performance, the strength of LLMs in limited data environments, and implications for real-world applications.

4.2.2. Impact of Dataset Characteristics on Performance

The performance of the proposed approach was influenced by the characteristics of the datasets used. Specifically, the Inspired dataset, which is the smallest among the datasets, posed challenges owing to its limited number of dialogues and items. This contrasts with the ReDial dataset, where movie recommendations are often conditioned on the explicit mention of movie titles within dialogues rather than in-depth discussions of user preferences.

Despite these dataset-specific challenges, LLMs demonstrated remarkable adaptability. Unlike traditional CRSs, which rely heavily on explicit input, LLMs leverage their extensive pretrained knowledge and context comprehension capabilities. As observed in the study “Large Language Models as Zero-Shot Conversational Recommenders”, LLMs predominantly utilize content and conversational context to generate recommendations. This enables them to infer relevant information even when provided with minimal explicit input, such as a movie title.

4.2.3. Strengths of LLMs in Limited-Data Environments

Our findings align with those of previous studies that highlighted the superior context awareness of GPT-based LLMs. This characteristic enables LLMs to outperform existing CRS models in understanding and utilizing conversational nuances, thereby offering robust recommendations, even in datasets with limited scale or detail. For instance, in the Inspired dataset, the ability of the LLM to generalize from minimal data compensates for the inherent constraints of the dataset.

4.2.4. Implications for Real-World Applications

The use of LLMs in CRSs demonstrates considerable potential for scalability and generalization. As conversational datasets continue to grow and diversify, the ability of LLMs to adapt to various user interactions and contextual subtleties is expected to improve. This positions LLMs as a promising approach for addressing the challenges posed by limited-data scenarios, while achieving consistent performance across different datasets.

5. Conclusions

This study explored the potential of leveraging LLMs in CRSs to address the challenges of extracting and utilizing implicit user preferences. We demonstrated significant advancements in recommendation accuracy and contextual understanding using an innovative method that combines LLM-based preference extraction with a multi-label classification framework. Our experiments, conducted across three diverse datasets—Reddit-movie, ReDial, and Inspired—reveal the effectiveness of this approach in improving the CRS performance. The key contributions and findings of this study are summarized as follows:

Key Contributions
a.
Implicit preference modelling: Our study introduces a methodology for systematically extracting implicit preferences from conversational data using LLMs. By tailoring prompts to capture nuanced user intentions, our approach identifies preferences that are often overlooked by traditional CRS methods.
b.
Quantification of preferences: We propose a multi-label classification framework that quantifies these preferences and transforms categorical data into numerical representations. This quantification enables precise modelling of user preferences and enhances the ability to generate personalized recommendations.
c.
Comprehensive evaluation: Through rigorous experiments with GPT-3.5-turbo, GPT-4, LLaMA, and Mistral, our approach achieved a 23.3% improvement in Recall@20 on the ReDial dataset and consistent performance gains across all datasets. This evaluation highlights the scalability and adaptability of the proposed methodology to datasets with diverse user preferences.
d.
Dataset-specific insights: Analysis of the dataset characteristics reveals that larger, more diverse datasets, such as ReDial, benefit most from our approach, whereas smaller datasets, such as Inspired, demonstrate modest gains, emphasizing the adaptability of our framework to varying data conditions.
e.
LLM effectiveness: Our findings confirm the superior performance of GPT-4 over GPT-3.5-turbo, particularly in scenarios requiring deeper contextual understanding and preference modelling, underscoring the critical role of advanced LLMs in CRS. Furthermore, the advantage of larger-parameter GPT models is evident as they outperform open-source models with fewer parameters.
Key Findings
a.
Performance gains: The integration of quantified user preferences consistently outperformed the reliance on raw conversational data and improved the ranking metrics across diverse datasets.
b.
Dataset-dependent performance: Larger datasets with richer user interactions exhibited significant performance gains, whereas smaller datasets benefited less, highlighting the need for dataset-specific adaptations.
c.
Scalability challenges: Although effective, the computational overhead of integrating LLMs and classification frameworks poses challenges for real-time applications, particularly in resource-constrained environments.
d.
Versatility of advanced LLMs: GPT-4 outperformed GPT-3.5-turbo and open-source models in all experiments, showing its capability to understand nuanced user intents and improve recommendation outcomes.

Despite the significant contributions and findings of this study, several areas of improvement remain. First, our research primarily focused on the movie recommendation domain. Further investigation is necessary to validate the generalizability of our methodology across other domains, such as music, books, and food. Expanding the application scope will ensure the robustness and versatility of the approach for addressing diverse recommendation scenarios.

The scalability and computational efficiency are also key challenges. The integration of LLMs with multi-label classifiers introduces computational overhead, posing difficulties in real-time deployment in resource-constrained environments. Developing lightweight architectures and employing fine-tuning techniques such as knowledge distillation and model pruning could help address these challenges. Similarly, adaptive methods that dynamically balance accuracy and efficiency, potentially using reinforcement learning, can enhance the resource allocation in real-time CRS deployment.

Furthermore, combining LLMs with domain-specific models to create hybrid architectures offers a promising avenue for enhancing adaptability and addressing the specific challenges inherent to different domains. Ethical considerations also warrant attention, as biases present in LLM training datasets must be addressed to ensure the fairness and reliability of the CRS recommendations. Addressing these limitations and exploring these directions will pave the way for the development of a more sophisticated, efficient, and scalable LLM-based CRS capable of delivering high-quality recommendations across diverse real-world applications.

Author Contributions

Conceptualization, W.-S.K.; methodology, W.-S.K. and S.-M.C.; software, W.-S.K., S.L. and G.-W.K.; formal analysis, S.-M.C. and G.-W.K.; investigation, W.-S.K., S.-M.C., S.L. and G.-W.K.; data curation, S.-M.C.; writing—original draft, W.-S.K. and S.-M.C.; supervision, G.-W.K.; project administration, S.-M.C. and G.-W.K.; funding acquisition, S.-M.C. and G.-W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the research grant of Gyeongsang National University in 2024, the “Leaders in Industry-university Cooperation 3.0” Project supported by the Ministry of Education and National Research Foundation of Korea (NRF), the “Regional Innovation Strategy (RIS)” through the NRF funded by the Ministry of Education (MOE) (2021RIS-003), the NRF grant funded by the Korea government (MIST) (No. RS-2022-00165785), and the National Forensic Service (NFS2024DTB03) by the Ministry of the Interior and Safety, Republic of Korea.

Data Availability Statement

The datasets used and/or analyzed during the current research are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Christakopoulou, K.; Radlinski, F.; Hofmann, K. Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’16), San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Sun, Y.; Zhang, Y. Conversational recommender system. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Ann Arbor, MI, USA, 8–12 July 2018; pp. 235–244. [Google Scholar]
Zhang, G. User-Centric Conversational Recommendation: Adapting the Need of User with Large Language Models. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys’23), New York, NY, USA, 18–22 September 2023; pp. 1349–1354. [Google Scholar]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2023, 15, 1–45. [Google Scholar] [CrossRef]
Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A Survey on Large Language Models for Recommendation. arXiv 2023, arXiv:2305.19860. [Google Scholar] [CrossRef]
Wei, W.; Ren, X.; Tang, J.; Wang, Q.; Su, L.; Cheng, S.; Wang, J.; Yin, D.; Huang, C. LLMRec: Large Language Models with Graph Augmentation for Recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM’24), Merida, Mexico, 4–8 March 2024; pp. 806–815. [Google Scholar]
He, Z.; Xie, Z.; Jha, R.; Steck, H.; Liang, D.; Feng, Y.; Majumder, B.P.; Kallus, N.; McAuley, J. Large Language Models as Zero-Shot Conversational Recommenders. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM’23), New York, NY, USA, 21–25 October 2023; pp. 720–730. [Google Scholar]
Farshidi, S.; Rezaee, K.; Mazaheri, S.; Rahimi, A.H.; Dadashzadeh, A.; Ziabakhsh, M.; Eskandari, S.; Jansen, S. Understanding User Intent Modeling for Conversational Recommender Systems: A Systematic Literature Review. User Model. User-Adapt. Interact. 2024, 34, 1643–1706. [Google Scholar] [CrossRef]
Zou, L.; Xia, L.; Du, P.; Zhang, Z.; Bai, T.; Liu, W.; Nie, J.; Yin, D. Pseudo Dyna-Q: A reinforcement learning framework for interactive recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining (WSDM’20), Houston, TX, USA, 3–7 January 2020; pp. 816–824. [Google Scholar]
Deng, Y.; Li, Y.; Sun, F.; Ding, B.; Lam, W. Unified Conversational Recommendation Policy Learning via Graph-Based Reinforcement Learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), New York, NY, USA, 11–15 July 2021; pp. 1431–1441. [Google Scholar]
Yang, F.; Chen, Z.; Jiang, Z.; Cho, E.; Huang, X.; Lu, Y. PALR: Personalization Aware LLMs for Recommendation. arXiv 2023, arXiv:2305.07622. [Google Scholar]
Sanner, S.; Balog, K.; Radlinski, F.; Wedin, B.; Dixon, L. Large Language Models are Competitive Near Cold-start Recommenders for Language-and Item-based Preferences. In Proceedings of the 17th ACM Conference on Recommender Systems, Singapore, 18–22 September 2023; pp. 890–896. [Google Scholar]
Agrawal, S.; Trenkle, J.; Kawale, J. Beyond Labels: Leveraging Deep Learning and LLMs for Content Metadata. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys’23), New York, NY, USA, 18–22 September 2023; pp. 74–77. [Google Scholar]
Zhao, Z.; Fan, W.; Li, J.; Liu, Y.; Mei, X.; Wang, Y.; Li, Q. Recommender Systems in the Era of Large Language Models (LLMs). arXiv 2023, arXiv:2307.02046. [Google Scholar] [CrossRef]
Xiao, Y.; Zhang, H.; Liu, Y.; Sun, F.; Zhou, K. Enhancing Conversational Recommendation Systems with User and Item Embeddings. IEEE Trans. Knowl. Data Eng. 2023, 35, 3412–3426. [Google Scholar]
Chen, Y.; Yang, F.; Wang, Y.; Wu, L.; Sun, Q. An Investigation into the Role of Context in CRS: A Multi-Contextual Approach. Inf. Sci. 2024, 658, 256–274. [Google Scholar]
Mei, Z.; Huang, R.; Zhao, W.; Gao, Z. Towards Multi-turn Dialogue Consistency: Leveraging Contextual Cues in CRS. Expert Syst. Appl. 2023, 207, 117935. [Google Scholar]
Vinyals, O.; Le, Q. A Neural Conversational Model. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 174–182. [Google Scholar]
Ghazvininejad, M.; Brockett, C.; Chang, M.W.; Dolan, B.; Gao, J.; Yih, W.; Galley, M. Knowledge-Grounded Neural Conversation Models. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Zhou, K.; Zhou, Y.; Song, Y.; Zhang, W. Towards Conversational Recommender Systems via User Feedback-Aware Reinforcement Learning. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), Virtual, 19–23 October 2020; pp. 1445–1454. [Google Scholar]
Hayati, S.A.; Kang, D.; Zhu, Q.; Shi, W.; Yu, Z. INSPIRED: Toward Sociable Recommendation Dialog Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online, 16–20 November 2020; pp. 3618–3632. [Google Scholar]
Li, R.; Kahou, S.E.; Schulz, H.; Michalski, V.; Charlin, L.; Pal, C.J. Towards Deep Conversational Recommendations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pretraining; OpenAI Technical Report. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 8 January 2025).
Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; pp. 427–431. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008.1. [Google Scholar]
Smith, A.; Brown, J.; Taylor, P. Enhancing Conversational Recommendation Systems with Sequence-to-Sequence Learning. J. Artif. Intell. Res. 2024, 73, 1045–1062. [Google Scholar]
Kim, H.; Park, S.; Lee, J. A Transformer-Based Approach for Context-Aware Conversational Recommenders. Expert Syst. Appl. 2024, 215, 119678. [Google Scholar]
Nguyen, M.; Tran, D.; Pham, T. Multi-modal Learning for CRS: Leveraging Visual and Textual Data. IEEE Trans. Multimed. 2023, 26, 455–470. [Google Scholar]
Park, J.; Lee, H.; Choi, S. Domain-Specific Conversational Recommender Systems with LLMs: Case Studies in Healthcare and Education. Inf. Process. Manag. 2023, 61, 107489. [Google Scholar]
Lee, K.; Kim, J.; Yoon, S. Real-Time Recommendation Optimization in CRS Using Reinforcement Learning and LLMs. ACM Trans. Recomm. Syst. 2024, 12, 567–583. [Google Scholar]
Gao, Y.; Sheng, T.; Xiang, Y.; Xiong, Y.; Wang, H.; Zhang, J. Chat-Rec: Towards Interactive and Explainable LLMs-Augmented Recommender Systems. arXiv 2023, arXiv:2303.14524. [Google Scholar]
Hou, Y.; Zhang, J.; Lin, Z.; Lu, H.; Xie, R.; McAuley, J.; Zhao, W.X. Large Language Models are Zero-Shot Rankers for Recommender Systems. In Proceedings of the Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, 24–28 March 2024; Proceedings, Part II. Springer: Berlin/Heidelberg, Germany, 2024; pp. 364–381. [Google Scholar]
Wang, L.; Lim, E.-P. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv 2023, arXiv:2304.03153. [Google Scholar]
Bao, K.; Zhang, J.; Zhang, Y.; Wang, W.; Feng, F.; He, X. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys’23), Singapore, 18–22 September 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1007–1014. [Google Scholar]
Kang, W.-C.; Ni, J.; Mehta, N.; Sathiamoorthy, M.; Hong, L.; Chi, E.; Cheng, D.Z. Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv 2023, arXiv:2305.06474. [Google Scholar]
Zhang, J.; Xie, R.; Hou, Y.; Zhao, W.X.; Lin, L.; Wen, J.-R. Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach. arXiv 2023, arXiv:2305.07001. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed approach.

Figure 2. Framework of the proposed method (a): extraction of implicit preferences within conversation; (b): quantifying extracted preferences.

Figure 3. Prompt for the extraction of implicit preferences within conversation.

Figure 4. Architecture for multi-label classification model.

Figure 5. Prompt for movie recommendations.

Figure 6. Example for the overview of experiments with

{C o n v}^{+ U i}

Figure 6. Example for the overview of experiments with

{C o n v}^{+ U i}

Table 1. Related studies that utilize transformer models, feature-driven advancements, and LLMs.

Research Type	Related Research
Transformer-based models	Wei et al. [6] introduced LLMRec, which integrates transformer models with graph-based embeddings to better represent user–item interactions, considerably enhancing recommendation relevance and scalability.
	He et al. [7] proposed a zero-shot conversational recommender system that utilizes pretrained LLMs to provide recommendations without requiring domain-specific training. Their approach demonstrated the adaptability of LLMs in diverse scenarios.
	Zhao et al. [14] explored the application of LLMs in cold-start scenarios by leveraging cross-domain knowledge transfer and attention mechanisms to improve system robustness and user satisfaction.
	Ghazvininejad et al. [19] introduced knowledge-grounded neural models to improve the ability of CRSs to handle open-domain queries, thereby demonstrating superior dialogue coherence and relevance in recommendations.
	Vaswani et al. [27] introduced transformer architectures, namely “Attention is All You Need”, which laid the foundation for many neural CRS advancements, enabling improved handling of long-term dependencies and contextual understanding.
	Smith et al. [28] explored sequence-to-sequence approaches to optimize conversational understanding and real-time recommendation generation and demonstrated improved user satisfaction.
Feature-driven advancements	Deng et al. [10] proposed a framework that combines conversational context with graph-based reinforcement learning, enabling systems to deliver personalized and contextually relevant recommendations.
	Mei et al. [17] focused on leveraging contextual cues from user interactions to achieve consistency in multi-turn dialogues, ensuring more coherent and satisfying user experiences.
	Li et al. [28] explored the integration of temporal and spatial signals into the CRS for dynamic context adaptation, improving the recommendation precision over extended user interactions.
	Nguyen et al. [30] demonstrated the utility of combining visual and textual data to enhance recommendation diversity and richness, leveraging the ability of CRSs to utilize multimodal information.
	Kim et al. [29] explored transformer models to integrate temporal and contextual user signals to enhance the precision and relevance of conversational recommendations.
CRS with LLMs	Advanced context understanding: LLMs enable CRSs to process complex conversations and identify subtle emotional shifts and dynamic user needs [28].
	Personalized interaction modelling: Studies have shown that integrating real-time user feedback with LLMs can significantly enhance recommendation accuracy and user satisfaction [29].
	Cold-start solutions: LLMs effectively address the cold-start problem by analyzing limited textual metadata and generating recommendations for new users or items [30].

Table 2. Outline of the preprocessing steps for each dataset.

Dataset	Inspired	Description	Comprising 1001 human-to-human dialogues annotated with sociable recommendation strategies, this dataset emphasizes social science-informed conversational strategies.
		Preprocessing Steps	-Verified the completeness of annotated fields and excluded dialogues with missing or ambiguous sociable strategy labels. -Balanced the dataset by augmenting underrepresented sociable strategies using data augmentation techniques (e.g., paraphrasing. -Removed duplicated dialogues to avoid bias during training and evaluation.
		Purpose	These steps were required to ensure fair representation of all sociable strategies and maintain the focus of the dataset on engaging conversational styles.
	ReDial	Description	This dataset contains over 10,000 human-to-human dialogues collected via Amazon Mechanical Turk, where participants were tasked with recommending movies to each other.
		Preprocessing Steps	-Identified and corrected incomplete dialogues or those with placeholder text (e.g., “N/A”). -Handled missing metadata (e.g., movie titles, genres) by cross-referencing external movie databases (e.g., IMDb). -Standardized dialogue formatting to facilitate consistent parsing during training.
		Purpose	Minor inconsistencies were addressed to improve the quality of the dataset and maintain its integrity for model training and evaluation.
	Reddit-movie	Description	Derived from real user conversations on Reddit, the dataset captures users’ natural expressions of preferences and personalized tendencies.
		Preprocessing Steps	-Removed irrelevant or off-topic threads that did not pertain to movie recommendations. -Filtered out conversations with less than three turns to ensure meaningful dialogue structure. -Normalized text by removing URLs, emojis, and special characters to maintain focus on the conversational content.
		Purpose	These steps were essential to clean noise and enhance the utility of the dataset for conversational recommendation tasks.

Table 3. Statistics of datasets.

Dataset	Dialogs	Turns	Items
Inspired	825	2051	1548
ReDial	2311	9913	4216
Reddit-Movie	8413	9410	6504

Table 4. Performance comparison of models. The best results are highlighted in bold, while the second-best results are underlined.

Dataset	Models	Methods	recall@1	ndcg@1	mrr@1	recall@5	ndcg@5	mrr@5	recall@10	ndcg@10	mrr@10	recall@20	ndcg@20	mrr@20
Reddit-movie	GPT-3.5-turbo	Conv	0.019955	0.019955	0.019955	0.070381	0.045299	0.037097	0.10662	0.05702	0.041931	0.13744	0.06493	0.044169
		${C o n v}^{+ U_{i}}$	0.02079	0.02079	0.02079	0.07248	0.04678	0.03837	0.103248	0.056663	0.04241	0.129659	0.06345	0.04433
		${C o n v}^{+ P_{i}}$	0.019906	0.019906	0.019906	0.069305	0.044663	0.03661	0.105497	0.056364	0.041437	0.135968	0.064197	0.043653
	GPT-4.0	Conv	0.019906	0.019906	0.019906	0.069305	0.044646	0.03659	0.105497	0.056349	0.041417	0.136115	0.064219	0.043644
		${C o n v}^{+ U_{i}}$	0.019906	0.019906	0.019906	0.06994	0.045015	0.036868	0.106035	0.056685	0.041681	0.136604	0.064541	0.043904
		${C o n v}^{+ P_{i}}$	0.01996	0.01996	0.01996	0.06994	0.04502	0.03688	0.10608	0.05671	0.0417	0.1369	0.06463	0.04394
	LLaMa3	Conv	0.009195	0.009195	0.009195	0.031155	0.020187	0.016602	0.047736	0.025502	0.018769	0.0583	0.028241	0.019556
		${C o n v}^{+ U_{i}}$	0.01614	0.01614	0.01614	0.053067	0.034484	0.028423	0.07625	0.041993	0.031527	0.106622	0.049613	0.033587
		${C o n v}^{+ P_{i}}$	0.017656	0.017656	0.017656	0.06011	0.039	0.03209	0.087499	0.047832	0.035719	0.109117	0.053388	0.037292
	Mistral	Conv	0.010369	0.010369	0.010369	0.03541	0.022933	0.018851	0.05248	0.028437	0.021114	0.065783	0.031893	0.02211
		${C o n v}^{+ U_{i}}$	0.01433	0.01433	0.01433	0.045437	0.029834	0.024743	0.070625	0.037933	0.028058	0.095569	0.044258	0.029805
		${C o n v}^{+ P_{i}}$	0.015064	0.015064	0.015064	0.048763	0.031875	0.026361	0.073706	0.039938	0.029687	0.090482	0.044245	0.030902
Inspired	GPT-3.5-turbo	Conv	0.03968	0.03968	0.03968	0.09898	0.07007	0.06057	0.13012	0.08018	0.06477	0.14377	0.08375	0.0658
		${C o n v}^{+ U_{i}}$	0.033703	0.033703	0.033703	0.081911	0.058672	0.051017	0.107935	0.067122	0.054525	0.118601	0.069945	0.055367
		${C o n v}^{+ P_{i}}$	0.034556	0.034556	0.034556	0.090444	0.06344	0.054536	0.122014	0.073664	0.058766	0.134812	0.077067	0.059788
	GPT-4.0	Conv	0.05247	0.05247	0.05247	0.1186	0.08676	0.07627	0.15102	0.09742	0.08077	0.17065	0.10265	0.08235
		${C o n v}^{+ U_{i}}$	0.045648	0.045648	0.045648	0.102816	0.075083	0.06597	0.133532	0.08509	0.070145	0.146331	0.088477	0.071154
		${C o n v}^{+ P_{i}}$	0.047782	0.047782	0.047782	0.108788	0.079623	0.069994	0.138652	0.089288	0.07399	0.153157	0.093116	0.075125
	LLaMa3	Conv	0.020478	0.020478	0.020478	0.063567	0.043722	0.037102	0.085751	0.050805	0.039972	0.119454	0.059316	0.042305
		${C o n v}^{+ U_{i}}$	0.025597	0.025597	0.025597	0.069113	0.048126	0.041197	0.091724	0.055468	0.044249	0.099829	0.057622	0.044894
		${C o n v}^{+ P_{i}}$	0.022611	0.022611	0.022611	0.062713	0.042895	0.036391	0.086604	0.050717	0.039676	0.09215	0.052195	0.040123
	Mistral	Conv	0.030717	0.030717	0.030717	0.066126	0.048447	0.042662	0.09215	0.056922	0.0462	0.115188	0.062892	0.047917
		${C o n v}^{+ U_{i}}$	0.018771	0.018771	0.018771	0.056314	0.037565	0.031421	0.075085	0.043517	0.03380	0.091297	0.047621	0.034938
		${C o n v}^{+ P_{i}}$	0.02901	0.02901	0.02901	0.069539	0.049799	0.043309	0.102816	0.060441	0.047633	0.133959	0.068247	0.049741
ReDial	GPT-3.5-turbo	Conv	0.035445	0.035445	0.035445	0.099548	0.068206	0.057902	0.139744	0.081122	0.063185	0.176923	0.090731	0.065931
		${C o n v}^{+ U_{i}}$	0.028431	0.028431	0.028431	0.085143	0.056976	0.047754	0.122851	0.069061	0.052679	0.160483	0.078741	0.05542
		${C o n v}^{+ P_{i}}$	0.04261	0.04261	0.04261	0.11908	0.0814	0.06903	0.1687	0.09737	0.07558	0.21456	0.10908	0.07885
	GPT-4.0	Conv	0.038688	0.038688	0.038688	0.11546	0.077825	0.065476	0.161011	0.092539	0.071537	0.213876	0.106065	0.075329
		${C o n v}^{+ U_{i}}$	0.041101	0.041101	0.041101	0.108899	0.07555	0.064613	0.154827	0.090354	0.070692	0.204299	0.102952	0.074195
		${C o n v}^{+ P_{i}}$	0.04796	0.04796	0.04796	0.12624	0.08813	0.07559	0.181	0.10568	0.08273	0.23341	0.119	0.08643
	LlaMA3	Conv	0.024133	0.024133	0.024133	0.066139	0.045536	0.038766	0.094872	0.054811	0.042583	0.121041	0.061549	0.044497
		${C o n v}^{+ U_{i}}$	0.029412	0.029412	0.029412	0.103469	0.066754	0.05473	0.152036	0.082543	0.061295	0.192006	0.092717	0.064121
		${C o n v}^{+ P_{i}}$	0.031222	0.031222	0.031222	0.094646	0.063353	0.053089	0.139216	0.077736	0.059009	0.186652	0.089775	0.062338
	Mistral	Conv	0.027602	0.027602	0.027602	0.081297	0.054711	0.046008	0.112293	0.064774	0.050184	0.146983	0.073663	0.052685
		${C o n v}^{+ U_{i}}$	0.029035	0.029035	0.029035	0.083635	0.056307	0.047382	0.116817	0.067022	0.051797	0.147059	0.074707	0.053928
		${C o n v}^{+ P_{i}}$	0.032881	0.032881	0.032881	0.097662	0.06513	0.054516	0.138537	0.078418	0.060041	0.175415	0.087761	0.062615

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, W.-S.; Lim, S.; Kim, G.-W.; Choi, S.-M. Extracting Implicit User Preferences in Conversational Recommender Systems Using Large Language Models. Mathematics 2025, 13, 221. https://doi.org/10.3390/math13020221

AMA Style

Kim W-S, Lim S, Kim G-W, Choi S-M. Extracting Implicit User Preferences in Conversational Recommender Systems Using Large Language Models. Mathematics. 2025; 13(2):221. https://doi.org/10.3390/math13020221

Chicago/Turabian Style

Kim, Woo-Seok, Seongho Lim, Gun-Woo Kim, and Sang-Min Choi. 2025. "Extracting Implicit User Preferences in Conversational Recommender Systems Using Large Language Models" Mathematics 13, no. 2: 221. https://doi.org/10.3390/math13020221

APA Style

Kim, W.-S., Lim, S., Kim, G.-W., & Choi, S.-M. (2025). Extracting Implicit User Preferences in Conversational Recommender Systems Using Large Language Models. Mathematics, 13(2), 221. https://doi.org/10.3390/math13020221

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu