Publisher’s Note
The corresponding author was changed to Elisabeth Lex and the statement referring to the TU Graz Open Access Publishing Fund was added on 14/04/2020.
1. Introduction
Music recommender systems play a pivotal role in popular streaming platforms such as Last.fm, Pandora, or Spotify to help users find music that suits their taste. Existing music recommender systems typically employ collaborative filtering algorithms based on the users’ interactions with music items (i.e., listening behavior or ratings), sometimes in combination with content features (e.g., acoustic features of songs) in the form of hybrid music recommender systems (; ).
Problem. While music recommender systems can provide quality recommendations to listeners of popular music, related research (; ) has shown that they tend to fail listeners who prefer niche artists and genres. A reason for that is the scarcity of usage data of such types of music as music consumption patterns are biased towards popular artists (; ; ). In this paper, we introduce a novel user modeling and genre prediction approach for users with different music consumption patterns and listening habits. We focus on three user groups: (i) LowMS, i.e., listeners of niche music, (ii) HighMS, i.e., listeners of mainstream (MS) music, and (iii) MedMS, i.e., listeners of music that lies in-between. The main problem we address in this work is how to exploit variations in listening habits to improve personalization for all three user groups. We investigate this problem by predicting the music genres a user is going to listen to in the future.
Approach and methods. We model the users’ listening behavior in terms of fine-grained music genre preferences. To that end, we use behavioral data in the form of listening events, i.e., the listening history of which genres a user has listened to in the past. Our approach is based on the Base-Level Learning (BLL) equation from the cognitive architecture ACT-R (; ) that accounts for the time-dependent decay of item exposure in human memory. It quantifies the usefulness of a piece of information based on how frequently and recently a user accessed it in the past. This time-dependent decay takes the shape of a power-law distribution. Related work has employed the BLL equation to recommend Web links (), to recommend scientific talks at conferences (), to recommend tags in social bookmarking systems (), and to recommend hashtags ().
In this work, we build upon these results and adopt the BLL equation to model the listening habits of users in our three groups to predict their music genre preferences. We demonstrate the efficacy of our approach on the LFM-1b dataset (), which contains listening histories of more than 120,000 Last.fm users, amounting to 1.1 billion individual listening events over nine years. The music in this dataset is categorized according to a fine-grained taxonomy that consists of 1,998 music genres and styles. Additionally, the dataset contains demographic data such as age and gender as well as a “mainstreaminess” factor () that relates the listening preferences of each user to the aggregated preferences of all Last.fm users in the dataset. Based on this factor, we assign the users in our dataset to one of the three groups, i.e., (i) LowMS, (ii) MedMS, and (iii) HighMS. This allows us to evaluate our proposed BLLu approach for different types of users.
Contributions and findings. The contributions of our work are two-fold. Firstly, we propose the BLLu approach for modeling popularity and temporal drift of music genre preferences. Secondly, we evaluate BLLu on three different groups of Last.fm users, which we separate based on the distance of their listening behavior to the mainstream: (i) LowMS, (ii) MedMS, and (iii) HighMS.
We find that for all three groups, BLLu provides the highest accuracy for predicting music genre preference, compared to five baselines: (i) group-based modeling (i.e., TOP), (ii) user-based collaborative filtering (i.e., CFu), (iii) item-based collaborative filtering (i.e., CFi), (iv) frequency-based modeling (i.e., POPu), and (v) recency-based modeling (i.e., TIMEu). Moreover, BLLu gives the highest accuracy improvements for the LowMS group. Finally, we also validate our findings in a cold-start setting, in which we only evaluate users with a small number of listening events. Here, we also find that our BLLu approach provides the best prediction accuracy results.
Structure of this paper. This paper is organized as follows: In Section 2, we review related work, and in Section 3, we describe the dataset as well as statistical analyses about genre mainstreaminess, popularity, and temporal drift of music genre preferences. Also, this section includes the methodology and the proposed approach for modeling music genre preferences. In Section 4, we present the experimental setup as well as the evaluation results. Finally, Section 5 concludes this paper and gives an outlook into future work.
2. Related Work
At present, we identify three strands of related research: (i) research on music preferences in light of psychology, (ii) temporal dynamics of music preferences, and (iii) personalization for music recommendation.
Research on music preferences in light of psychology. Research in music psychology () has shown that a range of factors impact music preferences (), such as emotional state (; ; ), a user’s current activity, their self-view and self-esteem (), the cognitive functions of music (e.g., music as a way to communicate and to self-reflect) (), as well as personality (; ; ; ; ; ; ; ).
For instance, showed that the Big Five personality traits (i.e., openness to experience, agreeableness, extraversion, neuroticism, and conscientiousness) influence genre preferences in music and that music preferences can be categorized along specific dimensions (e.g., reflective & complex, intense & rebellious, upbeat & conventional, and energetic & rhythmic music); the structure of music preferences is also discussed by . found that a person’s cognitive approach (i.e., their tendency towards empathy versus systemizing versus balancing both) impacts their music genre preferences. A user’s music preference is also impacted by familiarity (; ). This has been attributed to the so-called mere exposure effect (), which means that prior exposure can positively influence music liking. In our work, we also incorporate prior exposure (in this case, to a music genre) into our model.
Temporal dynamics of music preferences. Music preferences are often dynamic due to variations in user taste (), or evolving music taste (). One can distinguish between research on long-term temporal dynamics of listening behavior and short-term dynamics. Studies investigating long-term dynamics research on, for example, how music preferences of children and young adults evolve (; ), or how user tastes change over time and how artists develop ().
Studies investigating short-term dynamics typically assess users’ listening behaviors (; ) on a fine-granular basis (e.g., time of the day) to detect patterns and periodicity in listening behavior, or in the case of , to study the relationship between music preferences and seasons of the year. The latter approaches are typically intended to help create predictive models of music preferences to create playlist recommendations for music streaming services, among others. As we describe in detail in Section 3, in our data, we observe interesting temporal dynamics in users’ genre listening histories. Specifically, the time-dependent decay of number of plays per genre follows a power-law distribution, so our users tend to listen to genres to which they have recently listened.
Personalization for music recommendation. A number of aspects make personalization in music recommender systems challenging, such as, e.g., the variability of listening intent and purpose of music consumption, insufficient ratings and usage data, as well as users’ tendency to appreciate recommendations of items that have been previously recommended (), but also the dependence of music preferences on the user’s personality traits or emotional state. In this vein, extracted the user’s emotional context from social media messages as well as their current time context and incorporated both to generate personalized music recommendations. used a specific personality-enriched dataset that provided links to users’ listening histories on Last.fm to leverage personality traits to predict a user’s genre preferences. proposed a tag-aware dynamic music recommendation framework that represents musical tracks via user-generated tags and generates time-sensitive recommendations. incorporated a temporal analysis of user ratings assigned to music pieces and item popularity trends into a matrix factorization approach to mitigate the issue of insufficient item ratings. The latter is a common problem that causes (music) recommender systems to suffer from bias towards popular items. Due to insufficient amounts of usage data for less popular items, many recommendation algorithms cannot provide useful recommendations for consumers of less popular and niche items (; ; ). Recent work () has yet provided evidence that deep-learning-based methods (i.e., recurrent neural networks) seem to be less biased towards popular items.
In our work, we use only listening histories as a data source to model user preferences and to generate recommendations. As we show in Section 3, we observe that all users in our dataset tend to consume items they have listened to frequently and recently in the past, where the time-dependent decay of this item consumption count follows a power-law distribution. Correspondingly, the Base-Level Learning (BLL) equation from the cognitive architecture ACT-R (; ) describes a time-dependent decay of item exposure in human memory in the form of a power-law distribution. Leveraging these similarities between characteristics of music consumption patterns and cognition models (i.e., ACT-R in our case), we propose here to use the BLL equation to describe listeners’ behavioral music consumption traces.
3. Data and Method
In this section, we present the dataset we use for our study and statistical analyses we carry out. We outline the approach of this work and the baselines, which we employ to validate our proposed method.
3.1 Dataset and Statistical Analyses
First, we describe the Last.fm dataset, as well as the selected genre mapping procedure. We report statistical analyses for (i) music genre popularity, (ii) average pairwise user similarity, (iii) popularity of music genre preferences, and (iv) temporal drifts of music genre preferences.
Dataset description and availability. For our study, we use a dataset gathered from the online music service Last.fm, namely the LFM-1b dataset.LFM-1b contains listening histories of more than 120,000 users, totaling to about 1.1 billion individual listening events accrued between January 2005 and August 2014. Each listening event is characterized by a user identifier, artist, album, track name, and a timestamp (). Besides, the LFM-1b dataset contains user-specific demographic data such as country, age, gender as well as additional features such as mainstreaminess, which is defined as the overlap between the user’s listening history and the aggregated listening history of all Last.fm users in the dataset. More precisely, the mainstreaminess of a user corresponds to the average distance between all artists’ relative frequencies in the user’s listening profile and the artists’ relative frequencies among all users in the dataset ().
Mapping listening events to music genres. Since we are interested in modeling and predicting music genre preferences, we enhance the listening events in the LFM-1b dataset with additional genre information. Therefore, we use an extension of the LFM-1b dataset, termed LFM-1b User-Genre-Profile (i.e., LFM-1b UGP) dataset (), which describes the genres of an artist in a listening event by exploiting social tags from Last.fm.
Among others, LFM-1b UGP contains a weighted mapping of 1,998 music genres and styles available in the online database Freebase to Last.fm artists. In part, this taxonomy includes particular descriptors such as “Progressive Psytrance” or “Melodic Black Metal”, and therefore allows for a fine-grained representation of musical styles. The weightings correspond to the relative frequency of tags assigned to artists in Last.fm. For example, for the artist “Metallica” the top tags and their corresponding relative frequencies are “thrash metal” (1.0), “metal” (.91), “heavy metal” (.74), “hard rock” (.41), “rock” (.34) and “seen live” (.3). This means that the tag “thrash metal” is the most popular genre tag assigned to “Metallica” and thus, its weighting is 1.0. From this list, we remove all tags that are not part of the 1,998 Freebase genres (i.e., “seen live” in our example) as well as all tags with a relative frequency smaller than .5 (i.e., “hard rock” and “rock” in our example). Thus, for “Metallica”, we end up with three genres, namely “thrash metal”, “metal” and “heavy metal” that we assign to all listening events of the artist “Metallica”. Overall, this process gives us, on average, 2–3 genres per artist (i.e., mean = 2.466). Furthermore, 96.25% of the genres are assigned to more than one artist.
User groups based on mainstreaminess. The LFM-1b dataset contains a mainstreaminess value for each user, which defines the distance from this user’s music genre preferences to the music genre preferences of the (Last.fm) mainstream. To study different types of users, we split the dataset into three equally sized groups based on their mainstreaminess (i.e., low, medium, and high). We sort the users in the dataset based on their mainstreaminess value and assign the 1,000 users with the lowest values to the LowMS group, the 1,000 users with the highest values to the HighMS group, and the 1,000 users with a value that lies around the average mainstreaminess (=.379) to the MedMS group.
Here, we consider only users with at least 6,000 and at most 12,000 listening events, a choice we made based on the average number of listening events per user in the dataset (i.e., 9,043) as well as the kernel density distribution of the data. With this method, on the one hand, we exclude users with too little data available for training our algorithms (i.e., users with <6,000 listening events), and on the other hand, we exclude so-called power listeners (i.e., users with >12,000 listening events) who might distort our results.
Furthermore, this high average number of listening events per user also means that we have enough listening events (i.e., between 6.9 to 8.2 million) to train and test the music genre preference modeling and prediction approaches, even if we only consider 1,000 users per group. Table 1 summarizes the statistics and characteristics of these three groups.
User Group | |U| | |A| | |G| | |LE| | |GA| | |GA|/|LE| | |||
LowMS | 1,000 | 82,417 | 931 | 6,915,352 | 14,573,028 | 2.107 | 85.771 | .125 | 24.582 |
MedMS | 1,000 | 86,249 | 933 | 7,900,726 | 20,264,870 | 2.565 | 126.439 | .379 | 25.352 |
HighMS | 1,000 | 92,690 | 973 | 8,251,022 | 22,498,370 | 2.727 | 186.010 | .688 | 21.486 |
(i) LowMS. The LowMS group represents the |U| = 1,000 least mainstream users. They have an average mainstreaminess value of
This group contains |A| = 82,417 distinct artists, |LE| = 6,915,352 listening events, |G| = 931 genres and |GA| = 14,573,028 genre assignments.(ii) MedMS. The MedMS group represents the |U| = 1,000 users whose mainstreaminess values are between the ones of LowMS and HighMS groups (i.e., their mainstreaminess values lie around the average). This group has an average mainstreaminess value of
. Most statistics of this group lie between those of the LowMS and HighMS users (for example, the number of genre assignments per listening event |GA|/|LE| = 2.565), except for the average age, which is the highest for the MedMS users ( ).(iii) HighMS. This group represents the |U| = 1,000 most mainstream users in the LFM-1b dataset (
). These users are not only the youngest ones ( ) but also listen to the highest number of distinct genres on average ( ). Also, this user group exhibits the highest number of distinct genres (|G| = 973).Average pairwise user similarity. Finally, the boxplots in Figure 1 show the average pairwise user similarity in the three user groups. We calculate these scores based on the genre distributions of the users and using the cosine similarity metric. We see that users in the LowMS group have a very individual listening behavior (mean user similarity = .118), while users in the HighMS group tend to listen to similar music genres (mean user similarity = .691). Again, the users in the MedMS group lie in between (mean user similarity = .392). Given these results, we expect a collaborative filtering approach based on user similarities to deliver good genre prediction results for the HighMS group.
Popularity of music genre preferences. In Figure 2, we compare the music genre popularity distributions of the LowMS, MedMS, and HighMS groups. To this end, we plot the number of listening events for the groups’ top-30 genres. We find that there are some dominating genres with more than 2 million LE counts in the HighMS group, while the genre distribution is much more evenly distributed in the LowMS group with a LE count of around 500,000 for the most popular genres. We can describe the genre distribution of the MedMS group as an intermediate of the LowMS and HighMS distribution. We analyze the actual top-30 genres in these groups, and while the most popular genres Rock and Pop dominate the other genres in the HighMS group (LE count of Rock = 2,269,861), in the LowMS group, it is not as dominant (LE count of Rock = 685,998). Furthermore, we find several genres that are not popular in the MedMS and HighMS groups but are popular in the LowMS group, such as Ambient and Black Metal.
Based on the dataset characteristics, we expect that a group-based modeling approach, which models a user’s music genre preferences utilizing the most-frequently listened genres of all users in the group, performs fine for HighMS in relation to other modeling techniques, while for the LowMS group, a personalized modeling technique would be preferable. In the MedMS group, we expect both modeling approaches to work well due to the group being an intermediate of the HighMS and LowMS groups.
Temporal drift of music genre preferences. Next, we investigate the temporal drift of music genre preferences. The plots (a), (b), and (c) of Figure 3 show the effect of time on the genre listening behavior of our LowMS, MedMS, and HighMS user groups. We plot the relistening count of music genres over the time (in hours) since the last listening events of these genres on a log-log scale. For example, if a user u has listened to artists with genre g twice in a time interval of 1 hour, then the relistening count for “1 hour” is incremented by 1. We repeat this process for all listening events, which gives us a relistening count for each hour. We observe similar results for all three groups, which means that the shorter the time since the last listening event of a genre g, the higher its relistening count. In all three plots, we see a peak after 24 hours, which indicates that people tend to listen to similar music genres daily at the same time. However, we also see that when people have not listened to a genre for a longer period, i.e., one month (around 750 hours), the relistening count of this genre drastically drops.
Finally, we also plot the linear regression lines of the empirical data in the plots of Figure 3. In the log-log-scaled plots, we can observe a good fit of the data, which indicates that the data likely follows a power-law distribution (cf. ). This claim is supported by the high R2 values of the fits, which are between .870 and .895. Concerning the slopes α of the lines, which describe how strongly temporal listening drifts influence the user groups, we observe values between –1.480 and –1.587. We can use these values as the d parameter of the BLL equation (), cf. Equation 6.
Taken together, we observe interesting temporal effects in all three user groups: Last.fm users tend to listen to genres they have listened to recently. Moreover, we find that this temporal drift of music genre preferences follows a power-law distribution. Correspondingly, we can model this drift with the BLL equation.
3.2 Modeling and Prediction of Music Genre Preferences
In this section, we describe five baseline approaches (i.e., TOP, CFu, CFi, POPu, and TIMEu) as well as our approach based on the BLL equation for modeling and predicting music genre preferences (i.e., BLLu).
Group-based baseline: TOP. Motivated by our analysis in Figure 2, the TOP approach models a user u’s music genre preferences using the overall top-k (e.g., top-30) genres of all users in the user group UGu (i.e., LowMS, MedMS, HighMS) to which u belongs. This is given by:
where argmaxk refers to the “arguments of the maxima” function for the top-k genres with maximum values, 2 shows that the genre distribution in the HighMS group is the least evenly distributed one, we expect the TOP approach to provide good prediction accuracy results for the HighMS group while performing worse for the LowMS group in relation to other modeling techniques.
denotes the set of k predicted genres for user u, and |GAg,UGu| corresponds to the number of times g occurs in all genre assignments GA of UGu. Thus, we describe this approach as a group-based modeling technique since it reflects the preferences of the whole user group LowMS, MedMS or HighMS. As our analysis in FigureUser-based collaborative filtering baseline: CFu. User-based collaborative filtering-based approaches aim to find similar users for a target user u, i.e., the set of neighbors Nu. Nu is calculated using the cosine similarity between u’s genre distribution and the genre distributions of all other users. Then, the top-20 users are defined as Nu. Finally, CFu predicts the genres these similar users in Nu have listened to (), which is formally given by:
where sim(Gu, Gv) is the cosine similarity between the genre distributions of user u and neighbor v, and |GAg,v| indicates how often v has listened to genre g. Since CFu relies on user similarities, we expect it to provide good results for the HighMS group compared to other modeling approaches (see also Figure 1).
Item-based collaborative filtering baseline: CFi. Similar to CFu, CFi is a collaborative filtering-based approach, but instead of finding similar users for the target user u, it aims to find similar items (i.e., music artists). Then it predicts the genres that are assigned to these similar artists as given by:
Here, Au is the set of artists u has listened to, Sa is the set of similar artists for an artist a, sim(Ga, Gs) is the cosine similarity between the genres assigned to a and the genres assigned to a similar artist s, and |GAg,v| indicates how often genre g was assigned to artist a (hence, in our case either 0 or 1). Again, a neighborhood size |SAu| = 20 leads to the best genre prediction results, and we also set Au to the set of the 20 artists that u has listened to most frequently.
Frequency-based baseline: POPu. The POPu approach is a personalized music genre preference modeling technique, which predicts the k most frequently listened to (i.e., most popular) genres in the listening history of a user u. POPu corresponds to the modeling approach presented in () and is given by the following equation:
where Gu is the set of genres u has listened to and |GAg,u| denotes the number of times u has listened to tracks with genre g (i.e., the frequency). Thus, it ranks the genres u has listened to in the past by popularity. Therefore, in relation to other modeling algorithms, we expect POPu to generate good genre predictions for all users in our three user groups, but especially for HighMS, in which the popularity feature is the most important one (see Figure 2).
Recency-based baseline: TIMEu. Our analysis presented in Figure 3 motivates the personalized and recency-based music genre preference modeling, where we find that people tend to listen to genres to which they have listened just very recently. Thus, TIMEu predicts the most recently listened to genres that are present in the listening history of a user u, which is given by:
where tu,g,n is the time since the last (i.e., the nth) listening event of g by u. Since we find that the temporal drift of music genre preferences is an important feature for all our three user groups, TIMEu should provide good prediction accuracy results for LowMS, MedMS, and HighMS in relation to other modeling approaches.
Our approach based on the BLL equation: BLLu. To combine the frequency-based modeling method POPu with the recency-based modeling method TIMEu, we utilize the BLL equation from the declarative memory module of the cognitive architecture ACT-R (). The BLL equation quantifies the importance of information in human memory (e.g., a word or a music genre) by considering how recently (i.e., temporal drift) and frequently (i.e., popularity) it was used in the past. In our setting, we define it as follows:
Here, g is a genre user u has listened to in the past, and n is the number of times u has listened to g. Further, tu,g,j is the time since the jth listening event of g by u, and d is the power-law decay factor that accounts for the feature of the temporal drift of music genre preferences.
We set d to the slopes α identified in the analysis of Figure 3 (i.e., 1.480 for LowMS, 1.574 for MedMS, and 1.587 for HighMS). The resulting base-level activation values Bu,g are normalized using a simple softmax function in order to map them onto a range of [0,1] where they sum to 1 ():
Again, Gu is the set of distinct genres listened to by u. Finally, BLLu predicts the top-k genres
with the highest B′u,g values for u:Comparison of approaches. Table 2 shows how the five baselines, as well as BLLu, cover our four features of interest, i.e., (i) personalization, (ii) collaboration, (iii) popularity, and (iv) temporal drift.
Here, our BLLu approach is the only one that covers the features of personalization, popularity, and temporal drifts. Moreover, TOP, CFu, and CFi are the only approaches that consider collaboration among users and, thus, investigate the listening events of all users. We further examine which feature combination works best for predicting genres in our setting in the next section of this paper.
4. Experiments and Results
In this section, we outline the experimental setup (see Section 4.1) and in Section 4.2, we present the results of our study on evaluating the usefulness for modeling music genre preferences using the BLL equation.
4.1 Experimental Setup
To measure the accuracy of our music genre preference modeling approaches, we conduct a study, in which we predict the genres assigned to the artists a user is going to listen to in the future.
Evaluation protocol. We split the datasets into train and test sets () and make sure that our evaluation protocol preserves the temporal order of the listening events, which simulates a real-world scenario in which we predict (genres of) future listening events based on past ones (; ). This also means that a classic k-fold cross-validation evaluation protocol with random splits is not useful.
Therefore, we put the most recent 1% of the listening events of each user into the test set and keep the remaining listening events for training. We do not use a classic 80/20 or 90/10 split as the number of listening events per user is large (i.e., on average 7,689 per user). Furthermore, although we only use the most recent 1% of listening events per user, this process leads to three large test sets with 69,153 listening events for LowMS, 79,007 listening events for MedMS, and 82,510 listening events for HighMS. On average, there are 76 listening events per user for which we predict the assigned genres.
In Figure 4, we present boxplots showing the average duration in days per user we have available in our three test sets. We see that the average duration per user is evenly distributed across all three user groups with a median value of 11.8 days, which is also around 1% of the median value of the overall average duration per user (i.e., the sum of training and test durations). This corresponds to the 1% of the listening events per user we use for the test sets. Thus, we are going to predict the genres a user is going to listen to in this period.
Following this evaluation protocol, our goal is to validate whether our BLL-based approach (i.e., BLLu) provides better prediction accuracy results than the five baseline approaches (i.e., TOP, CFu, CFi, POPu, and TIMEu). When investigating the numbers shown in Table 1, we also see that our prediction task is not trivial since |GA|/|LE|, i.e., the number of genre assignments per listening event (=what should be predicted), is much smaller than , i.e., the average number of genres a user u has listened to (=what could be predicted).
Evaluation metrics. To measure the prediction quality of the approaches, we use the following six state-of-the-art metrics ():
(i) Recall: R@k. Recall is calculated as the number of correctly predicted genres divided by the number of relevant genres (i.e., from the test set). It is a measure of the completeness of the predictions.
(ii) Precision: P@k. Precision is calculated as the number of correctly predicted genres divided by the number of predictions k and is a measure of the accuracy of the predictions. We report recall and precision for k = 1 … 10 predicted genres in the form of recall/precision plots.
(iii) F1-score: F1@5. F1-score is the harmonic mean of recall and precision. If 10 genres are predicted, the F1-score typically reaches its highest value for k = 5. Thus, we report it for k = 5.
(iv) Mean Reciprocal Rank: MRR@10. MRR is the mean of reciprocal ranks of all relevant genres in the list of predicted genres.
(v) Mean Average Precision: MAP@10. MAP is the mean of the average precision scores at all ranks where relevant genres are predicted. With this, it also takes the ranking of the correctly predicted genres into account.
(vi) Normalized Discounted Cumulative Gain: nDCG@10. nDCG is another ranking-dependent metric. It is based on the Discounted Cumulative Gain (DCG) measure ().
We report MRR, MAP, and nDCG for k = 10 predicted music genres, where these metrics reach their highest values.
Evaluation framework. For reasons of reproducibility, we conduct the prediction study using our recommendation benchmarking framework TagRec (), which provides the evaluation protocol and metrics described in this section. Furthermore, we also implement the modeling approaches described in Section 3.2 using TagRec. It is freely available via our Github repository.
4.2 Results and Discussion
In this section, we report and discuss our prediction accuracy results on evaluating the usefulness of our BLL-based music genre preference modeling approach (i.e., BLLu) compared to five baseline approaches: (i) group-based modeling (i.e., TOP), (ii) user-based collaborative filtering (CFu), (iii) item-based collaborative filtering (CFi), (iv) frequency-based modeling (i.e., POPu), and (v) recency-based modeling (i.e., TIMEu).
Table 3 summarizes our evaluation results for the three user groups (i.e., LowMS, MedMS, and HighMS), the four evaluation metrics (i.e., F1@5, MRR@10, MAP@10, and nDCG@10) as well as the six approaches (i.e., TOP, CFu, CFi, POPu, TIMEu, and BLLu). Additionally, in Figure 5, we show the recall/precision plots of the approaches for k = 1…10 predicted genres (i.e., R@k and P@k).
Based on the features introduced in Table 2, we discuss these results concerning the influence of (i) personalization, (ii) collaboration, (iii) popularity, and (iv) temporal drift. Furthermore, we compare the results of our BLLu approach for our user groups and different numbers of predicted genres in Figure 6 as well as show the performance of the approaches in a cold-start setting in Figure 7. Finally, we also discuss the implications of our findings for personalized music recommendation.
Feature | TOP | CFu | CFi | POPu | TIMEu | BLLu |
Personalization | ✔ | ✔ | ✔ | ✔ | ✔ | |
Collaboration | ✔ | ✔ | ✔ | |||
Popularity | ✔ | ✔ | ✔ | ✔ | ✔ | |
Temporal drifts | ✔ | ✔ | ||||
User group | Evaluation metric | TOP | CFu | CFi | POPu | TIMEu | BLLu |
LowMS | F1@5 | .108 | .311 | .341 | .356 | .368 | .397*** |
MRR@10 | .101 | .389 | .425 | .443 | .445 | .492*** | |
MAP@10 | .112 | .461 | .505 | .533 | .550 | .601*** | |
nDCG@10 | .180 | .541 | .590 | .618 | .625 | .679*** | |
MedMS | F1@5 | .196 | .271 | .284 | .292 | .293 | .338*** |
MRR@10 | .146 | .248 | .264 | .274 | .272 | .320*** | |
MAP@10 | .187 | .319 | .336 | .351 | .365 | .419*** | |
nDCG@10 | .277 | .419 | .441 | .460 | .452 | .523*** | |
HighMS | F1@5 | .247 | .273 | .266 | .282 | .228 | .304*** |
MRR@10 | .188 | .232 | .229 | .242 | .201 | .266*** | |
MAP@10 | .246 | .304 | .298 | .314 | .267 | .348*** | |
nDCG@10 | .354 | .413 | .402 | .429 | .357 | .462*** | |
Influence of personalization. The personalized approaches (i.e., POPu, CFu, CFi, TIMEu, and BLLu) outperform the group-based TOP approach in the LowMS setting. This is in line with our analysis presented in Figure 2, where we found that the music genre popularity distribution in the LowMS group is the most evenly distributed one.
The same is true for the MedMS group, in which we observe a very similar performance of CFu, CFi, POPu, and TIMEu. However, in the HighMS setting only the four personalized approaches, which utilize the popularity feature (i.e., POPu, CFu, CFi, and BLLu) outperform TOP. This shows that the influence of personalization on the prediction accuracy becomes more important as the mainstreaminess of the users decreases (i.e., in the LowMS setting).
Influence of collaboration. We investigate the genre prediction accuracy of three approaches (i.e., TOP, CFu, and CFi) that consider collaboration among users, i.e., that analyze the listening events of all users. Here, the personalized CFu and CFi approaches provide better results than the non-personalized TOP approach for all three user groups.
Furthermore, CFu provides its best results for the HighMS group. This is in line with our analysis presented in Figure 1, which shows that the average pairwise user similarity is the highest for high-mainstream users. This is also the reason why CFi does not outperform CFu in the HighMS but outperforms it in the LowMS and MedMS settings.
Influence of popularity. We evaluate four popularity-based approaches. The first approach provides non-personalized genre predictions based on the preferences of all users (i.e., TOP), and the second offers personalized predictions based on user similarities (i.e., CFu). The third approach provides personalized predictions using item similarities (i.e., CFi), and the fourth produces personalized genre predictions based on the preferences of the individual user (i.e., POPu). While the prediction accuracy of TOP increases with the level of mainstreaminess, the prediction accuracy of POPu decreases with the level of mainstreaminess. The prediction accuracy of CFu and CFi are relatively stable over all three user groups, with the only exception that CFu provides better results than CFi in the HighMS setting.
Thus, in the HighMS group, TOP provides a higher prediction accuracy than in the other two groups. These results are in line with our analysis presented in Figure 2, where we find that there are some dominating genres in the HighMS group, which explains the good results of TOP, CFu, and POPu in this setting. When further comparing CFu with CFi, we see that CFi outperforms CFu in the LowMS and MedMS settings.
Influence of temporal drift. Our analysis in Figure 3 reveals that users in Last.fm tend to listen to genres which they have listened to very recently. In other words, time is important for all three user groups. However, as shown in Table 3 and Figure 5, TIMEu provides the weakest accuracy results for HighMS and good prediction accuracy results for LowMS and MedMS. Thus, for HighMS, popularity is a more important feature than recency.
BLLu outperforms TIMEu in all experiments. This means that our personalized modeling approach, which also considers the features of popularity and temporal drifts, can provide accurate genre predictions for all three groups in relation to other modeling techniques.
Accuracy of BLLu for different values of k. In Figure 6, we show the recall/precision results of BLLu for k = 1…10 predicted genres for the three user groups. We observe apparent differences in the accuracy value ranges when comparing the three groups. While BLLu outperforms the five baselines in all three settings (with significant differences between BLLu and all other approaches according to a t-test with α = .001), the accuracy estimates are much higher in the LowMS group (i.e., R@10 = .827 and P@1 = .559) than in the MedMS group (i.e., R@10 = .674 and P@1 = .419) and the HighMS group (i.e., R@10 = .603 and P@1 = .377). This shows that our approach is especially useful to predict the genre preferences of users with low inclination to listen to mainstream music.
Performance in cold-start setting. Since recommender systems are often faced with situations in which users only have a few interactions available to train the underlying recommendation algorithms, we also evaluate our BLLu approach in a cold-start setting (). For this, we extract the 1,000 users with the lowest number of LEs from the LFM-1b dataset. As we need to make sure that we have at least 1 LE per user available for training the algorithms, this procedure leads to 1,000 users with a minimum of 2 LEs and a maximum of 46 LEs per user. For these users, we have precisely 1 LE in the test set, for which we predict the assigned genres.
Our results for this experiment are shown in the recall/precision plot of Figure 7. Here, we observe very similar results to the ones of our LowMS, MedMS, and HighMS settings (see Figure 6). Thus, again BLLu provides the best accuracy results followed by TIMEu, POP, CFi, and CFu. As expected, the non-personalized TOP approach provides the worst results in this setting. These results show that BLLu is also capable of effectively predicting music genre preferences in cold-start settings where users only have a few listening events available for training.
Implications for personalized music recommendation. In this section, so far, we have shown that BLLu outperforms the baseline approaches concerning prediction accuracy in different settings (i.e., LowMS, MedMS, HighMS, and cold-start). When looking at Figure 6, this is especially true for the LowMS group, in which users do not follow the preferences of the mainstream, and thus, a personalization technique, as given by the BLL equation, is critical. If we relate this to music recommender systems, which exploit the listening histories of users to suggest other music that they might also like, our findings lead to interesting implications. have shown that standard recommendation algorithms such as collaborative filtering cannot provide suitable music recommendations for users with low mainstreaminess. The results presented in this section support this. In other words, such users need different music recommendation algorithms that account for their highly individual listening preferences.
One way to achieve this could be to combine state-of-the-art music recommendation algorithms (see Section 2) with our music genre preference modeling approach based on the BLL equation presented in this paper. We could use the calculated B′u,g values given by our approach as an input for these algorithms or to rerank recommendation results based on the importance of a genre for a user. We elaborate on these ideas as well as other plans for future work in Section 5.
5. Conclusion and Future Work
In this paper, we presented BLLu, an approach that utilizes the features of popularity and temporal drifts to model and predict music genre preferences via fine-grained genres. We leveraged the LFM-1b dataset of more than one billion music listening events, created by approximately 120,000 users of the online music service Last.fm. We divided the users into three groups based on the proximity of their music genre preferences to the mainstream: (i) LowMS, i.e., listeners of niche music, (ii) HighMS, i.e., listeners of mainstream music, and (iii) MedMS, i.e., listeners of music that lies in-between. To take into account the popularity and temporal drift of music genre preferences, we proposed to use the Base-Level Learning (BLL) equation from the cognitive architecture ACT-R, which quantifies the importance of information in human memory (e.g., a music genre) by considering how frequently (i.e., popularity) and recently (i.e., temporal drift) it was used in the past. A comparison between BLLu and a group-based baseline (i.e., TOP), a user-based collaborative filtering baseline (i.e., CFu), an item-based collaborative filtering baseline (i.e., CFi), a frequency-based baseline (i.e., POPu) as well as a recency-based baseline (i.e., TIMEu) showed that BLLu outperforms all other approaches for all three user groups in terms of prediction accuracy.
Furthermore, our results indicate that BLLu is especially useful to predict the music genre preferences of users with interest in low-mainstream music (i.e., the LowMS user group), which opens up interesting possibilities for future work in the research area of personalized music recommender systems.
Limitations and future work. So far, we limited our approach to the BLL equation of the declarative memory module of ACT-R. Since the BLL equation is only a part of the more exhaustive ACT-R framework that does not consider contextual information, one needs to consider this limitation when utilizing our approach. For example, when we model music genre preferences exclusively via past listening behavior, phenomena such as over-personalization or filter-bubble effects could occur (). To overcome this, we plan to extend our model to the full activation equation of ACT-R, which also considers contextual information via its associative activation (). Moreover, we plan to extend our model by other components of ACT-R, for example, to investigate further context dimensions such as the mood or the current activity of the user (see, e.g., ). We could achieve this by defining and implementing so-called production rules from ACT-R’s procedural memory module as, for instance, done in the SNIF-ACT model (; ). Another limitation of our work is that we employed a rather simple definition for the mainstreaminess of a user. We, therefore, plan to extend our analysis to include more sophisticated mainstreaminess measures, e.g., based on rank-order correlation or Kullback-Leibler divergence (). As part of future work, we plan to integrate our findings into music recommendation algorithms, with particular attention to addressing the low mainstreaminess group, since standard collaborative filtering approaches tend to fail to provide suitable music recommendations for this user group (). For example, we plan to integrate the preference values we obtain for a specific user and a particular genre via our approach as a context dimension into a matrix factorization-based approach (; ) or a deep learning-based approach (; ).
Furthermore, we aim to apply our approach to the problem of music playlist continuation, which was also the task of the ACM RecSys Challenge 2018. We believe that our findings concerning the temporal relistening patterns of music genres (see Section 3.1) could help identify genres that users commonly listened to consecutively. We could then, for example, incorporate such genre sequences into the two-stage convolutional neural network (CNN) model for automatic playlist continuation that was proposed by . Finally, we would like to highlight that our approach could be easily leveraged by researchers and practitioners also for other related tasks (e.g., recommending music artists) and not only for genre prediction. Thus, we hope that future work in the areas of user modeling and music recommendation will be attracted by our insights.
Reproducibility
To foster the reproducibility of our research, we use the publicly available LFM-1b Last.fm dataset (see Section 3.1). Furthermore, we provide our evaluation framework TagRec (see Section 4.1) freely for academic purposes. We hope that the approach presented in this paper and its implementation in TagRec, as well as the dataset, will attract further research on music preference modeling and recommender systems.