research-article

Open access

Memory Network-Based Interpreter of User Preferences in Content-Aware Recommender Systems

Authors:

Nhu-Thuat Tran,

Hady W. LauwAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 14, Issue 6

Article No.: 108, Pages 1 - 28

https://doi.org/10.1145/3625239

Published: 14 November 2023 Publication History

PDF eReader

Abstract

This article introduces a novel architecture for two objectives recommendation and interpretability in a unified model. We leverage textual content as a source of interpretability in content-aware recommender systems. The goal is to characterize user preferences with a set of human-understandable attributes, each is described by a single word, enabling comprehension of user interests behind item adoptions. This is achieved via a dedicated architecture, which is interpretable by design, involving two components for recommendation and interpretation. In particular, we seek an interpreter, which accepts holistic user’s representation from a recommender to output a set of activated attributes describing user preferences. Besides encoding interpretability properties such as fidelity, conciseness and diversity, the proposed memory network-based interpreter enables the generalization of user representation by discovering relevant attributes that go beyond her adopted items’ textual content. We design experiments involving both human- and functionally-grounded evaluations of interpretability. Results on four real-world datasets show that our proposed model not only discovers highly relevant attributes for interpreting user preferences, but also enjoys comparable or better recommendation accuracy than a series of baselines.

1 Introduction

Recommender systems are prevalent in various domains including e-commerce, news, social media, and so on. The methodologies range from matrix factorization [25] to attention-based historical aggregation [9], autoencoder-based models [33, 53, 69], and graph-based models [18, 66]. Two issues that commonly plague recommender systems are sparsity and lack of interpretability. The former is due to the few observations relative to the large number of users or items, resulting in difficulty in building models that are sufficiently generalizable, particularly for long-tail instances. The latter is due to the abstract nature of latent representations of users and items derived from representation learning methods.

Side information such as textual content could provide another pathway for establishing similarities across users or items to improve recommendation accuracy. Representative content-aware recommender system models, including [28, 32, 42, 62, 74], mainly employ textual content to resolve sparsity. Going beyond, this work employs textual content as a source of interpretability of user’s preferences. We consider two objectives, recommendation and interpretability, in a unified model, realized by two components respectively. Recommender focuses on learning to recommend items, while interpreter accepts user’s representation from recommender as input and outputs an interpretation of user interests.

To implement our idea, we design a novel architecture, which is inspired by Supervised Learning with Interpretation (SLI) [48]. We realize recommender as an autoencoder, learning non-linear representation of user at the hidden layer and predicting user-item interactions at the output layer. In the scope of this article, we focus on two variants of autoencoder-based recommender, namely AutoRec [53], and CDAE [69]. Autoencoder is chosen to help reducing the learning complexity as described in Section 3.3. Our architecture design is flexible enough to plug in other neural recommenders. We verify this applicability by examing non-autoencoder recommender called DirectAU. Pertaining to interpreter, a key-value memory network [46, 56] lies at the core. It stores two matrices of the same size, namely key matrix and value matrix. Key matrix is a vocabulary-sized dictionary, whose each element stores the representation of a word, also called an attribute. The definition in [13] refers a single word as a cognitive chunk, i.e., unit of interpretability. Without any other specification, in this article, we term single word, attribute and cognitive chunk interchangeably. Value matrix, on the other hand, stores another representation for each word. Generally, the difference is that key matrix stores representations from textual content signals, i.e., item-word relationships, while value matrix stores representations from collaborative filtering signals. Key matrix acts a ‘translator’ in the sense that multiplying user representation from recommender with key matrix is equivalent to translating user representation into word space and high-similarity words captures user’s preferences well. Value matrix stores building blocks to build up user representation based on generated words. The score produced by multiplying key matrix and user representation is the weight to aggregate building blocks from value matrix to output interpreter-based user vector. Key-value memory network brings two pertinent advantages. Firstly, it is flexible so that one can store n-grams as cognitive chunks. However, a larger-sized dictionary requires larger memory consumption and may slow the learning process. Secondly, by storing all words in the vocabulary, interpreter can generalizes user’s representation by attending to relevant words that go beyond user’s adopted items’ texts. We empirically demonstrate that this also benefits recommendation performance (see Section 4).

For a sense of the kind of interpretable representation we seek, Table 1 shows how given a user’s historical adoptions, in this case titles of academic articles (left column), we arrive at a list of inferred natural language words (right) underlying the given user’s preferences, presented as a word cloud. This is not merely keyword extraction, as some of these words may not necessarily have occurred within the adopted titles.

Table 1.

Our work is widely divergent from existing works in explainable recommendation and post-hoc explanation. The former concerns the underlying reasons behind a single user-item interaction while ours makes sense of user preferences holistically underlying their interactions with a set of items. Our model is interpretable by design, distinguishing itself from post-hoc explanation, which has been criticized for the lack of faithfulness of interpretation [52].

Contributions. In this work, we make the following contributions. Firstly, we present a novel architecture, called Interec (Section 3), which stands for Memory-based INTErpretable representation for user-oriented content-aware RECommendation, a dedicated and unified architecture for both recommendation and interpretability. To the best of our knowledge, this is the first work that incorporates textual content-based interpreter of user preferences into a recommendation model. Secondly, we innovatively use key-value memory network as means of interpretation. The proposed architecture is flexible, so various neural recommenders as well as various types of attributes can be leveraged for interpretation. Thirdly, we investigate a technique to promote conciseness of interpretability, which also brings recommendation performance gain. Lastly, we empirically demonstrate a significant advancement over comparable baselines on four datasets in accuracy and interpretability, quantitatively and qualitatively (Section 4).

2 Related Work

Content-Aware Recommender Systems. The line of research that incorporates item textual content into recommendation models includes CTR [60] with text modeling based on LDA [2], CDL [62] based on stacked denoising auto-encoder, ConvMF [28] based on convolutional neural networks, and CVAE [32] based on variational autoencoder. Though they vary in the text modeling, they have in common a regularization framework that encourages the text representation of an item to be close to its collaborative filtering representation. The goal of these works is to mainly resolve sparsity of user-item interaction data, leading to better recommendation performance. Subsequent works include JSR [74] that jointly predicts user-item interaction and reconstructs item textual description; GATE [42] that leverages attention network to model textual content of items and gating mechanism to combine collaborative filtering and content-based representation. These works also aim at achieving higher accuracy. Our work is distinct in a couple of ways. For one, existing works mainly employ textual content to resolve sparsity while we focus on both interpretation of user’s preferences and sparsity alleviation. For another, our model generates a personalized set of words describing user interests, which achieves higher level of interpretability, while existing works mainly employ textual content on item side. For parity, we compare against baselines in both item-oriented and user-oriented fashions.

Our work is also related to the use of heterogeneous side information to resolve data sparsity and cold-start problems, leading to better recommendation accuracy as well as improving interpreterbility. For instance, knowledge graph (KG) provides rich item attributes to characterize items and enhance user-item relationships. Notable works include path-based models [22, 67], regularization based-models [36, 76], and GNN models [63, 65]. On the other hand, social connections provide useful information to characterize user’s preferences based on their friendships on social platforms. Representative approaches include fusing user representations from social domain and item preference domain [6, 24, 70] or leveraging graph neural networks to model user-user and/or user-item connections [14, 39, 68]. Recently, thanks to the advance of learning from heterogeneous information network (HIN) [72], researchers have designed novel mechanisms to model the heterogeneity of users, items and their associated information, e.g., user social connections, item relations, from a heterogeneous network [3, 4, 10]. Our model Interec distinguishes itself with a couple of points. For one, our motivation stems from interpretability perspective, where textual content of items is employed as the source of interpretability in our dedicated interpreter to discover related words capturing user preferences. Despite can be used for interpretability like textual content, KG is costly to construct and not all benchmark datasets accompany KG. For another, Interec is able to generate relevant textual-based interpretation of user’s preferences relying solely on item textual information, which achieves lower level of model complexity than those using both item and user side information.

Explainable Recommendation. Recently, there are also active efforts [77] in making recommendations more explainable to consumers. The gist is in accompanying a recommendation of an item to a user with an explanation, which could be in various forms such as text [64, 78], rules [43], social graph [50], and visual imagery [37]. Our focus in this article is in interpreting the preferences of a user as a whole in terms of keywords to provide some interpretability to the workings of the content-aware recommendations. It is not our intention to explain individual item-wise recommendation instances.

User Profiling. There exist several works that seek to profile users for recommender systems. [15, 40] infers topics of interest to a user, both static, and dynamic. In [16], the authors profile user as a hierarchy of interactions, item level and category level. In contrast, we focus on words as units of interpretation. Outside of recommendation, user profiling is also investigated in Twitter [35], streaming short texts [34], or Question Answering [41].

Interpretable AI. Generally, studies on interpretable AI can be broadly categorized into two groups [38]. The first group relies on internal structure to interpret the working of a machine learning system [5, 31, 58]. The second group, post-hoc interpretation, including [51, 54, 75], treats machine learning model as blackbox and attempt to explain the model outputs. Our work fits into the former group of interpretable by design.

Studies on interpretability in recommender systems include [55] using CNN-based attention network to model globally and locally user’s preference from reviews, [23] exploiting attention network to model the content features of movies, and [47] projecting item’s representation into interpretable space to infer user preference on item’s features. Our work is distinct in deriving top-k words as “interpretation” for a user’s latent representation.

Pertaining to dictionary of attribute-based interpretability, representative works include [12, 26, 30]. These are not comparable with ours since the dictionary of attribute is assumed to be available in advance. FLINT [48] is dedicated for multi-class image classification while ours is applied to recommendation. Hence, there is a wide difference in constructing dictionary of attributes and visualization of interpretability. Moreover, we use memory network for interpretation while FLINT employs softmax function over attribute dictionary.

Neural Attention-Based Recommender Systems. ACF [9] leverages attention to model multi-media contents in collaborative filtering. [11] leverages memory network [56] to model dynamic user preferences to improve sequential recommendation. LRML [57] employs memory network to generate relation vector between user and item in metric learning so as to improve recommendation. Our novelty comes from the employment of key-value memory network [46, 56], in which attention lies at the heart, to build up interpreter from textual content.

3 Methodology

Our proposed architecture is illustrated in Figure 1 and the list of notations is presented in Table 2. The input includes binary interaction matrix \({\bf R} \in \lbrace 0, 1\rbrace ^{M \times N}\), \(M, N\) are the number of users and items, respectively. Each uth row, \({\bf r}_u\), of matrix \({\bf R}\) denotes the interaction vector of user indexed by u and \({\bf r}_{ui} = 1\) indicates interaction between user and item (indexed by i). Furthermore, items have side information, i.e., textual content, denoted by matrix \({\bf X} \in \mathbb {R}^{N \times K}\), with K is the number of words in vocabulary. Each ith row \({\bf x}_i\) of \({\bf X}\) is tf-idf representation of textual content of item i. For textual-aware recommendation task, Supervised Learning with Interpretation SLI involves two explicit empirical losses for recommendation, and interpretability.

\begin{equation} \text{arg min}_{f \in \mathcal {F}, g \in \mathcal {G}} \mathcal {L}^{rec}(f, {\bf R}, {\bf X}) + \mathcal {L}^{int} (f, g, {\bf R}, {\bf X}), \end{equation}

(1)

in which \(\mathcal {F}\) is the model space of recommender, \(\mathcal {G}\) is the model space of interpreter, \(\mathcal {L}^{rec}(\cdot)\) is recommendation loss while \(\mathcal {L}^{int}(\cdot)\) is designed for interpretability objective. Next, we describe our proposed realization of SLI called Interec, including details of recommender f, interpreter g and learning objectives.

Table 2.

u, i, j	User index, item index and word index
\(M, N, K\)	The number of users, items and words, respectively
d	Dimensionality of user, item and word embedding vector
\(\epsilon\)	Exponential weight of normalization term in memory-based representation (Equation (8))
\(\tau\)	Temperature hyper-parameter (Equation (6))
\({\bf X}, {\bf x}_i\)	Tf-idf item textual content matrix and textual content vector of item i
\({\bf R}, {\bf r}_u\)	User-item interaction matrix and binary rating vector of user u
\({\bf z}_u, \hat{{\bf z}}_u, \tilde{{\bf z}}_u\)	Latent representation, memory-based representation and combined representation of user u
\({\bf V}^{text}, {\bf V}\)	Text-based and collaborative filtering-based item embedding matrices
\({\bf M}, {\bf K}\)	Memory matrix and Key matrix in Memory Network
\({\bf a}^T, {\bf A}^T\)	Transpose of vector (bold lower case letter) and matrix (bold upper case letter)

Table 2. List of Notations

Fig. 1.

3.1 Interec

3.1.1 Recommender.

We set \(\mathcal {F}\) as the class of deep neural networks to learn a recommendation model. Each realization \(f \in \mathcal {F}\) is parameterized by \(\Theta _f\). f should satisfy the following properties (i) f takes user information, e.g., rating vector \({\bf r}_u\) or user’s ID, as input and outputs a list of recommended items for user u and (ii) the output of hidden layer of f abstractly encodes user’s preferences. In this work, we focus on examining two autoencoder-structured variants for f: vanilla autoencoder (AE) and denoising autoencoder (DAE). While other neural recommenders are feasible, the choice of autoencoder reduces model complexity, which will be elaborated later in Section 3.3. We denote the correspoding recommenders are \(f^{AE}\) and \(f^{DAE}\), respectively.

Encoder \(f^{AE}_{enc}\) of recommender \(f^{AE}\), which is based on AutoRec [53]:

\begin{equation} {\bf z}_u = f^{AE}_{enc} = e({\bf r}_u{\bf W}^{enc} + {\bf b}^{enc}). \end{equation}

(2)

Encoder \(f^{DAE}_{enc}\) of recommender \(f^{DAE}\), which is based on CDAE [69]:

\begin{equation} {\bf z}_u = f^{DAE}_{enc} = e({\bf r}^c_u{\bf W}^{enc} + {\bf Q}_u + {\bf b}^{enc}). \end{equation}

(3)

Decoder of recommender \(f^{AE}\) and recommender \(f^{DAE}\) has similar formulation, which is denoted as \(f_{dec}\):

\begin{equation} {\bf o}_u = f_{dec} = s(\tilde{{\bf z}}_u{\bf W}^{dec} + {\bf b}^{dec}), \end{equation}

(4)

in which, \({\bf r}^c_u\) is a corrupted version of \({\bf r}_u\), obtained by randomly zeroing out some elements, e and s are non-linearity functions. e is set to tanh for both variants, while s is set to sigmoid for \(f^{AE}\) and softmax for \(f^{DAE}\). These activation functions result in different loss functions for two examined recommendation models, allowing us to test the proposed architecture on different learning scenarios. The difference between \(f^{AE}_{enc}\) and \(f^{DAE}_{enc}\) is that \(f^{DAE}_{enc}\) accepts a corrupted version of user’s rating vector and uses bias vector \({\bf Q}_u\), which is model’s parameter, for user representation. \(\tilde{{\bf z}}_u = {\bf z}_u + \hat{{\bf z}}_u\), where \(\hat{{\bf z}}_u\) is interpreter-based representation of user u. Section 3.1.2 describes how to derive \(\hat{{\bf z}}_u\). Parameters of recommender is \(\Theta _f = \lbrace {\bf W}^{enc} \in \mathbb {R}^{N \times d}, {\bf b}^{enc} \in \mathbb {R}^d, {\bf W}^{dec} \in \mathbb {R}^{d \times N}, {\bf b}^{dec} \in \mathbb {R}^N, {\bf Q}_u \in {\bf Q} \in \mathbb {R}^{M \times d}\rbrace\), d is the dimensionality. We denote Interec with vanilla autoencoder recommender is Interec-AE while Interec-DAE is Interec with denoising autoencoder recommender.

Both \(f^{AE}\) and \(f^{DAE}\) satisfies the two mentioned properties. Firstly, the output of decoder can be seen as predicted probability of items that user is likely to interact. Secondly, \({\bf z}_u\) can be seen as a compact representation of user’s preferences. Its individual dimension, however, is abstract and not immediately interpretable. Therefore, interpreter is required to associate these latent features with human-understandable natural language words. Two salient notions in implementing recommender f, \(f^{AE}\) or \(f^{DAE}\), are

–

To incorporate textual content into recommender, \({\bf W}^{dec}\) is implemented as \({\bf W}^{dec} = ({\bf V}^{text} + {\bf V})^T\). By doing so, each item is represented by two components, \({\bf V} \in \mathbb {R}^{N \times d}\) is a free matrix learned during training to capture item’s features from collaborative filtering signals and \({\bf V}^{text} \in \mathbb {R}^{N \times d}\) captures textual-based item’s features and is obtained by stacking outputs of hidden layer in Equation (7). Note that \({\bf V}^{text}\) is left unchanged to preserve its meaning not to be overwritten by collaborative filtering signals, which is empirically found useful for interpretability as evidenced in Section 4.3. Unlike existing more restrictive models that treat text-based item representation as regularization or to be trained using collaborative filtering signals, this design enables user representation captures both collaborative filtering signals and textual signals more effectively as shown in the experimental results in Section 4.

–

For encoder, tanh non-linearity is used to model likes and dislikes with positive and negative values, respectively.

Extension. To verify the applicability of our proposed method, we examine a recently developed non-autoencoder recommendation model called DirectAU [61]. Under encoder-decoder framework, the encoder of DirectAU is simply a look-up table \({\bf U} \in \mathbb {R}^{M \times d}\) of M rows, each is vector representation of one user. User u representation is produced as \({\bf z}_u = {\bf U}_u \in \mathbb {R}^d\). Similarly, item representations are also stored in a look-up table \({\bf V} \in \mathbb {R}^{N \times d}\) of N rows, each for one item. Item i’s representation is produced as \({\bf z}_i = {\bf V}_i \in \mathbb {R}^d\). DirectAU distinguishes itself by the learning objective, which will be elaborated in Section 3.3. We name our model variant extending DirectAU as Interec-DirectAU. Interec-DirectAU predicts interaction score between user u and item i as

\begin{equation} {\bf o}_{ui} = (\tilde{{\bf z}}_u)^T{\bf W}^{dec}_i = ({\bf z}_u + \hat{{\bf z}}_u)^T({\bf V}^{text}_i + {\bf V}_i). \end{equation}

(5)

Similar to Interec-AE and Interec-DAE, user representation in Interec-DirectAU also contains two terms, one is \({\bf z}_u\) and the other \(\hat{{\bf z}}_u\), which is output of interpreter as elaborated in the next section. The interpretation of combined item representation \({\bf W}^{dec}_i\) and \(({\bf V}^{text}_i + {\bf V}_i)\) are identical to those of Interec-AE and Interec-DAE. For Interec-DirectAU, we empirically found that normalizing each row of \({\bf V}^{text}\) to unit length helps model converge faster and achieve higher accuracy. Additionally, \({\bf V}^{text}\) in Interec-DirectAU is also left unchanged during training model.

3.1.2 Interpreter.

Unlike existing interpretability models [1, 8, 51, 73] aiming at interpreting model prediction given input as an image or a sentence, our target is interpreting user’s preferences. In recommender systems, the input is a list of user’s adopted items, oftentimes described by their IDs, followed by an embedding layer. Therefore, it is difficult to understand user’s preferences based solely on item IDs. As such, the task of interpreter is to generate a set of attributes capturing user’s preferences. Following [13], our interpretability is formulated as

–

Understanding user’s preferences behind their adoptions.

–

The interpretability is evaluated towards how good they capture user’s preferences using both human-grounded metrics and functionally-grounded evaluation.

–

The scope of interpretability is local interpretability, i.e., understanding preferences of a single user.

–

Single words from item textual content are treated as cognitive chunks or attributes, i.e., units of interpretability.

Given user representation \({\bf z}_u\), which is abstract and not interpretable, interpreter g, \(g: {\bf z}_u \rightarrow \mathbb {R}^+\), computes the activation score of user representation with an attribute j, i.e.,

\begin{equation} g({\bf z}_u)_j = sigmoid\left({\bf z}_u{\bf K}_j^T, \tau \right) = 1 / \left(1 + e^{-\frac{{\bf z}_u{\bf K}_j^T}{\tau }}\right), \forall j = 1, 2, \ldots , K. \end{equation}

(6)

\({\bf K} \in \mathbb {R}^{K \times d}\) is key matrix and also called dictionary of attributes in this article. Each row of \({\bf K}\) stores the d-dimension representation of a word, i.e., cognitive chunk or attribute. Several notions are implemented here.

–

Interpreter g accepts user representation from recommender as input, enabling interpretation of user’s interests.

–

Dictionary \({\bf K}\) stores all of K words in the vocabulary. Consequently, \(g({\bf z}_u)\) is defined over word space, potentially attending to words outside user’s own corpus, resulting a more generalized user’s representation in \(\tilde{{\bf z}}_u\) in Equation (4).

–

Sigmoid non-linearity is used instead of softmax as in [46, 56]. The reason is that softmax acts as a \(L_1\)-normalization over attributes, overly punishing attentive scores in Equation (6) for active users who are associated with many attributes because of her interactions with a wide range of items. Sigmoid allows independent attention over attributes, meaning that many words can have attention score close to 1. In addition, a temperature hyper-parameter, \(\tau\), is introduced to strengthen the gap between positive and negative elements in \({\bf z}_u{\bf K}^T\). Our finding is consistent with other works [19, 71], which also study method to relax softmax, i.e., output does not sum to 1, to improve recommendation performance.

Dictionary of Attributes. A natural question until here is how to derive matrix \({\bf K}\). Therefore, we seek a function \(\phi : \mathcal {A} \rightarrow \mathbb {R}^d\) that maps each attribute from attribute space \(\mathcal {A}\) to a d-dimension vector. As in [48], \(\phi\) should encode patterns related to input, which is the list of items in our case. Intuitively, a solution that jointly derives representations of words and items is satisfied. Therefore, we implement \(\phi\) as a denoising auto-encoder (DAE) [59], i.e.,

\begin{equation} \hat{{\bf x}}_i = tanh({\bf x}^c_i{\bf K} + {\bf b}^{denc}) {\bf K}^{T} + {\bf b}^{ddec}, \end{equation}

(7)

\({\bf x}^c_i\) is the corrupted version¹ \({\bf x}_i\), which is tf-idf textual content of item i. Parameters are \(\Theta _1 = \lbrace {\bf K} \in \mathbb {R}^{K \times d}, {\bf b}^{denc} \in \mathbb {R}^d, {\bf b}^{ddec} \in \mathbb {R}^K\rbrace\), which are randomly initialized and refined during training. Hence, elements in \({\bf K}\) capture relationships between words, i.e., cognitive chunks/attributes, and items. The importance of \({\bf K}\) to the quality of user preference interpretation is analyzed in Section 4.3. Other choices such as CNN [28] or attention [42] are eligible. Finally, \(tanh({\bf x}_i{\bf K} + {\bf b}^{denc})\) composes each row of \({\bf V}^{text}\) used in Equation (4).

Interpretation in Interec . We are interested in interpreting preferences of a single user. This scope is local interpretability. The following definition guides our model to output interpretation of user’s preferences.

Definition 3.1 (Local Interpretability).

A local interpretation of user’s preferences for a user u by an interpreter g given her representation \({\bf z}_u\) from recommender f is the set of k attributes with highest activated scores in Equation (6).

Note that when k, a pre-chosen number, gets larger, the interpretation is better at covering user’s preferences. From human perspectives, however, large k of words results in difficulty to quickly grasp user’s preferences.

Interpretability-Based Representation. Since \(g({\bf z}_u)\) is defined over word space, it is potential that \(g(\cdot)\) gives higher score for words outside user’s own associated texts. Intuitively, we can generalize user’s representation beyond their interacted items, enabling delivering more interested items to user. We seek a function \(l: \mathbb {R}^K \rightarrow \mathbb {R}^d\) to cater the generalized representation.

\begin{equation} \hat{{\bf z}}_u = l(g({\bf z}_u), {\bf M}) = tanh\left(K^{-\epsilon }\sum _{j = 1}^K{g({{\bf z}_u})_j{\bf M}_j}\right). \end{equation}

(8)

Here, each row of value matrix M, \({\bf M}_j \in \mathbb {R}^d\), stores representation of word j, which captures collaborative filtering supervision signals. \({\bf M}\) will be trained during learning model. \(K^{-\epsilon }\) is used to promote conciseness, one property of interpretability mentioned in [48]. This property expects a small number of attributes for interpretation. We give a detailed explanation for \(\epsilon\) in Section 3.2 and empirical study on \(\epsilon\) is presented in Section 4.3. The parameters of interpreter g is \(\Theta _g = \lbrace {\bf K}, {\bf M}\rbrace\). After obtaining \(\hat{{\bf z}}_u\), we plug it into Equation (4) or Equation (5).

3.2 Model Analysis

Expanding Equation (4) and Equation (5), omitting non-linearity and bias for simplicity, the predicted score between user u and item i is

\begin{equation} {\bf o}_{ui} = \tilde{{\bf z}}_u{\bf W}^{dec}_i ={\bf z}_u{\bf W}^{dec}_i + \hat{{\bf z}}_u{\bf W}^{dec}_i. \end{equation}

(9)

Without the second term \(\hat{{\bf z}}_u{\bf W}^{dec}_i\) in Equation (9), Interec reduces to a form of content-aware recommendation, decoding the adoption-based user encoding \({{\bf z}}_u\) with item encoding \({\bf W}^{dec}_i\) informed by both content and collaborative filtering signals. Furthermore, by expanding the first term \({\bf z}_u{\bf W}^{dec}_i\), we obtain:

\begin{equation} {\bf z}_u{\bf W}^{dec}_i = {\bf z}_u({\bf V}^{text}_i)^T + {\bf z}_u({\bf V}_i)^T. \end{equation}

(10)

In Equation (10), the first term forces user latent vector \({\bf z}_u\) to capture user preferences from textual content signals, while the second term forces \({\bf z}_u\) to capture user preferences from collaborative filtering signals. By leaving \({\bf V}^{text}_i\) unchanged during training, the textual semantics are preserved. If \({\bf V}^{text}_i\) were to be updated during training, the collaborative filtering signals would potentially change the textual semantics underlying \({\bf V}^{text}_i\). By separating \({\bf V}^{text}\) and \({\bf V}\), our model fully exploits the representation power of both content-based representation and collaborative filtering-based representation. In experiments, we empirically show that this plays important role to achieve both of our goals in this article.

In Equation (9), we interpret the second term \(\hat{{\bf z}}_u{\bf W}^{dec}_i\) as a retrieval function, in which the inferred words act as a query to retrieve relevant items for each user. Expanding \(\hat{{\bf z}}_u{\bf W}^{dec}_i\) and omitting non-linear activation for simplicity, we have

\begin{equation} \hat{{\bf z}}_u{\bf W}^{dec}_i = K^{-\epsilon }\sum _{j = 1}^K{(g({{\bf z}_u})_j{\bf M}_j){\bf W}^{dec}_i}, \end{equation}

(11)

\(g({{\bf z}_u})_j{\bf M}_j\) can be interpreted as the jth word representation w.r.t. user preference on this word. \(g({{\bf z}_u})_j{\bf M}_j{\bf W}^{dec}_i\) measures the similarity between item i and word j w.r.t. user u. The output of Equation (11) is the similarity between item i and user u based on the inferred words for u. If \(g({\bf z}_u)_j\) outputs high score for words outside user’s adopted item texts, Equation (8) potentially results in retrieving more relevant items for user.

To understand the role of normalization term \(K^{-\epsilon }\), we examine two extreme cases. When \(\epsilon = 1\), the predicted score is averaged over all words in the vocabulary, an item i gets a high score provided it gets high inner product with nearly all words in the vocabulary. This is unrealistic since each item possesses only a certain number of features, described by their textual content. When \(\epsilon = 0\), the predicted score is the sum over inner product of all features with item i. Item i could get high score if it only gets high inner product with a few words. This may result in retrieving more irrelevant items than relevant items since too few words are insufficient to retrieve relevant items. We believe that an appropriate value of \(\epsilon\) would be somewhere between 0 and 1, which is shown by empirical evidence in Section 4.3.

3.3 Learning Objectives

This section elaborates our model’s learning objectives for both recommendation and interpretability. We discuss several properties needed to output relevant interpretation as presented in [48].

Objective Function. For Interec-AE, we use weighted binary cross-entropy loss for optimization

\begin{equation} \mathcal {L} = -\frac{1}{\mathcal {B}}\sum _{u=1}^{\mathcal {B}}\sum _{i = 1}^N{{\bf C}_{ui}[{\bf r}_{ui}log({\bf o}_{ui}) + (1 - {\bf r}_{ui})log(1 - {\bf o}_{ui})]}. \end{equation}

(12)

For Interec-DAE, we empirically find that using cross entropy loss achieves higher accuracy on four chosen datasets.

\begin{equation} \mathcal {L} = -\frac{1}{\mathcal {B}}\sum _{u=1}^{\mathcal {B}}\sum _{i = 1}^N{{\bf r}_{ui}\ log\ {\bf o}_{ui}}. \end{equation}

(13)

\({\bf C}_{ui} = 1\) if \({\bf r}_{ui} = 1\), otherwise \({\bf C}_{ui} = 0.01\) for all datasets, following [32, 62]. \(\mathcal {B}\) is batch data size.

For Interec-DirectAU, the objective includes two terms alignment and uniformity. While alignment encourages the representation of user and the representation of her adopted item are close, uniformity encourages discrimination between user and item representations. \(\gamma\) is a hyper-parameter controlling the influence of uniformity.

\begin{equation} \mathcal {L} = \sum _{(u, i) \in \mathcal {B}} \underbrace{\underset{u, i \sim p_{pos}}{E}||f(u) - f(i)||^2}_{\mathcal {L}_{alignment}} +\ \gamma \cdot (\underbrace{log \underset{u, u^{\prime } \sim p_{user}}{\mathbb {E}}\!\!\! e^{-2||f(u) - f(u^{\prime })||^2} / 2 +log \underset{i, i^{\prime } \sim p_{item}}{\mathbb {E}}\!\!\! e^{-2||f(i) - f(i^{\prime })||^2} / 2)}_{\mathcal {L}_{uniformity}}, \end{equation}

(14)

in which \(f(u)\) and \(f(i)\) are unit-length normalization of user representation \(\tilde{{\bf z}}_u\) and item representation \({\bf W}^{dec}_i\), respectively. \(p_{pos}, p_{user}, p_{item}\) are distribution of positive (observed) user-item interactions, users and items, respectively. \(u^{\prime }\) and \(i^{\prime }\) are other user and item in the same batch with u and i.

Examining multiple variants of recommender and their associated learning objectives gives us a broader view of the behavior of our proposed architecture. Minimizing these losses is equivalent to force predicted rating \({\bf o}_u\) for user u to be closed to ground truth values \({\bf r}_{u}\). Compared to the SLI framework in Equation (1), our objective includes only one term for both recommendation and interpretability. We now give the explanation for this design as well as examine several other properties discussed in [48] which are encoded Equation (12), Equation (13), and Equation (14).

Fidelity to Output. This property requires the interpreter g to be close to recommender f. In [48], the authors impose a regularization for this property by minimizing cross-entropy between outputs of g and f, leading to another term in loss function. Here, we implicitly impose this property in Equation (12), Equation (13), and Equation (14). Recall from Equation (4), predicted rating \({\bf o}_u\) is composed of user representation from recommender f, i.e., \(f^{AE}\) or \(f^{DAE}\), and interpreter g. Therefore, minimizing Equation (12), Equation (13) and Equation (14) forces f and g to converge to the same objective as observed rating \({\bf r}_u\).

Conciseness of Interpretation. A small number of attributes is expected for interpretation because it is easier for human to grasp the user’s preferences from generated words. In addition, focusing on smaller number of words implicitly forces model learn to choose illustrative words rather than less representative ones. Our model encodes this property in Equation (8).

Diversity of Interpretation. Diversity encourages different attributes to be generated given many randomly selected input samples. After learning DAE described in Equation (7), \({\bf K}\) is fixed. Therefore, in Equation (6), \(g({\bf z}_u) \ne g({\bf z}_{u^{\prime }})\) for \({\bf r}_u \ne {\bf r}_{u^{\prime }}\). In other words, users with different set of interacted items have different local interpretations. We empirically found that this idea works well and leave other optimization-based method as in [48] for future work. Note that optimization-based methods will introduce new term in loss function, imposing difficulty for convergence. Our design, on the other hand, focuses on only one objective function.

Fidelity to Input requires attributes stored in \({\bf K}\) related to input, i.e., the list of adopted items. As elaborated in Section 3.1.2, training denoising autoencoder in Equation (7) inherently captures relationships between words, i.e., attributes, and items. For recommender f, the input is a list of adopted items, hence inherently relates to words stored in \({\bf K}\). Furthermore, we leverage autoencoder structure for recommender f, enforcing Fidelity to Output is equivalent to imposing Fidelity to Input for recommender, because the input and output of recommender f are both the list of adopted items per user.

Note that FLINT [48] introduces a new loss term for each of the above properties with corresponding hyper-parameters, which is highly dynamical in nature. Finally, let parameters in Interec as \(\Theta = \lbrace \Theta _1, \Theta _2\rbrace\). In which, \(\Theta _1 = \lbrace {\bf K}, {\bf b}^{denc}, {\bf b}^{ddec}\rbrace\) are parameters of denoising autoencoder in Equation (7) while \(\Theta _2 = \lbrace \Theta _f, \Theta _g\rbrace\) are parameters of recommender f and interpreter g. Learning model boils down to learning its parameters \(\Theta\). Our training procedure including two stages. In the first stage, we train denoising autoencoder in Equation (7) with \(\Theta _1\) and then fix these parameters regarding use or not use in the second stage. Then, we train \(\Theta _2\) in the second stage with loss term defined in Equation (12). Algorithm 1 presents more details the training procedure of our proposed model.

4 Experiments

Our experiments seek to evaluate both aspects of recommendation accuracy and interpretability of user preferences. We aim at answering the following research questions

–

(RQ1) How does the proposed model Interec, which consists of recommender and interpreter, perform recommendation compared to existing baselines, including collaborative filtering and textual-aware models?

–

(RQ2) Is the interpreter in Interec able to generate textual units which well capture user’s preferences? How does interpreter affect the recommendation performance?

–

(RQ3) What are the effects of key components in recommender and interpreter on recommendation performance as well as the interpretation of user’s preferences?

Datasets. We consider CDs & Vinyl, Cell Phones, Toys & Games, which are three categories of the Amazon dataset [17, 44]. We use the provided 5-core data² where each user and item has a least five reviews. For each item, we concatenate its title and description, and the resulting text is referred to as item’s textual description/content. Citeulike-a³[62]. To branch out to a non-product dataset, we use dataset that associates users and academic articles. Item’s textual description is the concatenation of title and abstract of each article.

Preprocessing. We first remove html code (if any) then employ spaCy [21] to tokenize text into single words. For each dataset, we only keep words with frequency higher than 5 and appearing in less than 50% of textual descriptions, remove all stop words and retain most frequent words as the vocabulary proportionately to the dataset size. All items for which textual description is empty, i.e., there is no in-vocabulary word in the description, together with their interactions with users are discarded. Table 3 shows the statistics of our data after preprocessing.

Table 3.

Data	#users	#items	#interactions	#words
Cell Phones	4,775	4,883	31,749	4,000
Toys & Games	9,466	7,964	80,919	8,000
CDs & Vinyl	38,406	39,260	524,459	10,000
Citeulike-a	5,551	16,980	204,986	8,000

Table 3. Statistics of Dataset used in our Experiments

Data Split. We construct training, validation and testing set by utilizing leave-one-out strategy following [20]. For Amazon datasets with provided time-stamp, we first sort the user’s interactions chronologically. Then the latest item of each user is added to test set, the penultimate is added to validation set and the remaining items are added to the training set. For citeulike-a dataset, where timestamp of each user-article interaction is not available, for each user, we select a random article for the validation set and the test set respectively.

Baselines. We compare Interec against a series of baselines, including collaborative filtering and text-aware recommendation baselines on implicit feedback data.

Collaborative filtering models:

–

AutoRec [53] uses autoencoder for collaborative filtering. Interec-AE’s recommender is based on AutoRec.

–

CDAE [69] studies recommendation problem through autoencoder view and learns to recommend items from corrupted inputs. The recommender in Interec-DAE is based on CDAE.

–

NeuMF [20] combines generalized matrix factorization, which linearly models user/item latent feature interactions, and multi-layer perceptron to learn the interaction function between users and items.

–

LightGCN [18] improves Graph Convolutional Network for collaborative filtering by using linear propagation and weighted sum of multi-layered embeddings.

–

ENMF [7] proposes to learn a neural matrix factorization-based recommendation model without sampling via reformulating the loss function.

–

DirectAU [61] improves collaborative filtering by optimizing uniformity and alignment of user and item representations. The recommender in Interec-DirectAU is based on DirectAU.

Text-aware recommendation models:

–

CDL [62] proposes a probabilistic model that jointly learns Stacked Denoising Autoenencoder (SDAE) for text modeling and collaborative filtering.

–

CVAE [32] presents a similar approach with CDL but replacing SDAE by Variational Autoencoder (VAE).

–

GATE [42] leverages attention to model textual content and neighbor information to enrich item representation.

–

JSR [74] jointly predicts user-item interactions and reconstructs textual content.

As we aim at interpreting user’s preferences, we involve user-oriented baselines, i.e., incorporating textual content from user’s corpus on user side. For recommendation, we compare Interec-AE, Interec-DAE and Interec-DirectAU with both user-oriented and item-oriented competitors. Regarding interpretability, only user-oriented baselines are comparable with ours because they are able to generate a set of words representing user’s interests. For fair comparison, all models use the same word vocabulary with our proposed model.

Model Training. We use Nvidia Quadro RTX 8000 GPU machines for training with Adam optimizer [29]. Learning rate is chosen from \(\lbrace\)0.0003, 0.001, 0.003, 0.005\(\rbrace\) and dropout rate is chosen from \(\lbrace\)0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6\(\rbrace\). In Interec-AE, a dropout layer is added over \(\tilde{{\bf z}}^u\) before decoding step. For other baselines, we follow the original architecture to place dropout layer. For proposed models, the maximum number of training epochs is set to 200. Training stops after 12 epochs without improving HR@20 on validation set. The average results over 10 runs with different random seeds are reported.

Hyper-parameters are first chosen based on validation set, then we re-train models with chosen ones and report results on the test set. \(d = 50\) is set for Citeulike-a and Cell Phones datasets and \(d = 100\) for others. \(a = 1\) and \(b = 0.01\) are weights for observed and unobserved interactions, respectively, in CDL, CVAE, and Interec-AE. For fair comparison, we initialize word embedding matrix in JSR⁴ as K. Table 4 presents more details of the search space of hyper-parameters. Given the search space of hyper-parameters, we employ grid search approach for baselines and Interec to choose the set of hyper-parameters achieving the best recommendation accuracy, i.e., HR@20, on validation set. The we re-train all models with chosen hyper-parameters and report their performance on test set.

Table 4.

Model	Search Space
AutoRec	activation function \(\in \lbrace sigmoid, relu, tanh\rbrace\); weight decay \(\in \lbrace 0.0001, 0.001, 0.01\rbrace\)
CDAE	activation function \(\in \lbrace sigmoid, relu, tanh\rbrace\); weight decay \(\in \lbrace 0.0001, 0.001, 0.01\rbrace\)
CDAE	corruption ratio \(\in \lbrace 0.1, 0.3, 0.5\rbrace\)
NeuMF	hidden layer size \(\in \lbrace 8, 16, 32, 64, 128, 150, 200\rbrace\); number of layers \(\in \lbrace 2, 3\rbrace\);
NeuMF	dropout \(\in \lbrace 0.1, 0.3, 0.5\rbrace\)
LightGCN	number of layers \(\in \lbrace 1, 2, 3\rbrace\); \(L_2\) regularization coefficient \(\in \lbrace 10^{-6}, 10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}\rbrace\)
ENMF	dropout \(\in \lbrace 0.3, 0.5, 0.7, 0.9\rbrace\); weight of missing data \(c_0 \in \lbrace 0.005, 0.01, 0.05, 0.1, 0.2, 0.5, 1\rbrace\)
DirectAU	\(\gamma \in \lbrace 0.1, 0.2, 0.5, 1, 2, 5\rbrace\); weight decay \(\in\) {0, \(10^{-4}\), \(10^{-6}\), \(10^{-8}\)}; learning rate in {0.001, 0.003, 0.004}
CDL	\(\lambda _u \in \lbrace 10^{-x}, x \in {6, 5, 4, 3, 2, 1, 0}\rbrace\); \(\lambda _v \in \lbrace 10^{-x}, x \in {6, 5, 4, 3, 2, 1, 0}\rbrace\); \(\lambda _r=1; \lambda _n=10^3\)
CDL	\(\lambda _w=10^{-4}\) for Amazon subsets and \(\lambda _w=10^{-1}\) for Citeulike-a
CVAE	\(\lambda _u \in \lbrace 10^{-x}, x \in {6, 5, 4, 3, 2, 1, 0}\rbrace\); \(\lambda _v \in \lbrace 10^{-x}, x \in {6, 5, 4, 3, 2, 1, 0}\rbrace\); \(\lambda _r=1; \lambda _n=10^3\)
CVAE	\(\lambda _w=10^{-4}\) for Amazon subsets and \(\lambda _w=10^{-1}\) for Citeulike-a
GATE	maximal sequence length \(= 4000; \rho \in \lbrace 20, 50, 100\rbrace ; d_a \in \lbrace 10, 20, 50\rbrace\);
	hidden layer size \(\in \lbrace 100, 200\rbrace\)
JSR	\(\eta =4\); hidden layer size \(\in \lbrace 50, 100, 200\rbrace\); weight decay \(=10^{-3}\)
JSR	\(\lambda = 0.5\) for Amazon subsets and \(\lambda _w = 0.1\) for Citeulike-a
Interec	weight decay \(= 10^{-4}; \tau \in \lbrace 0.01, 0.02\rbrace ; \epsilon\) is studied in Section 4.3
	corruption ratio \(\in \lbrace 0.1, 0.3, 0.5\rbrace\) (for Interec-DAE variant)
	\(\gamma \in \lbrace 0.1, 0.2, 0.5, 1, 2, 5\rbrace\); weight decay \(\in\) {0, \(10^{-4}\), \(10^{-6}\), \(10^{-8}\)}; learning rate in {0.001, 0.003, 0.004} (for Interec-DirectAU variant)

Table 4. Search Space for Hyper-parameters in Baselines and the Proposed Interec

For Interec, more extensive hyper-parameter analysis is presented in Section 4.3.

Metrics. Hit Ratio at top-k (HR@K) and Normalized Discounted Cumulative Gain at top-k (NDCG@K) are employed for recommendation and retrieval-based interpretability evaluation.

\begin{equation*} \begin{aligned}HR@K = \frac{1}{|\mathcal {U}|}\sum _{u}{\mathbb {1}[\delta (R(u)\cap T(u) \ne \emptyset)].} \\ \end{aligned} \end{equation*}

\(R(u), T(u)\) are the set of predicted items and test items of user u, respectively. \(\mathbb {1}[x]\) returns 1 if x is true, otherwise is 0.

\begin{equation*} \begin{aligned}NDCG@K = \frac{1}{|\mathcal {U}|}\sum _{u}\frac{DCG@K}{IDCG@K} \hspace{14.22636pt} DCG@K = \sum _{r}^K\frac{2^{{rel}_r} - 1}{log_2(r + 1)} \end{aligned}. \end{equation*}

IDCG@K is the biggest possible value of DCG@K, obtained by treating sorted test set as a length-K prediction. \(rel_r = 1\) indicates relevance, otherwise is 0.

4.1 Recommendation Evaluation

Table 5 reports top-N recommendation performance. We have the following observations.

Table 5.

Dataset	Metric	Collaborative filtering						Item-oriented Fashion				User-oriented Fashion				Interec
Dataset	Metric	Auto-Rec	CDAE	Neu-MF	Light-GCN	ENMF	Direct-AU	CDL	CVAE	JSR	GATE	CDL	CVAE	JSR	GATE	AE	DAE	Direct-AU
Cell Phones	H@20	8.55	8.40	6.99	9.43	8.20	9.65	9.52	8.73	8.79	7.37	9.58	8.43	9.14	8.17	\({\bf {10.57}}\)\(^{\dagger }\)	10.34	9.86
	N@20	3.84	3.78	3.05	4.10	3.69	4.03	4.06	3.75	3.93	3.12	4.12	3.63	4.00	3.56	\({\bf {4.46}}\)\(^{\dagger }\)	4.35	4.10
	H@50	14.34	13.89	11.79	15.58	13.36	16.54	15.80	14.37	14.77	12.67	15.76	14.23	15.05	14.02	\({\bf {17.78}}\)\(^{\dagger }\)	17.05	17.39
	N@50	4.98	4.87	4.00	5.31	4.71	5.39	5.30	4.87	5.11	4.17	5.34	4.78	5.16	4.70	\({\bf {5.88}}\)\(^{\dagger }\)	5.68	5.58
1-19 Toys & Games	H@20	6.80	6.86	6.03	7.25	6.72	7.39	7.15	7.01	6.99	5.49	7.16	6.87	6.69	5.36	9.35	9.24	\({\bf 9.49}\)\(^{\dagger }\)
	N@20	2.96	2.98	2.61	3.07	2.87	3.13	3.01	3.00	2.91	2.37	3.02	2.91	2.79	2.29	\(\underline{3.85}\)	3.90	\({\bf 3.96}\)\(^{\dagger }\)
	H@50	11.13	11.08	9.87	11.99	10.66	12.30	11.66	11.44	11.25	8.98	11.78	11.35	11.10	9.02	15.39	15.38	\({\bf 15.41}\)\(^{\dagger }\)
	N@50	3.82	3.81	3.37	4.00	3.65	4.10	3.90	3.88	3.75	3.06	3.93	3.69	3.66	3.01	\(\underline{5.05}\)	5.11	\({\bf 5.13}\)\(^{\dagger }\)
CDs & Vinyl	H@20	8.80	9.62	7.00	9.02	8.66	10.30	7.54	9.10	5.88	7.70	7.70	8.83	5.86	7.16	9.36	9.56	\({\bf 10.55}\)\(^{\dagger }\)
	N@20	3.71	4.12	2.92	3.83	3.65	4.39	3.17	3.81	2.40	3.25	3.25	3.75	2.38	3.01	3.97	4.10	\({\bf 4.52}\)\(^{\dagger }\)
	H@50	14.48	15.59	11.96	14.80	14.20	16.63	12.67	15.06	10.30	12.79	12.69	14.54	10.61	11.95	15.27	15.56	\({\bf 16.83}\)\(^{\dagger }\)
	N@50	4.84	5.30	3.90	4.97	4.75	5.65	4.18	4.99	3.27	4.25	4.23	4.88	3.32	3.95	5.14	5.29	\({\bf 5.76}\)\(^{\dagger }\)
Cite-ulike-a	H@20	19.78	25.94	20.47	23.07	20.33	27.48	20.10	21.11	19.94	23.23	20.24	20.93	19.75	22.55	24.94	\({\bf 28.47}\)\(^{\dagger }\)	28.31
	N@20	8.80	12.33	9.06	10.49	9.17	13.04	9.07	9.38	8.17	10.50	8.95	9.23	8.06	10.50	11.29	\({\bf 13.83}\)\(^{\dagger }\)	13.39
	H@50	31.34	37.70	32.46	35.56	31.85	40.63	32.46	33.53	32.74	35.27	32.59	33.18	32.82	33.39	\(\underline{38.10}\)	41.01	\({\bf 41.40}\)\(^{\dagger }\)
	N@50	11.08	14.66	11.44	12.95	11.45	15.64	11.51	11.84	10.70	12.89	11.38	11.66	10.65	12.65	13.89	\({\bf {16.31}}\)\(^{\dagger }\)	15.98

Table 5. Recommendation Performance Comparison

Item-oriented fashion baselines equip textual content on item side while User-oriented fashion baselines incorporate texts on user side. Among baseline models, the highest number is double underlined. Regarding our proposed variants, the highest number is boldfaced, the first runner-up is boldfaced and underlined while the second runner-up is underlined. \(^{\dagger }\) denotes statistical significance between the boldfaced and the double underlined on paired t-test with p-value < 0.01. H@K and N@K stand for Hit Ratio at top K and Normalized Discounted Cummulative Gain at top K. Number unit is percentage (%).

–

Three variants of Interec, Interec-AE, Interec-DAE and Interec-DirectAU, enjoy significantly better recommendation performance than all baselines four chosen datasets. This supports our design of joint learning recommender and interpreter, i.e., the design of decoder (Section 3.1) as well as the incorporation of interpreter (Section 3.1.2), do not sacrifice recommendation accuracy. Rather, this approach brings performance gain on chosen datasets.

–

For autoencoder-based variants, the performance of Interec is based on the power of recommender. Contrasting the numbers of Interec and those of AutoRec and CDAE, it can be seen that the relative performance comparison between AutoRec and CDAE, reflects similar comparison between Interec-AE and Interec-DAE. For example, on Cell Phones, AutoRec is better than CDAE and Interec-AE is also better than Interec-DAE. Regarding Interec-DirectAU, this also applies on Toys & Games and CDs & Vinyl and some specific metrics on the other two datasets.

–

Textual content is an important factor to improve recommendation performance on chosen datasets. Firstly, this is evidenced by contrasting the numbers of AutoRec, CDAE, DirectAU with those of Interec, which extends AutoRec, CDAE and DirectAU by incorporating textual content informed by item representations (in decoder for autoencoder-based variants) and interpreter. Secondly, our variants Interec-AE and Interec-DAE achieve higher recommendation accuracy than LightGCN and CDL, despite AutoRec and CDAE, which are the base of the recommenders, are worse than LightGCN and CDL. Similarly, Interec-AE and Interec-DAE are also better than DirectAU in 3 out of 4 datasets given that AutoRec and CDAE are worse than DirectAU. Thirdly, textual-aware recommendation models CDL, CVAE and JSR, item-oriented or user-oriented fashion, generally works better than collaborative filtering counterparts, e.g., AutoRec, NeuMF, ENMF. GATE does not work well on Cell Phones and Toys & Games, which we conjecture that it stems from noisy and very long text sequences that GATE processes.

–

On CDs & Vinyl, the importance of textual content is model-dependent. While textual content is helpful to improve AutoRec and DirectAU, it shows a slightly negative effect when incorporating into CDAE. This suggests careful design and inspection when applying our proposed approach on different recommenders.

–

Among baselines, DirectAU stands out, achieving higher recommendation accuracy than all other baselines w.r.t. chosen metrics on four datasets, except NDCG@20 on Cell Phone. This is explained by the design of learning objective of DirectAU, which has been shown to be powerful for collaborative filtering. Next, LightGCN generally performs well across datasets. This is attributed to the high order connection modeling in LightGCN. While there is a small gap between performance of CDAE and AutoRec on Cell Phones and Toys & Games and CDs & Vinyl, this performance gap is much bigger on Citeulike-a. We empirically found that cross entropy loss used when training CDAE helps to achieve favorable performance on some specific datasets.

In what follows, we further analyze the performance change w.r.t. the presence of \({\bf V}^{text}\) and interpreter in Table 6.

Table 6.

Dataset	Metric	Interec-AE			Interec-DAE			Interec-DirectAU
Dataset	Metric	full model	without interpreter	without \({\bf V}^{text}\)	full model	without interpreter	without \({\bf V}^{text}\)	full model	without interpreter	without \({\bf V}^{text}\)
Cell Phones	H@20	10.57	10.33	8.96	10.34	10.00	8.73	9.86	9.58	10.10
	N@20	4.46	4.40	3.99	4.35	4.21	3.91	4.10	3.83	4.30
	H@50	17.78	17.54	14.80	17.05	16.77	14.43	17.39	16.90	17.34
	N@50	5.88	5.83	5.14	5.68	5.54	5.03	5.58	5.27	5.72
Toys & Games	H@20	9.35	9.09	6.80	9.24	9.37	6.50	9.49	9.50	7.43
	N@20	3.85	3.78	2.95	3.90	3.94	2.84	3.96	3.97	3.14
	H@50	15.39	15.13	10.98	15.38	15.03	10.65	15.41	15.44	12.00
	N@50	5.05	4.98	3.77	5.11	5.06	3.66	5.13	5.14	4.04
CDs & Vinyl	H@20	9.36	9.09	9.10	9.56	9.45	9.56	10.55	10.56	10.49
	N@20	3.97	3.82	3.87	4.10	4.03	4.07	4.52	4.52	4.50
	H@50	15.27	15.05	14.89	15.56	15.45	15.50	16.83	16.85	16.65
	N@50	5.14	5.00	5.01	5.29	5.21	5.25	5.76	5.76	5.72
Cite ulike-a	H@20	24.94	22.60	23.45	28.47	28.28	26.27	28.31	28.13	27.69
	N@20	11.29	10.10	10.69	13.83	13.61	12.53	13.39	13.37	13.22
	H@50	38.10	35.72	35.30	41.01	40.88	37.60	41.40	41.05	40.85
	N@50	13.89	12.69	13.04	16.31	16.10	14.78	15.98	15.94	15.83

Table 6. Recommendation Performance when Removing Interpreter and \({\bf V}^{text}\) from Interec

Unit of numbers is percentage (%).

–

For autoencoder-based variants, Interec-AE and Interec-DAE, on Cell Phones, Toys & Games and Citeulike-a, a significant performance degradation is observed when \({\bf V}^{text}\) is not present. This suggests textual content is the key factor to resolve sparsity in order to achieve better recommendation accuracy. On CDs & Vinyl where textual content is not an important factor for user-item interactions, it is observed that textual content has slight positive effect on Interec-AE and Interec-DAE.

For Interec-DirectAU, we observe the same trend on Toys & Games, CDs & Vinyl and Citeulike-a. Interestingly, when removing \({\bf V}^{text}\) on Cell Phone, we observe a slight increase in model performance. This might stem from the learning objective of Interec-DirectAU, where the normalized item textual representation does not align well with normalized collborative filtering item representation.

–

When removing interpreter from Interec-AE, we observe a small performance drop on Cell Phones, Toys & Games, CDs & Vinyl and a significant degradation on Citeulike-a. This shows that interpreter is able to generalize user representation to bring performance gain, particularly on Citeulike-a dataset. For Interec-DAE, we also observe the negative effect when removing interpreter on the majority of datasets. A special case is on Toys & Games where removing interpreter has positive effect on top 20 but negative effect on top 50 metrics. This suggests that interpreter output helps to discover more relevant items to user but rank them lower on the list. Regarding Interec-DirectAU, while interpreter has tiny influence on recommendation accuracy on Toys & Games and CDs & Vinyl, it is clearer that interpreter boosts the recommendation performance on Cell Phones and Citeulike-a.

–

Contrasting the numbers of Interec without \({\bf V}^{text}\) in Table 6, meaning that the model includes interpreter, and those of AutoRec, CDAE and DirectAU in Table 5, we find that interpreter actually brings performance gain without the presence of \({\bf V}^{text}\) on Cell Phones and Citeulike-a, CDs and Vinyl. This is supporting evidence for the generalization brought by interpreter on specific datasets. When both interpreter and \({\bf V}^{text}\) present in our unified Interec, the gap between Interec and AutoRec, CDAE and DirectAU is further enlarged, showing that textual content is not only helpful for recommendation performance but also beneficial for interpreter, i.e., assists interpreter to discover relevant words for higher recommendation performance.

–

The intuition behind interpreter is what follows. Existing works on interpretability, e.g., FLINT [48], often consider the tradeoff between accuracy and interpretability since these models force the output of recommender (or predictor) close to that of interpreter. This design is less effective as interpreter might not be good at performing target task of recommender. To resolve, we design interpreter to predict the same target with recommender, which reinforces the ability of interpreter in performing target task (recommendation in our case). Hence, interpreter is not only good at interpretability but also recommendation. Additionally, our proposal of using key-value memory network generalizes user representation by attending to words outside user’s interacted item content. These beyond user interacted text words are able to retrieve relevant items that users are interested in yet have not interacted, leading to better accuracy.

4.2 Interpretability Evaluation

For interpretability evaluation, we focus on Interec-AE as this variant works well across datasets and to keep this article focused on interpretability evaluation. We closely follow [13] to design interpretability evaluation. Two types of evaluation are applicable to our case, namely human-grounded metric and functionally-grounded evaluation.

4.2.1 Human-grounded Evaluation.

The goal is to conduct a simpler experiment that maintains the essence of target application [13], which is recommender system in this article. Since we leverage words as means of interpretation, the experiment should reflect how human comprehends a user’s list of adopted items and match their comprehension with words. Therefore, we engage 10 participants, who are not the authors of this article and are not aware of the research objectives. Among those, some have Computer Science (CS) background while others are non-CS major. In this article, we refer participants as humans who help us judge the quality of generated words while users, without other specification, are from chosen datasets. We randomly select 20 users from 4 chosen datasets. For each user, we collect the list of their adopted items’ titles, which are short sentences describing main content/feature of items. A set of 30 words, which are the outputs of Interec-AE, GATE and JSR⁵, are coupled with each user’s associated list of titles. Each participant would go through the list of titles and the set of generated words of each user. Participants would choose any word(s) that they comprehend to understand the list of titles. We regard participants’ choices as ground truth and generated words from each model as prediction. Let \({\bf D}^{g}\) and \({\bf D}^p\) is the list of ground truth words and predicted words for each user, respectively. Metrics are Precision (PR), Recall (RE), and Mean Reciprocal Rank (MR).

\begin{equation} {\bf PR} = \frac{|{\bf D}^{g} \cap {\bf D}^{p}|}{|{\bf D}^{p}|}; \hspace{5.69054pt} {\bf RE} = \frac{|{\bf D}^{g} \cap {\bf D}^{p}|}{|{\bf D}^{g}|}; \hspace{5.69054pt} {\bf MR} = \frac{1}{|{\bf D}^{p}|}\sum _{w \in {\bf D}^{p}}{\frac{\mathbb {1}[w \in {\bf D}^g]}{rank^{{\bf D}^p}(w)}}, \end{equation}

(15)

in which \(|\cdot |\) is the cardinality, \(rank^{L}(w)\) returns the rank of element w in list L and \(\mathbb {1}[x]\) returns 1 if x is true, 0 otherwise. The reported numbers are calculated per participant, averaged over 20 samples, as shown in Table 7. Evidently, our proposed model outperforms GATE and JSR convincingly, which is also the consensus among the majority of participants.

Table 7.

Metric	Model	Participant										Avg.
Metric	Model	1	2	3	4	5	6	7	8	9	10	Avg.
PR	GATE	11.50	11.00	11.50	14.00	19.00	57.50	29.00	5.00	28.50	15.50	20.25
	JSR	10.50	13.50	15.50	13.50	17.50	28.00	14.50	1.50	12.50	14.50	14.15
	Interec	14.00	\({\bf 25.00}\)\(^{\dagger }\)	\({\bf 23.50}\)\(^{\dagger }\)	\({\bf 27.50}\)\(^{\dagger }\)	\({\bf 32.50}\)\(^{\dagger }\)	\({\bf 71.00}\)\(^{\dagger }\)	34.00	6.00	28.50	23.00	\({\bf 28.50}\)\(^{\dagger }\)
RE	GATE	49.58	34.08	38.42	27.93	36.03	45.33	47.96	50.00	54.60	56.54	44.05
	JSR	42.08	44.92	45.75	30.76	36.40	23.83	24.42	15.00	24.67	33.19	32.10
	Interec	54.58	\({\bf 76.08}\)\(^{\dagger }\)	\({\bf 72.17}\)\(^{\dagger }\)	\({\bf 68.24}\)\(^{\dagger }\)	\({\bf 70.84}\)\(^{\dagger }\)	\({\bf 57.93}\)\(^{\dagger }\)	57.14	60.00	52.28	50.12	\({\bf 61.94}\)\(^{\dagger }\)
MR	GATE	2.28	3.29	3.17	3.99	4.65	16.15	9.30	1.86	9.34	4.04	5.81
	JSR	4.91	5.53	6.31	6.05	6.75	10.12	5.94	0.24	3.78	6.22	5.58
	Interec	5.29	\({\bf 9.60}\)\(^{\dagger }\)	\({\bf 8.70}\)\(^{\dagger }\)	\({\bf 9.72}\)\(^{\dagger }\)	\({\bf 10.90}\)\(^{\dagger }\)	\({\bf 21.10}\)\(^{\dagger }\)	10.57	1.68	7.97	7.49	\({\bf 9.30}\)\(^{\dagger }\)

Table 7. Human-grounded Interpretability Evaluation

Reported numbers per participant are averaged over 20 samples. Bold numbers are the best results while the runner-up is underlined. \(^{\dagger }\) denotes statistically significant w.r.t. to second best number on a paired t-test with p-value < 0.05. Unit of reported numbers is percentage (%).

4.2.2 Functionally-grounded Evaluation.

Since human-grounded evaluation is costly, we also seek functionally-grounded evaluation. This method requires a formal definition of interpretability as proxy for quality evaluation [13]. Since we characterize interpretability of user’s preferences using words from items’ content, these words can be intuitively employed as means to retrieve items that fit user’s needs. Hence, we formalize the proxy as a retrieval task, i.e., generated words are used to form a query to retrieve items based on the similarity between query and items’ textual content. Top items with highest similarity score are presented for each user.

The retrieval task involves query q, document D and retrieval function h. Query q consists of 10 cognitive chunks, i.e., single words, generated by each model while textual description of each item is treated as document D. For user-oriented CDL, CVAE and JSR, 10 words with highest predicted scores from user’s text modeling component are taken to form q. These words intuitively reflect user’s preferences since their predicted score is based on textual representation, which is a regularization of user’s representation. For user-oriented GATE, q is created from 10 words with highest attentive scores, following original article. For Interec, we follow definition 3.1 to create q with \(k = 10\). More values of k will be studied in Table 9. Retrieval function is \(h({\bf q}, {\bf D}) = \text{agg}_{w \in {\bf D}}max_{w^{\prime } \in {\bf q}}\frac{\lt {\bf e}^{w^{\prime }}, {\bf e}^w\gt }{|{\bf e}^{w^{\prime }}| \cdot |{\bf e}^w|}\), w and w’ are words in document D and query q, respectively. \({\bf e}^{w}\) and \({\bf e}^{w^{\prime }}\) are embeddings of word w and w’. \(max_{w^{\prime } \in {\bf q}}\frac{\lt {\bf e}^{w^{\prime }}, {\bf e}^w\gt }{|{\bf e}^{w^{\prime }}| \cdot |{\bf e}^w|}\) follows Equation (2) in [27], which measures the semantic similarity between term w with respect to short text q. By leveraging distributed vector representation [45], the vocabulary mismatch problem is alleviated. agg is an aggregation function.⁶ Since our method and competitors have different notions of word embeddings, we leverage Word2Vec in Gensim [49] to obtain word embeddings. Therefore, the output of retrieval function is not biased towards any competitors or Interec. We train Word2Vec for 500 epochs using corpus consisting of items’ textual descriptions with window size is 5, embedding dimension is 100, the number of negative samples is 5.

Table 8.

Dataset	Metric	Model
Dataset	Metric	CDL	CVAE	JSR	GATE	Interec-AE
Cell Phones	HR@20	32.57	23.06	39.50	31.27	\({\bf 42.49}^{\dagger }\)
	NDCG@20	11.88	7.49	15.63	11.88	\({\bf 17.04}^{\dagger }\)
	HR@50	66.30	52.98	70.83	62.80	\({\bf 72.09}^{\dagger }\)
	NDCG@50	18.52	13.36	21.82	18.08	\({\bf 22.89}^{\dagger }\)
Toys & Games	HR@20	40.31	26.81	42.37	37.20	\({\bf 46.19}^{\dagger }\)
	NDCG@20	17.96	9.28	18.95	15.98	\({\bf 21.79}^{\dagger }\)
	HR@50	67.38	55.41	68.74	65.87	\({\bf 71.74}^{\dagger }\)
	NDCG@50	23.28	14.90	24.14	21.62	\({\bf 26.82}^{\dagger }\)
CDs & Vinyl	HR@20	28.71	29.35	37.66	31.25	\({\bf 47.68}^{\dagger }\)
	NDCG@20	10.55	10.82	13.57	10.76	\({\bf 18.10}^{\dagger }\)
	HR@50	63.14	63.01	72.28	66.25	\({\bf 78.08}^{\dagger }\)
	NDCG@50	16.57	17.41	20.41	17.65	\({\bf 24.13}^{\dagger }\)
Citeulike-a	HR@20	71.13	19.27	80.72	53.08	\({\bf 83.05}^{\dagger }\)
	NDCG@20	34.11	6.58	42.23	22.83	\({\bf 44.47}^{\dagger }\)
	HR@50	91.10	50.35	94.93	82.29	\({\bf 95.52}^{\dagger }\)
	NDCG@50	38.11	12.65	45.08	28.63	\({\bf 46.99}^{\dagger }\)

Table 8. Retrieval-based Functionally-grounded Interpretability Evaluation with k = 10, i.e., Query Contains 10 Words

The boldfaced and underlined are the highest and the runner-up across models. Statistically significant numbers based on paired t-test are marked by \(^{\dagger }\) (p-value < 0.05). Unit of reported number are percentage (%).

Table 9.

Dataset	Number of words k	Model
Dataset	Number of words k	CDL	CVAE	JSR	GATE	Interec-AE
Cell Phones	5	33.79	23.48	38.91	29.62	\({\bf 40.44}\)\(^{\dagger }\)
	10	32.57	23.06	39.50	31.27	\({\bf 42.49}\)\(^{\dagger }\)
	15	32.39	23.37	39.74	33.19	\({\bf 42.96}\)\(^{\dagger }\)
	20	32.17	23.71	39.97	33.96	\({\bf 43.24}\)\(^{\dagger }\)
Toys & Games	5	39.16	26.73	41.73	34.37	\({\bf 45.20}\)\(^{\dagger }\)
	10	40.31	26.81	42.37	37.20	\({\bf 46.19}\)\(^{\dagger }\)
	15	40.39	27.25	42.69	38.44	\({\bf 46.80}\)\(^{\dagger }\)
	20	40.41	27.30	42.87	39.28	\({\bf 47.20}\)\(^{\dagger }\)
CDs & Vinyl	5	29.12	29.94	37.18	30.35	\({\bf 45.85}\)\(^{\dagger }\)
	10	28.71	29.35	37.66	31.25	\({\bf 47.68}\)\(^{\dagger }\)
	15	28.98	29.30	37.74	33.07	\({\bf 48.69}\)\(^{\dagger }\)
	20	28.99	29.16	37.79	34.28	\({\bf 49.47}\)\(^{\dagger }\)
Citeulike-a	5	70.95	19.41	80.17	46.97	\({\bf 80.91}\)\(^{\dagger }\)
	10	71.13	19.27	80.72	53.08	\({\bf 83.05}\)\(^{\dagger }\)
	15	71.50	19.91	82.29	56.13	\({\bf 83.80}\)\(^{\dagger }\)
	20	71.47	20.10	82.45	57.93	\({\bf 83.79}\)\(^{\dagger }\)

Table 9. Retrieval-based Functionally-grounded Interpretability Evaluation with Varying Number of Words k in the Query

We report HR@20. The same trend applies for other metrics. The boldfaced and underlined are the highest and the runner-up across models. Statistically significant numbers based on paired t-test are marked by \(^{\dagger }\) (p-value < 0.05). Unit of reported number are percentage (%).

The results are reported in Table 8. Firstly, Interec performs competitively and obtains the best performance on all datasets. This ascertains the role of a dedicated interpreter for generating highly relevant words for user preferences interpretation. Additionally, high retrieval performance of our model is attributed to the design of objective function in Section 3.3, i.e., the output of interpreter participates in predicting the ground-truth rating of user. As such, the generated words are better at describing user preferences than CDL, CVAE, JSR as these model do not explicitly align generated words with actual user rating. Secondly, among baseline models, JSR performs consistently well, better than CDL and CVAE on all datasets. This suggests that jointly modeling user interactions and textual content is a better choice for interpretability than regularization approach in CDL and CVAE, in which a tradeoff between recommendation and text reconstruction is carefully designed.

Thirdly, the performance of GATE, which is lower than ours, may stem from two reasons, one is noise in text of Amazon datasets and the second is that GATE considers text in sequence, which is very long when concatenating user’s adopted item texts. On Citeulike-a dataset, where textual content is less noisy, the performance of GATE is better. We further examine model performance in retrieval task w.r.t. various number of words in query in Table 9. Here, we report performance based on HR@20. For other metrics, we observe the same trend. We have the following observations. First, our proposed model achieves the best performance in retrieval task w.r.t. various number of words in query. Second, increasing the number of words in query generally increases model performance in retrieval task as we use more relevant words to user’s preferences. In conclusion, the empirical evidence showcases the power of our interpreter in discovering human-comprehensible attributes, i.e., words, to interpret user’s interests behind their interactions with items.

We argue that the described functionally-grounded evaluation already reflects fidelity, a widely used interpretability evaluation [8, 48, 51]. Existing models [8, 48, 51] interpret model prediction so they measure how good interpreter approximates black-box model’s prediction. Similarly, as our target is to interpret user’s preferences, we measure the quality of generated words pertaining to capturing user’s preferences through retrieving relevant items for user in a retrieval task.

We further examine the effect of architecture design on interpretability in Table 10. The key observations are removing \({\bf V}^{text}\) or interpreter results in significantly degraded performance. \({\bf V}^{text}\) encourages \({\bf z}_u\) to capture user’s preferences from textual signals, resulting in better interpretation while training interpreter to predict user-item interactions reinforces it to choose words that well capture user interests.

Table 10.

Dataset	Metric	Interec-AE
Dataset	Metric	full model	without interpreter	without \({\bf V}^{text}\)
Cell Phones	HR@20	42.49	37.70	18.72
	NDCG@20	17.04	14.80	6.41
	HR@50	72.09	67.19	48.55
	NDCG@50	22.89	20.61	12.23
Toys & Games	HR@20	46.19	43.95	19.25
	NDCG@20	21.79	20.00	6.69
	HR@50	71.74	70.09	49.12
	NDCG@50	26.82	25.15	12.52
CDs & Vinyl	HR@20	47.68	41.32	16.96
	NDCG@20	18.10	14.91	5.58
	HR@50	78.08	73.71	50.16
	NDCG@50	24.13	21.32	12.05
Citeulike-a	HR@20	83.05	80.81	18.78
	NDCG@20	44.47	41.45	6.56
	HR@50	95.52	94.70	48.12
	NDCG@50	46.99	44.25	12.28

Table 10. Retrieval-based Functionally-grounded Interpretability Evaluation when Removing interpreter and Removing \({\bf V}^{text}\)

Unit of reported number is percentage (%).

4.2.3 Qualitative Analysis of Interpretability.

We present a qualitative analysis of Interec-AE, GATE and JSR based on these models’ inferred words for user’s preferences interpretation in Table 11. We show two users, namely 1873 and A1FT98A06ZE4EQ, and further list the titles of items used in training phase in the first column, as well as the top-10 words generated considered models. Top words with highest \(tf-idf\) score are included for contrasting.

Table 11.

Titles of Adopted Items	Interec	JSR	GATE	\({\bf Tf-idf}\)
1. A Brief Survey of Web Data Extraction Tools 2. A Tutorial on Support Vector Machines for Pattern Recognition 3. Adaptive information extraction 4. Automatic web news extraction using tree edit distance 5. A Survey of Web Information Extraction Systems 6. Relational Learning of Pattern - Match Rules for Information Extraction 7. BoosTexter: A Boosting - based System for Text Categorization 8. Pattern Recognition and Machine Learning (Information Science and Statistics)	learning machine mining web extraction text training classification semantic recognition	learning web semantic training machine extraction task search tasks classification	calculus good effort shallow hope structures correctly article unfortunately based	extraction ie web data learning dimension text machine categorization pattern
1. Generiks TM iPhone 4&4S ANTI - FINGERPRINT/ANTI - GLARE Screen Protectors 2. Generiks TM iPhone 4 / 4S CLEAR Screen Protectors 3. Snap - on Rubber Coated Case for Apple iPhone 4 4S 4GS 4G AT&T / Verizon, Pink / Black 4. Deluxe AT&T Verizon White For Iphone 4 4S 4G Case Cover with Kickstand 5. 3d Hello Kitty Pink Ribbon Case / cover / protector Fits All Models of Iphone 4 & 4s 6. Leegoal Lightweight Hybrid Bumper Skin Back Case Cover for iPhone 5 5G Pink	apple iphone kitty anti glare pink hello screen protectors matte	iphone amp apple pink hot verizon white sprint back stand	adds trademarks protected phone easy keeps accessory endorsed controls iphone	iphone pink protectors cover name deluxe verizon at&t g ribbon

Table 11. Examples of Inferred Words for user in Citeulike-a Dataset (ID: 1873) in the First Row and user (ID:A1FT98A06ZE4EQ) in Cell Phones Dataset in the Second Row

Underlined words are outside user’s adopted items’ texts.

–

For these two users, some words produced by GATE are quite general words, e.g., unfortunately, based or adds, which make it difficult to understand user preferences. Contrarily, JSR’s and \(tf-idf\)’s words are somewhat more relevant.

–

In some cases Interec-AE identifies some relevant words not discovered by JSR. For example, Interec-AE discovers text and mining, which is one of the interests of the first user (1873). For the second user (A1FT98A06ZE4EQ), it seems that she bought items to protect her phone. Interec discovers related words, e.g., screen and protectors.

–

Interec and JSR are more generalizable than \(tf-idf\) by the ability to discovering preference words beyond user’s adopted items’ texts, e.g., classification or matte.

We further show the generated words as word clouds in Figure 2 and 3. In these figures, the bigger a word is, the higher its predicted score by Interec-AE is. Position and color are set randomly. It is clear that human can easily understand user’s interest topic in each word cloud. This analysis, based on a few examples, is not meant to be a formal comparison per se. Rather, it helps to illustrate some of the qualitative differences that underlie the quantitative comparison of interpretability presented in the previous tables.

Fig. 2.

Fig. 3.

4.2.4 Diversity of Interpretability.

To verify that our proposed approach achieves diversity of interpretability, we visualize the activated word scores of randomly selected users in Figure 4. We observe that each user has her own activated word score pattern. Concretely, for each user, top words with highest score are different from a user to one another. This ascertains our model’s ability to produce diverse set of words to interpret user’s preferences.

Fig. 4.

4.3 Architecture and Hyper-Parameter Analysis

We investigate the impacts of architecture and hyper-parameters on recommendation and interpretability objectives of Interec-AE, which achieves competitive performance on four chosen datasets. Table 12 reports recommendation accuracy and functionally-grounded evaluation of interpretability.

Table 12.

Dataset	Metric	Interpretability					Recommendation
Dataset	Metric	(1)	(2)	(3)	(4)	(5)	(1)	(2)	(3)	(4)	(5)
Cell Phones	HR@20	42.49	39.89	23.81	42.56	30.57	10.57	10.39	10.66	10.40	10.33
	NDCG@20	17.04	15.84	8.66	16.99	11.61	4.46	4.45	4.52	4.44	4.43
	HR@50	72.09	69.36	54.07	72.03	61.71	17.78	17.68	17.96	17.53	17.52
	NDCG@50	22.89	21.65	14.58	22.82	17.73	5.88	5.89	5.96	5.84	5.85
1-12 Toys & Games	HR@20	46.19	44.67	27.93	46.36	31.21	9.35	9.09	9.24	8.76	8.38
	NDCG@20	21.79	20.56	10.69	21.69	12.66	3.85	3.78	3.84	3.70	3.56
	HR@50	71.74	70.50	58.43	71.61	60.70	15.39	15.23	15.05	14.69	13.97
	NDCG@50	26.82	25.64	16.67	26.67	18.44	5.05	4.99	4.99	4.88	4.67
CDs & Vinyl	HR@20	47.68	43.55	17.64	47.33	19.77	9.36	9.15	9.33	9.33	9.26
	NDCG@20	18.10	15.95	5.76	17.96	6.44	3.97	3.86	3.97	3.97	3.95
	HR@50	78.08	75.33	51.47	78.00	54.50	15.27	15.11	15.36	15.24	15.20
	NDCG@50	24.13	22.25	12.35	24.05	13.22	5.14	5.03	5.17	5.13	5.12
Citeulike-a	HR@20	83.05	82.40	36.28	82.57	47.86	24.94	23.75	24.93	24.96	25.17
	NDCG@20	44.47	43.20	15.05	44.10	21.92	11.29	10.67	11.30	11.41	11.40
	HR@50	95.52	9534	65.83	95.49	74.49	38.10	36.76	38.07	37.56	37.63
	NDCG@50	46.99	45.80	20.86	46.71	27.18	13.89	13.24	13.91	13.90	13.87

Table 12. Results in our Ablation Analysis

Each column from (2) - (5) is a variant of Interec-AE. (1) Interec-AE. (2): sigmoid (in Equation (6)) is replaced by standard softmax. (3): fine tuning K. (4): fine tuning \({\bf V}^{text}\). (5): fine tuning both K and \({\bf V}^{text}\) (fine tuning mean that we allow update \({\bf K}\) and \({\bf V}^{text}\) in the second stage in Algorithm 1). Unit of reported number is percentage (%).

Sigmoid vs. Softmax in Equation (6). Column (2) in Table 12 shows that if we use softmax in place of sigmoid (Equation (6)), on retrieval we observe a 1%–3% drop on Amazon subsets, larger than those on Citeulike-a. Similarly for recommendation. Empirically interpreter equipped with sigmoid is more effective than standard softmax for our case.

Effects of \(\epsilon\). Recall that \(\epsilon\) controls the conciseness of interpretability. We vary \(\epsilon\) in Equation (8) and report model performance in Figure 5. Generally, we observe that for recommendation objective, there is a consistency between 4 datasets that the optimal value of \(\epsilon\) falls in range \([0.5 - 0.7]\), which supports our claim that \(\epsilon\) is between 0 and 1. Regarding interpretability objective, on Cell Phones and CDs & Vinyl dataset, the value of \(\epsilon\) is consistent with the one in recommendation. On Toys & Games and Citeulike-a dataset, retrieval performance peaks when the value of \(\epsilon\) is around 0.1 and 0.2, respectively. Overall, the results support our hypothesis in Section 3.2 that \(\epsilon\) is between 0 and 1.

Fig. 5.

Fixing vs. Updating K and \(\textbf {V}^{text}\) in the second stage training in Algorithm 1. We first fine-tune K and keep others as default choice. Next, we freeze K and fine-tune \({\bf V}^{text}\). Finally, we fine-tune both K and \({\bf V}^{text}\). The results of these experiments are shown on columns (3)–(5) on Table 12. Generally, fine-tuning one or both of the embeddings results in degraded performance. We intuit that these embeddings bring with them useful independent textual signals that would be overridden by collaborative filtering signals if floated.

As analyzed in previous sections, the performance of Interec is based on that of recommender. Therefore, in future application of our proposed architecture, one should pay attention and carefully choose the values for above mentioned designs and hyper-parameter choices w.r.t. the choice of recommender.

Run Time Analysis. We briefly discuss the run time of Interec-AE vis-a-vis the user-oriented textual-aware baselines. Figure 6 reports the number of seconds taken to run a single epoch in the examined models. Firstly, on small dataset, Cell Phones, the running time of Interec-AE is close to those of CDL and CVAE and nearly half of GATE. Recall that GATE requires extra time to process neighbor information. Secondly, on large dataset, CDs & Vinyl, Interec-AE maintains its time efficiency while an extra time is required in CDL and CVAE since these models store explicitly a huge number of user representations. GATE performs similarly to CDL and faster than CVAE. Lastly, on both dataset, JSR is not time-efficient due to negative sampling, which increases the number of ratings and the running time.

Fig. 6.

5 Conclusion

We propose Interec, a novel unified architecture for joint learning a neural recommender and an interpreter. Our work adopts a new angle to existing content-aware recommendation models by employing textual content for user’s preferences interpretation besides sparsity alleviation. In particular, our model provides local interpretability of user’s preferences underlying her adoptions in terms of human comprehensible attributes described by natural language words. The means of doing so is a dedicated interpreter which relies on user representation from recommender. A key-value memory network is used to implement interpreter, leading to a generalized user representation by discovering words going beyond user’s interacted items’ contents.

There are several research directions for future work to further build upon Interec. The first one is the investigation of the proposed architecture considering multi-interest recommender, which represents a user by multiple embedding vectors. Second, organizing textual content into a structure, e.g., topic modeling, and leveraging structured units for interpretation of user’s preferences. Last but not least, other type of side information, e.g., knowledge graph, could be worth exploring to gain better insights into user’s preferences underlying their item adoptions.

Footnotes

Randomly zeroing an element with probability of 0.3.

https://jmcauley.ucsd.edu/data/amazon/

http://wanghao.in/CDL.htm

⁴

In original article, authors leverage relevance-based word embeddings, which is not available in our work.

⁵

These models represent major approaches in existing content-aware recommendation models, namely regularization (JSR), attention mechanism (GATE) and memory network-based interpreter (ours).

⁶

In our case, we consider sum and mean. We choose aggregation function based on the performance on validation set and report numbers on test set.

References

[1]

Tameem Adel, Zoubin Ghahramani, and Adrian Weller. 2018. Discovering interpretable representations for both deep generative and discriminative models. In Proceedings of the 35th International Conference on Machine Learning. 50–59.

Abstract

1 Introduction

2 Related Work

3 Methodology

3.1 Interec

3.1.1 Recommender.

3.1.2 Interpreter.

3.2 Model Analysis

3.3 Learning Objectives

4 Experiments

4.1 Recommendation Evaluation

4.2 Interpretability Evaluation

4.2.1 Human-grounded Evaluation.

4.2.2 Functionally-grounded Evaluation.

4.2.3 Qualitative Analysis of Interpretability.

4.2.4 Diversity of Interpretability.

4.3 Architecture and Hyper-Parameter Analysis

5 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Paper Recommendation with Item-Level Collaborative Memory Network

Content-aware Neural Hashing for Cold-start Recommendation

Investigating serendipity in recommender systems based on real user feedback

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations