research-article

Open access

Coarse-to-Fine Knowledge-Enhanced Multi-Interest Learning Framework for Multi-Behavior Recommendation

Authors:

Wei Guo,

Dong Li,

Xiu Li,

Ruiming TangAuthors Info & Claims

ACM Transactions on Information Systems, Volume 42, Issue 1

Article No.: 30, Pages 1 - 27

https://doi.org/10.1145/3606369

Published: 18 August 2023 Publication History

PDF eReader

Abstract

Multi-types of behaviors (e.g., clicking, carting, purchasing, etc.) widely exist in most real-world recommendation scenarios, which are beneficial to learn users’ multi-faceted preferences. As dependencies are explicitly exhibited by the multiple types of behaviors, effectively modeling complex behavior dependencies is crucial for multi-behavior prediction. The state-of-the-art multi-behavior models learn behavior dependencies indistinguishably with all historical interactions as input. However, different behaviors may reflect different aspects of user preference, which means that some irrelevant interactions may play as noises to the target behavior to be predicted. To address the aforementioned limitations, we introduce multi-interest learning to the multi-behavior recommendation. More specifically, we propose a novel Coarse-to-fine Knowledge-enhanced Multi-interest Learning (CKML) framework to learn shared and behavior-specific interests for different behaviors. CKML introduces two advanced modules, namely Coarse-grained Interest Extracting (CIE) and Fine-grained Behavioral Correlation (FBC), which work jointly to capture fine-grained behavioral dependencies. CIE uses knowledge-aware information to extract initial representations of each interest. FBC incorporates a dynamic routing scheme to further assign each behavior among interests. Empirical results on three real-world datasets verify the effectiveness and efficiency of our model in exploiting multi-behavior data.

1 Introduction

Collaboration filtering (CF) [33] is widely used in industry to deeply probe into the latent information behind users’ behaviors. CF firstly learns representations for both users and items from their historical interactions and then leverages these representations to make predictions. Most of the existing CF methods [17, 21, 22, 29, 39, 47] are designed to model a single behavior. However, users usually have interactions with items through various behaviors in real-world applications, like viewing, tagging as favorites, carting, and buying in the e-commerce scenario. As various behaviors may express users’ complementary interests with items, utilizing the multi-behavior data simultaneously is necessary.

Many research efforts have been devoted to this problem to better capture collaborative signals from multi-behavior data, which can be divided into two categories [18]. The first category tackles the multi-behavior recommendation problem with advanced neural networks like attention network [14], transformer network [44, 45], and graph neural networks [11, 15, 37, 42, 45, 48]. The second category utilizes multi-behavior data with multi-task learning (MTL) [6]. These methods leverage all behaviors of users as prediction targets to improve the learning of users and items with different types of means, such as knowledge transfer [9, 12, 42] and graph neural networks [8].

However, these multi-behavior recommendation methods ignore the multi-faceted interests behind different behaviors. As shown in Figure 1(a), here we consider carting and buying, and exhibit a toy example. In this example, the user has a behavioral interaction with an item based on an interest (e.g., the user buys a hamburger based on his interest in “Junk Food”). Meanwhile, each behavior type is potentially attached to multiple interests (e.g., Buying is attached to “Junk Food” and “Electronic Goods”). Furthermore, we can observe that different interests that attached by the same behaviors (Cart) may have different effects on the prediction of target behaviors (Buy). Specifically, the interests “Electronic Goods” and “Luxury Goods” are both under the behavior of carting. However, for predicting whether the user will buy the computer, interest “Electronic Goods” is more effective and meaningful, because the user add the computer to cart based on “Electronic Goods” and add the less relevant item (watch) to cart based on “Luxury Goods”. Here, we define interest “Electronic Goods” as shared interest for different behaviors as it is shared by carting and buying. Obviously, it is very meaningful to model the shared interest when correlating the interactive information of multiple behaviors. Besides, we define interest “Luxury Goods” which is specific only to carting as behavior-specific interest (same for interest “Junk Food”), and these specific interests may be noises for prediction of other behaviors. Fine-grained decoupling of behavior to interest-level representations can make full use of the potential dependence information in a delicate way, thus achieves both better interpretability and possibly superior performance. So it is vital to explore the relationships of multi-behaviors at a level of multi-interests.

Fig. 1.

Recent works have attempted to leverage multi-interest learning for recommendation. Some approaches implicitly cluster historical user interactions by using powerful encoders, like dynamic routing [7, 24, 41] and self-attention [7], while the others seek to leverage the auxiliary semantic information of knowledge graph to model multi-interests [5, 40]. Despite the effectiveness of these methods, they are all designed for single behavior, which share two common limitations when applied for multi-behavior recommendation:

—

Inadequate Correlation Modeling. Existing multi-interest methods are designed for single behavior, all of which adopts a group of unified interests for each user. However, if we unify all interests for each behavior to model behavior correlation, it may inevitably introduce noise, as some behavior-specific interests will negatively affect the learning of interest representations under other behaviors. We named this correlation modeling strategy as unified interest form, as shown in Figure 1(b). The unified interest form roughly correlate the divided shared interests of one behavior with the ones of other behaviors, which will lead to a inadequate modeling of correlation. So, it is necessary to design an interest extraction strategy that can fully consider the relationship between different behaviors.

—

Difficulties in Interest Learning. The learning of interest can be regarded as the grouping of users’ historical behaviors through clustering, where items from one cluster are expected to be closely related and collectively represent a specific aspect of user interests [24]. In clustering theory, the final results are sensitive to the initialization of clustering centers, which has been claimed in lots of works such as K-Means++ [3], K-Means\(\Vert\) [4], and Canopy [27]. Existing multi-interest methods like MIND [24] and DGCF [41] initialize clustering centers with random vectors, which lead to sub-optimal results as the generated centers may be very close to each other. To solve this problem, methods like KGIN [40] and KTUP [5] utilize the semantic information from knowledge graph to learn interest representation. However, they overlooks the rich collaborative signals that can be used for interest representation learning. As a result, a more flexible method which can make the initial centers of interest as far as possible by using semantic information obtained from knowledge-aware information and make full use of collaborative signals’ information in clustering process is needed.

To tackle these two limitations, we propose a Coarse-to-fine Knowledge-enhanced Muti-interest Learning framework (CKML). To handle the inadequate modeling of behavioral correlations, CKML decouples interests into shared and behavior-specific parts for each behavior, then model the behavior correlation under the decoupled interest form, as shown in Figure 1(b).

To tackle the difficulties for interest learning process, CKML leverages a Coarse-to-fine strategy to initialize the interest centers and then allocates different interactions to different interests through the collaborative signals. Concretely, CKML consists of two modules: Coarse-grained Interest Extracting (CIE) module and Fine-grained Behavioral Correlation (FBC) module. For the first module, to capture the knowledge-aware item-item relations, it firstly learn representations for interests under each behavior with the paradigm of graph neural networks, and then use the representations of knowledge-aware relations to initiate shared and behavior-specific interests for every behavior. Thus, this keeps the initial center of interest as far away as possible. For the second module, to adequately utilize the high-order user-item collaborative signals, we design a GNN-based framework with dynamic routing mechanism [30] to further finely allocate each interaction to different interests. In this module, we then generate fine-grained representations for all interests by graph propagation on separate interest-level graph. And finally, we only correlate the information of the shared interests of different behaviors through a self-attention mechanism to model the decoupled interest form correlation between different behaviors for multi-behavior prediction.

To summarize, this work makes the following contributions:

—

We propose a novel CKML framework for multi-behavior recommendation, which learns shared and behavior-specific user interests for different behaviors. To the best of our knowledge, this is the first attempt to introduce multi-interest learning into the multi-behavior recommendation.

—

We propose a multi-interest learning mechanism that models interest with a coarse-to-fine process. It contains a CIE module and a FBC module, which better models the complex dependencies among multiple behaviors.

—

We conduct extensive experiments on three public datasets with a vastly different type of behaviors. The experimental results show the performance superiority and interpretability of our proposed framework.

2 Related Work

In this section, we present the muti-behavioral methods appearing in recommendation based on the ways of modeling the representations of users and items, and introduce the existing multi-interest methods according to the different means of modeling interests.

2.1 Multi-Behavior Recommendation

The existing multi-behavior recommendation methods can be classified into two categories [18]. One is multi-behavioral representation modeling based on advanced neural networks, such as transformer and graph neural networks. For example, DIPN [14] proposes a hierarchical attention network, which uses both the intra-view and inter-view attention to learn the relationships between different behaviors. MATN [44] then uses transformer to encode the interactions of multiple behaviors, and proposes a memory augmented attention network which maps the context signals of different behaviors into representations of different spaces. While NMTR [12] proposes a new neural network model to extract multi-behavior data of users. Further, the GNN-based methods like MGNN [49], MBGCN [19], and GHCF [8] propose to leverage message passing on graphs to model high-order multi-behavioral interactive information. Besides, MBGMN [46] utilize meta network with GNN to capture high-order collaborative signals. Moreover, KHGT [45] combines GNN and transformer to model the global behavioral information, which not only captures the higher-order behavior between nodes but also addresses the dynamics of behavior. CML [42] combines meta contrastive learning and GNN to mine the higher-order information between nodes, effectively modeling individualized multi-behavior correlations.

The other category is to model different behaviors with MTL. DIPN [14], MGNN [49], and GHCF [8] regard the aggregated representations of different behaviors as shared input and use aggregated representations to predict each behavior individually. NMTR [12], EHCF [9], MBGMN [46], and CML [42] use a transfer learning paradigm to fully interact and aggregate different behavioral representations and then make predictions separately. All in all, these multi-behavior methods try to capture the correlation between different behaviors, but they do not take into account the potential fine-grained interests behind each behavioral interaction. In contrast, our method makes full use of the behavioral correlation information of the interest level and alleviates the noise effect caused by coarse-grained modeling.

2.2 Multi-Interest Recommendation

Existing methods for multi-interest learning can be divided into two paradigms. One paradigm is utilizing collaborative behavioral signals to learn multi-interest representations. For example, MIND [24] applies a dynamic routing mechanism to assign each interaction to interests and uses label-aware attention to help learn user representations. On this basis, ComiRec [7] leverages self-attentive mechanism to extract user interests. To better learn interest representations, SINE [34] and Octopus [26] propose to model interests explicitly. They first build interest pools and then use attention mechanisms to explicitly activate some of the interests of users in the pool through historical user interactions. DGCF [41] introduces the dynamic routing mechanism into graphs with the modeling of independence among interests for multi-interest learning.

The other paradigm is leveraging structured relational information to construct multi-interest representations. For instance, KGIN [40] exploits the knowledge graph’s structural information to learn the representations of different interests and aggregates information using GNN-based methods. KTUP [5] raises a new translation-based model, which leverages implicit interests to capture the relationship between users and items. To sum up, both paradigms have their own drawbacks. We can find that the former does not consider the importance of knowledge-aware information in the initialization of interest clustering centers, and the latter does not consider the importance of collaborative signals in the process of interest clustering. Our method not only well initializes the interest clustering centers but also sufficiently utilizes the collaborative signals to assist the process of interest clustering.

3 Problem Definition

3.1 List of Notations

The notations we used in this article are shown in Table 1.

Table 1.

Notation	Description
\(u, i\)	The user and item.
\(\mathcal {U}, \mathcal {I}\)	The set of users and items.
\(\mathcal {V}\)	The set of nodes on \(\mathcal {G}_{u-i}\).
\(\mathcal {G}_{u-i}, \mathcal {G}_{i-i}\)	The user-item and item-item graphs.
\(\mathcal {E}_{u-i}, \mathcal {E}_{i-i}\)	The sets of edges on \(\mathcal {G}_{u-i}\) and \(\mathcal {G}_{i-i}\).
\(\mathcal {A}_{u-i}, \mathcal {A}_{i-i}\)	The set of adjacency matrices of \(\mathcal {G}_{u-i}\) and \(\mathcal {G}_{i-i}\).
\(\mathcal {R}_{u-i}, \mathcal {R}_{i-i}\)	The set of all possible behavior/relation types of \(\mathcal {G}_{u-i}\) and \(\mathcal {G}_{i-i}\).
\(\mathbf {A}_{u-i}^{k}, \mathbf {A}_{i-i}^{r}\)	The adjacency matrixes of behavior \(k\) and relation \(r\).
\(\mathbf {x}_u, \mathbf {y}_i\)	The initialize embeddings for user \(u\) and item \(i\).
\(\mathbf {z}_{i}^{r}\)	The learned representation of the \(i\) \(th\) item under the \(r\) \(th\) behavior in CIE.
\(\mathbf {s}_{i}^{k}, \mathbf {h}_{i}^{k}\)	The extracted behavioral-specific and shared interests embeddings of item \(i\) under behavior \(k\) in CIE.
\(\mathbf {g}_{i}^{k}\)	The output of CIE under behavior \(k\). \(\mathbf {g}_{i}^{k} = \mathbf {s}_{i}^{k}\Vert \mathbf {h}_{i}^{k}\).
\(N_{spe}, N_{sha}\)	The number of specific and shared interests for each behavior. Besides, \(N_{*}=N_{spe}+N_{sha}\)
\(d, d^{*}\)	The sizes of original embedding and interest embedding. Besides, \(d^{*} = {d\over {N_{spe}}} = {d\over {N_{sha}}}\)
\(\mathbf {t}_{u-i}^k\)	The time embedding for pair \((u,i)\) under behavior \(k\).
\(\mathbf {a}_{t}^{k}\)	The weight of edges on graph \(\mathcal {G}_{u-i}^{k}\) at the \(t\) \(th\) iteration.
\(\mathbf {e}_{u}^{k,l},\mathbf {e}_{i}^{k,l}\)	The input embedding at \((l+1)\)- \(th\) layer (or the output embedding at \(l\) \(th\) layer) in FBC for user \(u\) and item \(i\).
\(\lambda _{k, k^{\prime }}^{u, h},\lambda _{k, k^{\prime }}^{i, h}\)	The relevance score between \(k\) \(th\) and \(k^{\prime }\) \(th\) behaviors of the \(h\) \(th\) head for the user \(u\) and item \(i\).
\(\mathbf {f}_{u,sha}^{k},\mathbf {f}_{i,sha}^{k}\)	The shared interests embeddings for user \(u\) and item \(i\) under behavior \(k\) before behavioral correlation.
\(\mathbf {f}_{u,spe}^{k,l}, \tilde{\mathbf {f}}_{u,sha}^{k,l}\)	The final specific and shared interests embeddings for user \(u\) under behavior \(k\) at the \(l\) \(th\) layer.
\(\mathbf {f}_{i,spe}^{k,l}, \tilde{\mathbf {f}}_{i,sha}^{k,l}\)	The final specific and shared interests embeddings for item \(i\) under behavior \(k\) at the \(l\) \(th\) layer.
\(\hat{{o}}_{u,i}^{k}\)	The prediction score for the pair \((u,i)\) under behavior \(k\).
\(\hat{{o}}_{i,i^{\prime }}^{r}\)	The prediction score for the pair \((i,i^{\prime })\) under relation \(r\).

Table 1. Notations and Corresponding Descriptions

3.2 Multi-Behavior Interaction Graph

Let \(\mathcal {U}=\lbrace u_1, u_2, \ldots , u_M\rbrace\) represent the set of users and \(\mathcal {I}=\lbrace i_1, i_2, \ldots , i_N\rbrace\) represent the set of items, where \(M\) and \(N\) are the numbers of users and items, respectively. In real-world recommendation scenarios, users can interact with items in multiple behaviors. Suppose there are \(K\) types of behaviors, we denote the user-item interaction data of different behaviors as \(\mathcal {Y}_{u-i} = \lbrace \mathbf {Y}_{u-i}^1,\mathbf {Y}_{u-i}^2, \ldots ,\mathbf {Y}_{u-i}^K\rbrace\), where \(\mathbf {Y}_{u-i}^k\) represents the interaction matrix of behavior \(k\), \(y_{ui}^k = 1\) denotes that user \(u\) interacts with item \(i\) under behavior \(k\), otherwise \(y_{ui}^k = 0\). The user-item interaction data can also be regarded as a user-item bipartite graph \(\mathcal {G}_{u-i}=(\mathcal {V}, \mathcal {E}_{u-i}, \mathcal {A}_{u-i}, \mathcal {R}_{u-i})\), where \(\mathcal {V} = \mathcal {U}\cup \mathcal {I}\) is the node set containing all users and items, \(\mathcal {E}_{u-i} = \cup _{k \in \mathcal {R}_{u-i}}\mathcal {E}_{u-i}^{k}\) is the edge set including all behavior records between users and items. Here \(k\) denotes a specific type of behavior and \(\mathcal {R}_{u-i}\) is the set of all possible behavior types. \(\mathcal {A}_{u-i} = \cup _{k \in \mathcal {R}_{u-i}}\mathbf {A}_{u-i}^{k}\) is the adjacency matrix set with \(\mathbf {A}_{u-i}^{k}\) denoting adjacency matrix of a specific behavior graph \(\mathcal {G}_{u-i}^{k}=(\mathcal {V},\mathcal {E}_{u-i}^{k}, \mathbf {A}_{u-i}^{k})\).

3.3 Knowledge-Aware Relation Graph

To explore the rich semantics of items, we define graph \(\mathcal {G}_{i-i}=(\mathcal {I}, \mathcal {E}_{i-i}, \mathcal {A}_{i-i}, \mathcal {R}_{i-i})\) to leverage side information like attributes and external knowledge to depict the multi-faced characteristics of items, the definition of \(\mathcal {E}_{i-i}\) and \(\mathcal {R}_{i-i}\) are similar to the definition of \(\mathcal {E}_{u-i}\) and \(\mathcal {R}_{u-i}\), respectively. We denote the item-item relation matrix as \(\mathcal {A}_{i-i} = \lbrace \mathbf {A}_{i-i}^1,\mathbf {A}_{i-i}^2, \ldots ,\mathbf {A}_{i-i}^{|\mathcal {R}_{i-i}|}\rbrace\), which can be constructed with different reasons, such as items with same category, from the same restaurant, or interacted by similar users.

3.4 Task Description

Generally, there is a target behavior to be optimized (e.g., purchase), which we denote as \(\mathbf {Y}_{u-i}^K\), and other behaviors \(\lbrace \mathbf {Y}_{u-i}^1,\mathbf {Y}_{u-i}^2, \ldots ,\mathbf {Y}_{u-i}^{K-1}\rbrace\) (e.g., view and tag as favorite) are treated as auxiliary behaviors for assisting the prediction of target behavior. The goal is to predict the probability that user \(u\) will interact with item \(i\) under target behavior \(K\).

4 Methodology

We now present the model details of our proposed CKML, which is illustrated in Figure 2. It consists of two core modules: (1) CIE module, which utilizes knowledge-aware relations to extract shared and behavior-specific interests for multiple behaviors; and (2) FBC module, which allocates different interactions to different interests under each behavior, then models the complex behavior dependencies with interest-aware correlations.

Fig. 2.

4.1 Embedding Layer

In industrial applications, users and items are often denoted as high-dimensional one-hot vectors. Generally, given a user-item pair \((u, i)\), we apply the embedding lookup operation for user \(u\) and item \(i\) to obtain the embedding vectors:

\begin{equation} \mathbf {x}_u = \mathbf {E}_u^{T} \cdot \mathbf {p}_u, \ \mathbf {y}_i = \mathbf {E}_i^{T} \cdot \mathbf {p}_i , \end{equation}

(1)

where \(\mathbf {E}_u \in \mathbb {R}^{M \times d}\) and \(\mathbf {E}_i \in \mathbb {R}^{N \times d}\) are the created embedding tables for users and items, \(\mathbf {p}_u \in \mathbb {R}^{M}\) and \(\mathbf {p}_i \in \mathbb {R}^{N}\) denotes the one-hot IDs of user \(u\) and item \(i\), and \(d\) is the embedding size.

4.2 Coarse-Grained Interest Extracting

Knowledge-aware item-item relations are widely used to supplement semantic information and assist representation learning [5, 40, 45]. Inspired by the strong semantics of relations in the knowledge-aware relation graph [38, 40, 45], we propose a CIE module to extract users’ interests which motivates users’ interactions of multiple behaviors. In this way, we obtain the initial interest clustering centers. To further verify that the initial centers of interest obtained by CIE are better than the randomly initialized centers of interest, we design experiments to visualize the output embeddings of CIE in Section 5.6.1. There are two main components in CIE: the first part is the knowledge-aware relation modeling which is designed for capturing the semantic information from the knowledge-aware item-item relation graph, while the second part is the behavior-aware interest extraction which is designed to utilize the semantic information obtained in the previous part to make an extraction of interests.

4.2.1 Knowledge-Aware Relation Modeling.

Most existing multi-interest methods initialize interests with random generated vectors [24, 41], which fails to endow interests with semantics and may lead to a chaotic interest division. Since we have emphasized the importance of initializing interest clustering centers in Section 1, and inspired by the knowledge graph based methods [5, 40, 45], we use knowledge-aware information to initialize interest representations. Thanks to its high capability in modeling relational data and great performance in representation learning, we seek to utilize knowledge-aware relations for interest extraction under the paradigm of graph neural networks in this component. Specially, we firstly partition the knowledge-aware relation graph \(\mathcal {G}_{i-i}\) into several relation-specific sub-graphs \(\mathcal {G}_{i-i}^{1}, \mathcal {G}_{i-i}^{2}, \ldots ,\mathcal {G}_{i-i}^{|\mathcal {R}_{i-i}|}\), and the corresponding adjacency matrices are \(\mathbf {A}_{i-i}^{1},\mathbf {A}_{i-i}^{2}, \ldots ,\mathbf {A}_{i-i}^{|\mathcal {R}_{i-i}|}\). As for the message propagation, we adopt the state-of-the-art GCN models, such as LightGCN [17], LR-GCCF [10], GCN [21], and NGCF [39], for graph information aggregation. And the neighbor propagation process in each layer of each sub-graph can be formulated as

\begin{equation} \mathbf {z}_{i}^{r, l}=\mathop {Agg}\limits _{j \in N_{i}}\left(\mathbf {z}_{j}^{r, l-1}, \mathbf {A}_{i-i}^{r}\right) , \end{equation}

(2)

where \(r\) denotes the type of relation, \(l\) denotes the layer of GNN, \(N_{i}\) denotes the neighbors of item \(i\), and \(\mathbf {z}_{i}^{r, 0} = \mathbf {y}_{i}\) is the initial embedding for item \(i\). After the propagation, we average the generated representations from all layers to get the final representations:

\begin{equation} \mathbf {z}_{i}^{r} = {\sum \limits _{l=0}^{L_{i-i}}{\mathbf {z}_{i}^{r, l}}}/{(L_{i-i}+1)} , \end{equation}

(3)

where \(\mathbf {z}_{i}^{r} \in \mathbb {R}^{1 \times d}\) and \(L_{i-i}\) is the number of GNN layers setting for modeling the knowledge-aware relation graph. We use the same number of layers for all relations here for simplicity.

4.2.2 Behavior-Aware Interest Extraction.

Since we have obtained representations for all relations, how to effectively extract interests for different behaviors remains a challenge. As shown in Figure 1, different behaviors exhibit diverse interest patterns. Some interests are shared across multiple behaviors, while others are unique for specific behaviors. This is similar to shared expert information and specific expert information under different tasks in multi-task learning. Motivated by the customized gate presented in PLE [35], which achieves great performance in multi-task learning, we creatively propose to introduce shared interest and behavior-specific interest for multi-interest learning. The shared interest is designed to correlate with other types of behaviors at the level of interest, which can better leverage the potential complementary information of same interest within multiple behaviors. And the specific interest decouples and retains the independence of the corresponding behaviors, thus alleviating the influence of noise. We first combine the representations of all relations into a unified vector:

\begin{equation} \mathbf {z}_{i}^{*} = \mathop {Concatenate}\limits _{r \in \mathcal {R}_{i-i}}{\mathbf {z}_{i}^{r}}, \end{equation}

(4)

After that, we use a non-linear transformation which is generally used to model the combinations among relations to convert relations into multiple interests. For the specific interests, we have:

\begin{equation} \mathbf {s}_{i}^{k}=\mathop {Concatenate}\limits _{s=1}^{N_{spe}}\left(\mathop {LeakyReLU}\left(\mathbf {z}_{i}^{*}\cdot \mathbf {W}_{s}^{k}+\mathbf {b}_{s}^{k}\right)\right) , \end{equation}

(5)

where \(N_{spe}\) is the number of specific interests for each behavior, \(s\) is the \(s\)\(th\) interest, \(\mathbf {W}_{s}^{k} \in \mathbb {R}^{(|\mathcal {R}_{i-i}| * d) \times ({d\over {N_{spe}}})}\) and \(\mathbf {b}_{s}^{k} \in \mathbb {R}^{1 \times ({d\over {N_{spe}}})}\) are transformation matrix and bias matrix, and \(\mathbf {s}_{i}^{k}\) denotes the extracted behavioral-specific interests for behavior \(k\). Notice that we use \(1\over {N_{spe}}\) of the original item embedding size as the interest size to keep similar space usage as single-interest models, and we apply the same compressed form to shared interests. For the behavioral shared interests, we have:

\begin{equation} \mathbf {h}_{i}^{k}=\mathop {Concatenate}\limits _{s=1}^{N_{sha}}\left(\mathop {LeakyReLU}\left(\mathbf {z}_{i}^{*}\cdot \mathbf {W}_{s}+\mathbf {b}_{s}\right)\right) , \end{equation}

(6)

where \(N_{sha}\) is the number of shared interests, \(s\) is the \(s\)\(th\) interest, \(\mathbf {W}_{s} \in \mathbb {R}^{(|\mathcal {R}_{i-i}| * d) \times ({d\over {N_{sha}}})}\) and \(\mathbf {b}_{s} \in \mathbb {R}^{1 \times ({d\over {N_{sha}}})}\) are transformation matrix and bias matrix, \(\mathbf {h}_{i}^{k}\) denotes the extracted shared interests for behavior \(k\). Since the parameters of different behaviors are shared, the shared representations of different \(k\) in this equation are consistent.

Finally, we union the representations of shared and specific interests as the output of CIE:

\begin{equation} \mathbf {g}_{i}^{k}=\mathbf {s}_{i}^{k}\Vert \mathbf {h}_{i}^{k} , \end{equation}

(7)

where \((\Vert)\) is the concatenation operation between two vectors. For convenience, we set \({d\over {N_{spe}}} = {d\over {N_{sha}}} = d^{*}\), \(N_{*} = N_{spe}+N_{sha}\).

4.3 Fine-Grained Behavioral Correlation

Existing multi-behavior methods model the dependencies among multiple behaviors without distinguishing the diverse interests on which different interactions are based, which may inevitably introduce noise if the interactions are due to different interests.

In the previous part, we have preliminarily extracted the potential interest of items based on the knowledge-aware relations. However, this is only a node-wise partitioning, and does not divide specific interactions (i.e., edges on the graph) into interests. Here “node-wise” means the level of users and items, while the corresponding “edge-wise” denotes a finer-grained level that considers each interaction between users and items. To address this problem, we propose a FBC layer to further allocate each interaction to different interests and model the dependence between behaviors at the level of interest. FBC is composed of two key components: The first one is interest-aware behavior allocation which is designed to further allocate each interaction to different interests. And the second one is interest-aware dependence modeling which is designed to capture inter-behavioral correlations and adequately leverage this information at each layer.

4.3.1 Interest-Aware Behavior Allocation.

To allocate the edges on the graph \(\mathcal {G}_{u-i}\) to different interests under each behavior, we apply the disentangled representation learning [7, 24, 41] for behavior allocation. We firstly partition the provided multi-behavior user-item graph \(\mathcal {G}_{u-i}\) into behavior-specific sub-graphs \(\mathcal {G}_{u-i}^{1}, \mathcal {G}_{u-i}^{2}, \ldots ,\mathcal {G}_{u-i}^{K}\), and the corresponding adjacency matrices are \(\mathbf {A}_{u-i}^{1},\mathbf {A}_{u-i}^{2}, \ldots ,\mathbf {A}_{u-i}^{K}\), which can be formulated as

\begin{equation} \mathbf {A}_{u-i}^{k}=\left(\!\begin{array}{cc} 0 & \mathbf {Y}_{u-i}^{k} \\ \left(\mathbf {Y}_{u-i}^{k}\right)^{T} & 0 \end{array}\!\right) , \end{equation}

(8)

where \(\mathbf {Y}_{u-i}^{k}\) is the user-item adjacency interaction matrix of behavior \(k\), \(\mathbf {A}_{u-i}^{k} \in \mathbb {R}^{(M+N)\times (M+N)}\), \(M\) and \(N\) denote the number of users and items, respectively. As for the processing of time, we simply follow KHGT [45], and first consider the edge \(\mathcal {E}_{u-i}^k\) between user \(u\) and item \(i\) under behavior \(k\), mapping their corresponding interaction timestamp \(t_{u-i}^k\) into the time slot as \(\tau (t_{u-i}^k)\), then generate the embedding of time as \(\mathbf {t}_{u-i}^{k} \in \mathbb {R}^{1\times d^{*}}\) for each interaction. Specifically, we have:

\begin{equation} \left\lbrace \begin{array}{c} \begin{aligned}\mathbf {\hat{t}}_{u-i}^{k,(2 n)} &= \sin \left(\frac{\tau (t_{u-i}^k)}{10000^{\frac{2 n}{d}}}\right)\\ \mathbf {\hat{t}}_{u-i}^{k,(2 n+1)} &= \cos \left(\frac{\tau (t_{u-i}^k)}{10000^{\frac{2 n+1}{d}}}\right)\\ \mathbf {t}_{u-i}^{k} &= \mathbf {\hat{t}}_{u-i}^{k}\cdot \mathbf {W}_{t} \end{aligned} \end{array}\right. , \end{equation}

(9)

where the element indexs (even and odd position index) in the temporal information embedding are represented as \(2n\) and \(2n+1\), respectively. \(\mathbf {W}_{t} \in \mathbb {R}^{2d\times d^{*}}\) is the transformation weights corresponding to \(k\)-\(th\) type of interactions.

To better illustrate the process of the allocation of interests on each layer, we take the \(k\)\(th\) behavior as an example. As shown in Algorithm 1, we use \(\mathcal {E}_{u-i}^{k} = \lbrace (p,q)|\mathbf {A}_{u-i}^{k}[p,q] \ne 0 \rbrace\) to represent the set of edges on graph \(\mathcal {G}_{u-i}^{k}\). Meanwhile, we set \(\mathbf {a}_{0}\) as the initial weight for each edge on \(\mathcal {G}_{u-i}^{k}\) and initialize the embedding for each user and item. We leverage a kronecker product \((\otimes)\) to replicate the vector \(\mathbf {t}_{u-i}^k\) for \(N_*\) times along the row direction and add it to \(\mathbf {e}_{u}^{k}\) and \(\mathbf {e}_{i}^{k}\), thus get \(\mathbf {f}_{u,0}^{k}\) and \(\mathbf {f}_{i,0}^{k}\) (Step 1). Here, for simplicity, we denote \(\mathbf {e}_{u}^{k}\) and \(\mathbf {e}_{i}^{k}\) as the output of the previous layer. Next, we start iterative process. In the \(t\)\(th\) iteration, in order to get distributions across all interests, we use the softmax function to normalize these coefficients (Step 2):

\begin{equation} \mathbf {a}_{t}^{k}[s]=\frac{\exp {(\mathbf {a}_{t-1}^{k}[s])/{\tau })}}{\sum _{s=1}^{N_{*}} \exp {(\mathbf {a}_{t-1}^{k}[s]/{\tau })}} , \end{equation}

(10)

where \(\mathbf {a}_{t}^{k}\) denotes the vector of weight coefficients of each edge of graph \(\mathcal {G}_{u-i}^{k}\) in the \(t\)\(th\) iteration, \(\tau\) is the temperature coefficient, \(s\) denotes the \(s\)\(th\) interest. Furthermore, in each iteration, we assign all the edges on the graph \(\mathcal {G}_{u-i}^{k}\) to each interest of users and items on the graph (Step 3). At this step, \(\mathbf {f}_{u,t}^{k}[s]\) and \(\mathbf {f}_{i,t}^{k}[s]\) represent the \(s\)\(th\) interest for user \(u\) and item \(i\) after the allocation of the edge weights, respectively. Last but not least, we calculate the affinity between each pairs of nodes on the graph \(\mathcal {G}_{u-i}^{k}\) to update the weight of each edge (Step 4). Here, \(\mathbf {a}_{t}^{k}[s]\) denotes the updated weight of edges at the \(t\)\(th\) iteration of the \(s\)\(th\) interest under behavior \(k\). After all of the iterations, we finally take the representation generated by the last iteration as the final output, and aggregate them with GCN models (Step 5), which is the same as the aggregators in Section 4.2.1.

4.3.2 Interest-Aware Behavioral Correlation.

After the allocation of interests for every node at each layer, we have got \(\mathbf {f}_{u}^{k} = \mathbf {f}_{u,spe}^{k} \Vert \mathbf {f}_{u,sha}^{k}\) and \(\mathbf {f}_{i}^{k} = \mathbf {f}_{i,spe}^{k} \Vert \mathbf {f}_{i,sha}^{k}\). Furthermore, we need to correlate information between behaviors at the interest level. And we just correlate the information between the representations of shared interests of each behavior with a self-attention network [36] because the behavior-specific interests contain few useful information for the target behavior and may contain noise. For instance, in the Yelp dataset, there are behaviors (Dislike) that are contrary to the target behavior (Like), which may interfere with the learning of target behavior. For better convergence, we apply residual connection to the output of self-attention [16], which can be formulated as

\begin{equation} \left\lbrace \begin{array}{c} \begin{aligned}\tilde{\mathbf {f}}_{u,sha}^{k} &=\mathbf {M H}-\operatorname{Att}\left(\mathbf {f}_{u,sha}^{k}\right)+\sum \limits _{k^{\prime }=1}^{K}\mathbf {f}_{u,sha}^{k^{\prime }} \\ \mathbf {M H}-\operatorname{Att}\left(\mathbf {f}_{u,sha}^{k}\right)&=\mathop {Concatenate}\limits _{h=1}^{H} \left(\sum _{k^{\prime }=1}^{K} \lambda _{k, k^{\prime }}^{u, h} \cdot \tilde{\mathbf {V}}^{h} \cdot \mathbf {f}_{u,sha}^{k^{\prime }}\right) \\ \lambda _{k, k^{\prime }}^{u, h} &= \mathop {Softmax}(\bar{\lambda }_{k, k^{\prime }}^{u, h})\\ \bar{\lambda }_{k, k^{\prime }}^{u, h}&=\frac{\left(\tilde{\mathbf {Q}}^{h} \cdot \mathbf {f}_{u,sha}^{k}\right)^{\top }\left(\tilde{\mathbf {K}}^{h} \cdot \mathbf {f}_{u,sha}^{k^{\prime }}\right)}{\sqrt {d^{*}/H}} \end{aligned} \end{array}\right. , \end{equation}

(11)

where \(\tilde{\mathbf {Q}}^{h}\), \(\tilde{\mathbf {K}}^{h}\), \(\tilde{\mathbf {V}}^{h}\) \(\in \mathbb {R}^{{d^{*}\over {H}}\times {d^{*}\over {H}}}\) are learnable projection matrices of the \(h\)-\(th\) head. \(\lambda _{k, k^{\prime }}^{u, h}\) represents the relevance score between \(k\)\(th\) and \(k^{\prime }\)\(th\) behaviors of the \(h\)\(th\) head for the user \(u\). Moreover, similar operations are applied for the item \(i\).

Finally, for the information propagation of the \(k\)\(th\) behavior, we have:

\begin{equation} \left\lbrace \begin{array}{c} \begin{aligned}\mathbf {e}_{u}^{k,l} &= \mathbf {f}_{u,spe}^{k,l} \Vert \tilde{\mathbf {f}}_{u,sha}^{k,l}+\mathbf {e}_{u}^{k,l-1}, \forall u \in \mathcal {U}\\ \mathbf {e}_{i}^{k,l} &= \mathbf {f}_{i,spe}^{k,l} \Vert \tilde{\mathbf {f}}_{i,sha}^{k,l}+\mathbf {e}_{i}^{k,l-1}, \forall i \in \mathcal {I} \end{aligned} \end{array}\right. , \end{equation}

(12)

where \(l \in [1, \ldots ,L_{u-i}]\), \(L_{u-i}\) is the number of GNN layer, \((\Vert)\) is the concatenated operation for two vectors. \(\mathbf {e}_{u}^{k, 0} = \mathbf {x}_{u}^{k} = \mathbf {x}_{u}\) and \(\mathbf {e}_{i}^{k, 0} = \mathbf {g}_{i}^{k}\).

4.4 Joint Optimization

4.4.1 The Prediction of the U-I Interaction.

In the above parts, we have obtained the shared and behavior-specific representations \(\mathbf {f}_{u,spe}^{k,l}\) and \(\tilde{\mathbf {f}}_{u,sha}^{k,l}\), \(\forall l \in [1,2, \ldots ,L_{u-i}], \forall k \in [1,2, \ldots ,K], \forall u \in \mathcal {U}\), similar operations are applied for the item \(i\). To aggregate the information of each layer, we follow KHGT [45], and simply add them up. Thus we have:

\begin{equation} \left\lbrace \begin{array}{c} \begin{aligned}\mathbf {f}_{u}^{k,*} &= \sum \limits _{l=1}^{L_{u-i}} (\mathbf {f}_{u,spe}^{k,l} \Vert \tilde{\mathbf {f}}_{u,sha}^{k,l}), \forall u \in \mathcal {U} \\ \mathbf {f}_{i}^{k,*} &= \sum \limits _{l=1}^{L_{u-i}} (\mathbf {f}_{i,spe}^{k,l} \Vert \tilde{\mathbf {f}}_{i,sha}^{k,l}), \forall i \in \mathcal {I} \end{aligned} \end{array}\right. , \end{equation}

(13)

where \(\mathbf {f}_{u}^{k,*}, \mathbf {f}_{i}^{k,*} \in \mathbb {R}^{N_{*}\times d^{*}}\), \(k\) represents the \(k\)\(th\) behavior. Inspired by ComiRec [7], we make separate predictions for each interest under each behavior and take the maximum of all the predictions under each behavior, which can be formulated as

\begin{equation} \hat{{o}}_{u,i}^{k} = \max \limits _{s=1}^{N_{*}}\left(\sum \limits _{j}^{d^{*}}(\mathbf {f}_{u}^{k,*}[s] \circ \mathbf {f}_{i}^{k,*}[s])[j]\right), \end{equation}

(14)

where \(s \in [1,2, \ldots ,N_{*}]\) denotes the \(s\)\(th\) interest, (\(\circ\)) is the hadamard product operation.

Finally, to perform the model optimization, we follow KHGT [45] and use marginal pair-wise Bayesian Personalized Ranking (BPR) loss function to minimize the following loss function:

\begin{equation} \mathcal {L}_{u-i}=\sum _{k=1}^{K} \sum _{(u,p,q)\in \mathcal {O}_{u-i,k}} \alpha ^{k}*\max \left(0,1-\hat{{o}}_{u,p}^{k}+\hat{{o}}_{u,q}^{k}\right) , \end{equation}

(15)

where the \(\alpha ^{k} \in [0,1]\) denotes the coefficient of loss for \(k\)\(th\) behavior, \(\mathcal {O}_{u-i,k} = \lbrace (u,p,q)|(u,p)\in \mathcal {O}_{u-i,k}^{+},\) \((u,q) \in \mathcal {O}_{u-i,k}^{-}\rbrace\) denotes the training dataset. \(\mathcal {O}_{u-i,k}^+\) indicates observed positive user-item interactions under behavior \(k\) and \(\mathcal {O}_{u-i,k}^-\) indicates unobserved user-item interactions under behavior \(k\).

4.4.2 The Prediction of the Knowledge-Aware Item-Item Relation.

Inspired by self-supervised learning on graphs [43], we use the information of item-item relations to reconstruct the item-item graphs, which can be considered as a self-supervised relation reconstruction (SRR) task to enhance the learning of interest representations.

In detail, since we obtained the representations of each relation for all item \(i \in \mathcal {I}\), i.e., \(\mathbf {z}_{i}^{*}\) at Section 4.2.1, we calculate prediction scores for each relation between items:

\begin{equation} \hat{{o}}_{i,i^{\prime }}^{r} = \sum \limits _{j}^{d}(\mathbf {z}_{i}^{r} \circ \mathbf {z}_{i^{\prime }}^{r})[j], \end{equation}

(16)

where \(r\) represents the \(r\)\(th\) relation. We then use the BPR loss to reconstruct the graph \(\mathcal {G}_{i-i}^{r}\), which can be formulated as

\begin{equation} \mathcal {L}_{i-i}=-\sum _{r=1}^{|\mathcal {R}_{i-i}|} \sum _{(i, p, q) \in O_{i-i,r}} \ln \sigma \left(\hat{o}_{i, p}^{r}-\hat{o}_{i, q}^{r}\right) , \end{equation}

(17)

where \(\mathcal {O}_{i-i,r} = \lbrace (i,p,q)|(i,p)\in \mathcal {O}_{i-i,r}^{+}, (i,q) \in \mathcal {O}_{i-i,r}^{-}\rbrace\) denotes the training dataset of the item-item relation graph reconstructive task, which is similar to the definition in Section 4.4.1. Finally, for the total loss, we have:

\begin{equation} \mathcal {L}_{total} = \mathcal {L}_{u-i}+\beta \mathcal {L}_{i-i}+\lambda \Vert \Theta \Vert _{\mathrm{F}}^{2}, \end{equation}

(18)

where \(\Theta\) represents the set of all trainable parameters, \(\lambda\) is the weight for the regularization term, \(\beta \in [0,1]\) is the weight of \(\mathcal {L}_{i-i}\).

4.5 Complexity Analysis

4.5.1 Time Complexity.

In CIE, we spend \(\mathcal {O}(L_{i-i}|\mathcal {E}_{i-i}| d)\) for message propagation in the knowledge-aware item-item graph, where \(L_{i-i}\) denotes the number of GNN layers in handling item-item relations, \(|\mathcal {E}_{i-i}|\) is the number of edges on \(\mathcal {G}_{i-i}\) and \(d\) is the embedding size. After that, the time spent to extract the interest from the item-item relation is \(\mathcal {O}(|\mathcal {R}_{i-i}| d^{2})\), where \(|\mathcal {R}_{i-i}|\) refers to the number of relations. In FBC, it takes \(\mathcal {O}(L_{u-i}|\mathcal {E}_{u-i}| d)\) to propagate embedding in the user-item bipartite graph, where \(L_{u-i}\) is the number of GNN layers in handling user-item relations and \(|\mathcal {E}_{u-i}|\) denotes the number of edges on \(\mathcal {G}_{u-i}\). Besides, the computational complexity of the self-attention mechanism is \(\mathcal {O}(K L_{u-i} d^{2})\), where \(K\) is the number of behaviors. In summary, the overall time complexity of CKML mainly comes from the GNN part. The time complexity of our model is comparable to other GNN-based methods and we perform experiments to validate it in Section 5.2.3.

4.5.2 Space Complexity.

Most of the parameters that the model needs to learn are the embedding of the user and item, which costs \(\mathcal {O}((M+N)*d)\). The space size of the transformation matrixs in extracting shared interests and specific interests are \(\mathcal {O}(|\mathcal {R}_{i-i}| d^{2}+d)\) and \(\mathcal {O}(K|\mathcal {R}_{i-i}| d^{2}+Kd)\), where \(K\) is the number of behaviors. The space size of \(\tilde{\mathbf {Q}}^{h}\), \(\tilde{\mathbf {K}}^{h}\) and \(\tilde{\mathbf {V}}^{h}\) in the attentional mechanism requires \(\mathcal {O}(L_{u-i}*{(d^{*})^{2}\over {H}})\), where the \(d^{*} = {d\over {N_{spe}}} = {d\over {N_{sha}}}\), \(H\) is the number of attention head. All in all, CKML have limited additional parameters except for the embedding of the user and item.

5 Experiments

We conduct experiments to answer the following questions:

—

RQ1: How does CKML perform in terms of effectiveness and efficiency against various baselines?

—

RQ2: How do different components of CKML affect the performance?

—

RQ3: Can the design of shared and behavior-specific interests bring benefits to multi-behavior recommendation?

—

RQ4: How do different hyper-parameters affect the performance of CKML?

—

RQ5: How is the interest interpretability of CKML? Are the cluster centers of different interests really farther apart? Can the shared and specific interest patterns captured by CKML be represented in an explainable way?

5.1 Experimental Setting

5.1.1 Dataset Description.

We evaluate our model on three public datasets (i.e., Yelp,¹ Online Retail² and Tmall³) with the same parameter settings and preprocessing as compared baseline models. The behavior types and statistics of the three datasets are shown in Table 2.

Table 2.

Dataset	#User	#Item	#Interaction	#Target Interaction	#Interactive Behavior Type
Yelp	19,800	22,734	\(1.4 \times 10^6\)	677,343	{Tip, Dislike, Neutral, Like}
Online Retail	147,894	99,037	\(7.7 \times 10^6\)	642,916	{Page View, Favorite, Cart, Purchase}
Tmall	31,882	31,232	\(1.5 \times 10^6\)	167,862	{Page View, Favorite, Cart, Purchase}

Table 2. Statistics of Evaluation Datasets

—

Yelp: We have fully aligned the experiment protocol with KHGT. And following the partition strategy in References [25, 28], KHGT differentiates the explicit user-item interactive behavior into three types in terms of user rating scores (i.e., ranging from 1 (worst) to 5 (best) stars with 0.5 star as increment), i.e., dislike (\(r_{scores}\in [0,2]\)), neutral (\(r_{scores}\in (2,4)\)), and like (\(r_{scores}\in [4,5]\)). In addition, users offer tips about venues, which are considered as tips behaviors.

—

Online Retail: Following KHGT, we also tested our CKML on a real-world online retail dataset containing explicit user-item interactions of multiple types, which includes page view, add-to-cart, favorite, and purchase.

—

Tmall: The dataset is collected from Tmall, which is one of China’s largest e-commerce platforms. It contains various user interactions, including page views, adding items to favorites or carts, and making purchases. To follow the approach taken in CML [42], we only included users with at least three purchases for our training and testing datasets.

Following the setting of KHGT and CML, like is regarded as the target behavior, i.e., the behavior to be predicted, for Yelp, while purchase is the target behavior of Online Retail and Tmall.

5.1.2 Evaluation Protocols.

We apply two widely used metrics i.e., Hit Ratio (HR@\(N\)) and Normalized Discounted Cumulative Gain (NDCG@\(N\)) to evaluate the performance. HR@\(N\) is a recall-based metric which measures the average proportion of right items in the top-\(k\) recommendation lists. On the other hand, NDCG@\(N\) evaluates the ranking quality of the top-\(k\) recommendation lists in a position-wise manner. To fairly compare our models and baselines, we follow the evaluation settings of KHGT, and set \(N=10\) by default in all experiments. Following the setting of KHGT, the last interacted item on the behavior to be predicted is used as a positive example in the test data, while the 99 randomly selected items the user has not interacted with are taken as negative examples. Besides, we also provide an all-item ranking [23, 31] to evaluate the performance of the recent recommender algorithms.

5.1.3 Baseline Models.

To verify the effectiveness of our CKML model, we compare it with various baseline models, which can be categorized into four groups: (A) Single-behavior non-graph models (BPR [29], AutoRec⁴ [32], MIND [24], ComiRec⁵ [7]); (B) Single-behavior graph models (NGCF⁶ [39], DGCF [41], KGAT⁷ [38]); (C) Multi-behavior non-graph models (NMTR [12], DIPN [14], MATN⁸ [44]); (D) Multi-behavior graph models (DGCF\(_{M}\) [41], NGCF\(_{M}\) [39], LightGCN\(_{M}\)⁹ [17], MBGCN¹⁰ [19], CML¹¹ [42], KHGT¹² [45]). Among them, MIND, ComiRec, and DGCF are multi-interest based models, MATN and KHGT are transformer-based models, and CML is contrastive learning based model. As DGCF, NGCF, and LightGCN are originally designed for single behavior, we use the multi-behavior as input to train these models and name them DGCF\(_{M}\), NGCF\(_{M}\), and LightGCN\(_{M}\).

Single-behavior Non-graph Models:

—

BPR [29] It is a conventional approach to collaborative filtering that utilizes pairwise ranking loss to personalize item recommendations and generate item rankings.

—

AutoRec [32] It encodes vectors of users and items through reconstruction functions based on the autoencoder framework.

—

MIND [24] It designs a multi-interest extractor layer with a variant dynamic routing to extract users’ diverse interests and uses a label-aware attention scheme to learn these interests.

—

ComiRec [7] It captures multiple interests from interactions of users, retrieving candidate items from the large-scale item pool. Besides, this method leverages a controllable factor to balance the recommendation accuracy and diversity.

Single-behavior Graph Models:

—

NGCF [39] This approach exploits higher-order connectivity of user-item bipartite graphs via GNN.

—

DGCF [41] This model disentangles user-item interaction diagrams by modeling the interests behind the interactions, aiming to learn the representations of different interests.

—

KGAT [38] It uses the GAT framework, captures higher-order connectivity between user and item in a collaborative knowledge graph, which combines user-item interaction graphs and knowledge graphs.

Multi-behavior Non-graph Models:

—

NMTR [12] It captures cascading relationships between users’ multi-behavioral interactions using multi-task learning.

—

DIPN [14] This method leverages multiple behavioral interactions to predict user purchase intention via recurrent neural network and attention mechanism.

—

MATN [44] It explores the dependencies between multiple behaviors and the contribution on target behavior.

Multi-behavior Graph Models:

—

\(\mathbf {DGCF_{M}}\) [41] It takes multi-behavioral interactive information as input, and correlates the information of different behaviors at interest-level through attention mechanism.

—

\(\mathbf {NGCF_{M}}\) [39] It utilizes multiple behaviors by modeling the relationship between multiple behaviors according to KHGT.

—

\(\mathbf {LightGCN_{M}}\) [17] It removes feature transformation and nonlinear activation from GCN. Each category of behavior has the same influence on the target behavior.

—

MBGCN [19] This method uses graph convolutional networks on multi-behavior user-item interaction graphs, which learn the weights of multiple behaviors during embedding propagation.

—

CML [42] This approach proposes meta-learning and contrastive meta-learning paradigms to distill transferable knowledge across different types of behaviors.

—

KHGT [45] It encodes multi-behavioral interactive information between user and item using Graph Transformer Network to and infer the influence of multi-behavior interactions on target behavior.

5.1.4 Parameter Settings.

Our proposed CKML is implemented in TensorFlow [2]. We fix the embedding size as 16 in line with KHGT for a fair comparison. The batch size is searched in {16,32,64}. We initialize the parameters using Xavier [13]. The parameters are optimized by Adam [20], while the learning rate and decay rate are set to \(10^{-3}\) and 0.96, respectively. We search the number of GNN layers in {1,2,3,4} for the knowledge-aware item-item graph and user-item bipartite graph, respectively. We set the number of self-attention head to 2. The number of shared interests is varied in {1,2,4} as well as the number of specific interests, which is investigated in Section 5.5.1. The temperature coefficient used in the interest-aware behavior allocation is tuned in {0.1,1,5,10,20}, and the corresponding number of iterations is set to 2. We conduct a grid search of the loss coefficient for each behavior in {0,0.2,0.4,0.6,0.8,1}. All experiments are run for 5 times and average results are reported.

5.2 Performance Comparison (RQ1)

5.2.1 Effectiveness Comparison under the Setting of 99 Negative Samples.

Table 3 shows the performance of different methods on three datasets with respect to HR@10 and NDCG@10. We have the following findings:

Table 3.

Model	Yelp		Retail		Tmall
Model	HR	NDCG	HR	NDCG	HR	NDCG
BPR	0.744	0.450	0.261	0.165	0.244	0.150
AutoRec	0.765	0.472	0.313	0.190	0.321	0.156
MIND	0.789	0.514	0.307	0.191	0.314	0.185
ComiRec	0.774	0.488	0.314	0.196	0.291	0.184
NGCF	0.789	0.500	0.302	0.185	0.314	0.173
DGCF	0.861	0.587	0.304	0.169	0.322	0.184
KGAT	0.835	0.543	0.377	0.214	0.395	0.243
NMTR	0.790	0.478	0.332	0.179	0.362	0.215
DIPN	0.791	0.500	0.317	0.178	0.323	0.207
MATN	0.826	0.530	0.354	0.209	0.406	0.225
DGCF \(_{M}\)	0.863	0.591	0.467	0.282	0.448	0.280
NGCF \(_{M}\)	0.793	0.492	0.374	0.221	0.322	0.182
LightGCN \(_{M}\)	0.873	0.573	0.472	0.277	0.455	0.282
MBGCN	0.796	0.502	0.369	0.222	0.381	0.213
CML	0.785	0.471	0.499	0.289	0.513	0.302
KHGT	0.880	0.603	0.464	0.278	0.391	0.232
CKML	0.896^*	0.624^*	0.527^*	0.323^*	0.527^*	0.321^*
Rel Impr.	1.82%	3.48%	5.61%	11.76%	2.73%	6.29%

Table 3. The Overall Performance Comparison for Sampling-Item Test

Boldface denotes the highest score and underline indicates the results of the best baselines. \(\star\) represents significance level \(p\)-value \(\lt 0.05\) of comparing CKML with the best baseline.

The effectiveness of CKML model. Our proposed CKML consistently achieves the best results on all datasets. More specifically, CKML improves the strongest baselines by 1.82%, 5.61% and 2.73% in terms of HR (3.48%, 11.76%, and 6.29% in terms of NDCG) on Yelp, Retail, and Tmall datasets, respectively. The great improvements over baselines demonstrate the effectiveness of CKML for multi-behavior recommendation.

Both GNN and multi-behavior based methods improve model performance. Despite the various architectures among different baseline models, we can find that GNN based models have a consistent trend that perform much better than non-graph models. For example, by incorporating neighbor information into representations, MBGCN and NGCF outperform DIPN and BPR in most datasets and metrics at the multi-behavior and single-behavior settings, respectively. Besides, multi-behavior models KHGT and MBGCN achieve much better performance than single-behavior model KGAT and NGCF, which further verifies the effectiveness of adding multi-behavior information for learning.

CKML consistently outperforms GNN based multi-behavior baseline models. Our proposed CKML surpasses the performance of DGCF\(_{M}\), NGCF\(_{M}\), LightGCN\(_{M}\), MBGCN, and the state-of-the-art multi-behavior model KHGT and CML. By empowering the multi-behavior recommendation with multi-interest learning, CKML is capable of modeling the complex dependencies among multiple behaviors with multi-grained representations to infer user preference. While existing multi-behavior models only consider the observed user-item interactions as unified representations. Noticed that CML performs well on Retail and Tmall datasets but significantly worse on the other two datasets. A probable reason is that behaviors in Yelp are mutually exclusive (e.g., Dislike and Like), while CML assumes that different behaviors of the same user are similar for contrastive learning, which is unreasonable.

5.2.2 Effectiveness Comparison under the Setting of All-Item Ranking.

All-item ranking is another evaluation protocol which is widely used for testing [23, 31]. For comprehensive comparison, we compare our CKML with advanced methods under this setting. Specifically, we take the last item in the test data that interacts with the behavior to be predicted as a positive example, and all of the items that users do not interact with as the negative examples. As shown in Table 4, we can find that our CKML still performs best under this setting. Specifically, CKML improves the strongest baselines by 30.37%, 18.45%, and 26.43% in terms of HR (24.41%, 19.61%, and 30.16% in terms of NDCG) on Yelp, Retail, and Tmall datasets, respectively. The results show that our model has good robustness under different ranking settings.

Table 4.

Model	Yelp		Retail		Tmall
Model	HR	NDCG	HR	NDCG	HR	NDCG
MIND	0.0171	0.0087	0.0074	0.0037	0.0093	0.0047
ComiRec	0.0320	0.0156	0.0073	0.0039	0.0090	0.0042
NGCF	0.0230	0.0108	0.0033	0.0018	0.0086	0.0043
NGCF \(_{M}\)	0.0317	0.0146	0.0061	0.0029	0.0100	0.0048
CML	0.0320	0.0150	0.0103	0.0049	0.0140	0.0063
KHGT	0.0428	0.0213	0.0099	0.0051	0.0102	0.0053
CKML	0.0558^*	0.0265^*	0.0122^*	0.0061^*	0.0177^*	0.0082^*
Rel Impr.	30.37%	24.41%	18.45%	19.61%	26.43%	30.16%

Table 4. The Overall Performance Comparison for All-Item Test

5.2.3 Efficiency Comparison.

In addition to effectiveness, efficiency is also important. We conduct experiments to evaluate the cost of time of training and testing. Each result is obtained while the models are training in a single cluster, where each node contains 16 cores Intel(R) Xeon(R) Silver 4216 CPU (2.10 GHz) as well as 1 NVIDIA GeForce RTX 3090. And the following are details.

—

Training Efficiency. Table 5 shows the average training time of our proposed CKML and KHGT for each epoch. For the sake of fairness, we set the parameters related to training efficiency consistent, like batch size and GNN layer. We can find that CKML is faster with 13.63%, 19.42%, and 20.11% time reduction on the three datasets. One probable reason is that we split the complete graph into several smaller graphs under interests, and then make computation separately on these smaller graphs, which can be accelerated by parallel computation.

—

Testing Efficiency. For the sake of fairness, we set the parameters related to testing efficiency consistent, like batch size and GNN layer. As shown in Table 6, we can find that our proposed CKML is 12.59%, 9.42%, and 21.00% faster than KHGT on the three datasets for the testing time. The results show that our proposed CKML has higher efficiency when tested on the three datasets, which further demonstrates our views.

In summary, we claim that CKML has the best overall training and testing efficiency.

Table 5.

Table 6.

5.3 Ablation Study (RQ2)

CKML is built with several important designations including the Multi-Interest (MI), the CIE and the FBC. To analyze the rationality of each design consideration, we explore CKML with several different model variants.

—

CKML w/o CIE: We remove the coarse-grained interest extracting module and express each interest with randomly initialized vectors.

—

CKML w/o FBC: We replace the fine-grained behavioral correlation module with a combination of the best-performing GCN methods (LigthGCN for Yelp, GCCF for Retail and Tmall) and summation operation.

—

CKML w/o MI: To evaluate the effectiveness of multi-interest, we remove the above two modules simultaneously and use unified vectors for users and items representations.

The performance of CKML and its variants are summarized in Table 7, and we come to these conclusions:

Table 7.

Model	Yelp		Retail		Tmall
Model	HR	NDCG	HR	NDCG	HR	NDCG
CKML	0.896^*	0.624^*	0.527^*	0.323^*	0.527^*	0.321^*
CKML w/o CIE	0.893	0.619	0.510	0.310	0.507	0.308
CKML w/o FBC	0.887	0.610	0.491	0.290	0.508	0.311
CKML w/o MI	0.839	0.524	0.444	0.246	0.387	0.227

Table 7. Performance of Different CKML Variants

\(\star\) represents significance level \(p\)-value \(\lt 0.05\) of comparing CKML with other variants.

Boldface denotes the highest score.

—

Comparing the performance of CKML and its first two variants, we can find that each variant brings about performance degradation when any key component is removed or replaced with other modules. This demonstrates the rationality and effectiveness of the two key designations.

—

It is worthwhile noticing that CKML w/o MI achieves the worst performance on all three datasets compared to other variants with multi-interest learning. In particular, this variant has a performance decline up to 6.36%, 15.75%, and 26.57% in terms of HR (16.03%, 23.84%, and 29.28% in terms of NDCG) on Yelp, Retail, and Tmall datasets. This further demonstrates the effectiveness of multi-interest for the modeling of the complex dependencies among multiple behaviors.

5.4 Study of Interests (RQ3)

We propose to explicitly separate interests into shared and specific interests to alleviate the negative impact of irrelevant interactions. To demonstrate the superiority of this correlation modeling strategy, we replace it with two variants, namely, only shared interests and only specific interests. We keep the number of interests fixed and apply them as the basis of CKML for multi-behavior recommendation. Resulted variants are named as CKML-Shared and CKML-Specific, respectively. The results are reported in Table 8. There are some observations:

Table 8.

Model	Yelp		Retail		Tmall
Model	HR	NDCG	HR	NDCG	HR	NDCG
CKML-Shared	0.896	0.620	0.513	0.311	0.518	0.318
CKML-Specific	0.814	0.497	0.271	0.140	0.379	0.227
CKML	0.896	0.623	0.527	0.323	0.527	0.321

Table 8. Impact of Share Interests and Specific Interests

Boldface denotes the highest score.

—

CKML-Specific performs worse on Yelp, Retail, and Tmall datasets. This is because CKML-Specific fails to utilize information of other behaviors to assist the recommendation of target behavior as it neglects shared interests among multiple behaviors (e.g., Tip and Like on Yelp, as well as Add-to-cart and Purchase on Retail and Tmall).

—

CKML, which considers shared and specific interests, achieves the best performance on all three datasets. It suggests that taking into account both share and specific interests eliminate the effect of irrelevant interactions and improve the robustness of the model.

5.5 Hyper-Parameter Study (RQ4)

5.5.1 Impact of the Number of Interests.

To investigate how the number of interests affects the performance of CKML, we adjust the number of interests in the range {2,4}. For simplicity, we set the number of shared interests and specific interests to the same. The results are presented in Figure 3. We can find that when embedding size is set to 16 in line with KHGT, the model with 2 interests achieved the best results on all three datasets. Performance drops a lot when the number of interests increases from 2 to 4. Possible reason may be the too small embedding size (only 8) of each interest which can hardly learn good representations. We further extend the embedding size to 16 and 32, and we can observe that our model achieves significant performance improvement for both 2 interests and 4 interests. This verifies our above assumption. When embedding size grows larger, KHGT performs consistently worse than our model, which shows the superior performance of our proposed CKML. Moreover, KHGT has a performance drop on Yelp, Retail, and Tmall datasets when a larger embedding size is applied. Possible reason is that KHGT is easier to overfit due to the overlooking of multi-interest.

Fig. 3.

5.5.2 Impact of Temperature Coefficient.

The interactions between users and items are due to a single interest or the combination of multiple interests. To investigate it, we change the temperature coefficient used for behavior allocation and the results are reported in Figure 4. We can see that a moderate temperature coefficient is needed for the CKML to achieve the best performance. And when the temperature coefficient is set too small, the performance of the model deteriorates rapidly. One possible reason is that the probability distribution of interest is close to the one-hot vector in this case, which makes it challenging to learn. Besides, the model performance degrades either if the temperature coefficient is set too large. This may be because the weights of multiple interests become similar, and the model fails to identify the interest behind the interaction well. This again illustrates the importance of exploring multiple interests.

Fig. 4.

5.5.3 Impact of GCN Aggregators.

We investigate the impact of different GCN aggregators i.e., GCN [21], NGCF [39], LR-GCCF [10], and LightGCN [17]. The models with different aggregators are compared in Figure 5. We can find that LightGCN performs the best on Yelp among the four aggregators. The reason might be that removing the transformation matrix and nonlinear functions enables easier training and alleviate overfitting. CKML with LR-GCCF achieves the best performance on Retail and Tmall, probably because these two datasets contain multiple types of closely correlated behaviors, which has high requirements on the fitting ability of the model. So the introduction of a nonlinear activation function better facilitates the model to fit Retail and Tmall.

Fig. 5.

5.6 Case Study (RQ5)

5.6.1 The Visualized Analysis of Interest Initialization.

We have claimed in Section 1 that initializing the clustering centers to be far apart is significant for the learning of interests. To further illustrate that the initialization process of CIE ensures that the initial centers of interest are as far away as possible and better than Random, we calculate the average Euclidean distance between CIE and Random on different interests of all items, then plot the distance distribution in Figure 6. Specifically, we calculate and average the Euclidean distance between all pairs of interests in \(\mathbf {g}_{i}^{k}\) for each item \(i\):

\begin{equation} {Distance}(i)=\sum _{s=1}^{N_*}\sum _{s^{\prime }=1 \atop s^{\prime }\ne s}^{N_*} \frac{\sqrt {\sum _{j}^{d^*}\left(\mathbf {g}_{i}^{k}[s,j]-\mathbf {g}_{i}^{k}[s^{\prime },j]\right)^2}}{N_*^{2}-N_*} , \end{equation}

(19)

where \(s\) denotes the \(s\)\(th\) interest, \(N_*\) is the number of interests. \(d^*\) is the interest embedding size. \(k\) represents the \(k\)\(th\) behavior, and here we set the target behavior as \(k\).

Fig. 6.

We can observe the overall distribution of the average distances of interests obtained by CIE is further across all three datasets, which means the clustering centers initialized by CIE are farther apart than Random. It suggests that CIE can better initialize interest centers, enabling the model to identify the interest behind interactions efficiently.

5.6.2 The Visualized Analysis of Shared and Specific Interests.

We randomly select five users and the items they have interacted with under the target behavior. In Figure 7, we visualize the representations of items under shared interest and specific interest obtained from CKML, as well as the representations obtained by KHGT.

Fig. 7.

Comparing the points with the same color in Figure 7(a)–(c), we can find that items under the shared interest and KHGT are more clustered than specific interest representations. A probable reason is that Yelp has few interactions of target behavior, which makes it hard to mine the interest-related information behind the interaction. Besides, the shared interest and KHGT introduce additional interaction information of other behaviors, which makes it better to learn the representations of items.

We further analyze Retail and Tmall, and the results of the two datasets with the same behavior (target behavior) are shown in Figures 7(d)–(f) and 7(g)–(i), respectively. We can find that items under the shared interest are more clustered than specific interest representations. One possible reason is that the strong correlation between the four behaviors (e.g., page view, favorite, cart, and purchase), which brings additional powerful interactive information to assist the learning of target behavior. Conversely, items under the shared interest representations are more closely distributed than KHGT. This is due to the possibility of CKML to better extract shared interests excluding the interference of behavior-specific interests.

5.6.3 The Visualized Analysis of Behavioral Correlation.

We have depicted the explicit relevance scores (\(\lambda _{k, k^{\prime }}^{u, h}\) and \(\lambda _{k, k^{\prime }}^{i, h}\)) learned by our CKML model for predicting purchases in Retail and Tmall datasets are in Figure 8. The visualization reveals a hierarchical and explainable correlation among different types of user-item interactions (4 types). The darkness of the colors indicates the strength of the behavioral relevance, with darker colors representing higher relevance. In each row of the figure, the squares represent the cross-type behavioral dependencies learned through our Fine-grained Behavioral Correlation. For instance, in the Retail dataset, the “purchase” behavior demonstrates higher relevance with “page view” and “cart”, while exhibiting lower relevance with “favorite”. Similar observations can be drawn from the results obtained from the Tmall dataset. Moreover, we find that calculating the relevance between behaviors based on aggregated information from the item side yields better discrimination. This may stem from our CKML model’s ability to extract coarse-grained interests from item-item information, enabling the learning of more comprehensive behavioral correlations.

Fig. 8.

Furthermore, we analyze the label correlations to explain the above results. Figure 9 shows the behavioral label correlations with the venn diagram, where different overlaps represent different label correlations. We can find that the total proportion of X1X1 (X = 0/1) is only 0.57% and 0.73% in Retail and Tmall datasets, respectively. However, the total proportion of X0X1 (X = 0/1) is 9.04% and 12.91%, respectively. Hence, there is a limited overlap between the target behavior (purchase) and “favorites” in the two datasets, suggesting a weak correlation between these two behaviors. This observation aligns with the behavioral correlations learned by our model. The same analysis holds true for other behaviors as well.

Fig. 9.

6 Conclusion

In this article, we propose the CKML framework for multi-behavior recommendations. In order to make full use of knowledge-aware information to extract shared and behavior-specific interest representations, we propose the CIE module. To further learn the interest representation of each user and item under different behaviors and exchange information under different behaviors at fine granularity, we propose a GNN-based FBC module, which allocates edge weights by dynamic routing and exchanges information by self-attention mechanism. We conduct comprehensive experiments on three real-world datasets and show that the proposed CKML outperforms all state-of-the-art methods on all three datasets. Besides, the additional visualization experiment demonstrates the superiority of our well-designed shared and behavior-specific interests.

Footnotes

https://www.yelp.com/dataset/download.

https://github.com/akaxlh/KHGT.

https://github.com/weiwei1206/CML.

⁴

https://github.com/gtshs2/Autorec.

⁵

https://github.com/THUDM/ComiRec.

⁶

https://github.com/xiangwang1223/neural_graph_collaborative_filtering.

⁷

https://github.com/xiangwang1223/knowledge_graph_attention_network.

⁸

https://github.com/akaxlh/MATN.

⁹

https://github.com/kuandeng/LightGCN.

¹⁰

https://github.com/tsinghua-fib-lab/MBGCN.

¹¹

https://github.com/weiwei1206/CML.

¹²

https://github.com/akaxlh/KHGT.

References

[1]

2020. MindSpore. Retrieved from https://www.mindspore.cn.

Abstract

1 Introduction

2 Related Work

2.1 Multi-Behavior Recommendation

2.2 Multi-Interest Recommendation

3 Problem Definition

3.1 List of Notations

3.2 Multi-Behavior Interaction Graph

3.3 Knowledge-Aware Relation Graph

3.4 Task Description

4 Methodology

4.1 Embedding Layer

4.2 Coarse-Grained Interest Extracting

4.2.1 Knowledge-Aware Relation Modeling.

4.2.2 Behavior-Aware Interest Extraction.

4.3 Fine-Grained Behavioral Correlation

4.3.1 Interest-Aware Behavior Allocation.

4.3.2 Interest-Aware Behavioral Correlation.

4.4 Joint Optimization

4.4.1 The Prediction of the U-I Interaction.

4.4.2 The Prediction of the Knowledge-Aware Item-Item Relation.

4.5 Complexity Analysis

4.5.1 Time Complexity.

4.5.2 Space Complexity.

5 Experiments

5.1 Experimental Setting

5.1.1 Dataset Description.

5.1.2 Evaluation Protocols.

5.1.3 Baseline Models.

5.1.4 Parameter Settings.

5.2 Performance Comparison (RQ1)

5.2.1 Effectiveness Comparison under the Setting of 99 Negative Samples.

5.2.2 Effectiveness Comparison under the Setting of All-Item Ranking.

5.2.3 Efficiency Comparison.

5.3 Ablation Study (RQ2)

5.4 Study of Interests (RQ3)

5.5 Hyper-Parameter Study (RQ4)

5.5.1 Impact of the Number of Interests.

5.5.2 Impact of Temperature Coefficient.

5.5.3 Impact of GCN Aggregators.

5.6 Case Study (RQ5)

5.6.1 The Visualized Analysis of Interest Initialization.

5.6.2 The Visualized Analysis of Shared and Specific Interests.

5.6.3 The Visualized Analysis of Behavioral Correlation.

6 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Knowledge Enhancement for Contrastive Multi-Behavior Recommendation

Multi-view multi-behavior interest learning network and contrastive learning for multi-behavior recommendation

Knowledge-constrained interest-aware multi-behavior recommendation with behavior pattern identification

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations