Abstract
Many applications use data that are better represented in the binary matrix form, such as click-stream data, market basket data, document-term data, user-permission data in access control, and others. Matrix factorization methods have been widely used tools for the analysis of high-dimensional data, as they automatically extract sparse and meaningful features from data vectors. However, existing matrix factorization methods do not work well for the binary data. One crucial limitation is interpretability, as many matrix factorization methods decompose an input matrix into matrices with fractional or even negative components, which are hard to interpret in many real settings. Some matrix factorization methods, like binary matrix factorization, do limit decomposed matrices to binary values. However, these models are not flexible to accommodate some data analysis tasks, like trading off summary size with quality and discriminating different types of approximation errors. To address those issues, this article presents weighted rank-one binary matrix factorization, which is to approximate a binary matrix by the product of two binary vectors, with parameters controlling different types of approximation errors. By systematically running weighted rank-one binary matrix factorization, one can effectively perform various binary data analysis tasks, like compression, clustering, and pattern discovery. Theoretical properties on weighted rank-one binary matrix factorization are investigated and its connection to problems in other research domains are examined. As weighted rank-one binary matrix factorization in general is NP-hard, efficient and effective algorithms are presented. Extensive studies on applications of weighted rank-one binary matrix factorization are also conducted.
Keywords: Discrete data, clustering, compression, pattern discovery
1. INTRODUCTION
Significant advances in database and internet technologies have led to the drastic rise of data collection, storage, and analysis. This is the era of big data, where analyzing large datasets has become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus [Manyika et al. 2011]. One of the primary concerns regarding the digital economy is the necessity of appropriately exploiting data sources—so as to move along the data → information → decision chain. Data mining, the computational process of discovering patterns from large datasets, has attracted much attention from both academia and industry. Data mining involves methods at the intersection of artificial intelligence, machine learning, statistics, and database systems, among others. However, the advancement of data mining technologies still lags behind the exponential rate at which data are growing. The increasing business demand for big data analysis has resulted in a research focus on how to effectively reduce datasets while preserving important underlying patterns. Through such reduction, complicated data analysis tasks can be efficiently performed on reduced datasets that still approximate the underlying data and provide meaningful results.
Many data reduction and pattern extraction methods have been proposed in the literature. Among these, the matrix decomposition approach plays a very important role, due to its mathematical rigor and its effectiveness and the fact that many datasets can be represented in a matrix form. Important matrix decomposition methods include singular vector decomposition (SVD) [Golub and Loan 1996], which is the foundation of many low-rank matrix approximation and statistical analysis methods, non-negative matrix factorization (NMF) [Lee and Seung 1999], which is variant of SVD with application to non-negative datasets, Boolean matrix factorization (BMF) [Lu et al. 2014], which decomposes a binary matrix into the Boolean product of two other binary matrices with applications in role mining [Kuhlmann et al. 2003] and discrete pattern discovery [Pauli et al. 2008], and rank-one BMF [Koyuturk et al. 2005], which can be viewed as a special case of BMF.
However, existing matrix decomposition methods for data reduction and pattern extraction have limitations in the binary setting. The reason that we pay special attention to such datasets is that binary data are found in numerous application settings. For example, a market basket dataset, which is important for extracting user shopping patterns [Strehl and Ghosh 2003], can be represented as a user-item matrix, with each binary cell denoting whether the user has purchased the item. An access control list from an information system, used to prevent fraud and privacy exposure in process information flow [Bai et al. 2012], can be described by a user-permission binary matrix, where each cell shows whether the user is assigned to the permission. An online movie rating dataset, important for personal movie recommendation [Dellarocas 2003], can be represented by a user-movie binary matrix, with 1 meaning “like” and 0 “dislike.” One crucial limitation of many existing matrix decomposition methods for binary datasets is interpretability (or usability). For instance, SVD, like many other matrix decomposition methods, decomposes an input matrix into the product of several matrices with negative and fractional element values. The decomposed matrices could be viewed as patterns or summary of the input matrix. However, it is difficult to interpret a data summary or pattern with negative and fractional values in many domains. Consider Example 1.1. The matrix A on the right side represents market basket data, where an element at ith row and jth column indicates whether transaction i includes item j. The decomposed matrices U and V on the left side are low-rank and optimizes min ‖A − UV‖2. Although the product of decomposed matrices approximates the input data well, it is hard to interpret features like (−1.38, −2.23, −2.23, −2.23, −0.86) with negative and fractional values. In fact, making machine learning models/results interpretable has been a recent research focus [Ishibuchi and Nojima 2007; Vellido et al. 2012].
Example 1.1.
(1) |
To tackle the interpretability issue, non-negative matrix factorization (NMF) [Lee and Seung 1999] was proposed. It decomposes a real-valued matrix into the product of two matrices with the constraint that all entries must be non-negative. NMF has found a wide variety of applications including document clustering [Shahnaz et al. 2006] and bioinformatics [Gao and Church 2005]. However, NMF still does not fit the binary data well, because the decomposed matrices contain fractional values. Binary matrix factorization (BMF) was introduced. It decomposes a binary matrix into the product (or Boolean matrix product) of two binary matrices. BMF has found applications in access control, pattern recognition, and gene expression analysis, among others [Lu et al. n.d.; Pauli et al. 2008; Zhang et al. 2010].
For instance, how can we interpret a discovered pattern that states that Mike prefers to buy “−3.4” apples? In fact, a trend toward interpretability in computational intelligence systems has been observed recently. A benchmark method might be NMF, which decomposes a matrix into the product of two non-negative matrices. However it is still difficult to interpret the decomposition solutions in certain settings. For instance, if NMF suggests that a user playing a pharmacist role has “0.4” permissions to access customer medical records, then should we associate the pharmacist role with the right of accessing customer medical records? In an access control system, the permission assignment should be definitive, either “yes” or “no.”
BMF, which aims to decompose a binary matrix into the Boolean matrix product of two binary matrices, is an intuitive way to address the binary interpretability issue and has been actively studied lately. One of the decomposed binary matrices can be viewed as patterns and the other one describes how observations are represented by those patterns. BMF solutions can be directly used in pattern discovery, overlapping clustering, and even access control in security. Although BMF has many modeling advantages, it is extremely difficult to find a good BMF solution, which greatly limits its practical use [Lu et al. n.d.; Pauli et al. 2008; Zhang et al. 2010]. Due to computational challenge, rank-one binary matrix factorization [Koyutürk and Grama 2003] was introduced as an alternative. It is to optimize minX,Y ‖A − XYT‖1, where X and Y are binary vectors, which is relatively easy to solve. By iteratively performing rank-one BMF, one can conduct data analysis tasks, like clustering, pattern recognition, and data summarization, on large-scaled and discrete-attributed datasets.
Example 1.2.
(2) |
However, BMF and rank-one BMF both suffer from several issues. The first issue is that the extracted pattern set may not be able to represent the original dataset. According to the objective function minX,Y ‖A − XYT‖1, a row vector Ai. is considered belonging to pattern Y when ‖Ai. − Y‖1 <‖Ai.‖1. In other words, the number of approximation errors could be as large as ‖Ai. ‖1.Example 1.2 illustrates this issue. The right side of the equation is the rank-one BMF solution, which results in four approximation errors marked in bold. Since all entries in the presence vector are 1, the row vectors in the binary matrix cannot be divided any further, while there are two obvious patterns in the original data. If one only uses the pattern (1 1 1 1 0) to represent the entire binary matrix, then much information will be lost. In particular, the pattern (1 1 1 1 0) significantly differs from the observation (0 1 1 1 1) as two of five components do not match.
The second issue with rank-one BMF is that it does not support discrimination against different error types. The two error types occurring in binary data approximation are1-becoming-0 and 0-becoming-1. These two error types are treated equally by rank-one BMF. However, they need to be discriminated in some application domains. For instance, in market basket analysis, the binary matrix representation of market basket transactions is often very sparse. In such a scenario, 1-entries carry more information than 0-entries and so 1-becoming-0 errors should be avoided. Interestingly, in some other applications, 0-becoming-1 errors need to be discriminated against. For example, binary matrix factorization has been used to discover roles to implement the role-based access control information system, which is a de facto access control mechanism widely used in organizations and institutions [Lu et al. 2014]. In that scenario, 0-becoming-1 errors are problematic, as they may lead to roles with over-assignments and cause severe security problems, as will be illustrated later in the experimental evaluation.
This research is motivated to address the above identified issues. In specific, our contributions can be summarized as followings. (1) We proposed a novel notion, weighted rank-one binary matrix decomposition. It introduces parameters to control two different types of approximation errors, to trade off approximation accuracy and extracted pattern quality. By selecting appropriate parameters and systematically employing weighted rank-one BMF, one can effectively perform data analysis tasks, including clustering, role mining, and information retrieval, over high-dimensional datasets. (2) To solve the optimal weighted rank-one BMF problem, we examine its theoretical properties, including NP-hardness and approximation algorithms. We also relate weighted rank-one BMF to research problems in other domains, like maximum edge biclique and maximum tiling. We provide various optimization formulations to weighted rank-one BMF. (3) We develop fast and effective heuristic algorithms, including local search and random algorithms, to make our proposed method more practical. (4) We also conduct extensive experiments on various domains, including clustering, pattern discovery, and data compression, to illustrate the use and value of our model.
2. RELATED WORK
Pattern discovery, sampling, clustering, and data compression are the focus of conventional approaches used by management scientists to analyze datasets. Among numerous data analysis tools, matrix decomposition has always been favored by researchers and practitioners due to its effectiveness and mathematical rigor. The significant literature on matrix decomposition can be categorized based on the value domain of the decomposed matrices. Conventional matrix decomposition methods focus on the continuous domain. SVD is one of the most popular matrix decomposition methods and is used extensively in applications ranging from natural language text processing to image compression. A common algorithm in information retrieval, Latent Semantic Indexing (LSI) [Berry et al. 1995] exploits this property of SVD to summarize large datasets. The limitation of such types of matrix decomposition methods is their difficulty in interpreting patterns with negative and fractional values. So many research efforts have been focused on interpretable matrix decomposition. A benchmark algorithm is NMF [Lee and Seung 1999], which decomposes a real-valued matrix into the product of two matrices with the constraint that all entries must be non-negative. NMF has found a wide variety of applications, including document clustering [Shahnaz et al. 2006] and bioinformatics [Gao and Church 2005].
Due to the wide existence of binary data, a lot of research is devoted to factorization of binary matrix. The work of Schein et al. [2003] introduces a generalized linear model for principal competent analysis of binary data, where each observation is assumed from a single laten factor and there exists multiple latent factors, which are inferred by alternative least square. The work of Bingham et al. [2009] also assumes that each observation is sampled from a simple factor and infers the factors using the expectation-maximization algorithm. The work of Li [2005] proposes a method to factorize the binary matrix into two binary matrices, where the binary indicators suggest membership. The work of Singliar and Hauskrecht [2006] introduces a probabilistic view of binary data, where each observation is viewed as a sample from multiple binary Bernoulli factors. The work of Frank et al. [2012] introduces another probabilistic view of binary data, where each observation is assumed to be generated either by “signal” or by “noise,” both Bernoulli distribution. The work of Kabán and Bingham [2008] introduces a mixture model, meaning that a single observation is generated by multiple latent factors with a Beta prior. The work of Zhang et al. [2007] introduces a variant of NMF, where a binary matrix is decomposed into two matrices bounded by 0 to 1. The works of Pauli et al. [2008] and Geerts et al. [2004] view each observation as union of some binary patterns and factorize the binary data using “cover”-based discrete optimization methods, which are refereed as Boolean matrix factorization (BMF). SDD [Kolda and O’Leary 2000] decomposes a regular matrix into the standard matrix product of a binary matrix and a matrix in {−1, 0, 1}. EBMD [Lu et al. 2012] decomposes a binary matrix into the Boolean matrix product of a binary matrix and a matrix in {−1, 0, 1}. The works of Shen et al. [2009] and Koyuturk et al. [2005] study the rank-one BMF, which intends to discover the dominant discrete binary pattern by factorizing a binary matrix into the product of two binary vectors.
The weighted rank-one BMF proposed in the article is extended from the rank-one BMF by adding weights to different error types. Our approach to the problem is influenced by the research of the set cover problem, one fundamental problem in the management science field. Optimization techniques are commonly used to find an exact solution for the set cover problem, e.g., an integer programming formulation [Bellmore and Ratliff 1971]. Therefore, we built several optimization formulations for the weighted rank-one BMF problem, including a binary integer program, linear relaxation program, and semidefinite programming problem. Approximation algorithms are another important approach for the set cover problem. So we also introduced approximation algorithms for several special cases of the weighted rank-one BMF. Heuristic solutions, e.g., the greedy heuristic [Chvatal 1979], the lagrange-based heuristic [Caprara et al. 1999], are a common approach to the large set cover problem. So we proposed several fast heuristics to our problem as well. Our research is also influenced by the recent trend that optimization and heuristics are used to solve data mining problems. For instance, mathematical programming has been used to formulate many data mining problems, like clustering, classification, and so on [Bradley et al. 1999]. The preliminary study of this work was presented in the 2011 SIAM Conference on Data Mining [Lu et al. 2011]. While the problem studied in this article is the same, a significant part of the theoretical results, all of the presented algorithms, and the entire experimental study are new.
3. WEIGHTED RANK-ONE BMF
In this section, we introduce the weighted rank-one BMF model. Our model is derived from rank-one BMF, which is to optimize minX,Y ‖A − XYT‖1, where Y can be interpreted as a hidden pattern and binary values in X indicate whether an observation (row vector in A) belongs to the pattern. As shown in example 1.2, by minimizing ‖A − XYT‖1, an observation is considered to belong to a pattern, when the number of matching 1’s components is greater than the number of mismatching components. As a result, observations may be mistakenly clustered, such as in example 1.2, all row vectors are considered to belong to the same pattern and hence clustered together. The other issue with minX,Y ‖A − XYT‖1 is that it treats the two different types of errors, namely 0-becoming-1 and 1-becoming-0, the same, which should be discriminated in many applications.
So we propose the weighted rank-one BMF model by introducing weight parameters to different types of errors. Let denote the set {1, 2, … ,n}. A n-dimensional binary vector X can be represented by a set , i.e., i ∈ X if X (i) = 1. From the set perspective, the comparison of two binary vectors X and Y with the same dimension leads to the four quantities: f11(X,Y) = |X ∩ Y|, f10(X,Y) = |X\Y|, f01(X,Y) = |Y\X|, and , where X and Y are the set representations of X and Y.
To determine the presence of a pattern in an observation, we propose a new notion, pattern influence. The influence of a pattern Y ∈ {0, 1}n×1 in an observation X ∈ {0, 1} n×1 denoted by I (X,Y), is computed as:
(3) |
where w10 and w01 are weight parameters. A pattern Y is considered present in an observation X if I (X,Y) > 0. Note that the formula does not include f00(X,Y), because we care what X and Y both have, not what they do not have. Another reason is that binary datasets are typically sparse. If we consider f00x,y, then all records would look similar.
The weighted rank-one BMF problem aims to find the dominant pattern in a set of observations.
Definition 3.1. [Weighted Rank-One BMF Problem] Given matrix A ∈ {0, 1}m×n and weight parameters of w01 and w10, find the pattern Y ∈ {0, 1}n×1 that maximizes .
Given the dominant pattern, it is easy to determine the presence vector. Weighted rank-one BMF can effectively address the limitations of conventional rank-one BMF. Reconsider the binary matrix employed in Example 1.2. To decrease approximation errors, we penalize approximation errors by setting w10 = w01 = 2. By performing weighted rank-one BMF, the dominant pattern (1 1 1 1 0) is discovered and observations are successfully separated, as indicated by the line in the matrix. If we perform weighted rank-one BMF on the part not belonging to the pattern (1 1 1 1 0), then another pattern (0, 1, 1, 1, 1) will be revealed. These two extracted patterns can represent all observations without introducing any approximation error.
Example 3.2.
(4) |
Weighted rank-one BMF supports the discrimination against different error types. Suppose one wants to eliminate 0-becoming-1 errors. To achieve that, the user can simply penalize 0-becoming-1 errors more severely by setting w10 = 1 and w01 = 3. After performing weighted rank-one BMF, the dominant pattern (1 1 1 0 0) is revealed. The resultant approximation has three 1-becoming-0 errors, as marked in bold, and no 0-becoming-1 errors.
Example 3.3.
(5) |
Weight parameters w10 and w01 play an important role in weighted rank-one BMF. In summary, they have two main functions. First, weight parameters can control the upper bound of the total approximation errors. When w10 and w01 are large values, to qualify an object as a member of a pattern, the number of matching presences needs to be much larger than the number of mismatching components. So the maximum errors are bounded. Second, the weight parameters enable one to control the distribution of the two different error types. When w10 is larger than w01, the 1-becoming-0 error type is penalized more severely and vice versa.
By selecting appropriate weight parameters and systematically performing iterative weighted rank-one BMF, one can carry out many data analysis tasks, like clustering, data summarization, role mining, among others. To realize the practical value of the approach, a fast and effective algorithm to weighted rank-one BMF is needed. To the goal, we will first examine its theoretical properties and then investigate algorithm design.
4. THEORETICAL STUDY
In this section, we study the theoretical properties of weighted rank-one BMF. Since most theoretical properties are derived by relating weighted rank-one BMF to problems in computer science and operations research, therefore we first examine their connections.
4.1. Relation with Other Problems
The weighted rank-one BMF problem can be related to the maximum edge biclique problem [Peeters 2003], the maximum tile problem [Geerts et al. 2004], and the maximum edge weight biclique problem [Dawande et al. 1997].
A weighted rank-one BMF instance is defined as follows: Given a binary matrix A, weight parameters of w01 and w10, and an integer t, does there exist a pattern Y, such that ?
Definition 4.1 (Maximum Edge Biclique). Given a bipartite graph G = (V1 ∪ V2, E) and a positive integer k, does G contain a biclique (a kind of bipartite graph where every vertex of the first set is connected to every vertex of the second set) with at least k edges [Peeters 2003]?
Statement 1. For any maximum edge biclique instance {(V1 ∪ V2, E), k}, we can construct an equivalent weighted rank-one BMF instance {A,w01,w10,t}, such that the answer to the maximum edge biclique instance is true, if and only if the answer to the weighted rank-one BMF instance is true.
The construction is as follows: (i) Let A be a m × n binary matrix, such that m = |V1|, n = |V2|, and Aij = 1 if there is an edge in E, connecting vertices V1(i) and V2(j); (ii) let w10 = 0 and w01 ≥ maxi ‖Ai. ‖1; (iii) let t be k. The mapping makes A be a binary matrix representation of the bipartite graph (V1 ∪ V2, E), as the vertices in V1 correspond to rows in A, the vertices in V2 correspond to columns in A, and each edge in E corresponds to a 1-entry in A. As w01 ≥ maxi ‖Ai. ‖1, 0-becoming-1 errors are penalized so severely that a pattern cannot cover 0-entries.So a pattern present in A naturally induces a biclique in (V1 ∪ V2, E). As w10 = 0, 1-becoming-0 errors are not counted toward the pattern influence measure; in other words, the pattern influence measure is equal to the number of edges of the induced biclique. Therefore, the above statement is true.
Figure 1 illustrates the relation. The bipartite graph is the graph representation of the binary matrix on the left side in example 1.1. The nodes on the left side correspond to matrix rows, the nodes on the right side correspond to matrix columns, and edges (including both solid and dashed lines) correspond to 1’s cells in the matrix. The edges with solid lines correspond to the maximum edge biclique. It also corresponds to the optimal solution to the weighted rank-one BMF with w10 being 0 and w01 being a large enough number.
Definition 4.2 (Maximum Tile). Given a database D as a binary matrix Am×n, find the tile with the largest area in D, where a tile corresponds to a row index subset R and a column index set C, such that A(i, j) = 1, ∀i ∈ R, j ∈ C [Geerts et al. 2004].
The maximum tile problem is identical to the maximum edge biclique problem, except that one is defined in terms of a matrix, whereas the other one is defined on an equivalent graph representation. Therefore the maximum tile problem can be reduced to the weighted rank-one BMF problem as well.
Definition 4.3 (Maximum Edge Weight Biclique Problem). Given a bipartite graph {V1 ∪ V2, E} and weights {wij} associated with edges {(V1(i),V2(j))}, respectively, find a biclique C, where the sum of the edge weights is maximum [Dawande et al. 1997].
Statement 2. The weighted rank-one BMF problem with w10 of 0 is a special case of the maximum edge weight biclique problem.
For each instance of the weighted rank-one BMF problem with w10 of 0, one can construct an equivalent instance of the maximum edge weight biclique problem. An instance of the weighted rank-one problem can be denoted by {Am×n,w01}. With w10 being 0, the pattern influence measure changes to I (X,Y) = max{0, f11(X,Y) − w01f01 (X,Y)}. So the weighted rank-one BMF problem becomes to find a pattern to cover the most 1-entries, while there is a penalty of w01 for the coverage of each 0-entry. An instance of the maximum edge weight biclique problem can be denoted by {V1 ∪ V2, E,W}, where W is the weights of edges. An equivalent instance of the maximum edge weight biclique problem can be constructed as follows: (i) Let V1 have m vertices and V2 have n vertices; (ii) let {V1 ∪ V2, E} be a complete bipartite graph, i.e., ∀V1(i) ∈ V1,V2(j) ∈ V2, there is an edge connecting them; (iii) let W (i, j), the weight for the edge connecting nodes V1(i) and V2(j), be 1 if A(i, j) is 1; otherwise, −w10. The construction maps each entry in Am×n to an edge in the complete bipartite graph. As W (i, j) is 1 for A(i, j) of 1 and −w10 for other edges, the maximum edge weight for the constructed biclique instance is equal to the maximum pattern influence value for the BMF instance. Therefore, the two instances are equivalent.
4.2. Optimization Models
In this section, we build optimization models, which allow us to take advantage of the latest advancement of optimization techniques.
Unconstrained Quadratic Binary Programming.
The weighted rank-one BMF problem is to decompose a binary matrix A into the product of two binary vectors XYT, such that Y is the dominant pattern with respect to the overall pattern influence of ΣiI (Ai ,Y). If a dominant pattern Y is given, then it is easy to determine the presence vector X. But in the quadratic binary programming formulation, both presence vector X and pattern vector Y are treated as variables. To obtain the quadratic programming formulation of max , we need to reformulate ,
(6) |
(7) |
(8) |
(9) |
(10) |
(11) |
where Uij is Aij + w10Aij + w01Aij − w01 and Vij is w10Aij.
Integer Linear Programming.
The unconstrained quadratic binary programming formulation max can be linearized as an integer linear programming problem.
(12) |
(13) |
(14) |
(15) |
The linearization is done by introducing auxiliary binary variable Zij to replace XiYj. Constraints (13) and (14) enforce Zij to be equal to XiYj.
Linear Programming Relaxation.
The linear programming relaxation for the weighted rank-one BMF problem can be obtained by relaxing constraints (14) to {Xi,Yi, Zij ∈ [0, 1]}. The relaxation expands the feasible solution region to a polytope and makes the problem easier to solve. But the optimal solution of the relaxation problem is not necessarily a feasible solution of the original problem, because its components could be fractional. So such a linear programming relaxation is not much useful.
If we relax constraints (14) more, then we can obtain a linear programming relaxation, which guarantees an integral optimal solution through simplex methods, which is also a feasible solution of the original weighted rank-one BMF problem. The modification is done by enforcing the constraint of −Xi − Yj + 2Zij ≤ 0 only for (i, j) with Aij = 1 and enforcing the constraint of Xi + Yj − Zij ≤ 1 only for entries (i, j) with Aij = 0. The complete formulation is as follows:
(16) |
(17) |
(18) |
(19) |
Theorem 4.4. The optimal solution of the linear programming problem (16)–(19) via simplex algorithm is integral.
Proof. We will prove that any basic feasible solution of the linear programming problem (16)–(19) via simplex algorithm is integral, so is the optimal solution. Any linear constraint set can be transformed into a standard form as {AX = b, X ≥ 0}. A basic feasible solution via simplex algorithm is corresponding to a partition {XB, XN} of variables X. Accordingly, the constraint set can be reformed as BXB + NXN = b. A basic feasible solution is {XN = 0, XB = B−1b}, when B−1b ≥ 0. According to the Cramer’s rule [Golub and Loan 1996], the ith competent of XB is , where det (.) denotes the determinant of a matrix and B′ is B with its ith column being replaced by b. Constraints (17)–(19) are the same as the contraints of the linear programming problem studied in Shen et al. [2009], which proves that the conefficient matrix of constraints (17)–(19) is a totally unimodular matrix (a totally unimodular matrix is a matrix for which the determinant of every square non-singular submatrix is 1 or −1). So det (B) in our case is 1 or −1. Because all parameters in constraints (17)–(19) are integral, det (B′) is integral. Therefore, is integral, as are all basic feasible solutions inlcuding the optimal solution.
4.3. Computational Hardness
The rank-one BMF problem is NP-hard, because the NP-hard maximum edge biclique problem can be reduced to the rank-one BMF problem [Peeters 2003].
4.4. Approximability
There exist 2-approximation algorithms for special cases of the weighted rank-one BMF problem.
Case 1: w10 = 1 and w01 ≥ maxi ‖Ai. ‖1
The 2-approximation algorithm for such a case is based on Lemma 4.5.
Lemma 4.5. Any integer linear programming problem with n variables subject to m linear constraints with at most two variables per inequality, and with all variables bounded between 0 and U, has a 2-approximation algorithm that runs in polynomial time O(mnU 2log(Un2m)) [Hochbaum et al. 1992].
The weighted rank-one BMF problem with w10 = 1 and w01 ≥ maxi ‖Ai. ‖1 can be formulated as the integer linear programming problem (20)–(24) with two variables per inequality as follows:
(20) |
(21) |
(22) |
(23) |
(24) |
When w10 = 1 and w01 ≥ maxi ‖Ai. ‖1, the weighted rank-one BMF problem is to find a pattern to cover the most 1-entries without covering one 0-entry. So the objective is to maximize Σ(i,j) |Ai j =1 Zij, where Zij = XiYj. Constraints (21) and (22) are used to enforce Zij = XiYj. As explained earlier, constraints −Xi − Yj + 2Zij ≤ 0 and Xi + Yj − Zij ≤ 1 enforce Zij = XiYj. But when the objective is to maximize Σ(i,j) |Ai j =1 Zij, the constraint Xi + Yj − Zij ≤ 1 becomes redundant and can be dropped. Constraints Zij − Xi ≤ 0 and Zij − Yj ≤ 0 are equivalent to the constraint −Xi −Yj + 2Zij ≤ 0. Constraints (23) enforce no 0-entry being covered by the approximation.
Case 2: w10 = 0 and w01 = 1
The weighted rank-one BMF problem with w10 = 0 and w01 = 1 is equivalent to the conventional rank-one BMF problem. Conventional rank-one BMF considers observation X belongs to pattern Y, if ‖X‖1 < ‖X − Y‖1. As ‖X‖1 = f10(X,Y) + f11(X,Y) and ‖X − Y‖1 = f10(X,Y) + f01(X,Y), so ‖X‖1 < ‖X − Y‖1 induces f11(X,Y) − f01(X,Y) < 0, which is the special case of the weighted rank-one BMF problem with w10 = 0 and w01 = 1. A 2-approximation algorithm via linear programming relaxation is given in Shen et al. [2009].
4.5. Necessary Conditions for Observations of the Same Pattern
Weighted rank-one BMF essentially is to identify the dominant pattern with respect to given weights. Because of its inherent NP-hardness nature, it is difficult to find an efficient algorithm for weighted rank-one BMF on large datasets. However, if we can separate observations (records), which are determined not belonging to the same patterns, into disjoint groups, then we can perform weighted rank-one BMF on smaller datasets. To apply such a divide-and-conquer technique, we need an effective and efficient way to separate observations that cannot belong to the same pattern. The theorem below gives the necessary condition for two records to belong to the same pattern. In other words, if the condition is not satisfied, then two records can never belong to the same pattern. The theorem below also provides foundation for the practical iterative framework, described later in Algorithm 1 in Section 5.
Theorem 4.6. If A and B, the set representations of two binary vectors, belong to the same pattern, then the following are true:
(25) |
(26) |
(27) |
(28) |
where {z1, z2, z3} denote {|A| − |A ∩ B|, |A ∩ B|, |B| − |A ∩ B|}, respectively.
However, we observe that given weights w01 and w10, some observations can never belong to the same pattern. So if we can, in other words, we utilize a divide-and-conquer approach. To this end, we need a property, which is easy to implement.
To prove the theorem, assume observations A and B belong to pattern C. Figure 2 illustrates the regular set intersection of the three sets A, B, and C.
According to the pattern presence definition, the following are true:
(29) |
(30) |
The above inequalities can be simplified as below by letting d2 and d7 be 0,
(31) |
(32) |
As such, the set intersection of A, B, and C is updated by Figure 3.
Since {z1,z2,z3} denote {|A| − |A ∩ B|, |A ∩ B|, |B| − |A ∩ B|}, {z1,z2,z3} equal {d1 + d4,d5,d3 + d6}. So {d4,d5,d6} equal {z1 − d1, z2,z3 − d3}. Plug them into inequality (31) and (32) and obtain
(33) |
(34) |
Inequality (33) leads to
(35) |
Plug it into inequality (34) and obtain
(36) |
As d1 is not less than 0, so if , in order for the above inequality to satisfy, must be less than 0. Thus, the necessary condition (25) is proven.
Inequality (34) leads to
(37) |
Plug it into inequality (33) and obtain
(38) |
As d3 is not less than 0, so if , in order for the above inequality to satisfy, must be less than 0. The necessary condition (26) is proven.
Reconsider inequalities (31) and (32). {d1,d3,d5} equal {z1 − d4,z3 − d6,z2}. Plug them into inequalities (31) and (32) and obtain
(39) |
(40) |
Inequality (39) leads to
(41) |
Plug it into inequality (40) and obtain
(42) |
As d6 is not less than 0, so if , must be less than 0. The necessary condition (27) is proven.
Inequality (39) leads to
(43) |
Plug it into inequality (40) and obtain
(44) |
As d4 is not less than 0, so if , then must be less than 0. The necessary condition (28) is proven.
5. FAST SOLUTIONS
Due to the NP-hard nature of the problem, fast solutions necessary for practical applications of the weighted rank-one BMF framework. This section introduces a practical iterative framework and a quick-and-dirty solution.
5.1. Practical Iterative Framework
The practical iterative framework, described in Algorithm 1, reduces a large weighted rank-one BMF problem to a number of small problems by taking advantage of the fact that some observations can never belong to the same pattern. At each iteration, the algorithm selects one observation from the remaining dataset and then applies necessary conditions (25)–(28) to identify observations that may belong to the same pattern with the selected observation. After that, employ either heuristics or an exact algorithm (if the identified group is very small) and execute weighted rank-one BMF on the identified observations group. The observations associated with the obtained pattern are deleted from the remaining dataset. The procedure continues until no observation remains.
ALGORITHM 1:
while the remaining dataset is not empty do |
Select one random observation from the remaining dataset; |
Use necessary conditions (25)–(28) to identify the remaining observations that may belong to the same |
pattern with the selected observation; |
Apply an algorithm to discover a dominant pattern from the identified observations; |
Delete the observations corresponding to the discovered pattern from the remaining dataset; |
end |
However, we still need to solve a weighted rank-one BMF problem of reduced size at each iteration. Our experimental study finds that for a large dataset the original weighted rank-one BMF is indeed reduced to a number of problems of much smaller sizes. But due to notoriousness of binary optimization, it is still not practical to employ an exact algorithm to solve the reduced problems. Thus, efficient heuristics for weighted rank-one BMF are required.
Weighted rank-one BMF can be modeled by unconstrained binary quadratic programming. There have been a considerable number of studies on heuristics for unconstrained binary quadratic programming model, including greedy and local search [MerZ and Freisleben 2002], tabu search [Glover et al. 1998], simulated annealing [Beasley 1998], genetic algorithm [Merz and Freisleben 1999], and exact algorithm [Barahona et al. 1989]. As the purpose of weighted rank-one BMF is to facilitate large data analysis. We implemented and tested most mainstream heuristics. Experimental results show the lightweight local search algorithm in fact has the best overall performance in terms of running time and accuracy. Local search algorithms are heuristics that search in the neighborhood of the current solution for a better one until no further improvement can be made. Algorithm 2 provides the outline of a 1-opt local search algorithm, in which the neighborhood of a solution is defined by the solutions that can be obtained by flipping a single component in the binary vector. The algorithm iteratively examines all variables. If flipping the variable value improves the objective function, then make the move; otherwise, skip it.
ALGORITHM 2:
Initialize Y; | |
for i = 1 → n do | |
/* examine all components | */ |
Yi = 1 - Yi ; | /* make the move for component i */ |
if no gain then | |
Yi = 1 - Yi ; | /* cancel the move if no improvement */ |
end | |
end |
The local search algorithm can be implemented efficiently. First, if the input matrix Am×n is stored in sparse form such that each row of Am×n is represented by a subset of {1, … ,n}, then the running time is reduced considerably for instances with low density. Second, as much of the running time of the heuristics is for evaluating the gain of the next move, the overall running time can be decreased significantly by evaluating a move’s gain on the basis of the current solution. Let the current solution be Y. Its influence on each observation Ai. is as follows:
(45) |
Consider the move of flipping the value of the kth element of Y from 0 to 1. To evaluate the gain of the move, we only need to check whether Ai. contains the element k. If true, then the move increases f11(Ai.,Y) by one and decreases f10(Ai.,Y) by one. If false, then the move increases f01(Ai.,Y) by one. Suppose the move is to flip the value of the kth element of Y from 1 to 0. If Ai. contains the element k, then the move decreases f11(Ai.,Y) by one and increases f10(Ai.,Y) by one. Otherwise, the move decreases f01(Ai.,Y) by one. It takes O(log2(n)) to check the existence of an element in an ordered list of n elements by binary search. So evaluation of a move’s gain takes O(mlog2(n)).
5.2. Quick-and-Dirty
A quick-and-dirty solution is simply to randomly select one observation from the remaining dataset and take it as a tentative pattern at each iteration. Then, based on penalty weights identify observations belonging to the tentative pattern. Finally, take the rounded mean of the group of identified observations as a pattern. As such, rank-one BMF is avoided, while penalty weights are utilized. This random algorithm can easily be upgraded to be more robust. In this variant, at each iteration we randomly select log(n) observations as tentative patterns, where n is the number of total observations, and pick the one that has the most observations belonging to it. The variant would incur more computing time than the simple random algorithm but less time than the sophisticated algorithm employing BMF. The experimental results presented later show that the quick-and-dirty solutions have decent performance in terms of data compression, clustering, and pattern discovery and can be the preferred choice for extremely large datasets.
6. WEIGHT PARAMETER SELECTION
Weight parameters play a crucial role in effectiveness of the weighted rank-one BMF model. However, there is no unanimous way of determining weight parameters, as the choice largely depends on the specific application of the model. We now discuss several common scenarios as well as ways to automatically set the weights.
Weighted rank-one BMF can be applied to role-based access control, which we will discuss in detail in the experimental study section. In this scenario, the use of BMF is to find binary patterns to cover 1’s elements without covering any 0’s elements, since a 0-becoming-1 error may lead to confidentiality or integrity violation caused by permission misuse. To implement this, we can make the weight on 0-becoming-1 errors as a sufficiently large number and the weight on 1-becoming-0 errors as 1, so that a retrieved pattern does not cause any 0-becoming-1 error. By iteratively performing BMF on the remaining binary matrix, we can eventually discover a set of patterns (roles) and the pattern memberships (user-role assignments).
Another application of weighted rank-one BMF is data compression. In this scenario, we could replace each record with its representative pattern, thus enabling the reduction of a large dataset to a small set of patterns. Given that the patterns well approximate the original data, complicated data analysis tasks, which are difficult to be performed on the original dataset, can be conducted on the small pattern set instead. In this scenario, we need to make a tradeoff between compressed data size and degree of approximation. Higher weight parameter values lead to larger compressed data and more accurate approximation. To determine concrete weight parameter values, a straight-forward way is to look at the compressed data size and the approximation accuracy as a function of weight parameters and choose the weight parameter values so that increasing them does not lead to significantly better approximation of the data, while the compressed data size is acceptable for practical use. If the dataset is so large that it is difficult to try different weight values, then a compromise solution is to take a random sample of small size and determine the weight values based on it instead.
Weighted rank-one BMF can also be applied to pattern discovery and clustering. By iteratively performing BMF, we can cluster records into groups and retrieve their representative patterns. Weigh parameter values would determine the number of clusters and the quality of patterns. From this perspective, the weight parameter selection problem is similar to determining the number of clusters in the clustering problem. We can use cross-validation, which is a popular means for determining the number of clusters in a dataset, to facilitate automatic selection of weight parameter values. In this process, the data are partitioned into n parts. Each of the parts is then set aside at turn as a test set and the BMF model is performed on the other n − 1 training sets, and the value of the goal function, i.e., the sum of the squared distances to the corresponding patterns, is calculated for the test set. These n values are calculated and averaged for each alternative weight parameter values. We try a finite set of weight parameter values and choose the one that minimizes the test set approximation error.
7. EXPERIMENTAL EVALUATION
In this section, we illustrate the desirable properties of weighted rank-one BMF in terms of runtime scalability and effectiveness in clustering, compression, and pattern discovery on both synthetic and real datasets. We implemented our methods in both Matlab and C. However, since our algorithm does not involve vector operations, which is the primary advantage of Matlab, we find the C implementation takes much less time than Matlab. Therefore, we only report the experimental results for the C implementation. The experiments are run on a Dell desktop with dual core CPU 3.10 GHz and 4 GB memory.
7.1. Effectiveness of Clustering and Pattern Discovery on Synthetic Data
This experiment is to compare weighted rank-one BMF with the state-of-the-art approaches for clustering and pattern discovery for binary data that include SVD [Andrews and Patterson 1976], k-means [Tan et al. 2005], conventional rank-one BMF [Koyuturk et al. 2005], denoted by conventional in this section, and Discrete Basis Problem (DBP) [Pauli et al. 2008], with respect to their effectiveness. We adopt the same synthetic data generator and experiment design employed in Koyuturk et al. [2005]. We generate 200 binary vectors of 80 dimensions by first sampling 5 binary vectors each with 50 components being 1 as underlying patterns, repeating each 40 times and then flipping the values of 5% entries randomly chosen in those records. The generated data can be represented by a 200 × 80 binary matrix and is plotted in white-black as shown in Figure 4(a), where a black point indicates an entry with the value of 1.
We run each approach mentioned above to extract the underlying patterns and find clusters by grouping records belonging to the same pattern. A good approach should discover all five underlying patterns and their associated clusters. We utilize the packages in Matlab to perform SVD and k-means, use the code in Pauli et al. [2008] to execute DBP, and run the algorithm described in Koyuturk et al. [2005] to implement conventional rank-one BMF. To carry out SVD, k-means, and DBP, the number of clusters (or matrix rank) is needed. Indeed, this experiment is designed to favor those approaches and assumes the number 5 of true clusters already available to those approaches, while in practice a prior step (e.g., some information criterion-based approach) is typically required to learn the number of clusters.
We first apply the five-way k-means clustering to discover five clusters, approximate each record by its associative cluster centroid, and plot the resultant approximation data in Figure 4(b). The result confirms the issue of k-means that was raised in Koyuturk et al. [2005] that k-means is not able to capture patterns with significant overlap. In Figure 4(a), the first two patterns are highly overlapped. However, k-means fails to identify them and falsely groups many records of the first two clusters into one big cluster. This example in fact manifests the necessity of applying weights on errors. If we increase penalty on discrepancy, then the records in the first two clusters would never be considered from the same cluster. Another issue with k-means that has been discussed in the literature is that it is highly susceptible to initial solutions and hence not very robust.
We run the algorithm described in Pauli et al. [2008] to execute the DBP to discover discrete patterns. To favor the DBP, the execution assumes that it is known there are five discrete patterns and each record is derived from one pattern. DBP is an algorithm that decomposes a binary matrix into the Boolean product of two binary matrices. One decomposed binary matrix can be interpreted as discrete basis or (patterns) while the other describes the data as a formation of the discrete basis. The DBP algorithm first constructs a set of candidate basis vectors, based on the confidence concept in association rule mining. A parameter r controls the confidence level. Then, a small set of candidate basis vectors are selected in a greedy way. Two weight parameters w+ and w− control the selection of candidate basis vectors. The algorithm details can be found in Pauli et al. [2008]. We run the algorithm multiple times by varying r from 0.4 to 0.9 and w− from 1 to 10 with w+ being 1 to select the best set of five basis vectors, and associate each record with the closest basis vector. We represent each record with its corresponding basis vector and plot the resultant approximation data in Figure 4(c). As it shows, the DBP fails to discover all five underlying patterns, and the resultant approximation significantly differs from the original data by 3,123 errors. One possible reason is that DBP is suitable for the pattern discovery problem where each record is generated from multiple patterns, while our synthetic data generator samples each record from only one pattern.
We run the SVD to find the rank-5 approximation to the original data and plot the approximation matrix as grayscale image in Figure 4(d). The SVD approximation is optimal in the sense of minimum least squares, with an approximation error of 29.78 in this example. However, the SVD clearly has the interpretability issue, as the retrieved patterns contain fractional and negative numbers, while the five true patterns are binary. For a fairer comparison, we quantize the result of SVD. We can quantize either the resultant approximation or the singular vectors. As analyzed and reported in Koyuturk et al. [2005], quantizing singular vectors obtains approximations that significantly deviate from the original data. So we only consider quantizing the resultant approximation. We round a value in the SVD approximation to 1 if it is greater than a threshold value r, else 0. We try different values of r and pick the one that minimizes the difference between the quantized result to the original data. The best result is depicted in Figure 4(e). The approximate matrix is very close to the original data without noise. However, it requires 10 patterns to describe the original data, of which five are false patterns.
We run the algorithm described in Koyuturk et al. [2005] to implement conventional rank-one BMF. The conventional rank-one BMF is to iteratively discover a binary presence vector X and a binary pattern vector Y that maximize ‖A − XY‖2. It can be viewed as a special case of our weighted rank-one BMF with both error weights w10 and w01 of 1. We apply the conventional rank-one BMF to discover patterns, represent each record with their associative pattern and plot the approximation data in Figure 4(f). Two false patterns are retrieved, and the majority records are falsely clustered into one group. The reason that the conventional rank-one BMF is unable to discover the true patterns is the tolerance of errors, so it cannot distinguish patterns with large overlap as this example. If we increase penalty on differences, then more clusters would be revealed.
We finally apply our weighted rank-one BMF. Similarly to the conventional rank-one BMF, the weighted rank-one BMF is to iteratively discover a binary vector with the maximal pattern influence as defined in Equation (3), treat the discovered binary vector as an underlying pattern, and group the records with the presence of the pattern into a cluster. As it is a NP-hard problem to find the binary vector with the maximal pattern influence, we implement the weighted rank-one BMF approach with the three different heuristics proposed in Section 5. They are (1) local, which uses the local search heuristic to discover a pattern guaranteed to be a local optimum solution at each iteration; (2) random, where each iteration randomly selects a data record, groups its neighboring records into a cluster, and makes the rounded average of the cluster as pattern; and (3) upgraded random, which upgrades the random algorithm by repeating it log(n) times (n is the number of total records) random records as tentative patterns and then chooses the pattern with the most neighboring records. It is expected that local takes the most running time with the best performance, random takes the least running time at the cost of losing effectiveness, and upgraded random is in between the two. To execute the weighted rank-one BMF, we need to know the weights w10 and w01. Indeed, k-means, DBP, and SVD also need additional information, i.e., the number of clusters or rank, before they can be carried out. To favor those approaches, in the prior experiments, we simply assume the information is known to them. According to many experiments we conduct on the weighted rank-one BMF, we think it is reasonable to assume weight values in [1, 4]. For further convenience, we limit weight values to be increments of 0.5, i.e., {1.0, 1.5, … , 4.0}. That leaves 49 possible combinations of weight values. To determine the concrete weight values, we use the 10-fold cross-validation, a common approach for model selection, as discussed in Section 6. For each set of weight values, we partition the 200 records into 10 equal-size groups, use 9 groups as training data, apply the weights to learn patterns, associate each record in the remaining group with one of retrieved pattern that has the minimum Euclidean distance, and compute the approximation errors (i.e., the sum of Euclidean distances between records and their associative patterns). We repeat the above procedure 10 times by retaining a different group as testing data and calculate the average approximation errors. The set of weights that leads to the minimum average approximation errors is chosen. That is w10 of 2 and w01 of 2 in this case. Their induced patterns and clusters are the result of our weighted rank-one BMF approach. We represent each record with their associative pattern and plot the resultant approximations in Figure 4(g)–(i). The results show that all three implementations of the weighted rank-one BMF are able to discover the five true patterns and discover true underlying clusters. The experiment clearly demonstrates the effectiveness of weighted rank-one BMF in clustering and pattern discovery over other approaches.
7.2. Robustness of Clustering and Pattern Discovery on Synthetic Data
This experiment examines the robustness of the weighted rank-one BMF approach to noise, since robustness is a desired property for clustering and pattern discovery. The prior experiment has demonstrated the limitation of k-means in discerning patterns and clusters with large resemblance and the issue of SVD of discovering discrete patterns. So in this experiment, we compare our approach only with the conventional rank-one BMF and DBP. We use the same synthetic data generator to construct two synthetic datasets. One is added with 10% noise as plotted in Figure 5(a) and the other with 20% noise plotted in Figure 6(a).
The results of conventional are reported in Figures 5(b) and 6(b). For the data with 10% noise, conventional retrieves three false patterns with no resemblance to any true pattern and groups the majority records into one big cluster, as shown in Figure 5(b). The result on the data with 20% noise is not good either. Figure 6(b) shows that one true pattern and two false patterns are retrieved, and it fails to discover the five true clusters.
The results of DBP are reported in Figures 5(c) and 6(c). For DBP, we again assume the number of true clusters is known. By applying the rank-4 binary matrix decomposition, we retrieve four discrete basis vectors and represent each record with the closest discrete basis vector. For the data with 10% noise, Figure 5(c) shows that the DBP fails to discover the true patterns and the clustering result is incorrect. For the data with 20% noise, Figure 6(c) shows that the DBP performs even worse. The retrieved patterns hardly resemble any true pattern and the clustering result significantly deviates the underlying truth.
To evaluate the weighted rank-one BMF, we again run the three implementations, namely local, random, and upgraded random. To determine the weights, we still use the 10-fold cross-validation as we did for the previous experiment. For the data with 10% noise, the learned weights are w10 of 1 and w01 of 2. For the data with 20% noise, the learned weights are w10 of 1 and w01 of 1. By applying the learned weights, we execute local, random, and upgraded random separately to retrieve patterns and represent each record with its cluster centroid. The approximation results on the data with 10% noise are plotted in Figure 5(d)–(f). As shown in Figure 5(d), local returns all five true underlying patterns and one false pattern. It successfully discovers the five underlying clusters with one mistake of separating one record from others and treating it as one unique cluster. As shown in Figure 5(e), random returns 15 patterns and the resultant approximation largely resembles the original data without 10% noise. The 15 clusters induced by the 15 patterns contain the five true clusters with 10 false clusters of few members that can be ruled out with some postprocessing step. Figure 5(f) shows that upgrade returns 10 patterns. The resultant approximation of upgrade is more distant from the original data than random. But upgrade returns less patterns, so in terms of precision on retrieving true patterns, upgrade performs slightly better than random. The approximation results on the data with 20% noise are plotted in Figure 6(d)–(f). As shown in Figure 6(d), local still performs well. It retrieves all five patterns and one false pattern. So in terms of effectiveness of pattern discovery, the local implementation of weighted rank-one BMF is robust to data noise. In terms of clustering, local performs slightly worse on the data with 20% noise than the data with 10% noise. Figure 6(d) shows 6 clusters with one false cluster of quite a few members, while the false cluster in Figure 5(d) has only one member. The results show that the amount of noise does affect the performance of weighted rank-one BMF. Figure 6(e) and (f) also verify the observation. Figure 6(e), random returns 32 patterns that include all five true patterns. In Figure 6(f), upgraded random returns 26 patterns that also include all true patterns. It is observed that the amount of noise does affect the performance of weighted rank-one BMF, as more data noise leads to more false patterns. It is also observed that weighted rank-one BMF performs significantly better than the other approaches. For the both noisy datasets, all three implementations of weighted rank-one BMF are able to retrieve all underlying patterns. The local implementation performs particularly well, as it only returns one false pattern. Another observation is that all three implementations of weighted rank-one BMF perform well in terms of clustering, as cluster centroids can well reconstruct the original data, while the approximations returned from conventional and DBP significantly deviate the original data. The experiment concludes that weighted rank-one BMF performs well in terms of robustness to data noise. It again verifies the advantage of weighted rank-one BMF in clustering and pattern discovery on binary data.
7.3. Running Time
This experiment studies the running time (complexity) of the weighted rank-one BMF algorithms. Specifically, we proposed two solutions. The first one, local, is the practical iterative framework combined with a local search heuristic. (Note that the iterative framework can be combined with other heuristics as well.) The second one, random, is a quick-and-dirty solution, which generates a random pattern at each iteration. The synthetic data generator works as follows. Denote the number of observations by m. First, generate 0.01 * m random binary vectors of size 200 with average density of 1-entries of 0.01 and use them as patterns. Then construct a m matrix by associating each row with a random pattern. Finally, randomly select 10% entries and flip their values. We generate six synthetic matrices with m having the following values: 10k, 20k, 40k, 60k, 80k, and 100k. Local and random with w10 and w01 of 1 are executed against the seven matrices. Their running times are reported in Figure 7. The running time of random grows almost linearly with data size. It takes less than 150 s for a dataset with size as large as 100k. The running time of local grows seemingly quadratically with data size. It takes about 350 s for a dataset with 100k. Both algorithms have good performance in terms of running time. However, for even larger datasets, random is preferred. It is possible to apply parallel algorithms to reduce running time by utilizing technologies like Hadoop MapReduce. Exploring that direction is a potential avenue for future work.
7.4. Application to Data Compression in Association Rule Mining
In this section, we illustrate the application of weighted rank-one BMF to data compression in the context of association rule mining. Data compression is an important topic in computer science and information theory. It involves encoding information using fewer bits than the original representation. Data compression is useful, because it helps reduce resource usage and has found numerous applications including audio transmission, video storage, wireless sensor network, and among others. A variety of data compression methods have been proposed. They are classified as either lossy or lossless, dependent on whether the original data can be reconstructed from the compressed data. By applying weighted rank-one BMF with appropriately chosen weight parameter values, one can extract patterns from the original binary data and associate patterns with data records. As one pattern typically represents many data records, so the extracted relatively few patterns and their associations to the original data records can be viewed as compressed data. Since original data records are not exactly the same as their associated patterns, the weighted rank-one BMF serves as a lossy compression method.
Data compression can benefit association rule learning. Association rule learning is a popular and well-researched method to identify strong rules in large databases. For example, the rule {milk,bread} → {butter} found in the sales data of a supermarket would indicate that if a customer buys milk and bread, he or she is likely to also buy butter. Association rules can be used as the basis for decisions about marketing activities such as promotional pricing or product placements. The core of association rule generation is to identify frequent itemsets. It is difficult to find all frequent itemsets in a database, since it involves searching all possible itemsets. Efficient algorithms, like apriori, exploit the anti-monotonicity property, which guarantees that for a frequent itemset, all its subsets are also frequent and thus for an infrequent itemset, all its supersets must also be infrequent. However, even for efficient algorithms, finding frequent itemsets from a large dataset is time-consuming.
To cope with the computational challenge in mining association rules from large datasets, data compression techniques have been used [Burdick et al. 2001; Koyuturk et al. 2005; Shenoy et al. 2000]. The general idea is to cut down data size so the mining algorithm can complete the mining task on large datasets. The conventional rank-one BMF [Koyuturk et al. 2005] is one of the recent applications of data compression in association rule mining. Its idea is to extract binary patterns from a large binary dataset and then performs the mining task on the extract patterns of much smaller size than the original data. Experimental studies [Koyuturk et al. 2005] show that frequent itemsets mined from compressed data (extracted patterns) are close to that mined from original data. However, the conventional rank-one BMF-based compression approach has one limitation. That is the size (number) of extracted patterns is unadjustable. In practice, one might want to control data compression ratio to balance efficiency and accuracy of the mining algorithm.
This experiment is designed to demonstrate that weighted rank-one BMF allows one to adjust data compression ratio and also achieves high accuracy mining results from compressed data. As what we demonstrated in the previous experiments, weighted rank-one BMF can be used to extract binary patterns from a large dataset. So a large dataset can be represented by a small number of extracted patterns and the sizes of their corresponding clusters. Then one can reduce the computational overhead of association rule mining on a large dataset by performing the mining task on the compressed data of much smaller size. As weighted rank-one BMF serves as a lossy data compression method, it is not necessary that the frequent itemsets from the compressed data are the same as those from the original data. So to validate effectiveness and flexibility of weighted rank-one BMF as a compression technique to support association rule mining, we conduct experiments on two FIMI workshop datasets, pumsb and connect.1 The data descriptions are provided in Table 1. For each dataset, we first use the apriori algorithm, which proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets, to retrieve frequent itemsets. Then we apply weighted rank-one BMF to derive a small number of patterns and the sizes of their corresponding clusters (i.e., data compression). As both datasets are large, so we use the rand algorithm implementation for weighted rank-one BMF. Then we use the slightly modified apriori algorithm as described in Koyuturk et al. [2005] to mine frequent itemsets from the compressed dataset (i.e., a small number of patterns and the sizes of their corresponding clusters).
Table 1.
Dataset | Number of Data Records | Number of Items | Number of Non-zeros |
---|---|---|---|
pumsb | 49,047 | 2,113 | 3,629,478 |
connect | 67,558 | 129 | 2,904,994 |
We compare the true frequent itemsets mined from the original dataset, denoted by A, and those retrieved from the compressed dataset, denoted by B. To evaluate a test’s accuracy, we use the F-measure. It considers both the precision and the recall of the test to compute the score. Precision is defined as , where |A ∩ B| is the number of true frequent itemsets retrieved from the compressed dataset and |B| is the number of total frequent itemsets retrieved from the compressed dataset. Recall is defined as , where |A| is the number of true frequent itemsets from the original dataset. F-measure is computed as: , which reaches its best value at 1 and worst score at 0. The compressed data consist of representative patterns and their associations to the original data records. As it takes negligible space to store the associations (one integer of pattern label for each record), so we measure the data compression ratio by .
As discussed in Section 6, greater values of weight parameters lead to lower compression and higher approximation, as more representative records are selected. But there is no deterministic relation among weight parameter values, compression, and approximation. So we explored different values and found that weight parameter values between 1 and 3 typically achieve a good balance between data compression and approximation. So we reported results of parameters varying from 1 to 3. Note that weighted rank-one BMF with w10 and w01 being 1 is in fact conventional rank-one BMF. The results for the pumsb transactions set are reported in Tables 2–4. We consider the frequent itemsets with support (the proportion of the transactions in a database that contain the set) over 0.7 and length (the number of items in the set) less than 6 in the original transactions set and compare them with the itemsets satisfying the same criterion, in the approximate transactions sets. From Table 2, we observe that the number of patterns required to represent the original transactions set increases as the values of w10 and w01 grow. The benefits of weighted rank-one BMF can be observed in Tables 2–4. The number of itemsets with the same support in the original dataset is 66,741. When w10 and w01 are 1, the data are compressed to about 0.003, and 184,372 frequent itemsets are retrieved from the approximate transactions set with a F-measure score of 0.5271, which could be interpreted as inaccurate. By increasing the values of w10 and w01, more data storage are needed to represent the original transactions set. However, the numbers are still much less than the original data size. In return, the accuracy of retrieved frequent itemsets is significantly improved. For instance, when w10 and w01 are 2.5, the data are compressed to one-tenth, while the retrieved frequent itemsets has high accuracy of the F-measure score of 0.9475. When w10 and w01 are greater than 2, the F-measure score is consistently greatly than 0.9, as opposed to 0.5271 when w10 and w01 are 1. We also observe that higher values of w10 and w01 do not necessary lead to better F-measure score, although the general trend exists. This is partially because the rand algorithm is heuristic oriented. To confirm the findings, we conduct another experiment on the connect transactions set. We compare the frequent itemsets of support over 0.45 and length less than 5 in the original transactions set with the frequent itemsets of the same characteristics derived from the approximate transactions sets. The number itemsets with the same support in the original dataset is 57,352. The results reported in Tables 5–7 are even better. As shown in Table 5, when both w01 and w10 increase to 3, the data are compressed to about 0.003, while the F-measure of the frequent itemsets derived from the compressed data is as high as 0.9459. The experimental results clearly demonstrate the benefits of weighted rank-one BMF as a data compression method.
Table 2.
w10\w01 | 1 | 1.5 | 2 | 2.5 | 3 |
---|---|---|---|---|---|
1 | 0.0026 (conventional rank-one BMF) |
0.0068 | 0.0179 | 0.0325 | 0.0571 |
1.5 | 0.0071 | 0.0183 | 0.0329 | 0.0570 | 0.0742 |
2 | 0.0179 | 0.0330 | 0.0571 | 0.0724 | 0.0930 |
2.5 | 0.0179 | 0.0586 | 0.0730 | 0.0927 | 0.1143 |
3 | 0.0560 | 0.0730 | 0.7045 | 0.1147 | 0.1373 |
Table 4.
w10\w01 | 1 | 1.5 | 2 | 2.5 | 3 |
---|---|---|---|---|---|
1 | 0.5271 (conventional rank-one BMF) |
0.7670 | 0.8220 | 0.8821 | 0.8574 |
1.5 | 0.3943 | 0.7811 | 0.8286 | 0.8234 | 0.9006 |
2 | 0.8222 | 0.5663 | 0.8123 | 0.9010 | 0.9383 |
2.5 | 0.8562 | 0.9010 | 0.8614 | 0.9475 | 0.9252 |
3 | 0.9017 | 0.8962 | 0.9175 | 0.9138 | 0.9321 |
Table 5.
w10\w01 | 1 | 1.5 | 2 | 2.5 | 3 |
---|---|---|---|---|---|
1 | 0.00006 (conventional rank-one BMF) |
0.00010 | 0.00025 | 0.00049 | 0.00086 |
1.5 | 0.00013 | 0.00029 | 0.00044 | 0.00083 | 0.00164 |
2 | 0.00030 | 0.00047 | 0.00092 | 0.00158 | 0.00167 |
2.5 | 0.00047 | 0.00090 | 0.00172 | 0.00163 | 0.00358 |
3 | 0.00086 | 0.00173 | 0.00164 | 0.00357 | 0.00354 |
Table 7.
w10\w01 | 1 | 1.5 | 2 | 2.5 | 3 |
---|---|---|---|---|---|
1 | 0.3811 (conventional rank-one BMF) |
0.4371 | 0.7386 | 0.7837 | 0.8845 |
1.5 | 0.3605 | 0.7274 | 0.7168 | 0.8531 | 0.9339 |
3 | 0.8678 | 0.9224 | 0.9249 | 0.9571 | 0.9459 |
Note again that the choice of weights w10 and w01 is application dependent. Greater values of weight parameters lead to lower compression and higher approximation, as more representative records are selected. Although our extensive experimental studies suggest weights between 1 and 3, users can always take a small data sample, try different values of weights, and choose the one that achieves their preferred compression ratio and accuracy.
We also compare our method with two recent algorithms, low-rank structured matrix factorization [Haeffele and Vidal 2019] and low-rank non-negative matrix factorization [Gligorijević et al. 2018]. Low-rank matrix factorization is commonly used for compressing matrix data. By decomposing Am×n ≈ Um×k × Vk×n, where k is significantly smaller than m and n, A is compressed to U and V. To reconstruct A, we compute the product of U and V and round values to 0/1. We apply the two matrix factorization methods with different ranks to the pumsb dataset. From their reconstructed binary matrices, we find frequent itemsets with support over 0.7 and compute the F-measure accuracy by comparing the returned frequent itemsets with that from the original dataset. We also apply the weighted rank-one BMF with weights of (2.5,2.5). The results are reported in Table 8. It shows that the traditional matrix factorization-based compression methods do not work well for preserving association rules for binary data, although the performance improves when the rank increases. Another limitation is that matrix decomposition methods are time-consuming for large matrices.
Table 8.
Weighted Rank-One BMF | Low-Rank Matrix Factorization | Low-Rank NMF | Rank |
---|---|---|---|
0.9475 | 0.2402 | 0.1798 | 5 |
0.4537 | 0.3021 | 10 | |
0.7663 | 0.5437 | 20 |
7.5. Application to Clustering in Information Retrieval
In this section, we illustrate the application of weighted rank-one BMF to clustering in the context of information retrieval. Information retrieval is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) [Manning et al. 2008]. In the world, hundreds of millions of people engage in information retrieval every day when they use a web search engine. Information retrieval is becoming the dominant form of information process. Due to its importance, information retrieval is an actively sought research area. One challenge in information retrieval is to process a large volume of data. There are a variety of methods to alleviate the computational overhead, such as indexing, data compression, and distributed computing. Clustering can benefit information retrieval in the sense that documents can be grouped into clusters such that documents in the same cluster behave similarly with respect to relevance to information needs. One application of clustering in information retrieval is search result clustering. Consider a search term jaguar of different word senses. There could be a large number of web sites relevant to the search term. A good search engine would provide clustering results, such as cars, club, and cat, to allow a user to choose what he/she is interested in.
Weighted rank-one BMF can be used as a document clustering method in information retrieval. A document can be represented as a document-term binary matrix, where 1’s cell denotes the corresponding document contains the corresponding term. By iteratively applying weighted rank-one BMF with appropriately chosen weight parameters, one can derive dominant patterns and associate patterns with documents. As each data record is linked to one pattern, documents are naturally partitioned into clusters of relevant documents. Each pattern consists of a cluster of terms, from which semantic information can be derived and fine grain classification of terms is enabled. To validate our idea, we conduct experiments on the 20 newsgroup dataset, which is originally collected by Lang [1995], and compare our method with k-means, the classic text clustering method. The 20 newsgroups collection is a popular dataset for experiments in text classification and text clustering. The data contain 20 different newsgroups, each corresponding to a different topic, e.g., comp.graphics, rec.autos, and sci.crypt. Some of the newsgroups are very closely related to each other, while others are not. The data have 18,774 documents and 61,188 words in total.2 Since stop words do not contain important significance to be used in search queries, we filter out a list of 119 English stop words3 from the dataset.
The experimental design is as follows. We vary weight parameters and for each weight parameter setting run the rand algorithm to discover clusters. We then use the same number of returned clusters and apply the k-means algorithm. As the newsgroup dataset is quite large, so we use a fast implementation of k-means algorithm [Sculley 2010] that is based on mini-batch optimization. To compare the clustering results of k-means and our methods, we consider two measures, cohesion and normalized mutual information (NMI). Cohesion measures how closely related objects in a cluster are and is defined as , where ci denotes the centroid of cluster i, Ci denotes the points set of cluster i, k denotes the number of clusters, and dist (x,ci) denotes the distance between point x and the cluster centroid ci. There are several ways of measuring distance of two binary vectors. We use L1 distance, L2 distance, and cosine distance. We report the results of in Table 9. Normalized mutual information determines the similarity of two different clusterings of a dataset. There are various forms of normalized mutual information. We adopt the form used in Kvalseth [1987]. Let U = {U1,U2, … ,UR} and V = {V1,V2, … ,VC} be two clustering (partitional) results of a set of N data items. The NMI value of two clustering results is computed as , where , , and . The newsgroup dataset is extremely sparse. The overlapping terms of two documents are much less than their non-overlapping terms. We deliberately set weight parameters to be small values, so that the number of resulting clusters is reasonable. Note that there is no deterministic relation between the number of clusters and weight parameters. The selection of parameters is done by exploratory search. We report the results in Table 9 with weight parameters ranging from 0.01 to 0.07.
Table 9.
Cohesion Ratio of Weighted BMF to k-means | NMI | ||||||
---|---|---|---|---|---|---|---|
w10 | w01 | #clusters | L1 | L2 | cosine | BMF and k-means | k-means and k-means |
0.01 | 0.01 | 2 | 0.9993 | 0.9993 | 0.9994 | 0.0221 | 0.0217 |
0.02 | 0.02 | 7 | 1.0317 | 1.0145 | 1.0111 | 0.0205 | 0.0224 |
0.03 | 0.03 | 28 | 1.0167 | 1.0112 | 1.0065 | 0.0214 | 0.0220 |
0.04 | 0.04 | 58 | 0.9747 | 0.9894 | 0.9978 | 0.0217 | 0.0218 |
0.05 | 0.05 | 96 | 0.9595 | 0.9800 | 0.9945 | 0.0221 | 0.0222 |
0.06 | 0.06 | 125 | 1.0461 | 1.0212 | 1.0061 | 0.0216 | 0.0217 |
0.07 | 0.07 | 203 | 1.0278 | 1.0112 | 0.9911 | 0.0211 | 0.0214 |
We observe that with respect to cohesion, weighted rank-one BMF is comparable to k-means in any distance measure, as all values in columns 4–6 of Table 9 are around 1. With respect to NMI, values in column 7 suggest no significant similarity between the BMF and k-means clustering results. We suspect the randomness nature of both algorithms causes the fact. So we run the k-means algorithm one more time and study the similarity between the two k-means clustering results. The results are reported in the last column of Table 9. We observe that the values in the last column are close to their counterparts in the last second column. It confirms our suspect on the limitation of BMF and k-means that they both have randomness in nature and their results are susceptible to random choices made in the course of their heuristics.
We also compare our methods with the two recent k-means advancements, i.e., online k-means [Liberty et al. 2016] and local k-means [Gupta et al. 2017]. Online k-means assumes that we receive one point at each time. In this case a point is either assigned to an existing cluster or a new cluster. Local k-means incorporates an additional “cluster” to hold outliers. We compare our method with online k-means and local k-means with respect to cohesion ratio. The results are reported in Table 10. It shows that BMF performs better than both online k-means and local k-means. The main reason is that these two advanced k-means algorithms are not suitable for this application setting. Online k-means performs well when data points arrive in an online fashion. Local k-means is good for the dataset with many extreme outliers. However, neither works well with binary data.
Table 10.
Weighted BMF to Online k-means | Weighted BMF to Local k-means | |||||
---|---|---|---|---|---|---|
w10 | w01 | #clusters | L1 | L2 | L1 | L2 |
0.01 | 0.01 | 2 | 1.2279 | 1.2162 | 1.2303 | 1.2421 |
0.03 | 0.03 | 28 | 1.2454 | 1.2325 | 1.2421 | 1.2456 |
0.05 | 0.05 | 96 | 1.2318 | 1.2401 | 1.2210 | 1.2301 |
0.07 | 0.07 | 203 | 1.2104 | 1.2325 | 1.2221 | 1.2361 |
As numbers do not tell the whole story, so we further look into the patterns (centroids) returned from BMF and k-means to see if they do provide any insight on the dataset. We find that BMF does return semantically meaningful patterns that were not discovered by k-means. For instance, BMF discovers a pattern of terms “canada,” “russia,” “austria,” “sweden,” “switzerland,” and “italy,” which exists in a few documents in the rec.sport.hockey category. The pattern can be interpreted as “World Hockey.” Another returned pattern includes terms of “playoff,” “hawks,” “farenebt,” “farenell,” “binghamton,” “moncton,” “breton,” “candiens,” “adirondack,” and others. The pattern exists in many documents in the rec.sport.hockey category and can be interpreted as “North American Hockey.” BMF also discovers many interesting sub-topics in other categories. The examples illustrate that weighted rank-one BMF is able to classify documents and terms at an adjustable resolution that is much finer than the available categorical information. As an exploratory research task, clustering is to group similar objects in attempt to draw knowledge to make conceptual distinction or posit an exploratory relationship. In that sense, the weighted rank-one BMF provided with fast and effective heuristics serves as a useful cluster analysis tool.
7.6. Application to Pattern Discovery in Role-based Access Control
In this section, we illustrate the application of weighted rank-one BMF to pattern discovery in the context of role-based access control (RBAC) [Ferraiolo et al. 2001]. RBAC has been regarded as the de facto access control model and deployed by many organizations and companies. As opposed to the conventional permission-based access control model (i.e., an access control list specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects), RBAC groups a large number of permissions into a few roles and associates roles with users. This substantially simplifies the complexity of users and permissions management. Another advantage of RBAC is that the use of roles as authorization subjects, instead of users, avoids having to revoke and re-grant authorizations whenever users change their positions and/or duties within the organization. Given these benefits, companies and organizations want to migrate from their old permission-based access control systems to the RBAC model. Roles can be defined by carefully analyzing and decomposing business processes into smaller units in a functionally independent manner. These functional units are then associated with permissions on information systems. But such a top-down approach is not viable when an organization involves more than thousands users and operations. In contrast, the bottom-up approach utilizes the existing permission assignments and discovers roles by extracting data patterns, which is commonly referred to as role mining [Kuhlmann et al. 2003]. To offer justifications for the identified roles and provide insightful analysis, the number of roles has been widely considered as a measure for the goodness/interestingness of a role set. Discovering a minimum role set is referred to as the basic RMP [Vaidya et al. 2007].
From the pattern discovery perspective, the basic RMP can be viewed as an overlapping clustering problem. This is because a user can be assigned to multiple roles, e.g., a person can have both STUDENT and TEACHING ASSISTANT roles. Weighted rank-one BMF can be applied to role mining due to its lineage to Boolean matrix decomposition, which has been used to model the basic RMP. Let binary matrices A, X, and Y denote user-permission assignments, user-role assignments, and role-permission assignments, respectively. For example, A(i, j) = 1 denotes that user i has permission j, while being 0 indicates not; X (i, j) = 1 means user i is assigned role j, while being 0 indicates not. The basic RMP can be formulated as a Boolean matrix decomposition problem, represented by Am×n = Xm×k ⊗ Yk×n, where the ⊗ sign denotes the Boolean matrix product, where 1 plus 1 is 1, as opposed to being 2 in standard matrix product. The use of Boolean matrix product is to reflect the fact that when a user is assigned to multiple roles containing a common permission, he/she gets that permission assignment only once. Given an input binary matrix A, there exist many possible decompositions.
The basic idea of applying weighted rank-one BMF to role mining is to iteratively perform weighted rank-one BMF to discover dominant patterns (roles) and assign them to records (users). Different from the previous studied applications, where each record belongs to one pattern, the basic RMP allows a user to be assigned to multiple roles (patterns). To reflect that, after each iteration, we keep those records that have been assigned to the previously identified roles. We mark the permission assignments (1’s elements) that have been covered by the identified roles. For the subsequent weighted rank-one BMF operations, we do not count those marked assignments. 0-becoming-1 (over-assignment) errors are strictly forbidden in role mining, as an over-assignment error may lead to abuse of rights and cause severe security problems. Therefore, for weighted rank-one BMF, we set w10, the penalty weight on 0-becoming-1 errors, to be a sufficiently large number. Since we do not expect one role to fulfil the complete assignments of a user, therefore we set w01 as 0 and w11 as 1. In other words, at each iteration, an identified role is assigned to a user if the assignment does not introduce 0-becoming-1 errors. However, when we count the influence of a pattern, we count the number of newly covered assignments. In other words, at each iteration we try to discover the pattern, which covers the most remaining 1’s elements. To implement this, we use the local search algorithm as described in Algorithm 2. We generate a random pattern (role) and then search its neighborhood by iteratively flipping its component till no better solution can be found. Note that in this scenario (i) the influence of a pattern is computed as the number of newly covered assignments and (ii) a pattern is considered to exist in a record if the record contains the pattern.
We test the algorithm on the real access control datasets collected by Ene et al. [2008]. The description of datasets are provided in Table 11. We compare our algorithm with the benchmark algorithm DBP, which discovers discrete basis (patterns) from a binary matrix. The DBP implementation allows to set preferences for 1-becoming-0 and 0-becoming-1 errors. As 0-becoming-1 errors are prohibited in the role mining scenario, so for the DBP implementation, we set the weight of 1-becoming-0 errors as 1 and the weight of 0-becoming-1 errors as a sufficiently large number. The DBP implementation requires the number of roles (basis) as an input. We set the number of roles as the number of records, as the minimum number of roles cannot be greater than the number of records. The DBP implementation returns a basis matrix (Y) and a usage matrix (X) that describes how to reconstruct records (A) with the discovered basis. If a column of the usage matrix (X) consists of all zeros, then the corresponding basis vector (role) is not used. Then we delete that basis vector and its usage from X and Y accordingly and only keep those basis vectors (roles) that are in use. Results are reported in Table 11. Note that the reported errors are essentially 1-becoming-0 errors, because we configured both DBP and BMF to avoid 0-becoming-1 errors, because the 0-becoming-1 error type is strictly prohibited in the role mining stetting. The result shows that DBP is not able to fully cover permission assignments for all datasets. For instance, for the emea dataset, 32 roles are discovered and after assigning these roles to users, there are still 52 original user-permission assignments left uncovered. The issue cannot be solved by increasing the number of roles, as the additional roles returned by the DBP cannot cover those user-permission assignments either. The underlying reason is that the DBP algorithm has a fixed pool of candidate roles, which are derived from the association of attributes (permissions) of records. In contrast, BMF is able to successfully cover all user-permission assignments, as a role is discovered iteratively and there is no fixed candidate role pool. We can also observe that in addition to full coverage, the number of roles returned by BMF is very close to that of DBP. For the customer dataset, BMF finds a set of 276 roles with full coverage, while DBP returns 275 roles and still have 241 permission assignments uncovered. In some cases, the result is even better. For example, BMF finds a set of 456 roles to fully cover the apj dataset, while the DBP returns a set of 465 roles, which still leave 40 permission assignments uncovered. Thus, the BMF approach outperforms the DBP approach in the role mining setting, too.
Table 11.
Dataset | Users | Permissions | Assignments | Weighted BMF | DBP | NMF | |||
---|---|---|---|---|---|---|---|---|---|
Patterns | Errors | Patterns | Errors | Patterns | Errors | ||||
emea | 35 | 3046 | 7,220 | 34 | 0 | 32 | 52 | 34 | 1,743 |
healthcare | 46 | 46 | 1,486 | 15 | 0 | 9 | 31 | 15 | 352 |
domino | 79 | 231 | 731 | 20 | 0 | 20 | 7 | 20 | 217 |
firewallt | 365 | 709 | 31,951 | 70 | 0 | 47 | 217 | 70 | 2,217 |
firewall2 | 325 | 590 | 36,428 | 10 | 0 | 8 | 12 | 10 | 5,241 |
americas small | 3,477 | 1,587 | 105,205 | 220 | 0 | 201 | 578 | 220 | 9,582 |
apj | 2,044 | 1,164 | 6,841 | 456 | 0 | 465 | 40 | 456 | 3,514 |
customer | 10,961 | 284 | 45,427 | 276 | 0 | 275 | 241 | 276 | 15,369 |
As mentioned, one main advantage of weighted BMF over traditional matrix decomposition methods is that some real applications require the decomposed values to be binary. If we forcefully apply traditional methods, then it would return odd results. For instance, we could apply nonnegative matrix factorization to the binary setting by forcing the decomposed values to be within [0, 1] and then rounding values to 0/1. To compare with weighted BMF, we decompose a user-permission binary matrix into the two small non-negative matrices with the same rank as weighted BMF. We round decomposed values to 0/1 and compute their Boolean product to reconstruct the original binary matrix and count errors. The results are reported at the last two columns of Table 11. It shows that the reconstructed data are far different from the original one, which suggests the necessity of binary-valued matrix methods.
8. CONCLUSION
In this article, we proposed and studied weighted rank-one BMF, which can be used for effective compression, clustering, and pattern discovery for high-dimensional binary datasets. Desirable properties of weighted rank-one BMF include ability to discover interpretable data patterns, flexibility in tradeoff between summary size and quality, and ability to discriminate different types of approximation errors, namely 0-becoming-1 and 1-becoming-0 errors. Theoretical characteristics of weighted rank-one BMF are studied, including its relations with existing problems, computational hardness, optimization models, and approximate algorithms to provide solutions. Efficient, effective, and robust heuristics are also presented for practical use in large datasets. Experimental results demonstrate many benefits of weighted rank-one BMF. In the future, we plan to study the possibility of using parallel algorithms to reduce running time by utilizing technologies like Hadoop MapReduce.
Table 3.
w10\w01 | 1 | 1.5 | 2 | 2.5 | 3 |
---|---|---|---|---|---|
1 | 184372 (conventional rank-one BMF) |
50,680 | 69,227 | 70,347 | 88,162 |
1.5 | 34536 | 69,905 | 68,322 | 66,473 | 80,859 |
2 | 84451 | 30,354 | 70,166 | 76,359 | 68,157 |
2.5 | 77472 | 80,665 | 61,686 | 73,252 | 64,242 |
3 | 60944 | 64,443 | 66,237 | 64,486 | 76,459 |
Table 6.
w10\w01 | 1 | 1.5 | 2 | 2.5 | 3 |
---|---|---|---|---|---|
1 | 136697 (conventional rank-one BMF) |
136,697 | 70,359 | 76,226 | 56,602 |
1.5 | 136697 | 73,995 | 79,503 | 61,268 | 58,105 |
2 | 77348 | 65,243 | 58,089 | 57,459 | 60,913 |
2.5 | 62174 | 67,619 | 57,867 | 58,122 | 55,194 |
3 | 55437 | 58,856 | 58,178 | 57,428 | 58,157 |
CCS Concepts:
• Information systems → Clustering; Document representation;
Acknowledgments
Research reported in this publication was supported by the State Grid Corporation of China [5418-201958524A-0-0-00] and the National Institutes of Health under awards R01GM118574 and R35GM134927. Professor Huang would like to acknowledge the partial support of following NSFC grants (71731009, 2018WZDXM020; 71433001). The content is solely the responsibility of the authors and does not necessarily represent the official views of the agencies funding the research.
Footnotes
FIMI workshop datasets are available at http://fimi.cs.helsinki.fi/data/.
The Matlab format data can be downloaded at http://qwone.com/~jason/20Newsgroups/.
Contributor Information
HAIBING LU, Santa Clara Unviersity, USA.
XI CHEN, GEIRI North America, USA.
JUNMIN SHI, New Jersey Institute of Technology, USA.
JAIDEEP VAIDYA, Rutgers University, USA.
VIJAYALAKSHMI ATLURI, Rutgers University, USA.
YUAN HONG, Illinois Institute of Technology, USA.
WEI HUANG, Southern University of Science & Technology; Xi’an Jiaotong University, China.
REFERENCES
- Andrews Harry C. and Patterson Calude L.. 1976. Singular value decomposition and digital image processing. IEEE Trans. Acoust. Speech Sign. Process 24, 1 (1976), 26–28. [Google Scholar]
- Bai Xue, Gopal Ram, Nunez Manuel, and Zhdanov Dmitry. 2012. On the prevention of fraud and privacy exposure in process information flow. INFORMS J. Comput 24, 3 (2012), 416–432. [Google Scholar]
- Barahona F, Junger M, and Reinelt G. 1989. Experiments in quadratic 0–1 programming. Math. Program 44, 2 (Jul. 1989), 127–137. [Google Scholar]
- Beasley JE. 1998. Heuristic Algorithms for the Unconstrained Binary Quadratic Programming Problem Technical Report Management School, Imperial College. [Google Scholar]
- Bellmore Mandell and Ratliff H. Donald. 1971. Set covering and involutory bases. Manage. Sci 18, 3 (1971), 194–206. [Google Scholar]
- Berry MW, Dumais ST, and O’Brien GW. 1995. Using linear algebra for intelligent information retrieval. SIAM Rev 37, 4 (1995), 573–595. [Google Scholar]
- Bingham Ella, Kaban Ata, and Fortelius Mikael. 2009. The aspect Bernoulli model: Multiple causes of presences and absences. Pattern Anal. Appl 12, 1 (Jan. 2009), 55–78. [Google Scholar]
- Bradley PS, Fayyad Usama M., and Mangasarian OL. 1999. Mathematical programming for data mining: Formulations and challenges. INFORMS J. Comput 11, 3 (1999), 217–238. [Google Scholar]
- Burdick Douglas, Calimlim Manuel, and Gehrke Johannes. 2001. MAFIA: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the 17th International Conference on Data Engineering IEEE Computer Society, 443–452. [Google Scholar]
- Caprara Alberto, Fischetti Matteo, and Toth Paolo. 1999. A heuristic method for the set covering problem. Operat. Res 47, 5 (1999), 730–743. [Google Scholar]
- Chvatal V. 1979. A greedy heuristic for the set-covering problem. Math. Operat. Res 4, 3 (1979), 233–235. [Google Scholar]
- Dawande M, Keskinocak P, and Tayur S. 1997. On the biclique problem in bipartite graphs GSIA working paper 1996–04, Carnegie-Mellon University. [Google Scholar]
- Dellarocas Chrysanthos. 2003. The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Manage. Sci 49, 10 (2003), 1407–1424. [Google Scholar]
- Ene Alina, Horne William, Milosavljevic Nikola, Rao Prasad, Schreiber Robert, and Tarjan Robert E.. 2008. Fast exact and heuristic methods for role minimization problems. In Proceedings of the ACM Symposium on Access Control Models and Technologies (SACMAT’08) 1–10. [Google Scholar]
- Ferraiolo David F., Sandhu Ravi, Gavrila Serban, Kuhn D. Richard, and Chandramouli Ramaswamy. 2001. Proposed NIST standard for role-based access control. ACM Trans. Inf. Syst. Secur 4, 3 (Aug. 2001), 224–274. [Google Scholar]
- Frank Mario, Streich Andreas P., Basin David, and Buhmann Joachim M.. 2012. Multi-assignment clustering for Boolean data. J. Mach. Learn. Res 13, 1 (Feb. 2012), 459–489. [Google Scholar]
- Gao Y and Church G. 2005. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 21, 21 (2005), 3970–3975. [DOI] [PubMed] [Google Scholar]
- Geerts Floris, Goethals Bart, and Mielikainen Taneli. 2004. Tiling databases In Discovery Science Springer, 278–289. [Google Scholar]
- Gligorijević Vladimir, Panagakis Yannis, and Zafeiriou Stefanos. 2018. Non-negative matrix factorizations for multiplex network analysis. IEEE Trans. Pattern Anal. Mach. Intell 41, 4 (2018), 928–940. [DOI] [PubMed] [Google Scholar]
- Glover Fred, Kochenberger Gary A., and Alidaee Bahram. 1998. Adaptive memory tabu search for binary quadratic programs. Manage. Sci 44, 3 (1998), 336–345. [Google Scholar]
- Golub GH and Van Loan CF. 1996. Matrix Computations (3rd ed.). The Johns Hopkins Universtiy Press, Baltimore, MD. [Google Scholar]
- Gupta Shalmoli, Kumar Ravi, Lu Kefu, Moseley Benjamin, and Vassilvitskii Sergei. 2017. Local search methods for k-means with outliers. Proc. VLDB Endow 10, 7 (2017), 757–768. [Google Scholar]
- Haeffele Benjamin David and Vidal René. 2019. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. IEEE Trans. Pattern Anal. Mach. Intell 14, 8 (2019). [DOI] [PubMed] [Google Scholar]
- Hochbaum Dorit S., Megiddo Nimrod, Naor Joseph, and Tamir Arie. 1993. Tight bounds and 2-approximation algorithms for integer programs with two variables per inequality. Math. Program 62 (1993), 69–83. [Google Scholar]
- Ishibuchi Hisao and Nojima Yusuke. 2007. Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int. J. Approx. Reason 44, 1 (2007), 4–31. [Google Scholar]
- Kabán Ata and Bingham Ella. 2008. Factorisation and denoising of 0–1 data: A variational approach. Neurocomputing 71, 10–12 (Jun. 2008), 2291–2308. [Google Scholar]
- Kolda Tamara G. and O’Leary Dianne P.. 2000. Algorithm 805: Computation and uses of the semidiscrete matrix decomposition. ACM Trans. Math. Softw 26, 3 (2000), 415–435. [Google Scholar]
- Koyutürk Mehmet and Grama Ananth. 2003. PROXIMUS: A framework for analyzing very high dimensional discrete-attributed datasets. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03) 147–156. [Google Scholar]
- Koyuturk Mehmet, Grama Ananth, and Ramakrishnan Naren. 2005. Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans. Knowl. Data Eng 17, 4 (2005), 447–461. [Google Scholar]
- Kuhlmann Martin, Shohat Dalia, and Schimpf Gerhard. 2003. Role mining—Revealing business roles for security administration using data mining technology. In Proceedings of the 8th ACM Symposium on Access Control Models and Technologies (SACMAT’03) ACM, New York, NY, 179–186. DOI: 10.1145/775412.775435 [DOI] [Google Scholar]
- Kvalseth Tarald O.. 1987. Entropy and correlation: Some comments. IEEE Trans. Syst. Man Cybernet 17, 3 (May 1987), 517–519. [Google Scholar]
- Lang Ken. 1995. NewsWeeder: Learning to filter netnews. In Proceedings of the 12th International Machine Learning Conference. [Google Scholar]
- Lee DD and Seung HS. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999), 788–791. [DOI] [PubMed] [Google Scholar]
- Li Tao. 2005. A general model for clustering binary data. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05) ACM, New York, NY, 188–197. [Google Scholar]
- Liberty Edo, Sriharsha Ram, and Sviridenko Maxim. 2016. An algorithm for online k-means clustering. In Proceedings of the 18th Workshop on Algorithm Engineering and Experiments (ALENEX’16) SIAM, 81–89. [Google Scholar]
- Lu Haibing, Vaidya Jaideep, and Atluri Vijayalakshmi. 2014. An optimization framework for role mining. J. Comput. Secur 22, 1 (Jan. 2014), 1–31. [Google Scholar]
- Lu Haibing, Vaidya J, and Atluri V. [n.d.] An optimization framework for role mining (unpublished).
- Lu Haibing, Vaidya J, Atluri V, and Hong Yuan. 2012. Constraint-aware role mining via extended Boolean matrix decomposition. IEEE Trans. Depend. Sec. Comput 9, 5 (Sep-Oct 2012), 655–669. [Google Scholar]
- Lu Haibing, Vaidya Jaideep, Atluri Vijayalakshmi, Shin Heechang, and Jiang Lili. 2011. Weighted rank-one binary matrix factorization. In Proceedings of the SIAM Internation Conference on Data Mining (SDM’11) 283–294. [Google Scholar]
- Manning Christopher D., Raghavan Prabhakar, and Schütze Hinrich. 2008. Introduction to Information Retrieval Cambridge University Press, New York, NY. [Google Scholar]
- Manyika James, Chui Michael, Brown Brad, Bughin Jacques, Dobbs Richard, Roxburgh Charles, and Byers Angela Hung. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity Technical Report McKinsey Global Institute. [Google Scholar]
- Merz Peter and Freisleben Bernd. 1999. Genetic algorithms for binary quadratic programming. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’99) Morgan Kauffman, 417–424. [Google Scholar]
- MerZ Peter and Freisleben Bernd. 2002. Greedy and local search heuristics for unconstrained binary quadratic programming. J. Heurist 8, 2 (Mar. 2002), 197–213. [Google Scholar]
- Pauli Miettinen, Taneli Mielikainen, Aristides Gionis, Gautam Das, and Heikki Mannila. 2008. The discrete basis problem. IEEE Trans. Knowl. Data Eng 20, 10 (2008), 1348–1362. [Google Scholar]
- Peeters René. 2003. The maximum edge biclique problem is NP-complete. Discr. Appl. Math 131, 3 (2003), 651–654. [Google Scholar]
- Schein Andrew I., Saul Lawrence K., and Ungar Lyle H.. 2003. A generalized linear model for principal component analysis of binary data. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics 546431. [Google Scholar]
- Sculley D. 2010. Web-scale K-means Clustering. In Proceedings of the 19th International Conference on World Wide Web (WWW’10) ACM, New York, NY, 1177–1178. [Google Scholar]
- Shahnaz Farial, Berry Michael W., Pauca V. Paul, and Plemmons Robert J.. 2006. Document clustering using nonnegative matrix factorization. Inf. Process. Manage 42, 2 (Mar. 2006), 373–386. [Google Scholar]
- Shen Bao-Hong, Ji Shuiwang, and Ye Jieping. 2009. Mining discrete patterns via binary matrix factorization. In Proceedings of the ACM Knowledge Discovery and Data Mining Conference (KDD’09) 757–766. [Google Scholar]
- Shenoy Pradeep, Haritsa Jayant R., Sudarshan S, Bhalotia Gaurav, Bawa Mayank, and Shah Devavrat. 2000. Turbo-charging vertical mining of large databases. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00) ACM, New York, NY, 22–33. [Google Scholar]
- Singliar Tomáš and Hauskrecht Miloš. 2006. Noisy-OR component analysis and its application to link analysis. J. Mach. Learn. Res 7 (Dec. 2006), 2189–2213. [Google Scholar]
- Strehl Alexander and Ghosh Joydeep. 2003. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS J. Comput 15, 2 (2003), 208–230. [Google Scholar]
- Tan Pang-Ning, Steinbach Michael, and Kumar Vipin. 2005. Introduction to Data Mining (1st ed.). Addison-Wesley Longman, Boston, MA. [Google Scholar]
- Vaidya Jaideep, Atluri Vijayalakshmi, and Guo Qi. 2007. The role mining problem: Finding a minimal descriptive set of roles. In Proceedings of the ACM Symposium on Access Control Models and Technologies (SACMAT’07) 175–184. [Google Scholar]
- Vellido Alfredo, Martín-Guerrero José David, and Lisboa Paulo J. G.. 2012. Making machine learning models interpretable. In Proceedings of the European Symposium on Artifical Neural Networks (ESANN’12), Vol. 12 Citeseer, 163–172. [Google Scholar]
- Zhang Zhongyuan, Li Tao, Ding Chris, and Zhang Xiangsun. 2007. Binary matrix factorization with applications. In Proceedings of the 2007 7th IEEE International Conference on Data Mining 391–400. [Google Scholar]
- Zhang Zhong-Yuan, Li Tao, Ding Chris, Ren Xian-Wen, and Zhang Xiang-Sun. 2010. Binary matrix factorization for analyzing gene expression data. Data Min. Knowl. Discov 20, 1 (2010), 28. [Google Scholar]