[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Algorithms and Applications to Weighted Rank-one Binary Matrix Factorization

Published: 03 May 2020 Publication History

Abstract

Many applications use data that are better represented in the binary matrix form, such as click-stream data, market basket data, document-term data, user-permission data in access control, and others. Matrix factorization methods have been widely used tools for the analysis of high-dimensional data, as they automatically extract sparse and meaningful features from data vectors. However, existing matrix factorization methods do not work well for the binary data. One crucial limitation is interpretability, as many matrix factorization methods decompose an input matrix into matrices with fractional or even negative components, which are hard to interpret in many real settings. Some matrix factorization methods, like binary matrix factorization, do limit decomposed matrices to binary values. However, these models are not flexible to accommodate some data analysis tasks, like trading off summary size with quality and discriminating different types of approximation errors. To address those issues, this article presents weighted rank-one binary matrix factorization, which is to approximate a binary matrix by the product of two binary vectors, with parameters controlling different types of approximation errors. By systematically running weighted rank-one binary matrix factorization, one can effectively perform various binary data analysis tasks, like compression, clustering, and pattern discovery. Theoretical properties on weighted rank-one binary matrix factorization are investigated and its connection to problems in other research domains are examined. As weighted rank-one binary matrix factorization in general is NP-hard, efficient and effective algorithms are presented. Extensive studies on applications of weighted rank-one binary matrix factorization are also conducted.

References

[1]
Harry C. Andrews and Calude L. Patterson. 1976. Singular value decomposition and digital image processing. IEEE Trans. Acoust. Speech Sign. Process. 24, 1 (1976), 26--28.
[2]
Xue Bai, Ram Gopal, Manuel Nunez, and Dmitry Zhdanov. 2012. On the prevention of fraud and privacy exposure in process information flow. INFORMS J. Comput. 24, 3 (2012), 416--432.
[3]
F. Barahona, M. Junger, and G. Reinelt. 1989. Experiments in quadratic 0-1 programming. Math. Program. 44, 2 (Jul. 1989), 127--137.
[4]
J. E. Beasley. 1998. Heuristic Algorithms for the Unconstrained Binary Quadratic Programming Problem. Technical Report. Management School, Imperial College.
[5]
Mandell Bellmore and H. Donald Ratliff. 1971. Set covering and involutory bases. Manage. Sci. 18, 3 (1971), 194--206.
[6]
M. W. Berry, S. T. Dumais, and G. W. O’Brien. 1995. Using linear algebra for intelligent information retrieval. SIAM Rev. 37, 4 (1995), 573--595.
[7]
Ella Bingham, Ata Kaban, and Mikael Fortelius. 2009. The aspect Bernoulli model: Multiple causes of presences and absences. Pattern Anal. Appl. 12, 1 (Jan. 2009), 55--78.
[8]
P. S. Bradley, Usama M. Fayyad, and O. L. Mangasarian. 1999. Mathematical programming for data mining: Formulations and challenges. INFORMS J. Comput. 11, 3 (1999), 217--238.
[9]
Douglas Burdick, Manuel Calimlim, and Johannes Gehrke. 2001. MAFIA: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the 17th International Conference on Data Engineering. IEEE Computer Society, 443--452.
[10]
Alberto Caprara, Matteo Fischetti, and Paolo Toth. 1999. A heuristic method for the set covering problem. Operat. Res. 47, 5 (1999), 730--743.
[11]
V. Chvatal. 1979. A greedy heuristic for the set-covering problem. Math. Operat. Res. 4, 3 (1979), 233--235.
[12]
M. Dawande, P. Keskinocak, and S. Tayur. 1997. On the biclique problem in bipartite graphs. GSIA working paper 1996-04, Carnegie-Mellon University.
[13]
Chrysanthos Dellarocas. 2003. The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Manage. Sci. 49, 10 (2003), 1407--1424.
[14]
Alina Ene, William Horne, Nikola Milosavljevic, Prasad Rao, Robert Schreiber, and Robert E. Tarjan. 2008. Fast exact and heuristic methods for role minimization problems. In Proceedings of the ACM Symposium on Access Control Models and Technologies (SACMAT’08). 1--10.
[15]
David F. Ferraiolo, Ravi Sandhu, Serban Gavrila, D. Richard Kuhn, and Ramaswamy Chandramouli. 2001. Proposed NIST standard for role-based access control. ACM Trans. Inf. Syst. Secur. 4, 3 (Aug. 2001), 224--274.
[16]
Mario Frank, Andreas P. Streich, David Basin, and Joachim M. Buhmann. 2012. Multi-assignment clustering for Boolean data. J. Mach. Learn. Res. 13, 1 (Feb. 2012), 459--489.
[17]
Y. Gao and G. Church. 2005. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 21, 21 (2005), 3970--3975.
[18]
Floris Geerts, Bart Goethals, and Taneli Mielikainen. 2004. Tiling databases. In Discovery Science. Springer, 278--289.
[19]
Vladimir Gligorijević, Yannis Panagakis, and Stefanos Zafeiriou. 2018. Non-negative matrix factorizations for multiplex network analysis. IEEE Trans. Pattern Anal. Mach. Intell. 41, 4 (2018), 928--940.
[20]
Fred Glover, Gary A. Kochenberger, and Bahram Alidaee. 1998. Adaptive memory tabu search for binary quadratic programs. Manage. Sci. 44, 3 (1998), 336--345.
[21]
G. H. Golub and C. F. Van Loan. 1996. Matrix Computations (3rd ed.). The Johns Hopkins Universtiy Press, Baltimore, MD.
[22]
Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. 2017. Local search methods for k-means with outliers. Proc. VLDB Endow. 10, 7 (2017), 757--768.
[23]
Benjamin David Haeffele and René Vidal. 2019. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 14, 8 (2019).
[24]
Dorit S. Hochbaum, Nimrod Megiddo, Joseph Naor, and Arie Tamir. 1993. Tight bounds and 2-approximation algorithms for integer programs with two variables per inequality. Math. Program. 62 (1993), 69--83.
[25]
Hisao Ishibuchi and Yusuke Nojima. 2007. Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int. J. Approx. Reason. 44, 1 (2007), 4--31.
[26]
Ata Kabán and Ella Bingham. 2008. Factorisation and denoising of 0-1 data: A variational approach. Neurocomputing 71, 10--12 (Jun. 2008), 2291--2308.
[27]
Tamara G. Kolda and Dianne P. O’Leary. 2000. Algorithm 805: Computation and uses of the semidiscrete matrix decomposition. ACM Trans. Math. Softw. 26, 3 (2000), 415--435.
[28]
Mehmet Koyutürk and Ananth Grama. 2003. PROXIMUS: A framework for analyzing very high dimensional discrete-attributed datasets. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). 147--156.
[29]
Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan. 2005. Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans. Knowl. Data Eng. 17, 4 (2005), 447--461.
[30]
Martin Kuhlmann, Dalia Shohat, and Gerhard Schimpf. 2003. Role mining—Revealing business roles for security administration using data mining technology. In Proceedings of the 8th ACM Symposium on Access Control Models and Technologies (SACMAT’03). ACM, New York, NY, 179--186.
[31]
Tarald O. Kvalseth. 1987. Entropy and correlation: Some comments. IEEE Trans. Syst. Man Cybernet. 17, 3 (May 1987), 517--519.
[32]
Ken Lang. 1995. NewsWeeder: Learning to filter netnews. In Proceedings of the 12th International Machine Learning Conference.
[33]
D. D. Lee and H. S. Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401 (1999), 788--791.
[34]
Tao Li. 2005. A general model for clustering binary data. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, NY, 188--197.
[35]
Edo Liberty, Ram Sriharsha, and Maxim Sviridenko. 2016. An algorithm for online k-means clustering. In Proceedings of the 18th Workshop on Algorithm Engineering and Experiments (ALENEX’16). SIAM, 81--89.
[36]
Haibing Lu, Jaideep Vaidya, and Vijayalakshmi Atluri. 2014. An optimization framework for role mining. J. Comput. Secur. 22, 1 (Jan. 2014), 1--31.
[37]
Haibing Lu, J. Vaidya, and V. Atluri. [n.d.] An optimization framework for role mining. (unpublished).
[38]
Haibing Lu, J. Vaidya, V. Atluri, and Yuan Hong. 2012. Constraint-aware role mining via extended Boolean matrix decomposition. IEEE Trans. Depend. Sec. Comput. 9, 5 (Sept.-Oct. 2012), 655--669.
[39]
Haibing Lu, Jaideep Vaidya, Vijayalakshmi Atluri, Heechang Shin, and Lili Jiang. 2011. Weighted rank-one binary matrix factorization. In Proceedings of the SIAM Internation Conference on Data Mining (SDM’11). 283--294.
[40]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY.
[41]
James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. Technical Report. McKinsey Global Institute.
[42]
Peter Merz and Bernd Freisleben. 1999. Genetic algorithms for binary quadratic programming. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’99). Morgan Kauffman, 417--424.
[43]
Peter MerZ and Bernd Freisleben. 2002. Greedy and local search heuristics for unconstrained binary quadratic programming. J. Heurist. 8, 2 (Mar. 2002), 197--213.
[44]
Miettinen Pauli, Mielikainen Taneli, Gionis Aristides, Das Gautam, and Mannila Heikki. 2008. The discrete basis problem. IEEE Trans. Knowl. Data Eng. 20, 10 (2008), 1348--1362.
[45]
René Peeters. 2003. The maximum edge biclique problem is NP-complete. Discr. Appl. Math. 131, 3 (2003), 651--654.
[46]
Andrew I. Schein, Lawrence K. Saul, and Lyle H. Ungar. 2003. A generalized linear model for principal component analysis of binary data. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics. 546431.
[47]
D. Sculley. 2010. Web-scale K-means Clustering. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, NY, 1177--1178.
[48]
Farial Shahnaz, Michael W. Berry, V. Paul Pauca, and Robert J. Plemmons. 2006. Document clustering using nonnegative matrix factorization. Inf. Process. Manage. 42, 2 (Mar. 2006), 373--386.
[49]
Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. 2009. Mining discrete patterns via binary matrix factorization. In Proceedings of the ACM Knowledge Discovery and Data Mining Conference (KDD’09). 757--766.
[50]
Pradeep Shenoy, Jayant R. Haritsa, S. Sudarshan, Gaurav Bhalotia, Mayank Bawa, and Devavrat Shah. 2000. Turbo-charging vertical mining of large databases. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00). ACM, New York, NY, 22--33.
[51]
Tomáš Singliar and Miloš Hauskrecht. 2006. Noisy-OR component analysis and its application to link analysis. J. Mach. Learn. Res. 7 (Dec. 2006), 2189--2213.
[52]
Alexander Strehl and Joydeep Ghosh. 2003. Relationship-based clustering and visualization for high-dimensional data mining. INFORMS J. Comput. 15, 2 (2003), 208--230.
[53]
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. 2005. Introduction to Data Mining (1st ed.). Addison-Wesley Longman, Boston, MA.
[54]
Jaideep Vaidya, Vijayalakshmi Atluri, and Qi Guo. 2007. The role mining problem: Finding a minimal descriptive set of roles. In Proceedings of the ACM Symposium on Access Control Models and Technologies (SACMAT’07). 175--184.
[55]
Alfredo Vellido, José David Martín-Guerrero, and Paulo J. G. Lisboa. 2012. Making machine learning models interpretable. In Proceedings of the European Symposium on Artifical Neural Networks (ESANN’12), Vol. 12. Citeseer, 163--172.
[56]
Zhongyuan Zhang, Tao Li, Chris Ding, and Xiangsun Zhang. 2007. Binary matrix factorization with applications. In Proceedings of the 2007 7th IEEE International Conference on Data Mining. 391--400.
[57]
Zhong-Yuan Zhang, Tao Li, Chris Ding, Xian-Wen Ren, and Xiang-Sun Zhang. 2010. Binary matrix factorization for analyzing gene expression data. Data Min. Knowl. Discov. 20, 1 (2010), 28.

Cited By

View all

Index Terms

  1. Algorithms and Applications to Weighted Rank-one Binary Matrix Factorization

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Management Information Systems
      ACM Transactions on Management Information Systems  Volume 11, Issue 2
      Research Commentary
      June 2020
      115 pages
      ISSN:2158-656X
      EISSN:2158-6578
      DOI:10.1145/3398026
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 May 2020
      Accepted: 01 March 2020
      Revised: 01 February 2020
      Received: 01 September 2017
      Published in TMIS Volume 11, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Discrete data
      2. clustering
      3. compression
      4. pattern discovery

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • NSFC
      • National Institutes of Health
      • State Grid Corporation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)24
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 20 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Binary matrix factorization via collaborative neurodynamic optimizationNeural Networks10.1016/j.neunet.2024.106348176(106348)Online publication date: Aug-2024
      • (2024)A comprehensive review of model compression techniques in machine learningApplied Intelligence10.1007/s10489-024-05747-w54:22(11804-11844)Online publication date: 1-Nov-2024
      • (2023)Group Recommendation Based on Heterogeneous Graph Algorithm for EBSNsIEEE Access10.1109/ACCESS.2022.322459811(1854-1866)Online publication date: 2023
      • (2022)Generalized Noise Role MiningProceedings of the 27th ACM on Symposium on Access Control Models and Technologies10.1145/3532105.3535024(91-102)Online publication date: 7-Jun-2022
      • (2022)PAS: A Position-Aware Similarity Measurement for Sequential Recommendation2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892267(1-8)Online publication date: 18-Jul-2022
      • (2022)Enrichment Program using PROMETHEE for Decision Support Systems of Prospective Assistance Funds Disabilities2022 1st International Conference on Technology Innovation and Its Applications (ICTIIA)10.1109/ICTIIA54654.2022.9935971(1-5)Online publication date: 23-Sep-2022
      • (2022)Integrated aviation model and metaheuristic algorithm for hub-and-spoke network design and airline fleet planningTransportation Research Part E: Logistics and Transportation Review10.1016/j.tre.2022.102755164(102755)Online publication date: Aug-2022
      • (2022)Operating room scheduling for non-operating room anesthesia with emergency uncertaintyAnnals of Operations Research10.1007/s10479-022-04870-6321:1-2(565-588)Online publication date: 8-Aug-2022
      • (2022)Surgical scheduling by Fuzzy model considering inpatient beds shortage under uncertain surgery durationsAnnals of Operations Research10.1007/s10479-022-04645-z315:1(463-505)Online publication date: 22-Mar-2022

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media