[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3132847.3132994acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data

Published: 06 November 2017 Publication History

Abstract

This paper introduces a novel framework, namely SelectVC and its instance POP, for learning selective value couplings (i.e., interactions between the full value set and a set of outlying values) to identify outliers in high-dimensional categorical data. Existing outlier detection methods work on a full data space or feature subspaces that are identified independently from subsequent outlier scoring. As a result, they are significantly challenged by overwhelming irrelevant features in high-dimensional data due to the noise brought by the irrelevant features and its huge search space. In contrast, SelectVC works on a clean and condensed data space spanned by selective value couplings by jointly optimizing outlying value selection and value outlierness scoring. Its instance POP defines a value outlierness scoring function by modeling a partial outlierness propagation process to capture the selective value couplings. POP further defines a top-k outlying value selection method to ensure its scalability to the huge search space. We show that POP (i) significantly outperforms five state-of-the-art full space- or subspace-based outlier detectors and their combinations with three feature selection methods on 12 real-world high-dimensional data sets with different levels of irrelevant features; and (ii) obtains good scalability, stable performance w.r.t. k, and fast convergence rate.

References

[1]
Elke Achtert, Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek. 2013. Interactive data mining with 3D-parallel-coordinate-trees. In SIGMOD. 1009--1012.
[2]
Charu Aggarwal and S. Yu. 2005. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal 14, 2 (2005), 211--221.
[3]
Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. Fast and reliable anomaly detection in categorical data. In CIKM. ACM, 415--424.
[4]
Fabrizio Angiulli, Fabio Fassetti, and Luigi Palopoli. 2009. Detecting outlying properties of exceptional objects. ACM Transactions on Database Systems 34, 1 (2009), 7.
[5]
Fabrizio Angiulli and Clara Pizzuti. 2005. Outlier mining in large high- dimensional data sets. IEEE Transactions on Knowledge and Data Engineering 17, 2 (2005), 203--215.
[6]
Fatemeh Azmandian, Ayse Yilmazer, Jennifer G. Dy, Javed Aslam, David R. Kaeli, and others. 2012. GPU-accelerated feature selection for outlier detection using the local kernel density ratio. In ICDM. IEEE, 51--60.
[7]
Markus M Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying density-based local outliers. ACM SIGMOD Record 29, 2 (2000), 93--104.
[8]
Guilherme O. Campos, Arthur Zimek, Jörg Sander, Ricardo J.G.B. Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E. Houle. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery 30, 4 (2016), 891--927.
[9]
Longbing Cao. 2015. Coupling learning of complex interactions. Information Processing & Management 51, 2 (2015), 167--186.
[10]
Ting Chen, Lu-An Tang, Yizhou Sun, Zhengzhang Chen, and Kai Zhang. 2016. Entity embedding-based anomaly detection for heterogeneous categorical events. In IJCAI. AAAI Press, 1396--1403.
[11]
Amol Ghoting, Srinivasan Parthasarathy, and Matthew Eric Otey. 2006. Fast mining of distance-based outliers in high-dimensional datasets. In SDM. SIAM.
[12]
David J. Hand and Robert J. Till. 2001. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45, 2 (2001), 171--186.
[13]
Zengyou He, Xiaofei Xu, Zhexue Joshua Huang, and Shengchun Deng. 2005. FP-outlier: Frequent pattern based outlier detection. Computer Science and Infor- mation Systems 2, 1 (2005), 103--118.
[14]
Tin Kam Ho and Mitra Basu. 2002. Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 3 (2002), 289--300.
[15]
Fabian Keller, Emmanuel Müller, and Klemens Bohm. 2012. HiCS: High contrast subspaces for density-based outlier ranking. In ICDE. IEEE, 1037--1048.
[16]
Hans-Peter Kriegel and Arthur Zimek. 2008. Angle-based outlier detection in high-dimensional data. In SIGKDD. ACM, 444--452.
[17]
Aleksandar Lazarevic and Vipin Kumar. 2005. Feature bagging for outlier detec- tion. In SIGKDD. ACM, 157--166.
[18]
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2016. Feature Selection: A Data Perspective. CoRR abs/1601.07996 (2016).
[19]
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2012. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data 6, 1, Article 3 (2012), 39 pages.
[20]
Guansong Pang, Longbing Cao, and Ling Chen. 2016. Outlier detection in complex categorical data by modelling the feature value couplings. In IJCAI. AAAI Press, 1902--1908.
[21]
Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2016. Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In ICDM. IEEE, 410--419.
[22]
Guansong Pang, Kai Ming Ting, David Albrecht, and Huidong Jin. 2016. ZERO++: Harnessing the power of zero appearances to detect anomalies in large-scale data sets. Journal of Artificial Intelligence Research 57 (2016), 593--620.
[23]
Ninh Pham and Rasmus Pagh. 2012. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In SIGKDD. ACM, 877--885.
[24]
Saket Sathe and Charu C. Aggarwal. 2016. Subspace outlier detection in linear time with randomized hashing. In ICDM. IEEE, 459--468.
[25]
Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37 (2010), 141--188.
[26]
Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining 5, 5 (2012), 363--387

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. categorical data
  2. coupling learning
  3. feature selection
  4. high-dimensional data
  5. outlier detection

Qualifiers

  • Research-article

Funding Sources

  • Australian Research Council

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Detecting Outliers in Non-IID Data: A Systematic Literature ReviewIEEE Access10.1109/ACCESS.2023.329409611(70333-70352)Online publication date: 2023
  • (2023)Feature selection considering interaction, redundancy and complementarity for outlier detection in categorical dataKnowledge-Based Systems10.1016/j.knosys.2023.110678275:COnline publication date: 5-Sep-2023
  • (2022)Learning to Classify With Incremental New ClassIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2021.310488233:6(2429-2443)Online publication date: Jun-2022
  • (2022)An effective strategy for churn prediction and customer profilingData & Knowledge Engineering10.1016/j.datak.2022.102100142(102100)Online publication date: Nov-2022
  • (2022)A survey on machine learning methods for churn predictionInternational Journal of Data Science and Analytics10.1007/s41060-022-00312-514:3(217-242)Online publication date: 1-Mar-2022
  • (2022)A density estimation approach for detecting and explaining exceptional values in categorical dataApplied Intelligence10.1007/s10489-022-03271-352:15(17534-17556)Online publication date: 2-Apr-2022
  • (2022)Factor analysis of mixed data for anomaly detectionStatistical Analysis and Data Mining: The ASA Data Science Journal10.1002/sam.1158515:4(480-493)Online publication date: 2-May-2022
  • (2021)Implanting Domain Knowledge into Feature Selection for Effective Outlier Detection in Network Traffic Data2021 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI)10.1109/SWC50871.2021.00025(115-122)Online publication date: Oct-2021
  • (2021)Entropy and Autoencoder-Based Outlier Detection in Mixed-Type Network Traffic Data2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00075(501-508)Online publication date: Sep-2021
  • (2019)Anomaly Detection Algorithm Based on Subspace Local Density EstimationInternational Journal of Web Services Research10.4018/IJWSR.201907010316:3(44-58)Online publication date: 1-Jul-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media