Abstract
Distance concentration is a phantom menace for the labeling of high dimensional data by distance-based classifiers. Filter methods reduce data dimensionality, but they also add their ranking bias indirectly into the classification procedure. In this study, we examine the filtering problem from another perspective, in which multiple filters are aggregated according to classifiers’ constraints. Our approach, named S-Filter, is designed as a top-k skyline (k-skyband) search over multiple rankings by relying on the concept of \(\mathcal {F}\)–dominance for weighted and monotone linear functions. Unlike existing approaches, S-Filter provides a deterministic strategy for joining multiple filters and avoids the semantic problem of breaking top-k ties. S-Filter’s first stage uses labeling-driven measures, e.g., F1-Score, for assessing the quality of each filter with regards to a particular classifier, whereas range-tolerance intervals around the initial quality measures define the partial search weights. Next, S-Filter applies the FSA instance-optimal algorithm for selecting all the dimensions that can be among the top-k for a weight within the range-tolerance intervals. Experiments on high dimensional datasets show that S-Filter outperforms state-of-the-art filters in two scenarios: (i) exploratory analysis on varying k and range-tolerance intervals, and (ii) data reduction to its intrinsic dimensionality.
The authors thank the National Council for Scientific and Technological Development and Faperj (G. E-26/203.215/2016 and I. Sed. 2018) for their financial support.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The most relevant dimensions for a particular set of points are the most prominent data features. Accordingly, we use the terms dimensions and features alternately.
- 2.
References
Aggarwal, C.C.: Data Mining: The Textbook. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-14142-8
Ciaccia, P., Martinenghi, D.: Reconciling skyline and ranking queries. PVLDB 10(11), 1454–1465 (2017)
Ciaccia, P., Martinenghi, D.: \(FA + TA < FSA\): flexible score aggregation. In: CIKM, pp. 57–66. ACM (2018)
Drotár, P., Gazda, M., Vokorokos, L.: Ensemble feature selection using election methods and ranker clustering. Inf. Sci. 480, 365–380 (2019)
Fagin, R.: Combining fuzzy information from multiple systems. In: PODS, pp. 216–226 (1996)
Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: SIGMOD, pp. 301–312. ACM (2003)
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol. 112. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-7138-7
Navarro, G., Paredes, R., Reyes, N., Bustos, C.: An empirical evaluation of intrinsic dimension estimators. Inf. Syst. 64, 206–218 (2017)
Pestov, V.: An axiomatic approach to intrinsic dimension of a dataset. Neural Netw. 21(2–3), 204–213 (2008)
Pestov, V.: Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions. Algorithmica 66(2), 310–328 (2013)
Roffo, G., Melzi, S., Castellani, U., Vinciarelli, A.: Infinite latent feature selection: a probabilistic latent graph-based ranking approach. In: CVPR (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bedo, M., Ciaccia, P., Martinenghi, D., de Oliveira, D. (2019). A k–Skyband Approach for Feature Selection. In: Amato, G., Gennaro, C., Oria, V., Radovanović , M. (eds) Similarity Search and Applications. SISAP 2019. Lecture Notes in Computer Science(), vol 11807. Springer, Cham. https://doi.org/10.1007/978-3-030-32047-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-32047-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32046-1
Online ISBN: 978-3-030-32047-8
eBook Packages: Computer ScienceComputer Science (R0)