The Impact of Data Normalization on KNN Rendering

Hassan I. Abdalla⁷ &
Aneela Altaf⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 184))

Included in the following conference series:

International Conference on Advanced Intelligent Systems and Informatics

719 Accesses

Abstract

Data normalization is a vital preprocessing technique in which the data is either scaled or converted so features will make an equal contribution. The success of classifiers, like K-Nearest Algorithm, is highly dependent on data quality to generalize classification models. In its turn, KNN is the simplest and most widely-used model for different machine learning-based tasks, including text classification, pattern recognition, plagiarism and intrusion detection, ranking models, sentiment analysis, etc. While the core of KNN is basically based on similarity measures, its performance is also highly contingent on the nature and representation of data. It is commonly known in literature that to secure competitive performance with KNN, data must be normalized. This raises a key question about which normalization method would lead to the best performance. To answer this question, the normalization of data with KNN, which has not yet been given good attention, is investigated in this work. We provide a comparative study on the significant impact of data normalization on KNN performance using six normalization methods, namely, Decimal, L2-Norm, Max/Min, Std Norm, TFIDF and BoW. On eight publicly-available datasets, experimental results show that no method dominates the others. However, the L2-Norm, Decimal, and TFIDF methods were shown to obtain the best performance (measured by accuracy, precision, and recall) in most evaluation metrics. Moreover, run time analysis shows that KNN is working efficiently with BoW, followed by TFIDF.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

A set theory based similarity measure for text clustering and classification

Article Open access 14 September 2020

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Article 30 October 2019

Directing Natural Language Processing Text Similarity Challenges in Social Media with AI Techniques

Data Availability

The dataset used in this work is publicly available.

Notes

1.
https://stats.stackexchange.com/questions/287425/why-do-you-need-to-scale-data-in-knn.

References

Abdalla, H.I., Amer, A.A.: Towards highly-efficient k-nearest neighbor algorithm for big data classification. In: 2022 5th International Conference on Networking, Information Systems and Security: Envisage Intelligent Systems in 5G//6G-Based Interconnected Digital Worlds (NISS), pp. 1–5. IEEE, March 2022
Google Scholar
Abdalla, H.I., Amer, A.A.: Boolean logic algebra driven similarity measure for text based applications. PeerJ Comput. Sci. 7, e641 (2021)
Article Google Scholar
Sethi, A., et al.: Supervised enhancer prediction with epigenetic pattern recognition and targeted validation. Nat. Methods 17(8), 807–814 (2020)
Article Google Scholar
Pan, Z., Wang, Y., Pan, Y.: A new locally adaptive k-nearest neighbor algorithm based on discrimination class. Knowl.-Based Syst. 204, 106185 (2020)
Article Google Scholar
Wang, X.: A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. In: The 2011 International Joint Conference on Neural Networks, pp. 1293–1299. IEEE, July 2011‏
Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, vol. 97, pp. 426–435, August 1997‏
Google Scholar
Shokrzade, A., Ramezani, M., Tab, F.A., Mohammad, M.A.: A novel extreme learning machine based kNN classification method for dealing with big data. Expert Syst. Appl. 183, 115293 (2021)
Article Google Scholar
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, pp. 47–57, June 1984
Google Scholar
Zhang, S.: Cost-sensitive KNN classification. Neurocomputing 391, 234–242 (2020)
Article Google Scholar
Zhang, S., Li, X., Zong, M., Zhu, X., Cheng, D.: Learning k for knn classification. ACM Trans. Intell. Syst. Technol. (TIST) 8(3), 1–19 (2017)
Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, vol. 99, no. 6, pp. 518–529, September 1999
Google Scholar
Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM (JACM) 45(6), 891–923 (1998)
Article MathSciNet MATH Google Scholar
Li, W., et al.: Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. IEEE Trans. Knowl. Data Eng. 32(8), 1475–1488 (2019)
Article Google Scholar
Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2017)
Article MathSciNet Google Scholar
Abdalla, H.I., Amer, A.A.: On the integration of similarity measures with machine learning models to enhance text classification performance. Inf. Sci. 614, 263–288 (2022)
Article Google Scholar
Jayalakshmi, T., Santhakumaran, A.: Statistical normalization and back propagation for classification. Int. J. Comput. Theory Eng. 3(1), 1793–8201 (2011)
Google Scholar
Pan, J., Zhuang, Y., Fong, S.: The impact of data normalization on stock market prediction: using SVM and technical indicators. In: Berry, M.W., Hj. Mohamed, A., Yap, Bee Wah (eds.) SCDS 2016. CCIS, vol. 652, pp. 72–88. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-2777-2_7
Chapter Google Scholar
Amer, A.A., Mohamed, M.H., Al Asri, K.: ASGOP: an aggregated similarity-based greedy-oriented approach for relational DDBSs design. Heliyon 6(1), e03172 (2020)
Article Google Scholar
Amer, A.A., Abdalla, H.I.: An integrated design scheme for performance optimization in distributed environments. In: International Conference on Education and e-Learning Innovations, pp. 1–8. IEEE, July 2012
Google Scholar
Abdalla, H.I.: A brief comparison of k-means and agglomerative hierarchical clustering algorithms on small datasets. In: Qian, Z., Jabbar, M., Li, X. (eds.) WCNA 2021. LNEE, pp. 623–632. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-2456-9_64
Nguyen, L., Amer, A.A.: Advanced cosine measures for collaborative filtering. Adapt. Personal. (ADP) 1, 21–41 (2019)
Google Scholar
Abdalla, H.I., Amer, A.A., Amer, Y.A., Nguyen, L., Al-Maqaleh, B.: Boosting the item-based collaborative filtering model with novel similarity measures. Int. J. Comput. Intell. Syst. 16(1), 123 (2023)
Article Google Scholar
Abdalla, H.I., Amer, A.A., Ravana, S.D.: BoW-based neural networks vs. cutting-edge models for single-label text classification. Neural Comput. Appl. 1–14 (2023)
Google Scholar

Download references

Acknowledgement

The author would like to thank and appreciate the support they received from the Research Office of Zayed University for providing the necessary facilities to accomplish this work. This research has been supported by the Research Incentive Fund (RIF) Grant Activity Code: R22083—Zayed University, UAE.

Funding

This research has been supported by Research Incentive Fund (RIF) Grant Activity Code: R22083 – Zayed University, UAE.

Author information

Authors and Affiliations

College of Technological Innovation, Zayed University, Dubai, United Arab Emirates
Hassan I. Abdalla & Aneela Altaf

Authors

Hassan I. Abdalla
View author publications
You can also search for this author in PubMed Google Scholar
Aneela Altaf
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors are key contributors in conception and design, implementing the approach and analyzing results of all experiments, and the preparation, writing and revising the manuscript.

Corresponding author

Correspondence to Hassan I. Abdalla .

Editor information

Editors and Affiliations

Faculty of Computers and Information, Cairo University, Giza, Egypt
AboulElla Hassanien
Port Said University, Port Fouad City, Egypt
Rawya Y. Rizk
Department of Operations Research and Statistics, The University of Belgrade, Faculty of Organizational Sciences, Belgrade, Serbia
Dragan Pamucar
Faculty of Science, Helwan University, Cairo, Egypt
Ashraf Darwish
Shulin District, Fujian University of Technology, New Taipei, Taiwan
Kuo-Chi Chang

Ethics declarations

The authors declare that they have no competing interests.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdalla, H.I., Altaf, A. (2023). The Impact of Data Normalization on KNN Rendering. In: Hassanien, A., Rizk, R.Y., Pamucar, D., Darwish, A., Chang, KC. (eds) Proceedings of the 9th International Conference on Advanced Intelligent Systems and Informatics 2023. AISI 2023. Lecture Notes on Data Engineering and Communications Technologies, vol 184. Springer, Cham. https://doi.org/10.1007/978-3-031-43247-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-43247-7_16
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43246-0
Online ISBN: 978-3-031-43247-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics