Abstract
Preprocessing is the practice of cleaning, altering, and reorganizing raw data prior to processing and analysis, which is also known as data preparation [1]. It is an important step before processing and usually entails reformatting, adjusting, and integrating datasets to improve the information contained within them. Even though data preprocessing can be an onerous task, it is necessary as a precondition for putting data into context and reducing the possibility of bias [2]. An Aberdeen Group study states that data preprocessing refers to any activity taken in order to improve the quality, usability, accessibility, and portability of data [3]. In a poll published in Forbes, data scientists reported that they spend 60% of their time on data preprocessing (Fig. 4.1).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
S. Bandgar, Data Preparation in Data Science. Analytics Vidhya. https://medium.com/analytics-vidhya/data-preparation-in-data-science-16f9311760. Accessed 17 March 2022
B.C. Boehmke, Data Wrangling with R (Use R!) (Springer International Publishing, 2016)
H. Jamshed, M.S.A. Khan, M. Khurram, S. Inayatullah, S. Athar, Data preprocessing: A preliminary step for web data mining. 3C Tecnología_Glosas de innovación aplicadas a la pyme 8(1), 206–221 (2019)
G. Press, Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#1594bda36f63. Accessed 22 June 2022
C. Stedman, E. Burns, M. K. Pratt, What Is Data Preparation? An In-Depth Guide to Data Prep. Teach Target. https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation. Accessed 18 March 2022
D. Pyle, Data Preparation for Data Mining (ITPro Collection) (Elsevier Science, 1999)
M. Spruit, T. Dedding, D. Vijlbrief, Self-service data science for healthcare professionals: A data preparation approach, in Proceedings of the 13th International Joint conference on Biomedical Engineering systems and Technologies (BICSTEC 2020) – Volume 5: HEALTHINF, (ScitePress, Valetta, 2020), pp. 724–734
J. Brownlee, Data Preparation for Machine Learning: Data Cleaning, Feature Selection, and Data Transforms in Python (Machine Learning Mastery, 2020)
Talend.com. What Is Data Preparation? Talend.com. https://www.talend.com/resources/what-is-data-preparation/. Accessed 8 May 2022
V. Kumar, S. Minz, Feature selection: A literature review. Smart Comput. Rev. 4, 211–229 (2014)
Data Meer, Data Preparation. Data Meer. https://www.datameer.com/data-preparation/. Accessed 16 March 2022
F. Nargesian, H. Samulowitz, U. Khurana, E. B. Khalil, D. Turaga, Learning feature engineering for classification. 2017, pp. 2529–2535, [Online]. Available: https://doi.org/10.24963/ijcai.2017/352
J. Heaton, An empirical analysis of feature engineering for predictive modeling, in SoutheastCon 2016, 30 March–3 April 2016, pp. 1–6, https://doi.org/10.1109/SECON.2016.7506650
M. Kuhn, K. Johnson, Feature Engineering and Selection: A Practical Approach for Predictive Models (CRC Press, 2019)
M. Anderson, et al., Brainwash: A Data System for Feature Engineering, 21 Nov 2012
T. Bock, What is feature Engineering? Displayr. https://www.displayr.com/what-is-feature-engineering/. Accessed 18 March 2022
H. Patel, What is Feature Engineering — Importance, Tools and Techniques for Machine Learning. Medium. https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10. Accessed 16 March 2022
E. Rencberoglu, Fundamental Techniques of Feature Engineering for Machine Learning. Medium. https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114. Accessed 18 March 2022
R. Chopra, A. England, M.N. Alaudeen, Data Science with Python: Combine Python with Machine Learning Principles to Discover Hidden Patterns in Raw Data (Packt Publishing, 2019)
S. Raschka, Python Machine Learning (Packt Publishing, 2015)
J. Brownlee, Machine Learning Mastery with Weka: Analyze Data, Develop Models, and Work Through Projects (Machine Learning Mastery, 2016)
A. Burkov, The Hundred-Page Machine Learning Book (Andriy Burkov, 2019)
X. Ying, An Overview of Overfitting and its Solutions. J. Phys. Conf. Ser. 1168, 022022 (Feb 2019). https://doi.org/10.1088/1742-6596/1168/2/022022
J.A. Cook, J. Ranstam, Overfitting. Br. J. Surg. 103(13), 1814–1814 (2016). https://doi.org/10.1002/bjs.10244
D.M. Hawkins, The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (1 Jan 2004). https://doi.org/10.1021/ci0342472
A. Twin, How Overfitting Works. Investopedia. https://www.investopedia.com/terms/o/overfitting.asp#:~:text=Overfitting%20is%20a%20modeling%20error. Accessed 18 March 2022
IBM Cloud Education. What Is Overfitting? www.ibm.com. https://www.ibm.com/cloud/learn/overfitting. Accessed 18 March 2022
J. Brownlee, Overfitting and Underfitting with Machine Learning Algorithms. Machine Learning Mastery. https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/. Accessed 17 March 2022
H. K. Jabbar, R. Z. Khan, Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). (2014)
F. Sidi, P.H.S. Panah, L.S. Affendey, M.A. Jabar, H. Ibrahim, A. Mustapha, Data quality: A survey of data quality dimensions, in 2012 International Conference on Information Retrieval & Knowledge Management, (2012), pp. 300–304
R.Y. Wang, D.M. Strong, Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1 March 1996). https://doi.org/10.1080/07421222.1996.11518099
D. Firmani, M. Mecella, M. Scannapieco, C. Batini, On the meaningfulness of “big data quality” (invited paper). Data Sci. Eng. 1(1), 6–20 (2016). https://doi.org/10.1007/s41019-015-0004-7
T.C. Redman, Data Quality for the Information Age (Artech House, 1996)
P. Costabel, V. d. Carmen, Data freshness and data accuracy: A state of the art. (2006)
B. Shin, An exploratory investigation of system success factors in data warehousing. J. AIS 4, 0 (1 Jan 2003). https://doi.org/10.17705/1jais.00033
L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (22 May 2015). https://doi.org/10.5334/dsj-2015-002
A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, 2019)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
El Morr, C., Jammal, M., Ali-Hassan, H., El-Hallak, W. (2022). Data Preprocessing. In: Machine Learning for Practical Decision Making. International Series in Operations Research & Management Science, vol 334. Springer, Cham. https://doi.org/10.1007/978-3-031-16990-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-16990-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16989-2
Online ISBN: 978-3-031-16990-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)