[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Data Preprocessing

  • Chapter
  • First Online:
Machine Learning for Practical Decision Making

Abstract

Preprocessing is the practice of cleaning, altering, and reorganizing raw data prior to processing and analysis, which is also known as data preparation [1]. It is an important step before processing and usually entails reformatting, adjusting, and integrating datasets to improve the information contained within them. Even though data preprocessing can be an onerous task, it is necessary as a precondition for putting data into context and reducing the possibility of bias [2]. An Aberdeen Group study states that data preprocessing refers to any activity taken in order to improve the quality, usability, accessibility, and portability of data [3]. In a poll published in Forbes, data scientists reported that they spend 60% of their time on data preprocessing (Fig. 4.1).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 87.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
GBP 109.99
Price includes VAT (United Kingdom)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. S. Bandgar, Data Preparation in Data Science. Analytics Vidhya. https://medium.com/analytics-vidhya/data-preparation-in-data-science-16f9311760. Accessed 17 March 2022

  2. B.C. Boehmke, Data Wrangling with R (Use R!) (Springer International Publishing, 2016)

    Book  Google Scholar 

  3. H. Jamshed, M.S.A. Khan, M. Khurram, S. Inayatullah, S. Athar, Data preprocessing: A preliminary step for web data mining. 3C Tecnología_Glosas de innovación aplicadas a la pyme 8(1), 206–221 (2019)

    Article  Google Scholar 

  4. G. Press, Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#1594bda36f63. Accessed 22 June 2022

  5. C. Stedman, E. Burns, M. K. Pratt, What Is Data Preparation? An In-Depth Guide to Data Prep. Teach Target. https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation. Accessed 18 March 2022

  6. D. Pyle, Data Preparation for Data Mining (ITPro Collection) (Elsevier Science, 1999)

    Google Scholar 

  7. M. Spruit, T. Dedding, D. Vijlbrief, Self-service data science for healthcare professionals: A data preparation approach, in Proceedings of the 13th International Joint conference on Biomedical Engineering systems and Technologies (BICSTEC 2020) – Volume 5: HEALTHINF, (ScitePress, Valetta, 2020), pp. 724–734

    Chapter  Google Scholar 

  8. J. Brownlee, Data Preparation for Machine Learning: Data Cleaning, Feature Selection, and Data Transforms in Python (Machine Learning Mastery, 2020)

    Google Scholar 

  9. Talend.com. What Is Data Preparation? Talend.com. https://www.talend.com/resources/what-is-data-preparation/. Accessed 8 May 2022

  10. V. Kumar, S. Minz, Feature selection: A literature review. Smart Comput. Rev. 4, 211–229 (2014)

    Article  Google Scholar 

  11. Data Meer, Data Preparation. Data Meer. https://www.datameer.com/data-preparation/. Accessed 16 March 2022

  12. F. Nargesian, H. Samulowitz, U. Khurana, E. B. Khalil, D. Turaga, Learning feature engineering for classification. 2017, pp. 2529–2535, [Online]. Available: https://doi.org/10.24963/ijcai.2017/352

  13. J. Heaton, An empirical analysis of feature engineering for predictive modeling, in SoutheastCon 2016, 30 March–3 April 2016, pp. 1–6, https://doi.org/10.1109/SECON.2016.7506650

  14. M. Kuhn, K. Johnson, Feature Engineering and Selection: A Practical Approach for Predictive Models (CRC Press, 2019)

    Book  Google Scholar 

  15. M. Anderson, et al., Brainwash: A Data System for Feature Engineering, 21 Nov 2012

    Google Scholar 

  16. T. Bock, What is feature Engineering? Displayr. https://www.displayr.com/what-is-feature-engineering/. Accessed 18 March 2022

  17. H. Patel, What is Feature Engineering — Importance, Tools and Techniques for Machine Learning. Medium. https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-2080b0269f10. Accessed 16 March 2022

  18. E. Rencberoglu, Fundamental Techniques of Feature Engineering for Machine Learning. Medium. https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114. Accessed 18 March 2022

  19. R. Chopra, A. England, M.N. Alaudeen, Data Science with Python: Combine Python with Machine Learning Principles to Discover Hidden Patterns in Raw Data (Packt Publishing, 2019)

    Google Scholar 

  20. S. Raschka, Python Machine Learning (Packt Publishing, 2015)

    Google Scholar 

  21. J. Brownlee, Machine Learning Mastery with Weka: Analyze Data, Develop Models, and Work Through Projects (Machine Learning Mastery, 2016)

    Google Scholar 

  22. A. Burkov, The Hundred-Page Machine Learning Book (Andriy Burkov, 2019)

    Google Scholar 

  23. X. Ying, An Overview of Overfitting and its Solutions. J. Phys. Conf. Ser. 1168, 022022 (Feb 2019). https://doi.org/10.1088/1742-6596/1168/2/022022

    Article  Google Scholar 

  24. J.A. Cook, J. Ranstam, Overfitting. Br. J. Surg. 103(13), 1814–1814 (2016). https://doi.org/10.1002/bjs.10244

    Article  Google Scholar 

  25. D.M. Hawkins, The problem of overfitting. J. Chem. Inf. Comput. Sci. 44(1), 1–12 (1 Jan 2004). https://doi.org/10.1021/ci0342472

    Article  Google Scholar 

  26. A. Twin, How Overfitting Works. Investopedia. https://www.investopedia.com/terms/o/overfitting.asp#:~:text=Overfitting%20is%20a%20modeling%20error. Accessed 18 March 2022

  27. IBM Cloud Education. What Is Overfitting? www.ibm.com. https://www.ibm.com/cloud/learn/overfitting. Accessed 18 March 2022

  28. J. Brownlee, Overfitting and Underfitting with Machine Learning Algorithms. Machine Learning Mastery. https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/. Accessed 17 March 2022

  29. H. K. Jabbar, R. Z. Khan, Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). (2014)

    Google Scholar 

  30. F. Sidi, P.H.S. Panah, L.S. Affendey, M.A. Jabar, H. Ibrahim, A. Mustapha, Data quality: A survey of data quality dimensions, in 2012 International Conference on Information Retrieval & Knowledge Management, (2012), pp. 300–304

    Chapter  Google Scholar 

  31. R.Y. Wang, D.M. Strong, Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1 March 1996). https://doi.org/10.1080/07421222.1996.11518099

    Article  Google Scholar 

  32. D. Firmani, M. Mecella, M. Scannapieco, C. Batini, On the meaningfulness of “big data quality” (invited paper). Data Sci. Eng. 1(1), 6–20 (2016). https://doi.org/10.1007/s41019-015-0004-7

    Article  Google Scholar 

  33. T.C. Redman, Data Quality for the Information Age (Artech House, 1996)

    Google Scholar 

  34. P. Costabel, V. d. Carmen, Data freshness and data accuracy: A state of the art. (2006)

    Google Scholar 

  35. B. Shin, An exploratory investigation of system success factors in data warehousing. J. AIS 4, 0 (1 Jan 2003). https://doi.org/10.17705/1jais.00033

    Article  Google Scholar 

  36. L. Cai, Y. Zhu, The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14, 2 (22 May 2015). https://doi.org/10.5334/dsj-2015-002

    Article  Google Scholar 

  37. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, 2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

El Morr, C., Jammal, M., Ali-Hassan, H., El-Hallak, W. (2022). Data Preprocessing. In: Machine Learning for Practical Decision Making. International Series in Operations Research & Management Science, vol 334. Springer, Cham. https://doi.org/10.1007/978-3-031-16990-8_4

Download citation

Publish with us

Policies and ethics