More Web Proxy on the site http://driver.im/

research-article

Open access

DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications

Authors:

Bestoun S. Ahmed,

Anton EngmanAuthors Info & Claims

EASE '23: Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

Pages 32 - 41

https://doi.org/10.1145/3593434.3593445

Published: 14 June 2023 Publication History

All formats PDF

Abstract

Data quality assessment has become a prominent component in the successful execution of complex data-driven artificial intelligence (AI) software systems. In practice, real-world applications generate huge volumes of data at speeds. These data streams require analysis and preprocessing before being permanently stored or used in a learning task. Therefore, significant attention has been paid to the systematic management and construction of high-quality datasets. Nevertheless, managing voluminous and high-velocity data streams is usually performed manually (i.e. offline), making it an impractical strategy in production environments. To address this challenge, DataOps has emerged to achieve life-cycle automation of data processes using DevOps principles. However, determining the data quality based on a fitness scale constitutes a complex task within the framework of DataOps. This paper presents a novel Data Quality Scoring Operations (DQSOps) framework that yields a quality score for production data in DataOps workflows. The framework incorporates two scoring approaches, an ML prediction-based approach that predicts the data quality score and a standard-based approach that periodically produces the ground-truth scores based on assessing several data quality dimensions. We deploy the DQSOps framework in a real-world industrial use case. The results show that DQSOps achieves significant computational speedup rates compared to the conventional approach of data quality scoring while maintaining high prediction performance.

References

[1]

Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433–459.

Digital Library

[2]

Muhammad Aslam. 2021. A new goodness of fit test in the presence of uncertain parameters. Complex & Intelligent Systems 7, 1 (2021), 359–365.

[3]

Claudia Augste and Martin Lames. 2011. The relative age effect and success in German elite U-17 soccer teams. Journal of sports sciences 29, 9 (2011), 983–987.

[4]

Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey. IEEE transactions on software engineering 41, 5 (2014), 507–525.

Digital Library

[5]

Carlo Batini, Anisa Rula, Monica Scannapieco, and Gianluigi Viscusi. 2015. From data quality to big data quality. Journal of Database Management (JDM) 26, 1 (2015), 60–82.

Digital Library

[6]

Roger Blake and Paul Mangiameli. 2011. The effects and interactions of data quality and problem complexity on classification. Journal of Data and Information Quality (JDIQ) 2, 2 (2011), 1–28.

Digital Library

[7]

Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. 2021. Engineering ai systems: A research agenda. Artificial Intelligence Paradigms for Smart Cyber-Physical Systems (2021), 1–19.

[8]

Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.

Digital Library

[9]

John Byabazaire, Gregory O’Hare, and Declan T Delaney. 2022. End-to-End Data Quality Assessment Using Trust for Data Shared IoT Deployments. IEEE Sensors Journal (2022).

[10]

Cinzia Cappiello, C Cerletti, C Fratto, and Barbara Pernici. 2018. Validating data quality actions in scoring processes. Journal of Data and Information Quality (JDIQ) 9, 2 (2018), 1–27.

Digital Library

[11]

Emily Caveness, Paul Suganthan GC, Zhuo Peng, Neoklis Polyzotis, Sudip Roy, and Martin Zinkevich. 2020. Tensorflow data validation: Data analysis and validation in continuous ml pipelines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2793–2796.

Digital Library

[12]

Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407 (2019).

[13]

Abraham Chan, Arpan Gujarati, Karthik Pattabiraman, and Sathish Gopalakrishnan. 2022. The Fault in Our Data Stars: Studying Mitigation Techniques against Faulty Training Data in Machine Learning Applications. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 163–171.

[14]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.

Digital Library

[15]

Sezal Chug, Priya Kaushal, Ponnurangam Kumaraguru, and Tavpritesh Sethi. 2021. Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring. arXiv preprint arXiv:2108.08905 (2021).

[16]

Corinna Cichy and Stefan Rass. 2019. An overview of data quality frameworks. IEEE Access 7 (2019), 24634–24648.

[17]

RalphB D’Agostino. 2017. Goodness-of-fit-techniques. Routledge.

[18]

Koen Decancq and Maria Ana Lugo. 2012. Inequality of wellbeing: A multidimensional approach. Economica 79, 316 (2012), 721–746.

[19]

Adenekan Dedeke. 2000. A Conceptual Framework for Developing Quality Measures for Information Systems. In IQ. 126–128.

[20]

Lisa Ehrlinger and Wolfram Wöß. 2018. A novel data quality metric for minimality. In International Workshop on Data Quality and Trust in Big Data. Springer, 1–15.

[21]

Diane L Evans, John H Drew, and Lawrence M Leemis. 2008. The distribution of the Kolmogorov–Smirnov, Cramer–von Mises, and Anderson–Darling test statistics for exponential populations with estimated parameters. Communications in Statistics—Simulation and Computation® 37, 7 (2008), 1396–1421.

[22]

Wenfei Fan. 2015. Data quality: From theory to practice. Acm Sigmod Record 44, 3 (2015), 7–18.

Digital Library

[23]

Harald Foidl and Michael Felderer. 2019. Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation. 13–18.

Digital Library

[24]

Bernd Heinrich, Diana Hristova, Mathias Klier, Alexander Schiller, and Michael Szubartowicz. 2018. Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ) 9, 2 (2018), 1–32.

Digital Library

[25]

Bernd Heinrich, Mathias Klier, Alexander Schiller, and Gerit Wagner. 2018. Assessing data quality–A probability-based metric for semantic consistency. Decision Support Systems 110 (2018), 95–106.

[26]

Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2020. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3561–3562.

Digital Library

[27]

Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.

[28]

Meenu Mary John, Helena Holmström Olsson, and Jan Bosch. 2021. Towards mlops: A framework and maturity model. In 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 1–8.

[29]

Steven G Johnson, Stuart Speedie, Gyorgy Simon, Vipin Kumar, and Bonnie L Westra. 2016. Application of an ontology for characterizing data quality for a secondary use of EHR data. Applied clinical informatics 7, 01 (2016), 69–88.

[30]

Sonia Kahiomba Kiangala and Zenghui Wang. 2021. An effective adaptive customization framework for small manufacturing plants using extreme gradient boosting-XGBoost and random forest ensemble learning algorithms in an Industry 4.0 environment. Machine Learning with Applications 4 (2021), 100024.

[31]

Shirlee-ann Knight. 2011. The combined conceptual life-cycle model of information quality: part 1, an investigative framework. International journal of information quality 2, 3 (2011), 205–230.

[32]

Paul Kvam, Brani Vidakovic, and Seong-joon Kim. 2022. Nonparametric Statistics with Applications to Science and Engineering with R. John Wiley & Sons.

[33]

Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020.

[34]

Lillian Lee. 2000. Measures of distributional similarity. arXiv preprint cs/0001012 (2000).

[35]

Nan Li and Jeff Offutt. 2016. Test oracle strategies for model-based testing. IEEE Transactions on Software Engineering 43, 4 (2016), 372–395.

Digital Library

[36]

Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 1 (1991), 145–151.

Digital Library

[37]

Antonios Lionis, Konstantinos P Peppas, Hector E Nistazakis, and Andreas Tsigopoulos. 2021. RSSI Probability Density Functions Comparison Using Jensen-Shannon Divergence and Pearson Distribution. Technologies 9, 2 (2021), 26.

[38]

David Loshin. 2010. The practitioner’s guide to data quality improvement. Elsevier.

[39]

Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering 31, 12 (2018), 2346–2363.

[40]

Harald Martens and Magni Martens. 2001. Multivariate analysis of quality: an introduction. John Wiley & Sons.

[41]

Xiaofeng Meng and Xiang Ci. 2013. Big data management: concepts, techniques and challenges. Journal of computer research and development 50, 1 (2013), 146–169.

[42]

Helen-Tadesse Moges, Karel Dejaeger, Wilfried Lemahieu, and Bart Baesens. 2013. A multidimensional analysis of data quality for credit risk management: New insights and challenges. Information & Management 50, 1 (2013), 43–58.

Digital Library

[43]

Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How data science workers work with data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–15.

Digital Library

[44]

Hoang-Vu Nguyen and Jilles Vreeken. 2015. Non-parametric jensen-shannon divergence. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 173–189.

[45]

Jack E Olson. 2003. Data quality: the accuracy dimension. Elsevier.

[46]

Liu Peng and Lei Lei. 2005. A review of missing data treatment methods. Intell. Inf. Manag. Syst. Technol 1 (2005), 412–419.

[47]

Leo L Pipino, Yang W Lee, and Richard Y Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211–218.

Digital Library

[48]

John Winsor Pratt and Jean Dickinson Gibbons. 2012. Concepts of nonparametric theory. Springer Science & Business Media.

[49]

Laura Rettig, Mourad Khayati, Philippe Cudré-Mauroux, and Michał Piórkowski. 2019. Online anomaly detection over big data streams. In Applied data science. Springer, 289–312.

[50]

David Schuler and Andreas Zeller. 2013. Covering and uncovering equivalent mutants. Software Testing, Verification and Reliability 23, 5 (2013), 353–374.

[51]

Kelly M Sunderland, Derek Beaton, Julia Fraser, Donna Kwan, Paula M McLaughlin, Manuel Montero-Odasso, Alicia J Peltsch, Frederico Pieruccini-Faria, Demetrios J Sahlas, Richard H Swartz, 2019. The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project. BMC medical research methodology 19, 1 (2019), 1–16.

[52]

Ikbal Taleb, Mohamed Adel Serhani, Chafik Bouhaddioui, and Rachida Dssouli. 2021. Big data quality framework: a holistic approach to continuous quality management. Journal of Big Data 8, 1 (2021), 1–41.

[53]

Hui Yie Teh, Andreas W Kempa-Liehr, and Kevin I-Kai Wang. 2020. Sensor data quality: A systematic review. Journal of Big Data 7, 1 (2020), 1–49.

[54]

Reza Vaziri, Mehran Mohsenzadeh, and Jafar Habibi. 2019. Measuring data quality with weighted metrics. Total Quality Management & Business Excellence 30, 5-6 (2019), 708–720.

[55]

Yair Wand and Richard Y Wang. 1996. Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 11 (1996), 86–95.

Digital Library

[56]

Richard Y Wang, Veda C Storey, and Christopher P Firth. 1995. A framework for analysis of data quality research. IEEE transactions on knowledge and data engineering 7, 4 (1995), 623–640.

Digital Library

Cited By

Zainuddin ZAkhir EBencheva N(2024)Improving Data Quality Completeness using Deep Neural Network (DNN) Prediction2024 5th International Conference on Communications, Information, Electronic and Energy Systems (CIEES)10.1109/CIEES62939.2024.10811414(1-8)Online publication date: 20-Nov-2024
https://doi.org/10.1109/CIEES62939.2024.10811414
Zainuddin ZAkhir E(2024)Systematic Literature Review of Data Quality in Open Government Data: Trend, Methods, and ApplicationsIEEE Access10.1109/ACCESS.2024.347557712(148466-148487)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3475577
Bayram FAhmed BHallin E(2024)Adaptive data quality scoring operations framework using drift-aware mechanism for industrial applicationsJournal of Systems and Software10.1016/j.jss.2024.112184217(112184)Online publication date: Nov-2024
https://doi.org/10.1016/j.jss.2024.112184

Index Terms

DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques
    2. Software verification and validation

Recommendations

Defining Big Data
BDAW '16: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies

As Big Data becomes better understood, there is a need for a comprehensive definition of Big Data to support work in fields such as data quality for Big Data. Existing definitions of Big Data define Big Data by comparison with existing, usually ...
Adaptive data quality scoring operations framework using drift-aware mechanism for industrial applications
Abstract
Within data-driven artificial intelligence (AI) systems for industrial applications, ensuring the reliability of the incoming data streams is an integral part of trustworthy decision-making. An approach to assess data validity is data quality ...
Highlights
- Adaptive framework for real-time data quality scoring in industrial applications.
- Drift-aware mechanism ensures up-to-date, dynamic data quality assessment.
- Reducing computational overhead while maintaining high assessment ...
A Data Quality in Use model for Big Data

Beyond the hype of Big Data, something within business intelligence projects is indeed changing. This is mainly because Big Data is not only about data, but also about a complete conceptual and technological stack including raw and processed data, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EASE '23: Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

June 2023

544 pages

ISBN:9798400700446

DOI:10.1145/3593434

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EASE '23

EASE '23: The International Conference on Evaluation and Assessment in Software Engineering

June 14 - 16, 2023

Oulu, Finland

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
624
Total Downloads

Downloads (Last 12 months)492
Downloads (Last 6 weeks)59

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zainuddin ZAkhir EBencheva N(2024)Improving Data Quality Completeness using Deep Neural Network (DNN) Prediction2024 5th International Conference on Communications, Information, Electronic and Energy Systems (CIEES)10.1109/CIEES62939.2024.10811414(1-8)Online publication date: 20-Nov-2024
https://doi.org/10.1109/CIEES62939.2024.10811414
Zainuddin ZAkhir E(2024)Systematic Literature Review of Data Quality in Open Government Data: Trend, Methods, and ApplicationsIEEE Access10.1109/ACCESS.2024.347557712(148466-148487)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3475577
Bayram FAhmed BHallin E(2024)Adaptive data quality scoring operations framework using drift-aware mechanism for industrial applicationsJournal of Systems and Software10.1016/j.jss.2024.112184217(112184)Online publication date: Nov-2024
https://doi.org/10.1016/j.jss.2024.112184

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents