[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3593434.3593445acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article
Open access

DQSOps: Data Quality Scoring Operations Framework for Data-Driven Applications

Published: 14 June 2023 Publication History

Abstract

Data quality assessment has become a prominent component in the successful execution of complex data-driven artificial intelligence (AI) software systems. In practice, real-world applications generate huge volumes of data at speeds. These data streams require analysis and preprocessing before being permanently stored or used in a learning task. Therefore, significant attention has been paid to the systematic management and construction of high-quality datasets. Nevertheless, managing voluminous and high-velocity data streams is usually performed manually (i.e. offline), making it an impractical strategy in production environments. To address this challenge, DataOps has emerged to achieve life-cycle automation of data processes using DevOps principles. However, determining the data quality based on a fitness scale constitutes a complex task within the framework of DataOps. This paper presents a novel Data Quality Scoring Operations (DQSOps) framework that yields a quality score for production data in DataOps workflows. The framework incorporates two scoring approaches, an ML prediction-based approach that predicts the data quality score and a standard-based approach that periodically produces the ground-truth scores based on assessing several data quality dimensions. We deploy the DQSOps framework in a real-world industrial use case. The results show that DQSOps achieves significant computational speedup rates compared to the conventional approach of data quality scoring while maintaining high prediction performance.

References

[1]
Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433–459.
[2]
Muhammad Aslam. 2021. A new goodness of fit test in the presence of uncertain parameters. Complex & Intelligent Systems 7, 1 (2021), 359–365.
[3]
Claudia Augste and Martin Lames. 2011. The relative age effect and success in German elite U-17 soccer teams. Journal of sports sciences 29, 9 (2011), 983–987.
[4]
Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey. IEEE transactions on software engineering 41, 5 (2014), 507–525.
[5]
Carlo Batini, Anisa Rula, Monica Scannapieco, and Gianluigi Viscusi. 2015. From data quality to big data quality. Journal of Database Management (JDM) 26, 1 (2015), 60–82.
[6]
Roger Blake and Paul Mangiameli. 2011. The effects and interactions of data quality and problem complexity on classification. Journal of Data and Information Quality (JDIQ) 2, 2 (2011), 1–28.
[7]
Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. 2021. Engineering ai systems: A research agenda. Artificial Intelligence Paradigms for Smart Cyber-Physical Systems (2021), 1–19.
[8]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
[9]
John Byabazaire, Gregory O’Hare, and Declan T Delaney. 2022. End-to-End Data Quality Assessment Using Trust for Data Shared IoT Deployments. IEEE Sensors Journal (2022).
[10]
Cinzia Cappiello, C Cerletti, C Fratto, and Barbara Pernici. 2018. Validating data quality actions in scoring processes. Journal of Data and Information Quality (JDIQ) 9, 2 (2018), 1–27.
[11]
Emily Caveness, Paul Suganthan GC, Zhuo Peng, Neoklis Polyzotis, Sudip Roy, and Martin Zinkevich. 2020. Tensorflow data validation: Data analysis and validation in continuous ml pipelines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2793–2796.
[12]
Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407 (2019).
[13]
Abraham Chan, Arpan Gujarati, Karthik Pattabiraman, and Sathish Gopalakrishnan. 2022. The Fault in Our Data Stars: Studying Mitigation Techniques against Faulty Training Data in Machine Learning Applications. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 163–171.
[14]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
[15]
Sezal Chug, Priya Kaushal, Ponnurangam Kumaraguru, and Tavpritesh Sethi. 2021. Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring. arXiv preprint arXiv:2108.08905 (2021).
[16]
Corinna Cichy and Stefan Rass. 2019. An overview of data quality frameworks. IEEE Access 7 (2019), 24634–24648.
[17]
RalphB D’Agostino. 2017. Goodness-of-fit-techniques. Routledge.
[18]
Koen Decancq and Maria Ana Lugo. 2012. Inequality of wellbeing: A multidimensional approach. Economica 79, 316 (2012), 721–746.
[19]
Adenekan Dedeke. 2000. A Conceptual Framework for Developing Quality Measures for Information Systems. In IQ. 126–128.
[20]
Lisa Ehrlinger and Wolfram Wöß. 2018. A novel data quality metric for minimality. In International Workshop on Data Quality and Trust in Big Data. Springer, 1–15.
[21]
Diane L Evans, John H Drew, and Lawrence M Leemis. 2008. The distribution of the Kolmogorov–Smirnov, Cramer–von Mises, and Anderson–Darling test statistics for exponential populations with estimated parameters. Communications in Statistics—Simulation and Computation® 37, 7 (2008), 1396–1421.
[22]
Wenfei Fan. 2015. Data quality: From theory to practice. Acm Sigmod Record 44, 3 (2015), 7–18.
[23]
Harald Foidl and Michael Felderer. 2019. Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation. 13–18.
[24]
Bernd Heinrich, Diana Hristova, Mathias Klier, Alexander Schiller, and Michael Szubartowicz. 2018. Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ) 9, 2 (2018), 1–32.
[25]
Bernd Heinrich, Mathias Klier, Alexander Schiller, and Gerit Wagner. 2018. Assessing data quality–A probability-based metric for semantic consistency. Decision Support Systems 110 (2018), 95–106.
[26]
Abhinav Jain, Hima Patel, Lokesh Nagalapatti, Nitin Gupta, Sameep Mehta, Shanmukha Guttula, Shashank Mujumdar, Shazia Afzal, Ruhi Sharma Mittal, and Vitobha Munigala. 2020. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3561–3562.
[27]
Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.
[28]
Meenu Mary John, Helena Holmström Olsson, and Jan Bosch. 2021. Towards mlops: A framework and maturity model. In 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 1–8.
[29]
Steven G Johnson, Stuart Speedie, Gyorgy Simon, Vipin Kumar, and Bonnie L Westra. 2016. Application of an ontology for characterizing data quality for a secondary use of EHR data. Applied clinical informatics 7, 01 (2016), 69–88.
[30]
Sonia Kahiomba Kiangala and Zenghui Wang. 2021. An effective adaptive customization framework for small manufacturing plants using extreme gradient boosting-XGBoost and random forest ensemble learning algorithms in an Industry 4.0 environment. Machine Learning with Applications 4 (2021), 100024.
[31]
Shirlee-ann Knight. 2011. The combined conceptual life-cycle model of information quality: part 1, an investigative framework. International journal of information quality 2, 3 (2011), 205–230.
[32]
Paul Kvam, Brani Vidakovic, and Seong-joon Kim. 2022. Nonparametric Statistics with Applications to Science and Engineering with R. John Wiley & Sons.
[33]
Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020.
[34]
Lillian Lee. 2000. Measures of distributional similarity. arXiv preprint cs/0001012 (2000).
[35]
Nan Li and Jeff Offutt. 2016. Test oracle strategies for model-based testing. IEEE Transactions on Software Engineering 43, 4 (2016), 372–395.
[36]
Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 1 (1991), 145–151.
[37]
Antonios Lionis, Konstantinos P Peppas, Hector E Nistazakis, and Andreas Tsigopoulos. 2021. RSSI Probability Density Functions Comparison Using Jensen-Shannon Divergence and Pearson Distribution. Technologies 9, 2 (2021), 26.
[38]
David Loshin. 2010. The practitioner’s guide to data quality improvement. Elsevier.
[39]
Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering 31, 12 (2018), 2346–2363.
[40]
Harald Martens and Magni Martens. 2001. Multivariate analysis of quality: an introduction. John Wiley & Sons.
[41]
Xiaofeng Meng and Xiang Ci. 2013. Big data management: concepts, techniques and challenges. Journal of computer research and development 50, 1 (2013), 146–169.
[42]
Helen-Tadesse Moges, Karel Dejaeger, Wilfried Lemahieu, and Bart Baesens. 2013. A multidimensional analysis of data quality for credit risk management: New insights and challenges. Information & Management 50, 1 (2013), 43–58.
[43]
Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How data science workers work with data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–15.
[44]
Hoang-Vu Nguyen and Jilles Vreeken. 2015. Non-parametric jensen-shannon divergence. In Joint European conference on machine learning and knowledge discovery in databases. Springer, 173–189.
[45]
Jack E Olson. 2003. Data quality: the accuracy dimension. Elsevier.
[46]
Liu Peng and Lei Lei. 2005. A review of missing data treatment methods. Intell. Inf. Manag. Syst. Technol 1 (2005), 412–419.
[47]
Leo L Pipino, Yang W Lee, and Richard Y Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211–218.
[48]
John Winsor Pratt and Jean Dickinson Gibbons. 2012. Concepts of nonparametric theory. Springer Science & Business Media.
[49]
Laura Rettig, Mourad Khayati, Philippe Cudré-Mauroux, and Michał Piórkowski. 2019. Online anomaly detection over big data streams. In Applied data science. Springer, 289–312.
[50]
David Schuler and Andreas Zeller. 2013. Covering and uncovering equivalent mutants. Software Testing, Verification and Reliability 23, 5 (2013), 353–374.
[51]
Kelly M Sunderland, Derek Beaton, Julia Fraser, Donna Kwan, Paula M McLaughlin, Manuel Montero-Odasso, Alicia J Peltsch, Frederico Pieruccini-Faria, Demetrios J Sahlas, Richard H Swartz, 2019. The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project. BMC medical research methodology 19, 1 (2019), 1–16.
[52]
Ikbal Taleb, Mohamed Adel Serhani, Chafik Bouhaddioui, and Rachida Dssouli. 2021. Big data quality framework: a holistic approach to continuous quality management. Journal of Big Data 8, 1 (2021), 1–41.
[53]
Hui Yie Teh, Andreas W Kempa-Liehr, and Kevin I-Kai Wang. 2020. Sensor data quality: A systematic review. Journal of Big Data 7, 1 (2020), 1–49.
[54]
Reza Vaziri, Mehran Mohsenzadeh, and Jafar Habibi. 2019. Measuring data quality with weighted metrics. Total Quality Management & Business Excellence 30, 5-6 (2019), 708–720.
[55]
Yair Wand and Richard Y Wang. 1996. Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 11 (1996), 86–95.
[56]
Richard Y Wang, Veda C Storey, and Christopher P Firth. 1995. A framework for analysis of data quality research. IEEE transactions on knowledge and data engineering 7, 4 (1995), 623–640.

Cited By

View all
  • (2024)Improving Data Quality Completeness using Deep Neural Network (DNN) Prediction2024 5th International Conference on Communications, Information, Electronic and Energy Systems (CIEES)10.1109/CIEES62939.2024.10811414(1-8)Online publication date: 20-Nov-2024
  • (2024)Systematic Literature Review of Data Quality in Open Government Data: Trend, Methods, and ApplicationsIEEE Access10.1109/ACCESS.2024.347557712(148466-148487)Online publication date: 2024
  • (2024)Adaptive data quality scoring operations framework using drift-aware mechanism for industrial applicationsJournal of Systems and Software10.1016/j.jss.2024.112184217(112184)Online publication date: Nov-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
EASE '23: Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering
June 2023
544 pages
ISBN:9798400700446
DOI:10.1145/3593434
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2023

Check for updates

Author Tags

  1. Automated data scoring
  2. DataOps
  3. data assessment
  4. data quality dimensions
  5. mutation testing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EASE '23

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)492
  • Downloads (Last 6 weeks)59
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improving Data Quality Completeness using Deep Neural Network (DNN) Prediction2024 5th International Conference on Communications, Information, Electronic and Energy Systems (CIEES)10.1109/CIEES62939.2024.10811414(1-8)Online publication date: 20-Nov-2024
  • (2024)Systematic Literature Review of Data Quality in Open Government Data: Trend, Methods, and ApplicationsIEEE Access10.1109/ACCESS.2024.347557712(148466-148487)Online publication date: 2024
  • (2024)Adaptive data quality scoring operations framework using drift-aware mechanism for industrial applicationsJournal of Systems and Software10.1016/j.jss.2024.112184217(112184)Online publication date: Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media