[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3365438.3410987acmconferencesArticle/Chapter ViewAbstractPublication PagesmodelsConference Proceedingsconference-collections
research-article

Detecting quality problems in research data: a model-driven approach

Published: 16 October 2020 Publication History

Abstract

As scientific progress highly depends on the quality of research data, there are strict requirements for data quality coming from the scientific community. A major challenge in data quality assurance is to localise quality problems that are inherent to data. Due to the dynamic digitalisation in specific scientific fields, especially the humanities, different database technologies and data formats may be used in rather short terms to gain experiences. We present a model-driven approach to analyse the quality of research data. It allows abstracting from the underlying database technology. Based on the observation that many quality problems show anti-patterns, a data engineer formulates analysis patterns that are generic concerning the database format and technology. A domain expert chooses a pattern that has been adapted to a specific database technology and concretises it for a domain-specific database format. The resulting concrete patterns are used by data analysts to locate quality problems in their databases. As proof of concept, we implemented tool support that realises this approach for XML databases. We evaluated our approach concerning expressiveness and performance in the domain of cultural heritage based on a qualitative study on quality problems occurring in cultural heritage data.

References

[1]
Ziawasch Abedjan, Cuneyt Gurcan Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal Rules Discovery for Web Data Cleaning. Proc. VLDB Endow. 9, 4 (2015), 336--347.
[2]
Serge Abiteboul. 1997. Querying Semi-Structured Data. In Database Theory-ICDT '97, 6th International Conference, Delphi, Greece, January 8--10, 1997, Proceedings (Lecture Notes in Computer Science), Foto N. Afrati and Phokion G. Kolaitis (Eds.), Vol. 1186. Springer, 1--18.
[3]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3 (2009), 16:1--16:52.
[4]
Simran Bijral and Debajyoti Mukhopadhyay. 2014. Efficient Fuzzy Search Engine with B -Tree Search Mechanism. In 2014 International Conference on Information Technology, ICIT 2014, Bhubaneswar, India, December 22--24, 2014. IEEE, 118--122.
[5]
Christian Bizer and Richard Cyganiak. 2009. Quality-driven information filtering using the WIQA policy framework. J. Web Semant. 7, 1 (2009), 1--10.
[6]
Dario Bonino, Fulvio Corno, Laura Farinetti, and Alessio Bosca. 2004. Ontology driven semantic search. WSEAS Transaction on Information Science and Application 1, 6 (2004), 1597--1605.
[7]
Jens Bove, Lutz Heusinger, and Angela Kailus. 2001. Marburger Informations-, Dokumentations- und Administrations-System (MIDAS): Handbuch und CD (Literatur und Archiv; 4). - 4. überarbeitete Auflage. https://archiv.ub.uniheidelberg.de/artdok/3770/
[8]
Erin Coburn, Richard Light, Gordon McKenna, Regine Stein, and Axel Vitzthum. [Online]. LIDO (Lightweight Information Describing Objects). http://network.icom.museum/cidoc/working-groups/lido/. http://network.icom.museum/cidoc/working-groups/lido/
[9]
C. J. Date and Hugh Darwen. 1997. A Guide to SQL Standard, 4th Edition. Addison-Wesley.
[10]
Martin Doerr, George Bruseker, Chryssoula Bekiari, Christian Emil Ore, Thanasis Velios, and Stephen Stead. 2020. Definition of the CIDOC Conceptual Reference Model. http://www.cidoc-crm.org/sites/default/files/CIDOCCRM_v6.2.930-4-2020.pdf. (Accessed on 05/15/2020).
[11]
Michael Dyck, Jonathan Robie, and Josh Spiegel. 2017. XQuery 3.1: An XML Query Language. W3C Recommendation. W3C. https://www.w3.org/TR/2017/REC-xquery-31-20170321/.
[12]
Lisa Ehrlinger, Elisa Rusz, and Wolfram Wöß. 2019. A Survey of Data Quality Measurement and Monitoring Tools. CoRR abs/1907.08138 (2019). arXiv:1907.08138 http://arxiv.org/abs/1907.08138
[13]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.
[14]
William Ewald. 2019. The Emergence of First-Order Logic. In The Stanford Encyclopedia of Philosophy (spring 2019 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University.
[15]
Eclipse Foundation. [Online]. Eclipse Modeling Project. https://www.eclipse.org/modeling/emf/
[16]
Eclipse Foundation. [Online]. Sirius. https://www.eclipse.org/sirius/
[17]
Enrico Franconi, Alessandro Mosca, Xavier Oriol, Guillem Rull, and Ernest Teniente. 2019. OCLFO: First-Order Expressive OCL Constraints for Efficient Integrity Checking. Softw. Syst. Model. 18, 4 (Aug. 2019), 2655 -- 2678.
[18]
Christian Fürber and Martin Hepp. 2010. Using SPARQL and SPIN for Data Quality Management on the Semantic Web. In Business Information Systems, 13th International Conference, BIS 2010, Berlin, Germany, May 3--5, 2010. Proceedings (Lecture Notes in Business Information Processing), Witold Abramowicz and Robert Tolksdorf(Eds.), Vol. 47. Springer, 35--46.
[19]
Christian Fürber and Martin Hepp. 2011. Swiqa - a semantic web information quality assessment framework. In 19th European Conference on Information Systems, ECIS 2011, Helsinki, Finland, June 9--11, 2011, Virpi Kristiina Tuunainen, Matti Rossi, and Joe Nandhakumar (Eds.). 76. http://aisel.aisnet.org/ecis2011/76
[20]
German Council for Scientific Information Infrastructures (RfII). 2020. The Data Quality Challenge. Recommendations for Sustainable Research in the Digital Turn. Göttingen. http://www.rfii.de/?p=4203
[21]
Christian Grün.2006. Pushing XML Main Memory Databases to their Limits. In Tagungsband zum 18. GI-Workshop über Grundlagen von Datenbanken (18th GI-Workshop on the Foundations of Databases), Wittenberg, Sachsen-Anhalt, Deutschland, 6.-9. Juni 2006, Stefan Brass and Alexander Hinneburg (Eds.). Institute of Computer Science, Martin-Luther-University, 60--64. http://dbs.informatik.uni-halle.de/GvD2006/gvd06_gruen.pdf
[22]
Steven Harris and Andy Seaborne. 2013. SPARQL 1.1 Query Language. W3C Recommendation. W3C. http://www.w3.org/TR/2013/REC-sparql11-query-20130321/.
[23]
Arno Kesper, Markus Matoni, Julia Rössel, Michelle Weidling, and Viola Wenz. 2020. Catalog of Quality Problems for Data, Data Models and Data Transformations.
[24]
Arno Kesper, Viola Wenz, and Gabriele Taentzer. 2020. Detecting Quality Problems in Research Data: A Model-Driven Approach (Extended Version). arXiv:2007.11298
[25]
Won Y. Kim, Byoung-Ju Choi, Eui Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. 2003. A Taxonomy of Dirty Data. Data Min. Knowl. Discov. 7, 1 (2003), 81--99.
[26]
Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali Zaveri. 2014. Test-driven evaluation of linked data quality. In 23rd International World Wide Web Conference, WWW '14, Seoul, Republic of Korea, April 7--11, 2014, Chin-Wan Chung, Andrei Z. Broder, Kyuseok Shim, and Torsten Suel (Eds.). ACM, 747--758.
[27]
Andreas Kuczera. 2016. Digital Editions beyond XML - Graph-based Digital Editions. In Proceedings of the 3rd HistoInformatics Workshop on Computational History (HistoInformatics 2016) co-located with Digital Humanities 2016 conference (DH 2016), Krakow, Poland, July 11, 2016 (CEUR Workshop Proceedings), Marten Düring, Adam Jatowt, Johannes Preiser-Kappeller, and Antal van den Bosch (Eds.), Vol. 1632. CEUR-WS.org, 37--46. http://ceur-ws.org/Vol-1632/paper_5.pdf
[28]
Lien Fu Lai, Chao-Chin Wu, Pei-Ying Lin, and Liang-Tsung Huang. 2011. Developing a fuzzy search engine based on fuzzy ontology and semantic search. In FUZZ-IEEE 2011, IEEE International Conference on Fuzzy Systems, Taipei, Taiwan, 27--30 June, 2011, Proceedings. IEEE, 2684--2689.
[29]
Nuno Laranjeiro, Seyma Nur Soydemir, and Jorge Bernardino. 2015. A Survey on Data Quality: Classifying Poor Data. In 21st IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2015, Zhangjiajie, China, November 18--20, 2015, Guojun Wang, Tatsuhiro Tsuchiya, and Dong Xiang (Eds.). IEEE Computer Society, 179--188.
[30]
Ora Lassila. 1999. Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation. W3C. http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/.
[31]
Pablo N. Mendes, Hannes Mühleisen, and Christian Bizer. 2012. Sieve: linked data quality assessment and fusion. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, Berlin, Germany, March 30, 2012, Divesh Srivastava and Ismail Ari (Eds.). ACM, 116--123.
[32]
Nebras Nassar. 2020. Consistency-by-Construction Techniques for Software Models and Model Transformations. Ph.D. Dissertation. Philipps-Universität Marburg, Germany.
[33]
Ivo Oditis, Janis Bicevskis, and Zane Bicevska. 2017. Domain-Specific Characteristics of Data Quality. In Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, FedCSIS 2017, Prague, Czech Republic, September 3--6, 2017 (Annals of Computer Science and Information Systems), Maria Ganzha, Leszek A. Maciaszek, and Marcin Paprzycki (Eds.), Vol. 11. 999--1003.
[34]
Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219--233.
[35]
Paulo Oliveira, Fátima Rodrigues, and Pedro Rangel Henriques. 2005. A Formal Definition of Data Quality Problems. In Proceedings of the 2005 International Conference on Information Quality (MITICIQ Conference), Sponsored by Lockheed Martin, MIT, Cambridge, MMA, USA, November 10--12, 2006, Felix Naumann, Michael Gertz, and Stuart E. Madnick (Eds.). MIT. http://mitiq.mit.edu/iciq/iqdownload.aspx?ICIQYear=2005&File=AFormalDefinitionofDQProblems.pdf
[36]
Erhard Rahm and Hong Hai Do. 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13. http://sites.computer.org/debull/A00DEC-CD.pdf
[37]
Hans-Juergen Rennau. [Online]. Combining graph and tree: writing SHAX, obtaining SHACL, XSD and more. https://www.parsqube.de/publikationen/combining-graph-and-tree-writing-shax-obtaining-shacl-xsd-and-more/
[38]
Michael Sperberg-McQueen, Jean Paoli, Tim Bray, François Yergeau, and Eve Maler. 2008. Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation. W3C. http://www.w3.org/TR/2008/REC-xml-20081126/.
[39]
Josh Spiegel, Jonathan Robie, and Michael Dyck. 2017. XML Path Language (XPath) 3.1. W3C Recommendation. W3C. https://www.w3.org/TR/2017/REC-xpath-31-20170321/.
[40]
Yufei Sun, Liangli Ma, and Shuang Wang. 2015. A comparative evaluation of string similarity metrics for ontology alignment. Journal of Information & Computational Science 12, 3 (2015), 957--964.
[41]
TEI Consortium. [Online]. TEIP5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Guidelines/P5/
[42]
Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2016. Quality assessment for Linked Data: A Survey. Semantic Web 7, 1 (2016), 63--93.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MODELS '20: Proceedings of the 23rd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems
October 2020
406 pages
ISBN:9781450370196
DOI:10.1145/3365438
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data quality
  2. model-driven development
  3. pattern matching

Qualifiers

  • Research-article

Funding Sources

  • German Federal Ministry of Education and Research

Conference

MODELS '20
Sponsor:

Acceptance Rates

MODELS '20 Paper Acceptance Rate 35 of 127 submissions, 28%;
Overall Acceptance Rate 144 of 506 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 209
    Total Downloads
  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media