Abstract
In the context of healthcare, an AI solution is generally developed for a specific analysis task, based on a relevant dataset, with little attention to reusability and generalizability of its data preparation step. This paper focuses on a different scenario, which can be called context-oriented, where a set of clinical data sources, relevant for a specific context (e.g., a particular disease), is available and can be used for a variety of data analytics tasks, often carried out by different research groups. Therefore, the aim of this research is to present a systematic method, which exploits the Ontology-based Data Management paradigm to enhance data preparation in a context-oriented scenario. The introduced methodology has been applied to a project dealing with big data and regarding the treatment of diabetes and its complications. The peculiarity and challenge of this project lies in the fact that it deals with real world data, extracted from Electronic Medical Records within a 13 years timeframe, and thus not collected for research purposes. The paper focuses on two main steps of data preparation, namely data modeling and data cleaning, and it shows how this approach provides effective techniques for setting up a unified and shared database, to be used in the subsequent data analytics phases as an asset.
Similar content being viewed by others
Data availability
The data that support the findings of this study are property of the AMD Foundation. Restrictions apply to the availability of these data.
Notes
We clarify that not all the 320 centres constituting the AMD network included their data in the latest AMD annals.
References
Hameed M, Naumann F. Data preparation: a survey of commercial tools. SIGMOD Rec. 2020;49(3):18–29.
Furche T, Gottlob G, Libkin L, Orsi G, Paton N. Data wrangling for big data: challenges and opportunities. In: Advances in database technology — EDBT 2016; 2016. p. 473–8.
Data-centric ai. https://datacentricai.org. Accessed 21 Aug 2022.
Poggi A, Lembo D, Calvanese D, et al. Linking data to ontologies. J Data Semant. 2008;10:133–73.
Calvanese D, Giacomo GD, Lembo D, et al. Ontologies and databases: The dl-lite approach. In: Reasoning Web. Semantic Technologies for Information Systems. Cham: Springer; 2009. p. 255–356.
Lenzerini M. Managing data through the lens of an ontology. AI Mag. 2018;39(2):65–74.
Lin J-H, Haug PJ. Data preparation framework for preprocessing clinical data in data mining. In AMIA Annu Symp Proc., 2006; 489–493.
Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inf. 2018;22(5):1589–604.
Shang N, Weng C, Hripcsak G. A conceptual framework for evaluating data suitability for observational studies. J Am Med Inf Assoc. 2018;25(3):248–58.
Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, Wang D, Masys DR, Roden DM, Crawford DC. Phewas: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinform. 2010;26(9):1205–10.
Miao Z, Sealey MD, Sathyanarayanan SR, Delen D, Zhu L, Shepherd S. A data preparation framework for cleaning electronic health records and assessing cleaning outcomes for secondary analysis. Inf Syst. 2023;111: 102130.
Weiskopf NG, Bakken S, Hripcsak G, Weng C. A data quality assessment guideline for electronic health record data reuse. J Electron Health Data Methods. 2017;5(1):14–33.
Guo H, Scriney M, Liu K. An ostensive information architecture to enhance semantic interoperability for healthcare information systems. Inf Syst Front. 2024;26:277–300.
El-Sappagh S, Ali F. Ddo: a diabetes mellitus diagnosis ontology. Applied Informatics. 2016;3(5)
El-Sappagh S, Kwak D, Ali F, Kwak K-S. Dmto: a realistic ontology for standard diabetes mellitus treatment. Journal of Biomedical Semantics volume. 2018;9(8)
International Diabetes Federation - facts figures. https://idf.org/aboutdiabetes/what-is-diabetes/facts-figures.html. Accessed 21 Aug 2022.
Lin X, Xu Y, Pan X, et al. Global, regional, and national burden and trend of diabetes in 195 countries and territories: an analysis from 1990 to 2025. Sci Rep. 2020;10(1):14790.
International Diabetes Federation - about diabetes. https://www.idf.org/aboutdiabetes/type-2-diabetes.html. Accessed 21 Aug 2022.
Dabelea D, Mayer-Davis EJ, Saydah S, et al. Prevalence of type 1 and type 2 diabetes among children and adolescents from 2001 to 2009. JAMA. 2014;311(17):1778–86.
Pintaudi B, Scatena A, Piscitelli G, et al. Clinical profiles and quality of care of subjects with type 2 diabetes according to their cardiovascular risk: an observational, retrospective study. Cardiovasc Diabetol. 2021;20(1):59.
The journal of amd. https://www.jamd.it/archivio-annali-amd/. Accessed 21 Aug 2022.
Cucinotta D, Nicolucci A, Giandalia A, et al. Temporal trends in intensification of glucose-lowering therapy for type 2 diabetes in italy: data from the amd annals initiative and their impact on clinical inertia. Diabetes Res Clin Pract. 2021;181:109096.
ATC code. https://www.ema.europa.eu/en/glossary/atc-code. Accessed 21 Aug 2022.
OWL web ontology language guide; 2004. https://www.w3.org/TR/2004/REC-owl-guide-20040210/. Accessed May 2023.
Lembo D, Santarelli V, Savo DF, Giacomo GD. Graphol: a graphical language for ontology modeling equivalent to OWL 2. Future Internet. 2022;14(3):78.
Medicode. ICD-9-CM: International classification of diseases, 9th revision, clinical modification. 1996.
Geerts F, Mecca G, Papotti P, Santoro D. Cleaning data with llunatic. VLDB J. 2020;29(4):867–92.
ADA - understanding A1C. https://diabetes.org/diabetes/a1c. Accessed 21 Aug 2022.
Valentini R, Carrani E, Torre M, Lenzerini M. Ontology-based data management in healthcare: the case of the Italian arthroplasty registry. In: Basili R, Lembo D, Limongelli C, Orlandini A, editors. AIxIA 2023 - Advances in artificial intelligence. Cham: Springer Nature Switzerland; 2023. p. 88–101.
Acknowledgements
This work has been partially supported by MUR under the PRIN 2017 project “HOPE” (prot. 2017MMJJRE), by the EU under the H2020-EU.2.1.1 project TAILOR, grant id. 952215, and by the projects FAIR (PE0000013) and SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU. The authors would like to thank the Associazione Medici Diabetologi (AMD), Fondazione AMD and all the scientists involved in the STITCH-AMD initiative for supporting this work. This work would not have been possible without the precious efforts of Dr. Sebastiano Filetti and the expertise of Dr. Antonio Nicolucci and Dr. Giuseppe Lucisano (CORESEARCH S.r.l.) and all the patients who have been cared over the years in the AMD centers.
Funding
This article is funded by Ministero dell'Università e della Ricerca (2017MMJJRE, PE0000013, PE00000014), H2020 Industrial Leadership (952215).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Digital Healthcare and Wellbeing” guest edited by Achilleas Achilleos, George A. Papadopoulos, Edwige Pissaloux and Ramiro Velazquez.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Croce, F., Valentini, R., Maranghi, M. et al. Ontology-Based Data Preparation in Healthcare: The Case of the AMD-STITCH Project. SN COMPUT. SCI. 5, 437 (2024). https://doi.org/10.1007/s42979-024-02757-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02757-w