Martin Dugas, Ellen Hoffmann, Sabine Janko, Silke Hahnewald, Tomas Matis, Karl Überla
Abstract
Medical databases in general are characterized by a high degree of complexity in terms of quantity of items, number of parameter values and data types (free text, categorical, numerical and other). Substantial domain knowledge is required for adequate formalization of medical entities.
In this context we developed medical database plot (mdplot), a data mining tool to visualize both structure and quality of data in medical databases to identify items suitable for evaluation.
Data models are provided in XML format. Missing data is identified to enable targeted efforts to improve data quality prior to analysis. Database items are classified as 1:1-related to the patient (i.e. variables are collected once per patient) and 1:n related.
mdplot provides a list of all classes contained in a database, the number of records each and a condensed bar chart for semi-quantitative description of completeness according to four types of items: categorical, numerical, text and other. All items in a category are grouped from left to right, the height of each bar represents the proportion of non-missing values with respect to the total number of records in the class; thus the amount of content in a specific class is visualized.
By selection of a specific class, a detailed description of it is provided including mean completeness in each item category as well as number of values per item.
The new methodology was applied to a cardiological research database consisting of 619 items on 88 patients.