Background & Summary

Integrating emerging technologies and vast data resources has significantly enhanced various fields, from healthcare to finance, by enabling more accurate predictions, personalized services, and data-driven decision-making. Especially the combination of large data sets with recent machine learning techniques has made the potential obvious. However, the vast amount of patient data generated in healthcare systems is largely inaccessible to the broader research community due to its sensitive nature1. This presents a significant challenge to the advancement of research on observational health data and the reproducibility of corresponding results. On the other hand, uncovering new knowledge and improving healthcare systems using medical data hold the potential to tackle some of the challenges that 100-year lifespans and accelerated aging of global populations bring about2,3,4. Research on chronic diseases that are the leading causes of avoidable premature deaths5,6,7, could tackle some of these challenges with new data-driven strategies for disease prevention8,9.

Network medicine (NM) adopts a series of principles and approaches from network theory with the aim of preventing and treating diseases by leveraging their interconnected nature10. For instance, comorbidity networks represent phenomenological disease relations that provide insights into chronic disease progression patterns across life and gender11,12. It was found that the likelihood to develop specific diseases often depends on the proximity in the comorbidity network to already present diseases. This fact highlights the potential predictive value of these networks. Typically, a patients’ health is not characterized by a single disease but by multiple coexisting medical conditions. Identifying complex structures and patterns within high-dimensional medical data sets may allow for a better understanding of comorbidities and how they affect each other, of gender differences in disease progression, or lead to the discovery of new disease predictors. Data-driven methods based on comorbidity networks have also opened a novel way into epidemiology on a population-wide scale by analyzing individual patient trajectories13,14, their diagnosis progression patterns15, and typical clusters of such trajectories16,17. To construct the disease networks, most studies have used data on in-hospital stays18.

This project aims to provide a comprehensive dataset for a wide range of comorbidity networks to foster further research in this direction. Networks derived from medical claims databases from all Austrian hospitals reflect information about 44,619,964 hospital stays and their interrelations in the Austrian population (N = 8,996,916). The underlying database, maintained by the Austrian Ministry of Health, includes data on patients’ age and gender, primary and secondary diagnoses, entry and release dates, release type, hospital region, and patient’s residential region. It covers the years 1997 through 2014. Level-3 ICD-10 codes are used to represent primary and secondary diagnoses19.

Here we present this dataset to construct different types of comorbidity networks, for instance, networks of ICD10 diagnoses, ICD blocks, and chronic conditions blocks for different sex, age-groups, and time periods. We combine aggregated hospital data, and exported networks, and scripts in the Github and FigShare repository20. The workflow of the research presented in this article is presented in Fig. 1.

Fig. 1
figure 1

Workflow of the research presented in this article.

Using this network data, one can validate whether comorbidities predicted by shared pathobiological processes indeed do occur in the population and thereby validate potential disease etiologies21. This network data can also be used to find leverage points for targeted prevention efforts in specific at-risk cohorts using disease trajectories14, in particular to understand type 2 diabetes progression22, to explore and compare multimorbidity profiles in different populations23, as well as to achieve more accurate predictions on the length of hospital stays24. Network data of the kind presented has been used in interactive tools to analyze population-level disease progression over time in13.

We aim to present a dataset from which many different types of diagnosis co-occurrence statistics can be derived, providing a flexible platform for the construction of comorbidity networks. Here, we present one way of doing this and discuss how the data can be used to achieve some of the objectives mentioned. Different questions may require different definitions of comorbidity networks, many of which can be explored with the data presented.

Methods

The original dataset comprises highly sensitive medical information from the Austrian Federal Ministry for Health. As part of the collaboration agreement, we can only share the aggregated datasets. In our projects, we make secondary use of hospital claims data collected for billing reasons.

Comorbidity networks known as disease-disease networks, express the relationships between various individual diseases. These networks are typically constructed from extensive longitudinal health datasets. Numerous statistical ways exist to construct and derive these networks from raw data; tools from network science are frequently employed to analyze (temporal) disease correlations. In25 the authors review several methods for network reconstruction. Typically, nodes represent diagnoses, and links represent statistically significant correlations (of various types) between two diagnoses.

Odds Ratio calculation and statistical significance testing

We employ Odds ratios (OR) to quantify the strength of association between diseases. The OR is a straightforward metric for network construction. In addition, when controlling for potential confounding variables such as age, sex and time, the Cochran-Mantel-Haenszel (CMH) method allows for a more accurate and unbiased estimate of disease associations by stratifying the analysis and calculating weighted averages of ORs across strata.

We start from contingency tables (two-way tables) for each disease pair to assess statistically significant correlations. These tables are used in statistics to summarize relations between categorical variables. An example is shown in Table 1.

Table 1 Example of contingency table: letters a-d are the respective counts for the various combinations.

For the case shown in Table 1, the OR is calculated as \(OR=\frac{a/c}{b/d}\). The lower limit of the OR is zero, while it does not have an upper bound. An odds ratio of one means equal probabilities of presented outcome and absence of outcome. Logarithmic odds ratios or log-odds log (OR) is defined as \(log(OR)=log(\frac{a/c}{b/d})\). The log (OR) has a range from \(-\infty \) to \(+\infty \). The standard error SE of the log (OR) is \(S{E}_{log(OR)}=\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}\). A 95% confidence interval for the log (OR) is obtained as 1.96 SE on either side of the estimate26.

To enable researchers to study disease associations with other measures than ORs, our dataset provides 96 contingency tables for each pair of 1,080 diagnoses (ICD 3-digit codes) for each stratum.

From these variables most of the association measures used in the comorbidity network literature to date can be readily derived, compare25. For instance, instead of the OR one might choose the relative risk (RR), see for instance27,28. RR is the ratio of the probability of an event occurring in one group that is the exposed (with diagnose obesity) versus the unexposed (without diagnose obesity) group, \(RR=\frac{\frac{a}{a+b}}{\frac{c}{c+d}}\). It can be readily computed from the provided data. An RR of one means there is no difference between the compared groups.

Stratified analysis

A stratified analysis considered confounding variables such as age and period. 48 strata for men and women were created by splitting the dataset into 10-year age groups and six 2-year intervals from 2003 to 2014 (2003-2004, 2005-2006, etc.). A contingency table for every pair of diagnoses in every stratum was created. Odds Ratio (OR) and p-values that test the null hypothesis that the co-occurrence of two diagnoses is statistically independent, were only computed using contingency tables with appropriate patient numbers in each grouping (more than 4).

We calculated a weighted average of the odds ratio (OR) estimations across the stratified data using the Cochran–Mantel–Haenszel technique (see details below)29. Filtering the resulting correlation matrices by statistical significance alone is not advisable, since this could bias the resulting network towards links between very frequent diseases with low–but still significant–correlations. Hence, we only include comorbidities with an OR greater than 1.5, a p-value less than 0.05 and at least 100 patients with the analysed comorbidity.

The Cochran-Mantel-Haenszel method

To account for confounding factors in the analysis, we perform a stratified analysis by constructing two-by-two tables for each stratum (or category) of the confounding variable, as illustrated in Fig. 2. Cochran-Mantel-Haenszel estimates the OR as a weighted average of the odds ratio of the different strata, \({OR}_{cmh}=\frac{\sum \frac{{a}_{i}{d}_{i}}{{n}_{i}}}{\sum \frac{{b}_{i}{c}_{i}}{{n}_{i}}}\), and RR as the weighted average of the risk ratio, \({RR}_{cmh}=\frac{\sum \frac{{a}_{i}({c}_{i}+{d}_{i})}{{n}_{i}}}{\sum \frac{{c}_{i}({a}_{i}+{b}_{i})}{{n}_{i}}}\))29, ni is the number of stratum and ai, bi, ci, di refers to the corresponding terms in the contingency table of the i-th stratum.

Fig. 2
figure 2

Flowchart - Stratified Analysis.

Data Records

The dataset is available at30 and it is organised in four groups:

  1. (i)

    Prevalence

  2. (ii)

    Contingency Tables

  3. (iii)

    Adjacency Matrices

  4. (iv)

    Graphs - gexf files.

Prevalence data is provided in CSV format, and contingency tables are organized as lists and stored in RDS format. Adjacency Matrices are published in both CSV and RDS formats. Graphs are available in GEXF format. We also provide R and Python scripts for the analysis of available variables.

Overview of data sources and how datasets are analysed and organise is shown in Fig. 3.

Fig. 3
figure 3

Overview of data sources and organisation of data files and code of this project.

Hospital claims data

The dataset under analysis comprises 44,619,964 hospital admissions, corresponding to roughly the Austrian population (N = 8,996,916) between 1997 and 2014. As a result of a collaboration, the dataset is provided by the Austrian Ministry of Health to the Complexity Science Hub and the Medical University of Vienna. The database contains a patient ID, sex (male/female), age group (resolution of five-year), primary and secondary diagnoses, admission and discharge dates, the type of discharge (routine discharge, transfer to another facility, etc.), region of residency of the patient (32 regions NUTS3), region of the hospital, and department of the hospital department are all included in the database17,31,32. The primary diagnoses (one diagnosis per hospital stay) refer to the primary reason for hospitalization, secondary diagnoses (one or more diagnoses per hospital stay) specify additional diseases.

ICD-10 codes of level-3 as provided by the WHO are used to represent primary and secondary diagnoses19. We limited the study to codes between A00 and N99, reducing the number of 3-digit ICD codes from 1,699 to 1,080 diagnoses. We exclude diagnosis codes that cannot be directly related to diseases but encode other reasons for hospitalization, such as O00-O99 - pregnancy, childbirth, puberty, and S00-T88 - injury, poisoning, and some other effects of external causes.

Technical Validation

The goal of this study is to facilitate the secondary use of a population-wide in-hospital database (originally collected for billing purposes33). The LKF framework (Leistungsorientierte Krankenanstaltenfinanzierung) is Austria’s performance-oriented hospital financing system. It was introduced to ensure that hospitals are funded based on the services they provide. While primarily used for billing purposes, this data is also highly valuable for research, offering reliable insights into healthcare utilization, patient outcomes, and disease patterns.

Data collection under the LKF framework adheres to a rigorous standardized process and validation. Hospitals must collect and report detailed structured data, which includes patient demographics, admission and discharge dates, and diagnostic information (ICD codes). The data collection process is subject to regular external audits to ensure that hospitals are reporting accurately34. These audits are critical to identifying and correcting discrepancies, such as missing or inaccurate diagnoses.

Non-systematic errors, such as sporadic missing diagnoses, have been evaluated and their impact on the results of the analyses is minimal due to the large volume of data. To account for these limitations, sensitivity analyses are often performed to assess the robustness of the results, especially when analyzing rare conditions or specific comorbidities.

We performed filtering to prepare the dataset for comorbidity analysis. We limited the scope of our investigation to information collected between 2003 and 2014. We excluded any patient who had at least one hospital visit between 1997 and 2002 to ensure the comparability of the health state of the study population. Hence, we can assume that our cohort is “healthy” at the beginning of the observation period in the sense that they had no hospital stays during this time period. In the early 2000s, the Austrian diagnosis coding system was changed. By restricting the comorbidity network analysis to times from 2003 onwards, we avoid inaccuracies stemming from changes in diagnosis coding within the hospitals.

This database has been used in studies to analyze gender differences among diabetic patients35,36,37, gender differences in cardiovascular diseases38, comorbidities of obesity39, clusters of patients17, and disease trajectories32. These studies have validated the reliability of the LKF dataset in addressing a wide range of research questions, highlighting its robustness despite the known limitations.

Despite the robust structure and auditing, certain limitations remain in the LKF dataset. Diagnoses that do not lead directly to financial compensation, such as alcohol-related disorders or nicotine dependence, may be underreported. In addition, the database lacks outpatient visits, detailed socioeconomic indicators, and medication information. This may prevent the impact of these aspects on comorbidity from being uncovered. These limitations are acknowledged in studies using the dataset and are addressed through careful interpretation of results and, where possible, complementary data sources.

Usage Notes

Table 2 illustrates baseline characteristics of the hospital claims data set containing 3,378,906 patients (females: 1,688,467, males: 1,690,439) following filtering. They are 44.30 ± 24.89 years on average. Figure 4 shows the age distribution.

Table 2 Baseline table of the analyzed database, after filtering.
Fig. 4
figure 4

Age distribution. Histogram of the number of patients across different age groups for males (blue) and females (pink). The x-axis represents age intervals (e.g., 0–9, 10–19, etc.), while the y-axis shows the count of patients.

Prevalence of diagnoses

The most prevalent ICD chapters (based on the first letter of each code) for females and males over all time periods are cardiovascular disease (I–Circulatory System), cancers, and neoplasms (C–D–Neoplasms). In males, the third most prevalent are digestive diseases (K–Digestive System), followed by mental disorders (F–Mental and Behavioral Disorders), while in females, we see musculoskeletal disorders (M–Musculoskeletal, Connective Tissue), followed by digestive diseases (K–Digestive System). Interestingly, cardiovascular diagnoses were consistently the most prevalent in males and remained the most common in females up until 2006. However, after 2006, cancer diagnoses became the most prevalent among females. The prevalence of all ICD chapters over time is presented in Fig. 5 a) male, b) female.

Fig. 5
figure 5

Absolute prevalence of of all ICD chapters (based on the first letter of each code) over time for (a) male, (b) females. Each coloured band corresponds to a specific ICD10 chapter, showing its contribution to the overall growth.

Comorbidity networks

We constructed three versions of networks with different types of node:

  1. 1.

    ICD10 3 digits codes19, Fig. 6a,b

  2. 2.

    ICD10 Blocks19, Fig. 6c,d

  3. 3.

    Chronic conditions40, Fig. 6e,f.

Fig. 6
figure 6

Examples of different comorbidity networks. Node size represents disease prevalence; colors indicate the ICD chapter (first letter of ICD 10 code). Links weights are proportional to the odds ratios. Online dynamic version available at https://vis.csh.ac.at/comorbidity_networks/.

A comprehensive analysis of the network properties of ICD10 codes comorbidity (undirected weighted) networks for each age group (links weights normalized to range from 0 to 1 by dividing each link’s weight by the sum of all links of a target node) is shown in Fig. 7.

Fig. 7
figure 7

Network properties of comorbity networks of ICD10 codes across age groups for females and male: (a) the total number of nodes, (b) average degree, (c) average path length, (d) betweenness, (e) density, (f) closeness, (g) modularity.

These properties unravel a massive topological restructuring of the networks as the underlying patient cohorts age. Figure 7a shows the total number of nodes with at least one connection in the network. The number of these nodes and the average degree (the average number of connections or edges each node has) Fig. 7b increases with age. For both genders, the average path length decreases with age, indicating that the network gets denser with age Fig. 7c. This indicates that diseases become more correlated.

Betweenness centrality is a quantity that measures the influence of a node in “connecting” other nodes. The mean value of betweenness for the whole network fluctuates for both genders, with an increase starting around 40–49 years for both females and males Fig. 7d. This indicates that some diseases in males are critical “bridges” between other diseases in this age range. The networks become increasingly dense with age (except the youngest age group). This is associated with an increase in the betweenness centrality and a decrease in the average path length Fig. 7c,d, respectively.

Closeness centrality measures how quickly a node can reach other nodes in the network Fig. 7f. The spike in closeness for males in younger age groups (10–19 years) suggests that diseases in young males are more densely connected by a few diseases serving as hubs compare to the situation in other age groups. However, the values decline with age for both genders, suggesting a reduced influence of individual diseases as the network becomes denser. Both, males and females, show a decline in modularity with age, meaning that diseases are less likely to form separate, distinct clusters as individuals age, Fig. 7g. Males start with higher modularity but converge to levels of females in older age.

In summary, to the best of our knowledge, this dataset on comorbidities is the only one of its kind that spans 17 years and covers 9 million individuals, and it is publicly available to the research community. Research of these comorbidity networks and aggregated hospital claims data can enhance the understanding of comorbidities by identifying disease co-occurrence patterns. This enables more accurate patient classification based on risk profiles and disease trajectory prediction by analyzing comorbidities’ progression. The data also supports medication studies, assessing drug interactions in patients with multiple conditions. It can be used to test hypotheses about disease relationships across age groups, gender differences in comorbidities, and population-specific patterns.

Here we present a series of network centrality measures that quantify properties of the networks and provide a characterization of their topology and structure. In particular, we employ the degree (to how many diseases a disease is significantly connected to), betweenness centrality (which captures which diseases connect many others), average path length (that quantifies how close–in terms of networks distance–diseases are on average), modularity (reflecting how easily the network can be partitioned into distinct clusters or communities), as well as closeness centrality that captures how quickly a node can access other nodes in the network.