Abstract
The ATHLOS cohort is composed of several harmonized datasets of international groups related to health and aging. As a result, the Healthy Aging index has been constructed based on a selection of variables from 16 individual studies. In this paper, we consider additional variables found in ATHLOS and investigate their utilization for predicting the Healthy Aging index. For this purpose, motivated by the volume and diversity of the dataset, we focus our attention upon data clustering, where unsupervised learning is utilized to enhance prediction power. Thus we show the predictive utility of exploiting hidden data structures. In addition, we demonstrate that imposed computation bottlenecks can be surpassed when using appropriate hierarchical clustering, within a clustering for ensemble classification scheme, while retaining prediction benefits. We propose a complete methodology that is evaluated against baseline methods and the original concept. The results are very encouraging suggesting further developments in this direction along with applications in tasks with similar characteristics. A straightforward open source implementation for the R project is also provided (https://github.com/Petros-Barmpas/HCEP).
Similar content being viewed by others
References
Lee K-S, Lee B-S, Semnani S, Avanesian A, Um C-Y, Jeon H-J, Seong K-M, Yu K, Min K-J, Jafari M. Curcumin extends life span, improves health span, and modulates the expression of age-associated aging genes in drosophila melanogaster. Rejuvenation Res. 2010;13(5):561–70.
Mathias JS, Agrawal A, Feinglass J, Cooper AJ, Baker DW, Choudhary A. Development of a 5 year life expectancy index in older adults using predictive mining of electronic health record data. J Am Med Inform Assoc. 2013;20(e1):e118–24.
Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big data. 2014;1(1):1–35.
Eurostat, Population structure and ageing. statistics explained.
Mather M, Jacobsen LA, Pollard KM. Aging in the united states, Population Reference Bureau; 2015.
Organization WH, et al. Men, ageing and health: achieving health across the life span. Tech. rep. Geneva: World Health Organization; 2001.
DESA U. World population ageing 2015, in: United Nations DoEaSA, population division editor; 2015.
Alwan A, et al. Global status report on noncommunicable diseases 2010. Geneva: World Health Organization; 2011.
Seeman TE, Crimmins E, Huang M-H, Singer B, Bucur A, Gruenewald T, Berkman LF, Reuben DB. Cumulative biological risk and socio-economic differences in mortality: Macarthur studies of successful aging. Soc Sci Med. 2004;58(10):1985–97.
Wu M-S, Lan T-H, Chen C-M, Chiu H-C, Lan T-Y. Socio-demographic and health-related factors associated with cognitive impairment in the elderly in Taiwan. BMC Public Health. 2011;11(1):22.
Wagner K-H, Cameron-Smith D, Wessner B, Franzke B. Biomarkers of aging: from function to molecular biology. Nutrients. 2016;8:338. https://doi.org/10.3390/nu8060338.
Caballero FF, Soulis G, Engchuan W, Sánchez-Niubó A, Arndt H, Ayuso-Mateos JL, Haro JM, Chatterji S, Panagiotakos DB. Advanced analytical methodologies for measuring healthy ageing and its determinants, using factor analysis and machine learning techniques: the athlos project. Sci Rep. 2017;7:43955.
Higueras-Fresnillo S, Guallar-Castillón P, Cabanas-Sanchez V, Banegas JR, Rodríguez-Artalejo F, Martinez-Gomez D. Changes in physical activity and cardiovascular mortality in older adults. J Geriatr Cardiol: JGC. 2017;14(4):280.
Martinez-Gomez D, Guallar-Castillon P, Higueras-Fresnillo S, Garcia-Esquinas E, Lopez-Garcia E, Bandinelli S, Rodríguez-Artalejo F. Physical activity attenuates total and cardiovascular mortality associated with physical disability: a national cohort of older adults. J Gerontol: Ser A. 2018;73(2):240–7.
Graciani A, García-Esquinas E, López-García E, Banegas J. Ideal cardiovascular health and risk of frailty in older adults. Circulation. 2016;9(3):239–45.
Tyrovolas S, Panagiotakos D, Georgousopoulou E, Chrysohoou C, Tousoulis D, Haro JM, Pitsavos C. Skeletal muscle mass in relation to 10 year cardiovascular disease incidence among middle aged and older adults: the attica study. J Epidemiol Community Health. 2020;74(1):26–31.
Kollia N, Panagiotakos DB, Chrysohoou C, Georgousopoulou E, Tousoulis D, Stefanadis C, Papageorgiou C, Pitsavos C. Determinants of healthy ageing and its relation to 10-year cardiovascular disease incidence: the Attica study. Cent Eur J Public Health. 2018;26(1):3–9.
Kollia N, Caballero FF, Sánchez-Niubó A, Tyrovolas S, Ayuso-Mateos JL, Haro JM, Chatterji S, Panagiotakos DB. Social determinants, health status and 10-year mortality among 10,906 older adults from the English longitudinal study of aging: the athlos project. BMC Public Health. 2018;18(1):1357.
Soler-Vila H, García-Esquinas E, León-Muñoz LM, López-García E, Banegas JR, Rodríguez-Artalejo F. Contribution of health behaviours and clinical factors to socioeconomic differences in frailty among older adults. J Epidemiol Community Health. 2016;70(4):354–60.
Doménech-Abella J, Mundó J, Moneta MV, Perales J, Ayuso-Mateos JL, Miret M, Haro JM, Olaya B. The impact of socioeconomic status on the association between biomedical and psychosocial well-being and all-cause mortality in older spanish adults. Soc Psychiatry Psychiatr Epidemiol. 2018;53(3):259–68.
Hossin M, Koupil I. Early life social and health determinants of adult socioeconomic position across two generations. Eur J Public Health. 2018;28(4):cky213.
Machado-Fragua MD, Struijk EA, Graciani A, Guallar-Castillon P, Rodríguez-Artalejo F, Lopez-Garcia E. Coffee consumption and risk of physical function impairment, frailty and disability in older adults. Eur J Nutr. 2019;58(4):1415–27.
Tyrovolas S, Haro JM, Foscolou A, Tyrovola D, Mariolis A, Bountziouka V, Piscopo S, Valacchi G, Anastasiou F, Gotsis E, et al. Anti-inflammatory nutrition and successful ageing in elderly individuals: the multinational medis study. Gerontology. 2018;64(1):3–10.
Stefler D, Malyutina S, Nikitin Y, Nikitenko T, Rodriguez-Artalejo F, Peasey A, Pikhart H, Sabia S, Bobak M. Fruit, vegetable intake and blood pressure trajectories in older age. J Hum Hypertens. 2019;33(9):671–8.
León-Muñoz LM, Guallar-Castillón P, García-Esquinas E, Galán I, Rodríguez-Artalejo F. Alcohol drinking patterns and risk of functional limitations in two cohorts of older adults. Clin Nutr. 2017;36(3):831–8.
Ortolá R, García-Esquinas E, Galán I, Guallar-Castillón P, López-García E, Banegas J, Rodríguez-Artalejo F. Patterns of alcohol consumption and risk of falls in older adults: a prospective cohort study. Osteoporos Int. 2017;28(11):3143–52.
de la Torre-Luque A, Ayuso-Mateos JL, Sanchez-Carro Y, de la Fuente J, Lopez-Garcia P. Inflammatory and metabolic disturbances are associated with more severe trajectories of late-life depression. Psychoneuroendocrinology. 2019;110:104443.
de la Torre-Luque A, de la Fuente J, Sanchez-Niubo A, Caballero FF, Prina M, Muniz-Terrera G, Haro JM, Ayuso-Mateos JL. Stability of clinically relevant depression symptoms in old-age across 11 cohorts: a multi-state study. Acta Psychiatr Scand. 2019;140(6):541–51.
de la Torre-Luque A, de la Fuente J, Prina M, Sanchez-Niubo A, Haro JM, Ayuso-Mateos JL. Long-term trajectories of depressive symptoms in old age: relationships with sociodemographic and health-related factors. J Affect Disord. 2019;246:329–37.
Panaretos D, Koloverou E, Dimopoulos AC, Kouli G-M, Vamvakari M, Tzavelas G, Pitsavos C, Panagiotakos DB. A comparison of statistical and machine-learning techniques in evaluating the association between dietary patterns and 10-year cardiometabolic risk (2002–2012): the attica study. Br J Nutr. 2018;120(3):326–34.
Engchuan W, Dimopoulos AC, Tyrovolas S, Caballero FF, Sanchez-Niubo A, Arndt H, Ayuso-Mateos JL, Haro JM, Chatterji S, Panagiotakos DB. Sociodemographic indicators of health status using a machine learning approach and data from the English longitudinal study of aging (elsa). Med Sci Monit. 2019;25:1994.
Alapati YK, Sindhu K. Combining clustering with classification: a technique to improve classification accuracy. Lung Cancer. 2016;32(57):3.
Rouzbahman M, Jovicic A, Chignell M. Can cluster-boosted regression improve prediction of death and length of stay in the ICU? IEEE J Biomed Health Inform. 2017;21(3):851–8. https://doi.org/10.1109/JBHI.2016.2525731.
Trivedi S, Pardos ZA, Heffernan NT. The utility of clustering in prediction tasks, arXiv:1509.06163.
Gan H, Sang N, Huang R, Tong X, Dan Z. Using clustering analysis to improve semi-supervised classification. Neurocomputing. 2013;101:290–8.
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 2006;7:2399–434.
Agrawal U, Soria D, Wagner C, Garibaldi J, Ellis IO, Bartlett JM, Cameron D, Rakha EA, Green AR. Combining clustering and classification ensembles: a novel pipeline to identify breast cancer profiles. Artif Intell Med. 2019;97:27–37.
Tran CT, Zhang M, Andreae P, Xue B, Bui LT. Improving performance of classification on incomplete data using feature selection and clustering. Appl Soft Comput. 2018;73:848–61.
Sanchez-Niubo A, Egea-Cortés L, Olaya B, Caballero FF, Ayuso-Mateos JL, Prina M, Bobak M, Arndt H, Tobiasz-Adamczyk B, Pająk A, et al. Cohort profile: the ageing trajectories of health-longitudinal opportunities and synergies (athlos) project. Int J Epidemiol. 2019;48(4):1052–1053i.
Prina AM, Acosta D, Acosta I, Guerra M, Huang Y, Jotheeswaran A, Jimenez-Velazquez IZ, Liu Z, Llibre RJ, Salas JA. Cohort profile: the 10/66 study. Int J Epidemiol. 2017;46(2):406.
Luszcz MA, Giles LC, Anstey KJ, Browne-Yung KC, Walker RA, Windsor TD. Cohort profile: the Australian longitudinal study of ageing (alsa). Int J Epidemiol. 2016;45(4):1054–63.
Leonardi M, Chatterji S, Koskinen S, Ayuso-Mateos JL, Haro JM, Frisoni G, Frattura L, Martinuzzi A, Tobiasz-Adamczyk B, Gmurek M, et al. Determinants of health and disability in ageing population: the courage in Europe project (collaborative research on ageing in europe). Clin Psychol Psychother. 2014;21(3):193–8.
Steptoe A, Breeze E, Banks J, Nazroo J. Cohort profile: the English longitudinal study of ageing. Int J Epidemiol. 2013;42(6):1640–8.
Rodríguez-Artalejo F, Graciani A, Guallar-Castillón P, León-Muñoz LM, Zuluaga MC, López-García E, Gutiérrez-Fisac JL, Taboada JM, Aguilera MT, Regidor E, et al. Rationale and methods of the study on nutrition and cardiovascular risk in Spain (enrica). Revista Española de Cardiología (English Edition). 2011;64(10):876–82.
Peasey A, Bobak M, Kubinova R, Malyutina S, Pajak A, Tamosiunas A, Pikhart H, Nicholson A, Marmot M. Determinants of cardiovascular disease and other non-communicable diseases in central and eastern Europe: rationale and design of the hapiee study. BMC Public Health. 2006;6(1):255.
KS, Health 2000 and 2011 surveys-thl biobank. National Institute for Health and Welfare (2018). Accessed 18 July 2008.
Sonnega A, Faul JD, Ofstedal MB, Langa KM, Phillips JW, Weir DR. Cohort profile: the health and retirement study (hrs). Int J Epidemiol. 2014;43(2):576–85.
Ichimura H, Shimizutani S, Hashimoto H. Jstar first results 2009 report. Research Institute of Economy, Trade and Industry (RIETI): Tech. rep; 2009.
Park JH, Lim S, Lim J, Kim K, Han M, Yoon IY, Kim J, Chang Y, Chang CB, Chin HJ, et al. An overview of the Korean longitudinal study on health and aging. Psychiatry Investig. 2007;4(2):84.
Wong R, Michaels-Obregon A, Palloni A. Cohort profile: the Mexican health and aging study (MHAS). Int J Epidemiol. 2017;46(2):e2–e2.
Kowal P, Chatterji S, Naidoo N, Biritwum R, Fan W, Lopez Ridaura R, Maximova T, Arokiasamy P, Phaswana-Mafuya N, Williams S, et al. Data resource profile: the world health organization study on global ageing and adult health (Sage). Int J Epidemiol. 2012;41(6):1639–49.
Börsch-Supan A, Brandt M, Hunkler C, Kneip T, Korbmacher J, Malter F, Schaan B, Stuck S, Zuber S. Data resource profile: the survey of health, ageing and retirement in Europe (SHARE). Int J Epidemiol. 2013;42(4):992–1001.
Whelan BJ, Savva GM. Design and methodology of the Irish longitudinal study on ageing. J Am Geriatr Soc. 2013;61:S265–8.
Arokiasamy P, Bloom D, Lee J, Feeney K, Ozolins M. Longitudinal aging study in India: vision, design, implementation, and preliminary findings. In: Smith JP, Majmundar M, editors. Aging in Asia: findings from new and emerging data initiatives. Washington: National Academies Press; 2012.
Seetharaman P, Wichern G, Le Roux J, Pardo B. Bootstrapping single-channel source separation via unsupervised spatial clustering on stereo mixtures. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019. pp. 356–360.
Dietterich TG, Ensemble methods in machine learning. In: International workshop on multiple classifier systems, Springer, 2000; pp. 1–15.
Boongoen T, Iam-On N. Cluster ensembles: a survey of approaches with recent extensions and applications. Comput Sci Rev. 2018;28:1–25.
Saraçli S, Doğan N, Doğan İ. Comparison of hierarchical cluster analysis methods by cophenetic correlation. J Inequal Appl. 2013;2013(1):1–8.
Pavlidis NG, Hofmeyr DP, Tasoulis SK. Minimum density hyperplanes. J Mach Learn Res. 2016;17(1):5414–46.
Murtagh F, Legendre P. Ward’s hierarchical agglomerative clustering method: which algorithms implement ward’s criterion? J Classif. 2014;31(3):274–95.
Zhang W, Zhao D, Wang X. Agglomerative clustering via maximum incremental path integral. Pattern Recogn. 2013;46(11):3056–65.
Sharma A, López Y, Tsunoda T. Divisive hierarchical maximum likelihood clustering. BMC Bioinform. 2017;18(16):546.
Tasoulis S, Cheng L, Välimäki N, Croucher NJ, Harris SR, Hanage WP, Roos T, Corander J. Random projection based clustering for population genomics. IEEE Int Conf Big Data (Big Data). 2014;2014:675–82. https://doi.org/10.1109/BigData.2014.7004291.
Tasoulis SK, Tasoulis DK, Plagianakos VP. Enhancing principal direction divisive clustering. Pattern Recogn. 2010;43(10):3391–411.
Hofmeyr DP. Clustering by minimum cut hyperplanes. IEEE Trans Pattern Anal Mach Intell. 2016;39(8):1547–60.
Azzalini A, Torelli N. Clustering via nonparametric density estimation. Stat Comput. 2007;17(1):71–80.
Stuetzle W, Nugent R. A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat. 2010;19(2):397–418.
Menardi G, Azzalini A. An advancement in clustering via nonparametric density estimation. Stat Comput. 2014;24(5):753–67.
Ben-David S, Lu T, Pál D, Sotáková M. Learning low density separators. In: Artificial Intelligence and Statistics; 2009, pp. 25–32.
Boley D. Principal direction divisive partitioning. Data Min Knowl Disc. 1998;2(4):325–44.
Zumel N, Mount J vtreat: a data. frame processor for predictive modeling, arXiv:1611.09477.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Baker FB, Hubert LJ. Measuring the power of hierarchical cluster analysis. J Am Stat Assoc. 1975;70(349):31–8.
Tasoulis S, Pavlidis NG, Root T. Nonlineardimensionality reduction for clustering. Pattern Recogn. 2020;107:107508.
Emerson J, Kane M. biganalytics: Utilities for “big. matrix” objects from package “bigmemory”, J Stat Softw.
Liaw A, Wiener M, et al. Classification and regression by randomforest. R News. 2002;2(3):18–22.
Chai T, Draxler RR. Root mean square error (RMSE) or mean absolute error (MAE)?-Arguments against avoiding RMSE in the literature. Geosci Model Develop. 2014;7(3):1247–50.
Kim J-H. Estimatingclassification error rate: repeated cross-validation, repeated hold-out and bootstrap. Comput Stat Data Anal. 2009;53(11):3735–45. https://doi.org/10.1016/j.csda.2009.04.009.
Microsoft, S. Weston, foreach: provides Foreach Looping Construct, r package version 1.4.7 url = https://CRAN.R-project.org/package=foreach (2019).
Chen T, Guestrin C. Xgboost: a scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016, pp. 785–794.
Kingma DP, Ba J. Adam: a method for stochastic optimization, arXiv:1412.6980.
Rousseeuw PJ, Kaufman L. Finding groups in data, Hoboken: Wiley Online Library 1.
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B. 2001;63(2):411–23.
Hofmeyr D, Pavlidis N. Ppci: an r package for cluster identification using projection pursuit. R J Appear. 2019. https://doi.org/10.32614/RJ-2019-046.
Tasoulis SK, Vrahatis AG, Georgakopoulos SV, Plagianakos VP. Biomedical data ensemble classification using random projections. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE; 2018, pp. 166–172.
Cannings TI, Samworth RJ. Random-projection ensemble classification. J R Stat Soc Ser B. 2017;79(4):959–1035.
Acknowledgements
This work is supported by the ATHLOS (Aging Trajectories of Health: Longitudinal Opportunities and Synergies) project, funded by the European Union’s Horizon 2020 Research and Innovation Program under Grant Agreement Number 635316.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Barmpas, P., Tasoulis, S., Vrahatis, A.G. et al. A divisive hierarchical clustering methodology for enhancing the ensemble prediction power in large scale population studies: the ATHLOS project. Health Inf Sci Syst 10, 6 (2022). https://doi.org/10.1007/s13755-022-00171-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13755-022-00171-1