[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

Examining unsupervised ensemble learning using spectroscopy data of organic compounds

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

One solution to the challenge of choosing an appropriate clustering algorithm is to combine different clusterings into a single consensus clustering result, known as cluster ensemble (CE). This ensemble learning strategy can provide more robust and stable solutions across different domains and datasets. Unfortunately, not all clusterings in the ensemble contribute to the final data partition. Cluster ensemble selection (CES) aims at selecting a subset from a large library of clustering solutions to form a smaller cluster ensemble that performs as well as or better than the set of all available clustering solutions. In this paper, we investigate four CES methods for the categorization of structurally distinct organic compounds using high-dimensional IR and Raman spectroscopy data. Single quality selection (SQI) forms a subset of the ensemble by selecting the highest quality ensemble members. The Single Quality Selection (SQI) method is used with various quality indices to select subsets by including the highest quality ensemble members. The Bagging method, usually applied in supervised learning, ranks ensemble members by calculating the normalized mutual information (NMI) between ensemble members and consensus solutions generated from a randomly sampled subset of the full ensemble. The hierarchical cluster and select method (HCAS-SQI) uses the diversity matrix of ensemble members to select a diverse set of ensemble members with the highest quality. Furthermore, a combining strategy can be used to combine subsets selected using multiple quality indices (HCAS-MQI) for the refinement of clustering solutions in the ensemble. The IR + Raman hybrid ensemble library is created by merging two complementary “views” of the organic compounds. This inherently more diverse library gives the best full ensemble consensus results. Overall, the Bagging method is recommended because it provides the most robust results that are better than or comparable to the full ensemble consensus solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The Supporting Document is provided. The datasets and python source code supporting the conclusions of this article are available in the GitHub repository, https://github.com/nina23bom/Cluster-Ensemble-Selection-Project

Abbreviations

CE:

Cluster ensemble

CES:

Cluster ensemble selection

CSPA:

Cluster-based Similarity Partitioning Algorithm

HBGF:

Hybrid Bipartite Graph Formulation

SQI:

Single Quality Index Selection

HCAS-SQI:

Hierarchical Cluster and Select with Single Quality Index

HCAS-MQI:

Hierarchical Cluster and Select with Multiple Quality Indices

DC:

Direct combining

WC:

Weighted combining

BC:

Bagging combining

References

  1. Duda RO, Hart PE, Stork DG (2012) Pattern Classification. Wiley, New York

    Google Scholar 

  2. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  3. Kim S, Han L, Yu B, Hähnke VD, Bolton EE, Bryant SH (2015) PubChem structure-activity relationship (SAR) clusters. J Cheminform 7:33

    Article  Google Scholar 

  4. González-Alemán R, Hernández-Castillo D, Caballero J, Montero-Cabrera LA (2020) Quality threshold clustering of molecular dynamics: a word of caution. J Chem Inf Model 60(2):467–472

    Article  Google Scholar 

  5. Glielmo A, Husic BE, Rodriguez A, Clementi C, Noé F, Laio A (2021) Unsupervised learning methods for molecular simulation data. Chem Rev 121(16):9722–9758

    Article  CAS  Google Scholar 

  6. Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E (2016) Clustering algorithms: their application to gene expression data. Bioinform Biol Insights 10:237–253

    Article  Google Scholar 

  7. MacQueen J (1967) In Some methods for classification and analysis of multivariate observations

  8. von Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17(4):395–416

    Article  Google Scholar 

  9. Reynolds AP, Richards G, de la Iglesia B, Rayward-Smith VJ (2006) Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. J Math Model Algorithms 5(4):475–504

    Article  Google Scholar 

  10. Kleinberg J (2002) An impossibility theorem for clustering. Adv Neural Inform Process Syst 15:16

    Google Scholar 

  11. Hennig C (2015) What are the true clusters? Pattern Recognit Lett 64:53–62

    Article  Google Scholar 

  12. Jain AK, Duin RPW, Jianchang M (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  13. Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    Google Scholar 

  14. Ghosh J, Acharya A (2011) Cluster ensembles. Wiley Interdiscip Rev 1(4):305–315

    Google Scholar 

  15. Ghaemi R, Sulaiman NB, Ibrahim H, Mustapha N (2011) A review: accuracy optimization in clustering ensembles using genetic algorithms. Artif Intell Rev 35(4):287–318

    Article  Google Scholar 

  16. Ayad HG, Kamel MS (2007) Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans Pattern Anal Mach Intell 30(1):160–173

    Article  Google Scholar 

  17. Fred A, Lourenço A (2008) Cluster ensemble methods: from single clusterings to combined solutions. In Supervised and unsupervised ensemble methods and their applications, Springer, pp 3–30

  18. Topchy A, Jain AK, Punch W (2003) In Combining multiple weak clusterings, Third IEEE international conference on data mining. IEEE: pp 331–338

  19. Kittler J, Hatef M, Duin RPW, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239

    Article  Google Scholar 

  20. Kuncheva LI, Vetrov DP (2006) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal Mach Intell 28(11):1798–1808

    Article  Google Scholar 

  21. Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850

    Article  Google Scholar 

  22. Boongoen T, Iam-On N (2018) Cluster ensembles: a survey of approaches with recent extensions and applications. Comput Sci Rev 28:1–25

    Article  Google Scholar 

  23. Golalipour K, Akbari E, Hamidi SS, Lee M, Enayatifar R (2021) From clustering to clustering ensemble selection: a review. Eng Appl Artif Intell 104:104388

    Article  Google Scholar 

  24. Saeed F, Salim N, Abdo A (2012) Voting-based consensus clustering for combining multiple clusterings of chemical structures. J Cheminf 4(1):37

    Article  Google Scholar 

  25. Saeed F, Salim N, Abdo A (2013) Information Theory and voting based consensus clustering for combining multiple clusterings of chemical structures. Mol Inform 32(7):591–598

    Article  CAS  Google Scholar 

  26. Saeed F, Ahmed A, Shamsir MS, Salim N (2014) Weighted voting-based consensus clustering for chemical structure databases. J Comput Aided Mol Des 28(6):675–684

    Article  CAS  Google Scholar 

  27. Chu C-W, Holliday JD, Willett P (2012) Combining multiple classifications of chemical structures using consensus clustering. Bioorg Med Chem 20(18):5366–5371

    Article  CAS  Google Scholar 

  28. Fern XZ, Lin W (2008) Cluster ensemble selection. Stat Anal Data Min 1(3):128–141

    Article  Google Scholar 

  29. Abbasi S-O, Nejatian S, Parvin H, Rezaie V, Bagherifard K (2019) Clustering ensemble selection considering quality and diversity. Artif Intell Rev 52(2):1311–1340

    Article  Google Scholar 

  30. Shi Y, Yu Z, Chen CLP, You J, Wong HS, Wang Y, Zhang J (2020) Transfer Clustering Ensemble Selection. IEEE Trans Cybern 50(6):2872–2885

    Article  Google Scholar 

  31. Kuncheva LI, Hadjitodorov ST (2004) In Using diversity in cluster ensembles, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583), pp 1214–1219

  32. Alizadeh H, Minaei-Bidgoli B, Parvin H (2014) To improve the quality of cluster ensembles by selecting a subset of base clusters. J Exp Theor Artif Intell 26(1):127–150

    Article  Google Scholar 

  33. Minaei-Bidgoli B, Parvin H, Alinejad-Rokny H, Alizadeh H, Punch WF (2014) Effects of resampling method and adaptation on clustering ensemble efficacy. Artif Intell Rev 41(1):27–48

    Article  Google Scholar 

  34. UNODC Early Warning Advisory on New Psychoactive Substances. What are NPS? https://www.unodc.org/LSS/Home/NPS. (Accessed Mar 2021).

  35. “Title 21 United States Code (USC) Controlled Substances Act” United States Drug Enforcement Administration: https://www.dea.gov/controlled-substances-act. (Accessed Mar 2021).

  36. Luinge HJ (1990) Automated interpretation of vibrational spectra. Vib Spectrosc 1(1):3–18

    Article  CAS  Google Scholar 

  37. Butler KT, Davies DW, Cartwright H, Isayev O, Walsh A (2018) Machine learning for molecular and materials science. Nature 559(7715):547–555

    Article  CAS  Google Scholar 

  38. Biancolillo A, Marini F (2018) Chemometric methods for spectroscopy-based pharmaceutical analysis. Front Chem 6:576

    Article  CAS  Google Scholar 

  39. Wang X-Y, Garibaldi J (2005) Simulated annealing fuzzy clustering in cancer diagnosis. Informatica 29:61–70

    CAS  Google Scholar 

  40. Wu X, Wu B, Sun J, Yang N (2017) Classification of apple varieties using near infrared reflectance spectroscopy and fuzzy discriminant C-means clustering model. J Food Process Eng 40(2):e12355

    Article  Google Scholar 

  41. Haixia R, Weiqi L, Weimin S, Qi S (2013) Classification of edible oils by infrared spectroscopy with optimized k-means clustering by a hybrid particle swarm algorithm. Anal Lett 46(17):2727–2738

    Article  Google Scholar 

  42. Fred ALN, Jain AK (2002) In Data clustering using evidence accumulation, 2002 International Conference on Pattern Recognition, pp 276–280

  43. Ana LNF, Jain AK (2003) In Robust data clustering, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. pp II–II.

  44. Iam-on N, Boongoen T, Garrett S (2008) Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations. Springer, Berlin, pp 222–233

  45. Hadjitodorov ST, Kuncheva LI, Todorova LP (2006) Moderate diversity for better cluster ensembles. Inf Fusion 7(3):264–275

    Article  Google Scholar 

  46. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In Proceedings of the twentieth international conference on international conference on machine learning, AAAI Press: Washington, DC; pp 186–193

  47. Fischer B, Buhmann JM (2003) Bagging for path-based clustering. IEEE Trans Pattern Anal Mach Intell 25(11):1411–1415

    Article  Google Scholar 

  48. Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9):1090–1099

    Article  CAS  Google Scholar 

  49. Minaei-Bidgoli B, Topchy AP, Punch WF (2004) In A comparison of resampling methods for clustering ensembles, IC-AI

  50. Ayad H, Kamel M (2003) Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. Springer, Berlin, pp 166–175

  51. Hu X, Yoo I (2004) Cluster ensemble and its applications in gene expression analysis.

  52. Law MHC, Topchy AP, Jain AK (2004) In Multiobjective data clustering, In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004; pp II–II.

  53. Lu X, Yang Y, Wang H (2013) Selective clustering ensemble based on covariance. Springer, Berlin

    Book  Google Scholar 

  54. Yousefnezhad M, Reihanian A, Zhang D, Minaei-Bidgoli B (2016) A new selection strategy for selective cluster ensemble based on Diversity and Independency. Eng Appl Artif Intell 56:260–272

    Article  Google Scholar 

  55. Azimi J, Fern X (2009) Adaptive cluster ensemble selection. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers Inc.: Pasadena, pp 992–997.

  56. Faceli K, Carvalho ACPLFD, Souto MCPD (2006) In Multi-Objective Clustering Ensemble, 2006 Sixth International Conference on Hybrid Intelligent Systems (HIS'06). pp 51–51

  57. Yu Z, Chen H, You J, Han G, Li L (2013) Hybrid fuzzy cluster ensemble framework for tumor clustering from biomolecular data. IEEE/ACM Trans Comput Biol Bioinform 10(3):657–670

    Article  Google Scholar 

  58. Li F, Qian Y, Wang J, Liang J (2017) Multigranulation information fusion: a Dempster-Shafer evidence theory-based clustering ensemble method. Inf Sci 378:389–409

    Article  Google Scholar 

  59. Wu X, Ma T, Cao J, Tian Y, Alabdulkarim A (2018) A comparative study of clustering ensemble algorithms. Comput Electr Eng 68:603–615

    Article  Google Scholar 

  60. Hamidi SS, Akbari E, Motameni H (2019) Consensus clustering algorithm based on the automatic partitioning similarity graph. Data Knowl Eng 124:101754

    Article  Google Scholar 

  61. Ayad HG, Kamel MS (2010) On voting-based consensus of cluster ensembles. Pattern Recognit 43(5):1943–1953

    Article  Google Scholar 

  62. Bagherinia A, Minaei-Bidgoli B, Hosseinzadeh M, Parvin H (2021) Reliability-based fuzzy clustering ensemble. Fuzzy Sets Syst 413:1–28

    Article  Google Scholar 

  63. Naldi MC, Carvalho ACPLF, Campello RJGB (2013) Cluster ensemble selection based on relative validity indexes. Data Min Knowl Discov 27(2):259–289

    Article  Google Scholar 

  64. Alizadeh H, Minaei-Bidgoli B, Parvin H (2014) Cluster ensemble selection based on a new cluster stability measure. Intell Data Anal 18(3):389–408

    Article  Google Scholar 

  65. Jia J, Xiao X, Liu B, Jiao L (2011) Bagging-based spectral clustering ensemble selection. Pattern Recognit Lett 32(10):1456–1467

    Article  Google Scholar 

  66. Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):4

    Article  Google Scholar 

  67. Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In Proceedings of the 26th international conference on very large data bases, Morgan Kaufmann Publishers Inc.: pp 506–515

  68. Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A (2010) Can shared-neighbor distances defeat the curse of dimensionality? In: Ludäscher B (ed) Gertz M. Scientific and Statistical Database Management, Springer, Berlin pp, pp 482–500

    Chapter  Google Scholar 

  69. Aggarwal CC (2001) Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Rec 30(1):13–18

    Article  Google Scholar 

  70. Elghazel H, Aussem A (2015) Unsupervised feature selection with ensemble learning. Mach Learn 98(1):157–180

    Article  Google Scholar 

  71. Henschel H, van der Spoel D (2020) An intuitively understandable quality measure for theoretical vibrational spectra. J Phys Chem Lett 11(14):5471–5475

    Article  CAS  Google Scholar 

  72. Henschel H, Andersson AT, Jespers W, Mehdi Ghahremanpour M, van der Spoel D (2020) Theoretical infrared spectra: quantitative similarity measures and force fields. J Chem Theory Comput 16(5):3307–3315

    Article  CAS  Google Scholar 

  73. Topchy A, Jain AK, Punch W (2004) A mixture model for clustering ensembles. In Proceedings of the 2004 SIAM international conference on data mining (SDM), pp 379–390

  74. Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In Proceedings of the twenty-first international conference on Machine learning, Association for Computing Machinery: Banff, Alberta p 36

  75. Yang F, Li X, Li Q, Li T (2014) Exploring the diversity in cluster ensemble generation: Random sampling and random projection. Expert Syst Appl 41(10):4844–4866

    Article  Google Scholar 

  76. Hong Y, Kwong S, Wang H, Ren Q (2009) Resampling-based selective clustering ensembles. Pattern Recognit Lett 30(3):298–305

    Article  Google Scholar 

  77. Li F, Qian Y, Wang J, Dang C, Jing L (2019) Clustering ensemble based on sample’s stability. Artif Intell 273:37–55

    Article  Google Scholar 

  78. Akbari E, Mohamed Dahlan H, Ibrahim R, Alizadeh H (2015) Hierarchical cluster ensemble selection. Eng Appl Artif Intell 39:146–156

    Article  Google Scholar 

  79. Yu Z, Li L, Gao Y, You J, Liu J, Wong H-S, Han G (2014) Hybrid clustering solution selection strategy. Pattern Recognit 47(10):3362–3375

    Article  Google Scholar 

  80. Ma T, Yu T, Wu X, Cao J, Al-Abdulkarim A, Al-Dhelaan A, Al-Dhelaan M (2020) Multiple clustering and selecting algorithms with combining strategy for selective clustering ensemble. Soft Comput 24(20):15129–15141

    Article  Google Scholar 

  81. Rousseeuw PJ (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  82. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27

    Google Scholar 

  83. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI 1(2):224–227

    Article  CAS  Google Scholar 

  84. Bolton EE, Chen J, Kim S, Han L, He S, Shi W, Simonyan V, Sun Y, Thiessen PA, Wang J, Yu B, Zhang J, Bryant SH (2011) PubChem3D: a new resource for scientists. J Cheminf 3(1):32–32

    Article  CAS  Google Scholar 

  85. Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Scalmani G, Barone V, Petersson GA, Nakatsuji H, Li X, Caricato M, Marenich AV, Bloino J, Janesko BG, Gomperts R, Mennucci B, Hratchian HP, Ortiz JV, Izmaylov AF, Sonnenberg JLW, Ding F, Lipparini F, Egidi F, Goings J, Peng B, Petrone A, Henderson T, Ranasinghe D, Zakrzewski VG, Gao J, Rega N, Zheng G, Liang W, Hada M, Ehara M, Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao O, Nakai H, Vreven T, Throssell K, Montgomery Jr JA, Peralta JE, Ogliaro F, Bearpark MJ, Heyd JJ, Brothers EN, Kudin KN, Staroverov VN, Keith TA, Kobayashi R, Normand J, Raghavachari K, Rendell AP, Burant JC, Iyengar SS, Tomasi J, Cossi M, Millam JM, Klene M, Adamo C, Cammi R, Ochterski JW, Martin RL, Morokuma K, Farkas O, Foresman JB, Fox DJ (2016) Gaussian 16, Wallingford, CT

  86. He K (2021) Filter feature selection for unsupervised clustering of designer drugs using DFT simulated IR spectra data. ACS Omega 6(47):32151–32165

    Article  CAS  Google Scholar 

  87. Linstrom PJ, Mallard WG, NIST Chemistry WebBook, NIST Standard Reference Database Number 69. National Institute of Standards and Technology, Gaithersburg MD, 20899.

  88. Sano T (2021) ClusterEnsembles, https://github.com/tsano430/ClusterEnsembles, 2021–08–05.

  89. RDKit: Open-source cheminformatics; http://www.rdkit.org

  90. Karypis G, Eui-Hong H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75

    Article  Google Scholar 

Download references

Acknowledgements

Thank the MERCURY Consortium for computing resources and technical support. Clemson University is acknowledged for generous allotment of compute time on Palmetto cluster.

Funding

Computational resources were provided in part by the MERCURY consortium (http://mercuryconsortium.org/) under National Science Foundation grants CHE-1229354, CHE-1662030, and CHE-2018427.

Author information

Authors and Affiliations

Authors

Contributions

KH conceived and developed the presented idea, and were in charge of the overall direction and planning. Both KH and DM performed the computations and the data analysis.

Corresponding author

Correspondence to Kedan He.

Ethics declarations

Conflict of interest

The authors declare no competing financial interest.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 2323 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, K., Massena, D.G. Examining unsupervised ensemble learning using spectroscopy data of organic compounds. J Comput Aided Mol Des 37, 17–37 (2023). https://doi.org/10.1007/s10822-022-00488-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-022-00488-9

Keywords

Navigation