[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2554850.2555063acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Structural clustering of millions of molecular graphs

Published: 24 March 2014 Publication History

Abstract

We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for further cluster assignments. The pre-clustering methodology employs a cluster membership test based on a set-abstraction of graphs. The resulting clusters are further partitioned into a finer level of granularity using a highly parallelized scaffold-based clustering approach that produces overlapping (non-disjoint) and non-exhaustive clusters. In our experiments on very large datasets of chemical structures, we show that our algorithm is able to cluster at least 3,000,000 molecular graph structures within reasonable time. Compared to a recently proposed scaffold-based graph clustering algorithm, our approach is able to cluster an order of magnitude more graph structures in a fraction of the time. This demonstrates that our algorithm is applicable in the context of virtual screening of realistically sized compound libraries and a first step towards structuring chemical space.

References

[1]
C. C. Aggarwal, N. Ta, J. Wang, J. Feng, and M. Zaki. XProj: a framework for projected structural clustering of XML documents. In Proc. 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'07, pages 46--55. ACM, 2007.
[2]
J. Chen, S. J. Swamidass, Y. Dou, and P. Baldi. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinf., 21: 4133--4139, 2005.
[3]
L. M. Collins and C. W. Dent. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar. Behav. Res., 23: 231--242, 1988.
[4]
S. Günter and H. Bunke. Validation indices for graph clustering. Pattern Recogn. Lett., 24(8): 1107--1113, 2003.
[5]
M. S. Hossain and R. A. Angryk. GDClust: A graph-based document clustering technique. In Proc. Seventh IEEE International Conference on Data Mining Workshops, ICDMW'07, pages 417--422, 2007.
[6]
T. Kohonen, editor. Self-Organizing maps. Springer-Verlag New York, Inc., 1997.
[7]
M. J. McGregor and P. V. Pallai. Clustering of large databases of compounds: Using the MDL "keys" as structural descriptors. J. Chem. Inform. Comput. Sci., 37(3): 443--448, 1997.
[8]
J. W. Raymond and P. Willett. Effectiveness of graph-based and fingerprint-based similarity measures for virtual screening of 2D chemical structure databases. J. Comput. Aided. Mol. Des., 16(1): 59--71, 2002.
[9]
M. Seeland, S. A. Berger, A. Stamatakis, and S. Kramer. Parallel structural graph clustering. In Proc. 2011 European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD'11, pages 256--272, 2011.
[10]
M. Stahl and H. Mauser. Database clustering with a combination of fingerprint and maximum common substructure methods. J. Chem. Inf. Model., 45: 542--548, 2005.
[11]
J. J. Sutherland, L. A. O'Brien, and D. F. Weaver. Spline-fitting with a genetic algorithm: A method for developing classification structure-activity relationships. J. Chem. Inform. Comput. Sci., pages 1906--1915, 2003.
[12]
K. Tsuda and T. Kudo. Clustering graphs by weighted substructure mining. In Proc. 23rd international conference on Machine learning, ICML'06, pages 953--960, 2006.
[13]
K. Tsuda and K. Kurihara. Graph mining with variational Dirichlet process mixture models. In Proc. 8th SIAM International Conference on Data Mining, SDM'08, pages 432--442, 2008.
[14]
J. N. Weinstein, K. W. Kohn, M. R. Grever, and V. N. Viswanadhan. Neural computing in cancer drug development: Predicting mechanism of action. Science, 258: 447--451, 1992.
[15]
X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In Proc. 2002 IEEE International Conference on Data Mining, ICDM'02, pages 721--724, 2002.
[16]
T. Yoshida, R. Shoda, and H. Motoda. Graph clustering based on structural similarity of fragments. In Jantke, K. P., et al. (eds.) Federation over the Web. LNCS (LNAI), volume 3847, pages 97--114, 2006.
[17]
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discov., 1(2).
[18]
Z. Zheng, S. Kramer, and B. Schmidt. DySC: software for greedy clustering of 16S rRNA reads. Bioinf., 28(16), 2012.

Cited By

View all
  • (2023)Graph-Based Methods for Rational Drug DesignAlgorithms for Big Data10.1007/978-3-031-21534-6_5(76-96)Online publication date: 18-Jan-2023
  • (2022)A family of pairwise multi-marginal optimal transports that define a generalized metricMachine Learning10.1007/s10994-022-06280-y112:1(353-384)Online publication date: 20-Dec-2022
  • (2021)Unsupervised behavioural mining and clustering for malware family identificationProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441919(374-383)Online publication date: 22-Mar-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '14: Proceedings of the 29th Annual ACM Symposium on Applied Computing
March 2014
1890 pages
ISBN:9781450324694
DOI:10.1145/2554850
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 March 2014

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SAC 2014
Sponsor:
SAC 2014: Symposium on Applied Computing
March 24 - 28, 2014
Gyeongju, Republic of Korea

Acceptance Rates

SAC '14 Paper Acceptance Rate 218 of 939 submissions, 23%;
Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Graph-Based Methods for Rational Drug DesignAlgorithms for Big Data10.1007/978-3-031-21534-6_5(76-96)Online publication date: 18-Jan-2023
  • (2022)A family of pairwise multi-marginal optimal transports that define a generalized metricMachine Learning10.1007/s10994-022-06280-y112:1(353-384)Online publication date: 20-Dec-2022
  • (2021)Unsupervised behavioural mining and clustering for malware family identificationProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441919(374-383)Online publication date: 22-Mar-2021
  • (2017)StruClus: Scalable Structural Graph Set Clustering with Representative SamplingAdvanced Data Mining and Applications10.1007/978-3-319-69179-4_24(343-359)Online publication date: 14-Oct-2017
  • (2015)Discriminative Chemical Patterns: Automatic and Interactive DesignJournal of Chemical Information and Modeling10.1021/acs.jcim.5b0032355:8(1535-1546)Online publication date: 13-Aug-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media