Research
Open access
Published: 03 December 2024

Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints

Markus Dablander¹,
Thierry Hanser²,
Renaud Lambiotte¹ &
…
Garrett M. Morris³

Journal of Cheminformatics volume 16, Article number: 135 (2024) Cite this article

698 Accesses
1 Altmetric
Metrics details

Abstract

Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the L most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, L. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Scientific contribution

A general mathematical framework for the vectorisation of structural fingerprints called substructure pooling; and the technical description and computational evaluation of Sort & Slice, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.

Introduction

Extended-connectivity fingerprints and hash-based folding

The use of extended-connectivity fingerprints (ECFPs) is widespread in current cheminformatics and molecular machine learning. A modern and widely recognised technical description of ECFPs was given in 2010 by Rogers and Hahn [1], although the key ideas underlying ECFP-generation were already introduced by Morgan [2] in 1965. ECFPs have for instance been used successfully for ligand-based virtual screening [3], the prediction of the aqueous solubility of molecular compounds [4], the computational detection of cytotoxic substructures of molecules [5], the identification of binding targets of chemical compounds via similarity searching [6], and the prediction of quantum-chemical properties of small molecules [7]. One of the most typical use cases of ECFPs is as a feature-extraction technique for supervised molecular machine learning, i.e., as a method to convert molecules into binary feature vectors for a given downstream molecular property prediction task. For this purpose, ECFPs are popular featurisations thanks to their conceptual simplicity, high interpretability, and low computational cost. Moreover, a considerable corpus of recent literature suggests that ECFPs still regularly match or even surpass the predictive performance of differentiable feature-extraction methods based on state-of-the-art message-passing graph neural networks (GNNs) [8,9,10,11,12,13,14].

At its core, the ECFP algorithm can be formally described as a method, $\varphi$, that transforms an input molecule, $\mathcal {M}$, often represented as a SMILES string [15], into a set of hashed integer identifiers:

$$\begin{aligned} \varphi (\mathcal {M}) = \{\mathcal {J}_{1},\ldots , \mathcal {J}_{k}\} \,{=}{:}\,\mathcal {I}. \end{aligned}$$

Here, $\mathcal {J}_{1},\ldots , \mathcal {J}_{k}$ lie in some large integer hash space such as $\{1,\ldots ,2^{32}\}$. Each $\mathcal {J}_i$ is constructed to correspond to a distinct circular chemical substructure detected within $\mathcal {M}$ (notwithstanding rare hash collisions causing an identifier $\mathcal {J}_i$ to be associated with multiple substructures [1, 16]). The set $\mathcal {I}$ can intuitively be thought of as the set of circular chemical substructures present in $\mathcal {M}$. The computational step:

$$\begin{aligned} \mathcal {M} \mapsto \mathcal {I} \subseteq \{1,\ldots ,2^{32}\} \end{aligned}$$

is referred to as substructure enumeration by us and depends on two important hyperparameters: the maximal diameter, $D \in \{0,2,4,6,\ldots \}$, up to which circular substructures should be detected; and the chosen list of atomic invariants, A, used to distinguish the atoms in the input compound (such as the atomic number or whether the atom is part of a ring). The number of detected substructures, k, naturally varies with the input compound $\mathcal {M}$, the maximal substructure diameter D and the selected atomic invariants, A.

For further computational processing, for example by a machine-learning system, the set of substructure identifiers $\mathcal {I}$ is commonly converted into a high-dimensional binary vector, $\mathcal {F} \in \{0,1\}^{L}$, via a simple hashing procedure. Let:

$$\begin{aligned} h: \{1,\ldots , 2^{32}\} \rightarrow \{1,\ldots , L\} \end{aligned}$$

be a common (arbitrary) hash function that compresses the hash space $\{1,\ldots , 2^{32}\}$ into a much smaller space $\{1,\ldots ,L\}$. Then $\mathcal {F}$ is defined componentwise via:

$$\begin{aligned} \mathcal {F}_i = \left\{ \begin{array}{ll} 1 & \quad \exists \ \mathcal {J} \in \mathcal {I}: h(\mathcal {J}) = i, \\ 0 & \quad \text {else}. \\ \end{array} \right. \end{aligned}$$

The computational step:

$$\begin{aligned} \mathcal {I} \mapsto \mathcal {F} \in \{0,1\}^L \end{aligned}$$

is referred to as hash-based folding by us and depends on the chosen fingerprint dimension, $L \in \mathbb {N}$, as a hyperparameter. Common choices for L include 1024 or 2048; and less often 512 or 4096. The binary vectorial components of $\mathcal {F}$ indicate (notwithstanding hash collisions in $\{1,\ldots ,L\}$) the existence of particular substructure identifiers in $\mathcal {I}$ which subsequently indicate (again, with the exception of rare hash collisions [1, 16]) the existence of particular circular chemical substructures in the input compound $\mathcal {M}$. The hashed feature vector, $\mathcal {F},$ thus encodes valuable structural information about $\mathcal {M}$ and can for instance be fed into a machine-learning system for molecular property prediction.

Research aims and contributions

The technical parallels between ECFPs and current message-passing GNNs [4, 7, 17,18,19,20,21,22] are striking. In both cases, a compound is first transformed into an (unordered) set-representation: for ECFPs, into a set of integer substructure identifiers, $\mathcal {I}$; and for GNNs, into a set of initial and updated atom feature vectors.^{Footnote 1} For both methods, a pooling operation then plays the crucial role of reducing the given set-representation to a compound-wide feature vector.

While considerable work has been done to investigate pooling methods for GNN architectures [23,24,25,26,27], almost no analogous research exists on substructure pooling for ECFPs or other structural fingerprints. Currently, hash-based folding is the default technique to vectorise the set $\mathcal {I}$ for ECFPs in the literature [1, 3,4,5, 16, 28,29,30,31,32,33,34], although a few alternative hashing procedures have been explored by Probst and Reymond [35]. Despite its widespread use, hash-based folding comes with a considerable downside: distinct substructure identifiers in $\mathcal {I}$ can be hashed to the same binary component of the output vector $\mathcal {F}$. Such “bit collisions” necessarily occur when the fingerprint dimension, L, is smaller than the number of substructures detected in a set of training compounds — which is regularly the case. The ambiguity introduced by bit collisions into the hashed fingerprint not only compromises its interpretability but also its predictive performance in machine-learning applications.

In this work, we describe Sort & Slice, a very simple and surprisingly effective alternative to hash-based folding for the pooling of ECFP substructures. In a nutshell, Sort & Slice first sorts all substructures in the training set according to the respective number of training compounds in which they occur, and then “slices” away all but the L most frequent substructures which are subsequently used to generate an L-dimensional bit-collision-free binary fingerprint. The absence of bit collisions substantially improves the interpretability of Sort & Slice ECFPs compared to their hashed counterparts. One can further show that Sort & Slice almost exclusively selects only the most informative substructural features from an entropic point of view [36, 37]. The only other study we discovered in our literature search that explores a technique very similar to Sort & Slice appears to be the noteworthy work of MacDougall [38], which we acknowledge. Our study represents a more extensive and robust investigation of a closely related, but slightly further developed method that exhibits additional technical strengths. In summary, we aim to provide the following contributions:

We introduce a general mathematical definition of substructure pooling for structural fingerprints that subsumes hash-based folding, Sort & Slice, and a large number of other potential techniques under a single coherent framework.
We give a technical description of Sort & Slice as an easy-to-implement and bit-collision-free alternative to hash-based folding for ECFPs.
We show via a series of strict computational experiments that Sort & Slice robustly outperforms hash-based folding [1] as well as two advanced substructure-selection schemes (filtering [16] and mutual-information maximisation [36, 37]) across distinct ECFP-based molecular property prediction tasks, data splitting techniques, machine learning models, fingerprint diameters D, fingerprint dimensions L, and atomic invariants A; and that frequently the performance gains associated with Sort & Slice are surprisingly large.
We recommend that due to its simplicity, improved interpretability and superior predictive performance, Sort & Slice should canonically replace hash-based folding as the default method to vectorise ECFPs for supervised molecular machine learning.

Methodology

Substructure pooling: definition and setting

We start by introducing a general mathematical definition of substructure pooling that can be used in combination with ECFPs or other structural fingerprints.

Definition 1

(Substructure pooling) Let:

$$\begin{aligned} \mathfrak {J} = \{\mathcal {J}_{1},\ldots , \mathcal {J}_{m}\} \end{aligned}$$

be a finite (but potentially very large) set of m chemical substructures. The substructures $\mathcal {J}_1,\ldots , \mathcal {J}_m$ could be specified via hashed integer identifiers, SMILES strings, molecular graphs, or any other computational representation. Now, let the set of all possible subsets of $\mathfrak {J}$ (i.e., its power set) be denoted by:

$$\begin{aligned} P(\mathfrak {J}) \,{:}{=}\,\{\mathcal {I} \ \vert \ \mathcal {I} \subseteq \mathfrak {J} \}. \end{aligned}$$

A substructure-pooling method of dimension $L \in \mathbb {N}$ for the chemical substructures in $\mathfrak {J}$ is a set function:

$$\begin{aligned} \Psi : P(\mathfrak {J}) \rightarrow \mathbb {R}^L, \end{aligned}$$

i.e., a function that maps arbitrary subsets of $\mathfrak {J}$ to L-dimensional real-valued vectors.

Note that the fact that $\Psi$ is a set function implicitly presupposes that it is permutation-invariant, i.e., that its output does not depend on any arbitrary ordering of the substructures in the input set. One can further imagine $\Psi$ to not necessarily be a fixed function; it could also be a trainable deep network.

Substructure pooling naturally appears in supervised molecular machine learning during the vectorisation of structural fingerprints. Consider a supervised molecular property prediction task specified by a training set $\mathfrak {T}$ of n unique compounds:

$$\begin{aligned} \mathfrak {T} = \{\mathcal {M}_1,\ldots , \mathcal {M}_n\} \end{aligned}$$

and an associated function:

$$\begin{aligned} f_{\text {labels}}: \mathfrak {T} \rightarrow \mathbb {R} \end{aligned}$$

that assigns regression or classification labels to the training set. The training compounds $\mathcal {M}_1,\ldots , \mathcal {M}_n$ could for instance be represented as SMILES strings or molecular graphs and are assumed to be elements of some larger space of chemical compounds $\mathfrak {M}$ with:

$$\begin{aligned} \mathfrak {M} \supset \{\mathcal {M}_1,\ldots , \mathcal {M}_n\} = \mathfrak {T}. \end{aligned}$$

Now let:

$$\begin{aligned} \mathfrak {J} = \{\mathcal {J}_{1},\ldots , \mathcal {J}_{m}\} \end{aligned}$$

be a set representing m chemical substructures of interest. $\mathfrak {J}$ could for instance be the set of hashed integer identifiers of all possible ECFP substructures with chosen atomic invariants A and maximal diameter D, or a set of 166 functional groups represented via their respective SMARTS strings [39] as with MACCS fingerprints [40]. Now let:

$$\begin{aligned} \varphi : \mathfrak {M} \rightarrow P(\mathfrak {J}) \end{aligned}$$

describe the substructure-enumeration algorithm of a structural fingerprint. Then $\varphi$ maps a compound $\mathcal {M} \in \mathfrak {M}$ to the subset of substructures in $\mathfrak {J}$ that are contained in $\mathcal {M}$. Via $\varphi$ one can transform each training compound $\mathcal {M}_i$ into a set-representation, $\mathcal {I}_i$, consisting of $k_i$ substructures:

$$\begin{aligned} \varphi (\mathcal {M}_i) = \{\mathcal {J}_{i,1},\ldots , \mathcal {J}_{i, k_i} \} \,{=}{:}\,\mathcal {I}_i \subseteq \mathfrak {J}. \end{aligned}$$

At this stage, one requires a substructure-pooling method:

$$\begin{aligned} \Psi : P(\mathfrak {J}) \rightarrow \mathbb {R}^L \end{aligned}$$

to transform each molecular set-representation $\mathcal {I}_i$ into a real-valued vector:

$$\begin{aligned} \Psi (\mathcal {I}_i) \,{=}{:}\,\mathcal {F}_i \in \mathbb {R}^L. \end{aligned}$$

The vector representations $\mathcal {F}_1,\ldots ,\mathcal {F}_n$ can then be fed into a standard machine learning model which can be trained to predict the labels specified by $f_{\text {labels}}$. Notice that the substructure-pooling operator $\Psi$ can be constructed by leveraging knowledge from $\mathfrak {T}$ and $f_{\text {labels}}$.

The problem of substructure pooling can easily be translated into a mathematical problem that resembles GNN pooling if one employs a substructure embedding of some dimension, w:

$$\begin{aligned} \gamma : \mathfrak {J} \rightarrow \mathbb {R}^{w}. \end{aligned}$$

Using $\gamma$, one can transform substructure sets into vector sets:

$$\begin{aligned} \mathfrak {J} \supseteq \{\mathcal {J}_{1},\ldots , \mathcal {J}_{k}\} \mapsto \{\gamma (\mathcal {J}_{1}),\ldots , \gamma (\mathcal {J}_{k})\} \subset \mathbb {R}^w. \end{aligned}$$

Many GNN-pooling methods, such as summation, averaging, max-pooling, or the differentiable operator proposed by Navarin et al. [23], correspond to simple (graph-topology-independent) set functions that map vector sets to single vectors. All such functions can be repurposed to vectorise sets of embedded substructures. For example, by composing the sum operator $\Sigma$ with $\gamma$ we immediately gain a substructure-pooling method whose output dimension L is equal to the embedding-dimension w:

$$\begin{aligned} & \Psi : P(\mathfrak {J}) \rightarrow \mathbb {R}^w, \\ & \Psi (\{\mathcal {J}_{1},\ldots , \mathcal {J}_{k}\}) = \sum _{i = 1}^k \gamma (\mathcal {J}_{i}). \end{aligned}$$

Perhaps the simplest possible substructure embedding is given via one-hot encoding.

Definition 2

(One-hot encoding) Let $u^w_{i} \in \mathbb {R}^w$ be the w-dimensional unit vector with 1 in its ith component and 0 everywhere else. Furthermore, let:

$$\begin{aligned} s: \{\mathcal {J}_{1},\ldots , \mathcal {J}_{w}\} \rightarrow \{1,\ldots ,w\} \end{aligned}$$

be a bijective function that imposes a strict total order on a set of w substructures $\{\mathcal {J}_{1},\ldots , \mathcal {J}_{w}\}$ by assigning a unique rank to each substructure. Then we define the one-hot encoding associated with s as:

$$\begin{aligned} \gamma _s: \{\mathcal {J}_{1},\ldots , \mathcal {J}_{w}\} \rightarrow \mathbb {R}^{w}, \quad \gamma _s(\mathcal {J}) = u^w_{s(\mathcal {J})}. \end{aligned}$$

If the embedding dimension w is equal to the total number of considered substructures $\vert \mathfrak {J} \vert = m$, it can be extremely large and additional dimensionality-reduction techniques might be required.

Finally, note that while in this work we focus on substructure pooling for binary structural fingerprints due to their simplicity and widespread use, it would be possible to further generalise Definition 1 to fingerprints with substructure counts by extending the domain of $\Psi$ to multisets (i.e., sets with counts) of substructures instead of classical sets.

Investigated substructure-pooling techniques for ECFPs

We now go on to use the theoretical framework introduced in the previous section to describe four distinct substructure-pooling techniques for ECFPs that we investigate in our study. From here on, let:

$$\begin{aligned} \mathfrak {J} = \{\mathcal {J}_{1},\ldots , \mathcal {J}_{m}\} \subseteq \{1,\ldots ,2^{32}\} \end{aligned}$$

be the set of hashed integer identifiers of all possible ECFP substructures based on a predefined list of atomic invariants, A, and a fixed maximal diameter, $D \in \{0,2,4,6,\ldots \}$. Further, let:

$$\begin{aligned} \mathfrak {T} = \{\mathcal {M}_1,\ldots , \mathcal {M}_n\} \supset \mathfrak {M} \end{aligned}$$

be a training set of n compounds (for example represented as SMILES strings) drawn from a larger space of chemical compounds, $\mathfrak {M}$; let:

$$\begin{aligned} f_{\text {labels}}: \mathfrak {T} \rightarrow \mathbb {R} \end{aligned}$$

be a function that assigns regression or classification labels to the training set; and let:

$$\begin{aligned} \varphi : \mathfrak {M} \rightarrow P(\mathfrak {J}), \quad \varphi (\mathcal {M}) = \{ \mathcal {J}_1,\ldots ,\mathcal {J}_k \} \subseteq \mathfrak {J}, \end{aligned}$$

be the ECFP-substructure enumeration algorithm that maps a compound $\mathcal {M}~\in ~\mathfrak {M}$ to the set of circular substructures $\{ \mathcal {J}_1,\ldots ,\mathcal {J}_k \}$ in $\mathfrak {J}$ that are chemically contained in $\mathcal {M}$. Moreover, for each substructure identifier $\mathcal {J} \in \mathfrak {J}$, we define its associated binary substructural feature in the training set via:

$$\begin{aligned} f_{\mathcal {J}}: \mathfrak {T} \rightarrow \{0,1\}, \quad f_{\mathcal {J}}(\mathcal {M}) = \left\{ \begin{array}{ll} 1 & \quad \mathcal {J} \in \varphi (\mathcal {M}), \\ 0 & \quad \text {else}, \\ \end{array} \right. \end{aligned}$$

and its support as the set of all training compounds that contain $\mathcal {J}$,

$$\begin{aligned} \text {supp}_{\mathfrak {T}}(\mathcal {J}) \,{:}{=}\,\{ \mathcal {M} \in \mathfrak {T} \ \vert \ f_{\mathcal {J}}(\mathcal {M}) = 1 \}. \end{aligned}$$

Finally, let the set:

$$\begin{aligned} \mathfrak {J}_{\mathfrak {T}} \,{:}{=}\,\bigcup _{\mathcal {M} \in \mathfrak {T}} \varphi (\mathcal {M}) \end{aligned}$$

with cardinality $\vert ~\mathfrak {J}_{\mathfrak {T}}~\vert ~\,{=}{:}\,~m_{\mathfrak {T}}$ represent all training substructures, i.e., all substructure identifiers in $\mathfrak {J}$ that form part of at least one of the n training compounds.

Hash-based folding

The default way for ECFP-based substructure pooling [1] is by making use of a common (arbitrary) hash function:

$$\begin{aligned} h: \{1,\ldots , 2^{32}\} \rightarrow \{1,\ldots , L\}. \end{aligned}$$

Formally, one can employ h to define a substructure-pooling method:

$$\begin{aligned} \Psi : P(\mathfrak {J}) \rightarrow \mathbb {R}^L \end{aligned}$$

by setting its ith output component to:

$$\begin{aligned} &\Psi (\{\mathcal {J}_1,\ldots ,\mathcal {J}_k\})_i \\&\quad =\left\{ \begin{array}{ll} 1 & \quad \exists \ \mathcal {J} \in \{\mathcal {J}_1,\ldots ,\mathcal {J}_k\}: h(\mathcal {J}) = i, \\ 0 & \quad \text {else}. \\ \end{array} \right. \end{aligned}$$

The composite map:

$$\begin{aligned} \Psi \circ \varphi : \mathfrak {M} \rightarrow \mathbb {R}^L \end{aligned}$$

transforms an input compound $\mathcal {M} \in \mathfrak {M}$ into an L-dimensional binary vector whose ith component is 1 if and only if (at least) one of the substructures in $\varphi (\mathcal {M}) = \{\mathcal {J}_1,\ldots ,\mathcal {J}_k\}$ gets hashed to the integer i. This substructure-pooling method is straightforward to describe and implement. However, hash collisions in $\{1,\ldots ,L\}$ necessarily start to degrade its interpretability and predictive performance if L becomes smaller than $m_{\mathfrak {T}}$. Note that $\Psi$ is independent of $\mathfrak {T}$ and $f_{\text {labels}}$.

Sort & Slice

Let:

$$\begin{aligned} c: \mathfrak {J}_{\mathfrak {T}} \rightarrow \{1,\ldots ,n\}, \quad c(\mathcal {J}) = \vert \text {supp}_{\mathfrak {T}}(\mathcal {J}) \vert , \end{aligned}$$

be a function that assigns to every training substructure $\mathcal {J} \in \mathfrak {J}_{\mathfrak {T}}$ the number of training compounds in which it is contained. Then we use c to define a strict total order $\prec$ on $\mathfrak {J}_{\mathfrak {T}}$. For $\mathcal {J}, \tilde{\mathcal {J}} \in \mathfrak {J}_{\mathfrak {T}}$, we write $\mathcal {J} \prec \tilde{\mathcal {J}}$ if:

$$\begin{aligned} c(\mathcal {J})< c(\tilde{\mathcal {J}}) \ \text {or} \ [ c(\mathcal {J}) = c(\tilde{\mathcal {J}}) \ \text {and} \ \mathcal {J} < \tilde{\mathcal {J}} ]. \end{aligned}$$

The order $\prec$ sorts substructures according to their frequencies in the training set, whereby ties are broken using the (arbitrary) order defined by the integer identifiers themselves. Let:

$$\begin{aligned} s: \mathfrak {J}_{\mathfrak {T}} \rightarrow \{1,\ldots ,m_{\mathfrak {T}}\} \end{aligned}$$

be a bijective sorting function that assigns the ranks determined by $\prec$, whereby rank 1 is assigned to the most frequent substructure. Now let:

$$\begin{aligned} \gamma _s: \mathfrak {J} \rightarrow \mathbb {R}^{m_{\mathfrak {T}}}, \quad \gamma _s(\mathcal {J}) = \left\{ \begin{array}{ll} u^{m_{\mathfrak {T}}}_{s(\mathcal {J})} & \quad \mathcal {J} \in \mathfrak {J}_{\mathfrak {T}}, \\ (0,\ldots ,0) & \quad \text {else}. \\ \end{array} \right. \end{aligned}$$

be an embedding that assigns a one-hot encoding sorted by s (see Definition 2) to training substructures and a vector entirely composed of zeros to substructures that do not appear in the training set. Based on $m_{\mathfrak {T}}$ and the desired fingerprint length L, we further define a slicing function:

$$\begin{aligned} &\eta _{m_{\mathfrak {T}},L}: \mathbb {R}^{m_{\mathfrak {T}}} \rightarrow \mathbb {R}^L, \\&\eta _{m_{\mathfrak {T}},L}(v_1,\ldots ,v_{m_{\mathfrak {T}}}) = \left\{ \begin{array}{ll} (v_1,\ldots ,v_L) & L \le m_{\mathfrak {T}}, \\ (v_1,\ldots ,v_{m_{\mathfrak {T}}},0,\ldots ,0) & L> m_{\mathfrak {T}}. \\ \end{array} \right. \end{aligned}$$

Then the Sort & Slice substructure-pooling operator:

$$\begin{aligned} \Psi : P(\mathfrak {J}) \rightarrow \mathbb {R}^L \end{aligned}$$

is given by:

$$\begin{aligned} \Psi (\{\mathcal {J}_1,\ldots ,\mathcal {J}_k\}) = \eta _{m_{\mathfrak {T}},L} \left( \sum _{i = 1}^k \gamma _s(\mathcal {J}_i) \right) . \end{aligned}$$

The summation:

$$\begin{aligned} \sum _{i = 1}^k \gamma _s(\mathcal {J}_i) \in \{0,1\}^{m_{\mathfrak {T}}} \end{aligned}$$

corresponds to a sorted binary vector, each of whose components indicates the existence of a particular training substructure within the input set. Substructures that occur in more training compounds appear earlier in the vector. In the usual case where $L \le m_{\mathfrak {T}}$, the function $\eta _{m_{\mathfrak {T}},L}$ trims the dimensionality of the vector to L by 'slicing away' the less frequent substructures. In the unusual case where L is specifically demanded to be larger than $m_{\mathfrak {T}}$, all training substructures are contained in the final representation and it is simply padded with additional zeros.

The Sort & Slice operator $\Psi$ is dependent on $\mathfrak {T}$ but not on $f_{\text {labels}}$. It can be interpreted as a simple unsupervised feature selection technique. In simplified terms, the composite map:

$$\begin{aligned} \Psi \circ \varphi : \mathfrak {M} \rightarrow \mathbb {R}^L \end{aligned}$$

outputs a binary L-dimensional fingerprint that represents the presence or absence of the L most frequent training substructures in the input compound. In particular, this fingerprint does not suffer from bit collisions; each vectorial component is assigned by a unique ECFP-substructure identifier. This clarity comes at the cost of losing information contained in the less frequent training substructures that are sliced away.

Note however that in real-world chemical data sets the vast majority of ECFP substructures in the training set occur in only a few compounds, and almost all substructures occur in fewer than half of all compounds. For instance, the well-known lipophilicity data set from MoleculeNet [41] encompasses $n = 4200$ compounds which together lead to the enumeration of $m_{\mathfrak {T}} = 16{,}903$ unique ECFP substructures with a maximal diameter of $D = 4$. Of these 16, 903 substructures, 52.4 % occur only in a single compound, 87.5 % occur in 10 or fewer compounds, 98.3 % occur in 100 or fewer compounds, and almost all (99.92 %) occur in fewer than half of all compounds. This frequency distribution is typical and highly stable across many chemical data sets. In a machine-learning context, slicing away infrequent training substructures should thus lead to very little information loss as it corresponds to the removal of almost-constant features.

More specifically, Sort & Slice has a natural interpretation in the context of information theory: the empirical information content (i.e., Shannon entropy [36]) of a binary feature column $(f_{\mathcal {J}}(\mathcal {M}_i))_{i=1}^n$ peaks if the substructure $\mathcal {J}$ is contained in exactly half of all training compounds. If no substructure occurs in more than half of all training compounds, which is almost perfectly fulfilled for common data sets, then sorting substructures according to their training-set frequencies becomes equivalent to sorting columns of the training feature matrix according to their empirical information content. Sort & Slice therefore automatically generates fingerprints that essentially only encompass the most informative training substructures from an entropic point of view, while being much simpler to describe, implement and interpret than an approach explicitly built on Shannon entropy.

While theoretically Sort & Slice might still lead to the inclusion of a tiny number of uninformative high-frequency training substructures, we decided against also slicing these away as this would have unnecessarily complicated our method for negligible expected performance gains. For example, out of the 16, 903 ECFP substructures with maximal diameter $D=4$ in the MoleculeNet lipophilicity data set [41], only 14 substructures occur in more than half of all compounds, and only 3 occur in 90 % or more.

Finally, note that we do not dare to claim that the principles behind Sort & Slice were necessarily first discovered and utilised exclusively by us; because of its natural simplicity, variations of Sort & Slice might have already been experimented with by other researchers in the past. Yet, to the best of our knowledge, this study so far represents the only thorough and systematic theoretical and computational investigation of Sort & Slice, with the only exception being the interesting work by MacDougall [38], whom we acknowledge and who appears to propose and investigate a technique almost identical to Sort & Slice. However, according to our understanding, this noteworthy alternative algorithm does not appear to include a mechanism to break ties between substructures that are contained in the same number of training compounds. In [38], infrequent substructures are thus sliced away in chunks, which limits control over the fingerprint dimension L.

A simple, self-contained and fast method for the creation of a calibrated featurisation function to transform RDKit mol objects into vectorial ECFPs via Sort & Slice using only NumPy [42] and RDKit [43] can be found in our public code repository.^{Footnote 2}

Filtering and mutual-information maximisation

Gütlein and Kramer [16] conducted one of the few other existing studies that systematically explores alternative substructure pooling strategies for ECFPs to circumvent the problem of bit collisions. They propose an advanced supervised substructure-selection scheme as a pooling method to construct what they refer to as filtered fingerprints. Their method was originally published in Java; we reimplemented a version of it in Python and included it in our study for comparison. In the first step of filtering, training substructures that occur in only a single training compound are removed. In the second step, a graph-theoretic technique is applied that reduces feature redundancy by removing all training substructures that contain smaller training substructures associated with the same set of training compounds. In the third and last step, the L remaining substructural training features that show the strongest statistical dependence on the training label, as quantified by a $\chi ^2$ test [44], are then selected for the final fingerprint.

In addition to filtering, we further implemented another supervised substructure selection scheme based on mutual-information maximisation (MIM) [36, 37]. In the first step of this alternative algorithm, feature redundancy is reduced by randomly choosing two training substructures that are contained in the same set of training compounds and then randomly removing one of the two substructures. This is repeated until no two training substructures match the same set of training compounds. After this, the L remaining substructural training features that exhibit the highest mutual information with the training label are chosen for the final fingerprint.

Note that since filtering as well as MIM represent supervised feature selection methods, they are dependent on both $\mathfrak {T}$ and $f_{\text {labels}}$. A detailed technical description of filtering and MIM can be found in Appendix A.

Categories of substructure-pooling techniques

As has been shown, substructure pooling as outlined in Definition 1 encompasses a large number of potential techniques. In this study, we investigate one data-set agnostic substructure-pooling procedure (classical hash-based folding) and three data-set dependent techniques (Sort & Slice, filtering, MIM). Our data-set dependent techniques are all based on either supervised or unsupervised feature selection. A graphical overview of the investigated methods can be found in Fig. 1.

Note that there exist many more possibilities for substructure pooling, including techniques that are neither based on hashing nor on substructure selection. However, to the best of our knowledge, such methods remain entirely unexplored. For instance, an interesting avenue for future research might be the investigation of differentiable substructure-pooling operators:

$$\begin{aligned} \Psi _{\theta }: P(\mathfrak {J}) \rightarrow \mathbb {R}^L \end{aligned}$$

of the form:

$$\begin{aligned} \Psi _{\theta }(\{\mathcal {J}_1,\ldots ,\mathcal {J}_k\}) = \Phi _{\theta }(\{\gamma (\mathcal {J}_{1}),\ldots , \gamma (\mathcal {J}_{k})\}) \end{aligned}$$

whereby:

$$\begin{aligned} \gamma : \mathfrak {J} \rightarrow \mathbb {R}^{w} \end{aligned}$$

could be a chemically meaningful substructure embedding and:

$$\begin{aligned} \Phi _{\theta }: \{ A \subset \mathbb {R}^w \ \vert \ A \text { is finite} \} \rightarrow \mathbb {R}^L \end{aligned}$$

could be a trainable deep learning architecture designed to operate on finite sets of real-valued vectors [45].

Experimental setting

All experiments were implemented in Python. We computationally evaluated the relative performance of (i) traditional hash-based folding, (ii) Sort & Slice, (iii) filtering and (iv) MIM using five practically relevant molecular property prediction data sets that were chosen to cover a diverse set of chemical regression and classification tasks: (1) the prediction of lipophilicity [41], (2) aqueous solubility [46], (3) Ames mutagenicity [47], and (4) binding affinity for SARS-CoV-2 main protease (experimentally measured via a fluorescence-based assay) [48]. We also included (5) a highly imbalanced LIT-PCBA virtual screening data set for estrogen receptor $\alpha$ antagonism [49], whereby we used the full raw data rather than the preprocessed unbiased splits provided in [49].

Note that the five selected data sets reflect a variety of important real-world applications. For instance, the included Ames mutagenicity data set by [47] serves as a benchmark for the computational prediction of the outcome of the Ames test, which is a common biological assay to examine the mutagenic effects of chemical compounds. Ames mutagenicity testing is widely used in pharmaceutical and toxicological research; machine learning approaches that can predict the outcome of the Ames test for novel compounds are thus of strong practical interest.

All five data sets consisted of SMILES strings which we algorithmically standardised and desalted using the ChEMBL structure pipeline [50]. This step also removed solvents and isotopic information. SMILES strings that generated error messages upon being turned into an RDKit mol object were deleted. Afterwards, if two standardised SMILES strings were found to be duplicates, one was deleted uniformly at random along with its training label. The only two data sets that contained noteworthy numbers of duplicates were the aqueous solubility and the LIT-PCBA data set; for these two cases, removing duplicates reduced the number of compounds from 9978 to 9821, and from 5045 to 3921 respectively. In the aqueous solubility data set, we further detected a few instances where SMILES strings appeared to encode several disconnected fragments; we also deleted such occurrences which reduced the number of compounds from 9821 to the final size of 9335. The five standardised and cleaned data sets are summarised in Table 1.

Table 1 Overview of the five molecular property prediction data sets used for our computational experiments

Full size table

As a data splitting strategy, we implemented two-fold cross validation repeated with three random seeds for all data sets. Thus, each model was separately trained and tested $2 * 3 = 6$ times. We ran two distinct versions of our cross validation scheme for all experiments, based on either scaffold [51] or random data splitting (stratified random splitting for the imbalanced LIT-PCBA data set). Scaffold splitting guarantees that the scaffold of each training-set compound is distinct from the scaffold of each test-set compound. This creates a distributional shift between training and test set that leads to a more challenging prediction task. Performance results were recorded as the mean and standard deviation over the 6 splits, using the mean absolute error (MAE) for the three regression data sets, the area under the receiver operating characteristic curve (AUROC) for the balanced Ames mutagenicity classification data set, and the area under the precision recall curve (AUPRC) for the imbalanced LIT-PCBA virtual screening data set.

As machine-learning models, we selected random forests (RFs) and multilayer perceptrons (MLPs). Note that including at least two distinct machine-learning predictors in our experiments is important to provide evidence that the advantage of Sort & Slice is not overly dependent on any specific machine learning regressor. The RFs were implemented in scikit-learn [52] and we chose the default hyperparameters for them with the exception of MaxFeatures for RF-regressors which was set to “Sqrt” instead of 1.0 to add randomness. The MLPs were implemented in PyTorch [53] and our MLP architecture consisted of five hidden layers, each with 512 neurons, an additive bias vector, ReLU activations, batch normalisation [54], and a dropout rate [55] of 0.25. For MLP training we employed an initial learning rate of $10^{-3}$, a learning rate decay factor of 0.98 per epoch until a learning rate of $10^{-5}$ was reached, 250 training epochs, a batch size of 64, and AdamW optimisation [56] with a weight decay factor of 0.1. For MLP-regressors we used identity output activations and a mean squared error loss; for MLP-classifiers we used sigmoidal output activations and a binary cross-entropy loss. All MLPs were trained on a single NVIDIA GeForce RTX 3060 GPU.

ECFPs as implemented in RDKit [43] by default use a list of six standard atomic invariants, A, including for instance the atomic number and whether the atom is part of a ring. Alternatively, RDKit also provides a list of six binary pharmacophoric invariants, $\tilde{A}$ which more explicitly reflect the function an atom might play in pharmacological chemistry. These features for example include whether the atom is a halogen and whether it is a hydrogen bond acceptor. When using pharmacophoric invariants $\tilde{A}$ instead of standard invariants A, one speaks of functional-connectivity fingerprints (FCFPs) instead of ECFPs. The selected diameter D is usually appended to the fingerprint-abbreviation: for example, ECFP4 indicates a diameter of $D = 4$. Both ECFPs and FCFPs optionally allow for the stereochemical distinction between atoms with respect to tetrahedral R/S chirality; we decided to include this additional invariant in our experiments, in part motivated by the fundamental effects chirality is known to have in certain pharmacological settings [57]; although it should be noted that chiral annotations may exhibit a certain level of noise in publicly curated data sets.

For each of 20 modelling scenarios determined by selecting one of the five data sets, a data splitting technique (random or scaffold), and a machine-learning model (RF or MLP), we conducted a thorough and extensive investigation of the ECFP-hyperparameter space. In each of the 20 cases, we used twofold cross validation repeated with 3 random seeds to evaluate 96 distinct binary vectorial fingerprints generated by exhaustively combining three maximal substructure diameters $D \in \{2, 4, 6\}$, two types of atomic invariants $A \in \{\text {standard ECFP, pharmacophoric FCFP}\}$, four fingerprint lengths $L \in \{512, 1024, 2048, 4096\}$, and four substructure-pooling methods (hash-based folding, Sort & Slice, filtering, or MIM). In total, the application of this robust combinatorial methodology resulted in the training of $20 * 96 * 2 * 3 = 11{,}520$ separate machine-learning models, half of which were deep-learning models.

Results and discussion

Detailed computational results for the lipophilicity prediction task are visualised in Fig. 2. A high-level summary of the results for all five data sets is depicted in Fig. 3. Further detailed results for the remaining four data sets can be found in Figs. 4, 5, 6 and 7 in Appendix B.

Overall, our numerical observations indicate the robust superiority of Sort & Slice over hash-based folding. For instance, the Sort & Slice version of the popular 1024-bit ECFP4 surpasses the predictive performance of the hashed version in all but a few cases. For example, Fig. 2 reveals that simply replacing hash-based folding with Sort & Slice when using a 1024-bit ECFP4 with an MLP for lipophilicity prediction with a random data split leads to a rather remarkable relative MAE-improvement of $11.37\%$. The results for Ames mutagenicity prediction in Fig. 6 and for the virtual screening task in Fig. 7 also suggest that the advantage of Sort & Slice over hash-based folding remains stable in both balanced and highly imbalanced classification scenarios.

Parts of the detailed results in Figs. 2 and 4–7 suggest that the improvements achieved via Sort & Slice over hash-based folding tend to become more pronounced as the fingerprint dimension L decreases, the maximal substructure diameter D increases, and when standard atomic invariants are used instead of pharmacophoric invariants. Hashed ECFPs tend to exhibit an increasing number of bit collisions for training compounds as L decreases relative to the number $m_{\mathfrak {T}}$ of substructures detected in the training set. The size of $m_{\mathfrak {T}}$ in turn tends to increase with D and when switching from abstract pharmacophoric to more distinctive standard atomic invariants. An increased number of bit collisions associated with certain areas of the ECFP-hyperparameter space for a given training data set might thus degrade the performance of hashed ECFPs relative to their Sort & Slice counterparts that by design are always free of colliding bits. In particular, the ratio $\frac{m_{\mathfrak {T}}}{L}$ may represent an interesting predictor for the advantage of Sort & Slice over hash-based folding. In the limiting case where this ratio tends to 0, hashed ECFPs and Sort & Slice ECFPs tend to become identical for training compounds (up to the arbitrary ordering of vectorial components induced by the hash function, which should be irrelevant for predictive performance). On the other hand, as this ratio grows, emerging bit collisions in the hashed ECFP might gradually start to negatively impact its predictive power while the Sort & Slice ECFP only retains the most frequent substructures without incurring bit collisions.

A natural question to ask in this context is whether the predictive advantage of Sort & Slice is merely a result of the general avoidance of bit collisions; or if, and to what extent, the specific unsupervised substructure-selection scheme underlying Sort & Slice, i.e., the natural removal of low-variance features via the exclusion of substructures with low training-set frequency, contributes to the performance gain. Surprisingly, the median performance metrics represented by the central horizontal lines in the boxplots in Fig. 3 indicate that Sort & Slice not only surpasses hash-based folding, but it also consistently outperforms filtering and MIM, two advanced supervised feature-selection schemes that just like Sort & Slice are entirely free of bit collisions. More specifically, comparing MIM with hash-based folding delivers mixed results, with neither method appearing to be clearly superior to the other; filtering appears to mostly outperform both MIM and hash-based folding, but is in turn outperformed by Sort & Slice. According to our observations, the median performance metrics in Fig. 3 suggest these findings frequently tend to show low variance as we increase the number of repetitions with distinct random seeds in our experiments.

These observations reveal two points: firstly, the performance gains provided by Sort & Slice over hash-based folding are not purely the result of avoiding bit collisions via substructure selection; but the specific substructure-selection strategy does indeed make an important difference for downstream predictive performance. Secondly, and perhaps remarkably, the extremely simple prevalence-based substructure-selection scheme implemented by Sort & Slice outperforms the more technically advanced selection methods underlying filtering and MIM. This is despite the fact that, unlike MIM and filtering, Sort & Slice is an unsupervised technique that does not utilise any information associated with the training label. It appears surprising that Sort & Slice would beat technically sophisticated supervised feature selection methods such as filtering or MIM that select substructures using task-specific information. While the reasons for this are not obvious, it is conceivable that exploiting the training label when selecting substructures could potentially harm the generalisation abilities of a machine-learning system by contributing to its risk of overfitting to the training data (just like any other aspect of supervised model training). Another reason may be related to the natural frequency distribution of circular substructures in chemical data sets; the tendency for almost all ECFP substructures to appear in less than half of all training compounds causes the simple Sort & Slice algorithm to essentially coincide with a more complex and potentially more powerful feature-selection strategy based on the maximisation of information entropy. Unlike hash-based folding which is data-set agnostic, Sort & Slice exploits structural information from the training compounds and could thus be interpreted as a simple and effective way to calibrate ECFPs to a specific region of chemical space.

Conclusions

We have introduced a general mathematical framework for the vectorisation of structural fingerprints via a formal operation referred to as substructure pooling. For ECFPs, substructure pooling is a natural analogue to node-feature-vector pooling in message-passing GNN architectures. Unlike GNN pooling, ECFP-based substructure pooling is almost always performed using one particular method (hash-based folding); other substructure-pooling strategies for ECFPs remain largely unexplored. Our proposed substructure pooling framework encompasses hash-based folding, but also a broad spectrum of alternative methods that could be explored, including supervised and unsupervised substructure selection strategies. Trainable deep learning architectures operating on sets of chemically meaningful substructure embeddings also fit into the presented methodology; the investigation of such neural techniques, while out of the scope of this study, might form an interesting avenue for future research. Another valuable future research project could be to systematically compare the performance of Sort & Slice ECFPs with other commonly used molecular featurisation methods such as GNNs [8,9,10,11,12,13,14] and molecular descriptors [58,59,60].

As part of our work, we have mathematically described and computationally evaluated Sort & Slice, a straightforward alternative to hash-based folding for the pooling of ECFP substructures that is very easy to implement and interpret. Sort & Slice can be seen as a simple unsupervised feature-selection scheme that generates a binary vectorial fingerprint that is free of bit collisions and based only on the circular substructures that occur most frequently in the training compounds. Under realistic theoretical assumptions that are closely approximated by common chemical data sets, Sort & Slice can be shown to select automatically only the most informative ECFP substructures from an information-theoretic perspective.

An extensive series of strictly conducted computational experiments indicates that Sort & Slice tends to produce better (and sometimes substantially better) downstream performance than hash-based folding at ECFP-based molecular property prediction. Our numerical results suggest that the predictive advantage of Sort & Slice is highly robust and exists across a diverse range of molecular regression and classification tasks, balanced and imbalanced classification settings, distinct data splitting techniques, machine-learning models and ECFP hyperparameters. Perhaps surprisingly, Sort & Slice not only seems to outperform hash-based folding but also two technically sophisticated supervised substructure selection schemes [16, 37]. This indicates that, in spite of its extreme simplicity, sorting ECFP substructures according to their relative prevalence in a given region of chemical space of interest and then discarding infrequent substructures is a (maybe unexpectedly) strong feature selection strategy. Based on the robust predictive advantage of Sort & Slice, its technical simplicity, its ease of implementation, and its ability to improve fingerprint interpretability by avoiding bit collisions, we recommend that it should canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Appendix A: Technical descriptions of filtering and mutual-information maximisation

Filtering

We assume that we are given a classification task with binary labels:

$$\begin{aligned} f_{\text {labels}}: \mathfrak {T} \rightarrow \{0,1\}. \end{aligned}$$

If instead we are given a regression task with continuous labels, we binarise them by setting all labels below or above the label median to 0 or 1 respectively. Now let:

$$\begin{aligned} p_{\chi ^2}: \mathfrak {J}_{\mathfrak {T}} \rightarrow [0,1], \end{aligned}$$

be a function that assigns to each training substructure $\mathcal {J} \in \mathfrak {J}_{\mathfrak {T}}$ its p-value $p_{\chi ^2}(\mathcal {J}) \in [0,1]$ in a statistical $\chi ^2$ independence test [44] based on the statistical sample $(f_{\text {labels}}(\mathcal {M}_i), f_{\mathcal {J}}(\mathcal {M}_i))_{i = 1}^n$ derived from the training compounds. The function $p_{\chi ^2}$ defines a strict total order $\prec$ on $\mathfrak {J}_{\mathfrak {T}}$. For $\mathcal {J}, \tilde{\mathcal {J}} \in \mathfrak {J}_{\mathfrak {T}}$, we say $\mathcal {J} \prec \tilde{\mathcal {J}}$ if:

$$\begin{aligned} p_{\chi ^2}(\mathcal {J})> p_{\chi ^2}(\tilde{\mathcal {J}}) \ \text {or} \ \left[ p_{\chi ^2}(\mathcal {J}) = p_{\chi ^2}(\tilde{\mathcal {J}}) \ \text {and} \ \mathcal {J} < \tilde{\mathcal {J}} \right] . \end{aligned}$$

The smaller the p-value, the larger the substructure identifier according to $\prec$, whereby ties are broken using the (arbitrary) order defined by the integer identifiers themselves.

Based on Gütlein and Kramers method, we go on to select a set of L substructures $\mathfrak {J}_L$ from $\mathfrak {J}_{\mathfrak {T}}$ using the following strategy:

Initialisation: $\mathfrak {J}_L \,{:}{=}\,\mathfrak {J}_{\mathfrak {T}}.$
Step 1:. A substructure $\mathcal {J} \in \mathfrak {J}_L$ that fulfills $\vert \text {supp}_{\mathfrak {T}}(\mathcal {J}) \vert = 1$ is randomly chosen and removed. This is repeated until all substructures in $\mathfrak {J}_L$ appear in at least two training compounds or until $\vert \mathfrak {J}_L \vert = L$.
Step 2: A substructure $\mathcal {J} \in \mathfrak {J}_L$ that is non-closed is randomly chosen and removed. This is repeated until all remaining substructures in $\mathfrak {J}_L$ are closed or until $\vert \mathfrak {J}_L \vert = L$. A substructure $\mathcal {J} \in \mathfrak {J}_L$ is called non-closed if there exists another substructure $\tilde{\mathcal {J}} \in \mathfrak {J}_L$ such that $\text {supp}_{\mathfrak {T}}(\mathcal {J}) = \text {supp}_{\mathfrak {T}}(\tilde{\mathcal {J}})$ and $\mathcal {J}$ contains a proper subgraph that is isomorphic to $\tilde{\mathcal {J}}$.
Step 3: The lowest-ranking element in $\mathfrak {J}_L$ with respect to the order $\prec$ is chosen and removed. This is repeated until $\vert \mathfrak {J}_L \vert = L$.

Step 1 removes uninformative substructures that occur in only a single compound. Step 2 represents a graph-theoretic attempt to reduce feature redundancy via the removal of substructures that contain smaller substructures that match the same set of training compounds. Finally, Step 3 selects the L remaining substructural features that show the strongest statistical dependence on the training label as quantified by a $\chi ^2$ test.

Using $\mathfrak {J}_L$ and an arbitrary bijective sorting function:

$$\begin{aligned} s: \mathfrak {J}_L \rightarrow \{1,\ldots ,L\} \end{aligned}$$

one can construct an embedding that generates an L-dimensional one-hot encoding for the selected substructures:

$$\begin{aligned} \gamma _s: \mathfrak {J} \rightarrow \mathbb {R}^{L}, \quad \gamma _s(\mathcal {J}) = \left\{ \begin{array}{ll} u^L_{s(\mathcal {J})} & \quad \mathcal {J} \in \mathfrak {J}_L, \\ (0,\ldots ,0) & \quad \text {else}. \\ \end{array} \right. \end{aligned}$$

Substructure pooling via filtering is then given by:

$$\begin{aligned} \Psi : P(\mathfrak {J}) \rightarrow \mathbb {R}^L, \quad \Psi (\{\mathcal {J}_1,\ldots ,\mathcal {J}_k\}) = \sum _{i = 1}^k \gamma _s(\mathcal {J}_i). \end{aligned}$$

The composite map:

$$\begin{aligned} \Psi \circ \varphi : \mathfrak {M} \rightarrow \mathbb {R}^L \end{aligned}$$

transforms input compounds into binary fingerprints free of bit collisions that only encompass the selected substructures in $\mathfrak {J}_L$. $\Psi$ depends on both $\mathfrak {T}$ and $f_{\text {labels}}$.

Note that the above algorithm for the construction of $\mathfrak {J}_L$ implicitly assumes the usual case where $L \le m_{\mathfrak {T}}$. If for some reason L is demanded to be larger than $m_{\mathfrak {T}}$, then we simply set $\mathfrak {J}_L \,{:}{=}\,\mathfrak {J}_{\mathfrak {T}}$ and pad the fingerprints generated by $\Psi$ with additional zeros up to dimension L.

Mutual-information maximisation

Substructure pooling via mutual-information maximisation (MIM) [36, 37] is performed analogously to filtering, the only difference lying in the set of selected substructures $\mathfrak {J}_L$. We again assume that we are either given a binary classification task or a regression task that has been binarised around its median:

$$\begin{aligned} f_{\text {labels}}: \mathfrak {T} \rightarrow \{0,1\}. \end{aligned}$$

Using the statistical sample $(f_{\text {labels}}(\mathcal {M}_i), f_{\mathcal {J}}(\mathcal {M}_i))_{i = 1}^n$ of binary training variables, one can compute the empirical mutual information:

$$\begin{aligned} \hat{I}: \mathfrak {J}_{\mathfrak {T}} \rightarrow [0, \infty ) \end{aligned}$$

between the training label and the feature associated with a training substructure $\mathcal {J} \in \mathfrak {J}_{\mathfrak {T}}$ using simple plug-in entropy estimators [61]. $\hat{I}(\mathcal {J})$ is a nonnegative, symmetric and nonlinear measure of statistical dependence between $(f_{\text {labels}}(\mathcal {M}_i))_{i = 1}^n$ and $(f_{\mathcal {J}}(\mathcal {M}_i))_{i = 1}^n$. The larger $\hat{I}(\mathcal {J})$, the more information the presence of substructure $\mathcal {J}$ in a training compound conveys about its label and vice versa. The function $\hat{I}$ defines a strict total order $\prec$ on $\mathfrak {J}_{\mathfrak {T}}$. For $\mathcal {J}, \tilde{\mathcal {J}} \in \mathfrak {J}_{\mathfrak {T}}$, we write $\mathcal {J} \prec \tilde{\mathcal {J}}$ if:

$$\begin{aligned} \hat{I}(\mathcal {J})< \hat{I}(\tilde{\mathcal {J}}) \ \text {or} \ [\hat{I}(\mathcal {J}) = \hat{I}(\tilde{\mathcal {J}}) \ \text {and} \ \mathcal {J} < \tilde{\mathcal {J}} ]. \end{aligned}$$

We use the order $\prec$ to select L substructures $\mathfrak {J}_L$ from $\mathfrak {J}_{\mathfrak {T}}$ via the following strategy (assuming $L \le m_{\mathfrak {T}}$):

Initialisation: $\mathfrak {J}_L \,{:}{=}\,\mathfrak {J}_{\mathfrak {T}}.$
Step 1: If for two substructures $\mathcal {J}, \tilde{\mathcal {J}} \in \mathfrak {J}_L$ it holds that $\text {supp}_{\mathfrak {T}}(\mathcal {J}) = \text {supp}_{\mathfrak {T}}(\tilde{\mathcal {J}})$, then either $\mathcal {J}$ or $\mathcal {\tilde{J}}$ is chosen uniformly at random and removed from $\mathfrak {J}_L$. This is repeated until no two substructures have the same support or until $\vert \mathfrak {J}_L \vert = L$.
Step 2: The smallest element of $\mathfrak {J}_L$ with respect to the order $\prec$ is chosen and removed. This is repeated until $\vert \mathfrak {J}_L \vert = L$.

Step 1 aims to reduce feature redundancy via the elimination of pairwise identical substructural features from the training set. Step 2 subsequently selects the L training substructures that exhibit the highest mutual information with the training label. MIM depends on both $\mathfrak {T}$ and $f_{\text {labels}}$ and can thus be categorised as a supervised feature selection scheme. As in the case of filtering, in the unusual case where L is demanded to be larger than $m_{\mathfrak {T}}$, we simply set $\mathfrak {J}_L \,{:}{=}\,\mathfrak {J}_{\mathfrak {T}}$ and pad the resulting vectorial fingerprint with additional zeros up to dimension L.

Instead of MIM, we initially attempted to use conditional MIM, a more advanced information-theoretic technique that iteratively selects features that maximise the mutual information with the training label conditional on the information contained in any feature already picked. However, we observed that even the fast implementation of conditional MIM described by Fleuret [62] was computationally slow to select thousands of substructures out of an even larger substructure pool. This made the generation of vectorial ECFPs with usual lengths such as $L = 1024$ or $L = 2048$ bits impractical. We thus decided to instead investigate classical MIM, which represents a natural simplification of conditional MIM that remains computationally feasible even in very high feature dimensions.

Appendix B: Further computational results

In Figs. 4–7, we present detailed computational results for the four remaining data sets specified in Table 1.

Availability of data and materials

All used data sets and the Python code for our computational experiments are available in our public code repository https://github.com/oxpig/ECFP-Sort-and-Slice. This code repository also contains a simple, self-contained and fast method for the creation of a calibrated featurisation function to transform RDKit mol objects into vectorial ECFPs via Sort & Slice using only NumPy [42] and RDKit [43].

Notes

To be precise, one uses multisets (i.e., sets with counts) instead of classical sets for GNN architectures to be able to distinguish identical atom feature vectors belonging to distinct atoms. Similarly, one would use multisets instead of sets when dealing with ECFPs with counts instead of binary ECFPs. However, in this work we focus on binary ECFPs owing to their more widespread use and conceptual simplicity.
https://github.com/oxpig/ECFP-Sort-and-Slice.

Abbreviations

AUPRC:: Area under the precision recall curve
AUROC:: Area under the receiver operating characteristic curve
ECFP:: Extended-connectivity fingerprint
FCFP:: Functional-connectivity fingerprint
GNN:: Graph neural network
MAE:: Mean absolute error
MIM:: Mutual-information maximisation
MLP:: Multilayer perceptron
RF:: Random forest
SMILES:: Simplified molecular-input line-entry system

References

Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754
Article CAS PubMed Google Scholar
Morgan HL (1965) The generation of a unique machine description for chemical structures—a technique developed at chemical abstracts service. J Chem Doc 5(2):107–113
Article CAS Google Scholar
Riniker S, Landrum G (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminf 5(1):26
Article CAS Google Scholar
Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. Adv Neural Inf Process Syst 28:2224–2232
Google Scholar
Webel HE, Kimber TB, Radetzki S, Neuenschwander M, Nazaré M, Volkamer A (2020) Revealing cytotoxic substructures in molecules using deep learning. J Comput Aided Mol Des 34(7):731–746
Article CAS PubMed PubMed Central Google Scholar
Alvarsson J, Eklund M, Engkvist O, Spjuth O, Carlsson L, Wikberg JES, Noeske T (2014) Ligand-based target prediction with signature fingerprints. J Chem Inf Model 54(10):2647–2653
Article CAS PubMed Google Scholar
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: International conference on machine learning, PMLR, pp 1263–1272
Stepišnik T, Škrlj B, Wicker J, Kocev D (2021) A comprehensive comparison of molecular feature representations for use in predictive modeling. Comput Biol Med 130(104):197
Google Scholar
Mayr A, Klambauer G, Unterthiner T, Steijaert M, Wegner JK, Ceulemans H, Clevert DA, Hochreiter S (2018) Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem Sci 9(24):5441–5451
Article CAS PubMed PubMed Central Google Scholar
Menke J, Koch O (2021) Using domain-specific fingerprints generated through neural networks to enhance ligand-based virtual screening. J Chem Inf Model 61(2):664–675
Article CAS PubMed Google Scholar
Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885
Winter R, Montanari F, Noé F, Clevert DA (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10(6):1692–1701
Article CAS PubMed Google Scholar
Dablander M, Hanser T, Lambiotte R, Morris GM (2023) Exploring QSAR models for activity-cliff prediction. J Cheminf 15(1):47
Article Google Scholar
Dablander M, Hanser T, Lambiotte R, Morris GM (2021) Siamese neural networks work for activity cliff prediction [Poster presentation]. In: 4th RSC-BMCS/RSC-CICAG artificial intelligence in chemistry symposium, virtual. https://doi.org/10.13140/RG.2.2.18137.60000. Accessed 28 Jan 2024
Weininger D (1988) SMILES, a chemical language and information system. J Chem Inf Comput Sci 28(1):31–36
Article CAS Google Scholar
Gütlein M, Kramer S (2016) Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability. J Cheminf 8(1):1–16
Article Google Scholar
Xu K, Hu W, Leskovec J, Jegelka S (2018) How powerful are graph neural networks? arXiv preprint arXiv:1810.00826
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388
Article CAS PubMed PubMed Central Google Scholar
Wu Z, Pan S, Chen F, Long G, Zhang C, Philip SY (2020) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1):4–24
Article Google Scholar
Wieder O, Kohlbacher S, Kuenemann M, Garon A, Ducrot P, Seidel T, Langer T (2020) A compact review of molecular property prediction with graph neural networks. Drug Discov Today Technol. https://doi.org/10.1016/j.ddtec.2020.11.009
Article PubMed Google Scholar
Liu K, Sun X, Jia L, Ma J, Xing H, Wu J, Gao H, Sun Y, Boulnois F, Fan J (2019) Chemi-Net: a molecular graph convolutional network for accurate drug property prediction. Int J Mol Sci 20(14):3389
Article CAS PubMed PubMed Central Google Scholar
Navarin N, Van Tran D, Sperduti A (2019) Universal readout for graph convolutional neural networks. In: Proceedings of international joint conference on neural networks (IJCNN), pp 1–7
Cangea C, Veličković P, Jovanović N, Kipf T, Liò P (2018) Towards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287
Lee J, Lee I, Kang J (2019) Self-attention graph pooling. In: International conference on machine learning, PMLR, pp 3734–3743
Ranjan E, Sanyal S, Talukdar P (2020) Asap: Adaptive structure aware pooling for learning hierarchical graph representations. In: Proceedings of the AAAI conference on artificial intelligence, vol 34. pp 5470–5477
Ma Z, Xuan J, Wang YG, Li M, Liò P (2020) Path integral based convolution and pooling for graph neural networks. Adv Neural Inf Process Syst 33:16,421-16,433
Google Scholar
Zhong S, Guan X (2023) Count-based Morgan fingerprint: A more efficient and interpretable molecular representation in developing machine learning-based predictive regression models for water contaminants’ activities and properties. Environ Sci Technol 57(46):18,193-18,202
Article CAS Google Scholar
Harada S, Akita H, Tsubaki M, Baba Y, Takigawa I, Yamanishi Y, Kashima H (2020) Dual graph convolutional neural network for predicting chemical networks. BMC Bioinf 21:1–13
Article Google Scholar
Ucak UV, Ashyrmamatov I, Ko J, Lee J (2022) Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments. Nat Commun 13(1):1186
Article CAS PubMed PubMed Central Google Scholar
Capecchi A, Probst D, Reymond JL (2020) One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminf 12(1):1–15
Article Google Scholar
Le T, Winter R, Noé F, Clevert DA (2020) Neuraldecipher–reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures. Chem Sci 11(38):10,378-10,389
Article CAS Google Scholar
Shen J, Nicolaou CA (2019) Molecular property prediction: recent trends in the era of artificial intelligence. Drug Discov Today Technol 32:29–36
Article PubMed Google Scholar
Tripp A, Bacallado S, Singh S, Hernández-Lobato JM (2024) Tanimoto random features for scalable molecular machine learning. Adv Neural Inf Process Syst (NeurIPS 2023) 37:33656–33686. https://doi.org/10.48550/arXiv.2306.14809
Probst D, Reymond JL (2018) A probabilistic molecular fingerprint for big data settings. J Cheminf 10:1–12
Article Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Article Google Scholar
Cover TM, Thomas JA et al (1991) Entropy, relative entropy and mutual information. Elem Inf Theory 2(1):12–13
Google Scholar
MacDougall T (2022) Reduced collision fingerprints and pairwise molecular comparisons for explainable property prediction using deep learning. M.Sc. thesis, Université de Montréal, https://hdl.handle.net/1866/26533, accessed on 05.10.2023
Sayle R (1997) 1st-class SMARTS patterns. In: EuroMUG 97, https://www.daylight.com/meetings/emug97/Sayle, accessed on 28.01.2024
Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42(6):1273–1280
Article CAS PubMed Google Scholar
Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530
Article CAS PubMed Google Scholar
Harris CR, Millman KJ, Van Der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ et al (2020) Array programming with NumPy. Nature 585(7825):357–362
Article CAS PubMed PubMed Central Google Scholar
Landrum G (2006) RDKit: open-source cheminformatics. http://www.rdkit.org. Accessed on 05 October 2023
Pearson K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinburgh Dublin Philos Mag J Sci 50(302):157–175
Article Google Scholar
Zaheer M, Kottur S, Ravanbakhsh S, Poczos B, Salakhutdinov RR, Smola AJ (2017) Deep sets. In: Advances in neural information processing systems, vol 30
Sorkun MC, Khetan A, Er S (2019) AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data 6(1):143
Article PubMed PubMed Central Google Scholar
Hansen K, Mika S, Schroeter T, Sutter A, Ter Laak A, Steger-Hartmann T, Heinrich N, Muller KR (2009) Benchmark data set for in silico prediction of Ames mutagenicity. J Chem Inf Model 49(9):2077–2081
Article CAS PubMed Google Scholar
COVID Moonshot Consortium, Achdout H, Aimon A, Alonzi DS, Arbon R, Bar-David E, Barr H, Ben-Shmuel A, Bennett J, Bilenko VA et al (2020) Open science discovery of potent non-covalent SARS-CoV-2 main protease inhibitors. BioRxiv pp 2020–2010
Tran-Nguyen VK, Jacquemard C, Rognan D (2020) LIT-PCBA: an unbiased data set for machine learning and virtual screening. J Chem Inf Model 60(9):4263–4273
Article CAS PubMed Google Scholar
Bento AP, Hersey A, Félix E, Landrum G, Gaulton A, Atkinson F, Bellis LJ, de Veij M, Leach AR (2020) An open source chemical structure curation pipeline using RDKit. J Cheminf 12(1):1–16
Article Google Scholar
Bemis GW, Murcko MA (1996) The properties of known drugs: molecular frameworks. J Med Chem 39(15):2887–2893
Article CAS PubMed Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library”, Adv Neural Inf Process Syst (NeurIPS 2019), 33:8026 – 8037. https://doi.org/10.48550/arXiv.1912.01703
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of machine learning research, pp 448–456
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Google Scholar
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Tokunaga E, Yamamoto T, Ito E, Shibata N (2018) Understanding the thalidomide chirality in biological processes by the self-disproportionation of enantiomers. Sci Rep 8(1):17–131
Article Google Scholar
Todeschini R, Consonni V (2009) Molecular descriptors for chemoinformatics, vol 41. Wiley, New York
Book Google Scholar
Xue L, Bajorath J (2000) Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. Comb Chem High Throughput Screen 3(5):363–372
Article CAS PubMed Google Scholar
Hong H, Xie Q, Ge W, Qian F, Fang H, Shi L, Su Z, Perkins R, Tong W (2008) $\text{ Mold}^2$, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Model 48(7):1337–1344
Article CAS PubMed Google Scholar
Zhang Z, Zhang X (2011) A normal law for the plug-in estimator of entropy. IEEE Trans Inf Theory 58(5):2745–2747
Article Google Scholar
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Research 5(9):1531–1555
Google Scholar

Download references

Acknowledgements

We would like to thank Greg Landrum and Richard Cooper for their useful feedback.

Funding

This research was supported by the University of Oxford’s UK EPSRC Centre for Doctoral Training in Industrially Focused Mathematical Modelling (EP/L015803/1) and by the not-for-profit organisation and educational charity Lhasa Limited (https://www.lhasalimited.org).

Author information

Authors and Affiliations

Mathematical Institute, University of Oxford, Andrew Wiles Building, Radcliffe Observatory Quarter (550), Woodstock Road, Oxford, OX2 6GG, UK
Markus Dablander & Renaud Lambiotte
Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds, LS11 5PS, UK
Thierry Hanser
Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford, OX1 3LB, UK
Garrett M. Morris

Authors

Markus Dablander
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Hanser
View author publications
You can also search for this author in PubMed Google Scholar
Renaud Lambiotte
View author publications
You can also search for this author in PubMed Google Scholar
Garrett M. Morris
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The computational study was designed, implemented, conducted and interpreted by the first author, MD, who also proposed the specific version of Sort & Slice investigated in this work. To the best of our knowledge, the only other study that explores a technique almost identical to Sort & Slice is the noteworthy work of MacDougall [38], which we acknowledge. The general mathematical framework and the mathematical definition of substructure pooling introduced in this work were developed by MD. The research was supervised by GMM, TH and RL. The computer code was written by MD. The paper manuscript was written by MD. Feedback was provided by GMM, TH and RL during the writing process. All scientific figures were designed by MD, with input from GMM, TH and RL. When discussing how to name the main investigated substructure-pooling technique, the name “Sort & Slice” was suggested by GMM. All chemical data sets were gathered and cleaned by MD. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Garrett M. Morris.

Ethics declarations

Competing interests

The authors declare that they have no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dablander, M., Hanser, T., Lambiotte, R. et al. Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints. J Cheminform 16, 135 (2024). https://doi.org/10.1186/s13321-024-00932-y

Download citation

Received: 04 March 2024
Accepted: 12 November 2024
Published: 03 December 2024
DOI: https://doi.org/10.1186/s13321-024-00932-y

Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints

Abstract

Introduction

Extended-connectivity fingerprints and hash-based folding

Research aims and contributions

Methodology

Substructure pooling: definition and setting

Definition 1

Definition 2

Investigated substructure-pooling techniques for ECFPs

Hash-based folding

Sort & Slice

Filtering and mutual-information maximisation

Categories of substructure-pooling techniques

Experimental setting

Results and discussion

Conclusions

Appendix A: Technical descriptions of filtering and mutual-information maximisation

Filtering

Mutual-information maximisation

Appendix B: Further computational results

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us