Introduction

The application of computational methods has become a widely employed strategy in the development of new materials and drugs. A crucial aspect of this approach involves the calculation of quantum chemical (QC) properties of molecular structures1. These quantitative properties are highly dependent on the refined equilibrium conformations of molecules.

In the field of materials and drug design, researchers primarily focus on the quantitative properties of equilibrium conformations. The process to achieve this generally involves two key steps, both of which depend on electronic structure methods such as density functional theory (DFT)2. The initial step entails performing conformation optimization, also known as energy minimization, on the molecular structure to determine the equilibrium conformation. Subsequently, the quantum chemical (QC) properties of this equilibrium conformation are computed. However, the combined process of conformation optimization and property calculation using DFT can be extremely time-consuming and computationally expensive, potentially requiring several hours to evaluate the properties of just a single molecule. This constraint hinders the applicability of DFT in large-scale data screening endeavors. Consequently, it is of paramount importance to develop alternative methods that maintain the requisite accuracy while reducing computational costs.

Recent studies have demonstrated the potential of using deep learning to accelerate QC property calculations3,4,5. This approach involves training a deep neural network model to predict the property using molecular inputs, thereby circumventing the need for computationally-intensive DFT calculations. Prior research has mainly utilized 1D SMILES6,7,8 sequences or 2D molecular graphs4,9,10,11,12,13 as molecular inputs due to their easy obtainability. However, predicting QC properties from 1D SMILES and 2D molecular graphs can be ineffective since most QC properties are highly related to the refined 3D equilibrium conformations.

To address this challenge, we propose a method called Uni-Mol+ in this paper, illustrated in Fig. 1a. In contrast to previous approaches that directly predict QC properties from 1D/2D data, Uni-Mol+ takes advantage of the 3D conformation of the molecule as input, in accordance with physical principles. Uni-Mol+ first generates a raw 3D conformation from 1D/2D data using cheap methods, such as RDKit14. As the raw conformation is inaccurate, Uni-Mol+ then iteratively updates it towards the DFT equilibrium conformation using neural networks and predicts QC properties from the learned conformation. To obtain accurate equilibrium conformation predictions, we use large-scale datasets (e.g., PCQM4MV2 benchmark) to build up millions of pairs of RDKit-generated raw conformation and high-quality DFT equilibrium conformation and learn the update process from this supervised information. With a carefully designed model backbone and training strategy, Uni-Mol+ shows superior performance in various benchmarks.

Fig. 1: Overall architecture of Uni-Mol+.
figure 1

a In contrast to prior methods that directly predict QC properties from 1D/2D data, Uni-Mol+ uses a different approach. It first generates raw 3D conformation from 1D/2D data using cheap tools like RDKit, and then iteratively updates it towards the DFT equilibrium conformation. Finally, it predicts QC properties using the learned conformation. The abbreviation HOMO-LUMO gap represents the Highest Occupied Molecular Orbital— Lowest Unoccupied Molecular Orbital gap. b The Uni-Mol+ backbone consists of L blocks, each of which maintains two tracks of representations—atom and pair, initialized by atom features and 2D graph/3D conformation, respectively. These representations communicate with each other at every block. Based on this backbone model, Uni-Mol+ iteratively updates the raw conformation (i.e., 3D coordinates of atoms) towards the DFT equilibrium conformation for R iterations. The abbreviation FFN represents the Feed-Forward Neural network and QC property represents Quantum Chemical property. c A linear noisy interpolation between raw conformation and DFT conformation is used to generate a pseudo trajectory, effectively augmenting the input conformations. Uni-Mol+ uses a mixture of Bernoulli distribution and Uniform distribution to sample the noise interpolation weight q during training. The symbol q represents the interpolation weight between raw conformation and DFT conformation.

Our main contributions can be summarized as follows:

  • We develop a novel paradigm for QC property prediction by leveraging the conformation optimization from RDKit-generated conformation to DFT equilibrium conformation.

  • We create a new training strategy for 3D conformation optimization by generating a pseudo trajectory and a sampling strategy from it, based on a mixture of Bernoulli distribution and Uniform distribution.

  • The entire framework of Uni-Mol+ holds significant empirical value, as it achieves markedly better performance than all previous works on two widely recognized benchmarks, PCQM4MV215 and Open Catalyst 2020 (OC20)16.

Results

In this section, we initially present a concise overview of the Uni-Mol+ framework, followed by comprehensive benchmarking using two well-recognized public datasets: PCQM4MV215 and OC2016. These datasets enable the assessment of Uni-Mol+ ’s performance in small organic molecules and catalyst systems. Following this, we perform an ablation study to investigate the impact of various model components and training strategies on the overall performance. Lastly, we present a visual analysis to effectively demonstrate the conformation update process within Uni-Mol+. The complete model configuration can be found in the Supplementary Section 2.

Uni-Mol+ overview

As illustrated in Fig. 1a, for any molecule, Uni-Mol+ first obtains a raw 3D conformation generated by cheap methods, such as template-based methods from RDKit and OpenBabel. It then learns the target conformation, i.e., the equilibrium conformation optimized by DFT, by an iterative update process from the raw conformation. In the final step, the QC properties are predicted based on the learned conformation. To achieve this goal, we introduce a new model backbone and a novel training strategy for updating conformation and predicting QC properties.

The Uni-Mol+ ’s model backbone is a two-track transformer, consisting of an atom representation track and a pair representation track, as shown in Fig. 1b. In comparison to the transformer backbone used in the prior study Uni-Mol17, two significant updates have been implemented. i) The pair representation is enhanced by an outer product of the atom representation (referred to OuterProduct) for atom-to-pair communication, and a triangular operator (referred to TriangularUpdate) to bolster the 3D geometric information. These two operators are proven effective in AlphaFold218. ii) An iterative process is employed to continuously update the 3D coordinates towards the equilibrium conformation. We use R to denote the number of rounds for conformation optimization.

For the learning of the conformation update process, we introduce a novel training strategy as shown in Fig. 1c. We sample conformations from the trajectory between the RDKit-generated raw conformation and the DFT equilibrium conformation, and use the sampled conformation as input to predict the equilibrium conformation. It is crucial to note that the actual trajectory is often unknown in many datasets; therefore, we utilize a pseudo trajectory that presumes a linear process between two conformations. Furthermore, we devise a sampling strategy for obtaining conformations from the pseudo trajectory to serve as the model’s input during training. This strategy uses a mixture of Bernoulli distribution and Uniform distribution. The Bernoulli distribution addresses (1) the distributional shift between training and inference and (2) enhances the learning of an accurate mapping from the equilibrium conformation to the QC properties. Meanwhile, the Uniform distribution generates additional intermediate states to serve as model inputs, effectively augmenting the input conformations. The details of Uni-Mol+ can be found in Sec. 4.

Benchmark on small molecule (PCQM4MV2)

The PCQM4Mv2 dataset, derived from the OGB Large-Scale Challenge15, is designed to facilitate the development and evaluation of machine learning models for predicting QC properties of molecules, specifically the target property known as the HOMO-LUMO gap. This property represents the difference between the energies of the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO). The dataset, consisting of approximately 4 million molecules represented by SMILES notations, offers HOMO-LUMO gap labels for the training and validation sets; however, the labels for the test set remain undisclosed. Furthermore, the training set encompasses the DFT equilibrium conformation, which is not included in the validation and test sets. The benchmark’s goal is to utilize SMILES notation, without the DFT equilibrium conformation, to predict the HOMO-LUMO gap during the inference process.

Based on SMILES, we generate 8 initial conformations for each molecule by RDKit, at a per-molecule cost of about 0.01 seconds. Specifically, we use ETKDG 19 method to generate 3D conformations. Subsequent optimization of these conformations is achieved through the MMFF9420 force field. In molecules where the generation of a 3D conformation is unsuccessful, we default to producing a 2D conformation with a flat z-axis using RDKit’s AllChem.Compute2DCoords function instead. During training, we randomly sample 1 conformation as input at each epoch, while during inference, we use the average HOMO-LUMO gap prediction based on 8 conformations.

We incorporate previous submissions to the PCQM4MV2 leaderboard as baselines. In addition to the default 12-layer model, we evaluate the performance of Uni-Mol+ with two variants consisting of 6 and 18 layers, respectively. This aims to explore how model performance changes when varying the model parameter sizes.

The results are summarized in Table 1, and our observations are as follows: (1) Uni-Mol+ surpasses the previous SOTA by a margin of 0.0079 on validation data on single-model performance, a relative improvement of 11.4%. (2) All three variants of Uni-Mol+ demonstrate substantial performance improvements over previous baselines. (3) The 6-layer Uni-Mol+, despite having considerably fewer model parameters, outperforms all prior baselines. (4) Increasing the layers from 6 to 12 results in a significant accuracy enhancement, surpassing all baselines by a considerable margin. (5) The 18-layer Uni-Mol+ exhibits the highest performance, outperforming all baselines by a remarkable margin. These findings underscore the effectiveness of Uni-Mol+. (6) The performance of a single 18-layer Uni-Mol+ model on the leaderboard (test-dev set) is noteworthy, particularly as it surpasses previous state-of-the-art methods without employing an ensemble or additional techniques. In contrast, the previous state-of-the-art GPS++ relied on a 112-model ensemble and included the validation set for training.

Table 1 The benchmark results on PCQM4MV2

Benchmark on catalyst system (OC20)

The Open Catalyst 2020 (OC20) dataset16 is specifically designed to promote the development of machine-learning models for catalyst discovery and optimization. OC20 encompasses three tasks: Structure to Energy and Force (S2EF), Initial Structure to Relaxed Structure (IS2RS), and Initial Structure to Relaxed Energy (IS2RE). In this paper, we focus on the IS2RE task, as it aligns well with the objectives of the proposed methodology. The goal of the IS2RE task is to predict the relaxed energy based on the initial conformation. It comprises approximately 460K training data points. While DFT equilibrium conformations are provided for training, they are not permitted for use during inference. Moreover, in contrast to the PCQM4MV2 dataset, the initial conformation is already supplied in the OC20 IS2RE task, eliminating the need to generate the initial input conformation by ourselves.

We present a performance comparison of various models on the OC20 IS2RE validation and test set, as illustrated in Table 2. The table displays the Mean Absolute Error (MAE) for energy in electron volts (eV) and the percentage of Energies Within a Threshold (EwT) for each model. As evident from the tables, our proposed Uni-Mol+ significantly outperforms all previous baselines in terms of both MAE and EwT. For example, in the test set, Uni-Mol+ exceeds the previous SOTA in Average MAE and Average EwT by margins of 0.0366 (8.8% relative improvement) and 1.73 (26.6% relative improvement), respectively. This demonstrates the exceptional performance of Uni-Mol+. Notably, our method attains the lowest MAEs across all categories, including In-Domain (ID), Out-of-Domain Adsorption (OOD Ads.), Out-of-Domain Catalysis (OOD Cat.), Out-of-Domain Both (OOD Both), and Average (AVG.). Furthermore, in terms of EwT, Uni-Mol+ consistently achieves the highest values in all categories. These findings underscore the robustness of our method in handling both in-domain and out-of-domain data. In conclusion, the results emphasize the efficacy of our approach in capturing intricate interactions in material systems and its potential for extensive applicability in various computational material science tasks.

Table 2 The benchmark results on OC20 IS2RE task

Ablation study

In this subsection, we present a comprehensive ablation study for Uni-Mol+. To fully comprehend the configurations discussed herein, we recommend referring to the “Methods” section and the model specifications detailed in Supplementary Section 2. We conduct the ablation study on the PCQM4Mv2 dataset, employing the default 12-layer Uni-Mol+ configuration. The findings are summarized in Table 3, where No. 1 is the default setting, and No.2–7 focus on the examination of the model backbone, and No. 8 to No. 17 focus on the examination of the training strategies. A detailed analysis follows in the subsequent paragraphs.

Table 3 Ablation study for model backbone and for sampling strategies for q, on PCQM4MV2

As detailed in Sec. 4 and Supplementary Section 1, Uni-Mol+ introduces two novel components, OuterProduct and TriangularUpdate, and iteratively updates the 3D coordinates. An examination of the results (No. 1–7) in Table 3 provides insights into the implications of these modifications.

(1) We first examine the necessity of the new components in the model backbone. Upon examining the first three settings (No. 1 to 3), it becomes evident that both TriangularUpdate and OuterProduct significantly contribute to the model’s performance. A comparison between No. 3 and No. 4 reveals that utilizing pair representation exclusively, without incorporating OuterProduct or TriangularUpdate, does not enhance performance. This result is expected because the pair representation is not communicated with the atom representation (without OuterProduct) and is simply updated by FFN, resulting in a performance that is almost the same as not using pair representation, as there are merely more parameters. However, the proposed OuterProduct and TriangularUpdate can better utilize the pair representation, leading to an overall performance improvement (No.1 and No.2). This makes the pair representation an essential component in the backbone of our approach, even if its standalone effectiveness might appear limited.

(2) We then examine the performance brought by iterative coordinate updates. A comparison of No. 1 with No. 5 and No. 6 leads to the conclusion that omitting the iterative update (No. 5) yields suboptimal results. Note that even without the iterative refinement of 3D conformation (R = 0), Uni-Mol+ ’s score of 0.0715 (No. 5) significantly surpasses the previous SOTA GPS++ (0.0778). However, performing one additional iteration proves highly effective (No. 1), whereas further increasing the number of iterations offers marginal improvements (No. 6).

(3) Lastly, we check the result using the same model backbone as previous work. In particular, when the model retains the same structure as the one employed in previous works3,5,17 and excludes the iterative update (No. 7), its performance is the least favorable. Nonetheless, even with this substandard performance, the model surpasses all prior baselines, thereby highlighting the efficacy of the proposed training strategy. It is important to note that No.7 employs the proposed training strategy as outlined in Sec. 4.2. Although No.7 does not explicitly use conformation optimization (R = 0), the model is still trained to predict the target conformation. Consequently, the Atom Repr. and Pair Repr. of the last layer inherently contain the information required to predict the target conformation. Hence, even without explicitly conformation optimization (R=0), the result of No.7 still supports our primary contribution, namely the accurate prediction of QC properties by leveraging an auxiliary task of conformation optimization.

The training strategy primarily concentrates on sampling q (interpolation weight, details in Sec. 4) to obtain input conformations during training. Formally, q is sampled from a mixture of Bernoulli and Uniform distributions, denoted as \({w}_{1.0}{{\mathbb{I}}}_{\{1.0\}}(q)+{w}_{0.0}{{\mathbb{I}}}_{\{0.0\}}(q)+{w}_{u}{{\mathbb{I}}}_{[a,b]}(q)\), where \({{\mathbb{I}}}_{\{c\}}(q)\) is an indicator function that equals 1 if q = c and 0 otherwise, and \({{\mathbb{I}}}_{[a,b]}(q)\) is an indicator function that equals 1 if a ≤ q ≤ b and 0 otherwise. The weights w1.0, w0.0, and wu must be non-negative and add up to 1, i.e., w1.0 + w0.0 + wu = 1. In this notation, the default sampling strategy employed in Uni-Mol+ can be represented as (w1.0 = 0.1, w0.0 = 0.8, wu = 0.1, [ab] = [0.4, 0.6]). We investigate additional settings for the ablation study, and the results are summarized in Table 3 (No. 8 to No. 17). Except for No. 10 and 12, which use [ab] = [0.0, 1.0], all other settings use [ab] = [0.4, 0.6]. From these results, we make the following observations:

(1) Comparing No. 8, 9, and 10, we find that sampling from only one type of conformation is not effective. For No. 8, it lacks data augmentation and cannot learn an accurate mapping from equilibrium conformation to QC property. For No. 9, it experiences a distributional shift between training and inference. Although No. 10 is better, it has a low probability of sampling 0.0 and 1.0, resulting in suboptimal performance.

(2) By comparing No. 8, 9, and 11, we can deduce that sampling from the mixture of RDKit and target conformations yields a satisfactory result (Valid MAE with 0.0697). However, if only sampling from target and intermediate conformations (No. 12), the result is unsatisfactory (Valid MAE with 0.0753). This result indicates that sampling from w1.0 is necessary, as it reduces the distributional shift between training and inference.

(3) The default strategy that samples from three types of conformations (No. 1) exhibits the best performance.

(4) Altering the weights of the mixture distribution (No. 13–17) does not result in better performance over the default strategy. Furthermore, we notice that with a decreased w0.0, the performance worsens. This suggests that the default weighting scheme is appropriate for this task.

(5) Upon comparing the results of No.18 and No.1, it’s clear that the performance of Noisy Nodes (No.18, Valid MAE with 0.0760) is significantly lower than that of Uni-Mol+ (No.1, Valid MAE with 0.0696). This large performance gap (0.0760 vs. 0.0696) highlights the superior efficacy of the proposed training strategy, as opposed to the one employed previously.

(6) A comparison between No.19 and No.18 shows that the model structure employed in previous works3,5,17 yields worse results than using Uni-Mol+ ’s backbone when using Noisy Nodes strategy. This finding lends additional support to the superiority of Uni-Mol+ ’s backbone over the model architectures previously proposed.

In conclusion, the ablation study demonstrates the effectiveness of the default sampling strategy employed in Uni-Mol+, emphasizing the importance of utilizing a mixture of different conformations to achieve superior performance.

Visualized analysis of conformation learning

In addition to QC property prediction, Uni-Mol+ can also predict equilibrium conformations. Although this study primarily focuses on QC property prediction and the previous experimental results have clearly demonstrated the effectiveness of the proposed Uni-Mol+, visualized results can help to better understand how Uni-Mol+ works. Therefore, we also provide two additional analyzes for the conformation learning of Uni-Mol+ in the PCQM4MV2 dataset.

The First analysis evaluates the predicted conformations. Since the DFT conformations of the validation set (and test set) are not provided by the PCQM4MV2 dataset, we generated DFT conformations ourselves, using the same settings as the PCQM4MV2 source data21. As shown in Fig. 2, Uni-Mol+ can effectively predict equilibrium conformations. Moreover, as the number of update iterations increases, the RMSD is smaller, further demonstrating the effectiveness of the proposed iterative coordinate update. We provide the conformation files used in Fig. 2 in Supplementary Data 1.

Fig. 2: Visualization of Uni-Mol+ ’s predicted conformations.
figure 2

Comparison of RDKit-generated conformation and predicted conformations from first (R = 0) and second (R = 1) iterations, superimposed onto the target DFT conformation. Corresponding RMSDs are provided, demonstrating Uni-Mol+ 's effectiveness in predicting accurate DFT equilibrium conformations. The abbreviations RMSD represents Root Mean Square Deviation. The conformations are provided in the Supplementary Data 1.

The second analysis aims to show that Uni-Mol+ can predict conformations with lower energies, which approaches equilibrium conformations. To demonstrate this, we selected 100 data points and calculated the energies of their initial and predicted conformations and that between their initial conformations and the DFT conformations. Here the DFT conformations is Computed by ourself using the B3LYP functional and 6-31G* basis set, consistent with the settings used in the PCQM4MV2 dataset. As shown in Fig. 3, Uni-Mol+ can predict the conformations with lower energies. Moreover, the energy difference distribution between the initial and predicted conformations closely aligns with that between the initial and equilibrium conformations. This similarity demonstrates Uni-Mol+ ’s effectiveness in predicting equilibrium conformations accurately. We provide the conformation files used in Fig. 3 in Supplementary Data 1.

Fig. 3: Distribution of delta energy.
figure 3

We selected 100 data points and used DFT to calculate the following values: (a) the delta energies between their initial and Uni-Mol+ 's predicted conformations; b the delta energies between their initial conformations and the DFT conformations, where the DFT conformations are calculated by ourselves using DFT tool. Cross-marks indicate data points with increased energies, while circle-marks denote those with decreased energies. This visualization demonstrates that Uni-Mol+ effectively predicts conformations with lower energies. The conformations are provided in the Supplementary Data 1.

The aforementioned results provide additional evidence of the effectiveness of the proposed Uni-Mol+, as it can indeed predict conformations with lower energy and iteratively approach the target DFT conformations.

Discussion

Previous studies have primarily relied on 1D/2D information, such as SMILES or molecular graphs, for making predictions6,7,8,9. Recently, numerous investigations4,9,10,11,12,13 have employed Transformer models for graph tasks, resulting in significant advancements. Given the importance of 3D information in predicting quantum chemistry (QC) properties, several recent studies have incorporated 3D data into their approaches.

Some research has utilized 3D structural information and maximised mutual information between 2D and 3D molecular to augment 2D representations during training3,22,23,24. However, these studies only implicitly embed 3D information into 2D representations, with 2D data utilized exclusively during inference. We represent these models as x2D → (x3Dy), where x2D represents the 2D molecular graph input, x3D represents the 3D conformation input and y denotes a QC property. A crucial shortcoming of these approaches is that they don’t explicitly learn a mapping from the 3D equilibrium conformation x3D to y while y is highly correlated with x3D. Some models, like Transformer-M3, attempt to learn both x2D → y and x3D → y. However, during inference, these models rely solely on x2D, which compromises the prediction performance. Uni-Mol+, on the other hand, employs a strategy \({x}_{{{\rm{3D}}}}^{{\prime} }\to \ldots \to {x}_{{{\rm{3D}}}}\to y\). This process starts with a raw 3D conformation \({x}_{{{\rm{3D}}}}^{{\prime} }\), iteratively refines it towards x3D, and then predicts y. By explicitly learning a mapping from 3D conformation to QC properties, Uni-Mol+ proves to be more effective than previous models.

A few recent works have focused on property prediction using 3D conformations as input. For example, Uni-Mol17 employs the 3D conformation generated by RDKit as input. Uni-Mol is a pre-training method centred on designing pre-text tasks for molecular data, while Uni-Mol+ is a supervised learning approach aimed at predicting QC properties from raw conformations, aided by equilibrium conformation during training. Graphormer-3D5 utilizes the initial 3D conformation provided by the OC20 dataset16 to predict energy at equilibrium. However, it focuses on directly learning the mapping from input to target conformations without considering a training strategy specifically tailored for conformation optimization, as done in our work. The Noisy Nodes approach25 takes corrupted DFT conformations as inputs and aims to predict the uncorrupted ones. When an initial 3D conformation is provided, as in the OC20 dataset, Noisy Nodes generates an interpolated conformation between the initial and target conformations during training, which is similar to the uniform sampling of q in our study. In comparison to Noisy Nodes, our training strategy also incorporates a Bernoulli distribution, which has proven advantageous in addressing distributional shifts and improving QC property predictions. Moreover, both Graphformer-3D and Noisy Nodes necessitate the use of initial conformations provided by the dataset. In contrast, our study is not constrained by this requirement, as it can employ RDKit to generate initial conformations. Several studies26,27,28,29 concentrate on designing new model backbones with rotation and translation equivalence or invariance in 3D space. In contrast, our work emphasizes a novel paradigm for QC property prediction, rather than developing a new model backbone.

Conformation optimization is a critical challenge in computational chemistry. Density Functional Theory (DFT) is the most prevalent method for this task, offering high accuracy but at considerable computational expense. Several deep learning-based potential energy models, such as Deep Potential30, have been proposed to tackle this issue by using neural networks to replace costly potential calculations in DFT, thereby enhancing efficiency. However, deep potential models still necessitate dozens or even hundreds of iterative steps to optimize the conformation based on predicted potentials. In contrast, our approach, Uni-Mol+, requires only a few optimization rounds and can optimize conformations end-to-end, whereas deep potential models cannot.

Although other studies17,31 also optimize RDKit-generated conformations towards DFT conformations, they primarily focus on benchmarking conformation rather than predicting QC property. These works simply employ existing model backbones and learn the mapping between raw and equilibrium conformations. In contrast, Uni-Mol+ adopts a novel training strategy to effectively learn conformation optimization. However, it is important to note that conformation optimization serves merely as an auxiliary task; the primary objective of Uni-Mol+ is to predict QC properties.

The research most closely related to ours is EMPNN32, which utilizes a 2D molecular graph as input for predicting the 3D equilibrium conformation. However, EMPNN learns to map a 2D graph to a 3D equilibrium conformation, which differs from our model that optimizes from an RDKit-generated conformation. Moreover, EMPNN requires an additional model, such as SchNet26, to predict quantum chemistry (QC) properties using the 3D conformation generated by EMPNN as input.

In summary, our study presents a novel method capable of accurately predicting QC properties through an auxiliary task of conformation optimization. This approach has the potential to enhance the efficiency of high-throughput screening and facilitate the design of innovative materials and molecules in future research.

Method

Model backbone

The designed model backbone can predict the equilibrium conformation and QC property simultaneously, denoted as \((y,\, {\hat{{\boldsymbol{r}}}})=f({{\boldsymbol{X}}},\, {{\boldsymbol{E}}},\, {{\boldsymbol{r}}};{{\boldsymbol{\theta }}})\). The model takes three inputs, (i) atom features (\({{\boldsymbol{X}}}\in {{\mathbb{R}}}^{n\times {d}_{f}}\), where n is the number of atoms and df is atom feature dimension), (ii) edge features (\({{\boldsymbol{E}}}\in {{\mathbb{R}}}^{n\times n\times {d}_{e}}\), where de is the edge feature dimension), and (iii) 3D coordinates of atoms (\({{\boldsymbol{r}}}\in {{\mathbb{R}}}^{n\times 3}\)). θ is the set of learnable parameters. And the model predicts a quantum property y and updated 3D coordinates \({\hat{{\boldsymbol{r}}}}\in {{\mathbb{R}}}^{n\times 3}\).

As illustrated in Fig. 1b, the L-block model maintains two distinct representation tracks: atom representation and pair representation. The atom representation is denoted as \({{\boldsymbol{x}}}\in {{\mathbb{R}}}^{n\times {d}_{x}}\), where dx represents the dimension of the atom representation. Similarly, the pair representation is denoted as \({{\boldsymbol{p}}}\in {{\mathbb{R}}}^{n\times n\times {d}_{p}}\), where dp signifies the dimension of the pair representation. The model comprises L blocks, with x(l) and p(l) representing the output representations of the l-th block. Within each block, the atom representation is initially updated through self-attention, incorporating an attention bias derived from the pair representation, followed by an update via a feed-forward network (FFN). Concurrently, the pair representation undergoes a series of updates, beginning with an outer product of the atom representation (referred to OuterProduct), followed by triangular multiplication (referred to TriangularUpdate) as implemented in AlphaFold218, and finally, an update using a FFN. This backbone, in comparison to the one used in Uni-Mol17, enhances the pair representation through two key improvements: (i) employing an outer product for effective atom-to-pair communication, and (ii) utilizing a triangular operator to bolster the 3D geometric information. Next we will introduce each module in detail.

Positional encoding

Similar to previous works4,17, we use pair-wise encoding to encode the 3D spatial and 2D graph positional information. Specifically, for 3D spatial information, we utilize the Gaussian kernel for encoding, as done in previous studies5,17. The encoded 3D spatial positional encoding is denoted by ψ3D.

In addition to the 3D positional encodings, we also incorporate graph positional encodings similar to those used in Graphormer. This includes the shortest-path encoding, represented by \({{{\boldsymbol{\psi }}}}_{i,j}^{{{\rm{SP}}}}={\mathtt{Embedding}}\,({{{\rm{sp}}}}_{ij})\) where spij is the shortest path between atoms (ij) in the molecular graph. Additionally, instead of the time-consuming multi-hop edge encoding method used in Graphormer, we utilize a more efficient one-hop bond encoding, denoted by \({{{\boldsymbol{\psi }}}}^{{{\rm{Bond}}}}={\sum }_{i=1}^{{d}_{e}}{\mathtt{Embedding}}\,({{{\boldsymbol{E}}}}_{i})\), where Ei is the i-th edge feature. Combined above, the positional encoding is denoted as ψ = ψ3D + ψSP + ψBond. And the pair representation p is initialized by ψ, i.e., p(0) = ψ.

Update of atom representation

The atom representation x(0) is initialized by the embeddings of atom features, the same as Graphormer. At l-th block, x(l) is sequentially updated as follow:

$${{{\boldsymbol{x}}}}^{(l)}=\, {{{\boldsymbol{x}}}}^{(l-1)}+{\mathtt{SelfAttentionPairBias}}\,\left({{{\boldsymbol{x}}}}^{(l-1)},\, {{{\boldsymbol{p}}}}^{(l-1)}\right),\\ {{{\boldsymbol{x}}}}^{(l)}=\, {{{\boldsymbol{x}}}}^{(l)}+{\mathtt{FFN}}\,\left({{{\boldsymbol{x}}}}^{(l)}\right).$$
(1)

The SelfAttentionPairBias function is denoted as:

$${{{\boldsymbol{Q}}}}^{(l,h)}=\, {{{\boldsymbol{x}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{Q}^{(l,h)};\,{{{\boldsymbol{K}}}}^{(l,h)}={{{\boldsymbol{x}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{K}^{(l,h)}; \\ {{{\boldsymbol{B}}}}^{(l,h)}=\, {{{\boldsymbol{p}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{B}^{(l,h)};\,{{{\boldsymbol{V}}}}^{(l,h)}={{{\boldsymbol{x}}}}^{(l-1)}{{{\boldsymbol{W}}}}_{V}^{(l,h)};\\ {\mathtt{output}}=\, {\mathtt{softmax}}\left(\frac{{{{\boldsymbol{Q}}}}^{(l,h)}{({{{\boldsymbol{K}}}}^{(l,h)})}^{T}}{\sqrt{{d}_{h}}}+{{{\boldsymbol{B}}}}^{(l,h)}\right){{{\boldsymbol{V}}}}^{(l,h)};\\ $$
(2)

where dh is the head dimension, \({{{\boldsymbol{W}}}}_{Q}^{(l,h)},{{{\boldsymbol{W}}}}_{K}^{(l,h)},{{{\boldsymbol{W}}}}_{V}^{(l,h)}\in {{\mathbb{R}}}^{{d}_{x}\times {d}_{h}}\), \({{{\boldsymbol{W}}}}_{B}^{(l,h)}\in {{\mathbb{R}}}^{{d}_{p}\times 1}\). FFN is a feed-forward network with one hidden layer. For simplicity, layer normalizations are omitted. Compared to the standard Transformer layer, the only difference here is the usage of attention bias term B(lh) to incorporate p(l−1) from the pair representation track.

Update of pair representation

The pair representation p(0) is initialized by the positional encoding ψ. The update process of pair representation begins with an outer product of x(l), followed by a \({{\mathcal{O}}}({n}^{3})\) triangular multiplication, and is then concluded with an FFN layer. Formally, at l-th block, p(l) is sequentially updated as follow:

$${{{\boldsymbol{p}}}}^{(l)}=\, {{{\boldsymbol{p}}}}^{(l-1)}+{\mathtt{OuterProduct}}\,({{{\boldsymbol{x}}}}^{(l)}); \\ {{{\boldsymbol{p}}}}^{(l)}=\, {{{\boldsymbol{p}}}}^{(l)}+{\mathtt{TriangularUpdate}}\,({{{\boldsymbol{p}}}}^{(l)});\\ {{{\boldsymbol{p}}}}^{(l)}=\, {{{\boldsymbol{p}}}}^{(l)}+{\mathtt{FFN}}\,({{{\boldsymbol{p}}}}^{(l)}).$$
(3)

The OuterProduct is used for atom-to-pair communication, denoted as :

$$ {{\boldsymbol{a}}}={{{\boldsymbol{x}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{O1}}}}^{(l)},{{\boldsymbol{b}}}={{{\boldsymbol{x}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{O2}}}}^{(l)};\\ {{{\boldsymbol{o}}}}_{i,j}={\mathtt{flatten}}({{{\boldsymbol{a}}}}_{i}\otimes {{{\boldsymbol{b}}}}_{j});\\ {\mathtt{output}}={{\boldsymbol{o}}}{{{\boldsymbol{W}}}}_{{{\rm{O3}}}}^{(l)},$$
(4)

where \({{{\boldsymbol{W}}}}_{{{\rm{O1}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{O2}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{x}\times {d}_{o}}\), do is the hidden dimension of OuterProduct, and \({{{\boldsymbol{W}}}}_{{{\rm{O3}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{o}^{2}\times {d}_{p}}\), o = [oi, j]. Please note that abo are temporary variables in the OuterProduct function. TriangularUpdate is used to enhance pair representation further, denoted as:

$$ {{\boldsymbol{a}}}={\mathtt{sigmoid}}\left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T1}}}}^{(l)}\right)\odot \left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T2}}}}^{(l)}\right); \\ {{\boldsymbol{b}}}={\mathtt{sigmoid}}\left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T3}}}}^{(l)}\right)\odot \left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T4}}}}^{(l)}\right);\\ {{{\boldsymbol{o}}}}_{i,j}=\mathop{\sum}_{k}{{{\boldsymbol{a}}}}_{i,k}\odot {{{\boldsymbol{b}}}}_{j,k}+{\sum}_{k}{{{\boldsymbol{a}}}}_{k,i}\odot {{{\boldsymbol{b}}}}_{k,j};\\ {\mathtt{output}}={\mathtt{sigmoid}}\left({{{\boldsymbol{p}}}}^{(l)}{{{\boldsymbol{W}}}}_{{{\rm{T5}}}}^{(l)}\right)\odot \left({{\boldsymbol{o}}}{{{\boldsymbol{W}}}}_{{{\rm{T6}}}}^{(l)}\right),$$
(5)

where \({{{\boldsymbol{W}}}}_{{{\rm{T1}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{T2}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{T3}}}}^{(l)},{{{\boldsymbol{W}}}}_{{{\rm{T4}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{p}\times {d}_{t}}\), \({{{\boldsymbol{W}}}}_{{{\rm{T5}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{p}\times {d}_{p}}\), \({{{\boldsymbol{W}}}}_{{{\rm{T6}}}}^{(l)}\in {{\mathbb{R}}}^{{d}_{t}\times {d}_{p}}\), o = [oi, j], and dt is the hidden dimension of TriangularUpdate. abo are temporary variables. The TriangularUpdate is inspired by the Evoformer in AlphaFold218. The difference is that AlphaFold2 uses two modules, “outgoing” (oi, j = ∑k ai,k bj,k) and “incoming” (oi,j = ∑kak,i bk,j) respectively. In Uni-Mol+, we merge the two modules into one to save the computational cost.

Conformation optimization

The conformation optimization process in many practical applications, such as Molecular Dynamics, is iterative. This approach is also employed in the Uni-Mol+. The number of conformation update iterations denoted as R, is a hyperparameter. We use superscripts on r to distinguish the 3D positions of atoms in different iterations. for example, at the i-th iteration, the update can be denoted as (yr(i)) = f(XEr(i−1)θ). It is noteworthy that parameters θ are shared across all iterations. Moreover, please note that the iterative update in Uni-Mol+ involves only a few rounds, such as 1 or 2, instead of dozens or hundreds of steps in Molecular Dynamics.

3D position prediction head

Regarding the 3D position prediction head within Uni-Mol+, we have adopted the 3D prediction head proposed in Graphormer-3D5, as cited in our manuscript. The architecture takes atom representation xL, pair representation pL, and initial coordinates c as inputs. An attention mechanism is initially employed and then the attention weights is multiplied point-wisely with the pairwise delta coordinates derived from the initial coordinates. Similar to SelfAttentionPairBias, the attention mechanism is denoted as:

$${{{\boldsymbol{Q}}}}^{h}=\, {{{\boldsymbol{x}}}}^{L}{{{\boldsymbol{W}}}}_{Q}^{h};\,{{{\boldsymbol{K}}}}^{h}={{{\boldsymbol{x}}}}^{L}{{{\boldsymbol{W}}}}_{K}^{h}; \\ {{{\boldsymbol{B}}}}^{h}=\, {{{\boldsymbol{p}}}}^{L}{{{\boldsymbol{W}}}}_{B}^{h};\,{{{\boldsymbol{V}}}}^{h}={{{\boldsymbol{x}}}}^{L}{{{\boldsymbol{W}}}}_{V}^{h};\\ {{{\boldsymbol{A}}}}_{i,\, j}^{h}=\, {\mathtt{softmax}}\left(\frac{{{{\boldsymbol{Q}}}}_{i}^{h}{({{{\boldsymbol{K}}}}_{j}^{h})}^{T}}{\sqrt{{d}_{h}}}+{{{\boldsymbol{B}}}}_{i,\, j}^{h}\right);\\ {{\Delta }}{{{\boldsymbol{c}}}}_{i,\, j}=\, {{{\boldsymbol{c}}}}_{i}-{{{\boldsymbol{c}}}}_{j};\,{{{\boldsymbol{A}}}}_{i,\, j}^{(h,0)}={{{\boldsymbol{A}}}}_{i,\, j}^{h}\odot {{\Delta }}{{{\boldsymbol{c}}}}_{i,\, j}^{0};\\ {{{\boldsymbol{A}}}}_{i,\, j}^{(h,\, 1)}=\, {{{\boldsymbol{A}}}}_{i,\, j}^{h}\odot {{\Delta }}{{{\boldsymbol{c}}}}_{i,j}^{1};\,{{{\boldsymbol{A}}}}_{i,\, j}^{(h,2)}={{{\boldsymbol{A}}}}_{i,j}^{h}\odot {{\Delta }}{{{\boldsymbol{c}}}}_{i,\, j}^{2};\\ $$
(6)

where dh is the head dimension, \({{{\boldsymbol{W}}}}_{Q}^{h},{{{\boldsymbol{W}}}}_{K}^{h},{{{\boldsymbol{W}}}}_{V}^{h}\in {{\mathbb{R}}}^{{d}_{x}\times {d}_{h}}\), \({{{\boldsymbol{W}}}}_{B}^{h}\in {{\mathbb{R}}}^{{d}_{p}\times 1}\). Ah is the attention weights, Δcij is the delta coordinate between ci and cj where the superscript 0, 1 and 2 represent the X axis, Y axis and Z axis respectively. Then the position prediction head predicts coordinate updates using three linear projections of the attention head values onto the three axes, which is denoted as:

$${{{\boldsymbol{o}}}}^{0}=\, {{\mathtt{Concat}}}_{h}({{{\boldsymbol{A}}}}^{(h,0)}{{{\boldsymbol{V}}}}^{h});\,{{{\boldsymbol{o}}}}^{0}={{\mathtt{Linear}}}_{1}({{{\boldsymbol{o}}}}^{0}); \\ {{{\boldsymbol{o}}}}^{1}=\, {{\mathtt{Concat}}}_{h}({{{\boldsymbol{A}}}}^{(h,1)}{{{\boldsymbol{V}}}}^{h});\,{{{\boldsymbol{o}}}}^{1}={{\mathtt{Linear}}}_{2}({{{\boldsymbol{o}}}}^{1});\\ {{{\boldsymbol{o}}}}^{2}=\, {{\mathtt{Concat}}}_{h}({{{\boldsymbol{A}}}}^{(h,2)}{{{\boldsymbol{V}}}}^{h});\,{{{\boldsymbol{o}}}}^{2}={{\mathtt{Linear}}}_{3}({{{\boldsymbol{o}}}}^{2});\\ {{\Delta }}{{\boldsymbol{c}}}^{\prime}=\, {\mathtt{Concat}}([{{{\boldsymbol{o}}}}^{0},{{{\boldsymbol{o}}}}^{1},{{{\boldsymbol{o}}}}^{2}]);\,{{\boldsymbol{c}}}^{\prime}={{\boldsymbol{c}}}+{{\Delta }}{{\boldsymbol{c}}}^{\prime} ;$$
(7)

where \({{\Delta }}{{\boldsymbol{c}}}^{\prime}\) is the predicted coordinate updates and \({{\boldsymbol{c}}}^{\prime}\) is the predicted coordinates.

As described in the above formula, the coordinate prediction head used in our study does not inherently enforce strict equivariance. This challenge can be addressed through one of two strategies: (1) Strict equivariance of the model can be achieved by sharing the parameters across the three linear layers in Eq. (7)-denoted as linear1linear2, and linear3-and concurrently eliminating the bias terms within these layers; (2) the model’s robustness to spatial transformations can be enhanced by incorporating random rotations into the input coordinates as a form of data augmentation. During our experimental phase, both techniques were rigorously tested. The latter approach-data augmentation via random rotations yielded better accuracy in quantum chemistry property predictions and was thus selected for our model architecture. In this case, empirical evidence suggests that with a sufficiently large training dataset, such as the PCQM4MV2 dataset, the model naturally tends towards an equivariant state. Specifically, our observations indicate that the parameters of the three linear layers tend to converge to the same, and the bias terms asymptotically approach zero, with the discrepancies being marginal (on the order of 1e − 4).

Training strategy

In DFT conformation optimization or Molecular Dynamics simulations, a conformation is optimized step-by-step, resulting in a trajectory from a raw conformation to the equilibrium conformation in Euclidean space. However, saving such a trajectory can be expensive, and publicly available datasets usually provide the equilibrium conformations only. Providing a trajectory would be beneficial as intermediate states can be used as data augmentation to guide the model’s training. Inspired by this, we propose a novel training approach, which generates a pseudo trajectory first, samples a conformation from it, and uses the sampled conformation as input to predict the equilibrium conformation. This approach allows us to better exploit the information in the molecular data, which we found can greatly improve the model’s performance. Specifically, we assume that the trajectory from a raw conformation rinit to a target equilibrium conformation rtgt is a linear process. We generate an intermediate conformation along this trajectory via noisy interpolation, i.e.,

$${{{\boldsymbol{r}}}}^{(0)}=q{{{\boldsymbol{r}}}}^{{{\rm{init}}}}+(1-q)\left({{{\boldsymbol{r}}}}^{{{\rm{tgt}}}}+{{\boldsymbol{c}}}\right),$$
(8)

where scalar q ranges from 0 to 1, the Gaussian noise \({{\boldsymbol{c}}}\in {{\mathbb{R}}}^{n\times 3}\) has a mean of 0 and standard deviation υ (a hyper-parameter). Taking r(0) as input, Uni-Mol+ learns to update towards the target equilibrium conformation rtgt. During inference, q is set to 1.0 by default. However, during training, simply sampling q from a uniform distribution ([0.0, 1.0]) may cause (1) a distributional shift between training and inference, due to the infrequent sampling of q = 1.0 (RDKit-generated conformation), and (2) an inability to learn an accurate mapping from the equilibrium conformation to the QC properties, as q = 0.0 (target conformation) is also not sampled often. Therefore, we employ a mixture of Bernoulli and Uniform distributions to flexibly assign higher sample probabilities to q = 1.0 and q = 0.0, while also sampling from interpolations. The above process is illustrated in Fig. 1c in Supplementary.

The model takes r(0) as input and generates r(R) after R iterations. Then, the model uses r(R) as input and predicts the QC properties. L1 loss is applied to the QC property regression and the 3D coordinate prediction. All loss calculations are performed solely on the final conformer at the last iteration.

Model configuration

Similar to both Graphormer4 and Transformer-M3, Uni-Mol+ comprises 12 layers with an atom representation dimension of dx = 768 and a pair representation dimension of dp = 256. The hidden dimension of FFN in the atom representation track is set to 768, while that of the pair representation track is set to 256. Additionally, the hidden dimension in the OuterProduct is do = 32, and the hidden dimension in the TriangularUpdate is dt = 32 as well. The number of conformation optimization iterations R is set to 1, indicating that the model iterates twice in total (once for conformation optimization and once for quantum chemistry property prediction). For the training strategy, we specified a standard deviation of υ = 0.2 for random noise and employed a particular sampling method for q. Specifically, q was set to 0.0 with probability 0.8, set to 1.0 with probability 0.1, and uniformly sampled from [0.4, 0.6] with probability 0.1. With this setting, the number of parameters of Uni-Mol+ is about 52.4M.

Setting for PCQM4MV2

We used the AdamW optimizer with a learning rate of 2e − 4, a batch size of 1024, (β1β2) set to (0.9, 0.999), and gradient clipping set to 5.0 during training, which lasted for 1.5 million steps, with 150K warmup steps. Additionally, an exponential moving average (EMA) with a decay rate of 0.999 was utilized. The training took approximately 5 days, utilizing 8 NVIDIA A100 GPUs. The inference on the 147k test-dev set took approximately 7 minutes, utilizing 8 NVIDIA V100 GPUs.

Setting for OC20

We use the default 12-layer Uni-Mol+ setting for OC20 experiments. The model configuration deviates slightly from the settings employed in PCQM4MV2. Firstly, since OC20 lacks graph information, graph-related features are excluded from the model. Secondly, due to the greater number of atoms present in OC20 compared to PCQM4MV2, the model capacity is marginally reduced for efficiency reasons. In particular, the pair representation dimension dp is set to 128, while the hidden dimensions in the OuterProduct and TriangularUpdate are set to do= 16 and dt = 16, respectively. Third, the periodic boundary condition needs to be considered; we adopt the solution proposed in5, which pre-expands the neighbor cells and then applies a radius cutoff to reduce the number of atoms. The AdamW optimizer was employed during the training process, which lasted for 1.5 million steps, including 150K warmup steps. The optimizer was configured with a learning rate of 2e − 4, a batch size of 64, (β1β2) values of (0.9, 0.999), and a gradient clipping parameter of 5.0. The training process spanned approximately 7 days and made use of 16 NVIDIA A100 GPUs.