. Author manuscript; available in PMC: 2021 Aug 4.

Published in final edited form as: Commun Inf Syst. 2021;21(1):65–83. doi: 10.4310/cis.2021.v21.n1.a4

A Bayes-inspired theory for optimally building an efficient coarse-grained folding force field

Travis Hurst ¹, Dong Zhang ², Yuanzhe Zhou ³, Shi-Jie Chen ⁴

PMCID: PMC8336718 NIHMSID: NIHMS1690260 PMID: 34354546

Abstract

Because of their potential utility in predicting conformational changes and assessing folding dynamics, coarse-grained (CG) RNA folding models are appealing for rapid characterization of RNA molecules. Previously, we reported the iterative simulated RNA reference state (IsRNA) method for parameterizing a CG force field for RNA folding, which consecutively updates the simulation force field to reflect marginal distributions of folding coordinates in the structure database and extract various energy terms. While the IsRNA model was validated by showing close agreement between the IsRNA-simulated and experimentally observed distributions, here, we expand our theoretical understanding of the model and, in doing so, improve the parameterization process to optimize the subset of included folding coordinates, which leads to accelerated simulations. Using statistical mechanical theory, we analyze the underlying, Bayesian concept that drives parameterization of the energy function, providing a general method for developing predictive, knowledge-based, polymer force fields on the basis of limited data. Furthermore, we propose an optimal parameterization procedure, based on the principal of maximum entropy.

1. Introduction

The dynamics and structure of biological polymers, such as ribonucleic acid (RNA), provide insight into their evolutionary functional mechanics [1]. To grasp the function of existing RNA molecules and to design RNA-based therapeutics, we require deep knowledge of tertiary (3D) structure, kinetics, and thermodynamic stability that can be gained through rapid computational characterization techniques [2]. Recent discoveries highlighting the diversity and biological importance of RNA molecules have spurred rapid advancements in structure prediction [3–7], which use diverse methods, from bioinformatics data fitting to all-atom simulations, to predict RNA structure [8]. Knowledge-based, template-fitting methods can rapidly predict RNA structures with phylogenetic similarities to existing structures, especially if templates are available for the structure of interest [9]. However, they are limited by the availability of known structures and fragments, and most of these predictive methods do not attempt to predict conformational changes or folding dynamics, limiting their usefulness for predicting RNA characteristics. Unlike the template-based methods, all-atom molecular dynamics (MD) simulations do not necessarily rely on existing templates because of their ab initio nature, and they can be used to detail folding dynamics [10–12]. However, for lengthy RNA (>40 nucleotides), the enormous conformational space makes exhaustive, all-atom MD on the scale of milliseconds or seconds impossible with current computational limitations [13, 14].

Taking advantage of the strengths of both template-based, data fitting approaches and physics-based methods, coarse-grained (CG) molecular dynamics (CGMD) simulations constitute an appealing middle-ground between these extremes. By representing groups of atoms with CG pseudobeads, we overcome the limitations of the intrinsically large conformational space of long RNA molecules, smoothing the all-atom, rugged free energy landscape and selecting only the degrees of freedom necessary to capture the conformational dynamics, which accelerates simulations by both reducing the computed degrees of freedom and increasing the diffusion coefficient of the molecule on the energy landscape [15]. An ideal CG folding model would use a simple description that optimizes sampling efficiency while extending the ability to accurately estimate observables to larger RNA systems. CG force fields can be parameterized using physics-based, knowledge-based, or hybrid methods. Physics-based methods generally use clever averaging schemes to coarsely estimate forces from all-atom interactions, while knowledge-based methods calculate interaction occurrence frequencies from solved 3D structures. Hybrid approaches may use knowledge of the solved structures to estimate local, physical potentials from statistical distributions—such as bond stretching, bond angle bending, and torsion angle constraints—or non-local energy functions may be calculated using physics-based methods [16, 17]. Regardless of the particular parameterization strategy, CG folding methods can overcome computational limitations seen in all-atom sampling while providing more information about conformational dynamics than static, template-based prediction methods.

Previously, we reported the IsRNA method for parameterizing a CG force field for RNA folding [18]. While diverse CG models using a variety of resolutions and representations have been developed [5, 9, 16, 17, 19–26], the IsRNA model represents nucleotides using four or five beads and has two main advantages. Firstly, the IsRNA model estimates an n-dimensional joint probability distribution from limited experimental data, using iterative, consecutive inclusion of observed folding coordinate distributions, which allows the model to account for correlation effects, where n is the number of variables required for convergence between the simulated and experimentally observed marginal distributions of the set of considered folding coordinates. Secondly, the IsRNA model accounts for both native and non-native interactions in RNA folding. Provided with 2D constraints, the force field reproduces RNA observables and known 3D structures with alacrity. In combination with efficient experimental data to determine the 2D structure, this method can rapidly fold RNA molecules, finding low-energy candidates that represent the native structure. Efficient data can also be used to sieve experimentally-compatible structures, strengthening the prediction quality [27].

Here, we lay out a cohesive theory for parameterizing a CG force field for predictive polymer folding on the basis of knowledge extracted from a structural database. Using an iterative, Bayes-inspired approach, we construct a CG force field from consecutive simulations and marginal distributions of collective variables extracted from the RNA PDB database. The distributions of these variables are not statistically independent, so their joint distribution cannot be calculated by simply multiplying their marginal distributions together. Furthermore, the data is limited, so a joint distribution that reflects the full system energetics cannot be directly extracted from the structural data. Since the marginal distributions are not independent and we do not have enough data to extract an accurate force field from a joint distribution of folding coordinates, we combine data from reference simulations with observed marginal distributions to update our prior estimate of the joint force field in a Bayes-inspired approach. By adding new marginal distributions into the model one-by-one until all of the considered marginal distributions are reproduced by the energy function, we estimate a joint energy function for CG polymer folding simulations. This paper uses statistical mechanics to clarify the underpinning theory and assumptions in the model. Parameterization of the force field is optimized using the principle of maximum entropy, which informs the model of the optimal marginal distribution to include in the following step. In previous work, the principle of maximum entropy has been used for modeling intrinsically disordered proteins [28] and for combining experimental and simulation data [29]. We show that the principle of maximum entropy is directly related to the model, providing support for the underlying theory. Furthermore, the principle of maximum entropy has a Bayesian interpretation [30], providing a conceptual framework for easily understanding the model theory. We also illustrate the predictive power of the IsRNA model by applying it to accurately fold RNA into near-native 3D structures.

2. Theory and Methods

In this section, we describe and expand the theory supporting the iterative simulated reference state method of building a CG polymer force field, showing how to optimize inclusion of new folding coordinates consistent with the principle of maximum entropy. First, we describe the system setup and provide a simple example for how, in theory, we can use statistical mechanics principles to extract a CG energy function that is consistent with observations in the structure database. Second, we generalize and expand this theory into a workable model that uses iterative corrections to avoid growth of numerical instabilities during the practical parameterization process for a general polymer. Finally, we describe the principle of maximum entropy and show how it can be used to optimize the order of inclusion of new folding coordinates. We also show that our procedure is consistent with the principle of maximum entropy to provide further evidence that the method for parameterizing a CG polymer energy function is sound.

2.1. Setting up the system

In the IsRNA model, we use two CG beads (P and S) to represent the phosphate and sugar moiety on the backbone, respectively. For the base, purines (pyrimidines) have three (two) CG beads, and the model uses a total of ten beads. Non-local, base pairing interactions are realized with two pairwise interactions, so we can model hydrogen bonding configurations and energies for various base pairing configurations. See Fig. 1 for details of the mapping from all-atom structure to CG representation.

Figure 1: — Coarse-grained descriptions of canonical A, U, G, C nucleotides. The backbone is represented by beads located at the P and C4’ atoms (denoted S for the related ribose sugar). Purine (A and G) bases are represented by three beads each (R_C, A_C, A_N) and (R_C, G_O, G_N). Pyrimidine (U and C) bases are represented by two beads each (Y_C, U_O) and (Y_C, C_N). CG beads for the bases are centered at the corresponding heavy-atom group center-of-mass.

We begin our approach with the most basic assumptions about RNA polymer interactions. Namely, we assume a non-interacting reference state and extract statistical potentials of bond stretching connectivity and Lennard-Jones (LJ) volume exclusion terms so that the backbone energy is

E_{bk} (b, r) = E_{bond} (b) + E_{LJ} (r)

(1)

and we use those as the starting CG force field, generating the ref₀ conformational ensemble. As is shown in Fig 2A, this information is sufficient to reproduce the θ₂ bond angle distribution through simulation, which is already determined by the chain connectivity and volume exclusion information.

Figure 2: — Correlated folding coordinates lead to same final energies regardless of parameterization order. A) After incorporating bond length information in our reference force field, information about the S-R_C-G_O bond angle θ₁ still needs to be included because the reference simulation does not capture the θ₁ distribution. However, a correlated G_O-R_C-G_O angle θ₂ is fully captured by the reference force field without being explicitly included in the force field. B) As we include more folding coordinates in our energy function, the 5’S-P-S-R_C3’ torsion angle ϕ becomes more similar to the observed distribution, indicating correlation between ϕ and the previously included structure factors. C) and D) Although the individual factors are correlated, the final energy function yields the same final energies, regardless of parameterization order. In C), the r₁ folding coordinate is incorporated into the force field, followed by r₂. In D), the parameterization order is reversed, but the final total energies and distributions produced by simulation are unchanged.

Our goal is to accurately estimate the folding energy function for RNA interactions, which are captured by bond distance b, bond angle θ, torsion ϕ, and distance-based r non-local base interaction energies. In summary, our total energy function can be written as

E_{tot} (b, r, θ, ϕ) = E_{bk} (b, r) + E_{angle} (θ) + E_{torsion} (ϕ) + E_{pair} (r) + E_{ele} (r)

(2)

Although this is written as a sum of energy functions, the correlations are implicitly included through simulations, which allows us to ascertain what information about a new folding coordinate is already included in the estimated CG force field.

2.2. Supporting the IsRNA theory with statistical mechanics

Here, we provide an example to illustrate how we can use statistical mechanics to build the total force field by consecutively updating our prior energy function to enforce compatibility with an observed folding coordinate distribution. Using $E_{ref}^{(i)}$ as the reference state force field from inclusion of the ith folding coordinate, we can run a CGMD simulation and extract the simulated distribution p_ref(θ_j) = p_sim(θ_j)|_ref of the jth bond angle θ_j. From the PDB database, we also know the observed distribution p_obs(θ_j), and we can calculate the difference between the observed and reference energy functions using

E (θ_{j}) = - ln p_{obs} (θ_{j}) + ln p_{ref} (θ_{j}) = - ln \frac{p_{obs} (θ_{j})}{p_{ref} (θ_{j})}

(3)

where we use the dimensionless k_BT = 1 in this section for simplicity. Accounting for the difference between the observed distribution and the CGMD-generated reference distribution, we define the energy function for the next simulation as

E_{ref}^{(i + 1)} (θ_{j}) \equiv E_{ref}^{(i)} (θ_{j}) + E (θ_{j})

(4)

Physically, this ensures that our energy function is only adding the missing information about θ_j into the force field. Explicitly, we can show that this energy function will reproduce the observed distribution. The marginal reference distribution of θ_j generated via simulation can be written in terms of the reference joint energy function with previously added terms integrated out:

p_{ref}^{(i)} (θ_{j}) = \int e^{- E_{ref}^{(i)} (b, r, θ, ϕ)} d b d ϕ d r d θ_{0} \dots d θ_{j - 1}

(5)

Similarly, the simulated distribution of θ after including information about the observed distribution can be written

p_{sim}^{(i + 1)} (θ_{j}) = \int e^{- E_{ref}^{(i + 1)} (b, r, θ, ϕ)} d b d ϕ d r d θ_{0} \dots d θ_{j - 1} = e^{- E_{ref}^{(i)} (θ_{j})} e^{- E (θ_{j})} = p_{ref}^{(i)} (θ_{j}) \frac{p_{obs} (θ_{j})}{p_{ref}^{(i)} (θ_{j})} = p_{obs} (θ_{j})

(6)

which shows that simulations using $E_{ref}^{(i + 1)} (b, r, θ, ϕ)$ as the CGMD force field can reproduce the marginal probability for some bond angle θ_j after information about that coordinate is included in the force field. Indeed, once we include a coordinate, we find that the simulated distribution does match the observed distribution for that folding coordinate (see Fig. 2A–D).

To emphasize our previous point, although the force field is an additive sum of the different terms, the local variables (b, θ, ϕ) are still correlated through the non-local distance (r), which is a function of (b, θ, ϕ) in a highly coupled form. Similarly, the E(θ) and E(ϕ) energy terms are also coupled through non-local interactions. Without inclusion of the non-local terms, some of the local terms may be physically uncorrelated.

2.3. Generalizing and numerically stabilizing the model

Up until here, we have simplified the theoretical arguments in the previous work [18]. Now, we will pivot to expand the IsRNA approach in greater theoretical and practical detail to provide a general method for parameterizing a CG polymer force field. In general, the joint probability distribution cannot be uniquely determined from the set of marginal probability distributions that compose it, which is why we cannot simply multiply together all of the observed marginal distributions we find in the structure database to calculate the joint distribution and the total force field. However, by simulating the conditional probability distributions, we may estimate the joint probability distribution. In theory, we can broaden our method to use any collective variables of the system, and we are not limited to simple folding coordinates, such as distances and angles. However, the more time-consuming a collective variable is to calculate, the less efficient simulations using that variable will be, so simple distances and angles are appealing building blocks for a CG force field. We take the general perspective that the marginal distribution of a collective variable extracted from a reference simulation as described in the previous section can be viewed as the marginal distribution, conditional on the previously included distributions of collective variables. Then, the estimated energy function after each consecutive step is extracted from the estimated joint distribution of all the considered collective variables, conditional on the already included distributions of collective variables.

Our energy function is considered to converge when the estimated joint distribution after a step is sufficient to reproduce via simulation all of the considered observed marginal distributions, including those marginal distributions that have not been explicitly included in the joint distribution. For instance, assume we have a set of m calculated collective variables (x₀, … , x_m). Assume that so far in our procedure, we have explicitly included j (< m) collective variables to produce an energy function E(x₀, … , x_j). If the marginal distributions observed from the structural database agree with the marginal distributions extracted from the simulation results, the energy function converged, and our force field is adequately parameterized, having captured all of the available information in the considered distributions. While here we use marginal probability distributions, the explicit inclusion of some correlation effects by consecutively including some joint distributions of pairs of collective variables may be beneficial to explore, if enough data is available.

Again using statistical mechanics, we show here that the model provides a sound method for parameterizing a predictive force field. After running the simulation with reference energy parameters for backbone bond length and volume exclusion to provide a starting reference state with E = 0, we obtain the simulated distribution for the first collective variable to be added, p_sim(x₀)|₀. We can write the energy function function from the observed structures as

{E (x_{0}) |}_{x_{0}} = - ln \frac{p (x_{0})}{{p_{sim} (x_{0}) |}_{0}}

(7)

where p(x₀) is observed from solved structures. Again, we use k_BT = 1 for convenience. This energy function provides constraints for the next simulation step and is analytically sufficient to reproduce the p(x₀) distribution via simulation.

{p_{sim} (x_{0}) |}_{E_{0}} = \frac{Ω e^{- E (x_{0})}}{\sum_{x_{_{0}}} Ω e^{- E (x_{_{0}})}} = p (x_{0})

(8)

which is the result we obtained in Eq. 6. Here, we used the identity g(x₀) = Ω · p_sim(x₀)|₀, where Ω is the total number of conformations generated by the simulation, and g(x₀) is the degenerate distribution function obtained from simulation.

Generalizing the estimated joint distribution after consecutive inclusion of j marginal distributions, we find

P (x_{j}) = P (x_{n}) \frac{p (x_{j})}{{p_{sim} (x_{j}) |}_{E_{n}}} \prod_{i = 0}^{n} \frac{p (x_{i})}{{p_{sim} (x_{i}) |}_{E_{n}}}

(9)

where j = n + 1, P (x_j) = p_est(x₀, … , x_j) are the joint distributions estimated from inclusion of j = n + 1 marginals, and E_n = E_est(x₀, … , x_n) is the estimated total energy function that provides constraints in the simulation step. The question now is whether we can analytically show that this estimation of the joint probability will provide appropriate energy constraints for simulated reproduction of the marginal probabilities observed in the database. To evaluate this, we consider the case where j = 1 (i.e. the inclusion of the second observed distribution into the energy function). Generating a simulated distribution of x₁ on the basis of constraints provided by Eq. 7, ${p_{sim} (x_{1}) |}_{E_{0}}$ , we include the observed marginal distribution p(x₁) using

P (x_{1}) = P (x_{0}) \frac{p (x_{1})}{{p_{sim} (x_{1}) |}_{E_{0}}} \frac{p (x_{0})}{{p_{sim} (x_{0}) |}_{E_{0}}}

(10)

where ideally

\frac{p (x_{0})}{{p_{sim} (x_{0}) |}_{E_{0}}} = 1

(11)

everywhere, according to Eqs. 6 and 8. However, retaining this in our expression allows us to correct any small numerical errors that would grow with propagation as we build our estimated joint distribution, maintaining compatibility with the previously incorporated marginal distributions at each step. This can also be viewed as including effects from cross conditional probabilities into the estimated joint distribution. At this stage, our joint energy function has the form

E (x_{0}, x_{1}) = - ln P (x_{0}) \frac{p (x_{1})}{{p_{sim} (x_{1}) |}_{E_{0}}} \frac{p (x_{0})}{{p_{sim} (x_{0}) |}_{E_{0}}} = {E (x_{0}) |}_{x_{0}} + {E (x_{0}) |}_{x_{0}, x_{1}} + {E (x_{1}) |}_{x_{0}, x_{1}}

(12)

where as we alluded to earlier, the cross-conditional correction to the energy ${E (x_{0}) |}_{x_{0}, x_{1}} \approx 0$ everywhere, which implies that in the next step ${p_{sim} (x_{0}) |}_{E_{1}} = p (x_{0})$ . Since we have explicitly shown our decomposition of the joint into quasi-marginals that account for cross conditionals, the argument we made in Eqs. 6 and 8 holds, and it can be shown that the simulated distribution of x₀ remains true to the observed marginal distribution from the structure database(see Figs. 3 and 4 for examples).

Figure 3: — Incorporating the first non-local pairwise folding coordinate. After all the required local folding coordinates have been included in E_ref (r_s), non-local base pairing effects can be accounted for. Once this collective variable is included, the energy function produced by simulation E_sim (r_s) matches the observed distribution in the structure database E_obs(r_s).

Figure 4: — A selection of extracted CG energy functions using the IsRNA method. The left and middle columns show non-local energy functions between nucleotides. The right column shows the energy function for a local bond length b, torsion angle ϕ, and bond angle θ.

2.4. Maximizing entropy to optimize model construction

Although the consecutive procedure detailed above will lead to an estimated joint distribution of our collective variables that is consistent with all of the extracted marginal distributions—regardless of the order that the collective variables are included—the convergence and dimensionality of the energy function can be optimized on the set of all considered collective variables using the principle of maximum entropy, capturing all of the information with the fewest possible variables. In order to speed up convergence of the energy function, the IsRNA model incorporates the collective variable that requires the most change to the reference energy function in order to reproduce the observed distribution of the variable via simulation. This well-defined approach leads to an intuitively reasonable order of collective variable inclusion. First, we extract bond length and volume exclusion terms. After that, we include increasingly non-local terms, starting with bond angles, moving to torsion angles, and ending with pairwise distances. Viewing our problem from a Bayesian perspective, our objective is to optimally update from a prior distribution $Q = {q_{est} (x_{j}) |}_{E_{n}}$ to a posterior distribution $P = {P_{est} (x_{j}) |}_{E_{j}}$ , given that the posterior lies within a certain family of possible distributions, $p = {p_{est} (x_{0}, \dots, x_{j}) |}_{E_{j}}$ . In our case, the posterior depends on which collective variable we choose to include next (x_j), and we want to include the collective variable that maximizes our information gain at each step. The selected posterior P is that which maximizes the entropy S[p, q]. Since prior information is valuable, the functional S[p, q] is chosen so that beliefs are minimally updated to the extent required by the new information.

The relative entropy between a prior distribution q(x) and a posterior distribution p(x) is given by

S [p, q] = - \int p (x) log \frac{p (x)}{q (x)} d x

(13)

This is a measure of surprise that is also called the Kullback-Leibler divergence or the information gain. For our purposes, the term “information gain” helps us intuitively understand how we can use this concept to parameterize a CG force field. Including this concept in our iterative, consecutive procedure will allow us to reach a convergent energy function with the fewest marginal distributions included in our calculation, which has three main benefits. First, using fewer collective variables to parameterize the force field will accelerate simulations. Second, parameterizing the force field in this fashion requires less computational resources. Third, the concept of maximum entropy is well established, and the direct connection between IsRNA and the theory provides support for the validity of IsRNA. In essence, we check after each simulation to see which candidate observed marginal distribution disagrees the most with the simulated marginal distribution that was constrained by the energy function comprised of all of the previously incorporated marginal and cross conditional distributions. The one that disagrees the most gets incorporated into the joint distribution, providing the minimum information that the model must include to gain compatibility with all of the considered marginal distributions.

As can be seen in Eq. 12, our energy constraints supplied by new information at step j can be written as

E_{j} = - ln \frac{p (x_{j})}{{p_{sim} (x_{j}) |}_{E_{n}}}

(14)

where j = n + 1. Identifying the candidate p(x_j) as p and the previously known information about the distribution of x_j, ${p_{sim} (x_{j}) |}_{E_{n}}$ , as q, we see that the principle of maximum entropy agrees with the incorporation of x_j into our energy function as long as no other x_j can be found that increases the amount of information we can gain in this step.

More directly, we can take the ratio of the estimated joint probability before and after incorporating the new information to see how this is true. Using Eq. 9

\frac{p}{q} = \frac{P (x_{j})}{P (x_{n})} = \frac{P (x_{n}) \frac{p (x_{j})}{{p_{sim} (x_{j}) |}_{E_{n}}} \prod_{i = 0}^{n} \frac{p (x_{i})}{{p_{sim} (x_{i}) |}_{E_{n}}}}{P (x_{n})}

(15)

the joint probability from the previous step cancels in the ratio and leaves us with our proposed information gain times the product of cross conditional probability corrections, which we can take for the moment to be approximately unity (i. e. not contributing much to the information gain, according to Eqs. 6 and 8). We therefore find

\frac{p}{q} = \frac{p (x_{j})}{{p_{sim} (x_{j}) |}_{E_{n}}}

(16)

which is the main ratio contributing to the maximum entropy (information gain) formula.

Although we may choose any order of collective variables to include in our energy function (see Figs. 2C–D), this method provides us with a method to minimize the number of variables required to accurately simulate structures in agreement with the solved structure ensemble. The principle of maximum entropy method also optimizes the parameterization process—removing the need to guess and check the parameterization order—and supports our choice of marginal energy function. As we can see above, the principle of maximum entropy is consistent with our statistical mechanics approach to update the estimated joint probability distribution and corresponding energy function.

2.5. Applying the model to fold RNA

Finally, the derived CG force field was applied to study RNA folding behaviors and 3D structure predictions. From coiled states, several RNA molecules were de novo folded by applying the derived IsRNA force field through simulated annealing Molecular Dynamics (MD) in the modified LAMMPS [31] package. In detail, annealing MD simulations were carried out on the structures, scaling the temperature from 350K to 200K over 500 ns of simulation time with an integration time step of Δt=1fs. Then, an additional 500 ns of simulation (1μs simulation time in total for each RNA molecule) was performed at constant temperature (200K) to sufficiently sample the conformational space and snapshots were recorded every 100 ps from the trajectories. To obtain the predicted 3D structure, structures in the 10% lowest potential energy within the collected snapshots were clustered and the centroid structure of the largest cluster was chosen. As shown in Fig. 5, for the 19-nucleotide duplex with a cytosine bulge [32] (PDB ID: 1DQF) and the 12-nucleotide CUUG hairpin loop [33] (PDB ID: 1RNG), the 3D structures predicted by the IsRNA model have heavy-atom root-mean-square deviations (RMSDs) of 2.2 Å and 2.5 Å, respectively. Additionally, the folding behaviors of these RNA molecules are characterized by funnel-like plots of the RMSDs with potential energies. Furthermore, simulation in the proposed IsRNA model successfully elucidated the folding pathway for an RNA pseudoknot by predicting a series of kinetically important intermediates and incorporating information from a nanopore experiment and master equation analysis [34]. Taken together, those results demonstrate that the derived CG force field in the IsRNA model could serve as a potent tool to study RNA folding behaviors and to predict RNA 3D structures.

Figure 5: — Application of the IsRNA model in RNA folding simulations and 3D structure predictions for (A) the helix II of Xenopus laevis 5S rRNA with a cytosine bulge (PDB ID: 1DQF) and (B) the CUUG hairpin loop (PDB ID:1RNG). From left to right: the secondary structure extracted from the native structure, the predicted 3D structure (red color) superposed to the native structure (blue color), and the plot of RMSDs with potential energies in the folding simulation.

3. Conclusion

In this work, we simplify the basic theory underpinning the IsRNA model of parameterizing a CGMD force field. Further, we expand the model—explaining the applicability of the model to a general polymer—and discuss our practical approach to attenuate numerical instabilities and errors that may propagate without proper treatment in the model. Although the method of building the force field by consecutively including marginal distributions of individual distance and angle folding coordinates, one-by-one, works for any order of folding coordinate inclusion, we found that optimizing the parameterization process using the principle of maximum entropy allows us to capture all of the dynamic folding information contained in the solved structure database with the fewest possible collective variables. Using the principle of maximum entropy to guide the parameterization process has two main benefits. First, parameterizing the model requires fewer steps, rapidly converging to reproduce observed distributions of our collective variables, and we don’t have to guess and check to minimize the number of collective variables required to capture all the information contained in the considered coordinate distributions. Second, the force field uses the fewest possible collective variables selected from the set of considered collective variables, which will make simulations performed with the optimized IsRNA model more efficient. We also provide examples, applying the IsRNA model to fold RNA structures of interest with high accuracy. While we have used this method to successfully model 3D RNA folding, the methodology used in the IsRNA model applies to a wide variety of biopolymer folding problems, and we hope the present study will make the IsRNA procedure more approachable to those desiring to build their own CG force field.

Acknowledgments

Research supported by NSF Graduate Research Fellowship Program under Grant 1443129.

Research supported by NIH Grants R35-GM134919 and R01-GM117059.

Contributor Information

Travis Hurst, Department of Physics, University of Missouri-Columbia, Columbia, MO 65211, USA.

Dong Zhang, Department of Physics, University of Missouri-Columbia.

Yuanzhe Zhou, Department of Physics, University of Missouri-Columbia, Columbia, MO 65211, USA.

Shi-Jie Chen, Department of Physics, Department of Biochemistry, MU Institute for Data Science and Informatics, University of Missouri-Columbia, Columbia, MO 65211, USA.

References

[1].Serganov A; Patel DJ Molecular recognition and function of riboswitches. Curr. Opin. Struct. Biol 2012, 22, 279–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Larsen KP; Choi J; Prabhakar A; Puglisi EV; Puglisi JD Relating Structure and Dynamics in RNA Biology. Cold Spring Harb. Perspect. Biol 2019, 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Das R; Baker D Automated de novo prediction of native-like RNA tertiary structures. Proc. Natl. Acad. Sci 2007, 104, 14664–14669. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Parisien M; Major F The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature 2008, 452, 51–55. [DOI] [PubMed] [Google Scholar]
[5].Ding F; Sharma S; Chalasani P; Demidov VV; Broude NE; Dokholyan NV Ab initio RNA folding by discrete molecular dynamics: from structure prediction to folding mechanisms. RNA 2008, 14, 1164–1173. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Cao S; Chen S-J Physics-Based De Novo Prediction of RNA 3D Structures. J. Phys. Chem. B 2011, 115, 4216–4226. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Popenda M; Szachniuk M; Antczak M; Purzycka KJ; Lukasiak P; Bartol N; Blazewicz J; Adamiak RW Automated 3D structure composition for large RNAs. Nucleic Acids Res 2012, 40, e112. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Noid WG Perspective: Coarse-grained models for biomolecular systems. J. Chem. Phys 2013, 139, 90901. [DOI] [PubMed] [Google Scholar]
[9].Jonikas MA; Radmer RJ; Laederach A; Das R; Pearlman S; Herschlag D; Altman RB Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters. RNA 2009, 15, 189–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Brooks BR et al. CHARMM: the biomolecular simulation program. J. Comput. Chem 2009, 30, 1545–1614. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Salomon-Ferrer R; Case DA; Walker RC An overview of the Amber biomolecular simulation package. WIREs Comput. Mol. Sci 2013, 3, 198–210. [Google Scholar]
[12].Hollingsworth SA; Dror RO Molecular Dynamics Simulation for All. Neuron 2018, 99, 1129–1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Chen AA; García AE High-resolution reversible folding of hyperstable RNA tetraloops using molecular dynamics simulations. Proc. Natl. Acad. Sci 2013, 110, 16820–16825. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Šponer J; Banáš P; Jurečka P; Zgarbová M; Kührová P; Havrila M; Krepl M; Stadlbauer P; Otyepka M Molecular Dynamics Simulations of Nucleic Acids. From Tetranucleotides to the Ribosome. J. Phys. Chem. Lett 2014, 5, 1771–1782. [DOI] [PubMed] [Google Scholar]
[15].Meinel MK; Müller-Plathe F Loss of Molecular Roughness upon Coarse-Graining Predicts the Artificially Accelerated Mobility of Coarse-Grained Molecular Simulation Models. J. Chem. Theory Comput 2020, 16, 1411–1419. [DOI] [PubMed] [Google Scholar]
[16].He Y; Maciejczyk M; Ołdziej S; Scheraga HA; Liwo A Mean-field interactions between nucleic-acid-base dipoles can drive the formation of a double helix. Phys. Rev. Lett 2013, 110, 98101. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Cragnolini T; Laurin Y; Derreumaux P; Pasquali S Coarse-Grained HiRE-RNA Model for ab Initio RNA Folding beyond Simple Molecules, Including Noncanonical and Multiple Base Pairings. J. Chem. Theory Comput 2015, 11, 3510–3522. [DOI] [PubMed] [Google Scholar]
[18].Zhang D; Chen S-J IsRNA: An Iterative Simulated Reference State Approach to Modeling Correlated Interactions in RNA Folding. J. Chem. Theory Comput 2018, 14, 2230–2239. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Xia Z; Gardner DP; Gutell RR; Ren P Coarse-grained model for simulation of RNA three-dimensional structures. J. Phys. Chem. B 2010, 114, 13497–13506. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Denesyuk NA; Thirumalai D Coarse-Grained Model for Predicting RNA Folding Thermodynamics. J. Phys. Chem. B 2013, 117, 4901–4911. [DOI] [PubMed] [Google Scholar]
[21].Xia Z; Bell DR; Shi Y; Ren P RNA 3D structure prediction by using a coarse-grained model and experimental data. J. Phys. Chem. B 2013, 117, 3135–3144. [DOI] [PubMed] [Google Scholar]
[22].Shi Y-Z; Wang F-H; Wu Y-Y; Tan Z-J A coarse-grained model with implicit salt for RNAs: predicting 3D structure, stability and salt effect. J. Chem. Phys 2014, 141, 105102. [DOI] [PubMed] [Google Scholar]
[23].Šulc P; Romano F; Ouldridge TE; Doye JPK; Louis AA A nucleotide-level coarse-grained model of RNA. J. Chem. Phys 2014, 140, 235102. [DOI] [PubMed] [Google Scholar]
[24].Boniecki MJ; Lach G; Dawson WK; Tomala K; Lukasz P; Soltysinski T; Rother KM; Bujnicki JM SimRNA: a coarse-grained method for RNA folding simulations and 3D structure prediction. Nucleic Acids Res. 2016, 44, e63. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Bell DR; Cheng SY; Salazar H; Ren P Capturing RNA Folding Free Energy with Coarse-Grained Molecular Dynamics Simulations. Sci. Rep 2017, 7, 45812. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Uusitalo JJ; Ingólfsson HI; Marrink SJ; Faustino I Martini Coarse-Grained Force Field: Extension to RNA. Biophys. J 2017, 113, 246–256. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Hurst T; Xu X; Zhao P; Chen S-J Quantitative Understanding of SHAPE Mechanism from RNA Structure and Dynamics Analysis. J. Phys. Chem. B 2018, 122, 4771–4783. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Latham AP; Zhang B Maximum Entropy Optimized Force Field for Intrinsically Disordered Proteins. J. Chem. Theory Comput 2020, 16, 773–781. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Boomsma W; Ferkinghoff-Borg J; Lindorff-Larsen K Combining Experiments and Simulations Using the Maximum Entropy Principle. PLOS Comput. Biol 2014, 10, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Caticha A; Mohammad-Djafari A; Bercher J-F; Bessiére P Entropic Inference. AIP Conf. Proc 2011, 1305. [Google Scholar]
[31].Plimpton S Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comput. Phys 1995, 117, 1–19. [Google Scholar]
[32].Xiong Y; Sundaralingam M Two crystal forms of helix II of Xenopus laevis 5S rRNA with a cytosine bulge. RNA 2000, 6, 1316–1324. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Jucker FM; Pardi A Solution Structure of the CUUG Hairpin Loop: A Novel RNA Tetraloop Motif. Biochemistry 1995, 34, 14416–14427. [DOI] [PubMed] [Google Scholar]
[34].Zhang X; Zhang D; Zhao C; Tian K; Shi R; Du X; Burcke AJ; Wang J; Chen S-J; Gu L-Q Nanopore electric snapshots of an RNA tertiary folding pathway. Nat. Commun 2017, 8, 1458. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Bayes-inspired theory for optimally building an efficient coarse-grained folding force field

Travis Hurst

Dong Zhang

Yuanzhe Zhou

Shi-Jie Chen

Abstract

1. Introduction

2. Theory and Methods

2.1. Setting up the system

Figure 1:

Figure 2:

2.2. Supporting the IsRNA theory with statistical mechanics

2.3. Generalizing and numerically stabilizing the model

Figure 3:

Figure 4:

2.4. Maximizing entropy to optimize model construction

2.5. Applying the model to fold RNA

Figure 5:

3. Conclusion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Bayes-inspired theory for optimally building an efficient coarse-grained folding force field

Travis Hurst

Dong Zhang

Yuanzhe Zhou

Shi-Jie Chen

Abstract

1. Introduction

2. Theory and Methods

2.1. Setting up the system

Figure 1:

Figure 2:

2.2. Supporting the IsRNA theory with statistical mechanics

2.3. Generalizing and numerically stabilizing the model

Figure 3:

Figure 4:

2.4. Maximizing entropy to optimize model construction

2.5. Applying the model to fold RNA

Figure 5:

3. Conclusion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases