Keywords

1 Introduction

Reinforcement learning (RL) [36] is an area of Machine Learning (ML) that has been responsible for some of the largest recent AI breakthroughs [3, 32,33,34]. RL develops methods that advise agents to choose from multiple actions in an environment with a delayed reward. This fits many settings in Automated Theorem Proving (ATP), where many inferences are often possible in a particular search state, but their relevance only becomes clear when a proof is found.

Several learning-guided ATP systems have been developed that interleave proving with supervised learning from proof searches [4, 10,11,12,13, 17, 19, 23, 39]. In the saturation-style setting used by ATP systems like E [31] and Vampire [21], direct learning-based selection of the most promising given clauses leads already to large improvements [14], without other changes to the proof search procedure.

The situation is different in the connection tableau [22] setting, where choices of actions rarely commute, and backtracking is very common. This setting resembles games like Go, where Monte-Carlo tree search [20] with reinforcement learning used for action selection (policy) and state evaluation (value) has recently achieved superhuman performance. First experiments with the rlCoP system in this setting have been encouraging [19], achieving more than 40% improvement on a test set after training on a large corpus of Mizar problems.

The connection tableau setting is attractive also because of its simplicity, leading to very compact Prolog implementations such as leanCoP  [29]. Such implementations are easy to modify and extend in various ways [27, 28]. This is particularly interesting for machine learning research over reasoning corpora, where automated learning and addition of new prover actions (tactics, inferences, symbolic decision procedures) based on previous proof traces seems to be a large upcoming topic. Finally, the proofs obtained in this setting are easy to verify, which is important whenever automated self-improvement is involved.

The goal of the work described here is to develop a reinforcement learning toolkit for experiments with guiding automated theorem proving in the connection calculus. The core of the toolkit (Sect. 2) is a compact and easy to extend Prolog-based automated theorem prover called plCoP. plCoP builds on the leanCoP Prolog implementation and adds learning-guided Monte-Carlo Tree Search as done in the rlCoP  [19] system. Other components include a Python interface to plCoP and state-of-the-art machine learners and an external proof checker that verifies the validity of the plCoP proofs. The proof checker has proven useful in discovering bugs during development. Furthermore, it is our long term goal to add new prover actions automatically, where proof checking becomes essential.

Prolog is traditionally associated with ATP research, and it has been used for a number of Prolog provers [5, 24, 29, 35], as well as for rapid ATP prototyping, with core methods like unification for free. Also, Prolog is the basis for Inductive Logic Programming (ILP) [25] style systems and a natural choice for combining such symbolic learning methods with machine learning for ATP systems, which we are currently working on [41]. In more detail, the main contributions are:

  1. 1.

    We provide an open-source Prolog implementation of rlCoP, called plCoP, that uses the SWI-Prolog [40] environment.

  2. 2.

    We extend the guidance of leanCoP to reduction steps involving unification.

  3. 3.

    We extend leanCoP with rewrite steps while keeping the original equality axioms. This demonstrates the benefit of adding a useful but redundant inference rule, with its use controlled by the learned guidance.

  4. 4.

    We provide an external proof checker that certifies the validity of the proofs.

  5. 5.

    The policy model of rlCoP is trained using Monte Carlo search trees of all proof attempts. We show, however, that this introduces a lot of noise, and we get significant improvement by limiting policy training data to successful theorem proving attempts.

  6. 6.

    Policy and value models are trained in rlCoP using a subset of the Monte Carlo search nodes, called bigstep nodes. However, when a proof is found, not all nodes leading to the proof are necessarily bigstep nodes. We make training more efficient by explicitly ensuring that all nodes leading to proofs are included in the training dataset.

  7. 7.

    The system is evaluated in several iterations on two MPTP-based [37] benchmarks, showing large performance increases thanks to learning (Sect. 3). We also improve upon rlCoP with \(12\%\) and \(7\%\) on these benchmarks.

2 Prolog Technology Reinforcement Learning Prover

The toolkit is available at our repository.Footnote 1 Its core is our plCoP connection prover based on the leanCoP implementation and inspired by rlCoP. leanCoP  [29] is a compact theorem prover for first-order logic, implementing connection tableau search. The proof search starts with a start clause as a goal and proceeds by building a connection tableau by applying extension steps and reduction steps. leanCoP uses iterative deepening to ensure completeness. This is removed in rlCoP and plCoP and learning-guided Monte-Carlo Tree Search (MCTS) [8] is used instead. Below, we explain the main ideas and parts of the system.

Monte Carlo Tree Search (MCTS) is a search algorithm for sequential decision processes. MCTS builds a tree whose nodes are states of the process, and edges represent sequential decisions. Each state (node) yields some reward. The aim of the search algorithm is to find trajectories (branches in the search tree) that yield high accumulated reward. The search starts from a single root node (starting state), and new nodes are added iteratively. In each node i, we maintain the number of visits \(n_i\), the total reward \(r_i\), and its prior probability \(p_i\) given by a learned policy function. Each iteration, also called playout, starts with the addition of a new leaf node. This is done by recursively selecting a child that maximizes the standard UCT [20] formula (1), until a leaf is reached. In (1), N is the number of visits of the parent, and cp is a parameter that determines the balance between nodes with high value (exploitation) and rarely visited nodes (exploration). Each leaf is given an initial value, which is typically provided by a learned value function. Next, ancestors are updated: visit counts are increased by 1 and value estimates are increased by the value of the new node. The value and policy functions are learned in AlphaGo/Zero, rlCoP and plCoP.

$$\begin{aligned} \hbox {UCT}(i) = \frac{r_i}{n_i} + cp \cdot p_i \cdot \sqrt{\frac{ln N}{n_i}} \end{aligned}$$
(1)

MCTS for Connection Tableau: Both rlCoP and plCoP use the DAgger [30] meta-learning algorithm to learn the policy and value functions. DAgger interleaves ATP runs based on the current policy and value (data collection phase) with a training phase, in which these functions are updated to fit the collected data. Such iterative interleaving of proving and learning has also been used successfully in ATP systems such as MaLARea [38] and ENIGMA [14]. During the proof search plCoP builds a Monte Carlo tree for each training problem. Its nodes are the proof states (partial tableaux), and edges represent inferences. A branch leading to a node with a closed tableau is a valid proof. Initially, plCoP uses simple heuristic value and policy functions, later to be replaced with learned guidance. To enforce deeper exploration, rlCoP and plCoP perform a bigstep after a fixed number of playouts: the starting node of exploration is moved one level down towards the child with the highest value (called bigstep node). Later MCTS steps thus only extend the subtree under the bigstep node.

Training data for policy and value learning is extracted from the tableau states of the bigstep nodes. The value model gets features from the current goal and path, while the policy model also receives features of the given action.

We use term walks of length up to 3 as main features. Both rlCoP and plCoP add also several more specific features.Footnote 2 The resulting sparse feature vectors are compressed to a fixed size (see Appendix E). For learning value, each bigstep node is assigned a label of 1Footnote 3 if it leads to a proof and 0 otherwise. The policy model gets a target probability for each edge based on the relative frequency of the corresponding child. Both rlCoP and plCoP use gradient boosted trees (XGBoost [9]) for guidance. Training concludes one iteration of the DAgger method. See Appendix D for more details about policy and value functions.

To implement MCTS, we modify leanCoP so that the Prolog stack is explicitly maintained and saved in the Prolog database using assertions after each inference step. This is done in the leancop_step.pl submodule, described in Appendix C. This setup makes it possible to interrupt proof search and later continue at any previously visited state, required to interleave prover steps with Monte Carlo tree steps, also implemented in Prolog. The main MCTS code is explained in Appendix B. The MCTS search tree is stored in destructive Prolog hashtables.Footnote 4 These are necessary for efficient updates of the nodes statistics. The training data after each proof run is exported from the MCTS trees and saved for the XGBoost learning done in Python.

To guide the search, the trained XGBoost policy and value functions are accessed efficiently via the C foreign language interface of SWI-Prolog. This is done in 70 lines of C++ code, using the SWI C++ templates and the XGBoost C++ API. The main foreign predicate xgb:predict takes an XGBoost predictor and a feature list and returns the predicted value. A trained model performs 1000000 predictions in 19 s in SWI. To quantify the total slowdown due to the guidance, we ran plCoP with 200000 inference step limit and a large time limit (1000 s) on the M2k dataset (see Sect. 3) with and without guidance. The average execution time ratio on problems unsolved in both cases is 2.88, i.e., the XGBoost guidance roughly triples the execution time. Efficient feature collection in plCoP is implemented using declarative association lists (the assoc SWI library) implemented as AVL trees. Insertion, change, and retrieval is O(log(N)). We also use destructive hashtables for caching the features already computed. Compared to rlCoP, we can rely on SWI’s fast internal hash function term_hash/4 that only considers terms to a specified depth. This all contributes to plCoP ’s smaller size.

Guiding Reduction Steps: The connection tableau calculus has two steps: 1) extension that replaces the current goal with a new set of goals and 2) reduction that closes off the goal by unifying it with a literal on the path. rlCoP applies reduction steps eagerly, which can be harmful by triggering some unwanted unification. Instead, plCoP lets the guidance system learn when to apply reduction. Suppose the current goal is G. An input clause \(\{H, B\}\), s.t. H is a literal that unifies with \(\lnot G\) and B is a clause, yields an extension step, represented as ext(HB), while a literal P on the path that unifies with \(\lnot G\) yields a reduction step, represented as red(P). The symbols red and ext are then part of the standard feature representation.

Limited Policy Training: rlCoP extracts policy training data from the child visit frequencies of the bigstep nodes. We argue, however, that node visit frequencies may not be useful when no proof was found, i.e., when no real reward was observed. A frequently selected action that did not lead to proof should not be reinforced. Hence, plCoP only extracts policy training data when a proof was found. Note that the same is not true for value data. If MCTS was not successful, then bigstep nodes are given a value of 0, which encourages exploring elsewhere.

Training from all Proofsteps: Policy and value models are trained in rlCoP using bigstep nodes. However, when a proof is found, not all nodes leading to the proof are necessarily bigstep nodes. We make training more efficient by explicitly ensuring that all nodes leading to proofs are included in the training dataset.

(Conditional) Rewrite Steps: plCoP extends leanCoP with rewrite steps that can handle equality predicates more efficiently. Let \(t|_p\) denote the subterm of t at position p and \(t[u]_p\) denote the term obtained after replacing in t at position p by term u. Given a goal G and an input clause \(\{X = Y, B\}\), s.t. for some position p there is a substitution \(\sigma \) such that \(G|_p\sigma = X\sigma \), the rewrite step changes G to \(\{G[Y]_p\sigma , \lnot B \sigma \}\). Rewriting is allowed in both directions, i.e., the roles of X and Y can be switched.Footnote 5 This is a valid and well-known inference step, which can make proofs much shorter. On the other hand, rewriting can be simulated by a sequence of extension steps. We add rewriting without removing the original congruence axioms, making the calculus redundant. We find, however, that the increased branching in the search space is compensated by learning since we only explore branches that are deemed “reasonable” by the guidance.

Proof Checking: After plCoP generates a proof, the leancheck program included in the toolkit does an independent verification. Proofs are first translated into the standard leanCoP proof format. In case of rewriting steps, plCoP references the relevant input clause (with an equational literal), its substitution, the goal before and after rewriting, the equation used and the side literals of the instantiated input clause. Using this information, leancheck replaces the rewriting step with finitely many instances of equational axioms (reflexivity, symmetry, transitivity, and congruence) and proceeds as if there were no rewriting steps.

The converted output from plCoP includes the problem’s input clauses and a list of clauses that contributed to the proof. The proof clauses are either input clauses, their instances, or extension step clauses. Proof clauses may have remaining uninstantiated (existential) variables. However, for proof checking, we can consider these to be new constants, and so we consider each proof clause to be ground. To confirm we have a proof, it suffices to verify two assertions:

  1. 1.

    The proof clause alleged to be an input clause, or its instance is subsumed by the corresponding input clause (as identified in the proof by a label).

  2. 2.

    The set of such proof clauses forms a propositionally unsatisfiable set.

Each proof clause alleged to be an instance of an input clause is reported as a clause B, a substitution \(\theta \) and a reference to an input clause C. Optimally, the checker should verify that each literal in \(\theta (C)\) is in B, so that \(\theta (C)\) propositionally subsumes B. In many cases this is what the checker verifies. However, Prolog may rename variables so that the domain of \(\theta \) no longer corresponds to the free variables of C. In this case, the checker computes a renaming \(\rho \) such that \(\theta (\rho (C))\) propositionally subsumes B.Footnote 6 We could alternatively use first-order matching to check if C subsumes B. This would guarantee a correct proof exists, although it would accept proofs for which the reported \(\theta \) gives incorrect information about the intended instantiation. For the second property, we verify the propositional unsatisfiability of this ground clause set using PicoSat [7]. While plCoP is proving a theorem given in disjunctive normal form PicoSat is refuting a set of clauses. Hence we swap polarities of literals when translating clauses to PicoSat.Footnote 7 An example is given in Appendix A.

3 Evaluation

We use two datasets for evaluation. The first is the M2k benchmark that was introduced in [19]. The M2K dataset is a selection of 2003 problems [15] from the larger Mizar40 dataset [16], which consists of 32524 problems from the Mizar Mathematical Library that have been proven by several state-of-the-art ATPs used with many strategies and high time limits in the Mizar40 experiments [18]. Based on the proofs, the axioms were ATP-minimized, i.e., only those axioms were kept that were needed in any of the ATP proofs found. This dataset is by construction biased towards saturation-style ATP systems. To have an unbiased comparison with state-of-the-art saturation-style ATP systems such as E, we also evaluate the systems on the bushy (small) problems from the MPTP2078 benchmark [1], which contains just an article-based selection of Mizar problems, regardless of their solvability by a particular ATP system.

We report the number of proofs found using a 200000 step inference limit. Hyperparameters (described in Appendix E) were selected to be consistent with those of rlCoP. A lot of effort has already been invested in tuning rlCoP, furthermore, we wanted to make sure that the effects of our most important additions are not obfuscated by hyperparameter changes. As Tables 1 and 2 show, our baseline is weaker, due to several subtleties of rlCoP that are not reproduced. Nevertheless, since our main focus is to make learning more efficient, improvement with respect to the baseline can be used to evaluate the new features.

M2k Experiments: We first evaluate the features introduced in Sect. 2 on the M2k dataset. Table 1 shows that both limited policy training and training from all proofsteps yield significant performance increase: together, they improve upon the baseline with \(31\%\) and upon rlCoP with \(3\%\). However, guided reduction does not help. We found that the proofs in this dataset tend to only use reduction on ground goals, i.e., that does not involve unification, which indeed can be applied eagerly. The rewrite step yields a \(9\%\) increase. Improved training and rewriting together improve upon the baseline by \(42\%\) and upon rlCoP by \(12\%\). Overall, thanks to the changes in training data collection, plCoP shows greater improvement during training and finds more proofs than rlCoP, even without rewriting. Note that rlCoP has been developed and tuned on M2k. Adding ten more iterations to the best performing (combined) version of plCoP results in 1450 problems solved, which is 17.4% better than rlCoP in 20 iterations (1235).

Table 1. Performance on the M2k dataset: original , plCoP (baseline), plCoP with guided reduction, plCoP with limited policy training, plCoP trained using all proofsteps, plCoP using the previous two improved training plCoP with rewriting and plCoP with rewriting and improved training combined. incr shows the performance increase in percentages from iteration 0 (unguided) to the best result.

MPTP2078: Using a limit of 200000 inferences, unmodified leanCoP solves 612 of the MPTP2078 bushy problems, while its OCaml version (mlcop), used as a basis for rlCoP solves 502. E solves 998, 505, 326, 319 in auto, noauto, restrict, noorder modesFootnote 8 plCoP and rlCoP results are summarized in Table 2. Improved training and rewriting together yield \(63\%\) improvement upon the baseline and \(7\%\) improvement upon rlCoP. Here, it is plCoP that starts better, while rlCoP shows stronger learning. Eventually, plCoP outperforms rlCoP, even without rewriting. Additional ten iterations of the combined version increase the performance to 854 problems. This is 12% more than rlCoP in 20 iterations (763) but still weaker than the strongest E configuration (998). However it performs better than E with the more limited heuristics, as well as leanCoP with its heuristics.

Table 2. Performance on the MPTP2078 bushy dataset: original , baseline plCoP, plCoP using improved training and combined plCoP. incr shows the performance increase in percentages from iteration 0 (unguided) to the best result.

4 Conclusion and Future Work

We have developed a reinforcement learning toolkit for experiments with guiding automated theorem proving in the connection calculus. Its core is the Prolog-based plCoP obtained by extending leanCoP with a number of features motivated by rlCoP. New features on top of rlCoP include guidance of reduction steps, the addition of the rewrite inference rule and its guidance, external proof checker, and improvements to the training data selection. Altogether, plCoP improves upon rlCoP on the M2K and the MPTP2078 datasets by \(12\%\) and \(7\%\) in ten iterations, and by \(17.4\%\) and \(12\%\) in twenty iterations. The system is publicly available to the ML and AI/TP communities for experiments and extensions.

One lesson learned is that due to the sparse rewards in theorem proving, care is needed when extracting training data from the prover’s search traces. Another lesson is that new sound inference rules can be safely added to the underlying calculus. Thanks to the guidance, the system still learns to focus on promising actions, while the new actions may yield shorter search and proofs for some problems. An important part of such an extendability scheme is the independent proof checker provided.

Future work includes, e.g., the addition of neural learners, such as tree and graph neural networks [10, 26]. An important motivation for choosing Prolog is our plan to employ Inductive Logic Programming to learn new prover actions (as Prolog programs) from the plCoP proof traces. As the manual addition of the rewrite step already shows, such new actions can be inserted into the proof search, and guidance can again be trained to use them efficiently. Future work, therefore, involves AI/TP research in combining statistical and symbolic learning in this framework, with the goal of automatically learning more and more complex actions similar to tactics in interactive theorem provers. We believe this may become a very interesting AI/TP research topic facilitated by the toolkit.