Abstract
Recent progress in the runtime analysis of evolutionary algorithms (EAs) has allowed the derivation of upper bounds on the expected runtime of standard steady-state genetic algorithms (GAs). These upper bounds have shown speed-ups of the GAs using crossover and mutation over the same algorithms that only use mutation operators (i.e., steady-state EAs) both for standard unimodal (i.e., OneMax) and multimodal (i.e., Jump) benchmark functions. The bounds suggest that populations are beneficial to the GA as well as higher mutation rates than the default 1/n rate. However, making rigorous claims was not possible because matching lower bounds were not available. Proving lower bounds on crossover-based EAs is a notoriously difficult task as it is hard to capture the progress that a diverse population can make. We use a potential function approach to prove a tight lower bound on the expected runtime of the (2+1) GA for OneMax for all mutation rates c/n with \(c < 1.422\). This provides the last piece of the puzzle that completes the proof that larger population sizes improve the performance of the standard steady-state GA for OneMax for various mutation rates, and it proves that the optimal mutation rate for the (2+1) GA on OneMax is \((\sqrt{97}-5)/(4n) \approx 1.2122/n\).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The runtime analysis of randomized search heuristics like evolutionary algorithms (EAs), simulated annealing, ant colony optimization and estimation-of-distribution algorithms is a young and active subfield in algorithm research that has produced remarkable results in the last 20 years [2, 11, 15, 23]. Its main goal is to understand the working principles of the algorithms in different scenarios by deriving runtime bounds depending on problem characteristics, choice of algorithms and parameters. This line of research started with simple evolutionary algorithms using mutation only. Still today, the role of crossover, also called recombination, is less well understood than the one of mutation. In fact, explaining when recombination and mutation based genetic algorithms (GAs)Footnote 1 perform better than more traditional general purpose search heuristics that use mutation alone is regarded as one of the fundamental problems in evolution-inspired computation.
Traditionally proofs showing that crossover is a useful operator relied either on excessively low crossover rates [16, 18] or on some diversity-enforcing mechanism to make recombination effective by increasing the probability that members of the population are different [8, 10, 20, 22, 27]. However, it was never shown whether this enforced diversity was necessary or whether it was an additional requirement for the proofs to hold. Recently some results have appeared proving the superiority of standard steady state GAsFootnote 2 over mutation-only algorithms, without the need of any additional diversity enforcing mechanisms. Dang et al. [7] proved that for sufficiently large population sizes, the (\(\mu \)+1) GA is at least a linear factor faster than the best algorithm using only standard bit mutation for the Jump benchmark function. Hence, they showed that crossover may help algorithms to escape more quickly from local optima. Sutton [28] even proved that for the NP-hard Closest String problem from computational biology, the (\(\mu \)+1) GA with sufficiently large population size and restarts is a fixed parameter tractable (FPT) algorithm while if only standard bit mutation is used (i.e., (\(\mu \)+1) EA) it is not.
Strikingly, recombination has also been proven to be useful on unimodal functions. Lengler [21] has shown that there exist monotone functions for which the (\(\mu \)+1) EA with not too low standard bit mutation rate c/n (i.e., \(c>2.13\)) requires exponential runtime with high probability while the (\(\mu \)+1) GA with sufficiently large population sizes can solve them in \(O(n \log n)\) expected runtime for arbitrary mutation rates i.e., \(\varTheta (1)/n\). Analyses have revealed that the (\(\mu \)+1) GA is faster than the (\(\mu \)+1) EA using any standard bit mutation rate and population size, even on unimodal functions where the latter is particularly efficient i.e., OneMax [5, 6]. Furthermore, if the fitness of offspring that are identical to their parents is not unnecessarily re-evaluated, then the algorithm is faster than any unary unbiased black box algorithm for the problem [19], albeit slower than if the diversity is enforced [3, 27]. To prove these results, precise analyses up to the leading constants are required since for OneMax the algorithms have the same asymptotic expected runtime \(O(n \log n)\) for moderate population sizes.
An important insight from these analyses is that if diversity is enforced as in Sudholt’s work [27], then inevitably there are no advantages of using population sizes greater than \(\mu =2\) for OneMax. On the other hand, the analysis of Corus and Oliveto [6] provides upper bounds that decrease with the population size (up to some sub-logarithmic limit). For large enough population sizes the best derived upper bound is roughly \(1.64 n \ln n\) while, for \(\mu =2\), Corus and Oliveto only provide a larger upper bound of \(\frac{4e^{c} n \ln n}{c(c+4)} + O(n)\) [5]. Due to a mistake in one probability calculation this turns out to actually be \(\frac{9e^{c} n \ln n}{c(2c+9)} + O(n)\).
Indeed, all the positive results summarised above regarding the plain (\(\mu \)+1) GA required sufficiently large population sizes. While the comparative statements with the mutation-based algorithms were possible because of the availability of lower bounds on their expected runtime, rigorously showing whether the suggested population sizes are actually necessary requires lower bounds on the expected runtime of the (\(\mu \)+1) GA. Proving lower bounds for GAs with crossover is a notoriously hard task. The only available analysis concerning a standard GA is the proof that the simple genetic algorithm (SGA [12]) cannot solve OneMax in polynomial time with overwhelming probability due to the ineffectiveness of the fitness proportional selection operator [25, 26]. There have been recent attempts to generalize proof methods like the family tree technique to crossover-based algorithms; however, these only apply in a specific setting without mutation.
Providing lower bounds on the expected runtime of the (\(\mu \)+1) GA for OneMax has turned out to be surprisingly difficult. Sudholt simplified the analysis by considering a “greedy” (2+1) GA that always selects amongst the fittest individuals in the population and is sped-up by automatically achieving the best possible crossover operation between different parents [27]. A less greedy (2+1) GA was considered by Corus and Oliveto where individuals are only immediately crossed over optimally if the Hamming distance between the parents is larger than 2 [5]. These simplified algorithms allow the analysis to ignore the improvements which may occur in standard GAs when one parent is crossed over with another one of different fitness. However, it was never proven that the algorithms are indeed faster than the standard (2+1) GA, hence that the bounds are also valid for the latter algorithm. In this paper we provide a lower bound for the (2+1) GA with no simplifications that matches its upper bound up to the leading constant, hence providing a rigorous proof that larger populations are beneficial to the GA for OneMax. The preciseness of the results also allows us to derive that the value of \(c \in (0, 1.422]\) that yields the optimal mutation rate c/n is approximately \(c=1.21221445\).
A major difficulty in proving rigorous lower bounds for populations with crossover is to find a way to aggregate the state of the algorithm such that it accurately captures the current distance from the optimum, but also the potential improvements of the crossover operator. These advancements could be very big if the parents have a large Hamming distance, and our aim is to show that this rarely happens. We solve the aggregation problem for the (2+1) GA by defining a potential function that captures the current fitness and opportunities for easy improvements through crossover. By showing bounds on the expected increase in the potential, we are able to quantify how the distance to the optimum decreases in one generation. The challenge lies in proving this for every possible population, from those with identical individuals to those with a good amount of diversity. Once the potential is appropriately bounded, we can use standard drift analysis arguments to bound the expected time from below.
1.1 Main Contributions
The expected optimisation time of the (2+1) GA is bounded from above as follows.
Theorem 1
The expected optimisation time of the (2+1) GA with mutation rate \(c/n\), \(c>0\) a constant, on OneMax is at most
For \(c=1\) this is \( \frac{9}{11} \cdot e n \ln (n) + O(n) \approx 2.224 n \ln (n) + O(n) \). The upper bound follows from applying the analytical framework in [5] with a corrected transition probability for \(p_r\), using the value 1/(4e) instead of 5/(24e) (we shall give more details in Sect. 3). It can also be proven with mild adaptations of the proof of [27, Theorem 4]. We provide a self-contained proof in Sect. 3.
Our main contribution is the following lower bound that matches the upper bound proven in Theorem 1 up to small-order terms.
Theorem 2
The expected optimisation time of the (2+1) GA with mutation rate c/n, and \(0<c \le 1.422\) a constant, on OneMax is at least
Since the bounds from Theorems 1 and 2 have the same leading constant \(\frac{9e^c}{c(2c+9)}\), which is minimised for
we identify this as the optimal mutation rate for the (2+1) GA (up to small-order terms) within the range of rates covered by Theorem 2.
Theorem 3
Amongst all mutation rates c/n with \(c \in (0, 1.422]\), the choice \(c = \frac{\sqrt{97}-5}{4}\) is the optimal mutation rate of the (2+1) GA on OneMax, up to small-order terms. Then the expected optimisation time is \(\approx 2.18417 n \ln (n) + O(n)\).
The best identified mutation rate for the (2+1) GA is lower than the one minimising the upper bound for larger population sizes \(\mu \ge 5\) (it is at least 1.425/n and increases with \(\mu \)) always providing upper bounds below \(1.7 n \ln n\) and decreasing with \(\mu \) [6]. This implies that the (2+1) GA with mutation rate \(\frac{\sqrt{97}-5}{4n}\) is at least \(28\%\) slower than any (\(\mu \)+1) GA with \(\mu \ge 5\) and appropriate mutation rates.
Structure of the paper. Section 2 formally defines the (2+1) GA and lists important tools for the analysis, including the drift theorem used for our main result. Section 3 presents the above-mentioned, corrected upper bound from Theorem 1 and its self-contained proof. In Sect. 4, we introduce the potential function that captures the state of the (2+1) GA and is crucial for the drift analysis proving the lower bound in Theorem 2. As determining the drift requires a careful case analysis, we give a roadmap of this analysis in Sect. 5, followed by the technical Sects. 6–8 that analyze the drift in different scenarios. In Sect. 9 we then put all pieces of our analysis together to complete the proof of Theorem 2. In addition, Sect. 10 gives empirical results and results of regression analyses. The latter confirm that our theoretical results are remarkably precise and provide further insight into small-order terms. We finish with some conclusions. To streamline the presentation of the most important cases in the drift analysis, some less insightful proofs have been moved to an appendix.
This paper extends a previous conference paper [24] where most proofs had to be omitted because of space constraints. The experimental results have not been published before either.
2 Preliminaries
The (2+1) GA is defined in Algorithm 1. The algorithm initialises the population with two randomly chosen individuals. At each generation it selects two random parents with replacement to be mated via uniform crossover. The operator assigns each bit to the offspring by selecting the corresponding bit from one parent with probability 1/2 and from the other with the same probability. Standard bit mutation is then applied to the offspring by flipping each of its bits independently with probability c/n. Finally, the worst individual amongst the parents and the offspring is removed to select the new population. Ties are broken uniformly at random.
We will analyse the expected runtime of the algorithm to optimise the function \(f(x)=\mathrm {OneMax}(x)=\sum _{i=1}^n x_i\), which counts the number of 1-bits in a bitstring. Formally, the runtime is defined as the number of fitness function evaluations, which equals the smallest t such that the current population contains an optimum, up to an additive term of at most 2 stemming from evaluating the initial population.
Throughout this article, we will use state-of-the-art analysis techniques for randomized search heuristics, including drift analysis and concentration inequalities. In the following, we list three important tools, starting with the drift theorem that is crucial for the lower bound on the runtime.
Theorem 4
(Multiplicative drift, lower bound, e. g., [29])
Let \(X_t\), \(t\ge 0\), be a stochastic process, adapted to a filtration \({\mathcal {F}}_t\), over a state space \(S\subseteq \mathbb {R}^{\ge x_{\min }}\), where \(x_{\min }>0\). Assume that \(X_t\) is non-increasing, i. e., \(X_{t+1}\le X_t\) for all \(t\ge 0\). Let T be the smallest \(t\ge 0\) such that \(X_t\le x_{\min }\). If there exist positive real numbers \(\beta ,\delta >0\) such that for all \(t<T\) it holds that
-
1.
\(\mathord {{\mathrm {E}}}\mathord {\left( X_{t}-X_{t+1}\mid {\mathcal {F}}_t\right) } \le \delta X_t\)
-
2.
\(\mathord {{\mathrm {P}}}\mathord {\left( X_t-X_{t+1}\ge \beta X_t\right) } \le \beta \delta /\ln (X_t)\)
then
The following Chernoff bound goes back to [4] and is formulated in the style of [9, Theorem 1.10.1].
Theorem 5
Let \(X_1,\dots ,X_n\) be independent random variables taking values in [0, 1]. Let \(X:=\sum _{i=1}^n X_i\). Let \(\delta \ge 0\). Then
The following lemma bounds a sum by an integral.
Lemma 6
For any integrable function \(f :\mathbb {R}\rightarrow \mathbb {R}\), the following statements hold.
-
1.
If f is non-increasing in [a, b], \( \sum _{i=a}^b f(i) \le f(a) + \int _{a}^{b} f(i) \ \mathrm {d}i\).
-
2.
If f is non-decreasing in [a, b], \( \sum _{i=a}^b f(i) \le f(b) + \int _{a}^{b} f(i) \ \mathrm {d}i\).
-
3.
If there is an \(\alpha \in [a, b]\) such that f is non-increasing in \([a, \alpha ]\) and non-decreasing in \([\alpha , b]\) then \( \sum _{i=a}^b f(i) \le f(a) + f(b) + \int _{a}^b f(i) \ \mathrm {d}i\).
Proof
The first statement follows from splitting off the term f(a) and using \(\int _{j-1}^j f(i) \ \mathrm {d}i \ge f(j)\) for all \(j \in [a+1, b]\), which follows from f being non-increasing. This implies \(\sum _{i=a+1}^b f(i) \le \int _{a}^b f(i) \ \mathrm {d}i\). The second statement is proved similarly. For the last statement, we note \(\sum _{i=a}^b f(i) \le \sum _{i=a}^{\lfloor \alpha \rfloor } f(i) + \sum _{i=\lceil \alpha \rceil }^b f(i)\). Then we apply the first statement to \(\sum _{i=a}^{\lfloor \alpha \rfloor } f(i)\) and the second statement to \(\sum _{i=\lceil \alpha \rceil }^b f(i)\), yielding a bound of \(f(a) + f(b) + \int _{a}^{\lfloor \alpha \rfloor } f(i) \ \mathrm {d}i + \int _{\lceil \alpha \rceil }^b f(i) \ \mathrm {d}i \le \sum _{i=a}^b f(i) \le f(a) + f(b) + \int _{a}^b f(i) \ \mathrm {d}i\). \(\square \)
Finally, we frequently need the following estimate of the largest binomial coefficient, which can be found in [9, Inequality 1.4.18].
Lemma 7
For all \(n\in \mathbb {N}\) and \(k\in \{1,\dots ,n\}\) it holds that
3 Upper Bounds for the (2+1) GA
In this section we aim to convince the reader in two different ways that the upper bound for the (2+1) GA is as stated in Theorem 1, with a leading constant of \(\frac{9e^c}{c(2c+9)}\) instead of the leading constant \(\frac{4e^c}{c(c+4)}\) claimed in [5].
Sudholt [27] provided an upper bound for (\(\mu \)+\(\lambda \)) GAs with a diversity mechanism in the tie-breaking rule for the replacement selection. The analysis is based on a fitness-level argument and a simple Markov chain analysis made on each fitness level. Corus and Oliveto [5] analysed the standard (\(\mu \)+1) GA, for which the choice \(\mu =2\) yields the (2+1) GA as defined in Algorithm 1. They observed that the (\(\mu \)+1) GA can lose diversity after creating it, which required a more complex Markov chain analysis of each fitness level. We first explain their framework and argue why the leading constant is \(\frac{9e^c}{c(2c+9)}\). In addition, we give a self-contained bound based on the analysis in [27] with an ad-hoc Markov chain framework similar to the one used in [5], simplified for the fixed value of \(\mu =2\) (and \(\lambda =1\)). This shows that both previous works [5, 27], with appropriate modifications, yield the same upper bound for the (2+1) GA.
3.1 The Framework by Corus and Oliveto
Corus and Oliveto coupled the standard artificial fitness levels method [15] with a Markov chain framework to bound the expected runtime of the (\(\mu \)+1) GA for OneMax and any population size \(\mu \) [5]. More precisely, they divided the search space into the canonical \(n+1\) fitness levels, each containing all search points with i 1-bits, and assumed that the algorithm is in level \(L_i\) if all the individuals of the population have exactly i 1-bits. Since crossover may only speed up the optimisation process if diversity is present in the population, a Markov chain was used on each fitness level to distinguish between populations with and without diversity. The Markov chain at fitness level i consists of two transient states \(S_{1,i}\) and \(S_{2,i}\) (i.e., resp. with/without diversity) and an absorbing state \(S_{3,i}\), that is reached when a solution with better fitness is identified. For each level the analysis was performed by pessimistically assuming that initially on each level the population has no diversity (i.e., all individuals are identical). For this assumption to hold, and a valid upper bound on the expected runtime achieved, once the absorbing state is reached for level \(L_i\), the expected time for the improved individual to take over the population has to be taken into account before the analysis of the absorption time of the Markov chain for level \(L_{i+1}\) may be carried out (i.e., the population is initially in state \(S_{1, i+1}\) for each level \(L_i\)). When the algorithm is on the current level \(L_i\) with no diversity (\(S_{1,i}\)) only two changes of states may occur, both due to mutation: either the absorption state \(S_{3,i}\) is reached by increasing the number of 1-bits in an individual, or a state with some diversity is reached by switching the order of 1-bits and 0-bits in an offspring, hence reaching state \(S_{2,i}\). From this state, \(S_{3,i}\) may be reached more quickly than from state \(S_{1,i}\) if diverse parents are selected for reproduction and at least an extra 1-bit obtained in the offspring via crossover with constant probability if the diversity is not lost before (eg., by selecting two identical parents, creating a copy by not flipping any bits, and removing the diverse individual in the selection for replacement step). Overall, the proof strategy requires summing up the bounds on the absorption times of the \(n+1\) Markov chains and subsequent takeover times over the \(n+1\) levels according to the artificial fitness levels methodology.
The described technique allows to calculate the transition probabilities of the Markov chain, hence to provide upper bounds on the expected runtime of the (\(\mu \)+1) GA, as a function of any population size \(\mu \). In particular, it provides bounds on the expected runtime of the (\(\mu \)+1) GA up to the leading constant for population sizes of \(\mu =o(\log n/ \log \log n)\) that are smaller than those of any steady-state EA using only standard bit mutation (i.e., no crossover) (i.e., for larger population sizes the take over time becomes asymptotically larger than the \(O(n \log n)\) expected runtime of the (1+1) EA). An important insight from the Markov chain analysis is that the probability of losing diversity in state \(S_{2,i}\) is higher for population size \({\mu =2}\) than for larger populations: in the former case the diversity may be completely lost in every generation by either of the two individuals taking over, which is not the case for larger populations. However, in the analysis of this transition probability for the special case \({\mu =2}\), the bound provided by the authors of [5] is by a factor of 2 smaller than the correct one due to a mistake in the calculation of the probability that different parents are selected for reproduction and the variation operators produce an offspring identical to either parent (and the differing parent is removed by selection of replacement). This miscalculation led to an upper bound on the expected runtime of the (2+1) GA of \(\frac{4e^c n \ln n}{c(c+4)} + O(n)\) instead of the correct bound of \(\frac{9e^c n \ln n}{c (2c+9)} + O(n)\) provided in Theorem 1. In the following subsection we provide a self-contained proof of the result and point out exactly where the miscalculation occurred in [5].
3.2 A Self-Contained Upper Bound Proof for the (2+1) GA
Here we give a self-contained proof of the upper bound for the (2+1) GA, Theorem 1.
Proof of Theorem 1
We use straightforward adaptations of the proof of [27, Theorem 4], using the same notation as in [27]. As in said proof, we distinguish between the following cases that are labelled according to the current best fitness in the population, called i. There are cases i.1, i.2, and i.3 explained in the following. We estimate the expected time spent in all these cases, summed up over all possible values of i, to obtain an upper bound on the total expected optimisation time.
Case i.1: the population contains a search point with fitness i and one search point of lower fitness.
This case is left for good if another search point of fitness at least i is created as then both search points will have fitness at least i. A sufficient condition for this is that the parent with fitness i is selected twice as parent (probability 1/4) and no 1-bit is flipped during mutation (probability at least \((1-c/n)^{n-1} = \varOmega (1)\)). Hence the expected waiting time in case i.1 is O(1).
The other two cases are:
Case i.2: the population contains two copies of the same individual, with fitness i.
Case i.3: the population contains two different individuals, both having fitness i.
The algorithm can switch between the two cases as diversity can be gained or lost.
We first bound the expected time spent in Case i.3. This case can be left for good if the two different search point are chosen as parents (probability 1/2) if crossover generates a surplus of 1-bits (estimated from below by 1/4 in [27, page 427]) and if mutation does not flip any bits (probability at least \((1-c/n)^n\)). Together, we obtain a probability of at least \((1-c/n)^n/8\). Hence, the expected time spent in Case i.3 is at most \(8/(1-c/n)^n = O(1)\).
Let \(T_{i, 2}\) be the random time spent in Case i.2. We leave this case for good if mutation only flips a single 0-bit and no 1-bit. This event has probability at least \(p^+ := (n-i) \cdot c/n \cdot (1-c/n)^{n-1}\). We further make a transition to Case i.3 if mutation flips exactly one 0-bit and one 1-bit, leading to a different search point with fitness i, and then choosing one of the identical parents for removal (probability 2/3). The probability of said events is at least \(p^{2 \rightarrow 3} := i(n-i) \cdot (c/n)^2 \cdot (1-c/n)^{n-2} \cdot 2/3\).
From Case i.3 the algorithm can move back to Case i.2, so that we may have to consider multiple visits to Case i.2. A necessary condition for going back to Case i.2 is that the created offspring is identical to one of the search points in the population, and the other search point is being selected for removal (probability 1/3). The probability for creating an identical offspring is maximised when the two search points have Hamming distance 2. Either the same parent is selected twice (probability 1/2) and mutation does not flip any bit (probability \((1-c/n)^n\)), or the same parent is selected twice and mutation creates the other parent (probability \(O(1/n^2)\)), or different parents are selected (probability 1/2), crossover and mutation set the two differing bits identical to one of the parents (probability 1/2 i.e., this is the probability that was wrongly estimated to be 1/4 in [5] leading to an upper bound of 5/(24e) for going back to Case i.2 rather than the correct 1/(4e) bound which we derive now) and mutation does not flip any of the common bits (probability \((1-c/n)^{n-2}\)). Together, the probability of going back to Case i.2 is at most
Hence the conditional probability that, when Case i.3 is left towards Case i.2 or higher fitness, it is left towards Case i.2 is at most
With these pessimistic estimations for transition probabilities, we obtain the following recurrence:
which is equivalent to
and plugging in the above values yields the following upper bound for \(T_{i, 2}\):
Summing up these upper bounds for all i yields a total time bound of
A closer inspection of the function \(f(i) :=\frac{1}{(n-i)(1+i \cdot c/n \cdot 2/9)}\) reveals that it is non-decreasing in \([0, n-1]\) if \(c \le 9/2\). For \(c > 9/2\) it is non-increasing in \([0, n(2c - 9)/(4c)]\) and non-decreasing in \([n(2c - 9)/(4c), n-1]\). In either case, we obtain an upper bound from Lemma 6:
The first two summands are in O(1). The integral simplifies as follows, using e. g. equation 3.3.20 in [1]:
Plugging everything together, the total time bound is
Noting that the times in all Cases i.1 and i.3, summed up for all i, are O(n) proves the claim. \(\square \)
4 A Potential Function Approach
We now turn to the main contribution of this paper, the tight lower bound for the (2+1) GA. Before going into detail, this section describes the main idea behind our approach and clarifies some further notation and fundamental observations that will be used in the remainder.
We write a population \(\{x^1, x^2\}\) in order of monotonically decreasing fitness, that is, \(f(x^1) \ge f(x^2)\). Let \(n_{11}\) be the number of bit positions where both parents have ones and likewise for \(n_{00}\) and the number of zeros. Let \(n_{10}\) be the number of positions where \(x^1\) has a 1 and \(x^2\) has a 0 and likewise for \(n_{01}\). Then we have \(f(x^1) = n_{11}+n_{10}\) and \(f(x^2) = n_{11}+n_{01}\). Since by assumption, \(f(x^1) \ge f(x^2)\), we have \(n_{10} \ge n_{01}\) and \(n_{10}=n_{01}\) is equivalent to the two individuals having equal fitness. In case \(n_{10}=0\), both individuals are identical. Such a population is called monomorphic in population genetics, and we use this term here.
Note that the (2+1) GA is an unbiased algorithm in the sense of Lehre and Witt [19]. In brief, this means that the algorithm treats all bit positions and all bit values symmetrically when generating new search points. Owing to this symmetry of bit positions and the fact that the fitness function is symmetrical itself, i. e., it only depends on the number but not the positions of the one-bits, it suffices to know \(n_{11}, n_{10}\) and \(n_{01}\) to fully characterise the state of the algorithm. Note that \(n_{00}\) can be derived as \(n - n_{11}-n_{10}-n_{01}\).
The following lemma characterises probabilities of setting a bit to 1 in the offspring after a crossover of two different parents and a mutation of the result.
Lemma 8
Consider a crossover of two parents x, y followed by mutation with mutation rate \(p_m\), resulting in an offspring z. For all i,
Proof
If \(x_i=y_i\) then crossover will create an offspring with the same bit value. The statement for \(x_i\ne y_i\) holds because of symmetry, or using the following, alternative argument. The offspring has a 1 if crossover creates a 1 and mutation does not flip bit i, or if crossover creates a 0 and mutation does flip bit i. The probability of the former event is \(1/2 \cdot (1-p_m)\) and the probability of the latter event is \(1/2 \cdot p_m\). Together, this gives 1/2. \(\square \)
Note that differing bits \(x_i \ne y_i\) are set to 1 with probability 1/2, irrespective of the mutation rate. Hence, when two parents are selected, we only need to consider the effect of mutation on the bits where the parents agree. We frequently and tacitly use this fact.
Our lower bound applies when only considering populations where the number of zeros in the fitter parent is at most \(n/\mathrm {polylog}(n)\) and at least \(\mathrm {polylog}(n)\). This implies that all probabilities that involve flipping a 0 to 1 are polylogarithmically small.
The main tool for our lower bound is going to be drift analysis, applied to a potential function that captures the current state and potential easy fitness improvements.
Definition 1
For a population P with values \(n_{11}, n_{10}, n_{01}, n_{00}\) we define the potential of P as
The intuition is that \(n_{11}+n_{10}\) describes the current best fitness in the population. The term \(n_{01}/3\) adds potential to the best fitness as the population has the potential to exploit the diversity given by the \(n_{01}\) 1-bits that only exist in the less fit individual during a successful crossover operation.
The choice of the factor 1/3 is motivated as follows. We know from previous work [5, 27] that the most helpful populations for improvements are those where two search points have the same number of ones and Hamming distance 2, that is, \(n_{10}=n_{01}=1\). (Larger Hamming distances have the potential for larger fitness improvements, but such populations are rarely reached when the number of zeros becomes reasonably small.)
Assume the current state has \(n_{10}=n_{01}=1\), corresponding to a potential of \(n_{11}+n_{10}+1/3\). The most likely transitions (and, when only \(O(n/\mathrm {polylog}(n))\) zeros are left, the only transitions with probability \(\varOmega (1)\)) are (1) collapsing the population to copies of one parent (and potential \(n_{11}+n_{10}\)) and (2) creating a surplus of one 1-bit by crossover and not flipping anything else (potential \(n_{11}+n_{10}+1\)). The probability of the former event is roughlyFootnote 3\((1-p_m)^n/4\), which is the probability of selecting the same parent twice, not flipping any bits and then selecting the other population member for removal plus the probability of selecting different parents, creating one parent by crossover and not flipping any bits. The probability of the latter event is roughly \((1-p_m)^n/8\), which is the probability of selecting different parents, setting both differing bits to 1 and not flipping any bits in the subsequent mutation.
Comparing these terms, the conditional probability of an improvement via crossover is roughly 1/3. In case a monomorphic population is reached, the potential reduces by 1/3 and this happens with conditional probability \(1-1/3\). In the latter event, the potential increases by \(1-1/3\) and this happens with conditional probability 1/3. The net effect of these transitions in the expected change of the potential is \((1-1/3)1/3 - 1/3(1-1/3) = 0\). So the potential balances out the effects of “volatile” states left quickly.
Obviously, our analysis still needs to account for other, less likely transitions. For populations with \(n_{10}=n_{01}> 1\) the conditional transition probabilities change as the probability of creating one of the parents by crossover depends on the Hamming distance \(n_{10}+n_{01}\) between parents. For \(n_{10}=n_{01}> 1\) the likely progress in a successful crossover may be smaller than \(n_{01}/3\). Hence the term \(+n_{01}/3\) in Definition 1 is a precise estimate for the likely progress when \(n_{01}=1\) and for larger \(n_{01}\) it is an overestimation.
It suffices to restrict our considerations to moderate values of \(n_{10}\) and \(n_{01}\). The reason is that the (2+1) GA always has a constant probability of creating a monomorphic population in one generation, regardless of the current population. This means that large values of \(n_{10}\) and \(n_{01}\) are very unlikely.
Lemma 9
Let \(t\ge \log ^2 n\) and \(t=n^{O(1)}\). With probability \(1-n^{-\varOmega (\log n)}\), all populations within the time interval \([\log ^2 n, t]\) have Hamming distance at most \(\log ^2 n\) between their two individuals.
Proof
We call a generation that creates a population of two identical individuals a monomorphic generation. The crucial idea is to show that monomorphic generations are very frequent so that large Hamming distances are unlikely to occur.
The probability of a monomorphic generation happening is at least \((1/4)(1-1/n)^{n}(1/3)=\varOmega (1)\) since it is sufficient to select a fittest parent twice, to clone it and to remove the other parent (which has probability at least 1/3). For a number \(t\ge 0\) of generations after a monomorphic one, let \(D_t\) denote the maximum number of bits in which the two parents ever have differed during these t generations. The crucial idea is that only mutations can increase this D-value. The total number of bits flipped in t generations is the sum of tn Poisson trials with success probability c/n each. Hence, within t generations following a monomorphic one, the D-value is bounded from above by 2ct with probability \(1-2^{-\varOmega (t)}\) according to Chernoff bounds (Theorem 5), and clearly the Hamming distance is no larger than the D-value. We set \(t:=(\log ^2 n)/(2c)\) to bound the D-value by \(\log ^2 n\).
The proof is completed by noting that the probability of not observing a monomorphic generation within \(\log ^2 n\) generations is \((1-\varOmega (1))^{\log ^2 n} = n^{-\varOmega (\log n)}\). Together with the failure bound \(2^{-\varOmega (t)}\), which is \(n^{-\varOmega (\log n)}\) for \(t=(\log ^2 n)/(2c)\), and a union bound, this means that in any polynomial number of generations following the first monomorphic one the Hamming distance never exceeds \(\log ^2 n\) with probability \(1-n^{O(1)}n^{-\varOmega (\log n)} = 1-n^{-\varOmega (\log n)} \). \(\square \)
5 Roadmap for the Analysis
We give a deliberately informal, high-level view of our analysis, where \(\varDelta := \varphi (P_{t+1}) - \varphi (P_t)\) denotes the change in potential in one generation. By the law of total probability, \(\varDelta \) can be split up according to the number of zeros flipped by mutation:
The case that no zeros flip does not increase the potential, hence we aim to bound the first line from above by 0. The second line captures the most important case: one zero flips and subsequent progress is made. The second line will be bounded by the dominant term in our claimed lower bound. The third line involves the probability of flipping at least two zeros. If the number of zeros is small, this is unlikely and thus the third line only contributes a small order term.
The above high-level view is not particularly accurate. Firstly, the above estimations need to account for error terms. Secondly, the notion of “i zeros flip” used above is not well-defined. This is because the number of zeros that can flip during mutation depends on the parent selection. The same parents may be selected twice, and then the number of zeros depends on the fitness of the parent. If two different parents are used, we only consider mutations of bits that agree in both parents, as per Lemma 8.
Hence, we need to distinguish between different events from the parent selection. To formalise this, let \(P_{11}, P_{22}\), and \(P_{12}\) denote the events that parent selection chooses the first parent twice, the second parent twice and both parents, respectively. We further denote by \(F_{00}\) the number of flipping bits amongst the \(n_{00}\) bits and likewise for \(F_{11}\) and \(n_{11}\) bits. We use asterisks to indicate the union of sets: \(F_{0*}\) is the number of flipping bits among \(n_{01}+n_{00}\) bits and \(F_{*0}\) is the number of flipping bits among \(n_{10}+n_{00}\) bits. Variables \(F_{1*}\) and \(F_{*1}\) are defined analogously. Armed with this notation, we express the third line rigorously with a combination of events.
Lemma 10
For all populations with \(n_{11} \ge n - n/\log ^3 n\) and \(n_{10}+n_{01} \le \log ^2 n\),
Proof
We give one common way of bounding the three lines from the statement. For all \(F \in \{F_{0*},F_{*0},F_{00}\}\), the drift can only increase by at most \(n_{01}+F\) as every flipping 0-bit can only increase the potential by at most 1 and crossover can increase the potential on all \(n_{01}\) by at most 1. Also note that \(\mathord {{\mathrm {P}}}\mathord {\left( F_{*0} \ge 2\right) }\) has the largest probability amongst all variables F (as the underlying number of zeros is maximal for \(F_{*0}\)). Thus, all 3 lines are bounded as
Since \(n_{0*} \le n - n_{11} \le n/\log ^3 n\), we get \(\left( \frac{cn_{0*}}{n}\right) ^2 = O(n_{0*}/(n\log ^3 n))\). The sum is bounded by \(n_{01}+2 + \sum _{i=0}^{\infty } i \left( \frac{cn_{0*}}{n}\right) ^{i}\), which is \(n_{01}+O(1) = O(\log ^2 n)\). Together, this implies the claim. \(\square \)
The cases of no zeros flipping and one zero flipping are more difficult to handle. Corresponding drift estimates will be derived in the following sections. More precisely, we derive closed formulas for the drift of the potential function depending on the choice of parents in the following Sect. 6 and then distinguish between generations that flip no zeros (Sect. 7) and one zero (Sect. 8). As mentioned in the introduction, the drift bounds obtained in these major technical sections are finally put together in Sect. 9.
6 Closed Formulas for the Potential Drift
Before proceeding to bound the drift of the potential, we derive closed bounds for the potential under the condition \(P_{12}\) that different parents are being selected. These bounds hold for arbitrary given values of \(F_{00}\), \(F_{11}\), the numbers of flipping common zeros and ones, respectively. These closed formulas will then be used in the subsequent sections to bound the drift for the cases \(F_{00}=0\) and \(F_{00}=1\), respectively.
For the derivation of closed formulas it is important to distinguish between two cases: the parents having equal fitness, \(n_{10}=n_{01}\), or the parents having unequal fitness, \(n_{10}>n_{01}\). This is because, depending on the fitness of the offspring, different cases may occur and the probabilities used during replacement selection may differ. For instance, when both parents have an equal fitness and the offspring happens to have the same fitness as well, there is a 3-way tie that is resolved uniform randomly by selection. When both parents have different fitness, 3-way ties are not possible (there can only be ties between the offspring and one parent). However, the offspring might be strictly worse than the better parent and strictly better than the worse parent. Such a case cannot happen when \(n_{10}=n_{01}\). It therefore makes sense to split the analysis into these two cases and to derive separate closed bounds on the potential drift.
6.1 A Closed Formula for Selecting Different Parents of Equal Fitness (\(n_{10}=n_{01}\))
In the following lemma we shall first derive a closed formula for the potential drift when two different parents are chosen that have the same fitness, \(n_{10}=n_{01}\). The lemma defines a threshold \(\ell \) that reflects the number of bits crossover needs to set to 1 to achieve the same fitness as \(x^1\) and \(x^2\).
Lemma 11
Let \(S \sim \mathrm {Bin}(n_{10}+n_{01}, 1/2)\) and \(\ell := n_{10}-F_{00}+F_{11}\). Then for all populations with \(n_{10}=n_{01}\),
Proof
Consider a step where both parents are selected and \(F_{00}, F_{11}\) are known. The number of bits set to 1 by crossover is given by \(S \sim \mathrm {Bin}(n_{10}+n_{01}, 1/2)\). By the law of total probability,
Since \(n_{10}=n_{01}\), the fitness of both parents is \(n_{11}+n_{10}\) and the fitness of the offspring is \(n_{11} + s + F_{00} - F_{11}\). If \(s < \ell \), the latter is less than \(n_{11}+n_{10}\) and the offspring will be rejected. If \(s > \ell \), the offspring is fitter than the parent, one of the parents will be chosen uniformly at random for removal. If the offspring is as fit as both parents, the offspring is removed with probability 1/3. Thus,
Now we estimate \(\mathord {{\mathrm {E}}}\mathord {\left( \varDelta \mid S = s\right) } \mathord {{\mathrm {P}}}\mathord {\left( S = s\right) }\) for \(s\ge \ell \). Let \(S_{10}\) be the number of bits among the \(n_{10}\) bits that are set to 1 in the offspring and define \(S_{01}\) analogously for the \(n_{01}\) bits. Note that \(S := S_{10} + S_{01}\) where \(S_{10} \sim \mathrm {Bin}(n_{10}, 1/2)\) and \(S_{01} \sim \mathrm {Bin}(n_{01}, 1/2)\) according to Lemma 8.
Given \(S_{10} = s_{10}\) and \(S_{01} = s_{01}\), the potential difference \(\varDelta \) is derived as follows. Among the \(n_{11}\) bits, \(F_{11}\) bits flip to 0, reducing their contribution from 1 to 1/3 each, leading to a contribution of \(-2F_{11}/3\) to the potential difference. All the \(n_{10}\) bits contribute 1 to the potential \(\varphi (P_t)\). In \(P_{t+1}\), \(s_{10}\) bits contribute 1 and the remaining \(n_{10}-s_{10}\) bits contribute 1/3 each. Hence the contribution to the potential difference is \(-2(n_{10}-s_{10})/3\). The \(n_{01}\) bits contribute 1/3 each in \(\varphi (P_t)\) and in \(P_{t+1}\) we have \(s_{01}\) bits contributing 1 each and the other \(n_{01}-s_{01}\) bits contributing 0. Hence the contribution to the potential difference is \(s_{01} - n_{01}/3\). Finally, the contribution of the \(n_{00}\) bits to the potential difference is \(F_{00}\).
In the following we denote by \((X \mid Y)\) the conditional variable X conditional on Y, where Y is a sequence of events and/or random variables. Omitting the implicit conditions \(P_{12}, F_{00}, F_{11}\) for brevity, we obtain a conditional drift \(\varDelta \) of
By the law of total probability,
Now, \((S_{10} \mid S = s)\) follows a hypergeometric distribution with parameters \(n_{10}\) (number of red balls), \(n_{10}+n_{01}\) (number of balls) and s (number of draws). The expectation is thus \(s \cdot n_{10}/(n_{10}+n_{01}) = s/2\). Plugging this in yields
Together, this gives
Using
the first terms simplify as
The last term simplifies as
Together, this proves the claim. \(\square \)
The bound from Lemma 11 depends on the expected surplus \(\mathord {{\mathrm {E}}}\mathord {\left( S \mid S \ge \ell \right) }\) generated by a crossover on the bits that differ between the two parents, where \(\ell \) reflects the fitness threshold above which offspring are accepted. We use the following formula to simplify such expressions. The proof goes back to work by Gruder [13] that is highlighted in a paper by Johnson [17]. It is given in the appendix for the sake of completion.
Lemma 12
Let \(S \sim \mathrm {Bin}(n, 1/2)\), then for all \(\ell \in {\mathbb {N}}\),
Using Lemma 12, we obtain the following simplified formula.
Lemma 13
Let \(S \sim \mathrm {Bin}(n_{10}+n_{01}, 1/2)\). Then for all populations with \(n_{10}=n_{01}\),
Proof
Recall \(\ell := n_{10} - F_{00} + F_{11}\). By Lemma 11 and Lemma 12,
\(\square \)
6.2 A Closed Formula for Selecting Different Parents of Unequal Fitness (\(n_{10}>n_{01}\))
For the case of unequal fitness (\(n_{10}>n_{01}\)), we use similar arguments as before. This scenario has more involved calculations as we need to distinguish different cases: the offspring may be at least as good as the fitter parent, and then the calculations are similar to the equal-fitness scenario. The offspring may also be worse than the fitter parent and better than the worse parent. In this case, the potential is derived from a different formula as the values for \(n_{10}\) and \(n_{01}\) in the next generation are still determined according to the fitter parent. In case the offspring’s fitness is equal to that of the worse parent, there is a tie and the offspring is only accepted with probability 1/2. The lemma defines two thresholds \(\ell _1\) and \(\ell _2\) that reflect the number of bits crossover needs to set to 1 to achieve the fitness of \(x^1\) and \(x^2\), respectively.
Lemma 14
Let \(S \sim \mathrm {Bin}(n_{10}+n_{01}, 1/2)\), \(\ell _1 :=n_{10}-F_{00}+F_{11}\) and \(\ell _2 :=n_{01}-F_{00}+F_{11}\). Then for all populations with \(n_{10}>n_{01}\),
Proof
Consider a step where both parents are selected and \(F_{00}, F_{11}\) are known. The number of bits set to 1 by crossover is given by \(S \sim \mathrm {Bin}(n_{10}+n_{01}, 1/2)\). By the law of total probability,
Note that the fitness of \(x^1\) is \(n_{11}+n_{10}\), that of \(x^2\) is \(n_{11}+n_{01}\) and the fitness of the offspring is \(n_{11} + s + F_{00} - F_{11}\).
We distinguish the following cases.
-
1.
If \(s \ge \ell _1\), the offspring is at least as fit as the fitter parent, and \(x^2\) will be removed. The new values of \(n_{10}, n_{01}\) will be determined by the offspring.
-
2.
If \(\ell _2< s < \ell _1\), the offspring is fitter than the worse parent \(x^2\), and \(x^2\) will be removed. Parent \(x^1\) will remain the fitter parent.
-
3.
If \(s = \ell _2\), the offspring is as fit as \(x^2\) and \(x^2\) will be removed with probability 1/2. Then the potential is computed as in the case \(\ell _2< s < \ell _1\).
-
4.
If \(s < \ell _2\), the offspring is worse than \(x^2\) and will be rejected. Hence the potential does not change.
Let \(\varDelta \) be the change of potential from the current population to the new population \(P_{t+1}\) as described above. Then,
where the last line accounts for the fact that with probability 1/2 the offspring is removed if \(s=\ell _2\).
Now we estimate \(\mathord {{\mathrm {E}}}\mathord {\left( \varDelta \mid S = s\right) } \mathord {{\mathrm {P}}}\mathord {\left( S = s\right) }\), starting with the case for \(s\ge \ell _1\). Let \(S_{10}\) be the number of bits among the \(n_{10}\) bits that are set to 1 in the offspring and define \(S_{01}\) analogously for the \(n_{01}\) bits. Note that \(S := S_{10} + S_{01}\) where \(S_{10} \sim \mathrm {Bin}(n_{10}, 1/2)\) and \(S_{01} \sim \mathrm {Bin}(n_{01}, 1/2)\).
Given \(S_{10} = s_{10}\) and \(S_{01} = s_{01}\), the potential difference \(\varDelta \) is derived as follows. Among the \(n_{11}\) bits, \(F_{11}\) bits flip to 0, reducing their contribution from 1 to 1/3 each, leading to a contribution of \(-2F_{11}/3\) to the potential difference. All the \(n_{10}\) bits contribute 1 to the potential \(\varphi (P_t)\). In \(P_{t+1}\), \(s_{10}\) bits contribute 1 and the remaining \(n_{10}-s_{10}\) bits contribute 1/3 each. Hence the contribution to the potential difference is \(-2(n_{10}-s_{10})/3\). The \(n_{01}\) bits contribute 1/3 each in \(\varphi (P_t)\) and in \(P_{t+1}\) we have \(s_{01}\) bits contributing 1 each and the other \(n_{01}-s_{01}\) bits contributing 0. Hence the contribution to the potential difference is \(s_{01} - n_{01}/3\). Finally, the contribution of the \(n_{00}\) bits to the potential difference is \(F_{00}\). Together,
By the law of total probability, for all \(s \ge \ell _1\),
Now, \((S_{10} \mid S = s)\) follows a hypergeometric distribution with parameters \(n_{10}\) (number of red balls), \(n_{10}+n_{01}\) (number of balls) and s (number of draws). The expectation is thus \(s \cdot n_{10}/(n_{10}+n_{01})\). Plugging this in yields
Now we consider the case \(\ell _2 \le S < \ell _1\). Here \(x^1\) remains the fitter parent and the potential difference \(\varDelta \) is derived as follows. Flipping bits among the \(n_{11}\) and \(n_{10}\) bits do not change the potential. We have \(S_{01}+F_{00}\) bits that each contribute 1/3 to the potential, compared to \(n_{01}\) bits in the previous population. Hence the change in potential is
By the law of total probability, for all \(\ell _2 \le s < \ell _1\),
using the same calculations as before.
Plugging everything together,
By adding terms for \(\ell _1 \le s \le n_{10}+n_{01}\) to the second line and subtracting them from the first line, we get
Applying Lemma 12 twice, to the first and second line, yields the following two equations:
Plugging this in yields
\(\square \)
7 Potential Drift When No Zeros Flip
We now consider the potential drift when no zeros flip. When the distance to the optimum is o(n), this case is by far the most frequent case. This also means that our drift bounds have to be precise, as even a small error term may have a big impact and spoil the analysis.
7.1 Selecting the Same Parent Twice
We start by considering the drift conditional on selecting the same parent twice.
Lemma 15
For all populations with \(n_{10}=n_{01}\),
For all populations with \(n_{10} > n_{01}\),
Proof
First assume \(n_{10}=n_{01}\) and consider the event \(P_{11}\). Given \(F_{0*}=0\), that is, if no 0-bit is flipped, the offspring can only be accepted if no 1-bit is flipped, i. e., \(F_{1*}=0\). These events happen with probability \(\mathord {{\mathrm {P}}}\mathord {\left( F_{0*}=0\right) }\mathord {{\mathrm {P}}}\mathord {\left( F_{1*}=0\right) }=(1-c/n)^n\) and they lead to an offspring that is identical to \(x^1\). Since all search points have equal fitness, \(x^2\) is removed with probability 1/3. This leads to a monomorphic population and the potential decreases by \(n_{01}/3\). Multiplying the above terms proves the claimed equality. The case of \(P_{22}\) follows analogously, considering \(F_{*0}\) and \(F_{*1}\) instead.
For \(n_{10} > n_{01}\), if the fitter parent \(x^1\) is selected twice, a copy of it is created with probability \((1-c/n)^n\) and then \(x^2\) is removed. This decreases the potential by \(n_{01}/3\). Multiplying the above terms yields an expectation of \(-n_{01}/3 \cdot (1-c/n)^n\). Note that other operations cannot increase the potential since no 0-bit is being flipped and flipping 1-bits in \(x^1\) does not decrease the potential.
If \(x^2\) is selected twice as parent, the potential cannot increase since no 0-bits are flipped, and it cannot decrease as any 1-bit being flipped will lead to the offspring being rejected. \(\square \)
Now we consider the event that two different parents are selected. The remainder of this section is split into drift bounds when the parents have equal fitness, \(n_{10}=n_{01}\), and the case where parents have unequal fitness, \(n_{10} > n_{01}\).
7.2 Selecting Different Parents of Equal Fitness (\(n_{10}=n_{01}\))
We use the closed formula from Lemma 13 to show that, to get an upper bound on the drift when \(F_{00}=0\), we only need to consider \(F_{11} \in \{0, 1\}\) as larger values lead to a non-positive drift. This is not obvious, but follows from a lengthy and trite calculation. The proof is placed in the appendix to keep the main part streamlined.
Lemma 16
For all \(n_{10}=n_{01}\) and all \(i \ge 2\),
Now we are able to give an upper bound on the potential drift, assuming that different parents are selected (\(P_{12}\)) and no common zero-bits flip (\(F_{00}=0\)).
Lemma 17
For every population with \(n_{10}=n_{01}\),
Proof
By the law of total probability and Lemma 13,
as the term in brackets is 0 for all values \(i \ge n_{10}\).
For \(n_{10}=0\), the above is 0 and for \(n_{10}=1\), is it \(\mathord {{\mathrm {P}}}\mathord {\left( F_{11}=0\right) }/9\) as claimed. For \(n_{10} \ge 2\), by Lemma 16 all terms for \(i \ge 2\) are non-positive and can be dropped. Thus, we get
Writing the above as
we have \(a_0 = 0\) for \(n_{10}=0\) and \(a_0=1/9\) for \(n_{10}=1\). The maximum value is \(a_0 = 1/8\) for \(n_{10}=2\). This can be seen from evaluating \(a_0\) for \(0 \le n_{10} \le 10\) (shown in Table 1) and from the following analytical arguments for \(n_{10} \ge 11\). We use the bound on the largest binomial coefficient \(\left( {\begin{array}{c}2n_{10}\\ k\end{array}}\right) \le \left( {\begin{array}{c}2n_{10}\\ n_{10}\end{array}}\right) \le 2^{2n_{10}}/(\sqrt{\pi n_{10}})\) for all k according to Lemma 7 and the fact that \(\mathord {{\mathrm {P}}}\mathord {\left( S \ge n_{10}\right) } \ge 1/2\) by symmetry of the binomial distribution. Then \(a_0\) is at most
This upper bound becomes negative for \(n_{10} \ge 11\) (the real value \(a_0\) that is being bounded from above already becomes negative for \(n_{10} \ge 7\)). The term \(a_1\) is 0 for \(n_{10} \le 1\) and at most 1/64 for \(n_{10} \ge 2\). This can be seen by evaluating the above formula for \(2 \le n_{10} \le 10\) (see Table 1) and bounding it above as follows for \(n_{10} \ge 3\), using that \(\mathord {{\mathrm {P}}}\mathord {\left( S \ge n_{10}+1\right) }\) is minimised for \(n_{10}=3\), where it is 5/16.
which is at most \(n_{10}/64\) for \(n_{10} \ge 11\). \(\square \)
7.3 Selecting Different Parents of Unequal Fitness (\(n_{10}>n_{01}\))
The case of \(P_{12}\), different parents being selected, no common 0-bits flipping (\({F_{00}=0}\)) and parents having unequal fitness is easy to handle as in this scenario the potential drift is always non-positive. This is shown in the following lemma. It also states a closed formula for the potential drift as this will be used later on, in the proof of Lemma 22.
Lemma 18
For all \(n_{10} > n_{01}\) and all \(F_{11} \in {\mathbb {N}}_0\),
The proof is not very insightful, hence it is placed in the appendix.
7.4 Combined Drift Bound When No Zero Flips
Assembling the previous drift bounds under various conditions, we get the following drift bounds.
Lemma 19
Assume \(c \le 56/9\), \(n_{10}=n_{01}\) and \(n_{11} \le n-c\).
Proof
For \(n_{10}=n_{01}\), we argue that Lemma 17 implies
This is obvious for \(n_{10} \le 1\); for \(n_{10} \ge 2\) it follows from \(\mathord {{\mathrm {P}}}\mathord {\left( F_{11}=1\right) } = cn_{11}/n \cdot \left( 1 - c/n\right) ^{n-1} = cn_{11}/(n-c) \cdot \left( 1-c/n\right) ^n \le c\left( 1-c/n\right) ^n\) and \(1/8 + c/64 \le 2/9 \le n_{01}/9\) using \(c \le 56/9\).
By Lemmas 15 and 17, along with \(\mathord {{\mathrm {P}}}\mathord {\left( P_{12}\right) } = \mathord {{\mathrm {P}}}\mathord {\left( P_{11} \cup P_{22}\right) } = 1/2\) and \(\mathord {{\mathrm {P}}}\mathord {\left( F_{00}=0\right) }=(1-c/n)^{n_{00}}\), the left-hand side is at most
The bound for the case \(n_{10}>n_{01}\) follows immediately from Lemmas 15 and 18, along with \(\mathord {{\mathrm {P}}}\mathord {\left( P_{11}\right) } = 1/4\). \(\square \)
8 Potential Drift When One Zero Flips
In this section we show that the drift is bounded by a term that yields the leading constant we are aiming for in our main result. Note that here we can afford to include error terms of lower order.
We first consider the case that two equal parents are selected.
Lemma 20
For all populations with \(n_{10}=n_{01}\),
For all populations with \(n_{10} > n_{01}\),
Proof
First assume \(n_{10} = n_{01}\) and consider the event \(P_{11}\). Given \(F_{0*} = 1\), the offspring is accepted with certainty if no 1-bit is flipped. With probability 1/2 the parent survives and then the new potential is at most \(n_{11}+n_{10}+1\). Consequently, the potential changes by at most \(1-n_{01}/3\) due to the loss of diversity. With the remaining probability 1/2, the potential increases by at most 1. The overall expected change in potential in this case is thus at most \(1-n_{01}/6\). If a single 1-bit also flips together with the 0-bit, then the offspring has the same fitness as both parents and the individual to be removed is selected uniformly at random. If the offspring is removed, the potential does not change. If the parent is removed, the potential increases by at most 1/3. If the other population member is removed, the potential increases by at most \(1/3-n_{10}/3\) due to the loss of diversity. The expected change in potential in this case is thus at most \(2/9 - n_{10}/9\). If more than one 1-bits flip, then the offspring will have lower fitness than both other members and it will be rejected. Summing up the terms proves the first claim and the case of \(P_{22}\) follows analogously, considering \(F_{*0}\) and \(F_{*1}\) instead.
For \(n_{10} > n_{01}\), given \(F_{0*} = 1\) and that the fitter parent \(x^1\) is selected twice, we consider separate cases according to the size of \(n_{01}\). If \(n_{01}=0\), then no diversity can be lost. Thus, if no 1-bits flip, the potential increases by 1 because the new offspring has higher fitness than \(x^1\) and \(x^2\) is rejected. If at least one 1-bit flips, then the best fitness does not change and the potential increases by at most 1/3. Overall, for the case when \(n_{01}=0\), the potential changes by \(\mathord {{\mathrm {P}}}\mathord {\left( F_{1*}=0\right) } + 1/3 \cdot \mathord {{\mathrm {P}}}\mathord {\left( F_{1*}>0\right) } = 1/3 + 2/3 \cdot \mathord {{\mathrm {P}}}\mathord {\left( F_{1*}=0\right) }\). If \(n_{01}>0\) and no 1-bits flip, then the potential changes by at most \(1 - n_{01}/3\) since the diversity is lost because \(x^2\) is removed. If instead at least one 1-bit flips, then the potential changes by at most \(1/3 - n_{01}/3\) since the best fitness does not change and the diversity may be lost if \(x^2\) is removed. Since for \(n_{01}>0\) these terms are negative, and the offspring is accepted with probability 1 if \(F_{1*} = 1\) the claim follows by summing up the two terms.
If \(x^2\) is selected twice and no 1-bits are flipped, then the potential increases by at most 1/3 (i.e., if an \(n_{00}\) bit is flipped) since the parent is removed and the diversity is kept. If a single 1-bit is flipped then the potential increases again by at most 1/3. However, since the offspring has the same fitness as its parent, it is necessary that the parent is removed which happens with probability 1/2. If more than one 1-bits are flipped, then the offspring is rejected. Summing up the terms completes the proof. \(\square \)
The proof of Lemma 20 has revealed a counterintuitive effect. A population of individuals with very different fitness values \(f(x^1) \gg f(x^2)\) can have an advantage over a population where both members have the same fitness \(f(x^1)\). This is because, conditioning on a 0-bit flipping, if the fitter parent is chosen twice, a near-arbitrary number of 1-bits can flip at the same time and the outcome may still be accepted. This increases the potential and explains why Lemma 20 contains an unexpectedly large potential drift in the case \(n_{10} > n_{01} = 0\).
Now we consider the case \(P_{12}\) that two different parents are selected.
When \(F_{11} \ge 2\) and the parents have equal fitness, the drift under \(P_{12}\) is non-positive for \(n_{10} \ge 10\), except for \(F_{11}=n_{10-1}\), where it is exponentially small in \(n_{10}\). The proof of the following lemma is given in the appendix.
Lemma 21
For all \(n_{10}=n_{01}\), \(n_{10} \ge 10\) and all \(2 \le i \le n_{10}-1\),
When the two parents have unequal fitness, the following uniform upper bound on the potential drift applies.
Lemma 22
For all \(n_{10}>n_{01}\) and all \(F_{11} \in {\mathbb {N}}_0\),
Proof
We start again with the drift formula from Lemma 14 and plug in \(F_{00}=1\). Then
For \(n_{10}<12\) the claim is verified numerically for all pairs \(n_{01}<n_{10}\), see Table 4. We now turn to an analytical argument for \(n_{10}\ge 12\).
From Lemma 18 we already know that
regardless of \(F_{11}\). Hence,
since \(\mathord {{\mathrm {P}}}\mathord {\left( S\ge \ell _1\right) }\le 1/2\) (using that \(\ell _1\) is strictly greater than the mean of S). Altogether,
no matter how \(\ell _1\) and \(\ell _2\) turn out. \(\square \)
To obtain a combined drift formula, we need to consider probabilities for flipping a certain number of 1-bits. The following simple lemma gives bounds for these.
Lemma 23
For every mutation rate c/n and every i,
Proof
The first bracket is bounded as
\(\square \)
Using Lemmas 20 and 21 and considering the drift under \(P_{12}\) separately for \(F_{11} \in \{0,1\}\) and for all \(F_{11}\) when \(n_{10} \le 10\), we get:
Lemma 24
Assume \(c \le 2.71\), \(n_{10}+n_{01} \le \log ^2 n\), \(n_{11} \ge n - n/\log ^3 n\) and \(n_{00} \ge n_{10}\log n\).
Proof
We consider different cases.
Case \({{\varvec{n}}}_{{{\varvec{10}}}}={{\varvec{n}}}_{{{\varvec{01}}}}\): In case both parents have equal fitness,
and we bound the sum of both terms, using \(\mathord {{\mathrm {P}}}\mathord {\left( P_{11}\right) } = \mathord {{\mathrm {P}}}\mathord {\left( P_{22}\right) } = 1/4\) and Lemma 20 (noting that the formulas for \(P_{11}\) and \(P_{22}\) are identical) as
For \(P_{12}\) we consider probabilities \(\mathord {{\mathrm {P}}}\mathord {\left( F_{11} = \cdot \right) }\), hence we aim to relate this with probabilities \(\mathord {{\mathrm {P}}}\mathord {\left( F_{1*}=\cdot \right) }\) as follows. (Note that, for \(n_{10}=0\), \(F_{1*}=F_{11}\).)
and the term \(\left( 1 - \frac{c}{n}\right) ^{n_{10}}\) is at least \(1 - \frac{cn_{10}}{n}\) and at most 1. Similarly,
and the term \(\frac{n_{11}+n_{10}}{n_{11}} \left( 1 - \frac{c}{n}\right) ^{n_{10}}\) is at least \(1 - \frac{cn_{10}}{n}\) and at most \(1+O(n_{10}/n)\). Likewise,
which is at least \(1-cn_{01}/n\) and at most \(1+O(1/\log n)\) owing to the assumption \(n_{00} \ge n_{10}\log n \ge n_{01}\log n\).
Thus, (6) can be written as
for an error term \(\xi \in [-O(1/\log n), +O(1/\log n)]\), regardless of the signs of the factors the terms \(\mathord {{\mathrm {P}}}\mathord {\left( F_{1*}=0\right) }\) and \(\mathord {{\mathrm {P}}}\mathord {\left( F_{1*}=1\right) }\) are multiplied with. Here we used that these factors are in \([-O(n_{10}), O(n_{10})]\) and the absolute value of the error from replacing each term \(\mathord {{\mathrm {P}}}\mathord {\left( F_{1*} = \cdot \right) }\) with \(\mathord {{\mathrm {P}}}\mathord {\left( F_{11} = \cdot \right) }\) is at most \(O(n_{10}^2/n) = O(1/\log n)\).
The third term from the statement is
We simplify the expectations in the above terms for \(i \in \{0, 1\}\) using Lemma 13 and add them to the terms in (7). This yields an upper bound of
for coefficients \(a_0, a_1\) defined as follows:
It is easy to verify that \(a_0 = 1\) for \(n_{10}=0\) and \(a_0 \le 1\) for \(1 \le n_{10} \le 9\). For \(n_{10} \ge 10\), \(a_0\) becomes negative since the term \(\mathord {{\mathrm {P}}}\mathord {\left( S > n_{10}-1\right) }\left( 1 - \frac{n_{10}}{6}\right) \) is non-positive and using \(\mathord {{\mathrm {P}}}\mathord {\left( S = n_{10}-1\right) } \le 1/(\sqrt{\pi n_{10}})\) according to Lemma 7, we get
Similarly, it is easy to check that \(a_1 = 2/9\) for \(n_{10}=0\) and \(a_1 \le 2/9\) for \(1 \le n_{10} \le 6\). For \(n_{10} \ge 7\), \(a_1\) becomes negative as
By (8), for larger values of \(F_{11}\), coefficients \(a_i\) for \(\mathord {{\mathrm {P}}}\mathord {\left( F_{00}=1, F_{11}=i\right) }\), with \(2 \le i \le n_{10}+1\), can be bounded from considering only
Then we can bound the sought expression as
We bound \(a_i\) by \(\max \{0, a_i\}\) to ensure non-negative terms, which enables us to use the upper bound on \(\mathord {{\mathrm {P}}}\mathord {\left( F_{11}=i\right) }\) provided by Lemma 23:
where the last inequality follows from \(n_{11} \le n-c\).
Thus,
Using
and recalling \(\xi \le O(1/\log n)\) we can write this as
assuming that \(\max \{a_i, 0\} \cdot \frac{c^i}{i!} = \varOmega (1)\) (if it is 0 or o(1), we have proved the claim).
Now we obtain the claimed bound if we can show that
For \(n_{10}=0\) we have \(a_0 = 1\), \(a_1 = 2/9\), hence (9) holds with equality for \(n_{10}=0\). For \(0 \le n_{10} \le 10\) we evaluate the coefficients \(a_0, \dots , a_{n_{10}+1}\) with the closed formulas given above and verify (9) for \(c \le 2.71\). The coefficients and the ranges of c for which (9) holds are shown in Table 2. Since we assume \(c \le 2.71\), the inequality holds for all considered \(n_{10}\).
For \(n_{10} > 10\), we already established above that \(a_0 \le 0\) and \(a_1 \le 0\). By Lemma 21, \(a_i \le 0\) for all \(2 \le i \le n_{10}+1\), except for \(i=n_{10}-1\), where it is at most \(5n_{10}^3 2^{-2n_{10}}/3\). For \(n_{10} \ge 11\), this is at most 0.00053 and thus we get an upper drift bound of
where the last inequality holds for \(c \le 6.09\). This completes the proof for the case \(n_{10}=n_{01}\).
Case \({{\varvec{n}}}_{{{\varvec{10}}}}>{{\varvec{n}}}_{{{\varvec{01}}}}={{\varvec{0}}}\): Now we consider unequal fitness, \(n_{10}>n_{01}\), in the special case \(n_{01}=0\) (no diversity). By Lemma 20
and
Moving from events \(F_{0*}\)/\(F_{*0}\) and \(F_{1*}\)/\(F_{*1}\) to events \(F_{00}\) and \(F_{11}\) at the expense of an additive error term \(\xi ' \in [-O(1/\log n), +O(1/\log n)]\) as before and adding these two expressions shows that the first two lines of the formula from the statement are bounded by
Note that the first of these summands is multiplied by \(\mathord {{\mathrm {P}}}\mathord {\left( F_{00}=1\right) }\), which does not include a statement on \(F_{11}\).
Reusing the above calculations, the third term from the statement is
Using Lemma 14, the expectation simplifies to
Since \(n_{10}+n_{01}=n_{10} \le n_{10}+F_{11}\), the event \(S > n_{10}+F_{11}\) is impossible. The event \(S = n_{10}+F_{11}\) is only possible for \(F_{11}=0\), where \(\mathord {{\mathrm {P}}}\mathord {\left( S = n_{10} + F_{11}\right) } = \mathord {{\mathrm {P}}}\mathord {\left( S = n_{10}\right) } = \mathord {{\mathrm {P}}}\mathord {\left( S = 0\right) }\) by symmetry of the binomial distribution. For \(F_{11}=0\) we thus get
For all \(i > 0\), we get
Now we can bound the sought expression as
where, putting together (10) and the above bounds on \(\mathord {{\mathrm {E}}}\mathord {\left( \varDelta \mid P_{12}, F_{00} = 1, F_{11}=i\right) }\),
Since all coefficients are positive, we can bound \(\mathord {{\mathrm {P}}}\mathord {\left( F_{11} = i\right) }\) as before and obtain a drift bound of
Bounding \(a_0 \le 2/3\) (since \(\mathord {{\mathrm {P}}}\mathord {\left( S = 0\right) } \le 1/2\)), \(a_1 \le 1/24 + 1/3 = 3/8\) and \(a_i \le 1/3\) for \(i \ge 2\), we have
Plugging this in yields an upper drift bound of
Case \({\varvec{n}}_{{\varvec{10}}}>{\varvec{n}}_{{\varvec{01}}}>{\varvec{0}}\): Finally, we deal with the case \(n_{10}> n_{01} > 0\). By Lemma 20
and
We use Lemma 22 to bound the third term in the statement as
Together, we get a drift bound of
\(\square \)
9 Putting Everything Together
Combining results from previous sections, for different numbers of flipping zeros, yields the following unconditional drift bound.
Lemma 25
If \(c \le 1.422\), \(n_{10}+n_{01} \le \log ^2 n\), \(n_{11} \ge n-n/\log ^3 n\) and \(n_{00} \ge \log ^5 n\), then
Proof
This follows from adding drift bounds from Lemmas 10, 19 and 24. For \(n_{10}=n_{01}\), Lemma 24 gives the stated bound. The terms \(O(n_{10}^2/n) = O((\log ^4 n)/n)\) from Lemma 19 and \(O(n_{0*}/(n\log n)) = O(n_{00}/(n \log n))\) can both be absorbed in the \(O(1/\log n)\) term since \(n_{00}/n \ge \log ^5 n/n\).
For \(n_{10}>n_{01}=0\), the bound from Lemma 24 is at most the bound from the same lemma for \(n_{10}=n_{01}\) since
for \(c \le 1.422\). Then the claim follows as above.
For \(n_{10}>n_{01}>0\) note that Lemma 19 yields a negative upper bound of \(-n_{01}/8 \cdot (1-c/n)^n = -\varOmega (1)\). Since \(n_{00} \le n/\log ^3 n\), we obtain a negative drift bound if n is large enough. \(\square \)
To translate the upper bound from Lemma 25 into a lower bound on the expected runtime, we use the lower-bound version of the multiplicative drift theorem from Theorem 4. This theorem requires an upper bound on the drift of the potential function and a sufficiently small probability for large jumps of this value. Such large jumps can occur if the two individuals of the (2+1) GA have a large Hamming distance. Recall that Lemma 9 shows this to be unlikely.
The following lemma shows that a drift of the potential can be translated into a lower bound on the expected optimisation time. This is not immediate since the potential function is a weighted combination of two quantities.
Lemma 26
Let \(N_t\) denote the number of \(n_{00}\)-bits at time t of the (2+1) GA and T the first point in time where \(n_{00}\le \log ^5 n\). Assume that \(n_{01}+n_{10}\le \log ^2 n\) for all points in time before T, and \(N_0\ge \log ^5 n\). If \( \mathord {{\mathrm {E}}}\mathord {\left( \varphi _{t+1}-\varphi _t\mid \varphi _t\right) } \le \delta N_t \) for some \(n^{-O(1)} \le \delta <1\) and all \(t<T\), then
Proof
We introduce a distance function \({\overline{\varphi }}_t :=n - \varphi _t = n_{00} + \frac{2}{3} n_{01}\) as the mirror image of our potential to obtain a function to be minimised, as required in Theorem 4. Moreover, we write \(\varDelta _t={\overline{\varphi }}_t-{\overline{\varphi }}_{t+1}\). The key idea is to show that the statement on \(\mathord {{\mathrm {E}}}\mathord {\left( T\mid \varphi _0\right) }\) holds under the slightly different drift condition
and then to prove that the actual drift condition \(\mathord {{\mathrm {E}}}\mathord {\left( \varphi _{t+1}-\varphi _t\mid \varphi _t\right) } \le \delta N_t\) leads to the same result, up to lower-order terms.
We consider the process from the first point in time where \({\overline{\varphi }}_t\le n/\log ^3 n\), assume (11) to hold for all \(t<T\) and estimate the remaining expected time to minimise the distance \({\overline{\varphi }}_t\). Lemma 9 and the facts that at most \(\log n\) bits flip per generation and that the distance does not drop below \(n/\log ^3 n\) within the first \(\log ^2 n\) generations (each happening with probability \(1-n^{-\varOmega (\log n)}\)), imply that \({\overline{\varphi }}_{0} \ge n/(2\log ^3 n)\) with respect to our time count. We assume this to happen. By Lemma 9, with probability \(1-n^{-\varOmega (\log n)}\) it holds that \(n_{10}+n_{01}\le \log ^2 n\) for any polynomial number of steps. Assuming this for a sufficiently long period obtained from applying Markov’s inequality on \(\mathord {{\mathrm {E}}}\mathord {\left( T\mid {\overline{\varphi }}_0\right) }\), our assumptions only change the bound on the expected value by a \(1-n^{-\varOmega (\log n)}\) factor.
Clearly, since \(n_{10}+n_{01}\le \log ^2 n\), crossover followed by a neutral mutation can change the \({\overline{\varphi }}\)-value by at most \(\log ^2 n\). Moreover, each mutation flips k or more bits with probability at most
in particular it flips at most \(\log ^2 n\) bits with probability \(1-n^{-\varOmega (\log n)}\). Adding up these effects, we arrive at \( \mathord {{\mathrm {P}}}\mathord {\left( {\overline{\varphi }}_t-{\overline{\varphi }}_{t+1} \ge 2\log ^2 n\right) } = n^{-\varOmega (\log n)}. \) The time to minimise the distance function is no smaller than the time to reach a distance of at most \(x_{\min }:=\log ^5 n\) (we stop at this point to fulfill the condition on \(n_{00}\) in Lemma 25). Along with \(\beta :=2/\log n\) and using \(X_t:={\overline{\varphi }}_t\), we verify the second condition of Theorem 4 by estimating
Finally, we estimate this bound as \(n^{-\varOmega (n)} \le \beta \delta /\log n\le \beta \delta /\!\log (X_t)\) since both \(\beta \) and \(\delta \) are at least inversely polynomial in n. Hence, \(\mathord {{\mathrm {P}}}\mathord {\left( X_t-X_{t+1}\ge \beta X_t\right) }\le \beta \delta /\!\log (X_t)\), which satisfies the condition. Applying the theorem and recalling our assumption \({\overline{\varphi }}_{0} \ge n/(2\log n)\),
which is \((1-O(1/\log n))\frac{\ln (n)-O(\ln \ln n)}{\delta }\). Recall that the unconditional expected time is only by a factor \(1-n^{-\varOmega (\log n)}\) smaller.
Finally, we relate \(N_t\) to the distance function \({\overline{\varphi }}_t\) and note first that \({\overline{\varphi }}_t\ge N_t\). Since, for \(t<T\), our assumptions imply \({\overline{\varphi }}_t=n_{00}+(2/3)n_{01} \le (1+1/\log n)N_t\), the prerequisite
along with the fact \(\varDelta _t=\varphi _{t+1}-\varphi _{t}\) imply
Hence, (11) has been established with parameter \(\delta (1-1/\log n)\) so that
which is again \((1-O(1/\log n))\frac{\ln (n/\log ^4 n)}{\delta }\) as claimed. \(\square \)
We are now ready to prove our main result.
Proof of Theorem 2
We apply drift analysis to a “typical” run, that is, a run where events mentioned in the following that happen with overwhelming probability do occur. The contrary is called a failure event and if a failure occurs, we pessimistically assume that the runtime is 0. By the law of total probability, the expected runtime is bounded from below by the expected runtime in a typical run, multiplied by the probability that no failure occurs.
With probability \(1-2^{-\varOmega (n)}\) the initial population has a maximum number of (2/3)n one-bits. The random number of one-bits that crossover creates among the relevant \(n_{10}\) and \(n_{01}\) bits follows a binomial distribution with parameters \(n_{10}+n_{01}\) and 1/2. The expected value of this number equals \((n_{10}+n_{01})/2\le n/2\), and using Chernoff bounds (Theorem 5) with \(\delta =2n^{-1/3}\) and the upper bound n/2 on the expectation, the number of one-bits of every fixed offspring is at most \(U:=(n_{10}+n_{01})/2 + n^{2/3}\) with probability \(1-2^{-\varOmega (n^{1/3})}\). By a union bound, the probability that at least one offspring has more than U one-bits is still \(2^{-\varOmega (n^{1/3})}\). Since the fittest parent has at least \((n_{10}+n_{01})/2\) one-bits, this means that the number of one-bits at the relevant positions, and thereby the number of one-bits of the fitter offspring, grows by at most \(n^{2/3}\) with probability \(1-2^{-\varOmega (n^{1/3})}\). The same holds for each mutation with constant rate c as seen in (12). Hence, the subsequent \(2\log ^2 n\) generations do not increase the OneMax-value to more than (3/4)n with probability \(1-2^{-\varOmega (n^{1/3})}\).
We consider the first point in time when \(n_{11} \ge n - n/\log ^3 n\). By the same arguments as before, we then have \(n_{11} \le n-n/(2\log ^3 n)\) with overwhelming probability. Now Lemma 9 is in force, implying that we can apply Lemma 26 with
according to the drift bound from Lemma 25. Hence, if none of the above-mentioned failure events occur, the expected optimization time is bounded from below by
which is \(\frac{9e^c}{c(2c+9)} \cdot n\ln (n) -O(n \ln \ln n)\) as claimed. This still holds when multiplying the above with the probability of no failure event occurring, as the union of all failure events is superpolynomially small.
10 Experiments
Our lower bound is tight up to lower-order terms. We ran experiments to see how close the results are to the dominant term of \(\frac{9e^c}{c(2c+9)} \cdot n \ln n\), when varying c and n.
We first fixed \(n=1000\) and varied c in steps of 0.1. For each value of c, we ran the (2+1) GA 10000 times to obtain smooth curves. For the range of c-values, \(c \in (0, 1.422]\), covered by Theorem 2, Fig. 1 shows that the empirical averages are very close to the dominant term. The vertical line indicates the c-value of \(\frac{\sqrt{97}-5}{4} \approx 1.2122\) that was determined to be optimal (up to small-order terms) in Theorem 3.
An obvious question arises: what if values of c are chosen that are larger than the limit 1.422 in Theorem 2? Fig. 2 shows results of the previous experiment for a larger range of c-values, \(c \in \{0.1, 0.2, 0.3, \dots , 5.0\}\). The first vertical line in Fig. 2 again indicates the c-value of \(\frac{\sqrt{97}-5}{4} \approx 1.2122\); the second vertical line indicates the limit of 1.422 from Theorem 2.
We see that as c increases beyond 1.422, a gap starts to appear between the dominant term and the average runtime. The average runtime seems to be smaller than \(\frac{9e^c}{c(2c+9)} \cdot n \ln n\). This could indicate that the lower bound does not hold for \(c > 1.422\), or that there are small-order terms that affect the plots for \(n=1000\).
To see how quickly the runtime approaches the dominant term of \(\frac{9e^c}{c(2c+9)} \cdot n \ln n\) as n grows, we ran experiments with n increasing exponentially in powers of 2 from \(n=2^3=8\) to \(n=2^{13}=8192\). Figure 3 shows the normalised average runtimes, which is the runtime divided by \(n \ln n\). One can see that, for both the default and the optimal mutation rates, \(c=1\) and \(c=\frac{\sqrt{97}-5}{4}\), respectively, the curves approach their respective dominant terms. However, the approach is quite slow. Even for \(n=8192\) there is still a significant gap to the leading constant of the dominant term. This might indicate that the average runtime includes a significant, negative small-order term.
Recall that our lower bound (Theorem 2) includes a negative term of \(-O(n \log \log n)\), while the upper bound (Theorem 1) contains an additive term \(+O(n)\). The term \(-O(n \log \log n)\) is an artefact of the fact that we only considered around \(n/\log ^3 n\) fitness levels and we excluded the most difficult \(\log ^5 n\) ones. It is easy to show that optimising these fitness levels takes expected time \(O(n \log \log n)\) for both (1+1) EA and (2+1) GA. We conjecture that the expected running time is close to \(\frac{9e^{c}}{c(2c+9)} \cdot n \ln (n) + bn\) for some negative constant b. This is based on related work on the (1+1) EA whose expected runtime on OneMax is known to be \(en\ln (n) -1.8925\dots n + O(\log n)\) (cf. [14], who even determine the expected runtime up to additive errors of \(O(\log n/n)\)). Intuitively, the negative linear term appears in the bound for the (1+1) EA since the algorithm does not start with the largest possible value of the underlying potential function, which is the number of one-bits. More precisely, the initial number of one-bits \(X_0\) has an expected value of n/2 and the multiplicative drift theorems both for lower and upper bounds involve the term \(\ln (X_0/x_{\min })\). For \(X_0\approx n/2\) this becomes roughly \(\ln (n) - \ln 2\), so that the crucial term \(\ln (X_0/x_{\min })/\delta \), where \(\delta =\varTheta (1/n)\), becomes at most \((\ln n)/\delta - (\ln 2)/\delta = \varTheta (n\ln n)-O(n)\). The expected initial value of our potential function for the (2+1) GA equals \((1/4+1/4+1/12)n = (7/12)n\), so we believe there is a non-negligible negative term in the expression for the expected running as well.
To see how closely the empirical data matches a conjectured bound with a negative linear term, we performed a non-linear regression to fit the average runtimes to a model of \(a n \ln (n) + bn\), for parameters a, b. We used the software R, version 4.0.3, calling the nls command with starting values of \(a=1\) and \(b=0\). The results of the regression were values of a, b that minimised the residual sum of squares.
For the (2+1) GA with \(c=1\), this resulted in a fitted function of \(2.2381 n \ln (n) -0.7995n\). The leading constant 2.2381 is close to the value 2.224 from our theoretical analysis. Likewise, for the optimal value \(c=\frac{\sqrt{97}-5}{4} \approx 1.2122\), the best fit is for \(2.1532 n \ln (n) -0.5883n\) and the leading constant is close to 2.18417 from our analysis and smaller than the leading constant for \(c=1\). In both cases, the linear term has a negative sign. For comparison, the same regression for the (1+1) EA returned the best fit for the function \(2.745 n \ln (n) -2.116n\). The leading constant is close to the theoretically proven one of \(e=2.71828\) and the coefficient of the linear term is also close to the theoretical one of \(-1.8925\dots \) [14].
11 Conclusions
Proving lower bounds for crossover-based GAs is a notoriously hard problem. We have provided such a lower bound for the (2+1) GA on OneMax through a careful analysis of a potential function that captures both the current best fitness and the potential for finding improvements through crossover combining different “building blocks” of good solutions. Our lower bound is tight up to small-order terms. This for the first time proves rigorously that populations are beneficial for standard steady-state genetic algorithms. We also identified the optimal mutation rate for the (2+1) GA as \((\sqrt{97}-5)/(4n)\) for the considered range of mutation rates c/n.
Our lower bound applies for \(c \le 1.422\) and an obvious open question is whether the leading constant in the expected runtime remains at \(9e^c/(c(2c+9))\) when this threshold is exceeded. Our empirical results suggest that the expected runtime is smaller than the stated bound, albeit it is not clear whether these results are affected by negative small-order terms. We conjecture that the runtime is very close to \(\frac{9e^c}{c(2c+9)} \cdot n \ln (n) + bn\), where bn is a linear term with a negative constant b. Our drift estimates subsumed errors that were by a factor of \(O(1/\log n)\) smaller than the dominant term in asymptotic notation, hence tighter analyses would be required to obtain rigorous bounds on the conjectured linear term. Another avenue for future work would be to simplify our approach, possibly by exploiting that states with \(n_{10} > 1\) are rarely reached in the late stages of a run, or to generalise our analysis to larger parent population sizes.
Notes
The term “evolutionary algorithms” is commonly used as an umbrella term for evolutionary algorithms with and without crossover. We use the term “Genetic Algorithm (GA)” to emphasize the use of crossover in an evolutionary algorithm.
Steady state GAs are those that replace at most a proper subset of the population in each generation (usually one new individual is created). Typically they use standard bit mutation which flips each bit independently with probability c/n.
In this informal discussion we ignore events of smaller probability (e. g. picking the same parent twice and creating the other population member by a lucky mutation).
References
Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, tenth GPO printing edition Dover, Ninth Dover Printing, New York (1964)
Auger, A., Doerr, B. (eds.): Theory of Randomized Search Heuristics. World Scientific Publishing, Springer (2011)
Carvalho Pinto, E., Doerr, C.: A simple proof for the usefulness of crossover in black-box optimization. In: Auger, A., Fonseca, C.M., Lourenço, N., Machado, P., Paquete, L., Whitley, D. (eds) Parallel Problem Solving From Nature - PPSN XV, volume 11102 of LNCS, pp. 29–41. Springer (2018)
Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493–507 (1952)
Corus, D., Oliveto, P.S.: Standard steady state genetic algorithms can hillclimb faster than mutation-only evolutionary algorithms. IEEE Trans. Evol. Comput. 22(5), 720–732 (2018)
Corus, D., Oliveto, P.S.: On the benefits of populations for the exploitation speed of standard steady-state genetic algorithms. Algorithmica 82(12), 3676–3706 (2020)
Dang, D.-C., Friedrich, T., Kötzing, T., Krejca, M.S., Lehre, P.K., Oliveto, P.S., Sudholt, D., Sutton, A.M.: Escaping local optima using crossover with emergent diversity. IEEE Trans. Evol. Comput. 22(3), 484–497 (2018)
Dang, D.-C., Friedrich, T., Krejca, M.S., Kötzing, T., Lehre, P.K., Oliveto, P.S., Sudholt, D., Sutton, A.M.: Escaping local optima with diversity-mechanisms and crossover. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016), pp. 645–652. ACM Press (2016)
Doerr, B.: Probabilistic tools for the analysis of randomized optimization heuristics. In: Doerr, B., Neumann, F. (eds) Theory of Evolutionary Computation–Recent Developments in Discrete Optimization, Natural Computing Series, pp. 1–87. Springer (2020)
Doerr, B., Happ, E., Klein, C.: Crossover can provably be useful in evolutionary computation. Theoret. Comput. Sci. 425, 17–33 (2012)
Doerr, B., Neumann, F. (eds.): Theory of Evolutionary Computation-Recent Developments in Discrete Optimization. Springer, Natural Computing Series (2020)
Goldberg, D.E.: Genetic Algorithms in Search. Optimization and Machine Learning, Addison-Wesley Longman, LOndon (1989)
Gruder, O.: The theory of risk. In: 9th International Congress of Actuaries, vol 2, pp. 222 (1930)
Hwang, H.-K., Panholzer, A., Rolin, N., Tsai, T.-H., Chen, W.-M.: Probabilistic analysis of the (1+1)-evolutionary algorithm. Evolut. Comput. 26(2) (2018)
Jansen, T.: Analyzing Evolutionary Algorithms—The Computer Science Perspective. Springer, Berlin (2013)
Jansen, T., Wegener, I.: On the analysis of evolutionary algorithms—a proof that crossover really can help. Algorithmica 34(1), 47–66 (2002)
Johnson, N.L.: A note on the mean deviation of the binomial distribution. Biometrika 44(3–4), 532–533 (1957)
Kötzing, T., Sudholt, D., Theile, M.: How crossover helps in pseudo-Boolean optimization. In: Proceedings of the 13th Annual Genetic and Evolutionary Computation Conference (GECCO 2011), pp. 989–996. ACM Press (2011)
Lehre, P.K., Witt, C.: Black-box search by unbiased variation. Algorithmica 64(4), 623–642 (2012)
Lehre, P.K., Yao, X.: Crossover can be constructive when computing unique input–output sequences. Soft. Comput. 15(9), 1675–1687 (2011)
Lengler, J.: A general dichotomy of evolutionary algorithms on monotone functions. IEEE Trans. Evol. Comput. 24(6), 995–1009 (2020)
Neumann, F., Oliveto, P.S., Rudolph, G., Sudholt, D.: On the effectiveness of crossover for migration in parallel evolutionary algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2011), pp. 1587–1594. ACM Press (2011)
Neumann, F., Witt, C.: Bioinspired Computation in Combinatorial Optimization—Algorithms and Their Computational Complexity. Springer, Berlin (2010)
Oliveto, P.S., Sudholt, D., Witt, C.: A tight lower bound on the expected runtime of standard steady state genetic algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2020), pp. 1323–1331. ACM (2020)
Oliveto, P.S., Witt, C.: On the runtime analysis of the simple genetic algorithm. Theoret. Comput. Sci. 545, 2–19 (2014)
Oliveto, P.S., Witt, C.: Improved time complexity analysis of the simple genetic algorithm. Theoret. Comput. Sci. 605, 21–41 (2015)
Sudholt, D.: How crossover speeds up building-block assembly in genetic algorithms. Evol. Comput. 25(2), 237–274 (2017)
Sutton, A.M.: Fixed-parameter tractability of crossover: steady-state GAs on the closest string problem. Algorithmica 83(4), 1138–1163 (2021)
Witt, C.: Tight bounds on the optimization time of a randomized search heuristic on linear functions. Comb. Probab. Comput. 22, 294–318 (2013)
Acknowledgements
This work was initiated at the Dagstuhl Seminar 19431 and supported by the EPSRC under Grant EP/M004252/1.
Open Access
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Omitted Proofs
Omitted Proofs
In this appendix we present proofs omitted from the main part.
1.1 Proof of Lemma 12
Recall that Lemma 12 simplified the expected surplus of a binomially distributed random variable. To prove Lemma 12, we use an equation that is attributed to Gruder [13]. We therefore refer to it as Gruder’s equation.
Lemma 27
(Gruder’s equation [13]) For every \(0 \le p \le 1\) and all integers a, b,
Proof
The claim is trivial for \(a > b\), hence we assume \(a \le b\). The following proof is stated in [17], attributed to Gruder [13].
\(\square \)
We use Gruder’s equation to prove Lemma 12.
Proof of Lemma 12
Applying Gruder’s equation (Lemma 27) with \(p = 1/2\), \(a=\ell \), \(b=n\) yields
1.2 Proof of Lemma 16
Recall that Lemma 16 claims that if \(F_{00}=0\) and \(F_{11} \ge 2\) then under \(P_{12}\) the potential drift is non-positive.
Proof of Lemma 16
By Lemma 13,
The above is 0 for \(i \ge n_{10}\), hence the claim is trivial for \(n_{10} \le 2\). We assume \(n_{10} \ge 3\) in the following.
The claim is equivalent to
Multiplying by \(36 \cdot 2^{2n_{10}}\) and spelling out the binomial distributions involving S yields
Dividing both sides by \(\left( {\begin{array}{c}2n_{10}\\ n_{10}+i\end{array}}\right) \) yields
Note that this is equivalent to
which holds trivially for \(i=n_{10}\). For \(i=n_{10}-1\), the above simplifies to
which is true for \(n_{10} \ge 3\). For \(i \le n_{10}-2\), we have
Now the claim follows if we can show that the above fraction is at least 11/6.
where in the last two inequalities we have used \(i \ge 2\) and \(n_{10} \ge 3\).
1.3 Proof of Lemma 18
Proof of Lemma 18
We investigate the formula
that follows from Lemma 14 for \(F_{00}=0\). For \(n_{10}<12\) the claim is verified numerically for all pairs \(n_{01}<n_{10}\), see Table 3. We now turn to an analytical argument for \(n_{10}\ge 12\).
Since the first term of the rhs. in (13) is clearly negative, we omit it in the following. We use our assumption \(n_{10}\ge 12\). The aim is to compare different probabilities occurring in (13) to each other. More precisely, we want to show the claim that \(\mathord {{\mathrm {P}}}\mathord {\left( S > \ell _2\right) } \ge 2\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1\right) }\). Along with the fact that the third term is clearly negative, we then have
since \(F_{11}\ge 0\).
Hence, we are left with the claim. We first assume \(n_{10}\ge n_{01}+2\) and treat the case \(n_{10}=n_{01}-1\) separately below. Note that nothing is to show if \(\ell _1 > n_{10}+n_{01}\) since then only negative terms have non-zero probability. Hence, we assume \(\ell _1\le n_{10}+n_{01}\) in the following. Since \(n_{01}\le n_{10}-2\) and \(F_{00}=0\), we have \(\ell _1 = n_{10} - F_{00} + F_{11} \ge (n_{10}+n_{01})/2 + 1\) and \(\ell _2 = n_{10} - F_{00} + F_{11} \le \ell _1 -2\). Since S follows a binomial distribution with parameters \(n_{10}+n_{01}\) and 1/2, we have shown \(\ell _1\ge \mathord {{\mathrm {E}}}\mathord {\left( S\right) }+1\). Using the well-known monotonicity of the binomial distribution, we have \(\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1-1\right) } \ge \mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1\right) }\), and, since \(\mathord {{\mathrm {P}}}\mathord {\left( S>\ell _2\right) } \ge \mathord {{\mathrm {P}}}\mathord {\left( S = \ell _1\right) } + \mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1-1\right) }\), we obtain \(\mathord {{\mathrm {P}}}\mathord {\left( S>\ell _2\right) }\ge 2\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1\right) }\) as claimed in the case \(n_{10}\ge n_{01}+2\).
If \(n_{10}=n_{01}+1\) we have to argue more carefully. A relatively straightforward subcase (for \(n_{10}\ge 12\)) is \(F_{11}=0\). Then \(\ell _2=n_{01}\) and therefore
by symmetry of the binomial distribution \(\mathrm {Bin}(n_{10}+n_{01}, 1/2)\). Also, for any \(k\in \{0,\dots ,n_{10}+n_{01}\}\), we have
according to Lemma 7 for \(n_{10}\ge 12\), which proves \(\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1\right) }\le 1/4\) and, therefore, again \(\mathord {{\mathrm {P}}}\mathord {\left( S>\ell _2\right) }\ge 2\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1\right) }\) as above.
We are left with \(n_{10}=n_{01}+1\) and \(F_{11}>0\). Here we take into account the first line of (13) and obtain due to \(\ell _1=\ell _2+1\) that
We will show that \(\Pr (S=\ell _1+1)/\Pr (S=\ell _1) = \alpha \) for some sufficiently large \(\alpha \) and obtain
The ratio \(\alpha \) will reflect the following trade-off: if \(F_{11}\) is small then \(\ell _1=n_{10}+F_{11}\) is close to the middle value of the binomial distribution with parameters \(N:=n_{10}+n_{01}\) and \(\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1+1\right) }\) is not much smaller than \(\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1\right) }\). Then the positive contribution of \(n_{01}/6\) can almost be compensated by the negative \(-\alpha n_{01}/6\). However, if \(F_{11}\) is big then \(\alpha \) must be relatively small due to the tail of the binomial distribution. Then we exploit that the negative terms involving \(F_{11}\) make the drift bound small.
We now consider several ranges for \(F_{11}\) to relate the drift and \(\alpha \). The first range is \(F_{11}\ge n_{01}/2\). Then have (even without bounding \(\alpha \)) (Table 4)
If \(F_{11}<n_{01}/2\) and \(F_{11}\ge 1\) then we have
for some \(k^*\ge 2\). Plugging the lower bound into (15), we have
The bracket equals
which becomes non-positive (and along with it also our drift bound) for
Hence, we are left with establishing this bound on \(\alpha \). Note that, since \(n_{01}=n_{10}-1\), the sum \(N:=n_{10}+n_{01}\) is odd, and it holds that
Using the representation \(\mathord {{\mathrm {P}}}\mathord {\left( S=k\right) }=\left( {\begin{array}{c}N\\ k\end{array}}\right) 2^{-N}\), and the identity
we have \(\mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1+1\right) } \ge \mathord {{\mathrm {P}}}\mathord {\left( S=\ell _1\right) } \frac{N/2-N/(2k^*)-1/2}{N/2+N/(2k^*)+1/2+1}\), hence
where we used \(N\ge 12\). Now,
for \(k^*\ge 1\), which proves \(\alpha \ge \frac{k^*-1}{k^*+5}\) as required. \(\square \)
1.4 Proof of Lemma 21
Lemma 21 is similar to Lemma 16, saying that the drift is non-positive if at least 2 1-bits flip. Strangely enough, this holds for all \(i \ge 2\), except for \(i=n_{10}-1\) where the drift is exponentially small in \(n_{10}\).
Proof of Lemma 21
Using Lemma 13,
Noting that the first fraction is negative and the second one positive because of \(i\ge 2\), the whole expression is non-positive if
We will now bound \(\alpha \) depending on i. It is easy to see that \(\frac{17-13i-17n_{10}}{36-24i-6n_{10}} < 1\) for \(i\ge n_{10}\) while trivially \(\alpha \ge 1\). This proves the statement for \(i\ge n_{10}\). Hence, we now focus on \(2\le i \le n_{10} - 2\) and write \(i=n_{10}/r\) for some \(r\in [n_{10}/(n_{10}-2),n_{10}/2]\). Then
We now use a similar argument as in the proof of Lemma 18. Since \(n_{10}+i = (1+1/r)n_{10}\) and S is based on a binomial distribution with \(2n_{10}\) trials and success probability 1/2, we obtain that
using the identity
and, similarly,
as well as
Hence, since \(i\le n_{10}-2\)
in other words \(\alpha \ge q\), with equality holding for \(i=n_{10}-2\).
We now show \(q - \frac{13+17r-17r/N}{24+6r-36r/N}\ge 0\) for the desired ranges of \(N:=n_{10}\) and i, which implies (17). Simple algebraic manipulations (supported by a computer algebra system) yield that
Clearly, the denominator of the last fraction is positive for \(N\ge 6\). We now prove that its numerator, hereinafter called \(\nu (r,N)\), is positive for \(N\ge 12\) and \(N/(N-2)\le r\le N/2\). To see this, we will compute its first and second partial derivative with respect to r:
and
If we evaluate \(\nu (r,N)\) and its first and second partial derivative at the minimum viable \(r=N/(N-2)\), we have
and
all of which are clearly non-decreasing for \(N\ge 2\) (since the respective lead coefficients are positive and all other coefficients negative) and verified as positive (with values 1144.6272, 626482.9440 and 2117468.1601, respectively) for all \(N\ge 12\), without claiming this to be tight in all three cases. Hence, if \(N\ge 12\), \(\nu (r,N)\) is positive at \(r=N/(N-2)\) and increases with r at this point. If the second partial derivative is positive for all \(r\in [N/(N-2),N/2]\) then \(\nu (r,N)\) is monotone increasing on that interval. Otherwise the second derivative, a quadratic function, will be monotone increasing until its maximum at some \(r\in [N/(N-2),N/2]\) and then monotone decreasing. In particular this means that \(\nu (r,N)\) has at most one root right of \(r=N/(N-2)\). We evaluate \(\nu (N/2,N)\), i. e., at the maximum r-value considered and obtain the polynomial
which is non-decreasing in N. Now, \(\nu (N/2,N) = 1144.6272 > 0\) for \(N=12\). Due to the monotonicity of \(\nu (N/2,N)\), this holds for all greater N as well, and together with the observations regarding monotonicity and roots in the interval \([N/(N-2),N/2]\), this concludes the proof that \(\nu (r,N)\ge 0\) for \(N\ge 12\) and \(N/(N-2)\le r\le N/2\), implying (17).
Finally, for \(i=n_{10}-1\) and \(n_{10} \ge 10\) the formula simplifies to
\(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Oliveto, P.S., Sudholt, D. & Witt, C. Tight Bounds on the Expected Runtime of a Standard Steady State Genetic Algorithm. Algorithmica 84, 1603–1658 (2022). https://doi.org/10.1007/s00453-021-00893-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-021-00893-w