[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Performance Analysis of a Hybrid Electric Vehicle with Multiple Converter Configuration
Next Article in Special Issue
Detecting Green Mold Pathogens on Lemons Using Hyperspectral Images
Previous Article in Journal
Two-Dimensional Impact Modelling with Three Degrees of Freedom and Its Application in the Dynamics of Planing Hulls
Previous Article in Special Issue
Image Completion with Hybrid Interpolation in Tensor Representation
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Effective Optimization Method for Machine Learning Based on ADAM

1
Division of Creative Integrated General Studies, Daegu University College, Kyungsan 38453, Korea
2
Department of Mathematics, College of Natural Sciences, Chungnam National University, Daejeon 34134, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(3), 1073; https://doi.org/10.3390/app10031073
Submission received: 11 December 2019 / Revised: 18 January 2020 / Accepted: 25 January 2020 / Published: 5 February 2020
(This article belongs to the Special Issue Intelligent Processing on Image and Optical Information)
Figure 1
<p>Examples of MNIST dataset.</p> ">
Figure 2
<p>Cost function with one local minimum.</p> ">
Figure 3
<p>Compared to the other scheme, the learning rate is 0.2 and the <math display="inline"><semantics> <msub> <mi>β</mi> <mn>1</mn> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>β</mi> <mn>2</mn> </msub> </semantics></math>, <math display="inline"><semantics> <mi>λ</mi> </semantics></math> used in the proposed method are 0.95, 0.9999, <math display="inline"><semantics> <mrow> <mo>−</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>2</mn> </mrow> </msup> </mrow> </semantics></math>.</p> ">
Figure 4
<p>Beale Function.</p> ">
Figure 5
<p>Result of the Beale function with an initial point of (2, 2). <math display="inline"><semantics> <mrow> <msub> <mi>W</mi> <mn>0</mn> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </semantics></math> and the iteration number are 1000. The learning rate is 0.1 and the <math display="inline"><semantics> <msub> <mi>β</mi> <mn>1</mn> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>β</mi> <mn>2</mn> </msub> </semantics></math>, and <math display="inline"><semantics> <mi>λ</mi> </semantics></math> used in proposed method are 0.9, 0.999, and <math display="inline"><semantics> <mrow> <mo>−</mo> <mn>0.5</mn> </mrow> </semantics></math>, respectively. GD’s learning late was set to <math display="inline"><semantics> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </semantics></math>.</p> ">
Figure 6
<p>Result of Beale function with initial point <math display="inline"><semantics> <mrow> <mo>(</mo> <mo>−</mo> <mn>4</mn> <mo>,</mo> <mn>4</mn> <mo>)</mo> </mrow> </semantics></math>.</p> ">
Figure 7
<p>Styblinski–Tang Function.</p> ">
Figure 8
<p>Result of Styblinski–Tang function with initial point <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>6</mn> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> </semantics></math>. The learning rate is 0.1 and the <math display="inline"><semantics> <msub> <mi>β</mi> <mn>1</mn> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>β</mi> <mn>2</mn> </msub> </semantics></math>, <math display="inline"><semantics> <mi>λ</mi> </semantics></math> used in Proposed method are 0.95, 0.9999, 0.2 and GD’s learning late was set to <math display="inline"><semantics> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>3</mn> </mrow> </msup> </semantics></math>.</p> ">
Figure 9
<p>Cost function <math display="inline"><semantics> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>w</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msubsup> <mi>w</mi> <mn>2</mn> <mn>2</mn> </msubsup> <mo>−</mo> <msubsup> <mi>w</mi> <mn>1</mn> <mn>2</mn> </msubsup> <mo>+</mo> <mn>2</mn> </mrow> </semantics></math>.</p> ">
Figure 10
<p>The results near the saddle point with an initial point of (0.001, 2). The iteration number is 50, the learning rate is <math display="inline"><semantics> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>2</mn> </mrow> </msup> </semantics></math>, and the <math display="inline"><semantics> <msub> <mi>β</mi> <mn>1</mn> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>β</mi> <mn>2</mn> </msub> </semantics></math>, and <math display="inline"><semantics> <mi>λ</mi> </semantics></math> used in the proposed method are 0.95, 0.9999, and <math display="inline"><semantics> <mrow> <mo>−</mo> <mn>5</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>3</mn> </mrow> </msup> </mrow> </semantics></math>, respectively.</p> ">
Figure 11
<p>The result of the saddle point with an initial point of (0, 0.01).</p> ">
Figure 12
<p>Comparing proposed scheme with other schemes.</p> ">
Figure 13
<p>Comparing the proposed scheme with other schemes.</p> ">
Figure 14
<p>Comparing the proposed scheme with other schemes with an iteration number over 6000.</p> ">
Figure 15
<p>Examples that the proposed method predicted correctly, while the ADAM method does not.</p> ">
Figure 16
<p>Result of CIFAR10 dataset using the RESNET 44 model. The following parameters are default values; iteration: 80000, batch size: 128,weight decay: <math display="inline"><semantics> <mrow> <mn>2</mn> <mo>×</mo> <msup> <mn>10</mn> <mrow> <mo>−</mo> <mn>4</mn> </mrow> </msup> </mrow> </semantics></math>,learning rate: 0.1, batch norm decay: 0.997.</p> ">
Versions Notes

Abstract

:
A machine is taught by finding the minimum value of the cost function which is induced by learning data. Unfortunately, as the amount of learning increases, the non-liner activation function in the artificial neural network (ANN), the complexity of the artificial intelligence structures, and the cost function’s non-convex complexity all increase. We know that a non-convex function has local minimums, and that the first derivative of the cost function is zero at a local minimum. Therefore, the methods based on a gradient descent optimization do not undergo further change when they fall to a local minimum because they are based on the first derivative of the cost function. This paper introduces a novel optimization method to make machine learning more efficient. In other words, we construct an effective optimization method for non-convex cost function. The proposed method solves the problem of falling into a local minimum by adding the cost function in the parameter update rule of the ADAM method. We prove the convergence of the sequences generated from the proposed method and the superiority of the proposed method by numerical comparison with gradient descent (GD, ADAM, and AdaMax).

1. Introduction

Machine learning is a field of computer science that gives computer systems the ability to learn with data, without being explicitly programmed. For machines to learn data, a machine learns how to minimize it by introducing a cost function. The cost function is mostly made up of the difference between the true value and the value calculated from the Artificial Neural Network (ANN) [1,2,3,4,5,6,7]. Therefore, the cost function varies with the amount of training data, non-linear activation function in ANN, and the structure of ANN. These changes generate both a singularity and local minimum within the cost function. We know that all the differentiable functions at this point (singularity or local minimum) have a first derivative value of zero.
The gradient-based optimization (gradient descent optimization) is widely used to find the minimum value of the cost function [8,9,10,11,12,13,14,15]. Gradient descent (GD) is a method that was first introduced and uses a fixed-point method to make the first derivative of the cost function zero. This method works somewhat well; however, it causes many difficulties in complex ANNs. To overcome this difficulty, a method called ADAM (Adaptive moment estimation) [15] was introduced to add the momentum method and to control the distance of each step. In other words, this method uses the sum of the gradient values multiplied by the weights calculated in the past, which is the idea of momentum. This is called the ’first momentum’, and the sum of squares of the gradient is calculated in the same way. This is called the ’second momentum’, and the ratio of the first and the second momentum values is calculated and the minimum value is searched for according to the ratio. More detailed information can be found in [16,17,18,19,20,21,22]. This method is still widely used and works well in most areas. There is a more advanced method called AdaMax (Adaptive moment estimation with Maximum). The AdaMax method uses the maximum value of the calculation method of the second momentum part in ADAM. This provides a more stable method. Various other methods exist. We are particularly interested in GD, ADAM, and AdaMax, because GD was the first to be introduced, ADAM is still widely used, and AdaMax is a modified method of ADAM. It has been empirically observed that these algorithms fail to converge toward an optimal solution [23]. We numerically analyze the most basic GD method, the most widely used ADAM method, and the modified AdaMax method. As a result of these numerical interpretations, we introduce the proposed method. The existing methods based on gradient descent operate by changing the parameter so that the first derivative of the cost function becomes zero, which results in finding the minimum value of the cost function. For this method to be established, it is assumed that the cost function has a convex property [24,25]. However, the first derivative of the cost function is also zero at a local minimum. Therefore, this existing method may converge to the local minimum in the cost function where the local minimum exists. If the value of the cost function at the local minimum is not as small as is desired, the value of the parameter should change. To solve this problem, we solve the optimization problem by using the first derivative of the cost function and the cost function itself. In this way, if the value of the cost function is non-zero, the parameter changes even if the first derivative of the cost function becomes zero. This is the reason for adding the cost function. Here, we also use the idea of the ADAM method by using these data to add to the parameter changes, and also demonstrate the convergence of the created sequence.
In summary, our research question is why neural networks are often poorly trained by known optimization methods. Our research goal is to find a new optimization method which resolve this phenomenon. For this, first we prove the convergence of the new method. Next, we use the simplest cost function to visualize the movements of our method and basic methods near a local minimum. In addition, then, we compare performances of our method and ADAM on practical datasets such as MNIST and CIFAR10.
This paper is organized as follows. In Section 2, the definition of the cost function and the cause of the non-convex cost function are explained. In Section 3, we briefly describe the known Gradient-Descent-based algorithms. In particular, we introduce the first GD (Gradient Descent) method, the most recently used ADAM method, and finally, the improved ADAM method. In Section 4, we explain the proposed method, the convergence of the proposed method, and other conditions. In Section 5, we present several numerical comparisons between the proposed method and the discussed methods. The first case is the numerical comparison of a one variable non-convex function. We then perform numerical comparisons of non-convex functions in a two-dimensional space. The third case is a numerical comparison between four methods in two-dimensional region separation. Finally, we test with MNIST (Modified National Institute of Standards and Technology) and CIFAR10 (The Canadian Institute for Advanced Research)—the most basic examples of image analysis. MNIST classifies 10 types of grayscale images, as seen in Figure 1, and this experiment shows that our method is also efficient at analyzing images (https://en.wikipedia.org/wiki/MNIST_database). CIFAR10 (The Canadian Institute for Advanced Research) is a dataset that classifies 10 types like MNIST (Modified National Institute of Standards and Technology), but CIFAR10 has RGB images. Therefore, CIFAR10 requires more computation than MNIST. Through numerical comparisons, we confirm that the proposed method is more efficient than the existing methods described in this paper. In Section 6, we present the conclusions and future work.

2. Cost Function

In this section, we explain basic machine learning among the various modified machine learning algorithms. For convenience, machine learning refers to basic machine learning. To understand the working principle of machine learning, we try to understand the principle from the structure with one neuron. Let x be input data and H ( x ) be output data, which is obtained by
H ( x ) = σ ( w x + b ) ,
where w is weight, b is bias, and σ is a sigmoid function (it is universally called an activation function, and various functions can be used). Therefore, the result of the function H is a value between 0 and 1. Non-linearity is added to the cost function by using the non-linear activation function. For machine learning, let L S be the set of learning data and let l > 2 be the number of the size of L S . In other words, when the first departure point of learning data is 1, the learning dataset is L S = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x l , y l ) } , where x s is a real number and y s is a value between 0 and 1. From L S , we can define a cost function as follows:
C ( w , b ) = 1 l s = 1 l y s H ( x s ) 2 .
The machine learning is completed by finding w and b, which satisfies the minimum value of the cost function. Unfortunately, there are several local minimum values of the cost function, because the cost function is not convex. Furthermore, the deepening of the structure of an ANN means that the activation function performs the synthesis many times. This results in an increase in the number of local minimums of the cost function. More complete interpretations and analysis are currently being made in more detail and these will be reported soon.

3. Learning Methods

In this section, we provide a brief description of the well-known optimization methods—GD, ADAM, and AdaMax.

3.1. Gradient Descent Method

GD is the most basic method and the first introduced. In GD, a fixed-point iteration method is introduced with the first derivative of the cost function. A parameter is changed in each iteration as follows.
w i + 1 = w i η C ( w i ) w ,
The pseudocode version of this method is as follows in Algorithm 1.
Algorithm 1: Pseudocode of Gradient Descent Method.
η : Learning rate
C ( w ) : Cost function with parameters w
w 0 : Initial parameter vector
i 0 (Initialize time step)
while w not converged do
i i + 1
w i + 1 w i η C w ( w i )
end while
return w i (Resulting parameters)
As in the above formula, if the gradient is zero, the parameter does not change and does not continue to the local minimum.

3.2. ADAM Method

The ADAM method is the most widely used method based on the GD method and the momentum method, and additionally, a variation of the interval. The first momentum is obtained by
m i = β 1 m i 1 + ( 1 β 1 ) C w .
The second momentum is obtained by
v i = β 2 v i 1 + ( 1 β 2 ) C w 2 .
w i + 1 = w i η m ^ i v i ^ + ϵ ,
where m ^ i = m i / ( 1 β 1 ) and v i ^ = v i / ( 1 β 2 ) . The pseudocode version of this method is as follows in Algorithm 2.
Algorithm 2: Pseudocode of ADAM Method.
η : Learning rate
β 1 , β 2 [ 0 , 1 ) : Exponential decay rates for the moment estimates
C ( w ) : Cost function with parameters w
w 0 : Initial parameter vector
m 0 0
v 0 0
i 0 (Initialize timestep)
while w not converged do
i i + 1
m i β 1 · m i 1 + ( 1 β 1 ) · C w ( w i )
v i β 2 · v i 1 + ( 1 β 2 ) · C w ( w i ) 2
m i ^ m i / ( 1 β 1 i )
v i ^ v i / ( 1 β 2 i )
w i + 1 w i η · m i ^ / ( v i ^ + ϵ )
end while
return w i (Resulting parameters)
The ADAM method is a first-order method. Thus, it has low time complexity. As the parameter changes repeat, the learning rate becomes smaller due to the influence of m ^ i / v i ^ + ϵ and varies slowly around the global minimum.

3.3. AdaMax Method

The AdaMax method is based on the ADAM method and uses the maximum value of the calculation of the second momentum.
u i = m a x β 2 · u i 1 , | C w ( w i ) | .
w i + 1 = w i η m ^ i u i ,
The pseudocode version of this method is as follows in Algorithm 3.
Algorithm 3: Pseudocode of AdaMax.
η : Learning rate
β 1 , β 2 [ 0 , 1 ) : Exponential decay rates for the moment estimates
C ( w ) : Cost function with parameters w
w 0 : Initial parameter vector
m 0 0
u 0 0
i 0 (Initialize time step)
while w not converged do
i i + 1
m i β 1 · m i 1 + ( 1 β 1 ) · C w ( w i )
u i m a x ( β 2 · u i 1 , | C w ( w i ) | )
w i + 1 w i ( η / ( 1 β 1 i ) ) · m i / u i
end while
return w i (Resulting parameters)

4. The Proposed Method

The main idea starts with a fixed-point iteration method, the condition of the cost function ( 0 C ( w , b ) ), and the condition of the first derivative of the cost function. We define an auxiliary function H such as
H ( w ) = λ C ( w ) + C ( w ) w ,
where λ is determined to be positive or negative according to the initial sign of C ( w ) / w . We make w change if the value of C ( w ) is large even if it falls to a local minimum using H ( w ) .

Optimization

The iteration method is
w i + 1 = w i η m ^ i v i ^ + ϵ ,
where
m i = β 1 m i 1 + ( 1 β 1 ) H ( w i ) , v i = β 2 v i 1 + ( 1 β 2 ) H ( w i ) 2 ,
m ^ i = m i / ( 1 β 1 ) , and v ^ i = v i / ( 1 β 2 ) .
Theorem 1.
The iteration method in (1) is a way to satisfy convergence.
Proof. 
Equation (2) can be altered by
m i = β 1 i m 0 + ( 1 β 1 ) k = 1 i β 1 k 1 H ( w i k + 1 ) = ( 1 β 1 ) k = 1 i β 1 k 1 H ( w i k + 1 ) and v i = β 2 i v 0 + ( 1 β 2 ) k = 1 i β 2 k 1 H ( w i k + 1 ) 2 = ( 1 β 2 ) k = 1 i β 2 k 1 H ( w i k + 1 ) 2
under m 0 = 0 and v 0 = 0 . Therefore, Equation (1) can be altered by
w i + 1 = w i η 1 β 2 1 β 1 × ( 1 β 1 ) k = 1 i β 1 k 1 H ( w i k + 1 ) ( 1 β 2 ) k = 1 i β 2 k 1 H ( w i k + 1 ) 2 + ϵ = w i η k = 1 i β 1 k 1 H ( w i k + 1 ) k = 1 i β 2 k 1 H ( w i k + 1 ) 2 + ϵ = w i η S i S S i + ϵ ,
where
S i = k = 1 i β 1 k 1 H ( w i k + 1 ) and S S i = k = 1 i β 2 k 1 H ( w i k + 1 ) 2 .
Here, ϵ is introduced to exclude the case of dividing by 0, and it is 0 unless it is divided by 0. As a result of a simple calculation, the following relation can be obtained:
( S i ) 2 1 β 1 2 β 2 i 1 β 1 2 β 2 S S i .
For more information on the relation between S i and S S i , we explain the calculation process at Corollary 1. From Equation (3), we have
w i + 1 = w i η S i S S i + ϵ 1 / 2
Using the relation between S i and S S i ,
w i + 1 w i η 1 β 1 2 β 2 i 1 β 1 2 β 2
Here, since ϵ is to prevent division by zero, it can be considered to be zero in the calculation. After a sufficiently large number τ , a sufficiently small value η , and using the Taylor’s theorem, we can obtain
H ( w i + 1 ) H ( w i ) + H w w i + 1 w i H ( w i + 1 ) H ( w i ) + η H w 1 β 1 2 β 2 i 1 β 1 2 β 2 H ( w i + 1 ) H ( w i ) + ξ ,
where
ξ = η H w 1 β 1 2 β 2 i 1 β 1 2 β 2
and is negligibly small. This is possible because we set the η value to a very small value.
Looking at the relationship between S i and S i 1 , and using (4),
S i = H ( w i ) + β 1 S i 1 S i H ( w i ) 1 β 1 = β 1 S i 1 H ( w i 1 ) 1 β 1 β 1 1 β 1 ξ . S i H ( w i ) 1 β 1 + β 1 1 β 1 2 ξ = β 1 S i 1 H ( w i 1 ) 1 β 1 + β 1 1 β 1 2 ξ .
We have
S i H ( w i ) 1 β 1 + β 1 1 β 1 2 ξ = β 1 i S 0 H ( w 0 ) 1 β 1 + β 1 1 β 1 2 ξ .
Therefore,
S i = H ( w i ) 1 β 1 β 1 1 β 1 2 ξ + β 1 i S 0 H ( w 0 ) 1 β 1 + β 1 1 β 1 2 ξ
If the initial condition S 0 is defined as a fairly large value, the following equation can be obtained:
S i S i 1 = H ( w i ) 1 β 1 γ 1 γ 2 + β 1 i H ( w i 1 ) 1 β 1 γ 1 γ 2 + β 1 i 1 β 1 .
where
γ 1 = β 1 1 β 1 2 ξ , γ 2 = S 0 H ( w 0 ) 1 β 1 + β 1 1 β 1 2 ξ .
Through a similar process, we have
S S i = H ( w i ) 2 + β 2 S S i 1 S S i H ( w i ) 2 1 β 1 δ = β 2 S S i 1 H ( w i 1 ) 2 1 β 1 δ ,
where δ is 2 ξ H ( w i 1 ) + ξ / 2 / ( 1 β 2 ) and can remain constant because of a sufficiently small ξ value.
S S i = H ( w i ) 2 1 β 1 + δ + β 1 i S S 0 H ( w 0 ) 2 1 β 1 δ .
If the initial condition S S 0 is defined as a fairly large value, the following equation can be obtained:
S S i S S i 1 = H ( w i ) 2 + δ γ 3 + β 2 i H ( w i 1 ) 2 + δ γ 3 + β 2 i 1 β 2
where
γ 3 = 1 β 1 S S 0 H ( w 0 ) 2 δ
Assuming that the initial values S 0 and S S 0 are sufficiently large, Equation (3) can be changed as follows:
w i + 1 w i w i w i 1 = S i S i 1 S S i S S i 1 1 / 2 = β 1 β 2 1 / 2 < 1 .
For w i to converge (a Cauchy sequence), the following condition should be satisfied:
w i + 1 w i w i w i 1 γ 1 .
To satisfy this condition, the larger the value of β 2 is, the better. However, if the value of β 2 is larger than 1, the value of v i becomes negative and we should compute the complex value. As a result, β 2 is preferably as close to 1 as possible. Conversely, the smaller the value of β 1 is, the better. However, if the value of β 1 is small, the convergence of w i is fast and the change of w i is small, so the convergence value that we want cannot be achieved. As a result, β 1 is also preferably as close to 1 as possible. Therefore, it is better to decide in the range of β 1 2 β 2 . Generally, β 1 is 0.9 and β 2 is 0.999. After computing, we have | w i + 1 w i | γ i τ | w τ + 1 w τ | . As the iteration continues, the value of γ i τ converges to zero. Therefore, after a sufficiently large number (greater than τ ) w i + 1 and w i are equal.  □
Corollary 1.
The relationship between S i and S S i is
( S i ) 2 1 β 1 2 β 2 i 1 β 1 2 β 2 S S i .
Proof. 
S i = k = 1 i β 1 k 1 H ( w i k + 1 ) = H ( w i ) + β 1 H ( w i 1 ) + + β 1 i 1 H ( w 1 ) = H ( w i ) + β 1 β 2 1 / 2 β 2 1 / 2 H ( w i 1 ) + + β 1 i 1 β 2 i 1 1 / 2 β 2 i 1 1 / 2 H ( w 1 )
Using the general Cauchy–Schwarz inequality, we have
S i 2 1 + β 1 β 2 1 / 2 2 + + β 1 i 1 β 2 ( i 1 ) / 2 2 × H ( w i ) 2 + β 2 H ( w i 1 ) 2 + + β 2 i 1 H ( w 1 ) 2 1 β 1 2 β 2 i 1 β 1 2 β 2 S S i .
 □
Theorem 2.
The limit of w i satisfies that lim i H ( w i ) = 0 .
Proof. 
When the limit of w i is w * , using (5) and continuity of H , the following equation is obtained as
w * = w * η k = τ * β 1 k 1 H ( w i k + 1 ) k = τ * β 2 k 1 H ( w i k + 1 ) 2 + ϵ 0 = k = τ * β 1 k 1 H ( w i k + 1 ) k = τ * β 2 k 1 H ( w i k + 1 ) 2 + ϵ .
The effect of ϵ is to avoid making the denominator zero, so the denominator part of the above equation is not zero. We can get k = τ * β 1 k 1 H ( w i k + 1 ) = 0 . Since β 1 < 1 , β 1 2 < β 2 , and w i converges to w * assuming that * is a fairly large number, w * and w * 1 are close, and β κ can be regarded as 0 after κ , where κ is an appropriate large integer. Therefore, we can get
0 = H ( w * ) + β 1 H ( w * 1 ) + β 1 2 H ( w * 2 ) + + β 1 κ H ( w * κ ) H ( w * ) 1 + β 1 + β 1 2 + + β 1 κ = H ( w * ) 1 β 1 κ + 1 1 β 1 .
The extreme value w * of the sequence w i becomes a variable value that brings the cost function H close to zero.  □

5. Numerical Tests

In these numerical tests, we perform several experiments to show the novelty of the proposed method. The GD method, the ADAM method, the AdaMax method, and the proposed method are compared according to each experiment. Please note that some of the techniques such as batch, epoch, and drop out are not included. β 1 and β 2 used in ADAM and AdaMax are fixed as 0.9 and 0.999, respectively, and ϵ is used as 10 8 . These values are the default of β 1 , β 2 , and ϵ .

5.1. One Variable Non-Convex Function Test

Since the cost function has a non-convex property, we test each method with a simple non-convex function in this experiment. The cost function is C ( w ) = ( w + 5 ) ( w + 3 ) ( w 1 ) ( w 10 ) / 800 + 3 and has the global minimum at w 7.1047 . The starting point, w 0 , is initialized at 9 and the iteration number is 100. The reason this cost function is divided by 800 is that if you do not divide it, the degree of this function is 4, so the value of the function becomes too big and it is too far away from the real problem.
Figure 2 shows the change in the cost function ( C ( w ) ) according to the change in w.
Figure 3 shows the iterations of four methods over C ( w ) with w 0 = 9 . In this experiment, GD, ADAM, and AdaMax fall into a local minimum and it is confirmed that there is no motion. On the other hand, the proposed method is confirmed to settle at the global minimum beyond the local minimum. Although the global minimum is near 7, the other methods stayed near 4 .

5.2. Two Variables Non-Convex Function Test

In this section, we experiment with three two-variable cost functions. The first and second experiments are to find the global minimum of the Beale function and the Styblinski–Tang function, respectively, and the third experiment is to test whether the proposed method works effectively at the saddle point.

5.2.1. Beale function

The Beale function is defined by C ( w 1 , w 2 ) = ( 1.5 w 1 + w 1 w 2 ) 2 + ( 2.25 w 1 + w 1 w 2 2 ) 2 + ( 2.625 w 1 + w 1 w 2 3 ) 2 and has the global minimum at (3, 0.5). Figure 4 shows the Beale function.
Figure 5 shows the results of each method. GD’s learning late was set to 10 4 , which is different from other methods because only GD has a very large gradient and weight divergence. We confirm that all methods converge well because this function is convex around the given starting point.
Table 1 shows the weight change of each method according to the change of iteration. As you can see from this table, the proposed method shows the best performance.
To see the results from another starting point, we experiment with the same cost function using a different starting point ( 4 , 4 ) , hyperparameter λ = 0.2 and the iteration number = 50,000.
Figure 6 shows the results of each method. In this experiment, we confirm that GD, ADAM, and AdaMax fall into a local minimum and stop there, whereas the proposed method reaches the global minimum effectively.

5.2.2. Styblinski–Tang function

The Styblinski–Tang function is defined by
C ( w 1 , w 2 ) = ( w 1 4 16 w 1 2 + 5 w 1 ) + ( w 2 4 16 w 2 2 + 5 w 2 ) 2 / 2 + 80
and has the global minimum at ( 2.903534 , 2.903534 ) .
Figure 7 shows the Styblinski–Tang function. In this experiment, we present a result with the starting point W 0 = ( 6 , 0 ) . Please note that a local minimum point ( 2.7468 , 2.9035 ) is located between W 0 and the global minimum point.
Figure 8 shows the results of each method. Only the Proposed method find the global minimum, and other methods could not avoid a local minimum.

5.2.3. Function with a Saddle Point

The cost function C ( w 1 , w 2 ) = w 2 2 w 1 2 + 2 is shown in Figure 9. A Hessian matrix of this cost function is 2 0 0 2 , so this cost function has a saddle point at (0, 0).
In this experiment, we present results based on two different starting points. The first starting point is W 0 = ( 0.001 , 0.2 ) .
Figure 10 and Table 2 shows the results of each method. In this experiment, we see that the proposed method also changes the parameters more rapidly than the other three methods near the saddle point.
Then, we performed an experiment with the same cost function, but with another starting point (0, 0.01), and with an iteration number of 100. The hyper parameters in this experiment are the same as above. The only thing that was changed was the starting point.
Figure 11 shows the result of this experiment with an initial point of (0, 0.01). In this experiment, since one of the coordinates of the initial value is 0, the other methods do not work.

5.3. Two-Dimensional Region Segmentation Test

To see how one can divide the area in a more complicated situation, we introduce a problem of region separation in the form of the shape of a whirlwind. The value of 0 and 1 are given along the shape of the whirlwind, and the regions are divided by learning according to each method. Table 3b is the pseudo code for making the learning dataset for this experiment and Table 3a is a visualization of the dataset.
To solve this problem, we use a 2-layer and 25-row neural network. First, the neural network is learned by each method. After that, [ 2 , 2 ] × [ 2 , 2 ] is divided into 60 equal sections, and each of the 3600 coordinates are checked to decide the parts where they belong.
The results are presented in Figure 12. The proposed method is better expressed in dividing the region according to the direction of the whirlwind, compared with other schemes.
In the same artificial neural network structure, different results were obtained for each method, which is related to the accuracy of location learning.

5.4. MNIST with CNN

In the previous sections, we have seen simple cases of classification problems using ANN. Here, we try a more practical experiment using a CNN (Convolutional Neural Network). As we all know, we use an ANN to increase the number of layers to solve more difficult and complex problems. However, as the number of layers increases, the number of parameters also increases, and the memory and operating time of the computer, likewise, also increase. For this reason, a CNN was introduced and the performance is shown at The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [11]. Therefore, in this experiment, we want to see how our method works with the CNN structure. In this experiment, TensorFlow was used, and for the fairness of the experiment, we used the MNIST problem using a CNN, which is one of the basic examples provided in TensorFlow (see https://github.com/tensorflow/models/tree/master/tutorials/image/mnist). There were 8600 iterations, which output the Minibatch cost and validation error of each method as a result. The results are shown in Figure 13. As shown in Figure 13, all methods were found to reduce both cost and error.
For further investigation, we magnify the data from Figure 13 from the 6000th iteration, and show it in Figure 14.
Figure 14a shows that GD converges the slowest and the proposed method is more accurate than ADAM. Figure 14b shows the accuracies of the proposed method and the ADAM method, as measured on the test data. From this, we see that the proposed method works better for the classification of images than ADAM.
Figure 15 shows four sample digits from the MNIST test data that the proposed method classifies effectively but that ADAM failed to classify in this experiment.

5.5. CIFAR-10 with RESNET

We show in Section 5.4 that the proposed method is more effective than ADAM in the CNN structure using the MNIST dataset. Since the MNIST dataset is a grayscale image, the size of one image is 28 × 28 × 1 . In this section, we show that the proposed method is more effective than ADAM when using a dataset larger than the MNIST, such as the CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html). CIFAR10 is a popular dataset for classifying 10 categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) 32 × 32 × 3 size RGB images. This dataset consists of 50,000 training data and 10,000 test data. In this experiment, as in Section 5.4, we used the example code (the same model and the same hyperparameter settings) provided by TensorFlow (see https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator). In Section 5.4, we used a simple network using a CNN, but in this subsection, we use a more complex and high-performance network called RESNET (Residual network) based on CNN. RESNET is a well-known network that has good performance in ILSVRC, and we used RESNET 44 in this experiment.
Figure 16 shows the results of the training CIFAR10 dataset using the RESNET structure, and are compared with ADAM. In Figure 16a, the training cost of each method was calculated and plotted every 100th iteration. In Figure 16b, the validation cost of each method was calculated and plotted every 5000th iteration. This shows that the proposed method works effectively for more complex networks as well as simple networks.

6. Conclusions

In this paper, we propose a new optimization method based on ADAM. Many of the existing optimization methods (including ADAM) may not work properly according to the initial point. However, the proposed method finds the global minimum better than other methods, even if there is a local minimum near the starting point, and has better overall performance. We tested our method only on models for image datasets such as MNIST and CIFAR10. Our future work is to test our method on various models such as RNN models for time series prediction, various models for natural language processing, etc.

Author Contributions

Conceptualization, D.Y.; Data curation, S.J.; Formal analysis, D.Y. and J.A.; Funding acquisition, D.Y.; Investigation, D.Y.; Methodology, D.Y. and J.A.; Project administration, D.Y.; Resources, S.J.; Software, S.J.; Supervision, D.Y. and J.A.; Validation, J.A. and S.J.; Visualization, S.J.; Writing—original draft, D.Y.; Writing—review & editing, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the basic science research program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (grant number NRF-2017R1E1A1A03070311). Also, the second author Ahn was supported by research fund of Chungnam National University.

Acknowledgments

We sincerely thank anonymous reviewers whose suggestions helped improve and clarify this manuscript greatly.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Deng, L.; Li, J.; Huang, J.; Yao, K.; Yu, D.; Seide, F.; Seltzer, M.L.; Zweig, G.; He, X.; Williams, J.; et al. Recent advances in deep learning for speech research at microsoft. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP 2013, Vancouver, BC, Canada, 26–31 May 2013. [Google Scholar]
  2. Graves, A. Generating sequences with recurrent neural networks. arXiv 2013, arXiv:1308.0850. [Google Scholar]
  3. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 International Conference on Acoustics, Speech and Signal Processing—ICASSP 2013, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
  4. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar]
  5. Tieleman, T.; Hinton, G.E. Lecture 6.5—RMSProp, COURSERA: Neural Networks for Machine Learning; Technical Report; COURSERA: Mountain View, CA, USA, 2012. [Google Scholar]
  6. Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Le, Q.; Mao, M.; Ranzato, M.; Senior, A.; Tucker, P.; et al. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems—NIPS 2012, Lake Tahoe, NV, USA, 3 December 2012; pp. 1223–1231. [Google Scholar]
  7. Jaitly, N.; Nguyen, P.; Senior, A.; Vanhoucke, V. Application of pretrained deep neural networks to large vocabulary speech recognition. In Proceedings of the INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association—ISCA 2012, Portland, OR, USA, 9–13 September 2012; pp. 2578–2581. [Google Scholar]
  8. Amari, S. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
  9. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  10. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
  11. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems–NIPS 2012, Lake Tahoe, NV, USA, 3 December 2012; pp. 1097–1105. [Google Scholar]
  12. Pascanu, R.; Bengio, Y. Revisiting natural gradient for deep networks. arXiv 2013, arXiv:1301.3584. [Google Scholar]
  13. Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G.E. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning—ICML 2013, Atlanta, GA, USA, 16–21 Jun 2013; PMLR 28(3). pp. 1139–1147. [Google Scholar]
  14. Kochenderfer, M.; Wheeler, T. Algorithms for Optimization; The MIT Press Cambridge: London, UK, 2019. [Google Scholar]
  15. Kingma, D.P.; Ba, J. ADAM: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations—ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  16. Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
  17. Roux, N.L.; Fitzgibbon, A.W. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning—ICML 2010, Haifa, Israel, 21–24 June 2010; pp. 623–630. [Google Scholar]
  18. Sohl-Dickstein, J.; Poole, B.; Ganguli, S. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning—ICML 2014, Beijing, China, 21–24 June 2014; pp. 604–612. [Google Scholar]
  19. Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
  20. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  21. Becker, S.; LeCun, Y. Improving the Convergence of Back-Propagation Learning with Second Order Methods; Technical Report; Department of Computer Science, University of Toronto: Toronto, ON, Canada, 1988. [Google Scholar]
  22. Kelley, C.T. Iterative methods for linear and nonlinear equations. In Frontiers in Applied Mathematics; SIAM: Philadelphia, PA, USA, 1995; Volume 16. [Google Scholar]
  23. Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of ADAM and Beyond. arXiv 2019, arXiv:1904.09237. [Google Scholar]
  24. Ruppert, D. Efficient Estimations from a Slowly Convergent Robbins-Monro Process; Technical Report; Cornell University Operations Research and Industrial Engineering: Ithaca, NY, USA, 1988. [Google Scholar]
  25. Zinkevich, M. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Proceedings of the Twentieth International Conference on Machine Learning—ICML-2003, Washington, DC, USA, 21–24 August 2003; pp. 928–935. [Google Scholar]
Figure 1. Examples of MNIST dataset.
Figure 1. Examples of MNIST dataset.
Applsci 10 01073 g001
Figure 2. Cost function with one local minimum.
Figure 2. Cost function with one local minimum.
Applsci 10 01073 g002
Figure 3. Compared to the other scheme, the learning rate is 0.2 and the β 1 , β 2 , λ used in the proposed method are 0.95, 0.9999, 10 2 .
Figure 3. Compared to the other scheme, the learning rate is 0.2 and the β 1 , β 2 , λ used in the proposed method are 0.95, 0.9999, 10 2 .
Applsci 10 01073 g003
Figure 4. Beale Function.
Figure 4. Beale Function.
Applsci 10 01073 g004
Figure 5. Result of the Beale function with an initial point of (2, 2). W 0 = ( 2 , 2 ) and the iteration number are 1000. The learning rate is 0.1 and the β 1 , β 2 , and λ used in proposed method are 0.9, 0.999, and 0.5 , respectively. GD’s learning late was set to 10 4 .
Figure 5. Result of the Beale function with an initial point of (2, 2). W 0 = ( 2 , 2 ) and the iteration number are 1000. The learning rate is 0.1 and the β 1 , β 2 , and λ used in proposed method are 0.9, 0.999, and 0.5 , respectively. GD’s learning late was set to 10 4 .
Applsci 10 01073 g005
Figure 6. Result of Beale function with initial point ( 4 , 4 ) .
Figure 6. Result of Beale function with initial point ( 4 , 4 ) .
Applsci 10 01073 g006
Figure 7. Styblinski–Tang Function.
Figure 7. Styblinski–Tang Function.
Applsci 10 01073 g007
Figure 8. Result of Styblinski–Tang function with initial point ( 6 , 0 ) . The learning rate is 0.1 and the β 1 , β 2 , λ used in Proposed method are 0.95, 0.9999, 0.2 and GD’s learning late was set to 10 3 .
Figure 8. Result of Styblinski–Tang function with initial point ( 6 , 0 ) . The learning rate is 0.1 and the β 1 , β 2 , λ used in Proposed method are 0.95, 0.9999, 0.2 and GD’s learning late was set to 10 3 .
Applsci 10 01073 g008
Figure 9. Cost function C ( w 1 , w 2 ) = w 2 2 w 1 2 + 2 .
Figure 9. Cost function C ( w 1 , w 2 ) = w 2 2 w 1 2 + 2 .
Applsci 10 01073 g009
Figure 10. The results near the saddle point with an initial point of (0.001, 2). The iteration number is 50, the learning rate is 10 2 , and the β 1 , β 2 , and λ used in the proposed method are 0.95, 0.9999, and 5 × 10 3 , respectively.
Figure 10. The results near the saddle point with an initial point of (0.001, 2). The iteration number is 50, the learning rate is 10 2 , and the β 1 , β 2 , and λ used in the proposed method are 0.95, 0.9999, and 5 × 10 3 , respectively.
Applsci 10 01073 g010
Figure 11. The result of the saddle point with an initial point of (0, 0.01).
Figure 11. The result of the saddle point with an initial point of (0, 0.01).
Applsci 10 01073 g011
Figure 12. Comparing proposed scheme with other schemes.
Figure 12. Comparing proposed scheme with other schemes.
Applsci 10 01073 g012
Figure 13. Comparing the proposed scheme with other schemes.
Figure 13. Comparing the proposed scheme with other schemes.
Applsci 10 01073 g013
Figure 14. Comparing the proposed scheme with other schemes with an iteration number over 6000.
Figure 14. Comparing the proposed scheme with other schemes with an iteration number over 6000.
Applsci 10 01073 g014
Figure 15. Examples that the proposed method predicted correctly, while the ADAM method does not.
Figure 15. Examples that the proposed method predicted correctly, while the ADAM method does not.
Applsci 10 01073 g015
Figure 16. Result of CIFAR10 dataset using the RESNET 44 model. The following parameters are default values; iteration: 80000, batch size: 128,weight decay: 2 × 10 4 ,learning rate: 0.1, batch norm decay: 0.997.
Figure 16. Result of CIFAR10 dataset using the RESNET 44 model. The following parameters are default values; iteration: 80000, batch size: 128,weight decay: 2 × 10 4 ,learning rate: 0.1, batch norm decay: 0.997.
Applsci 10 01073 g016
Table 1. Change of weights of each method.
Table 1. Change of weights of each method.
Iteration/MethodGDADAMADAMaxProposed Method
0[2.0, 2.0][2.0, 2.0][2.0, 2.0][2.0, 2.0]
100[1.81, 0.84][1.53, 0.24][1.75, 0.69][2.11, −0.33]
200[1.85, 0.63][2.20, 0.18][1.80, 0.53][2.50, 0.32]
300[1.89, 0.52][2.39, 0.28][1.85, 0.42][2.72, 0.41]
400[1.92, 0.45][2.53, 0.34][1.90, 0.35][2.81, 0.44]
500[1.95, 0.39][2.64, 0.39][1.94, 0.30][2.87, 0.46]
600[1.98, 0.35][2.73, 0.42][1.99, 0.26][2.91, 0.47]
700[2.00, 0.32][2.79, 0.44][2.03, 0.23][2.93, 0.48]
800[2.03, 0.30][2.84, 0.45][2.06, 0.22][2.95, 0.48]
900[2.05, 0.28][2.88, 0.46][2.10, 0.21][2.96, 0.49]
1000[2.06, 0.27][2.91, 0.47][2.13, 0.21][2.97, 0.49]
Table 2. Change values of each method.
Table 2. Change values of each method.
BGDADAMADAMaxProposed Method
0 [ 0.001 , 0.2 ] [ 0.001 , 0.2 ] [ 0.001 , 0.2 ] [ 0.001 , 0.2 ]
10 [ 0.00121899 , 0.16341456 ] [ 0.0943284 , 0.10245869 ] [ 0.00157076 , 0.11185508 ] [ 0.16787479 , 0.03591625 ]
20 [ 0.00148595 , 0.13352159 ] [ 0.20284632 , 0.02224525 ] [ 0.00235688 , 0.05096114 ] [ 0.38849311 , 0.05289886 ]
30 [ 0.00181136 , 0.10909686 ] [ 0.32453918 , 0.02053808 ] [ 0.00347925 , 0.0150328 ] [ 0.61448606 , 0.01491036 ]
40 [ 0.00220804 , 0.08914008 ] [ 0.45704978 , 0.0233784 ] [ 0.00511317 , 0.00198256 ] [ 0.83303301 , 0.02029141 ]
50 [ 0.00269159 , 0.07283394 ] [ 0.59806857 , 0.0075764 ] [ 0.007516 , 0.00722015 ] [ 1.04047722 , 0.01258689 ]
Table 3. Learning dataset.
Table 3. Learning dataset.
(a) Image of L S (b) Pseudo code of L S
Applsci 10 01073 i001 L S : empty set

for i = 1: 30
r = 5 i / 30
t = 5 π i / 30
( [ r s i n ( t ) , r c o s ( t ) ] , 1 ) put in L S
end for
for i = 1: 30
r = 5 i / 30
t = 5 π i / 30 + π
( [ r s i n ( t ) , r c o s ( t ) ] , 0 ) put in L S
end for

return L S

Share and Cite

MDPI and ACS Style

Yi, D.; Ahn, J.; Ji, S. An Effective Optimization Method for Machine Learning Based on ADAM. Appl. Sci. 2020, 10, 1073. https://doi.org/10.3390/app10031073

AMA Style

Yi D, Ahn J, Ji S. An Effective Optimization Method for Machine Learning Based on ADAM. Applied Sciences. 2020; 10(3):1073. https://doi.org/10.3390/app10031073

Chicago/Turabian Style

Yi, Dokkyun, Jaehyun Ahn, and Sangmin Ji. 2020. "An Effective Optimization Method for Machine Learning Based on ADAM" Applied Sciences 10, no. 3: 1073. https://doi.org/10.3390/app10031073

APA Style

Yi, D., Ahn, J., & Ji, S. (2020). An Effective Optimization Method for Machine Learning Based on ADAM. Applied Sciences, 10(3), 1073. https://doi.org/10.3390/app10031073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop