CN116147627A

CN116147627A - Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation

Info

Publication number: CN116147627A
Application number: CN202310010366.6A
Authority: CN
Inventors: 阮晓钢; 林晨亮; 黄静; 李宇凡
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-05-23

Abstract

The invention discloses a mobile robot autonomous navigation method combining deep reinforcement learning and an internal motivation, which utilizes a visual sensor to acquire information from the environment, uses a D3QN algorithm to select optimal actions, and aims at the problem of sparse rewards existing in the navigation environment. Based on the Pygame simulation platform, an experimental environment is built, two groups of experiments of single-target point navigation and multi-target point navigation are carried out, and experimental results prove that the model can more effectively complete navigation tasks and is suitable for various navigation scenes. The method solves the contradiction between the precision and the memory requirement of the grid-based map representation method in the traditional robot path planning method by using the deep reinforcement learning method, and realizes the collision-free autonomous navigation of the mobile robot.

Description

Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation

Technical Field

The invention belongs to the field of artificial intelligence and robot navigation, and particularly relates to a mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation.

Background

With the rise of artificial intelligence, robots are developing towards self-exploration, self-learning and self-adaptive intelligence. The purpose of the path planning technique is to enable the robot to select an optimal or sub-optimal collision-free path from the start point to the end point in the environment in which it is located. The quality of the path planning result directly determines whether the robot can efficiently and accurately complete the task, so that the method has important significance on the research of the path planning technology of the robot.

The deep reinforcement learning has strong perception capability of deep learning and decision capability of reinforcement learning intelligence, and is outstanding when facing complex environments and tasks, which is helpful for autonomous learning and obstacle avoidance planning of robots. The learning process of the end-to-end algorithm model can be expressed as (1) that a robot interacts with the environment to obtain environment information and senses the obtained environment information by using a deep learning method. (2) Reinforcement learning is used to anticipate a cost function for rewarding the evaluation of each action and map the current state into the corresponding action according to some policy. (3) The robot moves according to the determined action and obtains the environmental information of the next moment. Through the continuous cyclic execution of the process, the robot finally obtains the optimal strategy for completing the target.

The inherent motivation stems from psychology, as it is a reasonable explanation of the human developmental process, and is increasingly being widely used in deep reinforcement learning reward designs to address the exploration problem. The internal motivation can drive the living beings to explore the unknown environment independently under the condition of no external stimulus, so that the research is conducted to form heuristic concepts of curiosity, surprise and the like which are derived from the internal motivation into the internal reward signals in reinforcement learning, and the intelligent body is driven to explore the environment autonomously and efficiently.

Disclosure of Invention

The invention is based on a deep double Q network (Dueling Double DQN), utilizes a sensor to acquire input data from the environment, outputs a Q value after calculation of a neural network and selects actions, the robot stores acquired data quadruples in a priority experience playback pool, and utilizes a small batch of data training network, thereby improving the learning and exploring efficiency of the mobile robot, and simultaneously aiming at the problem of sparse rewards in the algorithm model, improving the internal motivation module (ICM, intrinsic Curiosity Module) based on curiosity is introduced. ICM consists of three parts of neural network: an Encoder (Encoder), a Forward Model (Forward Model) and a reverse Model (Inverse Model). In the training process of the ICM model, the current state, the selected action and the state of the next moment of the mobile robot are taken as inputs, the ICM module predicts the state of the next moment through the current state and the selected action, and the predicted state of the next moment is compared with the actual state of the next moment, so that the larger the difference between the two states is, the more difficult to predict the future state is, and the larger intrinsic rewards are obtained. And adding the internal rewards and the external rewards obtained by interaction with the environment to obtain a sum which is used as the total rewards of the action. Through the continuous cyclic learning process, the robot can finally learn the optimal strategy for completing the target. The invention provides a mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation. The algorithm model consists of two subsystems: deep reinforcement learning subsystem and ICM internal motor subsystem. Wherein the deep reinforcement learning subsystem is responsible for selecting a series of actions to maximize rewards and the ICM intrinsic motor subsystem is responsible for generating an intrinsic rewards signal.

The deep reinforcement learning model D3QN is based on Double DQN and lasting DQN, and its architecture is shown in fig. 1, specifically including:

(1) Environmental perception processing layer: the environment sensing layer is composed of an input layer and three fully connected neural networks, each fully connected layer is composed of 1024 neurons and a ReLU activation function. The current state of the mobile robot is the position of the robot at the moment. The environment perception processing layer is used for taking the current moment state of the mobile robot and the perception information obtained through the sensor as the input of the deep neural network, mapping is completed through the hidden layer of the deep neural network, and the mapped result is sent to the action decision control layer to complete action selection and output.

(2) Action decision control layer: the method adopts a competition network as a basic network of an action decision control layer, and consists of a full-connection layer for value estimation and a full-connection layer for evaluating an action dominance function, Q values of different actions taken by a mobile robot in a certain state can be obtained through the two neural networks, and the robot can select the action with the largest Q value as the optimal action capable of being selected in the state to take the action, so that state transition is completed. In the initial stage of learning, the action selected by the action decision control layer is not necessarily the optimal action, but after multiple iterative training, the output of the action decision control layer is more and more close to the optimal action.

The action decision control layer of D3QN is basically consistent with Duleing DQN, but combines the idea of Double DQN when calculating the target value. In the lasting DQN algorithm, the target value y _t The calculation mode of (2) is shown in formula 1:

y _t ＝r _t+1 +γmaxQ(s _t+1 ，a；ω _t ) (1)

i.e. the next time state s acquired by the target network _t+1 And selecting the Q value corresponding to all actions, and then selecting the Q value with the largest value to calculate the target value. The maximization operation used in the algorithm can cause overestimation of the algorithm, and the accuracy of decision is affected. Therefore, in the D3QN algorithm, the calculation mode of the target value is improved, as shown in formula 2:

y _t ＝r _t+1 +γQ(s _t+1 ，argmaxQ(s _t+1 ，a；ω _e )；ω _t ) (2)

omega in _e ，ω _t Parameters respectively representing an evaluation network and a target network, and a D3QN algorithm model acquires a state s by using the evaluation network _t+1 And calculating the action value of the action by utilizing the target network so as to obtain a target value. Through the interactive calculation of the two networks, the method can effectively avoidThe problem of overestimation is avoided.

The mobile robot performs the action selected after the training of the deep reinforcement learning subsystem, and obtains a reward while interacting with the environment, which is called external reward

The external bonus function setting is shown in equation 3. Specifically: in order to urge the mobile robot to reach the target point in a smaller number of steps we set a negative prize r of-0.075 in value _step The moving robot accumulates once every time it moves one step. Giving a value r when the mobile robot reaches the target point _arrive +r _step Is a minimum of 0.1. A negative prize r is obtained when the mobile robot collides with the obstacle _collision As a penalty.

The curiosity-based intrinsic motivation module is shown in fig. 2, the subsystem being in the current state s _t Optimal action a selected after being calculated by deep reinforcement learning subsystem _t And the state s at the next moment _t+1 As input. Specifically, the method comprises the following steps:

(1) Encoder (Encoder): to solve the problem of the influence of unpredictable or uncontrollable parts of the input space on the subsequent predictions, the ICM algorithm uses a deep neural network to input the original state S _t Coding into feature vectors

For convenience of presentation we use +.>

And (3) representing. We will use two sub-modules in the following: the Forward model (Forward model) and the reverse model (Inverse model) learn the feature vectors of this feature encoder output.

(2) Inverse model (Inverse model): the current module is represented as a neural network g, which states the current state s _t And the state s at the next moment _t+1 Feature vector encoded and output by encoder

And->

As input, predicting the slave state s of the mobile robot _t To state s _t+1 Action taken +.>

As shown in equation 4:

wherein the method comprises the steps of

Is the optimal action a for the actual selection _t And θ _I ，θ _E Representing parameters of the neural network. Our goal is to minimize the difference between the predicted and the actual actions taken, i.e.>

Wherein the method comprises the steps of

Representing the difference between the predicted action and the actual action taken. The reverse prediction error (Inverse prediction error) generated by this module is used to train both the reverse module and the encoder.

(3) Forward model (Forward model): predicting actions using feature vectors of states at a current time and a next time, different from a reverse model, which uses feature vectors of states at the current time

And action a actually taken by the mobile robot _t To predict the feature vector of the next moment state +.>

For the forward model, if the mobile robot is environment-friendly, for any given state and action, it should be able to accurately predict what the state is at the next moment, and when the predicted state is inconsistent with the actual state at the next moment, it is unknown to the mobile robot, so that curiosity of the mobile robot can be motivated, thus generating an intrinsic motivational reward, and the mobile robot will seek the region where it feels curious. The feature vector of the next time state predicted by the forward model is shown in formula 5:

wherein the method comprises the steps of

The function f, which represents the predicted value of the state at the next moment, is also called forward model and is trained to optimize the loss.

The intrinsic prize value for the ICM subsystem is shown by equation 6:

where η > 0 represents a scaling parameter.

The overall optimization objective of the curiosity algorithm can be summarized as equation 7:

where β and λ are scalar quantities, β will be the inverse of the modulusThe model and the forward model are weighted by the loss, which satisfies that beta is more than or equal to 0 and less than or equal to 1, lambda is used for measuring the importance of gradient loss to learning intrinsic rewards, lambda is more than 0, L _I A loss function, L, representing the difference between the predicted action and the actual selected action _F A loss function representing a difference between the predicted next time state feature vector and the actual next time state feature vector.

Drawings

FIG. 1 is a diagram of a deep reinforcement learning algorithm D3QN

FIG. 2 is a block diagram of an ICM module.

FIG. 3 is a general construction diagram of the present invention

FIG. 4 is an environmental map of the training of the present invention

FIG. 5 is a graph showing the results of training in the single-target-point navigation experiment according to the present invention

FIG. 6 is a graph showing the results of training in the multi-target point navigation experiment according to the present invention

Detailed Description

The invention will be described in detail with reference to the drawings and examples.

The invention is based on a deep reinforcement learning algorithm model, and an internal motivation module is introduced into the deep reinforcement learning algorithm model to solve the problems of sparse rewards and low training speed of the mobile robot in the navigation process. FIG. 3 is a general construction diagram of the present invention, the mobile robot current state is s _t The selection of the optimal action is completed through the deep reinforcement learning module, and external rewards are obtained by interaction with the environment

And the next time state s _t+1 . Current state s _t The next time state s _t+1 And the selected optimal action a _t Producing an intrinsic reward +.>

Finally, the sum of the internal rewards and the external rewards is taken as the total rewards of the round. The training of the model is performed in this way.

The experimental environment of the invention uses Pygame to build an 8×14 rectangular simulation environment, wherein a circle represents a mobile robot, a diamond represents a target point, and a square represents an obstacle in a map. To better verify the effectiveness of the algorithm model of the present invention, we set up two sets of experiments: (1) a single target point navigation experiment; (2) A multi-objective navigation experiment is validated, wherein the multi-objective navigation experiment is set to: when the mobile robot reaches the target point in the current environment, the next target point is randomly generated in the map, and the mobile robot can continue to navigate to the new target point. The maximum movable step number of the two groups of experiments is 40w steps, the maximum movable step number of the mobile robot in one training round is 500 steps, and the training is stopped after the step number is reached. The super parameters set by the invention are shown in table 1:

TABLE 1 super parameter settings

Parameter name	Numerical value
		learning_rate	0.00025
Epsilon	1
		Final epsilon	0.1
Replay memory size	100000
		Training number	400000
Batch Size	32
		Gamma	0.99

The training steps of the invention are as follows:

(1) In the preparation phase, the map required for the experiment is generated, including target points, obstacles and mobile robots.

(2) And initializing the neural network parameters, the experience playback pool and the experimental environment, and acquiring the state of the mobile robot at the moment.

(3) Judging whether the maximum number of steps of training is reached, ending if the maximum number of steps of training is reached, otherwise, executing the step (4).

(4) The state s of the mobile robot at this time _t As the input of the neural network, the Q value corresponding to each action is calculated, and the current state s is selected by using the epsilon-greedy algorithm _t Action a selected by next _t 。

(5) The mobile robot performs action a _t And interact with the experimental environment to obtain a new state s _t+1 And external rewards

Whether the game is finished or not. Ending if the target point is reached, giving the prize r _arrive . Assigning a prize r if the mobile robot collides _collision If more than 500 steps have not finished training, a prize r is allocated _timeout Otherwise a negative prize of-0.075 is achieved.

(6) The original state s of the robot is processed _t Selected action a _t And a new state s _t+1 As input, to the ICM module for intrinsic rewards

Is calculated by the computer.The total rewards obtained by this action of the mobile robot are shown in formula 8

And will quadruple { s } _t ，s _t+1 ，a _t ，r _t Done is stored in the priority experience playback pool D.

(7) If the game ends, the present training is ended. Otherwise, m samples are acquired from the experience playback pool D and used for calculating the target value y of the current Q network _t . And calculating a loss function, and updating network parameters in the current Q network.

(8) And assigning network parameters in the current Q network to the target Q network at a lower update rate, thereby updating the target Q network. The priorities of the tetrads in the experience return pool D are updated.

(9) And after the training reaches a certain step number, ending the training.

Experimental results analysis we compare three different algorithms together, namely the algorithm model (outer), D3QN-ICM and D3QN.

(1) In a single-target-point navigation environment, the algorithm model and the D3QN-ICM algorithm model provided by the invention can obtain about 18 rewards after training of 40w steps, the D3QN algorithm only can obtain about 11 minutes rewards, meanwhile, the algorithm model provided by the invention starts to gradually and stably rewards when 100000 steps, and the D3QN-ICM algorithm model needs to gradually and stably rewards when 125000 steps, so that the algorithm model provided by the invention has more excellent performance in the single-target-point navigation environment.

(2) In a multi-target point navigation environment, the algorithm model provided by the invention can obtain 42.73 points through training of about 40w steps, the D3QN-ICM algorithm model can obtain 24.76 points, and the D3QN algorithm can only obtain 10.31 points. Compared with the D3QN-ICM algorithm, the method improves the performance by about 70%, and compared with a simple navigation environment, the algorithm model provided by the invention can have more excellent performance in a complex navigation task.

The foregoing description is only exemplary embodiments of the invention and is not intended to limit the scope of the invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation is based on a deep reinforcement learning algorithm, and simultaneously introduces an internal motivation theory; the method is characterized by comprising the following steps of:

(1) In the preparation stage, generating a map required by an experiment, wherein the map comprises target points, barriers and mobile robots;

(2) In the navigation training stage, training the intelligent agent according to external rewards obtained by interaction with the environment and internal motivations generated by the ICM module:

(2.1) initializing neural network parameters, an experience playback pool and an experimental environment, and acquiring the state of the mobile robot at the moment;

(2.2) judging whether the maximum number of steps of training is reached, if so, ending, otherwise, executing the step (2.3)

(2.3) moving the robot at the time of the state s _t As the input of the neural network, the Q value corresponding to each action is calculated, and the current state s is selected by using the epsilon-greedy algorithm _t Action a selected by next _t ；

(2.4) the Mobile robot performs action a _t And interact with the experimental environment to obtain a new state s _t+1 And external rewards

Whether the game is finished or not;

(2.5) the original State s of the robot _t Selected action a _t And a new state s _t+1 As input, to the ICM module for intrinsic rewards

As shown in equation 1;

the total rewards obtained by this action of the mobile robot are shown in equation 2:

(2.6) four-tuple { s } _t ,s _t+1 ,a _t ,r _t Done is stored in the experience playback pool D;

(2.7) if the game ends, ending the present round of training; otherwise, m samples are acquired from the experience playback pool D and used for calculating the target value y of the current Q network _t As shown in equation 3:

y _t ＝r _t+1 +γQ(s _t+1 ，argmaxQ(s _t+1 ，a；ω _e )；ω _t )#(3)

omega in _e ,ω _t Parameters respectively representing an evaluation network and a target network, and a D3QN algorithm model acquires a state s by using the evaluation network _t+1 The action corresponding to the optimal action value is obtained, and then the action value of the action is calculated by utilizing a target network, so that a target value is obtained;

(2.8) calculating the loss function L (θ), updating the network parameters in the current Q network

(2.9) copying the network parameters in the current Q network to the target Q network at a lower update rate, thereby updating the target Q network;

(3.0) updating the priorities of the tetrads in the experience playback pool D

And (3.1) ending the training when the training reaches a certain step number.

2. The mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation of claim 1, wherein: on the basis of a deep reinforcement learning algorithm, an internal motivation module is introduced, so that the problem of rewarding sparseness in the autonomous navigation process of the mobile robot is effectively solved.

3. The mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation of claim 1, wherein: and the experience playback flow is optimized by using the priority experience playback pool, so that the navigation efficiency of the mobile robot is improved.