[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN116147627A - Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation - Google Patents

Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation Download PDF

Info

Publication number
CN116147627A
CN116147627A CN202310010366.6A CN202310010366A CN116147627A CN 116147627 A CN116147627 A CN 116147627A CN 202310010366 A CN202310010366 A CN 202310010366A CN 116147627 A CN116147627 A CN 116147627A
Authority
CN
China
Prior art keywords
mobile robot
action
network
reinforcement learning
navigation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310010366.6A
Other languages
Chinese (zh)
Inventor
阮晓钢
林晨亮
黄静
李宇凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202310010366.6A priority Critical patent/CN116147627A/en
Publication of CN116147627A publication Critical patent/CN116147627A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a mobile robot autonomous navigation method combining deep reinforcement learning and an internal motivation, which utilizes a visual sensor to acquire information from the environment, uses a D3QN algorithm to select optimal actions, and aims at the problem of sparse rewards existing in the navigation environment. Based on the Pygame simulation platform, an experimental environment is built, two groups of experiments of single-target point navigation and multi-target point navigation are carried out, and experimental results prove that the model can more effectively complete navigation tasks and is suitable for various navigation scenes. The method solves the contradiction between the precision and the memory requirement of the grid-based map representation method in the traditional robot path planning method by using the deep reinforcement learning method, and realizes the collision-free autonomous navigation of the mobile robot.

Description

Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation
Technical Field
The invention belongs to the field of artificial intelligence and robot navigation, and particularly relates to a mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation.
Background
With the rise of artificial intelligence, robots are developing towards self-exploration, self-learning and self-adaptive intelligence. The purpose of the path planning technique is to enable the robot to select an optimal or sub-optimal collision-free path from the start point to the end point in the environment in which it is located. The quality of the path planning result directly determines whether the robot can efficiently and accurately complete the task, so that the method has important significance on the research of the path planning technology of the robot.
The deep reinforcement learning has strong perception capability of deep learning and decision capability of reinforcement learning intelligence, and is outstanding when facing complex environments and tasks, which is helpful for autonomous learning and obstacle avoidance planning of robots. The learning process of the end-to-end algorithm model can be expressed as (1) that a robot interacts with the environment to obtain environment information and senses the obtained environment information by using a deep learning method. (2) Reinforcement learning is used to anticipate a cost function for rewarding the evaluation of each action and map the current state into the corresponding action according to some policy. (3) The robot moves according to the determined action and obtains the environmental information of the next moment. Through the continuous cyclic execution of the process, the robot finally obtains the optimal strategy for completing the target.
The inherent motivation stems from psychology, as it is a reasonable explanation of the human developmental process, and is increasingly being widely used in deep reinforcement learning reward designs to address the exploration problem. The internal motivation can drive the living beings to explore the unknown environment independently under the condition of no external stimulus, so that the research is conducted to form heuristic concepts of curiosity, surprise and the like which are derived from the internal motivation into the internal reward signals in reinforcement learning, and the intelligent body is driven to explore the environment autonomously and efficiently.
Disclosure of Invention
The invention is based on a deep double Q network (Dueling Double DQN), utilizes a sensor to acquire input data from the environment, outputs a Q value after calculation of a neural network and selects actions, the robot stores acquired data quadruples in a priority experience playback pool, and utilizes a small batch of data training network, thereby improving the learning and exploring efficiency of the mobile robot, and simultaneously aiming at the problem of sparse rewards in the algorithm model, improving the internal motivation module (ICM, intrinsic Curiosity Module) based on curiosity is introduced. ICM consists of three parts of neural network: an Encoder (Encoder), a Forward Model (Forward Model) and a reverse Model (Inverse Model). In the training process of the ICM model, the current state, the selected action and the state of the next moment of the mobile robot are taken as inputs, the ICM module predicts the state of the next moment through the current state and the selected action, and the predicted state of the next moment is compared with the actual state of the next moment, so that the larger the difference between the two states is, the more difficult to predict the future state is, and the larger intrinsic rewards are obtained. And adding the internal rewards and the external rewards obtained by interaction with the environment to obtain a sum which is used as the total rewards of the action. Through the continuous cyclic learning process, the robot can finally learn the optimal strategy for completing the target. The invention provides a mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation. The algorithm model consists of two subsystems: deep reinforcement learning subsystem and ICM internal motor subsystem. Wherein the deep reinforcement learning subsystem is responsible for selecting a series of actions to maximize rewards and the ICM intrinsic motor subsystem is responsible for generating an intrinsic rewards signal.
The deep reinforcement learning model D3QN is based on Double DQN and lasting DQN, and its architecture is shown in fig. 1, specifically including:
(1) Environmental perception processing layer: the environment sensing layer is composed of an input layer and three fully connected neural networks, each fully connected layer is composed of 1024 neurons and a ReLU activation function. The current state of the mobile robot is the position of the robot at the moment. The environment perception processing layer is used for taking the current moment state of the mobile robot and the perception information obtained through the sensor as the input of the deep neural network, mapping is completed through the hidden layer of the deep neural network, and the mapped result is sent to the action decision control layer to complete action selection and output.
(2) Action decision control layer: the method adopts a competition network as a basic network of an action decision control layer, and consists of a full-connection layer for value estimation and a full-connection layer for evaluating an action dominance function, Q values of different actions taken by a mobile robot in a certain state can be obtained through the two neural networks, and the robot can select the action with the largest Q value as the optimal action capable of being selected in the state to take the action, so that state transition is completed. In the initial stage of learning, the action selected by the action decision control layer is not necessarily the optimal action, but after multiple iterative training, the output of the action decision control layer is more and more close to the optimal action.
The action decision control layer of D3QN is basically consistent with Duleing DQN, but combines the idea of Double DQN when calculating the target value. In the lasting DQN algorithm, the target value y t The calculation mode of (2) is shown in formula 1:
y t =r t+1 +γmaxQ(s t+1 ,a;ω t ) (1)
i.e. the next time state s acquired by the target network t+1 And selecting the Q value corresponding to all actions, and then selecting the Q value with the largest value to calculate the target value. The maximization operation used in the algorithm can cause overestimation of the algorithm, and the accuracy of decision is affected. Therefore, in the D3QN algorithm, the calculation mode of the target value is improved, as shown in formula 2:
y t =r t+1 +γQ(s t+1 ,argmaxQ(s t+1 ,a;ω e );ω t ) (2)
omega in e ,ω t Parameters respectively representing an evaluation network and a target network, and a D3QN algorithm model acquires a state s by using the evaluation network t+1 And calculating the action value of the action by utilizing the target network so as to obtain a target value. Through the interactive calculation of the two networks, the method can effectively avoidThe problem of overestimation is avoided.
The mobile robot performs the action selected after the training of the deep reinforcement learning subsystem, and obtains a reward while interacting with the environment, which is called external reward
Figure BDA0004037867490000041
The external bonus function setting is shown in equation 3. Specifically: in order to urge the mobile robot to reach the target point in a smaller number of steps we set a negative prize r of-0.075 in value step The moving robot accumulates once every time it moves one step. Giving a value r when the mobile robot reaches the target point arrive +r step Is a minimum of 0.1. A negative prize r is obtained when the mobile robot collides with the obstacle collision As a penalty.
Figure BDA0004037867490000042
The curiosity-based intrinsic motivation module is shown in fig. 2, the subsystem being in the current state s t Optimal action a selected after being calculated by deep reinforcement learning subsystem t And the state s at the next moment t+1 As input. Specifically, the method comprises the following steps:
(1) Encoder (Encoder): to solve the problem of the influence of unpredictable or uncontrollable parts of the input space on the subsequent predictions, the ICM algorithm uses a deep neural network to input the original state S t Coding into feature vectors
Figure BDA0004037867490000043
For convenience of presentation we use +.>
Figure BDA0004037867490000044
And (3) representing. We will use two sub-modules in the following: the Forward model (Forward model) and the reverse model (Inverse model) learn the feature vectors of this feature encoder output.
(2) Inverse model (Inverse model): the current module is represented as a neural network g, which states the current state s t And the state s at the next moment t+1 Feature vector encoded and output by encoder
Figure BDA0004037867490000045
And->
Figure BDA0004037867490000046
As input, predicting the slave state s of the mobile robot t To state s t+1 Action taken +.>
Figure BDA0004037867490000047
As shown in equation 4:
Figure BDA0004037867490000048
wherein the method comprises the steps of
Figure BDA0004037867490000049
Is the optimal action a for the actual selection t And θ I ,θ E Representing parameters of the neural network. Our goal is to minimize the difference between the predicted and the actual actions taken, i.e.>
Figure BDA00040378674900000410
Wherein the method comprises the steps of
Figure BDA0004037867490000051
Representing the difference between the predicted action and the actual action taken. The reverse prediction error (Inverse prediction error) generated by this module is used to train both the reverse module and the encoder.
(3) Forward model (Forward model): predicting actions using feature vectors of states at a current time and a next time, different from a reverse model, which uses feature vectors of states at the current time
Figure BDA0004037867490000052
And action a actually taken by the mobile robot t To predict the feature vector of the next moment state +.>
Figure BDA0004037867490000053
For the forward model, if the mobile robot is environment-friendly, for any given state and action, it should be able to accurately predict what the state is at the next moment, and when the predicted state is inconsistent with the actual state at the next moment, it is unknown to the mobile robot, so that curiosity of the mobile robot can be motivated, thus generating an intrinsic motivational reward, and the mobile robot will seek the region where it feels curious. The feature vector of the next time state predicted by the forward model is shown in formula 5:
Figure BDA0004037867490000054
wherein the method comprises the steps of
Figure BDA0004037867490000055
The function f, which represents the predicted value of the state at the next moment, is also called forward model and is trained to optimize the loss.
The intrinsic prize value for the ICM subsystem is shown by equation 6:
Figure BDA0004037867490000056
where η > 0 represents a scaling parameter.
The overall optimization objective of the curiosity algorithm can be summarized as equation 7:
Figure BDA0004037867490000057
where β and λ are scalar quantities, β will be the inverse of the modulusThe model and the forward model are weighted by the loss, which satisfies that beta is more than or equal to 0 and less than or equal to 1, lambda is used for measuring the importance of gradient loss to learning intrinsic rewards, lambda is more than 0, L I A loss function, L, representing the difference between the predicted action and the actual selected action F A loss function representing a difference between the predicted next time state feature vector and the actual next time state feature vector.
Drawings
FIG. 1 is a diagram of a deep reinforcement learning algorithm D3QN
FIG. 2 is a block diagram of an ICM module.
FIG. 3 is a general construction diagram of the present invention
FIG. 4 is an environmental map of the training of the present invention
FIG. 5 is a graph showing the results of training in the single-target-point navigation experiment according to the present invention
FIG. 6 is a graph showing the results of training in the multi-target point navigation experiment according to the present invention
Detailed Description
The invention will be described in detail with reference to the drawings and examples.
The invention is based on a deep reinforcement learning algorithm model, and an internal motivation module is introduced into the deep reinforcement learning algorithm model to solve the problems of sparse rewards and low training speed of the mobile robot in the navigation process. FIG. 3 is a general construction diagram of the present invention, the mobile robot current state is s t The selection of the optimal action is completed through the deep reinforcement learning module, and external rewards are obtained by interaction with the environment
Figure BDA0004037867490000061
And the next time state s t+1 . Current state s t The next time state s t+1 And the selected optimal action a t Producing an intrinsic reward +.>
Figure BDA0004037867490000062
Finally, the sum of the internal rewards and the external rewards is taken as the total rewards of the round. The training of the model is performed in this way.
The experimental environment of the invention uses Pygame to build an 8×14 rectangular simulation environment, wherein a circle represents a mobile robot, a diamond represents a target point, and a square represents an obstacle in a map. To better verify the effectiveness of the algorithm model of the present invention, we set up two sets of experiments: (1) a single target point navigation experiment; (2) A multi-objective navigation experiment is validated, wherein the multi-objective navigation experiment is set to: when the mobile robot reaches the target point in the current environment, the next target point is randomly generated in the map, and the mobile robot can continue to navigate to the new target point. The maximum movable step number of the two groups of experiments is 40w steps, the maximum movable step number of the mobile robot in one training round is 500 steps, and the training is stopped after the step number is reached. The super parameters set by the invention are shown in table 1:
TABLE 1 super parameter settings
Parameter name Numerical value
learning_rate 0.00025
Epsilon 1
Final epsilon 0.1
Replay memory size 100000
Training number 400000
Batch Size 32
Gamma 0.99
The training steps of the invention are as follows:
(1) In the preparation phase, the map required for the experiment is generated, including target points, obstacles and mobile robots.
(2) And initializing the neural network parameters, the experience playback pool and the experimental environment, and acquiring the state of the mobile robot at the moment.
(3) Judging whether the maximum number of steps of training is reached, ending if the maximum number of steps of training is reached, otherwise, executing the step (4).
(4) The state s of the mobile robot at this time t As the input of the neural network, the Q value corresponding to each action is calculated, and the current state s is selected by using the epsilon-greedy algorithm t Action a selected by next t
(5) The mobile robot performs action a t And interact with the experimental environment to obtain a new state s t+1 And external rewards
Figure BDA0004037867490000071
Whether the game is finished or not. Ending if the target point is reached, giving the prize r arrive . Assigning a prize r if the mobile robot collides collision If more than 500 steps have not finished training, a prize r is allocated timeout Otherwise a negative prize of-0.075 is achieved.
(6) The original state s of the robot is processed t Selected action a t And a new state s t+1 As input, to the ICM module for intrinsic rewards
Figure BDA0004037867490000072
Is calculated by the computer.The total rewards obtained by this action of the mobile robot are shown in formula 8
Figure BDA0004037867490000081
And will quadruple { s } t ,s t+1 ,a t ,r t Done is stored in the priority experience playback pool D.
(7) If the game ends, the present training is ended. Otherwise, m samples are acquired from the experience playback pool D and used for calculating the target value y of the current Q network t . And calculating a loss function, and updating network parameters in the current Q network.
(8) And assigning network parameters in the current Q network to the target Q network at a lower update rate, thereby updating the target Q network. The priorities of the tetrads in the experience return pool D are updated.
(9) And after the training reaches a certain step number, ending the training.
Experimental results analysis we compare three different algorithms together, namely the algorithm model (outer), D3QN-ICM and D3QN.
(1) In a single-target-point navigation environment, the algorithm model and the D3QN-ICM algorithm model provided by the invention can obtain about 18 rewards after training of 40w steps, the D3QN algorithm only can obtain about 11 minutes rewards, meanwhile, the algorithm model provided by the invention starts to gradually and stably rewards when 100000 steps, and the D3QN-ICM algorithm model needs to gradually and stably rewards when 125000 steps, so that the algorithm model provided by the invention has more excellent performance in the single-target-point navigation environment.
(2) In a multi-target point navigation environment, the algorithm model provided by the invention can obtain 42.73 points through training of about 40w steps, the D3QN-ICM algorithm model can obtain 24.76 points, and the D3QN algorithm can only obtain 10.31 points. Compared with the D3QN-ICM algorithm, the method improves the performance by about 70%, and compared with a simple navigation environment, the algorithm model provided by the invention can have more excellent performance in a complex navigation task.
The foregoing description is only exemplary embodiments of the invention and is not intended to limit the scope of the invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation is based on a deep reinforcement learning algorithm, and simultaneously introduces an internal motivation theory; the method is characterized by comprising the following steps of:
(1) In the preparation stage, generating a map required by an experiment, wherein the map comprises target points, barriers and mobile robots;
(2) In the navigation training stage, training the intelligent agent according to external rewards obtained by interaction with the environment and internal motivations generated by the ICM module:
(2.1) initializing neural network parameters, an experience playback pool and an experimental environment, and acquiring the state of the mobile robot at the moment;
(2.2) judging whether the maximum number of steps of training is reached, if so, ending, otherwise, executing the step (2.3)
(2.3) moving the robot at the time of the state s t As the input of the neural network, the Q value corresponding to each action is calculated, and the current state s is selected by using the epsilon-greedy algorithm t Action a selected by next t
(2.4) the Mobile robot performs action a t And interact with the experimental environment to obtain a new state s t+1 And external rewards
Figure QLYQS_1
Whether the game is finished or not;
(2.5) the original State s of the robot t Selected action a t And a new state s t+1 As input, to the ICM module for intrinsic rewards
Figure QLYQS_2
As shown in equation 1;
Figure QLYQS_3
the total rewards obtained by this action of the mobile robot are shown in equation 2:
Figure QLYQS_4
(2.6) four-tuple { s } t ,s t+1 ,a t ,r t Done is stored in the experience playback pool D;
(2.7) if the game ends, ending the present round of training; otherwise, m samples are acquired from the experience playback pool D and used for calculating the target value y of the current Q network t As shown in equation 3:
y t =r t+1 +γQ(s t+1 ,argmaxQ(s t+1 ,a;ω e );ω t )#(3)
omega in et Parameters respectively representing an evaluation network and a target network, and a D3QN algorithm model acquires a state s by using the evaluation network t+1 The action corresponding to the optimal action value is obtained, and then the action value of the action is calculated by utilizing a target network, so that a target value is obtained;
(2.8) calculating the loss function L (θ), updating the network parameters in the current Q network
(2.9) copying the network parameters in the current Q network to the target Q network at a lower update rate, thereby updating the target Q network;
(3.0) updating the priorities of the tetrads in the experience playback pool D
And (3.1) ending the training when the training reaches a certain step number.
2. The mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation of claim 1, wherein: on the basis of a deep reinforcement learning algorithm, an internal motivation module is introduced, so that the problem of rewarding sparseness in the autonomous navigation process of the mobile robot is effectively solved.
3. The mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation of claim 1, wherein: and the experience playback flow is optimized by using the priority experience playback pool, so that the navigation efficiency of the mobile robot is improved.
CN202310010366.6A 2023-01-04 2023-01-04 Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation Pending CN116147627A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310010366.6A CN116147627A (en) 2023-01-04 2023-01-04 Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310010366.6A CN116147627A (en) 2023-01-04 2023-01-04 Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation

Publications (1)

Publication Number Publication Date
CN116147627A true CN116147627A (en) 2023-05-23

Family

ID=86352032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310010366.6A Pending CN116147627A (en) 2023-01-04 2023-01-04 Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation

Country Status (1)

Country Link
CN (1) CN116147627A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116892932A (en) * 2023-05-31 2023-10-17 三峡大学 Navigation decision method combining curiosity mechanism and self-imitation learning
CN117490696A (en) * 2023-10-23 2024-02-02 广州创源机器人有限公司 Method for accelerating navigation efficiency of robot
CN118603105A (en) * 2024-08-08 2024-09-06 青岛理工大学 Air-ground heterogeneous robot navigation method, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116892932A (en) * 2023-05-31 2023-10-17 三峡大学 Navigation decision method combining curiosity mechanism and self-imitation learning
CN116892932B (en) * 2023-05-31 2024-04-30 三峡大学 Navigation decision method combining curiosity mechanism and self-imitation learning
CN117490696A (en) * 2023-10-23 2024-02-02 广州创源机器人有限公司 Method for accelerating navigation efficiency of robot
CN118603105A (en) * 2024-08-08 2024-09-06 青岛理工大学 Air-ground heterogeneous robot navigation method, equipment and medium

Similar Documents

Publication Publication Date Title
CN110262511B (en) Biped robot adaptive walking control method based on deep reinforcement learning
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN111098852B (en) Parking path planning method based on reinforcement learning
CN112132263B (en) Multi-agent autonomous navigation method based on reinforcement learning
CN102402712B (en) Robot reinforced learning initialization method based on neural network
CN116147627A (en) Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation
CN112362066A (en) Path planning method based on improved deep reinforcement learning
CN105700526A (en) On-line sequence limit learning machine method possessing autonomous learning capability
CN111783994A (en) Training method and device for reinforcement learning
CN113341972A (en) Robot path optimization planning method based on deep reinforcement learning
CN114518751A (en) Path planning decision optimization method based on least square truncation time domain difference learning
CN117590867A (en) Underwater autonomous vehicle connection control method and system based on deep reinforcement learning
CN114721397B (en) Maze robot path planning method based on reinforcement learning and curiosity
He et al. Decentralized exploration of a structured environment based on multi-agent deep reinforcement learning
Gromniak et al. Deep reinforcement learning for mobile robot navigation
CN114626505A (en) Mobile robot deep reinforcement learning control method
CN117302204B (en) Multi-wind-lattice vehicle track tracking collision avoidance control method and device based on reinforcement learning
CN115016499B (en) SCA-QL-based path planning method
Zhang et al. Route searching based on neural networks and heuristic reinforcement learning
Tan et al. PL-TD3: A Dynamic Path Planning Algorithm of Mobile Robot
CN116841303A (en) Intelligent preferential high-order iterative self-learning control method for underwater robot
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial
Gross et al. Sensory-based Robot Navigation using Self-organizing Networks and Q-learning
CN114118371A (en) Intelligent agent deep reinforcement learning method and computer readable medium
Bhatia et al. Reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination