End-to-End AUV Motion Planning Method Based on Soft Actor-Critic
<p>Major components of the AUV.</p> "> Figure 2
<p>Markov decision process.</p> "> Figure 3
<p>AUV sonar model.</p> "> Figure 4
<p>Neural network structure of the SAC algorithm.</p> "> Figure 5
<p>Neural network structure of the GAIL algorithm.</p> "> Figure 6
<p>Flow chart of obtaining rewards.</p> "> Figure 7
<p>Experimental environment.</p> "> Figure 8
<p>Real training environment in the Unity interface.</p> "> Figure 9
<p>Curve of the reward value.</p> "> Figure 10
<p>Curve of the episode length.</p> "> Figure 11
<p>AUV motion trajectory.</p> "> Figure 12
<p>AUV local trajectory.</p> "> Figure 13
<p>Six groups of actual planned trajectory routes.</p> "> Figure 14
<p>Curve of surge velocity.</p> "> Figure 15
<p>Curve of surge force.</p> "> Figure 16
<p>Curve of yaw moment output by AUV.</p> "> Figure 17
<p>Curve of angular velocity in the yaw output by AUV.</p> "> Figure 18
<p>Angle curve of AUV.</p> ">
Abstract
:1. Introduction
1.1. Background
- AUV motion planning becomes a difficult problem in navigation due to the uncertainty of the marine environment, the system dynamic constraints of the AUV itself, and the limitations of obstacle avoidance sonar and other sensor devices in the perception of the marine environment.
- In many early methods, the motion planning task is divided into two parts, namely, path planning and following, and the design process is complicated. These two modules depend on the characteristics of the environment and the dynamic constraints of the system, resulting in a sensitive system. Therefore, the robot can only obtain strategies in a single environment and lacks adaptability to the environment.
- Most existing methods have low exploration ability, easily fall into the local optimum, and cannot achieve the goal of the AUV motion planning task under the condition of multiple constraints.
- DRL is an unsupervised method. Some algorithms require tens of millions of interactions with the environment to learn a successful strategy. The training speed is considerably slow, and the cost is high. This problem is also aggravated by the large number of hyperparameters that need to be adjusted.
- The reward function is difficult to set, and the quality of the design directly affects the training success. However, the reward function at this stage is only independently formulated by the researcher according to the research problem, and no guiding rule has been established. Furthermore, AUV motion planning is a sparse reward task. In the initial stage of the training process, the robot cannot easily obtain a positive reward value, resulting in a complicated training.
1.2. Related Work
1.3. Proposed Solution
- Useful samples and expert demonstrations for AUV are difficult to collect in IL. The quality of the demo also has a greater impact on the training effect.
- GAIL introduces a survivor bias in the learning process, which encourages the agent to live as long as possible by giving positive rewards based on the similarity with the expert, which directly conflicts with goal-oriented tasks. In this case, the proportions of GAIL signals and external rewards are difficult to coordinate.
2. AUV Model and System Description
2.1. Preliminaries
2.2. Problem Formulation
3. Method
3.1. State Space and Action Space of AUV
3.2. SAC
- with the neural network parameter
- Soft action value function with the neural network parameter
- Soft state value function parameterized by . In principle, there is no need to set a separate function approximator for the state value, because can be derived from Q and , according to Equation (22).
Algorithm 1: SAC | |
1: Input: , , | Initial parameters |
2: , | Initialize target network weights |
3: | Initialize replay buffer |
4: for each iteration do | |
5: for each environment step do | |
6: | Sample action from the policy |
7: | Sample transition from the environment |
8: | Store the transition in the replay buffer |
9: end for | |
10: for each gradient step do | |
11: | Update the Q-function parameters |
12: | Update policy weights |
13: | Adjust temperature |
14: | Update target network weights |
15: end for | |
16: end for | |
17: Output: , , | Optimized parameters |
3.3. GAIL
3.4. Reward Function
4. Simulation and Results
- The instability of reinforcement learning training affects the results of the experiment. Even when the number of training steps and other variables are consistent, the strategies obtained after multiple trainings are not the same, resulting in different training results. In some locations where there are more random generations and the AUV is reached earlier, the obtained strategy is more complete, and the planned route and time are better;
- According to the reward curve and the episode length curve, SAC and SAC-GAIL have completely reached convergence for 800 W training steps; however, the convergence speed of SAC-GAIL is significantly faster. When the number of trainings is sufficient, and the reward function is properly set, the AUV can explore the terminal reward during the training process, and SAC can still play its advantages. If the total number of training steps is limited to a smaller range, the planning effect of SAC will be inferior to the SAC-GAIL algorithm;
- The quality of the samples in the demo used in the IL will also greatly affect the effect of SAC-GAIL. The demo recorded in the model learning in this experiment is recorded on the basis of the strategy obtained after the SAC algorithm training. Hence, the planning effect is not much different from the SAC algorithm. The proportion of the GAIL signal and the external rewards will also change the total rewards received by the AUV in each episode, which will affect the strategies obtained;
- In most cases, SAC-GAIL is superior to SAC in only one aspect of the planned trajectory distance and planning time. The reason for this phenomenon is that AUV motion planning can efficiently optimize time and distance at the same time. When the AUV sails at a higher speed, there will be a larger turning radius and a longer distance traveled per unit time. When the AUV wants to quickly reach the target point, the distance limit must be discarded, and vice versa. GAIL introduces a survivor bias in the learning process, which encourages the agent to live as long as possible by giving positive rewards based on the similarity with the expert. This notion is in direct conflict with goal-oriented tasks Therefore, the planned trajectory route based on the SAC-GAIL algorithm is shorter in most cases; however, the planning time is longer than that of the SAC algorithm.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
- Dijkstra, E.W. A note on two problems in connexion with graphs. Numer. Math. 1959, 1, 269–271. [Google Scholar] [CrossRef] [Green Version]
- Scharff Willners, J.; Gonzalez-Adell, D.; Hernández, J.D.; Pairet, È.; Petillot, Y. Online 3-Dimensional Path Planning with Kinematic Constraints in Unknown Environments Using Hybrid A* with Tree Pruning. Sensors 2021, 21, 1152. [Google Scholar] [CrossRef]
- Cui, R.; Li, Y.; Yan, W. Mutual information-based multi-AUV path planning for scalar field sampling using multidimensional RRT. IEEE Trans. Syst. Man Cybern. Syst. 2015, 46, 993–1004. [Google Scholar] [CrossRef]
- Fan, X.; Guo, Y.; Liu, H.; Wei, B.; Lyu, W. Improved artificial potential field method applied for AUV path planning. Math. Probl. Eng. 2020, 2020, 6523158. [Google Scholar] [CrossRef]
- Zeng, Z.; Sammut, K.; He, F.; Lammas, A. Efficient path evaluation for AUVs using adaptive B-spline approximation. In Proceedings of the IEEE Oceans, Hampton Roads, VA, USA, 14–19 October 2012; pp. 1–8. [Google Scholar]
- Cai, W.; Zhang, M.; Zheng, Y.R. Task assignment and path planning for multiple autonomous underwater vehicles using 3D dubins curves. Sensors 2017, 17, 1607. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Wang, L.; Kan, J.; Guo, J.; Wang, C. 3D path planning for the ground robot with improved ant colony optimization. Sensors 2019, 19, 815. [Google Scholar] [CrossRef] [Green Version]
- Hao, K.; Zhao, J.; Yu, K.; Li, C.; Wang, C. Path planning of mobile robots based on a multi-population migration genetic algorithm. Sensors 2020, 20, 5873. [Google Scholar] [CrossRef]
- Bai, X.; Yan, W.; Ge, S.S.; Cao, M. An integrated multi-population genetic algorithm for multi-vehicle task assignment in a drift field. Inf. Sci. 2018, 453, 227–238. [Google Scholar] [CrossRef]
- Bai, X.; Yan, W.; Cao, M. Clustering-based algorithms for multivehicle task assignment in a time-invariant drift field. IEEE Robot. Autom. Lett. 2017, 2, 2166–2173. [Google Scholar] [CrossRef] [Green Version]
- Li, J.; Wang, H. Research on AUV Path Planning Based on Improved Ant Colony Algorithm. In Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 2–5 August 2020; pp. 1401–1406. [Google Scholar]
- Camci, E.; Kayacan, E. End-to-End Motion Planning of Quadrotors Using Deep Reinforcement Learning. arXiv 2019, arXiv:1909.13599. [Google Scholar]
- Doukhi, O.; Lee, D. Deep Reinforcement Learning for End-to-End Local Motion Planning of Autonomous Aerial Robots in Unknown Outdoor Environments: Real-Time Flight Experiments. Sensors 2021, 21, 2534. [Google Scholar] [CrossRef] [PubMed]
- Cheng, Y.; Zhang, W. Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels. Neurocomputing 2018, 272, 63–73. [Google Scholar] [CrossRef]
- Sun, Y.; Ran, X.; Zhang, G.; Xu, H.; Wang, X. AUV 3D path planning based on the improved hierarchical deep Q network. J. Mar. Sci. Eng. 2020, 8, 145. [Google Scholar] [CrossRef] [Green Version]
- Sun, Y.; Cheng, J.; Zhang, G.; Xu, H. Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning. J. Intell. Robot. Syst. 2019, 96, 591–601. [Google Scholar] [CrossRef]
- Butyrev, L.T.; Mutschler, C. Deep reinforcement learning for motion planning of mobile robots. arXiv 2019, arXiv:1912.09260. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the PMLR, Montréal, QC, Canada, 2 December 2018; pp. 1861–1870. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
- Prianto, E.; Kim, M.; Park, J.; Bae, J.; Kim, J. Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor–Critic with Hindsight Experience Replay. Sensors 2020, 20, 5911. [Google Scholar] [CrossRef]
- Wong, C.; Chien, S.; Feng, H.; Aoyama, H. Motion Planning for Dual-Arm Robot Based on Soft Actor-Critic. IEEE Access 2021, 9, 26871–26885. [Google Scholar] [CrossRef]
- Liu, Q.; Li, Y.; Liu, L. A 3D Simulation Environment and Navigation Approach for Robot Navigation via Deep Reinforcement Learning in Dense Pedestrian Environment. In Proceedings of the 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), Hong Kong, China, 20–21 August 2020; pp. 1514–1519. [Google Scholar]
- Cheng, Y.; Song, Y. Autonomous Decision-Making Generation of UAV based on Soft Actor-Critic Algorithm. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 7350–7355. [Google Scholar]
- Gupta, A.; Khwaja, A.S.; Anpalagan, A.; Guan, L.; Venkatesh, B. Policy-Gradient and Actor-Critic Based State Representation Learning for Safe Driving of Autonomous Vehicles. Sensors 2020, 20, 5991. [Google Scholar] [CrossRef]
- Chen, J.; Li, S.E.; Tomizuka, M. Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. In Proceedings of the IEEE Transactions on Intelligent Transportation Systems, Indianapolis, IN, USA, 19–22 September 2021. [Google Scholar]
- Ahmad, T.; Ashraf, A.; Truscan, D.; Domi, A.; Porres, I. Using deep reinforcement learning for exploratory performance testing of software systems with multi-dimensional input spaces. IEEE Access 2020, 8, 195000–195020. [Google Scholar] [CrossRef]
- Baram, N.; Anschel, O.; Caspi, I.; Mannor, S. End-to-end differentiable adversarial imitation learning. In Proceedings of the PMLR, Sydney, Australia, 6–11 August 2017; pp. 390–399. [Google Scholar]
- Pomerleau, D.A. Efficient training of artificial neural networks for autonomous navigation. Neural Comput. 1991, 3, 88–97. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Giusti, A.; Guzzi, J.; Cireşan, D.C.; He, F.-L.; Rodríguez, J.P.; Fontana, F.; Faessler, M.; Forster, C.; Schmidhuber, J.; Di Caro, G.; et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robot. Autom. Lett. 2015, 1, 661–667. [Google Scholar] [CrossRef] [Green Version]
- Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
- Ross, S.E.P.; Bagnell, D. Efficient reductions for imitation learning. In Proceedings of the JMLR Workshop and Conference Proceedings, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010; pp. 661–668. [Google Scholar]
- Ross, S.E.P.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the JMLR Workshop and Conference Proceedings, Lauderdale, FL, USA, 11–13 April 2011; pp. 627–635. [Google Scholar]
- Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 2000, 90, 227–244. [Google Scholar] [CrossRef]
- Ng, A.Y.; Russell, S.J. Algorithms for inverse reinforcement learning. In Proceedings of the ICML, Stanford, CA, USA, 29 June–2 July 2000; p. 2. [Google Scholar]
- Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI, Chicago, IL, USA, 13–17 July 2008; pp. 1433–1438. [Google Scholar]
- Ratliff, N.D.; Silver, D.; Bagnell, J.A. Learning to search: Functional gradient techniques for imitation learning. Auton. Robot. 2009, 27, 25–53. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
- Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv 2018, arXiv:1809.11096. Available online: https://arxiv.org/abs/1809.11096 (accessed on 10 June 2021).
- Ho, J.; Ermon, S. Generative adversarial imitation learning. Adv. Neural Inf. Process. Syst. 2016, 29, 4565–4573. [Google Scholar]
- Ho, J.; Gupta, J.; Ermon, S. Model-free imitation learning with policy optimization. In Proceedings of the PMLR, New York, NY, USA, 20–22 June 2016; pp. 2760–2769. [Google Scholar]
- Merel, J.; Tassa, Y.; TB, D.; Srinivasan, S.; Lemmon, J.; Wang, Z.; Wayne, G.; Heess, N. Learning human behaviors from motion capture by adversarial imitation. arXiv 2017, arXiv:1707.02201. [Google Scholar]
- Peng, X.B.; Kanazawa, A.; Toyer, S.; Abbeel, P.; Levine, S. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv 2018, arXiv:1810.00821. [Google Scholar]
- Karimshoushtari, M.; Novara, C.; Tango, F. How Imitation Learning and Human Factors Can Be Combined in a Model Predictive Control Algorithm for Adaptive Motion Planning and Control. Sensors 2021, 21, 4012. [Google Scholar] [CrossRef]
- Zhou, Y.; Fu, R.; Wang, C.; Zhang, R. Modeling Car-Following Behaviors and Driving Styles with Generative Adversarial Imitation Learning. Sensors 2020, 20, 5034. [Google Scholar] [CrossRef]
- Fossen, T.I. Handbook of Marine Craft Hydrodynamics and Motion Control; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
- Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar] [CrossRef] [Green Version]
- Chaffre, T.; Moras, J.; Chan-Hon-Tong, A.; Marzat, J. Sim-to-real transfer with incremental environment complexity for reinforcement learning of depth-based robot navigation. arXiv 2020, arXiv:2004.14684. [Google Scholar]
- Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef] [PubMed]
- Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the PMLR, Sydney, Australia, 6–11 August 2017; pp. 1352–1361. [Google Scholar]
- Bhattacharyya, R.; Wulfe, B.; Phillips, D.; Kuefler, A.; Morton, J.; Senanayake, R.; Kochenderfer, M. Modeling human driving behavior through generative adversarial imitation learning. arXiv 2020, arXiv:2006.06412. [Google Scholar]
- Torabi, F.; Warnell, G.; Stone, P. Generative adversarial imitation from observation. arXiv 2018, arXiv:1807.06158. [Google Scholar]
- Littman, M.L. Reinforcement learning improves behaviour from evaluative feedback. Nature 2015, 521, 445–451. [Google Scholar] [CrossRef]
Algorithm | Pros | Cons | |
---|---|---|---|
Geometric model search | Simple and easy to implement | Low flexibility, poor real-time performance, lack of intelligent understanding | |
Rapidly exploring random tree | Search in high-dimensional space quickly and efficiently | The planning result is not optimal | |
Artificial potential field method | Simple operation and high real-time performance | Local optimal problem | |
Curve interpolation | The algorithm is intuitive, and the planned trajectory is also very flat | A large amount of calculation and low real-time performance | |
Genetic algorithm | Good robustness and easy to combine with other methods | Slow convergence speed | |
RL | DQN | Solve the problem of dimension disaster when the state and action space is too large | Single-step update, only applicable to discrete spaces and actions |
DDPG | Solve the motion problem in the continuous action space | The randomness of the strategy is low, and the convergence effect is poor | |
PPO | It can face both discrete control and continuous control | On policy, low sample efficiency | |
SAC | Off policy, less sensitive to different hyperparameter values, strong exploration ability and high robustness | Learning a policy from scratch is difficult and time-consuming | |
Imitation learning | Fast convergence speed with expert guidance | Good expert samples are difficult to obtain |
Parameter | Value |
---|---|
m | 45 kg |
L | 1.46 m |
1 | −1.5777 × 10−3 |
−3.0753 × 10−2 | |
9.4196 × 10−4 | |
−1.012 × 10−1 | |
−5.9 × 10−3 | |
−1.6687 × 10−1 | |
0 | |
0 | |
1.258 × 10−1 | |
0 | |
0 | |
0 | |
−1.2432 × 10−1 |
Sensors | Parameter | Value |
---|---|---|
DVL | Frequency | 600 kHz |
Accuracy | 1% ± 1 mm/s | |
Maximum Velocity | ±20 knots | |
OAS | Sharp Angle | |
Range | 0.6~120 m | |
Reliable Range | 1~20 m |
Steps | Observation Vector | Observation Vector After Stacking 1 |
---|---|---|
Step 1 | [0.6] | [0.6, 0.0, 0.0] |
Step 2 | [0.4] | [0.6, 0.4, 0.0] |
Step 3 | [0.9] | [0.9, 0.4, 0.6] |
Step 4 | [0.2] | [0.2, 0.9, 0.4] |
Element | Position | Collision Size | |
---|---|---|---|
AUV | (0,0) | Same size as actual model | |
Goal | Radius = 1 | ||
/ | Length | Width | |
Cruciform Obstacle 1 | (16.5, 16.5) | 9 | 0.6 |
Cruciform Obstacle 2 | (16.5, −16.5) | 9 | 0.6 |
Cruciform Obstacle 3 | (−16.5, −16.5) | 9 | 0.6 |
Cruciform Obstacle 4 | (−16.5, 16.5) | 9 | 0.6 |
Rectangular Obstacle 1 | (15, 45) | 8 | 1 |
Rectangular Obstacle 2 | (45, 15) | 8 | 1 |
Rectangular Obstacle 3 | (45, −15) | 8 | 1 |
Rectangular Obstacle 4 | (15, −45) | 8 | 1 |
Rectangular Obstacle 5 | (−15, −45) | 8 | 1 |
Rectangular Obstacle 6 | (−45, −15) | 8 | 1 |
Rectangular Obstacle 7 | (−45, 15) | 8 | 1 |
Rectangular Obstacle 8 | (−15, 45) | 8 | 1 |
Reward Parameter | Value |
---|---|
+20 | |
−20 | |
10 × 10−1 | |
10 × 10−2 | |
/(180 × 25,000) | |
−10 × 10−5 |
Parameters | PPO | SAC |
---|---|---|
batch_size (number of experiences in each iteration of gradient descent) | 2048 | 256 |
buffer_size (number of experiences to collect before updating the policy model) | 20,480 | 1,000,000 |
learning rate | 0.0003 | 0.0003 |
learning_rate_schedule (determines how the learning rate changes over time) | Linear decay | Fixed constant |
(strength of the entropy regularization in PPO) | 0.001 | / |
(influences how rapidly the policy can evolve during training in PPO) | 0.2 | / |
(regularization parameter used when calculating the Generalized Advantage Estimate in PPO) | 0.95 | / |
num_epoch (number of passes to make through the experience buffer when performing gradient descent optimization in PPO) | 3 | / |
tau (how aggressively to update the target network used for bootstrapping value estimation in SAC) | / | 0.005 |
steps_per_update (average ratio of actions taken to updates made of the agent’s policy in SAC) | / | 10.0 |
reward_signal_steps_per_update (number of steps per mini batch sampled and used for updating the reward signals in SAC) | / | 10.0 |
hidden_units (number of units in the hidden layers of the neural network) | 256 | 256 |
num_layers (the number of hidden layers in the neural network) | 2 | 2 |
(discount factor for future rewards) | 0.99 | 0.99 |
max_steps/per episode | 16,000 | 16,000 |
max_steps | 8,000,000 | 8,000,000 |
Target Coordinates | Index | PPO | SAC | SAC-GAIL |
---|---|---|---|---|
(−15.4, 31.4) | Distance (m) | 68.48 | 39.10 | 38.01 |
Time (s) | 42.34 | 25.50 | 26.42 | |
(16.8, 30.7) | Distance (m) | 49.27 | 35.65 | 35.55 |
Time (s) | 27.90 | 20.52 | 20.32 | |
(25.0, 24.5) | Distance (m) | 51.30 | 40.30 | 40.76 |
Time (s) | 28.72 | 21.72 | 22.08 | |
(−29.2, −19.3) | Distance (m) | 75.10 | 39.96 | 37.60 |
Time (s) | 50.16 | 31.56 | 32.40 | |
(−11.6, −33.0) | Distance (m) | 46.04 | 38.71 | 38.20 |
Time (s) | 26.04 | 23.76 | 23.78 | |
(23.1, −26.3) | Distance (m) | 64.00 | 41.03 | 40.92 |
Time (s) | 43.72 | 22.56 | 22.04 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, X.; Sun, Y.; Wang, X.; Zhang, G. End-to-End AUV Motion Planning Method Based on Soft Actor-Critic. Sensors 2021, 21, 5893. https://doi.org/10.3390/s21175893
Yu X, Sun Y, Wang X, Zhang G. End-to-End AUV Motion Planning Method Based on Soft Actor-Critic. Sensors. 2021; 21(17):5893. https://doi.org/10.3390/s21175893
Chicago/Turabian StyleYu, Xin, Yushan Sun, Xiangbin Wang, and Guocheng Zhang. 2021. "End-to-End AUV Motion Planning Method Based on Soft Actor-Critic" Sensors 21, no. 17: 5893. https://doi.org/10.3390/s21175893
APA StyleYu, X., Sun, Y., Wang, X., & Zhang, G. (2021). End-to-End AUV Motion Planning Method Based on Soft Actor-Critic. Sensors, 21(17), 5893. https://doi.org/10.3390/s21175893