CN115542912B

CN115542912B - A mobile robot path planning method based on improved Q-learning algorithm

Info

Publication number: CN115542912B
Application number: CN202211213330.XA
Authority: CN
Inventors: 涂俊翔; 张立; 李凡; 钟礼阳
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2024-06-07
Anticipated expiration: 2042-09-29
Also published as: CN115542912A

Abstract

The invention relates to a mobile robot path planning method based on an improved Q-learning algorithm, which comprises the following steps: (1) The artificial potential field principle is utilized to combine with the simulation environment, a new potential field function is designed, the environment potential value is introduced as heuristic information to initialize the Q value table, so that the potential value is larger when the potential value is closer to the target point, the intelligent agent is guided to explore towards the target direction in the earlier stage, algorithm convergence is accelerated, and planning efficiency is improved. Improving blindness of the Q-learning algorithm in early exploration; (2) And adding a behavior utility function into an epsilon greedy strategy of a traditional Q-learning algorithm, evaluating the actions according to the conditions of path segments after the actions are executed, dynamically adjusting the probability of each action of the intelligent agent, improving the search efficiency and improving the path smoothness. By applying the technical scheme, the shortest path can be obtained, and meanwhile, the convergence speed of an algorithm and the smoothness of the path can be improved.

Description

A mobile robot path planning method based on improved Q-learning algorithm

技术领域Technical Field

本发明涉及机器人导航规划技术领域，特别是一种基于改进Q-learning算法的移动机器人路径规划方法。The invention relates to the technical field of robot navigation planning, and in particular to a mobile robot path planning method based on an improved Q-learning algorithm.

背景技术Background technique

随着“货到人”的拣选模式的提出，移动机器人被广泛应用于智能仓中，移动机器人的引入提高了仓储的拣选效率，而路径规划作为移动机器人的核心技术之一，也越来越受到关注。路径规划是指根据移动机器人所处环境，结合最短路径，最短规划时间，路径平滑度等评估标准，规划出一条无碰撞的较优路径。With the introduction of the "goods to person" picking model, mobile robots are widely used in smart warehouses. The introduction of mobile robots has improved the picking efficiency of warehouses, and path planning, as one of the core technologies of mobile robots, has also received more and more attention. Path planning refers to planning a collision-free and optimal path based on the environment in which the mobile robot is located, combined with evaluation criteria such as the shortest path, shortest planning time, and path smoothness.

路径规划起源于20世纪60年代，常用于全局规划路径搜索的Dijkstra算法，A*算法，人工势场法，以及启发型智能搜索法的蚁群算法，粒子群算法等。但是传统方法操作复杂，求解问题的效率低，此外启发型算法又难以设计和理解。近年来随着强化学习研究的深入，一些学者开始将强化学习应用于路径规划之中。在移动机器人路径规划中应用最广泛的强化学习算法就是Q-learning算法。Q-learning算法是一种时序差分的强化学习算法，Q-learning算法的过程为：移动机器人先在状s下在所有可能的动作中选择动作a并执行，再根据获得动作a的立即奖赏值以及接收当前的状态动作值的估计来评估动作的结果。通过重复所有状态下的所有动作，移动机器人就可以通过判断长期折扣回报来学习总体上的最佳行为。传统Q-learning算法作为一种监督式学习方法，能够使移动机器人利用学习机制，通过与环境的实时交互，规划出一条较优的无碰撞路径，不需要环境模型，在复杂环境中表现优异。但还是存在算法前期探索盲目性大，学习时间长，收敛速度慢，路径平滑度差等问题。Path planning originated in the 1960s. Dijkstra algorithm, A* algorithm, artificial potential field method, and ant colony algorithm and particle swarm algorithm are commonly used for global planning path search. However, traditional methods are complex to operate and have low efficiency in solving problems. In addition, heuristic algorithms are difficult to design and understand. In recent years, with the deepening of reinforcement learning research, some scholars have begun to apply reinforcement learning to path planning. The most widely used reinforcement learning algorithm in mobile robot path planning is the Q-learning algorithm. The Q-learning algorithm is a temporal difference reinforcement learning algorithm. The process of the Q-learning algorithm is: the mobile robot first selects action a from all possible actions under state s and executes it, and then evaluates the result of the action based on the immediate reward value of action a and the estimated value of the current state action received. By repeating all actions in all states, the mobile robot can learn the overall best behavior by judging the long-term discounted return. As a supervised learning method, the traditional Q-learning algorithm enables the mobile robot to use the learning mechanism to plan a better collision-free path through real-time interaction with the environment. It does not require an environmental model and performs well in complex environments. However, there are still problems such as the blindness of the early exploration of the algorithm, long learning time, slow convergence speed, and poor path smoothness.

发明内容Summary of the invention

有鉴于此，本发明的目的在于提供一种基于改进Q-learning算法的移动机器人路径规划方法，实现得到最短路径的同时，能够提高算法的收敛速度，以及路径的平滑度。In view of this, the object of the present invention is to provide a mobile robot path planning method based on an improved Q-learning algorithm, which can achieve the shortest path while improving the convergence speed of the algorithm and the smoothness of the path.

为实现上述目的，本发明采用如下技术方案：一种基于改进Q-learning算法的移动机器人路径规划方法，包括以下步骤：To achieve the above object, the present invention adopts the following technical solution: a mobile robot path planning method based on an improved Q-learning algorithm, comprising the following steps:

步骤1:采用栅格法对环境地图进行建模，并建立障碍物矩阵；Step 1: Use the grid method to model the environment map and establish an obstacle matrix;

步骤2:移动机器人在二维环境中作为质点存在且仅能向4个方向搜索移动，即上、下、左、右；每个栅格的边长均为一个单位，移动机器人单步移动距离均为1个步长；Step 2: The mobile robot exists as a particle in a two-dimensional environment and can only search and move in four directions, namely up, down, left, and right; the side length of each grid is one unit, and the single-step movement distance of the mobile robot is 1 step length;

步骤3:设计奖励函数并建立奖励矩阵R和Q值表Q；Step 3: Design the reward function and establish the reward matrix R and Q value table Q;

步骤4:利用改进的势能场函数初始化Q值表，初始化算法的各类参数，包括迭代次数、ε-探索概率ε、学习效率α、奖励衰减因子γ、行为效用衰减系数p₁、探索激励系数p₂、权重系数β；Step 4: Use the improved potential field function to initialize the Q value table and various parameters of the algorithm, including the number of iterations, ε-exploration probability ε, learning efficiency α, reward attenuation factor γ, behavior utility attenuation coefficient p ₁ , exploration incentive coefficient p ₂ , and weight coefficient β;

步骤5:初始化起点，目标点；Step 5: Initialize the starting point and target point;

步骤6:开始探索，根据改进的ε探索策略选择执行动作，并获取该执行动作后的及时奖励值并计算距离函数，更新动作效用函数；Step 6: Start exploration, select an action to execute according to the improved ε exploration strategy, obtain the timely reward value after executing the action, calculate the distance function, and update the action utility function;

步骤7:根据所执行的动作，更新Q值表以及动作执行概率，其中Q值表更新公式如下：Step 7: Update the Q value table and the action execution probability according to the executed action, where the Q value table update formula is as follows:

Q(s,a)＝Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]Q(s,a)＝Q(s,a)+α[Rt+γmaxaQ(s',a')-Q(s,a)]

式中α表示学习效率α∈[0,1]，γ表示折扣因子γ∈[0,1]，Rt为及时奖励值，s′，a′为下一状态和下一动作；Where α represents the learning efficiency α∈[0,1], γ represents the discount factor γ∈[0,1], Rt is the timely reward value, s′, a′ are the next state and the next action;

动作执行概率的更新公式如下：The update formula for the action execution probability is as follows:

式中：n为执行动作的总数，为行为效用函数；Where: n is the total number of actions executed, is the behavior utility function;

步骤8:将执行动作后的位置状态更新为当前位置状态，判断当前位置是否为终点位置并判断是否超出最大步长，否则跳转至步骤6，是则进入步骤9；Step 8: Update the position state after the action is executed to the current position state, determine whether the current position is the end position and whether it exceeds the maximum step length, otherwise jump to step 6, if yes, go to step 9;

步骤9:记录每次学习的路径，判断是否达到最大迭代次数，若满足，输出最优路径，不满足跳转至步骤5。Step 9: Record the path of each learning and determine whether the maximum number of iterations has been reached. If so, output the optimal path. If not, jump to step 5.

在一较佳的实施例中，所述步骤3中设计的奖励函数为：In a preferred embodiment, the reward function designed in step 3 is:

式中：r₁，r₂都为正数，当智能体碰到障碍物时，奖励值-r₁被获取；当智能体到达目标点时，奖励值r₂被获取；在智能体到达其他位置时获取的奖励值为0。Where: r ₁ and r ₂ are both positive numbers. When the agent encounters an obstacle, the reward value -r ₁ is obtained; when the agent reaches the target point, the reward value r ₂ is obtained; when the agent reaches other positions, the reward value obtained is 0.

在一较佳的实施例中，所述步骤4中改进的势能场函数为：In a preferred embodiment, the improved potential energy field function in step 4 is:

式中:C＝L＝X+Y,X为环境的水平长度，Y为环境的垂直长度；d_1(s),d_2(s)分别为智能体当前位置与目标点垂直距离和水平距离；d(s)为智能体当前位置与目标点与起始点连线的欧式距离；Where: C = L = X + Y, X is the horizontal length of the environment, Y is the vertical length of the environment; d _1(s) and d _2(s) are the vertical distance and horizontal distance between the current position of the agent and the target point respectively; d(s) is the Euclidean distance between the current position of the agent and the line connecting the target point and the starting point;

初始化Q值表函数为：The function to initialize the Q value table is:

Q(s,a)＝R+γV(S)Q(s,a)＝R+γV(S)

式中，R是初始奖励函数矩阵，γ是奖励衰减因子，V(S)是通过引力场函数初始化所有的状态的价值函数；通过此方法初始后的Q值表，越靠近目标点Q值越大，且目标点有最大Q值，障碍物处Q值为0。In the formula, R is the initial reward function matrix, γ is the reward attenuation factor, and V(S) is the value function of all states initialized by the gravitational field function; the Q value table initialized by this method, the closer to the target point, the larger the Q value, and the target point has the maximum Q value, and the Q value at the obstacle is 0.

在一较佳的实施例中，所述步骤6中的改进的ε探索策略具体步骤如下：当随机值，小于贪婪因子时，选择执行动作概率最高的动作；当随机值大于贪婪因子时根据各执行动作的概率更新当前状态转移至下一状态的Q值，选取Q值最高的动作执行，随机值位于0-1之间，其更新公式为：In a preferred embodiment, the specific steps of the improved ε exploration strategy in step 6 are as follows: when the random value is less than the greedy factor, the action with the highest probability of execution is selected; when the random value is greater than the greedy factor, the Q value of the current state to the next state is updated according to the probability of each execution action, and the action with the highest Q value is selected for execution. The random value is between 0 and 1, and the update formula is:

T_Q＝Q+β×(P₁,P₂,P_i)(i∈(1,n))T _Q =Q+β×(P ₁ ,P ₂ ,P _i )(i∈(1,n))

式中：T_Q是根据新的动作执行概率更新后的Q值，β是权重系数，P₁，P₂，P_i是每个动作被执行的概率；改进的行为选择策略，使智能体在每个状态下选择执行动作时，利用已经探索的环境信息进行多步探索，综合考虑智能体前后状态与目标点的距离信息与多步执行动作信息，选择“最优”动作执行；Where: T _Q is the Q value updated according to the new action execution probability, β is the weight coefficient, P ₁ , P ₂ , _Pi is the probability of each action being executed; the improved behavior selection strategy enables the agent to use the already explored environmental information for multi-step exploration when selecting the execution action in each state, comprehensively considers the distance information between the previous and next states of the agent and the target point and the multi-step execution action information, and selects the "optimal" action to execute;

所述的步骤6中的距离函数为：The distance function in step 6 is:

式中：分别表示上一状态与当前状态离目标点的距离；Where: Respectively represent the distances of the previous state and the current state from the target point;

动作效用函数及其计算规则为：The action utility function and its calculation rules are:

式中：p₁为衰减系数，p₂是探索激励系数，r_t是即时奖励值；a_i分别为不同的动作，根据即时奖励值的大小以及连续执行动作是否相同的情况来更新不同动作的E值，当即时奖励值为正且连续三次执行动作相同时，当即时奖励值为正，且连续两次执行相同动作时，/>其他情况下E值为零。Where: _p1 is the attenuation coefficient, _p2 is the exploration incentive coefficient, and _rt is the immediate reward value; _ai is a different action, and the E value of different actions is updated according to the size of the immediate reward value and whether the consecutive actions are the same. When the immediate reward value is positive and the actions are the same for three consecutive times, When the immediate reward value is positive and the same action is performed twice in a row, /> Otherwise, the value of E is zero.

在一较佳的实施例中，引入环境势能值作为启发信息对Q值表进行初始化。In a preferred embodiment, the environmental potential energy value is introduced as heuristic information to initialize the Q value table.

在一较佳的实施例中，利用行为效用函数作为评估执行动作的标准，进而改进ε贪婪策略，结合智能体已经探索的环境信息和所执行动作对路径段平滑度的影响，动态调整智能体每个动作被选择的概率。In a preferred embodiment, the behavior utility function is used as a criterion for evaluating the execution of actions, thereby improving the ε-greedy strategy, combining the environmental information that the agent has explored and the impact of the executed actions on the smoothness of the path segment, and dynamically adjusting the probability of each action of the agent being selected.

与现有技术相比，本发明具有以下有益效果：本发明提出了一种基于改进Q-learning算法的移动机器人路径规划方法，所述方法针对传统Q-learning算法前期探索盲目性大，算法学习时间长，收敛速度，路径平滑度差等缺点进行改进，引入势能场函数以及行为效用函数，引导智能体前期就朝着目标方向探索，使智能体在每个状态下选择执行动作时，利用已经探索的环境信息进行多步探索，综合考虑智能体前后状态与目标点的距离信息与多步执行动作信息，选择“最优”动作执行，提高了算法的运行效率，加快了算法的收敛速度，改善了算法规划路径的平滑度。Compared with the prior art, the present invention has the following beneficial effects: the present invention proposes a mobile robot path planning method based on an improved Q-learning algorithm, which improves the shortcomings of the traditional Q-learning algorithm, such as large blindness in early exploration, long algorithm learning time, poor convergence speed, and poor path smoothness, by introducing a potential field function and a behavioral utility function to guide the intelligent agent to explore in the target direction in the early stage, so that when the intelligent agent chooses to execute an action in each state, it uses the already explored environmental information to perform multi-step exploration, comprehensively considers the distance information between the previous and next states of the intelligent agent and the target point and the multi-step execution action information, selects the "optimal" action to execute, improves the operation efficiency of the algorithm, accelerates the convergence speed of the algorithm, and improves the smoothness of the algorithm planning path.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明优选实施例的方法实现流程图。FIG. 1 is a flow chart of a method implementation of a preferred embodiment of the present invention.

图2是本发明优选实施例中传统方法、现有方法以及本方法的收敛曲线图。FIG. 2 is a graph showing convergence curves of a conventional method, an existing method, and the present method in a preferred embodiment of the present invention.

图3是本发明优选实施例中传统方法、现有方法以及本方法的路径搜索效果比较图。FIG3 is a diagram comparing the path search effects of the traditional method, the existing method and the present method in a preferred embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

应该指出，以下详细说明都是例示性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed descriptions are illustrative and are intended to provide further explanation of the present application. Unless otherwise specified, all technical and scientific terms used herein have the same meanings as those commonly understood by those skilled in the art to which the present application belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式；如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application; as used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should be understood that when the terms "comprise" and/or "include" are used in this specification, they indicate the presence of features, steps, operations, devices, components and/or their combinations.

一种基于改进Q-learning算法的移动机器人路径规划方法，所设计的改进Q-learning算法的移动机器人路径规划方法围绕Q值表和ε探索策略进行改进，引入势能场函数作为启发信息对Q值表进行初始化，引导智能体前期就朝着目标方向探索，解决了算法前期探索的盲目性；同时引入行为效用函数改进ε探索策略，结合智能体已经探索的环境信息和所执行动作对路径段平滑度的影响，动态调整智能体每个动作被选择的概率。A mobile robot path planning method based on an improved Q-learning algorithm is designed. The improved Q-learning algorithm is improved around the Q-value table and the ε exploration strategy. The potential energy field function is introduced as the inspiration information to initialize the Q-value table, so as to guide the intelligent agent to explore in the target direction in the early stage, thus solving the blindness of the algorithm's early exploration. At the same time, the behavioral utility function is introduced to improve the ε exploration strategy, and the probability of each action of the intelligent agent being selected is dynamically adjusted by combining the environmental information that the intelligent agent has explored and the influence of the executed actions on the smoothness of the path segment.

如图1所示，本发明实例的方法具体包括以下步骤：As shown in FIG1 , the method of the embodiment of the present invention specifically comprises the following steps:

步骤1、采用栅格法对环境地图进行建模，并建立障碍物矩阵；Step 1: Use the grid method to model the environment map and establish an obstacle matrix;

步骤2、移动机器人在二维环境中作为质点存在且仅能向4个方向搜索移动，即上、下、左、右；每个栅格的边长均为一个单位，移动机器人单步移动距离均为1个步长；Step 2: The mobile robot exists as a particle in a two-dimensional environment and can only search and move in four directions, namely up, down, left, and right; the side length of each grid is one unit, and the single-step movement distance of the mobile robot is 1 step length;

步骤3、设计奖励函数并建立奖励矩阵R和Q值表Q，奖励函数公式如下：Step 3: Design the reward function and establish the reward matrix R and Q value table Q. The reward function formula is as follows:

步骤4、利用改进的势能场函数初始化Q值表，初始化算法的各类参数，包括迭代次数、ε-探索概率ε、学习效率α、奖励衰减因子γ、行为效用衰减系数p₁、探索激励系数p₂、权重系数β等。传统的Q值表一般为0或者同一初始值，导致前期探索盲目性大，收敛速度慢，所以本发明结合人工势场原理，设计一种新的势能场函数，通过引入势能场函数，作为启发信息，其势能场函数如下：Step 4: Use the improved potential field function to initialize the Q value table and various parameters of the algorithm, including the number of iterations, ε-exploration probability ε, learning efficiency α, reward attenuation factor γ, behavior utility attenuation coefficient p ₁ , exploration incentive coefficient p ₂ , weight coefficient β, etc. The traditional Q value table is generally 0 or the same initial value, which leads to large blindness in the early exploration and slow convergence speed. Therefore, the present invention combines the principle of artificial potential field and designs a new potential field function. By introducing the potential field function as heuristic information, the potential field function is as follows:

式中:C＝L＝X+Y,X为环境的水平长度，Y为环境的垂直长度；d_1(s),d_2(s)分别为智能体当前位置与目标点垂直距离和水平距离；d(s)为智能体当前位置与目标点与起始点连线的欧式距离。Where: C = L = X + Y, X is the horizontal length of the environment, Y is the vertical length of the environment; d1 _(s) and d2 _(s) are the vertical distance and horizontal distance between the current position of the agent and the target point respectively; d(s) is the Euclidean distance between the current position of the agent and the line connecting the target point and the starting point.

初始化Q值表函数为：The function to initialize the Q value table is:

Q(s,a)＝R+γV(S)Q(s,a)＝R+γV(S)

式中R是初始奖励函数矩阵，γ是奖励衰减因子，V(S)是通过引力场函数初始化所有的状态的价值函数。通过此方法初始后的Q值表，越靠近目标点Q值越大，且目标点有最大Q值，障碍物处Q值为0，这样就使智能体前期就朝着目标点探索，加快算法收敛，提高了规划效率。In the formula, R is the initial reward function matrix, γ is the reward attenuation factor, and V(S) is the value function of all states initialized by the gravitational field function. The Q value table initialized by this method has a larger Q value the closer to the target point, and the target point has the maximum Q value, and the Q value at the obstacle is 0. This allows the agent to explore the target point in the early stage, accelerate the convergence of the algorithm, and improve planning efficiency.

步骤五、初始化起点，目标点；Step 5: Initialize the starting point and target point;

步骤六、开始探索，根据改进的ε探索策略选择执行动作，并获取该执行动作后的及时奖励值并计算距离函数，更新动作效用函数；Step 6: Start exploring, select an action to execute according to the improved ε exploration strategy, obtain the timely reward value after executing the action, calculate the distance function, and update the action utility function;

传统的ε探索策略，动作选择随机性强，导致算法收敛速度慢，路径平滑度差，本发明引入行为效用方程的概念，设计一种行为效用方程，用来评价执行动作的好坏，动态调整智能体每个动作被选择的概率，行为效用方程如下：The traditional ε exploration strategy has strong randomness in action selection, which leads to slow algorithm convergence and poor path smoothness. This paper introduces the concept of behavioral utility equation and designs a behavioral utility equation to evaluate the quality of action execution and dynamically adjust the probability of each action of the agent being selected. The behavioral utility equation is as follows:

其中距离函数如下：The distance function is as follows:

式中：分别表示上一状态与当前状态离目标点的距离。Where: Respectively represent the distances from the previous state and the current state to the target point.

改进的ε探索策略具体步骤如下：当随机值(位于0-1之间)小于贪婪因子时，选择执行动作概率最高的动作。当随机值小于贪婪因子时根据各执行动作的概率更新当前状态转移至下一状态的Q值，选取Q值最高的动作执行。其更新公式为：The specific steps of the improved ε exploration strategy are as follows: When the random value (between 0 and 1) is less than the greed factor, select the action with the highest probability of executing the action. When the random value is less than the greed factor, update the Q value of the current state to the next state according to the probability of each execution action, and select the action with the highest Q value to execute. The update formula is:

式中：T_Q是根据新的动作执行概率更新后的Q值，β是权重系数，P₁，P₂，P_i是每个动作被执行的概率。改进的行为选择策略，使智能体在每个状态下选择执行动作时，利用已经探索的环境信息进行多步探索，综合考虑智能体前后状态与目标点的距离信息与多步执行动作信息，选择“最优”动作执行。Where: T _Q is the Q value updated according to the new action execution probability, β is the weight coefficient, P ₁ , P ₂ , _Pi is the probability of each action being executed. The improved behavior selection strategy enables the agent to use the already explored environment information for multi-step exploration when selecting the execution action in each state, comprehensively considers the distance information between the previous and next states of the agent and the target point and the multi-step execution action information, and selects the "optimal" action to execute.

步骤七、更新Q值表，更新动作执行概率，更新位置状态。其中Q值表更新公式如下：Step 7: Update the Q value table, update the action execution probability, and update the position status. The Q value table update formula is as follows:

式中α表示学习效率α∈[0,1]，γ表示折扣因子γ∈[0,1]，Rt为及时奖励值，s′，a′为下一状态和下一动作。Where α represents the learning efficiency α∈[0,1], γ represents the discount factor γ∈[0,1], Rt is the timely reward value, s′, a′ are the next state and the next action.

式中：n为执行动作的总数，为行为效用函数。Where: n is the total number of actions executed, is the behavior utility function.

步骤八、将执行动作后的位置状态更新为当前位置状态，判断当前位置是否为终点位置并判断是否超出最大步长，否则跳转至步骤六，是则进入步骤九；Step 8: Update the position state after the action is executed to the current position state, determine whether the current position is the end position and whether it exceeds the maximum step length, otherwise jump to step 6, if yes, go to step 9;

步骤九、记录每次学习的路径，判断是否达到最大迭代次数，若满足，输出最优路径，不满足跳转至步骤五。Step 9: Record the path of each learning and determine whether the maximum number of iterations has been reached. If so, output the optimal path. If not, jump to step 5.

图2、图3示出了本发明方法与传统Q-learning算法以及现有Q-learning算法在路径规划效果上的差别。图2、图3中的(a)、(b)分别传统Q-learning算法以及现有算法(c)为本文发明方法。从上述图中可以看出，相比于传统算法和现有的Q-learning算法，本发明方法可以有效的改善路径平滑度并且加快算法收敛。Figures 2 and 3 show the difference in path planning effect between the method of the present invention and the traditional Q-learning algorithm and the existing Q-learning algorithm. (a) and (b) in Figures 2 and 3 are the traditional Q-learning algorithm and the existing algorithm (c) are the method of the present invention. It can be seen from the above figures that compared with the traditional algorithm and the existing Q-learning algorithm, the method of the present invention can effectively improve the path smoothness and accelerate the convergence of the algorithm.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application may adopt the form of a computer program product implemented in one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

以上所述，仅是本发明的较佳实施例而已，并非是对本发明作其它形式的限制，任何熟悉本专业的技术人员可能利用上述揭示的技术内容加以变更或改型为等同变化的等效实施例。但是凡是未脱离本发明技术方案内容，依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与改型，仍属于本发明技术方案的保护范围。The above is only a preferred embodiment of the present invention, and does not limit the present invention in other forms. Any technician familiar with the profession may use the above disclosed technical content to change or modify it into an equivalent embodiment with equivalent changes. However, any simple modification, equivalent change and modification made to the above embodiment according to the technical essence of the present invention without departing from the technical solution of the present invention still belongs to the protection scope of the technical solution of the present invention.

Claims

1. A mobile robot path planning method based on an improved Q-learning algorithm, characterized in that it comprises the following steps:

Step 1: Use the grid method to model the environment map and establish an obstacle matrix;

Step 2: The mobile robot exists as a particle in a two-dimensional environment and can only search and move in four directions, namely up, down, left, and right; the side length of each grid is one unit, and the single-step movement distance of the mobile robot is 1 step length;

Step 3: Design the reward function and establish the reward matrix R and Q value table Q;

Step 4: Use the improved potential field function to initialize the Q value table and various parameters of the algorithm, including the number of iterations, ε-exploration probability ε, learning efficiency α, reward attenuation factor γ, behavior utility attenuation coefficient p ₁ , exploration incentive coefficient p ₂ , and weight coefficient β;

Step 5: Initialize the starting point and target point;

Step 6: Start exploration, select an action to execute according to the improved ε exploration strategy, obtain the timely reward value after executing the action, calculate the distance function, and update the action utility function;

Step 7: Update the Q value table and the action execution probability according to the executed action, where the Q value table update formula is as follows:

Q(s,a)＝Q(s,a)+α[Rt+γmaxaQ(s',a')·-Q(s,a)]

Where α represents the learning efficiency α∈[0,1], γ represents the discount factor γ∈[0,1], Rt is the timely reward value, s′, a′ are the next state and the next action;

The update formula for the action execution probability is as follows:

Where: n is the total number of actions executed, is the behavior utility function;

Step 8: Update the position state after the action is executed to the current position state, determine whether the current position is the end position and whether it exceeds the maximum step length, otherwise jump to step 6, if yes, go to step 9;

Step 9: Record the path of each learning and determine whether the maximum number of iterations has been reached. If so, output the optimal path. If not, jump to step 5.

The improved potential energy field function in step 4 is:

Where: C = L = X + Y, X is the horizontal length of the environment, Y is the vertical length of the environment; d _1(s) and d _2(s) are the vertical distance and horizontal distance between the current position of the agent and the target point respectively; d(s) is the Euclidean distance between the current position of the agent and the line connecting the target point and the starting point;

The function to initialize the Q value table is:

Q(s,a)＝R+γV(S)

In the formula, R is the initial reward function matrix, γ is the reward attenuation factor, and V(S) is the value function of all states initialized by the gravitational field function; the Q value table initialized by this method, the closer to the target point, the larger the Q value, and the target point has the maximum Q value, and the Q value at the obstacle is 0;

The specific steps of the improved ε exploration strategy in step 6 are as follows: when the random value is less than the greedy factor, the action with the highest probability of execution is selected; when the random value is greater than the greedy factor, the Q value of the current state to the next state is updated according to the probability of each execution action, and the action with the highest Q value is selected for execution. The random value is between 0 and 1, and the update formula is:

T _Q =Q+β×(P ₁ ,P ₂ ,P _i )(i∈(1,n))

Where: T _Q is the Q value updated according to the new action execution probability, β is the weight coefficient, P ₁ , P ₂ , _Pi is the probability of each action being executed; the improved behavior selection strategy enables the agent to use the already explored environmental information for multi-step exploration when selecting the execution action in each state, comprehensively considers the distance information between the previous and next states of the agent and the target point and the multi-step execution action information, and selects the "optimal" action to execute;

The distance function in step 6 is:

Where: Respectively represent the distances of the previous state and the current state from the target point;

The action utility function and its calculation rules are:

Where: _p1 is the attenuation coefficient, _p2 is the exploration incentive coefficient, and _rt is the immediate reward value; _ai is a different action, and the E value of different actions is updated according to the size of the immediate reward value and whether the consecutive actions are the same. When the immediate reward value is positive and the actions are the same for three consecutive times, When the immediate reward value is positive and the same action is performed twice in a row, /> Otherwise, the value of E is zero.

2. The mobile robot path planning method based on the improved Q-learning algorithm according to claim 1, characterized in that:

The reward function designed in step 3 is:

Where: r ₁ and r ₂ are both positive numbers. When the agent encounters an obstacle, the reward value -r ₁ is obtained; when the agent reaches the target point, the reward value r ₂ is obtained; when the agent reaches other positions, the reward value obtained is 0.

3. According to claim 1, a mobile robot path planning method based on an improved Q-learning algorithm is characterized in that the environmental potential energy value is introduced as heuristic information to initialize the Q value table.

4. According to claim 1, a mobile robot path planning method based on an improved Q-learning algorithm is characterized in that a behavioral utility function is used as a criterion for evaluating the execution of actions, thereby improving the ε-greedy strategy, combining the environmental information that the agent has explored and the impact of the executed actions on the smoothness of the path segment, and dynamically adjusting the probability of each action of the agent being selected.