JP6189784B2

JP6189784B2 - Behavior control device, method and program

Info

Publication number: JP6189784B2
Application number: JP2014080002A
Authority: JP
Inventors: 洋川野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-04-09
Filing date: 2014-04-09
Publication date: 2017-08-30
Anticipated expiration: 2034-04-09
Also published as: JP2015201068A

Description

この発明は、複数の制御対象物の行動を制御する技術に関する。例えば、複数のロボットを、開始位置における隊列形成状態から協調して移動させ、障害物を回避させ、目標位置で隊列形成をさせるための各ロボットの行動計画を求めるロボット協調制御技術に関する。 The present invention relates to a technique for controlling actions of a plurality of control objects. For example, the present invention relates to a robot cooperative control technique for obtaining an action plan for each robot for moving a plurality of robots in a coordinated manner from a formation state at a start position, avoiding an obstacle, and forming a formation at a target position.

近年、多数の自律移動ロボットを効率的に制御にするための研究が活発に行われている。その任務内容は、人の入れない箇所の監視、物品の搬送などさまざまであるが、多数のロボットの協調動作による隊列形成を効率的に行わせるための技術が求められており盛んに研究が行われている（例えば、非特許文献１参照。）。多数のロボットによる効率的な隊列形成を実現するには、それぞれのロボットの配置、動作順序などを事前に計画することが重要である。このような計画においては、当然ながら、複数のロボットが動作する実環境における障害物の存在や経路の形状なども十分に考慮しなければならない。 In recent years, research has been actively conducted to efficiently control a large number of autonomous mobile robots. Their missions vary, such as monitoring places where people can't enter, transporting goods, etc., but technology is being sought for efficient formation of platoons through the coordinated operation of many robots. (For example, refer nonpatent literature 1.). In order to realize efficient formation of a formation by a large number of robots, it is important to plan the arrangement and operation order of each robot in advance. In such a plan, as a matter of course, it is necessary to sufficiently consider the presence of obstacles and the shape of a route in an actual environment where a plurality of robots operate.

このような計画計算を行うための効果的な手法の一つとして、マルコフ決定過程における動的計画法や強化学習の手法があり、さまざまな研究が行われている（例えば、非特許文献２参照。）。 As an effective method for performing such a plan calculation, there are a dynamic programming method and a reinforcement learning method in a Markov decision process, and various studies have been conducted (for example, see Non-Patent Document 2). .)

M.Shimizu, A.Ishiguro, T.Kawakatsu, Y.Masubuchi, “Coherent Swarming from Local Interaction by Exploiting Molecular Dynamics and Stokesian Dynamics Methods”, Proceeaings of the 2003 IEE/RSJ International Conference on intelligent Robots and Systems, Las Veqas, pp.1614-1619, October 2003.M. Shimizu, A. Ishiguro, T. Kawakatsu, Y. Masubuchi, “Coherent Swarming from Local Interaction by Exploiting Molecular Dynamics and Stokesian Dynamics Methods”, Proceeaings of the 2003 IEE / RSJ International Conference on intelligent Robots and Systems, Las Veqas, pp.1614-1619, October 2003. Y.Wang, C.W.de Silva, “Multi-Robot Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning”, Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, pp.3694-3699, October 2006.Y.Wang, CWde Silva, “Multi-Robot Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning”, Proceedings of the 2006 IEEE / RSJ International Conference on Intelligent Robots and Systems, Beijing, China, pp .3694-3699, October 2006.

しかしながら、非特許文献１の手法では、流体力学的な特性をロボット動作に組み込む手法を用いて群ロボットの動作を制御しており、低い計算負荷での制御を可能にしている利点があるが、任意の形状の隊列形成をすることができるとは限らない。 However, in the method of Non-Patent Document 1, the operation of the group robot is controlled using a method of incorporating the hydrodynamic characteristics into the robot operation, and there is an advantage that enables control with a low calculation load. It is not always possible to form a formation of any shape.

また、非特許文献２の手法のように、マルコフ決定過程における動的計画法や強化学習を使用してこのような計画を行おうとすると、単体のロボットを使用する場合に比べて複数のロボットを使用する場合には、その計算に要する時間や計算機の記憶容量がロボットの数に対して指数関数的に増大してしまう。その主たる原因となるのが、探索計算のためのマルコフ状態空間内の状態数の莫大な増加である。非特許文献２では、検証された強化学習の手法では、ロボット数の増加に伴い、指数関数的に計算負荷が増加するという、マルコフ状態空間内の爆発問題への解決策は示されていない。 In addition, as in the method of Non-Patent Document 2, when trying to perform such a plan using dynamic programming or reinforcement learning in the Markov decision process, a plurality of robots are used compared to the case of using a single robot. When used, the time required for the calculation and the storage capacity of the computer increase exponentially with respect to the number of robots. The main cause is the enormous increase in the number of states in the Markov state space for search computation. Non-Patent Document 2 does not show a solution to the explosion problem in the Markov state space in which the verified reinforcement learning method increases the computational load exponentially as the number of robots increases.

このように、任意形状の隊列制御を可能にする手法であって、計算負荷が低い手法は未だ実現できていない。 As described above, a method that enables control of a formation having an arbitrary shape and a low calculation load has not yet been realized.

このような現状に鑑みて、この発明は、任意形状の隊列制御を可能にする手法であって、計算負荷が従来よりも低い行動制御装置、方法及びプログラムを提供することを目的とする。 In view of such a current situation, an object of the present invention is to provide a behavior control apparatus, method, and program that enable a column control of an arbitrary shape and that has a lower computational load than conventional ones.

この発明の一態様による行動制御装置は、複数の制御対象物を所定の入口位置を含む目標位置の集合に移動させるための行動制御を行う行動制御装置であって、複数の制御対象物は、制御対象物がその制御対象物の現在の位置Lにおいて各行動aを取ったときの適切さを表す、マルコフ状態空間を１個の制御対象物の状態変数のみで構成することにより学習された、１個の制御対象物の状態変数を引数とする１個の価値関数に基づいて行動制御が行われるとして、(1)各制御対象物が目標位置に位置するか判定する位置判定部と、(2)各制御対象物が目標位置に位置しないと判定された場合には、制御対象物が入口位置に向かうことを理想的な状態として各制御対象物の現在の位置に基づいて価値関数を更新し、各制御対象物が移動可能な位置の中で更新後の価値関数の値が最も大きい位置に移動する行動を各制御対象物の行動として決定する目的領域外行動決定部と、(3)各制御対象物が目標位置に位置すると判定された場合には、制御対象物の移動に伴ってその制御対象物と位置が入れ替わる仮想的な存在であるボイドが入口位置に向かうことを理想的な状態としてボイドの現在の位置に基づいて価値関数を更新し、各制御対象物の現在の位置Lに移動可能な位置であるボイド位置の中で更新後の価値関数の値を最大にする行動が位置Lに移動する行動である位置を候補ボイド位置とし、その最大にする行動に対応する更新後の価値関数の値を候補ボイドQ関数値として、候補ボイド位置の中で各制御対象物が移動可能な位置であり候補ボイドQ関数値が最小である位置に移動する行動を各制御対象物の行動として決定する目的領域内行動決定部と、を含む行動割当部と、決定された行動に基づいて各制御対象物の位置を更新する位置更新部と、行動割当部と位置更新部との処理を繰り返し行うように制御する制御部と、を含む。 The behavior control device according to one aspect of the present invention is a behavior control device that performs behavior control for moving a plurality of control objects to a set of target positions including a predetermined entrance position, and the plurality of control objects are: Learned by constructing the Markov state space with only one state variable of the control object, which represents the appropriateness when the control object takes each action a at the current position L of the control object, Assuming that behavior control is performed based on one value function that takes a state variable of one control object as an argument , (1) a position determination unit that determines whether each control object is located at a target position; 2) When it is determined that each control object is not located at the target position, the value function is updated based on the current position of each control object with the ideal condition that the control object goes to the entrance position. In the position where each control object can move A non-target area action determination unit that determines the action to move to the position with the largest value function value after the update as the action of each control object, and (3) When it is determined that each control object is located at the target position The value function is updated based on the current position of the void with the ideal state that the void, which is a virtual entity whose position changes with the movement of the controlled object, moves toward the entrance position. The position where the action that maximizes the value of the updated value function is the action that moves to the position L among the void positions that are movable to the current position L of each control object is the candidate void position. The value of the updated value function corresponding to the action to be maximized is set as the candidate void Q function value, and the position of each controlled object is movable within the candidate void position, and the candidate void Q function value is the minimum. Each action to move to a position A behavior allocating unit including a behavior determining unit within the target area that is determined as a behavior of the target object, a position updating unit that updates the position of each control target based on the determined behavior, a behavior allocating unit, and a position updating unit And a control unit that controls to repeatedly perform the process.

任意形状の隊列制御が可能となり、計算負荷を従来よりも低くすることができる。 Arbitrary-shaped formation control becomes possible, and calculation load can be made lower than before.

行動制御装置の例を説明するためのブロック図。The block diagram for demonstrating the example of an action control apparatus. 学習部の例を説明するためのブロック図。The block diagram for demonstrating the example of a learning part. 第ｉ割当部の例を説明するためのブロック図。The block diagram for demonstrating the example of an i-th allocation part. 目標領域外行動決定部の例を説明するためのブロック図。The block diagram for demonstrating the example of a non-target area | region action determination part. 目標領域内行動決定部の例を説明するためのブロック図。The block diagram for demonstrating the example of the action determination part in a target area | region. スケジューリング部の例を説明するためのブロック図。The block diagram for demonstrating the example of a scheduling part. この発明の解決する問題を説明するための図。The figure for demonstrating the problem which this invention solves. 包摂構造による行動選択を説明するための図。The figure for demonstrating the action selection by an inclusion structure. 行動制御方法の学習ステップの例を説明するためのフローチャート。The flowchart for demonstrating the example of the learning step of a behavior control method. 行動制御方法の行動スケジュールステップの例を説明するためのフローチャート。The flowchart for demonstrating the example of the action schedule step of an action control method.

［理論的背景］
まず、行動制御装置及び方法の理論的背景について説明する。以下、行動制御の対象である制御対象物が、ロボットである場合を例に挙げて説明するが、制御対象物は、制御の対象となり得るものであれば、ロボット以外であってもよい。 [Theoretical background]
First, the theoretical background of the behavior control apparatus and method will be described. Hereinafter, a case where the control target that is the target of behavior control is a robot will be described as an example, but the control target may be other than the robot as long as it can be a target of control.

多数のロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務は、例えば図７に例示するような壁で区切られた部屋においての開始位置から目標位置まで複数のロボットの移動によって実現するものである。 A number of robots move from the formation state at the start position in cooperation with each other, and the task of forming the formation at the target position is, for example, from the start position in the room separated by walls as illustrated in FIG. This is realized by moving a plurality of robots.

任務を行うロボットは、N台（例えばN≧50）であり、各ロボットは、二次元平面におけるX軸方向及びY軸方向のそれぞれに移動可能とする。すなわち、この例では、各ロボットは、図７の紙面に対して上下左右の四方向に移動可能とする。図７の各格子は、それぞれのロボットの位置を示すものである。各格子にはロボットは一台しか存在することができない。それぞれのロボットは、移動しようとする方向に障害物か他のロボットがある場合には、静止をするものと仮定する。 The number of robots performing the mission is N (for example, N ≧ 50), and each robot is movable in the X-axis direction and the Y-axis direction on the two-dimensional plane. That is, in this example, each robot can move in four directions, up, down, left, and right with respect to the paper surface of FIG. Each grid in FIG. 7 indicates the position of each robot. There can be only one robot in each grid. Each robot is assumed to be stationary if there are obstacles or other robots in the direction of movement.

図７において、Rが記載された格子はロボットが存在する位置を示し、Oが記載された格子は障害物が存在する位置を示し、Fが記載された格子は目標位置を示す。また、太線の破線で囲まれた領域は開始位置を示し、太線で囲まれた領域は後述する入口位置を示す。このように、図７においては、ロボットの開始位置での隊列形状は略長方形であり、目標位置での隊列形状は略星形である。 In FIG. 7, a grid in which R is described indicates a position where the robot is present, a grid in which O is described indicates a position where an obstacle is present, and a grid in which F is described indicates a target position. A region surrounded by a thick broken line indicates a start position, and a region surrounded by a thick line indicates an entrance position described later. As described above, in FIG. 7, the formation shape at the start position of the robot is substantially rectangular, and the formation shape at the target position is substantially star shape.

それぞれのロボットi（iはロボット番号を表す）の初期位置を（Xr0[i],Yr0[i]）とし、目標位置を（Xre[i],Yre[i]）とするとき、初期位置に配置されたロボットが、目標位置まで移動するための行動計画を求める問題を考える。 When the initial position of each robot i (i represents the robot number) is (Xr0 [i], Yr0 [i]) and the target position is (Xre [i], Yre [i]) Consider the problem of finding an action plan for a placed robot to move to a target position.

このような問題に対して単純にマルコフ状態遷移モデルを適用しようとする場合、マルコフ状態空間は、iをロボット番号としたとき、ロボットiの位置（Xr[i],Yr[i]）、ロボットiの行動a[i]によって構成される。各状態（ロボットの位置と行動）は離散値で表現される。部屋をX,Yの直交座標系からなる２次元平面で表すと、X軸、Y軸をそれぞれ離散化表現した値により各位置を表現する。つまり、図７のように部屋（２次元平面）は格子で区切られ、各格子が各位置に対応する。また、各格子において、障害物の「ある／なし」が予め設定されている。上述の通り、図７では、障害物のある格子をOで示している。 When simply applying the Markov state transition model to such a problem, the Markov state space is the position of the robot i (Xr [i], Yr [i]), i It consists of i actions a [i]. Each state (robot position and action) is represented by discrete values. When a room is represented by a two-dimensional plane composed of an orthogonal coordinate system of X and Y, each position is represented by a discrete representation of the X axis and the Y axis. That is, as shown in FIG. 7, the room (two-dimensional plane) is divided by a grid, and each grid corresponds to each position. In each grid, “present / none” of the obstacle is set in advance. As described above, in FIG. 7, the lattice with an obstacle is indicated by O.

また、この例では、制御対象物である行動主体は部屋に配置されている各ロボットとなる。ロボットi（iはロボット番号）の行動a[i]∈D[i]は、静止、上下左右方向への１格子分の移動の計５種類のうちの何れかを取る。すなわち、D[ｉ]∈{0,1,2,3,4}として、各行動は例えば以下のように定義される。
0: 静止
1: 二次元平面上で右方向に１格子だけ移動する
2: 二次元平面上で上方向に１格子だけ移動する
3: 二次元平面上で左方向に１格子だけ移動する
4: 二次元平面上で下方向に１格子だけ移動する
このような任務環境におけるマルコフ状態空間は、ロボット数×２の次元数の状態を持ち、かつ選択可能な行動数は、ロボットの行動（＝５通り）のロボット数乗だけ存在する。例えば、ロボット数が５０で、部屋の縦横方向の格子数がそれぞれ２０であるとすれば状態数は２０の100乗個にもなり、探索計算に要する資源の量は膨大なものとなる。さらにロボット数が１台増えるごとに、その状態数は400倍増加していくことになり、複数ロボットを使用する場合の大きな問題となっている。 In this example, the action subject that is the control target is each robot arranged in the room. The action a [i] ∈D [i] of the robot i (i is a robot number) takes one of five types, that is, stationary and movement of one lattice in the vertical and horizontal directions. That is, each action is defined as follows, for example, with D [i] ε {0,1,2,3,4}.
0: stationary
1: Move one grid to the right on the 2D plane
2: Move one grid upward on a two-dimensional plane
3: Move one grid to the left on the 2D plane
4: Move downward by one grid on a two-dimensional plane The Markov state space in such a mission environment has the number of robots x 2 dimensions, and the number of actions that can be selected is the robot action ( = 5 ways). For example, if the number of robots is 50 and the number of grids in the vertical and horizontal directions of the room is 20, the number of states becomes 20 to the 100th power, and the amount of resources required for the search calculation becomes enormous. Further, every time the number of robots increases, the number of states increases 400 times, which is a big problem when using multiple robots.

そこで、この発明では、このような状態空間の爆発をさけるために、学習に使用するマルコフ状態空間を、一台分のロボットの状態変数のみで構成することにする。すなわち、状態変数及び行動変数を以下のように定義する。 Therefore, in the present invention, in order to avoid such an explosion of the state space, the Markov state space used for learning is composed of only one robot state variable. That is, state variables and behavior variables are defined as follows.

状態変数L=（Xr,Yr），行動変数a∈{0,1,2,3,4}
N台あるすべてのロボットは、この状態変数を引数とした１個の価値関数Q(L,a)を共有する。すなわち、各時刻ステップにおける価値関数Q(L,a)の更新は、N台の各ロボットが同じ価値関数を各々の経験によって更新する（すなわち、一時刻ステップでN回の更新を行う）。更新式は以下の通りである。 State variable L = (Xr, Yr), action variable a∈ {0,1,2,3,4}
All N robots share one value function Q (L, a) with this state variable as an argument. That is, in updating the value function Q (L, a) at each time step, the N robots update the same value function according to their experiences (that is, N updates are performed at one time step). The update formula is as follows.

ここで、式(1)におけるαは学習率と呼ばれる予め定められた定数であり０＜α＜１である。また、式(1)における←は右辺の値で左辺の値を更新することを意味する。 Here, α in equation (1) is a predetermined constant called a learning rate, and 0 <α <1. In the equation (1), ← means that the value on the left side is updated with the value on the right side.

i番目のロボットについて、式(1)及び式(2)の右辺のLにi番目のロボットの位置L[i]を代入し、式(1)及び式(2)のaにi番目のロボットの行動変数a[i]を代入して、式(1)及び式(2)を実行することで価値関数及び方策を更新する。これを、各i=1,2,…,Nについて繰り返す。 For the i-th robot, the position L [i] of the i-th robot is substituted into L on the right side of the expressions (1) and (2), and the i-th robot is substituted for a in the expressions (1) and (2). The value function and the policy are updated by substituting the action variable a [i] and executing Expression (1) and Expression (2). This is repeated for each i = 1, 2,.

行動選択時にも、１個の価値関数Q(L,a)によって導かれる方策関数π(L)を使用して各ロボットが行動選択を行う。言い換えれば、例えばロボットである制御対象物は、制御対象物がその制御対象物の現在の位置Lにおいて各行動aを取ったときの適切さを表す１個の価値関数に基づいて行動制御が行われるとする。これにより、ロボットの数がどんなに増えても、学習に使用する状態空間の状態数がロボット一台分の状態空間の状態数と同じとなり、状態空間の大きさがロボット数に依存しないことになる。 Also at the time of action selection, each robot performs action selection using the policy function π (L) derived by one value function Q (L, a). In other words, for example, a control object that is a robot performs action control based on a single value function that represents appropriateness when the control object takes each action a at the current position L of the control object. Let's say. As a result, no matter how much the number of robots increases, the number of state spaces used for learning will be the same as the number of states in the state space for one robot, and the size of the state space will not depend on the number of robots. .

なお、各ロボットは、それぞれの位置を計測することができ、また隣の位置に他のロボットが存在しているか否か、隣の位置に障害物があるか否かを知ることができるものとする。 Each robot can measure its position, know whether there is another robot at the next position, and whether there is an obstacle at the next position. To do.

このような１個の価値関数Q(L,a)を使用して学習を行った場合に起こる問題を以下に述べる。例えば学習において、各目標位置においてロボットに高い報酬を与えるものとする。まず、Q(L,a)においては、１個のロボットが開始位置からどのような行動を選択していくことで、最短時刻ステップ数で目標位置に到達できるかが記述されているのであるから、π(L)に従う各ロボットは、例えば目標位置へ向かう途中の障害物を回避するときに、障害物の角にあたる同じ位置を通ろうとする傾向がある。すなわち、同じ経路に多数のロボットが殺到し、文字通りの渋滞を引き起こしてしまう。また、目標位置に早めについたロボットがその位置に静止し、後から目標位置に到着しようとするロボットの道をふさいでしまうことも起こりうる。その結果、すべてのロボットが適切に目標位置に到達することが保証できない。それをさけるために、各ロボットの開始位置を考量して、早めに目標位置に到達するロボットには遠めの目標位置を割り振るなどの処理をする方法もあるが、そのためには、各ロボットの位置をロボットの台数分だけマルコフ状態空間に組み込むことが必要となってしまい、ロボット台数が多い場合には、状態空間の深刻な増加を引き起こす。 Problems that occur when learning is performed using such a single value function Q (L, a) will be described below. For example, in learning, a high reward is given to the robot at each target position. First, Q (L, a) describes what action a single robot can select from the start position to reach the target position with the shortest time step number. , Π (L), for example, when trying to avoid an obstacle on the way to the target position, there is a tendency to try to pass the same position corresponding to the corner of the obstacle. That is, a large number of robots rush to the same route, causing literal traffic jams. It is also possible that a robot that reaches the target position early stops at that position and blocks the path of the robot trying to reach the target position later. As a result, it cannot be guaranteed that all robots properly reach the target position. In order to avoid this, there is a method that considers the starting position of each robot and assigns a distant target position to a robot that reaches the target position earlier, but for that purpose, It is necessary to incorporate the position into the Markov state space by the number of robots, and when the number of robots is large, the state space is seriously increased.

そこで、このようなことを引き起こさないために、主に２つの方法を提案する。１つ目は包摂構造を使用した行動選択手法であり、２つ目は目標位置におけるボイド制御である。 Therefore, in order not to cause such a thing, two methods are mainly proposed. The first is an action selection method using an inclusion structure, and the second is void control at a target position.

図８に、包摂構造を使用した行動選択手法の例の概念図を示す。図８の〇の中にｓが描かれたモジュール（以下、包摂モジュールとする）は、包摂構造における重要なキーパーツである。包摂モジュールは、上位のモジュールから入力された信号を、下位のモジュールからの信号入力がない限りはそのまま出力する下位のモジュールからの入力があった場合は、上位モジュールからの入力を無視し、下位モジュールの入力を出力する。 FIG. 8 shows a conceptual diagram of an example of an action selection method using an inclusion structure. A module in which s is drawn in a circle in FIG. 8 (hereinafter referred to as an inclusion module) is an important key part in the inclusion structure. The inclusion module ignores the input from the upper module if there is an input from the lower module that outputs the signal input from the upper module as long as there is no signal input from the lower module. Output module input.

各層のモジュールは、Qxth(x=1,2,3,4)モジュールとStopperモジュールで構成される。最下層のQxthモジュールはQ1stモジュール、第二層はQ2ndモジュール、第三層はQ3rd、第四層はQ4thモジュールである。最上層はStayComモジュールで構成される。 Each layer module is composed of a Qxth (x = 1, 2, 3, 4) module and a Stopper module. The lowermost Qxth module is the Q1st module, the second layer is the Q2nd module, the third layer is the Q3rd, and the fourth layer is the Q4th module. The top layer consists of StayCom modules.

Q1stモジュールは、現在のロボットの位置(xr[i],yr[i])を入力値として受け取り、L=(xr[i],yr[i])においてQ(L,a)の値を最大とするaの値をロボットiの行動a[i]の候補として出力する。同様に、Q2ndモジュールは、現在のロボットの位置(xri,yri)を入力値として受け取り、L=(xr[i],yr[i])においてQ(s,a)の値を2番目に大きな値とするaの値をロボットiの行動a[i]の候補として出力する。さらに、Q3rdモジュールは、現在のロボットの位置(xr[i],yr[i])を入力値として受け取り、L=(xr[i],yr[i])においてQ(L,a)の値を3番目に大きな値とするaの値をロボットiの行動a[i]の候補として出力する。同様に、Q4thモジュールは、現在のロボットの位置(xr[i],yr[i])を入力値として受け取り、L=(xr[i],yr[i])においてQ(L,a)の値を4番目に大きな値とするaの値をロボットiの行動a[i]の候補として出力する。 The Q1st module receives the current robot position (xr [i], yr [i]) as an input value, and maximizes the value of Q (L, a) at L = (xr [i], yr [i]) Is output as a candidate for action a [i] of robot i. Similarly, the Q2nd module receives the current robot position (xri, yri) as an input value, and the value of Q (s, a) is the second largest in L = (xr [i], yr [i]) The value a is output as a candidate for action a [i] of robot i. Furthermore, the Q3rd module receives the current robot position (xr [i], yr [i]) as an input value, and the value of Q (L, a) at L = (xr [i], yr [i]) Is output as a candidate for action a [i] of robot i. Similarly, the Q4th module receives the current robot position (xr [i], yr [i]) as an input value, and at L = (xr [i], yr [i]), the Q (L, a) The value of a having the fourth largest value is output as a candidate for action a [i] of robot i.

なお、各Qxthモジュールは、出力する行動の候補としてa[i]=0（静止）を含めないものとする。Stopperモジュールは、位置(xr[i],yr[i])に存在するロボットの隣の位置(xr[i]+1,yr[i])、(xr[i],yr[i]+1)、(xr[i]-1,yr[i])、(xr[i],yr[i]-1)に他のロボットが存在しているかどうかをチェックし、入力された値の行動によってロボットが移動する先の位置に、他のロボットが存在している場合には、何も行動値を出力しない。そうでない場合は入力された行動値をそのまま出力する。StayComモジュールは、常に静止行動a[i]=0を出力する。 Each Qxth module does not include a [i] = 0 (still) as a candidate action to be output. The Stopper module is located next to the robot (xr [i] + 1, yr [i]), (xr [i], yr [i] +1) at the position (xr [i], yr [i]) ), (Xr [i] -1, yr [i]), and (xr [i], yr [i] -1) If another robot exists at the position to which the robot moves, no action value is output. Otherwise, the input action value is output as it is. The StayCom module always outputs a static action a [i] = 0.

ここで述べた行動選択方法は、例えば、位置ＬにおいてQ値を最大にする行動をロボットが選択した場合に、その行動によって移動する先の格子にすでに他のロボットが存在してしまっているときに、ロボットに動作をさせずに静止させるのではなく、最適ではないにしても、次に望ましい行動を選択して、他のロボットに占拠されていない格子に移動する行動をロボットに指示するものである。 The behavior selection method described here is, for example, when the robot selects the behavior that maximizes the Q value at the position L, and another robot has already existed in the lattice that is moved by the behavior. In addition, if the robot is not stationary and does not move, it is not optimal, but the next desired action is selected and the robot is instructed to move to a grid not occupied by other robots. It is.

これは、ちょうど流体が障害物にぶつかってもそこで静止せずに、障害物をよけつつも主流の方向から遠くずれない方向に流れていく性質を、ロボットに与えるものである。 This gives the robot the property that even if a fluid hits an obstacle, it does not stop there and flows in a direction that does not deviate from the mainstream direction while avoiding the obstacle.

なお、図８のモジュールが４層（第１〜第４）のレイヤで構成されているのは、この例では静止(a=0)以外でロボットの取りうる行動が４種類(a=1,2,3,4)であるとしているためである。一般には、行動の種類がM個（静止を含む）あれば、図８のモジュールはM-1個のレイヤになる。 The module shown in FIG. 8 is composed of four layers (first to fourth). In this example, the robot can take four types of actions (a = 1, 2,3,4). In general, if there are M types of actions (including stillness), the module in FIG. 8 has M−1 layers.

次にボイド制御の原理について述べる。まず、各ロボットの目標位置をここに厳密に割り振ることをせず、目標位置全体の集合を、目標隊列エリアGと定義する。すなわち、
（Xre[i],Yre[i]）∈G …(3)
として、各ロボットはG内の全ての全ての位置を自由に目標位置とすることができるものとする。つまり、Gをちょうど流体を注ぐ器のようなものとして扱う。すなわち、各ロボットは、Gの境界上にあるどの位置からもGに入ることが可能であるが、一度G内に入ったロボットは、Gを出る行動をとることができないものとする。また、強化学習時における報酬の設定については、Gの境界上に一点だけ入口の点Peを設定し、ロボットがPeからG内に入ったときのみ高報酬であるr=1を与え、それ以外の経験については、すべてr=0を与えるものとする。Peの位置はGの内部であって、Gの境界上であればどこでも構わないが、ロボットの開始位置から近い位置を選ぶのがロボットの動作をスムーズにするうえで効果的である。Peの位置を入口位置と呼ぶ。 Next, the principle of void control will be described. First, the target position of each robot is not strictly allocated here, and the set of the entire target position is defined as a target platoon area G. That is,
(Xre [i], Yre [i]) ∈G… (3)
Assuming that each robot can freely set all the positions in G as target positions. In other words, G is treated just like a device for pouring fluid. That is, each robot can enter G from any position on the boundary of G. However, once a robot enters G, it cannot take an action to exit G. As for the reward setting at the time of reinforcement learning, only one entry point Pe is set on the boundary of G, r = 1 which is a high reward is given only when the robot enters G from Pe, and other than that For all experiences, r = 0. The position of Pe is inside G and can be anywhere as long as it is on the boundary of G. However, selecting a position close to the start position of the robot is effective for smooth operation of the robot. The position of Pe is called the entrance position.

G内においては、ロボットの経験をそのまま価値関数Q(L,a)の更新に用いるのではなく、ロボットの行動に伴って動く“ボイド”の動作として扱い、Q(L,a)の更新に用いることにする。ボイドとは、ロボットが位置Lから、L’に遷移したときに同時に、L’からLに遷移する空隙のことである。すなわち、一台のロボットがGの内部に入ったときは、同時に一つのボイドがGの外部に出ていくことになる。Gの入口点としてPeを設定したために、ロボットはPeからGに入る傾向を持つことになるが、同時にボイドは、Peを経由してGから出ていくことになる。すなわち、ボイドがPeからGの外に出ていくときに高報酬を与えられることになる。すると、そうしたボイドの動きを実現するためのロボットの動作は、必然的にPeからG内に入ったのちにPeから離れた点を目指して、G内に分散していく動作となる。以下、そのための、Q値計算と行動選択方法について述べる。G内において、ロボットが位置Lから、L’に遷移したときに、Q値は以下の式(4)により更新される。 In G, the experience of the robot is not used as it is for updating the value function Q (L, a), but is treated as a “void” motion that moves with the robot's behavior, and Q (L, a) is updated. I will use it. A void is a gap that changes from L ′ to L at the same time when the robot changes from position L to L ′. That is, when one robot enters G, one void goes out of G at the same time. Since Pe is set as the entry point of G, the robot tends to enter G from Pe, but at the same time, the void goes out of G via Pe. That is, when the void goes out of Pe from G, a high reward is given. Then, the movement of the robot to realize such a movement of the void inevitably moves into the G from the Pe and then moves away from the Pe aiming at a point away from the Pe. The Q value calculation and action selection method for this purpose will be described below. In G, when the robot changes from position L to L ′, the Q value is updated by the following equation (4).

Q_max(L)は位置Lにおいて式(1)で定義されるQ関数の最大値である。すなわち、Q_max(L)=max_aQ(L,a)また、γは、学習率と呼ばれるあらかじめ値の定められた定数であり0＜γ＜１である。また、a^-1は行動aの逆方向の行動を表す。aとa^-1の関係の一例は以下のようになる。 Q _max (L) is the maximum value of the Q function defined by the expression (1) at the position L. That is, Q _max (L) = max _a Q (L, a) Further, γ is a constant with a predetermined value called a learning rate, and 0 <γ <1. Further, a ^-1 represents an action in the opposite direction of action a. An example of the relationship between a and a ^-1 is as follows.

a=0のとき、a^-1=0
a=1のとき、a^-1=3
a=2のとき、a^-1=4
a=3のとき、a^-1=1
a=4のとき、a^-1=2
式(4)は、ちょうど通常のQ学習におけるQ(L,a)の更新式である式(1)のLとL’を入れ替えて、時間の流れを逆に扱ったものに等しい。なお、Gの内と外でのQ値の不整合を避けるために、LがGの外部、L’がGの内部の場合には、Q_max(L)の値は０に設定することとする。このようにして更新されたQ値によって導かれる行動方策π（Q値を最大化する行動を返す関数）は、ボイドの最適な行動を返すものとなる。ロボットは、ボイドがそのような行動をとれるように移動を行う。 a ^-1 = 0 when a = 0
When a = 1, a ^-1 = 3
When a = 2, a ^-1 = 4
When a = 3, a ^-1 = 1
When a = 4, a ^-1 = 2
Equation (4) is equivalent to the inverse of the time flow by exchanging L and L ′ in Equation (1), which is an update equation for Q (L, a) in normal Q learning. To avoid inconsistency between the Q values inside and outside G, if L is outside G and L 'is inside G, the value of Q _max (L) should be set to 0. To do. The action policy π (the function that returns the action that maximizes the Q value) derived from the Q value updated in this way returns the optimum action of the void. The robot moves so that the void can take such actions.

次にG内での行動選択について述べる。G内においてもGの外部と同様に、包摂構造による行動選択を行うが、Qxthモジュールの動作がGの内部では異なる。G内部では、Qxthモジュールは、まずロボット位置Lの隣の位置(xr[i]+1,yr[i])、(xr[i],yr[i]+1)、(xr[i]-1,yr[i])、(xr[i],yr[i]-1)にある各ボイドについてQ値を最大化する行動が、ボイドを現在のロボット位置に向かわせるようになっているボイドを、候補ボイドとして複数選択する。つづいて、それらの候補ボイドの中から、Qmaxの値をx番目に小さくするものをターゲットボイドとして一つ選択し、選択したターゲットボイドにロボットを向かわせる行動の値を出力する。このようにすることで、ロボットを動かして適切にボイドをPeの位置に誘導し、常に後からGに入ってこようとするロボットに入口を空いた状態に確保することができる。 Next, action selection in G is described. In G, as well as outside of G, behavior selection by inclusion structure is performed, but the operation of the Qxth module is different inside G. Inside G, the Qxth module first places the position (xr [i] + 1, yr [i]), (xr [i], yr [i] +1), (xr [i]- 1, yr [i]), (xr [i], yr [i] -1) For each void, the action that maximizes the Q value will cause the void to point to the current robot position. Are selected as candidate voids. Next, one of the candidate voids that has the Qmax value that is xth smallest is selected as a target void, and an action value that directs the robot to the selected target void is output. By doing so, it is possible to move the robot to properly guide the void to the position of Pe, and to ensure that the robot that always tries to enter G later has a free entrance.

なお、Qxthモジュールの設計についてはいくつかオプションがあり、例えば、候補ボイドの中からターゲットボイドを選択する手法としてとして、候補ボイドの中からQmaxの値をx番目に大きくするものをターゲットボイドとして一つ選択する手法や、候補ボイドにロボットを向かわせる行動の中から行動番号の小さいものをロボットの行動として出力するなどの色々な方法があり得る。 There are several options for the design of the Qxth module. For example, as a method for selecting a target void from candidate voids, one that increases the value of Qmax x-th among candidate voids is selected as a target void. There are various methods, such as a method of selecting one, and outputting a behavior with a small action number as an action of the robot from actions of directing the robot to the candidate void.

なお、Gの入口位置を一点に設定しているが、包摂構造による行動選択の効用で、もしPeがロボットによって占拠されている場合には、Pe以外のPe近傍の点からロボットがGに入るようにロボットの行動が制御されるので、Gに入ろうとするロボットがPeの一点に集中して渋滞を起こすことはない。 In addition, although the entrance position of G is set to one point, if Pe is occupied by the robot due to the effect of action selection by the inclusion structure, the robot enters G from a point near Pe other than Pe Thus, the robot's behavior is controlled so that the robot trying to enter G does not concentrate on one point of Pe and cause a traffic jam.

［行動制御装置及び方法］
図を参照して、行動制御装置及び方法の例について説明する。この行動制御装置及び方法は、複数の制御対象物を所定の入口位置を含む目標位置の集合に移動させるための行動制御を行うものである。 [Action Control Apparatus and Method]
An example of a behavior control device and method will be described with reference to the drawings. This behavior control apparatus and method performs behavior control for moving a plurality of control objects to a set of target positions including a predetermined entrance position.

行動制御装置は、図１に示すように、学習部１、記憶部２及びスケジューリング部３を例えば備えている。 As shown in FIG. 1, the behavior control apparatus includes a learning unit 1, a storage unit 2, and a scheduling unit 3, for example.

学習部１は、図２に示すように、入力部１１、行動割当部１２、位置更新部１３及び制御部１４を例えば備えている。 As illustrated in FIG. 2, the learning unit 1 includes, for example, an input unit 11, an action assignment unit 12, a position update unit 13, and a control unit 14.

スケジューリング部３は、図６に示すように、初期状態入力部３１、行動割当部３２、位置更新部３３、目標位置到達判定部３４を例えば備えている。 As shown in FIG. 6, the scheduling unit 3 includes, for example, an initial state input unit 31, an action assignment unit 32, a position update unit 33, and a target position arrival determination unit 34.

以下では、制御の対象となる制御対象物が、ロボットである場合を例に挙げて説明する。もちろん、制御対象物は、制御の対象となり得るものであれば、ロボット以外であってもよい。 Hereinafter, a case where the control target to be controlled is a robot will be described as an example. Of course, the control object may be other than the robot as long as it can be a control target.

まず、行動制御装置の学習部１による学習ステップの処理について説明する。学習ステップの処理の流れの例を、図９に示す。 First, the process of the learning step by the learning unit 1 of the behavior control device will be described. An example of the processing flow of the learning step is shown in FIG.

＜入力部１１＞
入力部１１には、N台のロボットのそれぞれの初期位置(xr0[i],yr0[i])及び目標位置(Xre[i],Yre[i])が入力される。ここで、i=1,2,…,Nとする。N個の目標位置の集合は、G={(Xre[1],Yre[1]),(Xre[2],Yre[2]),…,(Xre[N],Yre[N])}として記憶部２に記憶される。 <Input unit 11>
The input unit 11 receives initial positions (xr0 [i], yr0 [i]) and target positions (Xre [i], Yre [i]) of the N robots. Here, i = 1, 2,. The set of N target positions is G = {(Xre [1], Yre [1]), (Xre [2], Yre [2]), ..., (Xre [N], Yre [N])} As stored in the storage unit 2.

N台のロボットのそれぞれについて、入力された初期位置の情報を用いて、i番目のロボットの初期位置L[i]=(xr0[i], yr0[i])を設定し、i番目のロボットの初期位置を記憶部２に記憶する。 For each of the N robots, the initial position L [i] = (xr0 [i], yr0 [i]) of the i-th robot is set using the input initial position information, and the i-th robot Are stored in the storage unit 2.

なお、目標位置は、所定の入口位置を含むとする。この入口位置についての情報も、入力部１１から入力され、記憶部２に記憶されるとする。 The target position includes a predetermined entrance position. Information about the entrance position is also input from the input unit 11 and stored in the storage unit 2.

＜記憶部２＞
記憶部２には、位置L及びa∈{0,1,2,3,4}の組み合わせのそれぞれについてのＱ関数Q(L,a)、各位置Lについての方策π(L)及びQmax(L)の初期値が記憶されているとする。Lの取りうる範囲は、対象となる二次元平面上の領域内の全ての座標である。これらの初期値はランダムな値を設定すればよい。ただし、Lが障害物位置と合致する場合は、Q(L,a)=0と設定してもよい。π(L)、Qmax(L)についても各Lに関するQ関数の初期値Q(L,a)のうちの最大値を設定すればよい。 <Storage unit 2>
The storage unit 2 stores the Q function Q (L, a) for each of the combinations of the position L and a∈ {0,1,2,3,4}, and the measures π (L) and Qmax ( Assume that the initial value of L) is stored. The range that L can take is all the coordinates in the region on the target two-dimensional plane. These initial values may be set to random values. However, when L matches the obstacle position, Q (L, a) = 0 may be set. For π (L) and Qmax (L), the maximum value among the initial values Q (L, a) of the Q function for each L may be set.

各位置Lの報酬r(L)についても、記憶部２に記憶されているとする。各位置Lの報酬r(L)についての情報は、例えば入力部１１から入力される。 It is assumed that the reward r (L) at each position L is also stored in the storage unit 2. Information about the reward r (L) at each position L is input from the input unit 11, for example.

＜行動割当部１２＞
行動割当部１２による行動割当処理は、各ロボットについて順次実行される。行動割当部１２は、第１割当部１２−１，第i割当部１２−ｉ，…，第N割当部１２−Ｎを例えば備えている。 <Action allocation unit 12>
The action assignment processing by the action assigning unit 12 is sequentially executed for each robot. The action assigning unit 12 includes, for example, a first assigning unit 12-1, an i-th assigning unit 12-i,..., An N-th assigning unit 12-N.

i=1,2,…,Nとして、i番目のロボットについての行動割当処理は、第i割当部１２−ｉが例えば行うとする。第i割当部１２−ｉの構成の例を図３に示す。 Assume that i = 1, 2,..., N, and the i-th assignment unit 12-i performs the action assignment process for the i-th robot, for example. An example of the configuration of the i-th allocation unit 12-i is shown in FIG.

≪位置判定部１２−ｉ−１≫
位置判定部１２−ｉ−１は、記憶部２からi番目のロボットの位置(xr[i],yr[i])を読み込み、読み込んだ位置(xr[i],yr[i])が目的位置の集合G内に含まれるか否かを判定する。言い換えれば、位置判定部１２−ｉ−１は、ロボットが目標位置に位置するか判定する（ステップＡ１）。 << Position Determination Unit 12-i-1 >>
The position determination unit 12-i-1 reads the position (xr [i], yr [i]) of the i-th robot from the storage unit 2, and the read position (xr [i], yr [i]) is the target. It is determined whether or not it is included in the position set G. In other words, the position determination unit 12-i-1 determines whether the robot is located at the target position (step A1).

位置判定部１２−ｉ−１は、位置(xr[i],yr[i])が目的位置の集合G内に含まれない場合は目的領域外行動決定部１２−ｉ−２が次の処理を実行し、(xr[i],yr[i])が目的位置の集合G内に含まれる場合は目的領域内行動決定部１２−ｉ−３が次の処理を実行するよう制御する。 If the position (xr [i], yr [i]) is not included in the target position set G, the position determination unit 12-i-1 performs the next process When (xr [i], yr [i]) is included in the set G of target positions, control is performed so that the action determining unit 12-i-3 in the target area executes the following process.

また、位置判定部１２−ｉ−１は、位置(xr[i],yr[i])が目的位置の集合G内に含まれない場合は、目的領域外行動決定部１２−ｉ−２から出力される行動値が位置更新部１３に入力され、(xr[i],yr[i])が目的位置の集合G内に含まれる場合は、目的領域内行動決定部１２−ｉ−３から出力される行動値が位置更新部１３に入力されるよう制御する。 In addition, when the position (xr [i], yr [i]) is not included in the set G of target positions, the position determination unit 12-i-1 When the action value to be output is input to the position update unit 13 and (xr [i], yr [i]) is included in the set G of the target position, the action determining unit 12-i-3 in the target area Control is performed so that the action value to be output is input to the position update unit 13.

≪目的領域外行動決定部１２−ｉ−２≫
目的領域外行動決定部１２−ｉ−２の構成の例を図４に示す。 ≪Non-target area action determination unit 12-i-2≫
An example of the configuration of the non-target area action determining unit 12-i-2 is shown in FIG.

目的領域外行動決定部１２−ｉ−２は、図８の行動選択手法に基づいて行動を決定するものである。すなわち、目的領域外行動決定部１２−ｉ−２は、ロボットが目標位置に位置しないと判定された場合には、ロボットが入口位置に向かうことを理想的な状態としてロボットの現在の位置に基づいて価値関数を更新し、ロボットが移動可能な位置の中で更新後の価値関数の値が最も大きい位置に移動する行動をロボットの行動として決定する（ステップＡ２）。 The non-target area action determination unit 12-i-2 determines an action based on the action selection method of FIG. In other words, when it is determined that the robot is not located at the target position, the behavior determination unit 12-i-2 outside the target area is based on the current position of the robot with an ideal state where the robot heads toward the entrance position. Then, the value function is updated, and the action to move to the position where the updated value function value is the largest among the positions where the robot can move is determined as the action of the robot (step A2).

〔領域外Ｑ関数更新部１２−ｉ−２１〕
領域外Ｑ関数更新部１２−ｉ−２１は、１時間ステップ前のｉ番目のロボットの位置をL=(xr[i],yr[i])とし、現在のｉ番目のロボットの位置をL’=(xr’[i],yr’[i])として、記憶部２に記憶されたQ(L,a)とQmax(L’)を参照して、１時間ステップ前のロボットの行動ａについて、式(1)によりQ(L,a)を求め、求めたQ(L,a)の値で記憶部２に記憶されたQ(L,a)の値を更新する。また、領域外Ｑ関数更新部１２−ｉ−２１は、更新前のQ(L,a)の値と更新後のQ(L,a)の値を制御部１４へ出力する。 [Out-of-region Q function update unit 12-i-21]
The out-of-region Q function updating unit 12-i-21 sets the position of the i-th robot one hour before as L = (xr [i], yr [i]) and sets the current position of the i-th robot to L By referring to Q (L, a) and Qmax (L ') stored in the storage unit 2 as' = (xr '[i], yr' [i]), the robot action a one hour before the step a , Q (L, a) is obtained by the equation (1), and the value of Q (L, a) stored in the storage unit 2 is updated with the obtained value of Q (L, a). Further, the out-of-region Q function updating unit 12-i-21 outputs the value of Q (L, a) before the update and the value of Q (L, a) after the update to the control unit 14.

また、領域外Ｑ関数更新部１２−ｉ−２１は、更新されたQ(L,a)の値を用いて、式(2)により方策π(L)を求め、求めたπ(L)の値で記憶部２に記憶された方策π(L)を更新する。 Further, the out-of-region Q function updating unit 12-i-21 uses the updated value of Q (L, a) to obtain the policy π (L) according to the equation (2), and the obtained π (L) The policy π (L) stored in the storage unit 2 is updated with the value.

〔第１領域外行動候補決定部１２−ｉ−２２〕
第１領域外行動候補決定部は、L=(xr[i],yr[i])として、記憶部２に記憶されたQ(L,1),Q(L,2),Q(L,3),Q(L,4)のうちの最大値をとるaの値を第１領域外行動候補値として出力する。また、最大となるQ関数の値で記憶部２に記憶されたQmax(L)の値を更新する。 [First Region Outside Action Candidate Determining Unit 12-i-22]
The first out-of-region action candidate determination unit sets Q (L, 1), Q (L, 2), Q (L, L) stored in the storage unit 2 as L = (xr [i], yr [i]). 3) The value of a that takes the maximum value among Q (L, 4) is output as the first out-of-region action candidate value. Also, the value of Qmax (L) stored in the storage unit 2 is updated with the value of the maximum Q function.

〔第１領域外包摂制御部１２−ｉ−２３〕
第１領域外包摂制御部１２−ｉ−２３は、第１領域外行動候補決定部１２−ｉ−２２で決定された第１領域外行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか否かを判定する。 [First Region Inclusion Control Unit 12-i-23]
The first out-of-region inclusion control unit 12-i-23 assumes that the i-th robot moves according to the first out-of-region action candidate value determined by the first out-of-region action candidate determination unit 12-i-22. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether or not an obstacle exists at the position (xr ′ [i], yr ′ [i]).

第１領域外包摂制御部１２−ｉ−２３は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか、障害物が存在する場合には、第２領域外行動候補決定部１２−ｉ−２４が次の処理を実行するよう制御する。 The first region outer inclusion control unit 12-i-23 determines that the first region inclusion control unit 12-i-23 has no other robot at the moved position (xr '[i], yr' [i]) or has an obstacle. Control is performed so that the 2-out-of-region action candidate determination unit 12-i-24 executes the following process.

第１領域外包摂制御部１２−ｉ−２３は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在しない場合には、第１領域外行動候補値を「行動値」として出力する。 The first out-of-region inclusion control unit 12-i-23 performs the first out-of-region action when there are no other robots and obstacles at the moved position (xr ′ [i], yr ′ [i]). Candidate values are output as “action values”.

〔第２領域外行動候補決定部１２−ｉ−２４〕
第２領域外行動候補決定部１２−ｉ−２４は、L=(xr[i], yr[i])として、記憶部２に記憶されたQ(L,1),Q(L,2),Q(L,3),Q(L,4)のうちの２番目に大きな値をとるaの値を第２領域外行動候補値として出力する。 [Second Out-of-Region Action Candidate Determination Unit 12-i-24]
The second out-of-region action candidate determination unit 12-i-24 sets Q (L, 1), Q (L, 2) stored in the storage unit 2 as L = (xr [i], yr [i]). , Q (L, 3), Q (L, 4), the value of a taking the second largest value is output as the second out-of-region action candidate value.

〔第２領域外包摂制御部１２−ｉ−２５〕
第２領域外包摂制御部１２−ｉ−２５は、第２領域外行動候補決定部１２−ｉ−２４で決定された第２領域外行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか否かを判定する。 [Second Area Outer Inclusion Control Unit 12-i-25]
The second region inclusion control unit 12-i-25 assumes that the i-th robot moves according to the second region outside action candidate value determined by the second region outside action candidate determination unit 12-i-24. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether or not an obstacle exists at the position (xr ′ [i], yr ′ [i]).

第２領域外包摂制御部１２−ｉ−２５は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか、障害物が存在する場合には、第３領域外行動候補決定部１２−ｉ−２６が次の処理を実行するよう制御する。 The second region inclusion control unit 12-i-25, when there is another robot at the moved position (xr '[i], yr' [i]) or there is an obstacle, The out-of-region action candidate determination unit 12-i-26 is controlled to execute the next process.

第２領域外包摂制御部１２−ｉ−２５は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在しない場合には、第２領域外行動候補値を「行動値」として出力する。 The second out-of-region inclusion control unit 12-i-25 performs the second out-of-region action when there is no other robot or obstacle at the moved position (xr '[i], yr' [i]). Candidate values are output as “action values”.

〔第３領域外行動候補決定部１２−ｉ−２６〕
第３領域外行動候補決定部１２−ｉ−２６は、L=(xr[i],yr[i])として、記憶部２に記憶されたQ(L,1),Q(L,2),Q(L,3),Q(L,4)のうちの３番目に大きな値をとるaの値を第３領域外行動候補値として出力する。 [Third Outside Region Action Candidate Determination Unit 12-i-26]
The third region outside action candidate determination unit 12-i-26 stores Q (L, 1) and Q (L, 2) stored in the storage unit 2 as L = (xr [i], yr [i]). , Q (L, 3), Q (L, 4), the value of a that takes the third largest value is output as a third out-of-region action candidate value.

〔第３領域外包摂制御部１２−ｉ−２７〕
第３領域外包摂制御部１２−ｉ−２７は、第３領域外行動候補決定部１２−ｉ−２６で決定された第３領域外行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか否かを判定する。 [Third region outer inclusion control unit 12-i-27]
The third region outside inclusion control unit 12-i-27 assumes that the i-th robot moves according to the third region outside action candidate determination value determined by the third region outside action candidate determination unit 12-i-26. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether or not an obstacle exists at the position (xr ′ [i], yr ′ [i]).

第３領域外包摂制御部１２−ｉ−２７は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか、障害物が存在する場合には、第４領域外行動候補決定部１２−ｉ−２８が次の処理を実行するよう制御する。 The third region inclusion control unit 12-i-27, when there is another robot at the moved position (xr '[i], yr' [i]) or when there is an obstacle, The 4-outside area action candidate determination unit 12-i-28 performs control so as to execute the following process.

第３領域外包摂制御部１２−ｉ−２７は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在しない場合には、第３領域外行動候補値を「行動値」として出力する。 If there is no other robot or obstacle at the moved position (xr ′ [i], yr ′ [i]), the third region outer inclusion control unit 12-i-27 Candidate values are output as “action values”.

〔第４領域外行動候補決定部１２−ｉ−２８〕
第４領域外行動候補決定部１２−ｉ−２８は、L=(xr[i],yr[i])として、記憶部２に記憶されたQ(L,1),Q(L,2),Q(L,3),Q(L,4)のうちの４番目に大きな値をとる（つまり、最小値をとる）aの値を第４領域外行動候補値として出力する。 [Fourth Region Action Candidate Determination Unit 12-i-28]
The fourth region outside action candidate determination unit 12-i-28 stores Q (L, 1), Q (L, 2) stored in the storage unit 2 as L = (xr [i], yr [i]). , Q (L, 3), Q (L, 4), the value of a that takes the fourth largest value (that is, takes the minimum value) is output as the fourth out-of-region action candidate value.

〔第４領域外包摂制御部１２−ｉ−２９〕
第４領域外包摂制御部１２−ｉ−２９は、第４領域外行動候補決定部１２−ｉ−２８で決定された第４領域外行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか否かを判定する。 [Fourth Region Inclusion Control Unit 12-i-29]
The fourth region outer inclusion control unit 12-i-29 assumes that the i-th robot moves according to the fourth region outer action candidate determination value determined by the fourth region outer action candidate determination unit 12-i-28. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether or not an obstacle exists at the position (xr ′ [i], yr ′ [i]).

第４領域外包摂制御部１２−ｉ−２９は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在するか、障害物が存在する場合には、a=0 (静止)を「行動値」として出力する。 The fourth region outer inclusion control unit 12-i-29 is configured when another robot and an obstacle exist at the position (xr '[i], yr' [i]) after the movement or when an obstacle exists. Outputs a = 0 (stillness) as an “action value”.

第４領域外包摂制御部１２−ｉ−２９は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在しない場合には、第４領域外行動候補値を「行動値」として出力する。 When there is no other robot at the moved position (xr ′ [i], yr ′ [i]), the fourth region outer inclusion control unit 12-i-29 sets the fourth region outer action candidate value. Output as “action value”.

≪目的領域内行動決定部１２−ｉ−３≫
目的領域内行動決定部１２−ｉ−３の詳細構成を図５に示す。 << Destination Area Action Determination Unit 12-i-3 >>
FIG. 5 shows a detailed configuration of the in-target area action determination unit 12-i-3.

目的領域内行動決定部１２−ｉ−３は目的領域外行動決定部１２−ｉ−２と同様に図８の行動選択手法に基づいて行動を決定するものである。ただし、Q関数として式(1)の代わりに式(4)を用いる点が目的領域外行動決定部１２−ｉ−２と異なる。つまり、目的領域内行動決定部１２−ｉ−３ではボイド制御も組み込んだ処理が行われる。 Similar to the non-target-area action determining unit 12-i-2, the in-target-area action determining unit 12-i-3 determines an action based on the action selection method of FIG. However, the point which uses Formula (4) instead of Formula (1) as a Q function differs from the non-target area | region action determination part 12-i-2. That is, the action determination unit 12-i-3 in the target area performs a process incorporating void control.

すなわち、目的領域内行動決定部１２−ｉ−３は、ロボットが目標位置に位置すると判定された場合には、ロボットの移動に伴ってそのロボットと位置が入れ替わる仮想的な存在であるボイドが入口位置に向かうことを理想的な状態としてボイドの現在の位置に基づいて価値関数を更新し、ロボットの現在の位置Lに移動可能な位置であるボイド位置の中で更新後の価値関数の値を最大にする行動が位置Lに移動する行動である位置を候補ボイド位置とし、その最大にする行動に対応する更新後の価値関数の値を候補ボイドQ関数値として、候補ボイド位置の中でロボットが移動可能な位置であり候補ボイドQ関数値が最小である位置に移動する行動をロボットの行動として決定する（ステップＡ３）。 That is, if it is determined that the robot is positioned at the target position, the action determining unit 12-i-3 within the target area receives a void that is a virtual entity whose position is switched with the robot as the robot moves. The value function is updated based on the current position of the void with the ideal state of going to the position, and the value of the updated value function is changed among the void positions that are movable to the current position L of the robot. The position in which the action to be maximized is the action to move to position L is set as a candidate void position, and the value of the updated value function corresponding to the action to be maximized is set as a candidate void Q function value. Is determined as the robot action (step A3). The action of moving to the position where the candidate void Q function value is minimum is determined.

〔領域内Ｑ関数更新部１２−ｉ−３１〕
領域内Ｑ関数更新部１２−ｉ−３１は、1ステップ前のi番目のロボットの位置(xr[i],yr[i])をLとし、現在のi番目のロボット位置をボイド位置L’として、記憶部２に記憶されたQ(L’,a)とQmax(L)を参照して、１ステップ前のロボットの行動からa^-1を求め、式(4)によりQ(L’,a^-1)を求め、求めたQ(L’,a^-1)の値で記憶部２に記憶されたQ(L’,a^-1)の値を更新する。 [Intra-region Q function update unit 12-i-31]
The in-region Q function updating unit 12-i-31 sets the position (xr [i], yr [i]) of the i-th robot one step before as L, and sets the current i-th robot position as the void position L ′. Referring to Q (L ′, a) and Qmax (L) stored in the storage unit 2, a ⁻¹ is obtained from the action of the robot one step before, and Q (L ′, a a ⁻¹ ) is obtained, and the value of Q (L ′, a ⁻¹ ) stored in the storage unit 2 is updated with the obtained value of Q (L ′, a ⁻¹ ).

ここで、L’は、1ステップ前の位置L=(xr[i], yr[i])において行動aを選択したときの移動後の位置である。aとL’の関係は、例えば以下のようになる。 Here, L ′ is the position after movement when the action a is selected at the position L = (xr [i], yr [i]) one step before. The relationship between a and L ′ is, for example, as follows.

a=1であれば、L’=(xr[i]+1, yr[i])
a=2であれば、L’=(xr[i], yr[i]+1)
a=3であれば、L’=(xr[i]-1, yr[i])
a=4であれば、L’=(xr[i], yr[i]-1)
領域内Ｑ関数更新部１２−ｉ−３１は、更新前のQ(L’,a^-1)と更新後のQ(L’,a^-1)を制御部１４に出力する。また、このときのQ関数の最大値で、記憶部２に記憶された方策π(L’)の値を更新する。 If a = 1, L '= (xr [i] +1, yr [i])
If a = 2, L '= (xr [i], yr [i] +1)
If a = 3, L '= (xr [i] -1, yr [i])
If a = 4, L '= (xr [i], yr [i] -1)
Region Q function update unit 12-i-31 is updated before the Q (L ', a ^-1) and the updated Q (L', a ^-1) and outputs to the controller 14. Further, the value of the policy π (L ′) stored in the storage unit 2 is updated with the maximum value of the Q function at this time.

〔候補ボイド集合生成部１２−ｉ−３２〕
候補ボイド集合生成部１２−ｉ−３２は以下の（１）から（３）の処理を行う。 [Candidate Void Set Generation Unit 12-i-32]
The candidate void set generation unit 12-i-32 performs the following processes (1) to (3).

（１）i番目のロボットの位置(xr[i],yr[i])に隣接する位置(xr[i]+1,yr[i])、(xr[i],yr[i]+1)、(xr[i]-1,y[ri])、(xr[i],yr[i]-1)の各々をボイド位置L’として、各位置L’において、Q(L’,a^-1) [a^-1=0,1,2,3,4]のうち最大値をとるQ(L’,a^-1)を「候補ボイドQ関数値」として決定する。またこのときのa^-1の値を「L’における候補ボイド行動」として決定する。 (1) Positions (xr [i] +1, yr [i]), (xr [i], yr [i] +1) adjacent to the position (xr [i], yr [i]) of the i-th robot ), (Xr [i] -1, y [ri]), (xr [i], yr [i] -1) as void positions L ′, and Q (L ′, a ^-1 ) Q (L ′, a ⁻¹ ) having the maximum value among [a ⁻¹ = 0,1,2,3,4] is determined as a “candidate void Q function value”. The value of a ⁻¹ at this time is determined as “candidate void behavior in L ′”.

（２）上記（１）で求めた各L’における候補ボイド行動のうち、候補ボイド行動に従ってL’からボイドが移動したと仮定したときの移動後の位置がi番目のロボットの位置(xr[i], yr[i])となるL’の集合を「候補ボイド位置集合」として求める。 (2) Of the candidate void actions in each L ′ obtained in (1) above, the position after the movement when assuming that the void has moved from L ′ according to the candidate void action is the position of the i-th robot (xr [ The set of L ′ as i], yr [i]) is obtained as a “candidate void position set”.

（３）上記（２）で求めた「候補ボイド位置集合」に含まれる各候補ボイド位置L’と、L’における候補ボイドQ関数値と、L’における候補ボイド行動との組からなる集合を「候補ボイド集合」として、第１領域内行動候補決定部１２−ｉ−３３に出力する。 (3) A set comprising a set of each candidate void position L ′ included in the “candidate void position set” obtained in (2) above, a candidate void Q function value in L ′, and a candidate void action in L ′ It outputs to 1st area | region action candidate determination part 12-i-33 as a "candidate void set."

〔第１領域内行動候補決定部１２−ｉ−３３〕
第１領域内行動候補決定部１２−ｉ−３３は、「候補ボイド集合」から候補ボイドQ関数値が最小となる候補ボイドQ関数値に対応する候補ボイド位置L’を「第１ターゲット位置」として決定する。 [First Area Action Candidate Determination Unit 12-i-33]
The first region action candidate determination unit 12-i-33 sets the candidate void position L ′ corresponding to the candidate void Q function value having the smallest candidate void Q function value from the “candidate void set” as the “first target position”. Determine as.

i番目のロボットの位置(xr[i], yr[i])から、上記（２）で決定された第１ターゲット位置へ移動する行動を第１領域内行動候補値として出力する。 The action of moving from the position (xr [i], yr [i]) of the i-th robot to the first target position determined in (2) above is output as a first region action candidate value.

〔第１領域内包摂制御部１２−ｉ−３４〕
第１領域内包摂制御部１２−ｉ−３４は、第１領域内行動候補決定部１２−ｉ−３３で決定された第１領域内行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部であるか否かを判定する。 [First Area Inclusion Control Unit 12-i-34]
The first region inclusion control unit 12-i-34 assumes that the i-th robot moves according to the first region action candidate value determined by the first region action candidate determination unit 12-i-33. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether there is an obstacle at the position (xr '[i], yr' [i]) or whether the position (xr '[i], yr' [i]) is outside G. .

第１領域内包摂制御部１２−ｉ−３４は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部である場合には、第２領域内行動候補決定部が次の処理を実行するよう制御する。 The first region inclusion control unit 12-i-34 determines whether there is another robot at the moved position (xr '[i], yr' [i]) or the position (xr '[i], yr' If there is an obstacle in [i]) or the position (xr '[i], yr' [i]) is outside G, the second region action candidate determination unit executes the following process Control to do.

第１領域内包摂制御部１２−ｉ−３４は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在せず、かつ、位置(xr’[i],yr’[i])がGの外部でない場合には、第１領域内行動候補値を「行動値」として出力する。 The first region inclusion control unit 12-i-34 has no other robots and obstacles at the moved position (xr '[i], yr' [i]), and the position (xr '[ If i], yr ′ [i]) is not outside G, the first region action candidate value is output as the “action value”.

〔第２領域内行動候補決定部１２−ｉ−３５〕
第２領域内行動候補決定部１２−ｉ−３５は、「候補ボイド集合」から候補ボイドQ関数値が２番目に小さな値をとる候補ボイドQ関数値に対応する候補ボイド位置L’を「第２ターゲット位置」として決定する。 [Second Area Action Candidate Determination Unit 12-i-35]
The second region action candidate determination unit 12-i-35 determines the candidate void position L ′ corresponding to the candidate void Q function value having the second smallest candidate void Q function value from the “candidate void set” as the “first 2 target position ".

i番目のロボットの位置(xr[i], yr[i])から、上記（２）で決定された第２ターゲット位置へ移動する行動を第２領域内行動候補値として出力する。 The action of moving from the position of the i-th robot (xr [i], yr [i]) to the second target position determined in (2) above is output as a second area action candidate value.

〔第２領域内包摂制御部１２−ｉ−３６〕
第２領域内包摂制御部１２−ｉ−３６は、第２領域内行動候補決定部１２−ｉ−３５で決定された第２領域内行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部であるか否かを判定する。 [Second Area Inclusion Control Unit 12-i-36]
The second region inclusion control unit 12-i-36 assumes that the i-th robot moves according to the second region action candidate value determined by the second region action candidate determination unit 12-i-35. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether there is an obstacle at the position (xr '[i], yr' [i]) or whether the position (xr '[i], yr' [i]) is outside G. .

第２領域内包摂制御部１２−ｉ−３６は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部である場合には、第３領域内行動候補決定部１２−ｉ−３７が次の処理を実行するよう制御する。 The second region inclusion control unit 12-i-36 determines whether there is another robot at the moved position (xr '[i], yr' [i]) or the position (xr '[i], yr' If an obstacle exists in [i]) or the position (xr ′ [i], yr ′ [i]) is outside G, the third region action candidate determination unit 12-i-37 Control to execute the following processing.

第２領域内包摂制御部１２−ｉ−３６は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在せず、かつ、位置(xr’[i],yr’[i])がGの外部でない場合には、第２領域内行動候補値を「行動値」として出力する。 The second region inclusion control unit 12-i-36 has no other robot or obstacle at the moved position (xr '[i], yr' [i]), and the position (xr '[ If i], yr '[i]) is not outside G, the action candidate value in the second region is output as the “action value”.

〔第３領域内行動候補決定部１２−ｉ−３７〕
第１領域内行動候補決定部１２−ｉ−３７は、「候補ボイド集合」から候補ボイドQ関数値が３番目に小さな値をとる候補ボイドQ関数値に対応する候補ボイド位置L’を「第３ターゲット位置」として決定する。 [Third Area Action Candidate Determination Unit 12-i-37]
The first region action candidate determination unit 12-i-37 determines the candidate void position L ′ corresponding to the candidate void Q function value having the third smallest candidate void Q function value from the “candidate void set” as the “first 3 target position ".

i番目のロボットの位置(xr[i], yr[i])から、上記（２）で決定された第３ターゲット位置へ移動する行動を第３領域内行動候補値として出力する。 The action of moving from the position (xr [i], yr [i]) of the i-th robot to the third target position determined in (2) above is output as an action candidate value in the third region.

〔第３領域内包摂制御部１２−ｉ−３８〕
第３領域内包摂制御部１２−ｉ−３８は、第３領域内行動候補決定部１２−ｉ−３７で決定された第３領域内行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部であるか否かを判定する。 [Third Region Inclusion Control Unit 12-i-38]
The third region inclusion control unit 12-i-38 assumes that the i-th robot moves according to the third region action candidate determination value determined by the third region action candidate determination unit 12-i-37. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether there is an obstacle at the position (xr '[i], yr' [i]) or whether the position (xr '[i], yr' [i]) is outside G. .

第３領域内包摂制御部１２−ｉ−３８は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部である場合には、第４領域内行動候補決定部１２−ｉ−３９が次の処理を実行するよう制御する。 The third region inclusion control unit 12-i-38 determines whether there is another robot at the moved position (xr '[i], yr' [i]) or the position (xr '[i], yr' If there is an obstacle in [i]) or the position (xr ′ [i], yr ′ [i]) is outside G, the fourth region action candidate determination unit 12-i-39 Control to execute the following processing.

第３領域内包摂制御部１２−ｉ−３８は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在せず、かつ、位置(xr’[i],yr’[i])がGの外部でない場合には、第３領域内行動候補値を「行動値」として出力する。 The third region inclusion control unit 12-i-38 has no other robot or obstacle at the moved position (xr '[i], yr' [i]), and the position (xr '[ If i], yr ′ [i]) is not outside G, the action candidate value in the third region is output as the “action value”.

〔第４領域内行動候補決定部１２−ｉ−３９〕
第４領域内行動候補決定部１２−ｉ−３９は、「候補ボイド集合」から候補ボイドQ関数値が４番目に小さな値をとる候補ボイドQ関数値に対応する候補ボイド位置L’を「第４ターゲット位置」として決定する。 [Fourth Area Action Candidate Determination Unit 12-i-39]
The fourth region action candidate determination unit 12-i-39 determines the candidate void position L ′ corresponding to the candidate void Q function value having the fourth smallest candidate void Q function value from the “candidate void set” as the “first 4 target position ".

i番目のロボットの位置(xr[i], yr[i])から、上記（２）で決定された第４ターゲット位置へ移動する行動を第４領域内行動候補値として出力する。 The action of moving from the position of the i-th robot (xr [i], yr [i]) to the fourth target position determined in (2) above is output as the action candidate value in the fourth region.

〔第４領域内包摂制御部１２−ｉ−３１０〕
第４領域内包摂制御部１２−ｉ−３１０は、第４領域内行動候補決定部１２−ｉ−３９で決定された第４領域内行動候補値に従ってi番目のロボットが移動すると仮定したときの移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか否かを判定する。つまり、(xr’[i],yr’[i])=(xr[j],yr[j]) (i≠j)となるjが存在するか否かを判定する。さらに、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部であるか否かを判定する。 [Fourth Region Inclusion Control Unit 12-i-310]
The fourth region inclusion control unit 12-i-310 assumes that the i-th robot moves according to the fourth region action candidate value determined by the fourth region action candidate determination unit 12-i-39. It is determined whether another robot exists at the position after movement (xr ′ [i], yr ′ [i]). That is, it is determined whether or not there exists j where (xr ′ [i], yr ′ [i]) = (xr [j], yr [j]) (i ≠ j). Further, it is determined whether there is an obstacle at the position (xr '[i], yr' [i]) or whether the position (xr '[i], yr' [i]) is outside G. .

第４領域内包摂制御部１２−ｉ−３１０は、移動後の位置(xr’[i],yr’[i])に他のロボットが存在するか、位置(xr’[i],yr’[i])に障害物が存在するか、位置(xr’[i],yr’[i])がGの外部である場合には、a=0(静止)を「行動値」として出力する。 The fourth region inclusion control unit 12-i-310 determines whether there is another robot at the moved position (xr '[i], yr' [i]) or the position (xr '[i], yr' If there is an obstacle in [i]) or the position (xr '[i], yr' [i]) is outside G, a = 0 (still) is output as the “action value” .

第４領域内包摂制御部１２−ｉ−３１０は、移動後の位置(xr’[i],yr’[i])に他のロボット及び障害物が存在せず、かつ、位置(xr’[i],yr’[i])がGの外部でない場合には、第４領域内行動候補値を「行動値a[i]」として出力する。 The fourth region inclusion control unit 12-i-310 has no other robot and obstacle at the moved position (xr '[i], yr' [i]), and the position (xr '[ If i], yr ′ [i]) is not outside G, the fourth region action candidate value is output as “action value a [i]”.

以上の処理により、第i割当部１２−ｉ−３１０からは、i番目のロボットが現在の位置(xr[j],yr[j])において選択する行動に対応する値である行動値a[i]∈{0,1,2,3,4}が出力される。ゆえに、行動割当部１２からは、N個のロボットがそれぞれ現在の位置において選択する行動値a[i]が出力される。 Through the above processing, the i-th assigning unit 12-i-310 transmits an action value a [which is a value corresponding to the action selected by the i-th robot at the current position (xr [j], yr [j]). i] ∈ {0,1,2,3,4} is output. Therefore, the action assignment unit 12 outputs action values a [i] selected by the N robots at the current position.

＜位置更新部１３＞
位置更新部１３は、各i=1,2,…,Nについて、i番目のロボットの現在の位置(xr[j],yr[j])において、行動割当部１２から出力された行動値a[i]に対応する行動をとった場合のロボットの移動後（行動後）の位置(xr’[i],yr’[i])を計算し、計算された(xr’[i],yr’[i])で記憶部２に格納されたi番目のロボットの位置を更新する。言い換えれば、位置更新部１３は、行動割当部１２によって決定された行動に基づいて例えばロボットである制御対象物のそれぞれの位置を更新する（ステップＡ４）。更新後の位置の系列｛(xr’[1],yr’[1]), (xr’[2],yr’[2]),…, (xr’[N],yr’[N])｝は、制御部１４に入力される。 <Location update unit 13>
For each i = 1, 2,..., N, the position updating unit 13 performs the action value a output from the action assigning unit 12 at the current position (xr [j], yr [j]) of the i-th robot. The position (xr '[i], yr' [i]) after the movement of the robot when the action corresponding to [i] is taken is calculated, and the calculated (xr '[i], yr '[i]) updates the position of the i-th robot stored in the storage unit 2. In other words, the position update unit 13 updates the position of each control target that is, for example, a robot based on the action determined by the action assignment unit 12 (step A4). Sequence of updated positions {(xr '[1], yr' [1]), (xr '[2], yr' [2]), ..., (xr '[N], yr' [N]) } Is input to the control unit 14.

＜制御部１４＞
制御部１４は、行動割当部と位置更新部との処理を繰り返し行うように制御する（ステップＡ５）。 <Control unit 14>
The control unit 14 performs control so as to repeatedly perform the processes of the action assignment unit and the position update unit (step A5).

制御部１４は、所定の終了条件を満たすまで、行動割当部と位置更新部との処理を繰り返し行うように制御する。例えば、制御部１４は、目的領域外行動決定部１２−ｉ−２又は目的領域内行動決定部１２−ｉ−３から出力されたすべての更新前のQ関数と更新後のQ関数とから構成される組について、更新前Q関数の値と更新後Q関数の値の差が所定の閾値以下となるまで、行動割当部１２及び位置更新部１３の処理を実行するよう制御する。この場合の終了条件は、更新前Q関数の値と更新後Q関数の値の差が所定の閾値以下となることである。 The control unit 14 performs control so that the processes of the action assignment unit and the position update unit are repeatedly performed until a predetermined end condition is satisfied. For example, the control unit 14 is composed of all the pre-update Q functions and the post-update Q functions output from the non-target region action determining unit 12-i-2 or the in-target region action determining unit 12-i-3. Control is performed so that the processing of the action assignment unit 12 and the position update unit 13 is executed until the difference between the value of the pre-update Q function and the value of the post-update Q function is equal to or less than a predetermined threshold. The termination condition in this case is that the difference between the value of the pre-update Q function and the value of the post-update Q function is equal to or less than a predetermined threshold value.

すべての更新前のQ関数と更新後のQ関数とから構成される組について、更新前Q関数の値と更新後Q関数の値の差が所定の閾値以下となったら、行動制御装置の学習部１による学習ステップの処理は終了する。 When the difference between the value of the pre-update Q function and the value of the post-update Q function is less than or equal to a predetermined threshold for a set composed of all pre-update Q functions and post-update Q functions, learning of the behavior control device The process of the learning step by the unit 1 ends.

次に、行動制御装置のスケジューリング部３による行動スケジュールステップの処理について説明する。以下、学習部１と異なる部分を中心に説明し、学習部１と同様の部分については重複説明を省略する。 Next, the process of the action schedule step by the scheduling unit 3 of the action control device will be described. Hereinafter, the description will focus on the parts that are different from the learning unit 1, and redundant description of the same parts as the learning unit 1 will be omitted.

行動スケジュールステップの処理の流れの例を、図１０に示す。 An example of the process flow of the action schedule step is shown in FIG.

＜スケジューリング部３＞
スケジューリング部３は、以上の学習部１の処理により得られたＱ関数と方策を用いて、N台の実ロボットが初期位置から目的の隊列を形成するための各ロボットの行動計画を決定する。スケジューリング部の詳細構成を図６に示す。 <Scheduling unit 3>
The scheduling unit 3 determines an action plan of each robot for the N actual robots to form a target formation from the initial position, using the Q function and the policy obtained by the processing of the learning unit 1 described above. A detailed configuration of the scheduling unit is shown in FIG.

≪初期状態入力部３１≫
初期状態入力部３１には、N台のロボットのそれぞれの初期位置(xr0[i], yr0[i])[i=1,2,…,N]が入力される。 << Initial state input unit 31 >>
The initial state input unit 31 receives initial positions (xr0 [i], yr0 [i]) [i = 1, 2,..., N] of N robots.

≪行動割当部３２≫
行動割当部３２の処理は学習部１の行動割当部１２と同様である。i=1,2,…,N1として、第i割当部３２−ｉは、学習部１の行動割当部１２の第i割当部１２−ｉと同様である。 ≪Action allocation unit 32≫
The process of the action assigning unit 32 is the same as that of the action assigning unit 12 of the learning unit 1. As i = 1, 2,..., N1, the i-th allocation unit 32-i is the same as the i-th allocation unit 12-i of the behavior allocation unit 12 of the learning unit 1.

ただし、行動割当部３２は、ここでは各iについて決定された行動a[i]を現在の時刻tにおいてi番目のロボットが選択する行動a_t[i]として記憶部に格納する。これにより、記憶部２には時刻tまでの各時刻でi番目のロボットが選択する行動の系列（行動系列） A[i] ={a₁[i],a₂[i],…,a_t−１[i]}が格納されることになる。 However, the action assigning unit 32 stores the action a [i] determined for each i in the storage unit as the action a _t [i] selected by the i-th robot at the current time t. Thereby, the storage unit 2 stores a series of actions (action series) A [i] = {a ₁ [i], a ₂ [i], ..., a selected by the i-th robot at each time up to time t. _t−1 [i]} is stored.

また、学習部１の行動割当部１２ではa[i]を決定するだけでなく、Q関数の値と方策の値の更新も行っているが、スケジューリング部３の行動割当部３２ではQ関数の値と方策の値の更新を行う必要はない。 In addition, the behavior allocation unit 12 of the learning unit 1 not only determines a [i], but also updates the value of the Q function and the value of the policy, but the behavior allocation unit 32 of the scheduling unit 3 updates the Q function. There is no need to update the value and policy value.

Q関数の値の更新を行わない場合には、行動割当部３２の位置判定部３２−ｉ−１は、ロボットが目標位置に位置するか判定し（ステップＢ１）、行動割当部３２の目的領域外行動決定部３２−ｉ−２は、ロボットが目標位置に位置しないと判定された場合には、ロボットが移動可能な位置の中で価値関数の値が最も大きい位置に移動する行動をロボットの行動として決定し（ステップＢ２）、目的領域内行動決定部３２−ｉ−３は、ロボットが目標位置に位置すると判定された場合には、ロボットの現在の位置Lに移動可能な位置であるボイド位置の中で価値関数の値を最大にする行動が位置Lに移動する行動である位置を候補ボイド位置とし、その最大にする行動に対応する価値関数の値を候補ボイドQ関数値として、候補ボイド位置の中でロボットが移動可能な位置であり候補ボイドQ関数値が最小である位置に移動する行動をロボットの行動として決定する（ステップＢ３）。 When the value of the Q function is not updated, the position determination unit 32-i-1 of the action assignment unit 32 determines whether the robot is positioned at the target position (step B1), and the target area of the action assignment unit 32 When it is determined that the robot is not located at the target position, the external action determination unit 32-i-2 determines the action of moving the robot to the position where the value of the value function is the largest among the positions where the robot can move. When the action is determined as an action (step B2) and the action determining unit 32-i-3 in the target area determines that the robot is located at the target position, the void is a position that can move to the current position L of the robot. The position where the action that maximizes the value of the value function in the position is the action that moves to the position L is set as the candidate void position, and the value of the value function corresponding to the action to be maximized is set as the candidate void Q function value. In the void position The action of moving to a position where the bot can move and the candidate void Q function value is minimum is determined as the action of the robot (step B3).

≪位置更新部３３≫
位置更新部３３の処理は、学習部１の位置更新部１３と同様である。すなわち、位置更新部３３は、行動割当部３２によって決定された行動に基づいて例えばロボットである制御対象物のそれぞれの位置を更新する（ステップＢ４）。 ≪Location update unit 33≫
The processing of the position update unit 33 is the same as that of the position update unit 13 of the learning unit 1. That is, the position update unit 33 updates the position of each control target that is, for example, a robot based on the behavior determined by the behavior allocating unit 32 (step B4).

≪目標位置到達判定部３４≫
目標位置到達判定部３４は、各i=1,2,…,Nについて、位置更新部３３から出力された更新後の位置(xr’[i],yr’[i])∈Gであるか否かを判定し、全てのiについて(xr’[i],yr’[i])∈Gである場合には、現在記憶部２に記憶されている行動系列 A[i] ={a₁[i],a₂[i],…,a_t−１[i], a_t[i]}をスケジューリング結果として出力する。少なくとも１つ以上のiについて(xr’[i],yr’[i])∈Gを満たさない場合には、行動割当部３２及び位置更新部３３を再度実行するよう制御する（ステップＢ５）。 ≪Target position arrival determination unit 34≫
The target position arrival determination unit 34, for each i = 1, 2,..., N, is the updated position (xr ′ [i], yr ′ [i]) ∈G output from the position update unit 33. If it is (xr ′ [i], yr ′ [i]) ∈G for all i, the action sequence A [i] = {a ₁ currently stored in the storage unit 2 _{[i], a 2 [i} ], ..., a t-1 [i], and outputs the scheduling result to a _t [i]}. If (xr ′ [i], yr ′ [i]) εG is not satisfied for at least one i, the behavior assigning unit 32 and the position updating unit 33 are controlled to be executed again (step B5).

［変形例等］
目的領域外行動決定部１２−ｉ−２が４層（第１〜第４）のレイヤで構成されているのは、上記の例では静止(a=0)以外でロボットの取りうる行動が４種類(a=1,2,3,4)であるとしているためである。一般には、行動の種類がM個（静止を含む）あれば、目的領域外行動決定部１２−ｉ−２はM-1個のレイヤになる。目的領域内行動決定部１２−ｉ−３、目的領域外行動決定部３２−ｉ−１及び目的領域内行動決定部３２−ｉ−３についても同様である。 [Modifications, etc.]
The non-target-area action determining unit 12-i-2 is configured with four layers (first to fourth). In the above example, the action that can be taken by the robot other than stationary (a = 0) is four. This is because the type (a = 1, 2, 3, 4) is assumed. In general, if there are M types of actions (including stillness), the non-target area action determining unit 12-i-2 has M-1 layers. The same applies to the action determining unit 12-i-3 within the target area, the action determining unit 32-i-1 outside the target area, and the action determining unit 32-i-3 within the target area.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

また、上記実施形態において説明したハードウェアエンティティにおける処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 Further, when the processing functions in the hardware entity described in the above embodiment are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１学習部
２記憶部
３スケジューリング部
１１入力部
１２行動割当部
１２−ｉ−１位置判定部
１２−ｉ−２目的領域外行動決定部
１２−ｉ−３目的領域内行動決定部
１３位置更新部
１４制御部
３１初期状態入力部
３２行動割当部
３２行動割当部
３２−ｉ−１位置判定部
３２−ｉ−２目的領域外行動決定部
３２−ｉ−３目的領域内行動決定部
３３位置更新部
３４目標位置到達判定部 DESCRIPTION OF SYMBOLS 1 Learning part 2 Memory | storage part 3 Scheduling part 11 Input part 12 Action allocation part 12-i-1 Position determination part 12-i-2 Out-of-target-area action determination part 12-i-3 In-target-area action determination part 13 Position update part 14 control part 31 initial state input part 32 action assignment part 32 action assignment part 32-i-1 position determination part 32-i-2 non-target area action determination part 32-i-3 in-target area action determination part 33 position update part 34 Target position arrival determination unit

Claims

A behavior control device that performs behavior control for moving a plurality of control objects to a set of target positions including a predetermined entrance position,
The plurality of control objects are represented by a Markov state space that represents the appropriateness when the control object takes each action a at the current position L of the control object. As behavior control is performed based on one value function that takes as an argument a state variable of one control object learned by configuring ,
(1) a position determination unit that determines whether each control object is located at the target position; and (2) if it is determined that each control object is not located at the target position, the control object is The value function is updated based on the current position of each control object with the ideal direction to the entrance position, and the updated value function in a position where each control object can move A non-target area action determining unit that determines an action to move to a position having the largest value as the action of each control object, and (3) when it is determined that each control object is located at the target position. The value function is updated based on the current position of the void in an ideal state where the void, which is a virtual entity whose position is switched with the position of the control object as the control object moves, is directed to the entrance position. And the current of each control object The position where the action that maximizes the value of the updated value function among the void positions that can be moved to the position L is the action that moves to the position L as the candidate void position, and the action that maximizes the action The value of the corresponding updated value function is set as a candidate void Q function value, and the control object moves to a position where each of the controlled objects can move and has a minimum candidate void Q function value. An action allocating unit including an action determining unit within a target area that determines an action as an action of each of the control objects;
A position update unit that updates the position of each control object based on the determined action;
A control unit that performs control so as to repeatedly perform the processes of the action assigning unit and the position updating unit;
A behavior control device including:

A behavior control method for performing behavior control for moving a plurality of control objects to a set of target positions including a predetermined entrance position,
The plurality of control objects are represented by a Markov state space that represents the appropriateness when the control object takes each action a at the current position L of the control object. As behavior control is performed based on one value function that takes as an argument a state variable of one control object learned by configuring ,
(1) a position determination step in which the position determination unit determines whether each of the control objects is located at the target position; and (2) an out-of-target-area action determination unit determines that each control object is If it is determined that the control object is not in the ideal state, the value function is updated based on the current position of each control object, and the control object is Out-of-target-area action determining step for determining the action to move to the position where the value function value after the update is the largest among the movable positions as the action of each control object, and (3) In-target action determination When the control unit determines that each of the control objects is located at the target position, a void that is a virtual existence in which the position of the control object interchanges with the movement of the control object is present at the entrance position. Going to the ideal state An action that updates the value function based on the current position of the void and maximizes the value of the updated value function among the void positions that are movable to the current position L of each control object. Is a position that is an action that moves to the position L as a candidate void position, and the value of the updated value function corresponding to the action to be maximized is a candidate void Q function value. An action allocating step including: an action determining step in a target area that determines an action of moving to a position where the control object is movable and the candidate void Q function value is minimum as the action of each control object;
A position update unit, wherein the position update unit updates the position of each control object based on the determined action;
Control unit, and a control step of controlling to perform repeated process between the behavior assigning step and the location update step,
A behavior control method including:

The program for functioning a computer as each part of the action control apparatus of Claim 1.