CN113657573A

CN113657573A - Robot skill acquisition method based on meta-learning under guidance of contextual memory

Info

Publication number: CN113657573A
Application number: CN202110740838.4A
Authority: CN
Inventors: 刘冬; 于洪华
Original assignee: Jiangsu Research Institute Co Ltd of Dalian University of Technology
Current assignee: Jiangsu Research Institute Co Ltd of Dalian University of Technology
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-16
Anticipated expiration: 2041-06-30
Also published as: CN113657573B

Abstract

The invention provides a robot skill acquisition method based on meta-learning under the guidance of context memory, which comprises the steps of firstly establishing a context memory model of a robot learning system, constructing a similarity measurement algorithm for robot perception and memory, and realizing retrieval matching of events and scene information and updating and calling of events in the context memory; and then, constructing a robot operation skill meta learning algorithm guided by scene memory, and acquiring knowledge from the single task and all tasks respectively to learn the skills. The invention provides a method for guiding a robot to learn new skills by using the existing experience, which improves the learning efficiency of the robot on the operation skills and solves the problems that the data size is too large and similar tasks need to be trained repeatedly in the process of learning the operation skills of the robot.

Description

Robot skill acquisition method based on meta-learning under guidance of contextual memory

Technical Field

The invention belongs to the technical field of intelligent robot service, and relates to a robot operation skill learning method based on scene memory and meta-learning.

Background

In recent years, in the fields of industrial production, medical treatment, business, family service and the like, the current learning method of the intelligent robot can be competent for accurate and repetitive tasks, but the learning capability of new tasks is lacked, repeated training is needed in similar task scenes, and the problems that the experience cannot be accumulated to guide the new tasks to realize quick learning and the like are solved. In the invention patent CN108333941A, durolon, zhangleing, etc. of south china university disclose a cloud robot collaborative learning method based on hybrid enhanced intelligence. The general task is decomposed into simple subtasks by a neural task programming meta-learning method, the robot learns the subtasks based on a teaching learning method, and then the subtasks are gathered and shared. SongRui, Lifengming and the like at Shandong university disclose a robot operation skill learning system and method under the guidance of prior knowledge in patent CN111618862A, a robot thermal system is modularized into modules of physics, evaluation, strategy learning and the like, a state-action mapping set of a robot is established, and the difficulty of robot skill learning is relieved. However, the above methods have a limited range of applicability, and above all, none of the above methods have experienced reuse and lack attention to biological learning systems. Secondly, the robot is only suitable for learning of specific tasks, the extended learning of robot operating skills cannot be carried out, the robot lacks the relevant capabilities of autonomous learning and exploration, lacks the adaptability to task environments, cannot realize the real-time learning of the robot in practical application, and is difficult to meet the requirements that the robot can continuously contact with new tasks and learn new skills. Finally, the robot learning system has a complex framework and is difficult to design and build. Therefore, the method cannot meet the requirements of rapid learning and generalization of the operation technology of the intelligent robot.

Disclosure of Invention

The invention mainly solves the problem of how to solve a new task faced by an intelligent robot during working by using learned knowledge and experience, and adapts to a new task target. The invention provides a meta-learning robot skill learning method combined with context memory, aiming at the problems that a large amount of data is required for training, similar task scenes need to be repeatedly trained, experience cannot be accumulated to guide a new task to realize rapid learning and the like in the current robot skill learning. Firstly, learning a task by a meta-learning method in a learning process, and storing a scene observation value, a trained network weight and the like as experience information into a scene memory model; secondly, performing memory matching reading through the similarity between cosine distance measurement scenes, and writing and updating the memory by adopting an LURA algorithm; and finally, the robot sensing planning module is combined with the sensing environment, the target detection and the path planning capability to interact with a target object to complete a task, so that the memory-guided robot operation skill rapid learning is realized. The method specifically comprises the following steps:

step 1: establishing a robot learning system memory model;

a skill-based event modeling method is used for establishing a robot scenario memory mathematical model M, wherein M is a memory set formed by a plurality of scenario memories M, and the scenario memories M mainly comprise the following components: a time-varying scenario event sequence combination E, empirical knowledge G learned by a meta-learning network belonging to the scenario, and a key-value eigenvector K for retrieving a matching similar event, i.e., m ═ E, G, K }. The event sequence combination E is composed of i events, i.e., E ═ E₁,e₂,…e_iAnd each event stores information such as environment observation values and actions related to the scene, and experience knowledge is acquired through event matching so as to guide decision-making behaviors.

Step 2: constructing a similarity measurement algorithm for robot perception and memory;

the more similar the new task and the trained task are in the meta-learning training stage, the more available scenes are, the more the task encoder encodes the event information at each time t into a key-value feature vector K_st. When retrieving and matching the scenes, selecting proper scene memory by calculating the similarity of the current event and the key value characteristic vector of the event stored in the scene memory. In the application stage, the task encoder encodes the scene information transmitted by the perception system to generate a key value eigenvector K_t(i) And searching and matching by calculating the similarity metric value of the scene information of the current event and the event information stored in the scene memory.

And step 3: writing the real-time experience into a memory model according to a scene memory writing mechanism;

and judging whether the current scene is a new event or not, if so, recording the event, and if not, updating the existing event in the scene memory. When the number of the stored scene memories reaches 20 set maximum numbers, the memory storage area only remains the reserved memory storage buffer area, at the moment, the current task memory is temporarily stored in the buffer area, and the memory is updated by analogy of an LRUA algorithm after the task is finished.

And 4, step 4: constructing a robot operation skill element learning algorithm guided by scene memory;

meta-learning learns on two levels, first to get knowledge quickly in each individual task, and the second to extract information slowly from all tasks. And enabling the robot to learn skills from the training task through the data of the training set. Firstly, a training task is split into events, each action executed by the robot corresponds to one event, in the training process, the robot packages the events and the executed strategy (skills) through a scenario memory module, and establishes the relation between the events and the skills, in addition, the robot learns all the training tasks through a meta-learning network, and packages information such as network weight and the like into experience knowledge.

And 5: and constructing a generalized learning algorithm aiming at the new task based on the scene memory.

And guiding the robot to learn new tasks appearing in the working environment according to the robot memories obtained in the steps 2, 3 and 4. Firstly, the sensing module is used for obtaining environment state information, similarity measurement is carried out on the current sensing information and events existing in a memory base, and proper events are selected from the memory to guide a current task.

The invention has the advantages that:

the invention effectively solves the problems that the operation skill learning of the intelligent robot needs a large amount of data training, similar task scenes need repeated training and experience cannot be accumulated to guide the new task to realize quick learning and the like at present, introduces the humanoid scene memory into the meta-learning method, and can guide the skill learning of the robot by using the experience when the robot faces the new task to realize the multiplexing of the skills. The invention can learn in a small amount of samples, complete complex and various tasks by learning and memorizing simple tasks, can utilize the prior experience knowledge to quickly master skills through a small amount of training to complete the learning tasks, and effectively improves the learning efficiency and the execution success rate of the robot skill learning.

Drawings

FIG. 1 is an overall flow diagram of the process of the present invention;

FIG. 2 is a diagram of a scenario memory model architecture;

FIG. 3 is a diagram of a scene memory update process;

FIG. 4 is a schematic diagram of an LSTM network structure;

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

The scenario-guided meta-learning-based robot skill acquisition flow diagram provided in the present example is shown in fig. 1. The invention relates to a meta-learning method based on context memory guidance, which comprises the steps of constructing a perception planning module, realizing the positioning and identification of an object through target detection, realizing the interaction of the context memory and a meta-learning network through a task encoder and a task decoder in the process of establishing and calling a context memory model, encoding a single task of the meta-learning network into an addressable label by the encoder, and decoding context experience into information which can be used by transmitting the information to the meta-learning network by the task decoder. In the meta-learning process, the meta-learner learns the current task at a low level for each task and grasps the current task; and learning is performed on all learning tasks at a high level, the experience knowledge is stored through a scene memory model, and a meta-learner is guided to learn the subsequent tasks.

In this embodiment, for example, the table top platform wood block stacking operation skill learning is adopted, and the wood block stacking learning method includes the following steps:

step 1: and establishing a memory model of the robot operation skill learning system. Establishing a robot scenario memory mathematical model M, which is formed as shown in fig. 2, wherein each scenario memory M is { E, G, K }, M comprises a time-varying event sequence combination E and empirical knowledge G learned by a meta-learning network belonging to the scenario, so as to obtain a robot scenario memory mathematical model MAnd a key-value feature vector K for retrieving matching similar events. In which the event sequence combination E is formed by i events, i.e. E ═ E₁,e₂，…，e_iEach event stores information such as environment observation values and actions related to scenes and represents scenes and action sequences which the robot has experienced in the task; the empirical knowledge G is empirical knowledge such as skills learned in the task. The robot continuously accumulates experience in learning, and simultaneously stores important scene information in a task in events, wherein each event e is composed of four tuples<o,p_e,a,p_t>A composition in which o is a state perception of the environment obtained by a sensor, including distribution of objects in an image, a positional relationship between each other, joint information of a robot, and the like; p is a radical of_eIs the three-dimensional coordinates of the end effector of the mechanical arm; a is the action executed by the mechanical arm, and represents the action sequence of the robot at the current task in the time dimension; p is a radical of_tThe three-dimensional coordinates of the target object for the mechanical arm to perform interactive operation are shown in the overall structure of fig. 2.

Step 2: and carrying out similarity measurement on robot perception and memory. In the learning process, the task encoder encodes the event information at each time t to generate a key value feature vector K_st. When retrieving and matching the scenes, selecting proper scene memory by calculating the similarity of the current event and the key value characteristic vector of the event stored in the scene memory. The scene memory update process is shown in fig. 3.

And step 3: and writing the real-time experience into the memory model according to the scene memory writing mechanism. When the number of the stored scene memories reaches 20 set maximum numbers, the storage area only remains a reserved memory storage buffer area, at the moment, the current task memory is temporarily stored in the buffer area, and after the task is finished, the memory is updated by analogy with an LRUA (least recent Used Access) algorithm. LRUA: the least recently used method is to store information to a memory location with a small number of uses to protect the recently written information, or to write to a memory location that has just been read, so as to avoid repeatedly storing similar memories. When the memory is updated, the softmax function is used for memorizing each time event in the buffer scene memory and the scene memoryEvent cosine distance converted into write weight

Wherein D (K)_s,M_t(i) Is the cosine distance of the scene from the memory event at time t, K_sKey-value feature vectors of memory events in a context memory for the state at time t, M_t(i) And memorizing the key value characteristic vector of each time event in the scene memory in the buffer area.

Then writing the events belonging to the same scene memory into the weight

Summing and averaging as coverage weights

According to

As a result, the new memory will be overwritten in the following two ways:

when there is a high similarity between two scenes, i.e. if

And writing the scene into the position of the most frequently called scene of the buffer scene.

B is, if

And if the situation in the buffer area is not particularly similar to the situation in the memory storage area, selecting the position of the situation memory with the lowest use weight, and covering the situation memory to ensure the high-efficiency utilization of the storage area. Use weight

The number of times the scene memory is matched in the memory storage area is defined as adding 1 to its use weight each time the scene memory is matched.

And 4, step 4: and (5) performing robot operation skill training by using a meta-learning method. Because the gradient-based updating mechanism in the back propagation has similarities with the updating of the cell state in the LSTM, and the long-term memory structure of the LSTM network is very similar to the idea of meta-learning, the LSTM is adopted to replace the back propagation meta-learning network, and the network structure is shown in FIG. 4, wherein X is X_tAs input of the current unit cell, h_tFor hidden layer output, σ is sigmoid activation function, tanh is tanh activation function,

in order to carry out the multiplication,

is an addition.

Setting the learning rate at time t to α_tThen, the learner parameter updating method is:

wherein theta is_tIs the parameter after the t-th update iteration, α_tIs the learning rate at the time of the t-th,

is the time t-1 loss function with respect to theta_t-1Gradient of (a), L_tThe subscript t represents the loss function of the loss function at the time of the t-th update, and the calculation and gradient of the loss function are relative to the parameter theta after the last iteration_t-1。

This process has the same form as the updating of the cell state (cell state) in the LSTM:

order forgetting door f_tStatus of cell unit c ═ 1_t-1＝θ_t-1Learning rate i_t＝α_t，

And (4) finishing. When the network parameter falls into the 'saddle point', the current parameter needs to be shrunk and the previous parameter theta needs to be matched_t-1Forget, so the learning rate i needs to be redefined_tAnd forget door f_tComprises the following steps:

where σ is the sigmod function, W_IAnd W_FUpdate functions for input gate and forget gate, respectively, b_IAnd b_FSeparately asking for the offset parameters, theta, of the input gate and the forgetting gate_t-1Is the learner parameter at time t-1, L_tFor the loss function after t updates,

is the time t-1 loss function with respect to theta_t-1A gradient of (a);

the meta-learner updates the LSTM cell state through the two steps, and the meta-learner can quickly train while avoiding divergence. In the training process, a training task is firstly split into events, each action executed by the robot corresponds to one event, in the training process, the events and executed strategies (skills) are packaged by the robot through a scenario memory module, the relation between the events and the skills is established, in addition, the robot learns all the training tasks through a meta-learning network, and information such as network weight is packaged into experience knowledge.

The mean and variance are collected on each meta-test data set, so that during meta-training we use the batch statistics of the training set and the test set, while during the meta-test phase we use the batch statistics of the training set and the running average of the test set during the classifier test, which avoids information loss. For each feature channel of each layer, the corresponding inputs for all samples within the current batch are calculated and their mean and variance are counted. The mean and variance are then used to normalize the corresponding input for each sample. After normalization, the mean of all input features was 0 and the standard deviation was 1. Meanwhile, to prevent normalization from causing loss of feature information, γ, β: learnable parameters introduced by each feature for recovering the original input features,

respectively input and output, BN_γ,β(x_i) Representing a batch normalization process:

a SeLU activation function is adopted in a convolutional neural network layer, and the defect that some neurons are in an inactivated state and do not work after network parameters are updated due to the fact that the gradient of an input function of the ReLU activation function is too large is overcome. When the difference after activation is too large, the variance can be reduced, and gradient explosion is prevented. And the gradient is larger than 1 on the positive half axis, the gradient can be increased when the variance is too small, the gradient is prevented from disappearing, and the output of each layer of the neural network is the mean value of 0 and the variance is 1. The expression is as follows:

wherein lambda is approximately equal to 1.05, alpha is approximately equal to 1.67.

And 5: learning new robot operating skills based on trained contextual memory

In the application process, when the perception is similar to the previously coded event or the new event is different from the previously perceived event, the task encoder encodes the scene information transmitted by the perception system to generate a key value feature vector K_t(i) In that respect And the cosine distance is used as a similarity measurement function, and the matching scene is retrieved by calculating the similarity measurement value of the scene information of the current event and the event information stored in the scene memory:

then read the weight by weighting calculation

Where xi is the attenuation coefficient, the bigger value of xi represents the larger influence of the previous event on the current state, and when t is 1, xi is 0,

and storing cosine measurement of the event information for the current event scene information and the scene memory at the moment t. According to the read weight

The calculation result selects one of the following two decoding scenes for guiding the learning of the new task:

(1) when the read weight value is larger than a given threshold value, extracting experience information in the scene to which the event belongs, and guiding the learning of a new task by taking the scene as the experience of the new task;

(2) and if the weight average of the event reading weights in the scenes stored in the air during traversal is smaller than a given threshold value, defining the current event as a new event, establishing a new scene for the current task, and selecting the scene with the highest reading weight value to guide the new task to learn.

Let the event matched from the scene at the current time step be e_iScene action information in the event of low-level extraction and matching is transmitted to a meta-learning network to help the robot to make a decision; and at a high level, the experience information such as the weight of the meta-learning network corresponding to the scene memory of the event is decoded by a task decoder and then transmitted to the meta-learning device, so that a more optimized network weight is given to the meta-learning device, and the convergence speed is accelerated.

Context awareness over current tasks_iJudging whether the task is completed, if so_iAnd context awareness at task completion o_fIf the current task is the same as the current task, ending the current task; and if the events are different, continuing to match and calling the next event, combining skills corresponding to the events in the scene, and continuously interacting with the environment through closed-loop feedback until a task target is realized.

The above description of exemplary embodiments has been presented only to illustrate the technical solution of the invention and is not intended to be exhaustive or to limit the invention to the precise form described. Obviously, many modifications and variations are possible in light of the above teaching to those skilled in the art. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to thereby enable others skilled in the art to understand, implement and utilize the invention in various exemplary embodiments and with various alternatives and modifications. It is intended that the scope of the invention be defined by the following claims and their equivalents.

Claims

1. A robot skill acquisition method based on meta-learning under the guidance of contextual memory is characterized in that a contextual memory module is added on the basis of a meta-learning method, and empirical knowledge learned by a robot in a task is stored, and the method comprises the following steps:

step 1: establishing a robot learning system memory model;

establishing a robot scene memory mathematical model M, wherein M is a memory set formed by a plurality of scene memories M, and the scene memories M mainly form partsComprises the following steps: a time-varying scenario event sequence combination E, empirical knowledge G learned by a meta-learning network belonging to the scenario, and a key-value eigenvector K for retrieving a matched similar event, namely m ═ E, G, K }; the event sequence combination E is composed of a plurality of events, i.e., E ═ E₁，e₂，···e_iStoring information related to the situation in each event, and acquiring experience knowledge through event matching so as to guide decision-making behaviors;

step 2: constructing a robot event perception similarity measurement algorithm;

the task encoder encodes the event information at each moment t to generate a key value characteristic vector K_st(ii) a When retrieving and matching the scenario memories, selecting the scenario memories by calculating the similarity of key value characteristic vectors of the current events and the events stored in the scenario memories; in the application stage, a task encoder encodes the scene information transmitted by the perception system to generate a key value eigenvector K_t(i) Selecting a proper scenario memory by calculating the similarity of the key value eigenvector of the current event and the event stored in the scenario memory by adopting the cosine distance as a similarity measurement function:

wherein st is time information at the time t;

judging whether the current scene is a new event or not, if so, recording the event, and if not, updating the existing event in the scene memory; when the stored scene memory amount reaches the set maximum amount of 20, the memory storage area only remains the reserved memory storage buffer area, the current task memory is temporarily stored in the buffer area at the moment, the memory is updated by utilizing an LRUA algorithm after the task is finished, and the LRUA: the least recently used method is that information is stored to a memory position with less use times to protect the recently written information or the memory position which is just read is written to avoid repeatedly storing similar memory; while updating the memoryConverting cosine distance between each moment event in the buffer scene memory and memory event in the scene memory into writing weight by using softmax function

Wherein, D (K)_s，M_t(i) Is the cosine distance of the scene from the memory event at time t, K_sKey-value feature vectors of memory events in a context memory for the state at time t, M_t(i) Key value feature vectors of events at each moment in the scene memory in the buffer area;

then writing the events belonging to the same scene memory into the weight

Summing and averaging to obtain coverage weight

According to

Calculating the result that the new memory is written into the position of the memory area most similar to the scene memory or the position of the scene memory least frequently called;

and 4, step 4: constructing a robot motor skill meta-learning algorithm guided by scene memory;

meta-learning is learned on two levels, the first learning level is to rapidly acquire knowledge in each individual task, and the second learning level is to slowly extract information from all tasks; enabling the robot to learn skills from the training task through data of the training set; firstly, splitting a training task into subtasks, wherein each action executed by a robot corresponds to an event, in the training process, the robot packages the event perception and the behavior through a scene memory module, and establishes the relation between the events and the behavior, and in addition, the robot learns all the training tasks through a meta-learning network and packages network weight information into experience knowledge;

the construction of the meta-learning network adopts the LSTM to replace the learning network of back propagation, and the time t sets the learning rate as alpha_tThen, the learner parameter updating method is:

the learner parameter update procedure has the same form as the updating of the cell states in LSTM:

Then the method is finished; when the network parameter falls into the 'saddle point', the current parameter needs to be shrunk and the previous parameter theta needs to be matched_t-1Forget to go on, redefine learning rate i_tAnd forget door f_tComprises the following steps:

wherein σ is sigmoid function, W_IAnd W_FUpdate functions for input gate and forget gate, respectively, b_IAnd b_FSeparately asking for the offset parameters, theta, of the input gate and the forgetting gate_t-1Learning for time t-1Device parameter, L_tFor the loss function after t updates,

is the time t-1 loss function with respect to theta_t-1A gradient of (a);

the meta-learner updates the LSTM cell state through the two steps, and the meta-learner can quickly train while avoiding divergence;

and 5: constructing a generalized learning algorithm aiming at a new task based on the scene memory;

guiding the robot to learn new tasks appearing in the working environment according to the robot memories obtained in the steps 2, 3 and 4; firstly, obtaining environment state information by using a perception module, carrying out similarity measurement on current perception information and events in a memory bank, taking cosine distance as a similarity measurement function, and retrieving matching scenes by calculating similarity measurement values of scene information of the current events and event information stored in a scene memory:

then read the weight by weighting calculation

Where xi is the attenuation coefficient, the bigger value of xi represents the larger influence of the previous event on the current state, and xi is 0 when t is 1,

cosine measurement of the current event scene information and the scene memory storage event information at the moment t;

secondly, selecting proper scene memory to guide the current task; root of herbaceous plantAccording to the read weight

Selecting a guiding experience according to a calculation result; if the read weight value is larger than a given threshold value, extracting experience information in the scene to which the event belongs and guiding the learning of a new task by taking the scene as the experience of the new task; if the event with the reading weight larger than the threshold value does not exist in the memory, the current event is defined as a new event, a new scene is established for the current task, and the scene with the highest reading weight value is selected to guide the new task to learn.

2. The method for acquiring robot skills based on meta-learning under the guidance of contextual memory according to claim 1, characterized in that event e_iBy quadruplets<o，p_e，a，p_t>The robot comprises a composition, wherein o is state perception of the environment obtained through a sensor, and comprises distribution of objects in an image, a position relation among the objects and joint information of the robot; p is a radical of_eIs the three-dimensional coordinates of the end effector of the mechanical arm; a is the action executed by the mechanical arm, and represents the action sequence of the robot at the current task in the time dimension; p is a radical of_tIs the three-dimensional coordinate of the target object for the mechanical arm to carry out interactive operation.

3. A contextual memory guided meta-learning based robot skill acquisition method according to claim 1 or 2, characterized in that the new memory will be overwritten to the location of the most similar contextual memory or to the location of the least frequently called contextual memory in the memory area:

(1) when there is a high similarity between the two scenes, i.e. if

Writing the scene into the position of the most frequently called scene in the buffer area;

(2) if it is

Indicate slowAnd if the scenes in the conflict area are not particularly similar to the scenes in the memory storage area, selecting the position of the scene memory with the lowest use weight, and covering the scene memory to ensure the high-efficiency utilization of the storage area. Use weight