CN115070753B

CN115070753B - A multi-objective reinforcement learning method for unsupervised image editing

Info

Publication number: CN115070753B
Application number: CN202210469373.8A
Authority: CN
Inventors: 钱智丰; 尤鸣宇
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2024-11-08
Anticipated expiration: 2042-04-28
Also published as: CN115070753A

Abstract

The present invention relates to a multi-objective reinforcement learning method based on unsupervised image editing, comprising: obtaining a multi-objective task data set about a robot control scene; training a generative adversarial network and a feature space encoder to decouple factors in the image that are highly relevant and irrelevant to the task; performing singular value decomposition on the weights of the fully connected layer corresponding to each subspace, obtaining several feature vectors with the largest contribution as editable directions with semantic information, and training the editable direction encoder to identify the category and scale of the editable direction; obtaining an editable representation space of the image based on the output of the editable direction encoder as the input of the control strategy network and the calculation of the reward function, and training the robot by controllably sampling various target tasks in the editable representation space, and finally obtaining a control strategy that can complete multiple goals. Compared with the prior art, the present invention has the advantages of being able to unsupervisedly decouple task-related factors, improve sample efficiency and generalization performance, and the like.

Description

A multi-objective reinforcement learning method for unsupervised image editing

技术领域Technical Field

本发明涉及智能体自主动作学习技术领域，尤其是涉及一种基于无监督图像编辑的多目标强化学习方法。The present invention relates to the technical field of intelligent agent autonomous action learning, and in particular to a multi-objective reinforcement learning method based on unsupervised image editing.

背景技术Background Art

随着人工智能算法的兴起和各硬件设备的高速发展，机器人技术已在医疗、服务、装配、安保、救援、运输等多个领域中发挥着重要的作用。然而针对单一任务进行机器人的定制化部署是费时费力的。因此，如何让机器人学习一个能够完成多个目标的控制策略是我们一直以来追求的目标。With the rise of artificial intelligence algorithms and the rapid development of various hardware devices, robotics has played an important role in many fields such as medical care, service, assembly, security, rescue, and transportation. However, customized deployment of robots for a single task is time-consuming and laborious. Therefore, how to let robots learn a control strategy that can achieve multiple goals has always been our goal.

传统的机器人控制方法极度依赖于技术人员的专业知识和软件编程水平，且针对特定的任务目标需要设计不同的机器人轨迹分布。当应用场景变化后，之前部署的机器人技能无法重复利用。深度强化学习方法能够通过探索和利用的方式来使得机器人自主学习任务相关的技能，然而这种学习方式需要针对不同的目标任务手工设计对应的奖励函数，且学习到的控制策略只能完成特定的目标任务，无法泛化到环境和任务结构相似、但目标位置不同的任务上。多目标强化学习方法通过将当前观察和任务目标同时输入控制策略，从而赋予机器人完成各种不同目标任务的能力。然而该类算法以图像作为输入，训练过程复杂且难以收敛，数据利用率低，在仿真环境中可能需要成百上千个小时来收集数据和训练策略，这在现实场景中是不切实际的。Traditional robot control methods are extremely dependent on the professional knowledge and software programming level of technicians, and different robot trajectory distributions need to be designed for specific task objectives. When the application scenario changes, the previously deployed robot skills cannot be reused. Deep reinforcement learning methods can enable robots to autonomously learn task-related skills through exploration and utilization. However, this learning method requires manual design of corresponding reward functions for different target tasks, and the learned control strategies can only complete specific target tasks and cannot be generalized to tasks with similar environment and task structures but different target positions. Multi-objective reinforcement learning methods give robots the ability to complete various different target tasks by simultaneously inputting current observations and task objectives into the control strategy. However, this type of algorithm uses images as input, the training process is complex and difficult to converge, and the data utilization rate is low. It may take hundreds or thousands of hours to collect data and train strategies in a simulation environment, which is impractical in real scenarios.

在日常生活和工作中，人类往往能够将环境中的各个组成部分解耦开来，并基于关键部分的状态变化来完成相应的各种目标动作。如何通过无监督的方式解耦并表征环境图像中的各个关键对象仍是一个挑战。In daily life and work, humans are often able to decouple the various components of the environment and complete various corresponding target actions based on the state changes of key parts. How to decouple and represent the key objects in the environment image in an unsupervised way remains a challenge.

发明内容Summary of the invention

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种能够提升样本效率和泛化性能的基于无监督图像编辑的多目标强化学习方法。The purpose of the present invention is to overcome the defects of the above-mentioned prior art and to provide a multi-objective reinforcement learning method based on unsupervised image editing that can improve sample efficiency and generalization performance.

本发明的目的可以通过以下技术方案来实现：The purpose of the present invention can be achieved by the following technical solutions:

一种基于无监督图像编辑的多目标强化学习方法，所述的多目标强化学习方法包括：A multi-objective reinforcement learning method based on unsupervised image editing, the multi-objective reinforcement learning method comprising:

步骤1：获取关于机器人控制场景的多目标任务数据集；Step 1: Obtain a multi-objective task dataset for robot control scenarios;

步骤2：训练对抗生成网络和特征空间编码器，将隐变量空间分为三个子空间，分别对应任务无关因素、机器人以及操作的物体，将图像中各个部分进行解耦；Step 2: Train the generative adversarial network and feature space encoder to divide the latent variable space into three subspaces, corresponding to task-irrelevant factors, the robot, and the manipulated object, to decouple the various parts of the image;

步骤3：对每个子空间对应全连接层的权重进行奇异值分解，获得贡献最大的若干个特征向量作为有语义信息的可编辑方向，能够将隐变量沿着可编辑方向增大或减小来进行图像编辑，并训练一个可编辑方向编码器来识别出可编辑方向的类别和尺度；Step 3: Perform singular value decomposition on the weights of the fully connected layer corresponding to each subspace to obtain several eigenvectors with the largest contribution as editable directions with semantic information. The latent variables can be increased or decreased along the editable directions to perform image editing, and an editable direction encoder is trained to identify the category and scale of the editable directions.

步骤4：基于可编辑方向编码器的输出得到图像的可编辑表征空间，作为控制策略网络的输入以及奖励函数的计算，同时通过在可编辑表征空间中可控地采样出各种目标任务来训练机器人，最终得到可完成多个目标的控制策略。Step 4: Based on the output of the editable direction encoder, an editable representation space of the image is obtained as the input of the control strategy network and the calculation of the reward function. At the same time, the robot is trained by controllably sampling various target tasks in the editable representation space, and finally a control strategy that can achieve multiple goals is obtained.

优选地，所述的步骤1具体为：Preferably, the step 1 is specifically:

搭建机器人虚拟仿真环境，并通过采样控制策略控制机器人完成多种目标任务，创建一个关于机器人控制场景的多目标任务数据集。Build a robot virtual simulation environment, and control the robot to complete multiple target tasks through sampling control strategies, and create a multi-target task dataset for robot control scenarios.

更加优选地，所述的多目标任务包括抓木块和组装两个任务，机器人虚拟仿真环境包括UR5机械臂、操作台和若干操作物体；通过从随机探索策略中采样出动作指令来控制机器人与环境进行交互，得到每个任务的各种图像序列数据，每条序列的目标位置由均匀分布中采样得到。More preferably, the multi-objective task includes two tasks: grabbing wooden blocks and assembling. The robot virtual simulation environment includes a UR5 robotic arm, an operating table and a number of operating objects. The robot is controlled to interact with the environment by sampling action instructions from a random exploration strategy to obtain various image sequence data for each task, and the target position of each sequence is obtained by sampling from a uniform distribution.

优选地，所述的对抗生成网络包括：Preferably, the generative adversarial network comprises:

生成网络，期望生成出辨别网络无法分辨其真伪的足够真实的图片；The generative network is expected to generate sufficiently realistic images that the discriminative network cannot distinguish between the real and the fake;

辨别网络，期望分辨出图片是真实的还是生成网络生成的；The discriminative network is expected to distinguish whether the image is real or generated by the generative network;

将从高斯分布中随机采样的隐变量通过8层全连接层映射到3个W特征空间中，其中，W₁对应于任务无关因素，其他两个子空间对应于任务相关因素，其中W₂对应于机器人，W₃对应于需要操作的物体；The latent variables randomly sampled from the Gaussian distribution are mapped into three W feature spaces through 8 fully connected layers, where _W1 corresponds to task-independent factors and the other two subspaces correspond to task-related factors, where _W2 corresponds to the robot and _W3 corresponds to the object to be operated.

所述的特征空间编码器用于将生成网络生成的图像编码回W空间中的特征；The feature space encoder is used to encode the image generated by the generating network back to the features in the W space;

步骤2将生成网络生成的图像作为特征空间编码器的输入，每张图像对应的生成网络输入作为特征空间编码器的输出标签，从而对特征空间编码器进行监督训练。In step 2, the image generated by the generative network is used as the input of the feature space encoder, and the generative network input corresponding to each image is used as the output label of the feature space encoder, so as to supervise the training of the feature space encoder.

优选地，所述步骤2中将图像各个部分进行解耦的方法为：Preferably, the method for decoupling the various parts of the image in step 2 is:

通过交换一个子空间隐变量、保持其他子空间隐变量不变的方法，解耦开图像中的各个部分。The various parts of the image are decoupled by exchanging the latent variables of one subspace and keeping the latent variables of other subspaces unchanged.

更加优选地，所述步骤2中将图像各个部分进行解耦的方法具体为：More preferably, the method of decoupling the various parts of the image in step 2 is specifically as follows:

将多目标任务数据集中的两张图片作为一组数据，对应的W空间特征记作W_1,2,3和W_a,b,c；The two images in the multi-target task dataset are taken as a set of data, and the corresponding W spatial features are recorded as W _1,2,3 and W _a,b,c ;

将某一个W子空间的特征相互交换，其他特征保持不变，得到两个新的特征W_1,b,3和W_a,2,c；Exchange the features of a certain W subspace with each other, and keep the other features unchanged, and obtain two new features W _1,b,3 and W _a,2,c ;

将新特征输入生成网络得到两张新的图像；Input the new features into the generative network to obtain two new images;

通过特征空间编码器将生成的两张图像进行编码，再次进行特征交换操作得到新的特征W′_1,2,3和W′_a,b,c，并通过均方根误差来监督该特征，具体为：The two generated images are encoded by the feature space encoder, and the feature exchange operation is performed again to obtain new features W′ _1,2,3 and W′ _a,b,c , and the root mean square error is used to supervise the feature, specifically:

优选地，所述步骤3中每个子空间对应全连接层参数设为为一个非方阵；Preferably, the fully connected layer parameters corresponding to each subspace in step 3 are set to is a non-square matrix;

则全连接层对于随机采样得到的隐变量z的仿射变换F为：Then the affine transformation F of the fully connected layer for the hidden variable z obtained by random sampling is:

W＝F(z)＝Az+bW＝F(z)＝Az+b

通过对对应的W空间特征添加有语义信息的方向N，对生成的图像进行可控地编辑，使得生成的图像在某一个内容因子上发生改变。By adding a direction N with semantic information to the corresponding W spatial feature, the generated image can be controllably edited so that the generated image changes in a certain content factor.

优选地，所述步骤3中奇异值分解具体为：Preferably, the singular value decomposition in step 3 is specifically as follows:

奇异值分解用于对全连接层参数A_i进行矩阵分解：Singular value decomposition is used to perform matrix decomposition on the fully connected layer parameters _Ai :

A_i＝U∑V^T A _i = U∑V ^T

其中，U为一个m*m方阵；∑为一个m*n矩阵，除了主对角线上的元素以外全为0，主对角线上的每个元素都称为奇异值；V为一个n*n方阵；Among them, U is an m*m square matrix; ∑ is an m*n matrix, all elements except the elements on the main diagonal are 0, and each element on the main diagonal is called a singular value; V is an n*n square matrix;

然后根据∑中奇异值大小进行排序，得到贡献值最大的前若干个奇异值，对应的特征向量则为有语义信息的方向N，即通过无监督的方式学到了图像可编辑方向N。Then, the singular values in ∑ are sorted according to their sizes to obtain the first several singular values with the largest contribution values. The corresponding feature vector is the direction N with semantic information, that is, the editable direction N of the image is learned in an unsupervised way.

优选地，所述的可编辑方向编码器具体为：Preferably, the editable direction encoder is specifically:

可编辑方向编码器用于编码出两张图像在对抗生成网络的W空间的变化情况，包括可编辑方向的类别N_i和尺度α，步骤3基于可编辑方向N对生成图像进行可控编辑，获得图像对和对应的编辑类别和尺度，从而对可编辑方向编码器进行监督训练。The editable direction encoder is used to encode the changes of the two images in the W space of the adversarial generation network, including the category _Ni and scale α of the editable direction. Step 3 performs controllable editing on the generated image based on the editable direction N to obtain image pairs and corresponding editing categories and scales, thereby performing supervised training on the editable direction encoder.

优选地，所述的步骤4具体为：Preferably, the step 4 is specifically:

控制策略网络以可编辑方向编码器的输出作为观察图像和任务目标的嵌入空间输入，并输出机器人的关节角度来控制机器人完成任务。The control policy network takes the output of the editable direction encoder as the embedding space input of the observation image and task goal, and outputs the joint angles of the robot to control the robot to complete the task.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

提升了多目标强化学习算法的样本效率和泛化性能：本发明中的基于无监督图像编辑的多目标强化学习方法将图像的可编辑方向类别和尺度作为图像的编码输入控制策略，更加明确高效地反映出图像中各个元素的状态变化情况。通过强化学习算法PPO来训练控制策略网络输出合适的机器人动作，使得机器人能够完成各种不同目标的任务，并大大提升了多目标强化学习算法的样本效率和泛化性能。Improved sample efficiency and generalization performance of multi-objective reinforcement learning algorithm: The multi-objective reinforcement learning method based on unsupervised image editing in the present invention uses the editable direction category and scale of the image as the encoding input control strategy of the image, which more clearly and efficiently reflects the state changes of each element in the image. The control strategy network is trained through the reinforcement learning algorithm PPO to output appropriate robot actions, so that the robot can complete tasks with various different goals, and greatly improves the sample efficiency and generalization performance of the multi-objective reinforcement learning algorithm.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例中无监督图像编辑的多目标强化学习方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a multi-objective reinforcement learning method for unsupervised image editing according to an embodiment of the present invention;

图2为本发明实施例中机器人虚拟仿真环境示意图；FIG2 is a schematic diagram of a robot virtual simulation environment according to an embodiment of the present invention;

图3为本发明实施例中对抗生成网络的结构示意图；FIG3 is a schematic diagram of the structure of a generative adversarial network according to an embodiment of the present invention;

图4为本发明实施例中可编辑方向编码器的网络结构示意图。FIG. 4 is a schematic diagram of the network structure of an editable direction encoder according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都应属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present invention.

本实施例以优傲UR5机械臂为例，其控制工作站安装Ubuntu 16.04系统，搭载Intel Core i7-10700K，8核16线程，睿频5.1GHz，GPU为NVIDIA GTX1080*2，内存为32GDDR4内存，同时还有一个通用的RGB摄像头观察操作台整体场景。This embodiment takes the Universal Robots UR5 robotic arm as an example. Its control workstation is installed with Ubuntu 16.04 system, equipped with Intel Core i7-10700K, 8 cores and 16 threads, turbo frequency 5.1GHz, GPU is NVIDIA GTX1080*2, memory is 32GDDR4 memory, and there is also a general RGB camera to observe the overall scene of the operating console.

一种基于无监督图像编辑的多目标强化学习方法，其流程如图1所示，包括：A multi-objective reinforcement learning method based on unsupervised image editing, the process of which is shown in Figure 1, includes:

步骤1：基于OpenAI Gym和Mujoco仿真平台搭建机器人仿真环境，仿真环境示意图如图2所示。在搭建好的虚拟环境中收集一个关于机器人控制场景的多目标任务数据集，总共有抓木块和组装两个任务。环境中包括一个UR5机械臂、一个操作台和若干操作物体。通过从随机探索策略中采样出动作指令来控制机器人与环境进行交互，得到每个任务的各种图像序列数据，每条序列的目标位置由均匀分布中采样得到。Step 1: Build a robot simulation environment based on OpenAI Gym and Mujoco simulation platforms. The simulation environment diagram is shown in Figure 2. In the built virtual environment, collect a multi-objective task dataset for robot control scenarios, including two tasks: grabbing wooden blocks and assembling. The environment includes a UR5 robotic arm, an operating table, and several operating objects. By sampling action instructions from the random exploration strategy to control the robot to interact with the environment, various image sequence data for each task are obtained, and the target position of each sequence is sampled from a uniform distribution.

步骤2：基于已有的数据集，训练一个对抗生成网络和一个特征空间编码器。生成对抗网络包括一个生成网络和一个辨别网络。辨别网络期望能够分辨出图片是真实的还是生成网络生成的。而生成网络则期望生成出辨别网络无法分辨其真伪的足够真实的图片。特征空间编码器旨在将生成网络生成的图像编码回W空间中的特征，可通过生成网络随机生成的样本进行监督训练。Step 2: Based on the existing dataset, train a generative adversarial network and a feature space encoder. The generative adversarial network consists of a generative network and a discriminative network. The discriminative network is expected to be able to distinguish whether the image is real or generated by the generative network. The generative network is expected to generate sufficiently real images that the discriminative network cannot distinguish between true and false. The feature space encoder aims to encode the images generated by the generative network back to features in the W space, and can be supervised by randomly generated samples from the generative network.

对抗生成网络的结构如图3所示，其中生成网络由3个卷积大小为5、步长stride为0.5、激活函数为ReLU的卷积层组成，判别网络由3个卷积大小为5、步长stride为0.5、激活函数为ReLU的卷积层和2层全连接层以及激活函数层组成。将从高斯分布中随机采样的隐变量通过8层全连接层映射到3个W特征空间中，其中，W₁对应于任务无关因素，包括仿真环境的光照、桌面颜色和墙壁颜色等。其他两个子空间对应于任务相关因素，其中W₂对应于机器人，W₃对应于需要操作的物体。The structure of the adversarial generative network is shown in Figure 3, where the generative network consists of 3 convolutional layers with a convolution size of 5, a stride of 0.5, and an activation function of ReLU, and the discriminative network consists of 3 convolutional layers with a convolution size of 5, a stride of 0.5, and an activation function of ReLU, 2 fully connected layers, and an activation function layer. The latent variables randomly sampled from the Gaussian distribution are mapped to 3 W feature spaces through 8 fully connected layers, where _W1 corresponds to task-independent factors, including the lighting of the simulation environment, the color of the desktop, and the color of the wall. The other two subspaces correspond to task-related factors, where _W2 corresponds to the robot and _W3 corresponds to the object to be operated.

由此可确保三个W子空间能够解耦开，一个子空间特征的改变不会影响其他子空间所对应的图像内容。This ensures that the three W subspaces can be decoupled, and changes in the characteristics of one subspace will not affect the image content corresponding to other subspaces.

步骤3：对每个子空间对应全连接层的权重进行奇异值分解，得到贡献最大的几个特征向量作为有语义信息的可编辑方向N。设每个子空间对应全连接层参数设为为一个非方阵。则全连接层对于随机采样得到的隐变量z的仿射变换F可表示为以下形式：Step 3: Weights of the fully connected layer corresponding to each subspace Perform singular value decomposition and obtain the eigenvectors with the largest contribution as the editable directions N with semantic information. Set the parameters of the fully connected layer corresponding to each subspace to is a non-square matrix. Then the affine transformation F of the fully connected layer for the hidden variable z obtained by random sampling can be expressed as follows:

W＝F(z)＝Az+bW＝F(z)＝Az+b

通过对对应的W空间特征添加有语义信息的方向N，可以对生成的图像进行可控地编辑，使得生成的图像在某一个内容因子上发生改变，比如改变操作物体的位置、形状、环境光照和颜色等等。By adding the direction N with semantic information to the corresponding W spatial feature, the generated image can be edited in a controllable manner so that the generated image changes in a certain content factor, such as changing the position, shape, ambient lighting and color of the operating object.

对非方阵进行的奇异值分解可表示为：Non-square The singular value decomposition can be expressed as:

A_i＝U∑V^T A _i = U∑V ^T

然后根据∑中奇异值大小进行排序，得到贡献值最大的前10个奇异值，对应的特征向量则为有语义信息的方向N，即通过无监督的方式学到了图像可编辑方向N。Then, the singular values in ∑ are sorted according to their sizes to obtain the top 10 singular values with the largest contribution values. The corresponding eigenvector is the direction N with semantic information, that is, the editable direction N of the image is learned in an unsupervised way.

此外，根据学习到的可编辑方向N，来学习一个可编辑方向编码器Enc，如图4所示。首先随机采样隐变量z_i，经过2层全连接层FC得到对应W空间中的特征W，通过生成网络Gen生成对应图像I_i。再从K个可编辑方向中均匀采样出一个可编辑方向N_i，并随机采样出编辑尺度α～U{-ε,ε}。将两者相乘加在w上，经过生成网络生成图像I_j。将图像I_i和I_j共同输入可编辑方向编码器Enc，并预测对应的编辑类别N_i和尺度α。通过生成器Gen生成大量的样本和真实标签，从而对Enc进行有监督训练。In addition, according to the learned editable direction N, an editable direction encoder Enc is learned, as shown in Figure 4. First, the latent variable z _i is randomly sampled, and the feature W in the corresponding W space is obtained through two layers of fully connected layers FC, and the corresponding image I _i is generated through the generative network Gen. Then an editable direction N _i is uniformly sampled from the K editable directions, and the editing scale α~U{-ε,ε} is randomly sampled. The two are multiplied and added to w, and the image I _j is generated through the generative network. The images I _i and I _j are input into the editable direction encoder Enc together, and the corresponding editing category N _i and scale α are predicted. A large number of samples and real labels are generated by the generator Gen, so that Enc can be supervised for training.

S4：将通过可编辑方向编码器，得到机器人初始观察图像obs_t＝0与当前帧图像obs_t之间的编辑类别N_t和尺度α_t，以及obs_t＝0与任务目标图像obs_goal之间的编辑类别N_goal和尺度α_goal，将(N_t，α_t，N_goal，α_goal)作为控制策略网络的输入，最终输出机器人的关节角度来控制机器人完成任务。控制策略网络具体表达形式如下：S4: The editable direction encoder is used to obtain the edit category N _t and scale α _t between the robot's initial observation image obs _{t = 0} and the current frame image obs _t , as well as the edit category N _goal and scale α _goal between obs _{t = 0} and the task goal image obs _goal . (N _t , α _t , N _goal , α _goal ) is used as the input of the control strategy network, and the joint angle of the robot is finally output to control the robot to complete the task. The specific expression of the control strategy network is as follows:

π＝(A|·)＝π(A|N_t,α_t,N_goal,α_goal)π=(A|·)=π(A|N _t ,α _t ,N _goal ,α _goal )

基于可编辑方向编码器的输出得到图像的可编辑表征空间，作为控制策略的输入以及奖励函数的计算，同时通过在可编辑表征空间中可控地采样出各种目标任务来训练机器人，最终得到可完成多个目标的控制策略。本发明将图像的可编辑方向类别和尺度作为图像的编码输入控制策略，更加明确高效地反映出图像中各个元素的状态变化情况。通过强化学习算法PPO来训练控制策略网络输出合适的机器人动作，使得机器人能够完成各种不同目标的任务，并大大提升了多目标强化学习算法的样本效率和泛化性能。Based on the output of the editable direction encoder, the editable representation space of the image is obtained as the input of the control strategy and the calculation of the reward function. At the same time, the robot is trained by controllably sampling various target tasks in the editable representation space, and finally a control strategy that can complete multiple goals is obtained. The present invention uses the editable direction category and scale of the image as the encoding input control strategy of the image, which more clearly and efficiently reflects the state changes of each element in the image. The control strategy network is trained to output appropriate robot actions through the reinforcement learning algorithm PPO, so that the robot can complete tasks with various different goals, and greatly improves the sample efficiency and generalization performance of the multi-objective reinforcement learning algorithm.

本发明的创新点在于：The innovation of the present invention is:

(一)提出基于对抗生成网络中隐变量的循环交换操作方法，解耦开机器人操作场景中任务相关因素与任务不相关因素，从而能使得控制策略聚焦于任务相关因素的状态变化情况，减少机器人观察表征的冗余性。(I) A cyclic exchange operation method based on hidden variables in a generative adversarial network is proposed to decouple task-related factors from task-irrelevant factors in the robot operation scenario, so that the control strategy can focus on the state changes of task-related factors and reduce the redundancy of the robot's observation representation.

(二)提出无监督图像编辑算法，对控制场景中不同因素在隐空间的参数进行奇异值分解，得到一系列有语义的可编辑方向，从而实现对图像中各个对象的可控生成。基于学到的一系列可编辑方向训练得到一个可编辑方向编码器，用于提取各个对象的结构化表征。(II) An unsupervised image editing algorithm is proposed to perform singular value decomposition on the parameters of different factors in the control scene in latent space, and obtain a series of semantically editable directions, thereby achieving controllable generation of each object in the image. Based on the learned series of editable directions, an editable direction encoder is trained to extract the structured representation of each object.

(三)提出了基于图像编辑方向的多目标控制策略，用编码器的输出来表征机器人观察和目标任务图像，从而能够极大缩短算法的训练时间，利用有语义信息的结构化表征来提升多目标强化学习算法的样本效率和泛化性能。(III) A multi-objective control strategy based on image editing is proposed. The output of the encoder is used to represent the robot observation and target task images, which can greatly shorten the training time of the algorithm. The structured representation with semantic information is used to improve the sample efficiency and generalization performance of the multi-objective reinforcement learning algorithm.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any technician familiar with the technical field can easily think of various equivalent modifications or replacements within the technical scope disclosed by the present invention, and these modifications or replacements should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention shall be based on the protection scope of the claims.

Claims

1. A multi-objective reinforcement learning method based on unsupervised image editing, characterized in that the multi-objective reinforcement learning method comprises:

Step 1: Obtain a multi-objective task dataset for robot control scenarios;

Step 2: Train the generative adversarial network and feature space encoder to divide the latent variable space into three subspaces, corresponding to task-irrelevant factors, the robot, and the manipulated object, to decouple the various parts of the image;

Step 3: Perform singular value decomposition on the weights of the fully connected layer corresponding to each subspace to obtain several feature vectors with the largest contribution as editable directions with semantic information, and train an editable direction encoder to identify the category and scale of the editable direction;

Step 4: Based on the output of the editable direction encoder, an editable representation space of the image is obtained as the input of the control strategy network and the calculation of the reward function. At the same time, the robot is trained by controllably sampling various target tasks in the editable representation space, and finally a control strategy that can achieve multiple goals is obtained;

The method for decoupling the various parts of the image in step 2 is:

Decouple the various parts of the image by exchanging one subspace latent variable and keeping the other subspace latent variables unchanged;

The method for decoupling the various parts of the image in step 2 is specifically as follows:

The two images in the multi-target task dataset are taken as a set of data, and the corresponding W spatial features are recorded as W _1,2,3 and W _a,b,c ;

Exchange the features of a certain W subspace with each other, and keep the other features unchanged, and obtain two new features W _1,b,3 and W _a,2,c ;

Input the new features into the generative network to obtain two new images;

The two generated images are encoded by the feature space encoder, and the feature exchange operation is performed again to obtain new features W ₁ ' _{, 2, 3} and W _a ' _{, b, c} , and the root mean square error is used to supervise the feature, specifically:

The singular value decomposition in step 3 is specifically as follows:

Singular value decomposition is used to perform matrix decomposition on the fully connected layer parameters _Ai :

A _i ＝UΣV ^T

Among them, U is an m*m square matrix; Σ is an m*n matrix, all elements except the elements on the main diagonal are 0, and each element on the main diagonal is called a singular value; V is an n*n square matrix;

Then, the singular values in Σ are sorted according to their sizes to obtain the first several singular values with the largest contribution values. The corresponding feature vectors are the directions N with semantic information, that is, the editable directions N of the image are learned in an unsupervised way.

The editable direction encoder is specifically:

The editable direction encoder is used to encode the changes of two images in the W space of the adversarial generative network, including the category _Ni and scale α of the editable direction. Based on the linear weighted operation of the editable direction in the W space, the generated image is controllably edited to obtain image pairs and the corresponding editing categories and scales, thereby performing supervised training on the editable direction encoder.

2. According to the multi-objective reinforcement learning method based on unsupervised image editing according to claim 1, it is characterized in that the step 1 is specifically:

Build a robot virtual simulation environment, and control the robot to complete multiple target tasks through sampling control strategies, and create a multi-target task dataset for robot control scenarios.

3. According to claim 2, a multi-objective reinforcement learning method based on unsupervised image editing is characterized in that the multi-objective tasks include two tasks: grabbing wooden blocks and assembling; the robot virtual simulation environment includes a UR5 robotic arm, an operating table and a number of operating objects; the robot is controlled to interact with the environment by sampling action instructions from a random exploration strategy to obtain various image sequence data for each task, and the target position of each sequence is obtained by sampling from a uniform distribution.

4. According to the multi-objective reinforcement learning method based on unsupervised image editing in claim 1, it is characterized in that the adversarial generative network comprises:

The generative network is expected to generate sufficiently realistic images that the discriminative network cannot distinguish between the real and the fake;

The discriminative network is expected to distinguish whether the image is real or generated by the generative network;

The latent variables randomly sampled from the Gaussian distribution are mapped to three W feature spaces through 8 fully connected layers, where W1 corresponds to task-independent factors and the other two subspaces correspond to task-related factors, where W2 corresponds to the robot and W3 corresponds to the object to be operated.

The feature space encoder is used to encode the image generated by the generating network back to the features in the W space;

The image generated by the generative network is used as the input of the feature space encoder, and the generative network input corresponding to each image is used as the output label of the feature space encoder, so as to supervise the training of the feature space encoder;

5. The multi-objective reinforcement learning method based on unsupervised image editing according to claim 1, characterized in that the fully connected layer parameters corresponding to each subspace in step 3 are set to is a non-square matrix;

Then the affine transformation F of the fully connected layer for the hidden variable z obtained by random sampling is:

W＝F(z)＝Az+b

By adding a direction N with semantic information to the corresponding W spatial feature, the generated image can be controllably edited so that the generated image changes in a certain content factor.

6. The multi-objective reinforcement learning method based on unsupervised image editing according to claim 1, characterized in that the step 4 is specifically:

The control policy network takes the output of the editable direction encoder as the embedding space input of the observation image and task goal, and outputs the joint angles of the robot to control the robot to complete the task.