CN117975190B

CN117975190B - Method and device for processing simulated learning mixed sample based on vision pre-training model

Info

Publication number: CN117975190B
Application number: CN202311868899.4A
Authority: CN
Inventors: 关伟凡; 张希; 程健
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-11-05
Anticipated expiration: 2043-12-29
Also published as: CN117975190A

Abstract

The present invention provides a mixed sample processing method and device for imitation learning based on a visual pre-training model, the method comprising: obtaining an expert sample set; adding target noise to a suboptimal expert sample to obtain a noise expert sample, and obtaining a mixed sample set according to the noise expert sample and the optimal expert sample; calibrating the weight coefficient of the mixed sample set, predicting and scoring the redistributed mixed sample set, and then training a strategy network and a reward function network according to the scoring results, scoring each sample of an evaluation data set according to the target reward function network, and obtaining a prediction ranking corresponding to the evaluation data set, so as to update the weight coefficient corresponding to each sample in the redistributed mixed sample set, and finally imitating learning the redistributed weight coefficient according to the target strategy network to obtain an optimized expert sample. The method of the present invention performs differentiated learning on mixed expert samples of different qualities, improves the sample distribution of the data set, and enhances the generalization ability of the imitation learning agent.

Description

Imitation learning mixed sample processing method and device based on visual pre-training model

技术领域Technical Field

本发明涉及深度学习技术领域，尤其涉及一种基于视觉预训练模型的模仿学习混合样本处理方法及装置。The present invention relates to the field of deep learning technology, and in particular to a method and device for processing mixed samples of imitation learning based on a visual pre-training model.

背景技术Background Art

随着机器人、自动驾驶汽车以及游戏智能体等高新技术的不断发展，如何让其完成复杂的决策任务并快速适应环境变化成为重要研究问题。With the continuous development of high-tech technologies such as robots, self-driving cars, and game agents, how to enable them to complete complex decision-making tasks and quickly adapt to environmental changes has become an important research issue.

对抗生成式模仿学习是模仿学习的代表算法之一，借鉴对抗生成网络(Generative Adversarial Network，GAN)的思想，通过智能体和奖励函数的对抗训练，使智能体接近专家智能体的决策能力；对抗生成式模仿学习旨在通过训练一个生成模型来模仿专家的行为，以实现任务的自动化解决；在该方法中，生成模型被训练来生成与专家行为相似的样本，而判别模型则被用来区分生成模型生成的样本和专家的真实样本，以促使生成模型逐渐逼近专家行为。Adversarial generative imitation learning is one of the representative algorithms of imitation learning. It draws on the idea of Generative Adversarial Network (GAN) and makes the agent close to the decision-making ability of expert agents through adversarial training of agents and reward functions. Adversarial generative imitation learning aims to imitate the behavior of experts by training a generative model to achieve automatic task solving. In this method, the generative model is trained to generate samples similar to expert behavior, while the discriminative model is used to distinguish between samples generated by the generative model and real samples of experts, so as to enable the generative model to gradually approach expert behavior.

相关技术中，对抗生成式模仿学习依赖高质量数据，很多任务场景下，出于人工成本的限制，缺乏充足的最优专家样本以供模仿学习智能体得到充分的训练，而且常见的专家数据集中通常包含次优专家样本，对次优专家样本的模仿将降低模仿学习算法的性能，导致处理相关人物的效率低；另外专家数据集中的数据在高维状态空间下的泛化能力差，随着状态空间维度的上升，例如在视觉观测任务场景下(状态表征为高维图像)，关于环境的冗余信息大量增加，难以捕捉到智能体主体的变化情况，在这种情况下训练收敛的智能体，在投放到新的测试环境中时，会明显受到环境变化的影响，导致智能体泛化能力下降、学习能力降低。In the related technology, adversarial generative imitation learning relies on high-quality data. In many task scenarios, due to the limitation of labor costs, there is a lack of sufficient optimal expert samples for the imitation learning agent to be fully trained. In addition, common expert data sets usually contain suboptimal expert samples. Imitation of suboptimal expert samples will reduce the performance of the imitation learning algorithm, resulting in low efficiency in processing related characters. In addition, the data in the expert data set has poor generalization ability in high-dimensional state space. With the increase of the dimension of the state space, for example, in the visual observation task scenario (the state is represented by a high-dimensional image), the redundant information about the environment increases significantly, making it difficult to capture the changes in the main body of the agent. In this case, the converged agent will be significantly affected by the environmental changes when it is placed in a new test environment, resulting in a decrease in the generalization ability and learning ability of the agent.

发明内容Summary of the invention

本发明提供一种基于对抗生成式模仿学习算法的样本处理方法及装置，用以解决现有技术的对抗生成式模仿学习算法依赖高质量数据，但高质量数据缺乏，且受质量较低数据影响导致模仿学习智能体无法得到充分训练，训练效果差，另外现有专家数据集在高维状态空间下的泛化能力差，导致训练得到的智能体泛化能力低的缺陷，提高了专家样本集的质量，进而提高了模仿学习智能体的泛化能力。The present invention provides a sample processing method and device based on an adversarial generative imitation learning algorithm, which is used to solve the problems that the adversarial generative imitation learning algorithm in the prior art relies on high-quality data, but lacks high-quality data, and is affected by low-quality data, resulting in the inability to fully train the imitation learning agent, resulting in poor training effect. In addition, the generalization ability of the existing expert data set in high-dimensional state space is poor, resulting in low generalization ability of the trained agent. The method improves the quality of the expert sample set, thereby improving the generalization ability of the imitation learning agent.

本发明提供一种基于视觉预训练模型的模仿学习混合样本处理方法，包括：The present invention provides a mixed sample processing method for imitation learning based on a visual pre-training model, comprising:

获取专家样本集，所述专家样本集包括最优专家样本和次优专家样本；Acquire an expert sample set, wherein the expert sample set includes an optimal expert sample and a suboptimal expert sample;

向所述次优专家样本添加目标噪声，得到噪声专家样本，根据对抗生成网络对所述噪声专家样本和所述最优专家样本进行处理，得到混合样本集；Adding target noise to the suboptimal expert samples to obtain noise expert samples, and processing the noise expert samples and the optimal expert samples according to a generative adversarial network to obtain a mixed sample set;

对所述混合样本集中的各样本标定权重系数，得到重分布的混合样本集，根据策略网络对所述重分布的混合样本集进行预测，得到动作预测结果；根据奖励函数网络对所述动作预测结果进行评分，得到评分结果；根据判别损失函数和所述评分结果对所述策略网络和所述奖励函数网络进行训练，得到目标策略网络和目标奖励函数网络；Calibrate the weight coefficient of each sample in the mixed sample set to obtain a redistributed mixed sample set, predict the redistributed mixed sample set according to the policy network to obtain an action prediction result; score the action prediction result according to the reward function network to obtain a scoring result; train the policy network and the reward function network according to the discriminant loss function and the scoring result to obtain a target policy network and a target reward function network;

根据所述目标奖励函数网络对评估数据集中的各样本进行评分，得到所述评估数据集对应的预测排序，根据所述预测排序计算排序误差损失，通过梯度优化更新所述重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据所述目标策略网络对所述重分布后的权重系数进行模仿学习，得到优化后的专家样本；所述评估数据集属于所述混合样本集。According to the target reward function network, each sample in the evaluation data set is scored to obtain the predicted ranking corresponding to the evaluation data set, the ranking error loss is calculated according to the predicted ranking, the weight coefficient corresponding to each sample in the redistributed mixed sample set is updated by gradient optimization to obtain the redistributed weight coefficient, and the redistributed weight coefficient is imitated and learned according to the target strategy network to obtain the optimized expert sample; the evaluation data set belongs to the mixed sample set.

本发明还提供一种基于视觉预训练模型的模仿学习混合样本处理方法，所述获取专家样本集包括：The present invention also provides a mixed sample processing method for imitation learning based on a visual pre-training model, wherein obtaining an expert sample set comprises:

基于视觉预训练模型对所述专家样本集中各个样本进行向前图推理，并确定网络中间层对应的特征图为所述专家样本集的有效特征。Based on the visual pre-training model, forward graph reasoning is performed on each sample in the expert sample set, and the feature graph corresponding to the middle layer of the network is determined to be the effective feature of the expert sample set.

本发明还提供一种基于视觉预训练模型的模仿学习混合样本处理方法，所述对抗生成网络包括生成网络和判别网络；The present invention also provides an imitation learning mixed sample processing method based on a visual pre-training model, wherein the adversarial generative network includes a generative network and a discriminative network;

所述根据对抗生成网络对所述噪声专家样本和所述最优专家样本进行处理，得到混合样本集包括：The processing of the noise expert samples and the optimal expert samples according to the generative adversarial network to obtain a mixed sample set includes:

根据所述生成网络对噪声专家样本和所述最优专家样本进行特征特征提取，得到状态表征向量；Extracting features from the noise expert sample and the optimal expert sample according to the generating network to obtain a state representation vector;

根据所述判别网络对所述状态表征向量的来源进行特征判别，得到判别结果；Performing feature discrimination on the source of the state representation vector according to the discrimination network to obtain a discrimination result;

根据所述判别结果和第一损失函数计算所述生成网络的损失值，并根据所述生成网络的损失值对所述生成网络进行参数优化，得到所述优化后的生成网络；根据所述判别结果和第二损失函数计算所述判别网络的损失值，并根据所述判别网络的损失值对所述判别网络进行参数优化，得到所述优化后的判别网络，以输出所述混合样本集；Calculating the loss value of the generating network according to the discrimination result and the first loss function, and optimizing the parameters of the generating network according to the loss value of the generating network to obtain the optimized generating network; calculating the loss value of the discriminant network according to the discrimination result and the second loss function, and optimizing the parameters of the discriminant network according to the loss value of the discriminant network to obtain the optimized discriminant network to output the mixed sample set;

其中，所述第一损失函数基于最优专家样本状态动作对的占有率度量、正态分布噪声后的占有率度量、所述状态表征向量、所述状态表征向量对应的条件概率分布、判别网络参数确定，所述第二损失函数基于所述最优专家样本状态动作对的占有率度量、所述正态分布噪声后的占有率度量、所述状态表征向量、所述状态表征向量对应的条件概率和生成网络参数确定。Among them, the first loss function is determined based on the occupancy measurement of the optimal expert sample state-action pair, the occupancy measurement after normal distribution noise, the state characterization vector, the conditional probability distribution corresponding to the state characterization vector, and the discriminant network parameters; the second loss function is determined based on the occupancy measurement of the optimal expert sample state-action pair, the occupancy measurement after normal distribution noise, the state characterization vector, the conditional probability corresponding to the state characterization vector and the generation network parameters.

本发明还提供一种基于视觉预训练模型的模仿学习混合样本处理方法，所述第一损失函数包括：The present invention also provides a mixed sample processing method for imitation learning based on a visual pre-training model, wherein the first loss function includes:

所述第二损失函数包括：The second loss function includes:

其中，E为期望，D(z)为判别网络，ρ^*(s,a)为最优专家样本状态动作对的占有率度量，ρ′(s,a)为添加正态分布噪声后的占有率度量，z为所述状态表征向量，p(z|x)为所述状态表征向量对应的条件概率分布，θ^D为判别网络参数，θ^E为生成网络参数。Wherein, E is the expectation, D(z) is the discriminant network, ρ ^* (s, a) is the occupancy measure of the optimal expert sample state-action pair, ρ′(s, a) is the occupancy measure after adding normal distribution noise, z is the state representation vector, p(z|x) is the conditional probability distribution corresponding to the state representation vector, θ ^D is the discriminant network parameter, and θ ^E is the generative network parameter.

本发明还提供一种基于视觉预训练模型的模仿学习混合样本处理方法，所述第一损失函数还包括：The present invention also provides a mixed sample processing method for imitation learning based on a visual pre-training model, wherein the first loss function further includes:

目标正则项，所述目标正则项根据所述最优专家样本和所述状态表征向量对应的KL散度确定；A target regularization term, wherein the target regularization term is determined according to the KL divergence corresponding to the optimal expert sample and the state representation vector;

所述第二损失函数还包括：所述目标正则项。The second loss function also includes: the target regularization term.

本发明还提供一种基于视觉预训练模型的模仿学习混合样本处理方法，所述评估数据集通过所述专家样本集按照先验排序获取；The present invention also provides a mixed sample processing method for imitation learning based on a visual pre-training model, wherein the evaluation data set is obtained by sorting the expert sample set according to a priori;

所述根据所述预测排序计算排序误差损失，通过梯度优化更新所述重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数包括：The calculating of the sorting error loss according to the predicted sorting and updating the weight coefficient corresponding to each sample in the redistributed mixed sample set by gradient optimization to obtain the weight coefficient after redistribution include:

通过所述预测排序、所述先验排序和排序损失函数计算排序差异损失；Calculate the ranking difference loss through the predicted ranking, the prior ranking and the ranking loss function;

根据所述排序差异损失确定所述重分布后的权重系数；Determining the weight coefficient after redistribution according to the sorting difference loss;

所述排序损失函数包括：The ranking loss function includes:

其中，i、j为样本序号，η_ζi为第i号样本的真实累计期望回报，η′_ζi为第i号样本的预测累计期望回报；η_ζj为第j号样本的真实累计期望回报，η′_ζj为第j号样本的预测累计期望回报。Among them, i and j are sample numbers, η _ζi is the true cumulative expected return of sample No. i, η′ _ζi is the predicted cumulative expected return of sample No. i; η _ζj is the true cumulative expected return of sample No. j, η′ _ζj is the predicted cumulative expected return of sample No. j.

本发明还提供一种基于视觉预训练模型的模仿学习混合样本处理装置，包括：The present invention also provides a mixed sample processing device for imitation learning based on a visual pre-training model, comprising:

样本获取模块，用于获取专家样本集，所述专家样本集包括最优专家样本和次优专家样本，所述专家样本集中的各个样本对应不同的权重系数；A sample acquisition module, used to acquire an expert sample set, wherein the expert sample set includes an optimal expert sample and a suboptimal expert sample, and each sample in the expert sample set corresponds to a different weight coefficient;

第一处理模块，用于向所述次优专家样本添加目标噪声，得到噪声专家样本，根据对抗生成网络对所述噪声专家样本和所述最优专家样本进行处理，得到混合样本集；A first processing module is used to add target noise to the suboptimal expert samples to obtain noise expert samples, and process the noise expert samples and the optimal expert samples according to a generative adversarial network to obtain a mixed sample set;

第二处理模块，用于对所述混合样本集中的各样本标定权重系数，得到重分布的混合样本集，根据策略网络对所述重分布的混合样本集进行预测，得到动作预测结果；根据奖励函数网络对所述动作预测结果进行评分，得到评分结果；根据判别损失函数和所述评分结果对所述策略网络和所述奖励函数网络进行训练，得到目标策略网络和目标奖励函数网络；The second processing module is used to calibrate the weight coefficient of each sample in the mixed sample set to obtain a redistributed mixed sample set, predict the redistributed mixed sample set according to the policy network to obtain an action prediction result; score the action prediction result according to the reward function network to obtain a scoring result; train the policy network and the reward function network according to the discriminant loss function and the scoring result to obtain a target policy network and a target reward function network;

第三处理模块，用于根据所述目标奖励函数网络对评估数据集中的各样本进行评分，得到所述评估数据集对应的预测排序，根据所述预测排序计算排序误差损失，通过梯度优化更新所述重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据所述目标策略网络对所述重分布后的权重系数进行模仿学习，得到优化后的专家样本；所述评估数据集属于所述混合样本集。The third processing module is used to score each sample in the evaluation data set according to the target reward function network, obtain the predicted ranking corresponding to the evaluation data set, calculate the ranking error loss according to the predicted ranking, update the weight coefficient corresponding to each sample in the redistributed mixed sample set through gradient optimization, obtain the redistributed weight coefficient, and perform imitation learning on the redistributed weight coefficient according to the target strategy network to obtain the optimized expert sample; the evaluation data set belongs to the mixed sample set.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述基于视觉预训练模型的模仿学习混合样本处理方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the mixed sample processing method for imitation learning based on a visual pre-training model as described in any one of the above is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述基于视觉预训练模型的模仿学习混合样本处理方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements any of the above-described methods for processing mixed samples by imitation learning based on a visual pre-training model.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述基于视觉预训练模型的模仿学习混合样本处理方法。The present invention also provides a computer program product, including a computer program, which, when executed by a processor, implements any of the above-mentioned methods for processing mixed samples by imitation learning based on a visual pre-training model.

本发明提供的本发明提供的基于视觉预训练模型的模仿学习混合样本处理方法及装置，通过向次优专家样本添加目标噪声，改善了次优专家样本的数据分布，根据噪声专家样本和最优专家样本对对抗生成网络进行参数优化，得到混合样本集，改善了次优专家样本的特征分布，再对混合样本集中的各样本标定权重系数，根据策略网络对标定后的混合样本集进行预测和评分，以更新策略网络和奖励函数网络，根据目标奖励函数网络对评估数据集中的各样本进行评分，得到对应预测排序，再根据预测排序更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本，能够针对品质不一的混合专家样本进行差异化学习，改善数据集样本分布，提升模仿学习智能体的泛化能力。The present invention provides an imitation learning mixed sample processing method and device based on a visual pre-training model, which improves the data distribution of suboptimal expert samples by adding target noise to suboptimal expert samples, optimizes the parameters of the adversarial generative network according to the noise expert samples and the optimal expert samples, obtains a mixed sample set, improves the feature distribution of the suboptimal expert samples, and then calibrates the weight coefficient of each sample in the mixed sample set, predicts and scores the calibrated mixed sample set according to the policy network to update the policy network and the reward function network, scores each sample in the evaluation data set according to the target reward function network to obtain the corresponding prediction ranking, and then updates the weight coefficient corresponding to each sample in the redistributed mixed sample set according to the prediction ranking to obtain the redistributed weight coefficient, and imitates the redistributed weight coefficient according to the target policy network to obtain optimized expert samples, which can perform differentiated learning for mixed expert samples of different qualities, improve the sample distribution of the data set, and enhance the generalization ability of the imitation learning agent.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明提供的基于视觉预训练模型的模仿学习混合样本处理方法的流程示意图之一；FIG1 is a flow chart of a mixed sample processing method for imitation learning based on a visual pre-training model provided by the present invention;

图2是本发明提供的基于视觉预训练模型的模仿学习混合样本处理方法的流程示意图之二；FIG2 is a second flow chart of the mixed sample processing method for imitation learning based on the visual pre-training model provided by the present invention;

图3是本发明提供的基于视觉预训练模型提取中间层特征图的流程示意图；3 is a schematic diagram of a process for extracting an intermediate layer feature map based on a visual pre-training model provided by the present invention;

图4是本发明提供的基于视觉预训练模型的模仿学习混合样本处理方法的流程示意图之三；FIG4 is a third flow chart of the mixed sample processing method for imitation learning based on the visual pre-training model provided by the present invention;

图5是本发明提供的基于视觉预训练模型的模仿学习混合样本处理装置的结构示意图；5 is a schematic diagram of the structure of a mixed sample processing device for imitation learning based on a visual pre-training model provided by the present invention;

图6是本发明提供的电子设备的结构示意图。FIG. 6 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

下面结合图1-图5描述本发明的基于视觉预训练模型的模仿学习混合样本处理方法及装置。The following describes the mixed sample processing method and device for imitation learning based on a visual pre-training model of the present invention in conjunction with Figures 1 to 5.

图1是本发明提供的基于视觉预训练模型的模仿学习混合样本处理方法的流程示意图之一，如图1所示，该基于视觉预训练模型的模仿学习混合样本处理方法，包括如下步骤：FIG1 is a flow chart of a mixed sample processing method for imitation learning based on a visual pre-training model provided by the present invention. As shown in FIG1 , the mixed sample processing method for imitation learning based on a visual pre-training model comprises the following steps:

步骤110、获取专家样本集，专家样本集包括最优专家样本和次优专家样本。Step 110: Obtain an expert sample set, where the expert sample set includes optimal expert samples and suboptimal expert samples.

在该步骤中，专家样本集可以包括如下场景中的至少一种所需的训练样本或测试样本：图像分类、目标检测和图像分割等图像处理任务。In this step, the expert sample set may include at least one required training sample or test sample in the following scenarios: image processing tasks such as image classification, target detection and image segmentation.

比如，针对图像分类任务，专家样本集包括待分类的图像数据及其样本图像数据，针对图像分割任务，专家样本集包括待分割的图像数据及其样本图像数据。For example, for image classification tasks, the expert sample set includes image data to be classified and its sample image data; for image segmentation tasks, the expert sample set includes image data to be segmented and its sample image data.

在该实施例中，最优专家样本比次优专家样本的质量高，例如，样本包含数据内容更丰富和有效信息更多等。In this embodiment, the optimal expert sample is of higher quality than the suboptimal expert sample. For example, the sample contains richer data content and more effective information.

步骤120、向次优专家样本添加目标噪声，得到噪声专家样本，根据对抗生成网络对噪声专家样本和最优专家样本进行处理，得到混合样本集。Step 120: Add target noise to the suboptimal expert samples to obtain noisy expert samples, and process the noisy expert samples and the optimal expert samples according to the generative adversarial network to obtain a mixed sample set.

在该步骤中，目标噪声包括正态分布的噪声，例如高斯噪声或者高斯白噪声等。In this step, the target noise includes normally distributed noise, such as Gaussian noise or Gaussian white noise.

在该实施例中，对抗生成网络包括生成网络和判别网络；其中，生成网络以噪声专家样本和最优专家样本为特征，提取新的状态表征，并通过判别网络对新的状态表征的来源进行判别，根据判别结果计算相应的损失函数，从而实现对对抗生成网络和判别网络参数进行优化判断，通过对特征提取器学习的监督和引导，使得次优专家样本的隐含层状态表征趋近于最优专家样本的表征。In this embodiment, the generative adversarial network includes a generative network and a discriminative network; wherein the generative network uses noise expert samples and optimal expert samples as features to extract new state representations, and discriminates the source of the new state representations through the discriminative network, and calculates the corresponding loss function according to the discrimination results, thereby optimizing the parameters of the generative adversarial network and the discriminative network, and through the supervision and guidance of the feature extractor learning, the hidden layer state representation of the suboptimal expert sample approaches the representation of the optimal expert sample.

在该实施例中，将次优专家样本的行为特征和最优专家样本的行为特征进行比较，使用噪声信号来模拟两者之间分布的差异，同时兼容于深度生成式模仿学习算法架构；具体的，通过最小化两者分布上的分布距离，如KL(Kullback-Leibler)散度，达到使特征提取后的次优专家样本的状态表征接近最优专家样本的状态表征的目的。In this embodiment, the behavioral characteristics of the suboptimal expert sample are compared with the behavioral characteristics of the optimal expert sample, and a noise signal is used to simulate the difference in distribution between the two, while being compatible with the deep generative imitation learning algorithm architecture; specifically, by minimizing the distribution distance between the two distributions, such as the KL (Kullback-Leibler) divergence, the state representation of the suboptimal expert sample after feature extraction is made close to the state representation of the optimal expert sample.

步骤130、对混合样本集中的各样本标定权重系数，得到重分布的混合样本集，根据策略网络对重分布的混合样本集进行预测，得到动作预测结果；根据奖励函数网络对动作预测结果进行评分，得到评分结果；根据判别损失函数和评分结果对策略网络和奖励函数网络进行训练，得到目标策略网络和目标奖励函数网络。Step 130: calibrate the weight coefficient of each sample in the mixed sample set to obtain a redistributed mixed sample set, predict the redistributed mixed sample set according to the policy network to obtain an action prediction result; score the action prediction result according to the reward function network to obtain a scoring result; train the policy network and the reward function network according to the discriminant loss function and the scoring result to obtain a target policy network and a target reward function network.

在该步骤中，每个专家样本各自对应不同的权重系数，专家样本集中最优专家样本对应的权重系数越高，专家样本集的质量越高，更适合作为上述图像处理任务的输入数据。In this step, each expert sample corresponds to a different weight coefficient. The higher the weight coefficient corresponding to the best expert sample in the expert sample set, the higher the quality of the expert sample set, and the more suitable it is as input data for the above-mentioned image processing task.

比如，权重系数β∈(0,1)，若样本的权重系数越接近1，则该样本的质量越高。For example, the weight coefficient β∈(0,1), if the weight coefficient of a sample is closer to 1, the quality of the sample is higher.

在该步骤中，模仿学习策略用于在样本对应的权重系数的越接近1(样本权重占比越大)，对该样本的模仿越重要。In this step, the imitation learning strategy is used. The closer the weight coefficient corresponding to the sample is to 1 (the larger the sample weight is), the more important it is to imitate the sample.

图2是本发明提供的基于视觉预训练模型的模仿学习混合样本处理方法的流程示意图之二，在图2所示的实施例中，在权重学习阶段，给混合样本集中的每一个样本标定一个权重系数用以表示样本的优劣程度(初始化为1，代表先验认为每个样本的优劣程度相同)；标定权重系数后得到被权重系数重分布的专家样本，即重分布的混合样本集。Figure 2 is the second flow chart of the mixed sample processing method for imitation learning based on the visual pre-training model provided by the present invention. In the embodiment shown in Figure 2, in the weight learning stage, a weight coefficient is calibrated for each sample in the mixed sample set to indicate the quality of the sample (initialized to 1, indicating that the quality of each sample is considered to be the same a priori); after calibrating the weight coefficient, the expert sample redistributed by the weight coefficient is obtained, that is, the redistributed mixed sample set.

在该实施例中，将重分布的混合样本集作为输入进入策略网络，预测对应的动作输出。In this embodiment, the redistributed mixed sample set is taken as input into the policy network to predict the corresponding action output.

比如，设定模仿学习的场景是序列决策任务，以自动驾驶为例，专家样本是摄像头获取的路况图像，预测的动作是方向盘转角和油门力度。For example, the scenario for imitation learning is a sequential decision-making task. Taking autonomous driving as an example, the expert sample is the road image obtained by the camera, and the predicted action is the steering wheel angle and throttle force.

在图2所示的实施例中，在模仿学习阶段，将策略网络动作输出，即动作预测结果输入至奖励函数网络，为不同的动作进行打分，(例如，为专家样本打高分，为策略网络生成样本打低分)，再通过计算损失函数来更新策略网络和奖励函数网络，得到目标策略网络和目标奖励函数网络，用以提升策略网络生成类似专家动作的能力，以及提升奖励函数区分策略网络生成样本的能力，即对抗训练。In the embodiment shown in FIG2 , in the imitation learning stage, the action output of the policy network, that is, the action prediction result, is input into the reward function network to score different actions (for example, high scores are given to expert samples and low scores are given to samples generated by the policy network). The policy network and the reward function network are then updated by calculating the loss function to obtain the target policy network and the target reward function network, so as to improve the ability of the policy network to generate expert-like actions and to improve the ability of the reward function to distinguish samples generated by the policy network, that is, adversarial training.

步骤140、根据目标奖励函数网络对评估数据集中的各样本进行评分，得到评估数据集对应的预测排序，根据预测排序计算排序误差损失，通过梯度优化更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本；评估数据集属于混合样本集。Step 140: score each sample in the evaluation data set according to the target reward function network to obtain the predicted ranking corresponding to the evaluation data set, calculate the ranking error loss according to the predicted ranking, update the weight coefficient corresponding to each sample in the redistributed mixed sample set through gradient optimization to obtain the redistributed weight coefficient, perform imitation learning on the redistributed weight coefficient according to the target strategy network to obtain the optimized expert sample; the evaluation data set belongs to the mixed sample set.

在该步骤中，通过从混合样本集中按照预设的权重系数排序(即先验排序)抽取对应的评估数据集。In this step, the corresponding evaluation data set is extracted from the mixed sample set by sorting according to preset weight coefficients (ie, prior sorting).

比如，在得到目标奖励函数网络之后，从原始的混合专家样本中抽取小部分样本，标定这些样本的相对优劣程度作为先验知识，构成评估数据集。For example, after obtaining the target reward function network, a small number of samples are extracted from the original mixed expert samples, and the relative quality of these samples is calibrated as prior knowledge to form an evaluation data set.

在该实施例中，评估数据集通过专家样本集按照先验排序获取；根据预测排序更新计算排序误差损失，通过梯度优化重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数包括：通过预测排序、先验排序和排序损失函数计算排序差异损失；根据排序差异损失确定重分布后的权重系数；排序损失函数包括：In this embodiment, the evaluation data set is obtained by prior sorting the expert sample set; the sorting error loss is calculated according to the predicted sorting update, and the weight coefficient corresponding to each sample in the mixed sample set is redistributed by gradient optimization to obtain the weight coefficient after redistribution, including: calculating the sorting difference loss through the predicted sorting, the prior sorting and the sorting loss function; determining the weight coefficient after redistribution according to the sorting difference loss; the sorting loss function includes:

在图2所示的实施例中，使用奖励函数网络对评估数据集的数据进行打分得到一个预测排序，可以将预测排序和标定的先验排序信息进行对比计算排序差异损失，以此损失函数更新权重系数，其中，损失越小，代表模仿学习阶段奖励函数学的越好，间接代表权重系数设定合理。In the embodiment shown in FIG2 , a reward function network is used to score the data in the evaluation data set to obtain a predicted ranking. The predicted ranking and the calibrated prior ranking information can be compared to calculate the ranking difference loss, and the weight coefficient is updated with this loss function. The smaller the loss, the better the reward function is learned in the imitation learning stage, which indirectly indicates that the weight coefficient is set reasonably.

在该实施例中，在得到重分布后的权重系数后，将重分布后的权重系数对应的混合样本送入策略网络进行模仿学习，模仿学习和权重学习交替进行，直到达到预定迭代轮次后停止即可，输出优化后的专家样本。In this embodiment, after obtaining the redistributed weight coefficients, the mixed samples corresponding to the redistributed weight coefficients are sent to the strategy network for imitation learning. Imitation learning and weight learning are performed alternately until a predetermined number of iterations are reached and then the optimized expert samples are output.

在该实施例中，在具体的训练过程中，对权重系数进行自适应优化，用于对专家样本集中的各样本进行差异化学习，并通过对各样本的权重系数进行重分布，实现对当前模仿学习策略的重分布，每次差异化学习过程中，由优化后的生成网络输出一个回报值，并通过优化后的判别网络根据该回报值和排序损失函数对各样本对应的权重系数进行重排序；在多次训练过程中，通过最大化累计期望回报值实现对各样本对应的权重系数排序的适应性优化，以改善专家样本集中各样本的分布，提高了专家样本集的质量；其中，最大化累计期望回报为通过对抗式逆强化学习算法训练后的奖励函数预测的期望回报的累计值。In this embodiment, in a specific training process, the weight coefficients are adaptively optimized for differentiated learning of each sample in the expert sample set, and the current imitation learning strategy is redistributed by redistributing the weight coefficients of each sample. In each differentiated learning process, the optimized generation network outputs a reward value, and the weight coefficients corresponding to each sample are reordered according to the reward value and the sorting loss function by the optimized discriminant network; in multiple training processes, the adaptive optimization of the sorting of the weight coefficients corresponding to each sample is achieved by maximizing the cumulative expected reward value to improve the distribution of each sample in the expert sample set and improve the quality of the expert sample set; wherein, the maximized cumulative expected reward is the cumulative value of the expected reward predicted by the reward function after training with the adversarial inverse reinforcement learning algorithm.

具体的，假设当前的混合专家策略表示为π^d，当前策略在和环境交互过程中的占有率度量为通过对π^d的状态动作对进行重分布后便可以得到新策略π_new；其占有率度量可表示为经过权重系数重分布后的模仿学习损失函数可以表示为： Specifically, assuming that the current hybrid expert strategy is represented by π ^d , the occupancy measure of the current strategy in the process of interacting with the environment is By redistributing the state-action pairs of π ^d, we can get the new strategy π _new ; its occupancy measurement can be expressed as The imitation learning loss function after weight coefficient redistribution can be expressed as:

其中，L_模仿可以是行为克隆或逆强化学习等任何传统模仿学习算法的损失函数。Among them, _Limitation can be the loss function of any traditional imitation learning algorithm such as behavior cloning or inverse reinforcement learning.

在该实施例中，通过权重系数对不同的专家样本进行差异化学习，能够最大化模仿学习算法性能，即达到最大化累计期望回报的目的；具体优化问题可表示为：In this embodiment, by performing differentiated learning on different expert samples through weight coefficients, the performance of the imitation learning algorithm can be maximized, that is, the purpose of maximizing the cumulative expected return can be achieved; the specific optimization problem can be expressed as:

其中，β^*代表最优权重系数分布，代表重分布后策略π_new的累积期望回报，表示为：Among them, β ^* represents the optimal weight coefficient distribution, represents the cumulative expected return of the redistributed strategy π _new , expressed as:

其中，R为环境的奖励函数，γ为衰减系数；权重学习的目的是最大化模仿学习算法的性能，故可以通过对模仿学习算法的性能进行评估，以此来反映当前权重系数学习的效果；设当前时刻为t，s_t代表t时刻的专家样本状态信息，a_t代表t时刻专家样本动作信息。Among them, R is the reward function of the environment, and γ is the attenuation coefficient. The purpose of weight learning is to maximize the performance of the imitation learning algorithm, so the performance of the imitation learning algorithm can be evaluated to reflect the effect of the current weight coefficient learning. Let the current time be t, s _t represents the state information of the expert sample at time t, and a _t represents the action information of the expert sample at time t.

本发明实施例提供的基于视觉预训练模型的模仿学习混合样本处理方法，通过向次优专家样本添加目标噪声，改善了次优专家样本的数据分布，根据噪声专家样本和最优专家样本对对抗生成网络进行参数优化，得到混合样本集，改善了次优专家样本的特征分布，再对混合样本集中的各样本标定权重系数，根据策略网络对标定后的混合样本集进行预测和评分，以更新策略网络和奖励函数网络，根据目标奖励函数网络对评估数据集中的各样本进行评分，得到对应预测排序，再根据预测排序更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本，能够针对品质不一的混合专家样本进行差异化学习，改善数据集样本分布，提升模仿学习智能体的泛化能力。The imitation learning mixed sample processing method based on the visual pre-training model provided by the embodiment of the present invention improves the data distribution of the suboptimal expert samples by adding target noise to the suboptimal expert samples, optimizes the parameters of the adversarial generative network according to the noise expert samples and the optimal expert samples, obtains a mixed sample set, improves the feature distribution of the suboptimal expert samples, and then calibrates the weight coefficient of each sample in the mixed sample set, predicts and scores the calibrated mixed sample set according to the policy network to update the policy network and the reward function network, scores each sample in the evaluation data set according to the target reward function network to obtain the corresponding prediction ranking, and then updates the weight coefficient corresponding to each sample in the redistributed mixed sample set according to the prediction ranking to obtain the redistributed weight coefficient, and imitates the redistributed weight coefficient according to the target policy network to obtain the optimized expert sample, which can perform differentiated learning for mixed expert samples of different qualities, improve the sample distribution of the data set, and enhance the generalization ability of the imitation learning agent.

在一些实施例中，获取专家样本集包括：基于视觉预训练模型对专家样本集中各个样本进行向前图推理，并确定网络中间层对应的特征图为专家样本集的有效特征。In some embodiments, obtaining the expert sample set includes: performing forward graph reasoning on each sample in the expert sample set based on a visual pre-training model, and determining that a feature map corresponding to a middle layer of the network is a valid feature of the expert sample set.

在该实施例中，视觉预训练模型是使用大规模数据集进行预训练的深度学习模型，通常可用于图像分类、目标检测和图像分割等下游任务的网络参数初始化。In this embodiment, the visual pre-trained model is a deep learning model pre-trained using a large-scale dataset, and can generally be used to initialize network parameters for downstream tasks such as image classification, target detection, and image segmentation.

在该实施例中，通用视觉强化学习方法采用高维图像信息作为对环境的观测，对特征提取器的状态抽象有着更高的要求；由于在对抗生成式模仿学习情景中，图像特征提取器的表达性、可训练性、容量等因素是算法泛化的关键环节，则由基于大规模视觉数据集预训练模型提取得到的特征实现生成式模仿学习可以提高智能体在新任务或新环境下的泛化能力。In this embodiment, the general visual reinforcement learning method uses high-dimensional image information as observation of the environment, and has higher requirements on the state abstraction of the feature extractor; because in the adversarial generative imitation learning scenario, factors such as the expressiveness, trainability, and capacity of the image feature extractor are key links in the generalization of the algorithm, generative imitation learning implemented by features extracted from a pre-trained model based on a large-scale visual dataset can improve the generalization ability of the intelligent agent in new tasks or new environments.

图3是本发明提供的基于视觉预训练模型提取中间层特征图的流程示意图，在图3所示的实施例中，将原始特征图输入至视觉预训练模型(对应大规模视觉数据集预训练卷积神经网络)，通过固定网络参数，仅对输入图像数据进行前向推理，抽取网络中间层的特征图作为视觉观测输入，输入到后续模仿学习算法。Figure 3 is a schematic diagram of the process of extracting the intermediate layer feature map based on the visual pre-training model provided by the present invention. In the embodiment shown in Figure 3, the original feature map is input into the visual pre-training model (corresponding to the convolutional neural network pre-trained with a large-scale visual data set). By fixing the network parameters, only forward reasoning is performed on the input image data, and the feature map of the intermediate layer of the network is extracted as the visual observation input and input into the subsequent imitation learning algorithm.

在该实施例中，在推理过程中，可以仅对预训练模型的归一化层(BatchNorm)进行参数更新，即通过在训练期间保持预训练模型的归一化层对应的均值和标准差参数进行更新，因为BatchNorm更新的统计数据有助于更好地适应视觉观测中的移位，从而提高智能体的泛化能力。In this embodiment, during the inference process, only the normalization layer (BatchNorm) of the pre-trained model can be updated by updating the parameters, that is, by keeping the mean and standard deviation parameters corresponding to the normalization layer of the pre-trained model during training, because the statistics updated by BatchNorm help to better adapt to the shift in visual observations, thereby improving the generalization ability of the agent.

本发明实施例提供的基于视觉预训练模型的模仿学习混合样本处理方法，通过视觉预训练模型对专家样本集中各个样本进行向前图推理，并确定网络中间层对应的特征图为专家样本集的有效特征，减少研发成本，为后续生成式模仿学习算法提供数据输入，以提高智能体在新任务或新环境下的泛化能力。The embodiment of the present invention provides an imitation learning mixed sample processing method based on a visual pre-training model. It performs forward graph reasoning on each sample in the expert sample set through the visual pre-training model, and determines that the feature map corresponding to the middle layer of the network is the effective feature of the expert sample set, thereby reducing R&D costs and providing data input for subsequent generative imitation learning algorithms to improve the generalization ability of the intelligent agent in new tasks or new environments.

在一些实施例中，对抗生成网络包括生成网络和判别网络；根据对抗生成网络对噪声专家样本和最优专家样本进行处理，得到混合样本集包括：根据生成网络对噪声专家样本和最优专家样本进行特征特征提取，得到状态表征向量；根据判别网络对状态表征向量的来源进行特征判别，得到判别结果；根据判别结果和第一损失函数计算生成网络的损失值，并根据生成网络的损失值对生成网络进行参数优化，得到优化后的生成网络；根据判别结果和第二损失函数计算判别网络的损失值，并根据判别网络的损失值对判别网络进行参数优化，得到优化后的判别网络，以输出混合样本集。In some embodiments, the adversarial generative network includes a generative network and a discriminative network; the noise expert samples and the optimal expert samples are processed according to the adversarial generative network to obtain a mixed sample set, including: extracting features from the noise expert samples and the optimal expert samples according to the generative network to obtain a state characterization vector; performing feature discrimination on the source of the state characterization vector according to the discriminative network to obtain a discrimination result; calculating the loss value of the generative network according to the discrimination result and the first loss function, and optimizing the parameters of the generative network according to the loss value of the generative network to obtain an optimized generative network; calculating the loss value of the discriminative network according to the discrimination result and the second loss function, and optimizing the parameters of the discriminative network according to the loss value of the discriminative network to obtain an optimized discriminative network to output a mixed sample set.

在该实施例中，初始化特征提取网络E(对应生成网络)和判别网络D；采集专家样本集中的最优专家样本ξ^*和次优专家样本ξ′，向次优专家样本中注入正态分布噪声∈，得到噪声专家样本，再将最优专家样本和噪声专家样本输入特征提取网络E，得到新的状态表征z，即状态表征向量；将状态表征向量输入判别网络，并对特征来源进行判别，根据判别结果计算损失函数，对特征提取网络和判别网络参数进行优化。In this embodiment, the feature extraction network E (corresponding to the generation network) and the discriminant network D are initialized; the optimal expert sample ξ ^* and the suboptimal expert sample ξ′ in the expert sample set are collected, and the normal distribution noise ∈ is injected into the suboptimal expert sample to obtain the noise expert sample, and then the optimal expert sample and the noise expert sample are input into the feature extraction network E to obtain a new state representation z, that is, a state representation vector; the state representation vector is input into the discriminant network, and the source of the feature is discriminated, and the loss function is calculated according to the discrimination result, and the parameters of the feature extraction network and the discriminant network are optimized.

其中，第一损失函数基于最优专家样本状态动作对的占有率度量、正态分布噪声后的占有率度量、状态表征向量、状态表征向量对应的条件概率分布、判别网络参数确定，第二损失函数基于最优专家样本状态动作对的占有率度量、正态分布噪声后的占有率度量、状态表征向量、状态表征向量对应的条件概率和生成网络参数确定。Among them, the first loss function is determined based on the occupancy measurement of the state-action pair of the optimal expert sample, the occupancy measurement after normal distribution noise, the state representation vector, the conditional probability distribution corresponding to the state representation vector, and the discriminant network parameters; the second loss function is determined based on the occupancy measurement of the state-action pair of the optimal expert sample, the occupancy measurement after normal distribution noise, the state representation vector, the conditional probability corresponding to the state representation vector, and the generation network parameters.

具体的，第一损失函数包括：Specifically, the first loss function includes:

第二损失函数包括：The second loss function includes:

其中，E为期望，D(z)为判别网络，ρ^*(s,a)为最优专家样本状态动作对的占有率度量，ρ′(s,a)为添加正态分布噪声后的占有率度量，z为状态表征向量，p(z|x)为状态表征向量对应的条件概率分布，θ^D为判别网络参数，θ^E为生成网络参数。Where E is the expectation, D(z) is the discriminant network, ρ ^* (s, a) is the occupancy measure of the optimal expert sample state-action pair, ρ′(s, a) is the occupancy measure after adding normal distribution noise, z is the state representation vector, p(z|x) is the conditional probability distribution corresponding to the state representation vector, θ ^D is the discriminant network parameter, and θ ^E is the generative network parameter.

图4是本发明提供的基于视觉预训练模型的模仿学习混合样本处理方法的流程示意图之三，在图4所示的实施例中，通过向次优专家样本注入噪声，得到噪声专家样本，并将最优专家样本和噪声专家样本输入至生成网路(对应特征提取网络)，得到状态表征向量(对应状态特征)，该状态表征向量用于向最优专家样本提供互信息约束；将状态表征向量输入至判别网络对特征来源进行特征判别，得到判别结果，再根据判别结果计算生成网络的损失值，并根据生成网络的损失值更新特征提取网络的网络梯度，得到优化后的生成网络；同时根据判别结果计算判别网络的损失值，并根据判别网络的损失值更新判别网络的梯度，得到优化后的判别网络，由此输出混合样本集。FIG4 is a third flow chart of the mixed sample processing method for imitation learning based on a visual pre-training model provided by the present invention. In the embodiment shown in FIG4 , noise is injected into the suboptimal expert sample to obtain a noise expert sample, and the optimal expert sample and the noise expert sample are input into the generation network (corresponding to the feature extraction network) to obtain a state representation vector (corresponding to the state feature), and the state representation vector is used to provide mutual information constraints for the optimal expert sample; the state representation vector is input into the discriminant network to perform feature discrimination on the feature source to obtain a discrimination result, and then the loss value of the generation network is calculated according to the discrimination result, and the network gradient of the feature extraction network is updated according to the loss value of the generation network to obtain the optimized generation network; at the same time, the loss value of the discriminant network is calculated according to the discrimination result, and the gradient of the discriminant network is updated according to the loss value of the discriminant network to obtain the optimized discriminant network, thereby outputting a mixed sample set.

在该实施例中，第一损失函数还包括：目标正则项，目标正则项根据最优专家样本和状态表征向量对应的KL散度确定；第二损失函数还包括：目标正则项。In this embodiment, the first loss function also includes: a target regularization term, which is determined according to the KL divergence corresponding to the optimal expert sample and the state representation vector; the second loss function also includes: a target regularization term.

在该实施例中，在生成网络和判别网络的训练过程中，可以将最优专家样本和提取后特征的KL散度作为正则化项，加入特征提取网络的损失函数中，以防止提取后的特征与最优专家样本特征差异过大，提高了算法性能。In this embodiment, during the training process of the generative network and the discriminative network, the KL divergence of the optimal expert sample and the extracted features can be used as a regularization term and added to the loss function of the feature extraction network to prevent the extracted features from being too different from the optimal expert sample features, thereby improving the algorithm performance.

本发明实施例提供的基于视觉预训练模型的模仿学习混合样本处理方法，通过对抗生成网络对噪声专家样本和最优专家样本进行处理，得到混合样本集，再生成网络对噪声专家样本和最优专家样本进行特征特征提取，得到状态表征向量，并根据判别网络对状态表征向量的来源进行特征判别，得到判别结果，最后根据判别结果、第一损失函数和第二损失函数分别优化生成网络和判别网络，并输出混合样本集；通过将噪声对比估计引入对抗生成式模仿学习，以改善次优专家样本的数据分布，通过正态噪声模拟最优专家样本和次优专家样本之间的特征差异，使用对抗生成的方法训练特征提取器，改善次优专家样本的特征分布，进而提升混合专家样本下的模仿学习算法性能。The embodiment of the present invention provides an imitation learning mixed sample processing method based on a visual pre-training model, which processes noisy expert samples and optimal expert samples through an adversarial generative network to obtain a mixed sample set, then uses a generative network to extract features from the noisy expert samples and the optimal expert samples to obtain a state representation vector, and uses a discriminant network to perform feature discrimination on the source of the state representation vector to obtain a discrimination result, and finally optimizes the generative network and the discriminant network respectively according to the discrimination result, the first loss function and the second loss function, and outputs a mixed sample set; by introducing noise contrast estimation into adversarial generative imitation learning to improve the data distribution of suboptimal expert samples, simulating the feature difference between the optimal expert samples and the suboptimal expert samples through normal noise, and using an adversarial generation method to train a feature extractor to improve the feature distribution of suboptimal expert samples, thereby improving the performance of the imitation learning algorithm under mixed expert samples.

下面对本发明提供的基于视觉预训练模型的模仿学习混合样本处理装置进行描述，下文描述的基于视觉预训练模型的模仿学习混合样本处理装置与上文描述的基于视觉预训练模型的模仿学习混合样本处理方法可相互对应参照。The following is a description of the mixed sample processing device for imitation learning based on a visual pre-training model provided by the present invention. The mixed sample processing device for imitation learning based on a visual pre-training model described below and the mixed sample processing method for imitation learning based on a visual pre-training model described above can be referenced to each other.

图5是本发明提供的基于视觉预训练模型的模仿学习混合样本处理装置的结构示意图，如图5所示，该基于视觉预训练模型的模仿学习混合样本处理装置包括：样本获取模块510、第一处理模块520、第二处理模块530和第三处理模块540。Figure 5 is a structural schematic diagram of the mixed sample processing device for imitation learning based on the visual pre-training model provided by the present invention. As shown in Figure 5, the mixed sample processing device for imitation learning based on the visual pre-training model includes: a sample acquisition module 510, a first processing module 520, a second processing module 530 and a third processing module 540.

样本获取模块510，用于获取专家样本集，专家样本集包括最优专家样本和次优专家样本，专家样本集中的各个样本对应不同的权重系数；The sample acquisition module 510 is used to acquire an expert sample set, where the expert sample set includes an optimal expert sample and a suboptimal expert sample, and each sample in the expert sample set corresponds to a different weight coefficient;

第一处理模块520，用于向次优专家样本添加目标噪声，得到噪声专家样本，根据对抗生成网络对噪声专家样本和最优专家样本进行处理，得到混合样本集；A first processing module 520 is used to add target noise to the suboptimal expert samples to obtain noise expert samples, and process the noise expert samples and the optimal expert samples according to the generative adversarial network to obtain a mixed sample set;

第二处理模块530，用于对混合样本集中的各样本标定权重系数，得到重分布的混合样本集，根据策略网络对重分布的混合样本集进行预测，得到动作预测结果；根据奖励函数网络对动作预测结果进行评分，得到评分结果；根据判别损失函数和评分结果对策略网络和奖励函数网络进行训练，得到目标策略网络和目标奖励函数网络；The second processing module 530 is used to calibrate the weight coefficient of each sample in the mixed sample set to obtain a redistributed mixed sample set, predict the redistributed mixed sample set according to the policy network to obtain an action prediction result; score the action prediction result according to the reward function network to obtain a scoring result; train the policy network and the reward function network according to the discriminant loss function and the scoring result to obtain a target policy network and a target reward function network;

第三处理模块540，用于根据目标奖励函数网络对评估数据集中的各样本进行评分，得到评估数据集对应的预测排序，根据预测排序计算排序误差损失，通过梯度优化更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本；评估数据集属于混合样本集。The third processing module 540 is used to score each sample in the evaluation data set according to the target reward function network, obtain the predicted ranking corresponding to the evaluation data set, calculate the ranking error loss according to the predicted ranking, update the weight coefficient corresponding to each sample in the redistributed mixed sample set through gradient optimization, obtain the redistributed weight coefficient, and perform imitation learning on the redistributed weight coefficient according to the target strategy network to obtain the optimized expert sample; the evaluation data set belongs to the mixed sample set.

本发明实施例提供的基于视觉预训练模型的模仿学习混合样本处理装置，通过向次优专家样本添加目标噪声，改善了次优专家样本的数据分布，根据噪声专家样本和最优专家样本对对抗生成网络进行参数优化，得到混合样本集，改善了次优专家样本的特征分布，再对混合样本集中的各样本标定权重系数，根据策略网络对标定后的混合样本集进行预测和评分，以更新策略网络和奖励函数网络，根据目标奖励函数网络对评估数据集中的各样本进行评分，得到对应预测排序，再根据预测排序更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本，能够针对品质不一的混合专家样本进行差异化学习，改善数据集样本分布，提升模仿学习智能体的泛化能力。The imitation learning mixed sample processing device based on the visual pre-training model provided by the embodiment of the present invention improves the data distribution of the suboptimal expert samples by adding target noise to the suboptimal expert samples, optimizes the parameters of the adversarial generative network according to the noise expert samples and the optimal expert samples, obtains a mixed sample set, improves the feature distribution of the suboptimal expert samples, and then calibrates the weight coefficient of each sample in the mixed sample set, predicts and scores the calibrated mixed sample set according to the policy network to update the policy network and the reward function network, scores each sample in the evaluation data set according to the target reward function network to obtain the corresponding prediction ranking, and then updates the weight coefficient corresponding to each sample in the redistributed mixed sample set according to the prediction ranking to obtain the redistributed weight coefficient, and imitates the redistributed weight coefficient according to the target policy network to obtain the optimized expert samples, which can perform differentiated learning for mixed expert samples of different qualities, improve the sample distribution of the data set, and enhance the generalization ability of the imitation learning agent.

图6是本发明提供的电子设备的结构示意图，如图6所示，该电子设备可以包括：处理器(processor)610、通信接口(Communications Interface)620、存储器(memory)630和通信总线640，其中，处理器610，通信接口620，存储器630通过通信总线640完成相互间的通信。处理器610可以调用存储器630中的逻辑指令，以执行基于视觉预训练模型的模仿学习混合样本处理方法，该方法包括：获取专家样本集，专家样本集包括最优专家样本和次优专家样本；向次优专家样本添加目标噪声，得到噪声专家样本，根据对抗生成网络对噪声专家样本和最优专家样本进行处理，得到混合样本集；对混合样本集中的各样本标定权重系数，得到重分布的混合样本集，根据策略网络对重分布的混合样本集进行预测，得到动作预测结果；根据奖励函数网络对动作预测结果进行评分，得到评分结果；根据判别损失函数和评分结果对策略网络和奖励函数网络进行训练，得到目标策略网络和目标奖励函数网络；根据目标奖励函数网络对评估数据集中的各样本进行评分，得到评估数据集对应的预测排序，根据预测排序计算排序误差损失，通过梯度优化更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本；评估数据集属于混合样本集。Figure 6 is a schematic diagram of the structure of the electronic device provided by the present invention. As shown in Figure 6, the electronic device may include: a processor (processor) 610, a communication interface (Communications Interface) 620, a memory (memory) 630 and a communication bus 640, wherein the processor 610, the communication interface 620, and the memory 630 communicate with each other through the communication bus 640. The processor 610 can call the logic instructions in the memory 630 to execute the imitation learning mixed sample processing method based on the visual pre-training model, the method comprising: obtaining an expert sample set, the expert sample set comprising an optimal expert sample and a suboptimal expert sample; adding target noise to the suboptimal expert sample to obtain a noise expert sample, processing the noise expert sample and the optimal expert sample according to the adversarial generation network to obtain a mixed sample set; calibrating the weight coefficient of each sample in the mixed sample set to obtain a redistributed mixed sample set, predicting the redistributed mixed sample set according to the policy network to obtain an action prediction result; scoring the action prediction result according to the reward function network to obtain a scoring result; training the policy network and the reward function network according to the discriminant loss function and the scoring result to obtain a target policy network and a target reward function network; scoring each sample in the evaluation data set according to the target reward function network to obtain a predicted ranking corresponding to the evaluation data set, calculating the ranking error loss according to the predicted ranking, updating the weight coefficient corresponding to each sample in the redistributed mixed sample set through gradient optimization to obtain the redistributed weight coefficient, and imitating learning the redistributed weight coefficient according to the target policy network to obtain an optimized expert sample; the evaluation data set belongs to the mixed sample set.

此外，上述的存储器630中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 630 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的基于视觉预训练模型的模仿学习混合样本处理方法，该方法包括：获取专家样本集，专家样本集包括最优专家样本和次优专家样本；向次优专家样本添加目标噪声，得到噪声专家样本，根据对抗生成网络对噪声专家样本和最优专家样本进行处理，得到混合样本集；对混合样本集中的各样本标定权重系数，得到重分布的混合样本集，根据策略网络对重分布的混合样本集进行预测，得到动作预测结果；根据奖励函数网络对动作预测结果进行评分，得到评分结果；根据判别损失函数和评分结果对策略网络和奖励函数网络进行训练，得到目标策略网络和目标奖励函数网络；根据目标奖励函数网络对评估数据集中的各样本进行评分，得到评估数据集对应的预测排序，根据预测排序计算排序误差损失，通过梯度优化更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本；评估数据集属于混合样本集。On the other hand, the present invention also provides a computer program product, which includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the imitation learning mixed sample processing method based on the visual pre-training model provided by the above methods, and the method includes: obtaining an expert sample set, the expert sample set includes an optimal expert sample and a suboptimal expert sample; adding target noise to the suboptimal expert sample to obtain a noise expert sample, processing the noise expert sample and the optimal expert sample according to a generative adversarial network to obtain a mixed sample set; calibrating the weight coefficient of each sample in the mixed sample set to obtain a redistributed mixed sample set, and Predict the redistributed mixed sample set to obtain the action prediction result; score the action prediction result according to the reward function network to obtain the scoring result; train the policy network and the reward function network according to the discriminant loss function and the scoring result to obtain the target policy network and the target reward function network; score each sample in the evaluation data set according to the target reward function network to obtain the predicted ranking corresponding to the evaluation data set, calculate the ranking error loss according to the predicted ranking, update the weight coefficient corresponding to each sample in the redistributed mixed sample set through gradient optimization to obtain the redistributed weight coefficient, and imitate the redistributed weight coefficient according to the target policy network to obtain the optimized expert sample; the evaluation data set belongs to the mixed sample set.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的基于视觉预训练模型的模仿学习混合样本处理方法，该方法包括：获取专家样本集，专家样本集包括最优专家样本和次优专家样本；向次优专家样本添加目标噪声，得到噪声专家样本，根据对抗生成网络对噪声专家样本和最优专家样本进行处理，得到混合样本集；对混合样本集中的各样本标定权重系数，得到重分布的混合样本集，根据策略网络对重分布的混合样本集进行预测，得到动作预测结果；根据奖励函数网络对动作预测结果进行评分，得到评分结果；根据判别损失函数和评分结果对策略网络和奖励函数网络进行训练，得到目标策略网络和目标奖励函数网络；根据目标奖励函数网络对评估数据集中的各样本进行评分，得到评估数据集对应的预测排序，根据预测排序计算排序误差损失，通过梯度优化更新重分布的混合样本集中各样本对应的权重系数，得到重分布后的权重系数，根据目标策略网络对重分布后的权重系数进行模仿学习，得到优化后的专家样本；评估数据集属于混合样本集。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which is implemented when the computer program is executed by a processor to execute the mixed sample processing method for imitation learning based on the visual pre-training model provided by the above methods, the method comprising: obtaining an expert sample set, the expert sample set comprising an optimal expert sample and a suboptimal expert sample; adding target noise to the suboptimal expert sample to obtain a noise expert sample, processing the noise expert sample and the optimal expert sample according to a generative adversarial network to obtain a mixed sample set; calibrating a weight coefficient for each sample in the mixed sample set to obtain a redistributed mixed sample set, predicting the redistributed mixed sample set according to a strategy network, Obtain action prediction results; score the action prediction results according to the reward function network to obtain scoring results; train the policy network and the reward function network according to the discriminant loss function and the scoring results to obtain the target policy network and the target reward function network; score each sample in the evaluation data set according to the target reward function network to obtain the predicted ranking corresponding to the evaluation data set, calculate the ranking error loss according to the predicted ranking, update the weight coefficient corresponding to each sample in the redistributed mixed sample set through gradient optimization, obtain the redistributed weight coefficient, perform imitation learning on the redistributed weight coefficient according to the target policy network to obtain the optimized expert sample; the evaluation data set belongs to a mixed sample set.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for processing a simulated learning mixture based on a vision pre-training model, comprising:

acquiring an expert sample set, wherein the expert sample set comprises an optimal expert sample and a suboptimal expert sample;

Adding target noise to the suboptimal expert sample to obtain a noise expert sample, and processing the noise expert sample and the optimal expert sample according to an countermeasure generation network to obtain a mixed sample set;

calibrating weight coefficients for each sample in the mixed sample set to obtain a redistributed mixed sample set, and predicting the redistributed mixed sample set according to a strategy network to obtain an action prediction result; scoring the action prediction result according to a reward function network to obtain a scoring result; training the strategy network and the reward function network according to the discrimination loss function and the scoring result to obtain a target strategy network and a target reward function network;

Scoring each sample in an evaluation data set according to the target reward function network to obtain a prediction ranking corresponding to the evaluation data set, calculating ranking error loss according to the prediction ranking, updating weight coefficients corresponding to each sample in the redistributed mixed sample set through gradient optimization to obtain redistributed weight coefficients, and performing simulated learning on the redistributed weight coefficients according to the target strategy network to obtain optimized expert samples; the evaluation dataset belongs to the mixed sample set; the acquiring the expert sample set includes:

Performing forward graph reasoning on each sample in the expert sample set based on a vision pre-training model, and determining a feature graph corresponding to a network middle layer as effective features of the expert sample set;

The forward graph reasoning on each sample in the expert sample set based on the vision pre-training model comprises:

Inputting original feature graphs corresponding to all samples in the expert sample set into the vision pre-training model, and only forward reasoning is carried out on input image data through fixed network parameters, and in the reasoning process, only mean value and standard deviation parameters corresponding to a normalization layer of the vision pre-training model are updated;

The expert sample set includes training samples or test samples required for at least one of the following scenarios: image classification, object detection and image segmentation.

2. The visual pre-training model-based simulated learning mixture sample processing method of claim 1, wherein the countermeasure generation network comprises a generation network and a discrimination network;

The processing the noise expert sample and the optimal expert sample according to the countermeasure generation network, and obtaining a mixed sample set includes:

extracting characteristic features of the noise expert samples and the optimal expert samples according to the generating network to obtain a state characterization vector;

performing feature discrimination on the source of the state characterization vector according to the discrimination network to obtain discrimination results;

Calculating a loss value of the generating network according to the judging result and the first loss function, and carrying out parameter optimization on the generating network according to the loss value of the generating network to obtain the optimized generating network; calculating a loss value of the discrimination network according to the discrimination result and the second loss function, and carrying out parameter optimization on the discrimination network according to the loss value of the discrimination network to obtain the optimized discrimination network so as to output the mixed sample set;

The first loss function is determined based on an occupancy metric of an optimal expert sample state action pair, an occupancy metric after normal distribution noise, the state characterization vector, a conditional probability distribution corresponding to the state characterization vector, and a discrimination network parameter, and the second loss function is determined based on an occupancy metric of the optimal expert sample state action pair, an occupancy metric after normal distribution noise, the state characterization vector, a conditional probability corresponding to the state characterization vector, and a generation network parameter.

3. The visual pre-training model-based simulated learning mixture sample processing method of claim 2, wherein the first loss function comprises:

the second loss function includes:

Where E is the expectation, D (z) is the discrimination network, ρ ^* (s, a) is the occupancy measure of the optimal expert sample state action pair, ρ' (s, a) is the occupancy measure after adding normal distribution noise, z is the state characterization vector, p (z|x) is the conditional probability distribution corresponding to the state characterization vector, θ ^D is the discrimination network parameter, and θ ^E is the generation network parameter.

4. The visual pre-training model-based simulated learning mixture sample processing method of claim 2, wherein the first loss function further comprises:

the target regular term is determined according to the KL divergence corresponding to the optimal expert sample and the state characterization vector;

The second loss function further includes: the target regularization term.

5. The visual pre-training model-based simulated learning hybrid sample processing method of claim 1, wherein said evaluation dataset is obtained by said expert sample set in a priori ordering;

Calculating the sorting error loss according to the prediction sorting, updating the weight coefficient corresponding to each sample in the redistributed mixed sample set through gradient optimization, wherein the obtaining the redistributed weight coefficient comprises the following steps:

calculating a ranking difference penalty by the predictive ranking, the prior ranking, and a ranking penalty function;

Determining the weight coefficient after redistribution according to the sorting difference loss;

the ordering loss function includes:

wherein i and j are sample numbers, The expected return is accumulated for the true number i sample,Accumulating expected returns for predictions for sample i; the expected return is accumulated for the true sample j, The expected return is accumulated for the prediction for sample j.

6. An imitation learning hybrid sample processing device based on a vision pre-training model, comprising:

the sample acquisition module is used for acquiring an expert sample set, wherein the expert sample set comprises an optimal expert sample and a suboptimal expert sample, and each sample in the expert sample set corresponds to different weight coefficients;

The first processing module is used for adding target noise to the suboptimal expert sample to obtain a noise expert sample, and processing the noise expert sample and the optimal expert sample according to an antagonism generation network to obtain a mixed sample set;

the second processing module is used for calibrating weight coefficients for each sample in the mixed sample set to obtain a redistributed mixed sample set, and predicting the redistributed mixed sample set according to a strategy network to obtain an action prediction result; scoring the action prediction result according to a reward function network to obtain a scoring result; training the strategy network and the reward function network according to the discrimination loss function and the scoring result to obtain a target strategy network and a target reward function network;

The third processing module is used for scoring each sample in the evaluation data set according to the target reward function network to obtain a prediction ranking corresponding to the evaluation data set, calculating ranking error loss according to the prediction ranking, updating weight coefficients corresponding to each sample in the redistributed mixed sample set through gradient optimization to obtain redistributed weight coefficients, and performing imitation learning on the redistributed weight coefficients according to the target strategy network to obtain optimized expert samples; the evaluation dataset belongs to the mixed sample set;

The sample acquisition module is specifically configured to:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the visual pre-training model-based simulated learning hybrid sample processing method of any of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of simulated learning mixed sample processing based on a visual pre-training model as claimed in any of claims 1 to 5.

9. A computer program product comprising a computer program which, when executed by a processor, implements a method of simulated learning hybrid sample processing based on a visual pre-training model as claimed in any one of claims 1 to 5.