[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021068638A1 - 结合tamer框架和面部表情反馈的交互强化学习方法 - Google Patents

结合tamer框架和面部表情反馈的交互强化学习方法 Download PDF

Info

Publication number
WO2021068638A1
WO2021068638A1 PCT/CN2020/108156 CN2020108156W WO2021068638A1 WO 2021068638 A1 WO2021068638 A1 WO 2021068638A1 CN 2020108156 W CN2020108156 W CN 2020108156W WO 2021068638 A1 WO2021068638 A1 WO 2021068638A1
Authority
WO
WIPO (PCT)
Prior art keywords
tamer
agent
feedback
reward
facial expression
Prior art date
Application number
PCT/CN2020/108156
Other languages
English (en)
French (fr)
Inventor
李光亮
林金莹
张期磊
何波
冯晨
Original Assignee
中国海洋大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国海洋大学 filed Critical 中国海洋大学
Publication of WO2021068638A1 publication Critical patent/WO2021068638A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2178Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Definitions

  • the present invention relates to the field of artificial intelligence technology, in particular to an interactive reinforcement learning method combining TAMER framework and facial expression feedback.
  • TAMER is a typical interactive reinforcement learning method.
  • the system can learn a prediction model of human user rewards. This model can successfully train TAMER agents even when human rewards are delayed or inconsistent.
  • TAMER agents can understand the intentions of human users and adapt to their preferences.
  • human user preferences are conveyed through clear instructions or expensive corrective feedback, such as through predefined words or sentences, buttons, mouse clicks, etc.
  • these feedbacks Methods increase the cognitive load of human users.
  • the problem in the prior art is that adjusting the behavior of the TAMER agent through explicit feedback forms such as predefined keyboard feedback will increase the cognitive burden of human users, and strategy updates require a large number of interactive behaviors, which increases the cost of learning. .
  • the direct significance is to reduce the number of explicit feedback required during the TAMER agent training process, and reduce the cognitive burden of human users;
  • the purpose of the present invention is to provide an interactive reinforcement learning method that combines the TAMER framework and facial expression feedback, and combines explicit feedback and facial expression feedback to perform learning on the TAMER framework.
  • an interactive reinforcement learning method combining TAMER framework and facial expression feedback including:
  • Face Valuing-TAMER agent Combining the TAMER framework and facial expression evaluation to form a Face Valuing-TAMER agent; the Face Valuing-TAMER agent anticipates future rewards by learning a value function from human feedback;
  • the combination of the TAMER framework and facial expression evaluation to form the Face Valuing-TAMER agent is specifically: the trainer trains the TAMER agent under the TAMER framework, uses keyboard button feedback to determine the keyboard reward signal, and trains the TAMER agent to obtain an initial Executable strategy;
  • the trainer determines the facial reward signal through facial expression feedback to adjust the behavior strategy of the TAMER agent.
  • the trainer trains the TAMER agent under the TAMER framework, determines keyboard reward signals through keyboard key feedback, and trains the TAMER agent to obtain an initial executable strategy, which specifically includes:
  • the trainer observes the current action of the TAMER agent, and feeds it back through a keyboard interface, obtains a keyboard feedback signal, and determines a keyboard reward signal according to the keyboard feedback signal;
  • An initial executable strategy is determined according to the keyboard feedback signal and the keyboard reward signal.
  • enabling the trainer to determine the facial reward signal through facial expression feedback to adjust the behavior strategy of the TAMER agent specifically includes:
  • the updated value function includes a state value function and an action value function;
  • the trainer obtains a facial feedback signal through facial expression feedback, and determines a facial reward signal according to the facial feedback signal;
  • the updated value function is:
  • the Face Valuing-TAMER agent predicts future rewards by learning a value function from human feedback, which specifically includes:
  • a TAMER agent learns a reward model
  • the behavior of is defined as the expected human reward in the current state and action: Agent S t is the signal received after taking reward operation in either state A t;
  • the TAMER agent chooses the maximum expected return:
  • the trainer Based on the maximum expected return, the trainer observes and evaluates the behavior of the TAMER agent and awards it.
  • An information data processing terminal applied to the interactive reinforcement learning method combining TAMER framework and facial expression feedback.
  • the present invention Compared with the prior art, the present invention has the advantages that: the present invention introduces human user's facial expression feedback in the TAMER framework, and the human user can provide feedback through the keyboard or other interactive interfaces to train the TAMER agent. After the agent learns an initial strategy, the human user adjusts the agent's behavior through facial expression feedback. This process will reduce the cognitive burden of the human user and free the human user from the heavy feedback task. This method is The supplement to the existing interactive machine learning methods will help to further improve the interaction efficiency between the agent and the human user.
  • the interactive reinforcement learning method of the present invention combining the TAMER framework and facial expression feedback can reduce the cognitive burden of the human user in the process of training the agent, so that the agent can better understand human preferences, and can effectively learn from human rewards.
  • Figure 5 shows the number of time steps required for each Episode through keyboard feedback training and facial expression feedback training during the training process.
  • the histogram shows the average value and standard deviation of each Episode, and the table shows the average value;
  • Figure 6 reflects During the training process, each Episode uses keyboard feedback training and introduces the required number of feedback for facial expression feedback training.
  • the histogram shows the average value and standard deviation of each Episode, and the table shows the average value.
  • FIG. 1 is a flowchart of an interactive reinforcement learning method combining TAMER framework and facial expression feedback provided by an embodiment of the present invention
  • FIG. 2 is an implementation flowchart of an interactive reinforcement learning method combining TAMER framework and facial expression feedback provided by an embodiment of the present invention
  • Fig. 3 is a screenshot of an example of a training interface interface and a Grid World environment task provided by an embodiment of the present invention
  • FIG. 4 is a schematic block diagram of an agent interaction reinforcement learning combining TAMER framework and facial expression feedback provided by an embodiment of the present invention
  • FIG. 5 is a comparison diagram of required time steps for keyboard feedback training and facial expression feedback training provided by an embodiment of the present invention
  • Fig. 6 is a comparison diagram of the number of feedbacks required by keyboard feedback training and the introduction of facial expression feedback training provided by an embodiment of the present invention.
  • the present invention provides an interactive reinforcement learning method combining the TAMER framework and facial expression feedback.
  • the present invention will be described in detail below with reference to the accompanying drawings.
  • the interactive reinforcement learning method combining TAMER framework and facial expression feedback includes the following steps:
  • S101 Face Valuing-TAMER allows human trainers to first train the agent under the TAMER framework; the agent chooses actions according to the current state.
  • S102 The human trainer observes and provides clear feedback as an incentive signal through keyboard keys and other interfaces.
  • S105 The agent obtains an initial executable strategy through keyboard feedback learning.
  • S106 The human trainer provides rewards to adjust the behavior of the agent through facial expression feedback, and adjusts the strategy to detect whether it reaches a satisfactory state; if satisfied, it ends, and if not satisfied, the strategy is adjusted again through facial expression feedback.
  • the algorithm for the agent to learn from human feedback includes:
  • TAMER learns a value function through a predictive reward model learned from human feedback:
  • R t+i is the reward obtained by TAMER agent performing action a in state s at time t+i
  • G t is the expected reward at time t, which is defined as the total discount of rewards after time t
  • v ⁇ (s) is the state value function corresponding to each strategy ⁇ , which maps each state s ⁇ S to the expected return G t of the state by following the strategy ⁇ ;
  • q ⁇ (s, a) is the action value function corresponding to each strategy ⁇ , which provides the expected return G t by following the strategy ⁇ and executing action a in the state s.
  • the state value function is very important. On the contrary, if the given task needs to be controlled, it is very important to use the action value function q ⁇ (s, a); human trainers can use keyboard keys or facial expressions Provide reward feedback to adjust the behavior of the agent.
  • the present invention introduces facial expression recognition feedback on the basis of the TAMER framework, which is a typical method for an agent to learn from human rewards. Assuming that the TAMER agent learns an initial strategy through keyboard feedback, the amount of explicit feedback needed to adjust through facial expression feedback is less than the amount of feedback that the agent needs to learn from keyboard feedback alone, and the algorithm is tested in the Grid World task field. Compared with the use of different discount factors on human rewards for agent learning through the TAMER framework, the results show that although training the agent directly through facial expression feedback cannot quickly obtain an effective strategy, it can capture the face of human users in real time. Features, adjust the agent's strategy online according to user preferences without changing the model.
  • Figure 4 is a schematic block diagram of agent interactive reinforcement learning combining the TAMER framework and facial expression feedback.
  • the TAMER framework is constructed for a variant of the Markov decision process, which is a model of sequential decision-making, which is solved through dynamic programming and reinforcement learning.
  • an agent learns in MDP without a clearly defined reward function, but learns a reward model, denoted by MDP ⁇ R.
  • the TAMER agent learns from the real-time evaluation of its behavior by human trainers.
  • the agent interprets this evaluation as a human reward, creates a predictive model, and selects the behavior that it predicts will receive the most human rewards. It strives to maximize the immediate rewards caused by behavior, which is in sharp contrast with traditional reinforcement learning. In traditional reinforcement learning, the agent seeks the largest future reward.
  • human rewards can be delivered with a small delay. This delay is the time for the trainer to evaluate the behavior of the agent and deliver its feedback. .
  • Second, the assessment provided by the human trainer judges the behavior itself and takes into account the model of its long-term consequences.
  • a TAMER agent learns a reward model Similar to the human reward expected in the current state and action, Given a state s, the agent chooses the largest expected return in the short-term, The trainer can observe and evaluate the behavior of the agent and give rewards.
  • TAMER feedback is given through keyboard input and is attributed to the agent’s recent actions.
  • Each feedback button press is marked as a scalar reward signal (-1 or +1). This signal can also be enhanced by pressing the button multiple times.
  • the label of the sample is used as the delay-weighted total return, which is based on the specific The probability of the human reward signal of the time step is calculated.
  • the TAMER learning algorithm continuously repeats actions, perceives rewards, and updates This process.
  • VI-TAMER TAMER variant
  • the agent learns from discounted human rewards and produces a planning algorithm-value iteration.
  • a VI-TAMER agent learns and applies its value function to the reward function that was recently changed from TAMER , And use the value function to choose the next action.
  • reinforcement learning RL
  • the discount factor ⁇ (0 ⁇ 1) determines how far the agent can look into the future. Since the discount factor ⁇ of the initial TAMER is 0 (short-term), it can be regarded as VI-TAMER Special circumstances. Therefore, in the present invention, from now on, TAMER is used as a general method for agents to learn from human rewards, and ⁇ TAMER is used as a discount factor for human rewards.
  • the Grid World task contains 30 states. In each state, the agent's movement can be selected from four actions in the action space: move up, down, left or right. The agent cannot pass through the wall, and attempts to pass through the wall will not change the current state of the agent.
  • the task performance index is the time step required to reach the target position from the initial position, that is, the number of actions. As shown in the middle of the screenshot of Figure 3, the small dark gray square is the agent, and the cross indicates the direction of the agent's next movement. In this task, the agent tries to learn a strategy so that it can reach the goal state and minimize the number of time steps. The optimal strategy from the starting state to the target state requires 20 time steps.
  • the current position of the agent is the initial state of the agent, and the target state is the position where the elliptical square in the upper right corner is located.
  • the black line and the light gray square both indicate the fence, and the agent cannot directly pass through this area. .
  • a radial basis function effectively creates a pseudo-table centered on each square cell of Grid World, which can be slightly generalized among nearby cells.
  • Face Valuing-TAMER first provide feedback through the keyboard to train the agent to obtain an initial strategy, and then use the user's facial expression feedback to adjust the strategy obtained by the agent. It is expected to analyze the average data collected in 20 experiments to test the performance of the proposed method.
  • ⁇ TAMER 0, 0.2, 0.5, 0.8 , 1. It should be noted that in the TAMER module of Face Valuing-TAMER and the TAMER framework, the value of ⁇ TAMER is the same.
  • Face Valuing-TAMER to train an agent requires less feedback, especially explicit feedback, than using the TAMER framework to train an agent.
  • the number of time steps for receiving feedback can be calculated to compare the number of time steps for the total feedback, positive feedback, and negative feedback received by the agent under different discount factors for Face Valuing-TAMER and TAMER. It is expected that the Face Valuing-TAMER agent will receive much less feedback than the TAMER agent. This result shows that humans provide evaluation feedback for the agent in the form of facial expression feedback, which can reduce the amount of feedback needed to train the agent and effectively reduce The cognitive burden of human training agent behavior.
  • the research results can show that, compared with learning from the TAMER framework, although facial expression feedback cannot effectively reduce the number of explicit feedback required (because the current facial expression recognition accuracy is only more than 60%), it can still obtain one and Learn the same optimal strategy from keyboard feedback. Further improving the accuracy of facial expression recognition can effectively reduce the amount of explicit feedback required to obtain an optimal strategy.
  • the total number of time steps required to train the agent to obtain the optimal strategy can be used as the performance measurement in the experiment.
  • the experiment compares the total number of time steps required to train the Face Valuing-TAMER and TAMER agents to obtain the best strategy using different discount factors on human rewards. It is expected that the total time step required to train a Face Valuing-TAMER agent is much less than that of the TAMER agent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种结合TAMER框架和面部表情反馈的交互强化学习方法,所述方法包括:结合TAMER框架和面部表情评估形成Face Valuing-TAMER智能体;所述Face Valuing-TAMER智能体通过从人类反馈中学习值函数来预期未来奖励。人类训练者首先在TAMER框架下训练智能体,通过键盘按键反馈提供奖励信号,训练智能体获得一个初始的可执行策略,然后允许人类训练者通过面部表情反馈提供奖励调整智能体的行为。通过基于面部表情反馈的交互强化学习方法可以减少人类用户训练智能体过程中的认知负担,使智能体更好的理解人类偏好,能够有效的从人类奖励中进行学习。

Description

结合TAMER框架和面部表情反馈的交互强化学习方法
本申请要求于2019年10月12日提交中国专利局、申请号为201910967991.3、发明名称为“一种结合TAMER框架和面部表情反馈的交互强化学习方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及人工智能技术领域,特别是涉及一种结合TAMER框架和面部表情反馈的交互强化学习方法。
背景技术
目前,最接近的现有技术:
研究以人为中心的强化学习方法近年来受到广泛关注,通过人类反馈进行强化学习已经被证明是一种非常有效的方法,它允许非技术人员指导TAMER智能体执行任务,TAMER智能体从人类反馈中学习的优势使这一方法在现实生活中的应用越来越广泛。TAMER智能体从人类反馈中学习时,需要一个不断的试错过程:当TAMER智能体执行一个正确的动作时,人类用户可以通过提供一个积极的反馈来鼓励它,当TAMER智能体执行一个不正确的动作时,需要提供一个消极反馈对它进行惩罚,告诉TAMER智能体尝试其他动作,这可能会带来一定的风险,因为TAMER智能体可能会执行更糟糕的动作。人类用户便需要提供更多反馈引导TAMER智能体修正已经学到的模型并重新学习正确的行为策略,这将增加TAMER智能体的学习成本并给人类带来更多负担。
交互强化学习的一个重要应用是扩展或增强人类的认知和身体能力。TAMER是一种典型的交互强化学习方法,在TAMER中,系统能够学习人类用户奖励的预测模型,这一模型甚至能在人类奖励延迟或者不一致的情况下成功地训练TAMER智能体。通过交互学习,TAMER智能体可以理解人类用户的意图并适应他们的偏好。在当前的大多数研究中,人类用户的偏好是通过明确的指示或者昂贵的纠正反馈传达的,比如通过预定义的单词或句子、按钮、鼠标单点等形式,然而在实际应用中,这些反馈方法会增加人类用户的认知负荷。
因此,亟需一种基于面部表情反馈的交互强化学习方法,允许训练者在不需要大量昂贵的交互行为的情况下提供反馈,并且不需要重新训练TAMER智能体就可以转移到新的或者变化着的场景中。
综上所述,现有技术存在的问题是:通过预定义的键盘反馈等明确反馈形式调整TAMER智能体的行为会增加人类用户的认知负担,策略更新需要大量的交互行为,增加了学习成本。
解决上述技术问题的难度:
1、如何向TAMER框架中引入面部表情识别处理,用面部表情取代键盘等明确反馈接口;
2、如何将人类复杂的面部表情与奖励信号有效结合,为TAMER智能体学习提供有效反馈。
解决上述技术问题的意义:
1、直接意义是减少TAMER智能体训练过程中需要的明确反馈的数量,降低人类用户的认知负担;
2、建立起人类用户与TAMER智能体直接沟通的渠道,人类用户不需要提前进行技能培训便可获得能快速适应自己偏好的TAMER智能体;
3、尤其是对肢体障碍的残障人士,他们不方便与TAMER智能体进行肢体交互,面部表情反馈的引入为他们提供了便利。
发明内容
本发明的目的是提供一种结合TAMER框架和面部表情反馈的交互强化学习方法,结合明确反馈和面部表情反馈在TAMER框架上进行学习。
为达到上述目的,本发明的技术方案为:一种结合TAMER框架和面部表情反馈的交互强化学习方法,包括:
结合TAMER框架和面部表情评估形成Face Valuing-TAMER智能体;所述Face Valuing-TAMER智能体通过从人类反馈中学习值函数预期未来奖励;
所述结合TAMER框架和面部表情评估形成Face Valuing-TAMER智能体具体为:训练者在所述TAMER框架下训练TAMER智能体,通过键 盘按键反馈,确定键盘奖励信号,并训练TAMER智能体获得一个初始的可执行策略;
基于所述初始的可执行策略使得所述训练者通过面部表情反馈,确定所述面部奖励信号以调整所述TAMER智能体的行为策略。
可选的,所述训练者在所述TAMER框架下训练TAMER智能体,通过键盘按键反馈,确定键盘奖励信号,并训练所述TAMER智能体获得一个初始的可执行策略,具体包括:
所述训练者观察所述TAMER智能体的当前动作,并通过键盘接口反馈,获取键盘反馈信号,并根据所述键盘反馈信号确定键盘奖励信号;
根据所述键盘反馈信号以及所述键盘奖励信号确定初始的可执行策略。
可选的,所述基于所述初始的可执行策略使得所述训练者通过面部表情反馈,确定所述面部奖励信号以调整所述TAMER智能体的行为策略,具体包括:
根据所述键盘奖励信号更新所述值函数,确定更新后的值函数;所述更新后的值函数包括状态值函数以及动作值函数;
根据所述奖励函数更新所述TAMER智能体的行为策略;
基于所述初始的可执行策略,所述训练者通过面部表情反馈,获取面部反馈信号,并根据所述面部反馈信号确定面部奖励信号;
根据所述面部奖励信号调整所述TAMER智能体的行为策略。
可选的,所述更新后的值函数为:
Figure PCTCN2020108156-appb-000001
v π(s)=Ε{G t|S t=s,π}
q π(s,a)=Ε{G t|S t=s,A t=a,π}
其中,G t为任一时间t时的预期回报,即为任一时间t时的奖励折扣总和;i为第t+i步长与第t步长之间的步长差值,γ为折扣因子,γ i-1为对第t+i步长获得奖励的折扣因数;R t+i在TAMER智能体在t+i时刻在状态s下执行动作a获得的奖励;v π(s)是对应于每一个行为策略π的状态值函 数,通过遵循行为策略π将每个状态s映射到所述状态s的预期回报G t,s∈S t;q π(s,a)是对应于每一个行为策略π的动作值函数,通过遵循行为策略π,在状态s下执行动作a来提供预期回报,a∈A t;E为对获得的预期回报求取期望值。
可选的,所述Face Valuing-TAMER智能体通过从人类反馈中学习值函数预期未来奖励,具体包括:
一个TAMER智能体学习一个奖励模型
Figure PCTCN2020108156-appb-000002
的行为定义为在当前状态和行动下预期的人类奖励:
Figure PCTCN2020108156-appb-000003
为智能体在任一状态S t下采取动作A t后收到的奖励信号;
给定一个状态s,TAMER智能体选择最大预期回报:
Figure PCTCN2020108156-appb-000004
基于所述最大预期回报,训练者观察和评估TAMER智能体的行为并给予奖励。
一种应用于所述结合TAMER框架和面部表情反馈的交互强化学习方法的信息数据处理终端。
本发明与现有技术相比的优点在于:本发明在TAMER框架中引入人类用户的面部表情反馈,人类用户可以通过键盘或其他交互接口提供反馈训练TAMER智能体。智能体学习到一个初始策略后,人类用户通过面部表情反馈对智能体的行为进行调整,这一过程将减少人类用户的认知负担,将人类用户从繁重反馈任务中解放出来,这一方法是对现有交互机器学习方法的补充,有助于进一步提高智能体与人类用户的交互效率。
本发明结合TAMER框架和面部表情反馈的交互强化学习方法可以减少人类用户训练智能体过程中的认知负担,使智能体更好的理解人类偏好,能够有效的从人类奖励中进行学习。图5显示训练过程中每一个Episode通过键盘反馈训练与引入面部表情反馈训练的所需时间步长数,柱状图显示每一个Episode的平均值和标准差,表格显示的是平均值;图6反映训练过程中每一个Episode通过键盘反馈训练与引入面部表情反馈训练的所需反馈数量,柱状图显示每一个Episode的平均值和标准差,表格显示的是平均值,从实验结果图5和图6分析中看出,引入面部表情反馈虽然没有减少需要的明确反馈数量(由于当前面部表情识别准确度较低只 有60%多),但仍然可以获得一个和从键盘反馈学习一样的最优策略,降低了训练成本。进一步提高表情识别精度可以有效减少获得最优策略所需的明确反馈数量。
说明书附图
下面结合附图对本发明作进一步说明:
图1本发明实施例提供的结合TAMER框架和面部表情反馈的交互强化学习方法流程图;
图2是本发明实施例提供的结合TAMER框架和面部表情反馈的交互强化学习方法实现流程图;
图3是本发明实施例提供的训练接口界面及Grid World环境任务示例截图;
图4是本发明实施例提供的结合TAMER框架和面部表情反馈的智能体交互强化学习示意框图;
图5是本发明实施例提供的通过键盘反馈训练与引入面部表情反馈训练的所需时间步长对照图;
图6是本发明实施例提供的通过键盘反馈训练与引入面部表情反馈训练的需要的反馈数量对照图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
针对现有技术存在的问题,本发明提供了一种结合TAMER框架和面部表情反馈的交互强化学习方法,下面结合附图对本发明作详细的描述。
如图1和图2所示,本发明实施例提供的结合TAMER框架和面部表情反馈的交互强化学习方法包括以下步骤:
S101:Face Valuing-TAMER允许人类训练者首先在TAMER框架下训练智能体;智能体根据当前状态选择动作。
S102:人类训练者观察并通过键盘按键等接口提供明确反馈作为奖 励信号。
S103:更新奖励函数及值函数。
S104:更新智能体的行为策略。
S105:智能体通过键盘反馈学习获得一个初始的可执行策略。
S106:人类训练者通过表情反馈提供奖励调整智能体的行为,对策略进行调整检测是否达到满意状态;若满意则结束,若不满意则重新通过表情反馈对策略进行调整。
进一步,所述智能体从人类反馈中进行学习的算法包括:
TAMER通过从人类反馈中学习的预测奖励模型学习值函数:
Figure PCTCN2020108156-appb-000005
v π(s)=Ε{G t|S t=s,π}
q π(s,a)=Ε{G t|S t=s,A t=a,π}
其中,R t+i在TAMER智能体在t+i时刻在状态s下执行动作a获得的奖励;G t是在时间t时的预期回报,被定义为在时间t之后的奖励的折扣总和;v π(s)是对应于每一个策略π的状态值函数,它通过遵循策略π将每个状态s∈S映射到该状态的预期回报G t
q π(s,a)是对应于每一个策略π的动作值函数,它通过遵循策略π,在状态s下执行动作a来提供预期的回报G t
当给定任务需要预测时,状态值函数非常重要,相反,如果给定的任务需要控制,那么使用动作值函数q π(s,a)就非常重要;人类训练者能够通过键盘按键或者面部表情提供奖励反馈调整智能体的行为。
下面结合实施例对本发明的技术方案作进一步的描述。
本发明在TAMER框架的基础上引入面部表情识别反馈,TAMER框架是一种典型的智能体从人类奖励中学习的方法。假设TAMER智能体在通过键盘反馈学习到一个初始策略后通过面部表情反馈进行调整需要的明确反馈数量比智能体单独从键盘反馈中学习需要的反馈数量少,在Grid World任务领域内测试了算法并且与通过TAMER框架在人类奖励上使用不同的折扣因子进行智能体学习进行了比较,结果表明虽然直接通过面部 表情反馈训练智能体不能快速获得一个有效的策略,但它可以实时抓取人类用户的面部特征,在不改变模型的情况下按照用户喜好在线调整智能体的策略。实验结果还表明,从人类用户的面部表情反馈中进行学习,没有减少需要的明确反馈数量(由于当前面部表情识别准确度较低只有60%多),但仍然可以获得一个和从键盘反馈学习一样的最优策略。图4是结合TAMER框架和面部表情反馈的智能体交互强化学习示意框图。
A.TAMER框架
TAMER框架是为马尔可夫决策过程的一个变种而构建的,这是一个顺序决策的模型,通过动态编程和强化学习来解决。在TAMER框架中,一个智能体在MDP中学习,没有明确定义的奖励函数,而是学习一个奖励模型,用MDP\R表示。
TAMER智能体从人类训练者对其行为的实时评估中学习,智能体将此评估解释为人类奖励,创建一个预测模型,并选择它所预测的将会获得最多人类奖励的行为。它努力使行为引起的即时回报最大化,这与传统的强化学习形成了鲜明的对比,在传统的强化学习中,智能体寻求最大的未来奖励。有两个原因可以解释为什么一个智能体可以从短期的奖励中学习执行任务:首先,人类的奖励可以以小的延迟来传递,这个延迟是训练者评估智能体的行为并传递它的反馈的时间。其次,由人类培训者提供的评估,对行为本身进行了判断,并将其长期后果的模型考虑在内。
一个TAMER智能体学习一个奖励模型
Figure PCTCN2020108156-appb-000006
近似于在当前状态和行动下预期的人类奖励,
Figure PCTCN2020108156-appb-000007
给定一个状态s,智能体短期的选择了最大的预期回报,
Figure PCTCN2020108156-appb-000008
训练者可以观察和评估智能体的行为并给予奖励。
在TAMER中,反馈是通过键盘输入给出的,并被归因于智能体最近的行动。每一个反馈按钮的按下都被标记为一个标量奖励信号(-1或+1),这个信号也可以通过多次按下按钮来加强,样本的标签作为延迟加权的总回报,是根据针对特定time step的人类奖励信号的概率来计算的。TAMER学习算法不断重复采取行动,感知奖励,并更新
Figure PCTCN2020108156-appb-000009
这一过程。
直到最近,目光短浅一直是所有涉及从人类评估反馈中学习的算法的 一个特征,并得到了实证支持。然而,最近有人提出了一种名为VI-TAMER的TAMER变体,它可以帮助智能体从非短期的人类奖励中学习。在VI-TAMER中,智能体从折扣的人类奖励中学习,产生了一个规划算法——值迭代,一个VI-TAMER智能体学习并把它的值函数应用到最近从TAMER改变的奖励函数
Figure PCTCN2020108156-appb-000010
中,并使用值函数来选择下一步行动。在强化学习(RL)中,折扣因子γ(0≤γ≤1)决定智能体可以展望多远的未来,由于初始TAMER的折扣因子γ是0(短期的),它能被视为VI-TAMER的特殊情况。因此,在本发明中,从现在开始把TAMER作为智能体从人类奖励中学习的一般方法,并把γ TAMER作为人类奖励的折扣因子。
B.实验验证:
为了证明所提出的方法的潜在有效性,在具有离散状态和动作空间的Grid World任务领域进行了实验。
a.Grid World任务
Grid World任务包含30个状态,在每个状态下,智能体的运动可以从动作空间的四个动作中进行选择:向上、向下、向左或向右移动。智能体不能穿过墙壁,试图穿过墙壁的动作不会改变智能体当前的状态。任务性能指标是从初始位置出发到达目标位置所需要的time step即动作的数量。如图3截图中间所示深灰色小方块为智能体,叉号指示智能体下一步的运动方向。在本次任务中,智能体尝试学习一种策略,使它可以达到目标状态,并且尽可能减少time step的数量。从开始状态出发到达目标状态的最优策略需要20time step。智能体当前所处的位置为智能体的起始状态,目标状态是右上角椭圆方块所在的位置,在图3截图中,黑色线条和浅灰色方块均表示围墙,该区域智能体不能直接穿过。
b.实验设置
在实验中,为了观察从人类用户面部表情反馈中学习的智能体是否能减少它需要的明确反馈的数量,计划将通过TAMER框架的智能体学习与通过有不同折扣因子γ TAMER的Face Valuing-TAMER框架的智能体学习进行比较。在Face Valuing-TAMER中,TAMER模块和TAMER框架是相同的。当说γ TAMER时,它适用于两者。因此,TAMER和Face Valuing-TAMER 之间的唯一区别是是否引入人类用户面部表情反馈。采用高斯径向基函数的线性模型作为对TAMER人类奖励模型R H的表示。TAMER的值函数也是通过高斯径向基函数近似的线性函数。
一个径向基函数以Grid World的每个方块单元为中心有效地创建了一个伪表格,可以在附近的单元之间稍微泛化。每一个径向基函数宽度σ 2=0.05,1是最邻近的径向基函数中心的距离,并且线性模型有一个额外的恒定值0.1为偏压特性。对所有折扣因子的Face Valuing-TAMER和TAMER智能体进行训练,每个智能体用不同的折扣因子各训练20次。对于任何一种方法的每次实验,都将智能体训练到能够获得最优策略。使用Face Valuing-TAMER时,首先通过键盘提供反馈训练智能体获得一个初始策略,然后通过用户的面部表情反馈来调整智能体获得的策略。预计基于20次实验中收集的平均数据进行分析,测试所提方法的性能。
C.实验结果:
计划采用不同的人类奖励折扣因子γ TAMER=0,0.2,0.5,0.8,1,需要注意的是在Face Valuing-TAMER的TAMER模块和TAMER框架中,γ TAMER的值是一样的。
a.反馈的数量
假设用Face Valuing-TAMER训练智能体比用TAMER框架训练智能体需要的反馈尤其是明确反馈少。为了测量给出的反馈量,可以计算接收反馈的time step数,来比较Face Valuing-TAMER和TAMER在不同的折扣因子下,智能体分别接收总反馈、正反馈和负反馈的time step数。期待Face Valuing-TAMER智能体收到的反馈比TAMER智能体少得多,这样的结果表明,人类以面部表情反馈的方式为智能体提供评估反馈可以减少训练智能体需要的反馈数量,有效减少了人类训练智能体行为的认知负担。通过研究结果可以表明,与从TAMER框架中学习相比,通过面部表情反馈虽然不能有效减少需要的明确反馈数量(由于当前面部表情识别准确度较低只有60%多),但仍然可以获得一个和从键盘反馈学习一样的最优策略。进一步提高表情识别精度可以有效减少获得最优策略所需的明确反馈数量。
b.性能
由于任务性能指标是基于在Grid World域中实现目标所需的time step数,可以采用训练智能体获得最优策略所需总time step的数量作为实验中的性能度量。实验对比在人类奖励上使用不同的折扣因子训练Face Valuing-TAMER和TAMER智能体获得最佳策略所需的总time step数。期待训练一个Face Valuing-TAMER智能体所需的总time step比TAMER智能体要少得多。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (6)

  1. 一种结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,包括:
    结合TAMER框架和面部表情评估形成Face Valuing-TAMER智能体;所述Face Valuing-TAMER智能体通过从人类反馈中学习值函数预期未来奖励;
    所述结合TAMER框架和面部表情评估形成FaceValuing-TAMER智能体具体为:训练者在所述TAMER框架下训练TAMER智能体,通过键盘按键反馈,确定键盘奖励信号,并训练TAMER智能体获得一个初始的可执行策略;
    基于所述初始的可执行策略使得所述训练者通过面部表情反馈,确定所述面部奖励信号以调整所述TAMER智能体的行为策略。
  2. 如权利要求1所述的结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,所述训练者在所述TAMER框架下训练TAMER智能体,通过键盘按键反馈,确定键盘奖励信号,并训练所述TAMER智能体获得一个初始的可执行策略,具体包括:
    所述训练者观察所述TAMER智能体的当前动作,并通过键盘接口反馈,获取键盘反馈信号,并根据所述键盘反馈信号确定键盘奖励信号;
    根据所述键盘反馈信号以及所述键盘奖励信号确定初始的可执行策略。
  3. 如权利要求2所述的结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,所述基于所述初始的可执行策略使得所述训练者通过面部表情反馈,确定所述面部奖励信号以调整所述TAMER智能体的行为策略,具体包括:
    根据所述键盘奖励信号更新所述值函数,确定更新后的值函数;所述更新后的值函数包括状态值函数以及动作值函数;
    根据所述奖励函数更新所述TAMER智能体的行为策略;
    基于所述初始的可执行策略,所述训练者通过面部表情反馈,获取面部反馈信号,并根据所述面部反馈信号确定面部奖励信号;
    根据所述面部奖励信号调整所述TAMER智能体的行为策略。
  4. 如权利要求3所述的结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,所述更新后的值函数为:
    Figure PCTCN2020108156-appb-100001
    v π(s)=Ε{G t|S t=s,π}
    q π(s,a)=Ε{G t|S t=s,A t=a,π}
    其中,G t为任一时间t时的预期回报,即为任一时间t时的奖励折扣总和;i为第t+i步长与第t步长之间的步长差值,γ为折扣因子,γ i-1为对第t+i步长获得奖励的折扣因数;R t+i在TAMER智能体在t+i时刻在状态s下执行动作a获得的奖励;v π(s)是对应于每一个行为策略π的状态值函数,通过遵循行为策略π将每个状态s映射到所述状态s的预期回报G t,s∈S t;q π(s,a)是对应于每一个行为策略π的动作值函数,通过遵循行为策略π,在状态s下执行动作a来提供预期回报,a∈A t;E为对获得的预期回报求取期望值。
  5. 如权利要求4所述的结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,所述Face Valuing-TAMER智能体通过从人类反馈中学习值函数预期未来奖励,具体包括:
    一个TAMER智能体学习一个奖励模型
    Figure PCTCN2020108156-appb-100002
    的行为定义为在当前状态和行动下预期的人类奖励:
    Figure PCTCN2020108156-appb-100003
    为智能体在任一状态S t下采取动作A t后收到的奖励信号;
    给定一个状态s,TAMER智能体选择最大预期回报:
    Figure PCTCN2020108156-appb-100004
    基于所述最大预期回报,训练者观察和评估TAMER智能体的行为并给予奖励。
  6. 一种应用于权利要求1~5任意一项所述的结合TAMER框架和面部表情反馈的交互强化学习方法的信息数据处理终端。
PCT/CN2020/108156 2019-10-12 2020-08-10 结合tamer框架和面部表情反馈的交互强化学习方法 WO2021068638A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910967991.3A CN110826723A (zh) 2019-10-12 2019-10-12 一种结合tamer框架和面部表情反馈的交互强化学习方法
CN201910967991.3 2019-10-12

Publications (1)

Publication Number Publication Date
WO2021068638A1 true WO2021068638A1 (zh) 2021-04-15

Family

ID=69548992

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/108156 WO2021068638A1 (zh) 2019-10-12 2020-08-10 结合tamer框架和面部表情反馈的交互强化学习方法

Country Status (3)

Country Link
CN (1) CN110826723A (zh)
LU (1) LU500028B1 (zh)
WO (1) WO2021068638A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657583A (zh) * 2021-08-24 2021-11-16 广州市香港科大霍英东研究院 一种基于强化学习的大数据特征提取方法及系统
CN114003121A (zh) * 2021-09-30 2022-02-01 中国科学院计算技术研究所 数据中心服务器能效优化方法与装置、电子设备及存储介质
CN114371728A (zh) * 2021-12-14 2022-04-19 河南大学 一种基于多智能体协同优化的无人机资源调度方法
CN114710792A (zh) * 2022-03-30 2022-07-05 合肥工业大学 基于强化学习的5g配网分布式保护装置的优化布置方法
CN115250156A (zh) * 2021-09-09 2022-10-28 李枫 一种基于联邦学习的无线网络多信道频谱接入方法
CN115361717A (zh) * 2022-07-12 2022-11-18 华中科技大学 一种基于vr用户视点轨迹的毫米波接入点选择方法及系统
CN116307241A (zh) * 2023-04-04 2023-06-23 暨南大学 基于带约束多智能体强化学习的分布式作业车间调度方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826723A (zh) * 2019-10-12 2020-02-21 中国海洋大学 一种结合tamer框架和面部表情反馈的交互强化学习方法
CN114118434A (zh) * 2020-08-27 2022-03-01 朱宝 智能机器人及其学习方法
CN112859591B (zh) * 2020-12-23 2022-10-21 华电电力科学研究院有限公司 一种面向能源系统运行优化的强化学习控制系统
CN112818672A (zh) * 2021-01-26 2021-05-18 山西三友和智慧信息技术股份有限公司 一种基于文本游戏的强化学习情感分析系统
CN114355786A (zh) * 2022-01-17 2022-04-15 北京三月雨文化传播有限责任公司 基于大数据的多媒体数字化展厅的调控云系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105105771A (zh) * 2015-08-07 2015-12-02 北京环度智慧智能技术研究所有限公司 潜能值测验的认知指标分析方法
CN105759677A (zh) * 2015-03-30 2016-07-13 公安部第研究所 一种适于视觉终端作业岗位的多模态行为分析与监控系统及方法
US20190179893A1 (en) * 2017-12-08 2019-06-13 General Electric Company Systems and methods for learning to extract relations from text via user feedback
CN110070185A (zh) * 2019-04-09 2019-07-30 中国海洋大学 一种从演示和人类评估反馈进行交互强化学习的方法
CN110826723A (zh) * 2019-10-12 2020-02-21 中国海洋大学 一种结合tamer框架和面部表情反馈的交互强化学习方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978012A (zh) * 2019-03-05 2019-07-05 北京工业大学 一种基于结合反馈的改进贝叶斯逆强化学习方法
CN110070188B (zh) * 2019-04-30 2021-03-30 山东大学 一种融合交互式强化学习的增量式认知发育系统及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105759677A (zh) * 2015-03-30 2016-07-13 公安部第研究所 一种适于视觉终端作业岗位的多模态行为分析与监控系统及方法
CN105105771A (zh) * 2015-08-07 2015-12-02 北京环度智慧智能技术研究所有限公司 潜能值测验的认知指标分析方法
US20190179893A1 (en) * 2017-12-08 2019-06-13 General Electric Company Systems and methods for learning to extract relations from text via user feedback
CN110070185A (zh) * 2019-04-09 2019-07-30 中国海洋大学 一种从演示和人类评估反馈进行交互强化学习的方法
CN110826723A (zh) * 2019-10-12 2020-02-21 中国海洋大学 一种结合tamer框架和面部表情反馈的交互强化学习方法

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657583A (zh) * 2021-08-24 2021-11-16 广州市香港科大霍英东研究院 一种基于强化学习的大数据特征提取方法及系统
CN115250156A (zh) * 2021-09-09 2022-10-28 李枫 一种基于联邦学习的无线网络多信道频谱接入方法
CN114003121A (zh) * 2021-09-30 2022-02-01 中国科学院计算技术研究所 数据中心服务器能效优化方法与装置、电子设备及存储介质
CN114003121B (zh) * 2021-09-30 2023-10-31 中国科学院计算技术研究所 数据中心服务器能效优化方法与装置、电子设备及存储介质
CN114371728A (zh) * 2021-12-14 2022-04-19 河南大学 一种基于多智能体协同优化的无人机资源调度方法
CN114371728B (zh) * 2021-12-14 2023-06-30 河南大学 一种基于多智能体协同优化的无人机资源调度方法
CN114710792A (zh) * 2022-03-30 2022-07-05 合肥工业大学 基于强化学习的5g配网分布式保护装置的优化布置方法
CN114710792B (zh) * 2022-03-30 2024-09-06 合肥工业大学 基于强化学习的5g配网分布式保护装置的优化布置方法
CN115361717A (zh) * 2022-07-12 2022-11-18 华中科技大学 一种基于vr用户视点轨迹的毫米波接入点选择方法及系统
CN115361717B (zh) * 2022-07-12 2024-04-19 华中科技大学 一种基于vr用户视点轨迹的毫米波接入点选择方法及系统
CN116307241A (zh) * 2023-04-04 2023-06-23 暨南大学 基于带约束多智能体强化学习的分布式作业车间调度方法
CN116307241B (zh) * 2023-04-04 2024-01-05 暨南大学 基于带约束多智能体强化学习的分布式作业车间调度方法

Also Published As

Publication number Publication date
LU500028B1 (en) 2021-04-23
CN110826723A (zh) 2020-02-21

Similar Documents

Publication Publication Date Title
WO2021068638A1 (zh) 结合tamer框架和面部表情反馈的交互强化学习方法
Kaufmann et al. A survey of reinforcement learning from human feedback
CN108415923B (zh) 封闭域的智能人机对话系统
CN108647233B (zh) 一种用于问答系统的答案排序方法
US12130603B2 (en) Method and apparatus for controlling smart home
Bohdal et al. Meta-calibration: Learning of model calibration using differentiable expected calibration error
CN108664589A (zh) 基于领域自适应的文本信息提取方法、装置、系统及介质
CN114999610B (zh) 基于深度学习的情绪感知与支持的对话系统构建方法
CN111274438A (zh) 一种语言描述引导的视频时序定位方法
CN108765228A (zh) 一种计算机自适应私教学习方法
Antwarg et al. Attribute-driven hidden markov model trees for intention prediction
Voskuilen et al. Modeling confidence and response time in associative recognition
Yang et al. [Retracted] Research on Students’ Adaptive Learning System Based on Deep Learning Model
Franke et al. The softmax function: Properties, motivation, and interpretation
CN111191722A (zh) 通过计算机训练预测模型的方法及装置
Lin et al. A comprehensive survey on deep learning techniques in educational data mining
JP2021140749A (ja) 人間の知能を人工知能に移植するための精密行動プロファイリングのための電子装置およびその動作方法
US12052183B1 (en) Resource allocation discovery and optimization service
Wu et al. A Tutorial-Generating Method for Autonomous Online Learning
Chauhan et al. Designing User-Friendly Human-Machine Interaction Interfaces For Industrial Systems
Zhang et al. A novel action decision method of deep reinforcement learning based on a neural network and confidence bound
Dai et al. DMH-CL: Dynamic Model Hardness Based Curriculum Learning for Complex Pose Estimation
Ren et al. Long-term student performance prediction using learning ability self-adaptive algorithm
Lerch Beyond Bounded Rationality: Towards a Computationally Rational Theory of Motor Control
Li et al. A novel teacher-assistance-based method to detect and handle bad training demonstrations in learning from demonstration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20874225

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20874225

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20874225

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.06.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20874225

Country of ref document: EP

Kind code of ref document: A1