WO2021068638A1 - 结合tamer框架和面部表情反馈的交互强化学习方法 - Google Patents
结合tamer框架和面部表情反馈的交互强化学习方法 Download PDFInfo
- Publication number
- WO2021068638A1 WO2021068638A1 PCT/CN2020/108156 CN2020108156W WO2021068638A1 WO 2021068638 A1 WO2021068638 A1 WO 2021068638A1 CN 2020108156 W CN2020108156 W CN 2020108156W WO 2021068638 A1 WO2021068638 A1 WO 2021068638A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- tamer
- agent
- feedback
- reward
- facial expression
- Prior art date
Links
- 230000008921 facial expression Effects 0.000 title claims abstract description 63
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 28
- 230000006870 function Effects 0.000 claims abstract description 40
- 230000006399 behavior Effects 0.000 claims abstract description 36
- 238000011156 evaluation Methods 0.000 claims abstract description 9
- 230000009471 action Effects 0.000 claims description 29
- 230000002787 reinforcement Effects 0.000 claims description 27
- 238000012549 training Methods 0.000 claims description 18
- 230000001815 facial effect Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 9
- 230000001149 cognitive effect Effects 0.000 abstract description 8
- 238000002474 experimental method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 208000001491 myopia Diseases 0.000 description 1
- 230000004379 myopia Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2178—Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
Definitions
- the present invention relates to the field of artificial intelligence technology, in particular to an interactive reinforcement learning method combining TAMER framework and facial expression feedback.
- TAMER is a typical interactive reinforcement learning method.
- the system can learn a prediction model of human user rewards. This model can successfully train TAMER agents even when human rewards are delayed or inconsistent.
- TAMER agents can understand the intentions of human users and adapt to their preferences.
- human user preferences are conveyed through clear instructions or expensive corrective feedback, such as through predefined words or sentences, buttons, mouse clicks, etc.
- these feedbacks Methods increase the cognitive load of human users.
- the problem in the prior art is that adjusting the behavior of the TAMER agent through explicit feedback forms such as predefined keyboard feedback will increase the cognitive burden of human users, and strategy updates require a large number of interactive behaviors, which increases the cost of learning. .
- the direct significance is to reduce the number of explicit feedback required during the TAMER agent training process, and reduce the cognitive burden of human users;
- the purpose of the present invention is to provide an interactive reinforcement learning method that combines the TAMER framework and facial expression feedback, and combines explicit feedback and facial expression feedback to perform learning on the TAMER framework.
- an interactive reinforcement learning method combining TAMER framework and facial expression feedback including:
- Face Valuing-TAMER agent Combining the TAMER framework and facial expression evaluation to form a Face Valuing-TAMER agent; the Face Valuing-TAMER agent anticipates future rewards by learning a value function from human feedback;
- the combination of the TAMER framework and facial expression evaluation to form the Face Valuing-TAMER agent is specifically: the trainer trains the TAMER agent under the TAMER framework, uses keyboard button feedback to determine the keyboard reward signal, and trains the TAMER agent to obtain an initial Executable strategy;
- the trainer determines the facial reward signal through facial expression feedback to adjust the behavior strategy of the TAMER agent.
- the trainer trains the TAMER agent under the TAMER framework, determines keyboard reward signals through keyboard key feedback, and trains the TAMER agent to obtain an initial executable strategy, which specifically includes:
- the trainer observes the current action of the TAMER agent, and feeds it back through a keyboard interface, obtains a keyboard feedback signal, and determines a keyboard reward signal according to the keyboard feedback signal;
- An initial executable strategy is determined according to the keyboard feedback signal and the keyboard reward signal.
- enabling the trainer to determine the facial reward signal through facial expression feedback to adjust the behavior strategy of the TAMER agent specifically includes:
- the updated value function includes a state value function and an action value function;
- the trainer obtains a facial feedback signal through facial expression feedback, and determines a facial reward signal according to the facial feedback signal;
- the updated value function is:
- the Face Valuing-TAMER agent predicts future rewards by learning a value function from human feedback, which specifically includes:
- a TAMER agent learns a reward model
- the behavior of is defined as the expected human reward in the current state and action: Agent S t is the signal received after taking reward operation in either state A t;
- the TAMER agent chooses the maximum expected return:
- the trainer Based on the maximum expected return, the trainer observes and evaluates the behavior of the TAMER agent and awards it.
- An information data processing terminal applied to the interactive reinforcement learning method combining TAMER framework and facial expression feedback.
- the present invention Compared with the prior art, the present invention has the advantages that: the present invention introduces human user's facial expression feedback in the TAMER framework, and the human user can provide feedback through the keyboard or other interactive interfaces to train the TAMER agent. After the agent learns an initial strategy, the human user adjusts the agent's behavior through facial expression feedback. This process will reduce the cognitive burden of the human user and free the human user from the heavy feedback task. This method is The supplement to the existing interactive machine learning methods will help to further improve the interaction efficiency between the agent and the human user.
- the interactive reinforcement learning method of the present invention combining the TAMER framework and facial expression feedback can reduce the cognitive burden of the human user in the process of training the agent, so that the agent can better understand human preferences, and can effectively learn from human rewards.
- Figure 5 shows the number of time steps required for each Episode through keyboard feedback training and facial expression feedback training during the training process.
- the histogram shows the average value and standard deviation of each Episode, and the table shows the average value;
- Figure 6 reflects During the training process, each Episode uses keyboard feedback training and introduces the required number of feedback for facial expression feedback training.
- the histogram shows the average value and standard deviation of each Episode, and the table shows the average value.
- FIG. 1 is a flowchart of an interactive reinforcement learning method combining TAMER framework and facial expression feedback provided by an embodiment of the present invention
- FIG. 2 is an implementation flowchart of an interactive reinforcement learning method combining TAMER framework and facial expression feedback provided by an embodiment of the present invention
- Fig. 3 is a screenshot of an example of a training interface interface and a Grid World environment task provided by an embodiment of the present invention
- FIG. 4 is a schematic block diagram of an agent interaction reinforcement learning combining TAMER framework and facial expression feedback provided by an embodiment of the present invention
- FIG. 5 is a comparison diagram of required time steps for keyboard feedback training and facial expression feedback training provided by an embodiment of the present invention
- Fig. 6 is a comparison diagram of the number of feedbacks required by keyboard feedback training and the introduction of facial expression feedback training provided by an embodiment of the present invention.
- the present invention provides an interactive reinforcement learning method combining the TAMER framework and facial expression feedback.
- the present invention will be described in detail below with reference to the accompanying drawings.
- the interactive reinforcement learning method combining TAMER framework and facial expression feedback includes the following steps:
- S101 Face Valuing-TAMER allows human trainers to first train the agent under the TAMER framework; the agent chooses actions according to the current state.
- S102 The human trainer observes and provides clear feedback as an incentive signal through keyboard keys and other interfaces.
- S105 The agent obtains an initial executable strategy through keyboard feedback learning.
- S106 The human trainer provides rewards to adjust the behavior of the agent through facial expression feedback, and adjusts the strategy to detect whether it reaches a satisfactory state; if satisfied, it ends, and if not satisfied, the strategy is adjusted again through facial expression feedback.
- the algorithm for the agent to learn from human feedback includes:
- TAMER learns a value function through a predictive reward model learned from human feedback:
- R t+i is the reward obtained by TAMER agent performing action a in state s at time t+i
- G t is the expected reward at time t, which is defined as the total discount of rewards after time t
- v ⁇ (s) is the state value function corresponding to each strategy ⁇ , which maps each state s ⁇ S to the expected return G t of the state by following the strategy ⁇ ;
- q ⁇ (s, a) is the action value function corresponding to each strategy ⁇ , which provides the expected return G t by following the strategy ⁇ and executing action a in the state s.
- the state value function is very important. On the contrary, if the given task needs to be controlled, it is very important to use the action value function q ⁇ (s, a); human trainers can use keyboard keys or facial expressions Provide reward feedback to adjust the behavior of the agent.
- the present invention introduces facial expression recognition feedback on the basis of the TAMER framework, which is a typical method for an agent to learn from human rewards. Assuming that the TAMER agent learns an initial strategy through keyboard feedback, the amount of explicit feedback needed to adjust through facial expression feedback is less than the amount of feedback that the agent needs to learn from keyboard feedback alone, and the algorithm is tested in the Grid World task field. Compared with the use of different discount factors on human rewards for agent learning through the TAMER framework, the results show that although training the agent directly through facial expression feedback cannot quickly obtain an effective strategy, it can capture the face of human users in real time. Features, adjust the agent's strategy online according to user preferences without changing the model.
- Figure 4 is a schematic block diagram of agent interactive reinforcement learning combining the TAMER framework and facial expression feedback.
- the TAMER framework is constructed for a variant of the Markov decision process, which is a model of sequential decision-making, which is solved through dynamic programming and reinforcement learning.
- an agent learns in MDP without a clearly defined reward function, but learns a reward model, denoted by MDP ⁇ R.
- the TAMER agent learns from the real-time evaluation of its behavior by human trainers.
- the agent interprets this evaluation as a human reward, creates a predictive model, and selects the behavior that it predicts will receive the most human rewards. It strives to maximize the immediate rewards caused by behavior, which is in sharp contrast with traditional reinforcement learning. In traditional reinforcement learning, the agent seeks the largest future reward.
- human rewards can be delivered with a small delay. This delay is the time for the trainer to evaluate the behavior of the agent and deliver its feedback. .
- Second, the assessment provided by the human trainer judges the behavior itself and takes into account the model of its long-term consequences.
- a TAMER agent learns a reward model Similar to the human reward expected in the current state and action, Given a state s, the agent chooses the largest expected return in the short-term, The trainer can observe and evaluate the behavior of the agent and give rewards.
- TAMER feedback is given through keyboard input and is attributed to the agent’s recent actions.
- Each feedback button press is marked as a scalar reward signal (-1 or +1). This signal can also be enhanced by pressing the button multiple times.
- the label of the sample is used as the delay-weighted total return, which is based on the specific The probability of the human reward signal of the time step is calculated.
- the TAMER learning algorithm continuously repeats actions, perceives rewards, and updates This process.
- VI-TAMER TAMER variant
- the agent learns from discounted human rewards and produces a planning algorithm-value iteration.
- a VI-TAMER agent learns and applies its value function to the reward function that was recently changed from TAMER , And use the value function to choose the next action.
- reinforcement learning RL
- the discount factor ⁇ (0 ⁇ 1) determines how far the agent can look into the future. Since the discount factor ⁇ of the initial TAMER is 0 (short-term), it can be regarded as VI-TAMER Special circumstances. Therefore, in the present invention, from now on, TAMER is used as a general method for agents to learn from human rewards, and ⁇ TAMER is used as a discount factor for human rewards.
- the Grid World task contains 30 states. In each state, the agent's movement can be selected from four actions in the action space: move up, down, left or right. The agent cannot pass through the wall, and attempts to pass through the wall will not change the current state of the agent.
- the task performance index is the time step required to reach the target position from the initial position, that is, the number of actions. As shown in the middle of the screenshot of Figure 3, the small dark gray square is the agent, and the cross indicates the direction of the agent's next movement. In this task, the agent tries to learn a strategy so that it can reach the goal state and minimize the number of time steps. The optimal strategy from the starting state to the target state requires 20 time steps.
- the current position of the agent is the initial state of the agent, and the target state is the position where the elliptical square in the upper right corner is located.
- the black line and the light gray square both indicate the fence, and the agent cannot directly pass through this area. .
- a radial basis function effectively creates a pseudo-table centered on each square cell of Grid World, which can be slightly generalized among nearby cells.
- Face Valuing-TAMER first provide feedback through the keyboard to train the agent to obtain an initial strategy, and then use the user's facial expression feedback to adjust the strategy obtained by the agent. It is expected to analyze the average data collected in 20 experiments to test the performance of the proposed method.
- ⁇ TAMER 0, 0.2, 0.5, 0.8 , 1. It should be noted that in the TAMER module of Face Valuing-TAMER and the TAMER framework, the value of ⁇ TAMER is the same.
- Face Valuing-TAMER to train an agent requires less feedback, especially explicit feedback, than using the TAMER framework to train an agent.
- the number of time steps for receiving feedback can be calculated to compare the number of time steps for the total feedback, positive feedback, and negative feedback received by the agent under different discount factors for Face Valuing-TAMER and TAMER. It is expected that the Face Valuing-TAMER agent will receive much less feedback than the TAMER agent. This result shows that humans provide evaluation feedback for the agent in the form of facial expression feedback, which can reduce the amount of feedback needed to train the agent and effectively reduce The cognitive burden of human training agent behavior.
- the research results can show that, compared with learning from the TAMER framework, although facial expression feedback cannot effectively reduce the number of explicit feedback required (because the current facial expression recognition accuracy is only more than 60%), it can still obtain one and Learn the same optimal strategy from keyboard feedback. Further improving the accuracy of facial expression recognition can effectively reduce the amount of explicit feedback required to obtain an optimal strategy.
- the total number of time steps required to train the agent to obtain the optimal strategy can be used as the performance measurement in the experiment.
- the experiment compares the total number of time steps required to train the Face Valuing-TAMER and TAMER agents to obtain the best strategy using different discount factors on human rewards. It is expected that the total time step required to train a Face Valuing-TAMER agent is much less than that of the TAMER agent.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims (6)
- 一种结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,包括:结合TAMER框架和面部表情评估形成Face Valuing-TAMER智能体;所述Face Valuing-TAMER智能体通过从人类反馈中学习值函数预期未来奖励;所述结合TAMER框架和面部表情评估形成FaceValuing-TAMER智能体具体为:训练者在所述TAMER框架下训练TAMER智能体,通过键盘按键反馈,确定键盘奖励信号,并训练TAMER智能体获得一个初始的可执行策略;基于所述初始的可执行策略使得所述训练者通过面部表情反馈,确定所述面部奖励信号以调整所述TAMER智能体的行为策略。
- 如权利要求1所述的结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,所述训练者在所述TAMER框架下训练TAMER智能体,通过键盘按键反馈,确定键盘奖励信号,并训练所述TAMER智能体获得一个初始的可执行策略,具体包括:所述训练者观察所述TAMER智能体的当前动作,并通过键盘接口反馈,获取键盘反馈信号,并根据所述键盘反馈信号确定键盘奖励信号;根据所述键盘反馈信号以及所述键盘奖励信号确定初始的可执行策略。
- 如权利要求2所述的结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,所述基于所述初始的可执行策略使得所述训练者通过面部表情反馈,确定所述面部奖励信号以调整所述TAMER智能体的行为策略,具体包括:根据所述键盘奖励信号更新所述值函数,确定更新后的值函数;所述更新后的值函数包括状态值函数以及动作值函数;根据所述奖励函数更新所述TAMER智能体的行为策略;基于所述初始的可执行策略,所述训练者通过面部表情反馈,获取面部反馈信号,并根据所述面部反馈信号确定面部奖励信号;根据所述面部奖励信号调整所述TAMER智能体的行为策略。
- 如权利要求3所述的结合TAMER框架和面部表情反馈的交互强化学习方法,其特征在于,所述更新后的值函数为:v π(s)=Ε{G t|S t=s,π}q π(s,a)=Ε{G t|S t=s,A t=a,π}其中,G t为任一时间t时的预期回报,即为任一时间t时的奖励折扣总和;i为第t+i步长与第t步长之间的步长差值,γ为折扣因子,γ i-1为对第t+i步长获得奖励的折扣因数;R t+i在TAMER智能体在t+i时刻在状态s下执行动作a获得的奖励;v π(s)是对应于每一个行为策略π的状态值函数,通过遵循行为策略π将每个状态s映射到所述状态s的预期回报G t,s∈S t;q π(s,a)是对应于每一个行为策略π的动作值函数,通过遵循行为策略π,在状态s下执行动作a来提供预期回报,a∈A t;E为对获得的预期回报求取期望值。
- 一种应用于权利要求1~5任意一项所述的结合TAMER框架和面部表情反馈的交互强化学习方法的信息数据处理终端。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910967991.3A CN110826723A (zh) | 2019-10-12 | 2019-10-12 | 一种结合tamer框架和面部表情反馈的交互强化学习方法 |
CN201910967991.3 | 2019-10-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021068638A1 true WO2021068638A1 (zh) | 2021-04-15 |
Family
ID=69548992
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/108156 WO2021068638A1 (zh) | 2019-10-12 | 2020-08-10 | 结合tamer框架和面部表情反馈的交互强化学习方法 |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110826723A (zh) |
LU (1) | LU500028B1 (zh) |
WO (1) | WO2021068638A1 (zh) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657583A (zh) * | 2021-08-24 | 2021-11-16 | 广州市香港科大霍英东研究院 | 一种基于强化学习的大数据特征提取方法及系统 |
CN114003121A (zh) * | 2021-09-30 | 2022-02-01 | 中国科学院计算技术研究所 | 数据中心服务器能效优化方法与装置、电子设备及存储介质 |
CN114371728A (zh) * | 2021-12-14 | 2022-04-19 | 河南大学 | 一种基于多智能体协同优化的无人机资源调度方法 |
CN114710792A (zh) * | 2022-03-30 | 2022-07-05 | 合肥工业大学 | 基于强化学习的5g配网分布式保护装置的优化布置方法 |
CN115250156A (zh) * | 2021-09-09 | 2022-10-28 | 李枫 | 一种基于联邦学习的无线网络多信道频谱接入方法 |
CN115361717A (zh) * | 2022-07-12 | 2022-11-18 | 华中科技大学 | 一种基于vr用户视点轨迹的毫米波接入点选择方法及系统 |
CN116307241A (zh) * | 2023-04-04 | 2023-06-23 | 暨南大学 | 基于带约束多智能体强化学习的分布式作业车间调度方法 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826723A (zh) * | 2019-10-12 | 2020-02-21 | 中国海洋大学 | 一种结合tamer框架和面部表情反馈的交互强化学习方法 |
CN114118434A (zh) * | 2020-08-27 | 2022-03-01 | 朱宝 | 智能机器人及其学习方法 |
CN112859591B (zh) * | 2020-12-23 | 2022-10-21 | 华电电力科学研究院有限公司 | 一种面向能源系统运行优化的强化学习控制系统 |
CN112818672A (zh) * | 2021-01-26 | 2021-05-18 | 山西三友和智慧信息技术股份有限公司 | 一种基于文本游戏的强化学习情感分析系统 |
CN114355786A (zh) * | 2022-01-17 | 2022-04-15 | 北京三月雨文化传播有限责任公司 | 基于大数据的多媒体数字化展厅的调控云系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105105771A (zh) * | 2015-08-07 | 2015-12-02 | 北京环度智慧智能技术研究所有限公司 | 潜能值测验的认知指标分析方法 |
CN105759677A (zh) * | 2015-03-30 | 2016-07-13 | 公安部第研究所 | 一种适于视觉终端作业岗位的多模态行为分析与监控系统及方法 |
US20190179893A1 (en) * | 2017-12-08 | 2019-06-13 | General Electric Company | Systems and methods for learning to extract relations from text via user feedback |
CN110070185A (zh) * | 2019-04-09 | 2019-07-30 | 中国海洋大学 | 一种从演示和人类评估反馈进行交互强化学习的方法 |
CN110826723A (zh) * | 2019-10-12 | 2020-02-21 | 中国海洋大学 | 一种结合tamer框架和面部表情反馈的交互强化学习方法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109978012A (zh) * | 2019-03-05 | 2019-07-05 | 北京工业大学 | 一种基于结合反馈的改进贝叶斯逆强化学习方法 |
CN110070188B (zh) * | 2019-04-30 | 2021-03-30 | 山东大学 | 一种融合交互式强化学习的增量式认知发育系统及方法 |
-
2019
- 2019-10-12 CN CN201910967991.3A patent/CN110826723A/zh active Pending
-
2020
- 2020-08-10 WO PCT/CN2020/108156 patent/WO2021068638A1/zh active Application Filing
- 2020-08-10 LU LU500028A patent/LU500028B1/en active IP Right Grant
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105759677A (zh) * | 2015-03-30 | 2016-07-13 | 公安部第研究所 | 一种适于视觉终端作业岗位的多模态行为分析与监控系统及方法 |
CN105105771A (zh) * | 2015-08-07 | 2015-12-02 | 北京环度智慧智能技术研究所有限公司 | 潜能值测验的认知指标分析方法 |
US20190179893A1 (en) * | 2017-12-08 | 2019-06-13 | General Electric Company | Systems and methods for learning to extract relations from text via user feedback |
CN110070185A (zh) * | 2019-04-09 | 2019-07-30 | 中国海洋大学 | 一种从演示和人类评估反馈进行交互强化学习的方法 |
CN110826723A (zh) * | 2019-10-12 | 2020-02-21 | 中国海洋大学 | 一种结合tamer框架和面部表情反馈的交互强化学习方法 |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657583A (zh) * | 2021-08-24 | 2021-11-16 | 广州市香港科大霍英东研究院 | 一种基于强化学习的大数据特征提取方法及系统 |
CN115250156A (zh) * | 2021-09-09 | 2022-10-28 | 李枫 | 一种基于联邦学习的无线网络多信道频谱接入方法 |
CN114003121A (zh) * | 2021-09-30 | 2022-02-01 | 中国科学院计算技术研究所 | 数据中心服务器能效优化方法与装置、电子设备及存储介质 |
CN114003121B (zh) * | 2021-09-30 | 2023-10-31 | 中国科学院计算技术研究所 | 数据中心服务器能效优化方法与装置、电子设备及存储介质 |
CN114371728A (zh) * | 2021-12-14 | 2022-04-19 | 河南大学 | 一种基于多智能体协同优化的无人机资源调度方法 |
CN114371728B (zh) * | 2021-12-14 | 2023-06-30 | 河南大学 | 一种基于多智能体协同优化的无人机资源调度方法 |
CN114710792A (zh) * | 2022-03-30 | 2022-07-05 | 合肥工业大学 | 基于强化学习的5g配网分布式保护装置的优化布置方法 |
CN114710792B (zh) * | 2022-03-30 | 2024-09-06 | 合肥工业大学 | 基于强化学习的5g配网分布式保护装置的优化布置方法 |
CN115361717A (zh) * | 2022-07-12 | 2022-11-18 | 华中科技大学 | 一种基于vr用户视点轨迹的毫米波接入点选择方法及系统 |
CN115361717B (zh) * | 2022-07-12 | 2024-04-19 | 华中科技大学 | 一种基于vr用户视点轨迹的毫米波接入点选择方法及系统 |
CN116307241A (zh) * | 2023-04-04 | 2023-06-23 | 暨南大学 | 基于带约束多智能体强化学习的分布式作业车间调度方法 |
CN116307241B (zh) * | 2023-04-04 | 2024-01-05 | 暨南大学 | 基于带约束多智能体强化学习的分布式作业车间调度方法 |
Also Published As
Publication number | Publication date |
---|---|
LU500028B1 (en) | 2021-04-23 |
CN110826723A (zh) | 2020-02-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021068638A1 (zh) | 结合tamer框架和面部表情反馈的交互强化学习方法 | |
Kaufmann et al. | A survey of reinforcement learning from human feedback | |
CN108415923B (zh) | 封闭域的智能人机对话系统 | |
CN108647233B (zh) | 一种用于问答系统的答案排序方法 | |
US12130603B2 (en) | Method and apparatus for controlling smart home | |
Bohdal et al. | Meta-calibration: Learning of model calibration using differentiable expected calibration error | |
CN108664589A (zh) | 基于领域自适应的文本信息提取方法、装置、系统及介质 | |
CN114999610B (zh) | 基于深度学习的情绪感知与支持的对话系统构建方法 | |
CN111274438A (zh) | 一种语言描述引导的视频时序定位方法 | |
CN108765228A (zh) | 一种计算机自适应私教学习方法 | |
Antwarg et al. | Attribute-driven hidden markov model trees for intention prediction | |
Voskuilen et al. | Modeling confidence and response time in associative recognition | |
Yang et al. | [Retracted] Research on Students’ Adaptive Learning System Based on Deep Learning Model | |
Franke et al. | The softmax function: Properties, motivation, and interpretation | |
CN111191722A (zh) | 通过计算机训练预测模型的方法及装置 | |
Lin et al. | A comprehensive survey on deep learning techniques in educational data mining | |
JP2021140749A (ja) | 人間の知能を人工知能に移植するための精密行動プロファイリングのための電子装置およびその動作方法 | |
US12052183B1 (en) | Resource allocation discovery and optimization service | |
Wu et al. | A Tutorial-Generating Method for Autonomous Online Learning | |
Chauhan et al. | Designing User-Friendly Human-Machine Interaction Interfaces For Industrial Systems | |
Zhang et al. | A novel action decision method of deep reinforcement learning based on a neural network and confidence bound | |
Dai et al. | DMH-CL: Dynamic Model Hardness Based Curriculum Learning for Complex Pose Estimation | |
Ren et al. | Long-term student performance prediction using learning ability self-adaptive algorithm | |
Lerch | Beyond Bounded Rationality: Towards a Computationally Rational Theory of Motor Control | |
Li et al. | A novel teacher-assistance-based method to detect and handle bad training demonstrations in learning from demonstration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20874225 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20874225 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20874225 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.06.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20874225 Country of ref document: EP Kind code of ref document: A1 |