CN115545188B - Multi-task offline data sharing method and system based on uncertainty estimation - Google Patents
Multi-task offline data sharing method and system based on uncertainty estimation Download PDFInfo
- Publication number
- CN115545188B CN115545188B CN202211307085.9A CN202211307085A CN115545188B CN 115545188 B CN115545188 B CN 115545188B CN 202211307085 A CN202211307085 A CN 202211307085A CN 115545188 B CN115545188 B CN 115545188B
- Authority
- CN
- China
- Prior art keywords
- learning
- offline
- uncertainty
- data set
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000006870 function Effects 0.000 claims abstract description 83
- 238000004364 calculation method Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 11
- 230000009471 action Effects 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims description 2
- 230000002787 reinforcement Effects 0.000 abstract description 12
- 238000010586 diagram Methods 0.000 description 7
- 241000282414 Homo sapiens Species 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000005259 measurement Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000010187 selection method Methods 0.000 description 3
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000002730 additional effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Feedback Control In General (AREA)
Abstract
The invention relates to the technical field of reinforcement learning, and provides a multi-task offline data sharing method and system based on uncertainty estimation. The method comprises the following steps: providing a multi-tasking offline data set, the multi-tasking offline data set comprising a plurality of tasks; performing data sharing with the multitasking offline data set to generate a hybrid data set; performing offline policy learning according to the mixed data set, wherein the offline policy learning comprises the following steps: training a plurality of value function networks according to the mixed data set and generating a plurality of prediction results; performing uncertainty calculation by using standard deviations of the plurality of prediction results; and performing policy learning based on the result of the uncertainty calculation. The invention greatly improves the efficiency of data sharing, creatively uses approximate Bayesian posterior to measure the uncertainty of the data, accords with the application scene of offline reinforcement learning, and can be used for large-scale robot tasks.
Description
Technical Field
The present invention relates generally to the field of reinforcement learning. In particular, the invention relates to a method and a system for sharing multi-task offline data based on uncertainty estimation.
Background
In offline reinforcement learning, agents learn strategies from a fixed dataset. Consider a multitasking offline learning scenario, where each task corresponds to an independent offline data set. Since the experience involved in each offline data set is limited, offline samples can only cover a limited state space and action space. In learning, the strategy obtained by optimizing the reinforcement learning algorithm based on the value function is different from the strategy space covered by the offline sample, so that the estimation of the algorithm on the target value function is inaccurate and even diverges.
The direct idea for solving the problem is to use other task data in the multi-task data set to expand the data volume of the task, thereby improving the coverage degree of the state action space. However, the difference between the performance strategies used in sampling the data is also large due to the different learning objectives of the different tasks. Thus, direct data expansion can introduce serious distribution bias problems in offline reinforcement learning. In order to solve the above distribution offset problem, the existing data sharing method includes a data sharing method based on human a priori knowledge, a data sharing method based on a policy distance metric, and an offline data sharing method based on a pessimistic value function.
The data sharing method based on human priori knowledge is based on the fact that human beings have a certain knowledge on the degree of association between different tasks. For example, multitasking in a quadruped robot may include standing, walking, jumping, running, etc., so in learning to run tasks, sharing the data of the walking task may be considered; in learning a jump task, sharing data of a standing task may be considered because an agent needs to stand first and then jump. The human beings can reduce the generation of the distribution offset problem by weighting the correlation degree of different tasks according to subjective knowledge of the correlation between the tasks and then giving a larger weight to the task with higher correlation degree in learning.
The policy distance metric-based data sharing method refers to that in order to measure the relationship between a shared data set and original task data, the distance between the performance policies between the two can be calculated from a given data set. For example, for each dataset, a state-to-action mapping is fitted using supervised learning to obtain a performance policy. In sharing, the distance (such as KL distance, maximum mean distance, etc.) between the performance strategies between tasks is calculated, and for two strategies with smaller distances, the tasks can be considered to be related, and data can be shared. And two strategies with longer distance do not share data, so that the generation of the distribution offset problem is reduced.
The off-line data sharing method based on the pessimistic value function is to train the pessimistic value function in the shared data set by using the off-line reinforcement learning method, so as to measure the value of each state action in the task. There may be two reasons for this if the value function of a certain state action is low. First, the state action cannot bring high value reward signals, resulting in a smaller value function estimate; second, the state action has low relevance to the task, independent of the optimal policy for solving the task. The second reason is often caused by data sharing between unrelated tasks. According to pessimistic value function estimation, the data values in the shared data can be ranked, and high-value data can be selected for training in training.
However, the existing offline data sharing method still has the following problems:
1. The data selection needs to be made according to certain rules. The prior art performs data selection based on human knowledge, policy distance, value function. The human knowledge lacks an automatic process, has high subjectivity and is difficult to carry out when the number of tasks is large. Both the strategic distance and the value function estimation require additional calculation modules. Policy distance requires the use of a deep neural network to fit the distribution of the representation policies in the dataset, which tend to be a hybrid policy. The value function estimation needs to be learned according to a strategy evaluation method and a Belman operator. After the estimation, the value functions of the offline shared samples need to be ordered, and part of samples with higher value functions need to be selected for strategy learning, thus requiring additional calculation.
2. Data selection and multitasking offline learning are split. After the data selection is made, the agent will obtain an offline multi-tasking data set from which the agent will then learn using an offline reinforcement learning method. However, the data selection and the multi-task offline learning are split, and different algorithms and different learning mechanisms are used for the two. The final offline learning method cannot influence the data selection process, and if the data selection mechanism is weak, the subsequent offline learning is greatly influenced.
Disclosure of Invention
To at least partially solve the above-mentioned problems in the prior art, the present invention provides a multi-task data sharing method for offline reinforcement learning, comprising the following steps:
Providing a multi-tasking offline data set, the multi-tasking offline data set comprising a plurality of tasks;
performing data sharing with the multitasking offline data set to generate a hybrid data set;
Performing offline policy learning according to the mixed data set, wherein the offline policy learning comprises the following steps:
training a plurality of value function networks according to the mixed data set and generating a plurality of prediction results;
Performing uncertainty calculation by using standard deviations of the plurality of prediction results; and
And performing strategy learning based on the result of the uncertainty calculation.
In one embodiment of the invention, it is provided that the data sharing with the multitasking offline data set to generate a hybrid data set comprises the steps of:
selecting a master task and a sharing task among the plurality of tasks, wherein data is shared from the sharing task while learning the master task;
Re-marking the rewards of the data in the sharing task, wherein the rewards of the samples in the sharing task are recalculated according to the rewards function of the main task; and
The shared task is mixed with the primary task to generate a mixed data set.
In one embodiment of the invention, it is provided that the plurality of value function networks comprise identical network structures and different initialization parameters, wherein the plurality of value function networks are trained using a random gradient method to estimate a bayesian posterior distribution of the value function.
In one embodiment of the invention, it is provided that the value function is learned by means of a speech-critique model and iterated by means of a bellman operator, comprising the following steps:
Representing the experience stored in the hybrid dataset as a set of state transition tuples (s, a, r, s '), where s represents state, a represents action, r represents reward, and s' represents the next time state;
The learning target y according to the bellman operator set value function Q (s, a) is expressed as:
y=r+γmaxa,Q(s′,a′),
wherein r represents a single-step environmental reward, gamma represents a discount factor of the reward over time, and a' represents a greedy action at the next moment;
The bellman loss L is expressed as: l= (Q (s, a) -y) 2; and
Training of the value function is performed by minimizing the bellman loss L.
In one embodiment of the invention, it is provided that the uncertainty Γ (s, a) of the state action (s, a) is calculated using standard deviations of the plurality of predictors, expressed as:
Γ(s,a)=Std(Qi(s,a)),
where i.epsilon.1, K represents the number of networks of value functions.
In one embodiment of the invention, it is provided that performing policy learning based on the result of the uncertainty calculation comprises:
using the result of uncertainty calculation as a penalty in value function learning to reset the learning objective y, expressed as:
y=r+γmax a ' Q (s ', a ') - Γ (s ', a '); and
And (3) strategy learning is carried out according to the punished learning target, wherein strategy output is carried out by optimizing min Q i, and i is E [1, K ].
The invention also provides a multi-task offline data sharing system based on uncertainty estimation, which comprises the following steps:
A data sharing module configured to perform the following actions:
providing a multi-tasking offline data set, the multi-tasking offline data set comprising a plurality of tasks; and
Performing data sharing with the multitasking offline data set to generate a hybrid data set;
a policy learning module configured to perform offline policy learning from the hybrid dataset.
In one embodiment of the invention, the policy learning module comprises:
A value function learning module configured to train a plurality of value function networks from the mixed dataset and generate a plurality of prediction results;
an uncertainty metric module configured to perform an uncertainty calculation using standard deviations of the plurality of predicted outcomes; and
A policy learning module configured to perform policy learning based on a result of the uncertainty calculation.
The invention also proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs steps according to the method.
The invention also proposes a computer system comprising:
a processor configured to execute machine-executable instructions; and
A memory having stored thereon machine executable instructions which when executed by a processor perform steps according to the method.
The invention has at least the following beneficial effects: the invention provides a multi-task offline data sharing method and a system based on uncertainty estimation, which remove an extra action selection module when sharing multi-task data, and can directly carry out uncertainty estimation and strategy learning in the multi-task shared data; the original single value network is expanded through the integrated value function network, the standard deviation of the integrated network in the value function estimation is used as the prediction of uncertainty, the posterior distribution of the value function estimated by the integrated network approximation is utilized, and theoretical guarantee is provided in the uncertainty measurement; in addition, the uncertainty is taken as punishment for value function learning in updating, so that a sample with larger uncertainty has larger punishment in the value function learning, and a stable offline strategy can be learned on the basis.
Compared with a data selection method based on pessimistic value function, the method does not need to perform additional data selection, and learns by directly utilizing all shared data, so that the efficiency of data sharing is greatly improved. In addition, the invention uses approximate Bayesian posterior to measure the uncertainty of the data, and uses the uncertainty to measure the value of the data, thereby being more in line with the application scene of offline reinforcement learning and having theoretical guarantee. In addition, the specific implementation of the integrated model can be suitable for high-dimensional states and action spaces, and can be used for large-scale robot tasks.
Drawings
To further clarify the advantages and features present in various embodiments of the present invention, a more particular description of various embodiments of the present invention will be rendered by reference to the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. In the drawings, for clarity, the same or corresponding parts will be designated by the same or similar reference numerals.
FIG. 1 illustrates a computer system that implements systems and/or methods in accordance with the present invention.
FIG. 2 is a flow chart of a method for multi-tasking offline data sharing based on uncertainty estimation in one embodiment of the present invention.
FIG. 3 is a schematic diagram of a multi-tasking offline data sharing method based on uncertainty estimation according to one embodiment of the present invention.
FIG. 4 shows a schematic diagram of an integrated model-based uncertainty metric in one embodiment of the invention.
FIG. 5 is a diagram showing a comparison of the effects of the present method and prior art running on a physical simulator in one embodiment of the invention.
Detailed Description
It should be noted that the components in the figures may be shown exaggerated for illustrative purposes and are not necessarily to scale. In the drawings, identical or functionally identical components are provided with the same reference numerals.
In the present invention, unless specifically indicated otherwise, "disposed on …", "disposed over …" and "disposed over …" do not preclude the presence of an intermediate therebetween. Furthermore, "disposed on or above" … merely indicates the relative positional relationship between the two components, but may also be converted to "disposed under or below" …, and vice versa, under certain circumstances, such as after reversing the product direction.
In the present invention, the embodiments are merely intended to illustrate the scheme of the present invention, and should not be construed as limiting.
In the present invention, the adjectives "a" and "an" do not exclude a scenario of a plurality of elements, unless specifically indicated.
It should also be noted herein that in embodiments of the present application, only a portion of the components or assemblies may be shown for clarity and simplicity, but those of ordinary skill in the art will appreciate that the components or assemblies may be added as needed for a particular scenario under the teachings of the present application. In addition, features of different embodiments of the application may be combined with each other, unless otherwise specified. For example, a feature of the second embodiment may be substituted for a corresponding feature of the first embodiment, or may have the same or similar function, and the resulting embodiment may fall within the scope of disclosure or description of the application.
It should also be noted herein that, within the scope of the present invention, the terms "identical", "equal" and the like do not mean that the two values are absolutely equal, but rather allow for some reasonable error, that is, the terms also encompass "substantially identical", "substantially equal". By analogy, in the present invention, the term "perpendicular", "parallel" and the like in the table direction also covers the meaning of "substantially perpendicular", "substantially parallel".
In the present invention, the modules may be implemented in software, hardware, firmware, or a combination thereof.
The numbers of the steps of the respective methods of the present invention are not limited to the order of execution of the steps of the methods. The method steps may be performed in a different order unless otherwise indicated.
The invention is further elucidated below in connection with the embodiments with reference to the drawings.
FIG. 1 illustrates a computer system 100 that implements systems and/or methods in accordance with the present invention. The method and/or system according to the present invention may be implemented in the computer system 100 shown in fig. 1 to accomplish the objects of the present invention, or the present invention may be distributed among a plurality of computer systems 100 according to the present invention via a network, such as a local area network or the internet, unless specifically stated otherwise. The computer system 100 of the present invention may comprise various types of computer systems, such as hand-held devices, laptop computers, personal Digital Assistants (PDAs), multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, network servers, tablet computers, and the like.
As shown in FIG. 1, computer system 100 includes processor 111, system bus 101, system memory 102, video adapter 105, audio adapter 107, hard disk drive interface 109, optical drive interface 113, network interface 114, and Universal Serial Bus (USB) interface 112. The system bus 101 may be any of several types of bus structures such as a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system bus 101 is used for communication between the various bus devices. In addition to the bus devices or interfaces shown in fig. 1, other bus devices or interfaces are also contemplated. The system memory 102 includes a Read Only Memory (ROM) 103 and a Random Access Memory (RAM) 104, where the ROM 103 may store basic input/output system (BIOS) data for implementing basic routines for information transfer at start-up, for example, and the RAM 104 is used to provide a running memory for the system that has a relatively high access speed. The computer system 100 further includes a hard disk drive 109 for reading from and writing to a hard disk 110, an optical drive interface 113 for reading from or writing to optical media such as a CD-ROM, and the like. The hard disk 110 may store, for example, an operating system and application programs. The drives and their associated computer-readable media provide nonvolatile storage of computer-readable instructions, data structures, program modules and other data for the computer system 100. Computer system 100 may also include a video adapter 105 for image processing and/or image output for interfacing with an output device such as a display 106. Computer system 100 may also include an audio adapter 107 for audio processing and/or audio output for connection to output devices such as speakers 108. In addition, computer system 100 may also include a network interface 114 for network connection, where network interface 114 may connect to the Internet 116 through a network device such as router 115, where the connection may be wired or wireless. In addition, computer system 100 may also include a universal serial bus interface (USB) 112 for connecting peripheral devices, including, for example, a keyboard 117, a mouse 118, and other peripheral devices, such as a microphone, a camera, and the like.
When the invention is implemented on the computer system 100 shown in fig. 1, an additional action selection module can be removed when the multi-task data is shared, and uncertainty estimation and strategy learning can be directly performed in the multi-task shared data; the original single value network is expanded through the integrated value function network, the standard deviation of the integrated network in the value function estimation is used as the prediction of uncertainty, the posterior distribution of the value function estimated by the integrated network approximation is utilized, and theoretical guarantee is provided in the uncertainty measurement; in addition, the uncertainty is taken as punishment for value function learning in updating, so that a sample with larger uncertainty has larger punishment in the value function learning, and a stable offline strategy can be learned on the basis.
Furthermore, embodiments may be provided as a computer program product that may include one or more machine-readable media having stored thereon machine-executable instructions that, when executed by one or more machines, such as a computer, computer network, or other electronic device, may result in the one or more machines performing operations in accordance with embodiments of the present invention. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disk read-only memory), and magneto-optical disks, ROMs (read-only memory), RAMs (random access memory), EPROMs (erasable programmable read-only memory), EEPROMs (electrically erasable programmable read-only memory), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions.
Furthermore, embodiments may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of one or more data signals embodied in and/or modulated by a carrier wave or other propagation medium via a communication link (e.g., a modem and/or network connection). Accordingly, a machine-readable medium as used herein may include such a carrier wave, but is not required.
FIG. 2 is a flow chart of a method for multi-tasking offline data sharing based on uncertainty estimation in one embodiment of the present invention. As shown in fig. 2, the method may include the steps of:
step 201, providing a multi-tasking offline data set, said multi-tasking offline data set comprising a plurality of tasks.
Step 202, performing data sharing by using the multitasking offline data set to generate a mixed data set.
And 203, performing offline policy learning according to the mixed data set.
A system for operating the uncertainty estimation-based multitasking offline data sharing method described above may include a data sharing module and a policy learning module.
Wherein the data sharing module may perform the following actions:
providing a multi-tasking offline data set, the multi-tasking offline data set comprising a plurality of tasks; and
Performing data sharing with the multitasking offline data set to generate a hybrid data set;
The policy learning module may perform offline policy learning according to the mixed data set, including:
a value function learning module that trains a plurality of value function networks from the mixed dataset and generates a plurality of prediction results;
An uncertainty measurement module that performs an uncertainty calculation using standard deviations of the plurality of prediction results; and
And a strategy learning module for performing strategy learning based on the result of the uncertainty calculation.
The method and system of the present invention are specifically described below with reference to examples.
FIG. 3 is a schematic diagram of a multi-tasking offline data sharing method based on uncertainty estimation according to one embodiment of the present invention. As shown in fig. 3, the method may include both data sharing and policy learning aspects.
In the data sharing stage, the patent removes a complex data selection module, and can directly share data of other tasks. Taking fig. 1 as an example, assuming that the master task is A1, it is desirable to share data from the slave task Ai when learning the A1 task. At this point, all data of task Ai is rewarded and re-labeled, and then a mixed offline data set is formed with the A1 task for policy learning. The process of rewards re-labeling is to recalculate rewards of samples in task Ai using the rewards function of task A1 so that the entire data set after sharing has a uniform rewards function. Because of the uncertainty measure used in the policy learning stage, even if the correlation degree between the task Ai and the task A1 is low, the direct sharing of data does not cause the problem of distribution offset.
The content of the three modules is included in the uncertainty-based multitasking policy learning phase.
In the value function learning module, the value function is performed using a speech-commentator (actor-critic) algorithm in reinforcement learning. The commentator model performs value function learning, and iterates through the bellman operator. The experience stored in the offline dataset is a set of state transition tuples (s, a, r, s '), each tuple containing state s, action a, prize r, and next time state s'. The bellman operator sets the learning objective of the value function Q (s, a) to y=r+γmax a 'Q (s', a '), where a' is the greedy action at the next moment. The bellman loss is defined as l= (Q (s, a) -y) 2. Training of the single value function can be performed by minimizing the loss function. In this patent, the original value function network is expanded to be an integrated network, and about 5 independent networks are included. Each network has the same network structure but different initialization parameters, and iterative training is performed respectively. Because different value function networks can generate different gradients when training is performed by using a random gradient method, the optimization directions of the networks are slightly different, and the parameters of the networks are further different. It can be shown in theory that the integrated value function model can approximate a bayesian posterior representing the value function.
In the uncertainty metric module, the standard deviation of the K value function predictions is used as an uncertainty estimate Γ (s, a) of the state action pair (s, a) on the basis of learning the integrated value function. Formalized Γ (s, a) =std (Q i (s, a)), where i e [1, k ]. In an offline dataset, an uncertainty metric measures the distribution of each state action pair in the dataset. If the uncertainty is high, it is indicated that the shared data is far from the data distribution of the original task, and its value function estimation may be inaccurate. If the uncertainty is low, it indicates that the shared data is very close to the original task, and a small uncertainty penalty should be given in learning.
FIG. 4 shows a schematic diagram of an integrated model-based uncertainty metric in one embodiment of the invention. In the single dataset task shown in the left-hand diagram of fig. 4, the distribution of the first data points 201 is more concentrated, with regions near the data points having lower uncertainty and the remaining regions having higher uncertainty. It can be found that the uncertainty based on the integrated model can accurately measure the distribution of the sample in the state action space. The right plot shows the uncertainty profile after sharing the data points of the remaining two tasks, the second data point 202 and the third data point 203, from offline data of the other tasks. It has been found that the uncertainty measure across the space after offline data sharing changes and the uncertainty in many areas is reduced by data sharing. Since uncertainty is used as a penalty in the value function estimation, the penalty of the value function in the region of less uncertainty is smaller, and thus the strategy can iterate and optimize over a wider region.
In the policy learning module, policy learning may be performed based on the uncertainty measure. First, using uncertainty as a penalty in value function learning, the learning target is reset to y=r+γmax a, Q (s ', a') - Γ (s ', a'). If the uncertainty is large, the penalty term contained in the value function is large, thus a conservative policy estimate is obtained. For shared data that is independent of the primary task optimization strategy, there will be a greater uncertainty in the metrics of the integrated model, thereby reducing its role in learning. Policy learning can be performed on the basis, and policy output is realized by optimizing min Q i, wherein i epsilon [1, K ] represents that the minimum value is selected from the integrated value function network for optimization. The strategy is optimized using a lower bound of a Bayesian posterior estimated median function, thereby ensuring that a conservative and stable strategy is learned.
Based on the model design, a robot multi-task offline data set is used to evaluate the effect of multi-task offline learning. The uncertainty measure shown in fig. 4 is when the input is a two-dimensional variable, and in an actual robot task, the state tends to be high-dimensional. For example, the state of the robotic arm typically includes information such as the position and velocity of each joint, typically 10-20 dimensions. The value function network uses a combination of states and actions as inputs to output an estimate of the value function. Uncertainty of high-dimensional state actions can be measured through the integrated model, so that samples with higher uncertainty are punished in learning, and the problem of distribution offset in multi-task data sharing is reduced. The method provided by the patent can be tested in the environments of a mechanical arm, a quadruped robot and the like by using Deepmind Control Suite and other robot simulation environments, and the application value and the actual effect of the method of the patent are comprehensively evaluated. In addition, the strategy learned in the simulation environment can be transferred to a real robot scene for implementation.
In the prior art, a data selection method based on the pessimistic value function size is used, and a shared sample with a higher value function under a given task is selected for training, so that a high-value sample can be screened. However, this method requires an additional data selection module, and the value functions of the data need to be ordered and selected after the value function learning is performed, so as to obtain a high-value shared sample post-learning strategy. Meanwhile, as the value function is iterated continuously, the data selection is needed to be carried out again according to the latest value function after the value function is updated, and the calculation cost is high.
Compared with a data selection method based on pessimistic value function, the method does not need to perform additional data selection, and learns by directly utilizing all shared data, so that the efficiency of data sharing is greatly improved. In addition, the invention uses approximate Bayesian posterior to measure the uncertainty of the data, and uses the uncertainty to measure the value of the data, thereby being more in line with the application scene of offline reinforcement learning and having theoretical guarantee. In addition, the specific implementation of the integrated model can be suitable for high-dimensional states and action spaces, and can be used for large-scale robot tasks.
FIG. 5 is a diagram showing a comparison of the effects of the present method and prior art running on a physical simulator in one embodiment of the invention. As shown in fig. 5, where the method is trained on a real physics simulator DeepMind Control Suite, a plurality of robot-related tasks are involved. The results show that the model has better effect than other related technology methods.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to those skilled in the relevant art that various combinations, modifications, and variations can be made therein without departing from the spirit and scope of the invention. Thus, the breadth and scope of the present invention as disclosed herein should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Claims (5)
1. A multi-task offline data sharing method based on uncertainty estimation, comprising the steps of:
Providing a robot multitasking offline data set comprising a plurality of tasks;
performing data sharing with the robot multitasking offline data set to generate a hybrid data set; and
Performing offline policy learning according to the hybrid dataset, including:
training a plurality of value function networks according to the mixed data set and generating a plurality of prediction results;
Performing uncertainty calculation by using standard deviations of the plurality of prediction results; and
Performing strategy learning based on the result of the uncertainty calculation;
wherein data sharing with the robot multitasking offline data set to generate a hybrid data set comprises the steps of:
selecting a master task and a sharing task among the plurality of tasks, wherein data is shared from the sharing task while learning the master task;
Re-marking the rewards of the data in the sharing task, wherein the rewards of the samples in the sharing task are recalculated according to the rewards function of the main task; and
Mixing the shared task with the primary task to generate a mixed dataset;
the multiple value function networks comprise the same network structure and different initialization parameters, wherein the multiple value function networks are trained by using a random gradient method to estimate Bayesian posterior distribution of the value functions;
The value function is learned by a speech-critic model and iterated by a bellman operator, comprising the steps of:
Representing the experience stored in the hybrid dataset as a set of state transition tuples (s, a, r, s '), where s represents state, a represents action, r represents reward, and s' represents the next time state;
The learning target y according to the bellman operator set value function Q (s, a) is expressed as:
wherein r represents a single-step environmental reward, gamma represents a discount factor of the reward over time, and a' represents a greedy action at the next moment;
The bellman loss L is expressed as: l= (Q (s, a) -y) 2; and
Training a value function by minimizing the bellman loss L;
Calculating uncertainty Γ (s, a) of the state action (s, a) using standard deviations of the plurality of predictors, expressed as:
Γ(s,a)=Std(Qi(s,a)),
Wherein i is E [1, K ], K represents the number of the value function network;
Policy learning based on the result of the uncertainty calculation includes:
using the result of uncertainty calculation as a penalty in value function learning to reset the learning objective y, expressed as:
And
And (3) strategy learning is carried out according to the punished learning target, wherein strategy output is carried out by optimizing min Q i, and i is E [1, K ].
2. A multi-tasking offline data sharing system based on uncertainty estimation, characterized in that the system is configured to perform the steps of the method according to claim 1, the system comprising:
A data sharing module configured to perform the following actions:
providing a multi-tasking offline data set, the multi-tasking offline data set comprising a plurality of tasks; and
Performing data sharing with the multitasking offline data set to generate a hybrid data set; and
A policy learning module configured to perform offline policy learning from the hybrid dataset.
3. The uncertainty estimation-based multi-tasking offline data sharing system of claim 2, wherein the policy learning module comprises:
A value function learning module configured to train a plurality of value function networks from the mixed dataset and generate a plurality of prediction results;
an uncertainty metric module configured to perform an uncertainty calculation using standard deviations of the plurality of predicted outcomes; and
A policy learning module configured to perform policy learning based on a result of the uncertainty calculation.
4. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to claim 1.
5. A computer system, comprising:
a processor configured to execute machine-executable instructions; and
Memory having stored thereon machine executable instructions which when executed by a processor perform the steps of the method according to claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211307085.9A CN115545188B (en) | 2022-10-24 | 2022-10-24 | Multi-task offline data sharing method and system based on uncertainty estimation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211307085.9A CN115545188B (en) | 2022-10-24 | 2022-10-24 | Multi-task offline data sharing method and system based on uncertainty estimation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115545188A CN115545188A (en) | 2022-12-30 |
CN115545188B true CN115545188B (en) | 2024-06-14 |
Family
ID=84718760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211307085.9A Active CN115545188B (en) | 2022-10-24 | 2022-10-24 | Multi-task offline data sharing method and system based on uncertainty estimation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115545188B (en) |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106779072A (en) * | 2016-12-23 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of enhancing based on bootstrapping DQN learns deep search method |
WO2018211138A1 (en) * | 2017-05-19 | 2018-11-22 | Deepmind Technologies Limited | Multitask neural network systems |
US12109701B2 (en) * | 2019-11-20 | 2024-10-08 | Nvidia Corporation | Guided uncertainty-aware policy optimization: combining model-free and model-based strategies for sample-efficient learning |
US20220188623A1 (en) * | 2020-12-10 | 2022-06-16 | Palo Alto Research Center Incorporated | Explainable deep reinforcement learning using a factorized function |
CN113449786A (en) * | 2021-06-22 | 2021-09-28 | 华东师范大学 | Reinforced learning confrontation defense method based on style migration |
-
2022
- 2022-10-24 CN CN202211307085.9A patent/CN115545188B/en active Active
Non-Patent Citations (2)
Title |
---|
Modular transfer learning with transition mismatch compensation for excessive disturbance rejection;Wang, TM et.al;INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS;20220909;第295-311页 * |
基于强化学习的部分线性离散时间系统的最优输出调节;庞文砚 等;自动化学报;20221009;第48卷(第9期);第2242-2253页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115545188A (en) | 2022-12-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112651509B (en) | Method and device for determining quantum circuit | |
US20210142200A1 (en) | Probabilistic decision making system and methods of use | |
JP2013242761A (en) | Method, and controller and control program thereof, for updating policy parameters under markov decision process system environment | |
CN115066694A (en) | Computation graph optimization | |
CN109308246A (en) | Optimization method, device and the equipment of system parameter, readable medium | |
CN116596060A (en) | Deep reinforcement learning model training method and device, electronic equipment and storage medium | |
WO2018143019A1 (en) | Information processing device, information processing method, and program recording medium | |
CN110909878A (en) | Training method and device of neural network model for estimating resource usage share | |
Dong et al. | Intelligent trainer for dyna-style model-based deep reinforcement learning | |
CN116257363B (en) | Resource scheduling method, device, equipment and storage medium | |
Renner et al. | Machine learning for dynamic incentive problems | |
KR102190584B1 (en) | System and method for predicting human choice behavior and underlying strategy using meta-reinforcement learning | |
CN115545188B (en) | Multi-task offline data sharing method and system based on uncertainty estimation | |
de Paula Garcia et al. | An enhanced surrogate-assisted differential evolution for constrained optimization problems | |
CN112465148B (en) | Network parameter updating method and device of multi-agent system and terminal equipment | |
CN114064235A (en) | Multitask teaching and learning optimization method, system and equipment | |
Wu et al. | Goal exploration augmentation via pre-trained skills for sparse-reward long-horizon goal-conditioned reinforcement learning | |
US20230222385A1 (en) | Evaluation method, evaluation apparatus, and non-transitory computer-readable recording medium storing evaluation program | |
CN116992253A (en) | Method for determining value of super-parameter in target prediction model associated with target service | |
Chen et al. | Adaptive bias-variance trade-off in advantage estimator for actor–critic algorithms | |
JP2001099924A (en) | Tracking device | |
JP7464115B2 (en) | Learning device, learning method, and learning program | |
Sun | An influence diagram based cloud service selection approach in dynamic cloud marketplaces | |
Aloka et al. | Test effort estimation-particle swarm optimization based approach | |
JP6726312B2 (en) | Simulation method, system, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |