CN113807460A

CN113807460A - Method and device for determining intelligent body action, electronic equipment and medium

Info

Publication number: CN113807460A
Application number: CN202111138932.9A
Authority: CN
Inventors: 张海超; 徐伟; 余昊男
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-17
Anticipated expiration: 2041-09-27
Also published as: CN113807460B

Abstract

The embodiment of the disclosure discloses a method and a device for determining an intelligent agent action, an electronic device and a medium, wherein the method for determining the intelligent agent action comprises the following steps: generating a first action sequence based on the current state of the agent at the current time step, the first action sequence including a first action for at least one time step; determining a first state action sequence value corresponding to each first action in the first action sequence based on the current state and the first action sequence; determining a target action sequence to be executed currently based on a first state action sequence value corresponding to each first action in the first action sequence and a second state action sequence value corresponding to each second action in the candidate action sequence; and determining a target action to be executed currently based on the target action sequence so that the intelligent agent acts according to the target action. The method and the device realize that each time step can generate multi-step actions to participate in the determination of the subsequent target action, realize time coordination exploration and effectively improve the exploration efficiency.

Description

Method and device for determining intelligent body action, electronic equipment and medium

Technical Field

The present disclosure relates to intelligent control technologies, and in particular, to a method and an apparatus for determining an action of an agent, an electronic device, and a medium.

Background

The unmanned device can be regarded as an intelligent agent with sensing and action capabilities in a real natural environment, actions of the intelligent agent usually need to be planned and then corresponding actions are executed according to the plan, in the prior art, single-step actions meeting the maximum expected future return are usually generated based on the current state of the intelligent agent and serve as target actions to be executed currently, and exploration efficiency is low.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a method and device for determining an intelligent agent action, an electronic device and a medium.

According to an aspect of the embodiments of the present disclosure, there is provided a method for determining an agent action, including: generating a first action sequence based on a current state of the agent at a current time step, the first action sequence including a first action for at least one time step; determining a first state action sequence value corresponding to each first action in the first action sequence based on the current state and the first action sequence, wherein the first state action sequence value is a state action sequence value function value; determining a target action sequence to be executed currently based on a first state action sequence value corresponding to each first action in the first action sequence and a second state action sequence value corresponding to each second action in a candidate action sequence, wherein the candidate action sequence is an action sequence formed by the rest unexecuted actions in the action sequence executed at the previous time step; and determining a target action to be executed currently based on the target action sequence so as to enable the intelligent agent to act according to the target action.

According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for determining an action of an agent, including: a generating module, configured to generate a first action sequence based on a current state of an agent at a current time step, where the first action sequence includes a first action of at least one time step; a determining module, configured to determine, based on the current state and the first action sequence, a first state action sequence value corresponding to each first action in the first action sequence, where the first state action sequence value is a state action sequence cost function value; the first processing module is used for determining a target action sequence to be executed currently based on a first state action sequence value corresponding to each first action in the first action sequence and a second state action sequence value corresponding to each second action in a candidate action sequence, wherein the candidate action sequence is an action sequence formed by the residual unexecuted actions in the action sequence executed at the previous time step; and the second processing module is used for determining a target action to be executed currently based on the target action sequence so as to enable the intelligent agent to act according to the target action.

According to a further aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the method for determining an action of an agent according to any of the above embodiments of the present disclosure.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for determining an agent action according to any of the above embodiments of the present disclosure.

Based on the method and apparatus for determining actions of an agent, the electronic device, and the medium provided by the above embodiments of the present disclosure, an action sequence including at least one action of a time step may be generated at a current time step based on a current state of the agent, a target action sequence to be executed at the current time step is determined based on a value of the state action sequence of the generated action sequence and a value of the state action sequence of remaining actions in the action sequence executed at a previous time step, and then a target action to be executed at the current time step is determined based on the target action sequence, that is, each time step may generate a multi-step action to participate in the determination of a subsequent target action, thereby effectively improving exploration efficiency.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is a schematic diagram of an exemplary scenario to which the present disclosure is applicable;

FIG. 2 is a flowchart illustrating a method for determining actions of an agent according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flowchart of step 201 provided by an exemplary embodiment of the present disclosure;

FIG. 4 is an exemplary diagram of an action sequence generator provided in an exemplary embodiment of the present disclosure;

fig. 5 is an exemplary structural diagram of an RNN unit combined with a hopping connection according to an exemplary embodiment of the present disclosure;

FIG. 6 is an exemplary architectural diagram of a state action sequence value network model provided by an exemplary embodiment of the present disclosure;

fig. 7 is a flowchart illustrating a method for determining an agent action according to another exemplary embodiment of the present disclosure;

FIG. 8 is an exemplary flowchart of step 302 provided by an exemplary embodiment of the present disclosure;

FIG. 9 is an exemplary flowchart of step 203 provided by an exemplary embodiment of the present disclosure;

FIG. 10 is an exemplary flowchart of a preset mapping rule provided by an exemplary embodiment of the present disclosure;

FIG. 11 is an exemplary architectural diagram of a network architecture for CARLA tasks provided by an exemplary embodiment of the present disclosure;

FIG. 12 is an exemplary architectural diagram of a CARLA environment and its controller interface provided by an exemplary embodiment of the present disclosure;

FIG. 13 is an overall flow diagram of action determination for a driving scenario provided by an exemplary embodiment of the present disclosure;

FIG. 14 is a flowchart illustrating a process for performing a target action in a driving scenario provided by an exemplary embodiment of the present disclosure;

FIG. 15 is a schematic diagram illustrating visualization of state access and evolution results during Pendulum task training of a GPM algorithm and other algorithms according to an exemplary embodiment of the present disclosure;

FIG. 16 is a graph illustrating performance effectiveness curves of a GPM algorithm and other algorithms provided by an exemplary embodiment of the present disclosure;

FIG. 17 is a diagram illustrating an exploration trajectory visualization result of a GPM algorithm and other algorithms provided by an exemplary embodiment of the present disclosure;

FIG. 18 is a schematic diagram illustrating an evolution process of a GPM algorithm generating action sequence according to an exemplary embodiment of the present disclosure;

FIG. 19 is a schematic visualization diagram of a GPM generated action sequence provided by an exemplary embodiment of the present disclosure;

fig. 20 is a schematic structural diagram of an apparatus for determining an action of an agent according to an exemplary embodiment of the present disclosure;

fig. 21 is an exemplary structural diagram of the generating module 501 provided in an exemplary embodiment of the present disclosure;

fig. 22 is a schematic structural diagram of an apparatus for determining an agent action according to another exemplary embodiment of the present disclosure;

fig. 23 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to agents and to electronic devices such as terminal devices, computer systems, servers, etc. in communication with an agent and which are operable with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the disclosure

In implementing the present disclosure, the inventors discovered that, in determining a target action to be performed by an agent, an action generation method is typically optimized using a model-free reinforcement learning algorithm that generates an action to be performed in a current time step to maximize an expected future reward, but generates only a single step action per time step, although flexible, but faces difficulties with inefficient exploration.

Brief description of the drawings

The embodiment of the disclosure can be applied to automatic driving scenes in the traffic field, robot automatic working scenes in the industrial field, weather change automatic identification scenes in the meteorological field and other arbitrary implementable scenes.

According to the method and the device, the first action sequence comprising the multi-step actions can be generated at each time step, the target action sequence required to be executed by the intelligent agent at the current time step is selected by the candidate action sequence formed by the residual unexecuted actions in the action sequence executed at the previous time step, the target action required to be executed by the intelligent agent at the current time step is determined based on the target action sequence, and compared with the generation of the single-step actions, the exploration efficiency is effectively improved. One example is shown in fig. 1, which is a schematic diagram of an exemplary scenario to which the present disclosure is applied, and the diagram is an autonomous vehicle, which is an intelligent agent, and the vehicle is shown as a host vehicle, and the host vehicle is driven on a road, senses surrounding environment information according to various sensors arranged on the host vehicle, and generates a first action sequence based on its current state at a current time step, the first action sequence includes one or more actions starting at the current time step, and determines a target action sequence to be executed at the current time step based on the first action sequence and a candidate action sequence, such as a sequence including 3 set point positions [ (x1, y1), (x2, y2), (x3, y3) ], and further determines a target action to be executed by the vehicle based on the target action sequence and a preset action operator, such as a target action based on the first set point position (x1, y1) and the current position (x0, y0) of the vehicle, determining a target direction and a target speed, transmitting the target direction and the target speed to a PID controller to generate a control signal, namely a target action, such as an accelerator, steering, braking, backing up and the like, executing the target action by the vehicle to realize the advance of the vehicle, after the target action is executed, enabling the vehicle to enter the next time step, taking the next time step as the current time step, acquiring the new state of the vehicle as the current state, continuously determining the action to be executed again by the vehicle according to the process, and the like to realize the automatic driving of the vehicle.

Exemplary method

Fig. 2 is a flowchart illustrating a method for determining an action of an agent according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an agent or an electronic device, and as shown in fig. 2, the method includes the following steps:

at a current time step, a first action sequence is generated based on the current state of the agent, and the first action sequence comprises a first action of at least one time step.

In an alternative example, the agent may be a computing entity such as an autonomous vehicle, a robot, etc. having an intelligent system capable of stereoscopic perception, global coordination, precise judgment, continuous evolution, and openness, and may be capable of continuously and autonomously functioning in a certain environment. In the actual application of the intelligent agent, the current time of the intelligent agent is referred to as the current time step, the state of the intelligent agent at the current time step is referred to as the current state of the intelligent agent, such as an automatic driving scene, and the current state of the intelligent agent may include the current position, the current speed, the current acceleration, the current surrounding environment information, and the like of the vehicle.

In one alternative example, the first sequence of actions may be τ_newIndicating, a first action comprising n time steps, i.e. τ_neW＝[a_t，new，a_t+1，new，a_t+2，new，…，a_t+n-1，new]Where t denotes the current time step, a_t，newI.e. representing a first action, a, corresponding to the current time step t_t+i-1，new(i∈[1，n]) Representing a first action corresponding to time step t + i-1.

In an optional example, the generation of the first action sequence may be implemented by using an action sequence generator, the action sequence generator may be implemented by using a recurrent neural network RNN, the action sequence generator needs to be obtained by training in advance, and may be trained by maximizing value, so that the generated action sequence is an intentional action sequence reaching a high-value area, and therefore, time coordination exploration may be performed based on the generated multi-step actions, thereby improving exploration efficiency.

Step 202, determining a first state action sequence value corresponding to each first action in the first action sequence based on the current state and the first action sequence, wherein the first state action sequence value is a state action sequence cost function value.

In one optional example, the state action sequence value is a state action sequence cost function value used to evaluate the goodness of the action sequence.

In an alternative example, the state action sequence cost function may be implemented by a state action sequence cost network model, which may be obtained through learning based on a recurrent neural network RNN.

In one optional example, the state action sequence cost function inputs are the state and action sequences and the output is a set of state action sequence costs along the action sequence.

Illustratively, the current state is denoted as s_tThe first sequence of actions is denoted τ_newIf the state action sequence cost function is Q (s, τ), then t + i-1(i ∈ [1, n ]]) First action of time step a_t+i-1，newThe corresponding first state action sequence value is:

Q(s_t，τ_new，i)＝Q(s_t，a_t，new，a_t+1，new，a_t+2，new，…，a_t+i-1，new)

wherein, tau_new，iI.e. representing the first i first actions starting in the first sequence of actions, i.e. each first action a_t+i-1，newThe corresponding first state action sequence value is the state action sequence value Q(s) of the action sequence formed by the first i actions in the first action sequence_t，τ_new，i)。

In the present disclosure, a single action sequence includes a plurality of actions, and the state action sequence value Q corresponding to one action refers to the state action value of a new action sequence formed by the actions from the start to the end of the action in the action sequence, that is, the value of the action sequence is evaluated by the state action sequence value.

Step 203, determining a target action sequence to be executed currently based on the first state action sequence value corresponding to each first action in the first action sequence and the second state action sequence value corresponding to each second action in the candidate action sequence, wherein the candidate action sequence is an action sequence formed by the residual unexecuted actions in the action sequence executed at the previous time step.

In an alternative example, the candidate action sequence may be obtained by time-shifting by one step the old action sequence performed at the previous time step t-1 (if there are still remaining unexecuted actions), i.e. the candidate action sequence τ_old＝ρ(τ_actual ⁰) Wherein, τ_actual ⁰I.e. representing the action sequence executed at the previous time step t-1, and p () is a shift operator representing the extraction of the remaining part of the action sequence, e.g. the action sequence τ has been executed at the previous time step_actual ⁰1st action τ in (1)_actual ⁰[0]Then ρ (τ)_actual ⁰)＝T^shift(τ_actual ⁰)＝τ_actual ⁰[1：]。

The value of the second state action sequence corresponding to each second action in the candidate action sequence is also determined by the state action sequence value function, and the specific principle is similar to the first action sequence and is not repeated herein.

In an alternative example, the target action sequence is determined by competition of the first action sequence and the candidate action sequence based on the respective state action sequence values, that is, the target action sequence is the first action sequence or the candidate action sequence, and the competition rules of the two can be set according to actual requirements.

And step 204, determining the target action which the intelligent agent needs to execute currently based on the target action sequence.

In an alternative example, the target action may be determined according to a target action sequence and a preset action operator (also referred to as a preset mapping rule).

In an alternative example, if the action included in the target action sequence is an action that can be executed by the agent, the action corresponding to the current time step in the target action sequence may be used as the target action.

In an alternative example, the target action sequence includes actions that cannot be directly executed by the agent, for example, a position coordinate sequence (which may be referred to as a setpoint sequence) that needs to be reached in the future is included in the target action sequence in an automatic driving scene, then actions that can be executed by the agent need to be determined as target actions according to the target action sequence, for example, a target direction and a target speed that the vehicle needs to travel are determined based on a setpoint that needs to be reached at a current time step and a current position of the agent (vehicle), a control signal (including throttle, steering, braking, backing up, and the like) of the vehicle is obtained as the target actions by using a certain controller (for example, a PID controller) based on the target direction and the target speed, and the vehicle executes the target actions, that is, one state update is completed.

According to the method and the device, the action sequence comprising the multi-step actions can be generated at each time step, the target action sequence to be executed at the current time step is determined based on the value of the state action sequence of the generated action sequence and the value of the state action sequence of the residual actions in the action sequence executed at the previous time step, the maximization of the value of the state action sequence is realized, the target action to be executed at the current time step is further determined based on the target action sequence, namely the multi-step actions generated at each time step participate in the determination of the subsequent target action for time coordination exploration, and the exploration efficiency is effectively improved.

In an alternative example, fig. 3 is a schematic flowchart of step 201 provided in an exemplary embodiment of the present disclosure, and as shown in fig. 3, on the basis of the above embodiment shown in fig. 2, step 201 of the present disclosure may include the following steps:

in step 2011, at the current time step, the current status of the agent is obtained.

Step 2012, inputting the current state to a motion sequence generator obtained by pre-training to generate a first motion sequence, where the motion sequence generator is a motion sequence generation model based on a recurrent neural network.

In an alternative example, the current state of the agent may be obtained by corresponding state acquisition devices, such as various sensors and other related devices, including a camera, an IMU sensor, which may be specifically set according to actual requirements.

In an optional example, the action sequence generator is an action sequence generation model obtained by training based on a recurrent neural network RNN, and the recurrent neural network may be set according to actual requirements, such as a gated recurrent unit GRU network, which is not limited in the embodiment of the present disclosure.

In an alternative example, fig. 4 is an exemplary structural schematic diagram of an action sequence generator of the present disclosure, which implements the functions as follows:

z_t＝encoder(s_t) (1)

h_t＝h(z_t) (2)

a_t＝μ_mean(z_t)+σ_std(z_t)·n_noise (3)

a_t+i＝f(h_t+i，a_t+i-1) (4)

h_t+i＝RNN(h_t+i-1，a_t+i-1) (5)

wherein s is_tIndicating the state of the t time step, encoder(s)_t) Represents a pair s_tCoding is carried out to obtain a coding result z_tN _ noise represents a noise vector, N _ noise is from N (0, 1), i.e., N _ noise follows a normal distribution, h (z)_t) Represents a pair z_tH mapping is carried out to obtain the initial state h of the t time step recurrent neural network RNN_t，μ_mean(z_t) And σ_std(z_t) A mu network and a sigma network in the figure respectively, representing the mean (i.e., mu in the figure) and the variance (i.e., sigma in the figure) of the motion distribution at t time step, f () represents the motion decoder, a_t+i-1(i∈[1，n-1]) An action representing the generated t + i-1 time step, RNN (h)_t+i-1，a_t+i-1) I.e. representing a deep neural network element, with an input of h_t+i-1And a_t+i-1Output h_t+i。

In an alternative example, from the generated second action (a)_t+1) Initially, RNN () can be implemented in connection with a jump connection, i.e. from the inputAdding a skip-connection to the output action, and fig. 5 is an exemplary structural diagram of the RNN unit combined with the skip-connection according to the present disclosure, where the input is the first action a generated by taking the first RNN unit as an example_tAnd h_tFrom the input action a_tTo the output action a_t+1With the addition of a jump connection, i.e. with the introduction of reasonable a priori knowledge, the implementation of the motion decoder f () can make use of the following time series structure:

a_t+i＝f(h_t+i-1，a_t+i-1)＝a_t+i-1+g(h_t+i-1，a_t+i-1) (6)

wherein, i belongs to [1, n-1], g () can be realized by adopting a neural network, if g is initialized to enable the generated initial value to be smaller, a priori with repeated action is actually embedded to carry out initial reinforcement learning exploration.

In an alternative example, the action sequence generator may adjust the degree of randomness by matching the specified target entropy, that is, introducing a cost function J (α) as a part of the overall cost function, where α is a parameter for adjusting the degree of randomness, and needs to be obtained by learning.

According to the method, the action sequence generator based on the cyclic neural network can generate the action sequence comprising a plurality of actions at each time step, time coordination exploration is conducted, the exploration efficiency is effectively improved, reasonable priori knowledge can be introduced through jump connection, the effectiveness of the generated actions is improved, in addition, an autoregressive structure is used based on the cyclic neural network modeling, the representation of the internal relation between the actions corresponding to the continuous time steps is facilitated, and the effectiveness of the model is further improved.

In an alternative example, on the basis of the embodiment shown in fig. 2 described above, the step 202 of the present disclosure may include: inputting the current state and the first action sequence into a state action sequence value network model obtained by pre-training to obtain a first state action sequence value corresponding to each first action in the first action sequence; the network architecture of the state action sequence value network model is established based on a recurrent neural network.

In an optional example, the state action sequence value network model is implemented based on a recurrent neural network, such as an LSTM (Long Short-Term Memory) implementation, and may be specifically set according to actual requirements.

In an alternative example, FIG. 6 is an exemplary architectural diagram of a state action sequence value network model of the present disclosure. State s at time step t_tObtaining initial input of RNN t time step after coding, action sequence tau ═ a_t，a_t+1，a_t+2，…，a_t+n-1]The output result of RNN unit corresponding to each action is decoded by decoder to obtain decoding result, for the first action a_tSince no state action sequence value of the previous step is available, a corresponding state action sequence value Q(s) is obtained by adopting independent decoder decoding_t，a_t) The remaining decoders share parameters, i.e. starting with the second action, the action a_t+i(i∈[1，n-1]) Corresponding decoder output result and previous action a_t+i-1Corresponding state action sequence value Q(s)_t，a_t，a_t+1，...，a_t+i-1) The sum is used as the state action sequence value Q(s) corresponding to the action_t，a_t，a_t+1，...，a_t+i) Wherein the previous action a_t+i-1Corresponding state action sequence value Q(s)_t，a_t，a_t+1，...，a_t+i-1) Refers to the sequence of actions [ a ] by the previous action_t，a_t+1，...，a_t+i-1]The state action sequence value of (1).

In an alternative example, the first action a in the state action sequence value network model_tThe corresponding decoders can be realized by adopting a normal distribution projection network, and the decoders corresponding to other actions can be realized by adopting a multilayer neural network MLP. In practical application, other implementation manners can be adopted according to actual requirements, and the disclosure is not limited.

In an optional example, the value of the state action sequence can be generated simultaneously through two state action sequence value network models, and the smaller value of the two state action sequence values is selected as the value of the state action sequence corresponding to each action, so that the problem of value overestimation can be solved.

According to the method, the state action sequence value function is realized through the state action sequence value model based on the recurrent neural network, so that the learning can be more effectively realized through data collected by interacting with the environment, the model-free reinforcement learning is convenient to realize, and the performance of the method is effectively improved.

In an alternative example, fig. 7 is a flowchart illustrating a method for determining an action of an agent according to another exemplary embodiment of the present disclosure, and as shown in fig. 7, before step 201, the method of the present disclosure further includes the following steps:

step 301, establishing an action sequence generation network corresponding to the action sequence generator and a state action sequence value network corresponding to the state action sequence value network model.

Step 302, placing the action sequence generation network and the state action sequence value network in a target learning environment for reinforcement learning training, and obtaining an action sequence generator and a state action sequence value network model when the action sequence generation network meets a preset training end condition.

In an optional example, the action sequence generation network is a neural network based on a recurrent neural network RNN, the state action sequence value network is also a neural network based on the recurrent neural network RNN, and the specific network structures of the two networks can be set according to actual requirements, for example, the action sequence generation network is implemented by using a GRU network, and the GRU network is obtained by simplifying on an LSTM, and belongs to a variant of the LSTM network. For example, the state action sequence value network is implemented by using an LSTM network.

In an alternative example, the action sequence generator and the state value network model may be trained jointly or separately.

In an alternative example, the optimized training of the action sequence generator can be trained by maximizing the value, so that the generated action sequence can be regarded as a beneficial action sequence reaching a high-value area, therefore, the time coordination exploration can be carried out by utilizing the multi-step actions in the action sequence, and the network is optimized to adapt to the task by continuously strengthening learning from a rough initial action sequence generation network, thereby being beneficial to the future exploration and more effective than the commonly used action repetition strategy. In addition, the multi-step actions in the action sequence can be regarded as the intention of the intelligent agent from the present to the future in a period of time, so that more information and intuitive interpretation signals can be provided, and the effectiveness of target action determination is further improved.

In an alternative example, the action sequence generator π_θThe parameter optimization is specifically as follows:

where θ is a learnable parameter of the motion sequence generator, τ_lRepresents the first L movements in the movement sequence τ, i.e. the part of length L starting from τ, L to U (1, L) represent that L is a random variable subject to uniform distribution, and L is the longest length of the movement sequence τ. D represents a data set (replay buffer) obtained by interaction of the agent and the environment in the training process.

This means that negative state action sequence values are expected with respect to the random variable s, τ, l.

Wherein the state action sequence cost function Q(s)_t，τ_l) Is defined as:

where γ is a discount factor, r_t+1＝R(s_t，a_t)，

R represents a reward function, R_t+i(i∈[1，l]) Is the action a_t+i-1Corresponding immediate award, s_t+1～P(s_t，a_t) Indicating the execution of action a_tSlave state s_tConversion to s_t+1Probability of(s), s' to T(s)_t，τ_l) Execution of action sequence τ_lSlave state s_tProbability of transition to s'; v (s ') is the state cost function of s':

V(s′)＝E_{τ′～π(s′)}Q(s′，τ′) (9)

where τ 'represents the action sequence generated by the action sequence generator based on the state s'.

In one optional example, the state action sequence value network is parameterized as Q_φThe method can be used for Q matching by using a TD-learning method in reinforcement learning_φTraining is carried out, specifically as follows:

wherein phi represents a parameter to be learned by the state action sequence value network, J_Q(phi) represents a cost function, Q, corresponding to the state action sequence value network_φ(s，τ_l) State action sequence value (also referred to in this disclosure as the first l (or t + l-1 time step) action a in an action sequence τ) representing the length of l in the action sequence τ generated based on the state s at a parameter φ_t+l-1Corresponding state action sequence value), T (R, s') represents the reference state action sequence value calculated based on the reward function R, and is used as a comparison object of the state action sequence value generated by the state action sequence value network, and is used for guiding the training of the state action sequence value network and continuously optimizing the parameter phi.

Where γ is a discount factor, V 'represents a target state cost function, and V' (s ') represents a target state cost corresponding to the solved state s', which can be calculated as follows:

V′(s′)＝E_{τ′～π(s′)}Q_φ′(s′，τ′) (12)

wherein τ ' to π (s ') represent motion sequences τ ', Q generated by the motion sequence generation network based on the state s_φ′(s ', τ') represents the state action sequence value obtained by the state action sequence value network under the parameter φ 'based on the state s' and the action sequence τ ', and φ' represents the parameter soft-copied from φ periodically.

In an alternative example, fig. 8 is an exemplary flowchart of step 302 of the present disclosure, and as shown in fig. 7, step 302 includes the following steps:

step 3021, at each time step of the training process, obtaining a current training state of the agent at the current time step.

Step 3022, inputting the current training state into the action sequence generation network to obtain a first training action sequence, where the first training action sequence includes a first training action of at least one time step.

Step 3023, determining a reward function value corresponding to the first training action at each time step based on the first training action sequence, and determining a reference training state action sequence value corresponding to each first training action based on the reward function value corresponding to the first training action at each time step.

And step 3024, inputting the current training state and the first training action sequence into the state action sequence value network, and obtaining a first training state action sequence value corresponding to each first training action sequence.

And step 3025, judging whether the current cost value meets a preset condition based on the reference training state action sequence value, the first training state action sequence value and a preset cost function.

And step 3026, if the current cost value meets the preset condition, ending the training to obtain the action sequence generator and the state action sequence value network model.

And step 3027, if the current cost value does not meet the preset condition, updating the action sequence according to the parameter updating rule to generate a first parameter of the network and a second parameter of the state action sequence value network.

Step 3028, determining a target training motion sequence based on the first training motion sequence and the candidate training motion sequence, and determining a target training motion based on the target training motion sequence.

And step 3029, after the agent executes the target training action, entering a next time step, taking the next time step as a current time step, acquiring a new state of the agent as a current state, and continuing the reinforcement learning training on the updated action sequence generation network and the state action sequence value network according to the steps until the cost value meets the preset condition.

In an alternative example, the training process needs to learn a random gradient adjustment parameter α (for adjusting the randomness of the motion sequence generator) in addition to the first parameter θ of the motion sequence generation network and the second parameter φ of the state motion sequence value network, and the learning of α is implemented based on the cost function J (α):

updating the parameter alpha when updating the parameter in the training process, wherein pi₀An entropy term representing an action on a first time step,

representing the target entropy.

In an optional example, in step 3028, a target training action sequence is determined based on a first training action sequence and a candidate training action sequence, a certain switching rule is required, that is, when what condition is satisfied, the first training action sequence is used as the target training action sequence, otherwise, the candidate training action sequence is used as the target training action sequence, in order to continuously optimize the switching rule in the training process, an action sequence switching adjustment parameter is introduced in the present disclosure, and is represented by ∈, and the parameter is also learned in the training process to obtain an optimal parameter, which may be specifically learned through a corresponding cost function J (e):

J(∈)＝∈·(l_commit-l_{commit_target}) (14)

wherein l_{commit_target}Is a preset target execution length, l, for the action sequence_commitThe calculation result is obtained by performing exponential moving average calculation on the length of actually executed action sequences during execution, namely performing exponential moving average calculation on all action sequence execution lengths (namely, executed action numbers) which are executed historically.

The action sequence switching adjustment parameter is also updated when the parameter of step 3027 is updated as described above.

In an alternative example, the parameter update of step 3027 is specifically as follows:

wherein (t +1) represents the corresponding updated parameter value, t represents the corresponding updated parameter value, J represents the cost function, the cost functions corresponding to different parameters can be set according to the actual requirement, but not limited to the specific content given in the above content of the present disclosure, λ is the preset updating step length,

indicating the gradient.

In an alternative example, to train the action sequence generator, a re-parameterization technique may be employed, and the gradient calculations are performed using the microportability of the action sequence generator:

where z represents the input generating τ, i.e. according to state s in fig. 4 (corresponding to s in fig. 4)_t) Z obtained after encoding (corresponding to z in FIG. 4)_t) For other symbols, refer to the above description, and are not repeated.

The method generates a time-expanded action sequence through an action sequence generator based on the RNN and a state action sequence value network model based on the RNN, and switches to a newly generated action sequence until a specific condition is met through competition between the newly generated action sequence and an old action sequence, so that the method disclosed by the invention has flexible intentional action sequence and exploration of time expansion (the time expansion and the intentional exploration of the type are called time coordination exploration), effectively realizes the purpose of simply repeating the same action beyond simple mode, induces a more flexible intentional action sequence form by using the action sequence to conduct the exploration of time coordination, can complete the exploration of a useful action sequence form in a light mode, retains the integral modeless property of the whole algorithm, and effectively combines the simple and effective action repetition strategy of the existing modeless reinforcement learning and the more flexible and clear model-based reinforcement learning The complementary advantages and the obscure relation are shown, and the problems that the sample efficiency is low due to the large number of samples required by the existing model-free reinforcement learning and the model-based reinforcement learning is high in sample efficiency, but high in calculation requirement and needs to be capable of obtaining the model are solved.

In an alternative example, fig. 9 is an exemplary flowchart of step 203 of the present disclosure, and as shown in fig. 9, on the basis of the above embodiment shown in fig. 2, step 203 of the present disclosure may include the following steps:

step 2031, obtaining the action sequence switching adjustment parameters obtained by learning.

Step 2032, based on the first state action sequence value corresponding to each first action in the first action sequence, the second state action sequence value corresponding to each second action in the candidate action sequence, the learned action sequence switching adjustment parameter and the preset action sequence switching rule, determining the target action sequence.

In an optional example, the action sequence switching adjustment parameter is a parameter e obtained by the learning, and the parameter can be stored together with the action sequence generator and the state action sequence value network model after the learning is finished and can be acquired from a corresponding storage area when needed.

In one alternative example, the preset action sequence switching rule may be set to switch to the first action sequence τ when the switch value replan is 1_newElse, continue the candidate motion sequence tau_oldWherein repan follows a distribution (category) (y), i.e. repan-category (y), wherein,

m represents a candidate motion sequence τ_oldLength of (d), τ_new，mRepresenting a first sequence of actions τ_newIn the action sequence formed by the first m actions, since the candidate action sequence is formed by the actions left in the action sequence executed at the previous time step, the length of the candidate action sequence is smaller than that of the first action sequence, and therefore, when y is calculated, the state action sequence value e needs to be calculated according to the length of the candidate action sequence, namely, the learned action sequence switching adjustment parameter is obtained. That is, the preset action sequence switching rule may be expressed as follows:

τ_actual＝(1-replan)·τ_old+replan·τ_new

it is understood that if the action in the action sequence executed at the previous time step is executed, the candidate action sequence is an empty set, and then replay is 1.

In an alternative example, on the basis of the embodiment shown in fig. 2, the step 204 of the present disclosure may include: and determining the target action which needs to be executed currently by the intelligent agent based on the action corresponding to the current time step in the target action sequence and a preset mapping rule, wherein the preset mapping rule is a rule describing the mapping relation between the action in the action sequence and the action to be executed.

In an alternative example, the preset mapping rule may be set according to actual scene requirements, and the target action sequence τ is set_actualThe preset mapping rule is represented by ω (), the target action a_t，actual＝ω(τ_actual)。

In one alternative example, the determined target action sequence is a first action sequence, namely:

τ_actual＝τ_new＝[a_t，new，a_t+1，new，a_t+2，new，…，a_t+n-1，new]

if tau_newThe action in (3) is an action that can be directly executed by the agent, and the target action is:

a_t，actual＝ω(τ_actual)＝a_t，new。

if tau_newIf the action in (1) is an action that can not be directly executed by the agent, the corresponding mapping conversion needs to be performed based on the preset mapping rule to obtain an action that can be executed by the agent, such as the autopilot scene tau_actualIs a sequence of position coordinates of the set point that needs to be reached in the future, then ω (τ)_actual) Fig. 10 is an exemplary flowchart of the preset mapping rule of the present disclosure, and fig. 10 is a schematic diagram of a set point coordinate a_t，new＝(x₁，y₁) The coordinate of the current position of the intelligent agent is (x)₀，y₀) According to (x)₀，y₀) And (x)₁，y₁) Determining a target direction and a target speed, transmitting the target direction and the target speed to a PID controller, and generating control signals such as accelerator, steering, braking, backing and the like, namely obtaining a target action a which can be executed by the intelligent agent_t，actual。

It should be noted that the preset mapping rules may be different in different scenarios, and fig. 10 only exemplifies one implementation manner in an automatic driving scenario, and the present disclosure is not limited to this scenario, and is even more limited to this implementation manner.

In one optional example, after determining the target action, the target action sequence a_t，actualCorresponding to the first action being performed, the remaining unexecuted actions are taken by one step forward in time as the candidate action sequence for the next time step.

In an optional example, after determining, in step 204, a target action that the agent currently needs to perform based on the target action sequence, the method further includes: and taking the next time step entered by the intelligent agent as the current time step, acquiring the new state of the intelligent agent as the current state, and continuously determining the action to be executed for the intelligent agent based on the current state.

After the agent completes a target action, the state update is realized once, the time step also enters the next time step from the current time step, and after entering the next time step, the next time step becomes the current time step of the agent.

According to the method, the intelligent agent executes the target action to realize the state conversion along with the time step conversion, and further determines the target action to be executed by the intelligent agent again based on the new current state of the intelligent agent, so that the continuous action of the intelligent agent is realized.

The following is a brief description of existing model-free reinforcement learning and model-based reinforcement learning to more clearly demonstrate the beneficial effects of the method of the present disclosure:

1. continuous control of existing model-free reinforcement learning

Existing model-free reinforcement learning algorithms such as DDPG, SAC, etc. mainly map a state s to an action a, which is called a strategy pi(s) → a. The strategy is usually modeled by a neural network, the trainable parameters of which are denoted as θ', and the optimization is performed as follows:

where Q' (s, a) represents a state action cost function that is standard in reinforcement learning. And D represents a data set (replay buffer) obtained by interaction of the agent and the environment in the training process.

This form of strategy of equation (20) generates one single step action at a time, which, while simple and flexible, may be ineffective in the exploration process, leading to jitter phenomena that limit its ability to evade local optimality, thus wasting much of the exploration effort.

2. Model-based reinforcement learning

In model-based reinforcement learning, the optimization action at a particular time step is obtained by minimizing the cost function C online. In order to make better use of the temporal structure of the cost function, it is common to optimize the sequence of actions τ ═ a over a period of time₀，a₁，a₂，…]：

This type of approach of equation (21) makes it possible to generate sequences of actions that are more intentional and flexible than repetitive actions, since they are optimized at the sequence level. However, the computational cost requires additional models to be obtained, and the solution process requires online optimization, so there is a large gap between this kind of method and the model-free RL method, which makes the task of exploring with a planned action sequence in the model-free RL more challenging.

It can be seen that model-based planning can generate more flexible and adaptive future actions, but largely results from model-free reinforcement learning, on the other hand action repetition is more efficient, but the adaptability of the generated future action is poor, based on the method, the method provided by the disclosure combines the advantages and the internal relation of the existing model-free reinforcement learning and the model-based reinforcement learning, realizes the flexible future action sequence beyond repetition without massive derivation from the model-free reinforcement learning algorithm, but the time expansion is explored in the learning process, an intentional action sequence is generated at each time step, the flexibility is ensured, the action repetition is combined, therefore, the method performance is effectively improved, and through verification, the method disclosed by the invention can find a high-value state more quickly, and the sample efficiency is further improved.

In an alternative example, the performance comparison of the disclosed method to existing methods may be demonstrated based on the GYM task, which is a toolkit for developing and comparing reinforcement learning algorithms, the cara task, which is an open-source autopilot simulator, and the like, which may include, but is not limited to, the tasks shown in table 1, with different tasks requiring corresponding GYM environments.

TABLE 1

Task	GYM environment
		Pendulum	Pendulum-v0
InvertedPendulum	InvertedPendulum-v2
		InvertedDoublePendulum	InvertedDoublePendulum-v2
MountainCarContinuous	MountainCarContinuous-v0
		CartPoleSwingUp	CartPoleSwingUp-v0
BipedalWalker	BipedalWalker-v2

The cara task environment is based on an open source cara simulator implementation. The following explains the specific scenarios of CARLA tasks:

1. sensor and acquired observation signal

CARLA uses the following types of observers:

a camera: the front camera collects RGB images of 64 × 64 × 3 size.

An IMU sensor: a 7D vector comprising acceleration (3D (i.e. 3D)), gyroscope (3D), compass (1D) provides a measure of some state of the current vehicle.

A speed sensor: a 3D vector comprising a velocity (in m/s) relative to its own coordinates (ego-coordinate).

Navigation: a set of 8 waypoint locations d meters from the host vehicle along the route, d ∈ {1, 3, 10, 30, 100, 300, 1000, 3000} meters. The waypoint position is represented in 3D self coordinates (ego-coordinate) and the sensor reading is a 24D vector.

The purpose is as follows: the 3D position in the ego coordinates represents the destination position to which the vehicle should travel and stop.

The last action is as follows: a 2D vector representing the last action.

In this example, the time difference between the two time steps is set to 0.05s, i.e. the sensing and control frequency is 20 Hz.

2. Network structure of CARLA task

Fig. 11 is an exemplary structural diagram of a network structure of the cara task of the present disclosure. The figure is a common structure that can be shared by all methods to be compared, including the method of the present disclosure, wherein the Actor and Critic networks adopt the aforementioned action sequence generator and state action generator structures when implemented in the method of the present disclosure, and are not described herein again. In this example, given an Observation signal containing a multi-modal signal, it is first necessary to blend them into a low-dimensional feature pair using an Observation Encoder (observer Encoder), which can be implemented by processing each signal modality with a dedicated Encoder having the same output dimension. More specifically, the visual encoder may be implemented based on the convolutional neural network CNN of ResNet-18. The other modal signals are encoded by a fully concatenated encoder (FC), the outputs of each encoder are fused together by the addition operator and then generate feature vectors as inputs to the downstream network via the fully concatenated layer (FC1), the observation encoder is shared by the Actor and criticc networks and is trained only by value learning, the gradient of Actor is truncated and does not directly affect the training of the network structure.

Illustratively, the network structure of the observation encoder is shown in table 2 below:

TABLE 2

Network	batch size	Action sequence length	Number of parallel acts
				Observation encoder [256 ]]	128	20	4

3. CARLA controller and action space

Fig. 12 is an exemplary structural schematic diagram of the cara environment and its controller interface of the present disclosure. The current position p of the vehicle₀And a set point (i.e., target position) p ═ x1, x2 as input, specifying the location to go to, which for the case of autonomous driving can be expressed in 2D space coordinates, so there are two dimensions (x: lateral, y: longitudinal) in the action space, where (x, y) ∈ [ -20, 20)]×[-20，20]. The agent controls the movement of the vehicle by generating a 2D action as a set point, and a target direction and target speed can be calculated from the set point and its current position and passed to the PID controller to generate control signals including throttle, steering, braking and reverse. The simulator will use these control signals to generate the next state of the vehicle.

In an alternative example, as part of the environment, in order to facilitate the occurrence of braking behavior before calculating the control signal, the setpoint p may also be processed as follows:

wherein | p | represents the setpoint p and the current position p of the vehicle itself₀δ is a preset parameter, e.g. setting δ to 3 in this example, will

As set point for the generation of the control signal.

4. Reward function

The reward function in Carla's environment contains the following parts:

route distance reward: the reward calculation mode is the distance traveled along the route from the previous time step to the current time step;

collision reward: the calculation method is min (collision _ dependency, 0.5 max (0., epsilon _ reward)), wherein the collision _ dependency is 20, epsilon _ reward represents the cumulative reward up to now in the current round, and the processing is performed as one time for the continuous collision event;

red light reward: the calculation method is min (red _ light _ dependency, 0.3 max (0., epoxy _ reward)), where red _ light _ dependency is 20, and only one processing is performed for consecutive red light violation events.

Successful awarding: if the current position of the vehicle is within 5 meters from the target position and the speed is less than 10^-3m/s, a successful reward of 100 would be achieved.

5. Round end conditions

The round will end if any of the following conditions are met:

the success is as follows: the vehicle is located within 5m from the target position and has a speed less than 10^-3m/s。

The maximum number of time steps K allowed is reached: the maximum number of time steps K is calculated as follows

Route _ length represents the path length, min _ velocity can be set according to actual requirements, for example, set to 5m/s, and bootstrap-based state value learning can be performed on the state.

Jamming in collision: if the vehicles are involved in the same crash event and move no more than 1 meter at 100 time steps. In this case, bootstrap-based state value learning may be performed on the state.

6. Map and sample route

Using the map Town01 in this example, which is a representative layout of towns consisting of T junctions, the towns having an area of about 410m 344m, this example also generated 20 other autopilot-controlled vehicles and 20 pedestrians to simulate typical driving conditions.

The destination location is randomly selected from a set of waypoints, and at the beginning of each round, the intelligence generates a random valid waypoint location. A waypoint is valid if it is on the road and at least 10 meters away from any other participants (vehicles and pedestrians) in the world.

The task of an agent is to travel to a destination location and remain stationary after reaching the destination location, and routes constructed in this manner can be very diverse, covering representative driving scenarios. For example, different routes may include different numbers of turns and different dynamic objects (e.g., other vehicles) at different empty locations.

In an alternative example, in the driving task, the motion decoder f () of the motion sequence generator can be implemented by:

a_t+i＝a_t+i-1+Δa_t+i-1+g(h_t+i-1，a_t+i-1，Δa_t+i-1)

wherein i is ∈ [1, n-1]]，Δa_t+i-1＝(a_t+i-1-a_t+i-2) Therefore, linear prior is fused into f, and the method has good effect on the control task based on the position.

In an alternative example, a cost function based on the maximum value max τ of the absolute value of the third derivative of the motion sequence may be added to the overall cost function to improve the smoothness of the generated motion sequence and reduce jitter.

It should be noted that, in practical applications, the cyclic motion sequence generator may be adjusted according to actual requirements, for example, run on a coarser granularity to reduce the computation cost, for example, the motion sequence generator may output several (k) anchor points, and then perform interpolation between them in an interpolation manner to obtain a complete motion sequence, and the specific interpolation operation may use linear interpolation.

Since the action sequence is a spatial trajectory in the coordinates of the agent (i.e. a coordinate system with the current position of the agent as the origin of coordinates) in the driving task, the agent will have different coordinates at different positions at the same position, in which case the shift operator ρ () is in the form of:

wherein, T_ego2worldRepresenting the conversion of a sequence of actions from self-coordinates to world coordinates, T_world2egoSpecifically, the function of the shift operator is to convert the action sequence from the self-coordinate to the world-coordinate, then shift the action sequence, and then convert the action sequence from the world-coordinate to the self-coordinate after the action sequence is shifted, so that the problem that the self-coordinate is not uniform in nature due to the movement of the agent is solved.

Fig. 13 is an overall flow diagram of the action determination of the driving scenario of the present disclosure. The horizontal direction represents different time steps developing along a time axis, the vertical direction represents a processing flow at each time step, pi represents an action sequence generator, p represents a shift operator, and ω represents a preset mapping rule.

In an alternative example, the preset mapping rule ω may also only indicate that an action is derived from the target action sequence, that is, an action corresponding to the current time step is directly derived as the target action regardless of whether the target action sequence includes an action that can be directly performed by the agent or an action that cannot be directly performed by the agent, and the processing from the target action to the executable action belongs to a specific process of the agent for performing the target action, for example, fig. 14 is a schematic processing flow diagram of the driving scene for performing the target action according to the present disclosure. Which is consistent with the overall process flow shown in fig. 10, is merely an identification of the target action and does not affect the actual flow of the method of the present disclosure.

As shown in table 3, is the overall algorithm pseudo code of the action determination method of the agent based on the action sequence generator of the present disclosure, and the algorithm of the present disclosure is referred to as gpm (general planning method):

TABLE 3

The following describes the performance comparison results of the GPM algorithm of the present disclosure and several existing algorithms, where the existing algorithms include SAC (soft Actor critical) algorithm (which is a representative model-free Reinforcement Learning (RL) algorithm), EZ (EZ-greeny) algorithm (extended exploration), FAR algorithm (fixed action repeat step number), DAR algorithm (dynamic adjustment action repeat step number), TAAC algorithm (method of adjusting repeat action through a switching strategy, 1step-td), and SAC is used as a backbone algorithm, and the comparison includes the following:

1. state access and evolution result during Pendulum task training

Fig. 15 is a visualization diagram of the state access and evolution results during the Pendulum task training of the GPM algorithm of the present disclosure and other algorithms, where 0-2k steps, 2k-4k steps, 4k-6k steps at the top of the diagram represent the environment interaction step size range in the training process, fig. 15 shows all rolout trajectories interacting with the environment in different step size ranges of each algorithm, and since the state with the highest value is (0, 0), the angular deviation (angle) and angular velocity corresponding to the Pendulum is zero, the effective strategy should be to approach the state quickly, the number of light color points on the path from the initial state (light color points) to (0, 0) is less, and most of the end points (dark color points) in the diagram are located at (0, 0). As can be seen from fig. 15, as training progresses, the GPM algorithm of the present disclosure is able to find high value states faster than other algorithms.

2. Comparison of performance of algorithms across multiple tasks

Fig. 16 is a graph illustrating performance effectiveness curves of the GPM algorithm of the present disclosure and other algorithms. Through experiments on a plurality of standard tasks, it can be seen that the algorithm combining action repetition in the disclosure is obviously superior to the SAC effect, and the algorithm in the disclosure is further improved in sample efficiency, and the final performance is superior to other algorithms.

3. Exploration trajectory visualization

Fig. 17 is a schematic diagram illustrating the search trajectory visualization result of the GPM algorithm of the present disclosure and other algorithms, and the SAC and EZ algorithms are compared in a thumbnail display, and the final result shows that the GPM algorithm of the present disclosure can effectively push the position of the end point as far as possible along the set route, i.e., search from the origin (0, 0) to the periphery. As can be seen in fig. 17, SAC has a slow and rather limited effective exploration range, proving its exploration efficiency, while GPM exhibits directional and extended exploration behavior and rapidly expands its exploration range in the initial stage of training, further exhibiting reasonable turning behavior after about 100 turns (episode), and exhibiting more complex multiple turning behavior after about 150 turns.

Fig. 18 is a schematic diagram of a formal evolution process of the GPM algorithm generating action sequences according to the present disclosure, and it can be seen that at the beginning of training, i.e. at the corresponding part above (a) in the diagram, the action sequences generated by the GPM are almost uniformly distributed in the direction (from the center point to the periphery), (b) the action sequences advancing towards the center occupy a greater weight, (c) some action sequences with shorter spatial length are generated through learning, so that more speed adjustment can be performed, (d) the action sequences with turning start to occur, and the spatial length corresponding to the sequences is longer (e) at the final stage of training, the action sequences advancing towards the center and the action sequences with turning occupy most of the generated action sequences.

Fig. 19 is a visualization schematic of a GPM generated action sequence of the present disclosure. The front camera image used during training, the camera capturing RGB images of size 64 × 64, for better and clearer visualization results, the present example visualizes the GPM action sequence in a higher resolution top view, at t₁A relatively straight sequence of actions is generated, a number of time steps up to t are performed₂At the next time step t₃GPM switches to a new sequence of actions, which is executed until the vehicle ahead (t)₄) Up to t₅At that time, another new motion sequence is adopted, t₆And t₇Shows that t is executed₅Two frames in the middle of the motion sequence generated at time, after which the old motion sequence is at t₈Replaced by a new sequence of actions, executed until t₉And at t₁₀Replaced by another sequence of actions.

In conclusion, the GPM algorithm of the present disclosure has an intrinsic mechanism for generating time coordinated action sequences for reinforcement learning exploration, while retaining the main features and advantages of typical model-free reinforcement learning algorithms, exhibiting better exploration behavior and having stronger interpretability and intelligibility.

Any of the methods for determining an agent action provided by embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, the method for determining any kind of agent action provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute the method for determining any kind of agent action mentioned by the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Exemplary devices

Fig. 20 is a schematic structural diagram of an apparatus for determining an action of an agent according to an exemplary embodiment of the present disclosure. The device of the embodiment can be used for realizing the corresponding method embodiment of the disclosure. The apparatus shown in fig. 20 includes: a generating module 501, a determining module 302, a first processing module 503, and a second processing module 504.

The generating module 501 is configured to generate a first action sequence based on the current state of the agent at the current time step, where the first action sequence includes a first action of at least one time step.

The determining module 302 is configured to determine, based on the current state and the first motion sequence generated by the generating module 501, a first state motion sequence value corresponding to each first motion in the first motion sequence, where the first state motion sequence value is a state motion sequence cost function value.

The first processing module 503 is configured to determine a target action sequence to be currently executed based on the first state action sequence value corresponding to each first action in the first action sequence determined by the determining module 302 and the second state action sequence value corresponding to each second action in the candidate action sequence, where the candidate action sequence is an action sequence formed by the remaining unexecuted actions in the action sequence executed at the previous time step.

The second processing module 504 is configured to determine a target action that the agent currently needs to perform based on the target action sequence determined by the first processing module 503.

In an alternative example, fig. 21 is an exemplary structural diagram of the generating module 501 of the present disclosure, and the generating module 501 as shown in fig. 21 may include a first acquiring unit 5011 and a first processing unit 5012.

The first obtaining unit 5011 is configured to obtain a current state of the agent at the current time step.

The first processing unit 5012 is configured to input the current state acquired by the first acquiring unit 5011 to a motion sequence generator obtained by training in advance, and generate a first motion sequence, and the motion sequence generator is a motion sequence generation model based on a recurrent neural network.

In an optional example, the determining module 302 may include a second processing unit, configured to input the current state acquired by the first acquiring unit 5011 and the first action sequence generated by the first processing unit 5012 into a state action sequence value network model obtained by pre-training, and obtain a first state action sequence value corresponding to each first action in the first action sequence; the network architecture of the state action sequence value network model is established based on a recurrent neural network.

In an alternative example, fig. 22 is a schematic structural diagram of an apparatus for determining an agent action according to another exemplary embodiment of the present disclosure. The apparatus further comprises a setup module 505 and a training module 506.

And the establishing module 505 is configured to establish an action sequence generating network corresponding to the action sequence generator and a state action sequence value network corresponding to the state action sequence value network model.

A training module 506, configured to place the action sequence generation network and the state action sequence value network established by the establishing module 505 in a target learning environment for reinforcement learning training, and obtain an action sequence generator and a state action sequence value network model when the action sequence generation network meets a preset training end condition.

In one optional example, training module 506 includes: the device comprises a second acquisition unit, a first training unit, a first determination unit, a second training unit, a judgment unit, a third processing unit, a parameter updating unit, a second determination unit and a fourth processing unit.

And the second acquisition unit is used for acquiring the current training state of the intelligent agent at each time step of the training process.

And the first training unit is used for inputting the current training state acquired by the second acquisition unit into the action sequence generation network to acquire a first training action sequence, and the first training action sequence comprises a first training action of at least one time step.

And the first determining unit is used for determining the reward function value corresponding to the first training action of each time step based on the first training action sequence obtained by the first training unit, and determining the reference training state action sequence value corresponding to each first training action based on the reward function value corresponding to the first training action of each time step.

And the second training unit is used for inputting the current training state acquired by the second acquisition unit and the first training action sequence acquired by the first training unit into the state action sequence value network to acquire a first training state action sequence value corresponding to each first training action sequence.

And the judging unit is used for judging whether the current cost value meets the preset condition or not based on the reference training state action sequence value determined by the first determining unit, the first training state action sequence value obtained by the second training unit and the preset cost function.

And the third processing unit is used for finishing the training and obtaining the action sequence generator and the state action sequence value network model if the current cost value meets the preset condition.

And the parameter updating unit is used for updating the action sequence according to the parameter updating rule to generate a first parameter of the network and a second parameter of the state action sequence value network if the current cost value does not meet the preset condition.

And the second determining unit is used for determining a target training action sequence based on the first training action sequence and the candidate training action sequence obtained by the first training unit and determining a target training action based on the target training action sequence.

And the fourth processing unit is used for entering the next time step after the intelligent agent executes the target training action determined by the second determining unit, and continuing the reinforcement learning training on the updated action sequence generation network and the state action sequence value network according to the steps until the cost value meets the preset condition.

In an alternative example, the first processing module 503 may include a third acquiring unit and a fifth processing unit.

And the third acquisition unit is used for acquiring the motion sequence switching and adjusting parameters obtained by learning.

A fifth processing unit, configured to determine a target action sequence based on the first state action sequence value corresponding to each first action in the first action sequence determined by the determining module 302, the second state action sequence value corresponding to each second action in the candidate action sequence, the action sequence switching adjustment parameter obtained by the third obtaining unit through learning, and the preset action sequence switching rule.

In an optional example, the second processing module 504 may include a sixth processing unit, configured to determine a target action that the agent needs to execute currently, based on an action corresponding to a current time step in the target action sequence and a preset mapping rule, where the preset mapping rule is a rule describing a mapping relationship between an action in the action sequence and an action to be executed.

In an optional example, the apparatus of the present disclosure may further include an obtaining module, configured to take a next time step entered by the agent as a current time step, obtain a new state of the agent as a current state, and continue to determine an action that needs to be performed for the agent based on the current state.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 23. Fig. 23 is a block diagram of the structure of an electronic device according to an embodiment of the present disclosure. As shown in fig. 23, the electronic device 60 includes one or more processors 61 and a memory 62.

The processor 61 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 60 to perform desired functions.

Memory 62 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 61 to implement the methods of determining agent actions and/or other desired functionality of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 60 may further include: an input device 63 and an output device 64, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

In one example, the input device 63 may be a microphone or a microphone array as described above for capturing an input signal of a sound source. When the electronic device is a stand-alone device, the input means 63 may be a communication network connector for receiving input signals collected by other devices.

The input device 63 may also include, for example, a keyboard, a mouse, and the like.

The output device 64 may output various information to the outside, and the output device 64 may include, for example, a display, a speaker, a printer, and a communication network and a remote output apparatus connected thereto, and the like.

Of course, for simplicity, only some of the components of the electronic device 60 relevant to the present disclosure are shown in fig. 23, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 60 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of determining the actions of an agent according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method of determining agent actions according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of determining an agent action, comprising:

generating a first action sequence based on a current state of the agent at a current time step, the first action sequence including a first action for at least one time step;

determining a first state action sequence value corresponding to each first action in the first action sequence based on the current state and the first action sequence, wherein the first state action sequence value is a state action sequence value function value;

determining a target action sequence to be executed currently based on a first state action sequence value corresponding to each first action in the first action sequence and a second state action sequence value corresponding to each second action in a candidate action sequence, wherein the candidate action sequence is an action sequence formed by the rest unexecuted actions in the action sequence executed at the previous time step;

and determining the target action which needs to be executed by the intelligent agent currently based on the target action sequence.

2. The method of claim 1, wherein the generating, at the current time step, a first sequence of actions based on a current state of an agent comprises:

at the current time step, acquiring the current state of the agent;

and inputting the current state into a motion sequence generator obtained by pre-training to generate the first motion sequence, wherein the motion sequence generator is a motion sequence generation model based on a recurrent neural network.

3. The method of claim 2, wherein the determining a first state action sequence value for each first action in the first action sequence based on the current state and the first action sequence comprises:

inputting the current state and the first action sequence into a state action sequence value network model obtained by pre-training to obtain a first state action sequence value corresponding to each first action in the first action sequence; and the network architecture of the state action sequence value network model is established based on a recurrent neural network.

4. The method of claim 3, further comprising:

establishing an action sequence generation network corresponding to the action sequence generator and a state action sequence value network corresponding to the state action sequence value network model;

and placing the action sequence generation network and the state action sequence value network in a target learning environment for reinforcement learning training, and obtaining the action sequence generator and the state action sequence value network model when the action sequence generation network meets a preset training end condition.

5. The method of claim 1, wherein the determining a target sequence of actions to be currently performed based on a first state action sequence value corresponding to each first action in the first sequence of actions and a second state action sequence value corresponding to each second action in the candidate sequence of actions comprises:

acquiring an action sequence switching and adjusting parameter obtained by learning;

and determining the target action sequence based on the value of the first state action sequence corresponding to each first action in the first action sequence, the value of the second state action sequence corresponding to each second action in the candidate action sequence, the learned action sequence switching regulation parameter and a preset action sequence switching rule.

6. The method of claim 1, wherein the determining a target action that the agent currently needs to perform based on the target sequence of actions comprises:

and determining the target action which needs to be executed currently by the intelligent agent based on the action corresponding to the current time step in the target action sequence and a preset mapping rule, wherein the preset mapping rule is a rule describing the mapping relation between the action in the action sequence and the action to be executed.

7. The method of any of claims 1-6, further comprising, after the determining a target action that the agent currently needs to perform based on the target sequence of actions:

and taking the next time step entered by the intelligent agent as the current time step, acquiring a new state of the intelligent agent as the current state, and continuing to determine the action to be executed for the intelligent agent based on the current state.

8. An apparatus for determining a smart action, comprising:

a generating module, configured to generate a first action sequence based on a current state of an agent at a current time step, where the first action sequence includes a first action of at least one time step;

a determining module, configured to determine, based on the current state and the first action sequence, a first state action sequence value corresponding to each first action in the first action sequence, where the first state action sequence value is a state action sequence cost function value;

the first processing module is used for determining a target action sequence to be executed currently based on a first state action sequence value corresponding to each first action in the first action sequence and a second state action sequence value corresponding to each second action in a candidate action sequence, wherein the candidate action sequence is an action sequence formed by the residual unexecuted actions in the action sequence executed at the previous time step;

and the second processing module is used for determining the target action which needs to be executed by the intelligent agent currently based on the target action sequence.

9. A computer-readable storage medium, storing a computer program for performing the method of determining the actions of an agent according to any of claims 1 to 7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method for determining the actions of the agent as claimed in any one of claims 1 to 7.