CN114819093B

CN114819093B - Strategy optimization method and device using environment model based on memristor array

Info

Publication number: CN114819093B
Application number: CN202210497721.2A
Authority: CN
Inventors: 高滨; 林钰登; 唐建石; 吴华强; 张清天; 钱鹤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2024-11-01
Anticipated expiration: 2042-05-09
Also published as: CN114819093A; WO2023217027A1

Abstract

A strategy optimization method and a strategy optimization device utilizing a dynamic environment model based on a memristor array. The method comprises the following steps: acquiring a dynamic environment model based on a memristor array; carrying out multiple predictions at a plurality of moments according to the dynamic environment model and the object strategy to obtain a data sample set comprising optimization costs of the object strategy corresponding to the moments; based on the data sample set, a policy search is performed using a policy gradient optimization algorithm to optimize the object policy. According to the method, a dynamic environment model based on the memristor array is utilized to generate a data sample set, long-term dynamic programming based on the dynamic environment model is achieved, then a strategy gradient optimization algorithm and other more stable algorithms are used for strategy searching, and object strategies can be effectively optimized.

Description

Strategy optimization method and device using environment model based on memristor array

Technical Field

Embodiments of the present disclosure relate to a policy optimization method and a policy optimization apparatus using a memristor array-based dynamic environment model.

Background

Artificial neural networks (ARTIFICIAL NEURAL NETWORK, ANN) have wide application in modeling of dynamic systems. However, using conventional artificial neural networks for long-term mission planning remains a challenge because of the lack of modeling uncertainty capability. The randomness (uncertainty) inherent to the actual system-the process noise and the approximation errors introduced by data-driven modeling can lead to long-term estimates of the artificial neural network deviating from the actual behavior of the system. Probabilistic models provide a way to solve uncertainty problems, and these models enable one to make informed decisions with predictions of the model, while keeping a careful attitude towards the uncertainty of these predictions.

Disclosure of Invention

At least one embodiment of the present disclosure provides a policy optimization method using a memristor array-based dynamic environment model, including: acquiring a dynamic environment model based on a memristor array; carrying out multiple predictions at a plurality of moments according to the dynamic environment model and the object strategy to obtain a data sample set comprising optimization costs of the object strategy corresponding to the moments; based on the data sample set, a policy search is performed using a policy gradient optimization algorithm to optimize the object policy.

For example, in a policy optimization method provided in an embodiment of the present disclosure, obtaining a dynamic environment model includes: acquiring a Bayesian neural network, wherein the Bayesian neural network is provided with a weight matrix obtained through training; obtaining a plurality of corresponding target conductance values according to a weight matrix of the Bayesian neural network, and mapping the plurality of target conductance values into the memristor array; and (3) inputting a state and a hidden input variable corresponding to the time t of the dynamic system as input signals to the memristor array after weight mapping, processing the state and the hidden input variable of the time t through the memristor array according to a Bayesian neural network, acquiring an output signal corresponding to a processing result from the memristor array, and obtaining a prediction result of the time t+1 of the dynamic system by the output signal.

For example, in the policy optimization method provided in an embodiment of the present disclosure, s _t+1＝f(s_t,a_t;W,ε),s_t is a state of a dynamic system at time t, a _t is an action of an object policy at time t, W is a weight matrix of a bayesian neural network, epsilon is additive noise corresponding to a memristor array, and s _t+1 is a prediction result of a dynamic system at time t+1; action a _t＝π(s_t of the object policy at time t; w pi), pi represents a function of the object policy, W pi represents a policy parameter, the weight matrix W of the bayesian neural network satisfies the distribution W-q (W), and the additive noise epsilon is additive gaussian noise epsilon-N (0, sigma ²).

For example, in a policy optimization method provided in an embodiment of the present disclosure, a plurality of moments include a moment 1 to a moment T sequentially arranged from early to late, and a plurality of predictions of the plurality of moments are performed according to a dynamic environment model and an object policy, to obtain a data sample set including optimization costs of the object policy corresponding to the plurality of moments, including: for any time T-1 from time 1 to time T, an execution action a _t-1 is obtained by the object policy, a _t-1＝π(s_t-1; w pi) to obtain the action a _t-1 of the object policy at the time t-1; according to the dynamic environment model s _t＝f(s_t-1,a_t-1; w, ε) calculating a state s _t at a next time T after time T-1 and obtaining a cost c _t of state s _t corresponding to time T, thereby obtaining a cost sequence { c ₁,c₂,…,c_t } from time 1 to time T, obtaining an optimized cost J _t-1 for time T based on the cost sequence, wherein 1.ltoreq.t.ltoreq.T; get the data sample set from time 1 to time T { [ a ₀,J₀],…,[a_T-1,J_T-1 ] }.

For example, in the policy optimization method provided in an embodiment of the present disclosure, the expected value of the cost c _t at the time t is E [ c _t ], and the optimization cost at the time t may be determined byIs obtained.

For example, in the policy optimization method provided in an embodiment of the present disclosure, the cost further includes a cost change caused by accidental uncertainty caused by hidden input variables and a cost change caused by cognitive uncertainty caused by intrinsic noise of the memristor array.

For example, in the policy optimization method provided in an embodiment of the present disclosure, the optimization cost at time t passes throughTo obtain, σ (η, θ) is a function of the occasional uncertainty and the cognitive uncertainty, η represents the occasional uncertainty and θ represents the cognitive uncertainty.

For example, in a policy optimization method provided by an embodiment of the present disclosure, the method is performed according to a dynamic environment model s _t＝f(s_t-1,a_t-1; w, epsilon) calculates a state s _t at time t and obtains a cost c _t of state s _t corresponding to time t, comprising: sampling the hidden input variable z from p (z) distribution to obtain a sample; inputting the state s _t-1 of the sample and the t-1 moment into the memristor array after weight mapping to obtain a predicted state s _t; for the predicted state s _t, a cost c _t＝c(s_t is obtained).

For example, in the policy optimization method provided in an embodiment of the present disclosure, the policy gradient optimization algorithm includes REINFORCE algorithm, PRO algorithm, or TPRO algorithm.

At least one embodiment of the present disclosure also provides a policy optimization apparatus using a memristor array-based dynamic environment model, including: an acquisition unit configured to acquire a dynamic environment model based on the memristor array; the computing unit is configured to conduct multiple predictions of a plurality of moments according to the dynamic environment model and the object strategy to obtain a data sample set comprising optimization costs of the object strategy corresponding to the moments; and the strategy searching unit is configured to perform strategy searching by using a strategy gradient optimization algorithm based on the data sample set so as to optimize the object strategy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1A illustrates a schematic flow diagram of a method for policy optimization based on a dynamic environment model of a memristor array provided in accordance with at least one embodiment of the present disclosure;

FIG. 1B shows a schematic flow chart of step S101 in FIG. 1A;

FIG. 2A shows a schematic structure of a memristor array;

FIG. 2B is a schematic diagram of a memristor device;

FIG. 2C is a schematic diagram of another memristor device;

FIG. 2D illustrates a schematic diagram of mapping a weighting matrix of a Bayesian neural network to a memristor array;

FIG. 3 shows a schematic flow chart of step S102 in FIG. 1A;

FIG. 4 illustrates a schematic diagram of one example of a policy optimization method provided by at least one embodiment of the present disclosure;

FIG. 5 illustrates a schematic block diagram of a policy optimization device utilizing a memristor array-based dynamic environment model provided in accordance with at least one embodiment of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

In model-free deep reinforcement learning (Deep Reinforcement Learning), an Agent (Agent) generally needs to perform a large amount of interaction trial and error with a real environment, and the data efficiency is low, so that the Agent (Agent) cannot be applied to a real task with a large trial and error cost ratio. Model-based deep reinforcement learning may then more efficiently utilize the data. In model-based deep reinforcement learning, an agent first learns from historical experience (e.g., state transition data collected in advance) of interacting with a real dynamic environment to obtain a dynamic environment model, and then interacts with the dynamic environment model to obtain a sub-optimal strategy.

The model-based reinforcement learning method learns the condition of an accurate dynamic environment model, and the model is used when an intelligent body is trained, so that the intelligent body does not need to interact with the real environment too many times, can 'imagine' the sense of interaction in the real environment, can greatly improve the data efficiency, and is suitable for being used in an actual physical scene with higher data acquisition cost; meanwhile, the dynamic environment model can predict the unknown state of the environment, generalize the cognition of the intelligent body, and also can be used as a new data source to provide context information to help decision making, so that the exploration-utilization dilemma can be relieved. In modeling the actual environment, the inherent randomness (uncertainty) of the environment-the process noise and the approximation errors introduced by data-driven modeling-can lead to long-term estimates of the artificial neural network deviating from the actual behavior of the system. Probabilistic models provide a way to solve uncertainty problems, and these models enable intelligent decisions to be made using the predictions of the models, while taking careful attitude towards the uncertainty of these predictions.

The inventors found that: a bayesian neural network (Bayesian Neural Network, BNN) is a probabilistic model that places the neural network in a bayesian framework, which can describe complex stochastic patterns; also, a bayesian neural network (BNN WITH LATENT input variables, bnn+lv) with hidden input variables can describe complex stochastic patterns by distribution over hidden input variables (occasional uncertainty) while taking into account model uncertainty by distribution over weights (cognitive uncertainty). A hidden input variable refers to a variable that cannot be directly observed, but has an effect on the state and output of the probabilistic model. The inventors describe methods and apparatus for implementing a bayesian neural network using memristor intrinsic noise in chinese patent application publication CN110956256a, which is incorporated herein in its entirety as part of the present application.

The structure of the bayesian neural network includes, but is not limited to, a fully connected structure, a convolutional neural network (Convolutional Neural Network, CNN) structure, etc., where the network weights W are random variables (W-q (W)) based on a certain distribution.

The inventors further found that, assuming that there is already a dataset d= { X, Y } for the dynamic system of the bayesian neural network, where X is the state feature vector of the dynamic system and Y is the next state of the dynamic system. The inputs of the Bayesian neural network are a state characteristic vector X and hidden input variables z (z-p (z)) of the dynamic system; parameters of the bayesian neural network can be trained; the output superposition of the bayesian neural network is the prediction y of the next state of the dynamic system, i.e., y=f (X, z, W, epsilon), with independent additive gaussian noise epsilon (epsilon-N (0, σ ²)). Thus, each weight of the Bayesian neural network is a distribution after training is completed. For example, each weight is a distribution independent of the other.

In the task of long-term planning, the gradient is reversely propagated through multiple steps, and the problems of gradient elimination and explosion exist; meanwhile, when the strategy search is performed by the neural network directly realized based on the memristor array, extra noise is introduced when the gradient is counter-propagated through the memristor array due to the intrinsic random characteristic of the memristor, and the gradient with the noise cannot effectively optimize the strategy search.

The strategy optimization method provided by the embodiment of the disclosure utilizes the dynamic environment model based on the memristor array to generate the data sample set, realizes long-term dynamic programming based on the dynamic environment model, then uses a strategy gradient optimization algorithm and other more stable algorithms to perform strategy searching, has no gradient disappearance and explosion problems, and can effectively optimize the object strategy.

At least one embodiment of the present disclosure further provides a policy optimization device corresponding to the policy optimization method.

Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments.

FIG. 1A illustrates a schematic flow diagram of a method for policy optimization based on a dynamic environment model of a memristor array provided in accordance with at least one embodiment of the present disclosure.

As shown in fig. 1A, the policy optimization method includes the following steps S101 to S103.

Step S101: a dynamic environment model based on the memristor array is obtained.

In an embodiment of the present disclosure, for example, a dynamic environment model may be obtained by modeling a dynamic system using bnn+lv based on a memristor array, and specific steps thereof will be shown in fig. 1B and will not be described herein.

Step S102: and carrying out multiple predictions at a plurality of moments according to the dynamic environment model and the object strategy to obtain a data sample set comprising optimization costs of the object strategy corresponding to the plurality of moments.

For example, the object policies involved are used for deep reinforcement learning, and may be, for example, policies that maximize rewards or achieve specific goals in the course of an agent's interaction with the environment.

Step S103: based on the data sample set, a policy search is performed using a policy gradient optimization algorithm to optimize the object policy.

For example, in different examples of embodiments of the present disclosure, the policy gradient optimization algorithm may include REINFORCE algorithm, PRO (Proximal Policy Optimization) algorithm, or TPRO (Trust Region Policy Optimization) algorithm. In the embodiment of the disclosure, the strategy gradient optimization methods are more stable, and can effectively optimize the object strategy.

Fig. 1B shows a schematic flow chart of an example of step S101 in fig. 1A.

As shown in fig. 1B, an example of step S101 may include steps S111 to S113 as follows.

Step S111: and acquiring a Bayesian neural network, wherein the Bayesian neural network is provided with a weight matrix obtained through training.

For example, the structure of the bayesian neural network includes a fully connected structure or a convolutional neural network structure, or the like. Each network weight of the bayesian neural network is a random variable. For example, after the bayesian neural network is trained, each weight is a distribution, such as a gaussian distribution or a laplace distribution.

For example, the bayesian neural network may be trained offline (offline) to obtain a weight matrix, and the method for training the bayesian neural network may refer to a conventional method, for example, a Central Processing Unit (CPU), an image processing unit (GPU), a neural Network Processing Unit (NPU), etc. may be used for training, which will not be described herein.

Step S112: and obtaining a plurality of corresponding target conductance values according to the weight matrix of the Bayesian neural network, and mapping the plurality of target conductance values into the memristor array.

After the Bayesian neural network training is completed to obtain a weight matrix, the weight matrix is processed to obtain a plurality of corresponding target conductance values. For example, in this process, the weight matrix may be biased and scaled until the weight matrix meets the appropriate conductance window for the memristor array used. And after the weight matrix is subjected to biasing and scaling, calculating a target conductance value according to the processed weight matrix and the memristor conductance value. The specific process of calculating the target conductance value may refer to the relevant description of the memristor-based bayesian neural network, which is not described herein.

FIG. 2A illustrates a schematic structure of a memristor array, for example, composed of a plurality of memristor cells that constitute an array of M rows and N columns, with M and N each being a positive integer. Each memristor cell includes a switching element and one or more memristors. In FIG. 2A, WL <1>, WL <2> … … WL < M > represent the word lines of the Mth row of the first row, the second row … …, respectively, the control electrodes of the switching elements (e.g., gates of transistors) in the memristor cell circuits of each row being connected to the word line corresponding to that row; BL <1>, BL <2> … … BL < N > represent the bit lines of the N-th column of the first column and the second column … … respectively, and the memristors in the memristor unit circuits of each column are connected with the bit lines corresponding to the column; SL <1>, SL <2> … … SL < M > represent the source lines of the Mth row of the first row and the second row … …, respectively, and the sources of the transistors in the memristor cell circuits of each row are connected to the source line corresponding to that row. According to kirchhoff's law, the memristor array may complete multiply-accumulate computations in parallel by setting the states (e.g., resistance) of the memristor cells and applying corresponding word line and bit line signals at the word and bit lines.

FIG. 2B is a schematic diagram of a memristor device including a memristor array and its peripheral drive circuitry. For example, as shown in FIG. 2B, the memristor device includes a signal acquisition device, a word line drive circuit, a bit line drive circuit, a source line drive circuit, a memristor array, and a data output circuit.

For example, the signal acquisition device is configured to convert a digital signal to a plurality of analog signals through a digital-to-analog converter (Digital to Analog converter, DAC for short) for input to a plurality of column signal inputs of the memristor array.

For example, a memristor array includes M source lines, M word lines, and N bit lines, and a plurality of memristor cells arranged in M rows and N columns.

For example, operation of the memristor array is achieved by a word line drive circuit, a bit line drive circuit, and a source line drive circuit.

For example, the word line driving circuit includes a plurality of multiplexers (Mux) for switching word line input voltages; the bit line driving circuit includes a plurality of multiplexers for switching bit line input voltages; the source line driving circuit also includes a plurality of multiplexers (Mux) for switching the source line input voltages. For example, the source line driving circuit further includes a plurality of ADCs for converting analog signals into digital signals. In addition, a transimpedance amplifier (Trans-IMPEDANCE AMPLIFIER, TIA for short) (not shown in the figure) can be further arranged between the Mux and the ADC in the source line driving circuit to complete the conversion from current to voltage, so as to facilitate ADC processing.

For example, a memristor array includes an operational mode and a computation mode. When the memristor array is in the operation mode, the memristor unit is in an initialized state, and values of parameter elements in the parameter matrix can be written into the memristor array. For example, the source line input voltage, the bit line input voltage, and the word line input voltage of the memristor are switched to corresponding preset voltage intervals through the multiplexer.

For example, the word line input voltage is switched to the corresponding voltage interval by the control signal wl_sw [1:M ] of the multiplexer in the word line driving circuit in fig. 2B. For example, when the memristor is set, the word line input voltage is set to 2V (volts), for example, when the memristor is reset, the word line input voltage is set to 5V, for example, the word line input voltage may be obtained by the voltage signal v_wl [1:M ] in fig. 2B.

For example, the source line input voltage is switched to the corresponding voltage interval by the control signal sl_sw [1:M ] of the multiplexer in the source line driving circuit in fig. 2B. For example, when the memristor is set, the source line input voltage is set to 0V, for example, when the memristor is reset, the source line input voltage is set to 2V, for example, the source line input voltage may be obtained by the voltage signal v_sl [1:M ] in fig. 2B.

For example, the bit line input voltages are switched to the corresponding voltage intervals by the control signals BL_sw [1:N ] of the multiplexers in the bit line driving circuit in FIG. 2B. For example, the bit line input voltage is set to 2V when the memristor is set, for example, the bit line input voltage is set to 0V when the memristor is reset, for example, the bit line input voltage may be obtained by the DAC in fig. 2B.

For example, when the memristor array is in a calculation mode, the memristors in the memristor array are in a conductive state available for calculation, and the bit line input voltage input by the column signal input terminal does not change the conductance value of the memristor, for example, the calculation can be completed by performing a multiply-add operation through the memristor array. For example, the word line input voltages are switched to the corresponding voltage intervals by the control signal wl_sw [1:M ] of the multiplexer in the word line driving circuit in fig. 2B, for example, when an on signal is applied, the word line input voltages of the corresponding rows are set to 5V, for example, when an on signal is not applied, the word line input voltages of the corresponding rows are set to 0V, for example, the GND signal is turned on; the source line input voltage is switched to a corresponding voltage interval, for example, the source line input voltage is set to 0V by the control signal sl_sw [1:M ] of the multiplexer in the source line driving circuit in fig. 2B, so that the current signals of the plurality of row signal output terminals can flow into the data output circuit, and the bit line input voltage is switched to a corresponding voltage interval, for example, the bit line input voltage is set to 0.1V-0.3V by the control signal bl_sw [1:n ] of the multiplexer in the bit line driving circuit in fig. 2B, so that the memristor array is utilized for performing the multiply-add operation.

For example, the data output circuit may include a plurality of transimpedance amplifiers (TIAs), ADCs, which may convert current signals at the plurality of row signal outputs to voltage signals and then to digital signals for subsequent processing.

FIG. 2C is a schematic diagram of another memristor device. The memristor device shown in fig. 2C is substantially the same structure as the memristor device shown in fig. 2B, and also includes a memristor array and its peripheral driving circuitry. For example, as shown in FIG. 2C, the memristor device signal acquisition device, word line drive circuitry, bit line drive circuitry, source line drive circuitry, memristor array, and data output circuitry.

For example, a memristor array includes M source lines, 2M word lines, and 2N bit lines, and a plurality of memristor cells arranged in an array of M rows and N columns. For example, each memristor unit is in a 2T2R structure, and the parameter matrix for transformation processing is mapped to the operation of different memristor units in the memristor array, which is not described herein. It should be noted that the memristor array may also include M source lines, M word lines, 2N bit lines, and a plurality of memristor cells arranged in M rows and N columns.

The description of the signal acquisition device, the control driving circuit, and the data output circuit may refer to the previous description, and will not be repeated here.

FIG. 2D illustrates a process of mapping a weighting matrix of a Bayesian neural network to a memristor array. And realizing a weight matrix between layers in the Bayesian neural network by utilizing the memristor array, realizing distribution corresponding to the weight by using N memristors for each weight, calculating N conductance values aiming at the corresponding random probability distribution of the weight, and mapping the N conductance value distribution into the N memristors. In this way, the weight matrix in the bayesian neural network is converted into a target conductance value that is mapped into a crossing sequence of memristor arrays.

As shown in fig. 2D, the left side of the figure is a three-layer bayesian neural network including 3-layer neuron layers connected one to the other. For example, the input layer includes a layer 1 neuron layer, the hidden layer includes a layer 2 neuron layer, and the output layer includes a layer 3 neuron layer. For example, the input layer transfers the received input data to the hidden layer, the hidden layer performs calculation conversion on the input data and sends the result to the output layer, and the output layer outputs the output structure of the bayesian neural network.

As shown in fig. 2D, the input layer, the hidden layer and the output layer each include a plurality of neuron nodes, and the number of the neuron nodes in each layer can be set according to different application situations. For example, the number of neurons of the input layer is 2 (including N ₁ and N ₂), the number of neurons of the intermediate hidden layer is 3 (including N ₃、N₄ and N ₅), and the number of neurons of the output layer is 1 (including N ₆).

As shown in fig. 2D, two adjacent neuron layers of the bayesian neural network are connected by a weight matrix. For example, the weight matrix is implemented by a memristor array as shown on the right side of fig. 2D. For example, the weight parameters may be programmed directly to the conductance of the memristor array. For example, the weight parameters may also be mapped to the conductance of the memristor array according to some rule. For example, the difference in conductance of two memristors may also be used to represent a weighting parameter. While the present disclosure describes the technical solution of the present disclosure in terms of programming the weight parameters directly to the conductance of the memristor array or mapping the weight parameters to the conductance of the memristor array according to a certain rule, it is merely exemplary and not limiting of the present disclosure.

The structure of the memristor array on the right in fig. 2D, for example, as shown in fig. 2A, may include a plurality of memristors arranged in an array. In the example shown in fig. 2D, the weights connecting the input N1 and the output N3 are implemented by 3 memristors (G ₁₁、G₁₂、G₁₃), and other weights in the weight matrix may be implemented identically. More specifically, the source line SL ₁ corresponds to neuron N ₃, the source line SL ₂ corresponds to neuron N ₄, the source line SL ₅ corresponds to neuron N ₅, the bit lines BL ₁、BL₂ and BL ₃ correspond to neuron N1, and one weight between the input layer and the hidden layer (weight between neuron N ₁ and neuron N ₃) is converted into three target conductance values according to a distribution and mapped into a cross sequence of the memristor array, where the target conductance values are G ₁₁、G₁₂ and G ₁₃, respectively, framed with a dashed line in the memristor array.

Returning to fig. 1B, step S113: and (3) inputting a state and a hidden input variable corresponding to the time t of the dynamic system as input signals to the memristor array after weight mapping, processing the state and the hidden input variable of the time t through the memristor array according to a Bayesian neural network, acquiring an output signal corresponding to a processing result from the memristor array, and obtaining a prediction result of the time t+1 of the dynamic system by the output signal.

For example, in some embodiments of the present disclosure, the dynamic environment model expressed as s _t+1＝f(s_t,a_t;W,ε),s_t is the state of the dynamic system at time t, a _t is the action of the object policy at time t, W is the weight matrix of the bayesian neural network, ε is the additive noise corresponding to the memristor array, and s _t+1 is the prediction of time t+1 of the dynamic system; action a _t＝π(s_t of the object policy at time t; w pi), pi represents a function of the object policy, W pi represents a policy parameter, the weight matrix W of the bayesian neural network satisfies the distribution W-q (W), and the additive noise epsilon is additive gaussian noise epsilon-N (0, sigma ²).

For memristor arrays, the input signal is a voltage signal, the output signal is a current signal, the output signal is read and analog-to-digital converted for subsequent processing. For example, an input sequence is applied to a BL (Bit-line) in the form of a voltage pulse, and then an output current flowing from the SL (Source-line) is collected for further calculation processing. For example, for a memristor device as shown in fig. 2B or 2C, the input sequence may be converted by a DAC into an analog voltage signal that is applied to the BL through a multiplexer. Correspondingly, an output current is taken from the SL, which current can be converted into a voltage signal by means of a transimpedance amplifier, into a digital signal by means of an ADC, which digital signal can be used for subsequent processing. When the N memristors read the current and N is larger, the output total current shows a certain distribution, such as a Gaussian distribution or a Laplace distribution. The total output current of all voltage pulses is the result of multiplying the input vector by the weight matrix. In a memristor cross array, such a parallel read operation is equivalent to two operations that implement sampling and vector matrix multiplication.

How to make multiple predictions for multiple moments according to a dynamic environment model and an object policy, a set of data samples including optimization costs for the object policy for the multiple moments is described below with reference to fig. 3.

For example, the plurality of times includes time 1 to time T which are sequentially arranged from early to late.

Fig. 3 shows a schematic flow chart of an example of step S102 in fig. 1A.

As shown in fig. 3, step S102 may include steps S301 to S303 as follows.

Step S301: for any time T-1 from time 1 to time T, an execution action a _t-1 is obtained by the object policy, a _t-1＝π(s_t-1; w pi) gets the action a _t-1 of the object policy at time t-1.

For example, action a _t-1 is the optimal action selected by the object policy in the state of time t-1.

Step S302: according to the dynamic environment model s _t＝f(s_t-1,a_t-1; w, epsilon) calculates the state s _t at the next time T after time T-1 and obtains the cost c _t of the state s _t corresponding to time T, thereby obtaining a cost sequence { c ₁,c₂,…,c_t } from time 1 to time T, and obtaining an optimized cost J _t-1 for time T based on the cost sequence, wherein 1.ltoreq.t.ltoreq.t.

For example, in some embodiments of the present disclosure, examples of step S302 may include: sampling the hidden input variable z from p (z) distribution to obtain a sample; inputting the state s _t-1 of the sample and the t-1 moment into the memristor array after weight mapping to obtain a predicted state s _t; for the predicted state s _t, a cost c _t＝c(s_t is obtained).

For example, the hidden input variable z is first sampled from the p (z) distribution, then the state s _t-1 at time t-1 and the sample of the hidden input variable are applied to BL with a READ (READ) voltage pulse of the memristor array, and then the output current from SL is collected for further computational processing to obtain the cost c _t corresponding to time t. The above operation is performed on the state at any one of time 1 to time t, and the cost sequence { c ₁,c₂,…,c_t } can be obtained.

Step S303: get the data sample set from time 1 to time T { [ a ₀,J₀],…,[a_T-1,J_T-1 ] }.

For example, the expected value of cost c _t at time t is E [ c _t ], then the optimization cost at time t may be byIs obtained.

For example, in some embodiments of the present disclosure, the cost further includes a cost change due to occasional uncertainty caused by hidden input variables and a cost change due to cognitive uncertainty caused by intrinsic noise of the memristor array.

For example, if the cost changes due to occasional uncertainty and cognitive uncertainty are further considered, the optimization cost at time t may be determined byTo obtain, σ (η, θ) is a function of the occasional uncertainty and the cognitive uncertainty, η represents the occasional uncertainty and θ represents the cognitive uncertainty.

For any time between time 1 and time T, a data sample corresponding to the time can be obtained, and the set of data samples from time 1 to time T is { [ a ₀,J₀],…,[a_T-1,J_T-1 ] }.

For example, an exemplary flow of the above-described strategy optimization method based on a dynamic environment model of a memristor array is as follows:

input: memristor array-based dynamic environment model and initial object strategy

N=1 to N, cycle

Initialization state s ₀

T=1 to T, cycle

Execution action a derived from object policy pi _t-1

Utilizing a memristor array-based dynamic environment model s _t＝f(s_t-1,a_t-1; w, ε) predicts state s _t at time t to obtain a data sample [ s _t ]

Calculating the cost and the optimization cost corresponding to the object strategy at the moment

c_t＝c(s_t)

Record [ a _t-1,J_t-1 ]

Obtaining a data sample set { [ a ₀,J₀],…,[a_T-1,J_T-1 ] }, and performing strategy searching on the data sample set by using a strategy gradient optimization algorithm

Ending the cycle of n

And (3) outputting: optimized strategy pi

Fig. 4 illustrates a schematic diagram of one example of a policy optimization method provided by at least one embodiment of the present disclosure.

In an exemplary application, as shown in fig. 4, a vessel is driven and thereby struggled with sea waves to get as close as possible to a target location on the coastline, thereby requiring training of a control model for driving the vessel. The vessel at position (x, y) may select an action (a _x,a_y) that represents the direction and magnitude of the drive. However, due to the dynamic environment of the sea surface with sea waves, the subsequent position of the vessel exhibits drift and disturbance. And the closer the location is to the shore, the greater the disturbance. The ship is given only a limited batch data set of spatial position transformations and cannot optimize the action strategy to ensure safety by directly interacting with the marine environment. At this time, it is necessary to learn a marine environment model (dynamic environment model) capable of predicting the next state by means of empirical data. The cognitive uncertainty and occasional uncertainty will result from missing information of the unvisited locations and randomness of the marine environment, respectively.

In this embodiment, for the control model, the sea surface is a dynamic environment, and the object strategy refers to a method used in the process of solving the ship from the current position to the target position. First, a dynamic environment model and an initial object policy for the dynamic environment are obtained. The initialization state of the ship is the current position, the execution action at the current moment is obtained by the object strategy, the state (the position of the ship) at the next moment is predicted by utilizing the dynamic environment model, the cost and the optimization cost corresponding to the object strategy are calculated, and a data sample consisting of the action and the optimization cost is recorded. Assuming that the current time is time 1, obtaining a data sample set from time 1 to the subsequent time T, and performing policy search on the data sample set by using a policy gradient optimization algorithm, so as to obtain an optimized object policy.

FIG. 5 illustrates a schematic block diagram of a policy optimization device 500 utilizing a memristor array-based dynamic environment model that may be used to perform the data processing method illustrated in FIG. 1A, in accordance with at least one embodiment of the present disclosure.

As shown in fig. 5, the policy optimizing apparatus 500 includes an acquisition unit 501, a calculation unit 502, and a policy search unit 503.

The acquisition unit 501 is configured to acquire a memristor array-based dynamic environment model.

The computing unit 502 is configured to perform multiple predictions of a plurality of moments according to the dynamic environment model and the object policy, resulting in a set of data samples comprising optimization costs of the object policy for the plurality of moments.

The policy search unit 503 is configured to perform a policy search using a policy gradient optimization algorithm to optimize the object policy based on the set of data samples.

For example, policy optimization device 500 may be implemented in hardware, software, firmware, and any feasible combination thereof, which is not limiting in this disclosure.

The technical effects of the above-mentioned policy optimization device are the same as those of the policy optimization method shown in fig. 1A, and are not described herein.

The following points need to be described:

(1) The drawings of the embodiments of the present disclosure relate only to the structures to which the embodiments of the present disclosure relate, and reference may be made to the general design for other structures.

(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims

1. A method of policy optimization using a memristor array-based dynamic environment model, comprising:

Acquiring the dynamic environment model based on the memristor array;

Carrying out multiple predictions at a plurality of moments according to the dynamic environment model and an object strategy to obtain a data sample set comprising optimization costs of the object strategy corresponding to the moments;

Performing policy searching to optimize the object policy using a policy gradient optimization algorithm based on the data sample set;

Wherein obtaining the dynamic environment model comprises:

acquiring a Bayesian neural network, wherein the Bayesian neural network is provided with a weight matrix obtained through training,

Obtaining a plurality of corresponding target conductance values according to the weight matrix of the Bayesian neural network, mapping the plurality of target conductance values into the memristor array,

Inputting a state and a hidden input variable corresponding to a time t of a dynamic system as input signals to the memristor array after weight mapping, processing the state and the hidden input variable at the time t through the memristor array according to the Bayesian neural network, and acquiring an output signal corresponding to a processing result from the memristor array, wherein the output signal is used for obtaining a prediction result of a time t+1 of the dynamic system;

Wherein the expression of the dynamic environment model is s _t+1 = f(s_t, a_t, W and epsilon),

Where s _t is the state of the dynamic system at time t, a _t is the action of the object policy at time t, W is the weight matrix of the bayesian neural network, epsilon is the additive noise corresponding to the memristor array, s _t+1 is the prediction of the dynamic system at time t+1,

Wherein the action a _t=π(s_t of the object strategy at the time t, W pi, pi represents a function of the object strategy, W pi represents a strategy parameter, the weight matrix W of the Bayesian neural network satisfies the distribution W-q (W), and the additive noise epsilon is additive Gaussian noise epsilon N (0, sigma ²).

2. The policy optimization method according to claim 1, wherein the plurality of time instants includes time instant 1 to time instant T arranged sequentially from early to late,

Performing multiple predictions of the multiple moments according to the dynamic environment model and the object policy to obtain the data sample set including the optimization cost of the object policy corresponding to the multiple moments, including:

For any of the times 1 to T-1, an execution action a _t-1 is obtained by the object policy,

Obtaining an action a _t-1 of the object policy at the time t-1 from a _t-1=π(s_t-1 and W pi);

From said dynamic environment model s _t = f(s_t-1,a_t-1, W, ε) calculate a state s _t at a next time t after said time t-1 and obtain a cost c _t of a state s _t corresponding to said time t, thereby obtaining a cost sequence { c ₁, c₂, …, c_t } from said time 1 to said time t,

Obtaining an optimization cost J _t-1 at the moment T based on the cost sequence, wherein T is more than or equal to 1 and less than or equal to T;

Obtaining the set of data samples { [ a ₀,J₀],…,[a_T-1,J_T-1 ] } from the time instant 1 to the time instant T.

3. The policy optimization method according to claim 2, wherein the expected value of the cost c _t at the time t is E [ c _t ], the optimization cost at the time t can passIs obtained.

4. The strategy optimization method of claim 2 wherein the costs further include cost changes due to occasional uncertainty and cost changes due to cognitive uncertainty,

Wherein the occasional uncertainty is caused by the hidden input variable and the cognitive uncertainty is caused by intrinsic noise of the memristor array.

5. The policy optimization method according to claim 4, wherein the optimization cost at time t is determined byIs obtained by, among others,Η represents the occasional uncertainty and θ represents the cognitive uncertainty as a function of the occasional uncertainty and the cognitive uncertainty.

6. The policy optimization method according to claim 2, wherein calculating the state s _t at the time t and the cost c _t of the state s _t corresponding to the time t according to the dynamic environment model s _t = f(s_t-1,a_t-1; W, ε includes:

Sampling the hidden input variable z from p (z) distribution to obtain a sample;

Inputting the state s _t-1 of the sample and the t-1 moment to the memristor array after weight mapping to obtain a predicted state s _t;

For the predicted state s _t, the cost c _t=c(s_t is obtained).

7. The policy optimization method according to claim 1, wherein the policy gradient optimization algorithm comprises REINFORCE algorithm, PRO algorithm or TPRO algorithm.

8. A policy optimization device utilizing a memristor array-based dynamic environment model, comprising:

an acquisition unit configured to acquire the dynamic environment model based on the memristor array;

The computing unit is configured to conduct multiple predictions of a plurality of moments according to the dynamic environment model and an object strategy to obtain a data sample set comprising optimization costs of the object strategy corresponding to the moments;

a policy search unit configured to perform policy search using a policy gradient optimization algorithm to optimize the object policy based on the data sample set;

wherein, in acquiring the dynamic environment model, the acquisition unit is configured to: