CN113246958A

CN113246958A - TD 3-based multi-target HEV energy management method and system

Info

Publication number: CN113246958A
Application number: CN202110654498.3A
Authority: CN
Inventors: 颜伏伍; 王金海; 杜常清; 彭可挥
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-08-13
Anticipated expiration: 2041-06-11
Also published as: CN113246958B

Abstract

A gradient multi-target HEV energy management method and system based on a double-delay depth certainty strategy are disclosed. The invention innovatively uses a double-delay depth certainty strategy gradient strategy to solve the problem of dimension disaster of a depth reinforcement learning energy management strategy based on a discrete action space and the problem of depth certainty strategy gradient over-estimation. And fuel consumption, battery temperature and battery life (SOH) are taken as optimization targets, and the practical value of the energy management strategy is improved.

Description

TD 3-based multi-target HEV energy management method and system

Technical Field

The invention relates to a Deep reinforcement learning algorithm for improving fuel economy of a new energy automobile and prolonging service life of a battery, in particular to a parallel Hybrid Electric Vehicle (HEV) multi-target energy management method based on a double delay Deep Deterministic strategy Gradient (TD 3).

Background

Energy crisis and climate change have attracted extensive attention from countries in the world, and fuel consumption and exhaust emission of vehicles are key factors that cannot be ignored. In order to alleviate severe energy crisis and climate change, vehicle motorization is the necessary way for the development of the automotive industry in the future. In new energy vehicles, hybrid vehicles need less fuel than traditional fuel vehicles, and have longer driving range than pure electric vehicles, so that hybrid vehicles become the most effective solution at present. However, the energy management system of the hybrid electric vehicle is very complex, not only needs to properly distribute the power of the engine and the power of the motor, but also needs to comprehensively ensure the driving performance and the economical efficiency of the vehicle, and the energy management method covers the contents in various aspects of energy management of the traditional automobile, the pure electric automobile and the oil-electric hybrid automobile, and becomes the focus of extensive research in the field of domestic and foreign automobiles.

Energy management policies can be largely divided into three categories. a) The rule-based energy management strategy depends on a rule set formulated through professional experience and does not need to predict the driving condition, although the practicability is high, the rule-based energy management cannot achieve optimal control of a vehicle, and the specific driving condition is single. The binary control strategy is a typical rule-based control strategy, which first drives the vehicle with the energy of the battery and switches to engine-driven vehicle when the battery SOC reaches a set minimum value. b) Based on an optimized energy management strategy, such as a dynamic programming strategy (DP), convex optimization, and a genetic algorithm, the optimal control of the vehicle is performed according to the known or predicted vehicle running conditions, so that the optimal or near optimal result of the vehicle under a specific condition cycle can be obtained, but all the running conditions of the vehicle need to be predicted, and the consumed computing resources are large and cannot be used for real-time control. To improve the utility of energy management strategies, real-time online optimization strategies are widely studied, such as Model Predictive Control (MPC), the pent-rieya-gold minimum principle (PMP) and the equivalent fuel consumption strategy (ECMS). However, due to the fact that equivalent fuel consumption of a system is calculated by adopting part of historical information, the historical information cannot necessarily represent future driving states, and the robustness of the algorithm is poor. A better-performing strategy needs to be adopted to make up for the defects of the above algorithm. c) A learning-based energy management policy. Machine Learning (data-driven optimization), in particular to a Deep Reinforcement Learning (Deep Learning) algorithm developed in recent years, provides a powerful research tool for system model and control parameter optimization, road condition feature and driving behavior feature extraction. Among the reinforcement Learning algorithms, discrete motion space reinforcement Learning algorithms such as Q Learning and Deep Q Network (DQN) are most widely used, but the above algorithms are only applicable to discrete and low-dimensional motion spaces, and the HEV energy management control task has a high-dimensional and continuous motion space. The above algorithm requires discretization of the motion space, which inevitably loses important information of the motion space and also constitutes a dimension of disaster (security) problem. The reinforcement learning algorithm of continuous action spaces such as a depth deterministic strategy gradient (DDPG) can well process the continuous action spaces without discretization, but the depth deterministic strategy gradient has an over-estimation problem, an estimated value function is often larger than a real value function, the stability of the energy management strategy is influenced, and the robustness of the algorithm is poor.

Furthermore, current energy management strategies only marginally improve vehicle fuel economy, ignoring the control strategy's impact on battery life. It is well known that the service life of a battery system is closely related to the operating conditions and the temperature of the battery, and that excessive internal temperature of the battery can cause thermal breakdown. The energy management policy must take these important factors into account or else there is no practical value.

Disclosure of Invention

The invention provides a gradient multi-target HEV energy management method and system based on a double-delay depth certainty strategy. The method and the system can well solve the over-estimation problem by using two sets of network representation value functions and a delay updating technology. The fuel consumption, the SOC of the battery, the temperature of the battery and the service life (SOH) of the battery of the vehicle are used as optimization targets, a multi-objective optimization energy management strategy is constructed, the vehicle works in a real optimal State, and the practical value of the energy management strategy is improved.

At least one embodiment of the present invention provides a HEV energy management method, comprising:

establishing a dynamic model, a battery thermal model and a battery service life model of the parallel hybrid electric vehicle, and calculating the fuel consumption rate m of the engine of the three models_fEngine output torque T_engBattery temperature T_empThe SOC and the SOH of the battery are taken as control targets;

constructing a dual-delay depth deterministic strategy gradient TD3 network;

taking the engine fuel consumption rate, the engine output torque, the battery temperature, the battery SOH and the battery SOC of the control target as a TD3 state space signal S, taking the engine output torque as a TD3 action space signal A, and making a return function r of TD 3;

acquiring parameters and observation values influencing energy management in vehicle standard working condition driving, wherein the parameters and observation values influencing energy management in vehicle standard working condition driving and the return function r are used for training a TD3 network, so that the TD3 network can make an action A capable of maximizing the return function r according to a received state signal S, and a trained deep reinforcement learning intelligent agent is further obtained;

and acquiring parameters and observation values influencing energy management in the actual running of the vehicle, wherein the parameters and the observation values influencing the energy management in the actual running of the vehicle comprise the fuel consumption rate of an engine, the output torque of the engine, the temperature of a battery, the SOH of the battery and the SOC of the battery which are taken as control targets, and inputting the parameters and the observation values influencing the energy management in the actual running of the vehicle into the trained deep reinforcement learning intelligent body for energy management.

At least one embodiment of the present invention provides a HEV energy management method system, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform all or part of the steps of the method.

At least one embodiment of the invention provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, performs all or part of the steps of a method as described herein.

The invention adopts a double-delay depth certainty strategy gradient energy management strategy to optimize the power distribution of an engine and a motor and the use condition of a battery, not only can make up the problem of dimension disaster of a discrete action space depth reinforcement learning energy management strategy, but also can solve the problems of depth certainty strategy gradient overestimation and unstable training.

The invention not only optimizes the fuel consumption in the running process of the vehicle and keeps the SOC of the battery in a reasonable range, but also considers the influence of the control strategy on the temperature of the battery and the service life of the battery. A return function is innovatively designed, a multi-target energy management strategy for fuel economy, battery SOC, battery temperature and battery service life is constructed, and the vehicle can be comprehensively optimized in a multi-target mode.

The method collects actual road working condition data and verifies the optimality of the deep reinforcement learning TD3 energy management strategy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.

Fig. 1 is a flowchart of a multi-target HEV energy management method based on TD3 according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a parallel hybrid electric vehicle according to an embodiment of the present invention.

Fig. 3 is a basic architecture diagram of an intelligent agent TD3 for deep reinforcement learning according to an embodiment of the present invention.

Fig. 4 is a speed curve of a vehicle under a standard operating condition according to an embodiment of the present invention.

Fig. 5 is a speed curve of a vehicle actually traveling at a certain location according to an embodiment of the present invention.

Detailed Description

Aiming at HEV energy management, the invention innovatively uses a double-delay depth certainty strategy gradient TD3 strategy to solve the problem of dimension disaster of a depth reinforcement learning energy management strategy based on a discrete action space and the problem of depth certainty strategy gradient overestimation. And fuel consumption, battery temperature and battery life (SOH) are taken as optimization targets, and the practical value of the energy management strategy is improved. The multi-target HEV energy management method based on TD3 will be described in detail with reference to fig. 1-5.

Step 1: the parallel hybrid power automobile model is established, the automobile power model is established according to an automobile power equation, the battery thermal model is established according to a battery heat generation and heat dissipation principle, and the battery service life model is established according to a battery capacity attenuation principle. The dynamic characteristics of the battery system can be predicted by combining the thermal model of the battery with the life model of the battery. The fuel consumption rates m of the engines of the three models are compared_fEngine output torque T_engBattery temperature T_empBattery SOH and battery SOC as control targets;

step 2: respectively constructing a critical network and an Actor network by using a deep neural network, commonly constructing a basic network framework, namely the Actor-critical network, of a double-delay deep deterministic strategy gradient strategy TD3 to construct a multi-target HEV energy management strategy learning network, and initializing and normalizing state data of parameters of the Actor-critical network, wherein the network parameters are shown in a table 2. And taking the engine fuel consumption rate, the engine output torque, the battery temperature, the battery SOH and the battery SOC of the control target as a TD3 state space signal S, taking the engine output torque as a TD3 action space signal A, and establishing a reasonable return function r of TD 3.

And step 3: the method comprises the steps of obtaining parameters and observation values influencing energy management in automobile standard working condition driving, wherein the parameters and the observation values influencing energy management in the automobile standard working condition driving and the return function r are used as control targets, training a basic network of TD3, enabling a TD3 energy management strategy to make an action A capable of maximizing the return function r according to a received state signal S, controlling the automobile to drive in an energy-saving and efficient manner, and further obtaining a trained deep reinforcement learning intelligent agent.

And 4, step 4: and acquiring parameters and observed values influencing energy management in the actual running of the automobile, wherein the parameters and the observed values influencing the energy management in the actual running are input into the trained deep reinforcement learning intelligent agent for energy management, and the parameters and the observed values influencing the energy management in the actual running comprise the fuel consumption rate of the engine, the output torque of the engine, the temperature of the battery, the SOH of the battery and the SOC of the battery which are used as control targets.

FIG. 2 shows a schematic diagram of a parallel hybrid vehicle drive system. In step 1, the automobile dynamic model may be calculated by an automobile dynamic equation, which is shown in formula (1):

wherein, F_tIs the driving force for automobile running; f_fIs the rolling resistance of the automobile during running; f_iIs the slope resistance of the automobile; f_ωIs the air resistance of the automobile in running; f_jIs the acceleration resistance of the automobile; m is the mass of the automobile; g is the acceleration of gravity; f is a rolling resistance coefficient; α is the motorroad grade; ρ is the air density; a is the frontal area of the automobile; c_DCoefficient of air resistance; v is the vehicle speed; δ is a rotating mass conversion factor; and a is the running acceleration of the automobile.

The thermal model of the battery is shown as formula (2):

wherein, T_empIs the battery temperature; t is_ambIs ambient temperature; m is the mass of the battery; c is the specific heat capacity of the battery; i is the working current of the battery; OCV is the battery open circuit voltage; v is the working voltage of the battery; h is the natural thermal convection constant.

The battery life model is shown in equation (3):

wherein, N (c)_r,T_emp) Is the equivalent cycle number before the end of the battery life, and the discharge rate C-rate (C) of the battery_r) And battery temperature (T)_emp) Influence, shown by equation (4);

percentage of loss of battery capacity is C_nB is an exponential factor, the value of which is given in table 1, R-8.314 is a universal gas constant,_z0.55 is the power law coefficient, Ah is the battery throughput, E_aIs the activation energy; when the capacity of the battery drops to 20%, the battery reaches the end of life. C_nAh and E_aIs defined by equation (5):

TABLE 1 relationship between index factor and discharge rate

In step 2, TD3 state space signal is S ═ SOC, m_f,T_eng,T_empSOH), where SOC represents the State of Charge (SOC) of the battery; m is_fIs the engine fuel consumption rate; t is_engAn engine output torque; t is_empIs the battery temperature. The motion space signal is A ═ T_eng|T_eng∈[-250_,841]) (ii) a The reward function is defined by equation (6):

wherein b is an offset used for adjusting the range of the return function; j. the design is a square_iIs a loss function, i represents a time step; s and a represent the states of the ith time step (the engine fuel consumption rate of the control target, and the like), respectivelyEngine output torque, the battery temperature, the battery SOH, and the battery SOC), and action (the engine output torque);

representing the fuel consumption rate of the engine; c_bRepresents a battery degradation cost; p_sAnd P_tRespectively representing SOC relative to a reference value SOC_refDeviation of (d) and penalty factor for excessive temperature; omega₁And ω₂Respectively represent P_sAnd P_tThe weight of the influencing factor. C_bCalculated from equation (7):

C_b,i＝λΔSOH (7)

where λ is the ratio of the battery replacement cost to the one kilogram fuel cost (n.kittner, f.lill, and d.m.kammen, "Energy storage deployment and innovation for the clean Energy transfer," Nature Energy, vol.2,2017, art.no. 17125.).

SOC relative to a reference value SOC_refThe deviation of (a) and the penalty coefficient for excessive temperatures are determined by equation (8) and equation (9):

therein, SOC_ref0.6 is the battery SOC reference value, T_refIs a penalty trigger threshold, which may be set at 40 ℃. Tau is₁And τ₂And adjusting the coefficient to enable the SOC deviation and the over-high temperature penalty coefficient of the battery to be in the same order of magnitude as the fuel consumption rate of the engine.

In step 2, the basic architecture of the dual-delay depth deterministic policy gradient algorithm is shown in fig. 3.

Wherein J represents a loss function, M number of gradient descent samples in batch, and θ_QAnd theta_μThe parameters of the Critic network and the Actor network are respectively, r represents a return function, and epsilon represents noiseSound, τ represents a soft update factor, y represents a time sequence difference error (TD error), L_kIndicating the accumulated error.

The detailed parameters of the deep reinforcement learning TD3 agent are shown in table 2:

TABLE 2 TD3 agent specific parameters

The TD3 energy management policy implementation details are shown in table 3:

TABLE 3 TD3 Algorithm execution steps

Wherein theta is_QAnd theta_μAre parameters of the Critic network and the Actor network, respectively. The deep reinforcement learning agent transmits observation signals (including the engine fuel consumption rate, the engine output torque, the battery temperature, the battery SOH and the battery SOC) to an Actor network, and the Actor network outputs a control action a ═ mu (s | theta) through a deterministic strategy function mu(s) and a random noise N_μ) + N. The controlled object obtains a new state s 'and a new reward r by executing an action a, stores (s, a, r, s') into an experience playback area, randomly samples M samples from the experience playback area, and inputs s 'into an Actor network in a target network to obtain a', wherein the action a obtained by the Critic network through the state s and the Actor network obtains a value function Q (s, a) by using Bellman equation learning, and the target Critic network obtains a target Q value Q '(s, a) ═ E [ r (s, a) + gamma Q' (s ', a')]Wherein Q '(s, a) represents a target Q value, s represents an observed quantity at this time, a represents an action selected by an Actor network in the agent, E represents an expectation operation, r (s, a) represents a return obtained under such a state value and action value, γ represents a discount silver, and Q' (s ', a') represents a target Q value of a next state, and the controlled object obtains a new state value s 'by performing the action a and obtains a new state value s' under the intelligent agentThe action a' at the next time instant selected in the volume, the TD error is calculated as follows

Where y represents the approximate equivalent of the target Q value, L_kIs the cumulative error, Q(s)_j,a_j) Is the estimated Q value in the current network. The Actor network parameter in the current network is updated by mapping the state to a specified action through an action value function, and is updated through the gradient back propagation and soft update strategy of the neural network.

In step 3, the deep reinforcement learning agent learns in the process of interacting with the environment (vehicle and road working conditions), and selects the action capable of maximizing the return, but the action selected by the agent at the initial stage is far from the optimal value and can generate unexpected consequences, so that the deep reinforcement learning agent is trained in the standard working condition to obtain more stable intelligent agent hyper-parameters (learning rate, neuron number, network layer number, experience playback zone size, batch gradient sampling size and the like), and then the deep reinforcement learning agent is applied to the actual road working condition. The method comprises the steps of selecting a proper standard working condition, leading the proper standard working condition into a driver model, preprocessing road working condition information by the driver model, inputting speed, acceleration and gradient information of the working condition, and outputting speed, acceleration and total torque demand information required by vehicle running. And in the training process, the hyper-parameters of the TD3 intelligent agent are adjusted according to the vehicle information and the working condition information, so that the aim that the TD3 intelligent agent can quickly and accurately select the optimal control action is fulfilled. The deep reinforcement learning TD3 network can be trained using three typical standard conditions, but is not limited thereto. The speed parameters for the three conditions are shown in fig. 4, and the characteristics of each condition are shown in table 4:

table 4 Standard Condition characteristics

And 4, acquiring actual operation data of the vehicle, making actual road working condition data, importing the actual road working condition data into a driver model, and performing energy management by using the trained deep reinforcement learning intelligent agent. Meanwhile, the trained deep reinforcement learning TD3 energy management strategy can be verified, and the optimality of the energy management strategy can be tested. The actual road speed parameter is shown in fig. 5.

In conclusion, the method provided by the invention not only ensures that the fuel economy is optimal in the driving process of the vehicle, but also ensures that the battery works in a proper temperature range, prolongs the service life of the battery and ensures that the multi-target comprehensive performance of the hybrid vehicle is optimal.

In an exemplary embodiment, there is also provided a dual delay depth certainty strategy gradient-based multi-target HEV energy management system, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform all or part of the steps of the method.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, on which a computer program is stored, which when executed by a processor implements all or part of the steps of the method. For example, the non-transitory computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Claims

1. A HEV energy management method comprising:

constructing a dual-delay depth deterministic strategy gradient TD3 network;

acquiring parameters and observation values influencing energy management in vehicle standard working condition driving, wherein the parameters and observation values influencing energy management in vehicle standard working condition driving and the return function r are used for training a TD3 network, so that the TD3 network can make an action capable of maximizing the return function r according to a received state signal S, and a trained deep reinforcement learning intelligent agent is further obtained;

2. The HEV energy management method of claim 1, wherein the TD3 state space signal is S ═ S (SOC, m)_f,T_eng,T_empSOH), the motion space signal is a ═ T (T)_eng|T_eng∈[-250,841]) The reward function is defined by equation (1):

wherein b is an offset used for adjusting the range of the return function; i represents a time step;

representing the fuel consumption rate of the engine; c_bRepresents a battery degradation cost; p_sAnd P_tRespectively representing SOC relative to a reference value SOC_refDeviation of (d) and penalty factor for excessive temperature; omega₁And ω₂Respectively represent P_sAnd P_tThe weight of the influencing factor; c_bCalculated from equation (2):

C_b,i＝λΔSOH (2)

where λ is the ratio of battery replacement cost to one kilogram of fuel cost;

therein, SOC_ref0.6 is the battery SOC reference value, T_refIs a penalty trigger threshold, which can be set to 40 deg.C, tau₁And τ₂And adjusting the coefficient to enable the SOC deviation and the over-high temperature penalty coefficient of the battery to be in the same order of magnitude as the fuel consumption rate of the engine.

3. A HEV energy management method system, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of the method of any one of claims 1-2.

4. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 2.