CN114445252A

CN114445252A - Data completion method, device, electronic device and storage medium

Info

Publication number: CN114445252A
Application number: CN202111346757.2A
Authority: CN
Inventors: 余剑峤; 张舒昱
Original assignee: Southern University of Science and Technology
Current assignee: Southern University of Science and Technology
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-05-06

Abstract

The embodiment of the invention provides a data completion method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring an original data set; carrying out data preprocessing on an original data set and classifying the original data set into a incomplete data set and a historical data set; extracting time characteristics of the incomplete data set to obtain a first time sequence; extracting time characteristics of the historical data set to obtain a second time sequence; performing multi-head attention mechanism calculation on the first time sequence to obtain a first output matrix; performing multi-head attention mechanism calculation on the second time sequence to obtain a second output matrix; and performing fusion processing on the first output matrix and the second output matrix to obtain the completion data. The method can complete the missing traffic data, and has good data completing effect in the face of different data missing conditions.

Description

Data completion method, device, electronic device and storage medium

技术领域technical field

本发明涉及人工智能技术领域，尤其涉及数据补全方法、装置、电子设备及存储介质。The present invention relates to the technical field of artificial intelligence, and in particular, to a data completion method, device, electronic device and storage medium.

背景技术Background technique

交通数据是一段时间的道路特征，比如：流量和速度的集合，是智能交通系统(Intelligent Transportation System,ITS)的一个重要组成部分。在道路特征数据的基础上，交通部门可以进行合理有效的交通控制，企业也可以提供更准确可靠的服务。但在实践中，交通数据集经常由于传感器故障、区域停电、极端天气等原因而导致缺失。因此，如何提供一种数据补全方法，将缺失的交通数据补全，成为了亟待解决的技术问题。Traffic data is a collection of road characteristics over a period of time, such as flow and speed, and is an important part of Intelligent Transportation System (ITS). On the basis of road characteristic data, traffic departments can carry out reasonable and effective traffic control, and enterprises can also provide more accurate and reliable services. But in practice, traffic datasets are often missing due to sensor failures, regional power outages, extreme weather, etc. Therefore, how to provide a data completion method to complete the missing traffic data has become an urgent technical problem to be solved.

发明内容SUMMARY OF THE INVENTION

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。为此，本发明提出一种数据补全方法、装置、电子设备及存储介质，能够在历史交通数据的基础上将缺失的交通数据补全，面对不同数据缺失情况有良好数据补全效果。The present invention aims to solve one of the technical problems in the related art at least to a certain extent. Therefore, the present invention provides a data completion method, device, electronic device and storage medium, which can complete the missing traffic data on the basis of historical traffic data, and have good data completion effect in different data missing situations.

为实现上述目的，本发明实施例的第一方面提出了一种数据补全方法，包括：To achieve the above object, a first aspect of the embodiments of the present invention provides a data completion method, including:

获取原始数据集；get the original dataset;

对所述原始数据集进行数据预处理，以将所述原始数据集分类成残缺数据集和历史数据集；performing data preprocessing on the original dataset to classify the original dataset into incomplete datasets and historical datasets;

对所述残缺数据集进行时间特征提取，得到第一时间序列；Extracting temporal features on the incomplete data set to obtain a first time series;

对所述历史数据集进行时间特征提取，得到第二时间序列；performing time feature extraction on the historical data set to obtain a second time series;

对所述第一时间序列进行多头注意力机制计算，得到第一输出矩阵；Perform multi-head attention mechanism calculation on the first time series to obtain a first output matrix;

对所述第二时间序列进行多头注意力机制计算，得到第二输出矩阵；Perform multi-head attention mechanism calculation on the second time series to obtain a second output matrix;

对所述第一输出矩阵和所述第二输出矩阵进行融合处理，得到补全数据。Fusion processing is performed on the first output matrix and the second output matrix to obtain complementary data.

在本发明的一些实施例中，所述对所述残缺数据集进行时间特征提取，得到第一时间序列，包括：In some embodiments of the present invention, performing temporal feature extraction on the incomplete data set to obtain a first time series, including:

将所述残缺数据集输入预设的长短期记忆补全网络中；Inputting the incomplete dataset into a preset long short-term memory completion network;

提取所述残缺数据集中处于预设时段的观测数据；extracting the observation data in the preset time period in the incomplete data set;

根据所述观测数据和所述预设时段的上一时段的预测数据，计算出所述预设时段的缺失数据；Calculate the missing data of the preset period according to the observation data and the prediction data of the previous period of the preset period;

根据所述缺失数据和所述观测数据，得到所述第一时间序列。According to the missing data and the observed data, the first time series is obtained.

在本发明的一些实施例中，所述对所述历史数据集进行时间特征提取，得到第二时间序列，包括：In some embodiments of the present invention, performing temporal feature extraction on the historical data set to obtain a second time series, including:

将所述历史数据集输入至预设的历史数据处理网络中；inputting the historical data set into a preset historical data processing network;

通过所述的历史数据处理网络计算出所述历史数据集的历史平均数据，将所述历史平均数据作为所述预设时段的历史数据；Calculate the historical average data of the historical data set through the historical data processing network, and use the historical average data as the historical data of the preset period;

根据所述历史平均数据，得到所述第二时间序列。The second time series is obtained according to the historical average data.

在本发明的一些实施例中，所述对所述第一时间序列进行多头注意力机制计算，得到第一输出矩阵，包括：In some embodiments of the present invention, the multi-head attention mechanism calculation is performed on the first time series to obtain a first output matrix, including:

将所述第一时间序列输入至第一多头注意力层；inputting the first time series to a first multi-head attention layer;

通过所述第一多头注意力层将所述第一时间序列转换为第一注意力矩阵；converting the first time series into a first attention matrix through the first multi-head attention layer;

通过预设函数将所述第一注意力矩阵转换为第一概率矩阵；converting the first attention matrix into a first probability matrix by using a preset function;

对所述第一概率矩阵进行注意力机制计算，得到第一特征矩阵；Perform an attention mechanism calculation on the first probability matrix to obtain a first feature matrix;

对所述第一特征矩阵进行降维处理，得到第一输出矩阵。Dimension reduction processing is performed on the first feature matrix to obtain a first output matrix.

在本发明的一些实施例中，所述对所述第二时间序列进行多头注意力机制计算，得到第二输出矩阵，包括：In some embodiments of the present invention, the multi-head attention mechanism calculation is performed on the second time series to obtain a second output matrix, including:

将所述第二时间序列输入至第二多头注意力层；inputting the second time series to a second multi-head attention layer;

通过所述第二多头注意力层将所述第二时间序列转换为第二注意力矩阵；converting the second time series into a second attention matrix by the second multi-head attention layer;

通过预设函数将所述第二注意力矩阵转换为第二概率矩阵；converting the second attention matrix into a second probability matrix through a preset function;

对所述第二概率矩阵进行注意力机制计算，得到第二特征矩阵；Perform attention mechanism calculation on the second probability matrix to obtain a second feature matrix;

对所述第二特征矩阵进行降维处理，得到第二输出矩阵。A dimensionality reduction process is performed on the second feature matrix to obtain a second output matrix.

在本发明的一些实施例中，所述对所述第一输出矩阵和所述第二输出矩阵进行融合处理，得到补全数据包括：In some embodiments of the present invention, performing fusion processing on the first output matrix and the second output matrix to obtain complementary data includes:

对所述第一输出矩阵和所述第二输出矩阵进行拼接处理，得到拼接矩阵；performing splicing processing on the first output matrix and the second output matrix to obtain a splicing matrix;

通过单头注意力层对所述拼接矩阵进行注意力机制的计算，得到目标矩阵；The attention mechanism is calculated on the splicing matrix through the single-head attention layer, and the target matrix is obtained;

通过线性层对所述目标矩阵进行特征提取，得到补全数据。Feature extraction is performed on the target matrix through a linear layer to obtain complementary data.

在本发明的一些实施例中，所述通过单头注意力层对所述拼接矩阵进行注意力机制的计算，得到目标矩阵包括：In some embodiments of the present invention, the calculation of the attention mechanism is performed on the splicing matrix through the single-head attention layer, and the obtained target matrix includes:

将所述拼接矩阵输入至单头注意力层；inputting the splicing matrix into a single-head attention layer;

通过所述单头注意力层将所述拼接矩阵转换为注意力矩阵；converting the splicing matrix into an attention matrix through the single-head attention layer;

通过预设函数将所述注意力矩阵转换为概率矩阵；converting the attention matrix into a probability matrix by a preset function;

对所述概率矩阵进行注意力机制计算，得到所述目标矩阵。Perform attention mechanism calculation on the probability matrix to obtain the target matrix.

为实现上述目的，本发明实施例的第二方面提出了一种数据补全装置，包括：To achieve the above purpose, a second aspect of the embodiments of the present invention provides a data completion device, including:

原始数据集获取模块，用于获取原始数据集；The original data set acquisition module is used to obtain the original data set;

原始数据集预处理模块，用于对所述原始数据集进行数据预处理，以将所述原始数据集分类成残缺数据集和历史数据集；an original data set preprocessing module, configured to perform data preprocessing on the original data set, so as to classify the original data set into incomplete data sets and historical data sets;

第一时间序列提取模块，用于对所述残缺数据集进行时间特征提取，得到第一时间序列；a first time series extraction module, configured to perform time feature extraction on the incomplete data set to obtain a first time series;

第二时间序列提取模块，用于对所述历史数据集进行时间特征提取，得到第二时间序列；A second time series extraction module, configured to perform time feature extraction on the historical data set to obtain a second time series;

第一输出矩阵计算模块，用于对所述第一时间序列进行多头注意力机制计算，得到第一输出矩阵；a first output matrix calculation module, configured to perform multi-head attention mechanism calculation on the first time series to obtain a first output matrix;

第二输出矩阵计算模块，用于对所述第二时间序列进行多头注意力机制计算，得到第二输出矩阵；A second output matrix calculation module, configured to perform multi-head attention mechanism calculation on the second time series to obtain a second output matrix;

数据融合模块，用于对所述第一输出矩阵和所述第二输出矩阵进行融合处理，得到补全数据。A data fusion module, configured to perform fusion processing on the first output matrix and the second output matrix to obtain complementary data.

为实现上述目的，本发明实施例的第三方面提出了一种电子设备，包括：To achieve the above purpose, a third aspect of the embodiments of the present invention provides an electronic device, including:

至少一个存储器；at least one memory;

至少一个处理器；at least one processor;

至少一个程序；at least one program;

所述程序被存储在存储器中，处理器执行所述至少一个程序以实现本发明如上述第一方面所述的数据补全方法。The program is stored in the memory, and the processor executes the at least one program to implement the data completion method according to the first aspect of the present invention.

为实现上述目的，本发明的第四方面提出了一种存储介质，该存储介质是计算机可读存储介质，所述计算机可读存储介质存储有计算机可执行指令，所述计算机可执行指令用于使计算机执行：In order to achieve the above object, a fourth aspect of the present invention provides a storage medium, which is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used for Make the computer execute:

如上述第一方面所述的数据补全方法。The data completion method as described in the first aspect above.

本发明实施例提出的数据补全方法、装置、电子设备及存储介质，通过获取原始数据集，将原始数据集进行数据预处理，分类处理成残缺数据集和历史数据集，对残缺数据集进行时间特征提取，得到第一时间序列，对历史数据集进行时间特征提取，得到第二时间序列，对第一时间序列进行多头注意力机制计算，得到第一输出矩阵，对第二时间序列进行多头注意力机制计算，得到第二输出矩阵，最后对第一输出矩阵和第二输出矩阵进行融合处理，得到补全数据。通过本发明实施例提供的技术方案可以解决由于交通数据的收集经常受到通信错误、传感器故障、存储丢失等因素影响而造成数据缺失的问题，使用注意力循环神经网络解决交通数据丢失的问题，能充分利用历史数据，有效提高交通数据补全效果。The data completion method, device, electronic device, and storage medium proposed by the embodiments of the present invention obtain the original data set, perform data preprocessing on the original data set, classify and process the original data set into incomplete data sets and historical data sets, and perform data processing on the incomplete data sets. Temporal feature extraction, obtain the first time series, perform temporal feature extraction on the historical data set, obtain the second time series, perform multi-head attention mechanism calculation on the first time series, obtain the first output matrix, and perform multi-head on the second time series. The attention mechanism is calculated to obtain the second output matrix, and finally the first output matrix and the second output matrix are fused to obtain the completion data. The technical solution provided by the embodiments of the present invention can solve the problem of data loss caused by the fact that the collection of traffic data is often affected by communication errors, sensor failures, storage loss, etc. Make full use of historical data to effectively improve the effect of traffic data completion.

附图说明Description of drawings

图1是本发明实施例提供的交通数据完全随机缺失示意图；1 is a schematic diagram of complete random missing traffic data provided by an embodiment of the present invention;

图2是本发明实施例提供的交通数据非随机缺失示意图；2 is a schematic diagram of non-random missing traffic data provided by an embodiment of the present invention;

图3是本发明实施例提供的中国北京东三里河大街在不同日期每10分钟的交通速度图；Fig. 3 is the traffic speed diagram of every 10 minutes on different dates in Dongsanlihe Street, Beijing, China provided by the embodiment of the present invention;

图4是本发明实施例提供的数据补全方法的流程图；4 is a flowchart of a data completion method provided by an embodiment of the present invention;

图5是图4中步骤S130的流程图；Fig. 5 is the flow chart of step S130 in Fig. 4;

图6是图4中步骤S140的流程图；Fig. 6 is the flow chart of step S140 in Fig. 4;

图7是图4中步骤S150的流程图；Fig. 7 is the flow chart of step S150 in Fig. 4;

图8是图4中步骤S160的流程图；Fig. 8 is the flow chart of step S160 in Fig. 4;

图9是图4中步骤S170的流程图；Fig. 9 is the flow chart of step S170 in Fig. 4;

图10是图9中步骤S620的流程图；Fig. 10 is the flow chart of step S620 in Fig. 9;

图11是本发明实施例提供的用于交通数据的补全的注意力循环神经网络的整体框架示意图；11 is a schematic diagram of an overall framework of an attention recurrent neural network for traffic data completion provided by an embodiment of the present invention;

图12是本发明实施例提供的中国北京数据集示意图；12 is a schematic diagram of a data set in Beijing, China provided by an embodiment of the present invention;

图13是本发明实施例提供的美国加州第五区数据集示意图；13 is a schematic diagram of a data set in the fifth district of California provided by an embodiment of the present invention;

图14是本发明实施例提供的中国香港数据集示意图；14 is a schematic diagram of a data set in Hong Kong, China provided by an embodiment of the present invention;

图15是本发明实施例提供的电子设备的硬件结构示意图。FIG. 15 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

在本发明的描述中，需要理解的是，涉及到方位描述，例如上、下、前、后、左、右等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the orientations related to the description, such as up, down, front, rear, left, right, etc., are based on the orientation or positional relationship shown in the drawings, and are only In order to facilitate the description of the present invention and simplify the description, it is not indicated or implied that the indicated device or element must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the present invention.

在本发明的描述中，若干的含义是一个或者多个，多个的含义是两个以上，大于、小于、超过等理解为不包括本数，以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系，也不必用于描述特定的顺序或先后次序。In the description of the present invention, the meaning of several is one or more, the meaning of multiple is two or more, greater than, less than, exceeding, etc. are understood as not including this number, above, below, within, etc. are understood as including this number. If it is described that the first and the second are only for the purpose of distinguishing technical features, it cannot be understood as indicating or implying relative importance, or indicating the number of the indicated technical features or the order of the indicated technical features. relationships, nor are they necessarily used to describe a particular order or sequence.

本发明的描述中，除非另有明确的限定，设置、安装、连接等词语应做广义理解，所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本发明中的具体含义。In the description of the present invention, unless otherwise clearly defined, words such as setting, installation, connection should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meaning of the above words in the present invention in combination with the specific content of the technical solution.

首先，对本申请中涉及的若干名词进行解析：First, some terms involved in this application are analyzed:

强化学习(Reinforcement Learning)：强化学习又称再励学习、评价学习或增强学习，是机器学习的范式和方法论之一，用于描述和解决智能体(Agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题；学习从环境状态到行为的映射，使得智能体选择的行为能够获得环境最大的奖赏，使得外部环境对学习系统在某种意义下的评价 (或整个系统的运行性能)为最佳.其基本原理是：如果Agent的某个行为策略导致环境正的奖赏(强化信号)，那么Agent以后产生这个行为策略的趋势便会加强。Agent的目标是在每个离散状态发现最优策略以使期望的折扣奖赏和最大。强化学习把学习看作试探评价过程， Agent选择一个动作用于环境，环境接受该动作后状态发生变化，同时产生一个强化信号(奖或惩)反馈给Agent，Agent根据强化信号和环境当前状态再选择下一个动作，选择的原则是使强化学习不同于连接主义学习中的监督学习，主要表现在教师信号上，强化学习中由环境提供的强化信号是Agent对所产生动作的好坏作一种评价(通常为标量信号)，而不是告诉 Agent如何去产生正确的动作。由于外部环境提供了很少的信息，Agent必须靠自身的经历进行学习。通过这种方式，Agent在行动一一评价的环境中获得知识，改进行动方案以适应环境受到正强化(奖)的概率增大。选择的动作不仅影响立即强化值，而且影响环境下一时刻的状态及最终的强化值。强化学习系统学习的目标是动态地调整参数，以达到强化信号最大。通常所说的强化学习，智能体Agent作为学习系统，获取外部环境的当前状态(State)信息 s，对环境采取试探行为u，并获取环境反馈的对此动作的评价r和新的环境状态。如果智能体的某动作u导致环境正的奖赏(立即报酬)，那么智能体以后产生这个动作的趋势便会加强；反之，智能体产生这个动作的趋势将减弱。在学习系统的控制行为与环境反馈的状态及评价的反复的交互作用中，以学习的方式不断修改从状态到动作的映射策略，以达到优化系统性能目的。强化学习包括Value-based(基于价值)和Policy-based(基于策略)两类，其中Value-based是学习价值函数，从价值函数采取出策略，确定一个策略a_t,是一种间接产生策略的方法；Value-Base中的action-value估计值最终会收敛到对应的true values(通常是不同的有限数，可以转化为0到1之间的概率)，因此通常会获得一个确定的策略 (deterministic policy)；Policy-based是学习策略函数，直接产生策略的方法，会产生各个动作的概率πθ(a∣s)；Policy-Based通常不会收敛到一个确定性的值；Policy-Based 适用于连续的动作空间，在连续的动作空间中，可以不用计算每个动作的概率，而是通过 Gaussian distribution(正态分布)选择action。Reinforcement learning (Reinforcement Learning): Reinforcement learning, also known as reinforcement learning, evaluation learning or reinforcement learning, is one of the paradigms and methodologies of machine learning. The problem of strategies to maximize rewards or achieve specific goals; learn the mapping from environmental states to behaviors, so that the behavior selected by the agent can obtain the maximum reward of the environment, so that the external environment can evaluate the learning system in a certain sense (or The operating performance of the whole system) is the best. The basic principle is: if a certain behavior strategy of the agent leads to a positive reward (strengthening signal) in the environment, then the tendency of the agent to generate this behavior strategy in the future will be strengthened. The agent's goal is to discover the optimal policy in each discrete state to maximize the expected discounted reward sum. Reinforcement learning regards learning as a tentative evaluation process. Agent selects an action to use in the environment. After the environment accepts the action, the state changes, and at the same time, a reinforcement signal (reward or punishment) is generated to feed back to the agent. Choose the next action. The principle of selection is to make reinforcement learning different from supervised learning in connectionist learning. It is mainly reflected in the teacher's signal. In reinforcement learning, the reinforcement signal provided by the environment is a kind of agent's decision on the quality of the generated action. Evaluate (usually a scalar signal) rather than tell the agent how to generate the correct action. Since the external environment provides little information, the agent must learn from its own experience. In this way, the agent acquires knowledge in the environment where actions are evaluated one by one, and the probability of improving the action plan to adapt to the environment is increased by positive reinforcement (reward). The selected action not only affects the immediate enhancement value, but also affects the state of the environment at the next moment and the final enhancement value. The goal of reinforcement learning system learning is to dynamically adjust parameters to maximize the reinforcement signal. Commonly referred to as reinforcement learning, the agent, as a learning system, obtains the current state information s of the external environment, adopts a tentative behavior u to the environment, and obtains the evaluation r of the action and the new environment state feedback from the environment. If an action u of the agent leads to a positive reward (immediate reward) in the environment, then the tendency of the agent to produce this action in the future will be strengthened; otherwise, the tendency of the agent to produce this action will be weakened. In the repeated interaction between the control behavior of the learning system and the state and evaluation of the environmental feedback, the mapping strategy from the state to the action is continuously modified in a learning manner to achieve the purpose of optimizing the system performance. Reinforcement learning includes two categories: Value-based (value-based) and Policy-based (policy-based), where Value-based is to learn a value function, take a strategy from the value function, and determine a strategy at _t , which is an indirect strategy. method; the action-value estimates in Value-Base will eventually converge to the corresponding true values (usually different finite numbers that can be converted into probabilities between 0 and 1), so a deterministic strategy is usually obtained. policy); Policy-based is a learning policy function, a method of directly generating policies, which will generate the probability of each action πθ(a∣s); Policy-Based usually does not converge to a deterministic value; Policy-Based is suitable for continuous The action space of , in the continuous action space, instead of calculating the probability of each action, the action can be selected through the Gaussian distribution (normal distribution).

RNN(Recurrent Neural Networks，循环神经网络)：RNN由输入层、一个隐藏层和一个输出层组成。循环神经网络神经元不仅预测，还会传递一个时间步s_t-1给下一神经元。其中，输出层Ot是一个全连接层，Ot＝g(Vs_t),g是激活函数,V是输出层的网络权重矩阵,s_t是隐藏层；也就是它的每个节点都和隐藏层的每个节点相连。当前输出由隐藏层计算获得，而当前隐藏层的计算不仅与输入有关，还与上一隐藏层的输出有关；s_t＝f(Ux_t+Ws_t-1),U是输入x的网络权重矩阵，W是上一次的值作为这一次的输入的网络权重矩阵，f是激活函数,隐藏层是循环层。RNN (Recurrent Neural Networks): RNN consists of an input layer, a hidden layer and an output layer. A recurrent neural network neuron not only predicts, but also passes a time step s _t-1 to the next neuron. Among them, the output layer Ot is a fully connected layer, Ot=g(Vs _t ), g is the activation function, V is the network weight matrix of the output layer, and s _t is the hidden layer; that is, each of its nodes is related to the hidden layer. connected to each node. The current output is calculated by the hidden layer, and the calculation of the current hidden layer is not only related to the input, but also related to the output of the previous hidden layer; s _t =f(Ux _t +Ws _t-1 ), U is the network weight of the input x Matrix, W is the network weight matrix of the previous value as the input of this time, f is the activation function, and the hidden layer is the recurrent layer.

长短期记忆网络(Long short-Term Memory)：全称为长短期记忆网络(LongShort Term Memory networks)，是一种特殊的RNN。LSTM在许多问题上效果非常好，现在被广泛使用。 LSTM在设计上明确地避免了长期依赖的问题。所有的循环神经网络都有着重复的神经网络模块形成链的形式。在普通的RNN中，重复模块结构非常简单，例如只有一个tanh层。LSTM使用“门”的结构来去除或者增加信息到细胞状态的能力。门是一种让信息选择式通过的方法。包含一个sigmoid神经网络层和一个pointwise乘法操作。Sigmoid层输出0到1之间的数值，描述每个部分有多少量可以通过。0代表“不许任何量通过”，1就指“允许任意量通过”。 LSTM拥有三个门，来保护和控制细胞状态：输入门、输出门、遗忘门。Long short-term memory network (Long short-Term Memory): the full name of the long short-term memory network (LongShort Term Memory networks), is a special RNN. LSTMs work very well on many problems and are now widely used. LSTMs are designed to explicitly avoid the problem of long-term dependencies. All RNNs have the form of chains of repeating neural network modules. In ordinary RNNs, the repeating module structure is very simple, such as only one tanh layer. LSTMs use a "gate" structure to remove or add information to the cell's ability to state. A gate is a way of letting information through selectively. Contains a sigmoid neural network layer and a pointwise multiplication operation. The sigmoid layer outputs a number between 0 and 1 that describes how much of each part can pass through. 0 means "don't allow any amount to pass" and 1 means "allow any amount to pass". LSTM has three gates to protect and control the cell state: input gate, output gate, and forget gate.

注意力机制(Attention Mechanism)源于对人类视觉的研究。在认知科学中，由于信息处理的瓶颈，人类会选择性地关注所有信息的一部分，同时忽略其他可见的信息。上述机制通常被称为注意力机制。神经注意力机制可以使得神经网络具备专注于其输入子集的能力：选择特定的输入。注意力可以应用于任何类型的输入而不管其形状如何。在计算能力有限情况下，注意力机制是解决信息超载问题的主要手段的一种资源分配方案，将计算资源分配给更重要的任务。多头注意力机制(Multi-head Attention Mechanism)是利用多个查询，来平行地计算从输入信息中选取多个信息。每个注意力关注输入信息的不同部分。Attention Mechanism originated from the study of human vision. In cognitive science, due to bottlenecks in information processing, humans selectively focus on a portion of all information while ignoring other visible information. The above mechanism is often referred to as the attention mechanism. Neural attention mechanisms can give a neural network the ability to focus on a subset of its inputs: select specific inputs. Attention can be applied to any type of input regardless of its shape. In the case of limited computing power, attention mechanism is a resource allocation scheme that is the main means to solve the problem of information overload, allocating computing resources to more important tasks. The Multi-head Attention Mechanism uses multiple queries to compute and select multiple pieces of information from the input information in parallel. Each attention focuses on a different part of the input information.

基于统计学习的方法：传统的统计学习应用统计操作来对残缺数据进行简单的补全，如线性插值，历史平均和最后观测值补充缺失值，但这类方法只能解决数据分布较简单的情况。一些研究者提出利用特征矩阵中缺失位置周围的数据来计算缺失值，如经典的K-近邻算法 (KNN)算法，它通过计算缺失位置周围K个临近位置的平均值来填补缺失。差分整合移动平均自回归模型(ARIMA)和其变种通过基于历史数据对缺失数据进行预测，以此来计算缺失值，然而，这些方法无法有效利用缺失发生后收集的特征。约束方法根据数据集中的整体数据特征建立了补全规则；然而，这种方法只适用于单变量数据，在大多数涉及多变量数据的实践场景中效果不佳。Statistical learning-based methods: Traditional statistical learning applies statistical operations to perform simple completion of incomplete data, such as linear interpolation, historical averages and last observations to supplement missing values, but such methods can only solve the case where the data distribution is relatively simple . Some researchers propose to use the data around the missing position in the feature matrix to calculate the missing value, such as the classic K-Nearest Neighbors (KNN) algorithm, which fills the missing value by calculating the average of K adjacent positions around the missing position. Autoregressive Autoregressive Models with Integral Moving Average (ARIMA) and its variants calculate missing values by making predictions on missing data based on historical data, however, these methods cannot effectively utilize features collected after the occurrence of missing data. Constrained methods establish completion rules based on the overall data characteristics in the dataset; however, this method is only suitable for univariate data and does not work well in most practical scenarios involving multivariate data.

基于矩阵分解的方法：矩阵分解(Matrix Factorization)通过分解和重建数据矩阵，发现数据中的相关性，并对缺失值进行补全。时序正则化矩阵分解(TemporalRegularized Matrix Factorization,TRMF)是一种时间序列补全方法，它基于矩阵分解引入了正则化和可扩展的矩阵因子化方法。此外，有技术方案将概率主成分分析(Probabilistic Principal Component Analysis，PPCA)引入到矩阵分解中，该方法假定观测数据的潜在特征符合高斯分布，以此来进行数据补全。还有采用了一种更复杂的低秩张量互补算法来恢复缺失数据。提出Bayesian Gaussia Candecomp/Parafac(BGCP)张量分解方法将原始数据矩阵转换为高维张量，然后描述和恢复不完整矩阵。然而，基于矩阵分解的方法需要具有特定形状的输入，这限制了它们的应用。Method based on matrix factorization: Matrix factorization (Matrix Factorization) finds the correlation in the data by decomposing and reconstructing the data matrix, and completes the missing values. Temporal Regularized Matrix Factorization (TRMF) is a time series completion method that introduces regularization and scalable matrix factorization based on matrix factorization. In addition, there is a technical solution that introduces Probabilistic Principal Component Analysis (PPCA) into matrix decomposition, and this method assumes that the latent features of the observed data conform to a Gaussian distribution, so as to perform data completion. There is also a more sophisticated low-rank tensor complementation algorithm to recover missing data. A Bayesian Gaussia Candecomp/Parafac (BGCP) tensor decomposition method is proposed to convert the original data matrix into a high-dimensional tensor, and then describe and restore the incomplete matrix. However, matrix factorization-based methods require inputs with specific shapes, which limits their applications.

生成式对抗网络(GAN,Generative Adversarial Networks)：是一种深度学习模型，是近年来复杂分布上无监督学习最具前景的方法之一。模型通过框架中至少两个模块：生成模型(Generative Model)和判别模型(Discriminative Model)的互相博弈学习产生相当好的输出。原始GAN理论中，并不要求G和D都是神经网络，只需要是能拟合相应生成和判别的函数即可。但实用中一般均使用深度神经网络作为G和D。一个优秀的GAN应用需要有良好的训练方法，否则可能由于神经网络模型的自由性而导致输出不理想。Generative Adversarial Networks (GAN, Generative Adversarial Networks): It is a deep learning model and one of the most promising methods for unsupervised learning on complex distributions in recent years. The model produces fairly good outputs through the mutual game learning of at least two modules in the framework: Generative Model and Discriminative Model. In the original GAN theory, it is not required that G and D are both neural networks, but only functions that can fit the corresponding generation and discrimination. But in practice, deep neural networks are generally used as G and D. An excellent GAN application needs to have a good training method, otherwise the output may be unsatisfactory due to the freedom of the neural network model.

历史平均(Historical Average，HA)：HA采用最近五天的完整数据，对每个路段一天中的相应时间段计算平均值，以填补缺失的数值。Historical Average (HA): HA uses the complete data of the last five days and calculates the average value for the corresponding time period of the day for each road segment to fill in the missing values.

K临近算法(K-Nearest Neighbors Algorithm,KNN):KNN是一种非参数插值方法，通过k个最近的邻接点的平均值来计算缺失，在发明实施例中考虑4个最近的邻接点进行计算。K-Nearest Neighbors Algorithm (KNN): KNN is a non-parametric interpolation method that calculates the missing by the average of the k nearest neighbors. In the embodiment of the invention, 4 nearest neighbors are considered for calculation .

贝叶斯高斯张量分解(Bayesian Gaussian CP decomposition,BGCP)：BGCP是一个贝叶斯张量分解模型，它采用马尔科夫链蒙特卡洛对潜在因子也即低秩结构进行建模。Bayesian Gaussian CP decomposition (BGCP): BGCP is a Bayesian tensor decomposition model that uses Markov Chain Monte Carlo to model latent factors, ie low-rank structures.

叠加去噪自动编码器(Denoising Stacked Autoencoders,DSAE)：DSAE接收包含观察值和噪声的时间序列数据特征矩阵，然后通过进行降维和升维以提取数据的隐含特征，并输出补全结果。Denoising Stacked Autoencoders (DSAE): DSAE receives the time series data feature matrix containing observations and noise, and then extracts the hidden features of the data by reducing and increasing the dimension, and outputs the completion result.

贝叶斯时间矩阵分解(Bayesian Temporal Matrix Factorization,BTMF)：BTMF采用高斯向量自回归过程来对时序依赖性进行建模，以补全时间序列数据。Bayesian Temporal Matrix Factorization (BTMF): BTMF uses a Gaussian vector autoregressive process to model temporal dependencies to complement time series data.

并行生成对抗补全网络(Parallel Data and Generative AdversarialNetworks for Imputation,PGAN)：PGAN是一种基于GAN的数据补全方法，用于生成缺失的交通数据，生成器和判别器都由线性层组成。Parallel Data and Generative Adversarial Networks for Imputation (PGAN): PGAN is a GAN-based data completion method for generating missing traffic data. Both the generator and the discriminator consist of linear layers.

双向递归神经网络(Bidirectional Recurrent Imputation,BRITS)：BRITS将缺失的时间数据分别输入到正向和反向的两个RNN中，并结合两个RNN的输出来估测缺失的时间序列值。Bidirectional Recurrent Imputation (BRITS): BRITS inputs the missing time data into the forward and reverse RNNs respectively, and combines the outputs of the two RNNs to estimate the missing time series values.

完全随机缺失数据和非随机缺失：本发明定义两种常见的数据缺失模式，即完全随机缺失(MCAR)和非随机缺失(MNAR)，分别如图1和图2所示，其中灰色格为缺失数据，白色格为非缺失数据。在MCAR中，缺失数据的分布在时间序列中是分散和随机的，而在MNAR中，缺失数据点出现在连续的时间点上。相对而言，MNAR是一个更具挑战性的问题，由于缺乏恢复单个缺失点所必需的相邻信息。Completely Missing Data and Non-Random Missing: The present invention defines two common data missing modes, namely completely random missing (MCAR) and non-random missing (MNAR), as shown in Figure 1 and Figure 2, respectively, where the gray cells are missing data, white cells are non-missing data. In MCAR, the distribution of missing data is scattered and random across the time series, while in MNAR, missing data points occur at consecutive time points. Relatively speaking, MNAR is a more challenging problem due to the lack of adjacent information necessary to recover a single missing point.

交通数据是一段时间的道路特征(如流量和速度)集合，是智能交通系统(Intelligent Transportation System,ITS)的一个重要组成部分。在这些数据的基础上，交通部门可以进行合理有效的交通控制，企业也可以提供更准确可靠的服务。最近，基于深度学习的算法Traffic data is a collection of road features (such as flow and speed) over a period of time, and is an important part of Intelligent Transportation System (ITS). On the basis of these data, traffic departments can carry out reasonable and effective traffic control, and enterprises can also provide more accurate and reliable services. Recently, deep learning-based algorithms

然而，大多数基于深度学习的方法都高度依赖于数据的质量。在实践中，交通数据集经常由于传感器故障、区域停电、极端天气等原因而导致缺失。因此，交通数据缺失的问题亟待解决。However, most deep learning based methods are highly dependent on the quality of the data. In practice, traffic datasets are often missing due to sensor failures, regional power outages, extreme weather, etc. Therefore, the problem of missing traffic data needs to be solved urgently.

为了解决缺失问题，最直接的方法是删除为与未观察到的数据。然而，这种操作都会导致时间或空间信息的损失。为了解决这个问题，可以使用适当的数据补全方法来对残缺数据进行估计。数据补全方法通过分析数据的依赖性或分布来估计缺失值。适当的数据补全方法可以准确地恢复缺失数据，从而避免智能交通系统中各种下游数据挖掘算法的性能下降。To address the missingness problem, the most straightforward approach is to delete as and unobserved data. However, all such operations result in the loss of temporal or spatial information. To solve this problem, an appropriate data completion method can be used to estimate the incomplete data. Data completion methods estimate missing values by analyzing the dependence or distribution of the data. Appropriate data completion methods can accurately recover missing data, thereby avoiding the performance degradation of various downstream data mining algorithms in intelligent transportation systems.

此外，与常规的时间序列数据相比，比如股票指数和医疗设备数据，交通数据具有强烈的周期性和波动性。图3显示了中国北京市东三里河大街在五天内的道路速度。从整体来看，连续几天的交通速度具有相似的模式。例如，在每天的早晚高峰期，道路速度都很低。此外，由于各种因素，如天气、事故和日期的影响，交通速度在每一天都完全不同，而每个时间点上的速度与前后时间的道路速度高度相关。因此，如何结合不同时间点前后的观测数据和整体的历史数据，对于处理缺失数据至关重要。Furthermore, compared to regular time series data, such as stock index and medical device data, traffic data is strongly cyclical and volatile. Figure 3 shows the road speed of Dongsanlihe Street in Beijing, China over five days. Overall, traffic speeds have similar patterns for consecutive days. For example, during the morning and evening rush hours every day, road speeds are low. Furthermore, due to various factors such as weather, accident and date, traffic speeds are completely different from day to day, and the speed at each point in time is highly correlated with the road speed at the time before and after. Therefore, how to combine the observational data before and after different time points and the overall historical data is crucial for dealing with missing data.

一些技术方案基于统计方法如KNN，ARIMA设计算法来补全残缺数据，然而这些方法仅对分布较为简单的数据有效。基于矩阵分解，一些技术方案引入了概率模型和高斯分布，但是这些方法仅适用于低秩数据，难以处理受多种因素影响的的交通数据，此外，该类方法对输入数据的格式有特定的要求，限定了其应用场景。近年来，深度学习在数据补全问题上得到了广泛的应用。例如，一些技术方案提出了具有双向递归神经网络结构的BRITS，以结合缺失位置前后的数据进行补全。一些技术方案提出了基于对抗生成网络的E2GAN，以生成丢失的数据。然而，这些研究在大数据集上常受到的计算或收敛缓慢问题的影响。虽然已有的方法在残缺交通数据补全问题上产生了可接受的结果，但仍有一些挑战存在。例如，现有的部分方法过于复杂，许多模型难以进行训练。已有方法只考虑了残缺时序数据中的时间依赖性，没有利用交通数据固有的周期性。此外，如何更好地从残缺的数据中提取时序特征的问题仍然发人深省。Some technical solutions design algorithms based on statistical methods such as KNN and ARIMA to complete incomplete data, but these methods are only effective for data with relatively simple distributions. Based on matrix decomposition, some technical solutions introduce probability models and Gaussian distributions, but these methods are only suitable for low-rank data and are difficult to deal with traffic data affected by many factors. In addition, these methods have specific formats for input data. requirements, which limit its application scenarios. In recent years, deep learning has been widely used in data completion problems. For example, some technical solutions propose BRITS with a bidirectional recurrent neural network structure to combine data before and after the missing position for completion. Some technical solutions propose E2GAN based on adversarial generative networks to generate missing data. However, these studies often suffer from slow computation or convergence problems on large datasets. Although existing methods have produced acceptable results on the incomplete traffic data completion problem, some challenges remain. For example, some existing methods are too complex and many models are difficult to train. Existing methods only consider the temporal dependencies in incomplete time series data, and do not take advantage of the inherent periodicity of traffic data. In addition, the question of how to better extract temporal features from mutilated data remains thought-provoking.

为了解决上述问题并填补研究空白，本发明提出了一种全新的数据补全模型，称为注意力循环神经网络(Attention-Driven Recurrent Imputation Network,ADRIN)。与以前的模型不同，本发明从残缺输入和历史平均值中提取特征，以考虑交通数据的波动性和周期性。 ADRIN采用了长短期记忆补全网络(Long Short-Term Memory forImputation,LSTM-I)来接收残缺时序输入，并采用多头注意力网络对完整的时序特征进行建模。此外，本发明将多头注意力网络应用于历史数据，以提取与历史信息相关的特征。然后，这两个模块的输出通过一个融合模块，该模块包含了一个注意力层和一个线性层，以得到补全结果。In order to solve the above problems and fill the research gap, the present invention proposes a brand-new data completion model, called Attention-Driven Recurrent Imputation Network (ADRIN). Unlike previous models, the present invention extracts features from mutilated inputs and historical averages to account for the volatility and periodicity of traffic data. ADRIN uses a Long Short-Term Memory for Imputation (LSTM-I) network to receive incomplete temporal inputs, and a multi-head attention network to model complete temporal features. Furthermore, the present invention applies a multi-head attention network to historical data to extract features related to historical information. The outputs of these two modules are then passed through a fusion module, which consists of an attention layer and a linear layer, to obtain the completion result.

基于此，本发明实施例提供一种数据补全方法、装置、电子设备及存储介质，可以实现对交通数据的补全。Based on this, the embodiments of the present invention provide a data completion method, apparatus, electronic device, and storage medium, which can realize the completion of traffic data.

本发明实施例提供数据补全方法、装置、电子设备及存储介质，具体通过如下实施例进行说明，首先描述本发明实施例中的数据补全方法。The embodiments of the present invention provide a data completion method, an apparatus, an electronic device, and a storage medium, which are specifically described by the following embodiments. First, the data completion method in the embodiments of the present invention is described.

本发明实施例可以基于人工智能技术对相关的数据进行获取和处理。其中，人工智能 (Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present invention can acquire and process related data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

本发明实施例提供的数据补全方法，涉及人工智能技术领域，尤其涉及数据补全方法、装置、电子设备及存储介质。本发明实施例提供的数据补全方法可应用于终端中，也可应用于服务器端中，还可以是运行于终端或服务器端中的软件。在一些实施例中，终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等；服务器可以是独立的服务器，也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器；软件可以是实现数据补全方法的应用等，但并不局限于以上形式。The data completion method provided by the embodiment of the present invention relates to the technical field of artificial intelligence, and in particular, to a data completion method, an apparatus, an electronic device, and a storage medium. The data completion method provided by the embodiment of the present invention can be applied to a terminal, also can be applied to a server, and can also be software running in a terminal or a server. In some embodiments, the terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, etc.; the server may be an independent server, or may provide cloud services, cloud databases, cloud computing, cloud functions, and cloud storage , network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms; The application of the whole method, etc., but not limited to the above forms.

本申请可用于众多通用或专用的计算机系统环境或配置中。例如：个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

图4是本发明实施例提供的数据补全方法的一个可选的流程图，图4中的方法可以包括但不限于包括步骤S110至步骤S170。FIG. 4 is an optional flowchart of a data completion method provided by an embodiment of the present invention. The method in FIG. 4 may include, but is not limited to, steps S110 to S170.

步骤S110，获取原始数据集；Step S110, obtaining the original data set;

步骤S120，对原始数据集进行数据预处理，以将原始数据集分类成残缺数据集和历史数据集；Step S120, performing data preprocessing on the original data set to classify the original data set into incomplete data sets and historical data sets;

步骤S130，对残缺数据集进行时间特征提取，得到第一时间序列；Step S130, performing time feature extraction on the incomplete data set to obtain a first time series;

步骤S140，对历史数据集进行时间特征提取，得到第二时间序列；Step S140, performing time feature extraction on the historical data set to obtain a second time series;

步骤S150，对第一时间序列进行多头注意力机制计算，得到第一输出矩阵；Step S150, performing multi-head attention mechanism calculation on the first time series to obtain a first output matrix;

步骤S160，对第二时间序列进行多头注意力机制计算，得到第二输出矩阵；Step S160, performing multi-head attention mechanism calculation on the second time series to obtain a second output matrix;

步骤S170，对第一输出矩阵和第二输出矩阵进行融合处理，得到补全数据；Step S170, performing fusion processing on the first output matrix and the second output matrix to obtain complementary data;

在实践中，交通数据的收集经常受到通信错误、传感器故障、存储丢失等因素影响而造成数据缺失，导致收集到的数据受损，严重影响了下游应用的有效性。然而现有的补全方法仅从具有残缺的观测数据中估计丢失的数据，而忽略利用历史数据。在本发明中使用注意力循环神经网络解决交通数据丢失的问题，能有效提高交通数据补全效果。In practice, the collection of traffic data is often affected by factors such as communication errors, sensor failures, and storage loss, resulting in missing data, resulting in damage to the collected data, which seriously affects the effectiveness of downstream applications. However, existing completion methods only estimate missing data from observational data with incompleteness, ignoring the utilization of historical data. In the present invention, the attention cycle neural network is used to solve the problem of traffic data loss, which can effectively improve the traffic data completion effect.

在一些实施例的步骤S110中，获取原始数据集，该原始数据集可以通过设置不同的数据源，根据数据源对不同地区的交通速度数据进行采集。例如，数据采用了三个真实世界的交通速度数据集，包括但不限于包括：中国北京交通数据，美国加州交通数据，中国香港交通数据。中国北京交通数据包括2019年1月1日00:00至2019年6月30日23:55期间中国北京1368条道路的平均速度。美国加州交通数据包括美国加州第五区的144个传感器站，记录时间为2013年1月1日00:00 至2013年6月30日23:55。中国香港交通数据包括2021年3月10日00:00至2021年7月31 日23:50的中国香港主要道路的平均道路速度。In step S110 of some embodiments, an original data set is obtained, and the original data set can be used to collect traffic speed data in different regions according to the data sources by setting different data sources. For example, the data adopts three real-world traffic speed datasets, including but not limited to: China Beijing traffic data, California traffic data, and Hong Kong traffic data. Beijing, China traffic data includes the average speed of 1368 roads in Beijing, China from 00:00 on January 1, 2019 to 23:55 on June 30, 2019. The California traffic data includes 144 sensor stations in California's Fifth District, recorded from 00:00 on January 1, 2013 to 23:55 on June 30, 2013. Traffic data for Hong Kong, China includes average road speeds for major roads in Hong Kong, China from 00:00 on March 10, 2021 to 23:50 on July 31, 2021.

在一些实施例的步骤S120中，对原始数据集进行数据预处理，以将原始数据集分类成残缺数据集和历史数据集。对采集到的数据以5分钟或10分钟等不同时间间隔采样，同时将数据完整的作为历史数据集，将残缺的数据作为残缺数据集。In step S120 of some embodiments, data preprocessing is performed on the original data set to classify the original data set into incomplete data sets and historical data sets. The collected data is sampled at different time intervals such as 5 minutes or 10 minutes, and the complete data is regarded as the historical data set, and the incomplete data is regarded as the incomplete data set.

在一些实施例的步骤S130中，对残缺数据集进行时间特征提取，得到第一时间序列。例如，可以通过改进的长短期记忆补全网络对残缺数据集进行时间特征提取。还可以通过并行生成对抗补全网络，递归神经网络，双向递归神经网络等对残缺数据集进行时间特征提取，不限于此。In step S130 of some embodiments, time feature extraction is performed on the incomplete data set to obtain a first time series. For example, temporal feature extraction can be performed on incomplete datasets through an improved long short-term memory completion network. It is also possible to perform temporal feature extraction on incomplete datasets through parallel generative adversarial completion networks, recurrent neural networks, bidirectional recurrent neural networks, etc., but not limited to this.

在一些实施例的步骤S140中，对历史数据集进行时间特征提取，得到第二时间序列。例如，可以通过历史平均网络对历史数据集进行时间特征提取。还可以通过K临近算法，线性插值，基于矩阵分解的方法等对历史数据集进行时间特征提取，不限于此。In step S140 of some embodiments, time feature extraction is performed on the historical data set to obtain a second time series. For example, temporal feature extraction can be performed on historical datasets through a historical averaging network. It is also possible to perform temporal feature extraction on the historical data set through the K-proximity algorithm, linear interpolation, matrix decomposition-based method, etc., but not limited to this.

在一些实施例的步骤S150中，对第一时间序列进行多头注意力机制计算，得到第一输出矩阵。将步骤S130得出的第一时间序列进行多头注意力机制计算，得出第一输出矩阵。虽然改进的长短期记忆补全网络能够逐步估计缺失的数据，但是它对较长时间序列数据的时序依赖捕捉能力有限，特别是交通数据在一天内往往有数百个时间段。应用多头注意力机制可以多头操作可以分别在几个子空间进行学习，对长时间序列数据的时序捕抓时间序列特征。In step S150 of some embodiments, multi-head attention mechanism calculation is performed on the first time series to obtain a first output matrix. The multi-head attention mechanism calculation is performed on the first time series obtained in step S130 to obtain a first output matrix. Although the improved long-short-term memory completion network can gradually estimate missing data, it has limited ability to capture the temporal dependence of longer time series data, especially the traffic data often has hundreds of time periods in a day. By applying the multi-head attention mechanism, multi-head operations can be learned in several subspaces respectively, and time series features can be captured for the time series of long-term series data.

在一些实施例的步骤S160中，对第二时间序列进行多头注意力机制计算，得到第二输出矩阵。将步骤S140得出的第二时间序列进行多头注意力机制计算，得出第二输出矩阵。历史平均网络也不能对较长时间序列数据的时序捕捉，为了更加有效的利用历史数据，应用多头注意力机制对历史数据进行计算，得出第二输出矩阵。In step S160 of some embodiments, multi-head attention mechanism calculation is performed on the second time series to obtain a second output matrix. The multi-head attention mechanism calculation is performed on the second time series obtained in step S140 to obtain a second output matrix. The historical average network also cannot capture the time series of long time series data. In order to use the historical data more effectively, the multi-head attention mechanism is applied to calculate the historical data, and the second output matrix is obtained.

在一些实施例的步骤S170中，对第一输出矩阵和第二输出矩阵进行融合处理，得到补全数据。残缺数据集和历史数据集经过前面的多头注意力机制输出矩阵，在融合处理模块将两个输出矩阵进行融合，计算出最终的补全结果。In step S170 of some embodiments, a fusion process is performed on the first output matrix and the second output matrix to obtain complementary data. The incomplete data set and the historical data set are outputted by the previous multi-head attention mechanism, and the two output matrices are fused in the fusion processing module to calculate the final completion result.

通过本发明以上步骤的实施例提供的数据补全方法，可以解决交通数据缺少的补全问题，提出了注意力循环神经网络(ADRIN)网络结构，与现有的基于深度学习的归因方法相比，注意力循环神经网络利用交通数据独特的周期性和波动性，从不完整的输入和历史平均中提取特征并补充缺失值，注意力循环神经网络具有长短期记忆补全网络(LSTM-I)来接受具有缺失的输入，同时应用多头注意力机制，分别从历史平均和LSTM-I输出中提取时间特征。将多头注意力机制的输出输入至包含注意力层和全连接层的融合模块，得到补全的结果。解决了现有技术方案无法利用历史数据来补全缺少数据的问题，通过本发明实施例可以解决交通数据丢失的问题，能有效提高交通数据补全效果。The data completion method provided by the embodiments of the above steps of the present invention can solve the problem of the lack of traffic data, and proposes an Attention Recurrent Neural Network (ADRIN) network structure, which is similar to the existing deep learning-based attribution methods. Ratio, attention recurrent neural network exploits the unique periodicity and volatility of traffic data, extracts features from incomplete input and historical average and fills missing values, attention recurrent neural network has long short-term memory completion network (LSTM-I ) to accept inputs with missings, while applying a multi-head attention mechanism to extract temporal features from historical average and LSTM-I outputs, respectively. The output of the multi-head attention mechanism is input to the fusion module containing the attention layer and the fully connected layer, and the completed result is obtained. The problem that the prior art solution cannot use historical data to complement missing data is solved, the embodiment of the present invention can solve the problem of traffic data loss, and can effectively improve the traffic data complementing effect.

请查阅图5，在一些实施例中，步骤S130可以包括但不限于包括步骤S210至S240；Please refer to FIG. 5, in some embodiments, step S130 may include but is not limited to including steps S210 to S240;

步骤S210，将残缺数据集输入预设的长短期记忆补全网络中；Step S210, input the incomplete data set into the preset long-term and short-term memory completion network;

步骤S220，提取残缺数据集中处于预设时段的观测数据；Step S220, extracting the observation data in the preset time period in the incomplete data set;

步骤S230，根据观测数据和预设时段的上一时段的预测数据，计算出预设时段的缺失数据；Step S230, according to the observation data and the prediction data of the previous period of the preset period, calculate the missing data of the preset period;

步骤S240，根据缺失数据和观测数据，得到第一时间序列。Step S240, obtaining a first time series according to the missing data and the observed data.

具体地，在一些实施例的步骤S210中，将残缺数据集输入到长短期记忆补全网络中；在一些实施例的步骤S220中，提取残缺数据集中处于预设时段的观测数据，在残缺数据集中不同时段的数据完整程度是不同的，例如包括但不限于包括：一段8点到20点时段的数据集中， 10点到12点的数据是残缺的，而其他时间段是完整的数据，则10点到12点为预设时段的观测数据。在一些实施例的步骤S230中，根据观测数据和预设时段的上一时段的预测数据，计算出预设时段的缺失数据；根据当前观测数据的时间段，例如包括但不限于包括：8月23 日的10点到12点，由于交通数据的周期性，根据8月22日的10点到12点时间段可以计算出预设时段的缺失数据。同时根据多方面数据的拟合，在长短期记忆补全网络中计算出预设时段的缺失数据。在一些实施例的步骤S240中，根据缺失数据和观测数据，得到第一时间序列，得出预设时段的缺失数据可以把缺少数据集补全得出第一时间序列。通过本发明以上步骤的实施例提供的数据补全方法，通过重构后的LSTM-I使用前一个时间段的预测值来估算当前时间段的缺失值。对于每个时间段，LSTM-I采用估计值和观察值来复原特征，得到第一时间序列。Specifically, in step S210 of some embodiments, the incomplete data set is input into the long-term and short-term memory completion network; in step S220 of some embodiments, observation data in a preset time period in the incomplete data set is extracted, and in the incomplete data set The degree of data integrity in different time periods in the collection is different, for example, including but not limited to: a data set from 8:00 to 20:00, the data from 10:00 to 12:00 is incomplete, while other time periods are complete data, then 10:00 to 12:00 is the observation data of the preset time period. In step S230 of some embodiments, the missing data of the preset period is calculated according to the observation data and the prediction data of the previous period of the preset period; according to the time period of the current observation data, for example, including but not limited to: August From 10:00 to 12:00 on the 23rd, due to the periodicity of traffic data, the missing data for the preset period can be calculated according to the time period from 10:00 to 12:00 on August 22nd. At the same time, according to the fitting of various data, the missing data of the preset period is calculated in the long short-term memory completion network. In step S240 of some embodiments, the first time series is obtained according to the missing data and the observed data, and the missing data of the preset period can be obtained by completing the missing data set to obtain the first time series. With the data completion method provided by the embodiments of the above steps of the present invention, the reconstructed LSTM-I uses the predicted value of the previous time period to estimate the missing value of the current time period. For each time period, LSTM-I uses the estimated and observed values to restore the features, resulting in the first time series.

请查阅图6，在一些实施例中，步骤S140可以包括但不限于包括步骤S310至S330；Please refer to FIG. 6 , in some embodiments, step S140 may include but is not limited to including steps S310 to S330;

步骤S310，将历史数据集输入至预设的历史数据处理网络中；Step S310, input the historical data set into the preset historical data processing network;

步骤S320，通过历史数据处理网络计算出历史数据集的历史平均数据，将历史平均数据作为预设时段的历史数据；Step S320, calculating the historical average data of the historical data set through the historical data processing network, and using the historical average data as the historical data of the preset period;

步骤S330，根据历史平均数据，得到第二时间序列。Step S330, obtaining a second time series according to the historical average data.

具体地，在一些实施例的步骤S310中，将历史数据集输入至预设的历史数据处理网络中；在一些实施例的步骤S320中，通过历史数据处理网络计算出历史数据集的历史平均数据，将历史平均数据作为预设时段的历史数据，采用最近五天的完整数据，对每个路段一天中的相应时间段计算平均值，以填补缺失的数值；步骤S330，根据历史平均数据，得到第二时间序列。通过本发明以上步骤的实施例提供的数据补全方法，由于交通数据的周期性，历史数据对于交通数据补全有着重要作用，计算历史数据集的历史平均数据可以减小误差减少干扰，为下一步的融合处理做准备。Specifically, in step S310 of some embodiments, the historical data set is input into a preset historical data processing network; in step S320 of some embodiments, the historical average data of the historical data set is calculated through the historical data processing network , take the historical average data as the historical data of the preset period, and use the complete data of the last five days to calculate the average value for the corresponding time period of each road section in one day to fill the missing value; step S330, according to the historical average data, obtain Second time series. Through the data completion method provided by the embodiments of the above steps of the present invention, due to the periodicity of traffic data, historical data plays an important role in the completion of traffic data, and calculating the historical average data of the historical data set can reduce errors and reduce interference. Prepare for the one-step fusion process.

请查阅图7，在一些实施例中，步骤S150可以包括但不限于包括步骤S410至S450；Please refer to FIG. 7 , in some embodiments, step S150 may include but is not limited to including steps S410 to S450;

步骤S410，将第一时间序列输入至第一多头注意力层；Step S410, inputting the first time series to the first multi-head attention layer;

步骤S420，通过第一多头注意力层将第一时间序列转换为第一注意力矩阵；Step S420, converting the first time series into a first attention matrix through the first multi-head attention layer;

步骤S430，通过预设函数将第一注意力矩阵转换为第一概率矩阵；Step S430, converting the first attention matrix into a first probability matrix by a preset function;

步骤S440，对第一概率矩阵进行注意力机制计算，得到第一特征矩阵；Step S440, performing attention mechanism calculation on the first probability matrix to obtain a first feature matrix;

步骤S450，对第一特征矩阵进行降维处理，得到第一输出矩阵。Step S450, performing dimension reduction processing on the first feature matrix to obtain a first output matrix.

具体地，在一些实施例的步骤S410中，将第一时间序列输入至第一多头注意力层；在一些实施例的步骤S420中，通过第一多头注意力层将第一时间序列转换为第一注意力矩阵，在多头注意力机制中进行矩阵运算，将第一时间序列转换为第一注意力矩阵；在一些实施例的步骤S430中，通过预设函数将第一注意力矩阵转换为第一概率矩阵，使用softmax函数将第一注意力矩阵转换成第一概率矩阵，所有列的概率之和为1；在一些实施例的步骤S440中，对第一概率矩阵进行注意力机制计算，得到第一特征矩阵，经过多头注意力机制的计算，能够捕捉到更丰富的特征关系，得到第一特征矩阵；在一些实施例的步骤S450中，对第一特征矩阵进行降维处理，得到第一输出矩阵，经过多头注意力机制计算，相对于单头注意力机制可以得到多个注意力机制的输出矩阵Z，将所有得到的输出矩阵Z结合起来，通过全连接层的降维，并输出与输入时间序列相同形状的第一输出矩阵。通过本发明以上步骤的实施例提供的数据补全方法，第一时间序列输入到第一多头注意力层，通过矩阵运算、注意力机制计算以及降维处理最后得到第一输出矩阵，LSTM结构只能从单一的方向对时序数据建模，这会导致在进行补全任务的时候难以结合缺失数据之后的观测值。而采用多头注意力机制，可以基于整段时序数据，结合缺失位置前后的观测值，对残缺数据进行补全。Specifically, in step S410 of some embodiments, the first time series is input to the first multi-head attention layer; in step S420 of some embodiments, the first time series is converted by the first multi-head attention layer is the first attention matrix, performing matrix operations in the multi-head attention mechanism to convert the first time series into a first attention matrix; in step S430 of some embodiments, the first attention matrix is converted by a preset function is the first probability matrix, using the softmax function to convert the first attention matrix into a first probability matrix, and the sum of the probabilities of all columns is 1; in step S440 of some embodiments, the attention mechanism calculation is performed on the first probability matrix , to obtain the first feature matrix, and through the calculation of the multi-head attention mechanism, richer feature relationships can be captured to obtain the first feature matrix; in step S450 in some embodiments, the first feature matrix is subjected to dimensionality reduction processing to obtain The first output matrix is calculated by the multi-head attention mechanism. Compared with the single-head attention mechanism, the output matrix Z of multiple attention mechanisms can be obtained. All the obtained output matrices Z are combined, and the dimensionality reduction of the fully connected layer is carried out. Output a first output matrix of the same shape as the input time series. Through the data completion method provided by the embodiments of the above steps of the present invention, the first time series is input to the first multi-head attention layer, and the first output matrix is finally obtained through matrix operation, attention mechanism calculation and dimensionality reduction processing. LSTM structure Time series data can only be modeled in a single direction, which makes it difficult to combine observations after missing data when performing completion tasks. The multi-head attention mechanism can be used to complete the incomplete data based on the entire time series data, combined with the observations before and after the missing position.

请查阅图8，在一些实施例中，步骤S160可以包括但不限于包括步骤S510至S550；Please refer to FIG. 8 , in some embodiments, step S160 may include but is not limited to including steps S510 to S550;

步骤S510，将第二时间序列输入至第二多头注意力层；Step S510, input the second time series to the second multi-head attention layer;

步骤S520，通过第二多头注意力层将第二时间序列转换为第二注意力矩阵；Step S520, converting the second time series into a second attention matrix through the second multi-head attention layer;

步骤S530，通过预设函数将第二注意力矩阵转换为第二概率矩阵；Step S530, converting the second attention matrix into a second probability matrix through a preset function;

步骤S540，对第二概率矩阵进行注意力机制计算，得到第二特征矩阵；Step S540, performing attention mechanism calculation on the second probability matrix to obtain a second feature matrix;

步骤S550，对第二特征矩阵进行降维处理，得到第二输出矩阵。Step S550, performing dimension reduction processing on the second feature matrix to obtain a second output matrix.

具体地，在一些实施例的步骤S510中，将第二时间序列输入至第二多头注意力层；在一些实施例的步骤S520中，通过第二多头注意力层将第二时间序列转换为第二注意力矩阵，将第二时间序列矩阵转换为第二注意力矩阵；在一些实施例的步骤S530中，通过预设函数将第二注意力矩阵转换为第二概率矩阵，使用softmax函数将第二注意力矩阵转换成第二概率矩阵，所有列的概率之和为1；在一些实施例的步骤S540中，对第二概率矩阵进行注意力机制计算，得到第二特征矩阵，经过多头注意力机制的计算，能够捕捉到更丰富的特征关系，得到第二特征矩阵；在一些实施例的步骤S550中，对第二特征矩阵进行降维处理，得到第二输出矩阵。通过本发明以上步骤的实施例提供的数据补全方法，将第二时间序列输入到第二多头注意力层，通过矩阵运算，注意力机制计算以及降维处理最后得到第二输出矩阵，LSTM-I 的输出结果可以看作是初步的不全结果，但局限于模型的能力，补全效果有限，所以本发明实施例希望能结合时序建模能力更强的多头注意力机制来结合残缺位置前后的观测值来进行补全，同时考虑到交通数据具有的周期性，因此增加了历史数据模块(HA)。为了能更有效地利用历史数据，同时保证在融合模块处，模型左右两部分的输出的分布近似，如果不一致在直接拼接后输入至融合模块会导致分布差异过大而难以进行参数学习，分布近似使得模型参数更容易学习。Specifically, in step S510 of some embodiments, the second time series is input to the second multi-head attention layer; in step S520 of some embodiments, the second time series is converted by the second multi-head attention layer For the second attention matrix, the second time series matrix is converted into a second attention matrix; in step S530 of some embodiments, the second attention matrix is converted into a second probability matrix by a preset function, and a softmax function is used Convert the second attention matrix into a second probability matrix, and the sum of the probabilities of all columns is 1; in step S540 of some embodiments, the attention mechanism calculation is performed on the second probability matrix to obtain a second feature matrix. The calculation of the attention mechanism can capture richer feature relationships and obtain a second feature matrix; in step S550 in some embodiments, a dimensionality reduction process is performed on the second feature matrix to obtain a second output matrix. Through the data completion method provided by the embodiment of the above steps of the present invention, the second time series is input into the second multi-head attention layer, and the second output matrix is finally obtained through matrix operation, attention mechanism calculation and dimension reduction processing, LSTM The output result of -I can be regarded as a preliminary incomplete result, but it is limited to the ability of the model, and the completion effect is limited, so the embodiment of the present invention hopes to combine the multi-head attention mechanism with stronger time series modeling ability to combine before and after the incomplete position. The observation value of the data is used to complete, and considering the periodicity of traffic data, a historical data module (HA) is added. In order to make more effective use of historical data, and at the same time to ensure that the distribution of the output of the left and right parts of the model is similar at the fusion module, if they are inconsistent and input to the fusion module after direct splicing, the distribution difference will be too large and it will be difficult to perform parameter learning. The approximate distribution Makes the model parameters easier to learn.

请查阅图9，在一些实施例中，步骤S170可以包括但不限于包括步骤S610至S630；Please refer to FIG. 9 , in some embodiments, step S170 may include but is not limited to including steps S610 to S630;

步骤S610，对第一输出矩阵和第二输出矩阵进行拼接处理，得到拼接矩阵；Step S610, performing splicing processing on the first output matrix and the second output matrix to obtain a splicing matrix;

步骤S620，通过单头注意力层对拼接矩阵进行注意力机制的计算，得到目标矩阵；Step S620, calculating the attention mechanism of the splicing matrix through the single-head attention layer to obtain the target matrix;

步骤S630，通过线性层对目标矩阵进行特征提取，得到补全数据。Step S630, perform feature extraction on the target matrix through the linear layer to obtain complementary data.

在一些实施例的步骤S610中，对第一输出矩阵和第二输出矩阵进行拼接处理，得到拼接矩阵，残缺数据集和历史数据集经过前面的多头注意力机制输出矩阵，对输出矩阵进行拼接处理得出拼接矩阵。In step S610 of some embodiments, splicing processing is performed on the first output matrix and the second output matrix to obtain a splicing matrix. The incomplete data set and the historical data set are outputted through the previous multi-head attention mechanism, and the output matrix is splicing processing. Get the splice matrix.

在一些实施例的步骤S620中，通过单头注意力层对拼接矩阵进行注意力机制的计算，得到目标矩阵。将拼接处理过后的拼接矩阵输入单头注意力层，得出单头注意力层输出矩阵。In step S620 of some embodiments, an attention mechanism is calculated on the splicing matrix through a single-head attention layer to obtain a target matrix. The splicing matrix after splicing processing is input into the single-head attention layer, and the output matrix of the single-head attention layer is obtained.

在一些实施例的步骤S630中，通过线性层对目标矩阵进行特征提取，得到补全数据。在单头注意力层计算之后，输入线性层进行特征提取，在线性层中使用一个全连接的神经网络来得到补全数据。如公式(1)所示：In step S630 of some embodiments, feature extraction is performed on the target matrix through a linear layer to obtain complementary data. After the single-head attention layer is calculated, it is input to the linear layer for feature extraction, and a fully connected neural network is used in the linear layer to obtain the completion data. As shown in formula (1):

H和H^*表示隐式状态的拼接，W_l和b_l是线性层的参数，

是注意力循环神经网络的最终补全结果。H and H ^* represent the concatenation of implicit states, W _l and b _l are the parameters of the linear layer,

is the final completion result of the attention recurrent neural network.

通过本发明以上步骤的实施例提供的数据补全方法，通过融合处理模块可以最终计算得出补全数据，在注意力循环神经网络的上半部分有针对残缺数据集处理的第一输出矩阵，有针对历史数据集处理的第二输出矩阵，这些输出矩阵都包含着残缺数据集和历史数据集的特征，需要融合处理模块对第一输出矩阵和第二输出矩阵处理才能得出最终的补全数据。Through the data completion method provided by the embodiments of the above steps of the present invention, the completion data can be finally calculated by the fusion processing module, and the upper half of the attention recurrent neural network has a first output matrix for processing the incomplete data set, There are second output matrices for historical data set processing. These output matrices contain the characteristics of incomplete data sets and historical data sets. The fusion processing module needs to process the first output matrix and the second output matrix to obtain the final completion. data.

请查阅图10，在一些实施例中，步骤S620可以包括但不限于包括步骤S710至S730Please refer to FIG. 10 , in some embodiments, step S620 may include but is not limited to including steps S710 to S730

步骤S710，将拼接矩阵输入至单头注意力层；Step S710, input the splicing matrix to the single-head attention layer;

步骤S720，通过单头注意力层将拼接矩阵转换为注意力矩阵；Step S720, converting the splicing matrix into an attention matrix through a single-head attention layer;

步骤S730，通过预设函数将注意力矩阵转换为概率矩阵；Step S730, converting the attention matrix into a probability matrix through a preset function;

步骤S740，对概率矩阵进行注意力机制计算，得到目标矩阵。Step S740, perform attention mechanism calculation on the probability matrix to obtain a target matrix.

在一些实施例的步骤S710中，将拼接矩阵输入至单头注意力层，将第一输出矩阵和第二输出矩阵的拼接矩阵输入至单头注意力层；在一些实施例的步骤S720中，通过单头注意力层将拼接矩阵转换为注意力矩阵，使用单头注意力层对拼接矩阵处理，形成注意力矩阵；在一些实施例的步骤S730中，通过预设函数将注意力矩阵转换为概率矩阵，使用softmax函数将注意力矩阵转换成概率矩阵，所有列的概率之和为1；在一些实施例的步骤S740中，对概率矩阵进行注意力机制计算，得到目标矩阵，使用单头注意力机制的计算，得到目标矩阵。In step S710 of some embodiments, the splicing matrix is input to the single-head attention layer, and the splicing matrix of the first output matrix and the second output matrix is input to the single-head attention layer; in step S720 of some embodiments, The splicing matrix is converted into an attention matrix through a single-head attention layer, and the splicing matrix is processed by the single-head attention layer to form an attention matrix; in step S730 of some embodiments, the attention matrix is converted into an attention matrix through a preset function probability matrix, use the softmax function to convert the attention matrix into a probability matrix, and the sum of the probabilities of all columns is 1; in step S740 of some embodiments, perform attention mechanism calculation on the probability matrix to obtain a target matrix, and use single-head attention Calculate the force mechanism to get the target matrix.

在一具体应用场景，交通速度数据补全的目标是用已知的残缺交通速度数据来预估缺失的数据。考虑到真实道路速度数据

本发明实施例有带有缺失数据点的输入特征，用

表示，其中n代表节点如传感器站或道路段的数量，T代表一天中的时间步数，x_ij代表节点i在第j个时间步数的观察数据点。本发明实施例另外定义了一个掩码矩阵(也称为标志矩阵)

如公式(2)所示：In a specific application scenario, the goal of traffic speed data completion is to use the known incomplete traffic speed data to estimate the missing data. Taking into account real road speed data

Embodiments of the present invention have input features with missing data points, using

where n represents the number of nodes such as sensor stations or road segments, T represents the number of time steps in a day, and x _ij represents the observed data points for node i at the jth time step. The embodiment of the present invention additionally defines a mask matrix (also called a flag matrix)

As shown in formula (2):

为了方便理解，下面是残缺交通数据的矩阵形式和相应的掩码矩阵的实例：For ease of understanding, the following is an example of the matrix form of the incomplete traffic data and the corresponding mask matrix:

可以看出，在特征矩阵中，位置(1,3)、(2,5)、(3,2)的数据值是缺失的，缺失的数据用问号表示。相应的二进制值为1，而其他观察到的数据点在各自的掩码矩阵中的值为0。缺失数据不全的目的是通过已有的观测数据值来恢复缺失的数据点。

和Y之间的误差应该最小化，其中

是归因结果。It can be seen that in the feature matrix, the data values of positions (1,3), (2,5), (3,2) are missing, and the missing data are represented by question marks. The corresponding binary value is 1, while the other observed data points have the value 0 in their respective mask matrices. The purpose of missing data incompleteness is to recover missing data points from existing observed data values.

and Y should be minimized, where

is the attribution result.

如图11描述了注意力循环神经网络(ADRIN)的框架。本发明实施例根据交通速度数据的时间序列特征，构建了两个主要的数据处理流，如图11的左、右部分所示。左边的模块侧重于从残缺的输入中提取时间特征，即

考虑到交通数据具有很强的周期相关性，本发明实施例另外构建了右边的模块，它接受输入前最近五天的历史平均数据矩阵

作为输入。这两个部分的输出是两个隐式特征矩阵，分别表示为

和

前者包含提取的残缺时间段之间的时序信息，而后者则包含了周期性的时间序列信息。最后，通过一个融合模块来处理两个模块的输出，并得到的补全后的数据

Figure 11 depicts the framework of the Attention Recurrent Neural Network (ADRIN). The embodiment of the present invention constructs two main data processing flows according to the time series characteristics of traffic speed data, as shown in the left and right parts of FIG. 11 . The module on the left focuses on extracting temporal features from the mutilated input, i.e.

Considering that the traffic data has a strong cyclical correlation, the embodiment of the present invention additionally constructs a module on the right, which accepts the historical average data matrix of the last five days before the input

as input. The outputs of these two parts are two implicit feature matrices, denoted as

and

The former contains time series information between the extracted incomplete time periods, while the latter contains periodic time series information. Finally, a fusion module is used to process the output of the two modules, and the completed data is obtained

具体来说，本发明实施例根据时间序列的交通速度数据定制了一些先进的神经网络方法，并将它们整合到ADRIN中。特别地，本发明实施例首先改进了普通的LSTM，并提出了长短期记忆补全网络(LSTM-I)，它可以接受具有缺失数据的输入。LSTM-I通过正向预测来估计缺失值，以生成预测的特征图

此外，本发明实施例在时序数据处理中采用了多头注意力机制，分别提取X^和历史特征x^(a)的时间依赖性，并获得隐式特征矩阵H和H^*。Specifically, the embodiments of the present invention customize some advanced neural network methods according to the time series traffic speed data, and integrate them into ADRIN. In particular, the embodiments of the present invention first improve the ordinary LSTM and propose a Long Short-Term Memory Completion Network (LSTM-I), which can accept inputs with missing data. LSTM-I estimates missing values through forward prediction to generate predicted feature maps

In addition, the embodiment of the present invention adopts a multi-head attention mechanism in time series data processing, extracts the time dependencies of X^ and historical features x ^(a) respectively, and obtains implicit feature matrices H and H ^* .

在一具体应用场景，LSTM在时序数据建模任务中取得了许多成就，特别是在时序预测方面。与普通的RNN相比，LSTM可以防止学习过程中的梯度爆炸。然而，对于现有的LSTM网络，输入的时间序列数据必须是完整的。然而这一要求在交通数据场景中是不现实的。因此，本发明实施例对现有的LSTM网络进行了重构，提出了LSTM-I，专门用于处理有缺失数据点的输入。In a specific application scenario, LSTM has achieved many achievements in time series data modeling tasks, especially in time series prediction. Compared with ordinary RNN, LSTM can prevent gradient explosion during learning. However, for existing LSTM networks, the input time series data must be complete. However, this requirement is not realistic in traffic data scenarios. Therefore, the embodiment of the present invention reconstructs the existing LSTM network, and proposes LSTM-I, which is specially used for processing the input with missing data points.

如图11所示，在时间点t，

表示不完整的观测，

是预测值。当输入的

包含缺失数据时，本发明实施例将

和

拼接为当前输入。具体来说，重构后的LSTM-I使用前一个时间段的预测值来估算当前时间段的缺失值。对于每个时间段，LSTM-I采用估计值和观察值来复原特征，如公式(3)所示：As shown in Figure 11, at time point t,

represents an incomplete observation,

is the predicted value. when input

When missing data is included, this embodiment of the present invention will

and

Concatenate as the current input. Specifically, the reconstructed LSTM-I uses the predicted values of the previous time period to estimate the missing values of the current time period. For each time period, LSTM-I uses the estimated and observed values to restore the features, as shown in Equation (3):

其中

表示第t个时间步的缺失位置，⊙表示Hadamard积。in

denotes the missing position at the t-th time step, and ⊙ denotes the Hadamard product.

对于LSTM-I的每一层，每层具有共享参数的LSTM单元被用于计算。对于第t个LSTMC 单元，输入包括前一个时间点的单元状态c_t-1、隐藏特征h_t-1和输入h_t-1。LSTM单元中有三种门控单元，即输入门i_t,遗忘门f_t,和输出门o_t，它们用于决定是否向单元状态添加/删除信息。这些门自适应地将输入信息保存到当前的存储状态，并计算出隐藏特征

其中d是LSTM单元输出的隐藏特征的尺寸。LSTM单元的整个计算过程如公式(4)所示:For each layer of LSTM-I, LSTM units with shared parameters for each layer are used for computation. For the t-th LSTMC cell, the input includes the cell state c _t-1 from the previous time point, the hidden feature h _t-1 , and the input h _t-1 . There are three kinds of gating units in the _LSTM cell, namely the input gate it, the forget gate ft, and the output gate _ot _, which are used to decide whether to add/remove information to the cell state. These gates adaptively save the input information to the current memory state and compute hidden features

where d is the dimension of the hidden features output by the LSTM unit. The entire calculation process of the LSTM unit is shown in formula (4):

其中

为单元输入激活向量，

和

分别为单元的权重矩阵和偏置参数；

分别表示输入、遗忘和输出门的权重矩阵。

是相应门的偏置参数；

和

分别是记忆单元的权重矩阵和偏置矩阵；σ表示sigmoid激活函数。此外，在下一时间步的预测值

是根据隐藏特征h_t计算的，即

其中

和

分别为权重和偏置矩阵。in

input activation vector for the cell,

and

are the weight matrix and bias parameter of the unit, respectively;

represent the weight matrices of the input, forget, and output gates, respectively.

is the bias parameter of the corresponding gate;

and

are the weight matrix and bias matrix of the memory unit, respectively; σ represents the sigmoid activation function. Additionally, the predicted value at the next time step

is calculated from the hidden feature _ht , i.e.

in

and

are the weight and bias matrices, respectively.

在一具体应用场景，虽然LSTM-I能够逐步估计缺失的数据，但是它对较长时间序列数据的时序依赖捕捉能力有限，特别是交通数据在一天内往往有数百个时间段。因此，本发明实施例应用了多头注意力机制，以进一步提取LSTM-I的输出

和历史平均X^(a)的特征。多头操作可以分别在几个子空间进行学习，再将所有得到的结果结合起来，通过全连接层的降维，并输出与输入时间序列相同形状的输出矩阵。In a specific application scenario, although LSTM-I can gradually estimate missing data, its ability to capture time-series dependencies for longer time-series data is limited, especially when traffic data often has hundreds of time periods in a day. Therefore, the embodiment of the present invention applies a multi-head attention mechanism to further extract the output of LSTM-I

and the characteristics of the historical average X ^(a) . The multi-head operation can be learned separately in several subspaces, and then all the obtained results are combined, through the dimensionality reduction of the fully connected layer, and output an output matrix with the same shape as the input time series.

对于注意力机制的计算，本发明实施例将时间序列输入定义为

(对应于

和X^(a)的转置)。在计算过程中，有三个定义的组成部分，即Query Q、Key K和Value V，它们的定义如公式(5)所示:For the calculation of the attention mechanism, the embodiment of the present invention defines the time series input as

(corresponds to

and the transpose of X ^(a) ). In the calculation process, there are three defined components, namely Query Q, Key K and Value V, and their definitions are shown in formula (5):

其中

和

分别为相应部分的参数矩阵。注意力矩阵E的计算过程如公式(6)所示：in

and

are the parameter matrices of the corresponding parts, respectively. The calculation process of the attention matrix E is shown in formula (6):

在此，使用softmax函数将注意力矩阵转换成概率矩阵，所有列的概率之和为1。因此， E_ij表示第i个时间点对第j个时间点的影响。注意力机制的输出用Z表示，计算如公式(7) 所示:Here, the attention matrix is converted into a probability matrix using the softmax function, and the sum of the probabilities of all columns is 1. Therefore, _Eij represents the influence of the ith time point on the jth time point. The output of the attention mechanism is denoted by Z, and the calculation is shown in Equation (7):

对于多头的注意力机制的计算，遵循范式，让多个注意机制分别计算各自的输出Z_i,i..H，其中H是注意头的数量。这允许在由上式得出的不同注意子空间中分别学习，这能够捕捉到更丰富的特征关系。最后，本发明实施例将所有Z_i输出至线性层，得到最终的结果，其表述如公式(8)所示：For the computation of the multi-head attention mechanism, following the paradigm, let multiple attention mechanisms calculate their respective outputs Z _i ,i..H, where H is the number of attention heads. This allows separate learning in different attention subspaces derived from the above equation, which can capture richer feature relationships. Finally, the embodiment of the present invention outputs all Z _i to the linear layer to obtain the final result, which is expressed as formula (8):

Output＝concat(Z₁,...,Z_h)W_c 公式(8)Output=concat(Z ₁ ,...,Z _h )W _c formula (8)

其中Output是MSL的最终输出，对应于隐藏状态H和H^*，W_c是线性层的参数权重。where Output is the final output of MSL, corresponding to hidden states H and H ^* , and _Wc is the parameter weight of the linear layer.

在一具体应用场景，为了结合两个模块的输出并计算最终的补全结果，本发明实施例设计了一个融合模块，其中包括一个注意力层和一个线性层。注意力层是MSL的单头版本，其输入是两个隐式状态的拼接。在注意力层计算之后，本发明实施例在线性层中使用一个全连接的神经网络来得到补全结果，计算如公式(9)所示:In a specific application scenario, in order to combine the outputs of the two modules and calculate the final completion result, an embodiment of the present invention designs a fusion module, which includes an attention layer and a linear layer. The attention layer is a single-headed version of MSL, whose input is the concatenation of two implicit states. After the attention layer is calculated, the embodiment of the present invention uses a fully connected neural network in the linear layer to obtain the completion result, and the calculation is as shown in formula (9):

其中attention(·)代表在多头注意力机制中的介绍的公式(3)-(5)，W_l和b_l是线性层的参数，

是ADRIN的最终补全结果。where attention( ) represents the formulas (3)-(5) introduced in the multi-head attention mechanism, W _l and b _l are the parameters of the linear layer,

is the final completion result of ADRIN.

在一具体应用场景，为了更好地训练ADRIN的不同模块，本发明实施例为LSTM-I、融合模块和最终输出定义了各自的损失函数。考虑到地面真实道路速度数据

和估算的输出

本发明实施例定义了掩蔽损失函数

其表述如公式(10)所示：In a specific application scenario, in order to better train different modules of ADRIN, the embodiment of the present invention defines respective loss functions for the LSTM-I, the fusion module and the final output. Taking into account ground truth road speed data

and the estimated output

The embodiment of the present invention defines a masking loss function

Its expression is shown in formula (10):

其中，

是上述实施例定义的掩码矩阵，用于表示丢失的数据点。in,

is the mask matrix defined in the above embodiment, used to represent the missing data points.

为了确保

(即LSTM-I的输出)与X^(？)相似，并加快LSTM-I的收敛速度，本发明实施例将损失函数

定义如公式(11)所示:to make sure

(that is, the output of LSTM-I) is similar to X ^(?) , and to speed up the convergence speed of LSTM-I, the embodiment of the present invention uses the loss function

The definition is shown in formula (11):

结合上述两个子损失函数，本发明实施例得到最终的损失函数

它的表述如公式(12) 所示：Combining the above two sub-loss functions, the embodiment of the present invention obtains the final loss function

It is expressed as Equation (12):

在一具体应用场景，本发明实施例采用了三个真实世界的交通速度数据集，即NavInfo-Beijing：中国北京(BJ)，PeMS：美国加州高速公路性能评估系统(PeMSD5)，以及中国香港交通速度(HK)。具体来说，如图12所示，中国北京数据集由四维交通指数平台提供，包含了2019年1月1日00:00至2019年6月30日23:55期间中国北京1368条道路的平均速度。记录的采样时间间隔为5分钟。为了在保留数据集复杂性的同时尽量减少数据缺失的影响，本发明实施例只使用整体数据缺失率小于5％的道路速度数据进行实验，即总共168条道路。此外，本发明实施例应用线性插值来补充缺失的数据并记录其位置，并在评估阶段将其删除。如图 13所示，PeMSD5数据集包含从美国加州交通局测量系统收集的交通速度数据。该数据集包括美国加州第五区的144个传感器站，记录时间为2013年1月1日00:00至2013年6月30日23:55。值得一提的是，PeMSD5的记录是由高速公路上的传感器采集的，这与中国北京和中国香港数据集中的城市道路的平均交通速度不同。如图14所示，中国香港数据集包括2021年3月10日00:00至 2021年7月31日23:50的中国香港主要道路的平均道路速度。来自屯门和沙田等偏远地区的记录长期以来一直没有变化。因此，本发明实施例只采用了由84条道路组成的中国香港岛内的道路速度，采样间隔为10分钟。这些数据集的总结见下表1。描述了这三个数据集中部分道路或传感器的采集位置。In a specific application scenario, the embodiment of the present invention adopts three real-world traffic speed data sets, namely NavInfo-Beijing: Beijing, China (BJ), PeMS: California Highway Performance Evaluation System (PeMSD5), and Hong Kong Traffic Speed (HK). Specifically, as shown in Figure 12, the Beijing China dataset is provided by the 4D Traffic Index platform, which contains the average of 1368 roads in Beijing, China during the period from 00:00 on January 1, 2019 to 23:55 on June 30, 2019 speed. The recorded sampling interval was 5 minutes. In order to minimize the impact of missing data while preserving the complexity of the dataset, the embodiment of the present invention only uses road speed data with an overall data missing rate less than 5% for experiments, that is, a total of 168 roads. Furthermore, embodiments of the present invention apply linear interpolation to supplement missing data and record their locations, which are removed during the evaluation phase. As shown in Figure 13, the PeMSD5 dataset contains traffic speed data collected from the California Department of Transportation measurement system. The dataset includes 144 sensor stations in the Fifth District of California, USA, recorded from 00:00 on January 1, 2013 to 23:55 on June 30, 2013. It is worth mentioning that the PeMSD5 records are collected by sensors on highways, which are different from the average traffic speeds of urban roads in the Beijing, China and Hong Kong, China datasets. As shown in Figure 14, the Hong Kong, China dataset includes the average road speeds of major roads in Hong Kong, China from 00:00 on March 10, 2021 to 23:50 on July 31, 2021. Records from remote areas such as Tuen Mun and Sha Tin have remained unchanged for a long time. Therefore, the embodiment of the present invention only adopts the road speed in Hong Kong Island, China, which is composed of 84 roads, and the sampling interval is 10 minutes. A summary of these datasets is presented in Table 1 below. The collection locations of some roads or sensors in these three datasets are described.

表1Table 1

BJBJ PeMSD5PeMSD5 HKHK 路段数Number of road sections 168168 144144 8484 天数days 181181 181181 144144 时间间隔time interval 5min5min 5min5min 10min10min 平均速度average speed 36.70km/h36.70km/h 54.47mi/h54.47mi/h 51.24km/h51.24km/h 标准偏差standard deviation 11.52km/h11.52km/h 7.32mi/h7.32mi/h 20.19km/h 20.19km/h

在本发明中采用Z-score归一化来预处理数据。在交叉验证方面，本发明沿用以前的工作方法，将三个数据集按照时间顺序分成两个不重叠的子集，即训练集和测试集：所有样本的前80％为训练数据，其余20％为测试数据。此外，在训练阶段采用了数据增强的方法，对于训练集中的每个样本Y，本发明随机生成10次缺失的输入X^(？)。使用Adam作为优化器，初始学习率为0.001。训练轮数设定为200，批处理大小为20。多头注意力机制中的头数H被设定为8，LSTM-I中的特征维度d被设定为168。用于进行实验的PyTorch，硬件配置包括nVidia RTX 2080Ti GPU和Xeon Silver 4210 CPU。In the present invention, Z-score normalization is used to preprocess the data. In terms of cross-validation, the present invention follows the previous work method and divides the three data sets into two non-overlapping subsets in time order, namely training set and test set: the first 80% of all samples are training data, and the remaining 20% for test data. In addition, the method of data augmentation is adopted in the training phase. For each sample Y in the training set, the present invention randomly generates 10 missing inputs X ^(?) . Use Adam as the optimizer with an initial learning rate of 0.001. The number of training epochs is set to 200 and the batch size is 20. The number of heads H in the multi-head attention mechanism is set to 8, and the feature dimension d in LSTM-I is set to 168. PyTorch used for experiments, hardware configuration includes nVidia RTX 2080Ti GPU and Xeon Silver 4210 CPU.

在一具体应用场景，为了全面评估本案例研究中的模型，本发明实施例对两种缺失模式，即MCAR和MNAR分别进行了实验。此外，为了验证模型在不同情况下的有效性，本发明实施例将ADRIN与现有方法在10％到90％的不同数据缺失率下进行比较。此外，所采用的方法来自于前述名词解析的各种补全方法，其中一些被认为是目前先进的缺失数据补全方法。In a specific application scenario, in order to fully evaluate the model in this case study, the embodiment of the present invention conducts experiments on two missing modes, namely MCAR and MNAR, respectively. In addition, in order to verify the validity of the model in different situations, the embodiments of the present invention compare ADRIN with existing methods under different data missing rates ranging from 10% to 90%. In addition, the adopted methods come from various completion methods of the aforementioned noun parsing, some of which are considered to be the current advanced missing data completion methods.

在这些方法中，基于矩阵分解的方法(如TRMF和BTMF)必须严格保证输入格式形状为天×时间×道路。此外，本发明实施例统一将DSAE、PGAN、BRITS和ADRIN等深度学习方法的输入设置为一天的缺失数据，即时间×道路。此外，由于很难保证PGAN的收敛性，本发明实施例设置生成器和判别器的学习率在{0.00001,0.0001,0.001,0.01}中，并应用网格搜索来获得最佳结果。其他模型的超参数保持不变。Among these methods, matrix factorization-based methods (such as TRMF and BTMF) must strictly guarantee the input format shape is day × time × road. In addition, the embodiments of the present invention uniformly set the input of deep learning methods such as DSAE, PGAN, BRITS, and ADRIN as missing data of one day, that is, time×road. In addition, since it is difficult to guarantee the convergence of PGAN, the embodiment of the present invention sets the learning rate of the generator and the discriminator in {0.00001, 0.0001, 0.001, 0.01}, and applies grid search to obtain the best results. The hyperparameters of the other models remain the same.

表1为中国北京(BJ)数据集补全准确率(MAE/MAPE(％))，表2为美国加州第五区(PeMSD)数据集补全准确率(MAE/MAPE(％))，表3为中国香港(HK)数据集补全准确率(MAE/MAPE(％))，本发明实施例选择了均方根误差(RMSE)和平均绝对百分比误差(MAPE)作为本发明实施例的指标。上述三个表格分别总结了MCAR和MNAR的补全结果，这三个数据集的缺失率在10％和90％之间。实验结果表明，所提出的ADRIN在大多数情况下取得了最低的RMSE和MAPE值，超过了所有其他方法。在复杂的数据集上，如城市道路速度数据中国北京和中国香港，结果非常突出，而在较简单的高速公路速度美国加州第五区数据集上，ADRIN可以在MCAR上取得接近于SOTA的结果。这是由于ADRIN可以从残缺输入和历史平均中提取时序依赖关系。特别是，基于统计学习的方法，如HA和KNN产生的补全结果较为明显的差于ADRIN。这是因为这类方法只依赖数据集内的简单样本分布来补充缺失值，而时间上的关联性不能被提取出来，导致性能低下。此外，当存在大规模的缺失数据时，一些缺失点使得KNN方法难以高效应用。Table 1 is the completion accuracy rate (MAE/MAPE (%)) of the Beijing China (BJ) dataset, and Table 2 is the completion accuracy rate (MAE/MAPE (%)) of the California Fifth District (PeMSD) dataset. 3 is the completion accuracy rate (MAE/MAPE (%)) of the Hong Kong (HK) data set, the embodiment of the present invention selects the root mean square error (RMSE) and the mean absolute percentage error (MAPE) as the indicators of the embodiment of the present invention . The above three tables summarize the completion results for MCAR and MNAR, respectively, and the missing rates for these three datasets are between 10% and 90%. Experimental results show that the proposed ADRIN achieves the lowest RMSE and MAPE values in most cases, outperforming all other methods. On complex datasets, such as urban road speed data Beijing, China and Hong Kong, China, the results are outstanding, while on the simpler highway speed California District 5 dataset, ADRIN can achieve results close to SOTA on MCAR . This is due to the fact that ADRIN can extract temporal dependencies from crippled inputs and historical averaging. In particular, statistical learning-based methods such as HA and KNN produce significantly worse completion results than ADRIN. This is because such methods only rely on simple sample distributions within the dataset to supplement missing values, and temporal correlations cannot be extracted, resulting in poor performance. Furthermore, when there are large-scale missing data, some missing points make the KNN method difficult to apply efficiently.

表2Table 2

表3table 3

表4Table 4

考虑到基于深度学习的方法，提出的ADRIN优于先进的DSAE、PGAN和BRITS。由于对缺失输入和历史数据的时间序列依赖性分别建模的卓越能力，即使在缺失率很高的情况下，无论缺失的数据是否完全随机，ADRIN都比其他方法取得了明显的改善。例如，与中国北京数据集上的其他基于深度学习的方法相比，例如BRITS，在这两种情况下，ADRIN在90％的缺失率下将MAPE降低了7.16％和7.77％，而在10％的缺失率下，MAPE结果分别降低了6.72％和7.64％。这表明，由于历史特征的结合，ADRIN在缺失率高的情况下比低的情况下能取得更明显的提升。同样的现象也可以在中国香港数据集上观察到。此外，由于DASE和PGAN只使用线性模块对特征进行建模，它们只能在具有简单数据分布的美国加州第五区数据集上推算出可接受的结果。然而，提出的ADRIN考虑到了时间上的依赖性，可以同时应对城市道路和高速公路的缺失数据。Considering deep learning based methods, the proposed ADRIN outperforms state-of-the-art DSAE, PGAN and BRITS. Due to its superior ability to separately model the time-series dependencies of missing input and historical data, ADRIN achieves significant improvements over other methods, even with high missing rates, regardless of whether the missing data is completely random or not. For example, compared to other deep learning-based methods on the Beijing, China dataset, such as BRITS, in both cases, ADRIN reduces MAPE by 7.16% and 7.77% at 90% miss rate, while at 10% The MAPE results were reduced by 6.72% and 7.64%, respectively, under the missing rate of . This shows that ADRIN can achieve a more significant improvement in the case of high missing rate than in the case of low due to the combination of historical features. The same phenomenon can also be observed on the Hong Kong dataset. Furthermore, since DASE and PGAN only use linear modules to model features, they can only extrapolate acceptable results on the California District 5 dataset with a simple data distribution. However, the proposed ADRIN takes into account the temporal dependencies and can deal with the missing data of urban roads and highways simultaneously.

与基于矩阵因子化的方法不同，深度学习方法对于城市道路速度等复杂数据的建模有更多的优势，这些模型在整体上取得了更出色的补全效果。然而，深度学习模型高度依赖于数据质量，高缺失率使其难以捕捉时间特征的相关性。以MCAR为例，90％的美国加州第五区数据集上，由于高速公路车速的数据分布比城市道路更简单，更容易学习，所以BTMF取得了最好的性能。此外，基于矩阵因子化的方法在所有缺失率中的结果都很稳定。这是由于这些方法从整体数据分布中估计出了丢失的数值。例如，在中国北京数据集上，ADRIN的MAPE结果在所有缺失率情况下的绝对差异为0.60％(MCAR)和0.43％(MNAR)，但BGCP的差异为0.21％(MCAR) 和0.26％(MNAR)。Different from methods based on matrix factorization, deep learning methods have more advantages for modeling complex data such as urban road speed, and these models achieve better completion effects on the whole. However, deep learning models are highly dependent on data quality, and high missing rates make it difficult to capture the correlation of temporal features. Taking MCAR as an example, BTMF achieves the best performance on 90% of the California District 5 dataset because the data distribution of highway speeds is simpler and easier to learn than urban roads. In addition, matrix factorization-based methods have stable results across all missing rates. This is because these methods estimate missing values from the overall data distribution. For example, on the Beijing, China dataset, the absolute difference in the MAPE results for ADRIN is 0.60% (MCAR) and 0.43% (MNAR) for all missing rate cases, but 0.21% (MCAR) and 0.26% (MNAR) for BGCP. ).

最后，比较MCAR和MNAR的补全结果，可以看出，所有方法在MCAR下的MAPE都比MNAR低。由于连续的大批量缺失块，在MNAR中对数据的特征依赖性进行建模具有挑战性，导致MNAR的相应性能比MCAR差。也就是说，当缺失率相同时，所有模型在MCAR下的表现都更好，因为观测值的分布更分散，因此更准确地代表整体数据。然而，在MNAR的情况下，ADRIN与DSAE和BRITS相比，由于ADRIN使用的是历史平均数据，在高缺失率的情况下，ADRIN可以取得更明显的归因结果。Finally, comparing the completion results of MCAR and MNAR, it can be seen that all methods have lower MAPE under MCAR than MNAR. Modeling the feature dependencies of the data in MNAR is challenging due to the continuous large batch of missing blocks, resulting in the corresponding performance of MNAR being worse than MCAR. That is, when the missing rates are the same, all models perform better under MCAR because the distribution of observations is more spread out and thus more accurately represents the overall data. However, in the case of MNAR, ADRIN can achieve more pronounced attribution results in the case of high missing rates compared to DSAE and BRITS because ADRIN uses historical average data.

本发明实施例还提供一种数据补全装置，可以实现上述数据补全方法，该装置包括：The embodiment of the present invention also provides a data completion device, which can implement the above data completion method, and the device includes:

原始数据集预处理模块，用于对原始数据集进行数据预处理，以将原始数据集分类成残缺数据集和历史数据集；The original dataset preprocessing module is used to perform data preprocessing on the original dataset to classify the original dataset into incomplete datasets and historical datasets;

第一时间序列提取模块，用于对残缺数据集进行时间特征提取，得到第一时间序列；The first time series extraction module is used to extract the time feature of the incomplete data set to obtain the first time series;

第二时间序列提取模块，用于对历史数据集进行时间特征提取，得到第二时间序列；The second time series extraction module is used to extract the time feature of the historical data set to obtain the second time series;

第一输出矩阵计算模块，用于对第一时间序列进行多头注意力机制计算，得到第一输出矩阵；a first output matrix calculation module, configured to perform multi-head attention mechanism calculation on the first time series to obtain a first output matrix;

第二输出矩阵计算模块，用于对第二时间序列进行多头注意力机制计算，得到第二输出矩阵；The second output matrix calculation module is used to perform multi-head attention mechanism calculation on the second time series to obtain a second output matrix;

数据融合模块，用于对第一输出矩阵和第二输出矩阵进行融合处理，得到补全数据。The data fusion module is used to perform fusion processing on the first output matrix and the second output matrix to obtain complementary data.

本实施例的数据补全装置的具体实施方式与上述数据补全方法的具体实施方式基本一致，在此不再赘述。The specific implementation of the data completion apparatus in this embodiment is basically the same as the specific implementation of the above-mentioned data completion method, and will not be repeated here.

本公开实施例还提供了一种电子设备，包括：Embodiments of the present disclosure also provide an electronic device, including:

至少一个存储器；at least one memory;

至少一个处理器；at least one processor;

至少一个程序；at least one program;

程序被存储在存储器中，处理器执行至少一个程序以实现本发明实施上述的数据补全方法。该电子设备可以为包括手机、平板电脑、个人数字助理(Personal DigitalAssistant，简称PDA)、车载电脑等任意智能终端。The program is stored in the memory, and the processor executes at least one program to implement the above-mentioned data completion method of the present invention. The electronic device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a vehicle-mounted computer, and the like.

请参阅图15，图15示意了另一实施例的电子设备的硬件结构，电子设备包括：Please refer to FIG. 15. FIG. 15 illustrates a hardware structure of an electronic device according to another embodiment. The electronic device includes:

处理器1501，可以采用通用的CPU(CentralProcessingUnit，中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit，ASIC)、或者一个或多个集成电路等方式实现，用于执行相关程序，以实现本发明实施例所提供的技术方案；The processor 1501 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for executing relevant programs to achieve The technical solutions provided by the embodiments of the present invention;

存储器1502，可以采用ROM(ReadOnlyMemory，只读存储器)、静态存储设备、动态存储设备或者RAM(RandomAccessMemory，随机存取存储器)等形式实现。存储器1502可以存储操作系统和其他应用程序，在通过软件或者固件来实现本说明书实施例所提供的技术方案时，相关的程序代码保存在存储器1502中，并由处理器1501来调用执行本公开实施例的数据补全方法；The memory 1502 may be implemented in the form of a ROM (ReadOnly Memory, read only memory), a static storage device, a dynamic storage device, or a RAM (Random Access Memory, random access memory). The memory 1502 may store an operating system and other application programs. When implementing the technical solutions provided by the embodiments of the present specification through software or firmware, the relevant program codes are stored in the memory 1502 and invoked by the processor 1501 to execute the implementation of the present disclosure. Example data completion method;

输入/输出接口1503，用于实现信息输入及输出；Input/output interface 1503, used to realize information input and output;

通信接口1504，用于实现本设备与其他设备的通信交互，可以通过有线方式(例如USB、网线等)实现通信，也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信；The communication interface 1504 is used to realize the communication interaction between the device and other devices, and the communication can be realized by wired means (such as USB, network cable, etc.), or by wireless means (such as mobile network, WIFI, Bluetooth, etc.);

总线1505，在设备的各个组件(例如处理器1501、存储器1502、输入/输出接口1503和通信接口1504)之间传输信息；A bus 1505 that transfers information between the various components of the device (eg, processor 1501, memory 1502, input/output interface 1503, and communication interface 1504);

其中处理器1501、存储器1502、输入/输出接口1503和通信接口1504通过总线1505实现彼此之间在设备内部的通信连接。The processor 1501 , the memory 1502 , the input/output interface 1503 and the communication interface 1504 realize the communication connection among each other within the device through the bus 1505 .

本公开实施例还提供了一种存储介质，该存储介质是计算机可读存储介质，该计算机可读存储介质存储有计算机可执行指令，该计算机可执行指令用于使计算机执行上述数据补全方法。Embodiments of the present disclosure further provide a storage medium, where the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, where the computer-executable instructions are used to cause a computer to execute the above data completion method .

存储器作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外，存储器可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中，存储器可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs and non-transitory computer-executable programs. Additionally, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

本发明实施例描述的实施例是为了更加清楚的说明本发明实施例的技术方案，并不构成对于本发明实施例提供的技术方案的限定，本领域技术人员可知，随着技术的演变和新应用场景的出现，本发明实施例提供的技术方案对于类似的技术问题，同样适用。The embodiments described in the embodiments of the present invention are for the purpose of illustrating the technical solutions of the embodiments of the present invention more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of the present invention. With the emergence of application scenarios, the technical solutions provided by the embodiments of the present invention are also applicable to similar technical problems.

本领域技术人员可以理解的是，图4至图10中示出的技术方案并不构成对本发明实施例的限定，可以包括比图示更多或更少的步骤，或者组合某些步骤，或者不同的步骤。Those skilled in the art can understand that the technical solutions shown in FIG. 4 to FIG. 10 do not constitute a limitation on the embodiments of the present invention, and may include more or less steps than those shown in the drawings, or combine certain steps, or different steps.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The apparatus embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, functional modules/units in the systems, and devices can be implemented as software, firmware, hardware, and appropriate combinations thereof.

本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description of the present application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments of the application described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

应当理解，在本申请中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和 /或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a， b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a 和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that, in this application, "at least one (item)" refers to one or more, and "a plurality" refers to two or more. "And/or" is used to describe the relationship between related objects, indicating that there can be three kinds of relationships, for example, "A and/or B" can mean: only A, only B, and both A and B exist , where A and B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a) of a, b or c, can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c" ", where a, b, c can be single or multiple.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括多指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store programs medium.

以上参照附图说明了本发明实施例的优选实施例，并非因此局限本发明实施例的权利范围。本领域技术人员不脱离本发明实施例的范围和实质内所作的任何修改、等同替换和改进，均应在本发明实施例的权利范围之内。The preferred embodiments of the embodiments of the present invention have been described above with reference to the accompanying drawings, which are not intended to limit the scope of the rights of the embodiments of the present invention. Any modifications, equivalent replacements, and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present invention shall fall within the scope of the rights of the embodiments of the present invention.

Claims

1. A data completion method, comprising:

acquiring an original data set;

performing data preprocessing on the original data set to classify the original data set into a incomplete data set and a historical data set;

extracting time characteristics of the incomplete data set to obtain a first time sequence;

extracting time characteristics of the historical data set to obtain a second time sequence;

performing multi-head attention mechanism calculation on the first time sequence to obtain a first output matrix;

performing multi-head attention mechanism calculation on the second time sequence to obtain a second output matrix;

and performing fusion processing on the first output matrix and the second output matrix to obtain the completion data.

2. The data completion method according to claim 1, wherein said performing temporal feature extraction on the incomplete data set to obtain a first time series comprises:

inputting the incomplete data set into a preset long-term and short-term memory completion network;

extracting observation data in the incomplete data set at a preset time period;

calculating missing data of the preset time interval according to the observation data and the prediction data of the previous time interval of the preset time interval;

and obtaining the first time sequence according to the missing data and the observation data.

3. The data completion method according to claim 1, wherein said performing temporal feature extraction on said historical data set to obtain a second time series comprises:

inputting the historical data set into a preset historical data processing network;

calculating historical average data of the historical data set through the historical data processing network, and taking the historical average data as the historical data of the preset time period;

and obtaining the second time sequence according to the historical average data.

4. The data completion method according to claim 1, wherein said performing a multi-point attention mechanism calculation on the first time series to obtain a first output matrix comprises:

inputting the first temporal sequence into a first multi-headed attention layer;

converting the first time series into a first attention matrix by the first multi-head attention layer;

converting the first attention moment matrix into a first probability matrix through a preset function;

performing attention mechanism calculation on the first probability matrix to obtain a first characteristic matrix;

and performing dimensionality reduction on the first feature matrix to obtain a first output matrix.

5. The data completion method according to claim 1, wherein said performing a multi-point attention mechanism calculation on the second time series to obtain a second output matrix comprises:

inputting the second time series to a second multi-headed attention layer;

converting, by the second multi-head attention layer, the second time series into a second attention matrix;

converting the second attention matrix into a second probability matrix through a preset function;

performing attention mechanism calculation on the second probability matrix to obtain a second feature matrix;

and performing dimensionality reduction on the second feature matrix to obtain a second output matrix.

6. The data completion method according to any one of claims 1 to 5, wherein the fusing the first output matrix and the second output matrix to obtain the completion data comprises:

splicing the first output matrix and the second output matrix to obtain a spliced matrix;

calculating an attention mechanism of the spliced matrix through a single-head attention layer to obtain a target matrix;

and performing feature extraction on the target matrix through a linear layer to obtain the completion data.

7. The data completion method according to claim 6, wherein the calculating an attention mechanism of the mosaic matrix through a single-head attention layer to obtain a target matrix comprises:

inputting the stitching matrix into a single-headed attention layer;

converting the mosaic matrix into an attention matrix by the single-headed attention layer;

converting the attention moment array into a probability matrix through a preset function;

and carrying out attention mechanism calculation on the probability matrix to obtain the target matrix.

8. A data complementing device, comprising:

the original data set acquisition module is used for acquiring an original data set;

the system comprises an original data set preprocessing module, a data preprocessing module and a data processing module, wherein the original data set preprocessing module is used for performing data preprocessing on the original data set so as to classify the original data set into a defective data set and a historical data set;

the first time sequence extraction module is used for extracting time characteristics of the incomplete data set to obtain a first time sequence;

the second time sequence extraction module is used for extracting time characteristics of the historical data set to obtain a second time sequence;

the first output matrix calculation module is used for performing multi-head attention mechanism calculation on the first time sequence to obtain a first output matrix;

the second output matrix calculation module is used for performing multi-head attention mechanism calculation on the second time sequence to obtain a second output matrix;

and the data fusion module is used for carrying out fusion processing on the first output matrix and the second output matrix to obtain the completion data.

9. An electronic device, comprising:

at least one memory;

at least one processor;

at least one program;

the programs are stored in a memory, and a processor executes the at least one program to implement:

the data completion method according to any one of claims 1 to 7.

10. A storage medium that is a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform:

the data completion method according to any one of claims 1 to 7.