CN114997067A

CN114997067A - Trajectory prediction method based on space-time diagram and space-domain aggregation Transformer network

Info

Publication number: CN114997067A
Application number: CN202210767796.8A
Authority: CN
Inventors: 曾繁虎; 杨欣; 王翔辰; 李恒锐; 樊江锋; 周大可
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-02
Anticipated expiration: 2042-06-30
Also published as: CN114997067B

Abstract

The invention discloses a trajectory prediction method based on a space-time diagram and airspace aggregation Transformer network, which solves the problem of insufficient extraction of interactive features in the existing pedestrian trajectory prediction output, uses a space-time diagram convolution neural network and a time sequence feature transformation network to operate and finish effective and accurate extraction of pedestrian trajectory features in a scene, meanwhile, a brand-new airspace aggregation Transformer framework is designed to perform pedestrian time sequence characteristic transformation, efficient aggregation and utilization of airspace pedestrian characteristics are completed, output of predicted tracks of pedestrians is finally completed in a probability distribution mode, the purposes of reasonably avoiding sudden conditions and keeping movement consistency of group pedestrians are achieved, relevant indexes show that the framework breaks through in the aspect of predicting pedestrian endpoints, the purpose of more accurate and efficient prediction of pedestrian track distribution is completed, and important help is provided for development in the fields of automatic driving, intelligent traffic and the like.

Description

Trajectory prediction method based on space-time diagram and space-domain aggregation Transformer network

Technical Field

The invention relates to a trajectory prediction method based on a space-time diagram and airspace aggregation Transformer network, and belongs to the field of artificial intelligence and automatic driving.

Background

The pedestrian trajectory prediction technology has deeper theoretical background and practical application value, and the pedestrian trajectory recognition and prediction technology always occupies more important position in the fields of unmanned driving, intelligent monitoring and the like. In recent years, due to the progress of artificial intelligence and deep learning technology, the falling and application of intelligent algorithms related to the pedestrian trajectory prediction problem gradually arouse attention and advices.

The behavior characteristics of traffic participants in a scene are better understood and judged by an intelligent agent, a pedestrian trajectory prediction model with space interaction characteristic information is established, relevant prediction is carried out, and accurate, rapid and reasonable relevant decisions are made. However, the high complexity and uncertainty of the pedestrian trajectory prediction problem determines that it has the following difficulties: the complex scene characteristic information enables the future track of the pedestrian to be influenced not only by the historical track and the established track route, but also by various influences of obstacles and other traffic participants in the scene in the space-time dimension. Therefore, whether a reasonable and accurate model can be established and quick prediction output and decision can be carried out is the key of applying the pedestrian trajectory prediction problem to an actual scene.

Thanks to the development of machine learning in the field of artificial intelligence, for a long time, trajectory prediction methods based on LSTM and on CNN algorithms are the mainstream prediction methods. The prediction method has simple model, can obtain good prediction effect by using less parameters and more basic model frameworks, provides thought and basic module frameworks for subsequent deep algorithm research, and has pioneering significance.

Since graphs and their network architectures have natural advantages in data information representation of the pedestrian trajectory prediction problem, graph-based pedestrian trajectory prediction research has become a popular direction of research in recent years. Mohamed A et al used a spatio-temporal neural network method in 2020 literature (Source-stgcnn: A spatial-temporal mapping conditional neural network for human target prediction [ C ]), and performed two different convolution operations on the time domain and the space domain, respectively, to obtain trajectory feature information and perform prediction output at the same time. Similarly, the model considers the randomness and uncertainty of the pedestrian track in the space, that is, the information such as the established track and the terminal point of each pedestrian is not known in advance when the model predicts, so a reasonable research method is to assume that the horizontal and vertical coordinates of the predicted track of the pedestrian conform to two-dimensional Gaussian distribution, and output the track in a sampling mode in the process of checking and predicting. The model also accomplishes the prediction based on such assumptions, yielding less favorable results. However, there is still no further processing on the pedestrian interaction characteristic information in such a pedestrian trajectory prediction model, which results in insufficient spatial interaction capability, so that the generated trajectory has large inertia, and cannot generate closely-related motion prediction according to the motion pattern between group pedestrians.

In recent years, many researchers have made many advances in predicting pedestrian trajectories based on graph representation, combining many other different algorithmic tools and research methods. Dan X et al propose a Pedestrian trajectory Prediction model architecture based on spatio-Temporal module and LSTM in the literature (Spatial-Temporal Block and LSTM Network for Pedestrian Trajectories Prediction [ J ]), the model is based on graph representation, the relation feature vector between each Pedestrian node and its neighbor Pedestrian is obtained through graph embedding representation, the space-Temporal graph Pedestrian interaction feature vector obtained through coding is input into LSTM, and then relevant Prediction is carried out, and a good Prediction result is obtained; rainbow B A et al propose a semantic-based space-time diagram pedestrian trajectory prediction model Semantics-STGCNN in the literature (Semantics-STGCNN: A semantic-defined spatial-temporal map contextual network for multi-class trajectory prediction [ C ]), wherein, from scene semantic understanding, the class labels of pedestrian objects are embedded into a label adjacency matrix, and the semantic adjacency matrix is output by combining with a velocity adjacency matrix, so as to complete modeling of semantic information and finally output the prediction result; yu C et al uses a Transformer-based network model in the literature (Spatio-temporal map transform networks for collaborative project prediction [ C ]), which utilizes the excellent performance of transformers in other fields to design a direct concatenation of multiple Transformer basic frames to extract the Spatio-temporal feature information of pedestrians in a scene and complete the relevant prediction.

Aiming at the problems and the defects of the existing track prediction method in the aspects of pedestrian space interactive feature extraction and prediction, the invention provides a brand-new network architecture for predicting the pedestrian track by using a space-time diagram and space domain aggregation Transformer, performs proper diagram representation and pretreatment on input original data, extracts original pedestrian track feature information by using a space-time diagram convolutional neural network and a time sequence feature transformation network, and introduces a space domain Transformer network architecture deep layer space feature information for sufficient extraction and aggregation so as to ensure the effectiveness and the accuracy of the model in the aspect of space pedestrian interactive feature. The invention pays attention to the reasonability of the model prediction result in the aspect of space interaction, ensures the space walking characteristic of the pedestrian and simultaneously considers the interaction influence, particularly makes a breakthrough on the prediction of the pedestrian track terminal point, has positive effects on pedestrian track interaction and the prediction of the pedestrian track in a complex modeling scene, and helps and inspires the research and exploration in the fields of unmanned driving, artificial intelligence and the like.

Disclosure of Invention

The invention discloses a track prediction method based on a space-time diagram and an airspace aggregation Transformer network, which aims at the problems that the extraction of space pedestrian track information is insufficient in the existing pedestrian track prediction method, the relative position relation of pedestrians during walking is not clear enough, the pedestrians cannot rotate in a large range aiming at collision and the like.

In the aspect of a space-time graph convolution neural network, pedestrian track characteristic information in a scene is represented and preprocessed in a graph form, and the graph convolution neural network is constructed to finish primary extraction of the pedestrian track characteristic information in the space and serve as subsequent network input.

In the time sequence feature transformation network, the extraction of time sequence feature information and the transformation of feature dimensions are completed through a convolution extending network, and meanwhile, the network is reasonably designed to simplify model parameters and improve the performance of the model.

In the spatial aggregation Transformer network, the features obtained from the space-time graph convolutional neural network and the time-sequence feature transformation network are further processed. In order to further mine and model the interaction of pedestrian features in the spatial scene, the model uses the time sequence feature vector of each pedestrian as an input vector, inputs the time sequence feature vector into a spatial aggregation Transformer network to fully extract and aggregate the spatial trajectory features of the pedestrians, and simultaneously completes the task of trajectory prediction output.

The invention mainly comprises the following steps:

step (1): representing and preprocessing the pedestrian track characteristic information in the scene from the input original data by using the characteristics of the graph, selecting a proper kernel function to complete the construction of the adjacent matrix, and providing accurate and efficient pedestrian information in the scene for the subsequent network architecture input;

step (2): establishing a space-time graph convolution neural network module, establishing a graph convolution neural network, and finishing preliminary extraction of pedestrian track characteristic information in a space by selecting graph convolution times of pedestrian track characteristics to ensure accuracy and effectiveness of characteristic extraction;

and (3): establishing a time sequence feature transformation network module, and finishing the extraction of time sequence features and the transformation of feature dimensions by designing a convolutional neural network;

and (4): and establishing a spatial aggregation Transformer network, using the time sequence characteristic vector of each pedestrian in the scene as an input vector, simultaneously inputting the Transformer network to further aggregate spatial characteristics, and finishing the output of the pedestrian track prediction sequence.

Furthermore, in the step (1), a space-time diagram is introduced to represent the input original pedestrian trajectory data, a proper kernel function is selected from multiple kernel functions to construct an adjacent matrix under the diagram meaning, efficient construction and selection of pedestrian features in a scene are completed, and accurate and efficient information is provided for subsequent modeling.

Further, the method for representing the input original pedestrian trajectory data by the introduced space-time diagram specifically comprises the following steps: for each time t, a spatial map G is introduced _t The system is used for representing the interactive characteristic relation among pedestrians at each time point; g _t Is defined as G _t ＝(V _t ，E _t ) Wherein V is _t Coordinate information specifically representing pedestrians in the scene at the time t, i.e.

Each one of which

Using observed relative coordinate changes

To perform the engraving, namely:

wherein, i is 1, …, N, T is 2, …, T _obs For the initial time, the relative offset of the position is defined as 0, i.e.

E _t Then represents the space diagram G _t Is a matrix with dimension size n × n; is defined as

The values of (a) are given by:

if node

And node

Are connected to each other, then

On the contrary, if the node

And node

Are not connected, then

Further, the selecting a proper kernel function from the plurality of kernel functions to construct the adjacency matrix in the sense of the graph is specifically as follows:

introducing a weighted adjacency matrix A _t Weighting and representing the node information of the pedestrian space diagram, obtaining the magnitude of mutual influence among pedestrians through kernel function transformation and storing the magnitude in a weighted adjacency matrix A _t Performing the following steps;

the reciprocal of the distance between two nodes in the Euclidean space is selected as a kernel function, and in order to avoid the problem of function divergence caused by too close distance between the two nodes, a tiny constant epsilon is added to accelerate model convergence, and the expression is as follows:

spatial map G for each time instant in the time dimension _t Stacking to obtain a pedestrian trajectory prediction space-time diagram sequence G ═ G under diagram representation ₁ ,…,G _T }。

Further, the step (2) specifically comprises:

for the characteristic diagram time sequence obtained by inputting, obtaining output through the established space-time diagram convolution neural network:

e _t ＝GNN(G _t ) (1.6)

the GNN represents a constructed space-time diagram convolution neural network, and an output result is obtained by multilayer diagram convolution iteration; e.g. of a cylinder _t Representing spatio-temporal feature information preliminarily extracted from a spatial dimension through a graph neural network;

for the output at each instant, there is such an operation; the output of the actual graph convolution neural network is a stack of such time series:

e _g ＝Stack(e _t ) (1.7)

wherein Stack (·) denotes the superposition of the input in an extended dimension, e _g Representing the output of the graph convolution; in the actual processing process, a plurality of extension dimensions are simultaneously and parallelly sent to the neural network of the graph for processing;

the features are then appropriately dimension transformed through a full connectivity layer FC:

V _GNN ＝FC(e _g ) (1.8)

therefore, the preliminary extraction output of the characteristic information of the space-time graph convolutional neural network is obtained.

Further, in the step (3), the output of the spatiotemporal graph convolutional neural network is subjected to dimensionality transformation, and a CNN-based time sequence feature transformation network module is used and convolution times are designed to complete the extraction of the pedestrian self-history track feature information;

further, the step (3) is specifically as follows:

after the feature extraction information of the space-time graph convolutional neural network is obtained, sending the feature extraction information into a time sequence feature transformation network to extract time sequence features; in the second step, the dimensional characteristics are properly converted through a full connection layer, so that the network module in the second step directly utilizes the obtained characteristic information; in the invention, a multilayer CNN convolutional neural network is selected to process time dimension characteristic information, which can be expressed as:

e _c ＝CNN(V _GNN ) (1.9)

wherein, V _GNN Representing feature information extracted from the convolutional neural network, e _c Representing the output through a time series characteristic transformation network; then, a multi-layer perceptron MLP is used for increasing the expression capability of the network:

V _CNN ＝MLP(e _c ) (1.10)

the characteristics are transformed and processed through the network, and the output V of the time sequence characteristic transformation network is obtained _CNN 。

Further, the main construction calculation content of the step four comprises: in order to increase the relation between pedestrian characteristics in the airspace, a airspace Transformer network is designed to further spatially aggregate the extracted characteristic information. Specifically, the feature vector of the same pedestrian in time sequence is input as an input vector, and the extracted features of different pedestrians are sequentially input.

For the spatial domain aggregation Transformer network, an encoder layer of a Transformer architecture is selected, and firstly, position coding is added to input:

V _in ＝V _CNN +PE _pos,i (V _CNN ) (1.11)

where pos represents the relative position of the input feature and i represents the dimension of the input feature. Then, introducing a multi-head attention layer, inputting query (Q), Key (K) and value (V) by using three attention layers obtained by matrix transformation from an input layer, dividing input characteristics according to a set multi-head number, and calculating an attention score, wherein the expression is as follows:

head _i ＝Attention(Q _i ,K _i ,V _i ) (1.13)

where, i ═ 1, …, nhead, nhead indicates the number of heads. And the final multi-head output completes the feature extraction in a splicing mode, and the expression is as follows:

V _Multi ＝ConCat(head ₁ ,…,head _h )W _o (1.14)

wherein ConCat denotes a splicing operation, W _o A parameter matrix representing the attention layer output.

And then, completing the final output of the spatial Transformer through a feedforward neural network and layer normalization, wherein the final output is expressed as:

V _out ＝LN(Feedback(V _Multi )) (1.15)

by the aid of the structure mode, aggregation of pedestrian space interaction characteristics of the piles through the preliminarily extracted space-time characteristics is well completed, and the purpose of better outputting pedestrian tracks meeting scene pedestrian association and interaction is achieved.

In the aspect of the loss function, the sum of the negative log-likelihood of each point on the predicted track of the pedestrian is selected as the loss function. The loss function for the ith pedestrian is expressed as follows:

wherein,

is an unknown pedestrian trajectory characteristic parameter to be predicted, T _obs ，T _pred Respectively representing observed and predicted endpoint times; and the sum of the loss functions of all pedestrians is the final loss function:

the forward loss function calculation and the reverse parameter updating are carried out on the model framework provided by the invention, so that the training of the model can be completed, and the reasonable pedestrian prediction trajectory output is obtained.

Advantageous effects

The invention provides a brand new network model architecture, which uses a time-space pattern convolution neural network, a time-sequence characteristic transformation network and other related transformation operations to effectively and accurately extract pedestrian characteristics in a scene, simultaneously designs a brand new airspace aggregation Transformer architecture to perform pedestrian time-sequence characteristic transformation and utilization, and finally completes the output of the pedestrian predicted trajectory in a probability distribution form, the method achieves the purposes of reasonably avoiding sudden situations and keeping the movement consistency of group pedestrians, completes more accurate and reasonable prediction of pedestrian space interaction, provides a new idea for further and deep research and exploration of the problem of pedestrian trajectory prediction, has profound significance and effect for more accurate and timely prediction and application in actual scenes, and provides help for development in the fields of automatic driving, intelligent traffic and the like.

Drawings

FIG. 1 is a general diagram of a space-time diagram and space-domain aggregation transform network framework according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the trajectory prediction performed by inputting the time-series transformation characteristics into the spatial aggregation Transformer network according to the present invention.

Detailed Description

The invention relates to a pedestrian trajectory prediction method based on a space-time diagram and airspace aggregation Transformer network, which mainly comprises the following steps:

for the pedestrian trajectory prediction problem under a given scene, N pedestrians are used for each observationThe coordinates of the time of day within the scene. For the coordinate information of the ith pedestrian at the t-th time, the coordinate information is used

And (4) showing. With the above definitions in mind, then the general formulation of the problem is that for each known set of given observed pedestrian trajectory sequences:

extracting and modeling pedestrian track characteristics by a constructed network framework through input data to obtain proper track characteristic information, and providing reasonable track prediction output in a scene:

wherein T is _obs And T _pred Respectively representing the observation time span and the prediction time span of the pedestrian, () representing the true value of the pedestrian track prediction,

and representing the predicted pedestrian track value given by the model.

Fig. 1 is a general schematic diagram of a space-time diagram and space-domain aggregation transform network framework according to an embodiment of the present invention.

The method comprises the following steps: the data is properly represented and preprocessed, and the accurate and efficient pedestrian information in the scene is provided

According to the invention, firstly, a proper graph representation method is used for carrying out correlation graph conversion and preprocessing on the input original pedestrian trajectory data, so that the input characteristic information can be conveniently extracted and efficiently utilized in the follow-up process.

For each time t, a space map G is introduced _t And is used for representing the interaction characteristic relation among pedestrians at each time point. G _t Is defined as G _t ＝(V _t ，E _t ) Wherein V is _t Representation space diagram G _t Node information of (2), in this model, V _t Coordinate information specifically representing pedestrians in the scene at the time t, i.e.

For the model, each

Using observed relative coordinate changes

To perform the engraving, i.e.:

E _t Then represents the space diagram G _t Is a matrix of one dimension size n × n. It is defined in its ordinary sense as

The values of (c) are given as follows: if node

And node

Are connected, then

On the contrary, if the node

And node

Are not connected, then

For the prediction task, not only is the correlation between pedestrians desired to be obtained, but also the relative size of the interpersonal influence in the space is desired to be measured, so that a weighted adjacency matrix A is introduced _t Weighting and expressing node information of the pedestrian space diagram, obtaining the magnitude of mutual influence among pedestrians through kernel function transformation, and storing the magnitude in a weighted adjacent matrix A _t In the invention, the reciprocal of the distance between two nodes in the Euclidean space is used as a kernel function, and a tiny constant epsilon is added to accelerate the convergence of the model in order to avoid the problem of function divergence caused by the fact that the two nodes are too close to each other, wherein the expression is as follows:

spatial map G for each time instant in the time dimension _t Stacking to obtain a pedestrian track prediction time-space diagram sequence G ═ G under the diagram representation ₁ ,…,G _T }. By means of the definition and transformation, the data graph representation and preprocessing in the pedestrian trajectory prediction problem are completed.

Step two: establishing a space-time graph convolutional neural network to preliminarily extract characteristic information

In the invention, aiming at the data obtained by representing the original data in the step one, a space-time graph convolutional neural network is used for preliminarily extracting the characteristic information.

In the model architecture, a graph convolution neural network is used for determining the proper number of convolution layers to carry out proper feature iteration times, so that the aim of better extracting the track features in the space is fulfilled.

e _t ＝GNN(G _t ) (1.6)

the GNN represents a constructed time-space diagram convolution neural network, and an output result is obtained by multilayer diagram convolution iteration; e.g. of the type _t Representing spatio-temporal feature information that is initially extracted from the spatial dimensions by a graph neural network.

This is done for the output at each instant. The output of the actual graph convolution neural network is a stack of such time series:

e _g ＝Stack(e _t ) (1.7)

wherein Stack (·) denotes the superposition of the input in an extended dimension, e _g Representing the output of the graph convolution. In the actual processing process, a plurality of extension dimensions are simultaneously and parallelly sent to the neural network of the graph for processing.

V _GNN ＝FC(e _g ) (1.8)

Step three: establishing a time sequence feature transformation network, and finishing the extraction of time sequence features and the transformation of feature dimensions by designing a convolutional neural network;

and after the characteristic extraction information of the space-time graph convolutional neural network is obtained, sending the characteristic extraction information into a time sequence characteristic transformation network to extract the time sequence characteristics. In the second step, the dimensional characteristics are properly transformed through a full connection layer, so that the network module in the second step directly utilizes the obtained characteristic information. In the invention, a multilayer CNN convolutional neural network is selected to process time dimension characteristic information, which can be expressed as:

e _c ＝CNN(V _GNN ) (1.9)

wherein, V _GNN Representing characteristic information extracted from the convolutional neural network, e _c Representing the output through the temporal feature transform network. Then, a multi-layer perceptron MLP is used for increasing the expression capability of the network:

V _CNN ＝MLP(e _c ) (1.10)

Step four: establishing a spatial domain aggregation Transformer network to further aggregate spatial domain characteristics, and finishing the output of a pedestrian trajectory prediction sequence

The pedestrian trajectory prediction method aims to solve the problems that interactive feature extraction is insufficient in the existing pedestrian trajectory prediction output, and further the spatial characteristics of pedestrians are not obvious, on one hand, the fact that the predicted trajectories of the pedestrians have large inertia and cannot avoid large corners aiming at conditions of high speed, burst and the like is shown, on the other hand, the fact that the motion consistency of pedestrian group behaviors is not enough, and the same motion trend cannot be kept among people closely related in space within a period of time is shown.

In order to increase the relation of pedestrian characteristics in an airspace, an airspace Transformer network is designed to further spatially aggregate the extracted characteristic information. Specifically, the feature vector of the same pedestrian in time sequence is input as an input vector, and the extracted features of different pedestrians are sequentially input.

V _in ＝V _CNN +PE _pos,i (V _CNN ) (1.11)

where pos represents the relative position of the input features and i represents the dimension of the input features. Then, introducing a multi-head attention layer, inputting query (Q), Key (K) and value (V) by using three attention layers obtained by matrix transformation from an input layer, dividing input characteristics according to a set multi-head number, and calculating an attention score, wherein the expression is as follows:

head _i ＝Attention(Q _i ,K _i ,V _i ) (1.13)

V _Multi ＝ConCat(head ₁ ,…,head _h )W _o (1.14)

And then, completing the final output of the spatial Transformer through a feedforward neural network and layer normalization, wherein the final output is expressed as follows:

V _out ＝LN(Feedback(V _Multi )) (1.15)

In the aspect of the loss function, the sum of the negative log-likelihood of each point on the predicted track of the pedestrian is used as the loss function. The loss function for the ith pedestrian is expressed as follows:

wherein,

In the process of evaluating the accuracy and the effectiveness of the model, similar to a common trajectory prediction evaluation method, an Average Differential Error (ADE) and an endpoint Differential Error (FDE) are used as evaluation indexes to describe the accuracy of a predicted trajectory. The average displacement error refers to the average of the L2 norm of the predicted displacement and the real displacement error of each pedestrian in the scene at each moment in time, and the endpoint displacement error refers to the average of the L2 norm of the predicted displacement and the real displacement error of each pedestrian in the scene at the endpoint moment in time, and the expression is as follows:

wherein,

showing the true value of the track to be predicted of the predicted pedestrian,

a predicted trajectory of the pedestrian representing an output of the model; t is a unit of _pred Indicates the predicted end point time, T _p And the prediction time range is represented, and for the FDE index, the error of the coordinate of each pedestrian terminal point in the scene is only averaged, and no higher requirement is made on the selection of the walking route, while for the ADE index, the coordinate error summation of each time point is averaged. For both indices, a smaller value indicates a closer approach to the actual trajectory, and the better the prediction performance.

Because the actual output is probability distribution of the track in a two-dimensional plane, in the actual track prediction performance evaluation, in order to ensure track diversity and generalization capability, multiple sampling prediction (for example, 20 times) is often adopted, and the predicted track closest to the true value track is taken as the output track to calculate ADE/FDE and an evaluation model. Specifically, for the five data sets on ETH and UCY, sampling is performed every 0.4 seconds to generate pedestrian trajectory data, using pedestrian trajectory data every 20 frames as a data sample, training and validating the model by giving pedestrian trajectory data for 3.2s for the last 8 frames as input, and predicting pedestrian trajectory for 4.8s for the future 12 frames. The model of the invention was compared with two other algorithms that also used the graph network model for performance, the comparison results obtained are shown in table 1, the best performance is marked in red:

TABLE 1 comparison of the model with the prediction results of the graphical network mainstream model

As can be seen from table 1, the framework proposed by the present invention provides a great breakthrough to the endpoint prediction problem, the FDE index is optimal on almost all data sets, and both the average ADE and FDE indexes are optimal performance. Compared with the optimal performance of two graph network algorithms, the model respectively improves the FDE by 17%, 21%, 5% and 12% on ETH, UNIV, ZARA1 and ZARA2, and improves the FDE by 16% on the average FDE index. The data show that the model uses a space aggregation Transformer framework to input pedestrian time sequence feature vectors, concentrates on the utilization of extracting features in a space diagram neural network and a time sequence feature transformation network, completes the better aggregation of space pedestrian interaction features, achieves a better prediction effect, makes a greater breakthrough on FDE, and has stronger perception and expression on the interaction features of pedestrians in the space.

Claims

1. A trajectory prediction method based on a space-time diagram and a space-domain aggregation Transformer network is characterized by comprising the following steps:

(1) representing and preprocessing the pedestrian track characteristic information in the scene from the input original data by using the characteristics of the graph, selecting a proper kernel function to complete the construction of an adjacent matrix, and providing accurate and efficient pedestrian track characteristic information in the scene for the subsequent network architecture input;

(2) establishing a space-time graph convolution neural network module, establishing a graph convolution neural network, and finishing the preliminary extraction of the graph representation in the step (1) and the preprocessed pedestrian track characteristic information by selecting the graph convolution times of the pedestrian track characteristic, so as to ensure the accuracy and effectiveness of the extracted characteristic;

(3) establishing a time sequence feature transformation network module, and finishing the extraction of time sequence feature information and the transformation of feature dimensions by designing a convolutional neural network;

(4) and establishing a spatial aggregation Transformer network, using the time sequence characteristic vector of each pedestrian in the scene as an input vector, simultaneously inputting the Transformer network to further aggregate spatial characteristics, and finishing the output of the pedestrian track prediction sequence.

2. The trajectory prediction method based on the spatio-temporal graph and spatial domain aggregation Transformer network as claimed in claim 1, wherein in the step (1), the spatio-temporal graph is introduced to represent the input original pedestrian trajectory data, a proper kernel function is selected from a plurality of kernel functions to construct an adjacency matrix under the graph meaning, so that efficient construction and selection of pedestrian features in a scene are completed, and accurate and efficient information is provided for subsequent modeling.

3. The trajectory prediction method based on the spatio-temporal graph and the spatial aggregation Transformer network as claimed in claim 2, wherein the representation of the input original pedestrian trajectory data by the introduced spatio-temporal graph is specifically as follows: for each time t, a space map G is introduced _t The system is used for representing the interactive characteristic relation among pedestrians at each time point; g _t Is defined as G _t ＝(V _t ，E _t ) Wherein, V _t Coordinate information specifically representing pedestrians in the scene at the time t, i.e.

Each one of which

Using observed relative coordinate changes

To perform the engraving, namely:

E _t Then represents a space diagram G _t Is a matrix of one dimension size n × n; is defined as

The values of (a) are given by:

if node

And node

Are connected, then

On the contrary, if the node

And node

Are not connected, then

4. The trajectory prediction method based on the spatio-temporal graph and spatial aggregation Transformer network as claimed in claim 2, wherein the selecting a suitable kernel function from the plurality of kernel functions to construct the adjacency matrix in the graph sense is specifically:

introducing a weighted adjacency matrix A _t Weighting and expressing node information of the pedestrian space diagram, obtaining the magnitude of mutual influence among pedestrians through kernel function transformation, and storing the magnitude in a weighted adjacent matrix A _t Performing the following steps;

5. The trajectory prediction method based on the spatio-temporal graph and the spatial domain aggregation Transformer network as claimed in claim 4, wherein the step (2) is specifically as follows:

e _t ＝GNN(G _t ) (1.6)

the GNN represents a constructed time-space diagram convolution neural network, and an output result is obtained by multilayer diagram convolution iteration; e.g. of the type _t Representing spatiotemporal feature information preliminarily extracted from spatial dimensions through a graph neural network;

e _g ＝Stack(e _t ) (1.7)

wherein Stack (·) denotes the superposition of the input in an extended dimension, e _g Representing the output of the graph convolution; in the actual processing process, a plurality of extension dimensions are simultaneously and parallelly sent to a graph neural network for processing;

V _GNN ＝FC(e _g ) (1.8)

6. The trajectory prediction method based on the spatio-temporal graph and spatial aggregation Transformer network as claimed in claim 1, wherein in the step (3), the output of the spatio-temporal graph convolutional neural network is subjected to appropriate dimension transformation, and the extraction of the pedestrian's own historical trajectory feature information is completed by using a CNN-based time-series feature transformation network module and designing the convolution times.

7. The trajectory prediction method based on the spatio-temporal graph and the spatial domain aggregation Transformer network as claimed in claim 6, wherein the step (3) is specifically as follows:

e _c ＝CNN(V _GNN ) (1.9)

wherein, V _GNN Representing characteristic information extracted from the convolutional neural network, e _c Representing the output through a time series characteristic transformation network; then, a multi-layer perceptron MLP is used for increasing the expression capability of the network:

V _CNN ＝MLP(e _c ) (1.10)

8. The trajectory prediction method based on the spatio-temporal graph and the spatial domain aggregation Transformer network as claimed in claim 1, wherein the step (4) is specifically as follows:

inputting a feature vector of the same pedestrian on a time sequence as an input vector, and sequentially inputting extracted features of different pedestrians;

V _in ＝V _CNN +PE _pos,i (V _CNN ) (1.11)

where pos represents the relative position of the input feature and i represents the dimension of the input feature; then introducing a multi-head attention layer, inputting Query, Key and Value by using three attention layers obtained by matrix transformation from an input layer, dividing input characteristics according to a set multi-head number, and calculating an attention score, wherein the expression is as follows:

head _i ＝Attention(Q _i ,K _i ,V _i ) (1.13)

wherein, i is 1, …, nhead, and nhead represents the number of the multiple heads; and the final multi-head output completes the feature extraction in a splicing mode, and the expression is as follows:

V _Multi ＝ConCat(head ₁ ,…,head _h )W _o (1.14)

V _out ＝LN(Feedback(V _Multi )) (1.15)

in the aspect of a loss function, the sum of negative log-likelihood of each point on the predicted track of the pedestrian is selected as the loss function; the loss function for the ith pedestrian is expressed as follows:

wherein,

the forward loss function calculation and the reverse parameter updating are carried out on the model framework, so that the training of the model is completed, and the reasonable predicted trajectory output of the pedestrian is obtained.