WO2020010517A1

WO2020010517A1 - Trajectory prediction method and apparatus

Info

Publication number: WO2020010517A1
Application number: PCT/CN2018/095144
Authority: WO
Inventors: 邹文斌; 周长源; 吴迪; 王振楠; 唐毅; 李霞
Original assignee: 深圳大学
Priority date: 2018-07-10
Filing date: 2018-07-10
Publication date: 2020-01-16

Abstract

A trajectory prediction method and apparatus, relating to the field of local navigation for robots and intelligent vehicles, and applied to a vehicle provided with a vehicle-mounted camera. The method comprises: photographing a surrounding environment by means of a vehicle-mounted camera to acquire a video sequence comprising a surrounding vehicle and a vehicle background (101); positioning the surrounding vehicle in the video sequence and extracting historical trajectory information of the surrounding vehicle, and taking, as auxiliary information, scene semantic information obtained through performing image segmentation on the video sequence (102); and inputting the historical trajectory information and the auxiliary information into a neural network model to obtain a predicted trajectory of the surrounding vehicle (103). By means of the trajectory prediction method, the accuracy of predicting a vehicle trajectory can be improved.

Description

Trajectory prediction method and device

Technical field

The invention relates to the field of local navigation of robots and intelligent vehicles, and particularly to a trajectory prediction method and device.

Background technique

As the vehicle travels, it is important to predict the future trajectory of other traffic participants to prevent autonomous vehicles from colliding with other vehicles. Assuming that all traffic participants comply with the traffic rules, human drivers can subconsciously predict the future trajectory of the target. For autonomous vehicles, a model is usually used to predict the future trajectories of other traffic participants.

However, most current work uses static images to extract visual semantic messages, or uses an end-to-end structure to learn driving networks. The former ignores the time continuity in driving situations, while the latter lacks the interpretability of training networks, which will cause The problem of low accuracy in predicting vehicle trajectory.

Summary of the invention

The main purpose of the present invention is to provide a trajectory prediction method and device, which can improve the accuracy of predicting a vehicle trajectory.

The trajectory prediction method provided by the first aspect of the embodiments of the present invention is applied to a vehicle provided with a vehicle-mounted camera. The method includes: using the vehicle-mounted camera to photograph the surrounding environment to obtain a video sequence including the surrounding vehicle and the vehicle background; The video sequence locates the surrounding vehicles and extracts the historical trajectory information of the surrounding vehicles, and uses scene semantic information obtained by image segmentation of the video sequence as auxiliary information; the historical trajectory information and the auxiliary information are input into a nerve A network model to obtain the predicted trajectory of the surrounding vehicles.

The trajectory prediction device provided in the second aspect of the embodiments of the present invention is applied to a vehicle provided with a vehicle-mounted camera, and the device includes: an acquisition module configured to use the vehicle-mounted camera to photograph the surrounding environment, and acquire a vehicle including the surrounding vehicle and the vehicle background. Video sequence; extraction and segmentation module for locating the surrounding vehicle from the video sequence and extracting historical trajectory information of the surrounding vehicle, and using scene semantic information obtained by image segmentation of the video sequence as auxiliary information; an output module For inputting the historical trajectory information and the auxiliary information into a neural network model to obtain a predicted trajectory of the surrounding vehicles.

It can be known from the above embodiments that a video sequence including surrounding vehicles and vehicle backgrounds is acquired through an on-board camera, and the video sequence is segmented to obtain scene semantic information, and then the scene semantic information and historical trajectory information are input to a neural network model to obtain a predicted trajectory, Instead of using static images to extract scene semantic information for analysis, the time continuity of the neural network model in this embodiment is guaranteed, and the accuracy of predicting the trajectory of the vehicle is improved.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic flowchart of an trajectory prediction method provided by a first embodiment of the present invention;

FIG. 2 is a schematic flowchart of a trajectory prediction method provided by a second embodiment of the present invention;

3 is a schematic diagram of a neural network model of a trajectory prediction method provided by a second embodiment of the present invention;

4 is an application schematic diagram of a trajectory prediction method provided by a second embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a trajectory prediction apparatus according to a third embodiment of the present invention.

detailed description

In order to make the objectives, features, and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be described clearly and completely in combination with the drawings in the embodiments of the present invention. Obviously, the description The embodiments are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative work fall into the protection scope of the present invention.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a trajectory prediction method provided by a first embodiment of the present invention. The method is applied to a vehicle provided with a vehicle camera. As shown in Figure 1, the trajectory prediction method mainly includes the following steps:

101. Use a vehicle-mounted camera to photograph a surrounding environment, and obtain a video sequence including a surrounding vehicle and a vehicle background.

Specifically, during the automatic driving process of the vehicle, it is assumed that all traffic participants comply with the traffic rules, and a model method is used to predict the future trajectory of other traffic participants. In the process of establishing the model, the surrounding environment information needs to be obtained, so firstly the on-board camera on the vehicle is used to photograph the surrounding environment to obtain a video sequence including the surrounding vehicles and the vehicle background. Among them, the frames per second of the video sequence can be selected according to the actual situation. The surrounding vehicle may refer to a vehicle that is located within a certain distance from the vehicle provided with the on-board camera and has a potential influence on the vehicle provided with the on-board camera. The range may be 30 meters around the vehicle provided with the on-board camera.

102. Locate the surrounding vehicle from the video sequence and extract historical track information of the surrounding vehicle, and use scene semantic information obtained by image segmentation of the video sequence as auxiliary information.

Specifically, the motion in the video sequence is the illusion of motion formed by displaying frames in rapid succession. The video sequence of each frame is a still image. Then, the surrounding vehicles are located in the video sequence of each frame. The trajectory information of the surrounding vehicles can be seen in the video sequence, so for the video sequence of the current frame, the historical trajectory information of the surrounding vehicles is obtained from the video sequence of the past multiple frames.

The scene semantic information obtained by image segmentation of the video sequence of each frame is used as auxiliary information. Image segmentation refers to segmenting objects in the video sequence of each frame according to semantic categories and labeling scene semantic information, such as pedestrians, surrounding vehicles, buildings, sky, vegetation, road obstacles, lane lines, road identification information and traffic lights Information, etc., to further identify the drivable area in the video sequence of the current frame. By using scene semantic information as auxiliary information, it can be robust to the apparent change of the target.

Optionally, since the regions corresponding to different semantic categories are different feature regions, and the boundary between different feature regions is edges, edge detection can be used to segment the video sequence of each frame to extract the required target. Among them, the edge indicates the end of one feature area and the beginning of another feature area. The internal features or attributes of the required target are consistent and inconsistent with the features or attributes inside other feature areas, such as grayscale, color, or texture. feature.

103. Input the historical trajectory information and the auxiliary information into a neural network model to obtain a predicted trajectory of the surrounding vehicles.

Specifically, a neural network is a complex network system formed by a large number of simple neurons widely connected to each other. It is a highly complex non-linear dynamic learning system with massively parallel, distributed storage and processing, self-organizing, and self-organizing. Adaptation and self-learning. Therefore, a neural network model is established using a neural network to obtain a neural network model, and the obtained historical trajectory information and auxiliary information are input to the neural network model to obtain a predicted trajectory of surrounding vehicles.

In the embodiment of the present invention, a video sequence including surrounding vehicles and vehicle backgrounds is acquired by an on-board camera, and the video sequence is segmented to obtain scene semantic information, and then the scene semantic information and historical trajectory information are input to a neural network model to obtain a predicted trajectory, Instead of using static images to extract scene semantic information for analysis, the time continuity of the neural network model in this embodiment is guaranteed, and the accuracy of predicting the trajectory of the vehicle is improved.

Please refer to FIG. 2. FIG. 2 is a schematic flowchart of a trajectory prediction method provided by a second embodiment of the present invention. The method is applied to a vehicle provided with a vehicle camera. As shown in FIG. 2, the trajectory prediction method mainly includes the following steps:

201. Use a vehicle-mounted camera to photograph the surrounding environment and obtain a video sequence including the surrounding vehicles and the vehicle background.

202. Locate the surrounding vehicle from the video sequence and extract historical track information of the surrounding vehicle, and use scene semantic information obtained by image segmentation of the video sequence as auxiliary information.

203. Input the auxiliary information to the convolutional neural network to obtain spatial feature information.

Specifically, the neural network model includes a convolutional neural network, a first layer of short-term memory network, a second layer of short-term memory network, and a fully connected layer.

Among them, the convolutional neural network is a feedforward neural network. After segmenting and labeling the video sequence, the scene semantic information is obtained as auxiliary information and input to the convolutional neural network to obtain spatial feature information. Auxiliary information is image information, which can be encoded using a one-bit effective encoding. The number of channels is used as the number of semantic categories. The auxiliary information is input into a four-layer convolutional neural network. The convolution kernel can be 3 * 3 * 4 to obtain space Feature information, which is represented by a 6-dimensional vector.

Among them, as shown in FIG. 3, the convolutional neural network includes a convolutional layer, a linear correction unit, a pooling layer, and a Dropout layer. The convolutional layer can extract features from the auxiliary information. The linear layer can introduce non-linear features. The pooling layer can compress the input auxiliary information and extract the main features. Dropout layers can be used to alleviate the problem of overfitting.

204. Input the historical track information into the first layer of short-term and short-term memory network to obtain temporal characteristic information. The spatial feature information and the temporal feature information are input into the second-layer long-short-term memory network to obtain joint feature information.

Specifically, a long-short term neural (LSTM) network is a time-recursive network. The historical trajectory information has a certain time series, and there is a certain contextual relationship in the position, that is, the historical trajectory information as a sequence needs to continuously learn the position characteristics before and after learning, so the LSTM network is used to train the historical trajectory information and connect the historical frames The track information is used to infer the track information of the current frame.

Wherein, as shown in FIG. 3, the historical trajectory information is input into the first layer LSTM network to obtain temporal feature information, and the temporal feature information and the spatial feature information obtained in step 203 are input to the second layer LSTM network to obtain joint characterization information. And because the dimension of the spatial grid is 6, the first-level LSTM network can not only learn the time feature information, but also make the dimensions of the time feature information and the space feature information consistent. In practical applications, the number of units in the first layer LSTM network may be 100, and the second layer LSTM network may include two LSTM networks with 300 units in both layers.

205: Enter the joint feature information into the fully connected layer to obtain the predicted trajectory.

Specifically, each node of the fully connected layer is connected to all nodes of the previous layer, and is used to integrate all the features extracted from the previous layer. Therefore, the joint characterization information is input into the fully connected layer to perform a series of The matrix is multiplied to obtain the output of the neural network model, and the predicted trajectory J at T time steps is obtained. In practical applications, the time T can be 1.6s (unit: second)

The neural network model includes the following formulas:

J ← M _p (h, a): H × A.

Among them, J indicates the predicted trajectory, M indicates the mapping relationship between H, A, and J, H indicates the historical trajectory information, A indicates the auxiliary information, p indicates the surrounding vehicles, and h indicates vehicles in the t-th video sequence The position information of p, a indicates the scene semantic information of vehicle p in the t-th video sequence, j indicates the position information of vehicle p in the t-th video sequence from frame T + 1, and t indicates each frame.

Among them, as shown in FIG. 3, an image segmentation-long-short term memory (SEG-LSTM) is proposed in this embodiment to fuse multiple streams of historical frames and predict the future trajectory of surrounding vehicles.

Among them, the number of layers of the LSTM network, the number of units of each layer of the LSTM network, the number of layers of the convolutional neural network, and the size of the convolution kernel are all network hyperparameters, which are determined through cross-validation. The role of cross-validation is to determine the optimal hyperparameters while avoiding overfitting the model. Exemplarily, first, the data set is divided into a training set and a test set with a ratio of 5: 1. Then the training set is divided into 5 parts. Each part is taken as the verification set in turn, and the remaining 4 parts are used as the training set for 5 training and verification. Different hyperparameters can be used to obtain the corresponding average accuracy rate. Determine its value.

As shown in FIG. 4, a video sequence is divided into multiple time step video sequences according to frames, and position information is detected and tracked from each frame of the video sequence, and image segmentation is performed to obtain semantic information. Subsequently, the position information and semantic information of the same frame are input into the LSTM network for training, and the predicted trajectory is obtained by training multiple historical frames and video sequences of the current frame.

206. Use the depth camera to obtain the minimum relative distance between the vehicle and each of the surrounding vehicles. According to the minimum relative distance, the two-dimensional spatial prediction trajectory is converted into a three-dimensional spatial prediction trajectory.

Specifically, the predicted trajectory is a two-dimensional spatial predicted trajectory, and a depth camera is further provided in the vehicle.

The following two formulas are used to convert the two-dimensional spatial prediction trajectory into a three-dimensional spatial prediction trajectory according to the minimum relative distance:

Among them, x, y, w, and h respectively represent the elements of the two-dimensional spatial prediction trajectory in the pixel bounding box of each frame of the video sequence, and x _r , y _r , w _r , and h _r respectively represent the three-dimensional spatial prediction trajectory in each frame. The elements in the pixel bounding box in a frame of video sequence, f is the focal length of the depth camera, and d _min is the minimum relative distance between the vehicle and each of the surrounding vehicles.

Among them, if the subscript p is ignored, the historical trajectory information and predicted trajectory can be defined as a three-dimensional space occupying grid, that is,

H, J∈R ⁶ = {x, y, w, h, d _min , d _max }

In the formula, d _max represents the maximum distance between the vehicle and each of the surrounding vehicles.

In the embodiment of the present invention, first, a video sequence including surrounding vehicles and vehicle backgrounds is acquired through a vehicle camera, and the video sequence is segmented to obtain scene semantic information, and then the scene semantic information and historical trajectory information are input to a neural network model to obtain predictions. Trajectories, instead of using static images to extract scene semantic information for analysis, thereby ensuring the temporal continuity of the neural network model in this embodiment, and thereby improving the accuracy of predicting vehicle trajectories. In addition, the use of convolutional neural networks and LSTM networks can improve the robustness of surrounding vehicle tracking, and image segmentation can be used to obtain scene semantic information, which can improve the interpretability of the training process.

Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of a trajectory prediction device provided by a third embodiment of the present invention, which is applied to a vehicle provided with a vehicle camera. As shown in FIG. 5, the trajectory prediction device mainly includes:

The obtaining module 301 is configured to use an on-vehicle camera to photograph the surrounding environment and obtain a video sequence including the surrounding vehicles and the vehicle background.

An extraction segmentation module 302 is used to locate surrounding vehicles from the video sequence and extract historical track information of the surrounding vehicles, and use scene semantic information obtained by image segmentation of the video sequence as auxiliary information.

An output module 303 is configured to input historical trajectory information and auxiliary information into a neural network model to obtain a predicted trajectory of surrounding vehicles.

Further, the neural network model includes a convolutional neural network, a first layer of short-term memory network, a second layer of short-term memory network, and a fully connected layer.

The output module 303 is further configured to input auxiliary information to the convolutional neural network to obtain spatial feature information.

The output module 303 is further configured to input the historical trajectory information into the first layer of short-term and short-term memory network to obtain temporal characteristic information.

The output module 303 is further configured to input the spatial feature information and the temporal feature information into the second-layer long-term and short-term memory network to obtain joint feature information.

The output module 303 is further configured to input the joint feature information into the fully connected layer to obtain a predicted trajectory.

Further, the neural network model includes the following formula:

J ← M _p (h, a): H × A.

Among them, J is the predicted trajectory, M is the mapping relationship between H, A, and J, H is the historical trajectory information, A is the auxiliary information, p is the surrounding vehicles, and h is the position information of the vehicle p in the t-th video sequence. , A indicates the scene semantic information of vehicle p in the t-th video sequence, j indicates the position information of vehicle p in the t-th video sequence from frame T + 1, and t indicates each frame.

Further, the predicted trajectory is a two-dimensional spatial predicted trajectory, and a depth camera is also provided in the vehicle.

The obtaining module 301 is further configured to obtain a minimum relative distance between a vehicle and each surrounding vehicle through a depth camera.

The device further includes a conversion module 304,

The conversion module 304 is configured to convert the two-dimensional space prediction trajectory into a three-dimensional space prediction trajectory according to the minimum relative distance.

Further, the conversion module 304 is further configured to convert the two-dimensional space prediction trajectory into a three-dimensional space prediction trajectory according to the minimum relative distance by the following formula:

Among them, x, y, w, and h respectively represent the elements of the two-dimensional spatial prediction trajectory in the pixel bounding box of each frame of the video sequence, and x _r , y _r , w _r , and h _r respectively represent the three-dimensional spatial prediction trajectory in each frame. An element in the pixel bounding box of a frame of video sequence, where f is the focal length of the depth camera and d _min is the minimum relative distance between the vehicle and each surrounding vehicle.

For the process of implementing the respective functions of the foregoing modules, reference may be made to related content in the embodiments shown in FIG. 1 to FIG. 4 above, and details are not described herein again.

In the multiple embodiments provided in this application, it should be understood that the disclosed methods and devices may be implemented in other ways. For example, the embodiments described above are only schematic. For example, the division of the modules is only a logical function division. In actual implementation, there may be another division manner. For example, multiple modules or components may be combined or may be combined. Integration into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual couplings or direct couplings or communication links can make the indirect coupling or communication links of the modules through some interfaces, which can be electrical, mechanical or other forms.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, which may be located in one place, or may be distributed on multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the objective of the solution of this embodiment.

In addition, each functional module in each embodiment of the present invention may be integrated into a processing module. Each module may exist separately physically, or two or more modules may be integrated into one module. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules.

It should be noted that, for the foregoing method embodiments, for simplicity of description, they are all described as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. Because according to the present invention, certain steps may be performed in another order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in one embodiment, reference may be made to related descriptions in other embodiments.

The foregoing is a description of a trajectory prediction method and device, a terminal, and a computer-readable storage medium provided by the present invention. For those of ordinary skill in the art, based on the ideas of the embodiments of the present invention, both the specific implementation and the scope of application are There may be changes. In summary, the content of this description should not be construed as a limitation on the present invention.

Claims

A trajectory prediction method, which is applied to a vehicle provided with a vehicle camera, is characterized in that the method includes:

Use the vehicle camera to photograph the surrounding environment and obtain a video sequence including the surrounding vehicles and the vehicle background;

Locate the surrounding vehicle from the video sequence and extract historical track information of the surrounding vehicle, and use scene semantic information obtained by image segmentation of the video sequence as auxiliary information;

The historical trajectory information and the auxiliary information are input into a neural network model to obtain a predicted trajectory of the surrounding vehicles.
The trajectory prediction method according to claim 1, wherein the neural network model comprises a convolutional neural network, a first layer of long-term short-term memory network, a second layer of long-term short-term memory network, and a fully connected layer, then the The historical trajectory information and the auxiliary information are input into a neural network model, and obtaining the predicted trajectory of the surrounding vehicles includes:

Inputting the auxiliary information to the convolutional neural network to obtain spatial feature information;

Inputting the historical trajectory information into the first layer of short-term and short-term memory network to obtain temporal characteristic information;

Inputting the spatial feature information and the temporal feature information into the second-layer long-short-term memory network to obtain joint feature information;

The joint feature information is input to a fully connected layer to obtain the predicted trajectory.
The trajectory prediction method according to claim 1, wherein the neural network model includes the following formula:

J ← M p (h, a): H × A;

Among them, J indicates the predicted trajectory, M indicates the mapping relationship between H, A, and J, H indicates the historical trajectory information, A indicates the auxiliary information, p indicates the surrounding vehicles, and h indicates the t frame. Location information of vehicle p in the video sequence, a represents the scene semantic information of vehicle p in the t-th video sequence, j represents the location of vehicle p in the t-th video sequence from frame T + 1, and t represents each frame .
The trajectory prediction method according to claim 1, wherein the predicted trajectory is a two-dimensional spatial prediction trajectory, and a depth camera is further provided in the vehicle, and the method further comprises:

Obtaining the minimum relative distance between the vehicle and each of the surrounding vehicles through the depth camera;

Converting the two-dimensional spatial prediction trajectory into a three-dimensional spatial prediction trajectory according to the minimum relative distance.
The trajectory prediction method according to claim 4, wherein the two-dimensional spatial prediction trajectory is converted into a three-dimensional spatial prediction trajectory according to the minimum relative distance by the following formula:

Among them, x, y, w, and h respectively represent the elements of the two-dimensional spatial prediction trajectory in the pixel bounding box of each frame of the video sequence, and x r , y r , w r , and h r respectively represent the three-dimensional spatial prediction trajectory in each frame. An element in a pixel bounding box in a frame of video sequence, where f is the focal length of the depth camera, and d min is the minimum relative distance between the vehicle and each of the surrounding vehicles.
A trajectory prediction device applied to a vehicle provided with a vehicle-mounted camera is characterized in that the device includes:

An acquisition module, configured to use a vehicle camera to photograph the surrounding environment, and obtain a video sequence including surrounding vehicles and the vehicle background;

An extraction segmentation module, configured to locate the surrounding vehicles from the video sequence and extract historical track information of the surrounding vehicles, and use scene semantic information obtained by image segmentation of the video sequence as auxiliary information;

An output module is configured to input the historical trajectory information and the auxiliary information into a neural network model to obtain a predicted trajectory of the surrounding vehicles.
The trajectory prediction device according to claim 6, wherein the neural network model includes a convolutional neural network, a first layer of short-term memory network, a second layer of short-term memory network, and a fully connected layer,

The output module is further configured to input the auxiliary information to the convolutional neural network to obtain spatial feature information;

The output module is further configured to input the historical trajectory information into the first-layer long-short-term memory network to obtain temporal characteristic information;

The output module is further configured to input the spatial feature information and the temporal feature information into the second-layer long-short-term memory network to obtain joint feature information;

The output module is further configured to input the joint feature information into a fully connected layer to obtain the predicted trajectory.
The trajectory prediction device according to claim 6, wherein the neural network model includes the following formula:

J ← M p (h, a): H × A;

Among them, J indicates the predicted trajectory, M indicates the mapping relationship between H, A, and J, H indicates the historical trajectory information, A indicates the auxiliary information, p indicates the surrounding vehicles, and h indicates the t frame. Location information of vehicle p in the video sequence, a represents the scene semantic information of vehicle p in the t-th video sequence, j represents the location of vehicle p in the t-th video sequence from frame T + 1, and t represents each frame .
The trajectory prediction device according to claim 6, wherein the predicted trajectory is a two-dimensional spatial prediction trajectory, and a depth camera is further provided in the vehicle,

The obtaining module is further configured to obtain the minimum relative distance between the vehicle and each of the surrounding vehicles through the depth camera;

The device further includes a conversion module,

The conversion module is configured to convert the two-dimensional space prediction trajectory into a three-dimensional space prediction trajectory according to the minimum relative distance.
The trajectory prediction device according to claim 9, wherein:

The conversion module is further configured to convert the two-dimensional spatial prediction trajectory into a three-dimensional spatial prediction trajectory according to the minimum relative distance by the following formula:

Among them, x, y, w, and h respectively represent the elements of the two-dimensional spatial prediction trajectory in the pixel bounding box of each frame of the video sequence, and x r , y r , w r , and h r respectively represent the three-dimensional spatial prediction trajectory in each frame. An element in a pixel bounding box in a frame of video sequence, where f is the focal length of the depth camera, and d min is the minimum relative distance between the vehicle and each of the surrounding vehicles.