CN114842085B - Full-scene vehicle attitude estimation method - Google Patents
Full-scene vehicle attitude estimation method Download PDFInfo
- Publication number
- CN114842085B CN114842085B CN202210780438.0A CN202210780438A CN114842085B CN 114842085 B CN114842085 B CN 114842085B CN 202210780438 A CN202210780438 A CN 202210780438A CN 114842085 B CN114842085 B CN 114842085B
- Authority
- CN
- China
- Prior art keywords
- image
- vehicle
- layer
- key point
- full
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/08—Detecting or categorising vehicles
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of vehicle attitude estimation, and relates to a full-scene vehicle attitude estimation method.
Description
Technical Field
The invention belongs to the technical field of vehicle attitude estimation, and relates to a full-scene vehicle attitude estimation method.
Background
The automatic driving prospect is wide, the development trend of future automobiles is that the development of automatic driving requires that vehicles have the ability to clearly judge the surrounding environment, correct driving routes and driving behaviors are selected, drivers are assisted to control the vehicles, the driving scenes in reality are complex and changeable, different countermeasures are required in each complex scene, and the estimation of vehicle postures is used as an important task in the automatic driving technology and aims to locate key points of the vehicles from images or videos and help to judge the driving states of the surrounding vehicles.
At present, the main challenge of vehicle attitude estimation is the occlusion problem, and no matter in which driving scene, the occlusion problem exists, such as occlusion between vehicles, occlusion between pedestrians and vehicles, and occlusion between other objects and vehicles, but the existing vehicle attitude estimation method is difficult to identify the vehicle attitude in the occlusion scene, and therefore a vehicle attitude estimation method facing a full scene is urgently needed.
The convolutional neural network obtains excellent performance in the field of attitude estimation, most work regards the deep convolutional neural network as a strong black box predictor, however, how to capture the spatial relationship between components is still unclear, from the viewpoint of science and practical application, the interpretability of the model can help to understand how the model relates variables to achieve final prediction, and how the attitude estimation algorithm processes various input images, and the Transformer can capture long-distance relationship to reveal the dependency relationship between key points in the task of vehicle attitude estimation.
Since the advent of the Transformer, its high computational efficiency and scalability have made it dominate natural language processing, being a deep neural network based mainly on the mechanism of self-attention, and due to its powerful performance, researchers are looking for ways to apply the Transformer to computer vision tasks, where the performance of the Transformer-based model in various vision benchmarking tests is similar or better than that of other types of networks (such as convolutional networks and recursive networks), but no report or use of the model in vehicle pose estimation is known at present.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, designs and provides a full-scene vehicle attitude estimation method, realizes high-efficiency vehicle attitude estimation, takes Swin transform as a backbone network for feature extraction, uses a transform encoder to encode feature map information into position representation of key points, obtains key point dependence items by calculating attention scores, predicts the final key point positions, effectively solves the problem of vehicle occlusion, and realizes full-scene vehicle attitude estimation.
In order to achieve the purpose, the Swin transform is introduced as a backbone network, a network structure is optimized according to the characteristics of a vehicle attitude estimation task, original image information is compressed into a position sequence with compact key points, the vehicle attitude estimation task is converted into a coding task, a key point dependent item is obtained by calculating an attention score, and a final key point position is predicted, wherein the specific process comprises the following steps:
(1) and (3) data set construction:
selecting vehicle images in an open source data set, collecting images of various vehicles in a traffic monitoring and parking lot, constructing a vehicle data set, and dividing the vehicle data set into a training set, a verification set and a test set;
(2) image segmentation: the image in the vehicle data set is segmented into non-overlapping image slices by a slice segmentation module, each image slice is regarded as a mark and is characterized by serial RGB values of the input image;
(3) extracting hierarchical features of a backbone network: the image slice marks obtained in the step (2) firstly pass through a linear embedding layer of a first stage of a backbone network, the feature dimension is changed into a random dimension C, and then, the two embedding layers and a second stage are used for carrying out layered feature extraction to obtain a feature map;
(4) position coding: inputting the characteristic diagram obtained in the step (3) into a position coding layer for position coding, and enabling the characteristic diagram to pass throughConvolution with a bit lineOr one linear layer is flattened intoAnDimensional vectors which pass through four attention layers and a feedforward neural network and then output characteristic vectors, wherein H and W are the height and width of the image respectively;
(5) generating a keypoint heat map: reshaping the characteristic vector obtained in the step (4) backThen channel dimensions are determined fromLowering to K, and generating a predicted K key point heat map, wherein K is the number of key points of each vehicle and has a value of 78;
(6) and outputting a result: and inhibiting the key point heat map to the key point coordinates through a non-maximum value, and marking the positions of the key points in the original image to realize the attitude estimation of the full-scene vehicle.
Further, 78 key points are defined for each vehicle in the vehicle image in the step (1), and a boundary frame and a category of the vehicle, namely a minimum bounding rectangle of the vehicle, are labeled.
Further, the trunk network in the step (3) adopts a Swin Transformer trunk network, the first stage includes a linear embedding layer and two Swin Transformer blocks, and the number of labels of the two Swin Transformer blocks isWhere H and W are the height and width of the input image; the second stage comprises a linear merging layer and two Swin transform blocks, the image slices subjected to the feature extraction in the first stage are marked in a reduction mode through the linear merging layer, and the linear merging layer enables each group to be markedThe features of adjacent blocks are connected, with a linear layer acting in the dimension ofThe number of marks is reduced by 4 times, and the output dimension is changed toThen, feature transformation is carried out through two Swin transform blocks, and the resolution of the obtained image isAnd realizing layered feature extraction.
Further, the position coding layer in step (4) adopts an encoder with a standard transform architecture, the position coding layer regards the feature map as a dynamic weight determined by specific image content, information flow in forward propagation is reweighed, a key point dependency is obtained by calculating a score of a last attention layer, a higher value of a position attention score in an image indicates that the contribution degree to predicting key points is larger, and the blocked key points are predicted by the dependency of the key points.
Compared with the prior art, the method has the advantages that the Swin transducer is used for replacing the traditional convolutional neural network, the layered transducer is adopted for the main network, the calculation efficiency is improved, the linear calculation complexity is low, the long-distance relation in the image is captured by using the encoder of the standard transducer, the dependency relation of the predicted key point is disclosed, the final position of the predicted key point is formed by collecting the dependency item which contributes greatly to the key point through the final attention layer, the shielding problem is solved, the method obtains better balance between the detection precision and the speed, and the method has higher practical application value.
Drawings
Fig. 1 is a schematic structural framework diagram of a vehicle attitude estimation system provided by the present invention.
Fig. 2 is a schematic structural diagram of a first stage of the backbone network according to the present invention.
Fig. 3 is a schematic structural diagram of a second stage of the backbone network according to the present invention.
FIG. 4 is a single structure diagram of the coding layer according to the present invention.
FIG. 5 is a block flow diagram of a vehicle attitude estimation method according to the present invention.
Detailed Description
The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.
Example (b):
the embodiment provides a full-scene vehicle attitude estimation method based on a transform backbone and a position encoder, which introduces Swin transform as a backbone network, converts a vehicle attitude estimation task into an encoding task by compressing original image information into a position sequence with compact key points, obtains a key point dependency term by calculating an attention score, predicts a final key point position, can effectively predict a blocked vehicle key point position, and realizes full-scene vehicle attitude estimation, as shown in FIGS. 1-5, and specifically comprises the following steps:
(1) and (3) data set construction:
selecting vehicle images in an open source data set, collecting images containing various vehicles in real scenes such as traffic monitoring, parking lots and the like, constructing a vehicle data set, defining 78 key points on each vehicle, taking a car as an example, mainly defining points with strong local texture feature information, such as corner point definitions (4 corner points of car lamps, 4 corner points of front and rear windshields and the like) on multiple selected vehicles, labeling a boundary frame and a category of the vehicle, namely a minimum bounding rectangle of the vehicle, and finally dividing the data set into a training set, a verification set and a test set;
(2) image segmentation:
the vehicle image is segmented into non-overlapping image slices by a slice segmentation module, each image slice being of a sizeTheir characteristic dimension isEach image slice is regarded as a mark and is characterized by serial RGB values of the input image;
(3) extracting hierarchical features of a backbone network;
the backbone network is divided into two stages, the image slice marking firstly passes through the first stage, as shown in fig. 2, the first stage comprises a linear embedding layer and two Swin transducer blocks, the linear embedding layer is applied to the image slice original value characteristics and maps the image slice original value characteristics to a random dimension C, and the number of the transducer blocks is thatH and W are the height and width of the input image, followed by a second stage, as shown in FIG. 3, of reducing the mark by a linear merge layer that reduces the mark per group as the network goes deepThe features of adjacent blocks are connected, with a linear layer acting in the dimension ofThe number of marks is reduced by 4 times, and the output dimension is changed toThen, feature transformation is carried out through two Swin transform blocks, and the resolution of the obtained image isRealizing the extraction of the layered characteristics;
(4) position coding:
the feature graph output by the backbone network is input into the coding layer, the embodiment has 4 coding layers, each coding layer is as shown in fig. 4, firstly, the feature graph passes throughConvolution or a linear layer, flattened intoAnDimensional vectors which are subjected to 4 attention layers and a feedforward neural network to obtain characteristic vectors;
(5) generating a keypoint heat map:
the feature vector is output by the coding layer, and is reshaped backThen channel dimensions are determined fromDecreasing to K (K is the number of keypoints per vehicle, value 78), generating a predicted K keypoint heat map;
(6) and outputting a result: and (5) applying a non-maximum value to suppress the key point coordinates in the key point heat map generated in the step (5), and marking the positions of the key points in the original image to realize the estimation of the vehicle attitude of the whole scene.
Structures, algorithms, and computational processes not described in detail herein are all common in the art.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the disclosure of the embodiment examples, but the scope of the invention is defined by the appended claims.
Claims (5)
1. A full scene vehicle attitude estimation method is characterized by comprising the following steps:
(1) and (3) data set construction:
selecting vehicle images in an open source data set, collecting images of various vehicles in a traffic monitoring and parking lot, constructing a vehicle data set, and dividing the vehicle data set into a training set, a verification set and a test set;
(2) image segmentation: the image in the vehicle data set is segmented into non-overlapping image slices by a slice segmentation module, each image slice is regarded as a mark and is characterized by serial RGB values of the input image;
(3) extracting hierarchical features of a backbone network: the image slice marks obtained in the step (2) firstly pass through a linear embedding layer of a first stage of a backbone network, the feature dimension is changed into a random dimension C, and then, the two embedding layers and a second stage are used for carrying out layered feature extraction to obtain a feature map;
(4) position coding: inputting the characteristic diagram obtained in the step (3) into a position coding layer for position coding, and enabling the characteristic diagram to pass throughConvolution or one linear layer being flattenedAnDimensional vectors which pass through four attention layers and a feedforward neural network and then output characteristic vectors, wherein H and W are the height and width of the image respectively;
(5) generating a keypoint heat map: reshaping the characteristic vector obtained in the step (4) backThen channel dimensions are determined fromLowering to K, and generating K key point heat maps, wherein K is the number of key points of each vehicle and has a value of 78;
(6) and outputting a result: and inhibiting the key point heat map to a key point coordinate through a non-maximum value, and marking the position of the key point in the original image to realize the attitude estimation of the full-scene vehicle.
2. The full-scene vehicle pose estimation method according to claim 1, wherein 78 key points are defined for each vehicle in the vehicle image in the step (1), and a boundary box and a category of the vehicle are labeled.
4. The full-scene vehicle attitude estimation method according to claim 3, wherein the trunk network in the step (3) adopts a Swin transform trunk network, the first stage comprises a linear embedding layer and two Swin transform blocks, and the number of labels of the two Swin transform blocks is equal toWhere H and W are the height and width of the input image; the second stage comprises a linear merging layer and two Swin transform blocks, the image slices subjected to the feature extraction in the first stage are marked in a reduction mode through the linear merging layer, and the linear merging layer enables each group to be markedCharacteristics of adjacent blocksConnecting by applying a linear layer in the dimension ofThe number of marks is reduced by 4 times, and the output dimension is changed toThen, feature transformation is carried out through two Swin transform blocks, and the resolution of the obtained image isAnd realizing layered feature extraction.
5. The full-scene vehicle attitude estimation method according to claim 4, wherein the position coding layer in step (4) adopts an encoder of a standard transform architecture, the position coding layer regards the feature map as a dynamic weight determined by specific image content, re-weights the information flow in forward propagation, obtains a key point dependency by calculating a score of a last attention layer, and predicts the occluded key point through the dependency of the key point, wherein a higher value of a certain position attention score in the image indicates a greater contribution degree to predicting the key point.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210780438.0A CN114842085B (en) | 2022-07-05 | 2022-07-05 | Full-scene vehicle attitude estimation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210780438.0A CN114842085B (en) | 2022-07-05 | 2022-07-05 | Full-scene vehicle attitude estimation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114842085A CN114842085A (en) | 2022-08-02 |
CN114842085B true CN114842085B (en) | 2022-09-16 |
Family
ID=82574897
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210780438.0A Active CN114842085B (en) | 2022-07-05 | 2022-07-05 | Full-scene vehicle attitude estimation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114842085B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115272992B (en) * | 2022-09-30 | 2023-01-03 | 松立控股集团股份有限公司 | Vehicle attitude estimation method |
CN116758341B (en) * | 2023-05-31 | 2024-03-19 | 北京长木谷医疗科技股份有限公司 | GPT-based hip joint lesion intelligent diagnosis method, device and equipment |
CN117352120B (en) * | 2023-06-05 | 2024-06-11 | 北京长木谷医疗科技股份有限公司 | GPT-based intelligent self-generation method, device and equipment for knee joint lesion diagnosis |
CN116740714B (en) * | 2023-06-12 | 2024-02-09 | 北京长木谷医疗科技股份有限公司 | Intelligent self-labeling method and device for hip joint diseases based on unsupervised learning |
CN116894973B (en) * | 2023-07-06 | 2024-05-03 | 北京长木谷医疗科技股份有限公司 | Integrated learning-based intelligent self-labeling method and device for hip joint lesions |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792669A (en) * | 2021-09-16 | 2021-12-14 | 大连理工大学 | Pedestrian re-identification baseline method based on hierarchical self-attention network |
CN114663917A (en) * | 2022-03-14 | 2022-06-24 | 清华大学 | Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200020117A1 (en) * | 2018-07-16 | 2020-01-16 | Ford Global Technologies, Llc | Pose estimation |
CN109598339A (en) * | 2018-12-07 | 2019-04-09 | 电子科技大学 | A kind of vehicle attitude detection method based on grid convolutional network |
CN113591936B (en) * | 2021-07-09 | 2022-09-09 | 厦门市美亚柏科信息股份有限公司 | Vehicle attitude estimation method, terminal device and storage medium |
-
2022
- 2022-07-05 CN CN202210780438.0A patent/CN114842085B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792669A (en) * | 2021-09-16 | 2021-12-14 | 大连理工大学 | Pedestrian re-identification baseline method based on hierarchical self-attention network |
CN114663917A (en) * | 2022-03-14 | 2022-06-24 | 清华大学 | Multi-view-angle-based multi-person three-dimensional human body pose estimation method and device |
Non-Patent Citations (2)
Title |
---|
DST3D: DLA-Swin Transformer for Single-Stage Monocular 3D Object Detection;Zhihong Wu等;《2022 IEEE Intelligent Vehicles Symposium (IV)》;20220609;全文 * |
SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION;Zinan Xiong 等;《arXiv:2201.07384v1》;20220119;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114842085A (en) | 2022-08-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114842085B (en) | Full-scene vehicle attitude estimation method | |
CN111598030B (en) | Method and system for detecting and segmenting vehicle in aerial image | |
CN111126359B (en) | High-definition image small target detection method based on self-encoder and YOLO algorithm | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN113486726A (en) | Rail transit obstacle detection method based on improved convolutional neural network | |
CN110781850A (en) | Semantic segmentation system and method for road recognition, and computer storage medium | |
CN111401436B (en) | Streetscape image segmentation method fusing network and two-channel attention mechanism | |
CN112990065B (en) | Vehicle classification detection method based on optimized YOLOv5 model | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN116797787B (en) | Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network | |
CN111696038A (en) | Image super-resolution method, device, equipment and computer-readable storage medium | |
CN117975418A (en) | Traffic sign detection method based on improved RT-DETR | |
CN115588126A (en) | GAM, CARAFE and SnIoU fused vehicle target detection method | |
CN115171074A (en) | Vehicle target identification method based on multi-scale yolo algorithm | |
Yu et al. | Intelligent corner synthesis via cycle-consistent generative adversarial networks for efficient validation of autonomous driving systems | |
CN113096133A (en) | Method for constructing semantic segmentation network based on attention mechanism | |
CN118071643A (en) | Optical image deblurring method obtained by hydropower station dam underwater robot | |
CN111626298A (en) | Real-time image semantic segmentation device and segmentation method | |
CN112733934B (en) | Multi-mode feature fusion road scene semantic segmentation method in complex environment | |
CN116228626A (en) | Magnetic inductance element surface defect detection method based on improved YOLOv5 | |
CN114187569A (en) | Real-time target detection method integrating Pearson coefficient matrix and attention | |
CN114863132A (en) | Method, system, equipment and storage medium for modeling and capturing image spatial domain information | |
CN118154655B (en) | Unmanned monocular depth estimation system and method for mine auxiliary transport vehicle | |
CN116486203B (en) | Single-target tracking method based on twin network and online template updating | |
CN117541922B (en) | SF-YOLOv-based power station roofing engineering defect detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |