CN111598095B - Urban road scene semantic segmentation method based on deep learning - Google Patents
Urban road scene semantic segmentation method based on deep learning Download PDFInfo
- Publication number
- CN111598095B CN111598095B CN202010156966.XA CN202010156966A CN111598095B CN 111598095 B CN111598095 B CN 111598095B CN 202010156966 A CN202010156966 A CN 202010156966A CN 111598095 B CN111598095 B CN 111598095B
- Authority
- CN
- China
- Prior art keywords
- image
- layer
- residual error
- network
- images
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000013135 deep learning Methods 0.000 title claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000005070 sampling Methods 0.000 claims abstract description 24
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims abstract description 10
- 230000009466 transformation Effects 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims abstract description 7
- 238000013136 deep learning model Methods 0.000 claims abstract description 6
- 230000001131 transforming effect Effects 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 36
- 238000011176 pooling Methods 0.000 claims description 18
- 238000012795 verification Methods 0.000 claims description 17
- 230000004927 fusion Effects 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 229910052754 neon Inorganic materials 0.000 claims description 7
- GKAOGPIIYCISHV-UHFFFAOYSA-N neon atom Chemical compound [Ne] GKAOGPIIYCISHV-UHFFFAOYSA-N 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 6
- 230000014616 translation Effects 0.000 claims description 6
- 230000001902 propagating effect Effects 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 238000003709 image segmentation Methods 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 description 14
- 238000013461 design Methods 0.000 description 6
- 238000010200 validation analysis Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 241000283070 Equus zebra Species 0.000 description 1
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/02—Affine transformations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4038—Image mosaicing, e.g. composing plane images from plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4084—Scaling of whole images or parts thereof, e.g. expanding or contracting in the transform domain, e.g. fast Fourier transform [FFT] domain scaling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/90—Determination of colour characteristics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/32—Indexing scheme for image data processing or generation, in general involving image mosaicing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
A deep learning-based urban road scene semantic segmentation method comprises the following steps: 1) Acquiring an image of the front end of the vehicle; 2) And expanding the input data of the marked image and the original image: randomly cutting, splicing or adding different types of noise to the image, transforming the image through an image affine matrix, and finally, maintaining the original resolution of the image through transformation such as filling and cutting to obtain a data set; 3) The image after data expansion and the marked image are used for network training, and the residual U-net network comprises a down-sampling part, a bridge part, an up-sampling part and a classification part; 4) And modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor. The invention uses a smaller data set, can prevent the gradient from descending too fast and can ensure that the overfitting problem does not occur during training.
Description
Technical Field
The invention belongs to the field of intelligent vehicles, and discloses an urban road scene semantic segmentation method based on deep learning.
Background
In recent years, with the continuous progress of urbanization, urban road conditions become more and more complex, and pedestrians, traffic lights, zebra crossings and different vehicles all influence the speed of the intelligent vehicle and obstacle avoidance measures. The environment around the vehicle can be well recognized through a deep learning semantic segmentation method, and different feedbacks are made. The semantic segmentation is to assign a preset category to each pixel point of the image, so that the intelligent vehicle can understand the surrounding environment in real time when driving, and the traffic accidents can be reduced. Therefore, research on deep learning of urban road environment has been a research focus in the field of vehicle intelligence. The existing deep learning semantic segmentation method researches neural networks such as Segnet, fcn, resnet and the like. Although these neural networks do not require a conventional object recognition process, can automatically learn features, are not designed manually by engineers, and can obtain a suitable model through a large amount of image training and output a result of semantic segmentation, the following problems can be encountered during the existing network training: 1. overfitting is caused by too many weights; 2. the problem of gradient rapid reduction may occur due to more network layers; 3. the training time is long due to the fact that a data set required by training is large. These problems make it difficult for the deep learning network to output accurate semantic segmentation results, so that it is difficult for the intelligent vehicle to obtain feedback of the surrounding environment in real time under complex road conditions, i.e. there is a potential safety hazard. It would therefore be valuable to design a network that uses a smaller data set, while preventing the gradient from dropping too quickly, and ensuring that over-fitting problems do not occur during training.
Disclosure of Invention
In order to overcome the defects of the prior art and to consider that the intelligent vehicle can better recognize the surrounding environment in complex environments such as urban roads and the like, the invention provides a method for carrying out semantic segmentation on urban road scenes based on deep learning.
The technical scheme adopted by the invention for solving the technical problem is as follows:
a deep learning-based urban road scene semantic segmentation method comprises the following steps:
1) And image acquisition of the front end of the vehicle: regularly collecting urban road images, setting a time interval as T, and inputting the images with the resolution of h multiplied by w into an image detection module to obtain effective images; then the image is input into a labeling module for labeling, the system adopts label software Labelme3.11.2 of a public image interface for labeling, vehicles, pedestrians, bicycles, traffic lights and neon objects on the image are framed and labeled into different categories through the scene segmentation labeling function of the system, the generated labeled image reflects the different categories of objects through different gray levels, and a gray list and an object category K stored in the image are obtained from the different gray levels of the labeled image;
2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):
in affine matrix s x Representing the sum s of lateral translations y Denotes the amount of longitudinal translation, c 1 Representing magnification or reduction of the image abscissa, c 4 Indicating the magnification or reduction of the ordinate, c 2 And c 3 Controlling image cutting transformation, (a, b represents original pixel position, (a ', b') is transformed position, and finally, the image cutting transformation is carried outPerforming conversion such as over filling and cutting, and keeping the original resolution of the image to obtain a data set;
3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;
image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. Dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.
4) Road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting the predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor, so that the vehicle can well identify which types of objects exist at the front position to make subsequent different reactions.
Further, in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, and the residual error networks are respectively from the first stage to the fourth stage. The connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the steps of combining a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally fusing an input image and a processed characteristic image in the fusion layer in an identity connection mode. The forms of the second-level to the fourth-level residual error network are the same, and the connection sequence is as follows: the method comprises the steps of batch normalizing layer, softmax function layer, convolution layer, batch normalizing layer, softmax function layer, convolution layer and fusion layer, and finally fusing the input characteristic image and the processed characteristic image in the fusion layer in an identity connection mode; the convolution layer is composed of convolution kernels of 3 x 3, the dimensions of two convolution kernels of each stage are respectively 64, 128, 256 and 512, and finally, each stage is connected through a pooling layer with 2 x 2 step length of 2, and the dimension change of the pooling layer is the same as that of the convolution layer of each stage
Further, in the step 3), the bridge part is prepared for splicing network high-bottom dimension information, and the bridge part is composed of two batch normalization layers, two softplus function layers and two convolution layers with dimensions of 1024 of 3 multiplied by 3, and is free of a fusion layer, so that constant connection is not needed, and the connection sequence of each layer is the same as that of the second-level residual error network. And finally, adjusting the characteristic image to the spliced size through an up-sampling layer.
Furthermore, in step 3), the upsampling part is also composed of four stages of residual error networks, which are respectively a fifth stage to an eighth stage of residual error networks, the form of the residual error networks and the connection order of each layer are basically the same as those of each stage of residual error networks of the downsampling part, only the identity connection of the fifth stage to the seventh stage of residual error networks is replaced by a 1 × 1 convolutional layer, the eighth stage of residual error networks is not changed, the dimensions of the convolutional layers in the upsampling residual error networks of each stage are respectively 512, 256, 128 and 64, the stages are connected by the upsampling layer and the splicing layer, and the splicing layer splices the high-low dimension information with corresponding dimensions, and the splicing measure is as follows:
(3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part;
(3.2) splicing the characteristic image output by the third-level residual error network after passing through the pooling layer with the characteristic image output by the fifth-level residual error network after passing through the upper sampling layer;
(3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer;
(3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer;
the dimension of the spliced feature images changes, the dimension of the feature images is adjusted by using the 1 × 1 convolutional layers instead of the identity connection, the dimension of the four 1 × 1 convolutional layers is respectively 512, 256, 128 and 64, and finally the feature images are fused in the fusion layers.
In the step 3), the classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images are not probability values, so that the output is converted into probability distribution through the softmax layer, and the softmax function is shown in formula (2):
wherein d is k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g k (x) Indicates the probability that pixel x belongs to class k, g k (x)∈[0,1]The highest probability in each channel is the corresponding class;
the deviation of the prediction from the actual is then evaluated using a cross-entropy loss function, see equation (3):
where t (x) represents the class to which pixel x corresponds, so g t(x) (x) The probability of the class is represented by,the probability that the pixel x corresponding to the annotated image belongs to the k classes is represented, so that the smaller the value of the loss function is, the closer the predicted image and the annotated image are. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to an ideal value;
finally, the iteration time epochs, the batch processing size batch _ size and the verification set proportion rate of the network are determined when the model is trained. And dividing the obtained image set into a training set and a verification set according to the proportion of the verification set, inputting the images in the training set into the network in batches according to the batch processing size until the images in the training set are all input, completing one iteration, and finally repeatedly training the model through the determined iteration times to obtain the optimal neural network model.
The main execution parts of the invention are the acquisition and processing of images, the training of neural networks and the recognition of the images by using a recognition model. The implementation process of the method can be divided into the following three stages:
firstly, acquiring image data: setting a time interval T of an acquisition module, selecting different urban environment road sections to acquire images, and inputting the images into a detection module to obtain an effective image set; then, labeling the image by using labeling software labelme3.11.2, framing a target object in the image by an example scene segmentation labeling function, labeling the type of the object, generating a labeled image by software, and labeling different objects in the image by using different gray levels. Marking different gray levels of an image, and obtaining a gray list = [ ] and an article category number K; and finally, expanding the image and the marked image through a data expansion module to obtain a data set.
Secondly, parameters and training of the network: image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. And dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable the L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.
Thirdly, road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting the predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor, so that the vehicle can well identify which types of objects exist at the front position to make subsequent different reactions.
The invention has the following beneficial effects: 1. in the network design, the problems of too fast gradient descent, too large required data set and overfitting possibly occurring in the deep learning network during training are comprehensively considered, so that the method of batch normalization, residual error network and high-bottom information splicing is added in the network, and the problems of gradient descent and image information loss are effectively reduced; the accuracy of semantic segmentation is improved; 2. the road condition detection system for deep learning is simple in design, convenient to understand, few in used data set, high in real-time performance, and high in practicability and adaptability.
Drawings
Fig. 1 is a flow of implementation of an urban road scene semantic segmentation system for deep learning.
FIG. 2 is an overall model design of a residual U-net network used in a deep learning urban road scene semantic segmentation system.
FIG. 3 is a network form of second-level to fifth-level residual error networks in a residual error U-net network used by the deep learning urban road scene semantic segmentation system.
FIG. 4 is a diagram showing the semantic segmentation effect of deep learning urban road scenes.
Detailed Description
The method of the present invention is described in further detail below with reference to the accompanying drawings.
Referring to fig. 1 to 4, a deep learning-based urban road scene semantic segmentation method includes the following steps:
1) And image acquisition of the front end of the vehicle: collecting urban road images at regular time, setting a time interval as T, and inputting the images with the resolution of h multiplied by w into an image detection module to obtain effective images; then the image is input into a labeling module for labeling, the system adopts labeling software Labelme3.11.2 of a public image interface for labeling, vehicles, pedestrians, bicycles, traffic lights, neon objects and the like on the image are framed and labeled into different categories through the scene segmentation labeling function of the system, the generated labeled image reflects the objects of different categories through different gray levels, and a gray list and the object category K stored in the image are obtained from the different gray levels of the labeled image;
2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):
s in affine matrix x Representing the sum of the lateral translations and s y Representing the amount of longitudinal translation, c 1 Representing magnification or reduction of the image abscissa, c 4 Indicating the magnification or reduction of the ordinate, c 2 And c 3 Controlling image cutting transformation, (a, b) representing original pixel position, (a ', b') being transformed position, and finally keeping original resolution of image through transformation such as filling and cutting to obtain data set;
3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;
image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. Dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.
4) Road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into the trained deep learning model, outputting the predicted semantic segmentation images, and returning different gray levels in the images to the processor, so that the vehicle can well identify which types of objects exist in the front position to make subsequent different responses.
Further, in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, and the residual error networks are respectively from the first stage to the fourth stage. The connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the steps of a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally fusing an input image and a processed characteristic image in the fusion layer in an identical connection mode. The forms of the second-level to the fourth-level residual error network are the same, and the connection sequence is as follows: and finally, fusing the input characteristic image and the processed characteristic image in the fusion layer in an identity connection mode. The convolutional layer is composed of 3 × 3 convolutional kernels, and the two convolutional kernels of each stage have dimensions of 64, 128, 256, and 512. Finally, all the levels are connected through the pooling layer with 2 multiplied by 2 step length being 2, and the dimensional change of the pooling layer is the same as that of the convolution layer of each level.
The bridge part is prepared for splicing network high-bottom dimension information and comprises two batch normalization layers, two softplus function layers and two convolution layers with the dimension of 3 multiplied by 3 being 1024, and a fusion-free layer is not needed, so that constant connection is not needed, and the connection sequence of each layer is the same as that of a second-stage residual error network. And finally, adjusting the feature image to a size suitable for splicing through an upsampling layer.
The up-sampling part is also composed of four stages of residual error networks which are respectively a fifth stage residual error network to an eighth stage residual error network, the form of the residual error network and the connection order of each layer are basically the same as that of each stage residual error network of the down-sampling part, only the identity connection of the fifth stage residual error network to the seventh stage residual error network is replaced by a 1 multiplied by 1 convolution layer, and the eighth stage residual error network is not changed. The dimension of the convolution layer in the upsampled residual network at each stage is 512, 256, 128 and 64 respectively. Connect through upsampling layer and concatenation layer between each level, the concatenation layer splices the height dimension information of corresponding size, the concatenation measure:
and (3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part.
And (3.2) splicing the characteristic image output by the third-level residual error network after passing through the pooling layer with the characteristic image output by the fifth-level residual error network after passing through the up-sampling layer.
And (3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer.
And (3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer.
The dimension of the spliced feature images changes, the dimension of the feature images is adjusted by using the 1 × 1 convolutional layers instead of the identity connection, the dimension of the four 1 × 1 convolutional layers is respectively 512, 256, 128 and 64, and finally the feature images are fused in the fusion layers.
The classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images represent probability values, so the output is converted into a probability distribution through the softmax layer, the softmax function, see formula (2):
wherein d is k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g k (x) Representing the probability that pixel x belongs to class k, g k (x)∈[0,1]. The highest probability in each channel is the corresponding class.
The deviation of the predicted result from the actual is then evaluated using a cross-entropy loss function, see equation (3):
where t (x) represents the class to which pixel x corresponds, so g t(x) (x) The probability of the class is represented by,indicating that the corresponding pixel x of the annotated image belongs to class kTherefore, a smaller value of the loss function indicates a closer proximity between the predicted image and the annotated image. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to an ideal value.
Finally, the iteration number epochs, the batch processing size batch _ size and the verification set proportion rate of the network are determined when the model is trained. And dividing the obtained image set into a training set and a verification set according to the proportion of the verification set, inputting images in the training set into the network in batches according to the batch processing size until the images in the training set are all input, finishing one iteration, and finally repeatedly training the model through the determined iteration times to obtain the optimal neural network model.
The main execution parts of the embodiment are image acquisition and processing, neural network training and image recognition by using a recognition model. The implementation process of the method can be divided into the following three stages:
firstly, acquiring image data: setting the time interval T =4s of an acquisition module, selecting different urban environment road sections to acquire images, and inputting the images into a detection module to obtain 1000 effective images; then, labeling software labelme3.11.2 is used for labeling the image, various targets in the image are framed through an example scene segmentation labeling function, the types of the targets are labeled, the software generates a labeled image, and different target types are labeled by using different gray levels in the image. A gray list of list = [0, 20, 80, 140, 180, 230] represents pixel values of different objects, which respectively include background, neon, traffic light, vehicle, pedestrian, and bicycle, and the total number of categories K =6; and finally, expanding the image and the marked image through a data expansion module to obtain a data set.
Secondly, inputting network parameters on a network parameter setting interface, wherein the network parameters comprise the following steps: image length h =224, image width w =224, loss function L; the number of network iterations epochs =30, the batch size batch _ size =4 and the validation set ratio rate =0.1;3000 image sets are divided into 2700 training sets and 300 verification sets, 4 images are input into a residual U-net network for training each time according to batch _ size during training, until the training sets are completely trained, the size of a loss function L is calculated through predicted images output by the network and actual label images, parameters in the network are reversely propagated and adjusted to enable the output of the L to tend to be minimized, one iteration is completed, the network is iteratively trained for 30 times, and network parameters are adjusted through the verification sets in the iteration process; finally, a proper network model is obtained.
And thirdly, modifying the time interval T =0.2s of the acquisition module, inputting the subsequently obtained images into the trained deep learning model, outputting a real-time semantic segmentation result, and returning different gray levels in the images to the processor, so that the vehicle can well identify which types of objects exist in the front position to make subsequent different responses.
The actual system design form, the network establishment process and the results are shown in fig. 1, fig. 2, fig. 3 and fig. 4, and fig. 1 is a flow of implementation of the deep learning urban road scene semantic segmentation system. FIG. 2 is an overall model design of a residual U-net network used in a deep learning urban road scene semantic segmentation system. FIG. 3 is a network form of second-level to fifth-level residual error networks in a residual error U-net network used by the deep learning urban road scene semantic segmentation system. FIG. 4 is a diagram showing the semantic segmentation effect of deep learning urban road scenes.
The above illustrates the excellent deep learning urban road scene semantic segmentation effect exhibited by one embodiment of the present invention. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that any modifications made within the spirit and scope of the appended claims are intended to be within the scope of the invention.
Claims (1)
1. A deep learning-based urban road scene semantic segmentation method is characterized by comprising the following steps:
1) And image acquisition of the front end of the vehicle: collecting urban road images at regular time, setting a time interval as T, and carrying out image detection on the images with the resolution of h multiplied by w to obtain effective images; then marking the obtained effective image, marking by adopting marking software Labelme3.11.2 of a public image interface, framing and marking the objects of vehicles, pedestrians, bicycles, traffic lights and neon lights on the image into different categories by the scene segmentation marking function, reflecting the objects of different categories by the generated marked image through different gray levels, and obtaining a gray list and the object category K stored in the image from the different gray levels of the marked image;
2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):
in affine matrix s x Representing the sum of the lateral translations and s y Representing the amount of longitudinal translation, c 1 Representing magnification or reduction of the image abscissa, c 4 Denotes the magnification or reduction of the ordinate, c 2 And c 3 Controlling image cutting transformation, (a, b) representing original pixel position, (a) ′ ,b ′ ) The original resolution of the image is maintained through the transformation of filling, cutting and the like to obtain a data set;
3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;
the method comprises the steps of image length h, image width w, loss function size L, network iteration times epochs, batch processing of batch _ size and verification set proportion rate, dividing a data set into a training set and a verification set through the rate, inputting batch _ size into a residual U-net network for training according to the batch _ size during training, calculating L through predicted images output by the network and actual label images, reversely propagating and adjusting parameters in the network to enable the output of the L to tend to be minimized, repeatedly training the network to the iteration times, adjusting network parameters through the verification set in the iteration process, and finally obtaining an optimal network model;
4) Road condition classification: modifying the acquisition time interval T, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, transmitting different gray levels in the images back to a processor, and identifying the object types existing in the front position by the vehicle;
in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, namely a first-stage to fourth-stage residual error network, and the connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the following steps of (1) merging the input image and the processed characteristic image in a merging layer by an identity connection mode, wherein the merging layer comprises a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a merging layer, the input image and the processed characteristic image are merged in the merging layer, the forms of all layers of a second-level residual error network and a fourth-level residual error network are the same, and the connection sequence is as follows: the method comprises the steps of firstly, integrating a plurality of feature images into a fusion layer, wherein the feature images are input into a batch normalization layer, a softmax function layer, a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally, the input feature images and the processed feature images are fused in the fusion layer in an identical connection mode; the convolution layer is composed of convolution kernels of 3 x 3, the dimensionalities of two convolution kernels of each level are respectively 64, 128, 256 and 512, and finally, each level is connected through a pooling layer with 2 x 2 step length being 2, and the dimensionality change of the pooling layer is the same as that of each level;
in the step 3), the bridge part is prepared for splicing network high-bottom dimension information and comprises two batch normalization layers, two softplus function layers and two convolution layers with dimensions of 3 multiplied by 3 being 1024, wherein no fusion layer exists, the connection sequence of each layer is the same as that of the second-stage residual error network, and finally the feature image is adjusted to be spliced through an up-sampling layer;
in the step 3), the upsampling part is also composed of four levels of residual error networks, which are respectively residual error networks from the fifth level to the eighth level, the form of the residual error networks and the connection mode of each layer are basically the same as that of the residual error networks from the fifth level to the seventh level, the permanent connection of the residual error networks from the fifth level to the seventh level is replaced by a 1 × 1 convolutional layer, the residual error network from the eighth level is not changed, the dimensions of the convolutional layers in the upsampling residual error networks from the various levels are 512, 256, 128 and 64 respectively, the layers are connected through the upsampling layer and the splicing layer, and the splicing layer splices the high-dimensional information and the low-dimensional information with corresponding dimensions, wherein the splicing measure is as follows:
(3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part;
(3.2) splicing the characteristic image of the output of the third-level residual error network after passing through the pooling layer with the characteristic image of the output of the fifth-level residual error network after passing through the upper sampling layer;
(3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer;
(3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer;
the dimensionality of the spliced characteristic images changes, the dimensionality of the characteristic images is adjusted by using a 1 × 1 convolutional layer for replacing constant connection, the dimensionalities of the four 1 × 1 convolutional layers are respectively 512, 256, 128 and 64, and finally the characteristic images are fused in a fusion layer;
in the step 3), the classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images are not probability values, so that the output is converted into probability distribution through the softmax layer, and the softmax function is shown in formula (2):
wherein d is k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g k (x) Representing the probability that pixel x belongs to class k, g k (x)∈[0,1]The highest probability in each channel is the corresponding class;
the deviation of the prediction from the actual is then evaluated using a cross-entropy loss function, see equation (3):
where t (x) represents the class to which pixel x corresponds, so g t(x) (x) The probability of the class is represented by,and the probability that the pixels x corresponding to the labeled image belong to k classes is represented, so that the smaller the value of the loss function is, the closer the predicted image and the labeled image are, and the internal parameters of the neural network are continuously optimized through reverse transfer of the loss function, so that the loss function is continuously reduced and tends to be an ideal value. />
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010156966.XA CN111598095B (en) | 2020-03-09 | 2020-03-09 | Urban road scene semantic segmentation method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010156966.XA CN111598095B (en) | 2020-03-09 | 2020-03-09 | Urban road scene semantic segmentation method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111598095A CN111598095A (en) | 2020-08-28 |
CN111598095B true CN111598095B (en) | 2023-04-07 |
Family
ID=72181296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010156966.XA Active CN111598095B (en) | 2020-03-09 | 2020-03-09 | Urban road scene semantic segmentation method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111598095B (en) |
Families Citing this family (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10678244B2 (en) | 2017-03-23 | 2020-06-09 | Tesla, Inc. | Data synthesis for autonomous control systems |
US11157441B2 (en) | 2017-07-24 | 2021-10-26 | Tesla, Inc. | Computational array microprocessor system using non-consecutive data formatting |
US11409692B2 (en) | 2017-07-24 | 2022-08-09 | Tesla, Inc. | Vector computational unit |
US10671349B2 (en) | 2017-07-24 | 2020-06-02 | Tesla, Inc. | Accelerated mathematical engine |
US11893393B2 (en) | 2017-07-24 | 2024-02-06 | Tesla, Inc. | Computational array microprocessor system with hardware arbiter managing memory requests |
US11561791B2 (en) | 2018-02-01 | 2023-01-24 | Tesla, Inc. | Vector computational unit receiving data elements in parallel from a last row of a computational array |
US11215999B2 (en) | 2018-06-20 | 2022-01-04 | Tesla, Inc. | Data pipeline and deep learning system for autonomous driving |
US11361457B2 (en) | 2018-07-20 | 2022-06-14 | Tesla, Inc. | Annotation cross-labeling for autonomous control systems |
US11636333B2 (en) | 2018-07-26 | 2023-04-25 | Tesla, Inc. | Optimizing neural network structures for embedded systems |
US11562231B2 (en) | 2018-09-03 | 2023-01-24 | Tesla, Inc. | Neural networks for embedded devices |
CA3115784A1 (en) | 2018-10-11 | 2020-04-16 | Matthew John COOPER | Systems and methods for training machine models with augmented data |
US11196678B2 (en) | 2018-10-25 | 2021-12-07 | Tesla, Inc. | QOS manager for system on a chip communications |
US11816585B2 (en) | 2018-12-03 | 2023-11-14 | Tesla, Inc. | Machine learning models operating at different frequencies for autonomous vehicles |
US11537811B2 (en) | 2018-12-04 | 2022-12-27 | Tesla, Inc. | Enhanced object detection for autonomous vehicles based on field view |
US11610117B2 (en) | 2018-12-27 | 2023-03-21 | Tesla, Inc. | System and method for adapting a neural network model on a hardware platform |
US10997461B2 (en) | 2019-02-01 | 2021-05-04 | Tesla, Inc. | Generating ground truth for machine learning from time series elements |
US11150664B2 (en) | 2019-02-01 | 2021-10-19 | Tesla, Inc. | Predicting three-dimensional features for autonomous driving |
US11567514B2 (en) | 2019-02-11 | 2023-01-31 | Tesla, Inc. | Autonomous and user controlled vehicle summon to a target |
US10956755B2 (en) | 2019-02-19 | 2021-03-23 | Tesla, Inc. | Estimating object properties using visual image data |
CN112070049B (en) * | 2020-09-16 | 2022-08-09 | 福州大学 | Semantic segmentation method under automatic driving scene based on BiSeNet |
CN112348839B (en) * | 2020-10-27 | 2024-03-15 | 重庆大学 | Image segmentation method and system based on deep learning |
CN112329780B (en) * | 2020-11-04 | 2023-10-27 | 杭州师范大学 | Depth image semantic segmentation method based on deep learning |
CN112767361B (en) * | 2021-01-22 | 2024-04-09 | 重庆邮电大学 | Reflected light ferrograph image segmentation method based on lightweight residual U-net |
CN112819688A (en) * | 2021-02-01 | 2021-05-18 | 西安研硕信息技术有限公司 | Conversion method and system for converting SAR (synthetic aperture radar) image into optical image |
CN113076837A (en) * | 2021-03-25 | 2021-07-06 | 高新兴科技集团股份有限公司 | Convolutional neural network training method based on network image |
CN113034598B (en) * | 2021-04-13 | 2023-08-22 | 中国计量大学 | Unmanned aerial vehicle power line inspection method based on deep learning |
CN112949617B (en) * | 2021-05-14 | 2021-08-06 | 江西农业大学 | Rural road type identification method, system, terminal equipment and readable storage medium |
CN113378845A (en) * | 2021-05-28 | 2021-09-10 | 上海商汤智能科技有限公司 | Scene segmentation method, device, equipment and storage medium |
CN113468963A (en) * | 2021-05-31 | 2021-10-01 | 山东信通电子股份有限公司 | Road raise dust identification method and equipment |
CN113269276B (en) * | 2021-06-28 | 2024-10-01 | 陕西谦亿智能科技有限责任公司 | Image recognition method, device, equipment and storage medium |
CN113657174A (en) * | 2021-07-21 | 2021-11-16 | 北京中科慧眼科技有限公司 | Vehicle pseudo-3D information detection method and device and automatic driving system |
CN113569774B (en) * | 2021-08-02 | 2022-04-08 | 清华大学 | Semantic segmentation method and system based on continuous learning |
CN113705498B (en) * | 2021-09-02 | 2022-05-27 | 山东省人工智能研究院 | Wheel slip state prediction method based on distribution propagation diagram network |
CN113689436B (en) * | 2021-09-29 | 2024-02-02 | 平安科技(深圳)有限公司 | Image semantic segmentation method, device, equipment and storage medium |
CN113808128B (en) * | 2021-10-14 | 2023-07-28 | 河北工业大学 | Intelligent compaction whole process visualization control method based on relative coordinate positioning algorithm |
CN114495236B (en) * | 2022-02-11 | 2023-02-28 | 北京百度网讯科技有限公司 | Image segmentation method, apparatus, device, medium, and program product |
CN118172787B (en) * | 2024-05-09 | 2024-07-30 | 南昌航空大学 | Lightweight document layout analysis method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062756A (en) * | 2018-01-29 | 2018-05-22 | 重庆理工大学 | Image, semantic dividing method based on the full convolutional network of depth and condition random field |
CN109145983A (en) * | 2018-08-21 | 2019-01-04 | 电子科技大学 | A kind of real-time scene image, semantic dividing method based on lightweight network |
CN110111335A (en) * | 2019-05-08 | 2019-08-09 | 南昌航空大学 | A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study |
CN110147794A (en) * | 2019-05-21 | 2019-08-20 | 东北大学 | A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning |
-
2020
- 2020-03-09 CN CN202010156966.XA patent/CN111598095B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108062756A (en) * | 2018-01-29 | 2018-05-22 | 重庆理工大学 | Image, semantic dividing method based on the full convolutional network of depth and condition random field |
CN109145983A (en) * | 2018-08-21 | 2019-01-04 | 电子科技大学 | A kind of real-time scene image, semantic dividing method based on lightweight network |
CN110111335A (en) * | 2019-05-08 | 2019-08-09 | 南昌航空大学 | A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study |
CN110147794A (en) * | 2019-05-21 | 2019-08-20 | 东北大学 | A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning |
Non-Patent Citations (4)
Title |
---|
基于卷积神经网络的交通场景语义分割方法研究;李琳辉等;《通信学报》;20180425(第04期);全文 * |
基于多尺度特征提取的图像语义分割;熊志勇等;《中南民族大学学报(自然科学版)》;20170915(第03期);全文 * |
基于彩色-深度图像和深度学习的场景语义分割网络;代具亭等;《科学技术与工程》;20180718(第20期);全文 * |
基于深度学习的遥感图像新增建筑物语义分割;陈一鸣等;《计算机与数字工程》;20191220(第12期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111598095A (en) | 2020-08-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111598095B (en) | Urban road scene semantic segmentation method based on deep learning | |
CN113506300B (en) | Picture semantic segmentation method and system based on rainy day complex road scene | |
Serna et al. | Classification of traffic signs: The european dataset | |
Alghmgham et al. | Autonomous traffic sign (ATSR) detection and recognition using deep CNN | |
CN109886066A (en) | Fast target detection method based on the fusion of multiple dimensioned and multilayer feature | |
CN112990065B (en) | Vehicle classification detection method based on optimized YOLOv5 model | |
CN110781850A (en) | Semantic segmentation system and method for road recognition, and computer storage medium | |
CN113902915A (en) | Semantic segmentation method and system based on low-illumination complex road scene | |
CN110263786A (en) | A kind of road multi-targets recognition system and method based on characteristic dimension fusion | |
CN111860269A (en) | Multi-feature fusion tandem RNN structure and pedestrian prediction method | |
CN110991523A (en) | Interpretability evaluation method for unmanned vehicle detection algorithm performance | |
CN114495029A (en) | Traffic target detection method and system based on improved YOLOv4 | |
Al Mamun et al. | Lane marking detection using simple encode decode deep learning technique: SegNet | |
CN114913498A (en) | Parallel multi-scale feature aggregation lane line detection method based on key point estimation | |
Gupta et al. | Image-based road pothole detection using deep learning model | |
Naik et al. | Implementation of YOLOv4 algorithm for multiple object detection in image and video dataset using deep learning and artificial intelligence for urban traffic video surveillance application | |
CN114973199A (en) | Rail transit train obstacle detection method based on convolutional neural network | |
CN116630702A (en) | Pavement adhesion coefficient prediction method based on semantic segmentation network | |
CN114612883A (en) | Forward vehicle distance detection method based on cascade SSD and monocular depth estimation | |
Bailke et al. | Traffic sign classification using CNN | |
Yildiz et al. | Hybrid image improving and CNN (HIICNN) stacking ensemble method for traffic sign recognition | |
Kadav et al. | Development of Computer Vision Models for Drivable Region Detection in Snow Occluded Lane Lines | |
Dong et al. | Intelligent pixel-level pavement marking detection using 2D laser pavement images | |
Vellaidurai et al. | A novel oyolov5 model for vehicle detection and classification in adverse weather conditions | |
CN114495050A (en) | Multitask integrated detection method for automatic driving forward vision detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |