Optimized 3D Street Scene Reconstruction from Driving Recorder Images
"> Figure 1
<p>Driving recorder and recorded image. (<b>a</b>) Photo of one type of driving recorder obtained from the Internet. (<b>b</b>) Test data recorded by the driving recorder in this paper.</p> "> Figure 2
<p>The pipeline of 3D reconstruction from driving recorder data. The grey frames show the typical SfM process. The two orange frames are the main improvement steps proposed in this paper.</p> "> Figure 3
<p>Example of samples and classifiers.</p> "> Figure 4
<p>Photographic model of driving recorder. (<b>a</b>) Integrated photographic model of driving recorder. (<b>b</b>) Side view of model. (<b>c</b>) Partial enlargement of side view model. The oblique image plane is the driving recorder image plane. Point O is the projective center, and <math display="inline"> <semantics> <mrow> <mtext>O″</mtext> </mrow> </semantics> </math> is the principal point on the driving recorder image plane. The focal length f is <math display="inline"> <semantics> <mrow> <mtext>OO″</mtext> </mrow> </semantics> </math>. Point <math display="inline"> <semantics> <mrow> <mtext>O′</mtext> </mrow> </semantics> </math> is the principal point on the virtual vertical image plane. Line OE is perpendicular to the ground. Point E is the intersection point of the ground and line OE. The plane <math display="inline"> <semantics> <mrow> <mtext>M′O′</mtext> </mrow> </semantics> </math>OOEF can be drawn perpendicular to both the image plane and the ground. <math display="inline"> <semantics> <mrow> <mtext>OM′</mtext> </mrow> </semantics> </math> is perpendicular to <math display="inline"> <semantics> <mrow> <mtext>M′J′</mtext> </mrow> </semantics> </math>, and OM is perpendicular to MJ. Line LN is perpendicular to OE. MP is a vertical line for the ground, and P is the intersection point of line MP and line ON. Line <math display="inline"> <semantics> <mrow> <mtext>M″</mtext> </mrow> </semantics> </math>T is perpendicular to <math display="inline"> <semantics> <mrow> <mtext>O′O</mtext> </mrow> </semantics> </math>. The angle between the oblique plane and the vertical plane is <math display="inline"> <semantics> <mtext>θ</mtext> </semantics> </math>. Angles MON and <math display="inline"> <semantics> <mrow> <mo> </mo> <mtext>O″ON″</mtext> </mrow> </semantics> </math> are <math display="inline"> <semantics> <mtext>α</mtext> </semantics> </math> and <math display="inline"> <semantics> <mtext>β</mtext> </semantics> </math>, respectively.</p> "> Figure 5
<p>(<b>a</b>) and (<b>b</b>) depictions of the box marking drawing method. (<b>c</b>) Example of box marking in an image. The principal point <math display="inline"> <semantics> <mtext>O″</mtext> </semantics> </math> is the center point of the image, and the black rectangle KJAB is the vehicle back surface in the image plane, which are detected by the classifier described in <a href="#sec2dot1-remotesensing-07-09091" class="html-sec">Section 2.1</a>. Point V is the vanishing point in the image. Line l is the perpendicular bisector of the image passing through principal point <math display="inline"> <semantics> <mtext>O″</mtext> </semantics> </math>. Line K<math display="inline"> <semantics> <mtext>M″</mtext> </semantics> </math> is parallel to the x axis of the image and <math display="inline"> <semantics> <mtext>M″</mtext> </semantics> </math> is the intersection point on l. Line <math display="inline"> <semantics> <mtext>N″</mtext> </semantics> </math>Q intersects lines VK and VJ at points Q and C, respectively. <math display="inline"> <semantics> <mtext>N″</mtext> </semantics> </math>Q is parallel with <math display="inline"> <semantics> <mtext>M″</mtext> </semantics> </math>K. Line QD intersects line VA at point D, and line DE intersects line VB at point E. Line QC and DE are parallel to the x axis and QD is parallel to the y axis of the image.</p> "> Figure 6
<p>Guardrails detection process. (<b>a</b>) Detection results of a specially-designed guardrail-classifier which could detect thousands of results, including not only correct guardrails but many wrong detection regions as well. (<b>b</b>) Example of how to draw the red lines from the vanishing point to the detection regions. (<b>c</b>) Results of red lines drawn from the vanishing point to each centre line of the rectangle regions at an interval of <math display="inline"> <semantics> <mrow> <mn>2</mn> <mo>°</mo> </mrow> </semantics> </math>. An example of a rectangle region’s centre line is shown in the bottom left corner of the (<b>c</b>); and (<b>d</b>) is an example of an intersection angle between the top and bottom edges of the guardrail.</p> "> Figure 7
<p>Guardrail location method. (<b>a</b>) Example of four triangle regions which included the angle of <math display="inline"> <semantics> <mrow> <mn>15</mn> <mo>°</mo> </mrow> </semantics> </math> and the fixed vertex (the vanishing point). (<b>b</b>) Triangle region that had the largest line numbers. (<b>c</b>) Final detection results of guardrail location method.</p> "> Figure 8
<p>Blocked vehicles detection method (guardrail region broadening method). Two vehicles running in opposite directions are missed detection by the vehicle classifier, which are indicated by the yellow arrows. These missed detection vehicle regions are included in the broadened guardrail regions, which are shown as the red region.</p> "> Figure 9
<p>SIFT feature points removing results. (<b>a</b>) Original SIFT feature points set on image. (<b>b</b>) Mask results, which show the masked out features on the vehicle and guardrail regions.</p> "> Figure 10
<p>RMSEs of each image. The X-axis represents the serial number of the image pairs and the Y-axis represents the RMSEs, which are shown as millimeters. The blue and red lines show the RMSEs of the typical method and our method, respectively. The correspondences in our method were matched after removing the SIFT features on the Mask, and then the outliers were eliminated by the epipolar constraint (QDEGSAC) method. In the typical method, the correspondences were filtered only by the epipolar constraint (QDEGSAC) method.</p> "> Figure 11
<p>Explanation of the reconstructed camera-pose-triangle and driving tracks. (<b>a</b>) Colored triangle represents the position of the recovered image and the camera projective center. The size of the triangle is followed by the size of the image data. (<b>b</b>) Red line represents the recovered vehicle driving tracks that carried recorder 1. The colored triangles are the reconstructed results that represent the position of the images taken by recorder 1.</p> "> Figure 12
<p>The recovered image positions of Set 1. These images were taken by recorders 4 and 5, which had the same exposure interval and were mounted on one vehicle. (<b>a</b>) and (<b>b</b>) are the recovered results from the same data with different methods. (a) depicts the reconstruction by the typical SfM method. The recovered images in the red rectangle of (a) are unordered obviously. (b) depicts the reconstruction by our method (features on vehicles and guardrails were masked out before matching and reconstruction). (<b>c</b>) is not a georeferenced result. We manually scaled the results of (b) and put it on the Google satellite map to help readers visualize the rough locations of the image sequences on roundabout.</p> "> Figure 13
<p>The recovered image positions of Set 2. These images were taken by recorders 1, 2, 3, and 4 mounted on their respective vehicles. (<b>a</b>) Reconstruction by the typical SfM method. The recovered disordered images in the red rectangles of (a) were recorded by recorder 4. (<b>b</b>) is not a georeferenced result. We manually scaled the results of (a) and put it on the Google satellite map. Based on the enlargement in (a) and the visualized rough location in (b), it can be seen that they were reconstructed in the wrong place. (<b>c</b>) Reconstruction by our method (features on vehicles and guardrails were masked out before matching and reconstruction). The recovered triangles of recorder 4 are smaller than the others because the sizes of the images taken by recorder 4 were smaller than those of the other recorders, which is reflected in (c) by the different reconstructed sizes of the triangles. (a) and (c) are the recovered results from the same data using different methods.</p> "> Figure 14
<p>The recovered image positions of Set 3. These images were taken by recorders 1–5. (<b>a</b>) and (<b>b</b>) are the recovered results from the same data with different methods; (a) was reconstructed by the typical SfM method and (b) was reconstructed by our method (features on vehicles and guardrails were masked out before matching and reconstruction). The images in red rectangles in (a) were recovered in chaos. (<b>c</b>) is not a georeferenced result. We manually scaled the recovery results of our method and put it on the Google satellite map to help readers visualize the rough locations of the image sequences on roundabout.</p> "> Figure 15
<p>Main targets in the sparse point clouds reconstruction process. The two building models in (<b>a</b>) and (<b>b</b>) with red and yellow marks are the main reconstruction targets. (a) and (b) are the side and oblique bird’s-eye view of two buildings from Google Earth, respectively.</p> "> Figure 15 Cont.
<p>Main targets in the sparse point clouds reconstruction process. The two building models in (<b>a</b>) and (<b>b</b>) with red and yellow marks are the main reconstruction targets. (a) and (b) are the side and oblique bird’s-eye view of two buildings from Google Earth, respectively.</p> "> Figure 16
<p>Side view of main target reconstruction results with sparse point clouds. Each result was reconstructed with the same data of 311 images in Set 3. (<b>a</b>) Sparse point clouds reconstructed by Photosynth without any added processing. The building on the left marked in red was repetitively reconstructed. (<b>b</b>) Sparse point clouds reconstructed by VisualSFM with the typical method. The building on the right could not be reconstructed and should be positioned inside the yellow box. (<b>c</b>) Sparse point clouds reconstructed by VisualSFM with our method. The details of the differences between the typical method and our method are described in <a href="#sec2dot5-remotesensing-07-09091" class="html-sec">Section 2.5</a> but can be summarized by saying that our method removed the features on the Mask and matched the remaining feature points before reconstruction.</p> "> Figure 16 Cont.
<p>Side view of main target reconstruction results with sparse point clouds. Each result was reconstructed with the same data of 311 images in Set 3. (<b>a</b>) Sparse point clouds reconstructed by Photosynth without any added processing. The building on the left marked in red was repetitively reconstructed. (<b>b</b>) Sparse point clouds reconstructed by VisualSFM with the typical method. The building on the right could not be reconstructed and should be positioned inside the yellow box. (<b>c</b>) Sparse point clouds reconstructed by VisualSFM with our method. The details of the differences between the typical method and our method are described in <a href="#sec2dot5-remotesensing-07-09091" class="html-sec">Section 2.5</a> but can be summarized by saying that our method removed the features on the Mask and matched the remaining feature points before reconstruction.</p> "> Figure 17
<p>Vertical view of main target reconstruction results with sparse point clouds. Each result was reconstructed by same data of 311 images in Set 3. (<b>a</b>) Sparse point clouds reconstructed by Photosynth without any added processing. The result is chaos. Expecting the repetition we experienced as shown in <a href="#remotesensing-07-09091-f016" class="html-fig">Figure 16</a>, it can be clearly seen that not only was the left building repeatedly reconstructed, but the right building was as well. The repetitive reconstructions of the buildings are marked in red for the left building and the right building is in yellow. (<b>b</b>) Sparse point clouds reconstructed by VisualSFM with the typical method. The right building was missed which should be reconstructed inside the yellow mark. (<b>c</b>) Sparse point clouds reconstructed by VisualSFM with our method. The details between the typical method and our method are described in <a href="#sec2dot5-remotesensing-07-09091" class="html-sec">Section 2.5</a>, which can be summarized by saying that we removed the features on the Mask and matched the remaining feature points before reconstruction. (<b>d</b>) shows a more intuitive result. It is not a georeferenced result. We manually scaled the sparse point clouds of our method and put it on the Google satellite map, which can help readers visualize the high level of overlapping between the point clouds and the map, the rough relative positions of the two buildings, and the position of recovered images in roundabout.</p> "> Figure 17 Cont.
<p>Vertical view of main target reconstruction results with sparse point clouds. Each result was reconstructed by same data of 311 images in Set 3. (<b>a</b>) Sparse point clouds reconstructed by Photosynth without any added processing. The result is chaos. Expecting the repetition we experienced as shown in <a href="#remotesensing-07-09091-f016" class="html-fig">Figure 16</a>, it can be clearly seen that not only was the left building repeatedly reconstructed, but the right building was as well. The repetitive reconstructions of the buildings are marked in red for the left building and the right building is in yellow. (<b>b</b>) Sparse point clouds reconstructed by VisualSFM with the typical method. The right building was missed which should be reconstructed inside the yellow mark. (<b>c</b>) Sparse point clouds reconstructed by VisualSFM with our method. The details between the typical method and our method are described in <a href="#sec2dot5-remotesensing-07-09091" class="html-sec">Section 2.5</a>, which can be summarized by saying that we removed the features on the Mask and matched the remaining feature points before reconstruction. (<b>d</b>) shows a more intuitive result. It is not a georeferenced result. We manually scaled the sparse point clouds of our method and put it on the Google satellite map, which can help readers visualize the high level of overlapping between the point clouds and the map, the rough relative positions of the two buildings, and the position of recovered images in roundabout.</p> "> Figure 18
<p>The vertical view of the two planes. (<b>a</b>) shows the sparse point clouds reconstructed by VisualSFM with our method. The Plane 1 and 2 are target planes we fitted. (<b>b</b>) shows the position of the target wall-planes in Google Map. (<b>c</b>) shows the Plane 1 and 2 in street view. (<b>d</b>) Example of plane fitted result in vertical view. The red line respects the vertical view of the plane fitted by wall points, and the blue lines are examples of the distances between the plane and points.</p> ">
Abstract
:1. Introduction
2. Methodology
- Both the cameras and the objects are in motion, which changes the relative pose of the objects. Moreover, the appearance of vehicles varies significantly (e.g., color, size, and difference between back/front appearances).
- The environment of the scene (e.g., illumination and background) often changes, and events such as occlusions are common.
- Guardrails are strip distributions on images, which make the detection of whole regions difficult.
2.1. Vehicle Front/Back Surfaces Detection
2.2. Vehicle Side-Surface Detection
2.3. Guardrail Detection
2.4. Blocked Vehicle Regions Detection
2.5. Mask and Structure from Motion
3. Experiment
3.1. Test Data and Platform
- The testing images were taken by five recorders mounted on four vehicles, and the largest time interval between the two image sequences was nearly three years.
- A total of 125 images were extracted from videos recorded by driving recorders 1, 2, and 3.
- 186 images were recorded by recorders 4 and 5, which were mounted on the same vehicle with identical exposure intervals.
- Roundabout was crowded during the recording time so the survey vehicles changed their lanes and speeds when necessary to move with the traffic.
- The rest details of recorders and images are shown in Table 1.
Recorder NO | Sensor Type | Focus Style | Image Size | Image Extraction Intervals | Recording Date |
---|---|---|---|---|---|
1, 2, 3 | Video | Zoom Lens | 1920 × 1080 | About 1 s | 12/23/2014 |
4, 5 | Camera | Fixed Focus | 800 × 600 | 0.5 s | 1/23/2012 |
Set Number | Recorder Number | Image Number | Attribute |
---|---|---|---|
1 | 4, 5 | 186 | Stereo images taken by two cameras mounted on the same vehicle with identical exposure intervals. |
2 | 1, 2, 3, 4 | 218 | The longest time interval between the two image sequences was nearly three years, and the images were two different sizes. |
3 | 1, 2, 3, 4, 5 | 311 | Three monocular and two stereo image sequences. The longest time interval between the two image sequences was nearly three years, and the images were two different sizes. |
3.2. Precision of Pairwise Orientation
3.3. Camera Poses Recovering Results
3.4. Sparse 3D Point Clouds Reconstruction Results
Plane NO. | Typical Method | Proposed Method | ||||
---|---|---|---|---|---|---|
RMSE | Maximum | Minimum | RMSE | Maximum | Minimum | |
1 | 0.0047 | 0.0170 | 0.0031 | 0.0142 | ||
2 | 0.0171 | 0.0994 | 0.0095 | 0.0705 |
4. Discussion
- We proposed a street scene reconstruction method from driving recorder data. This new method makes full use of the massive amount of data produced by driving recorders with shorter update time, which can reduce the costs of recovering 3D sparse point clouds compared to mobile mapping equipment carrying stable GPS/INS systems. In order to improve the recovery accuracy, we analyzed and summarized the distribution regularities of the outliers from the SIFT matching results through ample experiments.
- Our work differs from the typical SfM approaches, in that, we eliminate the feature points on the Mask before matching is undertaken. We also proved through experiments that the relative orientation results and reconstruction results improved after removing the feature points on the Mask.
- We designed guardrail and vehicle side region detecting methods based on the characteristics of the driving recorder data. The detection methods are based on the trained Haar-like-feature cascade classifiers, the position of the vanishing point, and some camera parameters.
5. Conclusions
- Reconstructing robust side surfaces in the vehicle detection method without camera parameters.
- Extracting the most appropriate images from driving recorder videos.
- Reducing the number of images in the time-consuming matching step with a reasonable strategy.
- Increasing the density of reconstructed point clouds.
- Detecting the blocked vehicles with more accuracy in a region.
Acknowledgments
Author Contributions
Conflicts of Interest
References
- Habib, A.; Pullivelli, A.; Mitishita, E.; Ghanma, M.; Kim, E. Stability analysis of low-cost digital cameras for aerial mapping using different georeferencing techniques. Photogramm. Rec. 2006, 21, 29–43. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, Z.; Zhang, J.; Wu, J. 3D building modelling with digital map, LiDAR data and video image sequences. Photogramm. Rec. 2005, 20, 285–302. [Google Scholar] [CrossRef]
- Xiao, J.; Fang, T.; Zhao, P.; Lhuillier, M.; Quan, L. Image-based street-side city modeling. Acm Trans. Graph. 2009, 28. [Google Scholar] [CrossRef]
- Cornelis, N.; Leibe, B.; Cornelis, K.; van Gool, L. 3D urban scene modeling integrating recognition and reconstruction. Int. J. Comput. Vis. 2008, 78, 121–141. [Google Scholar] [CrossRef]
- Aijazi, A.; Checchin, P.; Trassoudaine, L. Automatic removal of imperfections and change detection for accurate 3D urban cartography by classification and incremental updating. Remote Sens. 2013, 5, 3701–3728. [Google Scholar] [CrossRef]
- Williams, K.; Olsen, M.; Roe, G.; Glennie, C. Synthesis of transportation applications of mobile LiDAR. Remote Sens. 2013, 5, 4652–4692. [Google Scholar] [CrossRef]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: New York, NY, USA, 2003. [Google Scholar]
- Frahm, J.; Lazebnik, S.; Fite-Georgel, P.; Gallup, D.; Johnson, T.; Raguram, W.C.; Jen, Y.; Dunn, E.; Clipp, B. Building Rome on a cloudless day. In Proceeding of the 2010 Europe Conference on Computer Vision (ECCV), Crete, Greece, 5–11 September 2010; Volume 6314, pp. 368–381.
- Raguram, R.; Wu, C.; Frahm, J.; Lazebnik, S. Modeling and recognition of landmark image collections using iconic scene graphs. Int. J. Comput. Vis. 2011, 95, 213–239. [Google Scholar] [CrossRef]
- Snavely, N.; Seitz, S.M.; Szeliski, R. Photo tourism: Exploring photo collections in 3D. ACM Trans. Graph. 2006, 25. [Google Scholar] [CrossRef]
- Snavely, N.; Seitz, S.M.; Szeliski, R. Modeling the world from internet photo collections. Int. J. Comput. Vis. 2008, 80, 189–210. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. 511–518.
- Lienhart, R.; Maydt, J. An extended set of haar-like features for rapid object detection. In Proceedings of the 2002 International Conference on Image Processing, New York, NY, USA, 22–25 September 2002; pp. 900–903.
- Schapire, R.E.; Singer, Y. Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 1999, 37, 297–336. [Google Scholar] [CrossRef]
- Ponsa, D.; Lopez, A.; Lumbreras, F.; Serrat, J.; Graf, T. 3D vehicle sensor based on monocular vision. In Proceedings of the 2005 IEEE Intelligent Transportation Systems, Vienna, Austria, 13–16 September 2005; pp. 1096–1101.
- OpencvTeam OpenCV 2.4.9.0 Documentation. Available online: http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html (accessed on 18 October 2014).
- Jiang, J.L.; Loe, K.F. S-AdaBoost and pattern detection in complex environment. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’03), Los Alamitos, CA, USA, 18–20 June 2003; pp. 413–418.
- Kutulakos, K.N.; Sinha, S.N.; Steedly, D.; Szeliski, R. A multi-stage linear approach to structure from motion. In Trends and Topics in Computer Vision; Kutulakos, K.N., Ed.; Springer: Berlin, Germany, 2012; pp. 267–281. [Google Scholar]
- Moghadam, P.; Starzyk, J.A.; Wijesoma, W.S. Fast vanishing-point detection in unstructured environments. IEEE Trans. Image Process. 2012, 21, 425–430. [Google Scholar] [CrossRef] [PubMed]
- Frahm, J.M.; Pollefeys, M. RANSAC for (Quasi-) degenerate data (QDEGSAC). In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR′06), New York, NY, USA, 17–22 June 2006; Volume 1, pp. 453–460.
- Torr, P.; Zisserman, A. Robust computation and parametrization of multiple view relations. In Proceedings of the Sixth International Conference on Computer Vision, Bombay, India, 4–7 January 1998; pp. 727–732.
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Wu, C. VisualSFM: A Visual Structure from Motion System. Available online: http://ccwu.me/vsfm/ (accessed on 8 October 2014).
- Wu, C. Towards linear-time incremental structure from motion. In Proceedings of the 2013 International Conference on 3D Vision IEEE, Seattle, WA, USA, 29 June–1 July 2013; pp. 127–134.
- Nister, D. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 756–777. [Google Scholar] [CrossRef] [PubMed]
- Microsoft Corporation. Photosynth. Available online: https://photosynth.net/Background.aspx (accessed on 8 October 2014).
- Nister, D.; Stewenius, H. Scalable recognition with a vocabulary tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 2161–2168.
© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Li, Q.; Lu, H.; Liu, X.; Huang, X.; Song, C.; Huang, S.; Huang, J. Optimized 3D Street Scene Reconstruction from Driving Recorder Images. Remote Sens. 2015, 7, 9091-9121. https://doi.org/10.3390/rs70709091
Zhang Y, Li Q, Lu H, Liu X, Huang X, Song C, Huang S, Huang J. Optimized 3D Street Scene Reconstruction from Driving Recorder Images. Remote Sensing. 2015; 7(7):9091-9121. https://doi.org/10.3390/rs70709091
Chicago/Turabian StyleZhang, Yongjun, Qian Li, Hongshu Lu, Xinyi Liu, Xu Huang, Chao Song, Shan Huang, and Jingyi Huang. 2015. "Optimized 3D Street Scene Reconstruction from Driving Recorder Images" Remote Sensing 7, no. 7: 9091-9121. https://doi.org/10.3390/rs70709091
APA StyleZhang, Y., Li, Q., Lu, H., Liu, X., Huang, X., Song, C., Huang, S., & Huang, J. (2015). Optimized 3D Street Scene Reconstruction from Driving Recorder Images. Remote Sensing, 7(7), 9091-9121. https://doi.org/10.3390/rs70709091