Automatic Defect Description of Railway Track Line Image Based on Dense Captioning
<p>The railway track line structure of Beijing Metro Line 6.</p> "> Figure 2
<p>A brief schema of DenseCap.An input image is first processed by the VGG16 [<a href="#B12-sensors-22-06419" class="html-bibr">12</a>]. The Localization Layer proposes regions and uses bilinear interpolation to smoothly extract a batch of corresponding activations. After that, these regions are processed using a fully-connected recognition network and described with an LSTM model.</p> "> Figure 3
<p>Network architecture of RTLCap. ResNet-50-FPN is used instead of VGG16 as the backbone network.</p> "> Figure 4
<p>The basic structure of ResNet-50-FPN. (Conv*_x stands for the corresponding Convolutional layer in ResNet-50).</p> "> Figure 5
<p>Sketch map of the overlap between the fastener area and the backing plate area.</p> "> Figure 6
<p>Illustration of the Focal Loss. This figure is from [<a href="#B14-sensors-22-06419" class="html-bibr">14</a>].</p> "> Figure 7
<p>Example of manual labeling using the VGG image annotator with image regions and captions.</p> "> Figure 8
<p>Defects of railway track line of Beijing Metro Line 6. (<b>a</b>) missing fastener and broken fastener; (<b>b</b>) rail corrugation.</p> "> Figure 9
<p>(<b>a</b>,<b>b</b>) Qualitative comparisons between DenseCap_RF and RTLCap. From left to right are the grounding truth (GT), the prediction of DenseCap_RF and the prediction of RTLCap, respectively. Note that the prediction output is sorted by the confidence score.</p> "> Figure 10
<p>Network architecture of the proposed Faster RTLCap model.</p> "> Figure 11
<p>(<b>a</b>) The architecture of the bifurcation-fusion-based encoder part (YOLO-MFLMF); (<b>b</b>) DBL module; (<b>c</b>) DBLC module; (<b>d</b>) the module of Resn; (<b>e</b>) sketch map of the res unit. Note that the numbers in figure (<b>a</b>) indicate the number of blocks or components.</p> "> Figure 12
<p>Schematic diagramof calculation of <math display="inline"><semantics> <msub> <mi>b</mi> <mi>x</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>b</mi> <mi>y</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>b</mi> <mi>w</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>b</mi> <mi>h</mi> </msub> </semantics></math>.</p> "> Figure 13
<p>The internal structure of MFLMF.</p> "> Figure 14
<p>The relationship between the number of anchor boxes and average IOU.</p> "> Figure 15
<p>Long Short-Term Memory (LSTM) unit.</p> "> Figure 16
<p>CM-LSTM.</p> ">
Abstract
:1. Introduction
- 1.
- Based on advanced deep learning networks and natural language processing technologies, the problem of automatic defect description of railway track line image is investigated and solved for the first time, and the proposed methods meet the demand for automatic generation of inspection reports of railway track line safety status.
- 2.
- A railway track line image captioning model (RTLCap for short) is proposed based on improved DenseCap, which achieves better defect description accuracy than the original DenseCap and is more suitable for the scenario of the railway track line. To our best knowledge, this is the first research that introduces dense captioning technology into the field of railway track line safety status detection to investigate the automatic generation of inspection reports.
- 3.
- Motivated by the work of YOLOv3, a reconstructed RTLCap model named Faster RTLCap is presented. The Faster RTLCap reduces the image processing time effectively while maintaining a sound defect description performance. To be more exactly, the image processing time of Faster RTLCap is about 97.7% faster, and the defect description accuracy is improved by 1.12%.
2. Related Work
3. Automatic Defect Description of Railway Track Line Image
3.1. Dense Captioning Model
3.2. Railway Track Line Image Captioning Model (RTLCap)
3.2.1. Backbone and Anchors
3.2.2. Soft-NMS
3.2.3. Focal Loss
3.3. Experiments and Results
3.3.1. Experimental Environment and Datasets
3.3.2. Evaluation Metrics
3.3.3. Loss Function
3.3.4. Availability of the ResNet-50-FPN
3.3.5. Performance Evaluation for RTLCap
4. Faster Railway Track Line Image Captioning Model
4.1. One-Stage Detection Algorithm
4.2. Faster RTLCap
4.2.1. Feature Bifurcation-Fusion-Based Encoder Part
- Image Feature Extraction Stage
- Bounding Box and Class Prediction Stage
- Regional Feature Construction and Encoding Stage
- Convolutional Anchors
4.2.2. Stacked LSTM-Based Decoder Part
4.3. Experiments and Results
4.3.1. Loss Function
4.3.2. Performance Evaluation for Faster RTLCap
4.3.3. The Influence of Choosing Different Numbers of Anchors
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Design Details of MFLMF
Appendix A.1. Feature Localization
Appendix A.2. Feature Mapping and Fusion
References
- Li, Y.; Trinh, H.; Haas, N.; Otto, C.; Pankanti, S. Rail component detection, optimization, and assessment for automatic rail track inspection. IEEE Trans. Intell. Transp. Syst. 2013, 15, 760–770. [Google Scholar]
- Zuwen, L. Overall comments on track technology of high-speed railway. J. Railw. Eng. Soc. 2007, 1, 41–54. [Google Scholar]
- Johnson, J.; Karpathy, A.; Li, F.-F. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4565–4574. [Google Scholar]
- Yang, L.; Tang, K.; Yang, J.; Li, L.J. Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2193–2202. [Google Scholar]
- Wang, T.J.J.; Tavakoli, H.R.; Sjöberg, M.; Laaksonen, J. Geometry-aware relational exemplar attention for dense captioning. In Proceedings of the 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications, Nice, France, 25 October 2019; pp. 3–11. [Google Scholar]
- Yin, G.; Sheng, L.; Liu, B.; Yu, N.; Wang, X.; Shao, J. Context and attribute grounded dense captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6241–6250. [Google Scholar]
- Zhang, Z.; Zhang, Y.; Shi, Y.; Yu, W.; Nie, L.; He, G.; Fan, Y.; Yang, Z. Dense Image Captioning Based on Precise Feature Extraction. In International Conference on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 83–90. [Google Scholar]
- Zhao, D.; Chang, Z.; Guo, S. Cross-scale fusion detection with global attribute for dense captioning. Neurocomputing 2020, 373, 98–108. [Google Scholar] [CrossRef]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Hossain, M.Z.; Sohel, F.; Shiratuddin, M.F.; Laga, H. A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 2019, 51, 1–36. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Marino, F.; Distante, A.; Mazzeo, P.L.; Stella, E. A real-time visual inspection system for railway maintenance: Automatic hexagonal-headed bolts detection. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2007, 37, 418–428. [Google Scholar] [CrossRef]
- De Ruvo, P.; Distante, A.; Stella, E.; Marino, F. A GPU-based vision system for real time detection of fastening elements in railway inspection. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 2333–2336. [Google Scholar]
- Gibert, X.; Patel, V.M.; Chellappa, R. Robust fastener detection for autonomous visual railway track inspection. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 694–701. [Google Scholar]
- Gibert, X.; Patel, V.M.; Chellappa, R. Deep multitask learning for railway track inspection. IEEE Trans. Intell. Transp. Syst. 2016, 18, 153–164. [Google Scholar] [CrossRef]
- Wei, X.; Yang, Z.; Liu, Y.; Wei, D.; Jia, L.; Li, Y. Railway track fastener defect detection based on image processing and deep learning techniques: A comparative study. Eng. Appl. Artif. Intell. 2019, 80, 66–81. [Google Scholar] [CrossRef]
- Zhou, Y.; Li, X.; Chen, H. Railway fastener defect detection based on deep convolutional networks. In Proceedings of the Eleventh International Conference on Graphics and Image Processing (ICGIP 2019), Hangzhou, China, 12–14 October 2019; Volume 11373, p. 113732D. [Google Scholar]
- Qi, H.; Xu, T.; Wang, G.; Cheng, Y.; Chen, C. MYOLOv3-Tiny: A new convolutional neural network architecture for real-time detection of track fasteners. Comput. Ind. 2020, 123, 103303. [Google Scholar] [CrossRef]
- Bai, T.; Yang, J.; Xu, G.; Yao, D. An optimized railway fastener detection method based on modified Faster R-CNN. Measurement 2021, 182, 109742. [Google Scholar] [CrossRef]
- Faghih-Roohi, S.; Hajizadeh, S.; Núñez, A.; Babuska, R.; De Schutter, B. Deep convolutional neural networks for detection of rail surface defects. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 2584–2589. [Google Scholar]
- Liang, Z.; Zhang, H.; Liu, L.; He, Z.; Zheng, K. Defect Detection of Rail Surface with Deep Convolutional Neural Networks. In Proceedings of the 2018 13th World Congress on Intelligent Control and Automation (WCICA), Changsha, China, 4–8 July 2018; pp. 1317–1322. [Google Scholar]
- James, A.; Jie, W.; Xulei, Y.; Chenghao, Y.; Ngan, N.B.; Yuxin, L.; Yi, S.; Chandrasekhar, V.; Zeng, Z. TrackNet-A Deep Learning Based Fault Detection for Railway Track Inspection. In Proceedings of the 2018 International Conference on Intelligent Rail Transportation (ICIRT), Singapore, 12–14 December 2018; pp. 1–5. [Google Scholar]
- Shang, L.; Yang, Q.; Wang, J.; Li, S.; Lei, W. Detection of rail surface defects based on CNN image recognition and classification. In Proceedings of the 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Korea, 11–14 February 2018; pp. 45–51. [Google Scholar]
- Feng, J.H.; Yuan, H.; Hu, Y.Q.; Lin, J.; Liu, S.W.; Luo, X. Research on deep learning method for rail surface defect detection. IET Electr. Syst. Transp. 2020, 10, 436–442. [Google Scholar] [CrossRef]
- Wei, X.; Wei, D.; Suo, D.; Jia, L.; Li, Y. Multi-target defect identification for railway track line based on image processing and improved YOLOv3 model. IEEE Access 2020, 8, 61973–61988. [Google Scholar] [CrossRef]
- Zhang, Z.; Liang, M.; Wang, Z. A Deep Extractor for Visual Rail Surface Inspection. IEEE Access 2021, 9, 21798–21809. [Google Scholar] [CrossRef]
- Ni, X.; Ma, Z.; Liu, J.; Shi, B.; Liu, H. Attention Network for Rail Surface Defect Detection via CASIoU-Guided Center-Point Estimation. IEEE Trans. Ind. Inform. 2021, 18, 1694–1705. [Google Scholar] [CrossRef]
- Guo, F.; Qian, Y.; Wu, Y.; Leng, Z.; Yu, H. Automatic railroad track components inspection using real-time instance segmentation. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 362–377. [Google Scholar] [CrossRef]
- Wu, Y.; Qin, Y.; Qian, Y.; Guo, F.; Wang, Z.; Jia, L. Hybrid deep learning architecture for rail surface segmentation and surface defect detection. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 227–244. [Google Scholar] [CrossRef]
- Bai, T.; Gao, J.; Yang, J.; Yao, D. A study on railway surface defects detection based on machine vision. Entropy 2021, 23, 1437. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- Karpathy, A.; Joulin, A.; Li, F.-F. Deep fragment embeddings for bidirectional image sentence mapping. arXiv 2014, arXiv:1406.5679. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Nickolls, J. GPU parallel computing architecture and CUDA programming model. In Proceedings of the 2007 IEEE Hot Chips 19 Symposium (HCS), Stanford, CA, USA, 19–21 August 2007. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. Comput. Sci. 2014. [Google Scholar] [CrossRef]
- Geng, M.; Wang, Y.; Xiang, T.; Tian, Y. Deep transfer learning for person re-identification. arXiv 2016, arXiv:1611.05244. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv 2016, arXiv:1602.07332. [Google Scholar] [CrossRef]
- Bang, S.; Kim, H. Context-based information generation for managing UAV-acquired data using image captioning. Autom. Constr. 2020, 112, 103116. [Google Scholar] [CrossRef]
- Dutta, A.; Zisserman, A. The VIA annotation software for images, audio and video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2276–2279. [Google Scholar]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef] [Green Version]
- Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Li, Z.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Yang, H.; Bartz, C.; Meinel, C. Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherland, 15–19 October 2016; pp. 988–997. [Google Scholar]
- Yu, L.; Qu, J.; Gao, F.; Tian, Y. A novel hierarchical algorithm for bearing fault diagnosis based on stacked LSTM. Shock Vib. 2019, 2019, 2756284. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Method | mAP |
---|---|
DenseCap [3] | 0.143 |
DenseCap_RF | 0.958 |
RTLCap | 0.980 |
Method | mAP | Time (s) | Parameters |
---|---|---|---|
RTLCap | 0.980 | 2.0298 | ∼94.98 M |
Faster RTLCap (no SPP) | 0.984 | 0.0519 | ∼211.88 M |
Faster RTLCap (with LSTM) | 0.986 | 0.0363 | ∼111.78 M |
Faster RTLCap | 0.991 | 0.0465 | ∼113.88 M |
Number of Anchors | mAP | Time (s) |
---|---|---|
3 | 0.967 | 0.0463 |
5 | 0.969 | 0.0469 |
7 | 0.963 | 0.0468 |
9 | 0.991 | 0.0465 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wei, D.; Wei, X.; Jia, L. Automatic Defect Description of Railway Track Line Image Based on Dense Captioning. Sensors 2022, 22, 6419. https://doi.org/10.3390/s22176419
Wei D, Wei X, Jia L. Automatic Defect Description of Railway Track Line Image Based on Dense Captioning. Sensors. 2022; 22(17):6419. https://doi.org/10.3390/s22176419
Chicago/Turabian StyleWei, Dehua, Xiukun Wei, and Limin Jia. 2022. "Automatic Defect Description of Railway Track Line Image Based on Dense Captioning" Sensors 22, no. 17: 6419. https://doi.org/10.3390/s22176419
APA StyleWei, D., Wei, X., & Jia, L. (2022). Automatic Defect Description of Railway Track Line Image Based on Dense Captioning. Sensors, 22(17), 6419. https://doi.org/10.3390/s22176419