Development of an Optimized YOLO-PP-Based Cherry Tomato Detection System for Autonomous Precision Harvesting
<p>Research flow of our approach.</p> "> Figure 2
<p>Cherry tomato cultivation and data collection scene. (<b>a</b>) Tomato cultivation environment in facility agriculture, (<b>b</b>) Field data collection of tomatoes.</p> "> Figure 3
<p>Representative sample datasets in different states: (<b>a</b>) Direct light, (<b>b</b>) Backlight, (<b>c</b>) Front view, (<b>d</b>) Side view, (<b>e</b>) and Top-down view.</p> "> Figure 4
<p>Data annotation process.</p> "> Figure 5
<p>The overall network architecture of YOLO-PP.</p> "> Figure 6
<p>The evolution of the architecture of C3, C2F, and C2FET Modules.</p> "> Figure 7
<p>(<b>a</b>) The left path integrates a fundamental convolutional component and a series of bottleneck structures. The primary function of these structures is to refine residual features and integrate the outputs of the two independent branches of the C2FET module at the endpoint. (<b>b</b>) The constructed Transformer branch adopts a three-layer architecture and incorporates a progressive group attention mechanism. (<b>c</b>) The Cascaded Group Attention (CGA) module meticulously deconstructs the computation process of each attention head, customizing feature enhancement for each head to improve the diversity of attention maps.</p> "> Figure 8
<p>The structure of the SPSP module.</p> "> Figure 9
<p>The representation of Inner-IoU and visual explanation.</p> "> Figure 10
<p>Comparison of mAP@50 and mAP@50-95 results for different models.</p> "> Figure 11
<p>Actual detection picking point results of the different network models.</p> "> Figure 12
<p>Performance of YOLO-PP in special cases: (<b>a</b>) Detection results under different lighting conditions; (<b>b</b>) Detection results in multi-target and occlusion scenarios.</p> "> Figure 13
<p>Comparison of ablation results from precision mAP@0.5 and mAP@0.5–0.95: Precision curve; mAP50 curve; mAP50-95 curve.</p> "> Figure 14
<p>Variation curves of loss function for ablation experiments.</p> "> Figure 15
<p>Training loss function value curves: (<b>a</b>) YOLOv8-Pose loss function curve; (<b>b</b>) YOLO-PP loss function curve Abscissa, iteration times, and ordinate, loss value.</p> "> Figure 16
<p>Actual screen of device deployment.</p> "> Figure 17
<p>Hardware platform and software implementation of the automated tomato harvesting robot.</p> ">
Abstract
:1. Introduction
- (1)
- A dataset for target recognition and picking point detection, focusing on single cherry tomatoes, was collected and annotated at a tomato planting base. The feasibility of the YOLO keypoint detection algorithm for this problem was then verified.
- (2)
- In this study, the EfficientViT block was integrated into the C2F module in a parallel structure, and the C2FET was proposed as a means of enabling the network to more effectively capture global information.
- (3)
- A Spatial Pyramid Squeeze and Pooling (SPSP) module is proposed for implementation at the interconnection between the backbone network and the neck. The SP module extracts refined multi-scale features, while the SEWeight module, in conjunction with Softmax, recalibrates channel attention. The SPSP module is an effective means of capturing multi-scale spatial and contextual information.
- (4)
- The concept of Inner-CIoU was introduced to compute the IoU loss using auxiliary bounding boxes. Additionally, a scaling factor was incorporated to adjust the size of these bounding boxes during the loss calculation, which helps improve detection accuracy, particularly for small target points.
- (5)
- A software interface for intuitive interactive recognition has been developed on the Jetson Xavier NX, focusing on various components of YOLO-PP. This development integrates the previously created hardware electronic platform for the tomato harvesting robot in single-fruit picking mode with practical planting scenarios.
2. Related Works
2.1. Traditional Approach
2.2. Deep Learning Approach
3. Materials and Evaluation Metrics
3.1. Image and Data Acquisition
3.2. Annotation of Datasets
3.3. Evaluation Metrics
4. Methodologies
4.1. YOLO-PP Architecture Overview
4.2. C2FET Module
4.3. Spatial Pyramid Squeeze and Pooling Module
4.4. Loss Function
5. Experiments and Evaluation
5.1. Experimental Setup
5.2. Comparison of Network Models
5.3. Ablation Experiment
6. Device Deployment and Interactive Interface
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Mohamed, A.; Shaw-Sutton, J.; Green, B.; Andrews, W.; Rolley-Parnell, E.; Zhou, Y.; Zhou, P.; Mao, X.; Fuller, M.; Stoelen, M. Soft manipulator robot for selective tomato harvesting. In Precision Agriculture’19; Wageningen Academic: Wageningen, The Netherlands, 2019; pp. 799–805. [Google Scholar]
- Maureira, F.; Rajagopalan, K.; Stöckle, C.O. Evaluating tomato production in open-field and high-tech greenhouse systems. J. Clean. Prod. 2022, 337, 130459. [Google Scholar] [CrossRef]
- Wang, Z.; Xun, Y.; Wang, Y.; Yang, Q. Review of smart robots for fruit and vegetable picking in agriculture. Int. J. Agric. Biol. Eng. 2022, 15, 33–54. [Google Scholar]
- Zhou, H.; Wang, X.; Au, W.; Kang, H.; Chen, C. Intelligent robots for fruit harvesting: Recent developments and future challenges. Precis. Agric. 2022, 23, 1856–1907. [Google Scholar] [CrossRef]
- Lu, S.; Xiao, X. Neuromorphic Computing for Smart Agriculture. Agriculture 2024, 14, 1977. [Google Scholar] [CrossRef]
- Li, Y.; Feng, Q.; Li, T.; Xie, F.; Liu, C.; Xiong, Z. Advance of target visual information acquisition technology for fresh fruit robotic harvesting: A review. Agronomy 2022, 12, 1336. [Google Scholar] [CrossRef]
- Li, Z.; Yuan, X.; Wang, C. A review on structural development and recognition–localization methods for end-effector of fruit—Vegetable picking robots. Int. J. Adv. Robot. Syst. 2022, 19, 17298806221104906. [Google Scholar] [CrossRef]
- Li, M.; Wu, F.; Wang, F.; Zou, T.; Li, M.; Xiao, X. CNN-MLP-Based Configurable Robotic Arm for Smart Agriculture. Agriculture 2024, 14, 1624. [Google Scholar] [CrossRef]
- Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and localization methods for vision-based fruit picking robots: A review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef]
- Zhang, B.; Xie, Y.; Zhou, J.; Wang, K.; Zhang, Z. State-of-the-art robotic grippers, grasping and control strategies, as well as their applications in agricultural robots: A review. Comput. Electron. Agric. 2020, 177, 105694. [Google Scholar] [CrossRef]
- Fu, L.; Gao, F.; Wu, J.; Li, R.; Karkee, M.; Zhang, Q. Application of consumer RGB-D cameras for fruit detection and localization in field: A critical review. Comput. Electron. Agric. 2020, 177, 105687. [Google Scholar] [CrossRef]
- Saleem, M.H.; Potgieter, J.; Arif, K.M. Automation in agriculture by machine and deep learning techniques: A review of recent developments. Precis. Agric. 2021, 22, 2053–2091. [Google Scholar] [CrossRef]
- Meshram, V.; Patil, K.; Meshram, V.; Hanchate, D.; Ramkteke, S. Machine learning in agriculture domain: A state-of-art survey. Artif. Intell. Life Sci. 2021, 1, 100010. [Google Scholar] [CrossRef]
- Xiang, R.; Jiang, H.; Ying, Y. Recognition of clustered tomatoes based on binocular stereo vision. Comput. Electron. Agric. 2014, 106, 75–90. [Google Scholar] [CrossRef]
- Luo, L.; Tang, Y.; Lu, Q.; Chen, X.; Zhang, P.; Zou, X. A vision methodology for harvesting robot to detect cutting points on peduncles of double overlapping grape clusters in a vineyard. Comput. Ind. 2018, 99, 130–139. [Google Scholar] [CrossRef]
- Luo, L.; Tang, Y.; Zou, X.; Ye, M.; Feng, W.; Li, G. Vision-based extraction of spatial information in grape clusters for harvesting robots. Biosyst. Eng. 2016, 151, 90–104. [Google Scholar] [CrossRef]
- Luo, L.; Liu, W.; Lu, Q.; Wang, J.; Wen, W.; Yan, D.; Tang, Y. Grape berry detection and size measurement based on edge image processing and geometric morphology. Machines 2021, 9, 233. [Google Scholar] [CrossRef]
- Pérez-Zavala, R.; Torres-Torriti, M.; Cheein, F.A.; Troni, G. A pattern recognition strategy for visual grape bunch detection in vineyards. Comput. Electron. Agric. 2018, 151, 136–149. [Google Scholar] [CrossRef]
- Behroozi-Khazaei, N.; Maleki, M.R. A robust algorithm based on color features for grape cluster segmentation. Comput. Electron. Agric. 2017, 142, 41–49. [Google Scholar] [CrossRef]
- Bai, Y.; Mao, S.; Zhou, J.; Zhang, B. Clustered tomato detection and picking point location using machine learning-aided image analysis for automatic robotic harvesting. Precis. Agric. 2023, 24, 727–743. [Google Scholar] [CrossRef]
- Jin, Y.; Yu, C.; Yin, J.; Yang, S.X. Detection method for table grape ears and stems based on a far-close-range combined vision system and hand-eye-coordinated picking test. Comput. Electron. Agric. 2022, 202, 107364. [Google Scholar] [CrossRef]
- Tang, Y.; Qiu, J.; Zhang, Y.; Wu, D.; Cao, Y.; Zhao, K.; Zhu, L. Optimization strategies of fruit detection to overcome the challenge of unstructured background in field orchard environment: A review. Precis. Agric. 2023, 24, 1183–1219. [Google Scholar] [CrossRef]
- Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Detection of apple lesions in orchards based on deep learning methods of CycleGAN and YOLOV3-dense. J. Sens. 2019, 2019, 7630926. [Google Scholar] [CrossRef]
- Yan, C.; Chen, Z.; Li, Z.; Liu, R.; Li, Y.; Xiao, H.; Lu, P.; Xie, B. Tea sprout picking point identification based on improved DeepLabV3+. Agriculture 2022, 12, 1594. [Google Scholar] [CrossRef]
- Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. Deepfruits: A fruit detection system using deep neural networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef]
- Wu, F.; Duan, J.; Chen, S.; Ye, Y.; Ai, P.; Yang, Z. Multi-target recognition of bananas and automatic positioning for the inflorescence axis cutting point. Front. Plant Sci. 2021, 12, 705021. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Q.; Chen, J.; Li, B.; Xu, C. Method for recognizing and locating tomato cluster picking points based on RGB-D information fusion and target detection. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2021, 37, 143–152. [Google Scholar]
- Li, D.; Sun, X.; Lv, S.; Elkhouchlaa, H.; Jia, Y.; Yao, Z.; Lin, P.; Zhou, H.; Zhou, Z.; Shen, J.; et al. A novel approach for the 3D localization of branch picking points based on deep learning applied to longan harvesting UAVs. Comput. Electron. Agric. 2022, 199, 107191. [Google Scholar] [CrossRef]
- Qi, X.; Dong, J.; Lan, Y.; Zhu, H. Method for identifying litchi picking position based on YOLOv5 and PSPNet. Remote Sens. 2022, 14, 2004. [Google Scholar] [CrossRef]
- Zhang, T.; Wu, F.; Wang, M.; Chen, Z.; Li, L.; Zou, X. Grape-bunch identification and location of picking points on occluded fruit axis based on YOLOv5-GAP. Horticulturae 2023, 9, 498. [Google Scholar] [CrossRef]
- Ding, J.; Niu, S.; Nie, Z.; Zhu, W. Research on Human Posture Estimation Algorithm Based on YOLO-Pose. Sensors 2024, 24, 3036. [Google Scholar] [CrossRef]
- Pavlov, M.; Marakhtanov, A.; Korzun, D. Detection of Key Points for a Rainbow Trout in Underwater Video Surveillance System. In Proceedings of the 33rd Conference of FRUCT Association, Zilina, Slovakia, 24–26 May 2023. [Google Scholar]
- Tan, J.; Qin, H.; Chen, X.; Li, J.; Li, Y.; Li, B.; Leng, Y.; Fu, C. Point cloud segmentation of breast ultrasound regions to be scanned by fusing 2D image instance segmentation and keypoint detection. In Proceedings of the 2023 International Conference on Advanced Robotics and Mechatronics (ICARM), Sanya, China, 8–10 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 669–674. [Google Scholar]
- Nguyen, T.D.T.; Nguyen, M.H.; Nguyen, T.H.; Pham, V.C. Deep Learning Based Pose Estimation and Action Prediction for Construction Machines. In Proceedings of the 2023 8th International Scientific Conference on Applying New Technology in Green Buildings (ATiGB), Danang, Vietnam, 10–11 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 268–273. [Google Scholar]
- Wu, Z.; Xia, F.; Zhou, S.; Xu, D. A method for identifying grape stems using keypoints. Comput. Electron. Agric. 2023, 209, 107825. [Google Scholar] [CrossRef]
- Chen, J.; Ma, A.; Huang, L.; Li, H.; Zhang, H.; Huang, Y.; Zhu, T. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric. 2024, 217, 108612. [Google Scholar] [CrossRef]
- Tzutalin, D. tzutalin/labelimg: Labelimg is a Graphical Image Annotation Tool and Label Object Bounding Boxes in Images. 2018. Available online: https://github.com/wkentaro/labelme (accessed on 8 December 2024).
- Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 2637–2646. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 8 December 2024).
- Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Hardware/Software | Configuration |
---|---|
CPU | Intel(R) Core(TM)i5-12600KF |
Memory (GB) | 16 G |
GPU | NVIDIA GeForce RTX 4090 |
Graphics Memory (GB) | 24 G |
Training Envionment | CUDA 11.6 |
Operating System | Windows 11 (64-bit) |
Development Environment | Python 3.8.18 & Pytorch 1.13.1 |
Embedded Device | NVIDIA Jetson Xavier NX |
Parameter | Value |
---|---|
Initial learning rate | 0.01 |
Optimizer | Adam |
Momentum | 0.937 |
Weight Decay | 0.0005 |
Batch size | 12 |
Input image size | 960 × 960 |
Training Epochs | 150 |
Model | Backbone | Size | Recall | mAP50 | mAP50-95 | Params |
---|---|---|---|---|---|---|
DEKR | HRNet-W32 | 640 | 98.23 | 82.24 | 79.85 | 29.5 M |
YOLO-Pose | CSPDarknet53 m | 640 | 96.17 | 89.72 | 87.46 | 21.2 M |
YOLOv8-Pose | CSPDarknet53 s | 640 | 97.54 | 98.37 | 96.76 | 11.4 M |
YOLO-PP | CSPDarknet53 s | 640 | 97.51 | 98.49 | 97.29 | 12.7 M |
YOLO-Pose | CSPDarknet53 m | 960 | 96.23 | 88.35 | 87.51 | 21.2 M |
YOLOv8-Pose | CSPDarknet53 s | 960 | 97.72 | 98.95 | 98.24 | 11.4 M |
YOLO-PP | CSPDarknet53 s | 960 | 98.86 | 99.18 | 98.87 | 12.7 M |
SPSP Module | C2FET Module | Inner CIoU | Precision (%) | Recall (%) | mAP50 (%) |
---|---|---|---|---|---|
✗ | ✗ | ✗ | 94.81 | 97.72 | 98.95 |
✗ | ✗ | ✓ | 93.49 | 98.12 | 98.85 |
✗ | ✓ | ✗ | 94.95 | 97.76 | 98.96 |
✓ | ✗ | ✗ | 94.53 | 97.61 | 98.78 |
✗ | ✓ | ✓ | 95.27 | 97.80 | 99.15 |
✓ | ✓ | ✓ | 95.81 | 98.86 | 99.18 |
Methods | mAP50 | mAP50-95 |
---|---|---|
baseline (CIoU) | 98.75 | 97.5 |
baseline + EIoU | 97.86 | 95.76 |
baseline + SIoU | 97.42 | 96.35 |
baseline + DIoU | 98.52 | 97.76 |
baseline + Shape-IoU | 98.43 | 98.45 |
baseline + Inner CIoU | 99.18 | 98.87 |
Model | Inference Time | FPS |
---|---|---|
DEKR | 283.17 ms | 11.3 |
YOLO-Pose | 204.20 ms | 8.63 |
YOLO-PP (Unquantized) | 197.13 ms | 7.85 |
YOLO-PP (Quantified) | 31.64 ms | 31.24 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qin, X.; Cao, J.; Zhang, Y.; Dong, T.; Cao, H. Development of an Optimized YOLO-PP-Based Cherry Tomato Detection System for Autonomous Precision Harvesting. Processes 2025, 13, 353. https://doi.org/10.3390/pr13020353
Qin X, Cao J, Zhang Y, Dong T, Cao H. Development of an Optimized YOLO-PP-Based Cherry Tomato Detection System for Autonomous Precision Harvesting. Processes. 2025; 13(2):353. https://doi.org/10.3390/pr13020353
Chicago/Turabian StyleQin, Xiayang, Jingxing Cao, Yonghong Zhang, Tiantian Dong, and Haixiao Cao. 2025. "Development of an Optimized YOLO-PP-Based Cherry Tomato Detection System for Autonomous Precision Harvesting" Processes 13, no. 2: 353. https://doi.org/10.3390/pr13020353
APA StyleQin, X., Cao, J., Zhang, Y., Dong, T., & Cao, H. (2025). Development of an Optimized YOLO-PP-Based Cherry Tomato Detection System for Autonomous Precision Harvesting. Processes, 13(2), 353. https://doi.org/10.3390/pr13020353