1. Introduction
As one of the most important vegetable varieties, tomato ranks first in the world in terms of export volume and value, and is important in agricultural production and trade around the world, accounting for about 30 percent of the world’s total planting area [
1]. To meet the demand for increased tomato production, breeding experts have adopted efficient genetic breeding programs to cultivate high-yield and high-quality tomato varieties [
2]. High-throughput phenotypic analysis of tomato plants can effectively help researchers analyze and track the growth of tomato plants. In the past few decades, computer vision-based techniques for high-throughput analysis of plant phenotypes have received ever-increasing research interest from various communities, such as the computer science and modern intelligent agricultural breeding communities [
3,
4]. In order to automatically obtain plant organ parameters, it is necessary to accurately segment the stems and leaves, which are the primary components of plant for photosynthesis, evaporation and growth characterization. Recently, with the development of genetic breeding for tomato, the stem-leaf segmentation problem for the spatial tomato structures has gained particular research attention [
2].
Recently, two-dimensional (2D) imaging techniques have been applied to obtain structural parameters of plants [
5,
6], particularly in vegetation studies. It is possible to monitor plant growth, assess health status, and detect diseases through the analysis of obtained images using 2D image-based processing methods [
7,
8]. Because a 2D image is essentially a projected image in a direction and plants have complicated three-dimensional (3D) structures, techniques based on 2D imaging have significant limitations in stem-leaf segmentation when dealing with complex backgrounds, occlusions and overlaps, structural complexity of plants, image resolution and so on [
9,
10]. To overcome the problem of insufficient spatial information of 2D imaging, 3D point cloud technology has been utilized to obtain spatial information about plants [
11,
12,
13]. For example, lettuce leaf segmentation and fresh weight estimation have been achieved in [
14] by using 3D point cloud technology. The stem-leaf segmentation of individual corn has been studied based on the 3D point cloud data of lidar (light detection and ranging) [
15]. A tomato stem-leaf segmentation method is proposed according to point cloud data obtained through 3D laser scanning [
16]. Although the aforementioned traditional plant stem-leaf segmentation algorithm have been achieved success for specific growth stages or plant types with few leaves, it is difficult to implement adaptive segmentation and improve accuracy.
With the development of machine learning methods, deep learning techniques are applied to process plant 3D data and significant results have been achieved [
17,
18]. The SegNet model was adopted to segment leaves from individual poplar seedlings using 3D point cloud data under the condition of strong interference [
19]. Stem-leaf segmentations of maize are obtained by using voxel-based convolutional neural network and 3D point cloud data technology [
20,
21]. The semantic segmentation method based on PointNet++ is proposed in [
22] to realize automatic fruit segmentation through multi-sensor 3D point cloud data fusion. Researchers primarily focus on utilizing deep learning and point cloud techniques for the automatic segmentation of plant leaf and organ, however the point cloud resolution (approximately tens of thousands of points per plant) is relatively low. In high-resolution plant 3D point clouds applications, there are still several shortcomings such as insufficient processing capacity, poor environmental adaptability, and high computational complexity. Although the valuable methodologies is provided for plant phenotypic analysis using existing research, further improvements will help to meet the practical requirements of high-resolution plant 3D point cloud analysis and to enhance the performance of models in agricultural applications. As such, we are motivated to further study the stem-leaf problem for tomato with high-resolution 3D point clouds.
In this work, we propose the encoding-decoding structure with the voxel sparse convolution (SpConv) and attention-based feature fusion (VSCAFF) for semantic segmentation of 3D point clouds generated from high-throughput imaging of tomato seedlings. In the proposed network structure, the point clouds for the tomato seedling are rasterized and voxelified to facilitate deep learning analysis. At the same time, an feature fusion module based on attention for the semantic ambiguity of points in voxels is designed to effectively fuse voxel features with point features. The proposed model can be efficiently performed and provides technical support for semantic segmentation of point clouds without data preprocessing such as downsampling. Semantic segmentation of tomato seedlings provides a powerful tool for plant phenotypic analysis, whose application can guide tomato breeding in precision agriculture. The main contributions of this paper can be highlighted as follows:
- (1)
The SpConv based on 3D Skeleton convolution kernel is designed to strengthen the weights of Conv kernel skeleton and reduce the number of parameters.
- (2)
The feature fusion module based on attention mechanism is achieved to effectively fuse the different branch features and suppress noise.
- (3)
The composite loss function of Lovász–Softmax and weighted cross-entropy is introduced to effectively alleviate unevenness and improve model performance.
- (4)
The the VSCAFF based on encoding-decoding network structure for semantic segmentation is constructed.
2. Materials and Methods
2.1. Point Cloud Data Acquisition
For tasks such as point cloud annotation and semantic segmentation, the spatiotemporal datasets of the tomato seedling point clouds employs Pheno4D [
23] obtained using the high-precision 3D laser scanning system, which includes Perceptron Scan Works V5 (Perceptron Inc., Plymouth, MI, USA) as laser triangulation scanner connected to the ROMER Infinite 2.0 measurement arm (Hexagon Metrology Services Ltd., London, UK). The system can completely scan the surface of the tomato plant at different positions and angles, with a spatial accuracy of less than one-tenth of a millimeter. This dataset is scanned by Perceptron Scan Works V5 laser triangulation scanner, whose measuring distance is 100 mm. Thus, in order to scan the tomato plants with a large size in practice, the scanner is mounted on a measuring arm, a ROMER Infinite 2.0 designed by Hexagon Metrology Services Ltd. (Esplinge, Sweden). This mechanical arm consists of seven joints and has s spherical measurement volume with a radius of 140 cm. The measurements are performed in the greenhouse, where the plants are grown. Each plant is scanned separately, using a mounting system, which ensured, that the pot has the same position and orientation relative to the measuring arm every day. In this way, the plant’s position and orientation are consistent during the whole series of scans. The measuring volume is sufficient to scan the plant from different positions and angles. Due to the non-invasive nature of the scanning system, the measurements can be carried out with the motion of the plant. A total of 140 high-resolution point clouds of tomato seedlings were acquired over 20 days from 7 tomato plants collected, starting when tomato seedlings issued their first buds.
2.2. Data Annotation
Using CloudCompare (v2.12.2), the datasets of tomato seedling point clouds have been annotated with semantic labels, which are categorized into “Leaf”, “Stem”, and “Soil”, as shown in
Figure 1. 3D raw point clouds of the tomato are shown in
Figure 1a, and the semantic annotation point clouds is illustrated in
Figure 1b, where the green part, the purple part and the yellow part are respectively indicated “leaf”, “stem” and “soil”.
The point clouds of single high-resolution image from tomato seedlings generally contain millions of points, which can finely represent the details of tomato organs. The average point cloud number of seven tomato seedlings is collected in 20 days. The point cloud number varies with the size of tomato seedling at different growth stages. The statistics of 20-day point cloud quantity are shown in
Table 1. As illustrated in
Table 1, the average number is 2.43 millions, among which the plants scanned in 19th day has the maximum number of points of 3.68 millions and the 17th day has the minimum number 1.88 millions. The standard deviation of point number is 0.47 millions which denotes the imbalance distribution of point along the days.
In this study, we used the 20-day point cloud datasets numbered 1 to 5 as the training set, comprising a total of 100 point clouds. The point cloud datasets numbered 6 and 7, also covering 20 days, are designated as the test set, consisting of 40 point clouds. The datasets distribution is shown in
Table 2. As evident in
Table 2, the point set for leaf, stem, and soil demonstrate extremely uneven distribution, because tomato seedlings exhibit significant changes in size and morphology during different growth stages, resulting in variations in the number of point clouds. In the semantic class distribution of the training and test set, soil and leaf points each account for more than 45% of the total plant point clouds, while stem points make up less than 5%. In point-based methods that utilize raw point clouds as input, the highly uneven distribution of points among different semantic classes poses a significant challenge for semantic segmentation tasks.
2.3. Network Structure of Semantic Segmentation
Convolutional Neural Network (CNN) can excel at feature extraction from 2D images and has been extensively utilized in various tasks, including semantic segmentation and object detection in 2D images, demonstrating significant effectiveness [
24,
25]. During the computation of CNN, convolutional kernels interact with input data using local receptive fields, whereby the principle of weight sharing substantially decreases the number of parameters, thereby reducing the overall complexity of the model. The convolution operation relies on a fixed grid structure to extract features, such as the pixel matrix of image, where the relationships between adjacent pixels are well-defined. Due to the disordered nature and non-structured characteristics difficultly formed fixed connectivity relationships of 3D point cloud data, hinder the direct application of traditional 3D convolution methods to point clouds. This limitation has spurred researchers to explore novel network structures and algorithms, such as PointNet and its variations, to more effectively process and analyze point cloud data. An effective method that has been widely applied and has shown better performance in 3D point cloud recognition tasks, involves point clouds voxelized into the voxel representation, followed by the application of standard 3D convolution for feature extraction [
26]. During voxelization, point cloud data is greatly influenced by resolution, so it is difficult to balance model performance and efficiency. When higher voxel resolution is used, the model demonstrates excellent performance but requires more time and exhibits poor efficiency. Conversely, when lower voxel resolution is employed, the model executes quickly with high efficiency but suffers a significant drop in performance. One of the key reasons contributing to the issue is that when higher voxel resolution is used, an increase in empty voxels leads to excessive sparsity in the 3D point clouds, resulting in a significant number of empty operations when performing regular 3D convolution [
27].
The point cloud structures of tomato plants are more complex and irregular in shape, leading to greater sparsity after voxelization. The application of conventional 3D convolution can result in significant detailed deformation and incurs a substantial computational burden. In order to solve the challenge of directly applying convolutions to tomato plant point clouds, an encoding-decoding network structure based on SpConv is designed, as illustrated in
Figure 2. The core components of the network includes three parts, such as encoding-decoding structure based on SpConv, skeleton convolution kernel, attention-based feature fusion method. The network structure first performs voxelization on the point clouds of tomato plants, followed by feature extraction using SpConv. SpConv represents the sparse convolution module, while DeConv denotes the deconvolution module, which is used to restore the voxel shape prior to sparse convolution. The skeleton convolution kernel significantly reduces the number of parameters while enhancing the model’s performance. The voxel-based encoding–decoding structure is employed to ultimately generate the same class labels for all points within the same voxel. The expand module representing feature diffusion is used to spread voxel features to point features. The diffusion strategy used in this paper shares the voxel features within the same voxel. It is necessary to fuse point branch features for all points to ensure the uniqueness of points within the same voxel using the attention-based feature fusion, thereby eliminating ambiguity.
2.3.1. Encoding-Decoding Structure Based on SpConv
During the process of rasterizing tomato plant point clouds, their sparsity tends to increase with the increase in voxel resolution. A higher voxel resolution for high-resolution point cloud data is often required to accurately extract semantic information and the geometric details of plant organs such as stems, which will increase the time required for convolution operations. So SpConv is employed as a local feature extractor to perform convolution operations on the point clouds, constructing a hierarchical feature extraction module.
In this paper, a SpConv-based feature encoder is constructed using TorchSparse [
28], which is an efficient 3D SpConv library that accelerates the process of SpConv through adaptive matrix multiplication and vectorization. The encoder utilizes the feature sparse matrix of the point clouds as input, and its generation process is as follows. Firstly, the tomato point clouds is rasterized with 0.2 as the voxel unit, and the points within each voxel are input into Pointnet subnet to learn detailed features from the points. Then, the points within the same voxel are pooled to generate voxel features and construct a feature sparse matrix of the point clouds. Finally, the sparse matrix is used as input to the backbone network, which performs feature encoding.
In the decoding section, the feature code is deconvoluted, and the voxel shape is restored layer by layer using the encoding convolution table [
29]. The features with the same voxel shape from the encoding and decoding stages are concatenated as the input for the subsequent deconvolutional layers to fully utilize the semantic information of high-level features and the detail information of low-level features. The voxel features are diffused into point features through the diffusion module, meaning that all points within the same voxel share the voxel features. The encoding-decoding structure is shown in
Figure 3, where
indicates the number of voxels output by the current module. Due to the different sparsity levels of point clouds from various tomato plants, the number of voxels output by different data is also different. The voxel features are diffused to all points using the diffusion module, simultaneously restoring the raw point cloud shapes.
2.3.2. Skeleton Convolution
In order to further improve the inference speed of the network, we design 3D skeleton convolution (Skeletal Conv), which strengthens the weight of the convolution kernel skeleton, while reducing the model parameters, as illustrated in
Figure 4a. Compared to traditional 3D convolutional kernel [
30], the parameters of the single skeleton convolution kernel are reduced by nearly 74%, and the parameters of the SpConv module designed in the paper (shown in
Figure 4b) are reduced by 37% compared with the equivalent convolution module(shown in
Figure 4c).
3D Skeleton Conv extracts key structures from conventional convolutional kernels to enhance core weights, effectively matching the organ structures of tomato plant point clouds. During the actual convolution process, the overlap between the convolution kernel and the point clouds of tomato plants is concentrated on the horizontal and vertical skeletal parts. Through the residual concatenate between the skeleton and conventional convolution kernel, the feature extraction of the key parts from tomato plant point clouds can be satisfied while preserving edge information.
2.3.3. Attention-Mechanism Feature Fusion Method
At the later period of the decoding stage, point features are generated for voxel points by the expand module. However, the module generates identical features for all points within the same voxel, which can result in ambiguity as a result of different semantic points sharing the same features. To avoid the issue, the point features generated by the expand module are fused with the point features generated at the voxelization stage using PointNet. Traditional feature fusion methods typically rely on accumulation or concatenation of different features, but these approaches often fail to effectively filter out noise in the features.
Therefore, the feature fusion method designed in the paper is based on attention mechanism, integrating channel attention and self-attention to calculate channel attention. Using the fusion method, useful information can be effectively filter from a large amount of irrelevant data. The attention mechanism adopted by the model automatically learns the allocation of attention weights. The allocation of attention weights can be dynamically adjusted according to different inputs, demonstrating the certain flexibility and adaptability [
31]. Compared to traditional attention mechanisms, the incorporation of multi-layer perceptron (MLP) can learn more complex attention patterns, enabling better capture of intricate relationships and features within the input data.
The attention scores for feature type is calculated according to voxel feature and point feature of each point, namely the feature weight vector, as shown in Formula (
1):
where
b is the feature type, namely voxel feature (
v) or point feature (
p). Then the product of weight vector and corresponding point feature vector is calculated, which is the final point feature, as shown in Formula (
2):
where
is the voxel feature of points,
is the feature of points, and
is the weight vector. The calculation process is shown in
Figure 5.
2.3.4. Loss Function
With stems of tomato plant point clouds accounting for only 4.56% of the total, the class distribution is extremely uneven. The imbalance can lead to a bias towards the classes with higher proportions during the model training, thereby negatively impacting its performance. To effectively alleviate the unevenness, weighted cross-entropy loss function which assigns different weights to different classes is adopted. In addition, Lovász-Softmax loss directly optimizes the intersection over union (IoU) and dynamically adjusts the contribution of each pixel, improving the learning of minority classes and enhancing segmentation accuracy, especially in challenging scenarios like small object detection [
32]. Through integrating these two loss functions, The model benefits from classification accuracy while optimizing overall segmentation quality. The composite loss function focuses on both pixel-level classification capabilities and simultaneously optimizing overall segmentation performance, as shown in Formula (
3):
where
is responsible for the training of point features.
is responsible for optimizing the training of voxel branches, consisting of the weighted cross-entropy loss and the Lovász-Softmax loss, as shown in Formula (
4):
where
, represents the class weight, calculated from the class proportion
. For the training of the point features, only the weighted cross-entropy loss function is adopted.
2.4. Experimental Platform
The computer in the experiment is configured with the CPU of Intel Core i9-12900, the GPU of NVIDIA RTX 3090, the memory of 64 GB, and 2 T solid-state disk. The operating system of Ubuntu20.04 and the deep learning framework of Python 1.18 is used. The model implementation is based on Python.
2.5. Model Evaluating Indicator
The model performance is evaluated by IoU (per-class Intersection over union) and mIoU (mean IoU) [
33], as shown in Formula (
5):
where
i is the class (soil, stem or leaf),
(True Positives) represents the number of samples where both the prediction and the ground truth are of the same class
i.
(False Positives) denotes the number of samples where the prediction is of class
i, but the ground truth is of a different class.
(False Negatives) indicates the number of samples where the prediction is of a different class, but the ground truth is of the same category
i. mIoU is calculated by averaging the IoU values across all classes, as shown in Formula (
6):
where,
N represents the total number of classes.
To verify the model performance using IoU and mIoU, the proposed model is compared with three classical models based on different principles, such as point-based PointNet [
34], neighbor retrieval-based PointNet++ [
35] and graph-based DGCNN [
36].
2.6. Ablation Experiments
To further verify the effectiveness of VSCAFF, the ablation experiments are conducted to analyze the performance of each individual module. First, the foundational network consisting solely of an encoder and decoder with conventional sparse convolution is constructed, wherein the semantic segmentation module based entirely on fully connected layers. Then, upon the infrastructure, skeleton convolution kernel integrated into the SpConv, simple fusion of point feature extraction branch and voxel branch, and feature fusion module based on attention are respectively introduced. Following the incorporation of the point feature extraction branch, various feature fusion techniques including concatenation, accumulation, and the attention-based method delineated in the paper, are respectively explored. Ultimately, the weighted cross-entropy loss function (WCE) is compared with the combination of WCE and Lovász-softmax loss function in the experiment.
4. Discussion
4.1. Efficiency of Methods
To achieve efficient semantic segmentation of 3D point clouds of high-resolution tomato plant, the SpConv-based network model of semantic segmentation integrating PointNet is constructed. By employing skeleton convolution kernels, the parameter number of the model is reduced, which in turn enhances the inference speed. Additionally, an attention mechanism is utilized to fuse features extracted from two distinct views, thereby successfully accomplishing semantic segmentation for high-resolution point clouds of tomato plant. Finally, the composite loss function of Lovász-softmax and weighted cross-entropy efficiently solves the problems caused by the uneven datasets for tomato plant point clouds to enhance the model performance.
The comparative analysis with several different models, including the point-based PointNet, the neighbor feature-based PointNet++, the graph-based DGCNN, shows that the point-based model, due to the lack of neighbor information, deliver the worst segmentation performance, albeit with the fastest inference speed. While incorporating neighbor retrieval significantly improves segmentation accuracy, it also introduces substantial memory and computational overhead. In order to overcome this issue, the proposed model successfully achieves efficient and accurate semantic segmentation of high-resolution tomato plant point clouds by fusing features extracted from two distinct views.
Due to the non-invasive and physically harmless feature of semantic segmentation for tomato seedling point clouds, measurements can be repeated without disrupting plant growth, achieving dynamic monitoring. The precise segmentation facilitates the extraction of not only plant morphological features, such as leaf area, plant height, and stem diameter, but also 3D structural features like branch angles and leaf distribution [
37]. It is significant for understanding plant growth mechanisms and genetic variations in precision agriculture. Phenotype-based decisions can optimize planting density, irrigation strategies, and harvest timings, thereby enhancing crop yield and quality. Analysis of Phenotype helps identify plant varieties with desirable traits, expedite the breeding process, and cultivate varieties that are better adapted to specific environments and market demands [
38]. Continuous monitoring of phenotypic changes allows for the early detection of disease or environmental stress, facilitating timely interventions to minimize losses. Furthermore, personalized management for each plant can achieve refined management practices. In summary, segmentation of tomato seedling point clouds offers a robust tool for agricultural plant phenotypic analysis, and phenotypic application in precision agriculture is pivotal for fostering sustainable agricultural practices and boosting agricultural productivity.
Improving the study of tomato seedling phenotype can guide tomato breeding. Through high-precision phenotypic analysis techniques, the growth characteristics of tomato seedlings can be more accurately identified and quantified, helping breeders select plants with desirable traits [
39]. By integrating phenotypic data with genotypic data, the relationships between specific genes or gene combinations and their associated phenotypic traits can be revealed, which help breeders optimize breeding strategies by selecting genes that positively contribute to desired traits [
38]. Leveraging high-throughput phenotypic analysis techniques, plants exhibiting favorable traits can be rapidly identified and rapidly propagated, thereby significantly reducing the breeding cycle and improving breeding efficiency.
4.2. Limitations of Methods
On the other hand, there are still some limitations in the research, such as fewer stem points, lower IoU for the stem segmentation. The IoU of stems using these compared methods falls into a relatively large range (28–64%), which causes the differences of stem segmentation between these methods. Similar to the stem, petiole and pod are both slender and tiny, which needs more effective and generative methods to be developed. When the stem points is too low, it may lead to the following issues: (1) Data sparsity: insufficient points in space or time make it difficult to accurately capture the growth patterns and deformation of the stem. (2) Low statistical analysis accuracy: when performing statistical analysis such as stem shape analysis or growth rate calculations, the reliability and precision of the results are limited due to the scarcity of data points. (3) Difficulty in Model fitting: a low number of points makes it challenging for the model to learn accurate shape features, affecting the accuracy of predictions and analysis [
40]. To address the issue of too few points on the stem, we can enhance the frequency of data collection to increase the density of points in both time and space, thereby obtaining more stem points. The interpolation algorithms (e.g., linear interpolation, spline interpolation) are also used to generate new points between existing ones, increasing data density and continuity. To improve the model’s generalization ability and data diversity, virtual augmented reality data can be create combining it with real data for model training.
4.3. Future Work
In the future, suboptimal stem segmentation can affect applications that require precise analysis of stem. To improve the accuracy of stem segmentation, it is feasible to add more feature extraction layers to the existing model, use stronger encoder-decoder structures, and introduce multi-scale feature fusion mechanisms [
41], which can enhance the model’s ability to capture stem details and handle the complex relationships between the stem and its surroundings. Using the improvements, the performance of the VSCAFF model for stem segmentation can be significantly enhanced to make it more practical for agriculture applications. The semantic segmentation of high-resolution 3D point clouds for tomato plant can explore more advanced deep learning architectures and integrate multi-modal data. At the same time, the lightweight models are focused to adapt to practical applications, improving segmentation efficiency and accuracy. These provide strong support for precision agriculture, promoting the development of agricultural intelligence.
5. Conclusions
Because semantic segmentation methods often suffer from low precision and slow inference speed, the encoding–decoding structure is proposed, incorporating voxel-sparse convolution (SpConv) and attention-based feature fusion (VSCAFF). The number of parameters is reduced using the SpConv module to further enhance the inference speed. The ambiguity arising from points with different semantics is avoided using feature fusion module based on the attention mechanism. The model training class bias is solved using the composite loss function to improve its performance.
The experimental results indicate: (1) the VSCAFF model exhibits the best segmentation performance, with mIoU 86.96%, and IoU in the soil, stem, leaf, reaching 99.63%, 64.47%, 96.72%, respectively, significantly outperforming other conventional models. (2) The proposed model achieves the most efficiency (35 ms) with the least parameters and memory consumption. (3) The VSCAFF model fully utilizes computational resources, achieving the optimal balance between segmentation performance and speed. The results demonstrate that VSCAFF has significant advantages in semantic segmentation of high-resolution tomato point clouds, providing an efficient and practical solution for the high-throughput automatic phenotypic analysis.