1 Introduction

Extracting the central line or skeleton of a worm from images is not an easy task, and much less so in low-resolution images when there are aggregations between worms or the worms aggregate with plate noise. We use the word “noise” to refer to dark objects or segmentations, residues, stains or worm shapes, which are not actually worms. All these cases cause classical skeletonization algorithms to fail, leading to erroneous results. As demonstrated in Layana Castro et al. (2020), compared to classical skeletonization techniques, improved techniques can better locate and identify worms in the aforementioned cases, facilitating the automation of monitoring tasks, posture recognition, behavioral studies, etc.

Over the years, many applications have been developed for automatic monitoring and inspection of C. elegans using classical image processing techniques. Many of these applications solve this problem by identifying the central line or skeleton of the C. elegans. Skeleton identification basically reduces the shape of the worms without losing information about their posture. In order to identify the worms within images, some methods have been implemented to extract characteristics of the worm from the skeleton. The best-known characteristics include endpoints Tsibidis and Tavernarakis (2007), smoothness Rizvandi et al. (2008a, 2008b), length Wöhlby et al. (2012), previous segmentation Uhlmann and Unser (2015); Winter et al. (2016), or a combination of these Layana Castro et al. (2021). Other applications for extracting worm skeletons use neural network techniques Chen et al. (2020); Li et al. (2020b); Hebert et al. (2021).

Methods that use neural networks are becoming more reliable and precise, helping many professionals and researchers to achieve their goals in many fields of science. However, a problem implicit to using these techniques is having a dataset large enough to train and validate the model, in addition to testing the results. Moreover, creating a labeled dataset is usually a time-consuming manual task. In this context, some neural network applications have proven that the use of synthetic data can solve this problem either partially (mixing real and synthetic data) or completely (purely synthetic data).

Generally, in C. elegans assays, these applications are performed using high-resolution images where there is only one worm or very few worms per plate. The advantage is that better processing results are obtained due to the number of pixels and that overlaps or aggregations between worms are avoided. A recent work Hebert et al. (2021) proposes a simulator to generate purely synthetic high-resolution images using reverse skeletonization, this technique employs small rectangular image patches to generate worm images.

We present a new convolutional neural network skeletonization method trained using purely synthetic low-resolution images. The convolutional network architecture used is the U-Net Plebani et al. (2022). We take advantage of this U-shaped architecture with an encoder and decoder to produce encoded images of worm skeletons from low-resolution grayscale images. Instead of generating new synthetic frames of individual poses with rollings and self-intersections as in Hebert et al. (2021), our simulator generates new synthetic frames of multi-worm poses with intersection behaviors and parallel contacts. The results show a significant improvement of 3.32% compared to a previous work Layana Castro et al. (2020) which improved classical skeletonization methods.

Highlights:

  • A C. elegans skeletonization method is proposed based on U-Net type neural networks with low-resolution images and noise.

  • A new method for generating low-resolution synthetic images is proposed to easily generate a custom-labeled dataset for different C. elegans behaviors.

  • A neural network has been trained with a low-resolution synthetic image and successfully tested in the domain of real images.

  • Different U-Net architectures have been compared with an algorithm based on traditional image processing techniques.

2 Related Work

In this section, we review the state of the art of various works related to different network architectures and neural network techniques applied to C. elegans.

2.1 Caenorhabditis elegans and Neural Networks

Caenorhabditis elegans is one of the most widely studied organisms and has acquired great importance in the field of biology Biron and Haspel (2015). Its genome has been annotated in great detail, and research shows that many human diseases have homologues in the genome of this nematode Olsen and Gill (2017), making it an attractive animal model for the study of human pathologies. The advantages offered by this organism with respect to others include its short life cycle (around 21 days), short reproductive period, small size (around 1 mm long), and feeding based on bacterial strains such as Escherichia coli; all of which facilitate its large-scale culture Conn (2017).

In the past, C. elegans assays were monitored manually but nowadays many researchers choose more automatic technologies, thus reducing both processing times and the fatigue of technicians, who would otherwise spend hours looking through the microscope daily. This is where computer-vision applications have a great advantage over these manual practices. Due to the flexible body and the different poses that C. elegans can adopt, many automatic applications use skeletonization techniques to solve problems related to healthspan (Hahm et al., 2015; Le et al., 2020; Di Rosa et al., 2020), lifespan (Jung et al., 2014; Kumar et al., 2019; Puchalt et al., 2020), tracking (Javer et al., 2018; Koopman et al., 2020), behavior monitoring (Yu et al., 2014; Pitt et al., 2019), etc. These techniques are very useful in microscopic images without noise, but low-resolution and noisy images present a challenge difficult to overcome. Other automatic methods use neural networks (NN), which are more robust against these problems.

Neural networks have different topologies and these can perform tasks of classification, segmentation, detection, and so on. All these architectures can automatically extract features, with which C. elegans applications are developed, such as skeletonization Hebert et al. (2021), extreme segmentation Mane et al. (2020), and others (Wiehman and de Villiers, 2016; Wang et al., 2019, 2020; Yu et al., 2021). Furthermore, networks that have an encoder and a decoder have proven capable of solving more complex tasks such as posture classification Javer et al. (2019), skeleton definition in microscope images Chen et al. (2020) or patch acquisition to resolve aggregation Mais et al. (2020).

2.2 U-Nets

The U-Net is a neural network with an encoder and a decoder. Since it first appeared, the U-Net Ronneberger et al. (2015) has been widely used not only in medical applications, for which it was first introduced, but also in the segmentation of animals (Han et al., 2019; Padubidri et al., 2021), objects (Zhao et al., 2019; Wiles & Zisserman, 2019), etc. Various authors have taken this network architecture as a reference and have modified it to achieve greater convergence in training. These modifications consist of adding, removing, or replacing convolutional layers with others, thereby increasing or decreasing the size of the network. However, making the network deeper (increasing the number of convolution layers) or increasing the width of the network does not always result in a better prediction, it all depends on our dataset.

SmaAt-UNet Trebing et al. (2021) proposes reducing the size of the base U-Net Ronneberger et al. (2015) by adding spatial-channel attention, and shows that it can achieve similar precision values, this leads to reducing the inference time or the resources needed during the exploitation phase. On the other hand, UMF U-Net Plebani et al. (2022) modifies the U-Net standard by adding BatchNorm and dropOut layers and shows that careful choice of hyperparameters and other training configurations play a very important role in network development. As well as these cases, there are other more recent modifications (Alexandre, 2019; Moradi et al., 2019; Tang et al., 2019; Tschandl et al., 2019; Li et al., 2020a; Liu et al., 2020; Qamar et al., 2020; Baheti et al., 2020; McManigle et al., 2020; Huang et al., 2020; Cao et al., 2020; Isensee et al., 2021) that present significant improvements in the precision of their respective datasets compared to the standard U-Net.

When working with neural networks, it is problematic to obtain a well-labeled and balanced dataset. This can be a very expensive or even an impossible task; therefore, to alleviate this problem, different data augmentation methods have been developed, as well as simulators for generating synthetic images.

2.3 Data Augmentation and Synthetic Images

The use of synthetic data can help attain network convergence, avoid overfitting, and improve data generalization. On the one hand, the mixture of synthetic data with real data has been shown to help improve the training of neural networks (Pashevich et la., 2019; Doshi, 2019; Bargsten & Schlaefer, 2020; Dewi et al., 2021), even in applications with C. elegans García Garví et al. (2021). In general, these types of techniques convert synthetic images to the domain of real images, in order to achieve similar distributions. This domain change is achieved thanks to architectures such as GAN Li et al. (2020b), encoders, and decoders Chen et al. (2021).

On the other hand, applications that only use synthetic images also provide good results in the domain of real images, (Schraml, 2019; Hinterstoisser et al., 2019; Mayershofer et al., 2021). The use of simulators to create purely synthetic images pave the way to being able to create larger and more variable datasets, thus being able to generate cases or events that occur infrequently in real images. The simulation of C. elegans in low-resolution images is a challenge that, to our knowledge, has not been addressed before. Accordingly, Hebert et al. (2021) uses a high-resolution synthetic image simulator of a single worm to train a neural network.

In this work, we aim to generate low-resolution synthetic images to train a segmentation neural network of worm skeletons from 20 to 40 points. Our method obtains outstanding results, outperforming previous skeletonization work that improved classical image processing techniques.

3 Methods

3.1 Strain and Culture Conditions of C. elegans

The C. elegans used were wild type (N2) and CB1370, daf-2(e1370) young adults, obtained from worm eggs incubated at \(20^\circ \)C in 55 mm diameter NGM plates at the Caenorhabditis Genetics Center (CGC), University of Minnesota. Escherichia coli strain OP50 was used as food. To prevent reproduction, FUdR (0.2 mM) was used, and to prevent contamination by fungi, fungizone (1µg/mL) was added and the plates were closed with a lid. The standard method Stiernagle (2006) was followed to remove those plates with contamination. Plates with 10, 15, 30, 60, and 90 nematodes were cultured to obtain greater variability and analyze different types of behavior (aggregation of two or more C. elegans, aggregations with plate noise and occlusions).

3.2 Real-Image Acquisition Method

Images of complete 55 mm diameter Petri dishes were captured. Image acquisition was performed by a laboratory operator at a temperature of \(20^\circ \)C using the hardware and software described in Puchalt et al. (2021). To capture the images, the laboratory operator removed the plates from the incubator and placed them in the capture system (Fig. 1) where the system proceeded to capture a sequence of images of \(1944 \times 1944\) pixels at a frequency of 1Hz. The Escherichia coli (E. coli) OP50 strain was seeded in the center of the plate to prevent C. elegans from moving out of the field of view, either by climbing up the edges of the plates or by positioning themselves near the plate edges. Those plates with condensation on the cover were withdrawn from the image acquisition process.

The abovementioned system Puchalt et al. (2021) was developed with open hardware and software, using a Raspberry Pi v1.3 RGB camera, OmniVision OV5647 with resolution of 2592 \(\times \) 1944 pixels, and pixel size of \(1.4 \times 1.4\,\mu \hbox { m}\), field of view of \(53.50 \circ \times 41.41\circ \), with 1/4” optical size and 2.9 focal ratio, a lighting system based on a 7” Raspberry Pi screen with a resolution of \(800 \times 480\) at 60 fps, 24-bit RGB color and as processing unit a Raspberry Pi 3, (Fig. 1). The mounting process and image capture are detailed in Puchalt et al. (2021). The lighting technique used was active backlighting. This technique has been shown effective for low-resolution C. elegans applications for both the aforementioned Puchalt et al. (2019) and Puchalt et al. (2022) capture systems. Active backlighting consists of reducing the variability of the captured images by keeping gray scales more constant. This made it possible to differentiate the background of C. elegans easily in all images. To capture the real image sequences, our object of interest (Petri dish with worms) was placed between the illumination system and the camera, as described in Puchalt et al. (2021). With this configuration the nematodes have a maximum size of 40 \(\times \) 4 pixels and a minimum of 20 \(\times \) 3 pixels. In reality, the worms measure 1 mm. Working with low resolutions may complicate the problem in some cases, but it has advantages in terms of computational and memory efficiency. This resolution is sufficient to automate assays such as lifespan, healthspan, etc.

Fig. 1
figure 1

Image capture system. Location of the Petri dishes, as well as the other parts of the capture system Puchalt et al. (2021)

3.3 Image Simulation Method

A simulator has been designed capable of generating new sequences both of aggregation behaviors between worms (parallel behaviors and intersections) and of augmenting individual behaviors (free motion, coiling) with geometric transformations. Figure 2 shows a conceptual outline of the synthetic image generation process. The simulator starts from a database containing manually labeled real C. elegans paths and images of Petri dishes without worms. The labels contain the location (X,Y) of each of the points of the skeleton, as well as the color and width of each point of the skeleton. The process to simulate a sequence consists of selecting K paths randomly, applying rotation and translation transformations to them, and combining them to obtain the desired behaviors, thus obtaining an integrated sequence. To generate the parallel aggregation behavior, one path is randomly selected and a new path is generated by modifying its color and XY position. The XY position of the first worm is at NR, while the second worm is at NR + W1. W1 is the smallest width of the worm body (1 or 2 pixels). The simulator generates two types of parallel behaviors, in the first, both worms navigate in the same direction, while in the second one of them navigates in the opposite direction. To achieve the second case the path of one of the two worms is rotated \(180^\circ \), i.e., the worm of the rotated path starts at the end of the path and navigates in parallel with the other worm approximately in the middle of the path. To generate the intersection aggregation behavior, two paths are randomly selected, the intersection point will be a random point on the skeleton of each worm from a random pose of each path. Each path has 30 poses and each worm can have between 20 and 40 skeleton points. We have observed in the real dataset that aggregation behaviors (intersection and parallel) are accompanied by speed changes and pauses, so we randomly added these interactions within the simulator. Pauses are simulated by repeating poses for two instants of time. Speed changes are obtained by skipping a pose in the path.

Fig. 2
figure 2

Conceptual outline of the synthetic image generation process

Fig. 3
figure 3

Development of synthetic images. a Random generation of the track of a worm. b Synthetic image with 16 worms on the plate

The trajectory simulation consisted of applying a rotation angle (\(\theta \)) to all the skeletons of a worm sequence and moving that sequence to a random X-Y point inside the plate. The values of N, \(\theta \) and CC the centroid of the Petri dishes (pre-process info) were needed to calculate the new X-Y position of the worm trajectory (Eqs. 2 and 3), Fig. 3a and b. The angle \(\theta \) and NR were randomly generated between \(0 - 2\pi \) and \(0 - R\), respectively. The value of R was obtained from the difference between the plate radius (P) and the diameter of the trajectory found in the pre-process info file (2T), Eq. 1.

$$\begin{aligned} R= & {} P - 2T \end{aligned}$$
(1)
$$\begin{aligned} N_{x}= & {} CC_{x} + N\cos {\theta } \end{aligned}$$
(2)
$$\begin{aligned} N_{y}= & {} CC_{y} + N\sin {\theta } \end{aligned}$$
(3)

To rotate the worm skeletons, a random angle (\(\alpha \)) between \(0 - 2\pi \) was generated. The rotation operation was performed by multiplying each skeleton pixel (Eq. 5) by a rotation matrix (Eq. 4):

$$\begin{aligned} M_{a}= & {} \begin{bmatrix} \cos {\alpha } &{} \sin {\alpha } \\ -\sin {\alpha } &{} \cos {\alpha } \end{bmatrix} \end{aligned}$$
(4)
$$\begin{aligned} P_{i}= & {} \begin{bmatrix} P_{x} \\ P_{y} \end{bmatrix} \end{aligned}$$
(5)

Then, using the integrated sequence, the image sequence is generated by drawing the paths on an empty Petri dish image. The paths are drawn by inserting circles of diameter equal to the width value of the skeleton point stored in the database into each of the pixels of the skeletons. To color the circle, the value of the skeleton point is used and also averaged with the background. Finally, a blur filter (3 \(\times \) 3) is applied to the images. This filter was essential to bring the synthetic image domain closer to the real image domain. In addition, it favored convergence in the training of the networks. In this step, in addition to the gray image sequence, ground-truth masks are also generated. This simulator, which has been designed to obtain sequences of behaviors, allows an image to be selected from the generated sequence as input to the network. The details of the code implementation have been added to Appendix  1.

3.4 Classical Skeletonization Method

Classical skeletonization techniques have proven easy-to-implement for extracting shapes and predicting worm behaviors, which are problematic when worms coil or aggregate with each other, or with plate noise. When this happens, part of the skeleton is absent or displaced, and this is because these skeletonization techniques reduce the segmentation pixels until they achieve the central line. In these cases, skeleton prediction errors occur, as shown in Fig. 4. If the aggregation occurs in a large part of the worm body it can lead to a large error.

Fig. 4
figure 4

Classical skeletonization of problematic cases. a Grayscale image of worm aggregated with noise. b Grayscale image of two worms aggregated at one end and part of the body. c Gray image of worm coiled upon itself. d, e, f Result of classical skeletonization of images a, b, c, respectively. The white pixels show the segmentation using a threshold of 35, while the blue pixels show the result of skeletonizing that segmentation

3.5 Skeletonization Method Using Improved Skeleton

This method involves obtaining improved skeletons (Fig. 5a–c) from the width and length characteristics. These characteristics are obtained when the worms are free and not coiled during a previous preprocessing Layana Castro et al. (2020). The advantage of this technique, unlike classical skeletonization techniques, is that it can separate connected or coiled worms through new skeletons. In general, other applications cancel tracks where aggregation between worms, aggregation with noise, and coiling occur; however, the improved technique is very useful in these cases, recognizing skeletons (poses) and predicting behaviors. Layana Castro et al. (2020) showing that an improved skeleton together with worm-specific features such as color and temporal image features can solve problems of aggregation between worms or with noise in image sequences.

Fig. 5
figure 5

Skeletonization with enhanced algorithm (ISA) Layana Castro et al. (2020). a, b, c Skeletonization result with an improved algorithm (ISA) of the gray images from Fig. 4a, b, c, respectively. The white pixels show the segmentation using a threshold of 35, while the blue pixels show the result of skeletonizing that segmentation

Fig. 6
figure 6

Image pipeline through U-Net architecture. The blocks of the U-Net architecture were made using the PlotNeuralNet tool Iqbal (2018). The image is divided into 4 parts, each part enters the network and the result is reassembled to form a single image

3.6 Proposed Skeletonization Method

The model used for the segmentation of worm skeletons was the convolutional neural network U-Net. Different U-Net architectures were analysed to compare their performance. Comparison was made of the models U-Net standard Ronneberger et al. (2015), Alexandre’s U-Net Alexandre (2019), SmaAt-UNet, U-Net with DSC, U-Net with CBAM, U-Net with DSC, CBAM Trebing et al. (2021), UMF U-Net Plebani et al. (2022) and all showed good results.

Figure 6 shows all the blocks used with the different architectures. For the standard U-Net Ronneberger et al. (2015), the Doubleconv block does not have the white BatchNorm2d block and the yellow CBAM blocks and the purple Down4 block does not have the Dropout blocks. Alexandre’s U-Net Alexandre (2019) is the same as the standard U-Net but includes the BatchNorm2d block inside the Doubleconv block. UMF U-Net Plebani et al. (2022) is the same as Alexandre’s U-Net but on the purple Down4 block it does have the Dropout blocks. SmaAt-UNet Trebing et al. (2021) on the other hand has the spatial-channel attention blocks, it also has no dropout blocks in the purple Down4 block, and the models with DSC instead of the Doubleconv blocks have depthwise-separable convolutions blocks.

These models predict three classes: background, worm ends, and worm body. The background class (red pixels in Fig. 6) are all those pixels that do not correspond to the worm skeletons, such as the plate edge and interior of the Petri dish, dark spots, residues inside the dish, etc. The worm-ends class (green pixels in Fig. 6) includes those pixels corresponding to the head and tail of the worm skeleton. These vary between 5 and 10 pixels at each end, they are generally lighter pixels (higher grayscale intensity). And finally, the worm-body class (blue pixels in Fig. 6) are pixels in the center of the skeleton and are darker than the worm-end pixels (less grayscale intensity).

Given the dimensions of the input image and the limitations of the hardware to train and validate the architecture, both the input image and the ground-truth image were divided into 4 equal parts. To reconstruct the prediction image, each output prediction part of the network was joined in the same order as the input image, as shown in Figs. 6 and 7.

Fig. 7
figure 7

Coding of the output image from the network. a, b, c Resulting skeletons using the UMF U-Net Plebani et al. (2022), network from the gray images in Fig. 4a, b, c respectively. d, e, f Pixel encoding using the maximum value of the RGB channels. Red pixels are background pixels, blue pixels are worm body pixels, and green pixels are worm-end pixels. The results obtained with the rest of the models are similar to these

4 Evaluation Method

To evaluate all the U-Net models, two datasets were used, a synthetic dataset that was used in the training and validation stages of the networks, and a dataset of real images to test the results in the cases of coiling, aggregation between worms, and aggregation with noise (Fig. 8).

Fig. 8
figure 8

Synthetic and real dataset pipeline. The synthetic dataset was used to train and validate the U-Net neural network. The trained network was used to test the real image domain

To evaluate the real dataset, the Jaccard index, also known as intersection over union (IoU), and euclidean distance were used. The IoU coefficient measures the accuracy of a prediction with respect to a ground-truth Koul et al. (2019). And as its name indicates, it is obtained by dividing the total area of the intersection by the union of these areas, Eq. 6. On the other hand, the euclidean distance measures the average error in pixels of a prediction with respect to a ground-truth, Eq. 7

$$\begin{aligned} IoU= & {} \dfrac{\sum _{}^{}{P_{w1} \bigcap P_{w2}}}{\sum _{}^{}{P_{w1} \bigcup P_{w2}}} \end{aligned}$$
(6)
$$\begin{aligned} E.D.= & {} {{\sum _{i=1}^{nw}{\dfrac{ \sqrt{(X1_{i} - X2_{i})^2 + (Y1_{i} - Y2_{i})^2}}{nw}}}} \end{aligned}$$
(7)

The Precision and Recall metrics were used to evaluate the results of the detection experiment, and the MOTA metric was used to evaluate the results of tracking experiment. To obtain Precision (Eq. 8) and Recall (Eq. 9) metrics, three parameters were used: TP (true positives), FP (false positives), FN (false negative). On the other hand, to obtain MOTA metric (Eq. 10), the FN, FP, IDS and GT parameters were used. GT was the total number of worms in the aggregation, IDS value was increased by 1 when the body of a predicted worm overlapped more with another worm than with its respective GT. For the overlap, the IoU value and a threshold of 0.5 were considered.

$$\begin{aligned} Precision= & {} \frac{TP}{TP + FP} \end{aligned}$$
(8)
$$\begin{aligned} Recall= & {} \frac{TP}{TP + FN} \end{aligned}$$
(9)
$$\begin{aligned} MOTA= & {} 1- \dfrac{\sum _{t}^{}{FN_{t} + FP_{t} + IDS_{t}}}{\sum _{t}^{}{GT_{t}}} \end{aligned}$$
(10)

The synthetic dataset labels (ground-truth) were obtained automatically from the simulator, while the real dataset labels (ground-truth) were obtained by manually labeling worm skeletons. This task was performed using a pixel labeling application. The pixels of each worm skeleton were selected one by one until the skeleton was complete.

The predictions are not always exact for all the pixels, usually, one or more pixels are displaced with respect to the real label (ground-truth), thus obtaining low measurement errors and incorrect skeleton indicators. When the worm is 3 pixels in diameter, the skeleton pixel is the center pixel, but if the worm has an even number of pixels, it is impossible to select the center pixel, which may result in false errors between manual labeling and pixel predictions of the skeleton. To solve this problem and obtain a better measurement of results, we decided to use the worm body to obtain a more significant IoU value that would better reflect the prediction of the skeleton. To recover the shape and body of the worm, a dilation operation was performed on all the pixels of the skeleton with a disk of radius 2 (approximate diameter of the worm). This operation was performed for all manual labels. IoU and Euclidean distance metrics have been calculated for the following classes: worm ends, worm body and worm (fusion of body and ends).

5 Experiments and Results

5.1 Method Comparison

In this experiment, different U-Net architectures were compared to find the most accurate one for our case. In addition, it was compared with the results of a method based on traditional computer vision techniques.

Table 1 Synthetic dataset loss and IoU results
Table 2 Average IoU results of the actual dataset

As previously mentioned, a synthetic dataset was used to train and validate the networks and a real dataset to test all the results. For the synthetic dataset, 400 sequences of 30 images were simulated, giving a total of 12000 images, 70% was used to train the network (8400) and the other 30% was used for the validation stage (3600). Each image of the sequence had 16 worms per plate, in which different cases of aggregation were simulated. The hardware used for training and validation of the different networks was a Gigabyte Technology Z390 AORUS PRO machine, Intel(R) Core (TM) i9-9900KF CPU @ 3.60GHz x16 with 32GB of RAM, and NVIDIA GeForce RTX 2080 Ti graphics card with 4352 Cuda cores, Ubuntu 19.04 64bits operating system. The implementation was carried out in a Python version 3.7.5 environment, using the Pytorch 1.18, OpenCV 4.5.4, and SWIG 3.2 libraries. Different U-Net architectures were compared and the training for each of these took about 48 h with the abovementioned hardware. The hyper-parameters used for the training and validation phase were Batch_size = 1, num_workers = 1, maximum epoch = 10. The optimizer used was ADAM with a learning rate = 0.0001, betas = [0.95, 0.999], eps = 1e-8, CrossEntropyLoss() as the loss function, and the ReduceLROnPlateau scheduler with hyper-parameters mode = ’min, and patience = 2. All the U-Net architectures used were trained and evaluated using the same training and validation dataset. After each training, the model with the lowest loss value in the validation phase was selected to evaluate all the results. The average loss resulting from the loss function CrossEntropyLoss() and average IoU values for each model are shown in Table 1.

For the real dataset, 4500 images of Petri dishes were analyzed and those difficult cases were selected in which the worms coiled on themselves, aggregated to each other, or presented noise from the dish, in 90, 157, and 417 images, respectively. In order to obtain all this variability, the images contained 10, 15, 30, 60, and 90 worms.

The output images of all U-Net models were encoded using the maximum RGB channel value. Once the coded images of prediction and ground-truth were obtained, the pixels of the worm body and the worm ends were joined to form a single skeleton, then the shape of the worm was recovered as indicated in the evaluation method and the precision result was obtained using the IoU index. Table 2 shows the total parameters of each model and the average of the results obtained from the evaluation with these models and for all the cases analyzed. Appendix  2 shows the metrics obtained for each of the problematic cases: aggregation between worms (Table  4), aggregation with noise (Table  5) and rolled cases (Table  6).

As shown, the results of the networks were also compared with the values obtained in a previous work (ISA) Layana Castro et al. (2020). The average results showed that for the real dataset UMF U-Net is the best skeletonization method, showing a statistically significant difference with the previous work Layana Castro et al. (2020) Fig. 9. It should be noted that for cases of aggregation with noise it is the best option. Figure 10 compares the previous work with respect to the UMF U-Net architecture using a box plot. Figure 11 shows the results obtained in an image section of the different architectures used. Although the results are similar, Fig. 11d of the UMF U-Net predicts more connected skeletons than the other architectures.

Fig. 9
figure 9

Statistical analyses. a Normality test on the difference of methods (ISA-UMF U-Net). The p-value obtained was 5.88E-61 less than the significance value of 0.05, thus the null hypothesis was rejected and the alternative hypothesis H1 was accepted (the data did not come from a normal distribution). Once the alternative hypothesis was accepted, the Wilcoxon signed-rank test was used to evaluate both methods. b The Wilcoxon signed-rank test table shows the difference between two related samples across positive, negative and tied ranks. c The p-value obtained with the Wilcoxon rank test was 0.0040 less than the significance value of 0.05, thus concluding there was a statistically significant difference between both models

Fig. 10
figure 10

Comparison of previous work Layana Castro et al. (2020) with UMF U-Net Plebani et al. (2022). The green line indicates the mean in both graphs and the gray line indicates the median. ISA N = 664, mean = 0.6936, median = 0.7635, standard deviation = 0.1649, variance = 0.0272. UMF U-Net N = 664, mean = 0.7279, median = 0.7430, standard deviation = 0.0871, variance = 0.0076

5.2 Real Versus Synthetic Image Training

To demonstrate the need to use a simulator, an experiment was performed comparing the results of training with synthetic images alone versus training with the real dataset available using standard data augmentations. The labeling effort required to generate the starting database for the simulator was also analyzed. For this purpose, we compared the accuracy obtained by the network when training with different numbers of starting worms to generate the simulation. The model trained in all cases was the UMF U-Net. The network was trained for 20 epochs using the same hyperparameters as in the UNets comparison test. The training with real images only used data augmentation based on affine transformations (rotations (90,180 and 270 degree angles), vertical (1–30 pixels up/down) and horizontal (1–30 pixels left/right) translations, brightness and contrast changes). Table 3 shows the results obtained using the IoU metrics and Euclidean distance between skeleton points broken down by the different cases (aggregation, aggregation with noise and rolled). As can be seen in the results, the training with real images obtains worse results than all the models trained with synthetic images.This result justifies simulator use since training with only 30 base poses already gives good results (IoU = 0.6850), which are much better than those obtained with the training with real image by increasing data (IoU = 0.4265). Regarding the comparison between the different models trained with synthetic images, we find that the higher the number of base poses of the simulator, the better the results obtained. We also observe that the higher the number of base poses, the lower the number of epochs required for model convergence. However, these differences are not highly significant. This means that good results can be obtained with less labeling effort to generate the base poses. In this experiment, we also wanted to analyze the learning of the rolling case. Comparing the accuracy obtained in the trial using 30 base poses (none rolled up) versus the trial using 18810 base poses (3780 rolled up) the improvement is not very significant. The fact that the network can skeletonize rolled poses without having examples in training may be because it learns from the cases of parallel motion and cross aggregation, where it also has to solve the problem of skeletonizing wide areas.

Fig. 11
figure 11

Comparison of skeletons obtained with the different U-Net architectures. An image was selected and results were obtained coded for all the different architectures, then the same section was cropped in all the images. a Grayscale image. b Result with U-Net standard Ronneberger et al. (2015), c Result with Alexandre’s U-Net Alexandre (2019), d Result with UMF U-Net Plebani et al. (2022), e, f, g, h Result with SmaAt-UNet Trebing et al. (2021) (SmaAT Ds, SmaATDs At, SmaATDs At 4CBAMs and SmaAT, respectively)

Table 3 Results of the experiment Real versus synthetic image
Fig. 12
figure 12

Noise-generated connected components detection. a, d UMF U-Net network output encoding. b, e Joining method between final pixels of head/tail and body classes. c, f Detection result

5.3 Detection Application

This experiment consists of analyzing the accuracy of a detection method based on the proposed skeletonization method. For this experiment, 120 dataset-real images were used, these images contained different aggregation behaviors between worms and individual behaviors, all of these plates had high noise content. The selected images were passed through the trained UMF U-Net network and were encoded as indicated in Fig. 7 d, e, and f (Fig. 12a, b). Then we proceeded to analyze whether there were skeletons of broken worms, that is, if there was a distance greater than 1 and less than 4 pixels between the final pixels of the head/tail and body classes, if that was the case, both endpoints were joined (Fig. 12b, e). After this, each connected component was analyzed to obtain possible skeleton solutions. To do so, a recursive algorithm used the endpoints and intersections of connected points to go through the connected component and obtain sequences of points between 20 and 40. All these possible solutions were analyzed by an optimizer which evaluated whether they were individual worms or aggregations between worms and obtained the best solution for each case using 2 evaluation criteria: Minimum skeleton length (20 pixels), completeness (the solutions occupy the entire connected component). To detect aggregations, it was considered if the connected component had more than 3 head/tail pixel connected components (green pixels) and if the connected component had more than 30 pixels. The Precision and Recall metrics were used to measure the precision of this experiment. The results were 83.50% and 73.77% respectively. The Precision and Recall metrics are detailed in the evaluation method section. For cases of cross aggregations and individual behaviors, the precision and recall values were very high, close to 1. The precision errors were due to problems with the method of joining body and endpoint stubs. The recall errors were due to noise-generated connected components that complied with worm characteristics, Fig. 12c, f. In order to use the proposed skeletonization method in detection applications, the joining method and noise filtering should be improved.

5.4 Tracking Application

This experiment consists of analyzing the accuracy of a tracking method based on the proposed skeletonization method. In addition, the IoU and computational cost results were compared with the WT-ISA method Layana Castro et al. (2021). The WT-ISA tracking method is based on a skeletonization method Layana Castro et al. (2020) and an optimization method. The skeletonization method was designed specifically to solve rolling and aggregation cases. In the aggregation cases, the optimizer obtains the best solution among all possible combinations of endpoints and branches (possible solutions). The proposed tracking method consists of the UMF U-Net skeletonization method and the same optimization method used in WT-ISA. The accuracy evaluation was performed with paths from the real dataset. This dataset was divided between individual behaviors and aggregations. The aggregations dataset contains 72 paths of aggregations between 2, 3 and 4 worms and with plate noise. The result of paths of individual behaviors such as rolling and self-occlusions obtained an IoU and MOTA value close to 1. The results obtained with the aggregations dataset showed an average IoU value of 0.66 and a MOTA value of 0.70, an identity loss (IDS) of 7%. The IDS is distributed 95% in aggregations between worms (mostly in parallel) and 5% aggregation with noise. In most of these cases, after separating the worms, the identity of each individual was recovered. Regarding the comparison, the WT-ISA (Layana Castro et al. (2021)) method obtained an IoU value of 0.70 and generated 75626 possible solutions for the 72 aggregation paths, while the proposed method obtained an IoU value of 0.66 and generated 10997 possible solutions. Demonstration videos using the proposed skeletonization method were uploaded to github and a GoogleColab notebook to show worm tracking in image sequence. https://github.com/playanaC/Skeletonizing_Unet/blob/main/Demo_videos.ipynb.

6 Discussion

Noise on the plate (dark objects, residues, stains, or worm shapes that are not worms) usually results in constant scales of gray across the plate surface, thus it is distinguishable from worm bodies, which have different characteristics. Head and tail pixels are lighter than body pixels. However, these colors are altered by the aggregation of the worm with the noise on the plate, producing connected segmentations and causing the failure of fixed-threshold skeletonization techniques. Notwithstanding, neural networks are capable of learning features that are robust to changes in illumination and intensity. In this scenario, the variability of the dataset is a key piece in network training, thus changes in skeleton intensity, as well as variability in social or individual cases included in the network, will result in better predictions, even surpassing improved skeletonization techniques Layana Castro et al. (2020) as shown in Table 2.

Worm curling cases, however, are more complex to detect, especially in low-resolution images. Depending on the extent to which the worms are coiled, it may be possible to identify skeletons and worm ends or, on the contrary, it may be an almost impossible task for either traditional skeletonization techniques or neural networks. Previous work Layana Castro et al. (2020) showed that information on the width and length of the worm can be used to create an improved skeleton in order to provide a better pose and skeleton of the worm’s body. As indicated in Trebing et al. (2021), U-Nets with spatial-channel attention can achieve similar results to other architectures with a greater number of parameters, with which lighter applications or applications with multiple models can be developed.

All networks except the U-Net standard have Batch normalization layers, it is clear that applying batch normalization after each convolutional layer allows data regularization, as well as better convergence, reducing internal covariate change as mentioned in Ioffe and Szegedy (2015). UMF U-Net Plebani et al. (2022), on the other hand, adds another regularization technique (Dropout) at the end of the encoder, obtaining better skeletons with synthetic and real data.

7 Conclusions and Future Work

In this work, we have proposed a skeletonization method for low-resolution images of Petri dishes containing C. elegans based on U-Net type neural networks. A synthetic dataset of different C. elegans behaviors was generated using a low-resolution image simulator to train different U-Net architectures. Good skeletonization results were achieved with all models trained on real images. Finally, the results of networks were compared with a skeletonization method based on traditional image processing techniques, showing that all networks were superior to those in previous work Layana Castro et al. (2020) in terms of aggregations with noise. In future works, the low-resolution synthetic image simulator will be used together with U-Net plus a temporary network to predict worm behaviors and aim to perform tracking by resolving aggregations between worms.