Introduction

With the development of autonomous driving technology, the autonomous driving scenario is gradually expanding from the road to the water, and Unmanned Surface Vehicles (USVs) and their related applications have become one of the research hotspots in recent years [1, 2]. Increasingly prominent due to their ability to perform hazardous and time-consuming tasks [3], the strong demand from commercial, scientific, and environmental communities has accelerated the development of USVs applications such as hydrographic surveying and mapping, marine resource exploration, water quality monitoring, and floating waste removal [4,5,6,7]. Compared to marine USVs, inland waterway USVs are more closely related to human life and have great potential value, which makes them central to the construction of autonomous transportation systems for inland waterways [8].

For USVs in inland river scenarios to accomplish the established tasks without human intervention, it is necessary to deal with the perception of the environment by unmanned vessels in the near-shore environment, especially the perception and understanding of waterline information [9, 10]. The environment perception capability of the USVs in the inland river scenario is a key technology for the unmanned vessel to complete various tasks [11, 12], for example, the river patrol of the USVs in the inland river scenario, the automatic berthing, and docking of the USVs, the automatic mapping of the USVs, etc. First, the location and orientation of the water shoreline need to be determined to complete the set tasks under the premise of ensuring the safety of the unmanned boat's navigation. In addition, for some areas without obvious structure and lacking obvious water shoreline, the detection of the river surface can provide information about the drivable area, thus can improve the active safety function of the unmanned boat navigation system. Water shoreline detection helps to analyze and judge the specific navigation information of inland rivers, thus assisting and ensuring the safe navigation behavior of unmanned boats [13,14,15].

The current research on water shoreline detection methods is mainly divided into the following categories: the edge detection-based methods, such as the water shoreline detection method based on the Canny edge detection operator and the Hough transform proposed by Zeng et al. [16], and the fractional dimensional method of fusing multiple features to identify the shoreline proposed by Shao et al. [17]. They argue that the waterfront demarcation line traverses the entire image and that the waterfront distant boundary approximates a horizontal straight line when the observation point is close to the water, and propose an edge extraction algorithm that combines fractional dimensional features and shape location for edge extraction of waterfront images, followed by fitting the waterfront demarcation line by the median method; the methods based on Spatial characteristics. The water shoreline detection method based on logistic regression and sample polynomial fitting proposed by Kröhnert [18] has obtained a better fitting effect on the water shoreline of nearly straight lines and has better real-time performance, but it is not suitable for irregular water shoreline detection. Peng et al. [19] analyzed the features of HSV color space under different illumination and used the features with higher land saturation for image segmentation to extract riverbank contour lines, but the method has low robustness and the segmentation effect is not satisfactory in cases such as the similar color of waterbanks or the influence of illumination. Sun et al. [20] developed a superpixel-based conditional random field model to segment the land and sea regions; the methods based on area growth, such as the water shoreline detection based on the area growth algorithm proposed by Zheng et al. [21], which achieves automatic growth point selection by constructing a least squares problem and obtains a better detection effect, but the time consumption of the area growth algorithm is large, and the real-time performance of water shoreline detection using the area growth algorithm is low; the methods based on deep learning, such as Steccanella et al. [22] proposed Convolutional Neural Network (CNN) for water surface area segmentation, such algorithms also have the problem of poor real-time performance. Shen et al. [23] proposed an improved algorithm based on DeepLab v3 + for semantic segmentation of shoreline images to extract the water surface region, and when combined with the traditional edge detection algorithm to detect the water shoreline. Erdem et al. [24] propose a majority voting method based on different deep learning structures to automatically acquire water shorelines Methods of deep learning are highly adaptable to a different times and shoreline scenes and can overcome the interference of uncontrollable factors such as reflection and ripple in shoreline scenes on image segmentation, and achieve accurate detection of water shoreline with fast extraction speed [25, 26].

Traditional detection methods for inland river scenes with varying weather and a complex water shoreline need adjusting parameters in different situations, which is not applicable, and the semantic segmentation method is used to train images of waterbanks in inland river scenes with different inland river scenes, enabling the model to learn waterbank segmentation in different inland river scenes and thus solve applicable problems. Due to the existing mainstream algorithms, there is a lack of information acquisition for irregular water shoreline details, resulting in the segmentation accuracy needs to be improved, we improve PSPNet to improve the network's robustness and segmentation accuracy to improve the accuracy of waterfront segmentation in inland river scenes. The following are our major contributions:

  1. (1)

    Applying migration learning to PSPNet networks using ResNet50 weights on the ImageNet dataset to improve training efficiency.

  2. (2)

    The CBAM attention mechanism is added to the backbone feature extraction network ResNet50 to improve the robustness of the network.

  3. (3)

    We have added a branch of the atrous convolution module to the pyramid pooling module to combine pooling and null convolution to improve the acquisition of global information from the network.

The rest of the paper is organized in this way. The second section introduces the PSPNet structure, the third section explains the methodology of the paper, and the fourth section conducts experiments and shows the results. The fifth section concludes the paper.

Principle and method

PSPNet network model

The algorithm in this paper is based on the Pyramid Scene Parsing Network (PSPNet) [27], which is a Fully Convolutional Network (FNC) [28]-based network, but differs significantly from the traditional FCN in that PSPNet considers contextual and local information to make predictions based on the traditional FCN. To enhance the extraction of contextual information and local features, PSPNet proposes a pyramid pooling module, which uses pooling operations at different scales to obtain image contextual and local information and realize the extraction of multi-scale features, which solves the problem of scale diversity in image segmentation to a certain extent and improves the image segmentation effect compared with traditional FCN. The PSPNet network structure is composed of two main parts, the first module is the CNN [29] backbone network for obtaining feature maps, and the second module is the pyramid pooling module. These two modules are described separately below.

PSPNet was proposed in 2017 by literature, the entire network consists of four parts: input image, feature map, pyramid pooling module and final prediction. lt for the input image in part 1, we use a pre-trained ResNet model with an extended network strategy to extract the feature maps, as shown in part 2. For the above feature maps, we use the pyramid pooling module to obtain contextual information, where the pyramid pooling module is divided into four levels with pooling kernel sizes of 1 × 1, 2 × 2, 3 × 3 and 6 × 6, which can eventually be fused into global features. Then, in the final part of module, we connect the fused global features to the original feature map. Finally, the final prediction map is generated in part 4 by a layer of convolution. The specific implementation of feature map connects is shown in Eq. (1):

$$ R = \sum\limits_{i = 1}^{C} {(M_{i} + N_{i} ) \otimes K_{i} } , $$
(1)

where \(C\) denotes the total number of channels, \(M_{i}\) and denotes the feature maps of feature map and feature map at the ith channel, respectively, \(K_{i}\) is the convolution kernel at the ith channel in the convolution kernel and denotes the convolution operation. R is the output result of a single convolution after feature map connect (Fig. 1).

Feature extraction network

ResNet [30] is short for residual network and this network structure is usually used as a backbone network The core idea of ResNet is to add a shortcut branch to the ordinary convolutional neural network, short-circuiting the original convolutional layers and allowing the training process to skip a part of the convolutional layers, such a design can reduce a certain number of parameters, and the comparison between the residual network and the general network structure is shown in Fig. 2. This design can reduce the number of parameters.

Fig. 1
figure 1

PSPNet network structure

Fig. 2
figure 2

Comparison of general network and residual network

The problem of gradient disappearance can be effectively solved by the residual structure, as can be demonstrated by Eq. (2).

$$ {\text{Loss}} = F(X,W) $$
(2)
$$ \frac{{\partial {\text{Loss}}}}{\partial X} = \frac{\partial F(X,W)}{{\partial X}} $$
(3)
$$ \begin{aligned} & {\text{Loss}}_{n + 1} = F_{n} (X_{n} ,W_{n} ),\\ & {\text{Loss}}_{n} = F_{n - 1} (X_{n - 1} ,W_{n - 1} ), \cdot \cdot \cdot {\text{Loss}}_{2} = F_{1} (X_{1} ,W_{1} )\end{aligned} $$
(4)
$$ \frac{{\partial {\text{Loss}}}}{{\partial X_{i} }} = \frac{{\partial F_{n} (X_{n} ,W_{n} )}}{{\partial X_{n} }} \times \cdots \times \frac{{\partial F_{n} (X_{i + 1} ,W_{i + 1} )}}{{\partial X_{i} }} $$
(5)
$$ \frac{{\partial X_{i + 1} }}{{\partial X_{i} }} = \frac{{\partial X_{i} + \partial F(X_{i} ,W_{i} )}}{{\partial X_{i} }} = 1 + \frac{{\partial F(X_{i + 1} ,W_{i + 1} )}}{{\partial X_{i} }}. $$
(6)

The equation is the loss function of the network, and the Eq. (3) is its back-propagation gradient value. The same principle is extended to the multilayer neural network with the loss function of the network as Eq. (4), where, n is the number of layers of the neural network, and finally, the gradient of the ith layer can be introduced according to the chain rule as the Eq. (5). Therefore, it can be seen that the gradient of the previous layer of the network becomes smaller and smaller as the error is passed back. ResNet changes the output layer H(x) = F(x) to H(x) = F(x) + x, which is changed from Eqs. (5) to (6), so the gradient will not disappear even if the network structure is very deep.

The residual module shown in Fig. 2b is called Bottleneck, and is explained in terms of the 50 layers used in this paper. The main part of the network architecture of ResNet50 is that it goes through 4 large convolutional groups, which contain 4 Bottleneck modules, 3, 4, 6, and 3, respectively. The convolutional groups are pooled globally, then fully connected in 1000 dimensions, and finally, the classification scores are output by Softmax. The specific structure is shown in Table 1.

Table 1 Resnet50 network structure

Pyramid pooling module

In the deep neural network, the size of the receptive field can roughly represent the extent to which we use contextual information. Although the receptive field of ResNet is larger than the input image in theory, it can be found from experience that the receptive field of CNN is far smaller than the theoretical size, especially in deep networks, which makes many networks not fully integrate important global scene information. The pyramid pooling module can extract features through pooling cores of different sizes to obtain more global scene information.

The pyramid pooling module is composed of four levels of pooling, which are 1 × 1, 2 × 2, 3 × 3, and 6 × 6. 1 × 1 pool extracts the shallowest features, and processes the feature map completely. 2 × 2, 3 × 3 and 6 × 6 are the sub-region of the other three levels, which uses the pooling of different scales to process the feature map. The features of different scales are obtained by grading and pooling. Due to the different pool cores in the pool process, the output feature scales are also different. Finally, the features extracted from different pool cores can be fused into global features. In the last part of the pyramid pooling module, we connect the fused global features with the original feature map.

The improved pyramid scene analysis network

Implementation of CBAM attention mechanism

Convolutional Block Attention Module (CBAM) represents the attention mechanism module of the convolutional module, which is an attention mechanism module combining spatial and channel. The overall structure after adding the CBAM module is shown in Fig. 3. What can be seen is that the output of the convolutional layer will first pass through a channel attention module to get the weighted result and then will pass through a spatial attention module to finally weight the result. Compared with focusing only on the channel, the attention mechanism can achieve better results.

Fig. 3
figure 3

The overall structure after adding the CBAM Module

The channel attention module takes the input feature map, goes through global max pooling and global average pooling based on width and height, respectively, and then goes through MLP, respectively. The channel attention feature map and the input feature map are multiplied elementwise to generate the input features required by the Spatial attention module. The specific steps are shown in Fig. 4.

Fig. 4
figure 4

The channel attention module

The Channel Attention Module compresses the feature map in the spatial dimension into a one-dimensional vector, which it then manipulates. Not only does the compression in the spatial dimension take into account Average Pooling, but it also takes into account Max Pooling. To create a channel attention map, use Average Pooling and Max Pooling to aggregate the spatial information of feature maps, transfer them to a shared network, compress the spatial dimensions of the input feature maps, and sum and merge them element by element. For a single graph, channel attention is concerned with which pieces on the graph are important. When doing gradient back-propagation calculations, mean pooling provides feedback for every pixel point on the feature map, whereas maximum pooling provides feedback for gradients only when the response is highest in the feature map. The following is an equation to describe the channel attention mechanism.

$$ \begin{aligned} M_{{\text{c}}} (F) & = \sigma (MLP({\text{AvgPool(}}F)) + MLP({\text{MaxPool}}(F))) \\ & = \sigma (W_{{1}} (W_{0} (F_{{{\text{avg}}}}^{{\text{c}}} )) + W_{1} (W_{0} (F_{{{\text{max}}}}^{{\text{c}}} ))), \\ \end{aligned} $$
(7)

where MLP denotes multilayer perceptron, \(\sigma\) denotes the sigmoid function, \(W_{0}\) and \(W_{1}\) denotes the weight matrix in the multilayer perceptron, and \(F_{{{\text{avg}}}}^{{\text{C}}}\) and \(F_{{{\text{max}}}}^{{\text{C}}}\) denotes the average set feature and maximum set feature in the channel attention module.

The feature map output from the Channel attention module is used as the input feature map of the spatial attention module, as shown in Fig. 5. We execute global max pooling and global average pooling based on the channel first, and then we concat these two outcomes based on the channel. The sigmoid generates the spatial attention feature, which is then multiplied by the module's input feature to produce the end feature.

Fig. 5
figure 5

The spatial attention module

In the same way, the Spatial Attention Module compresses the channels and pools the mean value and the maximum value in the channel dimension, respectively. The AvgPool operation is to extract the average value on the channels, and the number of times it is extracted is also the height multiplied by the width; then, the previously extracted feature maps (the number of channels is 1) are combined to obtain a two-channel feature map. The following is an equation to describe the spatial attention mechanism.

$$ \begin{aligned} M_{{\text{s}}} (F)& = \sigma (f^{7 \times 7} ([{\text{AvgPool}}(F);{\text{MaxPool}}(F)])) \\ & = \sigma (f^{7 \times 7} ([F_{{{\text{avg}}}}^{{\text{s}}} ;F_{{{\text{max}}}}^{{\text{s}}} ])), \\ \end{aligned} $$
(8)

where \(f^{7 \times 7}\) denotes the convolution operation with a filter size of 7 × 7, and \(F_{{{\text{avg}}}}^{{\text{S}}}\) and \(F_{{{\text{max}}}}^{{\text{S}}}\) denote the average pool feature and the maximum pool feature in the spatial attention module, respectively.

Combining the CBAM attention mechanism with the ResNet50 network for end-to-end training, using the mathematical principle of attention mechanism to calculate the region of interest can be understood as a larger value of the weight of the region of interest, then correspondingly after the attention mechanism, the weight accounts for a larger proportion of the region of interest, highlighting the region of interest. By adding the CBAM attention mechanism to the first and last layer of the Resnet50 network, the network can pay more attention to the useful information and discard the useless information. As Fig. 6 shows the training loss function curve before and after the addition of the CBAM attention mechanism, it can be seen that the CBAM attention mechanism effectively reduces the oscillation when the curve converges.

Fig. 6
figure 6

Comparison of training loss function with CBAM attention mechanism

Improved pyramid pooling module with atrous convolution module

In the inland river scene, the surrounding environment of the riverbank is complex and changeable, and the water shoreline is extremely irregular. To segment the water area more accurately, there are high requirements for the extraction of the global information of the scene. Based on the original pyramid pooling module, we add an atrous convolution module branch to extract the global information more comprehensively by expanding the receptive field.

The atrous convolution is obtained by filling 0 with ordinary convolution so that the convolution can obtain a larger receptive field and more image feature information without changing the parameter size. More contextual information can be obtained by increasing the receptive field of the separation layer or fusing the deep and shallow features, which is easy to save the detailed information of the target as shown in Fig. 7a, but the three-layer convolution is an expansion convolution with r = 2. It can be seen that only 75% of the receptive field of the orange pixel on the top layer participates in the actual calculation. Using the hybrid dilated convolution, a certain number of layers are formed into a group, and then each group uses the continuously increasing hole rate, as shown in Fig. 7b. The combination of three-atrous rates of r = 1, r = 2, and r = 3 is adopted, which not only expands the receptive field, which is the same as the three r = 2 but also obtains information from a wider range of pixels. By designing a sawtooth wave with a void rate sequence \([r_{1} ,...,r_{i} ,...,r_{n} ]\), the aim of sensory field coverage of the whole map is achieved, with a small rate extracting local information and a large rate extracting long distance information, which can obtain wider area information and improve information utilization while keeping the size of the receiving field constant. And Eq. (9) for calculating the rate is defined as follows:

$$ M_{i} = \max \left[ {M_{i + 1} - 2r_{i} ,M_{i + 1} - 2(M_{i + 1} - r_{i} ),r_{i} } \right]. $$
(9)
Fig. 7
figure 7

Atrous convolution group combined with different dilation rates

With \(M_{n} = r_{n}\), the design objective is to have \(M_{2} \le K\). We adopt the idea of Fig. 7b and for our dilated convolution kernel size K = 3, we design the dilation rate as a [1, 2, 5, 1, 2, 5] sawtooth structure. By fusing the convolutional feature mapping of each dilation rate, the local and global information in the image is obtained by fusion, which facilitates the preservation of image details and enriches the feature information of the image while avoiding grid effects.

By adding a branch of the atrous convolution module to the pyramid pooling module of PSPNet, the global features extracted from the shallow features of the feature backbone network are extracted by one more hollow convolution module, and the features of its expanded field of perception are used to have better feature understanding for complex image information, and finally, the output of the atrous convolution module, the shallow feature map and the output of the pyramid pooling module are spatially connected and merged to make the current features include both global and local features, as shown in Fig. 8. Finally, the final prediction results are obtained by convolution and upsampling operations. The network's ability to gather global information about the image and make better judgments about complicated boundaries is improved when the pyramid pooling module and the atrous convolution module are combined.

Fig. 8
figure 8

Improved pyramid pool module

The water shoreline detection

In the complex environment of the inland river, the water surface can be effectively segmented out using the improved semantic segmentation model. Based on segmenting out the water surface area, the edges of the water surface area are extracted by combining the classical edge detection Canny algorithm [31], and the extracted edge lines are superimposed onto the original image after expansion [32] processing to facilitate observation. The results are shown in Fig. 9. The experimental results show that a simple Canny edge detection operator can quickly and effectively detect a clear and complete water shoreline in the segmented image.

Fig. 9
figure 9

Display of water shoreline detection results

Experimental results and analysis

Introduction to the dataset

Our training uses the USVInland dataset [33], which is a multi-sensor dataset for USVs in inland waterways. The dataset is available online at https://www.orca-tech.cn/datasets. USVInland is collected across a 26 km trajectory in a variety of realistic scenarios in inland waterways using various modalities including lidar, stereo cameras, millimeter-wave radar, GPS and inertial measurement units (IMUs) and contains data collected in real inland river scenarios under different weather or lighting conditions to cover real-world driving scenarios. The dataset has a total of 700 images of waterways in various river landscapes of different widths and lengths. Furthermore, unlike most datasets used for seawater segmentation, where waterlines are mainly straight water antennas, the USVInland dataset covers different inland river landscapes with different waterline structures. As shown in Fig. 10, the water bank images of inland river scenes under different weather conditions in the data set are shown.

Fig. 10
figure 10

The images in different weather

Evaluation based on improved semantic segmentation algorithm

Semantic segmentation algorithm training

In this dataset of 700 images containing multiple condition acquisitions, we set the images to a uniform size of 320 × 640 and trained them using the labels provided by the dataset. We selected 500 images of similar numbers for each scenes. Using 400 of these were used for training the improved semantic segmentation model, and the remaining 100 images were used as a validation set to test the accuracy of the algorithm.

The training networks in this paper were all implemented using the PyTorch deep learning framework in Python. The test device used was an RTX2080ti with 11 GB of video memory, Python 3.6, and PyTorch 1.4. A total of 200 epochs are trained for the network. The improved PSPNet model has the parameter FLOPS of 68.03G in time complexity and the parameter Param of 64.80 M in space complexity. It takes about 1.5 h to complete the whole training.

We use the idea of transfer learning [34] in training and fine-tune the weight of Resnet50 on the Imagenet dataset to greatly improve the efficiency of training. Therefore, we freeze the network in the first 25 epochs can speed up the training efficiency and prevent the weight from being damaged. In the freezing stage, the backbone of the model is frozen, and the feature extraction network does not change. The visual memory used is tiny, and the network is merely fine-tuned. The network is then unfrozen after 175 epochs. The backbone of the model is not frozen in the unfreezing step, and the feature extraction network will change. The video memory is fully utilized, and all network parameters will change.

For the evaluation of semantic segmentation. In classification and diagnostic tests, ROC and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of errors: false positives and false negatives. However, the ROC curve and AUC are only partially meaningful when used with unbalanced data [35]. In this paper, the confusion matrix approach is used to evaluate the model, and the overall PA and MIoU are selected as the evaluation metrics for pixel-level semantic segmentation. PA calculates the ratio of correct pixels to the total number of pixels in the prediction, and the IoU calculations for each category of MIoU are summed and averaged to obtain a global evaluation.

$$ {\text{PA}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}} + {\text{TN}}}} $$
(10)
$$ {\text{IoU}} = \frac{{{\text{TP}}}}{{{\text{FP}} + {\text{FN}} + {\text{TP}}}} $$
(11)
$$ {\text{MIoU}} = \frac{1}{k + 1}\sum\limits_{i = 0}^{k} {\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}}}}} , $$
(12)

where TP is the number of positive classes predicted as positive classes; FN is the number of positive classes predicted as negative classes; FP is the number of negative classes predicted as positive classes; TN is the number of negative classes predicted as negative classes.

Analysis of experimental results

Even though based on semantic segmentation models, watershed area detection is not only unlimited to areas with straight boundaries but is also very applicable. However, due to the high variability of different inland water environments, it is difficult to build robust segmentation models for water environments. Furthermore, unlike the lane detection task for autonomous vehicles, the similarity in appearance and reflection of objects on river banks makes accurate segmentation of waters difficult. The use of some mainstream semantic segmentation algorithms can suffer from poor segmentation of reflection regions, recognition of walls as water regions, and inaccurate segmentation of irregular water boundaries.

The improved PSPNet extracts shallow features through ResNet50 with the CBAM attention mechanism, then extracts global information through two branches of multi-scale pooling and multi-scale convolution, and then connects the shallow feature maps with the feature maps obtained from the two branches, so that the current features include both global and local features. Finally, the final prediction results are obtained by convolution and upsampling operations. The improved pyramidal scene analysis network can effectively solve the problem of not segmenting the reflection area and segmenting the walls.

The new model adds a branch of the atrous convolution module, which not only increases the segmentation accuracy of the original model but also enhances the solution to the segmentation problem of irregularly shaped water boundaries, which can effectively improve the accuracy of water bank segmentation in complex inland river scenes. A comparison of the specific shoreline segmentation results is shown in Fig. 11. We can see that when the water shoreline is relatively regular, the current semantic segmentation network can also segment the waterfront more accurately. However, it does not perform well in irregular water shorelines. The improved semantic segmentation model not only maintains its accuracy in regular water shoreline detection, but also extracts detailed information well in irregular water shoreline scenarios, which is closer to Ground Truth, effectively increasing the segmentation effect.

Fig. 11
figure 11

The different network segmentation results

Meanwhile, the segmentation index of our improved semantic segmentation network is 96.87% for MIoU and 98.49 for PA, which is higher than some mainstream algorithms. It shows that our algorithm has a good segmentation effect in complex inland river scenes. Some improvements to the algorithm can increase the information acquisition ability of the network for water bank images and obtain high segmentation accuracy at the same time. The results are shown in Table 2.

Table 2 Comparison of different model indicators

Results and analysis of water shoreline detection under different weather conditions

In the complex inland scene environment, when the coastline is extremely irregular or there is interference on the water surface, the use of traditional methods may fail, while the traditional coastline detection method uses some features of water surface and ground for image segmentation. In the face of different scenes, it needs to adjust the reference, which does not apply to complex inland river scenes and changeable weather environments. We use the improved semantic segmentation model, which has good applicability and improves the accuracy of water bank segmentation. Based on the water bank segmentation results, we use the Canny edge detection algorithm to extract the water bank line.

We extracted the waterfront segmentation map using the Canny edge detection algorithm to extract the waterfront contours. To further demonstrate the applicability of the algorithm in this paper, as shown in Fig. 12, we compared the water shorelines extracted by different networks with the reference standard water shorelines separately to be able to see the error with the standard reference water shorelines more intuitively. Where the green line indicates the reference standard water shoreline and the red line indicates the water shoreline extracted from the experimental results, with the overlap in yellow. We use the root mean square error (RMSE) to quantitatively evaluate the extraction error of the water shoreline. the RMSE is calculated as shown in equation (13):

$$ {\text{RMSE}} = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {(A_{i} - B_{i} )^{2} } }}{n}} , $$
(13)
Fig. 12
figure 12

Comparison with the standard reference water shoreline

where A is the pixel sampling matrix of the actual extracted water shoreline obtained by uniform sampling n times, and B is the water shoreline extracted as a result of the experiment.

We can see that the algorithm in this paper extracts the water shoreline with the least error from the reference water shoreline, which is closer to the real water shoreline location. In Table 3, we can see that our algorithm has the lowest RMSE value, but as the water shoreline becomes more complex, the RMSE value increases.

Table 3 RMSE values for different methods

As shown in Fig. 13a–f, we show the results of water shoreline detection using the algorithm in this paper under different weather conditions and different inland river scenes. We can see that our algorithm can segment the shoreline in a variety of weather conditions and in inland river scenes of varying complexity, and that when combined with the Canny edge detection algorithm and then superimposed on the original image by the expansion algorithm, we can intuitively see the position of the water shoreline.

Fig. 13
figure 13

The water shoreline detection results

Conclusion

In this paper, we propose a water shoreline detection algorithm based on the improved PSPNet algorithm. To improve the training efficiency, robustness, and accuracy of the algorithm, we first introduce the idea of migration learning based on the PSPNet algorithm, then we add the CBAM attention mechanism to the backbone feature extraction network ResNet50 to improve the algorithm's robustness, and finally, we add the atrous convolution module branch to the pyramid pooling module to combine pooling and atrous convolution. The use of the pyramid pooling module and atrous convolution module together considerably improves the accuracy of waterfront segmentation and the gathering of global information. Finally, the Canny edge detection algorithm extracts the shoreline segmentation map from the water shorelines. The algorithm is more suitable for complex inland river scenes in the Inland dataset under various weather conditions. It can overcome the interference of reflections, water waves, and riverbank obstacles in the inland scene and accurately achieve the segmentation of the waterbank image to detect the water shoreline. For USVs operating in inland river settings, the technology could be useful for visual navigation. In terms of the detection results, the detection effect is still low for the particularly difficult to identify reflection part and the part where bright light beams to the waterfront junction due to the diversity of the inland river scene. The detection accuracy can be improved by pre-processing the waterfront image to remove environmental interference before running the detection algorithm. In terms of approach, although semantic segmentation-based methods greatly enhance the applicability of the detection, they are too demanding in terms of data annotation, requiring not only large amounts of image data but also these images need to provide accurate tagging information down to the pixel level, so weakly supervised semantic segmentation-based approaches will be a direction of development.