Introduction

Ancient paintings are excellent traditional culture. Due to their long history, many ancient paintings have suffered from various degrees of damage. With the development of deep learning technology, this technology has been widely used in image inpainting such as ancient paintings. In recent years, many researchers have used deep learning methods such as convolutional neural networks and generative adversarial networks to restore images [1]. Pathak et al. [2] proposed an inpainting method based on an encoder-decoder architecture and contextual semantic features, known as context encoders (CE). However, the restored areas often suffer from low resolution and lack consistency in surrounding areas. Besides, this method requires that the damaged areas should be positioned in the center of the image. Liao et al. [3] proposed an improved version of CE called edge-aware context encoder (E-CE) to overcome its limitations. Instead of relying on the context of the entire image, E-CE uses the edge structure reconstructed by a fully convolutional network to restore textures. Vo et al. [4] introduced structural loss to the CE, which further refines the model’s inpainting capability for various scenes. Yan et al. [5] proposed the shift connection layer to address blurriness and semantic ambiguity from CE. This layer connects encoder features from known areas with decoder features from missing parts, enhancing the inpainting of semantic structure and detailed textures. Liu et al. [6] introduced partial convolution within the U-Net to address color bias and blurriness caused by regular convolution. The partial convolution automatically generates new masks during forward propagation, thereby reducing the artifacts. Li et al. [7] developed a recurrent feature reasoning module that performs iterative inference on mask boundary feature maps. Additionally, a knowledge-consistent attention mechanism is employed to gradually refine feature maps by adaptively fusing attention scores. Yi et al. [8] introduced a contextual residual aggregation (CRA) mechanism within the U-Net to address memory limitations and enable the inpainting of ultra-high-resolution images (e.g., 4 K to 8 K). By aggregating weighted residuals from contextual patches, this mechanism generates high-frequency residuals for missing content.

Although generative adversarial networks (GANs) can restore natural images, they often face instability and loss of semantic information, leading to the production of blurry and distorted images. Moreover, GANs struggle to learn high-resolution image textures and structural information. Consequently, researchers have proposed various improved GANs to address these challenges. Yeh et al. [9] combine context and prior losses, thereby better restoring the semantic information. To address the issue of blur and artifacts in restoring high-resolution images, Yang et al. [10] proposed a multi-scale neural patch synthesis method. This method adjusts patches with the most similar mid-level feature correlations to generate clearer and more coherent high-resolution images. Hui et al. [11] proposed a single-stage model employing dilated convolutions to capture a broader range of semantic information. They also designed self-guided regression losses and geometric alignment constraint losses to enhance semantic details and minimize the gap between predicted and real features. For repairing large irregular missing areas, Zeng et al. [12] proposed the aggregated contextual transformation GAN (AOT-GAN) for repairing large irregular missing areas. By stacking multiple AOT blocks, the generator’s contextual reasoning ability is enhanced. Meanwhile, introducing the mask prediction module from PatchGAN [13] enhances the discriminative capability. Lv et al. [14] believe that introducing prior information can improve the quality. To this end, they designed a GAN model with two generators: an edge restoration network to restore mural edges, guiding a content completion network for content inpainting. Moreover, Deng et al. [15] proposed a structural-guided dual-branch GAN model. This model enhances the inpainting capability for complex mural structures.

In 2017, Vaswani et al. proposed a Transformer [16] that uses self-attention mechanisms. Unlike convolutional neural networks (CNNs), the Transformer establishes long-range dependencies between different positions. This enables the Transformer to effectively capture global contextual information, and achieve a more accurate understanding of image structures. Zhou et al. [17] believe that image inpainting methods often rely on sample similarity and large-scale training data to learn texture and semantic information, but these methods are limited in their ability when dealing with larger and more complex image inpainting tasks. Therefore, they proposed a multi-homography transformed fusion method to solve the problem of missing areas in a target image. This method refers to another source image with the same content. Additionally, they have designed a color and spatial transformer that adjusts the color to make the inpainting results more consistent with the target image. Chen et al. [18] introduce a novel two-stage blind facial inpainting model called frequency-guided transformer and top-down refinement network (FT-TDR). This model detects areas to be restored by modeling relationships between different patches and improves detection results by using frequency mode as supplementary information. Since capturing global contextual information through deep and large receptive fields is inefficient, Zheng et al. [19] treat image inpainting as an undirected sequence-to-sequence prediction task. They directly capture long-range dependencies by adopting Transformers in the encoder. They also introduce an attention-aware layer (AAL) to better use distant-related high-frequency features. Transformer-based models are constrained by heavy inference computations for large images. To address this issue, Dong et al. [20] use a structure repair network to restore the overall image structure and employ a mask position encoding strategy for inpainting large damaged areas.

Pluralistic image inpainting models can generate multiple inpainting results, allowing users to choose various outputs. Han et al. [21] proposed a two-stage model with shape and appearance generation networks. Each network includes two encoders that jointly optimize to generate multiple plausible results. Zhao et al. [22] introduced the unsupervised cross-space translation generative adversarial network (UCTGAN), which enables one-to-one image mapping between two spaces. Additionally, Peng et al. [23] proposed a two-stage inpainting model. In the first stage, multiple coarse results with diverse structural features are generated. The second stage uses a texture generation network to improve texture details for each coarse result. They also introduced structural attention in the texture generation network to enhance results by capturing distant correlations.

Many deep-learning inpainting models excel in restoring natural images but face challenges with ancient paintings, often resulting in loss of texture details and excessive smoothing. Further improvement is required to extract semantic features and improve feature transfer capabilities for ancient paintings, preserving their original texture and detail information. In recent years, multi-scale feature fusion, gated encoding branch, and feature aggregation have been widely used in computer vision and image processing [24,25,26]. Therefore, motivated by these ideas, we propose an ancient painting inpainting model based on dual encoders and contextual information to address these challenges. The model integrates a generator and a discriminator. The generator employs a dual branch structure, comprising a regular encoding branch and a gated encoding branch, both of which incorporate four dense multi-scale feature fusion modules to capture rich information. Following this, the merged features from these two encoding branches undergo a contextual feature aggregation module, enhancing their correlation and further refining the restoration process. Ultimately, the restored ancient painting is subjected to a discriminator for quality assessment. Our model, centered around the dual encoder architecture, dense multi-scale feature fusion modules, and contextual feature aggregation module, aims to effectively address the detail loss and overall inconsistency of ancient painting inpainting. The specific details of the network are as follows:

  • To fully capture the semantic information from ancient paintings, a gated encoding branch is designed to reduce information loss.

  • A dense multi-scale feature fusion module is designed to extract the texture and detail of ancient paintings at different scales. Additionally, dilated depthwise separable convolutions are employed to reduce the parameter and improve computational efficiency.

  • A contextual feature aggregation module is introduced to extract contextual features, which can improve the overall consistency of the results.

  • Color loss is added to improve color consistency in the inpainting results to ensure that the color of the restored area is in harmony with the color of the surrounding area.

Methodology

Overall network architecture

The proposed model consists of a generator and a discriminator, as shown in Fig. 1. The generator uses a dual branch structure that consists of a regular encoding branch and a gated encoding branch. The two encoding branches both include four dense multi-scale feature fusion modules. After each encoding branch, the merged features from both encoding branches are fed into the contextual feature aggregation module to improve the correlation. Finally, the restored ancient painting is input to a discriminator.

Fig. 1
figure 1

Overall structure of the network

The discriminator has two branches, one for evaluating the global image and another for the local image. Both branches use convolutional layers with a stride of 2. To further optimize the performance of the generator, the model integrates several loss functions, including mean absolute error loss, color loss, adversarial loss, and feature matching loss. These loss functions ensure that the generated results are closer to the style and details of the original ancient painting.

Gated encoding branch

To improve the model’s capability of extracting features from ancient paintings, a gated encoding branch was introduced. This branch consists of a gated encoder made up of gated convolutions [27] and four dense multi-scale feature fusion modules. The gated encoder fully uses the dynamic pixel-selective features of gated convolutions to learn soft masks automatically. The gated convolution is shown in Fig. 2. The soft gating mechanism dynamically adjusts the importance of each element in the feature map by applying a Sigmoid function after the convolution operation, generating gating values that range from 0 to 1. This allows the model to exercise fine-grained control over the feature processing. By fusing the feature maps from the regular encoding branch and the gated encoding branch, the model can leverage the strengths of both to generate more accurate inpainting results.

Fig. 2
figure 2

Structure of gated convolution

Dense multi-scale feature fusion module

To ensure a low number of parameters while improving feature richness and continuity, we propose a dense multi-scale feature fusion (DMFF) module, as shown in Fig. 3. The module comprises four parallel dilated depthwise separable convolutions branches. The parameters can be reduced by replacing dilated convolutions (DConv) [28] with dilated depthwise separable convolutions (DDSConv) [29]. Specifically, the four branches of the DDSConv have dilation rates of 1, 2, 4, and 8 respectively. Table 1 shows the decrease in the number of generator parameters.

Fig. 3
figure 3

Architecture of dense multi-scale feature fusion module

Table 1 Comparison of parameters in the generator

In the DMFF module, the channel number of the input features is first reduced to 64 before being fed into the four branches for multi-scale feature extraction, denoted as \(x_{i}\) \(\left( {i = 1,2,3,4} \right)\). Each branch consists of a 3 × 3 convolution operation, denoted as \(k_{i} \left( \cdot \right)\). To obtain rich dense feature \(y_{i}\) from sparse multi-scale features, a dense feature fusion is employed, expressed as:

$$y_{i} = \left\{ \begin{gathered} \begin{array}{*{20}l} {x_{i} ,} & {i = 1} \\ {k_{i} \left( {x_{i - 1} + x_{i} } \right),} & {i = 2} \\ \end{array} \hfill \\ \begin{array}{*{20}l} {k_{i} \left( {y_{i - 1} + x_{i} } \right),} & {2 < i \le 4} \\ \end{array} \hfill \\ \end{gathered} \right.$$
(1)

The outputs are accumulated with outputs from other branches, effectively preventing the grid effects [30] and preserving feature continuity. By combining 1 × 1 convolution with skip connections, which enhances feature richness and continuity.

Contextual feature aggregation module

To fully use contextual information in ancient paintings, we introduce the contextual feature aggregation (CFA) module [31], as shown in Fig. 4. This module extracts contextual features of ancient paintings through region affinity learning, capturing both overall structure and local details to provide important feature information for the aggregation process. The multi-scale feature aggregation operation encodes rich semantic features at different scales. The use of a partial convolutional mask updating mechanism eliminates the need to distinguish between foreground and background pixels. Furthermore, the CFA module introduces skip connections to prevent the loss of potential semantic information.

Fig. 4
figure 4

Structure of contextual feature aggregation module

Initially, the input feature is processed through a convolutional (Conv) layer to obtain feature map \(F\), and then 3 × 3 patches of \(F\) are extracted. Subsequently, the cosine similarity between these patches is calculated by using (2).

$$S_{contextual}^{i,j} = \left\langle {\frac{{f_{i} }}{{\left\| {f_{i} } \right\|_{2} }},\frac{{f_{j} }}{{\left\| {f_{j} } \right\|_{2} }}} \right\rangle$$
(2)

where \(f_{i}\) and \(f_{j}\) represent the \(i\)-th and \(j\)-th patches of the feature map \(F\), respectively. \(\left\| \cdot \right\|_{2}\) denotes the L2 norm.

The Softmax function is then applied to the calculated cosine similarity to obtain the attention score for each patch as shown in (3):

$$\hat{S}_{contextual}^{i,j} = \frac{{\exp \left( {S_{contextual}^{i,j} } \right)}}{{\sum\limits_{j = 1}^{N} {\exp \left( {S_{contextual}^{i,j} } \right)} }}$$
(3)

Subsequently, based on the extracted patches and their corresponding attention scores, the feature map is reconstructed. The reconstruction process is shown in (4):

$$\tilde{f}_{i} = \sum\limits_{j = 1}^{N} {f_{j} \cdot \hat{S}_{contextual}^{i,j} }$$
(4)

where \(\tilde{f}_{i}\) represents the \(i\)-th patch of the reconstructed feature map \(F_{rec}\).

To capture multi-scale semantic features, CFA uses four sets of dilated convolutions with different dilation rates, as shown in (5):

$$F_{rec}^{k} = Conv_{k} \left( {F_{rec} } \right)$$
(5)

where \(Conv_{k} \left( \cdot \right)\) denotes the dilated convolution, and the dilation rates are set as \(k \in \left\{ {1,2,4,8} \right\}\).

A weight generator \(G_{w}\) is used to predict the pixel-wise weight maps. It consists of two convolutional layers with kernel sizes of 3 and 1 respectively. After each layer, a ReLU activation function is applied. The output channels of this generator are set to 4, which are used to compute the weights \(W^{k}\) by using (6) and (7).

$$W = {\text{Softmax}}\left( {G_{w} \left( {F_{rec} } \right)} \right)$$
(6)
$$W^{1} ,W^{2} ,W^{4} ,W^{8} = {\text{Slice}}\left( W \right)$$
(7)

where the \({\text{Slice}}\left( \cdot \right)\) is channel-wise slice.

The multi-scale semantic features are aggregated to generate refined feature map \(F_{c}\) through element-wise summation, as shown in (8):

$$F_{c} = \left( {F_{rec}^{1} \odot W^{1} } \right) \oplus \left( {F_{rec}^{2} \odot W^{2} } \right) \oplus \left( {F_{rec}^{4} \odot W^{4} } \right) \oplus \left( {F_{rec}^{8} \odot W^{8} } \right)$$
(8)

Finally, the feature map \(F\) is concatenated with \(F_{c}\), and the concatenated feature is then processed through a deconvolutional (Deconv) layer to produce the final output features.

Global and local discriminators

The global and local discriminators proposed in [32] are used to assess the generated images. The global branch evaluates the entire image, capturing global information and structural features with a larger receptive field. Meanwhile, the local branch evaluates details and texture information in local areas. This combination of global and local branches addresses the limitations of a single discriminator. More detailed information about the layers of the discriminators can be found in Tables 2 and 3.

Table 2 Convolutional layer parameters of the local discriminator
Table 3 Convolutional layer parameters of the global discriminator

As shown in Fig. 1, the global discriminator consists of six convolutional layers, taking 256 × 256 output images as input. The local discriminator consists of five convolutional layers, randomly selecting a 128 × 128 local patch as input. The outputs of the global and local discriminators are concatenated into a 2048-dimensional vector. This vector is then processed through a fully connected layer to produce a probability value between 0 and 1 using a Sigmoid function.

Loss function

We combine the adversarial loss, feature matching loss, mean absolute error loss, and color loss to constitute the total loss function of the proposed model. The following will introduce the details of each loss function.

Adversarial loss

The traditional discriminator evaluates the probability that the input image is real. The generator aims to increase the probability of its generated image being misclassified by the discriminator. The score of the input image \(x\) is defined as \(C\left( x \right)\), which is subsequently transformed through a Sigmoid function to obtain a probability value \(D\left( x \right)\) between 0 and 1, as shown in (9):

$$D\left( x \right) = {\text{Sigmoid}} \left( {C\left( x \right)} \right)$$
(9)

To enhance the authenticity of texture details in generated samples, a relativistic average discriminator \(D_{Ra}\) [33] is used. This discriminator no longer judges the authenticity of images but predicts the probability that real images \(x_{r}\) are more realistic compared to fake images \(x_{f}\), as shown in (10):

$$D_{Ra} \left( {x_{r} ,x_{f} } \right) = {\text{Sigmoid}} \left( {C\left( {x_{r} } \right) - {\rm E}_{{x_{f} }} \left[ {C\left( {x_{f} } \right)} \right]} \right)$$
(10)

where \({\rm E}_{{x_{f} }} \left( \cdot \right)\) is the mean of all fake data in the mini-batch.

Therefore, the adversarial loss is expressed as:

$$L_{adv} = - {\rm E}_{{x_{r} }} \left[ {\log \left( {1 - D_{Ra} \left( {x_{r} ,x_{f} } \right)} \right)} \right] - {\rm E}_{{x_{f} }} \left[ {\log \left( {D_{Ra} \left( {x_{f} ,x_{r} } \right)} \right)} \right]$$
(11)

Feature matching loss

The feature matching loss [34] is used to calculate the difference between the intermediate layer features in the discriminator. This helps the generator to learn multiscale information, leading to higher-quality generated images with improved semantics and structure. Its formula is as follows:

$$L_{fm\_dis} = \sum\limits_{l = 1}^{5} {\omega^{l} \frac{{\left\| {D_{local}^{l} \left( {{\text{I}}_{gt} } \right) - D_{local}^{l} \left( {{\text{I}}_{output} } \right)} \right\|_{1} }}{{{\text{N}}_{{D_{local}^{l} \left( {{\text{I}}_{gt} } \right)}} }}}$$
(12)

where \({\text{I}}_{gt}\) and \({\text{I}}_{output}\) represent the ground truth and output image, respectively. \(D_{local}^{l} \left( {{\text{I}}_{gt} } \right)\) and \(D_{local}^{l} \left( {{\text{I}}_{output} } \right)\) represent the feature maps of \({\text{I}}_{gt}\) and \({\text{I}}_{output}\) at the \(l\)-th ReLU activation layer, respectively. \({\text{N}}_{{D_{local}^{l} \left( {{\text{I}}_{gt} } \right)}}\) is the number of ground truth \({\text{I}}_{gt}\). \(\omega^{l}\) represents the weight of the \(l\)-th ReLU activation layer. \(\left\| \cdot \right\|_{1}\) denotes the L1 norm. The upper limit of the accumulation operation is 5, which corresponds to the number of layers of the local discriminator.

Mean absolute error loss

Mean absolute error loss [35] minimizes the average of absolute errors between generated results \(y_{i}\) and the actual values \(x_{i}\). Its formula is as follows:

$$L_{mae} = \frac{{\sum\limits_{i = 1}^{n} {\left| {y_{i} - x_{i} } \right|} }}{n}$$
(13)

where \(n\) is the number of samples.

Color loss

Color loss [36] is introduced to reduce the color bias between the output image and the target image. Gaussian filtering is initially utilized to generate a blurred image by removing high-frequency information while preserving low-frequency components. Then, the mean squared error (MSE) between the input image and the target image is computed as the loss function to measure the color difference, irrespective of texture and content. Its formula is as follows:

$$L_{color} = {\text{MSE}} \left( {I_{smoothed}^{1} ,I_{smoothed}^{2} } \right)$$
(14)

where \(I_{smoothed}^{1}\) and \(I_{smoothed}^{2}\) denote the blurred input image and blurred inpainting result obtained by Gaussian filtering, respectively.

Total loss

The total loss of the proposed model is shown in (15):

$$L_{Total} = \lambda_{adv} L_{adv} + \lambda_{fm\_dis} L_{fm\_dis} + \lambda_{mae} L_{mae} + \lambda_{{_{color} }} L_{color}$$
(15)

where \(\lambda_{adv}\), \(\lambda_{fm\_dis}\), \(\lambda_{mae}\), and \(\lambda_{color}\) are the weight for adversarial loss, feature matching loss, mean absolute error loss, and color loss. To determine these weights, we conducted experiments to balance the contribution of each loss item and optimize the overall performance of the model. We tried various combinations of weights to find the best results. Ultimately, we set the weights for the adversarial loss, feature matching loss, mean absolute error loss, and color loss to 0.03, 5, 1, and 0.01, respectively. Therefore, the total loss function reflects the balance between these components, ensuring that each loss item contributes appropriately to the training and final performance of the model.

Experimental analysis

Datasets

The experiment used two datasets: an ancient painting inpainting dataset that we created and a publicly available mural dataset.

Ancient painting inpainting dataset

We constructed an ancient painting inpainting dataset, which contains training, validation, and test sets, with the numbers 10,500, 1373, and 624, respectively, as shown in Table 4. These samples are from four well-known Chinese ancient paintings of different styles and painting techniques. Each sample image in the dataset has a size of 256 × 256 pixels. The single-color samples were removed to ensure that the dataset was diverse. Among them, Along the River During the Qingming Festival is imitated by Qiu Ying from the Ming Dynasty, based on the original work by Zhang Zeduan from the Song Dynasty. To obtain a simulated damaged ancient painting sample, we performed pixel multiplication between the original complete ancient painting and a randomly selected mask image from a mask dataset in [6]. This process allowed us to simulate various breakage rates in painting samples. The ancient painting inpainting dataset is accessible at https://github.com/luyjsnsndjx/painting-dataset.git.

Table 4 The details of ancient painting inpainting dataset

Mural dataset

To evaluate our model in different scenarios, experiments were also conducted on the mural dataset [37]. This mural dataset has 1714 samples which are mainly from the Mogao Grottoes in Dunhuang, including both authentic murals and those copied by artists.

Experimental configuration

This experiment used PyTorch 1.8.1 and Python 3.8 to build the proposed model, running on an NVIDIA GeForce RTX 3080TI GPU. Parameters were set with a batch size of 4, a learning rate of 0.0002, and Adam optimizer for model training.

Experimental results

Our model is validated based on the ancient painting inpainting dataset constructed in “Ancient painting inpainting dataset” section. From top to bottom, Fig. 5c displays the restored paintings of A Thousand Li of Rivers and Mountains, Dwelling in the Fuchun Mountains, Spring Morning in the Han Palace, and Along the River During the Qingming Festival, respectively.

Fig. 5
figure 5

Inpainting results of the proposed model: a original ancient paintings; b simulated damaged ancient paintings; c inpainting results

The results demonstrate that the proposed model can accurately restore various elements, such as buildings, mountains, rocks, and trees in A Thousand Li of Rivers and Mountains. Additionally, the model can fully recover the original image details and textures, such as the contours of mountains in Dwelling in the Fuchun Mountains, the tables, chairs, and clothing in Spring Morning in the Han Palace, and the details of the tree in Along the River During the Qingming Festival, without structural distortion and color bias.

Comparison experiments

We compare the proposed model with the following five typical image inpainting models: (1) LaMa [38]: a large mask inpainting network that uses Fast Fourier Convolutions (FFC) to obtain the global receptive field. (2) CoordFill [39]: a novel framework that achieves efficient high-resolution image inpainting via parameterized coordinate querying. (3) FRRN [40]: a full-resolution residual network that achieves progressive image inpainting. (4) RFA [41]: a network with texture perception that preserves shallow texture features for fine-grained inpainting. (5) CTSDG [31]: a dual-branch network for the simulation of texture synthesis and structural reconstruction in a coupled form.

Qualitative comparison

The results of both our model and the comparative models are shown in Fig. 6. Figure 6c and d exhibit obvious color bias and blurriness. The restored texture in Fig. 6e appears incomplete and chaotic. Figure 6f exhibits noticeable blurriness. Figure 6g has improved but still lacks texture details and has blurry edges. In contrast, for the inpainting results of Fig. 6h, from the first row, our model restores the original line of clothing, avoiding the missing texture issues observed in comparative models. From the second row, our model recovers detail textures overlooked by other models. From the third row, our model preserves harmonious lines and there is no color bias. Furthermore, from the fourth row, our model avoids the texture blurring and distortion issues encountered by the other five models. Lastly, from the fifth row, our model restores the painting to its original form.

Fig. 6
figure 6

Comparison of inpainting results from different models: a original ancient paintings; b simulated damaged ancient paintings; c LaMa; d CoordFill; e RFA; f FRRN; g CTSDG; h Ours

In our model, the gated encoding branch allows the model to effectively capture and encode spatial information, enhancing its ability to extract complex image structures. The contextual feature aggregation module improves the ability of the proposed model to capture long-range dependencies and contextual information. Furthermore, the color loss ensures that the model preserves the original color information, leading to more realistic and visually appealing results. In conclusion, by introducing these modules, our model successfully restores the color and detailed texture of ancient paintings.

Quantitative comparison

We evaluate inpainting results using quantitative metrics such as peak signal-to-noise ratio (PSNR) [42], structure similarity index measure (SSIM) [43], mean square error (MSE) [44], and learned perceptual image patch similarity (LPIPS) [45]. PSNR is calculated based on the mean squared error of the image, which measures the average difference between the original image and the restored image. A higher PSNR value means less distortion. SSIM generates a score between 0 and 1 by comparing the differences on the brightness, contrast, and structure of two images. The higher the SSIM value, the higher the similarity between the two images. MSE is computed by averaging the squared differences between predicted values and true values. A smaller MSE signifies less disparity between predicted and true values. LPIPS is designed to be closer to human perception and can be better applied to real scenarios. It compares two images by feeding them into a pre-trained deep neural network such as VGG. The network extracts features from the input images at different depth levels and normalizes them. After that, the similarity between the images is assessed by calculating the distance between these extracted features. A lower LPIPS value indicates a higher similarity between the two images.

Table 5 shows that the proposed model outperforms all other models on these quantitative metrics, thereby indicating the superiority of the proposed model in terms of reconstruction accuracy, structural similarity, and perceptual quality.

Table 5 Comparison of evaluation metrics of different models on the test set

Table 6 demonstrates the performance of the proposed model compared to all other models in various mask rates, ranging from 1 to 50%. The mask rate represents the percentage of the image that is masked or damaged. We simulated different levels of damage to the image and thus fully evaluated the performance of the proposed model. The results in Table 6 show that our model exhibits robustness and effectiveness in handling different levels of damage, further validating its efficacy and usefulness for ancient painting inpainting.

Table 6 Comparison of evaluation metrics in different mask rates

Ablation study

The ablation study is conducted on our ancient painting dataset to verify the effects of color loss \(L_{color}\), gated encoding branch (GEB), and contextual feature aggregation (CFA) modules. The results presented in Table 7 indicate that the removal of these modules resulted in a decrease in PSNR and SSIM values, along with an increase in MSE and LPIPS values. This suggests that each module plays a crucial role in the inpainting process.

Table 7 Ablation study

The differences before and after using the modules are shown in Fig. 7. Color loss adjusts and preserves the color of the original image by assessing brightness, contrast, and major color differences. A comparison of Fig. 7a and b shows that the color loss solves the problem of color bias in the inpainted area. GEB captures both shallow detail features and deep semantic features, reducing semantic information loss. By comparing Fig. 7c and d, it can be seen that the addition of the GEB improves the detail texture. The absence of CFA results in the inconsistency of texture, as seen in Fig. 7e compared to Fig. 7f. Incorporating CFA improves texture color and edge clarity.

Fig. 7
figure 7

Results comparison before and after using the module: a with color loss; b without color loss; c with GEB; d without GEB; e with CFA; f without CFA

In summary, based on Fig. 7 and Table 7, it can be observed that the significant impact of introducing color loss, gate encoding branch (GEB), and contextual feature aggregation module (CFA) on the inpainting of ancient paintings. These modules address issues such as color bias, detail loss, and lack of overall integrity in inpainting results. Therefore, the proposed model exhibits excellent performance in ancient painting inpainting.

Experiment on murals

To evaluate our model’s effectiveness in various scenarios, we conducted experiments on a mural dataset. As shown in Fig. 8, our model restores the lost color and texture in the original murals, improving the overall quality of inpainting results and resolving filling errors in comparative models.

Fig. 8
figure 8

Inpainting results of different models using the mural dataset: a original mural images; b simulated damaged mural images; c LaMa; d CoordFill; e RFA; f FRRN; g CTSDG; h Ours

As shown in Table 8, the proposed model achieves the best evaluation metrics compared to the comparative modes. In summary, our model not only excels in ancient painting inpainting but also in mural inpainting.

Table 8 Comparison of evaluation metrics of different models on the mural test set

Experiments on real damaged ancient paintings and murals

In the preceding section, our model was trained and tested by using simulated damaged ancient paintings and murals. To assess its performance in real damaged scenarios, we used real damaged ancient paintings and murals for testing. It is worth mentioning that for the real damaged ancient paintings and murals, we have consulted experienced artists to accurately mark the shape, location, and size of the real broken areas. This ensures that our model can be applied to the restoration of real ancient paintings and murals.

Real damaged ancient paintings

We chose four damaged ancient paintings as our test images. The inpainting results are shown in Fig. 9c. From top to bottom, the first inpainting result of Fig. 9c, achieves excellent integrity by effectively restoring the structure and texture of the mountain. The second result has maintained the color of the river. The third result exhibits minimal structural loss and color bias, thus maintaining the integrity of the houses and trees. The fourth damaged ancient painting has been well reconstructed by the proposed model, restoring the lost information of color and texture without introducing any structural distortion. These experiments verified the excellent performance of the proposed model in restoring mountains, rivers, houses, and trees in real damaged ancient paintings.

Fig. 9
figure 9

Inpainting results of the real damaged ancient paintings: a real damaged ancient paintings; b ancient paintings with labeled damaged areas; c inpainting results

Real damaged murals

We conducted test experiments using real damaged murals. As shown in Fig. 10c, the first, second, and fourth rows of the restored murals have better overall restoration results with structural integrity, avoiding problems such as color distortion. These results show that the proposed model can proficiently fill the damaged areas while maintaining the overall clarity and integrity of the murals. However, it is worth pointing out that our model does have limitations when applied to real damaged murals, such as the blurring of the ears of character in the third row of Fig. 10c. We will further optimize the model to improve the ability of detail restoration.

Fig. 10
figure 10

Inpainting results of the real damaged murals: a real damaged murals; b murals with labeled damaged areas; c inpainting results

Conclusion

An inpainting model for ancient paintings is proposed based on dual encoders and contextual information. Firstly, in the gate encoding branch, gate convolutions are introduced to capture semantic information of ancient paintings and reduce information loss through the dynamic feature selection mechanism. Secondly, a dense multi-scale feature fusion module is designed to accurately extract multi-scale features by expanding the receptive field with dilated convolutions and using dense feature fusion operations. Also, dilated depthwise separable convolutions are introduced to reduce the generator parameters. Next, a contextual feature aggregation module is used to aggregate contextual information, better handling long-range spatial dependencies. Finally, color loss is introduced to preserve color information. The results of qualitative and quantitative experiments show that, compared with the comparison model, the proposed model performs better in restoring color and detailed textures while preserving artistic styles and overall textures of paintings. Additionally, the proposed model is validated on real damaged ancient paintings and murals to further ensure its practical applicability. Therefore, the proposed model provides valuable references for the digital restoration of cultural heritage. However, our model has limitations when it comes to accurately restoring the details of real damaged murals. In the future research, we aim to improve our model’s ability to restore murals with greater accuracy. Additionally, to further improve the model’s ability, we need to create a more comprehensive dataset by collecting a wider range of ancient paintings that cover different styles, periods, and authors.