Abstract
Deep learning-based inpainting models have achieved success in restoring natural images, yet their application to ancient paintings encounters challenges due to the loss of texture, lines, and color. To address these issues, we introduce an ancient painting inpainting model based on dual encoders and contextual information to overcome the lack of feature extraction and detail texture recovery when restoring ancient paintings. Specifically, the proposed model employs a gated encoding branch that aims to minimize information loss and effectively capture semantic information from ancient paintings. A dense multi-scale feature fusion module is designed to extract texture and detail information at various scales, while dilated depthwise separable convolutions are utilized to reduce parameters and enhance computational efficiency. Furthermore, a contextual feature aggregation module is incorporated to extract contextual features, enhancing the overall consistency of the inpainting results. Finally, a color loss function is introduced to ensure color consistency in the restored area, harmonizing it with the surrounding region. The experimental results indicate that the proposed model effectively restores the texture details of ancient paintings, outperforming other methods both qualitatively and quantitatively. Additionally, the model is tested on real damaged ancient paintings to validate its practicality and efficacy.
Similar content being viewed by others
Introduction
Ancient paintings are excellent traditional culture. Due to their long history, many ancient paintings have suffered from various degrees of damage. With the development of deep learning technology, this technology has been widely used in image inpainting such as ancient paintings. In recent years, many researchers have used deep learning methods such as convolutional neural networks and generative adversarial networks to restore images [1]. Pathak et al. [2] proposed an inpainting method based on an encoder-decoder architecture and contextual semantic features, known as context encoders (CE). However, the restored areas often suffer from low resolution and lack consistency in surrounding areas. Besides, this method requires that the damaged areas should be positioned in the center of the image. Liao et al. [3] proposed an improved version of CE called edge-aware context encoder (E-CE) to overcome its limitations. Instead of relying on the context of the entire image, E-CE uses the edge structure reconstructed by a fully convolutional network to restore textures. Vo et al. [4] introduced structural loss to the CE, which further refines the model’s inpainting capability for various scenes. Yan et al. [5] proposed the shift connection layer to address blurriness and semantic ambiguity from CE. This layer connects encoder features from known areas with decoder features from missing parts, enhancing the inpainting of semantic structure and detailed textures. Liu et al. [6] introduced partial convolution within the U-Net to address color bias and blurriness caused by regular convolution. The partial convolution automatically generates new masks during forward propagation, thereby reducing the artifacts. Li et al. [7] developed a recurrent feature reasoning module that performs iterative inference on mask boundary feature maps. Additionally, a knowledge-consistent attention mechanism is employed to gradually refine feature maps by adaptively fusing attention scores. Yi et al. [8] introduced a contextual residual aggregation (CRA) mechanism within the U-Net to address memory limitations and enable the inpainting of ultra-high-resolution images (e.g., 4 K to 8 K). By aggregating weighted residuals from contextual patches, this mechanism generates high-frequency residuals for missing content.
Although generative adversarial networks (GANs) can restore natural images, they often face instability and loss of semantic information, leading to the production of blurry and distorted images. Moreover, GANs struggle to learn high-resolution image textures and structural information. Consequently, researchers have proposed various improved GANs to address these challenges. Yeh et al. [9] combine context and prior losses, thereby better restoring the semantic information. To address the issue of blur and artifacts in restoring high-resolution images, Yang et al. [10] proposed a multi-scale neural patch synthesis method. This method adjusts patches with the most similar mid-level feature correlations to generate clearer and more coherent high-resolution images. Hui et al. [11] proposed a single-stage model employing dilated convolutions to capture a broader range of semantic information. They also designed self-guided regression losses and geometric alignment constraint losses to enhance semantic details and minimize the gap between predicted and real features. For repairing large irregular missing areas, Zeng et al. [12] proposed the aggregated contextual transformation GAN (AOT-GAN) for repairing large irregular missing areas. By stacking multiple AOT blocks, the generator’s contextual reasoning ability is enhanced. Meanwhile, introducing the mask prediction module from PatchGAN [13] enhances the discriminative capability. Lv et al. [14] believe that introducing prior information can improve the quality. To this end, they designed a GAN model with two generators: an edge restoration network to restore mural edges, guiding a content completion network for content inpainting. Moreover, Deng et al. [15] proposed a structural-guided dual-branch GAN model. This model enhances the inpainting capability for complex mural structures.
In 2017, Vaswani et al. proposed a Transformer [16] that uses self-attention mechanisms. Unlike convolutional neural networks (CNNs), the Transformer establishes long-range dependencies between different positions. This enables the Transformer to effectively capture global contextual information, and achieve a more accurate understanding of image structures. Zhou et al. [17] believe that image inpainting methods often rely on sample similarity and large-scale training data to learn texture and semantic information, but these methods are limited in their ability when dealing with larger and more complex image inpainting tasks. Therefore, they proposed a multi-homography transformed fusion method to solve the problem of missing areas in a target image. This method refers to another source image with the same content. Additionally, they have designed a color and spatial transformer that adjusts the color to make the inpainting results more consistent with the target image. Chen et al. [18] introduce a novel two-stage blind facial inpainting model called frequency-guided transformer and top-down refinement network (FT-TDR). This model detects areas to be restored by modeling relationships between different patches and improves detection results by using frequency mode as supplementary information. Since capturing global contextual information through deep and large receptive fields is inefficient, Zheng et al. [19] treat image inpainting as an undirected sequence-to-sequence prediction task. They directly capture long-range dependencies by adopting Transformers in the encoder. They also introduce an attention-aware layer (AAL) to better use distant-related high-frequency features. Transformer-based models are constrained by heavy inference computations for large images. To address this issue, Dong et al. [20] use a structure repair network to restore the overall image structure and employ a mask position encoding strategy for inpainting large damaged areas.
Pluralistic image inpainting models can generate multiple inpainting results, allowing users to choose various outputs. Han et al. [21] proposed a two-stage model with shape and appearance generation networks. Each network includes two encoders that jointly optimize to generate multiple plausible results. Zhao et al. [22] introduced the unsupervised cross-space translation generative adversarial network (UCTGAN), which enables one-to-one image mapping between two spaces. Additionally, Peng et al. [23] proposed a two-stage inpainting model. In the first stage, multiple coarse results with diverse structural features are generated. The second stage uses a texture generation network to improve texture details for each coarse result. They also introduced structural attention in the texture generation network to enhance results by capturing distant correlations.
Many deep-learning inpainting models excel in restoring natural images but face challenges with ancient paintings, often resulting in loss of texture details and excessive smoothing. Further improvement is required to extract semantic features and improve feature transfer capabilities for ancient paintings, preserving their original texture and detail information. In recent years, multi-scale feature fusion, gated encoding branch, and feature aggregation have been widely used in computer vision and image processing [24,25,26]. Therefore, motivated by these ideas, we propose an ancient painting inpainting model based on dual encoders and contextual information to address these challenges. The model integrates a generator and a discriminator. The generator employs a dual branch structure, comprising a regular encoding branch and a gated encoding branch, both of which incorporate four dense multi-scale feature fusion modules to capture rich information. Following this, the merged features from these two encoding branches undergo a contextual feature aggregation module, enhancing their correlation and further refining the restoration process. Ultimately, the restored ancient painting is subjected to a discriminator for quality assessment. Our model, centered around the dual encoder architecture, dense multi-scale feature fusion modules, and contextual feature aggregation module, aims to effectively address the detail loss and overall inconsistency of ancient painting inpainting. The specific details of the network are as follows:
-
To fully capture the semantic information from ancient paintings, a gated encoding branch is designed to reduce information loss.
-
A dense multi-scale feature fusion module is designed to extract the texture and detail of ancient paintings at different scales. Additionally, dilated depthwise separable convolutions are employed to reduce the parameter and improve computational efficiency.
-
A contextual feature aggregation module is introduced to extract contextual features, which can improve the overall consistency of the results.
-
Color loss is added to improve color consistency in the inpainting results to ensure that the color of the restored area is in harmony with the color of the surrounding area.
Methodology
Overall network architecture
The proposed model consists of a generator and a discriminator, as shown in Fig. 1. The generator uses a dual branch structure that consists of a regular encoding branch and a gated encoding branch. The two encoding branches both include four dense multi-scale feature fusion modules. After each encoding branch, the merged features from both encoding branches are fed into the contextual feature aggregation module to improve the correlation. Finally, the restored ancient painting is input to a discriminator.
The discriminator has two branches, one for evaluating the global image and another for the local image. Both branches use convolutional layers with a stride of 2. To further optimize the performance of the generator, the model integrates several loss functions, including mean absolute error loss, color loss, adversarial loss, and feature matching loss. These loss functions ensure that the generated results are closer to the style and details of the original ancient painting.
Gated encoding branch
To improve the model’s capability of extracting features from ancient paintings, a gated encoding branch was introduced. This branch consists of a gated encoder made up of gated convolutions [27] and four dense multi-scale feature fusion modules. The gated encoder fully uses the dynamic pixel-selective features of gated convolutions to learn soft masks automatically. The gated convolution is shown in Fig. 2. The soft gating mechanism dynamically adjusts the importance of each element in the feature map by applying a Sigmoid function after the convolution operation, generating gating values that range from 0 to 1. This allows the model to exercise fine-grained control over the feature processing. By fusing the feature maps from the regular encoding branch and the gated encoding branch, the model can leverage the strengths of both to generate more accurate inpainting results.
Dense multi-scale feature fusion module
To ensure a low number of parameters while improving feature richness and continuity, we propose a dense multi-scale feature fusion (DMFF) module, as shown in Fig. 3. The module comprises four parallel dilated depthwise separable convolutions branches. The parameters can be reduced by replacing dilated convolutions (DConv) [28] with dilated depthwise separable convolutions (DDSConv) [29]. Specifically, the four branches of the DDSConv have dilation rates of 1, 2, 4, and 8 respectively. Table 1 shows the decrease in the number of generator parameters.
In the DMFF module, the channel number of the input features is first reduced to 64 before being fed into the four branches for multi-scale feature extraction, denoted as \(x_{i}\) \(\left( {i = 1,2,3,4} \right)\). Each branch consists of a 3 × 3 convolution operation, denoted as \(k_{i} \left( \cdot \right)\). To obtain rich dense feature \(y_{i}\) from sparse multi-scale features, a dense feature fusion is employed, expressed as:
The outputs are accumulated with outputs from other branches, effectively preventing the grid effects [30] and preserving feature continuity. By combining 1 × 1 convolution with skip connections, which enhances feature richness and continuity.
Contextual feature aggregation module
To fully use contextual information in ancient paintings, we introduce the contextual feature aggregation (CFA) module [31], as shown in Fig. 4. This module extracts contextual features of ancient paintings through region affinity learning, capturing both overall structure and local details to provide important feature information for the aggregation process. The multi-scale feature aggregation operation encodes rich semantic features at different scales. The use of a partial convolutional mask updating mechanism eliminates the need to distinguish between foreground and background pixels. Furthermore, the CFA module introduces skip connections to prevent the loss of potential semantic information.
Initially, the input feature is processed through a convolutional (Conv) layer to obtain feature map \(F\), and then 3 × 3 patches of \(F\) are extracted. Subsequently, the cosine similarity between these patches is calculated by using (2).
where \(f_{i}\) and \(f_{j}\) represent the \(i\)-th and \(j\)-th patches of the feature map \(F\), respectively. \(\left\| \cdot \right\|_{2}\) denotes the L2 norm.
The Softmax function is then applied to the calculated cosine similarity to obtain the attention score for each patch as shown in (3):
Subsequently, based on the extracted patches and their corresponding attention scores, the feature map is reconstructed. The reconstruction process is shown in (4):
where \(\tilde{f}_{i}\) represents the \(i\)-th patch of the reconstructed feature map \(F_{rec}\).
To capture multi-scale semantic features, CFA uses four sets of dilated convolutions with different dilation rates, as shown in (5):
where \(Conv_{k} \left( \cdot \right)\) denotes the dilated convolution, and the dilation rates are set as \(k \in \left\{ {1,2,4,8} \right\}\).
A weight generator \(G_{w}\) is used to predict the pixel-wise weight maps. It consists of two convolutional layers with kernel sizes of 3 and 1 respectively. After each layer, a ReLU activation function is applied. The output channels of this generator are set to 4, which are used to compute the weights \(W^{k}\) by using (6) and (7).
where the \({\text{Slice}}\left( \cdot \right)\) is channel-wise slice.
The multi-scale semantic features are aggregated to generate refined feature map \(F_{c}\) through element-wise summation, as shown in (8):
Finally, the feature map \(F\) is concatenated with \(F_{c}\), and the concatenated feature is then processed through a deconvolutional (Deconv) layer to produce the final output features.
Global and local discriminators
The global and local discriminators proposed in [32] are used to assess the generated images. The global branch evaluates the entire image, capturing global information and structural features with a larger receptive field. Meanwhile, the local branch evaluates details and texture information in local areas. This combination of global and local branches addresses the limitations of a single discriminator. More detailed information about the layers of the discriminators can be found in Tables 2 and 3.
As shown in Fig. 1, the global discriminator consists of six convolutional layers, taking 256 × 256 output images as input. The local discriminator consists of five convolutional layers, randomly selecting a 128 × 128 local patch as input. The outputs of the global and local discriminators are concatenated into a 2048-dimensional vector. This vector is then processed through a fully connected layer to produce a probability value between 0 and 1 using a Sigmoid function.
Loss function
We combine the adversarial loss, feature matching loss, mean absolute error loss, and color loss to constitute the total loss function of the proposed model. The following will introduce the details of each loss function.
Adversarial loss
The traditional discriminator evaluates the probability that the input image is real. The generator aims to increase the probability of its generated image being misclassified by the discriminator. The score of the input image \(x\) is defined as \(C\left( x \right)\), which is subsequently transformed through a Sigmoid function to obtain a probability value \(D\left( x \right)\) between 0 and 1, as shown in (9):
To enhance the authenticity of texture details in generated samples, a relativistic average discriminator \(D_{Ra}\) [33] is used. This discriminator no longer judges the authenticity of images but predicts the probability that real images \(x_{r}\) are more realistic compared to fake images \(x_{f}\), as shown in (10):
where \({\rm E}_{{x_{f} }} \left( \cdot \right)\) is the mean of all fake data in the mini-batch.
Therefore, the adversarial loss is expressed as:
Feature matching loss
The feature matching loss [34] is used to calculate the difference between the intermediate layer features in the discriminator. This helps the generator to learn multiscale information, leading to higher-quality generated images with improved semantics and structure. Its formula is as follows:
where \({\text{I}}_{gt}\) and \({\text{I}}_{output}\) represent the ground truth and output image, respectively. \(D_{local}^{l} \left( {{\text{I}}_{gt} } \right)\) and \(D_{local}^{l} \left( {{\text{I}}_{output} } \right)\) represent the feature maps of \({\text{I}}_{gt}\) and \({\text{I}}_{output}\) at the \(l\)-th ReLU activation layer, respectively. \({\text{N}}_{{D_{local}^{l} \left( {{\text{I}}_{gt} } \right)}}\) is the number of ground truth \({\text{I}}_{gt}\). \(\omega^{l}\) represents the weight of the \(l\)-th ReLU activation layer. \(\left\| \cdot \right\|_{1}\) denotes the L1 norm. The upper limit of the accumulation operation is 5, which corresponds to the number of layers of the local discriminator.
Mean absolute error loss
Mean absolute error loss [35] minimizes the average of absolute errors between generated results \(y_{i}\) and the actual values \(x_{i}\). Its formula is as follows:
where \(n\) is the number of samples.
Color loss
Color loss [36] is introduced to reduce the color bias between the output image and the target image. Gaussian filtering is initially utilized to generate a blurred image by removing high-frequency information while preserving low-frequency components. Then, the mean squared error (MSE) between the input image and the target image is computed as the loss function to measure the color difference, irrespective of texture and content. Its formula is as follows:
where \(I_{smoothed}^{1}\) and \(I_{smoothed}^{2}\) denote the blurred input image and blurred inpainting result obtained by Gaussian filtering, respectively.
Total loss
The total loss of the proposed model is shown in (15):
where \(\lambda_{adv}\), \(\lambda_{fm\_dis}\), \(\lambda_{mae}\), and \(\lambda_{color}\) are the weight for adversarial loss, feature matching loss, mean absolute error loss, and color loss. To determine these weights, we conducted experiments to balance the contribution of each loss item and optimize the overall performance of the model. We tried various combinations of weights to find the best results. Ultimately, we set the weights for the adversarial loss, feature matching loss, mean absolute error loss, and color loss to 0.03, 5, 1, and 0.01, respectively. Therefore, the total loss function reflects the balance between these components, ensuring that each loss item contributes appropriately to the training and final performance of the model.
Experimental analysis
Datasets
The experiment used two datasets: an ancient painting inpainting dataset that we created and a publicly available mural dataset.
Ancient painting inpainting dataset
We constructed an ancient painting inpainting dataset, which contains training, validation, and test sets, with the numbers 10,500, 1373, and 624, respectively, as shown in Table 4. These samples are from four well-known Chinese ancient paintings of different styles and painting techniques. Each sample image in the dataset has a size of 256 × 256 pixels. The single-color samples were removed to ensure that the dataset was diverse. Among them, Along the River During the Qingming Festival is imitated by Qiu Ying from the Ming Dynasty, based on the original work by Zhang Zeduan from the Song Dynasty. To obtain a simulated damaged ancient painting sample, we performed pixel multiplication between the original complete ancient painting and a randomly selected mask image from a mask dataset in [6]. This process allowed us to simulate various breakage rates in painting samples. The ancient painting inpainting dataset is accessible at https://github.com/luyjsnsndjx/painting-dataset.git.
Mural dataset
To evaluate our model in different scenarios, experiments were also conducted on the mural dataset [37]. This mural dataset has 1714 samples which are mainly from the Mogao Grottoes in Dunhuang, including both authentic murals and those copied by artists.
Experimental configuration
This experiment used PyTorch 1.8.1 and Python 3.8 to build the proposed model, running on an NVIDIA GeForce RTX 3080TI GPU. Parameters were set with a batch size of 4, a learning rate of 0.0002, and Adam optimizer for model training.
Experimental results
Our model is validated based on the ancient painting inpainting dataset constructed in “Ancient painting inpainting dataset” section. From top to bottom, Fig. 5c displays the restored paintings of A Thousand Li of Rivers and Mountains, Dwelling in the Fuchun Mountains, Spring Morning in the Han Palace, and Along the River During the Qingming Festival, respectively.
The results demonstrate that the proposed model can accurately restore various elements, such as buildings, mountains, rocks, and trees in A Thousand Li of Rivers and Mountains. Additionally, the model can fully recover the original image details and textures, such as the contours of mountains in Dwelling in the Fuchun Mountains, the tables, chairs, and clothing in Spring Morning in the Han Palace, and the details of the tree in Along the River During the Qingming Festival, without structural distortion and color bias.
Comparison experiments
We compare the proposed model with the following five typical image inpainting models: (1) LaMa [38]: a large mask inpainting network that uses Fast Fourier Convolutions (FFC) to obtain the global receptive field. (2) CoordFill [39]: a novel framework that achieves efficient high-resolution image inpainting via parameterized coordinate querying. (3) FRRN [40]: a full-resolution residual network that achieves progressive image inpainting. (4) RFA [41]: a network with texture perception that preserves shallow texture features for fine-grained inpainting. (5) CTSDG [31]: a dual-branch network for the simulation of texture synthesis and structural reconstruction in a coupled form.
Qualitative comparison
The results of both our model and the comparative models are shown in Fig. 6. Figure 6c and d exhibit obvious color bias and blurriness. The restored texture in Fig. 6e appears incomplete and chaotic. Figure 6f exhibits noticeable blurriness. Figure 6g has improved but still lacks texture details and has blurry edges. In contrast, for the inpainting results of Fig. 6h, from the first row, our model restores the original line of clothing, avoiding the missing texture issues observed in comparative models. From the second row, our model recovers detail textures overlooked by other models. From the third row, our model preserves harmonious lines and there is no color bias. Furthermore, from the fourth row, our model avoids the texture blurring and distortion issues encountered by the other five models. Lastly, from the fifth row, our model restores the painting to its original form.
In our model, the gated encoding branch allows the model to effectively capture and encode spatial information, enhancing its ability to extract complex image structures. The contextual feature aggregation module improves the ability of the proposed model to capture long-range dependencies and contextual information. Furthermore, the color loss ensures that the model preserves the original color information, leading to more realistic and visually appealing results. In conclusion, by introducing these modules, our model successfully restores the color and detailed texture of ancient paintings.
Quantitative comparison
We evaluate inpainting results using quantitative metrics such as peak signal-to-noise ratio (PSNR) [42], structure similarity index measure (SSIM) [43], mean square error (MSE) [44], and learned perceptual image patch similarity (LPIPS) [45]. PSNR is calculated based on the mean squared error of the image, which measures the average difference between the original image and the restored image. A higher PSNR value means less distortion. SSIM generates a score between 0 and 1 by comparing the differences on the brightness, contrast, and structure of two images. The higher the SSIM value, the higher the similarity between the two images. MSE is computed by averaging the squared differences between predicted values and true values. A smaller MSE signifies less disparity between predicted and true values. LPIPS is designed to be closer to human perception and can be better applied to real scenarios. It compares two images by feeding them into a pre-trained deep neural network such as VGG. The network extracts features from the input images at different depth levels and normalizes them. After that, the similarity between the images is assessed by calculating the distance between these extracted features. A lower LPIPS value indicates a higher similarity between the two images.
Table 5 shows that the proposed model outperforms all other models on these quantitative metrics, thereby indicating the superiority of the proposed model in terms of reconstruction accuracy, structural similarity, and perceptual quality.
Table 6 demonstrates the performance of the proposed model compared to all other models in various mask rates, ranging from 1 to 50%. The mask rate represents the percentage of the image that is masked or damaged. We simulated different levels of damage to the image and thus fully evaluated the performance of the proposed model. The results in Table 6 show that our model exhibits robustness and effectiveness in handling different levels of damage, further validating its efficacy and usefulness for ancient painting inpainting.
Ablation study
The ablation study is conducted on our ancient painting dataset to verify the effects of color loss \(L_{color}\), gated encoding branch (GEB), and contextual feature aggregation (CFA) modules. The results presented in Table 7 indicate that the removal of these modules resulted in a decrease in PSNR and SSIM values, along with an increase in MSE and LPIPS values. This suggests that each module plays a crucial role in the inpainting process.
The differences before and after using the modules are shown in Fig. 7. Color loss adjusts and preserves the color of the original image by assessing brightness, contrast, and major color differences. A comparison of Fig. 7a and b shows that the color loss solves the problem of color bias in the inpainted area. GEB captures both shallow detail features and deep semantic features, reducing semantic information loss. By comparing Fig. 7c and d, it can be seen that the addition of the GEB improves the detail texture. The absence of CFA results in the inconsistency of texture, as seen in Fig. 7e compared to Fig. 7f. Incorporating CFA improves texture color and edge clarity.
In summary, based on Fig. 7 and Table 7, it can be observed that the significant impact of introducing color loss, gate encoding branch (GEB), and contextual feature aggregation module (CFA) on the inpainting of ancient paintings. These modules address issues such as color bias, detail loss, and lack of overall integrity in inpainting results. Therefore, the proposed model exhibits excellent performance in ancient painting inpainting.
Experiment on murals
To evaluate our model’s effectiveness in various scenarios, we conducted experiments on a mural dataset. As shown in Fig. 8, our model restores the lost color and texture in the original murals, improving the overall quality of inpainting results and resolving filling errors in comparative models.
As shown in Table 8, the proposed model achieves the best evaluation metrics compared to the comparative modes. In summary, our model not only excels in ancient painting inpainting but also in mural inpainting.
Experiments on real damaged ancient paintings and murals
In the preceding section, our model was trained and tested by using simulated damaged ancient paintings and murals. To assess its performance in real damaged scenarios, we used real damaged ancient paintings and murals for testing. It is worth mentioning that for the real damaged ancient paintings and murals, we have consulted experienced artists to accurately mark the shape, location, and size of the real broken areas. This ensures that our model can be applied to the restoration of real ancient paintings and murals.
Real damaged ancient paintings
We chose four damaged ancient paintings as our test images. The inpainting results are shown in Fig. 9c. From top to bottom, the first inpainting result of Fig. 9c, achieves excellent integrity by effectively restoring the structure and texture of the mountain. The second result has maintained the color of the river. The third result exhibits minimal structural loss and color bias, thus maintaining the integrity of the houses and trees. The fourth damaged ancient painting has been well reconstructed by the proposed model, restoring the lost information of color and texture without introducing any structural distortion. These experiments verified the excellent performance of the proposed model in restoring mountains, rivers, houses, and trees in real damaged ancient paintings.
Real damaged murals
We conducted test experiments using real damaged murals. As shown in Fig. 10c, the first, second, and fourth rows of the restored murals have better overall restoration results with structural integrity, avoiding problems such as color distortion. These results show that the proposed model can proficiently fill the damaged areas while maintaining the overall clarity and integrity of the murals. However, it is worth pointing out that our model does have limitations when applied to real damaged murals, such as the blurring of the ears of character in the third row of Fig. 10c. We will further optimize the model to improve the ability of detail restoration.
Conclusion
An inpainting model for ancient paintings is proposed based on dual encoders and contextual information. Firstly, in the gate encoding branch, gate convolutions are introduced to capture semantic information of ancient paintings and reduce information loss through the dynamic feature selection mechanism. Secondly, a dense multi-scale feature fusion module is designed to accurately extract multi-scale features by expanding the receptive field with dilated convolutions and using dense feature fusion operations. Also, dilated depthwise separable convolutions are introduced to reduce the generator parameters. Next, a contextual feature aggregation module is used to aggregate contextual information, better handling long-range spatial dependencies. Finally, color loss is introduced to preserve color information. The results of qualitative and quantitative experiments show that, compared with the comparison model, the proposed model performs better in restoring color and detailed textures while preserving artistic styles and overall textures of paintings. Additionally, the proposed model is validated on real damaged ancient paintings and murals to further ensure its practical applicability. Therefore, the proposed model provides valuable references for the digital restoration of cultural heritage. However, our model has limitations when it comes to accurately restoring the details of real damaged murals. In the future research, we aim to improve our model’s ability to restore murals with greater accuracy. Additionally, to further improve the model’s ability, we need to create a more comprehensive dataset by collecting a wider range of ancient paintings that cover different styles, periods, and authors.
Availability of data and materials
The datasets used and analyzed in the current study are available from the given link or the corresponding author by reasonable request.
References
Gupta V, Sambyal N, Sharma A, et al. Restoration of artwork using deep neural networks. Evol Syst. 2021;12(2):439–46.
Pathak D, Krahenbuhl P, Donahue J et al. Context encoders: feature learning by inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2016. p. 2536–44.
Liao L, Hu R, Xiao J et al. Edge-aware context encoder for image inpainting. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing; 2018. p. 3156–60.
Vo HV, Duong NQK, Pérez P. Structural inpainting. In: Proceedings of the ACM international conference on multimedia; 2018. p. 1948–56.
Yan Z, Li X, Li M, et al. Shift-net: Image inpainting via deep feature rearrangement. In: Proceedings of the European conference on computer vision; 2018. p. 1–17.
Liu G, Reda FA, Shih KJ, et al. Image inpainting for irregular holes using partial convolutions. In: Proceedings of the European conference on computer vision; 2018. p. 85–100.
Li J, Wang N, Zhang L, et al. Recurrent feature reasoning for image inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 7760–68.
Yi Z, Tang Q, Azizi S, et al. Contextual residual aggregation for ultra high-resolution image inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 7508–17.
Yeh RA, Chen C, Lim TY, et al. Semantic image inpainting with deep generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2017. p. 5485–93.
Yang C, Lu X, Lin Z, et al. High-resolution image inpainting using multi-scale neural patch synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2017. p. 6721–9.
Hui Z, Li J, Wang X, et al. Image fine-grained inpainting; 2020. arXiv preprint arXiv:2002.02609.
Zeng Y, Fu J, Chao H, et al. Aggregated contextual transformations for high-resolution image inpainting; 2021. arXiv preprint arXiv:2104.01431.
Isola P, Zhu J, Zhou T, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2017. p. 1125–34.
Lv C, Li Z, Shen Y, et al. SeparaFill: Two generators connected mural image restoration based on generative adversarial network with skip connect. Herit Sci. 2022;10(1):1–13.
Deng X, Yu Y. Ancient mural inpainting via structure information guided two-branch model. Herit Sci. 2023;11(1):1–17.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the conference on neural information processing systems; 2017. p. 5998–6008.
Zhou Y, Barnes C, Shechtman E, et al. Transfill: Reference-guided image inpainting by merging multiple color and spatial transformations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 2266–76.
Chen S, Wu Z, Jiang Y, et al. FT-TDR: frequency-guided transformer and top-down refinement network for blind face inpainting. IEEE Trans Multimedia. 2022;25:2382–92.
Zheng C, Cham TJ, Cai J. TFill: Image completion via a transformer-based architecture; 2021. arXiv preprint arXiv:2104.00845.
Dong Q, Cao C, Fu Y. Incremental transformer structure enhanced image inpainting with masking positional encoding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022. p. 11358–68.
Han X, Wu Z, Huang W, et al. FiNet: Compatible and diverse fashion image inpainting. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 4481–91.
Zhao L, Mo Q, Lin S, et al. CTGAN: Diverse image inpainting based on unsupervised cross-space translation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition; 2020. p. 5740–9.
Peng J, Liu D, Xu S, et al. Generating diverse structure for image inpainting with hierarchical VQ-VAE. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 10775–84.
Wang Q, Liu Y, Xiong Z, et al. Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans Geosci Remote Sens. 2022;60:5624915.
Liu Y, Xiong Z, Yuan Y, et al. Distilling knowledge from super-resolution for efficient remote sensing salient object detection. IEEE Trans Geosci Remote Sens. 2023;61:5609116.
Liu Y, Li Q, Yuan Y, et al. ABNet: adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans Geosci Remote Sens. 2022;60:5614914.
Yu J, Lin Z, Yang J, et al. Free-form image inpainting with gated convolution. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 4471–80.
Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions; 2015. arXiv preprint arXiv:1511.07122.
Howard AG, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications; 2017. arXiv preprint arXiv:1704.04861.
Wang P, Chen P, Yuan Y, et al. Understanding convolution for semantic segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision; 2018. p. 1451–60.
Guo X, Yang H, Huang D. Image inpainting via conditional texture and structure dual generation. In: Proceedings of the international conference on computer vision; 2021. p. 14134–43.
Iizuka S, Simo-Serra E, Ishikawa H. Globally and locally consistent image completion. ACM Trans Graphics. 2017;36(4):1–14.
Jolicoeur-Martineau A. The relativistic discriminator: a key element missing from standard GAN; 2018. arXiv preprint arXiv:1807.00734.
Wang T, Liu M, Zhu J, et al. High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2018. p. 8798–807.
Hodson TO. Root mean square error (RMSE) or mean absolute error (MAE): when to use them or not. Geosci Model Dev Discussions. 2022;15(14):5481–7.
Ignatov A, Kobyshev N, Timofte R, et al. Dslr-quality photos on mobile devices with deep convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision; 2017. p. 3277–85.
Li L, Zou Q, Zhang F, et al. Line drawing guided progressive inpainting of mural damages; 2022. arXiv Preprint arXiv:2211.06649.
Suvorov R, Logacheva E, Mashikhin A, et al. Resolution-robust large mask inpainting with fourier convolutions. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision; 2022. p. 2149–59.
Liu W, Cun X, Pun C, et al. Coordfill: Efficient high-resolution image inpainting via parameterized coordinate querying. In: Proceedings of the AAAI conference on artificial intelligence; 2023. p. 1746–54.
Guo Z, Chen Z, Yu T, et al. Progressive image inpainting with full-resolution residual network. In: Proceedings of the 27th ACM international conference on multimedia; 2019. p. 2496–504.
Chen M, Zang S, Ai Z, et al. RFA-net: residual feature attention network for fine-grained image inpainting. Eng Appl Artif Intell. 2023;119: 105814.
Gupta P, Srivastava P, Bhardwaj S, et al. A modified PSNR metric based on HVS for quality assessment of color images. InL Proceedings of the international conference on communication and industrial application; 2011. p. 1–4.
Hore A, Ziou D. Image quality metrics: PSNR vs. SSIM. In: Proceedings of the international conference on pattern recognition; 2010. p. 2366–9.
Wang Z, Bovik AC, Sheikh HR, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process. 2004;13(4):600–12.
Zhang R, Isola P, Efros AA, et al. The unreasonable efectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 586–95.
Acknowledgements
None.
Funding
This work was supported by the National Key Research and Development Program of China (No.2017YFB1402102), the National Natural Science Foundation of China (No. 62377033), the Shaanxi Key Science and Technology Innovation Team Project (No. 2022TD-26), the Xi’an Science and Technology Plan Project (No. 23ZDCYJSGG0010-2022), and the Fundamental Research Funds for the Central Universities (No. GK202205036, GK202101004).
Author information
Authors and Affiliations
Contributions
Conceptualization, ZS; methodology, ZS; validation, YL and XW; investigation, YL; data curation, YL; writing-original draft preparation, ZS; writing-review and editing, ZS, YL and XW; supervision, ZS; funding acquisition, ZS. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sun, Z., Lei, Y. & Wu, X. Ancient paintings inpainting based on dual encoders and contextual information. Herit Sci 12, 266 (2024). https://doi.org/10.1186/s40494-024-01391-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40494-024-01391-2