5.1 Results on JPEG Artefacts
First, we study in detail the behavior of LANBIQUE on a single distortion. This way, we can easily control the amount of image corruption and evaluate the behavior of our metric on GAN restored images.
Results with reference captions. To use a dataset of images with a set of associated captions, we selected the 5,000 images testing set from the Karpathy split of COCO dataset [
9]. The images have then been compressed at different JPEG
Quality Factors (QF), and then they have been reconstructed using the GAN approach of Reference [
14]. In Table
1, we report results of LANBIQUE using various captioning metrics
\(\mathcal {D}\). Interestingly, all metrics show that captions over reconstructed images (REC rows) are better with respect to caption computed over compressed images (JPEG rows). This shows that image details that are compromised by the strong compression induce errors in the captioning algorithm. However, the GAN approach is able to recover an image that is not only pleasant to the human eye but recovers details that are also relevant to a semantic algorithm. In Figure
1, we show the difference of captions generated by Reference [
2] over original, compressed, and restored images. A human may likely succeed in producing an almost correct caption for highly compressed images, nevertheless, state-of-the art algorithms are likely to make extreme mistakes that are instead not present on reconstructed images.
In Figure
6, we show the different performance of captioning algorithms in terms of CIDEr measure on the same split of test of compressed and restored images, considering different quality factors of JPEG. The captioner proposed in Reference [
11] outperforms Reference [
2] as expected, but interestingly, we may observe that the range of CIDEr values of Reference [
11] is significantly higher than Reference [
2]. We argue that this could be considered a strong feature of our evaluation approach, as a wider range of value may imply that a good captioner is able to predict the image quality in a finer manner than other weaker captioning algorithms.
Figure
7 shows the bottom-up captioning process performed on an image used in the subjective evaluation. The left image shows the JPEG 10 version, while the right one shows the GAN reconstruction. The images show the bounding boxes of the detected elements. In the first case the wrong detections of indoor elements such as “floor” and “wall” are likely reasons for the wrong caption, as opposed to the correct recognition of a “white wave” and “blue water” in the GAN-reconstructed image.
Results without reference captions. A common setting that is used to evaluate image enhancement algorithms is Full-reference image quality assessment, where several image similarity metrics are used to measure how much a restored version differs with respect to the uncorrupted original image. This kind of metrics, measuring pixel-wise value differences, is likely to favor MSE-optimized networks, which are usually prone to obtain blurry and lowly detailed images.
In certain cases, it is not possible to use Full-reference quality metrics, e.g., if there is no available original image. This kind of metrics typically evaluates the “naturalness” of the image being analyzed. In the same setup we used previously, we perform experiments using NIQE and BRISQUE, which are two popular No-reference metrics for images. Interestingly, these metrics tend to favor GAN-restored images instead of the original uncompressed ones. Most surprisingly, NIQE and BRISQUE obtain better results when we reconstruct the most degraded version of images (QF 10–20), but these values increase as we reconstruct less degraded images. We believe that BRISQUE and NIQE favor crisper images with high-frequency patterns that are distinctive of GAN-based image enhancement and they are typically stronger when reconstructing heavily distorted images.
In Table
2, we report results on COCO for Full-reference and No-reference indexes. In this setup, we compress the original images at different QFs and then we restore them with a QF-specific artefact removal GAN. We use the uncompressed image generated caption as ground truth, as in Table
3. The results show that, for restored images, PSNR accounts for a slight improvement, while SSIM indexes lower than the compressed counterparts. This is an expected outcome, as in Reference [
14] it is shown that state-of-the-art results on PSNR can be obtained only when MSE is optimized and on SSIM if the metric is optimized directly. Nonetheless, as can be seen in Figure
2, GAN enhanced images are more pleasant to the human eye, therefore, we should not rely just on PSNR and SSIM for GAN restored images. LANBIQUE, using Reference [
11], is in line with LPIPS [
54]. Unfortunately, LPIPS, as shown in Table
3, has low correlation with scores determined by human-perceived quality.
Correlation with Mean Opinion Score. In Figure
8 (left) are reported subjective evaluation results as
Mean Opinion Scores (MOS) as box plots, showing the quartiles of the scores (box), while the whiskers show the rest of the distribution. The plots are made for the original images, the images compressed with JPEG using a QF = 10, and the images restored with the GAN-based approach of Reference [
14] from the heavily compressed JPEG images. The figure shows that the GAN-based network is able to produce images that are perceptually of much higher quality than the images from which they are originated; the average MOS score for JPEG images is 1.15, for the GAN-based approach is 2.56, and for the original images it is 3.59. The relatively low MOS scores obtained also by the original images are related to the fact that COCO images have a visual quality that is much lower than that of dataset designed for image quality evaluation. To give better insight on the distribution of MOS scores, Figure
8 (right) shows the histograms of the MOS scores for the three types of images: orange histogram for the original images, green for the JPEG compressed images, and blue for the restored images.
We further show that our language-based approach correlates with perceived quality using a IQA benchmark test on the LIVE dataset [
40] that consists of 29 high-resolution images compressed at different JPEG qualities for a total of 204 images. For each LIVE image a set of user scores is provided, indicating the perceived quality of the image. However, no caption is provided in this dataset. For this reason, we consider the output sentences of captioning approaches over the undistorted image as the ground truth to calculate the language similarity measures, following the LANBIQUE-NC protocol presented in Section
4.2. In Table
3, we show the Pearson correlation score of different captioning metrics and other common Full-reference quality assessment approaches. The experiment shows an interesting behavior of our approach in terms of correlation. First, we can observe that each captioning metric has a correlation index that is higher or at least comparable with the other Full-reference metrics. In particular, METEOR and CIDEr perform better than the other metrics independently of which captioning algorithm is used. In the following experiments, LANBIQUE, LANBIQUE-NC, and LANBIQUE-NR have been computed using CIDEr metric. Moreover, we observe that the correlation metric significantly improves if we employ a better performing captioner. In this case, the visual features used by the two captioning techniques are exactly the same; the main difference lies in the overall language generation pipeline of the approaches. Hence, we argue that language is effectively useful for quality assessment, and the more a captioning algorithm is capable of providing detailed and meaningful captions, the better we could use the generated sentences to formulate good predictions about the quality of images.
To better understand what metric could be used instead of human evaluation, we computed the correlation coefficient
between BRISQUE [
31], NIQE [
32], the proposed LANBIQUE, and MOS for all versions of the images. As shown in Table
4, it turns out that using a fine-grained semantic task as image captioning is the best proxy (highest correlation) of real human judgment.
Figure
9 shows a captioning example from the COCO images used in the subjective quality evaluation experiment. On the left, we show a sample compressed with JPEG with a QF = 10; in the center, we show the image restored with Reference [
14]; and on the right, we show the original one. It can be observed that the caption of the restored image is capable of describing correctly the image content, on par with the caption obtained on the original image. Instead, the caption of the highly compressed JPEG image is completely unrelated to image content, probably due to object detection errors.