HAPiCLR: heuristic attention pixel-level contrastive loss representation learning for self-supervised pretraining

Van Nhiem Tran¹,
Shen-Hsuan Liu²,
Chi-En Huang²,
Muhammad Saqlain Aslam²,
Kai-Lin Yang²,
Yung-Hui Li ORCID: orcid.org/0000-0002-0475-3689² &
…
Jia-Ching Wang¹

225 Accesses
1 Altmetric
Explore all metrics

Abstract

Recent self-supervised contrastive learning methods are powerful and efficient for robust representation learning, pulling semantic features from different cropping views of the same image while pushing other features away from other images in the embedding vector space. However, model training for contrastive learning is quite inefficient. In the high-dimensional vector space of the images, images can differ from each other in many ways. We address this problem with heuristic attention pixel-level contrastive loss for representation learning (HAPiCLR), a self-supervised joint embedding contrastive framework that operates at the pixel level and makes use of heuristic mask information. HAPiCLR leverages pixel-level information from the object’s contextual representation instead of identifying pair-wise differences in instance-level representations. Thus, HAPiCLR enhances contrastive learning objectives without requiring large batch sizes, memory banks, or queues, thereby reducing the memory footprint and the processing needed for large datasets. Furthermore, HAPiCLR loss combined with other contrastive objectives such as SimCLR or MoCo loss produces considerable performance boosts on all downstream tasks, including image classification, object detection, and instance segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Unsqueeze [CLS] Bottleneck to Learn Rich Representations

Masked Siamese Networks for Label-Efficient Learning

Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

In this study, we constructed a heuristic binary segmentation mask dataset for the ImageNet ILSVRC-2012, which can be found here: https://www.hh-ri.com/2022/05/30/heuristic-attention-representation.

References

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. Adv Neural Inf Process Syst 33, 22243–22255 (2020)
Google Scholar
Bardes, A., Ponce, J., LeCun, Y.: Vicreg: variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021)
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, pp. 12310–12320. PMLR (2021)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 33, 9912–9924 (2020)
Google Scholar
Putri, W.R., Liu, S.H., Aslam, M.S., Li, Y.H., Chang, C.C., Wang, J.C.: Self-supervised learning framework toward state-of-the-art iris image segmentation. Sensors 22(6), 2133 (2022)
Article Google Scholar
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese’’ time delay neural network. Adv. Neural Inf. Process. Syst. 6, 737 (1993)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546. IEEE (2005)
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16684–16693 (2021)
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Van Gool, L.: Unsupervised semantic segmentation by contrasting object mask proposals. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10052–10062 (2021)
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033 (2021)
Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Sun, P., Li, Z., Luo, P.: Detco: unsupervised contrastive learning for object detection (2021)
Ding, J., Xie, E., Xu, H., Jiang, C., Li, Z., Luo, P., Xia, G.S.: Deeply unsupervised patch re-identification for pre-training object detectors. IEEE Trans. Pattern Anal. Mach. Intell. (2022). https://doi.org/10.1109/TPAMI.2022.3164911
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. Technologies 9(1), 2 (2020)
Article Google Scholar
Iizuka, S., Simo-Serra, E., Ishikawa, H.: Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Gr. 35(4), 1–11 (2016)
Article Google Scholar
Mahendran, A., Thewlis, J., Vedaldi, A.: Cross pixel optical-flow similarity for self-supervised learning. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part V 14, pp. 99–116. Springer (2019)
Zhan, X., Pan, X., Liu, Z., Lin, D., Loy, C.C.: Self-supervised learning via conditional motion propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1881–1889 (2019)
Chen, X., Fan, H., Girshick, R., He, K.: Kaiming: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Misra, I., Maaten, L.V.D.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717 (2020)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1735–1742. IEEE (2006)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219 (2019)
Zhong, L., Yang, J., Chen, Z., Wang, S.: Contrastive graph convolutional networks with generative adjacency matrix. IEEE Trans. Signal Process. 71, 772–785 (2023)
Article MathSciNet Google Scholar
Xia, W., Wang, T., Gao, Q., Yang, M., Gao, X.: Graph embedding contrastive multi-modal representation learning for clustering. IEEE Trans. Image Process. 32, 1170–1183 (2023)
Article Google Scholar
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Springer (2020)
Henaff, O.: Data-efficient image recognition with contrastive predictive coding. In: International Conference on Machine Learning, pp. 4182–4192. PMLR (2020)
Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6002–6012 (2019)
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning? Adv. Neural Inf. Process. Syst. 33, 6827–6839 (2020)
Google Scholar
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent—a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
Cao, Y., Xie, Z., Liu, B., Lin, Y., Zhang, Z., Han, H.: Parametric instance classification for unsupervised visual feature learning. Adv. Neural Inf. Process. Syst. 33, 15614–15624 (2020)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Tran, V.N., Liu, S.H., Li, Y.H., Wang, J.C.: Heuristic attention representation learning for self-supervised pretraining. Sensors 22(14), 5169 (2022)
Article Google Scholar
Nguyen, T., Dax, M., Mummadi, C.K., Ngo, N., Nguyen, T.H.P., Lou, Z., Brox, T.: Deepusps: deep robust unsupervised saliency prediction via self-supervision. In: Advances in Neural Information Processing Systems, vol 32 (2019)
Zhang, S., Liew, J.H., Wei, Y., Wei, S., Zhao, Y.: Interactive object segmentation with inside-outside guidance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12234–12244 (2020)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
You, Y., Gitman, I., Ginsburg, B.: Scaling SGD batch size to 32k for ImageNet training. Vol. 6, No. 12, p. 6 (2017). arXiv preprint arXiv:1708.03888
Loshchilov, I., Hutter, F.: Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–308 (2009)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, National Central University, Taoyuan, 32001, Taiwan
Van Nhiem Tran & Jia-Ching Wang
AI Research Center, Hon Hai Research Institute, Taipei, 114699, Taiwan
Shen-Hsuan Liu, Chi-En Huang, Muhammad Saqlain Aslam, Kai-Lin Yang & Yung-Hui Li

Authors

Van Nhiem Tran
View author publications
You can also search for this author in PubMed Google Scholar
Shen-Hsuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chi-En Huang
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Saqlain Aslam
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Lin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yung-Hui Li
View author publications
You can also search for this author in PubMed Google Scholar
Jia-Ching Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors of the research paper made significant contributions to the study. Y-HL provided the methodology and conceptualized the study, while C-EH, VNT, and S-HL developed and validated the software. K-LY assisted with data curation. MSA, VNT, C-EH, S-HL, and Y-HL contributed to writing, reviewing, and editing. Y-HL and J-CW provided supervision, and Y-HL also managed project administration and acquired funding. All authors have approved the published work.

Corresponding author

Correspondence to Yung-Hui Li.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Implementation details

1.1 Heuristic mask proposal generator

Our heuristic binary mask approach eliminates the need for external supervision or training on the limited dataset of annotated masks. The mask generator network architecture comprises a residual convolution feature extractor, a \(1\times 1\) convolutional classification layer for saliency prediction, and the spatial feature map representation obtained from the output of the \(1\times 1\) convolution. The generator’s output mask can be segmented into foreground and background regions depending on the saliency threshold, as shown in Fig. 2 (in the main text).

In the convolution encoder for the feature extractor, we leverage the ResNet-50 backbone that is pre-trained on self-supervised objectives [5, 12, 34] with the unlabeled dataset. In our implementation, we evaluated the saliency threshold value across the range of [0.4, 0.5, 0.6] as illustrated in the corresponding Fig. 11. To generate a high-quality mask, we determined the saliency threshold that yielded the maximum mean Intersection-Over-Union (mIoU) compared to the ground-truth saliency object mask obtained from Pixel-ImageNet [40]. We evaluated mask quality on a subset of 485,000 images, specifically from 946 out of the 1000 classes in the ImageNet dataset. The mIoU of the generated masks, using three saliency thresholds of 0.4, 0.5, and 0.6, were 0.451, 0.490, and 0.448, respectively. Finally, we select the threshold value 0.5 with the highest mIoU for generating masks for the entire ImageNet train set dataset.

Data augmentation

The HAPiCLR data augmentation pipeline starts with the HAPiCLR cropping technique. Then, these cropped views undergo the same set of image augmentations as in SimCLR [1], involving the arbitrary sequence composition transformation encompassing color distortion, grayscale conversion, Gaussian blur, and solarization. Finally, each image and its corresponding heuristic binary mask undergo a transformation process through an augmentation pipeline. This pipeline consists of a set of operations, which will be detailed as described below. First, we use the random cropped image with resizing and random flipping; for the binary mask, only cropping and flipping the underlying image is applied. Subsequently, the cropped images undergo transformations, including color distortion (color jittering, color dropping), random Gaussian blur, and solarization, each applied with the specified probability. Table 7 lists the specific parameter settings for the augmentation pipeline.

Table 7 Parameters employed for generating augmentations of image and mask pairs

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tran, V.N., Liu, SH., Huang, CE. et al. HAPiCLR: heuristic attention pixel-level contrastive loss representation learning for self-supervised pretraining. Vis Comput 40, 7945–7960 (2024). https://doi.org/10.1007/s00371-023-03217-x

Download citation

Accepted: 28 November 2023
Published: 15 March 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s00371-023-03217-x