Abstract
Recent self-supervised contrastive learning methods are powerful and efficient for robust representation learning, pulling semantic features from different cropping views of the same image while pushing other features away from other images in the embedding vector space. However, model training for contrastive learning is quite inefficient. In the high-dimensional vector space of the images, images can differ from each other in many ways. We address this problem with heuristic attention pixel-level contrastive loss for representation learning (HAPiCLR), a self-supervised joint embedding contrastive framework that operates at the pixel level and makes use of heuristic mask information. HAPiCLR leverages pixel-level information from the object’s contextual representation instead of identifying pair-wise differences in instance-level representations. Thus, HAPiCLR enhances contrastive learning objectives without requiring large batch sizes, memory banks, or queues, thereby reducing the memory footprint and the processing needed for large datasets. Furthermore, HAPiCLR loss combined with other contrastive objectives such as SimCLR or MoCo loss produces considerable performance boosts on all downstream tasks, including image classification, object detection, and instance segmentation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
In this study, we constructed a heuristic binary segmentation mask dataset for the ImageNet ILSVRC-2012, which can be found here: https://www.hh-ri.com/2022/05/30/heuristic-attention-representation.
References
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. Adv Neural Inf Process Syst 33, 22243–22255 (2020)
Bardes, A., Ponce, J., LeCun, Y.: Vicreg: variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021)
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, pp. 12310–12320. PMLR (2021)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 33, 9912–9924 (2020)
Putri, W.R., Liu, S.H., Aslam, M.S., Li, Y.H., Chang, C.C., Wang, J.C.: Self-supervised learning framework toward state-of-the-art iris image segmentation. Sensors 22(6), 2133 (2022)
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese’’ time delay neural network. Adv. Neural Inf. Process. Syst. 6, 737 (1993)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 539–546. IEEE (2005)
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16684–16693 (2021)
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Van Gool, L.: Unsupervised semantic segmentation by contrasting object mask proposals. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10052–10062 (2021)
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033 (2021)
Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Sun, P., Li, Z., Luo, P.: Detco: unsupervised contrastive learning for object detection (2021)
Ding, J., Xie, E., Xu, H., Jiang, C., Li, Z., Luo, P., Xia, G.S.: Deeply unsupervised patch re-identification for pre-training object detectors. IEEE Trans. Pattern Anal. Mach. Intell. (2022). https://doi.org/10.1109/TPAMI.2022.3164911
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F.: A survey on contrastive self-supervised learning. Technologies 9(1), 2 (2020)
Iizuka, S., Simo-Serra, E., Ishikawa, H.: Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Gr. 35(4), 1–11 (2016)
Mahendran, A., Thewlis, J., Vedaldi, A.: Cross pixel optical-flow similarity for self-supervised learning. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part V 14, pp. 99–116. Springer (2019)
Zhan, X., Pan, X., Liu, Z., Lin, D., Loy, C.C.: Self-supervised learning via conditional motion propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1881–1889 (2019)
Chen, X., Fan, H., Girshick, R., He, K.: Kaiming: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Misra, I., Maaten, L.V.D.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717 (2020)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1735–1742. IEEE (2006)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219 (2019)
Zhong, L., Yang, J., Chen, Z., Wang, S.: Contrastive graph convolutional networks with generative adjacency matrix. IEEE Trans. Signal Process. 71, 772–785 (2023)
Xia, W., Wang, T., Gao, Q., Yang, M., Gao, X.: Graph embedding contrastive multi-modal representation learning for clustering. IEEE Trans. Image Process. 32, 1170–1183 (2023)
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Springer (2020)
Henaff, O.: Data-efficient image recognition with contrastive predictive coding. In: International Conference on Machine Learning, pp. 4182–4192. PMLR (2020)
Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6002–6012 (2019)
Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning? Adv. Neural Inf. Process. Syst. 33, 6827–6839 (2020)
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent—a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020)
Cao, Y., Xie, Z., Liu, B., Lin, Y., Zhang, Z., Han, H.: Parametric instance classification for unsupervised visual feature learning. Adv. Neural Inf. Process. Syst. 33, 15614–15624 (2020)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Tran, V.N., Liu, S.H., Li, Y.H., Wang, J.C.: Heuristic attention representation learning for self-supervised pretraining. Sensors 22(14), 5169 (2022)
Nguyen, T., Dax, M., Mummadi, C.K., Ngo, N., Nguyen, T.H.P., Lou, Z., Brox, T.: Deepusps: deep robust unsupervised saliency prediction via self-supervision. In: Advances in Neural Information Processing Systems, vol 32 (2019)
Zhang, S., Liew, J.H., Wei, Y., Wei, S., Zhao, Y.: Interactive object segmentation with inside-outside guidance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12234–12244 (2020)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
You, Y., Gitman, I., Ginsburg, B.: Scaling SGD batch size to 32k for ImageNet training. Vol. 6, No. 12, p. 6 (2017). arXiv preprint arXiv:1708.03888
Loshchilov, I., Hutter, F.: Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–308 (2009)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579 (2008)
Author information
Authors and Affiliations
Contributions
The authors of the research paper made significant contributions to the study. Y-HL provided the methodology and conceptualized the study, while C-EH, VNT, and S-HL developed and validated the software. K-LY assisted with data curation. MSA, VNT, C-EH, S-HL, and Y-HL contributed to writing, reviewing, and editing. Y-HL and J-CW provided supervision, and Y-HL also managed project administration and acquired funding. All authors have approved the published work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
A Implementation details
1.1 Heuristic mask proposal generator
Our heuristic binary mask approach eliminates the need for external supervision or training on the limited dataset of annotated masks. The mask generator network architecture comprises a residual convolution feature extractor, a \(1\times 1\) convolutional classification layer for saliency prediction, and the spatial feature map representation obtained from the output of the \(1\times 1\) convolution. The generator’s output mask can be segmented into foreground and background regions depending on the saliency threshold, as shown in Fig. 2 (in the main text).
Demonstration of heuristic binary masks employed in the HAPiCLR Framework. First row: random selection of images from the ImageNet [16] training set. The second to the fourth row is the mask obtained from our heuristic mask proposal technique with different saliency thresholds in [0.4, 0.5, 0.6]. Fifth Row: human-annotated ground-truth masks from Pixel-ImageNet [40]
In the convolution encoder for the feature extractor, we leverage the ResNet-50 backbone that is pre-trained on self-supervised objectives [5, 12, 34] with the unlabeled dataset. In our implementation, we evaluated the saliency threshold value across the range of [0.4, 0.5, 0.6] as illustrated in the corresponding Fig. 11. To generate a high-quality mask, we determined the saliency threshold that yielded the maximum mean Intersection-Over-Union (mIoU) compared to the ground-truth saliency object mask obtained from Pixel-ImageNet [40]. We evaluated mask quality on a subset of 485,000 images, specifically from 946 out of the 1000 classes in the ImageNet dataset. The mIoU of the generated masks, using three saliency thresholds of 0.4, 0.5, and 0.6, were 0.451, 0.490, and 0.448, respectively. Finally, we select the threshold value 0.5 with the highest mIoU for generating masks for the entire ImageNet train set dataset.
Data augmentation
The HAPiCLR data augmentation pipeline starts with the HAPiCLR cropping technique. Then, these cropped views undergo the same set of image augmentations as in SimCLR [1], involving the arbitrary sequence composition transformation encompassing color distortion, grayscale conversion, Gaussian blur, and solarization. Finally, each image and its corresponding heuristic binary mask undergo a transformation process through an augmentation pipeline. This pipeline consists of a set of operations, which will be detailed as described below. First, we use the random cropped image with resizing and random flipping; for the binary mask, only cropping and flipping the underlying image is applied. Subsequently, the cropped images undergo transformations, including color distortion (color jittering, color dropping), random Gaussian blur, and solarization, each applied with the specified probability. Table 7 lists the specific parameter settings for the augmentation pipeline.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tran, V.N., Liu, SH., Huang, CE. et al. HAPiCLR: heuristic attention pixel-level contrastive loss representation learning for self-supervised pretraining. Vis Comput 40, 7945–7960 (2024). https://doi.org/10.1007/s00371-023-03217-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-03217-x