[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification

Published: 23 December 2024 Publication History

Abstract

Recent advancements in pre-trained vision-language models like CLIP have shown promise in person re-identification (ReID) applications. However, their performance in generalizable person ReID tasks remains suboptimal. The large-scale and diverse image-text pairs used in CLIP’s pre-training may lead to a lack or insufficiency of certain fine-grained features. In light of these challenges, we propose a hard sample mining method called Depth-First Graph Sampler (DFGS), based on depth-first search, designed to offer sufficiently challenging samples to enhance CLIP’s ability to extract fine-grained features. DFGS can be applied to both the image encoder and the text encoder in CLIP. By leveraging the powerful cross-modal learning capabilities of CLIP, we aim to apply our DFGS method to extract challenging samples and form mini-batches with high discriminative difficulty, providing the image model with more efficient and challenging samples that are difficult to distinguish, thereby enhancing the model’s ability to differentiate between individuals. Our results demonstrate significant improvements over other methods, confirming the effectiveness of DFGS in providing challenging samples that enhance CLIP’s performance in generalizable person Re-ID.

References

[1]
Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018. Improving deep visual representation for person re-identification by global and local image-language association. In Proceedings of the European Conference on Computer Vision (ECCV ’18), 54–70.
[2]
Peixian Chen, Pingyang Dai, Jianzhuang Liu, Feng Zheng, Mingliang Xu, Qi Tian, and Rongrong Ji. 2021. Dual distribution alignment network for generalizable person re-identification. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI ’21), 1054–1062.
[3]
Seokeon Choi, Taekyung Kim, Minki Jeong, Hyoungseob Park, and Changick Kim. 2021. Meta batch-instance normalization for generalizable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’21), 3425–3435.
[4]
Yongxing Dai, Xiaotong Li, Jun Liu, Zekun Tong, and Ling-Yu Duan. 2021. Generalizable person re-identification with relevance-aware mixture of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’21), 16145–16154.
[5]
Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666. Retrieved from https://arxiv.org/abs/2107.12666
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://arxiv.org/abs/2010.11929
[7]
Chanho Eom and Bumsub Ham. 2019. Learning disentangled representation for robust person re-identification. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS ’19), 5298–5309.
[8]
Wenlong Fang, Chunping Ouyang, Qiang Lin, and Yue Yuan. 2023. Three heads better than one: Pure entity, relation label and adversarial training for cross-domain few-shot relation extraction. Data Intelligence 5, 3 (2023), 807.
[9]
Ammarah Farooq, Muhammad Awais, Fei Yan, Josef Kittler, Ali Akbari, and Syed Safwan Khalid. 2020. A convolutional baseline for person re-identification using vision and language descriptions. arXiv:2003.00808. Retrieved from https://arxiv.org/abs/2003.00808
[10]
Douglas Gray and Hai Tao. 2008. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In Proceedings of the European Conference on Computer Vision (ECCV ’08), 262–275.
[11]
Lingxiao He, Wu Liu, Jian Liang, Kecheng Zheng, Xingyu Liao, Peng Cheng, and Tao Mei. 2021. Semi-supervised domain generalizable person re-identification. arXiv:2108.05045. Retrieved from https://arxiv.org/abs/2108.05045
[12]
Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. 2021. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’21), 15013–15022.
[13]
Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737. Retrieved from https://arxiv.org/abs/1703.07737
[14]
Martin Hirzer, Csaba Beleznai, Peter M. Roth, and Horst Bischof. 2011. Person re-identification by descriptive and discriminative classification. In Proceedings of the Scandinavian Conference on Image Analysis (SCIA ’11), 91–102.
[15]
Jieru Jia, Qiuqi Ruan, and Timothy M. Hospedales. 2019. Frustratingly easy person re-identification: Generalizing person re-ID in Practice. In Proceedings of the British Machine Vision Conference (BMVC ’19), 117.
[16]
Ding Jiang and Mang Ye. 2023. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’23), 2787–2797.
[17]
Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen, and Li Zhang. 2020. Style normalization and restitution for generalizable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’20), 3140–3149.
[18]
Amruta Kale, Tin Nguyen, Frederick C Harris Jr, Chenhao Li, Jiyin Zhang, and Xiaogang Ma. 2023. Provenance documentation to enable explainable and trustworthy AI: A literature review. Data Intelligence 5, 1 (2023), 139–162.
[19]
Aske Rasch Lejbolle, Kamal Nasrollahi, Benjamin Krogh, and Thomas B. Moeslund. 2019. Person re-identification using spatial and layer-wise attention. IEEE Transactions on Information Forensics and Security, 99 (2019), 1–1.
[20]
Qingming Leng, Mang Ye, and Qi Tian. 2019. A survey of open-world person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 30, 4 (2019), 1092–1108.
[21]
Huafeng Li, Yiwen Chen, Dapeng Tao, Zhengtao Yu, and Guanqiu Qi. 2020. Attribute-aligned domain-invariant feature learning for unsupervised domain adaptation person re-identification. IEEE Transactions on Information Forensics and Security, 99 (2020), 1–1.
[22]
Siyuan Li, Li Sun, and Qingli Li. 2023. CLIP-ReID: Exploiting vision-language model for image re-identification without concrete text labels. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI ’23), 1405–1413.
[23]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17), 1970–1979.
[24]
Wei Li and Xiaogang Wang. 2013. Locally aligned feature transforms across views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’13), 3594–3601.
[25]
Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. 2014. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’14), 152–159.
[26]
Yuke Li, Jingkuan Song, Hao Ni, and Heng Tao Shen. 2023. Style-controllable generalized person re-identification. In Proceedings of the ACM International Conference on Multimedia (MM ’23), 7912–7921.
[27]
Shengcai Liao and Ling Shao. 2020. Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In Proceedings of the European Conference on Computer Vision (ECCV ’20), 456–474.
[28]
Shengcai Liao and Ling Shao. 2022. Graph sampling based deep metric learning for generalizable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’22), 7359–7368.
[29]
Ci-Siang Lin, Yuan-Chia Cheng, and Yu-Chiang Frank Wang. 2020. Domain generalized person re-identification via cross-domain episodic learning. In Proceedings of the International Conference on Pattern Recognition (ICPR ’20), 6758–6763.
[30]
Deyin Liu, Lin Wu, Richang Hong, ZongYuan Ge, Jialie Shen, Farid Boussaid, and Mohammed Bennamoun. 2022. Generative metric learning for adversarially robust open-world person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 19 (2022), 1–19.
[31]
Chen Change Loy, Tao Xiang, and Shaogang Gong. 2010. Time-delayed correlation analysis for multi-camera activity understanding. International Journal of Computer Vision 90 (2010), 106–129.
[32]
Chuanchen Luo, Chunfeng Song, and Zhaoxiang Zhang. 2020. Generalizing person re-identification by camera-aware invariance learning and cross-domain mixup. In Proceedings of the European Conference on Computer Vision (ECCV ’20), 224–241.
[33]
Hao Luo, Wei Jiang, Youzhi Gu, Fuxu Liu, Xingyu Liao, Shenqi Lai, and Jianyang Gu. 2019. A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia 22, 10 (2019), 2597–2609.
[34]
Hao Ni, Yuke Li, Lianli Gao, Heng Tao Shen, and Jingkuan Song. 2023. Part-aware transformer for generalizable person re-identification. In Proceedings of the International Conference on Computer Vision (ICCV ’23), 11280–11289.
[35]
Jinjia Peng, Song Pengpeng, Hui Li, and Huibing Wang. 2024. ReFID: reciprocal frequency-aware generalizable person re-identification via decomposition and filtering. ACM Transactions on Multimedia Computing, Communications and Applications 20, 7 (2024), 1–20.
[36]
Lei Qi, Ziang Liu, Yinghuan Shi, and Xin Geng. 2024. Generalizable metric network for cross-domain person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 34, 10 (2024), 9039–9052.
[37]
Lei Qi, Lei Wang, Jing Huo, Yinghuan Shi, and Yang Gao. 2021. GreyReID: A novel two-stream deep framework with RGB-grey information for person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 17, 1 (2021).
[38]
Lei Qi, Hongpeng Yang, Yinghuan Shi, and Xin Geng. 2024. Multimatch: Multi-task learning for semi-supervised domain generalization. ACM Transactions on Multimedia Computing, Communications and Applications 20, 6 (2024), 1–21.
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML ’21), 8748–8763.
[40]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the International Conference on Computer Vision (ICCV ’17), 618–626.
[41]
Liao Shengcai and Shao Ling. 2021. Transmatcher: Deep image matching through transformers for generalizable person re-identification. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS ’21), 1–12.
[42]
Jifei Song, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy M. Hospedales. 2019. Generalizable person re-Identification by domain-invariant mapping network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’19), 719–728.
[43]
Wentao Tan, Changxing Ding, Pengfei Wang, Mingming Gong, and Kui Jia. 2023. Style interleaved learning for generalizable person re-identification. IEEE Transactions on Multimedia 26 (2023), 1600–1612.
[44]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.
[45]
Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022. Caibc: Capturing all-round information beyond color for text-based person retrieval. In Proceedings of the ACM International Conference on Multimedia (MM ’22), 5314–5322.
[46]
Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’18), 79–88.
[47]
Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. 2016. End-to-end deep learning for person search. arXiv:1604.01850. Retrieved from https://arxiv.org/abs/1604.01850
[48]
Boqiang Xu, Jian Liang, Lingxiao He, and Zhenan Sun. 2022. Mimic embedding via adaptive aggregation: Learning generalizable person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV ’22), 372–388.
[49]
Cheng Yan, Guansong Pang, Xiao Bai, Changhong Liu, Xin Ning, Lin Gu, and Jun Zhou. 2021. Beyond triplet loss: Person re-identification with fine-grained difference-aware pairwise loss. IEEE Transactions on Multimedia (TMM) 24 (2021), 1665–1677.
[50]
Shuanglin Yan, Neng Dong, Liyan Zhang, and Jinhui Tang. 2023. Clip-driven fine-grained text-image person re-identification. IEEE Transactions on Image Processing 32 (2023), 6032–6046.
[51]
Ke Yang, Fan Part-awareand Yan, Shijian Lu, Huizhu Jia, Don Xie, Zongqiao Yu, Xiaowei Guo, Feiyue Huang, and Wen Gao. 2020. Part-aware progressive unsupervised domain adaptation for person re-identification. IEEE Transactions on Multimedia 23 (2020), 1681–1695.
[52]
Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. 2021. Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6 (2021), 2872–2893.
[53]
Mang Ye and Pong C. Yuen. 2020. PurifyNet: A robust person re-identification model with noisy labels. IEEE Transactions on Information Forensics and Security 15, 99 (2020), 2655–2666.
[54]
Ye Yuan, Wuyang Chen, Tianlong Chen, Yang Yang, Zhou Ren, Zhangyang Wang, and Gang Hua. 2020. Calibrated domain-invariant learning for highly generalizable large scale re-identification. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV ’20), 3578–3587.
[55]
Enwei Zhang, Xinyang Jiang, Hao Cheng, Ancong Wu, Fufu Yu, Ke Li, Xiaowei Guo, Feng Zheng, Weishi Zheng, and Xing Sun. 2021. One for more: Selecting generalizable samples for generalizable reid model. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI ’21), 3324–3332.
[56]
Pengyi Zhang, Huanzhang Dou, Yunlong Yu, and Xi Li. 2022. Adaptive cross-domain learning for generalizable person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV ’22), 215–232.
[57]
Yiyuan Zhang, Yuhao Kang, Sanyuan Zhao, and Jianbing Shen. 2022. Dual-semantic consistency learning for visible-infrared person re-identification. IEEE Transactions on Information Forensics and Security 18 (2022), 1554–1565.
[58]
Yuyang Zhao, Zhun Zhong, Fengxiang Yang, Zhiming Luo, Yaojin Lin, Shaozi Li, and Nicu Sebe. 2021. Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’21), 6277–6286.
[59]
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’15), 1116–1124.
[60]
Liang Zheng, Yi Yang, and Alexander G. Hauptmann. 2016. Person re-identification: Past, present and future. arXiv:1610.02984. Retrieved from https://arxiv.org/abs/1610.02984
[61]
Qiushuo Zheng, Hao Wen, Meng Wang, Guilin Qi, and Chaoyu Bai. 2022. Faster zero-shot multi-modal entity linking via visual-linguistic representation. Data Intelligence 4, 3 (2022), 493–508.
[62]
Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. 2009. Associating groups of people. In Proceedings of the British Machine Vision Conference (BMVC ’09), 1–11.
[63]
Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep surroundings-person separation learning for text-based person retrieval. In Proceedings of the PACM International Conference on Multimedia (MM ’21), 209–217.
[64]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial Networks. In Proceedings of the International Conference on Computer Vision (ICCV ’17), 2242–2251.
[65]
Zijie Zhuang, Longhui Wei, Lingxi Xie, Tianyu Zhang, Hengheng Zhang, Haozhe Wu, Haizhou Ai, and Qi Tian. 2020. Rethinking the distribution gap of person re-identification with camera-based batch normalization. In Proceedings of the European Conference on Computer Vision (ECCV ’20), 140–157.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 21, Issue 1
January 2025
860 pages
EISSN:1551-6865
DOI:10.1145/3703004
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 December 2024
Online AM: 21 October 2024
Accepted: 13 October 2024
Revised: 28 September 2024
Received: 31 July 2024
Published in TOMM Volume 21, Issue 1

Check for updates

Author Tags

  1. Visual language model
  2. Generalizable person re-identification
  3. Depth first search

Qualifiers

  • Research-article

Funding Sources

  • NSFC
  • China Postdoctoral Science Foundation
  • CPSF
  • Jiangsu Funding Program for Excellent Postdoctoral Talent

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 265
    Total Downloads
  • Downloads (Last 12 months)265
  • Downloads (Last 6 weeks)67
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media