[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Learning shared features from specific and ambiguous descriptions for text-based person search

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Text-based person search endeavors to utilize natural language descriptions for retrieving pedestrian images. Previous studies have primarily focused on leveraging information among pedestrians with distinct identities, overlooking the exploration of data variations within the same identity. Although some have attempted to extract multiple samples for each identity, an appropriate loss function was not employed. In response to this research gap, we present LFSA, a concise cross-model framework that Learns shared Features from Specific and Ambiguous descriptions. Firstly, building upon a distinctive sampling strategy, we formulate the Boundary Constraints Loss (BCL) and the Hard Sample Mining Loss (HSML) with the aim of extracting unique features from specific descriptions while simultaneously capturing shared features from ambiguous descriptions. Then, we introduce a textual augmentation module denoted as Mask-Delete-Replace (MDR). This module employs three operations to direct the model’s attention toward more comprehensive details within the textual descriptors. LFSA utilizes CLIP as the backbone of the network, only leveraging its global features from the [CLS] token. Extensive experiments on two benchmark datasets, CUHK-PEDES and ICFG-PEDES, demonstrate the effectiveness of our approach. Codes are available at https://github.com/CottonCandyZ/LFSA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Not available.

References

  1. Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5187–5196. IEEE, Honolulu, HI (2017). https://doi.org/10.1109/CVPR.2017.551

  2. Wang, Y., Jiang, K., Lu, H., Xu, Z., Li, G., Chen, C., Geng, X.: Encoder-decoder assisted image generation for person re-identification. Multim. Tools Appl. 81(7), 10373–10390 (2022). https://doi.org/10.1007/s11042-022-11907-2

    Article  Google Scholar 

  3. Zhu, Z., Jiang, X., Zheng, F., Guo, X., Huang, F., Sun, X., Zheng, W.: Viewpoint-aware loss with angular regularization for person re-identification. Proc. AAAI Conf Artif. Intell. 34(07), 13114–13121 (2020). https://doi.org/10.1609/aaai.v34i07.7014

    Article  Google Scholar 

  4. Wang, Y., Zhang, P., Gao, S., Geng, X., Lu, H., Wang, D.: Pyramid spatial-temporal aggregation for video-based person re-identification. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12006–12015 (2021). https://doi.org/10.1109/ICCV48922.2021.01181

  5. Lu, H., Zou, X., Zhang, P.: Learning progressive modality-shared transformers for effective visible-infrared person re-identification. Proc. AAAI Conf. Artif. Intell. 37(2), 1835–1843 (2023). https://doi.org/10.1609/aaai.v37i2.25273

    Article  Google Scholar 

  6. Han, X., He, S., Zhang, L., Xiang, T.: Text-based person search with limited data. In: BMVC (2021)

  7. Yang, S., Zhou, Y., Zheng, Z., Wang, Y., Zhu, L., Wu, Y.: Toward unified text-based person retrieval: a large-scale multi-attribute and language search benchmark. In: Proceedings of the 31st ACM International Conference on Multimedia. MM ’23, pp. 4492–4501. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3581783.3611709

  8. Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., Hoi, S.C.H.: Deep learning for person re-identification: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 2872–2893 (2022). https://doi.org/10.1109/TPAMI.2021.3054775

    Article  Google Scholar 

  9. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

  10. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, pp. 1597–1607. PMLR, ??? (2020)

  11. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  12. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., ??? (2017)

  13. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423

  14. Chen, Y., Zhang, G., Lu, Y., Wang, Z., Zheng, Y.: TIPCB: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494, 171–181 (2022). https://doi.org/10.1016/j.neucom.2022.04.081

    Article  Google Scholar 

  15. Wang, Z., Fang, Z., Wang, J., Yang, Y.: ViTAA: Visual-Textual attributes alignment in person search by natural language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020 vol. 12357, pp. 402–420. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_24

  16. Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., Li, Y.: CAIBC: capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 5314–5322 (2022). https://doi.org/10.1145/3503161.3548057

  17. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=YicbFdNTTy

  18. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002. IEEE, Montreal, QC, Canada (2021). https://doi.org/10.1109/ICCV48922.2021.00986

  19. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: PVT v2: improved baselines with pyramid vision transformer. Comput. Visual Media 8(3), 415–424 (2022). https://doi.org/10.1007/s41095-022-0274-8

    Article  Google Scholar 

  20. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  21. Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder–decoder for statistical machine translation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar (2014). https://doi.org/10.3115/v1/D14-1179

  22. Shu, X., Wen, W., Wu, H., Chen, K., Song, Y., Qiao, R., Ren, B., Wang, X.: See Finer, see more: implicit modality alignment for text-based person retrieval. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds.) Computer Vision–ECCV 2022 Workshops. Lecture Notes in Computer Science, pp. 624–641. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-25072-9_42

  23. Ji, Z., Hu, J., Liu, D., Wu, L.Y., Zhao, Y.: Asymmetric cross-scale alignment for text-based person search. IEEE Trans. Multim. (2022). https://doi.org/10.1109/TMM.2022.3225754

    Article  Google Scholar 

  24. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763. PMLR, ??? (2021)

  25. Dou, Z.-Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Zhang, P., Yuan, L., Peng, N., Liu, Z., Zeng, M.: An Empirical Study of Training End-to-End Vision-and-Language Transformers. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18145–18155. IEEE, New Orleans, LA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.01763

  26. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9694–9705. Curran Associates, Inc., ??? (2021)

  27. Bai, Y., Cao, M., Gao, D., Cao, Z., Chen, C., Fan, Z., Nie, L., Zhang, M.: Rasa: Relation and sensitivity aware representation learning for text-based person search. arXiv preprint arXiv:2305.13653 (2023)

  28. Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2787–2797 (2023)

  29. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations (ICLR 2015) (2015)

  30. Zhang, Y., Lu, H.: Deep Cross-Modal Projection Learning for Image-Text Matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018 vol. 11205, pp. 707–723. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42

  31. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv (2017). https://doi.org/10.48550/arXiv.1704.04861

  32. Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.-D.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multim. Comput. Commun. Appl.16(2), 1–23 (2020) https://doi.org/10.1145/3383184arxiv:1711.05535 [cs]

  33. Hu Lu, TingTing Jin, Hui Wei, Michele Nappi, Hu Li, ShaoHua Wan.: Soft-orthogonal constrained dual-stream encoder with self-supervised clustering network for brain functional connectivity data, Expert Systems with Applications, 244, 122898 (2023). https://doi.org/10.1016/j.eswa.2023.122898

  34. Farooq, A., Awais, M., Kittler, J., Khalid, S.S.: AXM-Net: implicit cross-modal feature alignment for person re-identification. Proc. AAAI Conf. Artif. Intell. 36(4), 4477–4485 (2022). https://doi.org/10.1609/aaai.v36i4.20370

    Article  Google Scholar 

  35. Gao, C., Cai, G., Jiang, X., Zheng, F., Zhang, J., Gong, Y., Lin, F., Sun, X., Bai, X.: Conditional feature learning based transformer for text-based person search. IEEE Trans. Image Process. 31, 6097–6108 (2022). https://doi.org/10.1109/TIP.2022.3205216

    Article  Google Scholar 

  36. Li, S., Lu, A., Huang, Y., Li, C., Wang, L.: Joint token and feature alignment framework for text-based person search. IEEE Signal Process. Lett. 29, 2238–2242 (2022). https://doi.org/10.1109/LSP.2022.3217682

    Article  Google Scholar 

  37. Li, F., Zhou, H., Li, H., Zhang, Y., Yu, Z.: Person text-image matching via text-feature interpretability embedding and external attack node implantation. arXiv (2022)

  38. Yan, S., Dong, N., Zhang, L., Tang, J.: CLIP-Driven Fine-grained Text-Image Person Re-identification. arXiv (2022)

  39. Ding, Z., Ding, C., Shao, Z., Tao, D.: Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification. arXiv (2021). https://doi.org/10.48550/arXiv.2107.12666

  40. Gao, C., Cai, G., Jiang, X., Zheng, F., Zhang, J., Gong, Y., Peng, P., Guo, X., Sun, X.: Contextual non-local alignment over full-scale representation for text-based person search. arXiv (2021)

  41. Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., Hu, F., Hua, G.: DSSL: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21, pp. 209–217. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3474085.3475369

  42. Yan, S., Tang, H., Zhang, L., Tang, J.: Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search. arXiv (2023). https://doi.org/10.48550/arXiv.2208.14365

  43. Li, S., Cao, M., Zhang, M.: Learning semantic-aligned feature representation for text-based person search. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2724–2728 (2022). https://doi.org/10.1109/ICASSP43922.2022.9746846

  44. Suo, W., Sun, M., Niu, K., Gao, Y., Wang, P., Zhang, Y., Wu, Q.: A simple and robust correlation filtering method for text-based person search. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022 vol. 13695, pp. 726–742. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_42

  45. Shao, Z., Zhang, X., Fang, M., Lin, Z., Wang, J., Ding, C.: Learning Granularity-Unified Representations for Text-to-Image Person Re-identification. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, pp. 5566–5574. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3503161.3548028

  46. Wang, G., Yu, F., Li, J., Jia, Q., Ding, S.: Exploiting the textual potential from vision-language pre-training for text-based person search. arXiv (2023)

  47. He, Ziqiang, Shaohua Wan, Marco Zappatore, Hu Lu.: A similarity matrix low-rank approximation and inconsistency separation Fusion Approach for Multi-view Clustering. IEEE Transactions on Artificial Intelligence (2023). https://doi.org/10.1109/TAI.2023.3271964

Download references

Acknowledgement

This work was supported in part by the NationalScience Foundation Program of China (NSFC) (grant number: 61976241), and the InternationalScience and technology cooperation plan project of Zhenjiang (grant number: GJ2021008).

Author information

Authors and Affiliations

Authors

Contributions

Qikai Geng and Hu Lu wrote the main manuscript text and Juanjuan Tu prepared figures. All authors reviewed the manuscript.

Corresponding author

Correspondence to Hu Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest. No funds, grants, or other support was received.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, K., Geng, Q., Huang, S. et al. Learning shared features from specific and ambiguous descriptions for text-based person search. Multimedia Systems 30, 94 (2024). https://doi.org/10.1007/s00530-024-01286-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01286-z

Keywords

Navigation