SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking

Siyuan Li¹³,
Lei Ke¹³,
Yung-Hsu Yang¹³,
Luigi Piccinelli¹³,
Mattia Segù¹³,
Martin Danelljan¹³ &
…
Luc Van Gool^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15085))

Included in the following conference series:

European Conference on Computer Vision

99 Accesses

Abstract

Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers semantics location, and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods for novel classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code is available at github.com/siyuanliii/SLAck.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

TAO: A Large-Scale Benchmark for Tracking Any Object

Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking

Model-Free Multiple Object Tracking with Shared Proposals

References

Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: ICCV (2019)
Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Google Scholar
Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6247–6257 (2020)
Google Scholar
Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11621–11631 (2020)
Google Scholar
Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: rethinking sort for robust multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9686–9696 (2023)
Google Scholar
Cetintas, O., Brasó, G., Leal-Taixé, L.: Unifying short and long-term tracking with graph hierarchies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22877–22887, June 2023
Google Scholar
Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D.: TAO: a large-scale benchmark for tracking any object. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 436–454. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_26
Chapter Google Scholar
Dendorfer, P., et al.: Mot20: a benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)
Du, F., Xu, B., Tang, J., Zhang, Y., Wang, F., Li, H.: 1st place solution to ECCV-TAO-2020: detect and represent any object for tracking. arXiv preprint arXiv:2101.08040 (2021)
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14084–14093 (2022)
Google Scholar
Du, Y., et al.: StrongSORT: make deepSORT great again. IEEE Trans. Multimedia 25, 8725–8737 (2023)
Article Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Article Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)
Google Scholar
Kim, V., Jung, G., Lee, S.W.: AM-SORT: adaptable motion predictor with historical trajectory embedding for multi-object tracking. arXiv preprint arXiv:2401.13950 (2024)
Li, S., Danelljan, M., Ding, H., Huang, T.E., Yu, F.: Tracking every thing in the wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 498–515. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_29
Chapter Google Scholar
Li, S., Fischer, T., Ke, L., Ding, H., Danelljan, M., Yu, F.: OVTrack: open-vocabulary multiple object tracking. In: CVPR (2023)
Google Scholar
Li, S., et al.: Matching anything by segmenting anything. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18963–18973 (2024)
Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Y., et al.: Opening up open world tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19045–19055 (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Segu, M., Yu, F.: COOLer: class-incremental learning for appearance-based multiple object tracking. In: Köthe, U., Rother, C. (eds.) DAGM GCPR 2023. LNCS, vol. 14264, pp. 443–458. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-54605-1_29
Chapter Google Scholar
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: TrackFormer: multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8844–8854 (2022)
Google Scholar
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Pang, J., et al.: Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2020)
Google Scholar
Segu, M., Piccinelli, L., Li, S., Van Gool, L., Yu, F., Schiele, B.: Walker: self-supervised multiple object tracking by walking on temporal appearance graphs. In: Computer Vision–ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings. Springer (2024)
Google Scholar
Segu, M., Piccinelli, L., Li, S., Yang, Y.H., Schiele, B., Van Gool, L.: Samba: synchronized set-of-sequences modeling for end-to-end multiple object tracking. arXiv preprint (2024)
Google Scholar
Segu, M., Schiele, B., Yu, F.: Darth: holistic test-time adaptation for multiple object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9717–9727 (2023)
Google Scholar
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Google Scholar
Sun, P., et al.: TransTrack: multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)
Wang, Z., Zheng, L., Liu, Y., Wang, S.: Towards real-time multi-object tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 107–122. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_7
Chapter Google Scholar
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)
Google Scholar
Wu, J., Jiang, Y., Liu, Q., Yuan, Z., Bai, X., Bai, S.: General object foundation model for images and videos at scale. arXiv preprint arXiv:2312.09158 (2023)
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13688, pp. 588–605. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_34
Chapter Google Scholar
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)
Google Scholar
Ye, M., et al.: Cascade-DETR: delving into high-quality universal object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6704–6714 (2023)
Google Scholar
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: end-to-end multiple-object tracking with transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 659–675. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_38
Chapter Google Scholar
Zhang, Y., et al.: ByteTrack: multi-object tracking by associating every detection box. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 1–21. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_1
Chapter Google Scholar
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: On the fairness of detection and re-identification in multiple object tracking. IJCV (2021)
Google Scholar
Zheng, G., Lin, S., Zuo, H., Fu, C., Pan, J.: NetTrack: tracking highly dynamic objects with a net. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19145–19155 (2024)
Google Scholar
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13669, pp. 350–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20077-9_21
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Computer Vision Lab, ETH Zürich, Zürich, Switzerland
Siyuan Li, Lei Ke, Yung-Hsu Yang, Luigi Piccinelli, Mattia Segù, Martin Danelljan & Luc Van Gool
INSAIT, Sofia, Bulgaria
Luc Van Gool

Authors

Siyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Lei Ke
View author publications
You can also search for this author in PubMed Google Scholar
Yung-Hsu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Piccinelli
View author publications
You can also search for this author in PubMed Google Scholar
Mattia Segù
View author publications
You can also search for this author in PubMed Google Scholar
Martin Danelljan
View author publications
You can also search for this author in PubMed Google Scholar
Luc Van Gool
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siyuan Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6461 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S. et al. (2025). SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15085. Springer, Cham. https://doi.org/10.1007/978-3-031-73383-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-73383-3_1
Published: 03 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73382-6
Online ISBN: 978-3-031-73383-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

TAO: A Large-Scale Benchmark for Tracking Any Object

Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking

Model-Free Multiple Object Tracking with Shared Proposals

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6461 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SLAck: Semantic, Location, and Appearance Aware Open-Vocabulary Tracking

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

TAO: A Large-Scale Benchmark for Tracking Any Object

Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking

Model-Free Multiple Object Tracking with Shared Proposals

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6461 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation