Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos

Pooya Mobadersany¹⁴,
Chaitanya Parmar¹⁴,
Pablo F. Damasceno¹⁴,
Shreyas Fadnavis¹⁴,
Krishna Chaitanya¹⁴,
Shilong Li¹⁴,
Evan Schwab¹⁵,
Jaclyn Xiao¹⁶,
Lindsey Surace¹⁴,
Tommaso Mansi¹⁴,
Gabriela Oana Cula¹⁴,
Louis R. Ghanem¹⁴ &
…
Kristopher Standish¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15006))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

1072 Accesses

Abstract

Camera localization in endoscopy videos plays a fundamental role in enabling precise diagnosis and effective treatment planning for patients with Inflammatory Bowel Disease (IBD). Precise frame-level classification, however, depends on long-range temporal dynamics, ranging from hundreds to tens of thousands of frames per video, challenging current neural network approaches. To address this, we propose EndoFormer, a frame-level classification model that leverages long-range temporal information for anatomic segment classification in gastrointestinal endoscopy videos. EndoFormer combines a Foundation Model block, judicious video-level augmentations, and a Transformer classifier for frame-level classification while maintaining a small memory footprint. Experiments on 4160 endoscopy videos from four clinical trials and over 61 million frames demonstrate that EndoFormer has an AUC = 0.929, significantly improving state-of-the-art models for anatomic segment classification. These results highlight the potential for adopting EndoFormer in endoscopy video analysis applications that require long-range temporal dynamics for precise frame-level predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 64.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 79.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-frame Abnormality Detection in Video Capsule Endoscopy

Motion-Aware Deep Feature-Based Scalable Video Summarization for Wireless Capsule Endoscopy Videos

Foundation Model for Endoscopy Video Analysis via Large-Scale Self-supervised Pre-train

References

Ananthakrishnan, A.N., Kaplan, G.G., Ng, S.C.: Changing global epidemiology of inflammatory bowel diseases: sustaining health care delivery into the 21st century. Clinical Gastroenterology and Hepatology 18(6), 1252–1260 (2020)
Article Google Scholar
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6836–6846 (2021)
Google Scholar
Azagra, P., Sostres, C., Ferrández, Á., Riazuelo, L., Tomasini, C., Barbed, O.L., Morlana, J., Recasens, D., Batlle, V.M., Gómez-Rodríguez, J.J., et al.: Endomapper dataset of complete calibrated endoscopy procedures. Scientific Data 10(1), 671 (2023)
Article Google Scholar
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PMLR (2020)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=YicbFdNTTy
Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part i. IEEE robotics & automation magazine 13(2), 99–110 (2006)
Article Google Scholar
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13. pp. 363–370. Springer (2003)
Google Scholar
Gao, X., Jin, Y., Long, Y., Dou, Q., Heng, P.A.: Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24. pp. 593–603. Springer (2021)
Google Scholar
Houwen, B.B., Hartendorp, F., Giotis, I., Hazewinkel, Y., Fockens, P., Walstra, T.R., Dekker, E., study group, P.: Computer-aided classification of colorectal segments during colonoscopy: a deep learning approach based on images of a magnetic endoscopic positioning device. Scandinavian Journal of Gastroenterology 58(6), 649–655 (2023)
Google Scholar
Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.W., Heng, P.A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020)
Article Google Scholar
Jin, Y., Long, Y., Chen, C., Zhao, Z., Dou, Q., Heng, P.A.: Temporal memory relation network for workflow recognition from surgical video. IEEE Transactions on Medical Imaging 40(7), 1911–1923 (2021)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koutroumpakis, E., Katsanos, K.H.: Implementation of the simple endoscopic activity score in crohn’s disease. Saudi journal of gastroenterology: official journal of the Saudi Gastroenterology Association 22(3), 183 (2016)
Article Google Scholar
Lobatón, T., Bessissow, T., De Hertogh, G., Lemmens, B., Maedler, C., Van Assche, G., Vermeire, S., Bisschops, R., Rutgeerts, P., Bitton, A., et al.: The modified mayo endoscopic score (mmes): a new index for the assessment of extension and severity of endoscopic activity in ulcerative colitis patients. Journal of Crohn’s and Colitis 9(10), 846–852 (2015)
Article Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI’81: 7th international joint conference on Artificial intelligence. vol. 2, pp. 674–679 (1981)
Google Scholar
Morlana, J., Tardós, J.D., Montiel, J.: Colonmapper: topological mapping and localization for colonoscopy. arXiv preprint arXiv:2305.05546 (2023)
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779–788 (2016)
Google Scholar
Rutgeerts, P., Reinisch, W., Colombel, J.F., Sandborn, W.J., D’Haens, G., Petersson, J., Zhou, Q., Iezzi, A., Thakkar, R.B.: Agreement of site and central readings of ileocolonoscopic scores in crohn’s disease: comparison using data from the extend trial. Gastrointestinal Endoscopy 83(1), 188–197 (2016)
Article Google Scholar
Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=SkGuG2R5tm
Saito, H., Tanimoto, T., Ozawa, T., Ishihara, S., Fujishiro, M., Shichijo, S., Hirasawa, D., Matsuda, T., Endo, Y., Tada, T.: Automatic anatomical classification of colonoscopic images using deep convolutional neural networks. Gastroenterology report 9(3), 226–233 (2021)
Article Google Scholar
Schwab, E., Cula, G.O., Standish, K., Yip, S.S., Stojmirovic, A., Ghanem, L., Chehoud, C.: Automatic estimation of ulcerative colitis severity from endoscopy videos using ordinal multi-instance learning. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 10(4), 425–433 (2022)
Google Scholar
Smith, R.: An overview of the tesseract ocr engine. In: Ninth international conference on document analysis and recognition (ICDAR 2007). vol. 2, pp. 629–633. IEEE (2007)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. pp. 2048–2057. PMLR (2015)
Google Scholar
Yamazaki, K., Vo, K., Truong, Q.S., Raj, B., Le, N.: Vltint: visual-linguistic transformer-in-transformer for coherent video paragraph captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 3081–3090 (2023)
Google Scholar
Yao, H., Stidham, R.W., Gao, Z., Gryak, J., Najarian, K.: Motion-based camera localization system in colonoscopy videos. Medical Image Analysis 73, 102180 (2021)
Article Google Scholar
Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: Image BERT pre-training with online tokenizer. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=ydopy-e6Dg

Download references

Author information

Authors and Affiliations

Janssen R&D, LLC, A Johnson and Johnson Company, New Brunswick, USA
Pooya Mobadersany, Chaitanya Parmar, Pablo F. Damasceno, Shreyas Fadnavis, Krishna Chaitanya, Shilong Li, Lindsey Surace, Tommaso Mansi, Gabriela Oana Cula, Louis R. Ghanem & Kristopher Standish
Epic Sciences, San Francisco, CA, USA
Evan Schwab
University of California, San Francisco, CA, USA
Jaclyn Xiao

Authors

Pooya Mobadersany
View author publications
You can also search for this author in PubMed Google Scholar
Chaitanya Parmar
View author publications
You can also search for this author in PubMed Google Scholar
Pablo F. Damasceno
View author publications
You can also search for this author in PubMed Google Scholar
Shreyas Fadnavis
View author publications
You can also search for this author in PubMed Google Scholar
Krishna Chaitanya
View author publications
You can also search for this author in PubMed Google Scholar
Shilong Li
View author publications
You can also search for this author in PubMed Google Scholar
Evan Schwab
View author publications
You can also search for this author in PubMed Google Scholar
Jaclyn Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Lindsey Surace
View author publications
You can also search for this author in PubMed Google Scholar
Tommaso Mansi
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Oana Cula
View author publications
You can also search for this author in PubMed Google Scholar
Louis R. Ghanem
View author publications
You can also search for this author in PubMed Google Scholar
Kristopher Standish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Pooya Mobadersany or Kristopher Standish .

Editor information

Editors and Affiliations

Children’s National Hospital/George Washington University, Washington, DC, USA
Marius George Linguraru
The Chinese University of Hong Kong, Hong Kong, China
Qi Dou
Technical University of Denmark, Kgs Lyngby, Denmark
Aasa Feragen
Imperial College London, London, UK
Stamatia Giannarou
Imperial College London, London, UK
Ben Glocker
Universitat de Barcelona, Barcelona, Spain
Karim Lekadir
Helmholtz Munich, Technical University of Munich and King’s College London, Munich, Germany
Julia A. Schnabel

Ethics declarations

Disclosure of Interests

All authors were employees of Janssen R&D, LLC, when conducting this research, and may own company stock/stock options.

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1113 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mobadersany, P. et al. (2024). Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos. In: Linguraru, M.G., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, vol 15006. Springer, Cham. https://doi.org/10.1007/978-3-031-72089-5_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-72089-5_28
Published: 03 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72088-8
Online ISBN: 978-3-031-72089-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-frame Abnormality Detection in Video Capsule Endoscopy

Motion-Aware Deep Feature-Based Scalable Video Summarization for Wireless Capsule Endoscopy Videos

Foundation Model for Endoscopy Video Analysis via Large-Scale Self-supervised Pre-train

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

1 Electronic supplementary material

Supplementary material 1 (pdf 1113 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-frame Abnormality Detection in Video Capsule Endoscopy

Motion-Aware Deep Feature-Based Scalable Video Summarization for Wireless Capsule Endoscopy Videos

Foundation Model for Endoscopy Video Analysis via Large-Scale Self-supervised Pre-train

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

1 Electronic supplementary material

Supplementary material 1 (pdf 1113 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation