[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

MAIM: a mixer MLP architecture for image matching

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Recent advances in multilayer perceptron (MLP) models have provided new effective network architecture designs for computer vision tasks. Compared with convolutional neural networks (CNNs) and visual transformers, MLP-based visual backbones have less induction bias, which can improve the sample utilization efficiency and reduce computational costs. Therefore, we designed the Mixer MLP Architecture for Image-Matching (MAIM), which is a coarse to fine-level detector-free image-matching scheme. Accordingly, we constructed a mixer MLP architecture called Mixer-WMLP, which evenly divides the feature map into non-overlapping windows, spreads each window as a token, achieves the exchange of token information between spatial locations, channels features through a two-layer MLP structure in the coarse-level model, and then feeds the windows with dense fine-level matching, thereby producing the final matches. Furthermore, the implemented global field-of-view mixer MLP framework for image-matching incurs a low computational cost. By conducting experiments with indoor and outdoor relative poses, our MLP architecture is compared with CNN and transformer-based image-matching methods. Our method has significant advantages in terms of real-time performance and largely reduces computational cost, proving its effectiveness in image-matching tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

  1. Jin, Y., Mishkin, D., Mishchuk, A., Matas, J., Fua, P., Yi, K.M., Trulls, E.: Image-matching across wide baselines: from paper to practice. Int. J. Comput. Vis. (2020). https://doi.org/10.1007/s11263-020-01385-0

    Article  Google Scholar 

  2. Yi, K.M., Trulls, E., Lepetit, V., Fua. P.: LIFT: Learned invariant feature transform. In: ECCV, pp.467–483. (2016). https://doi.org/10.1007/978-3-319-46466-4_28.

  3. DeTone, D., Malisiewicz, T., Rabi-novich, A.: Toward geometric deep slam. In: CVPRW. https://doi.org/10.48550/arXiv.1707.07410.

  4. DeTone, D., Malisiewicz, T., Rabi-Novich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPRW, pp. 224–236. (2018)

  5. Zafrir, O., Boudoukh, G., Izsak, P., Wasserblat, M.: Q8BERT: quantized 8bit BERT. In: CoRR. (2019). https://arxiv.org/abs/1910.06188

  6. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16×16 words: transformers for image recognition at scale. Int. Conf. Learn. Represent. (2021). https://doi.org/10.48550/arXiv.2010.11929

    Article  Google Scholar 

  7. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J'egou, H.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning. (2021)

  8. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: CVPR. (2020)

  9. Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: Correspondence Transformer for Matching Across Images, pp. 6187–6197. IEEE Publications, New York (2021)

    Google Scholar 

  10. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8918–8927. (2021). https://doi.org/10.1109/CVPR46437.2021.00881.

  11. Vaswani, A., Shazeer, N., Parmar, N., Reit, J.U., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin I.: Attention is all you need. In: NeurIPS. (2017)

  12. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Conference on computer vision and pattern recognition, depth and motion network for learning monocular stereo, demon, p. 2. (2017)

  13. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. (2020). https://arxiv.org/abs/1909.11942

  14. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Keysers, D., Uszkoreit, J., Lucic, M., Dosovitskiy, A.: MLP-mixer: an all-MLP architecture for vision. In: Neural Information Processing Systems. (2021)

  15. Mnih, A., Hinton. G.E.: A scalable hierarchical dis-tributed language model. In: NIPS, pp. 1081–1088. (2009)

  16. Truong, P et al.: GOCor: Bringing globally optimized correspondence volumes into your neural network. Adv. Neural Inf. Process. Syst. https://arxiv.org/abs/2009.07823

  17. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: ECCV, vol. 3, pp. 108–126. (2020). https://doi.org/10.1007/978-3-030-58548-8_7

  18. Cavalli, L., Larsson, V., Oswald, M.R., Sattler, T., Pollefeys, M.: Handcrafted Outlier Detection Revisited.In: European Conference on Computer Vision. Springer, Cham, (2020)

  19. Rublee, E., Rabaud, V., Konolige, K., Bradski. G.: ORB: An efficient alternative to SIFT or SURF. In: ICCV. (2011)

  20. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94

    Article  Google Scholar 

  21. Revaud, J., Weinzaepfel, P., Souza, C.D., Pion, N., Csurka, G., Cabon, Y., Humen-berger, M.: R2D2: repeatable and reliable detector and descriptor. In: NeurIPS. (2019). https://doi.org/10.48550/arXiv.1906.06195

  22. Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2-Net: A trainable CNN for joint detection and description of local features. In: CVPR. (2019). https://doi.org/10.48550/arXiv.1905.03561

  23. Liu, Y., Shen, Z., Lin, Z., Peng, S., Bao, H., Zhou, X.: GIFT: learning transformation-invariant dense visual descriptors via group cnns. In: NeurIPS. (2019)

  24. Luo, Z., Zhou, L., Bai, X., Chen, H., Zhang, J., Yao, Y., Li, S., Fang, T., Quan, L.: ASLFeat: Learning local features of accurate shape and localization. In: CVPR, pp. 6589–6598. (2020)

  25. M. Tyszkiewicz, P. Fua, E. Trulls. DISK: Learning local features with policy gradient. NeurIPS. (2020).

  26. Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33(5), 978–994 (2010). https://doi.org/10.1109/TPAMI.2010.147

    Article  Google Scholar 

  27. Choy, C.B., Gwak, J.Y., Savarese, S., Chandraker, M.: Universal correspondence network. In: NeurIPS. (2016).

  28. T. Schmidt, R. Newcombe, D. Fox. Self-supervised visual descriptor learning for dense correspondence. RAL. (2016)

  29. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations. (2020). https://arxiv.org/abs/2003.10555.

  30. Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model. Aistats 5, 246–252 (2005)

    Google Scholar 

  31. Truong, P., Danelljan, M., Timofte, R.: GLU-Net: global-local universal network for dense flow and correspondences. In: Conference on Computer Vision and Pattern Recognition, pp. 6258–6268. (2020)

  32. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV. (2020)

  33. Li, X., Han, K., Li, S., Prisacariu, V.: Dual- resolution correspondence networks. In: NeurIPS. (2020)

  34. Ranftl, R., Koltun, V.: Deep fundamental matrix estimation. In: ECCV. (2018)

  35. Yi, K.M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: CVPR, pp. 2666–2674. (2018)

  36. Touvron, H et al.: ResMLP: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 45, 5314–5321 (2021)

    Google Scholar 

  37. Tan, Y., Li, X., Cai, Y., Sun M., Li, P.: S2-MLPv2: improved spatial-shift MLP architecture for vision. (2021). https://arxiv.org/abs/2108.01072

  38. Zhang, H., Dong, Z., Li, Bo., He, S.: Multi-Scale MLP-Mixer for image classification. Knowl. Based Syst. 258, 109792 (2022)

    Article  Google Scholar 

  39. Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Conrad Bovik A., Li, Y.: MAXIM: multi-axis MLP for image processing. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5759–5770. (2022)

  40. Kaiming, H., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. (2016)

  41. Wang, Q., Zhou, X., Hariharan, B., Snavely, N.: Learning feature descriptors using camera pose supervision. (2020). https://arxiv.org/abs/2004.13324

  42. Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: CVPR. (2018)

  43. Phototourism Challenge, CVPR 2019 Image-Matching Workshop. https://image-matching-workshop.github.io. Accessed 8 Nov 2019

  44. Schönberger, J.L., Frahm, J.-M.: Structure-from-motion revisited. In: CVPR. (2016)

  45. Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 1, pp. 561–564. IEEE (2001)

  46. Rocco, I., Arandjelovi´c, R., Sivic, J.: Efficient neighbourhood consensus networks via submanifold sparse convolutions. In: ECCV. (2020)

  47. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3d reconstructions of indoor scenes. In: CVPR, pp. 5828–5839. (2017)

  48. Brachmann, E., Rother, C.: Neural-guided RANSAC: learning where to sample model hypotheses. In: ICCV, pp. 4322–4331. (2019)

  49. Zhang, J., Sun, D., Luo, Z., Yao, A., Zhou, L., Shen, T., Chen, Y., Quan, L., Liao, H.: Learning two-view correspondences and geometry using order-aware network. In: ICCV, pp. 5845–5854. (2019)

  50. Bian, J.W., Lin, W.-Y., Matsushita, Y., Yeung, S.-K., Nguyen, T.-D., Cheng, M.M.: GMS: grid-based motion statistics for fast, ultra-robust feature correspondence. In: CVPR, pp. 4181–4190. (2017)

  51. Luo, Z., Shen, T., Zhou, L., Zhang, J., Yao, Y., Li, S., Fang, T., Quan, L.: ContextDesc: local descriptor augmentation with cross-modality context. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2527–2536. (2019). https://doi.org/10.1109/CVPR.2019.00263

  52. Ono, Y., Trulls, E., Fua, P., Moo Yi, K.: LF-Net: learning local features from images. In: NeurIPS, pp. 5828–2839. (2018)

  53. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: ICML. (2020)

Download references

Funding

This work was supported by the National Key Research and Development Program of China (Grant Number: 2020AAA0108103). This work was supported by the Institute of Robotics and Intelligent Manufacturing Innovation, Chinese Academy of Sciences (Grant Number: C2021002). This work was supported by the Innovation Engineering Project for New Energy and Intelligent Networked Automobile of Anhui Province, China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Kong.

Ethics declarations

Conflicts of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, Z., Kong, B. & Dong, X. MAIM: a mixer MLP architecture for image matching. Vis Comput 40, 1327–1337 (2024). https://doi.org/10.1007/s00371-023-02851-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02851-9

Keywords

Navigation