Open AccessArticle

Self-Supervised Foundation Model for Template Matching

Anton Hristov

^1,*

Dimo Dimov

^2,*

and

Maria Nisheva-Pavlova

^1,3,*

Faculty of Mathematics and Informatics, Sofia University “St. Kliment Ohridski”, 5 James Bourchier Blvd., 1164 Sofia, Bulgaria

Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Acad. G. Bonchev Str., Block 2, 1113 Sofia, Bulgaria

Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Acad. G. Bonchev Str., Block 8, 1113 Sofia, Bulgaria

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(2), 38; https://doi.org/10.3390/bdcc9020038

Submission received: 5 November 2024 / Revised: 12 January 2025 / Accepted: 8 February 2025 / Published: 11 February 2025

(This article belongs to the Special Issue Perception and Detection of Intelligent Vision)

Download

Browse Figures

Figure 1
Illustration of Self-TM. "> Figure 2
Illustration of a receptive field, <math display="inline"><semantics> <mrow> <mi>R</mi> <msub> <mi>F</mi> <mrow> <mi>p</mi> <mi>r</mi> <mi>e</mi> <mi>d</mi> <mo>_</mo> <msub> <mi>p</mi> <mi>N</mi> </msub> </mrow> </msub> </mrow> </semantics></math>, in layer <math display="inline"><semantics> <mrow> <mi>N</mi> <mo>−</mo> <mn>1</mn> </mrow> </semantics></math> (in orange) of a detected maximum value, <math display="inline"><semantics> <mrow> <mi>p</mi> <mi>r</mi> <mi>e</mi> <mi>d</mi> <mo>_</mo> <msub> <mi>p</mi> <mi>N</mi> </msub> </mrow> </semantics></math>, in layer <math display="inline"><semantics> <mi>N</mi> </semantics></math> (in red). "> Figure 3
Visual representation of results on Hpatches (values, excluding those for Self-TM, are taken from Twin-Net [<a href="#B61-BDCC-09-00038" class="html-bibr">61</a>]): (a) patch verification task; (b) image matching task; (c) patch retrieval task. The methods are grouped into the following groups: “handcrafted”, which were manually created by their authors; “supervised”, which used annotated data for their training; “self-supervised”, which did not use any annotations. A plus (+) denotes Self-TM models that are finetuned on the Hpatches dataset, and similarly (*) denotes variations of Tfear models. "> Figure 3 Cont.
Visual representation of results on Hpatches (values, excluding those for Self-TM, are taken from Twin-Net [<a href="#B61-BDCC-09-00038" class="html-bibr">61</a>]): (a) patch verification task; (b) image matching task; (c) patch retrieval task. The methods are grouped into the following groups: “handcrafted”, which were manually created by their authors; “supervised”, which used annotated data for their training; “self-supervised”, which did not use any annotations. A plus (+) denotes Self-TM models that are finetuned on the Hpatches dataset, and similarly (*) denotes variations of Tfear models. "> Figure 4
Comparison of OmniGlue [<a href="#B34-BDCC-09-00038" class="html-bibr">34</a>] (a) and OmniGlue + Self-TM Base (b) in finding keypoint matches in an image with out-of-training-domain modality. For the purpose of visualization, matches with high “confidence” are not visualized to make the errors visible. The correct matches are shown in green color, respectively the incorrect matches in red color. ">

Versions Notes

Abstract

Finding a template location in a query image is a fundamental problem in many computer vision applications, such as localization of known objects, image registration, image matching, and object tracking. Currently available methods fail when insufficient training data are available or big variations in the textures, different modalities, and weak visual features exist in the images, leading to limited applications on real-world tasks. We introduce Self-Supervised Foundation Model for Template Matching (Self-TM), a novel end-to-end approach to self-supervised learning template matching. The idea behind Self-TM is to learn hierarchical features incorporating localization properties from images without any annotations. As going deeper in the convolutional neural network (CNN) layers, their filters begin to react to more complex structures and their receptive fields increase. This leads to loss of localization information in contrast to the early layers. The hierarchical propagation of the last layers back to the first layer results in precise template localization. Due to its zero-shot generalization capabilities on tasks such as image retrieval, dense template matching, and sparse image matching, our pre-trained model can be classified as a foundation one.

Keywords:

self-supervised learning; template matching; foundation model; convolutional neural network; image matching

1. Introduction

In computer vision, visual template detection is a technique used to identify and locate a pattern in a larger image. This usually involves comparing the image of the template on an input image at different positions to find its best match. This technique can be used for detection of familiar objects, image registration, and object tracking tasks.

Despite the significant progress in the development of this technique, the solutions developed so far do not emphasize a sufficiently wide range of essential properties important for practice, such as generalization capability, real-time execution, and easy retraining that does not require annotated data.

Recent advancements in self-supervised learning [1,2] have transformed how convolutional neural networks (CNNs) and visual transformers (ViTs) are developed and applied in computer vision. Self-supervised methods enable models to learn rich, transferable representations from unlabeled data, reducing their dependence on annotated datasets while enhancing generalization across tasks. A self-supervised transformer called masked autoencoder (MAE) [3] was introduced in 2022, which effectively reconstructs masked image patches, demonstrating the potential of transformers for efficient representation learning. Building on this, ref. [4] proposed a self-distilled framework for self-supervised learning in CNNs, highlighting the renewed relevance of CNNs in pretraining scenarios. Further advancements such as [5] explored joint embedding predictive architectures to optimize self-supervised CNNs. Additionally, hybrid models combining CNNs and visual transformers, like RingMo-Lite [6], have emerged as a scalable foundation models for domain-specific tasks. These studies underscore the potential of self-supervised learned CNNs and visual transformers applied on scalable, robust, and generalizable solutions across diverse applications and tasks.

The current work presents a novel approach to self-supervised learning of a CNN designed for visual template detection. Our Self-TM models provide a high degree of generalization capabilities across various tasks, which classifies them as so-called “foundation” models dealing with high accuracy on data of a different modality, i.e., highly different from those used for training. If accuracy increase is needed, these foundation models can be easily and quickly fine-tuned on a small set of real data.

The applied approach in Self-TM can be attributed to “Siamese network-based trackers” [7,8,9,10,11,12,13,14,15,16], where typically the same model (encoder) is used to extract both image features of the search object and image features in which it is searched. Then, the features from the last layers are correlated using learnable cross-correlation [15,16], and the result is then processed by one or multiple decoders (convolutional neural networks or transformers) [17,18,19,20,21,22]. Most often, the result obtained by the decoders is one or several regression and classification maps [15,16].

Based on this framework, the present work proposes a highly efficient architecture, in addition to a simple yet accurate correlation approach. Combined with the intuitive training method, Self-TM can be easily fine-tuned on any type of images.

Self-TM is evaluated on the tasks of template matching, patch verification, image matching, and patch retrieval, differing in their objectives and addressing specific application needs. Template matching, if focused on localization of a predefined template within a larger image, relies on pixel intensity comparisons or feature descriptors and is sensitive to scaling, rotation, and noise. Patch verification determines if two image patches represent the same visual content. Image matching seeks to find correspondences between two images, mainly by identifying keypoint matches. Finally, patch retrieval returns the most visually similar patches to a query patch from a large set of patches, applying similarity metrics on feature embeddings. Each task employs specific methodologies and tools, ensuring effectiveness for its diverse input−output requirements and application scenarios.

The novelty in the proposed Self-TM may be summarized by the following:

The high degree of generalization eliminates the need to retrain the model with real data;
If further training (fine-tuning) is still necessary, very few images of the real data are needed until the desired accuracy is reached;
The specially designed encoder is trained on the task of locating a searched object, which differs from the standard approaches of using a network trained on some other tasks, most often classification or a more general one;
The integrated correlation operator we use provides real information about the location of the searched object, rather than being a separate layer of the network [15,16] which has to be further trained and decoded by one or more neural networks. This approach has not been found in the literature so far;
Hierarchical propagation of activations from the last to the first layer leads to precise localization while precluding the need to use an additional decoder that needs to be trained. The result is an extremely simplified and lightweight architecture, while providing high accuracy. This approach has not been found in the literature so far;
Two-step self-supervised training, involving two types of data augmentations: color augmentation and color and geometric augmentations;
Self-TM is rotationally invariant. In this work, the Self-TM models’ family is trained over the entire interval from −90 to +90 degrees.

2. Related Work

2.1. Hand-Crafted Methods

Template matching has a long history in the field of computer vision with many hand-crafted methods, mostly involving correlation-based techniques in which the searched template is slid over the target image, computing various similarity scores.

Traditional methods applied directly on pixel intensities, such as SSD (sum of squared differences) [23] and SAD (sum of absolute differences) [24], are very sensitive to noise, deformations, and visual transformations of the object or region of interest, making them inapplicable to real tasks. Papageorgiou and Poggio (in 1998) introduced a novel method [25] that uses normalized cross-correlation (NCC), allowing a certain degree of robustness against illumination and contrast variations. Their approach has spawned later advances, for example with zero-mean NCC (ZNCC) [26] in pattern matching, emphasizing the importance of normalization.

Another type of methods, based on keypoints’ visual features such as SIFT [27], SURF [28], and ORB [29], which have emerged as an alternative to traditional template matching techniques, have wide popularity, because they can efficiently handle variations in scale, rotation, and perspective. Despite their accuracy fluctuations, they are used in various systems since they have high performance and good generalization capabilities on new data and do not need high computational resources.

2.2. Learnable Methods

Driven by the idea of increasing the accuracy of template matching, in recent years, the focus of development has mainly shifted to the development of two types of learnable methods: sparse and dense.

Sparse methods have gained popularity for their efficiency due to the use of keypoints’ features, resulting in reduced computational cost and improved robustness. For example, SuperPoint [30] is a keypoint extraction algorithm that introduces a trained CNN to detect keypoints and generate their features. SuperGlue [31] is focused on feature matching detection. It evaluates and aligns each pair of keypoints by constructing an association matrix based on their features and applies Sinkhorn [32] to determine the optimal match between them. LightGlue [33] shows significant progress with its efficient matching determination. It uses a small architecture that combined the ability to early stop inference when matches are already found and excludes points that cannot be compared from later processing. OmniGlue [34] presents a framework using SuperPoint [30] to keypoint detection that attempts to improve the generalization properties by exploiting features from a foundation model [35].

Dense methods [36,37,38,39] have demonstrated remarkable improvements in accuracy for similarity search between images, especially with recent advances introduced by the transformers [18,20,21,22]. Unlike sparse methods, they aim to establish correspondences for each pixel of the processed images, resulting in dense information flow. This approach is particularly useful in applications where understanding the entire image is critical. However, recent methods mainly emphasize accuracy, thereby increasing computational requirements to unacceptable levels, challenging even systems with significant GPU resources.

Self-TM offers a learnable dense approach to localizing a search template, taking advantage of the dense method’s high accuracy, while being characterized by a very efficient architecture that does not require high computational resources. Self-TM can also be integrated into a sparse-based framework such as OmniGlue [34] (described in Section 4.3), increasing its accuracy and preserving the characteristic properties of this type of methods.

Although learnable methods have proven to perform well on their training data, they are characterized by a limitation in the similarity of the tasks in which they can be applied due to their limited generalization that is characteristic of the foundation models.

2.3. Foundation Models

Foundation models are pre-trained, data-rich, highly generalizable models designed to perform more than one task with sufficiently high accuracy. These models serve as a foundation for a variety of applications, allowing for fine-tuning or adaptation to specific tasks with less data and computational resources.

Key features of these models are as follows: pre-training with large datasets covering a wide range of visual concepts, the ability to fine-tune on a small but specific corpus of data focused on a particular task, the ability to adapt to different tasks or subtasks, high robustness, and generalization, ensuring robust performance in different domains and tasks.

For example, DINO [35] and DINOv2 [40] are transformers trained by a self-supervised learning approach that allow semantic segmentation without relying on annotated data, making them invaluable for training on large unannotated image sets. DINOv2, as a successor, extracts higher quality features, providing improvements that allow it to solve more complex tasks with greater accuracy. Both methods use a self-distillation approach, where a student model learns from a teacher model of the same architecture. The teacher model is updated based on the student parameters using the exponential moving average technique. DINO and DINOv2 use “contrastive” learning [41], which encourages the model to produce similar embeddings under variations of the same image, by applying different color and geometric augmentations while discriminating between different images.

SAM [42] and SAM2 [43] with their accuracy and generalization revolutionize semantic image and video segmentation by requiring minimal annotations compared to traditional methods. In addition to images, the input for these frameworks also includes prompts in the form of point positions, a binary mask, rectangle coordinates, and even a textual description, allowing them to focus on a specific region to extract desired features and output.

SegGPT [44] is a segmentation model built on results from [45] that allows the model to adapt to different tasks using a minimum of examples. The model is useful for all segmentation tasks such as instance, object, and semantic segmentation. During training, it performs contextual coloring using a random coloring scheme (rather than specific colors) to identify segments by learning their contextual information, resulting in improved generalization.

Self-TM is a foundation model oriented towards learning high-quality hierarchical features incorporating localization properties. Similar to DINO [35] and DINOv2 [40], the Self-TM models’ family is trained using a self-supervised learning approach, not relying on annotated data, making them invaluable for training on large unannotated image sets. The localization properties provide high accuracy in tasks such as object detection, image registration, object tracking, and sparse and/or dense image matching.

3. Theoretical Formulation of the Proposed Method

In this section, the setup of our proposed Self-Supervised Foundation Model for Template Matching (Self-TM) method, illustrated in Figure 1, is discussed.

Technical details such as the choice of architecture and layers, data, and training steps involved are explained in detail, as well as essential properties of the hierarchical features that are the overall goal of the present work, including properties for image templates localization without any annotations.

The next section (Section 4) presents the experimental results of Self-TM on data with different modalities (HPatches [46], MegaDepth [47], and ScanNet [48]) and a variety of tasks including template localization, patch verification, patch retrieval, and image matching.

Section 5 presents details related to the development of the approach: the iterative learning approach and a comparison of Self-TM against the architecture on which it is based.

3.1. Model

A convolutional neural network is chosen since the proposed method involves hierarchical activations from the final to the first layer of the network where the deeper the layer in which the neuron is located, the larger receptive field it has, i.e., a larger region of the input pixels is covered.

The proposed method is independent of the CNN architecture and can be applied to models with different properties. Our chosen architecture is based on the recently introduced ConvNeXt [17], which is similar to ResNet [19] but designed to compete in performance with the state-of-the-art vision transformers [21,22]. In addition, this architecture is one of the few effectively used in self-supervised learning.

Our model Self-TM uses the same blocks of ConvNeXt [17], but with a modified filter size and stride (see Table 1). Downsampling blocks use 3 × 3 filters and a stride of 3, which in terms of receptive field gives a neuron from layer N a field of view with 3 by 3 from layer (N − 1).

The layers used, following ConvNeXt [17], are a standard two-dimensional convolution (Conv2D), a linear layer (Linear), a normalization layer (Norm) [49], and a non-linear GELU activation function [50] (see also Section 5.2).

The size of Self-TM Small is significantly compressed, containing only 13 million learnable parameters, which is significantly smaller than standard architectures (Table 2) used for encoders and decoders in template matching or tracking methods. This is given by the network design entirely focused on this type of tasks.

3.2. Data

ImageNet-1K Train [56] is used for training the Self-TM models because it has been established as a standard for training image processing models and it has the high diversity of classes (1000 classes) covering enough different visual concepts for achieving high generalization properties.

3.3. Training

The often-preferred invariance-based [57] approach has been used for the training, where the idea is to learn similar features for compatible images and discriminative features for incompatible images. These compatible images are produced by applying random image augmentations modifying the input image such as changing the contrast and blurring. Using this learning approach, it is possible to obtain the so-called representation collapse, where the network’s output is constant regardless of the various input data applied.

The training is performed in two successive steps, which differ in the type of augmentations applied to the input images: color augmentations and color and geometric augmentations. This is necessary because geometric augmentations add increased variation to the data that the model cannot initially overcome. To overcome this, Self-TM is initially trained only on data with color augmentations applied. Once this training step is complete, the model continues its training, adding geometric augmentations to the color augmentations, which are as follows: perspective transform with a factor up to 0.5, a rotation from −90 to +90 degrees, and a rescale with a tolerance from −0.3 to +0.3.

The same training procedure is applied on all Self-TM family models (Self-TM Small, Self-TM Base, and Self-TM Large), starting with 15 epochs with color augmentations and continuing with 30 epochs with color and geometric augmentations, using the same dataset. 4x NVIDIA A5000 24GB graphics cards are used for the training, and 1 epoch takes about 5.1 h for Self-TM Small, 5.7 h for Self-TM Base, and 6.5 h for Self-TM Large.

The training steps are as follows:

An input image I is taken from an unannotated image database, to which a random crop $R$ is applied, and then resized by $S$ to 189 × 189 pixels, yielding a “query” image, $S (R (I)) = Q$ (see Figure 1);
Another random crop $R$ is performed on $Q$ , and then, random image augmentations (color and/or geometric) $A$ modifies the input image to obtain a “template” image, $A (R (Q)) = T$ (see Figure 1). This step can produce one or many different templates. In the Self-TM training, only two “template” numbers are used;
The position $g t_p$ (ground truth position) of the resulting “template” on the “query” image is stored (the coordinates of the red rectangle’s center on “query” (see Figure 1);
“Query” and the two templates are fed as inputs to the Self-TM network, $f_{θ} : (Q, T) \to (y, y^{'})$ , and the resulting feature maps from all layers are stored, $y = f_{θ} (Q), y^{'} = f_{θ} (T)$ , $(y, y^{'}) = {(y_{n}, y_{n}^{'}) | n = 1, \dots, N}$ , where $N$ is the number of the network’s layers. In the current architecture, the number of layers is 3, denoted above by first, mid, and last;
A correlation operator $C O R R (y, y^{'})$ is applied to every two corresponding feature maps sequentially starting from the deepest layer. Then, the position of the maximum value (predicted position), $p r e d_p_{n} = s o f t m a x (C O R R (y_{n}, y_{n}^{'}))$ , is found on its output (see “hierarchical activations propagation” in Figure 1);
A mean squared error is calculated on every two corresponding feature maps, $M S E_{n} = M S E (y_{n}, y_{n}^{'}) = \frac{1}{| D |} \sum_{D} {(R {(y)}_{n} - y_{n}^{'})}^{2}, D = d e f (y_{n}) \cap d e f (y_{n}^{'})$ , where $D$ is the region of summation, i.e., the intersection of the two definition domains of the “template” feature map $y_{n}^{'}$ and the “query” feature map $R {(y)}_{n}$ , $(y, y^{'}) = {(y_{n}, y_{n}^{'}) | n = 1, \dots, N}$ . Multiplication is scalar, i.e., elementwise in $D$ ;
Parameter optimization (gradient descent) is performed by minimizing the errors obtained from $M S E_{N}, \dots, M S E_{1}$ and the offset of the positions of $p r e d_p_{N}, \dots, p r e d_p_{1}$ , relative to the real positions of the template $g t_p_{N}, \dots, g t_p_{1}$ , which is possible due to compatibility of positions (in pixels).

3.4. Augmentation

Because geometric augmentations add increased variation to the data that the model cannot initially overcome, training is performed in two steps: training begins on data with applied color augmentations, and then, it continues on data with both color and geometric augmentations.

3.4.1. Color Augmentations

The applied color augmentations (see Table 3) are based on the approach in [58]:

To obtain the “query” image, the following steps are applied sequentially:
○
Random crop: scale from 0.1 to 0.9;
○
Rescale: 189 × 189 pixels;
○
Normalization: mean = [0.485, 0.456, 0.406], std = [0.228, 0.224, 0.225], computed on ImageNet;
To obtain the two “templates” on the “query”, the following steps are applied sequentially:
○
Random color jitter with an independent probability * of 80%: brightness = 0.4, contrast = 0.4, saturation = 0.2, and hue = 0.1;
○
Random grayscale with an independent probability * of 20%;
○
For “template” 1:
▪
Random Gaussian blur: radius from 0.1 to 0.2;
○
For “template” 2:
▪
Random Gaussian blur with an independent probability * of 10%: radius from 0.1 to 2.0;
▪
Random invert of all pixel values above a given threshold (solarization) with an independent probability * of 20%: threshold = 128;
○
Normalization: the same as in “query”;
○
Random crop: scale from 0.14 to 0.85; ratio from 0.2 to 5.0.

* Applying random operation with an independent probability following the law of uniform distribution.

3.4.2. Geometric Augmentations

Geometric augmentations (see Table 4),aligned by the center of the resulting cropping square, are applied on the “template” images only:

Random square crop: scale from 0.14 to 0.45;
A randomly selected geometric augmentation is applied with a 50% probability:
○
Random perspective transformation: distortion scale = 0.5;
○
Random rotation: degrees from −90 to +90;
○
Random rescale: factor from −0.7 to 1.3.

3.5. Hierarchical Activations Propagation

One of the main novelty in the proposed framework is the hierarchical propagation of activations from the last to the first layer

p r e d_p_{N}, \dots, p r e d_p_{1}

(currently

p r e d_p_{l a s t}, p r e d_p_{m i d}, p r e d_p_{f i r s t}

). This results in precise localization although training an additional decoder is not required (“hierarchical activations propagation” in Figure 1).

The localization of the “template”

T

in a selected image

Q

is performed by finding the maximum value (the highest value activation position visualized by a red square in Figure 1 and Figure 2) in the correlation operator’s output produced between any two corresponding layers starting from the deepest to the first, i.e.,

p r e d_p_{N, \dots, 1} =

s o f t m a x (C O R R (y_{N, \dots, 1}, y_{N, \dots 1}^{'}))

We start by finding the position of the maximum value (

p r e d_p_{N}

) in the last layer

N

, which is propagated to the previous layer

N

− 1 as the position

p r e d_p_{N - 1}

. This new position (

p r e d_p_{N - 1}

) is again the maximum value but is selected from a limited region denoted as

R F

(receptive field), of the maximum value of

N,

p r e d_p_{N - 1} =

s o f t m a x (R F_{p r e d_p_{N}} (C O R R (y_{N - 1}, y_{N - 1}^{'})))

(see Figure 2). In this way, each subsequent layer refines the assumed position of

T

To detect the position of the maximum value, a softmax function [59] is used. It is a differentiable argmax function allowing the gradient to flow freely from the end to the beginning of the network during training:

s o f t m a x {(z)}_{i} = \frac{e^{- β z_{i}}}{\sum_{j = 1}^{K} e^{- β z_{i}}} for i = 1, \dots, K, and z = (z_{1}, \dots, z_{K}) \in ℝ^{K} .

(1)

3.6. Correlation

Another novelty in the proposed framework is the integration of a static correlation operator in the network, the result of which provides information about the location of the searched object, instead of being a separate layer [15,16] that needs to be further trained and subsequently decoded by one or more neural networks.

The correlation operator used here is normalized cross-correlation representing a dense measure of a pixel-wise similarity metric:

C O R R (x, y) = \frac{\sum_{x^{'} y^{'}} (T (x^{'}, y^{'}) \cdot I (x + x^{'}, y + y^{'}))}{\sqrt{\sum_{x^{'} y^{'}} T {(x^{'}, y^{'})}^{2}} \sqrt{\sum_{x^{'} y^{'}} I {(x + x^{'}, y + y^{'})}^{2}}},

(2)

where

(x^{'}, y^{'}) \in d e f (Q) \cap d e f (T)

, i.e., the summation is (only) over the intersection of the domains, of the input

I

and the template

T

is positioned at the point

(x, y)

I

The result of the correlation operator is a grayscale image, where the position of the template is the pixel with the highest intensity. In this case, the operator is not applied to pixel values, but to the activation values of the layers (feature maps).

3.7. Loss

Two terms

M S E

and

C O R R

are used for computing the loss

ℒ

while training the network

f_{θ}

ℒ_{M S E} = \frac{1}{N} \sum_{n = 1}^{N} \sqrt{M S E (y_{n}, y_{n}^{'})}, y_{n} \in ℝ, y_{n}^{'} \in ℝ

(3)

ℒ_{C O R R} = \frac{1}{N} \sum_{n = 1}^{N} \sqrt{{(p r e d_p_{n} - g t_p_{n})}^{2}}, p r e d_p_{n} \in Z^{2}, g t_p_{n} \in ℝ^{2}

(4)

ℒ = \frac{ℒ_{M S E} + ℒ_{C O R R}}{2},

(5)

where

ℒ_{M S E}

is the average of the mean squared error (

M S E

) results for each layer, and

ℒ_{C O R R}

is the average of the Euclidean distances between the detected maximum values position

p r e d_p

and the real template positions

g t_p

for each layer.

3.8. Hyperparameters/Optimization

An AdamW optimizer [60] with a learning rate of

0.001

and a regularization technique

w e i g h t_d e c a y

10^{- 6}

is used in the loss calculation:

ℒ = ℒ + \underline{w e i g h t_d e c a y} * \underline{L 2 n o r m o f t h e w e i g h t s} .

(6)

3.9. Prediction

A correlation operator

C O R R (y, y^{'})

is applied on every two corresponding feature maps obtained from all layers of the network for an input image

y = f_{θ} (I_{Q})

and a selected template

y^{'} = f_{θ} (I_{T})

. Starting with

C O R R

results of the deepest layer by using “hierarchical activations propagation”, the maximum activation value of the first layer is reached. The position of this activation is the coordinates of the template

I_{T}

in the input image

I_{Q}

. However, since the receptive field in the downsampling blocks for layer N is 3 by 3, this resulting position is applicable to the input image

I_{Q}

downsized 3 times.

4. Experiments

This section presents experimental results and demonstrates our proposed Self-TM method on data having the same and different modality as the one used in the training (ImageNet-1K Test [56], HPatches [46], MegaDepth [47], and ScanNet [48]) as well as a variety of tasks:

ImageNet-1K Test [56] showing the accuracy of template localization on data having the same modality as the training data;
HPatches [46] evaluating the properties of feature maps as local descriptors for finding matches between different image patches in corresponding images;
MegaDepth [47] evaluating image matching accuracy on outdoor scenes having different modality from the training data;
ScanNet [48] evaluating image matching accuracy on indoor scenes having different modality from the training data.

4.1. ImageNet-1K Test

An ImageNet-1K test set [56] was used to compare the trained Self-TM model family (Small, Base, and Large) based on template localization accuracy.

Since the training is carried out in two successive steps, which differ in the type of applied augmentations of the input images: color augmentations; color and geometric augmentations, this test was performed separately for each of them.

The result is presented in Table 5, where for all Self-TM models (Small, Base, and Large) the displacements in pixels of the three layers

(D_L_{l a s t}, D_L_{m i d}, D_L_{f i r s t})

p r e d_p_{N}, \dots, p r e d_p_{1}

were calculated, relative to the real positions of the template

g t_p_{N}, \dots, g t_p_{1}

The test steps are similar to training steps and are as follows:

An input image is taken from an unannotated dataset (ImageNet-1K test), from which a “template” is obtained. The positions of the maximum values of the correlation operator’s result are then calculated for each corresponding feature map (see training steps 1 to 5 in Section 3.3).
For each corresponding feature map (i.e., for each layer), the position displacement measured in pixels ( $D_L_{l a s t}, D_L_{m i d}, D_L_{f i r s t}$ ) of $p r e d_p_{N}, \dots, p r e d_p_{1}$ is computed, relative to the actual template positions $g t_p_{N}, \dots, g t_p_{1}$ , using the Euclidean distance:

$D_L_{n} = | g t_p_{n} - p r e d_p_{n} | = \frac{1}{M} \sum_{i = 1}^{M} \sqrt{{(g t_p_{n} (i) - p r e d_p_{n} (i))}^{2}},$

(7)

where $M$ is the number of images in the dataset, $g t_p_{n} (i)$ and $p r e d_p_{n} (i)$ are the two-dimensional vectors of each image $i = 1, \dots, M$ , and $n$ is the currently selected layer of the network ${D_L_{n} | n = 1, \dots, N}$ ; in this case, $N = 3$ , and for clarity, instead of $n = 1, 2, 3$ , we use $f i r s t, m i d, l a s t$ indexations; $D_L_{N}$ or $D_L_{l a s t}$ is the average Euclidean distance for the deepest layer.

Table 5 clearly shows that the models trained and tested using only color augmentations, despite their size difference, achieved the same localization accuracy. Specially, the average displacement was

0.15

pixels for the deepest layer, followed by an average displacement of

0.17

pixels for the middle layer and an average displacement of

0.57

pixels for the first layer. This shows that even the smallest Self-TM model (13 M) has sufficient parameters capable of reaching successful training, competing with the largest Self-TM (130 M). However, this is not the case when geometric augmentations are added to the color augmentations, where we observed almost twice the displacement (

D_L_{f i r s t} = 2.214, D_L_{m i d} = 0.767, D_L_{l a s t} = 0.409

), compared to Self-TM Large (130 M) (

D_L_{f i r s t} = 1.331, D_L_{m i d} = 0.452, D_L_{l a s t} = 0.273

). The reason is the increased variation in the training data by the applied geometric augmentations, which require bigger models and more training time.

Table 6 shows the performance of the Self-TM Large model on random images from the ImageNet-1K test set. A randomly selected region on which random color and geometric augmentations were applied to obtain a template. From the examples, it can be clearly seen that the selected model successfully coped with the applied complex color and geometry augmentations.

4.2. HPatches

HPatches [46] is a dataset for finding matches between different patches, which is used to evaluate the generalization capabilities of Self-TM models.

This dataset has an out-of-domain modality than the one used for training, containing scenes (116 groups of scenes with 6 images in each) with different illumination and geometric augmentations. Patch matches were detected by feature points located by HoG, Hessian, and Harris detectors between the first and the other images. The standard metric mAP (mean average precision) was used for evaluation on three different tasks: patch verification, image matching, and patch retrieval.

Patch verification evaluates how well positive pairs are separated from negative pairs. There are two subtasks: SameSeq containing pairs of patches from the same set of scenes and DiffSeq containing pairs of patches from different sets of scenes. The SameSeq is considered more challenging because the textures in different parts of the image are often similar.

Image matching evaluates how accurately images from the same scene are detected. Here, the overall mAP results are much lower than in patch verification, since the ratio of positive to negative examples is significantly lower here and all the negatives come from the same group of scenes. Another interesting observation is that results obtained from images with photometric changes (illum) are significantly lower than those with viewpoint changes (viewpt). This is because the “illum” images include extreme changes in illumination.

Patch retrieval evaluates how accurately patches correspondences are detected in a large set of patches. Here, a single patch was compared to a large collection containing mainly distractors obtained randomly from the remaining image sets.

In the Self-TM experiments on HPatches, the feature maps from the last layer of the network were used to obtain the patch descriptors. The input data were resized to 32 × 32 yielding descriptors of 1 × 512 (Self-TM Small), 1 × 1024 (Self-TM Base), and 1 × 2048 (Self-TM Large).

The benchmark of Self-TM with state-of-the-art methods on all three HPatches tasks [46] is shown in Figure 3.

In all three tasks, the color of the markers indicates the amount of “geometric noise”

E a s y

H a r d

T o u g h

contained in the images used. The accuracy percentages represent the average values of the selected subtask.

The experiment used all the Self-TM models, i.e., Self-TM Small, Self-TM Base, and Self-TM Large, trained on ImageNet1K Train [56] using color and geometric augmentations, where (+) denotes models that are finetuned on the Hpatches dataset.

These results demonstrate the robust generalization of all the Self-TM models, where Self-TM Large exhibits impressive performance on all three tasks—competitive with previous results of models trained with annotated data (supervised training) and approaching the best.

The performance of the Self-TM Large+ fine-tuned on Hpatches even outperforms the previous results on the Patch Verification task, and on image matching, it achieves the first place with a difference of only

0.2 %

mAP.

4.3. Image Matching

Image matching by sparse methods is a suitable task for verifying the high degree of generalization capabilities of Self-TM models, also being very close to template matching. Here, matching was performed by finding sparse keypoint matches in a set of images (Figure 4).

To demonstrate the effectiveness of Self-TM, we used two well-known databases as follows:

MegaDepth [47] is a large-scale outdoor image dataset. Ground truth matches between images are computed using Structure-From-Motion [62] algorithms. Test data used contained 1500 predefined image pairs, following the previous work [37].
ScanNet [48] which contains many indoor images composed of challenging scenes with low texture and large perspective changes. Here again, test data used contained 1500 predefined image pairs, following [37].

Both datasets contained only out-of-domain images different from the Self-TM training data.

We performed an estimation of the matching accuracy between each pair of images by finding the camera pose estimation and computing the essential matrix with the RANSAC [63] algorithm and then decomposing it into rotation and translation. Finally, the angular error between the calculated and real rotation for all pairs of images is reported, represented by AUC (area under the ROC curve) at thresholds of 5°, 10°, and 20° (see [47,48]). The AUC summarizes the performance of classification models with different classification thresholds using the ratio of the True Positive Rate (TPR) to the False Positive Rate (FPR).

Since Self-TM itself is a convolutional neural network model, it cannot be applied alone to solve the image matching tasks. Therefore, it is integrated into the OmniGlue [34] framework, designed and focused on a high degree of generalization using the DINOv2 foundation model [40].

More generally, OmniGlue [34] uses frozen weights of DINOv2 [40] and SuperPoint [30] to extract sparse keypoints and their features, which are used to construct an internal graph between each image’s keypoints and an external graph between all the potential correspondences. Both graphs are fed to a learning module containing multiple blocks with two attention layers ending with a layer giving the potential corresponding keypoints.

Self-TM is included in the OmniGlue framework [34], replacing DINOv2 [40] without any further training. Thus, a direct comparison between Self-TM and DINOv2 [40] features in terms of quality and generalization is performed.

Table 7 presents a comparison between the novel OmniGlue + Self-TM framework against the original OmniGlue [34] and various competing methods (requiring training) following the experiments performed in [34]. These experiments also included traditional approaches like SIFT [27] and SuperPoint [30] still used in tasks where training is not possible. In these hand-crafted rules, a mutual nearest neighbor is used to find a match between keypoints.

Table 7 shows a significant advantage of OmniGlue + Self-TM Large/Base over OmniGlue [34] on both MegaDepth [47] and ScanNet [48] datasets. OmniGlue + Self-TM Small improves accuracy at AUC@5° but lags behind at AUC@20°.

In addition to the accuracy evaluation, the comparison of DINOv2 [40] and Self-TM should also consider their architectures size, which directly affects training time and inference speed. The DINOv2 [40] architecture used is ViT-14-base [21] having 87 M parameters compared to Self-TM Base having 40 M and Self-TM Large having 130 M parameters.

Considering the results of Table 7 and the size of the architectures used, Self-TM gives a higher degree of generalization than DINOv2 [40] (ViT-14-base) and at the same time provides a more optimized architecture in terms of the number of trainable parameters and speed (Table 8).

5. Implementation Details

5.1. Two-Step Training

Table 9 presents our experiments based on which the iterative training approach was developed (two successive steps with different image augmentations: color and color and geometric). The HPathces dataset [46] was used, which effectively evaluated feature maps properties as local descriptors for finding matches between different image patches in corresponding images.

Our experiment used the Self-TM Base model trained from scratch on the ImageNet or HPatches dataset using color or color and geometric augmentations, and the same one was fine-tuned on the same data but using weights of already trained models on HPatches (color augmentations); ImageNet (color augmentations); ImageNet (color and geometric augmentations).

The iterative training approach (row 3 in Table 9), where the model is first trained on HPatches with color augmentations and then fine-tuned with color and geometric augmentations, increases the accuracy of +7.73% for patch verification, +11.96% for image matching, and +5.41% for patch retrieval compared to the model trained directly on HPatches with color and geometric augmentations (row 2 in Table 9). The reason for this lies in the fact that geometric augmentations introduce additional variability in the data, which the model cannot initially overcome.

HPatches contained scenes (116 groups of scenes with 6 images in each) with different illumination and geometric augmentations, where models trained with color and geometric augmentations, have increased accuracy than those using only color augmentations:

ImageNet training (row 5 vs. row 4 in Table 9) results on the increased model accuracy: +31.77% for patch verification; +81.76% for image matching; +66.63% for patch retrieval;
HPatches training (row 3 vs. row 1 in Table 9) results on the increased model accuracy: +9.23% for patch verification; +33.89% for image matching; +18.18% for patch retrieval.

The high degree of generalization can be explained by the high number of images and the high diversity of classes (1000 classes) covering a sufficient number of visual concepts in ImageNet. Comparing ImageNet and HPatches training (row 5 vs. row 3 in Table 9), a significant advantage was observed for the model trained on ImageNet: +22.59% for patch verification; +248.03% for image matching; +102.73% for patch retrieval, despite out-of-domain training data compared to the HPatches data used for test evaluation.

When accuracy increase is required, Self-TM allows for easy fine-tuning with a small number of images to incorporate precise information about the objects and visual concepts to be detected. Comparing ImageNet training results and the same model fine-tuning with HPatches (row 8 vs. row 5 in Table 9) shows an increase in accuracy: +1.15% for patch verification; +4.08% for image matching; +3.79% for patch retrieval.

5.2. Model Architecture

Computer vision field has seen increased model refinement based on improved architectures. They all focus on creating general-purpose visual features that perform well on images of out-of-domain modalities and various tasks. Such models are designed to cover different types of tasks, which usually results in increased size and complexity due to incorporation of additional layers for each type of task (classification, segmentation, tracking, etc.).

The current work shows that designing an architecture targeting a domain of tasks (object detection, image registration, object tracking, and sparse and dense image matching) significantly reduces the network size, simplifies the training and inference, increases accuracy and retains its generalization capability, performing well on out-of-domain images.

Our proposed architecture (described in Section 3) uses ConvNeXt [17] blocks with a modified filter size and stride. Downsampling blocks use 3 × 3 filters and a stride of 3, which in terms of a receptive field gives a layer

N

neuron a field of view of 3 by 3 on layer

N - 1

. This results in smoother layer-to-layer hierarchical propagation of activations and precise centering due to the size oddness.

The layers used, following ConvNeXt [17], are as follows:

Standard two-dimensional convolution (Conv2D) using a certain number of filters performing scalar multiplication on their input data;
Linear layer (Linear), representing a fully connected layer where each neuron is connected to each neuron from the previous layer;
Normalization layer (Norm) [49], a faster alternative to Batch Normalization [64], used to normalize network weights, in order to reduce the training time;
Non-linear activation function GELU [50], used in all modern transformers.

Since the proposed method is CNN architecture-independent, it can be applied to any CNN architecture, and we assume that the hierarchical localization properties of the layers will be again successfully learned. Furthermore, it can also be applied on visual transformers, sacrificing the ability of CNNs’ deeper layers to cover a larger receptive field of the input images. These kinds of experiments were briefly analyzed and are not the focus of this study.

A template positioning accuracy comparison of the proposed Self-TM architecture versus ConvNeXt is shown in Table 10, with the pixel displacements

D_L_{F i r s t}

and

D_L_{L a s t}

between the calculated positions

p r e d_p_{1}

p r e d_p_{N}

, relative to the ground truth positions

g t_p_{1}

and

g t_p_{N}

of the template. Here,

D_L_{F i s r t}

represents the average displacement using the first layer feature maps, and

D_L_{L a s t}

represents the average displacement using the last (the deepest) layer feature maps.

Here, Self-TM Base (40 million parameters) and ConvNeXt-S (50 million parameters) were used as they are closest in number of parameters. The result shows that Self-TM Base having 20% less parameters matches the templates more accurately by 48.44% when using only color augmentations and by 30.39% when applying both color and geometric augmentations.

An additional comparison between Self-TM Base and ConvNeXt-S is shown in Table 11, aiming to evaluate the properties of feature maps as local descriptors for detecting matches between different image patches in their corresponding images, using the HPatches dataset [46].

Results of multiple trainings steps on each model (sequentially with different augmentations) are presented, starting with ImageNet training with color augmentation and fine-tuning it to a model trained on HPatches with color and geometric transformations.

Despite the smaller number of parameters in Self-TM Base, an increased accuracy was observed in each of the three tasks: patch verification, image matching, and patch retrieval. The best result using a model trained on HPatches with color and geometric augmentations shows an increase in accuracy by 2.96% (from 84.39 to 86.89) in patch verification, by 15.68% (from 34.88 to 40.35) in image matching and by 5.61% (from 60.61 to 64.01) in patch retrieval.

6. Conclusions

The present work introduces a novel end-to-end approach using a family of foundation models Self-Supervised Foundation Model for Template Matching (Self-TM) for precise template matching, image retrieval, dense template matching, and sparse image matching using hierarchical propagation of activations from the last to the first layer in an efficient-sized convolutional neural network.

Trained on ImageNet-1K, without any annotations, Self-TM provides robust out-of-domain data generalization capabilities involving challenging geometric augmentations. On the sparse image matching task, Self-TM Base significantly outperforms DINOv2 [40] by +19.6%/+6.7%/+0.3% (AUC@5°/10°/20°) on MegaDepth-1500 [47] and on ScanNet-1500 [48] by +57.1%/+20.5%/+6.2% (AUC@5°/10°/20°). On HPatches [46], Self-TM has competitive performance compared to methods using a supervised learning approach. When an increase in accuracy is required, the model is easily fine-tuned, outperforming the previous supervised results in the patch verification task on HPatches.

Methods used to augment training data in self-supervised models can limit their applicability by learning representations aligned to specific augmentations rather than general real-world features, which reduces the model’s ability to generalize to tasks or environments where such transformations are uncommon. Our experiments do not show any presence of overfitting to augmentation-induced patterns, which once again proves the generalizing properties that the Self-TM family possesses.

In addition, the application of Self-TM can serve as an efficient component for visual object tracking. By leveraging its high CPU performance (Table 8), it can be integrated into existing frameworks such as BoT-SORT [65] or ByteTrack [66], acting as an encoder for feature extraction. Its robust generalization capabilities might be helpful to various medical imaging tasks, integrated as a vision encoder into multimodal frameworks [67,68,69,70]. Furthermore, our proposed hierarchical features can be applied to a hierarchical multi-modal alignment [71].

Author Contributions

Conceptualization, A.H.; methodology, A.H.; exploration, A.H., D.D. and M.N.-P.; formal analysis, A.H. and D.D.; writing—original draft preparation, A.H.; writing—review and editing, D.D., M.N.-P. and A.H.; visualization, A.H.; supervision, M.N.-P. and D.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in [46,47,48,56]. We shared source codes using the GitHub repository for this project: https://github.com/anhristov/self-tm (accessed on 5 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. Int. Conf. Mach. Learn. 2020, 1, 1597–1607. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Jang, J.; Kim, S.; Yoo, K.; Kong, C.; Kim, J.; Kwak, N. Self-Distilled Self-Supervised Representation Learning. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2828–2838. [Google Scholar] [CrossRef]
Kalapos, A.; Gyires-Tóth, B. CNN-JEPA: Self-Supervised Pretraining Convolutional Neural Networks Using Joint Embedding Predictive Architecture. arXiv 2024. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, T.; Zhao, L.; Hu, L.; Wang, Z.; Niu, Z.; Cheng, P.; Chen, K.; Zeng, X.; Wang, Z.; et al. RingMo-Lite: A remote sensing lightweight network with CNN-Transformer hybrid framework. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–20. [Google Scholar] [CrossRef]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H.S. Fully-Convolutional Siamese networks for object tracking. In Computer Vision—ECCV 2016 Workshops. ECCV 2016; Lecture notes in computer science; Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar] [CrossRef]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with Siamese Region Proposal network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
He, A.; Luo, C.; Tian, X.; Zeng, W. A twofold siamese network for real-time object tracking. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H.S. End-to-end representation learning for correlation filter based tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. arXiv 2018. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-Aware Siamese networks for visual object tracking. In Computer Vision—ECCV 2018. ECCV 2018; Lecture notes in computer science; Springer: Cham, Switzerland, 2018; pp. 103–119. [Google Scholar] [CrossRef]
Fan, H.; Ling, H. Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Song, Y.; Ma, C.; Gong, L.; Zhang, J.; Lau, R.; Yang, M.-H. CREST: Convolutional Residual Learning for Visual Tracking. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SIAMCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Hu, W.; Wang, Q.; Zhang, L.; Bertinetto, L.; Torr, P.H.S. SiamMask: A framework for fast online object tracking and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3072–3089. [Google Scholar] [PubMed]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Steiner, A.; Kolesnikov, A.; Zhai, X.; Wightman, R.; Uszkoreit, J.; Beyer, L. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. arXiv 2021. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning 2020, Virtual Event, 13–18 July 2020. [Google Scholar] [CrossRef]
Hisham, M.B.; Yaakob, S.N.; Raof, R.A.A.; Nazren, A.B.A.; Wafi, N.M. Template matching using sum of squared difference and normalized cross correlation. In Proceedings of the IEEE Student Conference on Research and Development (SCOReD) 2015, Kuala Lumpur, Malaysia, 13–14 December 2015. [Google Scholar] [CrossRef]
Niitsuma, H.; Maruyama, T. Sum of absolute difference implementations for image processing on FPGAs. In Proceedings of the 2010 International Conference on Field Programmable Logic and Applications, Milan, Italy, 31 August–2 September 2010; Volume 33, pp. 167–170. [Google Scholar] [CrossRef]
Papageorgiou, C.P.; Oren, M.; Poggio, T. A general framework for object detection. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), Bombay, India, 7 January 1998. [Google Scholar] [CrossRef]
Di Stefano, L.; Mattoccia, S.; Tombari, F. ZNCC-based template matching using bounded partial correlation. Pattern Recognit. Lett. 2005, 26, 2129–2134. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Computer Vision—ECCV 2006. ECCV 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the International Conference on Computer Vision 2011, Barcelona, Spain, 6–13 November 2011. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2018, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Cuturi, M. Sinkhorn Distances: Lightspeed computation of optimal transport. Neural Inf. Process. Syst. 2013, 26, 2292–2300. [Google Scholar]
Lindenberger, P.; Sarlin, P.-E.; Pollefeys, M. LightGlue: Local feature matching at light speed. In Proceedings of the International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
Jiang, H.; Karpur, A.; Cao, B.; Huang, Q.; Araujo, A. OmniGlue: Generalizable feature matching with foundation model guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 19865–19875. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Chen, H.; Luo, Z.; Zhou, L.; Tian, Y.; Zhen, M.; Fang, T.; McKinnon, D.; Tsin, Y.; Quan, L. ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer. In European Conference on Computer Vision; Lecture notes in computer science; Springer: Cham, Switzerland, 2022; pp. 20–36. [Google Scholar] [CrossRef]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LOFTR: Detector-Free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Edstedt, J.; Athanasiadis, I.; Wadenbäck, M.; Felsberg, M. DKM: Dense Kernelized Feature Matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar] [CrossRef]
Truong, P.; Danelljan, M.; Timofte, R.; Van Gool, L. PDC-NET+: Enhanced Probabilistic Dense Correspondence Network. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10247–10266. [Google Scholar] [CrossRef] [PubMed]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023. [Google Scholar] [CrossRef]
Van Den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the EEE/CVF International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment anything in images and videos. arXiv 2024. [Google Scholar] [CrossRef]
Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. SegGPT: Towards segmenting everything in context. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
Wang, X.; Wang, W.; Cao, Y.; Shen, C.; Huang, T. Images speak in images: A generalist painter for in-context visual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar] [CrossRef]
Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Li, Z.; Snavely, N. MegaDepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Niessner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning 2019, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar] [CrossRef]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollar, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNetV2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning 2021, Virtual, 18–24 July 2021. [Google Scholar]
Kolesnikov, A.; Beyer, L.; Zhai, X.; Puigcerver, J.; Yung, J.; Gelly, S.; Houlsby, N. Big Transfer (BIT): General visual representation learning. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 491–507. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, N.K.; Fei-Fei, N.L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
Assran, M.; Duval, Q.; Misra, I.; Bojanowski, P.; Vincent, P.; Rabbat, M.; LeCun, Y.; Ballas, N. Self-Supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar] [CrossRef]
Bardes, A.; Ponce, J.; Lecun, Y. VICRegL: Self-supervised learning of local visual features. Adv. Neural Inf. Process. Syst. 2022, 35, 8799–8810. [Google Scholar] [CrossRef]
Bridle, J.S. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. Adv. Neural Inf. Process. Syst. 1989, 2, 211–217. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017. [Google Scholar] [CrossRef]
Irshad, A.; Hafiz, R.; Ali, M.; Faisal, M.; Cho, Y.; Seo, J. Twin-Net descriptor: Twin negative mining with quad loss for Patch-Based matching. IEEE Access 2019, 7, 136062–136072. [Google Scholar] [CrossRef]
Schonberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. In Readings in Computer Vision; Elsevier eBooks; Elsevier: Amsterdam, The Netherlands, 1987; pp. 726–740. [Google Scholar] [CrossRef]
Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in Batch-Normalized models. In Proceedings of the Advances in Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BOT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arXiv 2020. [Google Scholar] [CrossRef]
Huang, S.-C.; Shen, L.; Lungren, M.P.; Yeung, S. GLORIA: A multimodal Global-Local Representation learning framework for label-efficient medical image recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3922–3931. [Google Scholar] [CrossRef]
Wang, F.; Zhou, Y.; Wang, S.; Vardhanabhuti, V.; Yu, L. Multi-Granularity cross-modal alignment for generalized medical visual representation learning. arXiv 2022. [Google Scholar] [CrossRef]
Liu, C.; Ouyang, C.; Cheng, S.; Shah, A.; Bai, W.; Arcucci, R. G2D: From global to Dense Radiography Representation Learning via Vision-Language Pre-training. arXiv 2023. [Google Scholar] [CrossRef]
Liu, C.; Cheng, S.; Shi, M.; Shah, A.; Bai, W.; Arcucci, R. IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training. IEEE Trans. Med. Imaging 2025, 44, 519–529. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Illustration of Self-TM.

Figure 2. Illustration of a receptive field,

R F_{p r e d_p_{N}}

, in layer

N - 1

(in orange) of a detected maximum value,

p r e d_p_{N}

, in layer

N

(in red).

Figure 2. Illustration of a receptive field,

R F_{p r e d_p_{N}}

, in layer

N - 1

(in orange) of a detected maximum value,

p r e d_p_{N}

, in layer

N

(in red).

Figure 3. Visual representation of results on Hpatches (values, excluding those for Self-TM, are taken from Twin-Net [61]): (a) patch verification task; (b) image matching task; (c) patch retrieval task. The methods are grouped into the following groups: “handcrafted”, which were manually created by their authors; “supervised”, which used annotated data for their training; “self-supervised”, which did not use any annotations. A plus (+) denotes Self-TM models that are finetuned on the Hpatches dataset, and similarly (*) denotes variations of Tfear models.

Figure 4. Comparison of OmniGlue [34] (a) and OmniGlue + Self-TM Base (b) in finding keypoint matches in an image with out-of-training-domain modality. For the purpose of visualization, matches with high “confidence” are not visualized to make the errors visible. The correct matches are shown in green color, respectively the incorrect matches in red color.

Table 1. Detailed description of the Self-TM architecture: (a) available model sizes; (b) Self-TM architecture. It contains 3 layers, namely for simplicity “first”, “middle”, and “last”.

X_{f i r s t}, X_{m i d}

, and

X_{l a s t}

denote the number of parameters in the first, middle, and last layers, respectively, for each model size.

X_{f i r s t}, X_{m i d}

, and

X_{l a s t}

denote the number of parameters in the first, middle, and last layers, respectively, for each model size.

(a)
Size	$X_{f i r s t}$	$X_{m i d}$	$X_{l a s t}$	Number of Parameters
Self-TM Small	128	256	512	13 M
Self-TM Base	128	384	1024	40 M
Self-TM Large	128	512	2048	130 M
(b)
Input Size	Layer Name	Layer Components		Output Size
3 × 189 × 189	Down sampling	$[\begin{matrix} Conv 2 D 3 \times 3, [X_{f i r s t}], stride 3 \\ Norm \end{matrix}]$		$X_{f i r s t}$ × 63 × 63
$X_{f i r s t}$ × 63 × 63	ConvNeXt block	$[\begin{matrix} Conv 2 D 7 \times 7, [X_{f i r s t}], stride 1, pad 3 \\ Norm, Linear [X_{f i r s t}, 512] \\ GELU, Linear [512, X_{f i r s t}] \end{matrix}]$ × 3		$X_{f i r s t}$ × 63 × 63
$X_{f i r s t}$ × 63 × 63	Normalization	$Norm$		$X_{f i r s t}$ × 63 × 63
$X_{f i r s t}$ × 63 × 63	Down sampling	$[\begin{matrix} Norm \\ Conv 2 D 3 \times 3, [X_{m i d}] stride 3 \end{matrix}]$		$X_{m i d}$ × 21 × 21
$X_{m i d}$ × 21 × 21	ConvNeXt block	$[\begin{matrix} Conv 2 D 7 \times 7, [X_{m i d}], stride 1, pad 3 \\ Norm, Linear [X_{m i d}, 1024] \\ GELU, Linear [1024, X_{m i d}] \end{matrix}]$ × 9		$X_{m i d}$ × 21 × 21
$X_{m i d}$ × 21 × 21	Normalization	$Norm$		$X_{m i d}$ × 21 × 21
$X_{m i d}$ × 21 × 21	Down sampling	$[\begin{matrix} Norm \\ Conv 2 D 3 \times 3, [X_{l a s t}], stride 3 \end{matrix}]$		$X_{l a s t}$ × 7 × 7
$X_{l a s t}$ × 7 × 7	ConvNeXt block	$[\begin{matrix} Conv 2 D 7 \times 7, [X_{l a s t}], stride 1, pad 3 \\ Norm, Linear [X_{l a s t}, 2048] \\ GELU, Linear [2048, X_{l a s t}] \end{matrix}]$ × 3		$X_{l a s t}$ × 7 × 7
$X_{l a s t}$ × 7 × 7	Normalization	$Norm$		$X_{l a s t}$ × 7 × 7

Table 2. Size comparison of Self-TM.

Model	Number of Parameters
Self-TM Small	13 M
DeiT-S [22], ViT-S [18], and Swin-T [51]	22–28 M
ConvNeXt-T [17]	29 M
Self-TM Base	40 M
ConvNeXt-S [17], Swin-S [51]	50 M
EffNet-B7 [52], RegNetY-16G [53], DeiT-B [22], ViT-B [18], and Swin-B [51]	66–88 M
ConvNeXt-B [17]	89 M
EffNetV2-L [54]	120 M
Self-TM Large	130 M
ConvNeXt-L [17]	198 M
ViT-L [18]	304 M
ConvNeXt-XL [17]	350 M
R-101x3 [55] and R-152x4 [55]	388–937 M

Table 3. Visualization of random color augmentations on randomly cropped images from ImageNet-1K.

Image Crop	Color Augmentation

Table 4. Visualization of random color and random geometric augmentations on random images from ImageNet-1K. For better visualization, the images are cropped at a 1:1 height-to-width ratio.

Image Crop	Color and Geometric Augmentation

Table 5. ImageNet-1K test results of Self-TM models trained on ImageNet-1K Train by color or color and geometric augmentations. The calculated displacements (in pixels)

D_L_{l a s t}, D_L_{m i d}, D_L_{f i r s t}

were calculated from the last, middle, and first layers of the network, respectively.

Table 5. ImageNet-1K test results of Self-TM models trained on ImageNet-1K Train by color or color and geometric augmentations. The calculated displacements (in pixels)

D_L_{l a s t}, D_L_{m i d}, D_L_{f i r s t}

were calculated from the last, middle, and first layers of the network, respectively.

Model Size	Applied Augmentation	$D_L_{f i r s t}$ Pixels	$D_L_{m i d}$ Pixels	$D_L_{l a s t}$ Pixels
Self-TM Small	color	0.579	0.176	0.156
Self-TM Base	color	0.577	0.173	0.156
Self-TM Large	color	0.572	0.171	0.153
Self-TM Small	color and geometric	2.214	0.767	0.409
Self-TM Base	color and geometric	1.752	0.602	0.338
Self-TM Large	color and geometric	1.331	0.452	0.273

Table 6. Visualization of template localization in random images from the ImageNet-1K test set. Random color and geometric augmentations were applied to all the templates. The color legends of the rectangles are as follows: yellow indicates the template position; red indicates the calculated position obtained from the deepest layer; green indicates the calculated position from the middle layer; blue indicates the calculated position from the first layer.

Template	Result

Table 7. Comparison between the presented novel framework OmniGlue + Self-TM against OmniGlue [34] and various competing methods (requiring training). Results (excluding those of OmniGlue + Self-TM Small/Base/Large) are taken from [34]. The OmniGlue + Self-TM’s relative gain over OmniGlue is shown in green color.

	Method	MegaDepth-1500	ScanNet
	Method	AUC@5°/10°/20°	AUC@5°/10°/20°
Descriptors with hand-crafter rules	SIFT [27] + MNN	25.8/41.5/54.2	1.7/4.8/10.3
Descriptors with hand-crafter rules	SuperPoint [30] + MNN	31.7/46.8/60.1	7.7/17.8/30.6
Sparse methods	SuperGlue [31]	42.2/61.2/76.0	10.4/22.9/37.2
	LightGlue [33]	47.6/64.8/77.9	15.1/32.6/50.3
	OmniGlue [34]	47.4/65.0/77.8	14.0/28.9/44.3
	OmniGlue + Self-TM Small Relative gain (in %) over OmniGlue	48.2/64.7/73.8 +1.8/−0.4/−5.1	15.8/29.4/43.4 +13.0/+1.8/−2.0
	OmniGlue + Self-TM Base Relative gain (in %) over OmniGlue	56.7/69.4/78.1 +19.6/+6.7/+0.3	22.0/34.8/47.0 +57.1/+20.5/+6.2
	OmniGlue + Self-TM Large Relative gain (in %) over OmniGlue	59.8/70.6/78.4 +26.2/+8.7/+0.8	26.6/37.7/48.4 +90.1/+30.3/+9.2

Table 8. Inference speed and architecture size comparison of DINOv2 and Self-TM. The experiment was performed with input images of various resolutions (multiples of 14 due to DINOv2 limitation) on an Intel Xeon Gold 5222 processor (3.80 GHz), and no graphics card was used.

Model	Number of Parameters	Inference Speed at Various Input Resolution
Model	Number of Parameters	$364$ × 238 Pixels	$742$ × 490 Pixels	$1498$ × 994 Pixels
Self-TM (Small)	13 M	212 ms	659 ms	2481 ms
Self-TM (Base)	40 M	244 ms	914 ms	3432 ms
DINOv2 (ViT-14-base)	87 M	445 ms	3065 ms	38,709 ms
Self-TM (Large)	130 M	377 ms	1268 ms	4706 ms

Table 9. Results of trained Self-TM Base models on HPatches [46] using different combinations of training data (ImageNet-1k [56] and HPatches [46]) and applied augmentations (color and color and geometric augmentations).

Model	Exp. No	Initial Weights	Dataset	Augmentations	Patch Verification mAP %	Image Matching mAP %	Patch Retrieval mAP %
Self-TM Base	1	Random init	HPatches	color	64.15	8.32	25.74
	2	Random init	HPatches	color and geometric	65.04	9.95	28.86
	3	HPatches (color)	HPatches	color and geometric	70.07	11.14	30.42
	4	Random init	ImageNet	color	65.19	21.33	37.01
	5	ImageNet (color)	ImageNet	color and geometric	85.90	38.77	61.67
	6	ImageNet (color)	HPatches	color	66.09	21.97	37.79
	7	ImageNet (color)	HPatches	color and geometric	78.97	29.85	50.30
	8	ImageNet (color and geometric)	HPatches	color and geometric	86.89	40.35	64.01

Table 10. Comparison of Self-TM Base and ConvNeXt-S [17] trained on ImageNet-1k [56], with various augmentations (color and color and geometric) applied.

D_L_{F i r s t}

and

D_L_{L a s t}

displacements (in pixels) were calculated using the first and last layers, respectively. The Self-TM Base’s relative gain over ConvNeXt-S is shown in green color.

Table 10. Comparison of Self-TM Base and ConvNeXt-S [17] trained on ImageNet-1k [56], with various augmentations (color and color and geometric) applied.

D_L_{F i r s t}

and

D_L_{L a s t}

displacements (in pixels) were calculated using the first and last layers, respectively. The Self-TM Base’s relative gain over ConvNeXt-S is shown in green color.

Model	Number of Parameters	Dataset	Augmentations	$D_L_{F i r s t}$ Pixels	$D_L_{L a s t}$ Pixels
ConvNeXt-S [17]	50 M	ImageNet	color	1.119	0.418
ConvNeXt-S [17]	50 M	ImageNet	color and geometric	2.515	0.542
Self-TM Base Relative gain over ConvNeXt-S [17]	40 M −20.00%	ImageNet	color	0.577 −48.44%	0.156 −62.68%
Self-TM Base Relative gain over ConvNeXt-S [17]	40 M −20.00%	ImageNet	color and geometric	1.752 −30.39%	0.338 −37.64%

Table 11. Self-TM Base and ConvNeXt-S [17] training results using different combinations of training data (ImageNet-1k [56] and HPatches [46]) and applied augmentations (color and color and geometric augmentations). The Self-TM Base’s relative gain over ConvNeXt-S is shown in green color.

Model	Initial Weights	Dataset	Augmentations	Patch Verification mAP %	Image Matching mAP %	Patch Retrieval mAP %
ConvNeXt-S [17]	Random init	ImageNet	color	63.00	16.99	33.49
	ImageNet (color)	ImageNet	color and geometric	83.31	32.94	58.42
	ImageNet (color and geometric)	HPatches	color and geometric	84.39	34.88	60.61
Self-TM Base Relative gain over ConvNeXt-S [17]	Random init	ImageNet	color	65.19 +3.48%	21.33 +25.54%	37.01 +10.51%
	ImageNet (color)	ImageNet	color and geometric	85.90 +3.11%	38.77 +17.70%	61.67 +5.56%
	ImageNet (color and geometric)	HPatches	color and geometric	86.89 +2.96%	40.35 +15.68%	64.01 +5.61%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hristov, A.; Dimov, D.; Nisheva-Pavlova, M. Self-Supervised Foundation Model for Template Matching. Big Data Cogn. Comput. 2025, 9, 38. https://doi.org/10.3390/bdcc9020038

AMA Style

Hristov A, Dimov D, Nisheva-Pavlova M. Self-Supervised Foundation Model for Template Matching. Big Data and Cognitive Computing. 2025; 9(2):38. https://doi.org/10.3390/bdcc9020038

Chicago/Turabian Style

Hristov, Anton, Dimo Dimov, and Maria Nisheva-Pavlova. 2025. "Self-Supervised Foundation Model for Template Matching" Big Data and Cognitive Computing 9, no. 2: 38. https://doi.org/10.3390/bdcc9020038

APA Style

Hristov, A., Dimov, D., & Nisheva-Pavlova, M. (2025). Self-Supervised Foundation Model for Template Matching. Big Data and Cognitive Computing, 9(2), 38. https://doi.org/10.3390/bdcc9020038

Article Menu

Self-Supervised Foundation Model for Template Matching

Abstract

1. Introduction

2. Related Work

2.1. Hand-Crafted Methods

2.2. Learnable Methods

2.3. Foundation Models

3. Theoretical Formulation of the Proposed Method

3.1. Model

3.2. Data

3.3. Training

3.4. Augmentation

3.4.1. Color Augmentations

3.4.2. Geometric Augmentations

3.5. Hierarchical Activations Propagation

3.6. Correlation

3.7. Loss

3.8. Hyperparameters/Optimization

3.9. Prediction

4. Experiments

4.1. ImageNet-1K Test

4.2. HPatches

4.3. Image Matching

5. Implementation Details

5.1. Two-Step Training

5.2. Model Architecture

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI