Keywords

1 Introduction

Social media feed us every day with an unprecedented amount of visual data. Images are uploaded by various actors, from corporations to political parties, institutions, entrepreneurs and private citizens, with roughly \(10^2M\) unique images shared everyday on Twitter, Facebook and Instagram. For the sake of freedom of expression, control over their content is limited, and their vast majority is uploaded without any textual description of their content. Their sheer magnitude makes it imperative to use algorithms to monitor and make sense of them, finding the right balance between protecting the privacy of citizens and their right of expression, and tracking fake news (often associated with malicious intentions) while fighting illegal and hate content. This in most cases boils down to the ability to automatically associate as many tags as possible to images, which in turns means determining which objects are present in a scene.

Object detection has been largely investigated since the infancy of computer vision [11, 47] and continues to attract a large attention in the current deep learning era [10, 19, 30, 52]. Most of the algorithms assume that training and test data come from the same visual domain [18, 19, 40]. Recently, some authors have started to investigate the more challenging yet realistic scenario where the detector is trained on data from a visual source domain, and deployed at test time in a different target domain [32, 33, 44, 46]. This setting is usually referred to as cross-domain detection and heavily relies on concepts and results from the domain adaptation literature [14, 20, 32]. Specifically, it inherits the standard tansductive logic, according to which unsupervised target data is available at training time together with annotated source data, and can be used to adapt across domains. This approach is not suitable, neither effective, for monitoring social media feeds. Consider for instance the scenario depicted in Fig. 1, where there is an incoming stream of images from various social media and the detector is asked to look for instances of the class bicycle. The images come continuously, but they are produced by different users that share them on different social platforms. Hence, even though they might contain the same object, each of them has been acquired by a different person, in a different context, under different viewpoints and illuminations. In other words, each image comes from a different visual domain, distinct from the visual domain where the detector has been trained. This poses two key challenges to current cross-domain detectors: (1) to adapt to the target data, these algorithms need first to gather feeds, and only after enough target data has been collected they can learn to adapt and start performing on the incoming images; (2) even if the algorithms have learned to adapt on target images from the feed up to time t, there is no guarantee that the images that will arrive from time \(t+1\) will come from the same target domain.

This is the scenario we address. We focus on cross-domain detection when only one target sample is available for adaptation, without any form of supervision. We propose an object detection method able to adapt from one target image, hence suitable for the social media scenario described above. Specifically, we build a multi-task deep architecture that adapts across domains by leveraging over a pretext task. This auxiliary knowledge is further guided by a cross-task pseudo-labeling that injects the locality specific of object detection into self-supervised learning. The result is an architecture able to perform unsupervised adaptive object detection from a single image. Extensive experiments show the power of our method compared to previous state-of-the-art approaches. To summarize, the contributions of our paper are as follows:

  1. (1)

    we introduce the One-Shot Unsupervised Cross-Domain Detection setting, a cross-domain detection scenario where the target domain changes from sample to sample, hence adaptation can be learned only from one image. This scenario is especially relevant for monitoring social media image feeds. We are not aware of previous works addressing it.

  2. (2)

    We propose OSHOT, the first cross-domain object detector able to perform one-shot unsupervised adaptation. Our approach leverages over self-supervised one-shot learning guided by a cross-task pseudo-labeling procedure, embedded into a multi-task architecture. A thorough ablation study showcases the importance of each component.

  3. (3)

    We present a new experimental setup for studying one-shot unsupervised cross-domain adaptation, designed on three existing databases plus a new test set collected from social media feed. We compare against recent algorithms in cross-domain adaptive detection [28, 42] and one-shot unsupervised learning [8], achieving the state-of-the-art.

We make the code of our project available at https://github.com/VeloDC/oshot_detection.

Fig. 1.
figure 1

Each social media image comes from a different domain. Existing Cross-Domain Detection algorithms (e.g. [28] in the left gray box) struggle to adapt in this setting. OSHOT (right) is able to adapt across domains from one single target image, thanks to the combined use of self-supervision and pseudo-labeling

2 Related Work

Object Detection. Many successful object detection approaches have been developed during the past several years, starting from the original sliding window methods based on handcrafted features, till the most recent deep-learning empowered solutions. Modern detectors can be divided into one-stage and two-stage techniques. In the former, classification and bounding box prediction is performed on the convolution feature map either solving a regression problem on grid cells [39], or exploiting anchor boxes at different scales and aspect ratios [31]. In the latter, an initial stage deals with the region proposal process and is followed by a refinement stage that adjusts the coarse region localization and classifies the box content. Existing variants of this strategy differ mainly in the region proposal algorithm [18, 19, 40]. Regardless of the specific implementation, the detector robustness across visual domains remains a major issue.

Cross-Domain Detection. When training and test data are drawn from two different distributions a model learned on the first is doomed to fail on the second. Unsupervised domain adaptation methods attempt to close the domain gap between the annotated source on which learning is performed, and the target samples on which the model is deployed. Most of the literature has focused on object classification with solutions based on feature alignment [2, 32, 33, 44] or adversarial approaches [15, 46]. GAN-based methods allow to directly update the visual style of the annotated source data and reduce the domain shift directly at pixel level [23, 41]. Only in the last two years adaptive detection methods have been developed considering three main components: (i) including multiple and increasingly more accurate feature alignment modules at different internal stages, (ii) adding a preliminary pixel-level adaptation and (iii) pseudo-labeling. The last one is also known as self-training and consists in using the output of the source model detector as coarse annotation on the target. The importance of considering both global and local domain adaptation, together with a consistency regularizer to bridge the two, was first highlighted in [7]. The Strong-Weak (SW) method of [42] improves over the previous one pointing out the need of a better balanced alignment with strong global and weak local adaptation. It was also further extended by [49], where the adaptive steps are multiplied at different depth in the network. By generating new source images that look like those of the target, the Domain-Transfer (DT, [25]) method was the first to adopt pixel adaptation for object detection and combine it with pseudo-labeling. More recently the Div-Match approach [28] re-elaborated the idea of domain randomization [45]: multiple CycleGAN [53] applications with different constraints produce three extra source variants with which the target can be aligned at different extent through an adversarial multi-domain discriminator. A weak self-training procedure (WST) to reduce false negatives is combined with adversarial background score regularization (BSR) in [27]. Finally, [26] followed the pseudo-labeling strategy including an approach to deal with noisy annotations.

Adaptive Learning on a Budget. There is a wide literature on learning from a limited amount of data, both for classification and detection. However, in case of domain shift, learning on a target budget becomes extremely challenging. Indeed, the standard assumption for adaptive learning is that a large amount of unsupervised target samples are available at training time, so that a source model can capture the target domain style from them and adapt to it.

Only few attempts have been done to reduce the target cardinality. In [36] the considered setting is that of few-shot supervised domain adaptation: only a few target samples are available but they are fully labeled. In [3, 8] the focus is on one-shot unsupervised style transfer with a large source dataset and a single unsupervised target image. These works propose time-costly autoencoder-based methods to generate a version of the target image that maintains its content, but visually resembles the source in its global appearance. Thus the goal is image generation with no discriminative purpose. A related setting is that of online domain adaptation where unsupervised target samples are initially scarce but accumulate in time [22, 34, 48]. In this case target samples belong to a continuous data stream with smooth domain changing, so the coherence among subsequent samples can be exploited for adaptation.

Self-supervised Learning. Despite not-being manually annotated, unsupervised data is rich of structural information that can be learned by self-supervision, i.e.hiding a subpart of the data information and then trying to recover it. This procedure is generally indicated as pretext task and possible examples are image completion [38], colorization [29, 51], relative position of patches [12, 37], rotation recognition [17] and many more. Self-supervised learning has been extensively used as an initialization step for scarcely annotated supervised learning settings and very recently [1] has shown with a thorough analysis the potential of self-supervised learning from a single image. Recent works also indicated that self-supervision supports adaptation and generalization when combined with supervised learning in a multi-task framework [4, 6, 50].

Our approach for cross-domain detection relates to the described scenario of learning on a budget and exploits self-supervised learning to perform one-shot unsupervised adaptation. Specifically with OSHOT we show how to recognize objects and their location on a single target image starting from a pre-trained source model, thus without the need of accessing the source data during testing.

3 Method

Problem Setting. We introduce the one-shot unsupervised cross-domain detection scenario where our goal is to predict on a single image \(x^t\), with t being any target domain not available at training time, starting from N annotated samples of the source domain \(S=\{x^s_{i},y^s_{i}\}_{i=1}^N\). Here the structured labels \(y^s=(c,b)\) describe class identity c and bounding box location b in each image \(x^s\), and we aim to obtain \(y^t\) that precisely detects objects in \(x^t\) despite the domain shift.

OSHOT Strategy. To pursue the described goal, our strategy is to train the parameters of a detection learning model such that it can be ready to get the maximal performance on a single unsupervised sample from a new domain after few gradient update steps on it. Since we have no ground truth on the target sample, we implement this strategy by learning a representation that exploits inherent data information as that captured by a self-supervised task, and then finetune it on the target sample (see Fig. 2). Thus, we design our OSHOT to include (1) an initial pretraining phase where we extend a standard deep detection model adding an image rotation classifier, and (2) a following adaptation stage where the network features are updated on the single target sample by further optimization of the rotation objective. Moreover, we exploit pseudo-labeling to focus the auxiliary task on the local object context. A clear advantage of this solution is that we decouple source training from target testing, with no need to access the source data while adapting on the target sample.

Fig. 2.
figure 2

Visualization of the adaptive phase of OSHOT with cross-task pseudo-labeling. The target image passes through the network and produces detections. While the class information is not used, the identified boxes are exploited to select object regions from the feature maps of the rotated image. The obtained region-specific feature vectors are finally sent to the rotation classifier. A number of subsequent finetuning iterations allows to adapt the convolutional backbone to the domain represented by the test image

Preliminaries. We leverage on Faster R-CNN [40] as our base detection model. It is a two-stage detector with three main components: an initial block of convolutional layers, a region proposal network (RPN) and a region-of-interest (ROI) based classifier. The bottom layers transform any input image x into its convolutional feature map \(G_{f}(x|\theta _{f})\) where \(\theta _{f}\) is used to parametrize the feature extraction model. The feature map is then used by RPN to generate candidate object proposals. Finally the ROI-wise classifier predicts the category label from the feature vector obtained using ROI-pooling. The training objective combines the loss of both RPN and ROI, each of them composed by two terms:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{d}(G_{d}(G_{f}(x|\theta _{f})|\theta _{d}), y)=&\big (\mathcal {L}_{class}(c^*) + \mathcal {L}_{regr}(b) \big )_{RPN} + \\&\big ( \mathcal {L}_{class}(c) + \mathcal {L}_{regr}(b) \big )_{ROI}~. \end{aligned} \end{aligned}$$
(1)

Here \(\mathcal {L}_{class}\) is a classification loss to evaluate the object recognition accuracy, while \(\mathcal {L}_{regr}\) is a regression loss on the box coordinates for better localization. To maintain a simple notation we summarize the role of ROI and RPN with the function \(G_{d}(G_{f}(x|\theta _{f})|\theta _{d})\) parametrized by \(\theta _{d}\). Moreover, we use \(c^*\) to highlight that RPN deals with a binary classification task to separate foreground and background objects, while ROI deals with the multi-class objective needed to discriminate among c foreground object categories. As mentioned above, ROI and RPN are applied in sequence: they both elaborate on the feature maps produced by the convolutional block, and then influence each other in the final optimization of the multi-task (classification, regression) objective function.

OSHOT Pretraining. As a first step, we extend Faster R-CNN to include image rotation recognition. Formally, to each source training image \(x^s\) we apply four geometric transformations \(R(x,\alpha )\) where \(\alpha = q\times 90^{\circ }\) indicates rotations with \(q \in \{1,\ldots ,4\}\). In this way we obtain a new set of samples \(\{R(x)_j, q_j\}_{j=1}^M\) where we dropped the \(\alpha \) without loss of generality. We indicate the auxiliary rotation classifier and its parameters respectively with \(G_{r}\) and \(\theta _{r}\) and we train our network to optimize the following multi-task objective

$$\begin{aligned} \begin{aligned} {\mathop {\hbox {argmin}}\limits _{\theta _{f}, \theta _{d}, \theta _{r}}}&\sum _{i=1}^N\mathcal {L}_{d}(G_{d}(G_{f}(x^s_i|\theta _{f})|\theta _{d}),y^s_i) + \lambda \sum _{j=1}^M\mathcal {L}_{r}(G_{r}(G_{f}(R(x^s)_j|\theta _{f})|\theta _{r}), q^s_j), \end{aligned} \end{aligned}$$
(2)

where \(\mathcal {L}_{r}\) is the cross-entropy loss. When solving this problem, we can design \(G_{r}\) in two different ways. Indeed it can either be a Fully Connected layer that naïvely takes as input the feature map produced by the whole (rotated) image \(G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(\cdot )\), or it can exploit the ground truth location of each object with a subselection of the features only from its bounding box in the original map \(G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(boxcrop(\cdot ))\). The boxcrop operation includes pooling to rescale the feature dimension before entering the final FC layer. In this last case the network is encouraged to focus only on the object orientation without introducing noisy information from the background and provides better results with respect to the whole image option as we will discuss in Sect. 4.4. In practical terms, both in the case of image and box rotations, we randomly pick one rotation angle per instance, rather than considering all four of them: this avoids any troublesome unbalance between rotated and non-rotated data when solving the multi-task optimization problem.

OSHOT Adaptation. Given the single target image \(x^t\), we finetune the backbone’s parameters \(\theta _{f}\) by iteratively solving a self-supervised task on it. This allows to adapt the original feature representation both to the content and to the style of the new sample. Specifically, we start from the rotated versions \(R(x^t)\) of the provided sample and optimize the rotation classifier through

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{\theta _{f}, \theta _{r}}} \, \mathcal {L}_{r}(G_{r}(G_{f}(R(x^t)|\theta _{f})|\theta _{r}),q^t)~. \end{aligned}$$
(3)

This process involves only \(G_{f}\) and \(G_{r}\), while the RPN and ROI detection components described by \(G_{d}\) remain unchanged. In the following we use \(\gamma \) to indicate the number of gradient steps (i.e.iterations), with \(\gamma =0\) corresponding to the OSHOT pretraining phase. At the end of the finetuning process, the inner feature model is described by \(\theta ^*_f\) and the detection prediction on \(x^t\) is obtained by \(y^{t*} = G_{d}(G_{f}(x^t|\theta ^*_{f})|\theta _{d})\).

Cross-Task Pseudo-labeling. As in the pretraining phase, also in the adaptation stage we have two possible choices to design \(G_{r}\): either considering the whole feature map \(G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(\cdot )\), or focusing on the object locations \(G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(pseudoboxcrop(\cdot ))\). For both variants we include dropout to prevent overfitting on the single target sample. With pseudoboxcrop we mean a localized feature extraction operation analogous to that discussed for pretraining, but obtained through a particular form of cross-task self-training. Specifically, we follow the self-training strategy used in [25, 27] with a cross-task variant: instead of reusing the pseudo-labels produced by the source model on the target to update the detector, we exploit them for the self-supervised rotation classifier. In this way we keep the advantage of the self-training initialization, largely reducing the risks of error propagation due to wrong class pseudo-labels.

More practically, we start from the \((\theta _{f},\theta _{d})\) model parameters of the pretraining stage and we get the feature maps from all the rotated versions of the target sample \(G_{f}(\{R(x^t),q\}|\theta _{f})\), \(q={1,\ldots ,4}\). Only the feature map produced by the original image (i.e.\(q=4\)) is provided as input to the RPN and ROI network components to get the predicted detection \(y^{t}=(c,b)=G_{d}(G_{f}(x^t|\theta _{f})|\theta _{d})\). This pseudo-label is composed by the class label c and the bounding box location b. We discard the first and consider only the second to localize the region containing an object in all the four feature maps, also recalibrating the position to compensate for the orientation of each map. Once passed through this pseudoboxcrop operation, the obtained features are used to finetune the rotation classifier, updating the bottom convolutional network block.

4 Experiments

4.1 Datasets

Real-World (VOC). Pascal-VOC [13] is the standard real-world image dataset for object detection benchmarks. VOC2007 and VOC2012 both contain bounding boxes annotations of 20 common categories. VOC2007 has 5011 images in the train-val split and 4952 images in the test split, while VOC2012 contains 11540 images in the train-val split.

Artistic Media Datasets (AMD). Clipart1k, Comic2k and Watercolor2k [25] are three object detection datasets designed for benchmarking Domain Adaptation methods when the source domain is Pascal-VOC. Clipart1k shares its 20 categories with Pascal-VOC: it has 500 images in the training set and 500 images in the test set. Comic2k and Watercolor2k both have the same 6 classes (a subset of the 20 classes of Pascal-VOC), and 1000-1000 images in the training-test splits each.

Cityscapes [9] is an urban street scene dataset with pixel level annotations of 8 categories. It has 2975 and 500 images respectively in the training and validation splits. We use the instance level pixel annotations to generate bounding boxes of objects, as in [7].

Foggy Cityscapes [43] is obtained by adding different levels of synthetic fog to Cityscapes images. We only consider images with the highest amount of artificial fog, thus training-validation splits have 2975-500 images respectively.

KITTI [16] is a dataset of images depicting several driving urban scenarios. By following [7], we use the full 7481 images for both training (when used as source) and evaluation (when used as target).

Fig. 3.
figure 3

The Social Bikes concept-dataset. A random data acquisition from multiple users/feeds leads to a target distribution with several, uneven domain shifts

Social Bikes is our new concept-dataset containing 30 images of scenes with persons/bicycles collected from Twitter, Instagram and Facebook by searching for #bike tags. Square crops of the full dataset are presented in Fig. 3: images acquired randomly from social feeds show diverse style properties and cannot be grouped under a single shared domain.

4.2 Performance Analysis

Experimental Setup. We evaluate OSHOT on several testbeds using the described datasets. In the following we will use an arrow \(Source \rightarrow Target\) to indicate the experimental setting. Our base detector is Faster-RCNN [35] with a ResNet-50 [21] backbone pre-trained on ImageNet, RPN with 300 top proposals after non-maximum-supression, anchors at three scales (128, 256, 512) and three aspect ratios (1:1, 1:2, 2:1). For all our experiments we set the IoU threshold at 0.5 for the mAP results, and report the average of three independent runs.

OSHOT Pretraining. We always resize the image’s shorter size to 600 pixels and apply random horizontal flipping. Unless differently specified, we train the base network for 70k iterations using SGD with momentum set at 0.9, the initial learning rate is 0.001 and decays after 50k iterations. We use a batch size of 1, keep batch normalization layers fixed for both pretraining and adaptation phases and freeze the first 2 blocks of ResNet50. The weight of the auxiliary task is set to \(\lambda =0.05\).

OSHOT Adaptation. We increase the weight of the auxiliary task to \(\lambda =0.2\) to speed up adaptation and keep all other training hyperparameters fixed. For each test instance, we finetune the initial model on the auxiliary task for 30 iterations before testing.

Benchmark Methods. We compare OSHOT with the following algorithms. FRCNN: baseline Faster-RCNN with ResNet50 backbone, trained on the source domain and deployed on the target without further adaptation. DivMatch [28]: cross-domain detection algorithm that, by exploiting target data, creates multiple randomized domains via CycleGAN and aligns their representations using an adversarial loss. SW [42]: adaptive detection algorithm that aligns source and target features based on global context similarity. For both DivMatch and SW, we use a ResNet-50 backbone pretrained on ImageNet for fair comparison. Since all cross-domain algorithms need target data in advance and are not designed to work in our one-shot unsupervised setting, we provide them with the advantage of 10 target images accessible during training and randomly selected at each run. We collect average precision statistics during inference under the favorable assumption that the target domain will not shift after deployment.

Table 1. (left) VOC \(\rightarrow \) Social Bikes mAP results; (right) visualization of DivMatch and OSHOT detections. The number associated with each bounding box indicates the model’s confidence in localization. Examples show how OSHOT detection is accurate, while most DivMatch boxes are false positives

Adapting to Social Feeds. When data is collected from multiple sources, the assumption that all target images originate from the same underlying distribution does not hold and standard cross-domain detection methods are penalized regardless of the number of seen target samples. We pretrain the source detector on Pascal VOC, and deploy it on Social Bikes. We consider only the bicycle and person annotations for this target, since all other instances of VOC classes are scarce. We report results in Table 1. OSHOT outperforms all considered competitors, with a mAP score of 64.4. Despite granting them access to the full target, adaptive algorithms incur in negative transfer due to data scarcity and large variety of target styles.

Large Distribution Shifts. Artistic images are difficult benchmarks for cross-domain methods. Unpredictable perturbations in shape and color are challenging to detectors trained only on realistic images. We investigate this setting by training the source detector on Pascal VOC an deploying it on Clipart, Comic and Watercolor datasets. Table 2 summarizes results on the three adaptation splits. We can see how OSHOT with 30 finetuning iterations outperforms all competitors, with mAP gains ranging from 7.5 points on Clipart to 9.2 points on Watercolor. Cross-detection methods perform poorly in this setting, despite using 9 more samples in the adaptation phase compared to OSHOT that only uses the test sample. These results confirm that they are not designed to tackle data scarcity conditions and exhibit negligible improvements compared to the baseline.

Table 2. mAP results for VOC \(\rightarrow \) AMD
Table 3. mAP results for Cityscapes \(\rightarrow \) FoggyCityscapes

Adverse Weather. Some peculiar environmental conditions, such as fog, may be disregarded in source data acquisition, yet adaptation to these circumstances is crucial for real world applications. We assess the performance of OSHOT on Cityscapes \(\rightarrow \) FoggyCityscapes. We train our base detector on Cityscapes for 30k iterations without stepdown, as in [5]. We select the best performing model on the Cityscapes validation split and deploy it to FoggyCityscapes. Experimental evaluation in Table 3 shows that OSHOT outperforms all compared approaches. Without finetuning iterations, performance using the auxiliary rotation task increases compared to the baseline. Subsequent finetuning iterations on the target sample improve these results, and 30 iterations yield models able to outperform the second-best method by 5 mAP. Cross-domain algorithms used in this setting struggle to surpass the baseline (DivMatch) or suffer negative transfer (SW).

Cross-Camera Transfer. Dataset bias between training and testing is unavoidable in practical applications, as for urban scene scenarios collected in different cities and with different cameras. We test adaptation between KITTI and Cityscapes in both directions. For cross-domain evaluation we consider only the label car as standard practice. In Table 4, OSHOT improves by 7 mAP points on KITTI \(\rightarrow \) Cityscapes compared to the FRCNN baseline. DivMatch and SW both show a gain in this split, with SW obtaining the highest mAP of 39.2 in the ten-shot setting. We argue that this is not surprising considering that, as shown in the visualization of Table 4, the Cityscapes images share all a uniform visual style. As a consequence, 10 target images may be enough for standard cross-domain detection methods. Despite visual style homogeneity, the diversity among car instances in Cityscapes is high enough for learning a good car detection model. This is highlighted by the results in Cityscapes \(\rightarrow \) KITTI task, for which adaptation performance for all methods is similar, and OSHOT with \(\gamma =0\) obtains the highest mAP of 75.4. The FRCNN baseline on KITTI scores a high mAP of 75.1: in this favorable condition detection doesn’t benefit from adaptation.

Table 4. mAP of car class in KITTI/Cityscapes detection experiments
Table 5. Comparison between baseline, one-shot syle transfer and OSHOT in the one-shot unsupervised cross-domain detection setting

4.3 Comparison with One-Shot Style Transfer

Although not specifically designed for cross-domain detection, in principle it is possible to apply one-shot style transfer methods as an alternative solution for our setting. We use BiOST [8], the current state-of-the-art method for one-shot transfer, to modify the style of the target sample towards that of the source domain before performing inference. Due to the time-heavy requirements to perform BiOST on each test sampleFootnote 1, we test it on Social Bikes and on a random subset of 100 Clipart images that we name Clipart100. We compare performance and time requirements of OSHOT and BiOST on these two targets. Speed has been computed on an RTX2080Ti with full precision settings.

Table 5 shows summary mAP results using BiOST and OSHOT. On Clipart100, the baseline FRCNN detector obtains 27.9 mAP. We can see how BiOST is effective in the adaptation from one-sample, gaining 1.9 points over the baseline, however it is outperformed by OSHOT, which obtains 30.7 mAP. On Social Bikes, while OSHOT still outperforms the baseline, BiOST incurs in negative transfer, indicating that it was not able to effectively modify the source’s style on the images we collected. Furthermore, BiOST is affected by two strong issues: (1) as already mentioned, it has an extremely high time complexity, with more than 6 hours needed to modify the style of a single source instance; (2) it works under the strict assumption of accessing at the same time the entire source training set and the target sample. Due to these weaknesses, and the fact that OSHOT still outperforms BiOST, we argue that existing one-shot translation methods are not suitable for one shot unsupervised cross-domain adaptation.

4.4 Ablation Study

Detection Error Analysis. Following [24], we provide detection error analysis for VOC \(\rightarrow \) Clipart setting in Fig. 4. We select the 1000 most confident detections, and assign error classes based on IoU with ground truth (IoUgt). Errors are categorized as: correct (IoUgt \(\geqslant \) 0.5), mislocalized (0.3 \(\leqslant \) IoUgt < 0.5) and background (IoUgt < 0.3). Results show that, compared to the baseline FRCNN model, the regularization effect of adding a self-supervised task at training time (\(\gamma = 0\)) marginally increases the quality of detections. Instead subsequent finetuning iterations on the test sample substantially improve the number of correct detections, while also decreasing both false positives and mislocalization errors.

Fig. 4.
figure 4

Detection error analysis on the most confident detections on Clipart

Cross-Task Pseudo-labeling Ablation. As explained in Sect. 3 we have two options in the OSHOT adaptation phase: either considering the whole image, or focusing on pseudo-labeled bounding boxes obtained from the detector after the first OSHOT pretraining stage. For all the experiments presented above we focused on the second case. Indeed by solving the auxiliary task only on objects, we limit the use of background features which may mislead the network towards solutions of the rotation task not based on relevant semantic information (e.g.: finding fixed patterns in images, exploiting watermarks). We validate our choice by comparing it against using the rotation task on the entire image in both training and adaptation phases. Table 6 shows results for VOC \(\rightarrow \) AMD and Cityscapes \(\rightarrow \) Foggy Cityscapes using OSHOT. We observe that the choice of rotated regions is critical for the effectiveness of the algorithm. Solving the rotation task on objects using pseudo-annotations results in mAP improvements that range from 2.9 to 5.9 points, indicating that we learn better features for the main task.

Table 6. Rotating image vs rotating objects via pseudo-labeling on OSHOT

Self-supervised Iterations. We study the effects of adaptating with up to \(\gamma = 70\) iterations on VOC \(\rightarrow \) Clipart, Cityscapes \(\rightarrow \) FoggyCityscapes and KITTI \(\rightarrow \) Cityscapes. Results are shown in Fig. 5. We observe a positive correlation between number of finetuning iterations and final mAP of the model in the earliest steps. This correlation is strong for the first 10 iterations and gets to a plateau after about 30 iterations: increasing \(\gamma \) beyond this point doesn’t affect the final results.

Fig. 5.
figure 5

Performance of OSHOT at different self-supervised iterations

5 Conclusions

This paper introduced the one-shot unsupervised cross-domain detection scenario, which is extremely relevant for monitoring image feeds on social media, where algorithms are called to adapt to a new visual domain from one single image. We showed that existing cross-domain detection methods suffer in this setting, as they are all explicitly designed to adapt from far larger quantities of target data. We presented OSHOT, the first deep architecture able to reduce the domain gap between source and target distribution by leveraging over one single target image. Our approach is based on a multi-task structure that exploits self-supervision and cross-task self-labeling. Extensive quantitative experiments and a qualitative analysis clearly demonstrate its effectiveness.