One-Shot Unsupervised Cross-Domain Detection

Antonio D’Innocente^12,14,
Francesco Cappio Borlino¹³,
Silvia Bucci^13,14,
Barbara Caputo^13,14 &
…
Tatiana Tommasi^13,14

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12361))

Included in the following conference series:

European Conference on Computer Vision

3933 Accesses
24 Citations

Abstract

Despite impressive progress in object detection over the last years, it is still an open challenge to reliably detect objects across visual domains. All current approaches access a sizable amount of target data at training time. This is a heavy assumption, as often it is not possible to anticipate the domain where a detector will be used, nor to access it in advance for data acquisition. Consider for instance the task of monitoring image feeds from social media: as every image is uploaded by a different user it belongs to a different target domain that is impossible to foresee during training. Our work addresses this setting, presenting an object detection algorithm able to perform unsupervised adaptation across domains by using only one target sample, seen at test time. We introduce a multi-task architecture that one-shot adapts to any incoming sample by iteratively solving a self-supervised task on it. We further enhance this auxiliary adaptation with cross-task pseudo-labeling. A thorough benchmark analysis against the most recent cross-domain detection methods and a detailed ablation study show the advantage of our approach.

You have full access to this open access chapter, Download conference paper PDF

MTTrans: Cross-domain Object Detection with Mean Teacher Transformer

Enhancing Source-Free Domain Adaptive Object Detection with Low-Confidence Pseudo Label Distillation

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Keywords

1 Introduction

Social media feed us every day with an unprecedented amount of visual data. Images are uploaded by various actors, from corporations to political parties, institutions, entrepreneurs and private citizens, with roughly $10^2M$ unique images shared everyday on Twitter, Facebook and Instagram. For the sake of freedom of expression, control over their content is limited, and their vast majority is uploaded without any textual description of their content. Their sheer magnitude makes it imperative to use algorithms to monitor and make sense of them, finding the right balance between protecting the privacy of citizens and their right of expression, and tracking fake news (often associated with malicious intentions) while fighting illegal and hate content. This in most cases boils down to the ability to automatically associate as many tags as possible to images, which in turns means determining which objects are present in a scene.

Object detection has been largely investigated since the infancy of computer vision [11, 47] and continues to attract a large attention in the current deep learning era [10, 19, 30, 52]. Most of the algorithms assume that training and test data come from the same visual domain [18, 19, 40]. Recently, some authors have started to investigate the more challenging yet realistic scenario where the detector is trained on data from a visual source domain, and deployed at test time in a different target domain [32, 33, 44, 46]. This setting is usually referred to as cross-domain detection and heavily relies on concepts and results from the domain adaptation literature [14, 20, 32]. Specifically, it inherits the standard tansductive logic, according to which unsupervised target data is available at training time together with annotated source data, and can be used to adapt across domains. This approach is not suitable, neither effective, for monitoring social media feeds. Consider for instance the scenario depicted in Fig. 1, where there is an incoming stream of images from various social media and the detector is asked to look for instances of the class bicycle. The images come continuously, but they are produced by different users that share them on different social platforms. Hence, even though they might contain the same object, each of them has been acquired by a different person, in a different context, under different viewpoints and illuminations. In other words, each image comes from a different visual domain, distinct from the visual domain where the detector has been trained. This poses two key challenges to current cross-domain detectors: (1) to adapt to the target data, these algorithms need first to gather feeds, and only after enough target data has been collected they can learn to adapt and start performing on the incoming images; (2) even if the algorithms have learned to adapt on target images from the feed up to time t, there is no guarantee that the images that will arrive from time $t+1$ will come from the same target domain.

This is the scenario we address. We focus on cross-domain detection when only one target sample is available for adaptation, without any form of supervision. We propose an object detection method able to adapt from one target image, hence suitable for the social media scenario described above. Specifically, we build a multi-task deep architecture that adapts across domains by leveraging over a pretext task. This auxiliary knowledge is further guided by a cross-task pseudo-labeling that injects the locality specific of object detection into self-supervised learning. The result is an architecture able to perform unsupervised adaptive object detection from a single image. Extensive experiments show the power of our method compared to previous state-of-the-art approaches. To summarize, the contributions of our paper are as follows:

(1)
we introduce the One-Shot Unsupervised Cross-Domain Detection setting, a cross-domain detection scenario where the target domain changes from sample to sample, hence adaptation can be learned only from one image. This scenario is especially relevant for monitoring social media image feeds. We are not aware of previous works addressing it.
(2)
We propose OSHOT, the first cross-domain object detector able to perform one-shot unsupervised adaptation. Our approach leverages over self-supervised one-shot learning guided by a cross-task pseudo-labeling procedure, embedded into a multi-task architecture. A thorough ablation study showcases the importance of each component.
(3)
We present a new experimental setup for studying one-shot unsupervised cross-domain adaptation, designed on three existing databases plus a new test set collected from social media feed. We compare against recent algorithms in cross-domain adaptive detection [28, 42] and one-shot unsupervised learning [8], achieving the state-of-the-art.

We make the code of our project available at https://github.com/VeloDC/oshot_detection.

2 Related Work

Object Detection. Many successful object detection approaches have been developed during the past several years, starting from the original sliding window methods based on handcrafted features, till the most recent deep-learning empowered solutions. Modern detectors can be divided into one-stage and two-stage techniques. In the former, classification and bounding box prediction is performed on the convolution feature map either solving a regression problem on grid cells [39], or exploiting anchor boxes at different scales and aspect ratios [31]. In the latter, an initial stage deals with the region proposal process and is followed by a refinement stage that adjusts the coarse region localization and classifies the box content. Existing variants of this strategy differ mainly in the region proposal algorithm [18, 19, 40]. Regardless of the specific implementation, the detector robustness across visual domains remains a major issue.

Cross-Domain Detection. When training and test data are drawn from two different distributions a model learned on the first is doomed to fail on the second. Unsupervised domain adaptation methods attempt to close the domain gap between the annotated source on which learning is performed, and the target samples on which the model is deployed. Most of the literature has focused on object classification with solutions based on feature alignment [2, 32, 33, 44] or adversarial approaches [15, 46]. GAN-based methods allow to directly update the visual style of the annotated source data and reduce the domain shift directly at pixel level [23, 41]. Only in the last two years adaptive detection methods have been developed considering three main components: (i) including multiple and increasingly more accurate feature alignment modules at different internal stages, (ii) adding a preliminary pixel-level adaptation and (iii) pseudo-labeling. The last one is also known as self-training and consists in using the output of the source model detector as coarse annotation on the target. The importance of considering both global and local domain adaptation, together with a consistency regularizer to bridge the two, was first highlighted in [7]. The Strong-Weak (SW) method of [42] improves over the previous one pointing out the need of a better balanced alignment with strong global and weak local adaptation. It was also further extended by [49], where the adaptive steps are multiplied at different depth in the network. By generating new source images that look like those of the target, the Domain-Transfer (DT, [25]) method was the first to adopt pixel adaptation for object detection and combine it with pseudo-labeling. More recently the Div-Match approach [28] re-elaborated the idea of domain randomization [45]: multiple CycleGAN [53] applications with different constraints produce three extra source variants with which the target can be aligned at different extent through an adversarial multi-domain discriminator. A weak self-training procedure (WST) to reduce false negatives is combined with adversarial background score regularization (BSR) in [27]. Finally, [26] followed the pseudo-labeling strategy including an approach to deal with noisy annotations.

Adaptive Learning on a Budget. There is a wide literature on learning from a limited amount of data, both for classification and detection. However, in case of domain shift, learning on a target budget becomes extremely challenging. Indeed, the standard assumption for adaptive learning is that a large amount of unsupervised target samples are available at training time, so that a source model can capture the target domain style from them and adapt to it.

Only few attempts have been done to reduce the target cardinality. In [36] the considered setting is that of few-shot supervised domain adaptation: only a few target samples are available but they are fully labeled. In [3, 8] the focus is on one-shot unsupervised style transfer with a large source dataset and a single unsupervised target image. These works propose time-costly autoencoder-based methods to generate a version of the target image that maintains its content, but visually resembles the source in its global appearance. Thus the goal is image generation with no discriminative purpose. A related setting is that of online domain adaptation where unsupervised target samples are initially scarce but accumulate in time [22, 34, 48]. In this case target samples belong to a continuous data stream with smooth domain changing, so the coherence among subsequent samples can be exploited for adaptation.

Self-supervised Learning. Despite not-being manually annotated, unsupervised data is rich of structural information that can be learned by self-supervision, i.e.hiding a subpart of the data information and then trying to recover it. This procedure is generally indicated as pretext task and possible examples are image completion [38], colorization [29, 51], relative position of patches [12, 37], rotation recognition [17] and many more. Self-supervised learning has been extensively used as an initialization step for scarcely annotated supervised learning settings and very recently [1] has shown with a thorough analysis the potential of self-supervised learning from a single image. Recent works also indicated that self-supervision supports adaptation and generalization when combined with supervised learning in a multi-task framework [4, 6, 50].

Our approach for cross-domain detection relates to the described scenario of learning on a budget and exploits self-supervised learning to perform one-shot unsupervised adaptation. Specifically with OSHOT we show how to recognize objects and their location on a single target image starting from a pre-trained source model, thus without the need of accessing the source data during testing.

3 Method

Problem Setting. We introduce the one-shot unsupervised cross-domain detection scenario where our goal is to predict on a single image $x^t$, with t being any target domain not available at training time, starting from N annotated samples of the source domain $S=\{x^s_{i},y^s_{i}\}_{i=1}^N$. Here the structured labels $y^s=(c,b)$ describe class identity c and bounding box location b in each image $x^s$, and we aim to obtain $y^t$ that precisely detects objects in $x^t$ despite the domain shift.

OSHOT Strategy. To pursue the described goal, our strategy is to train the parameters of a detection learning model such that it can be ready to get the maximal performance on a single unsupervised sample from a new domain after few gradient update steps on it. Since we have no ground truth on the target sample, we implement this strategy by learning a representation that exploits inherent data information as that captured by a self-supervised task, and then finetune it on the target sample (see Fig. 2). Thus, we design our OSHOT to include (1) an initial pretraining phase where we extend a standard deep detection model adding an image rotation classifier, and (2) a following adaptation stage where the network features are updated on the single target sample by further optimization of the rotation objective. Moreover, we exploit pseudo-labeling to focus the auxiliary task on the local object context. A clear advantage of this solution is that we decouple source training from target testing, with no need to access the source data while adapting on the target sample.

Preliminaries. We leverage on Faster R-CNN [40] as our base detection model. It is a two-stage detector with three main components: an initial block of convolutional layers, a region proposal network (RPN) and a region-of-interest (ROI) based classifier. The bottom layers transform any input image x into its convolutional feature map $G_{f}(x|\theta _{f})$ where $\theta _{f}$ is used to parametrize the feature extraction model. The feature map is then used by RPN to generate candidate object proposals. Finally the ROI-wise classifier predicts the category label from the feature vector obtained using ROI-pooling. The training objective combines the loss of both RPN and ROI, each of them composed by two terms:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{d}(G_{d}(G_{f}(x|\theta _{f})|\theta _{d}), y)=&\big (\mathcal {L}_{class}(c^*) + \mathcal {L}_{regr}(b) \big )_{RPN} + \\&\big ( \mathcal {L}_{class}(c) + \mathcal {L}_{regr}(b) \big )_{ROI}~. \end{aligned} \end{aligned}$$

(1)

Here $\mathcal {L}_{class}$ is a classification loss to evaluate the object recognition accuracy, while $\mathcal {L}_{regr}$ is a regression loss on the box coordinates for better localization. To maintain a simple notation we summarize the role of ROI and RPN with the function $G_{d}(G_{f}(x|\theta _{f})|\theta _{d})$ parametrized by $\theta _{d}$. Moreover, we use $c^*$ to highlight that RPN deals with a binary classification task to separate foreground and background objects, while ROI deals with the multi-class objective needed to discriminate among c foreground object categories. As mentioned above, ROI and RPN are applied in sequence: they both elaborate on the feature maps produced by the convolutional block, and then influence each other in the final optimization of the multi-task (classification, regression) objective function.

OSHOT Pretraining. As a first step, we extend Faster R-CNN to include image rotation recognition. Formally, to each source training image $x^s$ we apply four geometric transformations $R(x,\alpha )$ where $\alpha = q\times 90^{\circ }$ indicates rotations with $q \in \{1,\ldots ,4\}$. In this way we obtain a new set of samples $\{R(x)_j, q_j\}_{j=1}^M$ where we dropped the $\alpha $ without loss of generality. We indicate the auxiliary rotation classifier and its parameters respectively with $G_{r}$ and $\theta _{r}$ and we train our network to optimize the following multi-task objective

$$\begin{aligned} \begin{aligned} {\mathop {\hbox {argmin}}\limits _{\theta _{f}, \theta _{d}, \theta _{r}}}&\sum _{i=1}^N\mathcal {L}_{d}(G_{d}(G_{f}(x^s_i|\theta _{f})|\theta _{d}),y^s_i) + \lambda \sum _{j=1}^M\mathcal {L}_{r}(G_{r}(G_{f}(R(x^s)_j|\theta _{f})|\theta _{r}), q^s_j), \end{aligned} \end{aligned}$$

(2)

where $\mathcal {L}_{r}$ is the cross-entropy loss. When solving this problem, we can design $G_{r}$ in two different ways. Indeed it can either be a Fully Connected layer that naïvely takes as input the feature map produced by the whole (rotated) image $G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(\cdot )$, or it can exploit the ground truth location of each object with a subselection of the features only from its bounding box in the original map $G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(boxcrop(\cdot ))$. The boxcrop operation includes pooling to rescale the feature dimension before entering the final FC layer. In this last case the network is encouraged to focus only on the object orientation without introducing noisy information from the background and provides better results with respect to the whole image option as we will discuss in Sect. 4.4. In practical terms, both in the case of image and box rotations, we randomly pick one rotation angle per instance, rather than considering all four of them: this avoids any troublesome unbalance between rotated and non-rotated data when solving the multi-task optimization problem.

OSHOT Adaptation. Given the single target image $x^t$, we finetune the backbone’s parameters $\theta _{f}$ by iteratively solving a self-supervised task on it. This allows to adapt the original feature representation both to the content and to the style of the new sample. Specifically, we start from the rotated versions $R(x^t)$ of the provided sample and optimize the rotation classifier through

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{\theta _{f}, \theta _{r}}} \, \mathcal {L}_{r}(G_{r}(G_{f}(R(x^t)|\theta _{f})|\theta _{r}),q^t)~. \end{aligned}$$

(3)

This process involves only $G_{f}$ and $G_{r}$, while the RPN and ROI detection components described by $G_{d}$ remain unchanged. In the following we use $\gamma $ to indicate the number of gradient steps (i.e.iterations), with $\gamma =0$ corresponding to the OSHOT pretraining phase. At the end of the finetuning process, the inner feature model is described by $\theta ^*_f$ and the detection prediction on $x^t$ is obtained by $y^{t*} = G_{d}(G_{f}(x^t|\theta ^*_{f})|\theta _{d})$.

Cross-Task Pseudo-labeling. As in the pretraining phase, also in the adaptation stage we have two possible choices to design $G_{r}$: either considering the whole feature map $G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(\cdot )$, or focusing on the object locations $G_{r}(\cdot |\theta _r) = \text{ FC}_{\theta _r}(pseudoboxcrop(\cdot ))$. For both variants we include dropout to prevent overfitting on the single target sample. With pseudoboxcrop we mean a localized feature extraction operation analogous to that discussed for pretraining, but obtained through a particular form of cross-task self-training. Specifically, we follow the self-training strategy used in [25, 27] with a cross-task variant: instead of reusing the pseudo-labels produced by the source model on the target to update the detector, we exploit them for the self-supervised rotation classifier. In this way we keep the advantage of the self-training initialization, largely reducing the risks of error propagation due to wrong class pseudo-labels.

More practically, we start from the $(\theta _{f},\theta _{d})$ model parameters of the pretraining stage and we get the feature maps from all the rotated versions of the target sample $G_{f}(\{R(x^t),q\}|\theta _{f})$, $q={1,\ldots ,4}$. Only the feature map produced by the original image (i.e.$q=4$) is provided as input to the RPN and ROI network components to get the predicted detection $y^{t}=(c,b)=G_{d}(G_{f}(x^t|\theta _{f})|\theta _{d})$. This pseudo-label is composed by the class label c and the bounding box location b. We discard the first and consider only the second to localize the region containing an object in all the four feature maps, also recalibrating the position to compensate for the orientation of each map. Once passed through this pseudoboxcrop operation, the obtained features are used to finetune the rotation classifier, updating the bottom convolutional network block.

4 Experiments

4.1 Datasets

Real-World (VOC). Pascal-VOC [13] is the standard real-world image dataset for object detection benchmarks. VOC2007 and VOC2012 both contain bounding boxes annotations of 20 common categories. VOC2007 has 5011 images in the train-val split and 4952 images in the test split, while VOC2012 contains 11540 images in the train-val split.

Artistic Media Datasets (AMD). Clipart1k, Comic2k and Watercolor2k [25] are three object detection datasets designed for benchmarking Domain Adaptation methods when the source domain is Pascal-VOC. Clipart1k shares its 20 categories with Pascal-VOC: it has 500 images in the training set and 500 images in the test set. Comic2k and Watercolor2k both have the same 6 classes (a subset of the 20 classes of Pascal-VOC), and 1000-1000 images in the training-test splits each.

Cityscapes [9] is an urban street scene dataset with pixel level annotations of 8 categories. It has 2975 and 500 images respectively in the training and validation splits. We use the instance level pixel annotations to generate bounding boxes of objects, as in [7].

Foggy Cityscapes [43] is obtained by adding different levels of synthetic fog to Cityscapes images. We only consider images with the highest amount of artificial fog, thus training-validation splits have 2975-500 images respectively.

KITTI [16] is a dataset of images depicting several driving urban scenarios. By following [7], we use the full 7481 images for both training (when used as source) and evaluation (when used as target).

Social Bikes is our new concept-dataset containing 30 images of scenes with persons/bicycles collected from Twitter, Instagram and Facebook by searching for #bike tags. Square crops of the full dataset are presented in Fig. 3: images acquired randomly from social feeds show diverse style properties and cannot be grouped under a single shared domain.

4.2 Performance Analysis

Experimental Setup. We evaluate OSHOT on several testbeds using the described datasets. In the following we will use an arrow $Source \rightarrow Target$ to indicate the experimental setting. Our base detector is Faster-RCNN [35] with a ResNet-50 [21] backbone pre-trained on ImageNet, RPN with 300 top proposals after non-maximum-supression, anchors at three scales (128, 256, 512) and three aspect ratios (1:1, 1:2, 2:1). For all our experiments we set the IoU threshold at 0.5 for the mAP results, and report the average of three independent runs.

OSHOT Pretraining. We always resize the image’s shorter size to 600 pixels and apply random horizontal flipping. Unless differently specified, we train the base network for 70k iterations using SGD with momentum set at 0.9, the initial learning rate is 0.001 and decays after 50k iterations. We use a batch size of 1, keep batch normalization layers fixed for both pretraining and adaptation phases and freeze the first 2 blocks of ResNet50. The weight of the auxiliary task is set to $\lambda =0.05$.

OSHOT Adaptation. We increase the weight of the auxiliary task to $\lambda =0.2$ to speed up adaptation and keep all other training hyperparameters fixed. For each test instance, we finetune the initial model on the auxiliary task for 30 iterations before testing.

Benchmark Methods. We compare OSHOT with the following algorithms. FRCNN: baseline Faster-RCNN with ResNet50 backbone, trained on the source domain and deployed on the target without further adaptation. DivMatch [28]: cross-domain detection algorithm that, by exploiting target data, creates multiple randomized domains via CycleGAN and aligns their representations using an adversarial loss. SW [42]: adaptive detection algorithm that aligns source and target features based on global context similarity. For both DivMatch and SW, we use a ResNet-50 backbone pretrained on ImageNet for fair comparison. Since all cross-domain algorithms need target data in advance and are not designed to work in our one-shot unsupervised setting, we provide them with the advantage of 10 target images accessible during training and randomly selected at each run. We collect average precision statistics during inference under the favorable assumption that the target domain will not shift after deployment.

Table 1. (left) VOC $\rightarrow $ Social Bikes mAP results; (right) visualization of DivMatch and OSHOT detections. The number associated with each bounding box indicates the model’s confidence in localization. Examples show how OSHOT detection is accurate, while most DivMatch boxes are false positives

Full size table

Adapting to Social Feeds. When data is collected from multiple sources, the assumption that all target images originate from the same underlying distribution does not hold and standard cross-domain detection methods are penalized regardless of the number of seen target samples. We pretrain the source detector on Pascal VOC, and deploy it on Social Bikes. We consider only the bicycle and person annotations for this target, since all other instances of VOC classes are scarce. We report results in Table 1. OSHOT outperforms all considered competitors, with a mAP score of 64.4. Despite granting them access to the full target, adaptive algorithms incur in negative transfer due to data scarcity and large variety of target styles.

Large Distribution Shifts. Artistic images are difficult benchmarks for cross-domain methods. Unpredictable perturbations in shape and color are challenging to detectors trained only on realistic images. We investigate this setting by training the source detector on Pascal VOC an deploying it on Clipart, Comic and Watercolor datasets. Table 2 summarizes results on the three adaptation splits. We can see how OSHOT with 30 finetuning iterations outperforms all competitors, with mAP gains ranging from 7.5 points on Clipart to 9.2 points on Watercolor. Cross-detection methods perform poorly in this setting, despite using 9 more samples in the adaptation phase compared to OSHOT that only uses the test sample. These results confirm that they are not designed to tackle data scarcity conditions and exhibit negligible improvements compared to the baseline.

Table 2. mAP results for VOC $\rightarrow $ AMD

Full size table

Table 3. mAP results for Cityscapes $\rightarrow $ FoggyCityscapes

Full size table

Adverse Weather. Some peculiar environmental conditions, such as fog, may be disregarded in source data acquisition, yet adaptation to these circumstances is crucial for real world applications. We assess the performance of OSHOT on Cityscapes $\rightarrow $ FoggyCityscapes. We train our base detector on Cityscapes for 30k iterations without stepdown, as in [5]. We select the best performing model on the Cityscapes validation split and deploy it to FoggyCityscapes. Experimental evaluation in Table 3 shows that OSHOT outperforms all compared approaches. Without finetuning iterations, performance using the auxiliary rotation task increases compared to the baseline. Subsequent finetuning iterations on the target sample improve these results, and 30 iterations yield models able to outperform the second-best method by 5 mAP. Cross-domain algorithms used in this setting struggle to surpass the baseline (DivMatch) or suffer negative transfer (SW).

Cross-Camera Transfer. Dataset bias between training and testing is unavoidable in practical applications, as for urban scene scenarios collected in different cities and with different cameras. We test adaptation between KITTI and Cityscapes in both directions. For cross-domain evaluation we consider only the label car as standard practice. In Table 4, OSHOT improves by 7 mAP points on KITTI $\rightarrow $ Cityscapes compared to the FRCNN baseline. DivMatch and SW both show a gain in this split, with SW obtaining the highest mAP of 39.2 in the ten-shot setting. We argue that this is not surprising considering that, as shown in the visualization of Table 4, the Cityscapes images share all a uniform visual style. As a consequence, 10 target images may be enough for standard cross-domain detection methods. Despite visual style homogeneity, the diversity among car instances in Cityscapes is high enough for learning a good car detection model. This is highlighted by the results in Cityscapes $\rightarrow $ KITTI task, for which adaptation performance for all methods is similar, and OSHOT with $\gamma =0$ obtains the highest mAP of 75.4. The FRCNN baseline on KITTI scores a high mAP of 75.1: in this favorable condition detection doesn’t benefit from adaptation.

Table 4. mAP of car class in KITTI/Cityscapes detection experiments

Full size table

Table 5. Comparison between baseline, one-shot syle transfer and OSHOT in the one-shot unsupervised cross-domain detection setting

Full size table

4.3 Comparison with One-Shot Style Transfer

Although not specifically designed for cross-domain detection, in principle it is possible to apply one-shot style transfer methods as an alternative solution for our setting. We use BiOST [8], the current state-of-the-art method for one-shot transfer, to modify the style of the target sample towards that of the source domain before performing inference. Due to the time-heavy requirements to perform BiOST on each test sample^{Footnote 1}, we test it on Social Bikes and on a random subset of 100 Clipart images that we name Clipart100. We compare performance and time requirements of OSHOT and BiOST on these two targets. Speed has been computed on an RTX2080Ti with full precision settings.

Table 5 shows summary mAP results using BiOST and OSHOT. On Clipart100, the baseline FRCNN detector obtains 27.9 mAP. We can see how BiOST is effective in the adaptation from one-sample, gaining 1.9 points over the baseline, however it is outperformed by OSHOT, which obtains 30.7 mAP. On Social Bikes, while OSHOT still outperforms the baseline, BiOST incurs in negative transfer, indicating that it was not able to effectively modify the source’s style on the images we collected. Furthermore, BiOST is affected by two strong issues: (1) as already mentioned, it has an extremely high time complexity, with more than 6 hours needed to modify the style of a single source instance; (2) it works under the strict assumption of accessing at the same time the entire source training set and the target sample. Due to these weaknesses, and the fact that OSHOT still outperforms BiOST, we argue that existing one-shot translation methods are not suitable for one shot unsupervised cross-domain adaptation.

4.4 Ablation Study

Detection Error Analysis. Following [24], we provide detection error analysis for VOC $\rightarrow $ Clipart setting in Fig. 4. We select the 1000 most confident detections, and assign error classes based on IoU with ground truth (IoUgt). Errors are categorized as: correct (IoUgt $\geqslant $ 0.5), mislocalized (0.3 $\leqslant $ IoUgt < 0.5) and background (IoUgt < 0.3). Results show that, compared to the baseline FRCNN model, the regularization effect of adding a self-supervised task at training time ($\gamma = 0$) marginally increases the quality of detections. Instead subsequent finetuning iterations on the test sample substantially improve the number of correct detections, while also decreasing both false positives and mislocalization errors.

Cross-Task Pseudo-labeling Ablation. As explained in Sect. 3 we have two options in the OSHOT adaptation phase: either considering the whole image, or focusing on pseudo-labeled bounding boxes obtained from the detector after the first OSHOT pretraining stage. For all the experiments presented above we focused on the second case. Indeed by solving the auxiliary task only on objects, we limit the use of background features which may mislead the network towards solutions of the rotation task not based on relevant semantic information (e.g.: finding fixed patterns in images, exploiting watermarks). We validate our choice by comparing it against using the rotation task on the entire image in both training and adaptation phases. Table 6 shows results for VOC $\rightarrow $ AMD and Cityscapes $\rightarrow $ Foggy Cityscapes using OSHOT. We observe that the choice of rotated regions is critical for the effectiveness of the algorithm. Solving the rotation task on objects using pseudo-annotations results in mAP improvements that range from 2.9 to 5.9 points, indicating that we learn better features for the main task.

Table 6. Rotating image vs rotating objects via pseudo-labeling on OSHOT

Full size table

Self-supervised Iterations. We study the effects of adaptating with up to $\gamma = 70$ iterations on VOC $\rightarrow $ Clipart, Cityscapes $\rightarrow $ FoggyCityscapes and KITTI $\rightarrow $ Cityscapes. Results are shown in Fig. 5. We observe a positive correlation between number of finetuning iterations and final mAP of the model in the earliest steps. This correlation is strong for the first 10 iterations and gets to a plateau after about 30 iterations: increasing $\gamma $ beyond this point doesn’t affect the final results.

5 Conclusions

This paper introduced the one-shot unsupervised cross-domain detection scenario, which is extremely relevant for monitoring image feeds on social media, where algorithms are called to adapt to a new visual domain from one single image. We showed that existing cross-domain detection methods suffer in this setting, as they are all explicitly designed to adapt from far larger quantities of target data. We presented OSHOT, the first deep architecture able to reduce the domain gap between source and target distribution by leveraging over one single target image. Our approach is based on a multi-task structure that exploits self-supervision and cross-task self-labeling. Extensive quantitative experiments and a qualitative analysis clearly demonstrate its effectiveness.

Notes

1.
To get the style update, BiOST trains of a double-variational autoencoder using the entire source besides the single target sample. As advised by the authors through personal communications, we trained the model for 5 epochs.

References

Asano, Y.M., Rupprecht, C., Vedaldi, A.: A critical analysis of self-supervision, or what we can learn from a single image. In: ICLR (2020)
Google Scholar
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.: A theory of learning from different domains. Mach. Learn. 79, 151–175 (2010). https://doi.org/10.1007/s10994-009-5152-4
Article MathSciNet Google Scholar
Benaim, S., Wolf, L.: One-shot unsupervised cross domain translation. In: NIPS (2018)
Google Scholar
Bucci, S., D’Innocente, A., Tommasi, T.: Tackling partial domain adaptation with self-supervision. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11752, pp. 70–81. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30645-8_7
Chapter Google Scholar
Cai, Q., Pan, Y., Ngo, C.W., Tian, X., Duan, L., Yao, T.: Exploring object relation in mean teacher for cross-domain detection. In: CVPR (2019)
Google Scholar
Carlucci, F.M., D’Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: CVPR (2019)
Google Scholar
Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster R-CNN for object detection in the wild. In: CVPR (2018)
Google Scholar
Cohen, T., Wolf, L.: Bidirectional one-shot unsupervised domain mapping. In: ICCV (2019)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS (2016)
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML (2015)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. JMLR 17(1), 2030–2096 (2016)
MathSciNet Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Article Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: ICLR (2018)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hoffman, J., Darrell, T., Saenko, K.: Continuous manifold based adaptation for evolving visual domains. In: CVPR (2014)
Google Scholar
Hoffman, J., et al.: CyCADA: cycle-consistent adversarial domain adaptation. In: ICML (2018)
Google Scholar
Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 340–353. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_25
Chapter Google Scholar
Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-domain weakly-supervised object detection through progressive domain adaptation. In: CVPR (2018)
Google Scholar
Khodabandeh, M., Vahdat, A., Ranjbar, M., Macready, W.G.: A robust learning approach to domain adaptive object detection. In: ICCV (2019)
Google Scholar
Kim, S., Choi, J., Kim, T., Kim, C.: Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. In: ICCV (2019)
Google Scholar
Kim, T., Jeong, M., Kim, S., Choi, S., Kim, C.: Diversify and match: a domain adaptive representation learning paradigm for object detection. In: CVPR (2019)
Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)
Google Scholar
Liu, S., Huang, D., Wang, Y.: Receptive field block net for accurate and fast object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 404–419. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_24
Chapter Google Scholar
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Chapter Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: ICML (2015)
Google Scholar
Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: ICML (2017)
Google Scholar
Mancini, M., Karaoguz, H., Ricci, E., Jensfelt, P., Caputo, B.: Kitting in the wild through online domain adaptation. In: IROS (2018)
Google Scholar
Massa, F., Girshick, R.: maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark (2018). Accessed 22 Aug 2019
Motiian, S., Jones, Q., Iranmanesh, S., Doretto, G.: Few-shot adversarial domain adaptation. In: NIPS (2017)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: feature learning by inpainting. In: CVPR (2016)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Russo, P., Carlucci, F.M., Tommasi, T., Caputo, B.: From source to target and back: symmetric bi-directional adaptive GAN. In: CVPR (2018)
Google Scholar
Saito, K., Ushiku, Y., Harada, T., Saenko, K.: Strong-weak distribution alignment for adaptive object detection. In: CVPR (2019)
Google Scholar
Sakaridis, C., Dai, D., Van Gool, L.: Semantic foggy scene understanding with synthetic data. IJCV 126(9), 973–992 (2018). https://doi.org/10.1007/s11263-018-1072-8
Article Google Scholar
Sun, B., Saenko, K.: Deep CORAL: correlation alignment for deep domain adaptation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 443–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_35
Chapter Google Scholar
Tobin, J., Fong, R.H., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS (2017)
Google Scholar
Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Adversarial discriminative domain adaptation. In: CVPR (2017)
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR (2001)
Google Scholar
Wulfmeier, M., Bewley, A., Posner, I.: Incremental adversarial domain adaptation for continually changing environments. In: ICRA (2018)
Google Scholar
Xie, R., Yu, F., Wang, J., Wang, Y., Zhang, L.: Multi-level domain adaptive learning for cross-domain detection. In: ICCV Workshops (2019)
Google Scholar
Xu, J., Xiao, L., López, A.M.: Self-supervised domain adaptation for computer vision tasks. arXiv abs/1907.10915 (2019)
Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Single-shot refinement neural network for object detection. In: CVPR (2018)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
Google Scholar

Download references

Acknowledgements

This work was partially founded by the ERC grant 637076 RoboExNovo (AD, FCB, SB, BC) and took advantage of the GPU donated by NVIDIA (Academic Hardware Grant, TT). We acknowledge the support provided by Tomer Cohen and Kim Taekyung on their code respectively of BiOST and DivMatch.

Author information

Authors and Affiliations

Sapienza University of Rome, Rome, Italy
Antonio D’Innocente
Politecnico di Torino, Turin, Italy
Francesco Cappio Borlino, Silvia Bucci, Barbara Caputo & Tatiana Tommasi
Italian Institute of Technology, Turin, Italy
Antonio D’Innocente, Silvia Bucci, Barbara Caputo & Tatiana Tommasi

Authors

Antonio D’Innocente
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Cappio Borlino
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Bucci
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Caputo
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana Tommasi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Silvia Bucci .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 790 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

D’Innocente, A., Borlino, F.C., Bucci, S., Caputo, B., Tommasi, T. (2020). One-Shot Unsupervised Cross-Domain Detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12361. Springer, Cham. https://doi.org/10.1007/978-3-030-58517-4_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-58517-4_43
Published: 10 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58516-7
Online ISBN: 978-3-030-58517-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

One-Shot Unsupervised Cross-Domain Detection

Abstract

Similar content being viewed by others

MTTrans: Cross-domain Object Detection with Mean Teacher Transformer

Enhancing Source-Free Domain Adaptive Object Detection with Low-Confidence Pseudo Label Distillation

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

Keywords

1 Introduction

2 Related Work

3 Method