Abstract
Object detection is a hot topic with various applications in computer vision, e.g., image understanding, autonomous driving, and video surveillance. Much of the progresses have been driven by the availability of object detection benchmark datasets, including PASCAL VOC, ImageNet, and MS COCO. However, object detection on the drone platform is still a challenging task, due to various factors such as view point change, occlusion, and scales. To narrow the gap between current object detection performance and the real-world requirements, we organized the Vision Meets Drone (VisDrone2018) Object Detection in Image challenge in conjunction with the 15th European Conference on Computer Vision (ECCV 2018). Specifically, we release a large-scale drone-based dataset, including 8, 599 images (6, 471 for training, 548 for validation, and 1, 580 for testing) with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc. Featuring a diverse real-world scenarios, the dataset was collected using various drone models, in different scenarios (across 14 different cities spanned over thousands of kilometres), and under various weather and lighting conditions. We mainly focus on ten object categories in object detection, i.e., pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Some rarely occurring special vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. The dataset is extremely challenging due to various factors, including large scale and pose variations, occlusion, and clutter background. We present the evaluation protocol of the VisDrone-DET2018 challenge and the comparison results of 38 detectors on the released dataset, which are publicly available on the challenge website: http://www.aiskyeye.com/. We expect the challenge to largely boost the research and development in object detection in images on drone platforms.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Detecting objects in images, which aims to detect objects of the predefined set of object categories (e.g., cars and pedestrians), is a problem with a long history [9, 17, 32, 40, 50]. Accurate object detection would have immediate and far reaching impact on many applications, such as image understanding, video surveillance, and anomaly detection. Although object detection attracts much research and has achieved significant advances with the deep learning techniques in recent years, these algorithms are not usually optimal for dealing with sequences or images captured by drone-based platforms, due to various challenges such as view point change, scales and occlusion.
To narrow the gap between current object detection performance and the real-world requirements, we organized the “Vision Meets Drone - Object Detection in Images (VisDrone-DET2018) challenge, which is one track of the “Vision Meets Drone: A Challenge” (or VisDrone2018, for short) on September 8, 2018, in conjunction with the 15th European Conference on Computer Vision (ECCV 2018) in Munich, Germany. We collected a large-scale object detection dataset in real scenarios with detailed annotations. The VisDrone2018 challenge mainly focus on human and vehicles in our daily life. The comparisons of the proposed dataset and previous datasets are presented in Table 1.
We invite researchers to submit algorithms to detect objects of ten predefined categories (e.g., pedestrian and car) from individual images in the VisDrone-DET2018 dataset, and share their research results at the workshop. We believe this comprehensive challenge benchmark is useful to further boost research on object detection on drone platforms. The authors of the detection algorithms in this challenge have an opportunity to share their ideas and publish the source code at our website: http://www.aiskyeye.com/, which are helpful to promote the development of object detection algorithms.
2 Related Work
2.1 Existing Datasets
Several object detection benchmarks have been collected for evaluating object detection algorithms. Enzweiler and Gavrila [12] present the Daimler dataset, captured by a vehicle driving through urban environment. The dataset includes 3, 915 manually annotated pedestrians in video images in the training set, and 21, 790 video images with 56, 492 annotated pedestrians in the testing set. The Caltech dataset [11] consists of approximately 10 h of \(640\times 480\) 30 Hz videos taken from a vehicle driving through regular traffic in an urban environment. It contains \(\sim \)250,000 frames with a total of 350, 000 annotated bounding boxes of 2, 300 unique pedestrians. The KITTI-D benchmark [19] is designed to evaluate the car, pedestrian, and cyclist detection algorithms in autonomous driving scenarios, with 7, 481 training and 7, 518 testing images. Mundhenk et al. [34] create a large dataset for classification, detection and counting of cars, which contains 32, 716 unique cars from six different image sets, different geographical locations and different imagers. The recent UA-DETRAC benchmark [33, 47] provides 1, 210k objects in 140k frames for vehicle detection.
The PASCAL VOC dataset [15, 16] is one of the pioneering works in generic object detection, which is designed to provide a standardized test bed for object detection, image classification, object segmentation, person layout, and action classification. ImageNet [10, 41] follows the footsteps of the PASCAL VOC dataset by scaling up more than an order of magnitude in the number of object classes and images, i.e., PASCAL VOC 2012 with 20 object classes and 21, 738 images vs. ILSVRC2012 with 1, 000 object classes and 1, 431, 167 annotated images. Recently, Lin et al. [31] release the MS COCO dataset, containing more than 328, 000 images with 2.5 million manually segmented object instances. It has 91 object categories with 27.5k instances on average per category. Notably, it contains object segmentation annotations that are not available in ImageNet.
2.2 Review of Object Detection Methods
Classical Object Detectors. In early days, the object detection methods are constructed based on the sliding-window paradigm, which use the hand-crafted features and classifiers on dense image grids to locate objects. As one of previous most popular frameworks, Viola and Jones [45] use Haar feature and Adaboost algorithm to learn a series of cascaded classifiers for face detection, which achieves accurate results with high efficiency. Felzenszwalb et al. [17] develop an effective object detection method based on mixtures of multiscale deformable part models. Specifically, they calculate the Histograms of Oriented Gradients (HOG) features on each part of object and train the latent SVM (a reformulation of MI-SVM in terms of latent variables) for robust performance. However, the classical object detectors do not perform well in challenging scenarios. In recent years, with the advance of deep Convolutional Neural Network (CNN), the object detection field is dominated by the CNN-based detectors, which can be roughly divided into two categories, i.e., the two-stage approach and the one-stage approach.
Two-Stage CNN-Based Methods. The two-stage approach first generates a pool of object proposals by a separated proposal generator and then predicts the accurate object regions and the corresponding class labels, such as R-CNN [21], SPP-Net [24], Fast R-CNN [20], Faster R-CNN [40], R-FCN [7], Mask R-CNN [23], and FPN [29].
R-CNN [21] is one of the pioneering works using the CNN model pre-trained on ImageNet, which extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVM. SPP-Net [24] proposes the SPP layer that pools the features and generates fixed length outputs to remove the fixed input size constraint of the CNN model. In contrast to SPP [24], Fast R-CNN [20] designs a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations in an end-to-end way. Faster R-CNN [40] further improves Fast R-CNN using a region proposal network instead of the selective search algorithm [44] to extract the region proposals. The R-FCN method [7] develops a fully convolutional network (FCN) to solve object detection, which constructs a set of position-sensitive maps using a bank of specialized convolutional layers to incorporate translation variance into FCN. Recently, Lin et al. [29] exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost to improve the detection performance. In [28], the head of network is designed as light as possible to decrease the computation cost, by using a thin feature map and a cheap R-CNN subnet (pooling and single fully-connected layer). Zhang et al. [49] propose a new occlusion-aware R-CNN to improve the pedestrian detection in the crowded scenes, which designs an aggregation loss to enforce proposals to be close and locate compactly to the corresponding objects. In general, the aforementioned methods share almost the same pipeline for object detection (i.e., object proposal generation, feature extraction, object classification and bounding box regression). The region proposal generating stage is the bottleneck to improve running efficiency.
One-Stage CNN-Based Methods. Different from the two-stage approach, the one-stage approach directly predicts the object locations, shapes and the class labels without the proposal extraction stage, which can run in high efficiency. The community witnesses the noticeable improvements in this direction, including YOLO [37], SSD [32], DSSD [18], RefineDet [50], and RetinaNet [30].
Specifically, YOLO [37] formulates object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. After that, Redmon et al. [38] improve YOLO in various aspects, such as adding batch normalization on all of the convolutional layers, using anchor boxes to predict bounding boxes, and using multi-scale training. SSD [32] takes advantage of a set of default anchor boxes with different aspect ratios and scales to discretize the output space of bounding boxes and fuses predictions from multiple feature maps with different resolutions. DSSD [18] augments SSD with deconvolution layers to introduce additional large scale context in object detection to improve accuracy, especially for small objects. Zhang et al. [51] enrich the semantics of object detection features within SSD, by a semantic segmentation branch and a global activation module. Lin et al. [30] use Focal Loss (RetinaNet) to address the class imbalance issue in object detection by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. In addition, Zhang et al. [50] propose a single-shot detector RefineDet. It is formed by two inter-connected modules, i.e., the anchor refinement module and the object detection module, which achieves high accuracy and efficiency. Moreover, Chen et al. [6] propose a dual refinement network to boost the performance of the one-stage detectors, which considers anchor refinement and feature offset refinement in the anchor-offset detection.
3 The VisDrone-DET2018 Challenge
As mentioned above, to track and advance the developments in object detection, we designed the VisDrone-DET2018 challenge, which focuses on detecting ten predefined categories of objects (i.e., pedestrian, personFootnote 1, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle) in images from drones. We require each participating algorithm to predict the bounding boxes of objects in predefined classes with a real-valued confidence. Some rarely occurring special vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in the evaluation. The VisDrone-DET2018 dataset consists of 8, 599 images (6, 471 for training, 548 for validation, 1, 580 for testing) with rich annotations, including object bounding boxes, object categories, occlusion, and truncation ratios. Featuring a diverse real-world scenarios, the dataset was collected using various drone platforms (i.e., drones with different models), in different scenarios (across 14 different cities spanned over thousands of kilometres), and under various weather and lighting conditions. The manually annotated ground truths in the training and validation sets are made available to users, but the ground truths of the testing set are reserved in order to avoid (over)fitting of algorithms. We encourage the participants to use the provided training data, while also allow them to use additional training data. The use of external data must be indicated during submission.
3.1 Dataset
The dataset and annotations presented in this workshop are expected to be a significant contribution to the community. As mentioned above, we have collected and annotated the benchmark dataset consisting of 8, 599 images captured by drone platforms in different places at different heights, which is much larger than any previously published drone-based dataset. Specifically, we manually annotated more than 540k bounding boxes of targets of ten predefined categories. Some example images are shown in Fig. 3. We present the number of objects with different occlusion degrees of different object categories in the training, validation, and testing sets in Fig. 1, and plot the number of objects per image vs. percentage of images in each subset to show the distributions of the number of objects in each image in Fig. 2. The images of the three subsets are taken at different locations, but share similar environments and attributes.
In addition, we provide two kinds of useful annotations, occlusion ratio and truncation ratio. Specifically, we use the fraction of pixels being occluded to define the occlusion ratio, and define three degrees of occlusions: no occlusion (occlusion ratio \(0\%\)), partial occlusion (occlusion ratio 1%–50%), and heavy occlusion (occlusion ratio \({>}{50\%}\)). Regarding truncation ratio, it is used to indicate the degree of object parts that appear outside a frame. If an object is not fully captured within a frame, we annotate the bounding box inside the frame boundary and estimate the truncation ratio based on the region outside the image. It is worth mentioning that a target is skipped during evaluation if its truncation ratio is larger than \(50\%\).
3.2 Evaluation Protocol
We require each participating algorithm to output a list of detected bounding boxes with confidence scores for each test image. Following the evaluation protocol in MS COCO [31], we use the AP\(^{\text {IoU}=0.50:0.05:0.95}\), AP\(^{\text {IoU}=0.50}\), AP\(^{\text {IoU}=0.75}\), AR\(^{\text {max}=1}\), AR\(^{\text {max}=10}\), AR\(^{\text {max}=100}\) and AR\(^{\text {max}=500}\) metrics to evaluate the results of detection algorithms. These criteria penalize missing detection of objects as well as duplicate detections (two detection results for the same object instance). Specifically, AP\(^{\text {IoU}=0.50:0.05:0.95}\) is computed by averaging over all 10 Intersection over Union (IoU) thresholds (i.e., in the range [0.50 : 0.95] with the uniform step size 0.05) of all categories, which is used as the primary metric for ranking. AP\(^{\text {IoU}=0.50}\) and AP\(^{\text {IoU}=0.75}\) are computed at the single IoU thresholds 0.5 and 0.75 over all categories, respectively. The AR\(^{\text {max}=1}\), AR\(^{\text {max}=10}\), AR\(^{\text {max}=100}\) and AR\(^{\text {max}=500}\) scores are the maximum recalls given 1, 10, 100 and 500 detections per image, averaged over all categories and IoU thresholds. Please refer to [31] for more details.
4 Results and Analysis
4.1 Submitted Detectors
There are 34 different object detection methods from 31 different institutes submitted to the VisDrone-DET2018 challenge. The VisDrone committee also reports the results of the 4 baseline methods, i.e., FPN (A.35) [29], R-FCN (A.36) [7], Faster R-CNN (A.37) [40], and SSD (A.38) [32]. For these baselines, the default parameters are used or set to reasonable values. Thus, there are 38 algorithms in total included in the VisDrone-DET2018 challenge. We present a brief overview of the entries and provide the algorithm descriptions in Appendix A.
Nine submitted detectors improve the Faster R-CNN method [40], namely JNU_Faster RCNN (A.5), Faster R-CNN3 (A.7), MMN (A.9), CERTH-ODI (A.13), MFaster-RCNN (A.14), Faster R-CNN2 (A.16), IITH DODO (A.18), Faster R-CNN+ (A.19), and DPNet (A.34). Seven detectors are based on the FPN method [29], including FPN+ (A.1), DE-FPN (A.3), DFS (A.4), FPN2 (A.11), DDFPN (A.17), FPN3 (A.21), and DenseFPN (A.22). Three detectors are inspired from RetinaNet [30], including Keras-RetinaNet (A.27), RetinaNet2 (A.28), and HAL-Retina-Net (A.32). Three detectors, i.e., RefineDet+ (A.10), RD\(^4\)MS (A.24), and R-SSRN (A.30), are based on the RefineDet method [50]. Five detectors, i.e., YOLOv3+ (A.6), YOLOv3++ (A.12), YOLOv3_DP (A.26), MSYOLO (A.29) and SODLSY (A.33), are based on the YOLOv3 method [39]. CFE-SSDv2 (A.15) is based on the SSD method [32]. SOD (A.23) is based on the R-FCN method [7]. L-H RCNN+ (A.25) is modified from the light-head R-CNN method [28]. AHOD (A.31) is a feature fusion backbone network with the capability of modeling geometric transformations. MSCNN (A.20) is formed by two sub-networks: a multi-scale object proposal network (MS-OPN) [4] and an accurate object detection network (AODN) [5]. YOLO-R-CNN (A.2) and MMF (A.8) are the combinations of YOLOv3 and Faster R-CNN. We summarize the submitted algorithms in Table 3.
4.2 Overall Results
The overall results of the submissions are presented in Table 2. As shown in Table 2, we find that HAL-Retina-Net (A.32) and DPNet (A.34) are the only two algorithms achieving more than \(30\%\) AP score. HAL-Retina-Net (A.32) uses the SE module [27] and downsampling-upsampling [46] to learn channel attention and spatial attention. DPNet (A.34) employs the framework of FPN [29] to capture context information in different scales of feature maps. DE-FPN (A.3) and CFE-SSDv2 (A.15) rank in the third and fourth places with more than \(25\%\) AP score, respectively. We also report the detection results of each object category in Table 4. As shown in Table 4, we observe that all the top three results of different kinds of objects are produced by the detectors with top four AP scores (see Table 2), i.e., HAL-Retina-Net (A.32), DPNet (A.34), DE-FPN (A.3), and CFE-SSDv2 (A.15).
Among the 4 baseline methods provided by the VisDrone committee, FPN (A.35) achieves the best performance, SSD (A.38) performs the worst, and R-FCN (A.36) performs better than Faster R-CNN (A.37). These results of the algorithms are consistent with that in the MS COCO dataset [31].
-
SSD (A.38) performs worst, only producing \(2.52\%\) AP score. CFE-SSDv2 (A.15) is an improvement of SSD (A.38), which uses a new comprehensive feature enhancement mechanism to highlight the weak features of small objects and adopts the multi-scale testing to further improve the performance. Specifically, it brings a significant improvement on AP score (i.e., \(26.48\%\)), ranking the fourth place.
-
Faster R-CNN (A.37) performs slightly better than \(2.89\%\) AP. DPNet (A.34) uses three Faster R-CNN models to detect different scales of objects. Specifically, the authors train FPN [29] architecture based Faster R-CNN models with multiple scales (i.e., \(1000\times 1000, 800\times 800, 600\times 600\)), achieving the second best AP score (\(30.92\%\)). Faster R-CNN2 (A.16) and Faster R-CNN+ (A.19) design the size of anchors to adapt to the distribution of objects, producing \(21.34\%\) and \(9.67\%\) AP score, respectively. MFaster-RCNN (A.14) replaces the ROI pooling layer with ROI align layer proposed in Mask R-CNN [23] to get better results for small object detection, i.e., obtaining \(18.08\%\) AP score.
-
R-FCN (A.36) achieves much better performance than SSD and Faster R-CNN, i.e., producing \(7.20\%\) AP. However, its accuracy is still not satisfactory. SOD (A.23) use the pyramid-like prediction network for RPN and R-FCN [7] to improve object detection performance. In this way, the predictions made by higher level feature maps contain stronger contextual semantics while the lower level ones integrate more localized information at finer spatial resolution. It generates \(0.93\%\) high AP score than R-FCN (A.36), i.e., \(8.27\%\) vs. \(7.20\%\).
-
FPN (A.35) performs the best among the 4 baseline methods by achieving 13.36 AP score, ranking in the middle of all submissions. We speculate that the extracted semantic feature maps at all scales are effective to deal with the objects with various scales. To further improve the accuracy, DE-FPN (A.3) enhances the data augmentation part by image cropping and color jitter, achieving \(27.10\%\) AP, ranking the third place. DDFPN (A.17) uses the DBPN [22] super resolution network to up-sample the image, producing \(21.05\%\) AP. FPN2 (A.11) implements an additional keypoint classification module to help locate the object, improving \(2.79\%\) AP score comparing to FPN (A.35).
4.3 Discussion
As shown in Table 3, we find that 18 detectors perform better than all the baseline methods. The best detector HAL-Retina-Net (A.32) achieves \(31.88\%\) AP score, which is still far from satisfactory in real applications. In the following, we discuss some critical issues in object detection on drone platforms.
Large Scale Variations. As shown in Fig. 3, the objects have a substantial difference in scales, even for the objects in the same category. For example, as shown in the top-left of Fig. 3, cars on the bottom of the image appear larger than cars on the top-right side of the image. This factor greatly challenges the performance of the detectors. For better performance, it is necessary to redesign the anchor scales to adapt to scales of objects in the dataset, and it is also interesting to design an automatic mechanism to handle the objects with large scale variations in object detection. Meanwhile, fusing multi-level convolutional features to integrate contextual semantic information is also effective to handle scale variations, just like the architecture in FPN (A.35). In addition, multi-scale testing and model ensemble are effective to deal with the scale variations.
Occlusion. Occlusion is one of the critical issues challenging the detection performance, especially in our VisDrone2018 dataset (see Fig. 3). For example, as shown in Fig. 1, most of the instances in bus and motor categories, are occluded by other objects or background obstacles, which greatly hurt the detection performance. Specifically, the best detector HAL-Retina-Net (A.32) only produces less than \(20\%\) AP scores in these two categories. All the other detectors even produces less than \(1\%\) AP score on the motor class. In summary, it is important and urgent to design an effective strategy to solve the occlusion challenge to improve the detection performance.
Class Imbalance. Class imbalance is another issue of object detection. As shown in Fig. 1, there are much less awning-tricycle, tricycle, and bus instances in the training set than the instances in the car and pedestrian classes. Most of the detectors perform much better on the car and pedestrian classes than on the awning-tricycle, tricycle, and bus classes. For example, DPNet (A.34) produces \(45.06\%\) and \(54.62\%\) APs on the car and pedestrian classes, while only produces \(11.85\%\), \(21.79\%\), and \(3.78\%\) APs on the awning-tricycle, tricycle, and bus classes, see Table 4 for more details. The most straightforward and common approach is using the sampling strategy to balance the samples in different classes. Meanwhile, some methods (i.e., Keras-RetinaNet (A.27), RetinaNet2 (A.28)) integrate the weights of different object classes in the loss function to handle this issue, such as Focal Loss [30]. How to solve the class imbalance issue is still an open problem.
5 Conclusions
This paper reviews the VisDrone-DET2018 challenge and its results. The challenge contains a large-scale drone-based object detection dataset, including 8, 599 images (6, 471 for training, 548 for validation, and 1, 580 for testing) with rich annotations, including object bounding boxes, object categories, occlusion status, truncation ratios, etc. A set of 38 detectors have been evaluated on the released dataset. A large percentage of them have been published in recent top conferences and journals, such as ICCV, CVPR, and TPAMI, and some of them have not yet been published (available at arXiv). The top three detectors are HAL-Retina-Net (A.32), DPNet (A.34), and DE-FPN (A.3), achieving \(31.8\%\), \(30.92\%\), and \(27.10\%\) APs, respectively.
The VisDrone-DET2018 primary objective is to establish a community-based common platform for discussion and evaluation of detection performance on drones. This challenge will not only serve as a meeting place for researchers in this area but also present major issues and potential opportunities. We hope the released dataset allows for the development and comparison of the algorithms in the object detection fields, and workshop challenge provides a way to track the process. Our future work will be focused on revising the evaluation kit, dataset, as well as including more challenging vision tasks on the drone platform, through the feedbacks from the community.
Notes
- 1.
If a human maintains standing pose or walking, we classify it as a pedestrian; otherwise, it is classified as a person.
- 2.
- 3.
- 4.
- 5.
References
Agarwal, S., Awan, A., Roth, D.: Learning to detect objects in images via a sparse, part-based representation. TPAMI 26(11), 1475–1490 (2004)
Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: CVPR. IEEE Computer Society (2008)
Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-NMS - improving object detection with one line of code. In: ICCV, pp. 5562–5570 (2017)
Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 354–370. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_22
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. CoRR abs/1712.00726 (2017)
Chen, X., Wu, Z., Yu, J.: Dual refinement network for single-shot object detection. CoRR abs/1807.08638 (2018). http://arxiv.org/abs/1807.08638
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NIPS, pp. 379–387 (2016)
Dai, J., et al.: Deformable convolutional networks. In: ICCV, pp. 764–773 (2017)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: an evaluation of the state of the art. TPAMI 34(4), 743–761 (2012)
Enzweiler, M., Gavrila, D.M.: Monocular pedestrian detection: survey and experiments. TPAMI 31(12), 2179–2195 (2009)
Ess, A., Leibe, B., Gool, L.J.V.: Depth and appearance for mobile scene analysis. In: ICCV, pp. 1–8 (2007)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC 2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Everingham, M., Eslami, S.M.A., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes challenge: a retrospective. IJCV 111(1), 98–136 (2015)
Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. TPAMI 32(9), 1627–1645 (2010)
Fu, C., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD : Deconvolutional single shot detector. CoRR abs/1701.06659 (2017). http://arxiv.org/abs/1701.06659
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)
Girshick, R.B.: Fast R-CNN. In: ICCV, pp. 1440–1448 (2015)
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for super-resolution. CoRR abs/1803.02735 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV, pp. 2980–2988 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI 37(9), 1904–1916 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hsieh, M., Lin, Y., Hsu, W.H.: Drone-based object counting by spatially regularized regional proposal network. In: ICCV (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. CoRR abs/1709.01507 (2017)
Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: Light-head R-CNN: in defense of two-stage object detector. CoRR abs/1711.07264 (2017)
Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: CVPR, pp. 936–944 (2017)
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2999–3007 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Lyu, S., et al.: UA-DETRAC 2017: report of AVSS2017 & IWT4S challenge on advanced traffic monitoring. In: AVSS, pp. 1–7 (2017)
Mundhenk, T.N., Konjevod, G., Sakla, W.A., Boakye, K.: A large contextual dataset for classification, detection and counting of cars with deep learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 785–800. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_48
Özuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specific multiview object localization. In: CVPR, pp. 778–785 (2009)
Razakarivony, S., Jurie, F.: Vehicle detection in aerial imagery : a small target detection benchmark. J. Vis. Commun. Image Represent. 34, 187–203 (2016)
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR, pp. 6517–6525 (2017)
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. CoRR abs/1804.02767 (2018)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI 39(6), 1137–1149 (2017)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)
Shrivastava, A., Gupta, A., Girshick, R.B.: Training region-based object detectors with online hard example mining. In: CVPR, pp. 761–769 (2016)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI, pp. 4278–4284 (2017)
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: CVPR, pp. 511–518 (2001)
Wang, F., et al.: Residual attention network for image classification. In: CVPR, pp. 6450–6458 (2017)
Wen, L., et al.: UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. CoRR abs/1511.04136 (2015)
Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 5987–5995 (2017)
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Occlusion-aware R-CNN: detecting pedestrians in a crowd. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 657–674. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_39
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Single-shot refinement neural network for object detection. In: CVPR (2018)
Zhang, Z., Qiao, S., Xie, C., Shen, W., Wang, B., Yuille, A.L.: Single-shot object detection with enriched semantics. In: CVPR (2018)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61502332 and Grant 61732011, in part by Natural Science Foundation of Tianjin under Grant 17JCZDJC30800, in part by US National Science Foundation under Grant IIS-1407156 and Grant IIS-1350521, and in part by Beijing Seetatech Technology Co., Ltd and GE Global Research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Submitted Detectors
A Submitted Detectors
In this appendix, we provide a short summary of all algorithms participated in the VisDrone2018 competition. These are ordered according to the submissions of their final results.
1.1 A.1 Improved Feature Pyramid Network (FPN+)
Karthik Suresh, Hongyu Xu, Nitin Bansal, Chase Brown, Yunchao Wei, Zhangyang Wang, Honghui Shi
k21993@tamu.edu, xuhongyu2006@gmail.com, bansa01@tamu.edu
chasebrown42@tamu.edu, wychao1987@gmail.com, atlaswang@tamu.edu
honghui.shi@ibm.com
FPN+ is improved from the Feature Pyramid Network (FPN) model [29]. The main changes we made are concluded as follows: (1) We resize the input images with different scales; (2) We use more scales of smaller anchors; (3) We ensemble FPN models with different anchors and parameters; (4) We employ NMS as another post processing step to avoid box overlap and multi-scale testing. Specifically, we use a FPN with ResNet-101 pre-trained weights on ImageNet as the backbone. We also attempt to make some changes to the training data (resizing it to different shapes, cutting it into pieces, etc.).
1.2 A.2 Fusion of Faster R-CNN and YOLOv3 (YOLO-R-CNN)
Wenchi Ma, Yuanwei Wu, Usman Sajid, Guanghui Wang
{wenchima, y262w558,usajid,ghwang}@ku.edu
YOLO-R-CNN is basically a voting algorithm specifically designed for object detection. Instead of the widely used feature-level fusion for deep neural networks, our approach works at the detection-level. We train two different DCNN models, i.e., Faster R-CNN [40] and YOLOv3 [39]. Then the final detection results are produced by voting, weighted averages of the two above models.
1.3 A.3 Data Enhanced Feature Pyramid Network (DE-FPN)
Jingkai Zhou, Yi Luo, Hu Lin, Qiong Liu
{201510105876, 201721045510, 201721045497}@mail.scut.edu.cn
liuqiong@scut.edu.cn
DE-FPN is based on the Feature Pyramid Network (FPN) model [29] with data enhancement. Specifically, we enhance the training data by image cropping and color jitter. We use ResNeXt-101 64-4d as the backbone of FPN with COCO pre-trained model. We remove level 6 of FPN to improve small object detection.
1.4 A.4 Focal Loss for Object Detection (DFS)
Ke Bo
kebo3@mail2.sysu.edu.cn
DFS is based on ResNet-101 and Feature Pyramid Networks [29]. The features from Conv2_x are also used to detect objects, which gains about \(1\%\) improvements in mAP. Our model use other techniques including multiple scale training and testing, deformable convolutions and Soft-NMS.
1.5 A.5 Faster R-CNN by Jiangnan University (JNU_Faster RCNN)
Haipeng Zhang
6161910043@vip.jiangnan.edu.cn
JNU_Faster RCNN is based on the Faster R-CNN algorithm [40] to complete the detection task. The source code is from Github repository named faster-rcnn.pytorchFootnote 2. We use trainset and valset of the VisDrone2018-DET dataset without additional training data to train this model. The pre-trained model is Faster R-CNN with ResNet-101 backbone.
1.6 A.6 Improved YOLOv3 (YOLOv3+)
Siwei Wang, Xintao Lian
285111284@qq.com
YOLOv3+ is improved from YOLO [37]. Specifically, we use the VisDrone2018-DET train set and pre-trained models on the COCO dataset to fine-tune our model.
1.7 A.7 Improved Faster R-CNN: (Faster R-CNN3)
Yiling Liu, Ying Li
liulingyi601@mail.nwpu.edu.cn, lybyp@nwpu.edu.cn
Faster R-CNN3 is based on Faster R-CNN [40]. We only use VisDrone2018 train set as the training set. Our algorithm is implemented in TITAN XP\(\times 2\), Ubuntu, pytorch. The testing speed is about 7s per image. The based network of Faster R-CNN is ResNet-101.
1.8 A.8 The Object Detection Algorithm Based on Multi-Model Fusion (MMF)
Yuqin Zhang, Weikun Wu, Zhiyao Guo, Minyu Huang
{23020161153381,23020171153097}@stu.xmu.edu.cn
{23020171153021,23020171153029}@stu.xmu.edu.cn
MMF is a multi-model fusion based on Faster-RCNN [40] and YOLOv3 [39]. The Faster-RCNN algorithm is a modification of a published oneFootnote 3. We re-write the codes and re-set the parameters including learning rate, gamma, step size, scales, anchors and ratios. We use the ResNet-152 as the backbone. The YOLOv3 algorithm is also a modification of a published oneFootnote 4. We modify the anchor setting by the K-means++ algorithm.
Since the number of objects in different categories is very unbalanced in the train set, we adopt the multi-model fusion method to improve the accuracy. Specifically, the car category is trained using the Faster R-CNN algorithm and the rest categories are trained using the YOLOv3 algorithm. Moreover, the rest categories are divided into two types: one for pedestrian and people, and the other one for bicycle, van, truck, tricycle, awning-tricycle, bus and motor. Finally, the detection result is determined by the three models.
1.9 A.9 Multi-Model Net Based on Faster R-CNN (MMN)
Xin Sun
sunxin@ouc.edu.cn
MMN is based on the Faster R-CNN network [40]. We first crop the train images into small size to avoid the resize operation. Then there cropped images are used to train different Faster R-CNN networks. Finally we merge the results to obtain the best classification result.
1.10 A.10 An improved Object Detector Based on Single-Shot Refinement Neural Network (RefineDet+)
Kaiwen Duan, Honggang Qi, Qingming Huang
duankaiwen17@mails.ucas.ac.cn, hgqi@jdl.ac.cn, qmhuang@ucas.ac.cn
RefineDet+ improves the single-shot refinement Neural Network (RefineDet) [50] by proposing a new anchor matching strategy. Our anchor matching strategy is based on center point translation of anchors (CPTMatching). During the training phase, the detector needs to determine which anchors correspond to an object bounding box. RefineDet firstly matches each object to the anchor with the highest jaccard overlap and then matches each anchor to an object with jaccard overlap higher than a threshold (usually 0.5). However, some nearby anchors whose jaccard overlap lower than the threshold may also help the bounding box regression. In our CPTMatching, we first select bounding boxes predicted by the anchor refinement module (ARM) [50] to have a jaccard overlap with any object ground-truth higher than 0.5. For each selected bounding box, we compute a measurement \(\beta \), which is a ratio of the center point distance between its corresponding anchor and its matched ground-truth box to the scale of its anchor. Discard those anchors whose \(\beta \) are larger than a threshold. The remaining anchors are called potential valid anchors. Finally, we align each center point of those potential valid anchors to the center of their nearest ground-truth boxes. Anchors are preserved if their jaccard overlap higher than 0.6 with the aligned ground-truth.
1.11 A.11 An improved Object Detector Based on Feature Pyramid Networks (FPN2)
Zhenwei He, Lei Zhang
{hzw, leizhang}@cqu.edu.cn
FPN2 is based on the Feature Pyramid Networks (FPN) object detection framework [29]. To obtain better detection results, we improve the original FPN in three folds:
-
Data expansion. We extend the training set by clipping the images. The clipped images contain at least one object. New pictures have different proportions of the original pictures. In our implementation, the proportions to the width or height are set as 0.5 and 0.7, which results in totally 4 kinds of ratios ([0.5, 0.5], [0.5, 0.7], [0.7, 0.7], [0.7, 0.5] to the width and height, respectively). As a result, the extended datasets has 5 times number of training pictures compared to the original dataset.
-
Keypoint classification. We implement an auxiliary keypoint classification task to further improve the detection accuracy. The bounding box is the border line of the foreground and background, therefore, we suppose the 4 corners and the center of the bounding box are the keypoints of the corresponding object. 4 corners of the bounding box are annotated as background while the center is annotated as the category of the corresponding object in our implement.
-
Fusion of different models. We train our deep model with different expanded datasets to obtain different models. First, we implement the NMS to generate the detection results of the each deep models. Then, we count the number of bounding boxes with the score greater than the threshold from different deep models. If the number is more than half of the deep models, we will keep the bounding box; otherwise we will discard it. Finally, we perform NMS again to generate the final detection results.
1.12 A.12 Modified YOLOv3 (YOLOv3++)
Yuanwei Wu, Wenchi Ma, Usman Sajid, Guanghui Wang
YOLOv3++ is based on YOLOv3 [39], which is a one stage detection method without using object proposals. We follow the default setting in YOLOv3 during training. To improve the object detection performance, we conduct experiments by increasing network resolution in inference and training time, and recalculating the anchor box priors on VisDrone dataset. We only use the provided training dataset to train YOLOv3 without adding additional training data, and evaluate the algorithm performance on the validation dataset. Then, the training and validation datasets are combined together to train a new YOLOv3 model, and the predicted classes probabilities and bounding boxes position on the testing dataset are submitted as our final submission.
1.13 A.13 CERTH’s Object Detector in Images (CERTH-ODI)
Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris
{michem, koafgeri, giannakeris, stefanos, ikom}@iti.gr
CERTH-ODI is trained on the whole training set of the VisDrone2018-DET dataset. However, since pedestrian and cars were dominant, compared to other classes, in order to balance the training set, we remove several thousand cars and pedestrians annotations. For the training we use the Inception ResNet-v2 Faster R-CNN model pre-trained on the MSCOCO dataset. In order to provide more accurate results, we use a combination of different training set-ups: One with all the available object classes trained until 800, 000 training steps, one with four-wheel vehicles only (i.e., car, van, truck, bus, because they share similar characteristics) and one with the remaining classes. We apply each model separately on each image, and NMS on the results, and afterwards we merge all the resulting bounding boxes from the different training models. Subsequently, we reject overlapping bounding boxes with an IoU of 0.6, which is chosen empirically, excluding several combinations, like people-bicycle, people-motor that tends to high overlap.
1.14 A.14 Modified Faster-RCNN for Small Objects Detection (MFaster-RCNN)
Wenrui He, Feng Zhu
{hewenrui,zhufeng}@bupt.edu.cn
MFaster-RCNN is improved from the Faster R-CNN model [40]. Our method only uses the VisDrone2018-DET train set with data augmentation, including cropping, zooming and flipping. We use pre-trained ResNet-101 as backbone due to GPU limit. The tuned hyper-parameters are mainly presented as follows: (1) The anchor ratio is adjusted from [0.5, 1, 2] to [0.5, 1.5, 2.5] which is calculated by K-means with training data. (2) The base size of the anchors remains 16 but the multiplicative scale is adjusted from [4, 8, 16] to [1, 2, 4, 8, 16] to detect very small objects. (3) The RPN positive overlap threshold which decides whether the proposal is regarded as a positive sample to train the RPN is adjusted from 0.7 to 0.5, while the RPN negative overlap threshold is adjusted from 0.3 to 0.2. (4) the foreground and background thresholds for the Fast R-CNN part is 0.5 and 0.1, respectively. The foreground fraction is adjusted from 0.25 to 0.4 as we find these values perform the best in practice. (5) The maximal number of the groundtruth boxes allowed to use for training in one input image is adjusted from 20 to 60 as we have enormous training samples per image in average.
1.15 A.15 SSD with Comprehensive Feature Enhancement (CFE-SSDv2)
Qijie Zhao, Feng Ni, Yongtao Wang
{zhaoqijie,nifeng,wyt}@pku.edu.cn
CFE-SSDv2 is an end-to-end one-stage object detector with specially designed novel module, namely Comprehensive Feature Enhancement (CFE) module. We first improve the original SSD model [32] by enhancing the weak features for detecting small objects. Our CFE-SSDv2Footnote 5 is designed to enhance detection ability for small objects. In addition, we apply multi-scale inference strategy. Although training on input size of \(800\times 800\), we have broadened the input size to \(2200\times 2200\) when inferencing, leading to further improvement in detecting small objects.
1.16 A.16 Faster R-CNN based object detection (Faster R-CNN2)
Fan Zhang
zhangfan_1@stu.xidian.edu.cn
Faster R-CNN2 depends on the VisDrone2018-DET dataset, Faster R-CNN [40], and adjusts some parameters. For example, we add a small anchor scale \(64^2\) to detect small objects and reducing the mini-batch size from 256 to 128.
1.17 A.17 DBPN+Deformable FPN+Soft NMS (DDFPN)
Liyu Lu
coordinate@tju.edu.cn
DDFPN is designed for small object detection. Since the dataset contains a large amount of small objects, so we scale up the original image first and then detect the objects. We use the DBPN [22] super resolution network to upsample the image. The model used for the detection task is Deformable FPN [8, 29]. Bsides, we use Soft-NMS [3] as our non-maximum suppression algorithm.
For network training, we first divide the input image into patches with size of \(1024\times 1024\), and obtain 23, 602 training images and their corresponding labels as training set to train Deformable FPN. Our training process uses OHEM training methods [42]. The learning rate we use in training is 0.001, and the image input size we use for training is \(1024\times 1024\). ResNet-101 is used as the backbone and the weights are initialized using model pre-trained on Image-Net.
For network testing, we use the same method as the training set to divide the test image into patches with size of \(512\times 512\). Next, we up-sample the previously obtained test patches to \(1024\times 1024\) via the DBPN network. Then we send these testing patches to our trained Deformable FPN to obtain \(1024\times 1024\) results. In fact, the size of image corresponds to the size of the original image is \(512\times 512\). Since the results in different scales are consistent with the characteristics of visual blind spots, we use multi-scale images for testing purpose, i.e., [688, 688], [800, 800], [12001200], [1400, 1400], [1600, 1600], [2000, 2000]. Finally, we merge the results in each scale derived from the same image back into one single image, hence we obtain the final test results.
1.18 A.18 IIT-H Drone Object DetectiOn (IITH DODO)
Nehal Mamgain, Naveen Kumar Vedurupaka, K. J. Joseph, Vineeth N. Balasubramanian
cs17mtech11023@iith.ac.in, naveenkumarvedurupaka@gmail.com
{cs17m18p100001, vineethnb}@iith.ac.in
IITH DODO is based on the Faster R-CNN architecture [40]. Faster R-CNN has a Region Proposal Network which is trained end-to-end and shares convolutional features with the detection network thus ameliorating the computational cost of high-quality region proposals. Our model uses the Inception ResNet-v2 [43] backbone for Faster R-CNN, pre-trained on the COCO dataset. The anchor sizes are adapted to improve the performance of the detector on small objects. To reduce the complexity of the model, only anchors of single aspect ratio are used. Non-maximum suppression is applied both on the region proposals and final bounding box predictions. Atrous convolutions are also used. No external data has been used for training and no test-time augmentation is performed. The performance is the result of the detection pipeline with no ensemble used.
1.19 A.19 Adjusted Faster Region-Based Convolutional Neural Networks (Faster R-CNN+)
Tiaojio Lee, Yue Fan, Han Deng, Lin Ma, Wei Zhang
{tianjiao.lee, fanyue}@mail.sdu.edu.cn, 67443542@qq.com
forest.linma@gmail.com, davidzhang@sdu.edu.cn
Faster R-CNN+ basically follows the original algorithm of Faster R-CNN [40]. However, we make a few adjustments on Faster R-CNN algorithm to adapt to the VisDroneDet dataset. The dataset given consists of many variant-sized proposals which leads to a multi-scale object detection problem. In order to mitigate the impact of relatively rapid changes in scales of bounding boxes, we add more anchors with large sizes to fit those larger objects and keep small anchors unchanged for detecting tiny objects such as people and cars in long distance. Moreover, the VisDroneDet dataset has an unbalanced object distribution. When testing on validation dataset, we find that classification performance for car is much better than others for the reason that the appearance of cars is more frequent. To alleviate this problem, we mask out some car bounding boxes by hand for pursuing better classification performance.
1.20 A.20 Multi-Scale Convolutional Neural Networks (MSCNN)
Dongdong Li, Yangliu Kuai, Hao Liu, Zhipeng Deng, Juanping Zhao
moqimubai@sina.cn
MSCNN is a unified and effective deep CNN based approach for simultaneously detecting multi-class objects in UAV images with large scales variability. Similar to Faster R-CNN, our method consists of two sub-networks: a multi-scale object proposal network (MS-OPN) [4] and an accurate object detection network (AODN) [5]. Firstly, we redesign the architecture of feature extractor by adopting some recent building blocks, such as inception module, which can increase the variety of receptive field sizes. In order to ease the inconsistency between the sizes variability of objects and fixed filter receptive fields, MS-OPN is performed with several intermediate feature maps, according to the certain scale ranges of different objects. That is, the larger objects are proposed in deeper feature maps with highly-abstracted information, whereas the smaller objects are proposed in shallower feature maps with fine-grained details. The object proposals from various intermediate feature maps are combined together to form the outputs of MSOPN. Then those object proposals are sent to the AODN for accurate object detection. For detecting small objects appear in groups, AODN combines several outputs of intermediate layers to increase the resolution of feature maps, enabling small and densely packed objects to produce larger regions of strong response.
1.21 A.21 Feature Pyramid Networks for Object Detection (FPN3)
Chengzheng Li, Zhen Cui
czhengli@njust.edu.cn, zhen.cui@njust.edu.cn
FPN3 follows the Faster R-CNN [40] which uses the feature pyramid [29]. We make some modifications of the algorithm. First of all, since most of the objects in the VisDrone-DET2018 dataset are quite small, we add another stage feature based on the original P2-P6 layer, we take the output of conv1 which not pass the pool layer in ResNet [25] as C1, then transform it into P1 whose stride is 1 / 2 like what has done in FPN, the anchor size of this stage is 16, the additional stage is used to detect smaller objects in images. Secondly, we change the up-sample by nearest pixel which has no parameters into deconvolution layer which has parameters just like convolution layer, since the layers with parameters have better performance compared with those without parameters. In the training phase, we trained two model based on ResNet-50 and ResNet-101 respectively, all training images are artificially occluded and flipped to make the model more robust. In the testing phase, we combine the two results from ResNet-50 and ResNet-101 as the final results.
1.22 A.22 Dense Feature Pyramid Net (DenseFPN)
Xin Sun
sunxin@ouc.edu.cn
DenseFPN is inspired by Feature Pyramid Networks [29] to detect small objects on the VisDrone2018 dataset. In the original FPN, they use the low-level feature to predict small objects. We use the same strategy and fuse high-level and low-level features in a dense feature pyramid network. Meanwhile, we crop the training images into small size to avoid the resize operation. Then we merge the results to obtain the best detection result.
1.23 A.23 SJTU-Ottawa-Detector (SOD)
Lu Ding, Yong Wang, Chen Qian, Robert Laganière, Xinbin Luo
dinglu@sjtu.edu.cn, ywang6@uottawa.ca, qian_chen@sjtu.edu.cn
laganier@eecs.uottawa.ca, losinbin@sjtu.edu.cn
SOD employs a pyramid like predict network to detect objects with large range of scales because pyramid like representations are wildly used in recognition systems for detecting objects at different scales [29]. The prediction made by higher level feature maps contains stronger contextual semantics while the lower level ones integrate more localized information at finer spatial resolution. These predictions are hierarchically fused together to make pyramid-like decisions. We use this pyramid-like prediction network for RPN and region fully convolutional networks (R-FCN) [7] to perform object detection.
1.24 A.24 Ensemble of Four RefineDet Models with Multi-scale Deployment (RD\(^4\)MS)
Oliver Acatay, Lars Sommer, Arne Schumann
{oliver.acatay, lars.sommer, arne.schumann}@iosb.fraunhofer.de
RD\(^4\)MS is a variant of the RefineDet detector [50], using the novel Squeeze-and-Excitation Network (SENet) [27] as the base network. We train four variants of the detector: three with SEResNeXt-50 and one with ResNet-50 as base network, each with its own set of anchor sizes. Multi-scale testing is employed and the detection results of the four detectors are combined via weighted averaging.
1.25 A.25 Improved Light-Head RCNN (L-H RCNN+)
Li Yang, Qian Wang, Lin Cheng, Shubo Wei
liyang16361@163.com, {844021514,2643105823,914417478}@qq.com
L-H RCNN+ modifies the published algorithm light-head RCNN [28]. Firstly, we modify the parameter “anchor_scales”, replacing \(32\times 32\), \(64\times 64\), \(128\times 128\), and \(256\times 256\), \(512\times 512\) with \(16\times 16\), \(32\times 32\), \(64\times 64\), \(128\times 128\), and \(256\times 256\). Secondly, we modify the parameter “max_boxes_of_image”, replacing 50 with 600. Thirdly, we perform NMS for all detection objects that belong to the same category.
1.26 A.26 Improved YOLOv3 with Data Processing (YOLOv3_DP)
Qiuchen Sun, Sheng Jiang
345412791@qq.com
YOLOv3_DP is based on the YOLOv3 model [39]. We process the images of the training set. Firstly, we remove some images including pedestrians and cars. Secondly, we increase the brightness of some lower brightness pictures to enhance the data. Thirdly, we black out the ignored regions in the image and cut the image to a size of \(512\times 512\) with a step size of 400. The images without objects will be removed. Thus the final training set contains 31, 406 images with the size of \(512\times 512\).
1.27 A.27 RetinaNet implemented by Keras (Keras-RetinaNet)
Qiuchen Sun, Sheng Jiang
345412791@qq.com
Keras-RetinaNet is based on the RetinaNet [30], which is implemented by the Keras toolkit. The source codes can be found in the website: https://github.com/facebookresearch/Detectron.
1.28 A.28 Focal Loss for Dense Object Detection (RetinaNet2)
Li Yang, Qian Wang, Lin Cheng, Shubo Wei
liyang16361@163.com, 844021514@qq.com
2643105823@qq.com, 914417478@qq.com
RetinaNet2 is based on the RetinaNet [30] algorithm. The short size of images is set as 800, and the maximum size of the image is set as 1, 333. Each mini-batch has 1 image per GPU for training/testing.
1.29 A.29 Multiple-scale yolo network (MSYOLO)
Haoran Wang, Zexin Wang, Ke Wang, Xiufang Li
18629585405@163.com, 1304180668@qq.com
MSYOLO is the multiple scale YOLO network [39]. We divide these categories into three cases according to the scale of object categories. First of all, ignored regions and the others category is the first case for areas that are not trained. Second, since many categories are not in the same scale, we divide them into big objects and small objects on the basis of their scale of boxes. The big objects include car, truck, van and bus, and small objects contain pedestrian, people, bicycle, motor, tricycle and awning-tricycle. The big objects as the center of cut images have the scale of \(480\times 480\), and small objects have the scale of \(320\times 320\).
1.30 A.30 Region-Based Single-Shot Refinement Network (R-SSRN)
Wenzhe Yang, Jianxiu Yang
wzyang@stu.xidian.edu.cn, jxyang xidian@outlook.com
R-SSRN is based on the deep learning method called RefineDet [50]. We do modifications as follows: (1) We remove the deep convolutional layers after fc7 because they are useless for the VisDrone small objects detection; (2) We added additional small scales default boxes at conv3_3 and set new aspect ratios by using k-means cluster algorithm on the VisDrone dataset. The change of scales and aspect radios can help default boxes more suitable for the objects; (3) Due to the small and dense objects, we split each image to 5 sub images (i.e., bottom left, bot-tom right, middle, top left, top right), where the size of each sub image is 1 / 4 of that of original image. After testing the sub images, we merge them by using NMS.
1.31 A.31 A Highly Accurate Object Detectior in Drone Scenarios (AHOD)
Jianqiang Wang, Yali Li, Shengjin Wang
wangjian16@mails.tsinghua.edu.cn, liyali@ocrserv.ee.tsinghua.edu.cn
wgsgj@tsinghua.edu.cn
AHOD is a novel detection method with high accuracy in drone scenarios. First, a feature fusion backbone network with the capability of modelling geometric transformations is proposed to extract object features. Second, a special object proposal sub-network is applied to generate candidate proposals using multi-level semantic feature maps. Finally, a head network refines the categories and locations of these proposals.
1.32 A.32 Hybrid Attention based Low-Resolution Retina-Net (HAL-Retina-Net)
Yali Li, Zhaoyue Xia, Shengjin Wang
{liyali13, wgsgj}@tsinghua.edu.cn
HAL-Retina-Net is improved from Retina-Net [30]. To detect low-resolution objects, we remove P6 and P7 from the pyramid. Therefore the pyramid of the network includes three pathways, named as P1, P3, and P5. We inherit the head design of Retina-Net. Furthermore, the post-processing steps include Soft-NMS [3] and bounding box voting. We find that bounding box voting improve the detection accuracy significantly. Furthermore, we note that by increasing the normalized size of images the improvement is also significant. To encourage the full usage of training samples, we split the images into patches with size \(640\times 640\). To avoid out-of-memory in detection, we use SE-ResNeXt-50 [27] as the backbone network and train the Retina-Net with the cropped sub-images. To further improve the detection accuracy, we add the hybrid attention mechanism. That is, we use additional SE module [27] and downsample-upsample [46] to learn channel attention and spatial attention. Our final detection results on test challenge are based on the ensemble of modified Retina-net with the above two kinds of attention.
1.33 A.33 Small Object Detection in Large Scene based on YOLOv3 (SODLSY)
Sujuan Wang, Yifan Zhang, Jian Cheng
Wangsujuan@airia.cn, {yfzhang,jcheng}@nlpr.ia.ac.cn
SODLSY is used to detect objects in various weather and lighting conditions, representing diverse scenarios in our daily life. The maximum resolution of VOC images is \(469\times 387\), and \(640\times 640\) for COCO images. However, the static images in VisDrone2018 are even \(2000\times 1500\). Our algorithm first increases the size of training images to 1184, ensuring the information of small objects is not lost during image resizing. Thus, we adopt multi-scale (\(800,832,864,\cdots ,1376\)) training method to improve the detection results. We also re-generate the anchors for VisDrone-DET2018.
1.34 A.34 Drone Pyramid Networks (DPNet)
HongLiang Li, Qishang Cheng, Wei Li, Xiaoyu Chen, Heqian Qiu, Zichen Song
hlli@uestc.edu.cn, cqs@std.uestc.edu.cn, weili.cv@gmail.com
xychen9459@gmail.com, hqqiu@std.uestc.edu.cn, szc.uestc@gmail.com
DPNet consists of three object detectors based on the Faster R-CNN [40] method, by Caffe2 deep learning framework, in parallel, on 8 GPUs. The design of DPNet follows the idea of FPN [29], whose feature extractors are ResNet-50 [25], ResNet101, and ResNeXt [48], respectively which are pre-trained on ImageNet only. To make full use of the data, the methods are designed as follows:
-
No additional data other than the train + val dataset are used for network training.
-
We train Faster-RCNN with FPN using multiple scales (\(1000\times 1000\), \(800\times 800\), \(600\times 600\)) to naturally handle objects of various sizes, generating improvement of \(4\%\).
-
When selecting the prior boxes, we set multiple specific aspect ratios based on the scale distribution of the training data.
-
We change the IOU threshold from 0.5 to 0.6 and removed the last FPN layer, yielding an improvement of \(1.5\%\).
We use Soft-NMS [3] instead of the conventional NMS to select predicted boxes. We replace RoIPooling with RoIAlign [23] to perform feature quantification. We use multi-scale training and testing. On the validation set, our best single detector obtains mAP \(49.6\%\), and the ensemble of three detectors achieves mAP \(50.0\%\).
1.35 A.35 Feature pyramid networks for object detection (FPN)
Submitted by the VisDrone Committee
FPN takes advantage of featurized image pyramids to construct deep convolutional networks with inherent multi-scale and pyramidal hierarchy. It combines low-resolution but semantically strong features and high-resolution but semantically weak features. Thus it exploits rich semantic information from all scales and is trained in an end-to-end way. The experimental results show this architecture can significantly improve the generic deep models in several fields. Please refer to [29] for more details.
1.36 A.36 Object Detection via Region-based Fully Convolutional Networks (R-FCN)
Submitted by the VisDrone Committee
R-FCN is the region-based fully convolutional networks for object detection without ROI-wise sub-network. Different from previous methods such as Fast R-CNN and Faster R-CNN using a costly pre-region subnetwork, R-FCN addresses the dilemma between translation-invariance in image classification and translation-variance in object detection using the position-sensitive score maps. That is, almost all the computation is shared on the whole image. It also can adopt recent state-of-the-art classification network backbones (e.g., ResNet and Inception) for better performance. Please refer to [7] for more details.
1.37 A.37 Towards Real-Time Object Detection with Region Proposal Networks (Faster R-CNN)
Submitted by the VisDrone Committee
Faster R-CNN improves Fast R-CNN [20] by adding Region Proposal Network (RPN). RPN shares full-image convolutional features with the detection network in a nearly cost-free way. Specifically, it is implemented as a fully convolutional network that predict object bounding boxes and their scores at the same time. Given object proposals by the RPN, the Fast R-CNN model shares the convolutional features and then detect object efficiently. Please refer to [40] for more details.
1.38 A.38 Single Shot MultiBox Detector (SSD)
Submitted by the VisDrone Committee
SSD is the one-stage object detection method based a single deep neural network without proposal generation. It uses a set of pre-set anchor boxes with different aspect ratios and scales, and then discretize the output space of bounding boxes. To deal with multi-scale object detection, the network combines predictions from several feature maps in different layers. Notably, it predicts the score of each object category and adjusts the corresponding bounding box simultaneously. The network is optimized via a multi-task loss including confidence loss and localization loss. Finally, the multi-scale bounding boxes are converted to the detection results using the NMS strategy. Please refer to [32] for more details.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, P. et al. (2019). VisDrone-DET2018: The Vision Meets Drone Object Detection in Image Challenge Results. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11133. Springer, Cham. https://doi.org/10.1007/978-3-030-11021-5_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-11021-5_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11020-8
Online ISBN: 978-3-030-11021-5
eBook Packages: Computer ScienceComputer Science (R0)