LEARNING TO DETECT, PREDICT AND SYNTHESIZE OBJECTS

Files

Bodla_umd_0117E_21360.pdf (40.3 MB)

No. of downloads: 24

Date

2020

Authors

Bodla, Navaneeth

Advisor

Chellappa, Rama

DRUM DOI

https://doi.org/10.13016/1ask-0snf

Abstract

The performance of object detection has steadily improved over the past decade, primarily due to improved CNN architectures and well-annotated datasets. The state-of-the-art detectors learn category independent region proposals using a convolutional neural network and refine them to improve objects’ localization. There are several open-ended problems with these approaches. We address two of them: 1) multiple proposals often get regressed to the same region of interest (RoI), leading to cluttered detections. 2) The foreground/background assignment to the proposal could be incorrect due to missing annotations that leads to incorrect signals during backpropagation.

Firstly, non-maximum suppression is used to mitigate the cluttered detection problem and significantly reduce the number of false positives. As per the algorithm’s design, the detection box M with the maximum score is selected, and all other detection boxes with a significant overlap (using a pre-defined threshold) with $M$ are suppressed. Therefore, if an object lies within the predefined overlap threshold, it leads to a miss. To mitigate this, we propose Soft-NMS, an algorithm that decays all other objects’ detection scores as a continuous function of their overlap with $M$. Hence, no object is eliminated in this process. Soft-NMS obtains consistent improvements for the coco-style mAP metric on standard datasets like PASCAL VOC 2007 ($1.7%$ for both R-FCN and Faster-RCNN) and MS-COCO ($1.3%$ for R-FCN and $1.1%$ for Faster-RCNN). Secondly, state-of-the-art object detectors require detailed bounding-box annotations for images containing multiple objects. Annotations help in accurately assigning the foreground and background labels to the proposals used in the classification stage. However, the labels assigned to these proposals could be incorrect when the dataset is not fully annotated. To this end, we investigate the robustness of object detection under the presence of missing annotations and propose a method called Soft Sampling to bridge the performance gap, caused due to missing annotation. Soft Sampling re-weights the gradients of RoIs as a function of overlap with positive instances. This ensures that the uncertain background regions are given a smaller weight compared to the hard-negatives. Extensive experiments on curated PASCAL VOC and OpenImages V3 datasets demonstrate the effectiveness of the proposed Soft Sampling method.

Fidelity, diversity, and controllable sampling are the main quality measures of a good image generation model. Our work focuses on improving the controllable sampling while having very high fidelity and diversity. We argue that controllability can be achieved by disentangling the generation process into various stages and propose a method called FusedGAN that has a single-stage pipeline with a built-in stacking of GANs. Unlike existing methods, which require full supervision with paired conditions and images, the FusedGAN can effectively leverage more abundant images without corresponding conditions in training, to produce more diverse samples with high fidelity. We achieve this by fusing two generators: one for unconditional image generation, and the other for conditional image generation, where the two partly share a common latent space thereby disentangling the generation. We demonstrate the efficacy of the FusedGAN in fine grained image generation tasks such as text-to-image, and attribute-to-face generation.

Learning to model and predict how humans interact with objects while performing an action is challenging. Our work builds on hierarchical video prediction models, which disentangle the video generation into two stages: predicting a high-level representation, such as pose, over time, and then learning a pose-to-pixels translation model for pixel generation. An action sequence for a human-object interaction task is typically very complicated, involving the evolution of pose, person's appearance, object locations, and object appearances over time. To this end, we propose a Hierarchical Video Prediction model using a Relational Layout. In the first stage, we learn to predict a sequence of layouts. A layout is a high-level representation of the video containing both pose and object information for every frame. The layout sequence is learned by modeling the relationships between pose and objects using relational reasoning and recurrent neural networks. The layout sequence acts as a strong structure prior for the second stage that learns to map the layouts into pixel space. Our method’s experimental evaluation on two datasets, UMD-HOI and Bimanual, shows significant improvements in standard video evaluation metrics such as LPIPS, PSNR, and SSIM. We also perform a detailed qualitative analysis of our model to demonstrate various generalizations.