Google Scholar

Weakly-supervised spatio-temporally grounding natural sentence in video

Z Chen, L Ma, W Luo, KYK Wong - arXiv preprint arXiv:1906.02549, 2019 - arxiv.org

arXiv preprint arXiv:1906.02549, 2019•arxiv.org

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatio-temporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches.

arxiv.org

Show moreShow less

Save Cite Cited by 107 Related articles All 12 versions View as HTML

Cite

Advanced search

Saved to My library

Weakly-supervised spatio-temporally grounding natural sentence in video