Multi-stream multi-class fusion of deep networks for video classification

Z Wu, YG Jiang, X Wang, H Ye, X Xue - Proceedings of the 24th ACM …, 2016 - dl.acm.org
Proceedings of the 24th ACM international conference on Multimedia, 2016dl.acm.org
This paper studies deep network architectures to address the problem of video classification.
A multi-stream framework is proposed to fully utilize the rich multimodal information in
videos. Specifically, we first train three Convolutional Neural Networks to model spatial,
short-term motion and audio clues respectively. Long Short Term Memory networks are then
adopted to explore long-term temporal dynamics. With the outputs of the individual streams
on multiple classes, we propose to mine class relationships hidden in the data from the …
This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams on multiple classes, we propose to mine class relationships hidden in the data from the trained models. The automatically discovered relationships are then leveraged in the multi-stream multi-class fusion process as a prior, indicating which and how much information is needed from the remaining classes, to adaptively determine the optimal fusion weights for generating the final scores of each class. Our contributions are two-fold. First, the multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, our proposed fusion method not only learns the best weights of the multiple network streams for each class, but also takes class relationship into account, which is known as a helpful clue in multi-class visual classification tasks. Our framework produces significantly better results than the state of the arts on two popular benchmarks, 92.2% on UCF-101 (without using audio) and 84.9% on Columbia Consumer Videos.
ACM Digital Library