Learning Fine-Grained Visual Understanding
for Video Question Answering
via Decoupling Spatial-Temporal Modeling

1National Taiwan University  2Mobile Drive Technology
BMVC 2022 (spotlight)

Do video-language models really
understand videos?

Video-language models are currently the most popular solutions to video-language tasks, including video question answering. There are many kinds of questions. Some questions ask for spatial information, while some require temporal reasoning. According to our study, the spatial and temporal modeling of state-of-the-art video-language methods are far from optimal. For the questions related to spatial semantics, we could obtain better performance by simply averaging frame-by-frame prediction of an image-language model. For the questions about temporal information, we found the performance of video-language models similar even if they took shuffled frames as inputs. In this work, we propose a solution that improves both spatial and temporal modeling.

Recent breakthroughs in video-language modeling are built on large-scale pre-training over video captions, speech-to-text transcripts, or synthesized QAs. However, video captions usually describe entire videos, neglecting small events; transcripts contain noisy spoken words unrelated to visual scenes; and existing synthesized QAs are mostly about spatial information.

prior pretraining
Prior video-language pre-training may suffer from lack of event details, video-transcript misalignment or limited diversity.

To learn effective temporal modeling of sequential events in videos, we propose Temporal Referring Modeling. It first synthesizes long videos with sequential events by concatenating short videos. Then it queries the model about absolute and relative temporal relationships between events.

Temporal Referring Modeling
Temporal Referring Modeling

We then take advantage of image-language models to complement spatial encoding. We design a double-stream architecture for video question answering, Decoupled Spatial-Temporal Encoders, which contains a spatial and a temporal stream. The spatial stream is a pre-trained image-language model that takes high-resolution but sparsely sampled frames, providing overall spatial information, and the temporal stream is a video-language model trained with Temporal Referring Modeling, which encodes low-resolution but densely sampled frames, modeling temporal relationships of events in videos. The final outputs are the integration of the outputs of two streams.

Decoupled Spatial-Temporal Encoders
Decoupled Spatial-Temporal Encoders

Our experiments show that the model outperforms previous work pre-trained on orders of magnitude larger datasets. Please refer to the paper for more details.

BibTeX

@inproceedings{Lee_2022_BMVC,
  author    = {Hsin-Ying Lee and Hung-Ting Su and Bing-Chen Tsai and Tsung-Han Wu and Jia-Fong Yeh and Winston H. Hsu},
  title     = {Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling},
  booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
  year      = {2022},
}