Video Relation Inference and Content Understanding

4 May, 2020 by Jonathan Staniforth

Introduction & Motivation

Recent advances in personal video recording devices and video sharing platforms, such as the Youtube and TikTok, have resulted in explosive growth of videos. This has given rise to demands for various video content understanding tasks, including the video search, captioning, question answering, and content filtering. It motivates us to research into comprehensive and fine-grained video content analytics. A crux in content understanding is the object relations in video, which capture a wide variety of interactions between the objects. As shown in Figure 1, the relations expressed in the video include adults holding bottles; a woman watching a child; and the child standing in front of a table, etc. Understanding these relations will enable the system to further infer higher order semantics, such as the emotion state of a person and social relations among the people, hence able to better support the aforementioned tasks. Furthermore, it permits the system to extract relation triplets in the form of <subject, predicate, object> which elevates the extractable semantic of video to that of text, and supports video-text fusion tasks such as video descriptions, video question-answering, causal reasoning and the construction of multimodal knowledge graph.

Figure 1. An illustration of possible objects, relations and higher order semantics in a video

Current Research

Our research focuses on the recognition of a set of predefined basic relations in web videos, which includes verb-based relations (e.g. person kicks ball), spatial relations (e.g. car moves towards bench) and comparative relations (e.g. a tall man enters the scene). We have developed an overall framework [3,6] as shown in Figure 2 to detect and spatial-temporally localize such relations in a given video. Based on this framework, we have carried out research from several fronts.

The first involves the tracking of motion trajectories of visual objects (or entities) in video. To achieve this, we developed a novel approach for object trajectory proposal by jointly exploring the appearance and motion characteristics of video content [4]. It aims to generate a number of temporally consistent bounding boxes, i.e. trajectories, to indicate the object candidates in videos. From the generated proposals, object features can be extracted and measured for visual relation detection.

The second front builds the framework to extract visual relations between the detected entities in the videos (see Figure 2). Here, a given video is first decomposed into a set of overlapping segments, and the object trajectory proposals are generated on each segment. Next, short-term relations are predicted for each object pair on all the segments based on feature extraction and relation modeling. Finally, video visual relations are generated through greedily associating the short-term relations. With this proposed approach, we can also discover common-sense visual knowledge from the massive user-generated web videos. This work has been published in ACM Multimedia 2017 [3]. To the best of our knowledge, this is the first work on visual relation detection in videos. We have further explored more robust algorithms [1], and worked on an end-to-end deep learning model to video relation inference; as well as a novel approach to ground relations in videos [7].

The third front focuses on the construction of a large-scale video dataset to facilitate relevant research in this emerging field [2]. This work contributes the first benchmark dataset named ViDOR for visual relation extraction in videos, consisting of 10,000 videos (84 hours) with dense annotations that localize 80 categories of objects and 50 categories of predicates in each video. The dataset will facilitate various video research tasks as stated above. It has been made available for a grand challenge in the ACM Multimedia 2019 [5], and we plan to support this grand challenge for the next 3 years.

The fourth front explores various high-level video analysis tasks including video question answering (VQA) and video descriptions. VQA aims to understand the complex video content and user’s question in natural language in order to automatically infer an answer and its reasons. In particular, we will focus on the difficult inference-type questions. We will also explore multi-perspective video captioning, to enable system to describe video comprehensively from different points of view or perspectives. This also paves the way for personalized video description.

Figure 2: A framework for visual relation inference in videos

Plans for Future Research

First, we will extend our algorithms to recognize a wider range of relations, especially those important and meaningful ones that involved human users in specific application domains. Second, because of the difficulty in tagging a video dataset with relations, we will explore the novel transfer learning algorithms that are capable of transferring the learned knowledge in a dataset to new domains, such as cooking or live surveillance videos. Third, we will study the problem of higher-order relation recognition and inference, such as the causal reasoning. For instance, what event causes a man to lie down on the road is crucial to automatically identify the accident vehicle in a surveillance video. Finally, we will explore the construction of multimodal knowledge graph, by incorporating knowledge from both the text and image/video. 


  1. Donglin Di, Xindi Shang, Weinan Zhang, Xun Yang and Tat-Seng Chua. Multiple Hypothesis Video Relation Detection. IEEE BigMM 2019, Sep.
  2. Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang and Tat-Seng Chua. Annotating Objects and Relations in User-Generated Videos. ACM ICMR 2019, Jun.
  3. Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang and Tat-Seng Chua. Video Visual Relation Detection. ACM Multimedia 2017, Oct.
  4. Xindi Shang, Tongwei Ren, Hanwang Zhang, Gangshan Wu & Tat-Seng Chua. Object Trajectory Proposal. ICME 2017, Jul.
  5. Xindi Shang, Junbin Xiao, Donglin Di and Tat-Seng Chua. Relation Understanding in Videos: A Grand Challenge Overview. ACM Multimedia 2019, Oct. (Grand Challenge)
  6. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang and Tat-Seng Chua. Visual Translation Embedding Network for Visual Relation Detection. CVPR 2017, Jul.
  7. Junbin Xiao, Xindi Shang & Tat-Seng Chua. Visual Relation Grounding in Videos. Internal Report. 2020.


Video Relation Inference and Content Understanding

4 May, 2020

Multimodal and Multilingual Knowledge Graphs

4 May, 2020

Explainable AI

4 May, 2020

Recommendation Technology

4 May, 2020

Multimodal Conversational Search

4 May, 2020

Dialogue and Interactive Systems

4 May, 2020

Heterogeneous Data Mining for Fintech

4 May, 2020

Visually-Aware Fashion Computing

4 May, 2020