Multimodal Conversational Search

4 May, 2020 by NExT


The design of intelligent assistants that can interact directly with human ranks high on the agenda of current AI research. Existing promising results have been mainly achieved in text-based conversation systems both in academia and industry. However, the next generation of intelligent systems should go beyond natural language to multiple modalities. It will offer a more intuitive and natural human-computer interaction paradigm. The key research problems lie in multimodal understanding, knowledge incorporation, intelligent policy making and contextualized response generation.

Current Research

1. Knowledge-aware Fashion Dialogue

We study the issues of multimodal understanding in conversational search in two stages. In Stage 1, we explore the interactive fashion retrieval scenario where we align the different modalities by leveraging structured concept relationships [2]. In Stage 2, we look into the dialogue state tracking problem and build a neural multimodal belief tracker [3]. It learns the fashion semantics by delving into image sub-regions.

In order to enhance the intelligence of conversational systems, we further carry out two explorations on knowledge incorporation [1]. In the first approach, we develop a knowledge-aware multimodal dialogue system in fashion domain that focuses on exploiting various forms of domain knowledge as shown in Figure 1. In the second approach, we propose to model the strategies of the opponent in a conversation, and use the opponent model to improve our system’s ability to anticipate unexpected opponent behavior ahead of time.

Figure 1. An example of fashion dialogue agent. Key research points are multimodal understanding and knowledge incorporation.

2. MMDial Dataset for Multimodal Multi-task Dialogue Analysis

As human engages naturally in conversation involving multiple tasks, it is thus a basic requirement for a multimodal conversational agent to support multi-tasks. To facilitate this line of research, we construct a MMDial dataset in travel scenario involving several related tasks. The dataset facilitates research on internal dialogue state representation, conversational recommendation and contextualized response generation.

  • Internal dialogue state representation: The current representation (dialogue act + slot values) requires heavy human efforts on dialogue ontology construction and is not flexible [4]. In fact, it is very hard to represent the example presented in Figure 2 using the current representation. We thus investigate a graph-based back-end database representation and propose to represent states as sub-graphs dynamically over the conversation turns.
  • Conversational recommendation: Existing conversational recommendation systems fail to consider the degrees of user preferences as expressed in their natural language utterances. We thus look into this research gap by carrying out preference degree modeling, and further address the situation where there might be a partial mismatch between user preference and recommendation by back-end database as shown in Figure 2.
  • Contextualized response generation: For task-oriented dialogue systems, response generation is an intensively studied but over-simplified task. We argue that the desired responses should not only deliver the information asked or translate the actions generated, but also serve as a way to induce users for task completion. It should fit user preferences and make use of outside knowledge sources such as the example shown in Figure 2. We thus study how to leverage matching net and copynet to achieve this goal.
Figure 2. An example of multimodal multi-task dialogue we collected. Key research points are internal dialogue state representation, conversational recommendation and contextualized response generation.

3. Conversational Intervention in Product Search

Currently users primarily perform browsing and clicking on the product web site to find the products they want. This is laborious, and often takes a long time for the users to find what they want via the complex web structure. It may cause the loss of users before they click and buy. Hence, we propose to cut short the browsing stage by detecting when the users are in need of help and intervene to help the users via a conversation system. We first leverage a neural attention model to decide when to intervene. We then train a model to learn the optimal policy of question selection through continuous interactions with the users.

Plans for Future Research

We plan to further develop a general framework for intelligent conversational search which is easy and flexible to integrate the various components we are exploring. With an in-depth understanding of conversation systems’ internal state representation, policy making and contextualized response generation, we aim to realize a conversational search assistant in an e-commerce search application. 


[1] L. Liao, Y. Ma, X. He, R. Hong, and T.S. Chua, “Knowledge-aware multimodal dialogue systems”, ACM Multimedia. 2018, pp 801–809.

[2] L. Liao, X. He, B. Zhao, C.-W. Ngo, and T.S. Chua, “Interpretable multimodal retrieval for fashion products”, ACM Multimedia. 2018, pp 1571–1579.

[3] Z. Zhang, L. Liao, M. Huang, X. Zhu, and T.S. Chua, “Neural multi- modal belief tracker with adaptive attention for dialogue systems”, The Web Conference (WWW), 2019, pp 2401–2412.

[4] L. Liao, Y. Ma, W. Lei, and T.S. Chua, “Rethinking Dialogue State Tracking with Reasoning”, ACL 2020.

[5] Sun, Yueming, and Yi Zhang. “Conversational recommender system”, ACM SIGIR 2018.

[6] Zou. J., Kanoulas. E, “Learning to Ask: Question-based Sequential Bayesian Product Search”, CIKM 2019, pp 369-378.


Video Relation Inference and Content Understanding

4 May, 2020

Multimodal and Multilingual Knowledge Graphs

4 May, 2020

Explainable AI

4 May, 2020

Recommendation Technology

4 May, 2020

Multimodal Conversational Search

4 May, 2020

Dialogue and Interactive Systems

4 May, 2020

Heterogeneous Data Mining for Fintech

4 May, 2020

Visually-Aware Fashion Computing

4 May, 2020