KwaiChat

A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

Xiaoming Shi, Zeming Liu*, Yiming Lei, Chenkai Zhang, Haitao Leng,
Chuan Wang, Qingjie Liu*, Wanxiang Che, Shaoguo Liu, Size Li, Yunhong Wang
NAACL Findings, 2025
Project Leader. *Corresponding Authors.

Abstract

Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.
Data filtering pipeline

Dataset composition overview: Domains, languages, topics, and dialogue types with balanced distribution across categories in KwaiChat.

Dialogue Corpus

Data Statistics

data_category
data_statistics

Distribution of domains and data statistics in KwaiChat: The dataset consists of 30 domains grouped into six major categories, including entertainment, education, and technology. Basic statistics such as the number of dialogues and average video duration are also provided.

Dataset Comparison

comparison

Comparison of KwaiChat with other dialogue datasets. This figure compares the key characteristics of KwaiChat and existing dialogue datasets. Abbreviations include: DE (German), EN (English), ZH (Chinese), JPN (Japanese), ID (Indonesian), RUS (Russian), AR (Arabic), KIS (Kiswahili), ES (Spanish), POR (Portuguese). “Multi-party” indicates multi-participant dialogues.

Examples of KwaiChat

Experiment Results

Results of Zero-shot Evaluation

comparison

Zero-shot evaluation on KwaiChat: Percentage scores of seven LLMs in the zero-shot setting on the KwaiChat dataset. “POR”, “ID”, “ES”, and “ZH” denote Portuguese, Indonesian, Spanish, and Chinese, respectively.

Generation cases

comparison

Generation outputs from five LLMs: Two example cases generated by five models based on the same video and contextual input.

BibTeX

@inproceedings{shi2025kwaichat,
  title={KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus},
  author={Shi, Xiaoming and Liu, Zeming and Lei, Yiming and Zhang, Chenkai and Leng, Haitao and Wang, Chuan and Liu, Qingjie and Che, Wanxiang and Wang, Yunhong},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
  pages={2279--2294},
  year={2025}
}