Dataset composition overview: Domains, languages, topics, and dialogue types with balanced distribution across categories in KwaiChat.
Distribution of domains and data statistics in KwaiChat: The dataset consists of 30 domains grouped into six major categories, including entertainment, education, and technology. Basic statistics such as the number of dialogues and average video duration are also provided.
Comparison of KwaiChat with other dialogue datasets. This figure compares the key characteristics of KwaiChat and existing dialogue datasets. Abbreviations include: DE (German), EN (English), ZH (Chinese), JPN (Japanese), ID (Indonesian), RUS (Russian), AR (Arabic), KIS (Kiswahili), ES (Spanish), POR (Portuguese). “Multi-party” indicates multi-participant dialogues.
Example from KwaiChat: Video frames (top) with multilingual comments and a structured Chinese dialogue (bottom), including topic annotations and dialogue types.
More Examples of KwaiChat in Various Languages: This figure showcases KwaiChat's multilingual capabilities, featuring dialogue examples in Portuguese, Spanish, and Indonesian.
Zero-shot evaluation on KwaiChat: Percentage scores of seven LLMs in the zero-shot setting on the KwaiChat dataset. “POR”, “ID”, “ES”, and “ZH” denote Portuguese, Indonesian, Spanish, and Chinese, respectively.
Generation outputs from five LLMs: Two example cases generated by five models based on the same video and contextual input.
@inproceedings{shi2025kwaichat,
title={KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus},
author={Shi, Xiaoming and Liu, Zeming and Lei, Yiming and Zhang, Chenkai and Leng, Haitao and Wang, Chuan and Liu, Qingjie and Che, Wanxiang and Wang, Yunhong},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
pages={2279--2294},
year={2025}
}