Logo KuaiMM Conversation

MLLM-powered Conversational Interaction for KuaiShou Platform

Yiming Lei, Chenkai Zhang, Hui Qiu, Yu Cao, Jiaji Dong, Zeming Liu*, Haitao Leng, Shaoguo Liu,
Xiaoming Shi, Zhizheng Yang, Chuan Wang, Tingting Gao, Qingjie Liu*, Wanxiang Che, Yunhong Wang
Equal Contribution. Project Leader. *Corresponding Authors.

Introduction

Multimodal Large Language Models (MLLMs) have achieved notable advancements in tasks such as image captioning, video understanding, and vision-language dialogue, benefiting from unified semantic representations across modalities. Despite this progress, short video platforms present distinct challenges—frequent user interaction, code-switching, rapid scene transitions, and blended reposting patterns—that are insufficiently addressed by current benchmarks.

To better support modeling in these dynamic scenarios, we introduce a systematic framework for evaluating and improving MLLM capabilities in short video contexts, centered around three key benchmarks:

  • KwaiChat: A multilingual, multi-topic dialogue dataset grounded in real short videos, enabling video-driven question answering, mixed-modal dialogue, and emotionally rich interactions.
  • SeriesBench: A benchmark tailored for short video series, aimed at character consistency tracking, narrative reconstruction, and logical reasoning across clips.
  • GODBench: A generation-oriented suite for multimodal “golden comment” creation, with evaluation dimensions including cultural relevance, creativity, humor, and contextual sensitivity.

The broader goal is to advance multimodal interaction from perception-level understanding to fully contextual, dialogue-centric, and creatively expressive use cases. Additional details, including dataset construction, evaluation protocols, and experimental results, are available on the respective pages.

BibTeX

@misc{KwaiMM-Dialogue,
  author       = {Yiming Lei and Chenkai Zhang and Zeming Liu and Xiaoming Shi and Haitao Leng and Shaoguo Liu and Tingting Gao and Qingjie Liu and Wanxiang Che and Yunhong Wang},
  title        = {KwaiMM-Dialogue: A Multimodal Dialogue Dataset from Real Short Video Comments},
  howpublished = {\url{https://github.com/stan-lei/KwaiMM-Dialogue}},
  note         = {Accessed: 2025-05-21},
  year         = {2024}
}