KuaiMM Conversation
Multimodal Large Language Models (MLLMs) have achieved notable advancements in tasks such as image captioning, video understanding, and vision-language dialogue, benefiting from unified semantic representations across modalities. Despite this progress, short video platforms present distinct challenges—frequent user interaction, code-switching, rapid scene transitions, and blended reposting patterns—that are insufficiently addressed by current benchmarks.
To better support modeling in these dynamic scenarios, we introduce a systematic framework for evaluating and improving MLLM capabilities in short video contexts, centered around three key benchmarks:
The broader goal is to advance multimodal interaction from perception-level understanding to fully contextual, dialogue-centric, and creatively expressive use cases. Additional details, including dataset construction, evaluation protocols, and experimental results, are available on the respective pages.
@misc{KwaiMM-Dialogue,
author = {Yiming Lei and Chenkai Zhang and Zeming Liu and Xiaoming Shi and Haitao Leng and Shaoguo Liu and Tingting Gao and Qingjie Liu and Wanxiang Che and Yunhong Wang},
title = {KwaiMM-Dialogue: A Multimodal Dialogue Dataset from Real Short Video Comments},
howpublished = {\url{https://github.com/stan-lei/KwaiMM-Dialogue}},
note = {Accessed: 2025-05-21},
year = {2024}
}