GODBench

A Benchmark for Multimodal Large Language Models
in Video Comment Art

Yiming Lei, Chenkai Zhang, Zeming Liu*, Haitao Leng, Shaoguo Liu,
Tingting Gao, Qingjie Liu*, Yunhong Wang
ACL, 2025
Equal Contribution. Project Leader. *Corresponding Authors.

Abstract

Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at our GitHub repository.
Overview of Method

Overview of RoT framework: Inspired by the diffusion of ripples in physics, human creative thinking is abstracted into five propagation components. These components are mapped onto the RoT reasoning process in MLLMs.

Benchmark

Statistics

data_category
task_dimension

Composition and task distribution of GODBench: (Left) The dataset includes two main video categories—series videos and thematic videos—spanning 11 common themes such as Urban Life, Romance, Fantasy, and Food. Each series is annotated with its total number of videos. (Right) Task distribution is shown with detailed sample counts for each task defined in GODBench.

Overview of Method

Comment Art taxonomy and example tasks: The figure defines five dimensions of Comment Art and presents an example from the “Imaginary Completion” category with associated tasks.

Dataset Comparison

comparison

GODBench vs. existing benchmarks. Comparison includes video types, duration, context–response pairs with review type, coverage of five Comment Art dimensions (RT, DA, WT, IV, ER), and task types (SEL, RNK, CLS, EXP, CRE).

Examples of GODBench

Examples from GODBench: We select one example for each of the 25 subcategories in GODBench, and all the samples are carefully annotated by human experts.

Evaluation

Seriesbench Example

Performance of MLLMs on discriminative tasks: “Size” indicates the LLM size. Evaluation is based on Exact Match Accuracy (EMA), with results reported in percentage (%). † denotes models fine-tuned using LoRA. Top three scores are marked in purple, orange, and gray.

data_category
task_dimension

Performance of MLLMs on comment generation: Results are reported in percentage (%). “SGPT-4o” indicates quality scores evaluated by GPT-4o. A user study with voting (%) compares outputs from different models and improved methods.

Case Study

Case Study: We select 9 examples from GODBench to illustrate the performance of MLLMs on discriminative tasks and comment generation.

BibTeX

@misc{godbench2025,
      title={GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art}, 
      author={Yiming Lei and Chenkai Zhang and Zeming Liu and Haitao Leng and Shaoguo Liu and Tingting Gao and Qingjie Liu and Yunhong Wang},
      year={2025},
      eprint={2505.11436},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.11436}, 
}