SeriesBench

A Benchmark for Narrative-Driven Drama Series Understanding

Chenkai Zhang, Yiming Lei, Zeming Liu*, Haitao Leng, Shaoguo Liu,
Tingting Gao, Qingjie Liu*, Yunhong Wang
CVPR, 2025
Equal Contribution. Project Leader. *Corresponding Authors.

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on standalone videos and mainly assess "visual elements" like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a series. To address this challenge, we propose SeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our SeriesBench and PC-DCoT highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at our GitHub repository.
Overview of Method

Overview of PC-DCoT framework: Event and character chains are constructed separately from the input, then merged to enable question answering via dual-chain reasoning.

Benchmark

Statistics

data_category
task_dimension

Composition and task distribution of SeriesBench: (Left) The dataset includes two main video categories—series videos and thematic videos—spanning 11 common themes such as Urban Life, Romance, Fantasy, and Food. Each series is annotated with its total number of videos. (Right) Task distribution is shown with detailed sample counts for each task defined in SeriesBench.

Dataset Comparison

comparison

Comparison between SeriesBench and existing benchmarks. SeriesBench uniquely supports narrative-driven series understanding. The table summarizes differences in modality, annotation type, clip count, video length, subtitle density, and task diversity.

Examples of Seriesbench

Evaluation

Case Study

BibTeX

@misc{seriesbench2025,
      title={SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding}, 
      author={Chenkai Zhang and Yiming Lei and Zeming Liu and Haitao Leng and Shaoguo Liu and Tingting Gao and Qingjie Liu and Yunhong Wang},
      year={2025},
      eprint={2504.21435},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.21435}, 
}