VideoMarathon: Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Abstract

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored.

To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension.

Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating high quality of VideoMarathon dataset and superiority of Hour-LLaVA model.

VideoMarathon: A Long Video Instruction-following Dataset

VideoMarathon is a long video instruction-following dataset with a total duration of around 9,700 hours, consisting of 3.3M QA pairs across 22 tasks. The task taxonomy and the statistics of the VideoMarathon are shown as follows:

(a) Task Taxonomy of VideoMarathon.

(b) Data Source

(d) Video Duration

(e) Event Counting

Figure 1: VideoMarathon: A diverse long video instruction-following dataset. (a) The dataset contains 22 diverse tasks, covering both short-form (yellow tag) and long-form (red tag) comprehension. (b) The dataset spans diverse video source domains. (c) The dataset features a wide range of question types for long-form video-language modeling. (d) The dataset consists of long videos ranging from three minutes to one hour. (e) The dataset includes complex video content reflected by the number of events per video.

In Table 1, the comparison between our VideoMarathon and other existing video instruction-following datasets shows that VideoMarathon features a significantly longer average video length, broader duration range, and a larger number of QA pairs.

Table 1: Comparison between VideoMarathon and other existing video instruction-following datasets. OE and MC denote open-ended and multiple-choice, respectively.
Dataset	Captioner	Summarizer	Total Video Time	Average Video Length	Duration Range	#OE QA	#MC QA
LLaVA-Hound	GPT-4V	GPT-4	3K hrs	0.2 mins	0–8 mins	900K	0
ShareGPT4Video	GPT-4V	GPT-4	0.2K hrs	0.3 mins	< 2 mins	0	0
LLaVA-Video-178K	GPT-4o	GPT-4o	2K hrs	0.6 mins	< 3 mins	960K	196K
VideoMarathon	Qwen2VL-7B	DeepSeek-V3	9.7K hrs	20.9 mins	3–60 mins	1.73M	1.57M

Hour-LLaVA: An Efficient Hour-scale Video-LMM

Powered by memory augmentation, we propose Hour-LLaVA, an efficient video-language model capable of modeling hour-long videos at 1 FPS. It comprises three key modules: a video encoder, a memory augmentation module (i.e., MemAug), and an LLM decoder. Figure 2 shows the Hour-LLaVA framework, with the video encoder omitted for simplicity.

Figure 2: Overview of the Hour-LLaVA Framework. Input video features \( \mathbf{H}_\text{v} \) encoded from 1-FPS sampled frames are selectively decayed spatially and temporally through a forgetting mechanism, producing decayed video tokens \( \tilde{\mathbf{H}}_\text{v} \) for efficient video modeling. Meanwhile, full video features \( \mathbf{H}_\text{v} \) are stored in a memory repository. Given the decayed tokens \( \tilde{\mathbf{H}}_\text{v} \) and a user question tokens \( \mathbf{H}_\text{q}\), the MemAug module enhances them with full video context and user question-relevant details from the memory repository, obtaining memory-augmented video tokens \( \hat{\mathbf{H}}_\text{v} \). These augmented tokens are then passed with the original user question tokens \( \mathbf{H}_\text{q} \) into the LLM decoder to generate the final response \( \mathbf{X}_\text{a} \).

Performance

We evaluate our models on four mainstream video-language benchmarks: TempCompass, LongVideoBench, Video-MME, and LVBench. As shown in Table 2, Hour-LLaVA consistently achieves the best performance on these four benchmarks in both the 3B and 7-8B model size categories.

Table 2: Performance comparison of existing LMMs on TempCompass, LongVideoBench, VideoMME, and LVBench datasets. M-Avg denotes the average performance of multiple-choice tasks. Blue boxes denote the average durations of benchmarks. Red highlights that LVBench's average video length exceeds the maximum length in the training stage. The symbol ^† marks our reimplemented results. **Bold** font denotes the best performance among models at the same scale, while underline indicates the second-best.
Method	LLM Params	Input Video	TempCompass	LongVideoBench	VideoMME (w/o & w/ subtitles)	LVBench
Proprietary LMM
GPT-4V	–	10 frames	–	61.3	59.9/63.3	55.8/59.7	53.5/56.9	–
GPT-4o	–	384 frames	70.9	66.7	71.9/77.2	70.3/76.6	65.3/72.1	48.9
Gemini 1.5 Flash	–	0.5/1 fps	–	61.6	70.3/75.0	68.8/74.7	61.1/68.8	–
Gemini 1.5 Pro	–	0.5/1 fps	69.3	64.0	75.0/81.3	74.3/81.0	67.4/77.4	33.1
Open-source LMM (<7B)
ViLMA-1.5-3B	3B	8 frames	56.1	42.9	42.2/44.2	–	–	–
Phi-3.5-Vision-4.2B	4.2B	16 frames	–	–	50.8/–	–	–	–
LongVU-3.2B	3.2B	1 fps	–	–	–/51.5	–	-/47.2	–
InternVL2.5-2B	2B	64 frames	53.4	46.0	51.9/54.1	–	–	–
Apollo-1.5B	1.5B	2 fps	60.8	54.1	53.0/54.6	–	–	–
Apollo-3B	3B	2 fps	62.5	55.1	58.4/60.6	–	–	–
LLaVA-Video-3B^†	3B	64 frames	63.4	55.2	58.7/60.7	55.2/57.3	47.0/49.9	41.7
Hour-LLaVA-3B (ours)	3B	1 fps	63.6	57.8	60.6/66.7	59.0/65.4	52.1/60.4	44.7
Open-source LMM (7-8B)
Video-LLaVA	7B	8 frames	37.9	39.1	39.9/41.6	38.0/40.7	36.2/38.1	–
VideoChat2	7B	16 frames	51.1	39.3	39.5/43.8	37.0/39.4	33.2/39.2	–
ShareGPT4Video	8B	16 frames	59.4	41.8	39.9/43.6	36.3/39.3	35.0/37.9	–
VideoLLaMA2	7B	16 frames	–	51.4	47.9/50.3	37.0/39.4	33.2/39.2	–
Video-XL	7B	1 fps	–	50.7	55.5/61.0	–	–	–
Kangaroo	8B	64 frames	62.5	54.8	56.0/57.6	55.3/55.4	46.7/49.3	39.4
LongVA	7B	128 frames	–	–	52.6/54.3	50.4/53.6	46.2/47.6	–
LongVILA	7B	256 frames	–	–	60.1/65.1	58.3/64.9	53.0/57.4	–
LongVU	7B	1 fps	–	–	–/60.9	–	-/59.5	–
Apollo-7B	7B	2 fps	64.9	58.5	61.3/63.3	–	–	–
LLaVA-Video-7B	7B	64 frames	64.3^†	58.2	63.3/69.7	58.9/62.9^†	53.0/55.0^†	42.2^†
Hour-LLaVA-7B (ours)	7B	1 fps	68.1	60.4	63.6/70.2	63.8/70.0	55.0/65.1	45.6

Table 2: Performance comparison of existing LMMs on TempCompass, LongVideoBench, VideoMME, and LVBench datasets. M-Avg denotes the average performance of multiple-choice tasks. Blue boxes denote the average durations of benchmarks. Red highlights that LVBench's average video length exceeds the maximum length in the training stage. The symbol ^† marks our reimplemented results. Bold font denotes the best performance among models at the same scale, while underline indicates the second-best.

Method

LLM
Params

Input
Video

TempCompass

LongVideoBench

VideoMME (w/o & w/ subtitles)

LVBench

M-Avg
11s

M-Avg
459s

Overall
1021s

Medium
516s

Long
2466s

Avg
4037s

Proprietary LMM

GPT-4V

–

10 frames

–

61.3

59.9/63.3

55.8/59.7

53.5/56.9

–

GPT-4o

–

384 frames

70.9

66.7

71.9/77.2

70.3/76.6

65.3/72.1

48.9

Gemini 1.5 Flash

–

0.5/1 fps

–

61.6

70.3/75.0

68.8/74.7

61.1/68.8

–

Gemini 1.5 Pro

–

0.5/1 fps

69.3

64.0

75.0/81.3

74.3/81.0

67.4/77.4

33.1

Open-source LMM (<7B)

ViLMA-1.5-3B

8 frames

56.1

42.9

42.2/44.2

–

Phi-3.5-Vision-4.2B

4.2B

16 frames

–

50.8/–

–

LongVU-3.2B

3.2B

1 fps

–

–/51.5

–

-/47.2

–

InternVL2.5-2B

64 frames

53.4

46.0

51.9/54.1

–

Apollo-1.5B

1.5B

2 fps

60.8

54.1

53.0/54.6

–

Apollo-3B

2 fps

62.5

55.1

58.4/60.6

–

LLaVA-Video-3B^†

64 frames

63.4

55.2

58.7/60.7

55.2/57.3

47.0/49.9

41.7

Hour-LLaVA-3B (ours)

1 fps

63.6

57.8

60.6/66.7

59.0/65.4

52.1/60.4

44.7

Open-source LMM (7-8B)

Video-LLaVA

8 frames

37.9

39.1

39.9/41.6

38.0/40.7

36.2/38.1

–

VideoChat2

16 frames

51.1

39.3

39.5/43.8

37.0/39.4

33.2/39.2

–

ShareGPT4Video

16 frames

59.4

41.8

39.9/43.6

36.3/39.3

35.0/37.9

–

VideoLLaMA2

16 frames

–

51.4

47.9/50.3

37.0/39.4

33.2/39.2

–

Video-XL

1 fps

–

50.7

55.5/61.0

–

Kangaroo

64 frames

62.5

54.8

56.0/57.6

55.3/55.4

46.7/49.3

39.4

LongVA

128 frames

–

52.6/54.3

50.4/53.6

46.2/47.6

–

LongVILA

256 frames

–

60.1/65.1

58.3/64.9

53.0/57.4

–

LongVU

1 fps

–

–/60.9

–

-/59.5

–

Apollo-7B

2 fps

64.9

58.5

61.3/63.3

–

LLaVA-Video-7B

64 frames

64.3^†

58.2

63.3/69.7

58.9/62.9^†

53.0/55.0^†

42.2^†

Hour-LLaVA-7B (ours)

1 fps

68.1

60.4

63.6/70.2

63.8/70.0

55.0/65.1

45.6

Citation

@article{lin2025unleashing, author = {Lin, Jingyang and Wu, Jialian and Sun, Ximeng and Wang, Ze and Liu, Jiang and Chen, Hao and Luo, Jiebo and Liu, Zicheng and Barsoum, Emad}, title = {Unleashing Hour-Scale Video Training for Long Video-Language Understanding}, journal = {arXiv preprint arXiv:2506.05332}, year = {2025}, }