Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (2025)

Changli Tang1,Yixuan Li111footnotemark: 1,Yudong Yang1,Jimin Zhuang1,Guangzhi Sun2,
Wei Li3,Zejun Ma3,Chao Zhang1
Tsinghua University1, University of Cambridge2, ByteDance3
{tcl24, yixuan-l21}@mails.tsinghua.edu.cn, cz277@tsinghua.edu.cn
Equal contributionCorresponding author

Abstract

Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimization (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimized using DPO. To further improve training, we introduce a novel multi-round DPO (mrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initializing the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilize the process. To address potential catastrophic forgetting of non-captioning abilities due to mrDPO, we propose rebirth tuning, which finetunes the pre-DPO LLM by using the captions generated by the mrDPO-trained model as supervised labels. Experiments show that mrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing global and local error rates by 40% and 20%, respectively, while decreasing the repetition rate by 35%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining competitive performance to the state-of-the-art on widely used video question-answering benchmark among models of similar size. Upon acceptance, we will release the code, model checkpoints, and training and test data. Demos are available at https://video-salmonn-2.github.io.

1 Introduction

Large language models (LLMs) have exhibited outstanding capabilities in a wide range of natural language processing (NLP) tasks, and in some instances, have even approached human-level performance (OpenAI etal., 2024; Dubey etal., 2024; Touvron etal., 2023; Du etal., 2022; Bai etal., 2023a). LLMs’ remarkable ability to understand, generate, and reason with text has sparked widespread interest among researchers, attracting both academia and industry to extend them to multimodal understanding and generation.To endow LLMs with multimodal understanding capability, recent studies adopted a paradigm of training modality adapters and aligners between multi-modal encoders and LLMs. This approach leverages world knowledge in the textual LLM to interpret diverse types of data perceived by multimodal encoders, enabling the generation of meaningful insights.Over the past two years, many multimodal LLMs have emerged following this paradigm across different modalities. These include models for image and silent video understanding (Liu etal., 2024b; a; Li etal., 2023; Bai etal., 2023b; Lin etal., 2023; Chen etal., 2023; Lin etal., 2024; Chen etal., 2024), audio understanding (Wu etal., 2023; Tang etal., 2024b; Chu etal., 2023; 2024; Gong etal., 2024; 2023; Tang etal., 2024c; Zheng etal., 2024), and audio-visual understanding (Team etal., 2024; Cheng etal., 2024; Sun etal., 2024; Fu etal., 2024b; Tang etal., 2024d).

Text descriptions of multimodal data are critical for building multimodal LLMs. This is because most contemporary multimodal LLMs treat multimodal captions as a cornerstone task during pre-training or supervised fine-tuning (SFT), to align the representation spaces of multimodal encoders with that of textual large language models, helping LLMs recognise and understand events in multimodal data. Thus, collecting high-quality text descriptions paired with multimodal data is crucial for constructing high-performance multimodal LLMs, which implies training the model with more detailed and less hallucinated labels aligned with the multimodal data that could enhance the LLM’s ability to perform multimodal understanding and reasoning.In video understanding, generating detailed and accurate captions is crucial but challenging, as videos contain rich content that encompasses not only spatial features within individual visual frames but also audio-visual events that unfold across multiple frames over time.However, very few multimodal LLM-related works focus on improving the quality of video captions, due to the lack of quantitative metrics for evaluating video captions and the absence of training methods to enhance the completeness of these descriptions while reducing the risks of hallucination. Additionally, while audio is typically paired with video and provides crucial, complementary information to the visual content, most current visual LLMs lack audio-understanding abilities, leading to the omission of audio information in the generated captions.

In this paper, we introduce video-SALMONN 2, a multimodal LLM that supports both audio and visual inputs and primarily focuses on detailed and holistic audio-visual captioning. Building upon an already well-trained visual LLM, video-SALMONN 2 is further enhanced with auditory capabilities by training on audio-only data as well as videos with synchronized audio tracks. This enables the model to simultaneously “see” and “hear” the video, emulating the way humans perceive and interpret multimedia content.To accurately assess the performance of the model, new metrics to evaluate captioning quality are proposed, which then serve as the objective to optimize during reinforcement learning (RL) based on direct preference optimization (DPO). A novel multi-round DPO (mrDPO) is proposed and performed based on the preferences guided by the metrics, followed by a novel rebirth tuning stage to avoid the degradation of the non-captioning abilities caused by the mrDPO. The rebirth tuning leverages the post-mrDPO model to revise the captions of the videos in the training set, and trains the model after audio modality alignment using supervised fine-tuning (SFT) with the revised training data.Experiments demonstrate that video-SALMONN 2 with 7 billion (B) parameters can generate complete and accurate video descriptions and even outperforms much larger commercial multimodal LLMs such as GPT-4o and Gemini-1.5-Pro, and it also maintains competitive performance to the state-of-the-art (SOTA) multimodal LLM of similar model size on the commonly used Video-MME (Fu etal., 2024a) video question-answering (QA) benchmark.

The main contributions of this work can be summarised as follows:

  • We develop video-SALMONN 2, a powerful audio-visual LLM that generates high-quality video captions, outperforming larger commercial models such as GPT-4o and Gemini-1.5 in terms of completeness and accuracy.

  • We introduce an evaluation pipeline that computes the missing and hallucination rates of audio-visual events in video captions using text-based LLMs, breaking down the process into sub-tasks suited for current LLMs. Additionally, we provide a new benchmark for video captioning with a human-annotated test set.

  • We propose the mrDPO approach to optimize multimodal LLMs for video captioning, incorporating periodic updates to the DPO reference model, merging and reinitializing the low-rank adaptation (LoRA) (Hu etal., 2022) module, and smoothing the training loss using SFT based on ground-truth captions. To our knowledge, this is the first work applying RL to audio-visual LLMs.

  • We introduce rebirth tuning to ensure the resulting model maintains high performance in both captioning and non-captioning tasks. The mrDPO process, followed by rebirth tuning, can be iteratively applied to further enhance performance.

2 Related Work

2.1 Multimodal LLMs

Following the paradigm of connecting multimodal encoders to LLMs using modality adapters, various models have been developed. For image-based LLMs, LLaVA (Liu etal., 2024b; a) applies instruction tuning (Wei etal., 2022) to enhance performance on zero-shot tasks. BLIP-2 (Li etal., 2023) uses Q-Former to link a frozen encoder with an LLM, while VILA (Lin etal., 2023) explores pre-training strategies, achieving impressive results in video QA. InternVL (Chen etal., 2023) scales up the size of visual encoders for improved image representation. For silent video understanding, Video-LLaVA (Lin etal., 2024) aligns both image and video adapters to learn unified representations. ShareGPT4Video (Chen etal., 2024) uses GPT-4 to generate dense video captions, improving data quality, and LLaVA-Hound (Zhang etal., 2024) introduces DPO to enhance video LLMs’ understanding capabilities.

In the realm of audio perception, SALMONN (Tang etal., 2024b) uses a dual-encoder structure and can perform zero-shot audio reasoning tasks. LTU (Gong etal., 2024) and LTU-AS (Gong etal., 2023) trained on a large audio question-answering dataset are able to answer open-ended questions about audio. Qwen-Audio (Chu etal., 2023) and Qwen2-Audio (Chu etal., 2024) are built on large amounts of audio data to achieve high performance on a wide range of carefully selected audio tasks. Zheng etal. (2024) and Tang etal. (2024c) extend the LLM to perceive spatial audio information obtained from microphone array recordings.

As the visual frame sequence is often paired with audio in real-world video recordings, some studies investigate understanding non-silent video. video-SALMONN (Sun etal., 2024) uses a multi-resolution causal Q-Former to understand audio and video simultaneously. The Google Gemini model achieves video understanding as a native multimodal LLM built upon text, audio, and visual tokens (Team etal., 2024). AVicuna (Tang etal., 2024d) achieves audio-visual temporal understanding by introducing pseudo-untrimmed video data. Video-LLaMA (Zhang etal., 2023) and Video-LLaMA 2 (Cheng etal., 2024) directly concatenating audio and visual tokens for joint audio and video understanding.

2.2 RL for LLMs

RL with human feedback (RLHF) (Ouyang etal., 2022) is commonly used to enhance text-based LLMs, with early efforts applying PPO (Schulman etal., 2017) alongside a reward model trained on human preference data. Building on this, DPO (Rafailov etal., 2024) proposes that the LLM itself can serve as a reward model, using paired preference data to optimize the model without the need for an external reward model. KTO (Ethayarajh etal., 2024) further eliminates the need for paired preference data. Expanding on this, RLAIF (Lee etal., 2023) takes a cost-efficient approach by utilizing feedback generated automatically by models, reducing reliance on human involvement.

3 Methods

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (1)

3.1 Model Architecture

The overall architecture of our model is illustrated in Fig.1. The paired sequences of audio and visual frames from each video are fed into the audio and visual encoders separately. Users can provide textual prompts to guide the model in performing specific tasks based on the video content.This structure is implemented by incorporating a separate audio encoder branch to a pre-trained visual LLM, which enables the model to process and understand paired audio-visual sequences without degrading its visual performance.

In this structure, audio and visual tokens are computed independently in their respective branches. For the visual branch, the input visual frame sequence is first downsampled at a fixed frame rate of ϕitalic-ϕ\phiitalic_ϕ frame/second, and the total number of frames to sample is n=ϕT𝑛italic-ϕ𝑇n=\phi\,Titalic_n = italic_ϕ italic_T, where T𝑇Titalic_T is the duration of the input video in seconds. Let m𝑚mitalic_m be the maximum number of frames to sample based on the resource constraint. If n>m𝑛𝑚n>mitalic_n > italic_m, the frame rate is further reduced to ϕ=m/Tsuperscriptitalic-ϕ𝑚𝑇\phi^{\prime}=\lfloor m/T\rflooritalic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⌊ italic_m / italic_T ⌋, resulting in n=ϕTm𝑛superscriptitalic-ϕ𝑇𝑚n=\phi^{\prime}T\leq mitalic_n = italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_T ≤ italic_m. Let 𝐈isubscript𝐈𝑖\mathbf{I}_{i}bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the i𝑖iitalic_ith sampled visual frame, and each visual frame in 𝐈1,𝐈2,,𝐈nsubscript𝐈1subscript𝐈2subscript𝐈𝑛\mathbf{I}_{1},\mathbf{I}_{2},\ldots,\mathbf{I}_{n}bold_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is transformed to visual tokens independently using a pre-trained visual encoder EncoderVisualsubscriptEncoderVisual\text{Encoder}_{\text{Visual}}Encoder start_POSTSUBSCRIPT Visual end_POSTSUBSCRIPT followed by a visual modality aligner AlignerVisualsubscriptAlignerVisual\text{Aligner}_{\text{Visual}}Aligner start_POSTSUBSCRIPT Visual end_POSTSUBSCRIPT, as shown in Eqn.(1):

𝐇iVisual=AlignerVisual(EncoderVisual(𝐈i)),1in,formulae-sequencesubscriptsuperscript𝐇Visual𝑖subscriptAlignerVisualsubscriptEncoderVisualsubscript𝐈𝑖1𝑖𝑛\mathbf{H}^{\text{Visual}}_{i}=\text{Aligner}_{\text{Visual}}(\text{Encoder}_{%\text{Visual}}(\mathbf{I}_{i})),~{}~{}1\leq i\leq n,bold_H start_POSTSUPERSCRIPT Visual end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Aligner start_POSTSUBSCRIPT Visual end_POSTSUBSCRIPT ( Encoder start_POSTSUBSCRIPT Visual end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , 1 ≤ italic_i ≤ italic_n ,(1)

where 𝐇Visubscriptsubscript𝐇V𝑖\mathbf{H_{\text{V}}}_{i}bold_H start_POSTSUBSCRIPT V end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the visual tokens corresponding to 𝐈isubscript𝐈𝑖\mathbf{I}_{i}bold_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The audio frame sequence 𝐒𝐒\mathbf{S}bold_S is fed into a pre-trained audio encoder EncoderAudiosubscriptEncoderAudio\text{Encoder}_{\text{Audio}}Encoder start_POSTSUBSCRIPT Audio end_POSTSUBSCRIPT. Since EncoderAudiosubscriptEncoderAudio\text{Encoder}_{\text{Audio}}Encoder start_POSTSUBSCRIPT Audio end_POSTSUBSCRIPT may have a maximum processing duration tmaxsubscript𝑡maxt_{\text{max}}italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, the audio will be sliced into l=T/tmax𝑙𝑇subscript𝑡maxl=\lceil{T}/{t_{\text{max}}\rceil}italic_l = ⌈ italic_T / italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ⌉ segments of tmaxsubscript𝑡maxt_{\text{max}}italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT-length and processed separately by EncoderAudiosubscriptEncoderAudio\text{Encoder}_{\text{Audio}}Encoder start_POSTSUBSCRIPT Audio end_POSTSUBSCRIPT, as shown in Eqn.(2):

𝐙jAudio=EncoderAudio(𝐒(j1)×tmax:j×tmax),1jl,formulae-sequencesubscriptsuperscript𝐙Audio𝑗subscriptEncoderAudiosubscript𝐒:𝑗1subscript𝑡max𝑗subscript𝑡max1𝑗𝑙\mathbf{Z}^{\text{Audio}}_{j}=\text{Encoder}_{\text{Audio}}(\mathbf{S}_{(j-1)%\times t_{\text{max}}:j\times t_{\text{max}}}),~{}~{}1\leq j\leq l,bold_Z start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = Encoder start_POSTSUBSCRIPT Audio end_POSTSUBSCRIPT ( bold_S start_POSTSUBSCRIPT ( italic_j - 1 ) × italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT : italic_j × italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , 1 ≤ italic_j ≤ italic_l ,(2)

where 𝐙jAudiosubscriptsuperscript𝐙Audio𝑗\mathbf{Z}^{\text{Audio}}_{j}bold_Z start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the audio feature vector output by the audio encoder of the j𝑗jitalic_jth audio segment.

As suggested by Yu etal. (2024), a segment-level positional embedding is added before the modality aligner to improve the performance of long-form audio.Denote 𝐙jPossubscriptsuperscript𝐙Pos𝑗\mathbf{Z}^{\text{Pos}}_{j}bold_Z start_POSTSUPERSCRIPT Pos end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as the segment-level position embedding matrix corresponding to the position j𝑗jitalic_j, Concat()Concat\text{Concat}(\cdot)Concat ( ⋅ ) as the concatenation operation along the time dimension, and AlignerAudiosubscriptAlignerAudio\text{Aligner}_{\text{Audio}}Aligner start_POSTSUBSCRIPT Audio end_POSTSUBSCRIPT as the audio modality aligner. The audio token sequence 𝐇Audiosuperscript𝐇Audio\mathbf{H}^{\text{Audio}}bold_H start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT for the whole audio can be computed as Eqn.(3)–(5) shown:

𝐙~jAudiosubscriptsuperscript~𝐙Audio𝑗\displaystyle\tilde{\mathbf{Z}}^{\text{Audio}}_{j}over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT=𝐙jAudio+𝐙jPos,1jlformulae-sequenceabsentsubscriptsuperscript𝐙Audio𝑗subscriptsuperscript𝐙Pos𝑗1𝑗𝑙\displaystyle=\mathbf{Z}^{\text{Audio}}_{j}+\mathbf{Z}^{\text{Pos}}_{j},~{}~{}%1\leq j\leq l= bold_Z start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_Z start_POSTSUPERSCRIPT Pos end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , 1 ≤ italic_j ≤ italic_l(3)
𝐙~Audiosuperscript~𝐙Audio\displaystyle\tilde{\mathbf{Z}}^{\text{Audio}}over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT=Concat(𝐙~1Audio,𝐙~2Audio,,𝐙~lAudio)absentConcatsubscriptsuperscript~𝐙Audio1subscriptsuperscript~𝐙Audio2subscriptsuperscript~𝐙Audio𝑙\displaystyle=\text{Concat}(\tilde{\mathbf{Z}}^{\text{Audio}}_{1},\tilde{%\mathbf{Z}}^{\text{Audio}}_{2},\ldots,\tilde{\mathbf{Z}}^{\text{Audio}}_{l})= Concat ( over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )(4)
𝐇Audiosuperscript𝐇Audio\displaystyle\mathbf{H}^{\text{Audio}}bold_H start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT=AlignerAudio(𝐙~Audio).absentsubscriptAlignerAudiosuperscript~𝐙Audio\displaystyle=\text{Aligner}_{\text{Audio}}(\tilde{\mathbf{Z}}^{\text{Audio}}).= Aligner start_POSTSUBSCRIPT Audio end_POSTSUBSCRIPT ( over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT ) .(5)

Next, the audio and visual tokens are interleaved chronologically to form the input audio-visual token sequence 𝐇𝐇\mathbf{H}bold_H fed into the LLM backbone, and 𝐇𝐇\mathbf{H}bold_H is obtained based on Eqn.(6)–(8) by

αisubscript𝛼𝑖\displaystyle\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT=li/n,1informulae-sequenceabsent𝑙𝑖𝑛1𝑖𝑛\displaystyle=l\cdot{i}/{n},~{}~{}1\leq i\leq n= italic_l ⋅ italic_i / italic_n , 1 ≤ italic_i ≤ italic_n(6)
𝐇isubscript𝐇𝑖\displaystyle\mathbf{H}_{i}bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT=Concat(𝐇iVisual,𝐇αi1:αiAudio),1informulae-sequenceabsentConcatsubscriptsuperscript𝐇Visual𝑖subscriptsuperscript𝐇Audio:subscript𝛼𝑖1subscript𝛼𝑖1𝑖𝑛\displaystyle=\text{Concat}(\mathbf{H}^{\text{Visual}}_{i},\mathbf{H}^{\text{%Audio}}_{\alpha_{i-1}:\alpha_{i}}),~{}~{}1\leq i\leq n= Concat ( bold_H start_POSTSUPERSCRIPT Visual end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT Audio end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT : italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , 1 ≤ italic_i ≤ italic_n(7)
𝐇𝐇\displaystyle\mathbf{H}bold_H=Concat(𝐇1,𝐇2,,𝐇n).absentConcatsubscript𝐇1subscript𝐇2subscript𝐇𝑛\displaystyle=\text{Concat}(\mathbf{H}_{1},\mathbf{H}_{2},\ldots,\mathbf{H}_{n%}).= Concat ( bold_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .(8)

Finally, the text-based backbone LLM is required to generate a text response 𝐘^^𝐘\mathbf{\hat{Y}}over^ start_ARG bold_Y end_ARG given the user’s text prompt 𝐏𝐏\mathbf{P}bold_P and the audio-visual token sequence 𝐇𝐇\mathbf{H}bold_H:

𝐘^=argmax𝐘P(𝐘|𝐏,𝐇).^𝐘subscript𝐘𝑃conditional𝐘𝐏𝐇\mathbf{\hat{Y}}=\arg\max\nolimits_{\mathbf{Y}}P(\mathbf{Y}|\mathbf{P},\mathbf%{H}).over^ start_ARG bold_Y end_ARG = roman_arg roman_max start_POSTSUBSCRIPT bold_Y end_POSTSUBSCRIPT italic_P ( bold_Y | bold_P , bold_H ) .(9)

3.2 Training Strategies

To introduce audio perceptual capabilities to the visual LLM, we employ a multi-stage training approach that enables the model to fully utilize audio information for video understanding while maintaining its performance in processing visual data. Building on a well-trained visual LLM, the training proceeds through several stages: audio modality alignment, audio-visual SFT, RL based on the proposed mrDPO, and the newly introduced rebirth tuning.For the pre-trained visual LLM that already understands video, with the LLM, visual aligner, and visual encoder well-initialized, both the LLM and video branch are kept frozen during all training stages. Similarly, the audio encoder parameters remain fixed, as they have already been trained on a large-scale audio dataset.

Audio modality alignment extends the visual LLM by adding a parallel audio branch, enabling auditory perception capabilities. During this stage, only the audio aligner is trained on a large audio dataset, while the rest of the model remains frozen to preserve its original visual understanding performance. Since the focus is exclusively on learning the audio branch, only audio data is needed for training.

After audio modality alignment, the backbone LLM can recognize both visual and audio tokens. However, due to the lack of training with paired audio and visual token sequences, the model is not yet capable of synchronizing and integrating audio-visual information for comprehensive video understanding. To address this, we conduct audio-visual SFT using supervised video data.To improve the backbone LLM’s ability to process audio-visual token sequences, LoRA (Hu etal., 2022) is applied and trained during this stage. Additionally, the audio aligner is trained to align the output of the audio encoder with the input representation space of the LLM, making it easier for the backbone LLM to interpret audio tokens.

Although the model demonstrates the ability to describe synchronized audio-visual information in video after SFT, several issues persist, including missing information, hallucinations, and repetitive decoding. To address these shortcomings, we apply RL based on mrDPO to improve the model’s performance. Additionally, we introduce rebirth tuning after RL to further enhance the model’s performance in non-captioning tasks. Fig. 2 provides an overview of the entire training process involving mrDPO and rebirth tuning, with further details explained in Sections 3.3 and 3.4.

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (2)

3.3 RL stage with mrDPO

We aim to leverage RL to improve the quality of video captions generated by the model. To establish an effective method for evaluating the completeness of video captions, we propose using atomic events as a bridge to automatically assess the preference of caption samples through artificial intelligence (AI) feedback, guiding the model to produce more accurate and detailed video descriptions.

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (3)

The pipeline for selecting preferred samples when applying RL to video-SALMONN 2 is illustrated in Fig. 3. First, distinct video captions are sampled from the model’s output distribution, given the input video. These captions may be either global captions describing the entire video or local captions focusing on a specific time interval.To determine the preferred sample for global captions, the labelled caption of the input video is fed into a powerful text LLM, which is tasked with breaking down the caption into basic atomic events. This is relatively straightforward for commercial LLMs like GPT-3.5 and GPT-4o, and the resulting atomic events are generally reasonable.Next, the text LLM is used to evaluate the caption by identifying missed or hallucinated events, calculating the information missing and hallucination rates. The total error rate is the sum of the missing and hallucination rates. For local captions, a similar evaluation process is followed, with atomic events being extracted using Gemini-1.5-Pro (as detailed in AppendixA).In addition to metrics based on atomic events, we also consider the repetition rate of the video captions. The calculation procedure for this is provided in Appendix B.DPO (Rafailov etal., 2024) is applied as the main RL method based on automatic AI feedback. We assume that only sample pairs with significant metric differences are suitable for RL. Therefore, sample pairs with minimal gaps in metrics are excluded from the RL training set.

Unlike previous approaches that applied only single-round DPO to multimodal LLMs, we introduce a multi-round strategy, as prolonged offline training with a single round fails to optimize the model effectively due to the reference model being biased against the most recent model update in the DPO algorithm. In the multi-round framework, at each t𝑡titalic_tth round, the following steps are taken to perform DPO training for the current round.

  1. 1.

    First, pre-trained LoRA module Δt1subscriptΔ𝑡1\Delta_{t-1}roman_Δ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is merged into the LLM backbone Λt1subscriptΛ𝑡1\Lambda_{t-1}roman_Λ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to derive a new LLM backbone ΛtsubscriptΛ𝑡\Lambda_{t}roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that is equivalent to Λt1subscriptΛ𝑡1\Lambda_{t-1}roman_Λ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT with Δt1subscriptΔ𝑡1\Delta_{t-1}roman_Δ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, based on Eqn.(10):

    𝐖t=𝐖t1+α𝐀t1𝐁t1,subscript𝐖𝑡subscript𝐖𝑡1𝛼subscript𝐀𝑡1subscript𝐁𝑡1\mathbf{W}_{t}=\mathbf{W}_{t-1}+\alpha\mathbf{A}_{t-1}\mathbf{B}_{t-1},bold_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_α bold_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ,(10)

    where 𝐖tsubscript𝐖𝑡\mathbf{W}_{t}bold_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐖t1subscript𝐖𝑡1\mathbf{W}_{t-1}bold_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are the weight parameters to adapt in ΛtsubscriptΛ𝑡\Lambda_{t}roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Λt1subscriptΛ𝑡1\Lambda_{t-1}roman_Λ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, α𝛼\alphaitalic_α is the scaling factor of LoRA,r𝑟ritalic_r is the rank of LoRA, d𝑑ditalic_d is the dimension of 𝐖t1subscript𝐖𝑡1\mathbf{W}_{t-1}bold_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. 𝐀t1d×rsubscript𝐀𝑡1superscript𝑑𝑟\mathbf{A}_{t-1}\in\mathcal{R}^{d\times r}bold_A start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and 𝐁t1r×dsubscript𝐁𝑡1superscript𝑟𝑑\mathbf{B}_{t-1}\in\mathcal{R}^{r\times d}bold_B start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT are the low-rank matrix parameters of LoRA in the previous round t1𝑡1t-1italic_t - 1, and 𝐖d×d𝐖superscript𝑑𝑑\mathbf{W}\in\mathcal{R}^{d\times d}bold_W ∈ caligraphic_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is the parameter of LLM backbone.

  2. 2.

    Next, the new LLM backbone ΛtsubscriptΛ𝑡\Lambda_{t}roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is paired with a new randomly initialized LoRA module Δ~tsubscript~Δ𝑡\tilde{\Delta}_{t}over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, forming the policy model for round t𝑡titalic_t. To address the issue of the increasing difference between the reference and policy models caused by freezing the reference model in standard DPO, ΛtsubscriptΛ𝑡\Lambda_{t}roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used as the updated reference model used in round t𝑡titalic_t.

  3. 3.

    At last, Δ~tsubscript~Δ𝑡\tilde{\Delta}_{t}over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is trained to obtain a well-trained ΔtsubscriptΔ𝑡{\Delta}_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which can be achieved using the standard DPO loss. However, after multiple training rounds, the model starts to produce unnatural language patterns such as unintelligible or meaningless sentences.To alleviate this issue by stabilizing the training, a guided DPO (gDPO) loss is proposed as

    gDPO(πθ;πref)=DPO(πθ;πref)+λ𝔼(𝐱,𝐲gt)𝒟gtlogπθ(𝐲gt|𝐱),subscriptgDPOsubscript𝜋𝜃subscript𝜋refsubscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆subscript𝔼similar-to𝐱subscript𝐲gtsubscript𝒟gtsubscript𝜋𝜃conditionalsubscript𝐲gt𝐱\displaystyle\mathcal{L}_{\text{gDPO}}(\mathbf{\pi}_{\theta};\pi_{\text{ref}})%=\mathcal{L}_{\text{DPO}}(\mathbf{\pi}_{\theta};\pi_{\text{ref}})+\lambda\,%\mathbb{E}_{(\mathbf{x},\mathbf{y}_{\text{gt}})\sim\mathcal{D}_{\text{gt}}}%\log\pi_{\theta}(\mathbf{y}_{\text{gt}}|\mathbf{x}),caligraphic_L start_POSTSUBSCRIPT gDPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) + italic_λ blackboard_E start_POSTSUBSCRIPT ( bold_x , bold_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT | bold_x ) ,(11)

    where DPOsubscriptDPO\mathcal{L}_{\text{DPO}}caligraphic_L start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT is the standard DPO loss, πθ={Λt,Δ~t}subscript𝜋𝜃subscriptΛ𝑡subscript~Δ𝑡\mathbf{\pi}_{\theta}=\{\Lambda_{t},\tilde{\Delta}_{t}\}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = { roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and πref=Λtsubscript𝜋refsubscriptΛ𝑡\pi_{\text{ref}}=\Lambda_{t}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT = roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the policy and reference models, respectively. 𝒟gtsubscript𝒟gt\mathcal{D}_{\text{gt}}caligraphic_D start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT denotes the SFT training dataset, where (𝐱,𝐲gt)𝐱subscript𝐲gt(\mathbf{x},\mathbf{y}_{\text{gt}})( bold_x , bold_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) corresponds to a video and its paired ground-truth text description, randomly selected from 𝒟gtsubscript𝒟gt\mathcal{D}_{\text{gt}}caligraphic_D start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT. Finally, λ𝜆\lambdaitalic_λ is the weight of the second regularization term, which corresponds to cross-entropy learning towards the ground-truth text descriptions without unnatural patterns.

These steps complete the training for a single round. Our proposed mrDPO is implemented by repeating these steps across multiple rounds. Notably, by merging Δt1subscriptΔ𝑡1\Delta_{t-1}roman_Δ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT into Λt1subscriptΛ𝑡1\Lambda_{t-1}roman_Λ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and equipping the resulting ΛtsubscriptΛ𝑡\Lambda_{t}roman_Λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with a new LoRA module Δ~t~Δ𝑡\tilde{\Delta}{t}over~ start_ARG roman_Δ end_ARG italic_t, the new Δ~t~Δ𝑡\tilde{\Delta}{t}over~ start_ARG roman_Δ end_ARG italic_t functions as a LoRA proxy for parameter updates. This proxy helps regularize the training by introducing a new random initialization at each round of mrDPO.

3.4 Rebirth Tuning

After multiple rounds of iteration with the LoRA proxy, the model demonstrates significant improvements in captioning, showcasing the strong potential for audio-visual understanding. However, despite efforts to preserve its language abilities, the model gradually begins to produce repetitive and unnatural text patterns in its responses. In some cases, its benchmark performance remains high even if unnatural patterns appear.We believe this issue arises because RL methods primarily optimize the model’s output distribution using self-generated data. As a result, the model may overfit the feedback provided by AI models that mismatch with real human preferences, leading it to adopt these unnatural patterns. This tendency can cause a collapse in the model’s output distribution, resulting in frequent incoherent or repetitive outputs, and considerable declining performances on the non-captioning tasks.

Rebirth tuning is introduced to address the issue of declining non-captioning language abilities. This method applies teacher-forcing training on self-generated data, promoting a more stable learning process for video understanding. Teacher-forcing, which guides the model to predict the next token, helps prevent it from converging on limited and repetitive patterns. More specifically, before applying rebirth tuning, mrDPO is halted once we observe a significant decline in the model’s language capabilities. The final iteration of the model, which excels at generating complete and accurate video descriptions, is then used to label a large dataset of videos. Since the model’s language abilities remain relatively intact, natural and fluent descriptions can be easily filtered by detecting problematic patterns, with the remaining high-quality descriptions used as training data for rebirth tuning.

Rebirth tuning is applied to the model after audio modality alignment, allowing it to be ”reborn” from self-generated high-quality data to enhance video understanding. Following rebirth tuning, the model not only avoids catastrophic forgetting of non-captioning abilities but also supports the development of the next generation of models by applying mrDPO, followed by the subsequent stage of rebirth tuning.

4 Experimental Setup

4.1 Model Specifications

video-SALMONN 2 is built on an internally trained high-performance visual LLM. This visual LLM uses SigLIP (Zhai etal., 2023) as the visual encoder, Qwen 1.5 with 7B parameters as the backbone LLM, and two linear layers with GELU activation function (Hendrycks & Gimpel, 2016) as the visual aligner. The model processes video frames at a frame rate of 1 (i.e., ϕ=1italic-ϕ1\phi=1italic_ϕ = 1), and can handle up to 30 frames. For videos longer than 30 seconds, 30 frames are uniformly sampled from the video.

For the audio branch, we use the Whisper-Large-v3 encoder (Radford etal., 2023) as the audio encoder, and a window-level Q-Former (Tang etal., 2024a) with a window length of 0.2 seconds as the audio aligner, producing a total of 150 audio tokens for a 30-second input. The Whisper encoder has a maximum processing duration of tmax=30subscript𝑡max30t_{\text{max}}=30italic_t start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 30 seconds.The rank r𝑟ritalic_r and scaling factor α𝛼\alphaitalic_α of LoRA are set to 256 and 2.0, respectively. During training, the visual encoder, visual aligner, audio encoder, and LLM remain frozen.

4.2 Data and Training Specifications

During the audio modality alignment stage, LibriSpeech-960h (Panayotov etal., 2015) and AudioCaps (Kim etal., 2019) are used to train the audio aligner. LibriSpeech-960h is utilized for speech recognition training, while AudioCaps is employed for audio captioning training.

In the audio-visual SFT stage, experiments are conducted using an internal video dataset that will be released upon acceptance. A total of 13k videos with rich audio information are automatically selected, and high-quality audio-visual captions are regenerated with the assistance of GPT-4o (OpenAI etal., 2024), Whisper-Large-v3 (Radford etal., 2023), and SALMONN (Tang etal., 2024b). The detailed pipeline is described in AppendixC. Additionally, to further enhance the quality of the SFT data, around 1.5k video captions were manually refined.

In the RL stage, two kinds of tasks are studied: global captioning for the whole video and local captioning for a given time interval. Before each round, a pair of captions for both global and local captioning are sampled from the model for each video in SFT data, respectively.We consider the information missing rate, hallucination rate, and repetition rate to determine whether a sample pair is suitable for DPO and determine the chosen and rejected samples if yes. The selecting methods for each round are listed in AppendixF.

After mrDPO, the language abilities of the model are reduced. We stop RL training once significant degradation in language abilities is detected. The checkpoint of the last DPO round is used to label captions of a large number of videos. Unnatural captions are eliminated, and the remaining high-quality captions form the training data for the rebirth tuning stage.

For the test dataset, we curated a video captioning benchmark to evaluate the event missing rate (Miss), hallucination rate (Hall), and text repetition rate (Rep). Details of the test data and evaluation process can be found in AppendixD and AppendixE, respectively. The benchmark consists of 483 carefully selected videos, each labelled with complete audio-visual captions by human annotators. Atomic events for the test dataset were initially obtained using GPT-4o and then manually refined. For local captioning, we used Gemini-1.5-Pro (Team etal., 2024) to tag the start and end times of each event within specific time intervals. Since Gemini could not process some videos, only 457 videos were used for the local captioning evaluation.

Regarding training settings, we conducted audio modality alignment using 8×\times×A100 GPUs for 35k steps and audio-visual SFT using 16×\times×H100 GPUs for 4 epochs. Each RL round was trained with 64×\times×H100 GPUs for 1k steps. After six rounds of mrDPO training, rebirth tuning was performed. During the rebirth tuning stage, we used 64×\times×H100 GPUs and trained for 4 epochs. The batch size per device was set to 1, making the total batch size equal to the number of devices. The weight λ𝜆\lambdaitalic_λ in Eqn.(11) was set to 0.1 for all related experiments. The final video-SALMONN 2 model was obtained after one round of gDPO training following rebirth tuning.

5 Experimental Results

5.1 Overall Results

The results of our video captioning benchmarks are presented in Table 1. video-SALMONN 2 outperforms other models in both information missing and hallucination rates for global and local captioning. Among existing open-source multimodal LLMs, few can provide detailed and accurate video descriptions, whether purely visual models like Video-LLaVA (Lin etal., 2024) and VILA (Lin etal., 2023), or audio-visual models like Video-LLaMA 2 (Cheng etal., 2024) and video-SALMONN (Sun etal., 2024). Notably, many open-source models, such as Video-LLaVA and Video-LLaMA 2, tend to generate shorter captions, leading to high information missing rates but low hallucination rates.GPT-4o and Gemini-1.5-Pro can generate more detailed captions and are of higher quality than current open-source models. However, the purely visual version of GPT-4o lacks audio comprehension, and Gemini’s understanding of visual content is somewhat limited, resulting in both models exhibiting some degree of information missing and hallucination.

Our visual base model, trained on a large dataset of images and silent videos, is capable of generating detailed text descriptions based solely on visual information, with a relatively low information missing rate. However, generating longer texts leads to a higher hallucination rate. After audio modality alignment and audio-visual SFT, the model can leverage audio content to reduce both information loss and hallucinations in its descriptions. However, the inclusion of audio tokens may confuse the visual LLM, resulting in frequent repetition during decoding. Building on the SFT model, we applied mrDPT and rebirth tuning, achieving approximately a 35% absolute reduction in the repetition rate and absolute reductions of around 40% and 20% in the total error rate for global and local captioning, respectively. The final video-SALMONN 2 model outperforms some commercial models like GPT-4o and Gemini-1.5-pro in video captioning. As an audio-visual LLM, video-SALMONN 2 retains strong visual understanding capabilities and performs well on various visual benchmarks, such as Video-MME (Fu etal., 2024a). For more details, refer to AppendixG.

ModelModalityGlobalLocal
%Rep\downarrow%Miss\downarrow%Hall\downarrow%Total\downarrow%Miss\downarrow%Hall\downarrow%Total\downarrow
GPT-4o VisualV3.616.617.233.835.330.766.0
Gemini-1.5-ProA + V1.321.816.538.336.917.254.1
7B Video-LLaVAV13.265.35.470.759.19.468.5
8B VILAV4.539.318.657.947.923.471.2
7B Video-LLaMA 2A + V5.756.88.965.747.614.361.9
13B video-SALMONNA + V1.252.126.678.747.840.788.4
7B Ours-Visual BaseV11.829.830.059.736.146.182.2
7B Ours-SFTA + V36.026.726.953.630.833.364.0
7B video-SALMONN 2A + V1.46.96.813.722.221.443.6

5.2 Analysis of Multi-Round Reinforcement Learning

This section explores various approaches to training in mrDPO. In terms of loss functions, we compared the standard DPO loss, the proposed gDPO loss (which includes the regularisation term based on the ground-truth captions), and a similar loss referred to as “cDPO”, which is the sum of DPO loss and cross-entropy loss on chosen samples. Additionally, the LoRA proxy is evaluated against directly tuning the model’s original LoRA, referred to as “Direct”. Fig.4(a) presents the total error rates for global video captioning. Training is halted when unnatural captions begin to appear frequently. Examples of such cases are provided in AppendixI.

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (4)
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (5)

Among the three loss functions, DPO shows the fastest improvement in captioning metrics, but unfortunately, it also quickly leads to outputs with frequent unnatural patterns. This is likely because the model only sees self-generated labels rather than ground-truth labels. cDPO faces the same issue but performs worse than DPO. By incorporating loss on ground-truth labels, gDPO makes the training process more stable, allowing the model to generate text responses without unnatural patterns for a longer period of training across multiple RL rounds. This stability also preserves the model’s potential for further improvement, with a significant drop in the error rate observed after three rounds of mrDPO.

The LoRA proxy that randomly initializes a new LoRA module in each RL round, is found to be more beneficial for mrDPO performance compared to directly training the same LoRA, especially after over 4 gDPO rounds. This reveals that the regularisation effect introduced by the LoRA proxies helps to alleviate over-fitting.Since gDPO with LoRA proxy performs the best for mrDPO, we use the model after six gDPO rounds using LoRA proxy to generate captions for a large number of videos. After excluding unnatural patterns, a total of 180k video captions remain for rebirth tuning.

5.3 Analysis of Rebirth Tuning

StageSFT#Rounds of mrDPORebirth Tuning
1234567
%Unnatural Rate\downarrow0.00.00.01.90.92.92.112.00.0

While multiple rounds of mrDPO with LoRA proxy significantly improve video captioning performance, they also lead to an increasing frequency of unnatural patterns in text responses. Table 2 shows the occurrence rate of these unnatural patterns in global captioning after each training stage. Through rebirth tuning, the backbone LLM discards the LoRA proxies and restores its ability to generate fluent captions. Additionally, the careful selection of rebirth-tuning data enhances data quality, ensuring the model is fine-tuned with superior data, further boosting its overall performance.

Another notable effect of rebirth tuning is to sustain continued training. As shown in Fig.4(a), in the later rounds of mrDPO, the model starts to gain less with each round and eventually converges. The decline in the ability to generate fluent text responses is also more likely to occur in these later rounds, suggesting that the model has fallen into a local minimum after mrDPO.However, after the rebirth tuning stage, where only teacher-forcing training is applied, the model escapes the local optimum from previous training and becomes receptive to further optimization with RL. Fig.4(b) compares gDPO after rebirth tuning with six rounds of gDPO with LoRA proxy. It is observed that only minimal improvement can be achieved after sufficient RL rounds in terms of the total error rate for global captioning, while an extra RL stage following rebirth tuning yields significant performance gains once again.This suggests the potential of iterating mrDPO and rebirth tuning.

6 Conclusions

This work introduces video-SALMONN 2, a powerful audio-visual LLM designed for detailed video captioning, and proposes the mrDPO method. To our knowledge, this is the first study of applying RL to audio-visual LLMs in literature. New metrics are designed to evaluate the information missing and hallucination rates in video captions, which are used to guide sample selection for DPO. To further stabilize training, the setting with novel gDPO and LoRA proxy is introduced. After mrDPO, we propose a novel rebirth tuning method to restore LLM’s performance on non-captioning tasks. As a result, video-SALMONN 2 demonstrates significant improvements in video captioning, outperforming notable models such as GPT-4o and Gemini-1.5-Pro, and setting a promising direction for achieving detailed and accurate video captioning for video understanding.

References

  • Bai etal. (2023a)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, etal.Qwen Technical Report.arXiv preprint arXiv:2309.16609, 2023a.
  • Bai etal. (2023b)Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966, 2023b.
  • Chen etal. (2024)Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, etal.ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.arXiv preprint arXiv:2406.04325, 2024.
  • Chen etal. (2023)Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, YuQiao, and Jifeng Dai.InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.arXiv preprint arXiv:2312.14238, 2023.
  • Cheng etal. (2024)Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing.VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs.arXiv preprint arXiv:2406.07476, 2024.
  • Chu etal. (2023)Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou.Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models.arXiv preprint arXiv:2311.07919, 2023.
  • Chu etal. (2024)Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen2-Audio Technical Report.arXiv preprint arXiv:2407.10759, 2024.
  • Du etal. (2022)Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang.GLM: General Language Model Pretraining with Autoregressive Blank Infilling.In Proc. ACL, Dublin, 2022.
  • Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, etal.The LLaMA 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024.
  • Ethayarajh etal. (2024)Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.KTO: Model Alignment as Prospect Theoretic Optimization.arXiv preprint arXiv:2402.01306, 2024.
  • Fu etal. (2024a)Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, etal.Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.arXiv preprint arXiv:2405.21075, 2024a.
  • Fu etal. (2024b)Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, DiYin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, and Xing Sun.VITA: Towards Open-Source Interactive Omni Multimodal LLM.arXiv preprint arXiv:2408.05211, 2024b.
  • Gong etal. (2023)Yuan Gong, AlexanderH. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass.Joint Audio and Speech Understanding.In Proc. ASRU, Taipei, 2023.
  • Gong etal. (2024)Yuan Gong, Hongyin Luo, AlexanderH. Liu, Leonid Karlinsky, and JamesR. Glass.Listen, Think, and Understand.In Proc. ICLR, Vienna, 2024.
  • Hendrycks & Gimpel (2016)Dan Hendrycks and Kevin Gimpel.Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415, 2016.
  • Hu etal. (2022)EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.LoRA: Low-Rank Adaptation of Large Language Models.In Proc. ICLR, 2022.
  • Kim etal. (2019)ChrisDongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim.AudioCaps: Generating Captions for Audios in The Wild.In Proc. NAACL-HLT, Minneapolis, 2019.
  • Lee etal. (2023)Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, etal.RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.arXiv preprint arXiv:2309.00267, 2023.
  • Li etal. (2023)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.In Proc. ICML, Honolulu, 2023.
  • Lin etal. (2024)Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and LiYuan.Video-LLaVA: Learning United Visual Representation by Alignment Before Projection.In Proc. CVPR, Seattle, 2024.
  • Lin etal. (2023)JiLin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han.VILA: On Pre-training for Visual Language Models, 2023.
  • Liu etal. (2024a)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved Baselines with Visual Instruction Tuning.In Proc. CVPR, Seattle, 2024a.
  • Liu etal. (2024b)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual Instruction Tuning.In Proc. NeurIPS, Vancouver, 2024b.
  • OpenAI etal. (2024)OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, etal.GPT-4 Technical Report, 2024.
  • Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training Language Models to Follow Instructions with Human Feedback.In Proc. NeurIPS, New Orleans, 2022.
  • Panayotov etal. (2015)Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur.Librispeech: An ASR corpus based on public domain audio books.In Proc. ICASSP, Brisbane, 2015.
  • Radford etal. (2023)Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.Robust Speech Recognition via Large-scale Weak Supervision.In Proc. ICML. PMLR, 2023.
  • Rafailov etal. (2024)Rafael Rafailov, Archit Sharma, Eric Mitchell, ChristopherD Manning, Stefano Ermon, and Chelsea Finn.Direct Preference Optimization: Your Language Model is Secretly a Reward Model.In Proc. NeurIPS, Vancouver, 2024.
  • Schulman etal. (2017)John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017.
  • Sun etal. (2024)Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, LuLu, Zejun MA, Yuxuan Wang, and Chao Zhang.video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models.In Proc. ICMLR, Vienna, 2024.
  • Tang etal. (2024a)Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, LuLu, Zejun Ma, and Chao Zhang.Extending Large Language Models for Speech and Audio Captioning.In Proc. ICASSP, Seoul, 2024a.
  • Tang etal. (2024b)Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, LuLu, Zejun MA, and Chao Zhang.SALMONN: Towards Generic Hearing Abilities for Large Language Models.In Proc. ICLR, Vienna, 2024b.
  • Tang etal. (2024c)Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, LuLu, Zejun Ma, Yuxuan Wang, and Chao Zhang.Can Large Language Models Understand Spatial Audio.In Interspeech 2024, Kos Island, 2024c.
  • Tang etal. (2024d)Yunlong Tang, Daiki Shimada, Jing Bi, and Chenliang Xu.AVicuna: Audio-visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue.arXiv preprint arXiv:2403.16276, 2024d.
  • Team etal. (2024)Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, etal.Gemini: A Family of Highly Capable Multimodal Models.arXiv preprint arXiv:2312.11805, 2024.
  • Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, etal.LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971, 2023.
  • Wei etal. (2022)Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM Dai, and QuocV Le.Finetuned Language Models are Zero-Shot Learners.In Proc. ICML, 2022.
  • Wu etal. (2023)Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, BoRen, Linquan Liu, etal.On Decoder-only Architecture for Speech-to-Text and Large Language Model Integration.In Proc. ASRU, Taipei, 2023. IEEE.
  • Yu etal. (2024)Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, LuLu, Zejun Ma, and Chao Zhang.Connecting Speech Encoder and Large Language Model for ASR.In Proc. ICASSP, Seoul, 2024.
  • Zhai etal. (2023)Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer.Sigmoid Loss for Language Image Pre-Training.arXiv preprint arXiv:2303.15343, 2023.
  • Zhang etal. (2023)Hang Zhang, Xin Li, and Lidong Bing.Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding.In Proc. EMNLP: System Demonstrations, Singapore, 2023.
  • Zhang etal. (2024)Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, DiFu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, and Yiming Yang.Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward.arXiv preprint arXiv:2404.01258, 2024.
  • Zheng etal. (2024)Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, and David Harwath.BAT: Learning to Reason about Spatial Sounds with Large Language Models.In Proc. ICML, Vienna, 2024.

Appendix A Pipeline for Getting Atomic Events in a Time Interval

Gemini-1.5-Pro is used to obtain the atomic events in some time intervals. We input the video and all its atomic events labelled by GPT to Gemini and ask it to tag the beginning and start time of each atomic event. Events are then selected if their time intervals overlap with the given time interval.We have checked this process to confirm that the atomic events obtained for the given time interval are roughly accurate.

Appendix B Calculation Procedure of The Text Repetition Rate

The procedure to calculate the repetition rate of a long and detailed text is shown as follows:

  1. 1.

    Split the text into short phrases by punctuation;

  2. 2.

    Counting the number of occurrences of each phrase;

  3. 3.

    The number of recurring phrase words divided by the total number of words in the text is the repetition rate.

Appendix C Pipeline for Labelling High-quality Audio-visual Captions

To curate training data for supervised fine-tuning, we employ GPT-4o to label the visual content in each frame, while SALMONN-13B and Whisper-Large-v3 are used to annotate the speech content and audio events in the audio track. This process is illustrated in Figure5. Our initial aim is to automatically filter out videos that contain limited speech. We begin by slicing each video into 10-second segments, with the audio from each segment analyzed by SALMONN to generate automatic audio captions (AAC). These captions help us filter out videos that lack descriptive speech such as ”A man is speaking” or ”A woman says…”. This initial filtering step is somewhat coarse.

Next, the audio from each segment is processed by Whisper to produce automatic speech recognition (ASR) results. If the transcribed text is too brief or nonsensical, the corresponding video is deemed to lack rich audio content and is excluded from further consideration. For a video to be labelled, all of its segments should pass this exclusion criterion.

The segments from the remaining videos are then sampled at a rate of 1 fps and fed into GPT-4o to extract segment-level visual captions. Ultimately, the segment-level visual captions, AAC results, and ASR results form the input to GPT-4o concurrently to generate a detailed global audio-visual caption for each video.

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (6)

Appendix D About The Test Dataset

Figure 6 shows the basic information of our caption benchmark. The benchmark covers 14 different fields. All the videos are between 30s to 60s, with an average duration of 51s.

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (7)
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (8)

Appendix E Process of Evaluating Detailed Captions

To evaluate the specific video caption generated by our model, we first use GPT-3.5 or GPT-4o to split the labelled caption of this video into several atomic events, where we use GPT-4o for the test set and GPT-3.5 for the RL training set. Then, the list of atomic events and the caption to be evaluated are simultaneously fed into GPT-3.5 to determine what events in the atomic event list are missed and what events in the caption are hallucinated.Specifically, we ask GPT-3.5 to list all the missing events and hallucination events for better evaluation precision. Note that events that are described incorrectly are also regarded as hallucinations.The quotient between the number of missing or hallucination events and the number of all events in the video is the final missing or hallucination rate.For more robust testing, GPT-3.5 is used to evaluate 7 times for each caption and the medium number of the metric is reported. We have manually confirmed that the score GPT-3.5 gives is roughly plausible.

Appendix F Samples Selecting Methods for RL of Each Round

To achieve better performance and training efficiency, we take a specially designed strategy to select proper preference pairs. A sample pair is selected if one sample is better than the other in all metrics with a threshold. For global captioning, we consider global error rate ΔegΔsubscript𝑒𝑔\Delta e_{g}roman_Δ italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and global repetition rate ΔrgΔsubscript𝑟𝑔\Delta r_{g}roman_Δ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, while for local captioning we consider local error rate ΔetΔsubscript𝑒𝑡\Delta e_{t}roman_Δ italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and local repetition rate ΔrtΔsubscript𝑟𝑡\Delta r_{t}roman_Δ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Table 3 shows the threshold used in each round.

RL RoundThreshold Used
ΔegΔsubscript𝑒𝑔\Delta e_{g}roman_Δ italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPTΔrgΔsubscript𝑟𝑔\Delta r_{g}roman_Δ italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPTΔetΔsubscript𝑒𝑡\Delta e_{t}roman_Δ italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTΔrtΔsubscript𝑟𝑡\Delta r_{t}roman_Δ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
15%absentpercent5\geq 5\%≥ 5 %1%absentpercent1\geq 1\%≥ 1 %20%absentpercent20\geq 20\%≥ 20 %1%absentpercent1\geq 1\%≥ 1 %
220%absentpercent20\geq 20\%≥ 20 %1%absentpercent1\geq-1\%≥ - 1 %45%absentpercent45\geq 45\%≥ 45 %0absent0\geq 0≥ 0
320%absentpercent20\geq 20\%≥ 20 %1%absentpercent1\geq-1\%≥ - 1 %45%absentpercent45\geq 45\%≥ 45 %0absent0\geq 0≥ 0
420%absentpercent20\geq 20\%≥ 20 %1%absentpercent1\geq-1\%≥ - 1 %45%absentpercent45\geq 45\%≥ 45 %0absent0\geq 0≥ 0
520%absentpercent20\geq 20\%≥ 20 %1%absentpercent1\geq-1\%≥ - 1 %45%absentpercent45\geq 45\%≥ 45 %0absent0\geq 0≥ 0
625%absentpercent25\geq 25\%≥ 25 %1%absentpercent1\geq-1\%≥ - 1 %45%absentpercent45\geq 45\%≥ 45 %0absent0\geq 0≥ 0
730%absentpercent30\geq 30\%≥ 30 %1%absentpercent1\geq-1\%≥ - 1 %45%absentpercent45\geq 45\%≥ 45 %0absent0\geq 0≥ 0

Appendix G Results on Visual QA Benchmarks

Since QA data is not seen during the mrDPO process, the model’s QA abilities decrease a lot after mrDPO. Thanks to the rebirth tuning on captioning and QA, the non-captioning abilities are able to recover.After one round of gDPO with a LoRA proxy, video-SALMONN 2 finally achieves detailed and accurate captioning while getting competitive results compared to SOTA models of similar size on QA benchmarks like Video-MME.Since video-SALMONN 2 cannot support very long audio, we only consider the Video-MME Short set. TableG shows the results of our models, as well as those of SOTA models of 7B and 8B.

[ht]Accuracy results on Video-MME Short set.(#Params) Model#Max FramesVideo-MME Short\uparrow(8B) MiniCPM-V 2.66471.3 1(7B) Long-LLaVA6461.9 1(7B) Ours-Visual Base3067.2(7B) Ours-mrDPO3065.3(7B) Ours-Rebirth3067.6(7B) video-SALMONN 23067.0

Appendix H Video Captioning Cases of video-SALMONN 2

Some video captioning cases generated by video-SALMONN 2 are shown in Figure7 and Figure8. More demos can be found at https://video-salmonn-2.github.io.

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (9)
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (10)

Appendix I Cases of Unnatural Responses

Unnatural responses might be generated after multiple RL rounds.Figure 9 shows a caption generated by the model after 6 gDPO rounds with LoRA proxies, which contains some repeated patterns that make the caption unnatural. Figure 10 shows a caption generated by the model after 5 classical DPO rounds with LoRA proxies, which includes some strange characters and sentences.

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (11)
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (12)
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 6205

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.