MuVi:

Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Abstract. Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video’s mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context learning ability, allowing us to control the style and genre of the generated music. Experimental results show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.

(Training codes will be released soon)

Model Overview

Video-to-Music Generation

Video-to-Music Generation with In-context Learning

Visualization of Visual Adaptor's Attention

Comparison with Baselines and Ground Truths

Video-to-Music Generation Results

In this section, we demonstrate the results of music generation given video conditions.

(All music in the videos are generated by MuVi. The original videos are collected from public website.)

Music Generation for Anime/Cartoon

Music Generation for Silent Films

Music Generation for Sora Videos

Music Generation for Trailers/CGs of Video Games

Music Generation for Youtube Videos/Funny Compilations/Memes

Video-to-Music Generation with In-context Learning

In this section, we demonstrate the in-context learning capability of MuVi, by providing music clips with different styles as prompts.

Prompt Style	Prompt Audio	Samples
Piano
Electronic
Rock

Visualization of Visual Adaptor's Attention

In this section, we visualize the visual adaptor's attention distribution. The yellower the patch, the more it is related to the generated music. We mask the video frames with the averaged attention scores. We transform the patches corresponding to the weights after applying Softmax into masks, and then adjust the colors of the masks accordingly. When the weights are smaller (close to 0.0), the mask appears bluer; conversely (close to 1.0), it appears yellower. This reflects the attention distribution of the adaptor.

Comparison with Baselines and Ground truths

In this section, we compare the results of MuVi with baselines, as well as the ground-truth music. We choose M²UGen as a strong available baseline to compare. M²UGen leverages an LLM to comprehend the input information extracted from multiple multi-modal encoders (including video) and generate content-correlated music soundtracks. VidMuse and V2Meow are also good baselines, but they have not open-sourced their training or inference code. Other works, such as CMT and D2M-GAN are not considered for comparison because their scope of application differs from ours (symbolic music generation or dance-to-music), which could result in an unfair comparison.

Even so, M²UGen is still an unfair comparison to some extent, because 1) M²UGen is designed without specifically focusing on semantic alignment and rhythmic synchronization; 2) M²UGen leverages very powerful language models and music generators. Since our work is pioneering, we can only choose the strongest and most appropriate baseline methods for comparison

Ground Truth	M²UGen	MuVi

All videos and generated music listed on this page are for research purpose only. If you feel that some of the contents are inappropriate, please contact us.

MuVi:

Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Model Overview

Table of Contents

Video-to-Music Generation Results

Music Generation for Anime/Cartoon

Music Generation for Silent Films

Music Generation for Sora Videos

Music Generation for Trailers/CGs of Video Games

Music Generation for Youtube Videos/Funny Compilations/Memes

Video-to-Music Generation with In-context Learning

Visualization of Visual Adaptor's Attention

Comparison with Baselines and Ground truths