1 Media Intelligence Laboratory, ChengDu Sobey Digital Technology Co., Ltd
2 University of Electronic Science and Technology of China
3 SiChuan University
*Indicates The Corresponding Author
ESA
Transferring photorealistic style with a style image for the video.Energy-based shot assembly optimization. Given a specific theme and a library of video clips, our method employs an energy-based model to search for an optimal shot assembly that aligns with the thematic semantics, editing syntax, and user intent. Combined with additional post-production processes such as voice-overs and subtitles, this approach enables high-quality, intelligent video editing.
Abstract
Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrange ment of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator’s unique artistic expression in shot assembly. To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos.
Approach
Our approach consists of modules such as "Shot Segmentation and Label Extraction", "Visual-Semantic Matching", and "Multi-Syntax Joint Assembly Optimization". These modules work together to automatically generate an edited video sequence that reflects the shot scale style and camera motion style of a reference sequence. The core idea is to integrate semanic analysis, visual matching, and energy-based model optimization into a multimodal joint optimization process.
Given a user-provided textual description, we first retrieve candidate video shots from a video library that semantically match the text. We then segment these candidate shots and extract shot-related labels that are relevant to the text. These labels, combined with shot syntax rules, provide structured semantic guidance for the shot assembly optimization. Building on this foundation, the frame work employs an energy-based model to evaluate the quality of the combined sequence. The energy model considers multiple objectives, including the alignment of textual semantics with video content, visual continuity between shots, temporal logic consistency, and adherence to shot syntax rules.
Comparison
Reference video 1 - Traveling
Reference video
MoneyPrinterTurbo
MoneyPrinterTurboClip
Ours
Capcut
JichuangAI
Reference video 1 - Military
Reference video
MoneyPrinterTurbo
MoneyPrinterTurboClip
Ours
Capcut
JichuangAI
Reference video 1 - Shopping
Reference video
MoneyPrinterTurbo
MoneyPrinterTurboClip
Ours
Capcut
JichuangAI
Reference video 2 - cooking
Reference video
MoneyPrinterTurbo
MoneyPrinterTurboClip
Ours
Capcut
JichuangAI
Reference video 2 - bbq
Reference video
MoneyPrinterTurbo
MoneyPrinterTurboClip
Ours
Capcut
JichuangAI
Reference video 2 - chicken
Reference video
MoneyPrinterTurbo
MoneyPrinterTurboClip
Ours
Capcut
JichuangAI
expand_more
Experiments
Comparison of Subjective Video Similarity Scores:
Visual Comparison of Video Editing. We compare the results of video editing using the same ``Script Text Content" and ``Video Repository," and our method achieves better visual-text alignment.
Comparison of Subjective Video Similarity Scores. This table presents the performance of different video generation methods in experiments based on two distinct reference videos. The methods are evaluated on four metrics: Semantic Matching Score (SMC), Camera Motion Similarity (CMS), Shot Size Similarity (SSS), and Overall Style Similarity (OSS). These metrics collectively reflect the stylistic similarity between the generated and reference videos.
Comparison of Objective Video Similarity Scores:
Visual Comparison of the Transition Score Matrices for Shot Size and Camera Motion Syntax in the Edited Videos. S0 to S4 respectively represent the shot size attributes: Extreme Long Shot (ELS), Long Shot (LS), Medium Shot (MS), Close-Up (CU), and Extreme Close-Up (ECU). C0 to C6 respectively represent the camera motion attributes: Stable, Up, Down, Left, Right, Out, and In.
Comparison of Subjective Video Similarity Scores. This table presents the performance of different video generation methods in experiments based on two distinct reference videos. The methods are evaluated on four metrics: Semantic Matching Score (SMC), Camera Motion Similarity (CMS), Shot Size Similarity (SSS), and Overall Style Similarity (OSS). These metrics collectively reflect the stylistic similarity between the generated and reference videos.