Video Training

Preprocessing Videos

Selection of Training Videos

  • Use videos with consistent content, actions, or visual effects, but different main subjects.

  • Prioritize using videos; images can be used as supplementary data.

  • Videos must be high-resolution and watermark-free.

Number of Videos

  • 4 to 10 videos are sufficient. (Image-only training is not recommended.)

Frame Rate

  • Convert videos to 16fps, with a total of 81 frames (i.e., 5 seconds in duration).

  • You can use video editing tools to trim clips to 5 seconds, then extract frames at 16fps.

  • Shorter videos (e.g., 2s or 3s) are also acceptable, but they must be processed to 16fps.

Resolution

  • 480p works well. You can also reduce it to 320p to speed up training.

  • (Training will likely fail if the resolution is too high.)

Video Tagging

Automatic Tagging

Manual Tagging

Key Points: Secondary Features + Main Features

Main Features: Actions/effects to be learned; Secondary Features: Characters in the video, where they are, what they're doing.

Example: In the video, a woman wearing a black formal suit is presented. The person raises her hand and showers colorful confetti in celebration with a smile. The person then reveals a bikini, causing a b1k1n1 bikini up effect. The person continues celebrating, further showing the b1k1n1 bikini up effect.

The part before the red text describes the video content, and the red text summarizes the action effects being learned (i.e., red text represents main features, the rest are secondary features).

3(2).mp4

Online Training

Video Model Introduction

Hunyuan Video

Text-to-video: hunyuanvideo-fp8

Wan Video

Text-to-video: Wan2.1-14B

Image-to-video: Wan2.1-14B-480P, Wan2.1-14B-720P

Difference between text-to-video and image-to-video: In parameter adjustment, for model effect preview prompts, text-to-video only needs text similar to training set tags to generate preview images.

Image-to-video requires inputting images and corresponding prompts to generate preview images.

Wan 2.1 Video LoRA Training

Video Model Introduction

Wan Video

Text-to-Video: Wan2.1-14B.

Image-to-Video: Wan2.1-14B-480P, Wan2.1-14B-720P.

Difference between Text-to-Video and Image-to-Video: In parameter settings, under Model Effect Preview Prompts, for text-to-video, you only need to enter text similar to the training set captions to generate preview samples.

For image-to-video, you must provide both an image and the corresponding prompt to generate preview samples.

Online Parameter Settings

Image-to-video

Image-to-video: Wan2.1-14B-480P, Wan2.1-14B-720P (mainly selected based on training video resolution).

For training materials of 216*320 (less than 480p), choose the 480p model (there is little difference in final training effect between 720p and 480p, so 480p is recommended).

Resolution

Specific Size

Total Pixels

480p

854*480

About 410,000

720p

1280*720

About 920,000

Complete Dataset Upload

Parameter Settings

Frames to Extract: Number of frames to extract from a single video segment.

Example: For each segment at 16fps, setting Frames to Extract to 9 means not every frame will be learned.

Number of Slices: Dividing each video material.

Example: For a 5-second video at 16fps, setting Number of Slices to 5 means each segment is 16 frames; if set to 4, each segment is 20 frames.

Times per Image: Learning times for each video.

Cycles: Number of cycles based on Times per Image.

Model Effect Preview Prompts: Prompt for generating example video (modify it based on dataset tags combined with initial frame image content).

Initial frame: For image-to-video, the required image for generating the example video.

Advanced Parameter Settings

The only setting to be modified: Flow Shift.

720p is 5, 480p is 3 [materials must also be 480p].

Text-to-Video Parameters

Text-to-video parameters are consistent with image-to-video parameters. Flow Shift follows the default parameters.

Model Selection

Choose the one with good real-time sample images that match the effects or actions shown in the training set videos.

Model Testing

Image-to-Video Testing

kijai Workflow: kj wan testing.json

AI App Testing: SeaArt AI AI | kj wan testing

Parameter Settings

Model Selection: The training model should match the testing model.

Select LoRA: Choose saved LoRA from your models.

Weight: LoRA weight.

Width: The size after the input image is compressed and cropped.

Height: The size after the input image is compressed and cropped.

Frames: Total frame count for the output duration (calculated as 4*n+1, where n represents seconds; e.g., 5 seconds = 81 frames).

Shift: 720p is 5, 480p is 3.

CFG: Default cfg is 6, can be adjusted to 5.

Official Workflow: wan official workflow.json

AI App Testing: SeaArt AI AI | wan official workflow

Default cfg is 6, can be adjusted to 5.

Sampler is uni-pc, scheduler can be normal or simple.

Sampler dpmpp_2m, scheduler sgm_uniform.

Note: Other parameters are consistent with the kj parameter settings.

Text-to-Video Testing

Wan Creation Flow Testing

Model: Select wan2.1.

Additional: Select saved trained model.

Select Text to Video.

Hunyuan Creation Flow Testing

Model: Hunyuan Video.

Additional: Select saved trained model.

Select Text to Video.

Hunyuan LoRA Video Training

Video Model Introduction

Hunyuan Video: Currently, only online text-to-video training is available.

Text-to-Video: hunyuanvideo-fp8.

Parameter Settings

Frames to Extract: Number of frames to extract from a single video segment.

Example: For each segment at 16fps, setting Frames to Extract to 9 means not every frame will be learned.

Number of Slices: Dividing each video material.

Example: For a 5-second video at 16fps, setting Number of Slices to 5 means each segment is 16 frames; if set to 4, each segment is 20 frames.

Times per Image Repeat: Learning times for each video.

Cycles Epoch: Number of cycles based on Times per Image Repeat.

Model Effect Preview Prompts: Prompt for generating example video (modify it based on dataset tags combined with initial frame image content).

Hunyuan Creation Flow Testing

Model: Hunyuan Video.

Additional: Select saved trained model.

Select Text to Video.

Wan 2.2 Video LoRA Training

Video preprocessing is the same as Wan 2.1: resolution, video length, video fps, and video quantity are the same.

Video Model Introduction

Wan Video

Text-to-Video: wan2.2 t2v-low, wan2.2 t2v-high.

Image-to-Video: wan2.2 i2v-low, wan2.2 i2v-high.

Differences between Wan 2.2 Video and Wan 2.1 Video Training: Text-to-video and image-to-video each have two models: a high-noise model and a low-noise model. High-noise models mainly control motion/dynamics in the video, while low-noise models mainly control fine details. To minimize training time, you can only train the low-noise model to reach a basic effect quickly.

It is best to train both high-noise and low-noise models on the same dataset, and then load two LoRAs together. This can fully utilize the strong capabilities of Wan 2.2.

The Wan 2.2 video models have stronger language understanding, so you can use simpler and more unified language descriptions when annotating videos.

Wan 2.2 Video Annotation

Wan 2.2 Text-to-Video

Automatic Annotation

You can use Automatic Annotation.

Manual Annotation

Describe the video content clearly; avoid overly generic descriptions such as "a person," "an animal."

Wan 2.2 Image-to-Video

Automatic Annotation

Automatic annotation is not recommended, as there are only a few video assets, and it tends to be overly detailed. For Wan 2.2, excessively detailed captions can make it more cumbersome to enter prompts when using the LoRA later.

Manual Annotation

You can use simple descriptions if all video assets are of a person.

For example: A person, whose head turns into a pumpkin head, then wears a robe, and a pumpkin lantern, bats, and a moon Halloween background.

You do not need to specify gender or age in the description; simply describe “a person,” clearly and consistently outlining the transformation effects. Then, copy this entire sentence into the captions of other videos.

Parameter Setting

Frames to Extract: Number of frames to extract from a single video segment.

Example: For each segment at 16fps, setting Frames to Extract to 9 means not every frame will be learned.

Number of Slices: Dividing each video material.

Example: For a 5-second video at 16fps, setting Number of Slices to 5 means each segment is 16 frames; if set to 4, each segment is 20 frames.

Times per Image Repeat: Learning times for each video.

Cycles Epoch: Number of cycles based on Times per Image Repeat.

Model Effect Preview Prompts: Prompt for generating example video (modify it based on dataset tags combined with initial frame image content).

Initial frame: For image-to-video, the required image for generating the example video.

High/Low Noise Model Effect Comparison

The table shows the impact of low-noise and high-noise LoRA on the final generated videos.

Motion

Special Effects

Complex Scenario

From this we can see that even if you only train a high-noise model, the overall video effect can basically be expressed. But some fine details still need the low-noise model to present them.

Image to Video

Model: Generally, just choose wan2.2 i2v-high.

Advanced Parameter Settings: The default values are the best.

Model Effect Preview Prompts: You can directly use the video captions.

Initial Frame: It is recommended that the uploaded image type be consistent with the video asset type. For example, if the video asset is realistic, upload a realistic image; if the video is a half-body shot, upload a half-body image.

Text to Video

Model: Choose wan2.2 t2v-low.

Model Effect Preview Prompts: You can directly use the video captions.

Model Testing

Generally, save and test the LoRAs from the last few epochs.

Image-to-Video Testing

AI App Testing: SeaArt AI AI | WAN 2.2 Test

Parameter Settings

First LoRA: Select the high-noise LoRA.

Second LoRA: Select the low-noise LoRA.

If you only have a high-noise LoRA, then choose the same high-noise LoRA for the low-noise slot and set its strength to 0. If you have a low-noise LoRA, you can use both high-noise and low-noise LoRAs together.

Please enter text: Input the prompts (captions).

Please select an image: Input the image (preferably similar to the first frame of the training video).

Text-to-Video Testing

AI App Testing: SeaArt AI | wan2.2 t2i Test

Parameter Settings

First LoRA: Select the high-noise LoRA.

Second LoRA: Select the low-noise LoRA.

If you only have a high-noise LoRA, then choose the same high-noise LoRA for the low-noise slot and set its strength to 0. If you have a low-noise LoRA, you can use both high-noise and low-noise LoRAs together.

Please enter text: Input the prompts (captions).

Please select an image: Input the image (preferably similar to the first frame of the training video).

Last updated