Video Training
Preprocessing Videos
Selection of Training Videos
Use videos with consistent content, actions, or visual effects, but different main subjects.
Prioritize using videos; images can be used as supplementary data.
Videos must be high-resolution and watermark-free.
Number of Videos
4 to 10 videos are sufficient. (Image-only training is not recommended.)
Frame Rate
Convert videos to 16fps, with a total of 81 frames (i.e., 5 seconds in duration).
You can use video editing tools to trim clips to 5 seconds, then extract frames at 16fps.
Shorter videos (e.g., 2s or 3s) are also acceptable, but they must be processed to 16fps.
Resolution
480p works well. You can also reduce it to 320p to speed up training.
(Training will likely fail if the resolution is too high.)
Video Tagging
Automatic Tagging

Manual Tagging
Key Points: Secondary Features + Main Features
Main Features: Actions/effects to be learned; Secondary Features: Characters in the video, where they are, what they're doing.
Example: In the video, a woman wearing a black formal suit is presented. The person raises her hand and showers colorful confetti in celebration with a smile. The person then reveals a bikini, causing a b1k1n1 bikini up effect. The person continues celebrating, further showing the b1k1n1 bikini up effect.
The part before the red text describes the video content, and the red text summarizes the action effects being learned (i.e., red text represents main features, the rest are secondary features).
Online Training
Video Model Introduction
Hunyuan Video
Text-to-video: hunyuanvideo-fp8
Wan Video
Text-to-video: Wan2.1-14B
Image-to-video: Wan2.1-14B-480P, Wan2.1-14B-720P
Difference between text-to-video and image-to-video: In parameter adjustment, for model effect preview prompts, text-to-video only needs text similar to training set tags to generate preview images.
Image-to-video requires inputting images and corresponding prompts to generate preview images.
Wan 2.1 Video LoRA Training
Video Model Introduction
Wan Video
Text-to-Video: Wan2.1-14B.
Image-to-Video: Wan2.1-14B-480P, Wan2.1-14B-720P.
Difference between Text-to-Video and Image-to-Video: In parameter settings, under Model Effect Preview Prompts, for text-to-video, you only need to enter text similar to the training set captions to generate preview samples.
For image-to-video, you must provide both an image and the corresponding prompt to generate preview samples.
Online Parameter Settings
Image-to-video
Image-to-video: Wan2.1-14B-480P, Wan2.1-14B-720P (mainly selected based on training video resolution).
For training materials of 216*320 (less than 480p), choose the 480p model (there is little difference in final training effect between 720p and 480p, so 480p is recommended).
Resolution
Specific Size
Total Pixels
480p
854*480
About 410,000
720p
1280*720
About 920,000
Complete Dataset Upload

Parameter Settings
Frames to Extract: Number of frames to extract from a single video segment.
Example: For each segment at 16fps, setting Frames to Extract to 9 means not every frame will be learned.
Number of Slices: Dividing each video material.
Example: For a 5-second video at 16fps, setting Number of Slices to 5 means each segment is 16 frames; if set to 4, each segment is 20 frames.
Times per Image: Learning times for each video.
Cycles: Number of cycles based on Times per Image.
Model Effect Preview Prompts: Prompt for generating example video (modify it based on dataset tags combined with initial frame image content).
Initial frame: For image-to-video, the required image for generating the example video.
Advanced Parameter Settings
The only setting to be modified: Flow Shift.
720p is 5, 480p is 3 [materials must also be 480p].

Text-to-Video Parameters
Text-to-video parameters are consistent with image-to-video parameters. Flow Shift follows the default parameters.
Model Selection
Choose the one with good real-time sample images that match the effects or actions shown in the training set videos.
Model Testing
Image-to-Video Testing
kijai Workflow: kj wan testing.json
AI App Testing: SeaArt AI AI | kj wan testing
Parameter Settings
Model Selection: The training model should match the testing model.
Select LoRA: Choose saved LoRA from your models.
Weight: LoRA weight.
Width: The size after the input image is compressed and cropped.
Height: The size after the input image is compressed and cropped.
Frames: Total frame count for the output duration (calculated as 4*n+1, where n represents seconds; e.g., 5 seconds = 81 frames).
Shift: 720p is 5, 480p is 3.
CFG: Default cfg is 6, can be adjusted to 5.
Official Workflow: wan official workflow.json
AI App Testing: SeaArt AI AI | wan official workflow
Default cfg is 6, can be adjusted to 5.
Sampler is uni-pc, scheduler can be normal or simple.
Sampler dpmpp_2m, scheduler sgm_uniform.
Note: Other parameters are consistent with the kj parameter settings.
Text-to-Video Testing
Wan Creation Flow Testing
Model: Select wan2.1.
Additional: Select saved trained model.
Select Text to Video.
Hunyuan Creation Flow Testing

Model: Hunyuan Video.
Additional: Select saved trained model.
Select Text to Video.
Hunyuan LoRA Video Training
Video Model Introduction
Hunyuan Video: Currently, only online text-to-video training is available.
Text-to-Video: hunyuanvideo-fp8.
Parameter Settings
Frames to Extract: Number of frames to extract from a single video segment.
Example: For each segment at 16fps, setting Frames to Extract to 9 means not every frame will be learned.
Number of Slices: Dividing each video material.
Example: For a 5-second video at 16fps, setting Number of Slices to 5 means each segment is 16 frames; if set to 4, each segment is 20 frames.
Times per Image Repeat: Learning times for each video.
Cycles Epoch: Number of cycles based on Times per Image Repeat.
Model Effect Preview Prompts: Prompt for generating example video (modify it based on dataset tags combined with initial frame image content).

Hunyuan Creation Flow Testing

Model: Hunyuan Video.
Additional: Select saved trained model.
Select Text to Video.
Wan 2.2 Video LoRA Training
Video preprocessing is the same as Wan 2.1: resolution, video length, video fps, and video quantity are the same.
Video Model Introduction
Wan Video
Text-to-Video: wan2.2 t2v-low, wan2.2 t2v-high.
Image-to-Video: wan2.2 i2v-low, wan2.2 i2v-high.
Differences between Wan 2.2 Video and Wan 2.1 Video Training: Text-to-video and image-to-video each have two models: a high-noise model and a low-noise model. High-noise models mainly control motion/dynamics in the video, while low-noise models mainly control fine details. To minimize training time, you can only train the low-noise model to reach a basic effect quickly.
It is best to train both high-noise and low-noise models on the same dataset, and then load two LoRAs together. This can fully utilize the strong capabilities of Wan 2.2.
The Wan 2.2 video models have stronger language understanding, so you can use simpler and more unified language descriptions when annotating videos.
Wan 2.2 Video Annotation
Wan 2.2 Text-to-Video
Automatic Annotation
You can use Automatic Annotation.
Manual Annotation
Describe the video content clearly; avoid overly generic descriptions such as "a person," "an animal."
Wan 2.2 Image-to-Video
Automatic Annotation
Automatic annotation is not recommended, as there are only a few video assets, and it tends to be overly detailed. For Wan 2.2, excessively detailed captions can make it more cumbersome to enter prompts when using the LoRA later.

Manual Annotation
You can use simple descriptions if all video assets are of a person.


For example: A person, whose head turns into a pumpkin head, then wears a robe, and a pumpkin lantern, bats, and a moon Halloween background.
You do not need to specify gender or age in the description; simply describe “a person,” clearly and consistently outlining the transformation effects. Then, copy this entire sentence into the captions of other videos.
Parameter Setting
Frames to Extract: Number of frames to extract from a single video segment.
Example: For each segment at 16fps, setting Frames to Extract to 9 means not every frame will be learned.
Number of Slices: Dividing each video material.
Example: For a 5-second video at 16fps, setting Number of Slices to 5 means each segment is 16 frames; if set to 4, each segment is 20 frames.
Times per Image Repeat: Learning times for each video.
Cycles Epoch: Number of cycles based on Times per Image Repeat.
Model Effect Preview Prompts: Prompt for generating example video (modify it based on dataset tags combined with initial frame image content).
Initial frame: For image-to-video, the required image for generating the example video.
High/Low Noise Model Effect Comparison
The table shows the impact of low-noise and high-noise LoRA on the final generated videos.
Motion



Special Effects



Complex Scenario



From this we can see that even if you only train a high-noise model, the overall video effect can basically be expressed. But some fine details still need the low-noise model to present them.
Image to Video
Model: Generally, just choose wan2.2 i2v-high.
Advanced Parameter Settings: The default values are the best.

Model Effect Preview Prompts: You can directly use the video captions.
Initial Frame: It is recommended that the uploaded image type be consistent with the video asset type. For example, if the video asset is realistic, upload a realistic image; if the video is a half-body shot, upload a half-body image.

Text to Video
Model: Choose wan2.2 t2v-low.
Model Effect Preview Prompts: You can directly use the video captions.

Model Testing
Generally, save and test the LoRAs from the last few epochs.
Image-to-Video Testing
AI App Testing: SeaArt AI AI | WAN 2.2 Test

Parameter Settings
First LoRA: Select the high-noise LoRA.
Second LoRA: Select the low-noise LoRA.
If you only have a high-noise LoRA, then choose the same high-noise LoRA for the low-noise slot and set its strength to 0. If you have a low-noise LoRA, you can use both high-noise and low-noise LoRAs together.
Please enter text: Input the prompts (captions).
Please select an image: Input the image (preferably similar to the first frame of the training video).
Text-to-Video Testing
AI App Testing: SeaArt AI | wan2.2 t2i Test

Parameter Settings
First LoRA: Select the high-noise LoRA.
Second LoRA: Select the low-noise LoRA.
If you only have a high-noise LoRA, then choose the same high-noise LoRA for the low-noise slot and set its strength to 0. If you have a low-noise LoRA, you can use both high-noise and low-noise LoRAs together.
Please enter text: Input the prompts (captions).
Please select an image: Input the image (preferably similar to the first frame of the training video).
Last updated