Image Training

Concept of LoRA

  • LoRA allows fine-tuning specific image features without modifying the base Checkpoint weights. This means you can generate targeted image results just by adjusting the LoRA, instead of retraining the entire model.

  • Currently, LoRA training is typically conducted on official base models such as SD1.5, SDXL, PONY, ILLUSTRIOUS, FLUX, SD3.5. Refinement can also be performed on community-created models.

  • How LoRA training works: The AI first generates images based on prompts. These are then compared with the images in your dataset. The system gradually adjusts the embedding vectors based on the differences, guiding the AI to produce results increasingly similar to the dataset. Eventually, the model can generate images nearly identical in style or subject to the dataset, building a strong associative connection.

LoRA Training Workflow

Five steps: Prepare dataset → Image preprocessing → Set parameters → Monitor training → Complete training

Dataset

High-quality datasets are key to effective model training. Training datasets are the source of our models, and using the correct size, tagging, and editing is crucial for creating good datasets.

● Usually, 20 to 40 images are sufficient for many types of LoRA, but style model training requires more images for better generalization. More images aren't always better; adding low-quality materials can reduce model quality.

● Materials can typically be found on image websites or created using AI image generation. But low-quality materials should be avoided in both cases.

Low-quality materials generally have these characteristics: scenes too dark to identify, unreasonable composition, unclear images, details difficult to replicate, unclear primary/secondary elements, details difficult to stack repeatedly, appearance of irrelevant elements, inconsistent subjects with image splicing, incorrect or misaligned limbs.

● The character style being trained should ideally be similar to the chosen Checkpoint style, i.e., anime characters should use anime Checkpoints for training.

Usually, 25-40 images of the same character are sufficient. Use the same character images but with different poses, angles, views, clothes, expressions, backgrounds, etc.

A dataset containing 30 images: 12 portrait photos, 4 beach background photos, 4 smiling expression photos, 6 photos wearing original clothes, 4 back view photos, 6 upper body photos, 8 full body photos, 6 sitting pose photos, 8-10 standing pose photos, etc.

● LoRA Style Dataset Selection

● The benchmark model used must be flexible or similar to the style being trained; illustration styles should use flat-type Checkpoints for training.

All images must have the same style (the style needed to create the LoRA).

Editing Tags and Trigger Words

● Trigger words must be used for LoRA to work optimally; they are a very important part of the prompting phase.

● Trigger words are the activation keys for LoRA, essentially tags that aren't written into the image data but are displayed in the image. The consistent parts of LoRA must be identified through trigger words, not through its tags.

For example, if character training has fixed hair features, there's no need to note this in the tags, allowing these features to merge into the trigger word. For style training, style vocabulary tags need to be deleted.

Tagging

Name

wd1.4

deepbooru

blip

joy2

llava

Output Form

Words

Words

Natural Language

Natural Language

Natural Language

Main Use

Anime Specialization

Anime Image Sorting

General Image Description

General Image Description

General Image Description

Model Usage

1.5, il, pony

1.5, il, pony

flux, xl

flux, xl

flux, xl

Threshold

0.3-0.6

0.3-0.6

/

/

/

Threshold: Lower values mean more detailed descriptions.

Image Preprocessing

Cropping Images

● Center cropping: Crops the center area of the image.

●Focus cropping: Automatically identifies the main subject of the image.

● No cropping: No image cropping, must be used with ARB bucketing.

● Compared to center cropping, focus cropping more easily preserves the subject of the dataset, so focus cropping is generally recommended.

● For Stable Diffusion 1.5 LoRA, 512x512, 512x768, and 768x512 are recommended. If you want to create highly detailed LoRA, 768x768 can also be used.

● For SDXL, Flux, and Stable Diffusion 3.5 training, 1024x1024 is recommended.

Dataset Creation and Upload

You can choose to upload existing datasets (i.e., the collective term for corresponding images and text annotations) or upload images for tagging and cropping processing.

Uploaded datasets cannot be recropped or automatically tagged, but tags can be manually modified.

After uploading images (batch upload supports up to 50 images at a time), you can select the cropping method, size, and tagging method. After selection, click Crop/Tag (wait until processing is complete to start adjusting parameters for training).

Training Parameter Settings

● At the top is the base model type (i.e., the large model type under which we train LoRA).

● Base Model: Different choices for each type of base model.

● Repeat: How many times each image is trained.

● Epoch: How many cycles all images are trained.

● Model Effect Preview Prompts: After training is complete, each model will have a sample image; the prompt used to generate this sample image is the model effect preview prompt.

Advanced Parameter Settings

● Batch size: Refers to the number of data samples sent to the model at once. When set to 4, it means the model processes 4 images each time. By processing data in batches, memory utilization and training speed can be improved. Typically, Batch Size values are chosen as powers of 2. Increasing Batch Size allows for a proportional increase in learning rate, e.g., 2 times the batch_size can use twice the UNet learning rate, but the TE learning rate cannot be increased too much.

● Gradient Checkpointing: A training algorithm that trades computation for VRAM, saving memory but sacrificing some speed. If Batch Size is 1, it's turned off; if Batch Size is 2 or above, it's turned on.

● ARB Bucketing: Used to train with images of non-fixed aspect ratios (with ARB bucketing enabled, no cropping is needed. ARB bucketing will increase training time to some extent. ARB bucket resolution must be greater than training material resolution).

● ARB Bucket Minimum Resolution: Default is 256; uploaded image resolution cannot be less than 256.

● ARB Bucket Maximum Resolution: Default is 1024; uploaded image resolution cannot be greater than 1024. You can increase the value to add materials with a greater resolution.

● ARB Bucket Resolution Steps: Default is 64, which is fine.

● Save Every N Epochs: Saves models based on cycle count. E.g., determines the final number of LoRAs saved. If set to 2, and Epoch is 10, then 5 LoRAs will be saved in the end.

● Learning Rate: It represents the intensity with which AI learns the dataset. Higher learning rates mean stronger AI learning ability, but may also lead to inconsistent output images. It's recommended to gradually increase from lower learning rates; the suggested learning rate is 0.0001.

● unet lr: When unet lr is set, the learning rate will not take effect. The recommended setting is 0.0001.

● text encoder lr: Determines sensitivity to tags. Typically, it is set to 1/2 or 1/10 of the unet lr.

● Learning Rate & Optimizer:

AdamW8bit

prodigy

Learning Rate

Total learning rate 1e-4

Scale up proportionally based on batch size

All learning rates set to 1

Actual learning rate will adjust adaptively

Lr scheduler

Cosine with restart

Restart count not exceeding 4

constant

Lr warm up

Warm-up steps are 5%-10% of total steps

/

● Network: Common values

Network Rank Dim

32

64

128

Network Alpha

16

32

64

Setting this value too high will cause AI to learn too deeply and make the model larger, capturing many irrelevant details, similar to "overfitting."

● Shuffle Caption: When enabled, the token order of the text will be randomly shuffled during training to enhance the generalization ability of the generation model. It is recommended to turn it on.

● Keep N Tokens: Generally choose 1, to keep our first entered trigger word with the highest weight.

● Noise Offset: Adds global noise during training, improving the brightness range of images (meaning it can generate darker or whiter images).

● Multires Noise Iterations: Defines the number of iterations.

● Multires Noise Discount: Defines the proportion by which noise gradually decreases with iterations.

Note: As they (Noise Offset, Multires Noise Iterations, Multires Noise Discount) all require extra steps to ensure convergence, the training time will be affected.

View Training Records

View training records on the right side of the dataset creation screen.

Select models with no issues in the sample images and save them.

Click on your account avatar in the upper right corner to jump to the personal work screen, and select the Model tab to use.

Model Testing

SeaArt will automatically synchronize the base model and the trained LoRA

Turn on the Fixed Seed number in the Advanced Config, and adjust the model weight to test the effect of the model at different weights.

LoRA Training Issues

Overfitting/Underfitting

Overfitting: When the dataset is limited or the AI matches the dataset too precisely, the LoRA generates images very similar to the dataset, resulting in poor generalization ability.

The image in the upper right is very similar to the dataset on the left in appearance and pose.

Causes of Overfitting:

● Lack of dataset

● Incorrect parameter settings (tags, learning rate, steps, optimizer, etc.)

Preventing Overfitting:

● Appropriately reduce learning rate.

● Reduce Epoch.

● Reduce Repeat.

● Use regularization training.

● Increase dataset.

Underfitting: The model fails to adequately learn the features of the dataset during training, resulting in generated images that don't match the dataset well.

Causes of Underfitting:

● Low model complexity

● Insufficient features

Preventing Underfitting:

● Appropriately increase learning rate.

● Increase Epoch.

● Increase Repeat.

● Reduce regularization constraints.

● Add more feature materials (high quality) to the dataset.

Regular Dataset

One way to avoid image overfitting is to add additional images to enhance the model's generalization ability. Regular datasets should not be too large, otherwise the AI will over-learn the regular dataset, leading to inconsistency with the original target. 10-20 images are recommended.

For example, in a portrait dataset where most images feature long hair, add short hair images to the regular dataset. Similarly, if the dataset consists entirely of images in the same artistic style, add images of different styles to the regular dataset to enrich the model. Regular datasets don't need to be tagged or cropped.

Image Model Type Classification

● SD1.5: Released in October 2022. The mainstream training size is 512*512. As a training base model, the training speed is fast, but the image quality is relatively average.

● Common models include:

majicMIX realistic

麦橘写实

Counterfeit-V3.0

GhostMix鬼混

XXMix_9realistic

● SDXL: Released in July 2023. The mainstream training size is 1024*1024. As a training base model, the training speed is average, and the image effect is better.

● Common models include:

Animagine XL V3 Series

XXMix_9realisticSDXL

Juggernaut XL

● Pony: There are many versions, and the V6 XL version released in January 2024 is the most popular. The mainstream training size is 1024*1024. Focuses on the cartoon and animal-style image generation.

● Common models include:

WAI-ANI-NSFW-PONYXL

Prefect Pony XL

CyberRealistic Pony

● Illustrious: The V1.0 version released in July 2024 is the most popular. The mainstream training size is 1024*1024. Focused on providing high-quality anime and illustration style image generation capabilities.

● Common models include:

WAI-NSFW-illustrious-SDXL

Prefectious XL NSFW

Illustrious-Anime Artist

● Note: Models trained on Illustrious and Pony base models can be used under the SDXL models.

● Flux: Released in August 2024, based on a novel transformer architecture, using 12 billion parameters, allowing it to generate detailed and realistic images.

The mainstream training size is 1024*1024, but 512*512 also produces good effects.

● Model versions include: FLUX Pro, FLUX Dev, FLUX Schnell, FLUX GGUF, NF4

● Common models include:

Name

SeaArt Infinity

STOIQO NewRealit

MajicFlus麦橘超然

Last updated