3-2 LoRA Training (Advance)
Master AI art with advanced LoRA training! This guide covers everything from principles and processes to optimizing parameters for stunning, controllable results.
Last updated
Master AI art with advanced LoRA training! This guide covers everything from principles and processes to optimizing parameters for stunning, controllable results.
Last updated
Lora allows for fine-tuning the entire image while keeping the weights of the Checkpoint unchanged. In this case, only adjusting Lora is needed to generate specific images without modifying the entire Checkpoint. For some images that the AI has never encountered before, Lora is used for fine-tuning. This gives AI art a certain degree of "controllability.”
Currently, the trained models are all "refinements" made on officially trained models (SD1.5, SDXL). Of course, refinements can also be made on models created by others.
Lora Training: AI first generates images based on the prompts, then compares these images with the dataset in the training set. By guiding AI to continuously fine-tune the embedding vectors based on the generated differences, the generated results gradually approach the dataset. Eventually, the fine-tuned model can produce results that are completely equivalent to the dataset, forming an association between the images generated by AI and the dataset, making them increasingly similar.
*Compared to the Checkpoint, LoRA has a smaller file size, which saves time and resources. Moreover, it can adjust weights on top of the Checkpoint, achieving different effects.
Five steps: Prepare dataset - Image preprocessing - Set parameters - Monitor Lora training process - Training completion
*Taking the training of a facial Lora with SeaArt as an example.
*If you want to learn more about creating a dataset, you can read the guide below.
How To Create Dataset For TrainingWhen uploading the dataset, it's essential to maintain the principle of "diversified samples." This means the dataset should include images from different angles, poses, lighting conditions, etc., and ensure that the images are of high resolution. This step is primarily aimed at helping AI understand the images.
I. Cropping images II. Tagging III. Trigger words.
I. Cropping images
To enable the AI to better discern objects through images, it's generally best to maintain consistent image dimensions. You can choose from 512*512 (1:1), 512*768 (2:3), or 768*512 (3:2) based on the desired output.
Crop Mode: Center Crop / Focus Crop / No Crop
Center Crop: Crops the central region of the image.
Focus Crop: Automatically identifies the main subject of the image.
*Compared to center cropping, focus cropping is more likely to preserve the main subject of the dataset, so it is generally recommended to use focus crop.
II. Tagging
To provide textual descriptions for images in the dataset, allowing AI to learn from the text inside.
Tagging Algorithm: BLIP/Deepbooru
BLIP: Natural language tagger, for example, "a girl with black hair."
Deepbooru: Phrase language labels, for example, "a girl, black hair."
Tagging Threshold: The smaller the value, the finer the description, recommended to be 0.6.
Tagging process: Remove fixed features (such as physical features...) to allow AI to autonomously learn these features. Similarly, you can also add some features you want to adjust in the future (clothing, accessories, actions, background...).
*For example, if you want all the generated images to have black hair and black eyes, you can delete these two tags.
III. Trigger words
Words that trigger the activation of Lora, effectively consolidating the character features into a single word.
Base Model: It is recommended to choose a high-quality, stable base model that closely matches the style of Lora, as this makes it easier for AI to match features and record differences.
Recommended Base Models:
Realistic: SD1.5, ChilloutMix, MajicMIX Realistic, Realistic Vison
Anime: AnyLoRA, Anything | 万象熔炉, ReV Animated
Training Parameters:
Repeat (Single Image Repetitions): The number of times a single image is learned. The more repetitions, the better the learning effect, but excessive repetitions may lead to image rigidity. Suggestion: Anime: 8; Realistic: 15.
Epoch (Cycles): One cycle equals the number of dataset multiplied by Repeat. It represents how many steps the model has been trained on the training set. For example, if there are 20 images in the training set and Repeat is set to 10, then the model will learn 20 * 10 = 200 steps. If Epoch is set to 10, then the Lora training will have a total of 2000 steps. Suggestion: Anime: 20; Realistic: 10.
Batch size: It refers to the number of images the AI learns simultaneously. For example, when set to 2, the AI learns 2 images at a time, which shortens the overall training duration. However, learning multiple images simultaneously may lead to a relative decrease in the precision for each image.
Mixed precision: fp16 is recommended.
Sample Settings:
Resolution: Determines the size of the preview image for the final model effect.
SD1.5: 512*512
SD1.5: 512*512
Seed: Controls the randomly generated images. When using the same r seed with prompts, it will likely generate the same/similar images.
Sampler \ Prompts \ Negative Prompts: Mainly showcase the effect of the preview image of the final model.
Save Settings:
Determines the final number of Loras. If set to 2, and Epoch is 10, then 5 Loras will be saved in the end.
Save precision: Recommended fp16.
Learning Rate & Optimizer:
Learning Rate: It denotes the intensity of AI learning the dataset. The higher the learning rate, the more AI can learn, but it may also lead to dissimilar output images. When the dataset increases, it's advisable to try reducing the learning rate. It's recommended to start with the default value and then adjust it based on training results. It's suggested to gradually increase from a lower learning rate, recommended at 0.0001.
unet lr: When the unet lr is set, the Learning Rate will not take effect. Recommended at 0.0001.
text encoder lr: It determines the sensitivity to tags. Usually, the text encoder lr is set to 1/2 or 1/10 of the unet lr.
Lr scheduler: It primarily governs the decay of the learning rate. Different schedulers have minimal impact on the final results. Generally, the default "cosine" scheduler is used, but an upgraded version, "Cosine with Reastart," is also available. It goes through multiple restarts and decays to fully learn the dataset, avoiding interference from "local optimal solutions" during training. If using "Cosine with Reastart," set the Restart Times to 3-5.
Optimizer: It determines how AI grasps the learning process during training, directly impacting the learning results. It's recommended to use AdamW8bit.
Lion: A newly introduced optimizer, typically with a learning rate about 10 times smaller than AdamW.
Prodigy: If all learning rates are set to 1, Prodigy will automatically adjust the learning rate to achieve the best results, suitable for beginners.
Network:
Used to build a suitable Lora model base for AI input data.
Network Rank Dim: It directly affects the size of Lora. The larger the Rank, the more data needs to be fine-tuned during training. 128=140MB+; 64=70MB+; 32=40MB+.
Recommended:
Realistic: 64/128
Anime: 8/16/32
Setting the value too high will make the AI learn too deeply, capturing many irrelevant details, similar to "overfitting”
Network Alpha: It can be understood as the degree of influence of Lora on the original model weights. The closer it is to Rank, the smaller the influence on the original model weights, while the closer it is to 0, the more pronounced the influence on the original model weights. Alpha generally does not exceed Rank. Currently, Alpha is typically set to half of Rank. If set to 1, it maximizes the influence on weights.
Tagging Settings:
In general, the closer a tag is to the front, the greater its weight. Therefore, it's usually recommended to enable Shuffle Caption
Overfitting: When there is a limited dataset or the AI matches the dataset too precisely, it leads to Lora generating images that largely resemble the dataset, resulting in poor generalization ability of the model.
The image on the top right closely resembles the dataset on the left, both in appearance and posture.
Reasons for Overfitting:
The dataset is lacking.
Incorrect parameter settings (tags, learning rate, steps, optimizer, etc.).
Preventing Overfitting:
Decrease learning rate appropriately.
Shorten the Epoch.
Reduce Rank and increase Alpha.
Decrease Repeat.
Utilize regularization training.
Increase dataset.
Underfitting: The model fails to adequately learn the features of the dataset during training, resulting in generated images that do not match the dataset well.
You can see that Lora's generated images fail to adequately preserve the features of the dataset — they are dissimilar.
Reasons for Underfitting:
Low model complexity
Insufficient feature quantity
Preventing Underfitting:
Increase learning rate appropriately
Increase Epoch
Raise Rank, reduce Alpha
Increase Repeat
Reduce regularization constraints
Add more features to the dataset (high quality)
A way to avoid overfitting of images is by adding additional images to enhance the model's generalization ability. The regular dataset should not be too extensive, otherwise, the AI will overly learn from the regular dataset, leading to inconsistency with the original target. It is recommended to have 10-20 images.
For example, in a portrait dataset where most images feature long hair, you can add images with short hair to the regular dataset. Similarly, if the dataset consists entirely of images with the same artistic style, you can add images with different styles to the regulardataset to diversify the model. The regular dataset does not need to be tagged.
*In layman's terms, training Lora in this way is somewhat like a combination of the dataset and a regular dataset.
The deviation between what AI learns and reality, guided by loss, can optimize the direction of AI learning. Therefore, when the loss is low, the deviation between what AI learns and reality is relatively small, and at this point, AI learns the most accurately. As long as the loss gradually decreases, there are usually no major issues.
The loss value for Realistic images generally ranges from 0.1 to 0.12, while for anime, it can be lowered appropriately.
Use the loss value to assess model training issues.
Currently, the "fine-tuning models" can be roughly divided into three types: the Checkpoint output by Dreambooth, the Lora, and the Embeddings output by Textual Inversion. Considering factors such as model size, training duration, and training dataset requirements, Lora offers the best "cost-effectiveness". Whether it's adjusting the art style, characters, or various poses, Lora can perform effectively.
Number of dataset image training cycles. We suggest 10 for beginners. The value can be raised if the training seems insufficient due to a small dataset or lowered if the dataset is huge.
Number of times an image is learned. Higher values lead to better effects and more complex image compositions. Setting it too high may increase the risk of overfitting. Therefore, we suggest using 10 to achieve good training results while minimizing the chance of overfitting.
Note: You can increase the epochs and repeats if the training results do not resemble.
Degree of change in each repeat. Higher values mean faster learning but may cause model crashes or inability to converge. Lower values mean slower learning but may achieve optimal state. This value becomes ineffective after setting separate learning rates for U-Net and Text Encoder.
U-Net guides noise images generated by random seeds to determine denoising direction, find areas needing change, and provide the required data. Higher values mean faster fitting but risk missing details, while lower values cause underfitting and no resemblance among generated images and materials. The value is set accordingly based on the model type and dataset. We suggest 0.0002 for character training.
It converts tags to embedding form for U-Net to understand. Since the text encoder of SDXL is already well-trained, there is usually no need for further training, and default values are fine unless there are special needs.
An algorithm in deep learning that adjusts model parameters to minimize the loss function. During neural network training, the optimizer updates the model's weight based on the gradient information of the loss function so the model can better fit the training data. The default optimizer, AdamW, can be used for SDXL training, and other optimizers, like the easy-to-use Prodigy with adaptive learning rates, can also be chosen based on specific requirements.
Refers to a strategy or algorithm for dynamically adjusting the learning rate during training. Choosing Constant is sufficient under normal circumstances.
Closely related to the size of the trained LoRA.
For SDXL, a 32dim LoRA is 200M, a 16dim LoRA is 100M, and an 8dim LoRA is 50M. For characters, selecting 8dim is sufficient.
Typically set as half or a quarter of the dim value. If the dim is set as 8, then the alpha can be set as 4.
Training resolution can be non-square but must be multiples of 64. For SDXL, we suggest 10241024 or 1024768.
If the images' resolution is not unified, please turn on this parameter. It will automatically classify the resolution of the training set and create a bucket to store images for each resolution or similar resolution before the training starts. This saves time on unifying the resolution in the early stage. If the images' resolution has already been unified, there is no need to turn it on.
Both noise offsets improve the situation where the generated image is overly bright or too dark. If there are no excessively bright or dark images in the training set, they can be turned off. If turned on, we suggest using multires_noise_iterations with a value of 6-10.
Needs turning on with the multires_noise_iterations mentioned above, and a value of 0.3-0.8 is recommended.
Specifies which text encoder layer's output to use counting from the last. Usually, the default value is fine.