-
-
Notifications
You must be signed in to change notification settings - Fork 4
Guide
Stable diffusion is an image generation model created by Stability AI. It took 4000 GPUs and >100,000 USD to train, models that are trained at great expense from scratch are called Foundation models. It was released to the public as a Checkpoint, which is a 2-4GB file containing all the models weights. The Stable diffusion Checkpoint was further trained by individuals on different data, this training is much smaller in scale, taking a few days to weeks on a couple consumer level GPUs and is called Fine-tuning. Checkpoints can also be merged with each other in different proportions to create new Checkpoints, most Checkpoints you find will be Merges. Stability AI has released many versions of Stable diffusion which have slightly different architectures, and are incompatible with each other (V1, V2, XL, etc). The most common type of Checkpoint architecture is the original V1.
Checkpoints are actually composed of 3 different smaller models: UNET, VAE, CLIP. The UNET is the main model and actually generates the image. The VAE encodes images into a form the UNET can use, and decodes what the UNET produces back into images. The CLIP understands your prompts and produces a condensed form that guides the UNET.
Networks are another type of model used during generation, they are smaller models designed to attach to a Checkpoint and alter its outputs. LoRAs are the modern approach, though Hypernetworks were used before. They are usually less than 200MB and multiple can be attached at once. Networks are used by adding them to the prompt with a special syntax: <lora:some_thing>
, it will be highlighted purple if its working.
Embeddings are another smaller model, which can be thought of as a condensed prompt trained for a specific purpose. These are usually tiny, a few KB in size. Embeddings are used by adding their name to the prompt, it will be highlighted orange if its working.
Networks and Embeddings are not compatible across different Checkpoint architectures.
Generating an image is done gradually over many steps. The image begins as random noise and each step some of the noise is removed until no noise remains, this process is called Denoising. An existing image can be altered by adding some noise, then Denoising it.
Width/Height affects how fast generation is, larger resolutions take much longer and use more VRAM. Models are also trained on certain resolutions (512 for V1, 768 for V2) and generating outside those resolutions can result in nonsense. To address this issue generating can be split into two stages, first a normal resolution image is generated then its upscaled and used as the base for a higher resolution image (often called Highres Fix).
Steps affects the quality of the image and how long it takes to generate. There are diminishing returns after so many steps. 20-50 is the standard range.
Scale affects how drastic each step is, too little and the image will be bland, too high and the image will be fried. 5-20 is the standard range.
Special syntax is used to manipulate prompts. The basic features are Emphasis and Scheduling.
Emphasis changes the influence text has on the output. It does this with a numeric weighting. Syntax: Emphasis (PROMPT)
, De-emphasis [PROMPT]
, Custom (PROMPT:WEIGHT)
.
Prompt | Weighting |
---|---|
(hello world) |
110% |
[hello world] |
90% |
(hello world:2.0) |
200% |
([hello world]) |
100% |
(hello (world)) |
hello , 110%. world , 120% |
Scheduling changes the text during generation. Syntax: Scheduling [A:B:SPECIFIER]
, Alternating [A|B]
.
Prompt | Affect |
---|---|
[hello:world:10] |
At step 10 change from hello to world . |
[:hello world:10] |
At step 10 add hello world . |
[hello world::10] |
At step 10 remove hello world . |
[hello|world] |
Every step alternate between hello and world . |
Strength scheduling is a more compact way to change strengths. Syntax: (PROMPT:SCHEDULE)
.
For example (hello:[0.5:0.7:10])
is equivalent to [(hello:0.5):(hello:0.7):10]
Schedule | Affect |
---|---|
[0.5:0.7:10] |
At step 10 change from 50% strength to 70%. |
[0.0:0.7:10] |
At step 10 add the prompt at 70% strength. |
[[0.0:0.25:5]:0.7:10] |
At step 5 change from 0% to 25% strength, then 70% strength at step 10. |
[0.5:0.7:0.5] |
Halfway through generating change from 50% strength to 70%. |
[0.5:0.7:HR] |
For Highres pass change from 50% strength to 70%. |
Escaping characters may be needed. The ()[]
characters are interpreted as prompt syntax and ultimately get removed. To use them in a prompt directly you must preface them with a \
. For example hello_(world)
becomes hello_\(world\)
.
LoRAs, LoCons and Hypernetworks are activated by including them in the prompt. No distinction is made between LoCons and LoRAs in the program, since LoCons are just LoRAs with additional convolutional layers.
Prompt | Affect |
---|---|
<lora:borzoi> |
Enable the LoRA/LoCon with the name borzoi . |
<hypernet:borzoi> |
Enable the Hypernetwork with the name borzoi . |
<lora:borzoi:0.5> |
Set borzoi to 50% strength. |
<lora:borzoi:0.5:0.7> |
Set borzoi UNET to 50% strength, CLIP to 70% strength. |
<lora:borzoi:1,0.75,0.5,0.25,0,0.25,0.5,0.75,1> |
Use block weights for the borzoi UNET (4 and 12 block weights are supported). |
Dynamic mode evaluates each network individually for each step, enabling network strengths to be changed during generation. This is slower than Static mode, which will merge in LoRAs before generating. Overall Static mode is much faster if you dont need scheduling (as fast as using no LoRAs at all), especially for multiple or large LoRAs. Though it also needs to reload the model from disk whenever networks are changed.
Prompt encoding happens before generating, so the CLIP strength at step 0 is whats used. Scheduling CLIP strength after that is useless (Except the prompt gets encoded again before the Highres pass). Also if trying to schedule in Static mode, the network strengths at step 0 get used.
In dynamic mode, the following is possible:
Prompt | Affect |
---|---|
<lora:borzoi:[0.5:1.0:10]> |
Switch from 50% strength to 100% strength at step 10. |
<lora:borzoi:[0.5:1.0:0.5]> |
Switch from 50% strength to 100% strength half way through generating. |
<lora:borzoi:[0.5:1.0:HR]> |
Switch from 50% strength to 100% strength for the Highres pass. |
[:<lora:borzoi>:HR] |
Enable network for Highres pass. |
[<lora:borzoi>:<lora:borzoi:1,0.75,0.5,0.25,0,0.25,0.5,0.75,1>:10] |
Switch to block weights at step 10. |
<lora:borzoi:1,0.75,[0.5:1.0:0.5],0.25,0,0.25,0.5,0.75,1> |
Modify DOWN2 block weight half way through generating. |
Used to selectively denoise part of an image, access by adding a Mask
and linking it to an image you want to edit. With no mask specified the entire image will be affected.
Inpainting has a few parameters for controlling how the masks and images are processed.
Strength
controls how much noise is added before denoising. At 0.0 no noise is added, nothing is denoised, so image remains the same. At 1.0 the full amount of noise is added, so the image can change completely. This also affects how many steps need to be done, lower strength means less steps.
The mask is processed with the Mask Blur
and Mask Expand
parameters before being used. Mask Blur
applies a blur to the mask which helps avoid seams. Mask Expand
will expand the mask outwards in all directions.
Padding
affects the area of the image thats actually used, called the extent. The extent is shown on the mask with a red/green box. Red means you will be denoising at a lower resolution than the source image, making the end result blurry. The extent area is upscaled to your specified resolution before denoising.
Upscaler
is the method for upscaling the image. There are few reason to change this from Lanczos.
Mask Fill
controls how the masked area is prepared before denoising. Original
keeps the image as it is, Noise
completely replaces the masked area with noise. Noise
mode requires a Strength
of 1.0 to work optimally.
There are some options hidden by default, these are usually things the average user wont need. In the Settings
tab set Advanced Parameters
to Show
, then in the Generate
tab these parameters will be available. Highres: HR ToMe Ratio
, HR Sampler
, HR Scale
, HR Eta
. Misc: CFG Rescale
, ToMe Ratio
, Subseed
, Subseed Strength
. Operation: VAE Tiling
, Precision
, VAE Precision
, Autocast
Black outputs. Usually due to precision issues with your hardware or the model itself. Set under Operation
set VAE Precision
to FP32
(requires Advanced Paramaters). If that doesn't work set try with Precision
set to FP32
. FP32
is automatically forced used when generating with the CPU or on GPUs with known FP16
issues.
Noisy blob outputs. Can happen when generating with a V prediction model in EPS prediction mode. Check your model to see if its V prediction, if it is then set the Prediction
and CFG Rescale
parameters accordingly (under Misc
). The YAML is not required.
Ran out of VRAM. Under Operation
set VRAM
to Minimal
. If it still fails at the end (or middle for Highres), Set VAE Tiling
to Enabled
(requires Advanced Paramaters). The primary factor in VRAM usage is the resolution, try reduce it. ControlNets and Batch size also affect VRAM usage, LoRAs will affect VRAM usage when in Dynamic
mode. Generating with less the 4GB of VRAM is not recommended, use Remote instead.
The primary model format is safetensors
. Pickled models are also supported but not recommended: ckpt
, pt
, pth
, bin
, etc. Diffusers folders are also supported. The model structure is flexible, supporting both the A1111 folder layout and qDiffusions own layout:
- Checkpoint:
SD
,Stable-diffusion
,VAE
- Upscaler/Super resolution:
SR
,ESRGAN
,RealESRGAN
- Embedding/Textual Inversion:
TI
,embeddings
,../embeddings
- LoRA:
LoRA
- Hypernet:
HN
,hypernetworks
- ControlNet:
CN
,ControlNet
VAE's need .vae.
in their filename to be recognized as external: PerfectColors.vae.safetensors
. Embeddings in a subfolder with "negative" in the folder name will be considered negative embeddings. Subfolders/models starting with _
will be ignored.
Remote instances can download models from URL's or receive models uploaded by the client (Settings->Remote
). Some sources have special support:
- HuggingFace: Can also provide an access token in
config.json
("hf_token": "TOKEN"
).- Ex.
https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors
- Ex.
- Civit.ai: Right click the models download button and copy the link.
- Ex.
https://civitai.com/api/download/models/90854
- Ex.
- Google Drive: They may block you, good luck.
- Ex.
https://drive.google.com/file/d/1_sK-uEEZnS5mZThQbVg-2B-dV7qmAVyJ/view?usp=sharing
- Ex.
- Mega.nz: URL must include the key.
- Ex.
https://mega.nz/file/W1QxVZpL#E-B6XmqIWii3-mnzRtWlS2mQSrgm17sX20unA14fAu8
- Ex.
- Other: All other URLs get downloaded with
curl -OJL URL
, so simple file hosts will work.
Quick showcase of how qDiffusion operates with basic prompting, LoRA's, Embeddings, Inpainting and Controlnets.