Skip to content
arenasys edited this page Jun 6, 2024 · 11 revisions

Background

Stable diffusion is an image generation model created by Stability AI. It took 4000 GPUs and >100,000 USD to train. Models that are trained at great expense from scratch are called Foundation models. It was released to the public as a Checkpoint, which is a 2-4GB file containing all the models weights. The Stable diffusion Checkpoint was further trained by individuals on different data, this training is much smaller in scale, taking a few days to weeks on a couple consumer level GPUs and is called Fine-tuning. Checkpoints can also be merged with each other in different proportions to create new Checkpoints, most Checkpoints you find will be Merges. Stability AI has released many versions of Stable diffusion which have slightly different architectures, and are incompatible with each other (V1, V2, XL, etc). The most common type of Checkpoint architecture is the original V1, though XL is now more popular.

Models

Checkpoints are actually composed of 3 different smaller models: UNET, VAE, CLIP. The UNET is the main model and actually generates the image. The VAE encodes images into a form the UNET can use, and decodes what the UNET produces back into images. The CLIP understands your prompts and produces a condensed form that guides the UNET.

Networks are another type of model used during generation, they are smaller models designed to attach to a Checkpoint and alter its outputs. LoRAs are the modern approach, though Hypernetworks were used before. They are usually less than 200MB and multiple can be attached at once. Networks are used by adding them to the prompt with a special syntax: <lora:some_thing>, it will be highlighted purple if its working.

Embeddings are another smaller model, which can be thought of as a condensed prompt trained for a specific purpose. These are usually tiny, a few KB in size. Embeddings are used by adding their name to the prompt, it will be highlighted orange if its working.

Networks and Embeddings are not compatible across different Checkpoint architectures.

Denoising

Generating an image is done gradually over many steps. The image begins as random noise and each step some of the noise is removed until no noise remains, this process is called Denoising. An existing image can be altered by adding some noise, then Denoising it.

Width/Height affects how fast generation is, larger resolutions take much longer and use more VRAM. Models are also trained on certain resolutions (512 for V1, 768 for V2) and generating outside those resolutions can result in nonsense. To address this issue generating can be split into two stages, first a normal resolution image is generated then its upscaled and used as the base for a higher resolution image (often called Highres Fix).

Steps affects the quality of the image and how long it takes to generate. There are diminishing returns after so many steps. 20-50 is the standard range.

Scale affects how drastic each step is, too little and the image will be bland, too high and the image will be fried. 5-20 is the standard range.

Prompting

Special syntax is used to manipulate prompts. The basic features are Emphasis and Scheduling.

Emphasis changes the influence text has on the output. It does this with a numeric weighting. Syntax: Emphasis (PROMPT), De-emphasis [PROMPT], Custom (PROMPT:WEIGHT).

Prompt Weighting
(hello world) 110%
[hello world] 90%
(hello world:2.0) 200%
([hello world]) 100%
(hello (world)) hello, 110%. world, 120%

Scheduling changes the text during generation. Syntax: Scheduling [A:B:SPECIFIER], Alternating [A|B].

Prompt Affect
[hello:world:10] At step 10 change from hello to world.
[:hello world:10] At step 10 add hello world.
[hello world::10] At step 10 remove hello world.
[hello|world] Every step alternate between hello and world.

Strength scheduling is a more compact way to change strengths. Syntax: (PROMPT:SCHEDULE).

For example (hello:[0.5:0.7:10]) is equivalent to [(hello:0.5):(hello:0.7):10]

Schedule Affect
[0.5:0.7:10] At step 10 change from 50% strength to 70%.
[0.0:0.7:10] At step 10 add the prompt at 70% strength.
[[0.0:0.25:5]:0.7:10] At step 5 change from 0% to 25% strength, then 70% strength at step 10.
[0.5:0.7:0.5] Halfway through generating change from 50% strength to 70%.
[0.5:0.7:HR] For Highres pass change from 50% strength to 70%.

Escaping characters may be needed. The ()[] characters are interpreted as prompt syntax and ultimately get removed. To use them in a prompt directly you must preface them with a \. For example hello_(world) becomes hello_\(world\).

Networks

LoRAs, LoCons and Hypernetworks are activated by including them in the prompt. No distinction is made between LoCons and LoRAs in the program, since LoCons are just LoRAs with additional convolutional layers.

Screenshot from 2023-08-12 14-10-03

Prompt Affect
<lora:borzoi> Enable the LoRA/LoCon with the name borzoi.
<hypernet:borzoi> Enable the Hypernetwork with the name borzoi.
<lora:borzoi:0.5> Set borzoi to 50% strength.
<lora:borzoi:0.5:0.7> Set borzoi UNET to 50% strength, CLIP to 70% strength.
<lora:borzoi:1,0.75,0.5,0.25,0,0.25,0.5,0.75,1> Use block weights for the borzoi UNET (4 and 12 block weights are supported).

Scheduling

Dynamic mode evaluates each network individually for each step, enabling network strengths to be changed during generation. This is slower than Static mode, which will merge in LoRAs before generating. Overall Static mode is much faster if you dont need scheduling (as fast as using no LoRAs at all), especially for multiple or large LoRAs. Though it also needs to reload the model from disk whenever networks are changed.

Prompt encoding happens before generating, so the CLIP strength at step 0 is whats used. Scheduling CLIP strength after that is useless (Except the prompt gets encoded again before the Highres pass). Also if trying to schedule in Static mode, the network strengths at step 0 get used.

Screenshot from 2023-08-12 14-11-10

In dynamic mode, the following is possible:

Prompt Affect
<lora:borzoi:[0.5:1.0:10]> Switch from 50% strength to 100% strength at step 10.
<lora:borzoi:[0.5:1.0:0.5]> Switch from 50% strength to 100% strength half way through generating.
<lora:borzoi:[0.5:1.0:HR]> Switch from 50% strength to 100% strength for the Highres pass.
[:<lora:borzoi>:HR] Enable network for Highres pass.
[<lora:borzoi>:<lora:borzoi:1,0.75,0.5,0.25,0,0.25,0.5,0.75,1>:10] Switch to block weights at step 10.
<lora:borzoi:1,0.75,[0.5:1.0:0.5],0.25,0,0.25,0.5,0.75,1> Modify DOWN2 block weight half way through generating.

Inpainting

Used to selectively denoise part of an image, access by adding a Mask and linking it to an image you want to edit. With no mask specified the entire image will be affected.

Screenshot from 2023-07-09 15-44-54

Controls

Inpainting has a few parameters for controlling how the masks and images are processed.

params

Strength controls how much noise is added before denoising. At 0.0 no noise is added, nothing is denoised, so image remains the same. At 1.0 the full amount of noise is added, so the image can change completely. This also affects how many steps need to be done, lower strength means less steps.

The mask is processed with the Mask Blur and Mask Expand parameters before being used. Mask Blur applies a blur to the mask which helps avoid seams. Mask Expand will expand the mask outwards in all directions.

Padding affects the area of the image thats actually used, called the extent. The extent is shown on the mask with a red/green box. Red means you will be denoising at a lower resolution than the source image, making the end result blurry. The extent area is upscaled to your specified resolution before denoising.

Upscaler is the method for upscaling the image. There are few reason to change this from Lanczos.

Mask Fill controls how the masked area is prepared before denoising. Original keeps the image as it is, Noise completely replaces the masked area with noise. Noise mode requires a Strength of 1.0 to work optimally.

Advanced Parameters

There are some options hidden by default, these are usually things the average user wont need. In the Settings tab set Advanced Parameters to Show, then in the Generate tab these parameters will be available. Highres: HR ToMe Ratio, HR Sampler, HR Scale, HR Eta. Misc: CFG Rescale, ToMe Ratio, Subseed, Subseed Strength. Operation: VAE Tiling, Precision, VAE Precision, Autocast

Troubleshooting

Black outputs. Usually due to precision issues with your hardware or the model itself. Under Operation set VAE Precision to FP32 (requires Advanced Paramaters). If that doesn't work set try with Precision set to FP32. FP32 is automatically forced when generating with the CPU or on GPUs with known FP16 issues.

Noisy blob outputs. Can happen when generating with a V prediction model in EPS prediction mode. Check your model to see if its V prediction, if it is then set the Prediction and CFG Rescale parameters accordingly (under Model). The YAML is not required.

Ran out of VRAM. Under Operation set VRAM to Minimal. If it still fails at the end (or middle for Highres), Set VAE Tiling to Enabled (requires Advanced Paramaters). The primary factor in VRAM usage is the resolution, try reduce it. ControlNets and Batch size also affect VRAM usage, LoRAs will affect VRAM usage when in Dynamic mode. Generating with less the 4GB of VRAM is not recommended, use Remote instead.

Models

The primary model format is safetensors. Pickled models are also supported but not recommended: ckpt, pt, pth, bin, etc. Diffusers folders are also supported. The model structure is flexible, supporting both the A1111 folder layout and qDiffusions own layout:

  • Checkpoint: SD, Stable-diffusion, VAE
  • Upscaler/Super resolution: SR, ESRGAN, RealESRGAN
  • Embedding/Textual Inversion: TI, embeddings, ../embeddings
  • LoRA: LoRA
  • Hypernet: HN, hypernetworks
  • ControlNet: CN, ControlNet

VAE's need .vae. in their filename to be recognized as external: PerfectColors.vae.safetensors. Embeddings in a subfolder with "negative" in the folder name will be considered negative embeddings. Subfolders/models starting with _ will be ignored.

Downloading

Remote instances can download models from URL's or receive models uploaded by the client (Settings->Remote). Some sources have special support:

  • HuggingFace: Can also provide an access token in config.json ("hf_token": "TOKEN").
    • Ex. https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors
  • Civit.ai: Right click the models download button and copy the link.
    • Ex. https://civitai.com/api/download/models/90854
  • Google Drive: They may block you, good luck.
    • Ex. https://drive.google.com/file/d/1_sK-uEEZnS5mZThQbVg-2B-dV7qmAVyJ/view?usp=sharing
  • Mega.nz: URL must include the key.
    • Ex. https://mega.nz/file/W1QxVZpL#E-B6XmqIWii3-mnzRtWlS2mQSrgm17sX20unA14fAu8
  • Other: All other URLs get downloaded with curl -OJL URL, so simple file hosts will work.

Example

Quick showcase of how qDiffusion operates with basic prompting, LoRA's, Embeddings, Inpainting and Controlnets.

example.mp4
Clone this wiki locally