Skip to content
arenasys edited this page Jul 9, 2023 · 11 revisions

Background

Stable diffusion is an image generation model created by Stability AI. It took 4000 GPUs and >100,000 USD to train, models that are trained at great expense from scratch are called Foundation models. It was released to the public as a Checkpoint, which is a 2-4GB file containing all the models weights. The Stable diffusion Checkpoint was further trained by individuals on different data, this training is much smaller in scale, taking a few days to weeks on a couple consumer level GPUs and is called Fine-tuning. Checkpoints can also be merged with each other in different proportions to create new Checkpoints, most Checkpoints you find will be Merges. Stability AI has released many versions of Stable diffusion which have slightly different architectures, and are incompatible with each other (V1, V2, XL, etc). The most common type of Checkpoint architecture is the original V1.

Models

Checkpoints are actually composed of 3 different smaller models: UNET, VAE, CLIP. The UNET is the main model and actually generates the image. The VAE encodes images into a form the UNET can use, and decodes what the UNET produces back into images. The CLIP understands your prompts and produces a condensed form that guides the UNET.

Networks are another type of model used during generation, they are smaller models designed to attach to a Checkpoint and alter its outputs. LoRAs are the modern approach, though Hypernetworks were used before. They are usually less than 200MB and multiple can be attached at once. Networks are used by adding them to the prompt with a special syntax: <lora:some_thing>, it will be highlighted purple if its working.

Embeddings are another smaller model, which can be thought of as a condensed prompt trained for a specific purpose. These are usually tiny, a few KB in size. Embeddings are used by adding their name to the prompt, it will be highlighted orange if its working.

Networks and Embeddings are not compatible across different Checkpoint architectures.

Denoising

Generating an image is done gradually over many steps. The image begins as random noise and each step some of the noise is removed until no noise remains, this process is called Denoising. An existing image can be altered by adding some noise, then Denoising it.

Width/Height affects how fast generation is, larger resolutions take much longer and use more VRAM. Models are also trained on certain resolutions (512 for V1, 768 for V2) and generating outside those resolutions can result in nonsense. To address this issue generating can be split into two stages, first a normal resolution image is generated then its upscaled and used as the base for a higher resolution image (often called Highres Fix).

Steps affects the quality of the image and how long it takes to generate. There are diminishing returns after so many steps. 20-50 is the standard range.

Scale affects how drastic each step is, too little and the image will be bland, too high and the image will be fried. 5-20 is the standard range.

Prompting

Special syntax is used to manipulate prompts. The basic features are Emphasis and Scheduling.

Emphasis changes the influence text has on the output. It does this with a numeric weighting. Syntax: Emphasis (PROMPT), De-emphasis [PROMPT], Custom (PROMPT:WEIGHT).

Prompt Weighting
(hello world) 110%
[hello world] 90%
(hello world:2.0) 200%
([hello world]) 100%
(hello (world)) hello, 110%. world, 120%

Scheduling changes the text during generation. Syntax: Scheduling [A:B:SPECIFIER], Alternating [A|B].

Prompt Affect
[hello:world:10] At step 10 change from hello to world.
[:hello world:10] At step 10 add hello world.
[hello world::10] At step 10 remove hello world.
[hello|world] Every step alternate between hello and world.

Escaping characters may be needed. The ()[] characters are interpreted as prompt syntax and ultimately get removed. To use them in a prompt directly you must preface them with a \. For example hello_(world) becomes hello_\(world\).

Inpainting

Used to selectively denoise part of an image, access by adding a Mask and linking it to an image you want to edit. With no mask specified the entire image will be affected.

Screenshot from 2023-07-09 15-44-54

Controls

Inpainting has a few parameters for controlling how the masks and images are processed.

params

Strength controls how much noise is added before denoising. At 0.0 no noise is added, nothing is denoised, so image remains the same. At 1.0 the full amount of noise is added, so the image can change completely. This also affects how many steps need to be done, lower strength means less steps.

The mask is processed with the Mask Blur and Mask Expand parameters before being used. Mask Blur applies a blur to the mask which helps avoid seams. Mask Expand will expand the mask outwards in all directions.

Padding affects the area of the image thats actually used, called the extent. The extent is shown on the mask with a red/green box. Red means you will be denoising at a lower resolution than the source image, making the end result blurry. The extent area is upscaled to your specified resolution before denoising.

Upscaler is the method for upscaling the image. There are few reason to change this from Lanczos.

Mask Fill controls how the masked area is prepared before denoising. Original keeps the image as it is, Noise completely replaces the masked area with noise. Noise mode requires a Strength of 1.0 to work optimally.

Models

The primary model format is safetensors. Pickled models are also supported but not recommended: ckpt, pt, pth, bin, etc. Diffusers folders are also supported. The model structure is flexible, supporting both the A1111 folder layout and qDiffusions own layout:

  • Checkpoint: SD, Stable-diffusion, VAE
  • Upscaler/Super resolution: SR, ESRGAN, RealESRGAN
  • Embedding/Textual Inversion: TI, embeddings, ../embeddings
  • LoRA: LoRA
  • Hypernet: HN, hypernetworks
  • ControlNet: CN, ControlNet

VAE's need .vae. in their filename to be recognized as external: PerfectColors.vae.safetensors. Embeddings in a subfolder with "negative" in the folder name will be considered negative embeddings. Subfolders/models starting with _ will be ignored.

Downloading

Remote instances can download models from URL's or receive models uploaded by the client (Settings->Remote). Some sources have special support:

  • Civit.ai: Right click the models download button and copy the link.
    • Ex. https://civitai.com/api/download/models/90854
  • HuggingFace: Can also provide an access token in config.json ("hf_token": "TOKEN").
    • Ex. https://huggingface.co/arenasys/demo/blob/main/AnythingV3.safetensors
  • Google Drive: They may block you, good luck.
    • Ex. https://drive.google.com/file/d/1_sK-uEEZnS5mZThQbVg-2B-dV7qmAVyJ/view?usp=sharing
  • Mega.nz: URL must include the key.
    • Ex. https://mega.nz/file/W1QxVZpL#E-B6XmqIWii3-mnzRtWlS2mQSrgm17sX20unA14fAu8
  • Other: All other URLs get downloaded with curl -OJL URL, so simple file hosts will work.

Example

Quick showcase of how qDiffusion operates with basic prompting, LoRA's, Embeddings, Inpainting and Controlnets.

example.mp4
Clone this wiki locally