Trying to visually understand ComfyUI

      flowchart LR
        A([Load Checkpoint]) --> C([CLIP Text Encode]) & D([VAE Decode])
        C --> E(["fa:fa-plus Positive Prompt"]) & F(["fa:fa-minusNegative prompt"])
        A -- connect the model to the--> G([KSampler])
        G --> D
        D --> I(["fa:fa-camera-retro Image"])
        E --> G
        F --> G
        I --> M(["Preview"]) & N(["Save"])
        L([ControlNet]) --> C
        H(["Empty Latent Image"]) --> G
        click A callback "What is a Checkpoint"
        click C callback "What is CLIP Text Encode"
        click D callback "What is VAE Decode"
        click E callback "What is a Positive Prompt"
        click F callback "What is a Negative prompt"
        click G callback "What is a KSampler"
        click H callback "What is an Empty Latent image"
        click I callback "Pretty self explainatory, but still, what is an Image in ComfyUI"
        click L callback "What is ControlNet"
      

CHECKPOINT

Base models

Base models we find are Stable diffusion v1.5 or XL. They are trained on a wide dataset and while they can achieve great results, they are not so good when we are looking for a specific style

Fine-tuned models

Fine-tuned models are trained on a narrow dataset to gain specific results similar to the dataset used for the training part (i.e.: anime-style genres or realistic portrait). Think of SD 1.5 like a painter who makes general paintings, fine-tuned models are painters who have studied to replicate specific styles like Van Gogh style.

Useful Links

CivitAI is a great resource for models.

MODEL

here is the explanation lol

CLIP TEXT ENCODE

Stable Diffusion model uses ClipText, which is OpenAI's language model, it has a very complex process involving the transformation of each word into embeddings

How CLIP is trained

CLIP has been trained on a dataset made of images and their captions. This allow CLIP to be both an image and text encoder

VAE DECODE

The VAE Decode can be used to decode the latent space images back into pixel space images. The latent image from the KSampler will be linked to samples" in the VAE decode. Then we can use the vae from the checkpoint or load a different vae

POSITIVE PROMPT

Here we give the positive prompt, meaning we write what we will want to see in the final image. The CLIP text encoder will be very good at reading the words and understanding which image will perfectly fit your words

NEGATIVE PROMPT

Here we write what we do not want to see in the final image. For example in the positive prompt we wrote "beautiful man", but we don't want to have a picture of a beautiful man with glasses, here we will write "glasses".

KSAMPLER

KSampler is the most important part in Stable Diffusion, at least in my opinion, as here we can decide how the model will behave while generating the image.

KSampler Options

In the KSampler node we can select many options and it is easy to get confused and lost in the process, let's try to see what can we do

  • seed
  • control_after_generate
  • steps
  • cfg
  • sampler_name
  • scheduler
  • denoise

EMPTY LATENT IMAGE

Our process will start with a random image in the latent space, in ComfyUI we can decide the size of the latent image, which will be proportional to the actual image in the pixel space. We can decide the height, the weight and whether we want one or more images generated by adjusting the batch size (i.e.: 3 batch size will get us 3 generated images in one run)

GENERATED IMAGE

Finally you generated the image! In ComfyUI you can choose whether previewing or saving the generated image.

CONTROLNET

ControlNet is a family of neural networks fine-tuned on Stable Diffusion that let us have more control over the generation of images. Introduced by Zhang et al. (2023) Thanks to ControlNet we can control Stable Diffusion through single or multiple conditions, both with and without prompts.

Why do we need ControlNet?

We can think of ControlNet as an extra boost of control over the images we want to generate, especially if we want to achieve a specific pose or structure, thanks to ControlNet we can trace a shape that will guide the generation of the image.

ControlNet models I like and wanted to mention

  • OpenPose
  • Sketch
  • Canny

Useful resources

ComfyUI

Great resource that explains Stable Diffusion visually

Paper about the introduction of Stable Diffusion XL

Nice blog article about ControlNet

Must see for getting inspired by many ComfyUI workflows. Like this one where you can create AI pixel art with accurate pixel count

To build this website

Where I took the icons you are seeing

How I built the flowchart

For the fonts of the website