flowchart LR A([Load Checkpoint]) --> C([CLIP Text Encode]) & D([VAE Decode]) C --> E(["fa:fa-plus Positive Prompt"]) & F(["fa:fa-minusNegative prompt"]) A -- connect the model to the--> G([KSampler]) G --> D D --> I(["fa:fa-camera-retro Image"]) E --> G F --> G I --> M(["Preview"]) & N(["Save"]) L([ControlNet]) --> C H(["Empty Latent Image"]) --> G click A callback "What is a Checkpoint" click C callback "What is CLIP Text Encode" click D callback "What is VAE Decode" click E callback "What is a Positive Prompt" click F callback "What is a Negative prompt" click G callback "What is a KSampler" click H callback "What is an Empty Latent image" click I callback "Pretty self explainatory, but still, what is an Image in ComfyUI" click L callback "What is ControlNet"
Base models we find are Stable diffusion v1.5 or XL. They are trained on a wide dataset and while they can achieve great results, they are not so good when we are looking for a specific style
Fine-tuned models are trained on a narrow dataset to gain specific results similar to the dataset used for the training part (i.e.: anime-style genres or realistic portrait). Think of SD 1.5 like a painter who makes general paintings, fine-tuned models are painters who have studied to replicate specific styles like Van Gogh style.
CivitAI is a great resource for models.
here is the explanation lol
Stable Diffusion model uses ClipText, which is OpenAI's language model, it has a very complex process involving the transformation of each word into embeddings
CLIP has been trained on a dataset made of images and their captions. This allow CLIP to be both an image and text encoder
The VAE Decode can be used to decode the latent space images back into pixel space images. The latent image from the KSampler will be linked to samples" in the VAE decode. Then we can use the vae from the checkpoint or load a different vae
Here we give the positive prompt, meaning we write what we will want to see in the final image. The CLIP text encoder will be very good at reading the words and understanding which image will perfectly fit your words
Here we write what we do not want to see in the final image. For example in the positive prompt we wrote "beautiful man", but we don't want to have a picture of a beautiful man with glasses, here we will write "glasses".
KSampler is the most important part in Stable Diffusion, at least in my opinion, as here we can decide how the model will behave while generating the image.
In the KSampler node we can select many options and it is easy to get confused and lost in the process, let's try to see what can we do
Our process will start with a random image in the latent space, in ComfyUI we can decide the size of the latent image, which will be proportional to the actual image in the pixel space. We can decide the height, the weight and whether we want one or more images generated by adjusting the batch size (i.e.: 3 batch size will get us 3 generated images in one run)
Finally you generated the image! In ComfyUI you can choose whether previewing or saving the generated image.
ControlNet is a family of neural networks fine-tuned on Stable Diffusion that let us have more control over the generation of images. Introduced by Zhang et al. (2023) Thanks to ControlNet we can control Stable Diffusion through single or multiple conditions, both with and without prompts.
We can think of ControlNet as an extra boost of control over the images we want to generate, especially if we want to achieve a specific pose or structure, thanks to ControlNet we can trace a shape that will guide the generation of the image.
Great resource that explains Stable Diffusion visually
Paper about the introduction of Stable Diffusion XL
Nice blog article about ControlNet
Must see for getting inspired by many ComfyUI workflows. Like this one where you can create AI pixel art with accurate pixel count