How our team uses Stable Diffusion to create highly realistic images.
Hi, my name is Marko, and I work as the Machine Learning Lead at Linearity.
As an ML team, our goal is to empower design professionals with the most powerful tools, helping them share their stories with the world.
AI can dramatically remove barriers to creativity and design execution. We experienced this first-hand after the launch of our Auto Trace and Background Removal tools. And now, our team is taking another innovative step forward — unlocking the limitless potential of Generative AI.
Jumpstart your ideas with Linearity Curve
Take your designs to the next level.
Our process and how diffusion models work
AI image generation has been around for a long time, but only recently has it achieved high enough quality for the design community to adopt it and start making amazing things. You could witness it firsthand and follow the progress on social media, but let’s review the key events that got us here today:
- The first introduction of diffusion models was in 2015 in the “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” paper.
- OpenAI figured out how to connect the dots between text and pixels and released CLIP and DALL-E in 2021
- Stable Diffusion public release accelerated the space and allowed thousands of developers all around the world to build on top of it.
Without going too far down the rabbit hole, let's quickly recap how latent diffusion models work.
To reduce the memory requirements and ensure faster processing, instead of working in the pixel space, every image of shape
512x512x3 is compressed into its latent representation (latents) of shape
64x64x4 with a VAE encoder. During inference, you can generate the initial input latents from the Gaussian noise (text-to-image) or generate them from the input image (image-to-image) using the encoder.
The process starts with noisy latents, and the model is trained to incrementally reduce the noise step by step. As the model filters out the noise, it begins seeing the patterns and structures that are common across images. Eventually, it can generate a new image that is similar to those it has been trained on. The guidance comes from CLIP which provides a latent representation for the input text prompt that serves as a context for a diffusion process.
Depending on the input noise, you get different outputs. Therefore, you can generate nearly infinite variations of images from a single prompt. This is usually controlled by the random seed.
At the end of the diffusion process, the VAE decoder converts the denoised latents back to the original size.
How diffusion is used for image editing
The Stable Diffusion Inpainting model followed the release of Stable Diffusion. In computer graphics, Inpainting is the process of restoring small damaged areas of an image using information in nearby regions.
Diffusion models bring Inpainting to the ultimate level. The process is no longer limited to restoring small parts of an image, but they are capable of generating something entirely new.
Unleash Your Creative Potential in Design
Discover the endless possibilities of illustration in design. Learn how our tools can help you create stunning, unique designs effortlessly.
We can now simultaneously use the capabilities of the text-to-image generative model and capture the context of the input image. Therefore, it is able to match colors, light, shadows, and textures to make the generated parts blend naturally, no matter what you decide to put there.
Inpainting for background generation
What if instead of inpainting small areas, we tried to generate entirely new surroundings or backdrop?
Let’s start with a cut-out of a bottle of Chanel perfume and try to replace the transparent area with a simple background using the following prompt:
As we can observe, Stable Diffusion successfully generates a plausible background and accurately positions a stone pedestal to hold the bottle. Nonetheless, it does present a challenge by distorting the input object and introducing additional elements along the edges.
Enhancing object preservation with ControlNet
To solve the problem of object distortion and preserve its identity, we add an additional guidance via ControlNet - a neural network trained to control diffusion models by adding extra conditions. The idea behind this technique is to freeze the large Stable Diffusion model and train an additional control network that outputs a set of weights that are used to modify the diffusion network.
Ready to create brand assets that pack a punch?
Visit our Academy for free marketing design courses.
We observed that by using a normal map, cropped by the object mask, as a condition in the diffusion process, we are able to augment the inpainting model to accurately preserve the edges of the object and also better understand its appearance in the scene. As we only provide surface normals in the bounds of a target object, we are able to generate diverse and sophisticated backgrounds along with realistic shadows and reflections and even simulate the refraction of light.
In this example, we modify the ControlNet training script from the diffusers library into the inpainting pipeline. When training the control network, we simultaneously feed-forward the noisy latents, the control image (normalized normal map), and timesteps.
down_block_res_samples, mid_block_res_sample = controlnet( noisy_latents, timesteps, encoder_hidden_states=encoder_hidden_states, controlnet_cond=controlnet_image, return_dict=False, )
Subsequently, we add the acquired residuals to the mid and down blocks of the original Stable Diffusion U-Net model while keeping all other inputs the same as in the original inpainting pipeline.
latent_model_input = torch.cat( [noisy_latents, mask, masked_latents], dim=1 ).to(dtype=weight_dtype) model_pred = unet( latent_model_input, timesteps, encoder_hidden_states=encoder_hidden_states, down_block_additional_residuals=[ sample.to(dtype=weight_dtype) for sample in down_block_res_samples ], mid_block_additional_residual=mid_block_res_sample.to( dtype=weight_dtype ), ).sample
After training this model for 240 hours with A6000 Ada GPU, we’re able to generate an infinite amount of high-quality backgrounds in different styles in a matter of seconds without any distortions to the target object. At this point, the whole process is bounded only by your imagination.
Linearity Curve’s Generative AI
In 2024, with a Linearity Curve Pro or Organization subscription, users will gain access to the remarkable world of Generative AI. Professional designers and creatives will be empowered with the capability to effortlessly craft intricate, unique, and high-quality visuals. Create backgrounds, intricate patterns, and even complex visual effects, all while maintaining full control over the creative process. With Generative AI at your fingertips, Linearity Curve users can take their creative projects to new heights.
Linearity stands with Ukraine
Many of our colleagues, including myself, are from Ukraine. It’s been 18 months since Russia launched a full-scale invasion, and the fight for our freedom and peace continues. Here, you can learn about ways to help, and find organizations that provide relief and support to the Ukrainian people. Дякую — Thank you! 🇺🇦
- Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, & Björn Ommer. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.
- Lvmin Zhang, & Maneesh Agrawala. (2023). Adding Conditional Control to Text-to-Image Diffusion Models.
- Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, & Thomas Wolf. (2022). Diffusers: State-of-the-art diffusion models.
Jumpstart your ideas with Linearity Curve
Take your designs to the next level.
Marko is the Machine Learning Lead for Linearity in Berlin. He is passionate about finding new ways to equip graphic designers with machine-learning technologies.