High-performance image generation using Stable Diffusion in KerasCV

Authors: fchollet, lukewood, divamgupta
Generate new images using KerasCV's StableDiffusion model.

View on TensorFlow.org

Run in Google Colab

High-performance image generation using Stable Diffusion in KerasCV | TensorFlow Core (3)

View source on GitHub

High-performance image generation using Stable Diffusion in KerasCV | TensorFlow Core (4)

View on keras.io

Overview

In this guide, we will show how to generate novel images based on a text prompt usingthe KerasCV implementation of stability.ai's text-to-image model,Stable Diffusion.

Stable Diffusion is a powerful, open-source text-to-image generation model. While thereexist multiple open-source implementations that allow you to easily create images fromtextual prompts, KerasCV's offers a few distinct advantages.These include XLA compilation andmixed precision support,which together achieve state-of-the-art generation speed.

In this guide, we will explore KerasCV's Stable Diffusion implementation, show how to usethese powerful performance boosts, and explore the performance benefitsthat they offer.

To get started, let's install a few dependencies and sort out some imports:

pip install tensorflow keras_cv --upgrade --quiet

[2K ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 721.6/721.6 kB 13.5 MB/s eta 0&colon;00&colon;00[?25h

import timeimport keras_cvfrom tensorflow import kerasimport matplotlib.pyplot as plt

Introduction

Unlike most tutorials, where we first explain a topic then show how to implement it,with text-to-image generation it is easier to show instead of tell.

Check out the power of keras_cv.models.StableDiffusion().

First, we construct a model:

model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https&colon;//raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSE

Next, we give it a prompt:

images = model.text_to_image("photograph of an astronaut riding a horse", batch_size=3)def plot_images(images): plt.figure(figsize=(20, 20)) for i in range(len(images)): ax = plt.subplot(1, len(images), i + 1) plt.imshow(images[i]) plt.axis("off")plot_images(images)

Downloading data from https&colon;//github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz?raw=true1356917/1356917 [==============================] - 0s 0us/stepDownloading data from https&colon;//huggingface.co/fchollet/stable-diffusion/resolve/main/kcv_encoder.h5492466864/492466864 [==============================] - 9s 0us/stepDownloading data from https&colon;//huggingface.co/fchollet/stable-diffusion/resolve/main/kcv_diffusion_model.h53439090152/3439090152 [==============================] - 63s 0us/step50/50 [==============================] - 126s 295ms/stepDownloading data from https&colon;//huggingface.co/fchollet/stable-diffusion/resolve/main/kcv_decoder.h5198180272/198180272 [==============================] - 2s 0us/step

Pretty incredible!

But that's not all this model can do. Let's try a more complex prompt:

images = model.text_to_image( "cute magical flying dog, fantasy art, " "golden color, high quality, highly detailed, elegant, sharp focus, " "concept art, character concepts, digital painting, mystery, adventure", batch_size=3,)plot_images(images)

50/50 [==============================] - 15s 294ms/step

The possibilities are literally endless (or at least extend to the boundaries ofStable Diffusion's latent manifold).

Wait, how does this even work?

Unlike what you might expect at this point, StableDiffusion doesn't actually run on magic.It's a kind of "latent diffusion model". Let's dig into what that means.

You may be familiar with the idea of super-resolution:it's possible to train a deep learning model to denoise an input image -- and thereby turn it into a higher-resolutionversion. The deep learning model doesn't do this by magically recovering the information that's missing from the noisy, low-resolutioninput -- rather, the model uses its training data distribution to hallucinate the visual details that would be most likelygiven the input. To learn more about super-resolution, you can check out the following Keras.io tutorials:

When you push this idea to the limit, you may start asking -- what if we just run such a model on pure noise?The model would then "denoise the noise" and start hallucinating a brand new image. By repeating the process multipletimes, you can get turn a small patch of noise into an increasingly clear and high-resolution artificial picture.

This is the key idea of latent diffusion, proposed inHigh-Resolution Image Synthesis with Latent Diffusion Models in 2020.To understand diffusion in depth, you can check the Keras.io tutorialDenoising Diffusion Implicit Models.

Now, to go from latent diffusion to a text-to-image system,you still need to add one key feature: the ability to control the generated visual contents via prompt keywords.This is done via "conditioning", a classic deep learning technique which consists of concatenating to thenoise patch a vector that represents a bit of text, then training the model on a dataset of {image: caption} pairs.

This gives rise to the Stable Diffusion architecture. Stable Diffusion consists of three parts:

A text encoder, which turns your prompt into a latent vector.
A diffusion model, which repeatedly "denoises" a 64x64 latent image patch.
A decoder, which turns the final 64x64 latent patch into a higher-resolution 512x512 image.

First, your text prompt gets projected into a latent vector space by the text encoder,which is simply a pretrained, frozen language model. Then that prompt vector is concatenatedto a randomly generated noise patch, which is repeatedly "denoised" by the diffusion model over a seriesof "steps" (the more steps you run the clearer and nicer your image will be -- the default value is 50 steps).

Finally, the 64x64 latent image is sent through the decoder to properly render it in high resolution.

All-in-all, it's a pretty simple system -- the Keras implementationfits in four files that represent less than 500 lines of code in total:

But this relatively simple system starts looking like magic once you train on billions of pictures and their captions.As Feynman said about the universe: "It's not complicated, it's just a lot of it!"

Perks of KerasCV

With several implementations of Stable Diffusion publicly available why should you usekeras_cv.models.StableDiffusion?

Aside from the easy-to-use API, KerasCV's Stable Diffusion model comeswith some powerful advantages, including:

Graph mode execution
XLA compilation through jit_compile=True
Support for mixed precision computation

When these are combined, the KerasCV Stable Diffusion model runs orders of magnitudefaster than naive implementations. This section shows how to enable all of thesefeatures, and the resulting performance gain yielded from using them.

For the purposes of comparison, we ran benchmarks comparing the runtime of theHuggingFace diffusers implementation ofStable Diffusion against the KerasCV implementation.Both implementations were tasked to generate 3 images with a step count of 50 for eachimage. In this benchmark, we used a Tesla T4 GPU.

All of our benchmarks are open source on GitHub, and may be re-run on Colab toreproduce the results.The results from the benchmark are displayed in the table below:

GPU	Model	Runtime
Tesla T4	KerasCV (Warm Start)	28.97s
Tesla T4	diffusers (Warm Start)	41.33s
Tesla V100	KerasCV (Warm Start)	12.45
Tesla V100	diffusers (Warm Start)	12.72

30% improvement in execution time on the Tesla T4!. While the improvement is much loweron the V100, we generally expect the results of the benchmark to consistently favor the KerasCVacross all NVIDIA GPUs.

For the sake of completeness, both cold-start and warm-start generation times arereported. Cold-start execution time includes the one-time cost of model creation and compilation,and is therefore negligible in a production environment (where you would reuse the same model instancemany times). Regardless, here are the cold-start numbers:

GPU	Model	Runtime
Tesla T4	KerasCV (Cold Start)	83.47s
Tesla T4	diffusers (Cold Start)	46.27s
Tesla V100	KerasCV (Cold Start)	76.43
Tesla V100	diffusers (Cold Start)	13.90

While the runtime results from running this guide may vary, in our testing the KerasCVimplementation of Stable Diffusion is significantly faster than its PyTorch counterpart.This may be largely attributed to XLA compilation.

To get started, let's first benchmark our unoptimized model:

benchmark_result = []start = time.time()images = model.text_to_image( "A cute otter in a rainbow whirlpool holding shells, watercolor", batch_size=3,)end = time.time()benchmark_result.append(["Standard", end - start])plot_images(images)print(f"Standard model: {(end - start):.2f} seconds")keras.backend.clear_session() # Clear session to preserve memory.

50/50 [==============================] - 15s 294ms/stepStandard model&colon; 15.02 seconds

Mixed precision

"Mixed precision" consists of performing computation using float16precision, while storing weights in the float32 format.This is done to take advantage of the fact that float16 operations are backed bysignificantly faster kernels than their float32 counterparts on modern NVIDIA GPUs.

Enabling mixed precision computation in Keras(and therefore for keras_cv.models.StableDiffusion) is as simple as calling:

keras.mixed_precision.set_global_policy("mixed_float16")

That's all. Out of the box - it just works.

model = keras_cv.models.StableDiffusion()print("Compute dtype:", model.diffusion_model.compute_dtype)print( "Variable dtype:", model.diffusion_model.variable_dtype,)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https&colon;//raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSECompute dtype&colon; float16Variable dtype&colon; float32

As you can see, the model constructed above now uses mixed precision computation;leveraging the speed of float16 operations for computation, while storing variablesin float32 precision.

# Warm up model to run graph tracing before benchmarking.model.text_to_image("warming up the model", batch_size=3)start = time.time()images = model.text_to_image( "a cute magical flying dog, fantasy art, " "golden color, high quality, highly detailed, elegant, sharp focus, " "concept art, character concepts, digital painting, mystery, adventure", batch_size=3,)end = time.time()benchmark_result.append(["Mixed Precision", end - start])plot_images(images)print(f"Mixed precision model: {(end - start):.2f} seconds")keras.backend.clear_session()

50/50 [==============================] - 24s 229ms/step50/50 [==============================] - 11s 229ms/stepMixed precision model&colon; 11.87 seconds

XLA Compilation

TensorFlow comes with theXLA: Accelerated Linear Algebra compiler built-in.keras_cv.models.StableDiffusion supports a jit_compile argument out of the box.Setting this argument to True enables XLA compilation, resulting in a significantspeed-up.

Let's use this below:

# Set back to the default for benchmarking purposes.keras.mixed_precision.set_global_policy("float32")model = keras_cv.models.StableDiffusion(jit_compile=True)# Before we benchmark the model, we run inference once to make sure the TensorFlow# graph has already been traced.images = model.text_to_image("An avocado armchair", batch_size=3)plot_images(images)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https&colon;//raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSE50/50 [==============================] - 71s 233ms/step

Let's benchmark our XLA model:

start = time.time()images = model.text_to_image( "A cute otter in a rainbow whirlpool holding shells, watercolor", batch_size=3,)end = time.time()benchmark_result.append(["XLA", end - start])plot_images(images)print(f"With XLA: {(end - start):.2f} seconds")keras.backend.clear_session()

50/50 [==============================] - 12s 233ms/stepWith XLA&colon; 11.84 seconds

On an A100 GPU, we get about a 2x speedup. Fantastic!

Putting it all together

So, how do you assemble the world's most performant stable diffusion inferencepipeline (as of September 2022).

With these two lines of code:

keras.mixed_precision.set_global_policy("mixed_float16")model = keras_cv.models.StableDiffusion(jit_compile=True)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https&colon;//raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSE

And to use it...

# Let's make sure to warm up the modelimages = model.text_to_image( "Teddy bears conducting machine learning research", batch_size=3,)plot_images(images)

50/50 [==============================] - 71s 144ms/step

Exactly how fast is it?Let's find out!

start = time.time()images = model.text_to_image( "A mysterious dark stranger visits the great pyramids of egypt, " "high quality, highly detailed, elegant, sharp focus, " "concept art, character concepts, digital painting", batch_size=3,)end = time.time()benchmark_result.append(["XLA + Mixed Precision", end - start])plot_images(images)print(f"XLA + mixed precision: {(end - start):.2f} seconds")

50/50 [==============================] - 7s 144ms/stepXLA + mixed precision&colon; 7.51 seconds

Let's check out the results:

print("{:<22} {:<22}".format("Model", "Runtime"))for result in benchmark_result: name, runtime = result print("{:<22} {:<22}".format(name, runtime))

Model Runtime Standard 15.015103816986084 Mixed Precision 11.867290258407593 XLA 11.838508129119873 XLA + Mixed Precision 7.507506370544434

It only took our fully-optimized model four seconds to generate three novel images froma text prompt on an A100 GPU.

Conclusions

KerasCV offers a state-of-the-art implementation of Stable Diffusion -- andthrough the use of XLA and mixed precision, it delivers the fastest Stable Diffusion pipeline available as of September 2022.

Normally, at the end of a keras.io tutorial we leave you with some future directions to continue in to learn.This time, we leave you with one idea:

Go run your own prompts through the model! It is an absolute blast!

If you have your own NVIDIA GPU, or a M1 MacBookPro, you can also run the model locally on your machine.(Note that when running on a M1 MacBookPro, you should not enable mixed precision, as it is not yet well supportedby Apple's Metal runtime.)

High-performance image generation using Stable Diffusion in KerasCV | TensorFlow Core (2024)

FAQs

How do I get high resolution images from Stable Diffusion? ›

Key Takeaways

In Stable Diffusion WebUI, click the "Extras" tab.
Upload the image or images you want to upscale.
Adjust the Resize slider to choose the resolution you want.
Choose an upscaler to use.
Click "Generate."

Jul 9, 2023

Learn More ›

Can I use images generated by Stable Diffusion? ›

Stable Diffusion is free to use and an open-source tool. Many users claim that the artwork generated using Stable Diffusion is sellable, and why shouldn't it be? Since the tool is open-source, users have rights over the work, codes, and prompts they use and share.

Learn More Now ›

How long does Stable Diffusion take to generate image? ›

Stability AI, the creators of Stable Diffusion, have made it extremely simple for curious parties to test their text-to-image model with their online tool. This platform grants users access to the latest version of stable diffusion models, which allows you to generate an image in up to 15 seconds.

Learn More ›

What is Stable Diffusion image generation model? ›

Stable Diffusion is a text-to-image model. It is primarily used to generate detailed images based on text descriptions. Stable Diffusion is an excellent alternative to tools like midjourney and DALLE-2.

Tell Me More ›

Can Stable Diffusion generate 4K images? ›

High Quality Text to Image Generation using Stable Diffusion, GFPGAN,Real-ESR and Swin IR. Generate 4K and FULL HD Images and Artworks for Free Using Stable Diffusion.

See Details ›

What is the best image size for Stable Diffusion? ›

Stable Diffusion can create images from 64×64 to 1024×1024 pixels, but optimal results are achieved with its default 512×512 size. This size ensures consistency, diversity, speed, and manageable memory usage.

What is the most realistic Stable Diffusion model? ›

Top 10 Stable Diffusion Models

Realistic Vision. Realism is one of the hardest subjects when it comes to AI image generation. ...
DreamShaper. DreamShaper steps more into the illustration style with its beautiful digital art feel. ...
AbyssOrangeMix3 (AOM3) ...
Anything V3. ...
MeinaMix. ...
Deliberate. ...
Elldreths Retro Mix. ...
Protogen.

More items...

Learn More Now ›

Which Stable Diffusion upscaler is best for realistic images? ›

Real-ESRGAN

The R-ESRGAN AI upscaler is an enhanced version of the ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks) upscaling network. It specializes in restoring low-resolution realistic images from stable diffusion.

Read The Full Story ›

What are the disadvantages of Stable Diffusion? ›

However, stable diffusion can be computationally intensive and time-consuming, and the quality of the results may vary depending on the input data and the network parameters used. Additionally, the technique may not be suitable for certain types of image editing tasks, such as removing unwanted elements from a picture.

Learn More ›

How much RAM do I need for Stable Diffusion? ›

How Much RAM Do I Need for Stable Diffusion? 16GB of RAM will do the job just fine.

Do you need a GPU for Stable Diffusion? ›

Yes, for Stable Diffusion to work smoothly without any issues, you must have a GPU on your PC. For a minimum, look at 8-10 GB Nvidia models. Moreover, make sure you have 16 GB of PC RAM in the PC system to avoid any instability.

Show Me More ›

What are the advantages of Stable Diffusion? ›

Features and benefits of Stable Diffusion

One of the main features and advantages of Stable Diffusion is that its source code is publicly available, allowing any developer to create tools from the code base.

Explore More ›

Is Stable Diffusion easy to use? ›

Running Stable Diffusion without a user interface is quite difficult for beginners. While many different user interfaces are available, Automatc1111 has created a user-friendly web UI that allows you to easily generate images with Stable Diffusion, manage your models, and even train your own models.

Learn More Now ›

What CUDA requirement is needed for Stable Diffusion? ›

If you want to optimize Stable Diffusion on a NVIDIA GPU, in order to work properly, the following requirements must be installed on your machine: CUDA>=12.0.

Learn More Now ›

How do I upscale an image in Stable Diffusion Reddit? ›

How to upscale after image generation?

Generate your image with a prompt.
Transfer the image to img2img.
Set CFG scale to 15.
Set Denoising Strength to 0.1.
Pull up Ultimate SD Upscale. (install it from the extensions) ...
Set the scaling (either 2X or from img2img)

Jan 31, 2023

Show Me More ›

What is the resolution of Stable Diffusion HD? ›

Based on stable diffusion 1.5 and fine-tuned on 576x576 up to 1088x1088 images, Stable Diffusion High Resolution is compartible with another SD1. 5 model and mergeable with other SD1. 5 model, giving other model to generate high resolution images without using upscaler.

See Details ›

What is super resolution using Stable Diffusion? ›

Stable Diffusion Upscaler (SDU) is a super-resolution model based on the diffusion process. It converts a low-resolution image into a high-resolution image by introducing random noise and diffusion processes. The model includes an encoder, a diffusion layer and a decoder.

Tell Me More ›

How do I generate NSFW images in Stable Diffusion? ›

You can use Google Colab Pro or Plus to generate NSFW images in Stable Diffusion. By default, Colab notebooks rely on the original Stable Diffusion with an NSFW filter. By simply replacing all instances linking to the original script with a script without a safety filter, you can easily generate NSFW images.

Read The Full Story ›

High-performance image generation using Stable Diffusion in KerasCV | TensorFlow Core (2024)

Overview

Introduction

Wait, how does this even work?

Perks of KerasCV

Mixed precision

XLA Compilation

Putting it all together

Conclusions

FAQs

How do I get high resolution images from Stable Diffusion? ›

Do you need a GPU for Stable Diffusion? ›