Diffusion Models

Diffusion models are the state-of-the-art when it comes to AI generated images. Open-source models like Latent Diffusion and Stable Diffusion that can generate convincing images of anything are available to the public, making AI art more accessible than ever. For an explaination of how they work, check out my previous post in which I trained my own diffusion model from scratch.

That was great for training from scratch, but just like we've fine-tuned language models in some of my previous posts, we can leverage someone else's pre-training by fine-tuning Stable Diffusion. This model's pre-training means that instead of semi-abstract images, we should be able to get pretty realistic, grounded images in several styles.

Finetuning with Dreambooth

Finetuning a model allows us to take a model that's been trained on a generic task and use its general knowledge on a specific task. In the case of image generation, we start with Stable Diffusion, a model that gernerates high-quality images given a text input. Finetuning models can still be an involved process, requiring hundreds to thousands of examples to teach the model what to do.

Enter Dreambooth. This technique, developed at Google Research allows us to fine-tune image generation models using as few as 5 reference images. Computer Vision is not my area of expertise, but I'll try and give a brief explaination.
Dreambooth works by binding the pictures of the subject (in this case, me) to a unique identifier. This identifier can then be used to synthesize new pictures of me in new contexts, so long as I include the identifier in the prompt.

Figure from the dreambooth paper
Sample from the Dreambooth Paper

Prior Preservation

In order to teach the model the 'gist' of what we're going for, we first have it generate about 200 images of something like what we're trying to teach it. In this case, I generated pictures of 'a guy in his 20s', which is what I am.
This helps the model to understand that I am 'a guy in his 20s' and tweak its understanding of 'a guy in his 20s' to me.

Training the model is straightforward. There's a nice example script provided by HuggingFace that you can tweak and use for yourself. I grabbed 8ish pictures of myself and went to town. The model took under an hour to train on an A100 GPU.

Photo of me in a tan suit eating a corn dog photo of me smiling selfie of me with a big mac Photo of me, well dressed, in an upscale food court
Half of the training images of me

A major part of generating good images is prompt engineering. (For more on that, see my previous post on prompt engineering.) Essentially, in order to get the best output from the model you need to be very exact in your query. Interestingly, just asking for a better image usually works. Tacking on phrases like 'very detailed' or 'contest winner' will make your image better. Prompt engineering is a whole subfield with plenty of guides out there.


AI generated image of me in a film noir AI generated image of me in Skyrim AI generated image of me as a child AI generated image of me, Vietnam War

Right off the bat, we can see some great results. The model learned what I look like, and can put me in tons of cool contexts.

AI generated image of me in Star Trek AI generated image of me, Axis Propaganda Poster AI generated image of me in Blade Runner AI generated image of me in McDonalds 1965

It can change my age, clothing, and even animation style.


But it's not perfect. I noticed that about 25% of the time the model would make me look vaguely to moderately Asian, and sometimes give me a beard.

AI generated image of me, vaguely Asian AI generated image of me with a beard AI generated image of me, vaguely Asian AI generated image of me with a beard
Some of the bad outputs

After some investigation I realized the problem was with my prior preservation images. It turns out the prompt 'photo of a guy in his 20s' has a lot of bearded men and a lot of men of east Asian descent.


If the problem is as simple as bad prior preservation images, it shouldn't be hard to solve. I tried again with the prompt 'photo of a white guy in his 20s, clean-shaven'. That solved the issues with the beard and the race-shifting, but then a new issue cropped up. Now I wasn't wearing glasses. The prompt 'photo of a white guy in his 20s, clean-shaven, glasses' solved that.

With good prior-preservation images and a few tweaks with training hyperparameters, I finally have a model that does exceptionally well at generating pictures of me. Every so often my glasses come off or the image comes out wonky, but with a few tries I can be anything.

AI generated image of me in space AI generated image of me, pencil drawing AI generated image of me, oil painting from behind AI generated image of me early photography
AI generated image of me line drawing AI generated image of me as a muppet AI generated image of me, Studio Ghibli style AI generated image of me by Norman Rockwell

Final Thoughts

Combining Dreambooth and Stable Diffusion was very simple and very cool. It's exciting to see my AI image progress. In the course of six months, I went from playing with DALL-E mini to training my own diffusion model from scratch, and now fine-tuning for model personalization.
I can't wait to see what's next.