Large Denoising Diffusion Models

This year has been extremely exciting in the world of text-to-image generation. Openai's Dall-e 2 made waves, closely followed by Google's Imagen. Both of these models can produce amazingly realistic images of almost anything in a variety of styles. Unfortunately, they are also completely unavailable to the general public. On the one hand, I get it. I understand that giving anyone in the world the power to draw a picture of anything is unprecedented and probably has unforseen consequences for society. On the other hand, is giving that power to a few already powerful corporations any better? That's not what we're here to talk about today, though. Today we're talking about Dall-e mini, Boris Dayma's attempt to democratize this technology. (Dall-e Mini is not a diffusion model, but we're not going to dwell on that here.)
While significantly less powerful than the big companies' models, dall-e mini is an amazing feat, and free for anyone to use. Let's dive in.

If you're smarter than me maybe you'll understand this 2020 paper introducing them. This blog post from huggingface is easier to understand, but still requires some technical know-how. I'm not an image guy, I'm a text guy, so I don't fully understand how these models work. However, we don't need to. Just like you don't need to understand the nuances of the internal combustion engine to drive a car, we can 'drive' these models without fully understanding the underlying mechanism.

Prompt Engineering: How to 'Drive' the Model

As large models have become smarter, they have gained the capability to accept instructions in natural language. You can prompt GPT-3 to 'Tell me a story about two happy frogs' and it'll write a nice little piece for you.

GPT-3 interface with a short story about two happy frogs. Text:Once upon a time, there were two frogs who were the best of friends. They would do everything together, and they were always happy. They would hop around in the fields and play in the ponds. One day, they saw a beautiful butterfly and decided to follow it. They hopped and hopped, but they couldn't keep up with the butterfly. Eventually, they lost sight of it. The two frogs were so sad that they didn't want to play or hop around anymore. They just sat there, looking at the ground. Suddenly, the butterfly came back and landed on one of the frog's nose. The frog was so happy that it started to hop around again. Its friend was so happy too, and they both spent the rest of the day playing and hopping together.

That didn't take any special ability, I just had to ask. However, changing the prompts can drastically improve performance on certain tasks. For example, asking GPT-3 to 'explain your reasoning' results in much more in-depth answers.
If we ask GPT-3 "How many trolls are in the forest?" it responds:

There are an unknown number of trolls in the forest.

However, if we ask "How many trolls are in the forest? Explain your reasoning." we get :

There is no set answer to this question as the number of trolls in the forest can vary depending on the story. However, it is typically assumed that there are at least a few trolls living in the forest, as they are often depicted as inhabiting dark and dangerous places.

Prompt Engineering, or learning to ask the right question, is just as important for image generation models as it is for text-generation ones.

Testing Dall-e Mini

People have been playing around with this model for weeks, and it seems like the formula for generating a funny and successful series of images is the following:

  1. Take a well-known character
  2. Put them in an unlikely context

If I remember correctly from my Linguistics undergraduate days, this formula is successful because of its juxtaposition of pragmatic frames.
So let's do that. I decided to try and put two well-known characters in different spots.

  1. The Teletubbies in early Christian art
  2. Tony Soprano in a video game

Teletubbies in Early Christian Art

The first thing I decided to try was asking for traditional scenes from Christian art. I tried 'Teletubbies Pietà', 'Teletubbies Annunciation', 'Teletubbies Stations of the Cross', etc. All to no avail.
These were all largely unsuccessful. It seems like dall-e mini can't abstract the traditional composition of these images and replace the characters with teletubbies.

AI generated images of Teletubbies AI generated images of Teletubbies
Two unsuccessful generations

So the next thing I decided to try was putting the Teletubbies in the style of several artists famous for Christain paintings. This had aobut a 50% success rate. 'painted by Raphael' and 'painted by Michelangelo' didn't change the style of the images, whereas 'painted by Caravaggio' and 'painted by El Greco' did a decent job of stylistic transfer.

AI generated images of Teletubbies AI generated images of Teletubbies
Neither of these look like Raphael or Michelangelo

AI generated images of Teletubbies, with shading similar to Caravaggio's style AI generated images of Teletubbies, successfully style transfered to Eld Greco's painting style
The model captures the styles of Caravaggio and El Greco very well

At this point, it became clear that artist credits could transfer style, but not content. This makes sense. Even though most of El Greco's paintings are religious in nature, they don't all depict the same scenes. The model picks up on some of his distinct style, which is common to all of his paintings.
I decided to broaden my search to different genres of early Christian art. Some of the prompts I came up with were 'Orthodox Icon' and 'Byzantine Art', and I tried appending other words that could help cue the model into knowing what I was hoping for. Since most Tryptiches were of early Christain art, I thought that 'Tryptich' could help, and I tried 'Mosaic' (a popular medium) as well.

AI generated images of Orthodox icons of Teletubbies AI generated images of Orthodox icons of Teletubbies in mosaic style
Orthodox Icon was the first prompt that yielded the sorts of results I was hoping for

another series of AI generated images of Orthodox icons of Teletubbies another series of AI generated images of Orthodox icons of Teletubbies
Byzantine Art seems to be 'stronger' than the prompt Teletubbies, and we see a number of non-teletubby characters

Given the success of the word Tryptich, I decided to experiment with other items often found exclusively in Christain art contexts. I decided to try creating common Christian items that often depict images. 'Stained Glass' and 'Votive Candles' both seemed like safe bets.

AI generated images of stained glass windows featuring Teletubbies AI generated images of votive candles with Teletubbies on each candle

What did we learn from this? Artist names can help transfer style, but not content. Giving the subject a context (Orthodox Icon, Stained Glass) was much more effective at melding two distinct things. Interestingly, the context can have more strength than the characters (like with Byzantine Art), or the characters can overpower the rest of the prompt (like with The Annunciation).

Tony Soprano in Video Games

The Sopranos is widely considered one of the greatest TV shows of all time, taking the #1 spot in Rolling Stone's 2016 rankings. Its viewership was extremely high while it was on the air, and even now more than 15 years later, The Sopranos is one of the most popular shows on HBO Max. While the sopranos did get a spinoff video game in 2006, it was not well-recieved. So let's try and put Tony Soprano (the titular character) in some well-known video games. I'm not too much of a gamer, so to get started I did a google search for some of the most popular video games of all time. I haven't played most of the games on the list, so I chose games I've at least heard about so I could recognize whether the pictures capture the game.
My prompt engineering started in a very basic fashion, with 'Tony Soprano ______'. Sometimes it worked pretty well, but sometimes I needed to tack the word 'gameplay' on the end. Let's look at the results. We'll start with the earliest games on the list, Doom and Half-Life.

AI generated images of Tony Soprano as an enemy in a Doom game AI generated images of Tony Soprano as an enemy in a  half-life game

We're off to a great start. Both of these images feature someone who looks like Tony Soprano and blocky 20-year-old graphics. The Doom shot also features the series's signature dark environments.
Next, I decided to embrace one of the biggest franchises from my childhood, Pokémon.

AI generated images of Tony Soprano as a pokemon trainer. The AI was not very successful. Batch #2 of AI generated images of Tony Soprano as a pokemon trainer. The AI was not very successful.
The model didn't do too well on this task, even with a little bit of prompt tweaking

So from this it seems like for the model to be successful we need to put Tony in the game, not just style him as a character from the game.
Since Fortnite does crossovers with people from outside the game all of the time, I figured the model would have some idea how to translate a real-world person into the game.

AI generated images of Tony Soprano in the game Fortnite. Batch #2 of AI generated images of Tony Soprano in the game Fortnite.
While I've never played Fortnite, this seems right to me

Next up on the list are a number of AAA games. See if you can figure them out before reading the prompts.

AI generated images of Tony Soprano in the game Red Dead Redemption 2. AI generated images of Tony Soprano in the game Skyrim.

AI generated images of Tony Soprano in the game God of War. AI generated images of Tony Soprano in the game Grand Theft Auto 5

AI generated images of Tony Soprano in the game the sims. AI generated images of Tony Soprano in the game Overwatch.

The model did pretty well on all of those. I love the period clothing Tony's wearing in Skyrim and RDR2, and Kratos Tony gave me a laugh.

Bonus Image

Thanks for making it to the end of the post. As a reward, I'll share my favorite image I made during this project.
One of my favorite Christain artists is the 15th century Dutch painter Hieronymus Bosch. Bosch's works often feature nightmarish imagery, particularly in conjunction with themes such as temptation and hell.

Christ in Limbo by Hieronymus Bosch
Christ in Limbo by Hieronymus Bosch

Let's see what Bosch's Teletubbies would've looked like.

AI generated images of Teletubbies painted by Hieronymus Bosch

Pretty Amazing.