In October, OpenAI launched its new AI image generator—DALL-E 3—into a wide release for ChatGPT subscribers. DALL-E can pull off media generation tasks that would have seemed pointless a couple of years ago—and although it can delight with its unexpected creations of information, it also brings shock to some. Technology predicted history like this a long time ago, but watching machines carry creative orders feels different when it’s actually happening before our eyes.
“It’s impossible to dismiss the power of AI when it comes to image generation,” said Aurich Lawson, Ars Technica’s creative director. “With the rapid increase in visual acuity and the ability to get the result of use, there is no question that it is beyond a gimmick or a toy and is a legit tool.”
With the advent of AI image synthesis, it is increasingly looking like the future of creative media for many will come through the aid of creative thinking that can reproduce any art style, format, or medium. Media reality is becoming completely fluid and malleable. But how can AI image processing become more powerful faster—and what could that mean for future artists?
Using AI to improve itself
We first covered DALL-E 3 upon its announcement from OpenAI in late September, and since then, we’ve been using it quite a bit. For those just tuning in, DALL-E 3 is an AI (neural network) model that uses a technique called latent spread to draw images that “create” out of noise, in progress, based on the text provided by a user-or in this case, by ChatGPT. It works using the same basic principle as other popular image compression models such as Stability Diffusion and Midjourney.
You type in a description of what you want to see, and DALL-E 3 creates it.
ChatGPT and DALL-E 3 are currently working hand in hand, making AI image generation into a conversational and interactive experience. You tell ChatGPT (via the large GPT-4 language model) what you want to generate, and it writes the appropriate prompts for you and submits them to the DALL-E backend. DALL-E returns images (usually two at a time), and you view them through the ChatGPT interface, either through the web or through the ChatGPT app.
In many cases, ChatGPT will vary the artistic medium of the results, so you may see the same subject expressed in many styles—such as photography, illustration, animation, oil painting, or portrait photography. You can also change the aspect ratio of the generated image from the default square to “wide” (16:9) or “tall” (9:16).
OpenAI has not disclosed the dataset used to train DALL-E 3, but if previous models are any indication, OpenAI likely uses hundreds of millions of images found online and licensed from Shutterstock libraries. To learn visual concepts, the AI learning process combines words from descriptions of images found online (through titles, alt tags, and metadata) with the images themselves. It then encodes that group in multidimensional vector form. However, the extracted captions—written by humans—are not always clear or accurate, which leads to some error groups that reduce the AI model’s ability to follow the text.
To get around that problem, OpenAI decided to use AI to improve itself. As detailed in DALL-E 3 research paper, the team at OpenAI trained this new model to surpass its predecessor using synthetic image headers (AI-written) generated by GPT-4V, a visual version of GPT-4. With GPT-4V writing titles, the team generated more accurate and detailed descriptions for the DALL-E model to learn from during the training process. That makes a world of difference in terms of the DALL-E’s speed fidelity—in doing exactly what’s available at the speed of writing. (It’s very handy, too.)