Top 3 Text-to-Image Generators: How DALL-E 2, GLIDE, and Imagen Stand Out

Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.


The text-to-image generator revolution is in full swing with tools such as OpenAI’s DALL-E 2 and GLIDE, as well as Google’s Imagen, which have become hugely popular – even in beta – since each of them was introduced in the past year.

These three tools are all examples of a trend in intelligence systems: text-to-image synthesis or a generative model extended on image captions to produce new visual scenes.

Intelligent systems that can create images and videos have a wide variety of uses, from entertainment to education, with the potential to be used as accessible solutions for people with physical disabilities. Digital graphic design tools are widely used in the creation and editing of many modern cultural and artistic works. Yet their complexity can make them inaccessible to anyone without the necessary technical knowledge or infrastructure.

That’s why systems that can follow text-based instructions and then perform a corresponding image-editing task are breaking new ground when it comes to accessibility. These benefits can also be easily extended to other areas of image generation, such as gaming, animation and creating visual teaching materials.

The Rise of Text-to-Image AI Generators

AI has progressed over the past decade due to three key factors: the rise of big data, the rise of high-performance GPUs, and the re-emergence of deep learning. Generator AI systems are helping the tech sector realize its vision of the future of ambient computing — the idea that people will one day be able to use computers intuitively without having to know any particular systems or coding.

AI text-to-image generators are now slowly transforming from generating dreamy images to producing realistic portraits. Some even speculate that AI art will do that overtaking human creations. Many of the current text-to-image generation systems focus on learning to iteratively generate images from continuous linguistic input, just as a human artist can.

This process is known as a generative neural visual, a core transformer process, inspired by the process of gradually transforming a blank canvas into a scene. Systems trained to perform this task can take advantage of text-conditioned progress to generate a single image.

How 3 Text-to-Image AI Tools Stand Out

AI tools that mimic human communication and creativity have always been buzzworthy. For the past four years, major tech giants have prioritized creating tools to produce automated images.

There have been several notable releases in the past few months – a few were instant phenomena as soon as they were released, even though they were only available to a relatively small group for testing.

Let’s take a look at the technology of three of the most talked-about text-to-image generators released recently — and what sets them all apart.

OpenAI’s DALL-E 2: Diffusion creates state-of-the-art images

DALL-E 2, released in April, is OpenAI’s latest text-to-image generator and successor to DALL-E, a generative language model that creates sentences and original images.

A diffusion model is at the heart of DALL-E 2, which can directly add and remove elements, taking into account shadows, reflections and textures. Current research shows that diffusion models have emerged as a promising generative modeling framework, pushing the advanced tasks of image and video generation. To achieve the best results, the diffusion model in DALL-E 2 uses a guidance method for optimizing sample fidelity (for photorealism) at the price of sample diversity.

DALL-E 2 teaches the relationship between images and text through “diffusion”, which starts with a pattern of random dots, and gradually transforms into an image where it recognizes specific aspects of the image. With 3.5 billion parameters, the DALL-E 2 is a large model but, interestingly enough, not nearly as large as GPT-3 and smaller than its DALL-E predecessor (which was 12 billion). Despite its size, DALL-E 2 generates a resolution that is four times better than DALL-E and it is preferred by human judges more than 70% of the time, both for caption matching and photorealism.

Image source: Open AI

The versatile model can go beyond sentence-to-image generations and uses robust embeddings of CLAMP, an OpenAI computer vision system for relating text to image, it can create different variations of output for a given input, preserving semantic information and stylistic elements. Moreover, compared to other image representation models, CLIP encapsulates images and text in the same latent space, allowing for language-driven image manipulations.

While conditioning image generation on CLIP embeddings improves diversity, a specific drawback is that it has certain limitations. For example, UnCLIP, which generates images by inverting the CLIP image decoder, is worse at binding attributes to objects than a corresponding GLIDE model. This is because the CLIP embedding itself does not explicitly bind attributes to objects, and it was found that the decoder reconstructions often confuse attributes and objects. At the higher guidance scales used to generate photorealistic images, unCLIP yields greater diversity for comparable photorealism and caption matching.

GLIDE by OpenAI: Realistic edits to existing images

OpenAI’s Guided Language-to-Image Diffusion for Generation and Editing, also known as GLIDE, was released in December 2021. GLIDE can automatically create photorealistic images based on natural language cues, allowing users to create visual material through easier iterative refinement and fine-grained management of the images created.

This diffusion model delivers performance comparable to DALL-E, despite using only a third of the parameters (3.5 billion compared to DALL-E’s 12 billion parameters). GLIDE can also convert baseline drawings into photorealistic photos thanks to its powerful zero-sample production and repair capabilities for complicated conditions. In addition, GLIDE uses a small sampling delay and does not need to rearrange the CLIP.

Most notably, the model can also paint in images or make realistic edits to existing images through natural language prompts. This makes it similar in function to editors such as Adobe Photoshop, but easier to use.

Modifications made by the model match the style and lighting of the surrounding context, including convincing shadows and reflections. These models could potentially help people create compelling custom images with unprecedented speed and ease, while significantly reducing the production of effective misinformation or deepfakes. To protect against these use cases and to support future research, the OpenAI team has also released a smaller diffusion model and a noiseless CLIP model trained on filtered data sets.

Image source: Open AI

Image by Google: Understanding Text-Based Input

Announced in Jnot, image is a text-to-image generator created by the Brain Team at Google Research. It is similar to, yet different from, DALL-E 2 and GLIDE.

Google’s Brain Team wanted to generate images with greater accuracy and fidelity by using the short and descriptive sentence method. The model analyzes each phrase as a digestible chunk of information and tries to produce an image as close to that phrase as possible.

Imagen builds on the prowess of large transformer language models for syntactic understanding, while leveraging the power of diffusion models to generate high-fidelity images. Unlike previous work that used only image-text data for model training, Google’s fundamental discovery was that text embedding of large language models, when pre-trained on text-only corpora (large and structured sets of texts), are remarkably effective for text-to-image synthesis. Moreover, the larger size of the language model increases both the fidelity of the samples and the alignment of the image text much more than increasing the size of the image diffusion model.

Image source: Google

Instead of using an image-text dataset for training image, the Google team simply used an “off-the-shelf” text encoder, T5, to convert input text into embeds. The frozen T5-XXL encoder maps input text into a series of embeddings and a 64×64 image diffusion model, followed by two super-resolution diffusion models to generate 256×256 and 1024×1024 images. The diffusion models depend on the text embedding order and use classification-free guidance, relying on new sampling techniques to use large guidance weights without degradation of sample quality.

Imagen reached a state-of-the-art FID score of 7.27 on the COCO dataset without ever being trained on COCO. When assessed on DrawBench using current methods, including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE, and DALL-E 2, Imagen was shown to outperform both in terms of sample quality and image-text alignment.

Future Text-to-Image Opportunities and Challenges

There is no doubt that the rapidly advancing text-to-image AI generator technology is paving the way for unprecedented possibilities for instant editing and generated creative output.

There are also many challenges ahead, ranging from: questions about ethics and prejudice (although the creators have implemented security measures in the models designed to limit potentially destructive uses) to copyright and ownership issues. The sheer amount of computing power required to train text-to-image models through massive amounts of data also limits the work to only key and well-equipped players.

But there’s also no doubt that each of these three text-to-image AI models stands on its own as a way for creative professionals to let their imaginations run wild.

The mission of VentureBeat is a digital city square for tech decision makers to gain knowledge about transformative business technology and transactions. Learn more about membership.