Build Your Own Imagen Text-to-Image Model

Company

AssemblyAI

Date Published

Aug. 17, 2022

Author

Ryan O'Connor

Word count

6700

Language

English

Hacker News points

111

URL

www.assemblyai.com/blog/build-your-own-imagen-text-to-image-model

Summary

MinImagen is a lightweight text-to-image model introduced by Google DeepMind in 2022. It demonstrates that it's possible to train a high quality text-to-image generator using a much smaller dataset and computational resources compared to models like DALL-E or Imagen. The MinImagen model consists of two main components: a base U-Net which generates low-resolution images, and a super-resolution U-Net that upscales the generated images to higher resolutions. The key innovation in MinImagen is using classifier-free guidance, where both the unguided (text-only) and guided (text + image caption) logits are used during training and sampling to improve the quality of generated images. Training a MinImagen model involves first training the base U-Net on low-resolution images paired with captions, followed by fine-tuning the super-resolution U-Net using the outputs from the base U-Net as inputs. The final MinImagen model can then be used to generate high quality images based on textual descriptions. In summary, MinImagen is a significant step forward in making advanced text-to-image models more accessible and computationally efficient, paving the way for further improvements and applications in this area.