In this article, the authors discuss their ongoing project to develop a text-to-image foundation model from scratch, focusing on the architectural choices that underpin the model's design. They explore various transformer-based architectures, including DiT, MMDiT, DiT-Air, UViT, and their own custom design, PRX, to evaluate performance in terms of efficiency, scalability, and alignment with text prompts. The PRX architecture emerges as a promising option, balancing speed, memory efficiency, and generative quality, and is introduced alongside a modern text encoder, T5Gemma, which enhances multilingual capabilities and reduces computational demands. The authors also delve into the use of latent space representations and autoencoders like FluxVAE and Deep-Compression Autoencoders to further optimize the training process. The project is open-source, inviting community engagement through platforms like Hugging Face and Discord, as the authors continue to refine their models and prepare for larger-scale training iterations.