/plushcap/analysis/assemblyai/how-to-train-large-deep-learning-models-as-a-startup

How to Train Large Deep Learning Models as a Startup

What's this blog post about?

OpenAI's GPT-3 is a large deep learning model with 175 billion parameters, which requires significant computational resources and time for training. Training such models on a single GPU would take hundreds of years. However, OpenAI utilized Microsoft's high-bandwidth cluster of NVIDIA V100 GPUs to train GPT-3 in weeks instead of years. The cost of setting up a similar cluster with 1,024x NVIDIA A100 GPUs is estimated at almost $10 million, not including electricity and hardware maintenance costs. Training large models is expensive and slow, which poses challenges for startups that need to iterate quickly. AssemblyAI, a startup building large Automatic Speech Recognition (ASR) models, has learned several lessons about training large models efficiently. They recommend using more GPUs, improving GPU performance, and reducing precision during training to improve iteration speed. To reduce costs, they suggest buying your own hardware or renting dedicated servers from smaller hosting providers like Cirrascale instead of relying on public clouds like AWS or Google Cloud.

Company
AssemblyAI

Date published
Oct. 7, 2021

Author(s)
Dylan Fox

Word count
2099

Hacker News points
273

Language
English