Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

Training mRNA Language Models Across 25 Species for $165

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Maziyar Panahi
Word Count
6,915
Language
-
Hacker News Points
-
Summary

OpenMed has developed a comprehensive protein AI pipeline that spans structure prediction, sequence design, and codon optimization, with a focus on mRNA language modeling across 25 species. The pipeline utilizes advanced transformer architectures, identifying CodonRoBERTa-large-v2 as the superior model for codon-level language modeling, outperforming others with a perplexity of 4.10 and a Spearman CAI correlation of 0.40. This model was trained on 250,000 coding sequences within 55 GPU-hours, leading to the creation of a species-conditioned system that is unique among open-source projects. The pipeline integrates established tools like ESMFold for structure prediction and ProteinMPNN for sequence design, alongside new models for codon optimization, which addresses the genetic code's degeneracy by predicting codon usage patterns more effectively than traditional methods. This allows for optimized DNA sequences tailored to specific organisms, enhancing applications in therapeutic mRNA production, vaccines, and recombinant protein production. The project highlights the importance of domain-specific metrics, transfer learning, and species-specific fine-tuning, culminating in an efficient, open-source workflow that significantly reduces the time from protein concept to synthesis-ready DNA.