Building Tucano 2: Open-Source Language Models That Actually Think in Portuguese
Blog post from HuggingFace
Tucano 2 is a family of open-source language models specifically designed for Portuguese, addressing the lack of transparency and optimization found in existing multilingual models. Developed with a focus on openness and collaboration, these models range from 0.5 billion to 3.7 billion parameters and outperform prior Portuguese models of similar sizes. The development process involved creating a large, high-quality Portuguese corpus, GigaVerbo-v2, and a custom tokenizer optimized for Portuguese, significantly reducing computational costs. The models were trained using a blend of educational and synthetic data, and evaluated with a new two-tier suite designed to provide reliable benchmarks for Portuguese. The project also emphasizes transparency regarding energy consumption and environmental costs, reporting both carbon emissions and the material footprint associated with GPU usage. All datasets, models, and tools are released under permissive licenses, inviting further research and development in Portuguese natural language processing.