How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas
Blog post from HuggingFace
Nemotron-Personas-Korea is a dataset designed to enhance AI agents with culturally and demographically accurate personas for the Korean market, addressing the limitations of AI models primarily trained on English data. This dataset consists of 6 million synthetic personas, grounded in official statistics from multiple Korean institutions, ensuring compliance with Korea's Personal Information Protection Act by containing no personally identifiable information. It covers all 17 Korean provinces with detailed demographic fields, including 26 persona fields, and offers over 2,000 occupation categories. Developed using NVIDIA's NeMo Data Designer, the dataset leverages a Probabilistic Graphical Model for statistical grounding and Gemma-4-31B for Korean-language narrative generation. This approach allows AI agents to operate within a Korean context, enhancing their ability to interact appropriately with Korean users by incorporating region-specific communication norms and professional expertise. The dataset is part of the Nemotron-Personas Collection, which also includes data for other countries, enabling the creation of multilingual agents. NVIDIA provides tools like NemoClaw and NIM for deploying these agents, emphasizing the importance of culturally grounded AI in improving user trust and relevance in diverse markets.