Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

「データ不足」の壁を越える:合成ペルソナが日本のAI開発を加速

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Atsunori Fujita, Masaya Ogushi, Will Jennings, Yev Meyer, Kotaro Yamamoto, Yoshi Suhara, Vincent Gong, and Dane Corneil
Word Count
280
Language
-
Hacker News Points
-
Summary

A new study by NTT DATA highlights the potential of synthetic personas to overcome the significant data scarcity faced by AI developers in Japan, especially for systems that understand Japanese language and culture. This scarcity has hindered the development of AI models due to the lack of task-specific, culturally relevant data. To address this, NTT DATA utilized the open-source NeMo Data Designer to create the Nemotron-Personas-Japan dataset, which consists of six million synthetic personas based on Japanese demographics, geography, and culture. This dataset significantly improved the accuracy of AI models for legal Q&A tasks, enhancing precision from 15.3% to 79.3% without exposing sensitive data. The approach demonstrates that even with minimal proprietary data, high-quality AI models can be developed using open-source infrastructure and synthetic data, which also addresses privacy concerns by not including personally identifiable information. The study further suggests that synthetic data can reduce computational costs and accelerate development cycles, offering a practical solution for developers in domains with limited access to proprietary data, thereby promoting innovation aligned with Japan's AI governance vision.