Talking to a 4-Year-Old: A Multilingual Benchmark for Children's AI Companions
Blog post from HuggingFace
A multilingual benchmark called "Talking to a 4-Year-Old" has been developed to evaluate AI companions for children, comprising 2,312 conversational prompts in 23 languages and assessed using four language models. The initiative arose from real incidents involving voice assistants providing unsafe guidance to children, highlighting the need for child-appropriate AI evaluation criteria. Unlike existing benchmarks, which cater to adults, this project focuses on children's interactions and safety, using real conversations from apps like Octo Kids as a foundation. The benchmark categorizes prompts into eight areas, including safety redirection and emotional support, and is assessed using a rigorous rubric system. Evaluations were carried out by multiple language models, and the responses were judged by five independent judges to ensure reliability. The entire dataset, alongside model responses and judge scores, is open source, aiming to enhance the development of safer AI systems for children.