The blog post by Andrei Fajardo at LlamaIndex introduces new types of llama-datasets, specifically the LabelledEvaluatorDataset and the LabelledPairwiseEvaluatorDataset, designed to benchmark language model (LLM) evaluators such as Google's Gemini and OpenAI's GPT models. These datasets aim to provide a cost-effective and scalable alternative to human evaluation by using strong LLMs as evaluators, while also emphasizing the importance of continuously evaluating the LLM evaluators themselves. The post details how these datasets can be used to benchmark LLM systems, particularly focusing on the evaluation of Retrieval-Augmented Generation (RAG) systems and LLM responses through an evaluator's predictions. The results from benchmarking show that Gemini Pro performs comparably to GPT-3.5 and even suggests that it might outperform it in certain scenarios. The datasets are available for download through LlamaHub, encouraging users to build their own benchmark datasets and metrics.