Enhancing the evaluation of Retrieval-Augmented Generation (RAG) pipelines traditionally reliant on GPT-4, this blog explores the use of the open-source Prometheus model as an alternative. The evaluation focuses on metrics such as Correctness, Faithfulness, and Context Relevancy. The Prometheus model is integrated with the LlamaIndex framework and compared against GPT-4, revealing that while Prometheus provides more detailed feedback, it applies stricter penalties for missing facts and exhibits more hallucinations in feedback compared to GPT-4. The evaluation uses a dataset from the Llama2 Paper, highlighting differences in scoring and feedback precision between the two models. Despite being more cost-effective, the Prometheus model sometimes offers incorrect feedback, making its application more cautious. The study emphasizes the need for careful interpretation of feedback when using the Prometheus model, especially in areas of faithfulness and relevancy, where it shows more hallucinations than GPT-4.