The development of GenAI has led to a growing focus on testing and evaluating its capabilities, resulting in the release of several Large Language Model (LLM) benchmarks. These benchmarks assess various aspects of LLMs, such as natural language understanding, logical reasoning, coding abilities, and agentic systems' performance. However, existing benchmarks have limitations, and newer models often exceed their performance on specific tasks while struggling with others. The evaluation scores of state-of-the-art models demonstrate the need for more comprehensive frameworks to assess their capabilities. As agentic AI gains prominence, specialized benchmarks like AgentBench and t-bench are necessary to evaluate end-to-end systems' performance in real-world actionable scenarios. Ultimately, the evolution of GenAI requires the development of new evaluation metrics that can meet the demanding practical requirements of these systems.