xAI has introduced its latest Grok models, Grok 4 and the premium Grok 4 Heavy, which are designed to excel in reasoning tasks by utilizing tools rather than solely generalizing. Elon Musk claims these models surpass the capabilities of most graduate and PhD students in academic inquiries. To evaluate such claims, Simon Willison conducts a unique test asking the models to generate and describe an image of a pelican riding a bicycle, which helps assess the tendencies of different language models. The Braintrust platform provides a framework to systematically evaluate these models, using a custom 'LLM-as-Jury' scorer that combines judgments from OpenAI, Anthropic, and xAI, offering insights into model performance. Initial tests with Grok 4 suggest it performs well, especially praised by Anthropic, and the platform allows for continued experimentation and comparison across various models and vendors to track progress.