|
Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values
|
Edwin Chen |
2022-02-10 |
3,086 |
--
|
|
30% of Google's Emotions Dataset is Mislabeled
|
Edwin Chen |
2022-07-11 |
1,996 |
--
|
|
SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations
|
Logan Ritchie |
2025-09-15 |
3,790 |
--
|
|
The Human/AI Frontier: A Conversation with Bogdan Grechuk
|
-- |
2025-09-29 |
1,708 |
--
|
|
Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality
|
-- |
2022-07-29 |
3,387 |
--
|
|
AI Red Teams and Adversarial Data Labeling with Redwood Research
|
-- |
2022-06-28 |
1,484 |
--
|
|
How TikTok is Evolving the Next Generation of Search
|
-- |
2022-10-25 |
2,696 |
--
|
|
Benchmarks are broken
|
-- |
2025-09-07 |
865 |
--
|
|
We asked 100 humans to draw the DALL·E prompts
|
Edwin Chen |
2022-05-12 |
1,120 |
--
|
|
HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors
|
Edwin Chen |
2022-12-04 |
2,404 |
--
|
|
AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs …
|
-- |
2022-12-12 |
2,582 |
--
|
|
DALL·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet
|
Edwin Chen |
2024-08-01 |
2,016 |
--
|
|
How Anthropic uses Surge AI to Train and Evaluate Claude
|
-- |
2023-03-09 |
1,372 |
--
|
|
LMArena is a cancer on AI
|
Surge AI Research Team |
2025-12-01 |
1,585 |
--
|
|
Is Google Search Deteriorating? Measuring Google's Search Quality in 2022
|
Edwin Chen |
2022-01-10 |
2,289 |
--
|
|
Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI …
|
Edwin Chen |
2022-06-22 |
2,349 |
--
|
|
Unsexy AI Failures: The PDF That Broke ChatGPT
|
-- |
2025-08-25 |
2,102 |
--
|
|
We Evaluated ChatGPT vs. Google on 500 Search Queries
|
-- |
2022-12-21 |
3,557 |
--
|
|
The $250K Inverse Scaling Prize and Human-AI Alignment
|
-- |
2022-08-15 |
1,558 |
--
|
|
Why Instagram is Losing Gen Z: We Asked 100 Users to Compare …
|
-- |
2022-08-31 |
3,814 |
--
|
|
Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM?
|
-- |
2022-07-19 |
3,497 |
--
|
|
Google Search is Falling Behind
|
-- |
2022-04-12 |
2,405 |
--
|
|
Holy $#!t: Are popular toxicity models simply profanity detectors?
|
-- |
2022-01-22 |
1,394 |
--
|
|
Building AdvancedIF: Evolving Instruction Following Beyond IFEval and "Avoid the Letter C"
|
-- |
2025-12-06 |
1,916 |
--
|
|
How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems
|
Edwin Chen |
2022-06-13 |
2,583 |
--
|
|
RL Environments and the Hierarchy of Agentic Capabilities
|
Surge AI Research Team |
2025-11-03 |
4,073 |
--
|
|
How do frontier models perform on real-world finance problems?
|
Lily Zhao |
2025-11-03 |
3,212 |
--
|
|
A Product Take on Sonnet 4.5
|
Nick Heiner |
2025-10-10 |
1,255 |
--
|
|
Bringing light to the GPT-4o vs. GPT-5 personality controversy
|
Nick Heiner |
2025-08-15 |
2,544 |
--
|
|
Is Sonnet 4.5 the best coding model in the world?
|
Logan Ritchie |
2025-10-08 |
3,102 |
--
|
|
AdvancedIF and Our Philosophy on Building Benchmarks
|
-- |
2025-12-07 |
1,420 |
--
|
|
Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI …
|
-- |
2022-09-29 |
3,545 |
--
|
|
Hemingway-bench Leaderboard: Because Good Writing Isn't a Checklist of Vibes
|
-- |
2026-02-04 |
3,283 |
--
|
|
EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments
|
-- |
2026-02-19 |
4,038 |
--
|