Surge AI Blog - Plushcap

Blog URL

www.surgehq.ai/blog

Posts YTD

7 ↑ vs 0 last year

Avg Posts/Month

0.7 since 2022

Monthly Post Volume

Start year:

Post Details

Search:

Title	Author	Published	Words	HN Pts
Moving Beyond Engagement: Optimizing Facebook's Algorithms for Human Values	Edwin Chen	2022-02-10	3,086	--
30% of Google's Emotions Dataset is Mislabeled	Edwin Chen	2022-07-11	1,996	--
SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations	Logan Ritchie	2025-09-15	3,790	--
The Human/AI Frontier: A Conversation with Bogdan Grechuk	--	2025-09-29	1,708	--
Search Behind-the-Scenes: How Neeva Uses Human Evaluation to Measure Search Quality	--	2022-07-29	3,387	--
AI Red Teams and Adversarial Data Labeling with Redwood Research	--	2022-06-28	1,484	--
How TikTok is Evolving the Next Generation of Search	--	2022-10-25	2,696	--
Benchmarks are broken	--	2025-09-07	865	--
We asked 100 humans to draw the DALLÂ·E prompts	Edwin Chen	2022-05-12	1,120	--
HellaSwag or HellaBad? 36% of this popular LLM benchmark contains errors	Edwin Chen	2022-12-04	2,404	--
AI Red Teams for Adversarial Training: How to Make ChatGPT and LLMs …	--	2022-12-12	2,582	--
DALLÂ·E 3 and Midjourney Fail Astral Codex Ten's Image Generation Bet	Edwin Chen	2024-08-01	2,016	--
How Anthropic uses Surge AI to Train and Evaluate Claude	--	2023-03-09	1,372	--
LMArena is a cancer on AI	Surge AI Research Team	2025-12-01	1,585	--
Is Google Search Deteriorating? Measuring Google's Search Quality in 2022	Edwin Chen	2022-01-10	2,289	--
Humans vs. Gary Marcus vs. Slate Star Codex: When is an AI …	Edwin Chen	2022-06-22	2,349	--
Unsexy AI Failures: The PDF That Broke ChatGPT	--	2025-08-25	2,102	--
We Evaluated ChatGPT vs. Google on 500 Search Queries	--	2022-12-21	3,557	--
The $250K Inverse Scaling Prize and Human-AI Alignment	--	2022-08-15	1,558	--
Why Instagram is Losing Gen Z: We Asked 100 Users to Compare …	--	2022-08-31	3,814	--
Human Evaluation of Large Language Models: How Good is Hugging Face's BLOOM?	--	2022-07-19	3,497	--
Google Search is Falling Behind	--	2022-04-12	2,405	--
Holy $#!t: Are popular toxicity models simply profanity detectors?	--	2022-01-22	1,394	--
Building AdvancedIF: Evolving Instruction Following Beyond IFEval and "Avoid the Letter C"	--	2025-12-06	1,916	--
How Surge AI Built OpenAI's GSM8K Dataset of 8,500 Math Problems	Edwin Chen	2022-06-13	2,583	--
RL Environments and the Hierarchy of Agentic Capabilities	Surge AI Research Team	2025-11-03	4,073	--
How do frontier models perform on real-world finance problems?	Lily Zhao	2025-11-03	3,212	--
A Product Take on Sonnet 4.5	Nick Heiner	2025-10-10	1,255	--
Bringing light to the GPT-4o vs. GPT-5 personality controversy	Nick Heiner	2025-08-15	2,544	--
Is Sonnet 4.5 the best coding model in the world?	Logan Ritchie	2025-10-08	3,102	--
AdvancedIF and Our Philosophy on Building Benchmarks	--	2025-12-07	1,420	--
Evaluating Generative AI: Did Astral Codex Ten Win His Bet on AI …	--	2022-09-29	3,545	--
Hemingway-bench Leaderboard: Because Good Writing Isn't a Checklist of Vibes	--	2026-02-04	3,283	--
EnterpriseBench: CoreCraft – Measuring AI Agents in Chaotic, Enterprise RL Environments	--	2026-02-19	4,038	--
Riemann-bench: A Benchmark for Moonshot Mathematics	--	2026-03-24	1,154	--
GDP.pdf: Can $100B AI Models Master the Documents that Run the World?	--	2026-04-14	1,170	--
Slop is a choice. Introducing Antidote.	--	2026-01-01	2,626	--
Cross-Benchmark Generalization for Long-Horizon Agentic Tasks	--	2026-05-28	2,201	--
ComplexConstraints: A Benchmark for Entangled Instruction Following	--	2026-06-03	1,879	--

Plushcap, by Matt Makai. 2021-2026.