The researchers curated five different datasets with verifiable outcomes for benchmarking prompt optimization techniques, including Support email routing 3 and Support email routing 10. They implemented and benchmarked five different methods of systematically improving prompts using models such as Claude-sonnet and GPT-4o. The results show that prompt optimization can significantly improve the accuracy of LLMs on tasks where the underlying model lacks domain knowledge, with a ~200% increase in accuracy over naive baseline prompts. The researchers recommend using the Claude-sonnet model for prompt optimization, particularly in situations where the underlying model lacks domain knowledge. They also found that meta-prompting is especially useful for discovering rules or preferences and other clear patterns in the data, while few-shot prompting can communicate more information than simple instructions but doesn't capture complex conditionals and rules. The results support what has been observed: LLM-driven prompt optimization can systematically improve prompts and automate much of the manual guess-and-check process that dominates prompt engineering today. However, it's not a silver bullet, and prompt optimization is best viewed as one tool in a broader toolkit for improving LLM applications.