Silly Startups, Serious Signals: How to Use Custom Metrics to Measure Domain-Specific AI Success

Post Details

Company

Galileo

Date Published

July 2, 2025

Author

Erin Mikail Staples

Word Count

3,172

Language

English

Hacker News Points

-

Source URL

galileo.ai/blog/silly-startups-serious-signals-how-to-use-custom-metrics-to-measure-domain-specific-ai-success

Summary

This tutorial teaches developers how to build a Python-powered web app, called Startup Sim 3000, that uses real-time data and large language models (LLMs) to generate creative or professional startup pitches. The twist is that the system also learns how to track, monitor, and measure its performance using Agent Reliability tools and custom metrics with Galileo. By the end of this tutorial, developers will have built an AI Agent system that combines multiple tools, logged tool spans and LLM spans with Galileo, tracked custom LLM-as-a-Judge metrics, and learned how to define and measure success using these custom metrics. The application was also featured as a talk at DataBrick's 2025 Data and AI Conference. Large language models are inherently nondeterministic, making it hard to evaluate their performance using traditional software metrics. Custom metrics come in handy to address this issue, allowing developers to define and track domain-specific signals directly. By applying custom metrics to a comedy-generating app, the tutorial demonstrates how to measure success based on timing, tone, delivery, and more. The key takeaway is that custom metrics are essential for domain-specific AI applications, enabling developers to turn cool demos into production-ready products. The tutorial covers setting up Galileo, creating a new project, installing dependencies, and running the application, before diving into creating custom metrics using LLM-as-a-Judge prompts and testing them with sample outputs. By the end of this tutorial, developers will have learned how to structure an agent-based AI system, log tool and model activity, create custom metrics, and translate fuzzy ideas into measurable signals. The final goal is to move from "It runs" to "It works well," from a cool demo to a useful product, and from "Kinda funny" to "Funny enough to ship."