The performance benchmarking for Large Language Models (LLMs) is complex due to various factors such as hardware, streaming, quantizing, input size, output size, batch size, network speed, latency, throughput, and cost. A good benchmark should reflect the specific use case and tradeoffs that make sense for that scenario. Latency is crucial for chat-type applications, with a key metric being time to first token, while throughput is more like top speed in terms of requests per second or tokens per second. Cost is also an essential factor, with hardware choice, batching, and concurrency playing significant roles. Creating nuanced benchmarks that account for these factors is vital to optimize performance across latency, throughput, and cost.