In the blog post, Mingran Wang and Tan Li from SambaNova discuss the importance of selecting the right metrics for optimizing the performance of large language model (LLM) systems, emphasizing that relying solely on Tokens per Second can be misleading. The authors highlight the distinction between throughput and latency, noting that while throughput measures the number of instances processed over time, latency focuses on the time taken to process each instance. They advocate for a balanced approach in system design that considers both metrics, particularly in user-centric applications like chatbots, where Time to First Token is crucial for a smooth user experience. SambaNova's Reconfigurable Dataflow Unit (RDU) exemplifies this balance with a unique 3-tier memory hierarchy, enhancing both throughput and latency. The post underscores the limitations of using Tokens per Second as the sole metric and argues for a comprehensive metric selection to capture the full spectrum of system capabilities, ensuring effective performance optimization in various applications.