Inference Characteristics of Llama

Company

Cursor

Date Published

July 20, 2023

Author

Aman

Word count

4006

Language

English

Hacker News points

None

URL

cursor.com/blog/llama-inference

Summary

The blog post examines the cost and latency characteristics of the Llama-2-70B language model compared to OpenAI's GPT-3.5, emphasizing how Llama-2 is more suited for prompt-dominated tasks rather than completion-heavy workloads. The analysis reveals that Llama-2 is cheaper for prompt tokens but more expensive for generating completion tokens when using two 80-GB A100 GPUs, which are required to fit Llama-2 in memory. The article provides a detailed exploration of the model's inference math, memory requirements, and the impact of batch processing on cost and latency. It concludes that while Llama-2 may offer competitive pricing for specific use cases, such as large prompts with minimal generated tokens or offline batch-processing jobs, GPT-3.5 remains more efficient for most generation-heavy tasks. Additionally, the post touches on advanced techniques used by closed-source models to optimize performance and suggests that open-source models like Llama-2 can be beneficial for certain tasks, particularly when cost efficiency for prompt processing is a priority.