As transformer-based models increase in complexity and size, optimizing their inference speed becomes critical, particularly in applications demanding quick responses like chatbots. Key-value (KV) caching is a technique that enhances inference speed by storing key and value matrices for each generated token, allowing subsequent tokens to be processed without recomputing these matrices. This results in significant time savings, though it also increases memory consumption, which can be a challenge in memory-constrained environments. To manage this, strategies such as sequence truncation and model simplification may be employed, albeit at the expense of model accuracy. Implementing KV caching effectively in large-scale systems necessitates careful management of cache invalidation and reuse, ensuring efficient memory use and maintaining fast response times. Popular cache invalidation strategies include session-based clearing, time-to-live policies, and contextual relevance approaches, while cache reuse can be advantageous in scenarios with shared context or frequently repeated queries.