Mixtral 8x7B is an LLM that produces results comparable to larger models like Llama 2 70B and GPT-3.5, but with fewer parameters and enabling faster inference. Using TensorRT-LLM and quantizing the model to int8 achieves important performance milestones while using only a single A100 GPU. Mixtral's mixture of experts architecture uses only 12.9B parameters during inference, and its performance is better for individual requests than larger models. However, batching inference does not take full advantage of this architecture, resulting in decreased throughput. Quantizing the model to int8 cuts inference cost in half while preserving quality with a minimal increase in perplexity. The use of TensorRT-LLM unlocks faster single-request and batched inference performance, making Mixtral suitable for a wide range of use cases.