| 113 |
A guide to open-source LLM inference and performance |
2023-11-20 |
| 51 |
How we got Stable Diffusion XL inference to under 2 seconds |
2023-08-31 |
| 9 |
Show HN: Baseten Chains – Framework and SDK for Multi-Model AI Products |
2024-06-27 |
| 3 |
SDXL inference in under 2 seconds |
2023-08-31 |
| 2 |
Open Source Inference Engine Baseten Raises $40M from IVP, Spark and Greylock |
2024-03-14 |
| 2 |
Faster Mixtral inference with TensorRT-LLM and quantization |
2023-12-27 |
| 2 |
How to double tokens per second for Llama 3 with Medusa |
2024-08-20 |
| 2 |
Show HN: Automatically Build Nvidia TRT-LLM Engines |
2024-08-01 |
| 2 |
FP8: Efficient model inference with 8-bit floating point numbers |
2024-03-08 |
| 1 |
How to build function calling and JSON mode for open-source and fine-tuned LLMs |
2024-09-12 |
| 1 |
Show HN: 60% higher tokens per second for 70B custom LLMs |
2024-07-31 |
| 1 |
Introduction to quantizing machine learning models |
2024-02-16 |
| 1 |
Three techniques to adapt LLMs for any use case |
2023-06-15 |
| 402 |
Show HN: ChatLLaMA – A ChatGPT style chatbot for Facebook's LLaMA |
2023-03-22 |
| 16 |
Show HN: Fine-tune generative models in 1 line of code |
2023-03-01 |
| 1 |
Deploying custom ComfyUI workflows as APIs |
2024-11-20 |
| 1 |
Continuous vs. dynamic batching for AI inference |
2025-08-06 |
| 247 |
Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs |
2025-08-07 |