Content Deep Dive
How we improved Tensorflow Serving performance by over 70%
Blog post from Mux
Post Details
Company
Date Published
Author
Masroor Hasan
Word Count
1,852
Language
English
Hacker News Points
-
Summary
Tensorflow Serving is a flexible server architecture designed to deploy and serve machine learning models. It provides monitoring components, a configurable architecture, and supports multiple ML models or versions. The size of the "servable" matters as smaller models use less memory and storage, leading to faster load times. To improve latency, optimizations can be made on both the prediction server and client. Techniques such as building CPU-optimized serving binary, using server-side batching, and implementing client-side batching can significantly reduce prediction latency. Additionally, hardware acceleration like GPUs may be considered for "offline" inference processing with massive volumes.