/plushcap/analysis/cloudflare/the-sad-state-of-linux-socket-balancing

Why does one NGINX worker take all the load?

What's this blog post about?

The text discusses the different ways of designing a TCP server with regard to performance, focusing on three models: (a) Single listen socket, single worker process; (b) Single listen socket, multiple worker processes; and (c) Multiple worker processes, each with separate listen socket. It explains that while increasing the number of worker processes can overcome a single CPU core bottleneck, it also opens up new problems. The text then delves into the issue of spreading accept() load across multiple processes and how Linux handles this differently in both cases. Finally, it discusses how SO_REUSEPORT can be used to work around the balancing problem by splitting incoming connections into multiple separate accept queues, resulting in better load distribution. However, it also highlights that while the average is comparable, the maximum value significantly increased and most importantly the deviation is now gigantic, leading to a degraded-latency state. The text concludes by suggesting that changing the standard epoll behavior from LIFO to FIFO could be a better solution.

Company
Cloudflare

Date published
Oct. 23, 2017

Author(s)
Marek Majkowski

Word count
1663

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.