/plushcap/analysis/cloudflare/debugging-hardware-performance-on-gen-x-servers

Debugging Hardware Performance on Gen X Servers

What's this blog post about?

In this text, hardware engineer Yasir Jamal from Cloudflare discusses an issue they faced when servers from one vendor (SKU-B) were consistently performing 5-10% worse than servers from another vendor (SKU-A). The team initially suspected CPU performance as the cause and ran AMD's DGEMM high-performance computing tool, but found that underperforming servers had lower Thermal Design Power (TDP) and floating-point computation rate. After trying various debugging options like disabling idle power saving mode, checking network interface, and enabling AMD Preferred I/O functionality, the team discovered a difference in memory clock frequency from Infinity Fabric system using AMD's HSMP tool. They asked the vendor to provide a new BIOS that set the frequency to 1467 MHz during compile time, which resolved the issue and improved performance of SKU-B servers to match or exceed SKU-A servers.

Company
Cloudflare

Date published
May 17, 2022

Author(s)
Yasir Jamal

Word count
920

Hacker News points
5

Language
English


By Matt Makai. 2021-2024.