Company
Date Published
Author
Andreas Grabner
Word count
1013
Language
American English
Hacker News points
None

Summary

Dynatrace employs a "Progressive Delivery at Cloud Scale" strategy to update its Software Intelligence Platform, which involves rolling out new versions gradually across clusters, using automated self-monitoring to ensure updates meet service level objectives before proceeding. A recent issue during an update highlighted the platform's diagnostic capabilities, as a node took significantly longer to restart due to a specific tenant's extensive configuration, which led to a CPU spike. This problem was swiftly diagnosed and resolved by the Cluster Performance Engineering team, aided by Dynatrace's thread and CPU hotspot analysis tools, which identified a newly added method as the root cause. The fix involved rethinking the startup check requirements and considering caching strategies, all without impacting customers. This incident illustrates how Dynatrace's own tools facilitate efficient troubleshooting and performance optimization, benefiting both their internal processes and external offerings.