Company
Date Published
Author
Kesha Mykhailov
Word count
1979
Language
English
Hacker News points
None

Summary

At Intercom, building a resilient system is crucial for customer experience, and observability plays a key role in enabling humans to "look" inside the systems they run. The company has invested heavily in reliability and invests a lot in the reliability of their application, but unpredictable failures are inevitable, and when they happen, it's humans that fix them. Observability is essential for resilience, and Intercom defines it as a continuous process of humans asking questions about production and getting answers. To build a stronger culture of observability, Intercom has implemented several stages, starting with identifying the problem statement and formulating a solution, which involved shifting from metrics-based to tracing-centric tooling. They used an existing tracing library and performed a small adjustment to convert trace data into the Honeycomb-native format. The company also enabled teammates to adopt traces by finding allies, tailoring their message, and demonstrating potential. After completing the technical part of the observability program, Intercom has achieved several key milestones, including auto-instrumenting their main monolith application with high-quality attribute-rich traces and deploying Honeycomb Refinery to sample data dynamically and retain more of the "interesting" traces. To increase adoption, they offered optional tracing from the local development environment and added a custom bot reaction to a "show me web performance" message. The company is now focusing on measuring return-on-investment (ROI) on observability tooling and exploring front-end instrumentation, with the goal of building a culture of observability that will continue to evolve in the future.