How Reliability and Product Teams Collaborate at Booking.com
Blog post from Honeycomb
Booking.com relies on a vast infrastructure of over 50,000 servers to handle more than 1.5 million room nights booked daily, necessitating robust reliability measures. To address this, Booking.com has developed two key tools: the Reliability Collaboration Model (RCM) and the Ownership Map. The RCM outlines the responsibilities for maintaining system reliability, categorizing tasks into four main areas: Basic Operations, Disaster Recovery, Observability, and Advanced Operations. It employs three Support Levels to manage the division of labor between reliability and product teams, ranging from full autonomy for product teams at Level 3 to complete support from reliability teams at Level 1. The Ownership Map visually represents system support levels and criticality, helping align reliability efforts with business priorities. This approach fosters collaboration across teams, mitigating the ownership confusion that arises when multiple product teams work on shared systems. However, system criticality is determined by product leadership, ensuring alignment with business needs. While the RCM and Ownership Map are vital tools, Booking.com acknowledges the long-term effort required to instill a reliability culture across the organization, including ongoing education, process refinement, and strategic alignment.