Deep Dive: Telemetry cardinality in the Apollo GraphOS Router
Blog post from Apollo
The Apollo GraphOS Router utilizes OpenTelemetry as its telemetry backbone to ensure comprehensive observability, crucial for managing high-scale production graphs. The router employs the OpenTelemetry Rust SDK for metrics and traces, which are exported to various observability backends. A significant limitation is the SDK's cardinality cap of 2000 unique attribute combinations per metric batch, which can lead to data overflow and attribute loss if exceeded. Future updates aim to address this constraint by upgrading the SDK, although this involves complex changes. Metrics and traces are managed through multiple exporters, including Apollo Usage Reports and standard OpenTelemetry formats, and their configurations can be customized via batch processors to optimize performance. Additionally, Apollo GraphOS implements its own cardinality protection, replacing high-cardinality attribute values with a placeholder when limits are breached. Users are encouraged to manage cardinality by auditing attribute configurations and employing parameters to minimize unique metric combinations. Understanding these telemetry intricacies allows for a robust observability strategy that balances data completeness and performance in GraphQL infrastructures.