← All posts

Cutting Telemetry Latency by 83%: What Actually Worked

6 min read

When I joined the IOS-XR telemetry team at Cisco, the pipeline had a dirty secret: data was 30 minutes stale by the time it reached operators. For a carrier-grade router OS, that’s a long time. A routing flap, a line card fault, a congestion event — all of it was invisible for half an hour.

The team had accepted this. The latency was "known." I wanted to understand why.

The Actual Problem

Telemetry in IOS-XR works by collecting metrics from dozens of subsystems — interface counters, BGP state, MPLS labels, CPU utilization — and streaming them out via gRPC. Each subsystem speaks to the telemetry daemon via IPC (inter-process communication).

The issue was that IPC calls were happening one at a time, synchronously. For each data item, the telemetry daemon would: send a request, wait for a response, serialize the result, move to the next item. With hundreds of paths subscribed, this was catastrophically sequential.

The second problem was data that didn’t change. We were re-fetching static configuration data — things like interface descriptions, VRF mappings, static route configurations — on every collection cycle. That data almost never changes.

Fix 1: Batched IPC

The first change was to batch IPC requests. Instead of one request per data path, we grouped requests by subsystem and sent them as a batch. This reduced the round-trip overhead by an order of magnitude. The subsystem could also optimize internally when it saw a batch — walking its data structures once instead of once per request.

This alone cut latency significantly. The synchronous wall of sequential calls collapsed into a handful of parallel bursts.

Fix 2: In-Memory Caching for Static Data

For configuration data that rarely changes, we added a write-through cache with a short TTL. The telemetry daemon now serves cached results for most configuration paths, only hitting the subsystem IPC when the cache misses or expires.

Cache coherency was the hard part. IOS-XR is a distributed system; configuration can change on any node, and we needed cache invalidation to be consistent. We solved this by subscribing to configuration change events and invalidating relevant cache entries proactively — so the TTL was a safety net, not the primary mechanism.

Fix 3: Bulk API Patterns

Some subsystems had both a per-item API and a bulk API that returned the full data set in one shot. We audited every IPC interface and switched to bulk APIs wherever they existed. The difference was dramatic: one call to get all MPLS label bindings versus one call per binding.

Result

Together, these three changes reduced end-to-end latency from 30 minutes to under 5 minutes — an 83% reduction. System throughput improved by 70% because we were doing far less IPC overhead per unit of data collected.

The lesson: before optimizing, understand the actual bottleneck. We didn’t need a new architecture. We needed to stop doing obviously inefficient things at scale.