Raising the Bar: New SLOs at 99 % (and 99.9 % for Overall Health)

created: Sunday, Jun 15, 2025

TL;DR

Metric	Old SLO	New SLO
Any DTZ customer-facing service	95 %	99 %
dtz overall health (aggregated heartbeat)	95 %	99.9 %

The new objectives take effect **1 July 2025** and will be measured over the same rolling 30-day window you already know from the [status page](https://status.dtz.rocks).

Why we’re ready for an extra nine

Over the past year our platform has quietly evolved from “promising” to “battle-hardened”:

Data speaks: Since April 1 we’ve logged 11 production incidents totalling 7 h 28 m of downtime. That’s 99.66 % availability for a 75-day stretch—already above the new global target.
Overall health is rock solid: The aggregated dtz overall health probe has been unavailable for only 16 m in 2025 to date, translating to 99.97 %.
Mean time to recovery (MTTR) shrank 42 % thanks to automatic rollbacks, blue/green deploys and a growing suite of smoke tests.
Observability everywhere: Every critical path now emits RED metrics (rate, errors, duration) and SLO burn alerts feed directly into on-call slack channels.

What changes for you

Tighter error budgets. With 99 % availability a service may now be down for ~7 h 18 m per month (previously ~36 h). For the 99.9 % overall-health check the allowance is just 43 m.
Faster incident response. Pager thresholds are being shortened from 3 m to 60 s of failing probes so we can act before you notice.
Transparent credits. If we breach the SLO, service credits will land automatically—no ticket required. The updated ToS goes live next week.
Richer public telemetry. Latency percentiles and burn-rate graphs will be added to each component on the status page so you can correlate issues with your own dashboards.

How we’ll stay inside budget

Redundant probes from three regions for every heartbeat.
Instant deploy rollbacks. 90 % of reversions already complete in under three minutes; the goal is sub-one-minute.
chaos drills keep recovery playbooks fresh.
Sustainable ops, not wasteful ops. We continue to run on carbon-aware schedules; more nines do not mean more megawatts.

A quick look at the numbers

Since 1 April 2025 we have seen:

11 incidents across five services.
Average incident length: 41 m.
Longest single outage: 1 h 5 m (objectstore, 06 April).
Latest 30-day window: 2 incidents, 1 h 9 m total downtime → 99.85 % availability.

These figures give us comfortable head-room to meet the new targets even before the upcoming redundancy upgrades land.

Thank you

Reliability isn’t a switch you flip—it’s the cumulative effect of design reviews, test coverage, observability and a crew that cares. Your bug reports and feature suggestions pushed us to raise the bar. Keep the feedback coming, and here’s to fewer pages, greener ops and one extra nine.