Raising the Bar: New SLOs at 99 % (and 99.9 % for Overall Health)
created: Sunday, Jun 15, 2025
TL;DR
Metric |
Old SLO |
New SLO |
Any DTZ customer-facing service |
95 % |
99 % |
dtz overall health (aggregated heartbeat) |
95 % |
99.9 % |
The new objectives take effect **1 July 2025** and will be measured over the same rolling 30-day window you already know from the [status page](https://status.dtz.rocks).
Over the past year our platform has quietly evolved from “promising” to “battle-hardened”:
- Data speaks: Since April 1 we’ve logged 11 production incidents totalling 7 h 28 m of downtime. That’s 99.66 % availability for a 75-day stretch—already above the new global target.
- Overall health is rock solid: The aggregated dtz overall health probe has been unavailable for only 16 m in 2025 to date, translating to 99.97 %.
- Mean time to recovery (MTTR) shrank 42 % thanks to automatic rollbacks, blue/green deploys and a growing suite of smoke tests.
- Observability everywhere: Every critical path now emits RED metrics (rate, errors, duration) and SLO burn alerts feed directly into on-call slack channels.
What changes for you
- Tighter error budgets. With 99 % availability a service may now be down for ~7 h 18 m per month (previously ~36 h). For the 99.9 % overall-health check the allowance is just 43 m.
- Faster incident response. Pager thresholds are being shortened from 3 m to 60 s of failing probes so we can act before you notice.
- Transparent credits. If we breach the SLO, service credits will land automatically—no ticket required. The updated ToS goes live next week.
- Richer public telemetry. Latency percentiles and burn-rate graphs will be added to each component on the status page so you can correlate issues with your own dashboards.
How we’ll stay inside budget
- Redundant probes from three regions for every heartbeat.
- Instant deploy rollbacks. 90 % of reversions already complete in under three minutes; the goal is sub-one-minute.
- chaos drills keep recovery playbooks fresh.
- Sustainable ops, not wasteful ops. We continue to run on carbon-aware schedules; more nines do not mean more megawatts.
A quick look at the numbers
Since 1 April 2025 we have seen:
- 11 incidents across five services.
- Average incident length: 41 m.
- Longest single outage: 1 h 5 m (objectstore, 06 April).
- Latest 30-day window: 2 incidents, 1 h 9 m total downtime → 99.85 % availability.
These figures give us comfortable head-room to meet the new targets even before the upcoming redundancy upgrades land.
Thank you
Reliability isn’t a switch you flip—it’s the cumulative effect of design reviews, test coverage, observability and a crew that cares. Your bug reports and feature suggestions pushed us to raise the bar. Keep the feedback coming, and here’s to fewer pages, greener ops and one extra nine.