Raising the Bar: New SLOs at 99 % (and 99.9 % for Overall Health)
created: Sunday, Jun 15, 2025
TL;DR
Metric |
Old SLO |
New SLO |
Any DTZ customer‑facing service |
95 % |
99 % |
dtz overall health (aggregated heartbeat) |
95 % |
99.9 % |
The new objectives take effect **1 July 2025** and will be measured over the same rolling 30‑day window you already know from the [status page](https://status.dtz.rocks).
Over the past year our platform has quietly evolved from “promising” to “battle‑hardened”:
- Data speaks: Since April 1 we’ve logged 11 production incidents totalling 7 h 28 m of downtime. That’s 99.66 % availability for a 75‑day stretch—already above the new global target.
- Overall health is rock solid: The aggregated dtz overall health probe has been unavailable for only 16 m in 2025 to date, translating to 99.97 %.
- Mean time to recovery (MTTR) shrank 42 % thanks to automatic rollbacks, blue/green deploys and a growing suite of smoke tests.
- Observability everywhere: Every critical path now emits RED metrics (rate, errors, duration) and SLO burn alerts feed directly into on‑call slack channels.
What changes for you
- Tighter error budgets. With 99 % availability a service may now be down for ~7 h 18 m per month (previously ~36 h). For the 99.9 % overall‑health check the allowance is just 43 m.
- Faster incident response. Pager thresholds are being shortened from 3 m to 60 s of failing probes so we can act before you notice.
- Transparent credits. If we breach the SLO, service credits will land automatically—no ticket required. The updated ToS goes live next week.
- Richer public telemetry. Latency percentiles and burn‑rate graphs will be added to each component on the status page so you can correlate issues with your own dashboards.
How we’ll stay inside budget
- Redundant probes from three regions for every heartbeat.
- Instant deploy rollbacks. 90 % of reversions already complete in under three minutes; the goal is sub‑one‑minute.
- chaos drills keep recovery playbooks fresh.
- Sustainable ops, not wasteful ops. We continue to run on carbon‑aware schedules; more nines do not mean more megawatts.
A quick look at the numbers
Since 1 April 2025 we have seen:
- 11 incidents across five services.
- Average incident length: 41 m.
- Longest single outage: 1 h 5 m (objectstore, 06 April).
- Latest 30‑day window: 2 incidents, 1 h 9 m total downtime → 99.85 % availability.
These figures give us comfortable head‑room to meet the new targets even before the upcoming redundancy upgrades land.
Thank you
Reliability isn’t a switch you flip—it’s the cumulative effect of design reviews, test coverage, observability and a crew that cares. Your bug reports and feature suggestions pushed us to raise the bar. Keep the feedback coming, and here’s to fewer pages, greener ops and one extra nine.