How to Calculate and Improve Your Website's Uptime SLA
Learn how to calculate uptime percentages, set realistic SLA targets, and implement strategies to improve your service reliability.
“We guarantee 99.9% uptime.” It is a common promise, but do you know what it actually means? How do you calculate it? How do you know if you are meeting it? And what does it take to go from 99.9% to 99.99%?
This guide covers the math, the measurement, and the practical strategies for improving your uptime SLA.
Understanding Uptime Percentages
Uptime is expressed as a percentage of time your service is available within a given period. The “nines” are shorthand:
| SLA | Common Name | Monthly Downtime | Annual Downtime |
|---|---|---|---|
| 99% | Two nines | 7h 18m | 3d 15h 36m |
| 99.5% | Two and a half nines | 3h 39m | 1d 19h 48m |
| 99.9% | Three nines | 43m 49s | 8h 45m 57s |
| 99.95% | Three and a half nines | 21m 55s | 4h 22m 58s |
| 99.99% | Four nines | 4m 23s | 52m 36s |
| 99.999% | Five nines | 26s | 5m 15s |
Each additional nine is roughly a 10x improvement in reliability, and achieving it is roughly 10x harder.
The Math
Basic Calculation
Uptime % = ((Total Minutes - Downtime Minutes) / Total Minutes) * 100
For a 30-day month (43,200 minutes):
99.9% uptime = 43,200 - 43.2 = 43,156.8 minutes of uptime
= 43.2 minutes of allowed downtime
How to Measure Downtime
Not all monitoring tools measure downtime the same way. Key considerations:
Check interval matters: A monitoring tool that checks every 5 minutes might miss a 3-minute outage entirely, or might record a 5-minute outage when the actual downtime was 2 minutes.
With 30-second checks (as StatusApp provides), your downtime measurement is accurate to within 30 seconds. With 5-minute checks, it is accurate to within 5 minutes.
Confirmation checks: Most monitoring tools require multiple failed checks before recording downtime. This prevents false positives but means the recorded downtime starts slightly after the actual downtime began.
Partial outages: How do you measure a situation where 50% of requests succeed? Is the service “up” or “down”? Most monitoring tools record it as “up” since their individual checks succeed, but users experience degradation.
Composite SLAs
Your application depends on multiple services, each with its own reliability:
Overall Availability = Service A Availability * Service B Availability * ...
If your app depends on a web server (99.99%), a database (99.99%), and a third-party payment API (99.95%):
99.99% * 99.99% * 99.95% = 99.93%
Every dependency reduces your composite availability. This is why even small reliability improvements in individual services compound significantly.
Setting Realistic SLA Targets
What Most Teams Should Target
- 99% (7.3 hours/month): Reasonable for non-critical internal tools
- 99.5% (3.6 hours/month): Good for most B2B SaaS products
- 99.9% (43.8 minutes/month): Standard target for production SaaS
- 99.95% (21.9 minutes/month): Good for business-critical services
- 99.99% (4.4 minutes/month): Requires significant engineering investment
- 99.999% (26 seconds/month): Requires redundancy at every level; most organizations cannot achieve this
Do Not Over-Promise
If you have never measured your actual uptime, do not promise 99.99%. Start by measuring for 3-6 months, then set your SLA target slightly below your actual performance.
Promising 99.9% when your actual availability is 99.7% means you are in breach of your SLA from day one.
Measuring Your Current Uptime
Step 1: Set Up Monitoring
You cannot improve what you do not measure. Set up monitoring for every critical service:
- Website/app endpoints
- API endpoints
- Database connectivity
- Third-party dependencies
StatusApp provides continuous monitoring from 35+ global locations, giving you accurate uptime data from multiple geographic perspectives.
Step 2: Establish a Baseline
Run monitoring for at least 30 days before drawing conclusions. Short periods are not statistically meaningful:
- A month with zero incidents might be followed by a month with three
- Seasonal patterns affect reliability (traffic spikes, maintenance windows)
- External factors (ISP outages, DDoS attacks) are unpredictable
Step 3: Analyze the Data
After baseline monitoring:
- What is your current uptime percentage?
- What caused downtime? (Deployments, infrastructure, third parties, human error)
- Are certain time periods worse than others?
- Which services are least reliable?
Strategies to Improve Uptime
From 99% to 99.9% (Eliminate Obvious Failures)
Automated monitoring: Detect issues in seconds, not hours. Moving from “a customer reported it” to “automated alerting” eliminates hours of undetected downtime.
Blue-green deployments: Deploy new code to a standby environment, test it, then switch traffic. This eliminates deployment-related downtime.
Database backups and tested recovery: Ensure you can recover from database failures. Test your recovery process quarterly.
SSL certificate monitoring: Automate renewal and monitor expiration. An expired certificate is 100% preventable downtime.
DNS redundancy: Use multiple nameserver providers or a provider with global anycast.
From 99.9% to 99.95% (Reduce Incident Duration)
Faster detection: Move from 5-minute checks to 30-second checks. Every minute of faster detection is a minute less downtime.
Runbooks: Documented procedures reduce mean time to recovery (MTTR).
Auto-scaling: Handle traffic spikes without manual intervention.
Health checks and auto-restart: Configure your orchestrator (Docker, Kubernetes, systemd) to restart failed processes automatically.
Staged rollouts: Deploy to a small percentage of traffic first, then gradually increase.
From 99.95% to 99.99% (Eliminate Single Points of Failure)
Multi-region deployment: Run your application in at least two geographic regions.
Database replication: Primary with automatic failover to replica.
Load balancing: Distribute traffic across multiple application instances.
CDN: Serve static content from edge locations to reduce origin load.
Circuit breakers: Prevent cascading failures when dependencies go down.
Chaos engineering: Deliberately inject failures to test your resilience.
Beyond 99.99% (Extreme Measures)
Multi-cloud: Run across AWS and GCP simultaneously.
Global load balancing: Route traffic to the closest healthy region.
Automated failover: Zero-human-intervention recovery from failures.
Immutable infrastructure: Never patch servers; replace them.
This level of reliability requires significant engineering investment and is only justified for services where downtime has extreme consequences (payment processing, emergency services, critical infrastructure).
SLA in Practice
Error Budgets
Instead of treating your SLA as a hard line, use an error budget:
Error Budget = 100% - SLA Target
For a 99.9% SLA over a month:
Error Budget = 0.1% of 43,200 minutes = 43.2 minutes
You have 43.2 minutes of downtime to “spend” each month. This creates a healthy tension:
- If you have budget remaining, you can ship riskier changes
- If you are running low on budget, slow down and focus on reliability
- If you have exhausted your budget, freeze all non-essential changes
SLA Credits
If you offer SLA credits (refunds for missed SLA), define clear terms:
- How is downtime measured? (Your monitoring data? Customer reports?)
- What qualifies as downtime? (Complete outage? Degraded performance?)
- What is the credit amount? (Pro-rated? Fixed percentage?)
- How do customers claim credits?
Excluding Scheduled Maintenance
Most SLAs exclude planned maintenance. Be transparent about:
- How much advance notice you provide
- How much maintenance is allowed per month
- Whether maintenance windows count toward downtime if they overrun
Tracking and Reporting
Use your monitoring platform’s analytics to generate SLA reports:
- Monthly uptime percentage per service
- Incident count and duration
- Response time trends (performance SLA, not just availability SLA)
- Regional availability (if you serve a global audience)
StatusApp’s analytics provide all of these data points, making SLA reporting straightforward for your team and your customers.
Start measuring your uptime with confidence. Try StatusApp free and get accurate SLA data from 35+ global monitoring locations.
Start monitoring in 30 seconds
StatusApp gives you 30-second checks from 35+ global locations, instant alerts, and beautiful status pages. Free plan available.