How to Calculate and Improve Your Website's Uptime SLA

“We guarantee 99.9% uptime.” It is a common promise, but do you know what it actually means? How do you calculate it? How do you know if you are meeting it? And what does it take to go from 99.9% to 99.99%?

This guide covers the math, the measurement, and the practical strategies for improving your uptime SLA.

Understanding Uptime Percentages

Uptime is expressed as a percentage of time your service is available within a given period. The “nines” are shorthand:

SLA	Common Name	Monthly Downtime	Annual Downtime
99%	Two nines	7h 18m	3d 15h 36m
99.5%	Two and a half nines	3h 39m	1d 19h 48m
99.9%	Three nines	43m 49s	8h 45m 57s
99.95%	Three and a half nines	21m 55s	4h 22m 58s
99.99%	Four nines	4m 23s	52m 36s
99.999%	Five nines	26s	5m 15s

Each additional nine is roughly a 10x improvement in reliability, and achieving it is roughly 10x harder.

The Math

Basic Calculation

Uptime % = ((Total Minutes - Downtime Minutes) / Total Minutes) * 100

For a 30-day month (43,200 minutes):

99.9% uptime = 43,200 - 43.2 = 43,156.8 minutes of uptime
             = 43.2 minutes of allowed downtime

How to Measure Downtime

Not all monitoring tools measure downtime the same way. Key considerations:

Check interval matters: A monitoring tool that checks every 5 minutes might miss a 3-minute outage entirely, or might record a 5-minute outage when the actual downtime was 2 minutes.

With 30-second checks (as StatusApp provides), your downtime measurement is accurate to within 30 seconds. With 5-minute checks, it is accurate to within 5 minutes.

Confirmation checks: Most monitoring tools require multiple failed checks before recording downtime. This prevents false positives but means the recorded downtime starts slightly after the actual downtime began.

Partial outages: How do you measure a situation where 50% of requests succeed? Is the service “up” or “down”? Most monitoring tools record it as “up” since their individual checks succeed, but users experience degradation.

Composite SLAs

Your application depends on multiple services, each with its own reliability:

Overall Availability = Service A Availability * Service B Availability * ...

If your app depends on a web server (99.99%), a database (99.99%), and a third-party payment API (99.95%):

99.99% * 99.99% * 99.95% = 99.93%

Every dependency reduces your composite availability. This is why even small reliability improvements in individual services compound significantly.

Setting Realistic SLA Targets

What Most Teams Should Target

99% (7.3 hours/month): Reasonable for non-critical internal tools
99.5% (3.6 hours/month): Good for most B2B SaaS products
99.9% (43.8 minutes/month): Standard target for production SaaS
99.95% (21.9 minutes/month): Good for business-critical services
99.99% (4.4 minutes/month): Requires significant engineering investment
99.999% (26 seconds/month): Requires redundancy at every level; most organizations cannot achieve this

Do Not Over-Promise

If you have never measured your actual uptime, do not promise 99.99%. Start by measuring for 3-6 months, then set your SLA target slightly below your actual performance.

Promising 99.9% when your actual availability is 99.7% means you are in breach of your SLA from day one.

Measuring Your Current Uptime

Step 1: Set Up Monitoring

You cannot improve what you do not measure. Set up monitoring for every critical service:

Website/app endpoints
API endpoints
Database connectivity
Third-party dependencies

StatusApp provides continuous monitoring from 35+ global locations, giving you accurate uptime data from multiple geographic perspectives.

Step 2: Establish a Baseline

Run monitoring for at least 30 days before drawing conclusions. Short periods are not statistically meaningful:

A month with zero incidents might be followed by a month with three
Seasonal patterns affect reliability (traffic spikes, maintenance windows)
External factors (ISP outages, DDoS attacks) are unpredictable

Step 3: Analyze the Data

After baseline monitoring:

What is your current uptime percentage?
What caused downtime? (Deployments, infrastructure, third parties, human error)
Are certain time periods worse than others?
Which services are least reliable?

Strategies to Improve Uptime

From 99% to 99.9% (Eliminate Obvious Failures)

Automated monitoring: Detect issues in seconds, not hours. Moving from “a customer reported it” to “automated alerting” eliminates hours of undetected downtime.

Blue-green deployments: Deploy new code to a standby environment, test it, then switch traffic. This eliminates deployment-related downtime.

Database backups and tested recovery: Ensure you can recover from database failures. Test your recovery process quarterly.

SSL certificate monitoring: Automate renewal and monitor expiration. An expired certificate is 100% preventable downtime.

DNS redundancy: Use multiple nameserver providers or a provider with global anycast.

From 99.9% to 99.95% (Reduce Incident Duration)

Faster detection: Move from 5-minute checks to 30-second checks. Every minute of faster detection is a minute less downtime.

Runbooks: Documented procedures reduce mean time to recovery (MTTR).

Auto-scaling: Handle traffic spikes without manual intervention.

Health checks and auto-restart: Configure your orchestrator (Docker, Kubernetes, systemd) to restart failed processes automatically.

Staged rollouts: Deploy to a small percentage of traffic first, then gradually increase.

From 99.95% to 99.99% (Eliminate Single Points of Failure)

Multi-region deployment: Run your application in at least two geographic regions.

Database replication: Primary with automatic failover to replica.

Load balancing: Distribute traffic across multiple application instances.

CDN: Serve static content from edge locations to reduce origin load.

Circuit breakers: Prevent cascading failures when dependencies go down.

Chaos engineering: Deliberately inject failures to test your resilience.

Beyond 99.99% (Extreme Measures)

Multi-cloud: Run across AWS and GCP simultaneously.

Global load balancing: Route traffic to the closest healthy region.

Automated failover: Zero-human-intervention recovery from failures.

Immutable infrastructure: Never patch servers; replace them.

This level of reliability requires significant engineering investment and is only justified for services where downtime has extreme consequences (payment processing, emergency services, critical infrastructure).

SLA in Practice

Error Budgets

Instead of treating your SLA as a hard line, use an error budget:

Error Budget = 100% - SLA Target

For a 99.9% SLA over a month:

Error Budget = 0.1% of 43,200 minutes = 43.2 minutes

You have 43.2 minutes of downtime to “spend” each month. This creates a healthy tension:

If you have budget remaining, you can ship riskier changes
If you are running low on budget, slow down and focus on reliability
If you have exhausted your budget, freeze all non-essential changes

SLA Credits

If you offer SLA credits (refunds for missed SLA), define clear terms:

How is downtime measured? (Your monitoring data? Customer reports?)
What qualifies as downtime? (Complete outage? Degraded performance?)
What is the credit amount? (Pro-rated? Fixed percentage?)
How do customers claim credits?

Excluding Scheduled Maintenance

Most SLAs exclude planned maintenance. Be transparent about:

How much advance notice you provide
How much maintenance is allowed per month
Whether maintenance windows count toward downtime if they overrun

Tracking and Reporting

Use your monitoring platform’s analytics to generate SLA reports:

Monthly uptime percentage per service
Incident count and duration
Response time trends (performance SLA, not just availability SLA)
Regional availability (if you serve a global audience)

StatusApp’s analytics provide all of these data points, making SLA reporting straightforward for your team and your customers.

Start measuring your uptime with confidence. Try StatusApp free and get accurate SLA data from 35+ global monitoring locations.