Incident Lifecycle
Understand how incidents are created, updated, and resolved — with MTTR tracking and audit logs.
Incident Lifecycle & Management
What is an Incident?
An incident is any event that disrupts or degrades service availability or performance. StatusApp provides a complete timeline from detection through resolution, helping you communicate and learn from failures.
Incidents Track:
- When the issue started and duration
- Which monitors/services affected
- All status updates during incident
- Severity and impact
- Mean Time To Resolution (MTTR)
- Customer communications
How Incidents Are Created
Automatic Incident Creation
StatusApp automatically creates incidents when monitors fail:
Trigger Process:
- Monitor check fails (non-200 status, timeout, or error)
- Confirmation checks verify failure (prevent false positives)
- Incident created with unique ID (INC-00123)
- Status page updated instantly
- Subscribers notified via email
Automatic Incident Data:
- Incident Number (unique identifier like INC-00123)
- Start time (
startedAt) - when first failure detected - Affected monitor - which service failed
- Severity - auto-determined based on impact (low, medium, high, critical)
- Affected regions - array of regions where failure occurred
- Check statistics -
checksFailedandchecksTotalcounts - Status -
openorresolved - Incident logs - detailed check results with status codes, response times, errors
Grace Period:
- Configure monitors to wait before creating incident
- Prevents false positives from temporary blips
- Common values: 1-5 minutes
- Balances speed vs accuracy
Manual Incident Creation
Create incidents for issues not automatically detected:
When to Create Manually:
- Planned maintenance windows
- Third-party services down (not monitored)
- Internal issues affecting customers
- Customer-reported issues
- Special events/incidents
How to Create:
- Go to Incidents page
- Click Create Incident
- Fill in details:
- Title: Clear, concise description
- Affected Monitors: Select affected services
- Status: Start with "Investigating"
- Message: Detailed explanation
- Severity: Low, Medium, High, or Critical
- Click Create & Notify
Incident Details:
- Title: "Payment Processing Down"
- Affected: Payment API, Payment Gateway
- Status: Investigating
- Severity: Critical
- Message: "Our payment processor is experiencing issues..."
Subscribers automatically notified.
Incident Statuses (UpdateType)
Incidents move through statuses as you progress toward resolution. StatusApp uses these status types:
Investigating
Meaning: Issue reported, team assessing
When to use:
- Just discovered the problem
- Trying to understand scope
- Collecting information
- Determining root cause
Example update: "We've identified increased error rates in the payment service. Our team is investigating the root cause."
Identified
Meaning: Root cause found, working on fix
When to use:
- You know what's causing the problem
- Implementing a solution
- ETA available or being determined
- Working on remediation
Example update: "Root cause identified: database connection pool exhausted. We're scaling up connections now."
Monitoring
Meaning: Fix deployed, validating stability
When to use:
- Solution has been applied
- Testing or monitoring for stability
- Not yet 100% confident in resolution
- May need rollback
Example update: "We've deployed the fix. Monitoring for stability over the next 10 minutes."
Update
Meaning: Progress update without status change
When to use:
- Providing intermediate progress
- Sharing additional information
- No significant status change yet
- Keeping stakeholders informed during long incidents
Example update: "Our team is still working on the fix. Currently testing in staging environment."
Resolved
Meaning: Issue fixed, service fully operational
When to use:
- Confirmed service is working normally
- All customers impacted no longer affected
- Back to normal operations
- Ready to close incident
Example update: "The incident has been resolved. All services are operating normally."
Root Cause Categories
When resolving incidents, you can categorize the root cause for better analytics:
| Category | Description |
|---|---|
| DNS_FAILURE | DNS resolution issues |
| NETWORK_TIMEOUT | Network connectivity or timeout issues |
| SERVER_ERROR | Web server errors (5xx responses) |
| APPLICATION_ERROR | Application code or logic errors |
| DATABASE_ISSUE | Database connection, query, or performance issues |
| SSL_CERTIFICATE | SSL/TLS certificate problems |
| CONFIGURATION_ERROR | Misconfiguration of services or infrastructure |
| DEPLOYMENT_ISSUE | Problems introduced during deployment |
| THIRD_PARTY_SERVICE | External service or dependency failure |
| INFRASTRUCTURE | Cloud provider or infrastructure issues |
| DDOS_ATTACK | Distributed denial of service attack |
| MAINTENANCE | Planned or unplanned maintenance |
| UNKNOWN | Root cause not determined |
Incident Workflow
Typical Incident Timeline
1. Monitor Failure Detected
↓
2. Incident Created (Investigating)
├─ Update 1: Initial assessment
├─ Update 2: Root cause found (Identified)
├─ Update 3: Deploying fix
├─ Update 4: Monitoring (Monitoring)
├─ Update 5: Confirmed stable
↓
3. Incident Resolved
↓
4. Post-Mortem (optional)
Updating Incidents
Keep customers informed with regular updates:
- Open incident
- Click Post Update
- Change status if needed
- Add update message
- Click Post Update
Best Practices for Updates:
- Update every 30-60 minutes during outage
- Be specific about what's happening
- Provide ETAs when possible
- Explain what you're doing
- Update when status changes
- Thank customers for patience
Example Updates:
10:30 AM - Investigating
"We've detected elevated error rates affecting payment processing.
Our team is currently investigating the cause."
10:45 AM - Identified
"Root cause identified: database connection limits reached due to
surge in traffic. We're scaling up connections now."
11:00 AM - Monitoring
"We've increased database capacity. Monitoring for stability
before marking resolved."
11:15 AM - Resolved
"The incident is now resolved. All payments are processing normally.
We apologize for the disruption."
Incident Severity
Severity levels indicate impact:
Critical
Description: Major functionality down, revenue impact
Examples:
- Core platform unavailable
- Payment processing down
- Authentication broken
- Data loss
Response: Immediate escalation, all hands on deck
High
Description: Important feature broken, significant impact
Examples:
- API endpoint down
- Reporting feature unavailable
- Performance severely degraded
- Affecting multiple customers
Response: Urgent response, prioritize fix
Medium
Description: Feature degraded, some users impacted
Examples:
- Slow response times (but working)
- One feature broken (others work)
- Affecting specific user segments
- Non-critical functionality
Response: Important, address within hours
Low
Description: Minor issue, minimal impact
Examples:
- UI glitch on rarely-used page
- Email notification delays
- Minor performance degradation
- One user impacted
Response: Address when convenient
Incident Resolution
Auto-Resolution for Monitored Services
When a failed monitor recovers:
- StatusApp detects recovery
- Incident automatically resolves
- Final status change posted
- Subscribers notified
Automatic Detection:
- Monitor returns to "Up" status
- 1-2 successful check cycles confirm
- Incident marked resolved
- Recovery time recorded
Manual Resolution
For manually-created incidents:
- Update status to "Resolved"
- Post final update message
- Incident is closed
- Customers notified
Final Update Example:
"The issue has been fully resolved. Payment processing is operating
at normal capacity. We apologize for the inconvenience and thank you
for your patience."
Incident Analytics
Mean Time To Resolution (MTTR)
Time from incident start to resolution:
MTTR = Incident End Time - Incident Start Time
Tracking MTTR:
- View in incident details
- Historical MTTR trends
- Compare across incident types
- Identify patterns
Example: Incident from 10:30 AM to 11:15 AM = 45 minute MTTR
Incident Frequency
Number of incidents per time period:
Tracking:
- Dashboard shows recent incidents
- Analytics show trends
- Identify reliability patterns
- Plan improvements
Example: 3 incidents in last week
- Monday: 1 incident
- Wednesday: 2 incidents
- Other days: None
Mean Time Between Failures (MTBF)
Time between consecutive incidents:
MTBF = Time Between Incident End and Next Incident Start
Improvement: Longer MTBF = more stable service
Incident History
View all incidents on your status page:
Historical Record:
- Last 90 days by default
- Includes resolution time
- Links to full incident details
- Shows customer impact
Benefits:
- Track reliability trends
- Identify patterns
- Demonstrate transparency
- Learn from failures
Example Incident History:
INC-00125: Database Timeout - Resolved (15 min)
INC-00124: API Rate Limiting - Resolved (45 min)
INC-00123: SSL Certificate Update - Resolved (5 min)
Best Practices
1. Create Incidents Promptly
Don't delay creating incidents:
Good: Create immediately when issue detected
Bad: Wait until multiple customers complain
Bad: Wait until team has full root cause
2. Update Every 30-60 Minutes
Keep customers informed:
During Active Incident:
├─ 0 min: Create incident
├─ 15 min: Post initial update
├─ 30 min: Post status update
├─ 45 min: Post significant update
└─ Resolution: Mark resolved
3. Use Consistent Severity
Be consistent with severity classification:
Good: Critical reserved for major outages
Bad: Calling everything Critical
Bad: Never using Critical even for major issues
Customers learn to ignore if overused
4. Be Transparent
Share what you know:
Good: "We're still investigating but database team is engaged"
Bad: "We have no idea what's happening"
Bad: Silence during active incident
Transparency builds trust
5. Include Root Cause in Resolved Update
Help customers understand:
Good: "Root cause was X. We've deployed Y to prevent recurrence."
Bad: "It's fixed now."
Explains what happened and prevents future incidents
Common Incident Scenarios
Scenario 1: Database Connection Pool Exhausted
10:00 - Incident Created: API Timeouts Detected
Status: Investigating
Severity: Critical
10:05 - First Update: "We've identified elevated database connection
usage. Investigating cause."
Status: Identified
10:10 - Second Update: "Root cause: connection pool limit exceeded
due to surge in traffic. Scaling connections now."
Status: Monitoring
10:15 - Third Update: "We've increased connection pool size.
Monitoring for stability."
10:20 - Resolved: "Database connection pool increased from 50 to 200
connections. Service operating normally."
Scenario 2: Third-Party Service Down
12:00 - Incident Created: Payment Processor Integration Down
Status: Investigating
Severity: Critical
Message: "Our payment processor is experiencing issues"
12:05 - Update: "Contacted payment processor support. They confirm
service degradation in US region."
Status: Identified
12:15 - Update: "Payment processor reports 90% of service restored.
Monitoring their status page for full recovery."
12:30 - Resolved: "Payment processor has fully recovered. All
payment processing operating normally."
Next Steps
- Incident Communication - Best practices for customer communication
- Status Pages - Display incidents publicly
- Notifications - Alert your team
Start monitoring in 30 seconds
StatusApp gives you 30-second checks from 35+ global locations, instant alerts, and beautiful status pages. Free plan available.