Incident Lifecycle & Management

What is an Incident?

An incident is any event that disrupts or degrades service availability or performance. StatusApp provides a complete timeline from detection through resolution, helping you communicate and learn from failures.

Incidents Track:

When the issue started and duration
Which monitors/services affected
All status updates during incident
Severity and impact
Mean Time To Resolution (MTTR)
Customer communications

How Incidents Are Created

Automatic Incident Creation

StatusApp automatically creates incidents when monitors fail:

Trigger Process:

Monitor check fails (non-200 status, timeout, or error)
Confirmation checks verify failure (prevent false positives)
Incident created with unique ID (INC-00123)
Status page updated instantly
Subscribers notified via email

Automatic Incident Data:

Incident Number (unique identifier like INC-00123)
Start time (startedAt) - when first failure detected
Affected monitor - which service failed
Severity - auto-determined based on impact (low, medium, high, critical)
Affected regions - array of regions where failure occurred
Check statistics - checksFailed and checksTotal counts
Status - open or resolved
Incident logs - detailed check results with status codes, response times, errors

Grace Period:

Configure monitors to wait before creating incident
Prevents false positives from temporary blips
Common values: 1-5 minutes
Balances speed vs accuracy

Manual Incident Creation

Create incidents for issues not automatically detected:

When to Create Manually:

Planned maintenance windows
Third-party services down (not monitored)
Internal issues affecting customers
Customer-reported issues
Special events/incidents

How to Create:

Go to Incidents page
Click Create Incident
Fill in details:
- Title: Clear, concise description
- Affected Monitors: Select affected services
- Status: Start with "Investigating"
- Message: Detailed explanation
- Severity: Low, Medium, High, or Critical
Click Create & Notify

Incident Details:

Title: "Payment Processing Down"
Affected: Payment API, Payment Gateway
Status: Investigating
Severity: Critical
Message: "Our payment processor is experiencing issues..."

Subscribers automatically notified.

Incident Statuses (UpdateType)

Incidents move through statuses as you progress toward resolution. StatusApp uses these status types:

Investigating

Meaning: Issue reported, team assessing

When to use:

Just discovered the problem
Trying to understand scope
Collecting information
Determining root cause

Example update: "We've identified increased error rates in the payment service. Our team is investigating the root cause."

Identified

Meaning: Root cause found, working on fix

When to use:

You know what's causing the problem
Implementing a solution
ETA available or being determined
Working on remediation

Example update: "Root cause identified: database connection pool exhausted. We're scaling up connections now."

Monitoring

Meaning: Fix deployed, validating stability

When to use:

Solution has been applied
Testing or monitoring for stability
Not yet 100% confident in resolution
May need rollback

Example update: "We've deployed the fix. Monitoring for stability over the next 10 minutes."

Update

Meaning: Progress update without status change

When to use:

Providing intermediate progress
Sharing additional information
No significant status change yet
Keeping stakeholders informed during long incidents

Example update: "Our team is still working on the fix. Currently testing in staging environment."

Resolved

Meaning: Issue fixed, service fully operational

When to use:

Confirmed service is working normally
All customers impacted no longer affected
Back to normal operations
Ready to close incident

Example update: "The incident has been resolved. All services are operating normally."

Root Cause Categories

When resolving incidents, you can categorize the root cause for better analytics:

Category	Description
DNS_FAILURE	DNS resolution issues
NETWORK_TIMEOUT	Network connectivity or timeout issues
SERVER_ERROR	Web server errors (5xx responses)
APPLICATION_ERROR	Application code or logic errors
DATABASE_ISSUE	Database connection, query, or performance issues
SSL_CERTIFICATE	SSL/TLS certificate problems
CONFIGURATION_ERROR	Misconfiguration of services or infrastructure
DEPLOYMENT_ISSUE	Problems introduced during deployment
THIRD_PARTY_SERVICE	External service or dependency failure
INFRASTRUCTURE	Cloud provider or infrastructure issues
DDOS_ATTACK	Distributed denial of service attack
MAINTENANCE	Planned or unplanned maintenance
UNKNOWN	Root cause not determined

Incident Workflow

Typical Incident Timeline

1. Monitor Failure Detected
   ↓
2. Incident Created (Investigating)
   ├─ Update 1: Initial assessment
   ├─ Update 2: Root cause found (Identified)
   ├─ Update 3: Deploying fix
   ├─ Update 4: Monitoring (Monitoring)
   ├─ Update 5: Confirmed stable
   ↓
3. Incident Resolved
   ↓
4. Post-Mortem (optional)

Updating Incidents

Keep customers informed with regular updates:

Open incident
Click Post Update
Change status if needed
Add update message
Click Post Update

Best Practices for Updates:

Update every 30-60 minutes during outage
Be specific about what's happening
Provide ETAs when possible
Explain what you're doing
Update when status changes
Thank customers for patience

Example Updates:

10:30 AM - Investigating
"We've detected elevated error rates affecting payment processing. 
Our team is currently investigating the cause."

10:45 AM - Identified
"Root cause identified: database connection limits reached due to 
surge in traffic. We're scaling up connections now."

11:00 AM - Monitoring
"We've increased database capacity. Monitoring for stability 
before marking resolved."

11:15 AM - Resolved
"The incident is now resolved. All payments are processing normally. 
We apologize for the disruption."

Incident Severity

Severity levels indicate impact:

Critical

Description: Major functionality down, revenue impact

Examples:

Core platform unavailable
Payment processing down
Authentication broken
Data loss

Response: Immediate escalation, all hands on deck

High

Description: Important feature broken, significant impact

Examples:

API endpoint down
Reporting feature unavailable
Performance severely degraded
Affecting multiple customers

Response: Urgent response, prioritize fix

Medium

Description: Feature degraded, some users impacted

Examples:

Slow response times (but working)
One feature broken (others work)
Affecting specific user segments
Non-critical functionality

Response: Important, address within hours

Low

Description: Minor issue, minimal impact

Examples:

UI glitch on rarely-used page
Email notification delays
Minor performance degradation
One user impacted

Response: Address when convenient

Incident Resolution

Auto-Resolution for Monitored Services

When a failed monitor recovers:

StatusApp detects recovery
Incident automatically resolves
Final status change posted
Subscribers notified

Automatic Detection:

Monitor returns to "Up" status
1-2 successful check cycles confirm
Incident marked resolved
Recovery time recorded

Manual Resolution

For manually-created incidents:

Update status to "Resolved"
Post final update message
Incident is closed
Customers notified

Final Update Example:

"The issue has been fully resolved. Payment processing is operating 
at normal capacity. We apologize for the inconvenience and thank you 
for your patience."

Incident Analytics

Mean Time To Resolution (MTTR)

Time from incident start to resolution:

MTTR = Incident End Time - Incident Start Time

Tracking MTTR:

View in incident details
Historical MTTR trends
Compare across incident types
Identify patterns

Example: Incident from 10:30 AM to 11:15 AM = 45 minute MTTR

Incident Frequency

Number of incidents per time period:

Tracking:

Dashboard shows recent incidents
Analytics show trends
Identify reliability patterns
Plan improvements

Example: 3 incidents in last week

Monday: 1 incident
Wednesday: 2 incidents
Other days: None

Mean Time Between Failures (MTBF)

Time between consecutive incidents:

MTBF = Time Between Incident End and Next Incident Start

Improvement: Longer MTBF = more stable service

Incident History

View all incidents on your status page:

Historical Record:

Last 90 days by default
Includes resolution time
Links to full incident details
Shows customer impact

Benefits:

Track reliability trends
Identify patterns
Demonstrate transparency
Learn from failures

Example Incident History:

INC-00125: Database Timeout - Resolved (15 min)
INC-00124: API Rate Limiting - Resolved (45 min)
INC-00123: SSL Certificate Update - Resolved (5 min)

Best Practices

1. Create Incidents Promptly

Don't delay creating incidents:

Good: Create immediately when issue detected
Bad: Wait until multiple customers complain
Bad: Wait until team has full root cause

2. Update Every 30-60 Minutes

Keep customers informed:

During Active Incident:
├─ 0 min: Create incident
├─ 15 min: Post initial update
├─ 30 min: Post status update
├─ 45 min: Post significant update
└─ Resolution: Mark resolved

3. Use Consistent Severity

Be consistent with severity classification:

Good: Critical reserved for major outages
Bad: Calling everything Critical
Bad: Never using Critical even for major issues

Customers learn to ignore if overused

4. Be Transparent

Share what you know:

Good: "We're still investigating but database team is engaged"
Bad: "We have no idea what's happening"
Bad: Silence during active incident

Transparency builds trust

5. Include Root Cause in Resolved Update

Help customers understand:

Good: "Root cause was X. We've deployed Y to prevent recurrence."
Bad: "It's fixed now."

Explains what happened and prevents future incidents

Common Incident Scenarios

Scenario 1: Database Connection Pool Exhausted

10:00 - Incident Created: API Timeouts Detected
       Status: Investigating
       Severity: Critical
       
10:05 - First Update: "We've identified elevated database connection 
        usage. Investigating cause."
       Status: Identified
       
10:10 - Second Update: "Root cause: connection pool limit exceeded 
        due to surge in traffic. Scaling connections now."
       Status: Monitoring
       
10:15 - Third Update: "We've increased connection pool size. 
        Monitoring for stability."
       
10:20 - Resolved: "Database connection pool increased from 50 to 200 
        connections. Service operating normally."

Scenario 2: Third-Party Service Down

12:00 - Incident Created: Payment Processor Integration Down
       Status: Investigating
       Severity: Critical
       Message: "Our payment processor is experiencing issues"
       
12:05 - Update: "Contacted payment processor support. They confirm 
        service degradation in US region."
       Status: Identified
       
12:15 - Update: "Payment processor reports 90% of service restored. 
        Monitoring their status page for full recovery."
       
12:30 - Resolved: "Payment processor has fully recovered. All 
        payment processing operating normally."

Next Steps

Incident Communication - Best practices for customer communication
Status Pages - Display incidents publicly
Notifications - Alert your team

Incident Lifecycle & Management

What is an Incident?

How Incidents Are Created

Automatic Incident Creation

Manual Incident Creation

Incident Statuses (UpdateType)

Investigating

Identified

Monitoring

Update

Resolved

Root Cause Categories

Incident Workflow

Typical Incident Timeline

Updating Incidents

Incident Severity

Critical

High

Medium

Low

Incident Resolution

Auto-Resolution for Monitored Services

Manual Resolution

Incident Analytics

Mean Time To Resolution (MTTR)

Incident Frequency

Mean Time Between Failures (MTBF)

Incident History

Best Practices

1. Create Incidents Promptly

2. Update Every 30-60 Minutes

3. Use Consistent Severity

4. Be Transparent

5. Include Root Cause in Resolved Update

Common Incident Scenarios

Scenario 1: Database Connection Pool Exhausted

Scenario 2: Third-Party Service Down

Next Steps

Start monitoring in 30 seconds

Related Articles

Incident Communication