Incident Communication
Best practices for communicating incidents to your team and customers during downtime.
Incident Communication & Best Practices
The Importance of Communication
During incidents, communication is as important as fixing the problem.
Why Communication Matters:
- Reduces customer support tickets
- Builds trust through transparency
- Shows you're in control
- Prevents rumor/misinformation
- Demonstrates professionalism
- Reduces customer anxiety
Poor Communication:
- Silence during outage = customers panic
- Vague updates = customer frustration
- Delayed updates = loss of trust
- No transparency = reputation damage
Effective Communication:
- Frequent updates (every 30-60 min)
- Transparent about what you know
- Clear about what you're doing
- Realistic timelines
- Apologetic and professional
Real-Time Update Strategy
Initial Update (Within 15 Minutes)
Post first update immediately when issue detected:
What to Include:
- Acknowledge the problem
- State what's affected
- Brief description of issue
- What you're doing right now
Example:
"We're aware of intermittent errors affecting our payment service.
Our team is investigating. We'll have an update in 10 minutes."
What to Avoid:
- Speculation about root cause
- Blame on other teams/services
- Over-promises on ETA
- Technical jargon customers don't understand
Status Updates (Every 30-60 Minutes)
Continue updating customers throughout incident:
What to Include:
- Current status
- Progress since last update
- Updated timeline if applicable
- Any new discoveries
Example:
"Update: We've identified the issue is in our message queue service.
The team has implemented a fix and is currently validating it.
We expect full resolution within 30 minutes."
Keep Updates:
- Concise (2-3 sentences typical)
- Specific (not just "we're working on it")
- Honest (share what you know)
- Professional (no frustration)
Resolution Update (When Fixed)
Post final update when issue is fully resolved:
What to Include:
- Confirmation service is fully operational
- Root cause (if known)
- What prevented recurrence
- Appreciation for patience
Example:
"The incident is now fully resolved. Root cause was a memory leak
in our message queue service that accumulated over 3 days. We've
deployed a fix and will be monitoring closely. We apologize for
the disruption and appreciate your patience."
Always:
- Apologize for the inconvenience
- Thank customers for patience
- Explain prevention measures
Customer Communication Channels
Status Page
Primary channel for incident communication:
What Appears:
- Incident created when monitor fails
- Real-time status updates
- Affected services clearly listed
- Uptime impact calculated
- Accessible 24/7
Best For: All customers, maximum visibility
Email Notifications
Automated emails to subscribers:
Triggers:
- New incident created
- Incident status changed
- Incident resolved
- Maintenance scheduled
Best For: Detailed communication, record-keeping
Slack/Discord/Teams
Team channels for internal coordination:
What to Post:
- Initial discovery
- Status updates (shorter than public)
- Key decisions
- Resolution confirmation
Best For: Team coordination, rapid response
Twitter/Social Media
Optional for large outages:
When to Use:
- Outage affecting major portion of user base
- Incident lasting > 2 hours
- High customer visibility needed
What to Post:
- Acknowledge issue
- Share status page link
- Update on progress
- Resolution confirmation
Example:
"We're currently experiencing issues with our payment service.
Our team is investigating. Real-time updates: status.company.com"
What NOT to Say During Incidents
Don't Blame External Services
Bad: "Our provider's issue, not ours"
Good: "We're working with our provider to resolve this"
Customers don't care who's responsible - they just want it fixed
Don't Make Promises You Can't Keep
Bad: "We'll have this fixed in 5 minutes"
Good: "We expect resolution within 30 minutes"
Broken promises destroy trust more than waiting
Don't Use Technical Jargon
Bad: "Database connection pool exhausted, scaling MySQL replicas"
Good: "Database running out of connections, adding capacity"
Customers don't understand technical details
Don't Disappear
Bad: No updates for 2 hours during active incident
Good: Update every 30-60 minutes minimum
Silence breeds panic and rumors
Don't Minimize the Issue
Bad: "Just a minor glitch affecting a few users"
Good: "We're experiencing service degradation affecting payments"
Be honest about severity
Post-Mortem Process
After incident is resolved, conduct post-mortem:
Step 1: Schedule Meeting (Within 48 Hours)
Don't wait too long:
- Team memory still fresh
- Details accurate
- Momentum to implement fixes
Step 2: Gather Information
Collect during meeting:
What to Document:
- Timeline of events (minute by minute)
- Root cause (why did it happen?)
- Impact (how many customers affected?)
- Detection time (how long before discovered?)
- Resolution time (how long to fix?)
- Contributing factors
Use Data:
- Incident timeline from StatusApp
- Error logs from systems
- Customer impact data
- Team observations
Example Timeline:
2:00 PM - Issue begins (database connections exhausted)
2:05 PM - First customer report
2:10 PM - Alert fires in monitoring
2:15 PM - Incident created
2:20 PM - Root cause identified
2:30 PM - Fix deployed
2:45 PM - Incident resolved (45 min total MTTR)
Step 3: Identify Root Causes
Dig deeper than immediate cause:
5 Why's Technique:
-
Why did database connections exhaust? → Traffic spike exceeded capacity
-
Why did traffic spike? → New feature launch with viral marketing
-
Why weren't connections scaled? → Didn't anticipate that level of traffic
-
Why no capacity planning? → Load testing only simulated 50% actual load
-
Why was load testing insufficient? → Outdated assumptions about typical usage
Root Cause: Insufficient load testing and capacity planning
Step 4: Discuss Lessons Learned
What to improve:
Types of Lessons:
- What worked well (keep doing this)
- What didn't work (don't do again)
- What to improve
- What to prevent recurrence
Example Lessons:
What Worked Well:
- Team responded quickly
- Communication was clear
- Root cause identified fast
What to Improve:
- Capacity planning was inadequate
- Load testing didn't simulate real conditions
- No pre-incident escalation procedures
Preventive Measures:
- Implement continuous load testing
- Auto-scale database connections
- Set up traffic surge alerts
- Create escalation procedures
Step 5: Assign Action Items
Concrete steps to prevent recurrence:
Make Assignments:
- Who is responsible?
- What exactly needs to be done?
- When is deadline?
- How will you verify it's done?
Example Action Items:
1. Implement auto-scaling for database connections
Owner: Database team
Deadline: 1 week
Verify: Test auto-scaling response
2. Update load testing scenarios
Owner: QA team
Deadline: 2 weeks
Verify: New tests pass
3. Implement traffic surge alerts
Owner: DevOps team
Deadline: 1 week
Verify: Alert triggers at 80% capacity
4. Document escalation procedures
Owner: On-call engineer
Deadline: 3 days
Verify: All team trained
Step 6: Publish Post-Mortem (Optional)
Share with customers:
Customer-Facing Post-Mortem:
- High-level timeline
- What you learned
- What you're doing differently
- Appreciation for patience
Example:
Post-Mortem: Payment Service Incident
On January 20, we experienced a 45-minute outage affecting our
payment service. Our team has investigated the root cause and
implemented preventive measures.
What Happened:
We launched a new feature that generated significantly more traffic
than anticipated. Our database connection pool reached capacity,
causing services to fail.
What We Learned:
Our load testing didn't accurately simulate real user behavior. We've
updated our testing procedures.
What We're Doing:
1. Implemented automatic scaling for database resources
2. Updated our load testing to better simulate production conditions
3. Set up early warning alerts for resource saturation
We apologize for the disruption and appreciate your patience.
Communication Templates
Incident Start Template
"We're aware of [service/feature] issues starting at [time].
Our team is investigating. We'll provide an update in [15 min]."
Progress Update Template
"Update: We've identified [issue]. The team is [action].
We expect [timeframe]."
Resolution Template
"This incident has been resolved. Root cause was [brief explanation].
We've [prevention measure]. We apologize for the disruption."
Best Practices
1. Be Proactive
Post incidents before customers call support:
Good: Post incident 2 minutes after detecting
Bad: Wait for first customer complaint
Bad: Wait until you know root cause
2. Be Transparent
Share what you know, not what you don't:
Good: "We're still investigating but here's what we know so far..."
Bad: "We have no idea"
Bad: Silent
3. Be Consistent
Use same status page for all customers:
Good: One status page with updates
Bad: Telling different customers different things
Bad: Different information in email vs status page
4. Be Professional
Maintain professionalism even under pressure:
Good: "We're working to resolve this"
Bad: "Our stupid CDN failed again"
Bad: Frustration evident in tone
5. Follow Up
Post-mortem shows you care about improvement:
Good: Conduct post-mortem, share learnings
Good: Implement preventive measures
Good: Share progress with customers
Bad: Forget about it after incident ends
Bad: No preventive measures taken
Common Mistakes to Avoid
Waiting Too Long for First Update
Mistake: "Let's understand the full issue before updating"
Problem: Customers panic, support tickets flood in
Fix: Update immediately, even if partial information
Over-Promising on Timeframe
Mistake: "Should be fixed in 15 minutes"
Problem: When it takes 45 minutes, customer anger increases
Fix: Under-promise, over-deliver (say 30 min, fix in 20)
Using Too Much Technical Jargon
Mistake: "Database connection pool exhausted on MySQL replicas"
Problem: Most customers don't understand, feel excluded
Fix: "Database running out of connections"
No Post-Mortem
Mistake: Fix issue and move on
Problem: Same issue happens again in 3 months
Fix: Conduct post-mortem, implement preventive measures
Disappearing After Resolution
Mistake: Post "resolved" update and ghost
Problem: No post-mortem, no explanation, no prevention
Fix: Follow up with post-mortem and learnings
Tools for Communication
StatusApp Status Page
- Real-time incident publishing
- Automatic customer notifications
- Incident timeline/history
- Subscriber management
Email/SMS
- Targeted notifications
- Offline communication
- Record-keeping
Slack/Discord/Teams
- Internal team coordination
- Rapid communication
- Context sharing
Social Media
- Large audience reach
- Quick updates
- External visibility
Next Steps
- Incident Lifecycle - How incidents progress
- Status Pages - Publish incidents publicly
- Notifications - Alert your team
Start monitoring in 30 seconds
StatusApp gives you 30-second checks from 35+ global locations, instant alerts, and beautiful status pages. Free plan available.