Incident Communication & Best Practices

The Importance of Communication

During incidents, communication is as important as fixing the problem.

Why Communication Matters:

Reduces customer support tickets
Builds trust through transparency
Shows you're in control
Prevents rumor/misinformation
Demonstrates professionalism
Reduces customer anxiety

Poor Communication:

Silence during outage = customers panic
Vague updates = customer frustration
Delayed updates = loss of trust
No transparency = reputation damage

Effective Communication:

Frequent updates (every 30-60 min)
Transparent about what you know
Clear about what you're doing
Realistic timelines
Apologetic and professional

Real-Time Update Strategy

Initial Update (Within 15 Minutes)

Post first update immediately when issue detected:

What to Include:

Acknowledge the problem
State what's affected
Brief description of issue
What you're doing right now

Example:

"We're aware of intermittent errors affecting our payment service. 
Our team is investigating. We'll have an update in 10 minutes."

What to Avoid:

Speculation about root cause
Blame on other teams/services
Over-promises on ETA
Technical jargon customers don't understand

Status Updates (Every 30-60 Minutes)

Continue updating customers throughout incident:

What to Include:

Current status
Progress since last update
Updated timeline if applicable
Any new discoveries

Example:

"Update: We've identified the issue is in our message queue service. 
The team has implemented a fix and is currently validating it. 
We expect full resolution within 30 minutes."

Keep Updates:

Concise (2-3 sentences typical)
Specific (not just "we're working on it")
Honest (share what you know)
Professional (no frustration)

Resolution Update (When Fixed)

Post final update when issue is fully resolved:

What to Include:

Confirmation service is fully operational
Root cause (if known)
What prevented recurrence
Appreciation for patience

Example:

"The incident is now fully resolved. Root cause was a memory leak 
in our message queue service that accumulated over 3 days. We've 
deployed a fix and will be monitoring closely. We apologize for 
the disruption and appreciate your patience."

Always:

Apologize for the inconvenience
Thank customers for patience
Explain prevention measures

Customer Communication Channels

Status Page

Primary channel for incident communication:

What Appears:

Incident created when monitor fails
Real-time status updates
Affected services clearly listed
Uptime impact calculated
Accessible 24/7

Best For: All customers, maximum visibility

Email Notifications

Automated emails to subscribers:

Triggers:

New incident created
Incident status changed
Incident resolved
Maintenance scheduled

Best For: Detailed communication, record-keeping

Slack/Discord/Teams

Team channels for internal coordination:

What to Post:

Initial discovery
Status updates (shorter than public)
Key decisions
Resolution confirmation

Best For: Team coordination, rapid response

Optional for large outages:

When to Use:

Outage affecting major portion of user base
Incident lasting > 2 hours
High customer visibility needed

What to Post:

Acknowledge issue
Share status page link
Update on progress
Resolution confirmation

Example:

"We're currently experiencing issues with our payment service. 
Our team is investigating. Real-time updates: status.company.com"

What NOT to Say During Incidents

Don't Blame External Services

Bad: "Our provider's issue, not ours"
Good: "We're working with our provider to resolve this"

Customers don't care who's responsible - they just want it fixed

Don't Make Promises You Can't Keep

Bad: "We'll have this fixed in 5 minutes"
Good: "We expect resolution within 30 minutes"

Broken promises destroy trust more than waiting

Don't Use Technical Jargon

Bad: "Database connection pool exhausted, scaling MySQL replicas"
Good: "Database running out of connections, adding capacity"

Customers don't understand technical details

Don't Disappear

Bad: No updates for 2 hours during active incident
Good: Update every 30-60 minutes minimum

Silence breeds panic and rumors

Don't Minimize the Issue

Bad: "Just a minor glitch affecting a few users"
Good: "We're experiencing service degradation affecting payments"

Be honest about severity

Post-Mortem Process

After incident is resolved, conduct post-mortem:

Step 1: Schedule Meeting (Within 48 Hours)

Don't wait too long:

Team memory still fresh
Details accurate
Momentum to implement fixes

Step 2: Gather Information

Collect during meeting:

What to Document:

Timeline of events (minute by minute)
Root cause (why did it happen?)
Impact (how many customers affected?)
Detection time (how long before discovered?)
Resolution time (how long to fix?)
Contributing factors

Use Data:

Incident timeline from StatusApp
Error logs from systems
Customer impact data
Team observations

Example Timeline:

2:00 PM - Issue begins (database connections exhausted)
2:05 PM - First customer report
2:10 PM - Alert fires in monitoring
2:15 PM - Incident created
2:20 PM - Root cause identified
2:30 PM - Fix deployed
2:45 PM - Incident resolved (45 min total MTTR)

Step 3: Identify Root Causes

Dig deeper than immediate cause:

5 Why's Technique:

Why did database connections exhaust? → Traffic spike exceeded capacity
Why did traffic spike? → New feature launch with viral marketing
Why weren't connections scaled? → Didn't anticipate that level of traffic
Why no capacity planning? → Load testing only simulated 50% actual load
Why was load testing insufficient? → Outdated assumptions about typical usage

Root Cause: Insufficient load testing and capacity planning

Step 4: Discuss Lessons Learned

What to improve:

Types of Lessons:

What worked well (keep doing this)
What didn't work (don't do again)
What to improve
What to prevent recurrence

Example Lessons:

What Worked Well:
- Team responded quickly
- Communication was clear
- Root cause identified fast

What to Improve:
- Capacity planning was inadequate
- Load testing didn't simulate real conditions
- No pre-incident escalation procedures

Preventive Measures:
- Implement continuous load testing
- Auto-scale database connections
- Set up traffic surge alerts
- Create escalation procedures

Step 5: Assign Action Items

Concrete steps to prevent recurrence:

Make Assignments:

Who is responsible?
What exactly needs to be done?
When is deadline?
How will you verify it's done?

Example Action Items:

1. Implement auto-scaling for database connections
   Owner: Database team
   Deadline: 1 week
   Verify: Test auto-scaling response

2. Update load testing scenarios
   Owner: QA team
   Deadline: 2 weeks
   Verify: New tests pass

3. Implement traffic surge alerts
   Owner: DevOps team
   Deadline: 1 week
   Verify: Alert triggers at 80% capacity

4. Document escalation procedures
   Owner: On-call engineer
   Deadline: 3 days
   Verify: All team trained

Step 6: Publish Post-Mortem (Optional)

Share with customers:

Customer-Facing Post-Mortem:

High-level timeline
What you learned
What you're doing differently
Appreciation for patience

Example:

Post-Mortem: Payment Service Incident

On January 20, we experienced a 45-minute outage affecting our 
payment service. Our team has investigated the root cause and 
implemented preventive measures.

What Happened:
We launched a new feature that generated significantly more traffic 
than anticipated. Our database connection pool reached capacity, 
causing services to fail.

What We Learned:
Our load testing didn't accurately simulate real user behavior. We've 
updated our testing procedures.

What We're Doing:
1. Implemented automatic scaling for database resources
2. Updated our load testing to better simulate production conditions
3. Set up early warning alerts for resource saturation

We apologize for the disruption and appreciate your patience.

Communication Templates

Incident Start Template

"We're aware of [service/feature] issues starting at [time]. 
Our team is investigating. We'll provide an update in [15 min]."

Progress Update Template

"Update: We've identified [issue]. The team is [action]. 
We expect [timeframe]."

Resolution Template

"This incident has been resolved. Root cause was [brief explanation]. 
We've [prevention measure]. We apologize for the disruption."

Best Practices

1. Be Proactive

Post incidents before customers call support:

Good: Post incident 2 minutes after detecting
Bad: Wait for first customer complaint
Bad: Wait until you know root cause

2. Be Transparent

Share what you know, not what you don't:

Good: "We're still investigating but here's what we know so far..."
Bad: "We have no idea"
Bad: Silent

3. Be Consistent

Use same status page for all customers:

Good: One status page with updates
Bad: Telling different customers different things
Bad: Different information in email vs status page

4. Be Professional

Maintain professionalism even under pressure:

Good: "We're working to resolve this"
Bad: "Our stupid CDN failed again"
Bad: Frustration evident in tone

5. Follow Up

Post-mortem shows you care about improvement:

Good: Conduct post-mortem, share learnings
Good: Implement preventive measures
Good: Share progress with customers

Bad: Forget about it after incident ends
Bad: No preventive measures taken

Common Mistakes to Avoid

Waiting Too Long for First Update

Mistake: "Let's understand the full issue before updating"

Problem: Customers panic, support tickets flood in

Fix: Update immediately, even if partial information

Over-Promising on Timeframe

Mistake: "Should be fixed in 15 minutes"

Problem: When it takes 45 minutes, customer anger increases

Fix: Under-promise, over-deliver (say 30 min, fix in 20)

Using Too Much Technical Jargon

Mistake: "Database connection pool exhausted on MySQL replicas"

Problem: Most customers don't understand, feel excluded

Fix: "Database running out of connections"

No Post-Mortem

Mistake: Fix issue and move on

Problem: Same issue happens again in 3 months

Fix: Conduct post-mortem, implement preventive measures

Disappearing After Resolution

Mistake: Post "resolved" update and ghost

Problem: No post-mortem, no explanation, no prevention

Fix: Follow up with post-mortem and learnings

Tools for Communication

StatusApp Status Page

Real-time incident publishing
Automatic customer notifications
Incident timeline/history
Subscriber management

Email/SMS

Targeted notifications
Offline communication
Record-keeping

Slack/Discord/Teams

Internal team coordination
Rapid communication
Context sharing

Large audience reach
Quick updates
External visibility

Next Steps

Incident Lifecycle - How incidents progress
Status Pages - Publish incidents publicly
Notifications - Alert your team

Incident Communication & Best Practices

The Importance of Communication

Real-Time Update Strategy

Initial Update (Within 15 Minutes)

Status Updates (Every 30-60 Minutes)

Resolution Update (When Fixed)

Customer Communication Channels

Status Page

Email Notifications

Slack/Discord/Teams

Twitter/Social Media

What NOT to Say During Incidents

Don't Blame External Services

Don't Make Promises You Can't Keep

Don't Use Technical Jargon

Don't Disappear

Don't Minimize the Issue

Post-Mortem Process

Step 1: Schedule Meeting (Within 48 Hours)

Step 2: Gather Information

Step 3: Identify Root Causes

Step 4: Discuss Lessons Learned

Step 5: Assign Action Items

Step 6: Publish Post-Mortem (Optional)

Communication Templates

Incident Start Template

Progress Update Template

Resolution Template

Best Practices

1. Be Proactive

2. Be Transparent

3. Be Consistent

4. Be Professional

5. Follow Up

Common Mistakes to Avoid

Waiting Too Long for First Update

Over-Promising on Timeframe

Using Too Much Technical Jargon

No Post-Mortem

Disappearing After Resolution

Tools for Communication

StatusApp Status Page

Email/SMS

Slack/Discord/Teams

Social Media

Next Steps

Start monitoring in 30 seconds

Related Articles

Incident Lifecycle