Server Monitoring 101: CPU, Memory, Disk & Beyond
Server monitoring catches problems before they cause outages. Learn what metrics to track, what thresholds to set, and how to set up effective server monitoring.
Your website might be responding to HTTP requests, but that does not mean your server is healthy. A server running at 95% CPU, with 90% memory used and a disk that is 98% full, is a ticking time bomb. Server monitoring gives you visibility into the health of your infrastructure before problems cascade into outages.
Why Server Monitoring Matters
External monitoring (HTTP checks, ping) tells you whether your service is accessible from the outside. Server monitoring tells you why it might stop being accessible.
Common failure scenarios that external monitoring misses:
- Gradual CPU saturation: Response times slowly increase until the server cannot handle requests
- Memory leaks: Applications consume more memory over time until the OOM killer strikes
- Disk filling up: Logs, uploads, or database files grow until the disk is full, causing crashes
- Process crashes: A critical background process dies but the web server keeps responding
- Network saturation: Bandwidth limits are reached, causing dropped connections
By the time external monitoring detects these issues, the impact is already being felt by users. Server monitoring gives you advance warning.
Key Metrics to Monitor
CPU Usage
What it measures: The percentage of CPU time spent on processing tasks.
Why it matters: High CPU usage means your server is working hard. Sustained high CPU (above 80-90%) means requests will start queuing, response times will increase, and eventually the server may become unresponsive.
Thresholds:
- Warning at 70%: Investigate and consider scaling
- Critical at 90%: Immediate action required
What to look for:
- Sustained high CPU vs. brief spikes (spikes are often normal)
- CPU usage by type: user, system, I/O wait
- I/O wait indicates disk bottlenecks, not CPU bottlenecks
Memory Usage
What it measures: RAM utilization including used, free, cached, and buffered memory.
Why it matters: When physical RAM is exhausted, the OS starts swapping to disk (dramatically slower) or the OOM (Out of Memory) killer terminates processes.
Thresholds:
- Warning at 80%: Memory pressure is building
- Critical at 95%: At risk of OOM or excessive swapping
Important nuance: On Linux, “used memory” includes disk cache, which is immediately reclaimable. The meaningful metric is used - (buffers + cached), which represents memory actually committed to applications.
# Check actual memory usage on Linux
free -h
# total used free shared buff/cache available
# Mem: 16Gi 8.2Gi 1.1Gi 256Mi 6.7Gi 7.3Gi
# "available" is what matters, not "free"
Disk Usage
What it measures: Storage space utilization on each mounted filesystem.
Why it matters: A full disk causes cascading failures:
- Databases crash when they cannot write
- Applications fail when log files cannot be written
- Temporary files cannot be created
- Package managers cannot update
Thresholds:
- Warning at 80%: Plan cleanup or expansion
- Critical at 90%: Urgent action required
- Emergency at 95%: Immediate intervention
What to watch: Disk usage growth rate. If you are at 80% and growing 1% per day, you have 20 days. If you are growing 5% per day, you have 4 days.
Disk I/O
What it measures: Read/write operations per second and throughput.
Why it matters: Even with available disk space, I/O saturation causes severe performance degradation. Database-heavy applications are particularly sensitive.
Key metrics:
- IOPS: Input/Output operations per second
- Throughput: MB/s read and written
- I/O wait: CPU time spent waiting for I/O (should be under 10%)
- Queue depth: Number of pending I/O operations
Network I/O
What it measures: Bytes sent and received, packet counts, errors, and drops.
Key metrics:
- Bandwidth usage: Are you approaching network limits?
- Packet errors: Errors indicate network hardware or driver issues
- Connection count: Active TCP connections
- Dropped packets: Network interface or kernel is overwhelmed
Load Average
What it measures: The average number of processes waiting to run, measured over 1, 5, and 15 minutes.
How to interpret: A load average of 1.0 on a single-core server means the CPU is fully utilized. On a 4-core server, a load average of 4.0 means full utilization.
Rule of thumb: Load average should stay below the number of CPU cores. If your 4-core server consistently shows a load average above 4, it is overloaded.
Process Monitoring
Beyond resource metrics, monitor that critical processes are running:
- Web server (nginx, Apache, Caddy)
- Application server (Node.js, Python, Java, PHP-FPM)
- Database (PostgreSQL, MySQL, MongoDB)
- Cache (Redis, Memcached)
- Queue workers (Sidekiq, Celery, Bull)
- Cron daemon
Setting Up Server Monitoring
Agent-Based Monitoring
The most common approach: install a lightweight agent on your server that reports metrics to your monitoring platform.
StatusApp provides a server monitoring agent that:
- Reports CPU, memory, disk, and network metrics
- Runs as a lightweight background process
- Uses minimal resources (typically under 1% CPU and 20MB RAM)
- Sends data securely over HTTPS
Agentless Monitoring
For environments where installing an agent is not possible (managed hosting, certain compliance requirements), you can:
- Monitor server health through application-level health endpoints
- Use SSH-based checks (periodic remote commands)
- Rely on cloud provider metrics (AWS CloudWatch, GCP Monitoring)
Cloud Provider Integration
If you use AWS, GCP, or Azure, the cloud provider offers basic server monitoring. However, these are often:
- Limited in metric granularity
- Expensive at scale (CloudWatch charges per metric)
- Siloed from your application monitoring
A unified monitoring platform like StatusApp lets you see server metrics alongside website, API, and SSL monitoring in one dashboard.
Alert Thresholds by Server Role
Different servers have different normal patterns:
Web Servers
- CPU: Warning at 70%, Critical at 90%
- Memory: Warning at 80%, Critical at 90%
- Disk: Warning at 75%, Critical at 90%
- Focus on: Connection count, request queue depth
Database Servers
- CPU: Warning at 60%, Critical at 80% (databases need headroom for query bursts)
- Memory: Warning at 85%, Critical at 95%
- Disk: Warning at 70%, Critical at 85% (databases need disk space for operations)
- Focus on: Disk I/O, replication lag, connection pool usage
Worker/Queue Servers
- CPU: Warning at 80%, Critical at 95% (workers are expected to be busy)
- Memory: Warning at 80%, Critical at 90%
- Focus on: Queue depth, job processing rate, failed job count
Common Server Issues and How Monitoring Catches Them
Memory leak: Memory usage creeps up 1-2% per day. Without monitoring, you discover it when the server crashes at 3 AM. With monitoring, you see the trend and restart the application during business hours.
Log file growth: Application logs fill the disk over weeks. Monitoring alerts you at 80% disk usage, giving you time to configure log rotation.
Zombie processes: Crashed workers leave zombie processes. Process monitoring detects that your expected worker count has dropped.
Network saturation: A traffic spike or DDoS attack saturates your bandwidth. Network monitoring alerts you within seconds.
CPU thermal throttling: In bare-metal environments, overheating causes CPU throttling. Load average spikes while CPU percentage appears normal.
Server Monitoring + External Monitoring
Server monitoring is most powerful when combined with external monitoring:
| Issue | Server Monitoring Detects | External Monitoring Detects |
|---|---|---|
| High CPU | Yes (cause) | Yes (effect: slow responses) |
| Full disk | Yes (cause) | Yes (effect: errors) |
| Network outage | Maybe (if agent cannot report) | Yes (site unreachable) |
| Application bug | No (server is healthy) | Yes (wrong responses) |
| DNS issue | No (server is fine) | Yes (site unreachable) |
Together, they give you both the “what” and the “why.” External monitoring tells you something is wrong. Server monitoring tells you what caused it.
Get complete visibility into your server health. Start monitoring with StatusApp — server monitoring alongside 9 other monitor types.
Start monitoring in 30 seconds
StatusApp gives you 30-second checks from 35+ global locations, instant alerts, and beautiful status pages. Free plan available.