Server Monitoring 101: CPU, Memory, Disk & Beyond

Your website might be responding to HTTP requests, but that does not mean your server is healthy. A server running at 95% CPU, with 90% memory used and a disk that is 98% full, is a ticking time bomb. Server monitoring gives you visibility into the health of your infrastructure before problems cascade into outages.

Why Server Monitoring Matters

External monitoring (HTTP checks, ping) tells you whether your service is accessible from the outside. Server monitoring tells you why it might stop being accessible.

Common failure scenarios that external monitoring misses:

Gradual CPU saturation: Response times slowly increase until the server cannot handle requests
Memory leaks: Applications consume more memory over time until the OOM killer strikes
Disk filling up: Logs, uploads, or database files grow until the disk is full, causing crashes
Process crashes: A critical background process dies but the web server keeps responding
Network saturation: Bandwidth limits are reached, causing dropped connections

By the time external monitoring detects these issues, the impact is already being felt by users. Server monitoring gives you advance warning.

Key Metrics to Monitor

CPU Usage

What it measures: The percentage of CPU time spent on processing tasks.

Why it matters: High CPU usage means your server is working hard. Sustained high CPU (above 80-90%) means requests will start queuing, response times will increase, and eventually the server may become unresponsive.

Thresholds:

Warning at 70%: Investigate and consider scaling
Critical at 90%: Immediate action required

What to look for:

Sustained high CPU vs. brief spikes (spikes are often normal)
CPU usage by type: user, system, I/O wait
I/O wait indicates disk bottlenecks, not CPU bottlenecks

Memory Usage

What it measures: RAM utilization including used, free, cached, and buffered memory.

Why it matters: When physical RAM is exhausted, the OS starts swapping to disk (dramatically slower) or the OOM (Out of Memory) killer terminates processes.

Thresholds:

Warning at 80%: Memory pressure is building
Critical at 95%: At risk of OOM or excessive swapping

Important nuance: On Linux, “used memory” includes disk cache, which is immediately reclaimable. The meaningful metric is used - (buffers + cached), which represents memory actually committed to applications.

# Check actual memory usage on Linux
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           16Gi        8.2Gi       1.1Gi       256Mi       6.7Gi       7.3Gi

# "available" is what matters, not "free"

Disk Usage

What it measures: Storage space utilization on each mounted filesystem.

Why it matters: A full disk causes cascading failures:

Databases crash when they cannot write
Applications fail when log files cannot be written
Temporary files cannot be created
Package managers cannot update

Thresholds:

Warning at 80%: Plan cleanup or expansion
Critical at 90%: Urgent action required
Emergency at 95%: Immediate intervention

What to watch: Disk usage growth rate. If you are at 80% and growing 1% per day, you have 20 days. If you are growing 5% per day, you have 4 days.

Disk I/O

What it measures: Read/write operations per second and throughput.

Why it matters: Even with available disk space, I/O saturation causes severe performance degradation. Database-heavy applications are particularly sensitive.

Key metrics:

IOPS: Input/Output operations per second
Throughput: MB/s read and written
I/O wait: CPU time spent waiting for I/O (should be under 10%)
Queue depth: Number of pending I/O operations

Network I/O

What it measures: Bytes sent and received, packet counts, errors, and drops.

Key metrics:

Bandwidth usage: Are you approaching network limits?
Packet errors: Errors indicate network hardware or driver issues
Connection count: Active TCP connections
Dropped packets: Network interface or kernel is overwhelmed

Load Average

What it measures: The average number of processes waiting to run, measured over 1, 5, and 15 minutes.

How to interpret: A load average of 1.0 on a single-core server means the CPU is fully utilized. On a 4-core server, a load average of 4.0 means full utilization.

Rule of thumb: Load average should stay below the number of CPU cores. If your 4-core server consistently shows a load average above 4, it is overloaded.

Process Monitoring

Beyond resource metrics, monitor that critical processes are running:

Web server (nginx, Apache, Caddy)
Application server (Node.js, Python, Java, PHP-FPM)
Database (PostgreSQL, MySQL, MongoDB)
Cache (Redis, Memcached)
Queue workers (Sidekiq, Celery, Bull)
Cron daemon

Setting Up Server Monitoring

Agent-Based Monitoring

The most common approach: install a lightweight agent on your server that reports metrics to your monitoring platform.

StatusApp provides a server monitoring agent that:

Reports CPU, memory, disk, and network metrics
Runs as a lightweight background process
Uses minimal resources (typically under 1% CPU and 20MB RAM)
Sends data securely over HTTPS

Agentless Monitoring

For environments where installing an agent is not possible (managed hosting, certain compliance requirements), you can:

Monitor server health through application-level health endpoints
Use SSH-based checks (periodic remote commands)
Rely on cloud provider metrics (AWS CloudWatch, GCP Monitoring)

Cloud Provider Integration

If you use AWS, GCP, or Azure, the cloud provider offers basic server monitoring. However, these are often:

Limited in metric granularity
Expensive at scale (CloudWatch charges per metric)
Siloed from your application monitoring

A unified monitoring platform like StatusApp lets you see server metrics alongside website, API, and SSL monitoring in one dashboard.

Alert Thresholds by Server Role

Different servers have different normal patterns:

Web Servers

CPU: Warning at 70%, Critical at 90%
Memory: Warning at 80%, Critical at 90%
Disk: Warning at 75%, Critical at 90%
Focus on: Connection count, request queue depth

Database Servers

CPU: Warning at 60%, Critical at 80% (databases need headroom for query bursts)
Memory: Warning at 85%, Critical at 95%
Disk: Warning at 70%, Critical at 85% (databases need disk space for operations)
Focus on: Disk I/O, replication lag, connection pool usage

Worker/Queue Servers

CPU: Warning at 80%, Critical at 95% (workers are expected to be busy)
Memory: Warning at 80%, Critical at 90%
Focus on: Queue depth, job processing rate, failed job count

Common Server Issues and How Monitoring Catches Them

Memory leak: Memory usage creeps up 1-2% per day. Without monitoring, you discover it when the server crashes at 3 AM. With monitoring, you see the trend and restart the application during business hours.

Log file growth: Application logs fill the disk over weeks. Monitoring alerts you at 80% disk usage, giving you time to configure log rotation.

Zombie processes: Crashed workers leave zombie processes. Process monitoring detects that your expected worker count has dropped.

Network saturation: A traffic spike or DDoS attack saturates your bandwidth. Network monitoring alerts you within seconds.

CPU thermal throttling: In bare-metal environments, overheating causes CPU throttling. Load average spikes while CPU percentage appears normal.

Server Monitoring + External Monitoring

Server monitoring is most powerful when combined with external monitoring:

Issue	Server Monitoring Detects	External Monitoring Detects
High CPU	Yes (cause)	Yes (effect: slow responses)
Full disk	Yes (cause)	Yes (effect: errors)
Network outage	Maybe (if agent cannot report)	Yes (site unreachable)
Application bug	No (server is healthy)	Yes (wrong responses)
DNS issue	No (server is fine)	Yes (site unreachable)

Together, they give you both the “what” and the “why.” External monitoring tells you something is wrong. Server monitoring tells you what caused it.

Get complete visibility into your server health. Start monitoring with StatusApp — server monitoring alongside 9 other monitor types.

Why Server Monitoring Matters

Key Metrics to Monitor

CPU Usage

Memory Usage

Disk Usage

Disk I/O

Network I/O

Load Average

Process Monitoring

Setting Up Server Monitoring

Agent-Based Monitoring

Agentless Monitoring

Cloud Provider Integration

Alert Thresholds by Server Role

Web Servers

Database Servers

Worker/Queue Servers

Common Server Issues and How Monitoring Catches Them

Server Monitoring + External Monitoring

Start monitoring in 30 seconds

Related Articles

Synthetic Monitoring vs Real User Monitoring: Which Does Your Team Need?

GraphQL Monitoring: Why Traditional HTTP Checks Aren't Enough

API Monitoring Best Practices: A Developer's Guide