Technical December 30, 2025 6 min read

Server Monitoring 101: CPU, Memory, Disk & Beyond

Server monitoring catches problems before they cause outages. Learn what metrics to track, what thresholds to set, and how to set up effective server monitoring.

StatusApp Team

Your website might be responding to HTTP requests, but that does not mean your server is healthy. A server running at 95% CPU, with 90% memory used and a disk that is 98% full, is a ticking time bomb. Server monitoring gives you visibility into the health of your infrastructure before problems cascade into outages.

Why Server Monitoring Matters

External monitoring (HTTP checks, ping) tells you whether your service is accessible from the outside. Server monitoring tells you why it might stop being accessible.

Common failure scenarios that external monitoring misses:

  • Gradual CPU saturation: Response times slowly increase until the server cannot handle requests
  • Memory leaks: Applications consume more memory over time until the OOM killer strikes
  • Disk filling up: Logs, uploads, or database files grow until the disk is full, causing crashes
  • Process crashes: A critical background process dies but the web server keeps responding
  • Network saturation: Bandwidth limits are reached, causing dropped connections

By the time external monitoring detects these issues, the impact is already being felt by users. Server monitoring gives you advance warning.

Key Metrics to Monitor

CPU Usage

What it measures: The percentage of CPU time spent on processing tasks.

Why it matters: High CPU usage means your server is working hard. Sustained high CPU (above 80-90%) means requests will start queuing, response times will increase, and eventually the server may become unresponsive.

Thresholds:

  • Warning at 70%: Investigate and consider scaling
  • Critical at 90%: Immediate action required

What to look for:

  • Sustained high CPU vs. brief spikes (spikes are often normal)
  • CPU usage by type: user, system, I/O wait
  • I/O wait indicates disk bottlenecks, not CPU bottlenecks

Memory Usage

What it measures: RAM utilization including used, free, cached, and buffered memory.

Why it matters: When physical RAM is exhausted, the OS starts swapping to disk (dramatically slower) or the OOM (Out of Memory) killer terminates processes.

Thresholds:

  • Warning at 80%: Memory pressure is building
  • Critical at 95%: At risk of OOM or excessive swapping

Important nuance: On Linux, “used memory” includes disk cache, which is immediately reclaimable. The meaningful metric is used - (buffers + cached), which represents memory actually committed to applications.

# Check actual memory usage on Linux
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           16Gi        8.2Gi       1.1Gi       256Mi       6.7Gi       7.3Gi

# "available" is what matters, not "free"

Disk Usage

What it measures: Storage space utilization on each mounted filesystem.

Why it matters: A full disk causes cascading failures:

  • Databases crash when they cannot write
  • Applications fail when log files cannot be written
  • Temporary files cannot be created
  • Package managers cannot update

Thresholds:

  • Warning at 80%: Plan cleanup or expansion
  • Critical at 90%: Urgent action required
  • Emergency at 95%: Immediate intervention

What to watch: Disk usage growth rate. If you are at 80% and growing 1% per day, you have 20 days. If you are growing 5% per day, you have 4 days.

Disk I/O

What it measures: Read/write operations per second and throughput.

Why it matters: Even with available disk space, I/O saturation causes severe performance degradation. Database-heavy applications are particularly sensitive.

Key metrics:

  • IOPS: Input/Output operations per second
  • Throughput: MB/s read and written
  • I/O wait: CPU time spent waiting for I/O (should be under 10%)
  • Queue depth: Number of pending I/O operations

Network I/O

What it measures: Bytes sent and received, packet counts, errors, and drops.

Key metrics:

  • Bandwidth usage: Are you approaching network limits?
  • Packet errors: Errors indicate network hardware or driver issues
  • Connection count: Active TCP connections
  • Dropped packets: Network interface or kernel is overwhelmed

Load Average

What it measures: The average number of processes waiting to run, measured over 1, 5, and 15 minutes.

How to interpret: A load average of 1.0 on a single-core server means the CPU is fully utilized. On a 4-core server, a load average of 4.0 means full utilization.

Rule of thumb: Load average should stay below the number of CPU cores. If your 4-core server consistently shows a load average above 4, it is overloaded.

Process Monitoring

Beyond resource metrics, monitor that critical processes are running:

  • Web server (nginx, Apache, Caddy)
  • Application server (Node.js, Python, Java, PHP-FPM)
  • Database (PostgreSQL, MySQL, MongoDB)
  • Cache (Redis, Memcached)
  • Queue workers (Sidekiq, Celery, Bull)
  • Cron daemon

Setting Up Server Monitoring

Agent-Based Monitoring

The most common approach: install a lightweight agent on your server that reports metrics to your monitoring platform.

StatusApp provides a server monitoring agent that:

  • Reports CPU, memory, disk, and network metrics
  • Runs as a lightweight background process
  • Uses minimal resources (typically under 1% CPU and 20MB RAM)
  • Sends data securely over HTTPS

Agentless Monitoring

For environments where installing an agent is not possible (managed hosting, certain compliance requirements), you can:

  • Monitor server health through application-level health endpoints
  • Use SSH-based checks (periodic remote commands)
  • Rely on cloud provider metrics (AWS CloudWatch, GCP Monitoring)

Cloud Provider Integration

If you use AWS, GCP, or Azure, the cloud provider offers basic server monitoring. However, these are often:

  • Limited in metric granularity
  • Expensive at scale (CloudWatch charges per metric)
  • Siloed from your application monitoring

A unified monitoring platform like StatusApp lets you see server metrics alongside website, API, and SSL monitoring in one dashboard.

Alert Thresholds by Server Role

Different servers have different normal patterns:

Web Servers

  • CPU: Warning at 70%, Critical at 90%
  • Memory: Warning at 80%, Critical at 90%
  • Disk: Warning at 75%, Critical at 90%
  • Focus on: Connection count, request queue depth

Database Servers

  • CPU: Warning at 60%, Critical at 80% (databases need headroom for query bursts)
  • Memory: Warning at 85%, Critical at 95%
  • Disk: Warning at 70%, Critical at 85% (databases need disk space for operations)
  • Focus on: Disk I/O, replication lag, connection pool usage

Worker/Queue Servers

  • CPU: Warning at 80%, Critical at 95% (workers are expected to be busy)
  • Memory: Warning at 80%, Critical at 90%
  • Focus on: Queue depth, job processing rate, failed job count

Common Server Issues and How Monitoring Catches Them

Memory leak: Memory usage creeps up 1-2% per day. Without monitoring, you discover it when the server crashes at 3 AM. With monitoring, you see the trend and restart the application during business hours.

Log file growth: Application logs fill the disk over weeks. Monitoring alerts you at 80% disk usage, giving you time to configure log rotation.

Zombie processes: Crashed workers leave zombie processes. Process monitoring detects that your expected worker count has dropped.

Network saturation: A traffic spike or DDoS attack saturates your bandwidth. Network monitoring alerts you within seconds.

CPU thermal throttling: In bare-metal environments, overheating causes CPU throttling. Load average spikes while CPU percentage appears normal.

Server Monitoring + External Monitoring

Server monitoring is most powerful when combined with external monitoring:

IssueServer Monitoring DetectsExternal Monitoring Detects
High CPUYes (cause)Yes (effect: slow responses)
Full diskYes (cause)Yes (effect: errors)
Network outageMaybe (if agent cannot report)Yes (site unreachable)
Application bugNo (server is healthy)Yes (wrong responses)
DNS issueNo (server is fine)Yes (site unreachable)

Together, they give you both the “what” and the “why.” External monitoring tells you something is wrong. Server monitoring tells you what caused it.


Get complete visibility into your server health. Start monitoring with StatusApp — server monitoring alongside 9 other monitor types.

serverCPUmemorydiskinfrastructureDevOps

Start monitoring in 30 seconds

StatusApp gives you 30-second checks from 35+ global locations, instant alerts, and beautiful status pages. Free plan available.