Docker Crash Loop Fix: Production Down 4 Hours, Resolved in 47 Minutes for $99

A production e-commerce application went down when a Docker container entered a crash loop — restarting every 3 minutes due to OOM (Out of Memory) kills. The client's team could not identify the cause for 4 hours. Optimum Web's senior engineer connected via SSH, identified the root cause (Node.js worker process leaking memory through an unbounded event listener array), and fixed it in 47 minutes for $99 fixed price. The application has been stable for 5 months since, with zero OOM events.

This case study shows exactly what the crash looked like, how we diagnosed it, what the fix was, and how to recognize the same problem on your server.

The Problem: "Container Keeps Restarting Every 3 Minutes"

Friday, 5:17 PM. A logistics SaaS company in Germany noticed their tracking dashboard was down. Docker logs showed the pattern immediately:

text

app_worker  | Killed
app_worker  | OOM: Kill process 1 (node) score 987
app_worker exited with code 137
app_worker  | Starting...
# ... repeats every 2-3 minutes

Exit code 137 = killed by OOM (Out of Memory). The container was configured with a 512MB memory limit. Every 2–3 minutes, the Node.js process consumed all 512MB, was killed by the kernel, restarted, and the cycle repeated.

What the client's team tried (and why it didn't work):

Attempt	Result
Increase memory limit to 1GB	Crash shifted to every 5 minutes — same problem, more RAM to fill
Restart the entire Docker stack	Same crash loop after 3 minutes
Roll back to previous Docker image	Same crash — the bug was in a data dependency, not code version

After 4 hours of downtime, they contacted us.

Why This Happens

Memory leaks in Node.js are notoriously difficult to diagnose because:

JavaScript uses garbage collection — developers assume memory is managed automatically. It is, but only when there are no references to an object. If your code keeps a reference, garbage collection is blocked.

Leaks only appear under production load. Local development handles 5–10 concurrent connections. Production has 1,000+. The leak grows slowly at low traffic, then triggers frequently under load.

OOM kill is silent. The kernel kills the process without warning. Application logs show nothing — the next log line is the container restarting. You never see the root cause in logs.

Common Node.js leak sources: unbounded arrays growing with every request, event listeners not removed on disconnect, global caches without eviction policy, circular references preventing garbage collection.

Our Diagnosis (First 12 Minutes)

bash

# 1. Check which container is crashing
$ docker ps -a --format "{{.Names}}: {{.Status}}"
app_worker: Restarting (137) 45 seconds ago

# 2. Memory at the moment of crash
$ docker stats --no-stream app_worker
# MEM USAGE: 498MB / 512MB — at limit

# 3. Heap growth pattern
$ docker exec app_worker node -e "console.log(process.memoryUsage())"
# heapUsed: 487MB — almost all memory is heap

# 4. Generate heap snapshot via Chrome DevTools
# Found: Array "eventListeners" — 340,000 entries (growing ~1,000/min)

Root cause: A WebSocket handler was adding an event listener on every new connection but never removing it on disconnect. After 340,000 connections over 5 hours of production traffic, the listener array consumed all available memory.

The Fix (35 Minutes)

javascript

// BEFORE (bug):
socket.on('connection', (ws) => {
  process.on('SIGTERM', () => ws.close()); // Added every connection, never removed
});

// AFTER (fix):
socket.on('connection', (ws) => {
  const cleanup = () => ws.close();
  process.once('SIGTERM', cleanup); // Auto-removes after firing

  ws.on('close', () => {
    process.removeListener('SIGTERM', cleanup); // Explicit cleanup on disconnect
  });
});

Additionally applied:

- Set --max-old-space-size=384 to trigger GC pressure earlier, before reaching the container limit - Added memory monitoring alert (Telegram notification when heap > 80%) - Configured Docker restart policy with exponential backoff delay

→ [Fix My Docker — $99](/fixed-price/fix-docker-issues#checkout) · Root cause diagnosis included · 14-day warranty

The Result

Metric	Before	After
Container restarts per hour	20–30	0
Memory usage (steady state)	498MB → OOM every 3 min	180–220MB stable
Downtime	4 hours (and counting)	0 (5 months stable)
Time to resolution	4+ hours (no result)	47 minutes

Cost & Timeline

Item	Detail
Service	OW-BUG-01: Fix Docker & Docker Compose Issues
Price	$99 fixed
Actual time	47 minutes (SSH to resolution)
Engineer	Senior Node.js / Docker specialist
Included	Root cause diagnosis, code fix, memory monitoring, Docker config update
Warranty	14 days

🐛 Docker Container Crashing? We Fix It Same Day — $99.

OOM kills, network errors, permission denied, crash loops — root cause diagnosed and fixed. Not just restarted.

✓SSH diagnostic within 24 hours of order
✓Root cause identification (heap snapshot, logs, traces)
✓Code fix included (not just Docker restart)
✓Memory monitoring setup
✓Docker restart policy configuration
✓14-day warranty

$99 · Service ID: OW-BUG-01 · 14-day warranty

Fix My Docker — $99 →

Production Completely Down?

If your production is down right now and every minute costs you revenue, use [Emergency Server Recovery — $199](/fixed-price/quickfix-server-recovery#checkout) instead. Response within 1–4 hours. We SSH in, stabilize production first, diagnose root cause second.

Could This Be Your Problem? (5 Warning Signs)

Docker container restarts periodically (check docker ps -a for exit code 137)
Memory usage climbs steadily then drops to zero (the restart cycle)
The problem only appears in production, never in local development
Increasing container memory limit only delays the crash by minutes
Application logs show no errors before the crash — OOM kill is silent

If 2 or more apply, you likely have a memory leak.

→ [Fix Docker Issues — $99](/fixed-price/fix-docker-issues#checkout) · Root cause + code fix · 14-day warranty → [Emergency Recovery — $199](/fixed-price/quickfix-server-recovery#checkout) · 1–4 hour response · Production down now

DockerNode.jsMemory LeakOOMCrash LoopCase StudyBugfix

Frequently Asked Questions

How fast can you respond if my production is currently down?

For Docker bug fixes (OW-BUG-01, $99): we start diagnosis within 24 hours. For production emergencies (OW-EMR-01, $199): we SSH in within 1–4 hours of order confirmation.

What if the fix requires code changes in my application?

Code fixes are included in the $99 price, as long as the fix is related to the Docker/container issue. If the root cause requires days of refactoring — we tell you honestly and quote a separate engagement.

Do you support Docker Swarm and Kubernetes?

Docker and Docker Compose are covered by OW-BUG-01 ($99). Kubernetes-specific issues are more complex and require a custom quote. Contact us with your setup details.

Can you help prevent future Docker crashes?

Yes. As part of every fix, we configure basic memory monitoring (alerts via Telegram or email when heap usage exceeds threshold) and document prevention recommendations. For ongoing monitoring, see OW-AI-14 Website Uptime Monitor ($190).

What programming languages can you debug inside Docker containers?

Node.js, Java, PHP, Python, .NET, Go, and Ruby. Our team covers all major stacks. The diagnostic approach (heap snapshots, memory profiling, log analysis) is adapted per language but the $99 price is the same.

Docker Container Crash Loop: Production Back in 47 Minutes for $99