Kubernetes Incident Response: Debugging Production Issues in Containerized Environments

Kubernetes incidents feel different from traditional server incidents. The infrastructure is dynamic, pods come and go, and the thing that's broken right now might have already restarted and hidden its own evidence. If your mental model is still "SSH to the box and look at logs," you're going to spend a lot of time chasing ghosts.

This post covers a practical approach to K8s incident response: what to look at first, which commands actually help, and where people commonly waste time.

The first two minutes

When you get paged for a Kubernetes issue, resist the urge to immediately start running commands. Spend thirty seconds on context:

Is this one service or multiple? Multiple affected services usually point to infrastructure (nodes, networking, control plane).
Is the workload restarting or completely missing? Crash-looping is different from a missing deployment.
Did anything deploy in the last hour? Most K8s incidents have a deploy as the proximate cause.

Then start with cluster-wide health:

kubectl get nodes
kubectl get pods -A --field-selector status.phase!=Running

The first command tells you if node pressure is involved. The second gives you every non-running pod across all namespaces. If you see dozens of pods in Pending or CrashLoopBackOff across multiple namespaces, you're dealing with something systemic, not a single service issue.

Reading pod state

CrashLoopBackOff is the most common state you'll see during an incident. It means the container started, exited (usually with an error), and Kubernetes is backing off before restarting it again.

Get details on a specific pod:

kubectl describe pod <pod-name> -n <namespace>

The Events section at the bottom is often more useful than the status fields. Look for:

OOMKilled: the container hit its memory limit and was killed by the kernel
Error: failed to create containerd task: usually a node-level issue
Liveness probe failed: the app started but isn't responding to health checks
Back-off pulling image: image pull is failing, often an auth or network issue

For the logs from the last run (not the current restart):

kubectl logs <pod-name> -n <namespace> --previous

That --previous flag is critical. Without it, you're looking at logs from after the restart, which might be completely clean if the crash happened early in startup.

OOMKills deserve their own section

Memory-related crashes are easy to misread. The pod restarts, the --previous logs show nothing obviously wrong, and you assume it's a flapping health check. Check the exit code:

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

Exit code 137 means OOMKill. The fix is usually one of three things: the memory limit is too low, there's a memory leak in the application, or traffic spiked and the service didn't scale fast enough.

Don't just raise the limit and close the incident. OOMKills are usually symptoms. Add the fix as a short-term measure but open a follow-up to understand why memory usage exceeded the configured limit.

Node pressure

If pods are stuck in Pending for more than a few minutes, check node capacity:

kubectl describe nodes | grep -A 5 "Conditions:"
kubectl top nodes

Look for MemoryPressure, DiskPressure, or PIDPressure conditions. Any of these will cause the scheduler to stop placing pods on that node.

Disk pressure is the sneakiest one. It often comes from container image layers accumulating, or application logs filling up /var/log. You can check what's eating disk by running a debug container on the affected node:

kubectl debug node/<node-name> -it --image=busybox
# inside the debug container:
df -h /host
du -sh /host/var/log/containers/* | sort -h | tail -20

If multiple nodes hit disk pressure simultaneously, check if a recent deployment pushed a significantly larger image. A 2GB image pulled to ten nodes at once can exhaust disk headroom fast.

Networking failures

Network incidents in Kubernetes are genuinely hard to debug because there are multiple layers: kube-proxy, CNI plugins, DNS, service mesh (if you have one), and application-level connection pooling.

Start with DNS since it's the most common culprit:

kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup <service-name>.<namespace>.svc.cluster.local

If DNS resolution fails, check the CoreDNS pods in kube-system. They're often the first thing to fall over when the cluster is under memory pressure.

For service connectivity:

kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- curl -v http://<service-name>.<namespace>.svc.cluster.local

If DNS works but the curl fails, the problem is likely in the service definition, endpoints, or the pods themselves. Check:

kubectl get endpoints <service-name> -n <namespace>

An empty endpoints list means no pods are matching the service selector, which usually means a label mismatch or all pods are failing their readiness probes.

Readiness vs liveness probes

These two probe types behave differently during an incident and it matters:

Probe	On failure	Common cause
Liveness	Pod is restarted	App is deadlocked or hung
Readiness	Pod is removed from service load balancer	App is overloaded or starting up

A pod that's stuck in readiness failures won't crash, but it will receive no traffic. If your service has three replicas and all three are failing readiness checks, you have a full outage with zero pod restarts, which makes it easy to miss in monitoring that only watches for crash loops.

Check probe configuration with:

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[0].readinessProbe}'

Rollback checklist

If the incident started after a deployment, a rollback is often the fastest path to recovery. Kubernetes makes this straightforward:

# Check rollout history
kubectl rollout history deployment/<name> -n <namespace>

# Roll back to previous version
kubectl rollout undo deployment/<name> -n <namespace>

# Watch the rollback progress
kubectl rollout status deployment/<name> -n <namespace>

One thing to verify before rolling back: if the deployment introduced a database migration, rolling back the application code won't undo the migration. Coordinate with your database team before triggering an application rollback in that scenario.

Alerts worth setting up

If you're not already alerting on these, add them to your monitoring:

Pod restart rate: alert when a pod has restarted more than 3 times in 15 minutes
OOMKill count: any OOMKill should generate a low-severity alert for investigation
Pending pods duration: pods stuck in Pending for more than 5 minutes usually indicate a scheduling problem
Node disk usage: alert at 75% and page at 85%
CoreDNS error rate: DNS failures often precede wider cluster issues

The goal isn't to page on every hiccup. It's to catch things that require human judgment before they become full outages.

Keep a K8s-specific runbook

Generic incident runbooks don't work well for Kubernetes. The commands are specific, the failure modes are unique, and a responder who's never worked with K8s before will waste time on the wrong things.

Your K8s runbook should include:

The exact kubectl commands to run in the first five minutes
How to get logs from crashed containers (with --previous)
Which namespaces contain production workloads
How to identify and contact the team that owns a given deployment
The rollback procedure for your specific CD tooling

Store it somewhere your on-call rotation can find it at 3am without thinking too hard about where to look.

Kubernetes incident response gets easier with practice, but the tooling genuinely helps. The cluster knows a lot about what's happening. You just have to ask it the right questions.