Skip to content

Phase 5: Operating It

Time: week 5. Goal: the stuff that separates "I deployed it" from "I run it".

The debugging workflow

When something is broken, in order:

  1. kubectl get pods - what state is it in?
  2. kubectl describe pod <name> - read the Events section at the bottom. This answers most questions.
  3. kubectl logs <pod> - and --previous for the logs of the crashed container before the current restart.

Failure states and what they mean

StateMeaningFirst thing to check
ImagePullBackOffCan't pull the imageTag typo, private registry auth, image doesn't exist
CrashLoopBackOffContainer starts and dies repeatedlylogs --previous, bad config, missing env var
PendingScheduler can't place the poddescribe: not enough CPU/memory, no matching node, PVC unbound
OOMKilledExceeded memory limitRaise the limit or fix the leak
Ready 0/1 foreverReadiness probe failingProbe path/port wrong, or dependency genuinely down

HorizontalPodAutoscaler

Scale replicas on CPU/memory utilization. Needs metrics-server installed (kind doesn't ship it by default).

bash
kubectl autoscale deployment web --cpu-percent=70 --min=2 --max=10

Jobs and CronJobs

Your one-off scripts and scheduled tasks:

  • Job: run a pod to completion, retry on failure.
  • CronJob: a Job on a cron schedule. Direct replacement for host crontabs calling docker run.

RBAC basics

Just enough to read it, not design it:

  • ServiceAccount: an identity for pods.
  • Role / ClusterRole: a set of permissions (verbs on resources).
  • RoleBinding / ClusterRoleBinding: glues an identity to permissions.

NetworkPolicies (awareness)

Default k8s networking is wide open: every pod can talk to every pod. NetworkPolicies are firewalls between pods. Know they exist and that the default is "allow all".

Exercise

Full playbook with the broken manifests, diagnosis walkthroughs, CronJob, and HPA setup: Lab 5.

Deliberately break things and diagnose each from events and logs only, no peeking at the manifest:

  1. Deploy with a non-existent image tag.
  2. Point a readiness probe at a wrong path.
  3. Set a memory limit lower than the app needs.

For each: identify the failure state, find the evidence in describe/logs, fix it.

Checkpoint

  • Given a broken pod, you reach for describe events before anything else.
  • You can name the five failure states above and their most common cause.
  • You can write a CronJob from memory.

Next: Phase 6: Helm and Packaging

A VineLab lab. Released under the MIT License.