Phase 5: Operating It
Time: week 5. Goal: the stuff that separates "I deployed it" from "I run it".
The debugging workflow
When something is broken, in order:
kubectl get pods- what state is it in?kubectl describe pod <name>- read the Events section at the bottom. This answers most questions.kubectl logs <pod>- and--previousfor the logs of the crashed container before the current restart.
Failure states and what they mean
| State | Meaning | First thing to check |
|---|---|---|
ImagePullBackOff | Can't pull the image | Tag typo, private registry auth, image doesn't exist |
CrashLoopBackOff | Container starts and dies repeatedly | logs --previous, bad config, missing env var |
Pending | Scheduler can't place the pod | describe: not enough CPU/memory, no matching node, PVC unbound |
OOMKilled | Exceeded memory limit | Raise the limit or fix the leak |
Ready 0/1 forever | Readiness probe failing | Probe path/port wrong, or dependency genuinely down |
HorizontalPodAutoscaler
Scale replicas on CPU/memory utilization. Needs metrics-server installed (kind doesn't ship it by default).
bash
kubectl autoscale deployment web --cpu-percent=70 --min=2 --max=10Jobs and CronJobs
Your one-off scripts and scheduled tasks:
- Job: run a pod to completion, retry on failure.
- CronJob: a Job on a cron schedule. Direct replacement for host crontabs calling
docker run.
RBAC basics
Just enough to read it, not design it:
- ServiceAccount: an identity for pods.
- Role / ClusterRole: a set of permissions (verbs on resources).
- RoleBinding / ClusterRoleBinding: glues an identity to permissions.
NetworkPolicies (awareness)
Default k8s networking is wide open: every pod can talk to every pod. NetworkPolicies are firewalls between pods. Know they exist and that the default is "allow all".
Exercise
Full playbook with the broken manifests, diagnosis walkthroughs, CronJob, and HPA setup: Lab 5.
Deliberately break things and diagnose each from events and logs only, no peeking at the manifest:
- Deploy with a non-existent image tag.
- Point a readiness probe at a wrong path.
- Set a memory limit lower than the app needs.
For each: identify the failure state, find the evidence in describe/logs, fix it.
Checkpoint
- Given a broken pod, you reach for
describeevents before anything else. - You can name the five failure states above and their most common cause.
- You can write a CronJob from memory.