Phase 5: Operating It

Time: week 5. Goal: the stuff that separates "I deployed it" from "I run it".

The debugging workflow

When something is broken, in order:

kubectl get pods - what state is it in?
kubectl describe pod <name> - read the Events section at the bottom. This answers most questions.
kubectl logs <pod> - and --previous for the logs of the crashed container before the current restart.

Failure states and what they mean

State	Meaning	First thing to check
`ImagePullBackOff`	Can't pull the image	Tag typo, private registry auth, image doesn't exist
`CrashLoopBackOff`	Container starts and dies repeatedly	`logs --previous`, bad config, missing env var
`Pending`	Scheduler can't place the pod	`describe`: not enough CPU/memory, no matching node, PVC unbound
`OOMKilled`	Exceeded memory limit	Raise the limit or fix the leak
Ready `0/1` forever	Readiness probe failing	Probe path/port wrong, or dependency genuinely down

HorizontalPodAutoscaler

Scale replicas on CPU/memory utilization. Needs metrics-server installed (kind doesn't ship it by default).

bash

kubectl autoscale deployment web --cpu-percent=70 --min=2 --max=10

Jobs and CronJobs

Your one-off scripts and scheduled tasks:

Job: run a pod to completion, retry on failure.
CronJob: a Job on a cron schedule. Direct replacement for host crontabs calling docker run.

RBAC basics

Just enough to read it, not design it:

ServiceAccount: an identity for pods.
Role / ClusterRole: a set of permissions (verbs on resources).
RoleBinding / ClusterRoleBinding: glues an identity to permissions.

NetworkPolicies (awareness)

Default k8s networking is wide open: every pod can talk to every pod. NetworkPolicies are firewalls between pods. Know they exist and that the default is "allow all".

Exercise

Full playbook with the broken manifests, diagnosis walkthroughs, CronJob, and HPA setup: Lab 5.

Deliberately break things and diagnose each from events and logs only, no peeking at the manifest:

Deploy with a non-existent image tag.
Point a readiness probe at a wrong path.
Set a memory limit lower than the app needs.

For each: identify the failure state, find the evidence in describe/logs, fix it.

Checkpoint

Given a broken pod, you reach for describe events before anything else.
You can name the five failure states above and their most common cause.
You can write a CronJob from memory.

Next: Phase 6: Helm and Packaging

Phase 5: Operating It ​

The debugging workflow ​

Failure states and what they mean ​

HorizontalPodAutoscaler ​

Jobs and CronJobs ​

RBAC basics ​

NetworkPolicies (awareness) ​

Exercise ​

Checkpoint ​