Lab 5: Break It, Fix It, Schedule It, Scale It
Pairs with Phase 5. Three deliberate failures to diagnose using only kubectl describe and kubectl logs - no peeking at the manifests for hints - then a CronJob and an autoscaler.
Work in a fresh namespace:
kubectl create namespace lab5
kubectl config set-context --current --namespace=lab5Break 1: The image that isn't
Save and apply lab-5/break-1.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: break-1
spec:
replicas: 1
selector:
matchLabels:
app: break-1
template:
metadata:
labels:
app: break-1
spec:
containers:
- name: app
image: nginx:1.99-does-not-existkubectl apply -f lab-5/break-1.yaml
kubectl get podsYou'll see ErrImagePull, then ImagePullBackOff. Diagnose:
kubectl describe pod -l app=break-1What the events show, and the fix
The Events section reads something like:
Failed to pull image "nginx:1.99-does-not-exist": ... not found
Back-off pulling image "nginx:1.99-does-not-exist"The registry said the tag doesn't exist. In real life this is a tag typo, a missing push, or missing private-registry credentials (imagePullSecrets). Fix: edit the manifest to image: nginx, re-apply, watch it recover. Note that k8s never stopped retrying - BackOff means "retrying with increasing delays", not "gave up".
Break 2: The probe that lies
Save and apply lab-5/break-2.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: break-2
spec:
replicas: 1
selector:
matchLabels:
app: break-2
template:
metadata:
labels:
app: break-2
spec:
containers:
- name: app
image: traefik/whoami:v1.10
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 8080kubectl apply -f lab-5/break-2.yaml
kubectl get podsThe pod shows Running but READY 0/1 - forever. The app is healthy; why is it getting no traffic?
What the events show, and the fix
Readiness probe failed: Get "http://10.x.x.x:8080/": dial tcp ... connect: connection refusedThe probe targets port 8080; whoami listens on 80. connection refused on a probe almost always means wrong port; a 404 means wrong path. The pod never enters any Service's endpoints, so it silently receives nothing - the most insidious failure mode in this lab, because nothing is "red". Fix the port to 80, re-apply, watch 0/1 become 1/1.
Break 3: The memory eater
Save and apply lab-5/break-3.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: break-3
spec:
replicas: 1
selector:
matchLabels:
app: break-3
template:
metadata:
labels:
app: break-3
spec:
containers:
- name: app
image: polinux/stress
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]
resources:
limits:
memory: 100Mikubectl apply -f lab-5/break-3.yaml
kubectl get pods -wThe container tries to allocate 250M with a 100Mi limit. Watch the pod cycle: Running -> OOMKilled -> CrashLoopBackOff -> restart -> repeat.
Where the evidence is, and the fix
kubectl describe pod -l app=break-3 shows:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137Reason: OOMKilled and exit code 137 are the fingerprints. Note it's under Last State - the current state is the restarted container. The same logic applies to logs: kubectl logs <pod> --previous shows the dead container's output, plain logs shows the new one's. In real life the fix is either raising the limit (the app genuinely needs more) or fixing a leak. Here, raise the limit to 300Mi, re-apply, watch it stabilize.
Clean up the breakage:
kubectl delete deploy break-1 break-2 break-3Part 2: A CronJob
Save and apply lab-5/cronjob.yaml:
apiVersion: batch/v1
kind: CronJob
metadata:
name: ticker
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: tick
image: busybox:1.36
command: ["sh", "-c", "date; echo tick from a CronJob"]kubectl apply -f lab-5/cronjob.yaml
# wait a minute or two, then:
kubectl get jobs
kubectl logs job/$(kubectl get jobs -o jsonpath='{.items[-1].metadata.name}')Expected: the date and tick from a CronJob. Every minute spawns a Job, each Job a pod that runs to completion. This is the direct replacement for host crontabs calling docker run. Delete it before it litters: kubectl delete cronjob ticker.
Part 3: Autoscaling
HPA needs metrics-server, which kind doesn't ship. Install and patch it (the patch tolerates kind's self-signed kubelet certs - needed on kind, not on real clusters):
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl patch deployment metrics-server -n kube-system --type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'Wait a minute, then confirm metrics flow:
kubectl top nodesDeploy a scalable target with a deliberately small CPU request, plus an HPA:
kubectl create deployment scale-me --image=traefik/whoami:v1.10 --port=80
kubectl set resources deployment scale-me --requests=cpu=20m
kubectl expose deployment scale-me --port=80
kubectl autoscale deployment scale-me --cpu-percent=50 --min=1 --max=5Generate load from inside the cluster:
kubectl run load --image=busybox:1.36 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://scale-me > /dev/null; done"Watch it react (takes 1-3 minutes - the HPA is deliberately not jumpy):
kubectl get hpa scale-me -wExpected: TARGETS climbs past 50%, REPLICAS steps up. Kill the load (kubectl delete pod load) and watch it scale back down - note the downscale is much slower (about 5 minutes), by design, to avoid flapping.
Clean up:
kubectl delete namespace lab5
kubectl config set-context --current --namespace=defaultVerify it worked
- [ ] For each break, you found the evidence in
describe/logsbefore opening the solution - [ ] You can state the diagnostic fingerprint of each: not-found in pull events, connection refused on probe,
OOMKilled/137 in Last State - [ ] You know when to reach for
logs --previousand why - [ ] You watched an HPA scale up under load and (slowly) back down