Lab 5: Break It, Fix It, Schedule It, Scale It

Pairs with Phase 5. Three deliberate failures to diagnose using only kubectl describe and kubectl logs - no peeking at the manifests for hints - then a CronJob and an autoscaler.

Work in a fresh namespace:

bash

kubectl create namespace lab5
kubectl config set-context --current --namespace=lab5

Break 1: The image that isn't

Save and apply lab-5/break-1.yaml:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: break-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: break-1
  template:
    metadata:
      labels:
        app: break-1
    spec:
      containers:
        - name: app
          image: nginx:1.99-does-not-exist

bash

kubectl apply -f lab-5/break-1.yaml
kubectl get pods

You'll see ErrImagePull, then ImagePullBackOff. Diagnose:

bash

kubectl describe pod -l app=break-1

What the events show, and the fix

The Events section reads something like:

Failed to pull image "nginx:1.99-does-not-exist": ... not found
Back-off pulling image "nginx:1.99-does-not-exist"

The registry said the tag doesn't exist. In real life this is a tag typo, a missing push, or missing private-registry credentials (imagePullSecrets). Fix: edit the manifest to image: nginx, re-apply, watch it recover. Note that k8s never stopped retrying - BackOff means "retrying with increasing delays", not "gave up".

Break 2: The probe that lies

Save and apply lab-5/break-2.yaml:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: break-2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: break-2
  template:
    metadata:
      labels:
        app: break-2
    spec:
      containers:
        - name: app
          image: traefik/whoami:v1.10
          ports:
            - containerPort: 80
          readinessProbe:
            httpGet:
              path: /
              port: 8080

bash

kubectl apply -f lab-5/break-2.yaml
kubectl get pods

The pod shows Running but READY 0/1 - forever. The app is healthy; why is it getting no traffic?

What the events show, and the fix

Readiness probe failed: Get "http://10.x.x.x:8080/": dial tcp ... connect: connection refused

The probe targets port 8080; whoami listens on 80. connection refused on a probe almost always means wrong port; a 404 means wrong path. The pod never enters any Service's endpoints, so it silently receives nothing - the most insidious failure mode in this lab, because nothing is "red". Fix the port to 80, re-apply, watch 0/1 become 1/1.

Break 3: The memory eater

Save and apply lab-5/break-3.yaml:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: break-3
spec:
  replicas: 1
  selector:
    matchLabels:
      app: break-3
  template:
    metadata:
      labels:
        app: break-3
    spec:
      containers:
        - name: app
          image: polinux/stress
          command: ["stress"]
          args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]
          resources:
            limits:
              memory: 100Mi

bash

kubectl apply -f lab-5/break-3.yaml
kubectl get pods -w

The container tries to allocate 250M with a 100Mi limit. Watch the pod cycle: Running -> OOMKilled -> CrashLoopBackOff -> restart -> repeat.

Where the evidence is, and the fix

kubectl describe pod -l app=break-3 shows:

Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137

Reason: OOMKilled and exit code 137 are the fingerprints. Note it's under Last State - the current state is the restarted container. The same logic applies to logs: kubectl logs <pod> --previous shows the dead container's output, plain logs shows the new one's. In real life the fix is either raising the limit (the app genuinely needs more) or fixing a leak. Here, raise the limit to 300Mi, re-apply, watch it stabilize.

Clean up the breakage:

bash

kubectl delete deploy break-1 break-2 break-3

Part 2: A CronJob

Save and apply lab-5/cronjob.yaml:

yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: ticker
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: tick
              image: busybox:1.36
              command: ["sh", "-c", "date; echo tick from a CronJob"]

bash

kubectl apply -f lab-5/cronjob.yaml
# wait a minute or two, then:
kubectl get jobs
kubectl logs job/$(kubectl get jobs -o jsonpath='{.items[-1].metadata.name}')

Expected: the date and tick from a CronJob. Every minute spawns a Job, each Job a pod that runs to completion. This is the direct replacement for host crontabs calling docker run. Delete it before it litters: kubectl delete cronjob ticker.

Part 3: Autoscaling

HPA needs metrics-server, which kind doesn't ship. Install and patch it (the patch tolerates kind's self-signed kubelet certs - needed on kind, not on real clusters):

bash

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl patch deployment metrics-server -n kube-system --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

Wait a minute, then confirm metrics flow:

bash

kubectl top nodes

Deploy a scalable target with a deliberately small CPU request, plus an HPA:

bash

kubectl create deployment scale-me --image=traefik/whoami:v1.10 --port=80
kubectl set resources deployment scale-me --requests=cpu=20m
kubectl expose deployment scale-me --port=80
kubectl autoscale deployment scale-me --cpu-percent=50 --min=1 --max=5

Generate load from inside the cluster:

bash

kubectl run load --image=busybox:1.36 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://scale-me > /dev/null; done"

Watch it react (takes 1-3 minutes - the HPA is deliberately not jumpy):

bash

kubectl get hpa scale-me -w

Expected: TARGETS climbs past 50%, REPLICAS steps up. Kill the load (kubectl delete pod load) and watch it scale back down - note the downscale is much slower (about 5 minutes), by design, to avoid flapping.

Clean up:

bash

kubectl delete namespace lab5
kubectl config set-context --current --namespace=default

Verify it worked

[ ] For each break, you found the evidence in describe/logs before opening the solution
[ ] You can state the diagnostic fingerprint of each: not-found in pull events, connection refused on probe, OOMKilled/137 in Last State
[ ] You know when to reach for logs --previous and why
[ ] You watched an HPA scale up under load and (slowly) back down

Next: Lab 6: Helm and Kustomize

Lab 5: Break It, Fix It, Schedule It, Scale It ​

Break 1: The image that isn't ​

Break 2: The probe that lies ​

Break 3: The memory eater ​

Part 2: A CronJob ​

Part 3: Autoscaling ​

Verify it worked ​

Lab 5: Break It, Fix It, Schedule It, Scale It

Break 1: The image that isn't

Break 2: The probe that lies

Break 3: The memory eater

Part 2: A CronJob

Part 3: Autoscaling

Verify it worked