How Claude Code Helps Debug my Kubernetes Cluster
And Other Claude Code Use Cases
I've been doing something that sounds slightly insane. Giving an AI agent access to my Kubernetes cluster.
My Claude Code setup has access to bash commands. kubectl, flux, and helm are all bash commands. Why not see how well the LLMs can navigate DevOps tasks.
Now when something isn't working, I don't paste logs into chat. I describe what I see and let it investigate.
How I Used to Debug
Website returns 500 errors. I'd start the usual dance:
kubectl get pods -n production
kubectl logs deployment/api -n production
kubectl describe pod api-7d9f4b8c5-x2v9n -n production
flux get kustomizations -n flux-system
# ... 20 minutes later ...
Now I type:
"The website is returning 500 errors after the latest Flux changes. Figure out what's wrong."
And Claude starts working.
How It Works
Since Claude Code has access to bash and all the bash commands we already use for managing Kube clusters we can let Claude run the commands itself. Look at the logs itself. Describe pods itself. It catches single letter typos where I might take hours to notice it. It knows all the error codes from nginx, so when it tails a log from the ingress, it immediately sees these little signals amongst all the noise that takes us humans so much longer to see.
I describe a problem, it builds its own investigation. It follows dependencies. it checks everything methodically.
Sometimes for certain operators, I've seen it run commands I didn't know existed. Commands to view what the spec for a CRD is and what kind of value it's expecting, or what keys are available to configure something in that CRD.
It runs commands, reads output, decides what to check next, and keeps digging until it finds something.
The Ghost Blog Domain Typo
This works outside Kubernetes too. I was setting up this very blog with DigitalOcean's 1-click installer for Ghost. It literally asks two questions: domain and email. Should be simple.
It wasn't.
I entered my domain, but the wizard failed with "domain misconfigured" and died. No explanation, no restart option. The droplet was half-configured. Ghost running but no SSL, and I couldn't access the admin panel.
I had SSH access but didn't know the 1-click internals. Where does Ghost store config? How is nginx set up? Where do Let's Encrypt certs go? I could figure it out manually, but that would take hours.
So I gave Claude SSH access and told it: "SSH in and figure out why the Ghost setup failed. I think I mistyped the domain."
It checked /var/www/ghost/config.production.json and found the URL was set to breakingprod.new instead of breakingprod.now. I had fat-fingered the domain during setup. It checked the nginx config and found the same typo.
The SSH connection started flaking (unrelated). Connection refused, then working, then refusing again. DigitalOcean's networking was having issues. But Claude pieced together what happened anyway: I typed the wrong domain, Ghost and nginx got configured incorrectly, SSL never completed, and Ghost was redirecting to HTTPS on a non-existent domain.
It gave me commands to fix the Ghost URL, regenerate nginx config, set up SSL, and restart everything. Once SSH stabilized, I ran them. Five minutes later, the blog was live with working SSL.
I didn't need to understand the 1-click setup internals or know where Ghost stores config. I just described the symptom and gave Claude access to investigate.
The Framework Network Driver Mess
This pattern works for hardware problems too. I'm expanding my homelab k3s cluster with some Framework Ryzen AI Max+ 395 boards. 128GB of RAM in a 10-inch mini rack. Beautiful.
The problem: the onboard NIC is so new that Linux doesn't support it yet, at least not Ubuntu 24.04 LTS. No network. I downloaded the driver manually but had no way to get it onto the machine.
I have a GL.iNet Comet Pro KVM from Kickstarter. One of its tricks is storage mounting. I can upload files to the KVM, then mount that storage to the target machine like a virtual USB drive. So I uploaded the driver to the KVM, mounted it to the Framework node, and now I had the driver on a machine with zero network access.
Still didn't work. The driver loaded but the NIC wouldn't come up. I was stuck.
Then I remembered I had a USB-C to Ethernet adapter for my MacBook. I plugged it into the Framework node and suddenly I had network. I SSH'd in, which meant Claude could SSH in too.
I told Claude: "The onboard NIC isn't working. I've got a USB-C ethernet adapter working temporarily so you can connect. The driver should be installed but something's wrong. Fix the networking so I can use the onboard NIC."
Claude started digging through Ubuntu's driver subsystem. Checked dkms status, looked at the network interface configuration, examined device trees, checked if the module loaded correctly. It found the issue. Unfortunately I don't remember what the issue was, but it would have taken me hours to work through it. I know because I've done it.
It asked me if I wanted it to implement the fix and I said yes. A few moments later I was disconnecting the USB-C adapter. I plugged ethernet into the onboard port, rebooted, and everything came up clean.
I didn't need to understand Ubuntu network driver internals or dig through dkms configuration. I just needed to describe what I wanted and give Claude SSH access to figure it out.
Security
"You're giving an AI root access to your cluster?!"
Not exactly. Claude can investigate anything but can't modify resources directly via kubectl. All changes go through Git. Claude modifies the source of truth, Flux applies it. I get Git history for audit trails, the ability to revert changes, and approval gates before anything deploys.
The kubectl access uses a limited service account that can read most things but only write to specific namespaces. It can't delete production namespaces or modify cluster-wide resources. And Claude proposes actions and asks before executing. It's not running autonomously while I sleep.
Should You Try This?
Consider it if you understand Kubernetes well enough to validate suggestions, use GitOps for audit trails and rollbacks, have non-production environments to break things in, and are comfortable approving actions before they run.
Skip it if you're new to Kubernetes, don't have GitOps configured, or need SOC2 compliance. Auditors will ask uncomfortable questions.
The Future
This feels like early days. Right now I'm the bottleneck. Claude investigates, proposes, waits for approval. The next step is handling known-failure patterns autonomously: increasing memory limits when it sees OOMKilled, checking image tags when Flux syncs fail, or paging me with findings when error rates spike.
We're close, but not quite there.
I'm working on giving openclaw access to my homelab cluster. Again read only. But that'll be a topic of a future post.
Have you experimented with giving AI agents cluster access? What guardrails did you use?