Kubernetes (DOKS) node run out from disk space in /var/lib/containerd

Question

Hello!

We have DOKS cluster with 2 node. One node has many cronjobs. Now the pods in that node stuck in pending state. No more disk space in /var/lib/containerd.

We had a similar problem in January (in 2 clusters at the same time). They recreated node. Installed doks-debug. Now we have the same problem again adn again. The support says: Increase Cluster Resources or Enable Autoscaling, Adjust CronJob Frequency, etc…

Maybe not cleaning the old containers and logs automatically? Garbage collection is not enabled? This is a PaaS system.

This cluster running in ams3, 1.30.2-do.0. (We can update it, but will it help?)

What can we do?

Thanks for helping!

Bobby · Answer

Hey there,

Yep, as you mentioned, since this is DigitalOcean Managed Kubernetes Service, you don’t have direct access to node-level garbage collection, so the best approach is to optimize what you can control and work with DigitalOcean support for the rest.

If your CronJobs are filling up storage, what you could do is indeed tweak their history retention. For example, you can lower successfulJobsHistoryLimit and failedJobsHistoryLimit to keep fewer old jobs. Setting concurrencyPolicy: Forbid will also prevent overlapping runs, which can reduce unnecessary storage usage.

Another thing to check is completed and failed pods. You might want to clean them up manually. If I am not mistaken, the kubectl delete pods --field-selector=status.phase=Succeeded and kubectl delete pods --field-selector=status.phase=Failed commands should help with that.

As you mentioned, since you can’t manually manage node storage directly, scaling up might be necessary. As the support team mentioned, increasing the node size, enabling autoscaling, or using multiple node pools to separate CronJobs from other workloads can help distribute storage usage more efficiently.

It’s possible that DOKS’s default garbage collection settings aren’t aggressive enough for your workload. If that’s the case, reducing job history retention, spreading CronJobs across nodes, and using smaller container images are good ways to work around it or continue working with DigitalOcean support to further investigate this and see if this is actually the case.

Upgrading to a newer Kubernetes version is also worth considering. I believe that Kubernetes 1.30.x will likely be deprecated soon, and newer versions might handle this better.

If this keeps happening even after these changes, it’s best to keep the communication with the DigitalOcean support team open and see if they can provide more insights or help you with this issue.

- Bobby.

Report this

Kubernetes (DOKS) node run out from disk space in /var/lib/containerd

Become a contributor for community

DigitalOcean Documentation

Resources for startups and SMBs

Get our newsletter

The developer cloud

Get started for free