Report this

What is the reason for this report?

Kubernetes (DOKS) node run out from disk space in /var/lib/containerd

Posted on February 7, 2025

Hello!

We have DOKS cluster with 2 node. One node has many cronjobs. Now the pods in that node stuck in pending state. No more disk space in /var/lib/containerd.

We had a similar problem in January (in 2 clusters at the same time). They recreated node. Installed doks-debug. Now we have the same problem again adn again. The support says: Increase Cluster Resources or Enable Autoscaling, Adjust CronJob Frequency, etc…

Maybe not cleaning the old containers and logs automatically? Garbage collection is not enabled? This is a PaaS system.

This cluster running in ams3, 1.30.2-do.0. (We can update it, but will it help?)

What can we do?

Thanks for helping!



This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Hey there,

Yep, as you mentioned, since this is DigitalOcean Managed Kubernetes Service, you don’t have direct access to node-level garbage collection, so the best approach is to optimize what you can control and work with DigitalOcean support for the rest.

If your CronJobs are filling up storage, what you could do is indeed tweak their history retention. For example, you can lower successfulJobsHistoryLimit and failedJobsHistoryLimit to keep fewer old jobs. Setting concurrencyPolicy: Forbid will also prevent overlapping runs, which can reduce unnecessary storage usage.

Another thing to check is completed and failed pods. You might want to clean them up manually. If I am not mistaken, the kubectl delete pods --field-selector=status.phase=Succeeded and kubectl delete pods --field-selector=status.phase=Failed commands should help with that.

As you mentioned, since you can’t manually manage node storage directly, scaling up might be necessary. As the support team mentioned, increasing the node size, enabling autoscaling, or using multiple node pools to separate CronJobs from other workloads can help distribute storage usage more efficiently.

It’s possible that DOKS’s default garbage collection settings aren’t aggressive enough for your workload. If that’s the case, reducing job history retention, spreading CronJobs across nodes, and using smaller container images are good ways to work around it or continue working with DigitalOcean support to further investigate this and see if this is actually the case.

Upgrading to a newer Kubernetes version is also worth considering. I believe that Kubernetes 1.30.x will likely be deprecated soon, and newer versions might handle this better.

If this keeps happening even after these changes, it’s best to keep the communication with the DigitalOcean support team open and see if they can provide more insights or help you with this issue.

- Bobby.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.