Hello!
We have DOKS cluster with 2 node. One node has many cronjobs. Now the pods in that node stuck in pending state. No more disk space in /var/lib/containerd.
We had a similar problem in January (in 2 clusters at the same time). They recreated node. Installed doks-debug. Now we have the same problem again adn again. The support says: Increase Cluster Resources or Enable Autoscaling, Adjust CronJob Frequency, etc…
Maybe not cleaning the old containers and logs automatically? Garbage collection is not enabled? This is a PaaS system.
This cluster running in ams3, 1.30.2-do.0. (We can update it, but will it help?)
What can we do?
Thanks for helping!
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
Hey there,
Yep, as you mentioned, since this is DigitalOcean Managed Kubernetes Service, you don’t have direct access to node-level garbage collection, so the best approach is to optimize what you can control and work with DigitalOcean support for the rest.
If your CronJobs are filling up storage, what you could do is indeed tweak their history retention. For example, you can lower
successfulJobsHistoryLimit
andfailedJobsHistoryLimit
to keep fewer old jobs. SettingconcurrencyPolicy: Forbid
will also prevent overlapping runs, which can reduce unnecessary storage usage.Another thing to check is completed and failed pods. You might want to clean them up manually. If I am not mistaken, the
kubectl delete pods --field-selector=status.phase=Succeeded
andkubectl delete pods --field-selector=status.phase=Failed
commands should help with that.As you mentioned, since you can’t manually manage node storage directly, scaling up might be necessary. As the support team mentioned, increasing the node size, enabling autoscaling, or using multiple node pools to separate CronJobs from other workloads can help distribute storage usage more efficiently.
It’s possible that DOKS’s default garbage collection settings aren’t aggressive enough for your workload. If that’s the case, reducing job history retention, spreading CronJobs across nodes, and using smaller container images are good ways to work around it or continue working with DigitalOcean support to further investigate this and see if this is actually the case.
Upgrading to a newer Kubernetes version is also worth considering. I believe that Kubernetes 1.30.x will likely be deprecated soon, and newer versions might handle this better.
If this keeps happening even after these changes, it’s best to keep the communication with the DigitalOcean support team open and see if they can provide more insights or help you with this issue.
- Bobby.