Question

Managed Kubernetes not working - cilium can't connect to etcd

Posted on March 7, 2019
Kubernetes
Asked by DTrierweiler

Hi guys,

I already opened a support ticket, but I still have no replies since 3 days, so I wanted to try here too.

I use the managed kubernetes service with rancher and had it running smoothly. Then on monday morning, it suddenly stopped reporting to rancher and the deployed websites didn’t work anymore. I checked all pods and saw, that the cilium pods are restarting like crazy and most other pods are stuck in containerCreating.

It seems like the cilium pods can’t reach the etcd-node anymore. This is the log of one cilium node: https://gist.github.com/DTrierweiler/f2eecb5568fdf899695cb6f644318ffb I even downloaded the certs from the secret and tried to connect to the etcd from my local machine with curl - which worked without problems.

Could this be related to dns problems? The 2 coredns pods are not running as well because of being stuck in containerCreating.

Thanks a lot for your help. Besides this, is it normal for the support to take so much time? I have an unusable cluster (for 4 days now), which costs me 200$ per month and my websites are not running. Luckily this is still only staging and not production.

Cheers, Daniel

Submit an answer

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

jarland

DigitalOcean Employee

• March 7, 2019

Accepted Answer

Hey friend,

Per Nicholas from our support team:

We have seen a similar report of pods stuck in a ContainerCreating state and there might be a Cilium dependency issue; if you run kubectl -n kube-system edit ds cilium, what is dnsPolicy set to? If you change that to “ClusterFirst” or “Default”, does that resolve the issue?

I also wanted to quickly address this question:

is it normal for the support to take so much time?

It varies a bit. Our intention is to provide you with all of the things you need to troubleshoot and repair problems from your side, without having to wait for a response from our team. On the rare occasion that you do not have the ability to resolve an issue on your side and our intervention is required, such wait time is obviously unacceptable, and it is something we are working very hard on improving. By continually exposing customers to the right information up front, and getting better about providing a clear user experience as we go, we hope to see more customers empowered to solve problems so that we can be more available for the rare opportunities that you absolutely need us.

Jarland

Amir A • September 29, 2019

I am in the same boat waiting for support to address a related issue. Though my dnsPolicy is already set to ClusterFirst. I am finding myself debugging cilium issues very frequently to the point that I am questioning whether DigitalOcean’s offering is truly a “managed” offering.

At least if I host my own kubernetes distribution, I would have some control over the setup as opposed to having to wait a few days on an answer.

DTrierweiler • March 8, 2019

Hey Jarland,

thank you so much. The dnsPolicy was set to ClusterFirstWithHostNet and a change to ClusterFirst did the trick. It’s running again.

Do you know why it was set to ClusterFirstWithHostNet and why it stopped working from one moment to the next? Documentation says this value should only be used, when you use hostNetwork: true which is not the case.

I think you do a good job in providing a lot of information to fix and repair problems - but it all comes down to those rare occasions you mentioned (like in this case). I’m still not sure, why the ticket was unanswered for 3 days - correct me if I’m wrong, but I thought the main benefit of having a managed kubernetes service is not to worry about this exact problem.

Anyway - thanks a lot for the reply. Maybe it’ll help someone else as well :)

Cheers