The issue with AirMap platform service outage has been fully resolved. Incident began at around 2018-07-19 15:00 UTC and ended at 2018-07-19 20:00 UTC.
The AirMap platform experienced a service outage starting on the 19th of July, 15:00 UTC – 20:00 UTC. The outage caused all services to become unresponsive, beginning with the pilot service that retrieves a user’s profile. This resulted first in unsuccessful logins into the platform and grew from there. Investigations ensued to find the bug which ultimately led to all worker nodes being restarted to clear the issue.
- On the 19th of July, 2018, ~15:00 UTC, our monitoring infrastructure started alerting us about an increase in 5xx response for some of our APIs (most notably pilot and tiles).
- Our engineering response team started investigating and found that all the affected APIs were running on our production Kubernetes cluster.
- Further investigations revealed that a majority of the API replicas entered a restart loop and were unable to serve traffic. More to this, a subset of worker nodes in the cluster were reported as “Not Ready”.
- We attempted to resurrect the Kubelets of the stuck worker nodes, but they quickly fell back to “Not Ready”. At that point, additional worker nodes started to be marked as “Not Ready”.
- The investigation into the per-pod/per-container logs revealed network and DNS failures. These caused side-car containers connecting up to our internal traffic management infrastructure to fail, taking down the workload containers and resulting in the aforementioned reboot loop.
- ~18:00 UTC, the response team decided to restart all worker nodes in the cluster in a staggered approach.
- Shortly after a quorum of worker nodes powering our service agent pools got restarted, the situation started clearing up and healthy replicas of APIs got scheduled on the rebooted nodes.
- After ~1.5 hours, all the worker nodes had been restarted and both cluster-internal infrastructure and public-facing services had stabilized, 5xx responses disappeared and status indicators switched to green.
As a result of our ad-hoc and post-mortem investigations, we were able to correlate the symptoms identified in our production cluster to the following issue:
Fortunately, the root cause has been identified (see https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-405168344), together with good candidates for fixes (see https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-406121274).
Remediation and Prevention
In the meantime, to prevent service outages going forward, we will apply a rolling restart regime that ensures healthy node and Kubelet operations until the proper fixes have landed upstream and propagated to our cloud-provider of choice (MS Azure).
We apologize to our customers whose services or businesses were affected during this incident, and we are taking steps to refine the platform’s performance. We take incident resolution very seriously. Investigating each incident and its root cause is our top priority in mitigating against the recurrence of the incident.
We sincerely thank the open-source community and to the k8s/CNI ecosystem, in particular, for their help in correlating our findings with a known issue. We’re happy to see the dedication and expertise of the people contributing their insight into the bug and finding great candidates for valid fixes.