The issue with AirMap platform service outage has been fully resolved. Incident began at around 2018-08-23 22:00 UTC and ended at 2018-08-24 16:45 UTC.
The AirMap platform experienced a service outage starting on the 23rd of August, 22:00 UTC and ending on the 24th of August, 16:45 UTC. The outage initially caused all services that interact with advisory and flights to become unresponsive. This resulted first in no rules loading in the mobile apps and obstructing filing of flights. While the rule data was being fixed, the VPN tunnel, which we use to connect Azure services to AWS services, had an outage and needed to be manually restarted. That tunnel failure required all Azure k8s services to be restarted.
- On the 23rd of August 2018, ~22:00 UTC, our monitoring infrastructure started alerting us about an increase in 5xx response for some of our APIs (most notably advisory and flight).
- Our engineering response team started investigating and found that all the affected APIs were running on our production Kubernetes cluster.
- Further investigations revealed that a data engineer inserted a malformed rule causing rules to not evaluate correctly.
- At 15:30 UTC, alerting notified us to the specific malformed rule(s) and a data engineer removed it from the database.
- At 15:57 UTC, we discovered that our VPN tunnel between Azure and AWS had collapsed and by 16:05 UTC it had been restarted.
- Shortly after all our pods which use the VPN tunnel were restarted and the situation started clearing up.
- By 16:46 UTC, all the pods were in a running state and status indicators switched to green.
Our data team determined that a malformed rule caused the first issue and a lack of self-healing infrastructure for the VPN tunnel exacerbated the second issue.
Remediation and Prevention
To prevent malformed data from entering the system in the future, we are overhauling our data engineering QA. We have created new alerts and notifications for the VPN tunnel and have shared the knowledge of restarting the tunnel to our overseas and domestic response teams to efficiently address future tunnel fragility.After we finish migrating our services from AWS to Azure, the tunnel will no longer be necessary, and this potential point of failure will be removed from our stack.
We apologize to our customers whose services or businesses were affected during this incident, and we are taking steps to refine the platform’s performance. We take incident resolution very seriously. Investigating each incident and its root cause is our top priority in mitigating against the recurrence of the incident.