Hyper Portal service outage

Aug 23 at 10:49am BST

Affected services

Hyper Portal

Resolved
Aug 23 at 10:49am BST

1. Incident Overview

The Hyper Portal was affected by a service outage serving HMDFs to the client SDKs. The issue was caused due to an issue pulling dependencies from NPM during an automated instance restart. The incident occurred at 09:19 BST and was reported by a customer at 09:24 BST on Wednesday 23rd August and the Web Team were notified immediately. At 09:37 BST, the customers were informed that the issue was being investigated. The cause was identified and the issue was resolved by 10:41 BST.

2. Incident Details

AWS Elastic Beanstalk automatically scales instances within the deployment. In this case it deployed a new instance which triggered a re-download of the application dependencies. A dependency failed to be installed causing the server launch to fail. The service was subsequently unavailable, responding to requests with a 502 Bad Gateway.

3. Incident Response

The Web Team responded immediately by diagnosing the problem. The issue was resolved by updating the dependency list of the application and initiating a redeployment at 10:39 BST with restoration of services at 10:41 BST. The redeployment caused the application to re-download its dependencies and start up correctly.

4. Root Cause Analysis

What caused the new instance to fail in installing the dependency?
1. The dependency was not sufficiently configured and a 3rd party change caused the dependency to break
What changed?
1. NPM installed dependencies in a relatively non-deterministic manner, the behaviour of which is constrained by configuration. A change in an upstream dependency caused the resulting dependency tree to change such that the service became misconfigured.
What fixed it?
1. The issue was fixed by updating the dependency configuration and redeploying the service

5. Impact & Consequences

The impact resulted in a Priority 1 (P1) outage that affected customers. The overall impact of the outage was a loss of map service for the SDKs, causing the mapping experience to be unable to function.

6. Lessons Learned

The alarms were not sufficient and requires investigation into how and why this was not known about sooner. The alarm reported that it was OK. We suspect this is not the case and need to investigate to confirm the exact time.
ESLint configuration was not sufficient in order to bring this issue to light before deployment.

7. Preventive Action

The Hyper Portal is currently being migrated to ECS, in which the application is pre-built and stored in an image registry, meaning that the dependencies will not be required for deploying the application beyond the initial build.

AWS uses an automated process for monitoring the health of instances. We will review the configuration in AWS to understand if we can adjust parameters around what is considered as an unhealthy instance to avoid any unnecessary redeployments.

We will set up an external health check service to coarsely monitor service availability.

Configure ESLint rule eslint-plugin-import to mitigate against future possible occurrences of this issue.

8. Incident Closure

Internal confirmation of services being back online at 10:47 BST followed by external confirmation that the downstream services were functioning as expected again at 10:49 BST.