Hyper Portal service outage

Aug 03 at 10:15am BST

Affected services

Hyper Portal

Resolved
Aug 03 at 10:15am BST

1. Incident Overview

The Hyper Portal was affected by a service outage serving HMDFs to the client SDKs. The issue was caused due to an expired NPM token on the Hyper Portal server account that the API uses for reading HMDFs from S3. The incident was reported via our internal team at 09:07 BST on Thursday 3rd August. The Web Team were notified immediately where the cause was identified and resolved by 09:57. During the outage, the issue was further raised by a customer at 09:19 BST, whilst we were working through the problem. A response to customers was given within 20 minutes as to the cause and timeline to remediate.

2. Incident Details

The issue was caused due to an edge case involving an expired NPM token, used for building the application, following cycling tokens. AWS Elastic Beanstalk automatically scales instances within the deployment. In this case it deployed a new instance using an out of date build that referenced the expired token. The application was unable to successfully deploy and the service was subsequently unavailable, responding to requests with a 502 Bad Gateway.

3. Incident Response

The Web Team responded immediately by diagnosing the problem and providing a fix and timeline within 15 minutes of the alert. The issue was resolved by ensuring that the updated token was referenced within the deployment, and then initiating a redeployment of the application at 9:57 am.

4. Root Cause Analysis

The root cause is due to the way that NPM tokens are referenced within deployments. When a deployment is created it references the token that is available at that time. When the same build redeploys (which can happen for a number of reasons due to the auto scaling behaviour of the infrastructure) it uses the token that it was given during the initial deployment, rather than using a new token that may have (and in fact was in this case) updated.

5. Impact & Consequences

The impact resulted in a Priority 1 (P1) outage that affected customers. The overall impact of the outage was a loss of map service for the SDKs, causing the mapping experience to be unable to function.

6. Lessons Learned

The alarms were not sufficient and requires investigation into how and why this was not known about sooner. The alarm reported that it was OK. We suspect this is not the case and need to investigate to confirm the exact time.
The redeployment occurred at out of working hours where we had no ability to act to rectify the problem.
We noticed it by co-incidence internally via our Head of Growth whilst using the Portal to map a store. Once he refreshed the page he was presented with a 502 error. This was also tested on mobile with the same error.
Recommend to cycle the tokens since they were linked to one employee account. Any changes to internal users will have no impact on service availability.

7. Preventive Action

The Hyper Portal is currently being migrated to ECS, in which the application is pre-built and stored in an image registry, meaning that the token will not be required for deploying the application beyond the initial build.

AWS uses an automated process for monitoring the health of instances. We will review the configuration in AWS to understand if we can adjust parameters around what is considered as an unhealthy instance to avoid any unnecessary redeployments.

NPM tokens should be generated form a shared admin account so that they are not required to cycle when an employee leaves the company.

8. Incident Closure

The incident was closed at 10:15 BST upon confirmation from the customer that the issue was no longer occurring.