HMDF Downloads unavailable
Resolved
May 25 at 10:15am BST
1. Incident Overview
The Hyper Portal was affected by a service outage serving HMDFs to the client SDKs. The issue was caused due to a permissions change on the Hyper Portal server account that the API uses for reading HMDFs from S3. The incident was reported by a customer at 07:11 BST on Friday 19th May, the issue was picked up internally by Hyper at 09:12 BST 19th May. The cause was identified and resolved by 10:02 BST on Friday 19th May.
2. Incident Details
The SDKs make requests to the HMDF endpoint of the Hyper Portal API, which is
currently hosted on AWS Elastic Beanstalk, the API requests the HMDF from an S3
bucket before returning this to the the client SDK.
The issue caused by the unintended removal of the Hyper Portal user's S3 read permission caused the API to not be able to read HMDFs from the S3 bucket resulting in error responses on the API when requesting HMDFs. This in turn caused any SDK clients to be unable to request new HMDFs.
The Hyper Portal API logs are recorded using Cloud Watch with a 2 week retention
period. The issue was logged in the server logs with a message:
[HMDFService] Access Denied
to the HMDF S3 storage bucket. The first instance of an event log mentioning this was logged at 17:29 BST 18th May.
The Hyper Portal API health is recorded in the AWS Elastic Beanstalk Events Log with a retention duration of 2 weeks. The Hyper Portal entered into the Warning state at 18:33 BST 18th May, it then first entered into the Degraded state at 22:28 BST 18th May. There were a number of transitions between Degraded, Warning, and Okay until it was resolved.
3. Incident Response
09:12 BST 2023-05-19 Hayley Hinsley (Mobile PM) responds to the customer report.
09:12 BST 2023-05-19 Hayley Hinsley (Mobile PM) notifies the teams responsible for the Hyper Portal and SDKs
09:13 BST 2023-05-19 Chris Rivers (Mobile Lead) identifies the API is responding with an error
09:28 BST 2023-05-19 Harry Jarman (Web Engineer) investigates the Hyper Portal logs
09:40 BST 2023-05-19 Harry Jarman (Web Engineer) notifies Hyper team that the cause has been identified
09:56 BST 2023-05-19 Harry Jarman(Web Engineer) applies the fix and notifies Hyper team that the issue had been resolved
09:57 BST 2023-05-19 Chris Rivers (Mobile Lead) confirms the fix
10:02 BST 2023-05-19 Hayley Hinsley (Mobile PM) communicates to customer that the issue has been fixed
10:15 BST 2023-05-19 Customer confirms the issue is resolved Hyper AR Ltd
4. Root Cause Analysis
The issue was caused by the unintended removal of the Hyper Portal API server account's access to read from the HMDF S3 bucket due to human error.
5. Impact and Consequences
The overall impact of the outage was a loss of map service for the SDKs, causing the mapping experience to be unable to function.
6. Lessons Learned
Two main process improvements have been identified resulting from the incident.
Firstly, it has been identified that the issue was not observed by Hyper until the morning when the team arrived for work.
Secondly, that it should not be easy for a developer to alter the server permissions
unintentionally.
7. Mitigations for Preventing Future Occurrences
In order to mitigate this issue in future, a number of improvements have been implemented:
- Hyper Portal alerts have been created to monitor the service health of the Hyper Portal.
- Hyper Portal alerts have been set up to notify a shared email that relays messages to:
- Neil Thomson (Web Lead)
- Harry Jarman (Web Engineer)
- Greg Sims (Web Engineer)
- Andrew Hart (CEO)
- Hyper Portal alerts have been set up to post messages in Hyper's internal Slack channel
- A specific tag has been applied to all Hyper Portal AWS resources. An IAM policy has been applied to all users, except Neil Thomson (Web Lead), Harry Jarman (Web Engineer), Greg Sims (Web Engineer), that prevents them from being able to interact with these resources.
Future improvements and recommendations
- A policy allowing modification of Hyper Portal resources will be created and will be accessible by assuming a role explicitly on a time limited basis - this will then be applied to all users, with no exceptions.
- A policy for requesting to be able to assume the above role will be defined and requests by Hyper employees to have access for modification of Hyper Portal Resources will be logged and be required to be approved.
8. Incident Closure
The incident was closed at 10:15 BST upon confirmation from the customer that the
issue was no longer occurring.
Affected services
Hyper Portal