Resolved -
Incident Summary: Notifications Service Outage
Date: April 1, 2026
Duration: approximately 8 hours
Severity: Email notifications only - no impact to job processing
What happened
On April 1st, 2026, our Notifications service that is responsible for sending job update email notifications went offline. The service was unable to start due to a configuration change that removed the service's image from our image registry. Jobs continued to run and complete normally throughout the incident. No data was lost. The impact was the loss of email notifications for job status updates during the outage.
Root cause
As part of a planned infrastructure cost optimization effort, we activated a storage cleanup policy on our container image registry to remove old, unused images. The Notifications service uses an independent release cycle from our core platform and had not been rebuilt recently. Its deployed image version fell outside the retention window and was removed by the cleanup policy. When the service attempted to restart, it could not pull the required image.
Resolution
We identified the issue, deployed the latest version of the Notifications service, and confirmed full functionality was restored.
Steps taken to prevent recurrence
1. Added additional image version retention policy, ensuring that infrequently built services are never pruned while in service.
2. Upgraded monitoring and alerting - we are adding additional monitors and alerts that will page our On Call Engineering team to ensure faster response times.
3. Audit of all deployed services - we have verified that all currently deployed image versions across all environments are present in the registry and covered by the updated retention policy.
We apologize for the inconvenience and are committed to ensuring this does not happen again. If you have any questions, please reach out to our support team.
Apr 2, 04:49 UTC
Monitoring -
The service is online and we are monitoring for any issues.
Apr 2, 01:26 UTC
Identified -
We are resolving an issue with our Notifications service that sends job update email notifications. The service was inadvertently taken offline due to an infrastructure configuration change. Jobs are continuing to be processed and should not be impacted by this outage.
We estimate the service should be restored in 30 minutes or less and will update this incident with more details.
Apr 2, 01:16 UTC