2 posts tagged with "post-mortem"

Post-mortem of the June 29th and 30th 2025 incident

July 1, 2025 · 6 min read

CTO @ Caesar Groep

According to our mission, we are committed to transparency and accountability. This post-mortem is part of that commitment, detailing the events surrounding the issue with the Yivi app on iOS on June 29th and 30th 2025.

Summary of the impact

Some iOS users of the Yivi app were unable to open their Yivi app on June 29th and 30th, 2025, due to an issue with the Universal Links feature. This issue was caused by a domain migration of irma.app, resulting traffic to be redirected from irma.app to yivi.app. iOS devices however do not support redirection for Universal Links, which led to the app being unable to open Universal Links. This issue was resolved by changing the irma.app domain back to its original server.

Users that had the Yivi app installed prior to the incident were probably not affected, as the app was still able to open. However, users who installed the app after the domain migration were unable to open it due to the Universal Links issue.

Timeline of the incident

Date & Time	Description
June 29th, 2025, 20:00 CEST	Received an email from DNS provider that the domain `irma.app` was migrated to the new provider and was starting to redirect traffic to `yivi.app`.
June 29th, 2025, 21:07 CEST	Received a report from a user that the Yivi app on iOS was not opening.
June 30th, 2025, 00:03 CEST	Received another report from a user that the Yivi app on iOS was not opening.
June 30th, 2025, 13:15 CEST	Started investigating the issue and found that the Universal Links feature was not working due to the domain migration.
June 30th, 2025, 14:00 CEST	Confirmed the issue was caused by the redirection of traffic from `irma.app` to `yivi.app`. Issue affected only new installations of the Yivi app on iOS, not existing ones.
June 30th, 2025, 15:30 CEST	Changed the `irma.app` DNS settings back to the old server allowing Apple devices to check the site association again.
June 30th, 2025, 16:00 CEST	Confirmed that the issue was resolved; the Yivi app was able to open on iOS devices again.
July 1st, 2025, 11:45 CEST	Delivered a workaround to affected users by providing a page that converts `irma.app` Universal Links to `open.yivi.app` Universal Links.
July 1st, 2025, 14:00 CEST	Documented the issue and resolution in a post-mortem on our blog and shared it with the community.

Customer impact

The incident affected a small number of iOS users who were unable to open the Yivi app, in a same deice flow, due to the Universal Links issue. The impact was limited to users who installed the app after the domain migration and to users of who's iOS rechecked the site association file after the migration. We have no reason to believe that it affected a large number of existing Yivi iOS users. According to our keyshare server registrations, the number of affected users, that installed the Yivi app in the time window mentioned above, was about 78 users including Android users, which is a small percentage of our total user base.

Root cause and mitigation

As part of our efforts to improve the Yivi app, we migrated the domain irma.app to yivi.app. This migration was intended to provide a more consistent branding experience for our users and to gain control over all domains originally maintained by SIDN. However, the migration inadvertently caused an issue with the Universal Links feature on iOS devices, which do not support forwarding for Universal Links.

The Universal Links feature allows the Yivi app to open directly from links in the browser, providing a seamless user experience. However, iOS requires that the associated domain file (apple-app-site-association) be hosted on the domain without any redirects. The redirection of traffic from irma.app to yivi.app caused the associated domain file to be redirected, which prevented the Yivi app from opening on iOS devices that installed the app after the migration.

https://fully qualified domain/.well-known/apple-app-site-association You must host the file using https:// with a valid certificate and with no redirects.

Link to Apple's documentation

The reason it was not breaking Android devices is that Android uses a different mechanism for handling Universal Links, which doesn't rely on the irma.app domain, as can be seen in our Yivi frontend packages code, its not using the irma.app domain for Android devices, but instead used an intent link that does not require the associated domain file to be hosted without redirects.

Yivi frontend packages code

  _getMobileUrl(sessionPtr) {
    const json = JSON.stringify(sessionPtr);
    switch (this._userAgent) {
      case 'Android': {
        // Universal links are not stable in Android webviews and custom tabs, so always use intent links.
        const intent = `Intent;package=org.irmacard.cardemu;scheme=irma;l.timestamp=${Date.now()}`;
        return `intent://qr/json/${encodeURIComponent(json)}#${intent};end`;
      }
      case 'iOS': {
        return `https://irma.app/-/session#${encodeURIComponent(json)}`;
      }
      default: {
        throw new Error('Device type is not supported.');
      }
    }
  }

To mitigate the issue, we changed the DNS settings of irma.app back to the old server, allowing the Yivi app to open on iOS devices again. This was a temporary solution to ensure that users could continue to use the app while we worked on a more permanent fix.

The open.yivi.app domain was not affected by this issue, as it was not part of the domain migration. The open.yivi.app domain is an alternative Universal Link that can be used to open the Yivi app on iOS devices. We provided a workaround for affected users by creating a page on irma.app/-/session that converts the irma.app Universal Links to open.yivi.app Universal Links, allowing them to continue using the app without any issues.

This workaround will stay into place until we have fully migrated the irma.app domain to the new server and ensured that all Universal Links are working correctly. Apple states that it can take up to a week for devices to recheck the associated domain file, so we will monitor the situation closely and make any necessary adjustments as needed.

Affected users can decide to reinstall the Yivi app from the App Store, which will ensure that the app is able to open Universal Links correctly. However, this is not necessary for existing users who already have the app installed, as they will still be able to open the app without any issues.

Next steps

Reevaluate domain migration strategy: We will review our domain migration strategy to ensure that future migrations do not cause similar issues with Universal Links or other features.
Review App integrations: We will review the integration of the Yivi app with Universal Links and QR intents on both iOS and Android devices to ensure that it is robust and has fallbacks in place for any potential issues.

Post-mortem of the 1st of July outage

July 1, 2025 · 6 min read

Dibran Mulder

CTO @ Caesar Groep

According to our mission, we are committed to transparency and accountability. This post-mortem is part of that commitment, detailing the events surrounding the outage of the 1st of July 2025.

Summary of the impact

We experienced a significant outage on the 1st of July 2025, which affected our services for approximately 6 hours. During this time, users were unable to validate their pincodes and access their Yivi App, making the Yivi-App unusable. The issue was caused by an outage in the Scaleway data center AMS-1, where our database server is located. Scaleway preemptively shut down several services in the datacenter to prevent further issues such as hardware damage due to abnormal temperatures.

Timeline of the incident

Date & Time	Description
July 1st, 2025, 16:30 CEST	Scaleway confirmed that they were experiencing issues due to abnormal temperatures in the data center AMS-1
July 1st, 2025, 16:45 CEST	Our database server started spiking CPU usage and Total Connection, successively Scaleway preemptively shut down several services in the datacenter, to prevent it from further issues such as hardware damage.
June 1st, 2025, 16:57 CEST	Received a first report from a user that the user received errors validating the pincode in the Yivi app.
July 1st, 2025, 17:00 CEST	Started investigating the issue and found that the database server was experiencing high CPU usage.
July 1st, 2025, 17:10 CEST	Confirmed that the issue was caused by Scaleway, our infrastructure provider.
July 1st, 2025, 17:30 CEST	Restarted the Keyshare backend services to try to restore the database connections.
July 1st, 2025, 18:00 CEST	Started configuring a failover database server, for whenever the issue was not fixed over night.
July 1st, 2025, 23:25 CEST	Scaleway's Infrastructure provider resolved the issue, restarted the services and our system recovered accordingly.

See the full incident report from Scaleway for more details: Scaleway Status Page

Below are the metrics from the database server that show the CPU usage and Total Connections were spiking prior to the downtime.

Customer impact

Users were unable to validate their pincodes and access their Yivi App during the outage. The impact was significant, as users were unable to use the app for approximately 6 hours.

Root cause and mitigation

Yivi uses Scaleway as its infrastructure provider, and the outage was caused by an issue in their data center AMS-1. The issue was due to abnormal temperatures in the data center, which led to Scaleway preemptively shutting down several services to prevent further issues such as hardware damage. Temperatures in the Amsterdam area were unusually high, the infrastructure cooling system couldn't keep up, which caused the data center to overheat.

note

An availability zone is a distinct location within a data center region, designed to be isolated from failures in other availability zones. This setup allows for high availability and fault tolerance.

Yivi uses a multi-availability zone setup, which means that if one availability zone goes down, the other availability zone can take over. We use this for our Kubernetes cluster, which is responsible for running the Yivi backend services, including the Keyshare backend services, serveral issuers, the portal, etc. This setup allows us to ensure high availability and fault tolerance for our services.

However, our database server which is configured as a High Availability (HA) database within Scaleway, was located in the affected data center. This meant that when Scaleway shut down the services in that availability zone, our database server was also affected, leading to the outage. As a team we were expecting that the failover database server would be automatically used, but this was not the case, apparently both of the racks that power our database server were affected by the outage.

Scaleway documentation Are my active and standby database nodes in a high-availability cluster hosted in the same data center? In a high-availability cluster, active and hot standby nodes are indeed located in the same data center but in two separate racks. The idea is to offer the best performance to our users by reducing latency between active and hot standby nodes, as we use a sync replication process between the nodes.

Next steps

We think that the best way to prevent this issue from happening again is to create an additional failover database server in a different availability zone or even on a different infrastructure provider. This way, if one availability zone goes down, the failover can take over and the database server will still be available. This is not a native feature of the Scaleway Managed Database service, so we will have to implement this ourselves. Our Kubernetes cluster is already set up to use multiple availability zones and during this issue the 2 other nodes in the cluster were still operational, servicing traffic and requests, but the database server was not available, which caused the outage.

Next to that we indentified the following next steps aswell:

If one of the nodes powering the kubernetes cluster goes down, the other nodes will still be operational. However if for some reason we want to change the configuration of for instance the keyshare server to point to a different database then we can't do that with Terraform since its blocks on shutting down pods that run on unavailable nodes. Leaving us with no way other then to manually change the configuration of the keyshare server, which is not ideal.
The Scaleway Managed Database service always for a ReadOnly replica which can be promoted to a primary database server. However, this is not automatically done in case of an outage, so we will have to implement this ourselves as well. This means that we will have to monitor the database server and if it goes down, we will have to promote the ReadOnly replica to a primary database server.
Our blog also runs on the same Scaleway infrastructure, which we couldn't update because the Terraform changes were blocked by the unavailable nodes in the Kubernetes cluster. This means that we couldn't update our blog to inform users about the outage and the progress we were making to resolve it. We will have to implement a way to update our blog in case of an outage, so that we can keep our users informed.
Our public website has to incident page that informs users about the outage and the progress we are making to resolve it. This page should be accessible even if the blog is down, so we will have to implement this as well.
We have no mechnism in place to inform users in the Yivi-app about the outage or scheduled maintenance. We will have to implement a way to inform users about the outage or scheduled maintenance, so that they are aware of the situation and can take appropriate action.

Summary of the impact​

Timeline of the incident​

Customer impact​

Root cause and mitigation​

Next steps​

Summary of the impact​

Timeline of the incident​

Customer impact​

Root cause and mitigation​

Next steps​

Summary of the impact

Timeline of the incident

Customer impact

Root cause and mitigation

Next steps

Summary of the impact

Timeline of the incident

Customer impact

Root cause and mitigation

Next steps