The Railway engineering team experienced a production outage from 21:45 UTC to 23:24 UTC due to issues with Google's Metadata server, which was upgraded as part of the rollout of GKE v1.25. The upgrade caused significant delays in encrypt and decrypt requests, affecting the ability of users to deploy new workloads and update environment variables. The team quickly identified and addressed the issue by rolling out a newer version of GKE that was not affected by the known issue, and restored all services to normal operation by 23:24 UTC. The incident highlighted the importance of careful planning and monitoring during significant infrastructure changes, and led to several takeaways for improving internal coordination, messaging, re-shoring, and incident management.