On March 13th we experienced an incident lasting 7 hours (8:00 UTC - 15:00 UTC) during which users were unable to create new clusters. Existing clusters were unaffected, and our established SLA remained intact. We apologise for any inconvenience and, in line with our transparency principle, want to share some details about what happened and how we plan to prevent similar issues in the future.
At DoubleCloud, safeguarding data security is our top priority. To achieve data in transit encryption, we maintain TLS certificates for all clusters, relying on Let’s Encrypt as our Certificate Authority. However, as Let’s Encrypt operates as a free and nonprofit entity, it does not offer a Service Level Agreement (SLA) and imposes strict quotas on certificate generation.
Upon cluster creation, a dedicated certificate is provisioned exclusively for the cluster. After regular expanding our test suites to fortify platform quality, we started to create 10x more clusters for testing purposes and as a consequence exceeded Let’s Encrypt’s weekly quota on number of certificates, resulting in the inability to create new clusters. We regret to report that 7 customers were impacted. However, we promptly provided high-grade support to assist them and after we fixed the issue the clusters were created successfully.
To rectify this issue, we promptly decided to transition to ZeroSSL, which offers higher quotas and enhanced business support. To prevent similar issues in the future we aim to implement further enhancements, such as diversifying our SSL providers dependencies and improving certificate pooling.