Let's Encrypt quota is exhausted, unable to create new clusters

Incident Report for DoubleCloud

Postmortem

On March 13th we experienced an incident lasting 7 hours (8:00 UTC - 15:00 UTC) during which users were unable to create new clusters. Existing clusters were unaffected, and our established SLA remained intact. We apologise for any inconvenience and, in line with our transparency principle, want to share some details about what happened and how we plan to prevent similar issues in the future.

At DoubleCloud, safeguarding data security is our top priority. To achieve data in transit encryption, we maintain TLS certificates for all clusters, relying on Let’s Encrypt as our Certificate Authority. However, as Let’s Encrypt operates as a free and nonprofit entity, it does not offer a Service Level Agreement (SLA) and imposes strict quotas on certificate generation.

Upon cluster creation, a dedicated certificate is provisioned exclusively for the cluster. After regular expanding our test suites to fortify platform quality, we started to create 10x more clusters for testing purposes and as a consequence exceeded Let’s Encrypt’s weekly quota on number of certificates, resulting in the inability to create new clusters. We regret to report that 7 customers were impacted. However, we promptly provided high-grade support to assist them and after we fixed the issue the clusters were created successfully.

To rectify this issue, we promptly decided to transition to ZeroSSL, which offers higher quotas and enhanced business support. To prevent similar issues in the future we aim to implement further enhancements, such as diversifying our SSL providers dependencies and improving certificate pooling.

Posted Mar 22, 2024 - 09:24 CET

Resolved

This incident has been resolved.

Posted Mar 13, 2024 - 17:38 CET

Monitoring

The fix is now deployed and we are monitoring its state.

Posted Mar 13, 2024 - 16:56 CET

Update

We are rolling out the fix in our environment. We will inform you promptly.

Posted Mar 13, 2024 - 15:12 CET

Update

We are implementing a fix and proceeding to the testing phase.

Posted Mar 13, 2024 - 12:40 CET

Identified

We have identified the issue and are in the process of applying a mitigation. We will inform you promptly.

Posted Mar 13, 2024 - 11:15 CET

Update

The issue is persisting, and we are actively looking into it.

Posted Mar 13, 2024 - 10:32 CET

Investigating

We are currently investigating this issue.

Posted Mar 13, 2024 - 10:25 CET

This incident affected: API.