On March 18th, we experienced an incident that lasted just over an hour, from 10:39 UTC to 11:49 UTC. During this period, six customers encountered issues with their TLS connections to our managed ClickHouse clusters. We’re sorry for any inconvenience this might have caused.
In the spirit of transparency, we want to share what happened and illustrate our steps to prevent similar issues.
A few days before the incident, we started transitioning to a new SSL provider, ZeroSSL. This change was prompted by another incident involving Let's Encrypt, our previous certificate service. Before switching to ZeroSSL on March 14th, we rigorously tested it across all available layers of testing specifically for new clusters.
On March 18th, our automatic certificate renewal mechanism issued new ZeroSSL certificates to six clusters previously using Let's Encrypt certificates. Since ClickHouse supports reloading TLS certificates dynamically, we don’t restart our clusters for such renewal. Usually, it’s a silent background process that doesn’t interfere with ClickHouse operations. However, in this case, we encountered an issue within the ClickHouse upstream codebase, leading to an incomplete reloading of the certificate chain.
A certificate chain contains multiple layers of certificates, each verifying the authenticity of the next. During the incident we found out that ClickHouse reloaded only the last certificate of the entire chain, mixing certificates in the process: the tail from one TLS provider and the head from another.
A few minutes after the incident began, we halted our background renewal mechanism to minimize the impact. After identifying problem with the current clusters, we manually restarted them to forcibly reload the SSL context in active processes.
One of our software engineers delved into the codebase and prepared a PR for new versions of ClickHouse in two days. On March 27th, the PR merged into the main branch.
We are unable to apply this patch on the fly to the current clusters, so we prepared other solutions: