Broken SSL verification on many customer clusters

Incident Report for DoubleCloud

Postmortem

On March 18th, we experienced an incident that lasted just over an hour, from 10:39 UTC to 11:49 UTC. During this period, six customers encountered issues with their TLS connections to our managed ClickHouse clusters. We’re sorry for any inconvenience this might have caused.

In the spirit of transparency, we want to share what happened and illustrate our steps to prevent similar issues.

A few days before the incident, we started transitioning to a new SSL provider, ZeroSSL. This change was prompted by another incident involving Let's Encrypt, our previous certificate service. Before switching to ZeroSSL on March 14th, we rigorously tested it across all available layers of testing specifically for new clusters.

On March 18th, our automatic certificate renewal mechanism issued new ZeroSSL certificates to six clusters previously using Let's Encrypt certificates. Since ClickHouse supports reloading TLS certificates dynamically, we don’t restart our clusters for such renewal. Usually, it’s a silent background process that doesn’t interfere with ClickHouse operations. However, in this case, we encountered an issue within the ClickHouse upstream codebase, leading to an incomplete reloading of the certificate chain.

A certificate chain contains multiple layers of certificates, each verifying the authenticity of the next. During the incident we found out that ClickHouse reloaded only the last certificate of the entire chain, mixing certificates in the process: the tail from one TLS provider and the head from another.

A few minutes after the incident began, we halted our background renewal mechanism to minimize the impact. After identifying problem with the current clusters, we manually restarted them to forcibly reload the SSL context in active processes.

One of our software engineers delved into the codebase and prepared a PR for new versions of ClickHouse in two days. On March 27th, the PR merged into the main branch.

We are unable to apply this patch on the fly to the current clusters, so we prepared other solutions:

We decided to bind TLS providers to the cluster for its lifetime to prevent alterations in the TLS chain. New clusters will be created using the ZeroSSL provider, while clusters created before March 14th will continue to utilize Let’s Encrypt.
We will implement a special maintenance operation to change TLS certificate providers for affected versions of ClickHouse. We will carry out this operation during the maintenance window of each cluster.

Posted Apr 22, 2024 - 12:56 CEST

Resolved

This incident has been resolved.

Posted Mar 18, 2024 - 13:45 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 18, 2024 - 13:26 CET

Investigating

We are currently investigating this issue.

Posted Mar 18, 2024 - 12:48 CET