CelerData Cluster Upgrade
CelerDataβs Bring Your Own Cloud (BYOC) architecture is designed to minimize disruption during maintenance activities such as upgrades and patches. All system upgrades are implemented using a rolling upgrade strategy, ensuring a near-zero downtime experience for users.
We strongly recommend using 3 coordinator nodes for any production cluster to ensure high availability during all types of upgrades and normal operations.
This document provides answers to frequently asked questions about cluster upgrades: what happens during an upgrade, how long it takes, what impact to expect, and how to prepare.
Types of Cluster Upgradesβ
There are two primary types of upgrades that may be performed on your cluster:
CelerData Version Upgrade (Patch / Minor)β
This upgrades the CelerData database engine to a newer version (e.g., 4.0.1 to 4.0.8) to apply bug fixes, performance improvements, or new features. The upgrade is performed in-place on existing nodes β the service is briefly restarted with the updated version on each node sequentially. No new virtual machines are provisioned for this type of upgrade.
AMI Upgrade (Infrastructure / Security Patching)β
This upgrades the underlying machine image (AMI) to address operating system security vulnerabilities, kernel patches, or infrastructure updates. AMI upgrades involve provisioning new virtual machines with the updated image and then decommissioning the old ones, while preserving your cluster configuration and data.
For compute node clusters, AMI upgrades involve a kernel change, which means local cache on compute nodes will be lost during the upgrade. The cache will be rebuilt automatically as queries are executed after the upgrade.
How the Upgrade Process Worksβ
All CelerData cluster upgrades use a rolling upgrade approach. This process involves sequentially updating the various components of the cluster, including the individual cluster nodes. The upgrade sequence targets critical components in a controlled manner: compute (warehouse) nodes are upgraded first, followed by coordinator nodes. This staggered approach helps maintain system availability throughout the upgrade window.
Coordinator Node Upgradeβ
Version Upgradeβ
-
Each coordinator node is upgraded sequentially with the updated CelerData version. The service on each node is briefly restarted as part of this process.
-
Follower coordinator nodes are upgraded first. The leader coordinator node is upgraded last, which triggers a brief leader election (typically a few seconds).
-
Each coordinator node has a 60-second graceful shutdown period. During this time, the node stops accepting new connections. If active connections remain after 60 seconds, they will be forcefully terminated.
AMI Upgradeβ
-
AMI upgrades use a scale-out then scale-in approach: new coordinator nodes are provisioned with the updated AMI first (e.g., from 3 to 6 nodes), then old nodes are decommissioned (back to 3).
-
This approach means that even clusters with a single coordinator node experience zero downtime during AMI upgrades, since the new node is brought online before the old one is removed (1 β 2 β 1).
Compute (Warehouse) Node Upgradeβ
Version Upgradeβ
-
Each compute node is upgraded sequentially β only one node is restarted at a time to maintain query capacity and data availability.
-
Each compute node has a 20-second graceful shutdown period before being restarted.
AMI Upgradeβ
-
AMI upgrades provision new machines with the updated image, add them to the cluster, then decommission old machines one at a time.
-
The same 20-second graceful shutdown period applies to each node being replaced.
Minimizing Disruption with Graceful Shutdownβ
A key feature of the CelerData upgrade process is the implementation of a graceful shutdown period. This mechanism is designed to minimize disruption to any ongoing user activity while nodes are being taken offline for the upgrade:
-
Compute (Warehouse) Nodes: A default graceful shutdown period of 20 seconds is applied. This brief window allows in-flight queries and processes to complete before the node is upgraded.
-
Coordinator Nodes: Given their critical role in managing the cluster and handling incoming requests, coordinator nodes are assigned a longer default graceful shutdown period of 60 seconds to ensure a seamless handoff of operations.
Automatic Tablet Rebalance During Upgradeβ
During version upgrades, tablet rebalance is automatically disabled to prevent unnecessary data movement while nodes are being restarted. You may observe a temporary tablet imbalance during and shortly after the upgrade. Rebalancing resumes automatically once the upgrade is complete, and the system will gradually equalize tablet distribution.
Expected Impact During Upgradeβ
The following table summarizes the expected impact on various workloads during a cluster upgrade:
| Area | Impact | Recovery |
|---|---|---|
| Active Queries | Queries hitting a node being restarted will fail. Queries on other nodes are unaffected. | Automatic upon client retry. Queries are routed to healthy nodes after reconnection. |
| Client Connections | Existing connections to a restarting coordinator node will be disconnected after the 60-second graceful shutdown period. | Clients with connection pool or retry logic will reconnect to an available coordinator node within seconds. |
| Data Ingestion (Stream Load) | Active stream load jobs targeting a restarting node may fail. | Should be retried by the ingestion client (e.g., Flink/Kafka connector). Brief interruption only. |
| Routine Load | Routine Load tasks on a restarting compute node are temporarily interrupted. | Automatically resumed once the node is back online. No manual intervention required. |
| Long-Running Queries (>15s) | Higher chance of being terminated if running on a node being restarted. | Must be retried by the client application. |
| Compute Node Local Cache (AMI upgrade only) | Local cache on compute nodes is lost during AMI upgrades due to kernel change. | Cache rebuilds automatically as subsequent queries are executed. No action required. |
| Tablet Balance (version upgrade) | Tablet rebalance is auto-disabled during upgrade. Temporary imbalance is expected. | Rebalancing resumes automatically after upgrade completes. |
| Data Integrity | No impact. Data is not affected by the upgrade process. | N/A. All data remains intact. |
How Long Does an Upgrade Take?β
The upgrade process is highly efficient, typically requiring approximately 1β2 minutes per node. The total time required for a complete cluster upgrade is directly dependent on the total number of nodes present in the cluster. Larger clusters will naturally have longer, though still predictable, upgrade windows.
| Cluster Size | Upgrade Type | Estimated Duration |
|---|---|---|
| Small (3 Coordinator + 3 Compute) | Version Upgrade | 5 β 10 minutes |
| Small (3 Coordinator + 3 Compute) | AMI Upgrade | 15 β 20 minutes |
| Medium (3 Coordinator + 5β10 Compute) | Version or AMI Upgrade | 15 β 30 minutes |
| Large (3 Coordinator + 10+ Compute) | Version or AMI Upgrade | 30 β 60 minutes |
We recommend scheduling a maintenance window of approximately 1 hour to allow for the upgrade plus post-upgrade verification.
Downtime Expectationsβ
Production Clusters with 3 Coordinator Nodes (Recommended)β
With a 3-node coordinator setup, there is near-zero downtime for both upgrade types. While one coordinator node is being upgraded, the other two continue to serve queries. When the leader coordinator is upgraded, a brief leader election occurs (typically a few seconds). Clients with retry logic will experience only momentary connection blips.
Clusters with 1 Coordinator Nodeβ
The downtime behavior for single-coordinator clusters differs by upgrade type:
AMI Upgrade: Zero downtime. The AMI upgrade uses a scale-out/scale-in approach (1 β 2 β 1), so a new coordinator node is brought online before the old one is removed. The cluster remains available throughout.
Version Upgrade: There will be a period of complete unavailability while the single coordinator node is restarted with the new version (typically 1β3 minutes). During this window, no queries can be served.
We strongly recommend using 3 coordinator nodes for any production cluster to ensure high availability during all types of upgrades and normal operations.
Compute Node Downtimeβ
Individual compute nodes experience approximately 20β30 seconds of downtime during their restart. Since nodes are upgraded one at a time, the cluster as a whole continues to serve queries through the remaining healthy nodes.
Recommended Client Preparationsβ
To minimize the impact of a cluster upgrade, we recommend the following preparations:
Ensure Retry / Reconnect Logicβ
-
Verify that your query clients (JDBC, MySQL client, application connection pools) have automatic retry and reconnection logic enabled.
-
Most standard MySQL-compatible connection pools (HikariCP, etc.) handle transient disconnections automatically.
-
If your application uses long-lived persistent connections, ensure they can tolerate a brief disconnection and reconnect.
Schedule a Maintenance Windowβ
-
Coordinate with the CelerData team to schedule the upgrade during a low-traffic period or an established maintenance window.
-
We recommend choosing a time when query volume and ingestion load are at their lowest.
-
Avoid scheduling upgrades immediately before weekends, holidays, or critical business events.
Pause or Reduce Ingestion (Optional but Recommended)β
-
If possible, temporarily pause or reduce data ingestion (Stream Load, Routine Load, Kafka connectors) before the upgrade begins.
-
This reduces the chance of failed ingestion jobs and minimizes traffic during the rolling restart.
-
Ingestion can be resumed immediately after the upgrade is confirmed complete.
Verify Compatibility (for Major Version Upgrades)β
-
Some major version upgrades may require underlying runtime updates (e.g., JDK version changes). The CelerData team handles these updates as part of the upgrade process.
-
Clients should be aware this may add a few extra seconds to each node restart, but no action is required on the client side.
Proactive Incident Management & Rollbackβ
In the rare event that an upgrade process encounters an issue and becomes stuck, the CelerData team is proactively notified through automated monitoring systems. Immediate involvement from the team ensures rapid diagnosis and resolution. Remedial actions may include:
-
Rolling back the upgrade: reverting the system to the previous stable state to restore full functionality immediately.
-
Fixing the ongoing issue: applying a hotfix or resolving the underlying problem to allow the upgrade to complete successfully.
-
For AMI upgrades, old machine images are retained and can be used for rapid fallback.
Not all version downgrades are safe. Some versions introduce metadata changes that are not backward-compatible. The CelerData team will always verify downgrade compatibility before proceeding with any upgrade.
Quick Reference Summaryβ
| Question | Answer |
|---|---|
| Is there a full outage? | No. Rolling upgrade β nodes are upgraded one at a time. |
| Will my queries fail? | Queries hitting a restarting node may fail. Retry logic handles this automatically. |
| How long is the total upgrade? | ~1β2 minutes per node. Typically 5β60 minutes total depending on cluster size. |
| Graceful shutdown per node? | Coordinator: 60 seconds. Compute: 20 seconds. |
| Is my data safe? | Yes. No data loss or corruption during upgrades. |
| Do I need to do anything? | Ensure retry logic is in place. Optionally pause ingestion. Agree on a maintenance window. |
| Can we rollback? | Yes, with pre-verified rollback plans. Some version constraints may apply. |
| 3 vs 1 coordinator node? | 3 nodes: near-zero downtime for all upgrades. 1 node: zero downtime for AMI, 1β3 min for version upgrade. |
| Will compute node cache be lost? | Only during AMI upgrades (kernel change). Cache rebuilds automatically. |
Contact & Supportβ
For questions about scheduling an upgrade or understanding the impact on your specific environment, please reach out to your CelerData Solutions Architect or contact us through your dedicated Slack channel.
We are happy to arrange a joint monitoring session during the upgrade to ensure everything proceeds smoothly.