Version: 3.5

CelerData Cluster Upgrade

CelerData’s Bring Your Own Cloud (BYOC) architecture is designed to minimize disruption during maintenance activities such as upgrades and patches. All system upgrades are implemented using a rolling upgrade strategy, ensuring a near-zero downtime experience for users.

note

We strongly recommend using 3 coordinator nodes for any production cluster to ensure high availability during all types of upgrades and normal operations.

This document provides answers to frequently asked questions about cluster upgrades: what happens during an upgrade, how long it takes, what impact to expect, and how to prepare.

Types of Cluster Upgrades

There are two primary types of upgrades that may be performed on your cluster:

CelerData Version Upgrade (Patch / Minor)

This upgrades the CelerData database engine to a newer version (e.g., 4.0.1 to 4.0.8) to apply bug fixes, performance improvements, or new features. The upgrade is performed in-place on existing nodes — the service is briefly restarted with the updated version on each node sequentially. No new virtual machines are provisioned for this type of upgrade.

AMI Upgrade (Infrastructure / Security Patching)

This upgrades the underlying machine image (AMI) to address operating system security vulnerabilities, kernel patches, or infrastructure updates. AMI upgrades involve provisioning new virtual machines with the updated image and then decommissioning the old ones, while preserving your cluster configuration and data.

note

For compute node clusters, AMI upgrades involve a kernel change, which means local cache on compute nodes will be lost during the upgrade. The cache will be rebuilt automatically as queries are executed after the upgrade.

How the Upgrade Process Works

All CelerData cluster upgrades use a rolling upgrade approach. This process involves sequentially updating the various components of the cluster, including the individual cluster nodes. The upgrade sequence targets critical components in a controlled manner: compute (warehouse) nodes are upgraded first, followed by coordinator nodes. This staggered approach helps maintain system availability throughout the upgrade window.

Coordinator Node Upgrade

Version Upgrade

Each coordinator node is upgraded sequentially with the updated CelerData version. The service on each node is briefly restarted as part of this process.
Follower coordinator nodes are upgraded first. The leader coordinator node is upgraded last, which triggers a brief leader election (typically a few seconds).
Each coordinator node has a 60-second graceful shutdown period. During this time, the node stops accepting new connections. If active connections remain after 60 seconds, they will be forcefully terminated.

AMI Upgrade

AMI upgrades use a scale-out then scale-in approach: new coordinator nodes are provisioned with the updated AMI first (e.g., from 3 to 6 nodes), then old nodes are decommissioned (back to 3).
This approach means that even clusters with a single coordinator node experience zero downtime during AMI upgrades, since the new node is brought online before the old one is removed (1 → 2 → 1).

Compute (Warehouse) Node Upgrade

Version Upgrade

Each compute node is upgraded sequentially — only one node is restarted at a time to maintain query capacity and data availability.
Each compute node has a 20-second graceful shutdown period before being restarted.

AMI Upgrade

AMI upgrades provision new machines with the updated image, add them to the cluster, then decommission old machines one at a time.
The same 20-second graceful shutdown period applies to each node being replaced.

Minimizing Disruption with Graceful Shutdown

A key feature of the CelerData upgrade process is the implementation of a graceful shutdown period. This mechanism is designed to minimize disruption to any ongoing user activity while nodes are being taken offline for the upgrade:

Compute (Warehouse) Nodes: A default graceful shutdown period of 20 seconds is applied. This brief window allows in-flight queries and processes to complete before the node is upgraded.
Coordinator Nodes: Given their critical role in managing the cluster and handling incoming requests, coordinator nodes are assigned a longer default graceful shutdown period of 60 seconds to ensure a seamless handoff of operations.

Automatic Tablet Rebalance During Upgrade

During version upgrades, tablet rebalance is automatically disabled to prevent unnecessary data movement while nodes are being restarted. You may observe a temporary tablet imbalance during and shortly after the upgrade. Rebalancing resumes automatically once the upgrade is complete, and the system will gradually equalize tablet distribution.

Expected Impact During Upgrade

The following table summarizes the expected impact on various workloads during a cluster upgrade:

Area	Impact	Recovery
Active Queries	Queries hitting a node being restarted will fail. Queries on other nodes are unaffected.	Automatic upon client retry. Queries are routed to healthy nodes after reconnection.
Client Connections	Existing connections to a restarting coordinator node will be disconnected after the 60-second graceful shutdown period.	Clients with connection pool or retry logic will reconnect to an available coordinator node within seconds.
Data Ingestion (Stream Load)	Active stream load jobs targeting a restarting node may fail.	Should be retried by the ingestion client (e.g., Flink/Kafka connector). Brief interruption only.
Routine Load	Routine Load tasks on a restarting compute node are temporarily interrupted.	Automatically resumed once the node is back online. No manual intervention required.
Long-Running Queries (>15s)	Higher chance of being terminated if running on a node being restarted.	Must be retried by the client application.
Compute Node Local Cache (AMI upgrade only)	Local cache on compute nodes is lost during AMI upgrades due to kernel change.	Cache rebuilds automatically as subsequent queries are executed. No action required.
Tablet Balance (version upgrade)	Tablet rebalance is auto-disabled during upgrade. Temporary imbalance is expected.	Rebalancing resumes automatically after upgrade completes.
Data Integrity	No impact. Data is not affected by the upgrade process.	N/A. All data remains intact.

How Long Does an Upgrade Take?

The upgrade process is highly efficient, typically requiring approximately 1–2 minutes per node. The total time required for a complete cluster upgrade is directly dependent on the total number of nodes present in the cluster. Larger clusters will naturally have longer, though still predictable, upgrade windows.

Cluster Size	Upgrade Type	Estimated Duration
Small (3 Coordinator + 3 Compute)	Version Upgrade	5 – 10 minutes
Small (3 Coordinator + 3 Compute)	AMI Upgrade	15 – 20 minutes
Medium (3 Coordinator + 5–10 Compute)	Version or AMI Upgrade	15 – 30 minutes
Large (3 Coordinator + 10+ Compute)	Version or AMI Upgrade	30 – 60 minutes

We recommend scheduling a maintenance window of approximately 1 hour to allow for the upgrade plus post-upgrade verification.

Downtime Expectations

Production Clusters with 3 Coordinator Nodes (Recommended)

With a 3-node coordinator setup, there is near-zero downtime for both upgrade types. While one coordinator node is being upgraded, the other two continue to serve queries. When the leader coordinator is upgraded, a brief leader election occurs (typically a few seconds). Clients with retry logic will experience only momentary connection blips.

Clusters with 1 Coordinator Node

The downtime behavior for single-coordinator clusters differs by upgrade type:

AMI Upgrade: Zero downtime. The AMI upgrade uses a scale-out/scale-in approach (1 → 2 → 1), so a new coordinator node is brought online before the old one is removed. The cluster remains available throughout.

Version Upgrade: There will be a period of complete unavailability while the single coordinator node is restarted with the new version (typically 1–3 minutes). During this window, no queries can be served.

We strongly recommend using 3 coordinator nodes for any production cluster to ensure high availability during all types of upgrades and normal operations.

Compute Node Downtime

Individual compute nodes experience approximately 20–30 seconds of downtime during their restart. Since nodes are upgraded one at a time, the cluster as a whole continues to serve queries through the remaining healthy nodes.

Recommended Client Preparations

To minimize the impact of a cluster upgrade, we recommend the following preparations:

Ensure Retry / Reconnect Logic

Verify that your query clients (JDBC, MySQL client, application connection pools) have automatic retry and reconnection logic enabled.
Most standard MySQL-compatible connection pools (HikariCP, etc.) handle transient disconnections automatically.
If your application uses long-lived persistent connections, ensure they can tolerate a brief disconnection and reconnect.

Schedule a Maintenance Window

Coordinate with the CelerData team to schedule the upgrade during a low-traffic period or an established maintenance window.
We recommend choosing a time when query volume and ingestion load are at their lowest.
Avoid scheduling upgrades immediately before weekends, holidays, or critical business events.

Pause or Reduce Ingestion (Optional but Recommended)

If possible, temporarily pause or reduce data ingestion (Stream Load, Routine Load, Kafka connectors) before the upgrade begins.
This reduces the chance of failed ingestion jobs and minimizes traffic during the rolling restart.
Ingestion can be resumed immediately after the upgrade is confirmed complete.

Verify Compatibility (for Major Version Upgrades)

Some major version upgrades may require underlying runtime updates (e.g., JDK version changes). The CelerData team handles these updates as part of the upgrade process.
Clients should be aware this may add a few extra seconds to each node restart, but no action is required on the client side.

Proactive Incident Management & Rollback

In the rare event that an upgrade process encounters an issue and becomes stuck, the CelerData team is proactively notified through automated monitoring systems. Immediate involvement from the team ensures rapid diagnosis and resolution. Remedial actions may include:

Rolling back the upgrade: reverting the system to the previous stable state to restore full functionality immediately.
Fixing the ongoing issue: applying a hotfix or resolving the underlying problem to allow the upgrade to complete successfully.
For AMI upgrades, old machine images are retained and can be used for rapid fallback.

note

Not all version downgrades are safe. Some versions introduce metadata changes that are not backward-compatible. The CelerData team will always verify downgrade compatibility before proceeding with any upgrade.

Quick Reference Summary

Question	Answer
Is there a full outage?	No. Rolling upgrade – nodes are upgraded one at a time.
Will my queries fail?	Queries hitting a restarting node may fail. Retry logic handles this automatically.
How long is the total upgrade?	~1–2 minutes per node. Typically 5–60 minutes total depending on cluster size.
Graceful shutdown per node?	Coordinator: 60 seconds. Compute: 20 seconds.
Is my data safe?	Yes. No data loss or corruption during upgrades.
Do I need to do anything?	Ensure retry logic is in place. Optionally pause ingestion. Agree on a maintenance window.
Can we rollback?	Yes, with pre-verified rollback plans. Some version constraints may apply.
3 vs 1 coordinator node?	3 nodes: near-zero downtime for all upgrades. 1 node: zero downtime for AMI, 1–3 min for version upgrade.
Will compute node cache be lost?	Only during AMI upgrades (kernel change). Cache rebuilds automatically.

Contact & Support

For questions about scheduling an upgrade or understanding the impact on your specific environment, please reach out to your CelerData Solutions Architect or contact us through your dedicated Slack channel.

We are happy to arrange a joint monitoring session during the upgrade to ensure everything proceeds smoothly.

Types of Cluster Upgrades​

CelerData Version Upgrade (Patch / Minor)​

AMI Upgrade (Infrastructure / Security Patching)​

How the Upgrade Process Works​

Coordinator Node Upgrade​

Version Upgrade​

AMI Upgrade​

Compute (Warehouse) Node Upgrade​

Version Upgrade​

AMI Upgrade​

Minimizing Disruption with Graceful Shutdown​

Automatic Tablet Rebalance During Upgrade​

Expected Impact During Upgrade​

How Long Does an Upgrade Take?​

Downtime Expectations​

Production Clusters with 3 Coordinator Nodes (Recommended)​

Clusters with 1 Coordinator Node​

Compute Node Downtime​

Recommended Client Preparations​

Ensure Retry / Reconnect Logic​

Schedule a Maintenance Window​

Pause or Reduce Ingestion (Optional but Recommended)​

Verify Compatibility (for Major Version Upgrades)​

Proactive Incident Management & Rollback​

Quick Reference Summary​

Contact & Support​

Types of Cluster Upgrades

CelerData Version Upgrade (Patch / Minor)

AMI Upgrade (Infrastructure / Security Patching)

How the Upgrade Process Works

Coordinator Node Upgrade

Version Upgrade

AMI Upgrade

Compute (Warehouse) Node Upgrade

Version Upgrade

AMI Upgrade

Minimizing Disruption with Graceful Shutdown

Automatic Tablet Rebalance During Upgrade

Expected Impact During Upgrade

How Long Does an Upgrade Take?

Downtime Expectations

Production Clusters with 3 Coordinator Nodes (Recommended)

Clusters with 1 Coordinator Node

Compute Node Downtime

Recommended Client Preparations

Ensure Retry / Reconnect Logic

Schedule a Maintenance Window

Pause or Reduce Ingestion (Optional but Recommended)

Verify Compatibility (for Major Version Upgrades)

Proactive Incident Management & Rollback

Quick Reference Summary

Contact & Support