Unified catalog
A unified catalog is a type of external catalog that is provided by CelerData to handle tables from Apache Hive™, Apache Iceberg, Apache Hudi, and Delta Lake data sources as a unified data source without ingestion. With unified catalogs, you can:
- Directly query data stored in Hive, Iceberg, Hudi, and Delta Lake without the need to manually create tables.
- Use INSERT INTO or asynchronous materialized views to process data stored in Hive, Iceberg, Hudi, and Delta Lake and load the data into CelerData.
- Perform operations on CelerData to create or drop Hive and Iceberg databases and tables.
To ensure successful SQL workloads on your unified data source, your CelerData cluster must be able to access the storage system and metastore of your unified data source CelerData supports the following storage systems and metastores:
-
Object storage like AWS S3 and Microsoft Azure Storage
-
Metastore like Hive metastore (HMS) or AWS Glue
NOTE
If you choose AWS S3 as storage, you can use HMS or AWS Glue as metastore. If you choose any other storage system, you can only use HMS as metastore.
Limits
One unified catalog supports integrations with only a single storage system and a single metastore service. Therefore, make sure all the data sources you want to integrate as a unified data source with CelerData use the same storage system and metastore service.
Usage notes
-
See the "Usage notes" section in Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog to understand the file formats and data types supported.
-
Format-specific operations are supported only for specific table formats. For example, CREATE TABLE and DROP TABLE are supported only for Hive and Iceberg, and REFRESH EXTERNAL TABLE is supported only for Hive and Hudi.
When you create a table within a unified catalog by using the CREATE TABLE statement, use the
ENGINEparameter to specify the table format (Hive or Iceberg).
Integration preparations
Before you create a unified catalog, make sure your CelerData cluster can integrate with the storage system and metastore of your unified data source.
Hive metastore
If your Hive cluster uses Hive metastore as metastore, check that CelerData can access the host of your Hive metastore.
NOTE
In normal cases, you can take one of the following actions to enable integration between your CelerData cluster and your Hive metastore:
- Deploy your CelerData cluster and your Hive metastore on the same VPC.
- Configure a VPC peering connection between the VPC of your CelerData cluster and the VPC of your Hive metastore.
Then, check the configurations of the security group of your Hive metastore to ensure that its inbound rules allow inbound traffic from your CelerData cluster's security group and that its port range covers the default port 9083.
AWS
If your Hive cluster uses AWS S3 as storage or AWS Glue as metastore, choose your suitable authentication method and make the required preparations such as creating IAM roles or users and adding IAM policies to the specified IAM roles or users to ensure that your CelerData cluster can access these AWS resources. For more information, see Authenticate to AWS resources > Preparations.
Microsoft Azure Storage
If your Hive cluster uses Azure as storage, choose your suitable authentication method and make the required preparations such as adding role assignments. For more information, see Authenticate to Azure cloud storage.