Overview of database objects

As an open cloud-native lakehouse, Celerdata Cloud Serverless can serve as a data warehouse to store data, as well as a query engine to query external data in your data lake. As for organizing data for CelerData, database objects such as catalogs, database, and tables play an important role. For example, the internal catalog (default_catalog) is used to organize all data stored in CelerData, while external catalogs are used to query data from various external data systems.

This topic introduces database objects in CelerData to help you better design your own data governance paradigm and efficiently manage your organization's data assets.

Overview

The following figure displays an efficient catalog-based data management architecture in CelerData.

database object

External catalogs help you directly query your data stored in the external data systems like AWS S3 at high performance without data ingestion in scenarios such as ad-hoc queries and data exploring.

When your business application requires lower latency and higher concurrency, you can load data in the internal catalog in CelerData. The internal catalog also provides various database objects to help you manipulate data, such as databases, tables, and materialized views.

Catalog

Catalogs enable you to manage internal and external data in one system, and offer a flexible way for you to easily query and analyze data that is stored in various external systems.

CelerData provides two types of catalogs: internal catalog and external catalog.

Internal catalog

NOTE

This feature is only available in Premium Edition.

The internal catalog manages the data loaded into CelerData. Each account has only one built-in internal catalog (default_catalog), which can include one or more databases. You can use the internal catalog as a data warehouse to store data, significantly enhancing query performance, especially for complex analytical queries on large volumes of data.

External catalog

An external catalog allows access to external data in your data lake. Each account can have more than one external catalog, and each external catalog can include one or more databases. With external catalogs, you can directly query the external data in your data lake without loading the data into CelerData. However, CelerData also caches metadata of these external tables to accelerate queries by default.

When you query external data from your data lake, CelerData interacts with the two components of the data lake to generate a query execution plan, scan the target external data, perform calculations and then return the query result. The two components are metastore (for example, Hive metastore, AWS Glue, or Tabular) and data storage system (for example, AWS S3).

Currently, CelerData supports the following types of external catalogs: Hive catalog, Iceberg catalog, Tabular catalog, Hudi catalog, and Delta Lake catalog.

Database

A database is a collection of data objects, such as tables, materialized views, and pipes used to store, manage, and manipulate data.

Table

Tables are categorized into internal tables and external tables.

Internal table

NOTE

This feature is only available in Premium Edition.

Internal tables are maintained in databases under the internal catalog (default_catalog). The metadata and datafiles of internal tables are managed in your CelerData. An internal table logically consists of rows and columns, where each row represents a record and each column represents an attribute or field of a data row.

CelerData provides four types of internal tables, Primary Key tables, Duplicate Key tables, Aggregate tables, and Unique Key tables, to store various data, such as raw logs, realtime data, and aggregated data, to meet your varying business needs.

An internal table adopts a partitioning+bucketing two-tier data distribution strategy to achieve even data distribution.

External table

External tables are maintained in external catalogs. You can use external tables to query data from external data sources. The metadata and datafiles of external tables are stored in external data sources. However, CelerData also caches metadata of these external tables to accelerate queries by default.

Materialized view

NOTE

This feature is only available in Premium Edition.

A materialized view is a special physical table that holds pre-computed query results from one or more base tables (internal or external tables). Materialized views can only be created in the internal catalog.

Pipe

NOTE

This feature is only available in Premium Edition.

A pipe is a data pipeline that is used to ingest data from an external data source into a table.

Access control

A role-based access control (RBAC) framework is employed in CelerData, which allows account administrators to precisely manage and regulate access to objects or data in their CelerData accounts.