CREATE TABLE

Description

Creates a new table in CelerData.

Syntax

CREATE TABLE [IF NOT EXISTS] [database.]table_name
(column_definition1[, column_definition2, ...]
[, index_definition1[, ndex_definition12,]])
[ENGINE = olap]
[key_desc]
[COMMENT "table comment"];
[partition_desc]
[distribution_desc]
[rollup_index]
[ORDER BY (column_definition1,...)]
[PROPERTIES ("key"="value", ...)]
[BROKER PROPERTIES ("key"="value", ...)]

Parameters

column_definition

Syntax:

col_name col_type [agg_type] [NULL | NOT NULL] [DEFAULT "default_value"] [AUTO_INCREMENT] [AS generation_expr]

Note:

col_name:Column name
col_type:Column type



Specific column information, such as types and ranges: 



* TINYINT(1 byte)

Range: -2^7 + 1 ~ 2^7 - 1



* SMALLINT(2 bytes)

Range: -2^15 + 1 ~ 2^15 - 1



* INT(4 bytes)

Range: -2^31 + 1 ~ 2^31 - 1



* BIGINT(8 bytes)

Range: -2^63 + 1 ~ 2^63 - 1



* LARGEINT(16 bytes)

Range: -2^127 + 1 ~ 2^127 - 1



* FLOAT(4 bytes)

Support scientific notation 



* DOUBLE(8 bytes)

Support scientific notation 



* DECIMAL[(precision, scale)] (16 bytes) 

 Default value: DECIMAL(10, 0).

 precision: 1 ~ 38

 scale: 0 ~ precision

Integer part:precision - scale

Scientific notation is not supported 



* DATE(3 bytes)

Range :0000-01-01 ~ 9999-12-31



* DATETIME(8 bytes )

Range :0000-01-01 00:00:00 ~ 9999-12-31 23:59:59



* CHAR[(length)]

Fixed length string. Range:1 ~ 255。Default value: 1.



* VARCHAR[(length)]

Variable length string. Range:1 ~ 65533.



* HLL (1~16385 bytes)

For HLL type, there's no need to specify length or default value. 

The length will be controlled within the system according to data aggregation. 

HLL column can only be queried or used by hll_union_agg、Hll_cardinality、hll_hash.



* BITMAP

 Bitmap type does not require specified length or default value. It represents a set of unsigned bigint numbers. The largest element could be up to 2^64 - 1.
agg_type:aggregation type. If not specified, this column is key column. 

If specified, it it value column. 



The aggregation types supported are as follows: 



* SUM、MAX、MIN、REPLACE



* HLL_UNION (only for HLL type) 



* BITMAP_UNION(only for BITMAP) 



* REPLACE_IF_NOT_NULL:This means the imported data will only be replaced when it is of non-null value. If it is of null value, CelerData will retain the original value. 

Note: if NOT NULL is specified by REPLACE_IF_NOT_NULL column when the table was created, CelerData will still convert the data to NULL without sending an error report to the user. With this, the user can import selected columns. 

This aggregation type applies ONLY to the aggregation table whose key_desc type is AGGREGATE KEY. 
NULL is not allowed by default. NULL value should be represented by /N in the imported data. 
Note: 

When the column of aggregation type BITMAP_UNION is imported, its original data types must be TINYINT, SMALLINT, 

AUTO_INCREMENT:specifies an AUTO_INCREMENT column. The data types of AUTO_INCREMENT columns must be BIGINT. Auto-incremented IDs start from 1 and increase at a step of 1. For more information about AUTO_INCREMENT columns, see AUTO_INCREMENT.

AS generation_expr: specifies the generated column and its expression. The generated column can be used to precompute and store the results of expressions, which significantly accelerates queries with the same complex expressions.

index_definition

Syntax:

INDEX index_name (col_name[, col_name, ...]) [USING BITMAP] COMMENT 'xxxxxx'

Note:

index_name: Index name

col_name: Column name

Note: Currently only BITMAP index is supported and it only applies to single columns.

ENGINE type

Default value: olap.

CelerData supports creating Parquet-formatted tables in Iceberg catalogs, and you can insert data to these Parquet-formatted Iceberg tables by using INSERT INTO. See Create an Iceberg table.

key_desc

Syntax:

`key_type(k1[,k2 ...])`

Note:

Data is sequenced in specified key columns and has different attributes for different key_type. 

Key_type supports: 

AGGREGATE KEY: Identical content in key columns will be aggregated into value columns according to the specified aggregation type. It usually applies to business scenarios such as financial statements and multi-dimensional analysis. 

UNIQUE KEY/PRIMARY KEY: Identical content in key columns will be replaced in value columns according to the import sequence. It can be applied to make addition, deletion, modification and query on key columns. 

DUPLICATE KEY: Identical content in key columns, which also exists in CelerData at the same time. It can be used to store detailed data or data with no aggregation attributes. 

DUPLICATE KEY is set as default. Data will be sequenced according to key columns.

Please note: 

Value columns do not need to specify aggregation types when other key_type is used to create tables with the exception of AGGREGATE KEY. 

partition_desc

Partition description can be used in the following three ways:

LESS THAN

Syntax:

PARTITION BY RANGE (k1, k2, ...)
(
    PARTITION partition_name1 VALUES LESS THAN MAXVALUE|("value1", "value2", ...),
    PARTITION partition_name2 VALUES LESS THAN MAXVALUE|("value1", "value2", ...)
    ...
)

Note:

Please use specified key columns and specified value ranges for partitioning.

  • Partition name only supports [A-z0-9_]
  • Columns in Range partition only support the following types: TINYINT, SAMLLINT, INT, BIGINT, LARGEINT, DATE, and DATETIME.
  • Partitions are left closed and right open. The left boundary of the first partition is of minimum value.
  • NULL value is stored only in partitions that contain minimum values. When the partition containing the minimum value is deleted, NULL values can no longer be imported.
  • Partition columns can either be single columns or multiple columns. The partition values are the default minimum values.

Please note:

  • Partitions are often used for managing data related to time.
  • When data backtracking is needed, you may want to consider emptying the first partition for adding partitions later when necessary.

Fixed Range

Syntax:

PARTITION BY RANGE (k1, k2, k3, ...)
(
    PARTITION partition_name1 VALUES [("k1-lower1", "k2-lower1", "k3-lower1",...), ("k1-upper1", "k2-upper1", "k3-upper1", ...)),
    PARTITION partition_name2 VALUES [("k1-lower1-2", "k2-lower1-2", ...), ("k1-upper1-2", MAXVALUE, )),
    "k3-upper1-2", ...
)

Note:

  • Fixed Range is more flexible than LESS THAN. You can customize the left and right partitions.
  • Fixed Range is the same as LESS THAN in the other aspects.

Create partitions in bulk

Syntax

PARTITION BY RANGE (datekey) (

    START ("2021-01-01") END ("2021-01-04") EVERY (INTERVAL 1 day)

)

Description

You can specify the value for START and END and the expression in EVERY to create partitions in bulk .

  • If datekey supports DATE and INTEGER data type, the data type of START, END, and EVERY must be the same as the data type of datekey.
  • If datekey only supports DATE data type, you need to use the INTERVAL keyword to specify the date interval. You can specify the date interval by day, week, month, or year. The naming conventions of partitions are the same as those for dynamic partitions.

distribution_des

Hash bucketing

Syntax:

`DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num]`

Note:

Please use specified key columns for Hash bucketing. The default bucket number is 10.

It is recommended to use Hash bucketing method.

ORDER BY

The primary key and sort key are decoupled in the Primary Key table. The sort key is specified by the ORDER BY keyword and can be the permutation and combination of any columns.

NOTICE

If the sort key is specified, the prefix index is built according to the sort key; if the sort key is not specified, the prefix index is built according to the primary key.

PROPERTIES

  • If ENGINE type is olap. Users can specify storage medium, cooldown time and replica number.
PROPERTIES (
    "storage_medium" = "[SSD|HDD]",
    [ "storage_cooldown_time" = "yyyy-MM-dd HH:mm:ss", ]
    [ "replication_num" = "3" ]
)

storage_medium: SSD or HDD could be specified as the initial storage media. You can specify default initial storage medium by specifying default_storage_medium=xxx through FE configuration file fe.conf. If no medium is specified, the default is HDD.

Note: When FE configuration item enable_strict_storage_medium_check is True and if storage medium is not set in the CelerData cluster, the statement for table creating will report an error: Failed to find enough host in all backends with storage medium is SSD|HDD.

storage_cooldown_time: When the storage medium is SSD, please specify its storage cooldown time. The default time is 30 days. Format: "yyyy-MM-dd HH:mm:ss"

replication_num: number of replicas in the specified partition. Default number: 3.

When the table has only one partition, the properties belongs to the table. When the table has two levels of partitions, the properties belong to each partition. Users can also specify different properties for different partitions through ADD ADDITION and MODIFY PARTITION statements.

  • If Engine type is olap, users can specify a column to adopt Bloomfilter index which applies only to the condition where in and equal are query filters. More discrete values in this column will result in more precise queries. Bloomfilter currently supports the key column, with the exception of the key column in TINYINT FLOAT DOUBLE type, and the value column with the aggregation method REPLACE.
PROPERTIES (
    "bloom_filter_columns"="k1,k2,k3"
)
  • If you want to use Colocate Join attributes, please specify it in properties.
PROPERTIES (
    "colocate_with"="table1"
)
  • If you want to use dynamic partition attributes, please specify it in properties.
PROPERTIES (

    "dynamic_partition.enable" = "true|false",
    "dynamic_partition.time_unit" = "DAY|WEEK|MONTH",
    "dynamic_partition.start" = "${integer_value}",
    "dynamic_partition.end" = "${integer_value}",
    "dynamic_partition.prefix" = "${string_value}",
    "dynamic_partition.buckets" = "${integer_value}"

dynamic_partition.enable: It is used to specify whether dynamic partitioning at the table level is enabled. Default value: true.

dynamic_partition.time_unit: It is used to specify the time unit for adding partitions dynamically. Time unit could be DAY, WEEK, MONTH.

dynamic_partition.start: It is used to specify how many partitions should be deleted. The value must be less than 0. Default value: integer.Min_VALUE.

dynamic_partition.end: It is used to specify the how many partitions will be created in advance. The value must be more than 0.

dynamic_partition.prefix: It is used to specify the prefix of the created partition. For instance, if the prefix is p, the partition will be named p20200108 automatically.

dynamic_partition.buckets: It is used to specify the number of buckets automatically created in partitions.

  • When building tables, Rollup can be created in bulk.

Syntax:

ROLLUP (rollup_name (column_name1, column_name2, ...)
[FROM from_index_name]
[PROPERTIES ("key"="value", ...)],...)

Examples

Example 1: Create an olap table that uses Hash bucketing and column-based storage, and that is aggregated by identical key records.

CREATE TABLE example_db.table_hash
(
    k1 TINYINT,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    v1 CHAR(10) REPLACE,
    v2 INT SUM
)
ENGINE=olap
AGGREGATE KEY(k1, k2)
COMMENT "my first starrocks table"
DISTRIBUTED BY HASH(k1) BUCKETS 10
PROPERTIES ("storage_type"="column");

Example 2: Create an olap table that uses Hash bucketing and column-based storage, and that is aggregated by identical key records. Also, please set the storage medium and the cooldown time.

CREATE TABLE example_db.table_hash
(
    k1 BIGINT,
    k2 LARGEINT,
    v1 VARCHAR(2048) REPLACE,
    v2 SMALLINT SUM DEFAULT "10"
)
ENGINE=olap
UNIQUE KEY(k1, k2)
DISTRIBUTED BY HASH (k1, k2) BUCKETS 10
PROPERTIES(
    "storage_type"="column",
    "storage_medium" = "SSD",
    "storage_cooldown_time" = "2015-06-04 00:00:00"
);

Or

CREATE TABLE example_db.table_hash
(
    k1 BIGINT,
    k2 LARGEINT,
    v1 VARCHAR(2048) REPLACE,
    v2 SMALLINT SUM DEFAULT "10"
)
ENGINE=olap
PRIMARY KEY(k1, k2)
DISTRIBUTED BY HASH (k1, k2) BUCKETS 10
PROPERTIES(
    "storage_type"="column",
    "storage_medium" = "SSD",
    "storage_cooldown_time" = "2015-06-04 00:00:00"
);

Example 3: Create an olap table that uses Range partition, Hash bucketing and the default column-based storage. Records with the same key should exist at the same time. Also, please set the initial storage medium and the cooldown time.

LESS THAN

CREATE TABLE example_db.table_range
(
    k1 DATE,
    k2 INT,
    k3 SMALLINT,
    v1 VARCHAR(2048),
    v2 DATETIME DEFAULT "2014-02-04 15:36:00"
)
ENGINE=olap
DUPLICATE KEY(k1, k2, k3)
PARTITION BY RANGE (k1)
(
    PARTITION p1 VALUES LESS THAN ("2014-01-01"),
    PARTITION p2 VALUES LESS THAN ("2014-06-01"),
    PARTITION p3 VALUES LESS THAN ("2014-12-01")
)
DISTRIBUTED BY HASH(k2) BUCKETS 10
PROPERTIES(
    "storage_medium" = "SSD", 
    "storage_cooldown_time" = "2015-06-04 00:00:00"
);

Note:

This statement will create 3 data partitions:

( {    MIN     },   {"2014-01-01"} )
[ {"2014-01-01"},   {"2014-06-01"} )
[ {"2014-06-01"},   {"2014-12-01"} )

Data outside these ranges will be not be loaded.

Fixed Range

CREATE TABLE table_range
(
    k1 DATE,
    k2 INT,
    k3 SMALLINT,
    v1 VARCHAR(2048),
    v2 DATETIME DEFAULT "2014-02-04 15:36:00"
)
ENGINE=olap
DUPLICATE KEY(k1, k2, k3)
PARTITION BY RANGE (k1, k2, k3)
(
    PARTITION p1 VALUES [("2014-01-01", "10", "200"), ("2014-01-01", "20", "300")),
    PARTITION p2 VALUES [("2014-06-01", "100", "200"), ("2014-07-01", "100", "300"))
)
DISTRIBUTED BY HASH(k2) BUCKETS 10
PROPERTIES(
    "storage_medium" = "SSD"
);
  • Create a table that contain HLL columns.
CREATE TABLE example_db.example_table
(
    k1 TINYINT,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    v1 HLL HLL_UNION,
    v2 HLL HLL_UNION
)
ENGINE=olap
AGGREGATE KEY(k1, k2)
DISTRIBUTED BY HASH(k1) BUCKETS 10
PROPERTIES ("storage_type"="column");
  • Create table containing BITMAP_UNION aggregation type. (The original data type of v1 and v2 columns muse be TINYINT, SMALLINT, INT)
CREATE TABLE example_db.example_table
(
    k1 TINYINT,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    v1 BITMAP BITMAP_UNION,
    v2 BITMAP BITMAP_UNION
)
ENGINE=olap
AGGREGATE KEY(k1, k2)
DISTRIBUTED BY HASH(k1) BUCKETS 10
PROPERTIES ("storage_type"="column");
  • Create table t1 and t2 that support Colocate Join.
CREATE TABLE `t1` 
(
     `id` int(11) COMMENT "",
    `value` varchar(8) COMMENT ""
) 
ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES 
(
    "colocate_with" = "t1"
);

CREATE TABLE `t2` 
(
    `id` int(11) COMMENT "",
    `value` varchar(8) COMMENT ""
) 
ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES 
(
    "colocate_with" = "t1"
);
  • Create a table with bitmap index
CREATE TABLE example_db.table_hash
(
    k1 TINYINT,
    k2 DECIMAL(10, 2) DEFAULT "10.5",
    v1 CHAR(10) REPLACE,
    v2 INT SUM,
    INDEX k1_idx (k1) USING BITMAP COMMENT 'xxxxxx'
)
ENGINE=olap
AGGREGATE KEY(k1, k2)
COMMENT "my first starrocks table"
DISTRIBUTED BY HASH(k1) BUCKETS 10
PROPERTIES ("storage_type"="column");
  • Create a dynamic partition table. (The dynamic partitioning function should be turned on in FE configuration.) This table will create partitions for three days and delete those created three days ago. For example, assuming today is 2020-01-08, partitions with these names will be created: p20200108, p20200109, p20200110, p20200111. And their ranges are:
[types: [DATE]; keys: [2020-01-08]; ‥types: [DATE]; keys: [2020-01-09]; )
[types: [DATE]; keys: [2020-01-09]; ‥types: [DATE]; keys: [2020-01-10]; )
[types: [DATE]; keys: [2020-01-10]; ‥types: [DATE]; keys: [2020-01-11]; )
[types: [DATE]; keys: [2020-01-11]; ‥types: [DATE]; keys: [2020-01-12]; )
CREATE TABLE example_db.dynamic_partition
(
    k1 DATE,
    k2 INT,
    k3 SMALLINT,
    v1 VARCHAR(2048),
    v2 DATETIME DEFAULT "2014-02-04 15:36:00"
)
ENGINE=olap
DUPLICATE KEY(k1, k2, k3)
PARTITION BY RANGE (k1)
(
    PARTITION p1 VALUES LESS THAN ("2014-01-01"),
    PARTITION p2 VALUES LESS THAN ("2014-06-01"),
    PARTITION p3 VALUES LESS THAN ("2014-12-01")
)
DISTRIBUTED BY HASH(k2) BUCKETS 10
PROPERTIES(
    "storage_medium" = "SSD",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-3",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.buckets" = "10"
);

Create a Primary Key table and specify the sort key

Suppose that you need to analyze user behavior in real time from dimensions such as users' address and last active time. When you create a table, you can define the user_id column as the primary key and define the combination of the address and last_active columns as the sort key.

create table users (
    user_id bigint NOT NULL,
    name string NOT NULL,
    email string NULL,
    address string NULL,
    age tinyint NULL,
    sex tinyint NULL,
    last_active datetime,
    property0 tinyint NOT NULL,
    property1 tinyint NOT NULL,
    property2 tinyint NOT NULL,
    property3 tinyint NOT NULL
) 
PRIMARY KEY (`user_id`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 4
ORDER BY(`address`,`last_active`)
PROPERTIES(
    "enable_persistent_index" = "true"
);