AUTO_INCREMENT

CelerData supports the AUTO_INCREMENT column attribute, which can simplify data management. This topic introduces the application scenarios, usage and features of the AUTO_INCREMENT column attribute.

Introduction

When a new data row is loaded into a table and values are not specified for the AUTO_INCREMENT column, CelerData automatically assigns an integer value for the row's AUTO_INCREMENT column as its unique ID across the table. The subsequent values for the AUTO_INCREMENT column automatically increase at a specific step starting from the ID of the row. An AUTO_INCREMENT column can be used to simplify data management and speed up some queries. Here are some application scenarios of an AUTO_INCREMENT column:

  • Serve as primary keys: An AUTO_INCREMENT column can be used as the primary key to ensure that each row has a unique ID and make it easy to query and manage data.
  • Join tables: When multiple tables are joined, an AUTO_INCREMENT column can be used as the Join Key, which can expedite queries compared to using a column whose data type is STRING, for example, UUID.
  • Count the number of distinct values in a high-cardinality column: An AUTO_INCREMENT column can be used to represent the unique value column in a dictionary. Compared to directly counting distinct STRING values, counting distinct integer values of the AUTO_INCREMENT column can sometimes improve the query speed by several times or even tens of times.

You need to specify an AUTO_INCREMENT column in the CREATE TABLE statement. The data types of an AUTO_INCREMENT column must be BIGINT. The value for an AUTO_INCREMENT column can be implicitly assigned or explicitly specified. It starts from 1, and increments by 1 for each new row.

Basic operations

Specify AUTO_INCREMENT column at table creation

Create a table named test_tbl1 with two columns, id and number. Specify the column number as the AUTO_INCREMENT column.

CREATE TABLE test_tbl1
(
    id BIGINT NOT NULL, 
    number BIGINT NOT NULL AUTO_INCREMENT
) 
PRIMARY KEY (id) 
DISTRIBUTED BY HASH(id)
PROPERTIES("replicated_storage" = "true");

Assign values for AUTO_INCREMENT column

Assign values implicitly

When you load data into a table, you do not need to specify the values for the AUTO_INCREMENT column. CelerData automatically assigns unique integer values for that column and inserts them into the table.

INSERT INTO test_tbl1 (id) VALUES (1);
INSERT INTO test_tbl1 (id) VALUES (2);
INSERT INTO test_tbl1 (id) VALUES (3),(4),(5);

View data in the table.

mysql > SELECT * FROM test_tbl1 ORDER BY id;
+------+--------+
| id   | number |
+------+--------+
|    1 |      1 |
|    2 |      2 |
|    3 |      3 |
|    4 |      4 |
|    5 |      5 |
+------+--------+
5 rows in set (0.02 sec)

When you load data into a table, you can also specify the values as DEFAULT for the AUTO_INCREMENT column. CelerData automatically assigns unique integer values for that column and inserts them into the table.

INSERT INTO test_tbl1 (id, number) VALUES (6, DEFAULT);

View data in the table.

mysql > SELECT * FROM test_tbl1 ORDER BY id;
+------+--------+
| id   | number |
+------+--------+
|    1 |      1 |
|    2 |      2 |
|    3 |      3 |
|    4 |      4 |
|    5 |      5 |
|    6 |      6 |
+------+--------+
6 rows in set (0.02 sec)

In actual usage, the following result may be returned when you view the data in the table. This is because CelerData cannot guarantee that the values for the AUTO_INCREMENT column are strictly monotonic. But CelerData can guarantee that the values roughly increase in chronological order. For more information, see Monotonicity.

mysql > SELECT * FROM test_tbl1 ORDER BY id;
+------+--------+
| id   | number |
+------+--------+
|    1 |      1 |
|    2 | 100001 |
|    3 | 200001 |
|    4 | 200002 |
|    5 | 200003 |
|    6 | 200004 |
+------+--------+
6 rows in set (0.01 sec)

Specify values explicitly

You can also explicitly specify the values for the AUTO_INCREMENT column and insert them into the table.

INSERT INTO test_tbl1 (id, number) VALUES (7, 100);

-- view data in the table.

mysql > SELECT * FROM test_tbl1 ORDER BY id;
+------+--------+
| id   | number |
+------+--------+
|    1 |      1 |
|    2 | 100001 |
|    3 | 200001 |
|    4 | 200002 |
|    5 | 200003 |
|    6 | 200004 |
|    7 |    100 |
+------+--------+
7 rows in set (0.01 sec)

Moreover, explicitly specifying values does not affect the subsequent values generated by CelerData for newly inserted data rows.

INSERT INTO test_tbl1 (id) VALUES (8);

-- view data in the table.

mysql > SELECT * FROM test_tbl1 ORDER BY id;
+------+--------+
| id   | number |
+------+--------+
|    1 |      1 |
|    2 | 100001 |
|    3 | 200001 |
|    4 | 200002 |
|    5 | 200003 |
|    6 | 200004 |
|    7 |    100 |
|    8 |      2 |
+------+--------+
8 rows in set (0.01 sec)

NOTICE

We recommend that you do not use implicitly assigned values and explicitly specified values for the AUTO_INCREMENT column at the same time. Because the specified values may be the same as the values generated by CelerData, breaking the global uniqueness of auto-incremented IDs.

Basic features

Uniqueness

In general, CelerData guarantees that the values for an AUTO_INCREMENT column are globally unique across a table. We recommend that you do not implicitly assign and explicitly specify the values for the AUTO_INCREMENT column at the same time. If you do so, it may break the global uniqueness of auto-incremented IDs. Here is a simple example: Create a table named test_tbl2 with two columns, id and number. Specify the column number as the AUTO_INCREMENT column.

CREATE TABLE test_tbl2
(
    id BIGINT NOT NULL,
    number BIGINT NOT NULL AUTO_INCREMENT
 ) 
PRIMARY KEY (id) 
DISTRIBUTED BY HASH(id)
PROPERTIES("replicated_storage" = "true");

Implicitly assign and explicitly specify the values for the AUTO_INCREMENT column number in the table test_tbl2.

INSERT INTO test_tbl2 (id, number) VALUES (1, DEFAULT);
INSERT INTO test_tbl2 (id, number) VALUES (2, 2);
INSERT INTO test_tbl2 (id) VALUES (3);

Query the table test_tbl2.

mysql > SELECT * FROM test_tbl2 ORDER BY id;
+------+--------+
| id   | number |
+------+--------+
|    1 |      1 |
|    2 |      2 |
|    3 | 100001 |
+------+--------+
3 rows in set (0.08 sec)

Monotonicity

In order to improve the performance of allocating auto-incremented IDs, BEs cache some auto-incremented IDs locally. In this situation, CelerData cannot guarantee that the values for the AUTO_INCREMENT column are strictly monotonic. It can only be ensured that the values roughly increase in chronological order.

NOTE

The number of auto-incremented IDs cached by the BEs is determined by the FE dynamic parameter auto_increment_cache_size, which defaults to 100,000. You can modify the value by using ADMIN SET FRONTEND CONFIG ("auto_increment_cache_size" = "xxx");

For example, a CelerData cluster has one FE node and two BE nodes. Create a table named test_tbl3 and insert five rows of data as follows:

CREATE TABLE test_tbl3
(
    id BIGINT NOT NULL,
    number BIGINT NOT NULL AUTO_INCREMENT
) 
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id)
PROPERTIES("replicated_storage" = "true");

INSERT INTO test_tbl3 VALUES (1, DEFAULT);
INSERT INTO test_tbl3 VALUES (2, DEFAULT);
INSERT INTO test_tbl3 VALUES (3, DEFAULT);
INSERT INTO test_tbl3 VALUES (4, DEFAULT);
INSERT INTO test_tbl3 VALUES (5, DEFAULT);

The auto-incremented IDs in the table test_tbl3 do not monotonically increase, because the two BE nodes cache auto-incremented IDs, [1, 100000] and [100001, 200000], respectively. When data is loaded by using multiple INSERT statements, the data is sent to different BE nodes which allocate auto-incremented IDs independently. Therefore, it cannot be guaranteed that auto-incremented IDs are strictly monotonic.

mysql > SELECT * FROM test_tbl3 ORDER BY id;
+------+--------+
| id   | number |
+------+--------+
|    1 |      1 |
|    2 | 100001 |
|    3 | 200001 |
|    4 |      2 |
|    5 | 100002 |
+------+--------+
5 rows in set (0.07 sec)

Partial updates and AUTO_INCREMENT column

This section explains how to update only a few specified columns in a table that contains an AUTO_INCREMENT column.

NOTE

Currently, only primary key tables support partial updates.

AUTO_INCREMENT column is primary key

You need to specify the primary key during partial updates. Therefore, if the AUTO_INCREMENT column is the primary key or part of the primary key, the user behavior for partial updates is exactly the same as when the AUTO_INCREMENT column is not defined.

  1. Create a table test_tbl4 in the database example_db and insert one data row.

    -- Create a table.
    CREATE TABLE test_tbl4
    (
        id BIGINT AUTO_INCREMENT,
        name BIGINT NOT NULL,
        job1 BIGINT NOT NULL,
        job2 BIGINT NOT NULL
    ) 
    PRIMARY KEY (id, name)
    DISTRIBUTED BY HASH(id)
    PROPERTIES("replicated_storage" = "true");
    
    -- Prepared data.
    mysql > INSERT INTO test_tbl4 (id, name, job1, job2) VALUES (0, 0, 1, 1);
    Query OK, 1 row affected (0.04 sec)
    {'label':'insert_6af28e77-7d2b-11ed-af6e-02424283676b', 'status':'VISIBLE', 'txnId':'152'}
    
    -- Query the table.
    mysql > SELECT * FROM test_tbl4 ORDER BY id;
    +------+------+------+------+
    | id   | name | job1 | job2 |
    +------+------+------+------+
    |    0 |    0 |    1 |    1 |
    +------+------+------+------+
    1 row in set (0.01 sec)
  2. Prepare the CSV file my_data4.csv to update table test_tbl4. The CSV file includes values for the AUTO_INCREMENT column and does not include values for the column job1. The primary key of the first row already exists in table test_tbl4, while the primary key of the second row does not exist in the table.

    0,0,99
    1,1,99
  3. Run a Stream Load job and use the CSV file to update table test_tbl4.

    curl --location-trusted -u <username>:<password> -H "label:1" \
        -H "column_separator:," \
        -H "partial_update:true" \
        -H "columns:id,name,job2" \
        -T my_data4.csv -XPUT \
        http://<fe_host>/api/example_db/test_tbl4/_stream_load
  4. Query the updated table. The first row of data already exists in table test_tbl4, and the value for the column job1 remains unchanged. The second row of data is newly inserted, and because the default value for the column job1 is not specified, the partial update framework directly sets the value for this column to 0.

    mysql > SELECT * FROM test_tbl4 ORDER BY id;
    +------+------+------+------+
    | id   | name | job1 | job2 |
    +------+------+------+------+
    |    0 |    0 |    1 |   99 |
    |    1 |    1 |    0 |   99 |
    +------+------+------+------+
    2 rows in set (0.01 sec)

AUTO_INCREMENT column is not primary key

If the AUTO_INCREMENT column is not a primary key or a part of the primary key, and auto-incremented IDs are not provided in a Stream Load job, the following situations occur:

  • If the row already exists in the table, CelerData does not update the auto-incremented ID.
  • If the row is newly loaded into the table, CelerData generates a new auto-incremented ID.

This feature can be used to build a dictionary table for quickly computing distinct STRING values.

  1. In the database example_db, create a table test_tbl5 and specify the column job1 as the AUTO_INCREMENT column and insert a data row into the table test_tbl5.

    -- Create a table.
    CREATE TABLE test_tbl5
    (
        id BIGINT NOT NULL,
        name BIGINT NOT NULL,
        job1 BIGINT NOT NULL AUTO_INCREMENT,
        job2 BIGINT NOT NULL
    )
    PRIMARY KEY (id, name)
    DISTRIBUTED BY HASH(id)
    PROPERTIES("replicated_storage" = "true");
    
    -- Prepare data.
    mysql > INSERT INTO test_tbl5 VALUES (0, 0, -1, -1);
    Query OK, 1 row affected (0.04 sec)
    {'label':'insert_458d9487-80f6-11ed-ae56-aa528ccd0ebf', 'status':'VISIBLE', 'txnId':'94'}
    
    -- Query the table.
    mysql > SELECT * FROM test_tbl5 ORDER BY id;
    +------+------+------+------+
    | id   | name | job1 | job2 |
    +------+------+------+------+
    |    0 |    0 |   -1 |   -1 |
    +------+------+------+------+
    1 row in set (0.01 sec)
  2. Prepare a CSV file my_data5.csv to update table test_tbl5. The CSV file does not contain values for the AUTO_INCREMENT column job1. The primary key of the first row already exists in the table while the primary keys of the second and third rows do not.

    0,0,99
    1,1,99
    2,2,99
  3. Run a Stream Load job to load data from the CSV file into table test_tbl5.

    curl --location-trusted -u <username>:<password> -H "label:2" \
        -H "column_separator:," \
        -H "partial_update:true" \
        -H "columns: id,name,job2" \
        -T my_data5.csv -XPUT \
        http://<fe_host>/api/example_db/test_tbl5/_stream_load
  4. Query the updated table. The first row of data already exists in table test_tbl5, so the AUTO_INCREMENT column job1 retains its original value. The second and third rows of data are newly inserted, so CelerData generates new values for the AUTO_INCREMENT column job1.

    mysql > SELECT * FROM test_tbl5 ORDER BY id;
    +------+------+--------+------+
    | id   | name | job1   | job2 |
    +------+------+--------+------+
    |    0 |    0 |     -1 |   99 |
    |    1 |    1 |      1 |   99 |
    |    2 |    2 | 100001 |   99 |
    +------+------+--------+------+
    3 rows in set (0.01 sec)

Limits

  • When a table with an AUTO_INCREMENT column is created, 'replicated_storage' = 'true' must be set to ensure that all replicas have the same auto-incremented IDs.

  • Each table can have only one AUTO_INCREMENT column.

  • The data type of the AUTO_INCREMENT column must be BIGINT.

  • The AUTO_INCREMENT column must be NOT NULL and does not have a default value.

  • You can delete data from a Primary Key table with an AUTO_INCREMENT column. However, if the AUTO_INCREMENT column is not the primary key or part of the primary key, you need to note the following limits when you delete data in the following scenarios:

    • During the DELETE operation, there is also a load job for partial updates, which only contains UPSERT operations. If both the UPSERT and DELETE operations hit the same data row and the UPSERT operation is executed after the DELETE operation, the UPSERT operation may not take effect.
    • There is a load job for partial updates, which includes multiple UPSERT and DELETE operations on the same data row. If a certain UPSERT operation is executed after the DELETE operation, the UPSERT operation may not take effect.
  • Adding the AUTO_INCREMENT attribute by using ALTER TABLE is not supported.

  • CelerData does not support specifying the starting value and step size for the AUTO_INCREMENT column.

Keywords

AUTO_INCREMENT, AUTO INCREMENT