Generated columns

CelerData supports generated columns. Generated columns can be used to accelerate queries with complex expressions. This feature supports precomputing and storing the results of expressions and query rewrites, which significantly accelerates queries with the same complex expressions.

You can define one or more generated columns to store the results of expressions at table creation. As such, when executing queries that contain the expression whose results are stored in the generated column you have defined, the CBO rewrites the query to read data directly from the generated column. Alternatively, you can directly query the data in the generated column.

It is also recommended to evaluate the impact of generated columns on loading performance because computing expressions takes some time. Additionally, it is advised to create generated columns at table creation rather than adding or modifying them after table creation. Because it is time-consuming and costly to add or modify generated columns after table creation.

Basic operations

Create generated columns

Syntax

<col_name> <data_type> [NULL] AS <expr> [COMMENT 'string']

Create a table named test_tbl1 with five columns of which columns newcol1 and newcol2 are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns data_array and data_json respectively.

CREATE TABLE test_tbl1
(
    id INT NOT NULL,
    data_array ARRAY<int> NOT NULL,
    data_json JSON NOT NULL,
    newcol1 DOUBLE AS array_avg(data_array),
    newcol2 String AS json_string(json_query(data_json, "$.a"))
)
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);

NOTICE:

  • Generated columns must be defined after regular columns.
  • Aggregate functions cannot be used in the expressions for generated columns.
  • The expressions for generated column cannot reference other generated columns or auto-increment columns, but the expressions can reference multiple regular columns.
  • The data type of a generated column must match the data type of the result generated by the expression for the generated column.
  • Generated columns cannot be created on Aggregate tables.

Add generated columns after table creation

NOTICE

This operation is time-consuming and resource-intensive. Therefore, it is recommended to add generated columns at table creation. If it is unavoidable to use ALTER TABLE to add generated columns, it is recommended to evaluate the cost and time involved in advance.

  1. Create a table named test_tbl2 with three regular columns id, data_array, and data_json. Insert a data row into the table.

    -- Create a table.
    CREATE TABLE test_tbl2
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NOT NULL
    )
    PRIMARY KEY (id)
    DISTRIBUTED BY HASH(id);
    
    -- Insert a data row.
    INSERT INTO test_tbl2 VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
    
    -- Query the table.
    MySQL [example_db]> select * from test_tbl2;
    +------+------------+------------------+
    | id   | data_array | data_json        |
    +------+------------+------------------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |
    +------+------------+------------------+
    1 row in set (0.04 sec)
  2. Execute ALTER TABLE ... ADD COLUMN ... to add generated columns newcol1 and newcol2, which are created by evaluating the expressions based on the values of regular columns data_array and data_json.

    ALTER TABLE test_tbl2
    ADD COLUMN newcol1 DOUBLE AS array_avg(data_array);
    
    ALTER TABLE test_tbl2
    ADD COLUMN newcol2 String AS json_string(json_query(data_json, "$.a"));

    NOTICE:

    • Adding generated columns to Aggregate tables is not supported.
    • Regular columns need to be defined before generated columns. When you use the ALTER TABLE ... ADD COLUMN ... statement to add a regular column without specifying the position of the new regular column, the system automatically places it before the generated columns. Moreover, you cannot use AFTER to explicitly place the regular column after a generated column.
  3. Query the table data.

    MySQL [example_db]> SELECT * FROM test_tbl2;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.04 sec)

    The result shows that the generated columns newcol1 and newcol2 are added to the table, and CelerData automatically computes their values based on the expression.

Load data into generated columns

During data loading, CelerData automatically calculates the values for generated columns based on the expressions. You cannot specify the values of generated columns. The following example uses the INSERT INTO statement to load data:

  1. Use INSERT INTO to insert a record into the test_tbl1 table. Note that you cannot specify the values for the generated columns within the VALUES () clause.

    INSERT INTO test_tbl1 (id, data_array, data_json)
    VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
  2. Query the table data.

    MySQL [example_db]> SELECT * FROM test_tbl1;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.01 sec)

    The results show that CelerData automatically computes the values for the generated columns newcol1 and newcol2 based on the expressions.

    NOTICE:

    The following error is returned if you specify values for the generated columns during data loading:

    MySQL [example_db]> INSERT INTO test_tbl1 (id, data_array, data_json, newcol1, newcol2) 
    VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3");
    ERROR 1064 (HY000): Getting analyzing error. Detail message: materialized column 'newcol1' can not be specified.
    
    MySQL [example_db]> INSERT INTO test_tbl1 VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3");
    ERROR 1064 (HY000): Getting analyzing error. Detail message: Column count doesn't match value count.

Modify generated columns

NOTICE

This operation is time-consuming and resource-intensive. If it is unavoidable to use ALTER TABLE to modify generated columns, it is recommended to evaluate the cost and time involved in advance.

You can modify the data type and expression of a generated column.

  1. Create a table test_tbl3 with five columns of which columns newcol1 and newcol2 are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns data_array and data_json respectively. Insert a data row into the table.

    -- Create a table.
    MySQL [example_db]> CREATE TABLE test_tbl3
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NOT NULL,
        -- The data types and expressions of generated columns are specified as follows:
        newcol1 DOUBLE AS array_avg(data_array),
        newcol2 String AS json_string(json_query(data_json, "$.a"))
    )
    PRIMARY KEY (id)
    DISTRIBUTED BY HASH(id);
    
    -- Insert a data row.
    INSERT INTO test_tbl3 (id, data_array, data_json)
    VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
    
    -- Query the table.
    MySQL [example_db]> select * from test_tbl3;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.01 sec)
  2. Modified the generated columns newcol1 and newcol2:

    • Change the data type of the generated column newcol1 to ARRAY<INT> and change its expression to data_array.

      ALTER TABLE test_tbl3 
      MODIFY COLUMN newcol1 ARRAY<INT> AS data_array;
    • Modify the expression of the generated column newcol2 to extract the values of field b from the regular column data_json.

      ALTER TABLE test_tbl3
      MODIFY COLUMN newcol2 String AS json_string(json_query(data_json, "$.b"));
  3. View the modified schema and the data in the table.

    • View the modified schema.

      MySQL [example_db]> show create table test_tbl3\G
      **** 1. row ****
          Table: test_tbl3
      Create Table: CREATE TABLE test_tbl3 (
      id int(11) NOT NULL COMMENT "",
      data_array array<int(11)> NOT NULL COMMENT "",
      data_json json NOT NULL COMMENT "",
      -- After modification, the data types and expressions of generated columns are as follows:
      newcol1 array<int(11)> NULL AS example_db.test_tbl3.data_array COMMENT "",
      newcol2 varchar(65533) NULL AS json_string(json_query(example_db.test_tbl3.data_json, '$.b')) COMMENT ""
      ) ENGINE=OLAP 
      PRIMARY KEY(id)
      DISTRIBUTED BY HASH(id)
      PROPERTIES (...);
      1 row in set (0.00 sec)
    • Query the table data after modification. The result shows that CelerData recalculates the values of the generated columns newcol1 and newcol2 based on the modified expressions.

      MySQL [example_db]> select * from test_tbl3;
      +------+------------+------------------+---------+---------+
      | id   | data_array | data_json        | newcol1 | newcol2 |
      +------+------------+------------------+---------+---------+
      |    1 | [1,2]      | {"a": 1, "b": 2} | [1,2]   | 2       |
      +------+------------+------------------+---------+---------+
      1 row in set (0.01 sec)

Drop a generated column

Drop column newcol1 from the table test_tbl3

ALTER TABLE test_tbl3 DROP COLUMN newcol1;

NOTICE:
If a generated colum references a regular column in the expression, you cannot directly drop or modify that regular column. Instead, you need to first drop the generated column and then drop or modify the regular column.

Query rewrites

If the expression in a query matches the expression of a generated column, the optimizer automatically rewrites the query to directly read the values of the generated column.

  1. Suppose that you create a table test_tbl4 with the following schema:

    CREATE TABLE test_tbl4
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NOT NULL,
        newcol1 DOUBLE AS array_avg(data_array),
        newcol2 String AS json_string(json_query(data_json, "$.a"))
    )
    PRIMARY KEY (id) DISTRIBUTED BY HASH(id);
  2. If you query the data in the table test_tbl4 by using the SELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4; statement, the query involves only the regular columns data_array and data_json. However, the expressions in the query match the expressions of the generated columns newcol1 and newcol2. In this case, the execution plan shows that the CBO automatically rewrites the query to read the values of the generated columns newcol1 and newcol2.

    MySQL [example_db]> EXPLAIN SELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4;
    +---------------------------------------+
    | Explain String                        |
    +---------------------------------------+
    | PLAN FRAGMENT 0                       |
    |  OUTPUT EXPRS:4: newcol1 | 5: newcol2 | -- The query is rewritten to read data from the generated columns newcol1 and newcol2 are accessed.
    |   PARTITION: RANDOM                   |
    |                                       |
    |   RESULT SINK                         |
    |                                       |
    |   0:OlapScanNode                      |
    |      TABLE: test_tbl4                 |
    |      PREAGGREGATION: ON               |
    |      partitions=0/1                   |
    |      rollup: test_tbl4                |
    |      tabletRatio=0/0                  |
    |      tabletList=                      |
    |      cardinality=1                    |
    |      avgRowSize=2.0                   |
    +---------------------------------------+
    15 rows in set (0.00 sec)

Partial updates and generated columns

To perform partial updates on a Primary Key table, you must specify all the regular columns referenced by the generated columns in the columns parameter. The following example uses Stream Load to perform partial updates.

  1. Create a table test_tbl5 with five columns of which columns newcol1 and newcol2 are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns data_array and data_json respectively. Insert a data row into the table.

    -- Create a table.
    CREATE TABLE test_tbl5
    (
        id INT NOT NULL,
        data_array ARRAY<int> NOT NULL,
        data_json JSON NULL,
        newcol1 DOUBLE AS array_avg(data_array),
        newcol2 String AS json_string(json_query(data_json, "$.a"))
    )
    PRIMARY KEY (id)
    DISTRIBUTED BY HASH(id);
    
    -- Insert into a data row.
    INSERT INTO test_tbl5 (id, data_array, data_json)
    VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
    
    -- Query the table.
    MySQL [example_db]> select * from test_tbl5;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [1,2]      | {"a": 1, "b": 2} |     1.5 | 1       |
    +------+------------+------------------+---------+---------+
    1 row in set (0.01 sec)
  2. Prepare a CSV file my_data1.csv to update some columns in the test_tbl5 table.

    1|[3,4]|{"a": 3, "b": 4}
    2|[3,4]|{"a": 3, "b": 4} 
  3. Use Stream Load with the my_data1.csv file to update some columns of the test_tbl5 table. You need to set partial_update:true and specify all the regular columns referenced by the generated columns in the columns parameter.

    curl --location-trusted -u <username>:<password> -H "label:1" \
        -H "column_separator:|" \
        -H "partial_update:true" \
        -H "columns:id,data_array,data_json" \ 
        -T my_data1.csv -XPUT \
        http://<fe_host>/api/example_db/test_tbl5/_stream_load
  4. Query the table data.

    [example_db]> select * from test_tbl5;
    +------+------------+------------------+---------+---------+
    | id   | data_array | data_json        | newcol1 | newcol2 |
    +------+------------+------------------+---------+---------+
    |    1 | [3,4]      | {"a": 3, "b": 4} |     3.5 | 3       |
    |    2 | [3,4]      | {"a": 3, "b": 4} |     3.5 | 3       |
    +------+------------+------------------+---------+---------+
    2 rows in set (0.01 sec)

An error is returned by Stream Load if you perform partial updates without specifying all the regular columns referenced by the generated columns.

  1. Prepare a CSV file my_data2.csv.

    1|[3,4]
    2|[3,4]
  2. When partial column updates are performed by using Stream Load with the my_data2.csv file, if the values for the data_json column are not provided in my_data2.csv and the columns parameter in the Stream Load job does not include the data_json column, even if the data_json column allows null values, an error is returned by Stream Load because the column data_json is referenced by the generated column newcol2.