- Release Notes
- Get Started
- Clusters
- Cloud Settings
- Table Type
- Query Data Lakes
- Integration
- Query Acceleration
- Data Loading
- Concepts
- Batch load data from Amazon S3
- Batch load data from Azure cloud storage
- Load data from a local file system
- Load data from Confluent Cloud
- Load data from Amazon MSK
- Load data from Amazon Kinesis
- Data Unloading
- Data Backup
- Security
- Console Access Control
- Data Access Control
- Application keys
- Service accounts
- Use SSL connection
- Alarm
- Usage and Billing
- Organizations and Accounts
- Reference
- Amazon Web Services (AWS)
- Microsoft Azure
- SQL Reference
- Keywords
- ALL statements
- User Account Management
- Cluster Management
- ADMIN CANCEL REPAIR
- ADMIN CHECK TABLET
- ADMIN REPAIR
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- DROP FILE
- DROP RESOURCE GROUP
- EXPLAIN
- INSTALL PLUGIN
- SET
- SHOW BACKENDS
- SHOW BROKER
- SHOW COMPUTE NODES
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW TABLE STATUS
- SHOW FILE
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER MATERIALIZED VIEW
- ALTER TABLE
- ALTER VIEW
- ANALYZE TABLE
- BACKUP
- CANCEL ALTER TABLE
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE ANALYZE
- CREATE DATABASE
- CREATE EXTERNAL CATALOG
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP ANALYZE
- DROP STATS
- DROP CATALOG
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- KILL ANALYZE
- RECOVER
- REFRESH EXTERNAL TABLE
- RESTORE
- SET CATALOG
- SHOW ANALYZE JOB
- SHOW ANALYZE STATUS
- SHOW META
- SHOW FUNCTION
- TRUNCATE TABLE
- USE
- DML
- ALTER LOAD
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- CREATE ROUTINE LOAD
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- RESUME ROUTINE LOAD
- REFRESH MATERIALIZED VIEW
- SELECT
- SHOW ALTER
- SHOW ALTER MATERIALIZED VIEW
- SHOW BACKUP
- SHOW CATALOGS
- SHOW CREATE CATALOG
- SHOW CREATE MATERIALIZED VIEW
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW MATERIALIZED VIEW
- SHOW PARTITIONS
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- STOP ROUTINE LOAD
- STREAM LOAD
- SUBMIT TASK
- UPDATE
- Auxiliary Commands
- Data Types
- Keywords
- SQL Functions
- Function list
- Java UDFs
- Window functions
- Lambda expression
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_diff
- date_format
- date_slice
- date_sub, subdate
- date_trunc
- datediff
- day
- dayofweek_iso
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- jodatime_format
- last_day
- makedate
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- months_sub
- next_day
- now
- previous_day
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str_to_jodatime
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- to_iso8601
- to_tera_date
- to_tera_timestamp
- unix_timestamp
- utc_timestamp
- week
- week_iso
- weekofyear
- weeks_add
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Aggregate Functions
- any_value
- approx_count_distinct
- array_agg
- avg
- bitmap
- bitmap_agg
- count
- count_if
- corr
- covar_pop
- covar_samp
- group_concat
- grouping
- grouping_id
- hll_empty
- hll_hash
- hll_raw_agg
- hll_union
- hll_union_agg
- max
- max_by
- min
- min_by
- multi_distinct_sum
- multi_distinct_count
- percentile_approx
- percentile_cont
- percentile_disc
- retention
- stddev
- stddev_samp
- sum
- variance, variance_pop, var_pop
- var_samp
- window_funnel
- Geographic Functions
- String Functions
- append_trailing_char_if_absent
- ascii
- char
- char_length
- character_length
- concat
- concat_ws
- ends_with
- find_in_set
- group_concat
- hex
- hex_decode_binary
- hex_decode_string
- instr
- lcase
- left
- length
- locate
- lower
- lpad
- ltrim
- money_format
- null_or_empty
- parse_url
- repeat
- replace
- reverse
- right
- rpad
- rtrim
- space
- split
- split_part
- substring_index
- starts_with
- strleft
- strright
- str_to_map
- substring
- trim
- ucase
- unhex
- upper
- url_decode
- url_encode
- Pattern Matching Functions
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Bit Functions
- Bitmap Functions
- Array Functions
- all_match
- any_match
- array_agg
- array_append
- array_avg
- array_concat
- array_contains
- array_contains_all
- array_cum_sum
- array_difference
- array_distinct
- array_filter
- array_generate
- array_intersect
- array_join
- array_length
- array_map
- array_max
- array_min
- array_position
- array_remove
- array_slice
- array_sort
- array_sortby
- array_sum
- arrays_overlap
- array_to_bitmap
- cardinality
- element_at
- reverse
- unnest
- Map Functions
- Binary Functions
- cast function
- hash function
- Cryptographic Functions
- Math Functions
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Struct Functions
- Table Functions
- Utility Functions
- AUTO_INCREMENT
- Generated columns
- System variables
- System limits
- Information Schema
- Overview
- be_bvars
- be_cloud_native_compactions
- be_compactions
- character_sets
- collations
- column_privileges
- columns
- engines
- events
- global_variables
- key_column_usage
- load_tracking_logs
- loads
- materialized_views
- partitions
- pipe_files
- pipes
- referential_constraints
- routines
- schema_privileges
- schemata
- session_variables
- statistics
- table_constraints
- table_privileges
- tables
- tables_config
- task_runs
- tasks
- triggers
- user_privileges
- views
- System Metadatabase
- API
- Overview
- Actions
- Clusters
- Create and Manage Clusters
- Query Clusters
- Identity and Access Management
- Organization and Account
- Usage and Billing
- Clusters
- Terraform Provider
- Run scripts
Asynchronous materialized views
This topic describes how to understand, create, use, and manage an asynchronous materialized view.
Compared with synchronous materialized views, asynchronous materialized views support multi-table join and more aggregate functions. The refresh of asynchronous materialized views can be triggered manually or by scheduled tasks. You can also refresh some of the partitions instead of the whole materialized view, greatly reducing the cost of refresh. In addition, asynchronous materialized views support a variety of query rewrite scenarios, allowing automatic, transparent query acceleration.
For the scenario and usage of the synchronous materialized views (Rollup), see Synchronous materialized view (Rollup).
Overview
Applications in databases often perform complex queries on large tables. Such queries involve multi-table joins and aggregations on tables that contain billions of rows. Processing these queries can be expensive, in terms of system resources and the time it takes to compute the results.
Asynchronous materialized views in CelerData are designed to tackle these issues. An asynchronous materialized view is a special physical table that holds pre-computed query results from one or more base tables. When you perform complex queries on the base table, CelerData returns the pre-computed results from the relevant materialized views to process these queries. This way, query performance can be improved because repetitive complex calculations are avoided. This performance difference can be significant when a query is run frequently or is sufficiently complex.
Additionally, asynchronous materialized views are especially useful for building mathematical models upon your data warehouse. By doing so, you can provide a unified data specification to upper-layer applications, shield the underlying implementation, or protect the raw data security of the base tables.
Understand materialized views in CelerData
CelerData provides a synchronous materialized view that can be built only on a single table. Synchronous materialized views, or the Rollup, retain higher data freshness and lower refreshing costs. However, compared to asynchronous materialized views, synchronous materialized views have many limitations. You have limited choices of aggregation operators when you want to build a synchronous materialized view to accelerate or rewrite your queries.
The following table compares the asynchronous materialized views (ASYNC MV) and the synchronous materialized view (SYNC MV) in CelerData in the perspective of features that they support:
Single-table aggregation | Multi-table join | Query rewrite | Refresh strategy | Base table | |
---|---|---|---|---|---|
ASYNC MV | Yes | Yes | Yes |
| Multiple tables from:
|
SYNC MV (Rollup) | Limited choices of aggregate functions | No | Yes | Synchronous refresh during data loading | Single table in the default catalog |
Basic concepts
Base table
Base tables are the driving tables of a materialized view.
For CelerData' asynchronous materialized views, base tables can be CelerData native tables in the default catalog, tables in external catalogs, or even existing asynchronous materialized views and views. CelerData supports creating asynchronous materialized views on all table types.
Refresh
When you create an asynchronous materialized view, its data reflects only the state of the base tables at that time. When the data in the base tables change, you need to refresh the materialized view to keep the changes synchronized.
Currently, CelerData supports two generic refreshing strategies: ASYNC (refreshing triggered regularly by tasks) and MANUAL (refreshing triggered manually by users).
Query rewrite
Query rewrite means that when executing a query on base tables with materialized views built on, the system automatically judges whether the pre-computed results in the materialized view can be reused for the query. If they can be reused, the system will load the data directly from the relevant materialized view to avoid the time- and resource-consuming computations or joins.
CelerData supports automatic, transparent query rewrite based on the SPJG-type asynchronous materialized views that are created on the default catalog or an external catalog such as a Hive catalog, Hudi catalog, or Iceberg catalog.
Decide when to create a materialized view
You can create an asynchronous materialized view if you have the following demands in your data warehouse environment:
Accelerating queries with repetitive aggregate functions
Suppose that most queries in your data warehouse include the same sub-query with an aggregate function, and these queries have consumed a huge proportion of your computing resources. Based on this sub-query, you can create an asynchronous materialized view, which will compute and store all results of the sub-query. After the materialized view is built, CelerData rewrites all queries that contain the sub-query, loads the intermediate results stored in the materialized view, and thus accelerates these queries.
Regular JOIN of multiple tables
Suppose that you need to regularly join multiple tables in your data warehouse to make a new wide table. You can build an asynchronous materialized view for these tables, and set the ASYNC refreshing strategy that triggers refreshing tasks at a fixed time interval. After the materialized view is built, query results are returned directly from the materialized view, and thus the latency caused by JOIN operations is avoided.
Data warehouse layering
Suppose that your data warehouse contains a mass of raw data, and queries in it require a complex set of ETL operations. You can build multiple layers of asynchronous materialized views to stratify the data in your data warehouse, and thus decompose the query into a series of simple sub-queries. It can significantly reduce repetitive computation, and, more importantly, help your DBA identify the problem with ease and efficiency. Beyond that, data warehouse layering helps decouple raw data and statistical data, protecting the security of sensitive raw data.
Accelerating queries in data lakes
Querying a data lake can be slow due to network latency and object storage throughput. However, you can enhance the query performance by building an asynchronous materialized view on top of the data lake to filter the raw data. CelerData automatically refreshes the materialized view whenever the data from the external source changes, ensuring data consistency. Moreover, the SQL optimizer of CelerData can intelligently rewrite queries to use the existing materialized views, saving you the trouble of modifying your queries manually.
Create an asynchronous materialized view
CelerData' asynchronous materialized views can be created on the following base tables:
- CelerData' native tables of all table types
- Tables in Hive catalog, Hudi catalog, and Iceberg catalog
- Existing asynchronous materialized views
- Existing views
Before you begin
Prepare base tables
The following examples involve two base tables:
- The table
goods
records the item IDitem_id1
, the item nameitem_name
, and the item priceprice
. - The table
order_list
records the order IDorder_id
, client IDclient_id
, item IDitem_id2
, and order dateorder_date
.
The column goods.item_id1
is equivalent to the column order_list.item_id2
.
Execute the following statements to create the tables and insert data into them:
CREATE TABLE goods(
item_id1 INT,
item_name STRING,
price FLOAT
) DISTRIBUTED BY HASH(item_id1);
INSERT INTO goods
VALUES
(1001,"apple",6.5),
(1002,"pear",8.0),
(1003,"potato",2.2);
CREATE TABLE order_list(
order_id INT,
client_id INT,
item_id2 INT,
order_date DATE
) DISTRIBUTED BY HASH(order_id);
INSERT INTO order_list
VALUES
(10001,101,1001,"2022-03-13"),
(10001,101,1002,"2022-03-13"),
(10002,103,1002,"2022-03-13"),
(10002,103,1003,"2022-03-14"),
(10003,102,1003,"2022-03-14"),
(10003,102,1001,"2022-03-14");
The scenario in the following example demands frequent calculations of the total of each order. It requires frequent joins of the two base tables and intensive usage of the aggregate function sum()
. Besides, the business scenario demands the data refresh at an interval of one day.
The query statement is as follows:
SELECT
order_id,
sum(goods.price) as total
FROM order_list INNER JOIN goods ON goods.item_id1 = order_list.item_id2
GROUP BY order_id;
Create the materialized view
You can create a materialized view based on a specific query statement using CREATE MATERIALIZED VIEW.
Based on the table goods
, order_list
, and the query statement mentioned above, the following example creates the materialized view order_mv
to analyze the total of each order. The materialized view is set to refresh itself at an interval of one day.
CREATE MATERIALIZED VIEW order_mv
DISTRIBUTED BY HASH(`order_id`)
REFRESH ASYNC START('2022-09-01 10:00:00') EVERY (interval 1 day)
AS SELECT
order_list.order_id,
sum(goods.price) as total
FROM order_list INNER JOIN goods ON goods.item_id1 = order_list.item_id2
GROUP BY order_id;
NOTE
- You must specify a bucketing strategy when creating an asynchronous materialized view.
- You can set different partitioning and bucketing strategies for an asynchronous materialized view from those of its base tables.
- Asynchronous materialized views support a dynamic partitioning strategy in a longer span. For example, if the base table is partitioned at an interval of one day, you can set the materialized view to be partitioned at an interval of one month.
- The query statement that is used to create an asynchronous materialized view must include the partition keys and bucket keys of the materialized views.
- The query statement used to create a materialized view does not support random functions, including rand(), random(), uuid(), and sleep().
- Asynchronous materialized views support a variety of data types. For more information, see CREATE MATERIALIZED VIEW - Supported data types.
About refresh mechanisms of asynchronous materialized views
Currently, CelerData supports two ON DEMAND refresh strategies: manual refresh and regular refresh at a fixed time interval.
Asynchronous materialized views further support a variety of asynchronous refreshing mechanisms:
- If an MV has many large partitions, each refresh can consume a large amount of resources. CelerData supports splitting refresh tasks. You can specify the maximum number of partitions to be refreshed, and CelerData performs refresh in batches, with a batch size smaller or equal to the specified maximum number of partitions. This feature ensures large asynchronous materialized views are stably refreshed, enhancing the stability and robustness of data modeling.
- You can specify the time to live (TTL) for partitions of an asynchronous materialized view, reducing the storage size taken by the materialized view.
- You can specify the refresh range to refresh only the latest few partitions, reducing the refresh overhead.
For more information, see the PROPERTIES section in CREATE MATERIALIZED VIEW - Parameters. You can also modify the mechanisms of an existing asynchronous materialized view using ALTER MATERIALIZED VIEW.
About nested materialized views
CelerData supports creating nested asynchronous materialized views. You can build asynchronous materialized views based on existing asynchronous materialized views. The refreshing strategy for each materialized view does not affect the materialized views on the upper or lower layers. Currently, CelerData does not limit the number of nesting levels. In a production environment, we recommend that the number of nesting layers does not exceed THREE.
About external catalog materialized views
CelerData supports creating asynchronous materialized views based on Hive catalog, Hudi catalog, and Iceberg catalog. An external catalog materialized view is created in the same way as a general asynchronous materialized view is created, with the following usage restrictions:
Strict consistency is not guaranteed between the materialized view and the base tables in the external catalog.
Currently, building asynchronous materialized views based on external resources is not supported.
Currently, CelerData cannot perceive the data changes on the base tables in Iceberg catalogs and Hudi catalogs, so all partitions are refreshed by default every time the refreshing task is triggered. If you want to refresh only some of the partitions, you can manually refresh the materialized view using the REFRESH MATERIALIZED VIEW statement and specify the partition you want to refresh.
CelerData can periodically refresh the cached metadata of the frequently accessed Hive catalogs to perceive data changes. You can configure the Hive metadata cache refresh through the following FE parameters:
Configuration item Default Description enable_background_refresh_connector_metadata true
Whether to enable the periodic Hive metadata cache refresh. After it is enabled, CelerData polls the metastore (Hive Metastore or AWS Glue) of your Hive cluster, and refreshes the cached metadata of the frequently accessed Hive catalogs to perceive data changes. true
indicates to enable the Hive metadata cache refresh, andfalse
indicates to disable it. This item is an FE dynamic parameter. You can modify it using the ADMIN SET FRONTEND CONFIG command.background_refresh_metadata_interval_millis 600000
(10 minutes)The interval between two consecutive Hive metadata cache refreshes. Unit: millisecond. This item is an FE dynamic parameter. You can modify it using the ADMIN SET FRONTEND CONFIG command. background_refresh_metadata_time_secs_since_last_access_secs 86400
(24 hours)The expiration time of a Hive metadata cache refresh task. For the Hive catalog that has been accessed, if it has not been accessed for more than the specified time, CelerData stops refreshing its cached metadata. For the Hive catalog that has not been accessed, CelerData will not refresh its cached metadata. Unit: second. This item is an FE dynamic parameter. You can modify it using the ADMIN SET FRONTEND CONFIG command.
Manually refresh an asynchronous materialized view
You can refresh an asynchronous materialized view regardless of its refreshing strategy via REFRESH MATERIALIZED VIEW. CelerData supports refreshing specific partitions of an asynchronous materialized view by specifying partition names. CelerData supports making a synchronous call of the refresh task, and the SQL statement is returned only when the task succeeds or fails.
-- Refresh the materialized view via an asynchronous call (default).
REFRESH MATERIALIZED VIEW order_mv;
-- Refresh the materialized view via a synchronous call.
REFRESH MATERIALIZED VIEW order_mv WITH SYNC MODE;
You can cancel a refresh task submitted via an asynchronous call using CANCEL REFRESH MATERIALIZED VIEW.
Query the asynchronous materialized view
The asynchronous materialized view you created is essentially a physical table that contains the complete set of pre-computed results in accordance with the query statement. Therefore, you can directly query the materialized view after the materialized view is refreshed for the first time.
MySQL > SELECT * FROM order_mv;
+----------+--------------------+
| order_id | total |
+----------+--------------------+
| 10001 | 14.5 |
| 10002 | 10.200000047683716 |
| 10003 | 8.700000047683716 |
+----------+--------------------+
3 rows in set (0.01 sec)
NOTE
You can directly query an asynchronous materialized view, but the results may be inconsistent with what you get from the query on its base tables.
Rewrite queries with the asynchronous materialized view
CelerData supports automatic and transparent query rewrite based on the SPJG-type asynchronous materialized views. The SPJG-type materialized views refer to materialized views whose plan only includes Scan, Filter, Project, and Aggregate types of operators. The SPJG-type materialized views query rewrite includes single table query rewrite, Join query rewrite, aggregation query rewrite, Union query rewrite and query rewrite based on nested materialized views.
Currently, CelerData supports rewriting queries on asynchronous materialized views that are created on the default catalog or an external catalog such as a Hive catalog, Hudi catalog, or Iceberg catalog. When querying data in the default catalog, CelerData ensures strong consistency of results between the rewritten query and the original query by excluding materialized views whose data is inconsistent with the base table. When the data in a materialized view expires, the materialized view will not be used as a candidate materialized view. When querying data in external catalogs, CelerData does not ensure a strong consistency of the results because CelerData cannot perceive the data changes in external catalogs.
Enable query rewrite
Enable query rewrite based on the default catalog materialized views
CelerData enables asynchronous materialized view query rewrite by default. You can enable or disable this feature with the session variable enable_materialized_view_rewrite.
SET GLOBAL enable_materialized_view_rewrite = { true | false };
[Experimental] Enable query rewrite based on the external catalog materialized views
Because CelerData does not ensure a strong consistency of the results when you query data in external catalogs using asynchronous materialized views, the query rewrite based on the external catalog materialized views is disabled by default. You can enable this feature for an external catalog materialized view by adding the property
"force_external_table_query_rewrite" = "true"
when creating the materialized view.Example:
CREATE MATERIALIZED VIEW ex_mv_par_tbl PARTITION BY emp_date DISTRIBUTED BY hash(empid) PROPERTIES ( "force_external_table_query_rewrite" = "true" ) AS select empid, deptno, emp_date from `hive_catalog`.`emp_db`.`emps_par_tbl` where empid < 5;
Rewrite queries with Aggregate Rollup
CelerData supports rewriting queries with Aggregate Rollup, that is, CelerData can rewrite aggregate queries with a GROUP BY a
clause using an asynchronous materialized view created with a GROUP BY a,b
clause.
In the following example, CelerData can rewrite Query 1 and 2 with the materialized view order_agg_mv
.
CREATE MATERIALIZED VIEW order_agg_mv
DISTRIBUTED BY HASH(`order_id`) BUCKETS 12
REFRESH ASYNC START('2022-09-01 10:00:00') EVERY (interval 1 day)
AS
SELECT
order_id,
order_date,
bitmap_union(to_bitmap(client_id)) -- uv
FROM order_list
GROUP BY order_id, order_date;
-- Query 1
SELECT
order_date,
bitmap_union(to_bitmap(client_id)) -- uv
FROM order_list
GROUP BY order_date;
-- Query 2
SELECT
order_date,
count(distinct client_id)
FROM order_list
GROUP BY order_date;
Only certain aggregate functions support query rewrite with Aggregate Rollup. In the preceding example, if the materialized view order_agg_mv
uses count(distinct client_id)
instead of bitmap_union(to_bitmap(client_id))
, CelerData cannot rewrite the queries with Aggregate Rollup.
The following table shows the correspondence between the aggregate functions in the original query and the aggregate function used to build the materialized view. You can select the corresponding aggregate functions to build a materialized view according to your business scenario.
Aggregate function supported in original queries | Function supported Aggregate Rollup in materialized view |
---|---|
sum | sum |
count | count |
min | min |
max | max |
avg | sum / count |
bitmap_union, bitmap_union_count, count(distinct) | bitmap_union |
hll_raw_agg, hll_union_agg, ndv, approx_count_distinct | hll_union |
percentile_approx, percentile_union | percentile_union |
DISTINCT aggregates without the corresponding GROUP BY column cannot be rewritten with Aggregate Rollup. However, if a query with an Aggregate Rollup DISTINCT aggregate function does not have a GROUP BY column but an equal predicate, it can also be rewritten by the relevant materialized view because CelerData can convert the equal predicates into a GROUP BY constant expression.
In the following example, CelerData can rewrite the query with the materialized view order_agg_mv1
.
CREATE MATERIALIZED VIEW order_agg_mv1
DISTRIBUTED BY HASH(`order_id`) BUCKETS 12
REFRESH ASYNC START('2022-09-01 10:00:00') EVERY (interval 1 day)
AS
SELECT
order_date,
count(distinct client_id)
FROM order_list
GROUP BY order_date;
-- Query
SELECT
order_date,
count(distinct client_id)
FROM order_list WHERE order_date='2023-07-03';
Rewrite queries in View Delta Join scenarios
CelerData now supports rewriting queries based on asynchronous materialized views with Delta Join, which means that the queried tables are a subset of the materialized view's base tables. For example, queries of the form table_a INNER JOIN table_b
can be rewritten by materialized views of the form table_a INNER JOIN table_b INNER JOIN/LEFT OUTER JOIN table_c
, where table_b INNER JOIN/LEFT OUTER JOIN table_c
is the Delta Join. This feature allows transparent acceleration for such queries, thereby preserving the flexibility of the query and avoiding the huge cost of building wide tables. CelerData supports query rewrite in View Delta Join scenarios for Hive Catalog.
View Delta Join queries can be rewritten only when the following requirements are met:
The Delta Join must be Inner Join or Left Outer Join.
If the Delta Join is Inner Join, the keys to be joined must be the corresponding Foreign/Primary/Unique Key and NOT NULL.
For example, the materialized view is of the form
A INNER JOIN B ON (A.a1 = B.b1) INNER JOIN C ON (B.b2 = C.c1)
, and the query is of the formA INNER JOIN B ON (A.a1 = B.b1)
. In this case,B INNER JOIN C ON (B.b2 = C.c1)
is the Delta Join.B.b2
must be the Foreign Key of B and must be NOT NULL, andC.c1
must be the Primary Key or Unique Key of C.If the Delta Join is Left Outer Join, the keys to be joined must be the corresponding Foreign/Primary/Unique Key.
For example, the materialized view is of the form
A INNER JOIN B ON (A.a1 = B.b1) LEFT OUTER JOIN C ON (B.b2 = C.c1)
, and the query is of the formA INNER JOIN B ON (A.a1 = B.b1)
. In this case,B LEFT OUTER JOIN C ON (B.b2 = C.c1)
is the Delta Join.B.b2
must be the Foreign Key of B, andC.c1
must be the Primary Key or Unique Key of C.
To implement the above constraints, you must define the Unique Key constraints and Foreign Key constraints of a table using the properties unique_constraints
and foreign_key_constraints
when creating the table. For more information, see CREATE TABLE - PROPERTIES.
CAUTION
The Unique Key constraints and Foreign Key constraints are only used for query rewrite. The Foreign Key constraint checks are not guaranteed when data is loaded into the table. You must ensure the data loaded into the table meets the constraints.
The following example defines multiple Foreign Keys when creating the table lineorder
:
CREATE TABLE `lineorder` (
`lo_orderkey` int(11) NOT NULL COMMENT "",
`lo_linenumber` int(11) NOT NULL COMMENT "",
`lo_custkey` int(11) NOT NULL COMMENT "",
`lo_partkey` int(11) NOT NULL COMMENT "",
`lo_suppkey` int(11) NOT NULL COMMENT "",
`lo_orderdate` int(11) NOT NULL COMMENT "",
`lo_orderpriority` varchar(16) NOT NULL COMMENT "",
`lo_shippriority` int(11) NOT NULL COMMENT "",
`lo_quantity` int(11) NOT NULL COMMENT "",
`lo_extendedprice` int(11) NOT NULL COMMENT "",
`lo_ordtotalprice` int(11) NOT NULL COMMENT "",
`lo_discount` int(11) NOT NULL COMMENT "",
`lo_revenue` int(11) NOT NULL COMMENT "",
`lo_supplycost` int(11) NOT NULL COMMENT "",
`lo_tax` int(11) NOT NULL COMMENT "",
`lo_commitdate` int(11) NOT NULL COMMENT "",
`lo_shipmode` varchar(11) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`lo_orderkey`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`lo_orderkey`)
PROPERTIES (
-- Define Unique Keys in unique_constraints
"unique_constraints" = "lo_orderkey,lo_linenumber",
-- Define Foreign Keys in foreign_key_constraints
"foreign_key_constraints" = "
(lo_custkey) REFERENCES customer(c_custkey);
(lo_partkey) REFERENCES ssb.part(p_partkey);
(lo_suppkey) REFERENCES supplier(s_suppkey);
(lo_orderdate) REFERENCES dates(d_datekey)
"
);
Configure query rewrite
You can configure the asynchronous materialized view query rewrite through the following session variables:
Variable | Default | Description |
---|---|---|
enable_materialized_view_union_rewrite | true | Boolean value to control if to enable materialized view Union query rewrite. |
enable_rule_based_materialized_view_rewrite | true | Boolean value to control if to enable rule-based materialized view query rewrite. This variable is mainly used in single-table query rewrite. |
nested_mv_rewrite_max_level | 3 | The maximum levels of nested materialized views that can be used for query rewrite. Type: INT. Range: [1, +∞). The value of 1 indicates that materialized views created on other materialized views will not be used for query rewrite. |
Check if a query is rewritten
You can check if your query is rewritten by viewing its query plan using the EXPLAIN statement. If the field TABLE
under the section OlapScanNode
shows the name of the corresponding materialized view, it means that the query has been rewritten based on the materialized view.
mysql> EXPLAIN SELECT
order_id, sum(goods.price) AS total
FROM order_list INNER JOIN goods
ON goods.item_id1 = order_list.item_id2
GROUP BY order_id;
+------------------------------------+
| Explain String |
+------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:1: order_id | 8: sum |
| PARTITION: RANDOM |
| |
| RESULT SINK |
| |
| 1:Project |
| | <slot 1> : 9: order_id |
| | <slot 8> : 10: total |
| | |
| 0:OlapScanNode |
| TABLE: order_mv |
| PREAGGREGATION: ON |
| partitions=1/1 |
| rollup: order_mv |
| tabletRatio=0/12 |
| tabletList= |
| cardinality=3 |
| avgRowSize=4.0 |
| numNodes=0 |
+------------------------------------+
20 rows in set (0.01 sec)
Manage an asynchronous materialized view
Alter an asynchronous materialized view
You can alter the property of an asynchronous materialized view using ALTER MATERIALIZED VIEW.
Rename an asynchronous materialized view.
ALTER MATERIALIZED VIEW order_mv RENAME order_total;
Alter the refreshing interval of an asynchronous materialized view to 2 days.
ALTER MATERIALIZED VIEW order_mv REFRESH ASYNC EVERY(INTERVAL 2 DAY);
Show asynchronous materialized views
You can view the asynchronous materialized views in your database by using SHOW MATERIALIZED VIEWS or querying the system metadata table in Information Schema.
Check all asynchronous materialized views in your database.
SHOW MATERIALIZED VIEWS;
Check a specific asynchronous materialized view.
SHOW MATERIALIZED VIEWS WHERE NAME = "order_mv";
Check specific asynchronous materialized views by matching the name.
SHOW MATERIALIZED VIEWS WHERE NAME LIKE "order%";
Check all asynchronous materialized views by querying the metadata table in Information Schema.
SELECT * FROM information_schema.materialized_views;
Check the definition of asynchronous materialized view
You can check the query used to create an asynchronous materialized view via SHOW CREATE MATERIALIZED VIEW.
SHOW CREATE MATERIALIZED VIEW order_mv;
Check the execution status of asynchronous materialized view
You can check the execution (building or refreshing) status of an asynchronous materialized view by querying the tasks
and task_runs
metadata tables.
The following example checks the execution status of the materialized view that was created most recently:
Check the
TASK_NAME
of the most recent task in the tabletasks
.mysql> select * from information_schema.tasks order by CREATE_TIME desc limit 1\G; *************************** 1. row *************************** TASK_NAME: mv-59299 CREATE_TIME: 2022-12-12 17:33:51 SCHEDULE: MANUAL DATABASE: ssb_1 DEFINITION: insert overwrite hive_mv_lineorder_flat_1 SELECT `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderkey`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_linenumber`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_custkey`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_partkey`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderpriority`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_ordtotalprice`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_revenue`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`p_mfgr`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`s_nation`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`c_city`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`c_nation`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderdate` FROM `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc` WHERE `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderdate` = '1997-01-01' EXPIRE_TIME: NULL 1 row in set (0.02 sec)
Check the execution status in the table
task_runs
using theTASK_NAME
you have found.mysql> select * from information_schema.task_runs where task_name='mv-59299' order by CREATE_TIME \G; *************************** 1. row *************************** QUERY_ID: d9cef11f-7a00-11ed-bd90-00163e14767f TASK_NAME: mv-59299 CREATE_TIME: 2022-12-12 17:39:19 FINISH_TIME: 2022-12-12 17:39:22 STATE: SUCCESS DATABASE: ssb_1 DEFINITION: insert overwrite hive_mv_lineorder_flat_1 SELECT `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderkey`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_linenumber`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_custkey`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_partkey`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderpriority`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_ordtotalprice`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_revenue`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`p_mfgr`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`s_nation`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`c_city`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`c_nation`, `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderdate` FROM `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc` WHERE `hive_ci`.`dla_scan`.`lineorder_flat_1000_1000_orc`.`lo_orderdate` = '1997-01-01' EXPIRE_TIME: 2022-12-15 17:39:19 ERROR_CODE: 0 ERROR_MESSAGE: NULL PROGRESS: 100% 2 rows in set (0.02 sec)
Drop an asynchronous materialized view
You can drop an asynchronous materialized view via DROP MATERIALIZED VIEW.
DROP MATERIALIZED VIEW order_mv;
Relevant session variables
The following variables control the behavior of an asynchronous materialized view:
analyze_mv
: Whether and how to analyze the materialized view after refresh. Valid values are an empty string (Do not analyze),sample
(Sampled statistics collection), andfull
(Full statistics collection). Default issample
.enable_materialized_view_rewrite
: Whether to enable the automatic rewrite for materialized view. Valid values aretrue
(Default) andfalse
.
- Asynchronous materialized views