Data cache warmup
Some data lake analytics and shared-data cluster scenarios have high performance requirements for queries, such as BI reports and proof of concept (PoC) performance testing. Loading remote data into local data cache can avoid the need to fetch the same data multiple times, significantly speeding up query execution and minimizing resource usage.
StarRocks v3.3 introduces the Data Cache Warmup feature, which is an enhancement to Data Cache. Data Cache is a process of passively populating the cache, in which data is written to the cache during data querying. Data Cache Warmup, however, is an active process of populating the cache. It proactively fetches the desired data from remote storage in advance.
Scenarios
- The disk used for data cache has a storage capacity much larger than the amount of data to warm up. If the disk capacity is less than the data to warm up, the expected warmup effect cannot be achieved. For example, if 100 GB data needs to be warmed up but the disk has only 50 GB of space, then only 50 GB data can be loaded to the cache and the previously loaded 50 GB data will be replaced by the 50 GB data that is loaded later.
- Data access on the disk used for data cache is relatively stable. If there is a surge in the access volume, the expected warmup effect cannot be achieved. For example, if 100 GB data needs to be warmed up and the disk has 200 GB of space, then the first condition is met. However, if a large amount of new data (150 GB) is written to the cache during the warmup process, or if an unexpected large cold query needs to load 150 GB data to the cache, it may result in the eviction of the warmed data.
How it works
StarRocks provides the CACHE SELECT syntax to implement Data Cache Warmup. Before using CACHE SELECT, make sure that the Data Cache feature has been enabled.
Syntax of CACHE SELECT:
CACHE SELECT <column_name> [, ...]
FROM [<catalog_name>.][<db_name>.]<table_name> [WHERE <boolean_expression>]
[PROPERTIES("verbose"="true")]
Parameters:
column_name: The columns to fetch. You can use*to fetch all columns in the external table.catalog_name: The name of the external catalog, required only when querying external tables in data lakes. If you have switched to the external catalog using SET CATALOG, it can be left unspecified.db_name: The name of the database. If you have switched to that database, it can be left unspecified.table_name: The name of the table from which to fetch data.boolean_expression: The filter condition.PROPERTIES: Currently, only theverboseproperty is supported. It is used to return detailed warmup metrics.
CACHE SELECT is a synchronous process and it can warm up only one table at a time. Upon successful execution, it will return warmup-related metrics.
Warm up all data in an external table
The following example loads all data from external table lineitem:
mysql> cache select * from hive_catalog.test_db.lineitem;
+-----------------+------------------+----------------------+-------------------+
| READ_CACHE_SIZE | WRITE_CACHE_SIZE | AVG_WRITE_CACHE_TIME | TOTAL_CACHE_USAGE |
+-----------------+------------------+----------------------+-------------------+
| 48.2MB | 3.7GB | 59ms | 96.83% |
+-----------------+------------------+----------------------+-------------------+
1 row in set (19.56 sec)
Return fields:
READ_CACHE_SIZE: The total size of data read from the data cache by all nodes.WRITE_CACHE_SIZE: The total size of data written to the data cache by all nodes.AVG_WRITE_CACHE_TIME: The average time taken by each node to write data to the data cache.TOTAL_CACHE_USAGE: The space usage of the data cache of the entire cluster after this warmup task is complete. This metric can be used to assess whether the data cache has sufficient space.