Bitmap indexes
This topic describes how to create and manage a bitmap index, along with usage cases.
Introduction
A bitmap index is a special database index that uses bitmaps, which are an array of bits. A bit is always in one of two values: 0 and 1. Each bit in the bitmap corresponds to a single row in the table. The value of each bit depends upon the value of the corresponding row.
A bitmap index can help improve the query performance on a given column. If a query's filter conditions match a prefix index, it can significantly improve query efficiency and quickly return results. However, a table can only have one prefix index. If the query's filter conditions do not include the prefix of the prefix index, a bitmap index can be created for that column to improve query efficiency.
How to design a bitmap index to accelerate queries
The primary considerations for choosing a bitmap index are column cardinality and the filtering effect of the bitmap index on queries. Contrary to popular belief, bitmap indexes in StarRocks are more suitable for queries on columns with high cardinality and queries on the combination of multiple low cardinality columns. Furthermore, bitmap indexes need to effectively filter out data, potentially filtering out at least 999/1000 of the data, thereby reducing the amount of Page data read.
For queries on a single low cardinality column, the filtering effect of a bitmap index is poor, because too many rows need to be read and scattered across multiple Pages.
You need to consider the cost of loading data when evaluating the filtering effect of bitmap indexes on queries. In StarRocks, underlying data is organized and loaded by Pages (default size is 64K). The cost of loading data includes the time to load Pages from disk, decompress Pages, and decode.
However, excessively high cardinality can also cause issues such as occupying more disk space, and because bitmap indexes need to be constructed during data loading, frequent data loading can affect loading performance.
Additionally, the overhead of loading bitmap indexes during queries should be considered. During a query, bitmap indexes are loaded on demand, and the larger the value of number of column values involved in query conditions/cardinality x bitmap index
, the greater the overhead of loading bitmap indexes during queries.
To determine the appropriate cardinality and query conditions for bitmap indexes, it is recommended to refer to the Performance test on bitmap index in this topic to conduct performance tests. You can use actual business data and queries to create bitmap indexes on columns of different cardinalities, to analyze the filtering effect of bitmap indexes on queries (at least filtering out 999/1000 of the data), the disk space usage, the impact on loading performance, and the overhead of loading bitmap indexes during queries.
StarRocks has a built-in adaptive selection mechanism for bitmap indexes. If a bitmap index fails to accelerate queries, for example, if it cannot filter out many Pages, or the overhead of loading bitmap indexes during queries is high, it will not be used during the query, so query performance will not be significantly affected.
Adaptive selection of bitmap indexes
StarRocks can adaptively choose whether to use a bitmap index based on column cardinality and query conditions. If a bitmap index does not effectively filter out many Pages or the overhead of loading bitmap indexes during queries is high, StarRocks will not use the bitmap index by default to avoid degrading query performance.
StarRocks determines whether to use a bitmap index based on the ratio of the number of values involved in the query condition to the column cardinality. Generally, the smaller this ratio, the better the filtering effect of the bitmap index. Thus, StarRocks uses bitmap_max_filter_ratio/1000
as the threshold. When the number of values in the filter condition/column cardinality
is less than bitmap_max_filter_ratio/1000
, the bitmap index will be used. The default value of bitmap_max_filter_ratio
is 1
.
Take a query based on a single column as an example, such as SELECT * FROM employees WHERE gender = 'male';
. The gender
column in the employees
table has values 'male' and 'female', so the cardinality is 2 (two distinct values). The query condition involves one value, so the ratio is 1/2, which is greater than 1/1000. Therefore, this query will not use the bitmap index.
Take another query based on a combination of multiple columns as an example, such as SELECT * FROM employees WHERE gender = 'male' AND city IN ('Beijing', 'Shanghai');
. The cardinality of the city
column is of 10,000, and the query condition involves two values, so the ratio is calculated as (1*2)/(2*10000)
, which is less than 1/1000. Therefore, this query will use the bitmap index.
The value range for bitmap_max_filter_ratio
is 1-1000. If bitmap_max_filter_ratio
is set to 1000
, any query on a column with a bitmap index will be forced to use the bitmap index.
Advantages
- Bitmap indexes can quickly locate the row numbers of queried column values, suitable for point queries or small-range queries.
- Bitmap indexes can optimize multi-dimensional queries involving union and intersection operations (
OR
andAND
operations).
Considerations
Queries that can be optimized
Bitmap indexes are suitable for optimizing equality =
queries, [NOT] IN
range queries, >
, >=
, <
, <=
queries, and IS NULL
queries. They are not suitable for optimizing !=
and [NOT] LIKE
queries.
Supported columns and data types
Bitmap indexes can be created on all columns in primary key and duplicate key tables, and on key columns in aggregate and unique key tables. Bitmap indexes can be created on the columns of the following data types:
- Date types: DATE, DATETIME.
- Numeric types: TINYINT, SMALLINT, INT, BIGINT, LARGEINT, DECIMAL, BOOLEAN.
- String types: CHAR, STRING, VARCHAR.
- Other types: HLL.
Basic operations
Create an index
-
Create a bitmap index during table creation.
CREATE TABLE `lineorder_partial` (
`lo_orderkey` int(11) NOT NULL COMMENT "",
`lo_orderdate` int(11) NOT NULL COMMENT "",
`lo_orderpriority` varchar(16) NOT NULL COMMENT "",
`lo_quantity` int(11) NOT NULL COMMENT "",
`lo_revenue` int(11) NOT NULL COMMENT "",
INDEX lo_orderdate_index (lo_orderdate) USING BITMAP
) ENGINE=OLAP
DUPLICATE KEY(`lo_orderkey`)
DISTRIBUTED BY HASH(`lo_orderkey`) BUCKETS 1;In this example, a bitmap index named
lo_orderdate_index
is created on thelo_orderdate
column. Naming requirements for bitmap indexes can be found in System Limits. Identical bitmap indexes cannot be created within the same table.Multiple bitmap indexes can be created for multiple columns, separated by commas (,).
noteFor more parameters of table creation, refer to CREATE TABLE.
-
CREATE INDEX
can be used to create a bitmap index after table creation. For detailed parameter descriptions and examples, refer to CREATE INDEX.CREATE INDEX lo_quantity_index ON lineorder_partial (lo_quantity) USING BITMAP;
Progress of creating an index
Creating a bitmap index is an asynchronous process. After executing the index creation statement, you can check the index creation progress using the SHOW ALTER TABLE command. When the State
field in the returned value shows FINISHED
, the index is successfully created.
SHOW ALTER TABLE COLUMN;
Each table can only have one ongoing Schema Change task at a time. You cannot create a new bitmap index until the current bitmap index is created.
View an index
View all bitmap indexes for a specified table. For detailed parameters and returned results, refer to SHOW INDEX.
SHOW INDEXES FROM lineorder_partial;
Creating a bitmap index is an asynchronous process. Using the above statement, you can only view indexes that have successfully finished being created.
Delete an index
Delete a bitmap index for a specified table. For detailed parameters and examples, refer to DROP INDEX.
DROP INDEX lo_orderdate_index ON lineorder_partial;
Verify whether the bitmap index accelerates queries
Check the BitmapIndexFilterRows
field in the query Profile. For information on viewing the Profile, refer to Query analysis.
Performance test on bitmap index
Objective of testing
To analyze the filtering effect and other impacts, such as disk usage, of bitmap indexes on queries with different cardinalities:
- Query on a single column of low cardinality
- Query on the combination of multiple columns of low cardinality
- Query on a single column of high cardinality
This section also compares the performance between always using bitmap index and using bitmap indexes adaptively to verify the effectiveness of StarRocks' adaptive selection of bitmap indexes.
Create table and bitmap index
To avoid caching Page data affecting query performance, ensure that the BE configuration item disable_storage
is set to true
.
This section takes the table lineorder
(SSB 20G) as an example.
-
Original table (without bitmap indexes) as a reference
CREATE TABLE `lineorder_without_index` (
`lo_orderkey` int(11) NOT NULL COMMENT "",
`lo_linenumber` int(11) NOT NULL COMMENT "",
`lo_custkey` int(11) NOT NULL COMMENT "",
`lo_partkey` int(11) NOT NULL COMMENT "",
`lo_suppkey` int(11) NOT NULL COMMENT "",
`lo_orderdate` int(11) NOT NULL COMMENT "",
`lo_orderpriority` varchar(16) NOT NULL COMMENT "",
`lo_shippriority` int(11) NOT NULL COMMENT "",
`lo_quantity` int(11) NOT NULL COMMENT "",
`lo_extendedprice` int(11) NOT NULL COMMENT "",
`lo_ordtotalprice` int(11) NOT NULL COMMENT "",
`lo_discount` int(11) NOT NULL COMMENT "",
`lo_revenue` int(11) NOT NULL COMMENT "",
`lo_supplycost` int(11) NOT NULL COMMENT "",
`lo_tax` int(11) NOT NULL COMMENT "",
`lo_commitdate` int(11) NOT NULL COMMENT "",
`lo_shipmode` varchar(11) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`lo_orderkey`)
DISTRIBUTED BY HASH(`lo_orderkey`) BUCKETS 1; -
Table with bitmap indexes: Bitmap indexes created based on
lo_shipmode
,lo_quantity
,lo_discount
,lo_orderdate
,lo_tax
, andlo_partkey
.CREATE TABLE `lineorder_with_index` (
`lo_orderkey` int(11) NOT NULL COMMENT "",
`lo_linenumber` int(11) NOT NULL COMMENT "",
`lo_custkey` int(11) NOT NULL COMMENT "",
`lo_partkey` int(11) NOT NULL COMMENT "",
`lo_suppkey` int(11) NOT NULL COMMENT "",
`lo_orderdate` int(11) NOT NULL COMMENT "",
`lo_orderpriority` varchar(16) NOT NULL COMMENT "",
`lo_shippriority` int(11) NOT NULL COMMENT "",
`lo_quantity` int(11) NOT NULL COMMENT "",
`lo_extendedprice` int(11) NOT NULL COMMENT "",
`lo_ordtotalprice` int(11) NOT NULL COMMENT "",
`lo_discount` int(11) NOT NULL COMMENT "",
`lo_revenue` int(11) NOT NULL COMMENT "",
`lo_supplycost` int(11) NOT NULL COMMENT "",
`lo_tax` int(11) NOT NULL COMMENT "",
`lo_commitdate` int(11) NOT NULL COMMENT "",
`lo_shipmode` varchar(11) NOT NULL COMMENT "",
INDEX i_shipmode (`lo_shipmode`) USING BITMAP,
INDEX i_quantity (`lo_quantity`) USING BITMAP,
INDEX i_discount (`lo_discount`) USING BITMAP,
INDEX i_orderdate (`lo_orderdate`) USING BITMAP,
INDEX i_tax (`lo_tax`) USING BITMAP,
INDEX i_partkey (`lo_partkey`) USING BITMAP
) ENGINE=OLAP
DUPLICATE KEY(`lo_orderkey`)
DISTRIBUTED BY HASH(`lo_orderkey`) BUCKETS 1;
Disk space usage of bitmap indexes
lo_shipmode
: String type, cardinality 7, occupies 130Mlo_quantity
: Integer type, cardinality 50, occupies 291Mlo_discount
: Integer type, cardinality 11, occupies 198Mlo_orderdate
: Integer type, cardinality 2406, occupies 191Mlo_tax
: Integer type, cardinality 9, occupies 160Mlo_partkey
: Integer type, cardinality 600,000, occupies 601M
Query 1: Query on single column of low cardinality
Query table without bitmap index
Query:
SELECT count(1) FROM lineorder_without_index WHERE lo_shipmode="MAIL";
Query performance analysis: Since the table queried does not have bitmap index, all pages containing the lo_shipmode
column data need to be read and then predicate filtering is applied.
Total Time: Approximately 0.91 seconds, with data loading taking 0.47 seconds, decoding dictionary for low cardinality optimization taking 0.31 seconds, and predicate filtering taking 0.23 seconds.
PullRowNum: 20.566M (20566493) // Number of rows in the result set.
CompressedBytesRead: 55.283 MB // Total amount of data read.
RawRowsRead: 143.999M (143999468) // Number of rows read. Since there is no bitmap index, all data in this column is read.
ReadPagesNum: 8.795K (8795) // Number of pages read. Since there is no bitmap index, all pages containing data for this column are read.
IOTaskExecTime: 914ms // Total time for scanning data.
BlockFetch: 469ms // Time for loading data.
DictDecode: 311.612ms // Time for decoding dictionary for low cardinality optimization.
PredFilter: 23.182ms // Time for predicate filtering.
PredFilterRows: 123.433M (123432975) // Number of rows filtered out.
Query table with bitmap index
Use bitmap index compulsorily
To use the bitmap index compulsorily, according to Starrocks' configuration, bitmap_max_filter_ratio=1000
needs to be set in the be.conf file of each BE node, and then the BE nodes need to be restarted.
Query:
SELECT count(1) FROM lineorder_with_index WHERE lo_shipmode="MAIL";
Query performance analysis: Since the column queried is of low cardinality, bitmap index does not filter the data efficiently. Even though bitmap index can quickly locate the row numbers of actual data, a large number of rows need to be read, scattered across multiple pages. As a result, it cannot effectively filter out the pages that need to be read. Moreover, additional overhead for loading the bitmap index and using the bitmap index to filter data is incurred, resulting in a longer total time.
Total time: 2.077 seconds, with 0.93 seconds spent loading data and bitmap index, 0.33 seconds on decoding dictionary for low cardinality optimization, 0.42 seconds on filtering data with bitmap index, and 0.17 seconds on filtering data with ZoneMap Index.
PullRowNum: 20.566M (20566493) // Number of rows in the result set.
CompressedBytesRead: 72.472 MB // Total amount of data read. The size of the bitmap index for this column is 130MB, with 7 unique values. The size of a bitmap index for each value is 18MB. The data size of Pages is 55MB = 73MB.
RawRowsRead: 20.566M (20566493) // Number of rows read. 20 million rows are actually read.
ReadPagesNum: 8.802K (8802) // Number of Pages read. All Pages are read because the 20 million rows filtered by bitmap index are randomly distributed across different Pages, and Page is the smallest data reading unit. Therefore, bitmap index does not filter out Pages.
IOTaskExecTime: 2s77ms // Total time for scanning data, longer than without bitmap index.
BlockFetch: 931.144ms // Time for loading data and bitmap index. Compared to the previous query, an additional 400ms was spent to load bitmap index (18MB).
DictDecode: 329.696ms // Time for decoding dictionary for low cardinality optimization is similar since the output row count is the same.
BitmapIndexFilter: 419.308ms // Time for filtering data with the bitmap index.
BitmapIndexFilterRows: 123.433M (123432975) // Number of rows filtered by the bitmap index.
ZoneMapIndexFiter: 171.580ms // Time for filtering data with ZoneMap Index.