Colocate Join
For shuffle join and broadcast join, if the join condition is met, the data rows of the two joining tables are merged into a single node to complete the join. Neither of these two join methods can avoid latency or overhead caused by data network transmission between nodes.
The core idea is to keep bucketing key, number of copies, and copy placement consistent for tables in the same Colocation Group. If the join column is a bucketing key, the computing node only needs to do local join without getting data from other nodes.
This document introduces the principle, implementation, usage, and considerations of Colocate Join.
Terminology
- Colocation Group (CG): A CG will contain one or more Tables. The Tables within a CG have the same bucketing and replica placement, and are described using the Colocation Group Schema.
- Colocation Group Schema (CGS): A CGS contains the bucketing key, number of buckets, and number of replicas of a CG.
Principle
Colocate Join is to form a CG with a set of Tables having the same CGS, and ensure that the corresponding bucket copies of these Tables will fall on the same set of BE nodes. When the tables in the CG perform Join operations on the bucketed columns, the local data can be joined directly, saving time from transferring data between nodes.
Bucket Seq is obtained by hash(key) mod buckets
. Suppose a Table has 8 buckets, then there are [0, 1, 2, 3, 4, 5, 6, 7] 8 buckets, and each Bucket has one or more sub-tables, the number of sub-tables depends on the number of partitions. If it is a multi-partitioned table, there will be multiple tablets.
In order to have the same data distribution, tables within the same CG must comply with the following.
- Tables within the same CG must have the identical bucketing key (type, number, order) and the same number of buckets so that the data slices of multiple tables can be distributed and controlled one by one. The bucketing key is the columns specified in the table creation statement
DISTRIBUTED BY HASH(col1, col2, ...)
. The bucketing key determines which columns of data are Hashed into different Bucket Seqs. The name of the bucketing key can vary for tables within the same CG.The bucketing columns can be different in the creation statement, but the order of the corresponding data types inDISTRIBUTED BY HASH(col1, col2, ...)
should be exactly the same . - Tables within the same CG must have the same number of partition copies. If not, it may happen that a tablet copy has no corresponding copy in the partition of the same BE.
- Tables within the same CG may have different numbers of partitions and different partition keys.
When creating a table, the CG is specified by the attribute "colocate_with" = "group_name"
in the table PROPERTIES. If the CG does not exist, it means the table is the first table of the CG and called Parent Table. The data distribution of the Parent Table (type, number and order of split bucket keys, number of copies and number of split buckets) determines the CGS. If the CG exists, check whether the data distribution of the table is consistent with the CGS.
The copy placement of tables within the same CG satisfies:
- The mapping between the Bucket Seq and BE nodes of all the Tables is the same as that of the Parent Table.
- The mapping between the Bucket Seq and BE nodes of all the Partitions in the Parent Table is the same as that of the first Partition.
- The mapping between the Bucket Seq and BE nodes of the first Partition of the Parent Table is determined using the native Round Robin algorithm.
The consistent data distribution and mapping guarantee that the data rows with the same value taken by bucketing key fall on the same BE. Therefore, when using the bucketing key to join columns, only local joins are required.
Usage
Table creation
When creating a table, you can specify the attribute "colocate_with" = "group_name"
in PROPERTIES to indicate that the table is a Colocate Join table and belongs to a specified Colocation Group.
For example:
CREATE TABLE tbl (k1 int, v1 int sum)
DISTRIBUTED BY HASH(k1)
BUCKETS 8
PROPERTIES(
"colocate_with" = "group1"
);
If the specified Group does not exist, CelerData automatically creates a Group that only contains the current table. If the Group exists, CelerData checks to see if the current table meets the Colocation Group Schema. If so, it creates the table and adds it to the Group. At the same time, the table creates a partition and a tablet based on the data distribution rules of the existing Group.
The Group belongs to a Database, and the name of the Group is unique within the Database. The full name of the Group is dbId_groupName in the internal storage, but the user only needs the groupName.
Delete
A complete deletion is a deletion from the Recycle Bin. Normally, after a table is deleted with the DROP TABLE
command, by default it will stay in the recycle bin for a day before being deleted). When the last table in a Group is completely deleted, the Group will also be deleted automatically.