- Release Notes
- Get Started
- Clusters
- Cloud Settings
- Table Type
- Query Data Lakes
- Integration
- Query Acceleration
- Data Loading
- Concepts
- Batch load data from Amazon S3
- Batch load data from Azure cloud storage
- Load data from a local file system
- Load data from Confluent Cloud
- Load data from Amazon MSK
- Load data from Amazon Kinesis
- Data Unloading
- Data Backup
- Security
- Console Access Control
- Data Access Control
- Application keys
- Service accounts
- Use SSL connection
- Alarm
- Usage and Billing
- Organizations and Accounts
- Reference
- Amazon Web Services (AWS)
- Microsoft Azure
- SQL Reference
- Keywords
- ALL statements
- User Account Management
- Cluster Management
- ADMIN CANCEL REPAIR
- ADMIN CHECK TABLET
- ADMIN REPAIR
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- DROP FILE
- DROP RESOURCE GROUP
- EXPLAIN
- INSTALL PLUGIN
- SET
- SHOW BACKENDS
- SHOW BROKER
- SHOW COMPUTE NODES
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW TABLE STATUS
- SHOW FILE
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER MATERIALIZED VIEW
- ALTER TABLE
- ALTER VIEW
- ANALYZE TABLE
- BACKUP
- CANCEL ALTER TABLE
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE ANALYZE
- CREATE DATABASE
- CREATE EXTERNAL CATALOG
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP ANALYZE
- DROP STATS
- DROP CATALOG
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- KILL ANALYZE
- RECOVER
- REFRESH EXTERNAL TABLE
- RESTORE
- SET CATALOG
- SHOW ANALYZE JOB
- SHOW ANALYZE STATUS
- SHOW META
- SHOW FUNCTION
- TRUNCATE TABLE
- USE
- DML
- ALTER LOAD
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- CREATE ROUTINE LOAD
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- RESUME ROUTINE LOAD
- REFRESH MATERIALIZED VIEW
- SELECT
- SHOW ALTER
- SHOW ALTER MATERIALIZED VIEW
- SHOW BACKUP
- SHOW CATALOGS
- SHOW CREATE CATALOG
- SHOW CREATE MATERIALIZED VIEW
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW MATERIALIZED VIEW
- SHOW PARTITIONS
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- STOP ROUTINE LOAD
- STREAM LOAD
- SUBMIT TASK
- UPDATE
- Auxiliary Commands
- Data Types
- Keywords
- SQL Functions
- Function list
- Java UDFs
- Window functions
- Lambda expression
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_diff
- date_format
- date_slice
- date_sub, subdate
- date_trunc
- datediff
- day
- dayofweek_iso
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- jodatime_format
- last_day
- makedate
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- months_sub
- next_day
- now
- previous_day
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str_to_jodatime
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- to_iso8601
- to_tera_date
- to_tera_timestamp
- unix_timestamp
- utc_timestamp
- week
- week_iso
- weekofyear
- weeks_add
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Aggregate Functions
- any_value
- approx_count_distinct
- array_agg
- avg
- bitmap
- bitmap_agg
- count
- count_if
- corr
- covar_pop
- covar_samp
- group_concat
- grouping
- grouping_id
- hll_empty
- hll_hash
- hll_raw_agg
- hll_union
- hll_union_agg
- max
- max_by
- min
- min_by
- multi_distinct_sum
- multi_distinct_count
- percentile_approx
- percentile_cont
- percentile_disc
- retention
- stddev
- stddev_samp
- sum
- variance, variance_pop, var_pop
- var_samp
- window_funnel
- Geographic Functions
- String Functions
- append_trailing_char_if_absent
- ascii
- char
- char_length
- character_length
- concat
- concat_ws
- ends_with
- find_in_set
- group_concat
- hex
- hex_decode_binary
- hex_decode_string
- instr
- lcase
- left
- length
- locate
- lower
- lpad
- ltrim
- money_format
- null_or_empty
- parse_url
- repeat
- replace
- reverse
- right
- rpad
- rtrim
- space
- split
- split_part
- substring_index
- starts_with
- strleft
- strright
- str_to_map
- substring
- trim
- ucase
- unhex
- upper
- url_decode
- url_encode
- Pattern Matching Functions
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Bit Functions
- Bitmap Functions
- Array Functions
- all_match
- any_match
- array_agg
- array_append
- array_avg
- array_concat
- array_contains
- array_contains_all
- array_cum_sum
- array_difference
- array_distinct
- array_filter
- array_generate
- array_intersect
- array_join
- array_length
- array_map
- array_max
- array_min
- array_position
- array_remove
- array_slice
- array_sort
- array_sortby
- array_sum
- arrays_overlap
- array_to_bitmap
- cardinality
- element_at
- reverse
- unnest
- Map Functions
- Binary Functions
- cast function
- hash function
- Cryptographic Functions
- Math Functions
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Struct Functions
- Table Functions
- Utility Functions
- AUTO_INCREMENT
- Generated columns
- System variables
- System limits
- Information Schema
- Overview
- be_bvars
- be_cloud_native_compactions
- be_compactions
- character_sets
- collations
- column_privileges
- columns
- engines
- events
- global_variables
- key_column_usage
- load_tracking_logs
- loads
- materialized_views
- partitions
- pipe_files
- pipes
- referential_constraints
- routines
- schema_privileges
- schemata
- session_variables
- statistics
- table_constraints
- table_privileges
- tables
- tables_config
- task_runs
- tasks
- triggers
- user_privileges
- views
- System Metadatabase
- API
- Overview
- Actions
- Clusters
- Create and Manage Clusters
- Query Clusters
- Identity and Access Management
- Organization and Account
- Usage and Billing
- Clusters
- Terraform Provider
- Run scripts
BROKER LOAD
Description
CelerData provides the MySQL-based loading method Broker Load. After you submit a load job, CelerData asynchronously runs the job. You can use SELECT * FROM information_schema.loads
to check the load progress.
Syntax
LOAD LABEL [<database_name>.]<label_name>
(
data_desc[, data_desc ...]
)
WITH BROKER
(
StorageCredentialParams
)
[PROPERTIES
(
opt_properties
)
]
Note that in CelerData some literals are used as reserved keywords by the SQL language. Do not directly use these keywords in SQL statements. If you want to use such a keyword in an SQL statement, enclose it in a pair of backticks (`). See Keywords.
Parameters
database_name and label_name
label_name
specifies the label of the load job.
database_name
optionally specifies the name of the database to which the destination table belongs.
Each load job has a label that is unique across the entire database. You can use the label of a load job to view the execution status of the load job and prevent repeatedly loading the same data. When a load job enters the FINISHED state, its label cannot be reused. Only the label of a load job that has entered the CANCELLED state can be reused. In most cases, the label of a load job is reused to retry that load job and load the same data, thereby implementing Exactly-Once semantics.
data_desc
The description of a batch of data to be loaded. Each data_desc
descriptor declares information such as the data source, ETL functions, destination table, and destination partitions.
Batch loading supports loading multiple data files at a time. In one load job, you can use multiple data_desc
descriptors to declare multiple data files you want to load, or use one data_desc
descriptor to declare one file path from which you want to load all data files in it. Batch loading can also ensure the transactional atomicity of each job that loads multiple data files. Atomicity means that the loading of multiple data files in one load job must all succeed or fail. It never happens that the loading of some data files succeeds while the loading of the other files fails.
data_desc
supports the following syntax:
DATA INFILE ("<file_path>"[, "<file_path>" ...])
[NEGATIVE]
INTO TABLE <table_name>
[PARTITION (<partition_name>[, <partition_name> ...])]
[FORMAT AS "CSV | Parquet | ORC"]
[COLUMNS TERMINATED BY "<column_separator>"]
[(column_list)]
[COLUMNS FROM PATH AS (<partition_field_name>[, <partition_field_name> ...])]
[SET <destination_column_name=function(temp_column_name)>[, <destination_column_name=function(temp_column_name)> ...]]
[WHERE predicate]
data_desc
must include the following parameters:
DATA INFILE
Specifies the save path of one or more data files you want to load.
AWS S3 can be accessed according to the S3 or S3A protocol. Therefore, you can include
s3://
ors3a://
as the prefix in the S3 URI that you pass as the file path (DATA INFILE
).For example, you can specify this parameter as
"s3a://<bucket_name>/<folder_name>/<folder_name>/<file_name>"
to load a single data file from the specified path. You can also specify this parameter as"s3a://<bucket_name>/<folder_name>/<folder_name>/<wildcard>"
to load all data files from the specified path. The supported wildcards are?
,*
,[]
,{}
, and^
. For more information, see Wildcard reference.NOTE
Wildcards can also be used to specify intermediate paths.
The parameters in
file_path
are as follows:bucket_name
: the name of the S3 bucket in which your data files are stored.folder_name
: the name of the folder in which your data files are stored.file_name
: the name of the data file that you want to load.wildcard
: the wildcard that you want to use in the file path.
Different types of Azure cloud storage allow access via different protocols:
- When you load data from Blob Storage, you must use the wasb or wasbs protocol to access your data:
- If your storage account allows access over HTTP, use the wasb protocol and write the file path as
wasb://<container_name>@<storage_account_name>.blob.core.windows.net/<path>/<file_name>/*
. - If your storage account allows access over HTTPS, use the wasbs protocol and write the file path as
wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/<path>/<file_name>/*
- If your storage account allows access over HTTP, use the wasb protocol and write the file path as
- When you load data from Data Lake Storage Gen2, you must use the abfs or abfss protocol to access your data:
- If your storage account allows access over HTTP, use the abfs protocol and write the file path as
abfs://<container_name>@<storage_account_name>.dfs.core.windows.net/<file_name>
. - If your storage account allows access over HTTPS, use the abfss protocol and write the file path as
abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<file_name>
.
- If your storage account allows access over HTTP, use the abfs protocol and write the file path as
- When you load data from Data Lake Storage Gen1, you must use the adl protocol to access your data and write the file path as
adl://<data_lake_storage_gen1_name>.azuredatalakestore.net/<path>/<file_name>
.
INTO TABLE
Specifies the name of the destination table.
data_desc
can also optionally include the following parameters:
NEGATIVE
Revokes the loading of a specific batch of data. To achieve this, you need to load the same batch of data with the
NEGATIVE
keyword specified.NOTE
This parameter is valid only when the destination table uses the Aggregate table and all its value columns are computed by the
sum
function.PARTITION
Specifies the partitions into which you want to load data. By default, if you do not specify this parameter, the source data will be loaded into all partitions of the destination table.
FORMAT AS
Specifies the format of the data file. Valid values:
CSV
,Parquet
, andORC
. If you include the filename extension .csv, .parquet, or .orc in<file_path>
, you can skip this parameter. Otherwise, you must specify this parameter.COLUMNS TERMINATED BY
Specifies the column separator used in the data file. If you do not specify this parameter, this parameter defaults to
\t
, indicating tab. The column separator you specify must be the same as the column separator used in the data file. Otherwise, the load job fails due to inadequate data quality, and itsState
is displayed asCANCELLED
.NOTE
- You need to set this parameter only when you load CSV data.
- For CSV data, you can use a UTF-8 string, such as a comma (,), tab, or pipe (|), whose length does not exceed 50 bytes as a text delimiter.
column_list
Specifies the column mapping between the data file and the destination table. Syntax:
(<column_name>[, <column_name> ...])
. The columns declared incolumn_list
are mapped by name onto the destination table columns.NOTE
If the columns of the data file are mapped in sequence onto the columns of the destination table, you do not need to specify
column_list
. For more information, see the "Column mapping" section of this topic.If you want to skip a specific column of the data file, you only need to temporarily name that column as different from any of the destination table columns.
SET
Specifies one or more functions that you want to use to convert a column of the data file. Examples:
- The destination table consists of three columns, which are
col1
,col2
, andcol3
in sequence. The data file consists of four columns, among which the first two columns are mapped in sequence ontocol1
andcol2
of the destination table and the sum of the last two columns is mapped ontocol3
of the destination table. In this case, you can specifycolumn_list
as(col1,col2,tmp_col3,tmp_col4)
and pass(col3=tmp_col3+tmp_col4)
into the SET clause to implement data conversion. - The destination table consists of three columns, which are
year
,month
, andday
in sequence. The data file consists of only one column that accommodates date and time values inyyyy-mm-dd hh:mm:ss
format. In this case, you can specifycolumn_list
as(tmp_time)
and pass(year = year(tmp_time), month=month(tmp_time), day=day(tmp_time))
into the SET clause to implement data conversion.
- The destination table consists of three columns, which are
WHERE
Specifies the conditions based on which you want to filter the source data. CelerData loads only the source data that meets the filter conditions specified in the WHERE clause.
StorageCredentialParams
A set of parameters about how your CelerData cluster integrates with your data source.
AWS S3
If you choose AWS S3 as your storage system, take one of the following actions:
To choose the instance profile-based authentication method, configure
StorageCredentialParams
as follows:"aws.s3.use_instance_profile" = "true", "aws.s3.region" = "<aws_s3_region>"
To choose the assumed role-based authentication method, configure
StorageCredentialParams
as follows:"aws.s3.use_instance_profile" = "true", "aws.s3.iam_role_arn" = "<iam_role_arn>", "aws.s3.region" = "<aws_s3_region>"
To choose the IAM user-based authentication method, configure
StorageCredentialParams
as follows:"aws.s3.use_instance_profile" = "false", "aws.s3.access_key" = "<iam_user_access_key>", "aws.s3.secret_key" = "<iam_user_secret_key>", "aws.s3.region" = "<aws_s3_region>"
The following table describes the parameters you need to configure in StorageCredentialParams
.
Parameter | Required | Description |
---|---|---|
aws.s3.use_instance_profile | Yes | Specifies whether to enable the credential methods instance profile and assumed role. Valid values: true and false . Default value: false . |
aws.s3.iam_role_arn | No | The ARN of the IAM role that has privileges on your AWS S3 bucket. If you choose assumed role as the credential method for accessing AWS S3, you must specify this parameter. |
aws.s3.region | Yes | The region in which your AWS S3 bucket resides. Example: us-west-1 . |
aws.s3.access_key | No | The access key of your IAM user. If you choose IAM user as the credential method for accessing AWS S3, you must specify this parameter. |
aws.s3.secret_key | No | The secret key of your IAM user. If you choose IAM user as the credential method for accessing AWS S3, you must specify this parameter. |
For information about how to choose a credential method for accessing AWS S3 and how to configure an access control policy in AWS IAM Console, see Authentication parameters for accessing AWS S3.
Microsoft Azure Storage
This section describes the parameters that you need to configure in StorageCredentialParams
for integrating with various Azure cloud storage services by using various authentication methods. For more information about how to obtain the values of these parameters, see Authenticate to Azure cloud storage.
Azure Blob Storage
If you choose Blob Storage as storage for your Hive cluster, take one of the following actions:
To use the Shared Key authentication method, configure
StorageCredentialParams
as follows:"azure.blob.storage_account" = "<storage_account_name>", "azure.blob.shared_key" = "<storage_account_shared_key>"
The following table describes the parameters.
Parameter Description azure.blob.storage_account The name of your Blob storage account. azure.blob.shared_key The shared key (access key) of your Blob storage account. To use the SAS Token authentication method, configure
StorageCredentialParams
as follows:"azure.blob.storage_account" = "<storage_account_name>", "azure.blob.container" = "<container_name>", "azure.blob.sas_token" = "<storage_account_SAS_token>"
The following table describes the parameters.
Parameter Description azure.blob.storage_account The name of your Blob storage account. azure.blob.container The name of the Blob container that stores your data within your Blob storage account. azure.blob.sas_token The SAS token that is used to access your Blob storage account.
Azure Data Lake Storage Gen2
If you choose Data Lake Storage Gen2 as storage for your Hive cluster, take one of the following actions:
To use the Managed Identity authentication method, configure
StorageCredentialParams
as follows:"azure.adls2.oauth2_use_managed_identity" = "true", "azure.adls2.oauth2_tenant_id" = "<service_principal_tenant_id>", "azure.adls2.oauth2_client_id" = "<service_client_id>"
The following table describes the parameters.
Parameter Description azure.adls2.oauth2_use_managed_identity Specifies whether to enable the Managed Identity authentication method. Set the value to true
.azure.adls2.oauth2_tenant_id The ID of the tenant of your ADLS Gen2 storage account. azure.adls2.oauth2_client_id The client ID of the managed identity that is referenced in the data credential of the destination CelerData cluster. To use the Shared Key authentication method, configure
StorageCredentialParams
as follows:"azure.adls2.storage_account" = "<storage_account_name>", "azure.adls2.shared_key" = "<storage_account_shared_key>"
The following table describes the parameters.
Parameter Description azure.adls2.storage_account The name of your ADLS Gen2 storage account. azure.adls2.shared_key The shared key (access key) of your ADLS Gen2 storage account. To use the Service Principal authentication method, configure
StorageCredentialParams
as follows:"azure.adls2.oauth2_client_id" = "<service_client_id>", "azure.adls2.oauth2_client_secret" = "<service_principal_client_secret>", "azure.adls2.oauth2_client_endpoint" = "<service_principal_client_endpoint>"
The following table describes the parameters.
Parameter Description azure.adls2.oauth2_client_id The application (client) ID of the service principal. azure.adls2.oauth2_client_secret The value of the client secret of the service principal. azure.adls2.oauth2_client_endpoint The OAuth 2.0 token endpoint (v1) of the service principal.
Azure Data Lake Storage Gen1
If you choose Data Lake Storage Gen1 as storage for your Hive cluster, take one of the following actions:
To use the Managed Service Identity authentication method, configure
StorageCredentialParams
as follows:"azure.adls1.use_managed_service_identity" = "true"
The following table describes the parameters.
Parameter Description azure.adls1.use_managed_service_identity Specifies whether to enable the Managed Service Identity authentication method. Set the value to true
.To use the Service Principal authentication method, configure
StorageCredentialParams
as follows:"azure.adls1.oauth2_client_id" = "<application_client_id>", "azure.adls1.oauth2_credential" = "<application_client_credential>", "azure.adls1.oauth2_endpoint" = "<OAuth_2.0_authorization_endpoint_v2>"
The following table describes the parameters.
Parameter Description azure.adls1.oauth2_client_id The application (client) ID of the service principal. azure.adls1.oauth2_credential The value of the client secret of the service principal. azure.adls1.oauth2_endpoint The OAuth 2.0 token endpoint (v1) of the service principal or application.
opt_properties
Specifies some optional parameters whose settings are applied to the entire load job. Syntax:
PROPERTIES ("<key>" = "<value>"[, "<key>" = "<value>" ...])
The following parameters are supported:
timeout
Specifies the timeout period of the load job. Unit: second. The default timeout period is 4 hours. We recommend that you specify a timeout period shorter than 6 hours. If the load job does not finish within the timeout period, CelerData cancels the load job and the status of the load job becomes CANCELLED.
NOTE
In most cases, you do not need to set the timeout period. We recommend that you set the timeout period only when the load job cannot finish within the default timeout period.
Use the following formula to infer the timeout period:
Timeout period > (Total size of the data files to be loaded x Total number of the data files to be loaded and the materialized views created on the data files)/Average load speed
NOTE
"Average load speed" is the average load speed for your entire CelerData cluster. The average load speed varies for each cluster depending on the server configuration and the maximum number of concurrent query tasks allowed for the cluster. You can infer the average load speed based on the load speeds of historical load jobs.
Suppose that you want to load a 1-GB data file on which two materialized views are created into a CelerData cluster whose average load speed is 10 MB/s. The amount of time required for the data load is approximately 102 seconds.
(1 x 1024 x 3)/10 = 307.2 (second)
For this example, we recommend that you set the timeout period to a value greater than 308 seconds.
max_filter_ratio
Specifies the maximum error tolerance of the load job. The maximum error tolerance is the maximum percentage of rows that can be filtered out as a result of inadequate data quality. Valid values:
0
~1
. Default value:0
.If you set this parameter to
0
, CelerData does not ignore unqualified rows during loading. As such, if the source data contains unqualified rows, the load job fails. This helps ensure the correctness of the data loaded into CelerData.If you set this parameter to a value greater than
0
, CelerData can ignore unqualified rows during loading. As such, the load job can succeed even if the source data contains unqualified rows.NOTE
Rows that are filtered out due to inadequate data quality do not include rows that are filtered out by the WHERE clause.
If the load job fails because the maximum error tolerance is set to
0
, you can use SHOW LOAD to view the job result. Then, determine whether unqualified rows can be filtered out. If unqualified rows can be filtered out, calculate the maximum error tolerance based on the values returned fordpp.abnorm.ALL
anddpp.norm.ALL
in the job result, adjust the maximum error tolerance, and submit the load job again. The formula for calculating the maximum error tolerance is as follows:max_filter_ratio
= [dpp.abnorm.ALL
/(dpp.abnorm.ALL
+dpp.norm.ALL
)]The sum of the values returned for
dpp.abnorm.ALL
anddpp.norm.ALL
is the total number of rows to be loaded.log_rejected_record_num
Specifies the maximum number of unqualified data rows that can be logged. Valid values:
0
,-1
, and any non-zero positive integer. Default value:0
.- The value
0
specifies that no data rows that are filtered out will be logged. - The value
-1
specifies that all data rows that are filtered out will be logged. - A non-zero positive integer such as
n
specifies that up ton
data rows that are filtered out can be logged on each BE.
- The value
load_mem_limit
Specifies the maximum amount of memory that can be provided to the load job. Unit: bytes: The default memory limit is 2 GB.
strict_mode
Specifies whether to enable the strict mode. Valid values:
true
andfalse
. Default value:false
.true
specifies to enable the strict mode, andfalse
specifies to disable the strict mode.If you enable the strict mode, CelerData returns errors in case that data conversions fail. If you disable the strict mode, CelerData fills
NULL
values in the destination table in case that data conversions fail.The strict mode does not take effect in the following situations:
- The data from a source column is generated by functions.
- The data type of a source column imposes range limits on the column values. As a result, the values from that source column can be properly converted to other data types, but cannot pass the range limit checks. For example, a source column is defined as the DECIMAL(1,0) data type, and the values in that source column are
10
.
timezone
Specifies the time zone of the load job. Default value:
Asia/Shanghai
. The time zone setting affects the results returned by functions such as strftime, alignment_timestamp, and from_unixtime. The time zone specified in thetimezone
parameter is a session-level time zone.priority
Specifies the priority of the load job. Valid values:
LOWEST
,LOW
,NORMAL
,HIGH
, andHIGHEST
. Default value:NORMAL
. Batch loading provides a specific task pool size, which determines the maximum number of tasks that can be concurrently run within a specific time period. If the number of tasks to run for jobs that are submitted within the specified time period exceeds the maximum number, the jobs in the task pool will be waiting to be scheduled based on their priorities.merge_condition
Specifies the name of the column you want to use as the condition to determine whether updates can take effect. The update from a source record to a destination record takes effect only when the source data record has a greater or equal value than the destination data record in the specified column.
NOTE
The column that you specify cannot be a primary key column. Additionally, only tables that use the Primary Key table support conditional updates.
For the loading of JSON data, CelerData Cloud BYOC provides the following parameters:
jsonpaths
The names of the keys that you want to load from the JSON data file. You need to specify this parameter only when you load JSON data by using the matched mode. The value of this parameter is in JSON format. See Configure column mapping for JSON data loading.
strip_outer_array
Specifies whether to strip the outermost array structure. Valid values:
true
andfalse
. Default value:false
.In real-world business scenarios, the JSON data may have an outermost array structure as indicated by a pair of square brackets
[]
. In this situation, we recommend that you set this parameter totrue
, so CelerData Cloud BYOC removes the outermost square brackets[]
and loads each inner array as a separate data record. If you set this parameter tofalse
, CelerData Cloud BYOC parses the entire JSON data file into one array and loads the array as a single data record. For example, the JSON data is[ {"category" : 1, "author" : 2}, {"category" : 3, "author" : 4} ]
. If you set this parameter totrue
,{"category" : 1, "author" : 2}
and{"category" : 3, "author" : 4}
are parsed into separate data records that are loaded into separate rows of the destination table.json_root
The root element of the JSON data that you want to load from the JSON data file. You need to specify this parameter only when you load JSON data by using the matched mode. The value of this parameter is a valid JsonPath string. By default, the value of this parameter is empty, indicating that all data of the JSON data file will be loaded.
When you load JSON data, also note that the size per JSON object cannot exceed 4 GB. If an individual JSON object in the JSON data file exceeds 4 GB in size, an error "This parser can't support a document that big." is reported.
Column mapping
Configure column mapping for CSV data loading
If the columns of the data file can be mapped one on one in sequence to the columns of the destination table, you do not need to configure the column mapping between the data file and the destination table.
If the columns of the data file cannot be mapped one on one in sequence to the columns of the destination table, you need to use the columns
parameter to configure the column mapping between the data file and the destination table. This includes the following two use cases:
Same number of columns but different column sequence. Also, the data from the data file does not need to be computed by functions before it is loaded into the matching destination table columns.
In the
columns
parameter, you need to specify the names of the destination table columns in the same sequence as how the data file columns are arranged.For example, the destination table consists of three columns, which are
col1
,col2
, andcol3
in sequence, and the data file also consists of three columns, which can be mapped to the destination table columnscol3
,col2
, andcol1
in sequence. In this case, you need to specify"columns: col3, col2, col1"
.Different number of columns and different column sequence. Also, the data from the data file needs to be computed by functions before it is loaded into the matching destination table columns.
In the
columns
parameter, you need to specify the names of the table columns in the same sequence as how the data file columns are arranged and specify the functions you want to use to compute the data. Two examples are as follows:- The destination table consists of three columns, which are
col1
,col2
, andcol3
in sequence. The data file consists of four columns, among which the first three columns can be mapped in sequence to the destination table columnscol1
,col2
, andcol3
and the fourth column cannot be mapped to any of the destination table columns. In this case, you need to temporarily specify a name for the fourth column of the data file, and the temporary name must be different from any of the destination table column names. For example, you can specify"columns: col1, col2, col3, temp"
, in which the fourth column of the data file is temporarily namedtemp
. - The destination table consists of three columns, which are
year
,month
, andday
in sequence. The data file consists of only one column that accommodates date and time values inyyyy-mm-dd hh:mm:ss
format. In this case, you can specify"columns: col, year = year(col), month=month(col), day=day(col)"
, in whichcol
is the temporary name of the data file column and the functionsyear = year(col)
,month=month(col)
, andday=day(col)
are used to extract data from the data file columncol
and loads the data into the mapping destination table columns. For example,year = year(col)
is used to extract theyyyy
data from the data file columncol
and loads the data into the destination table columnyear
.
- The destination table consists of three columns, which are
Configure column mapping for JSON data loading
If the keys of the JSON document have the same names as the columns of the destination table, you can load the JSON-formatted data by using the simple mode. In simple mode, you do not need to specify the jsonpaths
parameter. This mode requires that the JSON-formatted data must be an object as indicated by curly brackets {}
, such as {"category": 1, "author": 2, "price": "3"}
. In this example, category
, author
, and price
are key names, and these keys can be mapped one on one by name to the columns category
, author
, and price
of the destination table.
If the keys of the JSON document have different names than the columns of the destination table, you can load the JSON-formatted data by using the matched mode. In matched mode, you need to use the jsonpaths
and COLUMNS
parameters to specify the column mapping between the JSON document and the destination table:
- In the
jsonpaths
parameter, specify the JSON keys in the sequence as how they are arranged in the JSON document. - In the
COLUMNS
parameter, specify the mapping between the JSON keys and the destination table columns:- The column names specified in the
COLUMNS
parameter are mapped one on one in sequence to the JSON keys. - The column names specified in the
COLUMNS
parameter are mapped one on one by name to the destination table columns.
- The column names specified in the
Examples
The following examples assume that you choose instance profile as the credential method for accessing AWS S3.
Perform a load
Your CelerData database test_db
contains a table named table1
. The table consists of three columns, which are col1
, col2
, and col3
in sequence.
Your s3a://celerdata_bucket/folder_a/folder_b
path stores a number of CSV data files, one of which is test_file.csv
. These data files each also consist of three columns, which are mapped in sequence onto col1
, col2
, and col3
of table1
.
Create a load job with the label label_1
to load only the data records whose values in the first column are greater than 20180601
from test_file.csv
into partitions p1
and p2
of table1
, and specify that the load job will time out in 3600 seconds:
LOAD LABEL test_db.label1
(
DATA INFILE("s3a://celerdata_bucket/folder_a/folder_b/test_file.csv")
INTO TABLE table1
COLUMNS TERMINATED BY ","
PARTITION (p1, p2)
(col1, col2, col3)
where col1 > 20180601
)
WITH BROKER
(
"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "<aws_s3_region>"
)
PROPERTIES
(
"timeout" = "3600"
);
Revoke a load
Create a load job with the label label_2
to revoke the previous load whose label is label_1
:
LOAD LABEL test_db.label2
(
DATA INFILE("s3a://celerdata_bucket/folder_a/folder_b/test_file.csv")
NEGATIVE
INTO TABLE table1
COLUMNS TERMINATED BY ","
PARTITION (p1, p2)
(col1, col2, col3)
where col1 > 20180601
)
WITH BROKER
(
"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "<aws_s3_region>"
);
Load data into a table containing HLL-type columns
Your CelerData database test_db
contains a table named table2
. The table consists of four columns, which are id
, col1
, col2
, and col3
in sequence. col1
and col2
are defined as HLL-type columns.
Your data file test_file.csv
consists of three columns, among which the first column is mapped onto id
of table2
and the second and third columns are mapped in sequence onto col1
and col2
of table2
. The values in the second and third columns of test_file.csv
can be converted into HLL-type data by using functions before they are loaded into col1
and col2
of table2
.
Create a load job with the label label_3
to load all data from test_file.csv
into table2
:
LOAD LABEL test_db.label3
(
DATA INFILE("hdfs://<hdfs_host>:<hdfs_port>/user/celerdata/data/input/test_file.csv")
INTO TABLE table2
COLUMNS TERMINATED BY ","
(id, temp1, temp2)
SET
(
col1 = hll_hash(temp1),
col2 = hll_hash(temp2),
col3 = empty_hll()
)
)
WITH BROKER
(
"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "<aws_s3_region>"
);
NOTE
In the preceding example, the three columns of
test_file.csv
are namedid
,temp1
, andtemp2
in sequence by usingcolumn_list
. Then, functions are used to convert data as follows:
The
hll_hash
function is used to convert the values intemp1
andtemp2
oftest_file.csv
into HLL-type data and maptemp1
andtemp2
oftest_file.csv
ontocol1
andcol2
oftable2
.The
empty_hll
function is used to fill the specified default value intocol3
oftable2
.
For usage of the functions hll_hash
and hll_empty
, see HLL.