- Release Notes
- Get Started
- Clusters
- Cloud Settings
- Table Type
- Query Data Lakes
- Integration
- Query Acceleration
- Data Loading
- Concepts
- Batch load data from Amazon S3
- Batch load data from Azure cloud storage
- Load data from a local file system
- Load data from Confluent Cloud
- Load data from Amazon MSK
- Load data from Amazon Kinesis
- Data Unloading
- Data Backup
- Security
- Console Access Control
- Data Access Control
- Application keys
- Service accounts
- Use SSL connection
- Alarm
- Usage and Billing
- Organizations and Accounts
- Reference
- Amazon Web Services (AWS)
- Microsoft Azure
- SQL Reference
- Keywords
- ALL statements
- User Account Management
- Cluster Management
- ADMIN CANCEL REPAIR
- ADMIN CHECK TABLET
- ADMIN REPAIR
- ADMIN SET CONFIG
- ADMIN SET REPLICA STATUS
- ADMIN SHOW CONFIG
- ADMIN SHOW REPLICA DISTRIBUTION
- ADMIN SHOW REPLICA STATUS
- ALTER RESOURCE GROUP
- ALTER SYSTEM
- CANCEL DECOMMISSION
- CREATE FILE
- CREATE RESOURCE GROUP
- DROP FILE
- DROP RESOURCE GROUP
- EXPLAIN
- INSTALL PLUGIN
- SET
- SHOW BACKENDS
- SHOW BROKER
- SHOW COMPUTE NODES
- SHOW FRONTENDS
- SHOW FULL COLUMNS
- SHOW INDEX
- SHOW PLUGINS
- SHOW PROCESSLIST
- SHOW RESOURCE GROUP
- SHOW TABLE STATUS
- SHOW FILE
- SHOW VARIABLES
- UNINSTALL PLUGIN
- DDL
- ALTER DATABASE
- ALTER MATERIALIZED VIEW
- ALTER TABLE
- ALTER VIEW
- ANALYZE TABLE
- BACKUP
- CANCEL ALTER TABLE
- CANCEL BACKUP
- CANCEL RESTORE
- CREATE ANALYZE
- CREATE DATABASE
- CREATE EXTERNAL CATALOG
- CREATE INDEX
- CREATE MATERIALIZED VIEW
- CREATE REPOSITORY
- CREATE TABLE AS SELECT
- CREATE TABLE LIKE
- CREATE TABLE
- CREATE VIEW
- CREATE FUNCTION
- DROP ANALYZE
- DROP STATS
- DROP CATALOG
- DROP DATABASE
- DROP INDEX
- DROP MATERIALIZED VIEW
- DROP REPOSITORY
- DROP TABLE
- DROP VIEW
- DROP FUNCTION
- KILL ANALYZE
- RECOVER
- REFRESH EXTERNAL TABLE
- RESTORE
- SET CATALOG
- SHOW ANALYZE JOB
- SHOW ANALYZE STATUS
- SHOW META
- SHOW FUNCTION
- TRUNCATE TABLE
- USE
- DML
- ALTER LOAD
- ALTER ROUTINE LOAD
- BROKER LOAD
- CANCEL LOAD
- CANCEL EXPORT
- CANCEL REFRESH MATERIALIZED VIEW
- CREATE ROUTINE LOAD
- DELETE
- EXPORT
- GROUP BY
- INSERT
- PAUSE ROUTINE LOAD
- RESUME ROUTINE LOAD
- REFRESH MATERIALIZED VIEW
- SELECT
- SHOW ALTER
- SHOW ALTER MATERIALIZED VIEW
- SHOW BACKUP
- SHOW CATALOGS
- SHOW CREATE CATALOG
- SHOW CREATE MATERIALIZED VIEW
- SHOW CREATE TABLE
- SHOW CREATE VIEW
- SHOW DATA
- SHOW DATABASES
- SHOW DELETE
- SHOW DYNAMIC PARTITION TABLES
- SHOW EXPORT
- SHOW LOAD
- SHOW MATERIALIZED VIEW
- SHOW PARTITIONS
- SHOW REPOSITORIES
- SHOW RESTORE
- SHOW ROUTINE LOAD
- SHOW ROUTINE LOAD TASK
- SHOW SNAPSHOT
- SHOW TABLES
- SHOW TABLET
- SHOW TRANSACTION
- STOP ROUTINE LOAD
- STREAM LOAD
- SUBMIT TASK
- UPDATE
- Auxiliary Commands
- Data Types
- Keywords
- SQL Functions
- Function list
- Java UDFs
- Window functions
- Lambda expression
- Date Functions
- add_months
- adddate
- convert_tz
- current_date
- current_time
- current_timestamp
- date
- date_add
- date_diff
- date_format
- date_slice
- date_sub, subdate
- date_trunc
- datediff
- day
- dayofweek_iso
- dayname
- dayofmonth
- dayofweek
- dayofyear
- days_add
- days_diff
- days_sub
- from_days
- from_unixtime
- hour
- hours_add
- hours_diff
- hours_sub
- jodatime_format
- last_day
- makedate
- microseconds_add
- microseconds_sub
- minute
- minutes_add
- minutes_diff
- minutes_sub
- month
- monthname
- months_add
- months_diff
- months_sub
- next_day
- now
- previous_day
- quarter
- second
- seconds_add
- seconds_diff
- seconds_sub
- str_to_date
- str_to_jodatime
- str2date
- time_slice
- time_to_sec
- timediff
- timestamp
- timestampadd
- timestampdiff
- to_date
- to_days
- to_iso8601
- to_tera_date
- to_tera_timestamp
- unix_timestamp
- utc_timestamp
- week
- week_iso
- weekofyear
- weeks_add
- weeks_diff
- weeks_sub
- year
- years_add
- years_diff
- years_sub
- Aggregate Functions
- any_value
- approx_count_distinct
- array_agg
- avg
- bitmap
- bitmap_agg
- count
- count_if
- corr
- covar_pop
- covar_samp
- group_concat
- grouping
- grouping_id
- hll_empty
- hll_hash
- hll_raw_agg
- hll_union
- hll_union_agg
- max
- max_by
- min
- min_by
- multi_distinct_sum
- multi_distinct_count
- percentile_approx
- percentile_cont
- percentile_disc
- retention
- stddev
- stddev_samp
- sum
- variance, variance_pop, var_pop
- var_samp
- window_funnel
- Geographic Functions
- String Functions
- append_trailing_char_if_absent
- ascii
- char
- char_length
- character_length
- concat
- concat_ws
- ends_with
- find_in_set
- group_concat
- hex
- hex_decode_binary
- hex_decode_string
- instr
- lcase
- left
- length
- locate
- lower
- lpad
- ltrim
- money_format
- null_or_empty
- parse_url
- repeat
- replace
- reverse
- right
- rpad
- rtrim
- space
- split
- split_part
- substring_index
- starts_with
- strleft
- strright
- str_to_map
- substring
- trim
- ucase
- unhex
- upper
- url_decode
- url_encode
- Pattern Matching Functions
- JSON Functions
- Overview of JSON functions and operators
- JSON operators
- JSON constructor functions
- JSON query and processing functions
- Bit Functions
- Bitmap Functions
- Array Functions
- all_match
- any_match
- array_agg
- array_append
- array_avg
- array_concat
- array_contains
- array_contains_all
- array_cum_sum
- array_difference
- array_distinct
- array_filter
- array_generate
- array_intersect
- array_join
- array_length
- array_map
- array_max
- array_min
- array_position
- array_remove
- array_slice
- array_sort
- array_sortby
- array_sum
- arrays_overlap
- array_to_bitmap
- cardinality
- element_at
- reverse
- unnest
- Map Functions
- Binary Functions
- cast function
- hash function
- Cryptographic Functions
- Math Functions
- Pattern Matching Functions
- Percentile Functions
- Scalar Functions
- Struct Functions
- Table Functions
- Utility Functions
- AUTO_INCREMENT
- Generated columns
- System variables
- System limits
- Information Schema
- Overview
- be_bvars
- be_cloud_native_compactions
- be_compactions
- character_sets
- collations
- column_privileges
- columns
- engines
- events
- global_variables
- key_column_usage
- load_tracking_logs
- loads
- materialized_views
- partitions
- pipe_files
- pipes
- referential_constraints
- routines
- schema_privileges
- schemata
- session_variables
- statistics
- table_constraints
- table_privileges
- tables
- tables_config
- task_runs
- tasks
- triggers
- user_privileges
- views
- System Metadatabase
- API
- Overview
- Actions
- Clusters
- Create and Manage Clusters
- Query Clusters
- Identity and Access Management
- Organization and Account
- Usage and Billing
- Clusters
- Terraform Provider
- Run scripts
Batch load data from Azure cloud storage
CelerData provides two options for loading data from Microsoft Azure Storage:
- Synchronous loading using INSERT+
FILES()
- Asynchronous loading using Broker Load
Each of these options has its own advantages, which are detailed in the following sections.
In most cases, we recommend that you use the INSERT+FILES()
method, which is much easier to use.
However, the INSERT+FILES()
method currently supports only the Parquet and ORC file formats. Therefore, if you need to load data of other file formats such as CSV, or perform data changes such as DELETE during data loading, you can resort to Broker Load.
Before you begin
Make source data ready
Make sure that the source data you want to load into CelerData is properly stored in a container within your Azure storage account.
In this topic, suppose you want to load the data of a Parquet-formatted sample dataset (user_behavior_ten_million_rows.parquet
) stored in the root directory of a container (celerdata-container
) within an Azure Data Lake Storage Gen2 (ADLS Gen2) storage account (celerdata
).
Check privileges
You can load data into tables within your CelerData cluster only as a user who has the INSERT privilege on those tables. If you do not have the INSERT privilege, follow the instructions provided in GRANT to grant the INSERT privilege to the user that you use to connect to your CelerData cluster.
Gather connection details
The examples in this topic use the Shared Key authentication method. To ensure that you have permission to read data from ADLS Gen2, we recommend that you read Azure Data Lake Storage Gen2 > Shared Key (access key of storage account) to understand the authentication parameters that you need to configure.
In a nutshell, if you practice Shared Key authentication, you need to gather the following information:
- The name of your ADLS Gen2 storage account
- The shared key (access key) of your ADLS Gen2 storage account
For information about all the authentication methods available, see Authenticate to Azure cloud storage.
Use INSERT+FILES()
This method supports only the Parquet and ORC file formats.
Advantages of INSERT+FILES()
FILES()
can read the file stored in cloud storage based on the path-related properties you specify, infer the table schema of the data in the file, and then return the data from the file as data rows.
With FILES()
, you can:
- Query the data directly from Azure using SELECT.
- Create and load a table using CREATE TABLE AS SELECT (CTAS).
- Load the data into an existing table using INSERT.
Typical examples
Querying directly from Azure using SELECT
Querying directly from Azure using SELECT+FILES()
can give a good preview of the content of a dataset before you create a table. For example:
- Get a preview of the dataset without storing the data.
- Query for the min and max values and decide what data types to use.
- Check for
NULL
values.
The following example queries the sample dataset user_behavior_ten_million_rows.parquet
stored in the container celerdata-container
within your storage account celerdata
:
SELECT * FROM FILES
(
"path" = "abfss://celerdata-container@celerdata.dfs.core.windows.net/user_behavior_ten_million_rows.parquet",
"format" = "parquet",
"azure.adls2.storage_account" = "celerdata",
"azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx"
)
LIMIT 3;
The system returns a query result similar to the following:
NOTE
Notice that the column names as returned above are provided by the Parquet file.
Creating and loading a table using CTAS
This is a continuation of the previous example. The previous query is wrapped in CREATE TABLE AS SELECT (CTAS) to automate the table creation using schema inference. This means CelerData will infer the table schema, create the table you want, and then load the data into the table. The column names and types are not required to create a table when using the FILES()
table function with Parquet files as the Parquet format includes the column names.
NOTE
The syntax of CREATE TABLE when using schema inference does not allow setting the number of replicas. If you are using a CelerData classic cluster, set the number of replicas before creating the table. The example below is for a system with a single replica:
ADMIN SET FRONTEND CONFIG ('default_replication_num' = "1");
Create a database and switch to it:
CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;
Use CTAS to create a table and load the data of the sample dataset user_behavior_ten_million_rows.parquet
, which is stored in the container celerdata-container
within your storage account celerdata
, into the table:
CREATE TABLE user_behavior_inferred AS
SELECT * FROM FILES
(
"path" = "abfss://celerdata-container@celerdata.dfs.core.windows.net/user_behavior_ten_million_rows.parquet",
"format" = "parquet",
"azure.adls2.storage_account" = "celerdata",
"azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx"
);
After creating the table, you can view its schema by using DESCRIBE:
DESCRIBE user_behavior_inferred;
The system returns the following query result:
Compare the inferred schema with the schema created by hand:
- data types
- nullable
- key fields
To better control the schema of the destination table and for better query performance, we recommend that you specify the table schema by hand in production environments.
Query the table to verify that the data has been loaded into it. Example:
SELECT * from user_behavior_inferred LIMIT 3;
The system returns a query result similar to the following, indicating that the data has been successfully loaded:
Loading into an existing table using INSERT
You may want to customize the table that you are inserting into, for example, the:
- column data type, nullable setting, or default values
- key types and columns
- data partitioning and bucketing
NOTE
Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This topic does not cover table design. For information about table design, see Table types.
In this example, we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in Azure.
- Since a query of the dataset in Azure indicates that the
Timestamp
column contains data that matches adatetime
data type, the column type is specified in the following DDL. - By querying the data in Azure, you can find that there are no
NULL
values in the dataset, so the DDL does not set any columns as nullable. - Based on knowledge of the expected query types, the sort key and bucketing column are set to the column
UserID
. Your use case might be different for this data, so you might decide to useItemID
in addition to or instead ofUserID
for the sort key.
Create a database and switch to it:
CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;
Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from Azure):
CREATE TABLE user_behavior_declared
(
UserID int(11),
ItemID int(11),
CategoryID int(11),
BehaviorType varchar(65533),
Timestamp datetime
);
After creating the table, you can load it with INSERT INTO SELECT FROM FILES():
INSERT INTO user_behavior_declared
SELECT * FROM FILES
(
"path" = "abfss://celerdata-container@celerdata.dfs.core.windows.net/user_behavior_ten_million_rows.parquet",
"format" = "parquet",
"azure.adls2.storage_account" = "celerdata",
"azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx"
)
After the load is complete, you can query the table to verify that the data has been loaded into it. Example:
SELECT * from user_behavior_declared LIMIT 3;
The system returns a query result similar to the following, indicating that the data has been successfully loaded:
Check load progress
You can query the progress of INSERT jobs from the information_schema.loads
view. This feature is supported from v3.1 onwards. Example:
SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC;
If you have submitted multiple load jobs, you can filter on the LABEL
associated with the job. Example:
SELECT * FROM information_schema.loads WHERE LABEL = 'insert_659411dd-a560-11ee-af93-000d3a544fa3';
For information about the fields provided in the loads
view, see Information Schema.
NOTE
INSERT is a synchronous command. If an INSERT job is still running, you need to open another session to check its execution status.
Use Broker Load
An asynchronous Broker Load process handles making the connection to Azure, pulling the data, and storing the data in CelerData.
This method supports the Parquet, ORC, CSV, and JSON file formats.
Advantages of Broker Load
- Broker Load supports data transformation and data changes such as UPSERT and DELETE operations during loading.
- Broker Load runs in the background and clients do not need to stay connected for the job to continue.
- Broker Load is preferred for long-running jobs, with the default timeout spanning 4 hours.
- In addition to Parquet and ORC file formats, Broker Load supports CSV and JSON file formats.
Typical example
Create a database and a table, start a load process that pulls the sample dataset user_behavior_ten_million_rows.parquet
from Azure, and verify the progress and success of the data loading.
Create a database and a table
Connect to your CelerData cluster. Then, create a database and switch to it:
CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;
Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from Azure):
CREATE TABLE user_behavior
(
UserID int(11),
ItemID int(11),
CategoryID int(11),
BehaviorType varchar(65533),
Timestamp datetime
);
Start a Broker Load
Run the following command to start a Broker Load job that loads data from the sample dataset user_behavior_ten_million_rows.parquet
to the user_behavior
table:
LOAD LABEL user_behavior
(
DATA INFILE("abfss://celerdata-container@celerdata.dfs.core.windows.net/user_behavior_ten_million_rows.parquet")
INTO TABLE user_behavior
FORMAT AS "parquet"
)
WITH BROKER
(
"azure.adls2.storage_account" = "celerdata",
"azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx"
)
PROPERTIES
(
"timeout" = "3600"
);
This job has four main sections:
LABEL
: A string used when querying the state of the load job.LOAD
declaration: The source URI, source data format, and destination table name.BROKER
: The connection details for the source.PROPERTIES
: The timeout value and any other properties to apply to the load job.
For detailed syntax and parameter descriptions, see BROKER LOAD.
Check load progress
Query the information_schema.loads
view to track the load progress:
SELECT * FROM information_schema.loads;
If you have submitted multiple load jobs, you can filter on the LABEL
associated with the job:
SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior';
For information about the fields provided in the loads
view, see Information Schema.
After you confirm that the load job has finished, you can check a subset of the destination table to see if the data has been successfully loaded. Example:
SELECT * from user_behavior LIMIT 3;
The system returns a query result similar to the following, indicating that the data has been successfully loaded: