Skip to main content

Unload data into Azure cloud storage

This topic describes how to unload data from CelerData into Microsoft Azure Blob Storage and Data Lake Storage.

From v3.2 onwards, CelerData supports using the table function FILES() to define a writable table in remote storage. You can then use INSERT INTO FILES to unload data from CelerData to your remote storage.

Compared to other data export methods supported by CelerData, unloading data with INSERT INTO FILES provides a more unified, easy-to-use, and rich-featured interface. You can unload the data directly into your remote storage using the same syntax you used to load data from it with a SELECT statement. In addition, it supports unloading data into Parquet, ORC, and CSV (from v3.3) format files. Therefore, in most cases, INSERT INTO FILES is recommended over EXPORT or SELECT INTO OUTFILE.

NOTE

You are recommended to go through all four examples in this topic to understand not only different credentials for cloud storage but also various file formats and compression algorithms.

Unload data into Azure Blob Storage

If you want to unload data into Azure Blob Storage from an AWS-based CelerData cluster, you can use the Shared Key authentication or the Shared Access Signature (SAS) authentication.

Preparations

Obtain container URL

The format of the Azure Blob container URL is as follows:

{wasb | wasbs}://<container_name>@<storage_account_name>.blob.core.windows.net/<path>/<dir>/

Suppose you want to unload the data into the directory test/user_behavior/ under the container unloading within the Azure Blob Storage account celerdata. When you unload data into Blob Storage, you can use the wasb(over HTTP) protocol or the wasbs (over HTTPS) protocol. Therefore, the URL (the path defined in the FILES() function) will be either of the following:

  • wasb://``unloading@celerdata.blob.core.windows.net/test/user_behavior/
  • wasbs://``unloading@celerdata.blob.core.windows.net/test/user_behavior/

Obtain credential information

  • If you want to use the Shared Key authentication, follow the instructions provided in Authenticate to Azure cloud storage - Shared Key to get the key value of the storage account celerdata.
  • If you want to use the SAS authentication, follow the instructions provided in Authenticate to Azure cloud storage - SAS Token to get the SAS token of your storage account with the following permission configurations on the storage account celerdata:
    • Allowed services: Blob.
    • Allowed resource types: Container, Object.
    • Allowed permissions: Read, Write, Delete, List, Add, Create, Immutable storage, Permanent delete.

Unload data using Shared Key

The following example unloads the data of the entire table into Parquet-formatted files using the wasbs protocol.

INSERT INTO FILES(
'path' = 'wasbs://unloading@celerdata.blob.core.windows.net/test1/user_behavior/',
'format' = 'parquet', -- Unload data as Parquet-formatted files.
'azure.blob.storage_account' = 'celerdata',
'azure.blob.shared_key' = '<storage_account_shared_key>'
)
SELECT * FROM user_behavior;

NOTE

Using the Shared Key authentication effectively grants you the super-user access, which is full access to all operations on all resources of this Storage Account, including managing data, setting owner, and changing ACLs.

Unload data using SAS Token

The following example unloads the data between 2020-01-01 and 2024-01-01 using a complex query with a WHERE clause and a JOIN. It stores data as ORC-formatted files with the LZ4 compression algorithm, partitions the data by month (based on the dt column), and distributes it into different directories under user_behavior/.

INSERT INTO FILES(
'path' = 'wasbs://unloading@celerdata.blob.core.windows.net/test2/user_behavior/',
'format' = 'orc', -- Unload data as ORC-formatted files.
'compression' = 'lz4', -- Use the LZ4 compression algorithm.
'azure.blob.storage_account' = 'celerdata',
'azure.blob.container' = 'unloading',
'azure.blob.sas_token' = '<storage_account_SAS_token>',
'partition_by' = 'dt' -- Partition the data by month into different directories.
)
SELECT tb.*, tu.name FROM user_behavior tb
JOIN users tu on tb.user_id = tu.id
WHERE tb.dt >= '2020-01-01' and tb.dt < '2024-01-01';

NOTE

The partition_by parameter will partition the data based on the specified column, and store the data in sub-directories formatted as prefix/<column_name>=<value>/, for example, /test2/user_behavior/dt=2020-01-01/.

Unload data into Azure Data Lake Storage

If you want to unload data into Azure Data Lake Storage from an Azure-based CelerData cluster, you can also use the Managed Identity authentication or the Service Principal authentication.

Preparations

Obtain container URL

The format of the Azure Data Lake container URL is as follows:

{abfs | abfss}://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>/<dir>/

Suppose you want to unload the data into the directory test/user_behavior/ under the container unloading within the Azure Data Lake Storage account celerdataadls. When you unload data into Data Lake Storage, you can use the abfs(over HTTP) protocol or the abfss (over HTTPS) protocol. Therefore, the URL (the path defined in the FILES() function) will be either of the following:

  • abfs://``unloading@celerdataadls.dfs.core.windows.net/test/user_behavior/
  • abfss://``unloading@celerdataadls.dfs.core.windows.net/test/user_behavior/

Obtain credential information

  • If you want to use the Managed Identity authentication, follow the instructions provided in Authenticate to Azure cloud storage - Managed Identity to create a managed identity with the following role assignments on the storage account celerdataadls, and get the corresponding oauth2_tenant_id (the ID of the tenant of your ADLS storage account) and oauth2_client_id (the client ID of the managed identity):
    • Storage Account Contributor
    • Storage Blob Data Contributor
  • If you want to use the Service Principal authentication, follow the instructions provided in Authenticate to Azure cloud storage - Service Principal to create a service principal with the following role assignments on the storage account celerdataadls, and get the corresponding oauth2_client_id (the application (client) ID), oauth2_client_secret (the value of the client secret), and oauth2_client_endpoint (the OAuth 2.0 token endpoint (v1)) of the service principal:
    • Storage Account Contributor
    • Storage Blob Data Contributor

Unload data using Managed Identity

The following example unloads the data into a single CSV-formatted file with the column separator as a comma (,).

INSERT INTO FILES(
'path' = 'abfss://unloading@celerdataadls.dfs.core.windows.net/test3/user_behavior/',
'format' = 'csv', -- Unload data as a CSV-formatted file.
'csv.column_separator' = ',', -- Use `,` as the column separator.
'single' = 'true', -- Output the data into a single file.
'azure.adls2.oauth2_use_managed_identity' = 'true',
'azure.adls2.oauth2_tenant_id' = '<tenant_id>',
'azure.adls2.oauth2_client_id' = '<client_id>'
)
SELECT * FROM user_behavior;

NOTE

  • If you are using a Blob Storage account, you need to upgrade your Blob Storage to Data Lake to use Managed Identity authentication.
  • The abfss protocol is required when you want to unload the data into Azure Data Lake Storage using the Managed Identity authentication.

Unload data using Service Principal

The following example unloads the data in multiple Parquet-formatted files each with a maximum size of 100000000 bytes.

INSERT INTO FILES(
'path' = 'abfss://unloading@celerdataadls.dfs.core.windows.net/test4/user_behavior/',
'format' = 'parquet',
'target_max_file_size' = '100000000', -- Set the maximum file size as 100000000 bytes.
'azure.adls2.oauth2_client_id' = '<service_principal_client_id>',
'azure.adls2.oauth2_client_secret' = '<service_principal_client_secret_value>',
'azure.adls2.oauth2_client_endpoint' = '<service_principal_client_endpoint>'
)
SELECT * FROM user_behavior;

Limits

  • Outputting CSV-formatted files is supported only from v3.3 onwards.
  • Setting compression algorithm is not supported for CSV-formatted files.