Authenticate to Azure cloud storage

Suppose you have successfully deployed a cluster on CelerData Cloud BYOC, and the cluster-related cloud resources such as your virtual machine (VM) and cloud storage are already located under your Azure account.

This topic introduces how to enable a CelerData cluster to access your self-managed Azure cloud storage such as Azure Blob Storage or Azure Data Lake Storage in the following scenarios:

The following table lists the authentication methods that are supported for various Azure storage services.

Azure storage serviceAuthentication method
Azure Blob Storage
  • Shared Key (access key of storage account)
  • SAS Token
Azure Data Lake Storage Gen2 (ADLS Gen2)
  • Managed Identity
  • Shared Key (access key of storage account)
  • Service Principal
Azure Data Lake Storage Gen1 (ADLS Gen1)
  • Managed Service Identity
  • Service Principal

The examples of data loading in this topic use file1.parquet stored in your Azure storage as a sample data file and use table1 which belongs to a database named mydatabase within your CelerData cluster as a sample destination table. Suppose that file1.parquet and table1 have the same schema.

Azure Blob Storage

When you access Blob Storage, you may need to input a file path:

  • If your storage account allows access over HTTP, use the wasb protocol and write the file path as wasb://<container_name>@<storage_account_name>.blob.core.windows.net/<path>/<file_name>.
  • If your storage account allows access over HTTPS, use the wasbs protocol and write the file path as wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/<path>/<file_name>.

Shared Key (access key of storage account)

Access keys authenticate your Azure-based CelerData cluster's requests to your storage account. For more information about access keys, see Manage account access keys - Azure Storage.

NOTICE

This authentication method now supports external catalog and Broker Load. Note that for external catalogs, the storage account you use for authentication must be the one used to store the data of your external data source.

Parameters

Configure the following parameters as credentials for accessing your Blob Storage.

ParameterDescription
azure.blob.storage_accountThe name of your Blob storage account.
azure.blob.shared_keyThe shared key (access key) of your Blob storage account.

Follow these steps to obtain the shared key (access key) of your Blob storage account:

  1. Sign in to the Microsoft Azure portal and navigate to the Storage accounts service page.

  2. Find your storage account and click its name.

  3. In the left-side navigation pane, choose Access keys.

  4. On the page that appears, click Show next to the key you want to use, and then click the copy icon to copy the key.

    Blob - Obtain Shared Key

Examples

External catalog

Create a Hive catalog named hive_catalog_azure:

CREATE EXTERNAL CATALOG hive_catalog_azure
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083",
    "azure.blob.storage_account" = "<storage_account_name>",
    "azure.blob.shared_key" = "<storage_account_shared_key>"
);

CelerData now supports Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog. For the detailed syntax and parameters for creating these external catalogs, see Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog.

Broker Load

Create a Broker Load job to load data from the data file file1.parquet to the destination table table1 which belongs to the database named mydatabase within the CelerData cluster:

LOAD LABEL mydatabase.label_blob_key
(
    DATA INFILE("wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/<path>/file1.parquet")
    INTO TABLE table1
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.blob.storage_account" = "<storage_account_name>",
    "azure.blob.shared_key" = "<storage_account_shared_key>"
)
PROPERTIES
(
    "timeout" = "3600"
);

For the detailed syntax and parameters for creating a Broker Load job, see BROKER LOAD.

For a typical example of loading data from Azure with table creation DDL, see Batch load data from Azure cloud storage.

SAS Token

A shared access signature (SAS) is a URI that grants restricted access to your storage account. For more information about SAS tokens, see Delegate access by using a shared access signature - Azure Storage.

NOTICE

This authentication method now supports external catalog and Broker Load. Note that for external catalogs, the storage account you use for authentication must be the one used to store the data of your external data source.

Parameters

Configure the following parameters as credentials for accessing your Blob Storage.

ParameterDescription
azure.blob.storage_accountThe name of your Blob storage account.
azure.blob.containerThe name of the Blob container that stores your data within your Blob storage account.
azure.blob.sas_tokenThe SAS token that is used to access your Blob storage account.

Follow these steps to obtain the SAS token of your Blob storage account:

  1. Sign in to the Microsoft Azure portal and navigate to the Storage accounts service page.

  2. Find your storage account and click its name.

  3. In the left-side navigation pane, choose Shared access signature.

  4. On the page that appears, check that you have properly made configurations in the Allowed services, Allowed resource types, and Allowed permissions sections, then select a key from the Signing key drop-down list and click Generate SAS and connection string.

    You can click the copy icon to copy the generated SAS token.

    Blob - Obtain SAS token

Examples

External catalog

Create a Hive catalog named hive_catalog_azure:

CREATE EXTERNAL CATALOG hive_catalog_azure
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083",
    "azure.blob.storage_account" = "<storage_account_name>",
    "azure.blob.container" = "<container_name>",
    "azure.blob.sas_token" = "<storage_account_SAS_token>"
);

CelerData now supports Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog. For the detailed syntax and parameters for creating these external catalogs, see Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog.

Broker Load

Create a Broker Load job to load data from the data file file1.parquet to the destination table table1 which belongs to the database named mydatabase within the CelerData cluster:

LOAD LABEL mydatabase.label_blob_sas
(
    DATA INFILE("wasb[s]://<container_name>@<storage_account_name>.blob.core.windows.net/<path>/file1.parquet")
    INTO TABLE table1
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.blob.storage_account" = "<storage_account_name>",
    "azure.blob.container" = "<container_name>",
    "azure.blob.sas_token" = "<storage_account_SAS_token>"
)
PROPERTIES
(
    "timeout" = "3600"
);

For the detailed syntax and parameters for creating a Broker Load job, see BROKER LOAD.

For a typical example of loading data from Azure with table creation DDL, see Batch load data from Azure cloud storage.

Azure Data Lake Storage Gen2

When you access ADLS Gen2, you may need to input a file path:

  • If your storage account allows access over HTTP, use the abfs protocol and write the file path as abfs://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>/<file_name>.
  • If your storage account allows access over HTTPS, use the abfss protocol and write the file path as abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>/<file_name>.

Managed Identity

User-assigned managed identities enable Azure resources to authenticate to cloud services (for example, Azure Key Vault) without storing credentials in code. Each managed identity is created as a standalone Azure resource, and has its own lifecycle. A single resource like a VM can utilize multiple user-assigned managed identities. Similarly, a single user-assigned managed identity can be shared across multiple resources (for example, multiple VMs).

Preparation

Before you can use the Managed Identity authentication method, you must find the managed identity that is referenced in the data credential of the CelerData cluster. Also, if you want the CelerData cluster to access more data in your storage account, you need to check that the managed identity is assigned the required read and write permissions on your storage account.

NOTICE

This authentication method now supports external catalog and Broker Load. Note that for external catalogs, the storage account you use for authentication must be the one used to store the data of your external data source.

Follow these steps:

  1. Sign in to the CelerData Cloud BYOC console.

  2. On the Clusters page, find your CelerData cluster and click it.

  3. On the Cloud settings tab, navigate to the Data credential section where you can find the Managed identity resource ID. The last element in the Managed identity resource ID is the name of the managed identity associated with the CelerData cluster. For example, if the Managed identity resource ID is /subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/celerdata-20240426-resource-group-001/providers/Microsoft.ManagedIdentity/userAssignedIdentities/celerdata-20240426-managed-identity, then the name of the managed identity is celerdata-20240426-managed-identity.

    Gen2 - Managed Identity - Obtain the name of the managed identity

  4. Sign in to the Microsoft Azure portal and navigate to the Storage accounts service page.

  5. Find your storage account and click its name.

  6. In the left-side navigation pane, choose Access control (IAM).

  7. On the Role assignments tab of the page that appears, check the existing role assignments.

    Gen2 - Managed Identity - Role assignments

  8. If the required read and write permissions on the storage account have not been assigned to the managed identity, click Add and choose Add role assignment. Then, add a proper role assignment:

    a. On the Job function roles tab of the Add role assignment page, select a proper role assignment (for example, Storage Blob Data Owner, which allows for full access to the containers and data within your storage account) and click Next.

    b. On the Members tab, select Managed identity for Assign access to.

    c. For Members, click Select members. In the right-side pane that appears, select your Microsoft subscription from the Subscription drop-down list, select User Assigned Managed Identity from the Managed identity drop-down list, search for and select the managed identity from the Select section, and then click Select.

    Gen2 - Managed Identity - Role assignments

    d. Click Review + assign. Then, on the Review + assign tab, review the information you have filled, and click Review + assign to add the role assignment.

Parameters

Configure the following parameters as credentials for accessing your ADLS Gen2.

ParameterDescription
azure.adls2.oauth2_use_managed_identitySpecifies whether to enable the Managed Identity authentication method. Set the value to true.
azure.adls2.oauth2_tenant_idThe ID of the tenant of your ADLS Gen2 storage account.
azure.adls2.oauth2_client_idThe client ID of the managed identity that is referenced in the data credential of the destination CelerData cluster.

Follow these steps to obtain the ID of the tenant of your ADLS Gen2 storage account:

  1. Sign in to the Microsoft Azure portal and navigate to the Tenant properties setting page.
  2. Find Tenant ID and click the copy icon to copy it.

Follow these steps to obtain the client ID of the managed identity:

  1. Sign in to the Microsoft Azure portal and navigate to the Managed Identities service page.

  2. Find the managed identity based on its name which you have obtained in the previous step, and click it.

  3. On the Overview page that is displayed, find the Client ID and click the copy icon to copy it.

    Gen2 - Managed Identity - Obtain the client ID of the managed identity

Examples

External catalog

Create a Hive catalog named hive_catalog_azure:

CREATE EXTERNAL CATALOG hive_catalog_azure
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083",
    "azure.adls2.oauth2_use_managed_identity" = "true",
    "azure.adls2.oauth2_tenant_id" = "<service_principal_tenant_id>",
    "azure.adls2.oauth2_client_id" = "<service_client_id>"
);

CelerData now supports Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog. For the detailed syntax and parameters for creating these external catalogs, see Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog.

Broker Load

Create a Broker Load job to load data from the data file file1.parquet to the destination table table1 which belongs to the database named mydatabase within the CelerData cluster:

LOAD LABEL mydatabase.label_gen2_identity
(
    DATA INFILE("abfs[s]://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>/file1.parquet")
    INTO TABLE table1
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.adls2.oauth2_use_managed_identity" = "true",
    "azure.adls2.oauth2_tenant_id" = "<service_principal_tenant_id>",
    "azure.adls2.oauth2_client_id" = "<service_client_id>"
)
PROPERTIES
(
    "timeout" = "3600"
);

For the detailed syntax and parameters for creating a Broker Load job, see BROKER LOAD.

For a typical example of loading data from Azure with table creation DDL, see Batch load data from Azure cloud storage.

Shared Key (access key of storage account)

Using Shared Key to access ADLS Gen2 works in the same way as using Shared Key to access Blob Storage. For information about using and obtaining the shared key, see Blobe Storage > Shared Key.

NOTICE

This authentication method now supports external catalog and Broker Load. Note that for external catalogs, the storage account you use for authentication must be the one used to store the data of your external data source.

Parameters

Configure the following parameters as credentials for accessing your ADLS Gen2.

ParameterDescription
azure.adls2.storage_accountThe name of your ADLS Gen2 storage account.
azure.adls2.shared_keyThe shared key (access key) of your ADLS Gen2 storage account.

Examples

External catalog

Create a Hive catalog named hive_catalog_azure:

CREATE EXTERNAL CATALOG hive_catalog_azure
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083",
    "azure.adls2.storage_account" = "<storage_account_name>",
    "azure.adls2.shared_key" = "<storage_account_shared_key>"
);

CelerData now supports Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog. For the detailed syntax and parameters for creating these external catalogs, see Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog.

Broker Load

Create a Broker Load job to load data from the data file file1.parquet to the destination table table1 which belongs to the database named mydatabase within the CelerData cluster:

LOAD LABEL mydatabase.label_gen2_key
(
    DATA INFILE("abfs[s]://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>/file1.parquet")
    INTO TABLE table1
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.adls2.storage_account" = "<storage_account_name>",
    "azure.adls2.shared_key" = "<storage_account_shared_key>"
)
PROPERTIES
(
    "timeout" = "3600"
);

For the detailed syntax and parameters for creating a Broker Load job, see BROKER LOAD.

For a typical example of loading data from Azure with table creation DDL, see Batch load data from Azure cloud storage.

Service Principal

Preparation

Before you can use this authentication method, make sure that you already have a service principal, or create a new one. For information about service principals in Azure, see Create a service principal in Azure.

Then, check that the service principal is assigned the required read and write permissions on your storage account. Follow these steps:

NOTICE

This authentication method now supports external catalog and Broker Load. Note that for external catalogs, the storage account you use for authentication must be the one used to store the data of your external data source, and the service principal must be the one used to deploy your CelerData cluster.

  1. Sign in to the Microsoft Azure portal and navigate to the Storage accounts service page.

  2. Find your ADLS Gen2 storage account and click its name.

  3. In the left-side navigation pane, choose Access control (IAM).

  4. On the Role assignments tab of the page that appears, check the existing role assignments.

  5. If the required read and write permissions on the storage account have not been assigned to the service principal, click Add and choose Add role assignment. Then, add a proper role assignment:

    Gen2 - Service Principal - Role assignments

    a. On the Job function roles tab of the Add role assignment page, select a proper role assignment (for example, Storage Blob Data Owner, which allows for full access to the containers and data within your storage account) and click Next.

    b. On the Members tab, select User, group, or service principal for Assign access to.

    c. For Members, click Select members. In the right-side pane that appears, search for and select the application you have registered, then click Select.

    Gen2 - Service Principal - Role assignments

    d. Click Review + assign. On the Review + assign tab, review the information you have filled. Then, click Review + assign again to add the role assignment.

Parameters

Configure the following parameters as credentials for accessing your ADLS Gen2.

ParameterDescription
azure.adls2.oauth2_client_idThe application (client) ID of the service principal.
azure.adls2.oauth2_client_secretThe value of the client secret of the service principal.
azure.adls2.oauth2_client_endpointThe OAuth 2.0 token endpoint (v1) of the service principal.

NOTE

Client secret values cannot be viewed, except for immediately after creation. After you create a client secret for the service principal, be sure to save the value of the client secret before you leave the page.

Follow these steps to obtain the application (client) ID and OAuth 2.0 token endpoint (v1) of the service principal:

  1. Sign in to the Microsoft Azure portal and navigate to the App registrations service page.

  2. Find your registered application and click its name.

  3. On the Overview page that is displayed, find the Application (client) ID and click the copy icon to copy it.

    Gen2 - Service Principal - Obtain the application (client) ID

  4. Click Endpoints in the upper area of the Overview page.

  5. In the right-side Endpoints pane, find the OAuth 2.0 token endpoint (v1) and click the copy icon to copy it.

    Gen2 - Service Principal - Obtain the OAuth 2.0 token endpoint (v1)

Examples

External catalog

Create a Hive catalog named hive_catalog_azure:

CREATE EXTERNAL CATALOG hive_catalog_azure
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083",
    "azure.adls2.oauth2_client_id" = "<service_client_id>",
    "azure.adls2.oauth2_client_secret" = "<service_principal_client_secret>",
    "azure.adls2.oauth2_client_endpoint" = "<service_principal_client_endpoint>"
);

CelerData now supports Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog. For the detailed syntax and parameters for creating these external catalogs, see Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog.

Broker Load

Create a Broker Load job to load data from the data file file1.parquet to the destination table table1 which belongs to the database named mydatabase within the CelerData cluster:

LOAD LABEL test_db.label_gen2_principal
(
    DATA INFILE("abfs[s]://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>/file1.parquet")
    INTO TABLE table1
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.adls2.oauth2_client_id" = "<service_principle_client_id>",
    "azure.adls2.oauth2_client_secret" = "<service_principal_client_secret>",
    "azure.adls2.oauth2_client_endpoint" = "<service_principal_client_endpoint>"
)
PROPERTIES
(
    "timeout" = "3600"
);

For the detailed syntax and parameters for creating a Broker Load job, see BROKER LOAD.

For a typical example of loading data from Azure with table creation DDL, see Batch load data from Azure cloud storage.

Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen1 (ADLS Gen1) is currently no longer publicly accessible in Azure. But if you have data stored in ADLS Gen1, you can still access it by using the following authentication methods.

ADLS Gen1 supports the adl protocol. If you need to input a file path when you access ADLS Gen1, write the file path as adl://<data_lake_storage_gen1_name>.azuredatalakestore.net/<path>/<file_name>.

Managed Service Identity

Parameters

Configure the following parameters as credentials for accessing your ADLS Gen1.

ParameterDescription
azure.adls1.use_managed_service_identitySpecifies whether to enable the Managed Service Identity authentication method. Set the value to true.

Examples

External catalog

Create a Hive catalog named hive_catalog_azure:

CREATE EXTERNAL CATALOG hive_catalog_azure
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083",
    "azure.adls1.use_managed_service_identity" = "true"
);

CelerData now supports Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog. For the detailed syntax and parameters for creating these external catalogs, see Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog.

Broker Load

Create a Broker Load job to load data from the data file file1.parquet to the destination table table1 which belongs to the database named mydatabase within the CelerData cluster:

LOAD LABEL mydatabase.label_gen1_identity
(
    DATA INFILE("adl://<data_lake_storage_gen1_name>.azuredatalakestore.net/<path>/file1.parquet")
    INTO TABLE table1
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.adls1.use_managed_service_identity" = "true"
)
PROPERTIES
(
    "timeout" = "3600"
);

For the detailed syntax and parameters for creating a Broker Load job, see BROKER LOAD.

For a typical example of loading data from Azure with table creation DDL, see Batch load data from Azure cloud storage.

Service Principal

Using Service Principal to access ADLS Gen1 works in the same way as using Service Principal to access ADLS Gen2. For information about using the service principal and obtaining its information, see Azure Data Lake Storage Gen2 > Service Principal.

NOTICE

This authentication method now supports external catalog and Broker Load. Note that for external catalogs, the storage account you use for authentication must be the one used to store the data of your external data source, and the service principal must be the one used to deploy your CelerData cluster.

Parameters

Configure the following parameters as credentials for accessing your ADLS Gen1.

ParameterDescription
azure.adls1.oauth2_client_idThe application (client) ID of the service principal.
azure.adls1.oauth2_credentialThe value of the client secret of the service principal.
azure.adls1.oauth2_endpointThe OAuth 2.0 token endpoint (v1) of the service principal or application.

Examples

External catalog

Create a Hive catalog named hive_catalog_azure:

CREATE EXTERNAL CATALOG hive_catalog_azure
PROPERTIES
(
    "type" = "hive",
    "hive.metastore.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083",
    "azure.adls1.oauth2_client_id" = "<application_client_id>",
    "azure.adls1.oauth2_credential" = "<application_client_credential>",
    "azure.adls1.oauth2_endpoint" = "<OAuth_2.0_authorization_endpoint_v2>"
);

CelerData now supports Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog. For the detailed syntax and parameters for creating these external catalogs, see Hive catalog, Iceberg catalog, Hudi catalog, and Delta Lake catalog.

Broker Load

Create a Broker Load job to load data from the data file file1.parquet to the destination table table1 which belongs to the database named mydatabase within the CelerData cluster:

LOAD LABEL mydatabase.label_gen1_principal
(
    DATA INFILE("adl://<data_lake_storage_gen1_name>.azuredatalakestore.net/<path>/file1.parquet")
    INTO TABLE table1
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.adls1.oauth2_client_id" = "<application_client_id>",
    "azure.adls1.oauth2_credential" = "<application_client_credential>",
    "azure.adls1.oauth2_endpoint" = "<OAuth_2.0_authorization_endpoint_v2>"
)
PROPERTIES
(
    "timeout" = "3600"
);

For the detailed syntax and parameters for creating a Broker Load job, see BROKER LOAD.

For a typical example of loading data from Azure with table creation DDL, see Batch load data from Azure cloud storage.