Skip to main content

CREATE STORAGE VOLUME

Description

Creates a storage volume for a remote storage system. This feature is supported from v3.1.

A storage volume consists of the properties and credential information of the remote data storage. You can reference a storage volume when you create databases and cloud-native tables in a shared-data StarRocks cluster.

CAUTION

Only users with the CREATE STORAGE VOLUME privilege on the SYSTEM level can perform this operation.

Syntax

CREATE STORAGE VOLUME [IF NOT EXISTS] <storage_volume_name>
TYPE = { S3 | HDFS | AZBLOB }
LOCATIONS = ('<remote_storage_path>')
[ COMMENT '<comment_string>' ]
PROPERTIES
("key" = "value",...)

Parameters

ParameterDescription
storage_volume_nameThe name of the storage volume. Please note that you cannot create a storage volume named builtin_storage_volume because it is used to create the builtin storage volume. For the naming conventions, see System limits.
TYPEThe type of the remote storage system. Valid values: S3, HDFS and AZBLOB. S3 indicates AWS S3 or S3-compatible storage systems. AZBLOB indicates Azure Blob Storage (supported from v3.1.1 onwards). HDFS indicates an HDFS cluster.
LOCATIONSThe storage locations. The format is as follows:
  • For AWS S3 or S3 protocol-compatible storage systems: s3://<s3_path>. <s3_path> must be an absolute path, for example, s3://testbucket/subpath. Note that if you want to enable the Partitioned Prefix feature for the storage volume, you can only specify the bucket name, and specifying a sub-path is not allowed.
  • For Azure Blob Storage: azblob://<azblob_path>. <azblob_path> must be an absolute path, for example, azblob://testcontainer/subpath.
  • For HDFS: hdfs://<host>:<port>/<hdfs_path>. <hdfs_path> must be an absolute path, for example, hdfs://127.0.0.1:9000/user/xxx/starrocks.
  • For WebHDFS: webhdfs://<host>:<http_port>/<hdfs_path>, where <http_port> is the HTTP port of the NameNode. <hdfs_path> must be an absolute path, for example, webhdfs://127.0.0.1:50070/user/xxx/starrocks.
  • For ViewFS:viewfs://<ViewFS_cluster>/<viewfs_path>, where <ViewFS_cluster> is the ViewFS cluster name. <viewfs_path> must be an absolute path, for example, viewfs://myviewfscluster/user/xxx/starrocks.
COMMENTThe comment on the storage volume.
PROPERTIESParameters in the "key" = "value" pairs used to specify the properties and credential information to access the remote storage system. For detailed information, see PROPERTIES.

PROPERTIES

The table below lists all available properties of storage volumes. Following the table are the usage instructions of these properties, categorized by different scenarios from the perspectives of Credential information and Features.

PropertyDescription
enabledWhether to enable this storage volume. Default: false. Disabled storage volume cannot be referenced.
aws.s3.regionThe region in which your S3 bucket resides, for example, us-west-2.
aws.s3.endpointThe endpoint URL used to access your S3 bucket, for example, https://s3.us-west-2.amazonaws.com. [Preview] From v3.3.0 onwards, the Amazon S3 Express One Zone storage class is supported, for example, https://s3express.us-west-2.amazonaws.com.
aws.s3.use_aws_sdk_default_behaviorWhether to use the default authentication credential of AWS SDK. Valid values: true and false (Default).
aws.s3.use_instance_profileWhether to use Instance Profile and Assumed Role as credential methods for accessing S3. Valid values: true and false (Default).
  • If you use IAM user-based credential (Access Key and Secret Key) to access S3, you must specify this item as false, and specify aws.s3.access_key and aws.s3.secret_key.
  • If you use Instance Profile to access S3, you must specify this item as true.
  • If you use Assumed Role to access S3, you must specify this item as true, and specify aws.s3.iam_role_arn.
  • And if you use an external AWS account, you must specify this item as true, and specify aws.s3.iam_role_arn and aws.s3.external_id.
aws.s3.access_keyThe Access Key ID used to access your S3 bucket.
aws.s3.secret_keyThe Secret Access Key used to access your S3 bucket.
aws.s3.iam_role_arnThe ARN of the IAM role that has privileges on your S3 bucket in which your data files are stored.
aws.s3.external_idThe external ID of the AWS account that is used for cross-account access to your S3 bucket.
azure.blob.endpointThe endpoint of your Azure Blob Storage Account, for example, https://test.blob.core.windows.net.
azure.blob.shared_keyThe Shared Key used to authorize requests for your Azure Blob Storage.
azure.blob.sas_tokenThe shared access signatures (SAS) used to authorize requests for your Azure Blob Storage.
hadoop.security.authenticationThe authentication method. Valid values: simple(Default) and kerberos. simple indicates simple authentication, that is, username. kerberos indicates Kerberos authentication.
usernameUsername used to access the NameNode in the HDFS cluster.
hadoop.security.kerberos.ticket.cache.pathThe path that stores the kinit-generated Ticket Cache.
dfs.nameservicesName of the HDFS cluster
dfs.ha.namenodes.<ha_cluster_name>Name of the NameNode. Multiple names must be separated by commas (,). No space is allowed in the double quotes. <ha_cluster_name> is the name of the HDFS service specified in dfs.nameservices.
dfs.namenode.rpc-address.<ha_cluster_name>.<NameNode>The RPC address information of the NameNode. <NameNode> is the name of the NameNode specified in dfs.ha.namenodes.<ha_cluster_name>.
dfs.client.failover.proxy.providerThe provider of the NameNode for client connection. The default value is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
fs.viewfs.mounttable.<ViewFS_cluster>.link./<viewfs_path>The path to the ViewFS cluster to be mounted. Multiple paths must be separated by commas (,). <ViewFS_cluster> is the ViewFS cluster name specified in LOCATIONS.
aws.s3.enable_partitioned_prefixWhether to enable the Partitioned Prefix feature for the storage volume. Default: false. For more information about this feature, see Partitioned Prefix.
aws.s3.num_partitioned_prefixThe number of prefixes to be created for the storage volume. Default: 256. Valid range: [4, 1024].

Credential information

AWS S3
  • If you use the default authentication credential of AWS SDK to access S3, set the following properties:

    "enabled" = "{ true | false }",
    "aws.s3.region" = "<region>",
    "aws.s3.endpoint" = "<endpoint_url>",
    "aws.s3.use_aws_sdk_default_behavior" = "true"
  • If you use IAM user-based credential (Access Key and Secret Key) to access S3, set the following properties:

    "enabled" = "{ true | false }",
    "aws.s3.region" = "<region>",
    "aws.s3.endpoint" = "<endpoint_url>",
    "aws.s3.use_aws_sdk_default_behavior" = "false",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.access_key" = "<access_key>",
    "aws.s3.secret_key" = "<secrete_key>"
  • If you use Instance Profile to access S3, set the following properties:

    "enabled" = "{ true | false }",
    "aws.s3.region" = "<region>",
    "aws.s3.endpoint" = "<endpoint_url>",
    "aws.s3.use_aws_sdk_default_behavior" = "false",
    "aws.s3.use_instance_profile" = "true"
  • If you use Assumed Role to access S3, set the following properties:

    "enabled" = "{ true | false }",
    "aws.s3.region" = "<region>",
    "aws.s3.endpoint" = "<endpoint_url>",
    "aws.s3.use_aws_sdk_default_behavior" = "false",
    "aws.s3.use_instance_profile" = "true",
    "aws.s3.iam_role_arn" = "<role_arn>"
  • If you use Assumed Role to access S3 from an external AWS account, set the following properties:

    "enabled" = "{ true | false }",
    "aws.s3.region" = "<region>",
    "aws.s3.endpoint" = "<endpoint_url>",
    "aws.s3.use_aws_sdk_default_behavior" = "false",
    "aws.s3.use_instance_profile" = "true",
    "aws.s3.iam_role_arn" = "<role_arn>",
    "aws.s3.external_id" = "<external_id>"
GCS

If you use GCP Cloud Storage, set the following properties:

"enabled" = "{ true | false }",

-- For example: us-east-1
"aws.s3.region" = "<region>",

-- For example: https://storage.googleapis.com
"aws.s3.endpoint" = "<endpoint_url>",

"aws.s3.access_key" = "<access_key>",
"aws.s3.secret_key" = "<secrete_key>"
MinIO

If you use MinIO, set the following properties:

"enabled" = "{ true | false }",

-- For example: us-east-1
"aws.s3.region" = "<region>",

-- For example: http://172.26.xx.xxx:39000
"aws.s3.endpoint" = "<endpoint_url>",

"aws.s3.access_key" = "<access_key>",
"aws.s3.secret_key" = "<secrete_key>"
Azure Blob Storage

Creating a storage volume on Azure Blob Storage is supported from v3.1.1 onwards.

  • If you use Shared Key to access Azure Blob Storage, set the following properties:

    "enabled" = "{ true | false }",
    "azure.blob.endpoint" = "<endpoint_url>",
    "azure.blob.shared_key" = "<shared_key>"
  • If you use shared access signatures (SAS) to access Azure Blob Storage, set the following properties:

    "enabled" = "{ true | false }",
    "azure.blob.endpoint" = "<endpoint_url>",
    "azure.blob.sas_token" = "<sas_token>"
note

The hierarchical namespace must be disabled when you create the Azure Blob Storage Account.

HDFS
  • If you do not use authentication to access HDFS, set the following properties:

    "enabled" = "{ true | false }"
  • If you are using simple authentication (supported from v3.2) to access HDFS, set the following properties:

    "enabled" = "{ true | false }",
    "hadoop.security.authentication" = "simple",
    "username" = "<hdfs_username>"
  • If you are using Kerberos Ticket Cache authentication (supported since v3.2) to access HDFS, set the following properties:

    "enabled" = "{ true | false }",
    "hadoop.security.authentication" = "kerberos",
    "hadoop.security.kerberos.ticket.cache.path" = "<ticket_cache_path>"

    CAUTION

    • This setting only forces the system to use KeyTab to access HDFS via Kerberos. Make sure that each BE or CN node has access to the KeyTab files. Also make sure that the /etc/krb5.conf file is set up correctly.
    • The Ticket cache is generated by an external kinit tool. Make sure you have a crontab or similar periodic task to refresh the tickets.
  • If your HDFS cluster is enabled for NameNode HA configuration (supported since v3.2), additionally set the following properties:

    "dfs.nameservices" = "<ha_cluster_name>",
    "dfs.ha.namenodes.<ha_cluster_name>" = "<NameNode1>,<NameNode2> [, ...]",
    "dfs.namenode.rpc-address.<ha_cluster_name>.<NameNode1>" = "<hdfs_host>:<hdfs_port>",
    "dfs.namenode.rpc-address.<ha_cluster_name>.<NameNode2>" = "<hdfs_host>:<hdfs_port>",
    [...]
    "dfs.client.failover.proxy.provider.<ha_cluster_name>" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

    For more information, see HDFS HA Documentation.

    • If you are using WebHDFS (supported since v3.2), set the following properties:
    "enabled" = "{ true | false }"

    For more information, see WebHDFS Documentation.

  • If you are using Hadoop ViewFS (supported since v3.2), set the following properties:

    -- Replace <ViewFS_cluster> with the name of the ViewFS cluster.
    "fs.viewfs.mounttable.<ViewFS_cluster>.link./<viewfs_path_1>" = "hdfs://<hdfs_host_1>:<hdfs_port_1>/<hdfs_path_1>",
    "fs.viewfs.mounttable.<ViewFS_cluster>.link./<viewfs_path_2>" = "hdfs://<hdfs_host_2>:<hdfs_port_2>/<hdfs_path_2>",
    [, ...]

    For more information, see ViewFS Documentation.

Features

Partitioned Prefix

From v3.2.4, StarRocks supports creating storage volumes with the Partitioned Prefix feature for S3-compatible object storage systems. When this feature is enabled, StarRocks stores the data into multiple, uniformly prefixed partitions (sub-paths) under the bucket. It can easily multiply StarRocks' read and write performance on data files stored in the bucket because the QPS or throughput limit of the bucket is per partition.

To enable this feature, set the following properties in addition to the above credential-related parameters:

"aws.s3.enable_partitioned_prefix" = "{ true | false }",
"aws.s3.num_partitioned_prefix" = "<INT>"
note
  • The Partitioned Prefix feature is only supported for S3-compatible object storage systems, that is, the TYPE of the storage volume must be S3.
  • LOCATIONS of the storage volume must only contain the bucket name, for example, s3://testbucket. Specifying a sub-path after the bucket name is not allowed.
  • Both properties are immutable once the storage volume is created.
  • You cannot enable this feature when create a storage volume by using the FE configuration file fe.conf.

Examples

Example 1: Create a storage volume my_s3_volume for the AWS S3 bucket defaultbucket, use the IAM user-based credential (Access Key and Secret Key) to access S3, and enable it.

CREATE STORAGE VOLUME my_s3_volume
TYPE = S3
LOCATIONS = ("s3://defaultbucket/test/")
PROPERTIES
(
"aws.s3.region" = "us-west-2",
"aws.s3.endpoint" = "https://s3.us-west-2.amazonaws.com",
"aws.s3.use_aws_sdk_default_behavior" = "false",
"aws.s3.use_instance_profile" = "false",
"aws.s3.access_key" = "xxxxxxxxxx",
"aws.s3.secret_key" = "yyyyyyyyyy"
);

Example 2: Create a storage volume my_hdfs_volume for HDFS and enable it.

CREATE STORAGE VOLUME my_hdfs_volume
TYPE = HDFS
LOCATIONS = ("hdfs://127.0.0.1:9000/sr/test/")
PROPERTIES
(
"enabled" = "true"
);

Example 3: Create a storage volume hdfsvolumehadoop for HDFS using simple authentication.

CREATE STORAGE VOLUME hdfsvolumehadoop
TYPE = HDFS
LOCATIONS = ("hdfs://127.0.0.1:9000/sr/test/")
PROPERTIES(
"hadoop.security.authentication" = "simple",
"username" = "starrocks"
);

Example 4: Use Kerberos Ticket Cache authentication to access HDFS and create storage volume hdfsvolkerberos.

CREATE STORAGE VOLUME hdfsvolkerberos
TYPE = HDFS
LOCATIONS = ("hdfs://127.0.0.1:9000/sr/test/")
PROPERTIES(
"hadoop.security.authentication" = "kerberos",
"hadoop.security.kerberos.ticket.cache.path" = "/path/to/ticket/cache/path"
);

Example 5: Create storage volume hdfsvolha for an HDFS cluster with NameNode HA configuration enabled.

CREATE STORAGE VOLUME hdfsvolha
TYPE = HDFS
LOCATIONS = ("hdfs://myhacluster/data/sr")
PROPERTIES(
"dfs.nameservices" = "myhacluster",
"dfs.ha.namenodes.myhacluster" = "nn1,nn2,nn3",
"dfs.namenode.rpc-address.myhacluster.nn1" = "machine1.example.com:8020",
"dfs.namenode.rpc-address.myhacluster.nn2" = "machine2.example.com:8020",
"dfs.namenode.rpc-address.myhacluster.nn3" = "machine3.example.com:8020",
"dfs.namenode.http-address.myhacluster.nn1" = "machine1.example.com:9870",
"dfs.namenode.http-address.myhacluster.nn2" = "machine2.example.com:9870",
"dfs.namenode.http-address.myhacluster.nn3" = "machine3.example.com:9870",
"dfs.client.failover.proxy.provider.myhacluster" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);

Example 6: Create a storage volume webhdfsvol for WebHDFS.

CREATE STORAGE VOLUME webhdfsvol
TYPE = HDFS
LOCATIONS = ("webhdfs://namenode:9870/data/sr");

Example 7: Create a storage volume viewfsvol using Hadoop ViewFS.

CREATE STORAGE VOLUME viewfsvol
TYPE = HDFS
LOCATIONS = ("viewfs://clusterX/data/sr")
PROPERTIES(
"fs.viewfs.mounttable.clusterX.link./data" = "hdfs://nn1-clusterx.example.com:8020/data",
"fs.viewfs.mounttable.clusterX.link./project" = "hdfs://nn2-clusterx.example.com:8020/project"
);

Relevant SQL statements