- Release Notes
- Introduction to CelerData Cloud Serverless
- Quick Start
- Sign up for CelerData Cloud Serverless
- A quick tour of the console
- Connect to CelerData Cloud Serverless
- Create an IAM integration
- Create and assign a warehouse
- Create an external catalog
- Load data from cloud storage
- Load data from Apache Kafka/Confluent Cloud
- Try your first query
- Invite new users
- Design data access control policy
- Warehouses
- Catalog, database, table, view, and MV
- Overview of database objects
- Catalog
- Table types
- Asynchronous materialized views
- Data Loading
- Data access control
- Networking and private connectivity
- Usage and Billing
- Organization and Account
- Integration
- Query Acceleration
- Reference
- AWS IAM policies
- Information Schema
- Overview
- be_bvars
- be_cloud_native_compactions
- be_compactions
- character_sets
- collations
- column_privileges
- columns
- engines
- events
- global_variables
- key_column_usage
- load_tracking_logs
- loads
- materialized_views
- partitions
- pipe_files
- pipes
- referential_constraints
- routines
- schema_privileges
- schemata
- session_variables
- statistics
- table_constraints
- table_privileges
- tables
- tables_config
- task_runs
- tasks
- triggers
- user_privileges
- views
- Data Types
- System Metadatabase
- Keywords
- SQL Statements
- Account Management
- Data Definition
- CREATE TABLE
- ALTER TABLE
- DROP CATALOG
- CREATE TABLE LIKE
- REFRESH EXTERNAL TABLE
- RESTORE
- SET CATALOG
- DROP TABLE
- RECOVER
- USE
- CREATE MATERIALIZED VIEW
- DROP DATABASE
- ALTER MATERIALIZED VIEW
- DROP REPOSITORY
- CANCEL RESTORE
- DROP INDEX
- DROP MATERIALIZED VIEW
- CREATE DATABASE
- CREATE TABLE AS SELECT
- BACKUP
- CANCEL BACKUP
- CREATE REPOSITORY
- CREATE INDEX
- Data Manipulation
- INSERT
- SHOW CREATE DATABASE
- SHOW BACKUP
- SHOW ALTER MATERIALIZED VIEW
- SHOW CATALOGS
- SHOW CREATE MATERIALIZED VIEW
- SELECT
- SHOW ALTER
- SHOW MATERIALIZED VIEW
- RESUME ROUTINE LOAD
- ALTER ROUTINE LOAD
- SHOW TABLES
- STREAM LOAD
- SHOW PARTITIONS
- CANCEL REFRESH MATERIALIZED VIEW
- SHOW CREATE CATALOG
- SHOW ROUTINE LOAD TASK
- SHOW RESTORE
- CREATE ROUTINE LOAD
- STOP ROUTINE LOAD
- SHOW DATABASES
- BROKER LOAD
- SHOW ROUTINE LOAD
- PAUSE ROUTINE LOAD
- SHOW SNAPSHOT
- SHOW CREATE TABLE
- CANCEL LOAD
- REFRESH MATERIALIZED VIEW
- SHOW REPOSITORIES
- SHOW LOAD
- Administration
- DESCRIBE
- SQL Functions
- Function List
- String Functions
- CONCAT
- HEX
- LOWER
- SPLIT
- LPAD
- SUBSTRING
- PARSE_URL
- INSTR
- REPEAT
- LCASE
- REPLACE
- HEX_DECODE_BINARY
- RPAD
- SPLIT_PART
- STRCMP
- SPACE
- CHARACTER_LENGTH
- URL_ENCODE
- APPEND_TAILING_CHAR_IF_ABSENT
- LTRIM
- HEX_DECODE_STRING
- URL_DECODE
- LEFT
- STARTS_WITH
- CONCAT
- GROUP_CONCAT
- STR_TO_MAP
- STRLEFT
- STRRIGHT
- MONEY_FORMAT
- RIGHT
- SUBSTRING_INDEX
- UCASE
- TRIM
- FIND_IN_SET
- RTRIM
- ASCII
- UPPER
- REVERSE
- LENGTH
- UNHEX
- ENDS_WITH
- CHAR_LENGTH
- NULL_OR_EMPTY
- LOCATE
- CHAR
- Predicate Functions
- Map Functions
- Binary Functions
- Geospatial Functions
- Lambda Expression
- Utility Functions
- Bitmap Functions
- BITMAP_SUBSET_LIMIT
- TO_BITMAP
- BITMAP_AGG
- BITMAP_FROM_STRING
- BITMAP_OR
- BITMAP_REMOVE
- BITMAP_AND
- BITMAP_TO_BASE64
- BITMAP_MIN
- BITMAP_CONTAINS
- SUB_BITMAP
- BITMAP_UNION
- BITMAP_COUNT
- BITMAP_UNION_INT
- BITMAP_XOR
- BITMAP_UNION_COUNT
- BITMAP_HAS_ANY
- BITMAP_INTERSECT
- BITMAP_AND_NOT
- BITMAP_TO_STRING
- BITMAP_HASH
- INTERSECT_COUNT
- BITMAP_EMPTY
- BITMAP_MAX
- BASE64_TO_ARRAY
- BITMAP_TO_ARRAY
- Struct Functions
- Aggregate Functions
- RETENTION
- MI
- MULTI_DISTINCT_SUM
- WINDOW_FUNNEL
- STDDEV_SAMP
- GROUPING_ID
- HLL_HASH
- AVG
- HLL_UNION_AGG
- COUNT
- BITMAP
- HLL_EMPTY
- SUM
- MAX_BY
- PERCENTILE_CONT
- COVAR_POP
- PERCENTILE_APPROX
- HLL_RAW_AGG
- STDDEV
- CORR
- COVAR_SAMP
- MIN_BY
- MAX
- VAR_SAMP
- STD
- HLL_UNION
- APPROX_COUNT_DISTINCT
- MULTI_DISTINCT_COUNT
- VARIANCE
- ANY_VALUE
- COUNT_IF
- GROUPING
- PERCENTILE_DISC
- Array Functions
- ARRAY_CUM_SUM
- ARRAY_MAX
- ARRAY_LENGTH
- ARRAY_REMOVE
- UNNEST
- ARRAY_SLICE
- ALL_MATCH
- ARRAY_CONCAT
- ARRAY_SORT
- ARRAY_POSITION
- ARRAY_DIFFERENCE
- ARRAY_CONTAINS
- ARRAY_JOIN
- ARRAY_INTERSECT
- CARDINALITY
- ARRAY_CONTAINS_ALL
- ARRAYS_OVERLAP
- ARRAY_MIN
- ARRAY_MAP
- ELEMENT_AT
- ARRAY_APPEND
- ARRAY_SORTBY
- ARRAY_TO_BITMAP
- ARRAY_GENERATE
- ARRAY_AVG
- ARRAY_FILTER
- ANY_MATCH
- REVERSE
- ARRAY_AGG
- ARRAY_DISTINCT
- ARRAY_SUM
- Condition Functions
- Math Functions
- Date and Time Functions
- DAYNAME
- MINUTE
- FROM_UNIXTIME
- HOUR
- MONTHNAME
- MONTHS_ADD
- ADD_MONTHS
- DATE_SUB
- PREVIOUS_DAY
- TO_TERA_DATA
- MINUTES_SUB
- WEEKS_ADD
- HOURS_DIFF
- UNIX_TIMESTAMP
- DAY
- DATE_SLICE
- DATE
- CURTIME
- SECONDS_SUB
- MONTH
- WEEK
- TO_DATE
- TIMEDIFF
- MONTHS_DIFF
- STR_TO_JODATIME
- WEEK_ISO
- MICROSECONDS_SUB
- TIME_SLICE
- MAKEDATE
- DATE_TRUNC
- JODATIME
- DAYOFWEEK
- YEARS_SUB
- TIMESTAMP_ADD
- HOURS_SUB
- STR2DATE
- TIMESTAMP
- FROM_DAYS
- WEEK_OF_YEAR
- YEAR
- TIMESTAMP_DIFF
- TO_TERA_TIMESTAMP
- DAYOFMONTH
- DAYOFYEAR
- DATE_FORMAT
- MONTHS_SUB
- NEXT_DAY
- MINUTES_DIFF
- DATA_ADD
- MINUTES_ADD
- CURDATE
- DAY_OF_WEEK_ISO
- CURRENt_TIMESTAMP
- STR_TO_DATE
- LAST_DAY
- WEEKS_SUB
- TO_DAYS
- DATEDIFF
- NOW
- TO_ISO8601
- TIME_TO_SEC
- QUARTER
- SECONDS_DIFF
- UTC_TIMESTAMP
- DATA_DIFF
- SECONDS_ADD
- ADDDATE
- WEEKSDIFF
- CONVERT_TZ
- MICROSECONDS_ADD
- SECOND
- YEARS_DIFF
- YEARS_ADD
- HOURS_ADD
- DAYS_SUB
- DAYS_DIFF
- Cryptographic Functions
- Percentile Functions
- Bit Functions
- JSON Functions
- Hash Functions
- Scalar Functions
- Table Functions
Load data from cloud storage
This tutorial uses Amazon S3 as an example to demonstrate how to load data from cloud storage into CelerData Cloud Serverless.
NOTE
This feature is included in the Premium software edition. Please see the Software editions documentation for details on the differences between Standard and Premium editions if you are subscribed to the Standard edition.
After reading this tutorial, you will be able to load data from an S3 bucket into a destination CelerData table by using INSERT+FILES()
.
This tutorial uses the Parquet file format as an example. For information about how to load data of other file formats, see FILES
and Batch load data from Amazon S3.
Interface: SQL
Estimated time: 20 to 30 minutes
Prerequisites
You have a CelerData cloud account and is a user with the necessary privileges to create databases and tables. If you do not fulfill these prerequisites, follow the instructions in Sign up for CelerData Cloud Serverless.
You have obtained the S3 URI of the data file.
In this tutorial, CelerData provides sample datasets stored in the
s3://lilyliukinesis/tutorials/
path on Amazon S3, helping you run a quick data load into cloud storage.You have created a database and a table in your CelerData cloud. Make sure that the table has the same schema as the data file you store in Amazon S3.
In this tutorial, the database
mydatabase
and the tableuser_behavior
are used as examples:CREATE DATABASE IF NOT EXISTS mydatabase; USE mydatabase; DROP TABLE IF EXISTS user_behavior; CREATE TABLE `user_behavior` ( `UserID` int(11), `ItemID` int(11), `CategoryID` int(11), `BehaviorType` varchar(65533), `Timestamp` datetime ) ;
Load your data
Query the sample dataset
You can use SELECT * FROM FILES(...)
to query the sample dataset customer01.parquet
:
SELECT * FROM FILES
(
'path' = 's3://lilyliukinesis/tutorials/parquet/customer01.parquet',
'format' = 'parquet',
'aws.s3.use_instance_profile' = 'false',
'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA',
'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
)
LIMIT 3;
- FILES() is a table function, which can read the file stored in cloud storage based on the path-related properties you specify, infer the table schema of the data in the file, and then return the data from the file as data rows.
- The above example queries the CelerData-provided Parquet-formatted sample dataset
customer01.parquet
. Hence thepath
parameter is set tos3://lilyliukinesis/tutorials/parquet/customer01.parquet
and theformat
parameter is set toparquet
. - The above example uses the IAM user-based authentication method and therefore includes the parameters
aws.s3.access_key
andaws.s3.secret_key
. For information about the other authentication methods available, see Authenticate to AWS resources.
NOTICE
If you want to load your own data stored in Amazon S3, you need to set
path
to the S3 URI of your data file. Also, you need to configure authentication parameters to allow access to that S3 URI.
Load data from a single file using INSERT+FILES()
Use the INSERT INTO
statement and the above-mentioned SELECT * FROM FILES(...)
statement to load the data of the customer01.parquet
file into the user_behavior
table:
INSERT INTO user_behavior
SELECT * FROM FILES
(
'path' = 's3://lilyliukinesis/tutorials/parquet/customer01.parquet',
'format' = 'parquet',
'aws.s3.use_instance_profile' = 'false',
'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA',
'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
);
View the load job information
Use the following statement to view the information about the load job you just submitted:
SELECT * FROM information_schema.loads ORDER BY JOB_ID DESC;
Information similar to the following is returned:
+--------+---------------------------------------------+---------------+----------+---------------------+--------+----------+-----------+---------------+-----------------+-----------+----------+-----------------------------------------------------+---------------------+---------------------+---------------------+---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------+--------------+----------------------+
| JOB_ID | LABEL | DATABASE_NAME | STATE | PROGRESS | TYPE | PRIORITY | SCAN_ROWS | FILTERED_ROWS | UNSELECTED_ROWS | SINK_ROWS | ETL_INFO | TASK_INFO | CREATE_TIME | ETL_START_TIME | ETL_FINISH_TIME | LOAD_START_TIME | LOAD_FINISH_TIME | JOB_DETAILS | ERROR_MSG | TRACKING_URL | TRACKING_SQL | REJECTED_RECORD_PATH |
+--------+---------------------------------------------+---------------+----------+---------------------+--------+----------+-----------+---------------+-----------------+-----------+----------+-----------------------------------------------------+---------------------+---------------------+---------------------+---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------+--------------+----------------------+
| 30045 | insert_aa4519c3-6cc0-11ee-b1ae-065bac47e6e9 | mydatabase | FINISHED | ETL:100%; LOAD:100% | INSERT | NORMAL | 5 | 0 | 0 | 5 | | resource:N/A; timeout(s):300; max_filter_ratio:0.0 | 2023-10-17 00:42:04 | 2023-10-17 00:42:04 | 2023-10-17 00:42:04 | 2023-10-17 00:42:04 | 2023-10-17 00:42:04 | {"All backends":{"aa4519c3-6cc0-11ee-b1ae-065bac47e6e9":[10004]},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":723,"InternalTableLoadRows":5,"ScanBytes":655,"ScanRows":5,"TaskNumber":1,"Unfinished backends":{"aa4519c3-6cc0-11ee-b1ae-065bac47e6e9":[]}} | NULL | NULL | NULL | NULL |
+--------+---------------------------------------------+---------------+----------+---------------------+--------+----------+-----------+---------------+-----------------+-----------+----------+-----------------------------------------------------+---------------------+---------------------+---------------------+---------------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--------------+--------------+----------------------+
- The
SCAN_ROWS
field tells you that thecustomer01.parquet
file consists of 5 data rows. - The
SINK_ROWS
field tells you that a total of 5 data rows are successfully loaded into theuser_behavior
table. - For more information, you can see the
JOB_DETAILS
field.
For information about the fields provided in the loads
view, see Information Schema.
NOTE
INSERT INTO is a synchronous statement. Therefore, if the load job is still running, you need to open another session to query its execution status.
Then, use the following statement to query the user_behavior
table:
SELECT * FROM user_behavior LIMIT 3;
Load data from multiple data files using INSERT+FILES()
You can use wildcards in the path
parameter to specify multiple data files you want to load.
The following example loads the data of multiple files, customer02.parquet
, customer03.parquet
, and customer04.parquet
, into the user_behavior
table:
INSERT INTO user_behavior
SELECT * FROM FILES
(
'path' = 's3://lilyliukinesis/tutorials/parquet/customer0[2-4].parquet',
'format' = 'parquet',
'aws.s3.use_instance_profile' = 'false',
'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA',
'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
);
Load data using CTAS
In the above examples, you need to create a table before you can load data by using INSERT+FILES().
You can also load data by using the CREATE TABLE AS SELECT (CTAS) statement without creating a table up front. The following example creates a table named user_behavior_ctas
and loads the data of all files named like customer0*.parquet
in the s3://lilyliukinesis/tutorials/parquet
path to that table:
CREATE TABLE user_behavior_ctas AS
SELECT * FROM FILES
(
'path' = 's3://lilyliukinesis/tutorials/parquet/customer0*.parquet'
, 'format' = 'parquet'
, 'aws.s3.use_instance_profile' = 'false'
, 'aws.s3.access_key' = 'AAAAAAAAAAAAAAAAAAAA'
, 'aws.s3.secret_key' = 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
);
Use the following statement to query the user_behavior_ctas
table, and you will find the table has been created and data has successfully loaded into it:
SELECT * FROM user_behavior_ctas LIMIT 3;
Clean up the environment
Congratulations! You have finished your journey with a data load on CelerData Cloud Serverless!
Use the following statement to drop mydatabase
you created for a test run.
DROP DATABASE IF EXISTS mydatabase;