Data modeling with materialized views

This topic describes how to perform data modeling with the help of CelerData's asynchronous materialized views. By doing so, you can greatly simplify the ETL pipeline of your data warehouses, and significantly improve data quality and query performance.

NOTE

This feature is included in the Premium software edition. Please see the Software editions documentation for details on the differences between Standard and Premium editions if you are subscribed to the Standard edition.

Overview

Data modeling is the process of cleaning, layering, aggregating, and associating data with rational methodologies. It can create a comprehensible representation of the raw data that is too rough, too complicated, or too costly to analyze directly, and provide actionable insights into the data.

However, the common challenge in real-world data modeling is that the modeling process struggles to keep up with the pace of business development, and it is difficult to measure the return on investment for data modeling efforts. Despite the fact that the modeling methodologies are simple, business experts are required to have a solid background in data organization and governance, which is a complex process. In the early stages of business, decision-makers rarely spare sufficient resources for data modeling, and it is challenging to see the value that data modeling can bring. Moreover, business models can change rapidly, and the modeling methodologies themselves need iteration and evolution. Therefore, many data analysts tend to avoid modeling and use raw data directly, thus inevitably leading to issues of data quality and query performance. When the need for modeling arises, it becomes difficult to restructure data analytics patterns that have already been established to match those of the data models.

Using materialized views for data modeling can effectively address these problems. CelerData's asynchronous materialized views can:

  • Simplify the data warehouse architecture: Because CelerData can provide a one-stop data governance experience, you do not need to maintain other data processing systems, saving the human and system resources spent on them.
  • Ease the data modeling experience: Any data analyst with only basic SQL knowledge is capable of data modeling with CelerData. Data modeling is no longer the exclusive province of experienced data engineers.
  • Reduce maintenance complexity: CelerData's asynchronous materialized views can automatically manage the lineage relationships and dependencies across data layers, eliminating the need for an entire data platform to handle this task.

Modeling-1

In real-world situations, you can perform data modeling by combining the usage of CelerData's views (logical views) and asynchronous materialized views as follows:

  1. Use views to associate real-time data with dimension data, and use materialized views to associate historical data from the data lake with dimension data. Perform necessary data cleaning and semantic mapping to obtain the detail data for the Intermediate Layer, which reflects the semantics required in your business scenarios.
  2. In the Application Layer, perform data Join, Aggregation, Union, and Window calculations tailored to different business scenarios. This will yield views for real-time pipelines and materialized views for near-real-time pipelines.
  3. On the Application side, choose an appropriate Analytical Data Store (ADS) for query analysis based on your timeliness and performance requirements. These ADS can serve real-time dashboards, near-real-time BI, ad-hoc queries, and scheduled reports.

During this process, you will leverage several built-in capabilities of CelerData, which will be elaborated in the following section.

Capabilities of asynchronous materialized view

CelerData's asynchronous materialized view possesses the following atomic functions that can aid in data modeling:

  • Automatic refresh: After data is loaded into base tables, materialized views can be automatically refreshed. You do not need to maintain a scheduling task externally.
  • Partitioned refresh: Near-real-time calculations can be achieved through the partitioned refresh of the materialized views built on tables featuring time series.
  • Synergy with views: You can achieve multi-layer modeling by using materialized views and logical views, thus enabling the reuse of the Intermediate Layer and simplification of data models.
  • Schema change: You can alter the calculation results through simple SQL statements, without the need to modify the complex data pipelines.

With these features, you can design comprehensive and adaptable data models to meet various business needs and scenarios.

Automatic refresh

When creating an asynchronous materialized view, you can specify the refresh strategy using the REFRESH clause. Currently, CelerData supports the following refresh strategies of asynchronous materialized views:

  • Automatic Refresh (REFRESH ASYNC): Refresh tasks are triggered each time data in the base tables changes. Data dependencies are automatically managed by the materialized view.
  • Scheduled Refresh (REFRESH ASYNC EVERY (INTERVAL <refresh_interval>)): Refresh tasks are triggered at regular intervals, for example, every minute, day, or month. If there are no data changes in the base tables, the refresh task will not be triggered.
  • Manual Refresh (REFRESH MANUAL): Refresh tasks are triggered only by executing REFRESH MATERIALIZED VIEW manually. This refresh strategy can be used when you maintain an external scheduling framework to trigger refresh tasks.

Syntax:

CREATE MATERIALIZED VIEW <name>
REFRESH 
    [ ASYNC | 
      ASYNC [START <time>] EVERY(<interval>) | 
      MANUAL
    ]
AS <query>

Partitioned refresh

When creating an asynchronous materialized view, you can specify the PARTITION BY clause to associate the partitions of the base table with those of the materialized view, thus achieving partition-level refresh.

  • PARTITION BY <column>: You can reference the same partitioning column for the base table and the materialized view. As a result, the base table and the materialized view are partitioned on the same granularity.
  • PARTITION BY date_trunc(<column>): You can use the date_trunc function to assign a different partition strategy (on the level of granularity) for the materialized view by truncating the time unit.
  • PARTITION BY { time_slice | date_slice }(<column>): In comparison to date_trunc, time_slice and date_slice offer more flexible time granularity adjustments, allowing for finer control over partitioning based on time.

Syntax:

CREATE MATERIALIZED VIEW <name>
REFRESH ASYNC
PARTITION BY 
    [
        <base_table_column> | 
        date_trunc(<granularity>, <base_table_column>) |
        time_slice(<base_table_column>, <granularity>) | 
        date_slice(<base_table_column>, <granularity>)
    ]
AS <query>

Synergy with views

  • Materialized views can be created based on views. In this case, when the base tables referenced by the view undergo data changes, the materialized view can be automatically refreshed.
  • You can also create materialized views based on other materialized views, enabling multi-level cascading refresh mechanisms.
  • Views can be created based on materialized views, which are equivalent to regular tables.

Schema change

  • You can perform an atomic exchange between two asynchronous materialized views using the ALTER MATERIALIZED VIEW SWAP statement. This allows you to create a new materialized view with added columns or changed column types and then replace the old one with it.
  • The definition of views can be directly modified using the ALTER VIEW statement.
  • Regular tables in CelerData can be modified using either SWAP or ALTER operations.
  • Furthermore, when there are changes in the base tables (which can be materialized views, views, or regular tables), it triggers cascading changes in the corresponding materialized views.

Layered modeling

In many real-world business scenarios, there are various forms of data sources, including real-time detail data, dimension data, and historical data from data lakes. On the other hand, business requirements call for diverse analytical methods, such as real-time dashboards, near-real-time BI queries, ad-hoc queries, and scheduled reports. Different scenarios have different demands - some require flexibility, some prioritize performance, while others emphasize cost-effectiveness.

Clearly, a single solution cannot adequately address such versatile demands. CelerData can efficiently address these needs by combining the usage of views and materialized views. Because views maintain no physical data, each time a view is queried, the query is parsed and executed according to the definition of the view. In comparison, materialized views, which hold the pre-computed results, can prevent the overhead of repetitive execution. Views are suitable for expressing business semantics and simplifying SQL complexity, but they cannot reduce the costs of query execution. Materialized Views, on the other hand, optimize the query performance through pre-computation and are suitable for streamlining ETL pipelines.

Below is a summary of the differences between views and materialized views:

ViewMaterialized view
Use casesBusiness modeling, data governanceData modeling, transparent acceleration, data lake integration
Storage costNo storage costStorage cost incurred by storing pre-computed results
Update costNo update costRefresh cost incurred when base table data updates
Performance benefitsNo performance benefitQuery acceleration introduced by reusing pre-computes results
Data real-time attributeThe latest data is returned because queries against views are computed in real-time.Data may not be up-to-date because results are pre-computed.
DependencyViews become invalid if base table names are changed because they reference base tables by name.Changes of base table names do not affect the availability of the materialized views, which reference base tables by ID.
Creation SyntaxCREATE VIEWCREATE MATERIALIZED VIEW
Modification SyntaxALTER VIEWALTER MATERIALIZED VIEW

You can use the following statements to modify your views, materialized views, and base tables:

-- Modify a table.
ALTER TABLE <table_name> ADD COLUMN <column_desc>;

-- Swap two tables.
ALTER TABLE <table1> SWAP WITH <table2>;

-- Modify the definition of a view.
ALTER VIEW <view_name> AS <query>;

-- Swap two materialized views 
-- (by swapping the name of the two materialized views without affecting the data within).
ALTER MATERIALIZED VIEW <mv1> SWAP WITH <mv2>;

-- Re-activate a materialized view.
ALTER MATERIALIZED VIEW <mv_name> ACTIVE;

Schema changes follow these principles:

  • Rename and Swap operations on a table will set the dependent materialized views inactive. For Schema Change operations, the dependent materialized views are set inactive only when Schema Change operations are performed on the base table columns referenced by the materialized views.
  • If you change the definition of a view, the dependent materialized views are set to inactive.
  • If a materialized view is swapped, any nested materialized views built on it are set to inactive.
  • The inactive status cascades upwards until there are no materialized view dependencies.
  • Inactive materialized views cannot be refreshed or used for automatic query rewrite.
  • Inactive materialized views can still be queried directly, but data consistency is not guaranteed until they are active again.

Whereas the data consistency of inactive materialized views cannot be guaranteed, you can restore the functionality of them using the following methods:

  • Manual repair: You can manually repair an inactive materialized view by executing ALTER MATERIALIZED VIEW <mv_name> ACTIVE. This statement will recreate the materialized view based on its original SQL definition. Please note that the SQL definition must still be valid after the underlying schema changes, otherwise, the operation will fail.
  • Automatic repair: CelerData will attempt to automatically activate inactive materialized views. However, the timeliness of this process cannot be guaranteed.

Partitioned modeling

In addition to layered modeling, partitioned modeling is also an essential aspect of data modeling. Data modeling often involves associating data based on business semantics and setting data's Time-To-Live (TTL) as per timeliness requirements. Partitioned modeling plays a significant role in this process.

Partitioned modeling is an essential aspect of data modeling, complementing layered modeling. It involves associating data based on business semantics and setting Time-To-Live (TTL) for data as per timeliness requirements. Data partitioning plays a significant role in this process.

Different ways of associating data give rise to various modeling approaches, such as star schemas and snowflake schemas. These models have something in common - they all use fact tables and dimension tables. Some business scenarios require multiple large fact tables, while others deal with complex dimension tables and the relationships among them. CelerData's materialized views support partition association for fact tables, meaning that the fact table is partitioned, and the materialized view's join results are partitioned in the same way.

Modeling-2

As the above figure shows, a materialized view associates a fact table with multiple dimension tables:

  • You need to reference the partition key of a specific base table (usually the fact table) as the materialized view's partitioning key (PARTITION BY fact_tbl.col) to associate their partitioning strategies. Each materialized view can only be associated with one base table.
  • When the data in a partition of the referenced table changes, the corresponding partition in the materialized view is refreshed without affecting other partitions.
  • When the data in the non-referenced tables changes, the entire materialized view is refreshed by default. However, you can choose to ignore the data changes in certain non-referenced base tables so that the materialized view will not be refreshed when data in these tables changes.

Such partition association supports various business scenarios:

  • Fact table updates: You can partition the fact table at a fine-grained level, such as daily or hourly. After the fact table is updated, the corresponding partitions in the materialized view are automatically refreshed.
  • Dimension table updates: Usually, data updates in dimension tables will lead to the refresh of all associated results, which can be costly. You can choose to ignore the data updates in some dimension tables to avoid refreshing the entire materialized view, or you can specify a time range so that only the partitions within the time range can be refreshed.
  • External table automatic refresh: In external data sources like Apache Hive or Apache Iceberg, data changes at the partition level. CelerData' materialized views can subscribe to changes in external catalogs at the partition level, and only refresh the corresponding partition of the materialized view.
  • TTL: When setting the partitioning strategy for a materialized view, you can set the number of recent partitions to retain, thus keeping only the most recent data. This is useful in business scenarios when analysts only query up-to-date data from a certain window of time, and they do not need to retain all the historical data.

Several parameters can be used to control refresh behavior:

  • partition_refresh_number: the number of partitions to refresh in each refresh operation.
  • partition_ttl_number: the number of recent partitions to retain.
  • excluded_trigger_tables: the table whose data changes can be ignored to avoid triggering the automatic refresh.
  • auto_refresh_partitions_limit: the number of partitions to refresh in each automatic refresh operation.

For more information, see CREATE MATERIALIZED VIEW.

Currently, partitioned materialized views have the following limitations:

  • You can only build a partitioned materialized view based on a partitioned table.
  • You can only use the DATE or DATETIME type columns as the partitioning key. The STRING data type is not supported.
  • You can only perform partition roll-up using the date_trunc, time_slice, and date_slice functions.
  • You can only specify a single column as the partitioning key. Multiple partitioning columns are not supported.

Summary

Utilizing CelerData's asynchronous materialized views for data modeling offers the advantage of simplifying pipeline management and enhancing the efficiency and flexibility of data modeling through a declarative modeling language.

Apart from the data modeling, CelerData's asynchronous materialized views find application in various scenarios involving transparent acceleration and data lake integration. This facilitates further exploration of data value and improves data efficiency.