How to Integrate Dbt and GreatExpectations

This blog post provides a step-by-step guide to integrating Dbt (Data Build Tool) and Great Expectations for enhancing data quality with E-commerce analytics as an example. It covers the entire process from setting up a project, transforming raw data into analytics-ready tables, to implementing data quality tests using a sample e-commerce dataset in Snowflake.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

How to Integrate Dbt and GreatExpectations

Data quality is crucial for making good data-driven decisions. As data volumes grow exponentially, automating data testing and documentation is key to maintaining high quality data in analytics databases. Dbt (Data Build Tool) and Great Expectations are two popular open-source frameworks that help tackle different aspects of managing analytics engineering pipelines.
In this post, we will walk through an end-to-end example of integrating Dbt and Great Expectations using an e-commerce dataset. We will load sample e-commerce order data into a Snowflake data warehouse. Dbt will help us transform the raw data into analytics-ready tables and views. Great Expectations will then help create data quality test suites validate the analytics dbt models.

Prerequisites to integrate Dbt and GreatExpectations

Before we begin, let's outline the key items we need:

  • Access to a Snowflake account where the raw e-commerce data is loaded
  • Create a database called ECOMM_ANALYTICS
  • Dbt installed and configured to connect to Snowflake
  • Great Expectations installed and connected to Snowflake
  • Git/Github repository to manage Dbt and Great Expectations config as code
  • Basic understanding of SQL, dbt, and Great Expectations

Sample E-Commerce Data Model

Our raw sample e-commerce data has information on customer orders, products, payments and order reviews stored in Snowflake stages.

Here is a snippet of what the raw orders data looks like:

Table 1
order_idcustomer_idorder_dateorder_value
1
100  
2022-01-0199.99
22002022-01-05149.99
Made with HTML Tables

And the products data:

Table 1
product_idproduct_namecategorycost_priceselling _price
1001T-shirt
Apparel2049.99
1002JeansApparel3099.99
Made with HTML Tables

We want to transform this raw data into an analytics dataset with clean, derived metrics like order revenue, product margins etc.
Here is what our target analytics data model looks like:

  • stg_orders - staged orders data
  • stg_products - staged products data
  • stg_payments - staged payments data
  • stg_reviews - staged reviews data
  • fct_orders - derived metrics like order revenue, profit
  • dim_customers - clean customer dimension
  • dim_products - clean product dimension

This end-to-end pipeline is depicted below:

How to setup DBT project

We first setup a dbt project to build the transformations and generate the analytics dataset.
Initialize a dbt project ecomm_dbt and configure the connection to the Snowflake database ECOMM_ANALYTICS


# dbt_project.yml

name: 'ecomm_dbt' 
version: '1.0.0'
config-version: 2

profile: ecomm_snowflake

Define the source and target schemas in the schema.yml file:


# schema.yml

sources:
  - name: ecommerce
    database: ECOMM_ANALYTICS
    schema: raw
    tables:
      - name: orders 
      - name: products

models:
  - name: stg
    description: Staged raw data
    columns:
      - name: order_id
        description: Primary key
        tests:
          - unique
          - not_null

  - name: dim
    description: Dimension tables  
    columns:
      - name: customer_id
        description: Customer ID
        tests: 
          - unique
          - not_null
          
  - name: fct
    description: Fact tables
    columns:
      - name: order_profit
        description: Order profit margin

Build the dbt SQL models to transform raw data:


-- stg_orders.sql 

select * from {{ source('ecommerce','orders') }}


-- dim_customers.sql

select
    customer_id,
    first_name,
    last_name
from {{ source('ecommerce','customers') }}


-- fct_orders.sql

select
    order_id,
    order_date,
    
    order_value as order_revenue,
    
    order_value - order_cost as order_profit
    
from {{ source('ecommerce','orders') }} 

inner join {{ source('ecommerce','order_costs') }}

Run tests on dbt project:

  • dbt compile to catch SQL errors
  • Sample test suite:

-- test_fct_orders.sql

select 
    count(*) as order_count
from {{ ref('fct_orders') }}

union all

select  
    round(avg(order_profit),2) as profit_avg,
    max(order_profit) as profit_max
from {{ ref('fct_orders') }}

Great Expectations Integration

Next, we setup Great Expectations to define data quality expectations and validate the analytics dataset.

  1. Initialize Great Expectations and connect it to the Snowflake database

great_expectations --datasource=snowflake://USER:PASSWORD@ACCOUNT build ecomm_expectations

Create a datasource pointing to the ECOMM_ANALYTICS database
Ingest metadata from dbt catalog:


great_expectations --v3-api datasource new

great_expectations suite edit default.suite edit expectations

Auto-generate expectations for dim_customers


expect_column_values_to_not_be_null -> customer_id
expect_column_unique -> customer_id  
expect_column_values_to_be_in_type_list -> all columns

Add custom expectations for metrics:
fct_orders.sql


expect_column_min_to_be_between -> order_profit: 5

expect_column_max_to_be_between -> order_revenue: 500, 50000

expect_column_kl_divergence_less_than -> order_profit: 0.6

Execute validation for customer 360 expectation suite:


great_expectations --v3-api suite run default.dim_customers.suite

Next Steps

Here are some ways we can build on this ETL monitoring framework:

  1. Schedule batched workflow for daily validation runs
  2. Set Data Docs to publish data quality results and metrics
  3. Expand test coverage for more edge cases
  4. Enable consumers to view data quality issues and fixes

Conclusion

In this post, we walked through a detailed example of integrating dbt and Great Expectations for an e-commerce pipeline using code snippets. Defining rigorous data quality tests is key to identifying issues early and maintaining high quality datasets.
Hopefully this gives you a blueprint to implementing automated data testing and documentation for your own data transformation pipelines!

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

Leveraging Databricks Feature Store for Machine Learning Feature Management

Machine learning is moving fast, and managing data features well has become really important for ML projects to succeed. As companies do more with ML, it's often hard to handle, share, and reuse features across different models and teams. That's where Databricks Feature Store comes in - it's a powerful tool that makes feature management easier and speeds up ML work.

AI/ML
time
10
 min read

Optimizing Bundle Sizes in React Applications: A Deep Dive into Code Splitting and Lazy Loading

In front-end engineering, performance optimization remains a critical concern for developers and businesses alike. As React applications grow in complexity and size, managing bundle sizes becomes increasingly challenging. Large bundle sizes can lead to slower initial page loads, reduced user engagement, and potential loss of business. This article delves into two powerful techniques for optimizing bundle sizes in React applications: code splitting and lazy loading.

Mobile Engineering
time
 min read

Implementing Task Planning and Execution Using LangChain for Complex Multi-Step Workflows

In order to apply LLM to the real world problems, the ability to handle complex, multi-step workflows has become increasingly crucial. LangChain is a powerful framework that has become very popular in the AI community for building complex workflows on top of the LLMs. Today, we're exploring how LangChain can be leveraged for implementing task planning and execution in complex scenarios.

AI/ML
time
5
 min read