Many data teams across Kraken are using Databricks and dbt to build their analytics lakehouses. Databricks is a powerful tool that allows you to run and query dbt models quickly with Spark. However with an increasing number of analysts, complexity of queries and growing dbt projects, costs of building these models can quickly rise. To help optimize costs we first need to understand them. So, in this post we’ll walk through one method to estimate the cost of running any dbt model in Databricks which can then be used for budgeting or prioritizing technical debt clean-up.
In this post, we’ll be using the following tools:
- Databricks Unity Catalog - system tables
- dbt Core - a dbt project running on Databricks
- elementary-data - elementary dbt metadata tables
Overview
To estimate the cost of running each dbt model over a week in Databricks we will:
- Calculate the total cost accrued by your Databricks SQL warehouse (
cost_per_warehouse
) - Calculate the total execution time of all queries run against the warehouse (
total_warehouse_execution_time
) - Calculate the execution time of each dbt model and its associated tests (
model_execution_time
) - Calculate every dbt model’s estimated cost using our algorithm:
cost_per_model = cost_per_warehouse * (model_execution_time / total_warehouse_execution_time)
Assumptions:
- dbt project is running against only one Databricks SQL warehouse
- These costs ignore additional cloud compute and storage costs behind Databricks - the estimates explained here are therefore conservative
- All prices are in USD
Step 1: Calculate the total cost accrued by your Databricks SQL warehouse
Databricks costs are based on processing units called Databricks units (DBUs). To get total costs of a warehouse, we need to multiply the number of DBUs used, by the dollar rate per DBU at the time of use. Luckily, we can easily find this by joining Databricks system tables: system.billing.usage
and system.billing.list_prices
.
SELECT
SUM(price.default * databricks_usage.usage_quantity) AS price
FROM
system.billing.usage AS databricks_usage
INNER JOIN
system.billing.list_prices AS price
ON databricks_usage.sku_name = price.sku_name
AND databricks_usage.usage_start_time >= price.price_start_time
AND databricks_usage.usage_start_time <= COALESCE(price.price_end_time, TIMESTAMP('2099-09-09'))
WHERE
databricks_usage.usage_metadata.warehouse_id = {WAREHOUSE_ID}
AND databricks_usage.usage_start_time >= CURRENT_TIMESTAMP() - INTERVAL 7 DAY;
Now we have the value for the first part of our algorithm: cost_per_warehouse
= price
.
Step 2: Calculate the total execution time of all queries run against the warehouse
We can use the system.query.history
table to get the total execution time of all queries run against the warehouse over the last week.
SELECT
-- Extracting dbt object name
REGEXP_EXTRACT(statement_text, '(?:"node_id": ")(.*?)(?="}|$)') AS object_name,
statement_text,
start_time,
end_time,
total_duration_ms,
CASE
-- Categorizing dbt statements - if warehouse is running multiple dbt projects
WHEN statement_text LIKE '%", "target_name": "{PRODUCTION_PROJECT}"%' THEN 'PRODUCTION'
ELSE 'OTHER'
END AS dbt_project_category
FROM
system.query.history
WHERE
start_time >= DATE_SUB(CURRENT_DATE(), 7) -- last seven days
AND client_application = 'Databricks Dbt'
AND compute.warehouse_id = {WAREHOUSE_ID};
The above query returns the total queries against the warehouse for the last seven days, along with additional information like query total duration, dbt object name, and dbt production category (this is helpful if running different dbt processes against the same warehouse). The results will be reused in the following steps, but to determine the total runtime against our warehouse, all query durations can be summed like so:
WITH all_queries AS (
SELECT
-- Extracting dbt object name
REGEXP_EXTRACT(statement_text, '(?:"node_id": ")(.*?)(?="}|$)') AS object_name,
statement_text,
start_time,
end_time,
total_duration_ms,
CASE
-- Categorizing based on target_name in the statement
WHEN statement_text LIKE '%", "target_name": "{PRODUCTION_PROJECT}"%' THEN 'PRODUCTION'
ELSE 'OTHER'
END AS dbt_project_category
FROM
system.query.history
WHERE
start_time >= DATE_SUB(CURRENT_DATE(), 7) -- last seven days
AND client_application = 'Databricks Dbt'
AND compute.warehouse_id = {WAREHOUSE_ID}
),
SELECT
CAST(SUM(total_duration_ms) AS DECIMAL) / 3600000 AS total_execution_time_hours
FROM
all_queries;
Now we have the value for the denominator of our algorithm: total_warehouse_execution_time
= total_execution_time_hours
.
Step 3: Calculate the execution time of each dbt model and its associated tests
In this step, we bring in elementary tables to map tests to their associated models using the column parent_model_unique_id
in the elementary table dbt_tests
. This allows us to group warehouse queries around specific models, rather than treating tests and models as separate.
WITH all_queries AS (
SELECT
-- Extracting dbt object name
REGEXP_EXTRACT(statement_text, '(?:"node_id": ")(.*?)(?="}|$)') AS object_name,
total_duration_ms,
CASE
-- Categorizing based on target_name in the statement
WHEN statement_text LIKE '%", "target_name": "{PRODUCTION_PROJECT}"%' THEN 'PRODUCTION'
ELSE 'OTHER'
END AS dbt_project_category
FROM
system.query.history
WHERE
start_time >= DATE_SUB(CURRENT_DATE(), 7) -- last seven days
AND client_application = 'Databricks Dbt'
AND compute.warehouse_id = {WAREHOUSE_ID}
)
-- Link tests to their parent model
SELECT
COALESCE(tests.parent_model_unique_id, aq.object_name) AS dbt_model_id,
CAST(SUM(total_duration_ms) AS DECIMAL) / 3600000 AS model_execution_time_hours
FROM
all_queries aq
LEFT JOIN
{ELEMENTARY_CATALOG}.elementary.dbt_tests tests
ON aq.object_name = tests.unique_id
WHERE
aq.dbt_project_category = 'PRODUCTION'
GROUP BY
dbt_model_id;
The above query returns the total execution time of each dbt model and its associated tests over the last seven days. This gives us the final element of our algorithm: model_execution_time
= model_execution_time_hours
.
Step 4: Putting it all together
To get the weekly cost of all models in USD, the following code combining all the above SQL can be used:
WITH price AS (
SELECT
SUM(prices.pricing.default * databricks_usage.usage_quantity) AS price
FROM
system.billing.usage AS databricks_usage
INNER JOIN
system.billing.list_prices AS prices
ON databricks_usage.sku_name = prices.sku_name
AND databricks_usage.usage_start_time >= prices.price_start_time
AND databricks_usage.usage_start_time <= COALESCE(prices.price_end_time, TIMESTAMP('2099-09-09'))
WHERE
databricks_usage.usage_metadata.warehouse_id = {WAREHOUSE_ID}
AND databricks_usage.usage_start_time >= CURRENT_TIMESTAMP() - INTERVAL 7 DAY
),
all_queries AS (
SELECT
-- Extracting dbt object name
REGEXP_EXTRACT(statement_text, '(?:"node_id": ")(.*?)(?="}|$)') AS object_name,
total_duration_ms,
CASE
-- Categorizing based on target_name in statement_text
WHEN statement_text LIKE '%", "target_name": "{PRODUCTION_PROJECT}"%' THEN 'PRODUCTION'
ELSE 'OTHER'
END AS dbt_project_category
FROM
system.query.history
WHERE
start_time >= DATE_SUB(CURRENT_DATE(), 7) -- last seven days
AND client_application = 'Databricks Dbt'
AND compute.warehouse_id = {WAREHOUSE_ID}
),
total_execution_time AS (
SELECT
CAST(SUM(total_duration_ms) AS DECIMAL) / 3600000 AS total_execution_time_hours
FROM
all_queries
),
test_model_map AS (
-- Link tests to their parent model
SELECT
COALESCE(tests.parent_model_unique_id, aq.object_name) AS dbt_model_id,
CAST(SUM(total_duration_ms) AS DECIMAL) / 3600000 AS model_execution_time_hours
FROM
all_queries aq
LEFT JOIN
{ELEMENTARY_CATALOG}.elementary.dbt_tests tests
ON aq.object_name = tests.unique_id
WHERE
aq.dbt_project_category = 'PRODUCTION'
GROUP BY
dbt_model_id
)
-- Bring it all together
SELECT
tmm.dbt_model_id,
-- cost algorithm
price.price * (tmm.model_execution_time_hours / te.total_execution_time_hours) AS model_weekly_cost_usd
FROM
test_model_map tmm
LEFT JOIN
price ON 1 = 1
LEFT JOIN
total_execution_time te ON 1 = 1;
Further information
Additional metadata from elementary tables can be added to enrich cost information. For example, model owners can be added to group costs by users or team.
There are similar blogs that cover estimating dbt model costs in Snowflake - this one by SELECT is my favourite.
- Automating secrets management with 1Password Connect
- Understanding how mypy follows imports
- Optimizing AWS Stream Consumer Performance
- Sharing Power Updates using Amazon EventBridge Pipes
- Using formatters and linters to manage a large codebase
- Our pull request conventions
- Patterns of flakey Python tests
- Integrating Asana and GitHub
- Durable database transactions in Django
- Python interfaces a la Golang
- Beware changing the "related name" of a Django model field
- Our in-house coding conventions
- Recommended Django project structure
- Using a custom Sentry client
- Improving accessibility at Octopus Energy
- Django, ELB health checks and continuous delivery
- Organising styles for a React/Django hybrid
- Testing for missing migrations in Django
- Hello world, would you like to join us?