Migrating Legacy ETL to dbt on Databricks

Data teams everywhere feel it — that creeping dread when a legacy ETL job fails at 2 a.m., halting dashboards and furious execs breathing down your neck. This Matillion-to-dbt shift on Databricks? It’s not some tech bro fantasy. It’s a blueprint for reclaiming your weekends, slashing compute bills by double digits, and finally trusting your data again.

Look, markets don’t lie. Databricks’ revenue jumped 60% last year to over $1 billion, fueled by teams ditching rigid ETL tools for flexible, code-first stacks. dbt, with its 50,000+ GitHub stars and adoption at Fortune 500s like JetBlue, isn’t niche anymore. It’s the SQL superpower turning data swamps into gold mines. And here’s the kicker: this migration story from a real practitioner shows exactly how — without the vendor sales pitch.

Why Ditch Matillion for dbt on Databricks Now?

Tight-coupled jobs. Reusability? Forget it. Debugging a nightmare. Sound familiar? That’s the legacy ETL trap snaring thousands of teams still on Matillion or Talend.

But.

This engineer faced it head-on with core entities — companies, departments, suppliers, barcodes — all tangled in monolithic mappings. The fix? Medallion architecture: bronze for raw dumps, silver for cleaned goods, gold for biz-ready tables. Data quality ramps up layer by layer, per Databricks’ own docs.

They shattered each job into bitesize dbt models: stg_ for staging cleanup, int_ for reusable logic, marts like dim_supplier and fct_sales for analytics. Example chain: stg_supplier flows to int_supplier_enriched, then dim_supplier. Incremental loads via updated_at filters nuked full refreshes. Costs plummeted. Runs sped up.

Does This Actually Cut Costs in the Real World?

Damn right. Compute savings hit hard because Delta Lake partitioning and optimized joins replaced brute-force ETL. dbt’s DAG via ref() handles dependencies cleanly — no more manual orchestration headaches.

Validation parity was non-negotiable. Row counts. Sums. Sample hashes. dbt tests: not null, unique keys, FKs, freshness. Custom macros for the win.

This migration wasn’t just tool replacement—it was a shift to: Modular data engineering, Version-controlled transformations, Reliable, testable pipelines.

That’s the practitioner’s own words. No fluff.

And the results? Maintainability soared. SQL standardized. Lineage crystal clear. Runtime and costs? Slashed.

Here’s my unique take, absent from the original: this echoes the early 2010s Informatica exodus to open-source like Apache Airflow. Back then, enterprises balked at $100k+ licenses; now, Matillion’s $50k+ annual tabs look prehistoric against dbt Core’s free tier plus Databricks’ pay-as-you-go. Prediction: by 2025, 70% of mid-market data teams follow suit, per Gartner-like trends, as AI demands faster, trustworthy pipelines.

But don’t buy the hype wholesale. Matillion’s low-code shines for non-coders — dbt demands SQL chops. If your team’s junior-heavy, stage the migration. Still, for scalable stacks? dbt wins.

Market dynamics seal it. Snowflake + dbt combos dominate, but Databricks’ Lakehouse edge — Unity Catalog for governance, Photon for speed — makes it the smart play for ETL-heavy workloads. Venture bucks pour in: dbt Labs at $4.2B valuation. Teams ignoring this risk talent flight; top engineers chase modern stacks.

The how-to nuts and bolts.

Start with extraction: dbt sources pull from your Matillion lands cleanly.

Joins? Modular models reference each other — ref(‘stg_supplier’).

Aggs in intermediates, business logic there too.

Incremental? dbt’s is_incremental() macro, tied to Delta merges.

Tests everywhere. Freshness on schedules. Relationships via schema.yml.

Reconciliation? Structured diffs pre-post migration.

One-paragraph deep dive: imagine supplier sites — legacy Matillion jammed extracts, transforms, loads into one beast. Now? stg_supplier_site grabs raw, int_enriched joins to companies/depts/groups/classes, applies filters, then dim_supplier_site serves analysts. Reusability explodes; tweak once, propagate. Costs? From hours to minutes, thanks to Z-ordering on join keys.

Skeptical? Parity checks don’t lie. Hash diffs caught 0.01% variances — fixed fast.

Is dbt + Databricks the Future for Every Data Team?

Not quite. Tiny teams? Stick to Airbyte + dbt Cloud. But scaling to petabytes? Lakehouse rules.

Corporate spin check: Databricks pushes “jobs” hard, but dbt’s SQL purity cuts through. No black-box magic.

For real people — you, grinding pipelines — this means dev speed triples. Debug in VS Code. Git collaborate. Test like code.

Data trust? Enforced.

Bottom line: if legacy ETL bites, migrate. Strategies? DM the engineer; they’re game.

🧬 Related Insights

Read more: pgEdge’s MCP Gambit: Why AI Agents Need This Over APIs for Postgres
Read more: $953M Vanished: Access Control’s Simple Screw-Ups Top Crypto Losses

Frequently Asked Questions

What is medallion architecture in Databricks?

It’s bronze (raw), silver (clean), gold (aggregated) layers for progressive data refinement — boosts quality, cuts rework.

How much can dbt save on ETL costs?

20-50% typical, via incrementals and partitioning; this case hit big on full-refresh killers.

Is dbt free for Matillion migrations?

Core yes, Enterprise for teams. Pairs free with Databricks community edition to start.

Migrating Legacy ETL to dbt on Databricks

Key Takeaways

Why Ditch Matillion for dbt on Databricks Now?

Does This Actually Cut Costs in the Real World?

Is dbt + Databricks the Future for Every Data Team?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

Why Ditch Matillion for dbt on Databricks Now?

Does This Actually Cut Costs in the Real World?

Is dbt + Databricks the Future for Every Data Team?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

ETL vs ELT: The Pipeline Schism Reshaping Data Teams

Data Engineering Interviews 2026: Skip the Hype, Nail What Counts

Stay in the loop

Key Takeaways