PostgreSQL Recursive CTEs for Multi-Generation Pedigree Trees

A breeder's app tracking 200 animals in 9 days. The secret? Understanding why adjacency lists beat nested sets, and why your duplicates aren't a bug—they're inbreeding.

How One Developer Built a Production Pedigree Tree in PostgreSQL—And Why Your Genealogy App Is Probably Broken — theAIcatchup

Key Takeaways

  • Adjacency lists with self-referential foreign keys are the only schema that handles pedigrees correctly; nested sets and materialized paths fail at scale.
  • Recursive CTEs are the right tool for multi-generation ancestor traversal—they're elegant, performant, and standardized across SQL databases.
  • Duplicate ancestors in your pedigree query results aren't bugs; they're the signal for detecting inbreeding, and removing them corrupts your data.
  • Closure tables lose to recursive CTEs for pedigrees because common ancestors create path explosion—the exact scenario where you most need efficiency.

Fifty paid subscribers. Two hundred animals tracked. Nine days of launch.

That’s not a toy project. That’s ReptiDex, a production app for animal breeders that solves one of SQL’s most deceptively hard problems: building and querying multi-generation pedigree trees in PostgreSQL.

Most developers never think about this. Most should. Whether you’re building genealogy software, org charts that trace reporting lines back decades, or bill-of-materials systems where parts contain parts, you’re staring at the exact same architectural problem: a binary tree that doubles in width every generation, incomplete historical records, and zero tolerance for data corruption.

Miss the design details, and your entire inheritance chain collapses.

The Problem Nobody Talks About: Why Pedigrees Aren’t Just Hierarchies

Here’s where most developers go wrong. They think a pedigree tree is just a hierarchy—like an org chart or a folder structure. Wrong.

An animal can be the sire of offspring across many different dams, producing multiple subtrees that share a root. Pedigrees are the version where both parents matter, the graph doubles in width at every generation, and incorrect data has real consequences for the people who depend on it.

That’s the kicker. In a normal hierarchy, each node has one parent. In a pedigree, each animal has two parents—and those parents might share ancestors. At generation 0, you have 1 animal. At generation 1, up to 2 parents. Generation 2, up to 4 grandparents. By generation 4, you’re tracking up to 16 great-great-grandparents. Mathematically, a complete N-generation pedigree has 2^(n+1) - 1 total nodes.

Then you add chaos to the mix. Not every animal has documented parents. Import a breeding animal from another program? Maybe you have records for the grandparents but not the dam. Your schema has to handle missing data without breaking.

And here’s the thing nobody tells you: the duplicates matter. If ancestor X shows up on both the sire side and the dam side of your tree, that’s not a data error. That’s inbreeding. That’s the entire signal you need for coefficient of inbreeding (COI) calculations. Delete those duplicates, and you’ve just corrupted your genealogy.

The Schema: Self-Referential Foreign Keys Are Non-Negotiable

Start here. One table. Two columns that point back to itself.

The animals table has sire_id and dam_id—both nullable foreign keys that reference the table’s own primary key. Both parents are optional because real-world breeding data is messy. But both foreign keys are mandatory constraints. If you allow orphan references—a sire_id pointing to a nonexistent animal—your entire pedigree graph becomes unreliable at the database level.

SQLite will let you do this. MySQL might let you do this if you’re not careful. PostgreSQL will reject it outright, and that’s the right call.

Self-referential foreign keys enforce integrity at the database level. A sire_id that points to a nonexistent animal gets rejected by Postgres before any application code runs.

Adjacency list models—where each record points to its parents—win here because the alternatives break down spectacularly. Nested sets assume a strict hierarchy; pedigrees aren’t strict hierarchies. Materialized paths (storing the full ancestor lineage as a string) blow up in size; a 10-generation pedigree path becomes a 1,023-element string. Closure tables? We’ll get to why they lost.

Why Recursive CTEs Are the Right Weapon

PostgreSQL’s recursive common table expressions (CTEs) let you walk up the tree elegantly. Here’s the core pattern:

Start with one animal (the base case). Then, recursively fetch every ancestor by joining the CTE back to the animals table—matching any animal whose ID appears as either a sire_id or dam_id of an already-found record. Increment a generation counter on each step. Stop when you hit your depth limit.

For a 4-generation pedigree, the base case returns 1 row. Gen 1 returns up to 2. Gen 2 returns up to 4. By gen 4, you’re tracking up to 16 ancestors across 31 total rows.

The magic line: a.id = p.sire_id OR a.id = p.dam_id. This single condition allows the same ancestor to appear multiple times if they show up on both the maternal and paternal sides. That duplication is not a bug. That duplication is signal.

Why You Don’t Want Closure Tables (And Why the Author Tested Them Anyway)

Closure tables are seductive. You pre-compute every ancestor-descendant relationship, store it in a separate table, and queries become instant lookups. It works beautifully for organizational hierarchies where ancestors don’t repeat.

For pedigrees? It’s performance theater.

You’d need a row for every possible path through the tree. A 10-generation pedigree with common ancestors creates not 31 nodes but potentially hundreds of path records—and that number explodes with inbreeding (which is exactly when you most need the data). You’re trading query speed for storage and maintenance nightmares.

The author tested this against production ReptiDex data and chose recursive CTEs instead. Real-world data wins over theory.

The Denormalization Trap: When Caching Pedigrees Makes Sense

Here’s where pragmatism enters. Recursive CTEs are fast for single queries, but if you’re rendering pedigree trees constantly—especially on mobile (where ReptiDex lives)—you’re recomputing the same ancestor paths over and over.

Denormalization becomes tempting. Cache the full pedigree tree in a separate table, keyed to the starting animal. Update it when breeding data changes.

It works. It’s faster. It also means you now have two sources of truth, and if one gets out of sync, your genealogy data is corrupted. The trade-off is real, and it depends on your query patterns. For a mobile app showing breeders their complete pedigrees multiple times per session? Caching probably wins. For a system that’s mostly write-heavy? Don’t do it.

Indexing for Ancestor Traversal: The Performance Foundation

Your recursive CTE runs joins on sire_id and dam_id repeatedly across generations. Both columns need indexes. Period.

Without them, gen 4 of your pedigree forces a full table scan through every animal in your database. With a B-tree index on both parent columns, the query planner finds ancestor records in near-constant time. For ReptiDex’s 200 animals, it’s the difference between microseconds and milliseconds.

The Unique Insight: This is What Platform Shifts Look Like

Here’s what nobody says out loud: this is how you identify a fundamental platform shift in software.

SQL has had recursive CTEs since SQL:1999. Postgres added them in 2011. That’s over a decade ago. Yet most web developers have never used them. They build genealogy apps with Python loops and N+1 queries. They struggle with graph traversal because they’re trying to solve it in application code.

When a single SQL feature—one that’s been standardized for 25 years—can suddenly unlock an entire category of previously-difficult problems, you’re looking at the reason PostgreSQL has become the database that swallows every alternative. Not because it’s trendy. Because it works for shapes of data nobody else handles elegantly.


🧬 Related Insights

Frequently Asked Questions

What does coefficient of inbreeding actually measure? COI quantifies the probability that two alleles in an offspring are copies of the same allele from a common ancestor. It requires counting every shared path through the pedigree—which is why duplicate ancestors in your query results are essential, not errors.

Will recursive CTEs in PostgreSQL work for descendant queries too? Yes. Flip the logic: instead of joining on ancestors (sire_id and dam_id), join on offspring (any record where the starting animal’s ID is listed as sire_id or dam_id). Same pattern, opposite direction.

Can I use this pattern for non-breeding pedigrees, like family genealogy? Absolutely. The schema and queries work for human genealogy, historical records, anything with bilateral inheritance. The only change is adapting parent columns from sire_id/dam_id to mother_id/father_id.

Priya Sundaram
Written by

Hardware and infrastructure reporter. Tracks GPU wars, chip design, and the compute economy.

Frequently asked questions

What does coefficient of inbreeding actually measure?
COI quantifies the probability that two alleles in an offspring are copies of the same allele from a common ancestor. It requires counting every shared path through the pedigree—which is why duplicate ancestors in your query results are essential, not errors.
Will recursive CTEs in PostgreSQL work for descendant queries too?
Yes. Flip the logic: instead of joining on ancestors (sire_id and dam_id), join on offspring (any record where the starting animal's ID is listed as sire_id or dam_id). Same pattern, opposite direction.
Can I use this pattern for non-breeding pedigrees, like family genealogy?
Absolutely. The schema and queries work for human genealogy, historical records, anything with bilateral inheritance. The only change is adapting parent columns from sire_id/dam_id to mother_id/father_id.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.