Reconciling 15 OSS Vulnerability Databases

Imagine firing up your OSS project's vuln scanner, only to wonder: is it catching everything? One dev's entity-resolution magic on 15 databases uncovers the chaotic truth.

869k Vulnerability Records from 15 OSS Databases Collapse to Just 608k—Here's the Real Overlap — theAIcatchup

Key Takeaways

  • 869k records from 15 OSS databases merge into 608k canonical vulns, with OSV/GHSA dominating.
  • GitHub-reviewed advisories cover only 9.1%—the rest is mostly automated NVD mirrors.
  • Build your own pipeline with union-find for multi-source scanning; OSS data's cross-links make it easy.

Picture this: late night in a dimly lit office, screens flickering with 869,771 vulnerability records pouring in from 15 open-source databases, union-find algorithm humming like a cosmic sorter aligning stars.

Reconciling OSS vulnerability databases isn’t just nerdy data plumbing—it’s the backbone of secure software in a world drowning in dependencies. And here’s the kicker: those 869k raw entries? They collapse into 608,463 canonical vulnerabilities. That’s your first wake-up call.

But wait.

OSV.dev dominates with 519,760 records across PyPI, npm, Go, Maven, the works. GitHub Advisory Database? 350,164, split into a tiny 28,618 reviewed gems and a whopping 297,078 unreviewed mirrors of NVD data. Then the niche players: PyPA’s 3,230 curated Python vulns, RustSec’s 1,022 for crates, Go’s 3,079. EPSS throws in exploit scores for good measure.

What Do These OSS Vulnerability Databases Actually Cover?

It’s not even close to uniform. OSV and GHSA swallow 99% of the pie—OSV alone hits 99.95% of the full union. Those ecosystem-specific ones? High-quality subsets, sure, but they’re drops in the ocean.

The magic happens via Package URLs—those pkg:pypi/tensorflow strings that pin down packages universally. Aliases link IDs across sources: GHSA-gcx2-gvj7-pxv3 buddies up with CVE-2022-24766 and PYSEC-2022-170. Boom—transitive closure via union-find.

Every one of these publishes bulk exports, under permissive licenses, without an API key.

Forty lines of Python, under a second on 616k IDs. 57% of clusters merge two or more IDs. OSS security data is a web of cross-links; no wonder it densifies so fast.

Here’s the headline that stopped me cold.

Full universe: 312,250 canonical clusters. GitHub-reviewed? Just 9.1% (28,419). Unreviewed mirrors? 95.1% (297k). OSV covers nearly all.

Why Does GitHub’s Reviewed vs. Unreviewed Split Matter?

Because most scanners like Dependabot pull from GHSA, and “reviewed” is the gold—human-enriched metadata GitHub’s team sweats over. The rest? Automated NVD passthrough, filtered to GitHub-tracked packages. It’s not “Dependabot misses 91%”—it consumes both—but quality? That’s the gap.

Think of it like a galactic federation: OSV’s the interstellar hub, federating ecosystems. GHSA’s Earth-centric, with a polished core and vast, raw frontier. Smaller dbs are specialist outposts, deeper but narrower.

And the unreviewed flood? It’s NVD echoes, but only for packages GitHub knows. Misses edge cases in obscure ecosystems.

My unique twist—and this isn’t in the data dive: this reconciliation screams for an AI-orchestrated meta-database, like a neural net that doesn’t just union-find but predicts merges via embedding similarities. We’ve seen it in protein folding; why not vulns? Blockchain attribution was sparse last week; OSS is primed. Bold prediction: by 2026, tools like this pipeline become the scanner default, slashing false negatives 40%.

But hype alert—GitHub’s PR spins GHSA as comprehensive. Numbers say: it’s a mirror with a fancy frame. 91% passthrough ain’t curation.

The ER density here crushes last week’s blockchain mess. OSS teams build with linking in mind—aliases as ground truth. Clusters average 2-3 IDs, but 345k (57%) multi-ID. That’s deliberate interoperability.

For project leads: don’t bet on one db. Pipe in OSV + GHSA reviewed + ecosystem natives. Your scanner’s opinion? Just that.

Smaller dbs shine in metadata—PyPA’s curation beats raw volume. RustSec, Go: precise, actionable.

EPSS adds exploit likelihood—~326k scores per CVE. Not vulns per se, but risk multipliers.

Can You Build Your Own OSS Vulnerability Reconciliation Pipeline?

Absolutely. Grab bulks (all free, no keys). Project to vuln_id, aliases, purl. Union-find the graph. Here’s the code skeleton—they shared it:

parent: dict[str, str] = {}

def find(x: str) -> str: while parent.get(x, x) != x: parent[x] = parent.get(parent[x], parent[x]) x = parent[x] return x

Scale it with embeddings for fuzzy matches. Future-proof your deps.

This shifts how I scan: multi-source ingestion, canonical clustering. No more “Dependabot said safe.” Query the union.

Wonder awaits—what if we federate all security signals, AI-synthesized? OSS security 2.0.


🧬 Related Insights

Frequently Asked Questions

What is the overlap between OSV and GitHub Advisory Database?

OSV and GHSA together cover nearly 100% of the 312k canonical OSS vulns, with OSV at 99.95% and GHSA-reviewed just 9.1%.

Does Dependabot miss most OSS vulnerabilities?

No—it uses GHSA fully, including 95% unreviewed—but quality varies wildly between reviewed and mirrored data.

How do I reconcile OSS vulnerability databases myself?

Download bulks from OSV, GHSA, etc., use union-find on vuln_id and aliases with Package URLs for canonicalization.

James Kowalski
Written by

Investigative tech reporter focused on AI ethics, regulation, and societal impact.

Frequently asked questions

What is the overlap between OSV and GitHub Advisory Database?
OSV and GHSA together cover nearly 100% of the 312k canonical OSS vulns, with OSV at 99.95% and GHSA-reviewed just 9.1%.
Does Dependabot miss most <a href="/tag/oss-vulnerabilities/">OSS vulnerabilities</a>?
No—it uses GHSA fully, including 95% unreviewed—but quality varies wildly between reviewed and mirrored data.
How do I reconcile OSS vulnerability databases myself?
Download bulks from OSV, GHSA, etc., use union-find on vuln_id and aliases with Package URLs for canonicalization.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.