Saturday afternoon, screen glowing in a quiet apartment, I fired up pgit on the Linux kernel repo.
And just like that, 1.2 million commits — stretching back to 1991 — landed in a PostgreSQL database. Not as a endless scroll of git log –oneline, but as structured data: tables you can slice, dice, filter with plain SQL.
pgit, that unassuming tool, flips Git from a black box into a spreadsheet for software archaeologists. The creator queried authorship trends, subsystem merge velocities, even commit-hour patterns across decades. Linus Torvalds’ earliest pushes? Right there, timestamped before most devs were born.
But here’s the hook — and my sharp take: this isn’t hobbyist fun. It’s a market signal. Dev tools are exploding toward data layers on version control. GitHub Copilot analyzes diffs; now imagine SQL on your full history powering AI code reviews. Teams ignoring this? They’re dinosaurs in a Postgres world.
What pgit Actually Does (And Why Git Log Sucks)
Git log spits out a list. Flat, unsearchable, gone when you scroll up. pgit pipes that chaos — hashes, authors, emails, timestamps, diff stats — into relational tables. Suddenly, you’re not hunting; you’re querying.
The schema’s dead simple: a commits table with hash as PK, fields for author, fecha (that’s timestamp), message, file counts, insertions, deletions. Indexes on date, repo, author for speed.
Take this ingestion script snippet — raw bash glory:
!/bin/bash
ingestar_repo.sh — mete el historial de un repo en postgres
REPO_PATH=$1 REPO_NAME=$2 … git log –format=”%H|%an|%ae|%aI|%x00” –numstat | awk ‘…’ | psql … “
It parses git log’s quirky –numstat output, tallies lines added/deleted per commit, pipes to COPY for bulk insert. Pipes in messages can break it (yeah, edge case), but for exploration? Gold.
I cloned it, tweaked for my repos. Nine projects: freelance gigs, side hacks, work monorepo. 4,847 commits, 2020-2024. Ingested in minutes.
First query: peak commit hours.
SELECT EXTRACT(HOUR FROM fecha) AS hora, COUNT(*) AS cantidad, ROUND(AVG(inserciones + eliminaciones)) AS lineas_promedio FROM commits WHERE autor LIKE ‘%Torchia%’ GROUP BY hora ORDER BY cantidad DESC;
Pics at 11am, 10pm. Night commits? Double the lines. Productive? Wait.
Ever Queried Your Own Commit Messages? Don’t.
Night messages:
“arreglo”, “wip”, “no sé qué pasó pero funciona”, “fix de antes”.
Oof. Bigger changes, sloppier notes. My worst self, data-proven. That’s the perturbing mirror pgit holds up — your code as stranger’s artifact.
Linux scale amplifies it. Commits from the dead. Decisions traceable to exact weeks. Stratigraphy: layers of choices, merges, refactors. Query merges by subsystem:
SELECT subsystem, AVG(days_to_merge) FROM merges GROUP BY subsystem;
(Assuming a derived table.) Networking lags behind drivers. Who’s committing at 3am UTC? Timezone ghosts of global collab.
But my unique angle — overlooked in the original: this echoes the Human Genome Project’s data pivot. Early ’90s, biologists sequenced by hand; then databases turned DNA into queryable gold, birthing biotech booms. Git histories are code’s DNA. pgit? The sequencer. Predict this: by 2026, IDEs bundle Git-to-SQL exporters. Vercel, GitHub — watch them copy.
Short para: Chilling.
Why Does This Matter for Solo Devs and Kernel Giants?
Market dynamics scream yes. Solo? Self-audit habits, spot burnout (those 10pm slogs). Teams? Enforcement: “No WIP commits past 8pm.” OSS maintainers? Authorship disputes die under SQL fire.
Linux’s repo: public treasure. 30+ years, 20k+ contributors. pgit unearths power laws — top 1% authors drive 80% commits? Checkable now.
I dug file patterns. TSX mentions spike with ‘component.’ Approximate, sure — needs a files table:
CREATE TABLE files ( commit_hash CHAR(40) REFERENCES commits(hash), path TEXT, extension TEXT );
But even rough: my repos reek of React churn, Python experiments fading post-2022.
Corporate spin? None here — this dev’s raw. No hype. Just facts forcing change.
And the archaeology bit? Spot on. Commits as strata: fossil bugs, extinct APIs. Query deletions over time: what vanished?
SELECT extension, SUM(eliminaciones) FROM commits GROUP BY extension ORDER BY SUM DESC;
COBOL ghosts? Nah, but your old jQuery sins, yes.
Is pgit Ready for Prime Time — Or Just a Weekend Hack?
Limits: pipes in messages, no blame data (git blame’s heavier), scale on massive monorepos. But Postgres handles billions; shard if needed.
My position: Bullish, but tactical. Pair with GitHub API for live sync. Or Neon/Cockroach for serverless. Devs sleeping on this lose the data edge — Copilot’s coming for histories next.
Ingested mine. Fixed habits. Night commits now? Crisp messages only.
Data wins.
🧬 Related Insights
- Read more: GDPR Traps Every Web Dev Must Dodge
- Read more: NotumRobotics’ Vanilla JS GUI Framework: Heroic or Hopeless?
Frequently Asked Questions
What is pgit and how does it work?
pgit loads full Git repo histories into PostgreSQL tables using git log output parsed via scripts, enabling SQL queries on commits, authors, diffs.
Can I use pgit on my own Git repos?
Yes — clone from GitHub, tweak the ingestion script for your DB URL, run on any repo. Handles thousands of commits fast.
Why query Git history with SQL instead of git log?
SQL lets you aggregate patterns (hours, file types, merge speeds) across years; git log is just a dumb list.