PII Detection API: Zero AI Cost, Pure Regex

Most PII detection tools bleed money because they run your data through an LLM. One developer just proved you don't need AI to catch credit cards, emails, and SSNs—pure regex patterns work fine, faster, and cheaper.

I Built a PII Detection API Without Touching AI—And It's Faster Than Every Enterprise Tool — theAIcatchup

Key Takeaways

  • Regex patterns outperform LLMs for structured PII like credit cards, emails, and SSNs—and cost nothing per request
  • Adding cryptographic validation (Luhn algorithm for cards) eliminates false positives without needing AI
  • Deterministic, sub-500ms response times make this faster and more reliable than enterprise alternatives that cost $0.01+ per call
  • Not every problem needs machine learning; sometimes the right question is 'Do I actually need AI for this?' and the answer is no

Regex beats AI for this one specific job.

Here’s the thing: we’ve been conditioned to think that detecting personal data requires machine learning. That detecting a credit card number, an email address, or a phone number demands neural networks and token counting and all the expensive machinery that comes with LLMs. It doesn’t.

One developer just shipped Origrid PII Detect, a PII scanning API that uses nothing but regex pattern matching—zero LLM calls, zero AI overhead, response times under 500 milliseconds. And it works. This is what happens when you actually ask the question: “Do I need artificial intelligence for this, or am I just pattern-matching?” Usually, it’s the latter.

Most companies reaching for a PII detection API today face the same trap. Microsoft Presidio demands you self-host a full NLP pipeline (complexity tax). AWS Comprehend charges $0.01+ per request (and those requests add up fast when you’re scanning millions of form submissions). Google DLP? Enterprise pricing meets enterprise friction. For most real-world use cases—GDPR compliance, preventing accidental data leaks in logs and chat systems, scrubbing user-submitted content before it hits your database—you don’t actually need the sophistication of an LLM.

Why Regex Wins When Patterns Are Predictable

Emails follow RFC 5322 (a specific, standardized format). Credit cards aren’t random—Visa cards start with 4, Mastercard with 5 or 2, and every valid card passes the Luhn algorithm, a mathematical checksum that catches fake number sequences. US Social Security numbers follow XXX-XX-XXXX. Phone numbers, IBANs, IPv4 addresses—they’re all deterministic structures.

The API detects six entity types, each with surgical precision:

  • Email: RFC 5322 simplified pattern
  • Phone: International formats (US, EU, UK, LATAM)
  • Credit card: Visa/MC/Amex/Discover patterns + Luhn validation
  • SSN: US format with range validation
  • IBAN: European format with country code prefix
  • IP address: IPv4 with octet range validation

That last part matters. A naive regex that just says “does this look like a credit card?” will flag random number sequences. But throw the Luhn algorithm on top—a 15-line validation function that’s been around since the 1960s—and suddenly you’re not chasing ghosts. You’re catching real card numbers and eliminating false positives.

Is This Actually Better Than AI-Powered Alternatives?

Depends entirely on what you’re trying to do. For structured, well-defined PII patterns? Absolutely. The response times hover at 100-400ms of pure network overhead. There’s no model latency, no token counting, no spinning up GPU inference. The cost is a hard zero. And here’s the kicker: it’s deterministic. Feed it the same text twice, and you get identical results every single time. No hallucinations. No model drift.

“Because there’s no AI model in the loop: Latency: ~100-400ms (network overhead, not compute). Cost per call: $0.00 (no LLM tokens). Reliability: deterministic—same input always produces same output.”

What it can’t do: it won’t catch names without a dictionary. Street addresses trip it up because there are too many formats globally. Context-dependent PII (“my birthday is next Thursday”) lives beyond regex’s reach. That’s the honest conversation nobody in the AI space wants to have—your hammer isn’t a universal solution, and that’s fine.

The Architecture: Deduplication and Risk Scoring

When patterns overlap—a phone number embedded in an IBAN, for example—the system deduplicates by priority. Credit cards and SSNs rank highest because they’re nuclear in terms of risk and regulatory exposure. The response includes exact start/end positions so you can highlight or redact in your UI, a pre-redacted version of the text ready for safe storage, and a risk_level flag (high = financial or government IDs detected).

Example output:

{
  "pii_found": true,
  "entity_count": 3,
  "entities": [
    {"type": "email", "value": "[email protected]", "start": 6, "end": 19, "confidence": 1.0},
    {"type": "phone", "value": "+34 612 345 678", "start": 26, "end": 41, "confidence": 1.0},
    {"type": "credit_card", "value": "4111-1111-1111-1111", "start": 48, "end": 67, "confidence": 1.0}
  ],
  "redacted_text": "Email [EMAIL], call [PHONE], card [CREDIT_CARD]",
  "risk_level": "high"
}

This isn’t a proof-of-concept built on a whiteboard. It’s live on RapidAPI with a free tier (50 requests/month), built with FastAPI, and ready to drop into production.

Why This Matters for the Broader AI Conversation

There’s a narrative in tech that says every problem benefits from more intelligence, more models, more tokens. But this project is a counter-signal. A sharp one. It’s saying: before you reach for the sledgehammer of machine learning, ask whether the job actually demands it. Sometimes the answer is no. Sometimes regex from 1987 solves the problem cheaper, faster, and with better reliability than the fanciest 2024 LLM wrapped in a REST API.

That’s not anti-AI rhetoric. It’s pro-pragmatism.

The roadmap hints at v2 adding a “deep scan” mode with optional LLM analysis for edge cases and context-dependent PII. That’s the right instinct: regex as the baseline, AI as an optional upgrade layer. It inverts the default assumption most companies operate under right now.


🧬 Related Insights

Frequently Asked Questions

What does PII detection API actually do? It scans text for personal information—emails, credit cards, phone numbers, SSNs, IBANs, IP addresses—and either redacts them or returns their exact positions in the text. This one uses only regex patterns, so there’s no LLM cost or latency.

Can regex really catch credit card fraud? It catches credit card numbers in your logs, chats, or user submissions before they leak. It validates using the Luhn algorithm to eliminate false positives. It won’t detect fraud itself—that’s different. But preventing accidental exposure is the whole point.

Will this replace expensive PII detection services? For 80% of compliance use cases (GDPR, CCPA, basic data hygiene), yes. For context-dependent PII, names without a dictionary, or complex patterns, you’d still layer in an LLM. But if you’re just trying to keep credit card numbers out of your database, this covers it without the $0.01-per-call bleed.

Aisha Patel
Written by

Former ML engineer turned writer. Covers computer vision and robotics with a practitioner perspective.

Frequently asked questions

What does PII detection API actually do?
It scans text for personal information—emails, credit cards, phone numbers, SSNs, IBANs, IP addresses—and either redacts them or returns their exact positions in the text. This one uses only regex patterns, so there's no LLM cost or latency.
Can regex really catch credit card fraud?
It catches credit card *numbers* in your logs, chats, or user submissions before they leak. It validates using the Luhn algorithm to eliminate false positives. It won't detect fraud itself—that's different. But preventing accidental exposure is the whole point.
Will this replace expensive PII detection services?
For 80% of compliance use cases (GDPR, CCPA, basic data hygiene), yes. For context-dependent PII, names without a dictionary, or complex patterns, you'd still layer in an LLM. But if you're just trying to keep credit card numbers out of your database, this covers it without the $0.01-per-call bleed.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.