Dangerous Regex in Python Packages Audit

One wrong regex, and your server grinds to a halt — like Cloudflare's infamous 2019 meltdown. I ran the audit on 20 Python heavyweights; 23 ticking bombs remain.

Exponential time growth graph for vulnerable regex pattern (a+)+

Key Takeaways

  • 23 ReDoS-vulnerable regex patterns found in runtime code of 20 top Python libraries.
  • Static AST analysis beats dynamic timing tests — detects nested quantifiers at compile time.
  • Atomic groups in Python 3.11+ fix many issues; aiohttp and pytest need fixes.

Cloudflare’s WAF just imploded. Eighty percent traffic drop. CPUs screaming at 100%. All from a single regex: .(?:.=.*). Three minutes flat.

That’s ReDoS in action, folks — Regular Expression Denial of Service. Not some abstract boogeyman. Real-world carnage.

The engineer tweaked an XSS rule. Nested .* quantifiers turned it into a backtracking nightmare. Python’s regex engine — like most — tries every possible match path. Exponential time. Boom.

At 13:42 UTC on July 2, 2019, an engineer working for Cloudflare made changes to the regular ruleset that was being used by their Web Application Firewall. In under three minutes, there was an 80% drop in the amount of traffic globally.

Our hero here — let’s call him the Auditor — got curious. How many of these time bombs lurk in the Python libs we all depend on? Requests. Django. Pandas. The daily grind.

He built a static analyzer. No fuzzy timing tests with evil inputs. Smart move. Python’s re module parses patterns into an AST via sre_parse. Spot nested MAX_REPEAT over SUBPATTERN? Flag it. Nested quantifiers. Nulls in repeats. Overlapping alternatives. Even a sneaky new category.

Tested 20 giants: requests, flask, django, fastapi, sqlalchemy, pydantic, pytest, numpy, pandas, pillow, scrapy, celery, boto3, httpx, aiohttp, click, rich, typer, black, mypy.

Ninety warnings. Trim tests, deps, build junk — 23 left in live code.

Why Do Regexes Still Haunt Python in 2024?

Regexes are like that ex who promises ‘just one more try.’ Charming at first. Then endless loops.

Take (a+)+ on a non-match. n=10: 0.001s. n=20: 0.884s. n=30: hours. Doubles input, explodes time. Your API endpoint? Dead on arrival.

Old checkers feed crafted payloads, clock ‘em. Works, but crafting per pattern? Tedious. Auditor’s AST trick sidesteps that. Compile-time detection. Elegant.

But here’s my beef — Python 3.11+ has atomic groups (?>…). One-way doors. No backtrack hell.

BEFORE: (a+)+ — disaster.

AFTER: (?>a+)+ — safe as houses.

Why aren’t we shoving this in every linter? Black, mypy — they format, type-check. Where’s the ReDoS cop?

Is Aiohttp’s WebSocket Parser a Ticking Bomb?

Aiohttp flagged big. _WS_EXT_RE in the WebSocket header parser. headers.get() slurps user input. Tool screams: nested quantifiers.

Benchmark: 0.8ms worst-case on CPython 3.12. Yawn. No DoS panic. Maintainer shrugs — inefficient, sure, but secure.

Yet — plot thickens. Past audit nuked _COOKIE_PATTERN (PR #11900). They know. Fixed one, missed this.

Pytest? expression.py line 113: (:?\w|:|…). Typo city. Meant (?:… ) non-capturing. Got optional colon capture. LIKELY_TYPO detector nails it. (:? is legal, just stupid.

Twenty-three like this. Scrapy. Celery. Your favorites.

And look, corporate spin? None here. This ain’t VC-funded scanner sales pitch. Open audit. Raw counts. But prediction time — my unique hot take: AI code-gen tools like Copilot are spewing regexes like confetti. No AST smarts. Next Cloudflare? Some GitHub Copilot ‘secure’ endpoint.

Historical parallel? Stack Overflow 2016. Ruby regex ReDoS floored the site. Patterns repeat. We forget.

Will This Fix Your Regex Nightmares?

Short answer: partially.

Atomic groups help. But not silver bullet. Overlaps, nulls — still tricky. Auditor suggests fixes per pattern. Smart.

Run it yourself. Fork the repo. Scan your codebase. requests clean-ish. But aiohttp? Poke it.

Python’s re is NFA engine. Backtracking baked in. Switch to Russ Cox’s RE2? Linear time. But no lookbehinds. Tradeoffs suck.

Devs: audit now. Linters: add this yesterday.

The Hidden Cost of ‘Good Enough’ Regex

We lean on these libs. Blind trust. One ReDoS in a header parser — attackers craft payloads, spam endpoints. Game over.

Numpy, pandas — data crunchers. Safe there? Audit says mostly. But black, typer — formatters with regex? Edge cases matter.

Dry humor break: if regex were a person, it’d be the friend who ‘borrows’ your server for ‘experiments.’ Hours later, still computing.

Unique insight: this audit’s ‘unexpected’ category? Probably polyquantifiers or some AST quirk. Undocumented gold. Means even experts miss static wins.


🧬 Related Insights

Frequently Asked Questions

What is ReDoS in Python regex?

ReDoS: regex Denial of Service. Evil input triggers exponential backtracking. Server hangs.

Which Python libraries have dangerous regex patterns?

Aiohttp (_WS_EXT_RE), pytest (expression.py), plus 21 more from top 20 packs. Full list in audit.

How to fix nested quantifiers in Python regex?

Use atomic groups: (?>a+) instead of (a+). Python 3.11+. No backtrack explosion.

Elena Vasquez
Written by

Senior editor and generalist covering the biggest stories with a sharp, skeptical eye.

Frequently asked questions

What is ReDoS in <a href="/tag/python-regex/">Python regex</a>?
ReDoS: regex Denial of Service. Evil input triggers exponential backtracking. Server hangs.
Which Python libraries have dangerous regex patterns?
Aiohttp (_WS_EXT_RE), pytest (expression.py), plus 21 more from top 20 packs. Full list in audit.
How to fix nested quantifiers in Python regex?
Use atomic groups: (

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.