Firecrawl meets rs-trafilatura. Boom.
This duo? It’s the scraping upgrade you’ve been ignoring. Firecrawl handles the heavy lifting — JS rendering, bot evasion, polite rate limits — spitting out Markdown by default. Solid for articles. But toss in raw HTML, pipe it through rs-trafilatura, and suddenly you’re not just grabbing content. You’re dissecting pages with surgical smarts: titles, authors, dates, page types, even a quality score from 0 to 1.
Why bother? Because Firecrawl’s Markdown can bloat up on product pages or forums. Navigation cruft. Sidebars. Irrelevant chatter. rs-trafilatura — a Rust-powered beast wrapped for Python — sniffs out the page type and carves away the noise.
Look, web scraping’s been a crapshoot since the BeautifulSoup days. Remember those regex hellscapes? This is different. rs-trafilatura leans on heuristics and ML-trained models, echoing the shift we saw in search engines around 2010 — when Google started scoring page freshness and authority behind the scenes. Here’s my bet: quality scores like this will be table stakes for scraping APIs by 2026. Firecrawl’s smart to pair with it early.
The HTML Trick That Changes Everything
One line seals it: formats=["html"]. Skip that, and you’re stuck with Markdown — no fuel for rs-trafilatura’s engine.
result = app.scrape("https://example.com/blog/post", formats=["html"])
extracted = extract_firecrawl_result(result)
print(f"Title: {extracted.title}")
print(f"Page type: {extracted.page_type}")
print(f"Quality: {extracted.extraction_quality:.2f}")
That blockquote? Straight from the docs. Dead simple. Install via pip install rs-trafilatura firecrawl, snag an API key from firecrawl.dev, and you’re extracting metadata that Markdown alone can’t touch.
But — and here’s the Wired-style deep dive — it’s the page_type field that rewires your pipeline. Articles? Clean prose. Products? JSON-LD fallback for specs. Forums? User posts only, no vote buttons or profiles. Service pages? Merges hero, features, testimonials without the sprawl.
Why Does Firecrawl’s Markdown Fall Short?
Firecrawl’s great for blogs. Treats the whole page as content otherwise.
Product listings drown in filters and “related items.” Forums become UI soup. rs-trafilatura scores it all — that extraction_quality flag lets you ditch the duds automatically. Say a forum thread scores 0.3? Skip it, log it, move on.
Compare outputs side-by-side:
Firecrawl’s built-in Markdown output is good for articles. The difference shows on non-article pages: Product pages: Firecrawl may include navigation, filters, and “related products” sections in its Markdown. rs-trafilatura recognises the page type and extracts just the product description.
Spot the win? Firecrawl grabs everything; rs-trafilatura curates. And with favor_precision=True, it gets stingy — perfect for noisy sites. Crank favor_recall=True for completeness.
Scale it up. Batch scrape a dozen URLs:
urls = ["https://example.com/products/widget", ...]
batch = app.batch_scrape(urls, formats=["html"])
for doc in batch.data:
extracted = extract_firecrawl_result(doc)
print(f"[{extracted.page_type}] {extracted.title} (quality: {extracted.extraction_quality:.2f})")
Handles v4 Documents or old dicts smoothly. That’s architectural maturity — no brittle parsers.
What About the Output? Markdown or Bust?
Toggle output_markdown=True for GFM-ready content. But the real gold’s in ExtractResult: main_content (text), content_markdown, images (with alt/captions), language, even sitename. It’s structured data on steroids.
Corporate spin check: Firecrawl pitches itself as “the best web scraper.” Fair, but without quality signals, you’re flying blind. rs-trafilatura calls the bluff — and wins. My unique angle? This combo hints at scraping’s next era: not just data dumps, but probabilistic intel pipelines, like ad tech’s click prediction models but for content fidelity.
Forums, docs, listings — rs-trafilatura tags ‘em: “forum,” “documentation,” “listing.” Filter by type upstream. Quality under 0.7? Human review queue. It’s devops for data ingestion.
And images? Extracted with context, not just src tags. No more blind scraping.
Is rs-trafilatura Worth the Extra Step?
Absolutely — if you’re building RAG systems, knowledge bases, or analytics. Firecrawl solo? Quick prototypes. This pair? Production-grade.
Benchmarks back it: Check webcontentextraction.org. rs-trafilatura crushes on precision/recall across page types. Rust core means it’s fast — no Python bottlenecks.
Downsides? API costs stack if you’re hammering Firecrawl. Local fallback? Trafilatura’s got you, but no JS rendering.
So, yeah. Ditch the hype. This is how you scrape like a pro in 2024.
🧬 Related Insights
- Read more: Three Friends Built and Published a Party Game in One Year—Without Writing a Line of Code
- Read more: GitLab’s Package Repository Overhaul: What DevOps Teams Must Do Before September 2026
Frequently Asked Questions
How do I use rs-trafilatura with Firecrawl?
pip install rs-trafilatura firecrawl, init FirecrawlApp with your key, scrape with formats=["html"], then extract_firecrawl_result(result).
What’s the difference between Firecrawl Markdown and rs-trafilatura?
Firecrawl gives raw-ish Markdown with noise; rs-trafilatura cleans by page type, adds metadata and quality scores.
Does rs-trafilatura work with Firecrawl batch scraping?
Yes — loop over batch.data, extract each. Handles formats and legacy outputs.