Docling CLI: PDF Parsing Tested

2 minutes and 26 seconds. That’s all it took for Docling CLI to chew through the PyTorch Conference brochure — that beast of a PDF with multi-column layouts, rich tables, icons crammed next to text — and spit out 281 lines of clean Markdown.

Pictures? Twenty-five detected, perfectly cataloged. One table? Structure intact, no mangled cells. And yet, flip on –force-ocr, and boom — PyTorch models gobble RAM like it’s free candy, crashing your local setup.

Here’s the thing. Docling isn’t just another PDF scraper. It’s baked into Ramalama’s RAG stack, aiming to make document understanding a CLI away from your messy real-world files. I fired it up on two test cases: that brochure for layout hell, and the Attention Is All You Need paper for math formulas. Why? Because PDFs aren’t uniform. They’re Frankenstein’s monsters of scans, vectors, and typesetting nightmares.

What the Hell Is Docling Doing Under the Hood?

Docling’s pipeline — document understanding models, table extraction, OCR if you poke it — loads heavy hitters from PyTorch. Baseline run? Feather-light on CPU (under 400ms total). But force OCR? Every page becomes an image buffet, spiking memory to exhaustion on even a modest machine.

I tried everything. Closed tabs. Switched to pypdfium2 backend — lighter than default. Still, nada. Ended up in Google Colab, which handled the brochure smoothly. That’s not a CLI win; that’s a cry for cloud crutches.

Overall output is clean and readable Markdown. Table structure is fully preserved. I did not observe any broken cells or merged rows.

That quote’s from the experimenter’s notes, and damn if it isn’t spot-on. The Markdown flows like it was born that way — headings nested right, lists intact. Minor quibbles: icons slightly off-kilter from the PDF’s tight design. But who cares? For RAG ingestion, this is gold.

Why Does Docling’s JSON Output Feel Like Peeking at Source Code?

Switch to –to json, and you get the full DoclingDocument dump. 6.5MB for our brochure, bloated with base64 images and metadata. My quick script unpacked it:

Top-level keys: schema_name, version, name, origin, furniture, body, groups, texts, pictures, tables, key_value_items, form_items, pages.

Texts: 122 blocks. Tables: 1. Pictures: 25. Pages: 7. Perfect match.

This schema — furniture for layout quirks, groups for hierarchy — screams architectural ambition. It’s not dumping text blobs; it’s modeling the document as a tree. Title feeds sections, subsections bloom into promo blocks. For RAG? Embed this JSON, and your vector store knows structure, not just words.

But. File size balloons because of embedded images. Trade-off for portability, sure. Still, if you’re piping to a database, strip that bloat first.

Raw text mode? –to text strips markup, pure content. Quick, dirty, forgettable. Use it for grep jobs, not insight.

Is Docling’s Memory Hunger a Dealbreaker for Devs?

Short answer: Yes, if you’re local-only. That –enrich-formulas flag on the Transformer paper? Worked fine — detected math, enriched it. But brochure OCR? Systemic issue. Models for layout, tables, layout again — they stack up.

A 7-page PDF shouldn’t nuke your rig. Reminds me of early computer vision libs in 2015 — GPU hogs before quantization hit. Docling’s on that curve. Prediction: Quantized models incoming, or Docker images with caps. Ramalama’s crew knows; this is alpha vibes.

Corporate spin? None here — this is open experiment, notebook shared. No hype. Just raw flags: –help reveals the arsenal, –output dir for batches.

Unique angle: Think back to Tabula or Camelot for tables — rule-based, brittle on fancy PDFs. Docling? ML shift, like YOLO for docs. Underlying architecture: End-to-end vision-language models parsing layout as code. Why it matters? RAG pipelines were text-only; now they’re visual natives. Your enterprise PDFs — invoices, brochures — finally yield without custom hacks.

Tested on Colab, baseline Markdown clocked 2:26 wall time. JSON? 2:07. Efficient enough for prod, if memory’s tamed.

Skeptical take: Great for brochures, but scale to 100-page reports? Or scanned handwriting? Jury’s out. OCR force is opt-in for a reason — defaults are smart.

Why Does This Matter for RAG Builders?

RAG’s bottleneck? Crappy chunking from PDFs. Docling fixes that — semantic blocks, not page soups. Plug into LlamaIndex or Haystack, and your retrieval jumps.

Downsides? Colab dependency screams ‘not ready for air-gapped servers.’ And that memory spike — call out the PR if they gloss it.

Wandered a bit there. Point is, Docling’s a leap. Not perfect. But in the doc-parsing wars, it’s wielding a sharper blade.

🧬 Related Insights

Read more: Pack AI: Your New Autonomous Travel Agent Awakens
Read more: Smithy Kotlin Client Gen Goes GA: Auto-Building Type-Safe API Clients

Frequently Asked Questions

What is Docling CLI used for?

Docling CLI parses PDFs into Markdown, JSON, or text — excels at tables, images, layouts for RAG pipelines.

How do I fix Docling’s OCR memory errors?

Use Google Colab or lighter backends like pypdfium2; avoid –force-ocr on image-heavy PDFs locally.

Does Docling preserve PDF table structure?

Yes, flawlessly in tests — no broken cells, full Markdown tables from multi-column originals.

Docling CLI: PDF Parsing Tested

Key Takeaways

What the Hell Is Docling Doing Under the Hood?

Why Does Docling’s JSON Output Feel Like Peeking at Source Code?

Is Docling’s Memory Hunger a Dealbreaker for Devs?

Why Does This Matter for RAG Builders?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

What the Hell Is Docling Doing Under the Hood?

Why Does Docling’s JSON Output Feel Like Peeking at Source Code?

Is Docling’s Memory Hunger a Dealbreaker for Devs?

Why Does This Matter for RAG Builders?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

Stay in the loop

Key Takeaways