Gemma Fine-Tune Fails Real Invoice Test

Validation loss plummeted to 0.024. The Gemma fine-tune looked invincible on synthetic invoices. Then reality struck — one document exposed four deadly flaws.

One Real Invoice Tanked a Flawless Gemma Fine-Tune — Here's What It Exposed — theAIcatchup

Key Takeaways

  • Synthetic data creates overly optimistic validation but crumbles on real invoices due to domain gaps.
  • Failures hit aggregates, enums first — data distribution flaw, not model.
  • One real document beats hundreds synthetic for calibration; hybrid data pipelines win.

Validation loss: 0.024 after 300 iterations. Textbook success for a Gemma fine-tune on synthetic Indian invoices.

Then? One real invoice from Jon Doe Print. It nailed the total, supplier, address. Still bombed four fields, rendering output useless for finance workflows.

That’s the hook here — not some abstract ML warning, but the gritty how and why synthetic data builds castles on sand.

Why’d This Gemma Setup Seem So Perfect?

Picture this: you’re dodging pricey APIs, chasing privacy, eyeing easy deploys. Fine-tune google/gemma-4-E2B-it on a fat synthetic dataset targeting 22-field JSON for Indian invoices. MLX-LM framework, 7M trainable params, Mac-friendly at 13GB peak memory. Batch size 1, grad accum 8, LR 5e-5. Smooth sailing.

Loss curve? Monotonic magic: 0.552 at iter 1, down to 0.084 by 50, 0.024 by 300. You’d ship it. Hell, I’d have.

But.

No massive labeled corpus? Synthetic fills the gap — standardized positions, clean labels, uniform spacing. Large enough for a pipeline, small enough to miss real-world chaos.

“That invoice was more useful than another few hundred synthetic examples.”

Spot on. Here’s the original’s punch: synthetic hides the domain gap, pumps fake optimism, erodes trust till reality bites.

It learned invoice structure. Not variance.

What Broke First — And Why It Matters

Jon Doe Print invoice. Model spits plausible JSON: supplier name bang-on, GSTIN format/state correct, address mostly good, invoice number/date/total perfect.

Failure table? Brutal.

Description: “3D Printed Prototype” instead of “3D Printed Prototype (Pre filter)”. Downstream categorization wrecked.

Taxable value: grabs line-item amount, ignores subtotal. Accounts? Screwed.

IGST rate: 0.09 when it’s 0.0. Tax logic downstream? Chaos.

Reverse charge: 0 (number) vs “No” (string). JSON contracts shatter, parsers choke.

Four hits. Not random garbage — patterned failure from synthetic bliss.

Multiple line items? Model snags “number near items” for taxable_value. Synthetics standardize subtotals: fixed spot, label family. Reals? Unit prices, line totals, tax-inclusives, formatting noise everywhere.

No override logic baked in.

The Tax Mapping Nightmare No One Saw Coming

Supplier and place-of-supply? Same state. Intra-state invoice: CGST/SGST >0, IGST=0.

Model? igst_rate=0.09. Why? Spotted 18% tax print, slotted wrong.

Not math. Field-to-concept hell. Synthetics teach tax fields exist — not disambiguate amid layout ambiguity.

And reverse charge? Strict enum downstream expects string. Gets int. Boom — brittle rules fail.

This isn’t model weakness. Data distribution sin.

Real invoices don’t play nice.

Is Synthetic Data Useless for Invoice Parsing?

No. But blind faith? Suicide.

Here’s my twist — remember early OCR in the ’90s? Lab scans pristine, real docs with smudges/faxes crushed accuracy. Same vibe: synthetic invoices are your clean-room prototype. One wild specimen stress-tests architecture.

Training curves pre/post that invoice? Instability spikes. Real data flips trust.

Concrete shapes matter: fields break predictably (aggregates before details, enums before numerics). Assumptions? Position over semantics, uniformity over noise.

Lessons? Data-centric. Not swap models — curate distributions.

Gemma crushed 0.157% params tuned. Framework nailed it. But synthetic variance starved the beast.

Look, companies hype “fully synthetic pipelines” — PR spin. It’s cheaper, sure. Private, yeah. But this exposes the lie: without real anchors, you’re optimizing illusions.

Prediction: open-source invoice datasets explode next year. Hybrid reigns — synthetic bulk, real gold for calibration. Gemma’s fine; your data pipeline isn’t.

And that Mac train? Enviable efficiency. Broader shift: edge ML for docs, ditching cloud calls.

But chase real docs first.

Or regret.

Why Does This Matter for Developers?

You’re building extraction? Don’t trust val curves alone. One real PDF per domain — your canary.

Indian GST invoices? Nightmarish variance: fonts, layouts, bilingual noise. Synthetics standardize; reals defy.

Downstream? Finance workflows demand perfection. Wrong taxable? Audit hell. Bad enum? Pipeline halt.

Shift: augment synthetics with 10-50 reals early. Curves will wobble — that’s signal.

Historical parallel: speech rec in 2010s. Clean lab audio aced benchmarks; accents crushed. Fix? Diverse wild data. Same for invoices.

Corporate spin calls it “ready.” Nah. This is the wake-up.


🧬 Related Insights

Frequently Asked Questions

What causes domain gap in synthetic invoice data?

Synthetics assume uniform layouts, labels, spacing — reals throw noisy formatting, competing numbers, layout shifts that break field mapping.

Why did the Gemma model output wrong tax rates?

It learned tax fields but not disambiguation; grabbed 18% print and slotted into IGST despite intra-state context.

Can synthetic data alone train reliable invoice parsers?

Nope — great for structure, fails variance. Hybrid with real samples is key for production trust.

Sarah Chen
Written by

AI research editor covering LLMs, benchmarks, and the race between frontier labs. Previously at MIT CSAIL.

Frequently asked questions

What causes domain gap in synthetic invoice data?
Synthetics assume uniform layouts, labels, spacing — reals throw noisy formatting, competing numbers, layout shifts that break field mapping.
Why did the Gemma model output wrong tax rates?
It learned tax fields but not disambiguation; grabbed 18% print and slotted into IGST despite intra-state context.
Can synthetic data alone train reliable invoice parsers?
Nope — great for structure, fails variance. Hybrid with real samples is key for production trust.

Worth sharing?

Get the best AI stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.