Skip to content
Use cases/Data extraction

Which AI model turns documents into structured data?

Extracting structured data from unstructured text is the most immediately profitable thing a language model can do. The return on investment is plain — a PDF parsed into a row of a spreadsheet is a thing the business can measure — and the failure modes are equally plain. A model that invents one field per hundred documents is a model that corrupts your database. This guide picks the five models we would build an extraction pipeline on today, and the dimensions that decide which one belongs where.

Document processing pipeline — concept image
The model that wins here is the one that knows when to stay silent.

Why extraction is the workload models fail on most quietly

Extraction is the workload where mistakes hide longest. The output looks like data — fields, types, neat values — and downstream systems consume it as if it were authored by a deterministic parser. When the model fills a missing field with a plausible guess, no log will flag it. The number turns up in a quarterly report and someone makes a decision on it.

That changes the criteria for picking one. Schema adherence and refusal-to-invent matter more than raw intelligence. A model that returns an empty field with a null marker is more useful than one that returns a confident-looking guess. A model that respects the JSON shape you described literally is more valuable than one that adds a friendly preamble. Some of the most capable frontier models score badly on these axes — they were tuned to be helpful, and inventing a value for a missing field reads as helpful unless you measure for it.

The job is also unusually price-sensitive. A pipeline parsing a million invoices a month is a pipeline where the input is in the model and the output is short — every token of unnecessary system prompt or chain-of-thought becomes real money. Models that produce concise, clean structured output earn back their position on price alone.

Five constraints define the work: strict schema adherence, bulk throughput economics, long-document context, robustness on noisy input, and cross-language coverage. The right model for batch-parsing receipts in twenty currencies is rarely the right model for parsing one fifty-page contract with five overlapping tables. The stack usually needs both.

One more constraint sits underneath the other five and is easy to forget at the design stage: observability. An extraction pipeline you cannot audit is a pipeline you cannot trust. Every output should be traceable to the input span it came from, every confidence score should be recorded, and every refusal-to-extract should be logged so the next iteration can decide whether the model was right to stay silent or wrong to give up. That telemetry is worth more than any single model upgrade.

Extraction schema flow — concept image
Schema-first extraction beats free-form parsing every time.

The five dimensions that decide which model wins

These are the axes our scorecard weights for any model that ships near an extraction pipeline. The relative weighting shifts with whether you process a few high-value documents or millions of low-value ones — but the floor on all five is non-negotiable.

  1. 01 — Schema adherence

    Does the output match the shape you specified?

    The single best predictor of fitness for extraction is how often the model returns valid, schema-conforming JSON without surrounding prose, extra fields or renamed keys. Strict structured-output modes from vendors that support them collapse this problem; models without those modes need a retry loop and a validator.

  2. 02 — Refusal to invent

    Will it leave a field empty when the source is silent?

    A missing invoice date that gets a guessed value is a quiet bug that surfaces in the next audit. Test candidates explicitly on documents where required fields are absent — the right model returns null, the wrong one returns its best inference and never tells you.

  3. 03 — Long-document context

    Can it pull data from page forty without losing page two?

    Contracts, prospectuses, medical records and legal filings routinely run past a hundred pages with cross-references that span the whole document. The model needs both window size and deep attention across the window; the first without the second is marketing.

  4. 04 — Robustness on messy input

    Does it recover gracefully from OCR errors and broken layout?

    Real-world extraction never sees clean text. The input is OCR output from a scanned receipt with a smudge on the date, or HTML from a site with three different table layouts on the same page. The model has to tolerate that noise and still produce clean output without overcorrecting.

  5. 05 — Cross-language coverage

    Does it extract from Japanese invoices as well as English ones?

    An extraction model deployed at scale will eventually see every script and convention its customers use. Frontier models advertise broad coverage; quality outside the top six languages varies sharply. Date formats, decimal separators and address conventions all need empirical testing.

Tokonomix top 5 picks for data extraction today

Below is what we would route real production traffic through tomorrow morning. Extraction at any meaningful scale almost always means a two-tier pipeline — a bulk-tier model that gets the well-behaved ninety percent done at near-zero cost, and a heavier model the bulk tier kicks documents up to when its own confidence drops. Picking both from the list below is more useful than picking one perfectly.

#1 · Bulk extraction championTier A

Gemini 2.5 Flash

via Google Gemini

The cheapest credible model for high-volume extraction work — invoice line items, form fields, address parsing, log structuring. Sub-second first-token latency and a million-token context mean it can swallow large documents in one shot without chunking.

Input / 1M tokens
$0.3000
Output / 1M tokens
$2.50
Context
1.048576M
Full benchmark profile →
#2 · Structured disciplineTier A

Claude Haiku 4.5

via Anthropic

Haiku 4.5 produces strikingly clean JSON that adheres to the schema you described, with very few invented fields or stray prose. The right pick when extraction feeds directly into a typed downstream system and any escape from the schema breaks the pipeline.

Input / 1M tokens
$1.00
Output / 1M tokens
$5.00
Context
200K
Full benchmark profile →
#3 · Strict schema modeTier C

gpt-4.1-mini

via OpenAI

OpenAI Structured Outputs mode forces the model to conform to a JSON schema you supply, eliminating an entire class of parse errors. GPT-4.1 mini hits that mode at a price low enough to put it on every form-fill, classification or extraction job that does not need premium reasoning.

Input / 1M tokens
$0.4000
Output / 1M tokens
$1.60
Context
1.047576M
Full benchmark profile →
#4 · Messy-document specialistTier A

Claude Sonnet 4.6

via Anthropic

When the input is a scanned PDF, an OCR-corrupted spreadsheet or a contract with five overlapping tables, Sonnet 4.6 is the model that figures out what was meant. Costs more per call than the volume-tier picks; pays for itself the first time it parses a document the cheaper models could not.

Input / 1M tokens
$3.00
Output / 1M tokens
$15.00
Context
1M
Full benchmark profile →
#5 · Reasoning over noisy dataTier C

o4-mini

via OpenAI

A reasoning model that benefits from the extra thinking time on extraction tasks with ambiguity — disambiguating which of three "John Smith" entries matches, deciding whether an unspecified date should be inferred from context. Slower than chat-tier; reserve for the steps that need the judgment.

Input / 1M tokens
$1.10
Output / 1M tokens
$4.40
Context
Full benchmark profile →

Input price per million tokens

Extraction is the rare workload where input cost dominates, not output — the whole document gets read, the response is a compact JSON. The chart shows the live list input price for each of the five models above.

Price per 1M input tokens, USD. Source: live provider pricing tracked by Tokonomix.
Extraction metrics dashboard — concept image
Measure precision and recall, not parse-success.

A field guide: which model for which extraction job

The mapping below is the one we would use to advise an operations team starting fresh. Treat it as a default, not a verdict — a benchmark on a hundred of your own documents will beat any general recommendation.

Pattern A

Invoices, receipts, forms at scale

Clean templates, predictable layout, millions per month. Gemini 2.5 Flash for the bulk, Haiku 4.5 when schema discipline becomes the bottleneck. Both are cheap enough to retry with verification.

Pattern B

Contracts, prospectuses, legal documents

Long, dense, full of cross-references. Sonnet 4.6 for the heavy reading, o4-mini for steps that need explicit reasoning about ambiguous clauses. Always produce structured output with citations to the source page.

Pattern C

Real-time form fill

User pastes raw text, your UI fills the form. Latency dominates. GPT-4.1 mini with strict schema mode is the safe default; the user sees the answer in under a second and the structured output is guaranteed valid.

Pattern D

PII-sensitive or sovereign documents

Medical records, financial filings, citizen-data forms with cross-border restrictions. Self-host an open-weight model on infrastructure you control — see the local & self-hosted guide for hardware fits.

Operations team setup — concept image
The pipeline is only as good as the schema, the validator and the human spot-checks.

Benchmark on your own documents before you commit

Pick fifty real documents from your own backlog and label them by hand. It is unglamorous work; it pays for itself the first time the production pipeline gets rolled out and you want to know whether the model is better than the regex it replaced. Run every candidate across the same fifty and measure precision and recall against your ground truth.

Then look at the failures, not the averages. Where did each model invent a field? Where did it leave one blank that should have been filled? How did each one cope with the scanned page, the second-language document, the rotated table? The model that survives your failure analysis is the model that survives production. Ship that one, regardless of which guide picked which.

Open the live test tool →

Related use cases