We ran Claude, GPT-5.5 and Gemini on the same tenders. Here's what we found.
May 30, 2026

We ran Claude, GPT-5.5 and Gemini on the same tenders. Here's what we found.

The model behind an AI procurement analysis is, for us, a configuration flag. We can run the same bid-vs-tender analysis on Anthropic's Claude, OpenAI's GPT, or Google's Gemini. So we asked the obvious question: if we change the engine, what actually changes — in cost, in speed, and in the quality of the verdict?

To find out, we ran our complete production analysis pipeline — the same one behind the "Run Analysis" button — against two realistic tenders, on all three providers, and repeated the larger one four times per provider to separate signal from noise. Twenty-four full analyses in total. Here is what we found.

12 / 12
times the non-compliant bid was correctly rejected — every provider, every run
~4.8×
cost difference between the most and least expensive provider for the same job
0 / 3
providers that reliably treated a compliant bid fairly

The setup

We built two synthetic tenders with a pre-registered answer key, so quality could be scored objectively rather than by impression:

Each tender had two bids: one compliant vendor that meets the requirements, and one non-compliant vendor with deliberately planted defects — 7 in the small tender, 17 in the large one, including subtle ones like a data-residency clause buried in a paragraph, a "modern API" offered in place of the required standard, and a price total that quietly contradicts its own line items.

One point of fairness: in our pipeline every stage runs on the chosen provider — the main reasoning, the criteria extraction, the per-requirement scoring, and the verification pass. An "OpenAI run" is 100% OpenAI; a "Google run" is 100% Google. Nothing falls back to a different vendor.

Cost: a wide, stable gap

Cost was the most stable result across all four trials — and the spread is large. To run the full analysis of the large tender (both bids), median cost per provider:

Cost to analyse one tender (both bids)
Large tender, median of 4 runs. Lower is better.
Anthropic $11.31
OpenAI $5.37
Google $2.35

Google was the cheapest in every single run — roughly five times cheaper than Anthropic. Counter-intuitively, it also pushed the most tokens through the pipeline; it just runs them on a far cheaper tier. Cost, at least, is a clean and repeatable win.

Rejecting a bad bid: everyone passed

Here is the reassuring part. On the non-compliant bid, the verdict was unanimous and perfectly stable: every provider, in every one of the runs, recommended "do not submit." Whatever else varies, the system reliably rejects a bid that should be rejected.

What varied was thoroughness — how much of the evidence behind that verdict each provider actually surfaced. Recall of the 17 planted defects in the large tender:

Defects caught on the bad bid (of 17)
Large tender, across 4 trials per provider.
Anthropic 17 / 17 — every trial
OpenAI 16–17 / 17
Google 13–17 / 17 (usually ~14)

Anthropic was the standout detector — all 17 defects, in all four trials, including the subtle ones. Google, the cheapest by far, trades thoroughness for cost: it typically surfaced about 14 of 17 and repeatedly missed real issues. The bottom-line verdict was still correct, but a procurement officer relying on Google gets a thinner, less defensible evidence pack.

Approving a good bid: nobody was reliable

This is the finding a single run would have hidden. On the compliant bid, the right answer is "shortlist it, with minor improvements." Across four trials, here is how often each provider actually did that — versus over-penalising a perfectly qualified vendor:

Verdict on the compliant bid, over 4 trials
Large tender. Green = correct. Red = wrongly rejected a qualified vendor.
Google 2 / 4 right
Anthropic 1 / 4 right
OpenAI 0 / 4 right
Correct (shortlist) Asked for major revision Wrongly rejected

No provider reliably got the good bid right. OpenAI was the most aggressive — it recommended rejecting a fully qualified vendor in three of four runs. Anthropic never rejected outright, but demanded "significant revision" three times in four. Google was the least-bad, correct half the time, but still rejected the compliant bid twice. A single early run had suggested Google was perfectly calibrated here; three more runs erased that impression.

In fairness, part of this is our test bid: it references its certificates and bid bond as attachments we did not physically include, so a model flagging "bid bond missing" was, strictly, correct about the document in front of it. A real submission with attachments would soften these verdicts. But the instability is the real lesson: on a large, evidence-dense bid, the verdict on a good vendor is not stable enough to trust to any single model run.

Trust the rejection, verify the approval

Across 24 analyses, every provider rejected the bad bid every time — but no provider reliably approved the good one. The safe operating rule: automate the "flag and reject clearly non-compliant bids" path, and keep a human on every "this looks good" verdict.

There is no single winner — there's a trade-off

Each provider won on one axis and lost on another:

Anthropic (Claude Opus 4.7) — the most thorough and most consistent. It caught every defect in every trial and extracted an identical requirement set each time. The cost: roughly 5× the price, the slowest runtime, and a tendency to be over-strict with a good bid.

OpenAI (GPT-5.5) — the fastest, with clean, well-itemised findings and reliable detection. The serious catch: it recommended rejecting a compliant vendor in three of four runs — the worst false-positive profile of the three. Never wire it to an automatic reject.

Google (Gemini 3.1 Pro) — far the cheapest and competitive on speed. The catch: the least thorough (it misses real defects roughly a quarter to a half of the time) and the least consistent run-to-run. Excellent for high-volume first-pass triage, not as a sole reviewer of a high-value tender.

On a single run, one model looked like the clear winner. Four runs erased that. Judging an AI on one run is how you publish the wrong conclusion.

What this means if you use AI for tenders

Two lessons, both only visible once we went big and repeated. First: the dependable half of the job is rejection, not approval — which is exactly why a human stays on every positive verdict. Second: provider choice is a deliberate trade-off between cost, thoroughness, and consistency, not a search for one universal best model. The cheapest option is a superb triage engine; the most thorough one is worth its price when a missed defect is expensive; and any of them, on a single pass, can misjudge a strong bid.

None of which removes the person signing the recommendation. It sharpens what they do: the AI reads every page of every document and never tires on page 200, and the specialist checks the evidence and keeps the decision. That is the same conclusion banking, law, and medicine reached — and our own numbers point straight at it.

How we ran it

  1. Two synthetic tenders with a pre-registered defect key (7 defects in the small, 17 in the large), each with a compliant and a non-compliant bid.
  2. The full production pipeline per run: domain detection → criteria & lot extraction → per-requirement coverage scoring → main reasoning agent → verification of every critical finding.
  3. Every model tier (main, verification, extraction) ran on the selected provider — no cross-provider fallback.
  4. Small tender: one run per provider. Large tender: four runs per provider (24 analyses), reporting medians and ranges.
  5. Limitations: synthetic English documents in one domain family; the compliant bid references attachments not physically included, which inflates "missing evidence" findings. Directional, not a universal ranking.
See how AI can help with your procurements.
Try your first AI analysis for free.
Register for free
Back to blog