Claude Opus 4.8 is now live for tender analysis. Here's what changed.

The engine behind a MitigateTenders analysis is a configuration flag, and today we moved it forward: Claude Opus 4.8 is now the default model for bid-vs-tender analysis in the app. Every analysis you run from here on uses it — the reasoning, the criteria and lot extraction, the per-requirement scoring.

Before flipping the switch we did the obvious thing: ran our complete production pipeline — the same one behind the "Run Analysis" button — on the same tender, once on Opus 4.7 and once on Opus 4.8, with everything else held identical. Here is what actually changed, and what didn't.

Live now

Opus 4.8 is the default analysis engine for every new run

17 / 17

planted defects caught by both models — including all six subtle ones

≈ same cost

$10.63 on 4.8 vs $10.26 on 4.7 for the full analysis — within a few percent

The test

We used a realistic tender with a pre-registered answer key, so the comparison could be scored against truth rather than impression: a EUR 12M regional healthcare digital-infrastructure programme — four lots, roughly twenty evaluation and pass/fail requirements with hard numeric thresholds (uptime, turnover, insurance, certifications, data-residency, interoperability). It came with two bids: one compliant vendor that meets the requirements, and one non-compliant vendor carrying 17 deliberately planted defects — including six subtle ones, like a data-residency clause buried mid-paragraph, a "modern API" offered instead of the required standard, and a price total that quietly contradicts its own line items.

One point of fairness: every reasoning stage ran on the model under test, both with high reasoning effort. A "4.8 run" extracts the criteria, scores each requirement, and writes the verdict entirely on 4.8 — nothing falls back to the older model.

What stayed the same: the verdicts

The decisions that matter were identical, and identically correct. On the non-compliant bid, both models returned "do not submit." On the compliant bid, both returned "submit with improvements" — correctly shortlisting a qualified vendor rather than over-penalising it. And both caught all 17 planted defects, including the difficult six: the missing ISO 13485, the non-EU data residency, the "modern API" offered in place of the required FHIR standard, the forbidden performance extrapolation, the buried arithmetic error, and the internal timeline contradiction.

Cost held too. Running the full analysis of both bids cost $10.26 on 4.7 and $10.63 on 4.8 — within a few percent. The upgrade is not a price increase.

Cost to analyse the full tender (both bids)

Large healthcare tender, one full pipeline run each. Lower is better.

Opus 4.7 $10.26

Opus 4.8 $10.63

What changed: a sharper, better-organised read

If both models reach the same verdict and catch the same defects, what does the newer one actually buy you? A cleaner, more useful evidence pack behind that verdict.

It understood the domain better. Before any scoring, the pipeline classifies the tender's domain to route the right specialist checks. 4.7 labelled this healthcare programme generically as "IT." 4.8 correctly classified it as medical — which points the downstream checks at the right regulatory frame (MDR, ISO 13485, clinical data residency) rather than a generic software lens.

Its findings are tidier and higher-signal. On the bad bid, 4.7 itemised 13 separate critical findings — thorough, but repetitive. 4.8 consolidated the mandatory misses into eight clearly-named gates plus one clean roll-up of the threshold failures (turnover, indemnity, validity, SLA, RPO/RTO, bond, support), and kept the full per-requirement grid for the detail. Same rigour, far less noise to read through.

It promotes the subtle stuff. 4.8 elevated the two hardest-to-spot defects to headline criticals of their own: the EUR 400,000 arithmetic error where the line items sum to 9.3M but the stated total is 8.9M, and the internal timeline contradiction (a go-live date stated as month 14 in one section and month 20 in another). On the compliant bid, where 4.7 spent several findings on an internal coverage-reconciliation artifact, 4.8 named the one gap that actually matters — a bid bond referenced but not evidenced in the submission — and flagged the missing appendices plainly.

All of this at comparable speed: across the two bids, 4.8 was, if anything, marginally quicker overall.

Same answer, cleaner reasoning

On this tender, Opus 4.8 didn't change a single verdict or miss a single defect that 4.7 caught. What it changed was the quality of the read: the right domain frame, fewer redundant findings, and the subtle errors pulled to the surface. For a procurement officer, that means less to wade through and a more defensible evidence trail — at the same cost.

One honest caveat

This is a single full-pipeline run per model. We have written before about how one run can flatter or mislead a model, and we hold ourselves to that here: treat the exact counts as directional, not definitive. What we trust more than any single number is that the verdicts and the defect recall were identical, while the organisation of the analysis was consistently cleaner on 4.8 — the right domain, tighter findings, the subtle errors surfaced. That is the kind of improvement that holds up across runs, and it is why we were comfortable making it the default.

What this means for you

If you run an analysis in MitigateTenders today, you are already on Opus 4.8 — no setting to change, no price difference. You get the same dependable rejection of clearly non-compliant bids and the same fair treatment of strong ones, with a sharper read of the domain and a cleaner, better-organised set of findings underneath. As always, the model reads every page of every document and never tires on page 200; the specialist checks the evidence and keeps the decision. 4.8 just hands them a sharper draft to work from.

How we ran it

One realistic EUR 12M healthcare tender with a pre-registered defect key (17 planted defects, 6 of them subtle), a compliant bid and a non-compliant bid.
The full production pipeline per run: domain detection → criteria & lot extraction → per-requirement coverage scoring → main reasoning agent → verification of every critical finding.
Every reasoning stage ran on the model under test (Opus 4.7 or Opus 4.8) at high reasoning effort; supporting stages were held constant across both for a fair comparison.
One full pipeline run per model (both bids). Single-run results — directional, not a definitive ranking.
Cost is the all-in dollar cost of the complete analysis (prep + both bids), priced on each model's published rate.

Run your next tender analysis on Claude Opus 4.8.
Your first AI analysis is free.

Atgal į tinklaraštį