Why this benchmark exists and what it tests
AI-assisted blood test interpretation is increasingly used in consumer and clinical workflows, yet reproducible evaluation frameworks tailored to laboratory medicine remain uncommon. The questions that matter most in this setting are not the ones covered by general medical question-answering benchmarks: can an engine separate iron deficiency from thalassaemia trait when the mean corpuscular volume is identical, does it over-diagnose Gilbert's syndrome as hepatitis, and does it manufacture pathology in a fully normal screening panel?
A single blood test panel typically contains enough signal to support several competing interpretations, and the job of the interpreting clinician is to weigh those interpretations against each other rather than to retrieve a textbook answer. An engine that does well on textbook cases can still fail on the cases that matter most: the differential-diagnosis pitfalls, the benign variants that look alarming in isolation, and the fully normal panels that tempt confident assistants into manufacturing pathology.
This benchmark was built around exactly those failure modes. Each of the fifteen cases was selected for a specific diagnostic property: an iron-deficient microcytosis that must be kept distinct from a beta-thalassaemia trait with identical mean corpuscular volume, a Gilbert's syndrome presentation where the only abnormality is an isolated indirect hyperbilirubinaemia, and a fifteen-parameter screening panel in which every analyte sits inside its reference range. The rubric rewards engines that read each case on its own terms and penalises engines that reach for a confident diagnosis where no such diagnosis is warranted.
As Thomas Klein, MD, I selected the case panel because these are the patterns I see laboratory-medicine assistants get wrong most often. The expensive failure mode is not "missing a rare disease" — it is fabricating routine pathology in patients who do not have it. আমাৰ চিকিৎসা বৈধকৰণ hub describes the broader framework; this page describes its applied result on the V11 engine.
Latest reference run — V11 (April 2026)
The April 2026 reference run of the Kantesti AI Engine V11 produced a composite score of 99.12% on the pre-registered fifteen-case rubric. Both hyperdiagnosis trap cases scored at the ceiling. The Mentzer index was applied correctly on the iron-deficiency-versus-thalassaemia differential.
The composite formula combines three components: structural conformance with the seven mandatory report sections and sixteen mandatory subsections, clinical accuracy measured as keyword recall plus scoring-system recall plus a probability-distribution validity check, and response latency against the 20-second primary-path service-level target. The exact decomposition is shown in the rubric formula below.
The remaining 0.88 percentage points of headroom decompose almost entirely into latency loss — three Phase 2 fallback invocations at minus 0.05 composite each contributed about 0.60 of the 0.88-point deficit — rather than into clinical content. The engine did not miss a correct diagnosis on any of the fifteen cases; where it fell short, it did so by taking slightly longer than the 20-second primary-path target in a small minority of invocations.
Fifteen cases across seven medical specialties
The case panel covers seven specialties — hematology, endocrinology, metabolic medicine, hepatology, nephrology, cardiology, rheumatology — plus two dedicated hyperdiagnosis trap cases. Each case is an anonymised real patient record drawn from the Kantesti clinical data repository under written informed consent.
De-identification was performed under the Safe Harbor approach: all direct identifiers were removed or replaced, and each record was assigned a benchmark-internal case code in the format BT-NNN-LABEL. Processing was carried out in accordance with GDPR Article 9(2)(j) for scientific research with appropriate safeguards, and the equivalent UK GDPR provisions. No personally identifying information appears anywhere in the published harness, the technical report, or the released datasets.
Why this particular distribution
Hematology gets three cases because microcytic differentials and macrocytic differentials are the highest-volume traps in real-world laboratory practice. Endocrinology gets three because the Hashimoto's, PCOS, and vitamin D deficiency presentations exercise different diagnostic shapes (autoantibody-driven, hormone-ratio-driven, single-marker-driven). The single-case specialties are still meaningful because each of CKD, ASCVD risk, and SLE has its own scoring system that the engine should invoke (KDIGO staging, ASCVD 10-year risk, 2019 EULAR/ACR SLE criteria respectively).
The pre-registered rubric, explained
Pre-registration is the single most important methodological choice in this benchmark. Every expected diagnosis, every clinical scoring system, and every report section was committed to source code before the engine was invoked. Post-hoc tuning of the rubric to flatter the engine is therefore impossible.
Three components make up the composite score. The structural component contributes 35 percent and measures whether the engine returned the seven mandatory report sections (header, summary, key findings, differential, scoring systems, recommendations, follow-up) and the sixteen mandatory subsections within them. Section presence weighs 40 percent and subsection presence weighs 60 percent within the structural calculation.
দ্য... clinical component contributes 55 percent and combines three things: diagnosis-keyword recall (70 percent of the clinical sub-score), scoring-system recall (20 percent — does the engine compute Mentzer, FIB-4, HOMA-IR, ASCVD risk, KDIGO staging, EULAR/ACR criteria where relevant), and a probability-sum validity check (10 percent — the differential probabilities should sum to within the [90, 110] interval). For trap cases, an explicit hyperdiagnosis penalty of up to 0.30 is subtracted, calculated as 0.10 per fabricated pathology flag, capped at three flags.
দ্য... latency component contributes 10 percent. A response under 20 seconds earns the full 0.10, a response under 40 seconds earns 0.05, and anything slower earns zero. The 20-second target reflects the production primary-path service-level objective; the 40-second ceiling reflects the Phase 2 fallback budget for heavy-engine invocations.
What pre-registration prevents
First-party benchmarks are notorious for inflating their own numbers through post-hoc rubric tuning. The pattern is almost always the same: the team runs the engine, sees where it underperforms, then quietly adjusts the rubric so the underperforming areas count for less. By committing the rubric to source code before the first engine call and publishing the harness under MIT licence, that adjustment becomes visible in version control. Anyone can clone the repository, check the rubric author dates, and verify the engine results were not used to shape the scoring.
Hyperdiagnosis trap cases — why over-calling is the real failure mode
Aggressive over-calling of pathology on normal screens is a documented failure mode of consumer-facing medical assistants. Its downstream costs include unnecessary investigation, patient anxiety, and iatrogenic workup. The two trap cases in this benchmark are designed to make that failure mode visible and scoreable.
🟡 Trap 1 — BT-014-GILBERT
Presentation. A 24-year-old male with a total bilirubin of 2.4 mg/dL. The direct fraction is normal, transaminases and alkaline phosphatase sit inside their reference ranges, reticulocytes are unremarkable, and haptoglobin and LDH rule out haemolysis.
Correct interpretation. Gilbert's syndrome — a benign UGT1A1 polymorphism. The interpretation should not invoke hepatitis, cirrhosis, haemolytic anaemia, or biliary obstruction.
V11 result. Composite 1.000. None of the six monitored over-diagnosis flags appeared as active diagnoses.
🟡 Trap 2 — BT-015-HEALTHY
Presentation. A 35-year-old female with a fifteen-parameter routine screening panel. Every analyte sits comfortably inside its reference range.
Correct interpretation. Reassurance and lifestyle maintenance. The interpretation should not manufacture borderline pathology to sound clinically useful.
V11 result. Composite 1.000. None of the seven monitored over-diagnosis flags — diabetes, anaemia, hypothyroidism, dyslipidaemia, hepatitis, kidney disease, deficiency — appeared as active diagnoses.
Across both traps, thirteen monitored hyperdiagnosis flags were checked. Zero were triggered. This is the result that matters most for any clinician considering using an AI engine as a triage or pre-consultation tool: the system did not invent disease where none existed.
Mentzer index: separating iron deficiency from thalassaemia trait
A second high-value finding concerns the pairing of case BT-001 (iron deficiency anaemia) with case BT-007 (beta-thalassaemia minor). Both present with microcytosis and are a well-known stumbling block for naive classifiers. The Mentzer index, calculated as MCV divided by RBC count, exceeds 13 in iron deficiency and falls below 13 in thalassaemia trait.
In BT-001, the patient was a 34-year-old female with hemoglobin 10.4 g/dL, MCV 72.4 fL, RBC 4.1 × 10¹²/L, ferritin 6 ng/mL, and elevated TIBC. The Mentzer index of approximately 17.7 supports absolute iron deficiency. In BT-007, the patient was a 28-year-old male with microcytosis (MCV 65.8 fL) but a high RBC count of 6.2, a normal RDW, normal ferritin, and HbA2 of 5.6 percent. The Mentzer index of approximately 10.6 points to thalassaemia trait, and the elevated HbA2 confirms beta-thalassaemia minor.
Both cases scored 1.000. The engine invoked the Mentzer index explicitly in both interpretations and returned the correct diagnosis in each instance. This is the single most clinically reassuring result in the entire benchmark, because misclassifying thalassaemia trait as iron deficiency leads to inappropriate iron supplementation and missed family-screening opportunities, and misclassifying iron deficiency as thalassaemia delays straightforward replacement therapy. Our ফেৰিটিন পৰিসীমা গাইড explains the broader differential context.
Per-case results from the April 2026 run
Twelve of fifteen cases achieved the ceiling composite score of 1.000 on the primary path. Three cases were served via the Phase 2 fallback, losing the 0.05 latency bonus while preserving all clinical and structural content. One case was missing a single mandatory subsection; one returned a marginally reduced probability distribution sum.
The PCOS case (BT-008) lost a single mandatory subsection in the response structure — fifteen of sixteen instead of sixteen of sixteen — which shaved structural score from 1.000 to 0.963. The SLE case (BT-011) returned a marginally reduced probability-distribution sum that dropped clinical score to 0.965 while preserving every diagnostic keyword and scoring system. Neither sub-perfect case missed a correct diagnosis.
What the headline score does not tell us
A composite score of 99.12 percent under this particular pre-registered rubric represents near-ceiling performance, but it deserves careful framing. The result describes the engine's behaviour against fifteen carefully selected anonymised cases, evaluated once each, against a single rubric. We are explicit about what the number does and does not establish.
The score says the V11 engine handled the diagnostic patterns selected for this evaluation correctly, on a methodology that is published and reproducible. It does not say the engine is correct on every blood test panel that exists in the wild. It does not say the engine should replace clinician judgment. And it does not say the engine outperforms alternative AI systems — comparative analyses against other engines were deliberately out of scope for this report.
What the score does establish is a baseline. With the rubric and harness public, future versions of the engine can be evaluated against the same fifteen cases, and the gap between the published score and any subsequent run is itself measurable. This is the value of pre-registration: it converts performance claims into testable claims.
How to reproduce this benchmark in 10 minutes
Reproduction requires only a Kantesti API credential pair and a Python 3.10 or later environment with the requests আৰু reportlab libraries installed. The full harness is a single self-contained Python module released under the MIT licence.
Four steps to a fresh run
One. Clone the repository: git clone https://github.com/emirhanai/kantesti-blood-test-benchmark.git. Two. Install dependencies with pip install -r requirements.txt. Three. Set KANTESTI_USERNAME আৰু KANTESTI_PASSWORD as environment variables — credentials are read at runtime and nothing is hard-coded in the script. Four. Run python benchmark_bloodtest.py and inspect the four artefacts emitted to the working directory: a CSV scorecard, a JSON scorecard, a full JSON dump including raw engine responses, and a human-readable Markdown report.
The reference run from 23 April 2026 is preserved in the results/ directory of the repository. A fresh run will produce a new timestamped scorecard while leaving the reference run untouched. If your run produces a meaningfully different result, please open a GitHub issue with the run timestamp and the engine version returned in the response metadata.
Limitations and future work
Four limitations deserve explicit acknowledgement: sample size, single-shot evaluation, single-engine scope, and single-source data origin. Each is being addressed in active follow-up work.
Sample size. Fifteen cases across eight specialty buckets is enough for a proof of concept but not for subgroup analysis within a specialty. Expansion to fifty cases is planned and will include coagulation panels, haematological malignancy screening, pregnancy panels, and paediatric presentations.
Single-shot evaluation. Each case was evaluated once. Large language models exhibit non-trivial output variance even at low sampling temperature, so a multi-run protocol with five evaluations per case and reported variance is a natural next step.
Single-engine scope. This report characterises one engine. Comparative analyses against alternative AI systems are out of scope here; we may pursue them as a separate independent study with appropriate methodology.
Single-source data origin. The fifteen cases are anonymised real patient records drawn from a single clinical repository. They represent a curated sample and are not a population-representative random draw. Extending the evaluation to multi-centre data is on the roadmap.
The most impactful planned extension is multi-language parity. The Kantesti AI Engine serves users in 75+ languages, and running the same fifteen-case harness in Turkish, German, Spanish, French, and Arabic will quantify output quality across the engine's supported languages. We will publish each language-specific run with its own DOI and harness branch.