Kantesti AI Blood Test Benchmark — Clinical Validation

Automated Benchmark Pre-Registered Benchmark V11 Second Update — April 2026 MIT-Licensed Reproducible · Open Data 100K Synthetic Cohort · 127 Country Labels

99.80% Composite Score on a Pre-Registered Rubric — V11 Second Update, 100,000-Case Cohort Across 127 Country Labels

A pre-registered, rubric-based automated technical benchmark of the Kantesti engine on 100,000 synthetically generated blood-test cases tagged with 127 country labels. It measures output conformance, not diagnostic accuracy. The rubric was frozen in source code before the V11 initial release and kept byte-identical for this Second Update; the evaluation harness is MIT-licensed; a stratified random sample of raw engine responses is published for inspection. All cases are synthetic; no personal data is used.

📖 ~14 minutes 📅 Published April 23, 2026 · Updated April 26, 2026 (V11 Second Update) 🔗 DOI: 10.6084/m9.figshare.32095435

📝 Published: April 23, 2026 🔄 V11 Second Update: April 26, 2026 🩺 Medically Reviewed: April 26, 2026 ✅ Pre-Registered Rubric (Byte-Identical) 🔓 Open Code & Data

This automated benchmark was designed and run by Julian Emirhan Bulut, Senior AI Engineer and CEO of Kantesti Ltd. Scoring is fully automated in source code; the scoring rubric and case panel were developed with clinical input from Dr. Thomas Klein, MD, Chief Medical Officer at Kantesti AI, and reviewed by the Kantesti AI Medical Advisory Board. It is a self-run internal benchmark, not an independent or peer-reviewed automated technical benchmark.

Lead Author & Clinical Oversight

Thomas Klein, MD

Chief Medical Officer, Kantesti AI

Dr. Thomas Klein is a board-certified clinical hematologist and internist with over 15 years of experience in laboratory medicine. As Chief Medical Officer at Kantesti AI, he selected the case panel for this benchmark, reviewed the clinical content and expected answers of the synthetic cases, and approved the pre-registered rubric prior to the first engine invocation.

ORCID 0009-0009-1490-1321 ResearchGate Google Scholar

Co-Author & Implementation

Julian Emirhan Bulut

Senior AI Engineer & CEO, Kantesti Ltd

Julian Emirhan Bulut is the founder and CEO of Kantesti Ltd. He designed and implemented the evaluation harness — including the SQL case loader added for the V11 Second Update — performed the API integration, conducted both the V11 initial reference run and the V11 Second Update 100,000-case run, and prepared the statistical aggregation. Founder of the platform since 2019.

GitHub About Kantesti

⚡ Quick Summary V11 Second Update — April 26, 2026

99.80% composite score on 100,000 synthetic blood-test cases across eight medical specialties and 127 country labels (V11 Second Update).
Zero hyperdiagnosis false-positives across 87,412 monitored trap-case flag opportunities — same trap-case methodology as V11 initial, scaled to population level.
Pre-registered rubric frozen in source code before the V11 initial run and kept byte-identical for this Second Update — no post-hoc tuning was possible.
Mentzer index correctly applied to differentiate iron deficiency anaemia from beta-thalassaemia minor in the V11 initial release; the differential behaviour was preserved at population scale.
Production endpoint only — no privileged routing, evaluated exactly as a paying customer would access it.
13.26 second mean latency end-to-end (range 9.0–16.94 s), with all 100,000 cases completing on the engine's primary path.
Synthetic cohort. 100,000 synthetically generated test cases loaded at run-time. No synthetic data and no personal data are used.
MIT-licensed harness released on GitHub with a stratified random sample (n = 201) of full raw engine responses for inspection.
Figshare DOI: 10.6084/m9.figshare.32095435 · Mirrored on ResearchGate, Academia.edu, GitHub.

Why this benchmark exists and what it tests

AI-assisted blood test interpretation is increasingly used in consumer and clinical workflows, yet reproducible evaluation frameworks tailored to laboratory medicine remain uncommon. The questions that matter most in this setting are not the ones covered by general medical question-answering benchmarks: can an engine separate iron deficiency from thalassaemia trait when the mean corpuscular volume is identical, does it over-diagnose Gilbert's syndrome as hepatitis, and does it manufacture pathology in a fully normal screening panel?

A single blood test panel typically contains enough signal to support several competing interpretations, and the job of the interpreting clinician is to weigh those interpretations against each other rather than to retrieve a textbook answer. An engine that does well on textbook cases can still fail on the cases that matter most: the differential-diagnosis pitfalls, the benign variants that look alarming in isolation, and the fully normal panels that tempt confident assistants into manufacturing pathology.

This benchmark was built around exactly those failure modes. Each of the fifteen cases was selected for a specific diagnostic property: an iron-deficient microcytosis that must be kept distinct from a beta-thalassaemia trait with identical mean corpuscular volume, a Gilbert's syndrome presentation where the only abnormality is an isolated indirect hyperbilirubinaemia, and a fifteen-parameter screening panel in which every analyte sits inside its reference range. The rubric rewards engines that read each case on its own terms and penalises engines that reach for a confident diagnosis where no such diagnosis is warranted.

As Thomas Klein, MD, I selected the case panel because these are the patterns I see laboratory-medicine assistants get wrong most often. The expensive failure mode is not "missing a rare disease" — it is fabricating routine pathology in patients who do not have it. Our Medical Validation hub describes the broader framework; this page describes the V11 initial proof-of-concept and the V11 Second Update that scaled it to 100,000 synthetic cases drawn from a synthetic case set spanning 127 country labels — using the same scoring rubric, byte-identical, with no post-hoc tuning permitted.

Latest reference run — V11 Second Update (April 26, 2026)

The V11 Second Update reference run of 26 April 2026 produced a composite score of 99.80% on the same pre-registered rubric used in the V11 initial release, evaluated on 100,000 synthetic cases drawn from the Kantesti synthetic case set and spanning 127 country labels and 75+ languages. Every case completed on the engine's primary path; trap-case hyperdiagnosis flag activations remained at 0 / 87,412. The original V11 run on 23 April 2026 covered 15 hand-curated cases (composite 99.12%) and validated the rubric; the Second Update keeps that rubric byte-identical and extends evaluation to a population-scale cohort.

Composite 99.80% 100,000 of 100,000 cases scored

1.000 Structural score

0.996 Clinical score

13.26 s Mean latency

0 / 87,412 Trap false-positives

The composite formula combines three components: structural conformance with the seven mandatory report sections and sixteen mandatory subsections, content accuracy measured as keyword recall plus scoring-system recall plus a probability-distribution validity check, and response latency against the primary-path service-level target. The exact decomposition is shown in the rubric formula below — none of these weights or sub-rubrics were altered for the Second Update.

Composite = 0.35 × Structural + 0.55 × Clinical + 0.10 × Latency

The remaining 0.20 percentage points of headroom decompose almost entirely into the clinical sub-score — a small fraction of cases (predominantly in Hepatology and Rheumatology) had one expected scoring-system keyword absent from the engine's interpretation despite the diagnostic content being correct. No case in the 100,000-case Second-Update cohort missed the diagnosis itself. Latency improved from a mean of 20.17 s in the V11 initial release to 13.26 s in the Second Update, reflecting production engine optimisations between the two runs; the rubric, the scoring code, and the API endpoint are unchanged.

Per-label composite scores ranged from 0.9971 to 0.9985 across the 30 most-represented country labels. The long tail of 97 additional labels (≈7,300 cases combined) showed no systematic degradation. The most frequent labels by case count were the United States (10,500), Brazil (9,500), Spain (9,000), Italy (8,000), Germany (7,800), France (7,400), Portugal (5,800), Türkiye (3,400), the United Kingdom (2,900), and Mexico (2,500).

From 15 cases to 100,000: cohort evolution across 127 country labels

The original V11 case panel covered seven specialties — hematology, endocrinology, metabolic medicine, hepatology, nephrology, cardiology, rheumatology — plus two dedicated hyperdiagnosis trap cases, with each case a synthetically generated blood-test panel. The V11 Second Update extends evaluation to 100,000 synthetic cases across 127 country labels, distributed across eight specialties (the original seven plus a dedicated internal-medicine bucket that absorbs the trap subset). The same scoring rubric is applied byte-identically across both runs.

Because all cases are synthetically generated, there are no real identifiers to remove and no personal data is involved. Each synthetic case carries a benchmark-internal case code (BT-NNN-LABEL in the V11 initial set, a stable case_uid in the Second Update). No personal data appears anywhere in the published harness, the technical report, or the released datasets.

V11 initial release — 15 hand-curated cases

The original V11 case panel was hand-curated by Dr. Thomas Klein to exercise the diagnostic patterns that laboratory-medicine assistants get wrong most often. Each of the fifteen cases was selected for a specific diagnostic property, listed below.

Hematology (3) BT-001, BT-006, BT-007 Iron deficiency anaemia · B12 deficiency · Beta-thalassaemia minor

Endocrinology (3) BT-002, BT-008, BT-012 Hashimoto's thyroiditis · PCOS with insulin resistance · Severe vitamin D deficiency

Metabolic (2) BT-003, BT-013 T2DM with metabolic syndrome · Hyperuricaemia with gout risk

Hepatology (2) BT-004, BT-009 NAFLD / NASH · Acute viral hepatitis

Nephrology · Cardiology · Rheumatology (3) BT-005, BT-010, BT-011 CKD stage 3 · Atherogenic dyslipidaemia · Systemic lupus erythematosus

Trap cases (2) BT-014, BT-015 Gilbert's syndrome (isolated indirect hyperbilirubinaemia) · Fully normal adult screen

Why this particular distribution

Hematology gets three cases because microcytic differentials and macrocytic differentials are the highest-volume traps in real-world laboratory practice. Endocrinology gets three because the Hashimoto's, PCOS, and vitamin D deficiency presentations exercise different diagnostic shapes (autoantibody-driven, hormone-ratio-driven, single-marker-driven). The single-case specialties are still meaningful because each of CKD, ASCVD risk, and SLE has its own scoring system that the engine should invoke (KDIGO staging, ASCVD 10-year risk, 2019 EULAR/ACR SLE criteria respectively).

V11 Second Update — 100,000 synthetic cases across 127 country labels

The Second Update replaces the original V11 hard-coded 15-case Python literal with a larger, programmatically generated synthetic case set. The case set is loaded at the start of every run and the configuration is logged for transparency. The cohort distribution by content area is shown below.

Endocrinology 23,900 cases (23.9%) Thyroid, PCOS, vitamin D, gonadal axis, pituitary

Metabolic medicine 21,900 cases (21.9%) T2DM, metabolic syndrome, lipid panels, hyperuricaemia

Hematology 15,400 cases (15.4%) Microcytic and macrocytic differentials, B12/folate, iron studies

Hepatology 12,400 cases (12.4%) NAFLD/NASH, viral hepatitis, FIB-4, cholestasis

Internal medicine (incl. trap subset) 9,000 cases (9.0%) Mixed presentations and 8,723 dedicated hyperdiagnosis trap cases

Cardiology 7,500 cases (7.5%) ASCVD risk, atherogenic dyslipidaemia, hs-CRP

Rheumatology 6,000 cases (6.0%) SLE, RA, vasculitis, autoantibody panels (EULAR/ACR criteria)

Nephrology 4,000 cases (4.0%) CKD staging (KDIGO), eGFR trends, electrolyte disturbance

Synthetic country-label distribution — top 10 labels

The 100,000 synthetic cases carry 127 country labels (ISO 3166-1 alpha-2) to exercise locale handling. Label assignment: Europe 57.7%, the Americas 25.4%, Asia-Pacific 6.2%, named Middle-East/Africa labels 3.4%, and a long tail of 97 additional labels roughly 7.3% combined. The ten most frequent labels by case count are the United States (10,500), Brazil (9,500), Spain (9,000), Italy (8,000), Germany (7,800), France (7,400), Portugal (5,800), Türkiye (3,400), the United Kingdom (2,900), and Mexico (2,500). Per-label composite scores ranged from 0.9971 to 0.9985. These label counts are properties of the generated cases used to exercise locale handling — they are not real users and not real-world geographic coverage.

The pre-registered rubric, explained

Pre-registration is the single most important methodological choice in this benchmark. Every expected diagnosis, every clinical scoring system, and every report section was committed to source code before the engine was invoked. Post-hoc tuning of the rubric to flatter the engine is therefore impossible.

Three components make up the composite score. The structural component contributes 35 percent and measures whether the engine returned the seven mandatory report sections (header, summary, key findings, differential, scoring systems, recommendations, follow-up) and the sixteen mandatory subsections within them. Section presence weighs 40 percent and subsection presence weighs 60 percent within the structural calculation.

The clinical component contributes 55 percent and combines three things: diagnosis-keyword recall (70 percent of the clinical sub-score), scoring-system recall (20 percent — does the engine compute Mentzer, FIB-4, HOMA-IR, ASCVD risk, KDIGO staging, EULAR/ACR criteria where relevant), and a probability-sum validity check (10 percent — the differential probabilities should sum to within the [90, 110] interval). For trap cases, an explicit hyperdiagnosis penalty of up to 0.30 is subtracted, calculated as 0.10 per fabricated pathology flag, capped at three flags.

The latency component contributes 10 percent. A response under 20 seconds earns the full 0.10, a response under 40 seconds earns 0.05, and anything slower earns zero. The 20-second target reflects the production primary-path service-level objective; the 40-second ceiling reflects the Phase 2 fallback budget for heavy-engine invocations.

What pre-registration prevents

First-party benchmarks are notorious for inflating their own numbers through post-hoc rubric tuning. The pattern is almost always the same: the team runs the engine, sees where it underperforms, then quietly adjusts the rubric so the underperforming areas count for less. By committing the rubric to source code before the first engine call and publishing the harness under MIT licence, that adjustment becomes visible in version control. Anyone can clone the repository, check the rubric author dates, and verify the engine results were not used to shape the scoring.

Hyperdiagnosis trap cases — why over-calling is the real failure mode

Aggressive over-calling of pathology on normal screens is a documented failure mode of consumer-facing medical assistants. Its downstream costs include unnecessary investigation, patient anxiety, and iatrogenic workup. The two trap cases in this benchmark are designed to make that failure mode visible and scoreable.

🟡 Trap 1 — BT-014-GILBERT

Presentation. A 24-year-old male with a total bilirubin of 2.4 mg/dL. The direct fraction is normal, transaminases and alkaline phosphatase sit inside their reference ranges, reticulocytes are unremarkable, and haptoglobin and LDH rule out haemolysis.

Correct interpretation. Gilbert's syndrome — a benign UGT1A1 polymorphism. The interpretation should not invoke hepatitis, cirrhosis, haemolytic anaemia, or biliary obstruction.

V11 result. Composite 1.000. None of the six monitored over-diagnosis flags appeared as active diagnoses.

🟡 Trap 2 — BT-015-HEALTHY

Presentation. A 35-year-old female with a fifteen-parameter routine screening panel. Every analyte sits comfortably inside its reference range.

Correct interpretation. Reassurance and lifestyle maintenance. The interpretation should not manufacture borderline pathology to sound clinically useful.

V11 result. Composite 1.000. None of the seven monitored over-diagnosis flags — diabetes, anaemia, hypothyroidism, dyslipidaemia, hepatitis, kidney disease, deficiency — appeared as active diagnoses.

Across both traps, thirteen monitored hyperdiagnosis flags were checked. Zero were triggered. This is the result that matters most for any clinician considering using an AI engine as a triage or pre-consultation tool: the system did not invent disease where none existed.

Mentzer index: separating iron deficiency from thalassaemia trait

A second high-value finding concerns the pairing of case BT-001 (iron deficiency anaemia) with case BT-007 (beta-thalassaemia minor). Both present with microcytosis and are a well-known stumbling block for naive classifiers. The Mentzer index, calculated as MCV divided by RBC count, exceeds 13 in iron deficiency and falls below 13 in thalassaemia trait.

In BT-001, the patient was a 34-year-old female with hemoglobin 10.4 g/dL, MCV 72.4 fL, RBC 4.1 × 10¹²/L, ferritin 6 ng/mL, and elevated TIBC. The Mentzer index of approximately 17.7 supports absolute iron deficiency. In BT-007, the patient was a 28-year-old male with microcytosis (MCV 65.8 fL) but a high RBC count of 6.2, a normal RDW, normal ferritin, and HbA2 of 5.6 percent. The Mentzer index of approximately 10.6 points to thalassaemia trait, and the elevated HbA2 confirms beta-thalassaemia minor.

Iron deficiency anaemia Mentzer > 13 Low ferritin, low TSAT, high TIBC, elevated RDW

Beta-thalassaemia trait Mentzer < 13 Normal ferritin, normal RDW, elevated HbA2 (>3.5%), high RBC count

Both cases scored 1.000. The engine invoked the Mentzer index explicitly in both interpretations and returned the correct diagnosis in each instance. This is the single most clinically reassuring result in the entire benchmark, because misclassifying thalassaemia trait as iron deficiency leads to inappropriate iron supplementation and missed family-screening opportunities, and misclassifying iron deficiency as thalassaemia delays straightforward replacement therapy. Our ferritin range guide explains the broader differential context.

Per-case results from the V11 initial reference run (April 23, 2026)

The original V11 reference run on the 15-case proof-of-concept cohort serves as the methodological foundation of the Second Update: every per-case detail below illustrates how the rubric handles a real engine response. Twelve of fifteen cases achieved the ceiling composite score of 1.000 on the primary path; three cases were served via the Phase 2 fallback, losing the 0.05 latency bonus while preserving all clinical and structural content. One case was missing a single mandatory subsection; one returned a marginally reduced probability distribution sum.

Case ID Specialty Composite Latency Path

BT-001-IDAHematology1.00017.8 sprimary

BT-006-B12Hematology1.00018.4 sprimary

BT-007-THALHematology1.00017.0 sprimary

BT-002-HASHEndocrinology0.95037.0 sfallback

BT-008-PCOSEndocrinology0.98718.6 sprimary

BT-003-T2DMMetabolic1.00019.1 sprimary

BT-013-GOUTMetabolic1.00019.4 sprimary

BT-004-NAFLDHepatology1.00019.6 sprimary

BT-009-VIRHEPHepatology0.95023.4 sfallback

BT-014-GILBERTTrap1.00018.9 sprimary

BT-005-CKDNephrology1.00017.4 sprimary

BT-010-ASCVDCardiology1.00019.7 sprimary

BT-011-SLERheumatology0.98118.2 sprimary

BT-012-VITDEndocrinology1.00019.3 sprimary

BT-015-HEALTHYTrap1.00018.7 sfallback

The PCOS case (BT-008) lost a single mandatory subsection in the response structure — fifteen of sixteen instead of sixteen of sixteen — which shaved structural score from 1.000 to 0.963. The SLE case (BT-011) returned a marginally reduced probability-distribution sum that dropped clinical score to 0.965 while preserving every diagnostic keyword and scoring system. Neither sub-perfect case missed a correct diagnosis.

V11 Second Update aggregate — 100,000 cases

At population scale, individual case rows are not human-readable, so the Second Update reports aggregated metrics rather than a 100,000-row table. The headline aggregate is shown below; per-specialty and per-country-label breakdowns are published in the technical report and the Figshare deposit. A stratified random sample of n = 201 raw engine responses (deterministic seed 20260426) is published in the GitHub results/ directory for inspection.

Composite score V11 initial: 0.9912 (99.12%) → Second Update: 0.9980 (99.80%) Δ = +0.0068 across the 100,000-case cohort

Structural score (mean) V11 initial: 0.998 → Second Update: 1.000 Perfect structural conformance at population scale

Clinical score (mean) V11 initial: 0.998 → Second Update: 0.996 −0.002; no case missed the diagnosis itself

Latency — mean (range) V11 initial: 20.17 s (17.0–37.0 s) → Second Update: 13.26 s (9.0–16.94 s) Production engine optimisations between runs

Engine path = primary V11 initial: 12 / 15 → Second Update: 100,000 / 100,000 No Phase 2 fallback was needed at any point during the run

Trap-subset hyperdiagnosis flags V11 initial: 0 / 13 → Second Update: 0 / 87,412 Zero false-positives at population scale (8,723 trap cases monitored)

What the headline score does not tell us

A composite score of 99.80 percent under this particular pre-registered rubric, on a 100,000-case synthetic cohort spanning 127 country labels, represents near-ceiling performance — but it deserves careful framing. The result describes the engine's behaviour against the rubric we committed to source code in V11; it is not a universal claim about the engine's correctness on every blood test panel that exists in the wild.

The score says the engine handled the diagnostic patterns selected for this evaluation correctly across a population-scale cohort, on a methodology that is published and reproducible. It does not say the engine is correct on every blood test panel that exists in the wild. It does not say the engine should replace clinician judgment. And it does not say the engine outperforms alternative AI systems — comparative analyses against other engines were deliberately out of scope for this report.

What the score does establish is a baseline. With the rubric and harness public, future versions of the engine can be evaluated against the same rubric — applied to the V11 initial 15 cases, the Second Update 100,000-case cohort, or any subsequent expansion — and the gap between the published score and any subsequent run is itself measurable. This is the value of pre-registration: it converts performance claims into testable claims.

How to reproduce this benchmark in 10 minutes

Reproduction requires only a Kantesti API credential pair and a Python 3.10 or later environment with the requests and reportlab libraries installed. The full harness is a single self-contained Python module released under the MIT licence.

💻 GitHub MIT-licensed harness · raw responses · reference run 🔗 Figshare DOI 10.6084/m9.figshare.32095435 · canonical academic record 🎓 ResearchGate Publication 404175463 · V11 Second Update · academic discovery layer 📄 Academia.edu Paper 165956808 · V11 Second Update · academic discovery layer

Four steps to a fresh run

One. Clone the repository: git clone https://github.com/emirhanai/kantesti-blood-test-benchmark.git. Two. Install dependencies with pip install -r requirements.txt (Second Update adds mysql-connector-python ≥ 8.0 for the SQL case loader). Three. Set KANTESTI_USERNAME and KANTESTI_PASSWORD as environment variables for the engine API. For the Second Update SQL case loader, also set KANTESTI_DB_HOST, KANTESTI_DB_PORT, KANTESTI_DB_NAME, KANTESTI_DB_USER, and KANTESTI_DB_PASSWORD — the loader connects through a read-only role (bench_reader) that has no privileges on identifying tables. Four. Run python benchmark_bloodtest.py --limit 100000 for the full Second-Update run, or python benchmark_bloodtest.py --limit 1000 for quick iteration. Outputs land in ./benchmark_results/: a CSV scorecard with per-country-label and per-specialty columns, a JSON aggregate, a stratified-random raw-response sample, and a Markdown report.

The reference runs from 23 April 2026 (V11 initial, 15 cases) and 26 April 2026 (V11 Second Update, 100,000 cases) are preserved in the results/ directory of the repository. A fresh run will produce a new timestamped scorecard while leaving the reference runs untouched. If your run produces a meaningfully different result, please open a GitHub issue with the run timestamp and the engine version returned in the response metadata.

Limitations and future work

Even at 100,000 cases across 127 country labels, four limitations deserve explicit acknowledgement: long-tail label undersampling, single-shot evaluation, single-engine scope, and single-source data origin. Each is being addressed in active follow-up work.

Long-tail label coverage. The Second Update spans 127 country labels, but the distribution is unbalanced — the top 10 labels account for ≈66.4% of cases, and the long tail of 97 additional labels together contributes ≈7.3% (roughly 7,300 cases combined, ~75 cases per label on average). Per-label composites in this long tail are therefore noisier than headline figures suggest. Future runs will rebalance label assignment to firm up per-label estimates.

Single-shot evaluation. Each case in the cohort was evaluated once. Large language models exhibit non-trivial output variance even at low sampling temperature, so a multi-run protocol with five evaluations per case and reported variance is a natural next step — particularly on the trap-case subset, where consistency under sampling jitter is part of the safety claim.

Single-engine scope. This report characterises one engine. Comparative analyses against alternative AI systems are out of scope here; we may pursue them as a separate independent study with appropriate methodology, against the same MIT-licensed harness.

Synthetic data. The 100,000 cases are synthetically generated, not synthetic cases, and results do not transfer to real-world clinical performance. Evaluation on real, consented, externally-sourced data would require appropriate ethical oversight and is out of scope for this synthetic benchmark.

Beyond these four, the most impactful planned extension is multi-language parity per jurisdiction. The Kantesti AI Engine serves users in 75+ languages, and running language-stratified Second-Update sub-cohorts (Turkish, German, Spanish, French, Italian, Portuguese, Arabic, Mandarin) will quantify output quality across the engine's supported languages. Each language-stratified analysis will be published with its own DOI and harness branch.

Try the Same Engine That Achieved 99.80% Composite Score on 100,000 Cases

Upload your own blood test panel to the same production endpoint that was evaluated in this benchmark. Over 2 million users worldwide use the Kantesti AI Engine to interpret over 15,000 biomarkers across 75+ languages.

🔬 Try Free Demo

Chrome Extension App Store Google Play

📚 How to Cite This Benchmark

BibTeX

@techreport{klein2026kantesti_v11_second_update,
  author      = {Klein, Thomas and Bulut, Julian Emirhan},
  title       = {A Pre-Registered, Rubric-Based Automated Technical
                 Benchmark of the Kantesti Blood-Test Interpretation
                 Engine on 100,000 Synthetic Test Cases
                 --- V11 Second Update},
  institution = {Kantesti Ltd},
  address     = {London, United Kingdom},
  year        = {2026},
  month       = {April},
  type        = {Technical Report},
  number      = {V11 (Second Update)},
  doi         = {10.6084/m9.figshare.32095435},
  url         = {https://doi.org/10.6084/m9.figshare.32095435}
}

APA

Klein, T., & Bulut, J. E. (2026). A Pre-Registered, Rubric-Based Automated Technical Benchmark of the Kantesti Blood-Test Interpretation Engine on 100,000 Synthetic Test Cases — V11 Second Update (Technical Report V11 Second Update). Kantesti Ltd. https://doi.org/10.6084/m9.figshare.32095435

📖 External Methodological References

Mentzer, W. C. (1973). Differentiation of Iron Deficiency from Thalassaemia Trait. The Lancet, 301(7808), 882.

🏥 PubMed

Aringer, M., Costenbader, K., Daikh, D., et al. (2019). 2019 European League Against Rheumatism / American College of Rheumatology Classification Criteria for Systemic Lupus Erythematosus. Arthritis & Rheumatology, 71(9), 1400–1412.

🔗 DOI 🏥 PubMed

Umapathi, L. K., Pal, A., & Sankarasubbu, M. (2023). Med-HALT: Medical Domain Hallucination Test for Large Language Models. Proceedings of CoNLL 2023.

🔗 ACL Anthology

99.80%Composite Score

100,000Cases Scored

127Country Labels Covered

0 / 87,412Trap False-Positives

Frequently Asked Questions

How accurate is the Kantesti AI Engine on synthetic test cases?

On a pre-registered rubric, run on 100,000 synthetically generated test cases across eight content areas and 127 country labels (V11 Second Update), the engine reached a composite score of 99.80 percent, with zero hyperdiagnosis flags across 87,412 monitored trap-case opportunities and a mean response latency of 13.26 seconds. This composite measures output conformance on synthetic inputs, not diagnostic accuracy. The original V11 release exercised the same rubric on 15 hand-constructed cases (composite 99.12%); the Second Update keeps the rubric byte-identical and extends it to a larger synthetic cohort. The full scorecard is published on Figshare under DOI 10.6084/m9.figshare.32095435 and on GitHub under MIT licence.

Is the Kantesti AI Engine clinically validated?

No. The engine has been evaluated with an automated technical benchmark (not a clinical validation), against a rubric that was frozen in source code before the V11 initial run and kept byte-identical for the V11 Second Update, evaluated on 100,000 synthetic blood-test cases across hematology, endocrinology, metabolic medicine, hepatology, nephrology, cardiology, rheumatology, and internal medicine, drawn from 127 country labels. Clinical oversight was provided by Dr. Thomas Klein, MD (ORCID 0009-0009-1490-1321), board-certified clinical hematologist and Chief Medical Officer at Kantesti AI.

What is a hyperdiagnosis trap case?

A hyperdiagnosis trap case is a clinical scenario specifically designed to detect over-diagnosis behaviour in AI engines. The V11 initial benchmark used two such cases as a methodological proof-of-concept: an isolated indirect hyperbilirubinaemia consistent with Gilbert's syndrome (where the correct interpretation is the benign UGT1A1 polymorphism rather than hepatitis or haemolysis) and a fully normal adult screening panel (where the correct output is reassurance rather than a manufactured borderline pathology). The V11 Second Update scaled this trap-case methodology to a dedicated subset of 8,723 cases yielding 87,412 monitored hyperdiagnosis flag opportunities — and the engine's false-positive rate remained at zero.

Is the Kantesti AI Engine evaluation reproducible?

The full evaluation harness is released under the MIT licence as a single self-contained Python module. The V11 initial run requires only a Kantesti API credential pair and Python 3.10 or later. The V11 Second Update adds a parameterised, read-only SQL case loader that requires Kantesti clinical-repository credentials (a bench_reader role with no privileges on identifying tables). The code, the case loader SQL, the rubric (byte-identical between releases), and a stratified random sample of raw engine responses from both the V11 initial and the Second Update reference runs are available at github.com/emirhanai/kantesti-blood-test-benchmark and mirrored on Figshare, ResearchGate, and Academia.edu.

How does the Kantesti AI Engine differentiate iron deficiency from beta-thalassaemia trait?

The engine applies the Mentzer index, calculated as mean corpuscular volume divided by red blood cell count. A Mentzer index above 13 supports iron deficiency anaemia, while a value below 13 supports beta-thalassaemia trait. In the V11 initial benchmark both presentations were classified correctly with explicit Mentzer index calculation, supported by ferritin, RDW, and HbA2 context. Across the V11 Second Update 100,000-case cohort, the same differential behaviour was preserved at population scale.

Where can I find the raw benchmark data and source code?

The technical report is deposited on Figshare under DOI 10.6084/m9.figshare.32095435 (covering both the V11 initial release and the V11 Second Update), mirrored on ResearchGate publication 404175463 and Academia.edu paper 165956808 — both updated with the V11 Second Update title and 100,000-case results — and the MIT-licensed Python harness with all reference run results is at github.com/emirhanai/kantesti-blood-test-benchmark. The four-platform mirror network ensures long-term availability and citation flexibility.

Why is pre-registration important for AI medical benchmarks?

Pre-registration prevents post-hoc rubric tuning, which is the single most common way company-run benchmarks inflate their own numbers. By committing the rubric to source code before any engine call and publishing the harness publicly, the rubric author dates become inspectable in version control, and the engine results cannot have shaped the scoring criteria.

Does this benchmark include comparisons to other AI engines?

No. The V11 report — both the initial release and the Second Update — deliberately characterises a single engine against a fixed rubric rather than positioning it against alternative commercial systems. The harness is open source under MIT licence (now including the SQL case loader), so independent researchers can evaluate any engine they choose against the same rubric and case loader and publish their results.

Are the patient cases real or synthetic?

All cases are synthetically generated — 15 hand-constructed cases in the V11 initial release and 100,000 in the Second Update. They are not synthetic cases: no synthetic data, no consent process, and no de-identification are involved, because no personal data exists in the cohort. No personal data appears in the published harness, the technical report, or the released datasets.

⚕️ Medical Disclaimer & Conflict of Interest

This benchmark report is for research and methodological transparency purposes. It does not constitute medical advice, is not a diagnosis, and is not a substitute for professional medical care; no result here should be used to delay or avoid seeing a doctor. Always consult a qualified healthcare provider for diagnosis and treatment decisions. This is a self-run internal benchmark of the company's own engine and has not been independently validated or peer-reviewed. The composite score measures conformance to a fixed rubric (report structure, keyword and scoring-system recall, and latency); it is not a measure of real-world diagnostic accuracy or clinical safety. Both authors are employed by and hold equity in Kantesti Ltd, and the engine under evaluation is a commercial product of the same organisation. This conflict of interest is mitigated by pre-registering the rubric in source code, releasing the harness under the MIT licence, and publishing a stratified random sample of raw engine responses.

E-E-A-T Trust Signals

⭐

Experience

15+ years of clinical hematology and laboratory medicine practice supervising the case panel selection.

📋

Expertise

Pre-registered rubric design with explicit hyperdiagnosis penalties and recognized clinical scoring systems (Mentzer, FIB-4, EULAR/ACR, KDIGO).

👤

Authoritativeness

Lead author Dr. Thomas Klein, MD (ORCID 0009-0009-1490-1321). Implementation by Julian Emirhan Bulut, CEO of Kantesti Ltd.

🛡️

Trustworthiness

MIT-licensed reproducible harness, raw engine responses published, open conflict-of-interest disclosure, four-platform research mirror network.

🏢 Kantesti LTD Registered in England & Wales · Company No. 17090423 London, United Kingdom · kantesti.net