Digital Pathology AI Copilot Benchmark (DALPHIN)

Introduction

Foundation models with vision-language capabilities are rapidly emerging as interactive visual question answering (VQA) systems in digital pathology. Despite growing interest in clinical adoption, their ability to support pathologists as virtual assistants for diagnostic tasks remains poorly understood. Existing pathology benchmarks are limited, often non-public or vulnerable to data leakage, preventing reproducible evaluation. Independent, long-term benchmarking is therefore essential to rigorously assess the clinical potential, robustness, and limitations of these AI copilots on diagnostically meaningful tasks.

Benchmark

To enable fair evaluation and comparison of pathology AI copilots, we introduce the digital pathology AI copilot benchmark (DALPHIN) [1,2], a multicentric open VQA benchmark for pathology AI copilots. DALPHIN consists of 300 cases collected across six healthcare institutions in six countries, covering 130 diagnoses from 14 pathology subspecialties, including non-neoplastic entities and rare cancers. The benchmark comprises 1,236 histopathology images (low-resolution whole-slide images and higher-resolution regions of interest) and 1,757 questions across six tasks: tissue/organ recognition, neoplastic status, neoplastic behavior (benign, malignant, in situ, or uncertain), diagnosis, and case-specific multiple-choice and free-response questions. The images and questions are publicly available on Zenodo, while the ground truth reference labels remain sequestered on this platform to preserve the benchmark's integrity.

Performance reference

In Lems et al. [2], we evaluated three vision-language models (VLMs) alongside pathologists with different levels of expertise using DALPHIN. The study included two general-purpose models (Gemini 2.5 Pro and GPT-5) and one pathology-specific model (PathChat+) evaluated on all 300 cases. In addition, 115 cases were independently answered by 31 pathologists (24 board-certified pathologists and 7 residents) to establish a human performance reference that reflects interobserver variability and expertise. Each case was reviewed by one subspecialty expert and three non-experts, with additional review by two semi-experts for 60 cases. Performance results for all models across two answer generation scenarios (contextual/sequential and independent) are included on the leaderboard, enabling direct comparison with new submissions.

How to use this platform

To assess the performance of your model on DALPHIN, follow these steps:

  1. Download the dataset from Zenodo, which includes:
    • Images: n=1236 histopathology images
    • CSV file: contains the full benchmark metadata, including questions, preambles, and mappings of which images correspond to each question.
  2. Evaluate your model on the full benchmark or on individual tasks. Example evaluation code is available in our GitHub repository.
  3. Submit your model's responses for a specific task following the instructions on the corresponding submission page. Successful submissions will appear on the associated leaderboard.

Disclaimer

DALPHIN is intended as a long-term benchmark rather than a traditional challenge. There are no submission deadlines, prizes, or formal publication opportunities. Currently, we do not plan to publish a paper summarizing benchmark results. The primary goal of DALPHIN is to provide researchers with a standardized dataset and evaluation framework to assess their pathology AI copilots. You are, of course, welcome to report your model's performance on the benchmark in your own publications.

References

[1] C. Lems, N. Klubíčková, B. Brattoli, T. Lee, S. Kim, V. Vilaplana Besler, et al., Towards a multicentric open DigitAL PatHology assIstant beNchmark: Initial Results from the DALPHIN Study. Laboratory Investigation, 105(3, Supplement), 103609, 2025. doi:10.1016/j.labinv.2024.103609. Available at: https://www.sciencedirect.com/science/article/pii/S0023683724032872

[2] C. Lems, S. Moonemans, N. Klubíčková, B. Brattoli, T. Lee, S. Kim, et al., DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset. arXiv preprint arXiv:2605.03544, 2026. Available at: https://arxiv.org/abs/2605.03544