Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as “deep ensembles” (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology–implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.
@inproceedings{kirscher2026lost,title={Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation},author={Kirscher, Tristan and Bujotzek, Markus and Kirchhoff, Yannick and Rokuss, Maximilian and Isensee, Fabian and Kahl, Kim-Celine and Kovacs, Balint and Maier-Hein, Klaus},booktitle={29th International Conference on Medical Image Computing and Computer Assisted Intervention},address={Strasbourg, France},year={2026},}
MIDL
TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation
Tristan Kirscher, Alexandra Ertl, Klaus Maier-Hein, and 3 more authors
Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR), the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.
@inproceedings{kirscher2026twintrack,title={TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation},author={Kirscher, Tristan and Ertl, Alexandra and Maier-Hein, Klaus and Coubez, Xavier and Meyer, Philippe and Faisan, Sylvain},booktitle={Medical Imaging with Deep Learning},year={2026},address={Taipei, Taiwan},}
Pediatric medical image segmentation remains challenging due to limited annotated data and significant anatomical variability across developmental stages. We propose PSAT, a framework that leverages adult data augmentations and transfer learning strategies to improve pediatric segmentation performance.
@inproceedings{kirscher2025psat,title={PSAT: Pediatric Segmentation Approaches via Adult Augmentations and Transfer Learning},author={Kirscher, Tristan and Faisan, Sylvain and Coubez, Xavier and Barrier, Loris and Meyer, Philippe},booktitle={Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},year={2025},address={Daejeon, South Korea},}
This work introduces a novel methodological framework for analyzing health trajectories and survival outcomes in heart failure patients. We combine NLP techniques for characterizing patient trajectories, unsupervised clustering with a new metric for measuring diagnosis distances, and survival analysis to assess patient outcomes.
@inproceedings{murris2024health,title={A Novel Methodological Framework for the Analysis of Health Trajectories and Survival Outcomes in Heart Failure Patients},author={Murris, Juliette and Amadei, Tristan and Kirscher, Tristan and Klein, Antoine and Tropeano, Anne-Isabelle and Katsahian, Sandrine},booktitle={ICLR 2024 Workshop on Learning from Time Series For Health},year={2024},address={Vienna, Austria},}