Reducing uncertainty in medical segmentation using ensemble methods
AUTHOR
Rémy SIAHAAN--GENSOLLEN
PUBLISHED ON
September 7, 2025
Ce projet est basé sur un travail académique réalisé à l'ENSAE avec mes camarades Lucas Cumunel, Tara Leroux et Léo Leroy, et supervisé par
Automatic organ segmentation, although very useful in medical imaging, remains subject to high uncertainty, especially when it relies on (subjective) manual annotations. This project evaluates the use of an ensemble method to reduce this uncertainty, by training and combining several U-Nets on multiple CT scans annotated by different experts. We evaluate the accuracy of the predictions, their aleatoric uncertainty and epistemic uncertainty. Results indicate that this simple method significantly reduce the uncertainties of the predictions.
Background and project
Introduction
For several years, artificial intelligence has been revolutionizing medical practice, supporting doctors in their diagnoses and decision-making. Medical imaging, in particular, plays a central role in assessing patients' health and guiding their care [Li, 2023]
Medical image analysis using deep learning algorithms
Li, Mengfang and Jiang, Yuanyuan and Zhang, Yanzhou and Zhu, Haisheng (2023)
. Automatic segmentation—that is, the precise delineation of organs and structures by algorithms—facilitates diagnosis, treatment planning, and clinical monitoring. These algorithms include convolutional neural networks. (Convolutional Neural Network, CNN), powerful deep learning tool that has outperformed human experts in many image understanding tasks [D. R. Sarvamangala, 2022]
Convolutional neural networks in medical image understanding: a survey
D. R. Sarvamangala and Raghavendra V. Kulkarni (2022)
3D segmentation of pancreas, kidneys and liver, as well as a section of the abdominal scanner used to delineate them.
3D segmentation of pancreas, kidneys and liver, as well as a section of the abdominal scanner used to delineate them.
However, many of the structures and anomalies analyzed (organs, blood vessels, tumors, etc.) are particularly complex and variable, leading to a certain uncertainty in their delimitation. This uncertainty is accentuated by the inter-expert variability : different medical specialists may have different opinions on the precise location of the boundaries of segmented entities. This increases even more when multiple structures are predicted simultaneously. Neural networks must deal with these discrepancies, sometimes leading to inconsistencies in segmentation results.
Quantifying these uncertainties allows for the generation of uncertainty maps on medical images, in order to isolate areas where physicians need to pay extra attention. This provides clinicians with better calibrated predictions, and integrate confidence measures into medical image analysis and subsequent decision-making [Kim-Celine Kahl, 2024]
ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation
Kim-Celine Kahl and Carsten T. Lüth and Maximilian Zenk and Klaus Maier-Hein and Paul F. Jaeger (2024)
. This not only improves the safety of AI-assisted diagnostics, but also makes algorithms more transparent and reliable for medical applications. Ensemble learning methods, which combine multiple individual models or their predictions, are a common choice for improving the performance of artificial intelligence models [Ganaie, 2022]
Ensemble deep learning: A review
Ganaie, M.A. and Hu, Minghui and Malik, A.K. and Tanveer, M. and Suganthan, P.N. (2022)
Engineering Applications of Artificial Intelligence, vol. 115, pp. 105151.
Machine learning models do not always clearly indicate their level of confidence in the predictions they produce: this is the problem of uncertainty in algorithmic predictions. Furthermore, medical experts may annotate the same image differently due to the ambiguity of certain anatomical structures. These disagreements reduce the quality of the annotations used to train the models and complicate the evaluation of their performance. In the left figure below are depicted three sections of the tomographic scan (or abdominal scan / CT scan) of the first patient in the dataset provided for the CURVAS challenge (more details below), as well as the three annotations of the pancreas, kidney and liver. The right figure highlights areas of disagreement:
Contours made by three doctors for different organs on three CT scan slices of the same patient.
Contours made by three doctors for different organs on three CT scan slices of the same patient.
Areas of disagreement highlighted in yellow
Areas of disagreement highlighted in yellow
Theoretically, we distinguish two types of uncertainty, which, when combined, give Predictive UncertaintyPU:
Aleatoric UncertaintyAU, which comes from the data itself. It is linked to the ambiguities intrinsic to the image. We can cite as causes of aleatoric uncertainty artifacts, digitization errors, etc… These causes include disagreements between annotators, as illustrated above.
Epistemic UncertaintyEU, which comes from the learning model itself. We can cite as causes of epistemic uncertainty a lack of knowledge (not enough diversified data observed during training), an architecture which does not allow them to be properly "learn", etc…
The most notable approach to capturing these uncertainties was introduced by [Alex Kendall, 2017]
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
assumes that epistemic uncertainty is represented by Predictive EntropyPE, which is the sum of Mutual InformationMI and Expected EntropyEE representing epistemic uncertainty and aleatory uncertainty respectively. Noting H the Shannon entropy, we have:
PU=PEH(Y∣x)=EU=MIMI(Y,Ω∣x)+AU=EE(for x i.i.d.)Eω∼Ω[H(Y∣ω,x)]
The interactive figure below, based on the thesis by [Lambert, 2024]
Quantification et caractérisation de l'incertitude de segmentation d'images médicales par des réseaux profonds
, illustrates the two types of uncertainty in a one-dimensional regression task. You can hover over the colored regions to see details, adjust their sizes, or change the shape of the function.
g(x)
x
g(x)=
Another very important concept is calibration. Neural networks produce probability distributions over possible class labels, which is a natural measure of uncertainty. Ideally, a well-calibrated model should have high confidence for correct predictions and low confidence for incorrect predictions. However, modern architectures often fail to achieve this ideal calibration. To assess calibration, reliability plots (or calibration graphs) are used, which compare predicted confidence to actual accuracy, highlighting deviations—called calibration deviations.
Mathematically, a perfectly calibrated model satisfies:
∀p∈[0,1],P(Y^=YP^=p)=p
Otherwise, this means that if the model assigns an 80 % probability to a prediction, it should be right 80 % of the time.
Experiment
Data and model
Held from May to October 2024, the CURVAS Challenge (Calibration and Uncertainty for Multi-Rater Volume Assessment in Multiorgan Segmentation) aimed to develop accurate segmentation models capable of providing both optimal calibration and quantification of inter-expert variability. For this project, we used the dataset released for the challenge, which includes 90 patient CT scans, each annotated by three different experts for the pancreas, kidneys, and liver. The figures above were generated using data from the first patient in the cohort. These CT scans were collected at University Hospital Erlangen between August and October 2023. A total of 20 scans were provided for training (group A), 5 for validation (group A), and 65 for testing (20 in group A, 22 in group B, and 23 in group C) [Riera-Marín, 2024]
CURVAS dataset
Riera-Marín, Meritxell and
Kleiß, Joy-Marie and
Aubanell, Anton and
Antolín, Andreu (2024)
nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation
Fabian Isensee and Jens Petersen and Andre Klein and David Zimmerer and Paul F. Jaeger and Simon Kohl and Jakob Wasserthal and Gregor Koehler and Tobias Norajitra and Sebastian Wirkert and Klaus H. Maier-Hein (2018)
architectures. It is specifically designed for automated biomedical image segmentation. nnU-Net automatically configures many training parameters based on dataset characteristics. This is particularly valuable in clinical contexts, where medical images often vary in format (2D vs. 3D), resolution, saturation, and acquisition protocols due to the use of different imaging instruments. However, these architectures comes with the drawback of being highly computationally intensive, requiring powerful GPUs.
We first trained 9 different models on the training dataset (20 patients): for each annotator, we trained three models with different weight initializations, in order to explore distinct optimization trajectories in the loss landscape. We then ran inference with each model on the test dataset (65 patients). For every model and patient, we systematically generated the predicted probabilities (softmax outputs), which were then used to construct four ensembles by averaging them: one per annotator-specific model triplet, and a general ensemble combining all nine models. Finally, for each patient and all 13 models (individual and ensembles), we computed prediction accuracy, as well as aleatoric and epistemic uncertainty estimates. These computations and results are detailed in the following sections.
Tools and ressources
First, we modified the nnU-Net library to integrate an early stopping functionality, as training could take several days to complete without any notable improvement. Training U-Net models on three-dimensional CT scans is computationally intensive, even with early stopping enabled. It also requires dedicated computing GPUs. We therefore used instances available through the Onyxia services provided by Insee and Groupe Genes, to which we had access. Even so, training each model still took nearly a full day.
Moreover, several challenges arose during inference, ensembling, and evaluation. The volume of data transferred at each step was particularly large, and since each instance was limited to 100 GB, we had to process each task and each patient—sometimes even each model—individually on different instances.
Due to the extensive decomposition of each task, we had to pay close attention to reproducibility.
For storage, Insee generously provided us with an S3-compatible storage space (following Amazon's standard) for file transfers, where we were able to store several terabytes of artifacts resulting from model training and inference. We then developed a CLI (command-line interface) using Typer, to interact with the remote storage and launch the various model-related tasks.
The CLI allows for fine-grained task execution (e.g., inference on patient 80 using the third model from annotator 2). The figure below provides an overview of the available commands. The source code for this tool is available on the GitHub repository.