Evaluating cell AI foundation models in kidney pathology with human-in-the-loop enrichment.
Academic Article
Overview
abstract
BACKGROUND: Large-scale artificial intelligence foundation models have emerged as promising tools for addressing healthcare challenges, including digital pathology. While many have been developed for complex tasks such as disease diagnosis and tissue quantification using extensive and diverse datasets, their readiness for seemingly simpler tasks, such as nuclei segmentation within a single organ (for example, the kidney), remains unclear. This study answers two questions: How good are current cell foundation models? and How can we improve them? METHODS: We curated a multi-center, multi-disease, and multi-species dataset sampled from 2542 kidney whole slide images. Three state-of-the-art cell foundation models-Cellpose, StarDist, and CellViT-were evaluated. To enhance performance, we developed a human-in-the-loop strategy that distilled multi-model predictions, improving data quality while reducing reliance on pixel-level annotation. Fine-tuning was performed using the enriched datasets, and segmentation performance was quantitatively assessed. RESULTS: Here we show that cell nuclei segmentation in kidney pathology still requires improvement with more organ-targeted foundation models. Among the evaluated models, CellViT achieves the highest baseline performance, with an F1 score of 0.78. Fine-tuning with enriched data improves all three models, with StarDist achieving the highest F1 score of 0.82. The combination of the foundation model-generated pseudo-labels and a subset of pathologist-corrected "hard" patches yields consistent performance gains across all models. CONCLUSIONS: This study establishes a benchmark for the development and deployment of cell AI foundation models tailored to real-world data. The proposed framework, which leverages foundation models with reduced expert annotation, supports more efficient workflows in clinical pathology.