Embryo ranking agreement between embryologists and artificial intelligence (AI) algorithms.
Academic Article
Overview
abstract
OBJECTIVE: To evaluate the degree of agreement of embryo ranking between embryologists and eight AI algorithms. DESIGN: Retrospective study PATIENTS: A total of 100 cycles with at least eight embryos were selected from the Weill Cornell Medicine database. For each embryo, the full-length time-lapse (TL) videos, as well as a single embryo image at 120 h, were given to five embryologists and eight AI algorithms for ranking. INTERVENTIONS: None MAIN OUTCOME MEASURE(S): Kendall rank correlation coefficient (Kendall's τ) RESULTS: Embryologists had a high degree of agreement in the overall ranking of 100 cycles with an average Kendall's tau (K-τ) of 0.70, slightly lower than the inter-embryologist agreement when using a single image or video (average K-τ = 0.78). Overall agreement between embryologists and the AI algorithms was significantly lower (average K-τ = 0.53) and similar to the observed low inter-AI algorithm agreement (average K-τ = 0.47). Notably, two of the eight algorithms had a very low agreement with other ranking methodologies (average K-τ = 0.05) and between each other (K-τ = 0.01). The average agreement in selecting the best-quality embryo (1/8 in 100 cycles with an expected agreement by random chance of 12.5%; CI95: 6-19%) was 59.5% among embryologists and 40.3% for six AI algorithms. The incidence of the agreement for the two algorithms with the low overall agreement was 11.7%. Agreement on selecting the same top-two embryos/cycle (expected agreement by random chance corresponds to 25.0%; CI95: 17-32%) was 73.5% among embryologists and 56.0% among AI methods excluding two discordant algorithms, which had an average agreement of 24.4%, the expected range of agreement by random chance. Intra-embryologist ranking agreement (single image vs. video) was 71.7% and 77.8% for single and top-two embryos, respectively. Analysis of average raw scores indicated that cycles with low diversity of embryo quality generally resulted in a lower overall agreement between the methods (embryologists and AI models). CONCLUSIONS: To our knowledge, this is the first study that evaluates the level of agreement in ranking embryo quality between different AI algorithms and embryologists. The different concordance methods were consistent and indicated that the highest agreement was intra-embryologist agreement, followed by inter-embryologist agreement. In contrast, the agreement between some of the AI algorithms and embryologists was similar to the inter-AI algorithm agreement, which also showed a wide range of pair-wise concordance. Specifically, two AI models showed intra- and inter-agreement at the level expected from random selection.