Nikica Zaninovic M.S., Ph.D. , Jose T. Sierra Ph.D. , Jonas E. Malmsten D.P.S. , Zev Rosenwaks M.D.
{"title":"Embryo ranking agreement between embryologists and artificial intelligence algorithms","authors":"Nikica Zaninovic M.S., Ph.D. , Jose T. Sierra Ph.D. , Jonas E. Malmsten D.P.S. , Zev Rosenwaks M.D.","doi":"10.1016/j.xfss.2023.10.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>To evaluate the degree of agreement of embryo ranking between embryologists and eight artificial intelligence (AI) algorithms.</p></div><div><h3>Design</h3><p>Retrospective study.</p></div><div><h3>Patient(s)</h3><p>A total of 100 cycles with at least eight embryos were selected from the Weill Cornell Medicine database. For each embryo, the full-length time-lapse (TL) videos, as well as a single embryo image at 120 hours, were given to five embryologists and eight AI algorithms for ranking.</p></div><div><h3>Intervention(s)</h3><p>None.</p></div><div><h3>Main Outcome Measure(s)</h3><p>Kendall rank correlation coefficient (Kendall’s τ).</p></div><div><h3>Result(s)</h3><p>Embryologists had a high degree of agreement in the overall ranking of 100 cycles with an average Kendall’s tau (K-τ) of 0.70, slightly lower than the interembryologist agreement when using a single image or video (average K-τ = 0.78). Overall agreement between embryologists and the AI algorithms was significantly lower (average K-τ = 0.53) and similar to the observed low inter-AI algorithm agreement (average K-τ = 0.47). Notably, two of the eight algorithms had a very low agreement with other ranking methodologies (average K-τ = 0.05) and between each other (K-τ = 0.01). The average agreement in selecting the best-quality embryo (1/8 in 100 cycles with an expected agreement by random chance of 12.5%; confidence interval [CI]95: 6%–19%) was 59.5% among embryologists and 40.3% for six AI algorithms. The incidence of the agreement for the two algorithms with the low overall agreement was 11.7%. Agreement on selecting the same top two embryos/cycle (expected agreement by random chance corresponds to 25.0%; CI95: 17%–32%) was 73.5% among embryologists and 56.0% among AI methods excluding two discordant algorithms, which had an average agreement of 24.4%, the expected range of agreement by random chance. Intraembryologist ranking agreement (single image vs. video) was 71.7% and 77.8% for single and top two embryos, respectively. Analysis of average raw scores indicated that cycles with low diversity of embryo quality generally resulted in a lower overall agreement between the methods (embryologists and AI models).</p></div><div><h3>Conclusion(s)</h3><p>To our knowledge, this is the first study that evaluates the level of agreement in ranking embryo quality between different AI algorithms and embryologists. The different concordance methods were consistent and indicated that the highest agreement was intraembryologist agreement, followed by interembryologist agreement. In contrast, the agreement between some of the AI algorithms and embryologists was similar to the inter-AI algorithm agreement, which also showed a wide range of pairwise concordance. Specifically, two AI models showed intra- and interagreement at the level expected from random selection.</p></div>","PeriodicalId":73012,"journal":{"name":"F&S science","volume":"5 1","pages":"Pages 50-57"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"F&S science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666335X23000575","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
To evaluate the degree of agreement of embryo ranking between embryologists and eight artificial intelligence (AI) algorithms.
Design
Retrospective study.
Patient(s)
A total of 100 cycles with at least eight embryos were selected from the Weill Cornell Medicine database. For each embryo, the full-length time-lapse (TL) videos, as well as a single embryo image at 120 hours, were given to five embryologists and eight AI algorithms for ranking.
Embryologists had a high degree of agreement in the overall ranking of 100 cycles with an average Kendall’s tau (K-τ) of 0.70, slightly lower than the interembryologist agreement when using a single image or video (average K-τ = 0.78). Overall agreement between embryologists and the AI algorithms was significantly lower (average K-τ = 0.53) and similar to the observed low inter-AI algorithm agreement (average K-τ = 0.47). Notably, two of the eight algorithms had a very low agreement with other ranking methodologies (average K-τ = 0.05) and between each other (K-τ = 0.01). The average agreement in selecting the best-quality embryo (1/8 in 100 cycles with an expected agreement by random chance of 12.5%; confidence interval [CI]95: 6%–19%) was 59.5% among embryologists and 40.3% for six AI algorithms. The incidence of the agreement for the two algorithms with the low overall agreement was 11.7%. Agreement on selecting the same top two embryos/cycle (expected agreement by random chance corresponds to 25.0%; CI95: 17%–32%) was 73.5% among embryologists and 56.0% among AI methods excluding two discordant algorithms, which had an average agreement of 24.4%, the expected range of agreement by random chance. Intraembryologist ranking agreement (single image vs. video) was 71.7% and 77.8% for single and top two embryos, respectively. Analysis of average raw scores indicated that cycles with low diversity of embryo quality generally resulted in a lower overall agreement between the methods (embryologists and AI models).
Conclusion(s)
To our knowledge, this is the first study that evaluates the level of agreement in ranking embryo quality between different AI algorithms and embryologists. The different concordance methods were consistent and indicated that the highest agreement was intraembryologist agreement, followed by interembryologist agreement. In contrast, the agreement between some of the AI algorithms and embryologists was similar to the inter-AI algorithm agreement, which also showed a wide range of pairwise concordance. Specifically, two AI models showed intra- and interagreement at the level expected from random selection.