Embryo ranking agreement between embryologists and artificial intelligence algorithms

F&S science Pub Date : 2024-02-01 DOI:10.1016/j.xfss.2023.10.002

Nikica Zaninovic M.S., Ph.D. , Jose T. Sierra Ph.D. , Jonas E. Malmsten D.P.S. , Zev Rosenwaks M.D.

{"title":"Embryo ranking agreement between embryologists and artificial intelligence algorithms","authors":"Nikica Zaninovic M.S., Ph.D. , Jose T. Sierra Ph.D. , Jonas E. Malmsten D.P.S. , Zev Rosenwaks M.D.","doi":"10.1016/j.xfss.2023.10.002","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>To evaluate the degree of agreement of embryo ranking between embryologists and eight artificial intelligence (AI) algorithms.</p></div><div><h3>Design</h3><p>Retrospective study.</p></div><div><h3>Patient(s)</h3><p>A total of 100 cycles with at least eight embryos were selected from the Weill Cornell Medicine database. For each embryo, the full-length time-lapse (TL) videos, as well as a single embryo image at 120 hours, were given to five embryologists and eight AI algorithms for ranking.</p></div><div><h3>Intervention(s)</h3><p>None.</p></div><div><h3>Main Outcome Measure(s)</h3><p>Kendall rank correlation coefficient (Kendall’s τ).</p></div><div><h3>Result(s)</h3><p>Embryologists had a high degree of agreement in the overall ranking of 100 cycles with an average Kendall’s tau (K-τ) of 0.70, slightly lower than the interembryologist agreement when using a single image or video (average K-τ = 0.78). Overall agreement between embryologists and the AI algorithms was significantly lower (average K-τ = 0.53) and similar to the observed low inter-AI algorithm agreement (average K-τ = 0.47). Notably, two of the eight algorithms had a very low agreement with other ranking methodologies (average K-τ = 0.05) and between each other (K-τ = 0.01). The average agreement in selecting the best-quality embryo (1/8 in 100 cycles with an expected agreement by random chance of 12.5%; confidence interval [CI]95: 6%–19%) was 59.5% among embryologists and 40.3% for six AI algorithms. The incidence of the agreement for the two algorithms with the low overall agreement was 11.7%. Agreement on selecting the same top two embryos/cycle (expected agreement by random chance corresponds to 25.0%; CI95: 17%–32%) was 73.5% among embryologists and 56.0% among AI methods excluding two discordant algorithms, which had an average agreement of 24.4%, the expected range of agreement by random chance. Intraembryologist ranking agreement (single image vs. video) was 71.7% and 77.8% for single and top two embryos, respectively. Analysis of average raw scores indicated that cycles with low diversity of embryo quality generally resulted in a lower overall agreement between the methods (embryologists and AI models).</p></div><div><h3>Conclusion(s)</h3><p>To our knowledge, this is the first study that evaluates the level of agreement in ranking embryo quality between different AI algorithms and embryologists. The different concordance methods were consistent and indicated that the highest agreement was intraembryologist agreement, followed by interembryologist agreement. In contrast, the agreement between some of the AI algorithms and embryologists was similar to the inter-AI algorithm agreement, which also showed a wide range of pairwise concordance. Specifically, two AI models showed intra- and interagreement at the level expected from random selection.</p></div>","PeriodicalId":73012,"journal":{"name":"F&S science","volume":"5 1","pages":"Pages 50-57"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"F&S science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666335X23000575","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

To evaluate the degree of agreement of embryo ranking between embryologists and eight artificial intelligence (AI) algorithms.

Design

Retrospective study.

Patient(s)

A total of 100 cycles with at least eight embryos were selected from the Weill Cornell Medicine database. For each embryo, the full-length time-lapse (TL) videos, as well as a single embryo image at 120 hours, were given to five embryologists and eight AI algorithms for ranking.

Intervention(s)

None.

Main Outcome Measure(s)

Kendall rank correlation coefficient (Kendall’s τ).

Result(s)

Embryologists had a high degree of agreement in the overall ranking of 100 cycles with an average Kendall’s tau (K-τ) of 0.70, slightly lower than the interembryologist agreement when using a single image or video (average K-τ = 0.78). Overall agreement between embryologists and the AI algorithms was significantly lower (average K-τ = 0.53) and similar to the observed low inter-AI algorithm agreement (average K-τ = 0.47). Notably, two of the eight algorithms had a very low agreement with other ranking methodologies (average K-τ = 0.05) and between each other (K-τ = 0.01). The average agreement in selecting the best-quality embryo (1/8 in 100 cycles with an expected agreement by random chance of 12.5%; confidence interval [CI]95: 6%–19%) was 59.5% among embryologists and 40.3% for six AI algorithms. The incidence of the agreement for the two algorithms with the low overall agreement was 11.7%. Agreement on selecting the same top two embryos/cycle (expected agreement by random chance corresponds to 25.0%; CI95: 17%–32%) was 73.5% among embryologists and 56.0% among AI methods excluding two discordant algorithms, which had an average agreement of 24.4%, the expected range of agreement by random chance. Intraembryologist ranking agreement (single image vs. video) was 71.7% and 77.8% for single and top two embryos, respectively. Analysis of average raw scores indicated that cycles with low diversity of embryo quality generally resulted in a lower overall agreement between the methods (embryologists and AI models).

Conclusion(s)

To our knowledge, this is the first study that evaluates the level of agreement in ranking embryo quality between different AI algorithms and embryologists. The different concordance methods were consistent and indicated that the highest agreement was intraembryologist agreement, followed by interembryologist agreement. In contrast, the agreement between some of the AI algorithms and embryologists was similar to the inter-AI algorithm agreement, which also showed a wide range of pairwise concordance. Specifically, two AI models showed intra- and interagreement at the level expected from random selection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

胚胎学家和人工智能算法之间的胚胎排名协议。

目的：评价胚胎学家与8种人工智能算法在胚胎排序上的一致性。设计：回顾性研究患者：从威尔康奈尔医学数据库中选择总共100个周期，至少有8个胚胎。对于每个胚胎，将全长延时（TL）视频以及120小时的单个胚胎图像交给五名胚胎学家和八种人工智能算法进行排名。干预措施：无主要结果测量（S）：肯德尔秩相关系数（Kendall’Sτ）结果：胚胎学家在100个周期的总体排名中高度一致，平均肯德尔τ（K-τ）为0.70，略低于使用单个图像或视频时胚胎学家之间的一致性（平均K-τ=0.78）。胚胎学家和人工智能算法之间的总体一致性显著较低（平均K-τ=0.53），与观察到的人工智能算法间的低一致性（平均K-τ=0.47）相似。值得注意的是，八种算法中有两种算法与其他排序方法的一致性很低（平均K-τ=0.05），彼此之间的一致性也很低（K-τ=0.01）。胚胎学家在选择最优质胚胎方面的平均一致性（100个周期中的1/8，随机概率为12.5%；CI95:6-19%）为59.5%，六种AI算法为40.3%。总体一致性较低的两种算法的一致性发生率为11.7%。胚胎学家对选择相同的前两个胚胎/周期的一致性（随机机会的预期一致性对应于25.0%；CI95:17-32%）为73.5%，人工智能方法中为56.0%，排除了两种不一致的算法，平均一致性为24.4%，通过随机机会达成一致的预期范围。胚胎学家对单个和前两个胚胎的排名一致性（单个图像与视频）分别为71.7%和77.8%。对平均原始分数的分析表明，胚胎质量多样性较低的周期通常会导致方法（胚胎学家和人工智能模型）之间的总体一致性较低。结论：据我们所知，这是第一项评估不同人工智能算法和胚胎学家在胚胎质量排序方面一致性水平的研究。不同的一致性方法是一致的，表明最高的一致性是胚胎内一致性，其次是胚胎间一致性。相比之下，一些人工智能算法和胚胎学家之间的一致性类似于人工智能算法间的一致性，也显示出广泛的成对一致性。具体而言，两个人工智能模型在随机选择的预期水平上显示出内部和内部一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊