Intra- and interobserver agreement of proposed objective transvaginal ultrasound image-quality scoring system for use in artificial intelligence algorithm development.

IF 6.1 1区医学 Q1 ACOUSTICS Ultrasound in Obstetrics & Gynecology Pub Date : 2025-01-24 DOI:10.1002/uog.29178

A Deslandes, J C Avery, H-T Chen, M Leonardi, S Knox, G Lo, R O'Hara, G Condous, M L Hull

{"title":"Intra- and interobserver agreement of proposed objective transvaginal ultrasound image-quality scoring system for use in artificial intelligence algorithm development.","authors":"A Deslandes, J C Avery, H-T Chen, M Leonardi, S Knox, G Lo, R O'Hara, G Condous, M L Hull","doi":"10.1002/uog.29178","DOIUrl":null,"url":null,"abstract":"Objectives: The development of valuable artificial intelligence (AI) tools to assist with ultrasound diagnosis depends on algorithms developed using high-quality data. This study aimed to test the intra- and interobserver agreement of a proposed image-quality scoring system to quantify the quality of gynecological transvaginal ultrasound (TVS) images, which could be used in clinical practice and AI tool development.Methods: A proposed scoring system to quantify TVS image quality was created following a review of the literature. This system involved a score of 1-4 (2 = poor, 3 = suboptimal and 4 = optimal image quality) assigned by a rater for individual ultrasound images. If the image was deemed inaccurate, it was assigned a score of 1, corresponding to 'reject'. Six professionals, including two radiologists, two sonographers and two sonologists, reviewed 150 images (50 images of the uterus and 100 images of the ovaries) obtained from 50 women, assigning each image a score of 1-4. The review of all images was repeated a second time by each rater after a period of at least 1 week. Mean scores were calculated for each rater. Overall interobserver agreement was assessed using intraclass correlation coefficient (ICC), and interobserver agreement between paired professionals and intraobserver agreement for all professionals were assessed using weighted Cohen's kappa and ICC.Results: Poor levels of interobserver agreement were obtained between the six raters for all 150 images (ICC, 0.480 (95% CI, 0.363-0.586)), as well as for assessment of the uterine images only (ICC, 0.359 (95% CI, 0.204-0.523)). Moderate agreement was achieved for the ovarian images (ICC, 0.531 (95% CI, 0.417-0.636)). Agreement between the paired sonographers and sonologists was poor for all images (ICC, 0.336 (95% CI, -0.078 to 0.619) and 0.425 (95% CI, 0.014-0.665), respectively), as well as when images were grouped into uterine images (ICC, 0.253 (95% CI, -0.097 to 0.577) and 0.299 (95% CI, -0.094 to 0.606), respectively) and ovarian images (ICC, 0.400 (95% CI, -0.043 to 0.669) and 0.469 (95% CI, 0.088-0.689), respectively). Agreement between the paired radiologists was moderate for all images (ICC, 0.600 (95% CI, 0.487-0.693)) and for their assessment of uterine images (ICC, 0.538 (95% CI, 0.311-0.707)) and ovarian images (ICC, 0.621 (95% CI, 0.483-0.728)). Weak-to-moderate intraobserver agreement was seen for each of the raters with weighted Cohen's kappa ranging from 0.533 to 0.718 for all images and from 0.467 to 0.751 for ovarian images. Similarly, for all raters, the ICC indicated moderate-to-good intraobserver agreement for all images overall (ICC ranged from 0.636 to 0.825) and for ovarian images (ICC ranged from 0.596 to 0.862). Slightly better intraobserver agreement was seen for uterine images, with weighted Cohen's kappa ranging from 0.568 to 0.808 indicating weak-to-strong agreement, and ICC ranging from 0.546 to 0.893 indicating moderate-to-good agreement. All measures were statistically significant (P < 0.001).Conclusion: The proposed image quality scoring system was shown to have poor-to-moderate interobserver agreement and mostly weak-to-moderate levels of intraobserver agreement. More refinement of the scoring system may be needed to improve agreement, although it remains unclear whether quantification of image quality can be achieved, given the highly subjective nature of ultrasound interpretation. Although some AI systems can tolerate labeling noise, most will favor clean (high-quality) data. As such, innovative data-labeling strategies are needed. © 2025 The Author(s). Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.","PeriodicalId":23454,"journal":{"name":"Ultrasound in Obstetrics & Gynecology","volume":" ","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ultrasound in Obstetrics & Gynecology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/uog.29178","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: The development of valuable artificial intelligence (AI) tools to assist with ultrasound diagnosis depends on algorithms developed using high-quality data. This study aimed to test the intra- and interobserver agreement of a proposed image-quality scoring system to quantify the quality of gynecological transvaginal ultrasound (TVS) images, which could be used in clinical practice and AI tool development.

Methods: A proposed scoring system to quantify TVS image quality was created following a review of the literature. This system involved a score of 1-4 (2 = poor, 3 = suboptimal and 4 = optimal image quality) assigned by a rater for individual ultrasound images. If the image was deemed inaccurate, it was assigned a score of 1, corresponding to 'reject'. Six professionals, including two radiologists, two sonographers and two sonologists, reviewed 150 images (50 images of the uterus and 100 images of the ovaries) obtained from 50 women, assigning each image a score of 1-4. The review of all images was repeated a second time by each rater after a period of at least 1 week. Mean scores were calculated for each rater. Overall interobserver agreement was assessed using intraclass correlation coefficient (ICC), and interobserver agreement between paired professionals and intraobserver agreement for all professionals were assessed using weighted Cohen's kappa and ICC.

Results: Poor levels of interobserver agreement were obtained between the six raters for all 150 images (ICC, 0.480 (95% CI, 0.363-0.586)), as well as for assessment of the uterine images only (ICC, 0.359 (95% CI, 0.204-0.523)). Moderate agreement was achieved for the ovarian images (ICC, 0.531 (95% CI, 0.417-0.636)). Agreement between the paired sonographers and sonologists was poor for all images (ICC, 0.336 (95% CI, -0.078 to 0.619) and 0.425 (95% CI, 0.014-0.665), respectively), as well as when images were grouped into uterine images (ICC, 0.253 (95% CI, -0.097 to 0.577) and 0.299 (95% CI, -0.094 to 0.606), respectively) and ovarian images (ICC, 0.400 (95% CI, -0.043 to 0.669) and 0.469 (95% CI, 0.088-0.689), respectively). Agreement between the paired radiologists was moderate for all images (ICC, 0.600 (95% CI, 0.487-0.693)) and for their assessment of uterine images (ICC, 0.538 (95% CI, 0.311-0.707)) and ovarian images (ICC, 0.621 (95% CI, 0.483-0.728)). Weak-to-moderate intraobserver agreement was seen for each of the raters with weighted Cohen's kappa ranging from 0.533 to 0.718 for all images and from 0.467 to 0.751 for ovarian images. Similarly, for all raters, the ICC indicated moderate-to-good intraobserver agreement for all images overall (ICC ranged from 0.636 to 0.825) and for ovarian images (ICC ranged from 0.596 to 0.862). Slightly better intraobserver agreement was seen for uterine images, with weighted Cohen's kappa ranging from 0.568 to 0.808 indicating weak-to-strong agreement, and ICC ranging from 0.546 to 0.893 indicating moderate-to-good agreement. All measures were statistically significant (P < 0.001).

Conclusion: The proposed image quality scoring system was shown to have poor-to-moderate interobserver agreement and mostly weak-to-moderate levels of intraobserver agreement. More refinement of the scoring system may be needed to improve agreement, although it remains unclear whether quantification of image quality can be achieved, given the highly subjective nature of ultrasound interpretation. Although some AI systems can tolerate labeling noise, most will favor clean (high-quality) data. As such, innovative data-labeling strategies are needed. © 2025 The Author(s). Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

目标：开发有价值的人工智能（AI）工具来辅助超声诊断取决于使用高质量数据开发的算法。本研究旨在测试拟用于量化妇科经阴道超声（TVS）图像质量的图像质量评分系统的观察者内部和观察者之间的一致性，该系统可用于临床实践和人工智能工具开发：方法：在查阅文献后，我们提出了一套用于量化 TVS 图像质量的评分系统。该系统由评分员对单个超声图像进行 1-4 级评分（2 = 差，3 = 次优，4 = 最佳图像质量）。如果认为图像不准确，则打 1 分，即 "拒绝"。包括两名放射科医生、两名超声技师和两名超声学家在内的六名专业人员审查了从 50 名妇女处获得的 150 张图像（50 张子宫图像和 100 张卵巢图像），并给每张图像打 1-4 分。至少一周后，每位评分者再次复查所有图像。计算每位评分者的平均分。使用类内相关系数（ICC）评估观察者之间的总体一致性，使用加权科恩卡帕（Cohen's kappa）和 ICC 评估配对专业人员之间的观察者一致性和所有专业人员的观察者内部一致性：结果：六位评分员在所有 150 张图像（ICC，0.480（95% CI，0.363-0.586））以及仅在子宫图像（ICC，0.359（95% CI，0.204-0.523））的评估中的观察者间一致性较差。卵巢图像的一致性为中等（ICC，0.531（95% CI，0.417-0.636））。配对超声技师和超声科医生在所有图像上的一致性较差（ICC，分别为 0.336 (95% CI, -0.078 to 0.619) 和 0.425 (95% CI, 0.014-0.665)），当图像分组为子宫图像时也是如此（ICC，0.253（95% CI，-0.097 至 0.577）和 0.299（95% CI，-0.094 至 0.606））以及卵巢图像（ICC，分别为 0.400（95% CI，-0.043 至 0.669）和 0.469（95% CI，0.088 至 0.689））。对于所有图像（ICC，0.600（95% CI，0.487-0.693））以及子宫图像（ICC，0.538（95% CI，0.311-0.707））和卵巢图像（ICC，0.621（95% CI，0.483-0.728））的评估，配对放射科医生之间的一致性为中等。所有图像的加权科恩卡帕（Cohen's kappa）范围为 0.533 至 0.718，卵巢图像的加权科恩卡帕（Cohen's kappa）范围为 0.467 至 0.751。同样，所有评分者的 ICC 显示，所有图像（ICC 在 0.636 到 0.825 之间）和卵巢图像（ICC 在 0.596 到 0.862 之间）的观察者内部一致性为中等到良好。子宫图像的观察者内部一致性稍好，加权科恩卡帕（Cohen's kappa）在 0.568 到 0.808 之间，表明一致性从弱到强，ICC 在 0.546 到 0.893 之间，表明一致性从中度到良好。所有测量结果均具有统计学意义（P 结论）：建议的图像质量评分系统的观察者间一致性为差到中等，观察者内一致性大多为弱到中等。尽管鉴于超声解读的高度主观性，能否实现图像质量的量化仍是未知数，但可能需要对评分系统进行更多改进，以提高一致性。虽然有些人工智能系统可以容忍标记噪声，但大多数系统更倾向于干净（高质量）的数据。因此，需要创新的数据标记策略。©2025年作者。妇产科超声》由 John Wiley & Sons Ltd 代表国际妇产科超声学会出版。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Ultrasound in Obstetrics & Gynecology 医学-妇产科学

CiteScore

12.30

自引率

14.10%

发文量

891

审稿时长

1 months

期刊介绍： Ultrasound in Obstetrics & Gynecology (UOG) is the official journal of the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) and is considered the foremost international peer-reviewed journal in the field. It publishes cutting-edge research that is highly relevant to clinical practice, which includes guidelines, expert commentaries, consensus statements, original articles, and systematic reviews. UOG is widely recognized and included in prominent abstract and indexing databases such as Index Medicus and Current Contents.