Crowdsourcing Skin Demarcations of Chronic Graft-Versus-Host Disease in Patient Photographs: Training Versus Performance Study.

Q3 Medicine JMIR dermatology Pub Date : 2023-12-26 DOI:10.2196/48589

Andrew J McNeil, Kelsey Parks, Xiaoqi Liu, Bohan Jiang, Joseph Coco, Kira McCool, Daniel Fabbri, Erik P Duhaime, Benoit M Dawant, Eric R Tkaczyk

{"title":"Crowdsourcing Skin Demarcations of Chronic Graft-Versus-Host Disease in Patient Photographs: Training Versus Performance Study.","authors":"Andrew J McNeil, Kelsey Parks, Xiaoqi Liu, Bohan Jiang, Joseph Coco, Kira McCool, Daniel Fabbri, Erik P Duhaime, Benoit M Dawant, Eric R Tkaczyk","doi":"10.2196/48589","DOIUrl":null,"url":null,"abstract":"Background: Chronic graft-versus-host disease (cGVHD) is a significant cause of long-term morbidity and mortality in patients after allogeneic hematopoietic cell transplantation. Skin is the most commonly affected organ, and visual assessment of cGVHD can have low reliability. Crowdsourcing data from nonexpert participants has been used for numerous medical applications, including image labeling and segmentation tasks.Objective: This study aimed to assess the ability of crowds of nonexpert raters-individuals without any prior training for identifying or marking cGHVD-to demarcate photos of cGVHD-affected skin. We also studied the effect of training and feedback on crowd performance.Methods: Using a Canfield Vectra H1 3D camera, 360 photographs of the skin of 36 patients with cGVHD were taken. Ground truth demarcations were provided in 3D by a trained expert and reviewed by a board-certified dermatologist. In total, 3000 2D images (projections from various angles) were created for crowd demarcation through the DiagnosUs mobile app. Raters were split into high and low feedback groups. The performances of 4 different crowds of nonexperts were analyzed, including 17 raters per image for the low and high feedback groups, 32-35 raters per image for the low feedback group, and the top 5 performers for each image from the low feedback group.Results: Across 8 demarcation competitions, 130 raters were recruited to the high feedback group and 161 to the low feedback group. This resulted in a total of 54,887 individual demarcations from the high feedback group and 78,967 from the low feedback group. The nonexpert crowds achieved good overall performance for segmenting cGVHD-affected skin with minimal training, achieving a median surface area error of less than 12% of skin pixels for all crowds in both the high and low feedback groups. The low feedback crowds performed slightly poorer than the high feedback crowd, even when a larger crowd was used. Tracking the 5 most reliable raters from the low feedback group for each image recovered a performance similar to that of the high feedback crowd. Higher variability between raters for a given image was not found to correlate with lower performance of the crowd consensus demarcation and cannot therefore be used as a measure of reliability. No significant learning was observed during the task as more photos and feedback were seen.Conclusions: Crowds of nonexpert raters can demarcate cGVHD images with good overall performance. Tracking the top 5 most reliable raters provided optimal results, obtaining the best performance with the lowest number of expert demarcations required for adequate training. However, the agreement amongst individual nonexperts does not help predict whether the crowd has provided an accurate result. Future work should explore the performance of crowdsourcing in standard clinical photos and further methods to estimate the reliability of consensus demarcations.","PeriodicalId":73553,"journal":{"name":"JMIR dermatology","volume":"6 ","pages":"e48589"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10777279/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR dermatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/48589","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Chronic graft-versus-host disease (cGVHD) is a significant cause of long-term morbidity and mortality in patients after allogeneic hematopoietic cell transplantation. Skin is the most commonly affected organ, and visual assessment of cGVHD can have low reliability. Crowdsourcing data from nonexpert participants has been used for numerous medical applications, including image labeling and segmentation tasks.

Objective: This study aimed to assess the ability of crowds of nonexpert raters-individuals without any prior training for identifying or marking cGHVD-to demarcate photos of cGVHD-affected skin. We also studied the effect of training and feedback on crowd performance.

Methods: Using a Canfield Vectra H1 3D camera, 360 photographs of the skin of 36 patients with cGVHD were taken. Ground truth demarcations were provided in 3D by a trained expert and reviewed by a board-certified dermatologist. In total, 3000 2D images (projections from various angles) were created for crowd demarcation through the DiagnosUs mobile app. Raters were split into high and low feedback groups. The performances of 4 different crowds of nonexperts were analyzed, including 17 raters per image for the low and high feedback groups, 32-35 raters per image for the low feedback group, and the top 5 performers for each image from the low feedback group.

Results: Across 8 demarcation competitions, 130 raters were recruited to the high feedback group and 161 to the low feedback group. This resulted in a total of 54,887 individual demarcations from the high feedback group and 78,967 from the low feedback group. The nonexpert crowds achieved good overall performance for segmenting cGVHD-affected skin with minimal training, achieving a median surface area error of less than 12% of skin pixels for all crowds in both the high and low feedback groups. The low feedback crowds performed slightly poorer than the high feedback crowd, even when a larger crowd was used. Tracking the 5 most reliable raters from the low feedback group for each image recovered a performance similar to that of the high feedback crowd. Higher variability between raters for a given image was not found to correlate with lower performance of the crowd consensus demarcation and cannot therefore be used as a measure of reliability. No significant learning was observed during the task as more photos and feedback were seen.

Conclusions: Crowds of nonexpert raters can demarcate cGVHD images with good overall performance. Tracking the top 5 most reliable raters provided optimal results, obtaining the best performance with the lowest number of expert demarcations required for adequate training. However, the agreement amongst individual nonexperts does not help predict whether the crowd has provided an accurate result. Future work should explore the performance of crowdsourcing in standard clinical photos and further methods to estimate the reliability of consensus demarcations.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

众包患者照片中慢性移植物抗宿主病的皮肤分界：培训与性能研究

背景：慢性移植物抗宿主疾病（cGVHD）是异基因造血细胞移植患者长期发病和死亡的重要原因。皮肤是最常受影响的器官，而对 cGVHD 的视觉评估可靠性较低。来自非专业参与者的众包数据已被用于许多医疗应用，包括图像标记和分割任务：本研究旨在评估非专业评定者人群（事先未接受过任何识别或标记 cGHVD 培训的人）对受 cGVHD 影响的皮肤照片进行分界的能力。我们还研究了培训和反馈对人群表现的影响：使用 Canfield Vectra H1 3D 相机拍摄了 36 名 cGVHD 患者的 360 张皮肤照片。由一名经过培训的专家提供三维真实分界，并由一名经过认证的皮肤科医生进行审核。通过 DiagnosUs 移动应用程序共创建了 3000 张 2D 图像（不同角度的投影），用于人群分界。评分者被分为高反馈组和低反馈组。对 4 个不同的非专业人群的表现进行了分析，包括低反馈组和高反馈组每张图像 17 名评分者，低反馈组每张图像 32-35 名评分者，以及低反馈组每张图像前 5 名评分者：在 8 次分界比赛中，高反馈组招募了 130 名评分员，低反馈组招募了 161 名评分员。结果，高反馈组共进行了 54,887 次单独分界，低反馈组共进行了 78,967 次单独分界。非专家人群在分割受 cGVHD 影响的皮肤方面取得了良好的整体性能，只需少量训练，高反馈组和低反馈组所有人群的皮肤像素表面积误差中位数均小于 12%。低反馈人群的表现略逊于高反馈人群，即使使用了更大的人群也是如此。对每张图像跟踪低反馈组中最可靠的 5 个评分者，其结果与高反馈组的结果相似。对于特定图像，评分者之间较高的变异性与较低的人群共识分界性能之间没有关联，因此不能用作可靠性的衡量标准。随着照片和反馈的增多，在任务过程中没有观察到明显的学习现象：结论：由非专业人员组成的群众评定员可以对 cGVHD 图像进行分界，且整体表现良好。跟踪前 5 位最可靠的评定者可获得最佳结果，在充分训练所需的最低专家分界数量下获得最佳性能。不过，非专家个人之间的一致意见无助于预测人群是否提供了准确的结果。未来的工作应探索众包在标准临床照片中的表现，并进一步探索估算共识分界可靠性的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊