Crowdsourcing Thumbnail Captions: Data Collection and Validation

IF 4.8 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Interactive Intelligent Systems Pub Date : 2023-03-28 DOI:https://dl.acm.org/doi/10.1145/3589346

Carlos Aguirre, Shiye Cao, Amama Mahmood, Chien-Ming Huang

{"title":"Crowdsourcing Thumbnail Captions: Data Collection and Validation","authors":"Carlos Aguirre, Shiye Cao, Amama Mahmood, Chien-Ming Huang","doi":"https://dl.acm.org/doi/10.1145/3589346","DOIUrl":null,"url":null,"abstract":"<p>Speech interfaces, such as personal assistants and screen readers, read image captions to users—but typically only one caption is available per image, which may not be adequate for all situations (e.g., browsing large quantities of images). Long captions provide a deeper understanding of an image but require more time to listen to, whereas shorter captions may not allow for such thorough comprehension, yet have the advantage of being faster to consume. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions—which allow individuals to understand visual content in greater detail; we consider text-based instructions and time-constrained methods to collect descriptions at these two levels of detail and find that a time-constrained method is the most effective for collecting thumbnail captions while preserving caption accuracy. Additionally, we verify that caption authors using this time-constrained method are still able to focus on the most important regions of an image by tracking their eye gaze. We evaluate our collected captions along human-rated axes—correctness, fluency, amount of detail, and mentions of important concepts—and discuss the potential for model-based metrics to perform large-scale automatic evaluations in the future.</p>","PeriodicalId":48574,"journal":{"name":"ACM Transactions on Interactive Intelligent Systems","volume":"34 1-2","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2023-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Interactive Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3589346","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Speech interfaces, such as personal assistants and screen readers, read image captions to users—but typically only one caption is available per image, which may not be adequate for all situations (e.g., browsing large quantities of images). Long captions provide a deeper understanding of an image but require more time to listen to, whereas shorter captions may not allow for such thorough comprehension, yet have the advantage of being faster to consume. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions—which allow individuals to understand visual content in greater detail; we consider text-based instructions and time-constrained methods to collect descriptions at these two levels of detail and find that a time-constrained method is the most effective for collecting thumbnail captions while preserving caption accuracy. Additionally, we verify that caption authors using this time-constrained method are still able to focus on the most important regions of an image by tracking their eye gaze. We evaluate our collected captions along human-rated axes—correctness, fluency, amount of detail, and mentions of important concepts—and discuss the potential for model-based metrics to perform large-scale automatic evaluations in the future.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

众包缩略图说明:数据收集和验证

语音界面，如个人助理和屏幕阅读器，会向用户读取图像标题，但通常每个图像只有一个标题可用，这可能不适用于所有情况(例如，浏览大量图像)。较长的字幕提供了对图像更深入的理解，但需要更多的时间来听，而较短的字幕可能不允许如此彻底的理解，但具有更快的消费优势。我们探讨了如何有效地收集缩略图标题(简洁的图像描述，旨在快速消费)和综合标题(允许个人更详细地理解视觉内容);我们考虑了基于文本的指令和时间约束的方法来收集这两个细节级别的描述，并发现时间约束的方法在收集缩略图标题的同时保持标题的准确性是最有效的。此外，我们验证了使用这种时间约束方法的标题作者仍然能够通过跟踪他们的眼睛注视来关注图像中最重要的区域。我们按照人类评定的标准(正确性、流畅性、细节数量和重要概念的提及)评估收集到的标题，并讨论基于模型的指标在未来执行大规模自动评估的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Interactive Intelligent Systems Computer Science-Human-Computer Interaction

CiteScore

7.80

自引率

2.90%

发文量

期刊介绍： The ACM Transactions on Interactive Intelligent Systems (TiiS) publishes papers on research concerning the design, realization, or evaluation of interactive systems that incorporate some form of machine intelligence. TIIS articles come from a wide range of research areas and communities. An article can take any of several complementary views of interactive intelligent systems, focusing on: the intelligent technology, the interaction of users with the system, or both aspects at once.

期刊最新文献

Categorical and Continuous Features in Counterfactual Explanations of AI Systems ID.8: Co-Creating Visual Stories with Generative AI Visualization for Recommendation Explainability: A Survey and New Perspectives Unpacking Human-AI interactions: From interaction primitives to a design space AutoRL X: Automated Reinforcement Learning on the Web