H-ARC：对人类在抽象与推理语料库基准上的表现的可靠评估

arXiv - CS - Artificial Intelligence Pub Date : 2024-09-02 DOI:arxiv-2409.01374

Solim LeGris, Wai Keen Vong, Brenden M. Lake, Todd M. Gureckis

{"title":"H-ARC：对人类在抽象与推理语料库基准上的表现的可靠评估","authors":"Solim LeGris, Wai Keen Vong, Brenden M. Lake, Todd M. Gureckis","doi":"arxiv-2409.01374","DOIUrl":null,"url":null,"abstract":"The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis\nbenchmark designed to test challenging out-of-distribution generalization in\nhumans and machines. Since 2019, limited progress has been observed on the\nchallenge using existing artificial intelligence methods. Comparing human and\nmachine performance is important for the validity of the benchmark. While\nprevious work explored how well humans can solve tasks from the ARC benchmark,\nthey either did so using only a subset of tasks from the original dataset, or\nfrom variants of ARC, and therefore only provided a tentative estimate of human\nperformance. In this work, we obtain a more robust estimate of human\nperformance by evaluating 1729 humans on the full set of 400 training and 400\nevaluation tasks from the original ARC problem set. We estimate that average\nhuman performance lies between 73.3% and 77.2% correct with a reported\nempirical average of 76.2% on the training set, and between 55.9% and 68.9%\ncorrect with a reported empirical average of 64.2% on the public evaluation\nset. However, we also find that 790 out of the 800 tasks were solvable by at\nleast one person in three attempts, suggesting that the vast majority of the\npublicly available ARC tasks are in principle solvable by typical crowd-workers\nrecruited over the internet. Notably, while these numbers are slightly lower\nthan earlier estimates, human performance still greatly exceeds current\nstate-of-the-art approaches for solving ARC. To facilitate research on ARC, we\npublicly release our dataset, called H-ARC (human-ARC), which includes all of\nthe submissions and action traces from human participants.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark\",\"authors\":\"Solim LeGris, Wai Keen Vong, Brenden M. Lake, Todd M. Gureckis\",\"doi\":\"arxiv-2409.01374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis\\nbenchmark designed to test challenging out-of-distribution generalization in\\nhumans and machines. Since 2019, limited progress has been observed on the\\nchallenge using existing artificial intelligence methods. Comparing human and\\nmachine performance is important for the validity of the benchmark. While\\nprevious work explored how well humans can solve tasks from the ARC benchmark,\\nthey either did so using only a subset of tasks from the original dataset, or\\nfrom variants of ARC, and therefore only provided a tentative estimate of human\\nperformance. In this work, we obtain a more robust estimate of human\\nperformance by evaluating 1729 humans on the full set of 400 training and 400\\nevaluation tasks from the original ARC problem set. We estimate that average\\nhuman performance lies between 73.3% and 77.2% correct with a reported\\nempirical average of 76.2% on the training set, and between 55.9% and 68.9%\\ncorrect with a reported empirical average of 64.2% on the public evaluation\\nset. However, we also find that 790 out of the 800 tasks were solvable by at\\nleast one person in three attempts, suggesting that the vast majority of the\\npublicly available ARC tasks are in principle solvable by typical crowd-workers\\nrecruited over the internet. Notably, while these numbers are slightly lower\\nthan earlier estimates, human performance still greatly exceeds current\\nstate-of-the-art approaches for solving ARC. To facilitate research on ARC, we\\npublicly release our dataset, called H-ARC (human-ARC), which includes all of\\nthe submissions and action traces from human participants.\",\"PeriodicalId\":501479,\"journal\":{\"name\":\"arXiv - CS - Artificial Intelligence\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01374\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

抽象与推理语料库（ARC）是一个可视化程序合成基准，旨在测试人类和机器在分布外概括方面的挑战。自2019年以来，使用现有人工智能方法在该挑战上取得的进展有限。比较人类和机器的性能对基准的有效性非常重要。虽然以前的工作探索了人类解决 ARC 基准任务的能力，但它们要么只使用了原始数据集中的任务子集，要么使用了 ARC 的变体，因此只能提供对人类性能的初步估计。在这项工作中，我们通过对原始 ARC 问题集的全部 400 个训练任务和 400 个评估任务中的 1729 人进行评估，获得了对人类性能更可靠的估计。我们估计，在训练集上，人类的平均正确率在 73.3% 到 77.2% 之间，报告的经验平均值为 76.2%；在公共评估集上，人类的平均正确率在 55.9% 到 68.9% 之间，报告的经验平均值为 64.2%。不过，我们还发现，在 800 项任务中，至少有一人可以在三次尝试中解决 790 项任务，这表明绝大多数公开的 ARC 任务原则上都可以由通过互联网招募的典型人群工作者解决。值得注意的是，虽然这些数字略低于之前的估计，但人类的表现仍然大大超过了目前最先进的 ARC 解决方法。为了促进对 ARC 的研究，我们公开发布了名为 H-ARC（human-ARC）的数据集，其中包括人类参与者提交的所有文件和行动轨迹。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark

The Abstraction and Reasoning Corpus (ARC) is a visual program synthesis benchmark designed to test challenging out-of-distribution generalization in humans and machines. Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Comparing human and machine performance is important for the validity of the benchmark. While previous work explored how well humans can solve tasks from the ARC benchmark, they either did so using only a subset of tasks from the original dataset, or from variants of ARC, and therefore only provided a tentative estimate of human performance. In this work, we obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks from the original ARC problem set. We estimate that average human performance lies between 73.3% and 77.2% correct with a reported empirical average of 76.2% on the training set, and between 55.9% and 68.9% correct with a reported empirical average of 64.2% on the public evaluation set. However, we also find that 790 out of the 800 tasks were solvable by at least one person in three attempts, suggesting that the vast majority of the publicly available ARC tasks are in principle solvable by typical crowd-workers recruited over the internet. Notably, while these numbers are slightly lower than earlier estimates, human performance still greatly exceeds current state-of-the-art approaches for solving ARC. To facilitate research on ARC, we publicly release our dataset, called H-ARC (human-ARC), which includes all of the submissions and action traces from human participants.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Artificial Intelligence

自引率

0.00%

发文量