BenchCouncil Transactions on Benchmarks, Standards and Evaluations最新文献_第2页

Benchmarking ChatGPT for Prototyping Theories: Experimental Studies Using the Technology Acceptance Model 以 ChatGPT 为原型理论基准：使用技术接受模型的实验研究

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2024-02-01 DOI: 10.1016/j.tbench.2024.100153

Yanwu Yang, T. Goh, Xin Dai

引用次数: 0

A pluggable single-image super-resolution algorithm based on second-order gradient loss 基于二阶梯度损失的可插入式单图像超分辨率算法

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2023.100148

Shuran Lin, Chunjie Zhang, Yanwu Yang

引用次数: 0

Analyzing the potential benefits and use cases of ChatGPT as a tool for improving the efficiency and effectiveness of business operations 分析ChatGPT作为提高业务操作效率和有效性的工具的潜在好处和用例

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-09-01 DOI: 10.1016/j.tbench.2023.100140

Rohit Raj , Arpit Singh , Vimal Kumar , Pratima Verma

The study addresses the potential benefits for companies of adopting ChatGPT, a popular chatbot built on a large-scale transformer-based language model known as a generative pre-trained transformer (GPT). Chatbots like ChatGPT may improve customer service, handle several client inquiries at once, and save operational costs. Moreover, ChatGPT may automate regular processes like order tracking and billing, allowing human employees to focus on more complex and strategic responsibilities. Nevertheless, before deploying ChatGPT, enterprises must carefully analyze its use cases and restrictions, as well as its strengths and disadvantages. ChatGPT, for example, requires training data that is particular to the business domain and might produce erroneous and ambiguous findings. The study identifies areas of deployment of ChatGPT's possible benefits in enterprises by drawing on the literature that is currently accessible on ChatGPT, massive language models, and artificial intelligence. Then, using the PSI (Preference Selection Index) and COPRAS (Complex Proportional Assessment) approaches, potential advantages are taken into account and prioritized. By highlighting current trends and possible advantages in the industry, this editorial seeks to provide insight into the present state of employing ChatGPT in enterprises and research. ChatGPT may also learn biases from training data and create replies that reinforce those biases. As a result, enterprises must train and fine-tune ChatGPT to specific operations, set explicit boundaries and limitations for its use, and implement appropriate security measures to avoid malicious input. The study highlights the research gap in the dearth of literature by outlining ChatGPT's potential benefits for businesses, analyzing its strengths and limits, and offering insights into how organizations might use ChatGPT's capabilities to enhance their operations.

这项研究探讨了采用ChatGPT对公司的潜在好处，ChatGPT是一种流行的聊天机器人，建立在一种大规模的基于转换器的语言模型上，称为生成预训练转换器（GPT）。像ChatGPT这样的聊天机器人可以改善客户服务，同时处理多个客户查询，并节省运营成本。此外，ChatGPT可以自动化订单跟踪和计费等常规流程，使员工能够专注于更复杂和战略性的职责。尽管如此，在部署ChatGPT之前，企业必须仔细分析它的用例和限制，以及它的优势和劣势。例如，ChatGPT需要特定于业务领域的训练数据，这些数据可能会产生错误和模糊的结果。该研究通过借鉴目前在ChatGPT、大规模语言模型和人工智能上可以获得的文献，确定了ChatGPT在企业中可能带来的好处的部署领域。然后，使用PSI（偏好选择指数）和COPRAS（复杂比例评估）方法，将潜在优势考虑在内并排定优先级。通过强调行业的当前趋势和可能的优势，这篇社论试图深入了解在企业和研究中使用ChatGPT的现状。ChatGPT还可以从训练数据中学习偏见，并创建强化这些偏见的回复。因此，企业必须根据具体操作对ChatGPT进行培训和微调，为其使用设置明确的边界和限制，并实施适当的安全措施以避免恶意输入。该研究概述了ChatGPT对企业的潜在好处，分析了其优势和局限性，并深入了解了组织如何利用ChatGPT的能力来增强其运营，从而突出了缺乏文献的研究差距。

{"title":"Analyzing the potential benefits and use cases of ChatGPT as a tool for improving the efficiency and effectiveness of business operations","authors":"Rohit Raj , Arpit Singh , Vimal Kumar , Pratima Verma","doi":"10.1016/j.tbench.2023.100140","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100140","url":null,"abstract":"<div><p>The study addresses the potential benefits for companies of adopting ChatGPT, a popular chatbot built on a large-scale transformer-based language model known as a generative pre-trained transformer (GPT). Chatbots like ChatGPT may improve customer service, handle several client inquiries at once, and save operational costs. Moreover, ChatGPT may automate regular processes like order tracking and billing, allowing human employees to focus on more complex and strategic responsibilities. Nevertheless, before deploying ChatGPT, enterprises must carefully analyze its use cases and restrictions, as well as its strengths and disadvantages. ChatGPT, for example, requires training data that is particular to the business domain and might produce erroneous and ambiguous findings. The study identifies areas of deployment of ChatGPT's possible benefits in enterprises by drawing on the literature that is currently accessible on ChatGPT, massive language models, and artificial intelligence. Then, using the PSI (Preference Selection Index) and COPRAS (Complex Proportional Assessment) approaches, potential advantages are taken into account and prioritized. By highlighting current trends and possible advantages in the industry, this editorial seeks to provide insight into the present state of employing ChatGPT in enterprises and research. ChatGPT may also learn biases from training data and create replies that reinforce those biases. As a result, enterprises must train and fine-tune ChatGPT to specific operations, set explicit boundaries and limitations for its use, and implement appropriate security measures to avoid malicious input. The study highlights the research gap in the dearth of literature by outlining ChatGPT's potential benefits for businesses, analyzing its strengths and limits, and offering insights into how organizations might use ChatGPT's capabilities to enhance their operations.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100140"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Algorithmic fairness in social context 社会背景下的算法公平

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-09-01 DOI: 10.1016/j.tbench.2023.100137

Yunyou Huang , Wenjing Liu , Wanling Gao , Xiangjiang Lu , Xiaoshuang Liang , Zhengxin Yang , Hongxiao Li , Li Ma , Suqin Tang

Algorithmic fairness research is currently receiving significant attention, aiming to ensure that algorithms do not discriminate between different groups or individuals with similar characteristics. However, with the popularization of algorithms in all aspects of society, algorithms have changed from mere instruments to social infrastructure. For instance, facial recognition algorithms are widely used to provide user verification services and have become an indispensable part of many social infrastructures like transportation, health care, etc. As an instrument, an algorithm needs to pay attention to the fairness of its behavior. However, as a social infrastructure, it needs to pay even more attention to its impact on social fairness. Otherwise, it may exacerbate existing inequities or create new ones. For example, if an algorithm treats all passengers equally and eliminates special seats for pregnant women in the interest of fairness, it will increase the risk of pregnant women taking public transport and indirectly damage their right to fair travel. Therefore, algorithms have the responsibility to ensure social fairness, not just within their operations. It is now time to expand the concept of algorithmic fairness beyond mere behavioral equity, assessing algorithms in a broader societal context, and examining whether they uphold and promote social fairness. This article analyzes the current status and challenges of algorithmic fairness from three key perspectives: fairness definition, fairness dataset, and fairness algorithm. Furthermore, the potential directions and strategies to promote the fairness of the algorithm are proposed.

算法公平性研究目前正受到极大关注，旨在确保算法不会歧视具有相似特征的不同群体或个人。然而，随着算法在社会各方面的普及，算法已经从单纯的工具变成了社会基础设施。例如，面部识别算法被广泛用于提供用户验证服务，并已成为交通、医疗等许多社会基础设施不可或缺的一部分。作为一种工具，算法需要注意其行为的公平性。然而，作为一种社会基础设施，它需要更加关注其对社会公平的影响。否则，它可能会加剧现有的不平等现象或造成新的不平等。例如，如果一种算法平等对待所有乘客，并为了公平起见取消孕妇专用座位，这将增加孕妇乘坐公共交通工具的风险，并间接损害她们公平出行的权利。因此，算法有责任确保社会公平，而不仅仅是在其操作范围内。现在是时候将算法公平的概念扩展到行为公平之外，在更广泛的社会背景下评估算法，并检查它们是否维护和促进社会公平了。本文从公平定义、公平数据集和公平算法三个关键角度分析了算法公平的现状和挑战。此外，还提出了提高算法公平性的潜在方向和策略。

{"title":"Algorithmic fairness in social context","authors":"Yunyou Huang , Wenjing Liu , Wanling Gao , Xiangjiang Lu , Xiaoshuang Liang , Zhengxin Yang , Hongxiao Li , Li Ma , Suqin Tang","doi":"10.1016/j.tbench.2023.100137","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100137","url":null,"abstract":"<div><p>Algorithmic fairness research is currently receiving significant attention, aiming to ensure that algorithms do not discriminate between different groups or individuals with similar characteristics. However, with the popularization of algorithms in all aspects of society, algorithms have changed from mere instruments to social infrastructure. For instance, facial recognition algorithms are widely used to provide user verification services and have become an indispensable part of many social infrastructures like transportation, health care, etc. As an instrument, an algorithm needs to pay attention to the fairness of its behavior. However, as a social infrastructure, it needs to pay even more attention to its impact on social fairness. Otherwise, it may exacerbate existing inequities or create new ones. For example, if an algorithm treats all passengers equally and eliminates special seats for pregnant women in the interest of fairness, it will increase the risk of pregnant women taking public transport and indirectly damage their right to fair travel. Therefore, algorithms have the responsibility to ensure social fairness, not just within their operations. It is now time to expand the concept of algorithmic fairness beyond mere behavioral equity, assessing algorithms in a broader societal context, and examining whether they uphold and promote social fairness. This article analyzes the current status and challenges of algorithmic fairness from three key perspectives: fairness definition, fairness dataset, and fairness algorithm. Furthermore, the potential directions and strategies to promote the fairness of the algorithm are proposed.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100137"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49714010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT 对话式人工智能的基准、道德一致性和评估框架:推进ChatGPT的负责任发展

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-09-01 DOI: 10.1016/j.tbench.2023.100136

Partha Pratim Ray

Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.

近年来，像ChatGPT这样的对话式人工智能系统取得了显著进步，彻底改变了人机交互。然而，评估这些系统的性能和道德影响仍然是一项挑战。本文深入探讨了创建严格的基准、适应性标准和专门为ChatGPT量身定制的智能评估方法。我们仔细分析了几个突出的基准，包括GLUE、SuperGLUE、SQuAD、CoQA、Persona Chat、DSTC、BIG Bench、HELM和MMLU，阐明了它们的优势和局限性。本文还仔细审查了OpenAI、IEEE的道德一致设计、蒙特利尔宣言和人工智能信条伙伴关系制定的现有标准，调查了它们与ChatGPT的相关性。此外，我们提出了适应性标准，包括伦理考虑、环境适应性和社区参与。在评估方面，我们探索了传统的方法，如BLEU、ROUGE、METEOR、精确回忆、F1分数、困惑和用户反馈，同时还提出了一种利用强化学习力量的新评估方法。我们提出的评估框架是多层面的，包括特定任务、现实世界的应用程序和多回合对话基准。我们对所提出的框架进行了可行性分析、SWOT分析和适应性分析。该框架强调了用户反馈的重要性，将其与主观评估和互动评估会议一起作为评估的核心组成部分。通过整合这些元素，本文有助于开发一个全面的评估框架，促进对话人工智能领域负责任和有影响力的发展。

{"title":"Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT","authors":"Partha Pratim Ray","doi":"10.1016/j.tbench.2023.100136","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100136","url":null,"abstract":"<div><p>Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100136"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

MetaverseBench: Instantiating and benchmarking metaverse challenges MetaverseBench:实例化和基准化元数据挑战

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-09-01 DOI: 10.1016/j.tbench.2023.100138

Hainan Ye , Lei Wang

The rapid evolution of the metaverse has led to the emergence of numerous metaverse technologies and productions. From a computer systems perspective, the metaverse system is a complex, large-scale system that integrates various state-of-the-art technologies, including AI, blockchain, big data, and AR/VR. It also includes multiple platforms, such as IoTs, edges, data centers, and diverse devices, including CPUs, GPUs, NPUs, and 3D glasses. Integrating these technologies and components to build a holistic system poses a significant challenge for system designers. The first step towards building the metaverse is to instantiate and evaluate the challenges and provide a comprehensive benchmark suite. However, to the best of our knowledge, no existing benchmark defines the metaverse challenges and evaluates state-of-the-art solutions from a holistic perspective. In this paper, we instantiate metaverse challenges from a system perspective and propose MetaverseBench, a holistic and comprehensive metaverse benchmark suite. Our preliminary experiments indicate that the existing system performance needs to catch up to the requirements of the metaverse by two orders of magnitude on average.

元宇宙的快速发展导致了大量元宇宙技术和产品的出现。从计算机系统的角度来看，元宇宙系统是一个复杂的大规模系统，集成了各种最先进的技术，包括人工智能、区块链、大数据和AR/VR。它还包括多个平台，如IoT、边缘、数据中心和各种设备，包括CPU、GPU、NPU和3D眼镜。集成这些技术和组件来构建一个整体系统对系统设计者来说是一个重大挑战。构建元宇宙的第一步是实例化和评估挑战，并提供一个全面的基准套件。然而，据我们所知，没有任何现有的基准可以定义元宇宙的挑战，并从整体角度评估最先进的解决方案。在本文中，我们从系统的角度实例化了元宇宙的挑战，并提出了MetaverseBench，一个全面、全面的元宇宙基准套件。我们的初步实验表明，现有的系统性能需要平均达到元宇宙的两个数量级的要求。

{"title":"MetaverseBench: Instantiating and benchmarking metaverse challenges","authors":"Hainan Ye , Lei Wang","doi":"10.1016/j.tbench.2023.100138","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100138","url":null,"abstract":"<div><p>The rapid evolution of the metaverse has led to the emergence of numerous metaverse technologies and productions. From a computer systems perspective, the metaverse system is a complex, large-scale system that integrates various state-of-the-art technologies, including AI, blockchain, big data, and AR/VR. It also includes multiple platforms, such as IoTs, edges, data centers, and diverse devices, including CPUs, GPUs, NPUs, and 3D glasses. Integrating these technologies and components to build a holistic system poses a significant challenge for system designers. The first step towards building the metaverse is to instantiate and evaluate the challenges and provide a comprehensive benchmark suite. However, to the best of our knowledge, no existing benchmark defines the metaverse challenges and evaluates state-of-the-art solutions from a holistic perspective. In this paper, we instantiate metaverse challenges from a system perspective and propose MetaverseBench, a holistic and comprehensive metaverse benchmark suite. Our preliminary experiments indicate that the existing system performance needs to catch up to the requirements of the metaverse by two orders of magnitude on average.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100138"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49714025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mind meets machine: Unravelling GPT-4’s cognitive psychology 思维与机器相遇:破解GPT-4的认知心理学

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-09-01 DOI: 10.1016/j.tbench.2023.100139

Sifatkaur Dhingra , Manmeet Singh , Vaisakh S.B. , Neetiraj Malviya , Sukhpal Singh Gill

Cognitive psychology delves on understanding perception, attention, memory, language, problem-solving, decision-making, and reasoning. Large Language Models (LLMs) are emerging as potent tools increasingly capable of performing human-level tasks. The recent development in the form of Generative Pre-trained Transformer 4 (GPT-4) and its demonstrated success in tasks complex to humans exam and complex problems has led to an increased confidence in the LLMs to become perfect instruments of intelligence. Although GPT-4 report has shown performance on some cognitive psychology tasks, a comprehensive assessment of GPT-4, via the existing well-established datasets is required. In this study, we focus on the evaluation of GPT-4’s performance on a set of cognitive psychology datasets such as CommonsenseQA, SuperGLUE, MATH and HANS. In doing so, we understand how GPT-4 processes and integrates cognitive psychology with contextual information, providing insight into the underlying cognitive processes that enable its ability to generate the responses. We show that GPT-4 exhibits a high level of accuracy in cognitive psychology tasks relative to the prior state-of-the-art models. Our results strengthen the already available assessments and confidence on GPT-4’s cognitive psychology abilities. It has significant potential to revolutionise the field of Artificial Intelligence (AI), by enabling machines to bridge the gap between human and machine reasoning.

认知心理学研究理解感知、注意力、记忆、语言、解决问题、决策和推理。大型语言模型（LLM）正在成为一种强大的工具，越来越能够执行人类级别的任务。Generative Pre-trained Transformer 4（GPT-4）形式的最新发展及其在人类复杂任务、考试和复杂问题方面的成功证明，增强了人们对LLM成为完美智能工具的信心。尽管GPT-4报告显示了一些认知心理学任务的表现，但需要通过现有的成熟数据集对GPT-4进行全面评估。在本研究中，我们重点评估了GPT-4在一组认知心理学数据集上的表现，如CommonsenseQA、SuperGLUE、MATH和HANS。通过这样做，我们了解了GPT-4是如何处理认知心理学并将其与上下文信息相结合的，从而深入了解其产生反应的潜在认知过程。我们发现，与先前最先进的模型相比，GPT-4在认知心理学任务中表现出较高的准确性。我们的研究结果加强了对GPT-4认知心理能力的现有评估和信心。它具有巨大的潜力，可以通过使机器弥合人类和机器推理之间的差距，彻底改变人工智能领域。

{"title":"Mind meets machine: Unravelling GPT-4’s cognitive psychology","authors":"Sifatkaur Dhingra , Manmeet Singh , Vaisakh S.B. , Neetiraj Malviya , Sukhpal Singh Gill","doi":"10.1016/j.tbench.2023.100139","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100139","url":null,"abstract":"<div><p>Cognitive psychology delves on understanding perception, attention, memory, language, problem-solving, decision-making, and reasoning. Large Language Models (LLMs) are emerging as potent tools increasingly capable of performing human-level tasks. The recent development in the form of Generative Pre-trained Transformer 4 (GPT-4) and its demonstrated success in tasks complex to humans exam and complex problems has led to an increased confidence in the LLMs to become perfect instruments of intelligence. Although GPT-4 report has shown performance on some cognitive psychology tasks, a comprehensive assessment of GPT-4, via the existing well-established datasets is required. In this study, we focus on the evaluation of GPT-4’s performance on a set of cognitive psychology datasets such as CommonsenseQA, SuperGLUE, MATH and HANS. In doing so, we understand how GPT-4 processes and integrates cognitive psychology with contextual information, providing insight into the underlying cognitive processes that enable its ability to generate the responses. We show that GPT-4 exhibits a high level of accuracy in cognitive psychology tasks relative to the prior state-of-the-art models. Our results strengthen the already available assessments and confidence on GPT-4’s cognitive psychology abilities. It has significant potential to revolutionise the field of Artificial Intelligence (AI), by enabling machines to bridge the gap between human and machine reasoning.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100139"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

2023 BenchCouncil Distinguished Doctoral Dissertation Award Call for Nomination 2023年BenchCouncil杰出博士论文奖诚邀提名

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-06-01 DOI: 10.1016/S2772-4859(23)00047-9

BenchCouncil Distinguished Doctoral Dissertation Award is to recognize and encourage superior research and writing by doctoral candidates in the broad field of benchmarking community. This year, the award consists of two tracks: Computer Architecture track and Other Areas track. Each track carries a $1,000 honorarium and has individual nomination submission form and award subcommittee. For each track, all the candidates are encouraged to submit articles to BenchCouncil Transactions on Benchmarks, Standards, and Evaluation (TBench). Among the submissions of each track, four candidates will be selected as finalists. They will be invited to give a 30-minute presentation at the BenchCouncil Bench 2023 conference and contribute research articles to TBench. Finally, for each track, one among the four will receive the award. More information are available from https://www.benchcouncil.org/awards/index.html#DistinguishedDoctoralDissertation

Important Dates: Nomination deadline: October 15, 2023, at 11:59 PM AoE Conference Date: December 3–5, 2023 Online Nomination form: Computer Architecture Track: https://forms.gle/a2JnWq9A9Vkq5JXXA Other Areas Track: https://forms.gle/pHBDZzWGN4kjwRJu9

BenchCouncil杰出博士论文奖旨在表彰和鼓励博士生在标杆社区广泛领域的卓越研究和写作。今年，该奖项由两个轨道组成：计算机架构轨道和其他领域轨道。每条赛道都有1000美元的奖金，并有个人提名提交表和颁奖小组委员会。对于每条赛道，鼓励所有候选人向BenchCouncil Transactions on Benchmarks，Standards，and Evaluation（TBench）提交文章。在每首曲目的参赛作品中，四名候选人将被选为决赛选手。他们将被邀请在2023年BenchCouncil Bench会议上发表30分钟的演讲，并为TBench撰写研究文章。最后，对于每首曲目，四首曲目中的一首将获得该奖项。更多信息请访问https://www.benchcouncil.org/awards/index.html#DistinguishedDoctoralDissertationImportant日期：提名截止日期：2023年10月15日下午11:59 AoE会议日期：2021年12月3日至5日在线提名表格：计算机架构轨道：https://forms.gle/a2JnWq9A9Vkq5JXXA其他区域跟踪：https://forms.gle/pHBDZzWGN4kjwRJu9

{"title":"2023 BenchCouncil Distinguished Doctoral Dissertation Award Call for Nomination","authors":"","doi":"10.1016/S2772-4859(23)00047-9","DOIUrl":"https://doi.org/10.1016/S2772-4859(23)00047-9","url":null,"abstract":"<div><p>BenchCouncil Distinguished Doctoral Dissertation Award is to recognize and encourage superior research and writing by doctoral candidates in the broad field of benchmarking community. This year, the award consists of two tracks: Computer Architecture track and Other Areas track. Each track carries a $1,000 honorarium and has individual nomination submission form and award subcommittee. For each track, all the candidates are encouraged to submit articles to BenchCouncil Transactions on Benchmarks, Standards, and Evaluation (TBench). Among the submissions of each track, four candidates will be selected as finalists. They will be invited to give a 30-minute presentation at the BenchCouncil Bench 2023 conference and contribute research articles to TBench. Finally, for each track, one among the four will receive the award. More information are available from <span>https://www.benchcouncil.org/awards/index.html#DistinguishedDoctoralDissertation</span><svg><path></path></svg></p><p><strong>Important Dates:</strong> Nomination deadline: October 15, 2023, at 11:59 PM AoE Conference Date: December 3–5, 2023 Online Nomination form: Computer Architecture Track: <span>https://forms.gle/a2JnWq9A9Vkq5JXXA</span><svg><path></path></svg> Other Areas Track: <span>https://forms.gle/pHBDZzWGN4kjwRJu9</span><svg><path></path></svg></p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 2","pages":"Article 100130"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49732686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

StreamAD: A cloud platform metrics-oriented benchmark for unsupervised online anomaly detection StreamAD:用于无监督在线异常检测的面向云平台指标的基准

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-06-01 DOI: 10.1016/j.tbench.2023.100121

Jiahui Xu , Chengxiang Lin , Fengrui Liu , Yang Wang , Wei Xiong , Zhenyu Li , Hongtao Guan , Gaogang Xie

Cloud platforms, serving as fundamental infrastructure, play a significant role in developing modern applications. In recent years, there has been growing interest among researchers in utilizing machine learning algorithms to rapidly detect and diagnose faults within complex cloud platforms, aiming to improve the quality of service and optimize system performance. There is a need for online anomaly detection on cloud platform metrics to provide timely fault alerts. To assist Site Reliability Engineers (SREs) in selecting suitable anomaly detection algorithms based on specific use cases, we introduce a benchmark called StreamAD. This benchmark offers three-fold contributions: (1) it encompasses eleven unsupervised algorithms with open-source code; (2) it abstracts various common operators for online anomaly detection which enhances the efficiency of algorithm development; (3) it provides extensive comparisons of various algorithms using different evaluation methods; With StreamAD, researchers can efficiently conduct comprehensive evaluations for new algorithms, which can further facilitate research in this area. The code of StreamAD is published at https://github.com/Fengrui-Liu/StreamAD.

云平台作为基础设施，在开发现代应用程序方面发挥着重要作用。近年来，研究人员对利用机器学习算法快速检测和诊断复杂云平台中的故障越来越感兴趣，旨在提高服务质量和优化系统性能。需要在云平台指标上进行在线异常检测，以提供及时的故障警报。为了帮助现场可靠性工程师（SRE）根据特定用例选择合适的异常检测算法，我们引入了一个名为StreamAD的基准。这个基准测试提供了三个方面的贡献：（1）它包含了11个带有开源代码的无监督算法；（2）它抽象了各种常见的在线异常检测算子，提高了算法开发的效率；（3）它提供了使用不同评估方法的各种算法的广泛比较；有了StreamAD，研究人员可以有效地对新算法进行全面评估，这可以进一步促进该领域的研究。StreamAD的代码发布在https://github.com/Fengrui-Liu/StreamAD.

{"title":"StreamAD: A cloud platform metrics-oriented benchmark for unsupervised online anomaly detection","authors":"Jiahui Xu , Chengxiang Lin , Fengrui Liu , Yang Wang , Wei Xiong , Zhenyu Li , Hongtao Guan , Gaogang Xie","doi":"10.1016/j.tbench.2023.100121","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100121","url":null,"abstract":"<div><p>Cloud platforms, serving as fundamental infrastructure, play a significant role in developing modern applications. In recent years, there has been growing interest among researchers in utilizing machine learning algorithms to rapidly detect and diagnose faults within complex cloud platforms, aiming to improve the quality of service and optimize system performance. There is a need for online anomaly detection on cloud platform metrics to provide timely fault alerts. To assist Site Reliability Engineers (SREs) in selecting suitable anomaly detection algorithms based on specific use cases, we introduce a benchmark called StreamAD. This benchmark offers three-fold contributions: (1) it encompasses eleven unsupervised algorithms with open-source code; (2) it abstracts various common operators for online anomaly detection which enhances the efficiency of algorithm development; (3) it provides extensive comparisons of various algorithms using different evaluation methods; With StreamAD, researchers can efficiently conduct comprehensive evaluations for new algorithms, which can further facilitate research in this area. The code of StreamAD is published at <span>https://github.com/Fengrui-Liu/StreamAD</span><svg><path></path></svg>.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 2","pages":"Article 100121"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49715935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DPUBench: An application-driven scalable benchmark suite for comprehensive DPU evaluation DPUBench:一个应用程序驱动的可扩展基准测试套件，用于全面的DPU评估

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

Pub Date : 2023-06-01 DOI: 10.1016/j.tbench.2023.100120

Zheng Wang , Chenxi Wang , Lei Wang

With the development of data centers, network bandwidth has rapidly increased, reaching hundreds of Gbps. However, the network I/O processing performance of CPU improvement has not kept pace with this growth in recent years, which leads to the CPU being increasingly burdened by network applications in data centers. To address this issue, Data Processing Unit (DPU) has emerged as a hardware accelerator designed to offload network applications from the CPU. As a new hardware device, the DPU architecture design is still in the exploration stage. Previous DPU benchmarks are not neutral and comprehensive, making them unsuitable as general benchmarks. To showcase the advantages of their specific architectural features, DPU vendors tend to provide some particular architecture-dependent evaluation programs. Moreover, they fail to provide comprehensive coverage and cannot adequately represent the full range of network applications. To address this gap, we propose an application-driven scalable benchmark suite called DPUBench. DPUBench classifies DPU applications into three typical scenarios — network, storage, and security, and includes a scalable benchmark framework that contains essential Operator Set in these scenarios and End-to-end Evaluation Programs in real data center scenarios. DPUBench can easily incorporate new operators and end-to-end evaluation programs as DPU evolves. We present the results of evaluating the NVIDIA BlueField-2 using DPUBench and provide optimization recommendations. DPUBench are publicly available from https://www.benchcouncil.org/DPUBench.

随着数据中心的发展，网络带宽迅速增加，达到数百Gbps。然而，近年来CPU改进的网络I/O处理性能没有跟上这种增长，这导致CPU越来越受到数据中心网络应用的负担。为了解决这个问题，数据处理单元（DPU）已经成为一种硬件加速器，旨在从CPU卸载网络应用程序。DPU作为一种新型的硬件设备，其体系结构设计尚处于探索阶段。以前的DPU基准并不中立和全面，因此不适合作为一般基准。为了展示其特定体系结构功能的优势，DPU供应商倾向于提供一些特定的依赖于体系结构的评估程序。此外，它们不能提供全面的覆盖范围，也不能充分代表网络应用的全部范围。为了解决这一差距，我们提出了一个名为DPUBench的应用程序驱动的可扩展基准测试套件。DPUBench将DPU应用程序分为三种典型场景——网络、存储和安全，并包括一个可扩展的基准框架，该框架包含这些场景中的基本操作员集和真实数据中心场景中的端到端评估程序。随着DPU的发展，DPUBench可以轻松地整合新的运营商和端到端评估程序。我们展示了使用DPUBench评估NVIDIA BlueField-2的结果，并提供了优化建议。DPUBench可从https://www.benchcouncil.org/DPUBench.

{"title":"DPUBench: An application-driven scalable benchmark suite for comprehensive DPU evaluation","authors":"Zheng Wang , Chenxi Wang , Lei Wang","doi":"10.1016/j.tbench.2023.100120","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100120","url":null,"abstract":"<div><p>With the development of data centers, network bandwidth has rapidly increased, reaching hundreds of Gbps. However, the network I/O processing performance of CPU improvement has not kept pace with this growth in recent years, which leads to the CPU being increasingly burdened by network applications in data centers. To address this issue, Data Processing Unit (DPU) has emerged as a hardware accelerator designed to offload network applications from the CPU. As a new hardware device, the DPU architecture design is still in the exploration stage. Previous DPU benchmarks are not neutral and comprehensive, making them unsuitable as general benchmarks. To showcase the advantages of their specific architectural features, DPU vendors tend to provide some particular architecture-dependent evaluation programs. Moreover, they fail to provide comprehensive coverage and cannot adequately represent the full range of network applications. To address this gap, we propose an <strong>application-driven</strong> scalable benchmark suite called <strong>DPUBench</strong>. DPUBench classifies DPU applications into three typical scenarios — network, storage, and security, and includes a scalable benchmark framework that contains essential Operator Set in these scenarios and End-to-end Evaluation Programs in real data center scenarios. DPUBench can easily incorporate new operators and end-to-end evaluation programs as DPU evolves. We present the results of evaluating the NVIDIA BlueField-2 using DPUBench and provide optimization recommendations. DPUBench are publicly available from <span>https://www.benchcouncil.org/DPUBench</span><svg><path></path></svg>.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 2","pages":"Article 100120"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49715585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0