首页 > 最新文献

BenchCouncil Transactions on Benchmarks, Standards and Evaluations最新文献

英文 中文
Evaluatology: The science and engineering of evaluation 评价学:评估科学与工程
Pub Date : 2024-03-01 DOI: 10.1016/j.tbench.2024.100162
Jianfeng Zhan , Lei Wang , Wanling Gao , Hongxiao Li , Chenxi Wang , Yunyou Huang , Yatao Li , Zhengxin Yang , Guoxin Kang , Chunjie Luo , Hainan Ye , Shaopeng Dai , Zhifei Zhang

Evaluation is a crucial aspect of human existence and plays a vital role in each field. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant consequences. This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation. We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines, if not all disciplines.

Our research reveals that the essence of evaluation lies in conducting experiments that intentionally apply a well-defined evaluation condition to individuals or systems under scrutiny, which we refer to as the subjects. This process allows for the creation of an evaluation system or model. By measuring and/or testing this evaluation system or model, we can infer the impact of different subjects. Derived from the essence of evaluation, we propose five axioms focusing on key aspects of evaluation outcomes as the foundational evaluation theory. These axioms serve as the bedrock upon which we build universal evaluation theories and methodologies. When evaluating a single subject, it is crucial to create evaluation conditions with different levels of equivalency. By applying these conditions to diverse subjects, we can establish reference evaluation models. These models allow us to alter a single independent variable at a time while keeping all other variables as controls. When evaluating complex scenarios, the key lies in establishing a series of evaluation models that maintain transitivity. Building upon the science of evaluation, we propose a formal definition of a benchmark as a simplified and sampled evaluation condition that guarantees different levels of equivalency. This concept serves as the cornerstone for a universal benchmark-based engineering approach to evaluation across various disciplines, which we refer to as benchmarkology.

评价是人类生存的一个重要方面,在各个领域都发挥着至关重要的作用。然而,人们往往以经验主义和临时性的方式来对待它,对普遍的概念、术语、理论和方法缺乏共识。这种缺乏共识的现象造成了严重后果。本文旨在正式介绍评价学这一学科,它包括评价的科学和工程。我们提出了一个通用的评价框架,其中包含的概念、术语、理论和方法即使不能适用于所有学科,也可以适用于各个学科。我们的研究揭示了评价的本质在于进行实验,有意识地对被审查的个人或系统(我们称之为被试)施加一个定义明确的评价条件。通过这一过程,可以创建一个评价系统或模型。通过测量和/或测试这个评价系统或模型,我们可以推断出不同主体的影响。从评价的本质出发,我们提出了五个公理,作为评价的基础理论,这些公理集中在评价结果的关键方面。这些公理是我们建立通用评价理论和方法的基石。在评价单一科目时,关键是要创造不同等效水平的评价条件。通过将这些条件应用于不同的主题,我们可以建立参考评价模型。通过这些模型,我们可以一次改变一个独立变量,同时保留所有其他变量作为对照。在对复杂的情况进行评估时,关键在于建立一系列能够保持反向性的评估模型。在评估科学的基础上,我们提出了基准的正式定义,即保证不同等效水平的简化和抽样评估条件。这一概念是基于基准的通用工程评估方法的基石,适用于各个学科,我们称之为基准学。
{"title":"Evaluatology: The science and engineering of evaluation","authors":"Jianfeng Zhan ,&nbsp;Lei Wang ,&nbsp;Wanling Gao ,&nbsp;Hongxiao Li ,&nbsp;Chenxi Wang ,&nbsp;Yunyou Huang ,&nbsp;Yatao Li ,&nbsp;Zhengxin Yang ,&nbsp;Guoxin Kang ,&nbsp;Chunjie Luo ,&nbsp;Hainan Ye ,&nbsp;Shaopeng Dai ,&nbsp;Zhifei Zhang","doi":"10.1016/j.tbench.2024.100162","DOIUrl":"https://doi.org/10.1016/j.tbench.2024.100162","url":null,"abstract":"<div><p>Evaluation is a crucial aspect of human existence and plays a vital role in each field. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant consequences. This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation. We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines, if not all disciplines.</p><p>Our research reveals that the essence of evaluation lies in conducting experiments that intentionally apply a well-defined evaluation condition to individuals or systems under scrutiny, which we refer to as the <em>subjects</em>. This process allows for the creation of an evaluation system or model. By measuring and/or testing this evaluation system or model, we can infer the impact of different subjects. Derived from the essence of evaluation, we propose five axioms focusing on key aspects of evaluation outcomes as the foundational evaluation theory. These axioms serve as the bedrock upon which we build universal evaluation theories and methodologies. When evaluating a single subject, it is crucial to create evaluation conditions with different levels of equivalency. By applying these conditions to diverse subjects, we can establish reference evaluation models. These models allow us to alter a single independent variable at a time while keeping all other variables as controls. When evaluating complex scenarios, the key lies in establishing a series of evaluation models that maintain transitivity. Building upon the science of evaluation, we propose a formal definition of a benchmark as a simplified and sampled evaluation condition that guarantees different levels of equivalency. This concept serves as the cornerstone for a universal benchmark-based engineering approach to evaluation across various disciplines, which we refer to as benchmarkology.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 1","pages":"Article 100162"},"PeriodicalIF":0.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485924000140/pdfft?md5=31c7470bd845fb50d0580585f84133b4&pid=1-s2.0-S2772485924000140-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140906873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An approach to workload generation for modern data centers: A view from Alibaba trace 现代数据中心工作负载生成方法:来自阿里巴巴的观点
Pub Date : 2024-03-01 DOI: 10.1016/j.tbench.2024.100164
Yi Liang , Nianyi Ruan , Lan Yi , Xing Su

Modern data centers provide the foundational infrastructure of cloud computing. Workload generation, which involves simulating or constructing tasks and transactions to replicate the actual resource usage patterns of real-world systems or applications, plays essential role for efficient resource management in these centers. Data center traces, rich in information about workload execution and resource utilization, are thus ideal data for workload generation. Traditional traces provide detailed temporal resource usage data to enable fine-grained workload generation. However, modern data centers tend to favor tracing statistical metrics to reduce overhead. Therefore the accurate reconstruction of temporal resource consumption without detailed, temporized trace information become a major challenge for trace-based workload generation. To address this challenge, we propose STWGEN, a novel method that leverages statistical trace data for workload generation. STWGEN is specifically designed to generate the batch task workloads based on Alibaba trace. STWGEN contains two key components: a suite of C program-based flexible workload building blocks and a heuristic strategy to assemble building blocks for workload generation. Both components are carefully designed to reproduce synthetic batch tasks that closely replicate the observed resource usage patterns in a representative data center. Experimental results demonstrate that STWGEN outperforms state-of-the-art workload generation methods as it emulates workload-level and machine-level resource usage in much higher accuracy.

现代数据中心是云计算的基础架构。工作负载生成涉及模拟或构建任务和事务,以复制现实世界中系统或应用的实际资源使用模式,对这些中心的高效资源管理起着至关重要的作用。因此,数据中心跟踪信息中含有丰富的工作负载执行和资源利用信息,是工作负载生成的理想数据。传统的跟踪可提供详细的时间资源使用数据,从而实现细粒度的工作负载生成。然而,现代数据中心倾向于采用跟踪统计指标来减少开销。因此,在没有详细的时间化跟踪信息的情况下,如何准确重建时间资源消耗成为基于跟踪的工作负载生成所面临的一大挑战。为了应对这一挑战,我们提出了 STWGEN,一种利用统计跟踪数据生成工作负载的新方法。STWGEN 专为生成基于阿里巴巴跟踪的批处理任务工作负载而设计。STWGEN 包含两个关键组件:一套基于 C 程序的灵活工作负载构建模块和一种启发式策略,用于组合构建模块以生成工作负载。这两个组件都经过精心设计,用于重现合成批处理任务,这些任务与在代表性数据中心观察到的资源使用模式密切相关。实验结果表明,STWGEN 超越了最先进的工作负载生成方法,因为它能更准确地模拟工作负载级和机器级资源使用情况。
{"title":"An approach to workload generation for modern data centers: A view from Alibaba trace","authors":"Yi Liang ,&nbsp;Nianyi Ruan ,&nbsp;Lan Yi ,&nbsp;Xing Su","doi":"10.1016/j.tbench.2024.100164","DOIUrl":"https://doi.org/10.1016/j.tbench.2024.100164","url":null,"abstract":"<div><p>Modern data centers provide the foundational infrastructure of cloud computing. Workload generation, which involves simulating or constructing tasks and transactions to replicate the actual resource usage patterns of real-world systems or applications, plays essential role for efficient resource management in these centers. Data center traces, rich in information about workload execution and resource utilization, are thus ideal data for workload generation. Traditional traces provide detailed temporal resource usage data to enable fine-grained workload generation. However, modern data centers tend to favor tracing statistical metrics to reduce overhead. Therefore the accurate reconstruction of temporal resource consumption without detailed, temporized trace information become a major challenge for trace-based workload generation. To address this challenge, we propose STWGEN, a novel method that leverages statistical trace data for workload generation. STWGEN is specifically designed to generate the batch task workloads based on Alibaba trace. STWGEN contains two key components: a suite of C program-based flexible workload building blocks and a heuristic strategy to assemble building blocks for workload generation. Both components are carefully designed to reproduce synthetic batch tasks that closely replicate the observed resource usage patterns in a representative data center. Experimental results demonstrate that STWGEN outperforms state-of-the-art workload generation methods as it emulates workload-level and machine-level resource usage in much higher accuracy.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 1","pages":"Article 100164"},"PeriodicalIF":0.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485924000164/pdfft?md5=dc97b50be70f18c4e64b66906a378a03&pid=1-s2.0-S2772485924000164-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141095886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking ChatGPT for Prototyping Theories: Experimental Studies Using the Technology Acceptance Model 以 ChatGPT 为原型理论基准:使用技术接受模型的实验研究
Pub Date : 2024-02-01 DOI: 10.1016/j.tbench.2024.100153
Yanwu Yang, T. Goh, Xin Dai
{"title":"Benchmarking ChatGPT for Prototyping Theories: Experimental Studies Using the Technology Acceptance Model","authors":"Yanwu Yang, T. Goh, Xin Dai","doi":"10.1016/j.tbench.2024.100153","DOIUrl":"https://doi.org/10.1016/j.tbench.2024.100153","url":null,"abstract":"","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"28 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139815896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A pluggable single-image super-resolution algorithm based on second-order gradient loss 基于二阶梯度损失的可插入式单图像超分辨率算法
Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2023.100148
Shuran Lin , Chunjie Zhang , Yanwu Yang
Convolutional neural networks for single-image super-resolution have been widely used with great success. However, most of these methods use L1 loss to guide network optimization, resulting in blurry restored images with sharp edges smoothed. This is because L1 loss limits the optimization goal of the network to the statistical average of all solutions within the solution space of that task. To go beyond the L1 loss, this paper designs an image super-resolution algorithm based on second-order gradient loss. We impose additional constraints by considering the high-order gradient level of the image so that the network can focus on the recovery of fine details such as texture during the learning process. This helps to alleviate the problem of restored image texture over-smoothing to some extent. During network training, we extract the second-order gradient map of the generated image and the target image of the network by minimizing the distance between them, this guides the network to pay attention to the high-frequency detail information in the image and generate a high-resolution image with clearer edge and texture. Besides, the proposed loss function has good embeddability and can be easily integrated with existing image super-resolution networks. Experimental results show that the second-order gradient loss can significantly improve both Learned Perceptual Image Patch Similarity (LPIPS) and Frechet Inception Distance score (FID) performance over other image super-resolution deep learning models.
卷积神经网络在单幅图像超分辨率研究中得到了广泛的应用,并取得了巨大的成功。然而,这些方法大多使用L1损失来指导网络优化,导致恢复的图像模糊,锐利的边缘被平滑。这是因为L1损耗将网络的优化目标限制为该任务的解决方案空间内所有解决方案的统计平均值。为了克服L1损耗,本文设计了一种基于二阶梯度损耗的图像超分辨算法。我们通过考虑图像的高阶梯度水平来施加额外的约束,以便网络在学习过程中可以专注于纹理等精细细节的恢复。这在一定程度上缓解了复原图像纹理过度平滑的问题。在网络训练过程中,我们通过最小化生成图像与网络目标图像之间的距离,提取生成图像的二阶梯度图,引导网络关注图像中的高频细节信息,生成边缘和纹理更清晰的高分辨率图像。此外,所提出的损失函数具有良好的嵌入性,可以很容易地与现有的图像超分辨率网络集成。实验结果表明,与其他图像超分辨率深度学习模型相比,二阶梯度损失可以显著提高学习感知图像Patch Similarity (LPIPS)和Frechet Inception Distance score (FID)的性能。
{"title":"A pluggable single-image super-resolution algorithm based on second-order gradient loss","authors":"Shuran Lin ,&nbsp;Chunjie Zhang ,&nbsp;Yanwu Yang","doi":"10.1016/j.tbench.2023.100148","DOIUrl":"10.1016/j.tbench.2023.100148","url":null,"abstract":"<div><div>Convolutional neural networks for single-image super-resolution have been widely used with great success. However, most of these methods use L1 loss to guide network optimization, resulting in blurry restored images with sharp edges smoothed. This is because L1 loss limits the optimization goal of the network to the statistical average of all solutions within the solution space of that task. To go beyond the L1 loss, this paper designs an image super-resolution algorithm based on second-order gradient loss. We impose additional constraints by considering the high-order gradient level of the image so that the network can focus on the recovery of fine details such as texture during the learning process. This helps to alleviate the problem of restored image texture over-smoothing to some extent. During network training, we extract the second-order gradient map of the generated image and the target image of the network by minimizing the distance between them, this guides the network to pay attention to the high-frequency detail information in the image and generate a high-resolution image with clearer edge and texture. Besides, the proposed loss function has good embeddability and can be easily integrated with existing image super-resolution networks. Experimental results show that the second-order gradient loss can significantly improve both Learned Perceptual Image Patch Similarity (LPIPS) and Frechet Inception Distance score (FID) performance over other image super-resolution deep learning models.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100148"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139022929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CloudAISim: A toolkit for modelling and simulation of modern applications in AI-driven cloud computing environments CloudAISim:在 al-driven 云计算环境中对现代应用进行建模和仿真的工具包
Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100150
Abhimanyu Bhowmik , Madhushree Sannigrahi , Deepraj Chowdhury , Ajoy Dey , Sukhpal Singh Gill
There is a very significant knowledge gap between Artificial Intelligence (AI) and a multitude of industries that exist in today’s modern world. This is primarily attributable to the limited availability of resources and technical expertise. However, a major obstacle is that AI needs to be flexible enough to work in many different applications, utilising a wide variety of datasets through cloud computing. As a result, we developed a benchmark toolkit called CloudAISim to make use of the power of AI and cloud computing in order to satisfy the requirements of modern applications. The goal of this study is to come up with a strategy for building a bridge so that AI can be utilised in order to assist those who are not very knowledgeable about technological advancements. In addition, we modelled a healthcare application as a case study in order to verify the scientific reliability of the CloudAISim toolkit and simulated it in a cloud computing environment using Google Cloud Functions to increase its real-time efficiency. A non-expert-friendly interface built with an interactive web app has also been developed. Any user without any technical knowledge can operate the entire model, which has a 98% accuracy rate. The proposed use case is designed to put AI to work in the healthcare industry, but CloudAISim would be useful and adaptable for other applications in the future.
人工智能(AI)与当今现代世界中存在的众多行业之间存在着非常显著的知识差距。这主要是由于资源和技术专门知识有限。然而,一个主要的障碍是人工智能需要足够灵活,以便在许多不同的应用程序中工作,通过云计算利用各种各样的数据集。因此,我们开发了一个名为CloudAISim的基准工具包,以利用人工智能和云计算的力量来满足现代应用程序的需求。这项研究的目标是提出一个战略,建立一个桥梁,使人工智能可以被利用,以帮助那些不太了解技术进步的人。此外,我们将一个医疗保健应用程序建模为案例研究,以验证CloudAISim工具包的科学可靠性,并使用谷歌cloud Functions在云计算环境中对其进行模拟,以提高其实时效率。还开发了一个非专家友好的交互式web应用程序界面。任何没有任何技术知识的用户都可以操作整个模型,准确率高达98%。拟议的用例旨在使人工智能在医疗保健行业中发挥作用,但CloudAISim将来对其他应用程序也很有用并具有适应性。
{"title":"CloudAISim: A toolkit for modelling and simulation of modern applications in AI-driven cloud computing environments","authors":"Abhimanyu Bhowmik ,&nbsp;Madhushree Sannigrahi ,&nbsp;Deepraj Chowdhury ,&nbsp;Ajoy Dey ,&nbsp;Sukhpal Singh Gill","doi":"10.1016/j.tbench.2024.100150","DOIUrl":"10.1016/j.tbench.2024.100150","url":null,"abstract":"<div><div>There is a very significant knowledge gap between Artificial Intelligence (AI) and a multitude of industries that exist in today’s modern world. This is primarily attributable to the limited availability of resources and technical expertise. However, a major obstacle is that AI needs to be flexible enough to work in many different applications, utilising a wide variety of datasets through cloud computing. As a result, we developed a benchmark toolkit called CloudAISim to make use of the power of AI and cloud computing in order to satisfy the requirements of modern applications. The goal of this study is to come up with a strategy for building a bridge so that AI can be utilised in order to assist those who are not very knowledgeable about technological advancements. In addition, we modelled a healthcare application as a case study in order to verify the scientific reliability of the CloudAISim toolkit and simulated it in a cloud computing environment using Google Cloud Functions to increase its real-time efficiency. A non-expert-friendly interface built with an interactive web app has also been developed. Any user without any technical knowledge can operate the entire model, which has a 98% accuracy rate. The proposed use case is designed to put AI to work in the healthcare industry, but CloudAISim would be useful and adaptable for other applications in the future.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100150"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139457584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing and understanding deep neural network batching systems on GPUs gpu上深度神经网络批处理系统的表征和理解
Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100151
Feng Yu , Hao Zhang , Ao Chen , Xueying Wang , Xiaoxia Liang , Sheng Wang , Guangli Li , Huimin Cui , Xiaobing Feng
As neural network inference demands are ever-increasing in intelligent applications, the performance optimization of model serving becomes a challenging problem. Dynamic batching is an important feature of contemporary deep learning serving systems, which combines multiple requests of model inference and executes them together to improve the system’s throughput. However, the behavior characteristics of each part in deep neural network batching systems as well as their performance impact on different model structures are still unknown. In this paper, we characterize the batching system by leveraging three representative deep neural networks on GPUs, performing a systematic analysis of the performance effects from the request batching module, model slicing module, and stage reorchestrating module. Based on experimental results, several insights and recommendations are offered to facilitate the system design and optimization for deep learning serving.
随着智能应用中神经网络推理需求的不断增加,模型服务的性能优化成为一个具有挑战性的问题。动态批处理是当代深度学习服务系统的一个重要特征,它将多个模型推理请求组合在一起执行,以提高系统的吞吐量。然而,深度神经网络批处理系统中各部分的行为特征及其对不同模型结构的性能影响仍然是未知的。在本文中,我们通过在gpu上利用三个具有代表性的深度神经网络来表征批处理系统,并对请求批处理模块、模型切片模块和阶段重新编排模块的性能影响进行了系统分析。基于实验结果,提出了一些见解和建议,以促进深度学习服务的系统设计和优化。
{"title":"Characterizing and understanding deep neural network batching systems on GPUs","authors":"Feng Yu ,&nbsp;Hao Zhang ,&nbsp;Ao Chen ,&nbsp;Xueying Wang ,&nbsp;Xiaoxia Liang ,&nbsp;Sheng Wang ,&nbsp;Guangli Li ,&nbsp;Huimin Cui ,&nbsp;Xiaobing Feng","doi":"10.1016/j.tbench.2024.100151","DOIUrl":"10.1016/j.tbench.2024.100151","url":null,"abstract":"<div><div>As neural network inference demands are ever-increasing in intelligent applications, the performance optimization of model serving becomes a challenging problem. Dynamic batching is an important feature of contemporary deep learning serving systems, which combines multiple requests of model inference and executes them together to improve the system’s throughput. However, the behavior characteristics of each part in deep neural network batching systems as well as their performance impact on different model structures are still unknown. In this paper, we characterize the batching system by leveraging three representative deep neural networks on GPUs, performing a systematic analysis of the performance effects from the request batching module, model slicing module, and stage reorchestrating module. Based on experimental results, several insights and recommendations are offered to facilitate the system design and optimization for deep learning serving.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100151"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143294419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking ChatGPT for prototyping theories: Experimental studies using the technology acceptance model 以 ChatGPT 为原型理论基准:使用技术接受模型的实验研究
Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100153
Tiong-Thye Goh , Xin Dai , Yanwu Yang
This paper explores the paradigm of leveraging ChatGPT as a benchmark tool for theory prototyping in conceptual research. Specifically, we conducted two experimental studies using the classical technology acceptance model (TAM) to demonstrate and evaluate ChatGPT's capability of comprehending theoretical concepts, discriminating between constructs, and generating meaningful responses. Results of the two studies indicate that ChatGPT can generate responses aligned with the TAM theory and constructs. Key metrics including the factors loading, internal consistency reliability, and convergence reliability of the measurement model surpass the minimum threshold, thus confirming the validity of TAM constructs. Moreover, supported hypotheses provide an evidence for the nomological validity of TAM constructs. However, both of the two studies show a high Heterotrait–Monotrait ratio of correlations (HTMT) among TAM constructs, suggesting a concern about discriminant validity. Furthermore, high duplicated response rates were identified and potential biases regarding gender, usage experiences, perceived usefulness, and behavioural intention were revealed in ChatGPT-generated samples. Therefore, it calls for additional efforts in LLM to address performance metrics related to duplicated responses, the strength of discriminant validity, the impact of prompt design, and the generalizability of findings across contexts.
本文探讨了利用ChatGPT作为概念研究中理论原型的基准工具的范例。具体来说,我们使用经典技术接受模型(TAM)进行了两项实验研究,以证明和评估ChatGPT理解理论概念、区分结构和产生有意义的响应的能力。这两项研究的结果表明,ChatGPT可以产生符合TAM理论和结构的响应。测量模型的因子负荷、内部一致性信度和收敛信度等关键指标均超过最小阈值,从而证实了TAM结构的有效性。此外,支持的假设为TAM结构的法理有效性提供了证据。然而,两项研究均显示TAM构念之间存在较高的异性状-单性状相关比(HTMT),表明存在对区分效度的担忧。此外,在chatgpt生成的样本中,发现了高重复回复率,并揭示了关于性别、使用体验、感知有用性和行为意图的潜在偏差。因此,它要求法学硕士进一步努力解决与重复反应相关的绩效指标,区别效度的强度,提示设计的影响,以及跨背景的研究结果的普遍性。
{"title":"Benchmarking ChatGPT for prototyping theories: Experimental studies using the technology acceptance model","authors":"Tiong-Thye Goh ,&nbsp;Xin Dai ,&nbsp;Yanwu Yang","doi":"10.1016/j.tbench.2024.100153","DOIUrl":"10.1016/j.tbench.2024.100153","url":null,"abstract":"<div><div>This paper explores the paradigm of leveraging ChatGPT as a benchmark tool for theory prototyping in conceptual research. Specifically, we conducted two experimental studies using the classical technology acceptance model (TAM) to demonstrate and evaluate ChatGPT's capability of comprehending theoretical concepts, discriminating between constructs, and generating meaningful responses. Results of the two studies indicate that ChatGPT can generate responses aligned with the TAM theory and constructs. Key metrics including the factors loading, internal consistency reliability, and convergence reliability of the measurement model surpass the minimum threshold, thus confirming the validity of TAM constructs. Moreover, supported hypotheses provide an evidence for the nomological validity of TAM constructs. However, both of the two studies show a high Heterotrait–Monotrait ratio of correlations (HTMT) among TAM constructs, suggesting a concern about discriminant validity. Furthermore, high duplicated response rates were identified and potential biases regarding gender, usage experiences, perceived usefulness, and behavioural intention were revealed in ChatGPT-generated samples. Therefore, it calls for additional efforts in LLM to address performance metrics related to duplicated responses, the strength of discriminant validity, the impact of prompt design, and the generalizability of findings across contexts.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100153"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139875928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AIGCBench: Comprehensive evaluation of image-to-video content generated by AI AIGCBench:对人工智能生成的图像到视频内容进行综合评价
Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100152
Fanda Fan , Chunjie Luo , Wanling Gao , Jianfeng Zhan
The burgeoning field of Artificial Intelligence Generated Content (AIGC) is witnessing rapid advancements, particularly in video generation. This paper introduces AIGCBench, a pioneering comprehensive and scalable benchmark designed to evaluate a variety of video generation tasks, with a primary focus on Image-to-Video (I2V) generation. AIGCBench tackles the limitations of existing benchmarks, which suffer from a lack of diverse datasets, by including a varied and open-domain image–text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models. To establish a unified evaluation framework for video generation tasks, our benchmark includes 11 metrics spanning four dimensions to assess algorithm performance. These dimensions are control-video alignment, motion effects, temporal consistency, and video quality. These metrics are both reference video-based and video-free, ensuring a comprehensive evaluation strategy. The evaluation standard proposed correlates well with human judgment, providing insights into the strengths and weaknesses of current I2V algorithms. The findings from our extensive experiments aim to stimulate further research and development in the I2V field. AIGCBench represents a significant step toward creating standardized benchmarks for the broader AIGC landscape, proposing an adaptable and equitable framework for future assessments of video generation tasks. We have open-sourced the dataset and evaluation code on the project website: https://www.benchcouncil.org/AIGCBench.
人工智能生成内容(AIGC)这一新兴领域正在迅速发展,尤其是在视频生成方面。本文介绍了AIGCBench,这是一个开创性的全面和可扩展的基准,旨在评估各种视频生成任务,主要关注图像到视频(I2V)生成。AIGCBench解决了现有基准的局限性,这些基准受到缺乏不同数据集的影响,通过包括在等效条件下评估不同最先进算法的各种开放域图像-文本数据集。我们使用一种新颖的文本组合器和GPT-4来创建富文本提示,然后使用这些提示通过高级文本到图像模型生成图像。为了建立视频生成任务的统一评估框架,我们的基准包括跨越四个维度的11个指标来评估算法性能。这些维度是控制视频对齐、运动效果、时间一致性和视频质量。这些指标都是基于视频和无视频的参考指标,以确保全面的评估策略。提出的评估标准与人类的判断很好地相关,提供了对当前I2V算法的优缺点的见解。我们广泛实验的结果旨在刺激I2V领域的进一步研究和发展。AIGCBench代表了为更广泛的AIGC领域创建标准化基准的重要一步,为未来视频生成任务的评估提出了一个适应性强且公平的框架。我们已经在项目网站上开源了数据集和评估代码:https://www.benchcouncil.org/AIGCBench。
{"title":"AIGCBench: Comprehensive evaluation of image-to-video content generated by AI","authors":"Fanda Fan ,&nbsp;Chunjie Luo ,&nbsp;Wanling Gao ,&nbsp;Jianfeng Zhan","doi":"10.1016/j.tbench.2024.100152","DOIUrl":"10.1016/j.tbench.2024.100152","url":null,"abstract":"<div><div>The burgeoning field of Artificial Intelligence Generated Content (AIGC) is witnessing rapid advancements, particularly in video generation. This paper introduces AIGCBench, a pioneering comprehensive and scalable benchmark designed to evaluate a variety of video generation tasks, with a primary focus on Image-to-Video (I2V) generation. AIGCBench tackles the limitations of existing benchmarks, which suffer from a lack of diverse datasets, by including a varied and open-domain image–text dataset that evaluates different state-of-the-art algorithms under equivalent conditions. We employ a novel text combiner and GPT-4 to create rich text prompts, which are then used to generate images via advanced Text-to-Image models. To establish a unified evaluation framework for video generation tasks, our benchmark includes 11 metrics spanning four dimensions to assess algorithm performance. These dimensions are control-video alignment, motion effects, temporal consistency, and video quality. These metrics are both reference video-based and video-free, ensuring a comprehensive evaluation strategy. The evaluation standard proposed correlates well with human judgment, providing insights into the strengths and weaknesses of current I2V algorithms. The findings from our extensive experiments aim to stimulate further research and development in the I2V field. AIGCBench represents a significant step toward creating standardized benchmarks for the broader AIGC landscape, proposing an adaptable and equitable framework for future assessments of video generation tasks. We have open-sourced the dataset and evaluation code on the project website: <span><span>https://www.benchcouncil.org/AIGCBench</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100152"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143294730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Corrigendum regarding missing Declaration Conflict-of -Interests statements in previously published articles 关于先前发表的文章中缺少声明利益冲突声明的更正
Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100149
{"title":"Corrigendum regarding missing Declaration Conflict-of -Interests statements in previously published articles","authors":"","doi":"10.1016/j.tbench.2024.100149","DOIUrl":"10.1016/j.tbench.2024.100149","url":null,"abstract":"","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100149"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143294731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the potential benefits and use cases of ChatGPT as a tool for improving the efficiency and effectiveness of business operations 分析ChatGPT作为提高业务操作效率和有效性的工具的潜在好处和用例
Pub Date : 2023-09-01 DOI: 10.1016/j.tbench.2023.100140
Rohit Raj , Arpit Singh , Vimal Kumar , Pratima Verma

The study addresses the potential benefits for companies of adopting ChatGPT, a popular chatbot built on a large-scale transformer-based language model known as a generative pre-trained transformer (GPT). Chatbots like ChatGPT may improve customer service, handle several client inquiries at once, and save operational costs. Moreover, ChatGPT may automate regular processes like order tracking and billing, allowing human employees to focus on more complex and strategic responsibilities. Nevertheless, before deploying ChatGPT, enterprises must carefully analyze its use cases and restrictions, as well as its strengths and disadvantages. ChatGPT, for example, requires training data that is particular to the business domain and might produce erroneous and ambiguous findings. The study identifies areas of deployment of ChatGPT's possible benefits in enterprises by drawing on the literature that is currently accessible on ChatGPT, massive language models, and artificial intelligence. Then, using the PSI (Preference Selection Index) and COPRAS (Complex Proportional Assessment) approaches, potential advantages are taken into account and prioritized. By highlighting current trends and possible advantages in the industry, this editorial seeks to provide insight into the present state of employing ChatGPT in enterprises and research. ChatGPT may also learn biases from training data and create replies that reinforce those biases. As a result, enterprises must train and fine-tune ChatGPT to specific operations, set explicit boundaries and limitations for its use, and implement appropriate security measures to avoid malicious input. The study highlights the research gap in the dearth of literature by outlining ChatGPT's potential benefits for businesses, analyzing its strengths and limits, and offering insights into how organizations might use ChatGPT's capabilities to enhance their operations.

这项研究探讨了采用ChatGPT对公司的潜在好处,ChatGPT是一种流行的聊天机器人,建立在一种大规模的基于转换器的语言模型上,称为生成预训练转换器(GPT)。像ChatGPT这样的聊天机器人可以改善客户服务,同时处理多个客户查询,并节省运营成本。此外,ChatGPT可以自动化订单跟踪和计费等常规流程,使员工能够专注于更复杂和战略性的职责。尽管如此,在部署ChatGPT之前,企业必须仔细分析它的用例和限制,以及它的优势和劣势。例如,ChatGPT需要特定于业务领域的训练数据,这些数据可能会产生错误和模糊的结果。该研究通过借鉴目前在ChatGPT、大规模语言模型和人工智能上可以获得的文献,确定了ChatGPT在企业中可能带来的好处的部署领域。然后,使用PSI(偏好选择指数)和COPRAS(复杂比例评估)方法,将潜在优势考虑在内并排定优先级。通过强调行业的当前趋势和可能的优势,这篇社论试图深入了解在企业和研究中使用ChatGPT的现状。ChatGPT还可以从训练数据中学习偏见,并创建强化这些偏见的回复。因此,企业必须根据具体操作对ChatGPT进行培训和微调,为其使用设置明确的边界和限制,并实施适当的安全措施以避免恶意输入。该研究概述了ChatGPT对企业的潜在好处,分析了其优势和局限性,并深入了解了组织如何利用ChatGPT的能力来增强其运营,从而突出了缺乏文献的研究差距。
{"title":"Analyzing the potential benefits and use cases of ChatGPT as a tool for improving the efficiency and effectiveness of business operations","authors":"Rohit Raj ,&nbsp;Arpit Singh ,&nbsp;Vimal Kumar ,&nbsp;Pratima Verma","doi":"10.1016/j.tbench.2023.100140","DOIUrl":"https://doi.org/10.1016/j.tbench.2023.100140","url":null,"abstract":"<div><p>The study addresses the potential benefits for companies of adopting ChatGPT, a popular chatbot built on a large-scale transformer-based language model known as a generative pre-trained transformer (GPT). Chatbots like ChatGPT may improve customer service, handle several client inquiries at once, and save operational costs. Moreover, ChatGPT may automate regular processes like order tracking and billing, allowing human employees to focus on more complex and strategic responsibilities. Nevertheless, before deploying ChatGPT, enterprises must carefully analyze its use cases and restrictions, as well as its strengths and disadvantages. ChatGPT, for example, requires training data that is particular to the business domain and might produce erroneous and ambiguous findings. The study identifies areas of deployment of ChatGPT's possible benefits in enterprises by drawing on the literature that is currently accessible on ChatGPT, massive language models, and artificial intelligence. Then, using the PSI (Preference Selection Index) and COPRAS (Complex Proportional Assessment) approaches, potential advantages are taken into account and prioritized. By highlighting current trends and possible advantages in the industry, this editorial seeks to provide insight into the present state of employing ChatGPT in enterprises and research. ChatGPT may also learn biases from training data and create replies that reinforce those biases. As a result, enterprises must train and fine-tune ChatGPT to specific operations, set explicit boundaries and limitations for its use, and implement appropriate security measures to avoid malicious input. The study highlights the research gap in the dearth of literature by outlining ChatGPT's potential benefits for businesses, analyzing its strengths and limits, and offering insights into how organizations might use ChatGPT's capabilities to enhance their operations.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100140"},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49713752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
BenchCouncil Transactions on Benchmarks, Standards and Evaluations
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1