arXiv - CS - Performance最新文献_第8页

Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels 为字节级整数通用矩阵乘法 (GEMM) 内核扩展模拟光子加速器

arXiv - CS - Performance

Pub Date : 2024-07-08 DOI: arxiv-2407.06134

Oluwaseun Adewunmi Alo, Sairam Sri Vatsavai, Ishan Thakkar

Deep Neural Networks (DNNs) predominantly rely on General Matrix Multiply(GEMM) kernels, which are often accelerated using specialized hardwarearchitectures. Recently, analog photonic GEMM accelerators have emerged as apromising alternative, offering vastly superior speed and energy efficiencycompared to traditional electronic accelerators. However, these photonic cannotsupport wider than 4-bit integer operands due to their inherent trade-offsbetween analog dynamic range and parallelism. This is often inadequate for DNNtraining as at least 8-bit wide operands are deemed necessary to preventsignificant accuracy drops. To address these limitations, we introduce ascalable photonic GEMM accelerator named SPOGA. SPOGA utilizes enhancedfeatures such as analog summation of homodyne optical signals andin-transduction positional weighting of operands. By employing an extendedoptical-analog dataflow that minimizes overheads associated with bit-slicedinteger arithmetic, SPOGA supports byte-size integer GEMM kernels, achievingsignificant improvements in throughput, latency, and energy efficiency.Specifically, SPOGA demonstrates up to 14.4$times$, 2$times$, and28.5$times$ improvements in frames-per-second (FPS), FPS/Watt, andFPS/Watt/mm$^2$ respectively, compared to existing state-of-the-art photonicsolutions.

深度神经网络（DNN）主要依赖通用矩阵乘法（GEMM）内核，通常使用专用硬件架构对其进行加速。最近，模拟光子 GEMM 加速器作为一种令人兴奋的替代方案出现，其速度和能效大大优于传统的电子加速器。然而，由于模拟动态范围和并行性之间的固有权衡，这些光子加速器无法支持超过 4 位的整数操作数。这往往不能满足 DNN 训练的需要，因为至少需要 8 位宽的操作数才能防止精度大幅下降。为了解决这些局限性，我们推出了名为 SPOGA 的可扩展光子 GEMM 加速器。SPOGA 利用同调光信号的模拟求和以及操作数的传导位置加权等增强功能。SPOGA 采用扩展的光模拟数据流，最大限度地减少了与位切片整数运算相关的开销，从而支持字节大小的整数 GEMM 内核，在吞吐量、延迟和能效方面实现了显著提高。具体来说，与现有的最先进的光子解决方案相比，SPOGA 在每秒帧数（FPS）、FPS/瓦特和 FPS/Watt/mm$^2$ 方面分别实现了高达 14.4 倍、2 倍和 28.5 倍的改进。

{"title":"Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels","authors":"Oluwaseun Adewunmi Alo, Sairam Sri Vatsavai, Ishan Thakkar","doi":"arxiv-2407.06134","DOIUrl":"https://doi.org/arxiv-2407.06134","url":null,"abstract":"Deep Neural Networks (DNNs) predominantly rely on General Matrix Multiply\u0000(GEMM) kernels, which are often accelerated using specialized hardware\u0000architectures. Recently, analog photonic GEMM accelerators have emerged as a\u0000promising alternative, offering vastly superior speed and energy efficiency\u0000compared to traditional electronic accelerators. However, these photonic cannot\u0000support wider than 4-bit integer operands due to their inherent trade-offs\u0000between analog dynamic range and parallelism. This is often inadequate for DNN\u0000training as at least 8-bit wide operands are deemed necessary to prevent\u0000significant accuracy drops. To address these limitations, we introduce a\u0000scalable photonic GEMM accelerator named SPOGA. SPOGA utilizes enhanced\u0000features such as analog summation of homodyne optical signals and\u0000in-transduction positional weighting of operands. By employing an extended\u0000optical-analog dataflow that minimizes overheads associated with bit-sliced\u0000integer arithmetic, SPOGA supports byte-size integer GEMM kernels, achieving\u0000significant improvements in throughput, latency, and energy efficiency.\u0000Specifically, SPOGA demonstrates up to 14.4$times$, 2$times$, and\u000028.5$times$ improvements in frames-per-second (FPS), FPS/Watt, and\u0000FPS/Watt/mm$^2$ respectively, compared to existing state-of-the-art photonic\u0000solutions.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NCPP: Nova CPU Performance Predictor on a Novel Dataset NCPP：新数据集上的 Nova CPU 性能预测器

arXiv - CS - Performance

Pub Date : 2024-07-03 DOI: arxiv-2407.03385

Xiaoman Liu

CPU performance prediction, which involves forecasting the performance scoresof a CPU based on its hardware characteristics during its operation, is acritical technology for computational system design and resource management inthe big data era. However, this research field currently faces two significantchallenges. First, collecting real-world data is challenging due to the widevariety of CPU products on the market and the highly specialized nature ofrelevant hardware characteristics. In the research process, this field lacks astandard dataset with unified hardware characteristics, wide data coverage, andcomprehensive benchmarks. Second, existing methods based on hardware simulationmodels or machine learning exhibit notable shortcomings, such as lengthysimulation test cycles and low prediction accuracy. To bridge these gaps, wefirst collect, preprocess, and standardize historical data from the 4thGeneration Intel Xeon Scalable Processors across multiple benchmark suites tocreate a new dataset, named PerfCastDB. Subsequently, we design a deep learningbased model called Nova CPU Performance Predictor (NCPP) as the baseline forthis new dataset. The NCPP network is designed based on group attentionmechanism. It effectively quantifies the implicit relationships betweenhardware characteristics within and across groups and comprehensively modelsthe impact of various hardware characteristics on CPU performance prediction.We conduct comparative experiments using the proposed PerfCastDB dataset.Compared to existing approaches, NCPP achieves superior evaluation results,demonstrating its effectiveness. Furthermore, we have open-sourced part of thedataset and the NCPP network code to facilitate subsequent research. Theresources can be accessed at https://github.com/xiaoman-liu/NCPP.

CPU 性能预测是指根据 CPU 运行时的硬件特性预测其性能得分，是大数据时代计算系统设计和资源管理的关键技术。然而，这一研究领域目前面临两个重大挑战。首先，由于市场上的 CPU 产品种类繁多，相关硬件特性的专业性很强，因此收集真实世界的数据具有很大的挑战性。在研究过程中，该领域缺乏硬件特性统一、数据覆盖面广、基准全面的标准数据集。其次，现有的基于硬件仿真模型或机器学习的方法存在明显缺陷，如仿真测试周期长、预测准确率低等。为了弥补这些不足，我们首先收集、预处理和标准化了第四代英特尔至强可扩展处理器在多个基准套件中的历史数据，创建了一个名为 PerfCastDB 的新数据集。随后，我们设计了一个基于深度学习的模型，名为 Nova CPU 性能预测器（NCPP），作为新数据集的基线。NCPP 网络是基于群体注意机制设计的。我们使用提出的 PerfCastDB 数据集进行了对比实验，与现有方法相比，NCPP 获得了更优越的评估结果，证明了其有效性。此外，我们还开源了部分数据集和 NCPP 网络代码，以方便后续研究。相关资源可通过 https://github.com/xiaoman-liu/NCPP 访问。

{"title":"NCPP: Nova CPU Performance Predictor on a Novel Dataset","authors":"Xiaoman Liu","doi":"arxiv-2407.03385","DOIUrl":"https://doi.org/arxiv-2407.03385","url":null,"abstract":"CPU performance prediction, which involves forecasting the performance scores\u0000of a CPU based on its hardware characteristics during its operation, is a\u0000critical technology for computational system design and resource management in\u0000the big data era. However, this research field currently faces two significant\u0000challenges. First, collecting real-world data is challenging due to the wide\u0000variety of CPU products on the market and the highly specialized nature of\u0000relevant hardware characteristics. In the research process, this field lacks a\u0000standard dataset with unified hardware characteristics, wide data coverage, and\u0000comprehensive benchmarks. Second, existing methods based on hardware simulation\u0000models or machine learning exhibit notable shortcomings, such as lengthy\u0000simulation test cycles and low prediction accuracy. To bridge these gaps, we\u0000first collect, preprocess, and standardize historical data from the 4th\u0000Generation Intel Xeon Scalable Processors across multiple benchmark suites to\u0000create a new dataset, named PerfCastDB. Subsequently, we design a deep learning\u0000based model called Nova CPU Performance Predictor (NCPP) as the baseline for\u0000this new dataset. The NCPP network is designed based on group attention\u0000mechanism. It effectively quantifies the implicit relationships between\u0000hardware characteristics within and across groups and comprehensively models\u0000the impact of various hardware characteristics on CPU performance prediction.\u0000We conduct comparative experiments using the proposed PerfCastDB dataset.\u0000Compared to existing approaches, NCPP achieves superior evaluation results,\u0000demonstrating its effectiveness. Furthermore, we have open-sourced part of the\u0000dataset and the NCPP network code to facilitate subsequent research. The\u0000resources can be accessed at https://github.com/xiaoman-liu/NCPP.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations MIREncoder：基于预训练嵌入的多模态红外编码器的性能优化

arXiv - CS - Performance

Pub Date : 2024-07-02 DOI: arxiv-2407.02238

Akash Dutta, Ali Jannesari

One of the primary areas of interest in High Performance Computing is theimprovement of performance of parallel workloads. Nowadays, compilable sourcecode-based optimization tasks that employ deep learning often exploit LLVMIntermediate Representations (IRs) for extracting features from source code.Most such works target specific tasks, or are designed with a pre-defined setof heuristics. So far, pre-trained models are rare in this domain, but thepossibilities have been widely discussed. Especially approaches mimickinglarge-language models (LLMs) have been proposed. But these have prohibitivelylarge training costs. In this paper, we propose MIREncoder, a M}ulti-modalIR-based Auto-Encoder that can be pre-trained to generate a learned embeddingspace to be used for downstream tasks by machine learning-based approaches. Amulti-modal approach enables us to better extract features from compilableprograms. It allows us to better model code syntax, semantics and structure.For code-based performance optimizations, these features are very importantwhile making optimization decisions. A pre-trained model/embedding implicitlyenables the usage of transfer learning, and helps move away from task-specifictrained models. Additionally, a pre-trained model used for downstreamperformance optimization should itself have reduced overhead, and be easilyusable. These considerations have led us to propose a modeling approach that i)understands code semantics and structure, ii) enables use of transfer learning,and iii) is small and simple enough to be easily re-purposed or reused evenwith low resource availability. Our evaluations will show that our proposedapproach can outperform the state of the art while reducing overhead.

提高并行工作负载的性能是高性能计算的主要关注领域之一。如今，采用深度学习的基于源代码的可编译优化任务通常利用 LLVMI 中间表征（IR）从源代码中提取特征。到目前为止，预训练模型在这一领域还很少见，但其可能性已被广泛讨论。特别是有人提出了模仿大型语言模型（LLM）的方法。但这些方法的训练成本过高。在本文中，我们提出了 MIREncoder，一种基于 M}多模态红外的自动编码器，它可以通过预训练来生成学习到的嵌入空间，以便通过基于机器学习的方法用于下游任务。多模态方法使我们能够更好地从可编译程序中提取特征。对于基于代码的性能优化来说，这些特征在做出优化决策时非常重要。预训练模型/嵌入隐含地允许使用迁移学习，有助于摆脱特定任务训练模型的束缚。此外，用于下游性能优化的预训练模型本身应减少开销，并易于使用。考虑到这些因素，我们提出了一种建模方法：i) 理解代码语义和结构；ii) 能够使用迁移学习；iii) 小巧、简单，即使在资源可用性较低的情况下也能方便地重新利用或重复使用。我们的评估结果将表明，我们提出的方法可以在降低开销的同时超越现有技术。

{"title":"MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations","authors":"Akash Dutta, Ali Jannesari","doi":"arxiv-2407.02238","DOIUrl":"https://doi.org/arxiv-2407.02238","url":null,"abstract":"One of the primary areas of interest in High Performance Computing is the\u0000improvement of performance of parallel workloads. Nowadays, compilable source\u0000code-based optimization tasks that employ deep learning often exploit LLVM\u0000Intermediate Representations (IRs) for extracting features from source code.\u0000Most such works target specific tasks, or are designed with a pre-defined set\u0000of heuristics. So far, pre-trained models are rare in this domain, but the\u0000possibilities have been widely discussed. Especially approaches mimicking\u0000large-language models (LLMs) have been proposed. But these have prohibitively\u0000large training costs. In this paper, we propose MIREncoder, a M}ulti-modal\u0000IR-based Auto-Encoder that can be pre-trained to generate a learned embedding\u0000space to be used for downstream tasks by machine learning-based approaches. A\u0000multi-modal approach enables us to better extract features from compilable\u0000programs. It allows us to better model code syntax, semantics and structure.\u0000For code-based performance optimizations, these features are very important\u0000while making optimization decisions. A pre-trained model/embedding implicitly\u0000enables the usage of transfer learning, and helps move away from task-specific\u0000trained models. Additionally, a pre-trained model used for downstream\u0000performance optimization should itself have reduced overhead, and be easily\u0000usable. These considerations have led us to propose a modeling approach that i)\u0000understands code semantics and structure, ii) enables use of transfer learning,\u0000and iii) is small and simple enough to be easily re-purposed or reused even\u0000with low resource availability. Our evaluations will show that our proposed\u0000approach can outperform the state of the art while reducing overhead.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLload: Simplifying Real-Time Job Monitoring for HPC Users LLload：简化高性能计算用户的实时作业监控

arXiv - CS - Performance

Pub Date : 2024-07-01 DOI: arxiv-2407.01481

Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin

One of the more complex tasks for researchers using HPC systems isperformance monitoring and tuning of their applications. Developing a practiceof continuous performance improvement, both for speed-up and efficient use ofresources is essential to the long term success of both the HPC practitionerand the research project. Profiling tools provide a nice view of theperformance of an application but often have a steep learning curve and rarelyprovide an easy to interpret view of resource utilization. Lower level toolssuch as top and htop provide a view of resource utilization for those familiarand comfortable with Linux but a barrier for newer HPC practitioners. To expandthe existing profiling and job monitoring options, the MIT Lincoln LaboratorySupercomputing Center created LLoad, a tool that captures a snapshot of theresources being used by a job on a per user basis. LLload is a tool built fromstandard HPC tools that provides an easy way for a researcher to track resourceusage of active jobs. We explain how the tool was designed and implemented andprovide insight into how it is used to aid new researchers in developing theirperformance monitoring skills as well as guide researchers in their resourcerequests.

对于使用高性能计算系统的研究人员来说，更复杂的任务之一是对其应用程序进行性能监控和调整。开发一种持续改进性能的方法，既能提高速度，又能有效利用资源，这对高性能计算从业人员和研究项目的长期成功都至关重要。剖析工具提供了一个很好的应用程序性能视图，但通常学习曲线很陡峭，而且很少提供易于解释的资源利用率视图。较低级别的工具，如 top 和 htop，可以为熟悉 Linux 的人提供资源利用率的视图，但对较新的 HPC 从业人员来说却是个障碍。为了扩展现有的剖析和作业监控选项，麻省理工学院林肯实验室超级计算中心创建了 LLoad，这是一种按用户捕获作业所用资源快照的工具。LLoad 是一款由标准 HPC 工具构建而成的工具，它为研究人员跟踪活动作业的资源使用情况提供了一种简便的方法。我们解释了该工具的设计和实施过程，并深入介绍了它如何用于帮助新研究人员提高性能监控技能，以及指导研究人员的资源需求。

{"title":"LLload: Simplifying Real-Time Job Monitoring for HPC Users","authors":"Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin","doi":"arxiv-2407.01481","DOIUrl":"https://doi.org/arxiv-2407.01481","url":null,"abstract":"One of the more complex tasks for researchers using HPC systems is\u0000performance monitoring and tuning of their applications. Developing a practice\u0000of continuous performance improvement, both for speed-up and efficient use of\u0000resources is essential to the long term success of both the HPC practitioner\u0000and the research project. Profiling tools provide a nice view of the\u0000performance of an application but often have a steep learning curve and rarely\u0000provide an easy to interpret view of resource utilization. Lower level tools\u0000such as top and htop provide a view of resource utilization for those familiar\u0000and comfortable with Linux but a barrier for newer HPC practitioners. To expand\u0000the existing profiling and job monitoring options, the MIT Lincoln Laboratory\u0000Supercomputing Center created LLoad, a tool that captures a snapshot of the\u0000resources being used by a job on a per user basis. LLload is a tool built from\u0000standard HPC tools that provides an easy way for a researcher to track resource\u0000usage of active jobs. We explain how the tool was designed and implemented and\u0000provide insight into how it is used to aid new researchers in developing their\u0000performance monitoring skills as well as guide researchers in their resource\u0000requests.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Power-Consumption Analysis for Different IPoWDM Network Architectures with ZR/ZR+ and Long-Haul Muxponders 采用 ZR/ZR+ 和长距离多路寻址器的不同 IPoWDM 网络架构的功耗分析

arXiv - CS - Performance

Pub Date : 2024-06-30 DOI: arxiv-2407.00643

Qiaolun Zhang, Annalisa Morea, Patricia Layec, Memedhe Ibrahimi, Francesco Musumeci, Massimo Tornatore

Operators are constantly faced with the need to increase optical-networkcapacity to accommodate rapid traffic growth while minimizing the cost-per-bitand power-per-bit. The drastic reduction of power consumption of IP routers andZR/ZR+ pluggable transponders seen in the last years has renewed the interestin "opaque" optical-network architectures, where no optical bypassing isallowed. In this work, we aim to quantify and compare the power consumption offour "IP over Wavelength Division Multiplexing" (IPoWDM) transport networkarchitectures employing ZR/ZR+ modules vs. long-haul muxponders, consideringdifferent grooming, regeneration, and optical bypassing capabilities. We firstpropose a power consumption model for different IPoWDM node architectures withZR/ZR+ modules and long-haul muxponders. Then, to obtain the power consumptionof different architectures, we propose a compact auxiliary-graph-basednetwork-design algorithm extensible to different network architectures.Moreover, we investigate how the continuous decrease in the power consumptionof ZR/ZR+ and IP routers can impact the power consumption of differentarchitectures through a sensitivity analysis. Illustrative numerical results onnetworks of different sizes show that, despite drastic reductions of powerconsumption at IP layer, optical bypassing is still the most power-efficientsolution, reducing consumption by up to 48%.

运营商一直面临着增加光网络容量的需求，以适应流量的快速增长，同时最大限度地降低每比特成本和每比特功耗。过去几年中，IP 路由器和 ZR/ZR+ 可插拔转发器的功耗大幅降低，这重新激发了人们对 "不透明 "光网络架构的兴趣，在这种架构中不允许光旁路。在这项工作中，我们旨在量化和比较采用 ZR/ZR+ 模块的 "IP over Wavelength Division Multiplexing"（IPoWDM）传输网络架构与长距离多路复用器的功耗，同时考虑到不同的疏导、再生和光旁路能力。我们首先提出了采用 ZR/ZR+ 模块和长途多路复用器的不同 IPoWDM 节点架构的功耗模型。此外，我们还通过敏感性分析研究了 ZR/ZR+ 和 IP 路由器功耗的持续下降如何影响不同架构的功耗。对不同规模网络的示例性数值结果表明，尽管 IP 层的功耗大幅降低，但光旁路仍然是最省电的解决方案，最多可降低 48% 的功耗。

{"title":"A Power-Consumption Analysis for Different IPoWDM Network Architectures with ZR/ZR+ and Long-Haul Muxponders","authors":"Qiaolun Zhang, Annalisa Morea, Patricia Layec, Memedhe Ibrahimi, Francesco Musumeci, Massimo Tornatore","doi":"arxiv-2407.00643","DOIUrl":"https://doi.org/arxiv-2407.00643","url":null,"abstract":"Operators are constantly faced with the need to increase optical-network\u0000capacity to accommodate rapid traffic growth while minimizing the cost-per-bit\u0000and power-per-bit. The drastic reduction of power consumption of IP routers and\u0000ZR/ZR+ pluggable transponders seen in the last years has renewed the interest\u0000in \"opaque\" optical-network architectures, where no optical bypassing is\u0000allowed. In this work, we aim to quantify and compare the power consumption of\u0000four \"IP over Wavelength Division Multiplexing\" (IPoWDM) transport network\u0000architectures employing ZR/ZR+ modules vs. long-haul muxponders, considering\u0000different grooming, regeneration, and optical bypassing capabilities. We first\u0000propose a power consumption model for different IPoWDM node architectures with\u0000ZR/ZR+ modules and long-haul muxponders. Then, to obtain the power consumption\u0000of different architectures, we propose a compact auxiliary-graph-based\u0000network-design algorithm extensible to different network architectures.\u0000Moreover, we investigate how the continuous decrease in the power consumption\u0000of ZR/ZR+ and IP routers can impact the power consumption of different\u0000architectures through a sensitivity analysis. Illustrative numerical results on\u0000networks of different sizes show that, despite drastic reductions of power\u0000consumption at IP layer, optical bypassing is still the most power-efficient\u0000solution, reducing consumption by up to 48%.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AR-PPF: Advanced Resolution-Based Pixel Preemption Data Filtering for Efficient Time-Series Data Analysis AR-PPF：基于分辨率的高级像素抢先数据过滤，用于高效的时间序列数据分析

arXiv - CS - Performance

Pub Date : 2024-06-27 DOI: arxiv-2406.19575

Taewoong Kim, Kukjin Choi, Sungjun Kim

With the advent of automation, many manufacturing industries havetransitioned to data-centric methodologies, giving rise to an unprecedentedinflux of data during the manufacturing process. This data has becomeinstrumental in analyzing the quality of manufacturing process and equipment.Engineers and data analysts, in particular, require extensive time-series datafor seasonal cycle analysis. However, due to computational resourceconstraints, they are often limited to querying short-term data multiple timesor resorting to the use of summarized data in which key patterns may beoverlooked. This study proposes a novel solution to overcome these limitations;the advanced resolution-based pixel preemption data filtering (AR-PPF)algorithm. This technology allows for efficient visualization of time-seriescharts over long periods while significantly reducing the time required toretrieve data. We also demonstrates how this approach not only enhances theefficiency of data analysis but also ensures that key feature is not lost,thereby providing a more accurate and comprehensive understanding of the data.

随着自动化时代的到来，许多制造行业已过渡到以数据为中心的方法，从而在制造过程中产生了前所未有的数据流。工程师和数据分析师尤其需要大量的时间序列数据来进行季节性周期分析。然而，由于计算资源的限制，他们往往只能多次查询短期数据，或使用汇总数据，而其中的关键模式可能会被忽略。本研究提出了一种新颖的解决方案来克服这些限制；基于高级分辨率的像素抢先数据过滤（AR-PPF）算法。这项技术可以实现长时间时间序列图的高效可视化，同时大大减少检索数据所需的时间。我们还展示了这种方法如何不仅提高数据分析的效率，而且确保关键特征不会丢失，从而提供对数据更准确、更全面的理解。

{"title":"AR-PPF: Advanced Resolution-Based Pixel Preemption Data Filtering for Efficient Time-Series Data Analysis","authors":"Taewoong Kim, Kukjin Choi, Sungjun Kim","doi":"arxiv-2406.19575","DOIUrl":"https://doi.org/arxiv-2406.19575","url":null,"abstract":"With the advent of automation, many manufacturing industries have\u0000transitioned to data-centric methodologies, giving rise to an unprecedented\u0000influx of data during the manufacturing process. This data has become\u0000instrumental in analyzing the quality of manufacturing process and equipment.\u0000Engineers and data analysts, in particular, require extensive time-series data\u0000for seasonal cycle analysis. However, due to computational resource\u0000constraints, they are often limited to querying short-term data multiple times\u0000or resorting to the use of summarized data in which key patterns may be\u0000overlooked. This study proposes a novel solution to overcome these limitations;\u0000the advanced resolution-based pixel preemption data filtering (AR-PPF)\u0000algorithm. This technology allows for efficient visualization of time-series\u0000charts over long periods while significantly reducing the time required to\u0000retrieve data. We also demonstrates how this approach not only enhances the\u0000efficiency of data analysis but also ensures that key feature is not lost,\u0000thereby providing a more accurate and comprehensive understanding of the data.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"152 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors 基于自动调整的优化框架，用于智能像素数据集和异质结晶体管中的混合核 SVM 分类

arXiv - CS - Performance

Pub Date : 2024-06-26 DOI: arxiv-2406.18445

Xingfu Wu, Tupendra Oli, ustin H. Qian, Valerie Taylor, Mark C. Hersam, Vinod K. Sangwan

Support Vector Machine (SVM) is a state-of-the-art classification methodwidely used in science and engineering due to its high accuracy, its ability todeal with high dimensional data, and its flexibility in modeling diversesources of data. In this paper, we propose an autotuning-based optimizationframework to quantify the ranges of hyperparameters in SVMs to identify theiroptimal choices, and apply the framework to two SVMs with the mixed-kernelbetween Sigmoid and Gaussian kernels for smart pixel datasets in high energyphysics (HEP) and mixed-kernel heterojunction transistors (MKH). Ourexperimental results show that the optimal selection of hyperparameters in theSVMs and the kernels greatly varies for different applications and datasets,and choosing their optimal choices is critical for a high classificationaccuracy of the mixed kernel SVMs. Uninformed choices of hyperparameters C andcoef0 in the mixed-kernel SVMs result in severely low accuracy, and theproposed framework effectively quantifies the proper ranges for thehyperparameters in the SVMs to identify their optimal choices to achieve thehighest accuracy 94.6% for the HEP application and the highest averageaccuracy 97.2% with far less tuning time for the MKH application.

支持向量机（SVM）因其高精确度、处理高维数据的能力以及对多种数据源建模的灵活性而成为科学和工程领域广泛使用的先进分类方法。本文提出了一种基于自动调谐的优化框架，用于量化 SVM 中的超参数范围，以确定最优选择，并将该框架应用于两种具有 Sigmoid 和高斯混合核的 SVM，分别用于高能物理（HEP）和混合核异质结晶体管（MKH）中的智能像素数据集。实验结果表明，对于不同的应用和数据集，SVM 和核中超参数的最佳选择大不相同，而选择它们的最佳值对于混合核 SVM 的高分类精度至关重要。混合核 SVM 中超参量 C 和 coef0 的不明智选择会导致严重的低准确率，而所提出的框架有效地量化了 SVM 中超参量的适当范围，从而确定了它们的最优选择，在 HEP 应用中实现了 94.6% 的最高准确率，在 MKH 应用中实现了 97.2% 的最高平均准确率，而且调整时间大大减少。

{"title":"An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors","authors":"Xingfu Wu, Tupendra Oli, ustin H. Qian, Valerie Taylor, Mark C. Hersam, Vinod K. Sangwan","doi":"arxiv-2406.18445","DOIUrl":"https://doi.org/arxiv-2406.18445","url":null,"abstract":"Support Vector Machine (SVM) is a state-of-the-art classification method\u0000widely used in science and engineering due to its high accuracy, its ability to\u0000deal with high dimensional data, and its flexibility in modeling diverse\u0000sources of data. In this paper, we propose an autotuning-based optimization\u0000framework to quantify the ranges of hyperparameters in SVMs to identify their\u0000optimal choices, and apply the framework to two SVMs with the mixed-kernel\u0000between Sigmoid and Gaussian kernels for smart pixel datasets in high energy\u0000physics (HEP) and mixed-kernel heterojunction transistors (MKH). Our\u0000experimental results show that the optimal selection of hyperparameters in the\u0000SVMs and the kernels greatly varies for different applications and datasets,\u0000and choosing their optimal choices is critical for a high classification\u0000accuracy of the mixed kernel SVMs. Uninformed choices of hyperparameters C and\u0000coef0 in the mixed-kernel SVMs result in severely low accuracy, and the\u0000proposed framework effectively quantifies the proper ranges for the\u0000hyperparameters in the SVMs to identify their optimal choices to achieve the\u0000highest accuracy 94.6% for the HEP application and the highest average\u0000accuracy 97.2% with far less tuning time for the MKH application.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models 有多好？评估通用提示与特定领域提示在基础大语言模型上的效果

arXiv - CS - Performance

Pub Date : 2024-06-25 DOI: arxiv-2407.11006

Oluyemi Enoch Amujo, Shanchieh Jay Yang

Recently, large language models (LLMs) have expanded into various domains.However, there remains a need to evaluate how these models perform whenprompted with commonplace queries compared to domain-specific queries, whichmay be useful for benchmarking prior to fine-tuning domain-specific downstreamtasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, acrossdiverse domains, including cybersecurity, medicine, and finance, compared tocommon knowledge queries. This study employs a comprehensive methodology toevaluate foundational models, encompassing problem formulation, data analysis,and the development of novel outlier detection techniques. This methodologicalrigor enhances the credibility of the presented evaluation frameworks. Thisstudy focused on assessing inference time, response length, throughput,quality, and resource utilization and investigated the correlations betweenthese factors. The results indicate that model size and types of prompts usedfor inference significantly influenced response length and quality. Inaddition, common prompts, which include various types of queries, generatediverse and inconsistent responses at irregular intervals. In contrast,domain-specific prompts consistently generate concise responses within areasonable time. Overall, this study underscores the need for comprehensiveevaluation frameworks to enhance the reliability of benchmarking procedures inmultidomain AI research.

最近，大型语言模型（LLMs）已经扩展到各个领域。然而，仍有必要评估这些模型在接受普通查询与特定领域查询时的表现，这可能有助于在微调特定领域下游任务之前进行基准测试。本研究评估了 LLM（特别是 Gemma-2B 和 Gemma-7B）在不同领域（包括网络安全、医学和金融）中与普通知识查询的比较。本研究采用综合方法对基础模型进行评估，包括问题提出、数据分析和新型离群点检测技术的开发。这种方法论基础增强了所提出的评估框架的可信度。本研究重点评估了推理时间、响应长度、吞吐量、质量和资源利用率，并调查了这些因素之间的相关性。结果表明，用于推理的模型大小和提示类型对响应时间和质量有显著影响。此外，普通提示（包括各种类型的查询）会在不规则的时间间隔内产生多样且不一致的响应。相比之下，特定领域的提示则能在合理的时间内产生简洁的回答。总之，这项研究强调了在多领域人工智能研究中需要全面的评估框架来提高基准程序的可靠性。

{"title":"How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models","authors":"Oluyemi Enoch Amujo, Shanchieh Jay Yang","doi":"arxiv-2407.11006","DOIUrl":"https://doi.org/arxiv-2407.11006","url":null,"abstract":"Recently, large language models (LLMs) have expanded into various domains.\u0000However, there remains a need to evaluate how these models perform when\u0000prompted with commonplace queries compared to domain-specific queries, which\u0000may be useful for benchmarking prior to fine-tuning domain-specific downstream\u0000tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across\u0000diverse domains, including cybersecurity, medicine, and finance, compared to\u0000common knowledge queries. This study employs a comprehensive methodology to\u0000evaluate foundational models, encompassing problem formulation, data analysis,\u0000and the development of novel outlier detection techniques. This methodological\u0000rigor enhances the credibility of the presented evaluation frameworks. This\u0000study focused on assessing inference time, response length, throughput,\u0000quality, and resource utilization and investigated the correlations between\u0000these factors. The results indicate that model size and types of prompts used\u0000for inference significantly influenced response length and quality. In\u0000addition, common prompts, which include various types of queries, generate\u0000diverse and inconsistent responses at irregular intervals. In contrast,\u0000domain-specific prompts consistently generate concise responses within a\u0000reasonable time. Overall, this study underscores the need for comprehensive\u0000evaluation frameworks to enhance the reliability of benchmarking procedures in\u0000multidomain AI research.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments 利用 Collective Mind、虚拟化 MLOps、MLPerf、Collective Knowledge Playground 和可重现的优化锦标赛，打造更高效、更具成本效益的人工智能/人工智能系统

arXiv - CS - Performance

Pub Date : 2024-06-24 DOI: arxiv-2406.16791

Grigori Fursin

In this white paper, I present my community effort to automatically co-designcheaper, faster and more energy-efficient software and hardware for AI, ML andother popular workloads with the help of the Collective Mind framework (CM),virtualized MLOps, MLPerf benchmarks and reproducible optimization tournaments.I developed CM to modularize, automate and virtualize the tedious process ofbuilding, running, profiling and optimizing complex applications across rapidlyevolving open-source and proprietary AI/ML models, datasets, software andhardware. I achieved that with the help of portable, reusable andtechnology-agnostic automation recipes (ResearchOps) for MLOps and DevOps(CM4MLOps) discovered in close collaboration with academia and industry whenreproducing more than 150 research papers and organizing the 1st mass-scalecommunity benchmarking of ML and AI systems using CM and MLPerf. I donated CM and CM4MLOps to MLCommons to help connect academia and industryto learn how to build and run AI and other emerging workloads in the mostefficient and cost-effective way using a common and technology-agnosticautomation, virtualization and reproducibility framework while unifyingknowledge exchange, protecting everyone's intellectual property, enablingportable skills, and accelerating transfer of the state-of-the-art research toproduction. My long-term vision is to make AI accessible to everyone by makingit a commodity automatically produced from the most suitable open-source andproprietary components from different vendors based on user demand,requirements and constraints such as cost, latency, throughput, accuracy,energy, size and other important characteristics.

在这篇白皮书中，我介绍了我的社区工作，即借助 Collective Mind 框架 (CM)、虚拟化 MLOps、MLPerf 基准和可重现的优化比赛，为 AI、ML 和其他流行工作负载自动共同设计更便宜、更快速、更节能的软件和硬件。我开发 CM 的目的是将在快速发展的开源和专有 AI/ML 模型、数据集、软件和硬件上构建、运行、剖析和优化复杂应用的繁琐过程模块化、自动化和虚拟化。我与学术界和工业界密切合作，为 MLOps 和 DevOps（CM4MLOps）开发了可移植、可重用和技术无关的自动化配方（ResearchOps），发表了 150 多篇研究论文，并使用 CM 和 MLPerf 组织了第一次大规模 ML 和 AI 系统社区基准测试。我将 CM 和 CM4MLOps 捐赠给了 MLCommons，以帮助连接学术界和产业界，学习如何使用通用的技术自动化、虚拟化和可重现性框架，以最高效、最具成本效益的方式构建和运行人工智能和其他新兴工作负载，同时统一知识交流，保护每个人的知识产权，实现可移植技能，并加速最先进研究成果的转化。我的长期愿景是让每个人都能使用人工智能，根据用户需求、要求和制约因素（如成本、延迟、吞吐量、精度、能量、尺寸和其他重要特征），让人工智能成为一种商品，由不同供应商提供的最合适的开源和专有组件自动生产而成。

{"title":"Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments","authors":"Grigori Fursin","doi":"arxiv-2406.16791","DOIUrl":"https://doi.org/arxiv-2406.16791","url":null,"abstract":"In this white paper, I present my community effort to automatically co-design\u0000cheaper, faster and more energy-efficient software and hardware for AI, ML and\u0000other popular workloads with the help of the Collective Mind framework (CM),\u0000virtualized MLOps, MLPerf benchmarks and reproducible optimization tournaments.\u0000I developed CM to modularize, automate and virtualize the tedious process of\u0000building, running, profiling and optimizing complex applications across rapidly\u0000evolving open-source and proprietary AI/ML models, datasets, software and\u0000hardware. I achieved that with the help of portable, reusable and\u0000technology-agnostic automation recipes (ResearchOps) for MLOps and DevOps\u0000(CM4MLOps) discovered in close collaboration with academia and industry when\u0000reproducing more than 150 research papers and organizing the 1st mass-scale\u0000community benchmarking of ML and AI systems using CM and MLPerf. I donated CM and CM4MLOps to MLCommons to help connect academia and industry\u0000to learn how to build and run AI and other emerging workloads in the most\u0000efficient and cost-effective way using a common and technology-agnostic\u0000automation, virtualization and reproducibility framework while unifying\u0000knowledge exchange, protecting everyone's intellectual property, enabling\u0000portable skills, and accelerating transfer of the state-of-the-art research to\u0000production. My long-term vision is to make AI accessible to everyone by making\u0000it a commodity automatically produced from the most suitable open-source and\u0000proprietary components from different vendors based on user demand,\u0000requirements and constraints such as cost, latency, throughput, accuracy,\u0000energy, size and other important characteristics.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Real-Time Neural Volumetric Rendering on Mobile Devices: A Measurement Study 在移动设备上实现实时神经体积渲染：测量研究

arXiv - CS - Performance

Pub Date : 2024-06-23 DOI: arxiv-2406.16068

Zhe Wang, Yifei Zhu

Neural Radiance Fields (NeRF) is an emerging technique to synthesize 3Dobjects from 2D images with a wide range of potential applications. However,rendering existing NeRF models is extremely computation intensive, making itchallenging to support real-time interaction on mobile devices. In this paper,we take the first initiative to examine the state-of-the-art real-time NeRFrendering technique from a system perspective. We first define the entireworking pipeline of the NeRF serving system. We then identify possible controlknobs that are critical to the system from the communication, computation, andvisual performance perspective. Furthermore, an extensive measurement study isconducted to reveal the effects of these control knobs on system performance.Our measurement results reveal that different control knobs contributedifferently towards improving the system performance, with the mesh granularitybeing the most effective knob and the quantization being the least effectiveknob. In addition, diverse hardware device settings and network conditions haveto be considered to fully unleash the benefit of operating under theappropriate knobs

神经辐射场（NeRF）是一种从二维图像合成三维物体的新兴技术，具有广泛的应用潜力。然而，渲染现有 NeRF 模型的计算量非常大，这给支持移动设备上的实时交互带来了挑战。在本文中，我们首次从系统角度研究了最先进的实时 NeRF 渲染技术。我们首先定义了 NeRF 服务系统的整个工作流水线。然后，我们从通信、计算和视觉性能的角度确定了对系统至关重要的控制旋钮。我们的测量结果表明，不同的控制钮对提高系统性能的贡献各不相同，其中网格粒度是最有效的控制钮，而量化是最无效的控制钮。此外，还必须考虑不同的硬件设备设置和网络条件，以充分发挥在适当旋钮下运行的优势。

{"title":"Towards Real-Time Neural Volumetric Rendering on Mobile Devices: A Measurement Study","authors":"Zhe Wang, Yifei Zhu","doi":"arxiv-2406.16068","DOIUrl":"https://doi.org/arxiv-2406.16068","url":null,"abstract":"Neural Radiance Fields (NeRF) is an emerging technique to synthesize 3D\u0000objects from 2D images with a wide range of potential applications. However,\u0000rendering existing NeRF models is extremely computation intensive, making it\u0000challenging to support real-time interaction on mobile devices. In this paper,\u0000we take the first initiative to examine the state-of-the-art real-time NeRF\u0000rendering technique from a system perspective. We first define the entire\u0000working pipeline of the NeRF serving system. We then identify possible control\u0000knobs that are critical to the system from the communication, computation, and\u0000visual performance perspective. Furthermore, an extensive measurement study is\u0000conducted to reveal the effects of these control knobs on system performance.\u0000Our measurement results reveal that different control knobs contribute\u0000differently towards improving the system performance, with the mesh granularity\u0000being the most effective knob and the quantization being the least effective\u0000knob. In addition, diverse hardware device settings and network conditions have\u0000to be considered to fully unleash the benefit of operating under the\u0000appropriate knobs","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0