Oluwaseun Adewunmi Alo, Sairam Sri Vatsavai, Ishan Thakkar
Deep Neural Networks (DNNs) predominantly rely on General Matrix Multiply (GEMM) kernels, which are often accelerated using specialized hardware architectures. Recently, analog photonic GEMM accelerators have emerged as a promising alternative, offering vastly superior speed and energy efficiency compared to traditional electronic accelerators. However, these photonic cannot support wider than 4-bit integer operands due to their inherent trade-offs between analog dynamic range and parallelism. This is often inadequate for DNN training as at least 8-bit wide operands are deemed necessary to prevent significant accuracy drops. To address these limitations, we introduce a scalable photonic GEMM accelerator named SPOGA. SPOGA utilizes enhanced features such as analog summation of homodyne optical signals and in-transduction positional weighting of operands. By employing an extended optical-analog dataflow that minimizes overheads associated with bit-sliced integer arithmetic, SPOGA supports byte-size integer GEMM kernels, achieving significant improvements in throughput, latency, and energy efficiency. Specifically, SPOGA demonstrates up to 14.4$times$, 2$times$, and 28.5$times$ improvements in frames-per-second (FPS), FPS/Watt, and FPS/Watt/mm$^2$ respectively, compared to existing state-of-the-art photonic solutions.
{"title":"Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels","authors":"Oluwaseun Adewunmi Alo, Sairam Sri Vatsavai, Ishan Thakkar","doi":"arxiv-2407.06134","DOIUrl":"https://doi.org/arxiv-2407.06134","url":null,"abstract":"Deep Neural Networks (DNNs) predominantly rely on General Matrix Multiply\u0000(GEMM) kernels, which are often accelerated using specialized hardware\u0000architectures. Recently, analog photonic GEMM accelerators have emerged as a\u0000promising alternative, offering vastly superior speed and energy efficiency\u0000compared to traditional electronic accelerators. However, these photonic cannot\u0000support wider than 4-bit integer operands due to their inherent trade-offs\u0000between analog dynamic range and parallelism. This is often inadequate for DNN\u0000training as at least 8-bit wide operands are deemed necessary to prevent\u0000significant accuracy drops. To address these limitations, we introduce a\u0000scalable photonic GEMM accelerator named SPOGA. SPOGA utilizes enhanced\u0000features such as analog summation of homodyne optical signals and\u0000in-transduction positional weighting of operands. By employing an extended\u0000optical-analog dataflow that minimizes overheads associated with bit-sliced\u0000integer arithmetic, SPOGA supports byte-size integer GEMM kernels, achieving\u0000significant improvements in throughput, latency, and energy efficiency.\u0000Specifically, SPOGA demonstrates up to 14.4$times$, 2$times$, and\u000028.5$times$ improvements in frames-per-second (FPS), FPS/Watt, and\u0000FPS/Watt/mm$^2$ respectively, compared to existing state-of-the-art photonic\u0000solutions.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141573207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CPU performance prediction, which involves forecasting the performance scores of a CPU based on its hardware characteristics during its operation, is a critical technology for computational system design and resource management in the big data era. However, this research field currently faces two significant challenges. First, collecting real-world data is challenging due to the wide variety of CPU products on the market and the highly specialized nature of relevant hardware characteristics. In the research process, this field lacks a standard dataset with unified hardware characteristics, wide data coverage, and comprehensive benchmarks. Second, existing methods based on hardware simulation models or machine learning exhibit notable shortcomings, such as lengthy simulation test cycles and low prediction accuracy. To bridge these gaps, we first collect, preprocess, and standardize historical data from the 4th Generation Intel Xeon Scalable Processors across multiple benchmark suites to create a new dataset, named PerfCastDB. Subsequently, we design a deep learning based model called Nova CPU Performance Predictor (NCPP) as the baseline for this new dataset. The NCPP network is designed based on group attention mechanism. It effectively quantifies the implicit relationships between hardware characteristics within and across groups and comprehensively models the impact of various hardware characteristics on CPU performance prediction. We conduct comparative experiments using the proposed PerfCastDB dataset. Compared to existing approaches, NCPP achieves superior evaluation results, demonstrating its effectiveness. Furthermore, we have open-sourced part of the dataset and the NCPP network code to facilitate subsequent research. The resources can be accessed at https://github.com/xiaoman-liu/NCPP.
CPU 性能预测是指根据 CPU 运行时的硬件特性预测其性能得分,是大数据时代计算系统设计和资源管理的关键技术。然而,这一研究领域目前面临两个重大挑战。首先,由于市场上的 CPU 产品种类繁多,相关硬件特性的专业性很强,因此收集真实世界的数据具有很大的挑战性。在研究过程中,该领域缺乏硬件特性统一、数据覆盖面广、基准全面的标准数据集。其次,现有的基于硬件仿真模型或机器学习的方法存在明显缺陷,如仿真测试周期长、预测准确率低等。为了弥补这些不足,我们首先收集、预处理和标准化了第四代英特尔至强可扩展处理器在多个基准套件中的历史数据,创建了一个名为 PerfCastDB 的新数据集。随后,我们设计了一个基于深度学习的模型,名为 Nova CPU 性能预测器(NCPP),作为新数据集的基线。NCPP 网络是基于群体注意机制设计的。我们使用提出的 PerfCastDB 数据集进行了对比实验,与现有方法相比,NCPP 获得了更优越的评估结果,证明了其有效性。此外,我们还开源了部分数据集和 NCPP 网络代码,以方便后续研究。相关资源可通过 https://github.com/xiaoman-liu/NCPP 访问。
{"title":"NCPP: Nova CPU Performance Predictor on a Novel Dataset","authors":"Xiaoman Liu","doi":"arxiv-2407.03385","DOIUrl":"https://doi.org/arxiv-2407.03385","url":null,"abstract":"CPU performance prediction, which involves forecasting the performance scores\u0000of a CPU based on its hardware characteristics during its operation, is a\u0000critical technology for computational system design and resource management in\u0000the big data era. However, this research field currently faces two significant\u0000challenges. First, collecting real-world data is challenging due to the wide\u0000variety of CPU products on the market and the highly specialized nature of\u0000relevant hardware characteristics. In the research process, this field lacks a\u0000standard dataset with unified hardware characteristics, wide data coverage, and\u0000comprehensive benchmarks. Second, existing methods based on hardware simulation\u0000models or machine learning exhibit notable shortcomings, such as lengthy\u0000simulation test cycles and low prediction accuracy. To bridge these gaps, we\u0000first collect, preprocess, and standardize historical data from the 4th\u0000Generation Intel Xeon Scalable Processors across multiple benchmark suites to\u0000create a new dataset, named PerfCastDB. Subsequently, we design a deep learning\u0000based model called Nova CPU Performance Predictor (NCPP) as the baseline for\u0000this new dataset. The NCPP network is designed based on group attention\u0000mechanism. It effectively quantifies the implicit relationships between\u0000hardware characteristics within and across groups and comprehensively models\u0000the impact of various hardware characteristics on CPU performance prediction.\u0000We conduct comparative experiments using the proposed PerfCastDB dataset.\u0000Compared to existing approaches, NCPP achieves superior evaluation results,\u0000demonstrating its effectiveness. Furthermore, we have open-sourced part of the\u0000dataset and the NCPP network code to facilitate subsequent research. The\u0000resources can be accessed at https://github.com/xiaoman-liu/NCPP.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead.
{"title":"MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations","authors":"Akash Dutta, Ali Jannesari","doi":"arxiv-2407.02238","DOIUrl":"https://doi.org/arxiv-2407.02238","url":null,"abstract":"One of the primary areas of interest in High Performance Computing is the\u0000improvement of performance of parallel workloads. Nowadays, compilable source\u0000code-based optimization tasks that employ deep learning often exploit LLVM\u0000Intermediate Representations (IRs) for extracting features from source code.\u0000Most such works target specific tasks, or are designed with a pre-defined set\u0000of heuristics. So far, pre-trained models are rare in this domain, but the\u0000possibilities have been widely discussed. Especially approaches mimicking\u0000large-language models (LLMs) have been proposed. But these have prohibitively\u0000large training costs. In this paper, we propose MIREncoder, a M}ulti-modal\u0000IR-based Auto-Encoder that can be pre-trained to generate a learned embedding\u0000space to be used for downstream tasks by machine learning-based approaches. A\u0000multi-modal approach enables us to better extract features from compilable\u0000programs. It allows us to better model code syntax, semantics and structure.\u0000For code-based performance optimizations, these features are very important\u0000while making optimization decisions. A pre-trained model/embedding implicitly\u0000enables the usage of transfer learning, and helps move away from task-specific\u0000trained models. Additionally, a pre-trained model used for downstream\u0000performance optimization should itself have reduced overhead, and be easily\u0000usable. These considerations have led us to propose a modeling approach that i)\u0000understands code semantics and structure, ii) enables use of transfer learning,\u0000and iii) is small and simple enough to be easily re-purposed or reused even\u0000with low resource availability. Our evaluations will show that our proposed\u0000approach can outperform the state of the art while reducing overhead.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin
One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application but often have a steep learning curve and rarely provide an easy to interpret view of resource utilization. Lower level tools such as top and htop provide a view of resource utilization for those familiar and comfortable with Linux but a barrier for newer HPC practitioners. To expand the existing profiling and job monitoring options, the MIT Lincoln Laboratory Supercomputing Center created LLoad, a tool that captures a snapshot of the resources being used by a job on a per user basis. LLload is a tool built from standard HPC tools that provides an easy way for a researcher to track resource usage of active jobs. We explain how the tool was designed and implemented and provide insight into how it is used to aid new researchers in developing their performance monitoring skills as well as guide researchers in their resource requests.
对于使用高性能计算系统的研究人员来说,更复杂的任务之一是对其应用程序进行性能监控和调整。开发一种持续改进性能的方法,既能提高速度,又能有效利用资源,这对高性能计算从业人员和研究项目的长期成功都至关重要。剖析工具提供了一个很好的应用程序性能视图,但通常学习曲线很陡峭,而且很少提供易于解释的资源利用率视图。较低级别的工具,如 top 和 htop,可以为熟悉 Linux 的人提供资源利用率的视图,但对较新的 HPC 从业人员来说却是个障碍。为了扩展现有的剖析和作业监控选项,麻省理工学院林肯实验室超级计算中心创建了 LLoad,这是一种按用户捕获作业所用资源快照的工具。LLoad 是一款由标准 HPC 工具构建而成的工具,它为研究人员跟踪活动作业的资源使用情况提供了一种简便的方法。我们解释了该工具的设计和实施过程,并深入介绍了它如何用于帮助新研究人员提高性能监控技能,以及指导研究人员的资源需求。
{"title":"LLload: Simplifying Real-Time Job Monitoring for HPC Users","authors":"Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner, Lauren Milechin","doi":"arxiv-2407.01481","DOIUrl":"https://doi.org/arxiv-2407.01481","url":null,"abstract":"One of the more complex tasks for researchers using HPC systems is\u0000performance monitoring and tuning of their applications. Developing a practice\u0000of continuous performance improvement, both for speed-up and efficient use of\u0000resources is essential to the long term success of both the HPC practitioner\u0000and the research project. Profiling tools provide a nice view of the\u0000performance of an application but often have a steep learning curve and rarely\u0000provide an easy to interpret view of resource utilization. Lower level tools\u0000such as top and htop provide a view of resource utilization for those familiar\u0000and comfortable with Linux but a barrier for newer HPC practitioners. To expand\u0000the existing profiling and job monitoring options, the MIT Lincoln Laboratory\u0000Supercomputing Center created LLoad, a tool that captures a snapshot of the\u0000resources being used by a job on a per user basis. LLload is a tool built from\u0000standard HPC tools that provides an easy way for a researcher to track resource\u0000usage of active jobs. We explain how the tool was designed and implemented and\u0000provide insight into how it is used to aid new researchers in developing their\u0000performance monitoring skills as well as guide researchers in their resource\u0000requests.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Operators are constantly faced with the need to increase optical-network capacity to accommodate rapid traffic growth while minimizing the cost-per-bit and power-per-bit. The drastic reduction of power consumption of IP routers and ZR/ZR+ pluggable transponders seen in the last years has renewed the interest in "opaque" optical-network architectures, where no optical bypassing is allowed. In this work, we aim to quantify and compare the power consumption of four "IP over Wavelength Division Multiplexing" (IPoWDM) transport network architectures employing ZR/ZR+ modules vs. long-haul muxponders, considering different grooming, regeneration, and optical bypassing capabilities. We first propose a power consumption model for different IPoWDM node architectures with ZR/ZR+ modules and long-haul muxponders. Then, to obtain the power consumption of different architectures, we propose a compact auxiliary-graph-based network-design algorithm extensible to different network architectures. Moreover, we investigate how the continuous decrease in the power consumption of ZR/ZR+ and IP routers can impact the power consumption of different architectures through a sensitivity analysis. Illustrative numerical results on networks of different sizes show that, despite drastic reductions of power consumption at IP layer, optical bypassing is still the most power-efficient solution, reducing consumption by up to 48%.
运营商一直面临着增加光网络容量的需求,以适应流量的快速增长,同时最大限度地降低每比特成本和每比特功耗。过去几年中,IP 路由器和 ZR/ZR+ 可插拔转发器的功耗大幅降低,这重新激发了人们对 "不透明 "光网络架构的兴趣,在这种架构中不允许光旁路。在这项工作中,我们旨在量化和比较采用 ZR/ZR+ 模块的 "IP over Wavelength Division Multiplexing"(IPoWDM)传输网络架构与长距离多路复用器的功耗,同时考虑到不同的疏导、再生和光旁路能力。我们首先提出了采用 ZR/ZR+ 模块和长途多路复用器的不同 IPoWDM 节点架构的功耗模型。此外,我们还通过敏感性分析研究了 ZR/ZR+ 和 IP 路由器功耗的持续下降如何影响不同架构的功耗。对不同规模网络的示例性数值结果表明,尽管 IP 层的功耗大幅降低,但光旁路仍然是最省电的解决方案,最多可降低 48% 的功耗。
{"title":"A Power-Consumption Analysis for Different IPoWDM Network Architectures with ZR/ZR+ and Long-Haul Muxponders","authors":"Qiaolun Zhang, Annalisa Morea, Patricia Layec, Memedhe Ibrahimi, Francesco Musumeci, Massimo Tornatore","doi":"arxiv-2407.00643","DOIUrl":"https://doi.org/arxiv-2407.00643","url":null,"abstract":"Operators are constantly faced with the need to increase optical-network\u0000capacity to accommodate rapid traffic growth while minimizing the cost-per-bit\u0000and power-per-bit. The drastic reduction of power consumption of IP routers and\u0000ZR/ZR+ pluggable transponders seen in the last years has renewed the interest\u0000in \"opaque\" optical-network architectures, where no optical bypassing is\u0000allowed. In this work, we aim to quantify and compare the power consumption of\u0000four \"IP over Wavelength Division Multiplexing\" (IPoWDM) transport network\u0000architectures employing ZR/ZR+ modules vs. long-haul muxponders, considering\u0000different grooming, regeneration, and optical bypassing capabilities. We first\u0000propose a power consumption model for different IPoWDM node architectures with\u0000ZR/ZR+ modules and long-haul muxponders. Then, to obtain the power consumption\u0000of different architectures, we propose a compact auxiliary-graph-based\u0000network-design algorithm extensible to different network architectures.\u0000Moreover, we investigate how the continuous decrease in the power consumption\u0000of ZR/ZR+ and IP routers can impact the power consumption of different\u0000architectures through a sensitivity analysis. Illustrative numerical results on\u0000networks of different sizes show that, despite drastic reductions of power\u0000consumption at IP layer, optical bypassing is still the most power-efficient\u0000solution, reducing consumption by up to 48%.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of automation, many manufacturing industries have transitioned to data-centric methodologies, giving rise to an unprecedented influx of data during the manufacturing process. This data has become instrumental in analyzing the quality of manufacturing process and equipment. Engineers and data analysts, in particular, require extensive time-series data for seasonal cycle analysis. However, due to computational resource constraints, they are often limited to querying short-term data multiple times or resorting to the use of summarized data in which key patterns may be overlooked. This study proposes a novel solution to overcome these limitations; the advanced resolution-based pixel preemption data filtering (AR-PPF) algorithm. This technology allows for efficient visualization of time-series charts over long periods while significantly reducing the time required to retrieve data. We also demonstrates how this approach not only enhances the efficiency of data analysis but also ensures that key feature is not lost, thereby providing a more accurate and comprehensive understanding of the data.
{"title":"AR-PPF: Advanced Resolution-Based Pixel Preemption Data Filtering for Efficient Time-Series Data Analysis","authors":"Taewoong Kim, Kukjin Choi, Sungjun Kim","doi":"arxiv-2406.19575","DOIUrl":"https://doi.org/arxiv-2406.19575","url":null,"abstract":"With the advent of automation, many manufacturing industries have\u0000transitioned to data-centric methodologies, giving rise to an unprecedented\u0000influx of data during the manufacturing process. This data has become\u0000instrumental in analyzing the quality of manufacturing process and equipment.\u0000Engineers and data analysts, in particular, require extensive time-series data\u0000for seasonal cycle analysis. However, due to computational resource\u0000constraints, they are often limited to querying short-term data multiple times\u0000or resorting to the use of summarized data in which key patterns may be\u0000overlooked. This study proposes a novel solution to overcome these limitations;\u0000the advanced resolution-based pixel preemption data filtering (AR-PPF)\u0000algorithm. This technology allows for efficient visualization of time-series\u0000charts over long periods while significantly reducing the time required to\u0000retrieve data. We also demonstrates how this approach not only enhances the\u0000efficiency of data analysis but also ensures that key feature is not lost,\u0000thereby providing a more accurate and comprehensive understanding of the data.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"152 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingfu Wu, Tupendra Oli, ustin H. Qian, Valerie Taylor, Mark C. Hersam, Vinod K. Sangwan
Support Vector Machine (SVM) is a state-of-the-art classification method widely used in science and engineering due to its high accuracy, its ability to deal with high dimensional data, and its flexibility in modeling diverse sources of data. In this paper, we propose an autotuning-based optimization framework to quantify the ranges of hyperparameters in SVMs to identify their optimal choices, and apply the framework to two SVMs with the mixed-kernel between Sigmoid and Gaussian kernels for smart pixel datasets in high energy physics (HEP) and mixed-kernel heterojunction transistors (MKH). Our experimental results show that the optimal selection of hyperparameters in the SVMs and the kernels greatly varies for different applications and datasets, and choosing their optimal choices is critical for a high classification accuracy of the mixed kernel SVMs. Uninformed choices of hyperparameters C and coef0 in the mixed-kernel SVMs result in severely low accuracy, and the proposed framework effectively quantifies the proper ranges for the hyperparameters in the SVMs to identify their optimal choices to achieve the highest accuracy 94.6% for the HEP application and the highest average accuracy 97.2% with far less tuning time for the MKH application.
{"title":"An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors","authors":"Xingfu Wu, Tupendra Oli, ustin H. Qian, Valerie Taylor, Mark C. Hersam, Vinod K. Sangwan","doi":"arxiv-2406.18445","DOIUrl":"https://doi.org/arxiv-2406.18445","url":null,"abstract":"Support Vector Machine (SVM) is a state-of-the-art classification method\u0000widely used in science and engineering due to its high accuracy, its ability to\u0000deal with high dimensional data, and its flexibility in modeling diverse\u0000sources of data. In this paper, we propose an autotuning-based optimization\u0000framework to quantify the ranges of hyperparameters in SVMs to identify their\u0000optimal choices, and apply the framework to two SVMs with the mixed-kernel\u0000between Sigmoid and Gaussian kernels for smart pixel datasets in high energy\u0000physics (HEP) and mixed-kernel heterojunction transistors (MKH). Our\u0000experimental results show that the optimal selection of hyperparameters in the\u0000SVMs and the kernels greatly varies for different applications and datasets,\u0000and choosing their optimal choices is critical for a high classification\u0000accuracy of the mixed kernel SVMs. Uninformed choices of hyperparameters C and\u0000coef0 in the mixed-kernel SVMs result in severely low accuracy, and the\u0000proposed framework effectively quantifies the proper ranges for the\u0000hyperparameters in the SVMs to identify their optimal choices to achieve the\u0000highest accuracy 94.6% for the HEP application and the highest average\u0000accuracy 97.2% with far less tuning time for the MKH application.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study employs a comprehensive methodology to evaluate foundational models, encompassing problem formulation, data analysis, and the development of novel outlier detection techniques. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research.
{"title":"How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models","authors":"Oluyemi Enoch Amujo, Shanchieh Jay Yang","doi":"arxiv-2407.11006","DOIUrl":"https://doi.org/arxiv-2407.11006","url":null,"abstract":"Recently, large language models (LLMs) have expanded into various domains.\u0000However, there remains a need to evaluate how these models perform when\u0000prompted with commonplace queries compared to domain-specific queries, which\u0000may be useful for benchmarking prior to fine-tuning domain-specific downstream\u0000tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across\u0000diverse domains, including cybersecurity, medicine, and finance, compared to\u0000common knowledge queries. This study employs a comprehensive methodology to\u0000evaluate foundational models, encompassing problem formulation, data analysis,\u0000and the development of novel outlier detection techniques. This methodological\u0000rigor enhances the credibility of the presented evaluation frameworks. This\u0000study focused on assessing inference time, response length, throughput,\u0000quality, and resource utilization and investigated the correlations between\u0000these factors. The results indicate that model size and types of prompts used\u0000for inference significantly influenced response length and quality. In\u0000addition, common prompts, which include various types of queries, generate\u0000diverse and inconsistent responses at irregular intervals. In contrast,\u0000domain-specific prompts consistently generate concise responses within a\u0000reasonable time. Overall, this study underscores the need for comprehensive\u0000evaluation frameworks to enhance the reliability of benchmarking procedures in\u0000multidomain AI research.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"46 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this white paper, I present my community effort to automatically co-design cheaper, faster and more energy-efficient software and hardware for AI, ML and other popular workloads with the help of the Collective Mind framework (CM), virtualized MLOps, MLPerf benchmarks and reproducible optimization tournaments. I developed CM to modularize, automate and virtualize the tedious process of building, running, profiling and optimizing complex applications across rapidly evolving open-source and proprietary AI/ML models, datasets, software and hardware. I achieved that with the help of portable, reusable and technology-agnostic automation recipes (ResearchOps) for MLOps and DevOps (CM4MLOps) discovered in close collaboration with academia and industry when reproducing more than 150 research papers and organizing the 1st mass-scale community benchmarking of ML and AI systems using CM and MLPerf. I donated CM and CM4MLOps to MLCommons to help connect academia and industry to learn how to build and run AI and other emerging workloads in the most efficient and cost-effective way using a common and technology-agnostic automation, virtualization and reproducibility framework while unifying knowledge exchange, protecting everyone's intellectual property, enabling portable skills, and accelerating transfer of the state-of-the-art research to production. My long-term vision is to make AI accessible to everyone by making it a commodity automatically produced from the most suitable open-source and proprietary components from different vendors based on user demand, requirements and constraints such as cost, latency, throughput, accuracy, energy, size and other important characteristics.
在这篇白皮书中,我介绍了我的社区工作,即借助 Collective Mind 框架 (CM)、虚拟化 MLOps、MLPerf 基准和可重现的优化比赛,为 AI、ML 和其他流行工作负载自动共同设计更便宜、更快速、更节能的软件和硬件。我开发 CM 的目的是将在快速发展的开源和专有 AI/ML 模型、数据集、软件和硬件上构建、运行、剖析和优化复杂应用的繁琐过程模块化、自动化和虚拟化。我与学术界和工业界密切合作,为 MLOps 和 DevOps(CM4MLOps)开发了可移植、可重用和技术无关的自动化配方(ResearchOps),发表了 150 多篇研究论文,并使用 CM 和 MLPerf 组织了第一次大规模 ML 和 AI 系统社区基准测试。我将 CM 和 CM4MLOps 捐赠给了 MLCommons,以帮助连接学术界和产业界,学习如何使用通用的技术自动化、虚拟化和可重现性框架,以最高效、最具成本效益的方式构建和运行人工智能和其他新兴工作负载,同时统一知识交流,保护每个人的知识产权,实现可移植技能,并加速最先进研究成果的转化。我的长期愿景是让每个人都能使用人工智能,根据用户需求、要求和制约因素(如成本、延迟、吞吐量、精度、能量、尺寸和其他重要特征),让人工智能成为一种商品,由不同供应商提供的最合适的开源和专有组件自动生产而成。
{"title":"Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments","authors":"Grigori Fursin","doi":"arxiv-2406.16791","DOIUrl":"https://doi.org/arxiv-2406.16791","url":null,"abstract":"In this white paper, I present my community effort to automatically co-design\u0000cheaper, faster and more energy-efficient software and hardware for AI, ML and\u0000other popular workloads with the help of the Collective Mind framework (CM),\u0000virtualized MLOps, MLPerf benchmarks and reproducible optimization tournaments.\u0000I developed CM to modularize, automate and virtualize the tedious process of\u0000building, running, profiling and optimizing complex applications across rapidly\u0000evolving open-source and proprietary AI/ML models, datasets, software and\u0000hardware. I achieved that with the help of portable, reusable and\u0000technology-agnostic automation recipes (ResearchOps) for MLOps and DevOps\u0000(CM4MLOps) discovered in close collaboration with academia and industry when\u0000reproducing more than 150 research papers and organizing the 1st mass-scale\u0000community benchmarking of ML and AI systems using CM and MLPerf. I donated CM and CM4MLOps to MLCommons to help connect academia and industry\u0000to learn how to build and run AI and other emerging workloads in the most\u0000efficient and cost-effective way using a common and technology-agnostic\u0000automation, virtualization and reproducibility framework while unifying\u0000knowledge exchange, protecting everyone's intellectual property, enabling\u0000portable skills, and accelerating transfer of the state-of-the-art research to\u0000production. My long-term vision is to make AI accessible to everyone by making\u0000it a commodity automatically produced from the most suitable open-source and\u0000proprietary components from different vendors based on user demand,\u0000requirements and constraints such as cost, latency, throughput, accuracy,\u0000energy, size and other important characteristics.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural Radiance Fields (NeRF) is an emerging technique to synthesize 3D objects from 2D images with a wide range of potential applications. However, rendering existing NeRF models is extremely computation intensive, making it challenging to support real-time interaction on mobile devices. In this paper, we take the first initiative to examine the state-of-the-art real-time NeRF rendering technique from a system perspective. We first define the entire working pipeline of the NeRF serving system. We then identify possible control knobs that are critical to the system from the communication, computation, and visual performance perspective. Furthermore, an extensive measurement study is conducted to reveal the effects of these control knobs on system performance. Our measurement results reveal that different control knobs contribute differently towards improving the system performance, with the mesh granularity being the most effective knob and the quantization being the least effective knob. In addition, diverse hardware device settings and network conditions have to be considered to fully unleash the benefit of operating under the appropriate knobs
{"title":"Towards Real-Time Neural Volumetric Rendering on Mobile Devices: A Measurement Study","authors":"Zhe Wang, Yifei Zhu","doi":"arxiv-2406.16068","DOIUrl":"https://doi.org/arxiv-2406.16068","url":null,"abstract":"Neural Radiance Fields (NeRF) is an emerging technique to synthesize 3D\u0000objects from 2D images with a wide range of potential applications. However,\u0000rendering existing NeRF models is extremely computation intensive, making it\u0000challenging to support real-time interaction on mobile devices. In this paper,\u0000we take the first initiative to examine the state-of-the-art real-time NeRF\u0000rendering technique from a system perspective. We first define the entire\u0000working pipeline of the NeRF serving system. We then identify possible control\u0000knobs that are critical to the system from the communication, computation, and\u0000visual performance perspective. Furthermore, an extensive measurement study is\u0000conducted to reveal the effects of these control knobs on system performance.\u0000Our measurement results reveal that different control knobs contribute\u0000differently towards improving the system performance, with the mesh granularity\u0000being the most effective knob and the quantization being the least effective\u0000knob. In addition, diverse hardware device settings and network conditions have\u0000to be considered to fully unleash the benefit of operating under the\u0000appropriate knobs","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}