ACM Transactions on Modeling and Performance Evaluation of Computing Systems最新文献

Configuring and Coordinating End-to-End QoS for Emerging Storage Infrastructure 为新兴存储基础设施配置和协调端到端QoS

Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-11-10 DOI: 10.1145/3631606

Jit Gupta, Krishna Kant, Amitangshu Pal, Joyanta Biswas

Modern data center storage systems are invariably networked to allow for consolidation and flexible management of storage. They also include high performance storage devices based on flash or other emerging technologies, generally accessed through low-latency and high throughput protocols such as NVMe (or its derivatives) carried over the network. With the increasing complexity and data-centric nature of the applications, properly configuring the quality of service (QoS) for the storage path has become crucial for ensuring the desired application performance. Such QoS is substantially influenced by the QoS in the network path, in the access protocol, and in the storage device. In this paper, we define a new transport level QoS mechanism for the network segment and demonstrate how it can augment and coordinate with the access level QoS mechanism defined for NVMe, and a similar QoS mechanism configured in the device. We show that the transport QoS mechanism not only provides the desired QoS to different classes of storage accesses but is also able to protect the access to the shared persistent memory (PM) devices located along with the storage but requiring much lower latency than storage. We demonstrate that a proper coordinated configuration of the 3 QoS’es on the path is crucial to achieve the desired differentiation depending on where the bottlenecks appear.

现代数据中心存储系统都是网络化的，以便对存储进行整合和灵活的管理。它们还包括基于闪存或其他新兴技术的高性能存储设备，通常通过网络传输的低延迟和高吞吐量协议(如NVMe(或其衍生产品))进行访问。随着应用程序的复杂性和以数据为中心的特性的增加，正确配置存储路径的服务质量(QoS)对于确保所需的应用程序性能变得至关重要。这种QoS实质上受到网络路径、访问协议和存储设备中的QoS的影响。在本文中，我们为网段定义了一种新的传输级QoS机制，并演示了它如何与为NVMe定义的访问级QoS机制以及设备中配置的类似QoS机制进行增强和协调。我们表明，传输QoS机制不仅为不同类型的存储访问提供所需的QoS，而且还能够保护对与存储一起的共享持久内存(PM)设备的访问，但需要比存储低得多的延迟。我们证明，根据瓶颈出现的位置，路径上3个QoS的适当协调配置对于实现所需的差异化至关重要。

{"title":"Configuring and Coordinating End-to-End QoS for Emerging Storage Infrastructure","authors":"Jit Gupta, Krishna Kant, Amitangshu Pal, Joyanta Biswas","doi":"10.1145/3631606","DOIUrl":"https://doi.org/10.1145/3631606","url":null,"abstract":"Modern data center storage systems are invariably networked to allow for consolidation and flexible management of storage. They also include high performance storage devices based on flash or other emerging technologies, generally accessed through low-latency and high throughput protocols such as NVMe (or its derivatives) carried over the network. With the increasing complexity and data-centric nature of the applications, properly configuring the quality of service (QoS) for the storage path has become crucial for ensuring the desired application performance. Such QoS is substantially influenced by the QoS in the network path, in the access protocol, and in the storage device. In this paper, we define a new transport level QoS mechanism for the network segment and demonstrate how it can augment and coordinate with the access level QoS mechanism defined for NVMe, and a similar QoS mechanism configured in the device. We show that the transport QoS mechanism not only provides the desired QoS to different classes of storage accesses but is also able to protect the access to the shared persistent memory (PM) devices located along with the storage but requiring much lower latency than storage. We demonstrate that a proper coordinated configuration of the 3 QoS’es on the path is crucial to achieve the desired differentiation depending on where the bottlenecks appear.","PeriodicalId":56350,"journal":{"name":"ACM Transactions on Modeling and Performance Evaluation of Computing Systems","volume":"113 31","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135137751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An approximation method for a non-preemptive multiserver queue with quasi-Poisson arrivals 一类具有准泊松到达的非抢占多服务器队列的逼近方法

Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-09-13 DOI: 10.1145/3624474

Alexandre Brandwajn, Thomas Begin

We consider a non-preemptive multiserver queue with multiple priority classes. We assume distinct exponentially distributed service times and separate quasi-Poisson arrival processes with a predefined maximum number of requests that can be present in the system for each class. We present an approximation method to obtain the steady-state probabilities for the number of requests of each class in our system. In our method, the priority levels (classes) are solved “nearly separately”, linked only by certain conditional probabilities determined approximately from the solution of other priority levels. Several numerical examples illustrate the accuracy of our approximate solution. The proposed approach significantly reduces the complexity of the problem while featuring generally good accuracy.

我们考虑一个具有多个优先级类的非抢占式多服务器队列。我们假设不同的指数分布服务时间和独立的准泊松到达过程，每个类可以在系统中存在预定义的最大请求数。我们提出了一种近似方法来求得系统中每一类请求数的稳态概率。在我们的方法中，优先级级别(类)是“几乎单独”解决的，仅通过从其他优先级的解近似确定的某些条件概率联系起来。几个数值例子说明了近似解的准确性。所提出的方法显著降低了问题的复杂性，同时具有良好的准确性。

引用次数: 0

From compositional Petri Net modeling to macro and micro simulation by means of Stochastic Simulation and Agent-Based models 从组合Petri网建模到利用随机仿真和基于agent的模型进行宏观和微观仿真

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-08-30 DOI: 10.1145/3617681

E. Amparore, M. Beccuti, P. Castagno, S. Pernice, G. Franceschinis, M. Pennisi

Computational modeling has become a widespread approach for studying real-world phenomena by using different modeling perspectives, in particular, the microscopic point of view concentrates on the behavior of the single components and their interactions from which the global system evolution emerges, while the macroscopic point of view represents the system’s overall behavior abstracting as much as possible from that of the single components. The preferred point of view depends on the effort required to develop the model, on the detail level of the available information about the system to be modeled, and on the type of measures that are of interest to the modeler; each point of view may lead to a different modeling language and simulation paradigm. An approach adequate for the microscopic point of view is Agent-Based Modeling and Simulation, which has gained popularity in the last few decades but lacks a formal definition common to the different tools supporting it. This may lead to modeling mistakes and wrong interpretation of the results, especially when comparing models of the same system developed according to different points of view. The aim of the work described in this paper is to provide a common compositional modeling language from which both a macro and a micro simulation model can be automatically derived: these models are coherent by construction and may be studied through different simulation approaches and tools. A framework is thus proposed in which a model can be composed using a Petri Net formalism and then studied through both an Agent-Based Simulation and a classical Stochastic Simulation Algorithm, depending on the study goal.

计算建模已经成为研究现实世界现象的一种广泛的方法，通过使用不同的建模视角，特别是微观视角集中于单个组件的行为及其相互作用，从而产生全局系统演化，而宏观视角则代表系统的整体行为，尽可能从单个组件的行为中抽象出来。首选的观点取决于开发模型所需的努力，取决于要建模的系统的可用信息的详细级别，以及建模者感兴趣的度量类型;每个观点都可能导致不同的建模语言和仿真范式。一种适合微观视角的方法是基于代理的建模和仿真，这种方法在过去几十年中得到了普及，但缺乏对支持它的不同工具通用的正式定义。这可能导致建模错误和对结果的错误解释，特别是在比较根据不同观点开发的同一系统的模型时。本文所描述的工作的目的是提供一种通用的组合建模语言，从中可以自动导出宏观和微观仿真模型:这些模型在结构上是连贯的，可以通过不同的仿真方法和工具进行研究。因此，提出了一个框架，在这个框架中，模型可以使用Petri网的形式构成，然后根据研究目标，通过基于agent的仿真和经典的随机仿真算法进行研究。

{"title":"From compositional Petri Net modeling to macro and micro simulation by means of Stochastic Simulation and Agent-Based models","authors":"E. Amparore, M. Beccuti, P. Castagno, S. Pernice, G. Franceschinis, M. Pennisi","doi":"10.1145/3617681","DOIUrl":"https://doi.org/10.1145/3617681","url":null,"abstract":"Computational modeling has become a widespread approach for studying real-world phenomena by using different modeling perspectives, in particular, the microscopic point of view concentrates on the behavior of the single components and their interactions from which the global system evolution emerges, while the macroscopic point of view represents the system’s overall behavior abstracting as much as possible from that of the single components. The preferred point of view depends on the effort required to develop the model, on the detail level of the available information about the system to be modeled, and on the type of measures that are of interest to the modeler; each point of view may lead to a different modeling language and simulation paradigm. An approach adequate for the microscopic point of view is Agent-Based Modeling and Simulation, which has gained popularity in the last few decades but lacks a formal definition common to the different tools supporting it. This may lead to modeling mistakes and wrong interpretation of the results, especially when comparing models of the same system developed according to different points of view. The aim of the work described in this paper is to provide a common compositional modeling language from which both a macro and a micro simulation model can be automatically derived: these models are coherent by construction and may be studied through different simulation approaches and tools. A framework is thus proposed in which a model can be composed using a Petri Net formalism and then studied through both an Agent-Based Simulation and a classical Stochastic Simulation Algorithm, depending on the study goal.","PeriodicalId":56350,"journal":{"name":"ACM Transactions on Modeling and Performance Evaluation of Computing Systems","volume":"1 1","pages":""},"PeriodicalIF":0.6,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43090262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

No-regret Caching via Online Mirror Descent 无悔缓存通过在线镜像下降

Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-08-11 DOI: 10.1145/3605209

Tareq Si Salem, Giovanni Neglia, Stratis Ioannidis

We study an online caching problem in which requests can be served by a local cache to avoid retrieval costs from a remote server. The cache can update its state after a batch of requests and store an arbitrarily small fraction of each file. We study no-regret algorithms based on Online Mirror Descent (OMD) strategies. We show that bounds for the regret crucially depend on the diversity of the request process, provided by the diversity ratio R/h , where R is the size of the batch and h is the maximum multiplicity of a request in a given batch. We characterize the optimality of OMD caching policies w.r.t. regret under different diversity regimes. We also prove that, when the cache must store the entire file, rather than a fraction, OMD strategies can be coupled with a randomized rounding scheme that preserves regret guarantees, even when update costs cannot be neglected. We provide a formal characterization of the rounding problem through optimal transport theory, and moreover we propose a computationally efficient randomized rounding scheme.

我们研究了一个在线缓存问题，其中请求可以由本地缓存服务，以避免从远程服务器检索开销。缓存可以在一批请求后更新其状态，并存储每个文件的任意一小部分。研究了基于在线镜像下降(OMD)策略的无遗憾算法。我们表明，遗憾的界限关键取决于请求进程的多样性，由多样性比R/h提供，其中R是批处理的大小，h是给定批处理中请求的最大多重性。在不同的多样性制度下，我们描述了OMD缓存策略的最优性。我们还证明，当缓存必须存储整个文件而不是一小部分时，OMD策略可以与保留遗憾保证的随机舍入方案相结合，即使在更新成本不能忽略的情况下也是如此。通过最优传输理论给出了舍入问题的形式化表征，并提出了一种计算效率高的随机舍入方案。

引用次数: 0

Optimal Pricing in a Single Server System 单服务器系统中的最优定价

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-07-05 DOI: 10.1145/3607252

Ashok Krishnan K. S., C. Singh, S. T. Maguluri, Parimal Parag

We study optimal pricing in a single server queue when the customers valuation of service depends on their waiting time. In particular, we consider a very general model, where the customer valuations are random and are sampled from a distribution that depends on the queue length. The goal of the service provider is to set dynamic state dependent prices in order to maximize its revenue, while also managing congestion. We model the problem as a Markov decision process and present structural results on the optimal policy. We also present an algorithm to find an approximate optimal policy. We further present a myopic policy that is easy to evaluate and present bounds on its performance. We finally illustrate the quality of our approximate solution and the myopic solution using numerical simulations.

当客户对服务的评价取决于他们的等待时间时，我们研究了单个服务器队列中的最优定价。特别是，我们考虑一个非常通用的模型，其中客户估价是随机的，并且是从取决于队列长度的分布中采样的。服务提供商的目标是设定动态的状态相关价格，以最大限度地提高收入，同时管理拥堵。我们将问题建模为马尔可夫决策过程，并给出最优策略的结构结果。我们还提出了一种算法来寻找近似最优策略。我们进一步提出了一种短视的政策，该政策易于评估并对其性能提出限制。最后，我们使用数值模拟说明了近似解和近视解的质量。

引用次数: 0

Efficient Computation of Optimal Thresholds in Cloud Auto-scaling Systems 云自动缩放系统中最优阈值的有效计算

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-06-06 DOI: 10.1145/3603532

Thomas Tournaire, Hind Castel-Taleb, E. Hyon

We consider a horizontal and dynamic auto-scaling technique in a cloud system where virtual machines hosted on a physical node are turned on and off to minimise energy consumption while meeting performance requirements. Finding cloud management policies that adapt the system to the load is not straightforward, and we consider here that virtual machines are turned on and off depending on queue load thresholds. We want to compute the optimal threshold values that minimize consumption costs and penalty costs (when performance requirements are not met). To solve this problem, we propose several optimisation methods, based on two different mathematical approaches. The first one is based on queueing theory and uses local search heuristics coupled with the stationary distributions of Markov chains. The second approach tackles the problem using Markov Decision Process (MDP) in which we assume that the policy is of a special multi-threshold type called hysteresis. We improve the heuristics of the former approach with the aggregation of Markov chains and queues approximation techniques. We assess the benefit of threshold-aware algorithms for solving MDPs. Then we carry out theoretical analyzes of the two approaches. We also compare them numerically and we show that all of the presented MDP algorithms strongly outperform the local search heuristics. Finally, we propose a cost model for a real scenario of a cloud system to apply our optimisation algorithms and to show their practical relevance. The major scientific contribution of the article is a set of fast (almost in real time) load-based threshold computation methods that can be used by a cloud provider to optimize its financial costs.

我们在云系统中考虑一种水平和动态的自动扩展技术，在这种技术中，托管在物理节点上的虚拟机可以打开和关闭，以在满足性能要求的同时最大限度地减少能耗。找到使系统适应负载的云管理策略并不简单，我们在这里考虑根据队列负载阈值打开和关闭虚拟机。我们想要计算最小化消耗成本和惩罚成本的最优阈值(当性能需求未得到满足时)。为了解决这个问题，我们提出了几种优化方法，基于两种不同的数学方法。第一种方法基于排队理论，结合马尔可夫链的平稳分布，采用局部搜索启发式算法。第二种方法使用马尔可夫决策过程(MDP)来解决问题，其中我们假设策略是一种特殊的多阈值类型，称为滞后。我们利用马尔可夫链的聚合和队列逼近技术改进了前一种方法的启发式。我们评估阈值感知算法解决mdp的好处。然后对这两种方法进行了理论分析。我们还对它们进行了数值比较，并表明所有提出的MDP算法都明显优于局部搜索启发式算法。最后，我们为云系统的真实场景提出了一个成本模型，以应用我们的优化算法并显示其实际相关性。本文的主要科学贡献是一组快速(几乎是实时的)基于负载的阈值计算方法，云提供商可以使用这些方法来优化其财务成本。

{"title":"Efficient Computation of Optimal Thresholds in Cloud Auto-scaling Systems","authors":"Thomas Tournaire, Hind Castel-Taleb, E. Hyon","doi":"10.1145/3603532","DOIUrl":"https://doi.org/10.1145/3603532","url":null,"abstract":"We consider a horizontal and dynamic auto-scaling technique in a cloud system where virtual machines hosted on a physical node are turned on and off to minimise energy consumption while meeting performance requirements. Finding cloud management policies that adapt the system to the load is not straightforward, and we consider here that virtual machines are turned on and off depending on queue load thresholds. We want to compute the optimal threshold values that minimize consumption costs and penalty costs (when performance requirements are not met). To solve this problem, we propose several optimisation methods, based on two different mathematical approaches. The first one is based on queueing theory and uses local search heuristics coupled with the stationary distributions of Markov chains. The second approach tackles the problem using Markov Decision Process (MDP) in which we assume that the policy is of a special multi-threshold type called hysteresis. We improve the heuristics of the former approach with the aggregation of Markov chains and queues approximation techniques. We assess the benefit of threshold-aware algorithms for solving MDPs. Then we carry out theoretical analyzes of the two approaches. We also compare them numerically and we show that all of the presented MDP algorithms strongly outperform the local search heuristics. Finally, we propose a cost model for a real scenario of a cloud system to apply our optimisation algorithms and to show their practical relevance. The major scientific contribution of the article is a set of fast (almost in real time) load-based threshold computation methods that can be used by a cloud provider to optimize its financial costs.","PeriodicalId":56350,"journal":{"name":"ACM Transactions on Modeling and Performance Evaluation of Computing Systems","volume":"8 1","pages":"1 - 31"},"PeriodicalIF":0.6,"publicationDate":"2023-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41511671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Program Analysis and Machine Learning–based Approach to Predict Power Consumption of CUDA Kernel 基于程序分析和机器学习的CUDA内核功耗预测方法

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-06-05 DOI: 10.1145/3603533

Gargi Alavani, Jineet Desai, Snehanshu Saha, S. Sarkar

The General Purpose Graphics Processing Unit has secured a prominent position in the High-Performance Computing world due to its performance gain and programmability. Understanding the relationship between Graphics Processing Unit (GPU) power consumption and program features can aid developers in building energy-efficient sustainable applications. In this work, we propose a static analysis-based power model built using machine learning techniques. We have investigated six machine learning models across three NVIDIA GPU architectures: Kepler, Maxwell, and Volta with Random Forest, Extra Trees, Gradient Boosting, CatBoost, and XGBoost reporting favorable results. We observed that the XGBoost technique-based prediction model is the most efficient technique with an R2 value of 0.9646 on Volta Architecture. The dataset used for these techniques includes kernels from different benchmarks suits, sizes, nature (e.g., compute-bound, memory-bound), and complexity (e.g., control divergence, memory access patterns). Experimental results suggest that the proposed solution can help developers precisely predict GPU applications power consumption using program analysis across GPU architectures. Developers can use this approach to refactor their code to build energy-efficient GPU applications.

通用图形处理单元由于其性能提升和可编程性，在高性能计算领域占据了突出的地位。了解图形处理单元(GPU)功耗和程序功能之间的关系可以帮助开发人员构建节能的可持续应用程序。在这项工作中，我们提出了一个使用机器学习技术构建的基于静态分析的功率模型。我们在三种NVIDIA GPU架构上研究了六种机器学习模型:Kepler, Maxwell和Volta，随机森林，Extra Trees，梯度增强，CatBoost和XGBoost报告了有利的结果。我们观察到基于XGBoost技术的预测模型是最有效的技术，在Volta架构上R2值为0.9646。用于这些技术的数据集包括来自不同基准测试的内核、大小、性质(例如，计算约束、内存约束)和复杂性(例如，控制发散、内存访问模式)。实验结果表明，该解决方案可以帮助开发人员通过跨GPU架构的程序分析来精确预测GPU应用程序的功耗。开发人员可以使用这种方法重构代码以构建节能的GPU应用程序。

{"title":"Program Analysis and Machine Learning–based Approach to Predict Power Consumption of CUDA Kernel","authors":"Gargi Alavani, Jineet Desai, Snehanshu Saha, S. Sarkar","doi":"10.1145/3603533","DOIUrl":"https://doi.org/10.1145/3603533","url":null,"abstract":"The General Purpose Graphics Processing Unit has secured a prominent position in the High-Performance Computing world due to its performance gain and programmability. Understanding the relationship between Graphics Processing Unit (GPU) power consumption and program features can aid developers in building energy-efficient sustainable applications. In this work, we propose a static analysis-based power model built using machine learning techniques. We have investigated six machine learning models across three NVIDIA GPU architectures: Kepler, Maxwell, and Volta with Random Forest, Extra Trees, Gradient Boosting, CatBoost, and XGBoost reporting favorable results. We observed that the XGBoost technique-based prediction model is the most efficient technique with an R2 value of 0.9646 on Volta Architecture. The dataset used for these techniques includes kernels from different benchmarks suits, sizes, nature (e.g., compute-bound, memory-bound), and complexity (e.g., control divergence, memory access patterns). Experimental results suggest that the proposed solution can help developers precisely predict GPU applications power consumption using program analysis across GPU architectures. Developers can use this approach to refactor their code to build energy-efficient GPU applications.","PeriodicalId":56350,"journal":{"name":"ACM Transactions on Modeling and Performance Evaluation of Computing Systems","volume":"8 1","pages":"1 - 24"},"PeriodicalIF":0.6,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43472653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Flydeling: Streamlined Performance Models for Hardware Acceleration of CNNs through System Identification Flydeling:通过系统识别实现cnn硬件加速的流线型性能模型

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-05-12 DOI: 10.1145/3594870

Walther Carballo-Hernández, M. Pelcat, S. Bhattacharyya, R. C. Galán, F. Berry

The introduction of deep learning algorithms, such as Convolutional Neural Networks (CNNs) in many near-sensor embedded systems, opens new challenges in terms of energy efficiency and hardware performance. An emerging solution to address these challenges is to use tailored heterogeneous hardware accelerators combining processing elements of different architectural natures such as Central Processing Unit (CPU), Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), or Application Specific Integrated Circuit (ASIC). To progress towards heterogeneity, a great asset would be an automated design space exploration tool that chooses, for each accelerated partition of a CNN, the most appropriate architecture considering available resources. To feed such a design space exploration process, models are required that provide very fast yet precise evaluations of alternative architectures or alternative forms of CNNs. Quick configuration estimation could be achieved with few parameters from representative input sequences. This article studies a solution called flydeling (as a contraction of flyweight modeling) for obtaining these models by inspiring from the black-box System Identification (SI) domain. We refer to models derived using the proposed approach as flyweight models (flydels). A methodology is proposed to generate these flydels, using CNN properties as predictor features together with SI techniques with a stochastic excitation input at a feature map dimensions level. For an embedded CPU-FPGA-GPU heterogeneous platform, it is demonstrated that it is possible to learn these Key Performance Indicators (KPIs) flydels at an early design stage and from high-level application features. For latency, energy, and resource utilization, flydels obtain estimation errors varying between 5% and 10% with less model parameters compared to state-of-the-art solutions and are built automatically from platform measurements.

深度学习算法的引入，如卷积神经网络(cnn)在许多近传感器嵌入式系统中，在能效和硬件性能方面带来了新的挑战。应对这些挑战的一个新兴解决方案是使用定制的异构硬件加速器，结合不同架构性质的处理元素，如中央处理单元(CPU)、图形处理单元(GPU)、现场可编程门阵列(FPGA)或专用集成电路(ASIC)。为了向异构方向发展，自动化的设计空间探索工具将是一个重要的资产，它可以为CNN的每个加速分区选择考虑可用资源的最合适的架构。为了满足这样的设计空间探索过程，需要模型提供非常快速而精确的替代架构或替代形式的cnn评估。利用代表性输入序列的少量参数，可以实现快速的组态估计。本文研究了从黑匣子系统识别(SI)领域获得这些模型的一种称为flydeling (flyweight modeling的缩写)的解决方案。我们将使用所提出的方法导出的模型称为flyweight模型(flydels)。提出了一种方法来生成这些飞行模型，使用CNN属性作为预测特征，并在特征映射维度水平上使用随机激励输入的SI技术。对于嵌入式CPU-FPGA-GPU异构平台，证明了在早期设计阶段和从高级应用功能中学习这些关键性能指标(kpi)的可能性。对于延迟、能源和资源利用率，与最先进的解决方案相比，flydels的模型参数更少，估计误差在5%到10%之间，并且是根据平台测量自动构建的。

{"title":"Flydeling: Streamlined Performance Models for Hardware Acceleration of CNNs through System Identification","authors":"Walther Carballo-Hernández, M. Pelcat, S. Bhattacharyya, R. C. Galán, F. Berry","doi":"10.1145/3594870","DOIUrl":"https://doi.org/10.1145/3594870","url":null,"abstract":"The introduction of deep learning algorithms, such as Convolutional Neural Networks (CNNs) in many near-sensor embedded systems, opens new challenges in terms of energy efficiency and hardware performance. An emerging solution to address these challenges is to use tailored heterogeneous hardware accelerators combining processing elements of different architectural natures such as Central Processing Unit (CPU), Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), or Application Specific Integrated Circuit (ASIC). To progress towards heterogeneity, a great asset would be an automated design space exploration tool that chooses, for each accelerated partition of a CNN, the most appropriate architecture considering available resources. To feed such a design space exploration process, models are required that provide very fast yet precise evaluations of alternative architectures or alternative forms of CNNs. Quick configuration estimation could be achieved with few parameters from representative input sequences. This article studies a solution called flydeling (as a contraction of flyweight modeling) for obtaining these models by inspiring from the black-box System Identification (SI) domain. We refer to models derived using the proposed approach as flyweight models (flydels). A methodology is proposed to generate these flydels, using CNN properties as predictor features together with SI techniques with a stochastic excitation input at a feature map dimensions level. For an embedded CPU-FPGA-GPU heterogeneous platform, it is demonstrated that it is possible to learn these Key Performance Indicators (KPIs) flydels at an early design stage and from high-level application features. For latency, energy, and resource utilization, flydels obtain estimation errors varying between 5% and 10% with less model parameters compared to state-of-the-art solutions and are built automatically from platform measurements.","PeriodicalId":56350,"journal":{"name":"ACM Transactions on Modeling and Performance Evaluation of Computing Systems","volume":"8 1","pages":"1 - 33"},"PeriodicalIF":0.6,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43919677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Load-optimization in Reconfigurable Data-center Networks: Algorithms and Complexity of Flow Routing 可重构数据中心网络中的负载优化:流路由的算法和复杂性

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-05-12 DOI: 10.1145/3597200

Wenkai Dai, Klaus-Tycho Foerster, David Fuchssteiner, Stefan Schmid

Emerging reconfigurable data centers introduce unprecedented flexibility in how the physical layer can be programmed to adapt to current traffic demands. These reconfigurable topologies are commonly hybrid, consisting of static and reconfigurable links, enabled by e.g., an Optical Circuit Switch (OCS) connected to top-of-rack switches in Clos networks. Even though prior work has showcased the practical benefits of hybrid networks, several crucial performance aspects are not well understood. For example, many systems enforce artificial segregation of the hybrid network parts, leaving money on the table. In this article, we study the algorithmic problem of how to jointly optimize topology and routing in reconfigurable data centers, in order to optimize a most fundamental metric, maximum link load. The complexity of reconfiguration mechanisms in this space is unexplored at large, especially for the following cross-layer network-design problem: given a hybrid network and a traffic matrix, jointly design the physical layer and the flow routing in order to minimize the maximum link load. We chart the corresponding algorithmic landscape in our work, investigating both un-/splittable flows and (non-)segregated routing policies. A topological complexity classification of the problem reveals NP-hardness in general for network topologies that are trees of depth at least two, in contrast to the tractability on trees of depth one. We moreover prove that the problem is not submodular for all these routing policies, even in multi-layer trees. However, networks that can be abstracted by a single packet switch (e.g., nonblocking Fat-Tree topologies) can be optimized efficiently, and we present optimal polynomial-time algorithms accordingly. We complement our theoretical results with trace-driven simulation studies, where our algorithms can significantly improve the network load in comparison to the state-of-the-art.

新兴的可重构数据中心在如何对物理层进行编程以适应当前流量需求方面引入了前所未有的灵活性。这些可重构拓扑通常是混合的，由静态和可重构链路组成，例如，通过连接到Clos网络中的机架顶部交换机的光电路交换机(OCS)来实现。尽管先前的工作已经展示了混合网络的实际好处，但几个关键的性能方面还没有得到很好的理解。例如，许多系统强制对混合网络部分进行人工隔离，从而使资金流失。在本文中，我们研究了如何在可重构数据中心中联合优化拓扑和路由的算法问题，以优化最基本的度量，最大链路负载。这一领域重构机制的复杂性在很大程度上是未知的，特别是对于以下跨层网络设计问题:给定一个混合网络和一个流量矩阵，共同设计物理层和流路由，以最小化最大链路负载。在我们的工作中，我们绘制了相应的算法景观，研究了不可分割/不可分割的流和(非)隔离的路由策略。该问题的拓扑复杂性分类揭示了深度至少为2的树的网络拓扑的一般np硬度，与深度为1的树的可跟踪性形成对比。此外，我们还证明了该问题不是所有这些路由策略的子模块，即使在多层树中也是如此。然而，可以由单个分组交换机抽象的网络(例如，非阻塞的Fat-Tree拓扑)可以有效地优化，并且我们相应地提出了最优多项式时间算法。我们用跟踪驱动的仿真研究补充了我们的理论结果，与最先进的算法相比，我们的算法可以显著改善网络负载。

{"title":"Load-optimization in Reconfigurable Data-center Networks: Algorithms and Complexity of Flow Routing","authors":"Wenkai Dai, Klaus-Tycho Foerster, David Fuchssteiner, Stefan Schmid","doi":"10.1145/3597200","DOIUrl":"https://doi.org/10.1145/3597200","url":null,"abstract":"Emerging reconfigurable data centers introduce unprecedented flexibility in how the physical layer can be programmed to adapt to current traffic demands. These reconfigurable topologies are commonly hybrid, consisting of static and reconfigurable links, enabled by e.g., an Optical Circuit Switch (OCS) connected to top-of-rack switches in Clos networks. Even though prior work has showcased the practical benefits of hybrid networks, several crucial performance aspects are not well understood. For example, many systems enforce artificial segregation of the hybrid network parts, leaving money on the table. In this article, we study the algorithmic problem of how to jointly optimize topology and routing in reconfigurable data centers, in order to optimize a most fundamental metric, maximum link load. The complexity of reconfiguration mechanisms in this space is unexplored at large, especially for the following cross-layer network-design problem: given a hybrid network and a traffic matrix, jointly design the physical layer and the flow routing in order to minimize the maximum link load. We chart the corresponding algorithmic landscape in our work, investigating both un-/splittable flows and (non-)segregated routing policies. A topological complexity classification of the problem reveals NP-hardness in general for network topologies that are trees of depth at least two, in contrast to the tractability on trees of depth one. We moreover prove that the problem is not submodular for all these routing policies, even in multi-layer trees. However, networks that can be abstracted by a single packet switch (e.g., nonblocking Fat-Tree topologies) can be optimized efficiently, and we present optimal polynomial-time algorithms accordingly. We complement our theoretical results with trace-driven simulation studies, where our algorithms can significantly improve the network load in comparison to the state-of-the-art.","PeriodicalId":56350,"journal":{"name":"ACM Transactions on Modeling and Performance Evaluation of Computing Systems","volume":"8 1","pages":"1 - 30"},"PeriodicalIF":0.6,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47120363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Dynamic Scheduling in a Partially Fluid, Partially Lossy Queueing System 部分流体、部分有损排队系统中的动态调度

Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Pub Date : 2023-04-13 DOI: 10.1145/3582884

Kiran Chaudhary, Veeraruna Kavitha, Jayakrishnan Nair

We consider a single server queueing system with two classes of jobs: eager jobs with small sizes that require service to begin almost immediately upon arrival, and tolerant jobs with larger sizes that can wait for service. While blocking probability is the relevant performance metric for the eager class, the tolerant class seeks to minimize its mean sojourn time. In this article, we analyse the performance of each class under dynamic scheduling policies, where the scheduling of both classes depends on the instantaneous state of the system. This analysis is carried out under a certain fluid limit, where the arrival rate and service rate of the eager class are scaled to infinity, holding the offered load constant. Our performance characterizations reveal a (dynamic) pseudo-conservation law that ties the performance of both classes to the standalone blocking probabilities associated with the scheduling policies for the eager class. Furthermore, the performance is robust to other specifics of the scheduling policies. We also characterize the Pareto frontier of the achievable region of performance vectors under the same fluid limit, and identify a (two-parameter) class of Pareto-complete scheduling policies.

我们考虑一个具有两类作业的单服务器排队系统:要求服务在到达后立即开始的小尺寸的eager作业，以及可以等待服务的较大尺寸的tolerance作业。虽然阻塞概率是渴望类的相关性能指标，但容忍类寻求最小化其平均逗留时间。在本文中，我们分析了动态调度策略下每个类的性能，其中两个类的调度取决于系统的瞬时状态。该分析是在一定的流体限制下进行的，其中渴望类的到达率和服务率被缩放到无穷大，保持提供的负载恒定。我们的性能描述揭示了一个(动态的)伪守恒定律，它将两个类的性能与与渴望类的调度策略相关的独立阻塞概率联系起来。此外，性能对调度策略的其他细节具有鲁棒性。我们还刻画了相同流体极限下性能矢量可达区域的Pareto边界，并确定了一类(双参数)Pareto完全调度策略。

{"title":"Dynamic Scheduling in a Partially Fluid, Partially Lossy Queueing System","authors":"Kiran Chaudhary, Veeraruna Kavitha, Jayakrishnan Nair","doi":"10.1145/3582884","DOIUrl":"https://doi.org/10.1145/3582884","url":null,"abstract":"We consider a single server queueing system with two classes of jobs: eager jobs with small sizes that require service to begin almost immediately upon arrival, and tolerant jobs with larger sizes that can wait for service. While blocking probability is the relevant performance metric for the eager class, the tolerant class seeks to minimize its mean sojourn time. In this article, we analyse the performance of each class under dynamic scheduling policies, where the scheduling of both classes depends on the instantaneous state of the system. This analysis is carried out under a certain fluid limit, where the arrival rate and service rate of the eager class are scaled to infinity, holding the offered load constant. Our performance characterizations reveal a (dynamic) pseudo-conservation law that ties the performance of both classes to the standalone blocking probabilities associated with the scheduling policies for the eager class. Furthermore, the performance is robust to other specifics of the scheduling policies. We also characterize the Pareto frontier of the achievable region of performance vectors under the same fluid limit, and identify a (two-parameter) class of Pareto-complete scheduling policies.","PeriodicalId":56350,"journal":{"name":"ACM Transactions on Modeling and Performance Evaluation of Computing Systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135127219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0