I/O virtualization is utilized by cloud platforms to provide tenants with efficient, scalable, and manageable network and storage services. The de-facto industrial standard, paravirtualization, offers rich cloud functionality by introducing split front-end and back-end drivers in the guest and host operating systems, respectively. Given this fact, paravirtualization incurs host inefficiency and performance overhead. Thus, emerging hardware virtio accelerators (i.e., SRIOV-capable devices that conform to virtio specification) with device passthrough technologies mitigate the performance issue. However, adopting these devices presents the challenge of insufficient support for live migration. This paper proposes Un-IOV, a novel I/O virtualization system that simultaneously achieves bare-metal level I/O performance and migratability. The key idea is to develop a new hybrid virtualization stack with: (1) a host-bypassed direct data path for virtio accelerators, and (2) a relayed control path guaranteeing seamless live migration support. Un-IOV achieves high scalability by consuming minimum host resources. Extensive experiment results demonstrate that Un-IOV achieves superior network and storage virtualization performance than software implementations with comparable performance of direct passthrough I/O virtualization, while imposing zero guest modification (i.e., guest transparency).
{"title":"Un-IOV: Achieving Bare-Metal Level I/O Virtualization Performance for Cloud Usage With Migratability, Scalability and Transparency","authors":"Zongpu Zhang;Chenbo Xia;Cunming Liang;Jian Li;Chen Yu;Tiwei Bie;Roberts Martin;Daly Dan;Xiao Wang;Yong Liu;Haibing Guan","doi":"10.1109/TC.2024.3375589","DOIUrl":"10.1109/TC.2024.3375589","url":null,"abstract":"I/O virtualization is utilized by cloud platforms to provide tenants with efficient, scalable, and manageable network and storage services. The de-facto industrial standard, paravirtualization, offers rich cloud functionality by introducing split front-end and back-end drivers in the guest and host operating systems, respectively. Given this fact, paravirtualization incurs host inefficiency and performance overhead. Thus, emerging hardware virtio accelerators (i.e., SRIOV-capable devices that conform to virtio specification) with device passthrough technologies mitigate the performance issue. However, adopting these devices presents the challenge of insufficient support for live migration. This paper proposes Un-IOV, a novel I/O virtualization system that simultaneously achieves bare-metal level I/O performance and migratability. The key idea is to develop a new hybrid virtualization stack with: (1) a host-bypassed direct data path for virtio accelerators, and (2) a relayed control path guaranteeing seamless live migration support. Un-IOV achieves high scalability by consuming minimum host resources. Extensive experiment results demonstrate that Un-IOV achieves superior network and storage virtualization performance than software implementations with comparable performance of direct passthrough I/O virtualization, while imposing zero guest modification (i.e., guest transparency).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 7","pages":"1655-1668"},"PeriodicalIF":3.7,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Long timescale Molecular Dynamics (MD) simulation of small molecules is crucial in drug design and basic science. To accelerate a small data set that is executed for a large number of iterations, high-efficiency is required. Recent work in this domain has demonstrated that among COTS devices only FPGA-centric clusters can scale beyond a few processors. The problem addressed here is that, as the number of on-chip processors has increased from fewer than 10 into the hundreds, previous intra-chip routing solutions are no longer viable. We find, however, that through various design innovations, high efficiency can be maintained. These include replacing the previous broadcast networks with ring-routing and then augmenting the rings with out-of-order and caching mechanisms. Others are adding a level of hierarchical filtering and memory recycling. Two novel optimized architectures emerge, together with a number of variations. These are validated, analyzed, and evaluated. We find that in the domain of interest speed-ups over GPUs are achieved. The potential impact is that this system promises to be the basis for scalable long timescale MD with commodity clusters.
{"title":"FPGA-Accelerated Range-Limited Molecular Dynamics","authors":"Chunshu Wu;Chen Yang;Sahan Bandara;Tong Geng;Anqi Guo;Pouya Haghi;Ang Li;Martin Herbordt","doi":"10.1109/TC.2024.3375613","DOIUrl":"10.1109/TC.2024.3375613","url":null,"abstract":"Long timescale Molecular Dynamics (MD) simulation of small molecules is crucial in drug design and basic science. To accelerate a small data set that is executed for a large number of iterations, high-efficiency is required. Recent work in this domain has demonstrated that among COTS devices only FPGA-centric clusters can scale beyond a few processors. The problem addressed here is that, as the number of on-chip processors has increased from fewer than 10 into the hundreds, previous intra-chip routing solutions are no longer viable. We find, however, that through various design innovations, high efficiency can be maintained. These include replacing the previous broadcast networks with ring-routing and then augmenting the rings with out-of-order and caching mechanisms. Others are adding a level of hierarchical filtering and memory recycling. Two novel optimized architectures emerge, together with a number of variations. These are validated, analyzed, and evaluated. We find that in the domain of interest speed-ups over GPUs are achieved. The potential impact is that this system promises to be the basis for scalable long timescale MD with commodity clusters.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 6","pages":"1544-1558"},"PeriodicalIF":3.7,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lulu Yao;Yongkun Li;Patrick P. C. Lee;Xiaoyang Wang;Yinlong Xu
Memory deduplication effectively relieves the memory space bottleneck by removing duplicate pages, especially in virtualized systems in which virtual machines run the same OS and similar applications. However, due to the non-uniform access latencies in NUMA architectures, memory deduplication poses a trade-off between memory savings and access performance: global deduplication across NUMA nodes realizes high memory savings, but leads to frequent cross-node remote access after deduplication and results in performance degradations. In contrast, local deduplication avoids remote access, but limits deduplication effectiveness. We design AdaptMD, an adaptive memory deduplication system that addresses the space-performance trade-off in NUMA architectures. AdaptMD leverages hotness awareness to globally deduplicate only cold pages to reduce remote access. It also migrates similar applications to the same NUMA node to allow local deduplication without remote access. We further make AdaptMD readily configurable to address various deployment scenarios. Experiments show that AdaptMD achieves high memory savings as in global deduplication, while achieving similar access performance as in local deduplication.
重复内存删除通过删除重复页面有效缓解了内存空间瓶颈,尤其是在虚拟机运行相同操作系统和类似应用程序的虚拟化系统中。然而,由于 NUMA 架构的访问延迟不均匀,重复数据删除需要在节省内存和访问性能之间做出权衡:跨 NUMA 节点的全局重复数据删除可节省大量内存,但会导致重复数据删除后频繁的跨节点远程访问,从而导致性能下降。相比之下,本地重复数据删除避免了远程访问,但却限制了重复数据删除的效果。我们设计的 AdaptMD 是一种自适应重复数据删除内存系统,可解决 NUMA 架构中空间与性能之间的权衡问题。AdaptMD 利用热度感知功能,只对冷页面进行全局重复数据删除,以减少远程访问。它还能将类似的应用程序迁移到相同的 NUMA 节点上,从而实现无需远程访问的本地重复数据删除。我们还进一步使 AdaptMD 易于配置,以应对各种部署场景。实验表明,AdaptMD 可以像全局重复数据删除一样节省大量内存,同时实现与本地重复数据删除类似的访问性能。
{"title":"AdaptMD: Balancing Space and Performance in NUMA Architectures With Adaptive Memory Deduplication","authors":"Lulu Yao;Yongkun Li;Patrick P. C. Lee;Xiaoyang Wang;Yinlong Xu","doi":"10.1109/TC.2024.3375592","DOIUrl":"10.1109/TC.2024.3375592","url":null,"abstract":"Memory deduplication effectively relieves the memory space bottleneck by removing duplicate pages, especially in virtualized systems in which virtual machines run the same OS and similar applications. However, due to the non-uniform access latencies in NUMA architectures, memory deduplication poses a trade-off between memory savings and access performance: global deduplication across NUMA nodes realizes high memory savings, but leads to frequent cross-node remote access after deduplication and results in performance degradations. In contrast, local deduplication avoids remote access, but limits deduplication effectiveness. We design AdaptMD, an adaptive memory deduplication system that addresses the space-performance trade-off in NUMA architectures. AdaptMD leverages hotness awareness to globally deduplicate only cold pages to reduce remote access. It also migrates similar applications to the same NUMA node to allow local deduplication without remote access. We further make AdaptMD readily configurable to address various deployment scenarios. Experiments show that AdaptMD achieves high memory savings as in global deduplication, while achieving similar access performance as in local deduplication.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 6","pages":"1588-1602"},"PeriodicalIF":3.7,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140154876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present a new method to find S-box circuits with optimal multiplicative complexity (MC), i.e., MC-optimal S-box circuits. We provide new observations for efficiently constructing circuits and computing MC, combined with a popular pathfinding algorithm named A*. In our search, the A* algorithm outputs a path of length MC, corresponding to an MC-optimal circuit. Based on an in-depth analysis of the process of computing MC, we enable the A* algorithm to function within our graph to investigate a wider range of S-boxes than existing methods such as the SAT-solver-based tool [1]