Haoran You, Cheng Wan, Yang Zhao, Zhongzhi Yu, Y. Fu, Jiayi Yuan, Shang Wu, Shunyao Zhang, Yongan Zhang, Chaojian Li, V. Boominathan, A. Veeraraghavan, Ziyun Li, Yingyan Lin
Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, (3) utilizes input feature-wise partition to save activation memory size, and (4) develops a sequential-write-parallel-read input buffer to alleviate the bandwidth requirement for the activation global buffer. On-silicon measurement and extensive experiments validate that our EyeCoD consistently reduces both the communication and computation costs, leading to an overall system speedup of 10.95×, 3.21×, and 12.85× over general computing platforms including CPUs and GPUs, and a prior-art eye tracking processor called CIS-GEP, respectively, while maintaining the tracking accuracy. Codes are available at https://github.com/RICE-EIC/EyeCoD.
{"title":"EyeCoD: eye tracking system acceleration via flatcam-based algorithm & accelerator co-design","authors":"Haoran You, Cheng Wan, Yang Zhao, Zhongzhi Yu, Y. Fu, Jiayi Yuan, Shang Wu, Shunyao Zhang, Yongan Zhang, Chaojian Li, V. Boominathan, A. Veeraraghavan, Ziyun Li, Yingyan Lin","doi":"10.1145/3470496.3527443","DOIUrl":"https://doi.org/10.1145/3470496.3527443","url":null,"abstract":"Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, (3) utilizes input feature-wise partition to save activation memory size, and (4) develops a sequential-write-parallel-read input buffer to alleviate the bandwidth requirement for the activation global buffer. On-silicon measurement and extensive experiments validate that our EyeCoD consistently reduces both the communication and computation costs, leading to an overall system speedup of 10.95×, 3.21×, and 12.85× over general computing platforms including CPUs and GPUs, and a prior-art eye tracking processor called CIS-GEP, respectively, while maintaining the tracking accuracy. Codes are available at https://github.com/RICE-EIC/EyeCoD.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"58 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124268525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, D. Novo, Juan G'omez-Luna, S. Stuijk, H. Corporaal, O. Mutlu
Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Data placement across different devices is critical to maximize the benefits of such a hybrid system. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a "best-fit" storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device configurations, and (2) makes it difficult for designers to extend these techniques to different storage system configurations (e.g., with a different number or different types of storage devices) than the configuration they are designed for. Our goal is to design a new data placement technique for hybrid storage systems that overcomes these issues and provides: (1) adaptivity, by continuously learning from and adapting to the workload and the storage device characteristics, and (2) easy extensibility to a wide range of workloads and HSS configurations. We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online. We implement Sibyl on real systems with various HSS configurations, including dual- and tri-hybrid storage systems, and extensively compare it against four previously proposed data placement techniques (both heuristic- and machine learning-based) over a wide range of workloads. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performance-oriented/cost-oriented HSS configuration compared to the best previous data placement technique. Our evaluation using an HSS configuration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while significantly reducing the system architect's burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge offuture access patterns while incurring a very modest storage overhead of only 124.4 KiB.
{"title":"Sibyl: adaptive and extensible data placement in hybrid storage systems using online reinforcement learning","authors":"Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, D. Novo, Juan G'omez-Luna, S. Stuijk, H. Corporaal, O. Mutlu","doi":"10.1145/3470496.3527442","DOIUrl":"https://doi.org/10.1145/3470496.3527442","url":null,"abstract":"Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Data placement across different devices is critical to maximize the benefits of such a hybrid system. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a \"best-fit\" storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device configurations, and (2) makes it difficult for designers to extend these techniques to different storage system configurations (e.g., with a different number or different types of storage devices) than the configuration they are designed for. Our goal is to design a new data placement technique for hybrid storage systems that overcomes these issues and provides: (1) adaptivity, by continuously learning from and adapting to the workload and the storage device characteristics, and (2) easy extensibility to a wide range of workloads and HSS configurations. We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online. We implement Sibyl on real systems with various HSS configurations, including dual- and tri-hybrid storage systems, and extensively compare it against four previously proposed data placement techniques (both heuristic- and machine learning-based) over a wide range of workloads. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performance-oriented/cost-oriented HSS configuration compared to the best previous data placement technique. Our evaluation using an HSS configuration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while significantly reducing the system architect's burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge offuture access patterns while incurring a very modest storage overhead of only 124.4 KiB.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123235224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Z. Bingöl, G. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie S. Kim, Nika Mansouri-Ghiasi, Gagandeep Singh, Juan G'omez-Luna, N. Alserr, M. Alser, S. Subramoney, C. Alkan, Saugata Ghose, O. Mutlu
A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.
{"title":"SeGraM: a universal hardware accelerator for genomic sequence-to-graph and sequence-to-sequence mapping","authors":"Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Z. Bingöl, G. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie S. Kim, Nika Mansouri-Ghiasi, Gagandeep Singh, Juan G'omez-Luna, N. Alserr, M. Alser, S. Subramoney, C. Alkan, Saugata Ghose, O. Mutlu","doi":"10.1145/3470496.3527436","DOIUrl":"https://doi.org/10.1145/3470496.3527436","url":null,"abstract":"A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highly-parallel memory access, which alleviates the memory bottleneck. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph (i.e., S2G) and sequence-to-sequence (i.e., S2S) mapping pipelines. First, SeGraM outperforms state-of-the-art S2G mapping tools by 5.9×/3.9× and 106×/- 742× for long and short reads, respectively, while reducing power consumption by 4.1×/4.4× and 3.0×/3.2×. Second, BitAlign outperforms a state-of-the-art S2G alignment tool by 41×-539× and three S2S alignment accelerators by 1.2×-4.8×. We conclude that SeGraM is a high-performance and low-cost universal genomics mapping accelerator that efficiently supports both sequence-to-graph and sequence-to-sequence mapping pipelines.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125534583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.
{"title":"Training personalized recommendation systems from (GPU) scratch: look forward not backwards","authors":"Youngeun Kwon, Minsoo Rhu","doi":"10.1145/3470496.3527386","DOIUrl":"https://doi.org/10.1145/3470496.3527386","url":null,"abstract":"Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the \"future\" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can \"always\" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122941886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.
{"title":"SIMD2: a generalized matrix instruction set for accelerating tensor computation beyond GEMM","authors":"Yunan Zhang, Po-An Tsai, Hung-Wei Tseng","doi":"10.1145/3470496.3527411","DOIUrl":"https://doi.org/10.1145/3470496.3527411","url":null,"abstract":"Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 6.94× on average over optimized CUDA programs, with only 5% of full-chip area overhead.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130087576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DNA is emerging as an increasingly attractive medium for data storage due to a number of important and unique advantages it offers, most notably the unprecedented durability and density. While the technology is evolving rapidly, the prohibitive cost of reads and writes, the high frequency and the peculiar nature of errors occurring in DNA storage pose a significant challenge to its adoption. In this work we make a novel observation that the probability of successful recovery of a given bit from any type of a DNA-based storage system highly depends on its physical location within the DNA molecule. In other words, when used as a storage medium, some parts of DNA molecules appear significantly more reliable than others. We show that large differences in reliability between different parts of DNA molecules lead to highly inefficient use of error-correction resources, and that commonly used techniques such as unequal error-correction cannot be used to bridge the reliability gap between different locations in the context of DNA storage. We then propose two approaches to address the problem. The first approach is general and applies to any types of data; it stripes the data and ECC codewords across DNA molecules in a particular fashion such that the effects of errors are spread out evenly across different codewords and molecules, effectively de-biasing the underlying storage medium and improving the resilience against losses of entire molecules. The second approach is application-specific, and seeks to leverage the underlying reliability bias by using application-aware mapping of data onto DNA molecules such that data that requires higher reliability is stored in more reliable locations, whereas data that needs lower reliability is stored in less reliable parts of DNA molecules. We show that the proposed data mapping can be used to achieve graceful degradation in the presence of high error rates, or to implement the concept of approximate storage in DNA. All proposed mechanisms are seamlessly integrated into the state-of-the art DNA storage pipeline at zero storage overhead, validated through wetlab experiments, and evaluated on end-to-end encrypted and compressed data.
{"title":"Managing reliability skew in DNA storage","authors":"Dehui Lin, Yasamin Tabatabaee, Yash Pote, Djordje Jevdjic","doi":"10.1145/3470496.3527441","DOIUrl":"https://doi.org/10.1145/3470496.3527441","url":null,"abstract":"DNA is emerging as an increasingly attractive medium for data storage due to a number of important and unique advantages it offers, most notably the unprecedented durability and density. While the technology is evolving rapidly, the prohibitive cost of reads and writes, the high frequency and the peculiar nature of errors occurring in DNA storage pose a significant challenge to its adoption. In this work we make a novel observation that the probability of successful recovery of a given bit from any type of a DNA-based storage system highly depends on its physical location within the DNA molecule. In other words, when used as a storage medium, some parts of DNA molecules appear significantly more reliable than others. We show that large differences in reliability between different parts of DNA molecules lead to highly inefficient use of error-correction resources, and that commonly used techniques such as unequal error-correction cannot be used to bridge the reliability gap between different locations in the context of DNA storage. We then propose two approaches to address the problem. The first approach is general and applies to any types of data; it stripes the data and ECC codewords across DNA molecules in a particular fashion such that the effects of errors are spread out evenly across different codewords and molecules, effectively de-biasing the underlying storage medium and improving the resilience against losses of entire molecules. The second approach is application-specific, and seeks to leverage the underlying reliability bias by using application-aware mapping of data onto DNA molecules such that data that requires higher reliability is stored in more reliable locations, whereas data that needs lower reliability is stored in less reliable parts of DNA molecules. We show that the proposed data mapping can be used to achieve graceful degradation in the presence of high error rates, or to implement the concept of approximate storage in DNA. All proposed mechanisms are seamlessly integrated into the state-of-the art DNA storage pipeline at zero storage overhead, validated through wetlab experiments, and evaluated on end-to-end encrypted and compressed data.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114640855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D perception in point clouds is transforming the perception ability of future intelligent machines. Point cloud algorithms, however, are plagued by irregular memory accesses, leading to massive inefficiencies in the memory sub-system, which bottlenecks the overall efficiency. This paper proposes Crescent, an algorithm-hardware co-design system that tames the irregularities in deep point cloud analytics while achieving high accuracy. To that end, we introduce two approximation techniques, approximate neighbor search and selectively bank conflict elision, that "regularize" the DRAM and SRAM memory accesses. Doing so, however, necessarily introduces accuracy loss, which we mitigate by a new network training procedure that integrates approximation into the network training process. In essence, our training procedure trains models that are conditioned upon a specific approximate setting and, thus, retain a high accuracy. Experiments show that Crescent doubles the performance and halves the energy consumption compared to an optimized baseline accelerator with < 1% accuracy loss. The code of our paper is available at: https://github.com/horizon-research/crescent.
{"title":"Crescent: taming memory irregularities for accelerating deep point cloud analytics","authors":"Yu Feng, Gunnar Hammonds, Yiming Gan, Yuhao Zhu","doi":"10.1145/3470496.3527395","DOIUrl":"https://doi.org/10.1145/3470496.3527395","url":null,"abstract":"3D perception in point clouds is transforming the perception ability of future intelligent machines. Point cloud algorithms, however, are plagued by irregular memory accesses, leading to massive inefficiencies in the memory sub-system, which bottlenecks the overall efficiency. This paper proposes Crescent, an algorithm-hardware co-design system that tames the irregularities in deep point cloud analytics while achieving high accuracy. To that end, we introduce two approximation techniques, approximate neighbor search and selectively bank conflict elision, that \"regularize\" the DRAM and SRAM memory accesses. Doing so, however, necessarily introduces accuracy loss, which we mitigate by a new network training procedure that integrates approximation into the network training process. In essence, our training procedure trains models that are conditioned upon a specific approximate setting and, thus, retain a high accuracy. Experiments show that Crescent doubles the performance and halves the energy consumption compared to an optimized baseline accelerator with < 1% accuracy loss. The code of our paper is available at: https://github.com/horizon-research/crescent.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122267283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Li, Soroush Ghodrati, A. Yazdanbakhsh, H. Esmaeilzadeh, Mingu Kang
Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).
{"title":"Accelerating attention through gradient-based learned runtime pruning","authors":"Zheng Li, Soroush Ghodrati, A. Yazdanbakhsh, H. Esmaeilzadeh, Mingu Kang","doi":"10.1145/3470496.3527423","DOIUrl":"https://doi.org/10.1145/3470496.3527423","url":null,"abstract":"Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126553002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Hadi Zadeh, M. Mahmoud, Ameer Abdelhadi, A. Moshovos
Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids. Mokey does not need fine-tuning, an essential feature as often the training resources or datasets are not available to many. Exploiting the range of values that naturally occur in transformer models, Mokey selects centroid values to also fit an exponential curve. This unique feature enables Mokey to replace the bulk of the original multiply-accumulate operations with narrow 3b fixed-point additions resulting in an area- and energy-efficient hardware accelerator design. Over a set of state-of-the-art transformer models, the Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least 4× and as much as 15× depending on the model and on-chip buffering capacity. Optionally, Mokey can be used as memory compression assist for any other accelerator transparently stashing wide floating-point or fixed-point activations or weights into narrow 4-bit indexes. Mokey proves superior to prior state-of-the-art quantization methods for Transformers.
{"title":"Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models","authors":"Ali Hadi Zadeh, M. Mahmoud, Ameer Abdelhadi, A. Moshovos","doi":"10.1145/3470496.3527438","DOIUrl":"https://doi.org/10.1145/3470496.3527438","url":null,"abstract":"Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids. Mokey does not need fine-tuning, an essential feature as often the training resources or datasets are not available to many. Exploiting the range of values that naturally occur in transformer models, Mokey selects centroid values to also fit an exponential curve. This unique feature enables Mokey to replace the bulk of the original multiply-accumulate operations with narrow 3b fixed-point additions resulting in an area- and energy-efficient hardware accelerator design. Over a set of state-of-the-art transformer models, the Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least 4× and as much as 15× depending on the model and on-chip buffering capacity. Optionally, Mokey can be used as memory compression assist for any other accelerator transparently stashing wide floating-point or fixed-point activations or weights into narrow 4-bit indexes. Mokey proves superior to prior state-of-the-art quantization methods for Transformers.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132842198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Stein, Yufei Ding, N. Wiebe, Bo Peng, K. Kowalski, Nathan A. Baker, James Ang, A. Li
Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in the noisy intermediate scale quantum (NISQ) era. However, the deployment of VQAs on contemporary NISQ devices often faces considerable system and time-dependant noise and prohibitively slow training speeds. On the other hand, the expensive supporting resources and infrastructure make quantum computers extremely keen on high utilization. In this paper, we propose a virtualized way of building up a quantum backend for variational quantum algorithms: rather than relying on a single physical device which tends to introduce ever-changing device-specific noise with less reliable performance as time-since-calibration grows, we propose to constitute a quantum ensemble, which dynamically distributes quantum tasks asynchronously across a set of physical devices, and adjusts the ensemble configuration with respect to machine status. In addition to reduced machine-dependant noise, the ensemble can provide significant speedups for VQA training. With this idea, we build a novel VQA training framework called EQC - a distributed gradient-based processor-performance-aware optimization system - that comprises: (i) a system architecture for asynchronous parallel VQA cooperative training; (ii) an analytical model for assessing the quality of a circuit output concerning its architecture, transpilation, and runtime conditions; (iii) a weighting mechanism to adjust the quantum ensemble's computational contribution according to the systems' current performance. Evaluations comprising 500K times' circuit evaluations across 10 IBMQ NISQ devices using a VQE and a QAOA applications demonstrate that EQC can attain error rates very close to the most performant device of the ensemble, while boosting the training speed by 10.5X on average (up to 86X and at least 5.2x). EQC is available at https://github.com/pnnl/eqc.
{"title":"EQC: ensembled quantum computing for variational quantum algorithms","authors":"S. Stein, Yufei Ding, N. Wiebe, Bo Peng, K. Kowalski, Nathan A. Baker, James Ang, A. Li","doi":"10.1145/3470496.3527434","DOIUrl":"https://doi.org/10.1145/3470496.3527434","url":null,"abstract":"Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in the noisy intermediate scale quantum (NISQ) era. However, the deployment of VQAs on contemporary NISQ devices often faces considerable system and time-dependant noise and prohibitively slow training speeds. On the other hand, the expensive supporting resources and infrastructure make quantum computers extremely keen on high utilization. In this paper, we propose a virtualized way of building up a quantum backend for variational quantum algorithms: rather than relying on a single physical device which tends to introduce ever-changing device-specific noise with less reliable performance as time-since-calibration grows, we propose to constitute a quantum ensemble, which dynamically distributes quantum tasks asynchronously across a set of physical devices, and adjusts the ensemble configuration with respect to machine status. In addition to reduced machine-dependant noise, the ensemble can provide significant speedups for VQA training. With this idea, we build a novel VQA training framework called EQC - a distributed gradient-based processor-performance-aware optimization system - that comprises: (i) a system architecture for asynchronous parallel VQA cooperative training; (ii) an analytical model for assessing the quality of a circuit output concerning its architecture, transpilation, and runtime conditions; (iii) a weighting mechanism to adjust the quantum ensemble's computational contribution according to the systems' current performance. Evaluations comprising 500K times' circuit evaluations across 10 IBMQ NISQ devices using a VQE and a QAOA applications demonstrate that EQC can attain error rates very close to the most performant device of the ensemble, while boosting the training speed by 10.5X on average (up to 86X and at least 5.2x). EQC is available at https://github.com/pnnl/eqc.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129885314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}