With the widespread use of artificial intelligent (AI) applications and dramatic growth in data volumes from edge devices, there are currently many works that place the training of AI models onto edge devices. The state-of-the-art edge training framework, federated learning (FL), requires to transfer of a large amount of data between edge devices and the central server, which causes heavy communication overhead. To alleviate the communication overhead, gradient compression techniques are widely used. However, the bandwidth of the edge devices is usually different, causing communication heterogeneity. Existing gradient compression techniques usually adopt a fixed compression rate and do not take the straggler problem caused by the communication heterogeneity into account. To address these issues, we propose AGQFL, an automatic gradient quantization method consisting of three modules: quantization indicator module, quantization strategy module and quantization optimizer module. The quantization indicator module automatically determines the adjustment direction of quantization precision by measuring the convergence ability of the current model. Following the indicator and the physical bandwidth of each node, the quantization strategy module adjusts the quantization precision at run-time. Furthermore, the quantization optimizer module designs a new optimizer to reduce the training bias and eliminate the instability during the training process. Experimental results show that AGQFL can greatly speed up the training process in edge AI systems while maintaining or even improving model accuracy.
{"title":"AGQFL: Communication-efficient Federated Learning via Automatic Gradient Quantization in Edge Heterogeneous Systems","authors":"Zirui Lian, Jing Cao, Yanru Zuo, Weihong Liu, Zongwei Zhu","doi":"10.1109/ICCD53106.2021.00089","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00089","url":null,"abstract":"With the widespread use of artificial intelligent (AI) applications and dramatic growth in data volumes from edge devices, there are currently many works that place the training of AI models onto edge devices. The state-of-the-art edge training framework, federated learning (FL), requires to transfer of a large amount of data between edge devices and the central server, which causes heavy communication overhead. To alleviate the communication overhead, gradient compression techniques are widely used. However, the bandwidth of the edge devices is usually different, causing communication heterogeneity. Existing gradient compression techniques usually adopt a fixed compression rate and do not take the straggler problem caused by the communication heterogeneity into account. To address these issues, we propose AGQFL, an automatic gradient quantization method consisting of three modules: quantization indicator module, quantization strategy module and quantization optimizer module. The quantization indicator module automatically determines the adjustment direction of quantization precision by measuring the convergence ability of the current model. Following the indicator and the physical bandwidth of each node, the quantization strategy module adjusts the quantization precision at run-time. Furthermore, the quantization optimizer module designs a new optimizer to reduce the training bias and eliminate the instability during the training process. Experimental results show that AGQFL can greatly speed up the training process in edge AI systems while maintaining or even improving model accuracy.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115637111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00032
Raul Murillo, David Mallasén, Alberto A. Del Barrio, Guillermo Botella Juan
Posit arithmetic is an alternative format to the standard IEEE 754 for floating-point numbers that claims to provide compelling advantages over floats, including higher accuracy, larger dynamic range, or bitwise compatibility across systems. The interest in the design of arithmetic units for this novel format has increased in the last few years. However, while multiple designs for posit adder and multiplier have been developed recently in the literature, fused units for posit arithmetic are still in the early stages of research. Moreover, due to the large size of accumulators needed in fused operations, the few fused posit units proposed so far still require many hardware resources. In order to contribute to the development of the posit number format, and facilitate its use in applications such as deep learning, this paper presents several designs of energy-efficient posit multiply- accumulate (MAC) units with support for standard quire format. Concretely, the proposed designs are capable of computing fused dot products of large vectors without accuracy drop, while consuming less energy than previous implementations. Experiments show that, compared to previous implementations, the proposed designs consume up to 75.49%, 88.45% and 83.43% less energy and are 73.18%, 87.36% and 83.00% faster for 8, 16 and 32 bitwidths, with an additional area of only 4.97%, 7.44% and 4.24%, respectively.
{"title":"Energy-Efficient MAC Units for Fused Posit Arithmetic","authors":"Raul Murillo, David Mallasén, Alberto A. Del Barrio, Guillermo Botella Juan","doi":"10.1109/ICCD53106.2021.00032","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00032","url":null,"abstract":"Posit arithmetic is an alternative format to the standard IEEE 754 for floating-point numbers that claims to provide compelling advantages over floats, including higher accuracy, larger dynamic range, or bitwise compatibility across systems. The interest in the design of arithmetic units for this novel format has increased in the last few years. However, while multiple designs for posit adder and multiplier have been developed recently in the literature, fused units for posit arithmetic are still in the early stages of research. Moreover, due to the large size of accumulators needed in fused operations, the few fused posit units proposed so far still require many hardware resources. In order to contribute to the development of the posit number format, and facilitate its use in applications such as deep learning, this paper presents several designs of energy-efficient posit multiply- accumulate (MAC) units with support for standard quire format. Concretely, the proposed designs are capable of computing fused dot products of large vectors without accuracy drop, while consuming less energy than previous implementations. Experiments show that, compared to previous implementations, the proposed designs consume up to 75.49%, 88.45% and 83.43% less energy and are 73.18%, 87.36% and 83.00% faster for 8, 16 and 32 bitwidths, with an additional area of only 4.97%, 7.44% and 4.24%, respectively.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122560820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00025
Yuya Degawa, Toru Koizumi, Tomoki Nakamura, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai
Various techniques, such as cache replacement algorithms and prefetching, have been studied to prevent instruction cache misses from becoming a bottleneck in the processor frontend. In such studies, the goal of the design has been to reduce the number of instruction cache misses. However, owing to the increasing complexity of modern processors, the correlation between reducing instruction cache misses and reducing the number of executed cycles has become smaller than in previous cases. In this paper, we propose a new guideline for improving the performance of modern processors. In addition, we propose a method for estimating the approximate performance of a design two orders of magnitude faster than a full simulation each time the designers modify their design.
{"title":"Accurate and Fast Performance Modeling of Processors with Decoupled Front-end","authors":"Yuya Degawa, Toru Koizumi, Tomoki Nakamura, Ryota Shioya, J. Kadomoto, H. Irie, S. Sakai","doi":"10.1109/ICCD53106.2021.00025","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00025","url":null,"abstract":"Various techniques, such as cache replacement algorithms and prefetching, have been studied to prevent instruction cache misses from becoming a bottleneck in the processor frontend. In such studies, the goal of the design has been to reduce the number of instruction cache misses. However, owing to the increasing complexity of modern processors, the correlation between reducing instruction cache misses and reducing the number of executed cycles has become smaller than in previous cases. In this paper, we propose a new guideline for improving the performance of modern processors. In addition, we propose a method for estimating the approximate performance of a design two orders of magnitude faster than a full simulation each time the designers modify their design.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114268750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00021
A. Ranasinghe, S. H. Gerez
This paper presents two novel ultra-low-voltage (ULV) Single-Edge-Triggered flip-flops (SET-FF) based on the True-Single-Phase-Clocking (TSPC) scheme. By exploiting the TSPC principle, the overall energy efficiency has been improved compared to the traditional flip-flop designs while providing fully static, contention-free functionality to satisfy ULV operation. At 0.5V near-Vth level in 65nm bulk CMOS technology, the proposed SET-FFs demonstrate up to 11-45% and 7-20% of energy efficiency at 0% and 100% data activity rates compared to the best known SET-FFs. The proposed SET-FF can safely operate down to 0.24V of supply voltage without corrupting rail-to-rail voltage levels at its internal nodes. The integration of proposed SET-FFs in a 320-bit parallel shift register demonstrated up to 33% of clock network power, 17-39% of register power reductions compared to the state-of-the-art and commercial standard-cells at near-Vth level. In addition to these merits, with the aid of parasitic modeling, this paper re-evaluates the vital performance metrics of SET-FFs at near-Vth voltage domain, improving their characterization accuracy and enabling the VLSI integration for commercial end-use.
{"title":"Novel Ultra-Low-Voltage Flip-Flops: Near-Vth Modeling and VLSI Integration","authors":"A. Ranasinghe, S. H. Gerez","doi":"10.1109/ICCD53106.2021.00021","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00021","url":null,"abstract":"This paper presents two novel ultra-low-voltage (ULV) Single-Edge-Triggered flip-flops (SET-FF) based on the True-Single-Phase-Clocking (TSPC) scheme. By exploiting the TSPC principle, the overall energy efficiency has been improved compared to the traditional flip-flop designs while providing fully static, contention-free functionality to satisfy ULV operation. At 0.5V near-Vth level in 65nm bulk CMOS technology, the proposed SET-FFs demonstrate up to 11-45% and 7-20% of energy efficiency at 0% and 100% data activity rates compared to the best known SET-FFs. The proposed SET-FF can safely operate down to 0.24V of supply voltage without corrupting rail-to-rail voltage levels at its internal nodes. The integration of proposed SET-FFs in a 320-bit parallel shift register demonstrated up to 33% of clock network power, 17-39% of register power reductions compared to the state-of-the-art and commercial standard-cells at near-Vth level. In addition to these merits, with the aid of parasitic modeling, this paper re-evaluates the vital performance metrics of SET-FFs at near-Vth voltage domain, improving their characterization accuracy and enabling the VLSI integration for commercial end-use.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129272061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00020
You Zhou, Ke Wang, Fei Wu, Changsheng Xie, Hao Lv
Log-structured file systems (LS-FSs) sequentialize writes, so they are expected to perform well on flash-based SSDs. However, we observe a semantic gap between the LS- FS and SSD that causes a stale-LBA problem. When data are updated, the LS-FS allocates new logical block addresses (LBAs). The relevant stale LBAs are invalidated and then trimmed or reused with a delay by the LS-FS. During the time interval, stale LBAs are regarded temporarily as valid and migrated unnecessarily by garbage collection in the SSD. Our experimental study of real-world traces reveals that stale-LBA migrations amount to 59%-150% of host data writes. To solve this serious problem, we propose Seer-SSD to deliver stale-LBA metadata along with written data from the LS-FS to the SSD. Then, stale LBAs are invalidated actively and selectively in the SSD without compromising file system consistency. Seer-SSD can be implemented easily based on existing block interfaces and maintain compatibility with non-LS-FSs. We perform a case study on an emulated NVMe SSD hosting F2FS (a state-of-the- art LS-FS). Experimental results with popular databases show that Seer-SSD improves the throughput by 99.8% and reduces the write amplification by 53.6%, on average, compared to a traditional SSD unaware of stale LBAs.
{"title":"Seer-SSD: Bridging Semantic Gap between Log-Structured File Systems and SSDs to Reduce SSD Write Amplification","authors":"You Zhou, Ke Wang, Fei Wu, Changsheng Xie, Hao Lv","doi":"10.1109/ICCD53106.2021.00020","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00020","url":null,"abstract":"Log-structured file systems (LS-FSs) sequentialize writes, so they are expected to perform well on flash-based SSDs. However, we observe a semantic gap between the LS- FS and SSD that causes a stale-LBA problem. When data are updated, the LS-FS allocates new logical block addresses (LBAs). The relevant stale LBAs are invalidated and then trimmed or reused with a delay by the LS-FS. During the time interval, stale LBAs are regarded temporarily as valid and migrated unnecessarily by garbage collection in the SSD. Our experimental study of real-world traces reveals that stale-LBA migrations amount to 59%-150% of host data writes. To solve this serious problem, we propose Seer-SSD to deliver stale-LBA metadata along with written data from the LS-FS to the SSD. Then, stale LBAs are invalidated actively and selectively in the SSD without compromising file system consistency. Seer-SSD can be implemented easily based on existing block interfaces and maintain compatibility with non-LS-FSs. We perform a case study on an emulated NVMe SSD hosting F2FS (a state-of-the- art LS-FS). Experimental results with popular databases show that Seer-SSD improves the throughput by 99.8% and reduces the write amplification by 53.6%, on average, compared to a traditional SSD unaware of stale LBAs.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"185 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120899315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00065
Chung-Kuan Cheng, A. Kahng, Ilgweon Kang, Minsoo Kim, Daeyeal Lee, Bill Lin, Dongwon Park, M. Woo
With the relentless scaling of technology nodes, physical design engineers encounter non-trivial challenges caused by rapidly increasing design complexity, particularly in the routing stage. Back-end designers must manually stitch/modify all of the design rule violations (DRVs) that remain after automatic place-and-route (P&R), during the implementation of engineering change orders (ECOs). In this paper, we propose CoRe-ECO, a concurrent refinement framework for efficient automation of the ECO process. Our framework efficiently resolves pin accessibility-induced DRVs by simultaneously performing detailed placement, detailed routing, and cell replacement. In addition to perturbation-minimized solutions, our proposed SMT-based optimization framework also suggests the adoption of alternative master cells to better achieve DRV-clean layouts. We demonstrate that our framework successfully resolves from 33.3% to 100.0% (58.6% on average) of remaining DRVs on M1-M3 layers, across a range of benchmark circuits with various cell architectures, while also providing average total wirelength reduction of 0.003%.
{"title":"CoRe-ECO: Concurrent Refinement of Detailed Place-and-Route for an Efficient ECO Automation","authors":"Chung-Kuan Cheng, A. Kahng, Ilgweon Kang, Minsoo Kim, Daeyeal Lee, Bill Lin, Dongwon Park, M. Woo","doi":"10.1109/ICCD53106.2021.00065","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00065","url":null,"abstract":"With the relentless scaling of technology nodes, physical design engineers encounter non-trivial challenges caused by rapidly increasing design complexity, particularly in the routing stage. Back-end designers must manually stitch/modify all of the design rule violations (DRVs) that remain after automatic place-and-route (P&R), during the implementation of engineering change orders (ECOs). In this paper, we propose CoRe-ECO, a concurrent refinement framework for efficient automation of the ECO process. Our framework efficiently resolves pin accessibility-induced DRVs by simultaneously performing detailed placement, detailed routing, and cell replacement. In addition to perturbation-minimized solutions, our proposed SMT-based optimization framework also suggests the adoption of alternative master cells to better achieve DRV-clean layouts. We demonstrate that our framework successfully resolves from 33.3% to 100.0% (58.6% on average) of remaining DRVs on M1-M3 layers, across a range of benchmark circuits with various cell architectures, while also providing average total wirelength reduction of 0.003%.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114201325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00012
Prattay Chowdhury, B. C. Schafer
Approximate computing has shown to be an effective approach to generate smaller and more power-efficient circuits by trading the accuracy of the circuit vs. area and/or power. So far, most work on approximate computing has focused on specific components within a system. It severely limits the approximation potential as most Integrated Circuits (ICs) are now complex heterogeneous systems. One additional limitation of current work in this domain is they assume that the training data matches the actual workload. This is nevertheless not always true as these complex Systems-on-Chip (SoCs) are used for a variety of different applications. To address these issues, this work investigates if lower-power designs can be found through mixing approximations across the different components in the SoC as opposed to only aggressively approximating a single component. The main hypothesis is that some approximations amplify across the system, while others tend to cancel each other out, thus, allowing to maximize the power savings while meeting the given maximum error threshold. To investigate this, we propose a method called ADAPT. ADAPT uses a neural network-based controller to dynamically adjust the supply voltage (Vdd) of different components in SoC at runtime based on the actual workload.
{"title":"Special Session: ADAPT: ANN-ControlleD System-Level Runtime Adaptable APproximate CompuTing","authors":"Prattay Chowdhury, B. C. Schafer","doi":"10.1109/ICCD53106.2021.00012","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00012","url":null,"abstract":"Approximate computing has shown to be an effective approach to generate smaller and more power-efficient circuits by trading the accuracy of the circuit vs. area and/or power. So far, most work on approximate computing has focused on specific components within a system. It severely limits the approximation potential as most Integrated Circuits (ICs) are now complex heterogeneous systems. One additional limitation of current work in this domain is they assume that the training data matches the actual workload. This is nevertheless not always true as these complex Systems-on-Chip (SoCs) are used for a variety of different applications. To address these issues, this work investigates if lower-power designs can be found through mixing approximations across the different components in the SoC as opposed to only aggressively approximating a single component. The main hypothesis is that some approximations amplify across the system, while others tend to cancel each other out, thus, allowing to maximize the power savings while meeting the given maximum error threshold. To investigate this, we propose a method called ADAPT. ADAPT uses a neural network-based controller to dynamically adjust the supply voltage (Vdd) of different components in SoC at runtime based on the actual workload.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121709522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00077
Xiaoming Du, Cong Li, Shen Zhou, Xian Liu, Xiaohan Xu, Tianjiao Wang, Shi-Lun Ge
Uncorrectable memory errors are the major causes of hardware failures in datacenters leading to server crashes. Page offlining is an error-prevention mechanism implemented in modern operating systems. Traditional offlining policies are based on correctable error (CE) rate of a page in a past period. However, CEs are just the observations while the underlying causes are memory circuit faults. A certain fault such as a row fault can impact quite a few pages. Meanwhile, not all faults are equally prone to uncorrectable errors (UEs). In this paper, we propose a fault-aware prediction-guide policy for page offlining. In the proposed policy, we first identify row faults based on CE observations as the preliminary candidates for offlining. Leveraging the knowledge of the error correction code, we design a predictor based on error-bit patterns to predict whether a row fault is prone to UEs or not. Pages impacted by the UE-prone rows are then offlined. Empirical evaluation using the error log from a modern large-scale cluster in ByteDance demonstrates that the proposed policy avoids several times more UEs than the traditional policy does at a comparable cost of memory capacity loss due to page offlining.
{"title":"Fault-Aware Prediction-Guided Page Offlining for Uncorrectable Memory Error Prevention","authors":"Xiaoming Du, Cong Li, Shen Zhou, Xian Liu, Xiaohan Xu, Tianjiao Wang, Shi-Lun Ge","doi":"10.1109/ICCD53106.2021.00077","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00077","url":null,"abstract":"Uncorrectable memory errors are the major causes of hardware failures in datacenters leading to server crashes. Page offlining is an error-prevention mechanism implemented in modern operating systems. Traditional offlining policies are based on correctable error (CE) rate of a page in a past period. However, CEs are just the observations while the underlying causes are memory circuit faults. A certain fault such as a row fault can impact quite a few pages. Meanwhile, not all faults are equally prone to uncorrectable errors (UEs). In this paper, we propose a fault-aware prediction-guide policy for page offlining. In the proposed policy, we first identify row faults based on CE observations as the preliminary candidates for offlining. Leveraging the knowledge of the error correction code, we design a predictor based on error-bit patterns to predict whether a row fault is prone to UEs or not. Pages impacted by the UE-prone rows are then offlined. Empirical evaluation using the error log from a modern large-scale cluster in ByteDance demonstrates that the proposed policy avoids several times more UEs than the traditional policy does at a comparable cost of memory capacity loss due to page offlining.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121618723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-01DOI: 10.1109/ICCD53106.2021.00067
Kyunghwan Choi, Seongju Lee, Beom Woo Kang, Yongjun Park
Distributing workloads that cannot be handled by a single edge device across multiple edge devices is a promising solution that minimizes the inference latency of deep learning applications by exploiting model parallelism. Several prior solutions have been proposed to partition target models efficiently, but most studies have focused on finding the optimal fused layer configurations, which minimize the data-transfer overhead between layers. However, as recent deep learning network models have become more complex and the ability to deploy them quickly has become a key challenge, the search for the best fused layer configurations of target models has become a major requirement. To solve this problem, we propose a lightweight model partitioning framework called Legion to find the optimal fused layer configurations with minimal profiling execution trials. By finding the optimal configurations using cost matrix construction and wild card selection, the experimental results showed that Legion achieved a similar performance to the full configuration search at a fraction of the search time. Moreover, Legion performed effectively even on a group of heterogeneous target devices by introducing a per-device cost-related matrix construction. With three popular networks, Legion shows only 3.4% performance loss as compared to a full searching scheme (FSS), on various different device configurations consisting of up to six heterogeneous devices, and minimizes the profiling overhead by 48.7× on average.
{"title":"Legion: Tailoring Grouped Neural Execution Considering Heterogeneity on Multiple Edge Devices","authors":"Kyunghwan Choi, Seongju Lee, Beom Woo Kang, Yongjun Park","doi":"10.1109/ICCD53106.2021.00067","DOIUrl":"https://doi.org/10.1109/ICCD53106.2021.00067","url":null,"abstract":"Distributing workloads that cannot be handled by a single edge device across multiple edge devices is a promising solution that minimizes the inference latency of deep learning applications by exploiting model parallelism. Several prior solutions have been proposed to partition target models efficiently, but most studies have focused on finding the optimal fused layer configurations, which minimize the data-transfer overhead between layers. However, as recent deep learning network models have become more complex and the ability to deploy them quickly has become a key challenge, the search for the best fused layer configurations of target models has become a major requirement. To solve this problem, we propose a lightweight model partitioning framework called Legion to find the optimal fused layer configurations with minimal profiling execution trials. By finding the optimal configurations using cost matrix construction and wild card selection, the experimental results showed that Legion achieved a similar performance to the full configuration search at a fraction of the search time. Moreover, Legion performed effectively even on a group of heterogeneous target devices by introducing a per-device cost-related matrix construction. With three popular networks, Legion shows only 3.4% performance loss as compared to a full searching scheme (FSS), on various different device configurations consisting of up to six heterogeneous devices, and minimizes the profiling overhead by 48.7× on average.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126043129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}