In today's communication systems, it has become prominent for processing elements (PEs) to receive requests with simultaneous, conflicting signals, which are unpredicted and randomly triggered. In such a case, multiple overlapping signal requests can potentially compete in the same PE causing erroneous operations or a halt of the communication cycle. The authors propose a self-timing CMOS first-edge take-all (FETA) circuit architecture, which examines two overlapping signals’ requests, and outputs only the leading-edge signal while the lagging-edge signal's request is declined. The FETA circuit functionality is considered as an essential component in First-In-First-Out for metastability avoidance, which usually occurs between the write and read overlapping requests for applications related to Internet of Things, Network-on-Chips, and microprocessor memory management units. HSPICE simulations for a 90 nm CMOS technology are used to verify the speed up to 1 GHz. Besides, the achievable resolution is in the order of 20 ps considering process variation sensitivity based on design inheriting symmetric timing paths between the two signals. Additionally, the proposed circuit architecture adopts a self-timing scheme obviating the overhead synchronisation circuitry, which comprises 12 D-Type Flip-Flops with about 300 transistors. This design is suited for HDL synthesis and FPGA application features.
{"title":"A novel self-timing CMOS first-edge take-all circuit for on-chip communication systems","authors":"Saleh Abdelhafeez, Shadi M. S. Harb","doi":"10.1049/cdt2.12059","DOIUrl":"https://doi.org/10.1049/cdt2.12059","url":null,"abstract":"<p>In today's communication systems, it has become prominent for processing elements (PEs) to receive requests with simultaneous, conflicting signals, which are unpredicted and randomly triggered. In such a case, multiple overlapping signal requests can potentially compete in the same PE causing erroneous operations or a halt of the communication cycle. The authors propose a self-timing CMOS first-edge take-all (FETA) circuit architecture, which examines two overlapping signals’ requests, and outputs only the leading-edge signal while the lagging-edge signal's request is declined. The FETA circuit functionality is considered as an essential component in First-In-First-Out for metastability avoidance, which usually occurs between the write and read overlapping requests for applications related to Internet of Things, Network-on-Chips, and microprocessor memory management units. HSPICE simulations for a 90 nm CMOS technology are used to verify the speed up to 1 GHz. Besides, the achievable resolution is in the order of 20 ps considering process variation sensitivity based on design inheriting symmetric timing paths between the two signals. Additionally, the proposed circuit architecture adopts a self-timing scheme obviating the overhead synchronisation circuitry, which comprises 12 D-Type Flip-Flops with about 300 transistors. This design is suited for HDL synthesis and FPGA application features.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 3-4","pages":"141-148"},"PeriodicalIF":1.2,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12059","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50139916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the ever-increasing computing requirements of modern applications, supercomputers are at the centre of attraction as a platform for high-performance computing. Although various features and indicators for testing and evaluating supercomputers are proposed in the literature, a comprehensive feature set to guide designers in comparing supercomputers and selecting an appropriate choice is not provided. Here, an integrated feature-based taxonomy comprised of seven indicator groups including passive infrastructure, hardware, software, support and maintenance, service, business, and security features is proposed. Also, a case study using our proposed framework is provided and a comparison between some commercial and research supercomputers including Fugaku's ideal supercomputer, Sharif supercomputer, Aramco supercomputer, and ITU supercomputer is presented. Moreover, here, the authors’ proposed method is compared with the Top500 method, which shows that the authors’ proposed method facilitates the ranking, comparison, and selection of the appropriate supercomputer in various fields by considering various aspects of design and implementation. The ranking results show that Aramco supercomputer, ITU supercomputer, and Sharif supercomputer have 65.9%, 57.6%, and 48.2% of ideal supercomputer points, respectively.
{"title":"An integrated taxonomy of standard indicators for ranking and selecting supercomputers","authors":"Davood Maleki, Alireza Mansouri, Ehsan Arianyan","doi":"10.1049/cdt2.12061","DOIUrl":"https://doi.org/10.1049/cdt2.12061","url":null,"abstract":"<p>Due to the ever-increasing computing requirements of modern applications, supercomputers are at the centre of attraction as a platform for high-performance computing. Although various features and indicators for testing and evaluating supercomputers are proposed in the literature, a comprehensive feature set to guide designers in comparing supercomputers and selecting an appropriate choice is not provided. Here, an integrated feature-based taxonomy comprised of seven indicator groups including passive infrastructure, hardware, software, support and maintenance, service, business, and security features is proposed. Also, a case study using our proposed framework is provided and a comparison between some commercial and research supercomputers including Fugaku's ideal supercomputer, Sharif supercomputer, Aramco supercomputer, and ITU supercomputer is presented. Moreover, here, the authors’ proposed method is compared with the Top500 method, which shows that the authors’ proposed method facilitates the ranking, comparison, and selection of the appropriate supercomputer in various fields by considering various aspects of design and implementation. The ranking results show that Aramco supercomputer, ITU supercomputer, and Sharif supercomputer have 65.9%, 57.6%, and 48.2% of ideal supercomputer points, respectively.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 3-4","pages":"162-179"},"PeriodicalIF":1.2,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12061","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50139238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern deep neural networks typically have some fully connected layers at the final classification stages. These stages have large memory requirements that can be expensive on resource-constrained embedded devices and also consume significant energy just to read the parameters from external memory into the processing chip. The authors show that the weights in such layers can be modelled as permutations of a common sequence with minimal impact on recognition accuracy. This allows the storage requirements of FC layer(s) to be significantly reduced, which reflects in the reduction of total network parameters from 1.3× to 36× with a median of 4.45× on several benchmark networks. The authors compare the results with existing pruning, bitwidth reduction, and deep compression techniques and show the superior compression that can be achieved with this method. The authors also showed 7× reduction of parameters on VGG16 architecture with ImageNet dataset. The authors also showed that the proposed method can be used in the classification stage of the transfer learning networks.
{"title":"Compressing fully connected layers of deep neural networks using permuted features","authors":"Dara Nagaraju, Nitin Chandrachoodan","doi":"10.1049/cdt2.12060","DOIUrl":"https://doi.org/10.1049/cdt2.12060","url":null,"abstract":"<p>Modern deep neural networks typically have some fully connected layers at the final classification stages. These stages have large memory requirements that can be expensive on resource-constrained embedded devices and also consume significant energy just to read the parameters from external memory into the processing chip. The authors show that the weights in such layers can be modelled as permutations of a common sequence with minimal impact on recognition accuracy. This allows the storage requirements of FC layer(s) to be significantly reduced, which reflects in the reduction of total network parameters from 1.3× to 36× with a median of 4.45× on several benchmark networks. The authors compare the results with existing pruning, bitwidth reduction, and deep compression techniques and show the superior compression that can be achieved with this method. The authors also showed 7× reduction of parameters on VGG16 architecture with ImageNet dataset. The authors also showed that the proposed method can be used in the classification stage of the transfer learning networks.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 3-4","pages":"149-161"},"PeriodicalIF":1.2,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12060","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50149530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kushal K. Ponugoti, Sudarshan K. Srinivasan, Nimish Mathure
Transient execution attacks such as Spectre and Meltdown exploit speculative execution in modern microprocessors to leak information via cache side-channels. Software solutions to defend against many transient execution attacks employ the lfence serialising instruction, which does not allow instructions that come after the lfence to execute out-of-order with respect to instructions that come before the lfence. However, errors and Trojans in the hardware implementation of lfence can be exploited to compromise the software mitigations that use lfence. The aforementioned security gap has not been identified and addressed previously. The authors provide a formal method solution that addresses the verification of lfence hardware implementation. The authors also show how hardware Trojans can be designed to circumvent lfence and demonstrate that their verification approach will flag such Trojans as well. The authors have demonstrated the efficacy of our approach using RSD, which is an open source RISC-V based superscalar out-of-order processor.
{"title":"Verification of serialising instructions for security against transient execution attacks","authors":"Kushal K. Ponugoti, Sudarshan K. Srinivasan, Nimish Mathure","doi":"10.1049/cdt2.12058","DOIUrl":"https://doi.org/10.1049/cdt2.12058","url":null,"abstract":"<p>Transient execution attacks such as Spectre and Meltdown exploit speculative execution in modern microprocessors to leak information via cache side-channels. Software solutions to defend against many transient execution attacks employ the <i>lfence</i> serialising instruction, which does not allow instructions that come after the <i>lfence</i> to execute out-of-order with respect to instructions that come before the <i>lfence</i>. However, errors and Trojans in the hardware implementation of <i>lfence</i> can be exploited to compromise the software mitigations that use <i>lfence</i>. The aforementioned security gap has not been identified and addressed previously. The authors provide a formal method solution that addresses the verification of <i>lfence</i> hardware implementation. The authors also show how hardware Trojans can be designed to circumvent <i>lfence</i> and demonstrate that their verification approach will flag such Trojans as well. The authors have demonstrated the efficacy of our approach using RSD, which is an open source RISC-V based superscalar out-of-order processor.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 3-4","pages":"127-140"},"PeriodicalIF":1.2,"publicationDate":"2023-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12058","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50147574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The protection of communications between Internet of Things (IoT) devices is of great concern because the information exchanged contains vital sensitive data. Malicious agents seek to exploit those data to extract secret information about the owners or the system. Power side channel attacks are of great concern on these devices because their power consumption unintentionally leaks information correlatable to the device's secret data. Several studies have demonstrated the effectiveness of authenticated encryption with advanced data, in protecting communications with these devices. A comprehensive evaluation of the seven (out of 10) algorithm finalists of the National Institute of Standards and Technology (NIST) IoT lightweight cipher competition that do not integrate built-in countermeasures is proposed. The study shows that, nonetheless, they still present some residual vulnerabilities to power side channel attacks (SCA). For five ciphers, an attack methodology as well as the leakage function needed to perform correlation power analysis (CPA) is proposed. The authors assert that Ascon, Sparkle, and PHOTON-Beetle security vulnerability can generally be assessed with the security assumptions “Chosen ciphertext attack and leakage in encryption only, with nonce-misuse resilience adversary (CCAmL1)” and “Chosen ciphertext attack and leakage in encryption only with nonce-respecting adversary (CCAL1)”, respectively. However, the security vulnerability of GIFT-COFB, Grain, Romulus, and TinyJambu can be evaluated more straightforwardly with publicly available leakage models and solvers. They can also be assessed simply by increasing the number of traces collected to launch the attack.
{"title":"Residual vulnerabilities to power side channel attacks of lightweight ciphers cryptography competition finalists","authors":"Aurelien T. Mozipo, John M. Acken","doi":"10.1049/cdt2.12057","DOIUrl":"https://doi.org/10.1049/cdt2.12057","url":null,"abstract":"<p>The protection of communications between Internet of Things (IoT) devices is of great concern because the information exchanged contains vital sensitive data. Malicious agents seek to exploit those data to extract secret information about the owners or the system. Power side channel attacks are of great concern on these devices because their power consumption unintentionally leaks information correlatable to the device's secret data. Several studies have demonstrated the effectiveness of authenticated encryption with advanced data, in protecting communications with these devices. A comprehensive evaluation of the seven (out of 10) algorithm finalists of the National Institute of Standards and Technology (NIST) IoT lightweight cipher competition that do not integrate built-in countermeasures is proposed. The study shows that, nonetheless, they still present some residual vulnerabilities to power side channel attacks (SCA). For five ciphers, an attack methodology as well as the leakage function needed to perform correlation power analysis (CPA) is proposed. The authors assert that Ascon, Sparkle, and PHOTON-Beetle security vulnerability can generally be assessed with the security assumptions “Chosen ciphertext attack and leakage in encryption only, with nonce-misuse resilience adversary (CCAmL1)” and “Chosen ciphertext attack and leakage in encryption only with nonce-respecting adversary (CCAL1)”, respectively. However, the security vulnerability of GIFT-COFB, Grain, Romulus, and TinyJambu can be evaluated more straightforwardly with publicly available leakage models and solvers. They can also be assessed simply by increasing the number of traces collected to launch the attack.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 3-4","pages":"75-88"},"PeriodicalIF":1.2,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12057","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50141094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the Internet of things (IoT) has become extensively used in our daily lives. This technology offers a new vision of the future internet where devices are interconnected and can communicate together. A big number of these devices complicates the firmware update and makes it more expensive since each node must be updated locally. Nevertheless, there are many cases where devices cannot change their location to upgrade the firmware locally due to an unreachable location or dangerous place. Therefore, it is necessary to remotely update the firmware of devices. In this study, the authors propose an efficient implementation of a low cost and secure framework of upgrading the firmware employing the latter On-The-Air programming technique. The authors present a proof of concept for a ubiquitous system applying wireless programmer. Our design offers a remote broadcasting of image code without disturbing the main functionality of nodes. The authors validated the performance of our design on a real network based on STM32 micro-controllers. The results showed the reduction of the network time-off, enabling a continuous operation of the ecosystem.
{"title":"Efficient implementation of low cost and secure framework with firmware updates","authors":"Ines Ben Hlima, Halim Kacem, Ali Gharsallah","doi":"10.1049/cdt2.12054","DOIUrl":"https://doi.org/10.1049/cdt2.12054","url":null,"abstract":"<p>Recently, the Internet of things (IoT) has become extensively used in our daily lives. This technology offers a new vision of the future internet where devices are interconnected and can communicate together. A big number of these devices complicates the firmware update and makes it more expensive since each node must be updated locally. Nevertheless, there are many cases where devices cannot change their location to upgrade the firmware locally due to an unreachable location or dangerous place. Therefore, it is necessary to remotely update the firmware of devices. In this study, the authors propose an efficient implementation of a low cost and secure framework of upgrading the firmware employing the latter On-The-Air programming technique. The authors present a proof of concept for a ubiquitous system applying wireless programmer. Our design offers a remote broadcasting of image code without disturbing the main functionality of nodes. The authors validated the performance of our design on a real network based on STM32 micro-controllers. The results showed the reduction of the network time-off, enabling a continuous operation of the ecosystem.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 3-4","pages":"89-99"},"PeriodicalIF":1.2,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12054","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50147647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingfu Xue, Shichang Sun, Can He, Dujuan Gu, Yushu Zhang, Jian Wang, Weiqiang Liu
The intellectual properties (IP) protection of deep neural networks (DNN) models has raised many concerns in recent years. To date, most of the existing works use DNN watermarking to protect the IP of DNN models. However, the DNN watermarking methods can only passively verify the copyright of the model after the DNN model has been pirated, which cannot prevent piracy in the first place. In this paper, an active DNN IP protection technique against DNN piracy, called ActiveGuard, is proposed. ActiveGuard can provide active authorisation control, users' identities management, and ownership verification for DNN models. Specifically, for the first time, ActiveGuard exploits well-crafted rare and specific adversarial examples with specific classes and confidences as users' fingerprints to distinguish authorised users from unauthorised ones. Authorised users can input their fingerprints to the DNN model for identity authentication and then obtain normal usage, while unauthorised users will obtain a very poor model performance. In addition, ActiveGuard enables the model owner to embed a watermark into the weights of the DNN model for ownership verification. Compared to the few existing active DNN IP protection works, ActiveGuard can support both users' identities identification and active authorisation control. Besides, ActiveGuard introduces lower overhead than these existing active protection works. Experimental results show that, for authorised users, the test accuracy of LeNet-5 and Wide Residual Network (WRN) models are 99.15% and 91.46%, respectively, while for unauthorised users, the test accuracy of LeNet-5 and WRN models are only 8.92% and 10%, respectively. Besides, each authorised user can pass the fingerprint authentication with a high success rate (up to 100%). For ownership verification, the embedded watermark can be successfully extracted, while the normal performance of DNN models will not be affected. Furthermore, it is demonstrated that ActiveGuard is robust against model fine-tuning attack, pruning attack, and three types of fingerprint forgery attacks.
{"title":"ActiveGuard: An active intellectual property protection technique for deep neural networks by leveraging adversarial examples as users' fingerprints","authors":"Mingfu Xue, Shichang Sun, Can He, Dujuan Gu, Yushu Zhang, Jian Wang, Weiqiang Liu","doi":"10.1049/cdt2.12056","DOIUrl":"https://doi.org/10.1049/cdt2.12056","url":null,"abstract":"<p>The intellectual properties (IP) protection of deep neural networks (DNN) models has raised many concerns in recent years. To date, most of the existing works use DNN watermarking to protect the IP of DNN models. However, the DNN watermarking methods can only passively verify the copyright of the model after the DNN model has been pirated, which cannot prevent piracy in the first place. In this paper, an active DNN IP protection technique against DNN piracy, called ActiveGuard<i>,</i> is proposed. ActiveGuard can provide active authorisation control, users' identities management, and ownership verification for DNN models. Specifically, for the first time, ActiveGuard exploits well-crafted rare and specific adversarial examples with specific classes and confidences as users' fingerprints to distinguish authorised users from unauthorised ones. Authorised users can input their fingerprints to the DNN model for identity authentication and then obtain normal usage, while unauthorised users will obtain a very poor model performance. In addition, ActiveGuard enables the model owner to embed a watermark into the weights of the DNN model for ownership verification. Compared to the few existing active DNN IP protection works, ActiveGuard can support both users' identities identification and active authorisation control. Besides, ActiveGuard introduces lower overhead than these existing active protection works. Experimental results show that, for authorised users, the test accuracy of LeNet-5 and Wide Residual Network (WRN) models are 99.15% and 91.46%, respectively, while for unauthorised users, the test accuracy of LeNet-5 and WRN models are only 8.92% and 10%, respectively. Besides, each authorised user can pass the fingerprint authentication with a high success rate (up to 100%). For ownership verification, the embedded watermark can be successfully extracted, while the normal performance of DNN models will not be affected. Furthermore, it is demonstrated that ActiveGuard is robust against model fine-tuning attack, pruning attack, and three types of fingerprint forgery attacks.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 3-4","pages":"111-126"},"PeriodicalIF":1.2,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50147648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two methods are presented for finding the top-k items in data streams using Field Programmable Gate Arrays (FPGAs). These methods deploy two variants of a novel accelerator architecture capable of extracting an approximate list of the topmost frequently occurring items in a single pass over the input stream without the need for random access. The first variant of the accelerator implements the well-known Probabilistic sampling algorithm by mapping its main processing stages to a hardware architecture consisting of two custom systolic arrays. The proposed architecture retains all the properties of this algorithm, which works even if the stream size is unknown at run time. The architecture shows better scalability compared to other architectures that are based on other stream algorithms. In addition, experimental results on both synthetic and real datasets, when implementing the accelerator on an Intel Arria 10 GX 1150 FPGA device, showed very good accuracy and significant throughput gains compared to the existing software and hardware-accelerated solutions. The second variant of the accelerator is specifically tailored for applications requiring higher accuracy, provided that the size of the stream is known at run time. This variant takes advantage of the embedded memory resources in an FPGA to implement a sketch-based filter that precedes the main systolic array in the accelerator's pipeline. This filter enhances the accuracy of the accelerator by pre-processing the stream to remove much of the insignificant items, allowing the accelerator to process a significantly smaller filtered stream.
{"title":"Fast approximation of the top-k items in data streams using FPGAs","authors":"Ali Ebrahim, Jalal Khalifat","doi":"10.1049/cdt2.12053","DOIUrl":"https://doi.org/10.1049/cdt2.12053","url":null,"abstract":"<p>Two methods are presented for finding the top-<i>k</i> items in data streams using Field Programmable Gate Arrays (FPGAs). These methods deploy two variants of a novel accelerator architecture capable of extracting an approximate list of the topmost frequently occurring items in a single pass over the input stream without the need for random access. The first variant of the accelerator implements the well-known <i>Probabilistic</i> sampling algorithm by mapping its main processing stages to a hardware architecture consisting of two custom systolic arrays. The proposed architecture retains all the properties of this algorithm, which works even if the stream size is unknown at run time. The architecture shows better scalability compared to other architectures that are based on other stream algorithms. In addition, experimental results on both synthetic and real datasets, when implementing the accelerator on an Intel Arria 10 GX 1150 FPGA device, showed very good accuracy and significant throughput gains compared to the existing software and hardware-accelerated solutions. The second variant of the accelerator is specifically tailored for applications requiring higher accuracy, provided that the size of the stream is known at run time. This variant takes advantage of the embedded memory resources in an FPGA to implement a sketch-based filter that precedes the main systolic array in the accelerator's pipeline. This filter enhances the accuracy of the accelerator by pre-processing the stream to remove much of the insignificant items, allowing the accelerator to process a significantly smaller filtered stream.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 2","pages":"60-73"},"PeriodicalIF":1.2,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12053","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50152328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a compact thermal model for smartphones, Phone-nomenon 2.0, to predict the thermal behavior of smartphones. In the beginning, non-linearities of internal and external heat transfer mechanisms of smartphones and a compact thermal model for these non-linearities have been studied and proposed. Then, an iterative simulation procedure to handle these non-linearities was developed, and the basic simulation framework which is one option in Phone-nomenon 2.0 was established and we call it Phone-nomenon.Iter. Finally, the linearisation approach was applied, and model order reduction techniques to enhance and speed up the basic framework were employed, and these two options Phone-nomenon.Lin and Phone-nomenon.LinMOR were named. Compared with a commercial tool, ANSYS Icepak, Phone-nomenon.Iter can achieve two orders of magnitude speedup with the maximum error being less than 1.90% for steady-state simulations and three orders of magnitude speedup with the temperature difference being less than 0.65°C for transient simulations. In addition, the speedup of Phone-nomenon.Lin over Phone-nomenon.Iter can be at least 4.22× and 3.26× for steady-state and transient simulations, respectively. Moreover, the speedup of Phone-nomenon.LinMOR over Phone-nomenon.Lin is at least 2.57×.
{"title":"Phone-nomenon 2.0: A compact thermal model for smartphones","authors":"Yu-Min Lee, Hong-Wen Chiou, Shinyu Shiau, Chi-Wen Pan, Shih-Hung Ting","doi":"10.1049/cdt2.12052","DOIUrl":"https://doi.org/10.1049/cdt2.12052","url":null,"abstract":"<p>This paper presents a compact thermal model for smartphones, Phone-nomenon 2.0, to predict the thermal behavior of smartphones. In the beginning, non-linearities of internal and external heat transfer mechanisms of smartphones and a compact thermal model for these non-linearities have been studied and proposed. Then, an iterative simulation procedure to handle these non-linearities was developed, and the basic simulation framework which is one option in Phone-nomenon 2.0 was established and we call it Phone-nomenon.Iter. Finally, the linearisation approach was applied, and model order reduction techniques to enhance and speed up the basic framework were employed, and these two options Phone-nomenon.Lin and Phone-nomenon.LinMOR were named. Compared with a commercial tool, ANSYS Icepak, Phone-nomenon.Iter can achieve two orders of magnitude speedup with the maximum error being less than 1.90% for steady-state simulations and three orders of magnitude speedup with the temperature difference being less than 0.65°C for transient simulations. In addition, the speedup of Phone-nomenon.Lin over Phone-nomenon.Iter can be at least 4.22× and 3.26× for steady-state and transient simulations, respectively. Moreover, the speedup of Phone-nomenon.LinMOR over Phone-nomenon.Lin is at least 2.57×.</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 2","pages":"43-59"},"PeriodicalIF":1.2,"publicationDate":"2023-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12052","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50125052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As embedded devices start supporting heterogeneous processing cores (Central Processing Unit [CPU]–Graphical Processing Unit [GPU] based cores), performance aware task allocation becomes a major issue. Use of Open Computing Language (OpenCL) applications on both CPU and GPU cores improves performance and resolves the problem. However, it has an adverse effect on the overall power consumption and the operating temperature of the system. Operating both kind of cores within a small form factor at high frequency causes rise in power consumption which in turn leads to increase in processor temperature. The elevated temperature brings about major thermal issues. In this paper, we present our investigation on the role of CPU during execution of GPU specific application and argue against running it at the high frequency. In addition, a machine learning guided mechanism to predict the optimal operating frequency of CPU cores during execution of OpenCL GPU kernels is presented in this study. Our experiments with OpenCL applications on the state of the art ODROID XU4 embedded platform show that the CPU cores of the experimental board if operated at a frequency proposed by our Machine Learning-based predictive method brings about 12.5°C reduction in processor temperature at 1.06% degradation in performance compared to the baseline frequency (default performance frequency governor of the embedded platform).
{"title":"Machine learning guided thermal management of Open Computing Language applications on CPU-GPU based embedded platforms","authors":"Rakesh Kumar, Bibhas Ghoshal","doi":"10.1049/cdt2.12050","DOIUrl":"https://doi.org/10.1049/cdt2.12050","url":null,"abstract":"<p>As embedded devices start supporting heterogeneous processing cores (Central Processing Unit [CPU]–Graphical Processing Unit [GPU] based cores), performance aware task allocation becomes a major issue. Use of Open Computing Language (OpenCL) applications on both CPU and GPU cores improves performance and resolves the problem. However, it has an adverse effect on the overall power consumption and the operating temperature of the system. Operating both kind of cores within a small form factor at high frequency causes rise in power consumption which in turn leads to increase in processor temperature. The elevated temperature brings about major thermal issues. In this paper, we present our investigation on the role of CPU during execution of GPU specific application and argue against running it at the high frequency. In addition, a machine learning guided mechanism to predict the optimal operating frequency of CPU cores during execution of OpenCL GPU kernels is presented in this study. Our experiments with OpenCL applications on the state of the art <i>ODROID XU4</i> embedded platform show that the CPU cores of the experimental board if operated at a frequency proposed by our Machine Learning-based predictive method brings about 12.5°C reduction in processor temperature at 1.06% degradation in performance compared to the baseline frequency (default <i>performance</i> frequency governor of the embedded platform).</p>","PeriodicalId":50383,"journal":{"name":"IET Computers and Digital Techniques","volume":"17 1","pages":"20-28"},"PeriodicalIF":1.2,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cdt2.12050","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50155184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}