Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00035
K. Dang, Xuan-Tu Tran
Soft errors are expecting to be accelerated with the shrinking of feature sizes due to low operating voltages and high circuit density. However, soft error rates per single-bit is expectedly reduced with technology scaling. With tight requirements for the area and energy consumption, using a low complexity and high coding rate error correction code (ECC) to handle soft errors in on-chip communication is necessary. In this work, we use Parity Product Code (PPC) and propose several supporting mechanisms to detect and correct soft errors. First, PPC can work as a parity check to detect single event upset (SEU) inside each flit. Then, to reduce the needed retransmission, a Razor flip-flop with parity check (RFF-w-P) is proposed to work with PPC. Since PPC can act like forward error correction (FEC), we also present a selective transmission in bit-indexes by using a transposable FIFO. Therefore, the proposed mechanism not only guarantee single error detection/correction but also provide 2+ error correction as FEC. The proposed work also reduce the area cost of FIFO in comparison to traditional coding methods and adapt too multiple error rates.
{"title":"Parity-Based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication","authors":"K. Dang, Xuan-Tu Tran","doi":"10.1109/MCSoC2018.2018.00035","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00035","url":null,"abstract":"Soft errors are expecting to be accelerated with the shrinking of feature sizes due to low operating voltages and high circuit density. However, soft error rates per single-bit is expectedly reduced with technology scaling. With tight requirements for the area and energy consumption, using a low complexity and high coding rate error correction code (ECC) to handle soft errors in on-chip communication is necessary. In this work, we use Parity Product Code (PPC) and propose several supporting mechanisms to detect and correct soft errors. First, PPC can work as a parity check to detect single event upset (SEU) inside each flit. Then, to reduce the needed retransmission, a Razor flip-flop with parity check (RFF-w-P) is proposed to work with PPC. Since PPC can act like forward error correction (FEC), we also present a selective transmission in bit-indexes by using a transposable FIFO. Therefore, the proposed mechanism not only guarantee single error detection/correction but also provide 2+ error correction as FEC. The proposed work also reduce the area cost of FIFO in comparison to traditional coding methods and adapt too multiple error rates.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130001173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00049
Mohammad Loni, Amin Majd, A. Loni, M. Daneshtalab, Mikael Sjödin, E. Troubitsyna
Autonomous systems are used in a wide range of domains from indoor utensils to autonomous robot surgeries and self-driving cars. Stereo vision cameras probably are the most flexible sensing way in these systems since they can extract depth, luminance, color, and shape information. However, stereo vision based applications suffer from huge image sizes and computational complexity leading system to higher power consumption. To tackle these challenges, in the first step, GIMME2 stereo vision system [1] is employed. GIMME2 is a high-throughput and cost efficient FPGA-based stereo vision embedded system. In the next step, we present a framework for designing an optimized Deep Convolutional Neural Network (DCNN) for time constraint applications and/or limited resource budget platforms. Our framework tries to automatically generate a highly robust DCNN architecture for image data receiving from stereo vision cameras. Our proposed framework takes advantage of a multi-objective evolutionary optimization approach to design a near-optimal network architecture for both the accuracy and network size objectives. Unlike recent works aiming to generate a highly accurate network, we also considered the network size parameters to build a highly compact architecture. After designing a robust network, our proposed framework maps generated network on a multi/many core heterogeneous System-on-Chip (SoC). In addition, we have integrated our framework to the GIMME2 processing pipeline such that it can also estimate the distance of detected objects. The generated network by our framework offers up to 24x compression rate while losing only 5% accuracy compare to the best result on the CIFAR-10 dataset.
{"title":"Designing Compact Convolutional Neural Network for Embedded Stereo Vision Systems","authors":"Mohammad Loni, Amin Majd, A. Loni, M. Daneshtalab, Mikael Sjödin, E. Troubitsyna","doi":"10.1109/MCSoC2018.2018.00049","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00049","url":null,"abstract":"Autonomous systems are used in a wide range of domains from indoor utensils to autonomous robot surgeries and self-driving cars. Stereo vision cameras probably are the most flexible sensing way in these systems since they can extract depth, luminance, color, and shape information. However, stereo vision based applications suffer from huge image sizes and computational complexity leading system to higher power consumption. To tackle these challenges, in the first step, GIMME2 stereo vision system [1] is employed. GIMME2 is a high-throughput and cost efficient FPGA-based stereo vision embedded system. In the next step, we present a framework for designing an optimized Deep Convolutional Neural Network (DCNN) for time constraint applications and/or limited resource budget platforms. Our framework tries to automatically generate a highly robust DCNN architecture for image data receiving from stereo vision cameras. Our proposed framework takes advantage of a multi-objective evolutionary optimization approach to design a near-optimal network architecture for both the accuracy and network size objectives. Unlike recent works aiming to generate a highly accurate network, we also considered the network size parameters to build a highly compact architecture. After designing a robust network, our proposed framework maps generated network on a multi/many core heterogeneous System-on-Chip (SoC). In addition, we have integrated our framework to the GIMME2 processing pipeline such that it can also estimate the distance of detected objects. The generated network by our framework offers up to 24x compression rate while losing only 5% accuracy compare to the best result on the CIFAR-10 dataset.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00026
Mohamed Hamada, N. Odu, Mohammed Hassan
Recommender systems (RSs) are web-based tools that use various machine learning and filtering methods to propose useful items for users. Several techniques have been used to develop such a system for generating a list of useful recommendations. Traditionally, RSs use a single rating to represent preferences of a user on an item. A multi-criteria recommendation is a new technique that recommends items to users based on multiple attributes of the items. This technique has been used to solve many recommendation problems. Its predictive performance has been tested and proved to be more efficient than the traditional approach. However, this paper presents a model that is based on the architecture and main features of fuzzy sets and systems. Fuzzy logic (FL) is widely known for its application in different fields of study with its main advantage being that it does not need a lot of training data and its ability to combine human heuristics into the computer-assisted decision making process. FL is highly applicable in the domain of RS. The proposed study is to test and provide the predictive performance of the fuzzy-based multi-criteria technique and compare it with a single rating RS. Experimental results on real-world datasets from Yahoo! Movies proved that the proposed technique has remarkably improved the accuracy of the system
{"title":"A Fuzzy-Based Approach for Modelling Preferences of Users in Multi-Criteria Recommender Systems","authors":"Mohamed Hamada, N. Odu, Mohammed Hassan","doi":"10.1109/MCSoC2018.2018.00026","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00026","url":null,"abstract":"Recommender systems (RSs) are web-based tools that use various machine learning and filtering methods to propose useful items for users. Several techniques have been used to develop such a system for generating a list of useful recommendations. Traditionally, RSs use a single rating to represent preferences of a user on an item. A multi-criteria recommendation is a new technique that recommends items to users based on multiple attributes of the items. This technique has been used to solve many recommendation problems. Its predictive performance has been tested and proved to be more efficient than the traditional approach. However, this paper presents a model that is based on the architecture and main features of fuzzy sets and systems. Fuzzy logic (FL) is widely known for its application in different fields of study with its main advantage being that it does not need a lot of training data and its ability to combine human heuristics into the computer-assisted decision making process. FL is highly applicable in the domain of RS. The proposed study is to test and provide the predictive performance of the fuzzy-based multi-criteria technique and compare it with a single rating RS. Experimental results on real-world datasets from Yahoo! Movies proved that the proposed technique has remarkably improved the accuracy of the system","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"403 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122787254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00038
T. Schwarzer, Sascha Roloff, Valentina Richthammer, R. Khaldi, S. Wildermann, M. Glaß, J. Teich
Many-core architectures enable the concurrent execution of multiple application programs. In this context, the well-known problem of feasibly mapping applications, i.e., their tasks and communication, to such architectures has gained importance due to the large number of cores and limited inter-processor communication capacities. This challenge is tackled by so-called Hybrid Application Mapping (HAM) approaches: These combine a design-time analysis to extract sets of mapping constraints that characterize feasible, respectively optimal mappings with the runtime determination of a concrete mapping in dependence of these mapping constraints and the set of currently available resources. A major strength of HAM approaches has been shown as their ability to give real-time and other guarantees for statically characterized application programs even in highly dynamic workload scenarios while avoiding the pessimism of static resource partitionings. However, finding a feasible mapping is an NP-complete problem. This work discusses arising implications for HAM approaches in general and investigates two exact techniques for solving the mapping constraints at runtime in particular: (I) a problem-specific backtracking approach, and (II) an approach that adopts a general-purpose SAT solver. Experimental results show that the overhead of the general-purpose solver and, in particular, processing and solving the required SAT formulation becomes significant, whereas the problem-specific backtracking technique achieves significantly lower execution times.
{"title":"On the Complexity of Mapping Feasibility in Many-Core Architectures","authors":"T. Schwarzer, Sascha Roloff, Valentina Richthammer, R. Khaldi, S. Wildermann, M. Glaß, J. Teich","doi":"10.1109/MCSoC2018.2018.00038","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00038","url":null,"abstract":"Many-core architectures enable the concurrent execution of multiple application programs. In this context, the well-known problem of feasibly mapping applications, i.e., their tasks and communication, to such architectures has gained importance due to the large number of cores and limited inter-processor communication capacities. This challenge is tackled by so-called Hybrid Application Mapping (HAM) approaches: These combine a design-time analysis to extract sets of mapping constraints that characterize feasible, respectively optimal mappings with the runtime determination of a concrete mapping in dependence of these mapping constraints and the set of currently available resources. A major strength of HAM approaches has been shown as their ability to give real-time and other guarantees for statically characterized application programs even in highly dynamic workload scenarios while avoiding the pessimism of static resource partitionings. However, finding a feasible mapping is an NP-complete problem. This work discusses arising implications for HAM approaches in general and investigates two exact techniques for solving the mapping constraints at runtime in particular: (I) a problem-specific backtracking approach, and (II) an approach that adopts a general-purpose SAT solver. Experimental results show that the overhead of the general-purpose solver and, in particular, processing and solving the required SAT formulation becomes significant, whereas the problem-specific backtracking technique achieves significantly lower execution times.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127424851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00044
Trong-Thuc Hoang, Duc-Hung Le, C. Pham
The design of 32-bit floating-point Fast Fourier Transform (FFT) Twiddle Factor (TF) is proposed in this paper. The architecture was developed based on the adaptive algorithm of COordinate Rotation DIgital Computer (CORDIC). The CORDIC method is a well-known approach for approximating the complex-number multiplication in FFT implementations, also known as TF. An iterative process does the calculations of adaptive CORDIC. Therefore, by limiting the number of iterations, the accuracy performances can be sacrificed for the better outcome of throughput rates. As a result, there are three different FFT TF implementations were presented in this paper. They are TF-4, TF-8, and TF-16 for the design of TF implemented on four, eight, and 16 iteration limitations, respectively. The results of the three implementations were reported on both Field Programmable Gate Array (FPGA) and Application Specific Integrated Chip (ASIC) level. The FPGA results were examined on the Altera Stratix IV development kit, and the ASIC results were reported by the Synopsys tools with the Silicon On Thin Buried-oxide (SOTB) 65nm process library.
{"title":"VLSI Design of Floating-Point Twiddle Factor Using Adaptive CORDIC on Various Iteration Limitations","authors":"Trong-Thuc Hoang, Duc-Hung Le, C. Pham","doi":"10.1109/MCSoC2018.2018.00044","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00044","url":null,"abstract":"The design of 32-bit floating-point Fast Fourier Transform (FFT) Twiddle Factor (TF) is proposed in this paper. The architecture was developed based on the adaptive algorithm of COordinate Rotation DIgital Computer (CORDIC). The CORDIC method is a well-known approach for approximating the complex-number multiplication in FFT implementations, also known as TF. An iterative process does the calculations of adaptive CORDIC. Therefore, by limiting the number of iterations, the accuracy performances can be sacrificed for the better outcome of throughput rates. As a result, there are three different FFT TF implementations were presented in this paper. They are TF-4, TF-8, and TF-16 for the design of TF implemented on four, eight, and 16 iteration limitations, respectively. The results of the three implementations were reported on both Field Programmable Gate Array (FPGA) and Application Specific Integrated Chip (ASIC) level. The FPGA results were examined on the Altera Stratix IV development kit, and the ASIC results were reported by the Synopsys tools with the Silicon On Thin Buried-oxide (SOTB) 65nm process library.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128824803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00045
Van-Tinh Nguyen, Tieu-Khanh Luong, Han Le Duc, Van‐Phuc Hoang
In this paper, we present a new approximation method for non-linear activation functions including tanh and sigmoid functions using stochastic computing (SC) logic based on the piecewise-linear approximation (PWL) for the full range of [-1, 1]. SC implementations with PWL approximation expansions for non-linear functions are based on a 90nm CMOS process. The implementation results shown that the proposed SC circuits can provide better performance compared with the previous methods such as the well-known Maclaurin expansions based, Bernstein polynomial based and finite-state-machine (FSM) based implementations. The implementation results are also presented and discussed.
{"title":"An Efficient Hardware Implementation of Activation Functions Using Stochastic Computing for Deep Neural Networks","authors":"Van-Tinh Nguyen, Tieu-Khanh Luong, Han Le Duc, Van‐Phuc Hoang","doi":"10.1109/MCSoC2018.2018.00045","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00045","url":null,"abstract":"In this paper, we present a new approximation method for non-linear activation functions including tanh and sigmoid functions using stochastic computing (SC) logic based on the piecewise-linear approximation (PWL) for the full range of [-1, 1]. SC implementations with PWL approximation expansions for non-linear functions are based on a 90nm CMOS process. The implementation results shown that the proposed SC circuits can provide better performance compared with the previous methods such as the well-known Maclaurin expansions based, Bernstein polynomial based and finite-state-machine (FSM) based implementations. The implementation results are also presented and discussed.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128140556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00036
H. Kidane, E. Bourennane
The Networks on chip (NoC) based communication is increasingly used as a solution for multi-IP system-on-Chip. There have been tremendous works to improve the adaptation of the NoC for FPGA based dynamically reconfigurable IPs. The Dynamic Partial Reconfiguration (DPR) based run-time scalable NoC is one way to reduce the power consumption by idle components of the NoC. However, the absence of custom HDL NoC generation tools which separate the NoC rows and columns into independent components remains open. In this paper, we have introduced a UML/MARTE and IPXACT based approach to model and generated run-time scalable NoC components targeting Xilinx FPGAs. The NoC is modeled by splitting into static sub-NoC and a series of run-time scalable rows and columns as a component. First, both the static and run-time scalable sub-NoC are defined at a high level using the UML/MARTE. Then, they are transformed into an intermediate level of XML description respecting the IP-XACT standard. Next, all XML description of the top level NoC, the reconfigurable rows and columns are transformed into VHDL. Finally, the HDL files of the NoC are imported to Xilinx EDK to implement the dynamically scalable NoC by mixing with the FPGA based reconfigurable IPs. The proposed approach is validated by modeling a 3x3 NoC splitting into three components as 2x2 static sub-NoC, 2x1 reconfigurable column and 1x3 reconfigurable row. Then, a user-defined small IPs are used to connect with the NoC routers and implement the full system.
{"title":"MARTE and IP-XACT Based Approach for Run-Time Scalable NoC","authors":"H. Kidane, E. Bourennane","doi":"10.1109/MCSoC2018.2018.00036","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00036","url":null,"abstract":"The Networks on chip (NoC) based communication is increasingly used as a solution for multi-IP system-on-Chip. There have been tremendous works to improve the adaptation of the NoC for FPGA based dynamically reconfigurable IPs. The Dynamic Partial Reconfiguration (DPR) based run-time scalable NoC is one way to reduce the power consumption by idle components of the NoC. However, the absence of custom HDL NoC generation tools which separate the NoC rows and columns into independent components remains open. In this paper, we have introduced a UML/MARTE and IPXACT based approach to model and generated run-time scalable NoC components targeting Xilinx FPGAs. The NoC is modeled by splitting into static sub-NoC and a series of run-time scalable rows and columns as a component. First, both the static and run-time scalable sub-NoC are defined at a high level using the UML/MARTE. Then, they are transformed into an intermediate level of XML description respecting the IP-XACT standard. Next, all XML description of the top level NoC, the reconfigurable rows and columns are transformed into VHDL. Finally, the HDL files of the NoC are imported to Xilinx EDK to implement the dynamically scalable NoC by mixing with the FPGA based reconfigurable IPs. The proposed approach is validated by modeling a 3x3 NoC splitting into three components as 2x2 static sub-NoC, 2x1 reconfigurable column and 1x3 reconfigurable row. Then, a user-defined small IPs are used to connect with the NoC routers and implement the full system.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121241864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00043
K. Kanazawa, Shaowei Cai
In this paper, we propose an FPGA solver for the maximum clique problems encoded into the partial maximum satisfiability (partial MaxSAT). Given a Boolean formula with hard constraints that required to be satisfied and soft constraints that are desired to be satisfied, the goal of partial MaxSAT is to find a truth assignment that satisfies all hard constraints and as many soft constraints as possible. The maximum clique problem involves finding a clique with the maximum possible number of vertices in a given graph, which can be formulated as partial MaxSAT in a natural way. The Dist algorithm is one of the best performing local search algorithms for solving partial MaxSAT. In this paper, we reconstruct the Dist algorithm to leverage its inherent parallelism while maintaining the accuracy of the algorithm for maximum clique problems and then describe the implementation of the algorithm on FPGA. Our FPGA solver can solve partial MaxSAT-encoded maximum clique problems up to 22 times faster than the Dist algorithm on CPU.
{"title":"FPGA Acceleration to Solve Maximum Clique Problems Encoded into Partial MaxSAT","authors":"K. Kanazawa, Shaowei Cai","doi":"10.1109/MCSoC2018.2018.00043","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00043","url":null,"abstract":"In this paper, we propose an FPGA solver for the maximum clique problems encoded into the partial maximum satisfiability (partial MaxSAT). Given a Boolean formula with hard constraints that required to be satisfied and soft constraints that are desired to be satisfied, the goal of partial MaxSAT is to find a truth assignment that satisfies all hard constraints and as many soft constraints as possible. The maximum clique problem involves finding a clique with the maximum possible number of vertices in a given graph, which can be formulated as partial MaxSAT in a natural way. The Dist algorithm is one of the best performing local search algorithms for solving partial MaxSAT. In this paper, we reconstruct the Dist algorithm to leverage its inherent parallelism while maintaining the accuracy of the algorithm for maximum clique problems and then describe the implementation of the algorithm on FPGA. Our FPGA solver can solve partial MaxSAT-encoded maximum clique problems up to 22 times faster than the Dist algorithm on CPU.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128859765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00031
M. Takayanagi, Tomohiro Suzuki
The tile algorithm for matrix decompositions is attracting attention as a method for the latest multicore/many-core architecture because it can generate many fine-grained tasks which can be executed in parallel. Exploiting many parallel computing resources effectively with a fork-join paradigm is difficult. CPU/GPU heterogeneous cluster system is mainstream for a supercomputer system in recent years. Using the CPU/GPU cluster system efficiently is more difficult than efficiently utilizing the multicore cluster system. We implemented the tile CAQR decomposition algorithm on the CPU/GPU cluster system with OpenMP 4.0, MPI and cuBLAS, and proposed new approaches to utilize GPUs efficiently. In this paper, we show the performance result of our implementation on the Reedbush-H heterogeneous supercomputer.
{"title":"Communication-Avoiding Tile QR Decomposition on CPU/GPU Heterogeneous Cluster System","authors":"M. Takayanagi, Tomohiro Suzuki","doi":"10.1109/MCSoC2018.2018.00031","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00031","url":null,"abstract":"The tile algorithm for matrix decompositions is attracting attention as a method for the latest multicore/many-core architecture because it can generate many fine-grained tasks which can be executed in parallel. Exploiting many parallel computing resources effectively with a fork-join paradigm is difficult. CPU/GPU heterogeneous cluster system is mainstream for a supercomputer system in recent years. Using the CPU/GPU cluster system efficiently is more difficult than efficiently utilizing the multicore cluster system. We implemented the tile CAQR decomposition algorithm on the CPU/GPU cluster system with OpenMP 4.0, MPI and cuBLAS, and proposed new approaches to utilize GPUs efficiently. In this paper, we show the performance result of our implementation on the Reedbush-H heterogeneous supercomputer.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126502732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}