Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266965
Muhammad Ismail Faruqi, Fumihiko Ino, K. Hagihara
Image demosaicing algorithms are used to reconstruct a full color image from the incomplete color samples output (RAW data) of an image sensor overlaid with a Color Filter Array (CFA). Better demosaicing algorithms are superior in terms of acuity, dynamic range, signal to noise ratio, and artifact suppression, which make them suitable for high quality delivery such as theatrical broadcast. In this paper, we present our efforts in examining the feasibility of exploiting the Graphics Processing Unit (GPU) as an emerging accelerator to create an on-the-fly implementation of Variance of Color Differences (VCD) demosaicing, a state-of-the-art heuristic demosaicing algorithm developed to eliminate false-color artifacts in texture region of images. Our efforts in this paper are 1) implementing the algorithm as several kernels to separate the bottleneck portion of the algorithm from the rest and to minimize idle threads and 2) reducing I/O between shared and global memory when performing green channel interpolation by separating the input RAW data into four channels. We then compare the implementation featuring both acceleration methods with a single kernel implementation. Based on experimental results, our proposed acceleration methods achieved per-frame processing time of 343 ms on an nVidia GTX 480, which translates into 2.95 fps. Additionally, our proposed methods were also able to accelerate the kernel time and the effective memory bandwidth by a factor of 2.1× compared with its single kernel counterpart.
{"title":"Acceleration of variance of color differences-based demosaicing using CUDA","authors":"Muhammad Ismail Faruqi, Fumihiko Ino, K. Hagihara","doi":"10.1109/HPCSim.2012.6266965","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266965","url":null,"abstract":"Image demosaicing algorithms are used to reconstruct a full color image from the incomplete color samples output (RAW data) of an image sensor overlaid with a Color Filter Array (CFA). Better demosaicing algorithms are superior in terms of acuity, dynamic range, signal to noise ratio, and artifact suppression, which make them suitable for high quality delivery such as theatrical broadcast. In this paper, we present our efforts in examining the feasibility of exploiting the Graphics Processing Unit (GPU) as an emerging accelerator to create an on-the-fly implementation of Variance of Color Differences (VCD) demosaicing, a state-of-the-art heuristic demosaicing algorithm developed to eliminate false-color artifacts in texture region of images. Our efforts in this paper are 1) implementing the algorithm as several kernels to separate the bottleneck portion of the algorithm from the rest and to minimize idle threads and 2) reducing I/O between shared and global memory when performing green channel interpolation by separating the input RAW data into four channels. We then compare the implementation featuring both acceleration methods with a single kernel implementation. Based on experimental results, our proposed acceleration methods achieved per-frame processing time of 343 ms on an nVidia GTX 480, which translates into 2.95 fps. Additionally, our proposed methods were also able to accelerate the kernel time and the effective memory bandwidth by a factor of 2.1× compared with its single kernel counterpart.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266960
J. Schneible, L. Ríha, Maria Malik, T. El-Ghazawi, A. Alexandru
In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing, has brought about significant performance increases for scientific applications. One of the best examples of this is Lattice Quantum Chromo-Dynamics (QCD), a stencil operation based simulation. These simulations have a large memory footprint necessitating the use of many graphics processing units (GPUs) in parallel. This requires the use of a heterogeneous cluster with one or more GPUs per node. In order to obtain optimal performance, it is necessary to determine an efficient communication pattern between GPUs on the same node and between nodes. In this paper we present a performance model based method for minimizing the communication time of applications with stencil operations, such as Lattice QCD, on heterogeneous computing systems with a non-blocking Infiniband interconnection network. The proposed method is able to increase the performance of the most computationally intensive kernel of Lattice QCD by 25 percent due to improved overlapping of communication and computation.
{"title":"A method for communication efficient work distributions in stencil operation based applications on heterogeneous clusters","authors":"J. Schneible, L. Ríha, Maria Malik, T. El-Ghazawi, A. Alexandru","doi":"10.1109/HPCSim.2012.6266960","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266960","url":null,"abstract":"In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing, has brought about significant performance increases for scientific applications. One of the best examples of this is Lattice Quantum Chromo-Dynamics (QCD), a stencil operation based simulation. These simulations have a large memory footprint necessitating the use of many graphics processing units (GPUs) in parallel. This requires the use of a heterogeneous cluster with one or more GPUs per node. In order to obtain optimal performance, it is necessary to determine an efficient communication pattern between GPUs on the same node and between nodes. In this paper we present a performance model based method for minimizing the communication time of applications with stencil operations, such as Lattice QCD, on heterogeneous computing systems with a non-blocking Infiniband interconnection network. The proposed method is able to increase the performance of the most computationally intensive kernel of Lattice QCD by 25 percent due to improved overlapping of communication and computation.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122601178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266993
R. Hassani, P. Luksch
Many parallel applications in High Performance Computing have to communicate via wide area network (WAN), e.g., in a grid or cloud environment that spans multiple sites. Communication across WAN links slows down the application due to high latency and low bandwidth. Much of this overhead is due to the current implementations of the MPI (Message Passing Interface) standard. My project aims at improving WAN performance of MPI. Virtually most of today's wide area MPI implementations rely on the TCP/IP protocol. I propose to replace it by an innovative concurrent multipath communication method (CMC-SCTP) and integrate it with Open MPI project, which will increase bandwidth and enhance fault resilience within the MPI protocol stack in WAN environment. I plan to make my research results available to the community within the scope of the Open MPI project.
{"title":"Scalable high performance computing in wide area network","authors":"R. Hassani, P. Luksch","doi":"10.1109/HPCSim.2012.6266993","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266993","url":null,"abstract":"Many parallel applications in High Performance Computing have to communicate via wide area network (WAN), e.g., in a grid or cloud environment that spans multiple sites. Communication across WAN links slows down the application due to high latency and low bandwidth. Much of this overhead is due to the current implementations of the MPI (Message Passing Interface) standard. My project aims at improving WAN performance of MPI. Virtually most of today's wide area MPI implementations rely on the TCP/IP protocol. I propose to replace it by an innovative concurrent multipath communication method (CMC-SCTP) and integrate it with Open MPI project, which will increase bandwidth and enhance fault resilience within the MPI protocol stack in WAN environment. I plan to make my research results available to the community within the scope of the Open MPI project.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121120868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266899
Y. Kessaci, N. Melab, E. Talbi
Reducing energy consumption is an increasingly important issue in cloud computing, more specifically when dealing with a cloud distribution dispatched over a huge number of machines. Minimizing energy consumption can significantly reduce the amount of energy bills, and the greenhouse gas emissions. Therefore, many researches are carried out to develop new methods in order to consume less energy. In this paper, we present an Energy-aware Multi-start Local Search algorithm for an OpenNebula based Cloud (EMLS-ONC) that optimizes the energy consumption of an OpenNebula managed geographically distributed cloud computing infrastructure. The results of our EMLS-ONC scheduler are compared to the results obtained by the default scheduler of OpenNebula. The two approaches have been experimented using different (VMs) arrival scenarios and different hardware infrastructures. The results show that EMLS-ONC outperforms the previous OpenNebula's scheduler by a significant margin in terms of energy consumption. In addition, EMLS-ONC is also proved to schedule more applications.
{"title":"An energy-aware multi-start local search heuristic for scheduling VMs on the OpenNebula cloud distribution","authors":"Y. Kessaci, N. Melab, E. Talbi","doi":"10.1109/HPCSim.2012.6266899","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266899","url":null,"abstract":"Reducing energy consumption is an increasingly important issue in cloud computing, more specifically when dealing with a cloud distribution dispatched over a huge number of machines. Minimizing energy consumption can significantly reduce the amount of energy bills, and the greenhouse gas emissions. Therefore, many researches are carried out to develop new methods in order to consume less energy. In this paper, we present an Energy-aware Multi-start Local Search algorithm for an OpenNebula based Cloud (EMLS-ONC) that optimizes the energy consumption of an OpenNebula managed geographically distributed cloud computing infrastructure. The results of our EMLS-ONC scheduler are compared to the results obtained by the default scheduler of OpenNebula. The two approaches have been experimented using different (VMs) arrival scenarios and different hardware infrastructures. The results show that EMLS-ONC outperforms the previous OpenNebula's scheduler by a significant margin in terms of energy consumption. In addition, EMLS-ONC is also proved to schedule more applications.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123127270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266992
J. Benk, D. Pflüger
This paper presents an efficient approach to parallel pricing of multi-dimensional financial derivatives based on the Black-Scholes Partial Differential Equation (BS-PDE). One of the main challenges for such multi-dimensional problems is the curse of dimensionality, that is tackled in our approach by the combination technique (CT). This technique consists of a combination of several solutions obtained on anisotropic full grids. Hence, it offers the possibility to compute the BS-PDE on each one in an embarrassingly parallel way. Besides parallelizing on the CT level, we have developed a shared memory parallel multigrid solver for the BS-PDE. The parallel efficiency of our hybrid parallel approach is demonstrated by strong scaling results of 5D and 6D pricing problems.
{"title":"Hybrid parallel solutions of the Black-Scholes PDE with the truncated combination technique","authors":"J. Benk, D. Pflüger","doi":"10.1109/HPCSim.2012.6266992","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266992","url":null,"abstract":"This paper presents an efficient approach to parallel pricing of multi-dimensional financial derivatives based on the Black-Scholes Partial Differential Equation (BS-PDE). One of the main challenges for such multi-dimensional problems is the curse of dimensionality, that is tackled in our approach by the combination technique (CT). This technique consists of a combination of several solutions obtained on anisotropic full grids. Hence, it offers the possibility to compute the BS-PDE on each one in an embarrassingly parallel way. Besides parallelizing on the CT level, we have developed a shared memory parallel multigrid solver for the BS-PDE. The parallel efficiency of our hybrid parallel approach is demonstrated by strong scaling results of 5D and 6D pricing problems.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115165677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266964
Ping Guo, Liqiang Wang
This paper presents an integrated analytical and profile-based CUDA performance modeling approach to accurately predict the kernel execution times of sparse matrix-vector multiplication for CSR, ELL, COO, and HYB SpMV CUDA kernels. Based on our experiments conducted on a collection of 8 widely-used testing matrices on NVIDIA Tesla C2050, the execution times predicted by our model match the measured execution times of NVIDIA's SpMV implementations very well. Specifically, for 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the differences are between 8% and 10%. For CSR, ELL, COO, and HYB SpMV kernels, the differences are 4.2%, 5.2%, 1.0%, and 5.7% on the average, respectively.
本文提出了一种集成的基于分析和概要文件的CUDA性能建模方法,以准确预测CSR, ELL, COO和HYB SpMV CUDA内核的稀疏矩阵向量乘法的内核执行时间。基于我们在NVIDIA Tesla C2050上对8个广泛使用的测试矩阵进行的实验,我们的模型预测的执行时间与NVIDIA SpMV实现的实际执行时间非常匹配。具体来说,对于32个测试用例中的29个,性能差异在7%以下或左右。对于其余3个测试用例,差异在8%到10%之间。对于CSR、ELL、COO和HYB SpMV内核,平均差异分别为4.2%、5.2%、1.0%和5.7%。
{"title":"Accurate CUDA performance modeling for sparse matrix-vector multiplication","authors":"Ping Guo, Liqiang Wang","doi":"10.1109/HPCSim.2012.6266964","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266964","url":null,"abstract":"This paper presents an integrated analytical and profile-based CUDA performance modeling approach to accurately predict the kernel execution times of sparse matrix-vector multiplication for CSR, ELL, COO, and HYB SpMV CUDA kernels. Based on our experiments conducted on a collection of 8 widely-used testing matrices on NVIDIA Tesla C2050, the execution times predicted by our model match the measured execution times of NVIDIA's SpMV implementations very well. Specifically, for 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the differences are between 8% and 10%. For CSR, ELL, COO, and HYB SpMV kernels, the differences are 4.2%, 5.2%, 1.0%, and 5.7% on the average, respectively.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122753319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266910
J. R. Cózar, José María González-Linares, Nicolás Guil Mata, Ruber Hernández, Yanio Heredia
Human action classification is an important task in computer vision. The Bag-of-Words model uses spatio-temporal features assigned to visual words of a vocabulary and some classification algorithm to attain this goal. In this work we have studied the effect of reducing the vocabulary size using a video word ranking method. We have applied this method to the KTH dataset to obtain a vocabulary with more descriptive words where the representation is more compact and efficient. Two feature descriptors, STIP and MoSIFT, and two classifiers, KNN and SVM, have been used to check the validity of our approach. Results for different vocabulary sizes show an improvement of the recognition rate whilst reducing the number of words as non-descriptive words are removed. Additionally, state-of-the-art performances are reached with this new compact vocabulary representation.
{"title":"Visual words selection for human action classification","authors":"J. R. Cózar, José María González-Linares, Nicolás Guil Mata, Ruber Hernández, Yanio Heredia","doi":"10.1109/HPCSim.2012.6266910","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266910","url":null,"abstract":"Human action classification is an important task in computer vision. The Bag-of-Words model uses spatio-temporal features assigned to visual words of a vocabulary and some classification algorithm to attain this goal. In this work we have studied the effect of reducing the vocabulary size using a video word ranking method. We have applied this method to the KTH dataset to obtain a vocabulary with more descriptive words where the representation is more compact and efficient. Two feature descriptors, STIP and MoSIFT, and two classifiers, KNN and SVM, have been used to check the validity of our approach. Results for different vocabulary sizes show an improvement of the recognition rate whilst reducing the number of words as non-descriptive words are removed. Additionally, state-of-the-art performances are reached with this new compact vocabulary representation.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129935350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266955
Paulo S. F. Eustaquio, Ricardo Figueiredo, S. Bruschi, R. Santana, M. J. Santana
This paper presents an architecture prototype, named Web Server with Service Differentiation, able to provide QoS to different classes of services. With the implemented prototype, an admission control algorithm, named connection reservation algorithm is proposed and compared to the negotiation algorithm. The results of the performance evaluation have showed both algorithms met proportionally a higher number of high priority class (Class 1) requests in relation to low priority class (Class 2), although the connection reservation algorithm fitted all workload variance better. The connection reservation algorithm can be extended to the Web, where workload dynamic characteristics predominate.
{"title":"Connection reservation algorithm in a Web server with service differentiation","authors":"Paulo S. F. Eustaquio, Ricardo Figueiredo, S. Bruschi, R. Santana, M. J. Santana","doi":"10.1109/HPCSim.2012.6266955","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266955","url":null,"abstract":"This paper presents an architecture prototype, named Web Server with Service Differentiation, able to provide QoS to different classes of services. With the implemented prototype, an admission control algorithm, named connection reservation algorithm is proposed and compared to the negotiation algorithm. The results of the performance evaluation have showed both algorithms met proportionally a higher number of high priority class (Class 1) requests in relation to low priority class (Class 2), although the connection reservation algorithm fitted all workload variance better. The connection reservation algorithm can be extended to the Web, where workload dynamic characteristics predominate.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130878321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266986
Zhiwei Yu, C. Thomborson, Chaokun Wang, Jianmin Wang, Rui Li
Private health information once confined to local medical institutions is migrating onto the Internet as an Electronic Health Record (EHR) that is accessed by cloud computing. No matter where it is hosted, health data is subject to security breaches, privacy abuses, and access control violations. However, novel technologies have new vulnerabilities, and allow new mitigations. In this paper, we propose a watermarking method in the architecture of cloud computing, to mitigate the risks of insider disclosures. Our design and preliminary implementation are accomplished by exploiting the MapReduce mechanism in the cloudlet we built. Our evaluation shows that our proposal addresses all of the requirements of the Cloud Oriented Architecture (COA) framework of the Jericho Forum.
{"title":"A cloud-based watermarking method for health data security","authors":"Zhiwei Yu, C. Thomborson, Chaokun Wang, Jianmin Wang, Rui Li","doi":"10.1109/HPCSim.2012.6266986","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266986","url":null,"abstract":"Private health information once confined to local medical institutions is migrating onto the Internet as an Electronic Health Record (EHR) that is accessed by cloud computing. No matter where it is hosted, health data is subject to security breaches, privacy abuses, and access control violations. However, novel technologies have new vulnerabilities, and allow new mitigations. In this paper, we propose a watermarking method in the architecture of cloud computing, to mitigate the risks of insider disclosures. Our design and preliminary implementation are accomplished by exploiting the MapReduce mechanism in the cloudlet we built. Our evaluation shows that our proposal addresses all of the requirements of the Cloud Oriented Architecture (COA) framework of the Jericho Forum.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127789829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-02DOI: 10.1109/HPCSim.2012.6266944
Toshinori Sato, Takanori Hayashida, Ken Yano
MultiCore processor is one of the promising techniques to satisfy computing demands of the future consumer devices. However, MultiCore processor is still threatened by increasing energy consumption due to PVT (Process-Voltage-Temperature) variations. They require large design margins in the supply voltage, resulting in large energy consumption. The combination of DVS (Dynamic voltage scaling) technique and Canary FF (flip-flop), named Canary-DVS, has been proposed to eliminate the overestimated voltage margin but has only been evaluated under the assumption of typical delay. This paper considers C2C (Core-to-Core) variations and evaluates how Canary-DVS eliminates the energy waste under the practical assumption of delay variations. We adopt Canary-DVS to a commercial processor, Toshiba's quad-core Media embedded Processor (MeP). From Monte Carlo simulations, it is found that energy is reduced by 18.6% on average and there are not any noticeable discrepancies from the typical situations, when 0.064 of σ/μ value is assumed in gate delay.
{"title":"Dynamically reducing overestimated design margin of MultiCores","authors":"Toshinori Sato, Takanori Hayashida, Ken Yano","doi":"10.1109/HPCSim.2012.6266944","DOIUrl":"https://doi.org/10.1109/HPCSim.2012.6266944","url":null,"abstract":"MultiCore processor is one of the promising techniques to satisfy computing demands of the future consumer devices. However, MultiCore processor is still threatened by increasing energy consumption due to PVT (Process-Voltage-Temperature) variations. They require large design margins in the supply voltage, resulting in large energy consumption. The combination of DVS (Dynamic voltage scaling) technique and Canary FF (flip-flop), named Canary-DVS, has been proposed to eliminate the overestimated voltage margin but has only been evaluated under the assumption of typical delay. This paper considers C2C (Core-to-Core) variations and evaluates how Canary-DVS eliminates the energy waste under the practical assumption of delay variations. We adopt Canary-DVS to a commercial processor, Toshiba's quad-core Media embedded Processor (MeP). From Monte Carlo simulations, it is found that energy is reduced by 18.6% on average and there are not any noticeable discrepancies from the typical situations, when 0.064 of σ/μ value is assumed in gate delay.","PeriodicalId":428764,"journal":{"name":"2012 International Conference on High Performance Computing & Simulation (HPCS)","volume":"418 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117331426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}