Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273509
Xiangyu Wu, Y. Xia, Naifeng Jing, Xiaoyao Liang
With the fast development of the GPU server technology, cloud gaming has become popular in recent years. Unlike the traditional desktop gaming where the graphic rendering is performed locally using the user's personal graphics card, cloud gaming runs multiple games to support many users at the same time in the data center where most of the rendering jobs are done in the remote GPU cluster. The rendered frames are streamed to user's devices such as notebooks, tablets and cell phones. For the economic cloud gaming to be viable, the operator must make full utilization of the expensive hardware resources like the graphic cards, and the state of art technology tries to render multiple instances of games on the same GPU. In this paper, we first identify that there are many redundant and duplicated contexts/workloads existing in today's cloud gaming rendering that waste a large amount of memory bandwidth and system energy. We in turn propose novel system architecture enhancements to effectively share the contents across the game instances from different users in the cloud gaming center.
{"title":"CGSharing: Efficient content sharing in GPU-based cloud gaming","authors":"Xiangyu Wu, Y. Xia, Naifeng Jing, Xiaoyao Liang","doi":"10.1109/ISLPED.2015.7273509","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273509","url":null,"abstract":"With the fast development of the GPU server technology, cloud gaming has become popular in recent years. Unlike the traditional desktop gaming where the graphic rendering is performed locally using the user's personal graphics card, cloud gaming runs multiple games to support many users at the same time in the data center where most of the rendering jobs are done in the remote GPU cluster. The rendered frames are streamed to user's devices such as notebooks, tablets and cell phones. For the economic cloud gaming to be viable, the operator must make full utilization of the expensive hardware resources like the graphic cards, and the state of art technology tries to render multiple instances of games on the same GPU. In this paper, we first identify that there are many redundant and duplicated contexts/workloads existing in today's cloud gaming rendering that waste a large amount of memory bandwidth and system energy. We in turn propose novel system architecture enhancements to effectively share the contents across the game instances from different users in the cloud gaming center.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"153 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131300211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273481
Ahmedullah Aziz, N. Shukla, S. Datta, S. Gupta
We present a novel technique for optimizing the read operation of spin-transfer torque (STT) MRAMs by employing a correlated material in conjunction with a magnetic tunnel junction (MTJ). The design of the proposed memory cell is based on exploiting the orders-of-magnitude difference in the resistance of the two phases of the correlated material (CM) and triggering operation-driven phase transitions in the CM by judiciously co-optimizing devices and the memory cell. During read, the CM operates in the metallic and insulating phases when the MTJ is in the low resistance and high resistance states, respectively. This leads to superior distinguishability, read efficiency and stability. During write, the CM operates in the metallic phase, which minimizes the impact of the CM resistance on the write speed. Our analysis shows that CM amplifies the cell tunneling magneto-resistance from 107% (for the standard STT MRAM) to 1878% (for the proposed cell) leading to 68% higher sense margin. In addition, 45% enhancement in the read disturb margin and 36% reduction in the cell read power is achieved. At the same time, the write asymmetry associated with different state transitions is mildly mitigated, leading to 9% reduction in the write power. This comes at a negligible cost of 4% larger write time. We also discuss the layout implications of our technique and propose the sharing of the CM amongst multiple cells. As a result of the sharing, the proposed technique incurs no area penalty.
{"title":"COAST: Correlated material assisted STT MRAMs for optimized read operation","authors":"Ahmedullah Aziz, N. Shukla, S. Datta, S. Gupta","doi":"10.1109/ISLPED.2015.7273481","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273481","url":null,"abstract":"We present a novel technique for optimizing the read operation of spin-transfer torque (STT) MRAMs by employing a correlated material in conjunction with a magnetic tunnel junction (MTJ). The design of the proposed memory cell is based on exploiting the orders-of-magnitude difference in the resistance of the two phases of the correlated material (CM) and triggering operation-driven phase transitions in the CM by judiciously co-optimizing devices and the memory cell. During read, the CM operates in the metallic and insulating phases when the MTJ is in the low resistance and high resistance states, respectively. This leads to superior distinguishability, read efficiency and stability. During write, the CM operates in the metallic phase, which minimizes the impact of the CM resistance on the write speed. Our analysis shows that CM amplifies the cell tunneling magneto-resistance from 107% (for the standard STT MRAM) to 1878% (for the proposed cell) leading to 68% higher sense margin. In addition, 45% enhancement in the read disturb margin and 36% reduction in the cell read power is achieved. At the same time, the write asymmetry associated with different state transitions is mildly mitigated, leading to 9% reduction in the write power. This comes at a negligible cost of 4% larger write time. We also discuss the layout implications of our technique and propose the sharing of the CM amongst multiple cells. As a result of the sharing, the proposed technique incurs no area penalty.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116499954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273510
Sankalp Jain, Harshad Navale, Ümit Y. Ogras, S. Garg
Heterogeneous multi-core processors, such as the ARM big-LITTLE architecture, are becoming increasingly popular due to power and thermal constraints. In this paper, we address the use of low-power heterogeneous multi-cores as microservers utilizing web search as a motivational application. In particular, we propose a new family of scheduling policies for heterogeneous microservers to optimize for performance metrics such as mean response time and service level agreements, while guaranteeing thermally-safe operation. Thorough experimental evaluations on a big-LITTLE platform demonstrate that naive performance-oriented scheduling policies quickly result in thermal instability, while the proposed policies not only reduce peak temperature but also achieve 4.8× reduction in processing time and 5.6× increase in energy efficiency compared to baseline scheduling policies.
{"title":"Energy efficient scheduling for web search on heterogeneous microservers","authors":"Sankalp Jain, Harshad Navale, Ümit Y. Ogras, S. Garg","doi":"10.1109/ISLPED.2015.7273510","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273510","url":null,"abstract":"Heterogeneous multi-core processors, such as the ARM big-LITTLE architecture, are becoming increasingly popular due to power and thermal constraints. In this paper, we address the use of low-power heterogeneous multi-cores as microservers utilizing web search as a motivational application. In particular, we propose a new family of scheduling policies for heterogeneous microservers to optimize for performance metrics such as mean response time and service level agreements, while guaranteeing thermally-safe operation. Thorough experimental evaluations on a big-LITTLE platform demonstrate that naive performance-oriented scheduling policies quickly result in thermal instability, while the proposed policies not only reduce peak temperature but also achieve 4.8× reduction in processing time and 5.6× increase in energy efficiency compared to baseline scheduling policies.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123500558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273515
Hayate Okuhara, K. Kitamori, Yu Fujita, K. Usami, H. Amano
Body bias control is an efficient means of balancing the trade-off between leakage power and performance especially for chips with silicon on thin buried oxide (SOTB), a type of FD-SOI technology. In this work, a method for finding the optimal combination of the supply voltage and body bias voltage to the core and memory is proposed and applied to a real micro-controller chip using SOTB CMOS technology. By obtaining several coefficients of equations for leakage power, switching power and operational frequency from the real chip measurements, the optimized voltage setting can be obtained for the target operational frequency. The power consumption lost by the error of optimization is 12.6% at maximum, and it can save at most 73.1% of power from the cases where only the body bias voltage is optimized. This method can be applied to the latest FD-SOI technologies.
{"title":"An optimal power supply and body bias voltage for a ultra low power micro-controller with silicon on thin box MOSFET","authors":"Hayate Okuhara, K. Kitamori, Yu Fujita, K. Usami, H. Amano","doi":"10.1109/ISLPED.2015.7273515","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273515","url":null,"abstract":"Body bias control is an efficient means of balancing the trade-off between leakage power and performance especially for chips with silicon on thin buried oxide (SOTB), a type of FD-SOI technology. In this work, a method for finding the optimal combination of the supply voltage and body bias voltage to the core and memory is proposed and applied to a real micro-controller chip using SOTB CMOS technology. By obtaining several coefficients of equations for leakage power, switching power and operational frequency from the real chip measurements, the optimized voltage setting can be obtained for the target operational frequency. The power consumption lost by the error of optimization is 12.6% at maximum, and it can save at most 73.1% of power from the cases where only the body bias voltage is optimized. This method can be applied to the latest FD-SOI technologies.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"34 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123535621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273518
M. Salehi, Mohammad Khavari Tavana, Semeen Rehman, F. Kriebel, M. Shafique, A. Ejlali, J. Henkel
Many-core processors facilitate coarse-grained reliability by exploiting available cores for redundant multithreading. However, ensuring high reliability with reduced power consumption necessitates joint considerations of variations in vulnerability, performance and power properties of software as well as the underlying hardware. In this paper, we propose a power-efficient reliability management system for many-core processors. It exploits various basic redundancy techniques (like, dual and triple modular redundancy) operating in different voltage-frequency levels, each offering distinct reliability, performance and power properties. Our system performs Dynamic Redundancy and Voltage Scaling (DRVS) considering process variations in hardware, and diversities in software vulnerability and execution time properties. Experiments show that DRVS system provides significant reliability improvements while providing up to 60% reduced power consumption compared to state-of-the-art techniques.
{"title":"DRVS: Power-efficient reliability management through Dynamic Redundancy and Voltage Scaling under variations","authors":"M. Salehi, Mohammad Khavari Tavana, Semeen Rehman, F. Kriebel, M. Shafique, A. Ejlali, J. Henkel","doi":"10.1109/ISLPED.2015.7273518","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273518","url":null,"abstract":"Many-core processors facilitate coarse-grained reliability by exploiting available cores for redundant multithreading. However, ensuring high reliability with reduced power consumption necessitates joint considerations of variations in vulnerability, performance and power properties of software as well as the underlying hardware. In this paper, we propose a power-efficient reliability management system for many-core processors. It exploits various basic redundancy techniques (like, dual and triple modular redundancy) operating in different voltage-frequency levels, each offering distinct reliability, performance and power properties. Our system performs Dynamic Redundancy and Voltage Scaling (DRVS) considering process variations in hardware, and diversities in software vulnerability and execution time properties. Experiments show that DRVS system provides significant reliability improvements while providing up to 60% reduced power consumption compared to state-of-the-art techniques.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125863010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273521
A. Pathania, Santiago Pagani, M. Shafique, J. Henkel
Gaming on mobile platforms is highly power hungry and rapidly drains the limited-capacity battery. In multi-threaded gaming, each thread has different processing requirements and even a single slow thread may lead to Quality of Service (QoS) violations. Further, modern mobile platforms are equipped with asymmetric multi-core processors, so that different cores exhibit diverse power and performance properties. These asymmetric cores along with different Dynamic Power Management (DPM) techniques enable a high degree of power efficiency in mobile gaming. The default Linux power manager (i.e. “Governor”) of asymmetric multi-cores performs power-wise inefficient for mobile games as it over allocates resources for processing threads by being oblivious to the QoS. The state-of-the-art Governor for mobile gaming does not account for multi-threaded gaming workloads, which are mainstream in mobile gaming. In this work, we present a power-performance characterization of multi-threaded mobile games by executing them on a real-world mobile platform with an asymmetric multi-core. This analysis is leveraged to propose a QoS-aware Governor running a lightweight online heuristic that holistically accounts for thread-to-core mapping and DPM. This solution, when integrated into the platform's Operating System (OS), provides 12% improved power efficiency on average.
{"title":"Power management for mobile games on asymmetric multi-cores","authors":"A. Pathania, Santiago Pagani, M. Shafique, J. Henkel","doi":"10.1109/ISLPED.2015.7273521","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273521","url":null,"abstract":"Gaming on mobile platforms is highly power hungry and rapidly drains the limited-capacity battery. In multi-threaded gaming, each thread has different processing requirements and even a single slow thread may lead to Quality of Service (QoS) violations. Further, modern mobile platforms are equipped with asymmetric multi-core processors, so that different cores exhibit diverse power and performance properties. These asymmetric cores along with different Dynamic Power Management (DPM) techniques enable a high degree of power efficiency in mobile gaming. The default Linux power manager (i.e. “Governor”) of asymmetric multi-cores performs power-wise inefficient for mobile games as it over allocates resources for processing threads by being oblivious to the QoS. The state-of-the-art Governor for mobile gaming does not account for multi-threaded gaming workloads, which are mainstream in mobile gaming. In this work, we present a power-performance characterization of multi-threaded mobile games by executing them on a real-world mobile platform with an asymmetric multi-core. This analysis is leveraged to propose a QoS-aware Governor running a lightweight online heuristic that holistically accounts for thread-to-core mapping and DPM. This solution, when integrated into the platform's Operating System (OS), provides 12% improved power efficiency on average.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114414091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273528
Mengbai Xiao, Yao Liu, Lei Guo, Songqing Chen
The display subsystem of a mobile device usually consumes 38%-68% [1] of the total battery power in video streaming. Therefore, a few schemes have been designed to reduce the display power consumption. The basic idea is to dim the backlight level while properly compensating the pixel luminance to maintain image fidelity. The luminance compensation and proper backlight level calculation are computation intensive and demand per-frame luminance information. For these reasons, existing schemes only work for video-on-demand where each frame (and thus the luminance information) is available in advance. In addition, they demand additional computing resource support. Otherwise, if the computation is conducted on the mobile device, the power consumption due to such computation can easily offset the power savings from dimming the backlight. In this work, we set to investigate power saving for real-time video calls on mobile devices. Different from video-on-demand, real-time video calls are highly delay sensitive and the frame luminance information is not known in advance. Moreover, video calls often involve multiple streaming sources from multiple (≥2) participants, making it more difficult. Because there are few background changes and the frame rate is usually small in video calls, we design a Greedy Display Power saving scheme, called LCD-GDP, which utilizes the commonly available GPU on mobile devices without demanding additional support. Our design is implemented on WebRTC, a popular real-time web browser based video call standard. Experiments show that our scheme can save up to 33% power consumption in video calls without affecting the video call quality.
{"title":"Reducing display power consumption for real-time video calls on mobile devices","authors":"Mengbai Xiao, Yao Liu, Lei Guo, Songqing Chen","doi":"10.1109/ISLPED.2015.7273528","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273528","url":null,"abstract":"The display subsystem of a mobile device usually consumes 38%-68% [1] of the total battery power in video streaming. Therefore, a few schemes have been designed to reduce the display power consumption. The basic idea is to dim the backlight level while properly compensating the pixel luminance to maintain image fidelity. The luminance compensation and proper backlight level calculation are computation intensive and demand per-frame luminance information. For these reasons, existing schemes only work for video-on-demand where each frame (and thus the luminance information) is available in advance. In addition, they demand additional computing resource support. Otherwise, if the computation is conducted on the mobile device, the power consumption due to such computation can easily offset the power savings from dimming the backlight. In this work, we set to investigate power saving for real-time video calls on mobile devices. Different from video-on-demand, real-time video calls are highly delay sensitive and the frame luminance information is not known in advance. Moreover, video calls often involve multiple streaming sources from multiple (≥2) participants, making it more difficult. Because there are few background changes and the frame rate is usually small in video calls, we design a Greedy Display Power saving scheme, called LCD-GDP, which utilizes the commonly available GPU on mobile devices without demanding additional support. Our design is implemented on WebRTC, a popular real-time web browser based video call standard. Experiments show that our scheme can save up to 33% power consumption in video calls without affecting the video call quality.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124002991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273500
Cheng Li, P. Ampadu
Whereas buffers significantly impact Network-on-Chip (NoC) performance, they also account for up to 75% and nearly 50% of NoC router area and power respectively. Traditionally, SRAM has been used as an area and power efficient implementation of the router buffer. However, motivated by the smaller size and lower-power potential of planar embedded DRAM (eDRAM), we implement the router buffer using a 3T NMOS eDRAM for improved power and area efficiency. We demonstrate that the lifetime of flits stalled in the NoC router buffer is much shorter than the retention time of currently available eDRAM. This observation allows us to make the appropriate trade-off in size and sense-amplifier complexity to meet requirements of power and performance. A low-overhead need-based refresh mechanism is further explored. With a conservative buffer design using 65nm CMOS technology, our method reduces buffer area by up to 52% and power by 43%, while maintaining performance similar to a SRAM-based buffer. In a NoC router with 128-bit channel width, we achieve 26% and 11% reduction of total router area and power respectively. We conclude that eDRAM-based buffer is a power and area efficient alternative to SRAM-based buffer for NoC router design.
{"title":"A compact low-power eDRAM-based NoC buffer","authors":"Cheng Li, P. Ampadu","doi":"10.1109/ISLPED.2015.7273500","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273500","url":null,"abstract":"Whereas buffers significantly impact Network-on-Chip (NoC) performance, they also account for up to 75% and nearly 50% of NoC router area and power respectively. Traditionally, SRAM has been used as an area and power efficient implementation of the router buffer. However, motivated by the smaller size and lower-power potential of planar embedded DRAM (eDRAM), we implement the router buffer using a 3T NMOS eDRAM for improved power and area efficiency. We demonstrate that the lifetime of flits stalled in the NoC router buffer is much shorter than the retention time of currently available eDRAM. This observation allows us to make the appropriate trade-off in size and sense-amplifier complexity to meet requirements of power and performance. A low-overhead need-based refresh mechanism is further explored. With a conservative buffer design using 65nm CMOS technology, our method reduces buffer area by up to 52% and power by 43%, while maintaining performance similar to a SRAM-based buffer. In a NoC router with 128-bit channel width, we achieve 26% and 11% reduction of total router area and power respectively. We conclude that eDRAM-based buffer is a power and area efficient alternative to SRAM-based buffer for NoC router design.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132357580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern General Purpose Graphic Processing Unit (GPGPU) demands a large Register File (RF), which is typically organized into multiple banks to support the massive parallelism. Although heavy banking benefits RF throughput, its associated area and energy costs with diminishing performance gains greatly limit future RF s-caling. In this paper, we propose an improved RF design with a bank stealing technique, which enables a high RF throughput with compact area. By deeply investigating the GPGPU microarchitecture, we identify the deficiency in the state-of-the-art RF designs as the bank conflict problem, while the majority of conflicts can be eliminated leveraging the fact that the highly-banked RF oftentimes experiences under-utilization. This is especially true in GPGPU where multiple ready warps are available at the scheduling stage with their operands to be wisely coordinated. Our lightweight bank stealing technique can opportunistically fill the idle banks for better operand service, and the average GPGPU performance can be improved under smaller energy budget with significant area saving, which makes it promising for sustainable RF scaling.
{"title":"Bank stealing for conflict mitigation in GPGPU Register File","authors":"Naifeng Jing, Shuang Chen, Shunning Jiang, Li Jiang, Chao Li, Xiaoyao Liang","doi":"10.1109/ISLPED.2015.7273490","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273490","url":null,"abstract":"Modern General Purpose Graphic Processing Unit (GPGPU) demands a large Register File (RF), which is typically organized into multiple banks to support the massive parallelism. Although heavy banking benefits RF throughput, its associated area and energy costs with diminishing performance gains greatly limit future RF s-caling. In this paper, we propose an improved RF design with a bank stealing technique, which enables a high RF throughput with compact area. By deeply investigating the GPGPU microarchitecture, we identify the deficiency in the state-of-the-art RF designs as the bank conflict problem, while the majority of conflicts can be eliminated leveraging the fact that the highly-banked RF oftentimes experiences under-utilization. This is especially true in GPGPU where multiple ready warps are available at the scheduling stage with their operands to be wisely coordinated. Our lightweight bank stealing technique can opportunistically fill the idle banks for better operand service, and the average GPGPU performance can be improved under smaller energy budget with significant area saving, which makes it promising for sustainable RF scaling.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131921013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-22DOI: 10.1109/ISLPED.2015.7273526
Donghwa Shin, N. Chang, Yanzhi Wang, Massoud Pedram
Photovoltaic (PV) power generation systems are usually accompanied by battery to bridge the gap between the generation and load demand. Solar tracking is also used to enhance the power stability and increase the amount of collected energy from the Sun. However, battery and tracking devices significantly increase the system cost, and they are subject to wear and tear, which makes maintenance-free installation challenging. In this work, we conduct the design optimization of a twofold three dimensional PV panel for solar-powered systems. With the proposed three dimensional arrangement, we extend the solar-powered time of the target application that is powered only with solar power. Experimental results show that the proposed architecture and control method extend the service time of the target system by up to 23% compared to a non-reconfigurable flat panel with the same PV panel area.
{"title":"Reconfigurable three dimensional photovoltaic panel architecture for solar-powered time extension","authors":"Donghwa Shin, N. Chang, Yanzhi Wang, Massoud Pedram","doi":"10.1109/ISLPED.2015.7273526","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273526","url":null,"abstract":"Photovoltaic (PV) power generation systems are usually accompanied by battery to bridge the gap between the generation and load demand. Solar tracking is also used to enhance the power stability and increase the amount of collected energy from the Sun. However, battery and tracking devices significantly increase the system cost, and they are subject to wear and tear, which makes maintenance-free installation challenging. In this work, we conduct the design optimization of a twofold three dimensional PV panel for solar-powered systems. With the proposed three dimensional arrangement, we extend the solar-powered time of the target application that is powered only with solar power. Experimental results show that the proposed architecture and control method extend the service time of the target system by up to 23% compared to a non-reconfigurable flat panel with the same PV panel area.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131724941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}