Pub Date : 2018-10-01DOI: 10.1109/DFT.2018.8602983
Elisabeth Baseman, Nathan Debardeleben, S. Blanchard, Juston S. Moore, O. Tkachenko, Kurt B. Ferreira, Taniya Siddiqua, Vilas Sridharan
As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.
{"title":"Physics-Informed Machine Learning for DRAM Error Modeling","authors":"Elisabeth Baseman, Nathan Debardeleben, S. Blanchard, Juston S. Moore, O. Tkachenko, Kurt B. Ferreira, Taniya Siddiqua, Vilas Sridharan","doi":"10.1109/DFT.2018.8602983","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602983","url":null,"abstract":"As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132195690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-10-01DOI: 10.1109/DFT.2018.8602958
G. Furano, A. Tavoularis, Lucana Santos, V. Ferlet-Cavrois, C. Boatella, R. G. Alía, P. Fernandez-Martínez, M. Kastriotou, V. Wyrwoll, S. Danzeca, M. Tali, Dejan Gacnik, I. Kramberger, L. Juul, Konstantinos Maragos, G. Lentaris
The use of System-on-Chip (SoC) solutions in the design of space-borne data handling systems is an important step towards further miniaturization in space. In cubesats and in many aggressive commercial missions, use of Commercial-Off-The-Shelf (COTS) components is becoming the rule, rather than the exception and many of those are complex SoC, multiprocessor system-on-chip (MPSoC), SiP (System in package) or AMS-SoC (Analog/Mixed Signal SoC). Those changes are triggering attempts to modify the way we approach and conduct radiation tolerance and testing of electronics. Among the changes that have an impact on Single Event Effect (SEE) testing are scaling of geometries, supply voltages, new materials, new packaging technologies, and overall speed and device complexity challenges. In the frame of the ESA-CERN cooperation agreement, certain ESA projects had access to the most intense beam of ultra-high energy heavy ions available at the Super Proton Synchrotron (SPS) particle accelerator. This paper will present challenges and advantages of SEE tests of complex electronic devices in this new environment and its relevance for future space missions.
{"title":"FPGA SEE Test with Ultra-High Energy Heavy Ions","authors":"G. Furano, A. Tavoularis, Lucana Santos, V. Ferlet-Cavrois, C. Boatella, R. G. Alía, P. Fernandez-Martínez, M. Kastriotou, V. Wyrwoll, S. Danzeca, M. Tali, Dejan Gacnik, I. Kramberger, L. Juul, Konstantinos Maragos, G. Lentaris","doi":"10.1109/DFT.2018.8602958","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602958","url":null,"abstract":"The use of System-on-Chip (SoC) solutions in the design of space-borne data handling systems is an important step towards further miniaturization in space. In cubesats and in many aggressive commercial missions, use of Commercial-Off-The-Shelf (COTS) components is becoming the rule, rather than the exception and many of those are complex SoC, multiprocessor system-on-chip (MPSoC), SiP (System in package) or AMS-SoC (Analog/Mixed Signal SoC). Those changes are triggering attempts to modify the way we approach and conduct radiation tolerance and testing of electronics. Among the changes that have an impact on Single Event Effect (SEE) testing are scaling of geometries, supply voltages, new materials, new packaging technologies, and overall speed and device complexity challenges. In the frame of the ESA-CERN cooperation agreement, certain ESA projects had access to the most intense beam of ultra-high energy heavy ions available at the Super Proton Synchrotron (SPS) particle accelerator. This paper will present challenges and advantages of SEE tests of complex electronic devices in this new environment and its relevance for future space missions.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125421060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-10-01DOI: 10.1109/DFT.2018.8602867
G. Chapman, Rohan Thomas, Klinsmann J. Coelho Silva Meneses, I. Koren, Z. Koren
Digital Imagers provide advantages over ICs when studying Soft Errors (SEUs); when cosmic ray particles hit a pixel, the pixel stores the deposited charge for later readout, providing both their time/area occurrence rate and the area distribution of the charge spread. SEUs are detected within an imager by taking a time sequence of long exposure dark field images, and identifying events that occur only in one image and then disappear. For pixels in the $4-7 mu mathbf{m}$ range (high end DSLRs) the native noise level is low enough, allowing simple detection of SEUs. However, as pixels shrink to the $1 mu mathbf{m}$ range (cell phone pixels) they become more sensitive to deposited charges (i.e., weaker SEUs) but the background noise rises substantially making it difficult to distinguish between SEUs and noise. Noise in these imagers has a pattern dependent on the pixel's location on the imager. We developed statistical methods that use near neighbor pixels to determine the local noise distribution characteristics and distinguish the SEU events from the noise. We observed that the number of SEU events/area is substantially higher for $1.3 mu mathbf{m}$ pixels than that experienced by bigger pixels, yet SEUs are still confined to a single pixel indicating that the charge spread is well under $1 mu mathbf{m}$. We also present a statistical analysis of the charge distribution and SEU events and their dependence on the pixel size.
在研究软误差(SEUs)时,数字成像仪比集成电路具有优势;当宇宙射线粒子撞击像素时,像素存储沉积的电荷以供以后读取,提供它们的时间/面积发生率和电荷扩散的面积分布。seu是在成像仪中通过拍摄长时间曝光的暗场图像的时间序列来检测的,并识别仅在一张图像中发生然后消失的事件。对于$4-7 mu mathbf{m}$范围内的像素(高端单反),本机噪声水平足够低,可以简单地检测到seu。然而,当像素缩小到$1 mu mathbf{m}$范围(手机像素)时,它们对沉积电荷(即较弱的seu)变得更加敏感,但是背景噪声大大增加,使得很难区分seu和噪声。这些成像仪中的噪声具有依赖于像素在成像仪上的位置的模式。我们开发了统计方法,使用近邻像素来确定局部噪声分布特征,并将SEU事件与噪声区分开来。我们观察到,对于$1.3 mu mathbf{m}$像素,SEU事件/面积的数量明显高于较大像素的SEU事件/面积,但SEU仍然局限于单个像素,这表明电荷差远低于$1 mu mathbf{m}$。我们还提出了电荷分布和SEU事件及其对像素大小的依赖的统计分析。
{"title":"Analysis of Single Event Upsets Based on Digital Cameras with Very Small Pixels","authors":"G. Chapman, Rohan Thomas, Klinsmann J. Coelho Silva Meneses, I. Koren, Z. Koren","doi":"10.1109/DFT.2018.8602867","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602867","url":null,"abstract":"Digital Imagers provide advantages over ICs when studying Soft Errors (SEUs); when cosmic ray particles hit a pixel, the pixel stores the deposited charge for later readout, providing both their time/area occurrence rate and the area distribution of the charge spread. SEUs are detected within an imager by taking a time sequence of long exposure dark field images, and identifying events that occur only in one image and then disappear. For pixels in the $4-7 mu mathbf{m}$ range (high end DSLRs) the native noise level is low enough, allowing simple detection of SEUs. However, as pixels shrink to the $1 mu mathbf{m}$ range (cell phone pixels) they become more sensitive to deposited charges (i.e., weaker SEUs) but the background noise rises substantially making it difficult to distinguish between SEUs and noise. Noise in these imagers has a pattern dependent on the pixel's location on the imager. We developed statistical methods that use near neighbor pixels to determine the local noise distribution characteristics and distinguish the SEU events from the noise. We observed that the number of SEU events/area is substantially higher for $1.3 mu mathbf{m}$ pixels than that experienced by bigger pixels, yet SEUs are still confined to a single pixel indicating that the charge spread is well under $1 mu mathbf{m}$. We also present a statistical analysis of the charge distribution and SEU events and their dependence on the pixel size.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127423952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}