首页 > 最新文献

2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)最新文献

英文 中文
Physics-Informed Machine Learning for DRAM Error Modeling 基于物理的机器学习的DRAM错误建模
Elisabeth Baseman, Nathan Debardeleben, S. Blanchard, Juston S. Moore, O. Tkachenko, Kurt B. Ferreira, Taniya Siddiqua, Vilas Sridharan
As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.
随着高性能计算设施的规模接近百亿亿次时代,获得对硬件故障的详细了解变得非常重要。特别是,现代超级计算机的极端内存容量意味着在较小规模上统计上可以忽略不计的数据损坏错误将变得更加普遍。为了理解硬件故障并减轻它们对百亿亿级工作负载的不利影响,我们必须从当前硬件的行为中学习。在这项工作中,我们使用两台最近退役的超级计算机:洛斯阿拉莫斯国家实验室的Cielo和劳伦斯伯克利国家实验室的Hopper的现场数据来研究DRAM错误的可预测性。由于现场数据的数量和复杂性,我们应用统计机器学习来预测以前未访问位置的DRAM错误的概率。我们比较了六种机器学习算法的预测性能,发现包含DRAM空间结构物理知识的模型优于纯统计方法。我们的发现既支持DRAM硬件的预期物理行为,也提供了实时错误预测的机制。我们通过在一台超级计算机上训练误差模型并在另一台超级计算机上有效预测误差来证明现实世界的可行性。我们的方法证明了DRAM错误中空间局部性比时间局部性的重要性,并表明相对简单的统计模型可以有效地基于历史数据预测未来的错误,从而实现主动的错误缓解。
{"title":"Physics-Informed Machine Learning for DRAM Error Modeling","authors":"Elisabeth Baseman, Nathan Debardeleben, S. Blanchard, Juston S. Moore, O. Tkachenko, Kurt B. Ferreira, Taniya Siddiqua, Vilas Sridharan","doi":"10.1109/DFT.2018.8602983","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602983","url":null,"abstract":"As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132195690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
FPGA SEE Test with Ultra-High Energy Heavy Ions 超高能重离子FPGA SEE测试
G. Furano, A. Tavoularis, Lucana Santos, V. Ferlet-Cavrois, C. Boatella, R. G. Alía, P. Fernandez-Martínez, M. Kastriotou, V. Wyrwoll, S. Danzeca, M. Tali, Dejan Gacnik, I. Kramberger, L. Juul, Konstantinos Maragos, G. Lentaris
The use of System-on-Chip (SoC) solutions in the design of space-borne data handling systems is an important step towards further miniaturization in space. In cubesats and in many aggressive commercial missions, use of Commercial-Off-The-Shelf (COTS) components is becoming the rule, rather than the exception and many of those are complex SoC, multiprocessor system-on-chip (MPSoC), SiP (System in package) or AMS-SoC (Analog/Mixed Signal SoC). Those changes are triggering attempts to modify the way we approach and conduct radiation tolerance and testing of electronics. Among the changes that have an impact on Single Event Effect (SEE) testing are scaling of geometries, supply voltages, new materials, new packaging technologies, and overall speed and device complexity challenges. In the frame of the ESA-CERN cooperation agreement, certain ESA projects had access to the most intense beam of ultra-high energy heavy ions available at the Super Proton Synchrotron (SPS) particle accelerator. This paper will present challenges and advantages of SEE tests of complex electronic devices in this new environment and its relevance for future space missions.
在星载数据处理系统的设计中使用片上系统(SoC)解决方案是迈向空间进一步小型化的重要一步。在立方体卫星和许多积极的商业任务中,使用商用现货(COTS)组件正在成为规则,而不是例外,其中许多是复杂的SoC,多处理器片上系统(MPSoC), SiP(系统级封装)或AMS-SoC(模拟/混合信号SoC)。这些变化正促使人们尝试改变我们处理和进行电子产品辐射耐受性和测试的方式。影响单事件效应(SEE)测试的变化包括几何形状的缩放、电源电压、新材料、新封装技术,以及整体速度和设备复杂性的挑战。在欧空局-欧洲核子研究中心合作协议的框架内,欧空局的某些项目可以使用超级质子同步加速器(SPS)粒子加速器上最强烈的超高能重离子束。本文将介绍在这种新环境下复杂电子设备的SEE测试的挑战和优势及其与未来空间任务的相关性。
{"title":"FPGA SEE Test with Ultra-High Energy Heavy Ions","authors":"G. Furano, A. Tavoularis, Lucana Santos, V. Ferlet-Cavrois, C. Boatella, R. G. Alía, P. Fernandez-Martínez, M. Kastriotou, V. Wyrwoll, S. Danzeca, M. Tali, Dejan Gacnik, I. Kramberger, L. Juul, Konstantinos Maragos, G. Lentaris","doi":"10.1109/DFT.2018.8602958","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602958","url":null,"abstract":"The use of System-on-Chip (SoC) solutions in the design of space-borne data handling systems is an important step towards further miniaturization in space. In cubesats and in many aggressive commercial missions, use of Commercial-Off-The-Shelf (COTS) components is becoming the rule, rather than the exception and many of those are complex SoC, multiprocessor system-on-chip (MPSoC), SiP (System in package) or AMS-SoC (Analog/Mixed Signal SoC). Those changes are triggering attempts to modify the way we approach and conduct radiation tolerance and testing of electronics. Among the changes that have an impact on Single Event Effect (SEE) testing are scaling of geometries, supply voltages, new materials, new packaging technologies, and overall speed and device complexity challenges. In the frame of the ESA-CERN cooperation agreement, certain ESA projects had access to the most intense beam of ultra-high energy heavy ions available at the Super Proton Synchrotron (SPS) particle accelerator. This paper will present challenges and advantages of SEE tests of complex electronic devices in this new environment and its relevance for future space missions.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125421060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analysis of Single Event Upsets Based on Digital Cameras with Very Small Pixels 基于小像素数码相机的单事件扰动分析
G. Chapman, Rohan Thomas, Klinsmann J. Coelho Silva Meneses, I. Koren, Z. Koren
Digital Imagers provide advantages over ICs when studying Soft Errors (SEUs); when cosmic ray particles hit a pixel, the pixel stores the deposited charge for later readout, providing both their time/area occurrence rate and the area distribution of the charge spread. SEUs are detected within an imager by taking a time sequence of long exposure dark field images, and identifying events that occur only in one image and then disappear. For pixels in the $4-7 mu mathbf{m}$ range (high end DSLRs) the native noise level is low enough, allowing simple detection of SEUs. However, as pixels shrink to the $1 mu mathbf{m}$ range (cell phone pixels) they become more sensitive to deposited charges (i.e., weaker SEUs) but the background noise rises substantially making it difficult to distinguish between SEUs and noise. Noise in these imagers has a pattern dependent on the pixel's location on the imager. We developed statistical methods that use near neighbor pixels to determine the local noise distribution characteristics and distinguish the SEU events from the noise. We observed that the number of SEU events/area is substantially higher for $1.3 mu mathbf{m}$ pixels than that experienced by bigger pixels, yet SEUs are still confined to a single pixel indicating that the charge spread is well under $1 mu mathbf{m}$. We also present a statistical analysis of the charge distribution and SEU events and their dependence on the pixel size.
在研究软误差(SEUs)时,数字成像仪比集成电路具有优势;当宇宙射线粒子撞击像素时,像素存储沉积的电荷以供以后读取,提供它们的时间/面积发生率和电荷扩散的面积分布。seu是在成像仪中通过拍摄长时间曝光的暗场图像的时间序列来检测的,并识别仅在一张图像中发生然后消失的事件。对于$4-7 mu mathbf{m}$范围内的像素(高端单反),本机噪声水平足够低,可以简单地检测到seu。然而,当像素缩小到$1 mu mathbf{m}$范围(手机像素)时,它们对沉积电荷(即较弱的seu)变得更加敏感,但是背景噪声大大增加,使得很难区分seu和噪声。这些成像仪中的噪声具有依赖于像素在成像仪上的位置的模式。我们开发了统计方法,使用近邻像素来确定局部噪声分布特征,并将SEU事件与噪声区分开来。我们观察到,对于$1.3 mu mathbf{m}$像素,SEU事件/面积的数量明显高于较大像素的SEU事件/面积,但SEU仍然局限于单个像素,这表明电荷差远低于$1 mu mathbf{m}$。我们还提出了电荷分布和SEU事件及其对像素大小的依赖的统计分析。
{"title":"Analysis of Single Event Upsets Based on Digital Cameras with Very Small Pixels","authors":"G. Chapman, Rohan Thomas, Klinsmann J. Coelho Silva Meneses, I. Koren, Z. Koren","doi":"10.1109/DFT.2018.8602867","DOIUrl":"https://doi.org/10.1109/DFT.2018.8602867","url":null,"abstract":"Digital Imagers provide advantages over ICs when studying Soft Errors (SEUs); when cosmic ray particles hit a pixel, the pixel stores the deposited charge for later readout, providing both their time/area occurrence rate and the area distribution of the charge spread. SEUs are detected within an imager by taking a time sequence of long exposure dark field images, and identifying events that occur only in one image and then disappear. For pixels in the $4-7 mu mathbf{m}$ range (high end DSLRs) the native noise level is low enough, allowing simple detection of SEUs. However, as pixels shrink to the $1 mu mathbf{m}$ range (cell phone pixels) they become more sensitive to deposited charges (i.e., weaker SEUs) but the background noise rises substantially making it difficult to distinguish between SEUs and noise. Noise in these imagers has a pattern dependent on the pixel's location on the imager. We developed statistical methods that use near neighbor pixels to determine the local noise distribution characteristics and distinguish the SEU events from the noise. We observed that the number of SEU events/area is substantially higher for $1.3 mu mathbf{m}$ pixels than that experienced by bigger pixels, yet SEUs are still confined to a single pixel indicating that the charge spread is well under $1 mu mathbf{m}$. We also present a statistical analysis of the charge distribution and SEU events and their dependence on the pixel size.","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127423952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1