首页 > 最新文献

2012 IEEE Conference on High Performance Extreme Computing最新文献

英文 中文
Exploiting SPM-aware Scheduling on EPIC architectures for high-performance real-time systems 在高性能实时系统的EPIC架构上利用spm感知调度
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408658
Yu Liu, Wei Zhang
In contemporary computer architectures, the Explicitly Parallel Instruction Computing Architectures (EPIC) permits microprocessors to implement Instruction Level Parallelism (ILP) by using the compiler, rather than complex on-die circuitry to control parallel instruction execution like the superscalar architecture. Based on the EPIC, this paper proposes a time predictable two-level scratchpad based memory architecture, and a Scratchpad-aware Scheduling method to improve the performance by optimizing the Load-To-Use Distance.
在当代计算机体系结构中,显式并行指令计算体系结构(EPIC)允许微处理器通过使用编译器来实现指令级并行(ILP),而不是像标量体系结构那样使用复杂的片上电路来控制并行指令的执行。在EPIC的基础上,提出了一种时间可预测的两级刮本存储器体系结构,并提出了一种刮本感知调度方法,通过优化负载使用距离来提高性能。
{"title":"Exploiting SPM-aware Scheduling on EPIC architectures for high-performance real-time systems","authors":"Yu Liu, Wei Zhang","doi":"10.1109/HPEC.2012.6408658","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408658","url":null,"abstract":"In contemporary computer architectures, the Explicitly Parallel Instruction Computing Architectures (EPIC) permits microprocessors to implement Instruction Level Parallelism (ILP) by using the compiler, rather than complex on-die circuitry to control parallel instruction execution like the superscalar architecture. Based on the EPIC, this paper proposes a time predictable two-level scratchpad based memory architecture, and a Scratchpad-aware Scheduling method to improve the performance by optimizing the Load-To-Use Distance.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129171184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An application of the constraint programming to the design and operation of synthetic aperture radars 约束规划在合成孔径雷达设计与运行中的应用
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408663
M. Holzrichter
The design and operation of synthetic aperture radars require compatible sets of hundreds of quantities. Compatibility is achieved when these quantities satisfy constraints arising from physics, geometry etc. In the aggregate these quantities and constraints form a logical model of the radar. In practice the logical model is distributed over multiple people, documents and software modules thereby becoming fragmented. Fragmentation gives rise to inconsistencies and errors. The SAR Inference Engine addresses the fragmentation problem by implementing the logical model of a Sandia synthetic aperture radar in a form that is intended to be usable from system design to mission planning to actual operation of the radar. These diverse contexts require extreme flexibility that is achieved by employing the constraint programming paradigm.
合成孔径雷达的设计和运行需要数百个数量的兼容装置。当这些量满足物理、几何等方面的约束时,兼容性就实现了。总的来说,这些数量和约束形成了雷达的逻辑模型。在实践中,逻辑模型分布在多个人员、文档和软件模块上,因此变得碎片化。碎片化会导致不一致和错误。SAR推理引擎通过实现桑迪亚合成孔径雷达的逻辑模型来解决碎片化问题,该模型旨在从系统设计到任务规划再到雷达的实际操作中都可用。这些不同的上下文需要极大的灵活性,这可以通过使用约束编程范例来实现。
{"title":"An application of the constraint programming to the design and operation of synthetic aperture radars","authors":"M. Holzrichter","doi":"10.1109/HPEC.2012.6408663","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408663","url":null,"abstract":"The design and operation of synthetic aperture radars require compatible sets of hundreds of quantities. Compatibility is achieved when these quantities satisfy constraints arising from physics, geometry etc. In the aggregate these quantities and constraints form a logical model of the radar. In practice the logical model is distributed over multiple people, documents and software modules thereby becoming fragmented. Fragmentation gives rise to inconsistencies and errors. The SAR Inference Engine addresses the fragmentation problem by implementing the logical model of a Sandia synthetic aperture radar in a form that is intended to be usable from system design to mission planning to actual operation of the radar. These diverse contexts require extreme flexibility that is achieved by employing the constraint programming paradigm.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"54 21","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120870676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable cryptographic authentication for high performance computing 用于高性能计算的可扩展加密身份验证
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408671
Andrew Prout, W. Arcand, David Bestor, C. Byun, Bill Bergeron, M. Hubbell, J. Kepner, P. Michaleas, J. Mullen, A. Reuther, Antonio Rosa
High performance computing (HPC) uses supercomputers and computing clusters to solve large computational problems. Frequently HPC resources are shared systems and access to restricted data sets or resources must be authenticated. These authentication needs can take multiple forms, both internal and external to the HPC cluster. A computational stack that uses web services among nodes in the HPC may need to perform authentication between nodes of the same job or a job may need to reach out to data sources outside the HPC. Traditional authentication mechanisms such as passwords or digital certificates encounter issues with the distributed and potentially disconnected nature of HPC systems. Distributing and storing plain-text passwords or cryptographic keys among nodes in a HPC system without special protection is a poor security practice. Systems that reach back to the user's terminal for access to the authenticator are possible, but only in fully interactive supercomputing where connectivity to the user's terminal can be guaranteed. Point solutions can be enabled for these use cases, such as software-based role or self-signed certificates, however they require significant expertise in digital certificates to configure. A more general solution is called for that is both secure and easy to use. This paper presents an overview of a solution implemented on the interactive, on-demand LLGrid computing system [1,2,3] at MIT Lincoln Laboratory and its use to solve one such authentication problem.
高性能计算(High performance computing, HPC)是一种利用超级计算机和计算集群解决大规模计算问题的技术。HPC资源通常是共享系统,访问受限制的数据集或资源必须经过身份验证。这些身份验证需求可以采用多种形式,既可以是HPC集群内部的,也可以是外部的。在HPC中的节点之间使用web服务的计算堆栈可能需要在相同作业的节点之间执行身份验证,或者作业可能需要访问HPC之外的数据源。传统的身份验证机制(如密码或数字证书)在HPC系统的分布式和潜在的断开连接特性中遇到了问题。在没有特殊保护的HPC系统中,在节点之间分发和存储明文密码或加密密钥是一种很差的安全做法。回到用户终端访问验证器的系统是可能的,但只有在完全交互式的超级计算中,才能保证与用户终端的连接。可以为这些用例启用点解决方案,例如基于软件的角色或自签名证书,但是它们需要大量的数字证书专业知识来配置。需要一种更通用的解决方案,既安全又易于使用。本文概述了麻省理工学院林肯实验室在交互式、按需LLGrid计算系统[1,2,3]上实现的解决方案,以及该解决方案在解决此类认证问题中的应用。
{"title":"Scalable cryptographic authentication for high performance computing","authors":"Andrew Prout, W. Arcand, David Bestor, C. Byun, Bill Bergeron, M. Hubbell, J. Kepner, P. Michaleas, J. Mullen, A. Reuther, Antonio Rosa","doi":"10.1109/HPEC.2012.6408671","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408671","url":null,"abstract":"High performance computing (HPC) uses supercomputers and computing clusters to solve large computational problems. Frequently HPC resources are shared systems and access to restricted data sets or resources must be authenticated. These authentication needs can take multiple forms, both internal and external to the HPC cluster. A computational stack that uses web services among nodes in the HPC may need to perform authentication between nodes of the same job or a job may need to reach out to data sources outside the HPC. Traditional authentication mechanisms such as passwords or digital certificates encounter issues with the distributed and potentially disconnected nature of HPC systems. Distributing and storing plain-text passwords or cryptographic keys among nodes in a HPC system without special protection is a poor security practice. Systems that reach back to the user's terminal for access to the authenticator are possible, but only in fully interactive supercomputing where connectivity to the user's terminal can be guaranteed. Point solutions can be enabled for these use cases, such as software-based role or self-signed certificates, however they require significant expertise in digital certificates to configure. A more general solution is called for that is both secure and easy to use. This paper presents an overview of a solution implemented on the interactive, on-demand LLGrid computing system [1,2,3] at MIT Lincoln Laboratory and its use to solve one such authentication problem.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121019638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Ruggedization of MXM graphics modules MXM图形模块的加固
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408666
I. Straznicky
MXM modules, used to package graphics processing devices for use in benign environments, have been tested for use in harsh environments typical of deployed defense and aerospace systems. Results show that specially mechanically designed MXM GP-GPU modules can survive these environments, and successfully provide the enormous processing capability offered by the latest generation of GPUs to harsh environment applications.
MXM模块用于封装图形处理设备,用于在良性环境中使用,已经在部署的国防和航空航天系统的典型恶劣环境中进行了测试。结果表明,专门机械设计的MXM GP-GPU模块可以在这些环境中生存,并成功地为恶劣环境应用提供最新一代gpu提供的巨大处理能力。
{"title":"Ruggedization of MXM graphics modules","authors":"I. Straznicky","doi":"10.1109/HPEC.2012.6408666","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408666","url":null,"abstract":"MXM modules, used to package graphics processing devices for use in benign environments, have been tested for use in harsh environments typical of deployed defense and aerospace systems. Results show that specially mechanically designed MXM GP-GPU modules can survive these environments, and successfully provide the enormous processing capability offered by the latest generation of GPUs to harsh environment applications.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124475237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
HPC-VMs: Virtual machines in high performance computing systems hpc - vm:高性能计算系统中的虚拟机
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408668
A. Reuther, P. Michaleas, Andrew Prout, J. Kepner
The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest operating systems from each other. 1 In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.
虚拟机的概念可以追溯到20世纪60年代。IBM和MIT都开发了支持用户和外设时间共享的操作系统特性,其基础是早期的虚拟机。现代虚拟机在客户机操作系统和在计算机系统上执行的主机操作系统之间提供系统设备的转换层,同时将每个客户机操作系统彼此隔离。在过去的几年里,企业计算已经采用虚拟机来部署各种各样的功能,从业务管理系统到电子邮件服务器群。采用虚拟部署环境的企业利用了各种优势,包括服务器整合、服务迁移和更高的服务可靠性。但它们也面临着一些挑战,包括性能上的牺牲和更复杂的系统管理。其中一些优点和挑战也适用于虚拟化环境中的HPC。本文分析了在高性能计算(HPC)环境中使用虚拟机的有效性。我们建议在已经很健壮的HPC环境中添加一些虚拟机功能,以满足使用虚拟机所获得的生产力超过性能损失的特定场景。最后,讨论了在高性能计算集群的软件栈中增加虚拟机的实现,并分析了该实现对作业启动时间的影响。
{"title":"HPC-VMs: Virtual machines in high performance computing systems","authors":"A. Reuther, P. Michaleas, Andrew Prout, J. Kepner","doi":"10.1109/HPEC.2012.6408668","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408668","url":null,"abstract":"The concept of virtual machines dates back to the 1960s. Both IBM and MIT developed operating system features that enabled user and peripheral time sharing, the underpinnings of which were early virtual machines. Modern virtual machines present a translation layer of system devices between a guest operating system and the host operating system executing on a computer system, while isolating each of the guest operating systems from each other. 1 In the past several years, enterprise computing has embraced virtual machines to deploy a wide variety of capabilities from business management systems to email server farms. Those who have adopted virtual deployment environments have capitalized on a variety of advantages including server consolidation, service migration, and higher service reliability. But they have also ended up with some challenges including a sacrifice in performance and more complex system management. Some of these advantages and challenges also apply to HPC in virtualized environments. In this paper, we analyze the effectiveness of using virtual machines in a high performance computing (HPC) environment. We propose adding some virtual machine capability to already robust HPC environments for specific scenarios where the productivity gained outweighs the performance lost for using virtual machines. Finally, we discuss an implementation of augmenting virtual machines into the software stack of a HPC cluster, and we analyze the affect on job launch time of this implementation.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125046402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Use of CUDA for the Continuous Space Language Model CUDA在连续空间语言模型中的应用
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408661
E. Thompson, Timothy R. Anderson
The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.
连续空间语言模型(CSLM)的训练阶段在NVIDIA硬件/软件架构计算统一设备架构(CUDA)中实现。在三种不同的CUDA设备上,使用CUBLAS库例程和CUDA内核调用的组合完成了实现,这些设备具有不同的计算能力,并且比传统的CPU方法节省了时间。
{"title":"Use of CUDA for the Continuous Space Language Model","authors":"E. Thompson, Timothy R. Anderson","doi":"10.1109/HPEC.2012.6408661","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408661","url":null,"abstract":"The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123358292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multithreaded FPGA acceleration of DNA sequence mapping DNA序列定位的多线程FPGA加速
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408669
Edward Fernandez, W. Najjar, S. Lonardi, J. Villarreal
In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions of short strings (called reads) against a reference genome. At the time of writing, a representative run requires to match tens of millions of reads of length of about 100 symbols against a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper, we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory image between hardware and software. By comparing the performance of FHAST against Bowtie on the Convey HC-1 we observed up to ~70X improvement in total end-to-end execution time, reducing runs that take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against multiple CPUs in Bowtie.
在生物信息学中,短读比对是一项计算密集型的操作,涉及将数百万个短串(称为reads)与参考基因组进行匹配。在撰写本文时,一次代表性的运行需要将长度约为100个符号的数千万次读取与可能包含数十亿个字符的基因组进行匹配。现有的短读比对器期望报告每个读的所有出现,并允许用户控制读和参考基因组之间允许的错配数量。流行的软件实现(如Bowtie[8]或BWA[10])可能需要数小时或数天的时间来执行,这使得该问题成为硬件加速的理想选择。在本文中,我们描述了FHAST (FPGA硬件加速测序匹配工具),这是一种硬件加速器,可以替代短读校准软件。我们的架构通过执行许多并发硬件线程同时访问内存来掩盖内存延迟,并由多个并行引擎组成,以利用FPGA上可用的并行性。我们已经在HC-1[9]上实现并测试了FHAST,利用了系统可用的大量内存带宽以及硬件和软件之间的共享内存映像。通过比较FHAST和bow在Convey HC-1上的性能,我们观察到总端到端执行时间提高了约70倍,将运行时间从几个小时减少到几分钟。在扩展FHAST以利用多个fpga对抗Bowtie中的多个cpu时,我们也有利地比较了增长率。
{"title":"Multithreaded FPGA acceleration of DNA sequence mapping","authors":"Edward Fernandez, W. Najjar, S. Lonardi, J. Villarreal","doi":"10.1109/HPEC.2012.6408669","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408669","url":null,"abstract":"In bioinformatics, short read alignment is a computationally intensive operation that involves matching millions of short strings (called reads) against a reference genome. At the time of writing, a representative run requires to match tens of millions of reads of length of about 100 symbols against a genome that can consists of a few billion characters. Existing short read aligners are expected to report all the occurrences of each read as well as allow users to control the number of allowed mismatches between reads and reference genome. Popular software implementations such as Bowtie [8] or BWA [10] can take many hours or days to execute, making the problem an ideal candidate for hardware acceleration. In this paper, we describe FHAST (FPGA Hardware Accelerated Sequencing-matching Tool), a hardware accelerator that acts as a drop-in replacement for short read alignment software. Our architecture masks memory latency by executing many concurrent hardware threads accessing memory simultaneously and consists of multiple parallel engines to exploit the parallelism available to us on an FPGA. We have implemented and tested FHAST on the Convey HC-1 [9], taking advantage of the large amount of memory bandwidth available to the system and the shared memory image between hardware and software. By comparing the performance of FHAST against Bowtie on the Convey HC-1 we observed up to ~70X improvement in total end-to-end execution time, reducing runs that take several hours to a few minutes. We also favorably compare the rate of growth when expanding FHAST to utilize multiple FPGAs against multiple CPUs in Bowtie.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123610296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Efficient and scalable computations with sparse tensors 稀疏张量的高效可伸缩计算
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408676
M. Baskaran, Benoît Meister, Nicolas Vasilache, R. Lethin
For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing with sparse data. In this paper, we address these challenges in applying tensor decompositions in real data analytic applications. We describe new sparse tensor storage formats that provide storage benefits and are flexible and efficient for performing tensor computations. Further, we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in tensor decomposition algorithms. Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our techniques on both synthetic small data sets and large-scale sparse real data sets.
对于处理大量高维多向数据的应用程序,将这些数据表示为张量或多向数组是很自然的。多线性代数计算,如张量分解进行总结和分析这些数据。它们在实际应用中的使用可以跨越信号处理、数据挖掘、计算机视觉和图形分析等领域。在实际应用中应用张量分解的主要挑战是:(1)处理大规模高维数据和(2)处理稀疏数据。在本文中,我们解决了在实际数据分析应用中应用张量分解的这些挑战。我们描述了新的稀疏张量存储格式,它提供了存储优势,并且在执行张量计算时灵活高效。此外,我们提出了一种优化方法,可以提高数据重用,减少张量分解算法中的冗余或不必要的计算。此外,我们将我们的数据重用优化和我们的稀疏张量存储格式的优势结合起来,为处理大规模稀疏张量计算提供了一个内存高效的可扩展解决方案。我们在合成小数据集和大规模稀疏真实数据集上使用我们的技术演示了改进的性能和地址内存可伸缩性。
{"title":"Efficient and scalable computations with sparse tensors","authors":"M. Baskaran, Benoît Meister, Nicolas Vasilache, R. Lethin","doi":"10.1109/HPEC.2012.6408676","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408676","url":null,"abstract":"For applications that deal with large amounts of high dimensional multi-aspect data, it becomes natural to represent such data as tensors or multi-way arrays. Multi-linear algebraic computations such as tensor decompositions are performed for summarization and analysis of such data. Their use in real-world applications can span across domains such as signal processing, data mining, computer vision, and graph analysis. The major challenges with applying tensor decompositions in real-world applications are (1) dealing with large-scale high dimensional data and (2) dealing with sparse data. In this paper, we address these challenges in applying tensor decompositions in real data analytic applications. We describe new sparse tensor storage formats that provide storage benefits and are flexible and efficient for performing tensor computations. Further, we propose an optimization that improves data reuse and reduces redundant or unnecessary computations in tensor decomposition algorithms. Furthermore, we couple our data reuse optimization and the benefits of our sparse tensor storage formats to provide a memory-efficient scalable solution for handling large-scale sparse tensor computations. We demonstrate improved performance and address memory scalability using our techniques on both synthetic small data sets and large-scale sparse real data sets.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132518036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Optimized parallel distribution load flow solver on commodity multi-core CPU 基于商用多核CPU的优化并行分配负载流求解器
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408675
Tao Cui, F. Franchetti
Solving a large number of load flow problems quickly is required for Monte Carlo analysis and various power system problems, including long term steady state simulation, system benchmarking, among others. Due to the computational burden, such applications are considered to be time-consuming, and infeasible for online or realtime application. In this work we developed a high performance framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and various code optimization techniques. We optimized data structures to better fit the memory hierarchy. We use the SPIRAL code generator to exploit inherent patterns of the load flow model through code specizlization. We use SIMD instructions and multithreading to parallelize our solver. Finally, we designed a Monte Carlo thread scheduling infrastructure to enable real time operation. The optimized solver is able to achieve more than 50% of peak performance on a Intel Core i7 CPU, which translates to solving millions of load flow problems within a second for IEEE 37 test feeder.
蒙特卡罗分析和各种电力系统问题,包括长期稳态仿真、系统基准测试等,都需要快速解决大量的潮流问题。由于计算负担,这种应用程序被认为是耗时的,并且不适合在线或实时应用。在这项工作中,我们开发了一个高性能框架,用于高吞吐量分布负载流计算,利用多核cpu的性能增强特性和各种代码优化技术。我们优化了数据结构以更好地适应内存层次结构。我们使用螺旋代码生成器通过代码专门化来开发负载流模型的固有模式。我们使用SIMD指令和多线程来并行化我们的求解器。最后,我们设计了一个蒙特卡罗线程调度基础架构来实现实时操作。优化后的求解器能够在英特尔酷睿i7 CPU上实现超过50%的峰值性能,这意味着在一秒钟内解决IEEE 37测试馈线的数百万个负载流问题。
{"title":"Optimized parallel distribution load flow solver on commodity multi-core CPU","authors":"Tao Cui, F. Franchetti","doi":"10.1109/HPEC.2012.6408675","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408675","url":null,"abstract":"Solving a large number of load flow problems quickly is required for Monte Carlo analysis and various power system problems, including long term steady state simulation, system benchmarking, among others. Due to the computational burden, such applications are considered to be time-consuming, and infeasible for online or realtime application. In this work we developed a high performance framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and various code optimization techniques. We optimized data structures to better fit the memory hierarchy. We use the SPIRAL code generator to exploit inherent patterns of the load flow model through code specizlization. We use SIMD instructions and multithreading to parallelize our solver. Finally, we designed a Monte Carlo thread scheduling infrastructure to enable real time operation. The optimized solver is able to achieve more than 50% of peak performance on a Intel Core i7 CPU, which translates to solving millions of load flow problems within a second for IEEE 37 test feeder.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128322026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
CUDA and OpenCL implementations of 3D CT reconstruction for biomedical imaging 生物医学成像三维CT重建的CUDA和OpenCL实现
Pub Date : 2012-09-01 DOI: 10.1109/HPEC.2012.6408674
Saoni Mukherjee, Nicholas Moore, J. Brock, M. Leeser
Biomedical image reconstruction applications with large datasets can benefit from acceleration. Graphic Processing Units(GPUs) are particularly useful in this context as they can produce high fidelity images rapidly. An image algorithm to reconstruct conebeam computed tomography(CT) using two dimensional projections is implemented using GPUs. The implementation takes slices of the target, weighs the projection data and then filters the weighted data to backproject the data and create the final three dimensional construction. This is implemented on two types of hardware: CPU and a heterogeneous system combining CPU and GPU. The CPU codes in C and MATLAB are compared with the heterogeneous versions written in CUDA-C and OpenCL. The relative performance is tested and evaluated on a mathematical phantom as well as on mouse data.
具有大型数据集的生物医学图像重建应用可以从加速中受益。图形处理单元(gpu)在这种情况下特别有用,因为它们可以快速生成高保真图像。利用图形处理器实现了一种利用二维投影重建圆锥束计算机断层(CT)图像的算法。该实现对目标进行切片,对投影数据进行加权,然后对加权数据进行过滤,对数据进行反向投影,生成最终的三维结构。这是在两种硬件上实现的:CPU和CPU和GPU相结合的异构系统。将C语言和MATLAB语言编写的CPU代码与CUDA-C语言和OpenCL语言编写的异构版本进行了比较。在数学模型和鼠标数据上测试和评估了相关性能。
{"title":"CUDA and OpenCL implementations of 3D CT reconstruction for biomedical imaging","authors":"Saoni Mukherjee, Nicholas Moore, J. Brock, M. Leeser","doi":"10.1109/HPEC.2012.6408674","DOIUrl":"https://doi.org/10.1109/HPEC.2012.6408674","url":null,"abstract":"Biomedical image reconstruction applications with large datasets can benefit from acceleration. Graphic Processing Units(GPUs) are particularly useful in this context as they can produce high fidelity images rapidly. An image algorithm to reconstruct conebeam computed tomography(CT) using two dimensional projections is implemented using GPUs. The implementation takes slices of the target, weighs the projection data and then filters the weighted data to backproject the data and create the final three dimensional construction. This is implemented on two types of hardware: CPU and a heterogeneous system combining CPU and GPU. The CPU codes in C and MATLAB are compared with the heterogeneous versions written in CUDA-C and OpenCL. The relative performance is tested and evaluated on a mathematical phantom as well as on mouse data.","PeriodicalId":193020,"journal":{"name":"2012 IEEE Conference on High Performance Extreme Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124052431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
期刊
2012 IEEE Conference on High Performance Extreme Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1