首页 > 最新文献

ACM/IEEE SC 2002 Conference (SC'02)最新文献

英文 中文
An Empirical Performance Evaluation of Scalable Scientific Applications 可扩展科学应用的实证性能评价
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10036
J. Vetter, A. Yoo
We investigate the scalability, architectural requirements,a nd performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, we refine our analysis into precise explanations of the factors that influence performance and scalability for each application; we distill these factors into common traits and overall recommendations for both users and designers of scalable platforms. Our experiments demonstrate that some traits, such as improvements in the scaling and performance of MPI's collective operations, will benefit most applications. We also find specific characteristics of some applications that limit performance. For example, one application's intensive use of a 64-bit, floating-point divide instruction, which has high latency and is not pipelined on the POWER3, limits the performance of the application's primary computation.
我们研究了八个可扩展科学应用程序的可扩展性、架构需求和性能特征。我们的分析是由经验测量驱动的,使用统计和跟踪仪器进行通信和计算。基于这些测量,我们将分析细化为影响每个应用程序性能和可伸缩性的因素的精确解释;我们将这些因素提炼为可扩展平台的用户和设计师的共同特征和总体建议。我们的实验表明,一些特征,如MPI集体操作的扩展和性能的改进,将使大多数应用受益。我们还发现了一些限制性能的应用程序的特定特征。例如,一个应用程序大量使用64位浮点除法指令,该指令具有高延迟,并且没有在POWER3上实现流水线化,从而限制了应用程序主要计算的性能。
{"title":"An Empirical Performance Evaluation of Scalable Scientific Applications","authors":"J. Vetter, A. Yoo","doi":"10.1109/SC.2002.10036","DOIUrl":"https://doi.org/10.1109/SC.2002.10036","url":null,"abstract":"We investigate the scalability, architectural requirements,a nd performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, we refine our analysis into precise explanations of the factors that influence performance and scalability for each application; we distill these factors into common traits and overall recommendations for both users and designers of scalable platforms. Our experiments demonstrate that some traits, such as improvements in the scaling and performance of MPI's collective operations, will benefit most applications. We also find specific characteristics of some applications that limit performance. For example, one application's intensive use of a 64-bit, floating-point divide instruction, which has high latency and is not pipelined on the POWER3, limits the performance of the application's primary computation.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116861180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Salinas: A Scalable Software for High-Performance Structural and Solid Mechanics Simulations Salinas:用于高性能结构和固体力学模拟的可扩展软件
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10028
M. Bhardwaj, K. Pierson, G. Reese, T. Walsh, D. Day, K. Alvin, J. Peery, C. Farhat, M. Lesoinne
We present Salinas, a scalable implicit software application for the finite element static and dynamic analysis of complex structural real-world systems. This relatively complete engineering software with more than 100,000 lines of C++ code and a long list of users sustains 292.5 Gflop/s on 2,940 ASCI Red processors, and 1.16 Tflop/s on 3,375 ASCI White processors.
我们提出了Salinas,一个可扩展的隐式软件应用程序,用于复杂结构系统的有限元静态和动态分析。这个相对完整的工程软件拥有超过10万行c++代码和长长的用户列表,在2,940个ASCI红色处理器上维持292.5 Gflop/s,在3,375个ASCI白色处理器上维持1.16 Tflop/s。
{"title":"Salinas: A Scalable Software for High-Performance Structural and Solid Mechanics Simulations","authors":"M. Bhardwaj, K. Pierson, G. Reese, T. Walsh, D. Day, K. Alvin, J. Peery, C. Farhat, M. Lesoinne","doi":"10.1109/SC.2002.10028","DOIUrl":"https://doi.org/10.1109/SC.2002.10028","url":null,"abstract":"We present Salinas, a scalable implicit software application for the finite element static and dynamic analysis of complex structural real-world systems. This relatively complete engineering software with more than 100,000 lines of C++ code and a long list of users sustains 292.5 Gflop/s on 2,940 ASCI Red processors, and 1.16 Tflop/s on 3,375 ASCI White processors.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130417550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 89
Giggle: A Framework for Constructing Scalable Replica Location Services Giggle:构建可伸缩副本位置服务的框架
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10024
A. Chervenak, E. Deelman, Ian T Foster, Leanne P. Guy, Wolfgang Hoschek, Adriana Iamnitchi, C. Kesselman, P. Kunszt, M. Ripeanu, Robert Schwartzkopf, H. Stockinger, Kurt Stockinger, B. Tierney
In wide area computing systems, it is often desirable to create remote read-only copies (replicas) of files. Replication can be used to reduce access latency, improve data locality, and/or increase robustness, scalability and performance for distributed applications. We define a replica location service (RLS) as a system that maintains and provides access to information about the physical locations of copies. An RLS typically functions as one component of a data grid architecture. This paper makes the following contributions. First, we characterize RLS requirements. Next, we describe a parameterized architectural framework, which we name Giggle (for GIGa-scale Global Location Engine), within which a wide range of RLSs can be defined. We define several concrete instantiations of this framework with different performance characteristics. Finally, we present initial performance results for an RLS prototype, demonstrating that RLS systems can be constructed that meet performance goals.
在广域计算系统中,通常需要创建文件的远程只读副本(副本)。复制可用于减少访问延迟、改善数据局域性和/或增加分布式应用程序的健壮性、可伸缩性和性能。我们将副本位置服务(RLS)定义为维护和提供对副本物理位置信息的访问的系统。RLS通常用作数据网格体系结构的一个组件。本文做了以下贡献。首先,我们描述RLS需求。接下来,我们描述了一个参数化的体系结构框架,我们将其命名为Giggle(用于GIGa-scale Global Location Engine),在其中可以定义广泛的rls。我们定义了这个框架的几个具有不同性能特征的具体实例。最后,我们给出了RLS原型的初始性能结果,证明RLS系统可以构建满足性能目标的系统。
{"title":"Giggle: A Framework for Constructing Scalable Replica Location Services","authors":"A. Chervenak, E. Deelman, Ian T Foster, Leanne P. Guy, Wolfgang Hoschek, Adriana Iamnitchi, C. Kesselman, P. Kunszt, M. Ripeanu, Robert Schwartzkopf, H. Stockinger, Kurt Stockinger, B. Tierney","doi":"10.1109/SC.2002.10024","DOIUrl":"https://doi.org/10.1109/SC.2002.10024","url":null,"abstract":"In wide area computing systems, it is often desirable to create remote read-only copies (replicas) of files. Replication can be used to reduce access latency, improve data locality, and/or increase robustness, scalability and performance for distributed applications. We define a replica location service (RLS) as a system that maintains and provides access to information about the physical locations of copies. An RLS typically functions as one component of a data grid architecture. This paper makes the following contributions. First, we characterize RLS requirements. Next, we describe a parameterized architectural framework, which we name Giggle (for GIGa-scale Global Location Engine), within which a wide range of RLSs can be defined. We define several concrete instantiations of this framework with different performance characteristics. Finally, we present initial performance results for an RLS prototype, demonstrating that RLS systems can be constructed that meet performance goals.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129099055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 477
Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces 使用内存映射网络接口的smp集群上平铺嵌套循环的流水线调度
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10008
Maria Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris
This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We present a novel, pipelined scheduling approach which takes advantage of DMA communication mode, to send data to other nodes, while the CPUs are performing calculations. We also use zero-copy communication through pinned-down physical memory regions, provided by NIC’s driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a cluster of SMP nodes with single PCI-SCI NICs inside each node. In order to schedule tiles, we apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. Experimental evaluation illustrates that memory mapped NICs with enhanced communication features enable the use of a more advanced pipelined (overlapping) schedule, which considerably improves performance, compared to an ordinary blocking schedule, implemented with conventional, CPU and kernel bounded, communication primitives.
本文描述了使用增强的网络接口实现低延迟通信所获得的性能优势。我们提出了一种新颖的流水线调度方法,它利用DMA通信模式,在cpu执行计算时将数据发送到其他节点。我们也使用零拷贝通信通过固定的物理内存区域,由网卡的驱动模块提供。我们的测试平台涉及在SMP节点集群上并行执行平铺嵌套循环,每个节点中都有单个PCI-SCI网卡。为了对贴片进行调度,我们对贴片空间进行了基于超平面的分组变换,将独立的相邻贴片分组在一起,并将它们分配到同一个SMP节点。实验评估表明,具有增强通信特性的内存映射网卡可以使用更高级的流水线(重叠)调度,与使用传统的、CPU和内核受限的通信原语实现的普通阻塞调度相比,这大大提高了性能。
{"title":"Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs Using Memory Mapped Network Interfaces","authors":"Maria Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris","doi":"10.1109/SC.2002.10008","DOIUrl":"https://doi.org/10.1109/SC.2002.10008","url":null,"abstract":"This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We present a novel, pipelined scheduling approach which takes advantage of DMA communication mode, to send data to other nodes, while the CPUs are performing calculations. We also use zero-copy communication through pinned-down physical memory regions, provided by NIC’s driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a cluster of SMP nodes with single PCI-SCI NICs inside each node. In order to schedule tiles, we apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. Experimental evaluation illustrates that memory mapped NICs with enhanced communication features enable the use of a more advanced pipelined (overlapping) schedule, which considerably improves performance, compared to an ordinary blocking schedule, implemented with conventional, CPU and kernel bounded, communication primitives.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123841800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Efficient Synchronization for Nonuniform Communication Architectures 非统一通信体系结构的高效同步
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10038
Z. Radovic, Erik Hagersten
Scalable parallel computers are often nonuniform communication architectures (NUCAs), where the access time to other processor’s caches vary with their physical location. Still, few attempts of exploring cache-to-cache communication locality have been made. This paper introduces a new kind of synchronization primitives (lock-unlock) that favor neighboring processors when a lock is released. This improves the lock handover time as well as access time to the shared data of the critical region. A critical section guarded by our new RH lock takes less than half the time to execute compared with the same critical section guarded by any other lock on our NUCA hardware. The execution time for Raytrace with 28 processors was improved 2.23 - 4.68 times, while global traffic was dramatically decreased compared with all the other locks. The average execution time was improved 7 - 24% while the global traffic was decreased 8 - 28% for an average over the seven applications studied.
可伸缩并行计算机通常是非统一通信体系结构(nuca),其中对其他处理器缓存的访问时间随其物理位置而变化。然而,很少有人尝试探索缓存到缓存通信的局部性。本文介绍了一种新的同步原语(锁-解锁),当锁被释放时,它有利于相邻的处理器。这提高了锁切换时间以及对关键区域共享数据的访问时间。与我们的NUCA硬件上任何其他锁所保护的相同临界区相比,我们的新RH锁所保护的临界区执行时间不到一半。具有28个处理器的Raytrace的执行时间提高了2.23 - 4.68倍,而与所有其他锁相比,全局流量大幅减少。在研究的7个应用程序中,平均执行时间提高了7 - 24%,而全球流量平均减少了8 - 28%。
{"title":"Efficient Synchronization for Nonuniform Communication Architectures","authors":"Z. Radovic, Erik Hagersten","doi":"10.1109/SC.2002.10038","DOIUrl":"https://doi.org/10.1109/SC.2002.10038","url":null,"abstract":"Scalable parallel computers are often nonuniform communication architectures (NUCAs), where the access time to other processor’s caches vary with their physical location. Still, few attempts of exploring cache-to-cache communication locality have been made. This paper introduces a new kind of synchronization primitives (lock-unlock) that favor neighboring processors when a lock is released. This improves the lock handover time as well as access time to the shared data of the critical region. A critical section guarded by our new RH lock takes less than half the time to execute compared with the same critical section guarded by any other lock on our NUCA hardware. The execution time for Raytrace with 28 processors was improved 2.23 - 4.68 times, while global traffic was dramatically decreased compared with all the other locks. The average execution time was improved 7 - 24% while the global traffic was decreased 8 - 28% for an average over the seven applications studied.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123488895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
An Overview of the BlueGene/L Supercomputer BlueGene/L超级计算机概述
Pub Date : 2002-11-16 DOI: 10.5555/762761.762787
N. Adiga, G. Almási, G. Almási, Y. Aridor, R. Barik, D. Beece, Ralph Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, J. Brunheroto, Calin Cascaval, J. Castaños, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, Dong Chen, G. Chiu, T. Cipolla, P. Crumley, K. Desai, A. Deutsch, T. Domany, M. B. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. Giampapa, B. Gopalsamy, John A. Gunnels, Manish Gupta, F. Gustavson, S. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Herger, D. Hoenicke, R. Jackson, T. Jamal-Eddine, G. Kopcsay, E. Krevat, M. Kurhekar, A. P. Lanzetta, D. Lieber, L. K. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. Moreira, B. J. Nathanson, M. Newton, M. Ohmacht, A. Oliner, Vinayaka Pandit, R. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, Edi Shmueli, Sarabjeet Singh, Peilin Song, V. Srinivasan, B. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. T
This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a number of academic and government institutions,including the San Diego Supercomputer Center and the California Institute of Technology. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004-2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.
本文介绍了BlueGene/L超级计算机的概况。这是IBM和Lawrence Livermore国家实验室联合资助的研究伙伴关系,是美国能源部ASCI高级架构研究计划的一部分。最近,一些学术和政府机构(包括圣地亚哥超级计算机中心和加州理工学院)的合作伙伴启动了应用程序性能和扩展研究。这个拥有65,536个节点的大规模并行系统基于一种新的架构,该架构利用单片系统技术提供360万亿次浮点运算(每秒万亿次浮点运算)的目标峰值处理能力。该机器计划在2004-2005年的时间框架内投入使用,其价格/性能和功耗/性能目标是传统架构无法实现的。
{"title":"An Overview of the BlueGene/L Supercomputer","authors":"N. Adiga, G. Almási, G. Almási, Y. Aridor, R. Barik, D. Beece, Ralph Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, A. Bright, J. Brunheroto, Calin Cascaval, J. Castaños, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, Dong Chen, G. Chiu, T. Cipolla, P. Crumley, K. Desai, A. Deutsch, T. Domany, M. B. Dombrowa, W. Donath, M. Eleftheriou, C. Erway, J. Esch, B. Fitch, J. Gagliano, A. Gara, R. Garg, R. Germain, M. Giampapa, B. Gopalsamy, John A. Gunnels, Manish Gupta, F. Gustavson, S. Hall, R. Haring, D. Heidel, P. Heidelberger, L. Herger, D. Hoenicke, R. Jackson, T. Jamal-Eddine, G. Kopcsay, E. Krevat, M. Kurhekar, A. P. Lanzetta, D. Lieber, L. K. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. Moreira, B. J. Nathanson, M. Newton, M. Ohmacht, A. Oliner, Vinayaka Pandit, R. Pudota, R. Rand, R. Regan, B. Rubin, A. Ruehli, S. Rus, R. Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, Edi Shmueli, Sarabjeet Singh, Peilin Song, V. Srinivasan, B. Steinmacher-Burow, K. Strauss, C. Surovic, R. Swetz, T. Takken, R. T","doi":"10.5555/762761.762787","DOIUrl":"https://doi.org/10.5555/762761.762787","url":null,"abstract":"This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded research partnership between IBM and the Lawrence Livermore National Laboratory as part of the United States Department of Energy ASCI Advanced Architecture Research Program. Application performance and scaling studies have recently been initiated with partners at a number of academic and government institutions,including the San Diego Supercomputer Center and the California Institute of Technology. This massively parallel system of 65,536 nodes is based on a new architecture that exploits system-on-a-chip technology to deliver target peak processing power of 360 teraFLOPS (trillion floating-point operations per second). The machine is scheduled to be operational in the 2004-2005 time frame, at price/performance and power consumption/performance targets unobtainable with conventional architectures.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"62 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132845472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 567
A TCP Tuning Daemon TCP调优守护进程
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10023
T. Dunigan, M. Mathis, B. Tierney
Many high performance distributed applications require high network throughput but are able to achieve only a small fraction of the available bandwidth. A common cause of this problem is improperly tuned network settings. Tuning techniques, such as setting the correct TCP buffers and using parallel streams, are well known in the networking community, but outside the networking community they are infrequently applied. In this paper, we describe a tuning daemon that uses TCP instrumentation data from the Unix kernel to transparently tune TCP parameters for specified individual flows over designated paths. No modifications are required to the application, and the user does not need to understand network or TCP characteristics.
许多高性能分布式应用程序需要高网络吞吐量,但只能实现可用带宽的一小部分。导致此问题的一个常见原因是网络设置调优不当。调优技术,如设置正确的TCP缓冲区和使用并行流,在网络社区中是众所周知的,但在网络社区之外,它们很少被应用。在本文中,我们描述了一个调优守护进程,它使用来自Unix内核的TCP检测数据,为指定路径上的指定单个流透明地调优TCP参数。不需要对应用程序进行修改,用户也不需要了解网络或TCP特征。
{"title":"A TCP Tuning Daemon","authors":"T. Dunigan, M. Mathis, B. Tierney","doi":"10.1109/SC.2002.10023","DOIUrl":"https://doi.org/10.1109/SC.2002.10023","url":null,"abstract":"Many high performance distributed applications require high network throughput but are able to achieve only a small fraction of the available bandwidth. A common cause of this problem is improperly tuned network settings. Tuning techniques, such as setting the correct TCP buffers and using parallel streams, are well known in the networking community, but outside the networking community they are infrequently applied. In this paper, we describe a tuning daemon that uses TCP instrumentation data from the Unix kernel to transparently tune TCP parameters for specified individual flows over designated paths. No modifications are required to the application, and the user does not need to understand network or TCP characteristics.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128850915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Improving Route Lookup Performance Using Network Processor Cache 利用网络处理器缓存改进路由查找性能
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10006
Kartik Gopalan, T. Chiueh
Earlier research has shown that the route lookup performance of a network processor can be significantly improved by caching ranges of lookup/classification keys rather than individual keys. While the previous work focused specifically on reducing capacity misses, we address two other important aspects - (a) reducing conflict misses and (b) cache consistency during frequent route updates. We propose two techniques to minimize conflict misses that aim to balance the number of cacheable entries mapped to each cache set. They offer different tradeoffs between performance and simplicity while improving the average route lookup time by 76% and 45.2% respectively. To maintain cache consistency during frequent route updates, we propose a selective cache invalidation technique that can limit the degradation in lookup latency to within 10.2%. Our results indicate potentially large improvement in lookup performance for network processors used at Internet edge and motivate further research into caching at the Internet core.
早期的研究表明,通过缓存查找/分类键的范围而不是单个键,可以显著提高网络处理器的路由查找性能。虽然之前的工作主要集中在减少容量遗漏上,但我们解决了另外两个重要方面——(a)减少冲突遗漏和(b)频繁路由更新期间的缓存一致性。我们提出了两种技术来最小化冲突缺失,旨在平衡映射到每个缓存集的可缓存条目的数量。它们在性能和简单性之间进行了不同的权衡,同时分别将平均路由查找时间提高了76%和45.2%。为了在频繁的路由更新期间保持缓存一致性,我们提出了一种选择性缓存失效技术,可以将查找延迟的退化限制在10.2%以内。我们的研究结果表明,在互联网边缘使用的网络处理器的查找性能可能会有很大的改善,并激发了对互联网核心缓存的进一步研究。
{"title":"Improving Route Lookup Performance Using Network Processor Cache","authors":"Kartik Gopalan, T. Chiueh","doi":"10.1109/SC.2002.10006","DOIUrl":"https://doi.org/10.1109/SC.2002.10006","url":null,"abstract":"Earlier research has shown that the route lookup performance of a network processor can be significantly improved by caching ranges of lookup/classification keys rather than individual keys. While the previous work focused specifically on reducing capacity misses, we address two other important aspects - (a) reducing conflict misses and (b) cache consistency during frequent route updates. We propose two techniques to minimize conflict misses that aim to balance the number of cacheable entries mapped to each cache set. They offer different tradeoffs between performance and simplicity while improving the average route lookup time by 76% and 45.2% respectively. To maintain cache consistency during frequent route updates, we propose a selective cache invalidation technique that can limit the degradation in lookup latency to within 10.2%. Our results indicate potentially large improvement in lookup performance for network processors used at Internet edge and motivate further research into caching at the Internet core.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128630039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Accelerating Parallel Maximum Likelihood-Based Phylogenetic Tree Calculations Using Subtree Equality Vectors 利用子树相等向量加速基于并行最大似然的系统发育树计算
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10016
A. Stamatakis, T. Ludwig, H. Meier, Marty J. Wolf
Heuristics for calculating phylogenetic trees for a large sets of aligned rRNA sequences based on the maximum likelihood method are computationally expensive. The core of most parallel algorithms, which accounts for the greatest part of computation time, is the tree evaluation function, that calculates the likelihood value for each tree topology. This paper describes and uses Subtree Equality Vectors (SEVs) to reduce the number of required floating point operations during topology evaluation. We integrated our optimizations into various sequential programs and into parallel fastDNAml, one of the most common and efficient parallel programs for calculating large phylogenetic trees. Experimental results for our parallel program, which renders exactly the same output as parallel fastDNAml show global runtime improvements of 26% to 65%. The optimization scales best on clusters of PCs, which also implies a substantial cost saving factor for the determination of large trees.
基于最大似然法计算大量排列rRNA序列的系统发育树的启发式方法计算成本很高。大多数并行算法的核心是树评价函数,它计算每个树拓扑的似然值,占计算时间的大部分。本文描述并使用子树相等向量(sev)来减少拓扑计算过程中所需的浮点运算次数。我们将我们的优化集成到各种顺序程序和并行fastDNAml中,fastDNAml是用于计算大型系统发育树的最常见和最有效的并行程序之一。我们的并行程序的实验结果显示,与并行fastDNAml呈现完全相同的输出,全局运行时间改善了26%到65%。这种优化在pc集群上的可伸缩性最好,这也意味着在确定大型树时可以节省大量成本。
{"title":"Accelerating Parallel Maximum Likelihood-Based Phylogenetic Tree Calculations Using Subtree Equality Vectors","authors":"A. Stamatakis, T. Ludwig, H. Meier, Marty J. Wolf","doi":"10.1109/SC.2002.10016","DOIUrl":"https://doi.org/10.1109/SC.2002.10016","url":null,"abstract":"Heuristics for calculating phylogenetic trees for a large sets of aligned rRNA sequences based on the maximum likelihood method are computationally expensive. The core of most parallel algorithms, which accounts for the greatest part of computation time, is the tree evaluation function, that calculates the likelihood value for each tree topology. This paper describes and uses Subtree Equality Vectors (SEVs) to reduce the number of required floating point operations during topology evaluation. We integrated our optimizations into various sequential programs and into parallel fastDNAml, one of the most common and efficient parallel programs for calculating large phylogenetic trees. Experimental results for our parallel program, which renders exactly the same output as parallel fastDNAml show global runtime improvements of 26% to 65%. The optimization scales best on clusters of PCs, which also implies a substantial cost saving factor for the determination of large trees.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116048309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
A New Data-Mapping Scheme for Latency-Tolerant Distributed Sparse Triangular Solution 容延迟分布稀疏三角解的一种新的数据映射方案
Pub Date : 2002-11-16 DOI: 10.1109/SC.2002.10020
K. Teranishi, P. Raghavan, E. Ng
This paper concerns latency-tolerant schemes for the efficient parallel solution of sparse triangular linear systems on distributed memory multiprocessors. Such triangular solution is required when sparse Cholesky factors are used to solve for a sequence of right-hand-side vectors or when incomplete sparse Cholesky factors are used to precondition a Conjugate Gradients iterative solver. In such applications, the use of traditional distributed substitution schemes can create a performance bottleneck when the latency of interprocessor communication is large. We had earlier developed the Selective Inversion (SI) scheme to reduce communication latency costs by replacing distributed substitution by parallel matrix vector multiplication. We now present a new two-way mapping of the triangular sparse matrix to processors to improve the performance of SI by halving its communication latency costs. We provide analytic results for model sparse matrices and we report on the performance of our scheme for parallel preconditioning with incomplete sparse Cholesky factors.
本文研究了分布式存储多处理机上稀疏三角形线性系统高效并行解的容延迟方案。当使用稀疏Cholesky因子求解右侧向量序列或使用不完全稀疏Cholesky因子作为共轭梯度迭代求解器的先决条件时,需要这样的三角形解。在这样的应用程序中,当处理器间通信的延迟很大时,使用传统的分布式替代方案可能会造成性能瓶颈。我们之前已经开发了选择性反演(SI)方案,通过并行矩阵向量乘法取代分布式替换来降低通信延迟成本。我们现在提出了一种新的三角稀疏矩阵到处理器的双向映射,通过将其通信延迟成本减半来提高SI的性能。我们给出了模型稀疏矩阵的解析结果,并报告了我们的方案在具有不完全稀疏Cholesky因子的并行预处理中的性能。
{"title":"A New Data-Mapping Scheme for Latency-Tolerant Distributed Sparse Triangular Solution","authors":"K. Teranishi, P. Raghavan, E. Ng","doi":"10.1109/SC.2002.10020","DOIUrl":"https://doi.org/10.1109/SC.2002.10020","url":null,"abstract":"This paper concerns latency-tolerant schemes for the efficient parallel solution of sparse triangular linear systems on distributed memory multiprocessors. Such triangular solution is required when sparse Cholesky factors are used to solve for a sequence of right-hand-side vectors or when incomplete sparse Cholesky factors are used to precondition a Conjugate Gradients iterative solver. In such applications, the use of traditional distributed substitution schemes can create a performance bottleneck when the latency of interprocessor communication is large. We had earlier developed the Selective Inversion (SI) scheme to reduce communication latency costs by replacing distributed substitution by parallel matrix vector multiplication. We now present a new two-way mapping of the triangular sparse matrix to processors to improve the performance of SI by halving its communication latency costs. We provide analytic results for model sparse matrices and we report on the performance of our scheme for parallel preconditioning with incomplete sparse Cholesky factors.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116785234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
ACM/IEEE SC 2002 Conference (SC'02)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1