T. Suganuma, R. Krishnamurthy, Moriyoshi Ohara, T. Nakatani
OpenCL is an open standard for heterogeneous parallel programming, exploiting multi-core CPUs, GPUs, or other accelerators as parallel computing resources. Recent work has extended the OpenCL parallel programming model for distributed heterogeneous clusters. For such loosely coupled acceleration architectures, the design of OpenCL programs to maximize performance is quite different from that of conventional tightly coupled acceleration platforms. This paper describes our experiences in OpenCL programming to extract scalable performance for a distributed heterogeneous cluster environment. We picked two real-world analytics workloads, Two-Step Cluster and Linear Regression, that offer different challenges to efficient OpenCL implementations. We obtained scalable performance with this architecture by carefully managing the amount of data and computations in the kernel program design and by well addressing the network latency problems through optimizations.
{"title":"Scaling analytics applications with OpenCL for loosely coupled heterogeneous clusters","authors":"T. Suganuma, R. Krishnamurthy, Moriyoshi Ohara, T. Nakatani","doi":"10.1145/2482767.2482812","DOIUrl":"https://doi.org/10.1145/2482767.2482812","url":null,"abstract":"OpenCL is an open standard for heterogeneous parallel programming, exploiting multi-core CPUs, GPUs, or other accelerators as parallel computing resources. Recent work has extended the OpenCL parallel programming model for distributed heterogeneous clusters. For such loosely coupled acceleration architectures, the design of OpenCL programs to maximize performance is quite different from that of conventional tightly coupled acceleration platforms. This paper describes our experiences in OpenCL programming to extract scalable performance for a distributed heterogeneous cluster environment. We picked two real-world analytics workloads, Two-Step Cluster and Linear Regression, that offer different challenges to efficient OpenCL implementations. We obtained scalable performance with this architecture by carefully managing the amount of data and computations in the kernel program design and by well addressing the network latency problems through optimizations.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127701195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Opportunistic networks exploits social behavior to build connectivity opportunities. This paradigm uses pair-wise contact to share and forward content without any prior knowledge about pre-existing infrastructure. In this context, optimize data dissemination among nodes is a paramount. This paper presents early stages of our research with focus on reasoning and predictions issues to improve data dissemination on opportunistic networks. We intend to explore contextual and social aspects with machine learning techniques in the design of a reasoning and prediction engine for this purpose.
{"title":"Reasoning and prediction on opportunistic networks to improve data dissemination","authors":"C. O. Rolim, C. Geyer","doi":"10.1145/2482767.2482782","DOIUrl":"https://doi.org/10.1145/2482767.2482782","url":null,"abstract":"Opportunistic networks exploits social behavior to build connectivity opportunities. This paradigm uses pair-wise contact to share and forward content without any prior knowledge about pre-existing infrastructure. In this context, optimize data dissemination among nodes is a paramount. This paper presents early stages of our research with focus on reasoning and predictions issues to improve data dissemination on opportunistic networks. We intend to explore contextual and social aspects with machine learning techniques in the design of a reasoning and prediction engine for this purpose.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134337874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advances in memory technology are promising the availability of byte-addressable persistent memory as an integral component of future computing platforms. This change has significant implications for software that has traditionally made a sharp distinction between durable and volatile storage. In this paper we describe a software-hardware architecture, WrAP, for persistent memory that provides atomicity and durability while simultaneously ensuring that fast paths through the cache, DRAM, and persistent memory layers are not slowed down by burdensome buffering or double-copying requirements. Trace-driven simulation of transactional data structures indicate the potential for significant performance gains using the WrAP approach.
{"title":"Bridging the programming gap between persistent and volatile memory using WrAP","authors":"Ellis R. Giles, K. Doshi, P. Varman","doi":"10.1145/2482767.2482806","DOIUrl":"https://doi.org/10.1145/2482767.2482806","url":null,"abstract":"Advances in memory technology are promising the availability of byte-addressable persistent memory as an integral component of future computing platforms. This change has significant implications for software that has traditionally made a sharp distinction between durable and volatile storage. In this paper we describe a software-hardware architecture, WrAP, for persistent memory that provides atomicity and durability while simultaneously ensuring that fast paths through the cache, DRAM, and persistent memory layers are not slowed down by burdensome buffering or double-copying requirements. Trace-driven simulation of transactional data structures indicate the potential for significant performance gains using the WrAP approach.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130274656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Architectural design, particularly in large scale masterplanning projects, has yet to fully undergo the computational revolution experienced by other design-led industries such as automotive and aerospace. These industries use computational frameworks to undertake automated design analysis and design space exploration. However, within the Architectural, Engineering and Construction (AEC) industries we find no such computational platforms. This precludes the rapid analysis needed for quantitative design iteration which is required for sustainable design. This is a current computing frontier. This paper considers the computational solutions to the challenges preventing such advances to improve architectural design performance for a more sustainable future. We present a practical discussion of the computational challenges and opportunities in this industry and present a computational framework "HierSynth" with a data model designed to the needs of this industry. We report the results and lessons learned from applying this framework to a major commercial urban masterplanning project. This framework was used to automate and augment existing practice and was used to undertake previously infeasible, designer lead, design space exploration. During the casestudy an order of magnitude more analysis cycles were undertaken than literature suggests is normal; each occurring in hours not days.
{"title":"Computationally unifying urban masterplanning","authors":"David Birch","doi":"10.1145/2482767.2482808","DOIUrl":"https://doi.org/10.1145/2482767.2482808","url":null,"abstract":"Architectural design, particularly in large scale masterplanning projects, has yet to fully undergo the computational revolution experienced by other design-led industries such as automotive and aerospace. These industries use computational frameworks to undertake automated design analysis and design space exploration. However, within the Architectural, Engineering and Construction (AEC) industries we find no such computational platforms. This precludes the rapid analysis needed for quantitative design iteration which is required for sustainable design. This is a current computing frontier.\u0000 This paper considers the computational solutions to the challenges preventing such advances to improve architectural design performance for a more sustainable future. We present a practical discussion of the computational challenges and opportunities in this industry and present a computational framework \"HierSynth\" with a data model designed to the needs of this industry.\u0000 We report the results and lessons learned from applying this framework to a major commercial urban masterplanning project. This framework was used to automate and augment existing practice and was used to undertake previously infeasible, designer lead, design space exploration. During the casestudy an order of magnitude more analysis cycles were undertaken than literature suggests is normal; each occurring in hours not days.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123700761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and threads on modern shared-memory systems. Multithreaded applications exhibit different performance behavior depending on the mapping of software threads to logical processors. We observe the execution time under one mapping can be 5.4 times as much as that under another. Applications with irregular access patterns show the worst performance under the default OS mapping. Mapping alone does not reduce remote accesses on NUMA machines when the logical processors span multiple chips. We present new data replication and distribution optimizations for two irregular applications. We further show that locality optimization reduces remote accesses and improves cache performance simultaneously and achieves better performance than prior NUMA-specific techniques.
{"title":"Mapping applications for high performance on multithreaded, NUMA systems","authors":"Guojing Cong, H. Wen","doi":"10.1145/2482767.2482777","DOIUrl":"https://doi.org/10.1145/2482767.2482777","url":null,"abstract":"The communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and threads on modern shared-memory systems. Multithreaded applications exhibit different performance behavior depending on the mapping of software threads to logical processors. We observe the execution time under one mapping can be 5.4 times as much as that under another. Applications with irregular access patterns show the worst performance under the default OS mapping.\u0000 Mapping alone does not reduce remote accesses on NUMA machines when the logical processors span multiple chips. We present new data replication and distribution optimizations for two irregular applications. We further show that locality optimization reduces remote accesses and improves cache performance simultaneously and achieves better performance than prior NUMA-specific techniques.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127153259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Regular expression matching is a central task in several networking (and search) applications and has been accelerated on a variety of parallel architectures, including general purpose multi-core processors, network processors, field programmable gate arrays, and ASIC- and TCAM-based systems. All of these solutions are based on finite automata (either in deterministic or non-deterministic form) and mostly focus on effective memory representations for such automata. More recently, a handful of proposals have exploited the parallelism intrinsic in regular expression matching (i.e., coarse-grained packet-level parallelism and fine-grained data structure parallelism) to propose efficient regex-matching designs for GPUs. However, most GPU solutions aim at achieving good performance on small datasets, which are far less complex and problematic than those used in real-world applications. In this work, we provide a more comprehensive study of regular expression matching on GPUs. To this end, we consider datasets of practical size and complexity and explore advantages and limitations of different automata representations and of various GPU implementation techniques. Our goal is not to show optimal speedup on specific datasets, but to highlight advantages and disadvantages of the GPU hardware in supporting state-of-the-art automata representations and encoding schemes, approaches that have been broadly adopted on other parallel memory-based platforms.
{"title":"GPU acceleration of regular expression matching for large datasets: exploring the implementation space","authors":"Xiaodong Yu, M. Becchi","doi":"10.1145/2482767.2482791","DOIUrl":"https://doi.org/10.1145/2482767.2482791","url":null,"abstract":"Regular expression matching is a central task in several networking (and search) applications and has been accelerated on a variety of parallel architectures, including general purpose multi-core processors, network processors, field programmable gate arrays, and ASIC- and TCAM-based systems. All of these solutions are based on finite automata (either in deterministic or non-deterministic form) and mostly focus on effective memory representations for such automata. More recently, a handful of proposals have exploited the parallelism intrinsic in regular expression matching (i.e., coarse-grained packet-level parallelism and fine-grained data structure parallelism) to propose efficient regex-matching designs for GPUs. However, most GPU solutions aim at achieving good performance on small datasets, which are far less complex and problematic than those used in real-world applications.\u0000 In this work, we provide a more comprehensive study of regular expression matching on GPUs. To this end, we consider datasets of practical size and complexity and explore advantages and limitations of different automata representations and of various GPU implementation techniques. Our goal is not to show optimal speedup on specific datasets, but to highlight advantages and disadvantages of the GPU hardware in supporting state-of-the-art automata representations and encoding schemes, approaches that have been broadly adopted on other parallel memory-based platforms.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128111023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nongda Hu, Long Li, Binzhang Fu, Tao Li, Xiufeng Sui, Lixin Zhang
Within today's large-scale data centers, the inter-node communication is often the major bottleneck. This fact recently blooms the data center network (DCN) research. Since building a real data center is cost prohibitive, most of DCN studies rely on simulations. Unfortunately, state-of-the-art network simulators have limited support for real world applications, which prevents researchers from first-hand investigation. To address this issue, we developed a unified and cross-layer simulation framework, namely the DCNSim. By leveraging the two widely deployed simulators, DCNSim introduces computer architecture solutions into DCN research. With DCNSim, one could run packet-level network simulation driven by commercial applications while varying computer and network parameters, such as CPU frequency, memory access latency, network topology and protocols. With extensive validations, we show that DCNSim could accurately capture performance trends caused by changing computer and network parameters. Finally, we argue that future DCN researches should consider computer architecture factors via several case studies.
{"title":"DCNSim: a unified and cross-layer computer architecture simulation framework for data center network research","authors":"Nongda Hu, Long Li, Binzhang Fu, Tao Li, Xiufeng Sui, Lixin Zhang","doi":"10.1145/2482767.2482792","DOIUrl":"https://doi.org/10.1145/2482767.2482792","url":null,"abstract":"Within today's large-scale data centers, the inter-node communication is often the major bottleneck. This fact recently blooms the data center network (DCN) research. Since building a real data center is cost prohibitive, most of DCN studies rely on simulations. Unfortunately, state-of-the-art network simulators have limited support for real world applications, which prevents researchers from first-hand investigation. To address this issue, we developed a unified and cross-layer simulation framework, namely the DCNSim. By leveraging the two widely deployed simulators, DCNSim introduces computer architecture solutions into DCN research. With DCNSim, one could run packet-level network simulation driven by commercial applications while varying computer and network parameters, such as CPU frequency, memory access latency, network topology and protocols. With extensive validations, we show that DCNSim could accurately capture performance trends caused by changing computer and network parameters. Finally, we argue that future DCN researches should consider computer architecture factors via several case studies.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115926698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The most important challenge facing the future Internet is not technical, but is rather the need to justify placing trust in the technical solutions. Current network models suffer from limitations that result in practical deployments being too complex to reason about. The novel channel market model, based on composing networks by sharing channels through a flat market, offers a better opportunity for reasoning. The old language is still useful, and continues to make sense in the new model. Two design principles, the haggling principle and the composition principle, provide hints for discussing and designing networks in a channel market.
{"title":"Network stacking considered harmful","authors":"Robert Surton","doi":"10.1145/2482767.2482780","DOIUrl":"https://doi.org/10.1145/2482767.2482780","url":null,"abstract":"The most important challenge facing the future Internet is not technical, but is rather the need to justify placing trust in the technical solutions. Current network models suffer from limitations that result in practical deployments being too complex to reason about. The novel channel market model, based on composing networks by sharing channels through a flat market, offers a better opportunity for reasoning. The old language is still useful, and continues to make sense in the new model. Two design principles, the haggling principle and the composition principle, provide hints for discussing and designing networks in a channel market.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133912943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new way of calculating the sine and cosine functions. The method is based on recursive applications of a modified complex power algorithm. On a machine with multiple complex multipliers the method can be used to calculate sines and cosines in logarithmic time. The serial version of the presented method requires only two precomputed constants and no tables. In the parallel versions a trade off can be made between the number of parallel processing elements and the size of tables.
{"title":"An algorithm for parallel calculation of trigonometric functions","authors":"T. Barrera, A. Hast, E. Bengtsson","doi":"10.1145/2482767.2482778","DOIUrl":"https://doi.org/10.1145/2482767.2482778","url":null,"abstract":"We propose a new way of calculating the sine and cosine functions. The method is based on recursive applications of a modified complex power algorithm. On a machine with multiple complex multipliers the method can be used to calculate sines and cosines in logarithmic time. The serial version of the presented method requires only two precomputed constants and no tables. In the parallel versions a trade off can be made between the number of parallel processing elements and the size of tables.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115892066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given the maintenance of Moore's law behavior, core count is expected to continue growing, which keeps demanding more memory bandwidth destined to feed them. Memory controller (MC) scalability is crucial to achieve these bandwidth needs, but constrained by I/O pin scaling. In this study, we introduce RFiof, a radio-frequency (RF) memory approach to address I/O pin constraints which restrict MC scalability in off-chip-memory systems, while keeping interconnection energy at lower levels. In this paper, we model, design, and demonstrate how RFiof achieves high MC I/O pin scalability for different memory technology generations, while evaluating its area and power/energy impact. By introducing the novel concept of RFpins -- to replace traditional MC I/O pins, and using RFMCs - MCs coupled to RF transmitters (TX)/receivers (RX), while employing a minimal RF-path between RFMC and ranks, we demonstrate that for a 32-out-of-order multicore configured with off-chip ranks with a 1:1 core-to-MC ratio, RFiof presents scalable 4 RFpins per RFMC -comparable to pin-scalable optical solutions - and is able to respectively improve bandwidth and performance by up to 7.2x and 8.6x, compared to the traditional baseline -- constrained to MC I/O pin counts. Furthermore, RFiof reduces about 65.6% of MC area usage, and 80% of memory path energy interconnection.
{"title":"RFiof: an RF approach to I/O-pin and memory controller scalability for off-chip memories","authors":"M. Marino","doi":"10.1145/2482767.2482803","DOIUrl":"https://doi.org/10.1145/2482767.2482803","url":null,"abstract":"Given the maintenance of Moore's law behavior, core count is expected to continue growing, which keeps demanding more memory bandwidth destined to feed them. Memory controller (MC) scalability is crucial to achieve these bandwidth needs, but constrained by I/O pin scaling. In this study, we introduce RFiof, a radio-frequency (<u>RF</u>) memory approach to address <u>I</u>/<u>O</u> pin constraints which restrict MC scalability in o<u>f</u>f-chip-memory systems, while keeping interconnection energy at lower levels.\u0000 In this paper, we model, design, and demonstrate how RFiof achieves high MC I/O pin scalability for different memory technology generations, while evaluating its area and power/energy impact. By introducing the novel concept of RFpins -- to replace traditional MC I/O pins, and using RFMCs - MCs coupled to RF transmitters (TX)/receivers (RX), while employing a minimal RF-path between RFMC and ranks, we demonstrate that for a 32-out-of-order multicore configured with off-chip ranks with a 1:1 core-to-MC ratio, RFiof presents scalable 4 RFpins per RFMC -comparable to pin-scalable optical solutions - and is able to respectively improve bandwidth and performance by up to 7.2x and 8.6x, compared to the traditional baseline -- constrained to MC I/O pin counts. Furthermore, RFiof reduces about 65.6% of MC area usage, and 80% of memory path energy interconnection.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115943837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}