Matthias Carnein, Dennis Assenmacher, H. Trautmann
Analysing streaming data has received considerable attention over the recent years. A key research area in this field is stream clustering which aims to recognize patterns in a possibly unbounded data stream of varying speed and structure. Over the past decades a multitude of new stream clustering algorithms have been proposed. However, to the best of our knowledge, no rigorous analysis and comparison of the different approaches has been performed. Our paper fills this gap and provides extensive experiments for a total of ten popular algorithms. We utilize a number of standard data sets of both, real and synthetic data and identify key weaknesses and strengths of the existing algorithms.
{"title":"An Empirical Comparison of Stream Clustering Algorithms","authors":"Matthias Carnein, Dennis Assenmacher, H. Trautmann","doi":"10.1145/3075564.3078887","DOIUrl":"https://doi.org/10.1145/3075564.3078887","url":null,"abstract":"Analysing streaming data has received considerable attention over the recent years. A key research area in this field is stream clustering which aims to recognize patterns in a possibly unbounded data stream of varying speed and structure. Over the past decades a multitude of new stream clustering algorithms have been proposed. However, to the best of our knowledge, no rigorous analysis and comparison of the different approaches has been performed. Our paper fills this gap and provides extensive experiments for a total of ten popular algorithms. We utilize a number of standard data sets of both, real and synthetic data and identify key weaknesses and strengths of the existing algorithms.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125143516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Yang, Jianbin Fang, Jing Chen, Chengkun Wu, T. Tang, Kai Lu
Coordinate descent (CD) has been proved to be an effective technique for matrix factorization (MF) in recommender systems. To speed up factorizing performance, various methods of implementing parallel CDMF have been proposed to leverage modern multi-core CPUs and many-core GPUs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an efficient and portable CDMF solver for recommender systems. On the one hand, we diagnose the baseline implementation and observe that it lacks the awareness of the hierarchical thread organization on modern hardware and the data variance of the rating matrix. Thus, we apply the thread batching technique and the load balancing technique to achieve high performance. On the other hand, we implement the CDMF solver in OpenCL so that it can run on various platforms. Based on the architectural specifics, we customize code variants to efficiently map them to the underlying hardware. The experimental results show that our implementation performs 2x faster on dual-socket Intel Xeon CPUs and 22x faster on an NVIDIA K20c GPU than the baseline implementations. When taking the CDMF solver as a benchmark, we observe that it runs 2.4x faster on the GPU than on the CPUs, whereas it achieves competitive performance on Intel MIC against the CPUs.
{"title":"High Performance Coordinate Descent Matrix Factorization for Recommender Systems","authors":"Xi Yang, Jianbin Fang, Jing Chen, Chengkun Wu, T. Tang, Kai Lu","doi":"10.1145/3075564.3077625","DOIUrl":"https://doi.org/10.1145/3075564.3077625","url":null,"abstract":"Coordinate descent (CD) has been proved to be an effective technique for matrix factorization (MF) in recommender systems. To speed up factorizing performance, various methods of implementing parallel CDMF have been proposed to leverage modern multi-core CPUs and many-core GPUs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an efficient and portable CDMF solver for recommender systems. On the one hand, we diagnose the baseline implementation and observe that it lacks the awareness of the hierarchical thread organization on modern hardware and the data variance of the rating matrix. Thus, we apply the thread batching technique and the load balancing technique to achieve high performance. On the other hand, we implement the CDMF solver in OpenCL so that it can run on various platforms. Based on the architectural specifics, we customize code variants to efficiently map them to the underlying hardware. The experimental results show that our implementation performs 2x faster on dual-socket Intel Xeon CPUs and 22x faster on an NVIDIA K20c GPU than the baseline implementations. When taking the CDMF solver as a benchmark, we observe that it runs 2.4x faster on the GPU than on the CPUs, whereas it achieves competitive performance on Intel MIC against the CPUs.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126167773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The trend of increasing the number of cores in a processor will lead to certain challenges, among which the fact that more cores issue more memory requests and this in turn will increase the competition, or interference, for shared resources such as the Last-Level Cache (LLC). In this work we focus on the cache interference while executing Decision Support System queries, which is a common case for a Data Center scenario. We study the co-execution of different queries from the TPC-H benchmark using the PostgreSQL DBMS system on a multicore with up to 16 cores and different LLC configurations. In addition to the working set metric, to better understand the effects of co-execution, we develop two new "personality" metrics to classify the behavior of the queries in co-execution: social and sensitive metrics. These metrics can be used to manage the cache interference and thus improve the co-execution performance of the queries.
{"title":"Using Personality Metrics to Improve Cache Interference Management in Multicore Processors","authors":"Mwaffaq Otoom, A. Jaleel, P. Trancoso","doi":"10.1145/3075564.3075591","DOIUrl":"https://doi.org/10.1145/3075564.3075591","url":null,"abstract":"The trend of increasing the number of cores in a processor will lead to certain challenges, among which the fact that more cores issue more memory requests and this in turn will increase the competition, or interference, for shared resources such as the Last-Level Cache (LLC). In this work we focus on the cache interference while executing Decision Support System queries, which is a common case for a Data Center scenario. We study the co-execution of different queries from the TPC-H benchmark using the PostgreSQL DBMS system on a multicore with up to 16 cores and different LLC configurations. In addition to the working set metric, to better understand the effects of co-execution, we develop two new \"personality\" metrics to classify the behavior of the queries in co-execution: social and sensitive metrics. These metrics can be used to manage the cache interference and thus improve the co-execution performance of the queries.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130022811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim Llewellynn, M. Fernández-Carrobles, O. Déniz-Suárez, Samuel Fricker, A. Storkey, Nuria Pazos, Gordana S. Velikic, Kirsten Leufgen, Rozenn Dahyot, Sebastian Koller, G. Goumas, P. Leitner, Ganesh S. Dasika, Lei Wang, K. Tutschku
The Bonseyes EU H2020 collaborative project aims to develop a platform consisting of a Data Marketplace, a Deep Learning Toolbox, and Developer Reference Platforms for organizations wanting to adopt Artificial Intelligence. The project will be focused on using artificial intelligence in low power Internet of Things (IoT) devices ("edge computing"), embedded computing systems, and data center servers ("cloud computing"). It will bring about orders of magnitude improvements in efficiency, performance, reliability, security, and productivity in the design and programming of systems of artificial intelligence that incorporate Smart Cyber-Physical Systems (CPS). In addition, it will solve a causality problem for organizations who lack access to Data and Models. Its open software architecture will facilitate adoption of the whole concept on a wider scale. To evaluate the effectiveness, technical feasibility, and to quantify the real-world improvements in efficiency, security, performance, effort and cost of adding AI to products and services using the Bonseyes platform, four complementary demonstrators will be built. Bonseyes platform capabilities are aimed at being aligned with the European FI-PPP activities and take advantage of its flagship project FIWARE. This paper provides a description of the project motivation, goals and preliminary work.
Bonseyes EU H2020合作项目旨在为希望采用人工智能的组织开发一个由数据市场、深度学习工具箱和开发人员参考平台组成的平台。该项目将侧重于在低功耗物联网(IoT)设备(“边缘计算”)、嵌入式计算系统和数据中心服务器(“云计算”)中使用人工智能。它将在集成智能网络物理系统(CPS)的人工智能系统的设计和编程方面带来效率、性能、可靠性、安全性和生产力的数量级改进。此外,它将为缺乏数据和模型访问的组织解决因果关系问题。其开放的软件架构将促进整个概念在更大范围内的采用。为了评估使用Bonseyes平台将人工智能添加到产品和服务中的有效性、技术可行性,并量化在效率、安全性、性能、工作量和成本方面的实际改进,将构建四个互补的演示。Bonseyes平台的功能旨在与欧洲FI-PPP活动保持一致,并利用其旗舰项目FIWARE。本文描述了项目的动机、目标和前期工作。
{"title":"BONSEYES: Platform for Open Development of Systems of Artificial Intelligence: Invited paper","authors":"Tim Llewellynn, M. Fernández-Carrobles, O. Déniz-Suárez, Samuel Fricker, A. Storkey, Nuria Pazos, Gordana S. Velikic, Kirsten Leufgen, Rozenn Dahyot, Sebastian Koller, G. Goumas, P. Leitner, Ganesh S. Dasika, Lei Wang, K. Tutschku","doi":"10.1145/3075564.3076259","DOIUrl":"https://doi.org/10.1145/3075564.3076259","url":null,"abstract":"The Bonseyes EU H2020 collaborative project aims to develop a platform consisting of a Data Marketplace, a Deep Learning Toolbox, and Developer Reference Platforms for organizations wanting to adopt Artificial Intelligence. The project will be focused on using artificial intelligence in low power Internet of Things (IoT) devices (\"edge computing\"), embedded computing systems, and data center servers (\"cloud computing\"). It will bring about orders of magnitude improvements in efficiency, performance, reliability, security, and productivity in the design and programming of systems of artificial intelligence that incorporate Smart Cyber-Physical Systems (CPS). In addition, it will solve a causality problem for organizations who lack access to Data and Models. Its open software architecture will facilitate adoption of the whole concept on a wider scale. To evaluate the effectiveness, technical feasibility, and to quantify the real-world improvements in efficiency, security, performance, effort and cost of adding AI to products and services using the Bonseyes platform, four complementary demonstrators will be built. Bonseyes platform capabilities are aimed at being aligned with the European FI-PPP activities and take advantage of its flagship project FIWARE. This paper provides a description of the project motivation, goals and preliminary work.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122341839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaodong Yu, Hao Wang, Wu-chun Feng, H. Gong, Guohua Cao
The algebraic reconstruction technique (ART) is an iterative algorithm for CT (i.e., computed tomography) image reconstruction that delivers better image quality with less radiation dosage than the industry-standard filtered back projection (FBP). However, the high computational cost of ART requires researchers to turn to high-performance computing to accelerate the algorithm. Alas, existing approaches for ART suffer from inefficient design of compressed data structures and computational kernels on GPUs. Thus, this paper presents our enhanced CUDA-based CT image reconstruction tool based on the algebraic reconstruction technique (ART) or cuART. It delivers a compression and parallelization solution for ART-based image reconstruction on GPUs. We address the under-performing, but popular, GPU libraries, e.g., cuSPARSE, BRC, and CSR5, on the ART algorithm and propose a symmetry-based CSR format (SCSR) to further compress the CSR data structure and optimize data access for both SpMV and SpMV_T via a column-indices permutation. We also propose sorting-based and sorting-free blocking techniques to optimize the kernel computation by leveraging the sparsity patterns of the system matrix. The end result is that cuART can reduce the memory footprint significantly and enable practical CT datasets to fit into a single GPU. The experimental results on a NVIDIA Tesla K80 GPU illustrate that our approach can achieve up to 6.8x, 7.2x, and 5.4x speedups over counterparts that use cuSPARSE, BRC, and CSR5, respectively.
代数重建技术(ART)是一种用于CT(即计算机断层扫描)图像重建的迭代算法,与行业标准的滤波反投影(FBP)相比,它能以更少的辐射剂量提供更好的图像质量。然而,ART的高计算成本要求研究人员转向高性能计算来加速算法。唉,现有的ART方法受到gpu上压缩数据结构和计算内核设计效率低下的影响。因此,本文提出了基于代数重建技术(ART)或cuART的增强的基于cuda的CT图像重建工具。它为gpu上基于art的图像重建提供了压缩和并行化解决方案。我们在ART算法上解决了性能不佳但流行的GPU库,例如cuSPARSE, BRC和CSR5,并提出了一种基于对称的CSR格式(SCSR),以进一步压缩CSR数据结构,并通过列索引排列优化SpMV和SpMV_T的数据访问。我们还提出了基于排序和无排序的阻塞技术,通过利用系统矩阵的稀疏模式来优化内核计算。最终结果是,cuART可以显著减少内存占用,并使实际CT数据集适合单个GPU。在NVIDIA Tesla K80 GPU上的实验结果表明,与使用cuSPARSE、BRC和CSR5的方法相比,我们的方法可以实现高达6.8倍、7.2倍和5.4倍的加速。
{"title":"An Enhanced Image Reconstruction Tool for Computed Tomography on GPUs","authors":"Xiaodong Yu, Hao Wang, Wu-chun Feng, H. Gong, Guohua Cao","doi":"10.1145/3075564.3078889","DOIUrl":"https://doi.org/10.1145/3075564.3078889","url":null,"abstract":"The algebraic reconstruction technique (ART) is an iterative algorithm for CT (i.e., computed tomography) image reconstruction that delivers better image quality with less radiation dosage than the industry-standard filtered back projection (FBP). However, the high computational cost of ART requires researchers to turn to high-performance computing to accelerate the algorithm. Alas, existing approaches for ART suffer from inefficient design of compressed data structures and computational kernels on GPUs. Thus, this paper presents our enhanced CUDA-based CT image reconstruction tool based on the algebraic reconstruction technique (ART) or cuART. It delivers a compression and parallelization solution for ART-based image reconstruction on GPUs. We address the under-performing, but popular, GPU libraries, e.g., cuSPARSE, BRC, and CSR5, on the ART algorithm and propose a symmetry-based CSR format (SCSR) to further compress the CSR data structure and optimize data access for both SpMV and SpMV_T via a column-indices permutation. We also propose sorting-based and sorting-free blocking techniques to optimize the kernel computation by leveraging the sparsity patterns of the system matrix. The end result is that cuART can reduce the memory footprint significantly and enable practical CT datasets to fit into a single GPU. The experimental results on a NVIDIA Tesla K80 GPU illustrate that our approach can achieve up to 6.8x, 7.2x, and 5.4x speedups over counterparts that use cuSPARSE, BRC, and CSR5, respectively.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124172706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Intel Software Guard Extensions (SGX) is an emerging trusted hardware technology. SGX enables user-level code to allocate regions of trusted memory, called enclaves, where the confidentiality and integrity of code and data are guaranteed. While SGX offers strong security for applications, one limitation of SGX is the lack of system call support inside enclaves, which leads to a non-trivial, refactoring effort when protecting existing applications with SGX. To address this issue, previous works have ported existing library OSes to SGX. However, these library OSes are suboptimal in terms of security and performance since they are designed without taking into account the characteristics of SGX. In this paper, we revisit the library OS approach in a new setting---Intel SGX. We first quantitatively evaluate the performance impact of enclave transitions on SGX programs, identifying it as a performance bottleneck for any library OSes that aim to support system-intensive SGX applications. We then present the design and implementation of SGXKernel, an in-enclave library OS, with highlight on its switchless design, which obviates the needs for enclave transitions. This switchless design is achieved by incorporating two novel ideas: asynchronous cross-enclave communication and preemptible in-enclave multi-threading. We intensively evaluate the performance of SGXKernel on microbenchmarks and application benchmarks. The results show that SGXKernel significantly outperforms a state-of-the-art library OS that has been ported to SGX.
{"title":"SGXKernel: A Library Operating System Optimized for Intel SGX","authors":"H. Tian, Yong Zhang, Chunxiao Xing, Shoumeng Yan","doi":"10.1145/3075564.3075572","DOIUrl":"https://doi.org/10.1145/3075564.3075572","url":null,"abstract":"Intel Software Guard Extensions (SGX) is an emerging trusted hardware technology. SGX enables user-level code to allocate regions of trusted memory, called enclaves, where the confidentiality and integrity of code and data are guaranteed. While SGX offers strong security for applications, one limitation of SGX is the lack of system call support inside enclaves, which leads to a non-trivial, refactoring effort when protecting existing applications with SGX. To address this issue, previous works have ported existing library OSes to SGX. However, these library OSes are suboptimal in terms of security and performance since they are designed without taking into account the characteristics of SGX. In this paper, we revisit the library OS approach in a new setting---Intel SGX. We first quantitatively evaluate the performance impact of enclave transitions on SGX programs, identifying it as a performance bottleneck for any library OSes that aim to support system-intensive SGX applications. We then present the design and implementation of SGXKernel, an in-enclave library OS, with highlight on its switchless design, which obviates the needs for enclave transitions. This switchless design is achieved by incorporating two novel ideas: asynchronous cross-enclave communication and preemptible in-enclave multi-threading. We intensively evaluate the performance of SGXKernel on microbenchmarks and application benchmarks. The results show that SGXKernel significantly outperforms a state-of-the-art library OS that has been ported to SGX.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117311118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudheer Chunduri, Prasanna Balaprakash, V. Morozov, V. Vishwanath, Kalyan Kumaran
Modeling the performance of scientific applications on emerging hardware plays a central role in achieving extreme-scale computing goals. Analytical models that capture the interaction between applications and hardware characteristics are attractive because even a reasonably accurate model can be useful for performance tuning before the hardware is made available. In this paper, we develop a hardware model for Intel's second-generation Xeon Phi architecture code-named Knights Landing (KNL) for the SKOPE framework. We validate the KNL hardware model by projecting the performance of minibenchmarks and application kernels. The results show that our KNL model can project the performance with prediction errors of 10% to 20%. The hardware model also provides informative recommendations for code transformations and tuning.
{"title":"Analytical Performance Modeling and Validation of Intel's Xeon Phi Architecture","authors":"Sudheer Chunduri, Prasanna Balaprakash, V. Morozov, V. Vishwanath, Kalyan Kumaran","doi":"10.1145/3075564.3075593","DOIUrl":"https://doi.org/10.1145/3075564.3075593","url":null,"abstract":"Modeling the performance of scientific applications on emerging hardware plays a central role in achieving extreme-scale computing goals. Analytical models that capture the interaction between applications and hardware characteristics are attractive because even a reasonably accurate model can be useful for performance tuning before the hardware is made available. In this paper, we develop a hardware model for Intel's second-generation Xeon Phi architecture code-named Knights Landing (KNL) for the SKOPE framework. We validate the KNL hardware model by projecting the performance of minibenchmarks and application kernels. The results show that our KNL model can project the performance with prediction errors of 10% to 20%. The hardware model also provides informative recommendations for code transformations and tuning.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114246936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly lessen the pressure of accessing slow global memory. Stencil computations, for example, can exploit such data reuse via spatial blocking through the memory hierarchy of the GPU to improve performance. However, approaches to take advantage of such blocking require complex and tedious changes to the GPU kernels for different stencils, GPU architectures, and multi-level cached systems. In this work, we explore the challenges of different spatial blocking strategies over three cache levels of the GPU (i.e., L1 cache, scratchpad memory, and registers) and propose a framework GPU-UniCache to automatically generate codes to access buffered data in the cached systems of GPUs. Based on the characteristics of spatial blocking over various stencil kernels, we generalize the patterns of data communication, index conversion, and synchronization (with abstracted ISA-friendly interfaces) and map them to different architectures with highly optimized code variants. Our approach greatly simplifies the design of efficient and portable stencil computations across GPUs. Compared to stencil kernels based on hardware-managed memory (L1 cache) and other state-of-the-art GPU benchmarks, the GPU-UniCache can achieve significant improvements.
{"title":"GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs","authors":"Kaixi Hou, Hao Wang, Wu-chun Feng","doi":"10.1145/3075564.3075583","DOIUrl":"https://doi.org/10.1145/3075564.3075583","url":null,"abstract":"Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly lessen the pressure of accessing slow global memory. Stencil computations, for example, can exploit such data reuse via spatial blocking through the memory hierarchy of the GPU to improve performance. However, approaches to take advantage of such blocking require complex and tedious changes to the GPU kernels for different stencils, GPU architectures, and multi-level cached systems. In this work, we explore the challenges of different spatial blocking strategies over three cache levels of the GPU (i.e., L1 cache, scratchpad memory, and registers) and propose a framework GPU-UniCache to automatically generate codes to access buffered data in the cached systems of GPUs. Based on the characteristics of spatial blocking over various stencil kernels, we generalize the patterns of data communication, index conversion, and synchronization (with abstracted ISA-friendly interfaces) and map them to different architectures with highly optimized code variants. Our approach greatly simplifies the design of efficient and portable stencil computations across GPUs. Compared to stencil kernels based on hardware-managed memory (L1 cache) and other state-of-the-art GPU benchmarks, the GPU-UniCache can achieve significant improvements.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126598317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During the past few years, social engineering has rapidly evolved and has become a mainstream technique in cybercrime and terrorism. It is used especially in targeted attacks involving complex human and technological exploits, aimed at deceiving humans and IT systems. Building on the work carried out in the DOGANA project, funded by the European Union, this paper provides an overview of the evolution and of the current landscape of social engineering, and introduces as its main contribution a theoretical model of how human exploits are built, named the Victim Communication Stack.
{"title":"Social Engineering 2.0: A Foundational Work: Invited Paper","authors":"Davide Ariu, E. Frumento, G. Fumera","doi":"10.1145/3075564.3076260","DOIUrl":"https://doi.org/10.1145/3075564.3076260","url":null,"abstract":"During the past few years, social engineering has rapidly evolved and has become a mainstream technique in cybercrime and terrorism. It is used especially in targeted attacks involving complex human and technological exploits, aimed at deceiving humans and IT systems. Building on the work carried out in the DOGANA project, funded by the European Union, this paper provides an overview of the evolution and of the current landscape of social engineering, and introduces as its main contribution a theoretical model of how human exploits are built, named the Victim Communication Stack.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"50 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130910519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Bagnato, Regina Krisztina Bíró, Dario Bonino, C. Pastrone, W. Elmenreich, René Reiners, M. Schranz, Edin Arnautovic
Cyber-Physical Systems (CPS) find applications in a number of large-scale, safety-critical domains e.g. transportation, smart cities, etc. As a matter of fact, the increasing interactions amongst different CPS are starting to generate unpredictable behaviors and emerging properties, often leading to unforeseen and/or undesired results. Rather than being an unwanted byproduct, these interactions could, however, become an advantage if they were explicitly managed, and accounted, since the early design stages. The CPSwarm project, presented in this paper, aims at tackling these kinds of challenges by easing development and integration of complex herds of heterogeneous CPS. Thanks to CPSwarm, systems designed through a combination of existing and emerging tools, will collaborate on the basis of local policies and exhibit a collective behavior capable of solving complex, real-world, problems. Three real-world use cases will demonstrate the validity of foundational assumptions of the presented approach as well as the viability of the developed tools and methodologies.
{"title":"Designing Swarms of Cyber-Physical Systems: the H2020 CPSwarm Project: Invited Paper","authors":"A. Bagnato, Regina Krisztina Bíró, Dario Bonino, C. Pastrone, W. Elmenreich, René Reiners, M. Schranz, Edin Arnautovic","doi":"10.1145/3075564.3077628","DOIUrl":"https://doi.org/10.1145/3075564.3077628","url":null,"abstract":"Cyber-Physical Systems (CPS) find applications in a number of large-scale, safety-critical domains e.g. transportation, smart cities, etc. As a matter of fact, the increasing interactions amongst different CPS are starting to generate unpredictable behaviors and emerging properties, often leading to unforeseen and/or undesired results. Rather than being an unwanted byproduct, these interactions could, however, become an advantage if they were explicitly managed, and accounted, since the early design stages. The CPSwarm project, presented in this paper, aims at tackling these kinds of challenges by easing development and integration of complex herds of heterogeneous CPS. Thanks to CPSwarm, systems designed through a combination of existing and emerging tools, will collaborate on the basis of local policies and exhibit a collective behavior capable of solving complex, real-world, problems. Three real-world use cases will demonstrate the validity of foundational assumptions of the presented approach as well as the viability of the developed tools and methodologies.","PeriodicalId":398898,"journal":{"name":"Proceedings of the Computing Frontiers Conference","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133915182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}