集合虚拟机:多前端多后端数据分析的抽象

Proceedings of the 16th International Workshop on Data Management on New Hardware Pub Date : 2020-04-04 DOI:10.1145/3399666.3399911

Ingo Müller, Renato Marroquín, D. Koutsoukos, Mike Wawrzoniak, G. Alonso, Sabir Akhadov

{"title":"集合虚拟机:多前端多后端数据分析的抽象","authors":"Ingo Müller, Renato Marroquín, D. Koutsoukos, Mike Wawrzoniak, G. Alonso, Sabir Akhadov","doi":"10.1145/3399666.3399911","DOIUrl":null,"url":null,"abstract":"Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement a single analytics type on one platform, leading to repeated implementation effort---and a plethora of semi-compatible tools for data scientists. In this paper, we propose the \"Collection Virtual Machine\" (or CVM)---an extensible compiler framework designed to keep the specialization process of data analytics systems tractable. It can capture at the same time the essence of a large span of low-level, hardware-specific implementation techniques as well as high-level operations of different types of analyses. At its core lies a language for defining nested, collection-oriented intermediate representations (IRs). Frontends produce programs in their IR flavors defined in that language, which get optimized through a series of rewritings (possibly changing the IR flavor multiple times) until the program is finally expressed in an IR of platform-specific operators. While reducing the overall implementation effort, this also improves the interoperability of both analyses and hardware platforms. We have used CVM successfully to build specialized backends for platforms as diverse as multi-core CPUs, RDMA clusters, and serverless computing infrastructure in the cloud and expect similar results for many more frontends and hardware platforms in the near future.","PeriodicalId":256784,"journal":{"name":"Proceedings of the 16th International Workshop on Data Management on New Hardware","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"The collection Virtual Machine: an abstraction for multi-frontend multi-backend data analysis\",\"authors\":\"Ingo Müller, Renato Marroquín, D. Koutsoukos, Mike Wawrzoniak, G. Alonso, Sabir Akhadov\",\"doi\":\"10.1145/3399666.3399911\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement a single analytics type on one platform, leading to repeated implementation effort---and a plethora of semi-compatible tools for data scientists. In this paper, we propose the \\\"Collection Virtual Machine\\\" (or CVM)---an extensible compiler framework designed to keep the specialization process of data analytics systems tractable. It can capture at the same time the essence of a large span of low-level, hardware-specific implementation techniques as well as high-level operations of different types of analyses. At its core lies a language for defining nested, collection-oriented intermediate representations (IRs). Frontends produce programs in their IR flavors defined in that language, which get optimized through a series of rewritings (possibly changing the IR flavor multiple times) until the program is finally expressed in an IR of platform-specific operators. While reducing the overall implementation effort, this also improves the interoperability of both analyses and hardware platforms. We have used CVM successfully to build specialized backends for platforms as diverse as multi-core CPUs, RDMA clusters, and serverless computing infrastructure in the cloud and expect similar results for many more frontends and hardware platforms in the near future.\",\"PeriodicalId\":256784,\"journal\":{\"name\":\"Proceedings of the 16th International Workshop on Data Management on New Hardware\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th International Workshop on Data Management on New Hardware\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3399666.3399911\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th International Workshop on Data Management on New Hardware","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3399666.3399911","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

从数量不断增加的硬件平台中获得最佳性能一直是数据处理系统面临的一个反复出现的挑战。近年来，数据科学的出现及其越来越多和复杂的分析类型使这一挑战变得更加困难。在实践中，系统设计人员被大量的组合所淹没，并且通常在一个平台上实现单一的分析类型，导致重复的实现工作-以及数据科学家的大量半兼容工具。在本文中，我们提出了“集合虚拟机”(或CVM)——一个可扩展的编译器框架，旨在保持数据分析系统的专业化过程易于处理。它可以同时捕获大范围的低级、特定于硬件的实现技术的本质，以及不同类型分析的高级操作。其核心是用于定义嵌套的、面向集合的中间表示(ir)的语言。前端生成用该语言定义的IR风格的程序，这些程序通过一系列重写(可能多次更改IR风格)得到优化，直到程序最终用特定于平台操作符的IR表示。在减少总体实现工作量的同时，这也提高了分析和硬件平台的互操作性。我们已经成功地使用CVM为各种平台构建专门的后端，如多核cpu、RDMA集群和云中的无服务器计算基础设施，并期望在不久的将来为更多的前端和硬件平台提供类似的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The collection Virtual Machine: an abstraction for multi-frontend multi-backend data analysis

Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement a single analytics type on one platform, leading to repeated implementation effort---and a plethora of semi-compatible tools for data scientists. In this paper, we propose the "Collection Virtual Machine" (or CVM)---an extensible compiler framework designed to keep the specialization process of data analytics systems tractable. It can capture at the same time the essence of a large span of low-level, hardware-specific implementation techniques as well as high-level operations of different types of analyses. At its core lies a language for defining nested, collection-oriented intermediate representations (IRs). Frontends produce programs in their IR flavors defined in that language, which get optimized through a series of rewritings (possibly changing the IR flavor multiple times) until the program is finally expressed in an IR of platform-specific operators. While reducing the overall implementation effort, this also improves the interoperability of both analyses and hardware platforms. We have used CVM successfully to build specialized backends for platforms as diverse as multi-core CPUs, RDMA clusters, and serverless computing infrastructure in the cloud and expect similar results for many more frontends and hardware platforms in the near future.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 16th International Workshop on Data Management on New Hardware

自引率

0.00%

发文量

期刊最新文献

Accelerating re-pair compression using FPGAs Scalable and robust latches for database systems Efficient generation of machine code for query compilers nKV Empirical evaluation across multiple GPU-accelerated DBMSes