首页 > 最新文献

Datenbanksysteme für Business, Technologie und Web最新文献

英文 中文
SportsTables: A new Corpus for Semantic Type Detection sportstable:一个新的语义类型检测语料库
Pub Date : 2023-10-16 DOI: 10.18420/BTW2023-68
S. Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig
Table corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora and real-world data lakes since they contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show in this extended version paper of [18] the results of an extensive study using four different state-of-the-art approaches for semantic type detection on our new corpus. Overall, the results demonstrate significant performance differences in predicting semantic types for textual and numerical data.
表语料库(如VizNet或TURL)每列包含注释的语义类型,对于构建用于自动语义类型检测任务的机器学习模型非常重要。然而,语料库与现实世界的数据湖之间存在巨大的差异,因为它们包含了大量现有语料库中不存在的数值数据。因此,在本文中,我们引入了一个新的语料库,它包含比现有语料库更高比例的数字列。为了反映真实世界数据湖中的分布,我们的语料库sportstabables平均约为。86%的数字列,对现有的主要针对非数字列的语义类型检测模型提出了新的挑战。为了证明这种效果,我们在[18]的这篇扩展版论文中展示了一项广泛研究的结果,该研究使用了四种不同的最先进的方法在我们的新语料库上进行语义类型检测。总体而言,结果表明在预测文本和数字数据的语义类型方面存在显著的性能差异。
{"title":"SportsTables: A new Corpus for Semantic Type Detection","authors":"S. Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig","doi":"10.18420/BTW2023-68","DOIUrl":"https://doi.org/10.18420/BTW2023-68","url":null,"abstract":"Table corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora and real-world data lakes since they contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show in this extended version paper of [18] the results of an extensive study using four different state-of-the-art approaches for semantic type detection on our new corpus. Overall, the results demonstrate significant performance differences in predicting semantic types for textual and numerical data.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130562111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Large Table Scan using Processing-In-Memory Technology 使用内存处理技术加速大表扫描
Pub Date : 2023-10-10 DOI: 10.18420/BTW2023-51
Alexander Baumstark, M. Jibril, K. Sattler
Today’s systems are capable of storing large amounts of data in main memory. Particularly, in-memory DBMSs benefit from this development. However, the processing of data from the main memory necessarily has to run via the CPU. This creates a bottleneck, which affects the possible performance of the DBMS. Processing-In-Memory (PIM) is a paradigm to overcome this problem, which was not available in commercial systems for a long time. With the availability of UPMEM, a commercial product is finally available that provides PIM technology in hardware. In this work, we focus on the acceleration of the table scan, a fundamental database query operation. We show and investigate an approach that can be used to optimize this operation by using PIM. We evaluate the PIM scan in terms of parallelism and execution time in benchmarks with different table sizes and compare it to a traditional CPU-based table scan. The result is a PIM table scan that outperforms the CPU-based scan significantly.
今天的系统能够在主存储器中存储大量数据。特别是,内存dbms可以从这种开发中受益。然而,处理来自主存的数据必须通过CPU来运行。这会产生瓶颈,影响DBMS的性能。内存中处理(PIM)是一种克服这个问题的范例,它在很长一段时间内无法在商业系统中使用。随着UPMEM的出现,在硬件中提供PIM技术的商业产品终于出现了。在这项工作中,我们专注于加速表扫描,这是一个基本的数据库查询操作。我们展示并研究了一种可用于通过使用PIM来优化此操作的方法。我们在不同表大小的基准测试中评估PIM扫描的并行性和执行时间,并将其与传统的基于cpu的表扫描进行比较。结果是PIM表扫描的性能明显优于基于cpu的扫描。
{"title":"Accelerating Large Table Scan using Processing-In-Memory Technology","authors":"Alexander Baumstark, M. Jibril, K. Sattler","doi":"10.18420/BTW2023-51","DOIUrl":"https://doi.org/10.18420/BTW2023-51","url":null,"abstract":"Today’s systems are capable of storing large amounts of data in main memory. Particularly, in-memory DBMSs benefit from this development. However, the processing of data from the main memory necessarily has to run via the CPU. This creates a bottleneck, which affects the possible performance of the DBMS. Processing-In-Memory (PIM) is a paradigm to overcome this problem, which was not available in commercial systems for a long time. With the availability of UPMEM, a commercial product is finally available that provides PIM technology in hardware. In this work, we focus on the acceleration of the table scan, a fundamental database query operation. We show and investigate an approach that can be used to optimize this operation by using PIM. We evaluate the PIM scan in terms of parallelism and execution time in benchmarks with different table sizes and compare it to a traditional CPU-based table scan. The result is a PIM table scan that outperforms the CPU-based scan significantly.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"452 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125780903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The InsightsNet Climate Change Corpus (ICCC) 洞察网气候变化语料库(ICCC)
Pub Date : 2023-09-11 DOI: 10.18420/BTW2023-59
Sabine Bartsch, Changxu Duan, Sherry Tan, Elena Volkanovska, W. Stille
The discourse on climate change has become a centerpiece of public debate, thereby creating a pressing need to analyze the multitude of messages created by the participants in this communication process. In addition to text, information on this topic is conveyed multimodally, through images, videos, tables and other data objects that are embedded within documents and accompany the text. This paper presents the process of building a multimodal pilot corpus to the InsightsNet Climate Change Corpus (ICCC) and using natural language processing (NLP) tools to enrich corpus (meta)data, thus creating a dataset that lends itself to the exploration of the interplay between the various modalities that constitute the discourse on climate change. We demonstrate how the pilot corpus can be queried for relevant information in two types of databases, and how the proposed data model promotes a more comprehensive sentiment analysis approach.
关于气候变化的讨论已经成为公众辩论的核心,因此迫切需要分析这一交流过程中参与者所产生的大量信息。除文本外,关于这一主题的信息还通过图像、视频、表格和嵌入文档中并随文本提供的其他数据对象以多种方式传达。本文介绍了为InsightsNet气候变化语料库(ICCC)构建一个多模态试点语料库的过程,并使用自然语言处理(NLP)工具来丰富语料库(元)数据,从而创建一个数据集,该数据集有助于探索构成气候变化话语的各种模式之间的相互作用。我们演示了如何在两种类型的数据库中查询试点语料库中的相关信息,以及所提出的数据模型如何促进更全面的情感分析方法。
{"title":"The InsightsNet Climate Change Corpus (ICCC)","authors":"Sabine Bartsch, Changxu Duan, Sherry Tan, Elena Volkanovska, W. Stille","doi":"10.18420/BTW2023-59","DOIUrl":"https://doi.org/10.18420/BTW2023-59","url":null,"abstract":"The discourse on climate change has become a centerpiece of public debate, thereby creating a pressing need to analyze the multitude of messages created by the participants in this communication process. In addition to text, information on this topic is conveyed multimodally, through images, videos, tables and other data objects that are embedded within documents and accompany the text. This paper presents the process of building a multimodal pilot corpus to the InsightsNet Climate Change Corpus (ICCC) and using natural language processing (NLP) tools to enrich corpus (meta)data, thus creating a dataset that lends itself to the exploration of the interplay between the various modalities that constitute the discourse on climate change. We demonstrate how the pilot corpus can be queried for relevant information in two types of databases, and how the proposed data model promotes a more comprehensive sentiment analysis approach.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the State of German (Abstractive) Text Summarization 论德语(抽象)文本摘要的现状
Pub Date : 2023-01-17 DOI: 10.48550/arXiv.2301.07095
Dennis Aumiller, Jing Fan, Michael Gertz
With recent advancements in the area of Natural Language Processing, the focus is slowly shifting from a purely English-centric view towards more language-specific solutions, including German. Especially practical for businesses to analyze their growing amount of textual data are text summarization systems, which transform long input documents into compressed and more digestible summary texts. In this work, we assess the particular landscape of German abstractive text summarization and investigate the reasons why practically useful solutions for abstractive text summarization are still absent in industry. Our focus is two-fold, analyzing a) training resources, and b) publicly available summarization systems. We are able to show that popular existing datasets exhibit crucial flaws in their assumptions about the original sources, which frequently leads to detrimental effects on system generalization and evaluation biases. We confirm that for the most popular training dataset, MLSUM, over 50% of the training set is unsuitable for abstractive summarization purposes. Furthermore, available systems frequently fail to compare to simple baselines, and ignore more effective and efficient extractive summarization approaches. We attribute poor evaluation quality to a variety of different factors, which are investigated in more detail in this work: A lack of qualitative (and diverse) gold data considered for training, understudied (and untreated) positional biases in some of the existing datasets, and the lack of easily accessible and streamlined pre-processing strategies or analysis tools. We provide a comprehensive assessment of available models on the cleaned datasets, and find that this can lead to a reduction of more than 20 ROUGE-1 points during evaluation. The code for dataset filtering and reproducing results can be found online at https://github.com/dennlinger/summaries
随着自然语言处理领域的最新进展,焦点正慢慢从纯粹以英语为中心的观点转向更具体的语言解决方案,包括德语。文本摘要系统对于企业分析日益增长的文本数据尤其实用,它将长输入文档转换为压缩的、更易于理解的摘要文本。在这项工作中,我们评估了德国抽象文本摘要的特殊景观,并调查了为什么在工业中仍然缺乏实用的抽象文本摘要解决方案的原因。我们的重点是双重的,分析a)培训资源,b)公开可用的摘要系统。我们能够证明,流行的现有数据集在对原始来源的假设中表现出关键缺陷,这经常导致对系统泛化和评估偏差的有害影响。我们确认,对于最流行的训练数据集MLSUM,超过50%的训练集不适合抽象摘要目的。此外,可用的系统经常无法与简单的基线进行比较,并且忽略了更有效和高效的提取摘要方法。我们将差的评估质量归因于各种不同的因素,这些因素在本工作中进行了更详细的研究:缺乏用于训练的定性(和多样化)黄金数据,一些现有数据集中未充分研究(和未经处理)的位置偏差,以及缺乏易于访问和简化的预处理策略或分析工具。我们对清理后的数据集上的可用模型进行了全面评估,发现这可以在评估期间减少超过20个ROUGE-1点。用于数据集过滤和再现结果的代码可以在https://github.com/dennlinger/summaries上在线找到
{"title":"On the State of German (Abstractive) Text Summarization","authors":"Dennis Aumiller, Jing Fan, Michael Gertz","doi":"10.48550/arXiv.2301.07095","DOIUrl":"https://doi.org/10.48550/arXiv.2301.07095","url":null,"abstract":"With recent advancements in the area of Natural Language Processing, the focus is slowly shifting from a purely English-centric view towards more language-specific solutions, including German. Especially practical for businesses to analyze their growing amount of textual data are text summarization systems, which transform long input documents into compressed and more digestible summary texts. In this work, we assess the particular landscape of German abstractive text summarization and investigate the reasons why practically useful solutions for abstractive text summarization are still absent in industry. Our focus is two-fold, analyzing a) training resources, and b) publicly available summarization systems. We are able to show that popular existing datasets exhibit crucial flaws in their assumptions about the original sources, which frequently leads to detrimental effects on system generalization and evaluation biases. We confirm that for the most popular training dataset, MLSUM, over 50% of the training set is unsuitable for abstractive summarization purposes. Furthermore, available systems frequently fail to compare to simple baselines, and ignore more effective and efficient extractive summarization approaches. We attribute poor evaluation quality to a variety of different factors, which are investigated in more detail in this work: A lack of qualitative (and diverse) gold data considered for training, understudied (and untreated) positional biases in some of the existing datasets, and the lack of easily accessible and streamlined pre-processing strategies or analysis tools. We provide a comprehensive assessment of available models on the cleaned datasets, and find that this can lead to a reduction of more than 20 ROUGE-1 points during evaluation. The code for dataset filtering and reproducing results can be found online at https://github.com/dennlinger/summaries","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132983195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Easiest Way of Turning your Relational Database into a Blockchain - and the Cost of Doing So 将关系数据库转换为bbb的最简单方法-以及这样做的成本
Pub Date : 2022-10-10 DOI: 10.48550/arXiv.2210.04484
F. Schuhknecht, Simon Jörz
Blockchain systems essentially consist of two levels: The network level has the responsibility of distributing an ordered stream of transactions to all nodes of the network in exactly the same way, even in the presence of a certain amount of malicious parties (byzantine fault tolerance). On the node level, each node then receives this ordered stream of transactions and executes it within some sort of transaction processing system, typically to alter some kind of state. This clear separation into two levels as well as drastically different application requirements have led to the materialization of the network level in form of so-called blockchain frameworks. While providing all the"blockchain features", these frameworks leave the node level backend flexible or even left to be implemented depending on the specific needs of the application. In the following paper, we present how to integrate a highly versatile transaction processing system, namely a relational DBMS, into such a blockchain framework. As framework, we use the popular Tendermint Core, now part of the Ignite/Cosmos eco-system, which can run both public and permissioned networks and combine it with relational DBMSs as the backend. This results in a"relational blockchain", which is able to run deterministic SQL on a fully replicated relational database. Apart from presenting the integration and its pitfalls, we will carefully evaluate the performance implications of such combinations, in particular, the throughput and latency overhead caused by the blockchain layer on top of the DBMS. As a result, we give recommendations on how to run such a systems combination efficiently in practice.
区块链系统本质上由两层组成:网络层负责以完全相同的方式将有序的交易流分发到网络的所有节点,即使存在一定数量的恶意方(拜占庭容错)。在节点级别上,每个节点接收这个有序的事务流,并在某种事务处理系统中执行它,通常是为了改变某种状态。这种明确划分为两个级别以及截然不同的应用需求导致了网络级别以所谓的区块链框架的形式实现。在提供所有“区块链特性”的同时,这些框架使节点级后端保持灵活,甚至可以根据应用程序的特定需求来实现。在下面的文章中,我们将介绍如何将一个高度通用的事务处理系统(即关系DBMS)集成到这样的区块链框架中。作为框架,我们使用了流行的Tendermint Core,它现在是Ignite/Cosmos生态系统的一部分,它可以运行公共和许可网络,并将其与关系dbms结合起来作为后端。这就产生了一个“关系区块链”,它能够在一个完全复制的关系数据库上运行确定性SQL。除了介绍集成及其缺陷之外,我们还将仔细评估这种组合的性能影响,特别是由DBMS之上的区块链层引起的吞吐量和延迟开销。最后,就如何在实践中有效地运行这一系统组合提出了建议。
{"title":"The Easiest Way of Turning your Relational Database into a Blockchain - and the Cost of Doing So","authors":"F. Schuhknecht, Simon Jörz","doi":"10.48550/arXiv.2210.04484","DOIUrl":"https://doi.org/10.48550/arXiv.2210.04484","url":null,"abstract":"Blockchain systems essentially consist of two levels: The network level has the responsibility of distributing an ordered stream of transactions to all nodes of the network in exactly the same way, even in the presence of a certain amount of malicious parties (byzantine fault tolerance). On the node level, each node then receives this ordered stream of transactions and executes it within some sort of transaction processing system, typically to alter some kind of state. This clear separation into two levels as well as drastically different application requirements have led to the materialization of the network level in form of so-called blockchain frameworks. While providing all the\"blockchain features\", these frameworks leave the node level backend flexible or even left to be implemented depending on the specific needs of the application. In the following paper, we present how to integrate a highly versatile transaction processing system, namely a relational DBMS, into such a blockchain framework. As framework, we use the popular Tendermint Core, now part of the Ignite/Cosmos eco-system, which can run both public and permissioned networks and combine it with relational DBMSs as the backend. This results in a\"relational blockchain\", which is able to run deterministic SQL on a fully replicated relational database. Apart from presenting the integration and its pitfalls, we will carefully evaluate the performance implications of such combinations, in particular, the throughput and latency overhead caused by the blockchain layer on top of the DBMS. As a result, we give recommendations on how to run such a systems combination efficiently in practice.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130412710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Silentium! Run-Analyse-Eradicate the Noise out of the DB/OS Stack Silentium !运行-分析-消除DB/OS堆栈中的噪声
Pub Date : 2021-02-11 DOI: 10.18420/btw2021-21
W. Mauerer, Ralf Ramsauer, Edson Ramiro Lucas Filho, Stefanie Scherzinger
When multiple tenants compete for resources, database performance tends to suffer. Yet there are scenarios where guaranteed sub-millisecond latencies are crucial, such as in real-time data processing, IoT devices, or when operating in safety-critical environments. In this paper, we study how to make query latencies deterministic in the face of noise (whether caused by other tenants or unrelated operating system tasks). We perform controlled experiments with an in-memory database engine in a multi-tenant setting, where we successively eradicate noisy interference from within the system software stack, to the point where the engine runs close to bare-metal on the underlying hardware. We show that we can achieve query latencies comparable to the database engine running as the sole tenant, but without noticeably impacting the workload of competing tenants. We discuss these results in the context of ongoing efforts to build custom operating systems for database workloads, and point out that for certain use cases, the margin for improvement is rather narrow. In fact, for scenarios like ours, existing operating systems might just be good enough, provided that they are expertly configured. We then critically discuss these findings in the light of a broader family of database systems (e.g., including disk-based), and how to extend the approach of this paper accordingly.
当多个租户竞争资源时,数据库性能往往会受到影响。然而,在某些情况下,保证亚毫秒级的延迟至关重要,例如在实时数据处理、物联网设备或在安全关键环境中运行时。在本文中,我们研究了如何在面对噪声(无论是由其他租户还是不相关的操作系统任务引起的)时使查询延迟具有确定性。我们在多租户设置中使用内存数据库引擎执行受控实验,其中我们逐步消除系统软件堆栈内部的噪声干扰,直到引擎在底层硬件上运行接近裸机的程度。我们可以实现与作为唯一承租者运行的数据库引擎相当的查询延迟,但不会明显影响竞争承租者的工作负载。我们将在为数据库工作负载构建定制操作系统的持续努力的背景下讨论这些结果,并指出对于某些用例,改进的余地相当小。事实上,对于像我们这样的场景,现有的操作系统可能已经足够好了,只要它们经过专业配置即可。然后,我们根据更广泛的数据库系统家族(例如,包括基于磁盘的)批判性地讨论这些发现,以及如何相应地扩展本文的方法。
{"title":"Silentium! Run-Analyse-Eradicate the Noise out of the DB/OS Stack","authors":"W. Mauerer, Ralf Ramsauer, Edson Ramiro Lucas Filho, Stefanie Scherzinger","doi":"10.18420/btw2021-21","DOIUrl":"https://doi.org/10.18420/btw2021-21","url":null,"abstract":"When multiple tenants compete for resources, database performance tends to suffer. Yet there are scenarios where guaranteed sub-millisecond latencies are crucial, such as in real-time data processing, IoT devices, or when operating in safety-critical environments. In this paper, we study how to make query latencies deterministic in the face of noise (whether caused by other tenants or unrelated operating system tasks). We perform controlled experiments with an in-memory database engine in a multi-tenant setting, where we successively eradicate noisy interference from within the system software stack, to the point where the engine runs close to bare-metal on the underlying hardware. We show that we can achieve query latencies comparable to the database engine running as the sole tenant, but without noticeably impacting the workload of competing tenants. We discuss these results in the context of ongoing efforts to build custom operating systems for database workloads, and point out that for certain use cases, the margin for improvement is rather narrow. In fact, for scenarios like ours, existing operating systems might just be good enough, provided that they are expertly configured. We then critically discuss these findings in the light of a broader family of database systems (e.g., including disk-based), and how to extend the approach of this paper accordingly.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"19 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120837431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring Memory Access Patterns for Graph Processing Accelerators 探索图形处理加速器的内存访问模式
Pub Date : 2020-10-26 DOI: 10.18420/btw2021-05
Jonas Dann, Daniel Ritter, H. Fröning
Recent trends in business and technology (e.g., machine learning, social network analysis) benefit from storing and processing growing amounts of graph-structured data in databases and data science platforms. FPGAs as accelerators for graph processing with a customizable memory hierarchy promise solving performance problems caused by inherent irregular memory access patterns on traditional hardware (e.g., CPU). However, developing such hardware accelerators is yet time-consuming and difficult and benchmarking is non-standardized, hindering comprehension of the impact of memory access pattern changes and systematic engineering of graph processing accelerators. In this work, we propose a simulation environment for the analysis of graph processing accelerators based on simulating their memory access patterns. Further, we evaluate our approach on two state-of-the-art FPGA graph processing accelerators and show reproducibility, comparablity, as well as the shortened development process by an example. Not implementing the cycle-accurate internal data flow on accelerator hardware like FPGAs significantly reduces the implementation time, increases the benchmark parameter transparency, and allows comparison of graph processing approaches.
商业和技术的最新趋势(例如,机器学习,社交网络分析)受益于在数据库和数据科学平台中存储和处理越来越多的图结构数据。fpga作为图形处理的加速器,具有可定制的内存层次结构,有望解决由传统硬件(例如CPU)固有的不规则内存访问模式引起的性能问题。然而,开发这样的硬件加速器仍然耗时且困难,并且基准测试是非标准化的,阻碍了对内存访问模式变化的影响的理解和图形处理加速器的系统工程。在这项工作中,我们提出了一个基于模拟其内存访问模式的图形处理加速器的仿真环境。此外,我们在两个最先进的FPGA图形处理加速器上评估了我们的方法,并通过实例展示了再现性,可比性以及缩短的开发过程。在fpga等加速器硬件上不实现周期精确的内部数据流可以显著减少实现时间,增加基准参数透明度,并允许比较图形处理方法。
{"title":"Exploring Memory Access Patterns for Graph Processing Accelerators","authors":"Jonas Dann, Daniel Ritter, H. Fröning","doi":"10.18420/btw2021-05","DOIUrl":"https://doi.org/10.18420/btw2021-05","url":null,"abstract":"Recent trends in business and technology (e.g., machine learning, social network analysis) benefit from storing and processing growing amounts of graph-structured data in databases and data science platforms. FPGAs as accelerators for graph processing with a customizable memory hierarchy promise solving performance problems caused by inherent irregular memory access patterns on traditional hardware (e.g., CPU). However, developing such hardware accelerators is yet time-consuming and difficult and benchmarking is non-standardized, hindering comprehension of the impact of memory access pattern changes and systematic engineering of graph processing accelerators. \u0000In this work, we propose a simulation environment for the analysis of graph processing accelerators based on simulating their memory access patterns. Further, we evaluate our approach on two state-of-the-art FPGA graph processing accelerators and show reproducibility, comparablity, as well as the shortened development process by an example. Not implementing the cycle-accurate internal data flow on accelerator hardware like FPGAs significantly reduces the implementation time, increases the benchmark parameter transparency, and allows comparison of graph processing approaches.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123300978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Die Data Science Challenge auf der BTW 2019 in Rostock 2019年BTW数据科学挑战赛在罗斯托克举行
Pub Date : 2019-03-04 DOI: 10.18420/btw2019-ws-30
Hannes Grunert, Holger Meyer
Zum zweiten Mal — nach der BTW 2017 in Stuttgart [Wa17] — findet auf der BTWKonferenzreihe die Data Science Challenge statt. Die Teilnehmer der Challenge hatten die Möglichkeit, ihren eigenen Ansatz zur cloud-basierten Datenanalyse zu entwickeln und damit im direkten Vergleich gegen andere Teilnehmer anzutreten.
第二届:自自BTW在2017年后的斯图加特(w17), btwall会议系列的科学挑战将迎战。挑战参与者有机会发展自己的基于云的数据分析方法,并与其他人竞争。
{"title":"Die Data Science Challenge auf der BTW 2019 in Rostock","authors":"Hannes Grunert, Holger Meyer","doi":"10.18420/btw2019-ws-30","DOIUrl":"https://doi.org/10.18420/btw2019-ws-30","url":null,"abstract":"Zum zweiten Mal — nach der BTW 2017 in Stuttgart [Wa17] — findet auf der BTWKonferenzreihe die Data Science Challenge statt. Die Teilnehmer der Challenge hatten die Möglichkeit, ihren eigenen Ansatz zur cloud-basierten Datenanalyse zu entwickeln und damit im direkten Vergleich gegen andere Teilnehmer anzutreten.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"308 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129750194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
jHound: Large-Scale Profiling of Open JSON Data jHound:开放JSON数据的大规模分析
Pub Date : 2019-03-04 DOI: 10.18420/btw2019-44
M. Möller, Nicolas Berton, Meike Klettke, Stefanie Scherzinger, U. Störl
We present jHound, a tool for profiling large collections of JSON data, and apply it to thousands of data sets holding open government data. jHound reports key characteristics of JSON documents, such as their nesting depth. As we show, jHound can help detect structural outliers, and most importantly, badly encoded documents: jHound can pinpoint certain cases of documents that use string-typed values where other native JSON datatypes would have been a better match. Moreover, we can detect certain cases of maladaptively structured JSON documents, which obviously do not comply with good data modeling practices. By interactively exploring particular example documents, we hope to inspire discussions in the community about what makes a good JSON encoding.
我们介绍了jHound,这是一个分析大型JSON数据集的工具,并将其应用于数千个包含开放政府数据的数据集。jHound报告JSON文档的关键特征,比如嵌套深度。正如我们所展示的,jHound可以帮助检测结构异常值,最重要的是,可以检测编码错误的文档:jHound可以精确地指出某些使用字符串类型值的文档,而其他本地JSON数据类型可以更好地匹配这些值。此外,我们可以检测到结构不适应的JSON文档的某些情况,这些文档显然不符合良好的数据建模实践。通过交互式地探索特定的示例文档,我们希望激发社区中关于什么是好的JSON编码的讨论。
{"title":"jHound: Large-Scale Profiling of Open JSON Data","authors":"M. Möller, Nicolas Berton, Meike Klettke, Stefanie Scherzinger, U. Störl","doi":"10.18420/btw2019-44","DOIUrl":"https://doi.org/10.18420/btw2019-44","url":null,"abstract":"We present jHound, a tool for profiling large collections of JSON data, and apply it to thousands of data sets holding open government data. jHound reports key characteristics of JSON documents, such as their nesting depth. As we show, jHound can help detect structural outliers, and most importantly, badly encoded documents: jHound can pinpoint certain cases of documents that use string-typed values where other native JSON datatypes would have been a better match. Moreover, we can detect certain cases of maladaptively structured JSON documents, which obviously do not comply with good data modeling practices. By interactively exploring particular example documents, we hope to inspire discussions in the community about what makes a good JSON encoding.","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Invest Once, Save a Million Times - LLVM-based Expression Compilation in PostgreSQL 投资一次,省一百万次——基于llvm的表达式编译
Pub Date : 2017-03-01 DOI: 10.15496/publikation-25721
D. Butterstein, Torsten Grust
We demonstrate how a surgical change to PostgreSQL’s evaluation strategy for SQL expressions can have a noticeable impact on overall query runtime performance. (This is an abridged version of [BG16], originally demonstrated at VLDB 2016.) The evaluation of scalar or Boolean expressions often takes the backseat in a discussion of query processing although table scans, filters, aggregates, projections
我们将演示如何对PostgreSQL的SQL表达式求值策略进行手术式更改,从而对整体查询运行时性能产生显著影响。(这是[BG16]的精简版本,最初在VLDB 2016上展示。)在查询处理的讨论中,标量或布尔表达式的求值通常处于次要位置,而表扫描、过滤器、聚合、投影等则处于次要位置
{"title":"Invest Once, Save a Million Times - LLVM-based Expression Compilation in PostgreSQL","authors":"D. Butterstein, Torsten Grust","doi":"10.15496/publikation-25721","DOIUrl":"https://doi.org/10.15496/publikation-25721","url":null,"abstract":"We demonstrate how a surgical change to PostgreSQL’s evaluation strategy for SQL expressions can have a noticeable impact on overall query runtime performance. (This is an abridged version of [BG16], originally demonstrated at VLDB 2016.) The evaluation of scalar or Boolean expressions often takes the backseat in a discussion of query processing although table scans, filters, aggregates, projections","PeriodicalId":421643,"journal":{"name":"Datenbanksysteme für Business, Technologie und Web","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130533198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Datenbanksysteme für Business, Technologie und Web
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1