首页 > 最新文献

Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文 中文
Consent Management in Data Workflows: A Graph Problem 数据工作流中的同意管理:一个图问题
Dorota Filipczuk, E. Gerding, G. Konstantinidis
Inmoderndataprocessing systemsusersexpectaserviceprovider to automatically respect their consent in all data processing within the service. However, data may be processed for many different purposes by several layers of algorithms that create complex workflows. To date, there is no existing approach to automatically satisfy fine-grained privacy constraints of a user in a way which optimises the service provider’s gains from processing. In this paper, we model a data processing workflow as a graph. User constraints and processing purposes are pairs of vertices which need to be disconnected in this graph. We propose heuristics and algorithms while at the same time we show that, in general, this problem is NP-hard. We discuss the optimality versus efficiency of our algorithms and evaluate them using synthetically generated data. On the practical side, our algorithms can provide a nearly optimal solution in the face of tens of constraints and graphs of thousands of nodes, in a few seconds.
在现代数据处理系统中,用户期望服务提供商在服务内的所有数据处理中自动尊重他们的同意。然而,数据可以通过创建复杂工作流的几层算法来处理许多不同的目的。到目前为止,还没有一种现有的方法可以自动满足用户的细粒度隐私约束,从而优化服务提供商从处理中获得的收益。在本文中,我们将数据处理工作流建模为一个图。用户约束和处理目的是图中需要断开连接的顶点对。我们提出了启发式和算法,同时我们表明,一般来说,这个问题是np困难的。我们讨论了算法的最优性和效率,并使用合成生成的数据对它们进行了评估。在实际方面,我们的算法可以在几秒钟内提供面对数十个约束和数千个节点的图的近乎最优解决方案。
{"title":"Consent Management in Data Workflows: A Graph Problem","authors":"Dorota Filipczuk, E. Gerding, G. Konstantinidis","doi":"10.48786/edbt.2023.61","DOIUrl":"https://doi.org/10.48786/edbt.2023.61","url":null,"abstract":"Inmoderndataprocessing systemsusersexpectaserviceprovider to automatically respect their consent in all data processing within the service. However, data may be processed for many different purposes by several layers of algorithms that create complex workflows. To date, there is no existing approach to automatically satisfy fine-grained privacy constraints of a user in a way which optimises the service provider’s gains from processing. In this paper, we model a data processing workflow as a graph. User constraints and processing purposes are pairs of vertices which need to be disconnected in this graph. We propose heuristics and algorithms while at the same time we show that, in general, this problem is NP-hard. We discuss the optimality versus efficiency of our algorithms and evaluate them using synthetically generated data. On the practical side, our algorithms can provide a nearly optimal solution in the face of tens of constraints and graphs of thousands of nodes, in a few seconds.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"35 1","pages":"737-748"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87213118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Task Processing in Vertex-Centric Graph Systems: Evaluations and Insights 以顶点为中心的图系统中的多任务处理:评价和见解
Siqiang Luo, Zichen Zhu, Xiaokui Xiao, Y. Yang, Chunbo Li, B. Kao
Vertex-centric (VC) graph systems are at the core of large-scale distributed graph processing. For such systems, a common usage pattern is the concurrent processing of multiple tasks ( multi-processing for short), which aims to execute a large number of unit tasks in parallel. In this paper, we point out that multi-processing has not been sufficiently studied or evaluated in previous work; hence, we fill this critical gap with three major contributions. First, we examine the tradeoff between two important measures in VC-systems: the number of communication rounds and message congestion . We show that this tradeoff is crucial to system performance; yet, existing approaches fail to achieve an optimal tradeoff, leading to poor performance. Second, based on exten-sive experimental evaluations on mainstream VC systems (e.g., Giraph, Pregel+, GraphD) and benchmark multi-processing tasks (e.g., Batch Personalized PageRanks, Multiple Source Shortest Paths), we present several important insights on the correlation between system performance and configurations, which is valu-able to practitioners in optimizing system performance. Third, based on the insights drawn from our experimental evaluations, we present a cost-based tuning framework that optimizes the performance of a representative VC-system. This demonstrates the usefulness of the insights.
以顶点为中心(VC)的图系统是大规模分布式图处理的核心。对于这样的系统,常见的使用模式是并发处理多个任务(简称为多处理),其目的是并行执行大量单元任务。在本文中,我们指出,多处理在以前的工作中没有得到充分的研究和评价;因此,我们用三个主要贡献来填补这一关键空白。首先,我们研究了风险投资系统中两个重要指标之间的权衡:通信轮数和消息拥塞。我们表明,这种权衡对系统性能至关重要;然而,现有的方法无法实现最佳权衡,导致性能不佳。其次,基于对主流VC系统(例如,Giraph, Pregel+, GraphD)和基准多处理任务(例如,Batch Personalized pagerank, Multiple Source最短路径)的广泛实验评估,我们提出了关于系统性能与配置之间相关性的几个重要见解,这对从业者优化系统性能有价值。第三,基于从实验评估中得出的见解,我们提出了一个基于成本的优化框架,该框架可优化具有代表性的风险投资系统的性能。这证明了这些见解的有用性。
{"title":"Multi-Task Processing in Vertex-Centric Graph Systems: Evaluations and Insights","authors":"Siqiang Luo, Zichen Zhu, Xiaokui Xiao, Y. Yang, Chunbo Li, B. Kao","doi":"10.48786/edbt.2023.20","DOIUrl":"https://doi.org/10.48786/edbt.2023.20","url":null,"abstract":"Vertex-centric (VC) graph systems are at the core of large-scale distributed graph processing. For such systems, a common usage pattern is the concurrent processing of multiple tasks ( multi-processing for short), which aims to execute a large number of unit tasks in parallel. In this paper, we point out that multi-processing has not been sufficiently studied or evaluated in previous work; hence, we fill this critical gap with three major contributions. First, we examine the tradeoff between two important measures in VC-systems: the number of communication rounds and message congestion . We show that this tradeoff is crucial to system performance; yet, existing approaches fail to achieve an optimal tradeoff, leading to poor performance. Second, based on exten-sive experimental evaluations on mainstream VC systems (e.g., Giraph, Pregel+, GraphD) and benchmark multi-processing tasks (e.g., Batch Personalized PageRanks, Multiple Source Shortest Paths), we present several important insights on the correlation between system performance and configurations, which is valu-able to practitioners in optimizing system performance. Third, based on the insights drawn from our experimental evaluations, we present a cost-based tuning framework that optimizes the performance of a representative VC-system. This demonstrates the usefulness of the insights.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"4 1","pages":"247-259"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79759570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
E2-NVM: A Memory-Aware Write Scheme to Improve Energy Efficiency and Write Endurance of NVMs using Variational Autoencoders E2-NVM:一种使用变分自编码器的内存感知写入方案,以提高nvm的能源效率和写入持久性
Saeed Kargar, Binbin Gu, S. Jyothi, Faisal Nawab
We introduce E2-NVM , a software-level memory-aware storage layer to improve the Energy efficiency and write Endurance (E2) of NVMs. E2-NVM employs a Variational Autoencoder (VAE) based design to direct the write operations judiciously to the memory segments that minimize bit flips. E2-NVM can be augmented with existing indexing solutions. E2-NVM can also be combined with prior hardware-based solutions to further improve efficiency. We performed real evaluations on an Optane memory device that show that E2-NVM can achieve up to 56% reduction in energy consumption.
我们引入了软件级内存感知存储层E2- nvm,以提高nvm的能效和写入持久性(E2)。E2-NVM采用基于变分自编码器(VAE)的设计,将写操作明智地引导到内存段,从而最大限度地减少位翻转。E2-NVM可以使用现有的索引解决方案进行扩展。E2-NVM还可以与先前基于硬件的解决方案相结合,以进一步提高效率。我们对Optane存储设备进行了实际评估,结果表明E2-NVM可以实现高达56%的能耗降低。
{"title":"E2-NVM: A Memory-Aware Write Scheme to Improve Energy Efficiency and Write Endurance of NVMs using Variational Autoencoders","authors":"Saeed Kargar, Binbin Gu, S. Jyothi, Faisal Nawab","doi":"10.48786/edbt.2023.49","DOIUrl":"https://doi.org/10.48786/edbt.2023.49","url":null,"abstract":"We introduce E2-NVM , a software-level memory-aware storage layer to improve the Energy efficiency and write Endurance (E2) of NVMs. E2-NVM employs a Variational Autoencoder (VAE) based design to direct the write operations judiciously to the memory segments that minimize bit flips. E2-NVM can be augmented with existing indexing solutions. E2-NVM can also be combined with prior hardware-based solutions to further improve efficiency. We performed real evaluations on an Optane memory device that show that E2-NVM can achieve up to 56% reduction in energy consumption.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"10 1","pages":"578-590"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82318149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Describing and Assessing Cubes Through Intentional Analytics 通过意向分析描述和评估多维数据集
Matteo Francia, M. Golfarelli, S. Rizzi
The Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics by (i) letting users explore multidimensional cubes stating their intentions, and (ii) returning multidimensional data coupled with knowledge insights in the form of annotations of subsets of data. Goal of this demonstration is to showcase the IAM approach using a notebook where the user can create a data exploration session by writing describe and assess statements, whose results are displayed by combining tabular data and charts so as to bring the highlights discovered to the user’s attention. The demonstration plan will show the effectiveness of the IAM approach in supporting data exploration and analysis and its added value as compared to a traditional OLAP session by proposing two scenarios with guided interaction and letting users run custom sessions.
意向分析模型(IAM)被设想为一种紧密耦合OLAP和分析的方法,它可以(i)让用户探索多维数据集,说明他们的意图,(ii)以数据子集注释的形式返回多维数据,并附带知识见解。本演示的目标是使用笔记本展示IAM方法,用户可以通过编写描述和评估语句创建数据探索会话,其结果通过组合表格数据和图表显示,从而将发现的亮点引起用户的注意。演示计划将展示IAM方法在支持数据探索和分析方面的有效性,以及与传统OLAP会话相比的附加价值,该计划提出了两种带有引导交互的场景,并允许用户运行自定义会话。
{"title":"Describing and Assessing Cubes Through Intentional Analytics","authors":"Matteo Francia, M. Golfarelli, S. Rizzi","doi":"10.48786/edbt.2023.69","DOIUrl":"https://doi.org/10.48786/edbt.2023.69","url":null,"abstract":"The Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics by (i) letting users explore multidimensional cubes stating their intentions, and (ii) returning multidimensional data coupled with knowledge insights in the form of annotations of subsets of data. Goal of this demonstration is to showcase the IAM approach using a notebook where the user can create a data exploration session by writing describe and assess statements, whose results are displayed by combining tabular data and charts so as to bring the highlights discovered to the user’s attention. The demonstration plan will show the effectiveness of the IAM approach in supporting data exploration and analysis and its added value as compared to a traditional OLAP session by proposing two scenarios with guided interaction and letting users run custom sessions.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"2 1","pages":"803-806"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84552661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting Stale Data in Wikipedia Infoboxes 检测维基百科信息框中的陈旧数据
Malte Barth, Tibor Bleidt, Martin Büßemeyer, Fabian Heseding, Niklas Köhnecke, Tobias Bleifuß, Leon Bornemann, D. Kalashnikov, Felix Naumann, D. Srivastava
Today’s fast-paced society is increasingly reliant on correct and up-to-date data. Wikipedia is the world’s most popular source of knowledge, and its infoboxes contain concise semi-structured data with important facts about a page’s topic. However, these data are not always up-to-date: we do not expect Wikipedia editors to update items at the moment their true values change. Also, many pages might not be well maintained and users might forget to update the data, e.g., when they are on holiday. To detect stale data in Wikipedia infoboxes, we combine cor-relation-based and rule-based approaches trained on different temporal granularities, based on all infobox changes over 15 years of English Wikipedia. We are able to predict 8 . 19% of all changes with a precision of 89 . 69% over a whole year, thus meet-ing our target precision of
当今快节奏的社会越来越依赖于正确和最新的数据。维基百科是世界上最受欢迎的知识来源,它的信息框包含简洁的半结构化数据,其中包含有关页面主题的重要事实。然而,这些数据并不总是最新的:我们不期望维基百科编辑在条目的真实价值发生变化时更新条目。此外,许多页面可能没有得到很好的维护,用户可能忘记更新数据,例如,当他们在度假时。为了检测维基百科信息框中的陈旧数据,我们结合了基于不同时间粒度的基于关联和基于规则的方法,基于15年来英文维基百科中所有信息框的变化。我们能够预测。19%的变化,精度为89。,达到了我们的目标精度
{"title":"Detecting Stale Data in Wikipedia Infoboxes","authors":"Malte Barth, Tibor Bleidt, Martin Büßemeyer, Fabian Heseding, Niklas Köhnecke, Tobias Bleifuß, Leon Bornemann, D. Kalashnikov, Felix Naumann, D. Srivastava","doi":"10.48786/edbt.2023.36","DOIUrl":"https://doi.org/10.48786/edbt.2023.36","url":null,"abstract":"Today’s fast-paced society is increasingly reliant on correct and up-to-date data. Wikipedia is the world’s most popular source of knowledge, and its infoboxes contain concise semi-structured data with important facts about a page’s topic. However, these data are not always up-to-date: we do not expect Wikipedia editors to update items at the moment their true values change. Also, many pages might not be well maintained and users might forget to update the data, e.g., when they are on holiday. To detect stale data in Wikipedia infoboxes, we combine cor-relation-based and rule-based approaches trained on different temporal granularities, based on all infobox changes over 15 years of English Wikipedia. We are able to predict 8 . 19% of all changes with a precision of 89 . 69% over a whole year, thus meet-ing our target precision of","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"29 1","pages":"450-456"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82545399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PyFroid: Scaling Data Analysis on a Commodity Workstation PyFroid:在商品工作站上扩展数据分析
Venkatesh Emani, A. Floratou, C. Curino
Almost every organization today is promoting data-driven decision making leveraging advances in data science. According to various surveys, data scientists spend up to 80% of their time cleaning and transforming data. Although data management systems have been carefully optimized for such tasks over several decades, they are seldom leveraged by data scientists who prefer to use libraries such as Pandas, sacrificing performance and scalability in favor of familiarity and ease of use. As a result, data scientists are not able to fully leverage the hardware capabilities of commodity workstations and either end up working on a small sample of their data locally or migrate to more heavyweight frameworks in a cluster environment. In this paper, we present PyFroid, a framework that leverages lightweight relational databases to improve the performance and scalability of Pandas, allowing data scientists to operate on much larger datasets on a commodity workstation. PyFroid has zero learning curve as it maintains all the Pandas APIs and is fully compatible with the tools that data scientists use (e.g., Python notebooks). We experimentally demonstrate that, compared to Pandas, PyFroid is able to analyze up to 20X more data on the same machine, provide comparable or better performance for small datasets as well as near-memory data sizes, and consume much less resources.
如今,几乎每个组织都在利用数据科学的进步推动数据驱动的决策制定。根据各种调查,数据科学家花费高达80%的时间来清理和转换数据。尽管数据管理系统在过去几十年中已经针对这些任务进行了精心优化,但数据科学家很少利用它们,他们更喜欢使用Pandas等库,牺牲性能和可伸缩性,以换取熟悉性和易用性。因此,数据科学家无法充分利用商品工作站的硬件功能,最终只能在本地处理一小部分数据样本,或者迁移到集群环境中更重量级的框架。在本文中,我们介绍了PyFroid,这是一个利用轻量级关系数据库来提高Pandas的性能和可伸缩性的框架,允许数据科学家在普通工作站上操作更大的数据集。PyFroid的学习曲线为零,因为它维护了所有Pandas api,并且与数据科学家使用的工具(例如Python笔记本)完全兼容。我们通过实验证明,与Pandas相比,PyFroid能够在同一台机器上分析多达20倍的数据,对于小型数据集以及近内存数据大小提供相当或更好的性能,并且消耗更少的资源。
{"title":"PyFroid: Scaling Data Analysis on a Commodity Workstation","authors":"Venkatesh Emani, A. Floratou, C. Curino","doi":"10.48786/edbt.2024.06","DOIUrl":"https://doi.org/10.48786/edbt.2024.06","url":null,"abstract":"Almost every organization today is promoting data-driven decision making leveraging advances in data science. According to various surveys, data scientists spend up to 80% of their time cleaning and transforming data. Although data management systems have been carefully optimized for such tasks over several decades, they are seldom leveraged by data scientists who prefer to use libraries such as Pandas, sacrificing performance and scalability in favor of familiarity and ease of use. As a result, data scientists are not able to fully leverage the hardware capabilities of commodity workstations and either end up working on a small sample of their data locally or migrate to more heavyweight frameworks in a cluster environment. In this paper, we present PyFroid, a framework that leverages lightweight relational databases to improve the performance and scalability of Pandas, allowing data scientists to operate on much larger datasets on a commodity workstation. PyFroid has zero learning curve as it maintains all the Pandas APIs and is fully compatible with the tools that data scientists use (e.g., Python notebooks). We experimentally demonstrate that, compared to Pandas, PyFroid is able to analyze up to 20X more data on the same machine, provide comparable or better performance for small datasets as well as near-memory data sizes, and consume much less resources.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"1 1","pages":"61-67"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89326433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KWIQ: Answering k-core Window Queries in Temporal Networks KWIQ:回答时间网络中的k核窗口查询
Mahdihusain Momin, Raj Kamal, Shantwana Dixit, Sayan Ranu, A. Bagchi
Understanding the evolution of communities and the factors that contribute to their development, stability and disappearance over time is a fundamental problem in the study of temporal networks. The concept of 𝑘 -core is one of the most popular metrics to detect communities. Since the 𝑘 -core of a temporal network changes with time, an important question arises: Are there nodes that always remain within the 𝑘 -core? In this paper, we explore this question by introducing the notion of core-invariant nodes . Given a temporal window ∆ and a parameter K , the core-invariant nodes are those that are part of the K -core throughout ∆. Core-invariant nodes have been shown to dictate the stability of networks, while being also useful in detecting anomalous behavior. The complexity of finding core-invariant nodes is 𝑂 ( | ∆ |×| 𝐸 | ), which is exorbitantly high for million-scale networks. We overcome this computational bottleneck by designing an algorithm called Kwiq. Kwiq efficiently processes the cascading impact of network updates through a novel data structure called orientation graph. Through extensive experiments on real temporal networks containing millions of nodes, we establish that the proposed pruning strategies are more than 5 times faster than baseline strategies.
了解群落的演变及其随时间发展、稳定和消失的因素是时间网络研究中的一个基本问题。𝑘-core的概念是检测社区最流行的指标之一。由于时间网络的𝑘-核心随着时间的变化而变化,因此出现了一个重要的问题:是否存在始终保持在𝑘-核心中的节点?在本文中,我们通过引入核心不变节点的概念来探讨这个问题。给定一个时间窗口∆和一个参数K,核心不变节点是整个∆中K核心的一部分。核心不变节点已被证明可以指示网络的稳定性,同时在检测异常行为方面也很有用。寻找核心不变节点的复杂度为𝑂(|∆| x | ),这对于百万规模的网络来说太高了。我们通过设计一个叫做Kwiq的算法来克服这个计算瓶颈。Kwiq通过一种称为方向图的新颖数据结构有效地处理网络更新的级联影响。通过对包含数百万节点的真实时态网络的大量实验,我们确定了所提出的修剪策略比基线策略快5倍以上。
{"title":"KWIQ: Answering k-core Window Queries in Temporal Networks","authors":"Mahdihusain Momin, Raj Kamal, Shantwana Dixit, Sayan Ranu, A. Bagchi","doi":"10.48786/edbt.2023.17","DOIUrl":"https://doi.org/10.48786/edbt.2023.17","url":null,"abstract":"Understanding the evolution of communities and the factors that contribute to their development, stability and disappearance over time is a fundamental problem in the study of temporal networks. The concept of 𝑘 -core is one of the most popular metrics to detect communities. Since the 𝑘 -core of a temporal network changes with time, an important question arises: Are there nodes that always remain within the 𝑘 -core? In this paper, we explore this question by introducing the notion of core-invariant nodes . Given a temporal window ∆ and a parameter K , the core-invariant nodes are those that are part of the K -core throughout ∆. Core-invariant nodes have been shown to dictate the stability of networks, while being also useful in detecting anomalous behavior. The complexity of finding core-invariant nodes is 𝑂 ( | ∆ |×| 𝐸 | ), which is exorbitantly high for million-scale networks. We overcome this computational bottleneck by designing an algorithm called Kwiq. Kwiq efficiently processes the cascading impact of network updates through a novel data structure called orientation graph. Through extensive experiments on real temporal networks containing millions of nodes, we establish that the proposed pruning strategies are more than 5 times faster than baseline strategies.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"55 1","pages":"208-220"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73858039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Experimental Analysis of Quantile Sketches over Data Streams 数据流上分位数草图的实验分析
Lasantha Fernando, Harsh Bindra, Khuzaima S. Daudjee
Streaming systems process large data sets in a single pass while applying operations on the data. Quantiles are one such operation used in streaming systems. Quantiles can outline the behaviour and the cumulative distribution of a data set. We study five recent quantile sketching algorithms designed for streaming settings: KLL Sketch, Moments Sketch, DDSketch, UDDSketch, and ReqSketch. Key aspects of the sketching algorithms in terms of speed, accuracy, and mergeability are examined. The accuracy of these algorithms is evaluated in Apache Flink, a popular open source streaming system, while the speed and mergeability is evaluated in a separate Java implementation. Results show that UDDSketch has the best relative-error accuracy guarantees, while DDSketch and ReqSketch also achieve consistently high accuracy, particularly with long-tailed data distributions. DDSketch has the fastest query and insertion times, while Moments Sketch has the fastest merge times. Our evaluations show that there is no single algorithm that dominates overall performance and different algorithms excel under the different accuracy and run-time performance criteria considered in our study.
流系统在对数据进行操作的同时,一次处理大型数据集。分位数就是流系统中使用的一种这样的操作。分位数可以勾勒出数据集的行为和累积分布。我们研究了最近为流设置设计的五种分位数素描算法:KLL Sketch, Moments Sketch, DDSketch, UDDSketch和ReqSketch。速写算法的关键方面在速度,准确性和可合并性方面进行了检查。这些算法的准确性在Apache Flink(一个流行的开源流系统)中进行评估,而速度和可合并性在单独的Java实现中进行评估。结果表明,UDDSketch具有最好的相对误差精度保证,而DDSketch和ReqSketch也保持了较高的精度,特别是在长尾数据分布的情况下。DDSketch具有最快的查询和插入时间,而Moments Sketch具有最快的合并时间。我们的评估表明,没有一种算法在整体性能上占主导地位,在我们研究中考虑的不同精度和运行时性能标准下,不同的算法表现优异。
{"title":"An Experimental Analysis of Quantile Sketches over Data Streams","authors":"Lasantha Fernando, Harsh Bindra, Khuzaima S. Daudjee","doi":"10.48786/edbt.2023.34","DOIUrl":"https://doi.org/10.48786/edbt.2023.34","url":null,"abstract":"Streaming systems process large data sets in a single pass while applying operations on the data. Quantiles are one such operation used in streaming systems. Quantiles can outline the behaviour and the cumulative distribution of a data set. We study five recent quantile sketching algorithms designed for streaming settings: KLL Sketch, Moments Sketch, DDSketch, UDDSketch, and ReqSketch. Key aspects of the sketching algorithms in terms of speed, accuracy, and mergeability are examined. The accuracy of these algorithms is evaluated in Apache Flink, a popular open source streaming system, while the speed and mergeability is evaluated in a separate Java implementation. Results show that UDDSketch has the best relative-error accuracy guarantees, while DDSketch and ReqSketch also achieve consistently high accuracy, particularly with long-tailed data distributions. DDSketch has the fastest query and insertion times, while Moments Sketch has the fastest merge times. Our evaluations show that there is no single algorithm that dominates overall performance and different algorithms excel under the different accuracy and run-time performance criteria considered in our study.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"56 1","pages":"424-436"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78966876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VOYAGER: Automatic Computation of Visual Complexity and Aesthetics of Graph Query Interfaces VOYAGER:图形查询接口的视觉复杂性和美学的自动计算
Duy Pham, S. Bhowmick
People prefer attractive visual query interfaces (vqi). Such interfaces are paramount for enhancing usability of graph querying frameworks. However, scant attention has been paid to the vi- sual complexity and aesthetics of graph query interfaces. In this demonstration, we present a novel system called voyager that leverages on research in computer vision, human-computer interaction (hci) and cognitive psychology to automatically compute the visual complexity and aesthetics of a graph query interface. voyager can not only guide vqi designers to iteratively improve their design to balance usability and aesthetics of visual query interfaces but it can also facilitate quantitative comparison of the visual complexity and aesthetics of a set of visual query interfaces. We demonstrate various innovative features of voyager and its promising results.
人们更喜欢有吸引力的可视化查询界面(vqi)。这样的接口对于增强图形查询框架的可用性至关重要。然而,图形查询界面的视觉复杂性和美观性却很少受到关注。在本次演示中,我们展示了一个名为voyager的新系统,该系统利用计算机视觉、人机交互(hci)和认知心理学的研究来自动计算图形查询界面的视觉复杂性和美学。Voyager不仅可以指导vqi设计师迭代改进他们的设计,以平衡视觉查询界面的可用性和美观性,而且还可以促进一组视觉查询界面的视觉复杂性和美观性的定量比较。我们展示了航海家号的各种创新功能及其有希望的结果。
{"title":"VOYAGER: Automatic Computation of Visual Complexity and Aesthetics of Graph Query Interfaces","authors":"Duy Pham, S. Bhowmick","doi":"10.48786/edbt.2023.72","DOIUrl":"https://doi.org/10.48786/edbt.2023.72","url":null,"abstract":"People prefer attractive visual query interfaces (vqi). Such interfaces are paramount for enhancing usability of graph querying frameworks. However, scant attention has been paid to the vi- sual complexity and aesthetics of graph query interfaces. In this demonstration, we present a novel system called voyager that leverages on research in computer vision, human-computer interaction (hci) and cognitive psychology to automatically compute the visual complexity and aesthetics of a graph query interface. voyager can not only guide vqi designers to iteratively improve their design to balance usability and aesthetics of visual query interfaces but it can also facilitate quantitative comparison of the visual complexity and aesthetics of a set of visual query interfaces. We demonstrate various innovative features of voyager and its promising results.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"151 1","pages":"815-818"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77798363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Intrinsically Interpretable Entity Matching System 一个内在可解释的实体匹配系统
Andrea Baraldi, Francesco Del Buono, Francesco Guerra, Matteo Paganelli, M. Vincini
Explainable classification systems generate predictions along with a weight for each term in the input record measuring its contribution to the prediction. In the entity matching (EM) scenario, inputs are pairs of entity descriptions and the resulting explanations can be difficult to understand for the users. They can be very long and assign different impacts to similar terms located in different descriptions. To address these issues, we introduce the concept of decision units, i.e., basic information units formed either by pairs of (similar) terms, each one belonging to a different entity description, or unique terms, existing in one of the descriptions only. Decision units form a new feature space, able to represent, in a compact and meaningful way, pairs of entity descriptions. An explainable model trained on such features generates effective explanations customized for EM datasets. In this paper, we propose this idea via a three-component architecture template, which consists of a decision unit generator, a decision unit scorer, and an explainable matcher. Then, we introduce WYM (Why do You Match?), an implementation of the architecture oriented to textual EM databases. The experiments show that our approach has accuracy comparable to other state-of-the-art Deep Learning based EM models, but, differently from them, its predictions are highly interpretable.
可解释的分类系统生成预测,并为输入记录中的每个术语提供权重,以衡量其对预测的贡献。在实体匹配(EM)场景中,输入是成对的实体描述,结果的解释对于用户来说可能很难理解。它们可以很长,并将不同的影响分配给位于不同描述中的相似术语。为了解决这些问题,我们引入了决策单元的概念,即,基本信息单元由(相似的)术语对组成,每个术语属于不同的实体描述,或者唯一的术语,只存在于一个描述中。决策单元形成一个新的特征空间,能够以紧凑和有意义的方式表示成对的实体描述。在这些特征上训练的可解释模型生成针对EM数据集定制的有效解释。在本文中,我们通过一个三组件架构模板提出了这个想法,该模板由决策单元生成器、决策单元评分器和可解释的匹配器组成。然后,我们介绍了WYM (Why do You Match?),一种面向文本EM数据库的体系结构实现。实验表明,我们的方法具有与其他最先进的基于深度学习的EM模型相当的准确性,但是,与它们不同的是,它的预测是高度可解释的。
{"title":"An Intrinsically Interpretable Entity Matching System","authors":"Andrea Baraldi, Francesco Del Buono, Francesco Guerra, Matteo Paganelli, M. Vincini","doi":"10.48786/edbt.2023.54","DOIUrl":"https://doi.org/10.48786/edbt.2023.54","url":null,"abstract":"Explainable classification systems generate predictions along with a weight for each term in the input record measuring its contribution to the prediction. In the entity matching (EM) scenario, inputs are pairs of entity descriptions and the resulting explanations can be difficult to understand for the users. They can be very long and assign different impacts to similar terms located in different descriptions. To address these issues, we introduce the concept of decision units, i.e., basic information units formed either by pairs of (similar) terms, each one belonging to a different entity description, or unique terms, existing in one of the descriptions only. Decision units form a new feature space, able to represent, in a compact and meaningful way, pairs of entity descriptions. An explainable model trained on such features generates effective explanations customized for EM datasets. In this paper, we propose this idea via a three-component architecture template, which consists of a decision unit generator, a decision unit scorer, and an explainable matcher. Then, we introduce WYM (Why do You Match?), an implementation of the architecture oriented to textual EM databases. The experiments show that our approach has accuracy comparable to other state-of-the-art Deep Learning based EM models, but, differently from them, its predictions are highly interpretable.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"31 1","pages":"645-657"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87061940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Advances in database technology : proceedings. International Conference on Extending Database Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1