Advances in database technology : proceedings. International Conference on Extending Database Technology最新文献

英文中文

Describing and Assessing Cubes Through Intentional Analytics 通过意向分析描述和评估多维数据集

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.69

Matteo Francia, M. Golfarelli, S. Rizzi

The Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics by (i) letting users explore multidimensional cubes stating their intentions, and (ii) returning multidimensional data coupled with knowledge insights in the form of annotations of subsets of data. Goal of this demonstration is to showcase the IAM approach using a notebook where the user can create a data exploration session by writing describe and assess statements, whose results are displayed by combining tabular data and charts so as to bring the highlights discovered to the user’s attention. The demonstration plan will show the effectiveness of the IAM approach in supporting data exploration and analysis and its added value as compared to a traditional OLAP session by proposing two scenarios with guided interaction and letting users run custom sessions.

意向分析模型(IAM)被设想为一种紧密耦合OLAP和分析的方法，它可以(i)让用户探索多维数据集，说明他们的意图，(ii)以数据子集注释的形式返回多维数据，并附带知识见解。本演示的目标是使用笔记本展示IAM方法，用户可以通过编写描述和评估语句创建数据探索会话，其结果通过组合表格数据和图表显示，从而将发现的亮点引起用户的注意。演示计划将展示IAM方法在支持数据探索和分析方面的有效性，以及与传统OLAP会话相比的附加价值，该计划提出了两种带有引导交互的场景，并允许用户运行自定义会话。

引用次数: 0

Streaming Weighted Sampling over Join Queries 在连接查询上流式加权抽样

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.24

Michael Shekelyan, Graham Cormode, Qingzhi Ma, A. Shanghooshabad, P. Triantafillou

Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data resid-ing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes. All pertinent code and data can be found at:

连接查询是一种基本的数据库工具，用于捕获涉及链接异构数据源的一系列任务。然而，对于巨大的表大小，将它们保存在内存中通常是不切实际的，并且我们只能对它们进行一次或几次流传递。此外，构建完整的连接结果(例如，沿着准标识符链接异构数据源)可能会由于多对多链接而导致结果的组合爆炸。随机抽样是一种自然的工具，可以将这个超大的结果归结为具有良好理解的统计属性的代表性子集，但由于抽样域的组合性质，这是一项具有挑战性的任务。文献中现有的技术仅仅关注于驻留在主存中的表格数据的设置，而没有解决现代数据处理环境中迫切需要的流操作、加权采样和更通用的连接操作等方面。这项工作的主要贡献是用更轻量级的实用方法来满足这些需求。首先，在抽样问题和图问题之间引入双射，以支持加权抽样和公共连接算子。其次，采样技术的改进，以尽量减少流的数量通过。第三，介绍了在有限内存下处理非常大的表的技术。最后，将所建议的技术与依赖数据库索引的现有方法进行比较，结果表明节省了大量内存，减少了临时查询的运行时间，并具有竞争性的分摊运行时间。有关守则及资料可于以下网址查阅:

{"title":"Streaming Weighted Sampling over Join Queries","authors":"Michael Shekelyan, Graham Cormode, Qingzhi Ma, A. Shanghooshabad, P. Triantafillou","doi":"10.48786/edbt.2023.24","DOIUrl":"https://doi.org/10.48786/edbt.2023.24","url":null,"abstract":"Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data resid-ing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes. All pertinent code and data can be found at:","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"298-310"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84911541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

E2-NVM: A Memory-Aware Write Scheme to Improve Energy Efficiency and Write Endurance of NVMs using Variational Autoencoders E2-NVM:一种使用变分自编码器的内存感知写入方案，以提高nvm的能源效率和写入持久性

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.49

Saeed Kargar, Binbin Gu, S. Jyothi, Faisal Nawab

We introduce E2-NVM , a software-level memory-aware storage layer to improve the Energy efficiency and write Endurance (E2) of NVMs. E2-NVM employs a Variational Autoencoder (VAE) based design to direct the write operations judiciously to the memory segments that minimize bit flips. E2-NVM can be augmented with existing indexing solutions. E2-NVM can also be combined with prior hardware-based solutions to further improve efficiency. We performed real evaluations on an Optane memory device that show that E2-NVM can achieve up to 56% reduction in energy consumption.

我们引入了软件级内存感知存储层E2- nvm，以提高nvm的能效和写入持久性(E2)。E2-NVM采用基于变分自编码器(VAE)的设计，将写操作明智地引导到内存段，从而最大限度地减少位翻转。E2-NVM可以使用现有的索引解决方案进行扩展。E2-NVM还可以与先前基于硬件的解决方案相结合，以进一步提高效率。我们对Optane存储设备进行了实际评估，结果表明E2-NVM可以实现高达56%的能耗降低。

引用次数: 2

Detecting Stale Data in Wikipedia Infoboxes 检测维基百科信息框中的陈旧数据

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.36

Malte Barth, Tibor Bleidt, Martin Büßemeyer, Fabian Heseding, Niklas Köhnecke, Tobias Bleifuß, Leon Bornemann, D. Kalashnikov, Felix Naumann, D. Srivastava

Today’s fast-paced society is increasingly reliant on correct and up-to-date data. Wikipedia is the world’s most popular source of knowledge, and its infoboxes contain concise semi-structured data with important facts about a page’s topic. However, these data are not always up-to-date: we do not expect Wikipedia editors to update items at the moment their true values change. Also, many pages might not be well maintained and users might forget to update the data, e.g., when they are on holiday. To detect stale data in Wikipedia infoboxes, we combine cor-relation-based and rule-based approaches trained on different temporal granularities, based on all infobox changes over 15 years of English Wikipedia. We are able to predict 8 . 19% of all changes with a precision of 89 . 69% over a whole year, thus meet-ing our target precision of

当今快节奏的社会越来越依赖于正确和最新的数据。维基百科是世界上最受欢迎的知识来源，它的信息框包含简洁的半结构化数据，其中包含有关页面主题的重要事实。然而，这些数据并不总是最新的:我们不期望维基百科编辑在条目的真实价值发生变化时更新条目。此外，许多页面可能没有得到很好的维护，用户可能忘记更新数据，例如，当他们在度假时。为了检测维基百科信息框中的陈旧数据，我们结合了基于不同时间粒度的基于关联和基于规则的方法，基于15年来英文维基百科中所有信息框的变化。我们能够预测。19%的变化，精度为89。，达到了我们的目标精度

引用次数: 0

WedgeBlock: An Off-Chain Secure Logging Platform for Blockchain Applications WedgeBlock:区块链应用的链下安全日志平台

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.45

Abhishek A. Singh, Yinan Zhou, Mohammad Sadoghi, S. Mehrotra, Sharad Sharma, Faisal Nawab

Over the recent years, there has been a growing interest in building blockchain-based decentralized applications (DApps). Developing DApps faces many challenges due to the cost and high-latency of writing to a blockchain smart contract. We propose WedgeBlock , a secure data logging infrastructure for DApps. WedgeBlock ’s design reduces the performance and monetary cost of DApps with its main technical innovation called lazy-minimum trust (LMT). LMT combines the following features: (1) off-chain storage component, (2) it lazily writes digests of data—rather than all data—on-chain to minimize costs, and (3) it integrates a trust mechanism to ensure the detection and punishment of malicious acts by the Offchain Node . Our experiments show that WedgeBlock is up to 1470× faster and 310× cheaper than a baseline solution of writing directly on chain.

近年来，人们对构建基于区块链的去中心化应用程序(DApps)越来越感兴趣。由于写入区块链智能合约的成本和高延迟，开发dapp面临许多挑战。我们提出了WedgeBlock，一个安全的dapp数据记录基础设施。WedgeBlock的设计降低了dapp的性能和货币成本，其主要技术创新被称为惰性最小信任(LMT)。LMT结合了以下特点:(1)off-chain存储组件;(2)它惰性地写入数据摘要而不是链上的所有数据，以最大限度地降低成本;(3)它集成了信任机制，以确保Offchain Node检测和惩罚恶意行为。我们的实验表明，WedgeBlock比直接在链上写入的基准解决方案快1470倍，便宜310倍。

引用次数: 0

Consent Management in Data Workflows: A Graph Problem 数据工作流中的同意管理:一个图问题

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.61

Dorota Filipczuk, E. Gerding, G. Konstantinidis

Inmoderndataprocessing systemsusersexpectaserviceprovider to automatically respect their consent in all data processing within the service. However, data may be processed for many different purposes by several layers of algorithms that create complex workflows. To date, there is no existing approach to automatically satisfy fine-grained privacy constraints of a user in a way which optimises the service provider’s gains from processing. In this paper, we model a data processing workflow as a graph. User constraints and processing purposes are pairs of vertices which need to be disconnected in this graph. We propose heuristics and algorithms while at the same time we show that, in general, this problem is NP-hard. We discuss the optimality versus efficiency of our algorithms and evaluate them using synthetically generated data. On the practical side, our algorithms can provide a nearly optimal solution in the face of tens of constraints and graphs of thousands of nodes, in a few seconds.

在现代数据处理系统中，用户期望服务提供商在服务内的所有数据处理中自动尊重他们的同意。然而，数据可以通过创建复杂工作流的几层算法来处理许多不同的目的。到目前为止，还没有一种现有的方法可以自动满足用户的细粒度隐私约束，从而优化服务提供商从处理中获得的收益。在本文中，我们将数据处理工作流建模为一个图。用户约束和处理目的是图中需要断开连接的顶点对。我们提出了启发式和算法，同时我们表明，一般来说，这个问题是np困难的。我们讨论了算法的最优性和效率，并使用合成生成的数据对它们进行了评估。在实际方面，我们的算法可以在几秒钟内提供面对数十个约束和数千个节点的图的近乎最优解决方案。

引用次数: 0

KWIQ: Answering k-core Window Queries in Temporal Networks KWIQ:回答时间网络中的k核窗口查询

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.17

Mahdihusain Momin, Raj Kamal, Shantwana Dixit, Sayan Ranu, A. Bagchi

Understanding the evolution of communities and the factors that contribute to their development, stability and disappearance over time is a fundamental problem in the study of temporal networks. The concept of 𝑘 -core is one of the most popular metrics to detect communities. Since the 𝑘 -core of a temporal network changes with time, an important question arises: Are there nodes that always remain within the 𝑘 -core? In this paper, we explore this question by introducing the notion of core-invariant nodes . Given a temporal window ∆ and a parameter K , the core-invariant nodes are those that are part of the K -core throughout ∆. Core-invariant nodes have been shown to dictate the stability of networks, while being also useful in detecting anomalous behavior. The complexity of finding core-invariant nodes is 𝑂 ( | ∆ |×| 𝐸 | ), which is exorbitantly high for million-scale networks. We overcome this computational bottleneck by designing an algorithm called Kwiq. Kwiq efficiently processes the cascading impact of network updates through a novel data structure called orientation graph. Through extensive experiments on real temporal networks containing millions of nodes, we establish that the proposed pruning strategies are more than 5 times faster than baseline strategies.

了解群落的演变及其随时间发展、稳定和消失的因素是时间网络研究中的一个基本问题。𝑘-core的概念是检测社区最流行的指标之一。由于时间网络的𝑘-核心随着时间的变化而变化，因此出现了一个重要的问题:是否存在始终保持在𝑘-核心中的节点?在本文中，我们通过引入核心不变节点的概念来探讨这个问题。给定一个时间窗口∆和一个参数K，核心不变节点是整个∆中K核心的一部分。核心不变节点已被证明可以指示网络的稳定性，同时在检测异常行为方面也很有用。寻找核心不变节点的复杂度为𝑂(|∆| x | )，这对于百万规模的网络来说太高了。我们通过设计一个叫做Kwiq的算法来克服这个计算瓶颈。Kwiq通过一种称为方向图的新颖数据结构有效地处理网络更新的级联影响。通过对包含数百万节点的真实时态网络的大量实验，我们确定了所提出的修剪策略比基线策略快5倍以上。

{"title":"KWIQ: Answering k-core Window Queries in Temporal Networks","authors":"Mahdihusain Momin, Raj Kamal, Shantwana Dixit, Sayan Ranu, A. Bagchi","doi":"10.48786/edbt.2023.17","DOIUrl":"https://doi.org/10.48786/edbt.2023.17","url":null,"abstract":"Understanding the evolution of communities and the factors that contribute to their development, stability and disappearance over time is a fundamental problem in the study of temporal networks. The concept of 𝑘 -core is one of the most popular metrics to detect communities. Since the 𝑘 -core of a temporal network changes with time, an important question arises: Are there nodes that always remain within the 𝑘 -core? In this paper, we explore this question by introducing the notion of core-invariant nodes . Given a temporal window ∆ and a parameter K , the core-invariant nodes are those that are part of the K -core throughout ∆. Core-invariant nodes have been shown to dictate the stability of networks, while being also useful in detecting anomalous behavior. The complexity of finding core-invariant nodes is 𝑂 ( | ∆ |×| 𝐸 | ), which is exorbitantly high for million-scale networks. We overcome this computational bottleneck by designing an algorithm called Kwiq. Kwiq efficiently processes the cascading impact of network updates through a novel data structure called orientation graph. Through extensive experiments on real temporal networks containing millions of nodes, we establish that the proposed pruning strategies are more than 5 times faster than baseline strategies.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"55 1","pages":"208-220"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73858039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Experimental Analysis of Quantile Sketches over Data Streams 数据流上分位数草图的实验分析

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.34

Lasantha Fernando, Harsh Bindra, Khuzaima S. Daudjee

Streaming systems process large data sets in a single pass while applying operations on the data. Quantiles are one such operation used in streaming systems. Quantiles can outline the behaviour and the cumulative distribution of a data set. We study five recent quantile sketching algorithms designed for streaming settings: KLL Sketch, Moments Sketch, DDSketch, UDDSketch, and ReqSketch. Key aspects of the sketching algorithms in terms of speed, accuracy, and mergeability are examined. The accuracy of these algorithms is evaluated in Apache Flink, a popular open source streaming system, while the speed and mergeability is evaluated in a separate Java implementation. Results show that UDDSketch has the best relative-error accuracy guarantees, while DDSketch and ReqSketch also achieve consistently high accuracy, particularly with long-tailed data distributions. DDSketch has the fastest query and insertion times, while Moments Sketch has the fastest merge times. Our evaluations show that there is no single algorithm that dominates overall performance and different algorithms excel under the different accuracy and run-time performance criteria considered in our study.

流系统在对数据进行操作的同时，一次处理大型数据集。分位数就是流系统中使用的一种这样的操作。分位数可以勾勒出数据集的行为和累积分布。我们研究了最近为流设置设计的五种分位数素描算法:KLL Sketch, Moments Sketch, DDSketch, UDDSketch和ReqSketch。速写算法的关键方面在速度，准确性和可合并性方面进行了检查。这些算法的准确性在Apache Flink(一个流行的开源流系统)中进行评估，而速度和可合并性在单独的Java实现中进行评估。结果表明，UDDSketch具有最好的相对误差精度保证，而DDSketch和ReqSketch也保持了较高的精度，特别是在长尾数据分布的情况下。DDSketch具有最快的查询和插入时间，而Moments Sketch具有最快的合并时间。我们的评估表明，没有一种算法在整体性能上占主导地位，在我们研究中考虑的不同精度和运行时性能标准下，不同的算法表现优异。

{"title":"An Experimental Analysis of Quantile Sketches over Data Streams","authors":"Lasantha Fernando, Harsh Bindra, Khuzaima S. Daudjee","doi":"10.48786/edbt.2023.34","DOIUrl":"https://doi.org/10.48786/edbt.2023.34","url":null,"abstract":"Streaming systems process large data sets in a single pass while applying operations on the data. Quantiles are one such operation used in streaming systems. Quantiles can outline the behaviour and the cumulative distribution of a data set. We study five recent quantile sketching algorithms designed for streaming settings: KLL Sketch, Moments Sketch, DDSketch, UDDSketch, and ReqSketch. Key aspects of the sketching algorithms in terms of speed, accuracy, and mergeability are examined. The accuracy of these algorithms is evaluated in Apache Flink, a popular open source streaming system, while the speed and mergeability is evaluated in a separate Java implementation. Results show that UDDSketch has the best relative-error accuracy guarantees, while DDSketch and ReqSketch also achieve consistently high accuracy, particularly with long-tailed data distributions. DDSketch has the fastest query and insertion times, while Moments Sketch has the fastest merge times. Our evaluations show that there is no single algorithm that dominates overall performance and different algorithms excel under the different accuracy and run-time performance criteria considered in our study.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"56 1","pages":"424-436"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78966876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VOYAGER: Automatic Computation of Visual Complexity and Aesthetics of Graph Query Interfaces VOYAGER:图形查询接口的视觉复杂性和美学的自动计算

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.72

Duy Pham, S. Bhowmick

People prefer attractive visual query interfaces (vqi). Such interfaces are paramount for enhancing usability of graph querying frameworks. However, scant attention has been paid to the vi- sual complexity and aesthetics of graph query interfaces. In this demonstration, we present a novel system called voyager that leverages on research in computer vision, human-computer interaction (hci) and cognitive psychology to automatically compute the visual complexity and aesthetics of a graph query interface. voyager can not only guide vqi designers to iteratively improve their design to balance usability and aesthetics of visual query interfaces but it can also facilitate quantitative comparison of the visual complexity and aesthetics of a set of visual query interfaces. We demonstrate various innovative features of voyager and its promising results.

人们更喜欢有吸引力的可视化查询界面(vqi)。这样的接口对于增强图形查询框架的可用性至关重要。然而，图形查询界面的视觉复杂性和美观性却很少受到关注。在本次演示中，我们展示了一个名为voyager的新系统，该系统利用计算机视觉、人机交互(hci)和认知心理学的研究来自动计算图形查询界面的视觉复杂性和美学。Voyager不仅可以指导vqi设计师迭代改进他们的设计，以平衡视觉查询界面的可用性和美观性，而且还可以促进一组视觉查询界面的视觉复杂性和美观性的定量比较。我们展示了航海家号的各种创新功能及其有希望的结果。

引用次数: 1

An Intrinsically Interpretable Entity Matching System 一个内在可解释的实体匹配系统

Advances in database technology : proceedings. International Conference on Extending Database Technology

Pub Date : 2023-01-01 DOI: 10.48786/edbt.2023.54

Andrea Baraldi, Francesco Del Buono, Francesco Guerra, Matteo Paganelli, M. Vincini

Explainable classification systems generate predictions along with a weight for each term in the input record measuring its contribution to the prediction. In the entity matching (EM) scenario, inputs are pairs of entity descriptions and the resulting explanations can be difficult to understand for the users. They can be very long and assign different impacts to similar terms located in different descriptions. To address these issues, we introduce the concept of decision units, i.e., basic information units formed either by pairs of (similar) terms, each one belonging to a different entity description, or unique terms, existing in one of the descriptions only. Decision units form a new feature space, able to represent, in a compact and meaningful way, pairs of entity descriptions. An explainable model trained on such features generates effective explanations customized for EM datasets. In this paper, we propose this idea via a three-component architecture template, which consists of a decision unit generator, a decision unit scorer, and an explainable matcher. Then, we introduce WYM (Why do You Match?), an implementation of the architecture oriented to textual EM databases. The experiments show that our approach has accuracy comparable to other state-of-the-art Deep Learning based EM models, but, differently from them, its predictions are highly interpretable.

可解释的分类系统生成预测，并为输入记录中的每个术语提供权重，以衡量其对预测的贡献。在实体匹配(EM)场景中，输入是成对的实体描述，结果的解释对于用户来说可能很难理解。它们可以很长，并将不同的影响分配给位于不同描述中的相似术语。为了解决这些问题，我们引入了决策单元的概念，即，基本信息单元由(相似的)术语对组成，每个术语属于不同的实体描述，或者唯一的术语，只存在于一个描述中。决策单元形成一个新的特征空间，能够以紧凑和有意义的方式表示成对的实体描述。在这些特征上训练的可解释模型生成针对EM数据集定制的有效解释。在本文中，我们通过一个三组件架构模板提出了这个想法，该模板由决策单元生成器、决策单元评分器和可解释的匹配器组成。然后，我们介绍了WYM (Why do You Match?)，一种面向文本EM数据库的体系结构实现。实验表明，我们的方法具有与其他最先进的基于深度学习的EM模型相当的准确性，但是，与它们不同的是，它的预测是高度可解释的。

{"title":"An Intrinsically Interpretable Entity Matching System","authors":"Andrea Baraldi, Francesco Del Buono, Francesco Guerra, Matteo Paganelli, M. Vincini","doi":"10.48786/edbt.2023.54","DOIUrl":"https://doi.org/10.48786/edbt.2023.54","url":null,"abstract":"Explainable classification systems generate predictions along with a weight for each term in the input record measuring its contribution to the prediction. In the entity matching (EM) scenario, inputs are pairs of entity descriptions and the resulting explanations can be difficult to understand for the users. They can be very long and assign different impacts to similar terms located in different descriptions. To address these issues, we introduce the concept of decision units, i.e., basic information units formed either by pairs of (similar) terms, each one belonging to a different entity description, or unique terms, existing in one of the descriptions only. Decision units form a new feature space, able to represent, in a compact and meaningful way, pairs of entity descriptions. An explainable model trained on such features generates effective explanations customized for EM datasets. In this paper, we propose this idea via a three-component architecture template, which consists of a decision unit generator, a decision unit scorer, and an explainable matcher. Then, we introduce WYM (Why do You Match?), an implementation of the architecture oriented to textual EM databases. The experiments show that our approach has accuracy comparable to other state-of-the-art Deep Learning based EM models, but, differently from them, its predictions are highly interpretable.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"31 1","pages":"645-657"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87061940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Advances in database technology : proceedings. International Conference on Extending Database Technology

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀