The Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics by (i) letting users explore multidimensional cubes stating their intentions, and (ii) returning multidimensional data coupled with knowledge insights in the form of annotations of subsets of data. Goal of this demonstration is to showcase the IAM approach using a notebook where the user can create a data exploration session by writing describe and assess statements, whose results are displayed by combining tabular data and charts so as to bring the highlights discovered to the user’s attention. The demonstration plan will show the effectiveness of the IAM approach in supporting data exploration and analysis and its added value as compared to a traditional OLAP session by proposing two scenarios with guided interaction and letting users run custom sessions.
{"title":"Describing and Assessing Cubes Through Intentional Analytics","authors":"Matteo Francia, M. Golfarelli, S. Rizzi","doi":"10.48786/edbt.2023.69","DOIUrl":"https://doi.org/10.48786/edbt.2023.69","url":null,"abstract":"The Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics by (i) letting users explore multidimensional cubes stating their intentions, and (ii) returning multidimensional data coupled with knowledge insights in the form of annotations of subsets of data. Goal of this demonstration is to showcase the IAM approach using a notebook where the user can create a data exploration session by writing describe and assess statements, whose results are displayed by combining tabular data and charts so as to bring the highlights discovered to the user’s attention. The demonstration plan will show the effectiveness of the IAM approach in supporting data exploration and analysis and its added value as compared to a traditional OLAP session by proposing two scenarios with guided interaction and letting users run custom sessions.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"2 1","pages":"803-806"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84552661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Shekelyan, Graham Cormode, Qingzhi Ma, A. Shanghooshabad, P. Triantafillou
Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data resid-ing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes. All pertinent code and data can be found at:
{"title":"Streaming Weighted Sampling over Join Queries","authors":"Michael Shekelyan, Graham Cormode, Qingzhi Ma, A. Shanghooshabad, P. Triantafillou","doi":"10.48786/edbt.2023.24","DOIUrl":"https://doi.org/10.48786/edbt.2023.24","url":null,"abstract":"Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data resid-ing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes. All pertinent code and data can be found at:","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"298-310"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84911541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce E2-NVM , a software-level memory-aware storage layer to improve the Energy efficiency and write Endurance (E2) of NVMs. E2-NVM employs a Variational Autoencoder (VAE) based design to direct the write operations judiciously to the memory segments that minimize bit flips. E2-NVM can be augmented with existing indexing solutions. E2-NVM can also be combined with prior hardware-based solutions to further improve efficiency. We performed real evaluations on an Optane memory device that show that E2-NVM can achieve up to 56% reduction in energy consumption.
{"title":"E2-NVM: A Memory-Aware Write Scheme to Improve Energy Efficiency and Write Endurance of NVMs using Variational Autoencoders","authors":"Saeed Kargar, Binbin Gu, S. Jyothi, Faisal Nawab","doi":"10.48786/edbt.2023.49","DOIUrl":"https://doi.org/10.48786/edbt.2023.49","url":null,"abstract":"We introduce E2-NVM , a software-level memory-aware storage layer to improve the Energy efficiency and write Endurance (E2) of NVMs. E2-NVM employs a Variational Autoencoder (VAE) based design to direct the write operations judiciously to the memory segments that minimize bit flips. E2-NVM can be augmented with existing indexing solutions. E2-NVM can also be combined with prior hardware-based solutions to further improve efficiency. We performed real evaluations on an Optane memory device that show that E2-NVM can achieve up to 56% reduction in energy consumption.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"10 1","pages":"578-590"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82318149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Malte Barth, Tibor Bleidt, Martin Büßemeyer, Fabian Heseding, Niklas Köhnecke, Tobias Bleifuß, Leon Bornemann, D. Kalashnikov, Felix Naumann, D. Srivastava
Today’s fast-paced society is increasingly reliant on correct and up-to-date data. Wikipedia is the world’s most popular source of knowledge, and its infoboxes contain concise semi-structured data with important facts about a page’s topic. However, these data are not always up-to-date: we do not expect Wikipedia editors to update items at the moment their true values change. Also, many pages might not be well maintained and users might forget to update the data, e.g., when they are on holiday. To detect stale data in Wikipedia infoboxes, we combine cor-relation-based and rule-based approaches trained on different temporal granularities, based on all infobox changes over 15 years of English Wikipedia. We are able to predict 8 . 19% of all changes with a precision of 89 . 69% over a whole year, thus meet-ing our target precision of
{"title":"Detecting Stale Data in Wikipedia Infoboxes","authors":"Malte Barth, Tibor Bleidt, Martin Büßemeyer, Fabian Heseding, Niklas Köhnecke, Tobias Bleifuß, Leon Bornemann, D. Kalashnikov, Felix Naumann, D. Srivastava","doi":"10.48786/edbt.2023.36","DOIUrl":"https://doi.org/10.48786/edbt.2023.36","url":null,"abstract":"Today’s fast-paced society is increasingly reliant on correct and up-to-date data. Wikipedia is the world’s most popular source of knowledge, and its infoboxes contain concise semi-structured data with important facts about a page’s topic. However, these data are not always up-to-date: we do not expect Wikipedia editors to update items at the moment their true values change. Also, many pages might not be well maintained and users might forget to update the data, e.g., when they are on holiday. To detect stale data in Wikipedia infoboxes, we combine cor-relation-based and rule-based approaches trained on different temporal granularities, based on all infobox changes over 15 years of English Wikipedia. We are able to predict 8 . 19% of all changes with a precision of 89 . 69% over a whole year, thus meet-ing our target precision of","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"29 1","pages":"450-456"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82545399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhishek A. Singh, Yinan Zhou, Mohammad Sadoghi, S. Mehrotra, Sharad Sharma, Faisal Nawab
Over the recent years, there has been a growing interest in building blockchain-based decentralized applications (DApps). Developing DApps faces many challenges due to the cost and high-latency of writing to a blockchain smart contract. We propose WedgeBlock , a secure data logging infrastructure for DApps. WedgeBlock ’s design reduces the performance and monetary cost of DApps with its main technical innovation called lazy-minimum trust (LMT). LMT combines the following features: (1) off-chain storage component, (2) it lazily writes digests of data—rather than all data—on-chain to minimize costs, and (3) it integrates a trust mechanism to ensure the detection and punishment of malicious acts by the Offchain Node . Our experiments show that WedgeBlock is up to 1470× faster and 310× cheaper than a baseline solution of writing directly on chain.
{"title":"WedgeBlock: An Off-Chain Secure Logging Platform for Blockchain Applications","authors":"Abhishek A. Singh, Yinan Zhou, Mohammad Sadoghi, S. Mehrotra, Sharad Sharma, Faisal Nawab","doi":"10.48786/edbt.2023.45","DOIUrl":"https://doi.org/10.48786/edbt.2023.45","url":null,"abstract":"Over the recent years, there has been a growing interest in building blockchain-based decentralized applications (DApps). Developing DApps faces many challenges due to the cost and high-latency of writing to a blockchain smart contract. We propose WedgeBlock , a secure data logging infrastructure for DApps. WedgeBlock ’s design reduces the performance and monetary cost of DApps with its main technical innovation called lazy-minimum trust (LMT). LMT combines the following features: (1) off-chain storage component, (2) it lazily writes digests of data—rather than all data—on-chain to minimize costs, and (3) it integrates a trust mechanism to ensure the detection and punishment of malicious acts by the Offchain Node . Our experiments show that WedgeBlock is up to 1470× faster and 310× cheaper than a baseline solution of writing directly on chain.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"17 1","pages":"526-539"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89906455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inmoderndataprocessing systemsusersexpectaserviceprovider to automatically respect their consent in all data processing within the service. However, data may be processed for many different purposes by several layers of algorithms that create complex workflows. To date, there is no existing approach to automatically satisfy fine-grained privacy constraints of a user in a way which optimises the service provider’s gains from processing. In this paper, we model a data processing workflow as a graph. User constraints and processing purposes are pairs of vertices which need to be disconnected in this graph. We propose heuristics and algorithms while at the same time we show that, in general, this problem is NP-hard. We discuss the optimality versus efficiency of our algorithms and evaluate them using synthetically generated data. On the practical side, our algorithms can provide a nearly optimal solution in the face of tens of constraints and graphs of thousands of nodes, in a few seconds.
{"title":"Consent Management in Data Workflows: A Graph Problem","authors":"Dorota Filipczuk, E. Gerding, G. Konstantinidis","doi":"10.48786/edbt.2023.61","DOIUrl":"https://doi.org/10.48786/edbt.2023.61","url":null,"abstract":"Inmoderndataprocessing systemsusersexpectaserviceprovider to automatically respect their consent in all data processing within the service. However, data may be processed for many different purposes by several layers of algorithms that create complex workflows. To date, there is no existing approach to automatically satisfy fine-grained privacy constraints of a user in a way which optimises the service provider’s gains from processing. In this paper, we model a data processing workflow as a graph. User constraints and processing purposes are pairs of vertices which need to be disconnected in this graph. We propose heuristics and algorithms while at the same time we show that, in general, this problem is NP-hard. We discuss the optimality versus efficiency of our algorithms and evaluate them using synthetically generated data. On the practical side, our algorithms can provide a nearly optimal solution in the face of tens of constraints and graphs of thousands of nodes, in a few seconds.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"35 1","pages":"737-748"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87213118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahdihusain Momin, Raj Kamal, Shantwana Dixit, Sayan Ranu, A. Bagchi
Understanding the evolution of communities and the factors that contribute to their development, stability and disappearance over time is a fundamental problem in the study of temporal networks. The concept of 𝑘 -core is one of the most popular metrics to detect communities. Since the 𝑘 -core of a temporal network changes with time, an important question arises: Are there nodes that always remain within the 𝑘 -core? In this paper, we explore this question by introducing the notion of core-invariant nodes . Given a temporal window ∆ and a parameter K , the core-invariant nodes are those that are part of the K -core throughout ∆. Core-invariant nodes have been shown to dictate the stability of networks, while being also useful in detecting anomalous behavior. The complexity of finding core-invariant nodes is 𝑂 ( | ∆ |×| 𝐸 | ), which is exorbitantly high for million-scale networks. We overcome this computational bottleneck by designing an algorithm called Kwiq. Kwiq efficiently processes the cascading impact of network updates through a novel data structure called orientation graph. Through extensive experiments on real temporal networks containing millions of nodes, we establish that the proposed pruning strategies are more than 5 times faster than baseline strategies.
了解群落的演变及其随时间发展、稳定和消失的因素是时间网络研究中的一个基本问题。𝑘-core的概念是检测社区最流行的指标之一。由于时间网络的𝑘-核心随着时间的变化而变化,因此出现了一个重要的问题:是否存在始终保持在𝑘-核心中的节点?在本文中,我们通过引入核心不变节点的概念来探讨这个问题。给定一个时间窗口∆和一个参数K,核心不变节点是整个∆中K核心的一部分。核心不变节点已被证明可以指示网络的稳定性,同时在检测异常行为方面也很有用。寻找核心不变节点的复杂度为𝑂(|∆| x | ),这对于百万规模的网络来说太高了。我们通过设计一个叫做Kwiq的算法来克服这个计算瓶颈。Kwiq通过一种称为方向图的新颖数据结构有效地处理网络更新的级联影响。通过对包含数百万节点的真实时态网络的大量实验,我们确定了所提出的修剪策略比基线策略快5倍以上。
{"title":"KWIQ: Answering k-core Window Queries in Temporal Networks","authors":"Mahdihusain Momin, Raj Kamal, Shantwana Dixit, Sayan Ranu, A. Bagchi","doi":"10.48786/edbt.2023.17","DOIUrl":"https://doi.org/10.48786/edbt.2023.17","url":null,"abstract":"Understanding the evolution of communities and the factors that contribute to their development, stability and disappearance over time is a fundamental problem in the study of temporal networks. The concept of 𝑘 -core is one of the most popular metrics to detect communities. Since the 𝑘 -core of a temporal network changes with time, an important question arises: Are there nodes that always remain within the 𝑘 -core? In this paper, we explore this question by introducing the notion of core-invariant nodes . Given a temporal window ∆ and a parameter K , the core-invariant nodes are those that are part of the K -core throughout ∆. Core-invariant nodes have been shown to dictate the stability of networks, while being also useful in detecting anomalous behavior. The complexity of finding core-invariant nodes is 𝑂 ( | ∆ |×| 𝐸 | ), which is exorbitantly high for million-scale networks. We overcome this computational bottleneck by designing an algorithm called Kwiq. Kwiq efficiently processes the cascading impact of network updates through a novel data structure called orientation graph. Through extensive experiments on real temporal networks containing millions of nodes, we establish that the proposed pruning strategies are more than 5 times faster than baseline strategies.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"55 1","pages":"208-220"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73858039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lasantha Fernando, Harsh Bindra, Khuzaima S. Daudjee
Streaming systems process large data sets in a single pass while applying operations on the data. Quantiles are one such operation used in streaming systems. Quantiles can outline the behaviour and the cumulative distribution of a data set. We study five recent quantile sketching algorithms designed for streaming settings: KLL Sketch, Moments Sketch, DDSketch, UDDSketch, and ReqSketch. Key aspects of the sketching algorithms in terms of speed, accuracy, and mergeability are examined. The accuracy of these algorithms is evaluated in Apache Flink, a popular open source streaming system, while the speed and mergeability is evaluated in a separate Java implementation. Results show that UDDSketch has the best relative-error accuracy guarantees, while DDSketch and ReqSketch also achieve consistently high accuracy, particularly with long-tailed data distributions. DDSketch has the fastest query and insertion times, while Moments Sketch has the fastest merge times. Our evaluations show that there is no single algorithm that dominates overall performance and different algorithms excel under the different accuracy and run-time performance criteria considered in our study.
{"title":"An Experimental Analysis of Quantile Sketches over Data Streams","authors":"Lasantha Fernando, Harsh Bindra, Khuzaima S. Daudjee","doi":"10.48786/edbt.2023.34","DOIUrl":"https://doi.org/10.48786/edbt.2023.34","url":null,"abstract":"Streaming systems process large data sets in a single pass while applying operations on the data. Quantiles are one such operation used in streaming systems. Quantiles can outline the behaviour and the cumulative distribution of a data set. We study five recent quantile sketching algorithms designed for streaming settings: KLL Sketch, Moments Sketch, DDSketch, UDDSketch, and ReqSketch. Key aspects of the sketching algorithms in terms of speed, accuracy, and mergeability are examined. The accuracy of these algorithms is evaluated in Apache Flink, a popular open source streaming system, while the speed and mergeability is evaluated in a separate Java implementation. Results show that UDDSketch has the best relative-error accuracy guarantees, while DDSketch and ReqSketch also achieve consistently high accuracy, particularly with long-tailed data distributions. DDSketch has the fastest query and insertion times, while Moments Sketch has the fastest merge times. Our evaluations show that there is no single algorithm that dominates overall performance and different algorithms excel under the different accuracy and run-time performance criteria considered in our study.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"56 1","pages":"424-436"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78966876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
People prefer attractive visual query interfaces (vqi). Such interfaces are paramount for enhancing usability of graph querying frameworks. However, scant attention has been paid to the vi- sual complexity and aesthetics of graph query interfaces. In this demonstration, we present a novel system called voyager that leverages on research in computer vision, human-computer interaction (hci) and cognitive psychology to automatically compute the visual complexity and aesthetics of a graph query interface. voyager can not only guide vqi designers to iteratively improve their design to balance usability and aesthetics of visual query interfaces but it can also facilitate quantitative comparison of the visual complexity and aesthetics of a set of visual query interfaces. We demonstrate various innovative features of voyager and its promising results.
{"title":"VOYAGER: Automatic Computation of Visual Complexity and Aesthetics of Graph Query Interfaces","authors":"Duy Pham, S. Bhowmick","doi":"10.48786/edbt.2023.72","DOIUrl":"https://doi.org/10.48786/edbt.2023.72","url":null,"abstract":"People prefer attractive visual query interfaces (vqi). Such interfaces are paramount for enhancing usability of graph querying frameworks. However, scant attention has been paid to the vi- sual complexity and aesthetics of graph query interfaces. In this demonstration, we present a novel system called voyager that leverages on research in computer vision, human-computer interaction (hci) and cognitive psychology to automatically compute the visual complexity and aesthetics of a graph query interface. voyager can not only guide vqi designers to iteratively improve their design to balance usability and aesthetics of visual query interfaces but it can also facilitate quantitative comparison of the visual complexity and aesthetics of a set of visual query interfaces. We demonstrate various innovative features of voyager and its promising results.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"151 1","pages":"815-818"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77798363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Baraldi, Francesco Del Buono, Francesco Guerra, Matteo Paganelli, M. Vincini
Explainable classification systems generate predictions along with a weight for each term in the input record measuring its contribution to the prediction. In the entity matching (EM) scenario, inputs are pairs of entity descriptions and the resulting explanations can be difficult to understand for the users. They can be very long and assign different impacts to similar terms located in different descriptions. To address these issues, we introduce the concept of decision units, i.e., basic information units formed either by pairs of (similar) terms, each one belonging to a different entity description, or unique terms, existing in one of the descriptions only. Decision units form a new feature space, able to represent, in a compact and meaningful way, pairs of entity descriptions. An explainable model trained on such features generates effective explanations customized for EM datasets. In this paper, we propose this idea via a three-component architecture template, which consists of a decision unit generator, a decision unit scorer, and an explainable matcher. Then, we introduce WYM (Why do You Match?), an implementation of the architecture oriented to textual EM databases. The experiments show that our approach has accuracy comparable to other state-of-the-art Deep Learning based EM models, but, differently from them, its predictions are highly interpretable.
可解释的分类系统生成预测,并为输入记录中的每个术语提供权重,以衡量其对预测的贡献。在实体匹配(EM)场景中,输入是成对的实体描述,结果的解释对于用户来说可能很难理解。它们可以很长,并将不同的影响分配给位于不同描述中的相似术语。为了解决这些问题,我们引入了决策单元的概念,即,基本信息单元由(相似的)术语对组成,每个术语属于不同的实体描述,或者唯一的术语,只存在于一个描述中。决策单元形成一个新的特征空间,能够以紧凑和有意义的方式表示成对的实体描述。在这些特征上训练的可解释模型生成针对EM数据集定制的有效解释。在本文中,我们通过一个三组件架构模板提出了这个想法,该模板由决策单元生成器、决策单元评分器和可解释的匹配器组成。然后,我们介绍了WYM (Why do You Match?),一种面向文本EM数据库的体系结构实现。实验表明,我们的方法具有与其他最先进的基于深度学习的EM模型相当的准确性,但是,与它们不同的是,它的预测是高度可解释的。
{"title":"An Intrinsically Interpretable Entity Matching System","authors":"Andrea Baraldi, Francesco Del Buono, Francesco Guerra, Matteo Paganelli, M. Vincini","doi":"10.48786/edbt.2023.54","DOIUrl":"https://doi.org/10.48786/edbt.2023.54","url":null,"abstract":"Explainable classification systems generate predictions along with a weight for each term in the input record measuring its contribution to the prediction. In the entity matching (EM) scenario, inputs are pairs of entity descriptions and the resulting explanations can be difficult to understand for the users. They can be very long and assign different impacts to similar terms located in different descriptions. To address these issues, we introduce the concept of decision units, i.e., basic information units formed either by pairs of (similar) terms, each one belonging to a different entity description, or unique terms, existing in one of the descriptions only. Decision units form a new feature space, able to represent, in a compact and meaningful way, pairs of entity descriptions. An explainable model trained on such features generates effective explanations customized for EM datasets. In this paper, we propose this idea via a three-component architecture template, which consists of a decision unit generator, a decision unit scorer, and an explainable matcher. Then, we introduce WYM (Why do You Match?), an implementation of the architecture oriented to textual EM databases. The experiments show that our approach has accuracy comparable to other state-of-the-art Deep Learning based EM models, but, differently from them, its predictions are highly interpretable.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"31 1","pages":"645-657"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87061940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}