The Euclidean Shortest Path Problem (ESPP), which involves finding the shortest path in a Euclidean plane with polygonal obstacles, is a classic problem with numerous real-world applications. The current state-of-the-art solution, Euclidean Hub Labeling (EHL), offers ultra-fast query performance, outperforming existing techniques by 1-2 orders of magnitude in runtime efficiency. However, this performance comes at the cost of significant memory overhead, requiring up to tens of gigabytes of storage on large maps, which can limit its applicability in memory-constrained environments like mobile phones or smaller devices. Additionally, EHL's memory usage can only be determined after index construction, and while it provides a memory-runtime tradeoff, it does not fully optimize memory utilization. In this work, we introduce an improved version of EHL, called EHL*, which overcomes these limitations. A key contribution of EHL* is its ability to create an index that adheres to a specified memory budget while optimizing query runtime performance. Moreover, EHL* can leverage preknown query distributions, a common scenario in many real-world applications to further enhance runtime efficiency. Our results show that EHL* can reduce memory usage by up to 10-20 times without much impact on query runtime performance compared to EHL, making it a highly effective solution for optimal pathfinding in memory-constrained environments.
{"title":"EHL*: Memory-Budgeted Indexing for Ultrafast Optimal Euclidean Pathfinding","authors":"Jinchun Du, Bojie Shen, Muhammad Aamir Cheema","doi":"arxiv-2408.11341","DOIUrl":"https://doi.org/arxiv-2408.11341","url":null,"abstract":"The Euclidean Shortest Path Problem (ESPP), which involves finding the\u0000shortest path in a Euclidean plane with polygonal obstacles, is a classic\u0000problem with numerous real-world applications. The current state-of-the-art\u0000solution, Euclidean Hub Labeling (EHL), offers ultra-fast query performance,\u0000outperforming existing techniques by 1-2 orders of magnitude in runtime\u0000efficiency. However, this performance comes at the cost of significant memory\u0000overhead, requiring up to tens of gigabytes of storage on large maps, which can\u0000limit its applicability in memory-constrained environments like mobile phones\u0000or smaller devices. Additionally, EHL's memory usage can only be determined\u0000after index construction, and while it provides a memory-runtime tradeoff, it\u0000does not fully optimize memory utilization. In this work, we introduce an\u0000improved version of EHL, called EHL*, which overcomes these limitations. A key\u0000contribution of EHL* is its ability to create an index that adheres to a\u0000specified memory budget while optimizing query runtime performance. Moreover,\u0000EHL* can leverage preknown query distributions, a common scenario in many\u0000real-world applications to further enhance runtime efficiency. Our results show\u0000that EHL* can reduce memory usage by up to 10-20 times without much impact on\u0000query runtime performance compared to EHL, making it a highly effective\u0000solution for optimal pathfinding in memory-constrained environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Privacy-preservation policies are guidelines formulated to protect data providers private data. Previous privacy-preservation methodologies have addressed privacy in which data are permanently stored in repositories and disconnected from changing data provider privacy preferences. This occurrence becomes evident as data moves to another data repository. Hence, the need for data providers to control and flexibly update their existing privacy preferences due to changing data usage continues to remain a problem. This paper proposes a blockchain-based methodology for preserving data providers private and sensitive data. The research proposes to tightly couple data providers private attribute data element to privacy preferences and data accessor data element into a privacy tuple. The implementation presents a framework of tightly-coupled relational database and blockchains. This delivers secure, tamper-resistant, and query-efficient platform for data management and query processing. The evaluation analysis from the implementation validates efficient query processing of privacy-aware queries on the privacy infrastructure.
{"title":"Privacy-Preserving Data Management using Blockchains","authors":"Michael Mireku Kwakye","doi":"arxiv-2408.11263","DOIUrl":"https://doi.org/arxiv-2408.11263","url":null,"abstract":"Privacy-preservation policies are guidelines formulated to protect data\u0000providers private data. Previous privacy-preservation methodologies have\u0000addressed privacy in which data are permanently stored in repositories and\u0000disconnected from changing data provider privacy preferences. This occurrence\u0000becomes evident as data moves to another data repository. Hence, the need for\u0000data providers to control and flexibly update their existing privacy\u0000preferences due to changing data usage continues to remain a problem. This\u0000paper proposes a blockchain-based methodology for preserving data providers\u0000private and sensitive data. The research proposes to tightly couple data\u0000providers private attribute data element to privacy preferences and data\u0000accessor data element into a privacy tuple. The implementation presents a\u0000framework of tightly-coupled relational database and blockchains. This delivers\u0000secure, tamper-resistant, and query-efficient platform for data management and\u0000query processing. The evaluation analysis from the implementation validates\u0000efficient query processing of privacy-aware queries on the privacy\u0000infrastructure.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper
Line charts are a valuable tool for data analysis and exploration, distilling essential insights from a dataset. However, access to the underlying dataset behind a line chart is rarely readily available. In this paper, we explore a novel dataset discovery problem, dataset discovery via line charts, focusing on the use of line charts as queries to discover datasets within a large data repository that are capable of generating similar line charts. To solve this problem, we propose a novel approach called Fine-grained Cross-modal Relevance Learning Model (FCM), which aims to estimate the relevance between a line chart and a candidate dataset. To achieve this goal, FCM first employs a visual element extractor to extract informative visual elements, i.e., lines and y-ticks, from a line chart. Then, two novel segment-level encoders are adopted to learn representations for a line chart and a dataset, preserving fine-grained information, followed by a cross-modal matcher to match the learned representations in a fine-grained way. Furthermore, we extend FCM to support line chart queries generated based on data aggregation. Last, we propose a benchmark tailored for this problem since no such dataset exists. Extensive evaluation on the new benchmark verifies the effectiveness of our proposed method. Specifically, our proposed approach surpasses the best baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.
{"title":"The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery","authors":"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper","doi":"arxiv-2408.09506","DOIUrl":"https://doi.org/arxiv-2408.09506","url":null,"abstract":"Line charts are a valuable tool for data analysis and exploration, distilling\u0000essential insights from a dataset. However, access to the underlying dataset\u0000behind a line chart is rarely readily available. In this paper, we explore a\u0000novel dataset discovery problem, dataset discovery via line charts, focusing on\u0000the use of line charts as queries to discover datasets within a large data\u0000repository that are capable of generating similar line charts. To solve this\u0000problem, we propose a novel approach called Fine-grained Cross-modal Relevance\u0000Learning Model (FCM), which aims to estimate the relevance between a line chart\u0000and a candidate dataset. To achieve this goal, FCM first employs a visual\u0000element extractor to extract informative visual elements, i.e., lines and\u0000y-ticks, from a line chart. Then, two novel segment-level encoders are adopted\u0000to learn representations for a line chart and a dataset, preserving\u0000fine-grained information, followed by a cross-modal matcher to match the\u0000learned representations in a fine-grained way. Furthermore, we extend FCM to\u0000support line chart queries generated based on data aggregation. Last, we\u0000propose a benchmark tailored for this problem since no such dataset exists.\u0000Extensive evaluation on the new benchmark verifies the effectiveness of our\u0000proposed method. Specifically, our proposed approach surpasses the best\u0000baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal conceptual data modelling, as an extension to regular conceptual data modelling languages such as EER and UML class diagrams, has received intermittent attention across the decades. It is receiving renewed interest in the context of, among others, business process modelling that needs robust expressive data models to complement them. None of the proposed temporal conceptual data modelling languages have been tested on understandability and usability by modellers, however, nor is it clear which temporal constraints would be used by modellers or whether the ones included are the relevant temporal constraints. We therefore sought to investigate temporal representations in temporal conceptual data modelling languages, design a, to date, most expressive language, TREND, through small-scale qualitative experiments, and finalise the graphical notation and modelling and understanding in large scale experiments. This involved a series of 11 experiments with over a thousand participants in total, having created 246 temporal conceptual data models. Key outcomes are that choice of label for transition constraints had limited impact, as did extending explanations of the modelling language, but expressing what needs to be modelled in controlled natural language did improve model quality. The experiments also indicate that more training may be needed, in particular guidance for domain experts, to achieve adoption of temporal conceptual data modelling by the community.
{"title":"The temporal conceptual data modelling language TREND","authors":"Sonia Berman, C. Maria Keet, Tamindran Shunmugam","doi":"arxiv-2408.09427","DOIUrl":"https://doi.org/arxiv-2408.09427","url":null,"abstract":"Temporal conceptual data modelling, as an extension to regular conceptual\u0000data modelling languages such as EER and UML class diagrams, has received\u0000intermittent attention across the decades. It is receiving renewed interest in\u0000the context of, among others, business process modelling that needs robust\u0000expressive data models to complement them. None of the proposed temporal\u0000conceptual data modelling languages have been tested on understandability and\u0000usability by modellers, however, nor is it clear which temporal constraints\u0000would be used by modellers or whether the ones included are the relevant\u0000temporal constraints. We therefore sought to investigate temporal\u0000representations in temporal conceptual data modelling languages, design a, to\u0000date, most expressive language, TREND, through small-scale qualitative\u0000experiments, and finalise the graphical notation and modelling and\u0000understanding in large scale experiments. This involved a series of 11\u0000experiments with over a thousand participants in total, having created 246\u0000temporal conceptual data models. Key outcomes are that choice of label for\u0000transition constraints had limited impact, as did extending explanations of the\u0000modelling language, but expressing what needs to be modelled in controlled\u0000natural language did improve model quality. The experiments also indicate that\u0000more training may be needed, in particular guidance for domain experts, to\u0000achieve adoption of temporal conceptual data modelling by the community.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The NFDI4DataScience (NFDI4DS) project aims to enhance the accessibility and interoperability of research data within Data Science (DS) and Artificial Intelligence (AI) by connecting digital artifacts and ensuring they adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) principles. To this end, this poster introduces the NFDI4DS Ontology, which describes resources in DS and AI and models the structure of the NFDI4DS consortium. Built upon the NFDICore ontology and mapped to the Basic Formal Ontology (BFO), this ontology serves as the foundation for the NFDI4DS knowledge graph currently under development.
{"title":"NFDI4DSO: Towards a BFO Compliant Ontology for Data Science","authors":"Genet Asefa Gesese, Jörg Waitelonis, Zongxiong Chen, Sonja Schimmler, Harald Sack","doi":"arxiv-2408.08698","DOIUrl":"https://doi.org/arxiv-2408.08698","url":null,"abstract":"The NFDI4DataScience (NFDI4DS) project aims to enhance the accessibility and\u0000interoperability of research data within Data Science (DS) and Artificial\u0000Intelligence (AI) by connecting digital artifacts and ensuring they adhere to\u0000FAIR (Findable, Accessible, Interoperable, and Reusable) principles. To this\u0000end, this poster introduces the NFDI4DS Ontology, which describes resources in\u0000DS and AI and models the structure of the NFDI4DS consortium. Built upon the\u0000NFDICore ontology and mapped to the Basic Formal Ontology (BFO), this ontology\u0000serves as the foundation for the NFDI4DS knowledge graph currently under\u0000development.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the current version of our (Elementary) Mathematical Data Model ((E)MDM), which is based on the na"ive theory of sets, relations, and functions, as well as on the first-order predicate calculus with equality. Many real-life examples illustrate its 4 types of sets, 4 types of functions, and 76 types of constraints. This rich panoply of constraints is the main strength of this model, guaranteeing that any data value stored in a database is plausible, which is the highest possible level of syntactical data quality. A (E)MDM example scheme is presented and contrasted with some popular family tree software products. The paper also presents the main (E)MDM related approaches in data modeling and processing.
{"title":"The (Elementary) Mathematical Data Model Revisited","authors":"Christian Mancas","doi":"arxiv-2408.08367","DOIUrl":"https://doi.org/arxiv-2408.08367","url":null,"abstract":"This paper presents the current version of our (Elementary) Mathematical Data\u0000Model ((E)MDM), which is based on the na\"ive theory of sets, relations, and\u0000functions, as well as on the first-order predicate calculus with equality. Many\u0000real-life examples illustrate its 4 types of sets, 4 types of functions, and 76\u0000types of constraints. This rich panoply of constraints is the main strength of\u0000this model, guaranteeing that any data value stored in a database is plausible,\u0000which is the highest possible level of syntactical data quality. A (E)MDM\u0000example scheme is presented and contrasted with some popular family tree\u0000software products. The paper also presents the main (E)MDM related approaches\u0000in data modeling and processing.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, Raymond Chi-Wing Wong
Data visualization (DV) is the fundamental and premise tool to improve the efficiency in conveying the insights behind the big data, which has been widely accepted in existing data-driven world. Task automation in DV, such as converting natural language queries to visualizations (i.e., text-to-vis), generating explanations from visualizations (i.e., vis-to-text), answering DV-related questions in free form (i.e. FeVisQA), and explicating tabular data (i.e., table-to-text), is vital for advancing the field. Despite their potential, the application of pre-trained language models (PLMs) like T5 and BERT in DV has been limited by high costs and challenges in handling cross-modal information, leading to few studies on PLMs for DV. We introduce textbf{DataVisT5}, a novel PLM tailored for DV that enhances the T5 architecture through a hybrid objective pre-training and multi-task fine-tuning strategy, integrating text and DV datasets to effectively interpret cross-modal semantics. Extensive evaluations on public datasets show that DataVisT5 consistently outperforms current state-of-the-art models on various DV-related tasks. We anticipate that DataVisT5 will not only inspire further research on vertical PLMs but also expand the range of applications for PLMs.
数据可视化(Data Visualization,DV)是提高传达大数据背后见解效率的基础和前提工具,在现有的数据驱动世界中已被广泛接受。DV 中的任务自动化,如将自然语言查询转换为可视化(即文本到可视化)、从可视化生成解释(即可视化到文本)、以自由形式回答 DV 相关问题(即 FeVisQA)以及阐释表格数据(即表格到文本),对于推动该领域的发展至关重要。尽管预训练语言模型(PLMs)(如 T5 和 BERT)潜力巨大,但其在 DV 中的应用却因成本高昂和处理跨模态信息的挑战而受到限制,导致针对 DV 的预训练语言模型的研究寥寥无几。我们介绍了textbf{DataVisT5},这是一种为DV量身定制的新型PLM,它通过混合目标预训练和多任务微调策略增强了T5架构,整合了文本和DV数据集,从而有效地解释了跨模态语义。在公共数据集上进行的广泛评估表明,DataVisT5 在各种 DV 相关任务上的表现始终优于当前最先进的模型。我们预计,DataVisT5 不仅会激发对垂直 PLM 的进一步研究,而且还会扩大 PLM 的应用范围。
{"title":"DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization","authors":"Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, Raymond Chi-Wing Wong","doi":"arxiv-2408.07401","DOIUrl":"https://doi.org/arxiv-2408.07401","url":null,"abstract":"Data visualization (DV) is the fundamental and premise tool to improve the\u0000efficiency in conveying the insights behind the big data, which has been widely\u0000accepted in existing data-driven world. Task automation in DV, such as\u0000converting natural language queries to visualizations (i.e., text-to-vis),\u0000generating explanations from visualizations (i.e., vis-to-text), answering\u0000DV-related questions in free form (i.e. FeVisQA), and explicating tabular data\u0000(i.e., table-to-text), is vital for advancing the field. Despite their\u0000potential, the application of pre-trained language models (PLMs) like T5 and\u0000BERT in DV has been limited by high costs and challenges in handling\u0000cross-modal information, leading to few studies on PLMs for DV. We introduce\u0000textbf{DataVisT5}, a novel PLM tailored for DV that enhances the T5\u0000architecture through a hybrid objective pre-training and multi-task fine-tuning\u0000strategy, integrating text and DV datasets to effectively interpret cross-modal\u0000semantics. Extensive evaluations on public datasets show that DataVisT5\u0000consistently outperforms current state-of-the-art models on various DV-related\u0000tasks. We anticipate that DataVisT5 will not only inspire further research on\u0000vertical PLMs but also expand the range of applications for PLMs.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"440 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu
We demonstrate QirK, a system for answering natural language questions on Knowledge Graphs (KG). QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs). It does so using a unique combination of database technology, LLMs, and semantic search over vector embeddings. The glue for these components is an intermediate representation (IR). The input question is mapped to IR using LLMs, which is then repaired into a valid relational database query with the aid of a semantic search on vector embeddings. This allows a practical synthesis of LLM capabilities and KG reliability. A short video demonstrating QirK is available at https://youtu.be/6c81BLmOZ0U.
{"title":"QirK: Question Answering via Intermediate Representation on Knowledge Graphs","authors":"Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu","doi":"arxiv-2408.07494","DOIUrl":"https://doi.org/arxiv-2408.07494","url":null,"abstract":"We demonstrate QirK, a system for answering natural language questions on\u0000Knowledge Graphs (KG). QirK can answer structurally complex questions that are\u0000still beyond the reach of emerging Large Language Models (LLMs). It does so\u0000using a unique combination of database technology, LLMs, and semantic search\u0000over vector embeddings. The glue for these components is an intermediate\u0000representation (IR). The input question is mapped to IR using LLMs, which is\u0000then repaired into a valid relational database query with the aid of a semantic\u0000search on vector embeddings. This allows a practical synthesis of LLM\u0000capabilities and KG reliability. A short video demonstrating QirK is available at\u0000https://youtu.be/6c81BLmOZ0U.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) have emerged as powerful conversational interfaces, and their application in process mining (PM) tasks has shown promising results. However, state-of-the-art LLMs struggle with complex scenarios that demand advanced reasoning capabilities. In the literature, two primary approaches have been proposed for implementing PM using LLMs: providing textual insights based on a textual abstraction of the process mining artifact, and generating code executable on the original artifact. This paper proposes utilizing the AI-Based Agents Workflow (AgWf) paradigm to enhance the effectiveness of PM on LLMs. This approach allows for: i) the decomposition of complex tasks into simpler workflows, and ii) the integration of deterministic tools with the domain knowledge of LLMs. We examine various implementations of AgWf and the types of AI-based tasks involved. Additionally, we discuss the CrewAI implementation framework and present examples related to process mining.
{"title":"Re-Thinking Process Mining in the AI-Based Agents Era","authors":"Alessandro Berti, Mayssa Maatallah, Urszula Jessen, Michal Sroka, Sonia Ayachi Ghannouchi","doi":"arxiv-2408.07720","DOIUrl":"https://doi.org/arxiv-2408.07720","url":null,"abstract":"Large Language Models (LLMs) have emerged as powerful conversational\u0000interfaces, and their application in process mining (PM) tasks has shown\u0000promising results. However, state-of-the-art LLMs struggle with complex\u0000scenarios that demand advanced reasoning capabilities. In the literature, two\u0000primary approaches have been proposed for implementing PM using LLMs: providing\u0000textual insights based on a textual abstraction of the process mining artifact,\u0000and generating code executable on the original artifact. This paper proposes\u0000utilizing the AI-Based Agents Workflow (AgWf) paradigm to enhance the\u0000effectiveness of PM on LLMs. This approach allows for: i) the decomposition of\u0000complex tasks into simpler workflows, and ii) the integration of deterministic\u0000tools with the domain knowledge of LLMs. We examine various implementations of\u0000AgWf and the types of AI-based tasks involved. Additionally, we discuss the\u0000CrewAI implementation framework and present examples related to process mining.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present ASPEN, an answer set programming (ASP) implementation of a recently proposed declarative framework for collective entity resolution (ER). While an ASP encoding had been previously suggested, several practical issues had been neglected, most notably, the question of how to efficiently compute the (externally defined) similarity facts that are used in rule bodies. This leads us to propose new variants of the encodings (including Datalog approximations) and show how to employ different functionalities of ASP solvers to compute (maximal) solutions, and (approximations of) the sets of possible and certain merges. A comprehensive experimental evaluation of ASPEN on real-world datasets shows that the approach is promising, achieving high accuracy in real-life ER scenarios. Our experiments also yield useful insights into the relative merits of different types of (approximate) ER solutions, the impact of recursion, and factors influencing performance.
在本文中,我们介绍了 ASPEN,它是最近提出的集体身份解析(ER)声明式框架的答案集编程(ASP)实现。虽然之前已经提出了 ASP 编码,但有几个实际问题却被忽略了,其中最突出的是如何高效计算规则体中使用的(外部定义的)相似性事实。这促使我们提出了编码的新变体(包括 Datalog 近似值),并展示了如何利用 ASP 求解器的不同功能来计算(最大)解以及可能合并集和确定合并集(的近似值)。在真实数据集上对 ASPEN 进行的综合实验评估表明,该方法很有前途,在真实的 ER 场景中实现了很高的准确性。其他实验还对不同类型(近似)ER 解决方案的相对优点、递归的影响以及影响性能的因素提出了有益的见解。
{"title":"ASPEN: ASP-Based System for Collective Entity Resolution","authors":"Zhiliang Xiang, Meghyn Bienvenu, Gianluca Cima, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García","doi":"arxiv-2408.06961","DOIUrl":"https://doi.org/arxiv-2408.06961","url":null,"abstract":"In this paper, we present ASPEN, an answer set programming (ASP)\u0000implementation of a recently proposed declarative framework for collective\u0000entity resolution (ER). While an ASP encoding had been previously suggested,\u0000several practical issues had been neglected, most notably, the question of how\u0000to efficiently compute the (externally defined) similarity facts that are used\u0000in rule bodies. This leads us to propose new variants of the encodings\u0000(including Datalog approximations) and show how to employ different\u0000functionalities of ASP solvers to compute (maximal) solutions, and\u0000(approximations of) the sets of possible and certain merges. A comprehensive\u0000experimental evaluation of ASPEN on real-world datasets shows that the approach\u0000is promising, achieving high accuracy in real-life ER scenarios. Our\u0000experiments also yield useful insights into the relative merits of different\u0000types of (approximate) ER solutions, the impact of recursion, and factors\u0000influencing performance.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}