首页 > 最新文献

arXiv - CS - Databases最新文献

英文 中文
EHL*: Memory-Budgeted Indexing for Ultrafast Optimal Euclidean Pathfinding EHL*:超快最优欧氏寻路的内存预算索引
Pub Date : 2024-08-21 DOI: arxiv-2408.11341
Jinchun Du, Bojie Shen, Muhammad Aamir Cheema
The Euclidean Shortest Path Problem (ESPP), which involves finding theshortest path in a Euclidean plane with polygonal obstacles, is a classicproblem with numerous real-world applications. The current state-of-the-artsolution, Euclidean Hub Labeling (EHL), offers ultra-fast query performance,outperforming existing techniques by 1-2 orders of magnitude in runtimeefficiency. However, this performance comes at the cost of significant memoryoverhead, requiring up to tens of gigabytes of storage on large maps, which canlimit its applicability in memory-constrained environments like mobile phonesor smaller devices. Additionally, EHL's memory usage can only be determinedafter index construction, and while it provides a memory-runtime tradeoff, itdoes not fully optimize memory utilization. In this work, we introduce animproved version of EHL, called EHL*, which overcomes these limitations. A keycontribution of EHL* is its ability to create an index that adheres to aspecified memory budget while optimizing query runtime performance. Moreover,EHL* can leverage preknown query distributions, a common scenario in manyreal-world applications to further enhance runtime efficiency. Our results showthat EHL* can reduce memory usage by up to 10-20 times without much impact onquery runtime performance compared to EHL, making it a highly effectivesolution for optimal pathfinding in memory-constrained environments.
欧氏最短路径问题(ESPP)涉及在有多边形障碍物的欧氏平面上寻找最短路径,是一个经典问题,在现实世界中有大量应用。目前最先进的解决方案--欧氏集束标记(EHL)--提供了超快的查询性能,在运行效率上比现有技术高出 1-2 个数量级。然而,这种性能是以巨大的内存开销为代价的,在大型地图上需要高达数十 GB 的存储空间,这可能会限制其在内存受限环境(如手机或更小的设备)中的适用性。此外,EHL 的内存使用量只能在索引构建后才能确定,虽然它提供了内存运行时间的权衡,但并不能完全优化内存利用率。在这项工作中,我们引入了 EHL 的改进版本,称为 EHL*,它克服了这些限制。EHL* 的一个关键贡献是,它能够创建符合指定内存预算的索引,同时优化查询运行时性能。此外,EHL*还能利用已知的查询分布(这是许多现实世界应用中的常见情况)来进一步提高运行时效率。我们的研究结果表明,与 EHL 相比,EHL* 可以减少多达 10-20 倍的内存使用量,而不会对查询运行时性能产生太大影响,因此是内存受限环境中优化寻路的高效解决方案。
{"title":"EHL*: Memory-Budgeted Indexing for Ultrafast Optimal Euclidean Pathfinding","authors":"Jinchun Du, Bojie Shen, Muhammad Aamir Cheema","doi":"arxiv-2408.11341","DOIUrl":"https://doi.org/arxiv-2408.11341","url":null,"abstract":"The Euclidean Shortest Path Problem (ESPP), which involves finding the\u0000shortest path in a Euclidean plane with polygonal obstacles, is a classic\u0000problem with numerous real-world applications. The current state-of-the-art\u0000solution, Euclidean Hub Labeling (EHL), offers ultra-fast query performance,\u0000outperforming existing techniques by 1-2 orders of magnitude in runtime\u0000efficiency. However, this performance comes at the cost of significant memory\u0000overhead, requiring up to tens of gigabytes of storage on large maps, which can\u0000limit its applicability in memory-constrained environments like mobile phones\u0000or smaller devices. Additionally, EHL's memory usage can only be determined\u0000after index construction, and while it provides a memory-runtime tradeoff, it\u0000does not fully optimize memory utilization. In this work, we introduce an\u0000improved version of EHL, called EHL*, which overcomes these limitations. A key\u0000contribution of EHL* is its ability to create an index that adheres to a\u0000specified memory budget while optimizing query runtime performance. Moreover,\u0000EHL* can leverage preknown query distributions, a common scenario in many\u0000real-world applications to further enhance runtime efficiency. Our results show\u0000that EHL* can reduce memory usage by up to 10-20 times without much impact on\u0000query runtime performance compared to EHL, making it a highly effective\u0000solution for optimal pathfinding in memory-constrained environments.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Privacy-Preserving Data Management using Blockchains 使用区块链进行隐私保护数据管理
Pub Date : 2024-08-21 DOI: arxiv-2408.11263
Michael Mireku Kwakye
Privacy-preservation policies are guidelines formulated to protect dataproviders private data. Previous privacy-preservation methodologies haveaddressed privacy in which data are permanently stored in repositories anddisconnected from changing data provider privacy preferences. This occurrencebecomes evident as data moves to another data repository. Hence, the need fordata providers to control and flexibly update their existing privacypreferences due to changing data usage continues to remain a problem. Thispaper proposes a blockchain-based methodology for preserving data providersprivate and sensitive data. The research proposes to tightly couple dataproviders private attribute data element to privacy preferences and dataaccessor data element into a privacy tuple. The implementation presents aframework of tightly-coupled relational database and blockchains. This deliverssecure, tamper-resistant, and query-efficient platform for data management andquery processing. The evaluation analysis from the implementation validatesefficient query processing of privacy-aware queries on the privacyinfrastructure.
隐私保护政策是为保护数据提供者的私人数据而制定的指导方针。以前的隐私保护方法所处理的隐私问题是,数据永久存储在存储库中,与数据提供者不断变化的隐私偏好无关。当数据转移到另一个数据存储库时,这种情况就会显现出来。因此,数据提供商需要根据数据使用情况的变化来控制和灵活更新其现有的隐私偏好,这仍然是一个问题。本文提出了一种基于区块链的方法,用于保护数据提供者的隐私和敏感数据。研究建议将数据提供者的私人属性数据元素与隐私偏好和数据访问者数据元素紧密耦合到一个隐私元组中。实施方案提出了一个紧密耦合关系数据库和区块链的框架。这为数据管理和查询处理提供了安全、防篡改和查询效率高的平台。实现过程中的评估分析验证了隐私基础设施上隐私感知查询的高效查询处理。
{"title":"Privacy-Preserving Data Management using Blockchains","authors":"Michael Mireku Kwakye","doi":"arxiv-2408.11263","DOIUrl":"https://doi.org/arxiv-2408.11263","url":null,"abstract":"Privacy-preservation policies are guidelines formulated to protect data\u0000providers private data. Previous privacy-preservation methodologies have\u0000addressed privacy in which data are permanently stored in repositories and\u0000disconnected from changing data provider privacy preferences. This occurrence\u0000becomes evident as data moves to another data repository. Hence, the need for\u0000data providers to control and flexibly update their existing privacy\u0000preferences due to changing data usage continues to remain a problem. This\u0000paper proposes a blockchain-based methodology for preserving data providers\u0000private and sensitive data. The research proposes to tightly couple data\u0000providers private attribute data element to privacy preferences and data\u0000accessor data element into a privacy tuple. The implementation presents a\u0000framework of tightly-coupled relational database and blockchains. This delivers\u0000secure, tamper-resistant, and query-efficient platform for data management and\u0000query processing. The evaluation analysis from the implementation validates\u0000efficient query processing of privacy-aware queries on the privacy\u0000infrastructure.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery 线条背后的故事折线图是发现数据集的入口
Pub Date : 2024-08-18 DOI: arxiv-2408.09506
Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper
Line charts are a valuable tool for data analysis and exploration, distillingessential insights from a dataset. However, access to the underlying datasetbehind a line chart is rarely readily available. In this paper, we explore anovel dataset discovery problem, dataset discovery via line charts, focusing onthe use of line charts as queries to discover datasets within a large datarepository that are capable of generating similar line charts. To solve thisproblem, we propose a novel approach called Fine-grained Cross-modal RelevanceLearning Model (FCM), which aims to estimate the relevance between a line chartand a candidate dataset. To achieve this goal, FCM first employs a visualelement extractor to extract informative visual elements, i.e., lines andy-ticks, from a line chart. Then, two novel segment-level encoders are adoptedto learn representations for a line chart and a dataset, preservingfine-grained information, followed by a cross-modal matcher to match thelearned representations in a fine-grained way. Furthermore, we extend FCM tosupport line chart queries generated based on data aggregation. Last, wepropose a benchmark tailored for this problem since no such dataset exists.Extensive evaluation on the new benchmark verifies the effectiveness of ourproposed method. Specifically, our proposed approach surpasses the bestbaseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.
折线图是数据分析和探索的重要工具,能从数据集中提炼出重要的见解。然而,人们很少能随时访问折线图背后的底层数据集。在本文中,我们探讨了一个新的数据集发现问题--通过折线图发现数据集,重点是使用折线图作为查询来发现大型数据存储库中能够生成类似折线图的数据集。为了解决这个问题,我们提出了一种名为细粒度跨模态相关性学习模型(FCM)的新方法,旨在估计折线图与候选数据集之间的相关性。为实现这一目标,FCM 首先使用视觉元素提取器从折线图中提取信息丰富的视觉元素,即线条和y-ticks。然后,采用两个新颖的分段级编码器来学习线形图和数据集的表征,保留细粒度信息,接着采用跨模态匹配器以细粒度方式匹配学习到的表征。此外,我们还将 FCM 扩展到支持基于数据聚合生成的折线图查询。最后,我们提出了一个专门针对这一问题的基准,因为目前还不存在这样的数据集。具体来说,我们提出的方法在prec@50和ndcg@50方面分别比最佳基准高出30.1%和41.0%。
{"title":"The Story Behind the Lines: Line Charts as a Gateway to Dataset Discovery","authors":"Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper","doi":"arxiv-2408.09506","DOIUrl":"https://doi.org/arxiv-2408.09506","url":null,"abstract":"Line charts are a valuable tool for data analysis and exploration, distilling\u0000essential insights from a dataset. However, access to the underlying dataset\u0000behind a line chart is rarely readily available. In this paper, we explore a\u0000novel dataset discovery problem, dataset discovery via line charts, focusing on\u0000the use of line charts as queries to discover datasets within a large data\u0000repository that are capable of generating similar line charts. To solve this\u0000problem, we propose a novel approach called Fine-grained Cross-modal Relevance\u0000Learning Model (FCM), which aims to estimate the relevance between a line chart\u0000and a candidate dataset. To achieve this goal, FCM first employs a visual\u0000element extractor to extract informative visual elements, i.e., lines and\u0000y-ticks, from a line chart. Then, two novel segment-level encoders are adopted\u0000to learn representations for a line chart and a dataset, preserving\u0000fine-grained information, followed by a cross-modal matcher to match the\u0000learned representations in a fine-grained way. Furthermore, we extend FCM to\u0000support line chart queries generated based on data aggregation. Last, we\u0000propose a benchmark tailored for this problem since no such dataset exists.\u0000Extensive evaluation on the new benchmark verifies the effectiveness of our\u0000proposed method. Specifically, our proposed approach surpasses the best\u0000baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The temporal conceptual data modelling language TREND 时态概念数据模型语言 TREND
Pub Date : 2024-08-18 DOI: arxiv-2408.09427
Sonia Berman, C. Maria Keet, Tamindran Shunmugam
Temporal conceptual data modelling, as an extension to regular conceptualdata modelling languages such as EER and UML class diagrams, has receivedintermittent attention across the decades. It is receiving renewed interest inthe context of, among others, business process modelling that needs robustexpressive data models to complement them. None of the proposed temporalconceptual data modelling languages have been tested on understandability andusability by modellers, however, nor is it clear which temporal constraintswould be used by modellers or whether the ones included are the relevanttemporal constraints. We therefore sought to investigate temporalrepresentations in temporal conceptual data modelling languages, design a, todate, most expressive language, TREND, through small-scale qualitativeexperiments, and finalise the graphical notation and modelling andunderstanding in large scale experiments. This involved a series of 11experiments with over a thousand participants in total, having created 246temporal conceptual data models. Key outcomes are that choice of label fortransition constraints had limited impact, as did extending explanations of themodelling language, but expressing what needs to be modelled in controllednatural language did improve model quality. The experiments also indicate thatmore training may be needed, in particular guidance for domain experts, toachieve adoption of temporal conceptual data modelling by the community.
时态概念数据模型作为常规概念数据模型语言(如 EER 和 UML 类图)的扩展,几十年来一直受到间歇性关注。时态概念数据模型在业务流程建模等方面再次受到关注,因为业务流程建模需要强大的表达式数据模型作为补充。然而,目前还没有任何一种拟议的时态概念数据模型语言经过建模者的可理解性和可用性测试,也不清楚建模者会使用哪些时态约束,或者所包含的约束是否是相关的时态约束。因此,我们试图研究时态概念数据模型语言中的时态表述,通过小规模定性实验设计出一种迄今为止最具表现力的语言 TREND,并在大规模实验中最终确定图形符号、建模和理解。这包括一系列 11 项实验,共有一千多名参与者参与,创建了 246 个时态概念数据模型。实验的主要结果是,选择过渡约束标签的影响有限,扩展建模语言解释的影响也有限,但用受控自然语言表达需要建模的内容确实提高了模型质量。实验还表明,可能需要更多的培训,特别是对领域专家的指导,以实现社区对时态概念数据模型的采用。
{"title":"The temporal conceptual data modelling language TREND","authors":"Sonia Berman, C. Maria Keet, Tamindran Shunmugam","doi":"arxiv-2408.09427","DOIUrl":"https://doi.org/arxiv-2408.09427","url":null,"abstract":"Temporal conceptual data modelling, as an extension to regular conceptual\u0000data modelling languages such as EER and UML class diagrams, has received\u0000intermittent attention across the decades. It is receiving renewed interest in\u0000the context of, among others, business process modelling that needs robust\u0000expressive data models to complement them. None of the proposed temporal\u0000conceptual data modelling languages have been tested on understandability and\u0000usability by modellers, however, nor is it clear which temporal constraints\u0000would be used by modellers or whether the ones included are the relevant\u0000temporal constraints. We therefore sought to investigate temporal\u0000representations in temporal conceptual data modelling languages, design a, to\u0000date, most expressive language, TREND, through small-scale qualitative\u0000experiments, and finalise the graphical notation and modelling and\u0000understanding in large scale experiments. This involved a series of 11\u0000experiments with over a thousand participants in total, having created 246\u0000temporal conceptual data models. Key outcomes are that choice of label for\u0000transition constraints had limited impact, as did extending explanations of the\u0000modelling language, but expressing what needs to be modelled in controlled\u0000natural language did improve model quality. The experiments also indicate that\u0000more training may be needed, in particular guidance for domain experts, to\u0000achieve adoption of temporal conceptual data modelling by the community.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NFDI4DSO: Towards a BFO Compliant Ontology for Data Science NFDI4DSO:建立符合 BFO 标准的数据科学本体论
Pub Date : 2024-08-16 DOI: arxiv-2408.08698
Genet Asefa Gesese, Jörg Waitelonis, Zongxiong Chen, Sonja Schimmler, Harald Sack
The NFDI4DataScience (NFDI4DS) project aims to enhance the accessibility andinteroperability of research data within Data Science (DS) and ArtificialIntelligence (AI) by connecting digital artifacts and ensuring they adhere toFAIR (Findable, Accessible, Interoperable, and Reusable) principles. To thisend, this poster introduces the NFDI4DS Ontology, which describes resources inDS and AI and models the structure of the NFDI4DS consortium. Built upon theNFDICore ontology and mapped to the Basic Formal Ontology (BFO), this ontologyserves as the foundation for the NFDI4DS knowledge graph currently underdevelopment.
NFDI4DataScience(NFDI4DS)项目旨在通过连接数字人工制品并确保其符合FAIR(可查找、可访问、可互操作和可重用)原则,提高数据科学(DS)和人工智能(AI)领域研究数据的可访问性和互操作性。为此,本海报介绍了 NFDI4DS 本体论,它描述了数据科学与人工智能领域的资源,并为 NFDI4DS 联盟的结构建模。该本体基于NFDICore本体并映射到基本形式本体(BFO),是目前正在开发的NFDI4DS知识图谱的基础。
{"title":"NFDI4DSO: Towards a BFO Compliant Ontology for Data Science","authors":"Genet Asefa Gesese, Jörg Waitelonis, Zongxiong Chen, Sonja Schimmler, Harald Sack","doi":"arxiv-2408.08698","DOIUrl":"https://doi.org/arxiv-2408.08698","url":null,"abstract":"The NFDI4DataScience (NFDI4DS) project aims to enhance the accessibility and\u0000interoperability of research data within Data Science (DS) and Artificial\u0000Intelligence (AI) by connecting digital artifacts and ensuring they adhere to\u0000FAIR (Findable, Accessible, Interoperable, and Reusable) principles. To this\u0000end, this poster introduces the NFDI4DS Ontology, which describes resources in\u0000DS and AI and models the structure of the NFDI4DS consortium. Built upon the\u0000NFDICore ontology and mapped to the Basic Formal Ontology (BFO), this ontology\u0000serves as the foundation for the NFDI4DS knowledge graph currently under\u0000development.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The (Elementary) Mathematical Data Model Revisited 再论(初级)数学数据模型
Pub Date : 2024-08-15 DOI: arxiv-2408.08367
Christian Mancas
This paper presents the current version of our (Elementary) Mathematical DataModel ((E)MDM), which is based on the na"ive theory of sets, relations, andfunctions, as well as on the first-order predicate calculus with equality. Manyreal-life examples illustrate its 4 types of sets, 4 types of functions, and 76types of constraints. This rich panoply of constraints is the main strength ofthis model, guaranteeing that any data value stored in a database is plausible,which is the highest possible level of syntactical data quality. A (E)MDMexample scheme is presented and contrasted with some popular family treesoftware products. The paper also presents the main (E)MDM related approachesin data modeling and processing.
本文介绍了我们的(初等)数学数据模型((E)MDM)的当前版本,该模型基于集合、关系和函数的天真理论,以及带有相等性的一阶谓词微积分。许多现实生活中的例子说明了它的 4 种集合、4 种函数和 76 种约束。这些丰富的约束是该模型的主要优势,可以保证数据库中存储的任何数据值都是可信的,这也是语法数据质量的最高水平。本文介绍了一个 (E)MDM 示例方案,并与一些流行的家庭树软件产品进行了对比。本文还介绍了数据建模和处理中与 (E)MDM 相关的主要方法。
{"title":"The (Elementary) Mathematical Data Model Revisited","authors":"Christian Mancas","doi":"arxiv-2408.08367","DOIUrl":"https://doi.org/arxiv-2408.08367","url":null,"abstract":"This paper presents the current version of our (Elementary) Mathematical Data\u0000Model ((E)MDM), which is based on the na\"ive theory of sets, relations, and\u0000functions, as well as on the first-order predicate calculus with equality. Many\u0000real-life examples illustrate its 4 types of sets, 4 types of functions, and 76\u0000types of constraints. This rich panoply of constraints is the main strength of\u0000this model, guaranteeing that any data value stored in a database is plausible,\u0000which is the highest possible level of syntactical data quality. A (E)MDM\u0000example scheme is presented and contrasted with some popular family tree\u0000software products. The paper also presents the main (E)MDM related approaches\u0000in data modeling and processing.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization DataVisT5:用于联合理解文本和数据可视化的预训练语言模型
Pub Date : 2024-08-14 DOI: arxiv-2408.07401
Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, Raymond Chi-Wing Wong
Data visualization (DV) is the fundamental and premise tool to improve theefficiency in conveying the insights behind the big data, which has been widelyaccepted in existing data-driven world. Task automation in DV, such asconverting natural language queries to visualizations (i.e., text-to-vis),generating explanations from visualizations (i.e., vis-to-text), answeringDV-related questions in free form (i.e. FeVisQA), and explicating tabular data(i.e., table-to-text), is vital for advancing the field. Despite theirpotential, the application of pre-trained language models (PLMs) like T5 andBERT in DV has been limited by high costs and challenges in handlingcross-modal information, leading to few studies on PLMs for DV. We introducetextbf{DataVisT5}, a novel PLM tailored for DV that enhances the T5architecture through a hybrid objective pre-training and multi-task fine-tuningstrategy, integrating text and DV datasets to effectively interpret cross-modalsemantics. Extensive evaluations on public datasets show that DataVisT5consistently outperforms current state-of-the-art models on various DV-relatedtasks. We anticipate that DataVisT5 will not only inspire further research onvertical PLMs but also expand the range of applications for PLMs.
数据可视化(Data Visualization,DV)是提高传达大数据背后见解效率的基础和前提工具,在现有的数据驱动世界中已被广泛接受。DV 中的任务自动化,如将自然语言查询转换为可视化(即文本到可视化)、从可视化生成解释(即可视化到文本)、以自由形式回答 DV 相关问题(即 FeVisQA)以及阐释表格数据(即表格到文本),对于推动该领域的发展至关重要。尽管预训练语言模型(PLMs)(如 T5 和 BERT)潜力巨大,但其在 DV 中的应用却因成本高昂和处理跨模态信息的挑战而受到限制,导致针对 DV 的预训练语言模型的研究寥寥无几。我们介绍了textbf{DataVisT5},这是一种为DV量身定制的新型PLM,它通过混合目标预训练和多任务微调策略增强了T5架构,整合了文本和DV数据集,从而有效地解释了跨模态语义。在公共数据集上进行的广泛评估表明,DataVisT5 在各种 DV 相关任务上的表现始终优于当前最先进的模型。我们预计,DataVisT5 不仅会激发对垂直 PLM 的进一步研究,而且还会扩大 PLM 的应用范围。
{"title":"DataVisT5: A Pre-trained Language Model for Jointly Understanding Text and Data Visualization","authors":"Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, Raymond Chi-Wing Wong","doi":"arxiv-2408.07401","DOIUrl":"https://doi.org/arxiv-2408.07401","url":null,"abstract":"Data visualization (DV) is the fundamental and premise tool to improve the\u0000efficiency in conveying the insights behind the big data, which has been widely\u0000accepted in existing data-driven world. Task automation in DV, such as\u0000converting natural language queries to visualizations (i.e., text-to-vis),\u0000generating explanations from visualizations (i.e., vis-to-text), answering\u0000DV-related questions in free form (i.e. FeVisQA), and explicating tabular data\u0000(i.e., table-to-text), is vital for advancing the field. Despite their\u0000potential, the application of pre-trained language models (PLMs) like T5 and\u0000BERT in DV has been limited by high costs and challenges in handling\u0000cross-modal information, leading to few studies on PLMs for DV. We introduce\u0000textbf{DataVisT5}, a novel PLM tailored for DV that enhances the T5\u0000architecture through a hybrid objective pre-training and multi-task fine-tuning\u0000strategy, integrating text and DV datasets to effectively interpret cross-modal\u0000semantics. Extensive evaluations on public datasets show that DataVisT5\u0000consistently outperforms current state-of-the-art models on various DV-related\u0000tasks. We anticipate that DataVisT5 will not only inspire further research on\u0000vertical PLMs but also expand the range of applications for PLMs.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"440 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QirK: Question Answering via Intermediate Representation on Knowledge Graphs QirK:通过知识图谱上的中间表示进行问题解答
Pub Date : 2024-08-14 DOI: arxiv-2408.07494
Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu
We demonstrate QirK, a system for answering natural language questions onKnowledge Graphs (KG). QirK can answer structurally complex questions that arestill beyond the reach of emerging Large Language Models (LLMs). It does sousing a unique combination of database technology, LLMs, and semantic searchover vector embeddings. The glue for these components is an intermediaterepresentation (IR). The input question is mapped to IR using LLMs, which isthen repaired into a valid relational database query with the aid of a semanticsearch on vector embeddings. This allows a practical synthesis of LLMcapabilities and KG reliability. A short video demonstrating QirK is available athttps://youtu.be/6c81BLmOZ0U.
我们展示了 QirK,这是一个在知识图谱(KG)上回答自然语言问题的系统。QirK 可以回答结构复杂的问题,而新兴的大型语言模型 (LLM) 仍然无法做到这一点。它将数据库技术、LLM 和向量嵌入的语义搜索独特地结合在一起。这些组件的粘合剂是中间呈现(IR)。输入问题使用 LLM 映射到 IR,然后借助向量嵌入的语义搜索将其修复为有效的关系数据库查询。这样就可以将 LLM 能力和 KG 可靠性结合起来。演示 QirK 的视频短片请访问:https://youtu.be/6c81BLmOZ0U。
{"title":"QirK: Question Answering via Intermediate Representation on Knowledge Graphs","authors":"Jan Luca Scheerer, Anton Lykov, Moe Kayali, Ilias Fountalis, Dan Olteanu, Nikolaos Vasiloglou, Dan Suciu","doi":"arxiv-2408.07494","DOIUrl":"https://doi.org/arxiv-2408.07494","url":null,"abstract":"We demonstrate QirK, a system for answering natural language questions on\u0000Knowledge Graphs (KG). QirK can answer structurally complex questions that are\u0000still beyond the reach of emerging Large Language Models (LLMs). It does so\u0000using a unique combination of database technology, LLMs, and semantic search\u0000over vector embeddings. The glue for these components is an intermediate\u0000representation (IR). The input question is mapped to IR using LLMs, which is\u0000then repaired into a valid relational database query with the aid of a semantic\u0000search on vector embeddings. This allows a practical synthesis of LLM\u0000capabilities and KG reliability. A short video demonstrating QirK is available at\u0000https://youtu.be/6c81BLmOZ0U.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Re-Thinking Process Mining in the AI-Based Agents Era 重新思考人工智能代理时代的流程挖掘
Pub Date : 2024-08-14 DOI: arxiv-2408.07720
Alessandro Berti, Mayssa Maatallah, Urszula Jessen, Michal Sroka, Sonia Ayachi Ghannouchi
Large Language Models (LLMs) have emerged as powerful conversationalinterfaces, and their application in process mining (PM) tasks has shownpromising results. However, state-of-the-art LLMs struggle with complexscenarios that demand advanced reasoning capabilities. In the literature, twoprimary approaches have been proposed for implementing PM using LLMs: providingtextual insights based on a textual abstraction of the process mining artifact,and generating code executable on the original artifact. This paper proposesutilizing the AI-Based Agents Workflow (AgWf) paradigm to enhance theeffectiveness of PM on LLMs. This approach allows for: i) the decomposition ofcomplex tasks into simpler workflows, and ii) the integration of deterministictools with the domain knowledge of LLMs. We examine various implementations ofAgWf and the types of AI-based tasks involved. Additionally, we discuss theCrewAI implementation framework and present examples related to process mining.
大型语言模型(LLMs)已成为强大的对话界面,其在流程挖掘(PM)任务中的应用已显示出良好的效果。然而,最先进的 LLM 在应对需要高级推理能力的复杂场景时显得力不从心。文献中提出了利用 LLM 实现 PM 的两种主要方法:基于流程挖掘工件的文本抽象提供文本见解,以及生成可在原始工件上执行的代码。本文建议利用基于人工智能的代理工作流(AgWf)范例来提高 PM 在 LLM 上的有效性。这种方法可以:i)将复杂的任务分解为更简单的工作流;ii)将确定性工具与 LLM 的领域知识相结合。我们研究了 AgWf 的各种实现方式以及所涉及的基于人工智能的任务类型。此外,我们还讨论了 CrewAI 实现框架,并介绍了与流程挖掘相关的示例。
{"title":"Re-Thinking Process Mining in the AI-Based Agents Era","authors":"Alessandro Berti, Mayssa Maatallah, Urszula Jessen, Michal Sroka, Sonia Ayachi Ghannouchi","doi":"arxiv-2408.07720","DOIUrl":"https://doi.org/arxiv-2408.07720","url":null,"abstract":"Large Language Models (LLMs) have emerged as powerful conversational\u0000interfaces, and their application in process mining (PM) tasks has shown\u0000promising results. However, state-of-the-art LLMs struggle with complex\u0000scenarios that demand advanced reasoning capabilities. In the literature, two\u0000primary approaches have been proposed for implementing PM using LLMs: providing\u0000textual insights based on a textual abstraction of the process mining artifact,\u0000and generating code executable on the original artifact. This paper proposes\u0000utilizing the AI-Based Agents Workflow (AgWf) paradigm to enhance the\u0000effectiveness of PM on LLMs. This approach allows for: i) the decomposition of\u0000complex tasks into simpler workflows, and ii) the integration of deterministic\u0000tools with the domain knowledge of LLMs. We examine various implementations of\u0000AgWf and the types of AI-based tasks involved. Additionally, we discuss the\u0000CrewAI implementation framework and present examples related to process mining.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASPEN: ASP-Based System for Collective Entity Resolution ASPEN:基于 ASP 的集体实体解决系统
Pub Date : 2024-08-13 DOI: arxiv-2408.06961
Zhiliang Xiang, Meghyn Bienvenu, Gianluca Cima, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García
In this paper, we present ASPEN, an answer set programming (ASP)implementation of a recently proposed declarative framework for collectiveentity resolution (ER). While an ASP encoding had been previously suggested,several practical issues had been neglected, most notably, the question of howto efficiently compute the (externally defined) similarity facts that are usedin rule bodies. This leads us to propose new variants of the encodings(including Datalog approximations) and show how to employ differentfunctionalities of ASP solvers to compute (maximal) solutions, and(approximations of) the sets of possible and certain merges. A comprehensiveexperimental evaluation of ASPEN on real-world datasets shows that the approachis promising, achieving high accuracy in real-life ER scenarios. Ourexperiments also yield useful insights into the relative merits of differenttypes of (approximate) ER solutions, the impact of recursion, and factorsinfluencing performance.
在本文中,我们介绍了 ASPEN,它是最近提出的集体身份解析(ER)声明式框架的答案集编程(ASP)实现。虽然之前已经提出了 ASP 编码,但有几个实际问题却被忽略了,其中最突出的是如何高效计算规则体中使用的(外部定义的)相似性事实。这促使我们提出了编码的新变体(包括 Datalog 近似值),并展示了如何利用 ASP 求解器的不同功能来计算(最大)解以及可能合并集和确定合并集(的近似值)。在真实数据集上对 ASPEN 进行的综合实验评估表明,该方法很有前途,在真实的 ER 场景中实现了很高的准确性。其他实验还对不同类型(近似)ER 解决方案的相对优点、递归的影响以及影响性能的因素提出了有益的见解。
{"title":"ASPEN: ASP-Based System for Collective Entity Resolution","authors":"Zhiliang Xiang, Meghyn Bienvenu, Gianluca Cima, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García","doi":"arxiv-2408.06961","DOIUrl":"https://doi.org/arxiv-2408.06961","url":null,"abstract":"In this paper, we present ASPEN, an answer set programming (ASP)\u0000implementation of a recently proposed declarative framework for collective\u0000entity resolution (ER). While an ASP encoding had been previously suggested,\u0000several practical issues had been neglected, most notably, the question of how\u0000to efficiently compute the (externally defined) similarity facts that are used\u0000in rule bodies. This leads us to propose new variants of the encodings\u0000(including Datalog approximations) and show how to employ different\u0000functionalities of ASP solvers to compute (maximal) solutions, and\u0000(approximations of) the sets of possible and certain merges. A comprehensive\u0000experimental evaluation of ASPEN on real-world datasets shows that the approach\u0000is promising, achieving high accuracy in real-life ER scenarios. Our\u0000experiments also yield useful insights into the relative merits of different\u0000types of (approximate) ER solutions, the impact of recursion, and factors\u0000influencing performance.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Databases
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1