arXiv - CS - Databases最新文献

英文中文

Intelligent Transaction Scheduling via Conflict Prediction in OLTP DBMS 通过 OLTP DBMS 中的冲突预测实现智能事务调度

arXiv - CS - Databases

Pub Date : 2024-09-03 DOI: arxiv-2409.01675

Tieying Zhang, Anthony Tomasic, Andrew Pavlo

Current architectures for main-memory online transaction processing (OLTP)database management systems (DBMS) typically use random scheduling to assigntransactions to threads. This approach achieves uniform load across threads butit ignores the likelihood of conflicts between transactions. If the DBMS couldestimate the potential for transaction conflict and then intelligently scheduletransactions to avoid conflicts, then the system could improve its performance.Such estimation of transaction conflict, however, is non-trivial for severalreasons. First, conflicts occur under complex conditions that are far removedin time from the scheduling decision. Second, transactions must be representedin a compact and efficient manner to allow for fast conflict detection. Third,given some evidence of potential conflict, the DBMS must schedule transactionsin such a way that minimizes this conflict. In this paper, we systematicallyexplore the design decisions for solving these problems. We then empiricallymeasure the performance impact of different representations on standard OLTPbenchmarks. Our results show that intelligent scheduling using a historyincreases throughput by $sim$40% on 20-core machine.

当前主内存联机事务处理（OLTP）数据库管理系统（DBMS）的架构通常使用随机调度将事务分配给线程。这种方法实现了线程间负载的一致性，但却忽略了事务间冲突的可能性。如果 DBMS 能够估计事务冲突的可能性，然后智能地调度事务以避免冲突，那么系统就能提高性能。首先，冲突发生的条件很复杂，与调度决策的时间相距甚远。其次，事务必须以紧凑高效的方式表示，以便快速检测冲突。第三，考虑到潜在冲突的某些证据，数据库管理系统必须以最小化冲突的方式调度事务。在本文中，我们系统地探讨了解决这些问题的设计决策。然后，我们在标准 OLTP 基准上实证测量了不同表示法对性能的影响。我们的结果表明，使用历史记录的智能调度在 20 核机器上将吞吐量提高了 $sim$40% 。

{"title":"Intelligent Transaction Scheduling via Conflict Prediction in OLTP DBMS","authors":"Tieying Zhang, Anthony Tomasic, Andrew Pavlo","doi":"arxiv-2409.01675","DOIUrl":"https://doi.org/arxiv-2409.01675","url":null,"abstract":"Current architectures for main-memory online transaction processing (OLTP)\u0000database management systems (DBMS) typically use random scheduling to assign\u0000transactions to threads. This approach achieves uniform load across threads but\u0000it ignores the likelihood of conflicts between transactions. If the DBMS could\u0000estimate the potential for transaction conflict and then intelligently schedule\u0000transactions to avoid conflicts, then the system could improve its performance.\u0000Such estimation of transaction conflict, however, is non-trivial for several\u0000reasons. First, conflicts occur under complex conditions that are far removed\u0000in time from the scheduling decision. Second, transactions must be represented\u0000in a compact and efficient manner to allow for fast conflict detection. Third,\u0000given some evidence of potential conflict, the DBMS must schedule transactions\u0000in such a way that minimizes this conflict. In this paper, we systematically\u0000explore the design decisions for solving these problems. We then empirically\u0000measure the performance impact of different representations on standard OLTP\u0000benchmarks. Our results show that intelligent scheduling using a history\u0000increases throughput by $sim$40% on 20-core machine.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computing Range Consistent Answers to Aggregation Queries via Rewriting 通过重写计算聚合查询的范围一致答案

arXiv - CS - Databases

Pub Date : 2024-09-03 DOI: arxiv-2409.01648

Aziz Amezian El Khalfioui, Jef Wijsen

We consider the problem of answering conjunctive queries with aggregation ondatabase instances that may violate primary key constraints. In SQL, thesequeries follow the SELECT-FROM-WHERE-GROUP BY format, where the WHERE-clauseinvolves a conjunction of equalities, and the SELECT-clause can incorporateaggregate operators like MAX, MIN, SUM, AVG, or COUNT. Repairs of a databaseinstance are defined as inclusion-maximal subsets that satisfy all primarykeys. For a given query, our primary objective is to identify repairs thatyield the lowest aggregated value among all possible repairs. We particularlyinvestigate queries for which this lowest aggregated value can be determinedthrough a rewriting in first-order logic with aggregate operators.

我们考虑的问题是，如何回答对可能违反主键约束的数据库实例进行聚合的连接查询。在 SQL 中，这些查询遵循 SELECT-FROM-WHERE-GROUP BY 格式，其中 WHERE-clause涉及等式连接，而 SELECT-clause可以包含诸如 MAX、MIN、SUM、AVG 或 COUNT 等聚合运算符。数据库实例的修复被定义为满足所有主键的最大包含子集。对于给定的查询，我们的主要目标是找出在所有可能的修复中产生最低聚合值的修复。我们特别研究了可以通过使用聚合运算符的一阶逻辑重写来确定最低聚合值的查询。

引用次数: 0

SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow SpannerLib：在命令式工作流中嵌入声明式信息提取

arXiv - CS - Databases

Pub Date : 2024-09-03 DOI: arxiv-2409.01736

Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld

Document spanners have been proposed as a formal framework for declarativeInformation Extraction (IE) from text, following IE products from the industryand academia. Over the past decade, the framework has been studied thoroughlyin terms of expressive power, complexity, and the ability to naturally combinetext analysis with relational querying. This demonstration presents SpannerLiba library for embedding document spanners in Python code. SpannerLibfacilitates the development of IE programs by providing an implementation ofSpannerlog (Datalog-based documentspanners) that interacts with the Python codein two directions: rules can be embedded inside Python, and they can invokecustom Python code (e.g., calls to ML-based NLP models) via user-definedfunctions. The demonstration scenarios showcase IE programs, with increasinglevels of complexity, within Jupyter Notebook.

继工业界和学术界的信息提取产品之后，人们又提出了从文本中进行声明式信息提取（IE）的正式框架--文档生成器（Document Spanners）。在过去十年中，该框架在表达能力、复杂性以及将文本分析与关系查询自然结合的能力等方面都得到了深入研究。本演示介绍了 SpannerLiba 库，用于在 Python 代码中嵌入文档生成器。SpannerLib 通过提供一个与 Python 代码双向交互的 Spannerlog（基于 Datalog 的文档生成器）实现，促进了 IE 程序的开发：规则可以嵌入到 Python 中，并且可以通过用户自定义函数调用自定义 Python 代码（例如，调用基于 ML 的 NLP 模型）。演示场景展示了 Jupyter Notebook 中复杂程度不断提高的 IE 程序。

引用次数: 0

Multilevel Verification on a Single Digital Decentralized Distributed (DDD) Ledger 在单一数字去中心化分布式（DDD）账本上进行多级验证

arXiv - CS - Databases

Pub Date : 2024-09-03 DOI: arxiv-2409.11410

Ayush Thada, Aanchal Kandpal, Dipanwita Sinha Mukharjee

This paper presents an approach to using decentralized distributed digital(DDD) ledgers like blockchain with multi-level verification. In regular DDDledgers like Blockchain, only a single level of verification is available,which makes it not useful for those systems where there is a hierarchy andverification is required on each level. In systems where hierarchy emergesnaturally, the inclusion of hierarchy in the solution for the problem of thesystem enables us to come up with a better solution. Introduction to hierarchymeans there could be several verification within a level in the hierarchy andmore than one level of verification, which implies other challenges induced byan interaction between the various levels of hierarchies that also need to beaddressed, like verification of the work of the previous level of hierarchy bygiven level in the hierarchy. The paper will address all these issues, andprovide a road map to trace the state of the system at any given time andprobability of failure of the system.

本文介绍了一种使用去中心化分布式数字（DDD）账本（如具有多级验证功能的区块链）的方法。在区块链等常规分布式数字分类账中，只有单级验证，这使得它对那些存在层次结构且每一级都需要验证的系统没有用处。在自然出现层次结构的系统中，将层次结构纳入系统问题的解决方案中，能让我们提出更好的解决方案。引入层次结构意味着在层次结构中的一个层级内可能有多个验证，而且验证的层级可能不止一个，这意味着还需要解决各层次之间的相互作用所引发的其他挑战，例如由层次结构中的给定层级验证上一层级的工作。本文将讨论所有这些问题，并提供一个路线图，以追踪任何给定时间的系统状态和系统发生故障的概率。

引用次数: 0

BEAVER: An Enterprise Benchmark for Text-to-SQL BEAVER：文本到 SQL 的企业基准

arXiv - CS - Databases

Pub Date : 2024-09-03 DOI: arxiv-2409.02038

Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker

Existing text-to-SQL benchmarks have largely been constructed using publiclyavailable tables from the web with human-generated tests containing questionand SQL statement pairs. They typically show very good results and lead peopleto think that LLMs are effective at text-to-SQL tasks. In this paper, we applyoff-the-shelf LLMs to a benchmark containing enterprise data warehouse data. Inthis environment, LLMs perform poorly, even when standard prompt engineeringand RAG techniques are utilized. As we will show, the reasons for poorperformance are largely due to three characteristics: (1) public LLMs cannottrain on enterprise data warehouses because they are largely in the "dark web",(2) schemas of enterprise tables are more complex than the schemas in publicdata, which leads the SQL-generation task innately harder, and (3)business-oriented questions are often more complex, requiring joins overmultiple tables and aggregations. As a result, we propose a new dataset BEAVER,sourced from real enterprise data warehouses together with natural languagequeries and their correct SQL statements which we collected from actual userhistory. We evaluated this dataset using recent LLMs and demonstrated theirpoor performance on this task. We hope this dataset will facilitate futureresearchers building more sophisticated text-to-SQL systems which can do betteron this important class of data.

现有的文本到 SQL 基准主要是利用网络上公开的表格和人工生成的包含问题和 SQL 语句对的测试来构建的。它们通常显示出非常好的结果，让人们认为 LLM 在文本到 SQL 任务中非常有效。在本文中，我们将现成的 LLM 应用于包含企业数据仓库数据的基准测试。在这种环境下，即使使用了标准的提示工程和 RAG 技术，LLM 的性能也很差。正如我们将展示的那样，性能不佳的原因主要有三个：（1）公共 LLM 无法对企业数据仓库进行约束，因为它们在很大程度上处于 "暗网 "中；（2）企业表的模式比公共数据中的模式更加复杂，这导致 SQL 生成任务天生更加困难；（3）面向业务的问题通常更加复杂，需要对多个表进行连接和聚合。因此，我们提出了一个新的数据集 BEAVER，其来源是真实的企业数据仓库，以及从实际用户历史中收集的自然语言查询及其正确的 SQL 语句。我们使用最近的 LLM 对该数据集进行了评估，结果表明它们在这项任务上的性能很差。我们希望这个数据集能帮助未来的研究人员构建更复杂的文本到 SQL 系统，从而在这类重要数据上取得更好的成绩。

{"title":"BEAVER: An Enterprise Benchmark for Text-to-SQL","authors":"Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker","doi":"arxiv-2409.02038","DOIUrl":"https://doi.org/arxiv-2409.02038","url":null,"abstract":"Existing text-to-SQL benchmarks have largely been constructed using publicly\u0000available tables from the web with human-generated tests containing question\u0000and SQL statement pairs. They typically show very good results and lead people\u0000to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply\u0000off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In\u0000this environment, LLMs perform poorly, even when standard prompt engineering\u0000and RAG techniques are utilized. As we will show, the reasons for poor\u0000performance are largely due to three characteristics: (1) public LLMs cannot\u0000train on enterprise data warehouses because they are largely in the \"dark web\",\u0000(2) schemas of enterprise tables are more complex than the schemas in public\u0000data, which leads the SQL-generation task innately harder, and (3)\u0000business-oriented questions are often more complex, requiring joins over\u0000multiple tables and aggregations. As a result, we propose a new dataset BEAVER,\u0000sourced from real enterprise data warehouses together with natural language\u0000queries and their correct SQL statements which we collected from actual user\u0000history. We evaluated this dataset using recent LLMs and demonstrated their\u0000poor performance on this task. We hope this dataset will facilitate future\u0000researchers building more sophisticated text-to-SQL systems which can do better\u0000on this important class of data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Split Learning-based Privacy-Preserving Record Linkage 实现基于拆分学习的隐私保护记录链接

arXiv - CS - Databases

Pub Date : 2024-09-02 DOI: arxiv-2409.01088

Michail Zervas, Alexandros Karakasidis

Split Learning has been recently introduced to facilitate applications whereuser data privacy is a requirement. However, it has not been thoroughly studiedin the context of Privacy-Preserving Record Linkage, a problem in which thesame real-world entity should be identified among databases from differentdataholders, but without disclosing any additional information. In this paper,we investigate the potentials of Split Learning for Privacy-Preserving RecordMatching, by introducing a novel training method through the utilization ofReference Sets, which are publicly available data corpora, showcasing minimalmatching impact against a traditional centralized SVM-based technique.

拆分学习（Split Learning）最近被引入到对用户数据隐私有要求的应用中。然而，在保护隐私的记录链接（Privacy-Preserving Record Linkage）问题上，它还没有得到深入研究，在这个问题中，需要在来自不同数据持有者的数据库中识别出相同的现实世界实体，但不能泄露任何额外信息。在本文中，我们通过利用参考集（公开可用的数据集）引入了一种新颖的训练方法，研究了拆分学习在隐私保护记录匹配中的潜力，与传统的基于 SVM 的集中式技术相比，拆分学习对匹配的影响最小。

引用次数: 0

GQL and SQL/PGQ: Theoretical Models and Expressive Power GQL 和 SQL/PGQ：理论模型和表达能力

arXiv - CS - Databases

Pub Date : 2024-09-02 DOI: arxiv-2409.01102

Amélie Gheerbrant, Leonid Libkin, Liat Peterfreund, Alexandra Rogova

SQL/PGQ and GQL are very recent international standards for querying propertygraphs: SQL/PGQ specifies how to query relational representations of propertygraphs in SQL, while GQL is a standalone language for graph databases. Therapid industrial development of these standards left the academic communitytrailing in its wake. While digests of the languages have appeared, we do notyet have concise foundational models like relational algebra and calculus forrelational databases that enable the formal study of languages, including theirexpressiveness and limitations. At the same time, work on the next versions ofthe standards has already begun, to address the perceived limitations of theirfirst versions. Motivated by this, we initiate a formal study of SQL/PGQ and GQL,concentrating on their concise formal model and expressiveness. For the former,we define simple core languages -- Core GQL and Core PGQ -- that capture theessence of the new standards, are amenable to theoretical analysis, and fullyclarify the difference between PGQ's bottom up evaluation versus GQL's linear,or pipelined approach. Equipped with these models, we both confirm thenecessity to extend the language to fill in the expressiveness gaps andidentify the source of these deficiencies. We complement our theoreticalanalysis with an experimental study, demonstrating that existing workarounds infull GQL and PGQ are impractical which further underscores the necessity tocorrect deficiencies in the language design.

SQL/PGQ 和 GQL 是查询属性图的最新国际标准：SQL/PGQ 规定了如何用 SQL 查询属性图的关系表示，而 GQL 则是图数据库的独立语言。这些标准在工业界的迅速发展让学术界望尘莫及。虽然这些语言的摘要已经出现，但我们还没有简明的基础模型（如关系代数和关系数据库微积分）来对语言进行正式研究，包括语言的可执行性和局限性。与此同时，下一版标准的制定工作已经开始，以解决第一版标准的局限性。受此启发，我们开始对 SQL/PGQ 和 GQL 进行形式研究，重点关注它们的简明形式模型和表达能力。对于前者，我们定义了简单的核心语言--核心 GQL 和核心 PGQ--它们抓住了新标准的精髓，适合理论分析，并充分阐明了 PGQ 的自下而上评估与 GQL 的线性或流水线方法之间的区别。有了这些模型，我们就能确认是否有必要扩展语言以填补表达能力上的缺陷，并找出这些缺陷的根源。我们通过实验研究补充了我们的理论分析，证明现有的完整 GQL 和 PGQ 的变通方法是不切实际的，这进一步强调了纠正语言设计缺陷的必要性。

{"title":"GQL and SQL/PGQ: Theoretical Models and Expressive Power","authors":"Amélie Gheerbrant, Leonid Libkin, Liat Peterfreund, Alexandra Rogova","doi":"arxiv-2409.01102","DOIUrl":"https://doi.org/arxiv-2409.01102","url":null,"abstract":"SQL/PGQ and GQL are very recent international standards for querying property\u0000graphs: SQL/PGQ specifies how to query relational representations of property\u0000graphs in SQL, while GQL is a standalone language for graph databases. The\u0000rapid industrial development of these standards left the academic community\u0000trailing in its wake. While digests of the languages have appeared, we do not\u0000yet have concise foundational models like relational algebra and calculus for\u0000relational databases that enable the formal study of languages, including their\u0000expressiveness and limitations. At the same time, work on the next versions of\u0000the standards has already begun, to address the perceived limitations of their\u0000first versions. Motivated by this, we initiate a formal study of SQL/PGQ and GQL,\u0000concentrating on their concise formal model and expressiveness. For the former,\u0000we define simple core languages -- Core GQL and Core PGQ -- that capture the\u0000essence of the new standards, are amenable to theoretical analysis, and fully\u0000clarify the difference between PGQ's bottom up evaluation versus GQL's linear,\u0000or pipelined approach. Equipped with these models, we both confirm the\u0000necessity to extend the language to fill in the expressiveness gaps and\u0000identify the source of these deficiencies. We complement our theoretical\u0000analysis with an experimental study, demonstrating that existing workarounds in\u0000full GQL and PGQ are impractical which further underscores the necessity to\u0000correct deficiencies in the language design.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Serverless Query Processing with Flexible Performance SLAs and Prices 具有灵活性能 SLA 和价格的无服务器查询处理

arXiv - CS - Databases

Pub Date : 2024-09-02 DOI: arxiv-2409.01388

Haoqiong Bian, Dongyang Geng, Yunpeng Chai, Anastasia Ailamaki

Serverless query processing has become increasingly popular due to itsauto-scaling, high elasticity, and pay-as-you-go pricing. It allows cloud datawarehouse (or lakehouse) users to focus on data analysis without the burden ofmanaging systems and resources. Accordingly, in serverless query services,users become more concerned about cost-efficiency under acceptable performancethan performance under fixed resources. This poses new challenges forserverless query engine design in providing flexible performance service-levelagreements (SLAs) and cost-efficiency (i.e., prices). In this paper, we first define the problem of flexible performance SLAs andprices in serverless query processing and discuss its significance. Then, weenvision the challenges and solutions for solving this problem and theopportunities it raises for other database research. Finally, we presentPixelsDB, an open-source prototype with three service levels supported bydedicated architectural designs. Evaluations show that PixelsDB reducesresource costs by 65.5% for near-real-world workloads generated by CloudAnalytics Benchmark (CAB) while not violating the pending time guarantees.

无服务器查询处理因其自动扩展、高弹性和即用即付的价格而越来越受欢迎。它允许云数据仓库（或湖库）用户专注于数据分析，而无需承担管理系统和资源的负担。因此，在无服务器查询服务中，用户更关心的是可接受性能下的成本效益，而不是固定资源下的性能。这对无服务器查询引擎的设计提出了新的挑战，即提供灵活的性能服务级别协议（SLA）和成本效益（即价格）。在本文中，我们首先定义了无服务器查询处理中灵活的性能服务级别协议和价格问题，并讨论了其意义。然后，我们展望了解决这一问题的挑战和解决方案，以及它为其他数据库研究带来的机遇。最后，我们介绍了像素数据库（PixelsDB），这是一个开源原型，通过专用架构设计支持三种服务级别。评估结果表明，在云分析基准（CAB）生成的接近真实世界的工作负载中，PixelsDB降低了65.5%的资源成本，同时没有违反待处理时间保证。

{"title":"Serverless Query Processing with Flexible Performance SLAs and Prices","authors":"Haoqiong Bian, Dongyang Geng, Yunpeng Chai, Anastasia Ailamaki","doi":"arxiv-2409.01388","DOIUrl":"https://doi.org/arxiv-2409.01388","url":null,"abstract":"Serverless query processing has become increasingly popular due to its\u0000auto-scaling, high elasticity, and pay-as-you-go pricing. It allows cloud data\u0000warehouse (or lakehouse) users to focus on data analysis without the burden of\u0000managing systems and resources. Accordingly, in serverless query services,\u0000users become more concerned about cost-efficiency under acceptable performance\u0000than performance under fixed resources. This poses new challenges for\u0000serverless query engine design in providing flexible performance service-level\u0000agreements (SLAs) and cost-efficiency (i.e., prices). In this paper, we first define the problem of flexible performance SLAs and\u0000prices in serverless query processing and discuss its significance. Then, we\u0000envision the challenges and solutions for solving this problem and the\u0000opportunities it raises for other database research. Finally, we present\u0000PixelsDB, an open-source prototype with three service levels supported by\u0000dedicated architectural designs. Evaluations show that PixelsDB reduces\u0000resource costs by 65.5% for near-real-world workloads generated by Cloud\u0000Analytics Benchmark (CAB) while not violating the pending time guarantees.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach 利用基于规则的可达性方法优化传感器数据的遍历查询

arXiv - CS - Databases

Pub Date : 2024-08-30 DOI: arxiv-2408.17157

Bryan-Elliott Tam, Ruben Taelman, Julián Rojas Meléndez, Pieter Colpaert

Link Traversal queries face challenges in completeness and long executiontime due to the size of the web. Reachability criteria define completeness byrestricting the links followed by engines. However, the number of links todereference remains the bottleneck of the approach. Web environments often havestructures exploitable by query engines to prune irrelevant sources. Currentcriteria rely on using information from the query definition and predefinedpredicate. However, it is difficult to use them to traverse environments wherelogical expressions indicate the location of resources. We propose to use arule-based reachability criterion that captures logical statements expressed inhypermedia descriptions within linked data documents to prune irrelevantsources. In this poster paper, we show how the Comunica link traversal engineis modified to take hints from a hypermedia control vocabulary, to pruneirrelevant sources. Our preliminary findings show that by using this strategy,the query engine can significantly reduce the number of HTTP requests and thequery execution time without sacrificing the completeness of results. Our workshows that the investigation of hypermedia controls in link pruning oftraversal queries is a worthy effort for optimizing web queries of unindexeddecentralized databases.

由于网络规模庞大，链接遍历查询面临着完整性和执行时间长的挑战。可达性标准通过限制引擎遵循的链接来定义完整性。然而，需要参考的链接数量仍然是该方法的瓶颈。网络环境通常拥有可供查询引擎利用的结构，以删除不相关的来源。当前的标准依赖于使用查询定义和预定义谓词中的信息。然而，在逻辑表达式显示资源位置的环境中，很难使用这些信息进行遍历。我们建议使用基于规则的可达性准则，该准则可以捕捉链接数据文档中超媒体描述所表达的逻辑语句，从而删除无关资源。在这篇海报论文中，我们展示了如何对 Comunica 链接遍历引擎进行修改，以便从超媒体控制词汇表中获取提示，从而剪除相关资源。我们的初步研究结果表明，通过使用这一策略，查询引擎可以显著减少 HTTP 请求的数量和查询执行时间，而不会牺牲结果的完整性。我们的研究表明，在遍历查询的链接剪枝中研究超媒体控制是优化无索引分散数据库网络查询的一项值得努力的工作。

{"title":"Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach","authors":"Bryan-Elliott Tam, Ruben Taelman, Julián Rojas Meléndez, Pieter Colpaert","doi":"arxiv-2408.17157","DOIUrl":"https://doi.org/arxiv-2408.17157","url":null,"abstract":"Link Traversal queries face challenges in completeness and long execution\u0000time due to the size of the web. Reachability criteria define completeness by\u0000restricting the links followed by engines. However, the number of links to\u0000dereference remains the bottleneck of the approach. Web environments often have\u0000structures exploitable by query engines to prune irrelevant sources. Current\u0000criteria rely on using information from the query definition and predefined\u0000predicate. However, it is difficult to use them to traverse environments where\u0000logical expressions indicate the location of resources. We propose to use a\u0000rule-based reachability criterion that captures logical statements expressed in\u0000hypermedia descriptions within linked data documents to prune irrelevant\u0000sources. In this poster paper, we show how the Comunica link traversal engine\u0000is modified to take hints from a hypermedia control vocabulary, to prune\u0000irrelevant sources. Our preliminary findings show that by using this strategy,\u0000the query engine can significantly reduce the number of HTTP requests and the\u0000query execution time without sacrificing the completeness of results. Our work\u0000shows that the investigation of hypermedia controls in link pruning of\u0000traversal queries is a worthy effort for optimizing web queries of unindexed\u0000decentralized databases.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Empowering Open Data Sharing for Social Good: A Privacy-Aware Approach 增强开放数据共享的社会效益：注重隐私的方法

arXiv - CS - Databases

Pub Date : 2024-08-30 DOI: arxiv-2408.17378

Tânia Carvalho, Luís Antunes, Cristina Costa, Nuno Moniz

The Covid-19 pandemic has affected the world at multiple levels. Data sharingwas pivotal for advancing research to understand the underlying causes andimplement effective containment strategies. In response, many countries havepromoted the availability of daily cases to support research initiatives,fostering collaboration between organisations and making such data available tothe public through open data platforms. Despite the several advantages of datasharing, one of the major concerns before releasing health data is its impacton individuals' privacy. Such a sharing process should be based onstate-of-the-art methods in Data Protection by Design and by Default. In thispaper, we use a data set related to Covid-19 cases in the second largesthospital in Portugal to show how it is feasible to ensure data privacy whileimproving the quality and maintaining the utility of the data. Our goal is todemonstrate how knowledge exchange in multidisciplinary teams of healthcarepractitioners, data privacy, and data science experts is crucial toco-developing strategies that ensure high utility of de-identified data.

Covid-19 大流行在多个层面对世界造成了影响。数据共享对于推动研究以了解根本原因和实施有效的遏制战略至关重要。为此，许多国家推动提供每日病例以支持研究计划，促进组织间的合作，并通过开放数据平台向公众提供此类数据。尽管数据共享具有多种优势，但在发布健康数据之前，人们主要关注的问题之一是其对个人隐私的影响。这种共享过程应基于设计和默认数据保护的最新方法。在本文中，我们使用了葡萄牙第二大医院 Covid-19 病例的相关数据集，以展示如何在提高数据质量和保持数据效用的同时确保数据隐私。我们的目标是展示由医疗从业人员、数据隐私和数据科学专家组成的多学科团队中的知识交流对于制定确保去标识化数据高度实用性的策略是多么重要。

{"title":"Empowering Open Data Sharing for Social Good: A Privacy-Aware Approach","authors":"Tânia Carvalho, Luís Antunes, Cristina Costa, Nuno Moniz","doi":"arxiv-2408.17378","DOIUrl":"https://doi.org/arxiv-2408.17378","url":null,"abstract":"The Covid-19 pandemic has affected the world at multiple levels. Data sharing\u0000was pivotal for advancing research to understand the underlying causes and\u0000implement effective containment strategies. In response, many countries have\u0000promoted the availability of daily cases to support research initiatives,\u0000fostering collaboration between organisations and making such data available to\u0000the public through open data platforms. Despite the several advantages of data\u0000sharing, one of the major concerns before releasing health data is its impact\u0000on individuals' privacy. Such a sharing process should be based on\u0000state-of-the-art methods in Data Protection by Design and by Default. In this\u0000paper, we use a data set related to Covid-19 cases in the second largest\u0000hospital in Portugal to show how it is feasible to ensure data privacy while\u0000improving the quality and maintaining the utility of the data. Our goal is to\u0000demonstrate how knowledge exchange in multidisciplinary teams of healthcare\u0000practitioners, data privacy, and data science experts is crucial to\u0000co-developing strategies that ensure high utility of de-identified data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Databases

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀