Current architectures for main-memory online transaction processing (OLTP) database management systems (DBMS) typically use random scheduling to assign transactions to threads. This approach achieves uniform load across threads but it ignores the likelihood of conflicts between transactions. If the DBMS could estimate the potential for transaction conflict and then intelligently schedule transactions to avoid conflicts, then the system could improve its performance. Such estimation of transaction conflict, however, is non-trivial for several reasons. First, conflicts occur under complex conditions that are far removed in time from the scheduling decision. Second, transactions must be represented in a compact and efficient manner to allow for fast conflict detection. Third, given some evidence of potential conflict, the DBMS must schedule transactions in such a way that minimizes this conflict. In this paper, we systematically explore the design decisions for solving these problems. We then empirically measure the performance impact of different representations on standard OLTP benchmarks. Our results show that intelligent scheduling using a history increases throughput by $sim$40% on 20-core machine.
{"title":"Intelligent Transaction Scheduling via Conflict Prediction in OLTP DBMS","authors":"Tieying Zhang, Anthony Tomasic, Andrew Pavlo","doi":"arxiv-2409.01675","DOIUrl":"https://doi.org/arxiv-2409.01675","url":null,"abstract":"Current architectures for main-memory online transaction processing (OLTP)\u0000database management systems (DBMS) typically use random scheduling to assign\u0000transactions to threads. This approach achieves uniform load across threads but\u0000it ignores the likelihood of conflicts between transactions. If the DBMS could\u0000estimate the potential for transaction conflict and then intelligently schedule\u0000transactions to avoid conflicts, then the system could improve its performance.\u0000Such estimation of transaction conflict, however, is non-trivial for several\u0000reasons. First, conflicts occur under complex conditions that are far removed\u0000in time from the scheduling decision. Second, transactions must be represented\u0000in a compact and efficient manner to allow for fast conflict detection. Third,\u0000given some evidence of potential conflict, the DBMS must schedule transactions\u0000in such a way that minimizes this conflict. In this paper, we systematically\u0000explore the design decisions for solving these problems. We then empirically\u0000measure the performance impact of different representations on standard OLTP\u0000benchmarks. Our results show that intelligent scheduling using a history\u0000increases throughput by $sim$40% on 20-core machine.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of answering conjunctive queries with aggregation on database instances that may violate primary key constraints. In SQL, these queries follow the SELECT-FROM-WHERE-GROUP BY format, where the WHERE-clause involves a conjunction of equalities, and the SELECT-clause can incorporate aggregate operators like MAX, MIN, SUM, AVG, or COUNT. Repairs of a database instance are defined as inclusion-maximal subsets that satisfy all primary keys. For a given query, our primary objective is to identify repairs that yield the lowest aggregated value among all possible repairs. We particularly investigate queries for which this lowest aggregated value can be determined through a rewriting in first-order logic with aggregate operators.
{"title":"Computing Range Consistent Answers to Aggregation Queries via Rewriting","authors":"Aziz Amezian El Khalfioui, Jef Wijsen","doi":"arxiv-2409.01648","DOIUrl":"https://doi.org/arxiv-2409.01648","url":null,"abstract":"We consider the problem of answering conjunctive queries with aggregation on\u0000database instances that may violate primary key constraints. In SQL, these\u0000queries follow the SELECT-FROM-WHERE-GROUP BY format, where the WHERE-clause\u0000involves a conjunction of equalities, and the SELECT-clause can incorporate\u0000aggregate operators like MAX, MIN, SUM, AVG, or COUNT. Repairs of a database\u0000instance are defined as inclusion-maximal subsets that satisfy all primary\u0000keys. For a given query, our primary objective is to identify repairs that\u0000yield the lowest aggregated value among all possible repairs. We particularly\u0000investigate queries for which this lowest aggregated value can be determined\u0000through a rewriting in first-order logic with aggregate operators.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld
Document spanners have been proposed as a formal framework for declarative Information Extraction (IE) from text, following IE products from the industry and academia. Over the past decade, the framework has been studied thoroughly in terms of expressive power, complexity, and the ability to naturally combine text analysis with relational querying. This demonstration presents SpannerLib a library for embedding document spanners in Python code. SpannerLib facilitates the development of IE programs by providing an implementation of Spannerlog (Datalog-based documentspanners) that interacts with the Python code in two directions: rules can be embedded inside Python, and they can invoke custom Python code (e.g., calls to ML-based NLP models) via user-defined functions. The demonstration scenarios showcase IE programs, with increasing levels of complexity, within Jupyter Notebook.
{"title":"SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow","authors":"Dean Light, Ahmad Aiashy, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, Benny Kimelfeld","doi":"arxiv-2409.01736","DOIUrl":"https://doi.org/arxiv-2409.01736","url":null,"abstract":"Document spanners have been proposed as a formal framework for declarative\u0000Information Extraction (IE) from text, following IE products from the industry\u0000and academia. Over the past decade, the framework has been studied thoroughly\u0000in terms of expressive power, complexity, and the ability to naturally combine\u0000text analysis with relational querying. This demonstration presents SpannerLib\u0000a library for embedding document spanners in Python code. SpannerLib\u0000facilitates the development of IE programs by providing an implementation of\u0000Spannerlog (Datalog-based documentspanners) that interacts with the Python code\u0000in two directions: rules can be embedded inside Python, and they can invoke\u0000custom Python code (e.g., calls to ML-based NLP models) via user-defined\u0000functions. The demonstration scenarios showcase IE programs, with increasing\u0000levels of complexity, within Jupyter Notebook.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an approach to using decentralized distributed digital (DDD) ledgers like blockchain with multi-level verification. In regular DDD ledgers like Blockchain, only a single level of verification is available, which makes it not useful for those systems where there is a hierarchy and verification is required on each level. In systems where hierarchy emerges naturally, the inclusion of hierarchy in the solution for the problem of the system enables us to come up with a better solution. Introduction to hierarchy means there could be several verification within a level in the hierarchy and more than one level of verification, which implies other challenges induced by an interaction between the various levels of hierarchies that also need to be addressed, like verification of the work of the previous level of hierarchy by given level in the hierarchy. The paper will address all these issues, and provide a road map to trace the state of the system at any given time and probability of failure of the system.
{"title":"Multilevel Verification on a Single Digital Decentralized Distributed (DDD) Ledger","authors":"Ayush Thada, Aanchal Kandpal, Dipanwita Sinha Mukharjee","doi":"arxiv-2409.11410","DOIUrl":"https://doi.org/arxiv-2409.11410","url":null,"abstract":"This paper presents an approach to using decentralized distributed digital\u0000(DDD) ledgers like blockchain with multi-level verification. In regular DDD\u0000ledgers like Blockchain, only a single level of verification is available,\u0000which makes it not useful for those systems where there is a hierarchy and\u0000verification is required on each level. In systems where hierarchy emerges\u0000naturally, the inclusion of hierarchy in the solution for the problem of the\u0000system enables us to come up with a better solution. Introduction to hierarchy\u0000means there could be several verification within a level in the hierarchy and\u0000more than one level of verification, which implies other challenges induced by\u0000an interaction between the various levels of hierarchies that also need to be\u0000addressed, like verification of the work of the previous level of hierarchy by\u0000given level in the hierarchy. The paper will address all these issues, and\u0000provide a road map to trace the state of the system at any given time and\u0000probability of failure of the system.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"27 16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker
Existing text-to-SQL benchmarks have largely been constructed using publicly available tables from the web with human-generated tests containing question and SQL statement pairs. They typically show very good results and lead people to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In this environment, LLMs perform poorly, even when standard prompt engineering and RAG techniques are utilized. As we will show, the reasons for poor performance are largely due to three characteristics: (1) public LLMs cannot train on enterprise data warehouses because they are largely in the "dark web", (2) schemas of enterprise tables are more complex than the schemas in public data, which leads the SQL-generation task innately harder, and (3) business-oriented questions are often more complex, requiring joins over multiple tables and aggregations. As a result, we propose a new dataset BEAVER, sourced from real enterprise data warehouses together with natural language queries and their correct SQL statements which we collected from actual user history. We evaluated this dataset using recent LLMs and demonstrated their poor performance on this task. We hope this dataset will facilitate future researchers building more sophisticated text-to-SQL systems which can do better on this important class of data.
{"title":"BEAVER: An Enterprise Benchmark for Text-to-SQL","authors":"Peter Baile Chen, Fabian Wenz, Yi Zhang, Moe Kayali, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker","doi":"arxiv-2409.02038","DOIUrl":"https://doi.org/arxiv-2409.02038","url":null,"abstract":"Existing text-to-SQL benchmarks have largely been constructed using publicly\u0000available tables from the web with human-generated tests containing question\u0000and SQL statement pairs. They typically show very good results and lead people\u0000to think that LLMs are effective at text-to-SQL tasks. In this paper, we apply\u0000off-the-shelf LLMs to a benchmark containing enterprise data warehouse data. In\u0000this environment, LLMs perform poorly, even when standard prompt engineering\u0000and RAG techniques are utilized. As we will show, the reasons for poor\u0000performance are largely due to three characteristics: (1) public LLMs cannot\u0000train on enterprise data warehouses because they are largely in the \"dark web\",\u0000(2) schemas of enterprise tables are more complex than the schemas in public\u0000data, which leads the SQL-generation task innately harder, and (3)\u0000business-oriented questions are often more complex, requiring joins over\u0000multiple tables and aggregations. As a result, we propose a new dataset BEAVER,\u0000sourced from real enterprise data warehouses together with natural language\u0000queries and their correct SQL statements which we collected from actual user\u0000history. We evaluated this dataset using recent LLMs and demonstrated their\u0000poor performance on this task. We hope this dataset will facilitate future\u0000researchers building more sophisticated text-to-SQL systems which can do better\u0000on this important class of data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Split Learning has been recently introduced to facilitate applications where user data privacy is a requirement. However, it has not been thoroughly studied in the context of Privacy-Preserving Record Linkage, a problem in which the same real-world entity should be identified among databases from different dataholders, but without disclosing any additional information. In this paper, we investigate the potentials of Split Learning for Privacy-Preserving Record Matching, by introducing a novel training method through the utilization of Reference Sets, which are publicly available data corpora, showcasing minimal matching impact against a traditional centralized SVM-based technique.
拆分学习(Split Learning)最近被引入到对用户数据隐私有要求的应用中。然而,在保护隐私的记录链接(Privacy-Preserving Record Linkage)问题上,它还没有得到深入研究,在这个问题中,需要在来自不同数据持有者的数据库中识别出相同的现实世界实体,但不能泄露任何额外信息。在本文中,我们通过利用参考集(公开可用的数据集)引入了一种新颖的训练方法,研究了拆分学习在隐私保护记录匹配中的潜力,与传统的基于 SVM 的集中式技术相比,拆分学习对匹配的影响最小。
{"title":"Towards Split Learning-based Privacy-Preserving Record Linkage","authors":"Michail Zervas, Alexandros Karakasidis","doi":"arxiv-2409.01088","DOIUrl":"https://doi.org/arxiv-2409.01088","url":null,"abstract":"Split Learning has been recently introduced to facilitate applications where\u0000user data privacy is a requirement. However, it has not been thoroughly studied\u0000in the context of Privacy-Preserving Record Linkage, a problem in which the\u0000same real-world entity should be identified among databases from different\u0000dataholders, but without disclosing any additional information. In this paper,\u0000we investigate the potentials of Split Learning for Privacy-Preserving Record\u0000Matching, by introducing a novel training method through the utilization of\u0000Reference Sets, which are publicly available data corpora, showcasing minimal\u0000matching impact against a traditional centralized SVM-based technique.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amélie Gheerbrant, Leonid Libkin, Liat Peterfreund, Alexandra Rogova
SQL/PGQ and GQL are very recent international standards for querying property graphs: SQL/PGQ specifies how to query relational representations of property graphs in SQL, while GQL is a standalone language for graph databases. The rapid industrial development of these standards left the academic community trailing in its wake. While digests of the languages have appeared, we do not yet have concise foundational models like relational algebra and calculus for relational databases that enable the formal study of languages, including their expressiveness and limitations. At the same time, work on the next versions of the standards has already begun, to address the perceived limitations of their first versions. Motivated by this, we initiate a formal study of SQL/PGQ and GQL, concentrating on their concise formal model and expressiveness. For the former, we define simple core languages -- Core GQL and Core PGQ -- that capture the essence of the new standards, are amenable to theoretical analysis, and fully clarify the difference between PGQ's bottom up evaluation versus GQL's linear, or pipelined approach. Equipped with these models, we both confirm the necessity to extend the language to fill in the expressiveness gaps and identify the source of these deficiencies. We complement our theoretical analysis with an experimental study, demonstrating that existing workarounds in full GQL and PGQ are impractical which further underscores the necessity to correct deficiencies in the language design.
{"title":"GQL and SQL/PGQ: Theoretical Models and Expressive Power","authors":"Amélie Gheerbrant, Leonid Libkin, Liat Peterfreund, Alexandra Rogova","doi":"arxiv-2409.01102","DOIUrl":"https://doi.org/arxiv-2409.01102","url":null,"abstract":"SQL/PGQ and GQL are very recent international standards for querying property\u0000graphs: SQL/PGQ specifies how to query relational representations of property\u0000graphs in SQL, while GQL is a standalone language for graph databases. The\u0000rapid industrial development of these standards left the academic community\u0000trailing in its wake. While digests of the languages have appeared, we do not\u0000yet have concise foundational models like relational algebra and calculus for\u0000relational databases that enable the formal study of languages, including their\u0000expressiveness and limitations. At the same time, work on the next versions of\u0000the standards has already begun, to address the perceived limitations of their\u0000first versions. Motivated by this, we initiate a formal study of SQL/PGQ and GQL,\u0000concentrating on their concise formal model and expressiveness. For the former,\u0000we define simple core languages -- Core GQL and Core PGQ -- that capture the\u0000essence of the new standards, are amenable to theoretical analysis, and fully\u0000clarify the difference between PGQ's bottom up evaluation versus GQL's linear,\u0000or pipelined approach. Equipped with these models, we both confirm the\u0000necessity to extend the language to fill in the expressiveness gaps and\u0000identify the source of these deficiencies. We complement our theoretical\u0000analysis with an experimental study, demonstrating that existing workarounds in\u0000full GQL and PGQ are impractical which further underscores the necessity to\u0000correct deficiencies in the language design.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Serverless query processing has become increasingly popular due to its auto-scaling, high elasticity, and pay-as-you-go pricing. It allows cloud data warehouse (or lakehouse) users to focus on data analysis without the burden of managing systems and resources. Accordingly, in serverless query services, users become more concerned about cost-efficiency under acceptable performance than performance under fixed resources. This poses new challenges for serverless query engine design in providing flexible performance service-level agreements (SLAs) and cost-efficiency (i.e., prices). In this paper, we first define the problem of flexible performance SLAs and prices in serverless query processing and discuss its significance. Then, we envision the challenges and solutions for solving this problem and the opportunities it raises for other database research. Finally, we present PixelsDB, an open-source prototype with three service levels supported by dedicated architectural designs. Evaluations show that PixelsDB reduces resource costs by 65.5% for near-real-world workloads generated by Cloud Analytics Benchmark (CAB) while not violating the pending time guarantees.
{"title":"Serverless Query Processing with Flexible Performance SLAs and Prices","authors":"Haoqiong Bian, Dongyang Geng, Yunpeng Chai, Anastasia Ailamaki","doi":"arxiv-2409.01388","DOIUrl":"https://doi.org/arxiv-2409.01388","url":null,"abstract":"Serverless query processing has become increasingly popular due to its\u0000auto-scaling, high elasticity, and pay-as-you-go pricing. It allows cloud data\u0000warehouse (or lakehouse) users to focus on data analysis without the burden of\u0000managing systems and resources. Accordingly, in serverless query services,\u0000users become more concerned about cost-efficiency under acceptable performance\u0000than performance under fixed resources. This poses new challenges for\u0000serverless query engine design in providing flexible performance service-level\u0000agreements (SLAs) and cost-efficiency (i.e., prices). In this paper, we first define the problem of flexible performance SLAs and\u0000prices in serverless query processing and discuss its significance. Then, we\u0000envision the challenges and solutions for solving this problem and the\u0000opportunities it raises for other database research. Finally, we present\u0000PixelsDB, an open-source prototype with three service levels supported by\u0000dedicated architectural designs. Evaluations show that PixelsDB reduces\u0000resource costs by 65.5% for near-real-world workloads generated by Cloud\u0000Analytics Benchmark (CAB) while not violating the pending time guarantees.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bryan-Elliott Tam, Ruben Taelman, Julián Rojas Meléndez, Pieter Colpaert
Link Traversal queries face challenges in completeness and long execution time due to the size of the web. Reachability criteria define completeness by restricting the links followed by engines. However, the number of links to dereference remains the bottleneck of the approach. Web environments often have structures exploitable by query engines to prune irrelevant sources. Current criteria rely on using information from the query definition and predefined predicate. However, it is difficult to use them to traverse environments where logical expressions indicate the location of resources. We propose to use a rule-based reachability criterion that captures logical statements expressed in hypermedia descriptions within linked data documents to prune irrelevant sources. In this poster paper, we show how the Comunica link traversal engine is modified to take hints from a hypermedia control vocabulary, to prune irrelevant sources. Our preliminary findings show that by using this strategy, the query engine can significantly reduce the number of HTTP requests and the query execution time without sacrificing the completeness of results. Our work shows that the investigation of hypermedia controls in link pruning of traversal queries is a worthy effort for optimizing web queries of unindexed decentralized databases.
{"title":"Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach","authors":"Bryan-Elliott Tam, Ruben Taelman, Julián Rojas Meléndez, Pieter Colpaert","doi":"arxiv-2408.17157","DOIUrl":"https://doi.org/arxiv-2408.17157","url":null,"abstract":"Link Traversal queries face challenges in completeness and long execution\u0000time due to the size of the web. Reachability criteria define completeness by\u0000restricting the links followed by engines. However, the number of links to\u0000dereference remains the bottleneck of the approach. Web environments often have\u0000structures exploitable by query engines to prune irrelevant sources. Current\u0000criteria rely on using information from the query definition and predefined\u0000predicate. However, it is difficult to use them to traverse environments where\u0000logical expressions indicate the location of resources. We propose to use a\u0000rule-based reachability criterion that captures logical statements expressed in\u0000hypermedia descriptions within linked data documents to prune irrelevant\u0000sources. In this poster paper, we show how the Comunica link traversal engine\u0000is modified to take hints from a hypermedia control vocabulary, to prune\u0000irrelevant sources. Our preliminary findings show that by using this strategy,\u0000the query engine can significantly reduce the number of HTTP requests and the\u0000query execution time without sacrificing the completeness of results. Our work\u0000shows that the investigation of hypermedia controls in link pruning of\u0000traversal queries is a worthy effort for optimizing web queries of unindexed\u0000decentralized databases.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Covid-19 pandemic has affected the world at multiple levels. Data sharing was pivotal for advancing research to understand the underlying causes and implement effective containment strategies. In response, many countries have promoted the availability of daily cases to support research initiatives, fostering collaboration between organisations and making such data available to the public through open data platforms. Despite the several advantages of data sharing, one of the major concerns before releasing health data is its impact on individuals' privacy. Such a sharing process should be based on state-of-the-art methods in Data Protection by Design and by Default. In this paper, we use a data set related to Covid-19 cases in the second largest hospital in Portugal to show how it is feasible to ensure data privacy while improving the quality and maintaining the utility of the data. Our goal is to demonstrate how knowledge exchange in multidisciplinary teams of healthcare practitioners, data privacy, and data science experts is crucial to co-developing strategies that ensure high utility of de-identified data.
{"title":"Empowering Open Data Sharing for Social Good: A Privacy-Aware Approach","authors":"Tânia Carvalho, Luís Antunes, Cristina Costa, Nuno Moniz","doi":"arxiv-2408.17378","DOIUrl":"https://doi.org/arxiv-2408.17378","url":null,"abstract":"The Covid-19 pandemic has affected the world at multiple levels. Data sharing\u0000was pivotal for advancing research to understand the underlying causes and\u0000implement effective containment strategies. In response, many countries have\u0000promoted the availability of daily cases to support research initiatives,\u0000fostering collaboration between organisations and making such data available to\u0000the public through open data platforms. Despite the several advantages of data\u0000sharing, one of the major concerns before releasing health data is its impact\u0000on individuals' privacy. Such a sharing process should be based on\u0000state-of-the-art methods in Data Protection by Design and by Default. In this\u0000paper, we use a data set related to Covid-19 cases in the second largest\u0000hospital in Portugal to show how it is feasible to ensure data privacy while\u0000improving the quality and maintaining the utility of the data. Our goal is to\u0000demonstrate how knowledge exchange in multidisciplinary teams of healthcare\u0000practitioners, data privacy, and data science experts is crucial to\u0000co-developing strategies that ensure high utility of de-identified data.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}