I would like to share my opinions on the following question: how should a modern graph DBMS (GDBMS) be architected? This is the motivating research question we are addressing in the K`uzu project at University of Waterloo [4, 5].1 I will argue that a modern GDBMS should optimize for a set of what I will call, for lack of a better term, "beyond relational" workloads. As a background, let me start with a brief overview of GDBMSs.
我想分享我对以下问题的看法:现代图形数据库管理系统(GDBMS)应该如何架构?这是我们在滑铁卢大学(University of Waterloo)的K 'uzu项目中所要解决的具有启发性的研究问题[4,5]我认为,现代GDBMS应该针对一组我称之为“超越关系”的工作负载进行优化,因为没有更好的术语。作为背景知识,让我先简要概述一下gdbms。
{"title":"Kùzu: A Database Management System For \"Beyond Relational\" Workloads","authors":"Semih Salihoglu","doi":"10.1145/3631504.3631514","DOIUrl":"https://doi.org/10.1145/3631504.3631514","url":null,"abstract":"I would like to share my opinions on the following question: how should a modern graph DBMS (GDBMS) be architected? This is the motivating research question we are addressing in the K`uzu project at University of Waterloo [4, 5].1 I will argue that a modern GDBMS should optimize for a set of what I will call, for lack of a better term, \"beyond relational\" workloads. As a background, let me start with a brief overview of GDBMSs.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"128 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential privacy has garnered significant attention in recent years due to its potential in offering robust privacy protection for individual data during analysis. With the increasing volume of sensitive information being collected by organizations and analyzed through SQL queries, the development of a general-purpose query engine that is capable of supporting a broad range of queries while maintaining differential privacy has become the holy grail in privacypreserving query release. Towards this goal, this article surveys recent advances in query evaluation under differential privacy.
{"title":"Query Evaluation under Differential Privacy","authors":"Wei Dong, Ke Yi","doi":"10.1145/3631504.3631506","DOIUrl":"https://doi.org/10.1145/3631504.3631506","url":null,"abstract":"Differential privacy has garnered significant attention in recent years due to its potential in offering robust privacy protection for individual data during analysis. With the increasing volume of sensitive information being collected by organizations and analyzed through SQL queries, the development of a general-purpose query engine that is capable of supporting a broad range of queries while maintaining differential privacy has become the holy grail in privacypreserving query release. Towards this goal, this article surveys recent advances in query evaluation under differential privacy.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olga Poppe, Pablo Castro, Willis Lang, Jyoti Leeka
Modern cloud services aim to find the middle ground between quality of service and operational cost efficiency by allocating resources if and only if these resources are needed by the customers. Unfortunately, most industrial demand-driven resource allocation approaches are reactive. Given that scaling mechanisms are not instantaneous, the reactive policy may introduce delays to latency-sensitive customer workloads and waste operational costs for cloud service providers. To solve this catch-22, we define the proactive resource allocation policy for Microsoft Azure Cognitive Search. In addition to the current resource demand, the proactive policy takes the typical resource usage patterns into account. We gained the following valuable insights from these patterns over several months of production workloads. One, 87% of the workload is stable due to continuous resource demand. Two, 90% of varying demand is predictable based on a few weeks of historical traces. Three, resources can be reclaimed 52% of the time due to extensive idle intervals of varying workload. Given the size and scope of our analysis, we believe that our approach applies to any latency-sensitive cloud service.
{"title":"Proactive Resource Allocation Policy for Microsoft Azure Cognitive Search","authors":"Olga Poppe, Pablo Castro, Willis Lang, Jyoti Leeka","doi":"10.1145/3631504.3631516","DOIUrl":"https://doi.org/10.1145/3631504.3631516","url":null,"abstract":"Modern cloud services aim to find the middle ground between quality of service and operational cost efficiency by allocating resources if and only if these resources are needed by the customers. Unfortunately, most industrial demand-driven resource allocation approaches are reactive. Given that scaling mechanisms are not instantaneous, the reactive policy may introduce delays to latency-sensitive customer workloads and waste operational costs for cloud service providers. To solve this catch-22, we define the proactive resource allocation policy for Microsoft Azure Cognitive Search. In addition to the current resource demand, the proactive policy takes the typical resource usage patterns into account. We gained the following valuable insights from these patterns over several months of production workloads. One, 87% of the workload is stable due to continuous resource demand. Two, 90% of varying demand is predictable based on a few weeks of historical traces. Three, resources can be reclaimed 52% of the time due to extensive idle intervals of varying workload. Given the size and scope of our analysis, we believe that our approach applies to any latency-sensitive cloud service.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danilo B. Seufitelli, Michele A. Brandão, Ayane C. A. Fernandes, Kayque M. Siqueira, Mirella M. Moro
We present a systematic literature review and propose a taxonomy for research at the intersection of Digital Forensics and Databases. The merge between these two areas has become more prolific due to the growing volume of data and mobile apps on the Web, and the consequent rise in cyber attacks. Our review has identified 91 relevant papers. The taxonomy categorizes such papers into: Cyber-Attacks (subclasses SQLi, Attack Detection, Data Recovery) and Criminal Intelligence (subclasses Forensic Investigation, Research Products, Crime Resolution). Overall, we contribute to better understanding the intersection between digital forensics and databases, and open opportunities for future research and development with potential for significant social, economic, and technical-scientific contributions.
{"title":"Where do Databases and Digital Forensics meet? A Comprehensive Survey and Taxonomy","authors":"Danilo B. Seufitelli, Michele A. Brandão, Ayane C. A. Fernandes, Kayque M. Siqueira, Mirella M. Moro","doi":"10.1145/3631504.3631508","DOIUrl":"https://doi.org/10.1145/3631504.3631508","url":null,"abstract":"We present a systematic literature review and propose a taxonomy for research at the intersection of Digital Forensics and Databases. The merge between these two areas has become more prolific due to the growing volume of data and mobile apps on the Web, and the consequent rise in cyber attacks. Our review has identified 91 relevant papers. The taxonomy categorizes such papers into: Cyber-Attacks (subclasses SQLi, Attack Detection, Data Recovery) and Criminal Intelligence (subclasses Forensic Investigation, Research Products, Crime Resolution). Overall, we contribute to better understanding the intersection between digital forensics and databases, and open opportunities for future research and development with potential for significant social, economic, and technical-scientific contributions.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"6 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sihem Amer-Yahia, Angela Bonifati, Lei Chen, Guoliang Li, Kyuseok Shim, Jianliang Xu, Xiaochun Yang
In recent years, large language models (LLMs) have garnered increasing attention from both academia and industry due to their potential to facilitate natural language processing (NLP) and generate highquality text. Despite their benefits, however, the use of LLMs is raising concerns about the reliability of knowledge extraction. The combination of DB research and data science has advanced the state of the art in solving real-world problems, such as merchandise recommendation and hazard prevention [30]. In this discussion, we explore the challenges and opportunities related to LLMs in DB and data science research and education.
{"title":"From Large Language Models to Databases and Back: A Discussion on Research and Education","authors":"Sihem Amer-Yahia, Angela Bonifati, Lei Chen, Guoliang Li, Kyuseok Shim, Jianliang Xu, Xiaochun Yang","doi":"10.1145/3631504.3631518","DOIUrl":"https://doi.org/10.1145/3631504.3631518","url":null,"abstract":"In recent years, large language models (LLMs) have garnered increasing attention from both academia and industry due to their potential to facilitate natural language processing (NLP) and generate highquality text. Despite their benefits, however, the use of LLMs is raising concerns about the reliability of knowledge extraction. The combination of DB research and data science has advanced the state of the art in solving real-world problems, such as merchandise recommendation and hazard prevention [30]. In this discussion, we explore the challenges and opportunities related to LLMs in DB and data science research and education.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This issue's contributors chose papers that address challenges at the heart of database systems: physical design tuning for index selection and transaction isolation levels. Both contributions emphasize the elegant, modular, and long-lasting design choices of the respective work. Enjoy reading!
{"title":"Reminiscences on Influential Papers","authors":"Renata Borovica-Gajic","doi":"10.1145/3631504.3631512","DOIUrl":"https://doi.org/10.1145/3631504.3631512","url":null,"abstract":"This issue's contributors chose papers that address challenges at the heart of database systems: physical design tuning for index selection and transaction isolation levels. Both contributions emphasize the elegant, modular, and long-lasting design choices of the respective work. Enjoy reading!","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The large variety of specialized data processing platforms and the increased complexity of data analytics has led to the need for unifying data analytics within a single framework. Such a framework should free users from the burden of (i) choosing the right platform( s) and (ii) gluing code between the different parts of their pipelines. Apache Wayang (Incubating) is the only open-source framework that provides a systematic solution to unified data analytics by integrating multiple heterogeneous data processing platforms. It achieves that by decoupling applications from the underlying platforms and providing an optimizer so that users do not have to specify the platforms on which their pipeline should run. Wayang provides a unified view and processing model, effectively integrating the hodgepodge of heterogeneous platforms into a single framework with increased usability without sacrificing performance and total cost of ownership. In this paper, we present the architecture ofWayang, describe its main components, and give an outlook on future directions.
各种各样的专业数据处理平台和数据分析的复杂性增加导致需要在单一框架内统一数据分析。这样的框架应该让用户从以下两个负担中解脱出来:(1)选择正确的平台;(2)在管道的不同部分之间粘接代码。Apache Wayang (Incubating)是唯一一个通过集成多个异构数据处理平台,为统一数据分析提供系统解决方案的开源框架。它通过将应用程序与底层平台解耦并提供优化器来实现这一点,这样用户就不必指定他们的管道应该在哪个平台上运行。Wayang提供了一个统一的视图和处理模型,有效地将异构平台的大杂烩集成到一个框架中,在不牺牲性能和总拥有成本的情况下提高了可用性。在本文中,我们介绍了大阳的架构,描述了它的主要组成部分,并对未来的发展方向进行了展望。
{"title":"Apache Wayang: A Unified Data Analytics Framework","authors":"Kaustubh Beedkar, Bertty Contreras-Rojas, Haralampos Gavriilidis, Zoi Kaoudi, Volker Markl, Rodrigo Pardo-Meza, Jorge-Arnulfo Quiané-Ruiz","doi":"10.1145/3631504.3631510","DOIUrl":"https://doi.org/10.1145/3631504.3631510","url":null,"abstract":"The large variety of specialized data processing platforms and the increased complexity of data analytics has led to the need for unifying data analytics within a single framework. Such a framework should free users from the burden of (i) choosing the right platform( s) and (ii) gluing code between the different parts of their pipelines. Apache Wayang (Incubating) is the only open-source framework that provides a systematic solution to unified data analytics by integrating multiple heterogeneous data processing platforms. It achieves that by decoupling applications from the underlying platforms and providing an optimizer so that users do not have to specify the platforms on which their pipeline should run. Wayang provides a unified view and processing model, effectively integrating the hodgepodge of heterogeneous platforms into a single framework with increased usability without sacrificing performance and total cost of ownership. In this paper, we present the architecture ofWayang, describe its main components, and give an outlook on future directions.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"128 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a new recovery model based on group commit, called concurrent prefix recovery (CPR). CPR differs from traditional group commit implementations in two ways: (1) it provides a sem...
{"title":"Concurrent Prefix Recovery","authors":"PrasaadGuna, ChandramouliBadrish, KossmannDonald","doi":"10.1145/3422648.3422653","DOIUrl":"https://doi.org/10.1145/3422648.3422653","url":null,"abstract":"This paper proposes a new recovery model based on group commit, called concurrent prefix recovery (CPR). CPR differs from traditional group commit implementations in two ways: (1) it provides a sem...","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"29 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2020-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89632252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this ...
{"title":"Implicit Parallelism through Deep Language Embedding","authors":"AlexandrovAlexander, KatsifodimosAsterios, KrastevGeorgi, MarklVolker","doi":"10.1145/2949741.2949754","DOIUrl":"https://doi.org/10.1145/2949741.2949754","url":null,"abstract":"Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this ...","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"1 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2949741.2949754","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"63971524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher De Sa, Alexander J. Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
{"title":"DeepDive: Declarative Knowledge Base Construction","authors":"Christopher De Sa, Alexander J. Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang","doi":"10.1145/2949741.2949756","DOIUrl":"https://doi.org/10.1145/2949741.2949756","url":null,"abstract":"The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"45 1 1","pages":"60-67"},"PeriodicalIF":1.1,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2949741.2949756","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"63971708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}