首页 > 最新文献

Sigmod Record最新文献

英文 中文
Kùzu: A Database Management System For "Beyond Relational" Workloads Kùzu:一个“超越关系”工作负载的数据库管理系统
4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-10-30 DOI: 10.1145/3631504.3631514
Semih Salihoglu
I would like to share my opinions on the following question: how should a modern graph DBMS (GDBMS) be architected? This is the motivating research question we are addressing in the K`uzu project at University of Waterloo [4, 5].1 I will argue that a modern GDBMS should optimize for a set of what I will call, for lack of a better term, "beyond relational" workloads. As a background, let me start with a brief overview of GDBMSs.
我想分享我对以下问题的看法:现代图形数据库管理系统(GDBMS)应该如何架构?这是我们在滑铁卢大学(University of Waterloo)的K 'uzu项目中所要解决的具有启发性的研究问题[4,5]我认为,现代GDBMS应该针对一组我称之为“超越关系”的工作负载进行优化,因为没有更好的术语。作为背景知识,让我先简要概述一下gdbms。
{"title":"Kùzu: A Database Management System For \"Beyond Relational\" Workloads","authors":"Semih Salihoglu","doi":"10.1145/3631504.3631514","DOIUrl":"https://doi.org/10.1145/3631504.3631514","url":null,"abstract":"I would like to share my opinions on the following question: how should a modern graph DBMS (GDBMS) be architected? This is the motivating research question we are addressing in the K`uzu project at University of Waterloo [4, 5].1 I will argue that a modern GDBMS should optimize for a set of what I will call, for lack of a better term, \"beyond relational\" workloads. As a background, let me start with a brief overview of GDBMSs.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"128 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Query Evaluation under Differential Privacy 差分隐私下的查询评估
4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-10-30 DOI: 10.1145/3631504.3631506
Wei Dong, Ke Yi
Differential privacy has garnered significant attention in recent years due to its potential in offering robust privacy protection for individual data during analysis. With the increasing volume of sensitive information being collected by organizations and analyzed through SQL queries, the development of a general-purpose query engine that is capable of supporting a broad range of queries while maintaining differential privacy has become the holy grail in privacypreserving query release. Towards this goal, this article surveys recent advances in query evaluation under differential privacy.
近年来,差分隐私因其在分析过程中为个人数据提供强大的隐私保护的潜力而引起了极大的关注。随着组织通过SQL查询收集和分析越来越多的敏感信息,开发一种通用的查询引擎,能够支持广泛的查询,同时保持不同的隐私,已经成为隐私保护查询发布的圣杯。为了实现这一目标,本文综述了差分隐私下查询评估的最新进展。
{"title":"Query Evaluation under Differential Privacy","authors":"Wei Dong, Ke Yi","doi":"10.1145/3631504.3631506","DOIUrl":"https://doi.org/10.1145/3631504.3631506","url":null,"abstract":"Differential privacy has garnered significant attention in recent years due to its potential in offering robust privacy protection for individual data during analysis. With the increasing volume of sensitive information being collected by organizations and analyzed through SQL queries, the development of a general-purpose query engine that is capable of supporting a broad range of queries while maintaining differential privacy has become the holy grail in privacypreserving query release. Towards this goal, this article surveys recent advances in query evaluation under differential privacy.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive Resource Allocation Policy for Microsoft Azure Cognitive Search Microsoft Azure认知搜索的主动资源分配策略
4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-10-30 DOI: 10.1145/3631504.3631516
Olga Poppe, Pablo Castro, Willis Lang, Jyoti Leeka
Modern cloud services aim to find the middle ground between quality of service and operational cost efficiency by allocating resources if and only if these resources are needed by the customers. Unfortunately, most industrial demand-driven resource allocation approaches are reactive. Given that scaling mechanisms are not instantaneous, the reactive policy may introduce delays to latency-sensitive customer workloads and waste operational costs for cloud service providers. To solve this catch-22, we define the proactive resource allocation policy for Microsoft Azure Cognitive Search. In addition to the current resource demand, the proactive policy takes the typical resource usage patterns into account. We gained the following valuable insights from these patterns over several months of production workloads. One, 87% of the workload is stable due to continuous resource demand. Two, 90% of varying demand is predictable based on a few weeks of historical traces. Three, resources can be reclaimed 52% of the time due to extensive idle intervals of varying workload. Given the size and scope of our analysis, we believe that our approach applies to any latency-sensitive cloud service.
现代云服务的目标是,当且仅当客户需要这些资源时,通过分配资源,在服务质量和运营成本效率之间找到中间地带。不幸的是,大多数工业需求驱动的资源配置方法都是被动的。考虑到扩展机制不是即时的,响应策略可能会给对延迟敏感的客户工作负载带来延迟,并浪费云服务提供商的运营成本。为了解决这个问题,我们定义了Microsoft Azure Cognitive Search的主动资源分配策略。除了当前的资源需求外,主动策略还考虑了典型的资源使用模式。在几个月的生产工作负载中,我们从这些模式中获得了以下有价值的见解。其一,由于持续的资源需求,87%的工作负载是稳定的。第二,根据几周的历史记录,90%的需求变化是可以预测的。第三,由于不同工作负载的大量空闲时间间隔,52%的时间可以回收资源。考虑到我们分析的规模和范围,我们相信我们的方法适用于任何对延迟敏感的云服务。
{"title":"Proactive Resource Allocation Policy for Microsoft Azure Cognitive Search","authors":"Olga Poppe, Pablo Castro, Willis Lang, Jyoti Leeka","doi":"10.1145/3631504.3631516","DOIUrl":"https://doi.org/10.1145/3631504.3631516","url":null,"abstract":"Modern cloud services aim to find the middle ground between quality of service and operational cost efficiency by allocating resources if and only if these resources are needed by the customers. Unfortunately, most industrial demand-driven resource allocation approaches are reactive. Given that scaling mechanisms are not instantaneous, the reactive policy may introduce delays to latency-sensitive customer workloads and waste operational costs for cloud service providers. To solve this catch-22, we define the proactive resource allocation policy for Microsoft Azure Cognitive Search. In addition to the current resource demand, the proactive policy takes the typical resource usage patterns into account. We gained the following valuable insights from these patterns over several months of production workloads. One, 87% of the workload is stable due to continuous resource demand. Two, 90% of varying demand is predictable based on a few weeks of historical traces. Three, resources can be reclaimed 52% of the time due to extensive idle intervals of varying workload. Given the size and scope of our analysis, we believe that our approach applies to any latency-sensitive cloud service.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Where do Databases and Digital Forensics meet? A Comprehensive Survey and Taxonomy 数据库和数字取证在哪里相遇?《综合调查与分类
4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-10-30 DOI: 10.1145/3631504.3631508
Danilo B. Seufitelli, Michele A. Brandão, Ayane C. A. Fernandes, Kayque M. Siqueira, Mirella M. Moro
We present a systematic literature review and propose a taxonomy for research at the intersection of Digital Forensics and Databases. The merge between these two areas has become more prolific due to the growing volume of data and mobile apps on the Web, and the consequent rise in cyber attacks. Our review has identified 91 relevant papers. The taxonomy categorizes such papers into: Cyber-Attacks (subclasses SQLi, Attack Detection, Data Recovery) and Criminal Intelligence (subclasses Forensic Investigation, Research Products, Crime Resolution). Overall, we contribute to better understanding the intersection between digital forensics and databases, and open opportunities for future research and development with potential for significant social, economic, and technical-scientific contributions.
我们提出了一个系统的文献综述,并提出了一个分类研究在数字取证和数据库的交集。由于网络上的数据量和移动应用程序的增长,以及随之而来的网络攻击的增加,这两个领域之间的合并变得更加频繁。我们的审查已经确定了91篇相关论文。分类法将这类论文分为:网络攻击(子类SQLi,攻击检测,数据恢复)和刑事情报(子类法医调查,研究产品,犯罪解决)。总的来说,我们有助于更好地理解数字取证和数据库之间的交集,并为未来的研究和开发提供机会,这些研究和开发具有重大的社会、经济和技术科学贡献的潜力。
{"title":"Where do Databases and Digital Forensics meet? A Comprehensive Survey and Taxonomy","authors":"Danilo B. Seufitelli, Michele A. Brandão, Ayane C. A. Fernandes, Kayque M. Siqueira, Mirella M. Moro","doi":"10.1145/3631504.3631508","DOIUrl":"https://doi.org/10.1145/3631504.3631508","url":null,"abstract":"We present a systematic literature review and propose a taxonomy for research at the intersection of Digital Forensics and Databases. The merge between these two areas has become more prolific due to the growing volume of data and mobile apps on the Web, and the consequent rise in cyber attacks. Our review has identified 91 relevant papers. The taxonomy categorizes such papers into: Cyber-Attacks (subclasses SQLi, Attack Detection, Data Recovery) and Criminal Intelligence (subclasses Forensic Investigation, Research Products, Crime Resolution). Overall, we contribute to better understanding the intersection between digital forensics and databases, and open opportunities for future research and development with potential for significant social, economic, and technical-scientific contributions.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"6 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From Large Language Models to Databases and Back: A Discussion on Research and Education 从大型语言模型到数据库再到数据库:关于研究与教育的讨论
4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-10-30 DOI: 10.1145/3631504.3631518
Sihem Amer-Yahia, Angela Bonifati, Lei Chen, Guoliang Li, Kyuseok Shim, Jianliang Xu, Xiaochun Yang
In recent years, large language models (LLMs) have garnered increasing attention from both academia and industry due to their potential to facilitate natural language processing (NLP) and generate highquality text. Despite their benefits, however, the use of LLMs is raising concerns about the reliability of knowledge extraction. The combination of DB research and data science has advanced the state of the art in solving real-world problems, such as merchandise recommendation and hazard prevention [30]. In this discussion, we explore the challenges and opportunities related to LLMs in DB and data science research and education.
近年来,大型语言模型(llm)因其促进自然语言处理(NLP)和生成高质量文本的潜力而受到学术界和工业界越来越多的关注。然而,尽管法学硕士有好处,但它的使用引起了人们对知识提取可靠性的担忧。数据库研究与数据科学的结合在解决现实世界问题(如商品推荐和危害预防)方面取得了进步[30]。在本次讨论中,我们将探讨法学硕士在数据库和数据科学研究和教育方面面临的挑战和机遇。
{"title":"From Large Language Models to Databases and Back: A Discussion on Research and Education","authors":"Sihem Amer-Yahia, Angela Bonifati, Lei Chen, Guoliang Li, Kyuseok Shim, Jianliang Xu, Xiaochun Yang","doi":"10.1145/3631504.3631518","DOIUrl":"https://doi.org/10.1145/3631504.3631518","url":null,"abstract":"In recent years, large language models (LLMs) have garnered increasing attention from both academia and industry due to their potential to facilitate natural language processing (NLP) and generate highquality text. Despite their benefits, however, the use of LLMs is raising concerns about the reliability of knowledge extraction. The combination of DB research and data science has advanced the state of the art in solving real-world problems, such as merchandise recommendation and hazard prevention [30]. In this discussion, we explore the challenges and opportunities related to LLMs in DB and data science research and education.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Reminiscences on Influential Papers 对有影响的论文的回忆
4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-10-30 DOI: 10.1145/3631504.3631512
Renata Borovica-Gajic
This issue's contributors chose papers that address challenges at the heart of database systems: physical design tuning for index selection and transaction isolation levels. Both contributions emphasize the elegant, modular, and long-lasting design choices of the respective work. Enjoy reading!
本期投稿人选择了解决数据库系统核心挑战的论文:索引选择和事务隔离级别的物理设计调优。两篇文章都强调了各自作品的优雅、模块化和持久的设计选择。喜欢阅读!
{"title":"Reminiscences on Influential Papers","authors":"Renata Borovica-Gajic","doi":"10.1145/3631504.3631512","DOIUrl":"https://doi.org/10.1145/3631504.3631512","url":null,"abstract":"This issue's contributors chose papers that address challenges at the heart of database systems: physical design tuning for index selection and transaction isolation levels. Both contributions emphasize the elegant, modular, and long-lasting design choices of the respective work. Enjoy reading!","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136106511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Apache Wayang: A Unified Data Analytics Framework Apache Wayang:统一的数据分析框架
4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2023-10-30 DOI: 10.1145/3631504.3631510
Kaustubh Beedkar, Bertty Contreras-Rojas, Haralampos Gavriilidis, Zoi Kaoudi, Volker Markl, Rodrigo Pardo-Meza, Jorge-Arnulfo Quiané-Ruiz
The large variety of specialized data processing platforms and the increased complexity of data analytics has led to the need for unifying data analytics within a single framework. Such a framework should free users from the burden of (i) choosing the right platform( s) and (ii) gluing code between the different parts of their pipelines. Apache Wayang (Incubating) is the only open-source framework that provides a systematic solution to unified data analytics by integrating multiple heterogeneous data processing platforms. It achieves that by decoupling applications from the underlying platforms and providing an optimizer so that users do not have to specify the platforms on which their pipeline should run. Wayang provides a unified view and processing model, effectively integrating the hodgepodge of heterogeneous platforms into a single framework with increased usability without sacrificing performance and total cost of ownership. In this paper, we present the architecture ofWayang, describe its main components, and give an outlook on future directions.
各种各样的专业数据处理平台和数据分析的复杂性增加导致需要在单一框架内统一数据分析。这样的框架应该让用户从以下两个负担中解脱出来:(1)选择正确的平台;(2)在管道的不同部分之间粘接代码。Apache Wayang (Incubating)是唯一一个通过集成多个异构数据处理平台,为统一数据分析提供系统解决方案的开源框架。它通过将应用程序与底层平台解耦并提供优化器来实现这一点,这样用户就不必指定他们的管道应该在哪个平台上运行。Wayang提供了一个统一的视图和处理模型,有效地将异构平台的大杂烩集成到一个框架中,在不牺牲性能和总拥有成本的情况下提高了可用性。在本文中,我们介绍了大阳的架构,描述了它的主要组成部分,并对未来的发展方向进行了展望。
{"title":"Apache Wayang: A Unified Data Analytics Framework","authors":"Kaustubh Beedkar, Bertty Contreras-Rojas, Haralampos Gavriilidis, Zoi Kaoudi, Volker Markl, Rodrigo Pardo-Meza, Jorge-Arnulfo Quiané-Ruiz","doi":"10.1145/3631504.3631510","DOIUrl":"https://doi.org/10.1145/3631504.3631510","url":null,"abstract":"The large variety of specialized data processing platforms and the increased complexity of data analytics has led to the need for unifying data analytics within a single framework. Such a framework should free users from the burden of (i) choosing the right platform( s) and (ii) gluing code between the different parts of their pipelines. Apache Wayang (Incubating) is the only open-source framework that provides a systematic solution to unified data analytics by integrating multiple heterogeneous data processing platforms. It achieves that by decoupling applications from the underlying platforms and providing an optimizer so that users do not have to specify the platforms on which their pipeline should run. Wayang provides a unified view and processing model, effectively integrating the hodgepodge of heterogeneous platforms into a single framework with increased usability without sacrificing performance and total cost of ownership. In this paper, we present the architecture ofWayang, describe its main components, and give an outlook on future directions.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"128 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Concurrent Prefix Recovery 并发前缀恢复
IF 1.1 4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2020-09-04 DOI: 10.1145/3422648.3422653
PrasaadGuna, ChandramouliBadrish, KossmannDonald
This paper proposes a new recovery model based on group commit, called concurrent prefix recovery (CPR). CPR differs from traditional group commit implementations in two ways: (1) it provides a sem...
提出了一种新的基于群提交的恢复模型,称为并发前缀恢复(CPR)。CPR与传统的组提交实现有两个不同之处:(1)它提供了一个…
{"title":"Concurrent Prefix Recovery","authors":"PrasaadGuna, ChandramouliBadrish, KossmannDonald","doi":"10.1145/3422648.3422653","DOIUrl":"https://doi.org/10.1145/3422648.3422653","url":null,"abstract":"This paper proposes a new recovery model based on group commit, called concurrent prefix recovery (CPR). CPR differs from traditional group commit implementations in two ways: (1) it provides a sem...","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"29 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2020-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89632252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Implicit Parallelism through Deep Language Embedding 基于深度语言嵌入的隐式并行
IF 1.1 4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-06-02 DOI: 10.1145/2949741.2949754
AlexandrovAlexander, KatsifodimosAsterios, KrastevGeorgi, MarklVolker
Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this ...
基于map和reduce等二阶函数的并行收集处理已被广泛应用于可扩展数据分析。最初由b谷歌推广,在过去的十年里,这个…
{"title":"Implicit Parallelism through Deep Language Embedding","authors":"AlexandrovAlexander, KatsifodimosAsterios, KrastevGeorgi, MarklVolker","doi":"10.1145/2949741.2949754","DOIUrl":"https://doi.org/10.1145/2949741.2949754","url":null,"abstract":"Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this ...","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"1 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2949741.2949754","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"63971524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
DeepDive: Declarative Knowledge Base Construction DeepDive:声明式知识库构建
IF 1.1 4区 计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2016-06-02 DOI: 10.1145/2949741.2949756
Christopher De Sa, Alexander J. Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
暗数据提取或知识库构建(KBC)问题是用来自非结构化数据源(包括电子邮件、网页和pdf报告)的信息填充SQL数据库。KBC是工业和研究中一个长期存在的问题,包括数据提取、清理和集成问题。我们描述了DeepDive,这是一个结合数据库和机器学习思想来帮助开发KBC系统的系统。DeepDive的关键思想是,统计推理和机器学习是以统一和更有效的方式解决提取、清理和集成中的经典数据问题的关键工具。DeepDive程序是声明性的,因此不能编写概率推理算法;相反,可以通过定义有关领域的特性或规则进行交互。选择这种设计的一个关键原因是使领域专家能够构建他们自己的KBC系统。我们介绍了用于加速构建KBC系统的DeepDive的应用、抽象和技术。
{"title":"DeepDive: Declarative Knowledge Base Construction","authors":"Christopher De Sa, Alexander J. Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang","doi":"10.1145/2949741.2949756","DOIUrl":"https://doi.org/10.1145/2949741.2949756","url":null,"abstract":"The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"45 1 1","pages":"60-67"},"PeriodicalIF":1.1,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2949741.2949756","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"63971708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 133
期刊
Sigmod Record
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1