Christopher De Sa, Alexander J. Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang
{"title":"DeepDive: Declarative Knowledge Base Construction","authors":"Christopher De Sa, Alexander J. Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang","doi":"10.1145/2949741.2949756","DOIUrl":null,"url":null,"abstract":"The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.","PeriodicalId":49524,"journal":{"name":"Sigmod Record","volume":"45 1 1","pages":"60-67"},"PeriodicalIF":0.9000,"publicationDate":"2016-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2949741.2949756","citationCount":"133","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sigmod Record","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2949741.2949756","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 133
Abstract
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
期刊介绍:
SIGMOD investigates the development and application of database technology to support the full range of data management needs. The scope of interests and members is wide with an almost equal mix of people from industryand academia. SIGMOD sponsors an annual conference that is regarded as one of the most important in the field, particularly for practitioners.
Areas of Special Interest:
Active and temporal data management, data mining and models, database programming languages, databases on the WWW, distributed data management, engineering, federated multi-database and mobile management, query processing & optimization, rapid application development tools, spatial data management, user interfaces.