在交互式数据科学中寻找数据湖中的相关表。

Proceedings. ACM-SIGMOD International Conference on Management of Data Pub Date : 2020-06-01 DOI:10.1145/3318464.3389726

Yi Zhang, Zachary G Ives

{"title":"在交互式数据科学中寻找数据湖中的相关表。","authors":"Yi Zhang, Zachary G Ives","doi":"10.1145/3318464.3389726","DOIUrl":null,"url":null,"abstract":"Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"2020 ","pages":"1951-1966"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3318464.3389726","citationCount":"75","resultStr":"{\"title\":\"Finding Related Tables in Data Lakes for Interactive Data Science.\",\"authors\":\"Yi Zhang, Zachary G Ives\",\"doi\":\"10.1145/3318464.3389726\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.\",\"PeriodicalId\":87344,\"journal\":{\"name\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"volume\":\"2020 \",\"pages\":\"1951-1966\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1145/3318464.3389726\",\"citationCount\":\"75\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. ACM-SIGMOD International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3318464.3389726\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. ACM-SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3389726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 75

摘要

许多现代数据科学应用程序建立在数据湖、与模式无关的数据文件存储库和数据产品之上，它们提供的组织和管理功能有限。有必要在数据科学环境中构建数据湖搜索功能，这样科学家和分析师就可以找到对他们手头任务有用的表、模式、工作流和数据集。我们为Jupyter Notebook数据科学平台开发搜索和管理解决方案，使科学家能够增强训练数据，找到提取的潜在特征，清理数据，并找到可连接或可链接的表。我们的核心方法也可以推广到涉及程序或脚本执行的计算任务的其他设置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Finding Related Tables in Data Lakes for Interactive Data Science.

Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

自引率

0.00%

发文量

期刊最新文献

Protecting Data Markets from Strategic Buyers XLJoins Convergence of Array DBMS and Cellular Automata: A Road Traffic Simulation Case Near-Optimal Distributed Band-Joins through Recursive Partitioning. Optimal Join Algorithms Meet Top-k.