Skluma:用于无序数据的可扩展元数据提取管道

Tyler J. Skluzacek, Rohan Kumar, Ryan Chard, Galen Harrison, Paul Beckman, K. Chard, Ian T Foster
{"title":"Skluma:用于无序数据的可扩展元数据提取管道","authors":"Tyler J. Skluzacek, Rohan Kumar, Ryan Chard, Galen Harrison, Paul Beckman, K. Chard, Ian T Foster","doi":"10.1109/eScience.2018.00040","DOIUrl":null,"url":null,"abstract":"To mitigate the effects of high-velocity data expansion and to automate the organization of filesystems and data repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, and organizes extracted metadata for subsequent use. Skluma is able to extract diverse metadata, including aggregate values derived from embedded structured data; named entities and latent topics buried within free-text documents; and content encoded in images. Skluma implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine learning methods to determine file types, dynamically prioritizes and then executes a suite of metadata extractors, and explores contextual metadata based on relationships among files. The derived metadata, represented in JSON, describes probabilistic knowledge of each file that may be subsequently used for discovery or organization. Skluma's architecture enables it to be deployed both locally and used as an on-demand, cloud-hosted service to create and execute dynamic extraction workflows on massive numbers of files. It is modular and extensible-allowing users to contribute their own specialized metadata extractors. Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. We have demonstrated its efficacy by applying it to a scientific environmental data repository of more than 500,000 files. We show that we can extract metadata from those files with modest cloud costs in a few hours.","PeriodicalId":6476,"journal":{"name":"2018 IEEE 14th International Conference on e-Science (e-Science)","volume":"115 1","pages":"256-266"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Skluma: An Extensible Metadata Extraction Pipeline for Disorganized Data\",\"authors\":\"Tyler J. Skluzacek, Rohan Kumar, Ryan Chard, Galen Harrison, Paul Beckman, K. Chard, Ian T Foster\",\"doi\":\"10.1109/eScience.2018.00040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To mitigate the effects of high-velocity data expansion and to automate the organization of filesystems and data repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, and organizes extracted metadata for subsequent use. Skluma is able to extract diverse metadata, including aggregate values derived from embedded structured data; named entities and latent topics buried within free-text documents; and content encoded in images. Skluma implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine learning methods to determine file types, dynamically prioritizes and then executes a suite of metadata extractors, and explores contextual metadata based on relationships among files. The derived metadata, represented in JSON, describes probabilistic knowledge of each file that may be subsequently used for discovery or organization. Skluma's architecture enables it to be deployed both locally and used as an on-demand, cloud-hosted service to create and execute dynamic extraction workflows on massive numbers of files. It is modular and extensible-allowing users to contribute their own specialized metadata extractors. Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. We have demonstrated its efficacy by applying it to a scientific environmental data repository of more than 500,000 files. We show that we can extract metadata from those files with modest cloud costs in a few hours.\",\"PeriodicalId\":6476,\"journal\":{\"name\":\"2018 IEEE 14th International Conference on e-Science (e-Science)\",\"volume\":\"115 1\",\"pages\":\"256-266\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 14th International Conference on e-Science (e-Science)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/eScience.2018.00040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 14th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2018.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

摘要

为了减轻高速数据扩展的影响并实现文件系统和数据存储库组织的自动化,我们开发了skluma——一个自动处理目标文件系统或存储库、提取基于内容和上下文的元数据并组织提取的元数据以供后续使用的系统。Skluma能够提取各种元数据,包括从嵌入式结构化数据中派生的聚合值;命名实体和隐藏在自由文本文档中的潜在主题;以及用图像编码的内容。Skluma实现了一个总体概率管道,从文件中提取越来越具体的元数据。它应用机器学习方法来确定文件类型,动态确定优先级,然后执行一套元数据提取器,并基于文件之间的关系探索上下文元数据。派生的元数据以JSON表示,描述了每个文件的概率知识,这些知识可能随后用于发现或组织。Skluma的架构使其既可以部署在本地,也可以作为按需云托管服务使用,可以在大量文件上创建和执行动态提取工作流。它是模块化和可扩展的——允许用户贡献他们自己专门的元数据提取器。到目前为止,我们已经在本地文件系统、远程ftp访问服务器和公共访问的Globus端点上测试了Skluma。我们已经通过将其应用于超过500,000个文件的科学环境数据存储库来证明其有效性。我们展示了我们可以在几个小时内以适度的云成本从这些文件中提取元数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Skluma: An Extensible Metadata Extraction Pipeline for Disorganized Data
To mitigate the effects of high-velocity data expansion and to automate the organization of filesystems and data repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, and organizes extracted metadata for subsequent use. Skluma is able to extract diverse metadata, including aggregate values derived from embedded structured data; named entities and latent topics buried within free-text documents; and content encoded in images. Skluma implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine learning methods to determine file types, dynamically prioritizes and then executes a suite of metadata extractors, and explores contextual metadata based on relationships among files. The derived metadata, represented in JSON, describes probabilistic knowledge of each file that may be subsequently used for discovery or organization. Skluma's architecture enables it to be deployed both locally and used as an on-demand, cloud-hosted service to create and execute dynamic extraction workflows on massive numbers of files. It is modular and extensible-allowing users to contribute their own specialized metadata extractors. Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. We have demonstrated its efficacy by applying it to a scientific environmental data repository of more than 500,000 files. We show that we can extract metadata from those files with modest cloud costs in a few hours.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Occam: Software Environment for Creating Reproducible Research Smart Data Scouting in Professional Soccer: Evaluating Passing Performance Based on Position Tracking Data Improving LBFGS Optimizer in PyTorch: Knowledge Transfer from Radio Interferometric Calibration to Machine Learning Nordic Exome Variant Catalogue a Web Resource for Genomic Data Browsing Survey on Research Software Engineering in the Netherlands
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1