Towards a Universal, Quantifiable, and Scalable File Format Converter

Kenton McHenry, R. Kooper, P. Bajcsy
{"title":"Towards a Universal, Quantifiable, and Scalable File Format Converter","authors":"Kenton McHenry, R. Kooper, P. Bajcsy","doi":"10.1109/e-Science.2009.28","DOIUrl":null,"url":null,"abstract":"This paper addresses the problem of designing a universal file format converter. File format conversion is a necessary part of data dissemination and curation. Complete and robust converters however are hard to find and build due to the abundance of file formats, the fact that many formats are closed, and the complexities within individual format specifications. On the other hand many software applications exist that are capable of performing some degree of data conversion between a subset of the available formats. To take advantage of this we introduce a data structure called an I/O-Graph to store the available input and output formats of these applications. Based on a concept of \\textit{\\textbf{imposed code reuse}} we use this to develop a service, NCSA Polyglot, which through this graph is capable of performing the larger union of conversions supported by the underlying software. The Polyglot system is designed to be easily extensible, scalable with the number of conversion requests, and inclusive of all available third party software. Given a data set of files from a particular domain, we are able to assign weights to the edges within the I/O-Graph indicating the amount of information retained during a conversion. These edge weights allow the system to then choose conversion paths with the least amount of information loss.","PeriodicalId":325840,"journal":{"name":"2009 Fifth IEEE International Conference on e-Science","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Fifth IEEE International Conference on e-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/e-Science.2009.28","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

This paper addresses the problem of designing a universal file format converter. File format conversion is a necessary part of data dissemination and curation. Complete and robust converters however are hard to find and build due to the abundance of file formats, the fact that many formats are closed, and the complexities within individual format specifications. On the other hand many software applications exist that are capable of performing some degree of data conversion between a subset of the available formats. To take advantage of this we introduce a data structure called an I/O-Graph to store the available input and output formats of these applications. Based on a concept of \textit{\textbf{imposed code reuse}} we use this to develop a service, NCSA Polyglot, which through this graph is capable of performing the larger union of conversions supported by the underlying software. The Polyglot system is designed to be easily extensible, scalable with the number of conversion requests, and inclusive of all available third party software. Given a data set of files from a particular domain, we are able to assign weights to the edges within the I/O-Graph indicating the amount of information retained during a conversion. These edge weights allow the system to then choose conversion paths with the least amount of information loss.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
迈向通用的、可量化的、可扩展的文件格式转换器
本文研究了通用文件格式转换器的设计问题。文件格式转换是数据传播和管理的必要组成部分。然而,由于文件格式的丰富,许多格式是封闭的,以及单个格式规范的复杂性,很难找到和构建完整而健壮的转换器。另一方面,存在的许多软件应用程序能够在可用格式的子集之间执行某种程度的数据转换。为了利用这一点,我们引入了一种称为I/O-Graph的数据结构来存储这些应用程序的可用输入和输出格式。基于\textit{\textbf{强制代码重用}}的概念,我们使用它来开发服务NCSA Polyglot,该服务通过此图能够执行底层软件支持的更大的转换联合。Polyglot系统被设计为易于扩展,可扩展的转换请求的数量,并包括所有可用的第三方软件。给定来自特定域的文件数据集,我们能够为I/ o图中的边分配权重,指示转换期间保留的信息量。这些边缘权重允许系统选择具有最少信息损失的转换路径。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Methodology for File Relationship Discovery A Protocol for Exchanging Scientific Citations Enabling Computational Steering with an Asynchronous-Iterative Computation Framework Topic Maps in the eHumanities Comparing METS and OAI-ORE for Encapsulating Scientific Data Products: A Protein Crystallography Case Study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1