Plain Text & Character Encoding: A Primer for Data Curators

S. Erickson
{"title":"Plain Text & Character Encoding: A Primer for Data Curators","authors":"S. Erickson","doi":"10.7191/jeslib.2021.1211","DOIUrl":null,"url":null,"abstract":"Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability. Despite its ubiquity, plain text is not as plain as it may seem. The set of standards used in modern text encoding (principally, the Unicode Character Set and the related encoding format, UTF-8) have complex architectures when compared to historical standards like ASCII. Further, while the Unicode standard has gained in prominence, text encoding problems are not uncommon in research data curation. This primer provides conceptual foundations for modern text encoding and guidance for common curation and preservation actions related to textual data.","PeriodicalId":90214,"journal":{"name":"Journal of escience librarianship","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of escience librarianship","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7191/jeslib.2021.1211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability. Despite its ubiquity, plain text is not as plain as it may seem. The set of standards used in modern text encoding (principally, the Unicode Character Set and the related encoding format, UTF-8) have complex architectures when compared to historical standards like ASCII. Further, while the Unicode standard has gained in prominence, text encoding problems are not uncommon in research data curation. This primer provides conceptual foundations for modern text encoding and guidance for common curation and preservation actions related to textual data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
纯文本和字符编码:数据管理员入门
纯文本数据由一系列来自给定标准(如Unicode标准)的编码字符或“代码点”组成。eScience中使用的一些最常见的数字数据文件格式(例如CSV、XML和JSON)是基于纯文本标准构建的。数字数据的纯文本表示通常是首选,因为纯文本格式相对稳定,并且有助于重用和互操作性。尽管纯文本无处不在,但它并不像看上去那么简单。与ASCII等历史标准相比,现代文本编码中使用的一组标准(主要是Unicode字符集和相关编码格式UTF-8)具有复杂的体系结构。此外,虽然Unicode标准越来越突出,但文本编码问题在研究数据管理中并不罕见。本初级读本为现代文本编码提供了概念基础,并为与文本数据相关的常见管理和保存行动提供了指导。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
16 weeks
期刊最新文献
Ethical considerations in utilizing artificial intelligence for analyzing the NHGRI's History of Genomics and Human Genome Project archives. The Creative Urge Title Pending 740 A Problem Shared Is a Community Created: Recommendations for Cross-Institutional Collaborations. Train the Teacher: Practical guidance for effective, critical teaching approaches for science and data librarians
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1