While small labs produce much of the fundamental experimental research in Material Science and Engineering (MSE), little is known about their data management and sharing practices and the extent to which they promote trust in, and transparency of, the published research. In this research, we conduct a case study of a leading MSE research lab to characterize the limits of current data management and sharing practices concerning reproducibility and attribution. We systematically reconstruct the workflows, underpinning four research projects by combining interviews, document review, and digital forensics. We then apply information graph analysis and computer-assisted retrospective auditing to identify where critical research information is unavailable or at risk. We find that while data management and sharing practices in this leading lab protect against computer and disk failure, they are insufficient to ensure reproducibility or correct attribution of work — especially when a group member withdraws before project completion. We conclude with recommendations for adjustments to MSE data management and sharing practices to promote trustworthiness and transparency by adding lightweight automated file-level auditing and automated data transfer processes.
{"title":"Reproducible and Attributable Materials Science Curation Practices: A Case Study","authors":"Ye Li, Sara Wilson, Micah Altman","doi":"10.2218/ijdc.v18i1.940","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.940","url":null,"abstract":"While small labs produce much of the fundamental experimental research in Material Science and Engineering (MSE), little is known about their data management and sharing practices and the extent to which they promote trust in, and transparency of, the published research. \u0000In this research, we conduct a case study of a leading MSE research lab to characterize the limits of current data management and sharing practices concerning reproducibility and attribution. We systematically reconstruct the workflows, underpinning four research projects by combining interviews, document review, and digital forensics. We then apply information graph analysis and computer-assisted retrospective auditing to identify where critical research information is unavailable or at risk. \u0000We find that while data management and sharing practices in this leading lab protect against computer and disk failure, they are insufficient to ensure reproducibility or correct attribution of work — especially when a group member withdraws before project completion. \u0000We conclude with recommendations for adjustments to MSE data management and sharing practices to promote trustworthiness and transparency by adding lightweight automated file-level auditing and automated data transfer processes.","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141796440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trusted Research Environments (TREs) enable the analysis of sensitive data under strict security assertions that protect the data with technical, organizational, and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks and their slight technical variations. To highlight on these problems, an overview of the existing, publicly described TREs and a bibliography linking to the system description are provided. Their technical characteristics, especially in commonalities and variations, are analysed, and insight is provided into their data type characteristics and availability. The literature study shows that 47 TREs worldwide provide access to sensitive data, of which two-thirds provide data predominantly via secure remote access. Statistical offices (SOs) make the majority of sensitive data records included in this study available.
{"title":"Trusted Research Environments: Analysis of Characteristics and Data Availability","authors":"Martin Weise, Andreas Rauber","doi":"10.2218/ijdc.v18i1.939","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.939","url":null,"abstract":"\u0000Trusted Research Environments (TREs) enable the analysis of sensitive data under strict security assertions that protect the data with technical, organizational, and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks and their slight technical variations. To highlight on these problems, an overview of the existing, publicly described TREs and a bibliography linking to the system description are provided. Their technical characteristics, especially in commonalities and variations, are analysed, and insight is provided into their data type characteristics and availability. The literature study shows that 47 TREs worldwide provide access to sensitive data, of which two-thirds provide data predominantly via secure remote access. Statistical offices (SOs) make the majority of sensitive data records included in this study available.\u0000","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141815970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emulation and migration are still our main tools for digital curation and preservation practice. Both strategies have been discussed extensively and have been demonstrated to be effective and applicable in various scenarios. Discussions have primarily centered on technical feasibility, workflow integration, and usability. However, there remains one important aspect when discussing these two techniques: managing and preserving operational knowledge. Both approaches require specialized knowledge but especially emulation requires future users to also have a great variety of knowledge about past software and computer systems for successful operation. We investigate how this knowledge can be stored and utilized, and to what extent it can be rendered machine-actionable, using modern large language models. We demonstrate a proof-of-concept implementation that operates an emulated software environment through natural language.
{"title":"Preserving Secondary Knowledge","authors":"Klaus Rechert, Rafael Gieschke","doi":"10.2218/ijdc.v18i1.930","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.930","url":null,"abstract":"\u0000Emulation and migration are still our main tools for digital curation and preservation practice. Both strategies have been discussed extensively and have been demonstrated to be effective and applicable in various scenarios. Discussions have primarily centered on technical feasibility, workflow integration, and usability. However, there remains one important aspect when discussing these two techniques: managing and preserving operational knowledge. Both approaches require specialized knowledge but especially emulation requires future users to also have a great variety of knowledge about past software and computer systems for successful operation. We investigate how this knowledge can be stored and utilized, and to what extent it can be rendered machine-actionable, using modern large language models. We demonstrate a proof-of-concept implementation that operates an emulated software environment through natural language.\u0000","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141668219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katharina Flicker, Andreas Rauber, Bettina Kern, Fajar J. Ekaputra
Trust is an essential pre-condition for the acceptance of digital infrastructures and services. Transparency has been identified as one mechanism for increasing trustworthiness. Yet, it is difficult to assess to which extent and how exactly different aspects of transparency contribute to trust, or potentially impede it in cases of overwhelming complexity of the information provided. To address these issues, we performed two initial studies to help determining the factors that influence or have impact on trust, focusing on transparency across a range of elements associated with data, data infrastructures and virtual research environments. On one hand, we performed a survey among IT experts in the field of data science focusing on quality aspects in the context of re-using and sharing open source software, assessing issues such as the need for documentation, test cases, and accountability. On the other hand, we complemented this with a set of semi-structured interviews with senior researchers to address specific issues of the degree of transparency achievable with different approaches. They include, for example, the amount of transparency we can achieve with approaches from explainable AI, or the usefulness and limitations of data provenance in determining the suitability of data for reuse and others. Specifically, we consider mechanisms on three levels, i.e. technical, process-oriented as well as social mechanisms. Starting from attributes of trust in the “analogue world”, we aim to understand which of these can be applied in the digital world, how they differ, and what additional mechanisms need to be established, in order to support trust in complex socio-technological processes and their emergent results when the traditional approaches cannot be applied anymore.
信任是接受数字基础设施和服务的基本先决条件。透明度被认为是提高信任度的一种机制。然而,很难评估透明度的不同方面在多大程度上以及如何确切地促进信任,或者在所提供的信息过于复杂的情况下可能会阻碍信任。为了解决这些问题,我们进行了两项初步研究,以帮助确定影响信任或对信任有影响的因素,重点是与数据、数据基础设施和虚拟研究环境相关的一系列要素的透明度。一方面,我们对数据科学领域的 IT 专家进行了一项调查,重点关注重复使用和共享开放源代码软件的质量问题,评估了对文档、测试用例和问责制的需求等问题。另一方面,我们还对资深研究人员进行了一系列半结构化访谈,以解决不同方法可实现的透明程度等具体问题。例如,我们可以利用可解释人工智能的方法实现多少透明度,或者数据出处在确定数据是否适合重用等方面的作用和局限性。具体来说,我们考虑了三个层面的机制,即技术机制、流程导向机制和社会机制。从 "模拟世界 "中的信任属性出发,我们旨在了解其中哪些可以应用于数字世界,它们有何不同,以及需要建立哪些额外的机制,以便在传统方法无法继续应用的情况下,支持对复杂的社会技术过程及其新兴结果的信任。
{"title":"Factors Influencing Perceptions of Trust in Data Infrastructures","authors":"Katharina Flicker, Andreas Rauber, Bettina Kern, Fajar J. Ekaputra","doi":"10.2218/ijdc.v18i1.921","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.921","url":null,"abstract":"\u0000Trust is an essential pre-condition for the acceptance of digital infrastructures and services. Transparency has been identified as one mechanism for increasing trustworthiness. Yet, it is difficult to assess to which extent and how exactly different aspects of transparency contribute to trust, or potentially impede it in cases of overwhelming complexity of the information provided. To address these issues, we performed two initial studies to help determining the factors that influence or have impact on trust, focusing on transparency across a range of elements associated with data, data infrastructures and virtual research environments. On one hand, we performed a survey among IT experts in the field of data science focusing on quality aspects in the context of re-using and sharing open source software, assessing issues such as the need for documentation, test cases, and accountability. On the other hand, we complemented this with a set of semi-structured interviews with senior researchers to address specific issues of the degree of transparency achievable with different approaches. They include, for example, the amount of transparency we can achieve with approaches from explainable AI, or the usefulness and limitations of data provenance in determining the suitability of data for reuse and others. Specifically, we consider mechanisms on three levels, i.e. technical, process-oriented as well as social mechanisms. Starting from attributes of trust in the “analogue world”, we aim to understand which of these can be applied in the digital world, how they differ, and what additional mechanisms need to be established, in order to support trust in complex socio-technological processes and their emergent results when the traditional approaches cannot be applied anymore.\u0000","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140983494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper aims to better understand early career researchers’ (ECRs’) research data management (RDM) competencies by assessing the contents and quality of data management plans (DMPs) developed during a multi-stakeholder RDM course. We also aim to identify differences between DMPs in relation to several background variables (e.g., discipline, course track). The Basics of Research Data Management (BRDM) course has been held in two multi-faculty, research-intensive universities in Finland since 2020. In this study, 223 ECRs’ DMPs created in the BRDM of 2020 - 2022 were assessed, using the recommendations and criteria of the Finnish DMP Evaluation Guide + General Finnish DMP Guidance (FDEG). The median quality of DMPs appeared to be satisfactory. The differences in rating according to FDEG’s three-point performance criteria were statistically insignificant between DMPs developed in separate years, course tracks or disciplines. However, using content analysis, differences were found between disciplines or course tracks regarding DMP’s key characteristics such as sharing, storing, and preserving data. DMPs that contained a data table (DtDMPs) also differed highly significantly from prose DMPs. DtDMPs better acknowledged the data handling needs of different data types and improved the overall quality of a DMP. The results illustrated that the ECRs had learned the basic RDM competencies and grasped their significance to the integrity, reliability, and reusability of data. However, more focused, further training to reach the advanced competency is needed, especially in areas of handling and sharing personal data, legal issues, long-term preserving, and funders’ data policies. Equally important to the cultural change when RDM is an organic part of the research practices is to merge research support services, processes, and infrastructure into the research projects’ processes. Additionally, incentives are needed for sharing and reusing data.
{"title":"Assessing Quality Variations in Early Career Researchers’ Data Management Plans","authors":"Jukka Rantasaari","doi":"10.2218/ijdc.v18i1.873","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.873","url":null,"abstract":"This paper aims to better understand early career researchers’ (ECRs’) research data management (RDM) competencies by assessing the contents and quality of data management plans (DMPs) developed during a multi-stakeholder RDM course. We also aim to identify differences between DMPs in relation to several background variables (e.g., discipline, course track). The Basics of Research Data Management (BRDM) course has been held in two multi-faculty, research-intensive universities in Finland since 2020. In this study, 223 ECRs’ DMPs created in the BRDM of 2020 - 2022 were assessed, using the recommendations and criteria of the Finnish DMP Evaluation Guide + General Finnish DMP Guidance (FDEG). The median quality of DMPs appeared to be satisfactory. The differences in rating according to FDEG’s three-point performance criteria were statistically insignificant between DMPs developed in separate years, course tracks or disciplines. However, using content analysis, differences were found between disciplines or course tracks regarding DMP’s key characteristics such as sharing, storing, and preserving data. DMPs that contained a data table (DtDMPs) also differed highly significantly from prose DMPs. DtDMPs better acknowledged the data handling needs of different data types and improved the overall quality of a DMP. The results illustrated that the ECRs had learned the basic RDM competencies and grasped their significance to the integrity, reliability, and reusability of data. However, more focused, further training to reach the advanced competency is needed, especially in areas of handling and sharing personal data, legal issues, long-term preserving, and funders’ data policies. Equally important to the cultural change when RDM is an organic part of the research practices is to merge research support services, processes, and infrastructure into the research projects’ processes. Additionally, incentives are needed for sharing and reusing data.","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140705655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea L Pritt, Briana E. Wham, Rachel H. Toczydlowski, Eric D Crandall
Science, Technology, Engineering, and Mathematics (STEM) and Research Data Librarians collaborated with an international research team of conservation geneticists to create an instructional and practical guide combining genetic biodiversity initiatives and data curation. Over the course of two months, the academic librarians held multiple community-based Curate-A-Thons where an international group of students, researchers, librarians, and faculty researchers participated in tracking down publications and metadata for genomic sequence data, thus crowd-sourcing this effort of metadata enhancement. This article details the successful Curate-a-Thon design and implementation process; the openly available instructional materials created and used to host the Curate-a-Thons; and the challenges and successes of these community-based events.
{"title":"Community-based Curate-a-Thons to Enhance Preservation of Global Genetic Biodiversity Data","authors":"Andrea L Pritt, Briana E. Wham, Rachel H. Toczydlowski, Eric D Crandall","doi":"10.2218/ijdc.v18i1.891","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.891","url":null,"abstract":"Science, Technology, Engineering, and Mathematics (STEM) and Research Data Librarians collaborated with an international research team of conservation geneticists to create an instructional and practical guide combining genetic biodiversity initiatives and data curation. Over the course of two months, the academic librarians held multiple community-based Curate-A-Thons where an international group of students, researchers, librarians, and faculty researchers participated in tracking down publications and metadata for genomic sequence data, thus crowd-sourcing this effort of metadata enhancement. This article details the successful Curate-a-Thon design and implementation process; the openly available instructional materials created and used to host the Curate-a-Thons; and the challenges and successes of these community-based events.","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139785970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea L Pritt, Briana E. Wham, Rachel H. Toczydlowski, Eric D Crandall
Science, Technology, Engineering, and Mathematics (STEM) and Research Data Librarians collaborated with an international research team of conservation geneticists to create an instructional and practical guide combining genetic biodiversity initiatives and data curation. Over the course of two months, the academic librarians held multiple community-based Curate-A-Thons where an international group of students, researchers, librarians, and faculty researchers participated in tracking down publications and metadata for genomic sequence data, thus crowd-sourcing this effort of metadata enhancement. This article details the successful Curate-a-Thon design and implementation process; the openly available instructional materials created and used to host the Curate-a-Thons; and the challenges and successes of these community-based events.
{"title":"Community-based Curate-a-Thons to Enhance Preservation of Global Genetic Biodiversity Data","authors":"Andrea L Pritt, Briana E. Wham, Rachel H. Toczydlowski, Eric D Crandall","doi":"10.2218/ijdc.v18i1.891","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.891","url":null,"abstract":"Science, Technology, Engineering, and Mathematics (STEM) and Research Data Librarians collaborated with an international research team of conservation geneticists to create an instructional and practical guide combining genetic biodiversity initiatives and data curation. Over the course of two months, the academic librarians held multiple community-based Curate-A-Thons where an international group of students, researchers, librarians, and faculty researchers participated in tracking down publications and metadata for genomic sequence data, thus crowd-sourcing this effort of metadata enhancement. This article details the successful Curate-a-Thon design and implementation process; the openly available instructional materials created and used to host the Curate-a-Thons; and the challenges and successes of these community-based events.","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139845791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The 2007 implementation of the Office Open XML standard for Microsoft Word introduced the assignation of individual revision save identifiers (Rsid) to document editing sessions that end in a save action. The relevant standards ECMA (2016) and ISO/ IEC 29500-1:2016 (2016) stipulate that these Rsid should be allocated randomised but with increasing numerical value, thereby documenting the progress of the editing. As MS Word is the most ubiquitous word processing software, Rsid appear to be a useful tool to examine and provide evidence for a wide range of common document generation editing and modification processes and file management operations, with implications for document analysis including, but not limited to academic integrity issues in student assignment submissions (e.g. contract cheating). This paper presents the results of a series of experiments conducted to assess whether and how well MS Word implements the ECMA and ISO/ IEC standards. The results show that the number of allocated Rsid indeed increases with each edit and save action, with the previous Rsids carried over and retained. The newly allocated Rsid, however, do not conform to the standard as the numerical value of a Rsid associated with a save action may be larger or smaller than any or all of those allocated during that of the previous save actions. The allocation of a new Rsid is not necessarily caused by an edit event but that a new Rsid can also be generated if a file is saved as rtf or if it is sent as an e-mail from within MS Word, although the file was not edited in any way. Rsid numbers are not generated if a person opens a MS Word document, reads it and closes the file without saving, making this action impossible to detect. MS Word template files on a given machine contain document (root) Rsid numbers that are generated when a newly installed application is launched for the first time. As these will be embedded as legacy Rsid into every new file generated from that template file, they act as signatures for all MS Word documents that are created. The experiments have shown that user behaviour has a direct influence on the number of Rsid represented in a given file. Although the implementation of Office Open XML chosen by Microsoft is not compliant with the relevant standards, and thus Rsid cannot be used determine the exact chronological order of all editing sequences within a given document, the Rsid retain their value for document forensics as they are associated with specific edit events, and illuminate the document writing and editing process.
2007 年实施的 Microsoft Word Office Open XML 标准为以保存操作结束的文档编辑会话分配了单个修订保存标识符(Rsid)。相关标准 ECMA (2016) 和 ISO/ IEC 29500-1:2016 (2016) 规定,这些 Rsid 应随机分配,但数值应不断增加,从而记录编辑进度。由于 MS Word 是最普遍的文字处理软件,Rsid 似乎是一种有用的工具,可用于检查各种常见文档的生成、编辑和修改过程以及文件管理操作,并为其提供证据,其对文档分析的影响包括但不限于学生作业提交中的学术诚信问题(如合同作弊)。本文介绍了一系列实验的结果,这些实验旨在评估 MS Word 是否以及如何很好地执行 ECMA 和 ISO/ IEC 标准。结果表明,分配的 Rsid 数量确实随着每次编辑和保存操作的进行而增加,之前的 Rsid 会被继承和保留。不过,新分配的 Rsid 并不符合标准,因为与保存操作相关的 Rsid 数值可能大于或小于之前保存操作中分配的任何或所有 Rsid。新 Rsid 的分配不一定是由编辑事件引起的,但如果文件被保存为 rtf 格式,或者从 MS Word 中以电子邮件的形式发送,也会产生新的 Rsid,尽管文件没有经过任何编辑。如果一个人打开一个 MS Word 文档,读完后没有保存就关闭了文件,则不会生成 Rsid 号码,因此无法检测到这一操作。给定机器上的 MS Word 模板文件包含文档(根)Rsid 号码,这些号码在首次启动新安装的应用程序时生成。实验表明,用户行为对特定文件中的 Rsid 数量有直接影响。尽管微软选择的 Office Open XML 实现不符合相关标准,因此 Rsid 无法用于确定给定文档中所有编辑序列的确切时间顺序,但 Rsid 仍具有文档取证价值,因为它们与特定的编辑事件相关联,并能阐明文档的编写和编辑过程。
{"title":"Generation of Revision Identifier (rsid) Numbers in MS Word","authors":"D. Spennemann, Clare L. Singh","doi":"10.2218/ijdc.v18i1.870","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.870","url":null,"abstract":"The 2007 implementation of the Office Open XML standard for Microsoft Word introduced the assignation of individual revision save identifiers (Rsid) to document editing sessions that end in a save action. The relevant standards ECMA (2016) and ISO/ IEC 29500-1:2016 (2016) stipulate that these Rsid should be allocated randomised but with increasing numerical value, thereby documenting the progress of the editing. As MS Word is the most ubiquitous word processing software, Rsid appear to be a useful tool to examine and provide evidence for a wide range of common document generation editing and modification processes and file management operations, with implications for document analysis including, but not limited to academic integrity issues in student assignment submissions (e.g. contract cheating). \u0000This paper presents the results of a series of experiments conducted to assess whether and how well MS Word implements the ECMA and ISO/ IEC standards. The results show that the number of allocated Rsid indeed increases with each edit and save action, with the previous Rsids carried over and retained. The newly allocated Rsid, however, do not conform to the standard as the numerical value of a Rsid associated with a save action may be larger or smaller than any or all of those allocated during that of the previous save actions. The allocation of a new Rsid is not necessarily caused by an edit event but that a new Rsid can also be generated if a file is saved as rtf or if it is sent as an e-mail from within MS Word, although the file was not edited in any way. Rsid numbers are not generated if a person opens a MS Word document, reads it and closes the file without saving, making this action impossible to detect.\u0000MS Word template files on a given machine contain document (root) Rsid numbers that are generated when a newly installed application is launched for the first time. As these will be embedded as legacy Rsid into every new file generated from that template file, they act as signatures for all MS Word documents that are created.\u0000The experiments have shown that user behaviour has a direct influence on the number of Rsid represented in a given file. Although the implementation of Office Open XML chosen by Microsoft is not compliant with the relevant standards, and thus Rsid cannot be used determine the exact chronological order of all editing sequences within a given document, the Rsid retain their value for document forensics as they are associated with specific edit events, and illuminate the document writing and editing process.\u0000 ","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139845481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The 2007 implementation of the Office Open XML standard for Microsoft Word introduced the assignation of individual revision save identifiers (Rsid) to document editing sessions that end in a save action. The relevant standards ECMA (2016) and ISO/ IEC 29500-1:2016 (2016) stipulate that these Rsid should be allocated randomised but with increasing numerical value, thereby documenting the progress of the editing. As MS Word is the most ubiquitous word processing software, Rsid appear to be a useful tool to examine and provide evidence for a wide range of common document generation editing and modification processes and file management operations, with implications for document analysis including, but not limited to academic integrity issues in student assignment submissions (e.g. contract cheating). This paper presents the results of a series of experiments conducted to assess whether and how well MS Word implements the ECMA and ISO/ IEC standards. The results show that the number of allocated Rsid indeed increases with each edit and save action, with the previous Rsids carried over and retained. The newly allocated Rsid, however, do not conform to the standard as the numerical value of a Rsid associated with a save action may be larger or smaller than any or all of those allocated during that of the previous save actions. The allocation of a new Rsid is not necessarily caused by an edit event but that a new Rsid can also be generated if a file is saved as rtf or if it is sent as an e-mail from within MS Word, although the file was not edited in any way. Rsid numbers are not generated if a person opens a MS Word document, reads it and closes the file without saving, making this action impossible to detect. MS Word template files on a given machine contain document (root) Rsid numbers that are generated when a newly installed application is launched for the first time. As these will be embedded as legacy Rsid into every new file generated from that template file, they act as signatures for all MS Word documents that are created. The experiments have shown that user behaviour has a direct influence on the number of Rsid represented in a given file. Although the implementation of Office Open XML chosen by Microsoft is not compliant with the relevant standards, and thus Rsid cannot be used determine the exact chronological order of all editing sequences within a given document, the Rsid retain their value for document forensics as they are associated with specific edit events, and illuminate the document writing and editing process.
2007 年实施的 Microsoft Word Office Open XML 标准为以保存操作结束的文档编辑会话分配了单个修订保存标识符(Rsid)。相关标准 ECMA (2016) 和 ISO/ IEC 29500-1:2016 (2016) 规定,这些 Rsid 应随机分配,但数值应不断增加,从而记录编辑进度。由于 MS Word 是最普遍的文字处理软件,Rsid 似乎是一种有用的工具,可用于检查各种常见文档的生成、编辑和修改过程以及文件管理操作,并为其提供证据,其对文档分析的影响包括但不限于学生作业提交中的学术诚信问题(如合同作弊)。本文介绍了一系列实验的结果,这些实验旨在评估 MS Word 是否以及如何很好地执行 ECMA 和 ISO/ IEC 标准。结果表明,分配的 Rsid 数量确实随着每次编辑和保存操作的进行而增加,之前的 Rsid 会被继承和保留。不过,新分配的 Rsid 并不符合标准,因为与保存操作相关的 Rsid 数值可能大于或小于之前保存操作中分配的任何或所有 Rsid。新 Rsid 的分配不一定是由编辑事件引起的,但如果文件被保存为 rtf 格式,或者从 MS Word 中以电子邮件的形式发送,也会产生新的 Rsid,尽管文件没有经过任何编辑。如果一个人打开一个 MS Word 文档,读完后没有保存就关闭了文件,则不会生成 Rsid 号码,因此无法检测到这一操作。给定机器上的 MS Word 模板文件包含文档(根)Rsid 号码,这些号码在首次启动新安装的应用程序时生成。实验表明,用户行为对特定文件中的 Rsid 数量有直接影响。尽管微软选择的 Office Open XML 实现不符合相关标准,因此 Rsid 无法用于确定给定文档中所有编辑序列的确切时间顺序,但 Rsid 仍具有文档取证价值,因为它们与特定的编辑事件相关联,并能阐明文档的编写和编辑过程。
{"title":"Generation of Revision Identifier (rsid) Numbers in MS Word","authors":"D. Spennemann, Clare L. Singh","doi":"10.2218/ijdc.v18i1.870","DOIUrl":"https://doi.org/10.2218/ijdc.v18i1.870","url":null,"abstract":"The 2007 implementation of the Office Open XML standard for Microsoft Word introduced the assignation of individual revision save identifiers (Rsid) to document editing sessions that end in a save action. The relevant standards ECMA (2016) and ISO/ IEC 29500-1:2016 (2016) stipulate that these Rsid should be allocated randomised but with increasing numerical value, thereby documenting the progress of the editing. As MS Word is the most ubiquitous word processing software, Rsid appear to be a useful tool to examine and provide evidence for a wide range of common document generation editing and modification processes and file management operations, with implications for document analysis including, but not limited to academic integrity issues in student assignment submissions (e.g. contract cheating). \u0000This paper presents the results of a series of experiments conducted to assess whether and how well MS Word implements the ECMA and ISO/ IEC standards. The results show that the number of allocated Rsid indeed increases with each edit and save action, with the previous Rsids carried over and retained. The newly allocated Rsid, however, do not conform to the standard as the numerical value of a Rsid associated with a save action may be larger or smaller than any or all of those allocated during that of the previous save actions. The allocation of a new Rsid is not necessarily caused by an edit event but that a new Rsid can also be generated if a file is saved as rtf or if it is sent as an e-mail from within MS Word, although the file was not edited in any way. Rsid numbers are not generated if a person opens a MS Word document, reads it and closes the file without saving, making this action impossible to detect.\u0000MS Word template files on a given machine contain document (root) Rsid numbers that are generated when a newly installed application is launched for the first time. As these will be embedded as legacy Rsid into every new file generated from that template file, they act as signatures for all MS Word documents that are created.\u0000The experiments have shown that user behaviour has a direct influence on the number of Rsid represented in a given file. Although the implementation of Office Open XML chosen by Microsoft is not compliant with the relevant standards, and thus Rsid cannot be used determine the exact chronological order of all editing sequences within a given document, the Rsid retain their value for document forensics as they are associated with specific edit events, and illuminate the document writing and editing process.\u0000 ","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139785561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antique books, old and rare documents are fragile and vulnerable to different hazards. Preserving them for an extended period is a real challenge. From ancient times people started expressing their knowledge by writing and keeping records and subsequently started collecting and storing these at later ages as antique materials. These can be seen in different museums, libraries, archives, individual households, and other places all over the world. Preserving and conserving these antique, old and rare books, documents etc. in good condition is a challenge for librarians, conservators, preservation administrators or persons associated with storing these. In this paper, details of the digital preservation of such a collection available in the Directorate of Historical and Antiquarian Studies (DHAS), Guwahati, Assam, India, are discussed. DHAS is a Government of Assam wing and is mainly mandated to collect, preserve and research historical and antiquarian resources. The collection of DHAS is one of the oldest collections and has been serving as a study and research centre in Assam since 1928. A special drive has been taken for the digital preservation of an identified part of the collection, with grant support from the National Archive of India. This paper discusses the entire project process starting from the project proposal formulation to the structuring of the digital collection. The paper sequentially discusses the different steps of the entire work of digitization of a collection of 241 old and rare books from the main collection of DHAS.
{"title":"E-Preservation of Old and Rare Books: A Structured Approach for Creating a Digital Collection","authors":"Sangeeta Chakravarty","doi":"10.2218/ijdc.v17i1.855","DOIUrl":"https://doi.org/10.2218/ijdc.v17i1.855","url":null,"abstract":"Antique books, old and rare documents are fragile and vulnerable to different hazards. Preserving them for an extended period is a real challenge. From ancient times people started expressing their knowledge by writing and keeping records and subsequently started collecting and storing these at later ages as antique materials. These can be seen in different museums, libraries, archives, individual households, and other places all over the world. Preserving and conserving these antique, old and rare books, documents etc. in good condition is a challenge for librarians, conservators, preservation administrators or persons associated with storing these. In this paper, details of the digital preservation of such a collection available in the Directorate of Historical and Antiquarian Studies (DHAS), Guwahati, Assam, India, are discussed. DHAS is a Government of Assam wing and is mainly mandated to collect, preserve and research historical and antiquarian resources. The collection of DHAS is one of the oldest collections and has been serving as a study and research centre in Assam since 1928. A special drive has been taken for the digital preservation of an identified part of the collection, with grant support from the National Archive of India. This paper discusses the entire project process starting from the project proposal formulation to the structuring of the digital collection. The paper sequentially discusses the different steps of the entire work of digitization of a collection of 241 old and rare books from the main collection of DHAS.","PeriodicalId":87279,"journal":{"name":"International journal of digital curation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134954212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}