首页 > 最新文献

Digital Scholarship in the Humanities最新文献

英文 中文
“I would I had that corporal soundness”: Pervez Rizvi's Analysis of the Word Adjacency Network Method of Authorship Attribution “I would I had that body sound”:Pervez Rizvi对作者归属词邻接网络方法的分析
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-28 DOI: 10.1093/llc/fqad032
G. Egan, Mark Eisen, Alejandro Ribeiro, Santiago Segarra
In his two-part article ‘An Analysis of the Word Adjacency Network Method—Part 1—The evidence of its unsoundness’ and ‘Part 2—A true understanding of the method’ Digital Scholarship in the Humanities, 38: 347-78 (2022), Pervez Rizvi attempts to replicate the Word Adjacency Network (WAN) method for authorship attribution and show that it does not produce the new knowledge that we, its inventors, claim for it. In the present essay, we will show that Rizvi misrepresents fundamental aspects of the WAN method, that his attempted replication fails not because the method is flawed but because he erred in replicating it, and that Rizvi misunderstands key aspects of the mathematics of Information Theory that the method uses.
Pervez Rizvi在他由两部分组成的文章《单词邻接网络方法的分析——第一部分——其不可靠的证据》和《第二部分——对该方法的真实理解》中,《人文学科中的数字奖学金》,38:347-78(2022),他试图复制单词邻接网络(WAN)的作者归属方法,并表明它不会产生新的知识,在本文中,我们将证明Rizvi歪曲了WAN方法的基本方面,他尝试复制失败并不是因为该方法有缺陷,而是因为他复制错误,而且Rizvi误解了该方法使用的信息理论数学的关键方面。
{"title":"“I would I had that corporal soundness”: Pervez Rizvi's Analysis of the Word Adjacency Network Method of Authorship Attribution","authors":"G. Egan, Mark Eisen, Alejandro Ribeiro, Santiago Segarra","doi":"10.1093/llc/fqad032","DOIUrl":"https://doi.org/10.1093/llc/fqad032","url":null,"abstract":"\u0000 In his two-part article ‘An Analysis of the Word Adjacency Network Method—Part 1—The evidence of its unsoundness’ and ‘Part 2—A true understanding of the method’ Digital Scholarship in the Humanities, 38: 347-78 (2022), Pervez Rizvi attempts to replicate the Word Adjacency Network (WAN) method for authorship attribution and show that it does not produce the new knowledge that we, its inventors, claim for it. In the present essay, we will show that Rizvi misrepresents fundamental aspects of the WAN method, that his attempted replication fails not because the method is flawed but because he erred in replicating it, and that Rizvi misunderstands key aspects of the mathematics of Information Theory that the method uses.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45784964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Provenance visualization: Tracing people, processes, and practices through a data-driven approach to provenance 来源可视化:通过数据驱动的来源方法跟踪人员、过程和实践
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-24 DOI: 10.1093/llc/fqad020
T. Vancisin, Loraine Clarke, M. Orr, Uta Hinrichs
Provenance disclosure—the documentation of an artifact’s origin and how it was produced—is an important aspect to consider when working with historical records which undergo multiple transformations in preparation for and during digitization. Provenance in this context is commonly communicated through explanatory text or static diagrams. However, the methodological and curatorial decisions that have influenced the records’ data are easily overlooked, in particular when exploring the records through visualization as a result of digitization processes. We propose a data-driven approach to provenance disclosure which (1) traces provenance back to when the records were created, (2) documents and categorizes the records’ transformations (transcriptions, content modifications, changes in organization, and representational form), and (3) uses data visualization to disclose provenance in interactive ways. We reflect on how this approach can be practically applied in the context of historical record collections, and we present findings from a qualitative study we conducted to investigate the merits and limitations of provenance-driven visualization. Our findings suggest that data-driven provenance disclosure has the potential to (1) promote transparency and deeper interpretations of historical records, (2) provide rigor in researching historical document collections and underlying production processes, and (3) encourage ethical considerations by making visible labor and implicit bias that influence the production and curation of historical records.
来源披露——记录文物的起源及其生产方式——是处理历史记录时需要考虑的一个重要方面,这些历史记录在数字化准备和数字化过程中经历了多次转换。在这种情况下,原产地通常通过解释性文本或静态图表进行交流。然而,影响记录数据的方法和策展决策很容易被忽视,尤其是在数字化过程中通过可视化探索记录时。我们提出了一种数据驱动的出处披露方法,该方法(1)将出处追溯到记录创建时,(2)记录并分类记录的转换(转录、内容修改、组织变化和表征形式),以及(3)使用数据可视化以交互方式披露出处。我们反思了这种方法如何在历史记录收集的背景下实际应用,并介绍了我们进行的一项定性研究的结果,该研究旨在调查来源驱动可视化的优点和局限性。我们的研究结果表明,数据驱动的出处披露有可能(1)促进历史记录的透明度和更深入的解释,(2)为研究历史文献收藏和潜在的生产过程提供严谨性,以及(3)通过制造影响历史记录的制作和管理的可见劳动和隐性偏见来鼓励伦理考虑。
{"title":"Provenance visualization: Tracing people, processes, and practices through a data-driven approach to provenance","authors":"T. Vancisin, Loraine Clarke, M. Orr, Uta Hinrichs","doi":"10.1093/llc/fqad020","DOIUrl":"https://doi.org/10.1093/llc/fqad020","url":null,"abstract":"\u0000 Provenance disclosure—the documentation of an artifact’s origin and how it was produced—is an important aspect to consider when working with historical records which undergo multiple transformations in preparation for and during digitization. Provenance in this context is commonly communicated through explanatory text or static diagrams. However, the methodological and curatorial decisions that have influenced the records’ data are easily overlooked, in particular when exploring the records through visualization as a result of digitization processes. We propose a data-driven approach to provenance disclosure which (1) traces provenance back to when the records were created, (2) documents and categorizes the records’ transformations (transcriptions, content modifications, changes in organization, and representational form), and (3) uses data visualization to disclose provenance in interactive ways. We reflect on how this approach can be practically applied in the context of historical record collections, and we present findings from a qualitative study we conducted to investigate the merits and limitations of provenance-driven visualization. Our findings suggest that data-driven provenance disclosure has the potential to (1) promote transparency and deeper interpretations of historical records, (2) provide rigor in researching historical document collections and underlying production processes, and (3) encourage ethical considerations by making visible labor and implicit bias that influence the production and curation of historical records.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45272916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proverbs as indicators of proficiency for art-generating AI 谚语作为人工智能艺术生成能力的指标
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-22 DOI: 10.1093/llc/fqad034
Luis J. Tosina Fernández
Art generated by Artificial Intelligence (AI) is currently having great repercussion online. The reason for this is the fact that it allows people without creative talent to produce outstanding works by just typing in the description of what they want to illustrate. However, the appearance of this technology has also caused some discomfort among artists and graphic designers, who see their craft threatened by a service that is available to anyone free of charge. In this article, the capability of some of these platforms to process figurative language will be assessed with the help of five well-known proverbs found in almost identical terms across a number of Western languages. These proverbs were used as the prompts on five of the most popular AI art generators accessible at present. After analyzing the results, our experiment concludes that AI evidences significant deficiencies in the processing of proverbs and, therefore, of figurative language. Consequently, AI does not seem able to substitute human agency completely in artistic creation yet. This exposes an aspect that needs improvement not just for the creative applications of AI but for other applications that it may have in the future. To achieve this, disciplines such as psycholinguistics should be integrated into the teams that develop AI.
人工智能产生的艺术目前在网上引起了巨大反响。这是因为它可以让没有创作天赋的人只需输入他们想要说明的内容就可以创作出杰出的作品。然而,这项技术的出现也引起了艺术家和平面设计师的一些不适,他们认为自己的工艺受到了免费服务的威胁。在这篇文章中,将借助五句在许多西方语言中几乎相同的谚语来评估其中一些平台处理比喻语言的能力。这些谚语被用作目前最受欢迎的五个人工智能艺术生成器的提示。在分析结果后,我们的实验得出结论,人工智能证明了谚语的处理以及比喻语言的处理存在显著缺陷。因此,人工智能似乎还不能完全取代人类在艺术创作中的能动性。这暴露了一个需要改进的方面,不仅对于人工智能的创造性应用,而且对于它未来可能拥有的其他应用。为了实现这一点,心理语言学等学科应该整合到开发人工智能的团队中。
{"title":"Proverbs as indicators of proficiency for art-generating AI","authors":"Luis J. Tosina Fernández","doi":"10.1093/llc/fqad034","DOIUrl":"https://doi.org/10.1093/llc/fqad034","url":null,"abstract":"\u0000 Art generated by Artificial Intelligence (AI) is currently having great repercussion online. The reason for this is the fact that it allows people without creative talent to produce outstanding works by just typing in the description of what they want to illustrate. However, the appearance of this technology has also caused some discomfort among artists and graphic designers, who see their craft threatened by a service that is available to anyone free of charge. In this article, the capability of some of these platforms to process figurative language will be assessed with the help of five well-known proverbs found in almost identical terms across a number of Western languages. These proverbs were used as the prompts on five of the most popular AI art generators accessible at present. After analyzing the results, our experiment concludes that AI evidences significant deficiencies in the processing of proverbs and, therefore, of figurative language. Consequently, AI does not seem able to substitute human agency completely in artistic creation yet. This exposes an aspect that needs improvement not just for the creative applications of AI but for other applications that it may have in the future. To achieve this, disciplines such as psycholinguistics should be integrated into the teams that develop AI.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47391299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new approach for the construction of historical databases—NoSQL Document-oriented databases: the example of AtlantoCracies 构建历史数据库的一种新方法——nosql面向文档的数据库:以atlantocracy为例
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-22 DOI: 10.1093/llc/fqad033
Manuel Díaz-Ordóñez, Domingo Savio Rodríguez Baena, Bartolomé Yun-Casalilla
This article proposes, and justifies, the use of the Document-oriented databases as a flexible, easy to use, and powerful digital tool in the field of historical research. First, the reasons that have made relational databases the predominant instrument among historians are studied, while detailing the problems involved in their use. Next, the way in which historians have tried to face these problems by using other digital tools is explained, as well as the limitations that such use entails. Through a case study—that of European aristocratic networks in early modern times—it is shown, however, that Document-oriented databases, present notable advantages and have greater explanatory power for the historian’s work. Thanks to their flexibility, they are better adapted to the often-unpredictable nature of historical sources without diminishing their ease of use or their analytical potential.
本文提出并论证了将面向文档的数据库作为一种灵活、易于使用、功能强大的数字工具用于历史研究领域。首先,研究了关系数据库在历史学家中成为主要工具的原因,同时详细说明了使用关系数据库所涉及的问题。接下来,解释了历史学家试图通过使用其他数字工具来面对这些问题的方式,以及这种使用所带来的限制。然而,通过一个关于近代早期欧洲贵族网络的案例研究表明,以文档为导向的数据库具有显著的优势,对历史学家的工作具有更大的解释力。由于它们的灵活性,它们能够更好地适应历史资料往往不可预测的性质,而不会降低它们的易用性或分析潜力。
{"title":"A new approach for the construction of historical databases—NoSQL Document-oriented databases: the example of AtlantoCracies","authors":"Manuel Díaz-Ordóñez, Domingo Savio Rodríguez Baena, Bartolomé Yun-Casalilla","doi":"10.1093/llc/fqad033","DOIUrl":"https://doi.org/10.1093/llc/fqad033","url":null,"abstract":"This article proposes, and justifies, the use of the Document-oriented databases as a flexible, easy to use, and powerful digital tool in the field of historical research. First, the reasons that have made relational databases the predominant instrument among historians are studied, while detailing the problems involved in their use. Next, the way in which historians have tried to face these problems by using other digital tools is explained, as well as the limitations that such use entails. Through a case study—that of European aristocratic networks in early modern times—it is shown, however, that Document-oriented databases, present notable advantages and have greater explanatory power for the historian’s work. Thanks to their flexibility, they are better adapted to the often-unpredictable nature of historical sources without diminishing their ease of use or their analytical potential.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43264481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Web archive analytics: Blind spots and silences in distant readings of the archived web 网络档案分析:对存档网络的远距离阅读中的盲点和沉默
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-19 DOI: 10.1093/llc/fqad014
Simon Donig, Markus Eckl, S. Gassner, Malte Rehbein
In this article, we discuss epistemological and methodological aspects of web archive analytics, a recent development towards more data-centred access to web archives. More specifically, we suggest understanding both the process of archiving and subsequent steps of analysis at scale as acts of observation that can be questioned for their epistemological priori. Therefore, we propose the concepts of ‘blind spots’ (features of the live web not included upon creation in the archive) and ‘silences’ (latent features present in the archive but requiring a particular method to be made articulate). In particular, we address two forms of silences playing a structural role in web archive analytics, crucial to both historians and social scientists alike: abundance (or scale) and time. We trace epistemological implications of web archive analytics across an exemplary case study workflow and suggest methodological answers to the issues raised in this process. On the data extraction side, we introduce warc2corpus (w2c), a new tool for extracting granular, structured data, especially temporal information related to the creation, modification, and publication specifically of webpages. For data analysis, we demonstrate how distant reading techniques—more specifically structural topic modelling (STM)—can contribute to providing a rich, temporally structured representation of textual web archive content that in turn can be subjected to scholarly inquiry, interpretation, and re-contextualization.
在这篇文章中,我们讨论了网络档案分析的认识论和方法论方面,这是最近发展起来的以数据为中心的网络档案访问。更具体地说,我们建议将存档过程和随后的大规模分析步骤都理解为观察行为,可以质疑其先验认识论。因此,我们提出了“盲点”(档案中未包含的实时网络特征)和“沉默”(档案中存在的潜在特征,但需要特定的方法来表达)的概念。我们特别讨论了在网络档案分析中扮演结构性角色的两种沉默形式:丰富(或规模)和时间,这对历史学家和社会科学家都至关重要。我们通过一个典型的案例研究工作流追踪网络档案分析的认识论含义,并对这个过程中提出的问题提出方法论答案。在数据提取方面,我们介绍了warc2corpus (w2c),这是一种用于提取颗粒状、结构化数据的新工具,特别是与网页的创建、修改和发布相关的时间信息。对于数据分析,我们展示了远距阅读技术——更具体地说是结构主题建模(STM)——如何有助于提供文本网络存档内容的丰富的、临时结构化的表示,而这些内容反过来又可以进行学术调查、解释和重新语境化。
{"title":"Web archive analytics: Blind spots and silences in distant readings of the archived web","authors":"Simon Donig, Markus Eckl, S. Gassner, Malte Rehbein","doi":"10.1093/llc/fqad014","DOIUrl":"https://doi.org/10.1093/llc/fqad014","url":null,"abstract":"\u0000 In this article, we discuss epistemological and methodological aspects of web archive analytics, a recent development towards more data-centred access to web archives. More specifically, we suggest understanding both the process of archiving and subsequent steps of analysis at scale as acts of observation that can be questioned for their epistemological priori. Therefore, we propose the concepts of ‘blind spots’ (features of the live web not included upon creation in the archive) and ‘silences’ (latent features present in the archive but requiring a particular method to be made articulate). In particular, we address two forms of silences playing a structural role in web archive analytics, crucial to both historians and social scientists alike: abundance (or scale) and time. We trace epistemological implications of web archive analytics across an exemplary case study workflow and suggest methodological answers to the issues raised in this process. On the data extraction side, we introduce warc2corpus (w2c), a new tool for extracting granular, structured data, especially temporal information related to the creation, modification, and publication specifically of webpages. For data analysis, we demonstrate how distant reading techniques—more specifically structural topic modelling (STM)—can contribute to providing a rich, temporally structured representation of textual web archive content that in turn can be subjected to scholarly inquiry, interpretation, and re-contextualization.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46386901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NEAT—Named Entities in Archaeological Texts: A semantic approach to term extraction and classification NEAT——考古文本中的命名实体:术语提取和分类的语义方法
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-13 DOI: 10.1093/llc/fqad017
Maria Pia di Buono, Gennaro Nolano, J. Monti
The lack of annotated datasets affects the development of Natural Language Processing applications and heavily impacts the access to textual data, in particular for specific domains and specific languages. In this paper, we propose a methodology to annotate texts concerning domain-specific knowledge, to provide a reliable source of data for the task of Named Entity Recognition (NER) in the domain of archaeology for the Italian laguage. This method integrates syntactic and semantic information from several structured sources to annotate entities’ mentions in unstructured texts. Furthermore, we make use of an ontology to label entities with the specific type they refer to. By using a corpus made up of item descriptions from Europeana’s Archaeology Collection, we first test our proposed methodology on a mock dataset composed of 1,000 texts. After several steps of improvements, we use the final process to create a complete dataset composed of 5,000 descriptions. The resulting dataset, Named Entities in Archaeological Texts has a total of 41,002 spans of texts annotated with their domain-specific entity classification according to the CIDOC Conceptual Reference Model.
缺乏注释数据集影响了自然语言处理应用程序的开发,并严重影响了对文本数据的访问,尤其是对特定领域和特定语言的访问。在本文中,我们提出了一种方法来注释与领域特定知识有关的文本,为意大利语言考古领域的命名实体识别(NER)任务提供可靠的数据来源。该方法集成了来自多个结构化来源的句法和语义信息,以注释非结构化文本中实体的提及。此外,我们使用本体论来标记实体所指的特定类型。通过使用由欧洲考古收藏中的物品描述组成的语料库,我们首先在由1000个文本组成的模拟数据集上测试了我们提出的方法。经过几个步骤的改进,我们使用最终流程创建了一个由5000个描述组成的完整数据集。由此产生的数据集“考古文本中的命名实体”共有41002个跨度的文本,根据CIDOC概念参考模型,用其特定领域的实体分类进行了注释。
{"title":"NEAT—Named Entities in Archaeological Texts: A semantic approach to term extraction and classification","authors":"Maria Pia di Buono, Gennaro Nolano, J. Monti","doi":"10.1093/llc/fqad017","DOIUrl":"https://doi.org/10.1093/llc/fqad017","url":null,"abstract":"\u0000 The lack of annotated datasets affects the development of Natural Language Processing applications and heavily impacts the access to textual data, in particular for specific domains and specific languages. In this paper, we propose a methodology to annotate texts concerning domain-specific knowledge, to provide a reliable source of data for the task of Named Entity Recognition (NER) in the domain of archaeology for the Italian laguage. This method integrates syntactic and semantic information from several structured sources to annotate entities’ mentions in unstructured texts. Furthermore, we make use of an ontology to label entities with the specific type they refer to. By using a corpus made up of item descriptions from Europeana’s Archaeology Collection, we first test our proposed methodology on a mock dataset composed of 1,000 texts. After several steps of improvements, we use the final process to create a complete dataset composed of 5,000 descriptions. The resulting dataset, Named Entities in Archaeological Texts has a total of 41,002 spans of texts annotated with their domain-specific entity classification according to the CIDOC Conceptual Reference Model.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44252712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic sentence segmentation for classical Chinese: The Spring and Autumn Annals as an example 文言文的自动分词——以《春秋》为例
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-12 DOI: 10.1093/llc/fqad016
Wenjie Fan, Dongbo Wang, Shuiqing Huang
There exists no sentence boundary in most classical Chinese literature texts. Since it is difficult to read literature of this kind, experts in literature or linguistics would segment the sentence manually. This article explores the effectiveness of classical Chinese sentence segmentation method so as to provide a reference for classical Chinese punctuation. On the basis of the machine learning methods, we chose three components of machine learning, namely models, tagging schemes, and features, to compare the learning results. The models include conditional random field (CRF) models, long short term memory (LSTM) models, BiLSTM–CRF models, and three Bidirectional Encoder Representation from Transformers (BERT) models. There are five tagging schemes in this article and three features including the statistical feature, Guangyun, and Fanqie. Finally, the performance of the combined feature template is evaluated by ten-fold cross-validation on four classical Chinese texts in different genres. The SikuBERT model is proved to be the most effective model for sentence segmentation at present. Different tagging schemes and various features are introduced. The results show that 5-tag-J tagging schemes can improve performance. Statistical feature, as an important clue for classical Chinese sentence segmentation, is useful in related tasks, but Guangyun and Fanqie have little impact. Other important factors of sentence segmentation are genres and writing styles.
中国古典文学文本大多不存在句子边界。由于这类文献很难阅读,所以文学或语言学专家会手动对句子进行分段。本文探讨了文言文分句方法的有效性,以期为文言文标点符号的使用提供参考。在机器学习方法的基础上,我们选择了机器学习的三个组成部分,即模型、标记方案和特征,来比较学习结果。这些模型包括条件随机场(CRF)模型、长短期记忆(LSTM)模型、BiLSTM-CRF模型和三种双向编码器表示(BERT)模型。本文提出了五种标注方案,并提出了统计特征、广云特征和繁切特征三个特征。最后,对四种不同体裁的文言文文本进行十倍交叉验证,评价组合特征模板的性能。SikuBERT模型被证明是目前最有效的句子分词模型。介绍了不同的标记方案和各种特性。结果表明,5标签- j标记方案可以提高性能。统计特征作为文言文分句的重要线索,在相关任务中发挥着重要的作用,而广云和繁切的作用不大。句子切分的其他重要因素是体裁和写作风格。
{"title":"Automatic sentence segmentation for classical Chinese: The Spring and Autumn Annals as an example","authors":"Wenjie Fan, Dongbo Wang, Shuiqing Huang","doi":"10.1093/llc/fqad016","DOIUrl":"https://doi.org/10.1093/llc/fqad016","url":null,"abstract":"\u0000 There exists no sentence boundary in most classical Chinese literature texts. Since it is difficult to read literature of this kind, experts in literature or linguistics would segment the sentence manually. This article explores the effectiveness of classical Chinese sentence segmentation method so as to provide a reference for classical Chinese punctuation. On the basis of the machine learning methods, we chose three components of machine learning, namely models, tagging schemes, and features, to compare the learning results. The models include conditional random field (CRF) models, long short term memory (LSTM) models, BiLSTM–CRF models, and three Bidirectional Encoder Representation from Transformers (BERT) models. There are five tagging schemes in this article and three features including the statistical feature, Guangyun, and Fanqie. Finally, the performance of the combined feature template is evaluated by ten-fold cross-validation on four classical Chinese texts in different genres. The SikuBERT model is proved to be the most effective model for sentence segmentation at present. Different tagging schemes and various features are introduced. The results show that 5-tag-J tagging schemes can improve performance. Statistical feature, as an important clue for classical Chinese sentence segmentation, is useful in related tasks, but Guangyun and Fanqie have little impact. Other important factors of sentence segmentation are genres and writing styles.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43547289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unravelling interlanguage facts via explainable machine learning 通过可解释的机器学习揭示中介语言事实
3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-10 DOI: 10.1093/llc/fqad019
Barbara Berti, Andrea Esuli, Fabrizio Sebastiani
Abstract Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e. that of analysing the internals of an NLI classifier trained by an explainable machine learning (EML) algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ‘give a speaker’s native language away’. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e. guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners’ essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker’s L1; our experiments indicate that the most discriminative features are the lexical ones, followed by the morphological, syntactic, and statistical features, in this order. We also present two case studies, one on Italian and one on Spanish learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s; we show that the traits identified as most discriminative well align with our intuition, i.e. represent typical patterns of language misuse, underuse, or overuse, by speakers of the given L1. Overall, our study shows that the use of EML can be a valuable tool for the scholar who investigates interlanguage facts and language transfer.
摘要母语识别(NLI)是训练(通过监督机器学习)分类器来猜测文本作者的母语的任务。在过去的十年中,这项任务得到了广泛的研究,NLI系统的性能在过去的几年里稳步提高。我们专注于NLI任务的另一个方面,即分析由可解释机器学习(EML)算法训练的NLI分类器的内部结构,以获得其分类决策的解释,最终目标是深入了解哪些语言现象“泄露了说话者的母语”。我们使用这一视角来解决NLI和一个(研究较少的)伴随任务,即猜测文本是由母语人士还是非母语人士撰写的。使用三个不同来源的数据集(两个英语学习者的论文数据集和一个社交媒体帖子数据集),我们研究了哪种语言特征(词汇、形态、句法和统计)对解决我们的两个任务最有效,即最能表明说话者的母语;我们的实验表明,最具辨别性的特征是词汇特征,其次是形态特征、句法特征和统计特征。我们还提出了两个案例研究,一个是关于意大利语的,一个是关于西班牙语的英语学习者的,在这两个案例中,我们分析了分类器挑选出来的个人语言特征,这些特征对于发现这些l1是最重要的;我们的研究表明,最具歧视性的特征与我们的直觉是一致的,即代表了特定母语使用者的语言误用、使用不足或过度使用的典型模式。总的来说,我们的研究表明,EML的使用对于研究中介语事实和语言迁移的学者来说是一个有价值的工具。
{"title":"Unravelling interlanguage facts via explainable machine learning","authors":"Barbara Berti, Andrea Esuli, Fabrizio Sebastiani","doi":"10.1093/llc/fqad019","DOIUrl":"https://doi.org/10.1093/llc/fqad019","url":null,"abstract":"Abstract Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e. that of analysing the internals of an NLI classifier trained by an explainable machine learning (EML) algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ‘give a speaker’s native language away’. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e. guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners’ essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker’s L1; our experiments indicate that the most discriminative features are the lexical ones, followed by the morphological, syntactic, and statistical features, in this order. We also present two case studies, one on Italian and one on Spanish learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s; we show that the traits identified as most discriminative well align with our intuition, i.e. represent typical patterns of language misuse, underuse, or overuse, by speakers of the given L1. Overall, our study shows that the use of EML can be a valuable tool for the scholar who investigates interlanguage facts and language transfer.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"829 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135593412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hacking stylometry with multiple voices: Imaginary writers can override authorial signal in Delta 用多种声音破解文体学:想象中的作家可以在Delta中覆盖作者信号
3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-08 DOI: 10.1093/llc/fqad012
Daniil Skorinkin, Boris Orekhov
Abstract It is a basic assumption of stylometry that texts written by the same person show greater stylometric similarity even if published under multiple pennames. Statistical authorship attribution strongly relies on the ability of Burrows’s Delta and its variants to cluster one author together regardless of pseudonyms. At the same time, the very first computational discoveries by the founder of modern stylometry showed that a single author is capable of producing multiple voices (Burrows, 1987, Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Clarendon Press). We investigate two authors whose stylistically autonomous pennames seem to deceive Delta and override authorial signals: a Portuguese poet Fernando Pessoa and a French novelist Romain Gary. Pessoa managed to create at least three pennames (the author himself used the term ‘heteronym’) who exhibit all traits of individual human beings from the stylometric point of view. Gary’s alter ego Emile Ajar, who was an intentional literary mystification, also demonstrates traits of stylometric autonomy. At the same time, other pseudonyms used by Gary lack that autonomy completely. Our investigation shows that there appears to be a continuum between a purely formal use of a penname, which brings almost no distinction from the real name of an author, and a strong literary sub-personality such as those created by Pessoa.
文体学的一个基本假设是,同一个人所写的文章即使用多个笔名发表,也会表现出更大的文体学相似性。统计作者归属很大程度上依赖于Burrows 's Delta及其变体将一个作者聚在一起的能力,而不考虑笔名。与此同时,现代文体学创始人的第一个计算发现表明,单个作者能够产生多种声音(巴罗斯,1987,计算到批评:简·奥斯汀小说研究和方法实验)。克拉伦登出版社)。我们研究了两位作家,他们风格自主的笔名似乎欺骗了德尔塔,并掩盖了作者的信号:葡萄牙诗人费尔南多·佩索阿和法国小说家罗曼·加里。佩索阿设法创造了至少三个笔名(作者自己使用了“异名”这个词),从文体学的角度来看,这些笔名表现出了人类个体的所有特征。加里的另一个自我埃米尔·阿加尔(Emile Ajar)是一个有意的文学神秘化者,也表现出风格自主的特征。与此同时,加里使用的其他假名完全缺乏这种自主权。我们的调查表明,在笔名的纯粹正式使用(与作者的真实姓名几乎没有区别)和佩索阿所创造的那种强烈的文学个性之间,似乎存在一种连续性。
{"title":"Hacking stylometry with multiple voices: Imaginary writers can override authorial signal in Delta","authors":"Daniil Skorinkin, Boris Orekhov","doi":"10.1093/llc/fqad012","DOIUrl":"https://doi.org/10.1093/llc/fqad012","url":null,"abstract":"Abstract It is a basic assumption of stylometry that texts written by the same person show greater stylometric similarity even if published under multiple pennames. Statistical authorship attribution strongly relies on the ability of Burrows’s Delta and its variants to cluster one author together regardless of pseudonyms. At the same time, the very first computational discoveries by the founder of modern stylometry showed that a single author is capable of producing multiple voices (Burrows, 1987, Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Clarendon Press). We investigate two authors whose stylistically autonomous pennames seem to deceive Delta and override authorial signals: a Portuguese poet Fernando Pessoa and a French novelist Romain Gary. Pessoa managed to create at least three pennames (the author himself used the term ‘heteronym’) who exhibit all traits of individual human beings from the stylometric point of view. Gary’s alter ego Emile Ajar, who was an intentional literary mystification, also demonstrates traits of stylometric autonomy. At the same time, other pseudonyms used by Gary lack that autonomy completely. Our investigation shows that there appears to be a continuum between a purely formal use of a penname, which brings almost no distinction from the real name of an author, and a strong literary sub-personality such as those created by Pessoa.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135648140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sagas and genre: A case for application of network analysis to manuscripts preserving Old Norse-Icelandic saga literature 传奇与类型:网络分析在保存古挪威冰岛传奇文学手稿中的应用
IF 0.8 3区 文学 0 HUMANITIES, MULTIDISCIPLINARY Pub Date : 2023-04-07 DOI: 10.1093/llc/fqad013
K. Kapitan, Tarrin Wills
This study applies statistical approaches to the analysis of the genre relationships of Old Norse-Icelandic literature in order to expand our understanding of the relationships between works, their transmission, and their possible modes of reception, as manifested in the extant manuscripts. This article contributes to the ongoing discussion of the genre boundaries of Old Norse-Icelandic literature and presents an alternative method of engaging with this material in the form of computer-assisted analysis, i.e. data visualization and network analysis. Using data collected from major online databases of Old Norse-Icelandic manuscripts, we present the most complete to date network of co-occurrences in manuscripts of works belonging to a number of literary genres. The present study empirically demonstrates the manifoldness of the connections between the Old Norse-Icelandic works which transcend traditional scholarly genre boundaries. The study identifies two main communities within the network: a community of romances, or works of narrative fiction, which includes mainly legendary sagas (fornaldarsögur) and chivalric sagas (riddarasögur), and a community of historicizing narratives, or pseudo-history, which includes mainly sagas of Icelanders (Íslendingasögur) and kings’ sagas (konungasögur).
本研究将统计学方法应用于分析古挪威-冰岛文学的体裁关系,以扩大我们对作品之间的关系、作品的传播及其可能的接受模式的理解,如现存手稿所示。这篇文章有助于对古挪威冰岛文学流派界限的持续讨论,并提出了一种以计算机辅助分析的形式处理这些材料的替代方法,即数据可视化和网络分析。利用从古挪威语-冰岛语手稿的主要在线数据库中收集的数据,我们呈现了迄今为止最完整的多个文学流派作品手稿中共同出现的网络。本研究实证地展示了古挪威-冰岛作品之间超越传统学术流派界限的联系的多样性。该研究确定了网络中的两个主要群体:一个是浪漫小说群体,或叙事小说作品,主要包括传奇传奇(fornaldarsögur)和骑士传奇(riddarasögul);另一个是历史化叙事群体,或伪历史,主要包括冰岛人的传奇(Íslendingsögu尔)和国王传奇(konungasö古尔)。
{"title":"Sagas and genre: A case for application of network analysis to manuscripts preserving Old Norse-Icelandic saga literature","authors":"K. Kapitan, Tarrin Wills","doi":"10.1093/llc/fqad013","DOIUrl":"https://doi.org/10.1093/llc/fqad013","url":null,"abstract":"\u0000 This study applies statistical approaches to the analysis of the genre relationships of Old Norse-Icelandic literature in order to expand our understanding of the relationships between works, their transmission, and their possible modes of reception, as manifested in the extant manuscripts. This article contributes to the ongoing discussion of the genre boundaries of Old Norse-Icelandic literature and presents an alternative method of engaging with this material in the form of computer-assisted analysis, i.e. data visualization and network analysis. Using data collected from major online databases of Old Norse-Icelandic manuscripts, we present the most complete to date network of co-occurrences in manuscripts of works belonging to a number of literary genres. The present study empirically demonstrates the manifoldness of the connections between the Old Norse-Icelandic works which transcend traditional scholarly genre boundaries. The study identifies two main communities within the network: a community of romances, or works of narrative fiction, which includes mainly legendary sagas (fornaldarsögur) and chivalric sagas (riddarasögur), and a community of historicizing narratives, or pseudo-history, which includes mainly sagas of Icelanders (Íslendingasögur) and kings’ sagas (konungasögur).","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.8,"publicationDate":"2023-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44337502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Digital Scholarship in the Humanities
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1