Ida Marie S Lassen, Ross Deans Kristensen-McLachlan, Mina Almasi, Kenneth Enevoldsen, Kristoffer L Nielbo
This article examines the epistemic consequences of unfair technologies used in digital humanities (DH). We connect bias analysis informed by the field of algorithmic fairness with perspectives on knowledge production in DH. We examine the fairness of Danish Named Entity Recognition tools through an innovative experimental method involving data augmentation and evaluate the performance disparities based on two metrics of algorithmic fairness: calibration within groups; and balance for the positive class. Our results show that only two of the ten tested models comply with the fairness criteria. From an intersectional perspective, we shed light on how unequal performance across groups can lead to the exclusion and marginalization of certain social groups, leading to voices and experiences being disregarded and silenced. We propose incorporating algorithmic fairness in the selection of tools in DH to help alleviate the risk of perpetuating silence and move towards fairer and more inclusive research.
{"title":"Epistemic consequences of unfair tools","authors":"Ida Marie S Lassen, Ross Deans Kristensen-McLachlan, Mina Almasi, Kenneth Enevoldsen, Kristoffer L Nielbo","doi":"10.1093/llc/fqad091","DOIUrl":"https://doi.org/10.1093/llc/fqad091","url":null,"abstract":"This article examines the epistemic consequences of unfair technologies used in digital humanities (DH). We connect bias analysis informed by the field of algorithmic fairness with perspectives on knowledge production in DH. We examine the fairness of Danish Named Entity Recognition tools through an innovative experimental method involving data augmentation and evaluate the performance disparities based on two metrics of algorithmic fairness: calibration within groups; and balance for the positive class. Our results show that only two of the ten tested models comply with the fairness criteria. From an intersectional perspective, we shed light on how unequal performance across groups can lead to the exclusion and marginalization of certain social groups, leading to voices and experiences being disregarded and silenced. We propose incorporating algorithmic fairness in the selection of tools in DH to help alleviate the risk of perpetuating silence and move towards fairer and more inclusive research.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"8 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139583956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The digital machine is analogical by design: with it, we construct models of phenomena that by definition of that term are necessarily partial approximations. For that reason, we learn more by conceiving of them as analogues rather than imperfect copies. As the foofaraw over AI would make clear to anyone who bothered to separate its strange wheat from the common chaff, analogy is key to the digital engine’s intellectual power, whether for good or for ill. (The one we must further, the other oppose, but in both cases, understand as fully as we are able.) Analogy is itself a Proteus, however, surfacing in different forms in different disciplines where the machine has found its applications. In the following essay, I chase it through a number of fields before returning to computing, with two examples of its application. I end with a brief note on worldmaking, which after all is what it’s all about, at whatever scale.
{"title":"The analogy of computing","authors":"Willard McCarty","doi":"10.1093/llc/fqad104","DOIUrl":"https://doi.org/10.1093/llc/fqad104","url":null,"abstract":"The digital machine is analogical by design: with it, we construct models of phenomena that by definition of that term are necessarily partial approximations. For that reason, we learn more by conceiving of them as analogues rather than imperfect copies. As the foofaraw over AI would make clear to anyone who bothered to separate its strange wheat from the common chaff, analogy is key to the digital engine’s intellectual power, whether for good or for ill. (The one we must further, the other oppose, but in both cases, understand as fully as we are able.) Analogy is itself a Proteus, however, surfacing in different forms in different disciplines where the machine has found its applications. In the following essay, I chase it through a number of fields before returning to computing, with two examples of its application. I end with a brief note on worldmaking, which after all is what it’s all about, at whatever scale.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"64 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139510108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The last years have seen the application of Natural Language Processing, in particular, language models, to the study of the Semantics of ancient Greek, but only a little work has been done to create gold data for the evaluation of such models. In this contribution we introduce AGREE, the first benchmark for intrinsic evaluation of semantic models of ancient Greek created from expert judgements. In the absence of native speakers, eliciting expert judgements to create a gold standard is a way to leverage a competence that is the closest to that of natives. Moreover, this method allows for collecting data in a uniform way and giving precise instructions to participants. Human judgements about word relatedness were collected via two questionnaires: in the first, experts provided related lemmas to some proposed seeds, while in the second, they assigned relatedness judgements to pairs of lemmas. AGREE was built from a selection of the collected data.
{"title":"AGREE: a new benchmark for the evaluation of distributional semantic models of ancient Greek","authors":"Silvia Stopponi, Saskia Peels-Matthey, Malvina Nissim","doi":"10.1093/llc/fqad087","DOIUrl":"https://doi.org/10.1093/llc/fqad087","url":null,"abstract":"The last years have seen the application of Natural Language Processing, in particular, language models, to the study of the Semantics of ancient Greek, but only a little work has been done to create gold data for the evaluation of such models. In this contribution we introduce AGREE, the first benchmark for intrinsic evaluation of semantic models of ancient Greek created from expert judgements. In the absence of native speakers, eliciting expert judgements to create a gold standard is a way to leverage a competence that is the closest to that of natives. Moreover, this method allows for collecting data in a uniform way and giving precise instructions to participants. Human judgements about word relatedness were collected via two questionnaires: in the first, experts provided related lemmas to some proposed seeds, while in the second, they assigned relatedness judgements to pairs of lemmas. AGREE was built from a selection of the collected data.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"134 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139475284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The digitization of the US Patent and Trademark Office’s (USPTO) backfile of six million patents undertaken between 1951 and 2001 was a five-decade struggle, featuring several media transitions from print and microfilm to CD-ROMs and, finally, the Web. This mass digitization project is on a similar scale to Google Books and the Internet Archive, but it is rarely discussed within critical digitization scholarship or for its significance as a tool for knowledge production. In this article, I focus on the USPTO’s patent document’s digital and physical material form and how the current paradigm of access and storage of the digital backfile emerged. Through this case study, I build upon Ian Milligan’s distinction between the ‘text’ and ‘platform’ layers of a digitization project to demonstrate how historical decisions regarding format and metadata continue to influence how users retrieve and interpret documents, such as patents, online.
{"title":"Digitizing the USPTO patent backfile","authors":"Simon Rowberry","doi":"10.1093/llc/fqad096","DOIUrl":"https://doi.org/10.1093/llc/fqad096","url":null,"abstract":"The digitization of the US Patent and Trademark Office’s (USPTO) backfile of six million patents undertaken between 1951 and 2001 was a five-decade struggle, featuring several media transitions from print and microfilm to CD-ROMs and, finally, the Web. This mass digitization project is on a similar scale to Google Books and the Internet Archive, but it is rarely discussed within critical digitization scholarship or for its significance as a tool for knowledge production. In this article, I focus on the USPTO’s patent document’s digital and physical material form and how the current paradigm of access and storage of the digital backfile emerged. Through this case study, I build upon Ian Milligan’s distinction between the ‘text’ and ‘platform’ layers of a digitization project to demonstrate how historical decisions regarding format and metadata continue to influence how users retrieve and interpret documents, such as patents, online.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"249 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139475256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increased emphasis on language and ethnicity among German immigrants in the USA at the beginning of the 20th century resulted from inter-ethnic competition as well as assimilation pressures on Germans as a minority in American society. Following the unification of Germany and the improvement of German international status, Germans in America claimed superiority of German culture; middle-class advocates attempted to build a more united German-American community, fighting for a stronger voice on issues such as prohibition and German language education. These processes eventually led to the establishment of the National German-American Alliance in Philadelphia in 1901. The present article employs topic modeling and GIS techniques to examine the little-known conference proceedings of the Alliance and discuss Prince Heinrich “Henry” of Prussia’s 1902 visit to the USA. On the humanities side, we foreground the dynamics of the German diaspora who sought their own ethnic uniqueness and constructed historical memory during this period. On the digital side, we discuss different statistical evaluations of topic models as well as their applicability within a small corpus research framework.
{"title":"Mapping Germanness in early 20th century USA: topic modeling and GIS within a small corpus framework","authors":"Sijie Wang, Maciej Kurzynski","doi":"10.1093/llc/fqad102","DOIUrl":"https://doi.org/10.1093/llc/fqad102","url":null,"abstract":"The increased emphasis on language and ethnicity among German immigrants in the USA at the beginning of the 20th century resulted from inter-ethnic competition as well as assimilation pressures on Germans as a minority in American society. Following the unification of Germany and the improvement of German international status, Germans in America claimed superiority of German culture; middle-class advocates attempted to build a more united German-American community, fighting for a stronger voice on issues such as prohibition and German language education. These processes eventually led to the establishment of the National German-American Alliance in Philadelphia in 1901. The present article employs topic modeling and GIS techniques to examine the little-known conference proceedings of the Alliance and discuss Prince Heinrich “Henry” of Prussia’s 1902 visit to the USA. On the humanities side, we foreground the dynamics of the German diaspora who sought their own ethnic uniqueness and constructed historical memory during this period. On the digital side, we discuss different statistical evaluations of topic models as well as their applicability within a small corpus research framework.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"28 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In addition to being a widely recognized novelist, Milan Kundera has also authored three pieces for theatre: The Owners of the Keys (Majitelé klíčů 1961), The Blunder (Ptákovina 1967), and Jacques and his Master (Jakub a jeho pán 1971). In recent years, however, the hypothesis has been raised that Kundera was the true author of a fourth play, Juro Jánošík, first performed in a 1974 production under the name of Karel Steigerwald, who was Kundera’s student at the time. In this study, we make use of supervised machine learning to settle the question of authorship attribution in the case of Juro Jánošík, with results strongly supporting the hypothesis of Kundera’s authorship.
米兰-昆德拉不仅是广为人知的小说家,还创作了三部戏剧作品:钥匙的主人》(Majitelé klíčů 1961 年)、《失误》(Ptákovina 1967 年)和《雅克和他的主人》(Jakub a jeho pán 1971 年)。然而,近年来有人提出了昆德拉是第四部剧本《Juro Jánošík》的真正作者的假设,该剧于 1974 年以卡雷尔-斯泰格瓦尔德(Karel Steigerwald)的名字首次公演,而斯泰格瓦尔德当时是昆德拉的学生。在本研究中,我们利用有监督的机器学习来解决《Juro Jánošík》的作者归属问题,结果有力地支持了昆德拉为作者的假设。
{"title":"Unsigned play by Milan Kundera? An authorship attribution study","authors":"Lenka Jungmannová, Petr Plecháč","doi":"10.1093/llc/fqad109","DOIUrl":"https://doi.org/10.1093/llc/fqad109","url":null,"abstract":"In addition to being a widely recognized novelist, Milan Kundera has also authored three pieces for theatre: The Owners of the Keys (Majitelé klíčů 1961), The Blunder (Ptákovina 1967), and Jacques and his Master (Jakub a jeho pán 1971). In recent years, however, the hypothesis has been raised that Kundera was the true author of a fourth play, Juro Jánošík, first performed in a 1974 production under the name of Karel Steigerwald, who was Kundera’s student at the time. In this study, we make use of supervised machine learning to settle the question of authorship attribution in the case of Juro Jánošík, with results strongly supporting the hypothesis of Kundera’s authorship.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"41 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since the middle of the 17th century, scholars have been systematically describing numerous medieval manuscripts preserved in libraries and religious institutions that contain hagiographic texts, that is texts recounting the lives of saints. In this article, we apply quantitative tools to the resulting database to consider these codices from a new point of view. As such, we study their internal organization, that is the order in which their texts are arranged. We first present a visualization tool that allows to grasp this structure at a glance. Then, we describe a model to automatically classify manuscripts according to their internal organization, based on a constrained spline regression. The results of this classification task make it possible to identify the manuscripts that have a particular internal organization, called per circulum anni (following the course of the year), and thus to study their properties. Furthermore, they open the possibility to obtain clues regarding the origin of some codices and potential kinship links between them.
自 17 世纪中叶以来,学者们一直在系统地描述保存在图书馆和宗教机构中的大量中世纪手稿,这些手稿中包含传记文本,即记述圣人生平的文本。在本文中,我们将定量工具应用到由此产生的数据库中,从一个新的角度来研究这些手抄本。因此,我们研究了它们的内部组织,即文本的排列顺序。我们首先介绍了一种可视化工具,它可以让我们一目了然地掌握这种结构。然后,我们描述了一个基于约束样条回归的模型,根据手稿的内部组织结构对其进行自动分类。这一分类任务的结果使我们有可能识别出具有特定内部组织结构的手稿,即所谓的 per circulum anni(按年份排列),从而研究它们的特性。此外,这些结果还为获得有关某些手抄本的起源以及它们之间潜在的亲缘关系的线索提供了可能。
{"title":"The internal structure of medieval Latin legendaries: a computational analysis","authors":"Sébastien de Valeriola, Bastien Dubuisson","doi":"10.1093/llc/fqad097","DOIUrl":"https://doi.org/10.1093/llc/fqad097","url":null,"abstract":"Since the middle of the 17th century, scholars have been systematically describing numerous medieval manuscripts preserved in libraries and religious institutions that contain hagiographic texts, that is texts recounting the lives of saints. In this article, we apply quantitative tools to the resulting database to consider these codices from a new point of view. As such, we study their internal organization, that is the order in which their texts are arranged. We first present a visualization tool that allows to grasp this structure at a glance. Then, we describe a model to automatically classify manuscripts according to their internal organization, based on a constrained spline regression. The results of this classification task make it possible to identify the manuscripts that have a particular internal organization, called per circulum anni (following the course of the year), and thus to study their properties. Furthermore, they open the possibility to obtain clues regarding the origin of some codices and potential kinship links between them.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"40 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The interview has always proved to be a rich source for those hoping to better understand the figures behind a text, as well as any social contexts and writing practices which might have informed their aesthetic sentiments. Over the past two decades, research into the literary interview has made significant strides, both in terms of how this literary genre is conceptualized and how its emergence and development has been historically traced, the form remains somewhat neglected by literary and cultural theorists and scholars. There is also a remarkable absence of distant readings in this domain. With the rise of the digital humanities, particularly digital literary studies, one would expect more scholars to have used computer-assisted techniques to mine literary interviews, which are, in terms of dataset practicalities, somewhat ideal, semi-structured by nature, and typically available online. Such is the question to which this article attends, taking as its dataset seven decades’ worth of literary interviews from The Paris Review, and ‘topic modelling’ these documents to determine the key themes that dominate such a culturally significant set of materials while also exploring the value of topic modelling to socio-literary criticism.
{"title":"Topic modelling literary interviews from The Paris Review","authors":"Derek Greene, James O'Sullivan, Daragh O'Reilly","doi":"10.1093/llc/fqad098","DOIUrl":"https://doi.org/10.1093/llc/fqad098","url":null,"abstract":"The interview has always proved to be a rich source for those hoping to better understand the figures behind a text, as well as any social contexts and writing practices which might have informed their aesthetic sentiments. Over the past two decades, research into the literary interview has made significant strides, both in terms of how this literary genre is conceptualized and how its emergence and development has been historically traced, the form remains somewhat neglected by literary and cultural theorists and scholars. There is also a remarkable absence of distant readings in this domain. With the rise of the digital humanities, particularly digital literary studies, one would expect more scholars to have used computer-assisted techniques to mine literary interviews, which are, in terms of dataset practicalities, somewhat ideal, semi-structured by nature, and typically available online. Such is the question to which this article attends, taking as its dataset seven decades’ worth of literary interviews from The Paris Review, and ‘topic modelling’ these documents to determine the key themes that dominate such a culturally significant set of materials while also exploring the value of topic modelling to socio-literary criticism.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"41 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal information plays a crucial role in historical research, as it enables scholars to gain insights into the events and processes that have shaped the past. However, the complexity and diversity of temporal descriptions found in Chinese historical texts pose significant challenges for analyzing and interpreting this information. This article addresses these challenges by introducing the traditional Chinese time ontology (TCT Ontology), which integrates relevant concepts and different timing methods into an ontology. The TCT Ontology comprises four classes, including the TCT Record class, Chinese Calendar class, Historical Interval class, and Person class, to represent time descriptions in Chinese texts. By separating time records and the traditional Chinese calendar, the ontology provides a reference model for understanding time information in Chinese historical archives and serves as a basis for converting those time records to the Gregorian calendar. This accurate conversion is critical for humanistic research in Chinese history, as it enables scholars to engage in meaningful reading, studying, and research of the historical record.
{"title":"Using ontology to model time description in historical Chinese texts","authors":"Linxu Wang, Jun Wang, Tong Wei","doi":"10.1093/llc/fqad092","DOIUrl":"https://doi.org/10.1093/llc/fqad092","url":null,"abstract":"\u0000 Temporal information plays a crucial role in historical research, as it enables scholars to gain insights into the events and processes that have shaped the past. However, the complexity and diversity of temporal descriptions found in Chinese historical texts pose significant challenges for analyzing and interpreting this information. This article addresses these challenges by introducing the traditional Chinese time ontology (TCT Ontology), which integrates relevant concepts and different timing methods into an ontology. The TCT Ontology comprises four classes, including the TCT Record class, Chinese Calendar class, Historical Interval class, and Person class, to represent time descriptions in Chinese texts. By separating time records and the traditional Chinese calendar, the ontology provides a reference model for understanding time information in Chinese historical archives and serves as a basis for converting those time records to the Gregorian calendar. This accurate conversion is critical for humanistic research in Chinese history, as it enables scholars to engage in meaningful reading, studying, and research of the historical record.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"57 13","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139441059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Digital Humanities and Literary Studies. Martin Paul Eve","authors":"Tiping Su","doi":"10.1093/llc/fqad095","DOIUrl":"https://doi.org/10.1093/llc/fqad095","url":null,"abstract":"","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"20 4","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139443360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}