Since the Internet is a breeding ground for unconfirmed fake news, its automatic detection and clustering studies have become crucial. Most current studies focus on English texts, and the common features of multilingual fake news are not sufficiently studied. Therefore, this article uses English, Russian, and Chinese as examples and focuses on identifying the common quantitative features of fake news in different languages at the word, sentence, readability, and sentiment levels. These features are then utilized in principal component analysis, K-means clustering, hierarchical clustering, and two-step clustering experiments, which achieved satisfactory results. The common features we proposed play a greater role in achieving automatic cross-lingual clustering than the features proposed in previous studies. Simultaneously, we discovered a trend toward linguistic simplification and economy in fake news. Furthermore, fake news is easier to understand and uses negative emotional expressions in ways that real news does not. Our research provides new reference features for fake news detection tasks and facilitates research into their linguistic characteristics.
{"title":"Finding common features in multilingual fake news: a quantitative clustering approach","authors":"Wei Yuan, Haitao Liu","doi":"10.1093/llc/fqae016","DOIUrl":"https://doi.org/10.1093/llc/fqae016","url":null,"abstract":"Since the Internet is a breeding ground for unconfirmed fake news, its automatic detection and clustering studies have become crucial. Most current studies focus on English texts, and the common features of multilingual fake news are not sufficiently studied. Therefore, this article uses English, Russian, and Chinese as examples and focuses on identifying the common quantitative features of fake news in different languages at the word, sentence, readability, and sentiment levels. These features are then utilized in principal component analysis, K-means clustering, hierarchical clustering, and two-step clustering experiments, which achieved satisfactory results. The common features we proposed play a greater role in achieving automatic cross-lingual clustering than the features proposed in previous studies. Simultaneously, we discovered a trend toward linguistic simplification and economy in fake news. Furthermore, fake news is easier to understand and uses negative emotional expressions in ways that real news does not. Our research provides new reference features for fake news detection tasks and facilitates research into their linguistic characteristics.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"4 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140572969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices—formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and a content analysis (mainly performed using a topic modelling process) of the retraction notices. Considering 343 cases of retraction, we found that many retraction notices are neither identifiable nor findable. In addition, these were not always separated from the original papers, introducing ambiguity in understanding how these notices were perceived by the community (i.e. cited). Also, we noticed that there is no systematic way to write a retraction notice. Indeed, some retraction notices presented a complete discussion of the reasons for retraction, while others tended to be more direct and succinct. We have also reported many notices having similar text while addressing different retractions. We think a further study with a larger collection should be done using the same methodology to confirm and investigate our findings further.
{"title":"Retractions in arts and humanities: an analysis of the retraction notices","authors":"Ivan Heibi, Silvio Peroni","doi":"10.1093/llc/fqad093","DOIUrl":"https://doi.org/10.1093/llc/fqad093","url":null,"abstract":"The aim of this work is to understand the retraction phenomenon in the arts and humanities domain through an analysis of the retraction notices—formal documents stating and describing the retraction of a particular publication. The retractions and the corresponding notices are identified using the data provided by Retraction Watch. Our methodology for the analysis combines a metadata analysis and a content analysis (mainly performed using a topic modelling process) of the retraction notices. Considering 343 cases of retraction, we found that many retraction notices are neither identifiable nor findable. In addition, these were not always separated from the original papers, introducing ambiguity in understanding how these notices were perceived by the community (i.e. cited). Also, we noticed that there is no systematic way to write a retraction notice. Indeed, some retraction notices presented a complete discussion of the reasons for retraction, while others tended to be more direct and succinct. We have also reported many notices having similar text while addressing different retractions. We think a further study with a larger collection should be done using the same methodology to confirm and investigate our findings further.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"52 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140168566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I demonstrate an approach fostering inventive interpretation of short stories in Literary Studies and higher education generally. It involves constructing an ‘assemblage’—at its simplest, an evolving network of unusual connections for creative outcome. The assemblage of this article combines freshly located research literature, directly and indirectly related to a story’s themes, and/or the personality type of protagonists. Importantly, this assemblage also utilizes text analysis software revealing the relatively invisible (e.g. (in)frequent words, parts of speech, and topics) and Large Language Model (LLM) Generative AI to enrich the interpretation. The use of all these elements helps productively exceed initial intuitions about the story, facilitating creativity. I model the approach using Edgar Allan Poe’s short story, The Black Cat, whose protagonist is a homicidal psychopath. Specifically, the assemblage here includes relevant software-based research (a corpus analysis of homicidal psychopathic language), non-software-based research (psychoanalytical literary criticism of The Black Cat using the empirically validated concept of transference), text analysis software (WMatrix and Datayze), and the LLM Generative AI, ‘ChatGPT’ (using the freely available LLM GPT-3.5). One use of this approach is as a pedagogy in Literary Studies employing text analysis software (e.g. on a digital stylistics course). Yet given creative adaptability is a key 21st-century skill, with digital literacy—including the use of Generative AI—an important contemporary competence, and with the short story genre universally known, I highlight too the utility of this approach as a university-wide pedagogy for enhancing creative thinking.
{"title":"Digital assemblages with AI for creative interpretation of short stories","authors":"Kieran O'Halloran","doi":"10.1093/llc/fqad050","DOIUrl":"https://doi.org/10.1093/llc/fqad050","url":null,"abstract":"I demonstrate an approach fostering inventive interpretation of short stories in Literary Studies and higher education generally. It involves constructing an ‘assemblage’—at its simplest, an evolving network of unusual connections for creative outcome. The assemblage of this article combines freshly located research literature, directly and indirectly related to a story’s themes, and/or the personality type of protagonists. Importantly, this assemblage also utilizes text analysis software revealing the relatively invisible (e.g. (in)frequent words, parts of speech, and topics) and Large Language Model (LLM) Generative AI to enrich the interpretation. The use of all these elements helps productively exceed initial intuitions about the story, facilitating creativity. I model the approach using Edgar Allan Poe’s short story, The Black Cat, whose protagonist is a homicidal psychopath. Specifically, the assemblage here includes relevant software-based research (a corpus analysis of homicidal psychopathic language), non-software-based research (psychoanalytical literary criticism of The Black Cat using the empirically validated concept of transference), text analysis software (WMatrix and Datayze), and the LLM Generative AI, ‘ChatGPT’ (using the freely available LLM GPT-3.5). One use of this approach is as a pedagogy in Literary Studies employing text analysis software (e.g. on a digital stylistics course). Yet given creative adaptability is a key 21st-century skill, with digital literacy—including the use of Generative AI—an important contemporary competence, and with the short story genre universally known, I highlight too the utility of this approach as a university-wide pedagogy for enhancing creative thinking.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"38 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140075469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article analyses how digital humanities scholarship can make use of recent advances in deep learning to analyse the temporal relations in an online textual archive. We use transfer learning as well as data augmentation techniques to investigate changes in United Nations Security Council resolutions. Instead of pre-defined periods, as it is common, we target the years directly. Such a text regression task is novel in the digital humanities as far as we can see and has the advantage of speaking directly to historical relations. We present not only very good experimental results but also demonstrate how such text regressions can be interpreted directly and with surrogate topic models.
{"title":"Using deep learning to analyse the times of the UN Security Council","authors":"Tobias Blanke","doi":"10.1093/llc/fqae009","DOIUrl":"https://doi.org/10.1093/llc/fqae009","url":null,"abstract":"This article analyses how digital humanities scholarship can make use of recent advances in deep learning to analyse the temporal relations in an online textual archive. We use transfer learning as well as data augmentation techniques to investigate changes in United Nations Security Council resolutions. Instead of pre-defined periods, as it is common, we target the years directly. Such a text regression task is novel in the digital humanities as far as we can see and has the advantage of speaking directly to historical relations. We present not only very good experimental results but also demonstrate how such text regressions can be interpreted directly and with surrogate topic models.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"70 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140019758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Péter Jeszenszky, Carina Steiner, Nina von Allmen, Adrian Leemann
In perceptual dialectology, mental mapping is a popular tool used for eliciting attitudes and the spatial imprint of linguistic cognition from non-linguists, through tasking them with drawing about linguistic variations on maps. Despite the popularity of this method, research on the geometrical parameters of the shapes drawn on these maps has been limited. In our study, we utilized 500 mental maps, both digital and hand-drawn, introducing a new digital implementation for mental mapping (source code available). Our contribution presents the first perceptual dialectological outcomes of the ‘Swiss German Dialects in Time and Space’ project, which recorded a socio-demographically balanced corpus containing a large amount of quantitative personal data about participants that represent the entire Swiss German dialect continuum. Our first research question explores how various sociolinguistic variables and other variables related to personal background influence the geometrical parameters of shapes drawn, such as the number of shapes, their coverage of the language area, and their compactness. Statistical modelling reveals that dialect identity plays the most important role, while educational background, urbanity, and regional differences also affect more parameters. The second research question investigates the comparability between hand-drawn and digital mental maps, showing that they are generally comparable in terms of geometrical aspects, with minor limitations due to specific technical considerations in our digital method.
{"title":"What drives non-linguists’ hands (or mouse) when drawing mental dialect maps?","authors":"Péter Jeszenszky, Carina Steiner, Nina von Allmen, Adrian Leemann","doi":"10.1093/llc/fqae003","DOIUrl":"https://doi.org/10.1093/llc/fqae003","url":null,"abstract":"In perceptual dialectology, mental mapping is a popular tool used for eliciting attitudes and the spatial imprint of linguistic cognition from non-linguists, through tasking them with drawing about linguistic variations on maps. Despite the popularity of this method, research on the geometrical parameters of the shapes drawn on these maps has been limited. In our study, we utilized 500 mental maps, both digital and hand-drawn, introducing a new digital implementation for mental mapping (source code available). Our contribution presents the first perceptual dialectological outcomes of the ‘Swiss German Dialects in Time and Space’ project, which recorded a socio-demographically balanced corpus containing a large amount of quantitative personal data about participants that represent the entire Swiss German dialect continuum. Our first research question explores how various sociolinguistic variables and other variables related to personal background influence the geometrical parameters of shapes drawn, such as the number of shapes, their coverage of the language area, and their compactness. Statistical modelling reveals that dialect identity plays the most important role, while educational background, urbanity, and regional differences also affect more parameters. The second research question investigates the comparability between hand-drawn and digital mental maps, showing that they are generally comparable in terms of geometrical aspects, with minor limitations due to specific technical considerations in our digital method.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"80 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140019702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monika Dabrowska, María Teresa Santa María Fernández
One of the many changes witnessed by Spanish society at the beginning of the 20th century was the early reshaping of the role of women, including in the realm of theatre. During the first three decades of the new century, Spanish theatre was thriving, favouring the emergence of new gender roles: there were new female playwrights, professional actresses, stage designers, costume designers, theatre company directors, etc. Against this background, i.e. the awakening of female consciousness, it is worth exploring whether the growing position of women in public life goes hand in hand with a greater presence of female characters in the plays composed at that time. With a view to assessing the position of women in playwriting in the Silver Age of Spanish literature, twenty-five stage plays by nine playwrights written between 1878 and 1936 have been analysed, taken from the Spanish Drama Corpus, which forms part of the DraCor project. The distribution of male and female protagonists on stage and the influence of female presence in dramatic conflict have been traced based on quantitative textual factors. The study thus tests the potential of quantitative methods and their scope for the structural analysis of plays and studies on dramatic corpora from a gender perspective.
{"title":"Gender relations in Spanish theatre during the Silver Age: a quantitative comparison of works in the Spanish Drama Corpus","authors":"Monika Dabrowska, María Teresa Santa María Fernández","doi":"10.1093/llc/fqae007","DOIUrl":"https://doi.org/10.1093/llc/fqae007","url":null,"abstract":"One of the many changes witnessed by Spanish society at the beginning of the 20th century was the early reshaping of the role of women, including in the realm of theatre. During the first three decades of the new century, Spanish theatre was thriving, favouring the emergence of new gender roles: there were new female playwrights, professional actresses, stage designers, costume designers, theatre company directors, etc. Against this background, i.e. the awakening of female consciousness, it is worth exploring whether the growing position of women in public life goes hand in hand with a greater presence of female characters in the plays composed at that time. With a view to assessing the position of women in playwriting in the Silver Age of Spanish literature, twenty-five stage plays by nine playwrights written between 1878 and 1936 have been analysed, taken from the Spanish Drama Corpus, which forms part of the DraCor project. The distribution of male and female protagonists on stage and the influence of female presence in dramatic conflict have been traced based on quantitative textual factors. The study thus tests the potential of quantitative methods and their scope for the structural analysis of plays and studies on dramatic corpora from a gender perspective.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"223 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140019744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The distinction between genre and form is still contested in literary studies. While scholars associated with the New Formalism are criticized for perceiving everything as a form, digital humanists tend to argue that everything is a genre. In this research, we employed machine learning models to classify 36,635 English poems in the Chadwyck-Healey Literature Collections into twenty-seven categories, focusing on their semantic features (lexicons) and prosodic features (meters and rhymes) independently. Our findings reveal that different categories of poetry are distinguished by different groups of characteristics, without a clear-cut division between those driven predominantly by semantic features and those driven predominantly by prosodic features. Instead, poetry categories manifest a combination of semantic and prosodic elements, spanning a spectrum of different strengths in both domains. These findings suggest that the colloquial distinction between “genre” and “form” is based on real differences between poetic categories, although those differences may not be quite as crisply binary as the vocabulary implies.
{"title":"Disentangling semantic and prosodic features of English poetry","authors":"Wenyi Shang, Ted Underwood","doi":"10.1093/llc/fqae008","DOIUrl":"https://doi.org/10.1093/llc/fqae008","url":null,"abstract":"The distinction between genre and form is still contested in literary studies. While scholars associated with the New Formalism are criticized for perceiving everything as a form, digital humanists tend to argue that everything is a genre. In this research, we employed machine learning models to classify 36,635 English poems in the Chadwyck-Healey Literature Collections into twenty-seven categories, focusing on their semantic features (lexicons) and prosodic features (meters and rhymes) independently. Our findings reveal that different categories of poetry are distinguished by different groups of characteristics, without a clear-cut division between those driven predominantly by semantic features and those driven predominantly by prosodic features. Instead, poetry categories manifest a combination of semantic and prosodic elements, spanning a spectrum of different strengths in both domains. These findings suggest that the colloquial distinction between “genre” and “form” is based on real differences between poetic categories, although those differences may not be quite as crisply binary as the vocabulary implies.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"48 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140019592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although the idea of the Anthropocene originated in the earth sciences, there have been increasing calls for questions about the Anthropocene to be addressed by pan-disciplinary groups of researchers from across the natural sciences, social sciences, and humanities. We use data analysis techniques from corpus linguistics to examine academic texts about the Anthropocene from these disciplinary families. We read the data to suggest that barriers to a broadly interdisciplinary study of the Anthropocene are high, but we are also able to identify some areas of common ground that could serve as interdisciplinary bridges.
{"title":"Whose Anthropocene?: a data-driven look at the prospects for collaboration between natural science, social science, and the humanities","authors":"Carlos Santana, Kathryn Petrozzo, T J Perkins","doi":"10.1093/llc/fqae004","DOIUrl":"https://doi.org/10.1093/llc/fqae004","url":null,"abstract":"Although the idea of the Anthropocene originated in the earth sciences, there have been increasing calls for questions about the Anthropocene to be addressed by pan-disciplinary groups of researchers from across the natural sciences, social sciences, and humanities. We use data analysis techniques from corpus linguistics to examine academic texts about the Anthropocene from these disciplinary families. We read the data to suggest that barriers to a broadly interdisciplinary study of the Anthropocene are high, but we are also able to identify some areas of common ground that could serve as interdisciplinary bridges.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"19 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139758892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mirella De Sisto, Laura Hernández-Lorenzo, Javier De la Rosa, Salvador Ros, Elena González-Blanco
Analyzing poetry with automatic tools has great potential for improving verse-related research. Over the last few decades, this field has expanded notably and a large number of tools aiming at analyzing various aspects of poetry have been developed. However, the concrete connection between these tools and traditional scholars investigating poetry and metrics is often missing. The purpose of this article is to bridge this gap by providing a comprehensive survey of the automatic poetry analysis tools available for European languages. The tools are described and classified according to the language for which they are primarily developed, and to their functionalities and purpose. Particular attention is given to those that have open-source code or provide an online version with the same functionality. Combining more traditional research with these tools has clear advantages: it provides the opportunity to address theoretical questions with the support of large amounts of data; also, it allows for the development of new and diversified approaches.
{"title":"Understanding poetry using natural language processing tools: a survey","authors":"Mirella De Sisto, Laura Hernández-Lorenzo, Javier De la Rosa, Salvador Ros, Elena González-Blanco","doi":"10.1093/llc/fqae001","DOIUrl":"https://doi.org/10.1093/llc/fqae001","url":null,"abstract":"Analyzing poetry with automatic tools has great potential for improving verse-related research. Over the last few decades, this field has expanded notably and a large number of tools aiming at analyzing various aspects of poetry have been developed. However, the concrete connection between these tools and traditional scholars investigating poetry and metrics is often missing. The purpose of this article is to bridge this gap by providing a comprehensive survey of the automatic poetry analysis tools available for European languages. The tools are described and classified according to the language for which they are primarily developed, and to their functionalities and purpose. Particular attention is given to those that have open-source code or provide an online version with the same functionality. Combining more traditional research with these tools has clear advantages: it provides the opportunity to address theoretical questions with the support of large amounts of data; also, it allows for the development of new and diversified approaches.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"4 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139758884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe an efficient pipeline for morpho-syntactically annotating an ancient language corpus which takes advantage of bootstrapping techniques. This pipeline is designed for ancient language scholars looking to jump-start their own treebank projects, which can in turn serve further pedagogical research projects in the target language. We situate our work in the field of similar ancient language treebank projects, arguing that our approach shows that individual humanities scholars can leverage current machine-learning tools to produce their own richly annotated corpora. We illustrate this pipeline by producing a new Akkadian-language treebank based on two volumes from the online editions of the State Archives of Assyria project hosted on Oracc, as well as a spaCy language model named AkkParser trained on that treebank. Both of these are made publicly available for annotating other Akkadian corpora. In addition, we discuss linguistic issues particular to the Neo-Assyrian letter corpus and data-encoding complications of cuneiform texts in Oracc. The strategies, language models, and processing scripts we developed to handle both linguistic and data-encoding issues in this project will be of special interest to scholars seeking to develop their own cuneiform treebanks.
{"title":"Linguistic annotation of cuneiform texts using treebanks and deep learning","authors":"Matthew Ong, Shai Gordin","doi":"10.1093/llc/fqae002","DOIUrl":"https://doi.org/10.1093/llc/fqae002","url":null,"abstract":"We describe an efficient pipeline for morpho-syntactically annotating an ancient language corpus which takes advantage of bootstrapping techniques. This pipeline is designed for ancient language scholars looking to jump-start their own treebank projects, which can in turn serve further pedagogical research projects in the target language. We situate our work in the field of similar ancient language treebank projects, arguing that our approach shows that individual humanities scholars can leverage current machine-learning tools to produce their own richly annotated corpora. We illustrate this pipeline by producing a new Akkadian-language treebank based on two volumes from the online editions of the State Archives of Assyria project hosted on Oracc, as well as a spaCy language model named AkkParser trained on that treebank. Both of these are made publicly available for annotating other Akkadian corpora. In addition, we discuss linguistic issues particular to the Neo-Assyrian letter corpus and data-encoding complications of cuneiform texts in Oracc. The strategies, language models, and processing scripts we developed to handle both linguistic and data-encoding issues in this project will be of special interest to scholars seeking to develop their own cuneiform treebanks.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"245 1","pages":""},"PeriodicalIF":0.8,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}