Pub Date : 2012-07-01DOI: 10.21248/jlcl.27.2012.160
J. Gippert
Avestan’ is the name of the ritual language of Zoroastrianism, which was the state religion of the Iranian empire in Achaemenid, Arsacid and Sasanid times, covering a time span of more than 1200 years. It is named after the ‘Avesta’, i.e., the collection of holy scriptures that form the basis of the religion which was allegedly founded by Zarathushtra, also known as Zoroaster, by about the beginning of the first millennium B.C. Together with Vedic Sanskrit, Avestan represents one of the most archaic witnesses of the Indo-Iranian branch of the Indo-European languages, which makes it especially interesting for historical-comparative linguistics. This is why the texts of the Avesta were among the first objects of electronic corpus building that were undertaken in the framework of Indo-European studies, leading to the establishment of the TITUS database (‘Thesaurus indogermanischer Textund Sprachmaterialien’). 2 Today, the complete Avestan corpus is available, together with elaborate search functions and an extended version of the subcorpus of the so-called ‘Yasna’, which covers a great deal of the attestation of variant readings. Right from the beginning of their computational work concerning the Avesta, the compilers had to cope with the fact that the texts contained in it have been transmitted in a special script written from right to left, which was also used for printing them in the scholarly editions used until today. It goes without saying that there was no way in the middle of the 1980s to encode the Avestan scriptures exactly as they are found in the manuscripts. Instead, we had to rely upon transcriptional devices that were dictated by the restrictions of character encoding as provided by the computer systems used. As the problems we had to face in this respect and the solutions we could apply are typical for the development of computational work on ancient languages, it seems worthwhile to sketch them out here. 1 The Avestan script and its transcription 1.1 Early western approaches to the Avestan script and its transcription The Avestan script has been known to western scholarship since the 17 century when the first accounts of the religion of the ‘Parsees’, i.e., Zoroastrians living in India and Iran, were published. The first notable description of the script is found in the travel report by JEAN CHARDIN who sojourned in Iran in 1673–7; in the 1711 edition of his report, the author provides an ‘alphabet of the ancient Persians’, together with a lithographed table contrasting the characters of the Avestan script with their Perso-Arabian equivalents; cf. the extract illustrated in Fig. 1.
{"title":"The Encoding of Avestan - Problems and Solutions","authors":"J. Gippert","doi":"10.21248/jlcl.27.2012.160","DOIUrl":"https://doi.org/10.21248/jlcl.27.2012.160","url":null,"abstract":"Avestan’ is the name of the ritual language of Zoroastrianism, which was the state religion of the Iranian empire in Achaemenid, Arsacid and Sasanid times, covering a time span of more than 1200 years. It is named after the ‘Avesta’, i.e., the collection of holy scriptures that form the basis of the religion which was allegedly founded by Zarathushtra, also known as Zoroaster, by about the beginning of the first millennium B.C. Together with Vedic Sanskrit, Avestan represents one of the most archaic witnesses of the Indo-Iranian branch of the Indo-European languages, which makes it especially interesting for historical-comparative linguistics. This is why the texts of the Avesta were among the first objects of electronic corpus building that were undertaken in the framework of Indo-European studies, leading to the establishment of the TITUS database (‘Thesaurus indogermanischer Textund Sprachmaterialien’). 2 Today, the complete Avestan corpus is available, together with elaborate search functions and an extended version of the subcorpus of the so-called ‘Yasna’, which covers a great deal of the attestation of variant readings. Right from the beginning of their computational work concerning the Avesta, the compilers had to cope with the fact that the texts contained in it have been transmitted in a special script written from right to left, which was also used for printing them in the scholarly editions used until today. It goes without saying that there was no way in the middle of the 1980s to encode the Avestan scriptures exactly as they are found in the manuscripts. Instead, we had to rely upon transcriptional devices that were dictated by the restrictions of character encoding as provided by the computer systems used. As the problems we had to face in this respect and the solutions we could apply are typical for the development of computational work on ancient languages, it seems worthwhile to sketch them out here. 1 The Avestan script and its transcription 1.1 Early western approaches to the Avestan script and its transcription The Avestan script has been known to western scholarship since the 17 century when the first accounts of the religion of the ‘Parsees’, i.e., Zoroastrians living in India and Iran, were published. The first notable description of the script is found in the travel report by JEAN CHARDIN who sojourned in Iran in 1673–7; in the 1711 edition of his report, the author provides an ‘alphabet of the ancient Persians’, together with a lithographed table contrasting the characters of the Avestan script with their Perso-Arabian equivalents; cf. the extract illustrated in Fig. 1.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120936084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-01DOI: 10.21248/jlcl.27.2012.165
Jolanta Gelumbeckaite, Mindaugas Sinkunas, Vytautas Zinkevicius
{"title":"Old Lithuanian Reference Corpus (SLIEKKAS) and Automated Grammatical Annotation","authors":"Jolanta Gelumbeckaite, Mindaugas Sinkunas, Vytautas Zinkevicius","doi":"10.21248/jlcl.27.2012.165","DOIUrl":"https://doi.org/10.21248/jlcl.27.2012.165","url":null,"abstract":"","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121882748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-07-01DOI: 10.21248/jlcl.27.2012.162
Roland Mittmann
Um Worter und Wortformen innerhalb von Texten auffindbar zu machen, waren im vordigitalen Zeitalter Glossare unerlasslich. Heute lassen sich ihre Daten automatisiert mit den zugehorigen Texten zusammenfuhren, um die Texte so mit weiteren Informationen anzureichern. Fur die dazu notwendige Digitalisierung der Glossare ist angesichts des historischen Druckbildes und der oft nicht eindeutigen Informationsauszeichnung ein manuelles Vorgehen am zielfuhrendsten. Je nach Strukturierung des Glossars und nach Art und Uberlieferungsdichte des behandelten Textes ergeben sich dabei unterschiedliche Herausforderungen und Probleme. Diese werden am Beispiel der Digitalisierung der Glossare zum Althochdeutschen und Altsachsischen dargestellt.
{"title":"Digitalisierung historischer Glossare zur automatisierten Vorannotation von Textkorpora am Beispiel des Altdeutschen","authors":"Roland Mittmann","doi":"10.21248/jlcl.27.2012.162","DOIUrl":"https://doi.org/10.21248/jlcl.27.2012.162","url":null,"abstract":"Um Worter und Wortformen innerhalb von Texten auffindbar zu machen, waren im vordigitalen Zeitalter Glossare unerlasslich. Heute lassen sich ihre Daten automatisiert mit den zugehorigen Texten zusammenfuhren, um die Texte so mit weiteren Informationen anzureichern. Fur die dazu notwendige Digitalisierung der Glossare ist angesichts des historischen Druckbildes und der oft nicht eindeutigen Informationsauszeichnung ein manuelles Vorgehen am zielfuhrendsten. Je nach Strukturierung des Glossars und nach Art und Uberlieferungsdichte des behandelten Textes ergeben sich dabei unterschiedliche Herausforderungen und Probleme. Diese werden am Beispiel der Digitalisierung der Glossare zum Althochdeutschen und Altsachsischen dargestellt.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121508683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-07-01DOI: 10.21248/jlcl.26.2011.143
Kristin Bech, K. Eide
{"title":"The Annotation of Morphology, Syntax and Information Structure in a Multilayered Diachronic Corpus","authors":"Kristin Bech, K. Eide","doi":"10.21248/jlcl.26.2011.143","DOIUrl":"https://doi.org/10.21248/jlcl.26.2011.143","url":null,"abstract":"","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126189651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-07-01DOI: 10.21248/jlcl.26.2011.149
D. Kaplan, R. Iida, K. Nishina, T. Tokunaga
Recent research trends of the last five years show that richly annotated corpora inspire novel research. These richly annotated corpora are indispensable for progressing research, but also more difficult to manage and maintain due to increasing complexity – what is needed is a way to manage the annotation project in its entirety. However, annotation project management has received little attention, with tools predominately focusing on single document annotation. Therefore, we define a list of corpus creation and management needs for annotation systems, and then introduce our multi-purpose annotation and management system Slate to address these needs through use of a case study, showing how project management is essential to creating good corpora.
{"title":"Slate - A Tool for Creating and Maintaining Annotated Corpora","authors":"D. Kaplan, R. Iida, K. Nishina, T. Tokunaga","doi":"10.21248/jlcl.26.2011.149","DOIUrl":"https://doi.org/10.21248/jlcl.26.2011.149","url":null,"abstract":"Recent research trends of the last five years show that richly annotated corpora inspire novel research. These richly annotated corpora are indispensable for progressing research, but also more difficult to manage and maintain due to increasing complexity – what is needed is a way to manage the annotation project in its entirety. However, annotation project management has received little attention, with tools predominately focusing on single document annotation. Therefore, we define a list of corpus creation and management needs for annotation systems, and then introduce our multi-purpose annotation and management system Slate to address these needs through use of a case study, showing how project management is essential to creating good corpora.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133184397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-07-01DOI: 10.21248/jlcl.26.2011.154
Arne Skjærholt
We present our experiments with annotating a Latin corpus using an assisted annotation procedure where the corpus to be annotated is preannotated by a statistical tagger. This assisted procedure gives a notable reduction in annotator error compared to the unassisted annotation of previous annotation efforts, even with a huge tagset (1 000 tags) and modest tagger accuracy due to limited training data and domain effects.
{"title":"More, Faster: Accelerated Corpus Annotation with Statistical Taggers","authors":"Arne Skjærholt","doi":"10.21248/jlcl.26.2011.154","DOIUrl":"https://doi.org/10.21248/jlcl.26.2011.154","url":null,"abstract":"We present our experiments with annotating a Latin corpus using an assisted annotation procedure where the corpus to be annotated is preannotated by a statistical tagger. This assisted procedure gives a notable reduction in annotator error compared to the unassisted annotation of previous annotation efforts, even with a huge tagset (1 000 tags) and modest tagger accuracy due to limited training data and domain effects.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115842633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-07-01DOI: 10.21248/jlcl.26.2011.141
Noah Bubenhofer
Fur die sprachwissenschaftliche Ausbildung an den Universitaten ist es zwar unabdingbar, die Studierenden in die Theorie und Methoden der Korpuslinguistik einzufuhren, doch als Lehrperson kampft man dabei mit einer Reihe von Problemen, denn das technische und methodische Know-how der Studierenden ist oft sehr heterogen. Zudem zeigt sich die Wichtigkeit, die Studierenden fur korpuslinguistisches Arbeiten begeistern zu konnen, indem sie an attraktives Anschauungsmaterial herangefuhrt werden. Im Folgenden zeige ich an einigen Beispielen, welche Themen in den Bereichen Semantik, Textlinguistik, Diskursund der Kulturanalyse sinnvollerweise korpuslinguistisch bearbeitet werden konnen. Zudem versuche ich anhand des Nutzungsverhaltens meiner Online-Einfuhrung in die Korpuslinguistik die Bedurfnisse von Anwendern an Methoden und Werkzeuge der Korpuslinguistik abzuleiten.
{"title":"Korpuslinguistik in der linguistischen Lehre: Erfolge und Misserfolge","authors":"Noah Bubenhofer","doi":"10.21248/jlcl.26.2011.141","DOIUrl":"https://doi.org/10.21248/jlcl.26.2011.141","url":null,"abstract":"Fur die sprachwissenschaftliche Ausbildung an den Universitaten ist es zwar unabdingbar, die Studierenden in die Theorie und Methoden der Korpuslinguistik einzufuhren, doch als Lehrperson kampft man dabei mit einer Reihe von Problemen, denn das technische und methodische Know-how der Studierenden ist oft sehr heterogen. Zudem zeigt sich die Wichtigkeit, die Studierenden fur korpuslinguistisches Arbeiten begeistern zu konnen, indem sie an attraktives Anschauungsmaterial herangefuhrt werden. Im Folgenden zeige ich an einigen Beispielen, welche Themen in den Bereichen Semantik, Textlinguistik, Diskursund der Kulturanalyse sinnvollerweise korpuslinguistisch bearbeitet werden konnen. Zudem versuche ich anhand des Nutzungsverhaltens meiner Online-Einfuhrung in die Korpuslinguistik die Bedurfnisse von Anwendern an Methoden und Werkzeuge der Korpuslinguistik abzuleiten.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-07-01DOI: 10.21248/jlcl.26.2011.152
M. Manca, L. Spinazzè, P. Mastandrea, L. Tessarolo, Federico Boschetti
The Musisque Deoque Project (MQDQ) aims at creating a digital archive of Latin poetry, from its origins to the late Italian Renaissance, equipped with critical apparatus and various exegetical and linguistic information. This project is focused on the study of synchronical and diachronical intertextuality as illustrated, e.g., in Cicu (2005). For this reason, we give strong attention to formal and material aspects of the text that actually played a relevant role in the poetical tradition. The fixed text of printed critical editions, aimed at the reconstruction as close as possible to the lost originals, provides just a snapshot of the tradition, which is intrisically dynamic, and gives to the modern reader a distorted image of what an ancient text was in fact. Fully searchable digital collections currently available are based on traditional critical editions, which are, as we just said, authoritarian texts; this authoritarianism is emphasized by the conversion from printed text to database, because usually the critical apparatus is cut away and there is no way for the reader to check a variant different from the one the editor put in the main text, often dubitanter, simply because he had to choose a variant. Limiting lexical searches to editor’s choices drives unavoidably both to false positives and false negatives, which need to be verified back on printed critical editions. False positives are due to possibly wrong emendations made by modern and contemporary scholars, provided by the text retrieval systems among the genuine occurrences, whereas false negatives are the likely variants excluded by editors biased by prejudices against specific linguistic and stylistic phenomena (such as the short-term repetiton, systematically emended by philologists of the last centuries). The purpose of Musisque Deoque is to overcome these limitations, retrieving not only the word keys quoted in the reference edition, but also the variants lying in the critical apparatus. In this way, further knowledge on the accomplished itinerary – from ancient operas during the subsequent ages until the Humanism and the Renaissance – can emerge.
{"title":"Musisque Deoque: Text Retrieval on Critical Editionse","authors":"M. Manca, L. Spinazzè, P. Mastandrea, L. Tessarolo, Federico Boschetti","doi":"10.21248/jlcl.26.2011.152","DOIUrl":"https://doi.org/10.21248/jlcl.26.2011.152","url":null,"abstract":"The Musisque Deoque Project (MQDQ) aims at creating a digital archive of Latin poetry, from its origins to the late Italian Renaissance, equipped with critical apparatus and various exegetical and linguistic information. This project is focused on the study of synchronical and diachronical intertextuality as illustrated, e.g., in Cicu (2005). For this reason, we give strong attention to formal and material aspects of the text that actually played a relevant role in the poetical tradition. The fixed text of printed critical editions, aimed at the reconstruction as close as possible to the lost originals, provides just a snapshot of the tradition, which is intrisically dynamic, and gives to the modern reader a distorted image of what an ancient text was in fact. Fully searchable digital collections currently available are based on traditional critical editions, which are, as we just said, authoritarian texts; this authoritarianism is emphasized by the conversion from printed text to database, because usually the critical apparatus is cut away and there is no way for the reader to check a variant different from the one the editor put in the main text, often dubitanter, simply because he had to choose a variant. Limiting lexical searches to editor’s choices drives unavoidably both to false positives and false negatives, which need to be verified back on printed critical editions. False positives are due to possibly wrong emendations made by modern and contemporary scholars, provided by the text retrieval systems among the genuine occurrences, whereas false negatives are the likely variants excluded by editors biased by prejudices against specific linguistic and stylistic phenomena (such as the short-term repetiton, systematically emended by philologists of the last centuries). The purpose of Musisque Deoque is to overcome these limitations, retrieving not only the word keys quoted in the reference edition, but also the variants lying in the critical apparatus. In this way, further knowledge on the accomplished itinerary – from ancient operas during the subsequent ages until the Humanism and the Renaissance – can emerge.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128400867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-07-01DOI: 10.21248/jlcl.26.2011.147
Iris Hendrickx, Rita Marquilhas
We aim to tackle the problem of spelling variations in a corpus of personal Portugese letters from the 16 th to the 20 th century. We investigated the extent to which the task of normalising Portuguese spelling can be accom plished automatically. We adapted VARD2 (Baron and Rayson, 2008), a statistical tool for normalising spelling, for use with the Portuguese language and studied its performance over four dierent time periods. Our results showed that VARD2 performed best on the older letters and worst on the most modern ones. In an extrinsic evaluation, we measured the usefulness of automatic normalisation for the linguistic task of automatic POS-tagging and showed that automatic normalisation of spelling helps improve the performance of the POS-tagger.
{"title":"From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation","authors":"Iris Hendrickx, Rita Marquilhas","doi":"10.21248/jlcl.26.2011.147","DOIUrl":"https://doi.org/10.21248/jlcl.26.2011.147","url":null,"abstract":"We aim to tackle the problem of spelling variations in a corpus of personal Portugese letters from the 16 th to the 20 th century. We investigated the extent to which the task of normalising Portuguese spelling can be accom plished automatically. We adapted VARD2 (Baron and Rayson, 2008), a statistical tool for normalising spelling, for use with the Portuguese language and studied its performance over four dierent time periods. Our results showed that VARD2 performed best on the older letters and worst on the most modern ones. In an extrinsic evaluation, we measured the usefulness of automatic normalisation for the linguistic task of automatic POS-tagging and showed that automatic normalisation of spelling helps improve the performance of the POS-tagger.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133797225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}