首页 > 最新文献

J. Lang. Technol. Comput. Linguistics最新文献

英文 中文
Aufbau eines Referenzkorpus zur deutschsprachigen internetbasierten Kommunikation als Zusatzkomponente für die Korpora im Projekt 'Digitales Wörterbuch der deutschen Sprache' (DWDS) 建立一个基于互联网的德语交流参考文件,作为“德语网词典”项目中企业的补充。
Pub Date : 2022-12-05 DOI: 10.21248/jlcl.28.2013.174
Michael Beißwenger, L. Lemnitzer
Dieser Beitrag gibt einen Überblick über die laufenden Arbeiten im Projekt „Deutsches Referenzkorpus zur internetbasierten Kommunikation“ (DeRiK), in dem ein Korpus zur Sprachverwendung in der deutschsprachigen internetbasierten Kommunikation aufgebaut wird. Das Korpus ist als eine Zusatzkomponente zu den Korpora im BBAW-Projekt „Digitales Wörterbuch der deutschen Sprache“ (DWDS, http://www.dwds.de) konzipiert, die die geschriebene deutsche Sprache seit 1900 dokumentieren.
本文总结了“德国基于网络通信参考文件”项目的当前工作情况。本文件是bba项目“数字德语词典”(dwds.de)中社团的补充部分,该项目自1900年起就对德语进行了记录。
{"title":"Aufbau eines Referenzkorpus zur deutschsprachigen internetbasierten Kommunikation als Zusatzkomponente für die Korpora im Projekt 'Digitales Wörterbuch der deutschen Sprache' (DWDS)","authors":"Michael Beißwenger, L. Lemnitzer","doi":"10.21248/jlcl.28.2013.174","DOIUrl":"https://doi.org/10.21248/jlcl.28.2013.174","url":null,"abstract":"Dieser Beitrag gibt einen Überblick über die laufenden Arbeiten im Projekt „Deutsches Referenzkorpus zur internetbasierten Kommunikation“ (DeRiK), in dem ein Korpus zur Sprachverwendung in der deutschsprachigen internetbasierten Kommunikation aufgebaut wird. Das Korpus ist als eine Zusatzkomponente zu den Korpora im BBAW-Projekt „Digitales Wörterbuch der deutschen Sprache“ (DWDS, http://www.dwds.de) konzipiert, die die geschriebene deutsche Sprache seit 1900 dokumentieren.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132862029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods 使用统计和神经机器翻译方法的监督OCR错误检测和校正
Pub Date : 2018-07-01 DOI: 10.5167/UZH-162394
Chantal Amrhein, S. Clematide
For indexing the content of digitized historical texts, optical character recognition (OCR) errors are a hampering problem. To explore the effectivity of new strategies for OCR post-correction, this article focuses on methods of character-based machine translation, specifically neural machine translation and statistical machine translation. Using the ICDAR 2017 data set on OCR post-correction for English and French, we experiment with different strategies for error detection and error correction. We analyze how OCR post-correction with NMT can profit from using additional information and show that SMT and NMT can benefit from each other for these tasks. An ensemble of our models reached best performance in ICDAR’s 2017 error correction subtask and performed competitively in error detection. However, our experimental results also suggest that tuning supervised learning for OCR post-correction of texts from different sources, text types (periodicals and monographs), time periods and languages is a difficult task: the data on which the MT systems are trained have a large influence on which methods and features work best. Conclusive and generally applicable insights are hard to achieve.
对于数字化历史文本内容的标引,光学字符识别(OCR)错误是一个阻碍问题。为了探索新的OCR后校正策略的有效性,本文重点研究了基于字符的机器翻译方法,特别是神经机器翻译和统计机器翻译。使用ICDAR 2017数据集对英语和法语进行OCR后校正,我们实验了不同的错误检测和错误校正策略。我们分析了使用NMT的OCR后校正如何从使用附加信息中获益,并表明SMT和NMT在这些任务中可以相互受益。我们的模型集合在ICDAR 2017年的纠错子任务中达到了最佳性能,并在错误检测中表现出竞争力。然而,我们的实验结果也表明,对来自不同来源、文本类型(期刊和专著)、时间段和语言的文本进行OCR校正后的监督学习进行调整是一项艰巨的任务:机器翻译系统所训练的数据对哪种方法和特征效果最好有很大影响。结论性和普遍适用的见解是很难获得的。
{"title":"Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods","authors":"Chantal Amrhein, S. Clematide","doi":"10.5167/UZH-162394","DOIUrl":"https://doi.org/10.5167/UZH-162394","url":null,"abstract":"For indexing the content of digitized historical texts, optical character recognition (OCR) errors are a hampering problem. To explore the effectivity of new strategies for OCR post-correction, this article focuses on methods of character-based machine translation, specifically neural machine translation and statistical machine translation. Using the ICDAR 2017 data set on OCR post-correction for English and French, we experiment with different strategies for error detection and error correction. We analyze how OCR post-correction with NMT can profit from using additional information and show that SMT and NMT can benefit from each other for these tasks. An ensemble of our models reached best performance in ICDAR’s 2017 error correction subtask and performed competitively in error detection. However, our experimental results also suggest that tuning supervised learning for OCR post-correction of texts from different sources, text types (periodicals and monographs), time periods and languages is a difficult task: the data on which the MT systems are trained have a large influence on which methods and features work best. Conclusive and generally applicable insights are hard to achieve.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"152 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133791771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus 使用开源引擎Calamari和OCRopus对早期印刷书籍OCR准确率的比较
Pub Date : 2018-07-01 DOI: 10.21248/jlcl.33.2018.219
C. Wick, Christian Reul, F. Puppe
This paper proposes a combination of a convolutional and an LSTM network to improve the accuracy of OCR on early printed books. While the default approach of line based OCR is to use a single LSTM layer as provided by the well-established OCR software OCRopus (OCRopy), we utilize a CNN-and Pooling-Layer combination in advance of an LSTM layer as implemented by the novel OCR software Calamari. Since historical prints often require book speci fi c models trained on manually labeled ground truth (GT) the goal is to maximize the recognition accuracy of a trained model while keeping the needed manual e ff ort to a minimum. We show, that the deep model signi fi cantly outperforms the shallow LSTM network when using both many and only a few training examples, although the deep network has a higher amount of trainable parameters. Hereby, the error rate is reduced by a factor of up to 55%, yielding character error rates (CER) of 1% and below for 1,000 lines of training. To further improve the results, we apply a con fi dence voting mechanism to achieve CERs below 0 . 5%. A simple data augmentation scheme and the usage of pretrained models reduces the CER further by up to 62% if only few training data is available. Thus, we require only 100 lines of GT to reach an average CER of 1.2%. The runtime of the deep model for training and prediction of a book behaves very similar to a shallow network when trained on a CPU. However, the usage of a GPU, as supported by Calamari, reduces the prediction time by a factor of at least four and the training time by more than six.
本文提出了一种卷积网络和LSTM网络相结合的方法来提高早期印刷书籍OCR的准确率。虽然基于线的OCR的默认方法是使用由成熟的OCR软件OCRopus (OCRopy)提供的单个LSTM层,但我们在使用新颖的OCR软件Calamari实现的LSTM层之前使用cnn和池层组合。由于历史印刷品通常需要在人工标记的ground truth (GT)上训练特定于书籍的模型,因此目标是最大化训练模型的识别准确性,同时将所需的人工工作量降至最低。我们表明,尽管深度网络具有更多的可训练参数,但当使用大量或仅使用少量训练样本时,深度模型明显优于浅层LSTM网络。因此,错误率降低了55%,对于1000行训练,字符错误率(CER)为1%或以下。为了进一步改善结果,我们应用了信任投票机制来实现cer低于0。5%。如果只有很少的训练数据可用,一个简单的数据增强方案和预训练模型的使用可以进一步降低高达62%的CER。因此,我们只需要100行GT就可以达到1.2%的平均CER。用于训练和预测一本书的深度模型的运行时间与在CPU上训练的浅网络非常相似。然而,使用由Calamari支持的GPU,将预测时间减少了至少四倍,训练时间减少了六倍以上。
{"title":"Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus","authors":"C. Wick, Christian Reul, F. Puppe","doi":"10.21248/jlcl.33.2018.219","DOIUrl":"https://doi.org/10.21248/jlcl.33.2018.219","url":null,"abstract":"This paper proposes a combination of a convolutional and an LSTM network to improve the accuracy of OCR on early printed books. While the default approach of line based OCR is to use a single LSTM layer as provided by the well-established OCR software OCRopus (OCRopy), we utilize a CNN-and Pooling-Layer combination in advance of an LSTM layer as implemented by the novel OCR software Calamari. Since historical prints often require book speci fi c models trained on manually labeled ground truth (GT) the goal is to maximize the recognition accuracy of a trained model while keeping the needed manual e ff ort to a minimum. We show, that the deep model signi fi cantly outperforms the shallow LSTM network when using both many and only a few training examples, although the deep network has a higher amount of trainable parameters. Hereby, the error rate is reduced by a factor of up to 55%, yielding character error rates (CER) of 1% and below for 1,000 lines of training. To further improve the results, we apply a con fi dence voting mechanism to achieve CERs below 0 . 5%. A simple data augmentation scheme and the usage of pretrained models reduces the CER further by up to 62% if only few training data is available. Thus, we require only 100 lines of GT to reach an average CER of 1.2%. The runtime of the deep model for training and prediction of a book behaves very similar to a shallow network when trained on a CPU. However, the usage of a GPU, as supported by Calamari, reduces the prediction time by a factor of at least four and the training time by more than six.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125781681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Crowdsourcing the OCR Ground Truth of a German and French Cultural Heritage Corpus 众包德国和法国文化遗产语料库的OCR基础真相
Pub Date : 2018-07-01 DOI: 10.21248/jlcl.33.2018.217
S. Clematide, Lenz Furrer, M. Volk
Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historical text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 months, achieving an OCR ground truth with a systematically evaluated accuracy of 99.7 % on the word level. The crowdsourced OCR ground truth and the corresponding original OCR recognition results from Abbyy FineReader for each page are available as a resource for machine learning and evaluation. Additionally, the scanned images (300 dpi) of all pages are included to enable tests with other OCR software.
OCR输出后校正(光学字符识别)的众包方法已成功应用于几个历史文本集。我们报告了我们的人群校正平台Kokos,我们建立它是为了提高19世纪以来瑞士阿尔卑斯俱乐部(SAC)数字化年鉴的OCR质量。这个多语言遗产语料库包括阿尔卑斯文本,主要用德语和法语书写,所有字体都是用Antiqua字体排版的。寻找并吸引志愿者将大量的页面修改为高质量的文本需要精心设计的用户界面,易于使用的工作流程,以及持续的努力来保持参与者的积极性。志愿者们在大约7个月的时间里纠正了大约21000页上的180,000多个字符,在单词水平上实现了OCR的基础事实,系统评估的准确率达到了99.7%。Abbyy FineReader对每个页面的众包OCR ground truth和相应的原始OCR识别结果可作为机器学习和评估的资源。此外,还包括所有页面的扫描图像(300 dpi),以便与其他OCR软件进行测试。
{"title":"Crowdsourcing the OCR Ground Truth of a German and French Cultural Heritage Corpus","authors":"S. Clematide, Lenz Furrer, M. Volk","doi":"10.21248/jlcl.33.2018.217","DOIUrl":"https://doi.org/10.21248/jlcl.33.2018.217","url":null,"abstract":"Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historical text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 months, achieving an OCR ground truth with a systematically evaluated accuracy of 99.7 % on the word level. The crowdsourced OCR ground truth and the corresponding original OCR recognition results from Abbyy FineReader for each page are available as a resource for machine learning and evaluation. Additionally, the scanned images (300 dpi) of all pages are included to enable tests with other OCR software.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123450435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin Ground Truth用于在德国尖角字体和早期现代拉丁语的历史文献上训练OCR引擎
Pub Date : 2018-07-01 DOI: 10.21248/jlcl.33.2018.220
U. Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter
In this paper we describe a dataset of German and Latin textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95% (early printings) and 98% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.
在本文中,我们描述了一个用于历史OCR的德语和拉丁语textit{ground truth} (GT)数据集,其形式是打印文本行图像与其转录配对。该数据集名为textit{gt4historr},由313,173行对组成,涵盖了从15世纪到19世纪以德国角字体印刷的古书的印刷日期,并在CC-BY 4.0许可下公开提供。GT作为行图像/转录对的特殊形式使其可以直接用于训练使用LSTM架构(如Tesseract 4或OCRopus)中的循环神经网络的OCR软件的最先进的识别模型。我们还为我们的数据集的子语料库提供了一些预训练的OCRopus模型,在未见过的测试用例中产生95%(早期印刷)和98%(19世纪德国角字体印刷)的字符准确率,一个Perl脚本来协调由不同转录规则产生的GT,并给出了如何为OCR目的构建GT的提示,该目的具有可能不同于语言动机转录的要求。
{"title":"Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin","authors":"U. Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter","doi":"10.21248/jlcl.33.2018.220","DOIUrl":"https://doi.org/10.21248/jlcl.33.2018.220","url":null,"abstract":"In this paper we describe a dataset of German and Latin textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95% (early printings) and 98% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"3 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131401563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning 通过结合预训练、投票和主动学习来提高早期印刷书籍的OCR准确性
Pub Date : 2018-02-27 DOI: 10.21248/jlcl.33.2018.216
Christian Reul, U. Springmann, C. Wick, F. Puppe
We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.
我们将三种方法结合起来,显著提高了早期印刷书籍OCR模型训练的OCR精度:(1)预训练方法利用存储在各种类型集(混合模型)上的已有模型中的信息,而不是从头开始训练。(2)使用单个OCR引擎(OCRopus)对单一组真实数据(线图像及其转录)进行交叉折叠训练,产生一个委员会,然后该委员会的成员通过考虑前n个备选方案及其内在置信度值来投票选出最佳结果。(3)根据最大分歧原则,我们选择选民最不同意的额外训练线,期望他们为后续训练(主动学习)提供最高的信息增益。对六本早期印刷书籍的评估得出了以下结果:从相同的混合模型开始训练五倍时,预训练和投票的组合平均提高了46%的字符准确性。当使用不同的模型进行预训练时,这一数字上升到53%,强调了多样化选民的重要性。结合主动学习使获得的结果平均提高了16%(对六本书中的三本书进行了评估)。总的来说,当只训练60行时,所提出的方法的平均错误率为2.5%。使用1,000行的真实值池,错误率甚至进一步降低到平均不到1%。
{"title":"Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning","authors":"Christian Reul, U. Springmann, C. Wick, F. Puppe","doi":"10.21248/jlcl.33.2018.216","DOIUrl":"https://doi.org/10.21248/jlcl.33.2018.216","url":null,"abstract":"We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcriptions) with a single OCR engine (OCRopus) produces a committee whose members then vote for the best outcome by also taking the top-N alternatives and their intrinsic confidence values into account. (3) Following the principle of maximal disagreement we select additional training lines which the voters disagree most on, expecting them to offer the highest information gain for a subsequent training (active learning). Evaluations on six early printed books yielded the following results: On average the combination of pretraining and voting improved the character accuracy by 46% when training five folds starting from the same mixed model. This number rose to 53% when using different models for pretraining, underlining the importance of diverse voters. Incorporating active learning improved the obtained results by another 16% on average (evaluated on three of the six books). Overall, the proposed methods lead to an average error rate of 2.5% when training on only 60 lines. Using a substantial ground truth pool of 1,000 lines brought the error rate down even further to less than 1% on average.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"300 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133991345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A Survey and Comparative Study of Arabic Diacritization Tools 阿拉伯语变音符工具的调查与比较研究
Pub Date : 2017-07-01 DOI: 10.21248/jlcl.32.2017.213
O. Hamed, Torsten Zesch
Modern Standard Arabic, as well as other languages based on the Arabic script, are usually written without diacritics, which complicates many language processing tasks. Although many different approaches for automatic diacritization of Arabic have been proposed, it is still unclear what performance level can be expected in a practical setting. For that purpose, we first survey the Arabic diacritization tools in the literature and group the results by the corpus used for testing. We then conduct a comparative study between the available tools for diacritization (Farasa and Madamira) as well as two baselines. We evaluate the error rates for these systems using a set of publicly available, fully-diacritized corpora in two different evaluation modes. With the help of human annotators, we conduct an additional experiment examining error categories. We find that Farasa is outperforming Madamira and the baselines in both modes.
现代标准阿拉伯语以及其他基于阿拉伯文字的语言通常不使用变音符号,这使许多语言处理任务变得复杂。虽然已经提出了许多不同的阿拉伯语自动变音符化方法,但目前尚不清楚在实际设置中可以预期的性能水平。为此,我们首先调查了文献中的阿拉伯语变音符工具,并根据用于测试的语料库对结果进行了分组。然后,我们对可用的变音符化工具(Farasa和Madamira)以及两条基线进行了比较研究。我们使用一组公开可用的、完全变音符化的语料库,在两种不同的评估模式下评估这些系统的错误率。在人类注释者的帮助下,我们进行了一个检查错误类别的额外实验。我们发现,在两种模式下,Farasa的表现都优于Madamira和基线。
{"title":"A Survey and Comparative Study of Arabic Diacritization Tools","authors":"O. Hamed, Torsten Zesch","doi":"10.21248/jlcl.32.2017.213","DOIUrl":"https://doi.org/10.21248/jlcl.32.2017.213","url":null,"abstract":"Modern Standard Arabic, as well as other languages based on the Arabic script, are usually written without diacritics, which complicates many language processing tasks. Although many different approaches for automatic diacritization of Arabic have been proposed, it is still unclear what performance level can be expected in a practical setting. For that purpose, we first survey the Arabic diacritization tools in the literature and group the results by the corpus used for testing. We then conduct a comparative study between the available tools for diacritization (Farasa and Madamira) as well as two baselines. We evaluate the error rates for these systems using a set of publicly available, fully-diacritized corpora in two different evaluation modes. With the help of human annotators, we conduct an additional experiment examining error categories. We find that Farasa is outperforming Madamira and the baselines in both modes.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125713371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers 使用可用的形态分析器和词性标记器标记古典阿拉伯语文本
Pub Date : 2017-07-01 DOI: 10.21248/jlcl.32.2017.212
A. Alosaimy, E. Atwell
Focusing on Classical Arabic, this paper in its first part evaluates morphological analysers and POS taggers that are available freely for research purposes, are designed for Modern Standard Arabic (MSA) or Classical Arabic (CA), are able to analyse all forms of words, and have academic credibility. We list and compare supported features of each tool, and how they differ in the format of the output, segmentation, Part-of-Speech (POS) tags and morphological features. We demonstrate a sample output of each analyser against one CA fully-vowelized sentence. This evaluation serves as a guide in choosing the best tool that suits research needs. In the second part, we report the accuracy and coverage of tagging a set of classical Arabic vocabulary extracted from classical texts. The results show a drop in the accuracy and coverage and suggest an ensemble method might increase accuracy and coverage for classical Arabic.
以古典阿拉伯语为重点,本文的第一部分评估了可免费用于研究目的的形态分析仪和词性标注器,这些分析仪和词性标注器专为现代标准阿拉伯语(MSA)或古典阿拉伯语(CA)设计,能够分析所有形式的单词,并具有学术可信度。我们列出并比较了每种工具支持的功能,以及它们在输出格式、分词、词性标记和形态特征方面的不同之处。我们演示了每个分析器针对一个CA全元音句子的示例输出。这种评估可以作为选择适合研究需要的最佳工具的指南。在第二部分中,我们报告了从经典文本中提取的一组经典阿拉伯语词汇的标记准确性和覆盖范围。结果表明,精度和覆盖率下降,并建议一个集成方法可能提高精度和覆盖率的古典阿拉伯语。
{"title":"Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers","authors":"A. Alosaimy, E. Atwell","doi":"10.21248/jlcl.32.2017.212","DOIUrl":"https://doi.org/10.21248/jlcl.32.2017.212","url":null,"abstract":"Focusing on Classical Arabic, this paper in its first part evaluates morphological analysers and POS taggers that are available freely for research purposes, are designed for Modern Standard Arabic (MSA) or Classical Arabic (CA), are able to analyse all forms of words, and have academic credibility. We list and compare supported features of each tool, and how they differ in the format of the output, segmentation, Part-of-Speech (POS) tags and morphological features. We demonstrate a sample output of each analyser against one CA fully-vowelized sentence. This evaluation serves as a guide in choosing the best tool that suits research needs. In the second part, we report the accuracy and coverage of tagging a set of classical Arabic vocabulary extracted from classical texts. The results show a drop in the accuracy and coverage and suggest an ensemble method might increase accuracy and coverage for classical Arabic.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115627390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Relativisation across varieties: A corpus analysis of Arabic texts 跨品种的相对化:阿拉伯语文本的语料库分析
Pub Date : 2017-07-01 DOI: 10.21248/jlcl.32.2017.214
Zainab Al-Zaghir
Relative clauses are among the main structures that are used frequently in written texts and everyday conversations. Different studies have been conducted to investigate how relative clauses are used and distributed in corpora. Some studies support the claim that accessibility to relativisation, represented by the Noun Phrase Accessibility Hierarchy (NPAH) which is proposed by KEENAN and COMRIE (1977), predict the distribution of relative clauses in corpora. Other studies found out that discourse functions of relative clauses have an important role in distributing relative clauses in corpora (FOX, 1987). However, little focus has been given to the role of the variety in which relative clauses are written in the distribution of relative clauses in written texts. This study investigates relativisation in Arabic written texts in three varieties: Classical Arabic, Modern Standard Arabic and Iraqi Arabic. A statistical analysis of the results shows that relativisation patterns differ significantly across varieties of the Arabic language and cannot be predicted by one accessibility hierarchy.
关系从句是书面文本和日常会话中使用频率较高的主要结构之一。人们对关系从句在语料库中的使用和分布进行了不同的研究。一些研究支持以KEENAN和COMRIE(1977)提出的名词短语可及性层次(NPAH)为代表的相对可及性预测了语料库中关系分句的分布。其他研究发现,关系分句的话语功能对关系分句在语料库中的分布具有重要作用(FOX, 1987)。然而,很少有人关注各种形式的定语从句在定语篇中定语从句分布中的作用。本研究调查了三种阿拉伯语书面文本中的相对论:古典阿拉伯语、现代标准阿拉伯语和伊拉克阿拉伯语。统计分析结果表明,阿拉伯文的相对化模式在不同种类的阿拉伯文中存在显著差异,不能用单一的可及性层次来预测。
{"title":"Relativisation across varieties: A corpus analysis of Arabic texts","authors":"Zainab Al-Zaghir","doi":"10.21248/jlcl.32.2017.214","DOIUrl":"https://doi.org/10.21248/jlcl.32.2017.214","url":null,"abstract":"Relative clauses are among the main structures that are used frequently in written texts and everyday conversations. Different studies have been conducted to investigate how relative clauses are used and distributed in corpora. Some studies support the claim that accessibility to relativisation, represented by the Noun Phrase Accessibility Hierarchy (NPAH) which is proposed by KEENAN and COMRIE (1977), predict the distribution of relative clauses in corpora. Other studies found out that discourse functions of relative clauses have an important role in distributing relative clauses in corpora (FOX, 1987). However, little focus has been given to the role of the variety in which relative clauses are written in the distribution of relative clauses in written texts. This study investigates relativisation in Arabic written texts in three varieties: Classical Arabic, Modern Standard Arabic and Iraqi Arabic. A statistical analysis of the results shows that relativisation patterns differ significantly across varieties of the Arabic language and cannot be predicted by one accessibility hierarchy.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131350782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Interactive Multidimensional Visualisations for Corpus Linguistics 面向语料库语言学的交互式多维可视化
Pub Date : 2016-07-01 DOI: 10.21248/jlcl.31.2016.200
Paul Rayson, J. Mariani, Bryce Anderson-Cooper, Alistair Baron, David Gullick, Andrew Moore, Stephen Wattam
We propose the novel application of dynamic and interactive visualisation techniques to support the iterative and exploratory investigations typical of the corpus linguistics methodology. Very large scale text analysis is already carried out in corpus-based language analysis by employing methods such as frequency profiling, keywords, concordancing, collocations and n-grams. However, at present only basic visualisation methods are utilised. In this paper, we describe case studies of multiple types of key word clouds, explorer tools for collocation networks, and compare network and language distance visualisations for online social networks. These are shown to fit better with the iterative data-driven corpus methodology, and permit some level of scalability to cope with ever increasing corpus size and complexity. In addition, they will allow corpus linguistic methods to be used more widely in the digital humanities and social sciences since the learning curve with visualisations is shallower for non-experts
我们提出动态和交互式可视化技术的新应用,以支持语料库语言学方法的迭代和探索性调查。在基于语料库的语言分析中,通过使用频率谱、关键词、一致性、搭配和n-grams等方法,已经进行了非常大规模的文本分析。然而,目前只使用基本的可视化方法。在本文中,我们描述了多种类型的关键词云,搭配网络的浏览器工具的案例研究,并比较了在线社交网络的网络和语言距离可视化。它们被证明更适合迭代数据驱动的语料库方法,并允许一定程度的可伸缩性来应对不断增加的语料库大小和复杂性。此外,它们将允许语料库语言学方法在数字人文和社会科学中得到更广泛的应用,因为对于非专家来说,可视化的学习曲线更浅
{"title":"Towards Interactive Multidimensional Visualisations for Corpus Linguistics","authors":"Paul Rayson, J. Mariani, Bryce Anderson-Cooper, Alistair Baron, David Gullick, Andrew Moore, Stephen Wattam","doi":"10.21248/jlcl.31.2016.200","DOIUrl":"https://doi.org/10.21248/jlcl.31.2016.200","url":null,"abstract":"We propose the novel application of dynamic and interactive visualisation techniques to support the iterative and exploratory investigations typical of the corpus linguistics methodology. Very large scale text analysis is already carried out in corpus-based language analysis by employing methods such as frequency profiling, keywords, concordancing, collocations and n-grams. However, at present only basic visualisation methods are utilised. In this paper, we describe case studies of multiple types of key word clouds, explorer tools for collocation networks, and compare network and language distance visualisations for online social networks. These are shown to fit better with the iterative data-driven corpus methodology, and permit some level of scalability to cope with ever increasing corpus size and complexity. In addition, they will allow corpus linguistic methods to be used more widely in the digital humanities and social sciences since the learning curve with visualisations is shallower for non-experts","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133104870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
J. Lang. Technol. Comput. Linguistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1