首页 > 最新文献

Recent Advances in Natural Language Processing最新文献

英文 中文
Discourse-Based Approach to Involvement of Background Knowledge for Question Answering 基于话语的问答背景知识介入方法
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_044
Boris A. Galitsky, Dmitry Ilvovsky
We introduce a concept of a virtual discourse tree to improve question answering (Q/A) recall for complex, multi-sentence questions. Augmenting the discourse tree of an answer with tree fragments obtained from text corpora playing the role of ontology, we obtain on the fly a canonical discourse representation of this answer that is independent of the thought structure of a given author. This mechanism is critical for finding an answer that is not only relevant in terms of questions entities but also in terms of inter-relations between these entities in an answer and its style. We evaluate the Q/A system enabled with virtual discourse trees and observe a substantial increase of performance answering complex questions such as Yahoo! Answers and www.2carpros.com.
我们引入了一个虚拟语篇树的概念来提高复杂的多句问题的问答(Q/ a)召回。用从文本语料库中获得的树状片段作为本体论,对答案的话语树进行扩充,我们得到了一个独立于给定作者思想结构的答案的规范话语表示。这种机制对于找到答案至关重要,因为答案不仅与问题实体相关,而且与答案及其风格中这些实体之间的相互关系相关。我们评估了启用虚拟话语树的问答系统,并观察到在回答复杂问题(如Yahoo!答案和www.2carpros.com。
{"title":"Discourse-Based Approach to Involvement of Background Knowledge for Question Answering","authors":"Boris A. Galitsky, Dmitry Ilvovsky","doi":"10.26615/978-954-452-056-4_044","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_044","url":null,"abstract":"We introduce a concept of a virtual discourse tree to improve question answering (Q/A) recall for complex, multi-sentence questions. Augmenting the discourse tree of an answer with tree fragments obtained from text corpora playing the role of ontology, we obtain on the fly a canonical discourse representation of this answer that is independent of the thought structure of a given author. This mechanism is critical for finding an answer that is not only relevant in terms of questions entities but also in terms of inter-relations between these entities in an answer and its style. We evaluate the Q/A system enabled with virtual discourse trees and observe a substantial increase of performance answering complex questions such as Yahoo! Answers and www.2carpros.com.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121076011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Two Discourse Tree - Based Approaches to Indexing Answers 两种基于语篇树的答案索引方法
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_043
Boris A. Galitsky, Dmitry Ilvovsky
We explore anatomy of answers with respect to which text fragments from an answer are worth matching with a question and which should not be matched. We apply the Rhetorical Structure Theory to build a discourse tree of an answer and select elementary discourse units that are suitable for indexing. Manual rules for selection of these discourse units as well as automated classification based on web search engine mining are evaluated con-cerning improving search accuracy. We form two sets of question-answer pairs for FAQ and community QA search domains and use them for evaluation of the proposed indexing methodology, which delivers up to 16 percent improvement in search recall.
我们探讨了答案的解剖,关于答案中的哪些文本片段值得与问题匹配,哪些不应该匹配。我们运用修辞结构理论构建答案的语篇树,并选择适合索引的基本语篇单位。从提高搜索精度的角度,评估了人工规则选择这些话语单元以及基于web搜索引擎挖掘的自动分类。我们为FAQ和社区QA搜索域形成了两组问答对,并使用它们来评估建议的索引方法,该方法在搜索召回率方面提高了16%。
{"title":"Two Discourse Tree - Based Approaches to Indexing Answers","authors":"Boris A. Galitsky, Dmitry Ilvovsky","doi":"10.26615/978-954-452-056-4_043","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_043","url":null,"abstract":"We explore anatomy of answers with respect to which text fragments from an answer are worth matching with a question and which should not be matched. We apply the Rhetorical Structure Theory to build a discourse tree of an answer and select elementary discourse units that are suitable for indexing. Manual rules for selection of these discourse units as well as automated classification based on web search engine mining are evaluated con-cerning improving search accuracy. We form two sets of question-answer pairs for FAQ and community QA search domains and use them for evaluation of the proposed indexing methodology, which delivers up to 16 percent improvement in search recall.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125528485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
v-trel: Vocabulary Trainer for Tracing Word Relations - An Implicit Crowdsourcing Approach v-trel:追踪单词关系的词汇训练器——一种隐含的众包方法
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_079
V. Lyding, Christos T. Rodosthenous, Federico Sangati, U. Hassan, Lionel Nicolas, Alexander Koenig, J. Horbačauskienė, Anisia Katinskaia
In this paper, we present our work on developing a vocabulary trainer that uses exercises generated from language resources such as ConceptNet and crowdsources the responses of the learners to enrich the language resource. We performed an empirical evaluation of our approach with 60 non-native speakers over two days, which shows that new entries to expand Concept-Net can efficiently be gathered through vocabulary exercises on word relations. We also report on the feedback gathered from the users and an expert from language teaching, and discuss the potential of the vocabulary trainer application from the user and language learner perspective. The feedback suggests that v-trel has educational potential, while in its current state some shortcomings could be identified.
在本文中,我们介绍了我们开发词汇训练器的工作,该训练器使用从语言资源(如ConceptNet)中生成的练习和众包学习者的回答来丰富语言资源。我们在两天内对60名非母语人士进行了实证评估,结果表明,通过单词关系的词汇练习可以有效地收集到扩展概念网的新条目。此外,我们还报告了从使用者和一位语言教学专家那里收集到的反馈,并从使用者和语言学习者的角度讨论了词汇训练器的应用潜力。反馈表明,v-trel具有教育潜力,但在目前的状态下,可以发现一些缺点。
{"title":"v-trel: Vocabulary Trainer for Tracing Word Relations - An Implicit Crowdsourcing Approach","authors":"V. Lyding, Christos T. Rodosthenous, Federico Sangati, U. Hassan, Lionel Nicolas, Alexander Koenig, J. Horbačauskienė, Anisia Katinskaia","doi":"10.26615/978-954-452-056-4_079","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_079","url":null,"abstract":"In this paper, we present our work on developing a vocabulary trainer that uses exercises generated from language resources such as ConceptNet and crowdsources the responses of the learners to enrich the language resource. We performed an empirical evaluation of our approach with 60 non-native speakers over two days, which shows that new entries to expand Concept-Net can efficiently be gathered through vocabulary exercises on word relations. We also report on the feedback gathered from the users and an expert from language teaching, and discuss the potential of the vocabulary trainer application from the user and language learner perspective. The feedback suggests that v-trel has educational potential, while in its current state some shortcomings could be identified.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130592465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Sentiment Polarity Detection in Azerbaijani Social News Articles 阿塞拜疆社会新闻文章的情感极性检测
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_082
S. Mammadli, S. Huseynov, Huseyn Alkaramov, Ulviyya Jafarli, U. Suleymanov, S. Rustamov
Text classification field of natural language processing has been experiencing remarkable growth in recent years. Especially, sentiment analysis has received a considerable attention from both industry and research community. However, only a few research examples exist for Azerbaijani language. The main objective of this research is to apply various machine learning algorithms for determining the sentiment of news articles in Azerbaijani language. Approximately, 30.000 social news articles have been collected from online news sites and labeled manually as negative or positive according to their sentiment categories. Initially, text preprocessing was implemented to data in order to eliminate the noise. Secondly, to convert text to a more machine-readable form, BOW (bag of words) model has been applied. More specifically, two methodologies of BOW model, which are tf-idf and frequency based model have been used as vectorization methods. Additionally, SVM, Random Forest, and Naive Bayes algorithms have been applied as the classification algorithms, and their combinations with two vectorization approaches have been tested and analyzed. Experimental results indicate that SVM outperforms other classification algorithms.
近年来,自然语言处理中的文本分类领域得到了显著的发展。尤其是情感分析,受到了业界和研究界的广泛关注。然而,针对阿塞拜疆语的研究案例很少。本研究的主要目的是应用各种机器学习算法来确定阿塞拜疆语新闻文章的情绪。从在线新闻网站上收集了大约3万篇社会新闻文章,并根据它们的情绪类别手动标记为消极或积极。为了消除噪声,首先对数据进行文本预处理。其次,为了将文本转换为机器可读的形式,使用了BOW (bag of words)模型。具体来说,采用了BOW模型的两种方法,即tf-idf和基于频率的模型作为矢量化方法。此外,还采用了SVM、Random Forest和朴素贝叶斯算法作为分类算法,并对它们与两种矢量化方法的组合进行了测试和分析。实验结果表明,SVM优于其他分类算法。
{"title":"Sentiment Polarity Detection in Azerbaijani Social News Articles","authors":"S. Mammadli, S. Huseynov, Huseyn Alkaramov, Ulviyya Jafarli, U. Suleymanov, S. Rustamov","doi":"10.26615/978-954-452-056-4_082","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_082","url":null,"abstract":"Text classification field of natural language processing has been experiencing remarkable growth in recent years. Especially, sentiment analysis has received a considerable attention from both industry and research community. However, only a few research examples exist for Azerbaijani language. The main objective of this research is to apply various machine learning algorithms for determining the sentiment of news articles in Azerbaijani language. Approximately, 30.000 social news articles have been collected from online news sites and labeled manually as negative or positive according to their sentiment categories. Initially, text preprocessing was implemented to data in order to eliminate the noise. Secondly, to convert text to a more machine-readable form, BOW (bag of words) model has been applied. More specifically, two methodologies of BOW model, which are tf-idf and frequency based model have been used as vectorization methods. Additionally, SVM, Random Forest, and Naive Bayes algorithms have been applied as the classification algorithms, and their combinations with two vectorization approaches have been tested and analyzed. Experimental results indicate that SVM outperforms other classification algorithms.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125859197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Sentence Simplification for Semantic Role Labelling and Information Extraction 语义角色标注与信息提取的句子简化
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_033
R. Evans, Constantin Orasan
In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.
在本文中,我们报告了一种自动句子简化方法在两个NLP任务:语义角色标记(SRL)和信息提取(IE)方面的外在评价。本文首先观察了句子简化系统的内在评价所面临的挑战,这促使我们在其他NLP任务中使用这些系统的外在评价。我们描述了两种NLP系统和外部评价中使用的测试数据,并提出了促使句子简化步骤集成作为提高这些系统准确性的方法的论据和证据。我们的评估表明,它们的性能通过简化步骤得到了改善:SRL系统能够更好地为动词的大多数参数分配语义角色,IE系统能够更好地识别所有IE模板槽的填充符。
{"title":"Sentence Simplification for Semantic Role Labelling and Information Extraction","authors":"R. Evans, Constantin Orasan","doi":"10.26615/978-954-452-056-4_033","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_033","url":null,"abstract":"In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127000658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Toponym Detection in the Bio-Medical Domain: A Hybrid Approach with Deep Learning 生物医学领域的地名检测:一种与深度学习的混合方法
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_106
A. Plum, Tharindu Ranasinghe, Constantin Orasan
This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.
本文比较了不同的机器学习分类器如何与简单的字符串匹配和命名实体识别一起使用来检测文本中的位置。我们比较了五种不同的最先进的机器学习分类器,以预测句子是否包含位置。在这个分类任务之后,我们使用带有地名词典的字符串匹配算法来识别句子中地名的确切索引。我们在SemEval-2019 Task 12数据集上评估了机器学习分类器、文本预处理和位置提取方面的不同方法,该数据集是为生物医学领域的地名解析而编译的。最后,我们将结果与之前提交给SemEval-2019任务评估的系统进行比较。
{"title":"Toponym Detection in the Bio-Medical Domain: A Hybrid Approach with Deep Learning","authors":"A. Plum, Tharindu Ranasinghe, Constantin Orasan","doi":"10.26615/978-954-452-056-4_106","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_106","url":null,"abstract":"This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131176772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatic Propbank Generation for Turkish 自动Propbank生成土耳其语
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_005
Koray Ak, O. T. Yildiz
Semantic role labeling (SRL) is an important task for understanding natural languages, where the objective is to analyse propositions expressed by the verb and to identify each word that bears a semantic role. It provides an extensive dataset to enhance NLP applications such as information retrieval, machine translation, information extraction, and question answering. However, creating SRL models are difficult. Even in some languages, it is infeasible to create SRL models that have predicate-argument structure due to lack of linguistic resources. In this paper, we present our method to create an automatic Turkish PropBank by exploiting parallel data from the translated sentences of English PropBank. Experiments show that our method gives promising results.
语义角色标注(SRL)是理解自然语言的一项重要任务,其目的是分析动词所表达的命题,并识别每个承担语义角色的单词。它提供了一个广泛的数据集,以增强NLP应用,如信息检索,机器翻译,信息提取和问题回答。然而,创建SRL模型是困难的。即使在某些语言中,由于缺乏语言资源,创建具有谓词-参数结构的SRL模型也是不可行的。在本文中,我们提出了一种利用英语PropBank翻译句子中的并行数据来创建自动土耳其语PropBank的方法。实验表明,该方法具有较好的效果。
{"title":"Automatic Propbank Generation for Turkish","authors":"Koray Ak, O. T. Yildiz","doi":"10.26615/978-954-452-056-4_005","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_005","url":null,"abstract":"Semantic role labeling (SRL) is an important task for understanding natural languages, where the objective is to analyse propositions expressed by the verb and to identify each word that bears a semantic role. It provides an extensive dataset to enhance NLP applications such as information retrieval, machine translation, information extraction, and question answering. However, creating SRL models are difficult. Even in some languages, it is infeasible to create SRL models that have predicate-argument structure due to lack of linguistic resources. In this paper, we present our method to create an automatic Turkish PropBank by exploiting parallel data from the translated sentences of English PropBank. Experiments show that our method gives promising results.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124042144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Named Entity Recognition in Information Security Domain for Russian 俄文信息安全领域的命名实体识别
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_128
A. Sirotina, Natalia V. Loukachevitch
In this paper we discuss the named entity recognition task for Russian texts related to cybersecurity. First of all, we describe the problems that arise in course of labeling unstructured texts from information security domain. We introduce guidelines for human annotators, according to which a corpus has been marked up. Then, a CRF-based system and different neural architectures have been implemented and applied to the corpus. The named entity recognition systems have been evaluated and compared to determine the most efficient one.
本文讨论了与网络安全相关的俄文文本的命名实体识别任务。首先描述了信息安全领域非结构化文本标注过程中出现的问题。我们为人类注释者介绍了一些准则,根据这些准则对语料库进行标记。然后,实现了基于crf的系统和不同的神经结构,并将其应用于语料库。对命名实体识别系统进行了评估和比较,以确定最有效的识别系统。
{"title":"Named Entity Recognition in Information Security Domain for Russian","authors":"A. Sirotina, Natalia V. Loukachevitch","doi":"10.26615/978-954-452-056-4_128","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_128","url":null,"abstract":"In this paper we discuss the named entity recognition task for Russian texts related to cybersecurity. First of all, we describe the problems that arise in course of labeling unstructured texts from information security domain. We introduce guidelines for human annotators, according to which a corpus has been marked up. Then, a CRF-based system and different neural architectures have been implemented and applied to the corpus. The named entity recognition systems have been evaluated and compared to determine the most efficient one.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121277359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Lexical Quantile-Based Text Complexity Measure 基于词汇分位数的文本复杂度度量
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_031
M. Eremeev, K. Vorontsov
This paper introduces a new approach to estimating the text document complexity. Common readability indices are based on average length of sentences and words. In contrast to these methods, we propose to count the number of rare words occurring abnormally often in the document. We use the reference corpus of texts and the quantile approach in order to determine what words are rare, and what frequencies are abnormal. We construct a general text complexity model, which can be adjusted for the specific task, and introduce two special models. The experimental design is based on a set of thematically similar pairs of Wikipedia articles, labeled using crowdsourcing. The experiments demonstrate the competitiveness of the proposed approach.
本文介绍了一种估算文本文档复杂度的新方法。常用的可读性指标是基于句子和单词的平均长度。与这些方法相比,我们建议统计文档中异常频繁出现的罕见词的数量。我们使用文本的参考语料库和分位数方法来确定哪些单词是罕见的,哪些频率是异常的。我们构建了一个通用的文本复杂性模型,该模型可以根据特定的任务进行调整,并引入了两个特殊的模型。实验设计是基于一组主题相似的维基百科文章,标记使用众包。实验证明了该方法的竞争力。
{"title":"Lexical Quantile-Based Text Complexity Measure","authors":"M. Eremeev, K. Vorontsov","doi":"10.26615/978-954-452-056-4_031","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_031","url":null,"abstract":"This paper introduces a new approach to estimating the text document complexity. Common readability indices are based on average length of sentences and words. In contrast to these methods, we propose to count the number of rare words occurring abnormally often in the document. We use the reference corpus of texts and the quantile approach in order to determine what words are rare, and what frequencies are abnormal. We construct a general text complexity model, which can be adjusted for the specific task, and introduce two special models. The experimental design is based on a set of thematically similar pairs of Wikipedia articles, labeled using crowdsourcing. The experiments demonstrate the competitiveness of the proposed approach.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121366780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System 广泛覆盖的冰岛语上下文无关语法和附带的解析系统
Pub Date : 2019-10-22 DOI: 10.26615/978-954-452-056-4_160
V. Thorsteinsson, Hulda Óladóttir, H. Loftsson
We present an open-source, wide-coverage context-free grammar (CFG) for Icelandic, and an accompanying parsing system. The grammar has over 5,600 nonterminals, 4,600 terminals and 19,000 productions in fully expanded form, with feature agreement constraints for case, gender, number and person. The parsing system consists of an enhanced Earley-based parser and a mechanism to select best-scoring parse trees from shared packed parse forests. Our parsing system is able to parse about 90% of all sentences in articles published on the main Icelandic news websites. Preliminary evaluation with evalb shows an F-measure of 70.72% on parsed sentences. Our system demonstrates that parsing a morphologically rich language using a wide-coverage CFG can be practical.
我们提出了一个开源的、广泛覆盖的冰岛语上下文无关语法(CFG),以及一个附带的解析系统。该语法有超过5600个非终结语、4600个终结语和19000个完全扩展形式的产出语,对大小写、性别、数字和人称都有特征一致的约束。解析系统包括一个增强的基于earley的解析器和一个从共享打包解析森林中选择得分最高的解析树的机制。我们的分析系统能够解析冰岛主要新闻网站上发表的文章中90%的句子。用evalb对解析后的句子进行初步评价,f值为70.72%。我们的系统表明,使用广泛覆盖的CFG来解析形态学丰富的语言是可行的。
{"title":"A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System","authors":"V. Thorsteinsson, Hulda Óladóttir, H. Loftsson","doi":"10.26615/978-954-452-056-4_160","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_160","url":null,"abstract":"We present an open-source, wide-coverage context-free grammar (CFG) for Icelandic, and an accompanying parsing system. The grammar has over 5,600 nonterminals, 4,600 terminals and 19,000 productions in fully expanded form, with feature agreement constraints for case, gender, number and person. The parsing system consists of an enhanced Earley-based parser and a mechanism to select best-scoring parse trees from shared packed parse forests. Our parsing system is able to parse about 90% of all sentences in articles published on the main Icelandic news websites. Preliminary evaluation with evalb shows an F-measure of 70.72% on parsed sentences. Our system demonstrates that parsing a morphologically rich language using a wide-coverage CFG can be practical.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126903300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
期刊
Recent Advances in Natural Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1