Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_044
Boris A. Galitsky, Dmitry Ilvovsky
We introduce a concept of a virtual discourse tree to improve question answering (Q/A) recall for complex, multi-sentence questions. Augmenting the discourse tree of an answer with tree fragments obtained from text corpora playing the role of ontology, we obtain on the fly a canonical discourse representation of this answer that is independent of the thought structure of a given author. This mechanism is critical for finding an answer that is not only relevant in terms of questions entities but also in terms of inter-relations between these entities in an answer and its style. We evaluate the Q/A system enabled with virtual discourse trees and observe a substantial increase of performance answering complex questions such as Yahoo! Answers and www.2carpros.com.
{"title":"Discourse-Based Approach to Involvement of Background Knowledge for Question Answering","authors":"Boris A. Galitsky, Dmitry Ilvovsky","doi":"10.26615/978-954-452-056-4_044","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_044","url":null,"abstract":"We introduce a concept of a virtual discourse tree to improve question answering (Q/A) recall for complex, multi-sentence questions. Augmenting the discourse tree of an answer with tree fragments obtained from text corpora playing the role of ontology, we obtain on the fly a canonical discourse representation of this answer that is independent of the thought structure of a given author. This mechanism is critical for finding an answer that is not only relevant in terms of questions entities but also in terms of inter-relations between these entities in an answer and its style. We evaluate the Q/A system enabled with virtual discourse trees and observe a substantial increase of performance answering complex questions such as Yahoo! Answers and www.2carpros.com.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"189 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121076011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_043
Boris A. Galitsky, Dmitry Ilvovsky
We explore anatomy of answers with respect to which text fragments from an answer are worth matching with a question and which should not be matched. We apply the Rhetorical Structure Theory to build a discourse tree of an answer and select elementary discourse units that are suitable for indexing. Manual rules for selection of these discourse units as well as automated classification based on web search engine mining are evaluated con-cerning improving search accuracy. We form two sets of question-answer pairs for FAQ and community QA search domains and use them for evaluation of the proposed indexing methodology, which delivers up to 16 percent improvement in search recall.
{"title":"Two Discourse Tree - Based Approaches to Indexing Answers","authors":"Boris A. Galitsky, Dmitry Ilvovsky","doi":"10.26615/978-954-452-056-4_043","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_043","url":null,"abstract":"We explore anatomy of answers with respect to which text fragments from an answer are worth matching with a question and which should not be matched. We apply the Rhetorical Structure Theory to build a discourse tree of an answer and select elementary discourse units that are suitable for indexing. Manual rules for selection of these discourse units as well as automated classification based on web search engine mining are evaluated con-cerning improving search accuracy. We form two sets of question-answer pairs for FAQ and community QA search domains and use them for evaluation of the proposed indexing methodology, which delivers up to 16 percent improvement in search recall.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125528485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_079
V. Lyding, Christos T. Rodosthenous, Federico Sangati, U. Hassan, Lionel Nicolas, Alexander Koenig, J. Horbačauskienė, Anisia Katinskaia
In this paper, we present our work on developing a vocabulary trainer that uses exercises generated from language resources such as ConceptNet and crowdsources the responses of the learners to enrich the language resource. We performed an empirical evaluation of our approach with 60 non-native speakers over two days, which shows that new entries to expand Concept-Net can efficiently be gathered through vocabulary exercises on word relations. We also report on the feedback gathered from the users and an expert from language teaching, and discuss the potential of the vocabulary trainer application from the user and language learner perspective. The feedback suggests that v-trel has educational potential, while in its current state some shortcomings could be identified.
{"title":"v-trel: Vocabulary Trainer for Tracing Word Relations - An Implicit Crowdsourcing Approach","authors":"V. Lyding, Christos T. Rodosthenous, Federico Sangati, U. Hassan, Lionel Nicolas, Alexander Koenig, J. Horbačauskienė, Anisia Katinskaia","doi":"10.26615/978-954-452-056-4_079","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_079","url":null,"abstract":"In this paper, we present our work on developing a vocabulary trainer that uses exercises generated from language resources such as ConceptNet and crowdsources the responses of the learners to enrich the language resource. We performed an empirical evaluation of our approach with 60 non-native speakers over two days, which shows that new entries to expand Concept-Net can efficiently be gathered through vocabulary exercises on word relations. We also report on the feedback gathered from the users and an expert from language teaching, and discuss the potential of the vocabulary trainer application from the user and language learner perspective. The feedback suggests that v-trel has educational potential, while in its current state some shortcomings could be identified.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130592465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_082
S. Mammadli, S. Huseynov, Huseyn Alkaramov, Ulviyya Jafarli, U. Suleymanov, S. Rustamov
Text classification field of natural language processing has been experiencing remarkable growth in recent years. Especially, sentiment analysis has received a considerable attention from both industry and research community. However, only a few research examples exist for Azerbaijani language. The main objective of this research is to apply various machine learning algorithms for determining the sentiment of news articles in Azerbaijani language. Approximately, 30.000 social news articles have been collected from online news sites and labeled manually as negative or positive according to their sentiment categories. Initially, text preprocessing was implemented to data in order to eliminate the noise. Secondly, to convert text to a more machine-readable form, BOW (bag of words) model has been applied. More specifically, two methodologies of BOW model, which are tf-idf and frequency based model have been used as vectorization methods. Additionally, SVM, Random Forest, and Naive Bayes algorithms have been applied as the classification algorithms, and their combinations with two vectorization approaches have been tested and analyzed. Experimental results indicate that SVM outperforms other classification algorithms.
近年来,自然语言处理中的文本分类领域得到了显著的发展。尤其是情感分析,受到了业界和研究界的广泛关注。然而,针对阿塞拜疆语的研究案例很少。本研究的主要目的是应用各种机器学习算法来确定阿塞拜疆语新闻文章的情绪。从在线新闻网站上收集了大约3万篇社会新闻文章,并根据它们的情绪类别手动标记为消极或积极。为了消除噪声,首先对数据进行文本预处理。其次,为了将文本转换为机器可读的形式,使用了BOW (bag of words)模型。具体来说,采用了BOW模型的两种方法,即tf-idf和基于频率的模型作为矢量化方法。此外,还采用了SVM、Random Forest和朴素贝叶斯算法作为分类算法,并对它们与两种矢量化方法的组合进行了测试和分析。实验结果表明,SVM优于其他分类算法。
{"title":"Sentiment Polarity Detection in Azerbaijani Social News Articles","authors":"S. Mammadli, S. Huseynov, Huseyn Alkaramov, Ulviyya Jafarli, U. Suleymanov, S. Rustamov","doi":"10.26615/978-954-452-056-4_082","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_082","url":null,"abstract":"Text classification field of natural language processing has been experiencing remarkable growth in recent years. Especially, sentiment analysis has received a considerable attention from both industry and research community. However, only a few research examples exist for Azerbaijani language. The main objective of this research is to apply various machine learning algorithms for determining the sentiment of news articles in Azerbaijani language. Approximately, 30.000 social news articles have been collected from online news sites and labeled manually as negative or positive according to their sentiment categories. Initially, text preprocessing was implemented to data in order to eliminate the noise. Secondly, to convert text to a more machine-readable form, BOW (bag of words) model has been applied. More specifically, two methodologies of BOW model, which are tf-idf and frequency based model have been used as vectorization methods. Additionally, SVM, Random Forest, and Naive Bayes algorithms have been applied as the classification algorithms, and their combinations with two vectorization approaches have been tested and analyzed. Experimental results indicate that SVM outperforms other classification algorithms.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125859197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_033
R. Evans, Constantin Orasan
In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.
{"title":"Sentence Simplification for Semantic Role Labelling and Information Extraction","authors":"R. Evans, Constantin Orasan","doi":"10.26615/978-954-452-056-4_033","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_033","url":null,"abstract":"In this paper, we report on the extrinsic evaluation of an automatic sentence simplification method with respect to two NLP tasks: semantic role labelling (SRL) and information extraction (IE). The paper begins with our observation of challenges in the intrinsic evaluation of sentence simplification systems, which motivates the use of extrinsic evaluation of these systems with respect to other NLP tasks. We describe the two NLP systems and the test data used in the extrinsic evaluation, and present arguments and evidence motivating the integration of a sentence simplification step as a means of improving the accuracy of these systems. Our evaluation reveals that their performance is improved by the simplification step: the SRL system is better able to assign semantic roles to the majority of the arguments of verbs and the IE system is better able to identify fillers for all IE template slots.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127000658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_106
A. Plum, Tharindu Ranasinghe, Constantin Orasan
This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.
{"title":"Toponym Detection in the Bio-Medical Domain: A Hybrid Approach with Deep Learning","authors":"A. Plum, Tharindu Ranasinghe, Constantin Orasan","doi":"10.26615/978-954-452-056-4_106","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_106","url":null,"abstract":"This paper compares how different machine learning classifiers can be used together with simple string matching and named entity recognition to detect locations in texts. We compare five different state-of-the-art machine learning classifiers in order to predict whether a sentence contains a location or not. Following this classification task, we use a string matching algorithm with a gazetteer to identify the exact index of a toponym within the sentence. We evaluate different approaches in terms of machine learning classifiers, text pre-processing and location extraction on the SemEval-2019 Task 12 dataset, compiled for toponym resolution in the bio-medical domain. Finally, we compare the results with our system that was previously submitted to the SemEval-2019 task evaluation.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131176772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_005
Koray Ak, O. T. Yildiz
Semantic role labeling (SRL) is an important task for understanding natural languages, where the objective is to analyse propositions expressed by the verb and to identify each word that bears a semantic role. It provides an extensive dataset to enhance NLP applications such as information retrieval, machine translation, information extraction, and question answering. However, creating SRL models are difficult. Even in some languages, it is infeasible to create SRL models that have predicate-argument structure due to lack of linguistic resources. In this paper, we present our method to create an automatic Turkish PropBank by exploiting parallel data from the translated sentences of English PropBank. Experiments show that our method gives promising results.
{"title":"Automatic Propbank Generation for Turkish","authors":"Koray Ak, O. T. Yildiz","doi":"10.26615/978-954-452-056-4_005","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_005","url":null,"abstract":"Semantic role labeling (SRL) is an important task for understanding natural languages, where the objective is to analyse propositions expressed by the verb and to identify each word that bears a semantic role. It provides an extensive dataset to enhance NLP applications such as information retrieval, machine translation, information extraction, and question answering. However, creating SRL models are difficult. Even in some languages, it is infeasible to create SRL models that have predicate-argument structure due to lack of linguistic resources. In this paper, we present our method to create an automatic Turkish PropBank by exploiting parallel data from the translated sentences of English PropBank. Experiments show that our method gives promising results.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124042144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_128
A. Sirotina, Natalia V. Loukachevitch
In this paper we discuss the named entity recognition task for Russian texts related to cybersecurity. First of all, we describe the problems that arise in course of labeling unstructured texts from information security domain. We introduce guidelines for human annotators, according to which a corpus has been marked up. Then, a CRF-based system and different neural architectures have been implemented and applied to the corpus. The named entity recognition systems have been evaluated and compared to determine the most efficient one.
{"title":"Named Entity Recognition in Information Security Domain for Russian","authors":"A. Sirotina, Natalia V. Loukachevitch","doi":"10.26615/978-954-452-056-4_128","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_128","url":null,"abstract":"In this paper we discuss the named entity recognition task for Russian texts related to cybersecurity. First of all, we describe the problems that arise in course of labeling unstructured texts from information security domain. We introduce guidelines for human annotators, according to which a corpus has been marked up. Then, a CRF-based system and different neural architectures have been implemented and applied to the corpus. The named entity recognition systems have been evaluated and compared to determine the most efficient one.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121277359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_031
M. Eremeev, K. Vorontsov
This paper introduces a new approach to estimating the text document complexity. Common readability indices are based on average length of sentences and words. In contrast to these methods, we propose to count the number of rare words occurring abnormally often in the document. We use the reference corpus of texts and the quantile approach in order to determine what words are rare, and what frequencies are abnormal. We construct a general text complexity model, which can be adjusted for the specific task, and introduce two special models. The experimental design is based on a set of thematically similar pairs of Wikipedia articles, labeled using crowdsourcing. The experiments demonstrate the competitiveness of the proposed approach.
{"title":"Lexical Quantile-Based Text Complexity Measure","authors":"M. Eremeev, K. Vorontsov","doi":"10.26615/978-954-452-056-4_031","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_031","url":null,"abstract":"This paper introduces a new approach to estimating the text document complexity. Common readability indices are based on average length of sentences and words. In contrast to these methods, we propose to count the number of rare words occurring abnormally often in the document. We use the reference corpus of texts and the quantile approach in order to determine what words are rare, and what frequencies are abnormal. We construct a general text complexity model, which can be adjusted for the specific task, and introduce two special models. The experimental design is based on a set of thematically similar pairs of Wikipedia articles, labeled using crowdsourcing. The experiments demonstrate the competitiveness of the proposed approach.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121366780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-22DOI: 10.26615/978-954-452-056-4_160
V. Thorsteinsson, Hulda Óladóttir, H. Loftsson
We present an open-source, wide-coverage context-free grammar (CFG) for Icelandic, and an accompanying parsing system. The grammar has over 5,600 nonterminals, 4,600 terminals and 19,000 productions in fully expanded form, with feature agreement constraints for case, gender, number and person. The parsing system consists of an enhanced Earley-based parser and a mechanism to select best-scoring parse trees from shared packed parse forests. Our parsing system is able to parse about 90% of all sentences in articles published on the main Icelandic news websites. Preliminary evaluation with evalb shows an F-measure of 70.72% on parsed sentences. Our system demonstrates that parsing a morphologically rich language using a wide-coverage CFG can be practical.
{"title":"A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System","authors":"V. Thorsteinsson, Hulda Óladóttir, H. Loftsson","doi":"10.26615/978-954-452-056-4_160","DOIUrl":"https://doi.org/10.26615/978-954-452-056-4_160","url":null,"abstract":"We present an open-source, wide-coverage context-free grammar (CFG) for Icelandic, and an accompanying parsing system. The grammar has over 5,600 nonterminals, 4,600 terminals and 19,000 productions in fully expanded form, with feature agreement constraints for case, gender, number and person. The parsing system consists of an enhanced Earley-based parser and a mechanism to select best-scoring parse trees from shared packed parse forests. Our parsing system is able to parse about 90% of all sentences in articles published on the main Icelandic news websites. Preliminary evaluation with evalb shows an F-measure of 70.72% on parsed sentences. Our system demonstrates that parsing a morphologically rich language using a wide-coverage CFG can be practical.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126903300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}