{"title":"Solo Queue at ASSIN: Combinando Abordagens Tradicionais e Emergentes","authors":"N. Hartmann","doi":"10.21814/LM.8.2.230","DOIUrl":null,"url":null,"abstract":"In this paper we present a proposal to automatically label the similarity between a pair of sentences and the results obtained on ASSIN 2016 sentence similarity shared-task. Our proposal consists of using a classical feature of bag-of-words, the TF-IDF model; and an emergent feature, obtained from processing word embeddings. The TF-IDF is used to relate texts which share words. Word embeddings are known by capture the syntax and semantics of a word. Following Mikolov et al. (2013), the sum of embedding vectors can model the meaning of a sentence. Using both features, we are able to capture the words shared between sentences and their semantics. We use linear regression to solve this problem, once the dataset is labeled as real numbers between 1 and 5. Our results are promising. Although the usage of embeddings has not overcome our baseline system, when we combined it with TF-IDF, our system achieved better results than only using TF-IDF. Our results achieved the first collocation of ASSIN 2016 for sentence similarity shared-task applied on brazilian portuguese sentences and second collocation when applying to Portugal portuguese sentences.","PeriodicalId":41819,"journal":{"name":"Linguamatica","volume":"29 1","pages":"59-64"},"PeriodicalIF":0.3000,"publicationDate":"2016-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Linguamatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21814/LM.8.2.230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 22
Abstract
In this paper we present a proposal to automatically label the similarity between a pair of sentences and the results obtained on ASSIN 2016 sentence similarity shared-task. Our proposal consists of using a classical feature of bag-of-words, the TF-IDF model; and an emergent feature, obtained from processing word embeddings. The TF-IDF is used to relate texts which share words. Word embeddings are known by capture the syntax and semantics of a word. Following Mikolov et al. (2013), the sum of embedding vectors can model the meaning of a sentence. Using both features, we are able to capture the words shared between sentences and their semantics. We use linear regression to solve this problem, once the dataset is labeled as real numbers between 1 and 5. Our results are promising. Although the usage of embeddings has not overcome our baseline system, when we combined it with TF-IDF, our system achieved better results than only using TF-IDF. Our results achieved the first collocation of ASSIN 2016 for sentence similarity shared-task applied on brazilian portuguese sentences and second collocation when applying to Portugal portuguese sentences.
本文提出了一种基于ASSIN 2016句子相似度共享任务的句子相似度自动标注方法。我们的建议包括使用经典的词袋特征TF-IDF模型;和一个突现特征,从处理词嵌入得到。TF-IDF用于关联共享单词的文本。单词嵌入是通过捕获单词的语法和语义来实现的。继Mikolov et al.(2013)之后,嵌入向量的总和可以对句子的含义进行建模。使用这两个特征,我们能够捕获句子之间共享的单词及其语义。我们使用线性回归来解决这个问题,一旦数据集被标记为1到5之间的实数。我们的结果很有希望。虽然嵌入的使用并没有克服我们的基线系统,但当我们将其与TF-IDF结合使用时,我们的系统取得了比仅使用TF-IDF更好的结果。我们的结果实现了ASSIN 2016对巴西葡萄牙语句子相似度共享任务的第一次搭配和对葡萄牙葡萄牙语句子的第二次搭配。