Hadi Abdine , Moussa Kamal Eddine , Davide Buscaldi , Michalis Vazirgiannis
{"title":"词义归纳与聚类和互信息最大化","authors":"Hadi Abdine , Moussa Kamal Eddine , Davide Buscaldi , Michalis Vazirgiannis","doi":"10.1016/j.aiopen.2023.12.001","DOIUrl":null,"url":null,"abstract":"<div><p>Word sense induction (WSI) is a challenging problem in natural language processing that involves the unsupervised automatic detection of a word’s senses (i.e., meanings). Recent work achieves significant results on the WSI task by pre-training a language model that can exclusively disambiguate word senses. In contrast, others employ off-the-shelf pre-trained language models with additional strategies to induce senses. This paper proposes a novel unsupervised method based on hierarchical clustering and invariant information clustering (IIC). The IIC loss is used to train a small model to optimize the mutual information between two vector representations of a target word occurring in a pair of synthetic paraphrases. This model is later used in inference mode to extract a higher-quality vector representation to be used in the hierarchical clustering. We evaluate our method on two WSI tasks and in two distinct clustering configurations (fixed and dynamic number of clusters). We empirically show that our approach is at least on par with the state-of-the-art baselines, outperforming them in several configurations. The code and data to reproduce this work are available to the public<span><sup>1</sup></span>.</p></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"4 ","pages":"Pages 193-201"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666651023000232/pdfft?md5=a0553e94f2fab365fb751bcc0ddf8e6c&pid=1-s2.0-S2666651023000232-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Word sense induction with agglomerative clustering and mutual information maximization\",\"authors\":\"Hadi Abdine , Moussa Kamal Eddine , Davide Buscaldi , Michalis Vazirgiannis\",\"doi\":\"10.1016/j.aiopen.2023.12.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Word sense induction (WSI) is a challenging problem in natural language processing that involves the unsupervised automatic detection of a word’s senses (i.e., meanings). Recent work achieves significant results on the WSI task by pre-training a language model that can exclusively disambiguate word senses. In contrast, others employ off-the-shelf pre-trained language models with additional strategies to induce senses. This paper proposes a novel unsupervised method based on hierarchical clustering and invariant information clustering (IIC). The IIC loss is used to train a small model to optimize the mutual information between two vector representations of a target word occurring in a pair of synthetic paraphrases. This model is later used in inference mode to extract a higher-quality vector representation to be used in the hierarchical clustering. We evaluate our method on two WSI tasks and in two distinct clustering configurations (fixed and dynamic number of clusters). We empirically show that our approach is at least on par with the state-of-the-art baselines, outperforming them in several configurations. The code and data to reproduce this work are available to the public<span><sup>1</sup></span>.</p></div>\",\"PeriodicalId\":100068,\"journal\":{\"name\":\"AI Open\",\"volume\":\"4 \",\"pages\":\"Pages 193-201\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666651023000232/pdfft?md5=a0553e94f2fab365fb751bcc0ddf8e6c&pid=1-s2.0-S2666651023000232-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AI Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666651023000232\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651023000232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
词义归纳(WSI)是自然语言处理中一个具有挑战性的问题,它涉及在无监督的情况下自动检测一个词的词义(即含义)。最近的研究通过预训练一个语言模型,该模型可以专门用于词义消歧,从而在词义归纳任务中取得了显著的成果。与此相反,其他研究则采用现成的预训练语言模型,并增加了诱导词义的策略。本文提出了一种基于分层聚类和不变信息聚类(IIC)的新型无监督方法。IIC 损失用于训练一个小型模型,以优化一对合成意译中出现的目标词的两个向量表示之间的互信息。该模型随后将用于推理模式,以提取更高质量的向量表示,用于分层聚类。我们在两个 WSI 任务和两种不同的聚类配置(固定聚类数和动态聚类数)中对我们的方法进行了评估。我们的经验表明,我们的方法至少与最先进的基线方法不相上下,在几种配置中的表现都优于它们。重现这项工作的代码和数据可公开获取1。
Word sense induction with agglomerative clustering and mutual information maximization
Word sense induction (WSI) is a challenging problem in natural language processing that involves the unsupervised automatic detection of a word’s senses (i.e., meanings). Recent work achieves significant results on the WSI task by pre-training a language model that can exclusively disambiguate word senses. In contrast, others employ off-the-shelf pre-trained language models with additional strategies to induce senses. This paper proposes a novel unsupervised method based on hierarchical clustering and invariant information clustering (IIC). The IIC loss is used to train a small model to optimize the mutual information between two vector representations of a target word occurring in a pair of synthetic paraphrases. This model is later used in inference mode to extract a higher-quality vector representation to be used in the hierarchical clustering. We evaluate our method on two WSI tasks and in two distinct clustering configurations (fixed and dynamic number of clusters). We empirically show that our approach is at least on par with the state-of-the-art baselines, outperforming them in several configurations. The code and data to reproduce this work are available to the public1.