冷启动语境下的跨域语料库选择

IF 1.7 4区管理学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of Information Science Pub Date : 2024-07-25 DOI:10.1177/01655515241263283

Wei-Ching Hsiao, Hei Chia Wang

{"title":"冷启动语境下的跨域语料库选择","authors":"Wei-Ching Hsiao, Hei Chia Wang","doi":"10.1177/01655515241263283","DOIUrl":null,"url":null,"abstract":"Sentiment analysis is a powerful tool for monitoring attitudes towards companies, products or services and identifying specific features that drive positive or negative sentiment. However, collecting labelled data for training sentiment analysis models in a specific domain can be challenging in practical applications. One promising solution to this ‘cold-start’ problem is domain adaptation, which leverages labelled data from a related source domain to train a model for the target domain. A critical yet often neglected aspect in prior research is the measurement of similarity between the source and target domains, a factor that greatly impacts the success of domain adaptation. To fill this gap, we propose a novel measure that combines semantic, syntactic and lexical features to assess corpus-level similarity between two domains. Our experimental results demonstrate that our method achieves high precision (0.91) and recall (0.75), outperforming traditional methods. Moreover, our proposed measure can assist new domain products in selecting the most suitable training data set for their sentiment analysis tasks.","PeriodicalId":54796,"journal":{"name":"Journal of Information Science","volume":"19 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-domain corpus selection for cold-start context\",\"authors\":\"Wei-Ching Hsiao, Hei Chia Wang\",\"doi\":\"10.1177/01655515241263283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis is a powerful tool for monitoring attitudes towards companies, products or services and identifying specific features that drive positive or negative sentiment. However, collecting labelled data for training sentiment analysis models in a specific domain can be challenging in practical applications. One promising solution to this ‘cold-start’ problem is domain adaptation, which leverages labelled data from a related source domain to train a model for the target domain. A critical yet often neglected aspect in prior research is the measurement of similarity between the source and target domains, a factor that greatly impacts the success of domain adaptation. To fill this gap, we propose a novel measure that combines semantic, syntactic and lexical features to assess corpus-level similarity between two domains. Our experimental results demonstrate that our method achieves high precision (0.91) and recall (0.75), outperforming traditional methods. Moreover, our proposed measure can assist new domain products in selecting the most suitable training data set for their sentiment analysis tasks.\",\"PeriodicalId\":54796,\"journal\":{\"name\":\"Journal of Information Science\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1177/01655515241263283\",\"RegionNum\":4,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/01655515241263283","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

情感分析是一种功能强大的工具，可用于监测人们对公司、产品或服务的态度，并识别驱动积极或消极情感的具体特征。然而，在实际应用中，收集用于训练特定领域情感分析模型的标记数据可能具有挑战性。解决这一 "冷启动 "问题的一个很有前景的方法是领域适应，即利用相关源领域的标记数据来训练目标领域的模型。在之前的研究中，源域和目标域之间相似性的测量是一个至关重要但又经常被忽视的方面，而这一因素对域适应的成功与否影响极大。为了填补这一空白，我们提出了一种新的测量方法，结合语义、句法和词汇特征来评估两个域之间的语料库级相似性。实验结果表明，我们的方法实现了较高的精确度（0.91）和召回率（0.75），优于传统方法。此外，我们提出的方法还能帮助新领域产品为其情感分析任务选择最合适的训练数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cross-domain corpus selection for cold-start context

Sentiment analysis is a powerful tool for monitoring attitudes towards companies, products or services and identifying specific features that drive positive or negative sentiment. However, collecting labelled data for training sentiment analysis models in a specific domain can be challenging in practical applications. One promising solution to this ‘cold-start’ problem is domain adaptation, which leverages labelled data from a related source domain to train a model for the target domain. A critical yet often neglected aspect in prior research is the measurement of similarity between the source and target domains, a factor that greatly impacts the success of domain adaptation. To fill this gap, we propose a novel measure that combines semantic, syntactic and lexical features to assess corpus-level similarity between two domains. Our experimental results demonstrate that our method achieves high precision (0.91) and recall (0.75), outperforming traditional methods. Moreover, our proposed measure can assist new domain products in selecting the most suitable training data set for their sentiment analysis tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Information Science 工程技术-计算机：信息系统

CiteScore

6.80

自引率

8.30%

发文量

121

审稿时长

4 months

期刊介绍： The Journal of Information Science is a peer-reviewed international journal of high repute covering topics of interest to all those researching and working in the sciences of information and knowledge management. The Editors welcome material on any aspect of information science theory, policy, application or practice that will advance thinking in the field.