利用研究人员领域专业知识注释不平衡数据中的概念

IF 3.7 1区文学 Q1 COMMUNICATION Communication Methods and Measures Pub Date : 2023-02-22 DOI:10.1080/19312458.2023.2182278

Dror K. Markus, Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav

{"title":"利用研究人员领域专业知识注释不平衡数据中的概念","authors":"Dror K. Markus, Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav","doi":"10.1080/19312458.2023.2182278","DOIUrl":null,"url":null,"abstract":"ABSTRACT As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method – Expert Initiated Latent Space Sampling – that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.","PeriodicalId":47552,"journal":{"name":"Communication Methods and Measures","volume":"17 1","pages":"250 - 271"},"PeriodicalIF":3.7000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Leveraging Researcher Domain Expertise to Annotate Concepts Within Imbalanced Data\",\"authors\":\"Dror K. Markus, Guy Mor-Lan, Tamir Sheafer, Shaul R. Shenhav\",\"doi\":\"10.1080/19312458.2023.2182278\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method – Expert Initiated Latent Space Sampling – that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.\",\"PeriodicalId\":47552,\"journal\":{\"name\":\"Communication Methods and Measures\",\"volume\":\"17 1\",\"pages\":\"250 - 271\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2023-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Communication Methods and Measures\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://doi.org/10.1080/19312458.2023.2182278\",\"RegionNum\":1,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMMUNICATION\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communication Methods and Measures","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1080/19312458.2023.2182278","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMMUNICATION","Score":null,"Total":0}

引用次数: 0

摘要

摘要随着越来越多的计算通信研究人员转向有监督的机器学习方法进行文本分类，我们注意到在不平衡的数据集中实现这些技术的挑战。这些问题在我们的领域是至关重要的，在许多情况下，研究人员试图识别和研究理论上有趣的类别，这些类别在目标语料库中可能很少见。具体而言，不平衡的分布，即文本在类别之间的倾斜分布，可能会导致漫长而昂贵的注释阶段，迫使从业者对大量文本进行采样和标记，以训练分类模型。在本文中，我们概述了这一问题，并描述了缓解此类挑战的现有战略。注意到以前解决方案的缺陷，我们提供了一种半监督方法——专家发起的潜在空间采样——通过对潜在语义空间的系统、无监督探索来补充研究人员领域的专业知识，以克服这些限制。利用模拟系统地评估我们的方法，并将其与现有方法进行比较，我们表明，在许多分类任务中，我们的程序在效率和准确性方面具有显著优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Leveraging Researcher Domain Expertise to Annotate Concepts Within Imbalanced Data

ABSTRACT As more computational communication researchers turn to supervised machine learning methods for text classification, we note the challenge in implementing such techniques within an imbalanced dataset. Such issues are critical in our domain, where, in many cases, researchers attempt to identify and study theoretically interesting categories that can be rare in a target corpus. Specifically, imbalanced distributions, with a skewed distribution of texts among the categories, can lead to a lengthy and expensive annotation stage, forcing practitioners to sample and label large numbers of texts to train a classification model. In this paper, we provide an overview of the issue, and describe existing strategies for mitigating such challenges. Noting the pitfalls of previous solutions, we then provide a semi-supervised method – Expert Initiated Latent Space Sampling – that complements researcher domain expertise with a systematic, unsupervised exploration of the latent semantic space to overcome such limitations. Utilizing simulations to systematically evaluate our method and compare it to existing approaches, we show that our procedure offers significant advantages in terms of efficiency and accuracy in many classification tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Communication Methods and Measures COMMUNICATION-

CiteScore

21.10

自引率

1.80%

发文量

期刊介绍： Communication Methods and Measures aims to achieve several goals in the field of communication research. Firstly, it aims to bring attention to and showcase developments in both qualitative and quantitative research methodologies to communication scholars. This journal serves as a platform for researchers across the field to discuss and disseminate methodological tools and approaches. Additionally, Communication Methods and Measures seeks to improve research design and analysis practices by offering suggestions for improvement. It aims to introduce new methods of measurement that are valuable to communication scientists or enhance existing methods. The journal encourages submissions that focus on methods for enhancing research design and theory testing, employing both quantitative and qualitative approaches. Furthermore, the journal is open to articles devoted to exploring the epistemological aspects relevant to communication research methodologies. It welcomes well-written manuscripts that demonstrate the use of methods and articles that highlight the advantages of lesser-known or newer methods over those traditionally used in communication. In summary, Communication Methods and Measures strives to advance the field of communication research by showcasing and discussing innovative methodologies, improving research practices, and introducing new measurement methods.