A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis.

IF 2.3 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS Journal of Computational Social Science Pub Date : 2023-01-01 DOI:10.1007/s42001-022-00191-7

Sandra Wankmüller

{"title":"A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis.","authors":"Sandra Wankmüller","doi":"10.1007/s42001-022-00191-7","DOIUrl":null,"url":null,"abstract":"<p><p>One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477-5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.</p>","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"6 1","pages":"91-163"},"PeriodicalIF":2.3000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9762672/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Social Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s42001-022-00191-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}

引用次数: 1

Abstract

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477-5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在检索相关文档进行分析的背景下，不平衡分类问题的方法比较。

许多基于文本的社会科学研究的第一步是从大量无关文档的语料库中检索与分析相关的文档。在社会科学中，解决这一检索任务的传统方法是应用一组关键字，并认为那些包含至少一个关键字的文档是相关的。但应用不完整的关键字列表有很高的风险得出有偏见的推论。更复杂和昂贵的方法，如查询扩展技术、基于主题模型的分类规则、主动和被动监督学习，都有可能更准确地将相关文档与不相关文档分开，从而减少潜在的偏差大小。然而，与关键字列表相比，应用这些更昂贵的方法是否提高了检索性能，如果有的话，提高了多少，由于缺乏对这些方法的比较，目前还不清楚。本研究通过将这些方法与一组德语推文数据集相关的三个检索任务进行比较，缩小了这一差距(Linder in SSRN, 2017)。10.2139/ssrn.3026393)，社会偏见推理语料库(SBIC) (Sap et al. Social Bias frames: reasoning about Social and power implications of language)。见:Jurafsky et al.(编)计算语言学协会第58届年会论文集。计算语言学，p 5477-5490, 2020。10.18653/v1/2020.aclmain.486)和Reuters-21578语料库(Lewis in Reuters-21578 (Distribution 1.0))。[数据集]，1997。http://www.daviddlewis.com/resources/testcollections/reuters21578/)。结果表明，在大多数研究环境下，查询扩展技术和基于主题模型的分类规则倾向于降低而不是提高检索性能。然而，如果将主动监督学习应用于不太小的标记训练实例集(例如1000个文档)，则可以达到比关键字列表高得多的检索性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computational Social Science SOCIAL SCIENCES, MATHEMATICAL METHODS-

CiteScore

6.20

自引率

6.20%

发文量