A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis.

IF 2 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS Journal of Computational Social Science Pub Date : 2023-01-01 DOI:10.1007/s42001-022-00191-7
Sandra Wankmüller
{"title":"A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis.","authors":"Sandra Wankmüller","doi":"10.1007/s42001-022-00191-7","DOIUrl":null,"url":null,"abstract":"<p><p>One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477-5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.</p>","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"6 1","pages":"91-163"},"PeriodicalIF":2.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9762672/pdf/","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Social Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s42001-022-00191-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 1

Abstract

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477-5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在检索相关文档进行分析的背景下,不平衡分类问题的方法比较。
许多基于文本的社会科学研究的第一步是从大量无关文档的语料库中检索与分析相关的文档。在社会科学中,解决这一检索任务的传统方法是应用一组关键字,并认为那些包含至少一个关键字的文档是相关的。但应用不完整的关键字列表有很高的风险得出有偏见的推论。更复杂和昂贵的方法,如查询扩展技术、基于主题模型的分类规则、主动和被动监督学习,都有可能更准确地将相关文档与不相关文档分开,从而减少潜在的偏差大小。然而,与关键字列表相比,应用这些更昂贵的方法是否提高了检索性能,如果有的话,提高了多少,由于缺乏对这些方法的比较,目前还不清楚。本研究通过将这些方法与一组德语推文数据集相关的三个检索任务进行比较,缩小了这一差距(Linder in SSRN, 2017)。10.2139/ssrn.3026393),社会偏见推理语料库(SBIC) (Sap et al. Social Bias frames: reasoning about Social and power implications of language)。见:Jurafsky et al.(编)计算语言学协会第58届年会论文集。计算语言学,p 5477-5490, 2020。10.18653/v1/2020.aclmain.486)和Reuters-21578语料库(Lewis in Reuters-21578 (Distribution 1.0))。[数据集],1997。http://www.daviddlewis.com/resources/testcollections/reuters21578/)。结果表明,在大多数研究环境下,查询扩展技术和基于主题模型的分类规则倾向于降低而不是提高检索性能。然而,如果将主动监督学习应用于不太小的标记训练实例集(例如1000个文档),则可以达到比关键字列表高得多的检索性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Computational Social Science
Journal of Computational Social Science SOCIAL SCIENCES, MATHEMATICAL METHODS-
CiteScore
6.20
自引率
6.20%
发文量
30
期刊最新文献
Identifying the factors influencing the development of bilateral investment treaties with health safeguards: a Machine Learning-based link prediction approach. Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning. Telegram channels covering Russia’s invasion of Ukraine: a comparative analysis of large multilingual corpora Fast meta-analytic approximations for relational event models: applications to data streams and multilevel data. A modelling study to explore the effects of regional socio-economics on the spreading of epidemics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1