组织研究中用于文本挖掘的文本预处理：综述和建议

IF 8.9 2区管理学 Q1 MANAGEMENT Organizational Research Methods Pub Date : 2020-11-23 DOI:10.1177/1094428120971683

Louis Hickman, Stuti Thapa, L. Tay, Mengyang Cao, P. Srinivasan

{"title":"组织研究中用于文本挖掘的文本预处理：综述和建议","authors":"Louis Hickman, Stuti Thapa, L. Tay, Mengyang Cao, P. Srinivasan","doi":"10.1177/1094428120971683","DOIUrl":null,"url":null,"abstract":"Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.","PeriodicalId":19689,"journal":{"name":"Organizational Research Methods","volume":"25 1","pages":"114 - 146"},"PeriodicalIF":8.9000,"publicationDate":"2020-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094428120971683","citationCount":"76","resultStr":"{\"title\":\"Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations\",\"authors\":\"Louis Hickman, Stuti Thapa, L. Tay, Mengyang Cao, P. Srinivasan\",\"doi\":\"10.1177/1094428120971683\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.\",\"PeriodicalId\":19689,\"journal\":{\"name\":\"Organizational Research Methods\",\"volume\":\"25 1\",\"pages\":\"114 - 146\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2020-11-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1177/1094428120971683\",\"citationCount\":\"76\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Organizational Research Methods\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1177/1094428120971683\",\"RegionNum\":2,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MANAGEMENT\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Organizational Research Methods","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1177/1094428120971683","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MANAGEMENT","Score":null,"Total":0}

引用次数: 76

摘要

文本挖掘的最新进展为利用由组织、其员工和其客户创建的大量自然语言文本数据提供了新的方法。虽然经常被忽视，但在文本预处理期间做出的决定会影响是否捕获语言的内容和/或风格、后续分析的统计能力以及从文本挖掘中获得的见解的有效性。过去的方法学文章描述了获取和分析文本数据的一般过程，但是关于预处理文本数据的建议并不一致。此外，初步研究使用并报告了不同的预处理技术。为了解决这个问题，我们对计算语言学和组织文本挖掘研究进行了两个互补的回顾，以提供基于经验的文本预处理决策建议，这些建议考虑了所进行的文本挖掘的类型(即开放或封闭词汇)、正在调查的研究问题和数据集的特征(即语料库大小和平均文档长度)。值得注意的是，由于文本数据的独特特性，偏离这些建议是适当的，有时也是必要的。我们还为报告文本挖掘提供了建议，以提高透明度和可重复性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations

Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Organizational Research Methods Multiple-

CiteScore

23.20

自引率

3.20%

发文量

期刊介绍： Organizational Research Methods (ORM) was founded with the aim of introducing pertinent methodological advancements to researchers in organizational sciences. The objective of ORM is to promote the application of current and emerging methodologies to advance both theory and research practices. Articles are expected to be comprehensible to readers with a background consistent with the methodological and statistical training provided in contemporary organizational sciences doctoral programs. The text should be presented in a manner that facilitates accessibility. For instance, highly technical content should be placed in appendices, and authors are encouraged to include example data and computer code when relevant. Additionally, authors should explicitly outline how their contribution has the potential to advance organizational theory and research practice.