Louis Hickman, Stuti Thapa, L. Tay, Mengyang Cao, P. Srinivasan
{"title":"Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations","authors":"Louis Hickman, Stuti Thapa, L. Tay, Mengyang Cao, P. Srinivasan","doi":"10.1177/1094428120971683","DOIUrl":null,"url":null,"abstract":"Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.","PeriodicalId":19689,"journal":{"name":"Organizational Research Methods","volume":"25 1","pages":"114 - 146"},"PeriodicalIF":8.9000,"publicationDate":"2020-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094428120971683","citationCount":"76","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Organizational Research Methods","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1177/1094428120971683","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 76
Abstract
Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.
期刊介绍:
Organizational Research Methods (ORM) was founded with the aim of introducing pertinent methodological advancements to researchers in organizational sciences. The objective of ORM is to promote the application of current and emerging methodologies to advance both theory and research practices. Articles are expected to be comprehensible to readers with a background consistent with the methodological and statistical training provided in contemporary organizational sciences doctoral programs. The text should be presented in a manner that facilitates accessibility. For instance, highly technical content should be placed in appendices, and authors are encouraged to include example data and computer code when relevant. Additionally, authors should explicitly outline how their contribution has the potential to advance organizational theory and research practice.