Latent Code Identification (LACOID): A Machine Learning-Based Integrative Framework [and Open-Source Software] to Classify Big Textual Data, Rebuild Contextualized/Unaltered Meanings, and Avoid Aggregation Bias

IF 3.9 2区社会学 Q1 SOCIAL SCIENCES, INTERDISCIPLINARY International Journal of Qualitative Methods Pub Date : 2023-01-05 DOI:10.1177/16094069221144940

Manuel S. González Canché

{"title":"Latent Code Identification (LACOID): A Machine Learning-Based Integrative Framework [and Open-Source Software] to Classify Big Textual Data, Rebuild Contextualized/Unaltered Meanings, and Avoid Aggregation Bias","authors":"Manuel S. González Canché","doi":"10.1177/16094069221144940","DOIUrl":null,"url":null,"abstract":"Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).","PeriodicalId":48220,"journal":{"name":"International Journal of Qualitative Methods","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Qualitative Methods","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/16094069221144940","RegionNum":2,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL SCIENCES, INTERDISCIPLINARY","Score":null,"Total":0}

引用次数: 5

Abstract

Labeling or classifying textual data and qualitative evidence is an expensive and consequential challenge. The rigor and consistency behind the construction of these labels ultimately shape research findings and conclusions. A multifaceted methodological conundrum to address this challenge is the need for human reasoning for classification that leads to deeper and more nuanced understandings; however, this same manual human classification comes with the well-documented increase in classification inconsistencies and errors, particularly when dealing with vast amounts of documents and teams of coders. An alternative to human coding consists of machine learning-assisted techniques. These data science and visualization techniques offer tools for data classification that are cost-effective and consistent but are prone to losing participants’ meanings or voices for two main reasons: (a) these classifications typically aggregate all texts configuring each input file (i.e., each interview transcript) into a single topic or code and (b) these words configuring texts are analyzed outside of their original contexts. To address this challenge and analytic conundrum, we present an analytic framework and software tool, that addresses the following question: How to classify vast amounts of qualitative evidence effectively and efficiently without losing context or the original voices of our research participants and while leveraging the nuances that human reasoning bring to the qualitative and mixed methods analytic tables? This framework mirrors the line-by-line coding employed in human/manual code identification but relying on machine learning to classify texts in minutes rather than months. The resulting outputs provide complete transparency of the classification process and aid to recreate the contextualized, original, and unaltered meanings embedded in the input documents, as provided by our participants. We offer access to the database ( González Canché, 2022e ) and software required ( González Canché, 2022a , Mac https://cutt.ly/jc7n3OT , and Windows https://cutt.ly/wc7nNKF ) to replicate the analyses. We hope this opportunity to become familiar with the analytic framework and software, may result in expanded access of data science tools to analyze qualitative evidence (see also González Canché 2022b , 2022c , 2022d , for related no-code data science applications to classify and analyze qualitative and textual data dynamically).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

潜在代码识别(LACOID):一个基于机器学习的集成框架[和开源软件]，用于对大文本数据进行分类，重建上下文化/未改变的含义，并避免聚合偏差

对文本数据和定性证据进行标记或分类是一项代价高昂且后果严重的挑战。这些标签构建背后的严谨性和一致性最终形成了研究结果和结论。解决这一挑战的一个多方面方法难题是，需要人类对分类进行推理，从而产生更深入、更细致的理解；然而，同样的人工分类也伴随着分类不一致和错误的增加，尤其是在处理大量文档和编码团队时。人工编码的替代方案包括机器学习辅助技术。这些数据科学和可视化技术提供了具有成本效益和一致性的数据分类工具，但由于两个主要原因，这些工具容易丢失参与者的含义或声音：（a）这些分类通常将配置每个输入文件（即每个访谈记录）的所有文本聚合为一个主题或代码；（b）这些配置文本的单词在其原始上下文之外进行分析。为了解决这一挑战和分析难题，我们提出了一个分析框架和软件工具，它解决了以下问题：如何在不丢失上下文或研究参与者的原始声音的情况下，有效、高效地对大量定性证据进行分类，同时利用人类推理给定性和混合方法分析表带来的细微差别？该框架反映了人工/手动代码识别中使用的逐行编码，但依赖机器学习在几分钟内而不是几个月内对文本进行分类。由此产生的输出提供了分类过程的完全透明性，并有助于重新创建嵌入输入文档中的上下文化、原始和未更改的含义，正如我们的参与者所提供的那样。我们提供访问数据库（González Canché，2022e）和所需软件（Gonzélez Canchhé，2022a，Machttps://cutt.ly/jc7n3OT、和Windowshttps://cutt.ly/wc7nNKF）以复制分析。我们希望有机会熟悉分析框架和软件，这可能会扩大数据科学工具的使用范围，以分析定性证据（另请参阅González Canché2022b、2022c、2022d，了解相关的无代码数据科学应用程序，以动态分类和分析定性和文本数据）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Qualitative Methods SOCIAL SCIENCES, INTERDISCIPLINARY-

CiteScore

6.90

自引率

11.10%

发文量

139

审稿时长

12 weeks

期刊介绍： Journal Highlights Impact Factor: 5.4 Ranked 5/110 in Social Sciences, Interdisciplinary – SSCI Indexed In: Clarivate Analytics: Social Science Citation Index, the Directory of Open Access Journals (DOAJ), and Scopus Launched In: 2002 Publication is subject to payment of an article processing charge (APC) Submit here International Journal of Qualitative Methods (IJQM) is a peer-reviewed open access journal which focuses on methodological advances, innovations, and insights in qualitative or mixed methods studies. Please see the Aims and Scope tab for further information.